Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

1

Creating bilingual dictionaries for under-resourced languages

Marzanne Janse van Rensburg and Vivian Marr Oxford University Press

Afrilex, Windhoek, June 2019

Creating dictionaries for under-resourced languages

Timing: 8.30-9.45: 1. The Oxford Global Languages

programme

2. Content creation methodology • Principles • Single lexicographical framework • Hands-on translation and

discussion (50 minutes)

9.45-10.15: Tea break!

Afrilex workshop, June 2019 (C) Oxford University Press


Timing: 10.15-12.00 1. Corpora

• Gap-filling • Building your own corpus • Using corpus to build a

framework entry 2. Ask questions at any time!



Oxford Global Languages The vision:

To bring lexical content online for 100 of the world’s languages and make it available to developers, consumers, licensees, and researchers for a wide variety of uses

The mission:

To improve the quality and breadth of global linguistic knowledge and communication, giving voice to all people in a rapidly changing world


Oxford Global Languages

20 languages launched to date – 6 more almost ready to go

https://developer.oxforddictionaries.com https://www.oxforddictionaries.com

Georgian

Greek

Gujarati

Hausa

Hindi

Igbo

Indonesian

isiXhosa

isiZulu

Kiswahili

Latvian

Malay

Marathi

Northern Sotho

Persian

Romanian

Setswana

Southern Quechua

Tajik

Tamil

Tatar

Telugu

Tok Pisin

Turkmen

Urdu

Yoruba


https://developer.oxforddictionaries.com/

https://developer.oxforddictionaries.com/

https://www.oxforddictionaries.com/

https://www.oxforddictionaries.com/


• The principles – applicable to all languages: – Single neutral, common framework

– Arranged by frequency

– Translate

– Peer review

– Discuss

– Finalize

– Reverse

• Working iteratively, based locally



• Why a single common, neutral framework?

– Reusable as the core for content creation

– Scalable – can be iteratively expanded and enhanced

– Common source for multiple languages allows automatic generation of language pairs



• WordReference automatically generated “virtual dictionaries”

• “Far from perfect” – but useful



• The results:


ELF: English Language Framework

• Frequency order • Prioritized senses • Translation help to aid

editors • Available to license in

exchange for digital rights



• What’s the minimum we want to offer users? – Headword

– Part of speech

– Sense division

– Example sentences showing language in use

– Language register shown



• Translations, of course! Not definitions.

• Translations must be:

– Up-to-date

– Accurate

– Matching in register



• Additional help for licensees

– Translation help

– Training materials



• Arranged by frequency, derived from corpus

– Covers core language first



• Based locally

– Recruitment where the language is used

– Opportunities for local skills development and expansion

• Working iteratively

– Build and release, build and release

– Quicker to market and to generate revenue


Content reversal



• Content reversal

– Translations become headword candidates

– Still experimenting

– Success with English-Igbo, English-Yoruba, English Marathi – but still small

– Other sources needed to ensure appropriate and sufficient coverage

– Creating framework on the fly

– Challenge to build iteratively



• Editing tools

– Spreadsheet approach

– Also possible to work in XML, with tools like TshwaneLex

– Moving to JSON format for purely digital product

– Building our own editing tool - DELTA

– Dictionary Creation Package can be supplied as spreadsheet or along with DELTA

Afrilex workshop, June 2019 (C) Oxford

University Press


• Translation task: 20 minutes in pairs or groups of three

– Choose an entry or entries from the English Language Framework (ELF) and work through translating into your language

– Followed by 20 minutes of group discussion:

• what went well/could have gone better

• where the help provided was sufficient/could have been improved


Creating dictionaries for under-resourced languages Sample ELF entries to choose from:

bear

centre

dark

exercise

happy

measure

oil

paper

quick

sleep


Tea break!


Gap-filling in an Oxford Kiswahili dictionary

Approach

Used all the OUP-published Kiswahili textbooks

and literature for Kenya and Tanzania to build a

corpus

Extracted a wordlist from the textbook corpus

sorted by frequency

Next step: to clean up these lists.

This entails removing any words that shouldn’t be

on the list, e.g. names of places and people, file

extensions like .indd, numerals, etc.

In order to see which words had to be

considered for inclusion in the new edition of

the dictionary, a comparison was necessary:

Comparing the existing wordlist (of the

current dictionary) to the wordlist extracted

from the corpus.

This is easy to do in Excel:

Paste the two lists next to each other and run a

formula that shows which words appear in the one

list but not in the other.

Formula:

=IF(ISERROR(VLOOKUP(A:A,B:B, 1, FALSE)),FALSE,TRUE)

Sketch Engine and WebBootCaT technology

Some background and a hands-on exercise

How it works

Sketch Engine has a built-in corpus tool that

enables users to extract information from the

internet to build a corpus.

Sketch Engine’s corpus-building tool, which uses

the WebBootCaT technology, automatically

creates a text corpus from relevant web pages.

Data downloaded from the internet is cleaned,

optionally de-duplicated and non-text is eliminated

to obtain linguistically valuable text material.

Let’s try this technology in Sketch Engine.

Step 1 – Click on “New corpus”

Step 2 – Give your corpus a name and specify the

language

Step 3 – Click on “Find texts on the web”

Step 4 – Click on “Web search” and specify the

maximum URLs and then click “Go”

Once the corpus is compiled, a pop-up will appear

to show you it is done

Now you can start having fun!

Search for concordances, compile wordlists, etc.


• Using SketchEngine to identify:

– Common collocates of different types



• Using SketchEngine to find:

– Good (bland) example phrases, which illustrate how the headword is used in context


Gap filling

Word lists, reading programme, corpus research, …


Types of corpora synchronic corpus: texts from a single period of time

diachronic corpus: texts collected from over a long period time, to show chronological change

– monitor corpus: frequently updated corpus of contemporary language, for identifying emerging language trends

– historical corpus: uses historical texts to show language change over a long period of time

learner corpus: content produced by learners of a language, used to study language learning

multilingual corpus: encompasses text in two or more languages

– parallel corpus: versions of the same texts in two (or more) different languages, aligned with each other to aid in identifying translations

– comparable corpus: a set of corpora in multiple languages on the same topic but not translating the same text


University Press


• Corpus task: 15 minutes in pairs or groups of three

– Choose an entry or entries from a list of suggestions from the Oxford English Corpus and use SketchEngine to create a framework entry for a basic dictionary

– It should include: part(s) of speech, sense division, illustrative example(s)

– Followed by 15 minutes of group discussion as to

• what you found interesting about using corpus

• how you might use it in the future


University Press

Creating dictionaries for under-resourced languages suggested corpus look-up entries:

aerial

alcoholic

Austrian

curve

insult

manual

queue

riot

remedy

throttle


Tools that may be of interest

• Sketch Engine (https://app.sketchengine.eu/#open): corpus software and corpora

in many languages, including WebBootCAT (https://www.sketchengine.eu/guide/create-a-corpus-from-the-web/): corpus builder

• NoSketch Engine (https://www.sketchengine.eu/nosketch-engine/): free version with limited functionality)

• TextSTAT (http://neon.niederlandistik.fu-berlin.de/en/textstat/): corpus software • AntConc (http://www.laurenceanthony.net/software/antconc/): corpus software • Wordsmith (https://www.lexically.net/wordsmith/): corpus software • BYU corpora (https://www.english-corpora.org/): corpora in English and Spanish

(https://www.corpusdelespanol.org/) • CQPWeb (https://cqpweb.lancs.ac.uk/): corpora in many languages • Lexonomy.eu (https://www.lexonomy.eu/): dictionary editing and publishing tool


https://app.sketchengine.eu/#open






https://www.sketchengine.eu/guide/create-a-corpus-from-the-web/














https://www.sketchengine.eu/nosketch-engine/









http://neon.niederlandistik.fu-berlin.de/en/textstat/




http://www.laurenceanthony.net/software/antconc/



https://www.lexically.net/wordsmith/



https://www.english-corpora.org/





https://www.corpusdelespanol.org/



https://cqpweb.lancs.ac.uk/



https://www.lexonomy.eu/



45

Thank you

With special thanks to: everyone at Afrilex who made this workshop possible, the team at Lexical Computing Ltd for access to SketchEngine and the Oxford English Corpus, and the team at OUP, especially Andy Allen, Tressy Arts, Emma Davies, Phillip Louw, Katherine Martin, Judy Pearsall, and Angus Stevenson

Marzanne Janse van Rensburg and Vivian Marr Oxford University Press

Documents

Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)