Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
1
Creating bilingual dictionaries for under-resourced languages
Marzanne Janse van Rensburg and Vivian Marr Oxford University Press
Afrilex, Windhoek, June 2019
Creating dictionaries for under-resourced languages
Timing: 8.30-9.45: 1. The Oxford Global Languages
programme
2. Content creation methodology • Principles • Single lexicographical framework • Hands-on translation and
discussion (50 minutes)
9.45-10.15: Tea break!
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
Timing: 10.15-12.00 1. Corpora
• Gap-filling • Building your own corpus • Using corpus to build a
framework entry 2. Ask questions at any time!
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
Oxford Global Languages The vision:
To bring lexical content online for 100 of the world’s languages and make it available to developers, consumers, licensees, and researchers for a wide variety of uses
The mission:
To improve the quality and breadth of global linguistic knowledge and communication, giving voice to all people in a rapidly changing world
Afrilex workshop, June 2019 (C) Oxford University Press
Oxford Global Languages
20 languages launched to date – 6 more almost ready to go
https://developer.oxforddictionaries.com https://www.oxforddictionaries.com
Georgian
Greek
Gujarati
Hausa
Hindi
Igbo
Indonesian
isiXhosa
isiZulu
Kiswahili
Latvian
Malay
Marathi
Northern Sotho
Persian
Romanian
Setswana
Southern Quechua
Tajik
Tamil
Tatar
Telugu
Tok Pisin
Turkmen
Urdu
Yoruba
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• The principles – applicable to all languages: – Single neutral, common framework
– Arranged by frequency
– Translate
– Peer review
– Discuss
– Finalize
– Reverse
• Working iteratively, based locally
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• Why a single common, neutral framework?
– Reusable as the core for content creation
– Scalable – can be iteratively expanded and enhanced
– Common source for multiple languages allows automatic generation of language pairs
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• WordReference automatically generated “virtual dictionaries”
• “Far from perfect” – but useful
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• The results:
Afrilex workshop, June 2019 (C) Oxford University Press
ELF: English Language Framework
• Frequency order • Prioritized senses • Translation help to aid
editors • Available to license in
exchange for digital rights
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• What’s the minimum we want to offer users? – Headword
– Part of speech
– Sense division
– Example sentences showing language in use
– Language register shown
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• Translations, of course! Not definitions.
• Translations must be:
– Up-to-date
– Accurate
– Matching in register
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• Additional help for licensees
– Translation help
– Training materials
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• Arranged by frequency, derived from corpus
– Covers core language first
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• Based locally
– Recruitment where the language is used
– Opportunities for local skills development and expansion
• Working iteratively
– Build and release, build and release
– Quicker to market and to generate revenue
Afrilex workshop, June 2019 (C) Oxford University Press
Content reversal
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• Content reversal
– Translations become headword candidates
– Still experimenting
– Success with English-Igbo, English-Yoruba, English Marathi – but still small
– Other sources needed to ensure appropriate and sufficient coverage
– Creating framework on the fly
– Challenge to build iteratively
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• Editing tools
– Spreadsheet approach
– Also possible to work in XML, with tools like TshwaneLex
– Moving to JSON format for purely digital product
– Building our own editing tool - DELTA
– Dictionary Creation Package can be supplied as spreadsheet or along with DELTA
Afrilex workshop, June 2019 (C) Oxford
University Press
Creating dictionaries for under-resourced languages
• Translation task: 20 minutes in pairs or groups of three
– Choose an entry or entries from the English Language Framework (ELF) and work through translating into your language
– Followed by 20 minutes of group discussion:
• what went well/could have gone better
• where the help provided was sufficient/could have been improved
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages Sample ELF entries to choose from:
bear
centre
dark
exercise
happy
measure
oil
paper
quick
sleep
Afrilex workshop, June 2019 (C) Oxford University Press
Tea break!
Afrilex workshop, June 2019 (C) Oxford University Press
Gap-filling in an Oxford Kiswahili dictionary
Approach
Used all the OUP-published Kiswahili textbooks
and literature for Kenya and Tanzania to build a
corpus
Extracted a wordlist from the textbook corpus
sorted by frequency
Next step: to clean up these lists.
This entails removing any words that shouldn’t be
on the list, e.g. names of places and people, file
extensions like .indd, numerals, etc.
In order to see which words had to be
considered for inclusion in the new edition of
the dictionary, a comparison was necessary:
Comparing the existing wordlist (of the
current dictionary) to the wordlist extracted
from the corpus.
This is easy to do in Excel:
Paste the two lists next to each other and run a
formula that shows which words appear in the one
list but not in the other.
Formula:
=IF(ISERROR(VLOOKUP(A:A,B:B, 1, FALSE)),FALSE,TRUE)
Sketch Engine and WebBootCaT technology
Some background and a hands-on exercise
How it works
Sketch Engine has a built-in corpus tool that
enables users to extract information from the
internet to build a corpus.
Sketch Engine’s corpus-building tool, which uses
the WebBootCaT technology, automatically
creates a text corpus from relevant web pages.
Data downloaded from the internet is cleaned,
optionally de-duplicated and non-text is eliminated
to obtain linguistically valuable text material.
Let’s try this technology in Sketch Engine.
Step 1 – Click on “New corpus”
Step 2 – Give your corpus a name and specify the
language
Step 3 – Click on “Find texts on the web”
Step 4 – Click on “Web search” and specify the
maximum URLs and then click “Go”
Once the corpus is compiled, a pop-up will appear
to show you it is done
Now you can start having fun!
Search for concordances, compile wordlists, etc.
Creating dictionaries for under-resourced languages
• Using SketchEngine to identify:
– Common collocates of different types
Afrilex workshop, June 2019 (C) Oxford University Press
Creating dictionaries for under-resourced languages
• Using SketchEngine to find:
– Good (bland) example phrases, which illustrate how the headword is used in context
Afrilex workshop, June 2019 (C) Oxford University Press
Gap filling
Word lists, reading programme, corpus research, …
Afrilex workshop, June 2019 (C) Oxford University Press
Types of corpora synchronic corpus: texts from a single period of time
diachronic corpus: texts collected from over a long period time, to show chronological change
– monitor corpus: frequently updated corpus of contemporary language, for identifying emerging language trends
– historical corpus: uses historical texts to show language change over a long period of time
learner corpus: content produced by learners of a language, used to study language learning
multilingual corpus: encompasses text in two or more languages
– parallel corpus: versions of the same texts in two (or more) different languages, aligned with each other to aid in identifying translations
– comparable corpus: a set of corpora in multiple languages on the same topic but not translating the same text
Afrilex workshop, June 2019 (C) Oxford
University Press
Creating dictionaries for under-resourced languages
• Corpus task: 15 minutes in pairs or groups of three
– Choose an entry or entries from a list of suggestions from the Oxford English Corpus and use SketchEngine to create a framework entry for a basic dictionary
– It should include: part(s) of speech, sense division, illustrative example(s)
– Followed by 15 minutes of group discussion as to
• what you found interesting about using corpus
• how you might use it in the future
Afrilex workshop, June 2019 (C) Oxford
University Press
Creating dictionaries for under-resourced languages suggested corpus look-up entries:
aerial
alcoholic
Austrian
curve
insult
manual
queue
riot
remedy
throttle
Afrilex workshop, June 2019 (C) Oxford University Press
Tools that may be of interest
• Sketch Engine (https://app.sketchengine.eu/#open): corpus software and corpora
in many languages, including WebBootCAT (https://www.sketchengine.eu/guide/create-a-corpus-from-the-web/): corpus builder
• NoSketch Engine (https://www.sketchengine.eu/nosketch-engine/): free version with limited functionality)
• TextSTAT (http://neon.niederlandistik.fu-berlin.de/en/textstat/): corpus software • AntConc (http://www.laurenceanthony.net/software/antconc/): corpus software • Wordsmith (https://www.lexically.net/wordsmith/): corpus software • BYU corpora (https://www.english-corpora.org/): corpora in English and Spanish
(https://www.corpusdelespanol.org/) • CQPWeb (https://cqpweb.lancs.ac.uk/): corpora in many languages • Lexonomy.eu (https://www.lexonomy.eu/): dictionary editing and publishing tool
Afrilex workshop, June 2019 (C) Oxford University Press
45
Thank you
With special thanks to: everyone at Afrilex who made this workshop possible, the team at Lexical Computing Ltd for access to SketchEngine and the Oxford English Corpus, and the team at OUP, especially Andy Allen, Tressy Arts, Emma Davies, Phillip Louw, Katherine Martin, Judy Pearsall, and Angus Stevenson
Marzanne Janse van Rensburg and Vivian Marr Oxford University Press