45
1 Creating bilingual dictionaries for under-resourced languages Marzanne Janse van Rensburg and Vivian Marr Oxford University Press Afrilex, Windhoek, June 2019

Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

1

Creating bilingual dictionaries for under-resourced languages

Marzanne Janse van Rensburg and Vivian Marr Oxford University Press

Afrilex, Windhoek, June 2019

Page 2: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

Timing: 8.30-9.45: 1. The Oxford Global Languages

programme

2. Content creation methodology • Principles • Single lexicographical framework • Hands-on translation and

discussion (50 minutes)

9.45-10.15: Tea break!

Afrilex workshop, June 2019 (C) Oxford University Press

Page 3: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

Timing: 10.15-12.00 1. Corpora

• Gap-filling • Building your own corpus • Using corpus to build a

framework entry 2. Ask questions at any time!

Afrilex workshop, June 2019 (C) Oxford University Press

Page 4: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

Oxford Global Languages The vision:

To bring lexical content online for 100 of the world’s languages and make it available to developers, consumers, licensees, and researchers for a wide variety of uses

The mission:

To improve the quality and breadth of global linguistic knowledge and communication, giving voice to all people in a rapidly changing world

Afrilex workshop, June 2019 (C) Oxford University Press

Page 5: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Oxford Global Languages

20 languages launched to date – 6 more almost ready to go

https://developer.oxforddictionaries.com https://www.oxforddictionaries.com

Georgian

Greek

Gujarati

Hausa

Hindi

Igbo

Indonesian

isiXhosa

isiZulu

Kiswahili

Latvian

Malay

Marathi

Northern Sotho

Persian

Romanian

Setswana

Southern Quechua

Tajik

Tamil

Tatar

Telugu

Tok Pisin

Turkmen

Urdu

Yoruba

Afrilex workshop, June 2019 (C) Oxford University Press

Page 6: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• The principles – applicable to all languages: – Single neutral, common framework

– Arranged by frequency

– Translate

– Peer review

– Discuss

– Finalize

– Reverse

• Working iteratively, based locally

Afrilex workshop, June 2019 (C) Oxford University Press

Page 7: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Why a single common, neutral framework?

– Reusable as the core for content creation

– Scalable – can be iteratively expanded and enhanced

– Common source for multiple languages allows automatic generation of language pairs

Afrilex workshop, June 2019 (C) Oxford University Press

Page 8: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• WordReference automatically generated “virtual dictionaries”

• “Far from perfect” – but useful

Afrilex workshop, June 2019 (C) Oxford University Press

Page 9: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• The results:

Afrilex workshop, June 2019 (C) Oxford University Press

Page 10: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

ELF: English Language Framework

• Frequency order • Prioritized senses • Translation help to aid

editors • Available to license in

exchange for digital rights

Afrilex workshop, June 2019 (C) Oxford University Press

Page 11: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• What’s the minimum we want to offer users? – Headword

– Part of speech

– Sense division

– Example sentences showing language in use

– Language register shown

Afrilex workshop, June 2019 (C) Oxford University Press

Page 12: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Translations, of course! Not definitions.

• Translations must be:

– Up-to-date

– Accurate

– Matching in register

Afrilex workshop, June 2019 (C) Oxford University Press

Page 13: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Additional help for licensees

– Translation help

– Training materials

Afrilex workshop, June 2019 (C) Oxford University Press

Page 14: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Arranged by frequency, derived from corpus

– Covers core language first

Afrilex workshop, June 2019 (C) Oxford University Press

Page 15: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Based locally

– Recruitment where the language is used

– Opportunities for local skills development and expansion

• Working iteratively

– Build and release, build and release

– Quicker to market and to generate revenue

Afrilex workshop, June 2019 (C) Oxford University Press

Page 16: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Content reversal

Afrilex workshop, June 2019 (C) Oxford University Press

Page 17: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Content reversal

– Translations become headword candidates

– Still experimenting

– Success with English-Igbo, English-Yoruba, English Marathi – but still small

– Other sources needed to ensure appropriate and sufficient coverage

– Creating framework on the fly

– Challenge to build iteratively

Afrilex workshop, June 2019 (C) Oxford University Press

Page 18: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Editing tools

– Spreadsheet approach

– Also possible to work in XML, with tools like TshwaneLex

– Moving to JSON format for purely digital product

– Building our own editing tool - DELTA

– Dictionary Creation Package can be supplied as spreadsheet or along with DELTA

Afrilex workshop, June 2019 (C) Oxford

University Press

Page 19: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Translation task: 20 minutes in pairs or groups of three

– Choose an entry or entries from the English Language Framework (ELF) and work through translating into your language

– Followed by 20 minutes of group discussion:

• what went well/could have gone better

• where the help provided was sufficient/could have been improved

Afrilex workshop, June 2019 (C) Oxford University Press

Page 20: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages Sample ELF entries to choose from:

bear

centre

dark

exercise

happy

measure

oil

paper

quick

sleep

Afrilex workshop, June 2019 (C) Oxford University Press

Page 21: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Tea break!

Afrilex workshop, June 2019 (C) Oxford University Press

Page 22: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Gap-filling in an Oxford Kiswahili dictionary

Page 23: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Approach

Used all the OUP-published Kiswahili textbooks

and literature for Kenya and Tanzania to build a

corpus

Extracted a wordlist from the textbook corpus

sorted by frequency

Page 24: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Next step: to clean up these lists.

This entails removing any words that shouldn’t be

on the list, e.g. names of places and people, file

extensions like .indd, numerals, etc.

Page 25: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

In order to see which words had to be

considered for inclusion in the new edition of

the dictionary, a comparison was necessary:

Comparing the existing wordlist (of the

current dictionary) to the wordlist extracted

from the corpus.

Page 26: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

This is easy to do in Excel:

Paste the two lists next to each other and run a

formula that shows which words appear in the one

list but not in the other.

Formula:

=IF(ISERROR(VLOOKUP(A:A,B:B, 1, FALSE)),FALSE,TRUE)

Page 27: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)
Page 28: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)
Page 29: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Sketch Engine and WebBootCaT technology

Some background and a hands-on exercise

Page 30: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

How it works

Sketch Engine has a built-in corpus tool that

enables users to extract information from the

internet to build a corpus.

Page 31: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Sketch Engine’s corpus-building tool, which uses

the WebBootCaT technology, automatically

creates a text corpus from relevant web pages.

Data downloaded from the internet is cleaned,

optionally de-duplicated and non-text is eliminated

to obtain linguistically valuable text material.

Page 32: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Let’s try this technology in Sketch Engine.

Step 1 – Click on “New corpus”

Page 33: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Step 2 – Give your corpus a name and specify the

language

Page 34: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Step 3 – Click on “Find texts on the web”

Page 35: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Step 4 – Click on “Web search” and specify the

maximum URLs and then click “Go”

Page 36: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Once the corpus is compiled, a pop-up will appear

to show you it is done

Page 37: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Now you can start having fun!

Search for concordances, compile wordlists, etc.

Page 38: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Using SketchEngine to identify:

– Common collocates of different types

Afrilex workshop, June 2019 (C) Oxford University Press

Page 39: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Using SketchEngine to find:

– Good (bland) example phrases, which illustrate how the headword is used in context

Afrilex workshop, June 2019 (C) Oxford University Press

Page 40: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Gap filling

Word lists, reading programme, corpus research, …

Afrilex workshop, June 2019 (C) Oxford University Press

Page 41: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Types of corpora synchronic corpus: texts from a single period of time

diachronic corpus: texts collected from over a long period time, to show chronological change

– monitor corpus: frequently updated corpus of contemporary language, for identifying emerging language trends

– historical corpus: uses historical texts to show language change over a long period of time

learner corpus: content produced by learners of a language, used to study language learning

multilingual corpus: encompasses text in two or more languages

– parallel corpus: versions of the same texts in two (or more) different languages, aligned with each other to aid in identifying translations

– comparable corpus: a set of corpora in multiple languages on the same topic but not translating the same text

Afrilex workshop, June 2019 (C) Oxford

University Press

Page 42: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages

• Corpus task: 15 minutes in pairs or groups of three

– Choose an entry or entries from a list of suggestions from the Oxford English Corpus and use SketchEngine to create a framework entry for a basic dictionary

– It should include: part(s) of speech, sense division, illustrative example(s)

– Followed by 15 minutes of group discussion as to

• what you found interesting about using corpus

• how you might use it in the future

Afrilex workshop, June 2019 (C) Oxford

University Press

Page 43: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Creating dictionaries for under-resourced languages suggested corpus look-up entries:

aerial

alcoholic

Austrian

curve

insult

manual

queue

riot

remedy

throttle

Afrilex workshop, June 2019 (C) Oxford University Press

Page 44: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

Tools that may be of interest

• Sketch Engine (https://app.sketchengine.eu/#open): corpus software and corpora

in many languages, including WebBootCAT (https://www.sketchengine.eu/guide/create-a-corpus-from-the-web/): corpus builder

• NoSketch Engine (https://www.sketchengine.eu/nosketch-engine/): free version with limited functionality)

• TextSTAT (http://neon.niederlandistik.fu-berlin.de/en/textstat/): corpus software • AntConc (http://www.laurenceanthony.net/software/antconc/): corpus software • Wordsmith (https://www.lexically.net/wordsmith/): corpus software • BYU corpora (https://www.english-corpora.org/): corpora in English and Spanish

(https://www.corpusdelespanol.org/) • CQPWeb (https://cqpweb.lancs.ac.uk/): corpora in many languages • Lexonomy.eu (https://www.lexonomy.eu/): dictionary editing and publishing tool

Afrilex workshop, June 2019 (C) Oxford University Press

Page 45: Creating bilingual dictionaries for under-resourced languagespvatn.net/wp/wp-content/uploads/2019/06/OUP-Afrilex-full...Tok Pisin Turkmen Urdu Yoruba Afrilex workshop, June 2019 (C)

45

Thank you

With special thanks to: everyone at Afrilex who made this workshop possible, the team at Lexical Computing Ltd for access to SketchEngine and the Oxford English Corpus, and the team at OUP, especially Andy Allen, Tressy Arts, Emma Davies, Phillip Louw, Katherine Martin, Judy Pearsall, and Angus Stevenson

Marzanne Janse van Rensburg and Vivian Marr Oxford University Press