The Crúbadán Project: Corpus building for under-resourced languages Kevin P. Scannell Presented by: Ben King

The Crúbadán Project: Corpus building for under-resourced languages

Kevin P. Scannell

Presented by:Ben King

Problems with Minority Languages

• Often no public funding available and no commercial interest in this work

• Few native speakers with NLP training• Data are too scarce for “representativeness”

Overcoming these Problems

• An open source development model• Volunteer labor by language enthusiasts• Free, web-crawled corpora• Language-independent tools (like the crawler)

deployed for a large number of languages• Unsupervised machine learning algorithms

An Crúbadán

• A web-crawler that builds corpora for minority languages

• Phase 1 aims to collect text in all languages on the web

• Phase 2 aims to produce NLP resources in these languages

Statistics

• 1872 languages• 836,000 documents• 1,900,000,000 words

Design Principles

• Orthographies instead of languages• A corpus should consist of real running text– Not just word lists

• It should be possible to collect all the Web text for small languages

The pipeline – step 1: finding languages

• Lots of manual work– Web searching– Scanning and OCRing books– Monitor for new languages at

• Wikipedia• UDHR• Watchtower• bible.is

– Add language metadata• Expected polluting languages• Encodings



The pipeline – step 2: building a query

• Separate the known words into stopwords and other words

• OR together content words• AND with a stopword

The pipeline – step 3: processing documents

• Google returns many types of documents• Convert to plain text– pdf2text– Remove markup– Remove boilerplate– Convert to Unicode

• Identify the document language• Update lexicon, lang-id trigrams

The pipeline – step 4: repeat

• Repeat

Applications

Spell checkers

• Most languages do not have a spell checker• Building a spell checker requires– List of roots– Basic morphological rules

• A native speaker must identify roots and affixes

• 28 new spellcheckers created

Indigenous Tweets and Language Revitalization

Indigenous Tweets and Language Revitalization

• Indigenous Tweets project identifies and tracks Twitter accounts that tweet in minority languages

• New usage of endangered languages online:– Gamilaraay (Australia, ~3 speakers), – Nawat (El Salvador, ~20 speakers), – Delaware (USA, ~78 speakers)

• Encourages a younger generation to use endangered languages

Diacritic Restoration

• Many languages with diacritics have developed an ASCII orthography on the Web

• Examples– Oll skulu vera fraels at hava sinar askodanir og bera taer fram uttan fordan →– Øll skulu vera fræls at hava sínar áskoðanir og bera tær fram uttan forðan– Ua noa i na kanaka apau ke kuokoa o ka manao a me ka hoike ana i ka manao →– Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo a me ka hōʻike ʻana i ka manaʻo– Moi nguoi deu co quyen tu do ngon luan va bay to quan diem →– Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm– Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u, ki o si so iru imoran bee jade→– Ẹnì k kan ló ní t sí òmì nira láti ní ìm ràn tí ó wù ú, kí ó sì sọ irú ìm ràn b jádeọ�ọ� è� ọ� ọ� ọ� è�è�

• Released as an open-source Firefox add-on, accentuate.us– Supports 84 languages

Diacritic Restoration

Continuing problems

• Variations in scripts– Example: Azerbaijani• ИНСАН ҺҮГУГЛАРЫ ҺАГГЫНДА ҮМУМИ БӘЈАННАМӘ• İNSAN HÜQUQLARI HAQQINDA ÜMUMİ BƏYANNAMƏ• INSAN HUQUQLARI HAQQINDA UMUMI BEYANNAME

• Mixed-language documents• Manual intervention

Documents

The Crúbadán Project: Corpus building for under-resourced languages Kevin P. Scannell Presented by: Ben King