Upload
melvin-sharp
View
225
Download
3
Tags:
Embed Size (px)
Citation preview
The Crúbadán Project: Corpus building for under-resourced languages
Kevin P. Scannell
Presented by:Ben King
Problems with Minority Languages
• Often no public funding available and no commercial interest in this work
• Few native speakers with NLP training• Data are too scarce for “representativeness”
Overcoming these Problems
• An open source development model• Volunteer labor by language enthusiasts• Free, web-crawled corpora• Language-independent tools (like the crawler)
deployed for a large number of languages• Unsupervised machine learning algorithms
An Crúbadán
• A web-crawler that builds corpora for minority languages
• Phase 1 aims to collect text in all languages on the web
• Phase 2 aims to produce NLP resources in these languages
Statistics
• 1872 languages• 836,000 documents• 1,900,000,000 words
Design Principles
• Orthographies instead of languages• A corpus should consist of real running text– Not just word lists
• It should be possible to collect all the Web text for small languages
The pipeline – step 1: finding languages
• Lots of manual work– Web searching– Scanning and OCRing books– Monitor for new languages at
• Wikipedia• UDHR• Watchtower• bible.is
– Add language metadata• Expected polluting languages• Encodings
The pipeline – step 1: finding languages
The pipeline – step 1: finding languages
The pipeline – step 2: building a query
• Separate the known words into stopwords and other words
• OR together content words• AND with a stopword
The pipeline – step 3: processing documents
• Google returns many types of documents• Convert to plain text– pdf2text– Remove markup– Remove boilerplate– Convert to Unicode
• Identify the document language• Update lexicon, lang-id trigrams
The pipeline – step 4: repeat
• Repeat
Applications
Spell checkers
• Most languages do not have a spell checker• Building a spell checker requires– List of roots– Basic morphological rules
• A native speaker must identify roots and affixes
• 28 new spellcheckers created
Indigenous Tweets and Language Revitalization
Indigenous Tweets and Language Revitalization
• Indigenous Tweets project identifies and tracks Twitter accounts that tweet in minority languages
• New usage of endangered languages online:– Gamilaraay (Australia, ~3 speakers), – Nawat (El Salvador, ~20 speakers), – Delaware (USA, ~78 speakers)
• Encourages a younger generation to use endangered languages
Diacritic Restoration
• Many languages with diacritics have developed an ASCII orthography on the Web
• Examples– Oll skulu vera fraels at hava sinar askodanir og bera taer fram uttan fordan →– Øll skulu vera fræls at hava sínar áskoðanir og bera tær fram uttan forðan– Ua noa i na kanaka apau ke kuokoa o ka manao a me ka hoike ana i ka manao →– Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo a me ka hōʻike ʻana i ka manaʻo– Moi nguoi deu co quyen tu do ngon luan va bay to quan diem →– Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm– Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u, ki o si so iru imoran bee jade→– Ẹnì k kan ló ní t sí òmì nira láti ní ìm ràn tí ó wù ú, kí ó sì sọ irú ìm ràn b jádeọ�ọ� è� ọ� ọ� ọ� è�è�
• Released as an open-source Firefox add-on, accentuate.us– Supports 84 languages
Diacritic Restoration
Continuing problems
• Variations in scripts– Example: Azerbaijani• ИНСАН ҺҮГУГЛАРЫ ҺАГГЫНДА ҮМУМИ БӘЈАННАМӘ• İNSAN HÜQUQLARI HAQQINDA ÜMUMİ BƏYANNAMƏ• INSAN HUQUQLARI HAQQINDA UMUMI BEYANNAME
• Mixed-language documents• Manual intervention