LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

LDMT MURI

Data Collection and Linguistic Annotation

November 2, 2012Jason Baldridge, UT Austin

Lori Levin, CMU

Purpose

Collect and build data• Monolingual text• Bilingual text• Linguistic annotations

to support work on machine translation for • Kinyarwanda-English• Malagasy-English

KGMC (270k) KGMC (225k)

Pbook (0.9k) Pbook (0.7k)

GWord (8b)

BILINGUAL(285k)

ENGLISHmonolingual

(huge)

KINYARWANDAmonolingual

(7m)

ENGtreebank

ENGtext

KINtext

KINtreebank

PTB (1m)

Kinyarwanda Data Resources

News (7m)

KGMC (5.8k) KGMC (4.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.1k) IGT (0.06k)

Dict (9k) Dict (8k)

KGMC (2.9k)

KGMC (3.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.06k)IGT (0.1k)

wordcounts

1.0 Release 02/112.0 Release 10/11

KGMC (270k) KGMC (225k)

Pbook (0.9k) Pbook (0.7k)

GWord (8b)

BILINGUAL(285k)

ENGLISHmonolingual

(huge)

KINYARWANDAmonolingual

(7m)

ENGtreebank

ENGtext

KINtext

KINtreebank

PTB (1m)

Kinyarwanda Data Resources

News (7m)

KGMC (5.8k) KGMC (4.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.1k) IGT (0.06k)

Dict (9k) Dict (8k)

KGMC (2.9k)Part-of-speech (2k)GFL (4.7k)

KGMC (3.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.06k)IGT (0.1k)

wordcounts

Revi

ewed

& im

prov

ed

Revi

ewed

& im

prov

ed

1.0 Release 02/112.0 Release 10/113.0 Release 11/12

Bible (730k) Bible (725k)

News (2.1k) News (2.3k)

Gword (8b)

BILINGUAL(732k)

ENGLISHmonolingual

(huge)

MALAGASYMonolingual

ENGtreebank

ENGtext

MLGtext

MLGtreebank

PTB (1m)

Malagasy Data Resources

News (2.1k)

News (2.3k)

1.0 Release 02/112.0 Release 10/11

Bible (730k) Bible (725k)

News (2.1k) News (2.3k)

Gword (8b)

BILINGUAL(732k)

ENGLISHmonolingual

(huge)

MALAGASYMonolingual

ENGtreebank

ENGtext

MLGtext

MLGtreebank

PTB (1m)


News (2.1k)Reviewed &improved.

News (2.3k)Reviewed &improved.Part-of-speech (2k)

Global voices (1.8m)

Global voices(1.9m)

Leipzig(600k)

Global voicesGFL (3.7k)

1.0 Release 02/112.0 Release 10/113.0 Release 11/12

Dic

tiona

ry (7

7.5k

)


• Year 1: 19th century Malagasy bible• Year 2:– Univ. of Leipzig Web Corpus• Monolingual Malagasy, very clean

– CMU Global Voices Archive

Malagasy ResourcesTokens Types Hapax

Bible (Year 1) 579,578 19,460 8,401

Leipzig corpus (Year 2) 618,282 41,462 23,659

CMU Global Voices (Year 2) 2,148,976 84,744 46,627

Total 3,346,836 115,172 62,517

Malagasy - English Resourceseng-Tokens eng-Types mlg-Tokens mlg-Types

Bible (Year 1) 584,872 13,084 579,578 19,460

CMU Global Voices (Year 2) 1,785,472 63,357 2,148,976 84,744

Total 2,370,344 67,790 3,346,836 115,172

CMU Global Voices Corpus

•Domains include Twitter, blogs, news about popular democracy movements• Actively published by volunteer translators– We are gathering ~ 500k words / language / year

of high quality parallel data

eng-Tokens eng-Types mlg-Tokens mlg-Types

Global Voices <Jun 2011 1,318,780 56,414 1,569,343 72,906

Global Voices <Jun 2012 1,732,674 59,750 2,066,419 79,269

http://www.ark.cs.cmu.edu/global-voices/

Morphological analysis

• We decided against creating morphological gold-standard annotations from the output of finite state transducers.

• Initially tried to use XFST analyzer created by Dalrymple, Liakata and Mackie 2006.– Quality of the output of Dalrymple transducer was

poor (ambiguous, many incorrect).• No existing Kinyarwanda transducer– Any annotations would be subject to changing

analyses during transducer development.

Morphological analysis

• Developed new transducers for both Kinyarwanda and Malagasy.– Less ambiguity– Cautious guessing for unknown stems => better

precision• Improvements driven by measuring

ambiguity/coverage on data and effect on performance in other tasks.

• We may produce annotations after transducer development deemed sufficient.

Syntactic annotations

• During past year, we reviewed and revised phrase structures annotated for kin and mlg texts.– Analyses and labels made more consistent across

languages– Head annotations added to enable dependency

parsing training/evaluation.– All tokenization standardized.

• GFL annotations: 4k each tokens, kin and mlg

Data accomplishments

• Fieldwork on Kinyarwanda that informs theoretical linguistic work and transducers.

• New morphological transducers for kin and mlg.• V 3.0 of monolingual, bilingual, and tree-banked

data for Kinyarwanda and Malagasy to be released this coming week.– Order of magnitude parallel data (mlg)– Better & more syntactic data (kin/mlg)

Data accomplishments

• Evaluation– Pilot annotations for linguistically target test suites

• Formal linguistic advances– GFL specification and tools for annotation and

visualization– Abstract Meaning Representation (AMR): leverage

ideas, data and tools from ISI as part of other synergistic projects.

Documents

LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU