14
LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Embed Size (px)

Citation preview

Page 1: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

LDMT MURI

Data Collection and Linguistic Annotation

November 2, 2012Jason Baldridge, UT Austin

Lori Levin, CMU

Page 2: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Purpose

Collect and build data• Monolingual text• Bilingual text• Linguistic annotations

to support work on machine translation for • Kinyarwanda-English• Malagasy-English

Page 3: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

KGMC (270k) KGMC (225k)

Pbook (0.9k) Pbook (0.7k)

GWord (8b)

BILINGUAL(285k)

ENGLISHmonolingual

(huge)

KINYARWANDAmonolingual

(7m)

ENGtreebank

ENGtext

KINtext

KINtreebank

PTB (1m)

Kinyarwanda Data Resources

News (7m)

KGMC (5.8k) KGMC (4.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.1k) IGT (0.06k)

Dict (9k) Dict (8k)

KGMC (2.9k)

KGMC (3.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.06k)IGT (0.1k)

wordcounts

1.0 Release 02/112.0 Release 10/11

Page 4: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

KGMC (270k) KGMC (225k)

Pbook (0.9k) Pbook (0.7k)

GWord (8b)

BILINGUAL(285k)

ENGLISHmonolingual

(huge)

KINYARWANDAmonolingual

(7m)

ENGtreebank

ENGtext

KINtext

KINtreebank

PTB (1m)

Kinyarwanda Data Resources

News (7m)

KGMC (5.8k) KGMC (4.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.1k) IGT (0.06k)

Dict (9k) Dict (8k)

KGMC (2.9k)Part-of-speech (2k)GFL (4.7k)

KGMC (3.8k)

BBC (0.3k) BBC (0.3k)

IGT (0.06k)IGT (0.1k)

wordcounts

Revi

ewed

& im

prov

ed

Revi

ewed

& im

prov

ed

1.0 Release 02/112.0 Release 10/113.0 Release 11/12

Page 5: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Bible (730k) Bible (725k)

News (2.1k) News (2.3k)

Gword (8b)

BILINGUAL(732k)

ENGLISHmonolingual

(huge)

MALAGASYMonolingual

ENGtreebank

ENGtext

MLGtext

MLGtreebank

PTB (1m)

Malagasy Data Resources

News (2.1k)

News (2.3k)

1.0 Release 02/112.0 Release 10/11

Page 6: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Bible (730k) Bible (725k)

News (2.1k) News (2.3k)

Gword (8b)

BILINGUAL(732k)

ENGLISHmonolingual

(huge)

MALAGASYMonolingual

ENGtreebank

ENGtext

MLGtext

MLGtreebank

PTB (1m)

Malagasy Data Resources

News (2.1k)Reviewed &improved.

News (2.3k)Reviewed &improved.Part-of-speech (2k)

Global voices (1.8m)

Global voices(1.9m)

Leipzig(600k)

Global voicesGFL (3.7k)

1.0 Release 02/112.0 Release 10/113.0 Release 11/12

Dic

tiona

ry (7

7.5k

)

Page 7: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Malagasy Data Resources

• Year 1: 19th century Malagasy bible• Year 2:– Univ. of Leipzig Web Corpus• Monolingual Malagasy, very clean

– CMU Global Voices Archive

Page 8: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Malagasy ResourcesTokens Types Hapax

Bible (Year 1) 579,578 19,460 8,401

Leipzig corpus (Year 2) 618,282 41,462 23,659

CMU Global Voices (Year 2) 2,148,976 84,744 46,627

Total 3,346,836 115,172 62,517

Malagasy - English Resourceseng-Tokens eng-Types mlg-Tokens mlg-Types

Bible (Year 1) 584,872 13,084 579,578 19,460

CMU Global Voices (Year 2) 1,785,472 63,357 2,148,976 84,744

Total 2,370,344 67,790 3,346,836 115,172

Page 9: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

CMU Global Voices Corpus

•Domains include Twitter, blogs, news about popular democracy movements• Actively published by volunteer translators– We are gathering ~ 500k words / language / year

of high quality parallel data

eng-Tokens eng-Types mlg-Tokens mlg-Types

Global Voices <Jun 2011 1,318,780 56,414 1,569,343 72,906

Global Voices <Jun 2012 1,732,674 59,750 2,066,419 79,269

http://www.ark.cs.cmu.edu/global-voices/

Page 10: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Morphological analysis

• We decided against creating morphological gold-standard annotations from the output of finite state transducers.

• Initially tried to use XFST analyzer created by Dalrymple, Liakata and Mackie 2006.– Quality of the output of Dalrymple transducer was

poor (ambiguous, many incorrect).• No existing Kinyarwanda transducer– Any annotations would be subject to changing

analyses during transducer development.

Page 11: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Morphological analysis

• Developed new transducers for both Kinyarwanda and Malagasy.– Less ambiguity– Cautious guessing for unknown stems => better

precision• Improvements driven by measuring

ambiguity/coverage on data and effect on performance in other tasks.

• We may produce annotations after transducer development deemed sufficient.

Page 12: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Syntactic annotations

• During past year, we reviewed and revised phrase structures annotated for kin and mlg texts.– Analyses and labels made more consistent across

languages– Head annotations added to enable dependency

parsing training/evaluation.– All tokenization standardized.

• GFL annotations: 4k each tokens, kin and mlg

Page 13: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Data accomplishments

• Fieldwork on Kinyarwanda that informs theoretical linguistic work and transducers.

• New morphological transducers for kin and mlg.• V 3.0 of monolingual, bilingual, and tree-banked

data for Kinyarwanda and Malagasy to be released this coming week.– Order of magnitude parallel data (mlg)– Better & more syntactic data (kin/mlg)

Page 14: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Data accomplishments

• Evaluation– Pilot annotations for linguistically target test suites

• Formal linguistic advances– GFL specification and tools for annotation and

visualization– Abstract Meaning Representation (AMR): leverage

ideas, data and tools from ISI as part of other synergistic projects.