Upload
everett-hutchinson
View
224
Download
0
Tags:
Embed Size (px)
Citation preview
Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project
Adam KilgarriffLexical Computing LtdLexicography MasterClassUniversities of Leeds and Sussex
Malta, May 2010 Kilgarriff: BUCC 2
Two corpora are comparable iff
roughly the same text types, subject matter, proportions
Malta, May 2010 Kilgarriff: BUCC 3
Two corpora are comparable iff
roughly the same text types, subject matter, proportions
Applicable where Different languages Same language
comparable=similar Any corpus is entirely similar to itself
Malta, May 2010 Kilgarriff: BUCC 4
Comparing Corpora
Input Word freq list for c1 Word freq list for c2
For top 500 words compute sum of (observed-expected)2/expected
Chi-square-based Discriminates well
Better than spearman rank, cross-entropy
Malta, May 2010 Kilgarriff: BUCC 5
1990s work
Then Very few corpora Purely theoretical interest
Now Web lots of corpora, created to spec Compare…
first question to ask about a new corpus
Malta, May 2010 Kilgarriff: BUCC 6
(Monolingual) Word Lists
Define a syllabus Which words get used in
Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing
NS: educational psychologists NNS: proficiency levels
Malta, May 2010 Kilgarriff: BUCC 7
Should be corpus-based
Most aren't Corpora are quite new
Easy to do better People will use them
Maybe also Governments
Malta, May 2010 Kilgarriff: BUCC 8
How
Take your corpus Count Voila
Malta, May 2010 Kilgarriff: BUCC 9
Complications
What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy
All are slightly different issues for each lg
Malta, May 2010 Kilgarriff: BUCC 10
What is a word; delimiters
Found between spaces Not for Chinese: segmentation
English co-operate, widely-held, farmer's, can't
Norwegian, Swedish Compounding, separable verbs
Arabic, Italian Clitics, al, ...
...
Malta, May 2010 Kilgarriff: BUCC 11
Words and lemmas
Word form (in text) invading
Lemma (dictionary headword) Invade for forms invade invades invaded
invading
Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara
Malta, May 2010 Kilgarriff: BUCC 12
Word Families
Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable
‘Word families’ tradition eg: Coxhead, Academic word list
Pedagogy: one item to learn But
Where do families end? Different meanings
Malta, May 2010 Kilgarriff: BUCC 13
Grammatical classes
brush (verb) and brush (noun) Same item or different? (both in same word family)
Required (short) list of word classes POS-tagger
Will make mistakes
Malta, May 2010 Kilgarriff: BUCC 14
Marginal cases
Numbers twelve, seventeenth, fifties
Closed sets Days of week, months
Countries Capitals, nationalities, currencies, adjectives,
languages regional/dialects, political groups, religions
easter, christmas, islam, republican
policies always needed
Malta, May 2010 Kilgarriff: BUCC 15
Multiwords
According to Linguistically a word but
Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords
Malta, May 2010 Kilgarriff: BUCC 16
Homonymy
bank (river) and bank (money) Word sense disambiguation
We can't do (with decent accuracy) We can't give freqs for senses
Lists of words not meanings Sometimes disconcerting
Malta, May 2010 Kilgarriff: BUCC 17
Corpora
A fairly arbitrary sample of a lg To limit arbitrariness of wordlist
Make it big and diverse WACKY corpora
From web Can do for any language
??? Comparable ??? Web language: less formal
Malta, May 2010 Kilgarriff: BUCC 18
Comparing corpora
Corpora: new We are all beginners Best way to get sense of a corpus
Compare with another Keywords of each vs. other
Case studies Sketch Engine functions
Malta, May 2010 Kilgarriff: BUCC 19
Comparing frequency lists
• Web1T
– Present from google
– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012) words of English
• that’s 1,000,000,000,000
• Compare with BNC
– Take top 50,000 items of each
– 105 Web1T words not in BNC top50k
– 50 words with highest Web1T:BNC ratio
– 50 words with lowest ratio
Malta, May 2010 Kilgarriff: BUCC 20
Web-high (155 terms)
• 61 web and computing– config browser spyware url www forum
• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web
– poker viagra lingerie ringtone dvd casino rental collectible tiffany
– NB: BNC is old
• 4 legal– trademarks pursuant accordance herein
Malta, May 2010 Kilgarriff: BUCC 21
Web-low
• Exclude British English, transcription/tokenisation anomalies
– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
Malta, May 2010 Kilgarriff: BUCC 22
Observations
• Pronouns and past tense verbs
– Fiction
• Masc vs fem
• Yesterday
– Probably daily newspapers
• Constancy of ratios:
– He/him/himself
– She/her/herself
Malta, May 2010 Kilgarriff: BUCC 23
Corpus Factory
Many languages General corpus, 100m+ words
Fast High quality Comparable across languages
Malta, May 2010 Kilgarriff: BUCC 24
Gather Seed words
Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come
Extract text from Wiki. Wikipedia 2 Text
Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.
Malta, May 2010 Kilgarriff: BUCC 25
Web Corpus Statistics
Unique URLscollected
Afterfiltering
After de-duplication
Web corpus size MB Words
Dutch 97,584 22,424 19,708 739 MB 108.6 mHindi 71,613 20,051 13,321 424 MB 30.6 mTelugu 37,864 6,178 5,131 107 MB 3.4 mThai 120,314 23,320 20,998 1.2 GB 81.8 mVietnamese 106,076 27,728 19,646 1.2 GB 149 m
Malta, May 2010 Kilgarriff: BUCC 26
Evaluation
For each of the languages, two corpora available: Web and Wiki Dutch: also a carefully designed lexicographic corpus.
Hypothesis: Wiki corpora are ‘informational’ Informational --> typical written Interactional --> typical spoken
Malta, May 2010 Kilgarriff: BUCC 27
Evaluation
1st, 2nd person pronouns strong indicators of interactional language. English: I me my mine you your yours we us our
For each language Take ten commonest 1st and 2nd person pronouns For each
Calculate ratio: web:wiki
Malta, May 2010 Kilgarriff: BUCC 28
Results: ratios, web:wiki
Language Average Min Max
Dutch 2.98 2.03 10.03
Hindi 5.36 1.85 11.50
Telugu 4.96 0.54 7.34
Thai 2.40 0.63 7.87
Vietnamese 3.82 1.81 19.41
Malta, May 2010 Kilgarriff: BUCC 29
KELLY
EU lifelong learning project Goal: wordcards
Word in one lg on one side, other on other Language learning
9 languages, 36 pairs Arabic Chinese English Greek Italian
Norwegian Polish Russian Sweden Partners in 6 countries
Malta, May 2010 Kilgarriff: BUCC 30
Method
Prepare monolingual lists Translate
Each into 8 target languages Professional translation services
Integrate, finalise Produce cards Goal for each set
9000 pairs at 6 levels
Malta, May 2010 Kilgarriff: BUCC 31
Stages
Sort out corpora, tagging Automatically generate M1 lists
names, numbers, countries ... keywords vis-a-vis other corpora
Review, compare, prepare M2 lists Translate Use translations: M3 lists Finalise
Malta, May 2010 Kilgarriff: BUCC 32
review - how?
points system 2 points for each of 6 levels 12 points for most freq words
deduct points for words in over-represented areas
add in words from other corpora
Malta, May 2010 Kilgarriff: BUCC 33
Translation database
On the web All translations entered into it Queries like
All Swedish words used as translations more than six times
All 1:1:1:1... 'simple cases'
Malta, May 2010 Kilgarriff: BUCC 34
Using the translations database
Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq
word in several of the 8 other lgs So:
add it to English list Homonyms: could be similar
Malta, May 2010 Kilgarriff: BUCC 35
Monolingual master lists (M3)
Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs
Useful words which might not be hi-freq added words/multiwords must be above a
lower freq threshold
Target 9000
Malta, May 2010 Kilgarriff: BUCC 36
Numbers
Target: 9000 per list M2 lists
Estimate: 5000-6000 needed We add 3000-4000 multiwords and other
'back-translations'
Malta, May 2010 Kilgarriff: BUCC 37
Current status
M1 lists prepared Lists checked, compared with other
lists Corpus-based and other
M2 lists prepared Translation underway
Malta, May 2010 Kilgarriff: BUCC 38
Big problems
Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello
Worse than anticipated Lists from spoken corpora, learner
corpora, needed Relation between
Competence for communicating The corpora at our disposal
Malta, May 2010 Kilgarriff: BUCC 39
Word lists are useful, but
...are they scientific? A tiny bit, occasionally
...could they be scientific? Yes
article of faith By the end of KELLY, we'll have a clearer
idea how
Malta, May 2010 Kilgarriff: BUCC 40
And now for something completely different: DANTE Lexical database for English
Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins
BNC, FrameNet, Euralex, COBUILD...
English side, New English-Irish dictionary Available for NLP research imminently