39
1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Universities of Leeds and Sussex

1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Embed Size (px)

Citation preview

Page 1: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

1

Linguistic evidence within and across languages, word frequency lists and language learning

Adam KilgarriffLexical Computing LtdLexicography MasterClassUniversities of Leeds and Sussex

Page 2: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

2

Linguistic evidence within and across languages, word frequency lists and language learning

OrWord lists are useful, but are they (could they be) scientific?

Page 3: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 3

KELLY

EU lifelong learning project Goal: wordcards

Word in one lg on one side, other on other Language learning

9 languages, 36 pairs Arabic Chinese English Greek Italian

Norwegian Polish Russian Sweden Partners (incl Leeds) in 6 countries

(Leeds does Arabic Chinese Russian)

Page 4: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 4

Method

Prepare monolingual lists Translate

Each into 8 target languages Professional translation services

Integrate, finalise Produce cards Goal for each set

9000 pairs at 6 levels

Page 5: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 5

(Monolingual) Word Lists

Define a syllabus Which words get used in

Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing

NS: educational psychologists NNS: proficiency levels

Page 6: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 6

Should be corpus-based

Most aren't Corpora are quite new

Easy to do better People will use them

Maybe also Governments

Page 7: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 7

How

Take your corpus Count Voila

Page 8: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 8

Complications

What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy

All are slightly different issues for each lg

Page 9: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 9

What is a word; delimiters

Found between spaces Not for Chinese: segmentation

English co-operate, widely-held, farmer's, can't

Norwegian, Swedish Compounding, separable verbs

Arabic, Italian Clitics, al, ...

...

Page 10: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 10

Words and lemmas

Word form (in text) invading

Lemma (dictionary headword) Invade for forms invade invades invaded

invading

Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara

Page 11: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 11

Grammatical classes

brush (verb) and brush (noun) Same item or different? Proposal: lempos

Recommendation: different With trepidation

Chinese: weak sense of noun, verb

Required (short) list of word classes for each lg

Same for all unless good reason

Page 12: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 12

Marginal cases

Numbers twelve, seventeenth, fifties

Closed sets Days of week, months

Countries Capitals, nationalities, currencies, adjectives,

languages regional/dialects, political groups, religions

easter, christmas, islam, republican

Consistency before freq: policies needed

Page 13: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 13

Multiwords

According to Linguistically a word but

Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords

Base list: Recommendation: no multiwords But see below

Page 14: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 14

Homonymy

bank (river) and bank (money) Word sense disambiguation

We can't do (with decent accuracy) We can't give freqs for senses

Lists of words not meanings Sometimes disconcerting See also below

Page 15: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 15

Corpora

A fairly arbitrary sample of a lg To limit arbitrariness of wdlist

Make it big and diverse WACKY corpora

From web Can do for any language Web language: less formal

not mainly 'reporting' or fiction, cf news, BNC Good for lg learners

Page 16: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 16

Comparing corpora

Corpora: new We are all beginners Best way to get sense of a corpus

Compare with another Keywords of each vs. other

Case studies Sketch Engine functions

Page 17: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 17

Comparing frequency lists

• Web1T

– Present from google

– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012) words of English

• that’s 1,000,000,000,000

• Compare with BNC

– Take top 50,000 items of each

– 105 Web1T words not in BNC top50k

– 50 words with highest Web1T:BNC ratio

– 50 words with lowest ratio

Page 18: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 18

Web-high (155 terms)

• 61 web and computing– config browser spyware url www forum

• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old

• 4 legal– trademarks pursuant accordance herein

Page 19: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 19

Web-low

• Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Page 20: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 20

Observations

• Pronouns and past tense verbs

– Fiction

• Masc vs fem

• Yesterday

– Probably daily newspapers

• Constancy of ratios:

– He/him/himself

– She/her/herself

Page 21: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY

Corpus Factory

Many languages General corpus, 100m+ words

Fast High quality Comparable across languages

Page 22: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY

Gather Seed words

Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come

Extract text from Wiki. Wikipedia 2 Text

Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.

Page 23: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY

Web Corpus Statistics

Unique URLscollected

Afterfiltering

After de-duplication

Web corpus size MB Words

Dutch 97,584 22,424 19,708 739 MB 108.6 mHindi 71,613 20,051 13,321 424 MB 30.6 mTelugu 37,864 6,178 5,131 107 MB 3.4 mThai 120,314 23,320 20,998 1.2 GB 81.8 mVietnamese 106,076 27,728 19,646 1.2 GB 149 m

Page 24: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY

Evaluation

For each of the languages, two corpora available: Web and Wiki Dutch: also a carefully designed lexicographic corpus.

Hypothesis: Wiki corpora are ‘informational’ Informational --> typical written Interactional --> typical spoken

Page 25: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY

Evaluation

1st, 2nd person pronouns strong indicators of interactional language. English: I me my mine you your yours we us our

For each languages Ratio: web:wiki

Page 26: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY

Results

ThaiWord Web Wiki Ratio

ผม 2935 366 8.00ดิ�ฉั�น 133 19 7.00ฉั�น 770 97 7.87คุณ 1722 320 5.36ท่�าน 2390 855 2.79กระผม 21 6 3.20ข้�าพเจ้�า 434 66 6.54ตั�ว 2108 2070 1.01ก� 179 148 1.20ชั้��น 431 677 0.63Total 11123 4624 2.40

Table : 1st and 2nd person pronouns in Web and Wiki corpora per million words

Page 27: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY

ANW NlWaCTheme Word English gloss Theme Word English glossBelgian Brussel (city)

Belgische Belgian

Vlaamse Flemish

Fiction keek Looked/watched

ReligionGod

Jezus

Christus

Gods

Newspapers

vorig previous

kreek watched/looked

procent Percent

miljoen million

miljard billion

frank (Belgian) Franc

Web

http

Geplaatst posted

Nl (Web domain)

Bewerk edited

Reacties Replies

www

Page 28: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 28

Stages

Sort out corpora, tagging Automatically generate M1 lists

names, numbers, countries ... keywords vis-a-vis other corpora

Review, prepare M2 lists Translate

Page 29: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 29

review - how?

points system 2 points for each of 6 levels 12 points for most freq words

deduct points for words in over-represented areas

add in words from other corpora

Page 30: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 30

Translation database

On the web All translations entered into it Queries like

All Swedish words used as translations more than six times

All 1:1:1:1... 'simple cases'

Page 31: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 31

Translations

Usually, of texts Words in context Kelly: no context

Usual principles don't apply

Instructions to translators

Page 32: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 32

Using the database

Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq

word in several of the 8 other lgs So:

add it to English list Homonyms: could be similar

Page 33: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 33

Monolingual master lists (M3)

Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs

Useful words which might not be hi-freq added words/multiwords must be above a

lower freq threshold

Target 9000 Important contribution

Page 34: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 34

Numbers

Target: 9000 per list M2 lists

Estimate: 5000-6000 needed We add 3000-4000 multiwords and other

'back-translations'

Page 35: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 35

From M3 lists to T2 lists

Page 36: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 36

Current status

M1 lists prepared Lists checked, compared with other

lists Corpus-based and other

M2 lists prepared Translation underway

Page 37: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 37

Big problems

Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello

Worse than anticipated Lists from spoken corpora, learner

corpora, needed Relation between

Competence for communicating The corpora at our disposal

Page 38: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 38

Word lists are useful, but

...are they scientific? A tiny bit, occasionally

...could they be scientific? Yes

article of faith By the end of KELLY, we'll have a clearer

idea how

Page 39: 1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass

Leeds April 2010 Kilgarriff: KELLY 39

http://forbetterenglish.com