46
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Embed Size (px)

Citation preview

Page 1: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

1

Using Corpora in Language Research-also Introduction to the Sketch Engine (WS15) part 1

Adam KilgarriffLexical Computing LtdUniversities of Leeds

Page 2: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff2 May 2011

What is language?

Page 3: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff3 May 2011

What is language? In our heads

Page 4: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff4 May 2011

What is language? In our heads In texts and sound signals

Page 5: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff5 May 2011

What is language? In our heads In texts and sound signals Both

Page 6: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff6 May 2011

Methodology

Study language in our headsCompetenceChomsky“rationalist” (Descartes, Leibniz)

Page 7: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff7 May 2011

Methodology

Study language in our headsCompetenceChomsky“rationalist” (Descartes, Leibniz)Odd method for objective sciencePractical problems: coverage,

arbitrariness

Page 8: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff8 May 2011

Methodology

Study text“empiricist” (Locke, Hume)

Physics: forces, matterChemistry: chemicals, bondsLanguage: text, speech signals

Page 9: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff9 May 2011

It goes against the grain

What is important about a sentence?its meaning

Corpus methodology:Throw away individual sentence

meaningFind patterns

Page 10: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff10 May 2011

Computer power

Corpora bigger and bigger data sets

Language technology toolslemmatizers, POS-taggers, parsersMachine learning, pattern-finding

20 years of rapid ascent

Page 11: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff11 May 2011

All the linguisticses Theoretical Socio Psycho Developmental Law and Computational Contrastive Applied ...

linguistics

Page 12: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff12 May 2011

Developmental

CHILDES, TalkBank How children learn language

Parents record all interactions Since 1980s Prof. Brian MacWhinney, Carnegie-Mellon Many languages

Largest chunk: English, 23m words

Page 13: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff13 May 2011

Page 14: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff14 May 2011

Page 15: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff15 May 2011

Page 16: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff16 May 2011

Page 17: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff17 May 2011

Page 18: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff18 May 2011

Page 19: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff19 May 2011

Page 20: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff20 May 2011

Language change

Brown familySmall but perfectly formed

• I m words• 500 x 2000-word samples• the same 15 text types

Supports comparison• American and British English• 1931, 1961, 1991, 2006

Page 21: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff21 May 2011

Page 22: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff22 May 2011

Page 23: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff23 May 2011

Page 24: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff24 May 2011

Page 25: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff25 May 2011

Page 26: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff26 May 2011

Language and gender

When you see a dentist <he/she/they> ... What is now normal? Recent study

they now the norm themself now needed despite what spellcheck says

BNC (most text from 1989) 0.2/million EnTenTen (mostly 2009) 0.4/million

Page 27: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff27 May 2011

Language and law

Trade marksHoover and similar

• trademark or generic

Cases • sabatier, botox, kettle chips

Key evidence• Do people tend to capitalize?

Page 28: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff28 May 2011

English nouns: % capitalized

0

10

20

30

40

50

60

70

<10 <30 <50 <70 <90

Types

Tokens

Page 29: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff29 May 2011

Syntax and semantics

Page 30: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff30 May 2011

Page 31: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff31 May 2011

Page 32: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff32 May 2011

DANTE

Detailed account of English lexis Corpus-driven

From word sketchesLexicographers assign to senses

High precision Available at http://webdante.com Brochures

Page 33: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff33 May 2011

What data shall I use?

Page 34: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff34 May 2011

Think hard

Page 35: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff35 May 2011

Sometimes ...

Just-in-time corpus from the web Use case:

Translator, French-to-English• Translation task

• volcanoes• In French• I understand it OK, but I'm no vulcanologist, I don't

know the English terminology

BootCaT, Baroni and Bernardini

Page 36: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff36 May 2011

Page 37: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff37 May 2011

Page 38: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff38 May 2011

Page 39: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff39 May 2011

Page 40: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff40 May 2011

Page 41: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff41 May 2011

Page 42: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff42 May 2011

Page 43: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff43 May 2011

Page 44: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff44 May 2011

Corpora in Sketch Engine

Access-to-all42 languages

• All major world languages

Mostly large, web-crawledVarious other

• CHILDES, Brown, ...

“My corpora”BootCat and other

Page 45: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff45 May 2011

LCL sponsorship of LSA

One year free accounts for participantshttp://www.sketchengine.co.uk

• “Register”• “Site licence member”• Your details and

• Organisation: select LSA2011• Site licence key: Boulder

• Password by email• change it (under Settings)

Page 46: 1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds

Adam Kilgarriff46 May 2011

TodayMotivations, taster

Sunday 9-12practical