26
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex

1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex

Embed Size (px)

Citation preview

1

Corpora, Language Technology and Maltese

Adam KilgarriffLexical Computing LtdLexicography MasterClass LtdUniversity of Sussex

Kilgarriff, Lexical ComputingSlide: 2 Malta, Nov 2006

How do you find out about a language?

Native speakers Dictionaries and Grammars Corpus

Kilgarriff, Lexical ComputingSlide: 3 Malta, Nov 2006

Four ages of corpus research

Kilgarriff, Lexical ComputingSlide: 4 Malta, Nov 2006

Age 1:Pre-computer

Oxford English Dictionary:• 20 million index cards

Kilgarriff, Lexical ComputingSlide: 5 Malta, Nov 2006

Age 2: KWIC Concordances

From 1980 Computerised

Kilgarriff, Lexical ComputingSlide: 6 Malta, Nov 2006

Age 2: KWIC Concordance1 arity, which will be used to take a party of under-privileged children to D 2 from outside. You are invited to a party and after a couple of drinks you d 3 tion, we believe politicians of all parties will listen to our views. &equo 4 ould be reaching agreement with all parties concerned, as to which events, 5 lack people. I have certainly been party to one or two discussions amongst 6 . These should be discussed by both parties before entering into the relatio 7 presents They had hosted a cocktail party at Kensington palace, for example 8 akes. By midnight the end-of-course party is in full swing, but most cadet 9 e should be a right for the injured party to terminate the contract. A mana 10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh 11 s. Ahead I could see the rest of my party plodding towards the final slope t 12 cial ethic. The two main political parties - the Tories and the Liberals - 13 ritish successes in Perth The small party of British players competing in th 14 to help control. One member of the party went to summon the rescue team and 15 rket society fashion magazine. The party was held at his flat which was a l 16 security and secrecy than any Tory Party Conference : it seems that bootleg

Kilgarriff, Lexical ComputingSlide: 7 Malta, Nov 2006

Age 2: KWIC Concordances

From 1980 Computerised COBUILD project was innovator the coloured-pens method

Kilgarriff, Lexical ComputingSlide: 8 Malta, Nov 2006

1 political association 4 person in an agreement/dispute 2 social event 5 to be party to something...3 group of people

1 arity, which will be used to take a party of under-privileged children to D2 from outside. You are invited to a party and after a couple of drinks you d3 tion, we believe politicians of all parties will listen to our views. &equo4 ould be reaching agreement with all parties concerned, as to which events,5 lack people. I have certainly been party to one or two discussions amongst6 . These should be discussed by both parties before entering into the relatio7 presents They had hosted a cocktail party at Kensington palace, for example8 akes. By midnight the end-of-course party is in full swing, but most cadet9 e should be a right for the injured party to terminate the contract. A mana10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh11 s. Ahead I could see the rest of my party plodding towards the final slope t12 cial ethic. The two main political parties - the Tories and the Liberals -13 ritish successes in Perth The small party of British players competing in th14 to help control. One member of the party went to summon the rescue team and15 rket society fashion magazine. The party was held at his flat which was a l16 security and secrecy than any Tory Party Conference : it seems that bootleg

The coloured pens method

Kilgarriff, Lexical ComputingSlide: 9 Malta, Nov 2006

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: read all • 500 lines: could read all, takes a long

time• 5000 lines: no

Kilgarriff, Lexical ComputingSlide: 10 Malta, Nov 2006

Age 3: Collocation statistics

Problem:too much data - how to summarise?

Solution:list of words occurring in neighbourhood of headword, with frequencies

Sorted by salience

Kilgarriff, Lexical ComputingSlide: 11 Malta, Nov 2006

Collocation listing For right collocates of save (>5 hits)

word freq word freq

forests 6 life 36

$1.2 6 dollars 8

lives 37 costs 7

enormous 6 thousands 6

annually 7 face 9

jobs 20 estimated 6

money 64 your 7

Kilgarriff, Lexical ComputingSlide: 12 Malta, Nov 2006

Age 4: The word sketch

A corpus-derived one-page summary of a word’s grammatical and collocational behaviour

Kilgarriff, Lexical ComputingSlide: 13 Malta, Nov 2006

Age 4: The word sketch

Large well-balanced corpus Parse to find

subjects, objects, heads, modifiers etc

One list for each grammatical relation

Statistics to sort each list

Kilgarriff, Lexical ComputingSlide: 14 Malta, Nov 2006

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002

Kilgarriff, Lexical ComputingSlide: 15 Malta, Nov 2006

Developer: Pavel Rychly, Brno Users:

OUP, Chambers, CUPUniversities for teaching and

research ELT textbook authors

Demo:http://www.sketchengine.co.uk/

• Self-registration for free account

Kilgarriff, Lexical ComputingSlide: 16 Malta, Nov 2006

How to develop language technologies?

Introspection Copy others Corpus

Kilgarriff, Lexical ComputingSlide: 17 Malta, Nov 2006

Last two decades

Corpora have moved centre-stage forSpellcheckersGrammar checkersAutomatic translationQuestion-answering…

Machine learning from big data

Kilgarriff, Lexical ComputingSlide: 18 Malta, Nov 2006

The corpus

EnglishBritish National Corpus100M words, range of text typesLed the worldLate 1980s/ early 1990s

Maltese …

Kilgarriff, Lexical ComputingSlide: 19 Malta, Nov 2006

Corpus design: ideals

Very largeMost words, phrases are rare

Cover all types of text All texts labelled by text type, author Available for anyone to use Only real language

Kilgarriff, Lexical ComputingSlide: 20 Malta, Nov 2006

Corpus development: pragmatics

What you can get Some sources easy, others hard Often unlabelled

expensive to label Copyright Lots of junk

expensive to clean

Kilgarriff, Lexical ComputingSlide: 21 Malta, Nov 2006

Sources

TraditionalBook publishersNewspapers, magazinesOfficial

Web

Kilgarriff, Lexical ComputingSlide: 22 Malta, Nov 2006

Traditional

Write to them and askWooing neededAppeal to national prideNewspapers:

• Large

Other sources:• Often small amounts

Kilgarriff, Lexical ComputingSlide: 23 Malta, Nov 2006

Web

Lots of data (even for Maltese) Free access

(but copyright) Some formal

Laws, government pages some informal

Chatroom, email

Kilgarriff, Lexical ComputingSlide: 24 Malta, Nov 2006

Laws vs web crawl

Kilgarriff, Lexical ComputingSlide: 25 Malta, Nov 2006

Agenda

Data cleaning More kinds of text Spelling: standardisation Morphology

inventing invent (v, -ing) Grammar

Kilgarriff, Lexical ComputingSlide: 26 Malta, Nov 2006

In sum

Dictionaries, language technology need corpus

MalteseSome components availableMore work needed

Solid foundation for LT is within reach