10
UCREL: from LOB to REVERE Paul Rayson

UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

Embed Size (px)

Citation preview

Page 1: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

UCREL: from LOB to REVERE

Paul Rayson

Page 2: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

November 1999 CSEG awayday Paul Rayson 2

A brief history of UCREL

In ten minutes, I will present a brief history of UCREL (the University Centre for Computer Corpus Research on Language) which has members in the Computing and Linguistics Departments. UCREL specialises in the automatic or computer-aided analysis of large bodies of naturally-occurring language (`corpora'). UCREL has a record of achievement of more than twenty years as pioneers in this field. From the first million-word corpus of British English in 1978 (LOB) up to the present day with the REVERE project which is piloting the application of UCREL's techniques to texts in the requirements engineering domain.

1960 THE TIMELINE 1999

Page 3: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

November 1999 CSEG awayday Paul Rayson 3

Noam Chomsky

• Chomsky changed the direction of Linguistics away from empiricism and towards rationalism

• observation of naturally occurring data versus theory of how human language processing is actually undertaken

• model competence rather than performance• early corpus linguists saw language as finite and

collected it all!

Page 4: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

November 1999 CSEG awayday Paul Rayson 4

Brown and LOB

• One million word machine-readable corpora:– Brown corpus (American English)– Lancaster-Oslo-Bergen (British English)

Page 5: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

November 1999 CSEG awayday Paul Rayson 5

Grammatical analysis

• Statistical word-class tagging by CLAWS, 98% accuracy, simple rule-based systems only achieved low 90%

• Manual full parse - 3 million words for training

Page 6: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

November 1999 CSEG awayday Paul Rayson 6

Speech recognition

• In collaboration with IBM’s continuous speech recognition group

• Produced a detailed analysis of spoken data and a grammar of spoken English

Page 7: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

November 1999 CSEG awayday Paul Rayson 7

Word sense tagging

• Automatic tagger to assign semantic tags• Tags differentiate dictionary word senses to

accuracy of 91%• Full text, hybrid rule-based & statistical• Application to market research

Page 8: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

November 1999 CSEG awayday Paul Rayson 8

The next generation

• British National Corpus• One hundred million words• All tagged at Lancaster

Page 9: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

November 1999 CSEG awayday Paul Rayson 9

REVERE

• Application of UCREL’s techniques to requirements engineering domain

• Legacy documents (specifications, ethnographic reports, manuals)

• Assisting the RE to extract domain knowledge

Page 10: UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history

November 1999 CSEG awayday Paul Rayson 10

UCREL consultancy

• CLAWS licences• Tagging services

– 146 million words processed