Upload
marlene-lindsey
View
216
Download
2
Embed Size (px)
Citation preview
UCREL: from LOB to REVERE
Paul Rayson
November 1999 CSEG awayday Paul Rayson 2
A brief history of UCREL
In ten minutes, I will present a brief history of UCREL (the University Centre for Computer Corpus Research on Language) which has members in the Computing and Linguistics Departments. UCREL specialises in the automatic or computer-aided analysis of large bodies of naturally-occurring language (`corpora'). UCREL has a record of achievement of more than twenty years as pioneers in this field. From the first million-word corpus of British English in 1978 (LOB) up to the present day with the REVERE project which is piloting the application of UCREL's techniques to texts in the requirements engineering domain.
1960 THE TIMELINE 1999
November 1999 CSEG awayday Paul Rayson 3
Noam Chomsky
• Chomsky changed the direction of Linguistics away from empiricism and towards rationalism
• observation of naturally occurring data versus theory of how human language processing is actually undertaken
• model competence rather than performance• early corpus linguists saw language as finite and
collected it all!
November 1999 CSEG awayday Paul Rayson 4
Brown and LOB
• One million word machine-readable corpora:– Brown corpus (American English)– Lancaster-Oslo-Bergen (British English)
November 1999 CSEG awayday Paul Rayson 5
Grammatical analysis
• Statistical word-class tagging by CLAWS, 98% accuracy, simple rule-based systems only achieved low 90%
• Manual full parse - 3 million words for training
November 1999 CSEG awayday Paul Rayson 6
Speech recognition
• In collaboration with IBM’s continuous speech recognition group
• Produced a detailed analysis of spoken data and a grammar of spoken English
November 1999 CSEG awayday Paul Rayson 7
Word sense tagging
• Automatic tagger to assign semantic tags• Tags differentiate dictionary word senses to
accuracy of 91%• Full text, hybrid rule-based & statistical• Application to market research
November 1999 CSEG awayday Paul Rayson 8
The next generation
• British National Corpus• One hundred million words• All tagged at Lancaster
November 1999 CSEG awayday Paul Rayson 9
REVERE
• Application of UCREL’s techniques to requirements engineering domain
• Legacy documents (specifications, ethnographic reports, manuals)
• Assisting the RE to extract domain knowledge
November 1999 CSEG awayday Paul Rayson 10
UCREL consultancy
• CLAWS licences• Tagging services
– 146 million words processed