104
DATA-DRIVEN APPROACHES TO NATURAL LANGUAGE PROCESSING Guy De Pauw

Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

DATA-DRIVENAPPROACHES TO

NATURAL LANGUAGEPROCESSING

Guy De Pauw

Page 2: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHO AM I?http://aflat.org/guy• Born in Antwerp, 1975• Studied Germanic Languages & Literature (Dutch – English)• PhD (2002): An Agent-Based Evolutionary Computing Approach to

Memory-Based Syntactic Parsing of Natural Language

• 2002-2006: FLaVoR: flexible large vocabulary recognition (Dutch morphology and parsing)

• 2006-2012: African Language Technology- FWO postdoc at University of Antwerp- Machine learning approaches to language technology for

African languages- AfLaT.org

Page 3: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

OUTLINE• What is Natural Language Processing?

• Introduction to CLiPS research- African language technology

• Data-driven Approaches- Paradigm Shift- Machine Learning Recap- Memory-Based Language Processing

• Data-Driven African Language Technology- Language Independence- Development Speed- Adaptability- Applicability- Empiricism

Page 4: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

NATURAL LANGUAGEPROCESSING

Page 5: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

LANGUAGE & SPEECH TECHNOLOGY

• Language is an important medium for- Communication- Storing knowledge

• Language & Speech Technology allowspeople to- Communicate with computers- Work with computers in natural language- Extract knowledge from speech and text- …

• Ultimate goal: natural language understanding

Page 6: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

TASKS & APPLICATIONS• Tasks:

- Tokenization- grapheme-to-phoneme conversion- morphological analysis (segmentation, generation, lemmatization, stemming)- part-of-speech tagging- named-entity recognition- syntactic parsing- word sense disambiguation- Semantic-role labeling - Co-reference resolution- Discourse analysis

• Applications: - optical character recognition- spell-checking- text-to-speech- Predictive text (T9)- automatic summarization- question answering- sentiment analysis- information retrieval/extraction- Terminology extraction- speech recognition (AI-complete)- machine translation (AI-complete)

Page 7: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

SOCIO-ECONOMIC IMPORTANCE

• Information explosion (internet)- Technostress- 2002: 5 exabytes newly stored information

• 1 exabyte = 1 million Terabyte- Doubles every 2-3 years

• Translation Explosion- EU (2005)

• 20+ official languages• > 1 billion euro per year• 2500 translators• 40% administrative budget

- Not just in Europe: South-Africa (11 official languages)

• Helpdesks, call centers, gaming, ...

Page 8: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

SOCIO-ECONOMIC IMPORTANCE

• Text production- Spelling- and grammar check- “clear language”: governments, pharmaceutical

companies

• Language Teaching- Language tests, exercises, …

• “Business Intelligence”- Collect information on the competition- Opinion mining

Page 9: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

SCIENTIFIC IMPORTANCE

• Artificial Intelligence• Language capacity = intelligence

Nim Chimpsky “me eat drink more” “banana eat me Nim”

Page 10: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Page 11: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

http://www.youtube.com/watch?v=oUj9AzSE_9c

Page 12: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Page 13: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHY NO HAL9000 IN 2001• Natural language processing is done at several levels

- Phonetics: speech soundsI can participate aj kæm patɪsəpe

- Phonology: how do phonemes combine into larger unitsI can participate aj kæn partɪsəpet

- Morfology: smallest meaning carrying unit in languageElle est joli+e

- Syntaxis: grammar, how do combinations of words express meaning

Elle est jolie vs *Il est jolie- Semantics: how is meaning expressed

He beat me vs I was beaten by him- Pragmatics: contextual knowledge

Can you pass me the salt?

• Each level introduces errors, contains ambiguity, …

Page 14: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

AMBIGUITY

• orthographic/ phonological (cfr. “Eye halve a spell...)

Eye halve a spelling chequerIt came with my pea seaIt plainly marques four my revueMiss steaks eye kin knot sea.Eye strike a key and type a wordAnd weight four it two sayWeather eye am wrong oar writeIt shows me strait a weigh.As soon as a mist ache is maidIt nose bee fore two longAnd eye can put the error riteIts rarely ever wrong.Eye have run this poem threw itI am shore your pleased two noIts letter perfect in it's weighMy chequer tolled me sew.

Page 15: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

AMBIGUITY

• orthographic/ phonological (cfr. “Eye halve a spell...)

• Lexical - morphological- The can will rust

• Syntactic- The prime minister reported his marriage to the king

• Semantic- My cat is on the television- All students know two languages

Page 16: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

• Text level- Tom didn’t have a job. He grabbed the news paper.- Tom thought the fly was annoying. He grabbed the news

paperWhy did Tom take the newspaper?

• World Knowledge!- The mayors prohibited the students to demonstrate because

they preached the revolution- The mayors prohibited the students to demonstrate because

they feared violence

• Ellipsis- Alcohol is more damaging to women than men

AMBIGUITY

Page 17: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

MACHINE TRANSLATION

Page 18: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Page 19: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Page 20: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

http://www.news.com.au/breaking-news/world/predictive-text-error-leads-uk-man-to-fatally-stab-friend/story-e6frfkui-1226004032018

Page 21: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHAT CAN WE DO?

Page 22: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHAT CAN WE DO?

Phonetics/Phonology

• Speech Synthesis: “Pas de déclaration choc lors des débats dominicaux : sur les plateaux TV, les partis qui ont conclu l’accord sur BHV ont défendu énergiquement le résultat de leurs

travaux.”http://www.acapela-group.com/text-to-speech-interactive-demo.html

• Speech Recognitionhttp://www.youtube.com/watch?v=-0kDcUEDfmY

Page 23: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHAT CAN WE DO?

Morphology• Morphological analysis: segmentation of

compounds, derivations, ….watermarking = ((water[N]+mark[N])[N]+ing[V|N.])[V]

• Lemmatization, stemming: get the base forms, roots of word forms

watermarking = watermark

• E.g. google search: bobcats

http://ilk.uvt.nl/mbma/

Page 24: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

NOT ALWAYS EASY

uygarlastiramayabileceklerimizdenmissinizcesine

urgar/civilized las/BECOME tir/CAUS ama/NEG yabil/POT ecek/FUT ler/3PL imiz/POSS-1SG den/ABL mis/NARR siniz/2PL cesine/AS-IF

• Adverb meaning “(behaving) as if you were one of those whom we might not be able to civilize”

• [Turkish, from Oflazer & Guzey 1994]

Page 25: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHAT CAN WE DO?

Morpho-syntax

• Part-of-Speech TaggingThe can will rustdeterminer noun modal verb

http://www.clips.ua.ac.be/cgi-bin/webdemo/MBSP-instant-webdemo.cgi

Page 26: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHAT CAN WE DO?

Syntaxis• Sentence analysis

http://www.link.cs.cmu.edu/link/submit-sentence-4.htmlhttp://www.connexor.com/nlplib/?q=demo/syntax

Page 27: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

APPLICATIONS

• Information Retrieval (data-mining)• INFORMATION = power• Document classification:

- SPAM filter- Intercept terrorist messages

• Document retrieval:- Web search

• Question-answering systems- How far is Brussels from Antwerp?- http://www.wolframalpha.com

• Text-mining: get facts from texts• Ontology learning

Page 28: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

hepatitis

diseaseinfection HBV

cirrhosis

liver

immunizationantibody vaccination

culture antisera

related_to

related_to

sim

sim sim

produced by produced by

prevented by

Page 29: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

APPLICATIONS

- Machine translation: http://translate.google.com- Automatic subtitling: e.g. youtube captions- Computational Stylometryhttp://www.clips.ua.ac.be/cgi-bin/kim/TACTiCSdemo.cgi- Automatic Summarizationhttp://www.clips.ua.ac.be/~iris/sumdemo.html- Spell-Checking (MS Word)- Grammar Checking (MS Word)

- Spoken Dialogue Systems (banking, airline, gaming, …)- T9, autocomplete - Google Adsense- …

Page 30: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

CLIPSCOMPUTATIONAL LINGUISTICS GROUP

RESEARCH

Page 31: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

CLIPS

www.clips.ua.ac.be

• Psycholinguistics Group- Language Acquisition (Steven Gillis)

- Language Processing (Dominiek Sandra)

• Computational Linguistics Group (Walter Daelemans)

- Text mining- Natural language understanding

Page 32: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

COMPUTATIONAL STYLOMETRY

• Authorship Attribution, personality prediction from text

• Exploring feature sets, corpora, different types of tasks (few vs many authors, …)

• Stylene (EWI): Stylometry and Readability Environment for Dutch

• Stylometry experiments with Middle-Dutch sermons• Investigate Hugo Claus’ work to find evidence of Alzheimer’s

disease

Page 33: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

AUTHOR AND COPYIST ATTRIBUTIONIN MEDIEVAL DUTCH TEXTS

• Use computational stylometry techniques to assign authorship/copyist to anonymous, texts

• Adapt and develop language technology tools for Medieval Dutch

People: Walter Daelemans, Mike Kestemont

Also: Dating (of texts)

Page 34: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

BIOGRAPH

• adaptation of text analysis tools to biomedicallanguage

• handling of negation, modality, and quantification in medical language

• Extract accurate relations from text

People: Walter Daelemans, Roser Morante, Vincent Van Asch

Page 35: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

DAPHNE: DETECTING “GROOMING” BY PEDOPHILES IN SOCIAL

NETWORKS

• Use text analysis tools to distinguish between children and adults posing as children on chatrooms

People: Walter Daelemans, Claudia Peersman

Page 36: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

ARTIFICIAL CREATIVITY INGRAPHICAL DESIGN

• Develop a software algorithm that summarizes, interprets and processes textual content (or data sets) in the context of graphical design

People: Walter Daelemans, Tom De Smedt

Page 37: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

TREND MINING

Page 38: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

DELEARYOUS

• Develop serious 3d game to help with training of interpersonal communicative skills

• Use text analysis tools to associate human interaction with quadrants of Leary’s Rose and plan next interaction

People: Walter Daelemans, Frederik Vaassen

Page 39: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

ALADIN

ALADIN: Adaptation and Learning for Assistive Domestic Vocal INterfaces

Goal: develop a robust, self-learning domestic vocal interface that adapts to the user instead of the other way around:- learn the user’s vocabulary & grammar constructs- learn the user’s voice & pronunciation characteristics

How? Unsupervised learning on the basis of training examples: vocal commands + associated controls (actions)

• People: Janneke van de Loo, Guy De Pauw, Walter Daelemans

Page 40: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

AFRICAN LANGUAGE TECHNOLOGY(FWO)

• Explore machine learning techniques for building language technology applications and modules for (resource-scarce) African languages

• Data collection, annotation and deployment• Unsupervised learning of morphology

People: Guy De Pauw, Naomi Maajabu

Page 41: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

OTHER PROJECTS

• NEON: automatic subtitling of television programs

• AMiCA: Automatic Monitoring for Cyberspace Applications

• STARLING: Statistical Relational Learning of Natural Language

Page 42: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

DATA-DRIVENAPPROACHES

Page 43: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

NATURAL LANGUAGE PROCESSING

• Early NLP: deductive, rule-based methods- Limited computational power- AI: expert systems

• Use linguistic experts to build rule-based NLP applications- Advantages: linguistically relevant

precise, fine-tuned for specific domain

- Disadvantages: expensive developmentnot robustnot domain/language independentKNOWLEDGE ACQUISITION BOTTLENECK

Page 44: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

SHIFT TO INDUCTIVE PARADIGM

• Late 80s, early 90s: shift from deductive paradigm to inductive paradigm, i.e. from rule-based to corpus-based approaches- Exploit large, annotated language corpora with

statistical and machine learning methods to automatically induce NLP tools

- Intelligent NLP systems learning from examples

Page 45: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

PARADIGM SHIFT

© Walter Daelemans

Deductive MethodsHard-coded solutions(Linguistic) Expert systemsRule-Based methods

Inductive MethodsInduced from (annotated) corporaStatistical, Machine Learning TechniquesData-driven methods

Page 46: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHY THIS SHIFT?

• NLP was coming out of the toy domain• Disadvantages of rule-based methods (expense, lack of

robustness, domain dependence) were becoming too obstructive for effective NLP research

Fred Jelinek (1988) on working on a speech recognizer:

Page 47: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHY THIS SHIFT?

• NLP was coming out of the toy domain• Disadvantages of rule-based methods (expense, lack of

robustness, domain dependence) were becoming too obstructive for effective NLP research

Fred Jelinek (1988) on working on a speech recognizer:“Every time I fire a linguist

the performance of the recognizer goes up”

Page 48: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WHY THIS SHIFT?

• NLP was coming out of the toy domain• Disadvantages of rule-based methods (expense, lack of

robustness, domain dependence) were becoming too obstructive for effective NLP research

Fred Jelinek (1988) on working on a speech recognizer:“Every time I fire a linguist

the performance of the recognizer goes up”

Page 49: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

HUGO BRANDT CORSTIUS

Three laws of computationallinguistics1. Whatever one does, semantics

will always interfere.2. Any linguistic description, no

matter how precise, will turn out to contain an error whenone attempts to implement it.

3. Law of diminishing returns.

Page 50: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

PARADIGM SHIFT

© Walter Daelemans

Deductive MethodsHard-coded solutions(Linguistic) Expert systemsRule-Based methods

Inductive MethodsInduced from (annotated) corporaStatistical, Machine Learning TechniquesData-driven methods

Page 51: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

DATA-DRIVEN NLP

• Statistical techniques- Number crunching- Exploit statistics for classification

Page 52: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Probabilistic POS Tagging

• Requires annotated corpus

can/md the/dt tag/nn be/vb better/jjr

• Unigram: P(word|tag)

frequency of the tag for this word in corpus

• Bigram: P(wordi|tagi) P(tagi|tagi-1)

frequency of the tag for this word in corpus, given previous tag

• Trigram: P(wordi|tagi) P(tagi|tagi-1,tagi-2)

frequency of the tag for this word in corpus, given previous two tag

• Good Results, but data Sparseness Problems

Page 53: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

DATA-DRIVEN NLP

• Statistical techniques- Number crunching- Exploit statistics for classification

• Machine-learning techniques:- Symbolic approaches- Use annotated corpora as example of a particular

classification task- The machine learning algorithm learns from

examples- Simple approach: Memory-based learning

Page 54: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Memory-Based Learning

Page 55: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Temp coughing headache nose class37 YES YES RUNNY COLD

39 NO YES OK FLU

40 YES NO STUFFY BRONCHITIS

… … … … …

Memory-Based Learning

• Describe problem• Gather data

Page 56: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Classification in Memory-Based Learning:

New instance:

39,YES,YES,OK,?????

Temp coughing headache nose class37 YES YES RUNNY COLD

39 NO YES OK FLU

40 YES NO STUFFY BRONCHITIS

… … … … …

Memory-Based Learning

Page 57: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Classification in Memory-Based Learning:

New instance:

39,YES,YES,OK,?????

Temp coughing headache nose class37 YES YES RUNNY COLD

39 NO YES OK FLU

40 YES NO STUFFY BRONCHITIS

… … … … …

overlap = 2

Memory-Based Learning

Page 58: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Temp coughing headache nose class37 YES YES RUNNY COLD

39 NO YES OK FLU

40 YES NO STUFFY BRONCHITIS

… … … … …

Memory-Based Learning

Classification in Memory-Based Learning:

New instance:

39,YES,YES,OK,?????

overlap = 2

overlap = 3

Page 59: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Temp coughing headache nose class37 YES YES RUNNY COLD

39 NO YES OK FLU

40 YES NO STUFFY BRONCHITIS

… … … … …

Memory-Based Learning

Classification in Memory-Based Learning:

New instance:

39,YES,YES,OK,?????

overlap = 2

overlap = 3

overlap = 1

Page 60: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Temp coughing headache nose class37 YES YES RUNNY COLD

39 NO YES OK FLU

40 YES NO STUFFY BRONCHITIS

… … … … …

Memory-Based Learning

Classification in Memory-Based Learning:

New instance:

39,YES,YES,OK,FLU

overlap = 2

overlap = 3

overlap = 1

Page 61: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

Memory-Based Language Processing

• Memory-Based Language Processing

• Lazy Learning Algorithm: no abstraction made during learning (↔ C5, Brill, Neural Networks,...)

• Data is stored in memory

• Nearest Neighbor search:

New data is classified by comparing the new instances to the instances in memory and extrapolating the class of the most similar instance

• Psycholinguistic Relevance

Page 62: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

NLP as classification• Most tasks in NLP: mapping between representations

the can will rust ► dt nn md vb

• This mapping in NLP: very complex because of different levels in language

• cf. rule-based methods: only approximate

• machine-learning methods: also approximate, but less effort

Page 63: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

NLP as classificationtransform description of your problem in fixed feature vector

e.g. Past tense of English verbs

work-worked sing-sang sting-stung

...

• describe verb in 3 features: onset-nucleus-code

w,o,rk s,i,ng st,i,ng ...

• describe output in finite set of classes

-ed i-a i-u ...

Page 64: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

NLP as classificationcreate instance base from set of examples

And feed it to the machine learner: e.g. TiMBL

onset nucleus coda class

w o rk -ed

s i ng i→a

st i ng i→u

sh oo t oo→o

cr a mp -ed

r i ng i→u

cork

flingloot?

Page 65: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

ClassificationApply this method to NLP tasks: e.g. POS-tagging

T-2 T-1 F T+1 TAG# # MD_NN DT_RB MD# MD DT_RB NN_VB DTMD DT NN_VB VB NNDT NN VB JJR_VB_RB VBNN VB JJR_VB_RB # JJR

e.g. Can this tag be betterMD VBDT NN JJR

Page 66: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

DATA-DRIVEN AFRICANLANGUAGE TECHNOLOGY

Page 67: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

DATA ACQUISITION

• Many advantages to data-driven NLP

• BUT: Methods need annotated data- Generally less expensive to develop- Same annotated data can be used by different

researchers, using different methods (publications)

• But: what about lesser-used languages, resource-scarce languages?

Page 68: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

AFRICAN LANGUAGE TECHNOLOGY

• +2000 languages• Very limited work on BLARKs in Africa• Bridging the digital divide: need for Language

Technology:- Localization- Machine translation

But: computational Linguistics for African languages= resource-scarce language engineering

Page 69: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

RESOURCE-SCARCE LANGUAGEENGINEERING

• Develop re-usable tools and methodology - Corpus-based methods (machine learning)- Develop annotated corpora- Develop automated methods that minimize the

amount of manual effort (and linguistic expertise) involved

Research Question: are data-driven methods applicable to African languages?

How to overcome Indo-European bias?

Page 70: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

ADVANTAGES OF MACHINE LEARNING

• Language independence• Development Speed• Adaptability• Applicability• Empiricism

Page 71: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

LANGUAGE INDEPENDENCE

Page 72: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

CORPUS COLLECTION

• Hoogeveen & De Pauw (2011) A Web Corpus Mining Tool for Resource-Scarce Languages- Increasing amount of vernacular data for many sub-

Saharan African languages - web-mining- Language identification of over 500 languages (96%

accurate)

• Encoding Issues- Diacritics not or inconsistently used- Normalization

Page 73: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

MEMORY-BASED DIACRITIC CORRECTION

• mbũri written as mburi• No digital lexicons available: simple look-up

approach not applicable• Alternative approach to normalization by defining

the problem on the character level.

L L L L L F R R R R R C- - - - - m b u r i - m- - - - m b u r i - - b- - - m b u r i - - - ũ- - m b u r i - - - - r- m b u r i - - - - - i

Page 74: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

EVALUATION

10 fold cross validation

Page 75: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

EVALUATION

SET

G

N

I

N

I

A

R

T

TEST SET

• 10 fold cross validation• compare output of automatic system to reference translation• Calculate accuracy scores for unseen data

Page 76: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

MEMORY-BASED DIACRITIC CORRECTIONLanguage Types LLU MBT LLU+MBT

Africa Cilubà 20.0k 77.0 85.3 79.6

Gĩkũyũ 9.1k 77.3 92.4 91.5

Kĩkamba 9.7k 79.4 91.6 90.4

Northern Sotho 157.8k 97.6 99.2 99.4Tshivenda 9.6k 97.7 99.4 99.2

Yoruba 4.2k 67.8 76.8 68.5

Europe Czech 105.8k 61.8 89.2 90.1Romanian 146.9k 94.0 96.5 96.6French 258.6k 89.1 88.3 89.3Dutch 301.9k 99.9 99.8 99.9German 365.6k 96.2 95.3 96.8

Asia Vietnamese 50.9k 74.5 73.5 75.5Chinese Pinyin 12.0k 78.5 83.9 80.3

• Single set of scripts for each language• Limited linguistic expertise

Page 77: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

DEVELOPMENT SPEED

Page 78: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

CORPUS ANNOTATION

• Swahili Part-of-Speech Tagging (De Pauw, de Schryver & Wagacha, 2006): uses annotated Helsinki Corpus of Swahili as training material for data-driven tagger (>98% accurate)

• Northern Sotho: start from scratch- Minimize human linguistic expertise- No extended tagging protocol development phase- Maximize re-usability of methodology

Page 79: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

CORPUS ANNOTATION

• Starting point: digital lexicon (word + possible tags)

• Annotator environment: Spreadsheet

Page 80: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

TAGGED CORPUS

• 10,000 words annotated over the course of a couple of weeks• Use data as training material for a maximum entropy-based

tagger (advanced statistical modeling of data)

• Classification on the basis of - Contextual features: surrounding words and tags- Orthographic features: capitalization, prefix/suffix letters

Page 81: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

TAGGED CORPUS

Instance Tag['W-1=#', 'T-1=#', 'FW=Ke', 'FT=SC_COPp', 'W+1=a',

'T+1=SC_PRES_PC_DEM_OC_HRTp', 'P1=K', 'S1=e', 'P2=Ke', 'S2=Ke', 'CAP‘]

SC

['W-1=Ke', 'T-1=SC', 'FW=a', 'FT=SC_PRES_PC_DEM_OC_HRTp', 'W+1=eletša', 'T+1=V', 'P1=a', 'S1=a']

PRES

['W-1=a', 'T-1=PRES', 'FW=eletša', 'FT=V', 'W+1=.', 'T+1=Punc', 'P1=e', 'S1=a', 'P2=el', 'S2=ša', 'P3=ele', 'S3=tša']

V

['W-1=eletša', 'T-1=V', 'FW=.', 'FT=Punc', 'W+1=#', 'T+1=#', 'P1=.', 'S1=.']

Punc

KeSC aPRES eletšaV .Punc

Page 82: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

RESULTS

Known Words Unknown Words(8%)

Total

Baseline 75.8 35.1 73.5MaxTag 95.1 78.9 93.5

• Minimal Development Time• Tagging Protocol on-the-fly (grounded in performance)• Good tagging accuracy

Page 83: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

LEARNING CURVE

70

80

90

100

1k 2k 3k 4k 5k 6k 7k 8k 9k 10k

Acc

ura

cy o

n T

est

Set

Number of words in training set

Known

Unknown

Total

Page 84: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

ADAPTABILITY

Page 85: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

SWAHILI LEMMATIZATION

• Extract information from Helsinki Corpus of Swahili• Typical HCS annotation:

ulikanusha kanusha V deny, disprove, refute, negateUlikoanzia anza V begin, establish

• Perform pattern-matching of lemma onto word form to create two-level morphological segmentation:

Ulikanusha kanusha Surface: uli[P] + kanusha[R] Lexical: uli[P] + kanusha[R]

ulikonzia anza Surface: uliko[P]+anz[R]+ia[S] Lexical: uliko[P]+anza[R]+ia[S]

Page 86: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

HCS PROBLEMS

• 97k word forms extracted from 9M word HCS• Noise in data:

- Remove hapaxes, English words- But still many inconsistencies and mistakes in automatic

annotation

• For proper evaluation: manually develop clean gold-standard evaluation set:- Take 10% of original word form list (9.7k word forms)- Manually annotate it with prefix-root-suffix protocol

Page 87: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

GOLD STANDARD EVALUATION

• Annotation using Spreadsheet

Page 88: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

MEMORY-BASED MORPHOLOGICALANALYSIS

• Extract instances from morphologically annotated word form list, e.g. for uliko[P] + anz[R] + ia[S])

L5 L4 L3 L2 L1 F R1 R2 R3 R4 R5 CLASS1 - - - - - u l i k o a 02 - - - - u l i k o a n 03 - - - u l i k o a n z 04 - - u l i k o a n z i 05 - u l i k o a n z i a P

6 u l i k o a n z i a - 07 l i k o a n z i a - - 08 i k o a n z i a - - - R+a9 k o a n z i a - - - - 0

10 o a n z i a - - - - - S

Page 89: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

NLP-EVALUATION

Segmentation of surface representation

Further lemmatization

WER WER

Morfessor 70.7 % 73.6 %SALAMAx 11.7 % 12.0 %MBSMA-c 13.3 % 13.6 %MBSMA-s 11.6 % 11.7 %

Page 90: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

NLP-EVALUATION

• Minimal Development time• Robust for unknown words (vs. SALAMA,

kamusiproject.org, …)

• How can a data-driven approach outperform the system that was used to build its training material? Generalization properties of the machine-

learning approach filters out noise in the HCS-induced training set.

Page 91: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

APPLICABILITY

Page 92: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

PARALLEL CORPUS ENGLISH - SWAHILI

• SAWA corpus: 2.5 million word orpus of translated texts

• Limited availability of parallel texts English –Swahili:- Smaller documents: investment reports,

political texts, e.g. Universal Declaration of Human Rights

“there is no data, like more data”- Bible, Quran, secular literature- New translations

• Experiment with statistical machine translation techniques

Page 93: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

AVAILABLE DATA IN SAWA CORPUS(2010)

English Sentences

Swahili Sentences

EnglishWords

SwahiliWords

Bible 52.4k 51.2k 813.3k 653.7k

Quran 14.3k 14.5k 165.5k 124.3k

Declaration of HR 0.2k 1.8k 1.8k

Kamusi.org 5.6k 35.5k 26.7k

Movie Subtitles 9.0k 72.2k 58.4k

Investment Reports 3.2k 3.1k 52.9k 54.9k

Local Translator 1.5k 1.6k 25.0k 25.7k

Draft Constitution 4.0k 3.8k 56.5k 51.1k

Total 90.2k 89k 1.2M 996.6k

Page 94: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WORD ALIGNMENT

Most difficult task: relate words between languages

No she ‘s uh, , up north

La

,

, ,yuko ,aa juu kaskazini

Page 95: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

WORD ALIGNMENT

You caught me skiving , I ‘m afraid .

Samahani , umenidaka nikihepa .

• Can be done automatically using established tools (GIZA++)• Provide manual reference to evaluate automatic word alignment

tools (5000 words, annotated with UMIACS alignment interface)

Page 96: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

ALIGNMENT PROBLEMS

nimemkatalia

have turned him downI

Page 97: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

MORPHOLOGICAL DECOMPOSITION

have turned him downI

ni+ me+ m+ katalia

Page 98: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

SMT EXPERIMENTS

• Proof-of-the-principle experiments with MOSES• GIZA++ word alignment• SRILM language models

BLEU NISTGoogle English Swahili 0.15 4.56SAWA English Swahili 0.14 4.23Google Swahili English 0.18 4.54SAWA Swahili English 0.23 4.74

Page 99: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

LUO

• Truly resource-scarce language• Nilotic language, 3M speakers• International Bible Society (2005) Luo New Testament.

Available at http://www.biblica.com/bibles/luo• Use English and Swahili New Testament data of SAWA

corpus to construct small trilingual parallel corpus• Preprocessing:

- Pdftext conversion- Tokenization- Sentence alignment

Page 100: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

SMT EXPERIMENTS

OOV BLEU NISTLuo English 4.4% 5.39 0.23Luo English [F] 4.4% 6.52 0.29English Luo 11.4% 4.12 0.18English Luo [F] 11.4% 5.31 0.22Luo Swahili 6.1% 2.91 0.11Luo Swahili [F] 6.1% 3.17 0.15Swahili Luo 11.4% 2.96 0.10Swahili Luo [F] 11.4% 3.36 0.15

Page 101: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

EXAMPLES

Source en ng’a moloyo piny ? mana jalo moyie ni yesu en wuod nyasaye

Translation who is more than the earth ? only he who believes that he is the son of god

Reference who is it that overcomes the world ? only he who believes that jesus is the son of god

Source atimo erokamano kuom thuoloniTranslation do thanks about this timeReference I am thankful for your leadership

Page 102: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

FUTURE WORK

• More data!- Optimize techniques for smaller data sets

• Unsupervised machine learning- Implicit linguistics- Linguistic classification from scratch- Typically uses huge data sets- Spell checkers + morphological analysis project

Page 103: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

CONCLUSION

• Our research on African Language Technology shows the typical advantages of the data-driven paradigm:

- Language independence- development speed- Adaptability- Applicability- Empiricism

• Every time I fire a linguist, the recognizer’s accuracy goes up- Fred Jelinek (2005) Some of my best friends are linguists- Anno 2012: linguists are not programmers per se, but domain

experts

Page 104: Data-driven approaches to natural language processing ... approaches to... · Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AI VUB 20/4/2012 TASKS & APPLICATIONS

Data-Driven Approach to Natural Language Processing - CURRENT TRENDS IN AIVUB 20/4/2012

HTTP://AFLAT.ORG

SPECIAL ISSUE ON AFRIAN LANGUAGE TECHNOLOGY OF LANGUAGERESOURCES AND EVALUATION 45(3). SEPTEMBER 2011

AFLAT 2012 (ISTANBUL, TURKEY)