Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CS544: Natural Language Processing
Zornitsa Kozareva!USC/ISI!
Marina del Rey, [email protected]!
www.isi.edu/~kozareva!
January 15, 2013
The Dream • It would be great if machines could
– Process our emails – Translate languages accurately – Help us manage, summarize, and aggregate informa=on
– Understand phone conversa=on – Talk to us / listen to us
• But they cannot: – Language is complex, ambiguous, flexible, and subtle – Good solu=ons need linguis=cs and machine learning knowledge
What is NLP?
• Goal: intelligent processing of human language – Not just string and keyword matching
• End systems we want to build: – Less ambi=ous: spelling correc=on, name en=ty extractors – Ambi=ous: machine transla=on, informa=on extrac=on,
ques=on answering, summariza=on …
Informa=on Extrac=on • Goal: build database entries from unstructured text
• Simple Task: Named En=ty Extrac=on
hQp://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html
Informa=on Extrac=on • Goal: build database entries from unstructured text
• Advanced: Mul=-‐sentence template extrac=on A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy, but no casual=es have been reported. According to unofficial sources, the bomb-‐allegedly detonated by urban guerrilla commandos blew up a power tower in the north western part of San Salvador at 0650.
Incident type: Date: Location: Perpetrator: Physical target: Effect on physical target: Effect on human target: Instrument:
Informa=on Extrac=on • Goal: build database entries from unstructured text
• Advanced: Mul=-‐sentence template extrac=on
Incident type: Date: Location: Perpetrator: Physical target: Effect on physical target: Effect on human target: Instrument:
A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy, but no casualCes have been reported. According to unofficial sources, the bomb-‐allegedly detonated by urban guerrilla commandos blew up a power tower in the north western part of San Salvador at 0650.
Informa=on Extrac=on • Goal: build database entries from unstructured text
• Advanced: Mul=-‐sentence template extrac=on A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy, but no casualCes have been reported. According to unofficial sources, the bomb-‐allegedly detonated by urban guerrilla commandos blew up a power tower in the north western part of San Salvador at 0650.
Incident type: bombing Date: March 11, 2010 Location: San Salvador (city) Perpetrator: urban guerrilla commandos Physical target: power tower Effect on physical target: destroyed Effect on human target: no injury or death Instrument: bomb
Informa=on Retrieval • Given a huge collec=on of text and a query • Goal: find documents that are relevant to the query
Ques=on Answering • Find answers to general comprehension ques=ons in a document collec=on
Text Summariza=on hQp://emm-‐labs.jrc.it/EMMLabs/NewsGist.html
Machine Transla=on
Speech Processing • Automa=c Speech Recogni=on
• Performance: 5% for dicta=on, 50%+TV
“will you move the clinic there?”
Sen=ment Analysis • Tracking sen=ment towards poli=cians, movies, products.
• Assis=ng in wri=ng e-‐mails, documents, and other text to convey desired emo=on (and avoiding misinterpreta=on).
• Detec=ng how people use emo=on-‐bearing-‐words to persuade and coerce others
• Decep=on detec=on
Text Categoriza=on • Build a system that groups text based on certain criteria – Spam Filtering – Share the same Topic (news clustering)
Linguis=cs Levels of Analysis
• Phonology: sounds / leQers / pronuncia=on • Morphology: construc=on of words
• Syntax: structural rela=onships between words • Seman=cs: meaning of strings (words, phrases)
• Discourse: rela=onships across different sentences • Pragma=cs: how we use language to communicate
• World Knowledge: facts about the world, common sense
MORPHOLOGY
Morphological Analysis • Morphology studies the internal structure of
words • A morpheme is the smallest linguis=c unit that
has seman=c meaning (Wikipedia) • Morphological Analysis is the task of
segmen=ng a word into its morphemes • carried => carry + ed (past tense) • disconnect => dis (not) + connect
• Challenging for morphologically rich languages like Finish and Turkish
SYNTACTIC TASKS
Part-‐of-‐Speech Tagging (POS)
• Annotate each word in a sentence with a part-‐of-‐speech tag
I ate the spagheh with meatballs.
Pro V Det N Prep N
• Useful for syntac=c parsing and word sense disambigua=on
• English POS tagging 95% accurate
Phrase Chunking
• Find all non-‐recursive noun phrases (NPs) and verb phrases (VPs) in a sentence.
[NP I] [VP ate] [NP the spagheh] [PP with] [NP meatballs] .
Syntac=c Parsing
• Produce syntac=c parse tree of a sentence
I ate the spagheh with meatballs.
• Help figuring out ques=ons like: Who did what and when?
S
Pro V
N
NP Prep N Det
NP PP
VP NP
More issues in Syntax
• Preposi=onal AQachment “I saw the man with the telescope”
Syntax does not tell us much about meaning
SEMANTIC TASKS
Word Sense Disambigua=on • Understand language! How?
I walked to the bank … of the river. to get money.
• Useful for machine transla=on, informa=on retrieval
How to learn the meaning of words?
• From dic=onaries, lexical repository like WordNet
bank -‐-‐ sloping land, especially the slope beside a body of water
ex. "they pulled the canoe up on the bank"
bank – a financial ins@tu@on that accepts deposits and channels the money into lending ac@vi@es
ex. "he cashed a check at the bank"
• Automa=cally from the Web
Seman=c Role Labeling
• For each clause, determine the seman=c role played by each noun phrase that is an argument to the verb
agent pa=ent source des=na=on
John drove Mary from LA to San Diego.
Textual Entailment
• Determine whether one natural language sentence entails another
The glass is half empty.
The glass is half full.
Google bought Youtube. Google acquired Youtube.
DISCOURSE, PRAGMATICS AND WORLD KNOWLEDGE
Anaphora Resolu=on
• Determine which phrases in a document refer to the same en=ty
“George woke up. He went to the kitchen.”
“ Peter put the carrot on the plate and ate it.”
Pragma=cs
• Studies how language is used to accomplish goals
What can we conclude from the following sentences?
“Could you please pass me the salt?” “ I am afraid I cannot do this”
“George woke up. He went to the bathroom and started shaving. He took the car key and ler.”
World Knowledge
WHERE WE STAND TODAY
What can NLP do (robustly) today? • Surface-‐level preprocessing (POS tagging, word segmenta=on, named en=ty extrac=on): 94%+
• Shallow syntac=c parsing: 92%+ for English • IE: ~40% for well-‐behaved topics (MUC, ACE)
• Speech: ~80% large vocab; 20%+ open vocab, noisy input
• IR: 40% (TREC) • MT: ~70% depending on what you measure
• SummarizaCon: ? (~60% for extracts; DUC)
• QA: ? (~60% for factoids; TREC)
90s–
00s–
80s–
80–90s 80–90s
80s–
90–00s
00s–
What cannot NLP do today? • Do general-‐purpose text generaCon • Deliver semanCcs—either in theory or in prac=ce • Deliver long/complex answers by extrac=ng, merging, and summarizing web info
• Handle extended dialogues • Read and learn (extend own knowledge) • Use pragmaCcs (style, emo=on, user profile…)
• Provide significant contribu=ons to a theory of Language (in Linguis=cs or Neurolinguis=cs) or of InformaCon (in Signal Processing)
CLASS DETAILS
What is in this Class? • Some linguis=c basics
– structure of English – sentence segmenta=on, tokeniza=on – Morphology
• Syntac=c parsing
• Seman=cs – Word sense disambigua=on – Seman=c rela=ons – Textual Entailment – Paraphrases – Noun Compounds
What is in this Class? • Applica=ons:
– Informa=on Extrac=on • Named en=ty extrac=on • Rela=on extrac=on
– Informa=on Retrieval
– Text Categoriza=on • Clustering • Latent Seman=c Analysis
– Ques=on Answering • Ques=on classifica=on
– Sen=ment Analysis
– Text Summariza=on
– Machine Transla=on
What is in this Class? • Overview of Tools
– Weka – Mallet
• Machine Learning Algorithms – Supervised learning – Semi-‐supervised learning – Unsupervised learning
• Graph Theory
What will we learn? • We will learn issues and techniques in NLP • We will learn various tools • We will learn about applica=ons that can benefit from
NLP • We will understand issues involved in processing natural
language • We will develop skills necessary to build NLP tools • We will learn to solve real-‐world problems
Class Requirements • Class requirements:
– Basic linguis=cs background – Basic linear algebra – Decent coding skills
Course Work • Recommended Readings:
– James Allen. Natural Language Understanding (2nd ed), Addison Wesley, 1994. – Christopher Manning and Hinrich Schütze.
Founda@ons of Sta@s@cal Natural Language Processing, MIT Press, 1999. – Daniel Jurafsky and James Mar=n. Speech and Language Processing, 2nd edi.,
Pren=ce Hall, 2008.
• Assignments: – 2 coding assignments
• brief 2-‐3 paged descrip=on • code
– 1 final project • power point presenta=on / poster • 8 paged report which will follow ACL guidelines • code (demo visualizing input-‐output of your system with graphs etc)
NLP AT USC, ISI AND ICT
44
Ph.D. Researchers and Topics At ISI:
• David Chiang — parsing, sta=s=cal processing • Ulf Hermjakob — parsing, QA, language learning • Jerry Hobbs — seman=cs, ontologies, discourse • Eduard Hovy (on leave) — summariza=on, ontologies, NLG • Kevin Knight — MT, NLG, decipherment
• Zornitsa Kozareva — IE, text mining, lexical seman=cs, sen=ment analysis • Daniel Marcu — MT, QA, summariza=on, discourse
At ICT: • David DeVault — NL genera=on • Andrew Gordon — cogni=ve science and language • Anton Leuski — IR • Kenji Sagae — parsing • Bill Swartout — NLG
• David Traum — dialogue
At USC/EE: • Shri Narayanan — speech recogni=on
NLP Projects at ISI
Large Resources
Ontologies OntoNotes (semantic corpus) Omega (for MT, summarization) DINO (for multi-database access) CORCO (semi-auto construction)
Lexicons Text Analysis
Discourse Parsing DMT (English, Japanese) Sentence Parsing, Grammar Learning CONTEX (English, Japanese, Korean, Chinese) Parser and grammar learning
Text Generation
Sent. Realization NITROGEN, HALOGEN PENMAN
Text Planning Sent. Planning ICT agent-based NLP HealthDoc
Machine Translation
AGILE (Arabic, Chinese) REWRITE (Chinese, Arabic, Tetun) TRANSONIC (speech translation) ADGEN GAZELLE (Japanese, Spanish, Arabic) QuTE (Indonesian)
EM: YASMET MT: GIZA FSM: CARMEL Name transliteration Clustering: ISICL
General packages
Social Network Analysis Email analysis
Document Management
Clustering CBC, ISICL
Web Access / IR MuST / C*ST*RD
TEXTMAP (English) WEBCLOPEDIA (English, Korean, Chinese)
Single-doc: SUMMARIST (English, Spanish, Indonesian, German) Multi-doc: NeATS, AGILE compaction, GOSP (headlines) Evaluation: SEE, ROUGE, BE Breaker
Summarization and Question
Answering
Information Extraction
Med. informatics Psyop/SOCOM Learning by Reading eRulemaking