23
CS544: Natural Language Processing Zornitsa Kozareva USC/ISI Marina del Rey, CA [email protected] www.isi.edu/~kozareva January 15, 2013 The Dream It would be great if machines could Process our emails Translate languages accurately Help us manage, summarize, and aggregate informa=on Understand phone conversa=on Talk to us / listen to us But they cannot: Language is complex, ambiguous, flexible, and subtle Good solu=ons need linguis=cs and machine learning knowledge

lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

CS544:  Natural  Language  Processing  

Zornitsa Kozareva!USC/ISI!

Marina del Rey, [email protected]!

www.isi.edu/~kozareva!

January  15,  2013  

The  Dream  •   It  would  be  great  if  machines  could    

–   Process  our  emails      –   Translate  languages  accurately    –   Help  us  manage,  summarize,  and  aggregate    informa=on    

–   Understand  phone  conversa=on  –   Talk  to  us  /  listen  to  us  

•   But  they  cannot:    –  Language  is  complex,  ambiguous,  flexible,  and  subtle    – Good  solu=ons  need  linguis=cs  and  machine  learning  knowledge  

Page 2: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

What  is  NLP?  

•  Goal:  intelligent  processing  of  human  language  –  Not  just  string  and  keyword  matching  

•  End  systems  we  want  to  build:  –  Less  ambi=ous:  spelling  correc=on,  name  en=ty  extractors  –  Ambi=ous:  machine  transla=on,  informa=on  extrac=on,  

ques=on  answering,  summariza=on  …  

Informa=on  Extrac=on  •  Goal:  build  database  entries  from  unstructured  text  

•  Simple  Task:  Named  En=ty  Extrac=on  

hQp://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html  

Page 3: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Informa=on  Extrac=on  •  Goal:  build  database  entries  from  unstructured  text  

•  Advanced:  Mul=-­‐sentence  template  extrac=on  A  bomb  went  off  this  morning  near  a  power  tower  in  San  Salvador  leaving  a  large  part  of  the  city  without  energy,  but  no  casual=es  have  been  reported.  According   to   unofficial   sources,   the   bomb-­‐allegedly   detonated   by   urban  guerrilla  commandos  blew  up  a  power  tower   in   the  north  western  part  of  San  Salvador  at  0650.  

Incident type: Date: Location: Perpetrator: Physical target: Effect on physical target: Effect on human target: Instrument:

Informa=on  Extrac=on  •  Goal:  build  database  entries  from  unstructured  text  

•  Advanced:  Mul=-­‐sentence  template  extrac=on  

Incident type: Date: Location: Perpetrator: Physical target: Effect on physical target: Effect on human target: Instrument:

A  bomb  went  off  this  morning  near  a  power  tower  in  San  Salvador  leaving  a  large  part  of  the  city  without  energy,  but  no  casualCes  have  been  reported.  According   to   unofficial   sources,   the   bomb-­‐allegedly   detonated   by   urban  guerrilla  commandos  blew  up  a  power  tower  in  the  north  western  part  of  San  Salvador  at  0650.  

Page 4: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Informa=on  Extrac=on  •  Goal:  build  database  entries  from  unstructured  text  

•  Advanced:  Mul=-­‐sentence  template  extrac=on  A  bomb  went  off  this  morning  near  a  power  tower  in  San  Salvador  leaving  a  large  part  of  the  city  without  energy,  but  no  casualCes  have  been  reported.  According   to   unofficial   sources,   the   bomb-­‐allegedly   detonated   by   urban  guerrilla  commandos  blew  up  a  power  tower  in  the  north  western  part  of  San  Salvador  at  0650.  

Incident type: bombing Date: March 11, 2010 Location: San Salvador (city) Perpetrator: urban guerrilla commandos Physical target: power tower Effect on physical target: destroyed Effect on human target: no injury or death Instrument: bomb

Informa=on  Retrieval  •  Given  a  huge  collec=on  of  text  and  a  query  •  Goal:  find  documents  that  are  relevant  to  the  query  

Page 5: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Ques=on  Answering  •  Find  answers  to  general  comprehension  ques=ons  in  a  document  collec=on  

Text  Summariza=on  hQp://emm-­‐labs.jrc.it/EMMLabs/NewsGist.html  

Page 6: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Machine  Transla=on  

Speech  Processing  •  Automa=c  Speech  Recogni=on  

•  Performance:  5%  for  dicta=on,  50%+TV  

“will you move the clinic there?”

Page 7: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Sen=ment  Analysis  •  Tracking  sen=ment  towards  poli=cians,  movies,  products.  

•  Assis=ng  in  wri=ng  e-­‐mails,  documents,  and  other  text  to  convey  desired  emo=on  (and  avoiding  misinterpreta=on).  

•  Detec=ng  how  people  use  emo=on-­‐bearing-­‐words  to  persuade  and  coerce  others  

•  Decep=on  detec=on  

Text  Categoriza=on  •  Build  a  system  that  groups  text  based  on  certain  criteria  –  Spam  Filtering  –  Share  the  same  Topic  (news  clustering)  

Page 8: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Linguis=cs  Levels  of  Analysis  

•  Phonology:  sounds  /  leQers  /  pronuncia=on  •  Morphology:  construc=on  of  words  

•  Syntax:  structural  rela=onships  between  words  •  Seman=cs:  meaning  of  strings  (words,  phrases)  

•  Discourse:  rela=onships  across  different  sentences  •  Pragma=cs:  how  we  use  language  to  communicate  

•  World  Knowledge:  facts  about  the  world,  common  sense  

MORPHOLOGY  

Page 9: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Morphological  Analysis  •  Morphology  studies  the  internal  structure  of  

words  •  A  morpheme  is  the  smallest  linguis=c  unit  that  

has  seman=c  meaning  (Wikipedia)  •  Morphological  Analysis  is  the  task  of  

segmen=ng  a  word  into  its  morphemes  •  carried  =>  carry  +  ed  (past  tense)  •  disconnect  =>  dis  (not)  +  connect    

•  Challenging  for  morphologically  rich  languages  like  Finish  and  Turkish  

SYNTACTIC  TASKS  

Page 10: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Part-­‐of-­‐Speech  Tagging  (POS)  

•  Annotate  each  word  in  a  sentence  with  a  part-­‐of-­‐speech  tag  

   I              ate        the        spagheh        with            meatballs.  

 Pro            V          Det                  N                              Prep                        N  

•  Useful  for  syntac=c  parsing  and  word  sense  disambigua=on  

•  English  POS  tagging  95%  accurate  

Phrase  Chunking  

•  Find  all  non-­‐recursive  noun  phrases  (NPs)  and  verb  phrases  (VPs)  in  a  sentence.  

     [NP  I]    [VP  ate]    [NP  the    spagheh]  [PP  with]  [NP  meatballs]  .  

Page 11: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Syntac=c  Parsing  

•  Produce  syntac=c  parse  tree  of  a  sentence  

               I              ate        the        spagheh        with        meatballs.  

•  Help  figuring  out  ques=ons  like:  Who  did  what  and  when?  

S  

Pro   V  

N  

NP  Prep  N  Det  

NP  PP  

VP  NP  

More  issues  in  Syntax  

•  Preposi=onal  AQachment            “I  saw  the  man  with  the  telescope”  

Syntax  does  not  tell  us  much  about  meaning  

Page 12: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

SEMANTIC  TASKS  

Word  Sense  Disambigua=on  •  Understand  language!  How?  

I  walked  to  the  bank  …                  of  the  river.                  to  get  money.  

•  Useful  for  machine  transla=on,  informa=on  retrieval  

Page 13: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

How  to  learn  the  meaning  of  words?  

•  From  dic=onaries,  lexical  repository  like  WordNet  

bank  -­‐-­‐  sloping  land,  especially  the  slope  beside  a  body  of  water  

     ex.  "they  pulled  the  canoe  up  on  the  bank"  

bank  –  a  financial  ins@tu@on  that  accepts  deposits  and  channels                              the  money  into  lending  ac@vi@es  

           ex.  "he  cashed  a  check  at  the  bank"  

•  Automa=cally  from  the  Web  

Seman=c  Role  Labeling  

•  For  each  clause,  determine  the  seman=c  role  played  by  each  noun  phrase  that  is  an  argument  to  the  verb  

   agent                            pa=ent                        source                des=na=on  

   John      drove      Mary      from            LA            to      San  Diego.  

Page 14: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Textual  Entailment  

•  Determine  whether  one  natural  language  sentence  entails  another  

     The  glass  is  half  empty.    

           The  glass  is  half  full.  

   Google  bought  Youtube.      Google  acquired  Youtube.  

DISCOURSE,  PRAGMATICS  AND  WORLD  KNOWLEDGE  

Page 15: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Anaphora  Resolu=on  

•  Determine  which  phrases  in  a  document  refer  to  the  same  en=ty  

   “George  woke  up.  He  went  to  the  kitchen.”      

   “  Peter  put  the  carrot  on  the  plate  and  ate  it.”  

Pragma=cs  

•  Studies  how  language  is  used  to  accomplish  goals  

What  can  we  conclude  from  the  following  sentences?  

   “Could  you  please  pass  me  the  salt?”      “  I  am  afraid  I  cannot  do  this”  

Page 16: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

   “George  woke  up.  He  went  to  the  bathroom  and  started  shaving.  He  took  the  car  key  and  ler.”  

World  Knowledge  

WHERE  WE  STAND  TODAY  

Page 17: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

What  can  NLP  do  (robustly)  today?  •  Surface-­‐level  preprocessing  (POS  tagging,  word  segmenta=on,  named  en=ty  extrac=on):  94%+    

•  Shallow  syntac=c  parsing:  92%+  for  English      •  IE:  ~40%  for  well-­‐behaved  topics  (MUC,  ACE)  

•  Speech:  ~80%  large  vocab;  20%+  open  vocab,  noisy  input    

•  IR:  40%  (TREC)    •  MT:  ~70%  depending  on  what  you  measure    

•  SummarizaCon:  ?  (~60%  for  extracts;  DUC)    

•  QA:  ?  (~60%  for  factoids;  TREC)  

90s–

00s–

80s–

80–90s 80–90s

80s–

90–00s

00s–

What  cannot  NLP  do  today?  •  Do  general-­‐purpose  text  generaCon    •  Deliver  semanCcs—either  in  theory  or  in  prac=ce    •  Deliver  long/complex  answers  by  extrac=ng,  merging,  and  summarizing  web  info    

•  Handle  extended  dialogues    •  Read  and  learn  (extend  own  knowledge)    •  Use  pragmaCcs  (style,  emo=on,  user  profile…)    

•  Provide  significant  contribu=ons  to  a  theory  of  Language  (in  Linguis=cs  or  Neurolinguis=cs)  or  of  InformaCon  (in  Signal  Processing)  

Page 18: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

CLASS  DETAILS  

What  is  in  this  Class?  •  Some  linguis=c  basics  

–  structure  of  English  –  sentence  segmenta=on,  tokeniza=on  –  Morphology  

•  Syntac=c  parsing  

•  Seman=cs  –  Word  sense  disambigua=on  –  Seman=c  rela=ons  –  Textual  Entailment  –  Paraphrases  –  Noun  Compounds  

Page 19: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

What  is  in  this  Class?  •  Applica=ons:  

–  Informa=on  Extrac=on  •  Named  en=ty  extrac=on  •  Rela=on  extrac=on  

–  Informa=on  Retrieval  

–  Text  Categoriza=on  •  Clustering  •  Latent  Seman=c  Analysis  

–  Ques=on  Answering  •  Ques=on  classifica=on  

–  Sen=ment  Analysis  

–  Text  Summariza=on  

–  Machine  Transla=on  

What  is  in  this  Class?  •  Overview  of  Tools  

–  Weka  –  Mallet  

•  Machine  Learning  Algorithms  –  Supervised  learning  –  Semi-­‐supervised  learning  –  Unsupervised  learning  

•  Graph  Theory  

Page 20: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

What  will  we  learn?  •  We  will  learn  issues  and  techniques  in  NLP  •  We  will  learn  various  tools  •  We  will  learn  about  applica=ons  that  can  benefit  from  

NLP  •  We  will  understand  issues  involved  in  processing  natural  

language  •  We  will  develop  skills  necessary  to  build  NLP  tools  •  We  will  learn  to  solve  real-­‐world  problems  

Class  Requirements  •  Class  requirements:  

– Basic  linguis=cs  background  – Basic  linear  algebra  – Decent  coding  skills  

Page 21: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

Course  Work  •  Recommended  Readings:  

–  James  Allen.  Natural  Language  Understanding  (2nd  ed),  Addison  Wesley,  1994.    –  Christopher  Manning  and  Hinrich  Schütze.  

Founda@ons  of  Sta@s@cal  Natural  Language  Processing,  MIT  Press,  1999.    –  Daniel  Jurafsky  and  James  Mar=n.  Speech  and  Language  Processing,  2nd  edi.,  

Pren=ce  Hall,  2008.    

•  Assignments:  –  2  coding  assignments  

•  brief  2-­‐3  paged  descrip=on  •  code  

–  1  final  project  •  power  point  presenta=on  /  poster  •  8  paged  report  which  will  follow  ACL  guidelines  •  code  (demo  visualizing  input-­‐output  of  your  system  with  graphs  etc)  

NLP  AT  USC,  ISI  AND  ICT  

Page 22: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

44  

Ph.D.  Researchers  and  Topics  At  ISI:    

•  David  Chiang  —  parsing,  sta=s=cal  processing    •  Ulf  Hermjakob  —  parsing,  QA,  language  learning    •  Jerry  Hobbs  —  seman=cs,  ontologies,  discourse    •  Eduard  Hovy  (on  leave)  —  summariza=on,  ontologies,  NLG  •  Kevin  Knight  —  MT,  NLG,  decipherment  

•  Zornitsa  Kozareva  —  IE,  text  mining,  lexical  seman=cs,  sen=ment  analysis    •  Daniel  Marcu  —  MT,  QA,  summariza=on,  discourse      

At  ICT:    •  David  DeVault  —  NL  genera=on    •  Andrew  Gordon  —  cogni=ve  science  and  language    •  Anton  Leuski  —  IR    •  Kenji  Sagae  —  parsing    •  Bill  Swartout  —  NLG  

•  David  Traum  —  dialogue    

At  USC/EE:    •  Shri  Narayanan  —  speech  recogni=on      

Page 23: lecture1-final€¦ · Parser and grammar learning Text Generation NITROGEN, HALOGEN PENMAN Text Planning Sent. Planning ICT agent-based NLP HealthDoc Machine Translation AGILE (Arabic,

NLP  Projects  at  ISI  

Large Resources

Ontologies OntoNotes (semantic corpus) Omega (for MT, summarization) DINO (for multi-database access) CORCO (semi-auto construction)

Lexicons Text Analysis

Discourse Parsing DMT (English, Japanese) Sentence Parsing, Grammar Learning CONTEX (English, Japanese, Korean, Chinese) Parser and grammar learning

Text Generation

Sent. Realization NITROGEN, HALOGEN PENMAN

Text Planning Sent. Planning ICT agent-based NLP HealthDoc

Machine Translation

AGILE (Arabic, Chinese) REWRITE (Chinese, Arabic, Tetun) TRANSONIC (speech translation) ADGEN GAZELLE (Japanese, Spanish, Arabic) QuTE (Indonesian)

EM: YASMET MT: GIZA FSM: CARMEL Name transliteration Clustering: ISICL

General packages

Social Network Analysis Email analysis

Document Management

Clustering CBC, ISICL

Web Access / IR MuST / C*ST*RD

TEXTMAP (English) WEBCLOPEDIA (English, Korean, Chinese)

Single-doc: SUMMARIST (English, Spanish, Indonesian, German) Multi-doc: NeATS, AGILE compaction, GOSP (headlines) Evaluation: SEE, ROUGE, BE Breaker

Summarization and Question

Answering

Information Extraction

Med. informatics Psyop/SOCOM Learning by Reading eRulemaking