39
Informa(on Extrac(on (IE) Fadi Biadsy CS4705 Oct 30, 2008

Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

Informa(onExtrac(on(IE)

FadiBiadsyCS4705

Oct30,2008

Page 2: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

Informa(onExtrac(on(IE)‐‐Task

•  Idea:‘extract’ortagpar(culartypesofinforma(onfromarbitrarytextortranscribedspeech

Page 3: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

NamedEn(tyTagger

•  Iden(fytypesandboundariesofnameden(ty

•  Forexample:

–  AlexanderMackenzie,(January28,1822‐April17,1892),abuildingcontractorandwriter,wasthesecondPrimeMinisterofCanadafrom….

<PERSON>AlexanderMackenzie</PERSON>,(<TIMEX>January28,1822<TIMEX>‐<TIMEX>April17,1892</TIMEX>),abuildingcontractorandwriter,wasthesecondPrimeMinisterof<GPE>Canada</GPE>from….

Page 4: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

IEforTemplateFillingRela(onDetec(on

•  Givenasetofdocumentsandadomainofinterest,fillatableofrequiredfields.

•  Forexample:–  Numberofcaraccidentspervehicletypeandnumberofcasualtyin

theaccidents.

VehicleType #accidents #casual@es Weather

SUV 1200 190 Rainy

Trucks 200 20 Sunny

Page 5: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

IEforQues(onAnswering

•  Q:WhenwasGandhiborn?•  A:October2,1869

•  Q:WherewasBillClintoneducated?

•  A:GeorgetownUniversityinWashington,D.C.

•  Q:Whatwastheeduca(onofYassirArafat?•  A:CivilEngineering

•  Q:WhatisthereligionofNoamChomsky?•  A:Jewish

Page 6: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

Approaches

1.  Sta(s(calSequenceLabeling2.  Supervised3.  Semi‐SupervisedandBootstrapping

Page 7: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

ApproachforNER•  <PERSON>AlexanderMackenzie</PERSON>,(<TIMEX>January28,1822<TIMEX>‐

<TIMEX>April17,1892</TIMEX>),abuildingcontractorandwriter,wasthesecondPrimeMinisterof<GPE>Canada</GPE>from….

•  Sta@s@calsequence‐labelingtechniquesapproachcanbeused–similartoPOStagging.–  Word‐by‐wordsequencelabeling

–  ExampleofFeatures:

•  POStags•  Syntac(ccons(tuents•  Shapefeatures•  Presenceinanameden(tylist

Page 8: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

SupervisedApproachforrela(ondetec(on

•  Givenacorpusofannotatedrela(onsbetweenen((es,traintwoclassifiers:1.  Abinaryclassifier:

•  Givenaspanoftextandtwoen((es•  Decideifthereisarela(onshipbetweenthesetwoen((es.

2.  Aclassifieristrainedtodeterminethetypesofrela(onsexistbetweentheen((es

•  Features:–  Typesoftwonameden((es–  Bag‐of‐words–  …

•  Example:–  ArentedSUVwentoutofcontrolonSunday,causingthedeathofsevenpeopleinBrooklyn–  Rela(on:Type=Accident,VehicleType=SUV,causality=7,weather=?

•  ProsandCons?

Page 9: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

PamernMatchingforRela(onDetec(on

•  PaNerns:•  “[CAR_TYPE]wentoutofcontrolon[TIMEX],causingthedeathof[NUM]people”

•  “[PERSON]wasbornin[GPE]”•  “[PERSON]wasgraduatedfrom[FAC]”

•  “[PERSON]waskilledby<X>”

•  MatchingTechniques–  Exactmatching

•  ProsandCons?–  Flexiblematching(e.g.,[X]was.*killed.*by[Y])

•  ProsandCons?

Page 10: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

PamernMatching

•  Howcanwecomeupwiththesepamerns?•  Manually?

– Taskanddomainspecific‐‐tedious,(meconsuming,andnotscalable.

Page 11: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

Semi‐SupervisedApproachAutoSlog‐TS(Riloff,1996)

•  MUC‐4task:extractinforma(onaboutterroristeventsinLa(nAmerica.

•  Twocorpora:1)  Domain‐dependentcorpusthatcontainsrelevantinforma(on

2)  Asetofirrelevantdocuments

•  Algorithm:1.  Usingsomeheuris(crules,allpamernsareextractedfromboth

corpora.Forexample: Rule:<Subj>passive‐verb

  <Subj>wasmurdered

  <Subj>wascalled

2.  PamernRanking:Theoutputpamernsarethenrankedbyfrequencyoftheiroccurrencesincorpus1/corpus2.

3.  Filteroutthepamernsbyhand

Page 12: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

Bootstrapping

PamernExtrac(on

TupleSearch

SeedPamerns

PamernSearch

PamernSet

TupleExtrac(on

TupleSet

SeedTuples

XwasborninY

GeorgeW.BushwasborninConnec(cut

<GeorgeW.Bush,Connec(cut>

BorninConnec(cutonJuly8,1946,Georgewas

BorninYonZ,Xwas

Page 13: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

13

TASK 12: (DARPA – GALE year 2) PRODUCE A BIOGRAPHY OF [PERON].

1.  Name(s),aliases:2.  *DateofBirthorCurrentAge:3.  *DateofDeath:4.  *PlaceofBirth:5.  *PlaceofDeath:6.  CauseofDeath:7.  Religion(Affilia(ons):8.  Knownloca(onsanddates:9.  Lastknownaddress:10.  Previousdomiciles:11.  Ethnicortribalaffilia(ons:12.  Immediatefamilymembers13.  Na(veLanguagespoken:14.  SecondaryLanguagesspoken:15.  PhysicalCharacteris(cs16.  Passportnumberandcountryofissue:17.  Professionalposi(ons:18.  Educa(on19.  Partyorotherorganiza(onaffilia(ons:20.  Publica(ons((tlesanddates):

Page 14: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

14

Biography – two approaches

•  Toobtainhighprecision,wehandleeachslotindependentlyusingbootstrappingtolearnIEpamerns.

•  Toimprovetherecall,weu(lizeabiographical‐sentenceclassifier.

Page 15: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

15

Biography patterns from Wikipedia

Page 16: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

16

•  Martin Luther King, Jr., (January 15, 1929 – April 4, 1968) was the most …

•  Martin Luther King, Jr., was born on January 15, 1929, in Atlanta, Georgia.

Biography patterns from Wikipedia

Page 17: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

17

Run NER on these sentences

•  <Person>Mar(nLutherKing,Jr.</Person>,(<Date>January15,1929</Date>–<Date>April4,1968</Date>)wasthemost…

•  <Person>Mar(nLutherKing,Jr.</Person>,wasbornon<Date>January15,1929</Date>,in<GPE>Atlanta,Georgia</GPE>.

•  Takethetokensequencethatincludesthetagsofinterest+somecontext(2tokensbeforeand2tokensa{er)

Page 18: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

18

Convert to Patterns:

•  <Target_Person>(<Target_Date>–<Date>)wasthe

•  <Target_Person>,wasbornon<Target_Date>,in

•  Removemorespecificpamerns–ifthereisapamernthatcontainsother,takethesmallest>ktokens.

•  <Target_Person>,wasbornon<Target_Date>

•  <Target_Person>(<Target_Date>–<Date>)

•  Finally,verifythepamernsmanuallytoremoveirrelevantpamerns.

Page 19: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

19

Examples of Patterns:

•  502dis(nctplace‐of‐birthpamerns:–  600 <Target_Person>wasbornin<Target_GPE>–  169 <Target_Person>(born<Date>in<Target_GPE>)–  44 Bornin<Target_GPE>,<Target_Person>–  10 <Target_Person>wasana(ve<Target_GPE>–  10 <Target_Person>'shometownof<Target_GPE>–  1 <Target_Person>wasbap(zedin<Target_GPE>–  …

•  291dis(nctdate‐of‐deathpamerns:–  770 <Target_Person>(<Date>‐<Target_Date>)–  92 <Target_Person>diedon<Target_Date>–  19 <Target_Person><Date>‐<Target_Date>–  16 <Target_Person>diedin<GPE>on<Target_Date>–  3 <Target_Person>passedawayon<Target_Date>–  1 <Target_Person>commimedsuicideon<Target_Date>–  …

Page 20: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

20

Biography as an IE task

•  ThisapproachisgoodfortheconsistentlyannotatedfieldsinWikipedia:placeofbirth,dateofbirth,placeofdeath,dateofdeath

•  Notallfieldsofinterestsareannotated,adifferentapproachisneededtocovertherestoftheslots

Page 21: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

21

Bouncing between Wikipedia and Google

•  Useoneseedtupleonly:– <TargetPerson>and<Targetfield>

•  Google:“Arafat”“civilengineering”,weget:

Page 22: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

22

Page 23: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

23

•  Useoneseedtupleonly:•  Google:“Arafat”“civilengineering”,weget:

⇒ Arafatgraduatedwithabachelor’sdegreeincivilengineering⇒ Arafatstudiedcivilengineering⇒ Arafat,acivilengineeringstudent⇒ …

•  Usingthesesnippets,correspondingpamernsarecreated,thenfilteredout.

Bouncing between Wikipedia and Google

Page 24: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

24

•  Useoneseedtupleonly:•  Google:“Arafat”“civilengineering”,weget:

⇒ Arafatgraduatedwithabachelor’sdegreeincivilengineering⇒ Arafatstudiedcivilengineering⇒ Arafat,acivilengineeringstudent⇒ …

•  Usingthesesnippets,correspondingpamernsarecreated,thenfilteredoutmanually

•  Dueto(melimita(ontheautoma(cfilterwasnotcompleted.

–  Togetmoreseedtuples,gotoWikipediabiographypagesonlyandsearchfor:

–  “graduatedwithabachelor’sdegreein”–  Weget:

Bouncing between Wikipedia and Google

Page 25: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

25

Page 26: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

26

•  Newseedtuples:–  “BurnieThompson”“poli(calscience“

–  “HenreyLuke”“EnvironmentStudies”

–  “ErinCrocker”“industrialandmanagementengineering”–  “DeniseBode”“poli(calscience”–  …

•  GobacktoGoogleandrepeattheprocesstogetmoreseedpamerns!

Bouncing between Wikipedia and Google

Page 27: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

27

Bouncing between Wikipedia and Google

•  Thisapproachworkedwellforafewfieldssuchas:educa7on,publica7on,Immediatefamilymembers,andPartyorotherorganiza7onaffilia7ons

•  Didnotprovidegoodpamernsforsomeofthefields,suchas:Religion,Ethnicortribalaffilia7ons,and

Previousdomiciles),wegotalotofnoise

•  Whythebouncingideaisbemerthanusingonlyonecorpus?

•  Nonofthepamernsmatch?Back‐offstrategy…

Page 28: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

28

Biographical‐SentenceClassifier(Biadsy,etal.,2008)

•  Trainabinaryclassifiertoiden(fybiographicalsentences

•  Manuallyannota(ngalargecorpusofbiographicalandnon‐biographicalinforma(on(e.g.,Zhouetal.,2004)islaborintensive

•  Ourapproach:collectbiographicalandnon‐biographicalcorporaautoma(cally

Page 29: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

29

TrainingData–BiographicalCorpusfromWikipedia

•  U(lizeWikipediabiographies

•  Extract17KbiographiesfromthexmlversionofWikipedia

•  Applysimpletextprocessingtechniquestocleanupthetext

Page 30: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

30

Construc(ngtheBiographicalCorpus

1.  Iden(fythesubjectofeachbiography

2.  RunNYU’sACEsystemtotagNEsanddocoreferenceresolu(on(Grishmanetal.,2005)

Page 31: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

31

Construc(ngtheBiographicalCorpus

3.  ReplaceeachNEbyitstagtypeandsubtype

InSeptember1951,KingbeganhisdoctoralstudiesIntheologyatBostonUniversity.

In[TIMEX],[PER_Individual]began[TARGET_HIS]doctoralstudiesIntheologyat[ORG_Educa@onal].

Page 32: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

32

Construc(ngtheBiographicalCorpus

3.  ReplaceeachNEbyitstagtypeandsubtype

4.  Non‐pronominalreferringexpressionthatiscoreferen(alwiththetargetpersonisreplacedby[TARGET_PER]

InSeptember1951,KingbeganhisdoctoralstudiesIntheologyatBostonUniversity.

In[TIMEX],[TARGET_PER]began[TARGET_HIS]doctoralstudiesIntheologyat[ORG_Educa@onal].

Page 33: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

33

Construc(ngtheBiographicalCorpus

3.  ReplaceeachNEbyitstagtypeandsubtype

4.  Non‐pronominalreferringexpressionthatiscoreferen(alwiththetargetpersonisreplacedby[TARGET_PER]

5.  EverypronounPthatreferstothetargetpersonisreplacedby[TARGET_P],wherePisthepronounreplaced

InSeptember1951,KingbeganhisdoctoralstudiesIntheologyatBostonUniversity.

In[TIMEX],[TARGET_PER]began[TARGET_HIS]doctoralstudiesIntheologyat[ORG_Educa@onal].

Page 34: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

34

Construc(ngtheBiographicalCorpus

3.  ReplaceeachNEbyitstagtypeandsubtype

4.  Non‐pronominalreferringexpressionsthatarecoreferen(alwiththetargetpersonarereplacedby[TARGET_PER]

5.  EverypronounPthatreferstothetargetpersonisreplacedby[TARGET_P],wherePisthepronounreplaced

6.  Sentencescontainingnoreferencetothetargetpersonareremoved

InSeptember1951,KingbeganhisdoctoralstudiesIntheologyatBostonUniversity.

In[TIMEX],[TARGET_PER]began[TARGET_HIS]doctoralstudiesIntheologyat[ORG_Educa@onal].

Page 35: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

35

Construc(ngtheNon‐BiographicalCorpus

•  Englishnewswirear(clesinTDT4usedtorepresentnon‐biographicalsentences

1.  RunNYU’sACEsystemoneachar(cle

2.  SelectaPERSONNEmen(onatrandomfromallNEsinar(cletorepresentthetargetperson

3.  Excludesentenceswithnoreferencetothistarget

4.  ReplacereferringexpressionsandNEsasinbiographycorpus

Page 36: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

36

Biographical‐SentenceClassifier

•  Trainaclassifieronthebiographicalandnon‐biographicalcorpora

–  Biographicalcorpus:•  30,002sentencesfromWikipedia•  2,108sentencesheldoutfortes(ng

–  Non‐Biographicalcorpus:•  23,424sentencesfromTDT4•  2,108sentencesheldoutfortes(ng

Page 37: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

37

Biographical‐SentenceClassifier

•  Features:–  Frequencyof1‐2‐3gramsofclass‐based/lexical,e.g.:

•  [TARGET_PER]wasborn•  [TARGET_HER]husbandwas•  [TARGET_PER]said

–  Frequencyof1‐2gramsofPOS

•  Chi‐squareforfeatureselec(on

Page 38: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

38

Classifica(onResults

•  Experimentedwiththreetypesofclassifiers:

•  Note:Classifiersprovideaconfidencescoreforeachclassifiedsample

Classifier Accuracy F‐Meassure SVM 87.6% 0.87 M.NaïveBayes(MNB) 84.1% 0.84 C4.5 81.8% 0.82

Page 39: Informaon Extracon (IE) - Columbia Universityjulia/courses/CS4705/kathy/Slides09/...(Biadsy, et al., 2008) • Train a binary classifier to idenfy biographical sentences • Manually

39

Thankyou