RelationExtraction
Whatisrelationextraction?
ManyslidesadaptedfromDanJurafsky
Extractingrelationsfromtext• Companyreport: “InternationalBusinessMachinesCorporation(IBMor
thecompany)wasincorporatedintheStateofNewYorkonJune16,1911,astheComputing-Tabulating-RecordingCo.(C-T-R)…”
• ExtractedComplexRelation:Company-Founding
Company IBMLocation NewYorkDate June16,1911Original-Name Computing-Tabulating-RecordingCo.
• ButwewillfocusonthesimplertaskofextractingrelationtriplesFounding-year(IBM,1911)Founding-location(IBM,New York)
WhyRelationExtraction?
• Createnewstructuredknowledgebases,usefulforanyapp• Augmentcurrentknowledgebases
• AddingwordstoWordNet thesaurus,factstoFreeBase orDBPedia
• Supportquestionanswering• Thegranddaughterofwhichactorstarredinthemovie“E.T.”?(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)
• Butwhichrelationsshouldweextract?
3
AutomaticContentExtraction(ACE)
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
GeographicalSubsidiary
Sports-Affiliation
17 relations from 2008 “Relation Extraction Task”
AutomaticContentExtraction(ACE)
• Physical-LocatedPER-GPEHe was in Tennessee
• Part-Whole-SubsidiaryORG-ORGXYZ, the parent company of ABC
• Person-Social-FamilyPER-PERJohn’s wife Yoko
• Org-AFF-FounderPER-ORGSteve Jobs, co-founder of Apple…
•5
UMLS:UnifiedMedicalLanguageSystem
• 134entitytypes,54relations
Injury disrupts PhysiologicalFunctionBodilyLocation location-of BiologicFunctionAnatomicalStructure part-of OrganismPharmacologicSubstancecauses PathologicalFunctionPharmacologicSubstancetreats PathologicFunction
DatabasesofWikipedia Relations
7
RelationsextractedfromInfoboxStanfordstateCaliforniaStanfordmotto “DieLuft derFreiheit weht”…
WikipediaInfobox
Howtobuildrelationextractors
1. Hand-writtenpatterns2. Supervisedmachinelearning3. Semi-supervisedandunsupervised• Bootstrapping(usingseeds)• Distantsupervision• Unsupervisedlearningfromtheweb
RelationExtraction
Whatisrelationextraction?
RelationExtraction
Usingpatternstoextractrelations
ExtractingRicherRelationsUsingRules
• Intuition:relationsoftenholdbetweenspecificentities• located-in(ORGANIZATION,LOCATION)• founded (PERSON,ORGANIZATION)• cures(DRUG,DISEASE)
• StartwithNamedEntitytagstohelpextractrelation!
NamedEntitiesaren’tquiteenough.Whichrelationsholdbetween2entities?
Drug Disease
Cure?Prevent?
Cause?
Whatrelationsholdbetween2entities?
PERSON ORGANIZATION
Founder?
Investor?
Member?
Employee?
President?
ExtractingRicherRelationsUsingRulesandNamedEntities
Whoholdswhatofficeinwhatorganization?PERSON, POSITION of ORG
• GeorgeMarshall,SecretaryofStateoftheUnitedStates
PERSON(named|appointed|chose|etc.) PERSON Prep?POSITION• TrumanappointedMarshallSecretaryofState
PERSON [be]?(named|appointed|etc.)Prep?ORG POSITION• GeorgeMarshallwasnamedUSSecretaryofState
Hand-builtpatternsforrelations• Plus:• Humanpatternstendtobehigh-precision• Canbetailoredtospecificdomains
• Minus• Humanpatternsareoftenlow-recall• Alotofworktothinkofallpossiblepatterns!• Don’twanttohavetodothisforeveryrelation!• We’dlikebetteraccuracy
RelationExtraction
Usingpatternstoextractrelations
RelationExtraction
Supervisedrelationextraction
Supervisedmachinelearningforrelations
• Chooseasetofrelationswe’dliketoextract• Chooseasetofrelevantnamedentities• Findandlabeldata
• Choosearepresentativecorpus• Labelthenamedentitiesinthecorpus• Hand-labeltherelationsbetweentheseentities• Breakintotraining,development,andtest
• Trainaclassifieronthetrainingset18
Howtodoclassificationinsupervisedrelationextraction
1. Findallpairsofnamedentities(usuallyinsamesentence)
2. Decideif2entitiesarerelated3. Ifyes,classifytherelation• Whytheextrastep?
• Fasterclassificationtrainingbyeliminatingmostpairs• Canusedistinctfeature-setsappropriateforeachtask.
19
AutomatedContentExtraction(ACE)
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
GeographicalSubsidiary
Sports-Affiliation
17 sub-relations of 6 relations from 2008 “Relation Extraction Task”
RelationExtraction
Classifytherelationbetweentwoentitiesinasentence
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.
SUBSIDIARY
FAMILYEMPLOYMENT
NIL
FOUNDER
CITIZEN
INVENTOR…
WordFeaturesforRelationExtraction
• HeadwordsofM1andM2,andcombinationAirlinesWagnerAirlines-Wagner
• BagofwordsandbigramsinM1andM2{American,Airlines,Tim,Wagner,AmericanAirlines,TimWagner}
• WordsorbigramsinparticularpositionsleftandrightofM1/M2M2:-1spokesmanM2:+1said
• Bagofwordsorbigramsbetweenthetwoentities{a,AMR,of,immediately,matched,move,spokesman,the,unit}
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaidMention1 Mention2
NamedEntityTypeandMentionLevelFeaturesforRelationExtraction
• Named-entitytypes• M1:ORG• M2:PERSON
• Concatenationofthetwonamed-entitytypes• ORG-PERSON
• EntityLevelofM1andM2 (NAME,NOMINAL,PRONOUN)• M1:NAME [itor hewouldbePRONOUN]• M2:NAME [thecompanywouldbeNOMINAL]
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaidMention1 Mention2
ParseFeaturesforRelationExtraction
• BasesyntacticchunksequencefromonetotheotherNPNPPPVPNPNP
• ConstituentpaththroughthetreefromonetotheotherNPé NPé Sé Sê NP
• DependencypathAirlinesmatchedWagnersaid
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaidMention1 Mention2
Gazeteer andtriggerwordfeaturesforrelationextraction
• Triggerlistforfamily:kinshipterms• parent,wife,husband,grandparent,etc.[fromWordNet]
• Gazeteer:• Listsofusefulgeoorgeopoliticalwords• Countrynamelist• Othersub-entities
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.
Classifiersforsupervisedmethods
• Nowyoucanuseanyclassifieryoulike• MaxEnt• NaïveBayes• SVM• ...
• Trainitonthetrainingset,tuneonthedev set,testonthetestset
EvaluationofSupervisedRelationExtraction
• ComputeP/R/F1 foreachrelation
28
P = # of correctly extracted relationsTotal # of extracted relations
R = # of correctly extracted relationsTotal # of gold relations
F1 =2PRP + R
Summary:SupervisedRelationExtraction
+ Cangethighaccuracieswithenoughhand-labeledtrainingdata,iftestsimilarenoughtotraining
- Labelingalargetrainingsetisexpensive
- Supervisedmodelsarebrittle,don’tgeneralizewelltodifferentgenres
RelationExtraction
Supervisedrelationextraction
RelationExtraction
Semi-supervisedandunsupervisedrelationextraction
Seed-basedorbootstrappingapproachestorelationextraction
• Notrainingset?Maybeyouhave:• Afewseedtuples or• Afewhigh-precisionpatterns
• Canyouusethoseseedstodosomethinguseful?• Bootstrapping:usetheseedstodirectlylearntopopulatearelation
RelationBootstrapping(Hearst1992)
• GatherasetofseedpairsthathaverelationR• Iterate:1. Findsentenceswiththesepairs2. Lookatthecontextbetweenoraroundthepair
andgeneralizethecontexttocreatepatterns3. Usethepatternsforgrep formorepairs
Bootstrapping• <MarkTwain,Elmira>Seedtuple
• Grep (google)fortheenvironmentsoftheseedtuple“MarkTwainisburiedinElmira,NY.”
XisburiedinY“ThegraveofMarkTwainisinElmira”
ThegraveofXisinY“ElmiraisMarkTwain’sfinalrestingplace”
YisX’sfinalrestingplace.
• Usethosepatternstogrep fornewtuples• Iterate
Dipre:Extract<author,book>pairs
• Startwith5seeds:
• FindInstances:TheComedyofErrors,by WilliamShakespeare,wasTheComedyofErrors,byWilliamShakespeare,isTheComedyofErrors,oneofWilliamShakespeare'searliestattemptsTheComedyofErrors,oneofWilliamShakespeare'smost
• Extractpatterns(groupbymiddle,takelongestcommonprefix/suffix)?x , by ?y , ?x , one of ?y ‘s
• Nowiterate,findingnewseedsthatmatchthepattern
Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.
Author BookIsaacAsimov TheRobots ofDawnDavidBrin Startide RisingJamesGleick Chaos:MakingaNewScienceCharlesDickens GreatExpectationsWilliamShakespeare TheComedyofErrors
DistantSupervision
• Combinebootstrappingwithsupervisedlearning• Insteadof5seeds,• Usealargedatabasetogethuge#ofseedexamples
• Createlotsoffeaturesfromalltheseexamples• Combineinasupervisedclassifier
Snow,Jurafsky,Ng.2005.Learningsyntacticpatternsforautomatichypernym discovery.NIPS17Fei WuandDanielS.Weld.2007.AutonomouslySemantifying Wikipeida.CIKM2007Mintz,Bills,Snow,Jurafsky.2009.Distantsupervisionforrelationextractionwithoutlabeleddata.ACL09
Distantsupervisionparadigm
• Likesupervisedclassification:• Usesaclassifierwithlotsoffeatures• Supervisedbydetailedhand-createdknowledge• Doesn’trequireiterativelyexpandingpatterns
• Likeunsupervisedclassification:• Usesverylargeamountsofunlabeleddata• Notsensitivetogenreissuesintrainingcorpus
Distantlysupervisedlearningofrelationextractionpatterns
Foreachrelation
Foreachtupleinbigdatabase
Findsentencesinlargecorpuswithbothentities
Extractfrequentfeatures(parse, words,etc)
Trainsupervisedclassifierusingthousandsoffeatures
4
1
2
3
5
PERwasborninLOCPER,born(XXXX),LOCPER’s birthplaceinLOC
<EdwinHubble,Marshfield><AlbertEinstein,Ulm>
Born-In
Hubble wasborninMarshfieldEinstein,born(1879),UlmHubble’sbirthplaceinMarshfield
P(born-in | f1,f2,f3,…,f70000)
Unsupervisedrelationextraction
• OpenInformationExtraction:• extractrelationsfromthewebwithnotrainingdata,nolistofrelations
1. Useparseddatatotraina“trustworthytuple”classifier2. Single-passextractallrelationsbetweenNPs,keepiftrustworthy3. Assessorranksrelationsbasedontextredundancy
(FCI,specializesin,softwaredevelopment)
(Tesla,invented,coiltransformer)39
M.Banko,M.Cararella,S.Soderland,M.Broadhead,andO.Etzioni.2007.Openinformationextractionfromtheweb.IJCAI
EvaluationofSemi-supervisedandUnsupervisedRelationExtraction
• Sinceitextractstotallynewrelationsfromtheweb• Thereisnogoldsetofcorrectinstancesofrelations!
• Can’tcomputeprecision(don’tknowwhichonesarecorrect)• Can’tcomputerecall(don’tknowwhichonesweremissed)
• Instead,wecanapproximateprecision(only)• Drawarandomsampleofrelationsfromoutput,checkprecisionmanually
• Canalsocomputeprecisionatdifferentlevelsofrecall.• Precisionfortop1000newrelations,top10,000newrelations,top100,000• Ineachcasetakingarandomsampleofthatset
• Butnowaytoevaluaterecall40
P̂ = # of correctly extracted relations in the sampleTotal # of extracted relations in the sample
RelationExtraction
Semi-supervisedandunsupervisedrelationextraction