Upload
thomas-shelton
View
214
Download
2
Embed Size (px)
Citation preview
Web-scale Information ExtWeb-scale Information Extraction in KnowItAllraction in KnowItAll
Oren Etzioni etc. Oren Etzioni etc. U. of WashingtonU. of Washington
WWW’2004WWW’2004
Presented by Zheng Shao, CS591CXZPresented by Zheng Shao, CS591CXZ
OutlineOutline
MotivationMotivation System ArchitectureSystem Architecture Detail TechniquesDetail Techniques
Search Engine InterfaceSearch Engine Interface ExtractorExtractor Probabilistic AssessmentProbabilistic Assessment
Experimental ResultExperimental Result Future WorkFuture Work ConclusionConclusion
MotivationMotivation
Why Web-scale Information Extraction?Why Web-scale Information Extraction? Web is the largest knowledge base.Web is the largest knowledge base. Extracting information by searching the web is Extracting information by searching the web is
not easy: list the cities in the world whose popnot easy: list the cities in the world whose population is above 400,000; humans who has visulation is above 400,000; humans who has visited space.ited space.
Unless we find the “right” document, this work Unless we find the “right” document, this work could be tedious, error-prone process of piececould be tedious, error-prone process of piecemeal search.meal search.
Motivation (2)Motivation (2)
Previous Information Extraction WorksPrevious Information Extraction Works Supervised LearningSupervised Learning
Difficult to scale to the webDifficult to scale to the web the diversity of the web the diversity of the web the prohibitive cost of creating an equally diverse set of hanthe prohibitive cost of creating an equally diverse set of han
d-tagged documentsd-tagged documents Weakly Supervised and BootstrapWeakly Supervised and Bootstrap
Need domain-specific seedsNeed domain-specific seeds Learn rule from seeds, and then vice versaLearn rule from seeds, and then vice versa
KnowItAllKnowItAll Domain-IndependentDomain-Independent Use Bootstrap techniqueUse Bootstrap technique
System ArchitectureSystem Architecture
4 Components4 Components Data FlowData Flow
Extractor
Search Engine Interface
Assessor
Database
System ArchitectureSystem Architecture
System Work FlowSystem Work Flow
Extractor
Search Engine Interface
Assessor
Database
Web Pages
RuleRule template keywords
NP1 “such as” NPList2& head(NP1) = plural(name(Class1))& properNoun(head(each(NPList2)))=>instanceOf(Class1,head(each(NPList2)))
Noun Phrase Noun Phrase List
NP1 “such as” NPList2& head(NP1) = “countries”& properNoun(head(each(NPList2)))=>instanceOf(Country,head(each(NPList2)))Keywords: “countries such as”
System ArchitectureSystem Architecture
System Work FlowSystem Work Flow
Extractor
Search Engine Interface
Assessor
Database
Web PagesRule
Extracted Information
Knowledge
the United Kingdom and CanadaIndiaNorth Korea, Iran, India and PakistanJapanIraq, Italy and Spain…the United Kingdom
CanadaIndiaNorth KoreaIran…
Discriminator Phrase
Country AND X“Countries such as X”
Country AND the United KingdomCountries such as the United Kingdom
Frequency
System ArchitectureSystem Architecture
Extractor
Search Engine Interface
Assessor
Database
Search Engine InterfaceSearch Engine Interface Distribute jobs to different Search EnginesDistribute jobs to different Search Engines
ExtractorExtractor Rule InstantiationRule Instantiation Information ExtractionInformation Extraction
AccessorAccessor Discriminator Phrases ConstructionDiscriminator Phrases Construction Access of InformationAccess of Information
Search Engine InterfaceSearch Engine Interface
Metaphor: Metaphor: Information Food ChainInformation Food Chain Search Engine Search Engine Herbivore Herbivore KnowItAll KnowItAll Carnivore Carnivore
Why build on top of search engine?Why build on top of search engine? No need to duplicate existing workNo need to duplicate existing work Low cost/time/effortLow cost/time/effort
Query DistributionQuery Distribution Make sure not to overload search enginesMake sure not to overload search engines
ExtractorExtractor
Extraction Template ExamplesExtraction Template Examples NP1 {“,”} “such as” NPList2NP1 {“,”} “such as” NPList2 NP2 {“,”} “and other” NP2NP2 {“,”} “and other” NP2 NP1 {“,”} “is a” NP2NP1 {“,”} “is a” NP2
All are domain-independent!All are domain-independent!
Extractor (2)Extractor (2)
Noun phrase analysisNoun phrase analysis A. “China is a country in Asia”A. “China is a country in Asia” B. “Garth Brooks is a country singer”B. “Garth Brooks is a country singer”
In A, the word “country” is In A, the word “country” is the head of a sithe head of a simple noun phrasemple noun phrase..
In B, the word “country” is not In B, the word “country” is not the head of the head of a simple noun phrasea simple noun phrase..
So, China is indeed a country while Garth So, China is indeed a country while Garth Brooks is not a country.Brooks is not a country.
Extractor (3)Extractor (3)
Rule Template:Rule Template: NP1 “such as” NPList2NP1 “such as” NPList2
& head(NP1) = plural( name( & head(NP1) = plural( name( Class1Class1 )) ))& properNoun( head( each( NPList2 )))& properNoun( head( each( NPList2 )))=> instanceOf( => instanceOf( Class1Class1, head( each( NPList, head( each( NPList2)))2)))
The Extractor generates a rule for “Country” froThe Extractor generates a rule for “Country” from this template by substituting “Country” for “Clam this template by substituting “Country” for “Class 1”. ss 1”.
AssessorAssessor
Naïve Bayesian ModelNaïve Bayesian Model Features: hits returned by search engineFeatures: hits returned by search engine Incident: whether the extracted inf. is a factIncident: whether the extracted inf. is a fact
Adjusting the thresholdAdjusting the threshold Trade between precision and recallTrade between precision and recall
Assessor (2)Assessor (2)
Use bootstrapping to learn P(fi|Ф) and P(fi|¬Ф)Use bootstrapping to learn P(fi|Ф) and P(fi|¬Ф) Define PMI (I,D) = |Hits(D+I)| / |Hits(I)|Define PMI (I,D) = |Hits(D+I)| / |Hits(I)|
I: the extracted NPI: the extracted NP D: discriminator phraseD: discriminator phrase
4 P(fi|Ф) and P(fi|¬Ф) Functions4 P(fi|Ф) and P(fi|¬Ф) Functions Hits-Thresh:Hits-Thresh: P(hits>Hits(D+I)|Ф) P(hits>Hits(D+I)|Ф) Hits-Density:Hits-Density: p(hits=Hits(D+I)|Ф)p(hits=Hits(D+I)|Ф) PMI-Thresh:PMI-Thresh: P(pmi>PMI(I,D)|Ф)P(pmi>PMI(I,D)|Ф) PMI-Density:PMI-Density: p(pmi=PMI(I,D)|Ф)p(pmi=PMI(I,D)|Ф)
Experimental ResultsExperimental Results
Precision vs. RecallPrecision vs. Recall ThreshThresh
better than Densitybetter than Density
PMIPMI better than Hitsbetter than Hits
Experimental Results (2)Experimental Results (2)
Time Len: 4 dayTime Len: 4 day Web page retrieved Web page retrieved
vs. timevs. time 3000 pages/hour3000 pages/hour
New facts vs. Web New facts vs. Web page retrievedpage retrieved
1 new fact / 3 pages1 new fact / 3 pagestoto1 new fact / 7 pages1 new fact / 7 pages
Conclusion & Future WorksConclusion & Future Works
Conclusion:Conclusion: Domain-independent rule templatesDomain-independent rule templates Rule generated by rule templatesRule generated by rule templates Built on top of search engineBuilt on top of search engine Assessor Model: More data, more accurateAssessor Model: More data, more accurate
Future works:Future works: Learn domain-specific rules to improve recallLearn domain-specific rules to improve recall Automatically extend the ontologyAutomatically extend the ontology
Q & AQ & A
Thanks!Thanks!