Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ

Web-scale Information ExtWeb-scale Information Extraction in KnowItAllraction in KnowItAll

Oren Etzioni etc. Oren Etzioni etc. U. of WashingtonU. of Washington

WWW’2004WWW’2004

Presented by Zheng Shao, CS591CXZPresented by Zheng Shao, CS591CXZ

OutlineOutline

MotivationMotivation System ArchitectureSystem Architecture Detail TechniquesDetail Techniques

Search Engine InterfaceSearch Engine Interface ExtractorExtractor Probabilistic AssessmentProbabilistic Assessment

Experimental ResultExperimental Result Future WorkFuture Work ConclusionConclusion

MotivationMotivation

Why Web-scale Information Extraction?Why Web-scale Information Extraction? Web is the largest knowledge base.Web is the largest knowledge base. Extracting information by searching the web is Extracting information by searching the web is

not easy: list the cities in the world whose popnot easy: list the cities in the world whose population is above 400,000; humans who has visulation is above 400,000; humans who has visited space.ited space.

Unless we find the “right” document, this work Unless we find the “right” document, this work could be tedious, error-prone process of piececould be tedious, error-prone process of piecemeal search.meal search.

Motivation (2)Motivation (2)

Previous Information Extraction WorksPrevious Information Extraction Works Supervised LearningSupervised Learning

Difficult to scale to the webDifficult to scale to the web the diversity of the web the diversity of the web the prohibitive cost of creating an equally diverse set of hanthe prohibitive cost of creating an equally diverse set of han

d-tagged documentsd-tagged documents Weakly Supervised and BootstrapWeakly Supervised and Bootstrap

Need domain-specific seedsNeed domain-specific seeds Learn rule from seeds, and then vice versaLearn rule from seeds, and then vice versa

KnowItAllKnowItAll Domain-IndependentDomain-Independent Use Bootstrap techniqueUse Bootstrap technique

System ArchitectureSystem Architecture

4 Components4 Components Data FlowData Flow

Extractor

Search Engine Interface

Assessor

Database


System Work FlowSystem Work Flow

Extractor


Assessor

Database

Web Pages

RuleRule template keywords

NP1 “such as” NPList2& head(NP1) = plural(name(Class1))& properNoun(head(each(NPList2)))=>instanceOf(Class1,head(each(NPList2)))

Noun Phrase Noun Phrase List

NP1 “such as” NPList2& head(NP1) = “countries”& properNoun(head(each(NPList2)))=>instanceOf(Country,head(each(NPList2)))Keywords: “countries such as”


System Work FlowSystem Work Flow

Extractor


Assessor

Database

Web PagesRule

Extracted Information

Knowledge

the United Kingdom and CanadaIndiaNorth Korea, Iran, India and PakistanJapanIraq, Italy and Spain…the United Kingdom

CanadaIndiaNorth KoreaIran…

Discriminator Phrase

Country AND X“Countries such as X”

Country AND the United KingdomCountries such as the United Kingdom

Frequency


Extractor


Assessor

Database

Search Engine InterfaceSearch Engine Interface Distribute jobs to different Search EnginesDistribute jobs to different Search Engines

ExtractorExtractor Rule InstantiationRule Instantiation Information ExtractionInformation Extraction

AccessorAccessor Discriminator Phrases ConstructionDiscriminator Phrases Construction Access of InformationAccess of Information

Search Engine InterfaceSearch Engine Interface

Metaphor: Metaphor: Information Food ChainInformation Food Chain Search Engine Search Engine Herbivore Herbivore KnowItAll KnowItAll Carnivore Carnivore

Why build on top of search engine?Why build on top of search engine? No need to duplicate existing workNo need to duplicate existing work Low cost/time/effortLow cost/time/effort

Query DistributionQuery Distribution Make sure not to overload search enginesMake sure not to overload search engines

ExtractorExtractor

Extraction Template ExamplesExtraction Template Examples NP1 {“,”} “such as” NPList2NP1 {“,”} “such as” NPList2 NP2 {“,”} “and other” NP2NP2 {“,”} “and other” NP2 NP1 {“,”} “is a” NP2NP1 {“,”} “is a” NP2

All are domain-independent!All are domain-independent!

Extractor (2)Extractor (2)

Noun phrase analysisNoun phrase analysis A. “China is a country in Asia”A. “China is a country in Asia” B. “Garth Brooks is a country singer”B. “Garth Brooks is a country singer”

In A, the word “country” is In A, the word “country” is the head of a sithe head of a simple noun phrasemple noun phrase..

In B, the word “country” is not In B, the word “country” is not the head of the head of a simple noun phrasea simple noun phrase..

So, China is indeed a country while Garth So, China is indeed a country while Garth Brooks is not a country.Brooks is not a country.

Extractor (3)Extractor (3)

Rule Template:Rule Template: NP1 “such as” NPList2NP1 “such as” NPList2

& head(NP1) = plural( name( & head(NP1) = plural( name( Class1Class1 )) ))& properNoun( head( each( NPList2 )))& properNoun( head( each( NPList2 )))=> instanceOf( => instanceOf( Class1Class1, head( each( NPList, head( each( NPList2)))2)))

The Extractor generates a rule for “Country” froThe Extractor generates a rule for “Country” from this template by substituting “Country” for “Clam this template by substituting “Country” for “Class 1”. ss 1”.

AssessorAssessor

Naïve Bayesian ModelNaïve Bayesian Model Features: hits returned by search engineFeatures: hits returned by search engine Incident: whether the extracted inf. is a factIncident: whether the extracted inf. is a fact

Adjusting the thresholdAdjusting the threshold Trade between precision and recallTrade between precision and recall

Assessor (2)Assessor (2)

Use bootstrapping to learn P(fi|Ф) and P(fi|¬Ф)Use bootstrapping to learn P(fi|Ф) and P(fi|¬Ф) Define PMI (I,D) = |Hits(D+I)| / |Hits(I)|Define PMI (I,D) = |Hits(D+I)| / |Hits(I)|

I: the extracted NPI: the extracted NP D: discriminator phraseD: discriminator phrase

4 P(fi|Ф) and P(fi|¬Ф) Functions4 P(fi|Ф) and P(fi|¬Ф) Functions Hits-Thresh:Hits-Thresh: P(hits>Hits(D+I)|Ф) P(hits>Hits(D+I)|Ф) Hits-Density:Hits-Density: p(hits=Hits(D+I)|Ф)p(hits=Hits(D+I)|Ф) PMI-Thresh:PMI-Thresh: P(pmi>PMI(I,D)|Ф)P(pmi>PMI(I,D)|Ф) PMI-Density:PMI-Density: p(pmi=PMI(I,D)|Ф)p(pmi=PMI(I,D)|Ф)

Experimental ResultsExperimental Results

Precision vs. RecallPrecision vs. Recall ThreshThresh

better than Densitybetter than Density

PMIPMI better than Hitsbetter than Hits

Experimental Results (2)Experimental Results (2)

Time Len: 4 dayTime Len: 4 day Web page retrieved Web page retrieved

vs. timevs. time 3000 pages/hour3000 pages/hour

New facts vs. Web New facts vs. Web page retrievedpage retrieved

1 new fact / 3 pages1 new fact / 3 pagestoto1 new fact / 7 pages1 new fact / 7 pages

Conclusion & Future WorksConclusion & Future Works

Conclusion:Conclusion: Domain-independent rule templatesDomain-independent rule templates Rule generated by rule templatesRule generated by rule templates Built on top of search engineBuilt on top of search engine Assessor Model: More data, more accurateAssessor Model: More data, more accurate

Future works:Future works: Learn domain-specific rules to improve recallLearn domain-specific rules to improve recall Automatically extend the ontologyAutomatically extend the ontology

Q & AQ & A

Thanks!Thanks!

Documents

Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ