Taxonomy-Aided Massive Relationship Extraction from the Web

Taxonomy-Aided Massive Relationship Extraction from the Web

The Task

• Goal: – Extract instance pairs that satisfy a certain

relationship from the Web.– E.g., (Microsoft, Redmond) satisfies the HeadquarterOf relationship

• Input: – Web corpus – Probase Taxonomy: classes/instances

Traditional Approaches

• Traditional approach– Manually supply a set of seed pairs for a given relationship– Find patterns from the seed pairs– Use the patterns to find more pairs

• Problem: Scalability– Cannot manually supply seed pairs for millions of relationships– Relationship extraction is time consuming even for one relationship.

• Problem: Quality– Semantics is not guaranteed: no guarantee that is a company entity,

and is a location entity.

Our goal: targeting all the relationships we can think of …

• How large is ?

• Each is in the form of

• Example:

Challenge: where does R come from?

• From Probase?

• From Freebase?

• From Tables?

• Language patterns? – What is the <attr> of <instance>?

Challenge: What are the seed pairs for each relationship?

• Not enough seed pairs– Possible to miss some useful patterns– Hard to evaluate quality of patterns

• Two-phase approach1. From one to several pairs

• Extract high-quality patterns, then extract seed pairs

2. From several to more pairs• Extract discriminative terms, then extract more pairs• Terms are more general and have more semantics

Challenge: What are the candidate entities?

• Cannot afford to scan the web corpus for each relationship

• Scan web texts to find candidates for millions of relationships simultaneously

• Need to avoid generating massive candidate pairs– Massive relations * massive candidate pairs = infeasible

• Solution: – use instances in a taxonomy

Approach overview– Phase 1: from one pair to several seed pairs

– Extract high-quality patterns– Extract seed pairs

– Phase 2: from seed pairs to more pairs– Map classes in seed pairs and taxonomy– Extract discriminative terms– Extract more pairs

Overview

patterns(The director of, is, .)(The headquarter of, is, .)(, is invented by, .)

……

Seed pairs(Natural Born Killers ,Oliver Stone) (Army of Darkness, Sam Raimi)(Toy Story, John Lasseter)

(Microsoft, Redmond)(IBM, Armonk)(Google, Mountain View)

(telephone, Bell)(polygraph machine, Mackenzie )(dry cleaning, Baptiste)

……

Phase 1

Web pages:Page 1: You can view the upcoming Toy Story 3 trailer here!Page 2: View company contact information for Pretty Woman on IMDbPro……

relations(spider-man, director, Sam Raimi)(Microsoft, headquarter, Redmond)(telephone, invented, Bell)

……

taxonomyMovies: Toy Story, Matrix,…Companies: Microsoft, IBM,…Directors: Stone, Lasseter,..……

ClassesR1: (movies, directors)R2: (companies, locations)R3: (products, persons)

……

terms(director, directed, movie,…)(location, located, headquarter,…)(inventor, inverted,…)

……

More pairs(Titanic,Cameron)…

Phase 2

Phase 1: from one to several

• Step 1: extract patterns for each relation– E.g. relation: (Microsoft, headquarter, Redmond)– Pattern: (The headquarter of el is er.)– General pattern: (pl,pm,pr)

• Step 2: extract seed pairs based on patterns

Step 1: extract patterns• Input:

– One tuple for each relation (el,r,er)– Web corpus;

• Output: – Pattern for each relation

• Approach: – MAP:

• Scan web texts and find sentences which contain all three elements of certain relation

• For each selected sentence, output candidate pattern (pl,pm,pr)– REDUCE:

• Rank patterns in each relation by tfidf value, and select the first one

Step 2: extract seed pairs• Input:

– One pattern for each relation– Suchas dataset– Web corpus;

• Output:– k seed pairs for each relation

• Approach – MAP:

• Scan web texts and find sentences in which any pattern occurs• output candidate seed pairs

– REDUCE:• Rank candidate pairs in each relation by frequency

Phase 2: from several to more

• Approach – Step 1: match classes in seed pairs and taxonomy– Step 2: extract terms for each relation– Step 3: extract pairs based on terms

• Advantage of term-based approach– More efficient– More general

“Suchas” dataset

• Extract information from web corpus like: “NNS including/such as I1,I2,…, and In”

• Eg. “companies including MS, IBM and Google”– Class label: companies– Instances: MS, IBM and Google– Combine all instances of a class label and rank them with tfidf

• Instance as term, and class label as document• companies: IBM 0.1, MS 0.09, Google 0.08, Intel 0.078, …• countries: China 0.2, USA 0.18, France 0.15,…

– Millions of instances and classes

Step 1: map classes• For each relation, obtain left class and right class

• find most relevance class in taxonomy to left/right class– Measurement of “relevance”

• For left/right class c and c’ in Suchas

Step 2: extract terms• Input:

– k seed pairs for each relation– Web corpus;

• Output: – Terms for each relation

• A term can appear in multiple relations

• Approach: – MAP:

• Scan web texts and find sentences in which certain seed pair appears• Take each word (except the words in instance pairs) as a candidate term

– REDUCE:• Rank terms in each relation by its tfidf value

Step 3: extract more pairs• Input:

– Terms for each relation– Suchas dataset– Web corpus;

• Output:– More instance pairs for each relation

• Approach – MAP:

• Scan web texts and find sentences in which any term(s) occur(s)• Generate candidate pairs

– REDUCE:• Rank candidate pairs in each relation

Implement

• MAP operation– Build hash table for one pair or seed pairs– Build hash table for patterns or terms

• Retrieve value in hash table cost O(1) nearly

Implement• Challenge: when taxonomy is too big

• Partition terms and classes into groups

• Reduce sentences bygroup ID of term

• For each group, use instances in corresponding classes

classesterms

movie

director

actordirected

role

company

CEO

CEO

group1

group2

Resultsc1 rel c2 db_x our_db

x our_x Preci_1football player pastteam football team 3042 2807 6569 0.996745writer notablework book 274 262 3207 0.9920635book author writer 5340 4736 8579 0.9980587astronaut mission space mission 308 275 301 0.9962264single musicalband musical artist 6306 5362 11591 0.9657787single musicalartist musical artist 6306 5392 11734 0.9637105musical artist associatedband band 5653 5316 17200 0.9280287musical artist associatedmusicalartist band 5653 5288 17240 0.9245726album artist band 17145 13261 31743 0.9748375lake outflow river 481 380 2276 0.997076radio station sisterstation radio station 3527 3276 10878 0.9701493airline headquarter country 1578 1442 1938 0.9234303company keyperson person 690 508 5345 0.9847826film starring actor 17630 13959 17769 0.975433videogame publisher company 6486 6021 9669 0.9507834lake inflow river 434 332 2055 0.9825784television show creator actor 487 423 5267 0.9198895film director actor 8023 6629 19007 0.9688858automobile manufacturer company 2209 1936 2588 0.8061288radio station owner company 2302 1920 8144 0.9755572

2361 5173 0.881

Efficiency • Number of instances

– More than 1 million• Runtime

– Step 1: less than 30 min– Step 2: about 1 hour– Step 3: about 3 hours

• Analysis– Good scalability

• Scan the whole web corpus twice• Process all relations simultaneously• All instances are organized in a dictionary

O(1) access cost

Documents

Taxonomy-Aided Massive Relationship Extraction from the Web