31
Discovery and Reconciliation of Entity Type Taxonomies Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/~soume n

Discovery and Reconciliation of Entity Type Taxonomies Soumen Chakrabarti IIT Bombay soumen

Embed Size (px)

Citation preview

Discovery and Reconciliation of Entity Type Taxonomies

Soumen Chakrabarti

IIT Bombay

www.cse.iitb.ac.in/~soumen

Searching with types and entities Answer types

• How far is it from Rome to Paris?• type=distance#n#1 near words={Rome, Paris}

Restrictions on match conditions• How many movies did 20th Century Fox release

in 1923?• …by 1950, the forest had only 1923 foxes left…• type={number#n#1,hasDigit} NEAR …

year=1923 organization=“20th Century Fox”

Corpus = set of docs, doc = token sequence, tokens connected to lexical networks

Searching personal info networks

No clean schema, data changes rapidly Lots of generic “graph proximity” information

Tim

e

XYZA. ThorAbstract:…

PDF file

LastMod

…your ECML submission titledXYZ has been accepted…

Email1

EmailDate

Could I get a preprint of yourrecent ECML paper?

Email2

EmailDate

XYZ

A. U. Thor

EmailTo

EmailTo

ECML

Canonical node

Want toquickly find this

file

Building blocks Structured “seeds” of type info

• WordNet, Wikipedia, OpenCYC, …

Semi-structured sources• List of faculty members in a department• Catalog of products at ecommerce site

Unstructured open-domain text• Email bodies, text of papers, blog text, Web page, …

Discovery and extension of type attachments• Hearst patterns, list extraction, NE taggers

Reconciling federated type systems• Schema and data integration

Query execution engines using type catalogs

Hearst patterns and enhancements Hearst, 1992; KnowItAll (Etzioni+ 2004)

• T such as x, x and other Ts, x or other Ts, T x, x is a T, x is the only T, …

C-PANKOW (Cimiano and Staab 2005)

Suitable for unformatted natural language Generally high precision, low recall

• If few possible Ts, use a named entity tagger

Set extraction Each node of a graph is a word Edge connects words wi and wj if these

words occur together in more than k docs• Use apriori style searches to enumerate• Edge weight depends on #docs

Given a set of words Q, set up Pagerank• Random surfer on word graph• W.p. d, jump to some element of Q• W.p. 1−d, walk to a neighbor

Present nodes (words) with largest Pagerank

Example

List extraction Given a current set of candidate Ts Limit to candidates having high confidence Select random subset of k=4 candidates Generate query from selected candidates Download response documents Look for lists containing candidate mentions Extract more instances from lists found

Boosts extraction rate 2—5 folds

Wrapping, scraping, tagging HTML formatting clues Help extract records and fields Extensive work in the DB, KDD, ML communities

Paper

P167

is-a

Gerhard Pass

Gregor Heinrichhas-author

Investigating word correlation…

has-title

Reconciling type systems WordNet: small and precise Wikipedia: much larger, less controlled Collect into a common is-a database

Mapping between taxonomies Each type has a set of instances

• Assoc Prof: K. Burn, R. Cook• Synset: lemmas from leaf instances• Wikipedia concept: list of instances• Yahoo topic: set of example Web pages

Goal: establish connections between types Connections could be “soft” or probabilistic

Cross-training Set of labels or types B, partly related but not

identical to type set A• A=Dmoz topics, B=Yahoo topics• A=Personal bookmark topics, B=Yahoo topics

Training docs come in two flavors now• Fully labeled with A and B labels (rare)• Half-labeled with either an A or a B label

Can B make classification for A more accurate (and vice versa)?

Inductive transfer, multi-task learning(Sarawagi+ 2003)

DA

DB

Motivation Symmetric taxonomy mapping

• Ecommerce catalogs: A=distributor, B=retailer• Web directories: A = Dmoz, B = Yahoo

Incomplete taxonomies, small training sets• Bookmark taxonomy vs. Yahoo

Cartesian label spaces

UK USA…

Regional

Top

Sports

Baseball Cricket

Region

T

opic Label-pair-

conditionedterm distribution

Labels as features A-label known, estimate B-label Suppose we have A+B labeled training set Discrete valued “label column”

Most text classifiers cannot balance importance of very heterogeneous features

Do not have fully-labeled data• Must guess (use soft scores instead of 0/1)

Term feature values

Augmented feature vector Target label

SVM-CT: Cross-trained SVM

S(A,0)

Train

DA–DBDocs having only A-labels One-vs-rest SVMensemble for A:returns |A| scoresfor each test doc(signed distancefrom separator)

DB–DA

Docs having only B-labelsTest

Testoutput

t |A|

LabelText features

S(B,1)Train

One-vs-rest SVMensemble for B(target label set)

Test case withA-label known(coded using avector of +1 and –1)

Term features –1,…,–1,+1,–1,…

S(A,1)S(B,2)S(A,2)…

SVM-CT anecdotesTopic Dmoz. Yahoo.Movies Genres.Western Titles.Western

Titles.HorrorPhotography Techniques+Styles Pinhole_Photography

3D_PhotographyPanoramic_PhotographyOrganizations

Discriminant reveals relations between A and B• One-to-one, many-to-one, related, antagonistic

However, accuracy gains are meager

Topic Yahoo. Dmoz.Photography Pinhole_Photography Techniques+Styles

PhotographersSoftware OS.MS_Windows OS.MS_Windows

OS.UNIX

Positive Negative

EM1D: Info from unlabeled docs Use training docs to induce initial classifier

for taxonomy B, say Repeat until classifier satisfactory

• Estimate Pr(|d) for unlabeled doc d, B• Reweigh d by factor Pr(|d) and add to training

set for label • Retrain classifier

EM1D: Expectation maximization with one label set B (Nigam et al.)

Ignores labels from another taxonomy A

Stratified EM1D Target labels = B B-labeled docs are labeled

training instances Consider A-labeled docs

labeled • These are unlabeled for

taxonomy B

Run EM1D for each row Test instance has known

• Invoke semi-supervised model for row to classify

A

top

ics

B-topics

Docs in DA–DB labeled

DB–DA: docswith B-labels

Docs in DA–DB labeled ’

EM2D: Cartesian product EM Initialize with fully labeled

docs which go to a specific (,) cell

Smear training doc across label row or column• Uniform smear could be bad• Use a naïve Bayes classifier to seed

Parameters extended from EM1D, prior probability for label pair (,)

,,t multinomial term probability for (,)

Labels in A

Labe

ls in

B

A-labeled doc

B-labeled doc

EM2D updates E-step for an A-labeled document

M-step

dt

tdnt

dt

tdnt

d),(

,,,

),(,,,

),|,Pr(

)(:)(:

1, ),|,Pr(),|,Pr('

dcddcdD

BA

dd

Updatedclass-pairpriors

)(: )(:

)(: )(:,,

),|,Pr(),(),|,Pr(),(

),|,Pr(),(),|,Pr(),(

'

dcd dcd

dcd dcdt

A B

A B

ddnddn

dtdndtdn

Updatedclass-pair-conditionedterm stats

Applying EM2D to a test doc Mapping a B-labeled test doc d to an A label

(e-commerce catalogs)• Given , find argmax Pr(,|d)

Classifying a document d with no labels to an A label• Aggregation

• For each compute Pr(,|d), pick best

• Guessing (EM2D-G)• Guess the best * using a B-classifier

• Find argmax Pr(,*|d)

EM pitfalls: damping factor, early stopping

Experiments Selected 5 Dmoz and Yahoo subtree pairs Compare EM2D against

• Naïve Bayes, best #features and smoothing• EM1D: ignore labels from other taxonomy,

consider as unlabeled docs• Stratified EM1D

Mapping test doc with A-label to B-label or vice versa

Classifying zero-labeled test doc Accuracy = fraction with correct labels

Accuracy benefits in mapping

20

30

40

50

60

70

80

90

A B A B A B A B A B

Autos Movies Outdoors Photo SoftwareAcc

ura

cy--

>

EM2DNBEM1DStrat-EM

EM1D and NB are close, because training set sizes for each taxonomy are not too small

EM2D > Stratified EM1D > NB• 2d transfer of model info seems important

Improvementover NB:30% best,10% average

Asymmetric setting

Few (only 300) bookmarked URLs(taxonomy B, target)

Many Yahoo URLs, larger number of classes (taxonomy A)

Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew

0%

20%40%

60%80%

100%

B B B B B

Autos Movies Outdoors Photo SoftwareAcc

ura

cy -

->

NB

EM2D

Zero-labeled test documents

0

20

40

60

80

L=0.01 L=1 L=0.01 L=1 L=0.01 L=1 L=0.01 L=1

12 60 120 360Training set size

Acc

ura

cy

NB

EM1D

EM2D-G

EM1D improves accuracy only for 12 train docs EM2D with guessing improves beyond EM1D

• In fact, better than aggregating scores to 1d

Choice of unlabeled:labeled damping ratio L may be important to get benefits

Robustness to initialization

20%

30%

40%

50%

60%

70%

0 0.2 0.4 0.6 0.8 1

Fraction of smeared guesses

Acc

ura

cy

Movies AMovies B

Seeding choices: hard (best class), NB scores, uniform

Smear a fraction uniformly, rest by NB scores EM2D is robust to wide range of smear fractions Fully uniform smearing can fail (local optima)

Uniformsmear

NaïveBayessmear

Handling constraints Type systems are often hierarchical Neighborhood constraints

• Two nodes match if their children also match• Two nodes match if their parents match and

some of their descendants also match• If all children of node X match node Y, then X

also matches Y• If a node in the neighborhood of node X matches

ASSOCIATE- PROFESSOR, then the chance that X matches PROFESSOR is increased

Three-stage approach (Doan+ 2002) …

Three-stage reconciliation Use EM2D to build a distribution estimator

• For every type pair A, B, find, over the domain of entities, Pr(A,B), Pr(A,!B), Pr(!A,B), Pr(!A,!B)

From these, compute a local similarity measure• E.g. Jaccard:

Perform relaxation labeling to find a good mapping (Chakrabarti+ 1998, Doan+ 2002)

),Pr(),Pr(),Pr(

),Pr(

BABABA

BA

Relaxation labeling

)|)\(,)(Pr()|)(Pr( 1\1

KXO

K XOfLXfLXf

Label in O1

Mapping

Label in O2

Everything known

Markovian assumption:

),,|)(Pr()),\(|)(Pr( 11 mK ffLXfXOfLXf

Features of the immediateneighborhood of X

Initialize some reasonable f and reassign via Bayes rule until f stabilizes

Sample results

Summary

To realize the semantic Web vision we must Assemble type schema from diverse sources Mine type instances automatically Annotate large corpora efficiency Build indices integrating text and annotations Support schema-agnostic query languages

• When did you last type XQuery into Google

Design high-performance query execution engines• New family of ranking functions