Upload
maude-berry
View
219
Download
2
Embed Size (px)
Citation preview
Discovery and Reconciliation of Entity Type Taxonomies
Soumen Chakrabarti
IIT Bombay
www.cse.iitb.ac.in/~soumen
Searching with types and entities Answer types
• How far is it from Rome to Paris?• type=distance#n#1 near words={Rome, Paris}
Restrictions on match conditions• How many movies did 20th Century Fox release
in 1923?• …by 1950, the forest had only 1923 foxes left…• type={number#n#1,hasDigit} NEAR …
year=1923 organization=“20th Century Fox”
Corpus = set of docs, doc = token sequence, tokens connected to lexical networks
Searching personal info networks
No clean schema, data changes rapidly Lots of generic “graph proximity” information
Tim
e
XYZA. ThorAbstract:…
PDF file
LastMod
…your ECML submission titledXYZ has been accepted…
Email1
EmailDate
Could I get a preprint of yourrecent ECML paper?
Email2
EmailDate
XYZ
A. U. Thor
EmailTo
EmailTo
ECML
Canonical node
Want toquickly find this
file
Building blocks Structured “seeds” of type info
• WordNet, Wikipedia, OpenCYC, …
Semi-structured sources• List of faculty members in a department• Catalog of products at ecommerce site
Unstructured open-domain text• Email bodies, text of papers, blog text, Web page, …
Discovery and extension of type attachments• Hearst patterns, list extraction, NE taggers
Reconciling federated type systems• Schema and data integration
Query execution engines using type catalogs
Hearst patterns and enhancements Hearst, 1992; KnowItAll (Etzioni+ 2004)
• T such as x, x and other Ts, x or other Ts, T x, x is a T, x is the only T, …
C-PANKOW (Cimiano and Staab 2005)
Suitable for unformatted natural language Generally high precision, low recall
• If few possible Ts, use a named entity tagger
Set extraction Each node of a graph is a word Edge connects words wi and wj if these
words occur together in more than k docs• Use apriori style searches to enumerate• Edge weight depends on #docs
Given a set of words Q, set up Pagerank• Random surfer on word graph• W.p. d, jump to some element of Q• W.p. 1−d, walk to a neighbor
Present nodes (words) with largest Pagerank
List extraction Given a current set of candidate Ts Limit to candidates having high confidence Select random subset of k=4 candidates Generate query from selected candidates Download response documents Look for lists containing candidate mentions Extract more instances from lists found
Boosts extraction rate 2—5 folds
Wrapping, scraping, tagging HTML formatting clues Help extract records and fields Extensive work in the DB, KDD, ML communities
Paper
P167
is-a
Gerhard Pass
Gregor Heinrichhas-author
Investigating word correlation…
has-title
Reconciling type systems WordNet: small and precise Wikipedia: much larger, less controlled Collect into a common is-a database
Mapping between taxonomies Each type has a set of instances
• Assoc Prof: K. Burn, R. Cook• Synset: lemmas from leaf instances• Wikipedia concept: list of instances• Yahoo topic: set of example Web pages
Goal: establish connections between types Connections could be “soft” or probabilistic
Cross-training Set of labels or types B, partly related but not
identical to type set A• A=Dmoz topics, B=Yahoo topics• A=Personal bookmark topics, B=Yahoo topics
Training docs come in two flavors now• Fully labeled with A and B labels (rare)• Half-labeled with either an A or a B label
Can B make classification for A more accurate (and vice versa)?
Inductive transfer, multi-task learning(Sarawagi+ 2003)
DA
DB
Motivation Symmetric taxonomy mapping
• Ecommerce catalogs: A=distributor, B=retailer• Web directories: A = Dmoz, B = Yahoo
Incomplete taxonomies, small training sets• Bookmark taxonomy vs. Yahoo
Cartesian label spaces
UK USA…
Regional
Top
Sports
Baseball Cricket
Region
T
opic Label-pair-
conditionedterm distribution
Labels as features A-label known, estimate B-label Suppose we have A+B labeled training set Discrete valued “label column”
Most text classifiers cannot balance importance of very heterogeneous features
Do not have fully-labeled data• Must guess (use soft scores instead of 0/1)
Term feature values
Augmented feature vector Target label
SVM-CT: Cross-trained SVM
S(A,0)
Train
DA–DBDocs having only A-labels One-vs-rest SVMensemble for A:returns |A| scoresfor each test doc(signed distancefrom separator)
DB–DA
Docs having only B-labelsTest
Testoutput
t |A|
LabelText features
S(B,1)Train
One-vs-rest SVMensemble for B(target label set)
Test case withA-label known(coded using avector of +1 and –1)
Term features –1,…,–1,+1,–1,…
S(A,1)S(B,2)S(A,2)…
SVM-CT anecdotesTopic Dmoz. Yahoo.Movies Genres.Western Titles.Western
Titles.HorrorPhotography Techniques+Styles Pinhole_Photography
3D_PhotographyPanoramic_PhotographyOrganizations
Discriminant reveals relations between A and B• One-to-one, many-to-one, related, antagonistic
However, accuracy gains are meager
Topic Yahoo. Dmoz.Photography Pinhole_Photography Techniques+Styles
PhotographersSoftware OS.MS_Windows OS.MS_Windows
OS.UNIX
Positive Negative
EM1D: Info from unlabeled docs Use training docs to induce initial classifier
for taxonomy B, say Repeat until classifier satisfactory
• Estimate Pr(|d) for unlabeled doc d, B• Reweigh d by factor Pr(|d) and add to training
set for label • Retrain classifier
EM1D: Expectation maximization with one label set B (Nigam et al.)
Ignores labels from another taxonomy A
Stratified EM1D Target labels = B B-labeled docs are labeled
training instances Consider A-labeled docs
labeled • These are unlabeled for
taxonomy B
Run EM1D for each row Test instance has known
• Invoke semi-supervised model for row to classify
A
top
ics
B-topics
Docs in DA–DB labeled
…
DB–DA: docswith B-labels
Docs in DA–DB labeled ’
EM2D: Cartesian product EM Initialize with fully labeled
docs which go to a specific (,) cell
Smear training doc across label row or column• Uniform smear could be bad• Use a naïve Bayes classifier to seed
Parameters extended from EM1D, prior probability for label pair (,)
,,t multinomial term probability for (,)
Labels in A
Labe
ls in
B
A-labeled doc
B-labeled doc
EM2D updates E-step for an A-labeled document
M-step
dt
tdnt
dt
tdnt
d),(
,,,
),(,,,
),|,Pr(
)(:)(:
1, ),|,Pr(),|,Pr('
dcddcdD
BA
dd
Updatedclass-pairpriors
)(: )(:
)(: )(:,,
),|,Pr(),(),|,Pr(),(
),|,Pr(),(),|,Pr(),(
'
dcd dcd
dcd dcdt
A B
A B
ddnddn
dtdndtdn
Updatedclass-pair-conditionedterm stats
Applying EM2D to a test doc Mapping a B-labeled test doc d to an A label
(e-commerce catalogs)• Given , find argmax Pr(,|d)
Classifying a document d with no labels to an A label• Aggregation
• For each compute Pr(,|d), pick best
• Guessing (EM2D-G)• Guess the best * using a B-classifier
• Find argmax Pr(,*|d)
EM pitfalls: damping factor, early stopping
Experiments Selected 5 Dmoz and Yahoo subtree pairs Compare EM2D against
• Naïve Bayes, best #features and smoothing• EM1D: ignore labels from other taxonomy,
consider as unlabeled docs• Stratified EM1D
Mapping test doc with A-label to B-label or vice versa
Classifying zero-labeled test doc Accuracy = fraction with correct labels
Accuracy benefits in mapping
20
30
40
50
60
70
80
90
A B A B A B A B A B
Autos Movies Outdoors Photo SoftwareAcc
ura
cy--
>
EM2DNBEM1DStrat-EM
EM1D and NB are close, because training set sizes for each taxonomy are not too small
EM2D > Stratified EM1D > NB• 2d transfer of model info seems important
Improvementover NB:30% best,10% average
Asymmetric setting
Few (only 300) bookmarked URLs(taxonomy B, target)
Many Yahoo URLs, larger number of classes (taxonomy A)
Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew
0%
20%40%
60%80%
100%
B B B B B
Autos Movies Outdoors Photo SoftwareAcc
ura
cy -
->
NB
EM2D
Zero-labeled test documents
0
20
40
60
80
L=0.01 L=1 L=0.01 L=1 L=0.01 L=1 L=0.01 L=1
12 60 120 360Training set size
Acc
ura
cy
NB
EM1D
EM2D-G
EM1D improves accuracy only for 12 train docs EM2D with guessing improves beyond EM1D
• In fact, better than aggregating scores to 1d
Choice of unlabeled:labeled damping ratio L may be important to get benefits
Robustness to initialization
20%
30%
40%
50%
60%
70%
0 0.2 0.4 0.6 0.8 1
Fraction of smeared guesses
Acc
ura
cy
Movies AMovies B
Seeding choices: hard (best class), NB scores, uniform
Smear a fraction uniformly, rest by NB scores EM2D is robust to wide range of smear fractions Fully uniform smearing can fail (local optima)
Uniformsmear
NaïveBayessmear
Handling constraints Type systems are often hierarchical Neighborhood constraints
• Two nodes match if their children also match• Two nodes match if their parents match and
some of their descendants also match• If all children of node X match node Y, then X
also matches Y• If a node in the neighborhood of node X matches
ASSOCIATE- PROFESSOR, then the chance that X matches PROFESSOR is increased
Three-stage approach (Doan+ 2002) …
Three-stage reconciliation Use EM2D to build a distribution estimator
• For every type pair A, B, find, over the domain of entities, Pr(A,B), Pr(A,!B), Pr(!A,B), Pr(!A,!B)
From these, compute a local similarity measure• E.g. Jaccard:
Perform relaxation labeling to find a good mapping (Chakrabarti+ 1998, Doan+ 2002)
),Pr(),Pr(),Pr(
),Pr(
BABABA
BA
Relaxation labeling
)|)\(,)(Pr()|)(Pr( 1\1
KXO
K XOfLXfLXf
Label in O1
Mapping
Label in O2
Everything known
Markovian assumption:
),,|)(Pr()),\(|)(Pr( 11 mK ffLXfXOfLXf
Features of the immediateneighborhood of X
Initialize some reasonable f and reassign via Bayes rule until f stabilizes
Summary
To realize the semantic Web vision we must Assemble type schema from diverse sources Mine type instances automatically Annotate large corpora efficiency Build indices integrating text and annotations Support schema-agnostic query languages
• When did you last type XQuery into Google
Design high-performance query execution engines• New family of ranking functions