Upload
merry-newton
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Towards Breaking the Quality Curse. A Web-Querying Approach to
Web People Search.
Dmitri V. Kalashnikov
Rabia Nuray-Turan
Sharad Mehrotra
Dept of Computer Science
University of California, Irvine
Searching for People on the Web
• Approx. 5-10% of internet searches look for specific people by their names [Guha&Garg’04]
• Challenges: – Heterogeneity of representations:
• Same entity represented multiple ways.
– Ambiguity in representations:• Same representation of multiple
entities.
• WePS systems address both these challenges.
Andrew McCallum’s Homepage…Associate Professor…
Andrew McCallum’s Homepage…Associate Professor…
ACM:Andrew Kachites McCallum…
ACM:Andrew Kachites McCallum…
Items for Author:McCallum, R. Andrew…
Items for Author:McCallum, R. Andrew…
UMass professor
Mental FlossAndrew McCallum……
Mental FlossAndrew McCallum……
MusicianUMass professor
DBLP:Andrew McCallum……
DBLP:Andrew McCallum……
News: …Andrew McCallum…Australia…
News: …Andrew McCallum…Australia…
CROSS president
Heterogeneity
Ambiguity
Search Engines v WePS
Google results for “Andrew McCallum”
WePS results for “Andrew McCallum”
Umass professor
photographer
physician
Advantages:Improved Precision, Ability for user to refine the query and filter away pages of irrelevant namesakes, resolves “Famous Person Problem”
WePS system return clusters of web pages wherein pages of the same entity are clustered together
Clustering Search Engines v WePS
Search results of “Andrew McCallum” using Clusty
WePS system results of “Andrew McCallum”
Examples of clustering search engines: Clusty – http://www.clusty.com Kartoo – http://www.kartoo.com
Differences:While both return clustered search results, Clustering search engines cluster similar/related pages based on broad topics (e.g., research, photos) while WePS system cluster based on disambiguating amongst various namesakes.
WePS System Architecture
• Client side solution– the client resolves
references
• Proxy Solution– Proxy resolves
references
• Server side solution– The server resolves
references
WebWeb
Client Server
Disamb.Approach
Query
Returned pages
WebWeb
ProxyClient Server
Disamb.Approach
Query
Returned pages
WebWeb
clusters
Client Server
Disamb.Approach
Query
clusters
Classified based on where references are disambiguated
Current WePS Approaches
• Commercial Systems:– Zoominfo –
http://www.zoominfo.com– Spock – http://www.spock.com
• Multiple research proposals– [Kamha et al. 04; Artiles et. al ’07
&’05; Bekkerman et. al. ’05; Bollegala et. al. ’06;Matsou et. al. 06] etc.
• Web people search task included in Sem-Eval workshop in 2007– 16 teams participated in the task
• WePS techniques differ from each other in:– data analyzed for disambiguation
& Clustering approach used
Type of data analyzed• Baseline
– exploit information in result pages only
– Keywords, named entities, hyperlinks, emails, …
– [Bekkerman et.al. 05], [Li, et. al. 05][Elmachioglu, et. al. 07]
• exploit external information– Ontologies, Encyclopedia, …– [Kalashnikov, et. al, 07]
• Exploit Web content– Crawling
• [Kanani et. al. 07]– Search engine Statistics
Contributions of this paper
• Exploiting Co-occurrence statistics from the search engine for WePS.
• A novel classification approach based on skylines to support reference resolution.
• Outperforms existing techniques on multiple data sets
Approach OverviewTop-K Webpages
1.Extract NEs 2. Filter
Refining Clusters using Web Queries
Preprocessing
Initial Clustering
Post processing
Preprocessed pages
1. Web Query2. Feature Extraction3. Skyline based classifier4. Cluster refinement
Initial clusters
Final clusters
Person1 Person2
Person3
1. Compute similarity using TF/IDF on NE2. Cluster based on
TF/IDF similarity
1. Rank clusters2. Extract cluster
sketches3. Rank pages in
clusters
Preprocessing: Named Entity Extraction
• Locate query name on web page
• Extract M neighboring names entities – organization – people Names
• Stanford Named Entity Recognizer & GATE used
People NEs
• Lise Getoor• Ben Taskar• John McCallum• Charles
Organization
• UMASS • ACM• Queue
Location
• Amherst
Preprocessing: Filtering Named Entities
Location Filter
Ambiguousname Filter
AmbiguousOrg. Filter
Same last name Filter
Input: Extracted NEs
– Remove location NEs by looking up a dictionary
– Remove one-word person names, such as ‘John’
– Remove common English words
– Remove the person names with the same last name.
Output: Cleaned NEs
Initial Clustering
• Intuition: – If two web pages “significantly” overlap in terms of associated
NEs, then the two pages refer to the same individual. • Approach:
– Compute similarity between web pages based on TF/IDF over Named Entities
– Merge all pairs of Web pages whose TF/IDF similarity exceeds a threshold
– Threshold learnt over the training dataset.
Threshold = 0.5
…
0.1 0.9
0.8
0.3
0.2
…
0.1 0.9
0.8
0.3
0.2
Resolvedreferences
Cluster Refinement• Initial Clustering based on TF/IDF similarity:
– Accurate – references resolved are usually correct.– Conservative – too many unresolved references
Andrew McCallum……CMU
Andrew McCallum……CMU
Andrew McCallum……UMASS
Andrew McCallum……UMASS
0.01
Key Observation:• Web may contain additional information that may further
resolve unresolved references. • Co-occurrence between contexts associated with unresolved
references
Andrew McCallum……Job: UMASSPhd: CMU
Andrew McCallum……Job: UMASSPhd: CMU
0.5 0.7 Co-occurrence statistics can be learnt by querying the
search engine
Using Co-occurrence to Disambiguate
Context associated with Search results
Andrew McCallum……CMU
Andrew McCallum……CMU
Andrew McCallum……UMASS
Andrew McCallum……UMASS
C1 = UMASS C2 = CMU
Overlap between contextsover the Web
D1 D2
Query N “Andrew McCallum”
Search Results
|"|"
|"".""."|"||
|2
.1
.|
McCallumAndrew
CMUUMASSllumAndrewMcCaN
CCN
Number of web pages containing
“Andrew McCallum “ AND “UMASS” AND
“CMU”
overlap between contexts associated with two pages of the same entity expected to be higher compared to pages of unrelated entities.
Merge Web pages Di and Dj based on |N.Ci.Cj|/|N|?
• What if context NEs Ci and Cj are very “general”• The co-occurrence counts could be high for all namesakes!
• Solution -- Normalize |N Ci Cj| using Dice similarity
Dice Similarity
• Two possible versions of dice:
will be large for “general
context” words
Dice2 Similarity -- Example• Dice2 captures
that Yahoo and Google are too unspecific Organization Entities to be a good context for “Andrew McCallum”
• Dice2 leads to visibly better results in practice!
Let an N O1 O2 query be:
Q: “Andrew McCallum” AND “Yahoo” AND “Google”
Let counts be:
“Andrew McCallum” AND “Yahoo” AND “Google” – 522“Andrew McCallum” AND “Yahoo” – 2470“Andrew McCallum” AND “Google” – 8950“Andrew McCallum” – 38500“Yahoo” AND “Google” – 73000000
Dice:
Dice1 = 0.05 Dice2 = 7.15e-6
Web Queries to Gather Co-Occurrence Statistics
People NEs• Lise Getoor• Ben Taskar
Organization• UMASS • ACM
People NEs• Charles A. Sutton • Ben Wellner
Organization• DBLP• KDD
Context
Associationbetween context
(L.G. OR B.T.)AND
(C.A.S. OR B.W.)
(L.G. OR B.T.)AND
(DBLP OR KDD)
(UMASS OR ACM)AND
(C.A.S. OR B.W.)
(UMASS OR ACM)AND
(DBLP OR KDD)
Q1 Q2 Q3 Q4
Association amongst Context within pages
relevant to original query N
N AND Q1 N AND Q2 N AND Q3 N AND Q4
8 webqueries
Ex: Q1 = (“Lise Getoor” OR “Ben Taskar”) AND (“Charles A. Sutton” OR “Ben Wellner”)
(#97)
(#2,570)
(#720)
(#2,190K)
(#79) (#49) (#63) (#10,100)
Creating Features from Web Co-Occurrence Counts
1: |N Pi Pj|2: Dice(N, Pi Pj)3: |N Pi Oj|4: Dice(N, Pi Oj)5: |N Oi Pj|6: D(N, Oi Pj)7: |N Oi Oj|8: D(N, Oi Oj)
8-Feature Vector
= 79/152K = 0.00052= 0.000918 = 49/152K = 0.00032= 0.000561 = 63/152K = 0.00014= 0.00073 = 10,100 = 0.066= 0.008552
Role of a Classifier
• Classifier used to mark edges as “merge” (+) or “do not merge” (-) resulting in the final cluster.
• Since the features are “similarity” based, the classifier should satisfy the dominance property
…
Initial Clusters
…
Final Clusters
Edge ClassificationWeb Co-Occurrence Feature Generation
…
Edges with 8D Features
(0.1,0.8,…,0.4)
(0.2,0.1,…,0.9)
(0.3,0.5,…,0.1)
Dominance Property• Let f and g be 8D feature vectors
– f = (f1, f2,…,f8)
– g = (g1,g2,…,g8)
• Notation – (f up dominates g) if f1 ≤ g1, f2 ≤ g2, …, f8 ≤ g8
– (f down dominates g) if f1 ≥ g1, f2 ≥ g2, …, f8 ≥ g8
• Dominance Property for classifier– Let f and g be the features associated with edges (di, dj)
and (dm, dn) respectively.
– If f ≤ g and the classifier decides to merge (di, dj), then it must also merge (dm, dn)
Skyline-Based Classificationa
b
c
d
e
f
g
h
i jl
m
p
k
o
q
r
a
c
d
e
f
g
h
i jl
m
p
k
o
q
r
(a) Skyline of the feature space. (b) Skyline dividing the feature space.
Misclassification. Documents will be grouped wrongly.b
ts
ts
• A skyline classifier consists of an up-dominating classification skyline– All points “above” or “on” it are declared “merge” (“+”)– The rest are “don’t merge” (“-”)– Automatically takes care of dominance
Our goal is to learn a classification skyline that supports a good final clusters of search results.
Learning Classification Skyline
h
a
c
d
f
g
i jl
m
p
k
o
q
r
b
s
t
(a)
u
v
e
w
Greedy Approach• Initialize classification skyline with the
point that dominates the entire space. • Pool of points to add to CS
– Down-dominating skyline on active “+” edges
• Maintain “current” clustering– Clustering induced by current CS
• Choosing a point from pool to CS that– Stage1: Minimizes loss of quality due to
adding new “-” edges– Stage2: Maximizes overall quality
• Iterate until done• Output best skyline discovered
Initial CS
Example: Adding a point to CS
h
a
c
d
f
g
i jl
m
p
k
o
q
r
b
s
t
(a)
u
v
e
w
Ground Truth Clustering:
Initial Clustering:
Current Clustering: Current CS = {q,d,c}
Point Pool = {f,e,u,h}
Example: Considering f
h
a
c
d
f
g
i jl
m
p
k
o
q
r
b
s
t
(a)
u
v
e
w
Stage 1: Adding edge kFp = 0.805
Stage 2: Adding edge fFp = 0.860
What if add f to CS?
Example: Considering e
h
a
c
d
f
g
i jl
m
p
k
o
q
r
b
s
t
(a)
u
v
e
w
Stage 1: Adding edges k and vFp = 0.718
Stage 2: Adding edge eFp = 0.774
What if add e to CS?
Example: Considering u
h
a
c
d
f
g
i jl
m
p
k
o
q
r
b
s
t
(a)
u
v
e
w
Stage 1: Adding edge iFp = 0.805
Stage 2: Adding edge uFp = 0.833
What if add u to CS?
Example: Considering h
Stage 1: Adding edges i and tFp = 0.625
Stage 2: Adding edge hFp = 0.741
What if add h to CS?
h
a
c
d
f
g
i jl
m
p
k
o
q
r
b
s
t
(a)
u
v
e
w
Example: Stage2
Stage 1 (for f): Fp = 0.805Stage 1 (for u): Fp = 0.805
After Stage1 only f and u remaina
c
d
e
f
g
h
i jl
m
p
k
o
q
r
b
s
t
(b)
u
v
w
Stage 2 (for f): Fp = 0.860Stage 2 (for u): Fp = 0.833
Thus adding f to CS!
Post Processing
Top-K Webpages
1.Extract NEs 2. Filter
Refining Clusters using Web Queries
Preprocessing
Initial Clustering
Post processing
Preprocessed pages
1. Web Query2. Feature Extraction3. Skyline based classifier4. Cluster refinement
Initial clusters
Final clusters
Person1 Person2
Person3
1. Compute similarity using TF/IDF on NE2. Cluster based on
TF/IDF similarity
1. Rank clusters2. Extract cluster
sketches3. Rank pages in
clusters
• Page Ranking– Use the original ranking from
search engine
• Cluster Ranking– each cluster we simply select
the highest ranked page and use that as the order of the cluster.
• Cluster Sketches– coalesce all pages in the cluster
into a single page– removing the stop words we
compute the– TF/IDF of the remaining words– Select set of terms above a
certain threshold (or top N terms)
Experiments• Datasets
– WWW’05• By Bekkerman and McCallum• 12 different Person Names all related to each other. Each
Person name has different number of namesakes.
– SIGIR’05• By Artiles et al• 9 different Person Names, each with different number of
namesakes.
– WEPS• For Semeval Workshop on WEPS task (by Artiles et al)• Training dataset ( 49 Different Person Names)• Test Dataset (30 Different Person Names)
Experiments
• Baselines– Named Entity based Clustering (NE)
• TF/IDF merging scheme that uses the NEs in the Web Pages
– Web-based Co-Occurrence Clustering (WebDice)• Computes the similarity between Web pages by
summing up the individual dice similarities of People-People and People-Organization similarities of Web pages.
Experiments on WWW & SIGIR Datasets
Experiments on the WEPS Dataset
Effect of Initial Clustering
Conclusion and Future Work• Web co-occurrence statistics can significantly
improve disambiguation quality in WePS task.• Skyline classifier is effective in making merge
decisions. • However, current approach has limitations:
– large number of search engine queries to collect statistics will work only as a server side solution given limitations today.
– Based on the assumption that context of different entities is very different. May not work as well on the difficult case when contexts of different entities significantly overlap.
Additional Slides
Techniques tested in SemEval 2007
• CU-COMSEM: – Use various token-based and phrase-based features. – Agglomerative clustering with a single linkage
• PSNUS: – Merging using TF/IDF on Named Entities
• IRST-BP– Features include Named Entities, frequent words (phrases), time
expression, etc.– similarity score is the sum of the individual scores of the
common features which are weighed according to the maximum of distances between the name of interest and the feature.