Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer

Towards Breaking the Quality Curse. A Web-Querying Approach to

Web People Search.

Dmitri V. Kalashnikov

Rabia Nuray-Turan

Sharad Mehrotra

Dept of Computer Science

University of California, Irvine

Searching for People on the Web

• Approx. 5-10% of internet searches look for specific people by their names [Guha&Garg’04]

• Challenges: – Heterogeneity of representations:

• Same entity represented multiple ways.

– Ambiguity in representations:• Same representation of multiple

entities.

• WePS systems address both these challenges.

Andrew McCallum’s Homepage…Associate Professor…

Andrew McCallum’s Homepage…Associate Professor…

ACM:Andrew Kachites McCallum…

ACM:Andrew Kachites McCallum…

Items for Author:McCallum, R. Andrew…

Items for Author:McCallum, R. Andrew…

UMass professor

Mental FlossAndrew McCallum……

Mental FlossAndrew McCallum……

MusicianUMass professor

DBLP:Andrew McCallum……

DBLP:Andrew McCallum……

News: …Andrew McCallum…Australia…

News: …Andrew McCallum…Australia…

CROSS president

Heterogeneity

Ambiguity

Search Engines v WePS

Google results for “Andrew McCallum”

WePS results for “Andrew McCallum”

Umass professor

photographer

physician

Advantages:Improved Precision, Ability for user to refine the query and filter away pages of irrelevant namesakes, resolves “Famous Person Problem”

WePS system return clusters of web pages wherein pages of the same entity are clustered together

Clustering Search Engines v WePS

Search results of “Andrew McCallum” using Clusty

WePS system results of “Andrew McCallum”

Examples of clustering search engines: Clusty – http://www.clusty.com Kartoo – http://www.kartoo.com

Differences:While both return clustered search results, Clustering search engines cluster similar/related pages based on broad topics (e.g., research, photos) while WePS system cluster based on disambiguating amongst various namesakes.

http://www.clusty.com/

http://www.kartoo.com/

WePS System Architecture

• Client side solution– the client resolves

references

• Proxy Solution– Proxy resolves

references

• Server side solution– The server resolves

references

WebWeb

Client Server

Disamb.Approach

Query

Returned pages

WebWeb

ProxyClient Server

Disamb.Approach

Query

Returned pages

WebWeb

clusters

Client Server

Disamb.Approach

Query

clusters

Classified based on where references are disambiguated

Current WePS Approaches

• Commercial Systems:– Zoominfo –

http://www.zoominfo.com– Spock – http://www.spock.com

• Multiple research proposals– [Kamha et al. 04; Artiles et. al ’07

&’05; Bekkerman et. al. ’05; Bollegala et. al. ’06;Matsou et. al. 06] etc.

• Web people search task included in Sem-Eval workshop in 2007– 16 teams participated in the task

• WePS techniques differ from each other in:– data analyzed for disambiguation

& Clustering approach used

Type of data analyzed• Baseline

– exploit information in result pages only

– Keywords, named entities, hyperlinks, emails, …

– [Bekkerman et.al. 05], [Li, et. al. 05][Elmachioglu, et. al. 07]

• exploit external information– Ontologies, Encyclopedia, …– [Kalashnikov, et. al, 07]

• Exploit Web content– Crawling

• [Kanani et. al. 07]– Search engine Statistics

http://www.zoominfo.com/

http://www.spock.com/

Contributions of this paper

• Exploiting Co-occurrence statistics from the search engine for WePS.

• A novel classification approach based on skylines to support reference resolution.

• Outperforms existing techniques on multiple data sets

Approach OverviewTop-K Webpages

1.Extract NEs 2. Filter

Refining Clusters using Web Queries

Preprocessing

Initial Clustering

Post processing

Preprocessed pages

1. Web Query2. Feature Extraction3. Skyline based classifier4. Cluster refinement

Initial clusters

Final clusters

Person1 Person2

Person3

1. Compute similarity using TF/IDF on NE2. Cluster based on

TF/IDF similarity

1. Rank clusters2. Extract cluster

sketches3. Rank pages in

clusters

Preprocessing: Named Entity Extraction

• Locate query name on web page

• Extract M neighboring names entities – organization – people Names

• Stanford Named Entity Recognizer & GATE used

People NEs

• Lise Getoor• Ben Taskar• John McCallum• Charles

Organization

• UMASS • ACM• Queue

Location

• Amherst

Preprocessing: Filtering Named Entities

Location Filter

Ambiguousname Filter

AmbiguousOrg. Filter

Same last name Filter

Input: Extracted NEs

– Remove location NEs by looking up a dictionary

– Remove one-word person names, such as ‘John’

– Remove common English words

– Remove the person names with the same last name.

Output: Cleaned NEs

Initial Clustering

• Intuition: – If two web pages “significantly” overlap in terms of associated

NEs, then the two pages refer to the same individual. • Approach:

– Compute similarity between web pages based on TF/IDF over Named Entities

– Merge all pairs of Web pages whose TF/IDF similarity exceeds a threshold

– Threshold learnt over the training dataset.

Threshold = 0.5

…

0.1 0.9

0.8

0.3

0.2

…

0.1 0.9

0.8

0.3

0.2

Resolvedreferences

Cluster Refinement• Initial Clustering based on TF/IDF similarity:

– Accurate – references resolved are usually correct.– Conservative – too many unresolved references

Andrew McCallum……CMU


Andrew McCallum……UMASS


0.01

Key Observation:• Web may contain additional information that may further

resolve unresolved references. • Co-occurrence between contexts associated with unresolved

references

Andrew McCallum……Job: UMASSPhd: CMU

Andrew McCallum……Job: UMASSPhd: CMU

0.5 0.7 Co-occurrence statistics can be learnt by querying the

search engine

Using Co-occurrence to Disambiguate

Context associated with Search results





C1 = UMASS C2 = CMU

Overlap between contextsover the Web

D1 D2

Query N “Andrew McCallum”

Search Results

|"|"

|"".""."|"||

|2

.1

.|

McCallumAndrew

CMUUMASSllumAndrewMcCaN

CCN

Number of web pages containing

“Andrew McCallum “ AND “UMASS” AND

“CMU”

overlap between contexts associated with two pages of the same entity expected to be higher compared to pages of unrelated entities.

Merge Web pages Di and Dj based on |N.Ci.Cj|/|N|?

• What if context NEs Ci and Cj are very “general”• The co-occurrence counts could be high for all namesakes!

• Solution -- Normalize |N Ci Cj| using Dice similarity

Dice Similarity

• Two possible versions of dice:

will be large for “general

context” words

Dice2 Similarity -- Example• Dice2 captures

that Yahoo and Google are too unspecific Organization Entities to be a good context for “Andrew McCallum”

• Dice2 leads to visibly better results in practice!

Let an N O1 O2 query be:

Q: “Andrew McCallum” AND “Yahoo” AND “Google”

Let counts be:

“Andrew McCallum” AND “Yahoo” AND “Google” – 522“Andrew McCallum” AND “Yahoo” – 2470“Andrew McCallum” AND “Google” – 8950“Andrew McCallum” – 38500“Yahoo” AND “Google” – 73000000

Dice:

Dice1 = 0.05 Dice2 = 7.15e-6

Web Queries to Gather Co-Occurrence Statistics

People NEs• Lise Getoor• Ben Taskar

Organization• UMASS • ACM

People NEs• Charles A. Sutton • Ben Wellner

Organization• DBLP• KDD

Context

Associationbetween context

(L.G. OR B.T.)AND

(C.A.S. OR B.W.)

(L.G. OR B.T.)AND

(DBLP OR KDD)

(UMASS OR ACM)AND

(C.A.S. OR B.W.)

(UMASS OR ACM)AND

(DBLP OR KDD)

Q1 Q2 Q3 Q4

Association amongst Context within pages

relevant to original query N

N AND Q1 N AND Q2 N AND Q3 N AND Q4

8 webqueries

Ex: Q1 = (“Lise Getoor” OR “Ben Taskar”) AND (“Charles A. Sutton” OR “Ben Wellner”)

(#97)

(#2,570)

(#720)

(#2,190K)

(#79) (#49) (#63) (#10,100)

Creating Features from Web Co-Occurrence Counts

1: |N Pi Pj|2: Dice(N, Pi Pj)3: |N Pi Oj|4: Dice(N, Pi Oj)5: |N Oi Pj|6: D(N, Oi Pj)7: |N Oi Oj|8: D(N, Oi Oj)

8-Feature Vector

= 79/152K = 0.00052= 0.000918 = 49/152K = 0.00032= 0.000561 = 63/152K = 0.00014= 0.00073 = 10,100 = 0.066= 0.008552

Role of a Classifier

• Classifier used to mark edges as “merge” (+) or “do not merge” (-) resulting in the final cluster.

• Since the features are “similarity” based, the classifier should satisfy the dominance property

…

Initial Clusters

…

Final Clusters

Edge ClassificationWeb Co-Occurrence Feature Generation

…

Edges with 8D Features

(0.1,0.8,…,0.4)

(0.2,0.1,…,0.9)

(0.3,0.5,…,0.1)

Dominance Property• Let f and g be 8D feature vectors

– f = (f1, f2,…,f8)

– g = (g1,g2,…,g8)

• Notation – (f up dominates g) if f1 ≤ g1, f2 ≤ g2, …, f8 ≤ g8

– (f down dominates g) if f1 ≥ g1, f2 ≥ g2, …, f8 ≥ g8

• Dominance Property for classifier– Let f and g be the features associated with edges (di, dj)

and (dm, dn) respectively.

– If f ≤ g and the classifier decides to merge (di, dj), then it must also merge (dm, dn)

Skyline-Based Classificationa

b

c

d

e

f

g

h

i jl

m

p

k

o

q

r

a

c

d

e

f

g

h

i jl

m

p

k

o

q

r

(a) Skyline of the feature space. (b) Skyline dividing the feature space.

Misclassification. Documents will be grouped wrongly.b

ts

ts

• A skyline classifier consists of an up-dominating classification skyline– All points “above” or “on” it are declared “merge” (“+”)– The rest are “don’t merge” (“-”)– Automatically takes care of dominance

Our goal is to learn a classification skyline that supports a good final clusters of search results.

Learning Classification Skyline

h

a

c

d

f

g

i jl

m

p

k

o

q

r

b

s

t

(a)

u

v

e

w

Greedy Approach• Initialize classification skyline with the

point that dominates the entire space. • Pool of points to add to CS

– Down-dominating skyline on active “+” edges

• Maintain “current” clustering– Clustering induced by current CS

• Choosing a point from pool to CS that– Stage1: Minimizes loss of quality due to

adding new “-” edges– Stage2: Maximizes overall quality

• Iterate until done• Output best skyline discovered

Initial CS

Example: Adding a point to CS

h

a

c

d

f

g

i jl

m

p

k

o

q

r

b

s

t

(a)

u

v

e

w

Ground Truth Clustering:

Initial Clustering:

Current Clustering: Current CS = {q,d,c}

Point Pool = {f,e,u,h}

Example: Considering f

h

a

c

d

f

g

i jl

m

p

k

o

q

r

b

s

t

(a)

u

v

e

w

Stage 1: Adding edge kFp = 0.805

Stage 2: Adding edge fFp = 0.860

What if add f to CS?

Example: Considering e

h

a

c

d

f

g

i jl

m

p

k

o

q

r

b

s

t

(a)

u

v

e

w

Stage 1: Adding edges k and vFp = 0.718

Stage 2: Adding edge eFp = 0.774

What if add e to CS?

Example: Considering u

h

a

c

d

f

g

i jl

m

p

k

o

q

r

b

s

t

(a)

u

v

e

w

Stage 1: Adding edge iFp = 0.805

Stage 2: Adding edge uFp = 0.833

What if add u to CS?

Example: Considering h

Stage 1: Adding edges i and tFp = 0.625

Stage 2: Adding edge hFp = 0.741

What if add h to CS?

h

a

c

d

f

g

i jl

m

p

k

o

q

r

b

s

t

(a)

u

v

e

w

Example: Stage2

Stage 1 (for f): Fp = 0.805Stage 1 (for u): Fp = 0.805

After Stage1 only f and u remaina

c

d

e

f

g

h

i jl

m

p

k

o

q

r

b

s

t

(b)

u

v

w

Stage 2 (for f): Fp = 0.860Stage 2 (for u): Fp = 0.833

Thus adding f to CS!

Post Processing

Top-K Webpages

1.Extract NEs 2. Filter

Refining Clusters using Web Queries

Preprocessing

Initial Clustering

Post processing

Preprocessed pages

1. Web Query2. Feature Extraction3. Skyline based classifier4. Cluster refinement

Initial clusters

Final clusters

Person1 Person2

Person3

1. Compute similarity using TF/IDF on NE2. Cluster based on

TF/IDF similarity

1. Rank clusters2. Extract cluster

sketches3. Rank pages in

clusters

• Page Ranking– Use the original ranking from

search engine

• Cluster Ranking– each cluster we simply select

the highest ranked page and use that as the order of the cluster.

• Cluster Sketches– coalesce all pages in the cluster

into a single page– removing the stop words we

compute the– TF/IDF of the remaining words– Select set of terms above a

certain threshold (or top N terms)

Experiments• Datasets

– WWW’05• By Bekkerman and McCallum• 12 different Person Names all related to each other. Each

Person name has different number of namesakes.

– SIGIR’05• By Artiles et al• 9 different Person Names, each with different number of

namesakes.

– WEPS• For Semeval Workshop on WEPS task (by Artiles et al)• Training dataset ( 49 Different Person Names)• Test Dataset (30 Different Person Names)

Experiments

• Baselines– Named Entity based Clustering (NE)

• TF/IDF merging scheme that uses the NEs in the Web Pages

– Web-based Co-Occurrence Clustering (WebDice)• Computes the similarity between Web pages by

summing up the individual dice similarities of People-People and People-Organization similarities of Web pages.

Experiments on WWW & SIGIR Datasets

Experiments on the WEPS Dataset

Effect of Initial Clustering

Conclusion and Future Work• Web co-occurrence statistics can significantly

improve disambiguation quality in WePS task.• Skyline classifier is effective in making merge

decisions. • However, current approach has limitations:

– large number of search engine queries to collect statistics will work only as a server side solution given limitations today.

– Based on the assumption that context of different entities is very different. May not work as well on the difficult case when contexts of different entities significantly overlap.

Additional Slides

Techniques tested in SemEval 2007

• CU-COMSEM: – Use various token-based and phrase-based features. – Agglomerative clustering with a single linkage

• PSNUS: – Merging using TF/IDF on Named Entities

• IRST-BP– Features include Named Entities, frequent words (phrases), time

expression, etc.– similarity score is the sum of the individual scores of the

common features which are weighed according to the maximum of distances between the name of interest and the feature.