Reference Reconciliation in Complex Information Spaces

Reference Reconciliation in Complex Information Spaces

Xin (Luna) Dong, Alon Halevy, Jayant Madhavan

@ Sigmod 2005University of Washington

Semex: Personal Information Management System

MentionedIn(315)

AuthorOfArticles(52)

RecipientOfEmails(8547)

SenderOfEmails(7595)

Homepage(1)


Email Contacts(1145)

Co-authors(24)


Authors

FromFile

CitedBy

Cites(33)

PublishedIn

Article: Reference Reconciliation in Complex Information Spaces


Xin (Luna) Dong

xin dong

•¶ ðà xinluna dong

luna

dongxin

x. dong

Lab-#dong xin

dong xin luna

Names

Emails

Semex Without DeduplicationSearch results for luna

luna dongSenderOfEmails(3043)RecipientOfEmails(2445)MentionedIn(94)

23 persons

Semex Without DeduplicationSearch results for luna

Xin (Luna) DongAuthorOfArticles(49)MentionedIn(20)

23 persons

Semex Without Deduplication

A Platform for Personal Information Management and Integration

Semex Without Deduplication

9 Persons: dong xin xin dong

Semex NEEDS Deduplication (Reference Reconciliation)



@ Sigmod 2005University of Washington

Complex Information Space Example – An Abstract View of Personal Information Article: a1=(“Distributed Query Processing”,“169-180”,

{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,

{p4,p5,p6}, c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)

Complex Information Space Example – An Abstract View of Personal Information Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3},

c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6},

c2)



Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)

Class

Reference

AtomicAttribute

AssociationAttribute

Other Complex Information Spaces Citation portals, e.g., Citeseer,

Cora Online product catalogs in E-

commerce

Real-World Objects Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3},

c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6},

c2)




Reference Reconciliation

Input: A set of references R Output: A partitioning over R, such

thatEach partition refers to a single real-

world object – high precision

Different partitions refer to different objects – high recall

Related Work

A very active area of research in Databases, Data Mining and AI

Most current approaches assume matching tuples from a single database table Traditional approaches (Surveyed in [Cohen, et

al. 2003]) Step I. Compare attributes Step II. Combine attribute similarities to decide tuple

match/non-match Step III. Compute transitive closures to get partitions

New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004]

Harder for complex information spaces

Challenges in Complex Information Spaces

Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)

a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)




1. MultipleClasses 3. Multi-value

Attributes

2. LimitedInformation

?

?

Intuition

Complex information spaces can be considered as networks of instances and associations between the instances

Key: exploit the network, specifically, the clues hidden in the associations

Outline

Introduction and problem definition

Reconciliation algorithmExperimental resultsConclusions

Framework: Dependency Graph p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“mike”, “[email protected]”, null)

(p2, p8)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reference Similarity Attribute Similarity

Compare contacts

Cross-attr similarity

(p1,p7)

(“Michael Stonebraker”, p7)

(p1, “[email protected]”)

(p3, “[email protected]”)



(p2, p8)



Compare contacts

Cross-attr similarity



(p8, p9)

(p2, p8)

(“Michael Stonebraker”, “mike”)

(p2, p9)



(“[email protected]”, “[email protected]”)

(“Eugene Wong”, “Eugene Wong”)

Exploit the Dependency Graph

(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)

(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

Reference similarity Attribute similarity

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Dependency Graph Example II


(“169-180”, “169-180”)

(a1, a2)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)



(p1, p4)

Compare authored papers

Strategy I. Consider Richer Evidence Cross-attribute similarity –

Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)

Context Information I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7

Context Information II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article

Considering Only Attribute-wise Similarities Cannot Merge Persons Well

1750

1950

2150

2350

2550

2750

2950

3150

3350

1 2 3 4

Evidence

#(P

erso

n P

arti

tio

ns)

1409

Person references: 24076 Real-world persons (gold-standard):1750

3159

Considering Richer Evidence Improves the Recall

3159

2169 21692096

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

arti

tio

ns)

1409

346

Person references: 24076 Real-world persons:1750



(“169-180”, “169-180”)

(a1, a2)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)



(p1, p4)



(“169-180”, “169-180”)

(a1, a2)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar



(“169-180”, “169-180”)

(a1, a2)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar



(“169-180”, “169-180”)

(a1, a2)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar



(“169-180”, “169-180”)

(a1, a2)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar



(“169-180”, “169-180”)

(a1, a2)


(p2, p5)


(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)


(p1, p4)

Reconciled Similar

Strategy II. Propagate Information between Reconciliation Decisions After changing the similarity

score of one node, re-compute similarity scores of its neighbors

This process converges ifSimilarity score is monotone in the

similarity values of neighborsCompute neighbor similarities only

if similarity increase is not too small

3159

2169 21692096

3159

2146 2135

2022

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-w ise Name&Email Article Contact

Evidence

#(Pe

rson

Par

titio

ns)

Traditional Propagation

Propagating Information between Reconciliation Decisions Further Improves Recall


Strategy III. Enrich References in Reconciliation Enrich knowledge of a real-world

object for later reconciliation Naïve:

Construct graph Compute similarity Transitive Closure

Problems Dependency-graph construction is expensive Reference enrichment takes effect until the

next pass Solution

Instant enrichment by adding neighbors in the dependency graph

Enrich References by Adding Neighbors p2=(“Michael Stonebraker”, null, {p1, p3})


(p8, p9)


(p2, p8)


(p2, p9)


Reconciled Similar



(p8, p9)


(p2, p8)


(p2, p9)


Reconciled Similar



(p8, p9)


(p2, p8)


(p2, p9)


Reconciled Similar



(p8, p9)


(p2, p8)

(“Michael Stonebraker”, “mike”)(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar



(p8, p9)


(p2, p8)

(“Michael Stonebraker”, “mike”)(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar

References Enrichment Improves Recall More than Information Propagation

3159

2169 21692096

3169

2036 2036

19101750

1950

2150

2350

2550

2750

2950

3150

3350


Evidence

#(P

erso

n P

arti

tio

ns)

Traditional Enrichment Propagation


3159

2169 21692096

3169

2002 1990

18731750

1950

2150

2350

2550

2750

2950

3150

3350


Evidence

#(P

erso

n P

artit

ions

)

Traditional Enrichment Propagation Full

Applying Both Information Propagation and Reference Enrichment Get the Highest Recall


1409

125346

Outline

Introduction and problem definition

Reconciliation algorithm Experimental results Conclusions

Experiment Settings Datasets

Four personal datasets Cora dataset for citations

Use the same parameters and thresholds for all datasets Measure

Precision and recall, F-measure Precision: The percentage of correctly reconciled reference pairs over

all reconciled reference pairs Recall: The percentage of correctly reconciled reference pairs over

pairs of references that refer to the same real-world object

Diversity and Dispersion Diversity: For every result partition, how many real-world objects are

included; ideally should be 1 (related to precision) Dispersion: For every real-world object, how many result partitions

include them; ideally should be 1 (related to recall)

3159

2169 21692096

3169

2002 1990

18731750

1950

2150

2350

2550

2750

2950

3150

3350


Evidence

#(P

erso

n P

artit

ions

)

Traditional Enrichment Propagation Full

Recall Results on One Personal Dataset


1409

125346

Results Considering All Occurrences of Person Instances

Dataset#per/#ref

Attr-wise Matching Dependency Graph

Prec/Recall

F#Pa

rPrec/Recall

F#Pa

r

A (1750/2407

6)B

(1989/36359)C

(1570/15160)D

(1518/17199)

Avg

0.999/0.741

0.974/0.998

0.999/0.967

0.894/0.998

0.967/0.926

0.8510.9860.9830.943

0.946

3159

2154

1660

1579

0.999/0.999

0.999/0.999

0.982/0.987

0.999/0.920

0.995/0.976

0.9990.9990.9850.958

0.986

1873

2068

1596

1546

Both precision and recall increase compared with attr-wise matching.

Results Considering Only Distinct Person References

Dataset#per/#dist-

ref


Prec/Recall

F#Pa

rPrec/Recall

F#Pa

r

A (1750/3114

)B

(1989/3211)C

(1570/2430)D

(1518/2188)

Avg

0.995/0.509

0.81/0.803

0.987/0.782

0.694/0.837

0.872/0.733

0.6730.8060.8730.759

0.778

3159

2154

1660

1579

0.982/0.947

0.958/0.891

0.814/0.925

0.942/0.737

0.924/0.875

0.9640.9230.8670.827

0.895

1873

2068

1596

1546

Precision and recall increase largely compared with attr-wise matching.

Diversity and Dispersion Are Very Close to 1

Dataset#per/#ref


Diversity/Dispersion Diversity/Dispersion

A (1750/2407

6)B

(1989/36359)C

(1570/15160)D

(1518/17199)

Avg

1.18/1.0031.067/1.01

1.053/1.0031.041/1.004

1.085/1.005

1.047/1.0031.039/1.0081.03/1.017

1.023/1.005

1.035/1.008

Our Algorithm Equals or OutperformsAttr-wise Matching in All Classes

Class

Attr-wise Matching

Dependency Graph

Precision

RecallPrecisi

onRecall

Person

Article

Venue

0.9670.9970.935

0.9260.9770.790

0.9950.9990.987

0.9760.9760.937

Results on Cora Dataset is Competitive with Other Reported Results

Results reported in other record linkage papers: Precision/Recall = 0.990/0.925 [Cohen et al., 2002] Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004] F-measure = 0.867 [Bilenko and Mooney, 2003]

Class

Attr-wise Matching

Dependency Graph

Prec/Recall

F-msre

Prec/Recall F-msre

Article

PersonVenue

0.985/0.913

0.994/0.985

0.982/0.362

0.948 0.9890.529

0.985/0.924

1/0.9870.837/0.71

4

0.954 0.9930.771

Conclusions

Contributions: Dependency-graph-based reconciliation algorithm Exploit rich evidence Propagate information between

reconciliation decisions Enrich references during reconciliation

Extended Work Propagate negative information through

dependency Graph



@ Sigmod 2005http://data.cs.washington.edu/semex

Strategy IV. Enforce Constraints Problem:

Solution: Propagate negative information—ConstraintsNon-merge node: the two elements are

guaranteed to be different and should never be merged

P1

P2

P3

Enforce Constraints by Propagating Negative Information p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“matt”, “[email protected]”, null)

(p2, p8)

(“Michael Stonebraker”, “matt”)

(p2, p9)


(p8, p9)





(p2, p8)


(p2, p9)


Reconciled Similar Non-merge

(p8, p9)


Constraint



(p2, p8)


(p2, p9)



(p8, p9)


Constraint



(p2, p8)


(p2, p9)



(p8, p9)


Constraint



(p2, p8)


(p2, p9)



(p8, p9)


Constraint

Enforcing Constraints Improves Precision

Method Precision

#(Entities reconciled with others incorrectly)

Constraint 0.999 13No

Constraint0.947 61

Similarity Computation

Similarity function for node N – s(N) Input: sim scores of N’s neighborsOutput: sim score of N, ranged from 0 to 1

Similarity function can be defined by applying domain knowledge, learning from training data, resorting to global knowledge, etc.

S = Srv + Ssb + Swb

Srv: from real-valued neighbors. Decision-tree shape.

Ssb: from strong-boolean-valued neighborsSwb: from weak-boolean-valued neighbors

Framework: Dependency Graph Definition

For every pair of references A and B: A node representing their similarity

For every attribute of A and attribute of B A node representing attribute similarity An edge between attr-sim node and ref-sim

node, representing the dependency between the similarities

Each node is associated with a similarity score between 0 and 1

Construction: include only nodes whose two elements have potential to be similar

Documents

Reference Reconciliation in Complex Information Spaces