63
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington

Reference Reconciliation in Complex Information Spaces

Embed Size (px)

DESCRIPTION

Reference Reconciliation in Complex Information Spaces. Xin (Luna) Dong , Alon Halevy, Jayant Madhavan @ Sigmod 2005 University of Washington. Semex : Personal Information Management System. Homepage(1). SenderOfEmails(7595). RecipientOfEmails(8547). AuthorOfArticles(52). - PowerPoint PPT Presentation

Citation preview

Page 1: Reference Reconciliation in Complex Information Spaces

Reference Reconciliation in Complex Information Spaces

Xin (Luna) Dong, Alon Halevy, Jayant Madhavan

@ Sigmod 2005University of Washington

Page 2: Reference Reconciliation in Complex Information Spaces

Semex: Personal Information Management System

MentionedIn(315)

AuthorOfArticles(52)

RecipientOfEmails(8547)

SenderOfEmails(7595)

Homepage(1)

Page 3: Reference Reconciliation in Complex Information Spaces

Semex: Personal Information Management System

Email Contacts(1145)

Co-authors(24)

Page 4: Reference Reconciliation in Complex Information Spaces

Semex: Personal Information Management System

Authors

FromFile

CitedBy

Cites(33)

PublishedIn

Article: Reference Reconciliation in Complex Information Spaces

Page 5: Reference Reconciliation in Complex Information Spaces

Semex: Personal Information Management System

Xin (Luna) Dong

xin dong

•¶ ðà xinluna dong

luna

dongxin

x. dong

Lab-#dong xin

dong xin luna

Names

Emails

Page 6: Reference Reconciliation in Complex Information Spaces

Semex Without DeduplicationSearch results for luna

luna dongSenderOfEmails(3043)RecipientOfEmails(2445)MentionedIn(94)

23 persons

Page 7: Reference Reconciliation in Complex Information Spaces

Semex Without DeduplicationSearch results for luna

Xin (Luna) DongAuthorOfArticles(49)MentionedIn(20)

23 persons

Page 8: Reference Reconciliation in Complex Information Spaces

Semex Without Deduplication

A Platform for Personal Information Management and Integration

Page 9: Reference Reconciliation in Complex Information Spaces

Semex Without Deduplication

9 Persons: dong xin xin dong

Page 10: Reference Reconciliation in Complex Information Spaces

Semex NEEDS Deduplication (Reference Reconciliation)

Page 11: Reference Reconciliation in Complex Information Spaces

Reference Reconciliation in Complex Information Spaces

Xin (Luna) Dong, Alon Halevy, Jayant Madhavan

@ Sigmod 2005University of Washington

Page 12: Reference Reconciliation in Complex Information Spaces

Complex Information Space Example – An Abstract View of Personal Information Article: a1=(“Distributed Query Processing”,“169-180”,

{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,

{p4,p5,p6}, c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)

Page 13: Reference Reconciliation in Complex Information Spaces

Complex Information Space Example – An Abstract View of Personal Information Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3},

c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6},

c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)

Class

Reference

AtomicAttribute

AssociationAttribute

Page 14: Reference Reconciliation in Complex Information Spaces

Other Complex Information Spaces Citation portals, e.g., Citeseer,

Cora Online product catalogs in E-

commerce

Page 15: Reference Reconciliation in Complex Information Spaces

Real-World Objects Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3},

c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6},

c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)

Page 16: Reference Reconciliation in Complex Information Spaces

Reference Reconciliation

Input: A set of references R Output: A partitioning over R, such

thatEach partition refers to a single real-

world object – high precision

Different partitions refer to different objects – high recall

Page 17: Reference Reconciliation in Complex Information Spaces

Related Work

A very active area of research in Databases, Data Mining and AI

Most current approaches assume matching tuples from a single database table Traditional approaches (Surveyed in [Cohen, et

al. 2003]) Step I. Compare attributes Step II. Combine attribute similarities to decide tuple

match/non-match Step III. Compute transitive closures to get partitions

New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004]

Harder for complex information spaces

Page 18: Reference Reconciliation in Complex Information Spaces

Challenges in Complex Information Spaces

Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)

a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)

Venue: c1=(“ACM Conference on Management of Data”, “1978”,

“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)

Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)

1. MultipleClasses 3. Multi-value

Attributes

2. LimitedInformation

?

?

Page 19: Reference Reconciliation in Complex Information Spaces

Intuition

Complex information spaces can be considered as networks of instances and associations between the instances

Key: exploit the network, specifically, the clues hidden in the associations

Page 20: Reference Reconciliation in Complex Information Spaces

Outline

Introduction and problem definition

Reconciliation algorithmExperimental resultsConclusions

Page 21: Reference Reconciliation in Complex Information Spaces

Framework: Dependency Graph p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“mike”, “[email protected]”, null)

(p2, p8)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reference Similarity Attribute Similarity

Compare contacts

Cross-attr similarity

(p1,p7)

(“Michael Stonebraker”, p7)

(p1, “[email protected]”)

(p3, “[email protected]”)

Page 22: Reference Reconciliation in Complex Information Spaces

Framework: Dependency Graph p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“mike”, “[email protected]”, null)

(p2, p8)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reference Similarity Attribute Similarity

Compare contacts

Cross-attr similarity

Page 23: Reference Reconciliation in Complex Information Spaces

Framework: Dependency Graph p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“mike”, “[email protected]”, null)

(p8, p9)

(p2, p8)

(“Michael Stonebraker”, “mike”)

(p2, p9)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reference Similarity Attribute Similarity

(“[email protected]”, “[email protected]”)

(“Eugene Wong”, “Eugene Wong”)

Page 24: Reference Reconciliation in Complex Information Spaces

Exploit the Dependency Graph

(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)

(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

Reference similarity Attribute similarity

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Page 25: Reference Reconciliation in Complex Information Spaces

Dependency Graph Example II

(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)

(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

Reference similarity Attribute similarity

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Compare authored papers

Page 26: Reference Reconciliation in Complex Information Spaces

Strategy I. Consider Richer Evidence Cross-attribute similarity –

Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)

Context Information I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7

Context Information II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article

Page 27: Reference Reconciliation in Complex Information Spaces

Considering Only Attribute-wise Similarities Cannot Merge Persons Well

1750

1950

2150

2350

2550

2750

2950

3150

3350

1 2 3 4

Evidence

#(P

erso

n P

arti

tio

ns)

1409

Person references: 24076 Real-world persons (gold-standard):1750

3159

Page 28: Reference Reconciliation in Complex Information Spaces

Considering Richer Evidence Improves the Recall

3159

2169 21692096

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

arti

tio

ns)

1409

346

Person references: 24076 Real-world persons:1750

Page 29: Reference Reconciliation in Complex Information Spaces

Exploit the Dependency Graph

(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)

(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

Reference similarity Attribute similarity

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Page 30: Reference Reconciliation in Complex Information Spaces

Exploit the Dependency Graph

(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)

(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Reconciled Similar

Page 31: Reference Reconciliation in Complex Information Spaces

Exploit the Dependency Graph

(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)

(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Reconciled Similar

Page 32: Reference Reconciliation in Complex Information Spaces

Exploit the Dependency Graph

(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)

(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Reconciled Similar

Page 33: Reference Reconciliation in Complex Information Spaces

Exploit the Dependency Graph

(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)

(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Reconciled Similar

Page 34: Reference Reconciliation in Complex Information Spaces

Exploit the Dependency Graph

(“Distributed…”, “Distributed …”)

(“169-180”, “169-180”)

(a1, a2)

(“Michael Stonebraker”, “Stonebraker, M.”)

(p2, p5)

(“Eugene Wong”, “Wong, E.”)

(p3, p6)(c1, c2)

(“ACM …”, “ACM SIGMOD”) (“1978”, “1978”)

(“Robert S. Epstein”, “Epstein, R.S.”)

(p1, p4)

Reconciled Similar

Page 35: Reference Reconciliation in Complex Information Spaces

Strategy II. Propagate Information between Reconciliation Decisions After changing the similarity

score of one node, re-compute similarity scores of its neighbors

This process converges ifSimilarity score is monotone in the

similarity values of neighborsCompute neighbor similarities only

if similarity increase is not too small

Page 36: Reference Reconciliation in Complex Information Spaces

3159

2169 21692096

3159

2146 2135

2022

1750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-w ise Name&Email Article Contact

Evidence

#(Pe

rson

Par

titio

ns)

Traditional Propagation

Propagating Information between Reconciliation Decisions Further Improves Recall

Person references: 24076 Real-world persons:1750

Page 37: Reference Reconciliation in Complex Information Spaces

Strategy III. Enrich References in Reconciliation Enrich knowledge of a real-world

object for later reconciliation Naïve:

Construct graph Compute similarity Transitive Closure

Problems Dependency-graph construction is expensive Reference enrichment takes effect until the

next pass Solution

Instant enrichment by adding neighbors in the dependency graph

Page 38: Reference Reconciliation in Complex Information Spaces

Enrich References by Adding Neighbors p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“mike”, “[email protected]”, null)

(p8, p9)

(“[email protected]”, “[email protected]”)

(p2, p8)

(“Michael Stonebraker”, “mike”)

(p2, p9)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar

Page 39: Reference Reconciliation in Complex Information Spaces

Enrich References by Adding Neighbors p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“mike”, “[email protected]”, null)

(p8, p9)

(“[email protected]”, “[email protected]”)

(p2, p8)

(“Michael Stonebraker”, “mike”)

(p2, p9)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar

Page 40: Reference Reconciliation in Complex Information Spaces

Enrich References by Adding Neighbors p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“mike”, “[email protected]”, null)

(p8, p9)

(“[email protected]”, “[email protected]”)

(p2, p8)

(“Michael Stonebraker”, “mike”)

(p2, p9)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar

Page 41: Reference Reconciliation in Complex Information Spaces

Enrich References by Adding Neighbors p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“mike”, “[email protected]”, null)

(p8, p9)

(“[email protected]”, “[email protected]”)

(p2, p8)

(“Michael Stonebraker”, “mike”)(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar

Page 42: Reference Reconciliation in Complex Information Spaces

Enrich References by Adding Neighbors p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“mike”, “[email protected]”, null)

(p8, p9)

(“[email protected]”, “[email protected]”)

(p2, p8)

(“Michael Stonebraker”, “mike”)(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar

Page 43: Reference Reconciliation in Complex Information Spaces

References Enrichment Improves Recall More than Information Propagation

3159

2169 21692096

3169

2036 2036

19101750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

arti

tio

ns)

Traditional Enrichment Propagation

Person references: 24076 Real-world persons:1750

Page 44: Reference Reconciliation in Complex Information Spaces

3159

2169 21692096

3169

2002 1990

18731750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

artit

ions

)

Traditional Enrichment Propagation Full

Applying Both Information Propagation and Reference Enrichment Get the Highest Recall

Person references: 24076 Real-world persons:1750

1409

125346

Page 45: Reference Reconciliation in Complex Information Spaces

Outline

Introduction and problem definition

Reconciliation algorithm Experimental results Conclusions

Page 46: Reference Reconciliation in Complex Information Spaces

Experiment Settings Datasets

Four personal datasets Cora dataset for citations

Use the same parameters and thresholds for all datasets Measure

Precision and recall, F-measure Precision: The percentage of correctly reconciled reference pairs over

all reconciled reference pairs Recall: The percentage of correctly reconciled reference pairs over

pairs of references that refer to the same real-world object

Diversity and Dispersion Diversity: For every result partition, how many real-world objects are

included; ideally should be 1 (related to precision) Dispersion: For every real-world object, how many result partitions

include them; ideally should be 1 (related to recall)

Page 47: Reference Reconciliation in Complex Information Spaces

3159

2169 21692096

3169

2002 1990

18731750

1950

2150

2350

2550

2750

2950

3150

3350

Attr-wise Name&Email Article Contact

Evidence

#(P

erso

n P

artit

ions

)

Traditional Enrichment Propagation Full

Recall Results on One Personal Dataset

Person references: 24076 Real-world persons:1750

1409

125346

Page 48: Reference Reconciliation in Complex Information Spaces

Results Considering All Occurrences of Person Instances

Dataset#per/#ref

Attr-wise Matching Dependency Graph

Prec/Recall

F#Pa

rPrec/Recall

F#Pa

r

A (1750/2407

6)B

(1989/36359)C

(1570/15160)D

(1518/17199)

Avg

0.999/0.741

0.974/0.998

0.999/0.967

0.894/0.998

0.967/0.926

0.8510.9860.9830.943

0.946

3159

2154

1660

1579

0.999/0.999

0.999/0.999

0.982/0.987

0.999/0.920

0.995/0.976

0.9990.9990.9850.958

0.986

1873

2068

1596

1546

Both precision and recall increase compared with attr-wise matching.

Page 49: Reference Reconciliation in Complex Information Spaces

Results Considering Only Distinct Person References

Dataset#per/#dist-

ref

Attr-wise Matching Dependency Graph

Prec/Recall

F#Pa

rPrec/Recall

F#Pa

r

A (1750/3114

)B

(1989/3211)C

(1570/2430)D

(1518/2188)

Avg

0.995/0.509

0.81/0.803

0.987/0.782

0.694/0.837

0.872/0.733

0.6730.8060.8730.759

0.778

3159

2154

1660

1579

0.982/0.947

0.958/0.891

0.814/0.925

0.942/0.737

0.924/0.875

0.9640.9230.8670.827

0.895

1873

2068

1596

1546

Precision and recall increase largely compared with attr-wise matching.

Page 50: Reference Reconciliation in Complex Information Spaces

Diversity and Dispersion Are Very Close to 1

Dataset#per/#ref

Attr-wise Matching Dependency Graph

Diversity/Dispersion Diversity/Dispersion

A (1750/2407

6)B

(1989/36359)C

(1570/15160)D

(1518/17199)

Avg

1.18/1.0031.067/1.01

1.053/1.0031.041/1.004

1.085/1.005

1.047/1.0031.039/1.0081.03/1.017

1.023/1.005

1.035/1.008

Page 51: Reference Reconciliation in Complex Information Spaces

Our Algorithm Equals or OutperformsAttr-wise Matching in All Classes

Class

Attr-wise Matching

Dependency Graph

Precision

RecallPrecisi

onRecall

Person

Article

Venue

0.9670.9970.935

0.9260.9770.790

0.9950.9990.987

0.9760.9760.937

Page 52: Reference Reconciliation in Complex Information Spaces

Results on Cora Dataset is Competitive with Other Reported Results

Results reported in other record linkage papers: Precision/Recall = 0.990/0.925 [Cohen et al., 2002] Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004] F-measure = 0.867 [Bilenko and Mooney, 2003]

Class

Attr-wise Matching

Dependency Graph

Prec/Recall

F-msre

Prec/Recall F-msre

Article

PersonVenue

0.985/0.913

0.994/0.985

0.982/0.362

0.948 0.9890.529

0.985/0.924

1/0.9870.837/0.71

4

0.954 0.9930.771

Page 53: Reference Reconciliation in Complex Information Spaces

Conclusions

Contributions: Dependency-graph-based reconciliation algorithm Exploit rich evidence Propagate information between

reconciliation decisions Enrich references during reconciliation

Extended Work Propagate negative information through

dependency Graph

Page 54: Reference Reconciliation in Complex Information Spaces

Reference Reconciliation in Complex Information Spaces

Xin (Luna) Dong, Alon Halevy, Jayant Madhavan

@ Sigmod 2005http://data.cs.washington.edu/semex

Page 55: Reference Reconciliation in Complex Information Spaces

Strategy IV. Enforce Constraints Problem:

Solution: Propagate negative information—ConstraintsNon-merge node: the two elements are

guaranteed to be different and should never be merged

P1

P2

P3

Page 56: Reference Reconciliation in Complex Information Spaces

Enforce Constraints by Propagating Negative Information p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“matt”, “[email protected]”, null)

(p2, p8)

(“Michael Stonebraker”, “matt”)

(p2, p9)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

(p8, p9)

Reference Similarity Attribute Similarity

(“[email protected]”, “[email protected]”)

Page 57: Reference Reconciliation in Complex Information Spaces

Enforce Constraints by Propagating Negative Information p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“matt”, “[email protected]”, null)

(p2, p8)

(“Michael Stonebraker”, “matt”)

(p2, p9)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar Non-merge

(p8, p9)

(“[email protected]”, “[email protected]”)

Constraint

Page 58: Reference Reconciliation in Complex Information Spaces

Enforce Constraints by Propagating Negative Information p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“matt”, “[email protected]”, null)

(p2, p8)

(“Michael Stonebraker”, “matt”)

(p2, p9)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar Non-merge

(p8, p9)

(“[email protected]”, “[email protected]”)

Constraint

Page 59: Reference Reconciliation in Complex Information Spaces

Enforce Constraints by Propagating Negative Information p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“matt”, “[email protected]”, null)

(p2, p8)

(“Michael Stonebraker”, “matt”)

(p2, p9)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar Non-merge

(p8, p9)

(“[email protected]”, “[email protected]”)

Constraint

Page 60: Reference Reconciliation in Complex Information Spaces

Enforce Constraints by Propagating Negative Information p2=(“Michael Stonebraker”, null, {p1, p3})

p3=(“Eugene Wong”, null, {p1, p2}) p7=(“Eugene Wong”, “[email protected]”, {p8}) p8=(null, “[email protected]”, {p7}) p9=(“matt”, “[email protected]”, null)

(p2, p8)

(“Michael Stonebraker”, “matt”)

(p2, p9)

(p3,p7) (“Michael Stonebraker”, “stonebraker@”)

Reconciled Similar Non-merge

(p8, p9)

(“[email protected]”, “[email protected]”)

Constraint

Page 61: Reference Reconciliation in Complex Information Spaces

Enforcing Constraints Improves Precision

Method Precision

#(Entities reconciled with others incorrectly)

Constraint 0.999 13No

Constraint0.947 61

Page 62: Reference Reconciliation in Complex Information Spaces

Similarity Computation

Similarity function for node N – s(N) Input: sim scores of N’s neighborsOutput: sim score of N, ranged from 0 to 1

Similarity function can be defined by applying domain knowledge, learning from training data, resorting to global knowledge, etc.

S = Srv + Ssb + Swb

Srv: from real-valued neighbors. Decision-tree shape.

Ssb: from strong-boolean-valued neighborsSwb: from weak-boolean-valued neighbors

Page 63: Reference Reconciliation in Complex Information Spaces

Framework: Dependency Graph Definition

For every pair of references A and B: A node representing their similarity

For every attribute of A and attribute of B A node representing attribute similarity An edge between attr-sim node and ref-sim

node, representing the dependency between the similarities

Each node is associated with a similarity score between 0 and 1

Construction: include only nodes whose two elements have potential to be similar