28
Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

Embed Size (px)

Citation preview

Page 1: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

1

Linking Records with Erroneous Values

Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac

AT&T Labs

Page 2: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

2

MotivationSrc Name Phone Address City

V

A-Link Wireless

8185491449

2148 GLENDALE GALLERIA

GLENDALE

V

Abercrombie

8185020728

2229 GLENDALE GALLERIA

GLENDALE

V

Abercrombie & Fitch

8185507492

2151 GLENDALE GALLERIA

GLENDALE

V

Aeropostale

8185458972

2187 GLENDALE GALLERIA

GLENDALE

V

Aerosoles

8182462455

1163 GLENDALE GALLERIA

GLENDALE

V Newtown Pizza Palace 2034266114 65 Church hill Rd NEWTOWN

V

Pizza Palace Of Newtown

2034266114

65 Church hill Rd

NEWTOWN

s

ss

integration

CleanedData

s

s

s

SearchBox

Src Name Phone Address City

D

Aerosoles

8182462455

1163 GLENDALE GALLERIA

GLENDALE

D

Aldo Shoes

8184090612

1157 GLENDALE GALLERIA

GLENDALE

D Newtown Pizza Palace 2034266114 65 Church hill Rd NewtownD Pizza Palace of Newtown 2034266114 Church Hill Rd Newtown

Src Name Phone Address City

A

A 24 Hour 1 A 1 Locksmith

8182404644

3210 GLENDALE GALLERIA

GLENDALE

A

A Link Wireless

8185491449

2148 GLENDALE GALLERIA

GLENDALE

A

Abercrombie

8185020728

2229 GLENDALE GALLERIA

GLENDALE

A

Abercrombie & Fitch

8185507492

2151 GLENDALE GALLERIA

GLENDALE

A Newtown Pizza Palace 2034266114 65 Church hill Rd Newtown

A

Aldo Shoes

8185482540

2154 GLENDALE GALLERIA

GLENDALE

A

Alert Cellular

8182404779

2148 GLENDALE GALLERIA

GLENDALE

Src Name Phone Address CityT Newtown Pizza Palace 2034266114 65 Church hill Rd Newtown

T

Aldo Shoes

8185482540

2154 GLENDALE GALLERIA

GLENDALE

T

American Eagle Outfitters

8189561893

2182 GLENDALE GALLERIA

GLENDALE

T

ANN TAYLOR

8182460350

2178 GLENDALE GALLERIA

GLENDALE

T

Ann Taylor Stores

8182460350

1108 GLENDALE GALLERIA

GLENDALE

Page 3: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

3

MotivationWhich type of listing

are they?

• A: the same business

• B: different businesses sharing the same phone#

• C: different businesses, only one correctly associated with the given phone#

Page 4: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

4

Current Solution• Uniqueness constraint– Each real-world entity has a unique value.

E.g., phone, address• The data may not satisfy the constraint– Erroneous values– Small number of exceptions

• Current two-step solution– Step 1: Record Linkage

• link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06]

– Step 2: Data Fusion• decide the correct values in the presence of conflicts

[J. Bleiholder et. al, ACM Computing Surveys]

Page 5: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

5

Limitations of Current SolutionSOURCE NAME PHONE ADDRESS

s1Microsofe Corp. xxx-1255 1 Microsoft Way Microsofe Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan W.

s2Microsoft Corp. xxx-1255 1 Microsoft Way Microsofe Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s3Microsoft Corp. xxx-1255 1 Microsoft Way Microsoft Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s4Microsoft Corp. xxx-1255 1 Microsoft Way Microsoft Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s5Microsoft Corp. xxx-1255 1 Microsoft Way Microsoft Corp. xxx-9400 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s6 Microsoft Corp. xxx-2255 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s7 MS Corp. xxx-1255 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s8 MS Corp. xxx-1255 1 Microsoft WayMacrosoft Inc. xxx-0500 2 Sylvan Way

s9 Macrosoft Inc. xxx-0500 2 Sylvan Ways10 MS Corp. xxx-0500 2 Sylvan Way

Locally resolving conflicts for linked records may overlook important global evidence

Erroneous values may prevent correct matching

Traditional techniques may fall short when exceptions to the uniqueness constraints exist

(Microsoft Corp. ,Microsofe Corp., MS Corp.)(XXX-1255, xxx-9400)(1 Microsoft Way)

(Macrosoft Inc.)(XXX-0500)(2 Sylvan Way, 2 Sylvan W.)

Page 6: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

6

Our Solution

• Perform linkage and fusion simultaneously– Able to identify incorrect value from the beginning,

so can improve linkage • Make global decisions– Consider sources that associate a pair of values in

the same record, so can improve fusion• Allow small number of violations for capturing

possible exceptions in the real world

Page 7: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

7

Road Map

• Motivation and overview• Problem definition• Solution

• Evaluations on YP data

• Conclusions

Page 8: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

8

Problem Input

• A set of independent data sources, each providing a set of records

• A set of (soft) uniqueness constraints– Uniqueness constraint (hard constraint):• Business Name, Business Phone, Business

Address

– Soft uniqueness constraint (soft constraint): • Business Phone

1-p1

1-p2

Page 10: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

10

K-Partite Graph Encoding

s(1)

N1

1 Microsoft Way

Microsofe Corp.

P1

A1

xxx-1255

N3N2 N4

P2 P3 P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

A3

2 Sylvan W.

s(1-2)s(1-5,7,8)

s(2-5)

s(2-6)

s(6)

s(6)

S(7-8)

S(7-8)s(1-2)

s(1-5)

S(3-5)

S(10)

S(10)

S(2-10)

S(1-9)

S(2-9)s(1)

s(1)s(1)

s(1)

S1 Microsofe Corp. XXX-1255 1 Microsoft Way

Page 11: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

11

Solution Encoding

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.

N4

P1

A1

P2 P3 P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

A3

2 Sylvan W.

Clustering problem & Matching problem

Page 12: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

12

Solution Encoding with Hard ConstraintMicrosofe Corp.

N3N1 N2

1 Microsoft Way

xxx-1255

N4

P1

A1

P2 P3 P4

A2

Microsoft Corp.

MS Corp.Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

A3

2 Sylvan W.

C1

C2 C3

C4Clustering problem

Page 13: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

13

Road Map

• Motivation and overview• Problem definition• Solution• Clustering w.r.t. hard constraint

• Matching w.r.t. soft constraint

• Evaluations on YP data

• Conclusions

Page 14: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

Clustering w.r.t. Hard Constraints

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.N4

P1

A1

P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-0500

A3

2 Sylvan W.

C1 C4

• Ideal clustering:– high cohesion within

each cluster– low correlation

between different clusters

• Objective function– Davis-Bouldin Index

(Minimization)• Average distance of– similarity distance– association distance

Page 15: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

Similarity Distance

15

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.N4

P1

A1

P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-0500

A3

2 Sylvan W.

0.95 0.65

0.650.4

0.70.7

0.9d2

S(C1,C4) = 1-0 = 1d3

S(C1,C4) = 1-0 = 1

C1 C4

d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3

= 0.25 (name)d2

S(C1,C1) = 0 (phone)d3

S(C1,C1) = 0 (address)

dS(C1,C1) = (0.25+0+0)/3 = 0.083

0

0 0d1

S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4

dS(C1,C4) = (0.4+1+1)/3=0.8

• Similarity of values• Defined for each attribute

Page 16: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

Association Distance

16

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.

s(1)

N4

P1

A1

P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-0500

s(2-5)

S(7-8)

s(1-2)

S(3-5)S(10) S(1-9)

A3

2 Sylvan W.

S(2-10)

s(1-2)

s(1-5,7,8)

s(2-6) S(7-8) S(2-9)s(1)

s(1)

d1,3A(C1,C1) = 1− 8/9 = 0.11

d2,3A (C1,C1) = 1− 7/8 = 0.125

C1 C4

d1,2A (C1,C1) = 1 − 7/9 = 0.22

dA(C1,C4) = (0.9+0.9+1)/3 = 0.93

d1,2A (C1,C4) = 1 − max(1/10,0/10)

= 0.9

dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153

S(10)

9 sources (S1-S8,S10)mention (N1,N2,N3,P1)7 sources (S1-S5,S7,S8)Support (N1,N2,N3)-P1

d1,3A(C1,C4) = 0.9

d2,3A (C1,C4) = 1

• Association by edges• Defined for each pair of

attributes

10 sources (S1-S10)mention (N1,N2,N3,N4) (P1,P4)

1 source (S10)supports (N1,N2,N3)-P4No connection between

(N4,P1)

Page 17: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

17

Greedy Algorithm

• Obtaining optimal clustering is intractable– [T.F. Gonzales., 82],[J. Simal et al., 06]

• Hill climbing approximation: CLUSTER– Step1: Initialization

• Cluster value representations by their similarity. Do majority voting to associate clusters

– Step2: Adjustment• For each node, moving to the cluster that minimize this DB index

– Step3: Convergence checking• terminate if step 2 doesn’t change the clustering result. Otherwise,

repeat step 2

• The algorithm converges

Page 18: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

18

N3 N1

1 Microsoft Way

xxx-1255

N4

P1

A1

P2 P3 P4

A2

N2

Microsoft Corp.

MS Corp.Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

A3

2 Sylvan W.

C1 C2 C3 C4

Microsofe Corp.

Φ=0.94Φ=1.16

Φ=0.93

Φ=0.89Φ=0.71Φ=0.45

Page 19: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

19

Road Map

• Motivation and overview• Problem definition• Solution• Clustering w.r.t. hard constraint

• Matching w.r.t. soft constraint

• Evaluations on YP data

• Conclusions

Page 20: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

20

Matching w.r.t. Soft Constraints

• Next? Matching problem• How to match?

N3N1 N2

1 Microsoft Way

xxx-1255

Microsofe Corp.

N4

P1

A1

P2 P3 P4

A2

Microsoft Corp.

MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

A3

2 Sylvan W.

NC1

1 Microsoft Way

xxx-1255

Microsofe Corp.

NC4

PC1

AC1

PC2 PC3 PC4

AC4

Microsoft Corp.MS Corp.

Macrosoft Inc.

2 Sylvan Way

xxx-2255

xxx-9400

xxx-0500

2 Sylvan W.

7s(1-5,7,8)

1S(6)

5s(1-5)

1S(10)

9S(1-9)

9S(1-9)

1S(10)

8S(1-8)

GRAPH TRANSFORM

Page 21: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

21

Matching w.r.t. Soft Constraint

• Intuitions– Largest sum of weights– Smallest gap– How to balance these two goals?

• Optimization problem– Maximize

– Subject to

• Two-phase greedy algorithm: MATCH

Mvu vGapuGap

vuw

),( )()(

),(

21 ||

|ˆ|0

||

|ˆ|0 p

A

Ap

A

A

K

K

P2P1 P3

N

1(s1)

9(s2-s10)

10(s1-s10)

Solution 2

Gap(N) = 9

P2P1 P3

N

1(s1)

9(s2-s10)

10(s1-s10)

Solution 1

Gap(N) = 1

P2P1 P3

N

1(s1)

9(s2-s10)

10(s1-s10)

Solution 3

Gap(N) = 0

Page 22: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

22

Road Map

• Motivation and overview• Problem definition• Solution• Evaluations on YP data• Conclusions

Page 23: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

23

Experiment Settings

• Dataset I– Business listings for two zip codes(07035-Lincoln Park NJ,

07715-Belmar, NJ) from multiple sourcesZip Business

Source#Sources #Srcs/business

07035 662 15 1-707715 149 6 1-3

ZipRecords

#Recs #Names #Phones #Addresses #(Err Ps)07035 1629 1154 839 735 7207715 266 243 184 55 12

ZipConstraint Violation

NP PN NA AN07035 8%(2.6) .8%(2.7) 2%(2.3) 12.6%(5.1)07715 4%(2) 1%(3) 4%(2) 4%(8.5)

Page 24: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

24

Matching of values of different attributes

Clustering of values of the same attribute

Precision

Recall

F-measure

Experiment Settings• Implementation

– MATCH (invoking CLUSTER first)– LINK: record linkage only– FUSE: data fusion only– LINKFUSE: first LINK, then FUSE

• Golden Standard: by manually checking• Measures: Precision/Recall/F-measure

P | G M R M |

| R M |

||

||

M

MM

G

RGR

RP

PRF

2

||

||

A

AA

R

RGP

||

||

A

AA

G

RGR

RP

PRF

2

Notation Description

Matched pairs for the golden standard

Matched pairs for our results

Clustered pairs for the golden standard

Clustered pairs for our results

G M

R M

G A

R A

Page 25: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

25

Accuracy

07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME)

07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME)

• MATCH achieves highest F-measure in most cases• Improves LINK by 11% on name-phone matching, by 20% on name clustering

• LINK vs. FUSE vs. LINKFUSE• LINK: high recall in matching• FUSE: high precision in matching, high precision in name clustering• LINKFUSE: only slightly better than FUSE in matching and similar to LINK in

clustering

Page 26: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

26

Efficiency and Scalability

• Data set II– Entire listing: 40+M records

• Hadoop-based linkage framework– Fuzzy self-join using Hadoop– Partition records into strongly connected components

• Efficiency– Linear growth– Execution time

Module Execution time (hour)

Record extraction 0.002

Fuzzy self join 0.89

Connected component 0.89

linkage 1.36

Overall 3.26

median95th

percentile99th

percentilemax

2 5 7 2103

Page 27: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

27

Conclusions

• In the real-world, we need to resolve duplicates and conflicts at the same time.

• We reduce the problem to a k-partite graph clustering and matching problem– Combine linkage and fusion– Apply them in the global fashion

• Experiments show high accuracy and scalability

Page 28: Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1

28

Thank You!