Detecting Nearly Duplicated Records in Location Datasets

Detecting Nearly Duplicated Records in Location Datasets

Microsoft Research AsiaSearch Technology Center

Yu Zheng Xing Xie, Shuang Peng, James Fu

Background

Web maps and local search engines are frequently-usedThe quality of the services depends on geographic data

Background

Name Address GPS Position Phone Num. Category Type

The Matt’s Bar 701 5th Ave Seattle, WA 116.325, 35.364 1-56987452 Café YP

Silver Cloud Inn 314 7th Ave Redmond, WA 116.451, 35.209 1-25698716 Hotel POI

Point of interestsCollected by people holding GPS-enabled devices in the physical worldAccurate GPS coordinatesLess accurate address

Yellow pageInputted by people in a cyber environment, e.g., onlineAccurate addressInaccurate GPS coordinates (translated by geocoding)

Problem

Nearly duplicated POIsThe same entity in the physical worldWith slightly different presentations of name, address,

Caused by multiple resourcesDifferent vendors and channelsDifferent types: POI and YP

ResultsBring trouble to data managementConfuse users

Example:Seattle Premier Outlet MallSeattle Premium Outlet

What we do

Infer the similarity between two location entitiesBased on a machine learning based approachConsider multiple fields: name, address, coordinates, categories

Identify some useful features

Evaluate our method using real datasets

Similarities between two entitiesName similarityAddress similarityCategory similarity

Train a inference modelUsing these similarities as featuresA small human label training setApply to a large scale dataset

Methodology

Name similarity

Edit distance does not workThe concept of IDF

Shared part: ,Different part:

Output and as features

Galaxies Coffee House

Espresso DarerEspresso Diana

Galaxies CafeGalaxies

Coffee HouseCafe

EspressoDianaDarer

Same part Difference Record names Edit Dist.

9

4

Same

Diff.

Results

𝑉 𝑠=⟨𝑤1 ,𝑤2 ,…,𝑤 𝑖 ⟩

𝑆1= ∑𝑖=1

¿𝑉 𝑠∨¿𝑖𝑑𝑓 (𝑤 𝑖∈𝑉 𝑠 )¿

¿

𝑆2=𝑚𝑎𝑥𝑤𝑖∈𝑉 𝑑𝑖𝑑𝑓 (𝑤 𝑖)

𝑉 𝑑=⟨𝑤 ′1 ,𝑤 ′2 ,…,𝑤 ′𝑖 ⟩

Address similarity

the geospatially closer two records are located, the higher the probability these two records might be nearly duplicated

79 Beaver St, New York, NY 10005-281292 Water St, New York, NY 10005-3511

NewYorkCity1xxxx

Manhattan100xxx

LowerEast1000x

City

Borough

Street

UpperEast1002x

Queen113xxx

Area

5thStreet WallStreet

Example: The same building having two different address presentation

City structure

Address similarity

Insert YP data into the city structure according to their addressCalculate the mean coordinates of each leaf nodeInsert POI data into the city structure in terms of their coordinatesFind out the co-parent node in the structure

R1

R2

np

R1 R2

np

R1

R2

npA) B) C)

Map each entity to a category hierarchyFind the co-parent node of two entitiesThe lower lever the co-parent is on the high similar

Category similarity

Entertaiment

Restaurant

Level 3

Level 1

Level 2

ChineseRestaurant

Cinema

ItalianRestaurant

Education

E.g., some shops usually provide coffee, lunch and wine simultaneously. Therefore, different people would classify these shops into different categories

Experiments- Settings

Beijing DatasetIn total 0.7 million entities0.3m POIs and 0.4m YPs

Human labeledDecision tree + BaggingBaselines

Exact matchRule-based: edit distance and geo-distance

Datasets Training Set Test Set TotalD1 200 200 400D2 400 400 800D3 600 600 1200D4 800 800 1600

Experiments - Results

Single feature studyS1 and S2 are name similarityS3 denotes address similarityS4 represents category similarity

0.4

0.5

0.6

0.7

0.8

0.9

400 800 1200 1600

Prec

isio

n

Number of entity pairs

S1

S2

S3

S4

0.4

0.5

0.6

0.7

0.8

0.9

1

400 800 1200 1600

Rec

all

Number of entity pairs

S1

S2

S3

S4

Experiments - Results

Feature combination

FeaturesDuplicated Non-duplicated Overall

accuracyPre. Rec. Pre. Rec.

0.860 0.857 0.852 0.864 0.858

0.800 0.767 0.746 0.819 0.782

0.864 0.859 0.853 0.869 0.861

0.864 0.859 0.853 0.869 0.861

0.885 0.866 0.858 0.891 0.875

Experiments- results

FeaturesDuplicated Non-duplicated Overall

accuracyPre. Rec. Pre. Rec.

Exact Match 1 0.183 0.558 0.100 0.598

Rule-based method 0.780 0.701 0.736 0.808 0.755

Our approach 0.885 0.866 0.858 0.891 0.875

0.65

0.7

0.75

0.8

0.85

0.9

0.95

precision (Y) recall (Y) precision (N) recall (N) overall

Performance Measures

D1

D2

D3

D4

Conclusion

A classification model usingName similarityAddress similarityCategory similarity

Determine the nearly duplicated location dataWith a overall accuracy of 0.89

Thanks!

Yu ZhengMicrosoft Research Asia

[email protected]

mailto:[email protected]

Documents

Detecting Nearly Duplicated Records in Location Datasets