Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1

Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou

Frequency-aware Similarity Measures

1

Outline

Introduction Composing similarity Exploiting frequencies Partitioning strategies Experiment Conclusion

2

Introduction Propose a novel comparison method that partitions

the data using value frequency information and then automatically determines similarity measures for each individual partition.

Use by partitioning compared record pairs according

to frequencies of attribute values. Partition 1contains all pairs with rare names.

Partition 2 all pairs with medium frequent names.

Partition 3 all pairs with frequent names.

3

Introduction

Motivation:Schufa, a credit rating agency that stores data

of about 66 million citizens, which are in turn reported by banks , insurance agencies, etc.

queries about the rating of an individual must be responded to as precisely as possible.

To ensure the quality of the data, it is necessary

to detect and fuse duplicates.

4

IntroductionWhy Arnold Schwarzenegger is Always a

Duplicate ?In a person table with U.S. citizens , this name is a very rare name. If we find several Arnold Schwarzeneggers in it, it is very likely that these are duplicates.

they argue that address and date-of-birth similarity are less important than for rows with frequent names.

person's name, birth date,

address

5

Introduction Determining the similarity (or distance) of two

records in a database is a well-known, but challenging problem.

The problem comprises two main difficulties:

1. typos

outdated values

sloppy data or query entries.

2.

The amount of data might be very large, thus prohibiting

exhaustive comparisons.

devising sophisticated

similarity measures

Efficient algorithms and

indexes that avoid comparing each entry with

allother entries. 6

Composing Similarity Base Similarity Measures Define: Simp(r1,r2) Simp : (R x R) → [0 ,1] ⊂ R

each responsible for calculating the similarity of a specific attribute

p of the compared records r1 and r2 from a set R of records.

Ex:

SimName : Jaro-Winkler distance

SimBirthDate : relative distance

SimAddress : Euclidean distance

Also test for equality (e.g., for email addresses) or boolean

values(e.g., for gender).

7

Jaro-Winkler distance

Jaro–Winkler distance dw :

8

m: the number of matching characters.t: half the number of transpositions.

dj :the Jaro distance for strings s1 and s2 :the length of common prefix at the start of the string up to a maximum of 4 charactersp : a constant scaling factor p should not exceed 0.25, otherwise the distance can become larger than 1. The standard value for this constant in Winkler's work is p = 0.1

Jaro-Winkler distance:s1:MARTHA s2 : MARHTA

m = 6 , | s1 | = 6 , | s2 | = 6

t= =1 (H/T&T/H)

dj=()=0.944 , standard weight p = 0.1

s1:MARTHA s2 : MARHTA =3

dw = 0.944 + (3 * 0.1(1 − 0.944)) = 0.961

---------------------------------------------------------------------------------------

s1:DWAYNE s2 : DUANE

m = 4 , | s1 | = 6 , | s2 | = 5

t = 0

dj=()=0.822 , standard weight p = 0.1

s1:DWAYNE s2 : DUANE =1

dw = 0.822 + (1 * 0.1(1 − 0.822)) = 0.84

9

Composing Similarity Composition of Base Similarity Measures Integrate the base similarity measures into an overall judgement to

calculate the overall similarity of two records.

the classes are isSimilar and isDissimilar

The features are the results of the base similarity measures.

To derive a general model:

employ machine learning techniques and have enough training

data for supervised learning methods.

10

logistic regression,

decision trees, SVM

logistic regression

SVM(support vector machine) Decision Tree

11

Frequency Function Determine the value frequencies of the selected attributes for two

compared records.

Define a frequency function f : R x R → N (FirstName & LastName)

Goal : partition the data according to the name frequencies. Several data quality problems:

1.swapping of first and last name

2. typos (e. g., Arnold , Arnnold)

3. combining two attributes

(e. g., Schwarzenegger is more distinguishing than Arnold)

12

Exploiting frequencies

FirstName LastName

Arnold Schwarzenegger

Schwarzenegger

Arnold

FirstName frequency

Josh : 3Kevin: 1 Jack: 5

...

...

...………

LastName frequency powell : 2 johnson : 0 wills: 5 powell : 1 johnson : 1 wills: 1 powell : 4 johnson : 3 wills: 0

13

LastName frequency

Powell: 1 Johnson: 0 Wills: 5

...

...

...………

FirstName frequency Josh : 2 Kevin : 2 Jack: 2

Josh : 4 Kevin : 6 Jack: 5

Exploiting frequenciesFrequency-enriched Models exploit frequency distributions is to alter the models that we

learned with the machine learning techniques

1. manually add rules to the models

2. integrate the frequencies directly into the machine learning models.

14

Ex: logistic regression, "if the frequency of the name value is below10, then increase the weight of the name similarity by 10% and appropriately decrease the weights of the other similarity functions".

Drawback : Manually defining such rules is cumbersome and error-prone

where M is the maximum

frequency in the data

set.

Partitioning strategies partition compared record pairs into n

partitions using the determined frequencies. Number of partition:Too large in small partitions: Overfitting

0 10Too small in large partitions: discovering frequency-specific differences

0 100

15

Partitioning strategies Define partitions: The entire frequency space is divided into non-

overlapping, continuous partitions by a set of thresholds:

Ɵ0= 0 and Ɵn = M + 1, where M is the maximum frequency in the data set.

Defined as frequency ranges Ii :

A partition covers a set of record pairs. A record pair(r1,r2) falls into a partition [Ɵi , Ɵi+1) iff the frequency function value for this pair lies in the partition's range:

16

Partitioning strategiesRandom partitioning: randomly pick several thresholds Ɵi ∈ {0,…….,M + 1}

The number of thresholds in each partitioning is also randomly chosen.

maximum of 20 partitions in one partitioning.

Equi-depth partitioning: divide the frequency space into e partitions. Each partition contains

the same number of tuples from the original data set R.

e ∈ {2,…….,20} 1partition

17

20 partition

e:9

Partitioning strategiesGreedy partitioning: define a list of threshold candidates C = {Ɵ0,……, Ɵn}

by dividing the frequency space into segments with the same number of tuples (similar to equi-depth partitioning, but with fixed, large e = 50). Process:1.learning a partition for the first candidate thresholds [Ɵ0, Ɵ1).

2.learn a second partition that extends the current partition by moving its upper threshold to the next threshold candidate: [Ɵ0, Ɵ2).

3. …………………… [Ɵ0, Ɵ3).

……

∆ compare both partitions using F-measure.

18

Partitioning strategiesGreedy partitioning: (continue) If the extended partition achieves better

performance, the process is repeated for the next threshold slot.

If not, the smaller partition is kept and a new partitioning is started at its upper threshold; another iteration starts with this new partition.

This process is repeated until all threshold candidates have been processed.

19

[Ɵi , Ɵj) Total S D

0≤Frequency<1

10 8 2

1≤Frequency<2

10 5 5

2≤Frequency<3

5 2 3

3≤Frequency<4

15

4≤Frequency<5

0

P=5/8=0.625R=5/6=0.83F==0.71

P=10/13=0.77R=10/11=0.91F==0.834

0 1

0 2

2 3

2 4

2 5

0 3

P=10/15=0.67R=10/14=0.71F==0. 6894

20

similar

dissimilar

similar 5 1

dissimilar

3 1

similar

dissimilar

similar 10 1

dissimilar

3 6

similar

dissimilar

similar 10 4

dissimilar

5 6

5≠20

5+15=20

5+15+0=20

actu

al

predict

Partitioning strategies Genetic Partitioning Algorithm1. Initialization: Create an initial population consisting of several random partitionings. These partitionings are created as described above with the random partitioning approach.

2. Growth: Learn one composite similarity function for each partition in the current set of partitionings.

3. Selection: For each partition, determine the maximum F-measure that can be achieved by choosing an appropriate threshold for the similarity function.

Select the partitionings with highest weighted F- measure, then select the top five partitionings.

21

Partitioning strategies4. Reproduction: build pairs of the selected best individuals and combine them to create new individuals.

a) Recombination: First create the union of the thresholds of both partitionings. For each threshold, randomly decide whether to keep it in the result partition or not. Both decisions have equal chances.

b) Mutation: Randomly decide whether to add another new (also ran-domly picked) threshold and whether to delete a (randomly picked) threshold from the current threshold list.

Define a minimum partition size (set this value to 20 record pairs ). Randomly created partitionings with too small partitions are discarded.

22

23

[Ɵ0 , Ɵ1) [Ɵ1 , Ɵ2) [Ɵ2 , Ɵ3)

[ 0 , 1 ) [ 1 , 2 ) [ 2 , 3 )

[ 0 , 2 ) [ 1 , 3 ) [ 2 , 4 )

[ 0 , 3 ) [ 1 , 4 ) ……

[ 0 , 4 ) …… [ 3 , 4 )

[ 0 , 5 ) [ 2 , 3 ) [ 3 , 5 )

[ 0 , 6 ) [ 2 , 4 ) ……

….. …… [ 4 , 5 )

Ɵ0 Ɵ1 Ɵ2 Ɵ3

→ [ 0 , 1 ), [ 1 , 3 ), [ 3 , 4 )

→ [ 0 , 2 ), [ 2 , 4 ), [ 4 , 5 )

Top5

Partitioning strategies5. Termination:• The resulting partitions are evaluated and added

to the set of evaluated partitions. • The selection/reproduction phases are repeated

until a certain number of iterations is reached or until no significant improvement can be measured.

• Require a minimum F-measure improvement of 0.001 after 5 iterations.

24

Experiment

25

data set consists of two parts: a person data set and a query data set.built record pairs of the form (query, correct result) or (query, incorrect result),

Evaluation on Schufa Data Set

Experiment

26

Evaluation on DBLP Data Set(bibliographic database for computer sciences)

(1) Two papers from the same author,(2) Two papers from the same author with different name aliases(3) Two papers from different authors with the same name, (4) Two papers from different authors with different names.For each paper pair, the matching task is to decide whether the two papers were written by the same author.

Conclusion With this paper, introduced a novel approach

for im-proving composite similarity measures. Divide a data set consisting of record pairs into

partitions according to frequencies of selected attributes.

Learn optimal similarity measures for each partition.

Experiments on different real-world data sets showed that partitioning the data can improve learning results and that genetic partitioning performs better than several other partitioning strategies.

27

Thank you for your listening !

28

Documents

Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1