12
Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 1 An Adaptive Algorithm for Detection of Duplicate Records Presented By: Rama kanta Behera IT200127207 Under the guidance of : Miss Ipsita Mishra

An adaptive algorithm for detection of duplicate records

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

1

An Adaptive Algorithm for Detection of Duplicate Records

Presented By:

Rama kanta Behera IT200127207

Under the guidance of :

Miss Ipsita Mishra

Page 2: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

2

INTRODUCTION

A “records set” is a list of prior distinct records. A new record is to be verified for a duplicate against the records set A database is a collection of related data. Various Algorithms like

• Matching learning algo,• Learnable string similarity measures • Adaptive Algo

Page 3: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

3

OBJECTIVES

Reduced cost of duplicate record detection.

Perfect scalability of one such detection procedure.

Cache prior information of distinct records and thus cause retaining of prior records redundant for furthering the search

Keep the algorithm adaptive.

Page 4: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

4

PREVALENT METHODS

The Brute Force Method

This method consumes complexity of the order number of records in the records set and requires all prior records to be stored.

Method by Rail et. al

The comparison of a new record against the records set is reduced from being full text match to comparing two integers

Page 5: An adaptive algorithm for detection of duplicate records

OUTLINE OF THE PROPOSED SOLUTION

The central idea behind the present algorithm is based on the fundamental property of primality of numbers

If(x)Record set Integer number space

Fig: hashing

I P

Record set Integer number Prime number

f(x) g(x)

Fig: Extended hashing into prime space

Page 6: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

6

r1r2…rn

I1I2…In

P1P2…Pn

PRODUCT(Pprior)

f(x) g(x)

P1*p2…*pn=Pprior

Fig: The complete algorithm

Page 7: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

7

REALIZATION OF THE ALGORITHM

Two functions f(x) and g(x) are to be realized for the implementation of the algorithm.

Realizing f(x) Realizing g(x)

Page 8: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

8

STEPS OF THE ALGORITHM

Step 1 : For each new record, hash is performed and unique hash value (Hnew) for each distinct record is obtained.

Step 2 : Hnew is mapped to its corresponding unique prime (Pnew).

Step 3 : Pprior is divided with Pnew. If Pnew exactly divides Pprior, then the corresponding record to Pnew is a duplicate and already exists in Pprior. Else, Pnew is a distinct record.

Step 4 : If Pnew is a distinct record, Pprior is multiplied with Pnew and the result is stored back in Pprior. Thus updating Pprior renders the algorithm adaptive.

Page 9: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

9

Fig: Flowchart

Page 10: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

10

IMPLEMENTATIONS

There are three important implementation details that need to be discussed

Size of Records set Use of Logarithms Subsets of Records set

Page 11: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

11

CONCLUSION

A new approach to handle duplicate records is presented

This approach combines the concepts of number theory and algorithmic to solve the oftener felt problem of “duplicate record detection”.

Page 12: An adaptive algorithm for detection of duplicate records

Tech

nic

al S

em

inar

2004

RAMAKANTA BEHERA IT200127207

An Adaptive Algorithm for Detection of Duplicate Records

12

THANK YOU !!!