32
Real Time Fuzzy Matching With Spark and ElasticSearch

ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

  • Upload
    others

  • View
    37

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Real Time Fuzzy Matching With Spark and ElasticSearch

Page 2: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

BFSI

Page 3: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Wilful Defaulters?

Page 4: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Sanctions Screening

PEP

HMT

OFAC SDN

..and many others

Page 5: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

However ...

7TH OF TIR

7TH OF TIR COMPLEX

7TH OF TIR INDUSTRIAL COMPLEX

7TH OF TIR INDUSTRIES

7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN

SEVENTH of Tir

Page 6: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Entity Resolution

Page 7: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Directory Listings

De

Dew Drops, Shop no - A-152, super mart 1, Gurgaon - 122001, DLF Phase 4

DewDrop Florist, A 152, DLF City Phase 4, Near Galleria Market, Super Mart 1

Page 8: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Ecommerce

Cherry Mobile Amethyst Android 4.2 Jelly Bean (Black) with Free Smart and Globe SIM

Cherry Mobile Amethyst (White) with 1 Smart SIM

CHERRY MOBILE AMETHYST + 1 SMART SIM

Cherry Mobile Amethyst Android 4.2 Jelly Bean

Cherry Mobile Amethyst (White) with 1 Samsung Galaxy V

CHERRY MOBILE AMETHYST + 1 SAMSUNG GALAXY V. + 1 SMART AND GLOBE SIM

Page 9: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Government of ..

● Benefit rollouts● Surveillance● Licenses● Linking NPR with Passport

Page 10: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

360 viewID Company Name Project

12345 UBM Asia Dave Chan HK - Fine Jewellery

13222 UBM A Dave C HK - Fashion Jewellery

15656 UBM Davechan HK - Beauty

14456 ubmAsia Mr. Dave CChan HK - Fine Jewellery

Page 11: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

“In order to be irreplaceable, one must always be different.”

― Coco Chanel

Page 12: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Other uses

● Cross selling● Data Quality● Vendor consolidation● Master Data Management● CRM Deduplication

Page 13: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Challenges

● Discovering and maintaining rules is extremely tough

● Custom coding and domain specific logic makes maintenance a nightmare

● No one size fits all, big custom implementations needed every time even after using existing tools

Page 14: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Challenges..

● High Data volumes ● Each record has multiple dimensions● Exact matches are rare● Comparing each record with every other is not

possible● Languages have unique issues

Page 15: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Lets start wishing...

● Data variety● Scalable● No manual configuration of rules or algorithms● Multi language● Real time

Page 16: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Our Approach

- Learn from the data- Divide the load

Page 17: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Reifier Workflow

Configure data

Reifier Interactive Learner

Linked Result

Have training data?Reifier Match

Yes

No

Page 18: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

1. Select Data

Page 19: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

2. Field Selection and Stop Words

Page 20: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Strata Hadoop World Singapore 2015

3. Choose Training Set

Page 21: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Strata Hadoop World Singapore 2015

4. Run the Spark Job

Page 22: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Strata Hadoop World Singapore 2015

5. Enjoy the results

Page 23: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Strata Hadoop World Singapore 2015

At the beginning: (Without Chinese Stopped words)

亚洲博闻有限公司 Dave Chan亚洲华乐有限公司 David Chan

In this case, the similarity between 2 records is very high

What if we include the stopped word? (亚洲,有限公司)

博闻 Dave Chan华乐 David Chan

Company names for these records now are not matched at all and the system will not group them together.

Fuzzy Match in Reifier – Stopped word

Page 24: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Reifier Interactive Learner

Page 25: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Reifier Interactive Learner

Page 26: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Reifier Interactive Learner

Page 27: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Reifier Interactive Learner

Page 28: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Spark Benefits

● Distributed● Scalable● Fast● Machine Learning● Sampling● No need to orchestrate multiple jobs

Page 29: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Real Time

Spark + ElasticSearch

Page 30: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Advantages● Point and Shoot - Zero config

● Learning similarity definitions from data

■ - No hard coding of business rules

■ - Domain agnostic

■ - Handle multiple languages (English,

Chinese, Japanese, Thai)

Page 31: ElasticSearch Real Time Fuzzy Matching With Spark anddevelopermarch.com/...Apr29_FuzzyMatchingSpark... · Real Time Spark + ElasticSearch. Advantages Point and Shoot - Zero config

Advantages

● Scalability

● Real time as well as batch