Beyond Kaggle: Solving Data Science Challenges at Scale

Think Big, Start Smart, Scale Fast

Dato ConferenceData Matching and Deduplication

using Dato ToolkitsJuly 21st, 2015

Guillermo Breto Rangel, PhD

Entity Resolution: Multiple Definitions

(ER)Entity Resolution

Extract, match and disambiguate entity records in data.

Entity Resolution: Real World Entity

Matching real world entities with profiles, mentions...

Facebook account(s)LinkedIn profile(s)TweetsGoogle Searches

Many recordsUnique Identities…

...…...

......

Entity Resolution: Use Cases

◆ Network Analysis ◆ Vocabulary Normalization:

Different organizations report different names for same entities

◆ Network Security: Finding user actions/intents

◆ Data Cleaning: removing duplicated records

◆ Metadata enrichment: records when matched append metadata to the entity.

Entity Resolution: Challenges

◆ Missing Values

◆ Data entry errors

◆ Abbreviations and formatting

◆ Data volume

◆ Variety of raw data sourceso free text, semi-structured, streaming

◆ Data integration from multiple sources

◆ Preprocessing

◆ Normalization

◆ Choosing similarity metrics

Dataset: Dbpedia/Amazon-Google Products

Putting a schema to WikipediaCrowd-sourced community project

Queries against WikipediaData Match data sets on the Web to Wikipedia data

A set of triples → <dbpedia:Luc_Besson> <dbpedia-owl:spouse><dbpedia:Milla_Jovovich>

Matching Amazon Products and Google Products

Deich Library and

Preprocessing: Steps

1) Extracttokens

2) Cleantriplets

3) Pivottable

4) Selectrelevantfeatures

5) Normalization

6) Choosingsimilaritymetrics

Algorithm: Nearest Neighbors

● The entity resolution problem is approached as a network problem○ Nodes: entity records○ Edges: similarity measures

● Define distance between entities to find the nearest neighbors. Composite distances could be built using euclidean, squared euclidean, levenshtein, Jaccard, Manhattan, cosine, dot product

● Compute the distance between all entities and find the nearest neighbors

● Duplicates are the connected components of the graph which are labeled as an entity

● Some parameters to keep in mind are:○ Grouping_features○ k (number of neighbors to compare)○ Radius (the distance threshold)

Results:

The benchmark results can be found at:

https://github.com/cubreto/dataDeduplication

Lessons Learned:

◆ Most of the time spent on preprocessing

◆ Hard to define the distance threshold

◆ Weighting the composite distance

◆ Data volume

◆ Dealing with missing values

◆ Tuning the parameters

◆ Finding exact matches

Some Resources/Bibliography

◆ Ricardo Vasquez Sierra, PhD: Senior Data Scientist from Ooyala

◆ Kevin Glynn, MS: Data Scientist and Khan Academy Instructor

◆ Vince Gonzalez: MapR Software Engineer◆ Alexey Svyatkovskiy, PhD: BigData Scientist

Princeton University◆ Ashwin Machanavajjhala, PhD: Professor of

Computer Science, Duke University◆ Lise Getoor, PhD: Professor of Computer

Science, UC Santa Cruzo KDDTutorialonEntityResolution inBigDatao Deduplication and Group Detection using Links, Indrajit

Bhattacharya and Lise Getoor, The 10th ACM SIGKDD Workshop on

Link Analysis and Group Detection (LinkKDD-04).

o Collective Entity Resolution in Relational Data, Indrajit Bhattacharya

and Lise Getoor, ACM Transactions on Knowledge Discovery from

Data (ACM-TKDD), 2007

◆ The Dato Team◆ My colleagues at Think Big

Beyond Kaggle: Solving Data Science Challenges at Scale

Technology

Ultrasound nerve segmentation, kaggle review

Opening Data With Kaggle

Kaggle Tradeshift Challenge

The Hitchhiker’s Guide to Kaggle

stories behind kaggle competitions

ABSTRACT Instructor: Natalia Sizova WORLD DATA: EXPLORING KAGGLE DATA SETSns10/Kaggle/pdfs/World_Data... · · 2017-05-09WORLD DATA: EXPLORING KAGGLE DATA SETS ABSTRACT Introduction

MOVING BEYOND ALGORITHM THROUGH PROBLEM SOLVING

Kaggle Competition: Product Classification · 2020. 10. 5. · Sponsor listed above and hosted on the Sponsor's behalf by Kaggle Inc ('Kaggle'). The competition is used for CS933

Beyond Brainstorms: Make Problem Solving Fun

Kaggle: Coupon Purchase Prediction

Kaggle Otto Group

Beyond Problem-Solving: Elementary Students’ Mathematical

HEALTH INSURANCE MARKET: SHARING YOUR WORK WITH THE KAGGLE …ns10/Kaggle/pdfs/Health_Insuran… · · 2017-05-09HEALTH INSURANCE MARKET: SHARING YOUR WORK WITH THE KAGGLE COMMUNITY

Kaggle Tutorial with R

陳琤 20160106 kaggle

H2 o kaggle-032515

MLDM CM Kaggle Tips

Kaggle - global Data Science community

Intro to kaggle

CM UTaipei Kaggle Share