BigDansing presentation slides for SIGMOD 2015

Preview:

Citation preview

BigDansing: A BigData Cleansing System

Zuhair Khayyat Ihab F. Ilyas Alekh Jindal Samuel Madden

Mourad Ouzzani Paolo Papotti Jorge-Arnulfo Quiané-Ruiz Nan Tan Si Yin

Problem of Dirty Data

● “duplicate and dirty data costs the healthcare industry over $300 billion every year”

– Joe Fusaro (RingLead)

● “inaccurate data has a direct impact ... the average company losing 12% of its revenue”

– Ben Davis (Econsultancy)

The Process of Data Cleansing

Stained

The Process of Data Cleansing

Stained

One approach: Violation Detection using declarative rules

Stained

The Process of Data Cleansing

Suggested Repairs

One approach: Violation Detection using declarative rules

Stained

Stained

The Process of Data Cleansing

Apply the Repairs

One approach: Violation Detection using declarative rules

Suggested Repairs

Stained

StainedStained

The Process of Data Cleansing

One approach: Violation Detection using declarative rules

Suggested Repairs Clean Dataset

Stained

Apply the Repairs

StainedStained

The Process of Data Cleansing

One approach: Violation Detection using declarative rules

Suggested Repairs

Side effect: new Violations

Stained

Clean Dataset

Apply the Repairs

StainedStained Stained

The Process of Data Cleansing

One approach: Violation Detection using declarative rules

Suggested Repairs

Stained

Clean Dataset

Stained

Side effect: new ViolationsApply the Repairs

Stained Stained

Related work

• Functional dependencies (FDs, CFDs)

• Inclusion dependencies (INDs, CINDs)

• Denial constraints (DCs)

• Matching dependencies (MDs)

• Entity resolution rules (ERs)

Limited quality rules support*Limited quality rules support*

* On approximating optimum repairs for functional dependency violations, ICDT 2009

* Holistic data cleaning: Putting violations into context, ICDE 2013

* The llunatic data-cleaning framework, VLDB 2013

Related work

• Functional dependencies (FDs, CFDs)

• Inclusion dependencies (INDs, CINDs)

• Denial constraints (DCs)

• Matching dependencies (MDs)

• Entity resolution rules (ERs)

Limited quality rules support*Limited quality rules support*

NADEEF**NADEEF**

• Easy-to-use

• Extensible

• Effcient

** SIGMOD 2013

* On approximating optimum repairs for functional dependency violations, ICDT 2009

* Holistic data cleaning: Putting violations into context, ICDE 2013

* The llunatic data-cleaning framework, VLDB 2013

Related work

• Functional dependencies (FDs, CFDs)

• Inclusion dependencies (INDs, CINDs)

• Denial constraints (DCs)

• Matching dependencies (MDs)

• Entity resolution rules (ERs)

Limited quality rules support*Limited quality rules support*

NADEEF**NADEEF**

• Easy-to-use

• Extensible

• Effcient

• Scalability

** SIGMOD 2013

* On approximating optimum repairs for functional dependency violations, ICDT 2009

* Holistic data cleaning: Putting violations into context, ICDE 2013

* The llunatic data-cleaning framework, VLDB 2013

Data Cleansing is Big Data a problem

Dirty data Dirty data Dirty data

Data Cleansing is Big Data a problem

Dirty data Dirty data Dirty data

Scalable

BigData Cleansing Requirements

Fast

Scalable

BigData Cleansing Requirements

Fast

Scalable

Portable

BigData Cleansing Requirements

AbstractionScalability

vs.

Challenges of BigData Cleansing

Ease-of-use

Effciencyvs.

AbstractionScalability

vs.

Challenges of BigData Cleansing

Ease-of-use

Effciencyvs.

AbstractionScalability

vs.

Quality Rules

Inequalities

Challenges of BigData Cleansing

BigDansing

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

BigDansing: Abstraction

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical Operators

BigDansing: Abstraction

Declarative Rules: FD, CFD, DC, ....

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

Logical Operators

BigDansing: Abstraction

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Easy to use and enables scalability!

BigDansing: Optimizations

RepairAlgorithm

DirtyDataset

UDFs(operators)

Iterate

Scope

Block

Detect

GenFix

declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

BigDansing: Optimizations

RepairAlgorithm

DirtyDataset

UDFs(operators)

Iterate

Scope

Block

Detect

GenFix

declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Shared Scans

Fast Inequality

Joins

Shared Execution

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

PartitioningPartitioning

(divide) partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

PartitioningPartitioning

SortingSorting

(divide)

(prepare)

partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate

partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort

rate & salary rate & salary rate & salary

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

PartitioningPartitioning

SortingSorting

PruningPruning

(divide)

(prepare)

(reduce)

partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate

partition 1partition 1

partition 2partition 2…

partition 3partition 3

partition 4partition 4partition npartition n

min-max values

min-max values

partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort

rate & salary rate & salary rate & salary

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

PartitioningPartitioning

SortingSorting

PruningPruning

JoiningJoiningpartition 2partition 2 partition 3partition 3 …

(divide)

(prepare)

(reduce)

(execute)

partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate

partition 1partition 1

partition 2partition 2…

partition 3partition 3

partition 4partition 4partition npartition n

min-max values

min-max values

partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort

rate & salary rate & salary rate & salary

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

BigDansing: Scalable Repair

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Distributed Equivalent Class

A B

a1 b1

a1 b2

a1 b1

a2 b4

a2 b4

a2 b3

t1

t2

t3

t4

t5

t6

FD: A —> B

Scalable Repair

Distributed Equivalent Class

A B

a1 b1

a1 b2

a1 b1

a2 b4

a2 b4

a2 b3

t1

t2

t3

t4

t5

t6

FD: A —> B

t1t2t3

t4t5t6

EQ1

EQ2

Scalable Repair

Distributed Equivalent Class

A B

a1 b1

a1 b2

a1 b1

a2 b4

a2 b4

a2 b3

t1

t2

t3

t4

t5

t6

FD: A —> B

t1t2t3

t4t5t6

EQ1

EQ2

b2 —> b1

b3 —> b4

Scalable Repair

Distributed Equivalent Class

A B

a1 b1

a1 b2

a1 b1

a2 b4

a2 b4

a2 b3

t1

t2

t3

t4

t5

t6

FD: A —> B

t1t2t3

t4t5t6

EQ1

EQ2

b2 —> b1

b3 —> b4

Scalable Repair

EQ algorithm as a word count problem

data errors

data cells

Data Repair as a Black box

data errors

data cells

Data Repair as a Black box

centralized data repair algorithm

data errors

data cells

Data Repair as a Black box

centralized data repair algorithm

data errors

data cells

big connected components?

Data Repair as a Black box

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

BigDansing: Portability

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

BigDansing: Portability

Centralized Experiment

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Several orders of magnitude faster!

Centralized Experiment

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Parallel Experiment

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Only BigDansing fnished!

Parallel Experiment

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Summary

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Ease-

of-Use

Summary

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Ease-

of-Use

Summary

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Ease-

of-Use

Scalab

ility

Summary

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Ease-

of-Use

Scalab

ility

Portability

Summary

57

Experiments – Parallel FD

● TPCH dataset:

● FD: custkey → custAddress

58

Experiments – Scalability

● TPCH Dataset:

● FD: custkey → custAddress

● Dataset: 500M rows

59

Repair Quality for FDs and DC

● Φ6: FD: Zipcode → State

● Φ8: FD: PhoneNumber → Zipcode

● Φ8: FD: ProviderID → City,PhoneNumber

● ØD: DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Φ6Φ6 & Φ7Φ6 - Φ8

ØD

Recommended