22
Using Hadoop to build a Data Quality Service for both real-time and batch data Griffin – https://github.com/ebay/griff in

Using Hadoop to build a Data Quality Service for both real-time and batch data

Embed Size (px)

Citation preview

Page 1: Using Hadoop to build a Data Quality Service for both real-time and batch data

Using Hadoop to build a Data Quality Service for both real-time and batch data

Griffin – https://github.com/ebay/griffin

Page 2: Using Hadoop to build a Data Quality Service for both real-time and batch data

About Us:• Alex Lv ([email protected])

Senior Staff Software Engineer – Data Products

Platform & Engineering at eBay

• Amber Vaidya ([email protected])Lead Product Manager - Data Products

Platform & Engineering at eBay

Page 3: Using Hadoop to build a Data Quality Service for both real-time and batch data

Agenda• Background• Introduction to Griffin• Demo

Page 4: Using Hadoop to build a Data Quality Service for both real-time and batch data

Background

Page 5: Using Hadoop to build a Data Quality Service for both real-time and batch data

eBay Marketplace at a Glance

Q2 2016 data

$19.8B GMV in Q2 2016

10MNew listings added via mobile per

week

300MSearches each day

65%Transactions that ship for free

(in US, UK, DE)

80%Items sold as new

1BLive listings

One of the world’s largest and most vibrant marketplaces

Page 6: Using Hadoop to build a Data Quality Service for both real-time and batch data

Velocity Stats

US

3 car parts or accessories are sold every

A smartphone is sold every

A dress is sold every

1 sec

4 sec

6 sec

UK

A necklace is sold every

A make-up product is sold every

A Lego product is sold every

10 sec

3 sec

19 sec

GERMANY

A truck or car is sold every

A pair of women’s jeans is sold every

A video game is sold every

5 min

4 sec

11 sec

AUSTRALIA

A pair of men’s sunglasses is sold every

A home décor item is sold every

A car or truck part is sold every

1 min

12 sec

4 sec

Page 7: Using Hadoop to build a Data Quality Service for both real-time and batch data

Mobile Velocity Stats

US

A woman’s handbag is sold every

A car or truck is sold every

An action figure is sold every

10 sec

5 min

10 sec

UK

A tablet is sold every

A cookware item is sold every

A car is sold every

1 min

6 sec

2 min

GERMANY

A pair of women’s shoes is sold every

A watch is sold every

A tire or car part is sold every

20 sec

48 sec

35 sec

AUSTRALIA

A piece of jewelry is sold every

A baby clothing item is sold every

A motorcycle part is sold every

12 sec

46 sec

51 sec

Page 8: Using Hadoop to build a Data Quality Service for both real-time and batch data

Big Data @

We manage one of the largest data platforms in

the world

We utilize one of the largest data platforms in

the world

Page 9: Using Hadoop to build a Data Quality Service for both real-time and batch data

Challenging to ensure data quality for such scale!Challenges at eBay:• No unified view of data quality across multiple systems and teams• No shared platform to manage data quality• No system to measure near real-time data quality

Page 10: Using Hadoop to build a Data Quality Service for both real-time and batch data

What is Data Quality?

Definition

• How well it meets the expectations of data consumers

• How well it represents the objects, events, and concepts it is created to represent Dimensions

Completeness

Uniqueness

Timeliness

Validity

Accuracy

Consistency

Core Dimensions

Page 11: Using Hadoop to build a Data Quality Service for both real-time and batch data

Virtuous Cycle of Data Quality

Define Measure

AnalyzeImprove

• Define the scope, dimensions, goals, thresholds, etc.

• Measure data quality values

• Analyze data quality results

• Improve data quality

Page 12: Using Hadoop to build a Data Quality Service for both real-time and batch data

Our Goal

A solution with all the below capabilities

Capability Commercial DQ software

Open source DQ software

Support eBay’s scale x x

Data Quality measurement √ x

Support real-time data x x

Support unstructured data x x

Service based API √ x

Data Profiling √ √

Pluggable measurement types x x

Page 13: Using Hadoop to build a Data Quality Service for both real-time and batch data

Griffin

Page 14: Using Hadoop to build a Data Quality Service for both real-time and batch data

What is Griffin?• Data Quality Platform built on Hadoop and

Spark Batch data Real-time data Unstructured data

• A unified process to detect DQ issues Incomplete Inaccurate Invalid ……

• An open source solutionhttps://github.com/ebay/griffin

Page 15: Using Hadoop to build a Data Quality Service for both real-time and batch data

Data Quality Framework in GriffinDe

fine

Mea

sure

Anal

yze

• Define Data Quality Dimensions• Define Metrics, Goals, Thresholds

Calculators running on Source

RDBMS

Accu

racy

Com

plet

enes

s

Uni

quen

ess

Tim

elin

ess

Valid

ity

Cons

isten

cy

Metrics

MetricsRepository Scorecards

• Scorecard Reports generated and displayed• Measurement values and quality scores calculated

and stored• Data quality trending graphs generated

MeasureRepository

Page 16: Using Hadoop to build a Data Quality Service for both real-time and batch data

Technical Highlights

Real-time Fast Massive Pluggabl

e

Page 17: Using Hadoop to build a Data Quality Service for both real-time and batch data

Component Design

Page 18: Using Hadoop to build a Data Quality Service for both real-time and batch data

Measure Calculator Example – Accuracy of Viewitem~300M customer view events per day

Accuracy Calculator

Metric

Target

Source

Item_view• User Id• Page Id• Site Id• Title• Date• ……

X 100%

Page 19: Using Hadoop to build a Data Quality Service for both real-time and batch data

Use Cases• Griffin has been deployed in production at eBay and provided the centralized data

quality service for several eBay systems.

~1.2PB 800+M 100+

Data Daily Records Metrics

Page 20: Using Hadoop to build a Data Quality Service for both real-time and batch data

Now life is easier……

Page 21: Using Hadoop to build a Data Quality Service for both real-time and batch data

Demo

Page 22: Using Hadoop to build a Data Quality Service for both real-time and batch data

We are open sourceand welcome contributions

Github: https://github.com/eBay/griffinBlog: http://www.ebaytechblog.com/?p=5877/Contact: [email protected]