View
804
Download
2
Embed Size (px)
Citation preview
Using Hadoop to build a Data Quality Service for both real-time and batch data
Griffin – https://github.com/ebay/griffin
About Us:• Alex Lv ([email protected])
Senior Staff Software Engineer – Data Products
Platform & Engineering at eBay
• Amber Vaidya ([email protected])Lead Product Manager - Data Products
Platform & Engineering at eBay
Agenda• Background• Introduction to Griffin• Demo
Background
eBay Marketplace at a Glance
Q2 2016 data
$19.8B GMV in Q2 2016
10MNew listings added via mobile per
week
300MSearches each day
65%Transactions that ship for free
(in US, UK, DE)
80%Items sold as new
1BLive listings
One of the world’s largest and most vibrant marketplaces
Velocity Stats
US
3 car parts or accessories are sold every
A smartphone is sold every
A dress is sold every
1 sec
4 sec
6 sec
UK
A necklace is sold every
A make-up product is sold every
A Lego product is sold every
10 sec
3 sec
19 sec
GERMANY
A truck or car is sold every
A pair of women’s jeans is sold every
A video game is sold every
5 min
4 sec
11 sec
AUSTRALIA
A pair of men’s sunglasses is sold every
A home décor item is sold every
A car or truck part is sold every
1 min
12 sec
4 sec
Mobile Velocity Stats
US
A woman’s handbag is sold every
A car or truck is sold every
An action figure is sold every
10 sec
5 min
10 sec
UK
A tablet is sold every
A cookware item is sold every
A car is sold every
1 min
6 sec
2 min
GERMANY
A pair of women’s shoes is sold every
A watch is sold every
A tire or car part is sold every
20 sec
48 sec
35 sec
AUSTRALIA
A piece of jewelry is sold every
A baby clothing item is sold every
A motorcycle part is sold every
12 sec
46 sec
51 sec
Big Data @
We manage one of the largest data platforms in
the world
We utilize one of the largest data platforms in
the world
Challenging to ensure data quality for such scale!Challenges at eBay:• No unified view of data quality across multiple systems and teams• No shared platform to manage data quality• No system to measure near real-time data quality
What is Data Quality?
Definition
• How well it meets the expectations of data consumers
• How well it represents the objects, events, and concepts it is created to represent Dimensions
Completeness
Uniqueness
Timeliness
Validity
Accuracy
Consistency
Core Dimensions
Virtuous Cycle of Data Quality
Define Measure
AnalyzeImprove
• Define the scope, dimensions, goals, thresholds, etc.
• Measure data quality values
• Analyze data quality results
• Improve data quality
Our Goal
A solution with all the below capabilities
Capability Commercial DQ software
Open source DQ software
Support eBay’s scale x x
Data Quality measurement √ x
Support real-time data x x
Support unstructured data x x
Service based API √ x
Data Profiling √ √
Pluggable measurement types x x
Griffin
What is Griffin?• Data Quality Platform built on Hadoop and
Spark Batch data Real-time data Unstructured data
• A unified process to detect DQ issues Incomplete Inaccurate Invalid ……
• An open source solutionhttps://github.com/ebay/griffin
Data Quality Framework in GriffinDe
fine
Mea
sure
Anal
yze
• Define Data Quality Dimensions• Define Metrics, Goals, Thresholds
Calculators running on Source
RDBMS
Accu
racy
Com
plet
enes
s
Uni
quen
ess
Tim
elin
ess
Valid
ity
Cons
isten
cy
Metrics
MetricsRepository Scorecards
• Scorecard Reports generated and displayed• Measurement values and quality scores calculated
and stored• Data quality trending graphs generated
MeasureRepository
Technical Highlights
Real-time Fast Massive Pluggabl
e
Component Design
Measure Calculator Example – Accuracy of Viewitem~300M customer view events per day
Accuracy Calculator
Metric
Target
Source
Item_view• User Id• Page Id• Site Id• Title• Date• ……
X 100%
Use Cases• Griffin has been deployed in production at eBay and provided the centralized data
quality service for several eBay systems.
~1.2PB 800+M 100+
Data Daily Records Metrics
Now life is easier……
Demo
We are open sourceand welcome contributions
Github: https://github.com/eBay/griffinBlog: http://www.ebaytechblog.com/?p=5877/Contact: [email protected]