Upload
hadoop-summit
View
5.607
Download
4
Tags:
Embed Size (px)
DESCRIPTION
This presentation will cover the design principles and techniques used to build data pipelines taking into consideration the following aspects: architecture evolution, capacity, data quality, performance, flexibility and alignment with business objectives. The discussions will be based on the context of managing a pipeline with multi-petabyte data sets; a code-base composed of Java map/reduce jobs with HBase integration; Hive scripts and Kafka/Storm inputs. We?ll talk about how to make sure that data pipelines have the following features: 1) Assurance that the input data is ready at each step. 2) Workflows are easy to maintain. 3) Data quality and validation comes included in the architecture. Part of presentation will be dedicated to show how to organize the warehouse using layers of data sets. A suggested starting point for these layers are: 1) Raw Input (Logs, Messages, etc.), 2) Logical Input (Scrubbed data), 3) Foundational Warehouse Data (Most relevant joins), 4) Departmental/Project Data Sets and 5) Report Data Sets. (Used by Traditional Report engines) The final part will discuss the design of a rule-based system to perform validation and trending reporting.
Citation preview
Rocket FuelBig Data and Artificial Intelligence for Digital Advertising
Abhijit PolMarilson Campos
Designing Data Pipelines
July, 2013
What We Do?
Data Partners*
Optimize
Bid Request
Rocket Fuel Winning Ad
Ad Request
Ad Served to User
Page Request
Bid & Ad
Web Browser
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Response Prediction
Model
Publishers
User Engagement Recorded
User Engages with Ad
Refresh learning
Campaign & User Data
Warehouse
Qualify Audience
Some Exchange Partners
AdExchange
Ads & Budget
How Big Is This Problem Each Day?
Trades on NASDAQ
Facebook Page Views
Searches on Google
Bid Requests Considered by Rocket Fuel
How Big Is This Problem Each Day?
Trades on NASDAQ
Facebook Page Views
Searches on Google
Bid Requests Considered by Rocket Fuel
~5 billion
10 million
30 billion
~20 billion
BIG DATA + AI
Advertising That Learns
Outline
•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices
Architecture for Growth
•20 GB/month to 2 PB/month in 3 years•New and complex requirements•More consumers•Rapid growth
How We Started?
Architecture 2.0
Current Architecture
Outline
•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices
Hurdles and Challenges Faced
•Exponential data growth and user queries•Network issues•Bots•Bad user queries
Outline
•Architecture Evolution•Hurdles and Challenges Faced•Data Pipelines Best Practices
Data Pipeline Design Best Practices
Job Design
ConsistencyJob Features
Avoid Re-work Golden Input
Shadow ClusterData Collection
Dashboard
Job Design / Consistency
• Idempotent
•Execution by different users
•Account for Execution Time
Job Execution Timeline
Job Features / Re-Work
•Smaller Jobs
•Record completion of steps
Recording completion times
Start
Is mark already there?
Step of workflow, job or script
Yes
No
Execute work for the step.
Create the mark
End
Collect other data (Optional)
Golden Input / Shadow Cluster
• Integration tests on realistic data sets.
•Safe environment to innovate.
Data Collection - Delivery time view
J
Data product
Workflow Workflow
Job
Job
Job Job
Job Job
Job
Job
JobJob
Job
Hive/Pig SSH Script
J J… J
J
Hive
J J J
Pig
…
Data collection : Data profiles view
Data product
Data set
Data set
= Data Set
= Transformation
Record Size & Type
Job Counts
Join success ratios Data Set Consistency
Data Collection Hierarchy
wk_external_events
wk_build_profile
user_profile
extract_fields
consolidate_metrics
load_into_data_centers
extract_features
compact_user_profile
Workflow/Job/Script StepData Product
Golden Input / Shadow Cluster
• Integration tests on realistic data sets.
•Safe environment to innovate.
Dashboard
• Delivery Time• Data Profile Ratios• Counters• Alarms
Thank you
www.rocketfuel.com