34
Thank you

Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Embed Size (px)

Citation preview

Page 1: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Thank you

Page 2: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Harel Ben AttiaSenior Software Engineer

RiverA data workflow management system

Page 3: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

– Tens of Billions of Recommendations per month– Most major publishers in the World– Hundreds GBs of new data every day

Page 4: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Context

• Data Processing Workflows

• Multiple Types of Processing– Rollups, Grouping, Filtering, Algorithm

Calculations

• Multiple Stages of Processing– Using the output of other processes as input

Page 5: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Problems

• Dependency “Management”– Hardcoded into code/scripts– Time-based using cron or another scheduler

• Logic is scattered around the system– Developers need to take care of monitoring,

alerts, permissions etc. – Multiple Locations of Execution

Page 6: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

River

Data Processing Management Infrastructure

Page 7: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

River

• Execution Management– Full Execution History and Filtering– Monitoring and Actionable Alerting– Automatic Retries– Web UI

• Ease of Development– Declarative Data Processing Definitions– Decentralized

• Shared Data, separate development

– JobLogs

• Data Driven Dependencies– Why?

Ops / NOC

Developers

Page 8: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

A B C

A B CJ

A B CJt

Option 1 Option 2

Other Approaches

Page 9: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

A B CJ t

Option 2

Other Approaches

Page 10: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

D FailsD sends email

Developer of Dstill works here

Where is the code?

Other Approaches

Page 11: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

2am is a great hour fortroubleshooting!

D =

Data from C is missing…

C = The data of Cis all there!

Other Approaches

Page 12: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

A B CD …

X:37 seems like a good time… C never finished after X:30

anyway

t

Job J has been working for more than a week before

the incident

Other Approaches

Page 13: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Need to rerun processes B, C and D

•Without running A again?•Without colliding with ongoing executions?

•Which hours failed?

•How to run all of them for the specific hours?

Other Approaches

Page 14: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

AJ

“A will never take more than 15 minutes, so X:20 is more than enough”

t

A WILL eventually take longer

X:00

Other Approaches

Page 15: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

River

• Execution Management– Full Execution History + Filtering and Searching– Monitoring and Actionable Alerting– Automatic Retries– Web UI– JobLogs

• Ease of Development– Declarative Data Processing Definitions– Decentralized

• Shared Data, separate development

• Data Driven Dependencies– Why? Robustness Reliability Parallelism

Page 16: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

River

What? When?

Where? How?

Page 17: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Execution Layer – the “What”

• Importing from MySQL to Hive• Hive Queries• JDBC Queries• Transfer data from Hive into MySQL and to Cassandra• Running External Commands: MapReduce, Java, bash,

Legacy code, etc.

Every data processing task is called a Job

A Job can contain multiple Steps

Jobs use Parameters

Page 18: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Scheduling Layer – the “When”

Events that describe Data Availability

Each job registers to an event, which will trigger its execution

Each job emits an event at job completion

Events that are time dependent

Page 19: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

The “How” and the “Where”

• Integration to other systems• Connecting to Hive/Hadoop/Cassandra• Connecting to JDBC Databases• Retries, throttling, timeouts

Both handled by the infrastructure

Logical names to all data sources

Centralized Management, email notifications and dashboards

• Monitoring and Alerts

• Location of Execution Actual location is hidden from the developer/ops

“readOnlyDataWarehouse””productionCassandra”

Page 20: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

River UI

Restart JobFail Job and DependentsDownload JobLog

Page 21: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Monitoring Dashboard

Page 22: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Monitoring Dashboard

Page 23: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Steps

Steps only contain what needs to be done

sourceDB = “productionDatabase”sourceTable = “myRawData”targetCluster = “onlineHadoopCluster”targetHiveTable = “rawDataTable”Filter = “date=#handledDate#”

Copy Data From JDBC to Hive

Page 24: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

A bit more about triggers

Triggers have parameters as well

Date=2012-10-10,hour=15 Date=2012-10-10,hour=19

Parameters Propagate through jobs and to other triggers

Page 25: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Developer’s Point-of-View

Automatic Retries

Parameters Pass-through

Page 26: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

TriggerManager

External SystemsExternal Systems

Trigger Queue Execution Queue

Hive/Hadoop Interface

OSInterface

CassandraInerface

JDBCInterface

Spring Batch DB

Execution Manager

Spring Batch

River

Topology

Page 27: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Dependenciesfor detailed example

Page 28: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

TriggerManager

External SystemsExternal Systems

Trigger Queue Execution Queue

Hive/Hadoop Interface

OSInterface

CassandraInerface

JDBCInterface

Spring Batch DB

Execution Manager

Spring Batch

River

Topology

T1Date=2012-01-02hour=03

Job1,Job2

Job1,Job2Job2

Job3

Job1

T2

T2

Job3

T3T1 Job3

Success Example

Job1,Job2Date=2012-01-02

hour=03

(from Job1) (from Job2)

T3Date=2012-01-02

hour=03

Page 29: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

TriggerManager

External SystemsExternal Systems

Trigger Queue Execution Queue

Hive/Hadoop Interface

OSInterface

CassandraInerface

JDBCInterface

Spring Batch DB

Execution Manager

Spring Batch

River

Topology

Job2

Job2

Job2Job2

T3

Job3

Job3

Job3

Failure Example

Job2

Date=2012-01-02hour=03

T3Date=2012-01-02

hour=03

UI

Page 30: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Notable Features• Parameter Enrichment

– Example: #beginningOfMonth

• Precondition Expressions– Example: isLastDayOfMonth(#handleDate)

• Data Comparison Capabilities– Data Validations– Supports Tolerance

• Absolute and Percentage margins

• Command Line and Java Clients

Page 31: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

River at

• 6 River Instances Running• 5 Teams• ~4100 Jobs running every day• ~50 Different Job Types

• Job Failures due to environment issues have almost no overhead

• Automatic restarts of jobs when data arrives late

Page 32: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Future Plans

• Multiple Dependencies• Offline Job Testing Capabilities• Improved DSL for Job Definitions• Support for Master/Worker River machines• Job Priorities• Analysis Tools

Outbrain is working on Open Sourcing River

Illustration by Chris Whetzel

Page 33: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Questions

Page 34: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system

Thank You

[email protected]

@harelba on TwitterHarel Ben Attiahttp://www.linkedin.com/in/harelba