29
Recent Development in Oozie Purshotam Shah ([email protected]) Satish Saley ([email protected])

August 2016 HUG: Recent development in Apache Oozie

Embed Size (px)

Citation preview

Page 1: August 2016 HUG: Recent development in Apache Oozie

Recent Development in Oozie

Purshotam Shah ([email protected])Satish Saley ([email protected])

Page 2: August 2016 HUG: Recent development in Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

Page 3: August 2016 HUG: Recent development in Apache Oozie

3

Why Oozie?

Out-of-box support for multiple job types Java, shell, distcp Mapreduce

• Pipes, streaming pig, hive, spark

Highly scalable High availability

Hot-Hot with rolling upgrades Load balanced

Hue Integration

Oozie

Hbase

Pig

Hive

Spark

Yarn

HDFS

Hue

HCatalog

Page 4: August 2016 HUG: Recent development in Apache Oozie

Scale at Yahoo

4

Deployed on all clusters (production, non-production)One instance per cluster

75 products / 2000 + projects255 monthly users

90,00 workflow jobs daily June 2016, one busy cluster)Between 1-8 actions :Avg. 4 actions/workflowExtreme use case, submit 100-200 workflow jobs per min

2,277 coordinator jobs daily (June 2016, one busy cluster)Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)99 % of workflow jobs kicked from coordinator

97 bundle jobs daily (June 2016, one busy cluster)

Page 5: August 2016 HUG: Recent development in Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

Page 6: August 2016 HUG: Recent development in Apache Oozie

Data Pipelines

6

Ad ExchangeAd LatencySearch Advertising

Content ManagementContent OptimizationContent PersonalizationFlickr Video

Audience TargetingBehavioral TargetingPartner TargetingRetargetingWeb Targeting

Advertisement Content Targeting

Page 7: August 2016 HUG: Recent development in Apache Oozie

Data Pipelines

7

Anti SpamContentRetargeting

ResearchDashboards & ReportsForecasting

Email Data Intelligence Data Management

Audience Pipeline

Page 8: August 2016 HUG: Recent development in Apache Oozie

Use Case - Data pipeline

8

Page 9: August 2016 HUG: Recent development in Apache Oozie

9

Oozie Coordinator<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://localhost:9000/tmp/workflows</app-path> </workflow> </action> </coordinator-app>

Page 10: August 2016 HUG: Recent development in Apache Oozie

Current limitation of Oozie coordinator

• All dataset are required• All instance are forced• We can’t combine datasets from multiple provider• There is no way to assign priority among datasets

10

Page 11: August 2016 HUG: Recent development in Apache Oozie

11

Complex dependencies

OOZIE-1976 : Specifying coordinator input datasets in more logical ways

Page 12: August 2016 HUG: Recent development in Apache Oozie

12

Oozie Coordinator with input logic<coordinator-app name="MY_APP" frequency="1440" start=”${start} end=“${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <datasets> <dataset name="input1" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-1/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> <dataset name="input2" frequency="60" initial-instance="2009-01-01T00:00Z" timezone="UTC"> <uri-template>hdfs://localhost:9000/tmp/revenue_feed-2/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1"> <instance>${coord:current(0}</instance> </data-in> <data-in name="coordInput2" dataset="input2"> <instance>${coord:current(0}</instance> </data-in> </input-events> <input-logic> <or name=“input1ORinput2”> <data-in dataset=“input1”/> <data-in dataset=“input2"/> </or> </input-logic>…...............

Page 13: August 2016 HUG: Recent development in Apache Oozie

BCP Support

Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.

<input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or></input-logic>

13

Page 14: August 2016 HUG: Recent development in Apache Oozie

14

Minimum availability processing

Some time, we want to process even if partial data is available.

<input-logic><data-in dataset=“A" min=”4”/>

</input-logic>

Page 15: August 2016 HUG: Recent development in Apache Oozie

15

Optional feeds

Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.

<input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>

Page 16: August 2016 HUG: Recent development in Apache Oozie

Priority Among Dataset Instances

A will have higher precedence over B and B will have higher precedence over C.

<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>

16

Page 17: August 2016 HUG: Recent development in Apache Oozie

Wait for primary

Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time.

<input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or></input-logic>

17

Page 18: August 2016 HUG: Recent development in Apache Oozie

Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.

<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance></data-in>

<data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance></data-in>

<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>

18

Page 19: August 2016 HUG: Recent development in Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

Page 20: August 2016 HUG: Recent development in Apache Oozie

20

MiniOozie

MiniOozie HCat Pig Hive Spark

MiniOozieClient To communicate with oozie server.

Page 21: August 2016 HUG: Recent development in Apache Oozie

21

Oozie unit Yamlname: TestCoordinatorjob: properties: raw_logs_path: "/tmp/test/input" aggregated_logs_path: "/user/test/output” oozie.coord.application.path: src/test/resources/coordinator-test.xmlhdfs: touchz: - /tmp/test/input/2010/02/01/09/_SUCCESS - /tmp/test/input/2010/02/01/10/_SUCCESS mkdir: - /user/test/outputvalidations: validate_job: sleep: 6000 coordinator_actions: - coordinator_action : "@2" not_status: WAITING nominal_time: 2010-02-01T11:00Z

Page 22: August 2016 HUG: Recent development in Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

Page 23: August 2016 HUG: Recent development in Apache Oozie

Spark Action

Yahoo Confidential & Proprietary

• Oozie native support for Apache Spark jobs

• Introduced last year in Apache Oozie 4.2.0

Page 24: August 2016 HUG: Recent development in Apache Oozie

Example

Yahoo Confidential & Proprietary

<spark xmlns="uri:oozie:spark-action:0.2">

<master>yarn</master>

<mode>cluster</mode>

<name>Spark-FileCopy</name>

<class>org.apache.oozie.example.SparkFileCopy</class>

<jar>${nameNode}/${examplesRoot}/apps/spark/lib/oozie-examples.jar</jar>

<file> ${nameNode}/${examplesRoot}/apps/spark/myfiles/somefile.txt </file>

<archive> ${nameNode}/${examplesRoot}/apps/spark/myfiles/someArchive.zip</archive>

<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080 --queue default</spark-opts>

<arg>${nameNode}/${examplesRoot}/input-data/text/data.txt</arg>

<arg>${nameNode}/${examplesRoot}/output-data/spark</arg>

</spark>

Page 25: August 2016 HUG: Recent development in Apache Oozie

PySpark Example

Yahoo Confidential & Proprietary

Automatically sets up pyspark.zip and py4j-src.zip from Sharelib

<spark xmlns="uri:oozie:spark-action:0.2">

<master>yarn</master>

<mode>cluster</mode>

<name>PySparkExample</name>

<jar>${nameNode}/${examplesRoot}/apps/spark/lib/pi.py</jar>

<spark-opts>--conf spark.yarn.historyServer.address=localhost:18080--queue default</spark-opts>

</spark>

Page 26: August 2016 HUG: Recent development in Apache Oozie

Modes supported

Yahoo Confidential & Proprietary

• For local and yarn-client mode, Driver runs in Oozie launcher itself, therefore for setting any properties for Driver, property should be prefixed with oozie.launcher.

• For ex, oozie.launcher.mapreduce.map.memory.mb and oozie.launcher.mapreduce.map.java.opts should be modified for increasing driver memory.

Master Mode

local[*]

yarn client

yarn cluster

Page 27: August 2016 HUG: Recent development in Apache Oozie

Recent enhancements

Yahoo Confidential & Proprietary

• Support for PySpark jobs

• Show Spark Job URLs in Oozie UI under Child Jobs Tab

• Automatically include spark-defaults.conf from Sharelib

• Support for <file> and <archive>

• Faster job launch time• Simplify setting up of classpath

• Avoid re-uploading jars for localization by reusing hdfs paths in mapreduce.job.cache.files

• Couple of bug fixes

Page 28: August 2016 HUG: Recent development in Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines and Complex dependencies

Oozie unit testing

Spark Action

Future Work

2

3

4

5

Page 29: August 2016 HUG: Recent development in Apache Oozie

29

Future Work

Oozie Unit testing framework No unit tests now. Directly tested by running in staging

Coordinator Dependency management Better reprocessing

Aperiodic processing Managed through workarounds