37
Cascading or, “was it worth three days out of the office?”

Intro to Cascading

Embed Size (px)

DESCRIPTION

Introduction and overview of Cascading framework for my team.

Citation preview

Page 1: Intro to Cascading

Cascadingor, “was it worth three days

out of the office?”

Page 2: Intro to Cascading

Agenda

What is Cascading?

Building cascades and flows

How does this fit our needs?

Advantages/disadvantages

Q&A

Page 3: Intro to Cascading

What is Cascading anyway?

Page 4: Intro to Cascading

Cascading 101

JVM framework and SDK for creating abstracted data flows

Translates data flows into actual Hadoop/RDBMS/local jobs

Page 5: Intro to Cascading

Huh?Okay, let’s back up a bit.

Page 6: Intro to Cascading

Data flowsThink of an ETL: Extract-Transform-Load

In simple terms, take data from a source, change it somehow, and stick the result into something (a “sink”)

Data source

Data sink

Extract Load

Transformation(s)

Page 7: Intro to Cascading

Data flow implementation

Pretty much everything we do is some flavor of this

Sources: Games, Hadoop, Hive/MySQL, Couchbase, web service

Transformations: Aggregations, group-bys, combined fields, filtering, etc.

Sinks: Hadoop, Hive/MySQL, Couchbase

Page 8: Intro to Cascading

Cascading 101 (Part Deux)

JVM data flow framework

Models data flows as abstractions:

Separates details of where and how we get data from what we do with it

Implements transform operations as SQL or MapReduce or whatever

Page 9: Intro to Cascading

In other words…An ETL framework.

A Pentaho we can program.

Page 10: Intro to Cascading

Building cascadesand flows

Page 11: Intro to Cascading

Cascading terminology

Flow: A path for data with some number of inputs, some operations, and some outputs

Cascade: A series of connected flows

Page 12: Intro to Cascading

More terminology

Operation: A function applied to data, yielding new data

Pipe: Moves data from someplace to some other place

Tap: Feeds data from outside the flow into it and writes data from inside the flow out of it

Page 13: Intro to Cascading

Simplest possible flow // create the source tap Tap inTap = new Hfs(new TextDelimited(true, "\t"), inPath); ! // create the sink tap Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath); ! // specify a pipe to connect the taps Pipe copyPipe = new Pipe(“copy"); ! // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); ! // run the flow flowConnector.connect(flowDef).complete();

Page 14: Intro to Cascading

We already have that.

!

It’s called ‘cp’.

Page 15: Intro to Cascading

Actually…

Runs entirely in the cluster

Works fine on megabytes, gigabytes, terabytes or petabytes; i.e., IT SCALES

Completely testable outside of the cluster

Who gets shell access to a namenode to run the bash or python equivalent?

Page 16: Intro to Cascading

Reliability is ESSENTIAL

!

if we, and our system, are to be taken srsly.

Reliability is a feature, not a goal.

Page 17: Intro to Cascading

Let’s do something more interesting.

Page 18: Intro to Cascading

Real world use case: Word counting

Read a simple file format

Count the occurrence of every word in the file

Output a list of all words and their counts

Page 19: Intro to Cascading

doc_id text doc01 A rain shadow is a dry area on the lee back side doc02 This sinking, dry air produces a rain shadow, or doc03 A rain shadow is an area of dry land that lies on doc04 This is known as the rain shadow effect and is the doc05 Two Women. Secrets. A Broken Land. [DVD Australia]

Newline-delimited entries

ID and text fields, separated by tabs

Plan: Split lines into words and count them over each line

Page 20: Intro to Cascading

Flow I/O

Tap docTap = new Hfs(new TextDelimited(true, "\t"), docPath); Tap wcTap = new Hfs(new TextDelimited(true, "\t"), wcPath);

No surprises here:

docTap reads a file from HDFS

wcTap will write the results to a different HDFS file

Page 21: Intro to Cascading

File parsing Fields token = new Fields("token"); Fields text = new Fields("text"); RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]"); Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);

Fields are names for the tuple elements

RegexSplitGenerator applies the regex to input and yields matches under the “token” field

docPipe takes each “token” generated by the splitter and outputs them

Page 22: Intro to Cascading

Count the tokens (words) Pipe wcPipe = new Pipe("wc", docPipe); wcPipe = new GroupBy(wcPipe, token); wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL);

wcPipe connects to docPipe, using it for input

Fit a GroupBy function onto wcPipe, grouping by the token field (the actual words)

for every tuple in wcPipe (every word), count each occurrence and output the result

Page 23: Intro to Cascading

Create and run the flow FlowDef flowDef = FlowDef.flowDef() .setName("wc") .addSource(docPipe, docTap) .addTailSink(wcPipe, wcTap); Flow wcFlow = flowConnector.connect(flowDef).complete();

Define a new flow with name “wc”

Feed the docTap (the original text file) into the docPipe

Feed the wcTap (the output word counts) into the wcPipe

Connect to the flowConnector (Hadoop) and go!

Page 24: Intro to Cascading

Cascading flow

100% Java

Databases and processing are behind class abstractions

Automatically scalable

Easily testable

Page 25: Intro to Cascading

How could this help us?

Page 26: Intro to Cascading

Testing

Create flows entirely in code on a local machine

Write tests for controlled sample data sets

Run tests as regular old Java without needing access to actual Hadoopery or databases

Local machine and CI testing are easy!

Page 27: Intro to Cascading

Reusability

Pipe assemblies are designed for reuse

Once created and tested, use them in other flows

Write logic to do something only once

This is *essential* for data integrity as well as good programming

Page 28: Intro to Cascading

Common code base

Infrastructure writes MR-type jobs in Cascading, warehouse writes data manipulations in Cascading

Everybody uses the same terms and same tech

Teams understand each other’s code

Can be modified by anyone, not just tool experts

Page 29: Intro to Cascading

Simpler stack

Cascading creates DAG of dependent jobs for us

Removes most of the need for Oozie (ew)

Keeps track of where a flow fails and can rerun from that point on failure

Page 30: Intro to Cascading

Disadvantages“silver bullets are not a thing”

Page 31: Intro to Cascading

Some bad news

JVM, which means Java (or Scala (or CLOJURE :) :)

Argument: Java is the platform for big data, so we can’t avoid embracing it.

PyCascading uses Jython, which kinda sucks

Page 32: Intro to Cascading

Some other bad news

Doesn’t have job scheduler

Can figure out dependency graph for jobs, but nothing to run them on a regular interval

We still need Jenkins or quartz

Concurrent is doing proprietary products (read: $) for this kind of thing, but they’re months away

Page 33: Intro to Cascading

Other bad news

No real built-in monitoring

Easy to have a flow report what it has done;hard to watch it in progress

We’d have to roll our own (but we’d have to do that anyway, so whatevs)

Page 34: Intro to Cascading

Recommendations“Enough already!”

Page 35: Intro to Cascading

Yes, we should try it.

It’s not everything we need, but it’s a lot

Possibly replace MapReduce and Sqoop

Proven tech; this isn’t bleeding edge work

We need an ETL framework and we don’t have time to write one from scratch.

Page 36: Intro to Cascading

Let’s prototype a couple of jobs and see what people other than me think.

Page 37: Intro to Cascading

Questions?Satisfactory answers

not guaranteed.