Intro to Cascading

Cascadingor, “was it worth three days

out of the office?”

Agenda

What is Cascading?

Building cascades and flows

How does this fit our needs?

Advantages/disadvantages

Q&A

What is Cascading anyway?

Cascading 101

JVM framework and SDK for creating abstracted data flows

Translates data flows into actual Hadoop/RDBMS/local jobs

Huh?Okay, let’s back up a bit.

Data flowsThink of an ETL: Extract-Transform-Load

In simple terms, take data from a source, change it somehow, and stick the result into something (a “sink”)

Data source

Data sink

Extract Load

Transformation(s)

Data flow implementation

Pretty much everything we do is some flavor of this

Sources: Games, Hadoop, Hive/MySQL, Couchbase, web service

Transformations: Aggregations, group-bys, combined fields, filtering, etc.

Sinks: Hadoop, Hive/MySQL, Couchbase

Cascading 101 (Part Deux)

JVM data flow framework

Models data flows as abstractions:

Separates details of where and how we get data from what we do with it

Implements transform operations as SQL or MapReduce or whatever

In other words…An ETL framework.

A Pentaho we can program.

Building cascadesand flows

Cascading terminology

Flow: A path for data with some number of inputs, some operations, and some outputs

Cascade: A series of connected flows

More terminology

Operation: A function applied to data, yielding new data

Pipe: Moves data from someplace to some other place

Tap: Feeds data from outside the flow into it and writes data from inside the flow out of it

Simplest possible flow // create the source tap Tap inTap = new Hfs(new TextDelimited(true, "\t"), inPath); ! // create the sink tap Tap outTap = new Hfs(new TextDelimited(true, "\t"), outPath); ! // specify a pipe to connect the taps Pipe copyPipe = new Pipe(“copy"); ! // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); ! // run the flow flowConnector.connect(flowDef).complete();

We already have that.

!

It’s called ‘cp’.

Actually…

Runs entirely in the cluster

Works fine on megabytes, gigabytes, terabytes or petabytes; i.e., IT SCALES

Completely testable outside of the cluster

Who gets shell access to a namenode to run the bash or python equivalent?

Reliability is ESSENTIAL

!

if we, and our system, are to be taken srsly.

Reliability is a feature, not a goal.

Let’s do something more interesting.

Real world use case: Word counting

Read a simple file format

Count the occurrence of every word in the file

Output a list of all words and their counts

doc_id text doc01 A rain shadow is a dry area on the lee back side doc02 This sinking, dry air produces a rain shadow, or doc03 A rain shadow is an area of dry land that lies on doc04 This is known as the rain shadow effect and is the doc05 Two Women. Secrets. A Broken Land. [DVD Australia]

Newline-delimited entries

ID and text fields, separated by tabs

Plan: Split lines into words and count them over each line

Flow I/O

Tap docTap = new Hfs(new TextDelimited(true, "\t"), docPath); Tap wcTap = new Hfs(new TextDelimited(true, "\t"), wcPath);

No surprises here:

docTap reads a file from HDFS

wcTap will write the results to a different HDFS file

File parsing Fields token = new Fields("token"); Fields text = new Fields("text"); RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\$\$,.]"); Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);

Fields are names for the tuple elements

RegexSplitGenerator applies the regex to input and yields matches under the “token” field

docPipe takes each “token” generated by the splitter and outputs them

Count the tokens (words) Pipe wcPipe = new Pipe("wc", docPipe); wcPipe = new GroupBy(wcPipe, token); wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL);

wcPipe connects to docPipe, using it for input

Fit a GroupBy function onto wcPipe, grouping by the token field (the actual words)

for every tuple in wcPipe (every word), count each occurrence and output the result

Create and run the flow FlowDef flowDef = FlowDef.flowDef() .setName("wc") .addSource(docPipe, docTap) .addTailSink(wcPipe, wcTap); Flow wcFlow = flowConnector.connect(flowDef).complete();

Define a new flow with name “wc”

Feed the docTap (the original text file) into the docPipe

Feed the wcTap (the output word counts) into the wcPipe

Connect to the flowConnector (Hadoop) and go!

Cascading flow

100% Java

Databases and processing are behind class abstractions

Automatically scalable

Easily testable

How could this help us?

Testing

Create flows entirely in code on a local machine

Write tests for controlled sample data sets

Run tests as regular old Java without needing access to actual Hadoopery or databases

Local machine and CI testing are easy!

Reusability

Pipe assemblies are designed for reuse

Once created and tested, use them in other flows

Write logic to do something only once

This is *essential* for data integrity as well as good programming

Common code base

Infrastructure writes MR-type jobs in Cascading, warehouse writes data manipulations in Cascading

Everybody uses the same terms and same tech

Teams understand each other’s code

Can be modified by anyone, not just tool experts

Simpler stack

Cascading creates DAG of dependent jobs for us

Removes most of the need for Oozie (ew)

Keeps track of where a flow fails and can rerun from that point on failure

Disadvantages“silver bullets are not a thing”

Some bad news

JVM, which means Java (or Scala (or CLOJURE :) :)

Argument: Java is the platform for big data, so we can’t avoid embracing it.

PyCascading uses Jython, which kinda sucks

Some other bad news

Doesn’t have job scheduler

Can figure out dependency graph for jobs, but nothing to run them on a regular interval

We still need Jenkins or quartz

Concurrent is doing proprietary products (read: $) for this kind of thing, but they’re months away

Other bad news

No real built-in monitoring

Easy to have a flow report what it has done;hard to watch it in progress

We’d have to roll our own (but we’d have to do that anyway, so whatevs)

Recommendations“Enough already!”

Yes, we should try it.

It’s not everything we need, but it’s a lot

Possibly replace MapReduce and Sqoop

Proven tech; this isn’t bleeding edge work

We need an ETL framework and we don’t have time to write one from scratch.

Let’s prototype a couple of jobs and see what people other than me think.

Questions?Satisfactory answers

not guaranteed.

Software

Intro to Cascading