17
Cascading for the Impatient Paco Nathan Concurrent, Inc. [email protected] @pacoid Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Copyright @2012, Concurrent, Inc.

Cascading for the Impatient

Embed Size (px)

Citation preview

Page 1: Cascading for the Impatient

Cascading for the ImpatientPaco NathanConcurrent, Inc.

[email protected]@pacoid

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Copyright @2012, Concurrent, Inc.

Page 2: Cascading for the Impatient

Unstructured Data meets Enterprise Scale

why?

Page 3: Cascading for the Impatient

Cascading.org/how?

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Page 4: Cascading for the Impatient

• Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)

• Systems Integrator POV:system integration of heterogenous data sources and compute platforms

• Data Scientist POV:a directed, acyclic graph (DAG) on which we can apply Amdahl's Law

• Data Architect POV:a physical plan for large-scale data flow management

• Software Architect POV:a pattern language, similar to plumbing or circuit design

• App Developer POV:API bindings for Scala, Clojure, Python, Ruby, Java

• Systems Engineer POV:a JAR file, has passed CI, available in a Maven repo

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

who?

Page 5: Cascading for the Impatient

Scala, Clojure, Python, Ruby, Java, etc.…envision whatever else runs in a JVM

where?

Nagios, etc.

(raw human intellect, unless…)

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Domain expertise, business trade-offs,operating parameters, etc.

Apache Hadoop, in-memory local mode…envision GPUs, other frameworks, etc.

business process

APIlanguage

logical plan/ optimize

physicalplan

compute framework

monitors, notification

“asse

mb

ler”

cod

e

Page 6: Cascading for the Impatient

1: copy

Source

Sink

M

public class  Main  {  public static void  main( String[] args )    {    String inPath = args[ 0 ];    String outPath = args[ 1 ];

    Properties props = new Properties();    AppProps.setApplicationJarClass( props, Main.class );    HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

    // create the source tap    Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );

    // create the sink tap    Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath );

    // specify a pipe to connect the taps    Pipe copyPipe = new Pipe( "copy" );

    // connect the taps, pipes, etc., into a flow    FlowDef flowDef = FlowDef.flowDef().setName( "copy" )     .addSource( copyPipe, inTap )     .addTailSink( copyPipe, outTap );

    // run the flow    flowConnector.connect( flowDef ).complete();    }  } 1 mapper

0 reducers10 lines code

Page 7: Cascading for the Impatient

ten lines of code for a file copy …seems like a lot.

wait!

Page 8: Cascading for the Impatient

same JAR, any scale…

Your Mom’s Laptop:Mb’s dataHadoop standalone modepasses unit tests, or notruntime: seconds – minutes

Staging Cluster:Gb’s dataEMR + 4 Spot InstancesCI shows red or green lightsruntime: minutes – hours

Production Cluster:Tb’s dataEMR + 50 HPC InstancesOps monitors resultsruntime: hours – days

MegaCorp Enterprise IT:Pb’s data1000+ node clusterEVP calls you when app failsruntime: days+

Page 9: Cascading for the Impatient

2: word count

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 mapper 1 reducer18 lines code

Page 10: Cascading for the Impatient

3: wc + scrub

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken Count

M

R

1 mapper 1 reducer22+10 lines code

Page 11: Cascading for the Impatient

4: wc + scrub + stop words

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1 mapper 1 reducer28+10 lines code

Page 12: Cascading for the Impatient

5: tf-idf

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

token

TF

GroupBydoc_id, token Count

D Uniquedoc_id

Insert1

SumBydoc_id

HashJoinLeft

RHS

HashJoin

RHS

DF Unique

tokenGroupBy

token Count CoGroup

RHS

ExprFunctf-idf

TF-IDF

M

R

R

R

R

RR

RM

M

M RM

M

M

RM

M

M

M

11 mappers 9 reducers65+10 lines code

Page 13: Cascading for the Impatient

6: tf-idf + tdd

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

token

TF

GroupBydoc_id, token Count

D Uniquedoc_id

Insert1

SumBydoc_id

HashJoinLeft

RHS

HashJoin

RHS

DF Unique

tokenGroupBy

token CountCoGroup

RHS

ExprFunctf-idf

TF-IDF

Assert

FailureTraps

CheckpointM

R

R

R

R

RR

RM

M

M RM

M

M

RM

M

M

M

M

12 mappers 9 reducers76+14 lines code

Page 14: Cascading for the Impatient

deployed…

elastic-mapreduce --create --name "TF-IDF" \ --jar s3n://temp.cascading.org/impatient/part6.jar \ --arg s3n://temp.cascading.org/impatient/rain.txt \ --arg s3n://temp.cascading.org/impatient/out/wc \ --arg s3n://temp.cascading.org/impatient/en.stop \ --arg s3n://temp.cascading.org/impatient/out/tfidf \ --arg s3n://temp.cascading.org/impatient/out/trap \ --arg s3n://temp.cascading.org/impatient/out/check

Page 15: Cascading for the Impatient

results?

doc_id textdoc01 A rain shadow is a dry area on the lee back side of a mountainous area.doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.doc05 Two Women. Secrets. A Broken Land. [DVD Australia]zoink null

doc_id tf-idf tokendoc02 0.9163 airdoc05 0.9163 australiadoc05 0.9163 brokendoc04 0.9163 california'sdoc04 0.9163 causedoc02 0.9163 cloudcoverdoc04 0.9163 deathdoc04 0.9163 desertsdoc03 0.9163 downwind …doc02 0.9163 sinkingdoc04 0.9163 suchdoc04 0.9163 valleydoc05 0.9163 womendoc03 0.5108 landdoc05 0.5108 landdoc01 0.5108 leedoc02 0.5108 leedoc03 0.5108 leewarddoc04 0.5108 leewarddoc01 0.4463 areadoc02 0.2231 areadoc03 0.2231 areadoc01 0.2231 drydoc02 0.2231 drydoc03 0.2231 drydoc02 0.2231 mountaindoc03 0.2231 mountaindoc04 0.2231 mountaindoc01 0.0000 raindoc02 0.0000 raindoc03 0.0000 raindoc04 0.0000 raindoc01 0.0000 shadowdoc02 0.0000 shadowdoc03 0.0000 shadowdoc04 0.0000 shadow

Page 16: Cascading for the Impatient

comparisons?

compare similar code in Scalding and Cascalog:

sujitpal.blogspot.com/2012/08/scalding-for-impatient.html

based on: github.com/twitter/scalding/wiki

github.com/Quantisan/Impatient

based on: github.com/nathanmarz/cascalog/wiki

Page 17: Cascading for the Impatient

blog, code, wiki, gists, jars, list, DevOps products:

cascading.org/category/impatient/github.org/Cascading/conjars.org/goo.gl/KQtULconcurrentinc.com/

drill-down?