DRIVING INNOVATION THROUGH - events.static.linuxfound.org · DRIVING INNOVATION THROUGH DATA...

Preview:

Citation preview

DRIVING INNOVATION THROUGH DATACASCADING 3 AND BEYONDAndré Kelpe | Apache Big Data Europe | Budapest, September 28th 2015

SPEAKER

2

André KelpeSenior Software Engineer at Concurrent company behind Cascading, Lingual and Drivenhttp://concurrentinc.com / @concurrent

andre@concurrentinc.com / @fs111

http://cascading.org

Apache licensed Java framework for writing data oriented applications

production ready, stable and battle proven

INTRODUCTION

3

4

PHILOSOPHY

developer productivity

users focus on business problems, not distributed systems knowledge

predictable runtime behaviour

fail fast

PHILOSOPHY

5

stable user APIs

safe defaults with knobs for experts

batch workloads

PHILOSOPHY

6

testability & robustness

production quality applications rather than a collection of scripts

abstractions over interchangeable platforms

PHILOSOPHY

7

8

TERMINOLOGY

A SERIES OF PIPES

9

https://www.flickr.com/photos/theilr/4283377543/sizes/l

CASCADING TERMINOLOGY

10

• Taps are sources and sinks for data• Schemes represent the format of the data • Pipes are connecting Taps

● Tuples flow through Pipes● Fields describe the Tuples● Operations are executed on Tuples in

TupleStreams● Pipes can be merged, spliced, joined etc.● Pipe-assemblies are reusable components

CASCADING TERMINOLOGY

11

FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational platform

Flows can be orchestrated via Cascade

Applications are Directed Acyclic Graphs (DAG)

CASCADING TERMINOLOGY

12

DAG

13

14

PLATFORMS

CASCADING PLATFORMS

15

local

change 1 line of code, recompile, done.

COMPILER ANALOGY

16

User Code TranslationOptimisationAssembly

CPU Architecture

QueryPlanner/RuleEngine

MR

Tez

Flink

FlowDef

FlowDef

FlowDef

FlowDef

FlowDefothers…

DAG

17

A DAG RUNNING ON A PLATFORM

18

REAL WORLD DAG

19

https://github.com/cchepelov/wcplus

https://driven.cascading.io/index.html#/apps/A7544E2B8E7C410397B4AE88F53326D1

20

CODE EXAMPLE

● Fluid - A Fluent API for Cascading− Targeted at application writers− https://github.com/Cascading/fluid

● „Raw“ Cascading API− Targeted for library writers, code

generators, integration layers− https://github.com/Cascading/cascading

APIS

21

COUNTING WORDS

22

String docPath = args[ 0 ];

String wcPath = args[ 1 ];

Properties properties = new Properties();

AppProps.setApplicationJarClass( properties, Main.class );

FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );

// create source and sink taps

Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );

Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

...

COUNTING WORDS (CONT.)

23

// specify a regex operation to split the "document" text lines into a token stream

Fields token = new Fields( "token" );

Fields text = new Fields( "text" );

RegexSplitGenerator splitter =

new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );

// only returns "token"

Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts

Pipe wcPipe = new Pipe( "wc", docPipe );

wcPipe = new GroupBy( wcPipe, token );

wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

...

COUNTING WORDS (CONT.)

24

// connect the taps, pipes, etc., into a flow

FlowDef flowDef = FlowDef.flowDef()

.setName( “word count" )

.addSource( docPipe, docTap )

.addTailSink( wcPipe, wcTap );

Flow wcFlow = flowConnector.connect( flowDef )

wcFlow.complete(); // ← runs the code

}

A FULL TOOLBOX

25

● Operations − Function

− Filter

− Regex/Scripts

− Boolean operators

− Count/Limit/Last/First

− Scripts

− Unique

− Asserts

− Min/Max

● Splices− GroupBy− CoGroup− HashJoin− Merge

● JoinsLeft, right, outer, inner, mixed, custom

A FULL TOOLBOX

26

• data access: JDBC, HBase, elasticsearch, redshift, HDFS, S3, Cassandra, kinesis, accumulo …

• data formats: avro, parquet, ORC (+ACID), thrift, protobuf, CSV, TSV…

• integration points: Cascading Lingual (SQL), Apache Hive, M/R apps, custom

OUTLOOK TO CASCADING 3.1+

27

• improved serialization through strong typing

• Cascading on Apache Flink

• Cascading on Hazelcast

DON’T LIKE JAVA?

28

Clojure/logic programming

https://github.com/nathanmarz/cascalog

Clojure

https://github.com/Netflix/PigPen

Scala

https://github.com/twitter/scalding

29

QUESTIONS?

LINK COLLECTION

30

• http://www.cascading.org/ • https://github.com/Cascading/ • http://driven.io/ • http://concurrentinc.com • https://groups.google.com/forum/#!forum/

cascading-user • http://docs.cascading.org/tutorials/etl-log/ • http://docs.cascading.org/cascading/3.0/

userguide/html/

DRIVING INNOVATION THROUGH DATACASCADING 3 AND BEYONDAndré Kelpe | Apache Big Data Europe | Budapest, September 28th 2015