Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
DRIVING INNOVATION THROUGH DATACASCADING 3 AND BEYONDAndré Kelpe | Apache Big Data Europe | Budapest, September 28th 2015
SPEAKER
2
André KelpeSenior Software Engineer at Concurrent company behind Cascading, Lingual and Drivenhttp://concurrentinc.com / @concurrent
[email protected] / @fs111
http://cascading.org
Apache licensed Java framework for writing data oriented applications
production ready, stable and battle proven
INTRODUCTION
3
4
PHILOSOPHY
developer productivity
users focus on business problems, not distributed systems knowledge
predictable runtime behaviour
fail fast
PHILOSOPHY
5
stable user APIs
safe defaults with knobs for experts
batch workloads
PHILOSOPHY
6
testability & robustness
production quality applications rather than a collection of scripts
abstractions over interchangeable platforms
PHILOSOPHY
7
8
TERMINOLOGY
A SERIES OF PIPES
9
https://www.flickr.com/photos/theilr/4283377543/sizes/l
CASCADING TERMINOLOGY
10
• Taps are sources and sinks for data• Schemes represent the format of the data • Pipes are connecting Taps
● Tuples flow through Pipes● Fields describe the Tuples● Operations are executed on Tuples in
TupleStreams● Pipes can be merged, spliced, joined etc.● Pipe-assemblies are reusable components
CASCADING TERMINOLOGY
11
FlowConnector uses QueryPlanner to translate FlowDef into Flow to run on computational platform
Flows can be orchestrated via Cascade
Applications are Directed Acyclic Graphs (DAG)
CASCADING TERMINOLOGY
12
DAG
13
14
PLATFORMS
CASCADING PLATFORMS
15
local
change 1 line of code, recompile, done.
COMPILER ANALOGY
16
User Code TranslationOptimisationAssembly
CPU Architecture
QueryPlanner/RuleEngine
MR
Tez
Flink
FlowDef
FlowDef
FlowDef
FlowDef
FlowDefothers…
DAG
17
A DAG RUNNING ON A PLATFORM
18
REAL WORLD DAG
19
https://github.com/cchepelov/wcplus
https://driven.cascading.io/index.html#/apps/A7544E2B8E7C410397B4AE88F53326D1
20
CODE EXAMPLE
● Fluid - A Fluent API for Cascading− Targeted at application writers− https://github.com/Cascading/fluid
● „Raw“ Cascading API− Targeted for library writers, code
generators, integration layers− https://github.com/Cascading/cascading
APIS
21
COUNTING WORDS
22
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
...
COUNTING WORDS (CONT.)
23
// specify a regex operation to split the "document" text lines into a token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter =
new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
...
COUNTING WORDS (CONT.)
24
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( “word count" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
Flow wcFlow = flowConnector.connect( flowDef )
wcFlow.complete(); // ← runs the code
}
A FULL TOOLBOX
25
● Operations − Function
− Filter
− Regex/Scripts
− Boolean operators
− Count/Limit/Last/First
− Scripts
− Unique
− Asserts
− Min/Max
● Splices− GroupBy− CoGroup− HashJoin− Merge
● JoinsLeft, right, outer, inner, mixed, custom
A FULL TOOLBOX
26
• data access: JDBC, HBase, elasticsearch, redshift, HDFS, S3, Cassandra, kinesis, accumulo …
• data formats: avro, parquet, ORC (+ACID), thrift, protobuf, CSV, TSV…
• integration points: Cascading Lingual (SQL), Apache Hive, M/R apps, custom
OUTLOOK TO CASCADING 3.1+
27
• improved serialization through strong typing
• Cascading on Apache Flink
• Cascading on Hazelcast
DON’T LIKE JAVA?
28
Clojure/logic programming
https://github.com/nathanmarz/cascalog
Clojure
https://github.com/Netflix/PigPen
Scala
https://github.com/twitter/scalding
29
QUESTIONS?
LINK COLLECTION
30
• http://www.cascading.org/ • https://github.com/Cascading/ • http://driven.io/ • http://concurrentinc.com • https://groups.google.com/forum/#!forum/
cascading-user • http://docs.cascading.org/tutorials/etl-log/ • http://docs.cascading.org/cascading/3.0/
userguide/html/
DRIVING INNOVATION THROUGH DATACASCADING 3 AND BEYONDAndré Kelpe | Apache Big Data Europe | Budapest, September 28th 2015