Upload
marton-balassi
View
305
Download
3
Embed Size (px)
Citation preview
1© Cloudera, Inc. All rights reserved.
Marton Balassi | Solutions Architect Flink PMC
member@MartonBalassi | [email protected]
The Flink - Apache Bigtop integration
2© Cloudera, Inc. All rights reserved.
Outline
• Short introduction to Bigtop
• An even shorter intro to Flink
• From Flink source to linux packages
• Implementing BigPetStore
• From linux packages to Cloudera parcels
• Summary
3© Cloudera, Inc. All rights reserved.
Short introduction to Bigtop
4© Cloudera, Inc. All rights reserved.
What is Bigtop?
Apache project for standardizing testing, packaging and integration of
leading big data components.
5© Cloudera, Inc. All rights reserved.
Components as building blocks
And many more …
6© Cloudera, Inc. All rights reserved.
Dependency hell
-------------------------------------------------------------------------hdfszookeeperhbasekafkaspark...mapredooziehiveetc ---
-------------------------------------------------------
----------------------------------------------------------
----------------------------------------------------------
----------------------------------------------------------
----------------------------------------------------------
----------------------------------------------------------
Build all the Things!!!
7© Cloudera, Inc. All rights reserved.
Early value added
• Bigtop has been around since the 0.20 days of Hadoop
• Provide a common foundation for proper integration of growing number of
Hadoop family components
• Foundation provides solid base for validating applications running on top of the
stack(s)
• Neutral packaging and deployment/config
8© Cloudera, Inc. All rights reserved.
Early mission accomplished
• Foundation for commercial Hadoop distros/services
• Leveraged by app providers
…
9© Cloudera, Inc. All rights reserved.
Adding more components
…
10© Cloudera, Inc. All rights reserved.
New focus and target groups
• Going way beyond just building debs/rpms
• Data engineers vs distro builders
• Enhance Operations/Deployment
• Reference implementations & tutorials
11© Cloudera, Inc. All rights reserved.
An even shorter intro to Flink
12© Cloudera, Inc. All rights reserved.
The Flink stack
DataStream APIStream Processing
DataSet APIBatch Processing
RuntimeDistributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
13© Cloudera, Inc. All rights reserved.
Flink in the wild
30 billion events daily 2 billion events in 10 1Gb machines
Picked Flink for "Saiki"data integration &
distribution platform
See talks by at
Runs their fork of Flink on 1000+ nodes
14© Cloudera, Inc. All rights reserved.
From Flink source to linux packages
15© Cloudera, Inc. All rights reserved.
The Bigtop component build
• Bigtop builds the component (potentially after patching it)
• Breaks up the files to linux distro friendly way (/etc/flink/conf, …)
• Adds users, groups, systemd services for the components
• Sets up the paths and alternatives for convenient access
• Builds the debs/rpm, takes care of the dependencies
http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html
16© Cloudera, Inc. All rights reserved.
Implementing BigPetStore
17© Cloudera, Inc. All rights reserved.
BigPetStore Outline
• BigPetStore model
• Data generator with the DataSet API
• ETL with the DataSet and Table APIs
• Matrix factorization with FlinkML
• Recommendation with the DataStream API
18© Cloudera, Inc. All rights reserved.
BigPetStore
• Blueprints for Big Data applications
• Consists of:• Data Generators• Examples using tools in Big Data ecosystem
to process data• Build system and tests for integrating tools
and multiple JVM languages• Part of the Bigtop project
19© Cloudera, Inc. All rights reserved.
BigPetStore model
• Customers visiting pet stores generating transactions, location based
Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014
20© Cloudera, Inc. All rights reserved.
Data generation
• Use RJ Nowling’s Java generator classes• Write transactions to JSON
val env = ExecutionEnvironment.getExecutionEnvironmentval (stores, products, customers) = getData()val startTime = getCurrentMillis()
val transactions = env.fromCollection(customers).flatMap(new TransactionGenerator(products)).withBroadcastSet(stores, ”stores”).map{t => t.setDateTime(t.getDateTime + startTime); t}
transactions.writeAsText(output)
21© Cloudera, Inc. All rights reserved.
ETL with the DataSet API
• Read the dirty JSON• Output (customer, product) pairs for the recommenderval env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val productsWithIndex = transactions.flatMap(_.getProducts).distinct.zipWithUniqueId
val customerAndProductPairs = transactions.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId,
p))).join(productsWithIndex).where(_._2).equalTo(_._2).map(pair => (pair._1._1, pair._2._1)).distinct
customerAndProductPairs.writeAsCsv(output)
22© Cloudera, Inc. All rights reserved.
ETL with Table API
• Read the dirty JSON• SQL style queries (SQL coming in Flink 1.1)
val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))
val table = transactions.map(toCaseClass(_)).toTable
val storeTransactionCount = table.groupBy('storeId).select('storeId, 'storeName, 'storeId.count as 'count)
val bestStores = table.groupBy('storeId).select('storeId.max as 'max).join(storeTransactionCount).where(”count = max”).select('storeId, 'storeName, 'storeId.count as 'count).toDataSet[StoreCount]
23© Cloudera, Inc. All rights reserved.
A little recommender theory
Item factors
User side information User-Item matrixUser factors
Item side information
U
I
PQ
R
• R is potentially huge, approximate it with PQ• Prediction is TopK(user’s row Q)
24© Cloudera, Inc. All rights reserved.
• Read the (customer, product) pairs• Write P and Q to file
Matrix factorization with FlinkML
val env = ExecutionEnvironment.getExecutionEnvironmentval input = env.readCsvFile[(Int,Int)](inputFile)
.map(pair => (pair._1, pair._2, 1.0))
val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)
model.fit(input)
val (p, q) = model.factorsOption.getp.writeAsText(pOut)q.writeAsText(qOut)
25© Cloudera, Inc. All rights reserved.
Recommendation with the DataStream API
• Give the TopK recommendation for a user• (Could be optimized)
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.socketTextStream(”localhost”, 9999).map(new GetUserVector()).broadcast().map(new PartialTopK()).keyBy(0).flatMap(new GlobalTopK()).print();
26© Cloudera, Inc. All rights reserved.
From linux packages
to Cloudera parcels
27© Cloudera, Inc. All rights reserved.
Why parcels?
• We have linux packages, why a new format?
• Cloudera Manager needs to update parcel without root privileges
• A big, single bundle for the whole ecosystem
• Plays well with the CM services and monitoring
• Package signing
https://github.com/cloudera/cm_ext
28© Cloudera, Inc. All rights reserved.
Managing the Flink parcel from CM
29© Cloudera, Inc. All rights reserved.
Next steps – Flink operations
• Flink does not offer a HistoryServer yet
Running on YARN is inconvenient like this
Follow [FLINK-4136] for resulotion
• The stand-alone cluster mode runs multiple jobs in the JVM
In practice users fire up clusters per job
Alibaba has a multitenant fork, aim is to contribute
https://www.youtube.com/watch?v=_Nw8NTdIq9A
30© Cloudera, Inc. All rights reserved.
Next steps – CM services, monitoring
31© Cloudera, Inc. All rights reserved.
Summary
32© Cloudera, Inc. All rights reserved.
Summary
• Flink is a dataflow engine with batch and streaming as first class citizens
• Bigtop offers unified packaging, testing and integration
• BigPetStore gives you a blueprint for a range of apps
• It is straight-forward to CM Parcel based on Bigtop
33© Cloudera, Inc. All rights reserved.
Big thanks to
• Clouderans supporting the project:
Sean Owen
Alexander Bartfeld
Justin Kestelyn
• The BigPetStore folks:
Suneel Marthi
Ronald J. Nowling
Jay Vyas
• Bigtop people answering my silly
questions:
Konstantin Boudnik
Roman Shaposhnik
Nate D'Amico
• Squirrels pushing the integration:
Robert Metzger
Fabian Hueske
34© Cloudera, Inc. All rights reserved.
Check out the code
github.com/mbalassi/bigpetstore-flinkgithub.com/mbalassi/flink-parcel
Feel free to give me feedback.
35© Cloudera, Inc. All rights reserved.
Come to Flink Forward
36© Cloudera, Inc. All rights reserved.
Thank you@[email protected]