The Flink - Apache Bigtop integration

Preview:

Citation preview

1© Cloudera, Inc. All rights reserved.

Marton Balassi | Solutions Architect Flink PMC

member@MartonBalassi | mbalassi@cloudera.com

The Flink - Apache Bigtop integration

2© Cloudera, Inc. All rights reserved.

Outline

• Short introduction to Bigtop

• An even shorter intro to Flink

• From Flink source to linux packages

• Implementing BigPetStore

• From linux packages to Cloudera parcels

• Summary

3© Cloudera, Inc. All rights reserved.

Short introduction to Bigtop

4© Cloudera, Inc. All rights reserved.

What is Bigtop?

Apache project for standardizing testing, packaging and integration of

leading big data components.

5© Cloudera, Inc. All rights reserved.

Components as building blocks

And many more …

6© Cloudera, Inc. All rights reserved.

Dependency hell

-------------------------------------------------------------------------hdfszookeeperhbasekafkaspark...mapredooziehiveetc ---

-------------------------------------------------------

----------------------------------------------------------

----------------------------------------------------------

----------------------------------------------------------

----------------------------------------------------------

----------------------------------------------------------

Build all the Things!!!

7© Cloudera, Inc. All rights reserved.

Early value added

• Bigtop has been around since the 0.20 days of Hadoop

• Provide a common foundation for proper integration of growing number of

Hadoop family components

• Foundation provides solid base for validating applications running on top of the

stack(s)

• Neutral packaging and deployment/config

8© Cloudera, Inc. All rights reserved.

Early mission accomplished

• Foundation for commercial Hadoop distros/services

• Leveraged by app providers

9© Cloudera, Inc. All rights reserved.

Adding more components

10© Cloudera, Inc. All rights reserved.

New focus and target groups

• Going way beyond just building debs/rpms

• Data engineers vs distro builders

• Enhance Operations/Deployment

• Reference implementations & tutorials

11© Cloudera, Inc. All rights reserved.

An even shorter intro to Flink

12© Cloudera, Inc. All rights reserved.

The Flink stack

DataStream APIStream Processing

DataSet APIBatch Processing

RuntimeDistributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

13© Cloudera, Inc. All rights reserved.

Flink in the wild

30 billion events daily 2 billion events in 10 1Gb machines

Picked Flink for "Saiki"data integration &

distribution platform

See talks by at

Runs their fork of Flink on 1000+ nodes

14© Cloudera, Inc. All rights reserved.

From Flink source to linux packages

15© Cloudera, Inc. All rights reserved.

The Bigtop component build

• Bigtop builds the component (potentially after patching it)

• Breaks up the files to linux distro friendly way (/etc/flink/conf, …)

• Adds users, groups, systemd services for the components

• Sets up the paths and alternatives for convenient access

• Builds the debs/rpm, takes care of the dependencies

http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html

16© Cloudera, Inc. All rights reserved.

Implementing BigPetStore

17© Cloudera, Inc. All rights reserved.

BigPetStore Outline

• BigPetStore model

• Data generator with the DataSet API

• ETL with the DataSet and Table APIs

• Matrix factorization with FlinkML

• Recommendation with the DataStream API

18© Cloudera, Inc. All rights reserved.

BigPetStore

• Blueprints for Big Data applications

• Consists of:• Data Generators• Examples using tools in Big Data ecosystem

to process data• Build system and tests for integrating tools

and multiple JVM languages• Part of the Bigtop project

19© Cloudera, Inc. All rights reserved.

BigPetStore model

• Customers visiting pet stores generating transactions, location based

Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014

20© Cloudera, Inc. All rights reserved.

Data generation

• Use RJ Nowling’s Java generator classes• Write transactions to JSON

val env = ExecutionEnvironment.getExecutionEnvironmentval (stores, products, customers) = getData()val startTime = getCurrentMillis()

val transactions = env.fromCollection(customers).flatMap(new TransactionGenerator(products)).withBroadcastSet(stores, ”stores”).map{t => t.setDateTime(t.getDateTime + startTime); t}

transactions.writeAsText(output)

21© Cloudera, Inc. All rights reserved.

ETL with the DataSet API

• Read the dirty JSON• Output (customer, product) pairs for the recommenderval env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))

val productsWithIndex = transactions.flatMap(_.getProducts).distinct.zipWithUniqueId

val customerAndProductPairs = transactions.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId,

p))).join(productsWithIndex).where(_._2).equalTo(_._2).map(pair => (pair._1._1, pair._2._1)).distinct

customerAndProductPairs.writeAsCsv(output)

22© Cloudera, Inc. All rights reserved.

ETL with Table API

• Read the dirty JSON• SQL style queries (SQL coming in Flink 1.1)

val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))

val table = transactions.map(toCaseClass(_)).toTable

val storeTransactionCount = table.groupBy('storeId).select('storeId, 'storeName, 'storeId.count as 'count)

val bestStores = table.groupBy('storeId).select('storeId.max as 'max).join(storeTransactionCount).where(”count = max”).select('storeId, 'storeName, 'storeId.count as 'count).toDataSet[StoreCount]

23© Cloudera, Inc. All rights reserved.

A little recommender theory

Item factors

User side information User-Item matrixUser factors

Item side information

U

I

PQ

R

• R is potentially huge, approximate it with PQ• Prediction is TopK(user’s row Q)

24© Cloudera, Inc. All rights reserved.

• Read the (customer, product) pairs• Write P and Q to file

Matrix factorization with FlinkML

val env = ExecutionEnvironment.getExecutionEnvironmentval input = env.readCsvFile[(Int,Int)](inputFile)

.map(pair => (pair._1, pair._2, 1.0))

val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)

model.fit(input)

val (p, q) = model.factorsOption.getp.writeAsText(pOut)q.writeAsText(qOut)

25© Cloudera, Inc. All rights reserved.

Recommendation with the DataStream API

• Give the TopK recommendation for a user• (Could be optimized)

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.socketTextStream(”localhost”, 9999).map(new GetUserVector()).broadcast().map(new PartialTopK()).keyBy(0).flatMap(new GlobalTopK()).print();

26© Cloudera, Inc. All rights reserved.

From linux packages

to Cloudera parcels

27© Cloudera, Inc. All rights reserved.

Why parcels?

• We have linux packages, why a new format?

• Cloudera Manager needs to update parcel without root privileges

• A big, single bundle for the whole ecosystem

• Plays well with the CM services and monitoring

• Package signing

https://github.com/cloudera/cm_ext

28© Cloudera, Inc. All rights reserved.

Managing the Flink parcel from CM

29© Cloudera, Inc. All rights reserved.

Next steps – Flink operations

• Flink does not offer a HistoryServer yet

Running on YARN is inconvenient like this

Follow [FLINK-4136] for resulotion

• The stand-alone cluster mode runs multiple jobs in the JVM

In practice users fire up clusters per job

Alibaba has a multitenant fork, aim is to contribute

https://www.youtube.com/watch?v=_Nw8NTdIq9A

30© Cloudera, Inc. All rights reserved.

Next steps – CM services, monitoring

31© Cloudera, Inc. All rights reserved.

Summary

32© Cloudera, Inc. All rights reserved.

Summary

• Flink is a dataflow engine with batch and streaming as first class citizens

• Bigtop offers unified packaging, testing and integration

• BigPetStore gives you a blueprint for a range of apps

• It is straight-forward to CM Parcel based on Bigtop

33© Cloudera, Inc. All rights reserved.

Big thanks to

• Clouderans supporting the project:

Sean Owen

Alexander Bartfeld

Justin Kestelyn

• The BigPetStore folks:

Suneel Marthi

Ronald J. Nowling

Jay Vyas

• Bigtop people answering my silly

questions:

Konstantin Boudnik

Roman Shaposhnik

Nate D'Amico

• Squirrels pushing the integration:

Robert Metzger

Fabian Hueske

34© Cloudera, Inc. All rights reserved.

Check out the code

github.com/mbalassi/bigpetstore-flinkgithub.com/mbalassi/flink-parcel

Feel free to give me feedback.

35© Cloudera, Inc. All rights reserved.

Come to Flink Forward

36© Cloudera, Inc. All rights reserved.

Thank you@MartonBalassimbalassi@cloudera.com