Big Data Processing with Spark and AWS EMR @ glomex€¦ · Spark and AWS EMR @ glomex 17.10.2016...

Preview:

Citation preview

Big Data Processing withSpark and AWS EMR @glomex17.10.2016MichaelLudwig

Our Architecture

2

3

Our Use Cases

4

Billing Pre-Aggregations

Interactive Big Data

Spark components

5

Spark 1.6, PySpark, spark-submit, DataFrames, SparkSQL, UDFs, Accumulators

Example: SparkSQL

6

EMR Cluster Startup

7

AWS Web Console AWS CLI

AWS SDKs(Python, Java, JS

etc.)

Startup parameters

8

Spot prices

9

Cluster Interaction

10

YARN Manager

11

Monitoring: Spark UI

12

Monitoring: Ganglia on EMR

13

Error Troubleshooting

14

Summary§ EMR§ Easyclusterstartupandconfiguration§ Throw-Away,isolatedclusters§ Nobigupfrontinvestmentsneeded

§ Spark§ BestframeworktogetstartedwithBigdata§ Bigcommunity&fastdevelopment§ Localdevelopmenteasy

15

Backup§ TODO

16

EMR Access Urls

17

RDD, DataFrame and DataSet

18

Spark Cluster

19

In-Memory Computation

20

Operations§ placeholder

21

Sample Transformations

22

RDD Lineage

23

RDD DAG

24

Recommended