Spark - Alexis Seigneurin (English)

Alexis Seigneurin@aseigneurin @ippontech

https://twitter.com/aseigneurin

https://twitter.com/aseigneurin

https://twitter.com/ippontech


Spark

● Processing of large volumes of data● Distributed processing on commodity

hardware● Written in Scala, Java and Python bindings

History

● 2009: AMPLab, Berkeley University● June 2013 : "Top-level project" of the

Apache foundation● May 2014: version 1.0.0● Currently: version 1.2.0

Use cases

● Logs analysis● Processing of text files● Analytics● Distributed search (Google, before)● Fraud detection● Product recommendation

● Same use cases● Same development

model: MapReduce● Integration with the

ecosystem

Proximity with Hadoop

Simpler than Hadoop

● API simpler to learn● “Relaxed” MapReduce● Spark Shell: interactive processing

Faster than Hadoop

Spark officially sets a new record in large-scale sorting (5th November 2014)

● Sorting 100 To of data● Hadoop MR: 72 minutes

○ With 2100 noeuds (50400 cores)

● Spark: 23 minutes○ With 206 noeuds (6592 cores)

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html




Spark ecosystem

● Spark● Spark Shell● Spark Streaming● Spark SQL● Spark ML● GraphX

Integration

● Yarn, Zookeeper, Mesos● HDFS● Cassandra● Elasticsearch● MongoDB

SparkOperating principle

● Resilient Distributed Dataset● Abstraction of a collection processed in

parallel● Fault tolerant● Can work with tuples:

○ Key - Value○ Tuples must be independent from each other

RDD

Sources

● Files on HDFS● Local files● Collection in memory● Amazon S3● NoSQL database● ...● Or a custom implementation of

InputFormat

Transformations

● Processes an RDD, returns another RDD● Lazy!● Examples :

○ map(): one value → another value○ mapToPair(): one value → a tuple○ filter(): filters values/tuples given a condition○ groupByKey(): groups values by key○ reduceByKey(): aggregates values by key○ join(), cogroup()...: joins two RDDs

Actions

● Does not return an RDD● Examples:

○ count(): counts values/tuples○ saveAsHadoopFile(): saves results in Hadoop’s

format○ foreach(): applies a function on each item○ collect(): retrieves values in a list (List<T>)

Example

● Trees of Paris: CSV file, Open Data● Count of trees by specie

Spark - Example

geom_x_y;circonfere;adresse;hauteurenm;espece;varieteouc;dateplanta48.8648454814, 2.3094155344;140.0;COURS ALBERT 1ER;10.0;Aesculus hippocastanum;;48.8782668139, 2.29806967519;100.0;PLACE DES TERNES;15.0;Tilia platyphyllos;;48.889306184, 2.30400164126;38.0;BOULEVARD MALESHERBES;0.0;Platanus x hispanica;;48.8599934405, 2.29504883623;65.0;QUAI BRANLY;10.0;Paulownia tomentosa;;1996-02-29...

Spark - ExampleJavaSparkContext sc = new JavaSparkContext("local", "arbres");

sc.textFile("data/arbresalignementparis2010.csv") .filter(line -> !line.startsWith("geom")) .map(line -> line.split(";")) .mapToPair(fields -> new Tuple2<String, Integer>(fields[4], 1)) .reduceByKey((x, y) -> x + y) .sortByKey() .foreach(t -> System.out.println(t._1 + " : " + t._2));

[... ; … ; …]

[... ; … ; …]

[... ; … ; …]

[... ; … ; …]

[... ; … ; …]

[... ; … ; …]

u

m

k

m

a

a

textFile mapToPairmap

reduceByKey

foreach

1

1

1

1

1

u

m

k

1

2

1

2a

...

...

...

...

filter

...

...

sortByKey

a

m

2

1

2

1u

...

...

...

...

...

...

geom;...

1 k

Spark - ExampleAcacia dealbata : 2

Acer acerifolius : 39

Acer buergerianum : 14

Acer campestre : 452

...

Spark clusters

Topology & Terminology

● One master / several workers○ (+ one standby master)

● Submit an application to the cluster● Execution managed by a driver

Spark in a cluster

Several options

● YARN● Mesos● Standalone

○ Workers started manually○ Workers started by the master

MapReduce● Spark (API)● Distributed processing● Fault tolerant

Storage● HDFS, base NoSQL...● Distributed storage● Fault tolerant

Storage & Processing

Data locality

● Process the data where it is stored● Avoid network I/Os

Data locality

Spark Worker

HDFS Datanode

Spark Worker

HDFS Datanode

Spark Worker

HDFS Datanode

Spark Master

HDFS Namenode

HDFS Namenode (Standby)

SparkMaster

(Standby)

DemoSpark in a cluster

Demo$ $SPARK_HOME/sbin/start-master.sh

$ $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://MBP-de-Alexis:7077 --cores 2 --memory 2G

$ mvn clean package$ $SPARK_HOME/bin/spark-submit --master spark://MBP-de-Alexis:7077 --class com.seigneurin.spark.WikipediaMapReduceByKey --deploy-mode cluster target/pres-spark-0.0.1-SNAPSHOT.jar

Spark SQL

● Usage of an RDD in SQL● SQL engine: converts SQL instructions to

low-level instructions

Spark SQL

Spark SQL

Prerequisites:

● Use tabular data● Describe the schema → SchemaRDD

Describing the schema :

● Programmatic description of the data● Schema inference through reflection (POJO)

JavaRDD<Row> rdd = trees.map(fields -> Row.create( Float.parseFloat(fields[3]), fields[4]));

● Creating tabular data (type Row)

Spark SQL - Example

---------------------------------------

| 10.0 | Aesculus hippocastanum |

| 15.0 | Tilia platyphyllos |

| 0.0 | Platanus x hispanica |

| 10.0 | Paulownia tomentosa |

| ... | ... |

Spark SQL - Example

List<StructField> fields = new ArrayList<StructField>();fields.add(DataType.createStructField("hauteurenm", DataType.FloatType, false));fields.add(DataType.createStructField("espece", DataType.StringType, false));

StructType schema = DataType.createStructType(fields);

JavaSchemaRDD schemaRDD = sqlContext.applySchema(rdd, schema);schemaRDD.registerTempTable("tree");

---------------------------------------

| hauteurenm | espece |

---------------------------------------

| 10.0 | Aesculus hippocastanum |

| 15.0 | Tilia platyphyllos |

| 0.0 | Platanus x hispanica |

| 10.0 | Paulownia tomentosa |

| ... | ... |

● Describing the schema

● Counting trees by specie

Spark SQL - Example

sqlContext.sql("SELECT espece, COUNT(*) FROM tree WHERE espece <> '' GROUP BY espece ORDER BY espece") .foreach(row -> System.out.println(row.getString(0)+" : "+row.getLong(1)));

Acacia dealbata : 2

Acer acerifolius : 39

Acer buergerianum : 14

Acer campestre : 452

...

Spark Streaming

Micro-batches

● Slices a continuous flow of data into batches● Same API● ≠ Apache Storm

DStream

● Discretized Streams● Sequence of RDDs● Initialized with a Duration

Window operations

● Sliding window● Reuses data from other windows● Initialized with a window length and a slide

interval

Sources

● Socket● Kafka● Flume● HDFS● MQ (ZeroMQ...)● Twitter● ...● Or a custom implementation of Receiver

DemoSpark Streaming

Spark Streaming Demo

● Receive Tweets with hashtag #Android○ Twitter4J

● Detection of the language of the Tweet○ Language Detection

● Indexing with Elasticsearch● Reporting with Kibana 4

http://twitter4j.org/

http://twitter4j.org/

https://code.google.com/p/language-detection/

https://code.google.com/p/language-detection/

$ curl -X DELETE localhost:9200$ curl -X PUT localhost:9200/spark/_mapping/tweets '{ "tweets": { "properties": { "user": {"type": "string","index": "not_analyzed"}, "text": {"type": "string"}, "createdAt": {"type": "date","format": "date_time"}, "language": {"type": "string","index": "not_analyzed"} } }}'

● Launch ElasticSearch

Demo

● Launch Kibana -> http://localhost:5601● Launch the Spark Streaming process

http://localhost:5601

@aseigneurin

aseigneurin.github.io

@ippontech

blog.ippon.fr

https://twitter.com/ASeigneurin

https://twitter.com/ASeigneurin

http://aseigneurin.github.io

http://aseigneurin.github.io



http://blog.ippon.fr

http://blog.ippon.fr

Technology

Spark - Alexis Seigneurin (English)