Spark - The beginnings

Spark – The beginningsDaniel LeonOptymyze

[7th of November 2015]

Content1) Hadoop Dilemma2) Processing engines war3) Spark ecosystem4) Resilient Distributed Datasets5) Spark application workflow6) Conclusion

Big data technologies

Hadoop – Where does it end ?

Hadoop Architecture

Hadoop evolution

Map Reduce workflow

Hadoop ecosystem

Beyond Map Reduce

• Complex iterative algorithms• Interactive queries• Real time processing

Different processing model

•More operation available• Flexible way of composing operations• Pluggable data sources• Streaming capabilities built-in• Pluggable algorithm

Searching for another processing engine

Processing engine comparison

Processing engine comparison

Spark ecosystem

Spark ecosystem

100TB Daytona Sort Competition 2014

Resilient Distributed Dataset - RDD

• Stored in memory and storage• Immutable• Enables parallel operations on collections of elements• Contains lineage information

Resilient Distributed Dataset - RDD

Constructing RDD's

• Parallelize existing collectionsl RDD=sc.parallelize([“a”, “b”, “c”])• From files in HDFS, S3, Hive

l linesRDD=sc.textFile(“README”)• Transforming an existing RDD

Operations on RDD's• Transformations – lazy

l filterl mapl groupBy

• Actionsl countl collect

Spark terminology• Job – the work required to compute an RDD• Stage – a wave of work within a job, corresponding to one or morepipelined RDD's• Task – a unit of work within a stage, correspoding to one RDD partition• Shuffle – the transfer the data between stages

Spark architecture

Conclusion• Spark is :

• Complete and standalone solution for distributed processing• Fluent API• Pluggable with other big data frameworks• One of the most actively contributed Apache project

Documentation

https://hadoopecosystemtable.github.iohttps://databricks.com/spark/developer-resourceshttps://databricks.com/resources/slideshttps://databricks.com/spark/training

Spark – The beginningsDaniel LeonOptymyze

[7th of November 2015]

Software

Spark - The beginnings