27

Spark - The beginnings

Embed Size (px)

Citation preview

Page 1: Spark -  The beginnings
Page 2: Spark -  The beginnings

Spark – The beginningsDaniel LeonOptymyze

[7th of November 2015]

Page 3: Spark -  The beginnings

Content1) Hadoop Dilemma2) Processing engines war3) Spark ecosystem4) Resilient Distributed Datasets5) Spark application workflow6) Conclusion

Page 4: Spark -  The beginnings

Big data technologies

Page 5: Spark -  The beginnings

Hadoop – Where does it end ?

Page 6: Spark -  The beginnings

Hadoop Architecture

Page 7: Spark -  The beginnings

Hadoop evolution

Page 8: Spark -  The beginnings

Map Reduce workflow

Page 9: Spark -  The beginnings

Hadoop ecosystem

Page 10: Spark -  The beginnings

Beyond Map Reduce

• Complex iterative algorithms• Interactive queries• Real time processing

Page 11: Spark -  The beginnings

Different processing model

•More operation available• Flexible way of composing operations• Pluggable data sources• Streaming capabilities built-in• Pluggable algorithm

Page 12: Spark -  The beginnings

Searching for another processing engine

Page 13: Spark -  The beginnings

Processing engine comparison

Page 14: Spark -  The beginnings

Processing engine comparison

Page 15: Spark -  The beginnings

Spark ecosystem

Page 16: Spark -  The beginnings

Spark ecosystem

Page 17: Spark -  The beginnings

100TB Daytona Sort Competition 2014

Page 18: Spark -  The beginnings

Resilient Distributed Dataset - RDD

• Stored in memory and storage• Immutable• Enables parallel operations on collections of elements• Contains lineage information

Page 19: Spark -  The beginnings

Resilient Distributed Dataset - RDD

Page 20: Spark -  The beginnings

Constructing RDD's

• Parallelize existing collectionsl RDD=sc.parallelize([“a”, “b”, “c”])• From files in HDFS, S3, Hive

l linesRDD=sc.textFile(“README”)• Transforming an existing RDD

Page 21: Spark -  The beginnings

Operations on RDD's• Transformations – lazy

l filterl mapl groupBy

• Actionsl countl collect

Page 22: Spark -  The beginnings

Spark terminology• Job – the work required to compute an RDD• Stage – a wave of work within a job, corresponding to one or morepipelined RDD's• Task – a unit of work within a stage, correspoding to one RDD partition• Shuffle – the transfer the data between stages

Page 23: Spark -  The beginnings

Spark architecture

Page 24: Spark -  The beginnings

Conclusion• Spark is :

• Complete and standalone solution for distributed processing• Fluent API• Pluggable with other big data frameworks• One of the most actively contributed Apache project

Page 25: Spark -  The beginnings

Documentation

https://hadoopecosystemtable.github.iohttps://databricks.com/spark/developer-resourceshttps://databricks.com/resources/slideshttps://databricks.com/spark/training

Page 26: Spark -  The beginnings

Spark – The beginningsDaniel LeonOptymyze

[7th of November 2015]

Page 27: Spark -  The beginnings