Spark in the BigData dark

Spark in the BigData dark

6+ years in HighLoad and BigData 3+ years as Team / Tech Lead Java, Scala, Javascript, PHP

About me

Hadoop components

Nano History

Apache Hadoop

Pros: • Batch operations • Scalability • User defined methods

Cons: • The problem must be resolved in context of a single

job • Filesystem based

Nano History

Tez, Pig, Hive, etc

Pros: • Batch operations • Over Hadoop • Faster then MapReduce • DAG

Cons: • Filesystem based

Nano History

What is in-memory?

• In-memory compute grid

• In-memory data grid

In-memory compute grid

In-memory data grid

HDD vs MEMORY?

• Memory speed is in nanoseconds• 10GbE Network speed is in microseconds (~50)• Flash speed is in microseconds (between 20-500+)• Disk speed is in milliseconds (between 4-7)

Spark in-memory model

Apache Spark

Pros: • In memory operations up to 100x times faster then

Hadoop MapReduce • On disc operations up to 10x times faster then Hadoop

MapReduce• In-memory• Batch operations & near real time • Interactive • Not bound to hadoop• Easy to start for developers

Really fast?

Is Spark popular?

HazelcastApache Spark Apache Hadoop

Is it popular?

The most active project

Who use Spark?

Languages

Libraries

RDD

• Resilient == fault-tolerant

• Distributed == compute in parallel

• Dataset == collection

How create RDD

• parallelize

• external dataset: filesystem, HDFS, HBase, etc

Lazy RDD

• map

• filter

• flatMap

• mapPartitions

• mapPartitionsWithIndex

• union

• intersection

• distinct

• groupByKey

• reduceByKey

• join

• collect

• count

• first

• take(n)

• reduce

• countByKey

• foreach

• takeOrdered

• takeSample

• saveAsTextFile

• saveAsSequenceFile

• saveAsObjectFile

Transformations Actions

Example

DataFrame

• Distributed collection of data organized into named columns

• SQL like syntax

• Catalyst Optimizer

• Catalyst Optimizer

DataFrame vs RDD

RDD

Cluster Overview

Cluster managers

• Standalone

• Apache Mesos

• Hadoop YARN

DEMO

• Standalone

• 1.5G dataset

• 2G RAM executor

DEMO

• Standalone

• 1.5G dataset

• 2G RAM executor

DEMO 2

https://goo.gl/xbnANN

[email protected]

https://goo.gl/xbnANN

Reference list

https://spark.apache.orghttps://databricks.com/bloghttp://hadoop.apache.org/docs/currenthttp://www.gridgain.comhttps://www.google.com/trendshttp://blog.revolutionanalytics.com/2013/12/apache-spark.htmlhttp://0xdata.com/blog/2014/09/Sparkling-Water/http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttps://spark.apache.org/docs/1.3.1/job-scheduling.htmlhttps://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttp://aryannava.com/2014/02/19/apache-hadoop-ecosystem/http://www.gridgain.com/in-memory-compute-grid-explained/http://gridgain.blogspot.com/2012/11/gridgain-and-hadoop-differences-and.htmlhttp://blog.infinio.com/relative-speeds-from-ram-to-flash-to-disk

https://spark.apache.org

https://databricks.com/blog

http://hadoop.apache.org/docs/current

http://www.gridgain.com

https://www.google.com/trends

http://blog.revolutionanalytics.com/2013/12/apache-spark.html

http://0xdata.com/blog/2014/09/Sparkling-Water/

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

https://spark.apache.org/docs/1.3.1/job-scheduling.html

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

http://aryannava.com/2014/02/19/apache-hadoop-ecosystem/

http://www.gridgain.com/in-memory-compute-grid-explained/

http://gridgain.blogspot.com/2012/11/gridgain-and-hadoop-differences-and.html

http://blog.infinio.com/relative-speeds-from-ram-to-flash-to-disk

Thank You!

Presentations & Public Speaking

Spark in the BigData dark