36
Spark in the BigData dark

Spark in the BigData dark

Embed Size (px)

Citation preview

Page 1: Spark in the BigData dark

Spark in the BigData dark

Page 2: Spark in the BigData dark

6+ years in HighLoad and BigData 3+ years as Team / Tech Lead Java, Scala, Javascript, PHP

About me

Page 3: Spark in the BigData dark

Hadoop components

Page 4: Spark in the BigData dark

Nano History

Page 5: Spark in the BigData dark

Apache Hadoop

Pros: • Batch operations • Scalability • User defined methods

Cons: • The problem must be resolved in context of a single

job • Filesystem based

Page 6: Spark in the BigData dark

Nano History

Page 7: Spark in the BigData dark

Tez, Pig, Hive, etc

Pros: • Batch operations • Over Hadoop • Faster then MapReduce • DAG

Cons: • Filesystem based

Page 8: Spark in the BigData dark

Nano History

Page 9: Spark in the BigData dark

What is in-memory?

• In-memory compute grid

• In-memory data grid

Page 10: Spark in the BigData dark

In-memory compute grid

Page 11: Spark in the BigData dark

In-memory data grid

Page 12: Spark in the BigData dark

HDD vs MEMORY?

• Memory speed is in nanoseconds• 10GbE Network speed is in microseconds (~50)• Flash speed is in microseconds (between 20-500+)• Disk speed is in milliseconds (between 4-7)

Page 13: Spark in the BigData dark

Spark in-memory model

Page 14: Spark in the BigData dark

Apache Spark

Pros: • In memory operations up to 100x times faster then

Hadoop MapReduce • On disc operations up to 10x times faster then Hadoop

MapReduce• In-memory• Batch operations & near real time • Interactive • Not bound to hadoop• Easy to start for developers

Page 15: Spark in the BigData dark

Really fast?

Page 16: Spark in the BigData dark

Is Spark popular?

HazelcastApache Spark Apache Hadoop

Page 17: Spark in the BigData dark

Is it popular?

Page 18: Spark in the BigData dark

The most active project

Page 19: Spark in the BigData dark

Who use Spark?

Page 20: Spark in the BigData dark

Languages

Page 21: Spark in the BigData dark

Libraries

Page 22: Spark in the BigData dark

RDD

• Resilient == fault-tolerant

• Distributed == compute in parallel

• Dataset == collection

Page 23: Spark in the BigData dark

How create RDD

• parallelize

• external dataset: filesystem, HDFS, HBase, etc

Page 24: Spark in the BigData dark

Lazy RDD

• map

• filter

• flatMap

• mapPartitions

• mapPartitionsWithIndex

• union

• intersection

• distinct

• groupByKey

• reduceByKey

• join

• collect

• count

• first

• take(n)

• reduce

• countByKey

• foreach

• takeOrdered

• takeSample

• saveAsTextFile

• saveAsSequenceFile

• saveAsObjectFile

Transformations Actions

Page 25: Spark in the BigData dark

Example

Page 26: Spark in the BigData dark

DataFrame

• Distributed collection of data organized into named columns

• SQL like syntax

• Catalyst Optimizer

Page 27: Spark in the BigData dark

• Catalyst Optimizer

Page 28: Spark in the BigData dark

DataFrame vs RDD

Page 29: Spark in the BigData dark

RDD

Page 30: Spark in the BigData dark

Cluster Overview

Page 31: Spark in the BigData dark

Cluster managers

• Standalone

• Apache Mesos

• Hadoop YARN

Page 32: Spark in the BigData dark

DEMO

• Standalone

• 1.5G dataset

• 2G RAM executor

Page 33: Spark in the BigData dark

DEMO

• Standalone

• 1.5G dataset

• 2G RAM executor

Page 34: Spark in the BigData dark

DEMO 2

https://goo.gl/xbnANN

[email protected]

Page 35: Spark in the BigData dark

Reference list

https://spark.apache.orghttps://databricks.com/bloghttp://hadoop.apache.org/docs/currenthttp://www.gridgain.comhttps://www.google.com/trendshttp://blog.revolutionanalytics.com/2013/12/apache-spark.htmlhttp://0xdata.com/blog/2014/09/Sparkling-Water/http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttps://spark.apache.org/docs/1.3.1/job-scheduling.htmlhttps://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttp://aryannava.com/2014/02/19/apache-hadoop-ecosystem/http://www.gridgain.com/in-memory-compute-grid-explained/http://gridgain.blogspot.com/2012/11/gridgain-and-hadoop-differences-and.htmlhttp://blog.infinio.com/relative-speeds-from-ram-to-flash-to-disk

Page 36: Spark in the BigData dark

Thank You!