Spark in the BigData dark
6+ years in HighLoad and BigData 3+ years as Team / Tech Lead Java, Scala, Javascript, PHP
About me
Hadoop components
Nano History
Apache Hadoop
Pros: • Batch operations • Scalability • User defined methods
Cons: • The problem must be resolved in context of a single
job • Filesystem based
Nano History
Tez, Pig, Hive, etc
Pros: • Batch operations • Over Hadoop • Faster then MapReduce • DAG
Cons: • Filesystem based
Nano History
What is in-memory?
• In-memory compute grid
• In-memory data grid
In-memory compute grid
In-memory data grid
HDD vs MEMORY?
• Memory speed is in nanoseconds• 10GbE Network speed is in microseconds (~50)• Flash speed is in microseconds (between 20-500+)• Disk speed is in milliseconds (between 4-7)
Spark in-memory model
Apache Spark
Pros: • In memory operations up to 100x times faster then
Hadoop MapReduce • On disc operations up to 10x times faster then Hadoop
MapReduce• In-memory• Batch operations & near real time • Interactive • Not bound to hadoop• Easy to start for developers
Really fast?
Is Spark popular?
HazelcastApache Spark Apache Hadoop
Is it popular?
The most active project
Who use Spark?
Languages
Libraries
RDD
• Resilient == fault-tolerant
• Distributed == compute in parallel
• Dataset == collection
How create RDD
• parallelize
• external dataset: filesystem, HDFS, HBase, etc
Lazy RDD
• map
• filter
• flatMap
• mapPartitions
• mapPartitionsWithIndex
• union
• intersection
• distinct
• groupByKey
• reduceByKey
• join
• collect
• count
• first
• take(n)
• reduce
• countByKey
• foreach
• takeOrdered
• takeSample
• saveAsTextFile
• saveAsSequenceFile
• saveAsObjectFile
Transformations Actions
Example
DataFrame
• Distributed collection of data organized into named columns
• SQL like syntax
• Catalyst Optimizer
• Catalyst Optimizer
DataFrame vs RDD
RDD
Cluster Overview
Cluster managers
• Standalone
• Apache Mesos
• Hadoop YARN
DEMO
• Standalone
• 1.5G dataset
• 2G RAM executor
DEMO
• Standalone
• 1.5G dataset
• 2G RAM executor
Reference list
https://spark.apache.orghttps://databricks.com/bloghttp://hadoop.apache.org/docs/currenthttp://www.gridgain.comhttps://www.google.com/trendshttp://blog.revolutionanalytics.com/2013/12/apache-spark.htmlhttp://0xdata.com/blog/2014/09/Sparkling-Water/http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttps://spark.apache.org/docs/1.3.1/job-scheduling.htmlhttps://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttp://aryannava.com/2014/02/19/apache-hadoop-ecosystem/http://www.gridgain.com/in-memory-compute-grid-explained/http://gridgain.blogspot.com/2012/11/gridgain-and-hadoop-differences-and.htmlhttp://blog.infinio.com/relative-speeds-from-ram-to-flash-to-disk
Thank You!