21
www.edureka.co/apache-spark-scala-training Performance of Spark vs MapReduce

Performance of Spark vs MapReduce

  • Upload
    edureka

  • View
    1.001

  • Download
    0

Embed Size (px)

Citation preview

www.edureka.co/apache-spark-scala-training

Performance of Spark vs MapReduce

www.edureka.co/apache-spark-scala-training

What will you learn today ?

Beyond Hadoop MapReduce

How Spark is better than MapReduce?

Benchmark : Spark vs MapReduce

Hands-On : Analyzing data with Spark

www.edureka.co/apache-spark-scala-training

Word Count Problem - MapReduce

MapReduce Code for a Simple Word Count Problem

www.edureka.co/apache-spark-scala-training

Apache Spark

Apache Spark is a general purpose data processing engine with in-memory computing

Spark provides API for Scala, Java, Python and R which makes Spark widely adopted for data processing

www.edureka.co/apache-spark-scala-training

How Spark fits into Hadoop Ecosystem ?

Spark is intended to enhance, not replace, the Hadoop stack

Spark is designed to read and write data to HDFS as well as other storage systems such as CSV files, Amazon S3 and NoSQL databases

www.edureka.co/apache-spark-scala-training

Word Count Problem - Spark

Spark Scala Code for Word Count Problem

Spark Python Code for Word Count Problem

Clearly processing data with Spark is much easier than MapReduce and Spark gives you the flexibility to choose your favorite language Scala, Java, Python etc.

www.edureka.co/apache-spark-scala-training

Why Spark for Big Data Analytics ?

What makes Spark

suitable for Big Data

Analytics ?

www.edureka.co/apache-spark-scala-training

Why Spark for Big Data Analytics ?

Following features make Spark, the best fit for Big Data Analytics :

Spark simplifies data analysis

Spark provides built-in libraries to do advanced analytics

Spark speaks more than one language

Spark provides faster results

Spark allows you to use different Hadoop vendors

www.edureka.co/apache-spark-scala-training

Benchmark : Spark is Blazingly Fast

www.edureka.co/apache-spark-scala-training

Isn’t Spark In-Memory Only

But I have heard Spark is good for onlyin-memory processing?

www.edureka.co/apache-spark-scala-training

Spark : Best of both Worlds

It’s a common misconception Spark is only for in-memory processing. From its inception Spark was designed to be a general execution engine that works both in-memory and on-disk. Almost all Spark operators perform external operations when data does not fit in memory

www.edureka.co/apache-spark-scala-training

Spark Libraries

Spark SQL : Spark’s module for working with structured data

MLlib : Spark’s machine learning library

GraphX : Spark’s API for graph computation

Spark Streaming : Spark’s API to process streaming data

www.edureka.co/apache-spark-scala-training

Spark in one Snapshot

www.edureka.co/apache-spark-scala-training

Spark Use Cases

Different companies are using Spark for solving various problems e.g. recommendation systems, business intelligence, fraud detection etc.

www.edureka.co/apache-spark-scala-training

Who is using Spark?

A complete list of companies using Spark can be found here : https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

www.edureka.co/apache-spark-scala-training

Spark is here to stay

Spark is not one of those "here today, gone tomorrow". Spark is here to stay for the foreseeable future, and it is well worth to get your teeth into it in order to get value out of your data

www.edureka.co/apache-spark-scala-training

Hands-onAnalyzing data with Spark

www.edureka.co/apache-spark-scala-training

References

IBM backs Apache Spark for Big Data Analytics :

http://www.forbes.com/sites/paulmiller/2015/06/15/ibm-backs-apache-spark-for-big-data-analytics/

Why Cloudera is saying 'Goodbye, MapReduce' and 'Hello, Spark' :

http://fortune.com/2015/09/09/cloudera-spark-mapreduce/

5 reasons to turn to Spark for Big Data Analytics :

http://www.infoworld.com/article/2897287/big-data/5-reasons-to-turn-to-spark-for-big-data-analytics.html

www.edureka.co/apache-spark-scala-training

References

Spark new record for large scale sorting :

https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

How eBay uses Spark to ignite Data Analytics :

http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite-data-analytics/

Spark is fast on disk too :

https://gigaom.com/2014/10/10/databricks-demolishes-big-data-benchmark-to-prove-spark-is-fast-on-disk-too/

www.edureka.co/apache-spark-scala-training

Survey

Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your experience better!

Please spare few minutes to take the survey after the webinar.

www.edureka.co/apache-spark-scala-training

Thank You …

Questions/Queries/Feedback

Recording and presentation will be made available to you within 24 hours