Upload
edureka
View
540
Download
4
Embed Size (px)
Citation preview
Big Data Processing With Spark and Scala
http://www.edureka.co/apache-spark-scala-training
Slide 2Slide 2 http://www.edureka.co/apache-spark-scala-training
What is Big Data?
What is Spark?
Why Spark?
Spark Ecosystem
A note about Scala
Why Scala?
MapReduce vs Spark
Hello Spark!
Objectives of this Session
Slide 3Slide 3 http://www.edureka.co/apache-spark-scala-training
Big Data
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data setsso large and complex that it becomes difficult toprocess using on-hand database managementtools or traditional data processing applications
The challenges include capture, curation,storage, search, sharing, transfer, analysis, andvisualization
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytes
processing
mobile
Big Data
Slide 4Slide 4 http://www.edureka.co/apache-spark-scala-training
What is Spark?
Apache Spark is a general-purpose cluster in-memory computing system
Provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs
Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and more..
High Level APIs
High Level Tools
More…
Slide 5Slide 5 http://www.edureka.co/apache-spark-scala-training
Why Spark?
Cluster Manager
Deployment
via YARN
The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster manager.
Slide 6Slide 6 http://www.edureka.co/apache-spark-scala-training
Why Spark?
Polyglot Scala
Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and Python supported).
Slide 7Slide 7 http://www.edureka.co/apache-spark-scala-training
Why Spark?
A fully Apache Hive compatible data warehousing system that can run 100x faster than Hive.
100x faster than for certain applications.
Slide 8Slide 8 http://www.edureka.co/apache-spark-scala-training
Why Spark?
Provides powerful caching and disk persistence capabilities
Interactive Data Analysis
Faster Batch
Iterative Algorithms
Real-Time Stream Processing
Faster Decision-Making
Slide 9Slide 9 http://www.edureka.co/apache-spark-scala-training
Spark Community is Super Active!
Slide 10Slide 10 http://www.edureka.co/apache-spark-scala-training
Spark Ecosystem
Spark Core Engine
Aplha/Pre-alpha
Shark (SQL)
SparkStreaming(Streaming)
MLLib(Machine learning)
GraphX(Graph
Computation)
SparkR(R on Spark)
BlindDB(Approximate
SQL)
Slide 11Slide 11 http://www.edureka.co/apache-spark-scala-training
Spark Ecosystem (Contd.)
Used for structured data. Can run unmodified hive queries on existing Hadoop deployment.
Spark Core Engine
Aplha/Pre-alpha
Shark (SQL)
SparkStreaming(Streaming)
MLLib(Machine learning)
GraphX(Graph
Computation)
SparkR(R on Spark)
BlindDB(Approximate
SQL)
Enables analytical and interactive apps for live streaming data.
An approximate query engine. To run over Core Spark Engine.
Graph Computation engine.(Similar to Giraph)
Package for R language to enable R-users to leverage Spark power from R shell.
Machine learning library being built on top of Spark. Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-Reduce.
Slide 12Slide 12 http://www.edureka.co/apache-spark-scala-training
A Note on Scala
Scala is a general-purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way
Scala supports both Object Oriented Programming and Functional Programming
Scala is very much in fabric of present and Future Big Data frameworks like Scalding, Spark, Akka
» All examples of Spark in class will be covered in Scala
» Scala would be covered before Spark coverage as part of course!
Slide 13Slide 13 http://www.edureka.co/apache-spark-scala-training
Why Scala?
Scala is a pure object-oriented language. Conceptually, every value is an object and every operation is a method-call. The language supports advanced component architectures through classes and traits
Scala is also a functional language. Supports functions, immutable data structures and preference for immutability over mutation
Seamlessly integrated with Java
Being used heavily for future Big data and developments frameworks like Spark, Akka, Scalding, Play etc
Slide 14Slide 14 http://www.edureka.co/apache-spark-scala-trainingSlide 14
If you want to do some Real Time Analytics, where you are expecting result quickly, Hadoop should not be used directly
Hadoop works on Batch processing, hence response time is high
Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n
Day 1 Day 2 Day 3 Day 4 ......... ………. ………. Day n
InputData
ProcessingData
InputData
ProcessingData
InputData
ProcessingData
Input Data
Processing Data using MR
Time Lag
Real Time Analytics
Slide 15Slide 15 http://www.edureka.co/apache-spark-scala-trainingSlide 15
Real Time Analytics – Accepted Way
Streaming Data
Storing
Slide 16Slide 16 http://www.edureka.co/apache-spark-scala-trainingSlide 16
14 sec
0.6 sec
MapReduce vs Spark
Slide 17 http://www.edureka.co/apache-spark-scala-training
Spark Demo!
Spark Demo!
Slide 18 http://www.edureka.co/apache-spark-scala-training
Questions?