Upload
harisr1234
View
818
Download
2
Tags:
Embed Size (px)
Citation preview
1© Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Real Time Data Processing using Spark Streaming
2© Cloudera, Inc. All rights reserved.
Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social• Connected devices: 9B in 2012 to 50B by 2020• Over 1 trillion sensors by 2020• Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time?• Value can quickly degrade → capture value immediately• From reactive analysis to direct operational impact• Unlocks new competitive advantages• Requires a completely new approach...
3© Cloudera, Inc. All rights reserved.
Use Cases Across Industries
CreditIdentifyfraudulent transactions as soon as they occur.
TransportationDynamicRe-routingOf traffic orVehicle Fleet.
Retail• Dynamic InventoryManagement• Real-timeIn-storeOffers and recommendations
Consumer Internet &MobileOptimize userengagement basedon user’s currentbehavior.
HealthcareContinuouslymonitor patientvital stats and proactively identifyat-risk patients.
Manufacturing• Identifyequipmentfailures and react instantly• PerformProactivemaintenance.
SurveillanceIdentifythreatsand intrusionsIn real-time
Digital Advertising& MarketingOptimize and personalize content based on real-time information.
4© Cloudera, Inc. All rights reserved.
From Volume and Variety to Velocity
Present
Batch + Stream Processing
Time to Insight of Seconds
Big-Data = Volume + Variety
Big-Data = Volume + Variety + Velocity
PastPresent
Hadoop Ecosystem evolves as well…
Past
Big Data has evolved
Batch Processing
Time to insight of Hours
5© Cloudera, Inc. All rights reserved.
Key Components of Streaming Architectures
Data Ingestion & TransportationService
Real-Time Stream Processing Engine
Kafka Flume
System Management
Security
Data Management & Integration
Real-TimeData Serving
6© Cloudera, Inc. All rights reserved.
Canonical Stream Processing Architecture
Kafka
Data Ingest
App 1
App 2
.
.
.
Kafka Flume
HDFS HBase
Data Sources
7© Cloudera, Inc. All rights reserved.
What is Spark?
Spark is a general purpose computational framework - more flexibility than MapReduce
Key properties:• Leverages distributed memory• Full Directed Graph expressions for data parallel computations• Improved developer experience
Yet retains:Linear scalability, Fault-tolerance and Data Locality based computations
8© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
•Easy to Develop•Rich APIs in Java, Scala, Python
• Interactive shell
•Fast to Run•General execution graphs
• In-memory storage
2-5× less codeUp to 10× faster on disk,
100× in memory
9© Cloudera, Inc. All rights reserved.
Easy: High productivity language support
Pythonlines = sc.textFile(...)lines.filter(lambda s: “ERROR” in s).count()
Scalaval lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()
JavaJavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() {Boolean call(String s) {return s.contains(“error”);
}}).count();
• Native support for multiple languages with identical APIs• Use of closures, iterations and other common language constructs to minimize code
10© Cloudera, Inc. All rights reserved.
Easy: Use Interactively
• Interactive exploration of data for data scientists – no need to develop “applications”• Developers can prototype application on live system as they build application
11© Cloudera, Inc. All rights reserved.
Easy: Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save
12© Cloudera, Inc. All rights reserved.
Easy: Example – Word Count (M/R)
public static class WordCountMapClass
extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one =
new IntWritable(1);
private Text word = new Text();
public void map(
LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr
= new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduceextends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {int sum = 0;while (values.hasNext()) {sum += values.next().get();
}output.collect(key, new IntWritable(sum));
}}
13© Cloudera, Inc. All rights reserved.
Easy: Example – Word Count (Spark)
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
14© Cloudera, Inc. All rights reserved.
Easy: Out of the Box Functionality
•Hadoop Integration• Works with Hadoop Data• Runs With YARN
• Libraries• MLlib• Spark Streaming• GraphX (alpha)
• Language support:• Improved Python support• SparkR• Java 8• Schema support in Spark’s APIs
15© Cloudera, Inc. All rights reserved.
Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM
16© Cloudera, Inc. All rights reserved.
RDDs
RDD = Resilient Distributed Datasets• Immutable representation of data• Operations on one RDD creates a new one• Memory caching layer that stores data in a distributed, fault-tolerant cache• Created by parallel transformations on data in stable storage• Lazy materialization
Two observations:a. Can fall back to disk when data-set does not fit in memoryb. Provides fault-tolerance through concept of lineage
17© Cloudera, Inc. All rights reserved.
Fast: Using RAM, Operator Graphs
• In-memory Caching• Data Partitions read from RAM
instead of disk
• Operator Graphs• Scheduling Optimizations• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
B: B:
C: D: E:
F:
map
A:
map
take
18© Cloudera, Inc. All rights reserved.
Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput
19© Cloudera, Inc. All rights reserved.
Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming
20© Cloudera, Inc. All rights reserved.
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Stream composed of small (1-10s) batch
computations
“Micro-batch” Architecture
21© Cloudera, Inc. All rights reserved.
Use DStreams for Windowing Functions
22© Cloudera, Inc. All rights reserved.
Spark Streaming
• Runs as a Spark job
• YARN or standalone for scheduling
• YARN has KDC integration
• Use the same code for real-time Spark Streaming and for batch Spark jobs.
• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….
• Easy to write “Receivers” for custom messaging systems.
23© Cloudera, Inc. All rights reserved.
Sharing Code between Batch and Streaming
def filterErrors (rdd: RDD[String]): RDD[String] = {
rdd.filter(s => s.contains(“ERROR”))
}
Library that filters “ERRORS”
• Streaming generates RDDs periodically
• Any code that operates on RDDs can therefore be used in streaming as well
24© Cloudera, Inc. All rights reserved.
Sharing Code between Batch and Streaming
val lines = sc.textFile(…)
val filtered = filterErrors(lines)
filtered.saveAsTextFile(...)
Spark:
val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)
val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {
filterErrors(rdd)
}))
filtered.saveAsTextFiles(…)
Spark Streaming:
25© Cloudera, Inc. All rights reserved.
Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.
26© Cloudera, Inc. All rights reserved.
Hadoop in the Spark world
YARN
Spark
Spark
StreamingGraphX MLlib
HDFS, HBase
HivePigImpala
MapReduce2
SharkSearch
Core Hadoop
Support Spark components
Unsupported add-ons
27© Cloudera, Inc. All rights reserved.
Current project status
• 100+ contributors and 25+ companies contributing
• Includes: Databricks, Cloudera, Intel, Yahoo etc
• Dozens of production deployments
• Included in CDH!
28© Cloudera, Inc. All rights reserved.
More Info..
• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html
• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/
• Apache Spark homepage: http://spark.apache.org/
• Github: https://github.com/apache/spark
29© Cloudera, Inc. All rights reserved.
Thank you