48
Spark Streaming and GraphX Amir H. Payberah [email protected] SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1/1

Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Spark Streaming and GraphX

Amir H. [email protected]

SICS Swedish ICT

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1

Page 2: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Spark Streaming

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 2 / 1

Page 3: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Motivation

I Many applications must process large streams of live data and pro-vide results in real-time.

• Wireless sensor networks

• Traffic management applications

• Stock marketing

• Environmental monitoring applications

• Fraud detection tools

• ...

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 3 / 1

Page 4: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Stream Processing Systems

I Database Management Systems (DBMS): data-at-rest analytics• Store and index data before processing it.• Process data only when explicitly asked by the users.

I Stream Processing Systems (SPS): data-in-motion analytics• Processing information as it flows, without storing them persistently.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 4 / 1

Page 5: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Stream Processing Systems

I Database Management Systems (DBMS): data-at-rest analytics• Store and index data before processing it.• Process data only when explicitly asked by the users.

I Stream Processing Systems (SPS): data-in-motion analytics• Processing information as it flows, without storing them persistently.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 4 / 1

Page 6: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

DBMS vs. SPS (1/2)

I DBMS: persistent data where updates are relatively infrequent.

I SPS: transient data that is continuously updated.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 5 / 1

Page 7: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

DBMS vs. SPS (2/2)

I DBMS: runs queries just once to return a complete answer.

I SPS: executes standing queries, which run continuously and provideupdated answers as new data arrives.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 6 / 1

Page 8: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Core Idea of Spark Streaming

I Run a streaming computation as a series of very small and deter-ministic batch jobs.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 7 / 1

Page 9: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

Page 10: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

Page 11: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

Page 12: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

Page 13: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

Page 14: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

DStream

I DStream: sequence of RDDs representing a stream of data.

I Any operation applied on a DStream translates to operations on theunderlying RDDs.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 9 / 1

Page 15: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

DStream

I DStream: sequence of RDDs representing a stream of data.

I Any operation applied on a DStream translates to operations on theunderlying RDDs.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 9 / 1

Page 16: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

StreamingContext

I StreamingContext: the main entry point of all Spark Streamingfunctionality.

I To initialize a Spark Streaming program, a StreamingContext objecthas to be created.

val conf = new SparkConf().setAppName(appName).setMaster(master)

val ssc = new StreamingContext(conf, Seconds(1))

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 10 / 1

Page 17: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Source of Streaming

I Two categories of streaming sources.

I Basic sources directly available in the StreamingContext API, e.g.,file systems, socket connections, ....

I Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter, ....

ssc.socketTextStream("localhost", 9999)

TwitterUtils.createStream(ssc, None)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 11 / 1

Page 18: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

DStream Transformations

I Transformations: modify data from on DStream to a new DStream.

I Standard RDD operations, e.g., map, join, ...

I DStream operations, e.g., window operations

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 12 / 1

Page 19: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

DStream Transformation Example

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 13 / 1

Page 20: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Window Operations

I Apply transformations over a sliding window of data: window lengthand slide interval.

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream(IP, Port)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val windowedWordCounts = pairs.reduceByKeyAndWindow(_ + _,

Seconds(30), Seconds(10))

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 14 / 1

Page 21: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

MapWithState Operation

I Maintains state while continuously updating it with new information.

I It requires the checkpoint directory.

I A new operation after updateStateByKey.

val ssc = new StreamingContext(conf, Seconds(1))

ssc.checkpoint(".")

val lines = ssc.socketTextStream(IP, Port)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val stateWordCount = pairs.mapWithState(

StateSpec.function(mappingFunc))

val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {

val sum = one.getOrElse(0) + state.getOption.getOrElse(0)

state.update(sum)

(word, sum)

}

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 15 / 1

Page 22: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Transform Operation

I Allows arbitrary RDD-to-RDD functions to be applied on a DStream.

I Apply any RDD operation that is not exposed in the DStream API,e.g., joining every RDD in a DStream with another RDD.

// RDD containing spam information

val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...)

val cleanedDStream = wordCounts.transform(rdd => {

// join data stream with spam information to do data cleaning

rdd.join(spamInfoRDD).filter(...)

...

})

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 16 / 1

Page 23: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Spark Streaming and DataFrame

val words: DStream[String] = ...

words.foreachRDD { rdd =>

// Get the singleton instance of SQLContext

val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)

import sqlContext.implicits._

// Convert RDD[String] to DataFrame

val wordsDataFrame = rdd.toDF("word")

// Register as table

wordsDataFrame.registerTempTable("words")

// Do word count on DataFrame using SQL and print it

val wordCountsDataFrame =

sqlContext.sql("select word, count(*) as total from words group by word")

wordCountsDataFrame.show()

}

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 17 / 1

Page 24: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

GraphX

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 18 / 1

Page 25: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 19 / 1

Page 26: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Introduction

I Graphs provide a flexible abstraction for describing relationships be-tween discrete objects.

I Many problems can be modeled by graphs and solved with appro-priate graph algorithms.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 20 / 1

Page 27: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Large Graph

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 21 / 1

Page 28: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Can we use platforms like MapReduce or Spark, which are based on data-parallel

model, for large-scale graph proceeding?

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 22 / 1

Page 29: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Graph-Parallel Processing

I Restricts the types of computation.

I New techniques to partition and distribute graphs.

I Exploit graph structure.

I Executes graph algorithms orders-of-magnitude faster than moregeneral data-parallel systems.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 23 / 1

Page 30: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Data-Parallel vs. Graph-Parallel Computation (1/3)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 24 / 1

Page 31: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Data-Parallel vs. Graph-Parallel Computation (2/3)

I Graph-parallel computation: restricting the types of computation toachieve performance.

I But, the same restrictions make it difficult and inefficient to expressmany stages in a typical graph-analytics pipeline.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 25 / 1

Page 32: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Data-Parallel vs. Graph-Parallel Computation (2/3)

I Graph-parallel computation: restricting the types of computation toachieve performance.

I But, the same restrictions make it difficult and inefficient to expressmany stages in a typical graph-analytics pipeline.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 25 / 1

Page 33: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Data-Parallel vs. Graph-Parallel Computation (3/3)

I Moving between table and graph views of the same physical data.

I Inefficient: extensive data movement and duplication across the net-work and file system.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 26 / 1

Page 34: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

GraphX

I Unifies data-parallel and graph-parallel systems.

I Tables and Graphs are composable views of the same physical data.

I Implemented on top of Spark.

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 27 / 1

Page 35: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

GraphX vs. Data-Parallel/Graph-Parallel Systems

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 28 / 1

Page 36: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

GraphX vs. Data-Parallel/Graph-Parallel Systems

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 28 / 1

Page 37: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Property Graph

I Represented using two Spark RDDs:• Edge collection: VertexRDD• Vertex collection: EdgeRDD

// VD: the type of the vertex attribute

// ED: the type of the edge attribute

class Graph[VD, ED] {

val vertices: VertexRDD[VD]

val edges: EdgeRDD[ED]

}

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 29 / 1

Page 38: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Triplets

I The triplet view logically joins the vertex and edge properties yieldingan RDD[EdgeTriplet[VD, ED]].

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 30 / 1

Page 39: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Example Property Graph (1/3)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 31 / 1

Page 40: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Example Property Graph (2/3)

val sc: SparkContext

// Create an RDD for the vertices

val users: VertexRDD[(String, String)] = sc.parallelize(

Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),

(5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))

// Create an RDD for edges

val relationships: EdgeRDD[String] = sc.parallelize(

Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"),

Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))

// Define a default user in case there are relationship with missing user

val defaultUser = ("John Doe", "Missing")

// Build the initial Graph

val userGraph: Graph[(String, String), String] =

Graph(users, relationships, defaultUser)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 32 / 1

Page 41: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Example Property Graph (3/3)

// Constructed from above

val userGraph: Graph[(String, String), String]

// Count all users which are postdocs

userGraph.vertices.filter((id, (name, pos)) => pos == "postdoc").count

// Count all the edges where src > dst

userGraph.edges.filter(e => e.srcId > e.dstId).count

// Use the triplets view to create an RDD of facts

val facts: RDD[String] = graph.triplets.map(triplet =>

triplet.srcAttr._1 + " is the " +

triplet.attr + " of " + triplet.dstAttr._1)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 33 / 1

Page 42: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Property Operators

def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]

def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]

def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]

val newGraph = graph.mapVertices((id, attr) => mapUdf(id, attr))

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 34 / 1

Page 43: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Structural Operators

def reverse: Graph[VD, ED]

def subgraph(epred: EdgeTriplet[VD, ED] => Boolean,

vpred: (VertexId, VD) => Boolean): Graph[VD, ED]

def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]

// Run Connected Components

val ccGraph = graph.connectedComponents() // No longer contains missing field

// Remove missing vertices as well as the edges to connected to them

val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")

// Restrict the answer to the valid subgraph

val validCCGraph = ccGraph.mask(validGraph)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 35 / 1

Page 44: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Join Operators

def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD):

Graph[VD, ED]

def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])

(map: (VertexId, VD, Option[U]) => VD2):

Graph[VD2, ED]

val outDegrees: VertexRDD[Int] = graph.outDegrees

val degreeGraph = graph.outerJoinVertices(outDegrees) {

(id, oldAttr, outDegOpt) =>

outDegOpt match {

case Some(outDeg) => outDeg

case None => 0 // No outDegree means zero outDegree

}

}

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 36 / 1

Page 45: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Neighborhood Aggregation

def aggregateMessages[Msg: ClassTag](

sendMsg: EdgeContext[VD, ED, Msg] => Unit, // map

mergeMsg: (Msg, Msg) => Msg, // reduce

tripletFields: TripletFields = TripletFields.All):

VertexRDD[Msg]

val graph: Graph[Double, Int] = ...

val olderFollowers: VertexRDD[(Int, Double)] =

graph.aggregateMessages[(Int, Double)](triplet =>

{ // Map Function

if (triplet.srcAttr > triplet.dstAttr) {

// Send message to destination vertex containing counter and age

triplet.sendToDst(1, triplet.srcAttr)

}

},

// Reduce Function

(a, b) => (a._1 + b._1, a._2 + b._2)

)

val avgAgeOfOlderFollowers: VertexRDD[Double] = olderFollowers.mapValues(

(id, value) => value match {case (count, totalAge) => totalAge / count})

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 37 / 1

Page 46: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Summary

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 38 / 1

Page 47: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Summary

I Spark streaming• Mini-batch processing• DStream (sequence of RDDs)• Transformations, e.g., stateful, window, join, transform, ...

I GraphX• Unifies graph-parallel and data-prallel models• Property graph (VertexRDD and EdgeRDD)

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 39 / 1

Page 48: Spark Streaming and GraphX · Chop upthe live stream into batches ofXseconds. Spark treats each batch of data asRDDsand processes them using RDD operations. Finally, the processed

Questions?

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 40 / 1