Introduction to Structured Streaming

Introduction to Structured StreamingManish Mishra

Software ConsultantKnoldus Software LLP

What is Structured Streaming ?

How is it different from Previous Streaming Engine?

Structured Streaming Programming model

Basic Operations, Selection, Projection and Aggregation

Window Operations on Event Time

Example Demo

Agenda

What is Structured Streaming ?

What is Structured Streaming

It is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

It was part of Spark 2.0 release

A unified API for streams which can combine stream computation and batch processing

The computations can be performed in sql-like queries which are applicable for Dataset/DataFrames on streaming dataframes.

What is new with this Streaming Engine?

What is new in Structured Streaming

The entry point of the streaming app is spark session in spite of previous streamingContext

Unlike Dstreams, It is an infinite Dataframe.

It is interoperable with DStreams

It can harness the power of Catalyst Optimizer to increase performance of query without changing the query semantics

An Unified API makes developer task easy and no one has to reason about how streaming computation will differ from a normal map-red computation

Structured Streaming Programming Model

It treats a live data stream as an unbounded table

Any streaming computation can be expressed as a batch-like query on static tables

The spark runs this computation as incremental query internally

The result of the computation depends on the output modes specified in the streaming query.


Image Source: Apache Spark Documentations



There are three output modes which decides what result output goes into the sink namely: Complete Mode:

Update Mode:

Append Mode:

Structured Streaming Programming Model:Output Modes

Complete Mode: Entire updated Result Table will be written to the sink. It is up to the storage connector to decide how to handle writing of the entire table. It can be specified by outputMode("complete") while instantiating a stream query object.


Append Mode (default) : Only the new rows appended in the result Table since the last trigger will be written to the external storage.

This is applicable only on the queries where existing rows in the Result Table are not expected to change.


Update Mode : Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (not available yet in Spark 2.0). Note that this is different from the Complete Mode in that this mode does not output the rows that are not changed.

Note: This mode is not implemented yet till Spark 2.0


// Create DataFrame representing the stream of input lines from connection to localhost:9000
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9000)
.load()

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy("value").count()Example: Running Word Count

// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()

query.awaitTermination()Example: Running Word Count

Basic Operations - Selection, Projection, Aggregation

case class DeviceData(device: String, type: String, signal: Double, time: DateTime)

val df: DataFrame = ... // streaming DataFrame with IOT device data with schema { device: string, type: string, signal: double, time: string }
val ds: Dataset[DeviceData] = df.as[DeviceData] // streaming Dataset with IOT device data

/ Select the devices which have signal more than 10
df.select("device").where("signal > 10") // using untyped APIs
ds.filter(_.signal > 10).map(_.device) // using typed APIs

/ Running count of the number of updates for each device type
df.groupBy("type").count() // using untyped API

// Running average signal for each device type

import org.apache.spark.sql.expressions.scalalang.typed._
ds.groupByKey(_.type).agg(typed.avg(_.signal)) // using typed API



import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()



References

Structured Streaming Programming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Thanks!!