Azure 上基于 Spark Streaming...

Preview:

Citation preview

Azure 上基于 Spark Streaming的数据流实时计算 左继红

ACP-B205

Azure 上的实时计算

Spark Streaming 概念和功能

EventHub 示例:

val stream = EventHubUtils.createStream(ssc, eventHubName, partitionNum, consumerGroupName)

val dataset: RDD[Int, String] = … val metricsDS: DStream[Int, SensorMetrics] = stream.window(Seconds(3), Seconds(2)) val joinedDS: Dstream[Int, (SensorMetrics, String)] = metricsDS.transform(rdd => rdd.join(dataset))

val computeMeanFunc = (values: Seq[SensorMetrics], state: Option[SensorState]) => { val back_ax_vals = values.map(_.getSensorReading("back").get.ax) val back_ax_mean = back_ax_vals.reduce(_+_) / values.size val back_ax_dev = Math.pow(back_ax_vals.map(x => Math.pow(x-back_ax_mean, 2)). reduce(_+_) / values.size, 0.5) ... }

集成 EventHub

并行结构,避免资源竞争 事件可保存多天,可反复读取 可通过Throughput Unit控制性能

EventData

Offset Sequence number Body User properties System properties

Event Hub

Partition1

Partition2

Partition3

Partition4

事件按接收的时间存储

Offset: 字节偏移量

每个EventHubReceiver对应一个EventHub Partition 使用EventHubs Java client 底层使用Apache Qpid库访问EventHub,基于AMQP协议

EventHub的数据持久化存储 ResilientEventHubReceiver的自动恢复 Offset的定时checkpoint Metadata、RDD data定时checkpoint

Reliable Receiver: 当数据被成功接收并可靠存储后,向源发送确认 Unreliable Receiver: 不向源发送确认

Unreliable Receiver 通过offset checkpointing保证数据的可靠接收 Offset被存储于Azure Blob Storage

Azure 上的 Spark 集群部署

演示: 使用 Spark Streaming 实现动作信号的分析

Azure 上实时分析工具的比较

课后提醒

Recommended