21
Testing Processing Frameworks Streaming and Gabriela Choy

Comparing processing frameworks v7

Embed Size (px)

Citation preview

Testing Processing Frameworks

Streamingand

Gabriela Choy

Spark / StormSpark Storm

Implemented in Scala Clojure, Java

Delivery Semantics Exactly once At least once. Exactly once with Trident

APIPython, Java, Scala Java, Scala, Clojure,

Python, etc. Trident: Java, Scala, Clojure.

Processing ModelBatch. Micro-batches with Spark Streaming. ~ 500ms

Record at a time/ Trident allows for micro-batches.

Latency 1 - 2 seconds sub-seconds

Pipeline

Streaming

MySQL

MySQL

Producer/s

Producer/s

Each pipeline is run independently

Node Specifications

Spark Streaming Storm

4 AWS nodes m3.medium

Zookeeper 3.4.6Kafka 0.8.2.1

Spark (streaming) 1.3

4 AWS nodes m3.medium

Zookeeper 3.4.6Kafka 0.8.2.1Storm 0.9.5

Spark Streaming: 1 master node, 3 workers

Cluster Configuration

Master node

Worker 1

Worker 2

Worker 3

Storm : 1 nimbus, 3 Supervisors

Cluster Configuration

Nimbus

Supervisor 1

Supervisor 2

Supervisor 3

Metric

Throughput: amount of data that is being processed.

● By changing batch size

● By changing load (i.e. Scaling up)

● Programs used for benchmarking will be wordcount.

# Producers Batch Interval

1 1s, 2s, 3s, 4s, 6s

41s, 2s, 3s, 4s, 6s

8 1s, 2s, 3s, 4s, 6s

Tests for Spark Streaming

Throughput for 1 producer with 95% CI

Throughput for 1 producer

Throughput for 4 producers with 95% CI

Throughput for 4 producers

Throughput for 8 producers

Throughput for 8 producers

# Producers Tuples Emitted-Acked

1 10 min

4 10 min

8 10 min

Tests for Storm

Preliminary results for storm

Tuples Emitted per Second

Tuples Acked per Second

Spout Latency

Takeaways

● Setting the batch interval in spark streaming should be done by monitoring processing times and load size

● For Storm as numbers of producers increase so does throughput and spout latency.

Would like to add:

● Increase number of producers. Use real data.

● Add a graph as a second use case.

● Dashboard to monitor live streaming.

Gabriela Choy

Bsc. in Chem. Engineering. ULA, Vnzla.Msc in Statistics. UT Dallas

Previously: Worked in Device Reliability Engineering at View, Inc.

About Me