16
Tuning Spark Streaming for Throughput By Gerard Maas 22/12/2014 Tech No Comments Spark Streaming is an abstraction that brings Streaming capabilities to Spark. It works by creating micro-batches of data that are given to Spark for further processing and it offers a rich set of stream operations consistent with the Spark API. While migrating some jobs to Spark Streaming, we faced a series of performance challenges. This article summarizes our findings in the form of a ‘tuning guide’ that could be of more general use. We hope it can be of use to other Spark Streaming adopters and we welcome an open discussion on the topic. After we present the context of our system and background, we cover how Spark Streaming works, going into the level of detail needed to explain the parameters involved in the performance of a streaming job. We have divided this article in four sections: Our Context Understanding Spark Streaming Going One Level Deeper A Tuning Guide Scaling up consumers Parallelism Partitions Data Locality Caching Logging

Tuning Spark Streaming for Throughput _ Virdata

Embed Size (px)

DESCRIPTION

Tuning Spark Streaming for Throughput _ Virdata

Citation preview

  • Tuning Spark Streaming for ThroughputBy Gerard Maas 22/12/2014 Tech No Comments

    Spark Streaming is an abstraction that brings Streaming capabilities to Spark. It works by creatingmicro-batches of data that are given to Spark for further processing and it offers a rich set of streamoperations consistent with the Spark API.

    While migrating some jobs to Spark Streaming, we faced a series of performance challenges. Thisarticle summarizes our findings in the form of a tuning guide that could be of more general use. Wehope it can be of use to other Spark Streaming adopters and we welcome an open discussion on thetopic.

    After we present the context of our system and background, we cover how Spark Streaming works,going into the level of detail needed to explain the parameters involved in the performance of astreaming job.We have divided this article in four sections:

    Our ContextUnderstanding Spark Streaming

    Going One Level Deeper

    A Tuning GuideScaling up consumersParallelismPartitionsData LocalityCachingLogging

  • Closing words

    Under the motto to measure is to know, performance measurements are an esse part of aperformance improvement process. In an upcoming issue, we will cover how to measure theseimprovements using the tools provided by Spark.

    Our ContextAt Virdata, we collect, store, transform and analyse data produced by a large number of devices orthings. We have been migrating our data ingestion pipelines from a well known streamingframework to Spark Streaming. Our rationale for that has been two-fold: the micro-batching model isa great match for inserting records in Cassandra and having a single programming model for ourSpark analytics and the data ingestion layer makes it easier to grow and maintain a coherent andreusable knowledge- and code- base.

    Our initial tests with Spark Streaming were really promising and we went ahead with the migration,but as we implemented the full spectrum of business requirements on the data ingestion pipeline, weobserved how our Spark Streaming jobs were increasingly lagging behind in terms of performance.To address that situation, we have done an in-depth analysis to spot the issues and find a workablesolution, which lead to an improvement of about 61x throughput within the same limits of a microbatch interval.

    The tuning aspects discussed here will be based on our system that consists of Kafka 0.8.1.1 Spark+Streaming 1.1.0 Spark Cassandra Connector 1.0.0 Cassandra 2.x. Spark runs on top ofMesos 0.20 and separate from the Cassandra 2.0.6 ring.

  • In this article, our focus will be mainly on the aspects and parameters related to Spark Streaming.

    How Spark Streaming worksSpark Streaming combines one or more stream consumers with a Spark transformation process thatmust be materialized by an action (like saveAs) in order to get scheduled for execution.Intuitively, its very easy to understand: as data comes in, its collected and packaged in blocks duringa given time interval, also known as batch interval. Once the interval of time is completed, thecollected data blocks are given to Spark for processing.

    Whats generally less known is the timing of how this sequence of events takes place:

    The batch interval provided to the Spark Streaming constructor will determine the duration ofeach interval. (10 seconds in this example)

    val ssc = new StreamingContext(sc, Seconds(10))

    That data is placed in blocks, implemented as arrays, and delivered to the Block Manager. Theseblocks become the RDD partitions that Spark will work on.On start of Spark Streaming (t0), the consumers will be instantiated and will start consumingevents from the streaming source. In our case, that means several Kafka consumers will startfetching messages from Kafka topics.

  • On the next batch interval, (t1) the data collected at t0 is provided to Spark. Spark Streamingstarts filling the next bucket of data (#1).

    At any point in time, Spark is processing the previous batch of data, while the StreamingConsumers are collecting data for the current interval. In this chart, we can see that SparkStreaming is processing interval #1 while collecting data for t2 (becoming block #2)Once a batch has been processed by Spark (like #0 in the above illustration), it can be cleaned

    up. When that RDD will be cleaned up is determined by the spark.cleaner.ttl setting.

    Going one level deeper

  • Summarizing the previous section, Spark Streaming consists of two processes:

    Fetching data; done by the streaming consumer (in our case, the Kafka consumer)Processing the data; done by Spark

    These two processes are connected by the timely delivery of collected data blocks from SparkStreaming to Spark. This gives us also the main performance guideline for Spark Streaming:

    The time to process the data of a batch interval must be less than the batch intervaltime.

    Given an unbounded consumer like the Kafka consumer, this implies that our Spark job must be ableto timely process the incoming data of a batch interval. In order to achieve our goal of maximizingthe throughput of the system, we want to consume data as fast as we can and tune our Spark job toprocess that data within the time interval restriction.

    As the Kafka consumer has proven to be quite good at delivering data up to overwhelming levels, thetuning efforts further described in this article are focused on optimizing the Spark-side of SparkStreaming.

    Lets do a quick Spark recap:

    A Spark job consists of transformations and actions. It is broken down in stages of operationsthat can be inlined.An RDD acts on a distributed collection of data, broken down in partitions spread over nodes.A task is applying a stage to a data partition on an executor. Scheduling a task has some fixedcost.

    In Spark Streaming, at each batch interval we will apply the same job to a new batch of data, so thebatch processing time will be informally determined by:

    processing time ~= #tasks * scheduling cost + #tasks * time-complexity per task / parallelism level

    where

    #tasks = #stages x #partitions

  • From these two statements, we can infer that to minimize the processing time of a batch we need tominimize the stages and partitions and maximize parallelism. Note how interval time is not explicit inthis set of equations.

    Note:Although this model is a simplification of the performance characterization of a SparkStreaming job, it provides a sufficient framework to reason about the streaming job in termsof tasks, stages, partitions and executors.

    A tuning guideNow that we have identified the elements that determine the performance of a Spark Streaming job,lets see what knobs we have to turn in order to optimize performance.

    Scaling up consumersIn order to increase the amount of messages consumed by the system we can create multipleconsumers that will fetch data in parallel. Each consumer is assigned one core on an executor.

    This is a common pattern:

    @transient val inKafkaList:List[DStream[(K,V)]] = List.fill(kafkaParallelism) {KafkaUtils.createStream[K, V, KDecoder, VDecoder](ssc, kafkaConfig, topics, StorageLevel.MEMORY_AND_DISK_SER)}@transient val inKafka = inKafkaList.tail.foldLeft(inKafkaList.head){_.union(_)}

    The union of the created DStreams is important as this reduces the number of transformationpipelines on the input DStream to one. Not doing this will multiply the number of stages by thenumber of consumers.

    Notes:

    kafkaParallelism is the (configurable) number of consumers to create

    Storage level MEMORY_AND_DISK_SER will allow Spark Streaming to spill serialized datato disk in cases of overload, when the available memory is not sufficient to hold the

  • incoming datadeclaring the dstream references as transient is often necessary to avoid them beingserialized with the job. This would result in a serialization exception as DStreams are notsupposed to be serialized.

    ParallelismAs we explained before, Spark Streaming is in fact two processes running concurrently: the dataconsumers and Spark. The parallelism level of the consumer side is defined by the number ofconsumers created (see the previous section on how to create consumers). The parallelism of theSpark processing cluster is determined by the total number of cores configured for the job minus thenumber of consumers.

    Given the total number of cores, controlled by the configuration parameter spark.cores.max:

    Consumer parallelism = #consumers created (kafkaParallelism in the previous example)

    Spark parallelism = spark.cores.max - #consumers

    Tuning guide for spark.cores.max:

    To maximize the chances of data locality and even parallel execution, spark.cores.max should be a

    multiple of #consumers. For example, if you are creating 4 kafka consumers, one could assign

    spark.cores.max = 8 or spark.cores.max = 12, effectively configuring 1 or 2 spark cores perconsumer respectively.

    Notes:Its some kind of an urban Internet legend that a Spark Streaming application need n+1cores, where n is the number of consumers. This is ONLY correct in test cases and verysmall deployments. For a throughput sensitive application, provision the Spark-side ofthe job with enough resources as outlined in this section.There are no hard warranties on the even distributions of consumers and Spark coresacross executors, resulting in less-than-ideal cluster topologies. For network intensive

  • applications, an even deployment across physical nodes would be ideal. At the moment

    of writing, theres no way to express that constraint in Mesos.

    PartitionsAs we discussed previously, reducing the number of partitions is important in order to reduce theoverall processing time, as it leads to less tasks and therefore bigger chunks of data to operate on atonce and less scheduling overhead.

    How many partitions do we have for each RDD in a DStream?

    Each receiver fetches data. That data is provided by the Receiver to its executor, aReceiverSupervisor that takes care of managing the blocks. Each block becomes a partition of theRDD produced during each batch interval. The size of these blocks is time-bound and defined by theconfiguration parameter (with its default value):

    spark.streaming.blockInterval = 200

    Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced into blocksof data before storing them in Spark. (see docs)

    Given that each consumer will produce the same amount of blocks, it follows that the number ofpartitions in an RDD for a given interval is:

    #partitions = #consumers * batchIntervalMillis / blockInterval

    Tuning guide for spark.streaming.blockInterval:

    Increasing spark.streaming.blockInterval will reduce the number of partitions in the RDD and

    therefore, the number of tasks per batch. blockInterval must be an integer divisor of batchinterval. Following the Spark guideline of having the number of partitions roughly 2x-3x the number ofavailable cores, we have been successfully implementing the following guideline:

    Given:

  • batchIntervalMillis = configured batch interval in milliseconds

    spark.cores.max = total cores

    #consumers = created streaming consumers

    sparkCores = spark.cores.max - #consumers

    partitionFactor = # of partitions / core (1, 2, 3,... ideally in multiples of k where sparkCores = k * #consumers)

    Then:

    spark.streaming.blockInterval = batchIntervalMillis * #consumers / (partitionFactor x sparkCores)

    Eg.

    In our configuration we have assigned spark.cores.max = 12 cores and we have created 4

    consumers, therefore leaving sparkCores = 8. We have defined a batch interval of 6 seconds and

    consider that a partitionFactor = 2 is acceptable.

    Then:

    spark.streaming.blockInterval = 6000 * 4 / (2 * 8) = 24000/16 = 1500 ms

    Lets check:

    #partitions = 4 * 6000/1500 = 16 partitions = 2 factor x 8 cores [QED]

    Data Locality

    Big blocks of data created with a large configured value for spark.streaming.blockInterval aregreat when they can be processed on the same node where they reside, using data locality level:

    NODE_LOCAL. But they can be heavy to transport over the network if another node has idleprocessing capacity.

    We try to improve our data locality odds by allocating k Spark nodes per consumer task, so thatcollected data can be evenly processed.Nevertheless, depending on the complexity of the Spark job defined over the DStream, some

  • executors might decide to launch a task with a lesser locality level.The time Spark will wait for locality is controlled by the configuration parameter:

    spark.locality.wait, with a default value of 3000ms.

    Tuning guide for spark.locality.wait:

    The default value of 3000ms is too high for jobs that are expected to execute under the 5-10s rangeas it will define, in many cases, a bottom line for the job execution time if any task breaks beyond

    NODE_LOCAL locality level. We have observed that setting this parameter to a value between 500 to1000 ms helps lowering the total processing time of a Spark Streaming job.

    Notes:

    We need to find a balance between spark.streaming.blockInterval and

    spark.locality.wait. If we observe that tasks are taken by a non-local executor,

    setting a lower spark.streaming.blockInterval will improve the network transfer

    time while increasing spark.locality.wait. will increase the chance of that task

    executing with data locality NODE_LOCAL.

    CachingIn our data ingestion use-case, we are routing data to different Cassandra keyspaces. This seems tobe a reasonably common use-case as the processed data streams ought to be persisted somewherefor further use (HDFS, Cassandra, local disk, ) and sorting it on some common denominator(keyspaces, date, customers, folders) will help with retrieval afterwards.

    To implement that routing function, we iterate over each RDD for as many times as different routeswe have, each time creating a filtered version of the original RDD that gets persisted.

    In code, this process looks like this:

    dstream.foreachRDD{rdd =>

  • rdd.cache() // cache the RDD before iterating!

    keys.foreach{ key =>

    rdd.filter(elem=> key(elem) == key).saveAsFooBar(...)

    }

    rdd.unpersist()

    }

    We enclose the iterative section with a cache/unpersist pair. This way we keep the data cached onlyfor the time we need it.

    Using rdd.cache speeds up the process considerably. Like in Spark, while the first cycle takes thesame time as the uncached version, each subsequent iteration takes only a fraction of the time.This discovery took us by surprise. Given that Spark Streaming data is per se already in memory, itwas counter-intuitive that rdd.cache would have any beneficial effect.Our testing shows a different reality: If a DStream or the corresponding RDD is used multiple times,caching it significantly speeds up the process. In our case, it was the setting that delivered thelargest performance improvement.

    Note:In case of window operations, DStreams are implicitly cached as the RDDs are preservedbeyond the limits of a single batch interval.

    Tuning guide for .cache:

    Use if the Streaming job involves iterating over the dstream or RDDs.

    LoggingOne of the popular rule-of-thumbs regarding logging is never place logging within a loop. As a SparkStreaming Job is basically a long running loop of the same job over the incoming new dataset, thisadvice is quite relevant.

  • On a simple ETL test job we measured the effect of two logInfo(...) lines in the scope of aDStream closure. This chart illustrates a comparison in performance for that case:

    Tuning guide for logging:Avoid using logging calls within the DStream and RDD transformations in a Spark Streaming Job.Spark is quite chatty on the logs. Set the right log levels for your application.

    Enable KryoThis one is still on our TODO list. After we enabled Kryo we had issues with some of the data beingsilently nullified.

    To Measure is to KnowTo tune a Spark Streaming application, we need to have means of determining that our changes aredelivering a beneficial effect. We have been perusing the Spark UI and in particular, the StreamingTab and the Spark metrics subsystem to gather performance data. We will cover these tools in detailin a follow up article.

    Closing WordsSpark Streaming offers a micro-batch based streaming model that offers the same rich andexpressive capabilities of Spark to streaming data. Given that the processing time is constrained tothat same batch interval, special care must be taken to ensure that Spark Streaming applications aretuned to consistently deliver results within that given time interval for every interval.

    Services Solutions Technology Learn About us

    News Contact

  • In this article we have visited the code changes, settings and parameters that have helped usimprove the throughput of our Spark Streaming applications by a 60-fold. We dont claim that theseare the only parameters affecting the performance of a Spark Streaming job, but we have seenconsistent performance improvements after applying this tuning guide, backed up by extensivetesting in development and production environments.

    In a follow up article we will further explain how to use the tools delivered by Spark to help with thetuning process.

    All feedback, corrections and discussions are welcome.Drunk monkeys dance under the moonlight.

    By: Gerard Maas (twitter: @maasg)

    RECENTLY

    Tuning Spark Streaming for Throughput

    Virdatas Spark presentation at Devoxx

    Virdata in the Databricks Application Spotlight

    Virdata @ Digiworld summit 2014 (November 18-20) in Montpellier

    Visit Virdata hosted in the NetApp booth at AWS re:Invent 2014 in Las Vegas (November 11-14)

    CATEGORIES

    Events

    News

    Partner News

    Press releases

  • Tech

    ARCHIVES

    December 2014

    November 2014

    October 2014

    September 2014

    July 2014

    June 2014

    May 2014

    February 2014

    January 2014

    December 2013

    TUNING SPARK STREAMING FOR THROUGHPUT22/12/2014Spark Streaming is an abstraction that brings Streaming capabilities... Read more

    VIRDATAS SPARK PRESENTATION AT DEVOXX09/12/2014Virdatas presentation on Lightning Fast Big Data...

    RECENT NEWS

  • Read more

    VIRDATA IN THE DATABRICKS APPLICATION SPOTLIGHT04/12/2014Read everything about Virdatas successful Spark Certification... Read more

    CONTACT

    virdata US

    175 S. San Antonio Rd, Los Altos, CA 94022, USAPhone: +1 (937) 569 4220Email: [email protected] Belgium

    Technicolor delivery technologies, SAS, dba VirdataPrins Boudewijnlaan 47,2650 Edegem, BelgiumPhone: +32 (0) 3 440 73 95Email: [email protected]

    C O N T A C T U S

    2012 - 2014 Technicolor Delivery Technologies, S.A.S. (dba Virdata) - All Rights Reserved

    Website Terms of Use | Privacy & Personnal Data Rules and Policies | IPR Policies