Upload
ilya-ganelin
View
2.057
Download
0
Embed Size (px)
Citation preview
USF Spark Workshop
Ilya Ganelin
Overview• Goal:
• Understand how Spark internals drive design and configuration
• Contents:
• Background• Partitions
• Caching
• Serialization
• Shuffle
• Lessons 1-4
• Experimentation, debugging, exploration
• ASK QUESTIONS.
Background
Partitions, Caching, and Serialization
• Partitions• How data is split on disk
• Affects memory / CPU usage and shuffle size
• Caching• Persist RDDs in distributed memory
• Major speedup for repeated operations
• Serialization• Efficient movement of data
• Java vs. Kryo
• Spark Architecture / Workflow
Shuffle?
Shuffle!• All-all operations
• reduceByKey, groupByKey
• Data movement• Serialization
• Akka
• Memory overhead• Dumps to disk when OOM
• Garbage collection
• EXPENSIVE!
Map Reduce
Spark Architecture
Lessons
Lesson 1: Spark is a problem child!• Memory
• You’re using more than you think• JVM garbage collection
• Spark metadata (shuffles, long-running jobs)
• Scala & Java type overhead
• Shuffle / heap / YARN
• Debugging is hard• Distributed logs
• Hundreds of tasks
Lesson 1: Discipline• Tame the beast (memory)
• Partition wisely
• Know your data!• Size, Types, Distribution
• Kryo Serialization
• Cleanup• Long-term jobs consume memory indefinitely
• Spark context cleanup fails in production environment
• Solution: YARN!• Separate spark-submits per batch
• Stable Spark-based job that runs for weeks
Spark Memory Structure
spark.executor.memory - parameter that defines
the total amount of memory available for the executor.
spark.storage.memoryFraction – This defines
the fraction (by default 0.6) of the total memory to use for
storing persisted RDDs.
spark.shuffle.memoryFraction – This defines
the fraction of memory to reserve for shuffle (by default 0.2)
Typically don’t touch:
spark.storage.unrollFraction
spark.storage.safetyFraction. T
These are defined primarily for certain internal
constructs and size-estimation. These default to 20% and 10% respectively.
• yarn.nodemanager.resource.memory-mb - controls the maximum sum of memory used by the containers on each node.• --executor-memory/spark.executor.memory controls the executor heap size,
• JVMs can use memory off heap, (interned Strings and direct byte buffers). • spark.yarn.executor.memoryOverhead + executor memory determine memory request to YARN for each executor. It defaults to max(384, .07 * spark.executor.memory).
• YARN may round the requested memory up a little. • yarn.scheduler.minimum-allocation-mb• yarn.scheduler.increment-allocation-mb
Lesson 2: Avoid shuffles!• Why?
• Speed up execution
• Increase stability
• ????
• Profit!
• How?• Custom partitioning
• Use the driver!• Collect
• Broadcast
Lesson 3: Using the driver is hard!• Limited memory
• Collected RDDs
• Metadata
• Results (Accumulators)
• Akka messaging• 106 x (120 bytes) ~ 1.2GB; 20 partitions
• Read ~60 MB per partition – (Default is 10MB)
• Solution: Partition & set akka.frameSize - know your data!
• Big data• Solution: Batch process
• Problem: Cleanup and long-term stability
Lesson 4: Speed!• Cache, but cache wisely
• If you use it twice, cache it
• Broadcast variables• Visible to all executors
• Only serialized once
• Blazing-fast lookups!
• Threading• Thread pool on driver
• Fast operations, many tasks
• 75x speedup over ML Lib ALS predict()• Start: 1 rec / 1.5 seconds
• End: 50 recs / second
screen pyspark--driver-memory 100g \--num-executors 60 \--executor-cores 5 \--master yarn-client \--conf "spark.executor.memory=20g” \--conf "spark.io.compression.codec=lz4" \--conf "spark.shuffle.consolidateFiles=true" \--conf "spark.dynamicAllocation.enabled=false" \--conf "spark.shuffle.manager=tungsten-sort" \--conf "spark.akka.frameSize=1028" \--conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC \-XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -XX:+UseCompressedOops"
Questions?
References• http://spark.apache.org/docs/latest/programming-guide.html
• http://spark.apache.org/docs/latest/sql-programming-guide.html
• http://tinyurl.com/leqek2d (Working With Spark, by Ilya Ganelin)
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ (by Sandy Ryza)
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ (by Sandy Ryza)
• http://www.slideshare.net/ilganeli/frustrationreduced-pyspark-data-engineering-with-dataframes
• http://www.amazon.com/Spark-Data-Cluster-Computing-Production/dp/1119254019