Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
A PRACTICAL CASE OF AIOPS: FAILURE
MANAGEMENT IN APACHE SPARK
GUGLIELMO IOZZIA
MSD
MOSCOW, OCTOBER 9TH 2019
#GuglielmoIozzia
ABOUT ME
Currently at
Previously at
Author
I got some awards lately I love cooking
Champion #guglielmoiozzia
MSD IRELAND
http://www.msd-ireland.com/
+ 50 years
Approx. 2,000 employees
$2.5 billion investment to date
Approx 50% MSD’s top 20 products manufactured here
Export to + 60 countries
€6.1 billion turnover in 2017
2017 + 300 jobs & €280m investment
MSD Biotech, Dublin, coming in 2021
AGENDA
• Let’s agree on the definition of failure in Spark.
What’s Failure in Spark?
• Let’s do prediction about Spark apps performance.
A Real Use Case Scenario
• The next level.Where to Go from Here?
• Have your say!Q & A
#GuglielmoIozzia
APACHE SPARK
It is an Open Source unified analytics engine for large scale data
processing.
APACHE SPARK
Speed
▪ It achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
Ease of Use
▪ It offers over 80 high-level operators that make it easy to build parallel apps.
▪ It makes possible to write applications quickly in Java, Scala, Python, R, and SQL.
Runs Everywhere
▪ It runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.
▪ It can access diverse data sources.
THE PHILOSOPHY BEHIND THIS TALK
https://www.youtube.com/watch?v=2A_Bx2gBRv8
Pawel Leszczynski
Allegro.pl
WHAT’S FAILURE?
According to the English Oxford dictionary, definitions of failure are:
• Lack of success.
• The neglect or omission of expected or required action.
• The action or state of not functioning.
WHAT’S FAILURE IN APACHE SPARK?
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:331)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:777)
at org.apache.hadoop.io.Text.decode(Text.java:405)
at org.apache.hadoop.io.Text.decode(Text.java:382)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:344)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:344)
Lack of success
The action or state of not functioning
WHAT’S FAILURE IN APACHE SPARK?
The neglect or omission of expected or required action.
WHAT’S FAILURE IN APACHE SPARK?
A Spark application is taking too
long time to complete its
execution. You then decide to:
• Add more nodes to the cluster or
• Add more memory to the existing
nodes of the cluster.
The neglect or omission of expected or required action.
Lack of success
FAILURE CLASSIFICATION
FAILURE CLASSIFICATION AND ALERTING
FAILURE CLASSIFICATION AND ALERTING
Spark job failure
Root Cause:
The CityState class Kryo serialization failed.
Recommendation:
Switch Serialization from Kryo to Java for this job or exclude
the CityState class from the Kryo serializable classes list.
To:From:
Spark job failure
PERFORMANCE PREDICTION
OBJECTIVES
• Predict the performance of Spark applications in order to improve
it and at the same time try to minimize costs/resources utilization.
• Shift as much as possible defects from production to test.
• Satisfy the SLAs.
BACKGROUND INFORMATION
• Pre-production and production environments based on OpenShift Origin.
• The PoC started before Spark 2.3.
• Spark has been upgraded to 2.3+ in a later stage.
• Spark cluster management in OpenShift Origin/Kubernetes initially based on
the radanalytics.io’s Oshinko OS project.
• Initially only Scala and Java Spark applications involved.
• No application data in Hadoop. Different data storage options (Cassandra,
S3, MySQL).
CHALLENGES
• Spark jobs not running repeatedly on the same data set.
• ML or hyperparameter tuning jobs with no relevant history.
• Spark specific extra complexity:
• Support from more than one programming language (Scala, Java)
• Diverse applications (SparkSQL, GraphX, ML)
• Different submission methods (Docker/Kubernetes, batch, scheduled
(Airflow), REST (Livy), Notebooks (Jupyter, Zeppelin)
• Infrastructure (Kubernetes, auto-scaling, Airflow)
FIRST APPROACH
The Ernest project
▪ A research paper from the AMPLab @ UC Berkley.
Pros
▪ It is an end-to-end linear model.
▪ It can make performance predictions after training a model on small samples of data.
▪ It follows an Optimal Experimental Design to choose the data points to collect.
▪ Low overhead.
Cons
▪ It has been validated against EC2 instances only.
▪ It has been validated mostly against MLLibSpark apps.
▪ Black box approach.
HOW DOES ERNEST WORK?
THE ERNEST FLOW: EXAMPLE
Min Partitions Max Partitions Total Partitions Min Pods Max Pods Cores per Pod
4 32 256 1 8 4
Pods Cores Input Fraction Partitions Weight
3 12 0.055921 15 1.000007
3 12 0.061678 16 1.000002
1 4 0.015625 4 0.999983
1 4 0.021382 6 0.999982
3 12 0.050164 13 0.999979
4 16 0.061678 16 0.999973
8 32 0.125000 32 0.999949
Input to Experimental Design:
Produces:
THE ERNEST FLOW: EXAMPLE
Cores Input Fraction Time (sec)
4 0.015625 0.4131746292114258
4 0.021382 0.43631744384765625
12 0.050164 1.7999424934387207
12 0.055921 1.3083412647247314
… … …
Collected Data Set for Training:
THE ERNEST FLOW: EXAMPLEPods Predicted Time (sec)
4 7.6657299238771355
8 5.174769180472036
12 4.940850807574099
16 5.271193027302953
20 5.827239484082524
24 6.4961380593874525
28 7.229523559564015
32 8.003213387785348
… …
48 11.299494340894533
52 12.150692492278885
… …
Predictions:
INTRODUCING SPARKMEASURE
PERFORMANCE METRICS COLLECTION
• Spark Measure used in Flight Recorder Mode.
• Metrics collected at both stage and task levels in pre-production.
• Metrics saved in Prometheus.
ADDING SPARKMEASURE
PATTERNS SPOT THROUGH SPARKMEASUREAfter the first experiment design run: a common pattern spotted across different configurations:
PATTERNS SPOT THROUGH SPARKMEASUREAfter code fix:
THE FUTURE
REINFORCEMENT LEARNING…
…OR BETTER, DEEP REINFORCEMENT LEARNING
USEFUL LINKS
Apache Spark: https://spark.apache.org/
DeepLearning4J: https://deeplearning4j.org/
Streamsets Data Collector: https://streamsets.com/products/sdc
SparkMeasure: https://github.com/LucaCanali/sparkMeasure
Ernest paper: http://shivaram.org/publications/ernest-nsdi.pdf
THANKS!!!
Any questions?You can find me at
@guglielmoiozzia
https://ie.linkedin.com/in/giozzia
googlielmo.blogspot.com