A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

A PRACTICAL CASE OF AIOPS: FAILURE

MANAGEMENT IN APACHE SPARK

GUGLIELMO IOZZIA

MSD

MOSCOW, OCTOBER 9TH 2019

#GuglielmoIozzia

ABOUT ME

Currently at

Previously at

Author

I got some awards lately I love cooking

Champion #guglielmoiozzia

MSD IRELAND

http://www.msd-ireland.com/

+ 50 years

Approx. 2,000 employees

$2.5 billion investment to date

Approx 50% MSD’s top 20 products manufactured here

Export to + 60 countries

€6.1 billion turnover in 2017

2017 + 300 jobs & €280m investment

MSD Biotech, Dublin, coming in 2021

http://www.msd-ireland.com/

AGENDA

• Let’s agree on the definition of failure in Spark.

What’s Failure in Spark?

• Let’s do prediction about Spark apps performance.

A Real Use Case Scenario

• The next level.Where to Go from Here?

• Have your say!Q & A

#GuglielmoIozzia

APACHE SPARK

It is an Open Source unified analytics engine for large scale data

processing.

APACHE SPARK

Speed

▪ It achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Ease of Use

▪ It offers over 80 high-level operators that make it easy to build parallel apps.

▪ It makes possible to write applications quickly in Java, Scala, Python, R, and SQL.

Runs Everywhere

▪ It runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

▪ It can access diverse data sources.

THE PHILOSOPHY BEHIND THIS TALK

https://www.youtube.com/watch?v=2A_Bx2gBRv8

Pawel Leszczynski

Allegro.pl

WHAT’S FAILURE?

According to the English Oxford dictionary, definitions of failure are:

• Lack of success.

• The neglect or omission of expected or required action.

• The action or state of not functioning.

WHAT’S FAILURE IN APACHE SPARK?

java.lang.OutOfMemoryError: GC overhead limit exceeded

at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)

at java.nio.CharBuffer.allocate(CharBuffer.java:331)

at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:777)

at org.apache.hadoop.io.Text.decode(Text.java:405)

at org.apache.hadoop.io.Text.decode(Text.java:382)

at org.apache.hadoop.io.Text.toString(Text.java:280)

at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:344)

at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:344)

Lack of success

The action or state of not functioning


The neglect or omission of expected or required action.


A Spark application is taking too

long time to complete its

execution. You then decide to:

• Add more nodes to the cluster or

• Add more memory to the existing

nodes of the cluster.

The neglect or omission of expected or required action.

Lack of success

FAILURE CLASSIFICATION

FAILURE CLASSIFICATION AND ALERTING

FAILURE CLASSIFICATION AND ALERTING

Spark job failure

Root Cause:

The CityState class Kryo serialization failed.

Recommendation:

Switch Serialization from Kryo to Java for this job or exclude

the CityState class from the Kryo serializable classes list.

To:From:

Spark job failure

PERFORMANCE PREDICTION

OBJECTIVES

• Predict the performance of Spark applications in order to improve

it and at the same time try to minimize costs/resources utilization.

• Shift as much as possible defects from production to test.

• Satisfy the SLAs.

BACKGROUND INFORMATION

• Pre-production and production environments based on OpenShift Origin.

• The PoC started before Spark 2.3.

• Spark has been upgraded to 2.3+ in a later stage.

• Spark cluster management in OpenShift Origin/Kubernetes initially based on

the radanalytics.io’s Oshinko OS project.

• Initially only Scala and Java Spark applications involved.

• No application data in Hadoop. Different data storage options (Cassandra,

S3, MySQL).

CHALLENGES

• Spark jobs not running repeatedly on the same data set.

• ML or hyperparameter tuning jobs with no relevant history.

• Spark specific extra complexity:

• Support from more than one programming language (Scala, Java)

• Diverse applications (SparkSQL, GraphX, ML)

• Different submission methods (Docker/Kubernetes, batch, scheduled

(Airflow), REST (Livy), Notebooks (Jupyter, Zeppelin)

• Infrastructure (Kubernetes, auto-scaling, Airflow)

FIRST APPROACH

The Ernest project

▪ A research paper from the AMPLab @ UC Berkley.

Pros

▪ It is an end-to-end linear model.

▪ It can make performance predictions after training a model on small samples of data.

▪ It follows an Optimal Experimental Design to choose the data points to collect.

▪ Low overhead.

Cons

▪ It has been validated against EC2 instances only.

▪ It has been validated mostly against MLLibSpark apps.

▪ Black box approach.

HOW DOES ERNEST WORK?

THE ERNEST FLOW: EXAMPLE

Min Partitions Max Partitions Total Partitions Min Pods Max Pods Cores per Pod

4 32 256 1 8 4

Pods Cores Input Fraction Partitions Weight

3 12 0.055921 15 1.000007

3 12 0.061678 16 1.000002

1 4 0.015625 4 0.999983

1 4 0.021382 6 0.999982

3 12 0.050164 13 0.999979

4 16 0.061678 16 0.999973

8 32 0.125000 32 0.999949

Input to Experimental Design:

Produces:

THE ERNEST FLOW: EXAMPLE

Cores Input Fraction Time (sec)

4 0.015625 0.4131746292114258

4 0.021382 0.43631744384765625

12 0.050164 1.7999424934387207

12 0.055921 1.3083412647247314

… … …

Collected Data Set for Training:

THE ERNEST FLOW: EXAMPLEPods Predicted Time (sec)

4 7.6657299238771355

8 5.174769180472036

12 4.940850807574099

16 5.271193027302953

20 5.827239484082524

24 6.4961380593874525

28 7.229523559564015

32 8.003213387785348

… …

48 11.299494340894533

52 12.150692492278885

… …

Predictions:

INTRODUCING SPARKMEASURE

PERFORMANCE METRICS COLLECTION

• Spark Measure used in Flight Recorder Mode.

• Metrics collected at both stage and task levels in pre-production.

• Metrics saved in Prometheus.

ADDING SPARKMEASURE

PATTERNS SPOT THROUGH SPARKMEASUREAfter the first experiment design run: a common pattern spotted across different configurations:

PATTERNS SPOT THROUGH SPARKMEASUREAfter code fix:

THE FUTURE

REINFORCEMENT LEARNING…

…OR BETTER, DEEP REINFORCEMENT LEARNING

HANDS-ON DL ON APACHE SPARK

http://tinyurl.com/y9jkvtuy

http://tinyurl.com/y9jkvtuy

USEFUL LINKS

Apache Spark: https://spark.apache.org/

DeepLearning4J: https://deeplearning4j.org/

Streamsets Data Collector: https://streamsets.com/products/sdc

SparkMeasure: https://github.com/LucaCanali/sparkMeasure

Ernest paper: http://shivaram.org/publications/ernest-nsdi.pdf

https://spark.apache.org/

https://deeplearning4j.org/

https://streamsets.com/products/sdc

https://github.com/LucaCanali/sparkMeasure

http://shivaram.org/publications/ernest-nsdi.pdf

THANKS!!!

Any questions?You can find me at

@guglielmoiozzia

https://ie.linkedin.com/in/giozzia

googlielmo.blogspot.com

https://ie.linkedin.com/in/giozzia

http://googlielmo.blogspot.com/

Documents

A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The