34
A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK GUGLIELMO IOZZIA MSD MOSCOW, OCTOBER 9 TH 2019 #GuglielmoIozzia

A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

A PRACTICAL CASE OF AIOPS: FAILURE

MANAGEMENT IN APACHE SPARK

GUGLIELMO IOZZIA

MSD

MOSCOW, OCTOBER 9TH 2019

#GuglielmoIozzia

Page 2: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

ABOUT ME

Currently at

Previously at

Author

I got some awards lately I love cooking

Champion #guglielmoiozzia

Page 3: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

MSD IRELAND

http://www.msd-ireland.com/

+ 50 years

Approx. 2,000 employees

$2.5 billion investment to date

Approx 50% MSD’s top 20 products manufactured here

Export to + 60 countries

€6.1 billion turnover in 2017

2017 + 300 jobs & €280m investment

MSD Biotech, Dublin, coming in 2021

Page 4: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

AGENDA

• Let’s agree on the definition of failure in Spark.

What’s Failure in Spark?

• Let’s do prediction about Spark apps performance.

A Real Use Case Scenario

• The next level.Where to Go from Here?

• Have your say!Q & A

#GuglielmoIozzia

Page 5: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

APACHE SPARK

It is an Open Source unified analytics engine for large scale data

processing.

Page 6: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

APACHE SPARK

Speed

▪ It achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Ease of Use

▪ It offers over 80 high-level operators that make it easy to build parallel apps.

▪ It makes possible to write applications quickly in Java, Scala, Python, R, and SQL.

Runs Everywhere

▪ It runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.

▪ It can access diverse data sources.

Page 7: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

THE PHILOSOPHY BEHIND THIS TALK

https://www.youtube.com/watch?v=2A_Bx2gBRv8

Pawel Leszczynski

Allegro.pl

Page 8: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

WHAT’S FAILURE?

According to the English Oxford dictionary, definitions of failure are:

• Lack of success.

• The neglect or omission of expected or required action.

• The action or state of not functioning.

Page 9: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

WHAT’S FAILURE IN APACHE SPARK?

java.lang.OutOfMemoryError: GC overhead limit exceeded

at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)

at java.nio.CharBuffer.allocate(CharBuffer.java:331)

at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:777)

at org.apache.hadoop.io.Text.decode(Text.java:405)

at org.apache.hadoop.io.Text.decode(Text.java:382)

at org.apache.hadoop.io.Text.toString(Text.java:280)

at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:344)

at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:344)

Lack of success

The action or state of not functioning

Page 10: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

WHAT’S FAILURE IN APACHE SPARK?

The neglect or omission of expected or required action.

Page 11: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

WHAT’S FAILURE IN APACHE SPARK?

A Spark application is taking too

long time to complete its

execution. You then decide to:

• Add more nodes to the cluster or

• Add more memory to the existing

nodes of the cluster.

The neglect or omission of expected or required action.

Lack of success

Page 12: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

FAILURE CLASSIFICATION

Page 13: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

FAILURE CLASSIFICATION AND ALERTING

Page 14: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

FAILURE CLASSIFICATION AND ALERTING

Spark job failure

Root Cause:

The CityState class Kryo serialization failed.

Recommendation:

Switch Serialization from Kryo to Java for this job or exclude

the CityState class from the Kryo serializable classes list.

To:From:

Spark job failure

Page 15: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

PERFORMANCE PREDICTION

Page 16: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

OBJECTIVES

• Predict the performance of Spark applications in order to improve

it and at the same time try to minimize costs/resources utilization.

• Shift as much as possible defects from production to test.

• Satisfy the SLAs.

Page 17: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

BACKGROUND INFORMATION

• Pre-production and production environments based on OpenShift Origin.

• The PoC started before Spark 2.3.

• Spark has been upgraded to 2.3+ in a later stage.

• Spark cluster management in OpenShift Origin/Kubernetes initially based on

the radanalytics.io’s Oshinko OS project.

• Initially only Scala and Java Spark applications involved.

• No application data in Hadoop. Different data storage options (Cassandra,

S3, MySQL).

Page 18: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

CHALLENGES

• Spark jobs not running repeatedly on the same data set.

• ML or hyperparameter tuning jobs with no relevant history.

• Spark specific extra complexity:

• Support from more than one programming language (Scala, Java)

• Diverse applications (SparkSQL, GraphX, ML)

• Different submission methods (Docker/Kubernetes, batch, scheduled

(Airflow), REST (Livy), Notebooks (Jupyter, Zeppelin)

• Infrastructure (Kubernetes, auto-scaling, Airflow)

Page 19: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

FIRST APPROACH

The Ernest project

▪ A research paper from the AMPLab @ UC Berkley.

Pros

▪ It is an end-to-end linear model.

▪ It can make performance predictions after training a model on small samples of data.

▪ It follows an Optimal Experimental Design to choose the data points to collect.

▪ Low overhead.

Cons

▪ It has been validated against EC2 instances only.

▪ It has been validated mostly against MLLibSpark apps.

▪ Black box approach.

Page 20: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

HOW DOES ERNEST WORK?

Page 21: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

THE ERNEST FLOW: EXAMPLE

Min Partitions Max Partitions Total Partitions Min Pods Max Pods Cores per Pod

4 32 256 1 8 4

Pods Cores Input Fraction Partitions Weight

3 12 0.055921 15 1.000007

3 12 0.061678 16 1.000002

1 4 0.015625 4 0.999983

1 4 0.021382 6 0.999982

3 12 0.050164 13 0.999979

4 16 0.061678 16 0.999973

8 32 0.125000 32 0.999949

Input to Experimental Design:

Produces:

Page 22: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

THE ERNEST FLOW: EXAMPLE

Cores Input Fraction Time (sec)

4 0.015625 0.4131746292114258

4 0.021382 0.43631744384765625

12 0.050164 1.7999424934387207

12 0.055921 1.3083412647247314

… … …

Collected Data Set for Training:

Page 23: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

THE ERNEST FLOW: EXAMPLEPods Predicted Time (sec)

4 7.6657299238771355

8 5.174769180472036

12 4.940850807574099

16 5.271193027302953

20 5.827239484082524

24 6.4961380593874525

28 7.229523559564015

32 8.003213387785348

… …

48 11.299494340894533

52 12.150692492278885

… …

Predictions:

Page 24: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

INTRODUCING SPARKMEASURE

Page 25: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

PERFORMANCE METRICS COLLECTION

• Spark Measure used in Flight Recorder Mode.

• Metrics collected at both stage and task levels in pre-production.

• Metrics saved in Prometheus.

Page 26: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

ADDING SPARKMEASURE

Page 27: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

PATTERNS SPOT THROUGH SPARKMEASUREAfter the first experiment design run: a common pattern spotted across different configurations:

Page 28: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

PATTERNS SPOT THROUGH SPARKMEASUREAfter code fix:

Page 29: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

THE FUTURE

Page 30: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

REINFORCEMENT LEARNING…

Page 31: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

…OR BETTER, DEEP REINFORCEMENT LEARNING

Page 32: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

HANDS-ON DL ON APACHE SPARK

http://tinyurl.com/y9jkvtuy

Page 33: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

USEFUL LINKS

Apache Spark: https://spark.apache.org/

DeepLearning4J: https://deeplearning4j.org/

Streamsets Data Collector: https://streamsets.com/products/sdc

SparkMeasure: https://github.com/LucaCanali/sparkMeasure

Ernest paper: http://shivaram.org/publications/ernest-nsdi.pdf

Page 34: A PRACTICAL CASE OF AIOPS: FAILURE MANAGEMENT IN APACHE SPARK · 2019-10-21 · BACKGROUND INFORMATION •Pre-production and production environments based on OpenShift Origin. •The

THANKS!!!

Any questions?You can find me at

@guglielmoiozzia

https://ie.linkedin.com/in/giozzia

googlielmo.blogspot.com