Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Session 5 Addendum
Data ImbalanceMemory Usage Estimation
Session Addendum Objectives
➢ Understand what causes data-imbalance
➢ Understand the impact of data-imbalance
➢ Be familiar with common strategies for dealing with data-imbalance
➢ Understanding and Estimating memory usage
What is Imbalanced Data?
● “Unbalanced” or “Imbalanced” or “Lumpy” data typically refers to “too much of one and not another” for performing some type of operation
● For example, imagine “groupByKey” for businesses in a zipcode in NY State.○ NYC would have a huge proportion
of the data!
● Caused by trying to organize around low-cardinality categorical or ordinal data
What Does Imbalanced Data Do?
● Makes things explode, is the main problem!
● Causes unstable nodes
● Causes laggy nodes (stragglers)
● Causes OOM errors
● Causes Looooong shuffles for wide operations
● Often manifests well-down-the-pipeline so you find out...too late
If you are a data science practitioner, the notion of imbalanced data is not new to you. However, keep in mind that the Spark execution concerns for ETL processing, while stemming from the same core issues, has a different solution space that may not be appropriate for ML scenarios. Here we are primarily concerned with how the data is managed from a cluster-execution memory and shuffle perspective, which may or may not be relevant to how an ML algorithm will experience the data imbalance.
How Do We Deal With Imbalanced Data? Part 1
● Operations Strategies (judicious use of groupBy & related shuffle-triggers)
● Filter early and often (Optimizer may handle some of this for you)
● Be cognizant of your partitioning mechanism○ Custom Partitioner if needed/helpful
● Detecting Stragglers○ Tasks in a stage that take over-long to execute relative to others○ Sign of uneven partitioning○ Use the Spark UI
How Do We Deal With Imbalanced Data? Part 2
● Enforcing higher-cardinality○ Re-keying pre-join (add ‘noise’ to keys)
■ E.G. Use the zip+4. Add a business category.■ 10017-1234 or 10017-laundromat
● Broadcast smallish data to avoid a shuffle○ You can push smaller dataframes up and join to them
businesses.join(broadcast(nyZips) as “z”, $”z.zip” === $”postal”)
● Remove duplicates or combine via “mapPartition” before join/grouping.
● When all else fails, vertically partitioning data, landing it, then a 2nd pass with remapped keys...subset subset subset.
https://medium.com/teads-engineering/spark-performance-tuning-from-the-trenches-7cbde521cf60“Spark is supposed to automatically identify DataFrames that should be broadcasted. Be careful though, it’s not as automatic as it appears in the documentation. Once again, the truth is in the code and the current implementation supposes that Spark knows the DataFrame’s metadata, which is not effective by default and requires to use Hive and its metastore.”
Check your default parallelism: sc.defaultParallelism
The behavior is different between regular RDD’s and Dataframes. The Spark SQL module contains the following default configuration: spark.sql.shuffle.partitions set to 200.
With a dataframe you can also call df.repartition(x) to change the @ of partitions, and so can rdd.coalesce(x). The main difference is coalesce will simply combine data to shrink the partitions whereas repartition will perform a full “rebalancing of the data” (think wide-op shuffle), equalizing the partition sizes.
Example: Increasing Cardinality for Groupings
//Both work!
//This one may cause more shuffling, which increases network // chatter incurring delaysval nyBiz = spark.read.csv("data/imbalance.csv").coalesce(5)case class Datum(city:String, postal:Int, category:String)val biz = nyBiz.map(x => Datum(x.getAs[String](0), Integer.parseInt(x.getAs[String](1)), x.getAs[String](2)))val groupedByPostal = biz.groupBy("postal").count.show
//This one may incur more "memory overhead" and therefore // garbage collectionval nyBiz = spark.read.csv("data/imbalance.csv").coalesce(5)case class Datum(city:String, postal:String, category:String)val biz = nyBiz.map(x => Datum(x.getAs[String](0), x.getAs[String](1) + x.getAs[String](2), x.getAs[String](2)))val groupedByPostal = biz.groupBy("postal").count.show
Here we are doing what appears to be the same thing - because it pretty much is. Note however than in the second case, we are changing the datatype of “postal” from Int to String, and concatenating 2 strings together to increase the cardinality of the data that will be grouped (postal). This will decrease the necessity for shuffles to occur, when doing group-by type activities.
Proactively Managing Memory
● Understand what eats it up
● Understand what gets used when
● Inspect the web UI and review usage
● Do the math!
● Use SizeEstimator.estimate
● Manually configure ratios (in more extreme cases)
https://spark.apache.org/docs/latest/tuning.html
Components of Memory Usage
● Objects stored & the ‘meta’ they carry with them○ Object Headers can be > the data itself○ Linked structures retain “pointers” to siblings○ “Primitives” might be boxed.
● Cost of object access
● Spark “memory overhead minimum”, ~ 384MB
● Serialization style
By avoiding the Java features that add overhead we can reduce the memory consumption. There are several ways to achieve this:
● Avoid the nested structure with lots of small objects and pointers.● Instead of using strings for keys, use numeric IDs or enumerated
objects.● If the RAM size is less than 32 GB, set JVM flag to
–xx:+UseCompressedOops to make a pointer to four bytes instead of eight.
Remember, anything that is “an object” has to be serialized and deserialized for shuffles, or persistence. This all costs CPU and RAM.
This becomes a balance between programmer convenience and code readability vs. code complexity + speed.
Components of Memory Access
● Parallelism: too few partitions can be problematic○ More partitions decreases each tasks memory use (input data)○ 2-3 tasks per CPU core
● Garbage Collection (the unseen devil in JVM performance problems)○ Collect & review statistics○ Adjust allocation to fit○ Try different flags, and balance with “fraction settings”
■ spark.memory.fraction■ spark.memory.storageFraction
○ This is its own science+art form: https://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
● Data Locality
This article has a good set of problems & solutions around memory & disk space. https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/
Data Locality & Streaming
● Where is the Partition?○ RDD carries information about location○ Hadoop RDD’s know about location of HDFS data○ KafkaRDDs indicate Kafka-Spark partition should get data from the machine hosting the Kafka
topic○ Spark Streaming - partitions are local to the node the receiver is running on
● What is “local” for a Spark task is based on what the RDD implementer decided would be local
● 4 Kinds of Locality○ PROCESS_LOCAL - task runs with same process as source data○ NODE_LOCAL - task runs on same machine as source data○ RACK_LOCAL - task runs on same rack○ NO_PREF/ANY - task cannot be run on same process as source data
● spark.locality.wait - determines how long to wait before changing the locality goal of a task
https://github.com/apache/spark/blob/aba9492d25e285d00033c408e9bfdd543ee12f72/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L137
RDD Subclasses can create their own implementation of getPreferredLocations to provide hints about where the data is, e.g. CassandraRDD, KafkaRDD implementers.
Setting spark.locality.wait should be considered based on the merits of locality for your application, relative to latency.
Memory and Data Locality article: http://www.russellspitzer.com/2017/09/01/Spark-Locality/
Here we can review the different metrics in the Spark UI around memory usage and data shuffled about.
http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html
Here we can look at the relative cost of tasks that involve shuffles - most of the more expensive stages involved data shuffled.
Specific Actions You Can Take - 1
● Design for size. Use arrays and primitives over standard collections and richer types○ Key off numbers not strings○ Use minimized objects to reduce overhead: http://fastutil.di.unimi.it/
● Calculate expected usage & size & parallelize accordingly○ Pass in initializations (e.g. sc.textFile or spark.read.xx.coalesce(n)○ Change the default - spark.default.parallelism
● Try dynamic allocation
● Change pointer size (< 32 GB Ram, set -XX:+UseCompressedOops - spark-env.sh)
“For most programs, switching to Kryo serialization and persisting data in serialized form will solve most common performance issues.”
“The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it into cache, and look at the “Storage” page in the web UI. The page will tell you how much memory the RDD is occupying.”
- Spark Tuning Guide
Specific Actions You Can Take - 2
● Switch to Kryo for serialization○ conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")○ Warning: Must register your custom classes
● Create facilities for on-demand clusters to isolate “RBJs” (Really Big Jobs)
● Estimate Size scala> import org.apache.spark.util.SizeEstimator
//The imported text filescala> println(SizeEstimator.estimate(nyBiz))62436248
//The resulting dataset post-map-to-case-classscala> println(SizeEstimator.estimate(biz))62440912
//The post-groupBy dataframescala> println(SizeEstimator.estimate(groupedByPostal))62440904
This gist has an example of Kryo custom class registration: https://gist.github.com/claudinei-daitx/f39d51e6ecf1e0683b21741bb1cb6f53
Kryo’s own writeup: https://github.com/EsotericSoftware/kryo
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")conf.registerKryoClasses(Array(classOf[MyClass1],
classOf[MyClass2]))
val sc = new SparkContext(conf)
Specific Actions You Can Take - 3
● Adjust Garbage Collection Settings○ Turn on logging:
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
○ Try the G1GC-XX:+UseG1GC
○ Many others…
● Adjust “Fractions”○ spark.memory.fraction ○ spark.memory.storageFraction
“Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:
● spark.memory.fraction expresses the size of M as a fraction of the (JVM heap
space - 300MB) (default 0.6). The rest of the space (40%) is reserved for user
data structures, internal metadata in Spark, and safeguarding against OOM
errors in the case of sparse and unusually large records.
● spark.memory.storageFraction expresses the size of R as a fraction of M
(default 0.5). R is the storage space within M where cached blocks immune to
being evicted by execution.
The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details”
https://spark.apache.org/docs/latest/tuning.html
Session Review
● Definitions of Imbalanced Data
● Solutions for managing Imbalanced Data
● Understanding components of memory consumption
● Actionable steps for improving functionality
EMRElastic Map Reduce on AWS
Session 9: Amazon EMR
Session 9 - EMR Session Objectives
➢ Understand EMR Solution
➢ Know how to start and run a cluster
➢ Know how to interact with Spark on EMR
➢ Interfacing with S3 data
➢ Zeppelin with EMR
➢ Cost optimizations for storage and networking??
➢ Scaling
➢ Debugging
EMR 101
● Elastic Map Reduce (Hadoop & Friends on AWS)
● Managed Cluster (Less devopsy stuff for you to do)
● Autoscaling
● Many Softwares○ Spark!○ Hadoop○ HBase○ Presto○ Hive○ HBase
Slide credit to ReInvent
YARN Schedulers - CapacityScheduler
● Default scheduler specified in Amazon EMR
● Queues○ Single queue is set by default○ Can create additional queues for workloads
based on multitenancy requirements
● Capacity guarantees● set minimal resources for each queue● Programmatically assign free resources to queues
● Adjust these settings using the classification capacity- scheduler in an
EMR configuration object (or bootstrapping)
The two built-in schedulers are Capacity Scheduler and Fair Scheduler. EMR uses Capacity Scheduler by default. Cloudera, on the other hand recommends the Fair scheduler. This can be configured using the procedure called ‘bootstrapping’ where we customize the configurations your EMR cluster runs with. The Fair scheduler can be more configurable in the sense of handling queues.
Fair - Allocates resources to weighted pools, with fair sharing within each pool (docs).Capacity - Allocates resources to pools, with FIFO scheduling within each pool (docs).
https://hortonworks.com/blog/yarn-capacity-scheduler/
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
http://blog.cloudera.com/blog/2016/01/untangling-apache-hadoop-yarn-part-3/
https://www.quora.com/On-what-basis-do-I-decide-between-Fair-and-Capacity-Scheduler-in-YARN
https://community.pivotal.io/s/article/How-to-configure-queues-using-YARN-capacity-scheduler-xml
EMR Hadoop
● Preconfigured software for your convenience○ AWS-Instance-Type-based Yarn and Hadoop settings
● Contains Hadoop customizations that are uniquely AWS○ Must build binaries using EMR (in other words on-cluster)
■ Not the case for Spark○ Must build binaries with same Linux version
● Build on EMR > Copy to S3 > Run Step Sequence
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-daemons.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-build-binaries.html
EMR Spark: The Sales Pitch
● “Easy” to use/get started
● Cost savings (potential)
● Open Source tools (with mods in some cases)
● Managed
● Secure
● Flexible
EMR Spark
● Fundamentally “still a Yarn-based cluster”
● Supports pretty much all the same features you’d expect running your own
● Ala-Carte opportunity to drink more AWS Kool-aid○ Data Pipeline○ Encryption at rest/in-transit○ Aurora-based Hive meta-store○ Spot provisioning for cost-measures
■ Can incur delays○ IAM Security measures○ S3 Data Lake (Decouple compute & storage)
Much like the majority of AWS products, EMR is functionally an open source product, customized for use by AWS in the AWS ecosystem. It plays nice with many other tools in a coherent, well organized aggregate of data streaming, processing and storage tools.
Some of their tools really do seem home-grown - Kinesis, DynamoDB, S3, whereas others are thin wrappers around popular tools
Athena: PrestoDB
Redshift: PostgreSQL with columnar storage
And still others are useful wrappers around advertised products, such as Aurora/Mysql, ElasticCache/Redis, ElasticSearch Service...etc.
Session Review
● What EMR is
● EMR’s purpose
● Spark on EMR Basics
EMRElastic Map Reduce on AWS
9.1 S3 Data Lake
EMR Spark: S3 Data Lake - Why?
The S3 Data Lake concept has some advantages
● High availability, 11 9’s of uptime
● Security○ Can constrain by IAM roles○ VPC-only access○ In Depth bucket policies○ Encryption at-rest
● Low Cost (dramatically lower than RDBMS or NoSQL storage)
● S3-Select (where viable/appropriate)
Additional value was added on the security-side of the S3 equation: https://aws.amazon.com/blogs/aws/new-vpc-endpoint-for-amazon-s3/
Even more security: https://aws.amazon.com/blogs/security/how-to-use-bucket-policies-and-apply-defense-in-depth-to-help-secure-your-amazon-s3-data/
Learn how to pre-filter your data in S3 before bringing it to the cluster: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html
Lower your storage cost with Glacier, when performance is less of a concern: https://aws.amazon.com/blogs/aws/s3-glacier-select/
EMR Spark: S3 Data Lake
But some disadvantages too!
● Limited read/write speed
● Network latency
● “Ghost files”, “conceived files” (eventual consistency side effects)
● Somewhat confusing protocol addressing s3:// s3n:// s3a://
EMR Spark: S3 Data Lake - Protocol & File Access
● s3:// - Hadoop implementation of block-based file system backed by s3. ○ Also how you might be used to referencing directly to AWS as an S3 URI
● s3n:// - “Native file system” access by Hadoop
● s3a:// - “s3n part 2” - The upgrade to s3n. ○ Supports files > 5GB○ Uses/requires AWS SDK○ Backwards compatible with s3n○ NOT SUPPORTED in EMR!
● EMRFS - Wait...we’re back to s3:// - yep. AWS EMR has re-simplified the confusion back to just s3:// if you are on EMR
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
EMR Spark: S3 Data Lake - Latency Concerns
● Resolve S3 inconsistencies, if present, with “EMRFS consistent View” in cluster setup
● Use compression!○ CSV/JSON - GZip or Bzip2 (if you wish S3-Select to be an option)
● Use S3-Select for CSV or JSON if filtering out ½ or more of the dataset
● Use other types of file-store, i.e. Parquet/Orc
● Chunk your files. ○ Spark will handle more small files better than many big files, up to a point
EMR Spark: S3 Data Lake - Latency Concerns:Sizing
How big should my files be? It depends -
● No S3-Select○ With GZip, 1-2GB tops. GZip cannot be split.○ Splittable files, between 2GB and 4GB.
■ Allows more than 1 mapper, increasing throughput■ Goal is to process as many files in parallel as possible
EMR Spark: S3 Data Lake - Latency Concerns:Sizing
How big should my files be? It depends -
● Using S3-Select? Less of a concern○ Input Files must be CSV. JSON, or Parquet○ Output Files must be JSON or CSV ○ Files must be uncompressed, .gz (GZip), or .bz2 (Bzip2) (json/csv only)○ Max SQL expression length 256KB○ Max result record length 1 MB
EMR Spark: Ingesting and Landing Data
● Ingestion○ We can improve overall system performance oftentimes, by using s3DistCopy to
bring data in from S3 and pushing to HDFS - but this is a whole extra step○ We may ingest from other datastores of course, RDS, Redshift, Streaming
(kafka/kinesis), Dynamo...etc. We can also ingest from Elasticsearch
● Landing○ When we land data, we can temporarily land it to a Hive table for interactive
exploration○ We can land it of course to any JDBC storage○ Often we will land it to S3 for further interaction with other systems (Athena,
Presto...etc)○ Fun Tip - need it to feed your ES Search? You can write your final result directly to
Elastic Search!
https://www.elastic.co/guide/en/elasticsearch/hadoop/6.x/spark.html
Session Review
● Value of S3 Data Lake
● Hindrances (pros/cons)
● Some ideas for how to store/retrieve data
EMRElastic Map Reduce on AWS
9.2 Setting up the Cluster
EMR Cluster: Setup
● Methods
○ AWS Console■ Advantages: Simplicity, Clarity
○ AWS CLI■ Advantages: Completeness, Scriptable
○ AWS SDK■ Java■ BOTO3 (Python■ ...etc…■ Advantages: Infrastructure As Code (IAC) friendly, completeness
https://sysadmins.co.za/aws-create-emr-cluster-with-java-sdk-examples/
EMR Cluster: Setup: Console
Don’t be seduced by the one-pager (Create Cluster - “Quick Options”- Typically want to use “Advanced” mode! (if you use Console at all)
If you do use quick options, at least take note of your log location, and make sure you select Spark ;)
Instance type is a function of sizing exercises you presumably have already done, or perhaps will do after you run some trial code.
Step execution is good for 1-off job clusters (launch, do, terminate)
EMR Cluster: Setup: Console - Quick
Pick the smallest instance you can. This is only a test...
You will need to add a key-pair in order to ssh-in, including accessing Web UI for Spark
EMR Cluster: Setup: Console - Quick
Get Coffee. This is not a super-quick, procedure. The EMR machinery is doing a bit of work, and the more softwares you selected, the longer it will take.
Once this cluster is launched, it is really not much different programmatically, from a local or on-prem cluster, except you have to SSH in to do much.
Look up the master node for your cluster in the Console UI, or:
> aws emr list-clusters> aws emr list-instances --cluster-id j-YOUR-CLUSTER-IDOr> aws emr describe-cluster --cluster-id j-YOUR-CLUSTER-ID
SSH instructions can be found on the cluster page, click the “SSH” link next to the master address.
EMR Cluster: Setup: Console - Quick
You can SSH in:
> ssh -i ~/.ssh/rf_emr_dev_access.pem [email protected]
--EMR banner shows --
> sudo spark-shell
If you don’t `sudo` you get a bunch of warnings basically saying the logs cannot be written.
SSH instructions can be found on the cluster page, click the “SSH” link next to the master address.
If you have trouble ssh’ing in - by default sometimes the security group does NOT allow SSH from anywhere (or everywhere). In the cluster-ui, click on the security group for the master, and make sure it allows ssh (port 22) from at least your IP range, and safe.
EMR Cluster: Setup: Console - Quick
You can connect from Zeppelin:
(It’s already installed on the cluster by default, just need an SSH tunnel)
Once a tunnel is set up, you can just “click the link” in the cluster-ui.
> ssh -i ~/.ssh/rf_emr_dev_access.pem -ND 8157 [email protected]
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel-local.html
MINI-LAB - Launch a Quick Cluster
● Log into AWS
● Launch a Quick Cluster
● Connect from Spark Shell
● Run a few commands to prove its working○ Try parallelizing an array of some data and mapping it○ Review the Spark UI (in the cluster window click “Enable Web Connection” and follow instructions)
● Bonus Credit - launch Zeppelin (in the cluster window click “Enable Web Connection” and follow instructions)
● Terminate the cluster
Run some commands -
scala> val numbers = Array(1, 2, 3, 3, 4, 4, 5)
numbers: Array[Int] = Array(1, 2, 3, 3, 4, 4, 5)
scala> val numbersRDD = sc.parallelize(numbers)
numbersRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26
scala> numbersRDD.distinct.collect
1
2
3
4
5
To try Spark UI or Zeppelin, establish an SSH Tunnel
ssh -i ~/.ssh/rf_emr_dev_access.pem -ND 8157 [email protected]
Much like with SSH, you may need to open up the port in the security groups.
This command will just stay running, it will not terminate until you ctrl-c it.
Then open a browser, or click, the links in the Console UI.
E.g.
http://ec2-100-25-196-44.compute-1.amazonaws.com:18080/
http://ec2-100-25-196-44.compute-1.amazonaws.com:8890/#/
EMR Cluster: Setup: Console - Advanced
EMR Cluster: Setup: Console - Advanced
● Unsurprisingly, kinda the same, but with options
● Customize your software set. ○ Need to ala-carte TensorFlow? ○ Contrast TensorFlow and MXNet?○ Try out Presto?
● Customize the installed configurations (i.e. adjust hadoop-env, yarn-site..etc)
● Add steps & conditionally set auto-terminate
● Optimize pricing/Instance types
● Customize Security/VPC/Subnets
EMR Cluster: Setup: Console - Advanced
What’s in a Node?There are some details to be aware of around node-types, that are hidden from the Quick Cluster setup.
● Master Node, you are probably familiar with○ Runs HDFS NameNode service○ Runs YARN ResourceManager service○ Tracks submitted job statuses and monitors health of the instance groups○ Like Highlander, there can be only one (per instance group/fleet)
● Core Nodes○ Run the DataNode daemon (for HDFS)○ Run TaskTracker daemon○ This is a scaling point
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html
EMR Cluster: Setup: Console - Advanced
What’s in a Node?There are some details to be aware of around node-types, that are hidden from the Quick Cluster setup.● Task Node
○ Does not run DataNode daemon (not participating in HDFS)○ Best for autoscale/spike capacity in your cluster
● Instance Fleets○ Fully configurable cluster management○ Able to take advantage of Spot instances (cost optimization)○ Allows AWS to “Mix and match” instance types, optimizing your pricing and
their utilization. Can result in sudden-death nodes.○ Can be used to really optimize, but is complex and requires
experimentation○ Can add a “Task Instance Fleet” to an active cluster
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html
EMR Cluster: Setup: Console - Advanced
What’s in a Node?There are some details to be aware of around node-types, that are hidden from the Quick Cluster setup.
● Uniform Instance Groups○ Simplified capacity management○ While allowing flexible autoscaling setup○ Specify purchasing options to manage cost○ Don’t run Master as Spot Instance on any cluster you care about
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-uniform-instance-group.html
EMR Cluster: Setup: Console - Advanced
3: Cluster SettingsThere’s a few things to note here, but the highlights for now are -
● Logging of course (location spec)
● “EMRFS consistent view” - remember when we talked about S3 “eventual consistency”?
● Bootstrap Actions - (NOT the same as ‘steps’)○ Definitely an “advanced mode” option here, for:
■ Pre-loading some common data-set onto each Node■ Install additional software (Drill, Impala, ElasticSearch, Saltstack..etc)■ Max of 16 actions
https://github.com/aws-samples/emr-bootstrap-actions
https://aws.amazon.com/premiumsupport/knowledge-center/bootstrap-step-emr/
Bootstrap actions are the first thing to run after an Amazon EMR cluster transitions from the STARTING state to the BOOTSTRAPPING state. Bootstrap actions, which run on all cluster nodes, are scripts that run as the Hadoop user by default, but they can also run as the root user with the sudo command. The cluster won't start if a bootstrap action fails.
Steps are a distinct unit of work, comprising one or more Hadoop jobs that run only on the master node of an Amazon EMR cluster. Steps are usually used to transfer or process data. One step might submit work to a cluster, and others might process the submitted data and then send the processed data to a particular location. Steps are often what is used for a “run and done” cluster.
MINI-LAB - Configure an Advanced Cluster
● Log into AWS
● Configure an Advanced Cluster
● Play with the options
● Click some “i” icons
● Q&A
What Are These “Steps”?
“When you are done starting, do this one thing.” Then maybe shut down too.
● Available in:○ Quick Launch - auto-terminate when done○ Advanced - specify steps & termination option
● A “Unit of work” submitted to the cluster○ Stream processing○ Hive/Pig○ Spark job○ Custom Hadoop
● Each has its own unique configuration
One disadvantage is ephemeral cluster can be hard to troubleshoot.Shutdown post-step is not required (when using Advanced Config)
EMR Cluster: Setup: Other Ways To Start
● Are you using IAC? Code it in. ○ On-Demand Jobs from Jenkins○ Other “AWS-SDK”-based solutions
● Script it○ Generally anything that can be done in Console, can be done in CLI○ Usually more options in CLI○ Also an IAC option here…○ https://docs.aws.amazon.com/cli/latest/reference/emr/index.html
For guaranteeing execution insulation, I really like the one-off-cluster mechanism, but it is easily overkill for “many small/mid sized jobs” environments.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html
EMR Cluster: Setup: Launch via CLI
Demo Lab 9.2Follow along if you like/are able. What we’ll do
1. Push a chunk of our data from earlier to S3
2. Take one of our lab projects as a Jar and push it to S3
3. Launch a cluster using “steps” and the AWS CLI that will
a. Start
b. Run Steps
c. Terminate
4. Validate the the creation, output, and termination
EMR Cluster: Setup: CLI Lab 9.2
Asset PlacementWe need to get our assets on S3, to be accessible to EMR.
● Build the Jar
● Use the AWS SDK to copy the jar
● Use the AWS SDK to copy the data
● Record those paths
https://www.oreilly.com/learning/how-do-i-package-a-spark-scala-script-with-sbt-for-use-on-an-amazon-elastic-mapreduce-emr-cluster
Session Review
● Cluster Setup
● Cluster Management (a bit)
● Some decisions & options for clusters
● We set up a cluster! (or 2..)
● We learned EMR is a highly malleable environment, though the stock configuration gives a lot of benefit out of the box
EMRElastic Map Reduce on AWS
9.3 Tools on EMR Clusters
Many Tools Available
● Databases○ Hive○ HBase/Phoenix○ Presto
● ML Tools○ Mahout○ MXNet○ Tensorflow
● Data Streaming/Loading○ Flink/Sqoop○ Kinesis/Kafka/ES/Cassandra
● Workflow & Monitoring○ Hue○ Oozie
Aaand Notebooks!
● Zeppelin
● Jupyter
Notebooks on EMR
● Houston, we have options
● Zeppelin vs. Jupyter: what’s worth fighting for?○ Well...Zeppelin gives you native Scala support…?○ Yeah but...Almond is a Scala kernel for Jupyter○ Jupyter has more better visualization and stuff...python libs man…○ Yeah but Zeppelin is growing more quickly○ Dude the data science guys LIKE Jupyter ok?○ But multi-user. But authentication. But...○ …scala….python...scala...python...jupyter...zeppelin
● Both are good. Both have merits. Both have some issues on EMR
● “What about Beaker man!” “But I like Databricks!”- “Religous debate has no place in [data] science”
Zeppelin on EMR
Here’s the deal with Zeppelin, if that’s what you want to use
● Multi-user setup may be less effort
● It can be more secure than Jupyter, if that is a business concern
● Store Zeppelin notebooks on S3 so they don’t go away with the cluster!○ If we don’t store off-cluster, we are scared to terminate, increasing cost○ Security options here:
■ Access key/secret■ IAM/User■ Secure by-bucket if you require
○ Can be done two ways■ Shelling in to the running cluster and updating the config■ Configuring the cluster at startup using “configurations” block
https://www.zepl.com/blog/setting-multi-tenant-environment-zeppelin-amazon-emr/
https://medium.com/@addnab/s3-backed-notebooks-for-zeppelin-running-on-amazon-emr-7a743d546846
Zeppelin on EMR
● One-off clusters for analysis can use Spot instances. IF you do this, make sure you set up to store the notebook to S3
● You can even set up your own EC2 with Zeppelin to run off the main cluster-master (so node deaths and cluster decommissions have less impact)
● Zeppelin on Amazon EMR does not support the SparkR interpreter
● Zepl is a 3rd party solution from the Zeppelin folks offering a product ZeppelinHub to ease the burdens here.
● https://www.zepl.com/blog/setting-multi-tenant-environment-zeppelin-amazon-emr/
https://www.zepl.com/blog/setting-multi-tenant-environment-zeppelin-amazon-emr/
https://zeppelin.apache.org/docs/0.8.0/setup/storage/storage.html#notebook-storage-in-s3
Simply adding an EBS volume to the cluster config does not guarantee persistent storage. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html states:
“Amazon EBS volumes attached to EMR clusters are ephemeral: the volumes are deleted upon cluster and instance termination (for example, when shrinking instance groups), so it's important that you not expect data to persist.”
Zeppelin Performance Notes 1
● Store Zeppelin notebooks on S3 so they don’t go away with the cluster. Ephemeral notebooks on-cluster-crash or decommission are just no fun○ You can also store them on EFS if you bootstrap your cluster to do so
● (Potentially) Set your notebooks to use interpreter-per rather than shared○ Configure “interpreter/spark interpreter” as “The interpreter will be instantiated ‘Per User’ in
‘scoped’ process”. and click “Save”. (JVM Isolation)○ Otherwise a single interpreter will be used by all notebooks
● Understanding CPU/VCPU/Yarn CPU/Zeppelin allocations○ Do not expect your cluster settings to be in effect:
“Zeppelin does not use some of the settings defined in your cluster’s spark-defaults.conf configuration file, even though it instructs YARN to allocate executors dynamically if you have set spark.dynamicAllocation.enabled to true. You must set executor settings, such as memory and cores, using the Zeppelin Interpreter tab, and then restart the interpreter for them to be used.
● Interpreter options override Spark submit optionsspark.executor.memory > export SPARK_SUBMIT_OPTIONS="--executor-memory 10G ..."
https://zeppelin.apache.org/docs/0.8.0/usage/interpreter/interpreter_binding_mode.html
https://docs.amazonaws.cn/en_us/emr/latest/ReleaseGuide/zeppelin-considerations.htmlConsiderations When Using Zeppelin on Amazon EMR
● Connect to Zeppelin using the same SSH tunneling method to connect
to other web servers on the master node. Zeppelin server is found at
port 8890.
● Zeppelin on Amazon EMR release versions 5.0.0 and later supports
Shiro authentication.
● Zeppelin on Amazon EMR release versions 5.8.0 and later supports
using AWS Glue Data Catalog as the metastore for Spark SQL. For
more information, see Using AWS Glue Data Catalog as the Metastore
for Spark SQL.
● Zeppelin does not use some of the settings defined in your cluster’s
● spark-defaults.conf configuration file, even though it instructs YARN to
allocate executors dynamically if you have set
spark.dynamicAllocation.enabled to true. You must set executor
settings, such as memory and cores, using the Zeppelin Interpreter
tab, and then restart the interpreter for them to be used.
● Zeppelin on Amazon EMR does not support the SparkR interpreter.
https://community.hortonworks.com/articles/212176/key-factors-that-affects-zeppelins-performance-1.html
Important: Before choosing Isolated check your system has enough resources, Ex if 30 users Interpreter memory is 1GB then you will need 30 GB of RAM available in the zeppelin node.
https://github.com/awslabs/deeplearning-emr/blob/master/training-on-emr/emr_efs_bootstrap.sh
Zeppelin Performance Notes 2
When creating your cluster -
● Use the Script Runner step or ”Configurations” to customize the setup ○ Give more memory, e.g. zeppelin-env.sh
■ export ZEPPELIN_MEM="-Xms4024m -Xmx6024m -XX:MaxPermSize=512m"
■ export ZEPPELIN_INTP_MEM="-Xms4024m -Xmx4024m -XX:MaxPermSize=512m"
● Limit the result sets, e.g. in the interpreter○ zeppelin.spark.maxResult
● Limit the interpreter output, e.g. zeppelin-site.xml○ <name>zeppelin.interpreter.output.limit</name>
We’ll do an example of this in the next lab.
https://medium.com/@omidvd/bootstrapping-zeppelin-emr-779c4a138aed
From: https://community.hortonworks.com/articles/141589/zeppelin-best-practices.html
Deployment Choices
While you can select any node type to install Zeppelin, the best place is a gateway node. The reason gateway node makes most sense is when the cluster is firewalled off and protected from outside, users can still see the gateway node.
1. Hardware Requirement
1. More memory & more Cores are better
2. Memory: Minimum of 64 GB node
3. Cores: Minimum of 8 cores
4. # of users: A given Zeppelin node can support 8-10 users. If you want more
users, you can set up multiple Zeppelin instances. More details in MT section.
Lab: Set Up Zeppelin on EMR
Lab 9.3-A
What we’ll do
1. Launch a single-node cluster2. Add Zeppelin & Spark,3. Do some work in Zeppelin4. Terminate the cluster
An example repo of AWS Console setup. https://github.com/arunkundgol/zeppelin-setup
Lab: Set Up Zeppelin on EMR with S3 Storage
Lab 9.3-B FIXME/TODO: NEED TO FINISH
What we’ll do
1. Launch a single-node cluster2. Create a folder for s3 persistence3. Configure to use S3 for Zeppelin persistence4. Edit & save the notebook5. Check S36. Terminate the cluster7. Start a new one8. Load the notebook & run it9. Terminate the cluster
An example repo of AWS Console setup. https://github.com/arunkundgol/zeppelin-setup
You may run into permissions with Zeppelin S3 storage. This unfortunately manifests as an IO exception or SAX parse exception in the Zeppelin logs.
Caused by: java.io.IOException: Unable to list objects in S3: com.amazonaws.SdkClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
Double check your settings in configurations.json and zeppelin-site.xml to make sure the user/bucket values match, and that the values exist in S3.
Jupyter on EMR
Notice that the EMR documentation only includes instructions for JupyterHub - which we will get to…
● To run a Jupyter Notebook directly against EMR requires a few extra steps○ We need to configure EMR to have Jupyter available○ We need to enable jupyter_spark○ We need to configure security groups○ We need to create the cluster with the custom bootstrap, and
script-runnerstep○ We need to manually start pyspark
● That seems like a lot of work. ○ WAY more work than running it locally
● Lets just use JupyterHub for goodness sake
https://mikestaszel.com/2017/10/16/jupyter-notebooks-with-pyspark-on-aws-emr/
emr_bootstrap.sh
#!/bin/bash
# yum packages:
sudo yum install -y htop tmux
# download and install anaconda:
wget -q https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh -O
~/anaconda2.sh
/bin/bash ~/anaconda2.sh -b -p $HOME/anaconda
echo -e '\nexport SPARK_HOME=/usr/lib/spark\nexport
PATH=$HOME/anaconda/bin:$PATH' >> ~/.bashrc && source ~/.bashrc
# cleanup:
rm ~/anaconda2.sh
# enable https://github.com/mozilla/jupyter-spark:
sudo mkdir -p /usr/local/share/jupyter
sudo chmod -R 777 /usr/local/share/jupyter
pip install jupyter-spark
jupyter serverextension enable --py jupyter_spark
jupyter nbextension install --py jupyter_spark
jupyter nbextension enable --py jupyter_spark
jupyter nbextension enable --py widgetsnbextension
jupyter_step.sh#!/bin/bash
# set up spark to use jupyter:
echo "" | sudo tee --append /etc/spark/conf/spark-env.sh > /dev/null
echo "export PYSPARK_PYTHON=/home/hadoop/anaconda/bin/python" | sudo tee
--append /etc/spark/conf/spark-env.sh > /dev/null
echo "export PYSPARK_DRIVER_PYTHON=/home/hadoop/anaconda/bin/jupyter" | sudo
tee --append /etc/spark/conf/spark-env.sh > /dev/null
echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --ip=0.0.0.0'"
| sudo tee --append /etc/spark/conf/spark-env.sh > /dev/null
Running Jupyter Off-Master - JupyterHub, SparkMagic, Spark-Rest/Livy
● Livy: Apache incubated a REST API for Spark called Livy○ Livy enables off-cluster hosting of notebooks for JupyterHub
● SparkMagic: Extra bits for Jupyter with Spark via JupyterHub + Livy○ Automatically installed on EMR with the JupyterHub package
● Spark-Rest - is Livy - the server side of the equation○ We really don’t need to think much about it○ Abstracted by wrapper-API’s under the covers of JupyterHub
● This creates a whole new set of capabilities
● As usual, EMR introduces some hiccups○ Cluster access from-AWS○ Means ssh-tunnel usually required
https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/
https://livy.incubator.apache.org/
https://github.com/jupyterhub/jupyterhub
https://github.com/jupyter-incubator/sparkmagic
Livy/SparkMagic Notes● SparkMagic
○ Included with Jupyter-Hub○ Uses Livy○ Gives extra ‘magics’ - %% capabilities in your notebook○ Automatic visualization of SQL queries in the PySpark, PySpark3, Spark and SparkR kernels; use an
easy visual interface to interactively construct visualizations, no code required○ Ability to capture the output of SQL queries as Pandas dataframes to interact with other Python libraries
(e.g. matplotlib)
● Introduces some limitations, notably○ “Since all code is run on a remote driver through Livy, all structured data must be serialized to JSON and
parsed by the Sparkmagic library so that it can be manipulated and visualized on the client side. In practice this means that you must use Python for client-side data manipulation in %%local mode”
○ Which can be confusing to readers….and writers…● You might also want to keep an eye ont Toree for more ‘magics’ https://toree.apache.org/
https://github.com/jupyter-incubator/sparkmagic
Notebooks: Jupyter Mini-Lab - 9.4
Lab (or Demo) 9.3.2Follow along if you like/are able. What we’ll do
1. Push a chunk of our data from earlier to S3 (or use if its still there)
2. Launch a cluster using “steps” and the AWS CLI that will use JupyterHub
3. Start up an SSH tunnel
4. Do some work in the notebook
5. Terminate the cluster
“JupyterHub on Amazon EMR has a default user with administrator permissions. The user name is jovyan and the password is jupyter. We strongly recommend that you replace the user with another user who has administrative permissions. You can do this using a step when you create the cluster, or by connecting to the master node when the cluster is running.”
Externalizing Zeppelin
It is also possible, probably preferable, to completely externalize Zeppelin to a secondary EC2 instance near your cluster(s).
● Isolation
● Insulation
● Multi-cluster access
https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/
Notebooks on EMR, Review
● Zeppelin vs Jupyter
● Both are good. Both have merits. Both have some issues on EMR○ Just know they must run “up there” and you’ll be ok
● For highest stability, externalize your setup from your cluster
● Evaluate your security needs and consider if those requirements drive you toward either solution in particular
● Outside of that, the normal arguments for/against each do not differ for being on EMR
https://medium.com/@muppal/aws-emr-jupyter-spark-2-x-7da54dc4bfc8
Security Notes
Zepplin + EMR supports:
● Shiro: Active Directory● Shiro: LDAP● Shiro: Role-based ● NGINX: Basic Auth● Kerberos-enabled cluster● SSL● Restriction to
○ UI○ Data○ Notes
JupyterHub + EMR supports:
● LDAP/AD● Livy+Impersonation (user name passed along)● PAM (Pluggable Authentication Module) (users on master, mucks with off-cluster
setups)○ https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-pa
m-users.html
EMRElastic Map Reduce on AWS
9.4 EMR ClusterTroubleshooting
Troubleshooting & Debugging
Step 1: Check the logs
Troubleshooting & Debugging
Step 2: Find the logs
Why So Difficult?!
Distributed Systems
● Are inherently challenging to troubleshoot
● Require a distributed mindset
● Require a deeper understanding of the components
● Take time to adjust your internal Sherlock Holmes to
● Have no golden hammers, silver bullets,
● May contain additional tooling for troubleshooting
Regular Spark
Spark UI already comes with some great tools
● Master UI○ Workers○ Cluster Details
● Spark Worker UI○ Jobs○ Stages
■ DAG Visualization■ Event Timeline
○ Storage○ Environment○ Executors
■ Memory/Cores/IO○ SQL
What Morecould you
need?
Maximize Information, Minimize Surface Area
All That There Is
EMR + Spark
The AWS EMR solution has taken extra steps to help limit the “surface area” needed to find most problems.
● Enable Debugging
○ With debugging enabled, you get a lot of information retained (contextually), and accessible via the AWS Console without having to sort through S3 log folders and peruse the gz files
● Without it, the data IS still there, you just have to dig a little harder
Quick Demo: Nav to an S3 log file and open.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot.html
EMR Best Practices
● Do a ‘dry run’ of completed code against limited datasets.○ Great case for notebooks○ Great case for small 1-off cluster you can leave running a bit for
diagnostics○ Enable Debugging○ Configure Logging○ Logging IS expensive, don’t leave it on for production runs, only test cases
and diagnostics
● Consult the “common errors” doc in AWS (link below) - good chance you’ll see your problem & solution listed there.
● Master Node log browsing○ If you’ve a good idea of the problem and don’t have debug enabled, most
of the logs you need will live on Master node
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging."html
“The debugging tool allows you to more easily browse log files from the EMR console.”
As part of your cluster spin-up from CLI, you could add a json config file to your --configurations such as
[ { "Classification": "yarn-site", "Properties": { "yarn.log-aggregation-enable": "true", "yarn.log-aggregation.retain-seconds": "-1", "yarn.nodemanager.remote-app-log-dir": "s3:\/\/mybucket\/logs" } }]
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors.html
SSH in to the Master node, and peruse
Startup: /mnt/var/log/bootstrap-actions
Performance: /mnt/var/log/instance-state
An App (e.g. Spark, Zeppelin): /mnt/var/log/application
A particular step: /mnt/var/log/hadoop/steps/N
Change what gets logged by using the cluster --configurations classification: spark-log4j
https://github.com/apache/spark/blob/master/conf/log4j.properties.template
Debug Logs In Context
Debug Logs In Context
Debug Logs In Context
Debug Logs In Context
Debug Logs In Context
Debug Logs: Lab 9.4
Lab 9.4
What we’ll do
● Navigate the Debug logs of previously launched clusters
Slow Cluster? https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-slow.html
Troubleshooting
Understand your problem class
● Sizing/Scaling? OOM, Hung Nodes, Stragglers○ Review cluster configuration○ Review Zeppelin interpreter config (if appropriate)○ Do some math with your data sizes
● Can’t Start Cluster?○ Bootstrapping failures○ One-offs jobs failing○ Security
● Processing○ Data formats - noisy data throwing exceptions?○ Data access - security preventing file access?
Slow Cluster? https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-slow.html
Real-Failure Suppression
Sometimes in Spark, the real problem is hidden by downstream tasks
● Log parsing failures
● Run explain plan
● Construct an alternative pipeline by○ Checkpointing○ Landing data & making your job multi-step
● Sometimes the best solution is to not try to make your Spark job a one-shot, and get back to basic ETL methodology○ Extract (O) - one job - load (O) /convert/filter/land (O’)○ Transform (O’) - one job - load (O’) /join/filter/land (O’’)○ Load (O’’) - one job - load (O’’) /load/aggregate/land
Slow Cluster? https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-slow.html
For really big jobs, you might be better multi-stage pipelining. This can be noisy from a duplicate perspective, but you can also clean up O’ and O’’ after-the-fact, if it makes sense. Consider also any lineage impacts.
Real-Failure Suppression, Step By Step Testing
1. Launch a cluster with debugging & termination production
2. Configure a Notebook with enough memory and cores for the interpreter that you are confident it is sufficient for your data population
3. Look for opportunities in each step for errors. a. Use type-testing (number? string?) b. Use try/catch/log semanticsc. When you are 100% sure the problem isn’t in this step, move on
4. Check the logs after each step via the Console
5. Correct any errors. If your automatic job still fails after this effort, you should have sufficiently ruled out programming problems, and should look for scale issues, security issues, latency problems, and other “infra” surfaces
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-failed-5-test-steps.html
The Usual Suspects 1 (how are you coding)
Stop errors before they blow up at cost by -
● Testing your code, ideally locally or on a 1-server “cluster”, with a limited-but-representative dataset
● Unit Testing your code with small data that contains expected variations and corner cases in your raw data
● Try/Catch works in Spark too...error handling
● Defect-based unit testing - when something blows up and it takes you 2 hours or 2 days to find the culprit, and you code a workaround - code a test for that workaround
● Configurable debug logging in your code
Your problem is really only happening post-dev with your shiny unit tested code. I know we’ve talked about this before, but it’s worth a second look.
● Review your DAG in the Spark UI○ Look for stragglers○ Re-evaluate your partitioning schemes
● Do a little math and double check your data size assumptions
● Look for opportunities to:○ Coalesce or Repartition○ DISTRIBUTE BY, SORT BY, CLUSTER BY
● If that all fails, try pulling in a good sample of your data, coalesce it to a reasonable number, and pay attention to the Spark UI
The Usual Suspects 2 (how are you running)
https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/
Quick Lab: Do the code blocks in here and examine the Spark UI.
https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
The Unusual Suspects
AWS has its own conditions which may cause cluster instability
● Service Outages http://status.aws.amazon.com/
● Usage Limits○ EC2 default quota of 20 EC2 servers
■ EC2 QUOTA EXCEEDED error ○ S3 bucket count limit (100 per acct)
■ Consider nesting by env and project
● Networking issues communicating between resources, in-or-across VPC’s○ Subnets run short of addressable IPs for large clusters
● Last state change before termination may hold clues
If data throughput is a problem, and there is value to “moving the data closer to the processing”, s3DistCp can be used to pull from S3 and push into Core Node hdfs. https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
Debugging Scala/Spark Applications
● Most bugs can be replicated from a reasonable sample-set of the data
● Running locally in an IDE can help you to perform interactive real-time troubleshooting of the running application○ Any Scala IDE can do this - I like Intellij
● With debug opts enabled you can also runtime-debug a running cluster - locally or on EMR○ You’ll obviously not do this with Prod○ Its best to do with a small cluster and data-switched breakpoints
● Arguably the best way to solve any issues with data validation, data munging, joins, failing actions.
● NOT the best way to troubleshoot performance issues!
https://community.hortonworks.com/articles/15030/spark-remote-debugging.html
Debugging PySpark Applications
● Some messages are from the JVM & some from Python, which can lead to some confusion
● In Jupyter, the useful error messages are most likely in the console log
● In YARN, YARN logs. Point being, the stuff in your notebook may feel pointless
● .take(1), .count(), and loop/print are your allies
● Lambdas can make troubleshooting even more difficult○ You can write tests in Python too! https://github.com/holdenk/spark-testing-base
● As with Scala, being able to run your code in a debuggable environment (e.g. Pycharm) can dramatically increase productivity, though it can feel alien (or, “low level”) to the data analyst accustomed to notebooks
https://www.slideshare.net/SparkSummit/debugging-pyspark-spark-summit-east-talk-by-holden-karau
Holden Karau is also a co-author of High Performance Spark - a Must Read.
https://github.com/holdenk/spark-testing-base
https://issues.apache.org/jira/browse/MAPREDUCE-3883
EMRElastic Map Reduce on AWS
9.5 The OperationsPerspective
Observability & Monitoring
We have been talking about “how the application experiences the infrastructure”
Now we will be talking about “how the infrastructure experiences the applications”
Observability & Monitoring
What is the delineation of responsibility in your organization?
Observability & Monitoring
Who should configure all this? Who owns it?
Observability & Monitoring
On a case-by-base basis we may know a particular Spark job we are responsible for didn’t work out, and that’s what we’ve been talking about. At a system level, how do we know when things are going right or things are going wrong?
Let’s take a look at some options
● Cloudwatch - AWS log/log-monitoring solution
● Ganglia - cluster visualization tool
● Third Party Options○ Influx TICK Stack○ Prometheus○ Home Rolled with any number of time-series dbs○ (Paid solutions) Loggly, Datadog...etc
https://www.loggly.com/blog/monitoring-amazon-emr-logs-in-loggly/
https://docs.datadoghq.com/integrations/amazon_emr/
Observability & Monitoring: Cloudwatch
● More #”s than you can probably stomach!○ Grows linearly by #metrics * # jobs
● Find the ones that matter for your purposes
● This can be time consuming and requires developing some subject matter expertise
● One limitation is any aggregation of system metrics across clusters must be performed manually
Observability & Monitoring: Cloudwatch
This makes ephemeral clusters a bit harder to deal with.
Observability & Monitoring: Cloudwatch
Tough to group across business-purposes when organized this way
Observability & Monitoring: Cloudwatch
Events vs Metrics
● “Actions vs Data”
● Goals with Events○ Know “something happened” or “something changed”
■ E.g. Cluster status went isIdle = true■ Great for creating actions you want to know about
● SNS yourself a message for:○ Scale out events
● Idle or Zombie server decommission
● You can generate custom Events from your app as well!
● Metrics: Know the usage of the system (at a very fine grained level)
EMR Cluster Events can be seen here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type
Observability & Monitoring: Cloudwatch Rule
EMR Cluster Events can be seen here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type
Observability & Monitoring: Cloudwatch Explore
9.5.1 Mini lab (group or demo-style)
Lets review some Cloudwatch metrics from the clusters we’ve run.
● MemoryAvailableMB● IsIdle● CoreNodesRunning● S3BytesRead● MRUnhealthyNodes● ...etc...
https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html
Observability & Monitoring: Cloudwatch: Custom Dashboard
Observability & Monitoring: Ganglia
Another “box-solution” provided by AWS for EMR is Ganglia.
http://ganglia.info/
“Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. “
Lab 9.5: Lets launch one cluster with Ganglia, as a group, and poke around a bit, separately.
Observability & Monitoring: Ganglia
Observability & Monitoring: Ganglia
Observability & Monitoring: Ganglia
Observability & Monitoring: Ganglia
Observability & Monitoring: Ganglia
Observability & Monitoring: Ganglia
Observability & Monitoring: Ganglia
Observability & Monitoring: 3rd Party
Since EMR is so flexible in custom bootstrapping, any agent you like can be added at cluster-provisioning time to broadcast to any accessible target.
Whether you decide to use TICK stack, Prometheus, or another system that you hand-roll, you can accomplish the goal of creating graphs and alerts about your infrastructure.
You can find an example of rolling a custom solution broacasting to Graphite https://dzone.com/articles/how-to-properly-collect-aws-emr-metrics
Graphite can be directly integrated to Ganglia as well.
Observability & Monitoring: Tick Stack
Tick stack is an “open source core” for a time-series-data platform built to handle metrics and events.
This open source core consists of the projects —
● Telegraf - A collection and reporting agent
● InfluxDB - A high performance time-series database written in Go with 0 dependencies
● Chronograf - Dashboarding
● Kapacitor - Realtime batch-and-streaming data processing engine for munching data from InfluxDB
-- collectively called the TICK Stack.
https://www.influxdata.com/time-series-platform/
https://gist.github.com/travisjeffery/43f424fbd7ac677adbba304cef6eb58f
EMRElastic Map Reduce on AWS
9.6 EMR ClusterOptimization
Scaling: Plenty of Knobs to Turn
● Sensible defaults can get you a long way○ Dynamic Allocation is “on by default”
■ This requires Shuffle service● Spark Shuffle Service is automatically configured by Amazon EMR
■ maxExecutors = infinity
● Autoscaling can be set up for Instance Groups
● Spark parameters can be adjusted with --configurations
● Many defaults are set based on instance types selected
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocationhttps://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
Scaling: Defaults
By default, Spark in EMR picks up its basic settings from instance type selected
spark.dynamicAllocation.enabled true
spark.executor.memory Setting is configured based on the core and task instance types in the cluster.
spark.executor.cores Setting is configured based on the core and task instance types in the cluster.
You can adjust these defaults and others on cluster creation with the --configurations classification: spark-defaults
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html
Scaling: maximumResourceAllocation: true
On cluster launch, settings you can supply via --configurations within Classification:Spark that affect how performance is managed. If maximizeResourceAllocation:true, then
spark.default.parallelism 2X number of CPU cores available to YARN containers.
spark.driver.memory Setting is configured based on the instance types in the cluster. This is set based on the smaller of the instance types in the two instance groups (master/core)
spark.executor.memory Setting is configured based on the core and task instance types in the cluster.
spark.executor.cores Setting is configured based on the core and task instance types in the cluster.
spark.executor.instances Setting is configured based on the core and task instance types in the cluster. Set unless spark.dynamicAllocation.enabled explicitly set to true at the same time.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.htmlDynamic Allocation (on by default)maximizeResourceAllocation
Configuring via --configurations[ { "Classification": "spark", "Properties": { "maximizeResourceAllocation": "true" } }]
A footnote about this setting: https://stackoverflow.com/questions/34003759/spark-emr-using-amazons-maximizeresourceallocation-setting-does-not-use-all
You probably want to explicitly set spark.default.parallelism to 4x number of instance cores you want the job to run on on a per "step" (in EMR speak)/"application" (in YARN speak) basis, i.e. set it every time
Scaling: maximumResourceAllocation: true
● Limits your cluster to one-job-at-a-time
● Best for single-use or single-purpose clusters
Scaling: dynamicAllocation Notes
Dynamic Allocation may be “on by default” in EMR, but it still has some of its own knobs to turn.
spark.dynamicAllocation.executorIdleTimeout Default: 60 - seconds of idle time means execute is “removable”
spark.dynamicAllocation.cachedExecutorIdleTimeout Default: Infinity - the lifespan of an executor which has cached data blocks.
spark.dynamicAllocation.initialExecutors Default: spark.dynamicAllocation.minExecutors - Default number of executors for DA, only if < --num-executors.
spark.dynamicAllocation.maxExecutors Default: Infinity - upper bound of num executors.
spark.dynamicAllocation.minExecutors Default: 0 - Number of executors ‘by default’.
spark.dynamicAllocation.executorAllocationRatio Default: 1 - Ratio of executors to tasks (1:1 is maximum parallelism)
https://aws.amazon.com/blogs/big-data/submitting-user-applications-with-spark-submit/
https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
Scaling: Some Numbers
● YARN: when tuning, Be sure to leave at least 1-2GB RAM and 1 vCPU for each instance's O/S and other applications to run too.○ The default amount of RAM seems to cover this, but this will leave us with
(N-1) vCPUs per instance off the top
● Executors: Parallelism is the goal, and keep in mind cluster size when setting parameters. If you are not using max and dynamic, you might, e.g. have 3 machines with 4 CPU each. Leaving 1 per for the system, you have 3 each, so --num-executors = 9 would be reasonable. (executors per node)
● Executor-cores: How many parallel tasks can an executor take on?○ Think about time spent on IO and determine the ratio○ Each executor-core will have various operations in which they are waiting
on other things (reads/writes) and so increasing executors and reducing cores-per can result in better performance
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes. Cloudera Manager helps by accounting for these and configuring these YARN properties automatically.
The likely first impulse would be to use --num-executors 6 --executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
● 63GB + the executor memory overhead won’t fit within the
63GB capacity of the NodeManagers.
● The application master will take up a core on one of the nodes,
meaning that there won’t be room for a 15-core executor on
that node.
● 15 cores per executor can lead to bad HDFS I/O throughput.
A better option would be to use --num-executors 17 --executor-cores 5 --executor-memory 19G. Why?
● This config results in three executors on all nodes except for
the one with the AM, which will have two executors.
● --executor-memory was derived as (63/3 executors per
node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.
Scaling: Drive Space
● By default you generally get a 10GB EBS volume○ Add volumes to increase storage when drivespace is a problem○ Add volumes to offset memory vs cpu vs storage inequities (in other words you keep running out
of drivespace, but the ram/cpu are fine)○ These are also ephemeral!○ “EBS-Optimized” = network traffic is dedicated, not shared
● You can add additional volumes but you must consciously adjust configuration to be aware of and take advantage of them.○ Check the directories where the logs are stored and change parameters as needed
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html
https://aws.amazon.com/blogs/aws/amazon-emr-update-support-for-ebs-volumes-and-m4-c4-instance-types/
https://aws.amazon.com/premiumsupport/knowledge-center/core-node-emr-cluster-disk-space/
Scaling: Drive Space
● Cross-cluster or intra-cluster storage can be performed using EFS rather than S3, if preferred. This is not recommended practice for log files.
● Change the guarantees of performance by selecting custom EBS○ Provisioned IOPS SSD - high performance (ops/sec)○ Throughput optimized - high throughput (Mib/sec)○ AWS measures IOPS in 256K or smaller blocks
https://cloud.netapp.com/blog/ebs-volumes-5-lesser-known-functions
Some clarification on the details of these options can be found here: https://www.fittedcloud.com/blog/aws-ebs-performance-confused/
https://aws.amazon.com/premiumsupport/knowledge-center/core-node-emr-cluster-disk-space/
Changing the log retention period:
1. Connect to the master node using SSH.2. Open the /etc/hadoop/conf/yarn-site.xml file on each node in your Amazon
EMR cluster (master, core, and task nodes).3. Reduce the value of the yarn.log-aggregation.retain-seconds property on all
nodes.4. Restart the ResourceManager daemon. For more information, see Viewing
and Restarting Amazon EMR and Application Processes (Daemons).
https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-mount-cmd-dns-name.html
https://github.com/awslabs/deeplearning-emr/blob/master/training-on-emr/emr_efs_bootstrap.sh
● EFS is automatically mounted on all worker instances during startup.● EFS allows sharing of code, data, and results across worker instances.● Using EFS doesn't degrade the performance for densely packed files
(for example, .rec files containing image data used in MXNet).
Scaling: Drive Space
If you decide to create an EBS volume to tolerate more local-logs, you will want to bootstrap your environment to accommodate this. Set the log path in yarn-site, and possibly do a custom mount operation to guarantee the device.
[hadoop@ip-172-31-50-48 /]$ mountproc on /proc type proc (rw,relatime)sysfs on /sys type sysfs (rw,relatime)devtmpfs on /dev type devtmpfs (rw,relatime,size=7686504k,nr_inodes=1921626,mode=755)devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)tmpfs on /dev/shm type tmpfs (rw,relatime)/dev/xvda1 on / type ext4 (rw,noatime,data=ordered)devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)/dev/xvdb1 on /emr type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvdb2 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvdc on /mnt1 type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvdd on /mnt2 type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvde on /mnt3 type xfs (rw,relatime,attr2,inode64,noquota)
https://cloud.netapp.com/blog/ebs-volumes-5-lesser-known-functions
Some clarification on the details of these options can be found here: https://www.fittedcloud.com/blog/aws-ebs-performance-confused/
https://aws.amazon.com/premiumsupport/knowledge-center/core-node-emr-cluster-disk-space/
Changing the log retention period:
1. Connect to the master node using SSH.2. Open the /etc/hadoop/conf/yarn-site.xml file on each node in your Amazon
EMR cluster (master, core, and task nodes).3. Reduce the value of the yarn.log-aggregation.retain-seconds property on all
nodes.4. Restart the ResourceManager daemon. For more information, see Viewing
and Restarting Amazon EMR and Application Processes (Daemons).
https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-mount-cmd-dns-name.html
https://github.com/awslabs/deeplearning-emr/blob/master/training-on-emr/emr_efs_bootstrap.sh
● EFS is automatically mounted on all worker instances during startup.● EFS allows sharing of code, data, and results across worker instances.● Using EFS doesn't degrade the performance for densely packed files
(for example, .rec files containing image data used in MXNet).
Scaling: Autoscaling
● Autoscaling requires Instance Groups and is not supported with Instance Fleet
● EMR Scaling is more complex that EC2 autoscale○ Core Node vs Task Node○ Core Node decommission times longer due to HDFS
● Scale Out != Scale In○ Scale out policies can be more flexible○ Scale in must be more prudent
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.htmlDynamic Allocation (on by default)maximizeResourceAllocation
https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/
https://aws.amazon.com/blogs/big-data/dynamically-scale-applications-on-amazon-emr-with-auto-scaling/
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-automatic-scaling.htmlhttps://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroupLifecycle.html
https://cloud.netapp.com/blog/optimizing-aws-emr-best-practices
Scaling: Scale In: Switches● “ Amazon EMR implements a blacklisting mechanism in Spark that is built on top of YARN's decommissioning mechanism.
This mechanism helps ensure that no new tasks are scheduled on a node that is decommissioning, while at the same time allowing tasks that are already running to complete.”
spark.blacklist.decommissioning.enabled Default: true - Spark does not schedule new tasks on executors running on that node. Tasks
already running are allowed to complete.
spark.blacklist.decommissioning.timeout Default : 1 hour - After the decommissioning timeout expires, the node transitions to a
decommissioned state, EMR can terminate the node's EC2 instance. Any tasks are still
running after the timeout expires are lost or killed and rescheduled on executors running on
other nodes.
spark.decommissioning.timeout.threshold Default: 20 seconds - This allows Spark to handle Spot instance terminations better because
Spot instances decommission within a 20-second timeout regardless of the value of
yarn.resourcemager.decommissioning.timeout, which may not provide other nodes enough
time to read shuffle files.
spark.stage.attempt.ignoreOnDecommissio
nFetchFailure
Default - true - When set to true, helps prevent Spark from failing stages and eventually failing
the job because of too many failed fetches from decommissioned nodes. Failed fetches of
shuffle blocks from a node in the decommissioned state will not count toward the maximum
number of consecutive fetch failures.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
spark.resourceManager.cleanupExpiredHost
When set to true, Spark unregisters all cached data and shuffle blocks
that are stored in executors on nodes that are in the decommissioned
state. This speeds up the recovery process.
true
Sizing Suggestions
● Memory at 3X your data size expectations
● Enough cores to reasonably parallelize your data, assuming you’ve also worked through the partitioning scenarios
● Filter, filter filter. Narrow narrow narrow
● Ephemeral clusters have fewer variables
● Shared clusters have MANY more details to consider
● This is as much art as science, and is invariably “case-by-case”
A note on underutilization: https://stackoverflow.com/questions/38331502/spark-on-yarn-resource-manager-relation-between-yarn-containers-and-spark-execu
Sizing/Setup Suggestions
● Use HDFS for intermediate data storage while the cluster is running and Amazon S3 only to input the initial data and output the final results.
● If your clusters will commit 200 or more transactions per second to Amazon S3, contact support to prepare your bucket for greater transactions per second and consider using the key partition strategies described in the links below
● Set Hadoop configuration setting io.file.buffer.size to 65536. This causes Hadoop to spend less time seeking through Amazon S3 objects. .
● If listing buckets with large numbers of files, pre-cache the results of an Amazon S3 list operation locally on the cluster.
You can find some links to S3 specific performance optimization best practices here: https://docs.aws.amazon.com/AmazonS3/latest/dev/PerformanceOptimization.html
This document has some of the “why” of performance concerns as well as some suggestions on how to break up your buckets and include some hashing mechanisms that aid in balancing performance and listability.
https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
Finally this document has 4 great tips from a heavy user of Spark on EMR.
https://medium.com/teads-engineering/spark-performance-tuning-from-the-trenches-7cbde521cf60
Cost Optimization Recommendations
● Ephemeral clusters which auto-terminate for spark-submit & sizeable jobs○ Can integrate with Jenkins for a seamless commit/execute CICD
● Dynamic allocation with minimal primary-cluster for active analysis○ Good pool of Task Node available
● Off-cluster notebook connectivity and management (jupyter-hub, zeppelin, livy)○ Cluster Core Node pool should remain relatively fixed to reduce decommission time○ Primary scale point should be Task Node○ Task Nodes are best case for Spot Instances (least risk, least cost)
● Off-cluster notebook storage
● Cloudwatch alerts with SNS listeners that proactively act or message
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html - node decomissioning
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.htmlFleets vs instance groups for cost optimization vs complexity. Spot instances
Cost Optimization In Dev
● Watch how you work○ Develop locally - grab a file and start the process○ Troubleshoot locally - use an IDE or local-notebook to evaluate your work and debug
● Go to the cluster when you have a job you are ready to productionize (doesn’t work on local machine anymore)
● Cluster job fails○ Get back to local once you understand the failure mode○ Emulate, correct, and re-deploy
● Too often we get in the mindset of “just this one tweak will fix it” and waste hours upon hours of cluster-runtime cycles.
Cost Source Notes
● EMR costs are on-top-of the underlying infrastructure (EC2) costs.
● S3 costs are around 700/mo for 10TB ‘with reduced redundancy’. Contrast this with Redshift $1000/TB/mo at the lowest tier (3 year buy-in).
● You may be charged for use of “SimpleDB” when you enable debugging
● If you add large EBS volumes to your clusters this can add up. Important if you are writing to HDFS, using Hive, or expect a lot of spill-to-disk or disk-cache
● In-Region data transfer should not add cost. Transferring data across regions will. Bear this in mind when establishing any data-landing practices
● You can save a ton of money leveraging reserved & spot instances
Common Problems
● https://www.knowru.com/blog/first-3-frustrations-you-will-encounter-when-migrating-spark-applications-aws-emr/
● https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/
EMRElastic Map Reduce on AWS
9.7 EMR SecurityNotes
What’s at Risk?
● Being based on AWS Technologies means you have “all the basic” AWS tools to help you secure your system.
● You launch your cluster in a VPC
● You leverage completely customizable IAM roles to○ Interact with other services○ Allow cleanup○ Autoscale
● You leverage completely customizable Security Groups
● The typical “risk” is identified as in-org risk
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles.html
What’s at Risk?
● In-Org Risk○ Some teams cannot see other teams data○ Organization wants to split resources by budgets in different departments○ Custom roles/groups can be usefiul for this
● Extra-Org Risk○ The most common failing is when data gets left lying around○ S3 public buckets (“Hey I couldn’t get the S3 policy set right so I just made it public”)○ Content emitted to email through notifications & SNS that contains sensitive
information
Risk Mitigation: Mechanisms for Security
● AWS Level○ IAM Roles○ Cloudtrail○ S3 Policies○ Firewall/VPN
● Spark/Hadoop level○ LDAP/AD Integration○ Authorized-user access (IAM + LDAP)○ HDFS Permissions/ACLs○ Kerberos
● Mechanisms○ Lock down access (IAM Roles, VPC’s, S3 Policies, LDAP)○ Audit access (Cloudtrail)○ Encrypt
https://docs.aws.amazon.com/emr/latest/ManagementGuide/logging_emr_api_calls.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos-configure.html
Kerberos is a secure authentication method developed by MIT that allows two services located in a non-secured network to authenticate themselves in a secure way. Kerberos, which is based on a ticketing system, serves as both Authentication Server and as Ticket Granting Server (TGS).
Risk Mitigation: Encryption Options
There are many options for at-rest and in-transit encryption in the Spark/EMR/S3 ecosystem.
● What matters to your organization?
● What attack vectors concern you?
● Do you consider VPC secure?
● Do you need to protect yourself from internal threats?
● What are your regulated surfaces/relevant responsibilities?
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html
Encryption Optionshttps://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption.html
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html
Encryption of Notebook storage on S3
https://zeppelin.apache.org/docs/0.8.0/setup/storage/storage.html#notebook-storage-in-s3
Enable server-side encryptionTo request server-side encryption of notebooks, set the following environment variable in the file zeppelin-env.sh:
export ZEPPELIN_NOTEBOOK_S3_SSE=true
Or using the following setting in zeppelin-site.xml:
<property>
<name>zeppelin.notebook.s3.sse</name>
<value>true</value>
<description>Server-side encryption enabled for notebooks</description>
</property>
Time Remaining?● Q&A
● Lab Struggles & Assistance
● Group Experiments?
● Active problems in the extant system?
● Thank you and good night!