Addendum Session 5 - Amazon S3 · KafkaRDDs indicate Kafka-Spark partition should get data from the machine hosting the Kafka topic Spark Streaming - partitions are local to the node

Session 5 Addendum

Data ImbalanceMemory Usage Estimation

Session Addendum Objectives

➢ Understand what causes data-imbalance

➢ Understand the impact of data-imbalance

➢ Be familiar with common strategies for dealing with data-imbalance

➢ Understanding and Estimating memory usage

What is Imbalanced Data?

● “Unbalanced” or “Imbalanced” or “Lumpy” data typically refers to “too much of one and not another” for performing some type of operation

● For example, imagine “groupByKey” for businesses in a zipcode in NY State.○ NYC would have a huge proportion

of the data!

● Caused by trying to organize around low-cardinality categorical or ordinal data

What Does Imbalanced Data Do?

● Makes things explode, is the main problem!

● Causes unstable nodes

● Causes laggy nodes (stragglers)

● Causes OOM errors

● Causes Looooong shuffles for wide operations

● Often manifests well-down-the-pipeline so you find out...too late

If you are a data science practitioner, the notion of imbalanced data is not new to you. However, keep in mind that the Spark execution concerns for ETL processing, while stemming from the same core issues, has a different solution space that may not be appropriate for ML scenarios. Here we are primarily concerned with how the data is managed from a cluster-execution memory and shuffle perspective, which may or may not be relevant to how an ML algorithm will experience the data imbalance.

How Do We Deal With Imbalanced Data? Part 1

● Operations Strategies (judicious use of groupBy & related shuffle-triggers)

● Filter early and often (Optimizer may handle some of this for you)

● Be cognizant of your partitioning mechanism○ Custom Partitioner if needed/helpful

● Detecting Stragglers○ Tasks in a stage that take over-long to execute relative to others○ Sign of uneven partitioning○ Use the Spark UI

How Do We Deal With Imbalanced Data? Part 2

● Enforcing higher-cardinality○ Re-keying pre-join (add ‘noise’ to keys)

■ E.G. Use the zip+4. Add a business category.■ 10017-1234 or 10017-laundromat

● Broadcast smallish data to avoid a shuffle○ You can push smaller dataframes up and join to them

businesses.join(broadcast(nyZips) as “z”, $”z.zip” === $”postal”)

● Remove duplicates or combine via “mapPartition” before join/grouping.

● When all else fails, vertically partitioning data, landing it, then a 2nd pass with remapped keys...subset subset subset.

https://medium.com/teads-engineering/spark-performance-tuning-from-the-trenches-7cbde521cf60“Spark is supposed to automatically identify DataFrames that should be broadcasted. Be careful though, it’s not as automatic as it appears in the documentation. Once again, the truth is in the code and the current implementation supposes that Spark knows the DataFrame’s metadata, which is not effective by default and requires to use Hive and its metastore.”

Check your default parallelism: sc.defaultParallelism

The behavior is different between regular RDD’s and Dataframes. The Spark SQL module contains the following default configuration: spark.sql.shuffle.partitions set to 200.

With a dataframe you can also call df.repartition(x) to change the @ of partitions, and so can rdd.coalesce(x). The main difference is coalesce will simply combine data to shrink the partitions whereas repartition will perform a full “rebalancing of the data” (think wide-op shuffle), equalizing the partition sizes.

https://medium.com/teads-engineering/spark-performance-tuning-from-the-trenches-7cbde521cf60


https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L171

Example: Increasing Cardinality for Groupings

//Both work!

//This one may cause more shuffling, which increases network // chatter incurring delaysval nyBiz = spark.read.csv("data/imbalance.csv").coalesce(5)case class Datum(city:String, postal:Int, category:String)val biz = nyBiz.map(x => Datum(x.getAs[String](0), Integer.parseInt(x.getAs[String](1)), x.getAs[String](2)))val groupedByPostal = biz.groupBy("postal").count.show

//This one may incur more "memory overhead" and therefore // garbage collectionval nyBiz = spark.read.csv("data/imbalance.csv").coalesce(5)case class Datum(city:String, postal:String, category:String)val biz = nyBiz.map(x => Datum(x.getAs[String](0), x.getAs[String](1) + x.getAs[String](2), x.getAs[String](2)))val groupedByPostal = biz.groupBy("postal").count.show

Here we are doing what appears to be the same thing - because it pretty much is. Note however than in the second case, we are changing the datatype of “postal” from Int to String, and concatenating 2 strings together to increase the cardinality of the data that will be grouped (postal). This will decrease the necessity for shuffles to occur, when doing group-by type activities.

Proactively Managing Memory

● Understand what eats it up

● Understand what gets used when

● Inspect the web UI and review usage

● Do the math!

● Use SizeEstimator.estimate

● Manually configure ratios (in more extreme cases)

https://spark.apache.org/docs/latest/tuning.html


Components of Memory Usage

● Objects stored & the ‘meta’ they carry with them○ Object Headers can be > the data itself○ Linked structures retain “pointers” to siblings○ “Primitives” might be boxed.

● Cost of object access

● Spark “memory overhead minimum”, ~ 384MB

● Serialization style

By avoiding the Java features that add overhead we can reduce the memory consumption. There are several ways to achieve this:

● Avoid the nested structure with lots of small objects and pointers.● Instead of using strings for keys, use numeric IDs or enumerated

objects.● If the RAM size is less than 32 GB, set JVM flag to

–xx:+UseCompressedOops to make a pointer to four bytes instead of eight.

Remember, anything that is “an object” has to be serialized and deserialized for shuffles, or persistence. This all costs CPU and RAM.

This becomes a balance between programmer convenience and code readability vs. code complexity + speed.

Components of Memory Access

● Parallelism: too few partitions can be problematic○ More partitions decreases each tasks memory use (input data)○ 2-3 tasks per CPU core

● Garbage Collection (the unseen devil in JVM performance problems)○ Collect & review statistics○ Adjust allocation to fit○ Try different flags, and balance with “fraction settings”

■ spark.memory.fraction■ spark.memory.storageFraction

○ This is its own science+art form: https://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

● Data Locality

This article has a good set of problems & solutions around memory & disk space. https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/

Data Locality & Streaming

● Where is the Partition?○ RDD carries information about location○ Hadoop RDD’s know about location of HDFS data○ KafkaRDDs indicate Kafka-Spark partition should get data from the machine hosting the Kafka

topic○ Spark Streaming - partitions are local to the node the receiver is running on

● What is “local” for a Spark task is based on what the RDD implementer decided would be local

● 4 Kinds of Locality○ PROCESS_LOCAL - task runs with same process as source data○ NODE_LOCAL - task runs on same machine as source data○ RACK_LOCAL - task runs on same rack○ NO_PREF/ANY - task cannot be run on same process as source data

● spark.locality.wait - determines how long to wait before changing the locality goal of a task

https://github.com/apache/spark/blob/aba9492d25e285d00033c408e9bfdd543ee12f72/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L137

RDD Subclasses can create their own implementation of getPreferredLocations to provide hints about where the data is, e.g. CassandraRDD, KafkaRDD implementers.

Setting spark.locality.wait should be considered based on the merits of locality for your application, relative to latency.

Memory and Data Locality article: http://www.russellspitzer.com/2017/09/01/Spark-Locality/



Here we can review the different metrics in the Spark UI around memory usage and data shuffled about.

http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html

Here we can look at the relative cost of tasks that involve shuffles - most of the more expensive stages involved data shuffled.

Specific Actions You Can Take - 1

● Design for size. Use arrays and primitives over standard collections and richer types○ Key off numbers not strings○ Use minimized objects to reduce overhead: http://fastutil.di.unimi.it/

● Calculate expected usage & size & parallelize accordingly○ Pass in initializations (e.g. sc.textFile or spark.read.xx.coalesce(n)○ Change the default - spark.default.parallelism

● Try dynamic allocation

● Change pointer size (< 32 GB Ram, set -XX:+UseCompressedOops - spark-env.sh)

“For most programs, switching to Kryo serialization and persisting data in serialized form will solve most common performance issues.”

“The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it into cache, and look at the “Storage” page in the web UI. The page will tell you how much memory the RDD is occupying.”

- Spark Tuning Guide

http://fastutil.di.unimi.it/


● Switch to Kryo for serialization○ conf.set("spark.serializer",

"org.apache.spark.serializer.KryoSerializer")○ Warning: Must register your custom classes

● Create facilities for on-demand clusters to isolate “RBJs” (Really Big Jobs)

● Estimate Size scala> import org.apache.spark.util.SizeEstimator

//The imported text filescala> println(SizeEstimator.estimate(nyBiz))62436248

//The resulting dataset post-map-to-case-classscala> println(SizeEstimator.estimate(biz))62440912

//The post-groupBy dataframescala> println(SizeEstimator.estimate(groupedByPostal))62440904

This gist has an example of Kryo custom class registration: https://gist.github.com/claudinei-daitx/f39d51e6ecf1e0683b21741bb1cb6f53

Kryo’s own writeup: https://github.com/EsotericSoftware/kryo

conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")conf.registerKryoClasses(Array(classOf[MyClass1],

classOf[MyClass2]))

val sc = new SparkContext(conf)

https://gist.github.com/claudinei-daitx/f39d51e6ecf1e0683b21741bb1cb6f53

https://github.com/EsotericSoftware/kryo


● Adjust Garbage Collection Settings○ Turn on logging:

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

○ Try the G1GC-XX:+UseG1GC

○ Many others…

● Adjust “Fractions”○ spark.memory.fraction ○ spark.memory.storageFraction

“Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:

● spark.memory.fraction expresses the size of M as a fraction of the (JVM heap

space - 300MB) (default 0.6). The rest of the space (40%) is reserved for user

data structures, internal metadata in Spark, and safeguarding against OOM

errors in the case of sparse and unusually large records.

● spark.memory.storageFraction expresses the size of R as a fraction of M

(default 0.5). R is the storage space within M where cached blocks immune to

being evicted by execution.

The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details”


Session Review

● Definitions of Imbalanced Data

● Solutions for managing Imbalanced Data

● Understanding components of memory consumption

● Actionable steps for improving functionality

EMRElastic Map Reduce on AWS

Session 9: Amazon EMR

Session 9 - EMR Session Objectives

➢ Understand EMR Solution

➢ Know how to start and run a cluster

➢ Know how to interact with Spark on EMR

➢ Interfacing with S3 data

➢ Zeppelin with EMR

➢ Cost optimizations for storage and networking??

➢ Scaling

➢ Debugging

EMR 101

● Elastic Map Reduce (Hadoop & Friends on AWS)

● Managed Cluster (Less devopsy stuff for you to do)

● Autoscaling

● Many Softwares○ Spark!○ Hadoop○ HBase○ Presto○ Hive○ HBase

Slide credit to ReInvent

YARN Schedulers - CapacityScheduler

● Default scheduler specified in Amazon EMR

● Queues○ Single queue is set by default○ Can create additional queues for workloads

based on multitenancy requirements

● Capacity guarantees● set minimal resources for each queue● Programmatically assign free resources to queues

● Adjust these settings using the classification capacityscheduler in an

EMR configuration object (or bootstrapping)

The two built-in schedulers are Capacity Scheduler and Fair Scheduler. EMR uses Capacity Scheduler by default. Cloudera, on the other hand recommends the Fair scheduler. This can be configured using the procedure called ‘bootstrapping’ where we customize the configurations your EMR cluster runs with. The Fair scheduler can be more configurable in the sense of handling queues.

Fair - Allocates resources to weighted pools, with fair sharing within each pool (docs).Capacity - Allocates resources to pools, with FIFO scheduling within each pool (docs).

https://hortonworks.com/blog/yarn-capacity-scheduler/

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

http://blog.cloudera.com/blog/2016/01/untangling-apache-hadoop-yarn-part-3/

https://www.quora.com/On-what-basis-do-I-decide-between-Fair-and-Capacity-Scheduler-in-YARN

https://community.pivotal.io/s/article/How-to-configure-queues-using-YARN-capacity-scheduler-xml

https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/FairScheduler.html

https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

https://hortonworks.com/blog/yarn-capacity-scheduler/



http://blog.cloudera.com/blog/2016/01/untangling-apache-hadoop-yarn-part-3/





EMR Hadoop

● Preconfigured software for your convenience○ AWS-Instance-Type-based Yarn and Hadoop settings

● Contains Hadoop customizations that are uniquely AWS○ Must build binaries using EMR (in other words on-cluster)

■ Not the case for Spark○ Must build binaries with same Linux version

● Build on EMR > Copy to S3 > Run Step Sequence

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-daemons.html

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-build-binaries.html

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-daemons.html

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-build-binaries.html

EMR Spark: The Sales Pitch

● “Easy” to use/get started

● Cost savings (potential)

● Open Source tools (with mods in some cases)

● Managed

● Secure

● Flexible

EMR Spark

● Fundamentally “still a Yarn-based cluster”

● Supports pretty much all the same features you’d expect running your own

● Ala-Carte opportunity to drink more AWS Kool-aid○ Data Pipeline○ Encryption at rest/in-transit○ Aurora-based Hive meta-store○ Spot provisioning for cost-measures

■ Can incur delays○ IAM Security measures○ S3 Data Lake (Decouple compute & storage)

Much like the majority of AWS products, EMR is functionally an open source product, customized for use by AWS in the AWS ecosystem. It plays nice with many other tools in a coherent, well organized aggregate of data streaming, processing and storage tools.

Some of their tools really do seem home-grown - Kinesis, DynamoDB, S3, whereas others are thin wrappers around popular tools

Athena: PrestoDB

Redshift: PostgreSQL with columnar storage

And still others are useful wrappers around advertised products, such as Aurora/Mysql, ElasticCache/Redis, ElasticSearch Service...etc.

Session Review

● What EMR is

● EMR’s purpose

● Spark on EMR Basics


9.1 S3 Data Lake

EMR Spark: S3 Data Lake - Why?

The S3 Data Lake concept has some advantages

● High availability, 11 9’s of uptime

● Security○ Can constrain by IAM roles○ VPC-only access○ In Depth bucket policies○ Encryption at-rest

● Low Cost (dramatically lower than RDBMS or NoSQL storage)

● S3-Select (where viable/appropriate)

Additional value was added on the security-side of the S3 equation: https://aws.amazon.com/blogs/aws/new-vpc-endpoint-for-amazon-s3/

Even more security: https://aws.amazon.com/blogs/security/how-to-use-bucket-policies-and-apply-defense-in-depth-to-help-secure-your-amazon-s3-data/

Learn how to pre-filter your data in S3 before bringing it to the cluster: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html

Lower your storage cost with Glacier, when performance is less of a concern: https://aws.amazon.com/blogs/aws/s3-glacier-select/

https://aws.amazon.com/blogs/aws/new-vpc-endpoint-for-amazon-s3/

https://aws.amazon.com/blogs/security/how-to-use-bucket-policies-and-apply-defense-in-depth-to-help-secure-your-amazon-s3-data/

https://aws.amazon.com/blogs/security/how-to-use-bucket-policies-and-apply-defense-in-depth-to-help-secure-your-amazon-s3-data/

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html

https://aws.amazon.com/blogs/aws/s3-glacier-select/

EMR Spark: S3 Data Lake

But some disadvantages too!

● Limited read/write speed

● Network latency

● “Ghost files”, “conceived files” (eventual consistency side effects)

● Somewhat confusing protocol addressing s3:// s3n:// s3a://

EMR Spark: S3 Data Lake - Protocol & File Access

● s3:// - Hadoop implementation of block-based file system backed by s3. ○ Also how you might be used to referencing directly to AWS as an S3 URI

● s3n:// - “Native file system” access by Hadoop

● s3a:// - “s3n part 2” - The upgrade to s3n. ○ Supports files > 5GB○ Uses/requires AWS SDK○ Backwards compatible with s3n○ NOT SUPPORTED in EMR!

● EMRFS - Wait...we’re back to s3:// - yep. AWS EMR has re-simplified the confusion back to just s3:// if you are on EMR

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

EMR Spark: S3 Data Lake - Latency Concerns

● Resolve S3 inconsistencies, if present, with “EMRFS consistent View” in cluster setup

● Use compression!○ CSV/JSON - GZip or Bzip2 (if you wish S3-Select to be an option)

● Use S3-Select for CSV or JSON if filtering out ½ or more of the dataset

● Use other types of file-store, i.e. Parquet/Orc

● Chunk your files. ○ Spark will handle more small files better than many big files, up to a point

EMR Spark: S3 Data Lake - Latency Concerns:Sizing

How big should my files be? It depends -

● No S3-Select○ With GZip, 1-2GB tops. GZip cannot be split.○ Splittable files, between 2GB and 4GB.

■ Allows more than 1 mapper, increasing throughput■ Goal is to process as many files in parallel as possible

EMR Spark: S3 Data Lake - Latency Concerns:Sizing

How big should my files be? It depends -

● Using S3-Select? Less of a concern○ Input Files must be CSV. JSON, or Parquet○ Output Files must be JSON or CSV ○ Files must be uncompressed, .gz (GZip), or .bz2 (Bzip2) (json/csv only)○ Max SQL expression length 256KB○ Max result record length 1 MB

EMR Spark: Ingesting and Landing Data

● Ingestion○ We can improve overall system performance oftentimes, by using s3DistCopy to

bring data in from S3 and pushing to HDFS - but this is a whole extra step○ We may ingest from other datastores of course, RDS, Redshift, Streaming

(kafka/kinesis), Dynamo...etc. We can also ingest from Elasticsearch

● Landing○ When we land data, we can temporarily land it to a Hive table for interactive

exploration○ We can land it of course to any JDBC storage○ Often we will land it to S3 for further interaction with other systems (Athena,

Presto...etc)○ Fun Tip - need it to feed your ES Search? You can write your final result directly to

Elastic Search!

https://www.elastic.co/guide/en/elasticsearch/hadoop/6.x/spark.html

https://www.elastic.co/guide/en/elasticsearch/hadoop/6.x/spark.html

Session Review

● Value of S3 Data Lake

● Hindrances (pros/cons)

● Some ideas for how to store/retrieve data


9.2 Setting up the Cluster

EMR Cluster: Setup

● Methods

○ AWS Console■ Advantages: Simplicity, Clarity

○ AWS CLI■ Advantages: Completeness, Scriptable

○ AWS SDK■ Java■ BOTO3 (Python■ ...etc…■ Advantages: Infrastructure As Code (IAC) friendly, completeness

https://sysadmins.co.za/aws-create-emr-cluster-with-java-sdk-examples/

https://sysadmins.co.za/aws-create-emr-cluster-with-java-sdk-examples/

EMR Cluster: Setup: Console

Don’t be seduced by the one-pager (Create Cluster - “Quick Options”- Typically want to use “Advanced” mode! (if you use Console at all)

If you do use quick options, at least take note of your log location, and make sure you select Spark ;)

Instance type is a function of sizing exercises you presumably have already done, or perhaps will do after you run some trial code.

Step execution is good for 1-off job clusters (launch, do, terminate)

EMR Cluster: Setup: Console - Quick

Pick the smallest instance you can. This is only a test...

You will need to add a key-pair in order to ssh-in, including accessing Web UI for Spark


Get Coffee. This is not a super-quick, procedure. The EMR machinery is doing a bit of work, and the more softwares you selected, the longer it will take.

Once this cluster is launched, it is really not much different programmatically, from a local or on-prem cluster, except you have to SSH in to do much.

Look up the master node for your cluster in the Console UI, or:

> aws emr list-clusters> aws emr list-instances --cluster-id j-YOUR-CLUSTER-IDOr> aws emr describe-cluster --cluster-id j-YOUR-CLUSTER-ID

SSH instructions can be found on the cluster page, click the “SSH” link next to the master address.


You can SSH in:

> ssh -i ~/.ssh/rf_emr_dev_access.pem [email protected]

--EMR banner shows --

> sudo spark-shell

If you don’t `sudo` you get a bunch of warnings basically saying the logs cannot be written.

SSH instructions can be found on the cluster page, click the “SSH” link next to the master address.

If you have trouble ssh’ing in - by default sometimes the security group does NOT allow SSH from anywhere (or everywhere). In the cluster-ui, click on the security group for the master, and make sure it allows ssh (port 22) from at least your IP range, and safe.


You can connect from Zeppelin:

(It’s already installed on the cluster by default, just need an SSH tunnel)

Once a tunnel is set up, you can just “click the link” in the cluster-ui.

> ssh -i ~/.ssh/rf_emr_dev_access.pem -ND 8157 [email protected]

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel-local.html

MINI-LAB - Launch a Quick Cluster

● Log into AWS

● Launch a Quick Cluster

● Connect from Spark Shell

● Run a few commands to prove its working○ Try parallelizing an array of some data and mapping it○ Review the Spark UI (in the cluster window click “Enable Web Connection” and follow instructions)

● Bonus Credit - launch Zeppelin (in the cluster window click “Enable Web Connection” and follow instructions)

● Terminate the cluster

Run some commands -

scala> val numbers = Array(1, 2, 3, 3, 4, 4, 5)

numbers: Array[Int] = Array(1, 2, 3, 3, 4, 4, 5)

scala> val numbersRDD = sc.parallelize(numbers)

numbersRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> numbersRDD.distinct.collect

1

2

3

4

5

To try Spark UI or Zeppelin, establish an SSH Tunnel

ssh -i ~/.ssh/rf_emr_dev_access.pem -ND 8157 [email protected]

Much like with SSH, you may need to open up the port in the security groups.

This command will just stay running, it will not terminate until you ctrl-c it.

Then open a browser, or click, the links in the Console UI.

E.g.

http://ec2-100-25-196-44.compute-1.amazonaws.com:18080/

http://ec2-100-25-196-44.compute-1.amazonaws.com:8890/#/

EMR Cluster: Setup: Console - Advanced


● Unsurprisingly, kinda the same, but with options

● Customize your software set. ○ Need to ala-carte TensorFlow? ○ Contrast TensorFlow and MXNet?○ Try out Presto?

● Customize the installed configurations (i.e. adjust hadoop-env, yarn-site..etc)

● Add steps & conditionally set auto-terminate

● Optimize pricing/Instance types

● Customize Security/VPC/Subnets


What’s in a Node?There are some details to be aware of around node-types, that are hidden from the Quick Cluster setup.

● Master Node, you are probably familiar with○ Runs HDFS NameNode service○ Runs YARN ResourceManager service○ Tracks submitted job statuses and monitors health of the instance groups○ Like Highlander, there can be only one (per instance group/fleet)

● Core Nodes○ Run the DataNode daemon (for HDFS)○ Run TaskTracker daemon○ This is a scaling point

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances.html


What’s in a Node?There are some details to be aware of around node-types, that are hidden from the Quick Cluster setup.● Task Node

○ Does not run DataNode daemon (not participating in HDFS)○ Best for autoscale/spike capacity in your cluster

● Instance Fleets○ Fully configurable cluster management○ Able to take advantage of Spot instances (cost optimization)○ Allows AWS to “Mix and match” instance types, optimizing your pricing and

their utilization. Can result in sudden-death nodes.○ Can be used to really optimize, but is complex and requires

experimentation○ Can add a “Task Instance Fleet” to an active cluster


https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html


https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html


What’s in a Node?There are some details to be aware of around node-types, that are hidden from the Quick Cluster setup.

● Uniform Instance Groups○ Simplified capacity management○ While allowing flexible autoscaling setup○ Specify purchasing options to manage cost○ Don’t run Master as Spot Instance on any cluster you care about

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-uniform-instance-group.html




3: Cluster SettingsThere’s a few things to note here, but the highlights for now are -

● Logging of course (location spec)

● “EMRFS consistent view” - remember when we talked about S3 “eventual consistency”?

● Bootstrap Actions - (NOT the same as ‘steps’)○ Definitely an “advanced mode” option here, for:

■ Pre-loading some common data-set onto each Node■ Install additional software (Drill, Impala, ElasticSearch, Saltstack..etc)■ Max of 16 actions

https://github.com/aws-samples/emr-bootstrap-actions

https://aws.amazon.com/premiumsupport/knowledge-center/bootstrap-step-emr/

Bootstrap actions are the first thing to run after an Amazon EMR cluster transitions from the STARTING state to the BOOTSTRAPPING state. Bootstrap actions, which run on all cluster nodes, are scripts that run as the Hadoop user by default, but they can also run as the root user with the sudo command. The cluster won't start if a bootstrap action fails.

Steps are a distinct unit of work, comprising one or more Hadoop jobs that run only on the master node of an Amazon EMR cluster. Steps are usually used to transfer or process data. One step might submit work to a cluster, and others might process the submitted data and then send the processed data to a particular location. Steps are often what is used for a “run and done” cluster.

https://github.com/aws-samples/emr-bootstrap-actions

https://aws.amazon.com/premiumsupport/knowledge-center/bootstrap-step-emr/

https://docs.aws.amazon.com/emr/latest/ManagementGuide/AddingStepstoaJobFlow.html

MINI-LAB - Configure an Advanced Cluster

● Log into AWS

● Configure an Advanced Cluster

● Play with the options

● Click some “i” icons

● Q&A

What Are These “Steps”?

“When you are done starting, do this one thing.” Then maybe shut down too.

● Available in:○ Quick Launch - auto-terminate when done○ Advanced - specify steps & termination option

● A “Unit of work” submitted to the cluster○ Stream processing○ Hive/Pig○ Spark job○ Custom Hadoop

● Each has its own unique configuration

One disadvantage is ephemeral cluster can be hard to troubleshoot.Shutdown post-step is not required (when using Advanced Config)

EMR Cluster: Setup: Other Ways To Start

● Are you using IAC? Code it in. ○ On-Demand Jobs from Jenkins○ Other “AWS-SDK”-based solutions

● Script it○ Generally anything that can be done in Console, can be done in CLI○ Usually more options in CLI○ Also an IAC option here…○ https://docs.aws.amazon.com/cli/latest/reference/emr/index.html

For guaranteeing execution insulation, I really like the one-off-cluster mechanism, but it is easily overkill for “many small/mid sized jobs” environments.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html

EMR Cluster: Setup: Launch via CLI

Demo Lab 9.2Follow along if you like/are able. What we’ll do

1. Push a chunk of our data from earlier to S3

2. Take one of our lab projects as a Jar and push it to S3

3. Launch a cluster using “steps” and the AWS CLI that will

a. Start

b. Run Steps

c. Terminate

4. Validate the the creation, output, and termination

EMR Cluster: Setup: CLI Lab 9.2

Asset PlacementWe need to get our assets on S3, to be accessible to EMR.

● Build the Jar

● Use the AWS SDK to copy the jar

● Use the AWS SDK to copy the data

● Record those paths

https://www.oreilly.com/learning/how-do-i-package-a-spark-scala-script-with-sbt-for-use-on-an-amazon-elastic-mapreduce-emr-cluster



Session Review

● Cluster Setup

● Cluster Management (a bit)

● Some decisions & options for clusters

● We set up a cluster! (or 2..)

● We learned EMR is a highly malleable environment, though the stock configuration gives a lot of benefit out of the box


9.3 Tools on EMR Clusters

Many Tools Available

● Databases○ Hive○ HBase/Phoenix○ Presto

● ML Tools○ Mahout○ MXNet○ Tensorflow

● Data Streaming/Loading○ Flink/Sqoop○ Kinesis/Kafka/ES/Cassandra

● Workflow & Monitoring○ Hue○ Oozie

Aaand Notebooks!

● Zeppelin

● Jupyter

Notebooks on EMR

● Houston, we have options

● Zeppelin vs. Jupyter: what’s worth fighting for?○ Well...Zeppelin gives you native Scala support…?○ Yeah but...Almond is a Scala kernel for Jupyter○ Jupyter has more better visualization and stuff...python libs man…○ Yeah but Zeppelin is growing more quickly○ Dude the data science guys LIKE Jupyter ok?○ But multi-user. But authentication. But...○ …scala….python...scala...python...jupyter...zeppelin

● Both are good. Both have merits. Both have some issues on EMR

● “What about Beaker man!” “But I like Databricks!”- “Religous debate has no place in [data] science”

Zeppelin on EMR

Here’s the deal with Zeppelin, if that’s what you want to use

● Multi-user setup may be less effort

● It can be more secure than Jupyter, if that is a business concern

● Store Zeppelin notebooks on S3 so they don’t go away with the cluster!○ If we don’t store off-cluster, we are scared to terminate, increasing cost○ Security options here:

■ Access key/secret■ IAM/User■ Secure by-bucket if you require

○ Can be done two ways■ Shelling in to the running cluster and updating the config■ Configuring the cluster at startup using “configurations” block

https://www.zepl.com/blog/setting-multi-tenant-environment-zeppelin-amazon-emr/

https://medium.com/@addnab/s3-backed-notebooks-for-zeppelin-running-on-amazon-emr-7a743d546846




Zeppelin on EMR

● One-off clusters for analysis can use Spot instances. IF you do this, make sure you set up to store the notebook to S3

● You can even set up your own EC2 with Zeppelin to run off the main cluster-master (so node deaths and cluster decommissions have less impact)

● Zeppelin on Amazon EMR does not support the SparkR interpreter

● Zepl is a 3rd party solution from the Zeppelin folks offering a product ZeppelinHub to ease the burdens here.

● https://www.zepl.com/blog/setting-multi-tenant-environment-zeppelin-amazon-emr/


https://zeppelin.apache.org/docs/0.8.0/setup/storage/storage.html#notebook-storage-in-s3

Simply adding an EBS volume to the cluster config does not guarantee persistent storage. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html states:

“Amazon EBS volumes attached to EMR clusters are ephemeral: the volumes are deleted upon cluster and instance termination (for example, when shrinking instance groups), so it's important that you not expect data to persist.”




https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-storage.html

Zeppelin Performance Notes 1

● Store Zeppelin notebooks on S3 so they don’t go away with the cluster. Ephemeral notebooks on-cluster-crash or decommission are just no fun○ You can also store them on EFS if you bootstrap your cluster to do so

● (Potentially) Set your notebooks to use interpreter-per rather than shared○ Configure “interpreter/spark interpreter” as “The interpreter will be instantiated ‘Per User’ in

‘scoped’ process”. and click “Save”. (JVM Isolation)○ Otherwise a single interpreter will be used by all notebooks

● Understanding CPU/VCPU/Yarn CPU/Zeppelin allocations○ Do not expect your cluster settings to be in effect:

“Zeppelin does not use some of the settings defined in your cluster’s spark-defaults.conf configuration file, even though it instructs YARN to allocate executors dynamically if you have set spark.dynamicAllocation.enabled to true. You must set executor settings, such as memory and cores, using the Zeppelin Interpreter tab, and then restart the interpreter for them to be used.

● Interpreter options override Spark submit optionsspark.executor.memory > export SPARK_SUBMIT_OPTIONS="--executor-memory 10G ..."

https://zeppelin.apache.org/docs/0.8.0/usage/interpreter/interpreter_binding_mode.html

https://docs.amazonaws.cn/en_us/emr/latest/ReleaseGuide/zeppelin-considerations.htmlConsiderations When Using Zeppelin on Amazon EMR

● Connect to Zeppelin using the same SSH tunneling method to connect

to other web servers on the master node. Zeppelin server is found at

port 8890.

● Zeppelin on Amazon EMR release versions 5.0.0 and later supports

Shiro authentication.

● Zeppelin on Amazon EMR release versions 5.8.0 and later supports

using AWS Glue Data Catalog as the metastore for Spark SQL. For

more information, see Using AWS Glue Data Catalog as the Metastore

for Spark SQL.

● Zeppelin does not use some of the settings defined in your cluster’s

https://zeppelin.apache.org/docs/0.8.0/usage/interpreter/interpreter_binding_mode.html

https://docs.amazonaws.cn/emr/latest/ManagementGuide/emr-ssh-tunnel.html

https://zeppelin.apache.org/docs/latest/security/shiroauthentication.html

https://docs.amazonaws.cn/emr/latest/ReleaseGuide/emr-spark-glue.html

https://docs.amazonaws.cn/emr/latest/ReleaseGuide/emr-spark-glue.html

● spark-defaults.conf configuration file, even though it instructs YARN to

allocate executors dynamically if you have set

spark.dynamicAllocation.enabled to true. You must set executor

settings, such as memory and cores, using the Zeppelin Interpreter

tab, and then restart the interpreter for them to be used.

● Zeppelin on Amazon EMR does not support the SparkR interpreter.

https://community.hortonworks.com/articles/212176/key-factors-that-affects-zeppelins-performance-1.html

Important: Before choosing Isolated check your system has enough resources, Ex if 30 users Interpreter memory is 1GB then you will need 30 GB of RAM available in the zeppelin node.

https://github.com/awslabs/deeplearning-emr/blob/master/training-on-emr/emr_efs_bootstrap.sh



Zeppelin Performance Notes 2

When creating your cluster -

● Use the Script Runner step or ”Configurations” to customize the setup ○ Give more memory, e.g. zeppelin-env.sh

■ export ZEPPELIN_MEM="-Xms4024m -Xmx6024m -XX:MaxPermSize=512m"

■ export ZEPPELIN_INTP_MEM="-Xms4024m -Xmx4024m -XX:MaxPermSize=512m"

● Limit the result sets, e.g. in the interpreter○ zeppelin.spark.maxResult

● Limit the interpreter output, e.g. zeppelin-site.xml○ <name>zeppelin.interpreter.output.limit</name>

We’ll do an example of this in the next lab.

https://medium.com/@omidvd/bootstrapping-zeppelin-emr-779c4a138aed

From: https://community.hortonworks.com/articles/141589/zeppelin-best-practices.html

Deployment Choices

While you can select any node type to install Zeppelin, the best place is a gateway node. The reason gateway node makes most sense is when the cluster is firewalled off and protected from outside, users can still see the gateway node.

1. Hardware Requirement

1. More memory & more Cores are better

2. Memory: Minimum of 64 GB node

3. Cores: Minimum of 8 cores

4. # of users: A given Zeppelin node can support 8-10 users. If you want more

users, you can set up multiple Zeppelin instances. More details in MT section.

https://medium.com/@omidvd/bootstrapping-zeppelin-emr-779c4a138aed

https://community.hortonworks.com/articles/141589/zeppelin-best-practices.html

Lab: Set Up Zeppelin on EMR

Lab 9.3-A

What we’ll do

1. Launch a single-node cluster2. Add Zeppelin & Spark,3. Do some work in Zeppelin4. Terminate the cluster

An example repo of AWS Console setup. https://github.com/arunkundgol/zeppelin-setup

https://github.com/arunkundgol/zeppelin-setup

Lab: Set Up Zeppelin on EMR with S3 Storage

Lab 9.3-B FIXME/TODO: NEED TO FINISH

What we’ll do

1. Launch a single-node cluster2. Create a folder for s3 persistence3. Configure to use S3 for Zeppelin persistence4. Edit & save the notebook5. Check S36. Terminate the cluster7. Start a new one8. Load the notebook & run it9. Terminate the cluster

An example repo of AWS Console setup. https://github.com/arunkundgol/zeppelin-setup

You may run into permissions with Zeppelin S3 storage. This unfortunately manifests as an IO exception or SAX parse exception in the Zeppelin logs.

Caused by: java.io.IOException: Unable to list objects in S3: com.amazonaws.SdkClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler

Double check your settings in configurations.json and zeppelin-site.xml to make sure the user/bucket values match, and that the values exist in S3.

https://github.com/arunkundgol/zeppelin-setup

Jupyter on EMR

Notice that the EMR documentation only includes instructions for JupyterHub - which we will get to…

● To run a Jupyter Notebook directly against EMR requires a few extra steps○ We need to configure EMR to have Jupyter available○ We need to enable jupyter_spark○ We need to configure security groups○ We need to create the cluster with the custom bootstrap, and

script-runnerstep○ We need to manually start pyspark

● That seems like a lot of work. ○ WAY more work than running it locally

● Lets just use JupyterHub for goodness sake

https://mikestaszel.com/2017/10/16/jupyter-notebooks-with-pyspark-on-aws-emr/

emr_bootstrap.sh

#!/bin/bash

# yum packages:

sudo yum install -y htop tmux

# download and install anaconda:

wget -q https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh -O

~/anaconda2.sh

/bin/bash ~/anaconda2.sh -b -p $HOME/anaconda

echo -e '\nexport SPARK_HOME=/usr/lib/spark\nexport

PATH=$HOME/anaconda/bin:$PATH' >> ~/.bashrc && source ~/.bashrc

# cleanup:

rm ~/anaconda2.sh

# enable https://github.com/mozilla/jupyter-spark:

sudo mkdir -p /usr/local/share/jupyter

sudo chmod -R 777 /usr/local/share/jupyter



pip install jupyter-spark

jupyter serverextension enable --py jupyter_spark

jupyter nbextension install --py jupyter_spark

jupyter nbextension enable --py jupyter_spark

jupyter nbextension enable --py widgetsnbextension

jupyter_step.sh#!/bin/bash

# set up spark to use jupyter:

echo "" | sudo tee --append /etc/spark/conf/spark-env.sh > /dev/null

echo "export PYSPARK_PYTHON=/home/hadoop/anaconda/bin/python" | sudo tee

--append /etc/spark/conf/spark-env.sh > /dev/null

echo "export PYSPARK_DRIVER_PYTHON=/home/hadoop/anaconda/bin/jupyter" | sudo

tee --append /etc/spark/conf/spark-env.sh > /dev/null

echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --ip=0.0.0.0'"

| sudo tee --append /etc/spark/conf/spark-env.sh > /dev/null

Running Jupyter Off-Master - JupyterHub, SparkMagic, Spark-Rest/Livy

● Livy: Apache incubated a REST API for Spark called Livy○ Livy enables off-cluster hosting of notebooks for JupyterHub

● SparkMagic: Extra bits for Jupyter with Spark via JupyterHub + Livy○ Automatically installed on EMR with the JupyterHub package

● Spark-Rest - is Livy - the server side of the equation○ We really don’t need to think much about it○ Abstracted by wrapper-API’s under the covers of JupyterHub

● This creates a whole new set of capabilities

● As usual, EMR introduces some hiccups○ Cluster access from-AWS○ Means ssh-tunnel usually required

https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/

https://livy.incubator.apache.org/

https://github.com/jupyterhub/jupyterhub

https://github.com/jupyter-incubator/sparkmagic



https://livy.incubator.apache.org/

https://github.com/jupyterhub/jupyterhub


Livy/SparkMagic Notes● SparkMagic

○ Included with Jupyter-Hub○ Uses Livy○ Gives extra ‘magics’ - %% capabilities in your notebook○ Automatic visualization of SQL queries in the PySpark, PySpark3, Spark and SparkR kernels; use an

easy visual interface to interactively construct visualizations, no code required○ Ability to capture the output of SQL queries as Pandas dataframes to interact with other Python libraries

(e.g. matplotlib)

● Introduces some limitations, notably○ “Since all code is run on a remote driver through Livy, all structured data must be serialized to JSON and

parsed by the Sparkmagic library so that it can be manipulated and visualized on the client side. In practice this means that you must use Python for client-side data manipulation in %%local mode”

○ Which can be confusing to readers….and writers…● You might also want to keep an eye ont Toree for more ‘magics’ https://toree.apache.org/



Notebooks: Jupyter Mini-Lab - 9.4

Lab (or Demo) 9.3.2Follow along if you like/are able. What we’ll do

1. Push a chunk of our data from earlier to S3 (or use if its still there)

2. Launch a cluster using “steps” and the AWS CLI that will use JupyterHub

3. Start up an SSH tunnel

4. Do some work in the notebook

5. Terminate the cluster

“JupyterHub on Amazon EMR has a default user with administrator permissions. The user name is jovyan and the password is jupyter. We strongly recommend that you replace the user with another user who has administrative permissions. You can do this using a step when you create the cluster, or by connecting to the master node when the cluster is running.”

Externalizing Zeppelin

It is also possible, probably preferable, to completely externalize Zeppelin to a secondary EC2 instance near your cluster(s).

● Isolation

● Insulation

● Multi-cluster access

https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/

https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/

Notebooks on EMR, Review

● Zeppelin vs Jupyter

● Both are good. Both have merits. Both have some issues on EMR○ Just know they must run “up there” and you’ll be ok

● For highest stability, externalize your setup from your cluster

● Evaluate your security needs and consider if those requirements drive you toward either solution in particular

● Outside of that, the normal arguments for/against each do not differ for being on EMR

https://medium.com/@muppal/aws-emr-jupyter-spark-2-x-7da54dc4bfc8

Security Notes

Zepplin + EMR supports:

● Shiro: Active Directory● Shiro: LDAP● Shiro: Role-based ● NGINX: Basic Auth● Kerberos-enabled cluster● SSL● Restriction to

○ UI○ Data○ Notes

JupyterHub + EMR supports:

● LDAP/AD● Livy+Impersonation (user name passed along)● PAM (Pluggable Authentication Module) (users on master, mucks with off-cluster

setups)○ https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-pa

m-users.html

https://medium.com/@muppal/aws-emr-jupyter-spark-2-x-7da54dc4bfc8


9.4 EMR ClusterTroubleshooting

Troubleshooting & Debugging

Step 1: Check the logs

Troubleshooting & Debugging

Step 2: Find the logs

Why So Difficult?!

Distributed Systems

● Are inherently challenging to troubleshoot

● Require a distributed mindset

● Require a deeper understanding of the components

● Take time to adjust your internal Sherlock Holmes to

● Have no golden hammers, silver bullets,

● May contain additional tooling for troubleshooting

Regular Spark

Spark UI already comes with some great tools

● Master UI○ Workers○ Cluster Details

● Spark Worker UI○ Jobs○ Stages

■ DAG Visualization■ Event Timeline

○ Storage○ Environment○ Executors

■ Memory/Cores/IO○ SQL

What Morecould you

need?

Maximize Information, Minimize Surface Area

All That There Is

EMR + Spark

The AWS EMR solution has taken extra steps to help limit the “surface area” needed to find most problems.

● Enable Debugging

○ With debugging enabled, you get a lot of information retained (contextually), and accessible via the AWS Console without having to sort through S3 log folders and peruse the gz files

● Without it, the data IS still there, you just have to dig a little harder

Quick Demo: Nav to an S3 log file and open.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot.html

EMR Best Practices

● Do a ‘dry run’ of completed code against limited datasets.○ Great case for notebooks○ Great case for small 1-off cluster you can leave running a bit for

diagnostics○ Enable Debugging○ Configure Logging○ Logging IS expensive, don’t leave it on for production runs, only test cases

and diagnostics

● Consult the “common errors” doc in AWS (link below) - good chance you’ll see your problem & solution listed there.

● Master Node log browsing○ If you’ve a good idea of the problem and don’t have debug enabled, most

of the logs you need will live on Master node

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging."html

“The debugging tool allows you to more easily browse log files from the EMR console.”

As part of your cluster spin-up from CLI, you could add a json config file to your --configurations such as

[ { "Classification": "yarn-site", "Properties": { "yarn.log-aggregation-enable": "true", "yarn.log-aggregation.retain-seconds": "-1", "yarn.nodemanager.remote-app-log-dir": "s3:\/\/mybucket\/logs" } }]

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors.html

SSH in to the Master node, and peruse

Startup: /mnt/var/log/bootstrap-actions

Performance: /mnt/var/log/instance-state

An App (e.g. Spark, Zeppelin): /mnt/var/log/application

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors.html

A particular step: /mnt/var/log/hadoop/steps/N

Change what gets logged by using the cluster --configurations classification: spark-log4j

https://github.com/apache/spark/blob/master/conf/log4j.properties.template

Debug Logs In Context





Debug Logs: Lab 9.4

Lab 9.4

What we’ll do

● Navigate the Debug logs of previously launched clusters

Slow Cluster? https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-slow.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-slow.html

Troubleshooting

Understand your problem class

● Sizing/Scaling? OOM, Hung Nodes, Stragglers○ Review cluster configuration○ Review Zeppelin interpreter config (if appropriate)○ Do some math with your data sizes

● Can’t Start Cluster?○ Bootstrapping failures○ One-offs jobs failing○ Security

● Processing○ Data formats - noisy data throwing exceptions?○ Data access - security preventing file access?



Real-Failure Suppression

Sometimes in Spark, the real problem is hidden by downstream tasks

● Log parsing failures

● Run explain plan

● Construct an alternative pipeline by○ Checkpointing○ Landing data & making your job multi-step

● Sometimes the best solution is to not try to make your Spark job a one-shot, and get back to basic ETL methodology○ Extract (O) - one job - load (O) /convert/filter/land (O’)○ Transform (O’) - one job - load (O’) /join/filter/land (O’’)○ Load (O’’) - one job - load (O’’) /load/aggregate/land


For really big jobs, you might be better multi-stage pipelining. This can be noisy from a duplicate perspective, but you can also clean up O’ and O’’ after-the-fact, if it makes sense. Consider also any lineage impacts.


Real-Failure Suppression, Step By Step Testing

1. Launch a cluster with debugging & termination production

2. Configure a Notebook with enough memory and cores for the interpreter that you are confident it is sufficient for your data population

3. Look for opportunities in each step for errors. a. Use type-testing (number? string?) b. Use try/catch/log semanticsc. When you are 100% sure the problem isn’t in this step, move on

4. Check the logs after each step via the Console

5. Correct any errors. If your automatic job still fails after this effort, you should have sufficiently ruled out programming problems, and should look for scale issues, security issues, latency problems, and other “infra” surfaces

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-failed-5-test-steps.html

The Usual Suspects 1 (how are you coding)

Stop errors before they blow up at cost by -

● Testing your code, ideally locally or on a 1-server “cluster”, with a limited-but-representative dataset

● Unit Testing your code with small data that contains expected variations and corner cases in your raw data

● Try/Catch works in Spark too...error handling

● Defect-based unit testing - when something blows up and it takes you 2 hours or 2 days to find the culprit, and you code a workaround - code a test for that workaround

● Configurable debug logging in your code

Your problem is really only happening post-dev with your shiny unit tested code. I know we’ve talked about this before, but it’s worth a second look.

● Review your DAG in the Spark UI○ Look for stragglers○ Re-evaluate your partitioning schemes

● Do a little math and double check your data size assumptions

● Look for opportunities to:○ Coalesce or Repartition○ DISTRIBUTE BY, SORT BY, CLUSTER BY

● If that all fails, try pulling in a good sample of your data, coalesce it to a reasonable number, and pay attention to the Spark UI

The Usual Suspects 2 (how are you running)

https://deepsense.ai/optimize-spark-with-distribute-by-and-cluster-by/

Quick Lab: Do the code blocks in here and examine the Spark UI.

https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4





The Unusual Suspects

AWS has its own conditions which may cause cluster instability

● Service Outages http://status.aws.amazon.com/

● Usage Limits○ EC2 default quota of 20 EC2 servers

■ EC2 QUOTA EXCEEDED error ○ S3 bucket count limit (100 per acct)

■ Consider nesting by env and project

● Networking issues communicating between resources, in-or-across VPC’s○ Subnets run short of addressable IPs for large clusters

● Last state change before termination may hold clues

If data throughput is a problem, and there is value to “moving the data closer to the processing”, s3DistCp can be used to pull from S3 and push into Core Node hdfs. https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5

http://status.aws.amazon.com/

Debugging Scala/Spark Applications

● Most bugs can be replicated from a reasonable sample-set of the data

● Running locally in an IDE can help you to perform interactive real-time troubleshooting of the running application○ Any Scala IDE can do this - I like Intellij

● With debug opts enabled you can also runtime-debug a running cluster - locally or on EMR○ You’ll obviously not do this with Prod○ Its best to do with a small cluster and data-switched breakpoints

● Arguably the best way to solve any issues with data validation, data munging, joins, failing actions.

● NOT the best way to troubleshoot performance issues!

https://community.hortonworks.com/articles/15030/spark-remote-debugging.html

https://community.hortonworks.com/articles/15030/spark-remote-debugging.html

Debugging PySpark Applications

● Some messages are from the JVM & some from Python, which can lead to some confusion

● In Jupyter, the useful error messages are most likely in the console log

● In YARN, YARN logs. Point being, the stuff in your notebook may feel pointless

● .take(1), .count(), and loop/print are your allies

● Lambdas can make troubleshooting even more difficult○ You can write tests in Python too! https://github.com/holdenk/spark-testing-base

● As with Scala, being able to run your code in a debuggable environment (e.g. Pycharm) can dramatically increase productivity, though it can feel alien (or, “low level”) to the data analyst accustomed to notebooks

https://www.slideshare.net/SparkSummit/debugging-pyspark-spark-summit-east-talk-by-holden-karau

Holden Karau is also a co-author of High Performance Spark - a Must Read.

https://github.com/holdenk/spark-testing-base

https://issues.apache.org/jira/browse/MAPREDUCE-3883




9.5 The OperationsPerspective

Observability & Monitoring

We have been talking about “how the application experiences the infrastructure”

Now we will be talking about “how the infrastructure experiences the applications”


What is the delineation of responsibility in your organization?


Who should configure all this? Who owns it?


On a case-by-base basis we may know a particular Spark job we are responsible for didn’t work out, and that’s what we’ve been talking about. At a system level, how do we know when things are going right or things are going wrong?

Let’s take a look at some options

● Cloudwatch - AWS log/log-monitoring solution

● Ganglia - cluster visualization tool

● Third Party Options○ Influx TICK Stack○ Prometheus○ Home Rolled with any number of time-series dbs○ (Paid solutions) Loggly, Datadog...etc

https://www.loggly.com/blog/monitoring-amazon-emr-logs-in-loggly/

https://docs.datadoghq.com/integrations/amazon_emr/

https://www.loggly.com/blog/monitoring-amazon-emr-logs-in-loggly/

https://docs.datadoghq.com/integrations/amazon_emr/

Observability & Monitoring: Cloudwatch

● More #”s than you can probably stomach!○ Grows linearly by #metrics * # jobs

● Find the ones that matter for your purposes

● This can be time consuming and requires developing some subject matter expertise

● One limitation is any aggregation of system metrics across clusters must be performed manually


This makes ephemeral clusters a bit harder to deal with.


Tough to group across business-purposes when organized this way


Events vs Metrics

● “Actions vs Data”

● Goals with Events○ Know “something happened” or “something changed”

■ E.g. Cluster status went isIdle = true■ Great for creating actions you want to know about

● SNS yourself a message for:○ Scale out events

● Idle or Zombie server decommission

● You can generate custom Events from your app as well!

● Metrics: Know the usage of the system (at a very fine grained level)

EMR Cluster Events can be seen here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type

https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type


Observability & Monitoring: Cloudwatch Rule

EMR Cluster Events can be seen here: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#emr_event_type



Observability & Monitoring: Cloudwatch Explore

9.5.1 Mini lab (group or demo-style)

Lets review some Cloudwatch metrics from the clusters we’ve run.

● MemoryAvailableMB● IsIdle● CoreNodesRunning● S3BytesRead● MRUnhealthyNodes● ...etc...

https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html

Observability & Monitoring: Cloudwatch: Custom Dashboard

Observability & Monitoring: Ganglia

Another “box-solution” provided by AWS for EMR is Ganglia.

http://ganglia.info/

“Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. “

Lab 9.5: Lets launch one cluster with Ganglia, as a group, and poke around a bit, separately.

http://ganglia.info/








Observability & Monitoring: 3rd Party

Since EMR is so flexible in custom bootstrapping, any agent you like can be added at cluster-provisioning time to broadcast to any accessible target.

Whether you decide to use TICK stack, Prometheus, or another system that you hand-roll, you can accomplish the goal of creating graphs and alerts about your infrastructure.

You can find an example of rolling a custom solution broacasting to Graphite https://dzone.com/articles/how-to-properly-collect-aws-emr-metrics

Graphite can be directly integrated to Ganglia as well.

https://dzone.com/articles/how-to-properly-collect-aws-emr-metrics

Observability & Monitoring: Tick Stack

Tick stack is an “open source core” for a time-series-data platform built to handle metrics and events.

This open source core consists of the projects —

● Telegraf - A collection and reporting agent

● InfluxDB - A high performance time-series database written in Go with 0 dependencies

● Chronograf - Dashboarding

● Kapacitor - Realtime batch-and-streaming data processing engine for munching data from InfluxDB

-- collectively called the TICK Stack.

https://www.influxdata.com/time-series-platform/

https://gist.github.com/travisjeffery/43f424fbd7ac677adbba304cef6eb58f

https://www.influxdata.com/time-series-platform/

https://gist.github.com/travisjeffery/43f424fbd7ac677adbba304cef6eb58f


9.6 EMR ClusterOptimization

Scaling: Plenty of Knobs to Turn

● Sensible defaults can get you a long way○ Dynamic Allocation is “on by default”

■ This requires Shuffle service● Spark Shuffle Service is automatically configured by Amazon EMR

■ maxExecutors = infinity

● Autoscaling can be set up for Instance Groups

● Spark parameters can be adjusted with --configurations

● Many defaults are set based on instance types selected

https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocationhttps://spark.apache.org/docs/latest/configuration.html#dynamic-allocation

https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation

Scaling: Defaults

By default, Spark in EMR picks up its basic settings from instance type selected

spark.dynamicAllocation.enabled true

spark.executor.memory Setting is configured based on the core and task instance types in the cluster.

spark.executor.cores Setting is configured based on the core and task instance types in the cluster.

You can adjust these defaults and others on cluster creation with the --configurations classification: spark-defaults

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html


Scaling: maximumResourceAllocation: true

On cluster launch, settings you can supply via --configurations within Classification:Spark that affect how performance is managed. If maximizeResourceAllocation:true, then

spark.default.parallelism 2X number of CPU cores available to YARN containers.

spark.driver.memory Setting is configured based on the instance types in the cluster. This is set based on the smaller of the instance types in the two instance groups (master/core)

spark.executor.memory Setting is configured based on the core and task instance types in the cluster.

spark.executor.cores Setting is configured based on the core and task instance types in the cluster.

spark.executor.instances Setting is configured based on the core and task instance types in the cluster. Set unless spark.dynamicAllocation.enabled explicitly set to true at the same time.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.htmlDynamic Allocation (on by default)maximizeResourceAllocation

Configuring via --configurations[ { "Classification": "spark", "Properties": { "maximizeResourceAllocation": "true" } }]

A footnote about this setting: https://stackoverflow.com/questions/34003759/spark-emr-using-amazons-maximizeresourceallocation-setting-does-not-use-all

You probably want to explicitly set spark.default.parallelism to 4x number of instance cores you want the job to run on on a per "step" (in EMR speak)/"application" (in YARN speak) basis, i.e. set it every time


https://stackoverflow.com/questions/34003759/spark-emr-using-amazons-maximizeresourceallocation-setting-does-not-use-all

https://stackoverflow.com/questions/34003759/spark-emr-using-amazons-maximizeresourceallocation-setting-does-not-use-all

Scaling: maximumResourceAllocation: true

● Limits your cluster to one-job-at-a-time

● Best for single-use or single-purpose clusters

Scaling: dynamicAllocation Notes

Dynamic Allocation may be “on by default” in EMR, but it still has some of its own knobs to turn.

spark.dynamicAllocation.executorIdleTimeout Default: 60 - seconds of idle time means execute is “removable”

spark.dynamicAllocation.cachedExecutorIdleTimeout Default: Infinity - the lifespan of an executor which has cached data blocks.

spark.dynamicAllocation.initialExecutors Default: spark.dynamicAllocation.minExecutors - Default number of executors for DA, only if < --num-executors.

spark.dynamicAllocation.maxExecutors Default: Infinity - upper bound of num executors.

spark.dynamicAllocation.minExecutors Default: 0 - Number of executors ‘by default’.

spark.dynamicAllocation.executorAllocationRatio Default: 1 - Ratio of executors to tasks (1:1 is maximum parallelism)

https://aws.amazon.com/blogs/big-data/submitting-user-applications-with-spark-submit/





Scaling: Some Numbers

● YARN: when tuning, Be sure to leave at least 1-2GB RAM and 1 vCPU for each instance's O/S and other applications to run too.○ The default amount of RAM seems to cover this, but this will leave us with

(N-1) vCPUs per instance off the top

● Executors: Parallelism is the goal, and keep in mind cluster size when setting parameters. If you are not using max and dynamic, you might, e.g. have 3 machines with 4 CPU each. Leaving 1 per for the system, you have 3 each, so --num-executors = 9 would be reasonable. (executors per node)

● Executor-cores: How many parallel tasks can an executor take on?○ Think about time spent on IO and determine the ratio○ Each executor-core will have various operations in which they are waiting

on other things (reads/writes) and so increasing executors and reducing cores-per can result in better performance

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes. Cloudera Manager helps by accounting for these and configuring these YARN properties automatically.

The likely first impulse would be to use --num-executors 6 --executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:

● 63GB + the executor memory overhead won’t fit within the

63GB capacity of the NodeManagers.

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

● The application master will take up a core on one of the nodes,

meaning that there won’t be room for a 15-core executor on

that node.

● 15 cores per executor can lead to bad HDFS I/O throughput.

A better option would be to use --num-executors 17 --executor-cores 5 --executor-memory 19G. Why?

● This config results in three executors on all nodes except for

the one with the AM, which will have two executors.

● --executor-memory was derived as (63/3 executors per

node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.

Scaling: Drive Space

● By default you generally get a 10GB EBS volume○ Add volumes to increase storage when drivespace is a problem○ Add volumes to offset memory vs cpu vs storage inequities (in other words you keep running out

of drivespace, but the ram/cpu are fine)○ These are also ephemeral!○ “EBS-Optimized” = network traffic is dedicated, not shared

● You can add additional volumes but you must consciously adjust configuration to be aware of and take advantage of them.○ Check the directories where the logs are stored and change parameters as needed


https://aws.amazon.com/blogs/aws/amazon-emr-update-support-for-ebs-volumes-and-m4-c4-instance-types/

https://aws.amazon.com/premiumsupport/knowledge-center/core-node-emr-cluster-disk-space/







● Cross-cluster or intra-cluster storage can be performed using EFS rather than S3, if preferred. This is not recommended practice for log files.

● Change the guarantees of performance by selecting custom EBS○ Provisioned IOPS SSD - high performance (ops/sec)○ Throughput optimized - high throughput (Mib/sec)○ AWS measures IOPS in 256K or smaller blocks

https://cloud.netapp.com/blog/ebs-volumes-5-lesser-known-functions

Some clarification on the details of these options can be found here: https://www.fittedcloud.com/blog/aws-ebs-performance-confused/


Changing the log retention period:

1. Connect to the master node using SSH.2. Open the /etc/hadoop/conf/yarn-site.xml file on each node in your Amazon

EMR cluster (master, core, and task nodes).3. Reduce the value of the yarn.log-aggregation.retain-seconds property on all

nodes.4. Restart the ResourceManager daemon. For more information, see Viewing

and Restarting Amazon EMR and Application Processes (Daemons).

https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-mount-cmd-dns-name.html


● EFS is automatically mounted on all worker instances during startup.● EFS allows sharing of code, data, and results across worker instances.● Using EFS doesn't degrade the performance for densely packed files

(for example, .rec files containing image data used in MXNet).


https://www.fittedcloud.com/blog/aws-ebs-performance-confused/



https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-process-restart-stop-view.html






If you decide to create an EBS volume to tolerate more local-logs, you will want to bootstrap your environment to accommodate this. Set the log path in yarn-site, and possibly do a custom mount operation to guarantee the device.

[hadoop@ip-172-31-50-48 /]$ mountproc on /proc type proc (rw,relatime)sysfs on /sys type sysfs (rw,relatime)devtmpfs on /dev type devtmpfs (rw,relatime,size=7686504k,nr_inodes=1921626,mode=755)devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)tmpfs on /dev/shm type tmpfs (rw,relatime)/dev/xvda1 on / type ext4 (rw,noatime,data=ordered)devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)/dev/xvdb1 on /emr type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvdb2 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvdc on /mnt1 type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvdd on /mnt2 type xfs (rw,relatime,attr2,inode64,noquota)/dev/xvde on /mnt3 type xfs (rw,relatime,attr2,inode64,noquota)


Some clarification on the details of these options can be found here: https://www.fittedcloud.com/blog/aws-ebs-performance-confused/


Changing the log retention period:

1. Connect to the master node using SSH.2. Open the /etc/hadoop/conf/yarn-site.xml file on each node in your Amazon

EMR cluster (master, core, and task nodes).3. Reduce the value of the yarn.log-aggregation.retain-seconds property on all

nodes.4. Restart the ResourceManager daemon. For more information, see Viewing

and Restarting Amazon EMR and Application Processes (Daemons).



● EFS is automatically mounted on all worker instances during startup.● EFS allows sharing of code, data, and results across worker instances.● Using EFS doesn't degrade the performance for densely packed files

(for example, .rec files containing image data used in MXNet).


https://www.fittedcloud.com/blog/aws-ebs-performance-confused/



https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html






Scaling: Autoscaling

● Autoscaling requires Instance Groups and is not supported with Instance Fleet

● EMR Scaling is more complex that EC2 autoscale○ Core Node vs Task Node○ Core Node decommission times longer due to HDFS

● Scale Out != Scale In○ Scale out policies can be more flexible○ Scale in must be more prudent

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.htmlDynamic Allocation (on by default)maximizeResourceAllocation

https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/

https://aws.amazon.com/blogs/big-data/dynamically-scale-applications-on-amazon-emr-with-auto-scaling/

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-automatic-scaling.htmlhttps://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroupLifecycle.html

https://cloud.netapp.com/blog/optimizing-aws-emr-best-practices






https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-automatic-scaling.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-automatic-scaling.html

https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroupLifecycle.html

https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroupLifecycle.html

https://cloud.netapp.com/blog/optimizing-aws-emr-best-practices

Scaling: Scale In: Switches● “ Amazon EMR implements a blacklisting mechanism in Spark that is built on top of YARN's decommissioning mechanism.

This mechanism helps ensure that no new tasks are scheduled on a node that is decommissioning, while at the same time allowing tasks that are already running to complete.”

spark.blacklist.decommissioning.enabled Default: true - Spark does not schedule new tasks on executors running on that node. Tasks

already running are allowed to complete.

spark.blacklist.decommissioning.timeout Default : 1 hour - After the decommissioning timeout expires, the node transitions to a

decommissioned state, EMR can terminate the node's EC2 instance. Any tasks are still

running after the timeout expires are lost or killed and rescheduled on executors running on

other nodes.

spark.decommissioning.timeout.threshold Default: 20 seconds - This allows Spark to handle Spot instance terminations better because

Spot instances decommission within a 20-second timeout regardless of the value of

yarn.resourcemager.decommissioning.timeout, which may not provide other nodes enough

time to read shuffle files.

spark.stage.attempt.ignoreOnDecommissio

nFetchFailure

Default - true - When set to true, helps prevent Spark from failing stages and eventually failing

the job because of too many failed fetches from decommissioned nodes. Failed fetches of

shuffle blocks from a node in the decommissioned state will not count toward the maximum

number of consecutive fetch failures.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation

spark.resourceManager.cleanupExpiredHost

When set to true, Spark unregisters all cached data and shuffle blocks

that are stored in executors on nodes that are in the decommissioned

state. This speeds up the recovery process.

true



Sizing Suggestions

● Memory at 3X your data size expectations

● Enough cores to reasonably parallelize your data, assuming you’ve also worked through the partitioning scenarios

● Filter, filter filter. Narrow narrow narrow

● Ephemeral clusters have fewer variables

● Shared clusters have MANY more details to consider

● This is as much art as science, and is invariably “case-by-case”

A note on underutilization: https://stackoverflow.com/questions/38331502/spark-on-yarn-resource-manager-relation-between-yarn-containers-and-spark-execu

https://stackoverflow.com/questions/38331502/spark-on-yarn-resource-manager-relation-between-yarn-containers-and-spark-execu

https://stackoverflow.com/questions/38331502/spark-on-yarn-resource-manager-relation-between-yarn-containers-and-spark-execu

Sizing/Setup Suggestions

● Use HDFS for intermediate data storage while the cluster is running and Amazon S3 only to input the initial data and output the final results.

● If your clusters will commit 200 or more transactions per second to Amazon S3, contact support to prepare your bucket for greater transactions per second and consider using the key partition strategies described in the links below

● Set Hadoop configuration setting io.file.buffer.size to 65536. This causes Hadoop to spend less time seeking through Amazon S3 objects. .

● If listing buckets with large numbers of files, pre-cache the results of an Amazon S3 list operation locally on the cluster.

You can find some links to S3 specific performance optimization best practices here: https://docs.aws.amazon.com/AmazonS3/latest/dev/PerformanceOptimization.html

This document has some of the “why” of performance concerns as well as some suggestions on how to break up your buckets and include some hashing mechanisms that aid in balancing performance and listability.

https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/

Finally this document has 4 great tips from a heavy user of Spark on EMR.


https://docs.aws.amazon.com/AmazonS3/latest/dev/PerformanceOptimization.html





Cost Optimization Recommendations

● Ephemeral clusters which auto-terminate for spark-submit & sizeable jobs○ Can integrate with Jenkins for a seamless commit/execute CICD

● Dynamic allocation with minimal primary-cluster for active analysis○ Good pool of Task Node available

● Off-cluster notebook connectivity and management (jupyter-hub, zeppelin, livy)○ Cluster Core Node pool should remain relatively fixed to reduce decommission time○ Primary scale point should be Task Node○ Task Nodes are best case for Spot Instances (least risk, least cost)

● Off-cluster notebook storage

● Cloudwatch alerts with SNS listeners that proactively act or message

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html - node decomissioning

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.htmlFleets vs instance groups for cost optimization vs complexity. Spot instances


https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html

Cost Optimization In Dev

● Watch how you work○ Develop locally - grab a file and start the process○ Troubleshoot locally - use an IDE or local-notebook to evaluate your work and debug

● Go to the cluster when you have a job you are ready to productionize (doesn’t work on local machine anymore)

● Cluster job fails○ Get back to local once you understand the failure mode○ Emulate, correct, and re-deploy

● Too often we get in the mindset of “just this one tweak will fix it” and waste hours upon hours of cluster-runtime cycles.

Cost Source Notes

● EMR costs are on-top-of the underlying infrastructure (EC2) costs.

● S3 costs are around 700/mo for 10TB ‘with reduced redundancy’. Contrast this with Redshift $1000/TB/mo at the lowest tier (3 year buy-in).

● You may be charged for use of “SimpleDB” when you enable debugging

● If you add large EBS volumes to your clusters this can add up. Important if you are writing to HDFS, using Hive, or expect a lot of spill-to-disk or disk-cache

● In-Region data transfer should not add cost. Transferring data across regions will. Bear this in mind when establishing any data-landing practices

● You can save a ton of money leveraging reserved & spot instances

Common Problems

● https://www.knowru.com/blog/first-3-frustrations-you-will-encounter-when-migrating-spark-applications-aws-emr/

● https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/

https://www.knowru.com/blog/first-3-frustrations-you-will-encounter-when-migrating-spark-applications-aws-emr/

https://www.knowru.com/blog/first-3-frustrations-you-will-encounter-when-migrating-spark-applications-aws-emr/

https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/

https://www.indix.com/blog/engineering/lessons-from-using-spark-to-process-large-amounts-of-data-part-i/


9.7 EMR SecurityNotes

What’s at Risk?

● Being based on AWS Technologies means you have “all the basic” AWS tools to help you secure your system.

● You launch your cluster in a VPC

● You leverage completely customizable IAM roles to○ Interact with other services○ Allow cleanup○ Autoscale

● You leverage completely customizable Security Groups

● The typical “risk” is identified as in-org risk

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-roles.html

What’s at Risk?

● In-Org Risk○ Some teams cannot see other teams data○ Organization wants to split resources by budgets in different departments○ Custom roles/groups can be usefiul for this

● Extra-Org Risk○ The most common failing is when data gets left lying around○ S3 public buckets (“Hey I couldn’t get the S3 policy set right so I just made it public”)○ Content emitted to email through notifications & SNS that contains sensitive

information

Risk Mitigation: Mechanisms for Security

● AWS Level○ IAM Roles○ Cloudtrail○ S3 Policies○ Firewall/VPN

● Spark/Hadoop level○ LDAP/AD Integration○ Authorized-user access (IAM + LDAP)○ HDFS Permissions/ACLs○ Kerberos

● Mechanisms○ Lock down access (IAM Roles, VPC’s, S3 Policies, LDAP)○ Audit access (Cloudtrail)○ Encrypt

https://docs.aws.amazon.com/emr/latest/ManagementGuide/logging_emr_api_calls.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-kerberos-configure.html

Kerberos is a secure authentication method developed by MIT that allows two services located in a non-secured network to authenticate themselves in a secure way. Kerberos, which is based on a ticketing system, serves as both Authentication Server and as Ticket Granting Server (TGS).





Risk Mitigation: Encryption Options

There are many options for at-rest and in-transit encryption in the Spark/EMR/S3 ecosystem.

● What matters to your organization?

● What attack vectors concern you?

● Do you consider VPC secure?

● Do you need to protect yourself from internal threats?

● What are your regulated surfaces/relevant responsibilities?

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption.html

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-data-encryption-options.html




Encryption Optionshttps://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html


http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html

Encryption of Notebook storage on S3


Enable server-side encryptionTo request server-side encryption of notebooks, set the following environment variable in the file zeppelin-env.sh:

export ZEPPELIN_NOTEBOOK_S3_SSE=true

Or using the following setting in zeppelin-site.xml:

<property>

<name>zeppelin.notebook.s3.sse</name>

<value>true</value>

<description>Server-side encryption enabled for notebooks</description>

</property>

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-encryption-tdehdfs.html





Time Remaining?● Q&A

● Lab Struggles & Assistance

● Group Experiments?

● Active problems in the extant system?

● Thank you and good night!

Documents

Addendum Session 5 - Amazon S3 · KafkaRDDs indicate Kafka-Spark partition should get data from the machine hosting the Kafka topic Spark Streaming - partitions are local to the node