Upload
cloudera-inc
View
994
Download
1
Embed Size (px)
Citation preview
1© Cloudera, Inc. All rights reserved.
Spark OperationsKostas Sakellis
2© Cloudera, Inc. All rights reserved.
Me
• Software Engineer at Cloudera•Contributor to Apache Spark•Before that, contributed to Cloudera Manager
3© Cloudera, Inc. All rights reserved.
Building a proof of concept!
Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
4© Cloudera, Inc. All rights reserved.
Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
5© Cloudera, Inc. All rights reserved.
Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
6© Cloudera, Inc. All rights reserved.
Examplesc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
7© Cloudera, Inc. All rights reserved.
Partitionssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
8© Cloudera, Inc. All rights reserved.
RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
9© Cloudera, Inc. All rights reserved.
RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
10© Cloudera, Inc. All rights reserved.
RDDssc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
11© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
12© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDD Lineage
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
Lineage
13© Cloudera, Inc. All rights reserved.
Task
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
•A pipelined set of transformation on a single thread
14© Cloudera, Inc. All rights reserved.
Spark Architecture
15© Cloudera, Inc. All rights reserved.
Spark System Architecture
16© Cloudera, Inc. All rights reserved.
Deployments
• Spark supports pluggable Cluster Managers• local, Standalone, YARN and Mesos
• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone•CDH 5.x includes Spark on YARN support
17© Cloudera, Inc. All rights reserved.
Standalone
Master
WorkerClient
Worker
Process
AppMaster
Process
18© Cloudera, Inc. All rights reserved.
Standalone
•On cluster./sbin/start-master.sh./sbin/start-slave.sh <master-spark-URL>
• Submit jobspark-submit --master <master-spark-URL>
…
19© Cloudera, Inc. All rights reserved.
Container
YARN Architecture
Resource Manager
Node Manager
Client
Node Manager
Container
Process
AppMaster
Container
Process
20© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource Manager
Node Manager
Client
Node Manager
Container
Process
AppMaster
Container
Process
21© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource Manager
Node Manager
Client
Node Manager
Container
Process
AppMaster
Container
Process
22© Cloudera, Inc. All rights reserved.
Spark on YARN
• Submit jobspark-submit --master yarn-client …
•Cluster modespark-submit --master yarn-cluster …
• Spark shell only works in client mode!
23© Cloudera, Inc. All rights reserved.
Customers often have shared infrastructure
Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
24© Cloudera, Inc. All rights reserved.
Multi-tenancy
•Cluster utilization is top metric•Target: 70-80% utilization
•Mixed workloads from mixed customers•We recommend YARN•Built in resource manager
25© Cloudera, Inc. All rights reserved.
Underutilized Clusters
Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
26© Cloudera, Inc. All rights reserved.
Dynamic Allocation
• Spark applications scale the number of executors based on load•Removes need for: --num-executors• Idle executors get killed
• First supported in CDH 5.4• Ideal for:•Long ETL jobs with large shuffles• shell applications: hive and spark shell
27© Cloudera, Inc. All rights reserved.
Dynamic Allocation Limitations
• Still required to specify cores•--num-cores
•Memory•--executor-memory• Includes JVM overhead•Need to do the math yourself
•Our customers still get it wrong!
28© Cloudera, Inc. All rights reserved.
The Future of Dynamic Allocation
•Only “task size” needed: --task-size• Eliminates•--num-cores•--num-executors•--executor-memory
• Leads to better cluster utilization
29© Cloudera, Inc. All rights reserved.
Security, now it’s getting serious.
Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
30© Cloudera, Inc. All rights reserved.
Authentication
•Kerberos – the necessary evil•Ubiquitous amongst other services•YARN, HDFS, Hive, HBase, etc.
• Spark utilizes delegation tokens
31© Cloudera, Inc. All rights reserved.
Encryption
•Control plane• File distribution•Block Manager•User UI / REST API•Data-at-rest (shuffle files)
SPARK-6028 (Replace with netty)Replace with nettySpark 1.4SPARK-2750 (SSL)SPARK-5682
32© Cloudera, Inc. All rights reserved.
Authorization
• Enterprises have sensitive data•Beyond HDFS file permissions•Partial access to data•Column level granularity
•Apache Sentry•HDFS-Sentry synchronization plugin
•Record Service•Column level security for Spark!
33© Cloudera, Inc. All rights reserved.
Thank youWe’re Hiring!