Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
Apache Spark in the Cloud
Zbyněk RoubalíkSenior Quality Engineer, Red Hat
February 15 2018
Apache Spark in the Cloud | Zbyněk Roubalík2
Technologies
● Apache Spark
● Docker
● Kubernetes
● OpenShift
Apache Spark in the Cloud | Zbyněk Roubalík3
Apache Spark in the Cloud
aka
How to create and deploy Apache Spark
applications to cloud native environments like
OpenShift
Apache Spark in the Cloud | Zbyněk Roubalík4
What is cloud native?
● Containerized● Dynamically orchestrated● Microservice oriented
● www.cncf.io/about/faq
Apache Spark in the Cloud | Zbyněk Roubalík5
Containers
● A container image is a lightweight, stand-alone,
executable package of a piece of software that
includes everything needed to run it: code, runtime,
system tools, system libraries, settings.
● https://www.docker.com/what-container
Apache Spark in the Cloud | Zbyněk Roubalík6
VM vs Containers
Apache Spark in the Cloud | Zbyněk Roubalík7
Containers
● Cloud vs standard deployment model
● Pets vs Cattle
● Developers + Operations (Admins) → DevOps
● Docker
Apache Spark in the Cloud | Zbyněk Roubalík8
Kubernetes
● Container cluster manager
Apache Spark in the Cloud | Zbyněk Roubalík9
Kubernetes
● Based on etcd – distributed clustered key value store● Smallest deployable unit is Pod
Apache Spark in the Cloud | Zbyněk Roubalík10
OpenShift
● Open Source Container Application Platform● Focused on application (not just containers as a
concept) and developer experience
Apache Spark in the Cloud | Zbyněk Roubalík11
OpenShift
● Sits on the top of Kubernetes● Source code, builds and deployments management● S2I - Source to Image● Application lifecycle management (CI/CD)● Service catalog (Language runtimes, Middleware,
Databases)● Security
Apache Spark in the Cloud | Zbyněk Roubalík12
OpenShift architecture
Apache Spark in the Cloud | Zbyněk Roubalík13
Apache Spark
● Fast and general engine for large-scale data processing
● Distributed computation system
● Provides high-level APIs in Java, Scala, Python and R
● Supports a rich set of tools for Big Data, AI, ML● Spark SQL for SQL and structured data processing● MLlib for machine learning● GraphX for graph processing● Spark Streaming● ...
Apache Spark in the Cloud | Zbyněk Roubalík14
General Spark architecture
Apache Spark in the Cloud | Zbyněk Roubalík15
How to interact with Spark
● Run an application
● Start a REPL● Scala
● Python
● R
Apache Spark in the Cloud | Zbyněk Roubalík16
The fundamental Spark abstraction
Resilient distributed dataset (RDD)
● are partitioned, lazy and immutable homogenous collections
● partitioned● lazy● immutable
Apache Spark in the Cloud | Zbyněk Roubalík17
Resilient distributed dataset in action
Apache Spark in the Cloud | Zbyněk Roubalík18
Resilient distributed dataset in action
Apache Spark in the Cloud | Zbyněk Roubalík19
Resilient distributed dataset in action
Apache Spark in the Cloud | Zbyněk Roubalík20
Resilient distributed dataset in action
Apache Spark in the Cloud | Zbyněk Roubalík21
Resilient distributed dataset in action
Apache Spark in the Cloud | Zbyněk Roubalík22
Resilient distributed dataset in action
Apache Spark in the Cloud | Zbyněk Roubalík23
Resilient distributed dataset in action
Apache Spark in the Cloud | Zbyněk Roubalík24
What is Spark application?
Apache Spark in the Cloud | Zbyněk Roubalík25
simple.py application
● Even numbers count
Apache Spark in the Cloud | Zbyněk Roubalík26
Apache Spark in the Cloud | Zbyněk Roubalík27
A little more complex application
Apache Spark in the Cloud | Zbyněk Roubalík28
Designing a Spark microservice
Apache Spark in the Cloud | Zbyněk Roubalík29
On demand batch processing
Apache Spark in the Cloud | Zbyněk Roubalík30
Continuous batch processing
Apache Spark in the Cloud | Zbyněk Roubalík31
Stream processing
Apache Spark in the Cloud | Zbyněk Roubalík32
OpenShift architecture - recall
Apache Spark in the Cloud | Zbyněk Roubalík33
Spark on OpenShift
Apache Spark in the Cloud | Zbyněk Roubalík34
Oshinko - Integrating Spark and OpenShift
Apache Spark in the Cloud | Zbyněk Roubalík35
Oshinko - Integrating Spark and OpenShift
Apache Spark in the Cloud | Zbyněk Roubalík36
Demo time
Apache Spark in the Cloud | Zbyněk Roubalík37
Takeaways
● Containers
● Kubernetes
● OpenShift
● Apache Spark
● Oshinko tooling
Apache Spark in the Cloud | Zbyněk Roubalík38
Спасибі!
www.github.com/radanalyticsio