43
Make the cloud work for you Easy, fast, and low-cost streaming with Apache Flink on Google Cloud Platform James Malone [ [email protected] ]

Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Make the cloud work for youEasy, fast, and low-cost streaming with Apache Flink on Google Cloud Platform

James Malone [ [email protected] ]

Page 2: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Open sourcepowerful but complex

Page 3: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

The Apache ecosystem

Page 4: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Typical OSS deployments

Page 5: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Scaling makes your life difficult

Scaling can take hours, days, or weeks to perform which may delay needed data processing

Capa

city

nee

ded

Time needed to obtain new capacity

Time

Page 6: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Scaling should be painless and fast

Things take seconds to minutes, not hours or weeks.

Capa

city

nee

ded

Time

Page 7: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

You have to babysit utilization

Clus

ter u

tiliz

atio

n

Time

Requires effort to pack clusters so the it does not have periods of inactivity and wasted resources

Cluster Inactive

Page 8: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Only use clusters when you need them

Be an expert with your data, not your infrastructure

Clus

ter u

tiliz

atio

n

cluster 1 cluster 2

Time

Clus

ter u

tiliz

atio

n

Time

Page 9: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

You are not paying for what you use

Cost

Time

You are paying for more (spare) capacity than you actually need to

process your data

Page 10: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Pay for exactly what you use

Cost

Time

You only pay for resources when you need them

Page 11: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Open sourceon Google Cloud

Page 12: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Google is passionate about open source

MapReduce

BigTable DremelColossus

FlumeMegastore

SpannerPubSub

Millwheel

Cloud Dataflow

Cloud Dataproc

Apache Beam

Page 13: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

What is Cloud Dataflow?

Intelligently scales to millions of QPS

Open source programming model

Unified batch and streaming processing

Fully managed, no-ops data processing

Page 14: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud Dataflow

Compute and storage

Unbounded

Bounded

Resource management

Resource auto-scaler

Dynamic work rebalancer

Work scheduler

Monitoring

Log collection

Graph optimization

Auto-healing

Intelligent watermarking SOURCE

SINK

Page 15: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud Dataproc offers a spectrum

Cloud Dataflow Cloud Dataproc

Page 16: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Proprietary + Confidential

Cloud Dataflow

Cloud Dataflow is a real-time data processing service for batch and stream data processing.

● Fully managed● Unified programming model● Integrated and open source● Resource management● Autoscaling● Monitoring

Page 17: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

What is Cloud Dataproc?

Google Cloud Dataproc is a fast, easy to use, low cost and fully-managed service, powered by Google Cloud Platform, that helps you take advantage of the Spark, Flink, and Hadoop ecosystem.

Page 18: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Google Cloud Dataproc

FastThings take seconds to minutes, not hours or weeks

EasyBe an expert with your data, not your data infrastructure

Cost-effectivePay for exactly what you use to process your data, not more

Page 19: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud Dataproc clusters

Create cluster

(0 seconds)

Configure cluster

(20 seconds)

Use cluster

(90 seconds)

Scale anytime

Page 20: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud Dataproc offers a spectrum

BA

A

B

C

D

E

F

Page 21: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Connecting OSS to Cloud Platform

Input

Cloud Dataproc

BigQuery

Compute Engine

Cloud Storage

Cloud Bigtable

Output

Cloud Dataproc

BigQuery

Compute Engine

Cloud Storage

Cloud Bigtable

Page 22: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud Dataproc - under the hood

Dataproc Cluster

Dataproc Agent

Master node(s)Compute Engine

Managed instance Group

Persistent WorkersCompute Engine

PVM Worker(s)Compute Engine

HDFSPersistent Disk

Cluster bucketCloud Storage

NetworkingCloud Network

CloudIAM

Page 23: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud Dataproc - under the hood

Dataproc Cluster

Dataproc Agent

HDFSPersistent Disk

Cluster bucketCloud Storage

Comput engine nodes

Dataproc Image OSS

Apache Flink

Apache Hadoop

Apache Hive

(many others)

Clients

Cloud Dataproc API

Clusters

Operations

Jobs

Page 24: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud Dataproc - under the hood

Clients

Developers Console

Cloud SDK

SSH

Customer project(s)

Dataproc API ClustersCloud Dataproc

Page 25: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud Storage performance (and improvement)

Page 26: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud Dataproc demo

Page 27: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

$12.85

Page 28: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Example new featuresand their impact

Page 29: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Restartable jobs (beta)

● Any job submitted through the Cloud Dataproc Jobs API can now be set to automatically restart on failure

● Very useful for both batch and streaming jobs. Jobs which checkpoint can also be automatically restarted

● Specified with the switch --max_retries_per_hour when using the Cloud SDK (gcloud) the max_failures_per_hour in the Jobs API

Page 30: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Clusters with GPUs (beta)

● Cloud Dataproc clusters support Compute Engine nodes with Nvidia Tesla K80 GPUs attached to them

● We expect GPU support will continue to grow in the open source data processing ecosystem throughout 2017

● Easily add GPUs to a Cloud Dataproc cluster with the switch --master/worker_accelerator with the Cloud SDK (gcloud)

Page 31: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Single-node clusters (beta)

● Create a sandbox Cloud Dataproc “cluster” with only one node instead of the typical three node design (1 master, 2 workers)

● Great for lightweight data science, small-scale testing, proof of concept building, and education

● Use the --single-node argument in the Cloud SDK or select “Single node” when creating a cluster in the Google Cloud Console

Page 32: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Regional endpoints & private IP clusters (beta)

● Cloud Dataproc supports a “global” endpoint and “regional endpoints” in each Compute Engine region. This allows you to isolate Cloud Dataproc interactions to one specific region

● Traditionally clusters have needed a public IP attached to them. Cloud Dataproc now supports (easy to setup) “private IP only” clusters which do not require a public IP address on Compute Engine nodes

Page 33: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Cloud platformarchitecture concepts

Page 34: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Disaggregation of storage and compute

AnalysisCloud Datalab

Development & Test

Data sinksProductionCloud Dataproc

External applicationsStorageCloud Storage

Application Logs

StorageBigQuery

Clusters

DevelopmentCloud Dataproc

TestCloud Dataproc

Data sources

StorageCloud Bigtable

StorageCloud Storage

StorageBigQuery

StorageCloud Bigtable

Data scienceCluster monitoring

MonitorStackdriver

LogsLogging

Page 35: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Dataproc Cluster

Cost savings through preemptible VMs

Master node(s)Compute Engine

Managed instance Group

Persistent WorkersCompute Engine

PVM Worker(s)Compute Engine

HDFSPersistent Disk

Cluster bucketCloud Storage

NetworkingCloud Network

CloudIAM

Master node(s)Compute Engine

Page 36: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Right-sizing your hardware with custom machine types

Workload n1-standard-4 n1-highmem-4 n1-standard-8custom-6-30720

Page 37: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Split clusters and jobs

Customer’s on-premise datacenter

Flink/Hadoop cluster

Job 1

Job 2

Job 3

Job 4

Cluster 1Cloud Dataproc

Cluster 2Cloud Dataproc

Cluster 3Cloud Dataproc

Job 1

Job 2

Job 3

Job 4

Page 38: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Separate development and production

Customer’s on-premise datacenter

ProductionCloud Dataproc

DevelopmentCloud Dataproc

StagingCloud Dataproc

Production cluster

Compute HDFS

Development cluster

Compute HDFS

Cluster bucketCloud Storage

Read/write

Read only

Page 39: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Semi-long-lived clusters - group and select by label

Ephemeral and semi-long-lived clusters

Clusters per Job

ClusterCloud Dataproc

ClusterCloud Dataproc

ClusterCloud Dataproc

Cloud Storage

ClusterCloud Dataproc

ClusterCloud Dataproc

ClusterCloud Dataproc

Edge NodesCompute Engine

"Canary" v1.2

"Current" v1.1

Clients Clients Clients

ClientsClients

Page 40: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Questions andnext steps

Page 41: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Getting started

Codelabs - codelabs.developers.google.com/codelabs/cloud-dataproc-starter/

Cloud Dataproc quickstarts - cloud.google.com/dataproc/docs/quickstarts

Cloud Dataproc tutorials - cloud.google.com/dataproc/docs/tutorials

Cloud Dataproc initialization actions - github.com/GoogleCloudPlatform/dataproc-initialization-actions

Page 42: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Getting help

Cloud Dataproc documentation - cloud.google.com/dataproc/docs

Cloud Dataproc release notes - cloud.google.com/dataproc/docs

Stack Overflow - google-cloud-dataproc

Cloud Dataproc email discussion - [email protected]

Google Cloud Support - cloud.google.com/support

Page 43: Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data processing service for batch and stream data processing. Fully managed Unified programming

Thank you

https://goo.gl/Qyf5U7