Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data...

Make the cloud work for youEasy, fast, and low-cost streaming with Apache Flink on Google Cloud Platform

James Malone [ jamesmalone@google.com ]

Open sourcepowerful but complex

The Apache ecosystem

Typical OSS deployments

Scaling makes your life difficult

Scaling can take hours, days, or weeks to perform which may delay needed data processing

Time needed to obtain new capacity

Scaling should be painless and fast

Things take seconds to minutes, not hours or weeks.

You have to babysit utilization

Requires effort to pack clusters so the it does not have periods of inactivity and wasted resources

Cluster Inactive

Only use clusters when you need them

Be an expert with your data, not your infrastructure

cluster 1 cluster 2

You are not paying for what you use

You are paying for more (spare) capacity than you actually need to

process your data

Pay for exactly what you use

You only pay for resources when you need them

Open sourceon Google Cloud

Google is passionate about open source

MapReduce

BigTable DremelColossus

FlumeMegastore

SpannerPubSub

Millwheel

Cloud Dataflow

Cloud Dataproc

Apache Beam

What is Cloud Dataflow?

Intelligently scales to millions of QPS

Open source programming model

Unified batch and streaming processing

Fully managed, no-ops data processing

Cloud Dataflow

Compute and storage

Unbounded

Bounded

Resource management

Resource auto-scaler

Dynamic work rebalancer

Work scheduler

Monitoring

Log collection

Graph optimization

Auto-healing

Intelligent watermarking SOURCE

Cloud Dataproc offers a spectrum

Cloud Dataflow Cloud Dataproc

Proprietary + Confidential

Cloud Dataflow

Cloud Dataflow is a real-time data processing service for batch and stream data processing.

● Fully managed● Unified programming model● Integrated and open source● Resource management● Autoscaling● Monitoring

What is Cloud Dataproc?

Google Cloud Dataproc is a fast, easy to use, low cost and fully-managed service, powered by Google Cloud Platform, that helps you take advantage of the Spark, Flink, and Hadoop ecosystem.

Google Cloud Dataproc

FastThings take seconds to minutes, not hours or weeks

EasyBe an expert with your data, not your data infrastructure

Cost-effectivePay for exactly what you use to process your data, not more

Cloud Dataproc clusters

Create cluster

(0 seconds)

Configure cluster

(20 seconds)

Use cluster

(90 seconds)

Scale anytime

Cloud Dataproc offers a spectrum

Connecting OSS to Cloud Platform

Cloud Dataproc

BigQuery

Compute Engine

Cloud Storage

Cloud Bigtable

Output

Cloud Dataproc

BigQuery

Compute Engine

Cloud Storage

Cloud Bigtable

Cloud Dataproc - under the hood

Dataproc Cluster

Dataproc Agent

Master node(s)Compute Engine

Managed instance Group

Persistent WorkersCompute Engine

PVM Worker(s)Compute Engine

HDFSPersistent Disk

Cluster bucketCloud Storage

NetworkingCloud Network

CloudIAM

Dataproc Cluster

Dataproc Agent

HDFSPersistent Disk

Comput engine nodes

Dataproc Image OSS

Apache Flink

Apache Hadoop

Apache Hive

(many others)

Clients

Cloud Dataproc API

Clusters

Operations

Clients

Developers Console

Cloud SDK

Customer project(s)

Dataproc API ClustersCloud Dataproc

Cloud Storage performance (and improvement)

Cloud Dataproc demo

$12.85

Example new featuresand their impact

Restartable jobs (beta)

● Any job submitted through the Cloud Dataproc Jobs API can now be set to automatically restart on failure

● Very useful for both batch and streaming jobs. Jobs which checkpoint can also be automatically restarted

● Specified with the switch --max_retries_per_hour when using the Cloud SDK (gcloud) the max_failures_per_hour in the Jobs API

Clusters with GPUs (beta)

● Cloud Dataproc clusters support Compute Engine nodes with Nvidia Tesla K80 GPUs attached to them

● We expect GPU support will continue to grow in the open source data processing ecosystem throughout 2017

● Easily add GPUs to a Cloud Dataproc cluster with the switch --master/worker_accelerator with the Cloud SDK (gcloud)

Single-node clusters (beta)

● Create a sandbox Cloud Dataproc “cluster” with only one node instead of the typical three node design (1 master, 2 workers)

● Great for lightweight data science, small-scale testing, proof of concept building, and education

● Use the --single-node argument in the Cloud SDK or select “Single node” when creating a cluster in the Google Cloud Console

Regional endpoints & private IP clusters (beta)

● Cloud Dataproc supports a “global” endpoint and “regional endpoints” in each Compute Engine region. This allows you to isolate Cloud Dataproc interactions to one specific region

● Traditionally clusters have needed a public IP attached to them. Cloud Dataproc now supports (easy to setup) “private IP only” clusters which do not require a public IP address on Compute Engine nodes

Cloud platformarchitecture concepts

Disaggregation of storage and compute

AnalysisCloud Datalab

Development & Test

Data sinksProductionCloud Dataproc

External applicationsStorageCloud Storage

Application Logs

StorageBigQuery

Clusters

DevelopmentCloud Dataproc

TestCloud Dataproc

Data sources

StorageCloud Bigtable

StorageCloud Storage

StorageBigQuery

StorageCloud Bigtable

Data scienceCluster monitoring

MonitorStackdriver

LogsLogging

Dataproc Cluster

Cost savings through preemptible VMs

Managed instance Group

Persistent WorkersCompute Engine

PVM Worker(s)Compute Engine

HDFSPersistent Disk

NetworkingCloud Network

CloudIAM

Right-sizing your hardware with custom machine types

Workload n1-standard-4 n1-highmem-4 n1-standard-8custom-6-30720

Split clusters and jobs

Customer’s on-premise datacenter

Flink/Hadoop cluster

Cluster 1Cloud Dataproc

Separate development and production

Customer’s on-premise datacenter

ProductionCloud Dataproc

DevelopmentCloud Dataproc

StagingCloud Dataproc

Production cluster

Compute HDFS

Development cluster

Compute HDFS

Read/write

Read only

Semi-long-lived clusters - group and select by label

Ephemeral and semi-long-lived clusters

Clusters per Job

ClusterCloud Dataproc

Cloud Storage

ClusterCloud Dataproc

Edge NodesCompute Engine

"Canary" v1.2

"Current" v1.1

Clients Clients Clients

ClientsClients

Questions andnext steps

Getting started

Codelabs - codelabs.developers.google.com/codelabs/cloud-dataproc-starter/

Cloud Dataproc quickstarts - cloud.google.com/dataproc/docs/quickstarts

Cloud Dataproc tutorials - cloud.google.com/dataproc/docs/tutorials

Cloud Dataproc initialization actions - github.com/GoogleCloudPlatform/dataproc-initialization-actions

Getting help

Cloud Dataproc documentation - cloud.google.com/dataproc/docs

Cloud Dataproc release notes - cloud.google.com/dataproc/docs

Stack Overflow - google-cloud-dataproc

Cloud Dataproc email discussion - cloud-dataproc-discuss@googlegroups.com

Google Cloud Support - cloud.google.com/support

Thank you

https://goo.gl/Qyf5U7

Make the cloud work for you - Francisco · Cloud Dataflow Cloud Dataflow is a real-time data...

Documents

Parallel & async processing using tpl dataflow

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App

Assuring Integrity of Dataflow Processing in Cloud Computing Infrastructures

Scio - A Scala API for Google Cloud Dataflow & Apache Beam

HiTune: Dataflow-Based Performance Analysis for Big Data Cloud · Intel Confidential HiTune: Dataflow-Based Performance Analysis for Big Data Cloud Bo Huang Principal Engineer Intel

Cloud processing for 5G systems v6simeone/Cloud processing for 5G systems v6.pdf · Cloud Processing for 5G Systems Cloud Radio Access Network (C-RAN): Baseband processing offloading

Dataflow Processing with Events - Mel Conway's

Dataflow - A Unified Model for Batch and Streaming Data Processing

SAP S/4HANA SU GOOGLE CLOUD PLATFORM · Cloud Dataflow Cloud Dataproc Cloud Datalab Cloud Pub/Sub Genomics. SAP S/4HANA su Google Cloud Platform: ... •TPL •Trasporto regionale

Data Management 13 Stream Processing - mboehm7.github.io · Distributed stream processing engines, and “unified” batch/stream processing Proprietary systems: Google Cloud Dataflow,

ABSTRACT AND A FRAMEWORK FOR PROCESSING DATAFLOW …

Google cloud Dataflow & Apache Flink

HiTune: Dataflow-Based Performance Analysis for Big · PDF fileHiTune: Dataflow-Based Performance Analysis for Big Data Cloud ... merge sort merge merge reduce ... Hive data flow stage

Google Cloud Dataflow

Dataflow Integration Solution for Power BI Reza... · Dataflow is a Power Query process that runs in the cloud independently from any Power BI reports. Where the Output Stored? Dataflow

Guide to Implementing Custom TPL Dataflow Blocksdownload.microsoft.com/download/1/6/1/1615555D-287C-4159-8491... · Guide to Implementing Custom TPL Dataflow Blocks ... processing

Dataflow I: Dataflow Analysis

Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Google Cloud Dataflow meets TensorFlow

Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink