in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not...

Scientific Computing in the CloudsKaran Bhatia, GoogleMay 1, 2017

Investing to meet University and research needs

$29.4 BillionGoogle’s trailing 3 Year CAPEX investment

1 Billion End users served by GCP customers

Current regions and number of zones

Data Localization

Network path

Committed regions for 2017 and number of zones

# https://peering.google.comhttps://cloud.google.com/compute/docs/regions-zones/regions-zones

Singapore2

S Carolina

N Virginia

BelgiumLondon

Tokyo (2016)

TaiwanMumbai

Sydney

Oregon

Frankfurt

São Paulo

Finland

Points of presence

Agenda

Big Compute

Big Data

Programs

Patterns

Big Compute

Proprietary + Confidential

SC16 CMS DemonstratorTarget: generate 1 Billion events in 48 hours during Supercomputing 2016 on Google Cloud via HEPCloud

35% filter efficiency = stage out 380 million events → 150 TB output

Double the size of global CMS computing resources

CMS Higgs Event - credit: CERN https://commons.wikimedia.org/wiki/File:CMS_Higgs-event.jpg

Cores from Google

MIT Research w/ VMs

Products used: Google Compute Engine, Cloud Storage, DataStore

220,000 cores on preemptible VMs

2,250 32-core instances, 60 CPU-years of computation in a single afternoon

Answers in hours v. months

Broad Firecloud:WDL, Cromwell and Google Genomics

WDL: an external DSL used by computational biologists to express the analytical pipelines

Cromwell: a scalable, robust engine for executing WDL against pluggable backends including local, Docker, Grid Engine or …

Google Genomics Pipelines API: co-developed by Broad and Google Genomics, a scalable Docker-as-a-Service with data scheduling

Pipeline definition{

"name": "samtools index",

"description": "Run samtools index to generate a BAM index file",

"inputParameters": [

{"name": "inputFile",

"localCopy": {

"disk": "data",

"path": "input.bam"

{"name": "outputFile",

"localCopy": {

"disk": "data",

"path": "output.bam.bai"

"resources": {

"minimumCpuCores": 1,

"minimumRamGb": 1,

"disks": [{

"name": "data",

"type": "PERSISTENT_HDD"

"sizeGb": 200,

"mountPoint": "/mnt/data",

"docker": {

"imageName": "quay.io/cancercollaboratory/dockstore-tool-samtools-index",

"cmd": "samtools index /mnt/data/input.bam /mnt/data/output.bam.bai"

Create, run, monitor, and kill pipelines

Create$ gcloud alpha genomics pipelines create --pipeline-json-file PIPELINE-FILE.json --pipeline-json-file samtools_index.json

Created samtools index, id: PIPELINE-ID

Run$ gcloud alpha genomics pipelines run --pipeline_id PIPELINE-ID \

--logging gs://YOUR-BUCKET/YOUR-DIRECTORY/logs \

--inputs inputFile=gs://genomics-public-data/gatk-examples/example1/NA12878_chr22.bam \

--outputs outputFile=gs://YOUR-BUCKET/YOUR-DIRECTORY/output/NA12878_chr22.bam.bai

Running: operations/OPERATION-ID

Status$ gcloud alpha genomics operations describe OPERATION-ID

Kill$ gcloud alpha genomics operations cancel OPERATION-ID

DSUB (google genomics pipelines)

Lessons

● Integration with third-party workload manager vs roll your own vs something in between

○ HTCondor, Slurm, Google Genomics Pipelines, ssh○ Managed instance groups

● On-premise + hybrid vs on-cloud● Cost optimizations

○ Preemptible vms and custom machine types○ Per-minute billing

● Networking is a key differentiator, public peering + internet2 member

Intel Skylake

● Significant “per core” performance improvements

● Intel® Advanced Vector Extension 512 (Intel® AVX-512)

○ 2x flops/second● Accelerated IO with Intel® Omni-Path

Architecture (Fabric)● Integrated Intel® QuickAssist Technology

(crypto & compression offload)● Intel® Resource Director Technology (Intel®

RDT) for Efficiency & TCO

Hardware Accelerated

● Available Today: NVIDIA K80 GPU● Coming Soon: Tensor Processing

Unit (TPU)● Custom ASIC built and optimized

for TensorFlow● Used in production at Google for

over 16 months● 7 years ahead of GPU performance

per watt

Data Prep (beta)

Cloud Dataprep

Cloud Pub/Sub

Cloud Dataflow

1. Ingest Data

Clean Data Raw Data

Google BigQuery

Data Studio

Cloud ML

2. Instantly Prepare Data 3. Analyze Data

Supports Common Data Sources of Any SizeProcess diverse datasets - structured and unstructured. Transform data stored in CSV, JSON, or relational Table formats. Prepare datasets of any size, megabytes to terabytes, with equal ease.

Cloud DataprepInstant Data ExplorationVisually explore and interact with data in seconds. Instantly understand data distribution and patterns. There is no need for one to write code. You can prepare data with a few clicks.

Intelligent Data CleansingCloud Dataprep automatically identifies data anomalies and helps you to take corrective actions fast. Get data transformation suggestions based on your usage pattern. Standardize, structure, and join datasets easily with a guided approach.

ServerlessCloud Dataprep is a serverless service, so you do not need to create or manage infrastructure.

Seriously PowerfulCloud Dataprep is built on top of powerful Google Cloud Dataflow service. Cloud Dataprep is auto-scalable and can easily handle processing massive data sets.

Google cloud computing can help universities transform

Teaching Faculty in select

countries

Teachinguniversity courses

In computer science or

related fields

● National Science Foundation

○ BIGDATA

● National Institutes of Health

○ Data Commons

Funding Agency Partnerships

Google Cloud Public Datasets Program

Mission: Facilitate the onboarding of datasets into Google Cloud products

How to use the basic headline + body:

1. Replace body text by either typing directly into table boxes or copy and paste content in from other source42+ datasets

You can contribute too!

Visit: https://cloud.google.com/public-datasets/

Email: bq-public-data@google.com

Themes / Patterns for Scientific Computing

Extending the Cloud APIs

PI/BiologistWeb Access

Computational Research ScientistPython, R, SQL

Algorithm Developerssh, programmatic access

ISB-CGC GUI Google GUI

GoogleISB-CGC API

Compute Engine VMs

Cloud Storage BigQuery Genomics

Local Storage

ISB-CGC Hosted Data Controlled-Access Data Open-Access Data User Data

TensorFlow

● World’s most popular ML framework

● Developer friendly yet performance optimized

● Powers over 100 Google services

● Managed infrastructure with Cloud ML

● Tutorials at https://www.tensorflow.org

Linear Regression VS Neural Network

Thank you!

in the Clouds Scientific Computing€¦ · Cloud Dataprep is a serverless service, so you do not...

Documents

Using Serverless to Modernize Legacy Applications: Quick ...€¦ · Cloud Native Applications &theHybrid Cloud.. Introduction to Serverless & Fission 7. Serverless: A New Way To

Real-time Serverless: Cloud Resource Management …...cloud resource type, real-time serverless, with allocation-rate service-level guarantees. This is distinct from existing cloud

TECH UPDATE Serverless APIs with AWS · Serverless APIs with AWS TECH UPDATE. ... • Dick Eimers, Cloud Solution Architect ... Introduce Amazon’s serverless building blocks for

Alibaba Cloud Serverless

The Data Dossier...Appliance o e Cloud Dataflow Cloud Dataproc BigQuery Cloud ML Cloud Vision API Cloud Speech API Translate API Cloud Natural Language API Cloud Dataprep Cloud Video

PolarDB Serverless: A Cloud Native Database for

Dataprep Master Presentation

Economics of Serverless Cloud Computing - American Marvelsamericanmarvels.com/cloud/economics.pdf · or containers. Serverless is really a shift from PaaS or Ôcontainers as a serviceÕ

On Merits and Viability of Multi-Cloud Serverless

Building Full-Stack Serverless Applications · resource on building full-stack serverless ... “Serverless computing, is a cloud computing code execution model in which the cloud

Bimodal / Two Speed IT and Cloud Serverless Microservice Architecture

LEVERAGING SERVERLESS CLOUD COMPUTING ARCHITECTURES · Best practices for creating and migrating towards a serverless architecture ... Serverless computing is a new and interesting

the serverless cloud - JAX London · the serverless cloud pay per execution - example (aws) ... the serverless cloud seamless scaling ... aws lambda. ibm bluemix openwhisk. microsoft

Microsoft Cloud Workshop - azurebootcampdk.comazurebootcampdk.com/Materials/Bootcamp 8 Hands-on lab step-by-step - Serverless...Microsoft Cloud Workshop Serverless architecture 1 |

Basics of Cloud Computing – Lecture 4 Serverless ...IBM BlueMix Cloud Function IBM Bluemix, rebranded IBM Cloud in 2017 Based on OpenWhisk - Open Source Serverless cloud platform

Serverless IoT at iRobot - QCon San Francisco · 2020-06-05 · Serverless IoT at iRobot Ben Kehoe @ben11kehoe Cloud Robotics Research Scientist. ... Serverless Serverless-ness goes

Serverless with Google Cloud Functions

Serverless - O'Reilly...The future is Serverless. On-Prem Cloud Serverless. Ride sharing Serverless Infrastructure Someone else's car Never manage servers Cost Pay per ride Pay per

(CMP301) AWS Lambda and the Serverless Cloud

Getting Started with AWS Lambda & Serverless Cloud