31
1 Scaling Galaxy on GCP LynnLangit Cloud and Data Architect Google Developer Cloud Expert, AWS Community Hero, Microsoft Data Platform MVP

Scaling Galaxy on Google Cloud Platform

Embed Size (px)

Citation preview

Page 1: Scaling Galaxy on Google Cloud Platform

1

Scaling Galaxy on GCPLynnLangit Cloud and Data Architect Google Developer Cloud Expert, AWS Community Hero, Microsoft Data Platform MVP

Page 2: Scaling Galaxy on Google Cloud Platform

2

Agenda

• Scaling Up• Virtual Machines• Hello Galaxy• Adding Tools to Galaxy• Genomic Data on GCP

• Scaling Out• Docker Containers• Google Persistent Disks• Pipelines • Google Genomics APIs• Big Query

Galaxy on Google Cloud Platform

Page 3: Scaling Galaxy on Google Cloud Platform

3

Google Cloud in AustraliaData center here in 2017

Page 4: Scaling Galaxy on Google Cloud Platform

4

Galaxy on GCP – Scale Up

Page 5: Scaling Galaxy on Google Cloud Platform

5Google Cloud Platform 5

Demo 1- Hello Galaxy on Google Cloud

Page 6: Scaling Galaxy on Google Cloud Platform

6Google Cloud Platform 6

Demo 2 - Adding Galaxy Tools

Page 7: Scaling Galaxy on Google Cloud Platform

77

• Cloud Storage (file) buckets• Source data

• Compute Engine Virtual Machines• Virtual Machine Image files• External VM persistent hard disks with your

source data

GCP Virtual Machine Services

Key Concepts: -- VM configuration as code -- Fast, cheap scalable VMs

Page 8: Scaling Galaxy on Google Cloud Platform

88

• Re-size Virtual Machines• Attach more persistent disks• Update base image• Monitor with Stackdriver

Scale Up Patterns

Page 9: Scaling Galaxy on Google Cloud Platform

99

Page 10: Scaling Galaxy on Google Cloud Platform

1010

Genomic Data• Files at GCS

• gs://genomics-public-data• Query via BigQuery

• https://bigquery.cloud.google.com/queries/genomics-public-data

• Code via Genomics API• Implements Global Alliance for Genomics and

Health APIs• Genome browser - https://gabrowse.appspot.com• Google Genomics example code on GitHub

Page 11: Scaling Galaxy on Google Cloud Platform

1111

Page 12: Scaling Galaxy on Google Cloud Platform

12

Galaxy on GCP – Scale Out

Page 13: Scaling Galaxy on Google Cloud Platform

1313

GCP Docker Container Services

• Cloud Storage• Container Engine / Docker

Key Concepts: -- Container configuration as code -- Fast, cheap scalable Docker Containers

Page 14: Scaling Galaxy on Google Cloud Platform

1414

Page 15: Scaling Galaxy on Google Cloud Platform

1515

• Docker Container Cluster• Kubernetes manager• Container orchestration

Scale Out Patterns

Page 16: Scaling Galaxy on Google Cloud Platform

1616

GCP Serverless Services

• Cloud Functions• Microservices

Key Concepts: -- Function configuration as code -- Fast, cheap scalable Microservices

Page 17: Scaling Galaxy on Google Cloud Platform

17

Galaxy on GCP – Advanced Pipelines

Page 18: Scaling Galaxy on Google Cloud Platform

18Google Cloud Platform 18

Demo 3 – Using the Google Genomics API & BigQuery

Page 19: Scaling Galaxy on Google Cloud Platform

1919

BigQuery

• ANSI SQL Queries • Query-as-a-service

Key Concepts: -- SQL query configuration as code -- Fast, cheap scalable SQL Queries

Page 20: Scaling Galaxy on Google Cloud Platform

20

Private Datasets

Public Datasets

Variant Analysis

MSSNG AutismCloud Storage

Scientist

HighThroughputGenomeSequencers

1000 GenomesCloud Storage

Patient DataCloud Storage

Illumina PlatformCloud Storage

Ref GenomesCloud Storage

TCGACloud Storage

Analytics

Online AnalyticsBigQuery

Batch AnalyticsCloud Dataflow

Lab NotebooksCloud Datalab

Data IngestGenomics

BAMFASTQ

Page 21: Scaling Galaxy on Google Cloud Platform

21

Ingest

Elastic Cluster

Storage

Analytics

CarrierInterconnect

HighThroughputGenomeSequencers

Scientist

Raw DatafilesCloud Storage

Processed DataCloud Storage

MetadataCloud SQL

Lab notebooksCloud Datalab

HPC ClusterCompute Engine10 Nodes

Ingest ServerCompute Engine

Online AnalyticsBigQuery

Cloud LoadBalancing

CloudNetwork

Genomics, Secondary Analysis

Page 22: Scaling Galaxy on Google Cloud Platform

2222

• Cloud Storage / Public datasets on GCP

• Big Query• Cloud Dataflow• Genomics API

Advanced GCP Pipelines Core Products

Key Concepts: -- Pipeline configuration as code -- Fast, cheap scalable cloud services

Page 23: Scaling Galaxy on Google Cloud Platform

23

Resources

Page 24: Scaling Galaxy on Google Cloud Platform

2424

• Cloud Storage (files) -- here• Compute Engine (VMs) -- here• Container Engine (Docker) -- here• Big Query (SQL) -- here• Cloud Dataflow (pipelines) -- here• Genomics API-- here

• Genomics Cookbook– here• Public datasets on GCP-- here• Google’s Genomic code samples – here• Lynn’s GitHub code samples -- here

Resources

Page 25: Scaling Galaxy on Google Cloud Platform

25

More about Google Cloud Services

Page 26: Scaling Galaxy on Google Cloud Platform

26

Compute

Compute Engine

App Engine

Container Engine

Container

Registry

Cloud Function

s

Networking

Cloud Virtual

NetworkCloud Load Balancing

Cloud CDN

Cloud Interconnec

tCloud DNS

Big Data

BigQuery Cloud Dataflow

Cloud Dataproc

Cloud Datalab

Cloud Pub/Sub

Genomics

Identity & Security

Cloud IAM

Cloud Resource Manager

Cloud Security Scanner

Cloud Platform Security

Storage and Databases

Cloud Storage

Cloud Bigtable

Cloud Datastor

eCloud SQL

Persistent Disk

Machine Learning

Cloud Machine Learning

Vision API

Speech API

Natural Language

APITranslatio

n API

Google Cloud Platform Services Part One

Jobs API

Page 27: Scaling Galaxy on Google Cloud Platform

27

Management Tools

Stackdriver

Monitoring Logging

Error Reportin

gTrace Debugge

rDeployme

nt Manager

Cloud Endpoint

sCloud

Console

Developer Tools

Cloud SDK

Deployment

ManagerCloud Source Repositories

Cloud Tools for Android Studio

Cloud Tools for IntelliJ

Cloud Tools for

PowerShell

Cloud Tools for Visual

StudioGoogle Plug-in for Eclipse

Cloud Test Lab

Google Cloud Platform Services Part Two

Cloud Shell

Cloud Mobile App

Billing App

Cloud APIs

Page 28: Scaling Galaxy on Google Cloud Platform

28

Page 29: Scaling Galaxy on Google Cloud Platform

29

Page 30: Scaling Galaxy on Google Cloud Platform

30

GCE Persistence Options – Disks, etc.… Created From Notes

Image GCS File or Disk File path <bucket>/<folder>/<file>Disk must detached from VM

Snapshot Disk or Instance (boot) Can create an Instance FROM a Snapshot

Persistent Disk

Image –or-Snapshot –or- Blank

Blank disk must be formattedCan create an Instance or Snapshot FROM a Disk

Bucket GCS console for file Access via path gs://<bucketName>/<fileName>

VM InstanceBoot Disk

Image –or-Snapshot –or-Disk

Images -> OS, Application or Custom ImageN/AFrom Saved Disk

VM Instance Additional Disk

Local Scratch –or-Standard persistent –or-SSD persistent

Max 8 at 375 GB each.500 GB 64 TBRead/Write or Read OnlyAttach up to 16 Disks* per VM

Page 31: Scaling Galaxy on Google Cloud Platform

31