40
www.nimbusproject.org Infrastructure Clouds for Science and Education: Platform Tools 1 7/16/2012 Kate Keahey, Renato J. Figueiredo, John Bresnahan, Mike Wilde, David LaBissoniere Argonne National Laboratory Computation Institute, University of Chicago University of Florida

Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Infrastructure Clouds for Science and

Education: Platform Tools

1 7/16/2012

Kate Keahey, Renato J. Figueiredo, John Bresnahan, Mike Wilde, David LaBissoniere

Argonne National Laboratory

Computation Institute, University of Chicago

University of Florida

Page 2: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

The Power of Infrastructure Clouds

Virtualization opens the flood gates

7/16/2012 2

• Outsourcing

• Virtual appliances

– Freeze your stack in time

– Run it anywhere

• Multi-cloud applications

– Run many copies all over the world!

• Elasticity

Page 3: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Harnessing The Power

• Organization tools and techniques

7/16/2012 3

Page 4: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Towards a Power Adapter

7/16/2012 4

Page 5: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

What Needs To Be Harnessed

• VM (appliance) creation and development – configuration management tools (chef, puppet)

• VM hypervisors – Infrastructure-as-a-Service (IaaS)

• Cloud applications – virtual clusters, cloudinit.d, CloudFormation,

• Elasticity – Auto-scaling tools, phantom

• Workflow – Swift, etc

7/16/2012 5

Page 6: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

What Needs To Be Organized?

• VM (appliance) creation and development – configuration management tools (chef, puppet)

• VM hypervisors – Infrastructure-as-a-Service (IaaS)

• Cloud applications – virtual clusters, cloudinit.d, CloudFormation,

• Elasticity – Auto-scaling tools, phantom

• Workflow – Swift, etc

7/16/2012 6

Page 7: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

VM Applications

• An entire system frozen in time

– Full software stacks (versions)

– Configuration files

– Important for science!

• A dedicated modular service

– Web service, database, AMQP node, etc

• Demos

• A binary single file (or set of files)

– Easy to freeze

7/16/2012 7

Page 8: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Developing Appliances

• A single binary image?

– Many developers?

– Version control?

– Merging conflicts?

• Base image with a description

– Ex: Ubuntu 11.04 base images plus a set of scripts

• Configuration Management Software

– Chef, Puppet, FG Rain, etc

7/16/2012 8

Page 9: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

• Software stack description – ruby and json

• A library of cookbooks

• Cookbooks contain recipes – Ex: apache2 server with php4

• Attributes to customize each recipe – Ex: on what port will apache listen

• Templates for configuration files

• Appliance developers make recipes – Version control can be done with git/svn/cvs…

7/16/2012 9

Chef

Page 10: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Example Recipe

7/16/2012 10

app_dir = node[:appdir] ve_dir = node[:virtualenv][:path] git app_dir do repository node[:autoscale][:git_repo] reference node[:autoscale][:git_branch] action :sync user node[:username] group node[:groupname] end execute "run install" do cwd app_dir user node[:username] group node[:groupname] command "python setup.py install" end

Page 11: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Example Template

7/16/2012 11

phantom: system: type: epu rabbit: <%= node[:autoscale][:rabbit_host] %> rabbit_port: <%= node[:autoscale][:rabbit_port] %> rabbit_ssl: False rabbit_user: <%= node[:autoscale][:rabbit_username] %> rabbit_pw: <%= node[:autoscale][:rabbit_password] %> rabbit_exchange: <%= node[:autoscale][:rabbit_exchange] %> authz: type: sqldb dburl: <%= node[:autoscale][:dburl] %>

phantom: system: type: epu rabbit: vm-102.uc.futuregrid.org rabbit_port: 5672 rabbit_ssl: False rabbit_user: XXX rabbit_pw: PPPPPP rabbit_exchange: default_dashi_exchange authz: type: sqldb dburl: mysql://nimbus:[email protected]/testphantom

Page 12: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

What Needs To Be Organized?

• VM (appliance) creation and development – configuration management tools (chef, puppet)

• VM hypervisors – Infrastructure-as-a-Service (IaaS)

• Cloud applications – virtual clusters, cloudinit.d, CloudFormation,

• Elasticity – Auto-scaling tools, phantom

• Workflow – Swift, etc

7/16/2012 12

Page 13: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Cloud Applications

• More than 1 VM needed for the job

• Information exchange is needed

– Manual information exchange

• Multi-cloud

– Cloud independence required

7/16/2012 13

Web Server database

Web Web Web Server

nginx

Web Servers

Page 14: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Cloud Management Tools

• Architecture description

– VM type, location, count

– Volumes

– Networks

– Other services

• Contextualization

– Exchange dynamically determined information • IP addrs, security information.

– Bootstrap component connections • Ex: mount NFS, connect to DB, etc

7/16/2012 14

Page 15: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

A Simplified Deployment Scenario

7/16/2012 15

Page 16: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

A Grid in Your Pocket…

7/16/2012 16

Pierre

EC2

Page 17: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

A Grid in Your Pocket…

7/16/2012 17

Jamie

EC2 OOI private cloud

Pierre

Page 18: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org 7/16/2012 18

Jamie David

EC2 OOI private cloud FutureGrid

A Grid in Your Pocket…

Pierre

Page 19: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

CloudFormation

• Assemble AWS services

– Run AMIs.

– Connect EBS volumes to AMIs

– Associate and SQS queue, etc

• JSON descriptions

• AWS only

• No configuration management software integration

– Manual integration with Chef

7/16/2012 19

Page 20: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

cloudinit.d

• Multicloud VM dependency management – Uses the libcloud abstraction library

• Integrated with chef solo

• ini file format descriptions – Coupled with any executable script

• Launch plan end-users/operators – Lightweight

– Copy launch plan and “one click” action

– Easily reconfigured for various clouds

• Launch plan/application developers: – Minimal software assumptions (ssh)

– “Stem cell” deployment approach

– Incremental launch plan development

7/16/2012 20

[svc-alamoHTTP]

iaas_key: XXXXXX

iaas_secret: XXXX

iaas_host: alamo.futuregrid.org

iaas_port: 8443

iaas: Nimbus

image: ubunut10.10

ssh_username: ubuntu

localsshkeypath: ~/.ssh/fg.pem

readypgm: http-test.py

bootpgm: http-boot.sh

Page 21: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

cloudinit.d Overview

• Services

• Run Levels

– Collections of

services without

dependencies on

each other

• Launch Plan

– An ordered set of

run levels

7/16/2012 21

Page 22: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Cloudinit.d Features

7/16/2012 22

database

Web Server Web Server Web Server

• Repeatability: write a launch plan once, deploy many times

Launch plan

Page 23: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Cloudinit.d Features

7/16/2012 23

database

Web Server Web Server Web Server

• Deploy on cloud and non-cloud resources from many providers

Launch plan

Page 24: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Cloudinit.d Features

7/16/2012 24

database

Web Server Web Server Web Server

• Coordination of interdependent launches

Launch plan

Ru

n-level 1

R

un

-level 2

Page 25: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Cloudinit.d Features

7/16/2012 25

database

Web Server Web Server Web Server

Launch plan

Ru

n-level 1

R

un

-level 2

• User-defined launch tests

Page 26: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Cloudinit.d Features

7/16/2012 26

database

Web Server Web Server Web Server

Launch plan

Ru

n-level 1

R

un

-level 2

• Test-based monitoring and repair

Page 27: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Cloudinit.d Features

7/16/2012 27

database

Web Server Web Server Web Server

Launch plan

Ru

n-level 1

R

un

-level 2

• Test-based monitoring and repair

Page 28: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Cloudinit.d Iaas

Interface

A Single Service Application Boot

Infrastructure Cloud

Request a new VM

Check Status New VM

sshd Verify ssh works

bootpgm

Run the boot program….

VM HTTP Server readypgm

Run the ready program…

If the has a successful exit code (0), then the new simple cloud application is set to go!

The VM is running Now the VM has been contextualized to be a web server

scp over the boot contextualization program…

scp over the ready program

Poll the IaaS service to determine when the VM is running…

sshd needs to startup and be accessible on the new VM

Here we show how cloudinit.d automatically creates a HTTP server from a simple distribution base image

Page 29: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

What Needs To Be Organized?

• VM (appliance) creation and development – configuration management tools (chef, puppet)

• VM hypervisors – Infrastructure-as-a-Service (IaaS)

• Cloud applications – virtual clusters, cloudinit.d, CloudFormation,

• Elasticity – Auto-scaling tools, phantom

• Workflow – Swift, etc

7/16/2012 29

Page 30: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Escalation Pattern

7/16/2012 30

Operational Units

User Domain (configuration and security)

Domain Management: Monitor and regulate domain properties based on

system-specific and application-specific metrics

• Challenge: leverage on-demand, large but unreliable provider pool – Applications that absorb resources

– Applications that tolerate failures

Page 31: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Scaling Considerations

• Reasons to scale – Business vs science

• Cost vs quota

• Lossy environment – VMs fail more often than bare metal

– N preserving

• Spot instances – If the price is right

• Backfill – If resources are idle

7/16/2012 31

Page 32: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Amazon Auto Scaling and CloudWatch

• Auto Scaling in EC2 – Policies to scale up and down servers

• Min, Max, and desired size

• Integrated with AWS CloudWatch Sensors – Triggers

– CPU load, disk capacity, load balancer loads,etc

– Custom sensors

• No contextualization

• REST API

• AWS only

7/16/2012 32

Page 33: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Phantom Scaling Services • Multi-cloud

– Fail-over and even distribution policies

• Monitor scaling factors and failures – Generic/system qualities: deployment status,

load, bank account, etc.

– Application-specific qualities, e.g., a workload queue for ALiEn, PBS, AMQP, and others

• Evaluate against policies

• Scale and/or recover – For user components

– For system components

– Across different cloud providers

• Release as a Service

• 0.1 running on FutureGrid now – Initially available as a service on FutureGrid

resources

– Provides high availability

7/16/2012 33

Sensor information

Reliably provision, manage and contextualize resources

Apply Policy

Page 34: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Infrastructure Platform Goals • Multi-cloud

– Work across private, community and commercial clouds

• Any Scale – Scale in response to a diverse set of sensors/triggers

– Both system and application sensors

• High Availability – “Any VM can die”: system or user VMs

– Minimizing time to recovery (TTR)

• Your Polices, Our Enactment – User-defined sensors/triggers and policies

• Engineered from the ground up to work with infrastructure clouds

• Easy on the user

7/16/2012 34

Page 35: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

How Can Science Plug Into This Power

Example Embarrassingly Parallel

Scientific Application

Demonstration

7/16/2012 36

Page 36: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

M subtask messages

Task Queue

Application Start the workers

Using Nimbus Domains

Page 37: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Preserve N worker VMs

M subtask messages

Cumulus/S3

Message Queue

“N preserving” policy

Infrastructure Compute Cloud

Get task

Results/Checkpoints

Application Start the workers

Using Nimbus Domains

Page 38: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Phantom Architecture

7/16/2012 39

MySQL

nginx

REST HTTPS

Web Application HTTPS REST Service

Web Application

FutureGrid Clouds

RabbitMQ EPUM

Provisioner

DTRS

Zookeeper Cluster

REST Service REST Service IaaS

Clouds

Page 39: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Adventures in Availability

• Time to scale (TTS)

– PENDING (request)

– STARTED (deployment)

– RUNNING

(contextualization)

7/16/2012 40

TTS: preliminary results for 2,000 VMs provisioned on AWS EC2

Page 40: Infrastructure Clouds for Science and Education: Platform ... · Phantom Scaling Services • Multi-cloud – Fail-over and even distribution policies • Monitor scaling factors

www.nimbusproject.org

Application adaptation:

Applications

7/16/2012 41

Infrastructure Platform Contextualization, multi-cloud bridge, repeatable launches, scaling, elasticity

and High Availability

Schedulers

Elastic MapReduce

Workflow Systems (Swift) Data Transfer Systems

Science Gateways Custom Applications (OOI)

Library of generic sensors

Application-specific sensors

Policies Decision Engine