Integration of OpenStack and Amazon Web Service into

Preview:

Citation preview

Integration of OpenStack and Amazon Web Service into

local batch job systemWataru Takase, Tomoaki Nakamura,

Koichi Murakami, Takashi SasakiComputing Research Center, KEK, Japan

1

Background: KEK Batch Job System

• 10000 CPU cores

• Scientific Linux 6

• IBM Spectrum LSF

work

server

work

server

work

server

LSF

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

calc.

server

Interactive work and

job submission

Batch service

Remote

login

Batch job

scheduler

Job queues

job job job

job job

job job job job

2

Background: Challenges for the Batch Job System

•Requirements on specific system from experiments groups•Piled up pending jobs due to resource

shortage•Take advantage of Cloud computing• Provide heterogeneous clusters• Expand computing resource to clouds

3

Overview of Cloud-integrated Batch Job System (Test Phase)

• Use cloud resources via batch job submission command.$ bsub –q aws /bin/hostname

OpenStack

SL6 cluster

AWS

The other cloud

On-premise resource

Off-premise resource

LSF ResourceConnector

[1]

Queue based resource selection

: /.1/ / /: / 0 / / 0 :/ : / / :4

Integration with OpenStack

Base image Custom image

Group manager

End user LSF3. Submit job

calc. server (VM)

5. Dispatch

1. Create imageOpenStack

ResourceConnector

{"Name": "CentOS7_01","Attributes": {

"type": ["String", "X86_64"],"openstackhost": ["Numeric", "1"],"template": ["CentOS7_01"]

},"Image": "generic-cent7-01","Flavor": "c04-m016G"

}

2. Create Resource connector template

Cloud admin

4. Launch instance

calc. server

calc. server

Physical machines (SL6) Batch serviceDispatch normal job

5

Share GPFS between Local Batch and OpenStack

• Each compute node mounts GPFS and exposes the directories to VM via NFS.

Compute node GPFS

calc. server (VM)

calc. server (VM)…calc. server

(VM)

NFS

GPFS mount

NFS mount

calc. server

calc. server

OpenStack Batch service

6

NFSKEK

EC2

Objectstorage

S3

For sharing input/output data

among KEK and AWS

AWS queue

The other queues

AWS

LSF calc. server

LSF calc. server

Launched on demand

LSF calc. server

LSF calc. server

Physicalmachines (SL6)

…LSFwork

server VPN connection

Integration with AWS

OpenStackOpenStack queue

Filesystem is not shared with KEK batch system

7

KEK

S3 bucket

AWS

…LSFwork

server

NFS

Use AWS S3 Object Storage for Sharing Data between KEK and AWS

INPUT

OUTPUTOUTPUTINPUT

calc. server

calc. server

1. Put input data

2. Copy input data

3. Submit job

4. Copy output data

5. Get output data

• KEK batch system and OpenStack share GPFS filesystem in KEK.

• AWS environment is independent from the KEK system.

• S3FS[2] allows to Linux to mount an AWS S3 bucket via FUSE.

22 -2 3 /. 3 3 8

Available resource Transition on AWS

Submit jobsNumber of instances on AWS

Number of total cores on AWS

Transition of total number of cores

9

Scalability Test: Submission of Geant4 based Particle Therapy Monte Carlo Simulation Jobs to AWS

Mass density distribution generated by input CT data

Simulated dose distribution

Particle beam

direction

Monte Carlo simulation of shooting 2,000,000Protons on N CPU cores (N jobs)

If N=10, 10 jobs carried out simulation events 200,000 times each

10

Scalability Test: Submission of Geant4 based Particle Therapy Monte Carlo Simulation Jobs• Scalability comparison between on KEK and AWS

AWS

KEK (Better CPU and file system)

The AWS result has the same tendency as the KEK’s one.

11

Scalability Test: Image Classification by Deep Learning on AWS

• Classify CIFAR-10 image[3] into 10 categories.

• We have built Convolutional Neural Network, then trained for the classification using TensorFlow[4].

[3] https://www.cs.toronto.edu/~kriz/cifar.html [4] https://www.tensorflow.org/tutorials/deep_cnn

conv1 layer

conv2 layer

FC2 layer

FC1 layer

pool2 layer

pool1 layer

auto-mobile

Feedback

Convolution Neural Network

12

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

100

1000

10000

100000

10 100 1000 10000

Accu

racy

Trai

ning

Tim

e [s

ec]

Number of used cores

CIFAR-10 Training Time of 500 epochs for CNN

Training Time Training Accuracy Test Accuracy

• Submit TensorFlow jobs to AWS queue and measured scalability by changing number of workers.

Scalability Test: Image Classification Multi-node Deep Learning on AWS

Parameter server

Worker

TensorFlowCluster

Store and updateparameters

Calculate loss

23,000 sec(6.5 hours)

1,000 sec

Worker Worker

57 workers(3648 cores)

1 worker (64 cores)

30 workers(1920 cores)

Network bandwidthbottleneck?

13

Another Use Case:Automatic Offloading to Cloud

Submit 3000 jobs to the mixed-resources (KEK and AWS) queue

1. Some jobs dispatched to KEK servers

2. Launch AWS instances, and some jobs dispatched to the AWS instances

No more free resource on KEK

Find free resource on KEK

3. Some jobs dispatched to KEK servers

4. Some jobs dispatched to AWS serversRUN

RUN

RUN

RUN

PEND

PEND

PEND

Time

Each

job

stat

us

14

Summary•We have integrated OpenStack and AWS

clouds with LSF batch job system by using Resource Connector.• We are in test phase.

•We have succeeded to offload some batch workloads to cloud.• Cloud resources used in this work was

provided in the Demonstration Experiment of Cloud Use conducted by National Institute of Informatics (NII) Japan (FY2017).

15

Recommended