38
CENTER FOR BRAIN SIMULATION Automated Configuration and Administration of a Storage-class Memory System to Support Supercomputer-based Scientific Workflows J. Bernard 1 , P. Morjan 2 , B. Hagley 3 , F. Delalondre 1 , F. Schürmann 1 , B. Fitch 4 , A. Curioni 5 1 Blue Brain Project (BBP), Geneva, Switzerland 2 IBM, Böblingen, Germany 3 Swiss National Computing Center (CSCS), Lugano, Switzerland 4 IBM, Yorktown Heights, NY, USA 5 IBM, Zurich, Switzerland

Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Automated Configuration and Administration of a Storage-class Memory System to Support Supercomputer-based Scientific Workflows J. Bernard1, P. Morjan2, B. Hagley3, F. Delalondre1, F. Schürmann1, B. Fitch4, A. Curioni5 1 Blue Brain Project (BBP), Geneva, Switzerland 2 IBM, Böblingen, Germany 3 Swiss National Computing Center (CSCS), Lugano, Switzerland 4 IBM, Yorktown Heights, NY, USA 5 IBM, Zurich, Switzerland

Page 2: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Outline

•  Why do we need storage-class memory system ?

•  Blue Brain Project hardware system design

•  Why do we need system management automation ?

•  First implementation supporting application user-defined system configuration

Page 3: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Example complex workflow

HPC Simulation

Key-Value Store

writes

Field Voxelization

uses

uses

Volume Renderer

Report Reader

reads events from

writes

reads events from

GPFS

Visu

aliz

atio

n C

lust

er

Nod

es

Com

pute

BG

AS

Nod

es

Page 4: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Why use storage-class memory?

Multi-step, complex workflows requiring lots of effort from scientific application user/developer •  Building brain tissue model •  Simulating electrical evolution •  Analysis of simulation results •  Visualization Brain modeling requires large memory footprint •  Rat brain about 100 TB •  Estimate for human brain is 100 PB •  DRAM not enough cost effective •  Requires memory hierarchy

Page 5: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Outline

•  Why do we need storage-class memory system ?

•  Blue Brain Project hardware system design

•  Why do we need system management automation ?

•  First implementation supporting application user-defined system configuration

Page 6: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Viz x86 compute cluster

GSS storage cluster

System Overview •  4 racks of compute nodes: 8

midplanes & 4096 nodes •  8 BG/Q Production I/O drawers

(64 nodes) •  8 BGAS I/O drawers (64 nodes) •  GSS storage cluster •  x86 compute cluster

Distributed management •  CSCS storage team •  CSCS BG team •  BBP HPC & infrastructure team

BBP resources at CSCS Blue Gene/Q Blue Gene/Q I/O nodes

BGAS I/O nodes

Page 7: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

BGAS I/O nodes compared to standard IONs

Page 8: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

BGAS I/O nodes: hardware

•  PCIe 2.0 x8

•  Infiniband replaced by 10 GbE

•  optical cables between drawers •  <2,2,2> torus extended to <4,4,4> •  potentially expandable to <8,8,8>

•  2 TiB SLC flash

Page 9: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

BGAS I/O nodes: NVM user interfaces

Direct storage access (DSA) •  OFED RDMA verbs provider •  0 – 1.4 TiB •  applications need to be modified Block devices based on DSA •  Ext4 block device

•  0 – 1.4 TiB •  overhead, but POSIX

•  GPFS •  0 – 89.6 TiB •  NSDs communicate over iWarp •  overhead, no data locality worries, POSIX

Page 10: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

HS4 flash card – partitioning

1.4 TiB usable capacity

flash partition 0 block device

flash partition 1 DSA

GPFS EXT4

•  2 TiB •  0.6 TiB for wear-leveling •  1.4 TiB usable

•  Native DSA interface •  Verbs block device (VBD) on top of DSA for block access •  GPFS, EXT4, DSA partitions can all be from 0 to 100% of

the usable capacity

Page 11: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Outline

•  Why do we need storage-class memory system ?

•  Blue Brain Project hardware system design

•  Why do we need system management automation ?

•  First implementation supporting application user-defined system configuration

Page 12: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Why do we need automation ? •  Highly configurable system supporting different memory

interfaces (GPFS, SKV) •  Automated partitioning based on user requirements for fast

application prototyping

Page 13: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

circuit building

simulation

analysis, visualization

Why do we need automation ? •  Highly configurable system supporting different memory

interfaces (GPFS, SKV) •  Automated partitioning based on user requirements for fast

application prototyping

Page 14: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

circuit building

simulation

analysis, visualization

SLURM queue

SLURM queue

SLURM queue

Why do we need automation ? •  Highly configurable system supporting different memory

interfaces (GPFS, SKV) •  Automated partitioning based on user requirements for fast

application prototyping

Page 15: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

15

circuit building

simulation

analysis, visualization

SLURM queue

SLURM queue

SLURM queue

GP

FS

SK

V

ext4

Why do we need automation ? •  Highly configurable system supporting different memory

interfaces (GPFS, SKV) •  Automated partitioning based on user requirements for fast

application prototyping

Page 16: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

What do we want to automate ?

System Software Management •  New major release •  Release update 2-Level Partitioning •  Cluster partitioning •  On node-flash memory partitioning

Integration with rest of the eco-system (Blue Gene/Q, x86 cluster, GSS storage)

Page 17: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

System maintenance & update

BGAS sandbox creation workflow (major release) •  build sandbox •  create ramdisk •  add RPMs •  compile GPFS kernel module •  install Soft-iWARP •  Integrate with other services (cp config files to sandbox)

•  ssh •  kerberos •  SLURM •  environment modules

Page 18: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Shell access to BGAS I/O nodes

•  SSH with Kerberos from any other BBP user node •  add /etc/krb5.conf, /etc/krb5.keytab to sandbox •  DNS and /etc/hosts

•  sshd must return FQDN •  consistent across all user-accessible nodes

•  Viz cluster •  BGAS nodes •  BG/Q and BGAS front end nodes •  BBP desktops

•  Limit access to users with running jobs •  when fully productionized

Page 19: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

User-defined configuration parameters

List of basic configuration parameters at partitioning time •  How many clusters: 1 to 8 BGAS clusters •  How many nodes per cluster: 8 to 64 nodes •  How much flash allocated for DSA: 0 to 100% •  How much flash allocated to local ext4: 0 to 100% •  How much flash allocated to GPFS: 0 to 100% Advanced configuration (Only for GPFS) •  From 1 to 8 GB GPFS page pool •  From 64KB to 4MB GPFS block size

Page 20: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Overview of partitioning workflow

On service node •  Free block •  Boot block On all I/O nodes in the block •  Partition flash •  Partition & set up ext4 On first node of each block •  Set up GPFS •  Integration: Grant remote access to Viz cluster •  Integration: Set up remote mounts from GSS cluster

Page 21: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Partitioning BGAS nodes into I/O blocks

I0-64

I4-32

I0-32

I6-16

I0-48

15 possible I/O blocks connected via <4,4,4> 3D torus •  8 drawer/64 node: 0-7 •  4 drawer/32 node: 0-3, 4-7 •  2 drawer/16 node: 0-1, 2-3, 4-5, 6-7 •  1 drawer/08 node: 0, 1, 2, 3, 4, 5, 6, 7

etc.

Page 22: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Compute node partitions

I0-64

I4-32

I0-32

I6-16

I0-48

Page 23: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

BGAS GPFS cluster creation

Clusters identified & authenticated using •  Cluster name •  Automatically generated cluster ID •  Automatically generated SSL certificate to authenticate

the name & ID

Page 24: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

BGAS GPFS remote cluster access Integrating with rest of the system

Remote access requires •  Key generation •  Certificate exchange •  Mmauth on server cluster •  Mmremote* on client cluster But … •  All of these require root mmauth update? •  Depends on cluster ID not changing

Page 25: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

BGAS GPFS cluster creation, step 1 Integrating with rest of the system

Set up 15 clusters •  One for each possible BGAS I/O block •  Extract and save cluster names, IDs, certificates •  Exchange certs with GSS and Viz cluster admins •  GSS cluster authorizes each BGAS cluster •  Each BGAS cluster authorizes mounts by Viz cluster

•  Viz cluster adds each cluster & its file system •  Delete the clusters (FD: What did you want to say

here ?)

Page 26: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

BGAS GPFS cluster creation, step 2 Integrating with rest of the system

•  mmcrcluster •  mmauth genkey new

•  <TOTALLY UNSUPPORTED> •  cp –af $CERT_DIR/* /var/mmfs/ssl •  new_cluster_id=$(mmlsconfig clusterID …) •  sed “s/old_cluster_id/new_cluster_id/” mmfs.cfg … •  mmauth genkey propagate •  </TOTALLY UNSUPPORTED>

Page 27: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Sharing scripts and public certificates Integrating with rest of the system •  git repo hosted at EPFL •  commit access for BBP and CSCS

•  automated checkout by puppet on Viz cluster •  checkout by non-root user with read-only access

Page 28: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Mounting BGAS GPFS on Viz cluster Integrating with rest of the system •  BGAS GFPS file systems come and go •  Viz nodes get rebooted

•  when a BGAS file system is created •  touch a status file on a GSS file system

•  every N minutes

•  check the status file •  if mtime < N, or uptime < N

•  mmlsfs && mmmount || mmumount •  non-root admin user via sudo

Page 29: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Repartitioning performance

0

500

1000

1500

2000

2500

8 16 32 64

boot time(sec)

partition time(sec)

Page 30: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Outline

•  Why do we need storage-class memory system ?

•  Blue Brain Project hardware system design

•  Why do we need system management automation ?

•  First implementation supporting application user-defined system configuration

Page 31: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

User experience – expected workflow

Configuring BGAS •  Configuration of BGAS according to multiple teams’ needs

(Multi-tendency) •  Configuring a BGAS cluster according to one team’s needs

Using BGAS for fast scientific development •  From IBM Blue Gene/Q •  From Viz cluster •  From BGAS itself as regular cluster

Page 32: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Expected user development cycle Configuring BGAS

Super-User (Manager, PI) Decides how cluster should be partitioned

based on several teams’ needs (few weeks)

Team developers decide how they want flash of their

cluster partitioned at job submission time (few days)

Page 33: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Using BGAS from Blue Gene/Q: switching I/O links automatically

$ sinfo PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE MIDPLANELIST debug* down 1-256 1:00:00 8K 512:16 512 idle bgq1011 test up 1-256 6:00:00 8K 512:16 512 allocated bgq1001 prod up 512-2K 7-00:00:00 8K 512:16 512 drained bgq1000 prod up 512-2K 7-00:00:00 8K 512:16 2.50K allocated bgq[0000x0011,1001] prod-large up 1-4K 2-12:00:00 8K 512:16 512 drained bgq1000 prod-large up 1-4K 2-12:00:00 8K 512:16 2.50K allocated bgq[0000x0011,1001] prod-large up 1-4K 2-12:00:00 8K 512:16 1024 idle bgq[1010x1011] bgas up 1-4K 1-00:00:00 8K 512:16 512 drained bgq1000 bgas up 1-4K 1-00:00:00 8K 512:16 1.50K allocated bgq[0000x0001,1001] bgas up 1-4K 1-00:00:00 8K 512:16 2K idle bgq[0010x1011]

•  BGAS queue seen as regular queue •  Jobs will run on cnk nodes •  I/O will be routed automatically to BGAS nodes

instead of production I/O nodes

Page 34: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Switching between I/O nodes •  Compute nodes cabled to both sets of IONs •  Only one link can be active •  Compute nodes need to be rebooted SLURM prolog •  Check partition for “bgas” •  Are requested BGAS IONs already linked? •  If not, deallocate compute nodes, switch

links •  Restart job, boot compute nodes •  Leave links in place to minimize rebooting

Using BGAS from Blue Gene/Q: switching I/O links automatically

Page 35: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Using BGAS as an independent cluster

$ sinfo PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST bgas001-008 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[001-008] bgas009-016 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[009-016] bgas017-024 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[017-024] bgas025-032 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[025-032] bgas033-040 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[033-040] bgas041-048 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[041-048] bgas049-056 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[049-056] bgas057-064 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[057-064] bgas001-016 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[001-016] bgas017-032 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[017-032] bgas033-048 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[033-048] bgas049-064 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[049-064] bgas001-032 down 1-32 2-00:00:00 1 1:12:1 32 idle bbpbgas[001-032] bgas033-064 down 1-32 2-00:00:00 1 1:12:1 32 idle bbpbgas[033-064] bgas001-064 up 1-64 2-00:00:00 1 1:12:1 64 idle bbpbgas[001-064]

•  Log in to BGAS front end node •  Use the queue of your BGAS cluster •  All created queues visible, only created cluster queues up

Page 36: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

GPFS IOR performance, reads, MiB/node

0

200

400

600

800

1000

1200

1400

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

1 proc/node, read

2 proc/node, read

4 proc/node, read

RDMA over roq (iWARP) interface, 1MB blocks

Page 37: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

GPFS IOR performance, writes, MiB/node

0

200

400

600

800

1000

1200

1400

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

1 proc/node, write

2 proc/node, write

4 proc/node, write

RDMA over roq (iWARP) interface, 1MB blocks

Page 38: Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Next steps

Further integration to increase automation •  Integration of BGAS cluster partitioning with SLURM •  Integration of flash partitioning with SLURM •  Complete integration of all services Getting user feedback & experience Enhancement & addition of new services •  Performance benchmarking of data store interfaces

(SKV, …) •  Automated data transfer/copy between BGAS & GSS •  Multicluster allocation via co-scheduling to support

complex workflow execution