Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based

CENTER FOR BRAIN SIMULATION

Automated Configuration and Administration of a Storage-class Memory System to Support Supercomputer-based Scientific Workflows J. Bernard1, P. Morjan2, B. Hagley3, F. Delalondre1, F. Schürmann1, B. Fitch4, A. Curioni5 1 Blue Brain Project (BBP), Geneva, Switzerland 2 IBM, Böblingen, Germany 3 Swiss National Computing Center (CSCS), Lugano, Switzerland 4 IBM, Yorktown Heights, NY, USA 5 IBM, Zurich, Switzerland


Outline

•  Why do we need storage-class memory system ?

•  Blue Brain Project hardware system design

•  Why do we need system management automation ?

•  First implementation supporting application user-defined system configuration


Example complex workflow

HPC Simulation

Key-Value Store

writes

Field Voxelization

uses

uses

Volume Renderer

Report Reader

reads events from

writes

reads events from

GPFS

Visu

aliz

atio

n C

lust

er

Nod

es

Com

pute

BG

AS

Nod

es


Why use storage-class memory?

Multi-step, complex workflows requiring lots of effort from scientific application user/developer •  Building brain tissue model •  Simulating electrical evolution •  Analysis of simulation results •  Visualization Brain modeling requires large memory footprint •  Rat brain about 100 TB •  Estimate for human brain is 100 PB •  DRAM not enough cost effective •  Requires memory hierarchy


Outline






Viz x86 compute cluster

GSS storage cluster

System Overview •  4 racks of compute nodes: 8

midplanes & 4096 nodes •  8 BG/Q Production I/O drawers

(64 nodes) •  8 BGAS I/O drawers (64 nodes) •  GSS storage cluster •  x86 compute cluster

Distributed management •  CSCS storage team •  CSCS BG team •  BBP HPC & infrastructure team

BBP resources at CSCS Blue Gene/Q Blue Gene/Q I/O nodes

BGAS I/O nodes


BGAS I/O nodes compared to standard IONs


BGAS I/O nodes: hardware

•  PCIe 2.0 x8

•  Infiniband replaced by 10 GbE

•  optical cables between drawers •  <2,2,2> torus extended to <4,4,4> •  potentially expandable to <8,8,8>

•  2 TiB SLC flash


BGAS I/O nodes: NVM user interfaces

Direct storage access (DSA) •  OFED RDMA verbs provider •  0 – 1.4 TiB •  applications need to be modified Block devices based on DSA •  Ext4 block device

•  0 – 1.4 TiB •  overhead, but POSIX

•  GPFS •  0 – 89.6 TiB •  NSDs communicate over iWarp •  overhead, no data locality worries, POSIX


HS4 flash card – partitioning

1.4 TiB usable capacity

flash partition 0 block device

flash partition 1 DSA

GPFS EXT4

•  2 TiB •  0.6 TiB for wear-leveling •  1.4 TiB usable

•  Native DSA interface •  Verbs block device (VBD) on top of DSA for block access •  GPFS, EXT4, DSA partitions can all be from 0 to 100% of

the usable capacity


Outline






Why do we need automation ? •  Highly configurable system supporting different memory

interfaces (GPFS, SKV) •  Automated partitioning based on user requirements for fast

application prototyping


circuit building

simulation

analysis, visualization





circuit building

simulation


SLURM queue

SLURM queue

SLURM queue





15

circuit building

simulation


SLURM queue

SLURM queue

SLURM queue

GP

FS

SK

V

ext4





What do we want to automate ?

System Software Management •  New major release •  Release update 2-Level Partitioning •  Cluster partitioning •  On node-flash memory partitioning

Integration with rest of the eco-system (Blue Gene/Q, x86 cluster, GSS storage)


System maintenance & update

BGAS sandbox creation workflow (major release) •  build sandbox •  create ramdisk •  add RPMs •  compile GPFS kernel module •  install Soft-iWARP •  Integrate with other services (cp config files to sandbox)

•  ssh •  kerberos •  SLURM •  environment modules


Shell access to BGAS I/O nodes

•  SSH with Kerberos from any other BBP user node •  add /etc/krb5.conf, /etc/krb5.keytab to sandbox •  DNS and /etc/hosts

•  sshd must return FQDN •  consistent across all user-accessible nodes

•  Viz cluster •  BGAS nodes •  BG/Q and BGAS front end nodes •  BBP desktops

•  Limit access to users with running jobs •  when fully productionized


User-defined configuration parameters

List of basic configuration parameters at partitioning time •  How many clusters: 1 to 8 BGAS clusters •  How many nodes per cluster: 8 to 64 nodes •  How much flash allocated for DSA: 0 to 100% •  How much flash allocated to local ext4: 0 to 100% •  How much flash allocated to GPFS: 0 to 100% Advanced configuration (Only for GPFS) •  From 1 to 8 GB GPFS page pool •  From 64KB to 4MB GPFS block size


Overview of partitioning workflow

On service node •  Free block •  Boot block On all I/O nodes in the block •  Partition flash •  Partition & set up ext4 On first node of each block •  Set up GPFS •  Integration: Grant remote access to Viz cluster •  Integration: Set up remote mounts from GSS cluster


Partitioning BGAS nodes into I/O blocks

I0-64

I4-32

I0-32

I6-16

I0-48

15 possible I/O blocks connected via <4,4,4> 3D torus •  8 drawer/64 node: 0-7 •  4 drawer/32 node: 0-3, 4-7 •  2 drawer/16 node: 0-1, 2-3, 4-5, 6-7 •  1 drawer/08 node: 0, 1, 2, 3, 4, 5, 6, 7

etc.


Compute node partitions

I0-64

I4-32

I0-32

I6-16

I0-48


BGAS GPFS cluster creation

Clusters identified & authenticated using •  Cluster name •  Automatically generated cluster ID •  Automatically generated SSL certificate to authenticate

the name & ID


BGAS GPFS remote cluster access Integrating with rest of the system

Remote access requires •  Key generation •  Certificate exchange •  Mmauth on server cluster •  Mmremote* on client cluster But … •  All of these require root mmauth update? •  Depends on cluster ID not changing


BGAS GPFS cluster creation, step 1 Integrating with rest of the system

Set up 15 clusters •  One for each possible BGAS I/O block •  Extract and save cluster names, IDs, certificates •  Exchange certs with GSS and Viz cluster admins •  GSS cluster authorizes each BGAS cluster •  Each BGAS cluster authorizes mounts by Viz cluster

•  Viz cluster adds each cluster & its file system •  Delete the clusters (FD: What did you want to say

here ?)


BGAS GPFS cluster creation, step 2 Integrating with rest of the system

•  mmcrcluster •  mmauth genkey new

•  <TOTALLY UNSUPPORTED> •  cp –af $CERT_DIR/* /var/mmfs/ssl •  new_cluster_id=$(mmlsconfig clusterID …) •  sed “s/old_cluster_id/new_cluster_id/” mmfs.cfg … •  mmauth genkey propagate •  </TOTALLY UNSUPPORTED>


Sharing scripts and public certificates Integrating with rest of the system •  git repo hosted at EPFL •  commit access for BBP and CSCS

•  automated checkout by puppet on Viz cluster •  checkout by non-root user with read-only access


Mounting BGAS GPFS on Viz cluster Integrating with rest of the system •  BGAS GFPS file systems come and go •  Viz nodes get rebooted

•  when a BGAS file system is created •  touch a status file on a GSS file system

•  every N minutes

•  check the status file •  if mtime < N, or uptime < N

•  mmlsfs && mmmount || mmumount •  non-root admin user via sudo


Repartitioning performance

0

500

1000

1500

2000

2500

8 16 32 64

boot time(sec)

partition time(sec)


Outline






User experience – expected workflow

Configuring BGAS •  Configuration of BGAS according to multiple teams’ needs

(Multi-tendency) •  Configuring a BGAS cluster according to one team’s needs

Using BGAS for fast scientific development •  From IBM Blue Gene/Q •  From Viz cluster •  From BGAS itself as regular cluster


Expected user development cycle Configuring BGAS

Super-User (Manager, PI) Decides how cluster should be partitioned

based on several teams’ needs (few weeks)

Team developers decide how they want flash of their

cluster partitioned at job submission time (few days)


Using BGAS from Blue Gene/Q: switching I/O links automatically

$ sinfo PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE MIDPLANELIST debug* down 1-256 1:00:00 8K 512:16 512 idle bgq1011 test up 1-256 6:00:00 8K 512:16 512 allocated bgq1001 prod up 512-2K 7-00:00:00 8K 512:16 512 drained bgq1000 prod up 512-2K 7-00:00:00 8K 512:16 2.50K allocated bgq[0000x0011,1001] prod-large up 1-4K 2-12:00:00 8K 512:16 512 drained bgq1000 prod-large up 1-4K 2-12:00:00 8K 512:16 2.50K allocated bgq[0000x0011,1001] prod-large up 1-4K 2-12:00:00 8K 512:16 1024 idle bgq[1010x1011] bgas up 1-4K 1-00:00:00 8K 512:16 512 drained bgq1000 bgas up 1-4K 1-00:00:00 8K 512:16 1.50K allocated bgq[0000x0001,1001] bgas up 1-4K 1-00:00:00 8K 512:16 2K idle bgq[0010x1011]

•  BGAS queue seen as regular queue •  Jobs will run on cnk nodes •  I/O will be routed automatically to BGAS nodes

instead of production I/O nodes


Switching between I/O nodes •  Compute nodes cabled to both sets of IONs •  Only one link can be active •  Compute nodes need to be rebooted SLURM prolog •  Check partition for “bgas” •  Are requested BGAS IONs already linked? •  If not, deallocate compute nodes, switch

links •  Restart job, boot compute nodes •  Leave links in place to minimize rebooting

Using BGAS from Blue Gene/Q: switching I/O links automatically


Using BGAS as an independent cluster

$ sinfo PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST bgas001-008 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[001-008] bgas009-016 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[009-016] bgas017-024 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[017-024] bgas025-032 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[025-032] bgas033-040 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[033-040] bgas041-048 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[041-048] bgas049-056 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[049-056] bgas057-064 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[057-064] bgas001-016 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[001-016] bgas017-032 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[017-032] bgas033-048 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[033-048] bgas049-064 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[049-064] bgas001-032 down 1-32 2-00:00:00 1 1:12:1 32 idle bbpbgas[001-032] bgas033-064 down 1-32 2-00:00:00 1 1:12:1 32 idle bbpbgas[033-064] bgas001-064 up 1-64 2-00:00:00 1 1:12:1 64 idle bbpbgas[001-064]

•  Log in to BGAS front end node •  Use the queue of your BGAS cluster •  All created queues visible, only created cluster queues up


GPFS IOR performance, reads, MiB/node

0

200

400

600

800

1000

1200

1400

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

1 proc/node, read

2 proc/node, read

4 proc/node, read

RDMA over roq (iWARP) interface, 1MB blocks


GPFS IOR performance, writes, MiB/node

0

200

400

600

800

1000

1200

1400

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

1 proc/node, write

2 proc/node, write

4 proc/node, write

RDMA over roq (iWARP) interface, 1MB blocks


Next steps

Further integration to increase automation •  Integration of BGAS cluster partitioning with SLURM •  Integration of flash partitioning with SLURM •  Complete integration of all services Getting user feedback & experience Enhancement & addition of new services •  Performance benchmarking of data store interfaces

(SKV, …) •  Automated data transfer/copy between BGAS & GSS •  Multicluster allocation via co-scheduling to support

complex workflow execution

Documents

Automated Configuration and Administration of a …spscicomp.org/wordpress/wp-content/uploads/2015/05/...Administration of a Storage-class Memory System to Support Supercomputer-based