Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
CENTER FOR BRAIN SIMULATION
Automated Configuration and Administration of a Storage-class Memory System to Support Supercomputer-based Scientific Workflows J. Bernard1, P. Morjan2, B. Hagley3, F. Delalondre1, F. Schürmann1, B. Fitch4, A. Curioni5 1 Blue Brain Project (BBP), Geneva, Switzerland 2 IBM, Böblingen, Germany 3 Swiss National Computing Center (CSCS), Lugano, Switzerland 4 IBM, Yorktown Heights, NY, USA 5 IBM, Zurich, Switzerland
CENTER FOR BRAIN SIMULATION
Outline
• Why do we need storage-class memory system ?
• Blue Brain Project hardware system design
• Why do we need system management automation ?
• First implementation supporting application user-defined system configuration
CENTER FOR BRAIN SIMULATION
Example complex workflow
HPC Simulation
Key-Value Store
writes
Field Voxelization
uses
uses
Volume Renderer
Report Reader
reads events from
writes
reads events from
GPFS
Visu
aliz
atio
n C
lust
er
Nod
es
Com
pute
BG
AS
Nod
es
CENTER FOR BRAIN SIMULATION
Why use storage-class memory?
Multi-step, complex workflows requiring lots of effort from scientific application user/developer • Building brain tissue model • Simulating electrical evolution • Analysis of simulation results • Visualization Brain modeling requires large memory footprint • Rat brain about 100 TB • Estimate for human brain is 100 PB • DRAM not enough cost effective • Requires memory hierarchy
CENTER FOR BRAIN SIMULATION
Outline
• Why do we need storage-class memory system ?
• Blue Brain Project hardware system design
• Why do we need system management automation ?
• First implementation supporting application user-defined system configuration
CENTER FOR BRAIN SIMULATION
Viz x86 compute cluster
GSS storage cluster
System Overview • 4 racks of compute nodes: 8
midplanes & 4096 nodes • 8 BG/Q Production I/O drawers
(64 nodes) • 8 BGAS I/O drawers (64 nodes) • GSS storage cluster • x86 compute cluster
Distributed management • CSCS storage team • CSCS BG team • BBP HPC & infrastructure team
BBP resources at CSCS Blue Gene/Q Blue Gene/Q I/O nodes
BGAS I/O nodes
CENTER FOR BRAIN SIMULATION
BGAS I/O nodes compared to standard IONs
CENTER FOR BRAIN SIMULATION
BGAS I/O nodes: hardware
• PCIe 2.0 x8
• Infiniband replaced by 10 GbE
• optical cables between drawers • <2,2,2> torus extended to <4,4,4> • potentially expandable to <8,8,8>
• 2 TiB SLC flash
CENTER FOR BRAIN SIMULATION
BGAS I/O nodes: NVM user interfaces
Direct storage access (DSA) • OFED RDMA verbs provider • 0 – 1.4 TiB • applications need to be modified Block devices based on DSA • Ext4 block device
• 0 – 1.4 TiB • overhead, but POSIX
• GPFS • 0 – 89.6 TiB • NSDs communicate over iWarp • overhead, no data locality worries, POSIX
CENTER FOR BRAIN SIMULATION
HS4 flash card – partitioning
1.4 TiB usable capacity
flash partition 0 block device
flash partition 1 DSA
GPFS EXT4
• 2 TiB • 0.6 TiB for wear-leveling • 1.4 TiB usable
• Native DSA interface • Verbs block device (VBD) on top of DSA for block access • GPFS, EXT4, DSA partitions can all be from 0 to 100% of
the usable capacity
CENTER FOR BRAIN SIMULATION
Outline
• Why do we need storage-class memory system ?
• Blue Brain Project hardware system design
• Why do we need system management automation ?
• First implementation supporting application user-defined system configuration
CENTER FOR BRAIN SIMULATION
Why do we need automation ? • Highly configurable system supporting different memory
interfaces (GPFS, SKV) • Automated partitioning based on user requirements for fast
application prototyping
CENTER FOR BRAIN SIMULATION
circuit building
simulation
analysis, visualization
Why do we need automation ? • Highly configurable system supporting different memory
interfaces (GPFS, SKV) • Automated partitioning based on user requirements for fast
application prototyping
CENTER FOR BRAIN SIMULATION
circuit building
simulation
analysis, visualization
SLURM queue
SLURM queue
SLURM queue
Why do we need automation ? • Highly configurable system supporting different memory
interfaces (GPFS, SKV) • Automated partitioning based on user requirements for fast
application prototyping
CENTER FOR BRAIN SIMULATION
15
circuit building
simulation
analysis, visualization
SLURM queue
SLURM queue
SLURM queue
GP
FS
SK
V
ext4
Why do we need automation ? • Highly configurable system supporting different memory
interfaces (GPFS, SKV) • Automated partitioning based on user requirements for fast
application prototyping
CENTER FOR BRAIN SIMULATION
What do we want to automate ?
System Software Management • New major release • Release update 2-Level Partitioning • Cluster partitioning • On node-flash memory partitioning
Integration with rest of the eco-system (Blue Gene/Q, x86 cluster, GSS storage)
CENTER FOR BRAIN SIMULATION
System maintenance & update
BGAS sandbox creation workflow (major release) • build sandbox • create ramdisk • add RPMs • compile GPFS kernel module • install Soft-iWARP • Integrate with other services (cp config files to sandbox)
• ssh • kerberos • SLURM • environment modules
CENTER FOR BRAIN SIMULATION
Shell access to BGAS I/O nodes
• SSH with Kerberos from any other BBP user node • add /etc/krb5.conf, /etc/krb5.keytab to sandbox • DNS and /etc/hosts
• sshd must return FQDN • consistent across all user-accessible nodes
• Viz cluster • BGAS nodes • BG/Q and BGAS front end nodes • BBP desktops
• Limit access to users with running jobs • when fully productionized
CENTER FOR BRAIN SIMULATION
User-defined configuration parameters
List of basic configuration parameters at partitioning time • How many clusters: 1 to 8 BGAS clusters • How many nodes per cluster: 8 to 64 nodes • How much flash allocated for DSA: 0 to 100% • How much flash allocated to local ext4: 0 to 100% • How much flash allocated to GPFS: 0 to 100% Advanced configuration (Only for GPFS) • From 1 to 8 GB GPFS page pool • From 64KB to 4MB GPFS block size
CENTER FOR BRAIN SIMULATION
Overview of partitioning workflow
On service node • Free block • Boot block On all I/O nodes in the block • Partition flash • Partition & set up ext4 On first node of each block • Set up GPFS • Integration: Grant remote access to Viz cluster • Integration: Set up remote mounts from GSS cluster
CENTER FOR BRAIN SIMULATION
Partitioning BGAS nodes into I/O blocks
I0-64
I4-32
I0-32
I6-16
I0-48
15 possible I/O blocks connected via <4,4,4> 3D torus • 8 drawer/64 node: 0-7 • 4 drawer/32 node: 0-3, 4-7 • 2 drawer/16 node: 0-1, 2-3, 4-5, 6-7 • 1 drawer/08 node: 0, 1, 2, 3, 4, 5, 6, 7
etc.
CENTER FOR BRAIN SIMULATION
Compute node partitions
I0-64
I4-32
I0-32
I6-16
I0-48
CENTER FOR BRAIN SIMULATION
BGAS GPFS cluster creation
Clusters identified & authenticated using • Cluster name • Automatically generated cluster ID • Automatically generated SSL certificate to authenticate
the name & ID
CENTER FOR BRAIN SIMULATION
BGAS GPFS remote cluster access Integrating with rest of the system
Remote access requires • Key generation • Certificate exchange • Mmauth on server cluster • Mmremote* on client cluster But … • All of these require root mmauth update? • Depends on cluster ID not changing
CENTER FOR BRAIN SIMULATION
BGAS GPFS cluster creation, step 1 Integrating with rest of the system
Set up 15 clusters • One for each possible BGAS I/O block • Extract and save cluster names, IDs, certificates • Exchange certs with GSS and Viz cluster admins • GSS cluster authorizes each BGAS cluster • Each BGAS cluster authorizes mounts by Viz cluster
• Viz cluster adds each cluster & its file system • Delete the clusters (FD: What did you want to say
here ?)
CENTER FOR BRAIN SIMULATION
BGAS GPFS cluster creation, step 2 Integrating with rest of the system
• mmcrcluster • mmauth genkey new
• <TOTALLY UNSUPPORTED> • cp –af $CERT_DIR/* /var/mmfs/ssl • new_cluster_id=$(mmlsconfig clusterID …) • sed “s/old_cluster_id/new_cluster_id/” mmfs.cfg … • mmauth genkey propagate • </TOTALLY UNSUPPORTED>
CENTER FOR BRAIN SIMULATION
Sharing scripts and public certificates Integrating with rest of the system • git repo hosted at EPFL • commit access for BBP and CSCS
• automated checkout by puppet on Viz cluster • checkout by non-root user with read-only access
CENTER FOR BRAIN SIMULATION
Mounting BGAS GPFS on Viz cluster Integrating with rest of the system • BGAS GFPS file systems come and go • Viz nodes get rebooted
• when a BGAS file system is created • touch a status file on a GSS file system
• every N minutes
• check the status file • if mtime < N, or uptime < N
• mmlsfs && mmmount || mmumount • non-root admin user via sudo
CENTER FOR BRAIN SIMULATION
Repartitioning performance
0
500
1000
1500
2000
2500
8 16 32 64
boot time(sec)
partition time(sec)
CENTER FOR BRAIN SIMULATION
Outline
• Why do we need storage-class memory system ?
• Blue Brain Project hardware system design
• Why do we need system management automation ?
• First implementation supporting application user-defined system configuration
CENTER FOR BRAIN SIMULATION
User experience – expected workflow
Configuring BGAS • Configuration of BGAS according to multiple teams’ needs
(Multi-tendency) • Configuring a BGAS cluster according to one team’s needs
Using BGAS for fast scientific development • From IBM Blue Gene/Q • From Viz cluster • From BGAS itself as regular cluster
CENTER FOR BRAIN SIMULATION
Expected user development cycle Configuring BGAS
Super-User (Manager, PI) Decides how cluster should be partitioned
based on several teams’ needs (few weeks)
Team developers decide how they want flash of their
cluster partitioned at job submission time (few days)
CENTER FOR BRAIN SIMULATION
Using BGAS from Blue Gene/Q: switching I/O links automatically
$ sinfo PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE MIDPLANELIST debug* down 1-256 1:00:00 8K 512:16 512 idle bgq1011 test up 1-256 6:00:00 8K 512:16 512 allocated bgq1001 prod up 512-2K 7-00:00:00 8K 512:16 512 drained bgq1000 prod up 512-2K 7-00:00:00 8K 512:16 2.50K allocated bgq[0000x0011,1001] prod-large up 1-4K 2-12:00:00 8K 512:16 512 drained bgq1000 prod-large up 1-4K 2-12:00:00 8K 512:16 2.50K allocated bgq[0000x0011,1001] prod-large up 1-4K 2-12:00:00 8K 512:16 1024 idle bgq[1010x1011] bgas up 1-4K 1-00:00:00 8K 512:16 512 drained bgq1000 bgas up 1-4K 1-00:00:00 8K 512:16 1.50K allocated bgq[0000x0001,1001] bgas up 1-4K 1-00:00:00 8K 512:16 2K idle bgq[0010x1011]
• BGAS queue seen as regular queue • Jobs will run on cnk nodes • I/O will be routed automatically to BGAS nodes
instead of production I/O nodes
CENTER FOR BRAIN SIMULATION
Switching between I/O nodes • Compute nodes cabled to both sets of IONs • Only one link can be active • Compute nodes need to be rebooted SLURM prolog • Check partition for “bgas” • Are requested BGAS IONs already linked? • If not, deallocate compute nodes, switch
links • Restart job, boot compute nodes • Leave links in place to minimize rebooting
Using BGAS from Blue Gene/Q: switching I/O links automatically
CENTER FOR BRAIN SIMULATION
Using BGAS as an independent cluster
$ sinfo PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST bgas001-008 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[001-008] bgas009-016 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[009-016] bgas017-024 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[017-024] bgas025-032 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[025-032] bgas033-040 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[033-040] bgas041-048 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[041-048] bgas049-056 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[049-056] bgas057-064 down 1-8 2-00:00:00 1 1:12:1 8 idle bbpbgas[057-064] bgas001-016 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[001-016] bgas017-032 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[017-032] bgas033-048 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[033-048] bgas049-064 down 1-16 2-00:00:00 1 1:12:1 16 idle bbpbgas[049-064] bgas001-032 down 1-32 2-00:00:00 1 1:12:1 32 idle bbpbgas[001-032] bgas033-064 down 1-32 2-00:00:00 1 1:12:1 32 idle bbpbgas[033-064] bgas001-064 up 1-64 2-00:00:00 1 1:12:1 64 idle bbpbgas[001-064]
• Log in to BGAS front end node • Use the queue of your BGAS cluster • All created queues visible, only created cluster queues up
CENTER FOR BRAIN SIMULATION
GPFS IOR performance, reads, MiB/node
0
200
400
600
800
1000
1200
1400
4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
1 proc/node, read
2 proc/node, read
4 proc/node, read
RDMA over roq (iWARP) interface, 1MB blocks
CENTER FOR BRAIN SIMULATION
GPFS IOR performance, writes, MiB/node
0
200
400
600
800
1000
1200
1400
4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
1 proc/node, write
2 proc/node, write
4 proc/node, write
RDMA over roq (iWARP) interface, 1MB blocks
CENTER FOR BRAIN SIMULATION
Next steps
Further integration to increase automation • Integration of BGAS cluster partitioning with SLURM • Integration of flash partitioning with SLURM • Complete integration of all services Getting user feedback & experience Enhancement & addition of new services • Performance benchmarking of data store interfaces
(SKV, …) • Automated data transfer/copy between BGAS & GSS • Multicluster allocation via co-scheduling to support
complex workflow execution