Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
17 May 2016
NCAR Globally Accessible User Environment
Spectrum Scale User Group – UK Meeting
17 May 2016 Pamela Hill, Manager, Data Analysis Services
17 May 2016
Data Analysis Services Group NCAR / CISL / HSS / DASG
• Data Transfer and Storage Services • Pamela Hill • Joey Mendoza • Craig Ruff
• High-Performance File Systems
• Data Transfer Protocols • Science Gateway
Support • Innovative I/O Solutions
• Visualization Services • John Clyne • Scott Pearse • Alan Norton
• VAPOR development and support
• 3D visualization • Visualization User
Support
17 May 2016
GLADE Mission GLobally Accessible Data Environment
• Unified and consistent data environment for NCAR HPC • Supercomputers, Data Analysis and Visualization Clusters • Support for project work spaces • Support for shared data transfer interfaces • Support for Science Gateways and access to RDA & ESG
data sets
• Data is available at high bandwidth to any server or supercomputer within the GLADE environment
• Resources outside the environment can manipulate data using common interfaces
• Choice of interfaces supports current projects; platform is flexible to support future projects
17 May 2016
GLADE Environment
GLADE
HPSS
DataTransferGateways ScienceGateways
ProjectSpacesDataCollec;ons$HOME$WORK
$SCRATCH
RemoteVisualiza;on
Computa;on Analysis&Visualiza;on
RDAESGCDP
geysercalderapronghorn
yellowstonecheyenne
GlobusOnlineGlobus+DataShare
GridFTPHSI/HTAR
scp,sPp,bbcp
VirtualGL
17 May 2016
GLADE Today
• 90 GB/s bandwidth • 16 PB useable capacity • 76 IBM DCS3700 • 6840 3TB drives
• shared data + metadata • 20 GPFS NSD servers • 6 management nodes • File Services
• FDR • 10 GbE
Expansion Fall 2016 • 200 GB/s bandwidth • ~21.5 PB useable
capacity • 4 DDN SFA14KX • 3360 8TB drives
• data only • 48 800GB SSD
• Metadata • 24 GPFS NSD servers • File Services
• EDR • 40 GbE
17 May 2016
GLADE File System Structure
GLADE GPFS Cluster
scratch
SSD metadata
SFA14KX 15 PB
8M block, 4K inode
p
DCS3700 10 PB
4M block 512 byte inode
p2
SSD metadata
SFA14KX 5 PB
8M block, 4K inode
DCS3700 5 PB
home apps
SSD metadata
SSD applications
DDN 7700 100 TB
512K block, 4K inode
• ILM rules to migrate between pools • Data collections pinned to DCS3700
17 May 2016
GLADE I/O Network • Network architecture providing global access to data
storage from multiple HPC resources • Flexibility provided by support of multiple connectivity
options and multiple compute network topologies • 10GbE, 40GbE, FDR, EDR • Full Fat Tree, Quasi Fat Tree, Hypercube
• Scalability allows for addition of new HPC or storage resources
• Agnostic with respect to vendor and file system • Can support multiple solutions simultaneously
17 May 2016
HPSS Archive160PB, 12.5 GB/s
60PB holdings
HPSS
GLADE16 PB, 90 GB/s, GPFS
21.5 PB, 200GB/s, GPFS
(4) DTN
Data Services20GB/s
FRGP Network
Science Gateways
10 Gb Ethernet 40 Gb Ethernet 100 Gb Ethernet Infiniband
100 GbE
Data Networks
Infiniband FDR
Geyser, Caldera Pronghorn 46 nodes
HPC / DAV
HPSS (Boulder)Disaster Recovery
HPSS
Cheyenne 4,032 nodes
Infiniband EDR
NWSC Core
Ethernet Switch
10/40/100
10 GbE 10 GbE
40 GbE
40 GbE
Yellowstone 4,536 nodes
Infiniband FDR
17 May 2016
GLADE I/O Network Connections • All NSD servers are connected to 40GbE network • Some NSD servers are connected to the FDR IB network
• Yellowstone is a full fat tree network • Geyser/Caldera/Pronghorn quasi fat tree, up/dwn routing • NSD servers are up/dwn routing
• Some NSD servers are connected to the EDR IB network • Cheyenne is an enhanced hypercube • NSD servers are nodes in the hypercube
• Each NSD server is connected to two points in the hypercube
• Data transfer gateways, RDA, ESG and CDP science gateways are connected to 40GbE and 10GbE networks
• NSD servers will route traffic over the 40GbE network to serve data to both IB networks
17 May 2016
HPC Network Architecture FDR
EDR
NWSC10GbE
YS
4DataMover
MLHPSS
CH
DAV
ML10GbE
40GbE
20NSDServers
24NSDServers
6mgtServers
NWSCHPSS
NWSC-1
NWSC-2
RDA,ESG/CDP
2NSDServers
3chgpfsmgt
$HOME A B
17 May 2016
GPFS Multi-Cluster • Clusters can serve file systems or be diskless • Each cluster is managed separately
• can have different administrative domains • Each storage cluster controls who has access to which file
systems • Any cluster mounting a file system must have TCP/IP access
from all nodes to the storage cluster • Diskless clusters only need TCP/IP access to the storage
cluster hosting the file system they wish to mount • No need for access to other diskless clusters
• Each cluster is configured with a GPFS subnet which allows NSD servers in the storage cluster to act as routers • Allows blocking of a subnet when necessary
• User names and group need to be sync’d across the multi-cluster domain
17 May 2016
GPFS Multi-Cluster
Geyser, Caldera Pronghorn 46 nodes
Cheyenne 4,032 nodes
Yellowstone 4,536 nodes n1 n2 n4536
n1 n2 n46 nsd21
nsd22
nsd46
nsd1
nsd2
nsd20
ESG / CDP Science Gateways
Data Transfer Services
RDA Science Gateway
web n1 n8
n1 n2 n4
web n1 n8
EDR Network
FDR Network
40GbE Network
. . .
. . .
. . .
. . .
. . .
. . .
Storage Cluster YS Cluster
SAGE Cluster
RDA Cluster
DAV Cluster
CH Cluster
DTN Cluster
n1 n2 n10 CH PBS Cluster
n1 n2 n4032
Cheyenne PBS, Login Nodes
. . .
. . .. . .
GLADE
17 May 2016
NCARP – Requirements The NCAR Common Address Redundancy Protocol
• Must provide IP routing between Infiniband-only GPFS clients and Ethernet-only GPFS servers. • Some GPFS protocol traffic is always carried over IP, even if
RDMA is used for I/O • All GPFS protocol traffic can be carried over IP if necessary
• Need high availability of the virtual router addresses (VRAs) to provide the clients uninterrupted file system access • Routine maintenance • Unattended automatic VRA failure recovery • Use gratuitous ARP to reduce fail over time • Actively detect network interface failures • Use digital signatures for valid message detection and
spoofing rejection
17 May 2016
NCARP – Requirements The NCAR Common Address Redundancy Protocol
• Make use of existing hardware where possible • Current need is for 1 year, then all NSD servers migrate to
the EDR Hypercube network • Be scalable if client cluster requirements change
• Phase in/phase out of client clusters and cluster membership over time
• There are thousands of client nodes in multiple clusters • Each router node has limited bandwidth available
• IP over IB throughput is the limiting issue in testing • Partition nodes among the routers
• Allows a measure of load balancing • Reduce the number of static rules in the Linux kernel routing
table • Usable for both servers and clients
17 May 2016
NCARP – Requirements The NCAR Common Address Redundancy Protocol
• Must run on subnets managed by the organization's Networking Group as well as on the cluster private I/O networks. • Must not collide with Virtual Router Redundancy Protocol
(VRRP) on the organization's subnets. • Must not appear as general purpose routers for non-GPFS
traffic. • Will not support MPI job traffic
• Use a common configuration description for all participating networks and routers.
17 May 2016
NCARP The NCAR Common Address Redundancy Protocol
• NCAR Common Address Redundancy Protocol • Based loosely on the BSD Common Address
Redundancy Protocol (CARP) • Reduce the number of NCARP packets traversing the
networks to keep overhead lower • Each node should handle a primary Virtual Router
Address (VRA) and some number of secondary VRAs • Use the node's load input when deciding to assume
mastership of a secondary VRA.
17 May 2016
NCARP – Deployment Strategy The NCAR Common Address Redundancy Protocol
• Deploy NCARP on existing NSD server nodes. • CPUs usually mostly idle even under file I/O load • Spare memory and PCI bandwidth available
• Deploy dedicated inexpensive router nodes if bandwidth insufficient or impacts GPFS too much.
• Use 40G (or 100G) Ethernet as the interchange network between the router nodes and the GPFS servers. • Insulates us from IB dependencies and restrictions • Have to have this network anyway for non-IB attached
GPFS clusters.
17 May 2016
QUESTIONS? [email protected]