View
2
Download
0
Category
Preview:
Citation preview
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon: Design, Performance, & Experiences Deploying & Supporting a Data-Intensive
Supercomputer
Shawn Strande
Gordon Project Manager San Diego Supercomputer Center
XSEDE ‘12
July 16-19, 2012
Chicago, IL
Allan Snavely 1962 - 2012
“Though I am clearly a ‘rolleur’ a cyclist who
goes faster on the flats as opposed to a
‘grimpeur’ a cyclist who goes faster uphill, for
some reason I actually prefer
climbing.”
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon – An Innovative Data-Intensive Supercomputer
• Designed to accelerate access to massive amounts of data in areas of genomics, earth science, engineering, medicine, and others.
• Emphasizes memory and IO over FLOPS. • Appro integrated 1,024 node Sandy Bridge
cluster. • 300 TB of high performance Intel flash. • Large memory supernodes via vSMP
Foundation from ScaleMP. • 3D torus interconnect from Mellanox. • In production operation since February 2012. • Funded by the NSF and available through the
Extreme Science and Engineering Discovery Environment program (XSEDE).
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon Design: Two Driving Ideas
• Observation #1: Data keeps getting further away from processor cores (“red shift”) • Do we need a new level in the memory hierarchy?
• Observation #2: Many data-intensive
applications are serial and difficult to parallelize • Would a large, shared memory machine be better from the
standpoint of researcher productivity for some of these? • Rapid prototyping of new approaches to data analysis
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
The Memory Hierarchy of a Typical Supercomputer
Shared memory Programming (single node)
Message passing programming
Latency Gap
Disk I/O BIG DATA
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
The Memory Hierarchy of Gordon
Shared memory Programming
(vSMP)
Disk I/O BIG DATA
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon Design Highlights
• 3D Torus • Dual rail QDR
• 64, 2S Westmere I/O nodes
• 12 core, 48 GB/node • 4 LSI controllers • 16 SSDs • Dual 10GbE • SuperMicro motherboard • PCI Gen2
• 300 GB Intel 710 eMLC SSDs
• 300 TB aggregate
• 1,024 2S Xeon E5 (Sandy Bridge) nodes
• 16 cores, 64 GB/node • Intel Jefferson Pass
motherboard • PCI Gen3
• Large Memory vSMP Supernodes
• 2TB DRAM • 10 TB Flash
“Data Oasis” Lustre PFS 100 GB/sec, 4 PB
SAN DIEGO SUPERCOMPUTER CENTER
Flash Drive (e.g., SLC, eMLC)
Typical HDD
Good for Data Intensive Apps
Latency < .1 ms 10 ms ✔
Bandwidth (r/w) 270 / 210 MB/s 100-150 MB/s ✔
IOPS (r/w) 38,500 / 2000 100 ✔
Power consumption (when doing r/w)
2-5 W 6-10 W ✔
Price/GB $3/GB $.50/GB -
Endurance 2-10PB N/A ✔
Total Cost of Ownership Jury is still out.
(Some) SSDs are a good fit for data-intensive computing
.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
vSMP aggregation SW
Gordon 32-way Supernode
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
Dual SB CN
ION
4.8 TB flash SSD
Dual WM IOP
ION
4.8 TB flash SSD
Dual WM IOP
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon 3D Torus Interconnect Fabric 4x4x4 3D Torus Topology
IO
CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN
36 Port Fabric Switch
36 Port Fabric Switch
18 x 4X IB Network Connections
18 x 4X IB Network Connections
IO
CN
Dual-Rail Network increased Bandwidth & Redundancy
Single Connection to each Network 16 Compute Nodes, 2 IO Nodes
4X4X4 Mesh Ends are folded on all three
Dimensions to form a 3DTorus Why a 3D torus interconnect?
• Lower Cost :40% as many switches, 25%
to 50% fewer cables compared to a fat tree
• Works well for localized communication • Linearly expandable • Simple wiring pattern • Short Cables- Fiber Optic cables generally
not required • Fault Tolerant within the mesh with 2QoS
Alternate Routing • Fault Tolerant with Dual-Rails for all
routing algorithms • Based on OFED IB stack
SAN DIEGO SUPERCOMPUTER CENTER
Full System
• 16 Compute Node Racks (all racks 48U)
• 4 I/O Node Racks
• 1 Service Node Rack
• Hot aisle containment
• 500kW
• Earthquake isobases
SAN DIEGO SUPERCOMPUTER CENTER
Gordon Network Architecture
QDR 40 Gb/s GbE 2x10GbE 10GbE
3D torus: rail 1 3D torus: rail 2
Mgmt. Nodes (2x)
Mgmt. Edge & Core Ethernet
Public Edge & Core Ethernet
NFS Server (4x)
Compute Node
Compute Node
Compute Node
Data Movers (4x)
Data Oasis Lustre PFS
4 PB
XSEDE & R&E Networks
SDSC Network
IO Nodes
IO Nodes
Login Nodes (4x)
Compute Node 1,024
64
• Dual-rail IB • Dual 10GbE storage • GbE management • GbE public • Round robin login • Mirrored NFS • Redundant front-end
SAN DIEGO SUPERCOMPUTER CENTER
Data Oasis Heterogeneous Architecture Lustre-based Parallel File System
OSS 72TB
64 OSS (Object Storage Servers)
Provide 100GB/s Performance and >4PB Raw Capacity
JBOD 90TB
JBODs (Just a Bunch Of Disks)
Provide Capacity Scale-out to an Additional 5.8PB
Arista 7508 10G
Arista 7508 10G
Redundant Switches for Reliability and
Performance
3 Distinct Network Architectures
OSS 72TB
JBOD 90TB
OSS 72TB
JBOD 90TB
OSS 72TB
JBOD 90TB
64 Lustre LNET Routers 100 GB/s
Mellanox 5020 Bridge 12 GB/s
MDS
MDS
Myrinet 10G Switch 25 GB/s
MDS
GORDON IB cluster
TRITON Myrinet cluster
TRESTLES IB cluster
Metadata Servers
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Innovation carries risk, and Gordon had equal amounts of both
• Sandy Bridge processor wasn’t available; delivery schedule was uncertain
• SSD market in the midst of a revolution
• vSMP new to large, multi-user HPC environment
• Dual-rail 3D torus had never been deployed
• Data intensive user community not well defined
Source Wikipedia:
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Risk Reduction
Source: Wikipedia
Deployed Dash prototype
vSMP 16-way testing
Dash available to users
vSMP 32-way testing
Deployed 16 Gordon I/O nodes With Postville SSD
Early delivery of all I/O nodes
Full system delivery
3D torus prototype demonstration
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Testing, testing, and more testing
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Challenge: Intel SSD Roadmap Changes Necessitated a Revisit of SSD Options
• Rigorous acceptance criteria required high IOPS, endurance, capacity, and low UBER
• Tested numerous drives • Performed paper studies
of many more • $ was an issue for the
vendor • Dash prototype was
crucial
The final choice was the new Intel 710 eMLC, 300 GB SSD Launched at IDF 2011. There are 1,024 of these in Gordon
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Challenge: Exporting & Preserving Flash Performance • There are several layers of
overhead that reduce performance (SATA, Linux, network)
• I/O models need to be driven by the applications
• No one had really done this before
• iSCSIoRDMA was the best protocol
• XFS performs well • Early work with OCFS is
promising for a shared file system
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Challenge: vSMP had not been used in large scale, multi-user HPC system
• Dash prototype was used for engineering scale-up work (16 and 32 way)
• SDSC did significant systems and application testing
• Users had early access to Dash • First Gordon SB nodes were
shipped to ScaleMP for certification
• ScaleMP has been a partner throughout the project vSMP is in production on Gordon. Most users need 16-way
(1 TB), but larger nodes can be provisioned.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Challenge: User Outreach to Identify Good Applications for Gordon
• Many traditional HPC users are not “data-intensive”
• Mined the existing NSF allocations database to identify potential users
• Conducted data intensive summer institutes
• Reached out to new communities in linguistics, political science, and others
• Revised the allocations models for Gordon to encourage new users to apply for time
• We’re still not quite there
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Applications
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Computational Style Code Answering the question: Why Gordon?
V M F
C T L
V: Uses vSMP C: Computationally intensive, leverages Sandy Bridge architecture M:Uses large memory/core on Gordon (4GB/core) T: Threaded F: Uses Flash L: Lustre I/O intensive
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Breadth First Search Comparison using SSD and HDD
V M F
C T L
Source: Sandeep Gupta, San Diego Supercomputer Center. Used by permission. 2011
Graphs are mathematical and computational representations of relationships of objects in a network. Such networks occur in many natural and man-made scenarios, including communication, biological, and social contexts. Understanding the structure of these graphs is important for uncovering important relationships among the members.
• Implementation of Breadth-first search (BFS) graph algorithm developed by Munagala and Ranade
• 134 million nodes • Flash drives reduced
I/O time by factor of 6.5x • Problem converted from I/O
bound to compute bound
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Postgres pgbench result for a Gordon I/O node pgbench: Standard Postgres benchmark to test performance using a real-world banking scenario. Tests are performed for a range of database sizes and client connections.
Achieves high TPS (transactions per second) at large scale (150GB) and high client count.
V M F
C T L
Query, update, insert (read/write)
Gordon I/O node • 2x6C Westmere • 48 GB DRAM • 4.4 TB of high performance flash Benchmark Scale = number of bank branches 10 tellers and 100,000 accounts per branch Each client executes 100,000 transactions
Random Select (read only)
Source: Kai Lin, San Diego Supercomputer Center. 2012
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
PDB Query Comparisons, with DB2 Database on two Gordon I/O Nodes: One with HDD’s, One with SSD’s
V M F
C T L Source: Vishwinath Nandigam, San Diego Supercomputer Center. 2011
The Protein Data Bank (PDB): Is the single worldwide repository of information about the 3D structures of large biological molecules. These are the molecules of life that are found in all organisms. Understanding the shape of a molecule helps to understand how it works.
• For single queries, HDD and SSD perform about the same.
• For concurrent queries, SSD’s achieve big speedup.
• Q5B is > 10x, and performance varies by type of query
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Daphnia Genome Assembly using Velvet and vSMP
V M F
C T L Source: Wayne Pfeiffer, San Diego Supercomputer Center. Used by permission.
Daphnia (a.k.a. water flea), is a model species used for understanding mechanisms of inheritance and evolution, and as a surrogate species for studying human health in responses to environmental changes.
De novo assembly of short DNA reads using the de Bruijn graph algorithm. Code parallelized using OpenMP directives. Benchmark problem: Daphnia genome assembly from 44-bp and 75-bp reads using 35-mer
Photo: Dr. Jan Michels, Christian-Albrechts-University, Kiel
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Foxglove Calculation using Gaussian 09 with vSMP - MP2 Energy Gradient Calculation
V M F
C T L
Source: Jerry Greenberg, San Diego Supercomputer Center. January, 2012.
The Foxglove plant (Digitalis) is studied for its medicinal uses. Digoxin, an extract of the Foxglove, is used to treat a variety of conditions including diseases of the heart. There is some recent research that suggests it may also be a beneficial cancer treatment.
Time to solution: 43,000s
Processor footprint - 4 nodes 64 threads
Memory footprint – 10 nodes 700 GB
1 Compute node = (16 cores/node) 64 GB/node)
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Axial compression of caudal rat vertebra using Abaqus and vSMP
V M F
C T L Source: Matthew Goff, Chris Hernandez. Cornell University. Used by permission. 2012
The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions in the model. The research goal is to understand the response of trabecular bone to mechanical stimuli. This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the elderly.
• 5 million quadratic, 8 noded elements
• Model created with custom Matlab application that converts 253 micro CT images into voxel-based finite element models
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Cosmology simulation - matter power spectrum measurement using vSMP
Source: Rick Wagner, Michael L. Norman. SDSC.
Goal is to measure the effect of the light from the first stars on the evolution of the universe. To quantitatively compare the matter distribution of each simulation, we use radially binned 3D power spectra.
• 2 simulations • 32003 uniform 3D grids • 15k+ files each
Individual simulations
Difference
V M F
C T L
Power spectra
• Existing OpenMP code • ~256GB memory used • ~5 ½ hours per field • 0 development effort
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Impact of high-frequency trading on financial markets
V M F
C T L Source: Mao Ye, Dept. of Finance, U. Illinois. Used by permission. 6/1/2012
To determine the impact of high-frequency trading activity on financial markets, it is necessary to construct nanosecond resolution limit order books – records of all unexecuted orders to buy/sell stock at a specified price. Analysis provides evidence of quote stuffing: a manipulative practice that involves submitting a large number of orders with immediate cancellation to generate congestion
Time to construct limit order books now under 15 minutes for threaded application using 16 cores on single Gordon compute node
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Massive Data Analysis of Large-eddy Simulation of Deep Convection in Atmosphere (Clouds) using vSMP
Simulation Details • GigaLES Model Run Dataset (partial) • 40 time-steps (24 hour simulation) • 256 vertical layers • 204.8 x 204.8 kilometers • 100 m horizontal resolution
R Analysis • 160 GB data set (40 netCDF files @ 4 GB each) • 340 GB memory footprint • ~ 3 ½ hours for data input and analysis
The Center for Multi-scale Modeling of Atmospheric Processes (CMMAP) is an NSF Science and Technology Center focused on improving the representation of cloud processes in climate models.
V M F
C T L
• System for Atmospheric Modeling: M. Kharoutdinov, SUNY Stonybrook
• Visualization: J. Helly, A. Chourasia • Analysis: J. Helly, S. Strande
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
MrBayes Running on Gordon through the CIPRES Gateway
V M F
C T L Source: Wayne Pfeiffer, San Diego Supercomputer Center.
MrBayes 3.1.2 is used extensively via the CIPRES Science Gateway to infer phylogenetic trees. The hybrid parallel version running at SDSC uses both MPI and OpenMP.
• CIPRES has allowed over 4000 biologists world-wide to run parallel tree inference codes via a simple-to-use web interface.
• Applications can be targeted to appropriate architectures.
• Gordon provides a significant speedup for unpartitioned data sets over the SDSC Trestles system.
• A model for future data intensive projects
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Application-aware, Digital Voltage Frequency Scaling Saves an Average of 12% Energy on HPC Workloads
S M I
C T P Source: Laura Carrington, PMaC Lab; San Diego Supercomputer Center. May, 2012
A series of HPC applications run on 1,024 cores using the Intel baseline power savings vs application aware settings. Average performance penalty is 7.9%. LAMMPS realizes a power savings of 31.7% with a performance penalty of 3.9%.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Gordon Impact as a Resource Provider
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Conclusions • The nature of computational research is
becoming more data-intensive, requiring new kinds of high-performance computer architectures.
• Gordon is an innovative system that addresses a range of challenges associated with data intensive computing.
• A prototype system and significant testing mitigated the challenges of deploying Gordon.
• Outreach to new user communities takes concerted and ongoing effort.
• Gordon supports a wide range of applications: large memory, MPI applications, and dedicated I/O node.
• Productive data intensive computing is being done.
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA; SAN DIEGO
Thank you very much!
sstrande@ucsd.edu
And thank you to the co-authors:
Pietro Cicotti Bob Sinkovits
Bill Young Rick Wagner
Mahidhar Tatineni Eva Hocks
Allan Snavely Mike Norman
Recommended