Home - XSEDE - Gordon: Design, Performance ......Deploying & Supporting a Data-Intensive...

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Gordon: Design, Performance, & Experiences Deploying & Supporting a Data-Intensive

Supercomputer

Shawn Strande

Gordon Project Manager San Diego Supercomputer Center

XSEDE ‘12

July 16-19, 2012

Chicago, IL

Allan Snavely 1962 - 2012

“Though I am clearly a ‘rolleur’ a cyclist who

goes faster on the flats as opposed to a

‘grimpeur’ a cyclist who goes faster uphill, for

some reason I actually prefer

climbing.”

Gordon – An Innovative Data-Intensive Supercomputer

• Designed to accelerate access to massive amounts of data in areas of genomics, earth science, engineering, medicine, and others.

• Emphasizes memory and IO over FLOPS. • Appro integrated 1,024 node Sandy Bridge

cluster. • 300 TB of high performance Intel flash. • Large memory supernodes via vSMP

Foundation from ScaleMP. • 3D torus interconnect from Mellanox. • In production operation since February 2012. • Funded by the NSF and available through the

Extreme Science and Engineering Discovery Environment program (XSEDE).

Gordon Design: Two Driving Ideas

• Observation #1: Data keeps getting further away from processor cores (“red shift”) • Do we need a new level in the memory hierarchy?

• Observation #2: Many data-intensive

applications are serial and difficult to parallelize • Would a large, shared memory machine be better from the

standpoint of researcher productivity for some of these? • Rapid prototyping of new approaches to data analysis

The Memory Hierarchy of a Typical Supercomputer

Shared memory Programming (single node)

Message passing programming

Latency Gap

Disk I/O BIG DATA

The Memory Hierarchy of Gordon

Shared memory Programming

(vSMP)

Disk I/O BIG DATA

Gordon Design Highlights

• 3D Torus • Dual rail QDR

• 64, 2S Westmere I/O nodes

• 12 core, 48 GB/node • 4 LSI controllers • 16 SSDs • Dual 10GbE • SuperMicro motherboard • PCI Gen2

• 300 GB Intel 710 eMLC SSDs

• 300 TB aggregate

• 1,024 2S Xeon E5 (Sandy Bridge) nodes

• 16 cores, 64 GB/node • Intel Jefferson Pass

motherboard • PCI Gen3

• Large Memory vSMP Supernodes

• 2TB DRAM • 10 TB Flash

“Data Oasis” Lustre PFS 100 GB/sec, 4 PB

Flash Drive (e.g., SLC, eMLC)

Typical HDD

Good for Data Intensive Apps

Latency < .1 ms 10 ms ✔

Bandwidth (r/w) 270 / 210 MB/s 100-150 MB/s ✔

IOPS (r/w) 38,500 / 2000 100 ✔

Power consumption (when doing r/w)

2-5 W 6-10 W ✔

Price/GB $3/GB $.50/GB -

Endurance 2-10PB N/A ✔

Total Cost of Ownership Jury is still out.

(Some) SSDs are a good fit for data-intensive computing

vSMP aggregation SW

Gordon 32-way Supernode

Dual SB CN

4.8 TB flash SSD

Dual WM IOP

4.8 TB flash SSD

Dual WM IOP

Gordon 3D Torus Interconnect Fabric 4x4x4 3D Torus Topology

CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN

36 Port Fabric Switch

18 x 4X IB Network Connections

Dual-Rail Network increased Bandwidth & Redundancy

Single Connection to each Network 16 Compute Nodes, 2 IO Nodes

4X4X4 Mesh Ends are folded on all three

Dimensions to form a 3DTorus Why a 3D torus interconnect?

• Lower Cost :40% as many switches, 25%

to 50% fewer cables compared to a fat tree

• Works well for localized communication • Linearly expandable • Simple wiring pattern • Short Cables- Fiber Optic cables generally

not required • Fault Tolerant within the mesh with 2QoS

Alternate Routing • Fault Tolerant with Dual-Rails for all

routing algorithms • Based on OFED IB stack

Full System

• 16 Compute Node Racks (all racks 48U)

• 4 I/O Node Racks

• 1 Service Node Rack

• Hot aisle containment

• 500kW

• Earthquake isobases

Gordon Network Architecture

QDR 40 Gb/s GbE 2x10GbE 10GbE

3D torus: rail 1 3D torus: rail 2

Mgmt. Nodes (2x)

Mgmt. Edge & Core Ethernet

Public Edge & Core Ethernet

NFS Server (4x)

Compute Node

Data Movers (4x)

Data Oasis Lustre PFS

XSEDE & R&E Networks

SDSC Network

IO Nodes

Login Nodes (4x)

Compute Node 1,024

• Dual-rail IB • Dual 10GbE storage • GbE management • GbE public • Round robin login • Mirrored NFS • Redundant front-end

Data Oasis Heterogeneous Architecture Lustre-based Parallel File System

OSS 72TB

64 OSS (Object Storage Servers)

Provide 100GB/s Performance and >4PB Raw Capacity

JBOD 90TB

JBODs (Just a Bunch Of Disks)

Provide Capacity Scale-out to an Additional 5.8PB

Arista 7508 10G

Redundant Switches for Reliability and

Performance

3 Distinct Network Architectures

OSS 72TB

JBOD 90TB

OSS 72TB

JBOD 90TB

OSS 72TB

JBOD 90TB

64 Lustre LNET Routers 100 GB/s

Mellanox 5020 Bridge 12 GB/s

Myrinet 10G Switch 25 GB/s

GORDON IB cluster

TRITON Myrinet cluster

TRESTLES IB cluster

Metadata Servers

Innovation carries risk, and Gordon had equal amounts of both

• Sandy Bridge processor wasn’t available; delivery schedule was uncertain

• SSD market in the midst of a revolution

• vSMP new to large, multi-user HPC environment

• Dual-rail 3D torus had never been deployed

• Data intensive user community not well defined

Source Wikipedia:

Risk Reduction

Source: Wikipedia

Deployed Dash prototype

vSMP 16-way testing

Dash available to users

vSMP 32-way testing

Deployed 16 Gordon I/O nodes With Postville SSD

Early delivery of all I/O nodes

Full system delivery

3D torus prototype demonstration

Testing, testing, and more testing

Challenge: Intel SSD Roadmap Changes Necessitated a Revisit of SSD Options

• Rigorous acceptance criteria required high IOPS, endurance, capacity, and low UBER

• Tested numerous drives • Performed paper studies

of many more • $ was an issue for the

vendor • Dash prototype was

crucial

The final choice was the new Intel 710 eMLC, 300 GB SSD Launched at IDF 2011. There are 1,024 of these in Gordon

Challenge: Exporting & Preserving Flash Performance • There are several layers of

overhead that reduce performance (SATA, Linux, network)

• I/O models need to be driven by the applications

• No one had really done this before

• iSCSIoRDMA was the best protocol

• XFS performs well • Early work with OCFS is

promising for a shared file system

Challenge: vSMP had not been used in large scale, multi-user HPC system

• Dash prototype was used for engineering scale-up work (16 and 32 way)

• SDSC did significant systems and application testing

• Users had early access to Dash • First Gordon SB nodes were

shipped to ScaleMP for certification

• ScaleMP has been a partner throughout the project vSMP is in production on Gordon. Most users need 16-way

(1 TB), but larger nodes can be provisioned.

Challenge: User Outreach to Identify Good Applications for Gordon

• Many traditional HPC users are not “data-intensive”

• Mined the existing NSF allocations database to identify potential users

• Conducted data intensive summer institutes

• Reached out to new communities in linguistics, political science, and others

• Revised the allocations models for Gordon to encourage new users to apply for time

• We’re still not quite there

Applications

Computational Style Code Answering the question: Why Gordon?

V: Uses vSMP C: Computationally intensive, leverages Sandy Bridge architecture M:Uses large memory/core on Gordon (4GB/core) T: Threaded F: Uses Flash L: Lustre I/O intensive

Breadth First Search Comparison using SSD and HDD

Source: Sandeep Gupta, San Diego Supercomputer Center. Used by permission. 2011

Graphs are mathematical and computational representations of relationships of objects in a network. Such networks occur in many natural and man-made scenarios, including communication, biological, and social contexts. Understanding the structure of these graphs is important for uncovering important relationships among the members.

• Implementation of Breadth-first search (BFS) graph algorithm developed by Munagala and Ranade

• 134 million nodes • Flash drives reduced

I/O time by factor of 6.5x • Problem converted from I/O

bound to compute bound

Postgres pgbench result for a Gordon I/O node pgbench: Standard Postgres benchmark to test performance using a real-world banking scenario. Tests are performed for a range of database sizes and client connections.

Achieves high TPS (transactions per second) at large scale (150GB) and high client count.

Query, update, insert (read/write)

Gordon I/O node • 2x6C Westmere • 48 GB DRAM • 4.4 TB of high performance flash Benchmark Scale = number of bank branches 10 tellers and 100,000 accounts per branch Each client executes 100,000 transactions

Random Select (read only)

Source: Kai Lin, San Diego Supercomputer Center. 2012

PDB Query Comparisons, with DB2 Database on two Gordon I/O Nodes: One with HDD’s, One with SSD’s

C T L Source: Vishwinath Nandigam, San Diego Supercomputer Center. 2011

The Protein Data Bank (PDB): Is the single worldwide repository of information about the 3D structures of large biological molecules. These are the molecules of life that are found in all organisms. Understanding the shape of a molecule helps to understand how it works.

• For single queries, HDD and SSD perform about the same.

• For concurrent queries, SSD’s achieve big speedup.

• Q5B is > 10x, and performance varies by type of query

Daphnia Genome Assembly using Velvet and vSMP

C T L Source: Wayne Pfeiffer, San Diego Supercomputer Center. Used by permission.

Daphnia (a.k.a. water flea), is a model species used for understanding mechanisms of inheritance and evolution, and as a surrogate species for studying human health in responses to environmental changes.

De novo assembly of short DNA reads using the de Bruijn graph algorithm. Code parallelized using OpenMP directives. Benchmark problem: Daphnia genome assembly from 44-bp and 75-bp reads using 35-mer

Photo: Dr. Jan Michels, Christian-Albrechts-University, Kiel

Foxglove Calculation using Gaussian 09 with vSMP - MP2 Energy Gradient Calculation

Source: Jerry Greenberg, San Diego Supercomputer Center. January, 2012.

The Foxglove plant (Digitalis) is studied for its medicinal uses. Digoxin, an extract of the Foxglove, is used to treat a variety of conditions including diseases of the heart. There is some recent research that suggests it may also be a beneficial cancer treatment.

Time to solution: 43,000s

Processor footprint - 4 nodes 64 threads

Memory footprint – 10 nodes 700 GB

1 Compute node = (16 cores/node) 64 GB/node)

Axial compression of caudal rat vertebra using Abaqus and vSMP

C T L Source: Matthew Goff, Chris Hernandez. Cornell University. Used by permission. 2012

The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions in the model. The research goal is to understand the response of trabecular bone to mechanical stimuli. This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the elderly.

• 5 million quadratic, 8 noded elements

• Model created with custom Matlab application that converts 253 micro CT images into voxel-based finite element models

Cosmology simulation - matter power spectrum measurement using vSMP

Source: Rick Wagner, Michael L. Norman. SDSC.

Goal is to measure the effect of the light from the first stars on the evolution of the universe. To quantitatively compare the matter distribution of each simulation, we use radially binned 3D power spectra.

• 2 simulations • 32003 uniform 3D grids • 15k+ files each

Individual simulations

Difference

Power spectra

• Existing OpenMP code • ~256GB memory used • ~5 ½ hours per field • 0 development effort

Impact of high-frequency trading on financial markets

C T L Source: Mao Ye, Dept. of Finance, U. Illinois. Used by permission. 6/1/2012

To determine the impact of high-frequency trading activity on financial markets, it is necessary to construct nanosecond resolution limit order books – records of all unexecuted orders to buy/sell stock at a specified price. Analysis provides evidence of quote stuffing: a manipulative practice that involves submitting a large number of orders with immediate cancellation to generate congestion

Time to construct limit order books now under 15 minutes for threaded application using 16 cores on single Gordon compute node

Massive Data Analysis of Large-eddy Simulation of Deep Convection in Atmosphere (Clouds) using vSMP

Simulation Details • GigaLES Model Run Dataset (partial) • 40 time-steps (24 hour simulation) • 256 vertical layers • 204.8 x 204.8 kilometers • 100 m horizontal resolution

R Analysis • 160 GB data set (40 netCDF files @ 4 GB each) • 340 GB memory footprint • ~ 3 ½ hours for data input and analysis

The Center for Multi-scale Modeling of Atmospheric Processes (CMMAP) is an NSF Science and Technology Center focused on improving the representation of cloud processes in climate models.

• System for Atmospheric Modeling: M. Kharoutdinov, SUNY Stonybrook

• Visualization: J. Helly, A. Chourasia • Analysis: J. Helly, S. Strande

MrBayes Running on Gordon through the CIPRES Gateway

C T L Source: Wayne Pfeiffer, San Diego Supercomputer Center.

MrBayes 3.1.2 is used extensively via the CIPRES Science Gateway to infer phylogenetic trees. The hybrid parallel version running at SDSC uses both MPI and OpenMP.

• CIPRES has allowed over 4000 biologists world-wide to run parallel tree inference codes via a simple-to-use web interface.

• Applications can be targeted to appropriate architectures.

• Gordon provides a significant speedup for unpartitioned data sets over the SDSC Trestles system.

• A model for future data intensive projects

Application-aware, Digital Voltage Frequency Scaling Saves an Average of 12% Energy on HPC Workloads

C T P Source: Laura Carrington, PMaC Lab; San Diego Supercomputer Center. May, 2012

A series of HPC applications run on 1,024 cores using the Intel baseline power savings vs application aware settings. Average performance penalty is 7.9%. LAMMPS realizes a power savings of 31.7% with a performance penalty of 3.9%.

Gordon Impact as a Resource Provider

Conclusions • The nature of computational research is

becoming more data-intensive, requiring new kinds of high-performance computer architectures.

• Gordon is an innovative system that addresses a range of challenges associated with data intensive computing.

• A prototype system and significant testing mitigated the challenges of deploying Gordon.

• Outreach to new user communities takes concerted and ongoing effort.

• Gordon supports a wide range of applications: large memory, MPI applications, and dedicated I/O node.

• Productive data intensive computing is being done.

Thank you very much!

sstrande@ucsd.edu

And thank you to the co-authors:

Pietro Cicotti Bob Sinkovits

Bill Young Rick Wagner

Mahidhar Tatineni Eva Hocks

Allan Snavely Mike Norman

Home - XSEDE - Gordon: Design, Performance ......Deploying & Supporting a Data-Intensive...

Documents

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SDSC RP Update Trestles Recent Dash results Gordon schedule SDSC’s broader HPC

XSEDE Cloud Survey Report

The Community Software Repository from XSEDE: A Resource ... · national cyberinfrastructure. One part of XSEDE - the XSEDE Cy-berinfrastructure Integration (XCI) team - is particularly

Getting Started with Xsede - User Guide › wp-content › ...• Enter your CMU email as ... GuidE ALLOCATIONS CodE MY xsEDE Get Sta Manage Data ge Policy B a MFA xsEDE API Verify

ARCC National Perspective Panel: XSEDE (Towns)

XSEDE: the eXtreme Science and Engineering Discovery ... › presentations › fall11 › ... · The eXtreme Science and Engineering Discovery Environment (XSEDE) will: enhance the

XSEDE: eXtreme Science and Engineering Discovery Environment · 2015. 5. 27. · 2 The Science Impact of XSEDE ... competition ultimately resulted in the XSEDE award that also required

XSEDE Training

Scientific Visualization in the Geosciences Gordon Erlebacher Florida State University Minnesota Supercomputer Institute October 8, 2001

Jul 21 2013 XSEDE’13

Hadoop Deployment and Performance on Gordon Data … · SAN DIEGO SUPERCOMPUTER CENTER Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer! Mahidhar Tatineni,

Overview of XSEDE Systems Engineering

September 3, 2015 Writing a Successful XSEDE Proposal Ken Hackworth XSEDE Allocations Coordinator

XSEDE Overview

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science XSEDE’14

XSEDE at Brainstorm HPCD Jan2015

PiXtend Anwendungsbeispiel „Supercomputer“€¦ · „Supercomputer“ Beim Anwendungsbeispiel „Supercomputer“ haben wir einen kompletten Serverraum in Miniatur nachgebildet

XSEDE Technology Insertion Service

Gateways for Open Science - XSEDE

XSEDE Resources - HPC Universityhpcuniversity.org/media/TrainingMaterials/26/XSEDE... · Productive Use of XSEDE Resources • Support people who understand the discipline as well