Data- Driven Computational Science and Future Architectures at the Pittsburgh Supercomputing Center Ralph Roskies Scientific Director, Pittsburgh Supercomputing

Data- Driven Computational Science and Future Architectures at the

Pittsburgh Supercomputing CenterRalph Roskies

Scientific Director, Pittsburgh Supercomputing Center

Jan 30, 2009

2

NSF TeraGrid Cyberinfrastructure

• Mission: Advancing scientific research capability through advanced IT

• Resources: Computational, Data Storage, Instruments, Network

3

Now is a Resource Rich Time

• NSF has funded two very large distributed memory machines available to the national research community

– Trk2a-Texas- Ranger (62,976 cores, 579 Teraflops, 123 TB memory)

– Trk2b-Tennessee- Kraken (18048 cores, 166 Teraflops, 18 TB memory) growing to close to a Petaflop

– Track 2d: data centric; experimental architecture; … proposals in review

• All part of TeraGrid. Largest single allocation this past September was 46M processor hours.

• In 2011, NCSA is going to field a 10 PF machine.

4

Increasing Importance of Data in Scientific Discovery

Large amounts from instruments and sensors.

• Genomics.

• Large Hadron Collider

• Huge astronomy data bases – Sloan Digital Sky Survey– Pan Starrs– Large Scale Synoptic Telescope

Results of large simulations ( CFD, MD, cosmology,…)

5

Insight by VolumeNIST Machine Translation Contest

• In 2005, Google beat all the experts by exploiting 200 billion words of documents (Arabic to English, UN high quality translation), and looking at all 1- word, 2-word,…5 word phrases, and estimating their best translation. Then applied that to the test text.

• No one on the Google team spoke Arabic or understood its syntax!

• Results depend critically on the volume of text analyzed. 1 billion words would not have sufficed

6

What computer architecture is best for data intensive work?

Based on discussions with many communities, we believe that a complementary architecture embodying large shared memory will be invaluable– Large graph algorithms (many fields including web

analysis, bioinformatics, …)

– Rapid assessment of data analysis ideas, using OpenMP rather than MPI, and with access to large data

77

History of first or early systems

PSC Facilities

Storage Silos2 PB

DMF Archive ServerVisualization NodesNVidia Quadro4 980XGL

Storage Cache Nodes100 TB

XT3 (BigBen)4136p, 22 TFlop/s

Altix (Pople)768 p, 1.5TB shared memory

9

PSC Shared Memory Systems

• Pople- introduced this March 2008– SGI Altix 4700, 768 Intel cores, 1.5

TB coherent shared memory, Numalink Interconnect

• Highly oversubscribed

• Already stimulated work in new areas, because of perceived ease of programming in shared memory – Game theory (Poker),

– Epidemiological modeling

– Social network analysis:

– Economics of Internet Connectivity:

– fMRI study of Cognition:

10

Desiderata for New System

• Powerful Performance

• Programmability

• Support for current applications

• Support for a host of new applications and science communities.

11

Proposed Track 2 System at PSC

• Combines next generation Intel processors (Nehalem EX) with SGI next generation interconnect technology, (NUMAlink-5)

• ~100,000 cores, ~100TB memory, ~1 Pf peak

• At least 4TB coherent shared memory components, with full globally addressable memory

• Superb MPI and IO performance

12

• MPI Offload Engine (MOE) – Frees CPU from MPI activity– Faster Reductions (2-3x compared to competitive

clusters/MPPs)– Order of magnitude faster barriers and random access

• NUMAlink 5 Advantage– 2-3× MPI latency improvement– 3× bandwidth of InfiniBand QDR– Special support for block transfer and global operations

• Massively Memory-mapped I/O– Under user control– Big speedup for I/O bound apps

Accelerated Performance

13

Enhanced productivity from Shared Memory

• Easier shared memory programming for rapid development/prototyping

• Will allow large scale generation of data, and analysis on the same platform without moving- (a major problem for current Track2 systems)

• Mixed shared memory/ MPI programming between much larger blocks (e.g. Woodward’s PPM code or example below)

14

High-Productivity, High-PerformanceProgramming ModelsThe T2c system will support programming models for:

HighProductivityStar-P: parallelMATLAB,Python, R


CoherentSharedMemoryOpenMP,pthreads


HybridMPI/OpenMP,MPI/threaded


PGASUPC, CAFPGASUPC, CAF

MPI,shmemMPI,shmem

Charm++Charm++

extreme capability algorithm expression user productivity workflows

15

Programming ModelsPetascale Capability Applications








MPI,shmemMPI,shmem

Charm++Charm++

• Full-system applications will run in any of 4 programming models

• Dual emphasis on performance and productivity– Existing codes

– Optimization for multicore

– New and rewritten applications

16

Programming ModelsHigh Productivity Supercomputing








MPI,shmemMPI,shmem

Charm++Charm++

• Algorithm development

• Rapid prototyping

• Interactive simulation

• Also:– Analysis and visualization– Computational steering– Workflows

17

Programming ModelsNew Research Communities








MPI,shmemMPI,shmem

Charm++Charm++

• multi-TB coherent shared memory

• Global address space

• Express algorithms not served by distributed systems

– Complex, dynamic connectivity

– Simplify load balancing

18

Enhanced Service for Current Power UsersAnalyze Massive Data where you produce it

• Combines superb MPI performance with shared memory and higher level languages for rapid analysis prototyping,

19

• Validation across models (Quake: CMU, AWM: SCEC) 4D waveform output at 2Hz (to address civil engineering structures) for 200s earthquake simulations will generate hundreds of TB of output.

• Voxel by voxel comparison is not an appropriate comparison technique. PSC developed data-intensive statistical analysis tools to understand subtle differences in these vast spatiotemporal datasets.

• required having substantial windowsof both datasets in memory to compare

Analysis of Seismology Simulation Results

20

Design of LSST Detectors

• Gravitational lensing can map the distribution of dark matter in the Universe and make estimates of Dark Energy content more accurate. – Measurements are very subtle.– High quality modeling, with robust statistics,

is needed for LSST detector design.

• Must calculate ~10,000 light cones through each simulated universe.– Each universe is 30TB.– Each light cone calculation requires

analyzing large chunks of the entire dataset..

21

A crack in the surface of a piece of metal grows from activity of atoms at the point of cracking. Quantum-level simulation (right panel) leads to modeling the consequences (left panel). From http://viterbi.usc.edu/news/news/2004/2004_10_08_corrosion.htm

Understanding the Processes that DriveStress-Corrosion Cracking (SCC)

• Stress-corrosion cracking affects the safe, reliable performance of buildings, dams, bridges, and vehicles.

– Corrosion costs the U.S. economy about 3% of GDP annually.

• Predicting the lifetime beyond which SCC may causefailure requires multiscale simulations that couplequantum, atomistic, and structural scales.

– 100-300nm, 1-10 million atoms, over 1-5 μs, 1 fs timestep

• Efficient execution requires large SMP nodes to minimize surface-to-volume communication, large cache capacity,and high-bandwidth, low-latency communications.

• expected to achieve the ~1000 timesteps per second needed for realistic simulation of stress-corrosion cracking.

Courtesy of Priya Vashishta, USC

22

From Karla Atkins et al., An Interaction Based Composable Architecture for Building Scalable Models of Large Social, Biological, Information and Technical Systems, CTWatch Quarterly March 2008http://www.ctwatch.org

Analyzing the Spread of Pandemics

• Understanding the spread of infectiousdiseases is critical for effective responseto disease outbreaks (e.g. avian flu).

• EpiFast: a fast, reliable method for simulating pandemics, based on a combinatorial interpretation of percolation on directed networks

– Madhav Marathe, Keith Bisset, et al.,Network Dynamics and SimulationsScience Laboratory (NDSSL) at Virginia Tech

• Large shared memory is needed for efficientimplementation of graph theoretic algorithmsto simulate transmission networks that modelhow disease spreads from one individual to the next.

• 4TB of shared memory will allow study of world-wide pandemics.

23

• Web analytics

• Applications: fight spam, rank importance, cluster information, determine communities

• Algorithms are notoriously hard to implement on distributed memory machines.

Engaging New CommunitiesMemory-Intensive Graph Algorithms

1010 pages 1011 links 40 bytes/link → 4TB

web page

Link

courtesy Guy Blelloch (CMU)

24

More Memory-Intensive Graph Algorithms

protein

interaction

IP packet

session

item

common receipt

word

adjacency

computer securitybiological pathways

machine translation analyzing buying habits

Also: epidemiology,social networks, … courtesy Guy Blelloch (CMU)

25

PSC T2c: Summary

• PSC’s T2c system, when awarded, will leverage architectural innovations in the processor (Intel Nehalem-EX) and the platform (SGI Project Ultraviolet) to enable groundbreaking science and engineering simulations using both “traditional HPC” and emerging paradigms

• Complement and dramatically extend existing NSF program capabilities

• Usability features will be transformative– Unprecedented range of target communities

• perennial computational scientists

• algorithm developers, especially those tackling irregular problems

• data-intensive and memory-intensive fields

• highly dynamic workflows (modify code, run, modify code again, run again, …)

• Reduced concept-to-results time transforming NSF user productivity

26

Integrated in National Cyberinfrastructure

• Enabled and supported by PSC’s advanced user support,application and system optimization, middleware and infrastructure, and leveraging national CyberInfrastructure

27

Questions?

28

Predicting Mesoscale Atmospheric Phenomena

• Accurate prediction of atmosphericphenomena at the 1-100km scale isneeded to reduce economic lossesand injuries due to strong storms.

• To achieve this, we require 20-memberensemble runs of 1 km resolution,covering the Continental US, withdynamic data assimilation inquasi-real time.

– Ming Xue, University of Oklahoma– Reaching 1.0-1.5 km resolution is critical.

(In certain weather situations, fewerensemble members may suffice.)

• Expected to sustain 200 Tf/s for WRF, enabling prediction of atmospheric phenomena at the mesoscale.

Fanyou Kong et al., Real-Time Storm-Scale Ensemble Forecast Experiment – Analysis of 2008 Spring Experiment Data, Preprints, 24th Conf. on SevereLocal Storm, Amer. Metor. Soc., 27-31 October 2008.http://twister.ou.edu/papers/Kong_24thSLS_extendedabs-2008.pdf

29

Reliability

• Hardware-enabled fault detection, prevention, containment

• Enhanced monitoring and serviceability

• Numalink automatic retry, various error correcting mechanisms

Documents

Data- Driven Computational Science and Future Architectures at the Pittsburgh Supercomputing Center Ralph Roskies Scientific Director, Pittsburgh Supercomputing