Upload
karin-gaines
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Data- Driven Computational Science and Future Architectures at the
Pittsburgh Supercomputing CenterRalph Roskies
Scientific Director, Pittsburgh Supercomputing Center
Jan 30, 2009
2
NSF TeraGrid Cyberinfrastructure
• Mission: Advancing scientific research capability through advanced IT
• Resources: Computational, Data Storage, Instruments, Network
3
Now is a Resource Rich Time
• NSF has funded two very large distributed memory machines available to the national research community
– Trk2a-Texas- Ranger (62,976 cores, 579 Teraflops, 123 TB memory)
– Trk2b-Tennessee- Kraken (18048 cores, 166 Teraflops, 18 TB memory) growing to close to a Petaflop
– Track 2d: data centric; experimental architecture; … proposals in review
• All part of TeraGrid. Largest single allocation this past September was 46M processor hours.
• In 2011, NCSA is going to field a 10 PF machine.
4
Increasing Importance of Data in Scientific Discovery
Large amounts from instruments and sensors.
• Genomics.
• Large Hadron Collider
• Huge astronomy data bases – Sloan Digital Sky Survey– Pan Starrs– Large Scale Synoptic Telescope
Results of large simulations ( CFD, MD, cosmology,…)
5
Insight by VolumeNIST Machine Translation Contest
• In 2005, Google beat all the experts by exploiting 200 billion words of documents (Arabic to English, UN high quality translation), and looking at all 1- word, 2-word,…5 word phrases, and estimating their best translation. Then applied that to the test text.
• No one on the Google team spoke Arabic or understood its syntax!
• Results depend critically on the volume of text analyzed. 1 billion words would not have sufficed
6
What computer architecture is best for data intensive work?
Based on discussions with many communities, we believe that a complementary architecture embodying large shared memory will be invaluable– Large graph algorithms (many fields including web
analysis, bioinformatics, …)
– Rapid assessment of data analysis ideas, using OpenMP rather than MPI, and with access to large data
77
History of first or early systems
PSC Facilities
Storage Silos2 PB
DMF Archive ServerVisualization NodesNVidia Quadro4 980XGL
Storage Cache Nodes100 TB
XT3 (BigBen)4136p, 22 TFlop/s
Altix (Pople)768 p, 1.5TB shared memory
9
PSC Shared Memory Systems
• Pople- introduced this March 2008– SGI Altix 4700, 768 Intel cores, 1.5
TB coherent shared memory, Numalink Interconnect
• Highly oversubscribed
• Already stimulated work in new areas, because of perceived ease of programming in shared memory – Game theory (Poker),
– Epidemiological modeling
– Social network analysis:
– Economics of Internet Connectivity:
– fMRI study of Cognition:
10
Desiderata for New System
• Powerful Performance
• Programmability
• Support for current applications
• Support for a host of new applications and science communities.
11
Proposed Track 2 System at PSC
• Combines next generation Intel processors (Nehalem EX) with SGI next generation interconnect technology, (NUMAlink-5)
• ~100,000 cores, ~100TB memory, ~1 Pf peak
• At least 4TB coherent shared memory components, with full globally addressable memory
• Superb MPI and IO performance
12
• MPI Offload Engine (MOE) – Frees CPU from MPI activity– Faster Reductions (2-3x compared to competitive
clusters/MPPs)– Order of magnitude faster barriers and random access
• NUMAlink 5 Advantage– 2-3× MPI latency improvement– 3× bandwidth of InfiniBand QDR– Special support for block transfer and global operations
• Massively Memory-mapped I/O– Under user control– Big speedup for I/O bound apps
Accelerated Performance
13
Enhanced productivity from Shared Memory
• Easier shared memory programming for rapid development/prototyping
• Will allow large scale generation of data, and analysis on the same platform without moving- (a major problem for current Track2 systems)
• Mixed shared memory/ MPI programming between much larger blocks (e.g. Woodward’s PPM code or example below)
14
High-Productivity, High-PerformanceProgramming ModelsThe T2c system will support programming models for:
HighProductivityStar-P: parallelMATLAB,Python, R
HighProductivityStar-P: parallelMATLAB,Python, R
CoherentSharedMemoryOpenMP,pthreads
CoherentSharedMemoryOpenMP,pthreads
HybridMPI/OpenMP,MPI/threaded
HybridMPI/OpenMP,MPI/threaded
PGASUPC, CAFPGASUPC, CAF
MPI,shmemMPI,shmem
Charm++Charm++
extreme capability algorithm expression user productivity workflows
15
Programming ModelsPetascale Capability Applications
HighProductivityStar-P: parallelMATLAB,Python, R
HighProductivityStar-P: parallelMATLAB,Python, R
CoherentSharedMemoryOpenMP,pthreads
CoherentSharedMemoryOpenMP,pthreads
HybridMPI/OpenMP,MPI/threaded
HybridMPI/OpenMP,MPI/threaded
PGASUPC, CAFPGASUPC, CAF
MPI,shmemMPI,shmem
Charm++Charm++
• Full-system applications will run in any of 4 programming models
• Dual emphasis on performance and productivity– Existing codes
– Optimization for multicore
– New and rewritten applications
16
Programming ModelsHigh Productivity Supercomputing
HighProductivityStar-P: parallelMATLAB,Python, R
HighProductivityStar-P: parallelMATLAB,Python, R
CoherentSharedMemoryOpenMP,pthreads
CoherentSharedMemoryOpenMP,pthreads
HybridMPI/OpenMP,MPI/threaded
HybridMPI/OpenMP,MPI/threaded
PGASUPC, CAFPGASUPC, CAF
MPI,shmemMPI,shmem
Charm++Charm++
• Algorithm development
• Rapid prototyping
• Interactive simulation
• Also:– Analysis and visualization– Computational steering– Workflows
17
Programming ModelsNew Research Communities
HighProductivityStar-P: parallelMATLAB,Python, R
HighProductivityStar-P: parallelMATLAB,Python, R
CoherentSharedMemoryOpenMP,pthreads
CoherentSharedMemoryOpenMP,pthreads
HybridMPI/OpenMP,MPI/threaded
HybridMPI/OpenMP,MPI/threaded
PGASUPC, CAFPGASUPC, CAF
MPI,shmemMPI,shmem
Charm++Charm++
• multi-TB coherent shared memory
• Global address space
• Express algorithms not served by distributed systems
– Complex, dynamic connectivity
– Simplify load balancing
18
Enhanced Service for Current Power UsersAnalyze Massive Data where you produce it
• Combines superb MPI performance with shared memory and higher level languages for rapid analysis prototyping,
19
• Validation across models (Quake: CMU, AWM: SCEC) 4D waveform output at 2Hz (to address civil engineering structures) for 200s earthquake simulations will generate hundreds of TB of output.
• Voxel by voxel comparison is not an appropriate comparison technique. PSC developed data-intensive statistical analysis tools to understand subtle differences in these vast spatiotemporal datasets.
• required having substantial windowsof both datasets in memory to compare
Analysis of Seismology Simulation Results
20
Design of LSST Detectors
• Gravitational lensing can map the distribution of dark matter in the Universe and make estimates of Dark Energy content more accurate. – Measurements are very subtle.– High quality modeling, with robust statistics,
is needed for LSST detector design.
• Must calculate ~10,000 light cones through each simulated universe.– Each universe is 30TB.– Each light cone calculation requires
analyzing large chunks of the entire dataset..
21
A crack in the surface of a piece of metal grows from activity of atoms at the point of cracking. Quantum-level simulation (right panel) leads to modeling the consequences (left panel). From http://viterbi.usc.edu/news/news/2004/2004_10_08_corrosion.htm
Understanding the Processes that DriveStress-Corrosion Cracking (SCC)
• Stress-corrosion cracking affects the safe, reliable performance of buildings, dams, bridges, and vehicles.
– Corrosion costs the U.S. economy about 3% of GDP annually.
• Predicting the lifetime beyond which SCC may causefailure requires multiscale simulations that couplequantum, atomistic, and structural scales.
– 100-300nm, 1-10 million atoms, over 1-5 μs, 1 fs timestep
• Efficient execution requires large SMP nodes to minimize surface-to-volume communication, large cache capacity,and high-bandwidth, low-latency communications.
• expected to achieve the ~1000 timesteps per second needed for realistic simulation of stress-corrosion cracking.
Courtesy of Priya Vashishta, USC
22
From Karla Atkins et al., An Interaction Based Composable Architecture for Building Scalable Models of Large Social, Biological, Information and Technical Systems, CTWatch Quarterly March 2008http://www.ctwatch.org
Analyzing the Spread of Pandemics
• Understanding the spread of infectiousdiseases is critical for effective responseto disease outbreaks (e.g. avian flu).
• EpiFast: a fast, reliable method for simulating pandemics, based on a combinatorial interpretation of percolation on directed networks
– Madhav Marathe, Keith Bisset, et al.,Network Dynamics and SimulationsScience Laboratory (NDSSL) at Virginia Tech
• Large shared memory is needed for efficientimplementation of graph theoretic algorithmsto simulate transmission networks that modelhow disease spreads from one individual to the next.
• 4TB of shared memory will allow study of world-wide pandemics.
23
• Web analytics
• Applications: fight spam, rank importance, cluster information, determine communities
• Algorithms are notoriously hard to implement on distributed memory machines.
Engaging New CommunitiesMemory-Intensive Graph Algorithms
1010 pages 1011 links 40 bytes/link → 4TB
web page
Link
courtesy Guy Blelloch (CMU)
24
More Memory-Intensive Graph Algorithms
protein
interaction
IP packet
session
item
common receipt
word
adjacency
computer securitybiological pathways
machine translation analyzing buying habits
Also: epidemiology,social networks, … courtesy Guy Blelloch (CMU)
25
PSC T2c: Summary
• PSC’s T2c system, when awarded, will leverage architectural innovations in the processor (Intel Nehalem-EX) and the platform (SGI Project Ultraviolet) to enable groundbreaking science and engineering simulations using both “traditional HPC” and emerging paradigms
• Complement and dramatically extend existing NSF program capabilities
• Usability features will be transformative– Unprecedented range of target communities
• perennial computational scientists
• algorithm developers, especially those tackling irregular problems
• data-intensive and memory-intensive fields
• highly dynamic workflows (modify code, run, modify code again, run again, …)
• Reduced concept-to-results time transforming NSF user productivity
26
Integrated in National Cyberinfrastructure
• Enabled and supported by PSC’s advanced user support,application and system optimization, middleware and infrastructure, and leveraging national CyberInfrastructure
27
Questions?
28
Predicting Mesoscale Atmospheric Phenomena
• Accurate prediction of atmosphericphenomena at the 1-100km scale isneeded to reduce economic lossesand injuries due to strong storms.
• To achieve this, we require 20-memberensemble runs of 1 km resolution,covering the Continental US, withdynamic data assimilation inquasi-real time.
– Ming Xue, University of Oklahoma– Reaching 1.0-1.5 km resolution is critical.
(In certain weather situations, fewerensemble members may suffice.)
• Expected to sustain 200 Tf/s for WRF, enabling prediction of atmospheric phenomena at the mesoscale.
Fanyou Kong et al., Real-Time Storm-Scale Ensemble Forecast Experiment – Analysis of 2008 Spring Experiment Data, Preprints, 24th Conf. on SevereLocal Storm, Amer. Metor. Soc., 27-31 October 2008.http://twister.ou.edu/papers/Kong_24thSLS_extendedabs-2008.pdf
29
Reliability
• Hardware-enabled fault detection, prevention, containment
• Enhanced monitoring and serviceability
• Numalink automatic retry, various error correcting mechanisms