Upload
vuxuyen
View
223
Download
6
Embed Size (px)
Citation preview
Simulation Science in Grid Environments: Integrated
Adaptive Software SystemsLennart Johnsson
Advanced Computing Research Laboratory Department of Computer Science
University of Houstonand
Department of Numerical Analysis and Computer ScienceRoyal Institute of Technology, Stockholm
Outline
• Technology drivers• Sample applications• Domain specific software environments• High-performance software
Cost of Computing
In 2010, the compute power of today’s top-of-the-line PC can be found in $1 consumer electronics
Today’s most powerful computers (the power of 10,000 PCs at a cost of $100M) will cost a few hundred thousand dollars
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
1999 180 nm
2000 2001 130 nm
2002 .
2003 .
2004 90 nm
2005 .
2008 60 nm
2011 40 nm
2014 30 nm
Functions per chip, Mtransistors
DRAM, at Production
DRAM, at Introduction
MPU, Cost-Performance, atProduction MPU, Cost-Performance, atIntroduction MPU, High-Performance, atProduction ASIC, at Production
05
1015202530354045
$/M
tran
sist
or
1999 180nm
2001 130nm
2003 .
2005 100nm
2011 40 nm
Cost/Mtransistor
DRAM, cost x 100, at Introduction,$/MtransistorsDRAM, cost x 100, at Production,$/MtransistorsMPU, Cost-Performance, atIntroduction, $/MtransistorsMPU, Cost-Performance, atProduction, $/MtransistorsMPU, High Performance, atProduction, $/Mtransistors
SIA Roadmap
SIA Roadmap
In 2010, $1 will buy enough disk space to store
10,000 Books 35 hrs of CDQuality audio
2 min of DVDQuality Video
IBM 9.1GB Ultra 2XP
1980 1985 1990 1995 2000 2005 2010
Year
0.001
0.01
0.1
1
10
100
1000
Pric
e/M
Byt
e, D
olla
rs
HDD DRAM Flash Paper/FilmAverage Price of Storage
IBM 18.2GB Ultrastar
IBM Deskstar 37GB
Toshiba 6.4GB
IBM Deskstar4
IBM Deskstar3
IBM 16.8GB Deskstar
IBM 8.1GB Travelstar
Seagate 8.6GB
Quant 4.5GB
64MB
IBM 9.1GB Ultrastar
96 MB Flash Camera Mem.
64MB Flash
4MB Flash
16MB Flash1MB Flash
512KB Flash256KB Flash
128KB Flash
8KB
32KB 64KB
128KB
512KB 1MB 2MB
4MB
IBM6150
Wren II Seagate ST125
Maxt170IBM0615
IBM0663
Seagate B'cuda4
Seagate ST500
oem
prc2
000a
a.pr
z
128MB Flash
64MB
Ed Grochowski at Almaden
128MB Flash
IBM 25GB Travelstar
IBM 340 MB Microdrive
IBM Deskstar 25GB
IBM Deskstar 75GXP
IBM 1 GB Microdrive
1" HDD ProjectionDataQuest 2000
Flash ProjectionDataQuest 2000
Range of Paper/Film
3.5 " HDD 2.5 " HDD
1 " HDD
Flash
DRAM
0
100
200
300
400
500
600
700
800
In M
illio
ns
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
Cell SubscriptionsInternet Hosts
Growth of Cell vs. Internet
Access Technologies
Computing Platforms 2001 ⇒ 2030Personal Computers O[$1000]– 109 Flops/sec in 2001 ⇒ 1015 – 1017 Flops/sec by 2030
Supercomputers O[$100,000,000]– 1013 Flops/sec in 2001 ⇒ 1018 – 1020 Flops/sec by 2030
Number of Computers [global population ~1010]– SCs ⇒ 10-8 –10-6 per person ⇒ 102 – 104 systems– PCs ⇒ .1x – 10x per person ⇒ 109 – 1011 systems– Embedded ⇒ 10x – 105x per person ⇒ 1011 – 1015 systems– Nanocomputers ⇒ 0x – 1010 per person ⇒ 0 – 1020 systems
Available Flops Planetwide by 2030– 1024 – 1030 Flops/sec [assuming classical models of computation]
Courtesy Rick Stevens
MEMS - Biosensors
http://www.darpa.mil/mto/mems/presentations/memsatdarpa3.pdf
MEMS – Jet Engine Application
http://www.darpa.mil/mto/mems/presentations/memsatdarpa3.pdf
Smart Dust - UCB RF Mote
RF Mini Mote ILaser Mote
Sensor
IrDA Mote
http://robotics.eecs.berkeley.edu/~pister/SmartDust/
RF Mini Mote II
Laser Mote with CCD
Polymer Radio Frequency Identification Transponder
http://www.research.philips.com/pressmedia/pictures/polelec.html
Optical Communication costs
Larry Roberts, Caspian Networks
Fiber Optic Communication
In 2010. . .
A million books can be sent across the Pacific for 1$ in 8 seconds
All books in the American Research Libraries can be sent across the Pacific in about 1 hr for $500
Fiberoptic CommunicationMilestones
First Laser 1960
First room temperature laser, ~1970
Continuous mode commercial lasers, ~1980
Tunable lasers, ~1990
Commercial fiberoptic WANs, 1985
10 Tbps/strand demonstrated in 2000 (10% of fiber peak capacity). (10 Tbps is enough bandwidth to transmit a million high-definition resolution movies simultaneously, or over 100 million phone calls).
WAN fiberoptic cables often have 384 strands of fiber and would have a capacity of 2 Pbps. Several such cables are typically deployed in the same conduit/right-of-way
Pacific Capacity
Atlantic Capacity
NSFnetvBNS
Internet2 AbileneTeraGrid
0
2000
4000
6000
8000
10000
1986 1996 1997 1999 2001
Year
Mbi
t/s
OC-3 OC-12
OC-48
OC-192Doubling every year
Vancouver
SeattlePortland
San Francisco
Los Angeles
San Diego
(SDSC)
NCSA
Chicago NYC
SURFnetCA*net4
AMPATH
PSC
Atlanta
IU
U Wisconsin
DTF 40Gb
NTON
NTON
I-WIREUIC
ANL
NCSA/UIUC
UC
NU / Starlight
Star Tap
IIT
Charlie Catlett Argonne National Laboratory
• State Funded Infrastructure to support Networking and Applications Research
– $6.5M Total Funding• $4M FY00-01 (in hand)• $2.5M FY02 (approved 1-
June-01)• Possible add’l $1M in FY03-5
– Application Driven• Access Grid: Telepresence &
Media• Computational Grids: Internet
Computing• Data Grids: Information
Analysis– New Technologies Proving
Ground• Optical Switching• Dense Wave Division
Multiplexing• Ultra-High Speed SONET• Wireless• Advanced middleware
infrastructure
CalREN-2
CA*net 4 Architecture
Calgary ReginaWinnipeg
OttawaMontreal
Toronto
Halifax
St. John’s
Fredericton
Charlottetown
ChicagoSeattleNew York
CANARIEGigaPOPORAN DWDMCarrier DWDM
Thunder Bay
CA*net 4 node)Possible future CA*net 4 node
Quebec
Windsor
Edmonton
Saskatoon
VictoriaVancouver
Boston
Bill St Arnaud
CANARIE
Wavelength Disk Drives
Vancouver
Computer data continuously circulates around the WDD
Calgary
Regina
Winnipeg
Ottawa
Montreal
Toronto
Halifax
St. John’s
Fredericton
Charlottetown
CA*net 3/4
WDD Node
GEANT
Nordic Grid Networks
0.622 Gbps
2.5 Gbps
10 Gbps
0.155 Gbps
SURFnet4 Topology
Grid Applications
Grid Application Projects
ODIN
PAMELA
March 28, 2000 Fort Worth Tornado
Courtesy Kelvin Droegemeier
In 1988 … NEXRAD Was Becoming a Reality
s fn
Courtesy Kelvin Droegemeier
Houston, TX
Environmental Studies
Neptune Undersea Grid
Air Quality Measurementand Control
Surface dataRadar dataBallon dataSatellite data
Real-time data
NCAR
Digital MammographyAbout 40 million mammograms/yr (USA) (estimates 32 – 48 million)About 250,000 new breast cancer cases detected each yearOver 10,000 units (analogue)Resolution: up to about 25 microns/pixelImage size: up to about 4k x 6k (example: 4096 x 5624)Dynamic range: 12-bitsImage size: about 48 MbytesImages per patient: 4Data set size per patient: about 200 MbytesData set per year: about 10 PbytesData set per unit, if digital: 1 Tbytes/yr, on averageData rates/unit: 4 Gbytes/operating day, or 0.5 Gbytes/hr, or 1 MbpsComputation: 100 ops/pixel = 10 Mflops/unit, 100 Gflops total; 1000 ops/pixel = 1 Tflops total
E-Science: Data Gathering, Analysis, Simulation, and Collaboration
LHC
CMS
Simulated Higgs Decay
Molecular Dynamics
Jim Briggs
University of Houston
Molecular Dynamics Simulations
SimDB
Simulation Data Base
SimDBArchitecture
500 Å
JEOL3000-FEGLiquid He stageNSF support
Biological Imaging
No. of Particles Needed for 3-D Reconstruction
B = 100 Å2
8.5 Å 4.5 Å6,000 5,000,000
Resolution
B = 50 Å2 3,000 150,000 8.5 Å Structure of the HSV-1
Capsid
EMEN Database•Archival•Data Mining•Management
VitrificationRobot
Particle SelectionPower SpectrumAnalysis
Initial3D Model
ClassifyParticles
Reproject3D Model
AlignAverageDeconvolute
Build New3D Model
EMAN
Tele-MicroscopyOsaka, Japan Mark Ellisman, UCSD
28 Sep 00 - #17NORDUnet 2000
GEMSvizGEMSviz at at iGRIDiGRID 20002000
STAR TAP
NORDUnet
APAN
INET
ParalleldatorcentrumKTH Stockholm
Universityof Houston
Computational Steering
GrADS – Grid Application Development Software
Grids – Contract Development
Grids - Contract Development
Grids – Contract Development
Grids – Application Launch
Grids – Library Evaluation
Grids – Performance Models
Grids – Library Evaluation
Grids – Library Evaluation
Cactus on the Grid
Cactus – Job Migration
Cactus – Migration Architecture
Cactus – Migration example
Adaptive Software
• Diversity of execution environments– Growing complexity of modern microprocessors.
• Deep memory hierarchies• Out-of-order execution• Instruction level parallelism
– Growing diversity of platform characteristics• SMPs• Clusters (employing a range of interconnect
technologies)• Grids (heterogeneity, wide range of characteristics)
• Wide range of application needs– Dimensionality and sizes– Data structures and data types– Languages and programming paradigms
Challenges
• Algorithmic– High arithmetic efficiency
• low floating-point v.s. load/store ratio– Unfavorable data access patterns (big 2n strides)
• Application owns the datastructures/layout
– Additions/multiplications unbalanced
• Version explosion– Verification– Maintenance
Challenges
Opportunities
• Multiple algorithms with comparable numerical properties for many functions
• Improved software techniques and hardware performance
• Integrated performance monitors, models and data bases
• Run-time code construction
• Automatic algorithm selection – polyalgorithmicfunctions (CMSSL, FFTW, ATLAS, SPIRAL, …..)
• Exploit multiple precision options• Code generation from high-level descriptions
(WASSEM, CMSSL, CM-Convolution-Compiler, FFTW, UHFFT, SPIRAL, …..)
• Integrated performance monitoring, modeling and analysis
• Judicious choice between compile-time and run-time analysis and code construction
• Automated installation process
Approach
• Program preparation at installation (platform dependent)
• Integrated performance models (in progress) and data bases
• Algorithm selection at run-time from set defined at installation
• Automatic multiple precision constant generation
• Program construction at run-time based on application and performance predictions
The UHFFT
Input ParametersSize, dim., …
InitializationSelect best plan (factorization)
ExecutionCalculate one or more FFTs
Run-time
Performance Monitoring
Database update
Performance TuningMethodology
Input ParametersSystem specifics,
UHFFT Code generator
Library of FFT modules
Performancedatabase
User options
Installation
Codelet efficiency
Intel PIV 1.8 GHz AMD Athlon 1.4 GHz
PowerPC G4 867 MHz
Intel PIV 1.8 GHz AMD Athlon 1.4 GHz
PowerPC G4 867 MHz
Radix-4 codelet efficiency
Intel PIV 1.8 GHz AMD Athlon 1.4 GHz
PowerPC G4 867 MHz
Radix-8 codelet efficiency
Plan Performance, 32-bit Architectures
Power3 plan performance
0
50
100
150
200
250
300
350
MFL
OPS
16
2 8
4 4
8 2
2 2
4
2 4
2
4 2
2
2 2
2 2
Plan
222 MHz888 Mflops
Itanium …..
L1: 64K+32K+2K+2K(Data+Instruction+Pre-fetch+Write)
L2: up to 8M (off-die)1.5 GFlops750 MhzSun UltraSparc-III
L1: 16K+16K(Data+Instruction)
L2: 256K, L3: 3M (on-die)4 GFlops1000 MhzIntel Itanium 2
L1: 64K+32K+2K+2K(Data+Instruction+Pre-fetch+Write)
L2: up to 8M (off-die)2.1 GFlops1050 MhzSun UltraSparc-III
L1: 16K+16K(Data+Instruction)
L2: 256K, L3: 1.5M (on-die)3.6 GFlops900 MhzIntel Itanium 2
L1: 16K+16K(Data+Instruction)
L2: 92K, L3: 2-4M (off-die)3.2 GFlops800 MhzIntel Itanium
Cache structurePeak
PerformanceClock frequencyProcessor
Tested configuration
Memory Hierarchy
96K B64B/6-way
Min 6 cyclesMin 9 cycles
Write back, write allocate
256KB128B/8-wayMin 5 cyclesMin 6 cycles
Write back, write allocate
Size:Line size/Associativity:
Integer Latency:FP Latency:
Write Policies:
16KB + 16KB32B/4-way
1 cycleWrite through, No write allocate
16KB + 16KB64B/4-way
1 cycleWrite through, No write allocate
Size:Line size/Associativity:
Latency:Write Policies:
4MB or 2MB off chip64B/4-way
Min 21 cyclesMin 24 cycles
16B/cycle
3MB or 1.5MB on chip128B/12-wayMin 12 cyclesMin 13 cycles
32B/cycle
Size:Line size/Associativity:
Integer Latency:FP Latency:
Bandwith:
ItaniumItanium-2 (McKinley)
L1I
and
L1D
Unified
L2
Unified
L3
Itanium Comparison
HP zx1Intel 82460GXChipset
2 GB DDR SDRAM (266 MHz)2 GB SDRAM (133 MHz)Memory
Intel 6.0Intel 6.0Compiler
HP version of the 64-bit RH Linux 7.264-bit Red Hat Linux 7.1OS
128 bit64 bitBus Width
400 MHz133 MHZBus Speed
900 MHz Intel Itanium 2 (McKinley)800 MHz Intel ItaniumProcessor
HP zx2000HP i2000Workstation
HP zx1 Chipset
Features:•2-way and 4-way•Low latency connection to the DDR memory (112 ns)
•Directly (112 ns latency)•Through (up to 12 ) scalable memory expanders (+25 ns latency)
•Up to 64 GB of DDR today (256 in the future)•AGP 4x today (8x in the future versions)•1-8 I/O adapters supporting
•PCI, PCI-X, AGP 2-way block diagram
UHFFT Codelet Performance
UHFFT Codelet Performance
Codelet Performance Radix-2
Codelet Performance Radix-3
Codelet Performance Radix-4
Codelet Performance Radix-5
Codelet Performance Radix-6
Codelet Performance Radix-7
Codelet Performance Radix-13
Codelet Performance Radix-64
The UHFFT: Summary• Code generator written in C• Code is generated at installation• Codelet library is tuned to the underlying architecture • The whole library can be easily customized through
parameter specification – No need for laborious manual changes in the source
– Existing code generation infrastructure allows easy library extensions
• Future:– Inclusion of vector/streaming instruction set extension for various
architectures– Implementation of new scheduling/optimization algorithms– New codelet types and better execution routines– Unified algorithm specification on all levels
Acknowledgements
Dave Angulo, Ruth Aydt, Fran Berman, AnrewChien, Keith Cooper, Holly Dail, Jack Dongarra, Ian Foster, Sridhar Gullapallii, Lennart Johnsson, Ken Kennedy, Carl Kesselman, Chuck Koelbel, Bo Liu, Chuang Liu, Xin Liu, Anirban Mandal, Mark Mazina, John Mellor-Crummey, Celso Mendes, GrazianoObertelli, Alex Olugbile, Mitul Patel, Dan Reed, Martin Swany, Linda Torczon, Satish Vahidyar, Shannon Whitmore, Rich Wolski, Huaxia Xia, Lingyun Yang, Asim Yarkin, ….
GrADS contributors
Funding: NSF Next Generation Software initiative, Los Alamos Computer Science Institute
AcknowledgementsSimDB Contributors:
Matin Abdullah
Michael Feig
Lennart Johnsson
Seonah Kim
Prerna Kohsla
Gillian Lynch
Montgomery PettittFunding:
NPACI (NSF)
Texas Learning and Computation Center
Acknowledgements
UHFFT Contributors
Dragan Mirkovic
Rishad Mahasoom
Fredrick Mwandia
Nils Smeds
Funding:
Alliance (NSF)
LACSI (DoE)