Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
HPC as a Driver for Computing Technology and Education
Tarek El-Ghazawi
The George Washington University Washington D.C., USA
2 Tarek El-Ghazawi, GWU
NOW- July 2015: The TOP 10 Systems Rank Site Computer Cores Rmax
[Pflops] % of Peak
Power [MW]
MFlops/Watt
1 National Super
Computer Center in Guangzhou, China
Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi
(57c) + Custom 3,120,000 33.9 62 17.8 1905
2 DOE / OS
Oak Ridge Nat Lab USA
Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom 560,640 17.6 65 8.3 2120
3 DOE / NNSA
L Livermore Nat Lab USA
Sequoia, BlueGene/Q (16c) + custom 1,572,864 17.2 85 7.9 2063
4 RIKEN Advanced Inst for Comp Sci, Japan
K computer Fujitsu SPARC64 VIIIfx (8c) + Custom 705,024 10.5 93 12.7 827
5 DOE / OS
Argonne Nat Lab, USA
Mira, BlueGene/Q (16c) + Custom 786,432 8.16 85 3.95 2066
6 Swiss CSCS Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom 115,984 6.27 81 2.3 2726
7 KAUST, Saudi Shaheen II, Cray XC30, Xeon 16C + Custom 196,608 5.54 77 4.5 1146
8 TACC, USA Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB 204,900 5.17 61 4.5 1489
9 Forschungszentrum
Juelich (FZJ), Germany
JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom 458,752 5.01 85 2.30 2178
10 DOE / NNSA LLNL, USA
Vulcan, BlueGene/Q, Power BQC 16C 1.6GHz+Custom 393,216 4.29 85 1.97 2177
500 (422) Software Comp HP Cluster USA 18,896 .309 48
3 Tarek El-Ghazawi, GWU
HPC is a Top National Priority!
3
Establishment of a National Strategic Computing Initiative (NCSI) – 29 July 2015
Executive Order from the White House
4 Tarek El-Ghazawi, GWU
National Strategic Computing Initiative
Five strategic themes of the NSCI:
1) Create systems that can apply exaflops of computing power to exabytes of data
2) Keep the United States at the forefront of HPC capabilities
3) Improve HPC application developer productivity
4) Make HPC readily available
5) Establish hardware technology for future HPC systems
4
5 Tarek El-Ghazawi, GWU
Future/Investments - International Exascale HPC Programs
5
Country Funding Year(s) Remarks
European Union €700M 2014-20 Private-Public Partnership commitment through European Tech Platform for HPC (ETP4HPC) €143.4M in 2014-15
€74M 2011- 6 dedicated FP7 Exascale projects
India $2B 2014-20 Led by IISc (Indian Institute of Science) and ISRO (Indian Space Research Organization). Targeting a 132 ExaFLOP/s machine
$750M 2014-19 C-DAC (Center for Development of Advanced Computing) to set up 70 supercomputers over 5 years
Japan $1.38B 2013-20 Post-K computer to be installed at RIKEN; Tentatively based on Extreme SIMD chip “PACS-G”
China - Due to U.S./DoC ban will use Chinese parts to upgrade current #1 system
6 Tarek El-Ghazawi, GWU
Why is HPC Important?
Critical for economic competitiveness (Highlighted by Minster Daoudi) because of its wide applications (through simulations and intensive data analyses)
Drives computer hardware and software innovations for future conventional computing
Is becoming ubiquitous, i.e. all computing/information technology is turning into Parallel!! Is that why it is turning into an international HPC
muscle flexing contest?
7 Tarek El-Ghazawi, GWU
Why is HPC Important? (1)Competitiveness
Design Build Test
Simulate Build Model Design
8 Tarek El-Ghazawi, GWU
Why is HPC Important? Competitiveness
Molecular Dynamics
Simulation for 2ns: • 2 weeks on a desktop • 6 hours on a supercomputer
Gene Sequence Alignment Inhibitor Drug
HIV-1 Protease
Phylogenetic Analysis: • 32 days on desktop • 1.5 hrs supercomputer
Car Crash Simulations
Understanding Fundamental
Structure of Matter 2 million elements simulation: • 4 days on a desktop • 25 minutes on a supercomputer Requires a billion-
billion calculations per second
HPC Application Examples
9 Tarek El-Ghazawi, GWU
Why is HPC Important? (2) HPC of Today is Conventional Computing for
Tomorrow
The ASCI Red Supercomputer 9000 chips for 3 TeraFLOPs in 1997
Intel 80 Core Chip 1 Chip and 1 TeraFLOPs in 2007
10 Tarek El-Ghazawi, GWU
3- Why is HPC Important?- HPC Concepts are becoming Ubiquitous
Sony PS3
The Road Runner: Was Fastest Supercomputer in 08 Tile64: A 64 CPU Chip-
Can be in your future laptop!
HPC is Ubiquitous! All Computing is becoming HPC, Can we become bystanders?
Uses Cell Processors!
Uses the Cell Processors!
Samsung S6 – 8 Cores
11 Tarek El-Ghazawi, GWU
How Did we Get Here - Supercomputers in recent History Computer Processor # Pr. Year Rmax
(TFlops)
Tianhe-2 (MilkyWay-2) TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C
2.200GHz, TH Express-2, Intel Xeon Phi 31S1P
3120000 2013-till now 33,862
Titan Cray XK7, Opteron 16 Cores, 2.2GHz, Nvidia K20X 560640 2012 17,600
K-Computer, Japan SPARC64 VIIIfx 2.0GHz, 705024 2011 10,510
Tianhe-1A, China Intel EM64T Xeon X56xx (Westmere-EP) 2930
MHz (11.72 Gflops) + NVIDIA GPU, FT-1000 8C
186368 2010 2,566
Jaguar, Cray Cray XT5-HE Opteron Six Core 2.6 GHz 224162 2009 1,759
Roadrunner, IBM PowerXCell 8i 3200 MHz (12.8 GFlops) 122400 2008 1,026
BlueGene/L - eServer Blue Gene Solution, IBM PowerPC 440 700 MHz (2.8 GFlops) 212992 2007 478
BlueGene/L - eServer Blue Gene Solution, IBM PowerPC 440 700 MHz (2.8 GFlops) 131072 2005 280
BlueGene/L beta-System IBM PowerPC 440 700 MHz (2.8 GFlops) 32768 2004 70.7
Earth-Simulator / NEC NEC 1000 MHz (8 GFlops) 5120 2002 35.8
IBM ASCI White,SP POWER3 375 MHz (1.5 GFlops) 8192 2001 7.2
IBM ASCI White,SP POWER3 375MHz (1.5 GFlops) 8192 2000 4.9
Intel ASCI Red Intel IA-32 Pentium Pro 333 MHz (0.333 GFlops) 9632 1999 2.4
12 Tarek El-Ghazawi, GWU
How Did we Get Here - Supercomputers in recent History
See: http://spectrum.ieee.org/tech-talk/computing/hardware/china-builds-worlds-fastest-supercomputer
13 Tarek El-Ghazawi, GWU
How Did we Get Here - Supercomputers in recent History
Vector Machines
MPPs with Multicores and Heterogeneous Accelerators
Massively Parallel
Processors
1993- HPCC
2008- 2011 End of Moore’s
Law in Clocking!
Performance
Time
PetaFLOPS
TeraFLOPS
Discrete Integrated
14 Tarek El-Ghazawi, GWU
NOW- July 2015: The TOP 10 Systems Rank Site Computer Cores Rmax
[Pflops] % of Peak
Power [MW]
MFlops/Watt
1 National Super
Computer Center in Guangzhou, China
Tianhe-2 NUDT, Xeon 12C 2.2GHz + IntelXeon Phi
(57c) + Custom 3,120,000 33.9 62 17.8 1905
2 DOE / OS
Oak Ridge Nat Lab USA
Titan, Cray XK7, AMD (16C) + Nvidia Kepler GPU (14c) + Custom 560,640 17.6 65 8.3 2120
3 DOE / NNSA
L Livermore Nat Lab USA
Sequoia, BlueGene/Q (16c) + custom 1,572,864 17.2 85 7.9 2063
4 RIKEN Advanced Inst for Comp Sci, Japan
K computer Fujitsu SPARC64 VIIIfx (8c) + Custom 705,024 10.5 93 12.7 827
5 DOE / OS
Argonne Nat Lab, USA
Mira, BlueGene/Q (16c) + Custom 786,432 8.16 85 3.95 2066
6 Swiss CSCS Piz Daint, Cray XC30, Xeon 8C + Nvidia Kepler (14c) + Custom 115,984 6.27 81 2.3 2726
7 KAUST, Saudi Shaheen II, Cray XC30, Xeon 16C + Custom 196,608 5.54 77 4.5 1146
8 TACC, USA Stampede, Dell Intel (8c) + Intel Xeon Phi (61c) + IB 204,900 5.17 61 4.5 1489
9 Forschungszentrum
Juelich (FZJ), Germany
JuQUEEN, BlueGene/Q, Power BQC 16C 1.6GHz+Custom 458,752 5.01 85 2.30 2178
10 DOE / NNSA LLNL, USA
Vulcan, BlueGene/Q, Power BQC 16C 1.6GHz+Custom 393,216 4.29 85 1.97 2177
500 (422) Software Comp HP Cluster USA 18,896 .309 48
15 Tarek El-Ghazawi, GWU
How to Make Progress
Launch a competitive funding cycle or a large national project
Pose a system challenge ~ 33.8 PFLOPS/17.8 Mwatt provides about
2GF/Watt To get to Exascale using same total power we
need 200GF/Watt
Pose an application challenge(s)
Let the community compete for government funding with innovative ideas
16 Tarek El-Ghazawi, GWU
Challenges - The End of Moore’s Law
The phenomenon of exponential improvements in processors was observed in 1979 by Intel co-founder Gordon Moore The speed of a microprocessor doubles every
18-24 months, assuming the price of the processor stays the same The price of a microchip drops about 48% every
18-24 months, assuming the same processor speed and on chip memory capacity The number of transistors on a microchip
doubles every 18-24 months, assuming the price of the chip stays the same
Ok, for Now
Ok, for Now
Wrong, not anymore!
17 Tarek El-Ghazawi, GWU
No faster clocking but more Cores?
Source: Ed Davis, Intel
18 Tarek El-Ghazawi, GWU
Accelerators and Dealing with the Moore’s Law Challenge Through Parallelism
Fab. Process Freq # Cores Peak FP
Performance Peak
Power DP
Flops/W Memory
nm GHz SPFP GFlops
DPFP GFlops W BW GB/s Memory
type
PowerXCell 8i 65 3.2 1 + 8 204 102.4 92 1.11 25.6 XDR
Nvidia Kepler K40 28 0.75 2880 4290 1430 235 6.1 288 GDDR5
Intel Xeon Phi 7120P 22 1.24 61 (244
threads) 2417 1208 300 4.0 352 GDDR5
Intel Xeon 12-core 2.7 GHz E5-2697v2
22 2.7 12 518.4 259.2 130 1.99 59.7 DDR3-1866
AMD Opteron 6370P Interlagos 32 2.5 16 320 160 99 1.62 42.7 DDR3-
1333
Xilinx XC7VX1140T 28 - - 801 241 43 5.6 - -
Xilinx XCUV440 20 - - 1306 402 80* 5.0*
Altera Stratix V GSB8 28 - - 604 296 59 5.0 - -
19 Tarek El-Ghazawi, GWU
FPGAs Cell GPUs Phi …
Microprocessor
Application Speedup SAVINGS
Cost Power Size DNA Match 8723 22x 779x 253x
DES Breaker 38514 96x 3439x 1116x
El-Ghazawi et. al. The Promise of HPRCs. IEEE Computer, February 2008
Accelerators/Heterogeneous Computing
20 Tarek El-Ghazawi, GWU
A General Execution Model for Heterogeneous Computers
PC
µP Accelerator
•Transfer of Control •Input Data
•Output Data •Transfer of Control
FPGA
GPU
Clearspeed
CELL B.E.
Intel Xeon Phi
21 Tarek El-Ghazawi, GWU
Challenges for Accelerators
1. Application must lend itself to the 90-10 rule, and different accelerators suit diffent type of computations
2. Programmer partitions the code across the CPU and accelerator 3. Programmer co-schedules CPU and accelerator, and ensures
good utilization of the expensive accelerator resources 4. Programmer explicitly transfers data between CPU and
accelerator 5. Accelerators are fast as compared to the link, and overhead that
can render the use of the accelerator useless or harmful 6. Multiple programming paradigms are needed 7. New accelerator means learning/porting to a new programming
interface 8. Changing the ratio of CPUs to accelerators requires also
substantial programming unless accelerators are vituralized
22 Tarek El-Ghazawi, GWU
Challenges for Advancing or for Exascale
1. Energy Efficiency 2. Interconnect Technology 3. Memory Technology 4. Scalable System Software 5. Programming Systems 6. Data Management 7. Exascale Algorithms 8. Algorithms for Discovery, Design
& Decision 9. Resilience and Correctness 10. Scientific Productivity
DoE ASCAC Subcommittee Report Feb 2014
Data movement and/or programming related
23 Tarek El-Ghazawi, GWU
Exascale Technological Challenges
23
The Power Wall Frequency scaling is no longer possible,
power increases rapidly
The Memory Wall Gap between processor speed and memory
speed is widening
The Interconnect Wall Available bandwidth per compute operations
is dropping Power needed for data movement is
increasing
Programmability Wall, Resilience Wall, ..
24 Tarek El-Ghazawi, GWU
The Data Movement Challenge
Locality matters a lot, cost (energy and time) rapidly increases with distance
Locality should be exploited at short distance, needed more at far distances
Bandwidth density vs. system distance Energy vs. system distance
[Source: ASCAC 14]
25 Tarek El-Ghazawi, GWU
Data Movement and the Hierarchical Locality Challenge
25
26 Tarek El-Ghazawi, GWU
Locality is Not Flat Anymore– Chip and System
26
27 Tarek El-Ghazawi, GWU
Locality is Not Flat in Anymore – Chip and System
27
28 Tarek El-Ghazawi, GWU
Locality is Not Flat Anymore – Chip and System
28
29 Tarek El-Ghazawi, GWU
Locality is Not Flat in Extreme Scale – Chip and System
29
Cray XC40
30 Tarek El-Ghazawi, GWU
Locality in Extreme Scale – Chip and System Perspectives
30
Cray XC40
TTT TILE64
Tile64
31 Tarek El-Ghazawi, GWU
What Does that Mean for Programmers
Exploiting Hierarchical Locality
Machine level and Chip level
Hierarchical Tiled Data Structures
Hierarchical Locality Exploitation with RTS
MPI+X
32 Tarek El-Ghazawi, GWU
General Implications
Short term programming challenge
Golden opportunity for smart programmer
New hardware advances needed first and they will influence software
May be silicon based, may be nano technologies like carbon nano-tube transistors by IBM (9nm), may keep things the way they are from the software side for a while
33 Tarek El-Ghazawi, GWU
General Implications- Longer Run
Long-term hardware technology may move toward Nano-photonics for computing Quantum Computing
Many of the new hardware computing innovations may show first as discrete accelerators, then on the chip accelerator, then move closer to the processor internal circuitry ( data path )
34 Tarek El-Ghazawi, GWU
Longer term
The bad news: with the limits of the silicon approached we may see departures from conventional methods of computing which may dramatically change the way we conceive software
The good news: history has shown that good ideas from the past get resurrected in new ways
35 Tarek El-Ghazawi, GWU
Conclusions Graduating and intelligent IT workforce can be a
golden egg for countries like Morocco
You can teach skills but it is imperative to teach and stress concepts in the curriculum Stress Parallelism Stress Locality
See the recommendations by IEEE/NSF and SIAM for incorporating parallelism in Computer Science, Computer Engineering, and Computational Science and Engineering Curricula, and add locality
For the very long-term There is nothing better than having good
foundations in Physics and Math even for CS and CE majors
36 Tarek El-Ghazawi, GWU
Conclusions cont.
Integrate teaching soft skills as President Ouaouicha said Communications Entrepreneurism and marketing, individually
and in groups Patenting and legal