Upload
bethanie-norman
View
219
Download
2
Tags:
Embed Size (px)
Citation preview
Supercomputers
Special Course of Computer Architecture
H.Amano
Contents• What are supercomputers?
• Architecture of Supercomputers
• Representative supercomputers
• Exa-Scale supercomputer project
Defining Supercomputers• High performance computers mainly for scientific
computation.– Huge amount of computation for Biochemistry,
Physics, Astronomy, Meteorology and etc.– Very expensive: developed and managed by national
fund.– High level techniques are required to develop and
manage them.– USA, Japan and China compete the top 1
supercomputer.– A large amount of national fund is used, and tends to
be political news→ In Japan, the supercomputer project became the target of budget review in Dec. 2009
「 K 」 achieved 10PFLOPS, and became the top 1 in the last year,but Sequoia got back in the last month.
FLOPS
• Floating Point Operation Per Second
• Floating Point number– (Mantissa) × 2 ( index )– Double precision 64bit, Single precision 32bit.– IEEE Standard defines the format and
rounding
238
5211
Single
Double
sign index mantissa
The range of performance
106
100 万
M ( Mega)
10 億
G ( Giga)
1 兆
T ( Tera )
1000 兆
P ( Peta )
100 京
E ( Exa )
10PFLOPS = 1 京回 in Japanese→ The name 「 K 」 comes from it.
iPhone4S140MFLOPS
High-end PC50-80GFLOPS
PowerfulGPUTera-FLOPS
Supercomputers10TFLOPS-16PFLOPS
growing ratio: 1.9times/year
109 1012 1015 1018
How to select top 1 ?• Top500/Green500: Performance of executing Linpack
– Linpack is a kernel for matrix computation.– Scale free– Performance centric.
• Godon Bell Prize– Peak Performance, Price/Performance, Special Achievement
• HPC Challenge– Global HPL Matrix computation: Computation – Global Random Access : random memory access : Communicati
on– EP stream per system: heavy load memory access : Memory perf
ormance– Global FFT: Complicated problem requiring both memory and commun
ication performance.
• Nov. ACM/IEEE Supercomputing Conference– Top500 、 Gordon Bell Prize 、 HPC Challenge 、 Green500
• Jun. International Supercomputing Conference– Top500 、 Green500
1
2
3
8
9
10
2011.112011.62010.112010.6
Rmax:Peta FLOPS
KJapan
Tianhe( 天河 ) China
Jaguar USANebulae China
Tsubame JapanRoadrunner USAKraken USA
Jugene Germany
Top 5
SequoiaUSA16PFLOPS
From SACSIS2012 Invited Speech.
Name Development
Hardware Cores Performance TFLOPS
Power ( KW)
K ( 京 )
(Japan)
RIKEN AICS SPARC VIIIfx 2.0GHz Tofu Interconnect
Fujitsu
705024 10510
(11280)
12659.9
Tianhe-1A( 天河 )
(China)
National Supercomputer Center Tenjien
NUDT YH MPPXeon X5670 6C 2.93GHz,NVIDIA 2050 NUDT
186368 2566
(4701)
4040
Jaguar
(USA)
DOE/SC/Oak Ridge
National Lab.
Cray XT5-HE Opteron 6-Core 2.6GHz, Cray Inc.
224162 1759
(2331)
6950
Nebulae(China)
National Supercomputing Centre in Shenzhen
Dawning TC3600 Blade, Xeon X5650 6C 2.66GHz, Infiniband QDR, NVIDIA 2050,Dawing
120640 1271
(2974)
2580
TSUBAME2.0(Japan)
GSIC,Tokyo Inst. of Technology
HP ProLiant SL390s G7 Xeon 6C X5670, NVIDIA GPU,NEC/HP
73238 1192
(2287)
1398.6
Top 500 2011 11 月
Machine Place FLOPS/W Total
kW
1 BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
IBM - Rochester 2026.48 85.12
2 - 5 BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
BlueGene/Q Prototype
IBM – Thomas J. Watson Research Center /Rochester
1689.86 -2026.48
6 DEGIMA Cluster, Intel i5, ATI Radeon GPU, Infiniband QDR
Nagasaki Univ. 1378.32 47.05
7 Bullx B505, Xeon E5649 6C 2.53GHz, Infiniband QDR, NVIDIA 2090
Barcelona Supercomputing Center
1266.26 81.50
8 Curie Hybrid Nodes - Bullx B505, Nvidia M2090, Xeon E5640 2.67 GHz, Infiniband QDR
TGCC / GENCI 1010.11 108.80
Green 500 2011 11 月
10 位は Tsubame-2 . 0 (東工大)
IBM BlueGene/Qgot 1-5
Why Top1?
• Top1 is just a measure of matrix computation.• Top1 of Green500, Gordon Bell Prize, Top1 of
each HPC Challenge program
→ All machines are valuable.
TV or newspapers are too much focus on Top 500.• However, most top 1 computer also got Gordon
Bell Prize and HPC Challenge top1.– K and Sequoia
• Impact of Top 1 is great!
Why supercomputers so fast?× Because they use high freq. clock
100MH z
1GH z
1992 2000 2008
Pentium43.2GHz
Nehalem3.3GHz
Alpha21064150MHz
K 2GHz
The speed up of the clock is saturated in 2003.
Power and heat dissipation
The clock frequency of K and Sequoia is lower than
that of common PCs
40 % / year
Clock freq. of High end PC
年
Freq.
Sequoia 1.6GHz
Major 3 methods of parallel processing in supercomputers
Supercomputer = massively parallel computers– SIMD (Single Instruction Stream Multiple Data Streams)
• Most accelerators
– Pipelined processing• Vector computers
– MIMD(Multiple Instruction Streams Multiple Data Streams):• Homogeneous (vs. Accelerators), Scalar (vs. Vector machines)
– Although all supercomputers use three methods in various level, it can be classified by its usage.
Key issues other than computational nodesLarge high bandwidth memoryLarge diskHigh speed Interconnection Networks.
SIMD (Single Instruction StreamMultiple Data Streams
Instruction
InstructionMemory
Processing Unit
Data memory
•All Processing Units executes the same instruction•Low degree of flexibility•Illiac-IV/MMX instructions/ClearSpeed/IMAP/GP-GPU( coarse grain )•CM-2, ( fine grain )
– TSUBAME2.0(Xeon+Tesla,Top500 2010/11 4th )– 天河一号 (Xeon+FireStream,2009/11 5th )
GPGPU(General-Purpose computing on Graphic ProcessingUnit)
※() 内は開発環境
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
PBSM PBSM
Thread Processors
…
Thread Execution Manager
Input Assembler
Host
Load/Store
Global Memory
GeForceGTX280240 cores
GPU (NVIDIA’s GTX580)
512 GPU cores ( 128 X 4 )768 KB L2 cache40nm CMOS 550 mm^2
128 Cores128 Cores 128 Cores128 Cores
128 Cores128 Cores 128 Cores128 Cores
L2 CacheL2 Cache
Cell Broadband Engine
SXU
LSDMA
MICBIF/
IOIF0
PXU
L1 CL2 CPPE
SXU
LSDMA
SXU
LSDMA
SXU
LSDMA
SXU
LSDMA
SXU
LSDMA
SXU
LSDMA
SXU
LSDMA
IOIF1
SPE
1.6GHz / 4 X 16B data rings
IBM Roadrunner
PS3
Common platform forsupercomputers and games
1
2
3
4
5
10
Peta FLOPS11 K
Japan
Tianhe( 天河 ) China
Jaguar USA
Nebulae China
Tsubame Japan
Peak performance vsLinpack Performance
The difference is large in machines with accelerators
Homogeneous
Using GPU
Accelerator type isenergy efficient.
Pipeline processing
1 2 3 4 5 6
Stage
Each stage sends the result/receives the input every clock cycle.N stages = N times performanceData dependency makes RAW hazards and degrades the performance.If the large array is treated, a lot of stages can work efficiently.
Vector computers
a0a1a2…..
multiplieradder
X[i]=A[i] * B[i]Y=Y+X[i]
vector registers
The classic style supercomputers since Cray-1.Earth simulator may be the last vector supercomputer.
b0b1b2….
a1a2…..
X[i]=A[i] * B[i]Y=Y+X[i]b1b2….
a0
b0
Vector computers
multiplieradder
vector registers
a2…..
X[i]=A[i] * B[i]Y=Y+X[i]b2….
a0b0
b1
a1
Vector computers
multiplieradder
vector registers
a11…..
X[i]=A[i] * B[i]Y=Y+X[i]b11….
a9b9
b10
a10
x1x0
Vector computers
multiplieradder
vector registers
• Multiple processors (cores) can work independently.– Synchronization mechanism– Data communication: Shared memory
• All supercomputers are MIMD with multiple cores.
• However, K and Sequoia (BlueGene Q) are typical massively parallel MIMD machines.– homogeneous computers– scalar processors
MIMD ( Multipe-Instruction Streams/
Multiple-Data Streams)
MIMD ( Multipe-Instruction Streams/
Multiple-Data Streams)
Node 1
Node 2
Node 3
Node 0
0
1
2
3
Interconnect ionNetwork
Shared Memory
Processors which canwork independently.
Multi-Core (Intel’s Nehalem-EX)
8 CPU cores24MB L3 cache45nm CMOS 600 mm^2
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPUL3 CacheL3 Cache
L3 CacheL3 Cache
Intel 80-Core Chip
Intel 80-core chip [Vangal,ISSCC’07]
How to program them ?• Can the common programs for PC be acc
elerated on supercomputers?– Yes, a certain degree by parallel compilers.
• However, in order to efficient use of many cores, specialists must optimize programs.– Multithread using MPIs– Open MP– Open CL/CUDA → GPU accelerator type
From IBM web site
The fastest computerAlso simple NUMA
IBM’s BlueGene Q
• Successor of Blue Gene L and Blue Gene P.• Sequoia is consisting of BlueGene Q• 18 Power processors (16 computational, 1 contr
ol and 1 redundant) and network interfaces are provided in a chip.
• Inner-chip interconnection is a cross-bar switch.• 5 dimensional Mesh/Torus• 1.6GHz clock.
Japanese supercomputers
• K-Supercomputer– Homogeneous scalar type massively parallel computers.
• Earth simulator– Vector computers– The difference between peak and Linpack performance is small.
• TIT’s Tsubame– A lot of GPUs are used. Energy efficient supercomputer.
• Nagasaki University’s DEGIMA– A lot of GPUs are used. Hand made supercomputer. High cost-p
erformance. Gordon Bell prize cost performance winner
• GRAPE projects– For astronomy, dedicated supercomputers. SIMD 、 Various ver
sion won the Gordon Bell prize.
SACSIS2012 Invited Speech
Supercomputer 「 K 」
Core
Core
Core
Core
Core
Core
Core
Core
L2 C
InterConnect
Controller
Tofu Interconnect 6-D Torus/Mesh
SPARC64 VIIIfx Chip
4 nodes/board
24boards/Lack
96nodes/Lack
RDMA mechanismNUMA or UMA+NORMA
Memory
SACSIS2012 Invited speech
SACSIS2012 invited speech
water cooling system
Lacks of K
6 dimensional torusTofu
0 1 20 0 0
1 1 1
2 2 20 1 2
0 1 2
3-dimensional mesh
3-ary 1-cube
3-ary 2-cube
0 1 20 0 0
1 1 1
2 2 20 1 2
0 1 2
0 1 20 0 0
1 1 1
2 2 20 1 2
0 1 2
0
0
0
0 0
0
0 0
01
1 1 1
1
1
1
1 1
1
2
2 2 2
2
2
3- ary 3- cube
4 dimensional mesh
0***
1***
2***
Why K could get top 1
• The delay of BlueGeneQ/Sequoia– Financial crisis in USA
• Withdrawal of NEC/Hitachi– As starting, the complex system of a vector machine a
nd a scalar machine was planned.– All budget can be used only for scalar machine.
• Budget reviewing made the project famous.– Enough fund was thrown in short period.
• Engineers in Fujitsu did really good job.
SACSIS2012 invited talk
The earth simulatorV
ect
or
Pro
cess
or
Vect
or
Pro
cess
or
…
Vect
or
Pro
cess
or
0 1 7
Shared Memory16GB
Vect
or
Pro
cess
or
Vect
or
Pro
cess
or
…
Vect
or
Pro
cess
or
0 1 7
Shared Memory16GB
Vect
or
Pro
cess
or
Vect
or
Pro
cess
or
…
Vect
or
Pro
cess
or
0 1 7
Shared Memory16GB
….
Interconnection Network (16GB/s x 2)
Node 0 Node 1 Node 639
Peak performance40TFLOPS
The Earth simulator(2002) Simple NUMA
TIT’s TsubameWell balanced supercomp
uter with GPUs
NagasakiUniv’s DEGIMA
Exa-scale computer• Japanese national project for exa-scale computer started.• Feasibility Study started.
– U. Tokyo, Tsukuba Univ. Tohoku Univ. and Riken.• It is difficult to produce supercomputers with Japanese original chip
s.• In Japan, a vendor suffers loss for developing supercomputers.• The vendor may retrieve development fee later by selling smaller sy
stems.• However, Japanese semiconductor companies will not be able to su
pport a big money for development.• If Intel’s CPUs or NVIDIA’s GPUs are used, a huge national money
will flow to US companies.• For exa-scale: 70,000,000 cores are needed.
– The limitation of budget is severer than technical limit.
Amdahl’s lawSerial part1%
Parallel part 99 %
Accelerated by parallel processing
0.01 + 0.99/p
50 times with 100 cores 、 91 times with 1000 cores
If there is a small part of serial execution part, the performanceimprovement is limited.
Why Exa-scale supercomputers?
• The ratio of serial part becomes small for the large scale problem.– Linpack is scale free benchmark.– Serial execution part 1 day + Parallel execution part 10 years→ 1day+1day: A big impact.
• Are there any big programs which cannot be solved by K but can be solved by Exa-scale supercomputers?– The number of programs will be decreased.– Can we find new area of application ?
• It is important such a big computing power is open for researches.
Should we develop a floating computation centric supercomputers?
• What people wants big supercomputer to do?– Finding new medicines: Pattern matching.– Simulation of earthquake, Meteorology for analyzing
global warming.– Big data– Artificial Intelligence
• Most of them are not suitable for floating computation centric supercomputers.
• “Supercomputers for big data” or “Super-cloud computers” might be required.
Motivation and limitation
• Integrated computer technologies including architecture, hardware, software, dependable techniques, semiconductors and application.
• Flagship and symbols.• No-computer is remained in Japan other than supercomputers• A super computing power is open for peaceful researches.• It is a tool which makes impossible analysis possible.
• What needs infinite computing power ?• Is it a Japanese supercomputer if all cores and accelerators are ma
de in USA ?• Does floating centric supercomputer to solve LInpack as fast as pos
sible really fit the demand?
Look at Exa-scale computer project!
Excise
• A target program:serial computation part :1
parallel computation part: N3
• K: 700,000 cores
• Exa: 70,000,000 cores
• What N makes Exa 10 times faster than K ?