View
223
Download
0
Category
Preview:
Citation preview
All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
SR8000 Concept
Tim Lanfear
Hitachi Europe GmbH.
t-lanfear@hpcc.hitachi-eu.co.uk
2 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
SR8000 Model Range4 8 16 32 64 128 256 512
SR8000 32 64 128 256 512 1,024 - -SR8000 Model E1 38.4 76.8 153.6 307.2 614.4 1,228.8 2,457.6 4,915.2SR8000 Model F1 48 96 192 384 768 1,536 3,072 6,144SR8000 3-D
Crossbar- -
SR8000 Models E1, F1SR8000SR8000 Model E1SR8000 Model F1SR8000 32 64 128 256 512 1,024 - -SR8000 Models E1, F1 64 128 256 512 1,024 2,048 4,096 8,192
SR8000SR8000 Model E1SR8000 Model F1SR8000SR8000 Models E1, F1
External Interfaces
Number of NodesPeak Performance (GFlops)
Inter-node Network
Max. Memory Capacity (GB)
One Dimensional Crossbar
Two Dimensional Crossbar
One Dimensional Two Dimensional Crossbar Three Dimensional CrossbarInter-node Transfer Speed
1 GB/s (single direction) x2 -1.2 GB/s (single direction) x21 GB/s (single direction) x2
Ultra SCSI, Ethernet/Fast Ethernet, Gigabit Ethernet, ATM, HIPPI, Fibre Channel
Node
Peak Performance (GFlops)
89.612
Memory Capacity (GB)
2 / 4 / 82 / 4 / 8 / 16
System
3 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
SR8000 Appearance
4 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Compact Model
Model Model A Model B Model C
Peak Performance 4 GFlops 8 GFlops 12 GFlops Memory Capacity 2GB / 4GB / 8GB /
16GB
External Interfaces
Number of I/O Interfaces
System Expandability
Input Current (Single Phase 200V-240V)
18 A
Power Consumption 3.3 kW
Noise Level
Dimension (mm) <W x D x H>
OS
Operation Management
Languages
Development Support
Matrix Calculation Library
Graphics
500 x 910 x 1500
Physical Planning
HI-UX/MPP for SR8000
NQS, NFS, RealTime Monitor, ADSM (ADSTAR Distributed Storage Manager)
SoftwareFORTRAN77, FORTRAN90, Parallel FORTRAN, C, C++,
Kuck and Associates C++
Application Development Environment, Parallel Debugger
MATRIX/MPP, MATRIX/MPP/SSS, MSL2
X Window System, OSF/Motif, GKS, PHIGS, PEXlib
Hardware
17 A
3.0 kW
57 dB
Ultra SCSI, Fibre Channel, Ethernet/Fast Ethernet, Gigabit Ethernet, HIPPI, ATM
2GB / 4GB / 8GB
8 (maximum)
Available
5 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Vector vs SMP vs MPP
Feature Vector SMP MPPSingle Node Performance
Scalability
Programming Effort
Development Cycle
Energy Requirements
6 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
System Architecture
Cross-bar Inter-node Network
Node (PRN)
Node (PRN)
Node (ION) CPU CPU
System Control
Main MemoryNetwork Control
Ether, ATM, HIPPI RAID Disk
PCI
Service Processor
Console
7 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Programming Models
Hardware Programming Model Example Single CPU Pseudo-vector processing Vector application
Independent processing on each IP
Compilation, parallel make
Message passing MPP application DO loop distribution with COMPAS
Vector application Single Node
Parallel processing of independent blocks of code
Message passing MPP application Multiple Nodes COMPAS and message
passing Vector parallel application
8 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
CPU Architecture
• 16 bytes/cycle memory BW• 128 Kbyte L1 cache• Pre-fetch and pre-load
instructions• 160 f.p. registers• 2 f.p. pipelines• 4 flops/cycle
Main MemoryMain Memory
Pre-fetchPre-fetch
Pre-loadPre-loadCache
Floating Point Registers
Load
Arithmetic UnitArithmetic Unit
Memory SwitchMemory Switch
9 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Slide Window Registers
• Registers for all instructions• Registers for extended instructions only• Fixed registers: 4, 8, 16, 32 (16 illustrated)• Fixed + sliding = 128
Physical Sliding part: 0 to 127 Global part: 128 to 159
Logical0 to 1532 to 12516 to 31126-7
0 to 1532 to 12316 to 31124-7
Base=2
Base=4
10 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Instruction Set Extensions
• Load and store with extended registers
• Floating point arithmetic with extended registers
• Slide window control
• Pre-fetch and pre-load
• Thread start-up and finish
• Predicate instructions
All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
SR8000 Programming
Instruction Level Parallelism
(Pseudo-vector Processing: PVP)
12 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Pre-fetch and Pre-load
• Pre-fetch: load cache line from memory to cache
• Pre-load: load one word from memory to register
• 16 streams
Main MemoryMain Memory
Pre-fetchPre-fetch
Pre-loadPre-loadCache
Floating Point Registers
Load
Arithmetic UnitArithmetic Unit
Memory SwitchMemory Switch
13 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Pre-fetchIteration
PF Latency Use dataLD1
Use dataLD
Use dataLD
Use dataLD
2
3
4
PF Latency Use dataLD
Use dataLD
5
6
• Pre-fetch 128 bytes to cache• Follow by LD to register
14 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Pre-load
PL Latency Use data1
Latency Use data
Latency Use data
Latency Use data
Latency Use data
Latency Use data
2
3
4
5
6
PL
PL
PL
PL
PL
• Pre-load 8 bytes to register• LD not required
Iteration
15 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Software Pipelining
I=1 I=2 I=3
No SWPL
I=1
I=2
I=3
Infinite resource
Recurrence
a=
=a a=
=a a=
=a I=1
I=2
I=3
Resources:
registers, f.p. units, instruction issue, memory bandwidth etc
I=1
I=2
I=3
Finite resource
Initiation interval
16 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
VST
Pseudo-vector Processing
A(:) = A(:) + N
Vector Pseudo-Vector
PF Lat LD + ST
LD + ST
LD + ST
LD + ST
PF Lat LD + ST
LD + ST
LD + ST
VADDVLD
17 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Effect of PVP
0
100
200
300
400
500
600
700
800
900
1 10 100 1000 10000 100000
Loop length (N)
Mfl
op
s
PVP off
PVP on
Dot product: S = A(1:N)*B(1:N)
All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
SR8000 Programming
Multi-thread Parallelism
(Cooperative Microprocessors in a Single Address Space: COMPAS)
19 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
COMPASMulti-dimensional Crossbar Network
Node
thread
process
. . .
.
IP
Pre-fetchLoadArithmeticStoreBranch
Pre-fetchLoadArithmeticStoreBranch
IP
IP
Automatic Parallel Processing
Node
COMPAS (Start Inst.)
IP: Instruction Processor
Node Node Node
Main memory (shared)
IP IP IP IP
COMPAS ( End Inst.)
COMPAS: Co-operative Micro-Processors in single Address Space
20 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Loop Part
IP
(waiting for startup)
Hardware SupportSoftware
IP
Scalar Part
Start Parallel Inst.
Loop Part
End Parallel Inst.
Hardware Support
SC
MS
IP:Instruction ProcessorSC:Storage ControllerMS:Main Storage
Barrier SynchronizationMechanism
Scalar Part
Loop Part
IP
(waiting for startup)
Loop Part
IP
(waiting for startup)
IPIPIPIP
21 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Loop Parallelisation
DO i =1,NA(i)=B(i)+C(i)
ENDDO
[fork]DO i =start,end
A(i)=B(i)+C(i)ENDDO[join]
i loop parallelisation
DO j=1,MW(j)=C(j)+D(j)DO i=1,N
A(i,j)=B(i,j)+W(j)ENDDO
ENDDO
[fork]DO j=start,end
W(j)=C(j)+D(j)DO i=1,N
A(i,j)=B(i,j)+W(j)ENDDO
ENDDO[join]
j loop parallelisation
22 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Loop Parallelisation
DO j=2,MDO i=1,N
A(i,j) = A(i,j-1)+A(i,j)ENDDO
ENDDO
[fork]DO j=2,M
DO i=start,endA(i,j) = A(i,j-1)+A(i,j)
ENDDOENDDO[join]
i loop parallelisation
DO i=1,NA(i) = B(i)+C(i)
ENDDODO j=1,M
D(j) = E(j)*F(j)ENDDO
[fork]DO i=start,end
A(i) = B(i)+C(i)ENDDODO j=start,end
D(j) = E(j)*F(j)ENDDO[join]
i loop parallelisation
j loop parallelisation
23 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Loop Parallelisation
DO i = 1,N
CALL sub(a,b,i)
ENDDO
*poption parallel force parallelisation
*poption tlocal(a,b,i) thread local variables
[fork]
DO i = 1,N
CALL sub(a,b,i)
ENDDO
[join]
24 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Section Parallelisation
*poption parallel_sections
*poption section
CALL SUB1
*poption section
CALL SUB2
*poption end_parallel_sections
Execution of independent blocks of code in different threads
(sections are always single threaded)
25 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Effect of COMPAS
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 10 100 1000 10000 100000
Loop length (N)
Mfl
op
s
COMPAS off
COMPAS on
Dot product: S = A(1:N)*B(1:N)
All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
SR8000 Programming
Message Passing
(MPI)
27 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Remote DMA
Receive Buffer
ProgramProgram
OSOS
memory copy
Send Buffer
Crossbar Network
Normal TransferProtocol ProcessingContext SwitchInterrupt Handling
Node Node
memory copy
Remote DMA Transfer
No Buffering in KernelNo OS System Call
data
data
data
data
28 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Inter-node MPI
Cross-bar Inter-node Network
One MPI process per node; RDMA transfer possible
MPI MPI MPI
29 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Intra-node MPI
Cross-bar Inter-node Network
One MPI process per IP; RDMA transfer not possible
MPI
MPI
MPI
Shared mem
ory
MPI
MPI
MPI
Shared mem
ory
MPI
MPI
MPIShared m
emory
30 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
MPI Ping-pong
0.00E+00
2.00E+02
4.00E+02
6.00E+02
8.00E+02
1.00E+03
1.20E+03
1.40E+03
1 10 100 1000 10000 100000 1000000
Message length
Ba
nd
wid
th (
Mb
yte
s/s
ec
)
Intra-node
Inter-node RDMAInter-node no RDMA
31 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
SR8000 Parallelism
Instruction level (PVP)
Multi-thread (COMPAS)
Node 1 Node 2
Message passing (MPI)
All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
SR8000 Programming
Memory Architecture
33 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Memory Hierarchy
fp registers (128+32)
L1 cache (128 Kb 4-way)
Store buffer (16 entries)
Switch
Memory (2 to 16 Gb, 512 banks)
16 b/cyc
16 b/cyc32 b/cyc
Other IPs
34 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Address Translation
Virtual page number Page offset
Page
table
Main
memory
Virtual address
Cache recently used entries of page table in
TLB
35 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Large TLB
Virtual page number Page offset
Large
page
table
Main
memory
Virtual address
Large TLB covers whole address space with 256 entries.
Page size 16Mb to 128 Mb
36 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Memory Address Hashing
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
xor
memory controller data path
xor
storage controller data path
37 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Key Features of SR8000
• High performance RISC CPU with PVP
• High performance node with COMPAS
• High sustained memory bandwidth
• High scalability with fast network
• Low energy and space requirements
All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
SR8000 Programming
Performance
39 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Top 500 – June 2000
Manufacturer Computer Rmax Installation Site
1 Intel ASCI Red 2379 Sandia National Lab
2 IBM ASCI Blue Pacific 2144 Lawrence Livermore National Lab
3 SGI ASCI Blue Mountain 1608 Los Alamos National Lab
4 IBM SP Power3 375 MHz 1417 NAVOCEANO
5 Hitachi SR8000-F1/112 1035 LRZ Munich
6 Hitachi SR8000-F1/100 917 KEK Tsukuba
7 Cray Inc T3E/1200 891 US Government
8 Cray Inc T3E/1200 891 US Army HPC Research Center
9 Hitachi SR8000/128 873 University of Tokyo
10 Cray Inc T3E/1200 815 US Government
40 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
Linpack Performance
80.25 (8 nodes)159.51 (16 nodes)
313.32 (32nodes)
917.15(100 nodes)
605.30 (64 nodes)577.49 (60 nodes)
0
100
200
300
400
500
600
700
800
900
1000
0 20 40 60 80 100 120Number of nodes
GFl
op
s
10.88 Gflops on 1 node
20.50 Gflops on 2 nodes
40.76 Gflops on 4 nodes
10.88 Gflops on 1 node
20.50 Gflops on 2 nodes
40.76 Gflops on 4 nodes
41 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
NAS Parallel FT
5.147.92
26.16
14.01
8.31
27.95
14.84
5.39
28.78
15.10
8.37
0
5
10
15
20
25
30
35
1 2 4 8Number of Nodes
GFl
ops
ClassA
ClassB
ClassC
42 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.
NAS Parallel CG
7.98
14.54
6.353.67
10.52
16.46
4.14
27.86
0
5
10
15
20
25
30
1 2 4 8Number of Nodes
GFl
ops
ClassB
ClassC
7.98
14.54
6.353.67
10.52
16.46
4.14
27.86
0
5
10
15
20
25
30
1 2 4 8Number of Nodes
GFl
ops
ClassB
ClassC
Recommended