Upload
others
View
23
Download
0
Embed Size (px)
Citation preview
1
© 2004 IBM
Charles GrasslIBM
August, 2004
POWER5, Federation and Linux:System Architecture
2 © 2004 IBM Corporation
Agenda
Architecture–Processors
–POWER4–POWER5
–System–Nodes–Systems
High Performance Switch
2
3 © 2004 IBM Corporation
Processor Design
POWER4Two processors per chipL1:–32 kbyte data–64 kbyte instructions
L2:–1.45 Mbyte per processor pair
L3:–32 Mbyte per processor pair–Shared by all processors
POWER5Two processors per chipL1:–32 kbyte data–64 kbyte instructions
L2:–1.9 Mbyte per processor pair
L3:–36 Mbyte per processor pair–Shared by processor pair–Extension of L2 cache
Simultaneous multithreadingEnhanced memory loads and storesAdditional rename registers
4 © 2004 IBM Corporation
POWER4:Multi-processor Chip
Each processor:– L1 caches– Registers– Functional units
Each chip (shared):– L2 cache– L3 cache
– Shared across node– Path to memory
Processors
L2 Cache
3
5 © 2004 IBM Corporation
POWER5:Multi-processor Chip
Each processor:– L1 caches– Registers– Functional units
Each chip (shared):– L2 cache– L3 cache
– NOT shared– Path to memory– Memory controller
Processors
L2 CacheMemory
Controller
6 © 2004 IBM Corporation
POWER4: Features
Multi-processor chipHigh clock rate– High computation burst rate
Three cache levels– Bandwidth– Latency hiding
Shared Memory– Large memory size
4
7 © 2004 IBM Corporation
POWER5 Features
Performance–Registers
–Additional rename registers–Cache improvements
–L3 runs at ½ freq. (POWER4: 1/3 freq.)–Simultaneous Multi Threading (SMT)–Memory Bandwidth
Virtualization–Partitioning
Dynamic power management
8 © 2004 IBM Corporation
POWER4, POWER5 Differences
389mm
1/10 of processor
Yes
~16Gbyte / sec /Chip
36MB12-way Associative~80 Clock Cycles
10-way Associative1.9MB
4-way AssociativeLRU
POWER5 Design
Better usage of processor resources1 processorProcessor
Addressing
50% more transistors in the same space412mmSize
Better processor utilizationNo
Simultaneous Multi-
Threading
4X improvementFaster memory access
4GBbyte / sec /Chip
Memory Bandwidth
Better Cache performance
32MB8-way Associative118 Clock Cycles
L3 Cache
Fewer L2 Cache missesBetter performance
8-way Associative1.44MBL2 cache
Improved L1 Cacheperformance
2-way AssociativeFIFOL1 Cache
BenefitPOWER4 Design
5
9 © 2004 IBM Corporation
L3
Modifications to POWER4 System Structure
P P
L2
Memory
L3
Fab Ctl
P P
L2
L3
Memory
L3
Fab Ctl
L3 L3
Mem Ctl Mem Ctl
Mem Ctl Mem Ctl
10 © 2004 IBM Corporation
POWER5 Challenges
Scaling from 32-way to logical 128-way– Large system effects, 2Tbyte of memory
1.9MB L2 shared across 4 threads (SMT)–L2 miss rates
36 Mbyte private L3 versus 128 Mbyte shared L3– Will have to watch multi-programming level– Read-only data will get replicated
Local/Remote memory latencies–Worst case remote memory latency is 60% higher than local
6
11 © 2004 IBM Corporation
Effect of Features: Registers
Additional rename registers–POWER4: 72 total registers–POWER5: 110 total registers
Matrix multiply:–POWER4: 65% of burst rate–POWER5: 90% of burst rate
Enhances performance of computationally intensive kernels
12 © 2004 IBM Corporation
Effect of Rename Registers
0
1000
2000
3000
4000
5000
6000
7000
POWER41.5 GHz
POWER41.5 GHz
POWER51.65 GHz
POWER51.65 GHz
Mflo
p/s
PolynomialMATMUL
7
13 © 2004 IBM Corporation
Effect of Rename Registers
“Tight” algorithms were previously register boundIncreased floating point efficiencyExamples–Matrix multiply
– LINPACK–Polynomial calculation
– Horner’s Rule–FFTs
0
1000
2000
3000
4000
5000
6000
7000
14 © 2004 IBM Corporation
Memory Bandwidth
0
2000
4000
6000
8000
10000
12000
14000
0 100000 200000 300000 400000
Mby
te/s
POWER5POWER4
s=s+A(i)1.65 GHz POWER51.5 GHz POWER4
8
15 © 2004 IBM Corporation
Memory Bandwidth: L1 – L2 Cache Regime
0
2000
4000
6000
8000
10000
12000
14000
0 100000 200000 300000 400000
Mby
te/s
POWER5POWER4
L1 Cache
L2 Cache
16 © 2004 IBM Corporation
Memory Bandwidth: L2 Cache Regime
0
2000
4000
6000
8000
10000
12000
14000
0 1000000 2000000 3000000 4000000
Mby
te/s
POWER5POWER4
-----------------------------------L2 Cache-------------------------------
9
17 © 2004 IBM Corporation
L2 and L3 Cache Improvement
L3 cache enhances bandwidth–Previously only hid
latency–New L3 cache is like a “2.5
cache”Larger blocking sizes
0
2000
4000
6000
8000
10000
12000
14000
0 1000000 2000000 3000000 4000000
-----------------------------------L2 Cache-------------------------------
18 © 2004 IBM Corporation
Simultaneous Multi-Threading
Two threads per processor– Threads execute concurrently
Threads share:– Caches– Registers– Functional units
10
19 © 2004 IBM Corporation
POWER5 Simultaneous Multi-Threading (SMT)
Thread1 active
Thread0 activeLegend
No Thread active
Processor Cycles
FX0FX1LS0LS1FP0FP1BRXCRL
POWER5 (Simultaneous Multi-threaded)
● Presents SMP programming model to software● Natural fit with superscalar out-of-order execution core
POWER4 (Single Threaded)
Processor Cycles
FX0FX1LS0LS1FP0FP1BRXCRL
20 © 2004 IBM Corporation
Conventional Multi-Threading
Threads alternate– Nothing shared
FunctionalUnits
FX0FX1FP0FP1LS0LS1BRXCRL
Thread 0 Thread 1
Time
11
21 © 2004 IBM Corporation
Simultaneous Multi-Threading
Simultaneous execution– Shared registers– Shared functional units
FX0FX1FP0FP1LS0LS1BRXCRL
Thread 0 Thread 1
0 1
22 © 2004 IBM Corporation
Manipulating SMT
Administrator functionDynamicSystem-wide
/usr/sbin/smtctl –m on nowOn
/usr/sbin/smtctl –m off nowOff
/usr/sbin/smtctlView
CommandFunction
12
23 © 2004 IBM Corporation
Dynamic Power Management
SingleThread
No Power Management
Photos taken with thermal sensitive camera while prototype POWER5 chip was undergoing tests
SimultaneousMulti-threading
Dynamic Power Management
24 © 2004 IBM Corporation
Multi-Chip Modules (MCMs)
Four processor chips are mounted on a module–Unit of manufacture–Smallest freestanding system
Multiple modules–Node
Also available: Single Chip Modules–Lower cost–Lower bandwidth
13
25 © 2004 IBM Corporation
POWER4 Multi-Chip Module (MCM)
4 chips– 8 processors
– 2 processors/chip>35 Gbyte/s in each chip-to-chip connection2:1 MHz chip-to-chip busses6 connections per MCMLogically shared L2's, L3'sData Bus (Memory, inter-module):3:1 MHz MCM to MCM
26 © 2004 IBM Corporation
POWER5 Multi-Chip Module (MCM)
95mm % 95mmFour POWER5 chipsFour cache chips4,491 signal I/Os89 layers of metal
14
27 © 2004 IBM Corporation
Multi Chip Module (MCM) Architecture
POWER4:– 4 processor chips
– 2 processors per chip– 8 off-module L3 chips
– L3 cache is controlled by MCM and logically shared across node
– 4 Memory control chips----------------------------------------------– 16 chips
POWER5:– 4 processor chips
– 2 processors per chip– 4 L3 cache chips
– L3 cache is used by processor pair– “Extension” of L2
----------------------------------------------– 8 chips
L3 ChipMem. Controller
28 © 2004 IBM Corporation
System Design with POWER4 and with POWER5
Multi Chip Module (MCM)–Multiple MCMs per system node
Other issues:–L3 cache sharing–Memory controller
15
29 © 2004 IBM Corporation
A
Core Core
Store Cache Store Cache
1.4M L2 Cache
32 Mbyte L3 Cache(portion of 128M shared) FBC
X
A
Z
Y
GX,NCU
MC
Power 4
Memory
POWER4Processor
Chip
L3Cache
30 © 2004 IBM Corporation
GX Bus
Shared 128MB L3
GX Bus
GX BusGX Bus
Chip-chip communication
Chip-chip communication
Chip-chip communication
Shared L21.5MB
Chip-chip communication
cpu0
M
E
M
O
R
Y
Cpu 1
Cpu 6Cpu 5Cpu 4
Cpu 3Cpu 2
Cpu 7
Shared L21.5MB
Shared L21.5MB
Shared L21.5MB
L2 & L3 D
irs
L2 & L3 D
irsL2 &
L3 Dirs
L2 & L3 D
irs
M
C
M
C
M
E
M
O
R
Y
MCM to MCM
MCM to MCM
MCM to MCM
MCM to MCM
POWER4 Multi-chip Module
16
31 © 2004 IBM Corporation
MCM 0
S
T
U
V
MCM 1
S
T
U
V
MCM2
S
T
U
V
MCM 3 T
U
V
S
POWER4 32-way Plane Topology
Uni-directional links2-beat Addr @ 850Mhz = 425M A/s 8B Data @ 850Mhz (2:1) = 7 GB/s
32 © 2004 IBM Corporation
B
A
Core Core
Store Cache Store Cache
1.9M L2 Cache
FBC
X
A
Z
Y
MC,GXNCU
Power 5 Thread 0 Thread 1
Memory
POWER5
36M L3 Cache(victim)
17
33 © 2004 IBM Corporation
GX Bus
L336 MB
MEMORY
MEMORY
L3
L3 L3
GX Bus
GX BusGX Bus
RIO-2High Performance Switch
InfiniBand
MCM to MCMBook to Book
MCM to MCMBook to Book
MCM to MCMBook to Book
MCM to MCMBook to Book
Chip-chip communication
Shared L2
MemCtrl
L3Dir
Chip-chip communication
Shared L2
MemCtrl
L3Dir
Chip-chip communication
Shared L2
MemCtrl
L3Dir
Chip-chip communication
Shared L2
MemCtrl
L3Dir
CPUCPU6, 76, 7
CPUCPU4, 54, 5
CPUCPU2, 32, 3
CPUCPU0, 10, 1
CPUCPU10,1110,11
CPUCPU8, 98, 9
CPUCPU14,1514,15
CPUCPU12,1312,13
POWER5 Multi Chip Module
34 © 2004 IBM Corporation
Uni-directional links2-beat Addr @ 1GHz= 500M A/s
8B Data @ 1 GHz (2:1)= 8 GB/s
Vertical node linksUsed to reduce latencyand increase bandwidthfor data transfers.
Used as alternate routefor address operationsduring dynamic ringre-configuration (e.g.concur repair/upgrade).
MCM 7
S
T
U
V
MCM 6T
V
MCM 5T
V
MCM 0
S
T
U
V
MCM 1 T
V
MCM 2 T
V
MCM 3 T
VMCM 4T
V
S S
S
U
S
US
U
S
U
U U
POWER5 64-way Plane Topology
18
35 © 2004 IBM Corporation
Upcoming POWER5 Systems
Early introductions are iSeries–Commercial
Next:–Small nodes–2,4,8 processors
–SF, MLNext, next–iH–H
–Up to 64 processors
36 © 2004 IBM Corporation
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
Planned POWER5 iH
2/4/6/8 – way POWER52U rack ChassisRack: 24” x 43” deep
19
37 © 2004 IBM Corporation
Auxiliary
38 © 2004 IBM Corporation
Switch Technology
Internal network–In lieu of, e.g. Gig Ethernet
Multiple links per node–Match number of links to number of processors POWER4 →
POWER5HPS (Federation)
POWER3 →POWER4
SP Switch 2 (Colony)
POWER2 →POWER3SP Switch
POWER2HPS Switch
ProcessorsGeneration
20
39 © 2004 IBM Corporation
HPS Software
MPI-LAPI (PE V4.1)–Uses LAPI as the reliable transport–Library uses threads, not signals for async activities
Existing applications binary compatibleNew performance characteristics–Improved collective communication
40 © 2004 IBM Corporation
High Performance Switch (HPS)
Also Known As “Federation”Follow on to SP Switch2–Also known as “Colony”
Specifications:–2 Gbyte/s (bidirectional)–5 microsecond latency
Configuration:–Four adaptors per node
–2 links per adaptor–16 Gbyte/s per node
21
41 © 2004 IBM Corporation
HPS Specifications
Bandwidth, multiple
[Mbyte/s]
Bandwidth, single
[Mbyte/s]
Latency[microsec.]
5
15
1700
350
200HPS(“Federation”)
550SP Switch 2(“Colony”)
42 © 2004 IBM Corporation
HPS Packaging
4U, 24-inch drawer –Fits in the bottom position of p655 or p690 frame
Switch only frame (7040-W42 frame)–Up to eight HPS switches
16 ports for server-to-switch16 ports for switch-to-switch connections Host attachment directly to server bus via 2-link or 4-link Switch Network Interface (SNI) –Up to two links per pSeries 655–Up to eight links per pSeries 690
22
43 © 2004 IBM Corporation
HPS Scalability
Supports up to 16 p690 or p655 servers and 32 links at GAIncreased support for up to 64 servers –32 can be pSeries 690 servers–128 links planned for July, 2004
Higher scalability limits available by special order
44 © 2004 IBM Corporation
HPS Performance: Single and Multiple Messages
0
500
1000
1500
2000
2500
0 250000 500000 750000 1000000
Bytes
Mby
te/s
SingleMultiple
23
45 © 2004 IBM Corporation
HPS Performance: POWER5 Shared Memory MPI
0
500
1000
1500
2000
2500
3000
3500
4000
0 200000 400000 600000 800000 1000000
Bytes
Mby
te/s
46 © 2004 IBM Corporation
Summary
Processor technology:–POWER4 → POWER5
–Multiprocessor chips–Enhanced caches–SMT
Cluster systems–P655–P690