View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Cluster Hardware Overview(IA-32 Pentium)
Kent Milfeld
10/31/2002
The University of Texas at AustinTexas Advanced Computing Center
2Cluster Hardware Overview (IA-32 Pentium 4)
Outline: Cluster Systems
• Cluster Architecture– Nodes -- 2-way SMP (Dell Xeon Pentium 4)– Motherboard -- 2-way SMP (ServerWorks)– Interconnect -- Switch (Myrinet)
3Cluster Hardware Overview (IA-32 Pentium 4)
Cluster Architecture
internet
internet
Switch
Server
PC PC PC PCPC+
GigE, Myrinet
Switch
FileServer
PC+
ethernetMyrinet, …FCAL, SCSI,…
…
4Cluster Hardware Overview (IA-32 Pentium 4)
Node
Processors: Two 2.4GHz Intel® Xeon Processors (2U ) Chipset: ServerWorks Grand Champ LE chipset Memory: 2GB 2:1 memory interleave (200MHz DDR SDRAM)
FSB: 400MHz (Front Side Bus) Cache: 512KB L2 Advanced Transfer Cache Disk: Dual-channel integrated Ultra3 (Ultra160)
SCSI Adaptec® AIC-7899 (160Mb/s) controller
Dell PowerEdge 2650
2U
5Cluster Hardware Overview (IA-32 Pentium 4)
Motherboard
GC-LE
BCM5701
CIOB-X
CIOB-X
PCI64C
CSB5Legacy I/O
DDR 200
DDR 200
32-bit PCI
thin IMBus
IMBus8B
8B
Pentium 4 Xeon Processors
PCI-X
3.2GB/sMemorySubsystem
InterleavedMemory
Memory: dual-channel, up to 16GB of DDR200 memoryBandwidth: 3.2GB/s of memory bandwidth RAS: ECC, redundant spare memory support, memory
scrubbing & chipkill
Gigabit NIC
Myrinet Adapter
6Cluster Hardware Overview (IA-32 Pentium 4)
Interconnect (Myrinet Bandwidth)
http://www.myri.com/myrinet/performance/index.html
7Cluster Hardware Overview (IA-32 Pentium 4)
Interconnect (MPI Bandwidth)
MPI Bandwidth (P-2-P)
0
50
100
150
200
250
0 40000 80000 120000
Payload (Bytes)
Ba
nd
wid
th (
MB
/se
c.)
longhorntejasjupiterqsantarita
Machine
GigE (IBM)
Myrinet
8Cluster Hardware Overview (IA-32 Pentium 4)
Outline: Pentium 4 Microarchitecture
• Features
• Block Diagrams (data flow / hardware)
• Out-of-Order (OO) execution
• Speeds & Feeds
• Floating Point & Memory Performance
• Registers / Caches
• SIMD
• Compiler Design
• Optimizations
9Cluster Hardware Overview (IA-32 Pentium 4)
Architecture Features
• NetBurst Microarchitecture
• Instruction Cache (Execution Trace Cache)
• Out-of-Order (OO) execution engine
• Double-pumped Arithmetic Logic Unit
• Memory Subsystem (L1 access in 2 CP)
• Floating Point/Multi-Media performance
10Cluster Hardware Overview (IA-32 Pentium 4)
Basic Features
• 42 million transistors (0.18u), 217 mm**2, 55watts @1.5GHz, 6 levels of aluminum interconnect)
• Up to 3.0GHz
• 400/533 MHz FSB
• 144 128/64-bit SIMD instructions– SSE2 (Streaming Extension 2)
11Cluster Hardware Overview (IA-32 Pentium 4)
Data and Instruction Flow
Level 1 Data Cache
Execution UnitsRegisters Level 2 Cache
Execution Out-of-OrderCore
Retire-ment
Trace Cachecode ROMFetch/
Decode
BTBBranch Prediction
Branch History Update
Memory Int & FP ExecutionMemory Subsystem
Out-of-Order EngineFront End
Bus 100MHz Bus Unit
+
13Cluster Hardware Overview (IA-32 Pentium 4)
Out-of-Order Execution
• Non deterministic because Out-of-Order Execution
• Stalls overcome by parallel execution, buffering, and speculation.
In Order Issue
Out of Order Execution
In Order Retirement
14Cluster Hardware Overview (IA-32 Pentium 4)
Out-of-Order Execution -- Pipeline
Fetch1
Fetch2
Decode3
Decode4
Decode5
Rename6
ROB Rd7
Rdy/Sch8
Dispatch9
Exec10
TC Fetch1
Drive Rename Que Sch Disp FR Flgs Drive2
TC Fetch3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Alloc Sch Sch Disp FR BrCkEx
Pentium III processor misprediction pipeline
Pentium 4 processor misprediction pipeline
15Cluster Hardware Overview (IA-32 Pentium 4)
Pentium 4 Speeds & Feeds
L1 DataRegs.
W PF Word (64 bit) Int Integer (64 bit)CP Clock Period
Memory
4 W CP
8KBL2
1 W (load) CP
1 W 6 CP
@400MHz FSB 2.4GHz CPU PC800 RDRAM
2 CP
~4 CP(3uops/CP stream)
Latencies
TraceCache
Exec
1 W (store) CP
2-7 CP ~90 CP
Line size L1/L2 =8/16/ W
256/512KB
32B wide
on die
16Cluster Hardware Overview (IA-32 Pentium 4)
Performance Comparison
Scott Wasson “Intel’s Pentium 4 Processor, Radical Chic”www.tech-report.com/reviews/2001q3/pentium4-2ghz/
17Cluster Hardware Overview (IA-32 Pentium 4)
Processor Speed vs Memory Bandwidth
Scott Wasson “Intel’s Pentium 4 Processor, Radical Chic”www.tech-report.com/reviews/2001q3/pentium4-2ghz/
18Cluster Hardware Overview (IA-32 Pentium 4)
Registers
1
GPR
SEG
MMX
8
64-bit
32-bit
16-bit
XMM
FPU80-bit
128-bit1
1
1
1
6
8
8
8
General Purpose Registers
Segment Registers
Floating PointRegisters
MMX/SSERegisters
SSE2Registers(FP/Int…)
EFLAGS RegisterControl Register
19Cluster Hardware Overview (IA-32 Pentium 4)
Pentium 4 Cache
Level Capacity Associativity
Line Size (bytes)
Latency int/float (clocks)
Write Update Policy
First 8KB 4 64 2/9 write through
TC 12K uops
8 N/A N/A N/A
Second 256KB, 512KB
8 128 read
64 write
7/7 write back
Third 0, 512KB or 1MB
8 128 read
64 write
14/14 write back
20Cluster Hardware Overview (IA-32 Pentium 4)
SIMD
• Beginning with Pentium II SIMD Technology was integrated into the Hardware & Instruction Set. SSE2was implemented in Pentium 4.
Instructions PackedData
RegistersMXM 64-bit
RegistersXMM 128bit
APPS
MMX INTB,W,Q
Yes --- Imaging, MM, comm.
SSE SP Float Yes --- 3-D geo/renderingvideo en/decode
SSE2 INT, SP/DPFloat
Yes Yes 4-D graphicsScientific Comp
• Intel Hyper-Threading TechnologyUse OpenMP pragmas & Directives with Intel Compiler.Higher Performance realized with Multi-Entry Threading (MET)
21Cluster Hardware Overview (IA-32 Pentium 4)
Compiler Design “all for one” with SIMD
C++/C Front End Fortran 95 Front End
Code Restructuring & IPO
OpenMP/AutomaticParallelization & Vectorization
HLO & Scalar Opt.
Lower Level Code Gen. & Optimization
IA-32 IA-64
outlining
Multi-EntryThreading
Uses guide-based multi-threaded run-time libraryfrom Intel KAI Software Laboratory (KSL)
22Cluster Hardware Overview (IA-32 Pentium 4)
OPT: Avoid Unpredictable Branches
• Simple, often traversed, loops can be corrected by compiler:
C = (A<B) ? C1: C2 or IF(A.LT.B) C=C1; ELSE C=C2; ENDIF
Compare A>BC3 = C1-C2Set register to 0 or 1 (according to compare)And C3 with registerAdd C2 to register
A<B A>=B0000000000 11111111111C3 C3C2 C2C2 C1 (result)
AND
ADD
Example 1. Optimization Eliminates Branch
cmp A,B jge L0 mov ebx, C1 jmp L1L0: mov ebx, C2L1:
Assembly Branch Assembly No-Branch (pseudo-assembly)
23Cluster Hardware Overview (IA-32 Pentium 4)
OPT: Make code consistent with static prediction algorithm
• Predict backward conditional branches to be taken (loops).
• Predict forward conditional branches to be NOT taken.
• Predict indirect branches to be NOT taken.
If <condition>{…}for <condition>{…}
loop{…}<condition>
Forward Conditional branches not taken (fall through)
Backward Conditional branches taken
24Cluster Hardware Overview (IA-32 Pentium 4)
OPT: Make code consistent with static prediction algorithm
• Inline functions with branch structure: A mispredicted branch can lead to larger performance penalties inside a small function than if that function is inlined
• Be careful not to increase “working set” beyond what will fit in the trace cache.
• Indirect branches degrade performance if they are non-predictable. (switches, computed GOTOs, call through pointers)
25Cluster Hardware Overview (IA-32 Pentium 4)
OPT: Unrolling
• If loop count is small, unroll Pentium 4 code so that only up to 16 iterations are performed. They will all be predicted. (Only 4 suggested for PII and PIII.)
• Other concerns: registers, working set size in trace cache, and prefetching may be more important.
OPT: Memory
• Inappropriate Alignment and Forwarding are the sources of large delays.
26Cluster Hardware Overview (IA-32 Pentium 4)
OPT: General Optimization Concerns
• Instruction Decoding is less important(than with Pentium III)
• Some Latencies of simple arithmetic ops have decreased (2x faster local clock)
• Memory latency hiding is better. (Hardware Prefetching)
• New Cacheability Instructions (streamline stores and manage cache usage)
• Fewer prefetches required. (64-byte cache lines compared to 32-bytes (PII, PIII); but false sharing more important.
• L2 code misses should be less. (Trace Cache is used in lieu of L1 code cache.
27Cluster Hardware Overview (IA-32 Pentium 4)
OPT: X87/SSE2 Instructions
• Avoid changing between 3 (or more) floating-point modes. [FLDCW (mode change= precision & rounding control, etc. e.g. converting to int.] Must flush instruction pipe.
• Masked floating-point exceptions require “assistance” from slower microcode operations to handle masked exception. Avoid propagation of overflow, underflow and denormalized operands.
28Cluster Hardware Overview (IA-32 Pentium 4)
OPT: X87/SSE2 Instructions
• Set mode to convert underflows to zero (FTZ mode).
• Set mode to convert denormalized floats to zero (DAZ mode).
-- Use FTZ and DAX when speed is important and a slight loss in precision is acceptable.