Upload
andren
View
74
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds. Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven 2009. Avoiding superscalar complexity. An alternative: EPIC (explicit parallel instruction computer) EPIC: Best of both worlds? - PowerPoint PPT Presentation
Citation preview
Advanced Computer Architecture5MD00 / 5Z033
EPIC / Itanium architecturebest of both worlds
Henk Corporaalwww.ics.ele.tue.nl/~heco
TUEindhoven2009
04/22/23 ACA H.Corporaal 2
Avoiding superscalar complexity• An alternative:
– EPIC (explicit parallel instruction computer)
• EPIC: Best of both worlds? – Superscalar: expensive but binary compatible– VLIW: simple, but not compatible
• Or: use VLIW with Binary translation at Run-time– Transmeta: Crusoe VLIW processor– Runs x86 code on a VLIW !!!
04/22/23 ACA H.Corporaal 3
EPIC Architecture: IA-64 / ItaniumExplicit Parallel Instruction Computer• IA-64 • Implementations: Merced (2001), McKinley (2002), Montecite
(2 core, 2006), Tukwila (4-core 2009), Poulson (Q4, 2009, 8-core)
• architecture is now called Itanium
Register model:• 128 64-bit int x bits, stack, rotating• 128 82-bit floating point, rotating• 64 1-bit booleans• 8 64-bit branch target address• system control registers
04/22/23 ACA H.Corporaal 4(2002)
04/22/23 ACA H.Corporaal 5
Itanium Instruction format• Instructions grouped in 128-bit bundles
– 3 * 41-bit instruction– 5 template bits, indicate type and stop location
• Each 41-bit instruction – starts with 4-bit opcode, and – ends with 6-bit guard (boolean) register-id
5 41 41 41
04/22/23 ACA H.Corporaal 6
04/22/23 ACA H.Corporaal 7
Predication• Predicated execution of virtually all instructions
– (p) add r1 = r2, r3• If p is true, normal add operation. Otherwise, NOP
– 64 1-bit predicate registers– Advantages of predicated execution:
• Remove branches– Convert control dependence to data dependence– Reduce misprediction penalties
• Increase the size of basic block – Both codes from taken & not-taken path can be scheduled in the
same cycle
04/22/23 ACA H.Corporaal 8
Control Speculation
• Loads incur high latency– Need to schedule loads as early as possible– Two barriers – branches and stores
• Control speculation – move loads above branches:
04/22/23 ACA H.Corporaal 9
Control speculation – move loads above branches
Problem: loads can cause exceptions• Separate load behavior from exception behavior
– Speculative load (ld.s) initiates a load op. & detects exceptions
– On an exception, hardware propagates exception token (stored with destination register) from ld.s to chk.s
– Speculative check (chk.s) delivers the exception detected by ld.s
04/22/23 ACA H.Corporaal 10
Control Speculation• Control speculating uses further increase ILP
– Dependent instructions following the load can be also speculated above branches
04/22/23 ACA H.Corporaal 11
Data Speculation• Loads and previous stores can conflict
– When the loads/stores overlap (access the same memory location), the loads must wait for previous stores due to RAW dependence
• IA-64 enables data speculation by ld.a and ld.c/chk.a with ALAT (Advanced Load Address Table)– ld. a performs a normal load and inserts the address to ALAT– Any intervening stores eliminate the overlapping entries from
ALAT– The advanced load check (ld.c) checks ALAT whether there
was a violation and reissues the load if necessary
04/22/23 ACA H.Corporaal 12
Data Speculation• Move loads above potentially overlapping stores
04/22/23 ACA H.Corporaal 13
Data Speculation• Uses of speculative data can be further speculated
• Also, control and data speculation can be combined– Schedule loads across branches and across stores at the same time– Speculative advanced loads – ld.sa combines the semantics of ld.a and ld.s
04/22/23 ACA H.Corporaal 14
Register Stack• Procedure call overhead
– Spill registers to memory on call– Restore them on procedure return
• Register Stack– Register stack is used to save/restore
procedure contexts across calls– Stack area in memory to save/restore
procedure context– Explicit allocation of stack frames
• Effective use of 96 registers• Allocate only what is needed
– Overlapping stack frames avoids parameter copying
– Mechanism implemented by renaming register addresses
04/22/23 ACA H.Corporaal 15
Register Stack
04/22/23 ACA H.Corporaal 16
Register Stack Engine (RSE)• Automatically saves/restores stack registers
without software intervention– Avoids explicit spill/fill (Eliminates stack management
overhead)– Provides the illusion of infinite physical registers
• RSE uses unused memory bandwidth (cycle stealing) to perform register spill and fill operations in the background– Overflow: alloc needs more registers than available– Underflow: return needs to restore frame saved in
memory
04/22/23 ACA H.Corporaal 17
Software Pipelining Support• High performance loops without
code size overhead– No prologue and epilogue
• Rotating registers– Provide automatic renaming
• Rotating predicates (stage predicates)– Unify prologue, kernel, and epilogue
• Loop control registers (LC, EC)• Loop branches
– Counted loop (br.ctop)– While loop (br.wtop)
– Especially valuable for integer loops with small trip counts
04/22/23 ACA H.Corporaal 18
Software Pipelining Example
L1: (p16) ld4 r32 = [r5], 4 // Cycle 0 (p18) add r35 = r34, r9 // Cycle 0 (p19) st4 [r6] = r36, 4 // Cycle 0 br.ctop L1 // Cycle 0
ld Prolog ld add ld st add ld Kernel st add ld st add Epilog st add st
L1: ld4 r4 = [r5], 4 //Cycle 0 add r7 = r4, r9 //Cycle 2 st4 [r6] = r7, 4 //Cycle 3 br.cloop L1;;
Iteration1 r32 r33 r34 r35 … p16 p17 p18 p19 .. 1 0 0 0 ..Iteration2 r33 r34 r35 r36 … p17 p18 p19 .. p16 1 0 0 .. 1Iteration3 r34 r35 r36 r37 … p18 p19 .. p16 p17 1 0 .. 1 1
What happens during runtime?
04/22/23 ACA H.Corporaal 19
IA-64 / Itanium architecture: a VLIW?• Yes, but:
– Instructions contain only one operation; compiler can indicate that successive instructions can be executed in parallel
– HW does the Operation – FU binding– Pipeline latencies not visible in the ISA– These measures make the ISA independent of #FUs
and pipeline latencies ISA supports multiple implementations
04/22/23 ACA H.Corporaal 20
Montecito 2006: dual 11-issue cores
04/22/23 ACA H.Corporaal 21
Tukwila 4 core Itanium, 2009
04/22/23 ACA H.Corporaal 22
How further?Burton SmithMicrosoft2005