View
216
Download
2
Category
Tags:
Preview:
Citation preview
Understanding The Nehalem Core
Note: The examples herein are mostly illustrative. They have shortcommings compared to
the real implementation in favour of easier visibility.
April 29 2008
Agenda
•Pipelines
•Branch Prediction
•Superscalar execution
•Out-Of-Order execution
April 29 2008
Understanding CPUs
•Optimizing performance is pretty much tied to understanding what a CPU actually does.
•VTune, Thread Profiler and other tools are mere helpers in order to understand the operation of the CPU.
•The following gives some illustrative ideas on how the most important concepts of modern CPU design work. Not in any case this corresponds to the actual implementation, but rather outlines some basic ideas.
April 29 2008
Architecture Block Diagram
Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity.
Branch Target Buffer
Microcode Sequencer
Register Allocation Table (RAT)
32 KBInstruction Cache
Next IP
InstructionDecode
(4 issue)
Fetch / Decode
Retire
Re-Order Buffer (ROB) – 128 entry
IA Register Set
To L2 Cache/Memory
Port
Port
Port
Port
Bus Unit
Reserv
ati
on
Sta
tion
s (
RS
)3
2 e
ntr
y
Sch
ed
ule
r /
Dis
patc
h P
ort
s
32 KBData Cache
Execute
Port
FP Add
SIMDIntegerArithmetic
MemoryOrderBuffer(MOB)
Load
StoreAddr
FP Div/MulInteger
Shift/RotateSIMD
SIMD
IntegerArithmetic
IntegerArithmetic
Port
StoreData
This is too complicated for the beginning!The best way to understand it is to construct the
CPU yourself
April 29 2008
Our CPU Construction
CPU
Data
Data
Let‘s assume nothing for the moment – The CPU is just a black box with data coming in and data going out
April 29 2008
Modern Processor Technology
In order to make use of modern CPUs some concepts must be clear
1. Pipelines
2. Prediction
3. Superscalarity
4. Out-Of-Order Execution
5. Vectorization
We will go step by step through these concepts and make contact to indicators for sub-optimal performance on the Nehalem architecture
April 29 2008
Pipelines - What a CPU really understands …
A CPU can distinguish between different concepts
Instructions – what to do with the arguments
Arguments - Usually registers that contain numbers or addresses
A register can be thought of as a piece of memory directly in the CPU. It is as big as the architecture is wide (64bit). Specialized registers can be wider (SSE registers: 128bit).
The ALU can only operate on registers!
April 29 2008
Pipelines - What a CPU really understands …
1001101011000111101101110010101000110100000100100111111001111100011010001110010011010111100101000110010001000101001100111111111110001010011001111110110100010100110011111011011000010111110110
4d 63 db4a 8d 04 9f4f 8d 1c 9a0f 28 c845 33 ff45 33 f645 33 ed85 f6
movslq %r11d,%r11lea (%rdi,%r11,4),%raxlea (%r10,%r11,4),%r11movaps %xmm0,%xmm1xor %r15d,%r15dxor %r14d,%r14dxor %r13d,%r13dtest %esi,%esi
This is the output of „objdump –d a.out“ for some binary a.out. Try yourself …
1. FETCH: get a binary number from memory2. DECODE: translated a binary number into a
meaningful instruction
April 29 2008
Pipelines – Execution and Memory Access
So we can fetch the next number and translate it into a meaningful instruction. What now?
If the instruction is arithmetic, just execute it, e.g.
mul %r14d,%r15d (multiply register 14 with register 15)
If it is a memory reference, load the data, e.g.
mov <memory address>,%r15d
Some memory references involve computation (arithmetic)
lea (%r10,%r11,4),%r11
3. EXECUTE: Execute the instruction4. MEMORY ACCESS: Load and store data into memory,
maybe compute the address in step 3.
April 29 2008
Pipelines – Write back
•This is the final step of our primitve pipeline
Once the the result is computed or loaded, write it into the register specified, e.g.
mul %r14d,%r15d
Put the result here
The need for this step is not immediately clear, but will become important later on …
5. WRITE BACK: Save the result as specified
April 29 2008
Our CPU construction – functional parts
1. Fetch a new instruction2. Decode Instruction 3. Exectution4. Memory Access5. Write Register
Data
Data
Let‘s assume nothing for the moment – The CPU is just a black box with data coming in and data going out
April 29 2008
… a CPU with 5 pipeline stages
Inst D
ata
Instr.Fetch
Instr. Decode
ExecuteMemory Access
Write Back
Data
April 29 2008
Pipelining 1 – No Pipelining
13 245
Execution time = Ninstructions * Tp
Fetch Instruction
Decode Instruction
Execute Write Back
Memory Access
F D E M W
Tp
April 29 2008
Processor Technology Pipelining 1 – No Pipelining
•The concept of a non-pipelinging CPU is very inefficient!
•While, e.g. some comparison is performed in the ALU, the other functional units are running idle.
•With very little effort we can use all functional units at the same time and increase the performance tremendously
• Pipeling let‘s different commands use different functional units at the same time.
•There are problems connected to pipelined operation. We will address them later.
April 29 2008
Pipelining 2 - Pipelining
Fetch Instruction
Decode Instruction
Execute Write Back
Memory Access
F D E M W
3 2 145
Tp
Execution time = Ninstructions * Ts
Ts
April 29 2008
Our Core2 Construction
Inst D
ata
Instr.Fetch
Instr. Decode
ALUMemory Access
Write Registers
Data
April 29 2008
Pipelining 3 – Pipeline Stalls
1
F D E M W
3 2
RExample code:
...
c=a/b
d=c/b
e=c/d
f=e/d
g=f/e
...
Pipeline Stalls can have two reasons – competition for resources and data dependencies
April 29 2008
Pipelining 4 – Branches
F D E M W
3J Z5
CMP
45
Branches are a consicerable obstacle for performance!
FCompare two values and write result in F
Jump if Flag F is 0
April 29 2008
Processor Technology Avoiding Pipeline Stalls – Branch Prediction
Pipeline stalls due to branches can be avoided by introducing a functional unit the „guesses“ the next branch.
00No jump predicte
d
012. No jump
predicted
102. Jump predicte
d
11Jump
predicted
jump
no jump jump
no jump
no jump jumpjump
no jump
April 29 2008
Pipelining 4 – Branch Prediction
F D E M W
3J Z5
CMP
45
Branch prediction units can predict with a probability of >>90%. At a wrongly predicted branch the
pipeline needs to be flushed!
FCompare two values and write result in F
Jump if Flag F is 0
Branch Target Buffer
April 29 2008
Our CPU Construction
Inst
Data
Instr.Fetch
Decode ALUMemory Access
Write Registers
Data
Branch Target Buffer
April 29 2008
Superscalar Processors 1
• The superscalar concept is an extension of the pipeline concept
• Pipelining alows for „hiding“ the latency of an instruction
• Pipelining idealy achieves 1 instruction per clock cycle
• Higher ratest can only be achieved if more than one instruction can be executed at the same time
1. Superscalar architecture - Xeon
2. Other idea: Very long instruction word (VLIW) – EPIC/Itanium
April 29 2008
Superscalar Processors 2
Fetch Instruction + Decode
Dispatch
ALU Write Register
Memory Access
F D E M W
3 2 14
Execution time = Ninstructions * Ts / NPipelines
Problems of the Pipeline remain!
April 29 2008
Processor Technology Superscalar Processors 3
• The superscalar concept dramatically improves performance (2x at best)
• The different pipelines don‘t need to have the same functionalities
• The superscalar concept can be extended easily
April 29 2008
Our CPU Construction
Inst
Data
Instr.Fetch + Decode
Dispatch
ALU
Memory Access
Write Registers
Data
Branch Target Buffer
ALU
Read Registers
Read Registers
Read Registers
Write Registers
Write Registers
April 29 2008
Processor Technology Out-Of-Order Execution
•Speculative computing provides an important way of reducing pipeline stalls in case of branches.
•Still data dependencies cause pipeline stalls
•There is no good reason why the CPU shouldn‘t execute instruction which don‘t have dependencies or for which the dependencies are already computed.
•Out-Of-Order a way to minimize the pipeline stalls by reordering the instructions at entry to the pipeline and restore the original order when exeting the pipeline
April 29 2008
Processor Technology Out-Of-Order Execution
1
F R A M W
3 2
R
45
Inst. Queue Reorder Buffer
April 29 2008
Our Construction
Inst
Data
Instr.Fetch + Decode
Dispatch
ALU
Memory Access
Data
Branch Target Buffer
ALU
Read Registers
Read Registers
Read Registers
Reorder Buffer
Retire -Write
Registers
April 29 2008
Processor Technology Register Allocation
•IA32 provides a very limited number of registers for local storage
•This is only due to definition in the instruction set architecture (ISA)
•Internally there are more registers available
•The CPU can internally re-assign the registers in order make best use of the given resources.
•This re-allocation is tracked in the Register Allocation Table
April 29 2008
Our CPU Construction
Inst
Data
Instr.Fetch
Dispatch
ALU
Memory Access
Data
Branch Target Buffer
ALU
Read Registers
Read Registers
Read Registers
Reorder Buffer
Retire -Write
RegistersRegister Allocation
Now let‘s compare to the original ...
April 29 2008
Block Diagram
Branch Target Buffer
Microcode Sequencer
Register Allocation Table (RAT)
32 KBInstruction Cache
Next IP
InstructionDecode
(4 issue)
Fetch / Decode
Retire (Write back)
Re-Order Buffer (ROB) – 128 entry
IA Register Set
To L2 Cache
Port
Port
Port
Port
Bus Unit
Reserv
ati
on
Sta
tion
s (
RS
)3
2 e
ntr
y
Sch
ed
ule
r /
Dis
patc
h P
ort
s
32 KBData Cache
Execute
Port
FP Add
SIMDIntegerArithmetic
MemoryOrderBuffer(MOB)
Load
StoreAddr
FP Div/MulInteger
Shift/RotateSIMD
SIMD
IntegerArithmetic
IntegerArithmetic
Port
StoreData
Not very different, right?
Memory
Dispatch
April 29 2008
Summary
Nehalem can be thought of very much in the same as standard text books present a CPU
1. Instruction Fetch and Decode (Frontend)
2. Dispatch
3. Execute
4. Memory Access
5. Retire (Write Back)
If you want to dive deeper into the subject:Hennessy and Patterson: Computer Architecture – A quantitave approach
April 29 2008
Recommended