Upload
rupali
View
104
Download
0
Embed Size (px)
DESCRIPTION
Introduction to Processor Architecture. Contents. Introduction Processor architecture overview ISA(Instruction Set Architecture) RISC example (SMIPs) CISC example (Y86) Processor architecture Single-cycle processor example(SMIPs) Pipelining Control hazard Branch Predictor Data hazard - PowerPoint PPT Presentation
Citation preview
www.company.com
Introduction to Processor Architecture
www.company.com
• Introduction• Processor architecture overview• ISA(Instruction Set Architecture)
– RISC example (SMIPs)– CISC example (Y86)
• Processor architecture– Single-cycle processor example(SMIPs)
• Pipelining– Control hazard– Branch Predictor– Data hazard
• Cache memory
Contents
www.company.com
Introduction
www.company.com
Processors
• What is the processor?
• What’s the difference among them?
www.company.com
Processor architecture and program
• Understanding architecture, there’s more opportunity to optimize your program.
• Let’s see some examples
www.company.com
Example1
• for(i=0 ; i<size ; i++) { for(j=0 ; j<size ; j++) {
sum += array[i][j];}
}
• for(j=0 ; j<size ; j++) { for(i=0 ; i<size ; i++) {
sum += array[i][j];}
} Keyword : Cache
1
2
www.company.com
Example2 (1/2)
• for(i=0 ; i<size ; i++) { if(i%2 == 0) {
action_even(); { else {
action_odd(); } }
1
www.company.com
Example2 (2/2)
• for(i=0 ; i<size ; i += 2) {action_even();
}
for(i=1 ; i<size ; i+= 2) {action_odd();
}
Keyword : Branch predictor and pipeline
2
www.company.com
Processor architecture overview
www.company.com
Von Neumann Architecture
• Input -> process -> output model
• Integrated Instruction Memory and Data Memory
www.company.com
Basic components of x86 CPURegister file
Status Registers
Zeroflag
Signflag
Overflowflag
Carryflag
%eax %esp%ecx %ebp%edx %esi%ebx %edi
CPU pipeline DecodeFetch Execution Units Commit
Memory(external)
Program Counter %eip Cache Memory $
www.company.com
Register file
• What is a register?
A simple memory element(s.t. edge triggered flip flops)
www.company.com
Register file
• A collection of registers– 8 registers are visible
• In fact, there are a lot of registers hided for other usages.
ex) There are 168 registers in Intel’s Haswell
www.company.com
Program counter
• Points the address of instruction that processor should execute next cycle.
• %eip is the name of program counter register in X86.
• Naming convention differs with ISA,
Instruction Set Architecture
%eip
www.company.com
Status registers
• Also a collection of registers• Boolean registers that represents processor’s
status.• Used to evaluate conditions
Zeroflag
Signflag
Overflowflag
Carryflag
www.company.com
Memory
• Main memory, usually D-RAM• In Von Neumann architecture,
instructions(codes) and data are on same memory module
www.company.com
CPU pipeline
• Where actual operation occurs
• Details will be explained later
CPU pipeline
DecodeFetch Execution Units Commit
www.company.com
Instruction Set Architecture
www.company.com
• How you actually talk to a Processor
Instruction Set Architecture (ISA)
www.company.com
• Mapping between assembly code and machine code– What assembly codes will be included?– How to represent assembly codes in byte codes
Instruction Set Architecture (ISA)
www.company.com
• A command to processor to make processor perform specific task(s)
– Ex1) Mov 4(%eax), %esp (x86) -> move the data in the address of (%eax) + 4, to %esp – Ex2) Irmovl %eax, $256 (y86) -> store the value 256 to the register eax
Instruction
www.company.com
• Instructions are represented in byte codes– Pushl %ebx => 0xa01f– Irmovl %eax, $256 => 0x30f000010000
Representation of instructions
A 0 rA x10 2
pushl
B 0 rA x10 2
popl
3 0 x rB Immediate Value10 53 42 6
irmovl
www.company.com
CISC vs RISC
CISC(Y86) RISC(sMips)
www.company.com
CISC
• Basic Idea : give programmers powerful instructions ; fewer instructions to complete a work
• One instruction do multiple work• A lot of instructions! (over 300 in x86)• Many instruction can access memory• Variable instruction length
www.company.com
RISC
• Basic Idea : Using simple instructions, write a complex program
• Each instruction do only one task• Small instructions set (about 100 in MIPS)• Only load and store instruction can access
memory• Fixed instruction length
www.company.com
RISC exampleSMIPs ISA
www.company.com
• Only three formats but the fields are used differently by different types of instructions
6 5 5 16opcode rs rt immediate I-type
6 5 5 5 5 6opcode rs rt rd shamt func R-type
6 26opcode target J-type
M05-27
Instruction formats
www.company.com
• Computational Instructions
• Load/Store Instructions
opcode rs rt immediate rt (rs) op immediate
6 5 5 5 5 6 0 rs rt rd 0 func rd (rs) func (rt)
rs is the base registerrt is the destination of a Load or the source for a Store
6 5 5 16 addressing modeopcode rs rt displacement (rs) + displacement31 26 25 21 20 16 15 0
M05-28
Instruction formats
www.company.com
• Conditional (on GPR) PC-relative branch
– target address = (offset in words)4 + (PC+4)– range: 128 KB range
• Unconditional register-indirect jumps
• Unconditional absolute jumps
– target address = {PC<31:28>, target4}– range : 256 MB range
6 5 5 16opcode rs offset BEQZ, BNEZ
6 26opcode target J, JAL
6 5 5 16opcode rs JR, JALR
jump-&-link stores PC+4 into the link register (R31)M05-29
Control instructions
www.company.com
CISC exampleY86 ISA
www.company.com
1 Byte
iCd iFun rA rB2 Bytes
iCd iFun rA rB Immediate/Offset6 Bytes
iCd iFun Destination5 Bytes
iCd iFun
M05-3
10 53 42 6
10 53 42
10 2
10
iCd : Instruction code iFun : Function code rA, rB : Register index
Instruction formats
www.company.com
halt: Used as a sign of program termination - Changes processor state to halt (HLT)
nop: No operation. Used as a bubble.
M05-4
0 010
halt
1 0nop10
1 byte instructions – halt, nop
www.company.com
OPl : Perform 4 basic ALU operations; add, sub, and, xor - R[rB] <- R[rB] Op R[rA] - Condition flags are set depending on the result.
M05-33
6 Op rA rB10 2
OPl
2 byte instruction – opl
www.company.com
call - R[esp] <- R[esp] - 4 (Update the stack pointer; move stack top) - M[esp] <- pc + 5 (Store the return address on the stack top) - pc <- Destination (Jump to Destination address)
M05-34
8 0 Destination10 53 42
call dest
5 byte instruction – call
www.company.com
rmmovl: Store - target address = R[rB] + offset - M[target address] <- R[rA]
mrmovl: Load - source address = R[rB] + offset - R[rA] <- M[source address]
4 0 rA rB Offset
10 53 42 6rmmovl
rA, Offset(rB)
M05-7
5 0 rA rB Offset10 53 42 6mrmovl
Offset(rB), rA
6 byte instructions – rmmov, mrmov
www.company.com
Processor Architecture
www.company.com
Simple processor architecture
www.company.com
Simplified version (a lot..)
Large sequential Logic
Memory
Load program codes Store Data
Output(register values)Clock
www.company.com
Sequential design
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
www.company.com
Fetch unit
Fetch
%EIP
Memory
1) Get PC
2) Require next instruction 3) Get next instruction
4) Update PC
5) Give next instruction (Byte code)
www.company.com
Decode
Decode unit(1/2)
iCd fCd rA rB imm
1) Truncate input Instruction
Decode Combinational Logic
2) Fill information structure for execution
Inst Type
Target Register A
Target Register B
Immediate value
Register value A
Register value B
… (depends on ISA)
www.company.com
Decode
Decode unit(2/2)
Inst Type
Target Register A
Target Register B
Immediate value
Register value A
Register value B…
(depends on ISA)
Inst Type
Target Register A
Target Register B
Immediate value
Register value A
Register value B…
(depends on ISA)
RegisterRead
3) Read register values
Decoded Instruction
Register File
www.company.com
Execute
Execution unit(1/2)
Inst Type
Target Register A
Target Register B
Immediate value
Register value A
Register value B
1) Select input for ALU
ALU
2) Perform appropriate ALU operation
Inst Type
Memory Data
Register Data
Target Register
Memory Addr
ExecuteCombinational Logic
3) Using ALU result, fill information structure for memory & register update
Executed Instruction
www.company.com
Execute
Execution unit(2/2)
4) Perform memory operations(Ld, St)
Memory Operation Logic
Executed Instruction(updated)
Inst Type
Memory Data
Register Data
Target Register
Memory Addr
Memory
Inst Type
Memory Data
Register Data
Target Register
Memory Addr
5) Update the field (if load instruction executed)
www.company.com
Commit
Commit unit
Inst Type
Memory Data
Register Data
Target Register
Memory Addr
Register File
Register UpdateLogic
www.company.com
Single-cycle processor exampleSMIPs
www.company.com
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
2 read & 1 write ports
separate Instruction & Data memories
M06-47
Single-Cycle SMIPS
SMIPs instructions are all 4 byte-long
www.company.com
module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory;
Rule doProc()let inst = iMem.req(pc);let dInst = decode(inst);let rVal1 = rf.rd1(validRegValue(dInst.src1));let rVal2 = rf.rd2(validRegValue(dInst.src2));let eInst = exec(dInst, rVal1, rVal2, pc);
if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld, addr:eInst.addr, data: ?});else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St, addr: eInst.addr, data: eInst.data});if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data);pc <= eInst.brTaken ? eInst.addr : pc + 4;
endrule endmodule
M06-48
Single-Cycle SMIPS
www.company.com
module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory;
M06-49
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
• Declaration of components
www.company.com
Rule doProc()let inst = iMem.req(pc);
M06-50
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
www.company.com
let dInst = decode(inst);
M06-51
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
www.company.com
let rVal1 = rf.rd1(validRegValue(dInst.src1));let rVal2 = rf.rd2(validRegValue(dInst.src2));
M06-52
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
www.company.com
let eInst = exec(dInst, rVal1, rVal2, pc);
M06-53
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
www.company.com
if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld,addr:eInst.addr,data: ?});else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St,addr: eInst.addr,data: eInst.data});
M06-54
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
www.company.com
if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data);
M06-55
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
www.company.com
pc <= eInst.brTaken ? eInst.addr : pc + 4;
M06-56
Single-Cycle SMIPS
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4
www.company.com
Improve processor performance-Pipelining
www.company.com
Pipelining
• Introduce the idea of conveyor belt process
www.company.com
Pipelining
• Introduce the idea of conveyor belt process
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
FIFO or Register
www.company.com
Pipelining
• In this case, 4 instructions are executing on the same time
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
www.company.com
Pipelining
• Throughput is same?– Sequential : 1 instructions / 1 cycle– Pipelined : 4 instructions / 4 cycle
• No, pipelined design can make clock faster
• Because task amount per cycle is decreased, we can apply shorter clock time
www.company.com
Pipeline hazard- control hazard
www.company.com
Control hazard (1/5)
• Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je BA : addl %ecx, %ediB : leave ret
• We don’t know the condition before run the code
www.company.com
Control hazard (2/5)
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
mov %ecx, %ebxsubl %eax, (%ebx)
je B
Execution flow
• What’s next?
addl? leave?
www.company.com
Control hazard (3/5)
• Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je BA : addl %ecx, %edi mov %edi, %eaxB : leave ret
• As we can’t know about the future, bring the instruction in the next position
www.company.com
Control hazard (4/5)
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
subl %eax, (%ebx)
je B
Execution flow
• What if jump occurred? addl %ecx, %edi
www.company.com
Control hazard (5/5)
DecodeFetch Execution Units Commit
Memory
Register File
%EIP
je B
Execution flow
addl %ecx, %edimov %edi, %eax
• Wrong instructions were executing
www.company.com
Control hazard - analysis
• We must discard some instructions when we mispredict the branch direction…
• The longer the pipeline, the more instructions must be discarded when branch mispredict.
www.company.com
• Add an epoch register in the processor state • The Execute stage changes the epoch whenever the
pc prediction is wrong and sets the pc to the correct value
• The Fetch stage associates the current epoch with every instruction when it is fetched
PC
iMem
pred f2d
Epoch
Fetch Execute
inst
targetPC
The epoch of the instruction is examined when it is ready to
execute. If the processor epoch has changed the
instruction is thrown away
M07-69
Epoch method
www.company.com
Epoch method - Summary
• Add a ‘Tag’ to each instructions
• There are two Tag machine; in Fetch and Execute
• If tag machines recognize something wrong, they change their testing tag
www.company.com
PC
InstMemory
Decode
Register File
Execute
DataMemory
+4 f2d
FIFO
FIFO
redi
rect
Execute sends information about the target pc to Fetch, which updates fEpoch and pc whenever it looks at the redirect PC fifo
fEpo
ch
eEpo
ch
M07-71
Epoch method example (SMIPs)
www.company.com
rule doFetch ; let inst=iMem.req(pc); let ppc=nextAddr(pc); pc <= ppc; f2d.enq(Fetch2Decode{pc:pc,ppc:ppc,epoch:epoch, inst:inst});Endrule
rule doExecute; let x=f2d.first; let inpc=x.pc; let inEp=x.epoch; let ppc = x.ppc; let inst = x.inst; if(inEp == epoch) begin let dInst = decode(inst); ... register fetch ...;
let eInst = exec(dInst, rVal1, rVal2, inpc, ppc); ...memory operation ... ...rf update ... if (eInst.mispredict) begin
pc <= eInst.addr; epoch <= inEp + 1; end end f2d.deq;endrule
Epoch method example (SMIPs)
www.company.com
Branch Prediction
www.company.com
Need to predict next PC
• We must fetch a instruction every cycle
• But as we see in control hazard part, we can’t know what is exactly next instruction
• So we must predict what is next instruction
www.company.com
How to predict next PC?
• We can depend on the history– Memo the history and make use of it
• So we must predict what is next instruction
www.company.com
2-bit counter branch predictor
(Weakly taken)
(Weakly not taken)
(Strongly not taken)
(Strongly taken)
www.company.com
• Assume 2 BP bits per instruction• Use saturating counter
On ¬taken
On taken
1 1 Strongly taken
1 0 Weakly taken
0 1 Weakly ¬taken
0 0 Strongly ¬takenDirection prediction changes only after two successive bad
predictions
M11-77
2-bit counter branch predictor
www.company.com
4K-entry BHT, 2 bits/entry, ~80-90% correct direction predictions
0 0Fetch PC
Branch?
Opcode offsetInstruction
k
BHT Index
2k-entryBHT,
2 bits/entry
Taken/¬Taken?
Target PC
+
from Fetch
After decoding the instruction if it turns out be a branch, then we can consult BHT using the pc; if this prediction is different from the incoming predicted pc we can redirect Fetch
M11-78
Branch History Table (BHT)
www.company.com
Let’s see program example again
• for(i=0 ; i<size ; i++) { if(i%2 == 0) {
action_even(); { else {
action_odd(); } }
www.company.com
Pipeline hazard- data hazard
www.company.com
Data hazard by flow dependence
• Sometimes, instructions uses the result of former instruction
I1 addl %eax, %ebxI2 subl %ebx, %ecx
I2 must wait until I1 updates the register file, so that I2 can see the result.
www.company.com
Dealing with data hazards
• We can wait until desired value is updated– Stall method
• Or, we can send the value directly– Bypass method
www.company.com
Pipeline stalling example
www.company.com
Data bypass example
www.company.com
Improve processor performance-Cache Memory
www.company.com
Memory operations are bottle neck!
• Memory transfer rate of DDR3 RAM– Peak transfer rate : 6400 MB/s
• Assume that we have single core processor of clock speed 3.0Ghz, and we process a word(32bit) every cycle.– Approximately, we process 1.2GB/s• 3.0 Ghz * 32bit(word size) / 8
www.company.com
Memory hierarchy
www.company.com
Locality of reference
• Temporal locality– If a value is used, it is likely to be used again soon.
mov $100, %ecx mov Array, %ebx // %ebx = &Arrayxorl %eax, %eax // %eax = 0
Loop : mov (%ebx), %esi // %esi += *(%ebx)addl %esi, %eax // %eax += %esiaddl $4, %ebx // %ebx += 4subl $1, %ecx // %ecx = %ecx - 1je Loop
Fin :leaveret
www.company.com
Locality of reference
• Spatial locality– If a value is used, nearby values are likely to be
used
for(i=0; i<size; i++) {for(j=0; j<size; j++) {
sum += array[i][j];}
}
www.company.com
Cache memory
• A buffer between processor and memory– Often several levels of caches
• Small but fast– Old values will be removed from cache to make
space for new values
• Capitalizes on spatial locality and temporal locality
• Parameters vary by system – unknown to programmer
www.company.com
• Cache memories are small, fast SRAM-based memories managed automatically in hardware. – Hold frequently accessed blocks of main memory
• CPU looks first for data in L1, then in L2, then in main memory.• Typical bus structure:
mainmemory
I/Obridgebus interfaceL2 cache
ALU
register file
CPU chip
cache bus system bus memory bus
L1 cache
Cache memory
www.company.com
copy of main memlocations 100, 101, ...
Data Block
DataByte
DataByte
DataByte
100
304
6848 416
How many bits are needed for the tag?Enough to uniquely identify block
Address Tag
Structure of cache memory
• Basically, an array of memory elements
www.company.com
Search cache tags to find match for the processor generated address
Found in cache a.k.a. HIT
Return copy of data from cache
Not in cachea.k.a. MISS
Read block of data from Main Memory – may require writing back a cache line
Return data to processor and update cache
Which line do we replace?
Read behavior
www.company.com
• On a write hit– Write-through: write to both cache and the next level memory– Writeback: write only to cache and update the next level
memory when line is evacuated
• On a write miss – Allocate – because of multi-word lines we first fetch the line,
and then update a word in it– Not allocate – word modified in memory
M12-94
Write behavior
www.company.com
Direct mapped cache
• A buffer between processor and memory– Often several levels of caches
• Small but fast– Old values will be removed from cache to make
space for new values
• Capitalizes on spatial locality and temporal locality
• Parameters vary by system – unknown to programmer
www.company.com
• A cache line usually holds more than one word(32bit)– Reduces the number of tags and the tag size needed to
identify memory locations
– To exploit spatial locality
M12-96
Cache line size
www.company.com
• Compulsory misses (cold start)– First time data is referenced– Run billions of instructions, become insignificant
• Capacity misses– Working set is larger than cache size– Solution: increase cache size
• Conflict misses– Usually multiple memory locations are mapped to the same
cache location to simplify implementations– Thus it is possible that the designated cache location is full
while there are empty locations in the cache. – Solution: Set-Associative Caches
Cold fact of life!
M12-97
Types of misses
www.company.com
Tag Data Block V
=
Offset Tag Index
t k b
t
HIT Data Word or Byte
2k
lines
Block number Block offset
What is a bad reference pattern? Strided = size of cache
req address
Direct-mapped cache
www.company.com
• Bitwise truncation
• Goto 270th cache entry and compare the Tag
Cold fact of life!
M12-99
Addressing example
00000100001100 00000100001110 01 00
Index = 270
req address
Tag = 524 Block offset = 1
www.company.com
Tag Data Block V
=
Offset Index
t k b
t
HIT Data Word or Byte
2k
lines
Tag
Why might this be undesirable?Spatially local blocks conflict
Address selection
www.company.com
• Memory time = Hit time + Prob(miss) * Miss penalty
• Associativity: Allow blocks to go to several sets in cache– 2-way set associative: each block maps to either of 2 cache
sets– Fully associative: each block maps to any cache frame
M12-101
Reduce conflict misses
www.company.com
Tag Data Block V
=
BlockOffset
Tag Index
t k
b
HIT
Tag Data Block V
DataWord
or Byte
=
t
2-way Set-Associative cache
www.company.com
• In order to bring in a new cache line, usually another cache line has to be thrown out. Which one?– No choice in replacement if the cache is direct mapped
• Replacement policy for set-associative caches– One that is not dirty, i.e., has not been modified
• In I-cache all lines are clean• In D-cache if a dirty line has to be thrown out then it must be written
back first– Least recently used?– Most recently used?– Random?
How much is performance affected by the choice?
Difficult to know without quantitative measurements
M12-103
Replacement policy
www.company.com
Implementing LRU..
• We need time stamps for all lines
• And also require time stamp comparison!– Log scale over head
www.company.com
Pseudo LRU example
• So use pseudo LRU instead of true LRU
• We’ll use 8-way set associative cache• Use three bit history bits
• If a line is referenced, memo the following code at history bits– Line 0 : 000– Line 1 : 001– Line 2 : 010– Line 3 : 011…
www.company.com
Pseudo LRU example
www.company.com
Pseudo LRU example