Introduction to Processor Architecture

www.company.com

Introduction to Processor Architecture

www.company.com

• Introduction• Processor architecture overview• ISA(Instruction Set Architecture)

– RISC example (SMIPs)– CISC example (Y86)

• Processor architecture– Single-cycle processor example(SMIPs)

• Pipelining– Control hazard– Branch Predictor– Data hazard

• Cache memory

Contents

www.company.com

Introduction

www.company.com

Processors

• What is the processor?

• What’s the difference among them?

www.company.com

Processor architecture and program

• Understanding architecture, there’s more opportunity to optimize your program.

• Let’s see some examples

www.company.com

Example1

• for(i=0 ; i<size ; i++) { for(j=0 ; j<size ; j++) {

sum += array[i][j];}

}

• for(j=0 ; j<size ; j++) { for(i=0 ; i<size ; i++) {


} Keyword : Cache

1

2

www.company.com

Example2 (1/2)

• for(i=0 ; i<size ; i++) { if(i%2 == 0) {

action_even(); { else {

action_odd(); } }

1

www.company.com

Example2 (2/2)

• for(i=0 ; i<size ; i += 2) {action_even();

}

for(i=1 ; i<size ; i+= 2) {action_odd();

}

Keyword : Branch predictor and pipeline

2

www.company.com

Processor architecture overview

www.company.com

Von Neumann Architecture

• Input -> process -> output model

• Integrated Instruction Memory and Data Memory

www.company.com

Basic components of x86 CPURegister file

Status Registers

Zeroflag

Signflag

Overflowflag

Carryflag

%eax %esp%ecx %ebp%edx %esi%ebx %edi

CPU pipeline DecodeFetch Execution Units Commit

Memory(external)

Program Counter %eip Cache Memory $

www.company.com

Register file

• What is a register?

A simple memory element(s.t. edge triggered flip flops)

www.company.com

Register file

• A collection of registers– 8 registers are visible

• In fact, there are a lot of registers hided for other usages.

ex) There are 168 registers in Intel’s Haswell

www.company.com

Program counter

• Points the address of instruction that processor should execute next cycle.

• %eip is the name of program counter register in X86.

• Naming convention differs with ISA,

Instruction Set Architecture

%eip

www.company.com

Status registers

• Also a collection of registers• Boolean registers that represents processor’s

status.• Used to evaluate conditions

Zeroflag

Signflag

Overflowflag

Carryflag

www.company.com

Memory

• Main memory, usually D-RAM• In Von Neumann architecture,

instructions(codes) and data are on same memory module

www.company.com

CPU pipeline

• Where actual operation occurs

• Details will be explained later

CPU pipeline

DecodeFetch Execution Units Commit

www.company.com

Instruction Set Architecture

www.company.com

• How you actually talk to a Processor

Instruction Set Architecture (ISA)

www.company.com

• Mapping between assembly code and machine code– What assembly codes will be included?– How to represent assembly codes in byte codes

Instruction Set Architecture (ISA)

www.company.com

• A command to processor to make processor perform specific task(s)

– Ex1) Mov 4(%eax), %esp (x86) -> move the data in the address of (%eax) + 4, to %esp – Ex2) Irmovl %eax, $256 (y86) -> store the value 256 to the register eax

Instruction

www.company.com

• Instructions are represented in byte codes– Pushl %ebx => 0xa01f– Irmovl %eax, $256 => 0x30f000010000

Representation of instructions

A 0 rA x10 2

pushl

B 0 rA x10 2

popl

3 0 x rB Immediate Value10 53 42 6

irmovl

www.company.com

CISC vs RISC

CISC(Y86) RISC(sMips)

www.company.com

CISC

• Basic Idea : give programmers powerful instructions ; fewer instructions to complete a work

• One instruction do multiple work• A lot of instructions! (over 300 in x86)• Many instruction can access memory• Variable instruction length

www.company.com

RISC

• Basic Idea : Using simple instructions, write a complex program

• Each instruction do only one task• Small instructions set (about 100 in MIPS)• Only load and store instruction can access

memory• Fixed instruction length

www.company.com

RISC exampleSMIPs ISA

www.company.com

• Only three formats but the fields are used differently by different types of instructions

6 5 5 16opcode rs rt immediate I-type

6 5 5 5 5 6opcode rs rt rd shamt func R-type

6 26opcode target J-type

M05-27

Instruction formats

www.company.com

• Computational Instructions

• Load/Store Instructions

opcode rs rt immediate rt (rs) op immediate

6 5 5 5 5 6 0 rs rt rd 0 func rd (rs) func (rt)

rs is the base registerrt is the destination of a Load or the source for a Store

6 5 5 16 addressing modeopcode rs rt displacement (rs) + displacement31 26 25 21 20 16 15 0

M05-28

Instruction formats

www.company.com

• Conditional (on GPR) PC-relative branch

– target address = (offset in words)4 + (PC+4)– range: 128 KB range

• Unconditional register-indirect jumps

• Unconditional absolute jumps

– target address = {PC<31:28>, target4}– range : 256 MB range

6 5 5 16opcode rs offset BEQZ, BNEZ

6 26opcode target J, JAL

6 5 5 16opcode rs JR, JALR

jump-&-link stores PC+4 into the link register (R31)M05-29

Control instructions

www.company.com

CISC exampleY86 ISA

www.company.com

1 Byte

iCd iFun rA rB2 Bytes

iCd iFun rA rB Immediate/Offset6 Bytes

iCd iFun Destination5 Bytes

iCd iFun

M05-3

10 53 42 6

10 53 42

10 2

10

iCd : Instruction code iFun : Function code rA, rB : Register index

Instruction formats

www.company.com

halt: Used as a sign of program termination - Changes processor state to halt (HLT)

nop: No operation. Used as a bubble.

M05-4

0 010

halt

1 0nop10

1 byte instructions – halt, nop

www.company.com

OPl : Perform 4 basic ALU operations; add, sub, and, xor - R[rB] <- R[rB] Op R[rA] - Condition flags are set depending on the result.

M05-33

6 Op rA rB10 2

OPl

2 byte instruction – opl

www.company.com

call - R[esp] <- R[esp] - 4 (Update the stack pointer; move stack top) - M[esp] <- pc + 5 (Store the return address on the stack top) - pc <- Destination (Jump to Destination address)

M05-34

8 0 Destination10 53 42

call dest

5 byte instruction – call

www.company.com

rmmovl: Store - target address = R[rB] + offset - M[target address] <- R[rA]

mrmovl: Load - source address = R[rB] + offset - R[rA] <- M[source address]

4 0 rA rB Offset

10 53 42 6rmmovl

rA, Offset(rB)

M05-7

5 0 rA rB Offset10 53 42 6mrmovl

Offset(rB), rA

6 byte instructions – rmmov, mrmov

www.company.com

Processor Architecture

www.company.com

Simple processor architecture

www.company.com

Simplified version (a lot..)

Large sequential Logic

Memory

Load program codes Store Data

Output(register values)Clock

www.company.com

Sequential design


Memory

Register File

%EIP

www.company.com

Fetch unit

Fetch

%EIP

Memory

1) Get PC

2) Require next instruction 3) Get next instruction

4) Update PC

5) Give next instruction (Byte code)

www.company.com

Decode

Decode unit(1/2)

iCd fCd rA rB imm

1) Truncate input Instruction

Decode Combinational Logic

2) Fill information structure for execution

Inst Type

Target Register A

Target Register B

Immediate value

Register value A

Register value B

… (depends on ISA)

www.company.com

Decode

Decode unit(2/2)

Inst Type

Target Register A

Target Register B

Immediate value

Register value A

Register value B…

(depends on ISA)

Inst Type

Target Register A

Target Register B

Immediate value

Register value A

Register value B…

(depends on ISA)

RegisterRead

3) Read register values

Decoded Instruction

Register File

www.company.com

Execute

Execution unit(1/2)

Inst Type

Target Register A

Target Register B

Immediate value

Register value A

Register value B

1) Select input for ALU

ALU

2) Perform appropriate ALU operation

Inst Type

Memory Data

Register Data

Target Register

Memory Addr

ExecuteCombinational Logic

3) Using ALU result, fill information structure for memory & register update

Executed Instruction

www.company.com

Execute

Execution unit(2/2)

4) Perform memory operations(Ld, St)

Memory Operation Logic

Executed Instruction(updated)

Inst Type

Memory Data

Register Data

Target Register

Memory Addr

Memory

Inst Type

Memory Data

Register Data

Target Register

Memory Addr

5) Update the field (if load instruction executed)

www.company.com

Commit

Commit unit

Inst Type

Memory Data

Register Data

Target Register

Memory Addr

Register File

Register UpdateLogic

www.company.com

Single-cycle processor exampleSMIPs

www.company.com

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

2 read & 1 write ports

separate Instruction & Data memories

M06-47

Single-Cycle SMIPS

SMIPs instructions are all 4 byte-long

www.company.com

module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory;

Rule doProc()let inst = iMem.req(pc);let dInst = decode(inst);let rVal1 = rf.rd1(validRegValue(dInst.src1));let rVal2 = rf.rd2(validRegValue(dInst.src2));let eInst = exec(dInst, rVal1, rVal2, pc);

if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld, addr:eInst.addr, data: ?});else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St, addr: eInst.addr, data: eInst.data});if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data);pc <= eInst.brTaken ? eInst.addr : pc + 4;

endrule endmodule

M06-48

Single-Cycle SMIPS

www.company.com

module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory;

M06-49

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

• Declaration of components

www.company.com

Rule doProc()let inst = iMem.req(pc);

M06-50

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

www.company.com

let dInst = decode(inst);

M06-51

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

www.company.com

let rVal1 = rf.rd1(validRegValue(dInst.src1));let rVal2 = rf.rd2(validRegValue(dInst.src2));

M06-52

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

www.company.com

let eInst = exec(dInst, rVal1, rVal2, pc);

M06-53

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

www.company.com

if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld,addr:eInst.addr,data: ?});else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St,addr: eInst.addr,data: eInst.data});

M06-54

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

www.company.com

if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data);

M06-55

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

www.company.com

pc <= eInst.brTaken ? eInst.addr : pc + 4;

M06-56

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

www.company.com

Improve processor performance-Pipelining

www.company.com

Pipelining

• Introduce the idea of conveyor belt process

www.company.com

Pipelining

• Introduce the idea of conveyor belt process


Memory

Register File

%EIP

FIFO or Register

www.company.com

Pipelining

• In this case, 4 instructions are executing on the same time


Memory

Register File

%EIP

www.company.com

Pipelining

• Throughput is same?– Sequential : 1 instructions / 1 cycle– Pipelined : 4 instructions / 4 cycle

• No, pipelined design can make clock faster

• Because task amount per cycle is decreased, we can apply shorter clock time

www.company.com

Pipeline hazard- control hazard

www.company.com

Control hazard (1/5)

• Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je BA : addl %ecx, %ediB : leave ret

• We don’t know the condition before run the code

www.company.com



Memory

Register File

%EIP

mov %ecx, %ebxsubl %eax, (%ebx)

je B

Execution flow

• What’s next?

addl? leave?

www.company.com


• Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je BA : addl %ecx, %edi mov %edi, %eaxB : leave ret

• As we can’t know about the future, bring the instruction in the next position

www.company.com



Memory

Register File

%EIP

subl %eax, (%ebx)

je B

Execution flow

• What if jump occurred? addl %ecx, %edi

www.company.com



Memory

Register File

%EIP

je B

Execution flow

addl %ecx, %edimov %edi, %eax

• Wrong instructions were executing

www.company.com

Control hazard - analysis

• We must discard some instructions when we mispredict the branch direction…

• The longer the pipeline, the more instructions must be discarded when branch mispredict.

www.company.com

• Add an epoch register in the processor state • The Execute stage changes the epoch whenever the

pc prediction is wrong and sets the pc to the correct value

• The Fetch stage associates the current epoch with every instruction when it is fetched

PC

iMem

pred f2d

Epoch

Fetch Execute

inst

targetPC

The epoch of the instruction is examined when it is ready to

execute. If the processor epoch has changed the

instruction is thrown away

M07-69

Epoch method

www.company.com

Epoch method - Summary

• Add a ‘Tag’ to each instructions

• There are two Tag machine; in Fetch and Execute

• If tag machines recognize something wrong, they change their testing tag

www.company.com

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4 f2d

FIFO

FIFO

redi

rect

Execute sends information about the target pc to Fetch, which updates fEpoch and pc whenever it looks at the redirect PC fifo

fEpo

ch

eEpo

ch

M07-71

Epoch method example (SMIPs)

www.company.com

rule doFetch ; let inst=iMem.req(pc); let ppc=nextAddr(pc); pc <= ppc; f2d.enq(Fetch2Decode{pc:pc,ppc:ppc,epoch:epoch, inst:inst});Endrule

rule doExecute; let x=f2d.first; let inpc=x.pc; let inEp=x.epoch; let ppc = x.ppc; let inst = x.inst; if(inEp == epoch) begin let dInst = decode(inst); ... register fetch ...;

let eInst = exec(dInst, rVal1, rVal2, inpc, ppc); ...memory operation ... ...rf update ... if (eInst.mispredict) begin

pc <= eInst.addr; epoch <= inEp + 1; end end f2d.deq;endrule

Epoch method example (SMIPs)

www.company.com

Branch Prediction

www.company.com

Need to predict next PC

• We must fetch a instruction every cycle

• But as we see in control hazard part, we can’t know what is exactly next instruction

• So we must predict what is next instruction

www.company.com

How to predict next PC?

• We can depend on the history– Memo the history and make use of it

• So we must predict what is next instruction

www.company.com

2-bit counter branch predictor

(Weakly taken)

(Weakly not taken)

(Strongly not taken)

(Strongly taken)

www.company.com

• Assume 2 BP bits per instruction• Use saturating counter

On ¬taken

On taken

1 1 Strongly taken

1 0 Weakly taken

0 1 Weakly ¬taken

0 0 Strongly ¬takenDirection prediction changes only after two successive bad

predictions

M11-77

2-bit counter branch predictor

www.company.com

4K-entry BHT, 2 bits/entry, ~80-90% correct direction predictions

0 0Fetch PC

Branch?

Opcode offsetInstruction

k

BHT Index

2k-entryBHT,

2 bits/entry

Taken/¬Taken?

Target PC

+

from Fetch

After decoding the instruction if it turns out be a branch, then we can consult BHT using the pc; if this prediction is different from the incoming predicted pc we can redirect Fetch

M11-78

Branch History Table (BHT)

www.company.com

Let’s see program example again

• for(i=0 ; i<size ; i++) { if(i%2 == 0) {

action_even(); { else {

action_odd(); } }

www.company.com

Pipeline hazard- data hazard

www.company.com

Data hazard by flow dependence

• Sometimes, instructions uses the result of former instruction

I1 addl %eax, %ebxI2 subl %ebx, %ecx

I2 must wait until I1 updates the register file, so that I2 can see the result.

www.company.com

Dealing with data hazards

• We can wait until desired value is updated– Stall method

• Or, we can send the value directly– Bypass method

www.company.com

Pipeline stalling example

www.company.com

Data bypass example

www.company.com

Improve processor performance-Cache Memory

www.company.com

Memory operations are bottle neck!

• Memory transfer rate of DDR3 RAM– Peak transfer rate : 6400 MB/s

• Assume that we have single core processor of clock speed 3.0Ghz, and we process a word(32bit) every cycle.– Approximately, we process 1.2GB/s• 3.0 Ghz * 32bit(word size) / 8

www.company.com

Memory hierarchy

www.company.com

Locality of reference

• Temporal locality– If a value is used, it is likely to be used again soon.

mov $100, %ecx mov Array, %ebx // %ebx = &Arrayxorl %eax, %eax // %eax = 0

Loop : mov (%ebx), %esi // %esi += *(%ebx)addl %esi, %eax // %eax += %esiaddl $4, %ebx // %ebx += 4subl $1, %ecx // %ecx = %ecx - 1je Loop

Fin :leaveret

www.company.com

Locality of reference

• Spatial locality– If a value is used, nearby values are likely to be

used

for(i=0; i<size; i++) {for(j=0; j<size; j++) {


}

www.company.com

Cache memory

• A buffer between processor and memory– Often several levels of caches

• Small but fast– Old values will be removed from cache to make

space for new values

• Capitalizes on spatial locality and temporal locality

• Parameters vary by system – unknown to programmer

www.company.com

• Cache memories are small, fast SRAM-based memories managed automatically in hardware. – Hold frequently accessed blocks of main memory

• CPU looks first for data in L1, then in L2, then in main memory.• Typical bus structure:

mainmemory

I/Obridgebus interfaceL2 cache

ALU

register file

CPU chip

cache bus system bus memory bus

L1 cache

Cache memory

www.company.com

copy of main memlocations 100, 101, ...

Data Block

DataByte

DataByte

DataByte

100

304

6848 416

How many bits are needed for the tag?Enough to uniquely identify block

Address Tag

Structure of cache memory

• Basically, an array of memory elements

www.company.com

Search cache tags to find match for the processor generated address

Found in cache a.k.a. HIT

Return copy of data from cache

Not in cachea.k.a. MISS

Read block of data from Main Memory – may require writing back a cache line

Return data to processor and update cache

Which line do we replace?

Read behavior

www.company.com

• On a write hit– Write-through: write to both cache and the next level memory– Writeback: write only to cache and update the next level

memory when line is evacuated

• On a write miss – Allocate – because of multi-word lines we first fetch the line,

and then update a word in it– Not allocate – word modified in memory

M12-94

Write behavior

www.company.com

Direct mapped cache

• A buffer between processor and memory– Often several levels of caches

• Small but fast– Old values will be removed from cache to make

space for new values

• Capitalizes on spatial locality and temporal locality

• Parameters vary by system – unknown to programmer

www.company.com

• A cache line usually holds more than one word(32bit)– Reduces the number of tags and the tag size needed to

identify memory locations

– To exploit spatial locality

M12-96

Cache line size

www.company.com

• Compulsory misses (cold start)– First time data is referenced– Run billions of instructions, become insignificant

• Capacity misses– Working set is larger than cache size– Solution: increase cache size

• Conflict misses– Usually multiple memory locations are mapped to the same

cache location to simplify implementations– Thus it is possible that the designated cache location is full

while there are empty locations in the cache. – Solution: Set-Associative Caches

Cold fact of life!

M12-97

Types of misses

www.company.com

Tag Data Block V

=

Offset Tag Index

t k b

t

HIT Data Word or Byte

2k

lines

Block number Block offset

What is a bad reference pattern? Strided = size of cache

req address

Direct-mapped cache

www.company.com

• Bitwise truncation

• Goto 270th cache entry and compare the Tag

Cold fact of life!

M12-99

Addressing example

00000100001100 00000100001110 01 00

Index = 270

req address

Tag = 524 Block offset = 1

www.company.com

Tag Data Block V

=

Offset Index

t k b

t

HIT Data Word or Byte

2k

lines

Tag

Why might this be undesirable?Spatially local blocks conflict

Address selection

www.company.com

• Memory time = Hit time + Prob(miss) * Miss penalty

• Associativity: Allow blocks to go to several sets in cache– 2-way set associative: each block maps to either of 2 cache

sets– Fully associative: each block maps to any cache frame

M12-101

Reduce conflict misses

www.company.com

Tag Data Block V

=

BlockOffset

Tag Index

t k

b

HIT

Tag Data Block V

DataWord

or Byte

=

t

2-way Set-Associative cache

www.company.com

• In order to bring in a new cache line, usually another cache line has to be thrown out. Which one?– No choice in replacement if the cache is direct mapped

• Replacement policy for set-associative caches– One that is not dirty, i.e., has not been modified

• In I-cache all lines are clean• In D-cache if a dirty line has to be thrown out then it must be written

back first– Least recently used?– Most recently used?– Random?

How much is performance affected by the choice?

Difficult to know without quantitative measurements

M12-103

Replacement policy

www.company.com

Implementing LRU..

• We need time stamps for all lines

• And also require time stamp comparison!– Log scale over head

www.company.com

Pseudo LRU example

• So use pseudo LRU instead of true LRU

• We’ll use 8-way set associative cache• Use three bit history bits

• If a line is referenced, memo the following code at history bits– Line 0 : 000– Line 1 : 001– Line 2 : 010– Line 3 : 011…

www.company.com

Pseudo LRU example

www.company.com

Pseudo LRU example

Documents

Introduction to Processor Architecture