107
www.company.com Introduction to Processor Architecture

Introduction to Processor Architecture

  • Upload
    rupali

  • View
    104

  • Download
    0

Embed Size (px)

DESCRIPTION

Introduction to Processor Architecture. Contents. Introduction Processor architecture overview ISA(Instruction Set Architecture) RISC example (SMIPs) CISC example (Y86) Processor architecture Single-cycle processor example(SMIPs) Pipelining Control hazard Branch Predictor Data hazard - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Processor Architecture

www.company.com

Introduction to Processor Architecture

Page 2: Introduction to Processor Architecture

www.company.com

• Introduction• Processor architecture overview• ISA(Instruction Set Architecture)

– RISC example (SMIPs)– CISC example (Y86)

• Processor architecture– Single-cycle processor example(SMIPs)

• Pipelining– Control hazard– Branch Predictor– Data hazard

• Cache memory

Contents

Page 3: Introduction to Processor Architecture

www.company.com

Introduction

Page 4: Introduction to Processor Architecture

www.company.com

Processors

• What is the processor?

• What’s the difference among them?

Page 5: Introduction to Processor Architecture

www.company.com

Processor architecture and program

• Understanding architecture, there’s more opportunity to optimize your program.

• Let’s see some examples

Page 6: Introduction to Processor Architecture

www.company.com

Example1

• for(i=0 ; i<size ; i++) { for(j=0 ; j<size ; j++) {

sum += array[i][j];}

}

• for(j=0 ; j<size ; j++) { for(i=0 ; i<size ; i++) {

sum += array[i][j];}

} Keyword : Cache

1

2

Page 7: Introduction to Processor Architecture

www.company.com

Example2 (1/2)

• for(i=0 ; i<size ; i++) { if(i%2 == 0) {

action_even(); { else {

action_odd(); } }

1

Page 8: Introduction to Processor Architecture

www.company.com

Example2 (2/2)

• for(i=0 ; i<size ; i += 2) {action_even();

}

for(i=1 ; i<size ; i+= 2) {action_odd();

}

Keyword : Branch predictor and pipeline

2

Page 9: Introduction to Processor Architecture

www.company.com

Processor architecture overview

Page 10: Introduction to Processor Architecture

www.company.com

Von Neumann Architecture

• Input -> process -> output model

• Integrated Instruction Memory and Data Memory

Page 11: Introduction to Processor Architecture

www.company.com

Basic components of x86 CPURegister file

Status Registers

Zeroflag

Signflag

Overflowflag

Carryflag

%eax %esp%ecx %ebp%edx %esi%ebx %edi

CPU pipeline DecodeFetch Execution Units Commit

Memory(external)

Program Counter %eip Cache Memory $

Page 12: Introduction to Processor Architecture

www.company.com

Register file

• What is a register?

A simple memory element(s.t. edge triggered flip flops)

Page 13: Introduction to Processor Architecture

www.company.com

Register file

• A collection of registers– 8 registers are visible

• In fact, there are a lot of registers hided for other usages.

ex) There are 168 registers in Intel’s Haswell

Page 14: Introduction to Processor Architecture

www.company.com

Program counter

• Points the address of instruction that processor should execute next cycle.

• %eip is the name of program counter register in X86.

• Naming convention differs with ISA,

Instruction Set Architecture

%eip

Page 15: Introduction to Processor Architecture

www.company.com

Status registers

• Also a collection of registers• Boolean registers that represents processor’s

status.• Used to evaluate conditions

Zeroflag

Signflag

Overflowflag

Carryflag

Page 16: Introduction to Processor Architecture

www.company.com

Memory

• Main memory, usually D-RAM• In Von Neumann architecture,

instructions(codes) and data are on same memory module

Page 17: Introduction to Processor Architecture

www.company.com

CPU pipeline

• Where actual operation occurs

• Details will be explained later

CPU pipeline

DecodeFetch Execution Units Commit

Page 18: Introduction to Processor Architecture

www.company.com

Instruction Set Architecture

Page 19: Introduction to Processor Architecture

www.company.com

• How you actually talk to a Processor

Instruction Set Architecture (ISA)

Page 20: Introduction to Processor Architecture

www.company.com

• Mapping between assembly code and machine code– What assembly codes will be included?– How to represent assembly codes in byte codes

Instruction Set Architecture (ISA)

Page 21: Introduction to Processor Architecture

www.company.com

• A command to processor to make processor perform specific task(s)

– Ex1) Mov 4(%eax), %esp (x86) -> move the data in the address of (%eax) + 4, to %esp – Ex2) Irmovl %eax, $256 (y86) -> store the value 256 to the register eax

Instruction

Page 22: Introduction to Processor Architecture

www.company.com

• Instructions are represented in byte codes– Pushl %ebx => 0xa01f– Irmovl %eax, $256 => 0x30f000010000

Representation of instructions

A 0 rA x10 2

pushl

B 0 rA x10 2

popl

3 0 x rB Immediate Value10 53 42 6

irmovl

Page 23: Introduction to Processor Architecture

www.company.com

CISC vs RISC

CISC(Y86) RISC(sMips)

Page 24: Introduction to Processor Architecture

www.company.com

CISC

• Basic Idea : give programmers powerful instructions ; fewer instructions to complete a work

• One instruction do multiple work• A lot of instructions! (over 300 in x86)• Many instruction can access memory• Variable instruction length

Page 25: Introduction to Processor Architecture

www.company.com

RISC

• Basic Idea : Using simple instructions, write a complex program

• Each instruction do only one task• Small instructions set (about 100 in MIPS)• Only load and store instruction can access

memory• Fixed instruction length

Page 26: Introduction to Processor Architecture

www.company.com

RISC exampleSMIPs ISA

Page 27: Introduction to Processor Architecture

www.company.com

• Only three formats but the fields are used differently by different types of instructions

6 5 5 16opcode rs rt immediate I-type

6 5 5 5 5 6opcode rs rt rd shamt func R-type

6 26opcode target J-type

M05-27

Instruction formats

Page 28: Introduction to Processor Architecture

www.company.com

• Computational Instructions

• Load/Store Instructions

opcode rs rt immediate rt (rs) op immediate

6 5 5 5 5 6 0 rs rt rd 0 func rd (rs) func (rt)

rs is the base registerrt is the destination of a Load or the source for a Store

6 5 5 16 addressing modeopcode rs rt displacement (rs) + displacement31 26 25 21 20 16 15 0

M05-28

Instruction formats

Page 29: Introduction to Processor Architecture

www.company.com

• Conditional (on GPR) PC-relative branch

– target address = (offset in words)4 + (PC+4)– range: 128 KB range

• Unconditional register-indirect jumps

• Unconditional absolute jumps

– target address = {PC<31:28>, target4}– range : 256 MB range

6 5 5 16opcode rs offset BEQZ, BNEZ

6 26opcode target J, JAL

6 5 5 16opcode rs JR, JALR

jump-&-link stores PC+4 into the link register (R31)M05-29

Control instructions

Page 30: Introduction to Processor Architecture

www.company.com

CISC exampleY86 ISA

Page 31: Introduction to Processor Architecture

www.company.com

1 Byte

iCd iFun rA rB2 Bytes

iCd iFun rA rB Immediate/Offset6 Bytes

iCd iFun Destination5 Bytes

iCd iFun

M05-3

10 53 42 6

10 53 42

10 2

10

iCd : Instruction code iFun : Function code rA, rB : Register index

Instruction formats

Page 32: Introduction to Processor Architecture

www.company.com

halt: Used as a sign of program termination - Changes processor state to halt (HLT)

nop: No operation. Used as a bubble.

M05-4

0 010

halt

1 0nop10

1 byte instructions – halt, nop

Page 33: Introduction to Processor Architecture

www.company.com

OPl : Perform 4 basic ALU operations; add, sub, and, xor - R[rB] <- R[rB] Op R[rA] - Condition flags are set depending on the result.

M05-33

6 Op rA rB10 2

OPl

2 byte instruction – opl

Page 34: Introduction to Processor Architecture

www.company.com

call - R[esp] <- R[esp] - 4 (Update the stack pointer; move stack top) - M[esp] <- pc + 5 (Store the return address on the stack top) - pc <- Destination (Jump to Destination address)

M05-34

8 0 Destination10 53 42

call dest

5 byte instruction – call

Page 35: Introduction to Processor Architecture

www.company.com

rmmovl: Store - target address = R[rB] + offset - M[target address] <- R[rA]

mrmovl: Load - source address = R[rB] + offset - R[rA] <- M[source address]

4 0 rA rB Offset

10 53 42 6rmmovl

rA, Offset(rB)

M05-7

5 0 rA rB Offset10 53 42 6mrmovl

Offset(rB), rA

6 byte instructions – rmmov, mrmov

Page 36: Introduction to Processor Architecture

www.company.com

Processor Architecture

Page 37: Introduction to Processor Architecture

www.company.com

Simple processor architecture

Page 38: Introduction to Processor Architecture

www.company.com

Simplified version (a lot..)

Large sequential Logic

Memory

Load program codes Store Data

Output(register values)Clock

Page 39: Introduction to Processor Architecture

www.company.com

Sequential design

DecodeFetch Execution Units Commit

Memory

Register File

%EIP

Page 40: Introduction to Processor Architecture

www.company.com

Fetch unit

Fetch

%EIP

Memory

1) Get PC

2) Require next instruction 3) Get next instruction

4) Update PC

5) Give next instruction (Byte code)

Page 41: Introduction to Processor Architecture

www.company.com

Decode

Decode unit(1/2)

iCd fCd rA rB imm

1) Truncate input Instruction

Decode Combinational Logic

2) Fill information structure for execution

Inst Type

Target Register A

Target Register B

Immediate value

Register value A

Register value B

… (depends on ISA)

Page 42: Introduction to Processor Architecture

www.company.com

Decode

Decode unit(2/2)

Inst Type

Target Register A

Target Register B

Immediate value

Register value A

Register value B…

(depends on ISA)

Inst Type

Target Register A

Target Register B

Immediate value

Register value A

Register value B…

(depends on ISA)

RegisterRead

3) Read register values

Decoded Instruction

Register File

Page 43: Introduction to Processor Architecture

www.company.com

Execute

Execution unit(1/2)

Inst Type

Target Register A

Target Register B

Immediate value

Register value A

Register value B

1) Select input for ALU

ALU

2) Perform appropriate ALU operation

Inst Type

Memory Data

Register Data

Target Register

Memory Addr

ExecuteCombinational Logic

3) Using ALU result, fill information structure for memory & register update

Executed Instruction

Page 44: Introduction to Processor Architecture

www.company.com

Execute

Execution unit(2/2)

4) Perform memory operations(Ld, St)

Memory Operation Logic

Executed Instruction(updated)

Inst Type

Memory Data

Register Data

Target Register

Memory Addr

Memory

Inst Type

Memory Data

Register Data

Target Register

Memory Addr

5) Update the field (if load instruction executed)

Page 45: Introduction to Processor Architecture

www.company.com

Commit

Commit unit

Inst Type

Memory Data

Register Data

Target Register

Memory Addr

Register File

Register UpdateLogic

Page 46: Introduction to Processor Architecture

www.company.com

Single-cycle processor exampleSMIPs

Page 47: Introduction to Processor Architecture

www.company.com

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

2 read & 1 write ports

separate Instruction & Data memories

M06-47

Single-Cycle SMIPS

SMIPs instructions are all 4 byte-long

Page 48: Introduction to Processor Architecture

www.company.com

module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory;

Rule doProc()let inst = iMem.req(pc);let dInst = decode(inst);let rVal1 = rf.rd1(validRegValue(dInst.src1));let rVal2 = rf.rd2(validRegValue(dInst.src2));let eInst = exec(dInst, rVal1, rVal2, pc);

if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld, addr:eInst.addr, data: ?});else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St, addr: eInst.addr, data: eInst.data});if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data);pc <= eInst.brTaken ? eInst.addr : pc + 4;

endrule endmodule

M06-48

Single-Cycle SMIPS

Page 49: Introduction to Processor Architecture

www.company.com

module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory;

M06-49

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

• Declaration of components

Page 50: Introduction to Processor Architecture

www.company.com

Rule doProc()let inst = iMem.req(pc);

M06-50

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

Page 51: Introduction to Processor Architecture

www.company.com

let dInst = decode(inst);

M06-51

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

Page 52: Introduction to Processor Architecture

www.company.com

let rVal1 = rf.rd1(validRegValue(dInst.src1));let rVal2 = rf.rd2(validRegValue(dInst.src2));

M06-52

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

Page 53: Introduction to Processor Architecture

www.company.com

let eInst = exec(dInst, rVal1, rVal2, pc);

M06-53

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

Page 54: Introduction to Processor Architecture

www.company.com

if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld,addr:eInst.addr,data: ?});else if(eInst.iType == St) let dummy <- dMem.req(MemReq{op: St,addr: eInst.addr,data: eInst.data});

M06-54

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

Page 55: Introduction to Processor Architecture

www.company.com

if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data);

M06-55

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

Page 56: Introduction to Processor Architecture

www.company.com

pc <= eInst.brTaken ? eInst.addr : pc + 4;

M06-56

Single-Cycle SMIPS

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4

Page 57: Introduction to Processor Architecture

www.company.com

Improve processor performance-Pipelining

Page 58: Introduction to Processor Architecture

www.company.com

Pipelining

• Introduce the idea of conveyor belt process

Page 59: Introduction to Processor Architecture

www.company.com

Pipelining

• Introduce the idea of conveyor belt process

DecodeFetch Execution Units Commit

Memory

Register File

%EIP

FIFO or Register

Page 60: Introduction to Processor Architecture

www.company.com

Pipelining

• In this case, 4 instructions are executing on the same time

DecodeFetch Execution Units Commit

Memory

Register File

%EIP

Page 61: Introduction to Processor Architecture

www.company.com

Pipelining

• Throughput is same?– Sequential : 1 instructions / 1 cycle– Pipelined : 4 instructions / 4 cycle

• No, pipelined design can make clock faster

• Because task amount per cycle is decreased, we can apply shorter clock time

Page 62: Introduction to Processor Architecture

www.company.com

Pipeline hazard- control hazard

Page 63: Introduction to Processor Architecture

www.company.com

Control hazard (1/5)

• Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je BA : addl %ecx, %ediB : leave ret

• We don’t know the condition before run the code

Page 64: Introduction to Processor Architecture

www.company.com

Control hazard (2/5)

DecodeFetch Execution Units Commit

Memory

Register File

%EIP

mov %ecx, %ebxsubl %eax, (%ebx)

je B

Execution flow

• What’s next?

addl? leave?

Page 65: Introduction to Processor Architecture

www.company.com

Control hazard (3/5)

• Where this assembly code will execute? mov %ecx, %ebx subl %eax, (%ebx) je BA : addl %ecx, %edi mov %edi, %eaxB : leave ret

• As we can’t know about the future, bring the instruction in the next position

Page 66: Introduction to Processor Architecture

www.company.com

Control hazard (4/5)

DecodeFetch Execution Units Commit

Memory

Register File

%EIP

subl %eax, (%ebx)

je B

Execution flow

• What if jump occurred? addl %ecx, %edi

Page 67: Introduction to Processor Architecture

www.company.com

Control hazard (5/5)

DecodeFetch Execution Units Commit

Memory

Register File

%EIP

je B

Execution flow

addl %ecx, %edimov %edi, %eax

• Wrong instructions were executing

Page 68: Introduction to Processor Architecture

www.company.com

Control hazard - analysis

• We must discard some instructions when we mispredict the branch direction…

• The longer the pipeline, the more instructions must be discarded when branch mispredict.

Page 69: Introduction to Processor Architecture

www.company.com

• Add an epoch register in the processor state • The Execute stage changes the epoch whenever the

pc prediction is wrong and sets the pc to the correct value

• The Fetch stage associates the current epoch with every instruction when it is fetched

PC

iMem

pred f2d

Epoch

Fetch Execute

inst

targetPC

The epoch of the instruction is examined when it is ready to

execute. If the processor epoch has changed the

instruction is thrown away

M07-69

Epoch method

Page 70: Introduction to Processor Architecture

www.company.com

Epoch method - Summary

• Add a ‘Tag’ to each instructions

• There are two Tag machine; in Fetch and Execute

• If tag machines recognize something wrong, they change their testing tag

Page 71: Introduction to Processor Architecture

www.company.com

PC

InstMemory

Decode

Register File

Execute

DataMemory

+4 f2d

FIFO

FIFO

redi

rect

Execute sends information about the target pc to Fetch, which updates fEpoch and pc whenever it looks at the redirect PC fifo

fEpo

ch

eEpo

ch

M07-71

Epoch method example (SMIPs)

Page 72: Introduction to Processor Architecture

www.company.com

rule doFetch ; let inst=iMem.req(pc); let ppc=nextAddr(pc); pc <= ppc; f2d.enq(Fetch2Decode{pc:pc,ppc:ppc,epoch:epoch, inst:inst});Endrule

rule doExecute; let x=f2d.first; let inpc=x.pc; let inEp=x.epoch; let ppc = x.ppc; let inst = x.inst; if(inEp == epoch) begin let dInst = decode(inst); ... register fetch ...;

let eInst = exec(dInst, rVal1, rVal2, inpc, ppc); ...memory operation ... ...rf update ... if (eInst.mispredict) begin

pc <= eInst.addr; epoch <= inEp + 1; end end f2d.deq;endrule

Epoch method example (SMIPs)

Page 73: Introduction to Processor Architecture

www.company.com

Branch Prediction

Page 74: Introduction to Processor Architecture

www.company.com

Need to predict next PC

• We must fetch a instruction every cycle

• But as we see in control hazard part, we can’t know what is exactly next instruction

• So we must predict what is next instruction

Page 75: Introduction to Processor Architecture

www.company.com

How to predict next PC?

• We can depend on the history– Memo the history and make use of it

• So we must predict what is next instruction

Page 76: Introduction to Processor Architecture

www.company.com

2-bit counter branch predictor

(Weakly taken)

(Weakly not taken)

(Strongly not taken)

(Strongly taken)

Page 77: Introduction to Processor Architecture

www.company.com

• Assume 2 BP bits per instruction• Use saturating counter

On ¬taken

On taken

1 1 Strongly taken

1 0 Weakly taken

0 1 Weakly ¬taken

0 0 Strongly ¬takenDirection prediction changes only after two successive bad

predictions

M11-77

2-bit counter branch predictor

Page 78: Introduction to Processor Architecture

www.company.com

4K-entry BHT, 2 bits/entry, ~80-90% correct direction predictions

0 0Fetch PC

Branch?

Opcode offsetInstruction

k

BHT Index

2k-entryBHT,

2 bits/entry

Taken/¬Taken?

Target PC

+

from Fetch

After decoding the instruction if it turns out be a branch, then we can consult BHT using the pc; if this prediction is different from the incoming predicted pc we can redirect Fetch

M11-78

Branch History Table (BHT)

Page 79: Introduction to Processor Architecture

www.company.com

Let’s see program example again

• for(i=0 ; i<size ; i++) { if(i%2 == 0) {

action_even(); { else {

action_odd(); } }

Page 80: Introduction to Processor Architecture

www.company.com

Pipeline hazard- data hazard

Page 81: Introduction to Processor Architecture

www.company.com

Data hazard by flow dependence

• Sometimes, instructions uses the result of former instruction

I1 addl %eax, %ebxI2 subl %ebx, %ecx

I2 must wait until I1 updates the register file, so that I2 can see the result.

Page 82: Introduction to Processor Architecture

www.company.com

Dealing with data hazards

• We can wait until desired value is updated– Stall method

• Or, we can send the value directly– Bypass method

Page 83: Introduction to Processor Architecture

www.company.com

Pipeline stalling example

Page 84: Introduction to Processor Architecture

www.company.com

Data bypass example

Page 85: Introduction to Processor Architecture

www.company.com

Improve processor performance-Cache Memory

Page 86: Introduction to Processor Architecture

www.company.com

Memory operations are bottle neck!

• Memory transfer rate of DDR3 RAM– Peak transfer rate : 6400 MB/s

• Assume that we have single core processor of clock speed 3.0Ghz, and we process a word(32bit) every cycle.– Approximately, we process 1.2GB/s• 3.0 Ghz * 32bit(word size) / 8

Page 87: Introduction to Processor Architecture

www.company.com

Memory hierarchy

Page 88: Introduction to Processor Architecture

www.company.com

Locality of reference

• Temporal locality– If a value is used, it is likely to be used again soon.

mov $100, %ecx mov Array, %ebx // %ebx = &Arrayxorl %eax, %eax // %eax = 0

Loop : mov (%ebx), %esi // %esi += *(%ebx)addl %esi, %eax // %eax += %esiaddl $4, %ebx // %ebx += 4subl $1, %ecx // %ecx = %ecx - 1je Loop

Fin :leaveret

Page 89: Introduction to Processor Architecture

www.company.com

Locality of reference

• Spatial locality– If a value is used, nearby values are likely to be

used

for(i=0; i<size; i++) {for(j=0; j<size; j++) {

sum += array[i][j];}

}

Page 90: Introduction to Processor Architecture

www.company.com

Cache memory

• A buffer between processor and memory– Often several levels of caches

• Small but fast– Old values will be removed from cache to make

space for new values

• Capitalizes on spatial locality and temporal locality

• Parameters vary by system – unknown to programmer

Page 91: Introduction to Processor Architecture

www.company.com

• Cache memories are small, fast SRAM-based memories managed automatically in hardware. – Hold frequently accessed blocks of main memory

• CPU looks first for data in L1, then in L2, then in main memory.• Typical bus structure:

mainmemory

I/Obridgebus interfaceL2 cache

ALU

register file

CPU chip

cache bus system bus memory bus

L1 cache

Cache memory

Page 92: Introduction to Processor Architecture

www.company.com

copy of main memlocations 100, 101, ...

Data Block

DataByte

DataByte

DataByte

100

304

6848 416

How many bits are needed for the tag?Enough to uniquely identify block

Address Tag

Structure of cache memory

• Basically, an array of memory elements

Page 93: Introduction to Processor Architecture

www.company.com

Search cache tags to find match for the processor generated address

Found in cache a.k.a. HIT

Return copy of data from cache

Not in cachea.k.a. MISS

Read block of data from Main Memory – may require writing back a cache line

Return data to processor and update cache

Which line do we replace?

Read behavior

Page 94: Introduction to Processor Architecture

www.company.com

• On a write hit– Write-through: write to both cache and the next level memory– Writeback: write only to cache and update the next level

memory when line is evacuated

• On a write miss – Allocate – because of multi-word lines we first fetch the line,

and then update a word in it– Not allocate – word modified in memory

M12-94

Write behavior

Page 95: Introduction to Processor Architecture

www.company.com

Direct mapped cache

• A buffer between processor and memory– Often several levels of caches

• Small but fast– Old values will be removed from cache to make

space for new values

• Capitalizes on spatial locality and temporal locality

• Parameters vary by system – unknown to programmer

Page 96: Introduction to Processor Architecture

www.company.com

• A cache line usually holds more than one word(32bit)– Reduces the number of tags and the tag size needed to

identify memory locations

– To exploit spatial locality

M12-96

Cache line size

Page 97: Introduction to Processor Architecture

www.company.com

• Compulsory misses (cold start)– First time data is referenced– Run billions of instructions, become insignificant

• Capacity misses– Working set is larger than cache size– Solution: increase cache size

• Conflict misses– Usually multiple memory locations are mapped to the same

cache location to simplify implementations– Thus it is possible that the designated cache location is full

while there are empty locations in the cache. – Solution: Set-Associative Caches

Cold fact of life!

M12-97

Types of misses

Page 98: Introduction to Processor Architecture

www.company.com

Tag Data Block V

=

Offset Tag Index

t k b

t

HIT Data Word or Byte

2k

lines

Block number Block offset

What is a bad reference pattern? Strided = size of cache

req address

Direct-mapped cache

Page 99: Introduction to Processor Architecture

www.company.com

• Bitwise truncation

• Goto 270th cache entry and compare the Tag

Cold fact of life!

M12-99

Addressing example

00000100001100 00000100001110 01 00

Index = 270

req address

Tag = 524 Block offset = 1

Page 100: Introduction to Processor Architecture

www.company.com

Tag Data Block V

=

Offset Index

t k b

t

HIT Data Word or Byte

2k

lines

Tag

Why might this be undesirable?Spatially local blocks conflict

Address selection

Page 101: Introduction to Processor Architecture

www.company.com

• Memory time = Hit time + Prob(miss) * Miss penalty

• Associativity: Allow blocks to go to several sets in cache– 2-way set associative: each block maps to either of 2 cache

sets– Fully associative: each block maps to any cache frame

M12-101

Reduce conflict misses

Page 102: Introduction to Processor Architecture

www.company.com

Tag Data Block V

=

BlockOffset

Tag Index

t k

b

HIT

Tag Data Block V

DataWord

or Byte

=

t

2-way Set-Associative cache

Page 103: Introduction to Processor Architecture

www.company.com

• In order to bring in a new cache line, usually another cache line has to be thrown out. Which one?– No choice in replacement if the cache is direct mapped

• Replacement policy for set-associative caches– One that is not dirty, i.e., has not been modified

• In I-cache all lines are clean• In D-cache if a dirty line has to be thrown out then it must be written

back first– Least recently used?– Most recently used?– Random?

How much is performance affected by the choice?

Difficult to know without quantitative measurements

M12-103

Replacement policy

Page 104: Introduction to Processor Architecture

www.company.com

Implementing LRU..

• We need time stamps for all lines

• And also require time stamp comparison!– Log scale over head

Page 105: Introduction to Processor Architecture

www.company.com

Pseudo LRU example

• So use pseudo LRU instead of true LRU

• We’ll use 8-way set associative cache• Use three bit history bits

• If a line is referenced, memo the following code at history bits– Line 0 : 000– Line 1 : 001– Line 2 : 010– Line 3 : 011…

Page 106: Introduction to Processor Architecture

www.company.com

Pseudo LRU example

Page 107: Introduction to Processor Architecture

www.company.com

Pseudo LRU example