67
Anshul Kumar, CSE IITD ECE729 : Advanced ECE729 : Advanced Computer Architecture Computer Architecture Lectures 20-23 : Memory Hierarchy -Cache Performance 2 nd – 12 th March, 2010

ECE729 : Advanced Computer Architecture

  • Upload
    lajos

  • View
    61

  • Download
    2

Embed Size (px)

DESCRIPTION

ECE729 : Advanced Computer Architecture. Lectures 20-23 : Memory Hierarchy -Cache Performance 2 nd – 12 th March, 2010. Lecture 20. 2 nd March, 2010. Performance. Time to access cache =  1 ( usually 1 CPU cycle ) Time to access main memory =  2 ( 1 order higher than  1 ) - PowerPoint PPT Presentation

Citation preview

Page 1: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD

ECE729 : Advanced Computer ECE729 : Advanced Computer ArchitectureArchitecture

ECE729 : Advanced Computer ECE729 : Advanced Computer ArchitectureArchitecture

Lectures 20-23 : Memory Hierarchy -Cache Performance

2nd – 12th March, 2010

Page 2: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD

Lecture 20Lecture 20Lecture 20Lecture 20

2nd March, 2010

Page 3: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 3

PerformancePerformancePerformancePerformance

Time to access cache = 1 (usually 1 CPU cycle)

Time to access main memory = 2 (1 order higher than 1)

Hit probability (hit ratio or hit rate) = h

Miss probability (miss ratio or miss rate) = m = 1 - h

Time spent when hit occurs = 1 (Hit time)

Time spent when miss occurs = 1 + 2

(2 = Miss penalty)Teff = h 1 + m (1 + 2) OR 1 + m 2

Page 4: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 4

Performance contd.Performance contd.Performance contd.Performance contd.

Teff = 1 + m 2

Average memory access time =

Hit time + Miss rate * Miss penalty

Program execution time =

IC * Cycle time * (CPIexec + Mem stalls / instr)

Mem stalls / instr =

Miss rate * Miss Penalty * Mem accesses / instr

Miss Penalty in OOO processor =

Total miss latency - Overlapped miss latency

Mem stalls / access

Page 5: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 5

Transferring blocks to/from Transferring blocks to/from memorymemory

Transferring blocks to/from Transferring blocks to/from memorymemory

CPU

cache

mem

ory

bus

a. one word widememory

CPU

cache

memory

bus

b. four word widememory

CPU

cache

membank0

membank1

membank2

membank3

bus

c. interleavedmemory

Page 6: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 6

Miss penalty exampleMiss penalty exampleMiss penalty exampleMiss penalty example

• 1 clock cycle to send address• 15 cycles for RAM access• 1 cycle for sending data• block size = 4 words

Miss penalty:

case (a): 4 (1 + 15 + 1) = 68 or 1 + 4 (15 + 1) = 65

case (b): 1 + 1 (15 + 1) = 17

case (c): 1 + 15 + 4 = 20

Page 7: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 7

DRAM with page modeDRAM with page modeDRAM with page modeDRAM with page mode

• Memory cells are organized as a 2-D structure

• Entire row is accessed at a time internally and kept in a buffer

• Reading multiple bits from a row can be done very fast– sequentially, without giving address again– randomly, giving only the column addresses

Page 8: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 8

Performance analysis examplePerformance analysis examplePerformance analysis examplePerformance analysis example

CPIeff = CPI + Miss rate * Miss Penalty * Mem accesses / Instr

CPI = 1.2 Miss rate = 0.5% Block size = 16 w

Miss penalty??Mem access / Instr = 1 (assumption)

Page 9: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 9

Miss penalty calculationMiss penalty calculationMiss penalty calculationMiss penalty calculation

Data / address transfer time = 1 cycle

Memory latency = 10 cycles

a) Miss penalty = 16*(1+10+1) = 192

b) Miss penalty = 4*(1+10+1) = 48

c) Miss penalty = 4*(1+10+4*1) = 60

Page 10: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 10

Back to CPI calculationBack to CPI calculationBack to CPI calculationBack to CPI calculation

CPIeff = 1.2 + .005 * miss penalty * 1.0

a) 1.2 + .005 * 192 * 1.0 = 1.2 + .96 = 2.16b) 1.2 + .005 * 48 * 1.0 = 1.2 + .24 = 1.44c) 1.2 + .005 * 60 * 1.0 = 1.2 + .30 = 1.50

Page 11: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 11

Performance ImprovementPerformance ImprovementPerformance ImprovementPerformance Improvement

• Reducing miss penalty

• Reducing miss rate

• Reducing miss penalty * miss rate

• Reducing hit time

Page 12: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 12

Reducing Miss PenaltyReducing Miss PenaltyReducing Miss PenaltyReducing Miss Penalty

• Multi level caches

• Critical word first and early restart

• Write Through policy

• Giving priority to read misses over write

• Merging write buffer

• Victim caches

Page 13: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 13

Red

ucin

g m

iss

pena

ltyMulti Level CachesMulti Level CachesMulti Level CachesMulti Level Caches

Average memory access time =

Hit timeL1 + Miss rateL1 * Miss penaltyL1

Miss penaltyL1 =

Hit timeL2 + Miss rateL2 * Miss penaltyL2

Multi level inclusion

and

Multi level exclusion

Page 14: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 14

Red

ucin

g m

iss

pena

ltyMisses in Multilevel CacheMisses in Multilevel CacheMisses in Multilevel CacheMisses in Multilevel Cache

• Local Miss rate– no. of misses / no. of requests, as seen at a level

• Global Miss rate– no. of misses / no. of requests, on the whole

• Solo Miss rate– miss rate if only this cache was present

Page 15: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 15

Red

ucin

g m

iss

pena

ltyTwo level cache miss exampleTwo level cache miss exampleTwo level cache miss exampleTwo level cache miss example

A: L1, L2

B: ~L1, L2

C: L1, ~L2

D: ~L1, ~L2

Local miss (L1) = (B+D)/(A+B+C+D)Local miss (L2) = D/(B+D)Global Miss = D/(A+B+C+D)Solo miss (L2) = (C+D)/(A+B+C+D)

Page 16: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 16

Red

ucin

g m

iss

pena

ltyMulti-level cache exampleMulti-level cache exampleMulti-level cache exampleMulti-level cache example

CPI with no miss = 1.0 Clock = 500 MHz

Main mem access time = 200 ns

Miss rate = 5%

Adding L2 cache (20 ns) reduces miss to 2%. Find performance improvement.

Miss penalty (mem) = 200/2 = 100 cycles

Effective CPI with L1 = 1+5%*100 = 6

Page 17: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 17

Red

ucin

g m

iss

pena

ltyExample continuedExample continuedExample continuedExample continued

Miss penalty ( L2 ) = 20/2 = 10 cycles

Total CPI = Base CPI + stalls due to L1 miss

+ stalls due to L2 miss

= 1.0 + 5% * 10 + 2% * 100

= 1.0 + 0.5 + 2.0 = 3.5

Performance ratio = 6.0/3.5 = 1.7

Page 18: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 18

Red

ucin

g m

iss

pena

lty

Lecture 21Lecture 21Lecture 21Lecture 21

3rd March, 2010

Page 19: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 19

Red

ucin

g m

iss

pena

ltyCritical Word First and Early RestartCritical Word First and Early RestartCritical Word First and Early RestartCritical Word First and Early Restart

• Read policy– initiate memory access along with cache access

in anticipation of a miss– forward data to CPU as it gets filled in cache

• Load policy– wrap around load

More effective when block size is large

Page 20: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 20

Red

ucin

g m

iss

pena

ltyWrite Through PolicyWrite Through PolicyWrite Through PolicyWrite Through Policy

• Write Through Policy reduces block traffic at the cost of – increased word traffic– increased miss rate

Page 21: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 21

Red

ucin

g m

iss

pena

ltyRead Miss Priority Over WriteRead Miss Priority Over WriteRead Miss Priority Over WriteRead Miss Priority Over Write

• Provide write buffers

• Processor writes into buffer and proceeds (for write through as well as write back)

On read miss– wait for buffer to be empty, or– check addresses in buffer for conflict

Page 22: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 22

Red

ucin

g m

iss

pena

ltyMerging Write BufferMerging Write BufferMerging Write BufferMerging Write Buffer

Merge writes belonging to same block in case of write through

Page 23: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 23

Red

ucin

g m

iss

pena

ltyVictim Cache Victim Cache (proposed by Jouppi)(proposed by Jouppi)Victim Cache Victim Cache (proposed by Jouppi)(proposed by Jouppi)

• Evicted blocks are recycled

• Much faster than getting a block from the next level

• Size = a few blocks only

• A significant fraction of misses may be found in victim cache

Cache

VictimCache

from mem

to proc

Page 24: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 24

Reducing Miss RateReducing Miss RateReducing Miss RateReducing Miss Rate

• Large block size• Larger cache• LRU replacement• WB, WTWA write policies• Higher associativity• Way prediction, pseudo-associative cache• Warm start in multi-tasking• Compiler optimizations

Page 25: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 25

Red

ucin

g m

iss

rate

Large Block SizeLarge Block SizeLarge Block SizeLarge Block Size

• Reduces compulsory misses

• Too large block size - misses increase

• Miss Penalty increases

Page 26: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 26

Red

ucin

g m

iss

rate

Large CacheLarge CacheLarge CacheLarge Cache

• Reduces capacity misses

• Hit time increases

• Keep small L1 cache and large L2 cache

Page 27: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 27

Red

ucin

g m

iss

rate

LRU Replacement PolicyLRU Replacement PolicyLRU Replacement PolicyLRU Replacement Policy

• Choice of replacement policy– optimal: replace the block whose reference is

farthest away in future– practical: replace the block whose reference is

farthest away in past

Page 28: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 28

Red

ucin

g m

iss

rate

WB, WTWA PoliciesWB, WTWA PoliciesWB, WTWA PoliciesWB, WTWA Policies

• WB and WTWA write policies tend to reduce the miss rate as compared to WTNWA at the cost of– increased block traffic

Page 29: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 29

Red

ucin

g m

iss

rate

Higher AssociativityHigher AssociativityHigher AssociativityHigher Associativity

• Reduces conflict misses

• 8-way is almost like fully associative

• Hit time increases

Page 30: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 30

Red

ucin

g m

iss

rate

Associative cache exampleAssociative cache exampleAssociative cache exampleAssociative cache example

Cache mapping block size I-miss D-miss CPI

1 direct 1 word 4% 8% 2.0

2 direct 4 word 2% 5% ??

3 2-way s.a. 4 word 2% 4% ??

Miss penalty = 6 + block size

50% instruction have a data reference

Stall cycles: cache1: 7*(.04+.08*.5)=.56

cache2: 10*(.02+.05*.5) = .45

cache3: 10*(.02+.04*.5) = .40

Page 31: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 31

Red

ucin

g m

iss

rate

Example continuedExample continuedExample continuedExample continued

Cache CPI clock period time/instr

1 2.0 2.0 4.0

2 2.0 - .56 + .45 = 1.89 2.0 3.78

3 2.0 - .56 + .40 = 1.84 2.4 4.416

Page 32: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 32

Red

ucin

g m

iss

rate

Way Prediction and Pseudo-Way Prediction and Pseudo-associative Cacheassociative Cache

Way Prediction and Pseudo-Way Prediction and Pseudo-associative Cacheassociative Cache

Way prediction: low miss rate of SA cache with hit time of DM cache

• Only one tag is compared initially• Extra bits are kept for prediction• Hit time in case of mis-prediction is highPseudo-assoc. or column assoc. cache: get

advantage of SA cache in a DM cache• Check sequentially in a pseudo-set• Fast hit and slow hit

Page 33: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 33

Red

ucin

g m

iss

rate

Warm Start in Multi-taskingWarm Start in Multi-taskingWarm Start in Multi-taskingWarm Start in Multi-tasking

• Cold start– process starts with empty cache– blocks of previous process invalidated

• Warm start– some blocks from previous activation are still

available

Page 34: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 34

Red

ucin

g m

iss

rate

Lecture 22Lecture 22Lecture 22Lecture 22

6th March, 2010

Page 35: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 35

Red

ucin

g m

iss

rate

Compiler optimizationsCompiler optimizationsCompiler optimizationsCompiler optimizations

Loop interchange

• Improve spatial locality by scanning arrays row-wise

Blocking

• Improve temporal and spatial locality

Page 36: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 36

Red

ucin

g m

iss

rate

Improving LocalityImproving LocalityImproving LocalityImproving Locality

MNNLML

BAC

Matrix Multiplication example

Page 37: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 37

Red

ucin

g m

iss

rate

Cache Organization for the exampleCache Organization for the exampleCache Organization for the exampleCache Organization for the example

• Cache line (or block) = 4 matrix elements.• Matrices are stored row wise.• Cache can’t accommodate a full row/column.

(In other words, L, M and N are so large w.r.t. the cache size that after an iteration along any of the three indices, when an element is accessed again, it results in a miss.)

• Ignore misses due to conflict between matrices. (as if there was a separate cache for each matrix.)

Page 38: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 38

Red

ucin

g m

iss

rate

Matrix Multiplication : Code IMatrix Multiplication : Code IMatrix Multiplication : Code IMatrix Multiplication : Code I

for (i = 0; i < L; i++)

for (j = o; j < M; j++)

for (k = 0; k < N; k++)

c[i][j] += A[i][k] * B[k][j];

C A B

accesses LM LMN LMN

misses LM/4 LMN/4 LMN

Total misses = LM(5N+1)/4

Page 39: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 39

Red

ucin

g m

iss

rate

Matrix Multiplication : Code IIMatrix Multiplication : Code IIMatrix Multiplication : Code IIMatrix Multiplication : Code II

for (k = 0; k < N; k++)

for (i = 0; i < L; i++)

for (j = o; j < M; j++)

c[i][j] += A[i][k] * B[k][j];

C A B

accesses LMN LN LMN

misses LMN/4 LNLMN/4

Total misses = LN(2M+4)/4

Page 40: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 40

Red

ucin

g m

iss

rate

Matrix Multiplication : Code IIIMatrix Multiplication : Code IIIMatrix Multiplication : Code IIIMatrix Multiplication : Code III

for (i = 0; i < L; i++)

for (k = 0; k < N; k++)

for (j = o; j < M; j++)

c[i][j] += A[i][k] * B[k][j];

C A B

accesses LMN LN LMN

misses LMN/4 LN/4LMN/4

Total misses = LN(2M+1)/4

Page 41: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 41

Red

ucin

g m

iss

rate

BlockingBlockingBlockingBlocking

= k

j

jj

kk

k

kk

ij

jj

i

jjkk

ij

k5 nested loops

blocking factor = b

C A BaccessesLMN/b LMN LMNmisses LMN/4b LMN/4b MN/4Total misses = MN(2L/b+1)/4

Page 42: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 42

Red

ucin

g m

iss

rate

Loop BlockingLoop BlockingLoop BlockingLoop Blocking

for (k = 0; k < N; k+=4) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k]*B[k][j] +A[i][k+1]*B[k+1][j] +A[i][k+2]*B[k+2][j] +A[i][k+3]*B[k+3][j];

C A Baccesses LMN/4 LN LMNmisses LMN/16 LN/4 LMN/4

Total misses = LN(5M/4+1)/4

Page 43: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 43

Reducing Miss Penalty * Miss RateReducing Miss Penalty * Miss RateReducing Miss Penalty * Miss RateReducing Miss Penalty * Miss Rate

• Non-blocking cache

• Write allocate with no fetch

• Hardware prefetching

• Compiler controlled prefetching

Page 44: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 44

Red

ucin

g m

iss

pena

lty

* m

iss

rate

Non-blocking CacheNon-blocking CacheNon-blocking CacheNon-blocking Cache

In OOO (Out of Order) processor

• Hit under a miss– complexity of cache controller increases

• Hit under multiple misses or miss under a miss – memory should be able to handle multiple

misses

Page 45: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 45

Red

ucin

g m

iss

pena

lty

* m

iss

rate

Write allocate with no fetchWrite allocate with no fetchWrite allocate with no fetchWrite allocate with no fetch

Write miss in a Write Through cache:– Allocate a block in cache– Fetch contents of block from memory– Write into cache– Write into memory

– reduces miss rate (WTWA)– increases miss penalty (avoid for clustered

writes)

Page 46: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 46

Red

ucin

g m

iss

pena

lty

* m

iss

rate

Hardware PrefetchingHardware PrefetchingHardware PrefetchingHardware Prefetching

• Prefetch items before they are requested– both data and instructions

• What and when to prefetch?– fetch two blocks on a miss (requested+next)

• Where to keep prefetched information?– in cache– in a separate buffer (most common case)

Page 47: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 47

Red

ucin

g m

iss

pena

lty

* m

iss

rate

Prefetch Buffer/Stream BufferPrefetch Buffer/Stream BufferPrefetch Buffer/Stream BufferPrefetch Buffer/Stream Buffer

Cache

prefetchbuffer

from mem

to proc

Page 48: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 48

Red

ucin

g m

iss

pena

lty

* m

iss

rate

Hardware prefetching: Stream buffersHardware prefetching: Stream buffersHardware prefetching: Stream buffersHardware prefetching: Stream buffers

Joupi’s experiment [1990]:• Single instruction stream buffer catches 15% to

25% misses from a 4KB direct mapped instruction cache with 16 byte blocks

• 4 block buffer – 50%, 16 block – 72%• single data stream buffer catches 25% misses from

4 KB direct mapped cache• 4 data stream buffers (each prefetching at a

different address) – 43%

Page 49: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 49

Red

ucin

g m

iss

pena

lty

* m

iss

rate

HW prefetching: HW prefetching: UltraSPARC III exampleUltraSPARC III exampleHW prefetching: HW prefetching: UltraSPARC III exampleUltraSPARC III example

64 KB data cache, 36.9 misses per 1000 instructions

22% instructions make data reference

hit time = 1, miss penalty = 15

prefetch hit rate = 20%

1 cycle to get data from prefetch buffer

What size of cache will give same performance?

miss rate = 36.9/220 = 16.7%

av mem access time =1+(.167*.2*1)+(.167*.8*15)=3.046

effective miss rate = (3.046-1)/15=13.6%=> 256 KB cache

Page 50: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 50

Red

ucin

g m

iss

pena

lty

* m

iss

rate

Compiler Controlled PrefetchingCompiler Controlled PrefetchingCompiler Controlled PrefetchingCompiler Controlled Prefetching

• Register prefetch / Cache prefetch

• Faulting / non-faulting (non-binding)

• Semantically invisible (no change in registers or memory contents)

• Makes sense if processor doesn’t stall while prefetching (non-blocking cache)

• Overhead of prefetch instruction should not exceed the benefit

Page 51: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 51

Red

ucin

g m

iss

pena

lty

* m

iss

rate

Lecture 23Lecture 23Lecture 23Lecture 23

12th March, 2010

Page 52: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 52

Red

ucin

g m

iss

pena

lty

* m

iss

rate

SW Prefetch ExampleSW Prefetch ExampleSW Prefetch ExampleSW Prefetch Example

• 8 KB direct mapped, write back data cache with 16 byte blocks.

• a is 3 100, b is 101 3

for (i = 0; i < 3; i++) for (j = o; j < 100; j++) a[i][j] = b[j][0] * b[j+1][0];

each array element is 8 bytesmisses in array a = 3 * 100 /2 = 150misses in array b = 101total misses = 251

Page 53: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 53

Red

ucin

g m

iss

pena

lty

* m

iss

rate

SW Prefetch Example – contd.SW Prefetch Example – contd.SW Prefetch Example – contd.SW Prefetch Example – contd.

Suppose we need to prefetch 7 iterations in advancefor (j = o; j < 100; j++){ prefetch(b[j+7]][0]); prefetch(a[0][j+7]); a[0][j] = b[j][0] * b[j+1][0];};for (i = 1; i < 3; i++) for (j = o; j < 100; j++){ prefetch(a[i][j+7]); a[i][j] = b[j][0] * b[j+1][0];};

misses in first loop = 7 (for b[0..6][0]) + 4 (for a[0][0..6] )misses in second loop = 4 (for a[1][0..6]) + 4 (for a[2][0..6] )total misses = 19, total prefetches = 400

Page 54: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 54

Red

ucin

g m

iss

pena

lty

* m

iss

rate

SW Prefetch Example – contd.SW Prefetch Example – contd.SW Prefetch Example – contd.SW Prefetch Example – contd.

Performance improvement?Assume no capacity and conflict misses,prefetches overlap with each other and with missesOriginal loop: 7, Prefetch loops: 9 and 8 cyclesMiss penalty = 100 cycles

Original loop = 300*7 + 251*100 = 27,200 cycles1st prefetch loop = 100*9 + 11*100 = 2,000 cycles2nd prefetch loop = 200*8 + 8*100 = 2,400 cyclesSpeedup = 27200/(2000+2400) = 6.2

Page 55: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 55

Reducing Hit TimeReducing Hit TimeReducing Hit TimeReducing Hit Time

• Small and simple caches

• Avoid time loss in address translation

• Pipelined cache access

• Write before hit

• Trace caches

Page 56: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 56

Red

ucin

g hi

t tim

eSmall and Simple CachesSmall and Simple CachesSmall and Simple CachesSmall and Simple Caches

• Small size => faster access

• Small size => fit on the chip, lower delay

• Simple (direct mapped) => lower delay

• Second level – tags may be kept on chip

Page 57: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 57

Red

ucin

g hi

t tim

eCache access time estimates using Cache access time estimates using CACTICACTICache access time estimates using Cache access time estimates using CACTICACTI

.8 micron technology, 1 R/W port, 32 b address, 64 b o/p, 32 B block

Page 58: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 58

Red

ucin

g hi

t tim

eAvoid time loss in addr translationAvoid time loss in addr translationAvoid time loss in addr translationAvoid time loss in addr translation

Cache Addressing:• Physically addressed cache

– first convert virtual address into physical address, then access cache (indexing, tag matching)

– can indexing start before address translation?

• Virtually addressed cache– access cache directly using the virtual address

Page 59: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 59

Red

ucin

g hi

t tim

eIndexing before address translationIndexing before address translationIndexing before address translationIndexing before address translation

• Virtually indexed, physically tagged cache– simple and effective approach– indexing overlaps with address translation, tag

matching after address translation– valid only if index field does not change during

address translation (possible for caches that are not too large)

Virtual Page No. Offset in pageTag Index

Page 60: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 60

Red

ucin

g hi

t tim

eVirtually addressedVirtually addressedVirtually addressedVirtually addressed

• Blocks in virtual address space are mapped to cache blocks

• Tag as well as index are taken from virtual address

• problems with virtually addressed cache– protection?– multiple processes?– aliasing?– I/O?

Page 61: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 61

Red

ucin

g hi

t tim

eProblems with virtually addressed cacheProblems with virtually addressed cacheProblems with virtually addressed cacheProblems with virtually addressed cache

• Page level protection? – copy protection info from TLB

• Same virtual address from two different processes needs to be distinguished – purge cache blocks on context switch or use PID tags

along with other address tags

• Aliasing (different virtual addresses from two processes pointing to same physical address) – inconsistency?

• I/O uses physical addresses

Page 62: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 62

Red

ucin

g hi

t tim

eMulti processes in virtually addr cacheMulti processes in virtually addr cacheMulti processes in virtually addr cacheMulti processes in virtually addr cache

• purge cache blocks on context switch• use PID tags along with other address

tags

Page 63: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 63

Red

ucin

g hi

t tim

eInconsistency in virtually addr cacheInconsistency in virtually addr cacheInconsistency in virtually addr cacheInconsistency in virtually addr cache

• Hardware solution (Alpha 21264)– 64 KB cache, 2-way set associative, 8 KB page– a block with a given offset in a page can map to 8

locations in cache– check all 8 locations, invalidate duplicate entries

• Software solution (page coloring)– make 18 lsbs of all aliases same – ensures that

direct mapped cache 256 KB has no duplicates– i.e., 4 KB pages are mapped to 64 sets (or colors)

Page 64: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 64

Red

ucin

g hi

t tim

ePipelined Cache AccessPipelined Cache AccessPipelined Cache AccessPipelined Cache Access

• Cache access (hit) may be multi-cycle. e.g. Pentium 4 takes 4 cycles

• Multi-cycle access can be pipelined

• Throughput is increased though hit time remains more than one cycle

• greater penalty on branch misprediction

• more clock cycles between issue of load and use of data

Page 65: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 65

Red

ucin

g hi

t tim

eWrite before hitWrite before hitWrite before hitWrite before hit

• Write into cache without checking for hit/miss.

• Check the tag and valid bit while writing.• If it was found to be a hit, reduced write hit

time is obtained.• If it was found to be miss, the corrupted

block needs to be taken care of – increased miss penalty.

• Possible for direct mapped cache only

Page 66: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 66

Red

ucin

g hi

t tim

eTrace CachesTrace CachesTrace CachesTrace Caches

• what maps to a cache block?– not statically determined – decided by the dynamic sequence of instructions,

including predicted branches

• Used in Pentium 4 (NetBurst architecture)• starting addresses not word size * powers of 2• Better utilization of cache space• downside – same instruction may be stored

multiple times

Page 67: ECE729 : Advanced Computer Architecture

Anshul Kumar, CSE IITD slide 67

Red

ucin

g hi

t tim

eAddress Mapping in Trace CacheAddress Mapping in Trace CacheAddress Mapping in Trace CacheAddress Mapping in Trace Cache

MAINMEMORY

CACHE