CS152 / Kubiatowicz Lec21.1 4/21/03©UCB Spring 2003 CS152 Computer Architecture and Engineering Lecture 21 Memory Systems (recap) Caches April 21, 2003

4/21/03 ©UCB Spring 2003 CS152 / Kubiatowicz

Lec21.1

CS152Computer Architecture and Engineering

Lecture 21

Memory Systems (recap)Caches

April 21, 2003

John Kubiatowicz (www.cs.berkeley.edu/~kubitron)

lecture slides: http://inst.eecs.berkeley.edu/~cs152/


Lec21.2

° The Five Classic Components of a Computer

° Today’s Topics: • Recap last lecture

• Simple caching techniques

• Many ways to improve cache performance

• Virtual memory?

Recap: The Big Picture: Where are We Now?

Control

Datapath

Memory

Processor

Input

Output


Lec21.3

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

1000

198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Recap: Who Cares About the Memory Hierarchy?

“Less’ Law?”


Lec21.4

Recap: Memory Hierarchy: Why Does it Work? Locality!

° Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the processor

° Spatial Locality (Locality in Space):=> Move blocks consists of contiguous words to the upper levels

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Address Space0 2^n - 1

Probabilityof reference


Lec21.5

Recap: Static RAM Cell

6-Transistor SRAM Cell

bit bit

word(row select)

bit bit

word

° Write:1. Drive bit lines (bit=1, bit=0)

2.. Select row

° Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!

2.. Select row

3. Cell pulls one line low

4. Sense amp on column detects difference between bit and bit

replaced with pullupto save area

10

0 1


Lec21.6

Recap: 1-Transistor Memory Cell (DRAM)

° Write:• 1. Drive bit line

• 2.. Select row

° Read:• 1. Precharge bit line to Vdd/2

• 2.. Select row

• 3. Cell and bit line share charges

- Very small voltage changes on the bit line

• 4. Sense (fancy sense amp)

- Can detect changes of ~1 million electrons

• 5. Write: restore the value

° Refresh• 1. Just do a dummy read to every cell.

row select

bit

Trench Capacitor


Lec21.7

Recap: Classical DRAM Organization (square)

° Row and Column Address Select 1 bit at a time° Act of reading refreshes one complete row

• Sense amps detect slight variations from VDD/2 and amplify them

row

decoder

rowaddress

Sense-AMPS, Column Selector & I/O Column

Address

data

RAM Cell Array

word (row) select

bit (data) lines

Each intersection representsa 1-T DRAM Cell


Lec21.8

AD

OE_L

256K x 8DRAM9 8

WE_LCAS_LRAS_L

OE_L

A Row Address

WE_L

Junk

Read AccessTime

Output EnableDelay

CAS_L

RAS_L

Col Address Row Address JunkCol Address

D High Z Data Out

DRAM Read Cycle Time

Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L

° Every DRAM access begins at:

• The assertion of the RAS_L

• 2 ways to read: early or late v. CAS

Junk Data Out High Z

Recap: Traditional “asynchronous” DRAM Read Timing


Lec21.9

Recap: “Synchronous timing”: SDRAM timing for Lab6

° Micron 128M-bit dram (using 2Meg16bit4bank ver)• Row (12 bits), bank (2 bits), column (9 bits)

RAS(New Bank)

CAS End RASx

BurstREADCAS Latency


Lec21.10

Processor

$

MEM

Memory

reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .

op: i-fetch, read, write

Optimize the memory system organizationto minimize the average memory access timefor typical workloads

Workload orBenchmarkprograms

The Art of Memory System Design


Lec21.11

Impact of Memory Hierarchy on Algorithms

° Today CPU time is a function of (ops, cache misses)

° What does this mean to Compilers, Data structures, Algorithms?

• Quicksort: fastest comparison based sorting algorithm when keys fit in

memory

• Radix sort: also called “linear time” sortFor keys of fixed length and fixed radix a constant number of

passes over the data is sufficient independent of the number of keys

° “The Influence of Caches on the Performance of Sorting” by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, 370-379.

• For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to 4000000


Lec21.12

Quicksort vs. Radix as vary number keys: Instructions

Job size in keys

Instructions/key

Radix sort

Quicksort


Lec21.13

Quicksort vs. Radix as vary number keys: Instrs & Time

Time

Job size in keys

Instructions

Radix sort

Quicksort


Lec21.14

Quicksort vs. Radix as vary number keys: Cache misses

Cache misses

Job size in keys

Radix sort

Quicksort

What is proper approach to fast algorithms?


Lec21.15

Example: 1 KB Direct Mapped Cache with 32 B Blocks° For a 2 ** N byte cache:

• The uppermost (32 - N) bits are always the Cache Tag

• The lowest M bits are the Byte Select (Block Size = 2M)

• One cache miss, pull in complete “Cache Block” (or “Cache Line”)

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x50

Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9Block address


Lec21.16

Set Associative Cache° N-way set associative: N entries for each Cache Index

• N direct mapped caches operates in parallel

° Example: Two-way set associative cache• Cache Index selects a “set” from the cache

• The two tags in the set are compared to the input in parallel

• Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit


Lec21.17

Disadvantage of Set Associative Cache

° N-way Set Associative Cache versus Direct Mapped Cache:

• N comparators vs. 1• Extra MUX delay for the data• Data comes AFTER Hit/Miss decision and set selection

° In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:

• Possible to assume a hit and continue. Recover later if miss.

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit


Lec21.18

Example: Fully Associative

° Fully Associative Cache• Forget about the Cache Index

• Compare the Cache Tags of all cache entries in parallel

• Example: Block Size = 32 B blocks, we need N 27-bit comparators

° By definition: Conflict Miss = 0 for a fully associative cache

:

Cache Data

Byte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

=

=

=

=

=


Lec21.19

° Should be reading Chapter 7 of your book ° Second midterm: Monday May 5th

• Pipelining - Hazards, branches, forwarding, CPI calculations- (may include something on dynamic scheduling)

• Memory Hierarchy• Possibly something on I/O (see where we get in lectures)• Possibly something on power (Broderson Lecture)

° Lab 6: Hopefully all is well!• How about a “fake processor” written in Verilog to debug memory

system?

• Make sure to try writes to individual words and reads from adjacent words in the cache

• Careful about 2-clocks for DRAM module

- I have sent out a couple of messages about this: you should be able to single-step your processor while DRAM continues to run at full speed….

Administrative Issues


Lec21.20

° Compulsory (cold start or process migration, first reference): first access to a block

• “Cold” fact of life: not a whole lot you can do about it

• Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant

° Capacity:• Cache cannot contain all blocks access by the program

• Solution: increase cache size

° Conflict (collision):• Multiple memory locations mapped

to the same cache location

• Solution 1: increase cache size

• Solution 2: increase associativity

° Coherence (Invalidation): other process (e.g., I/O) updates memory

A Summary on Sources of Cache Misses


Lec21.21

Design options at constant cost

Direct Mapped N-way Set Associative Fully Associative

Compulsory Miss

Cache Size

Capacity Miss

Coherence Miss

Big Medium Small

Note:If you are going to run “billions” of instruction, Compulsory Misses are insignificant (except for streaming media types of programs).

Same Same Same

Conflict Miss High Medium Zero

Low Medium High

Same Same Same


Lec21.22

° Q1: Where can a block be placed in the upper level? (Block placement)

° Q2: How is a block found if it is in the upper level? (Block identification)

° Q3: Which block should be replaced on a miss? (Block replacement)

° Q4: What happens on a write? (Write strategy)

Recap: Four Questions for Caches and Memory Hierarchy


Lec21.23

° Block 12 placed in 8 block cache:• Fully associative, direct mapped, 2-way set associative• S.A. Mapping = Block Number Modulo Number Sets

0 1 2 3 4 5 6 7Blockno.

Fully associative:block 12 can go anywhere

0 1 2 3 4 5 6 7Blockno.

Direct mapped:block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7Blockno.

Set associative:block 12 can go anywhere in set 0 (12 mod 4)

Set0

Set1

Set2

Set3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Block-frame address

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.

Q1: Where can a block be placed in the upper level?


Lec21.24

° Direct indexing (using index and block offset), tag compares, or combination

° Increasing associativity shrinks index, expands tag

Blockoffset

Block AddressTag Index

Q2: How is a block found if it is in the upper level?

Set Select

Data Select


Lec21.25

° Easy for Direct Mapped

° Set Associative or Fully Associative:• Random

• LRU (Least Recently Used)

Associativity: 2-way 4-way 8-way

Size LRU Random LRU Random LRU Random

16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Q3: Which block should be replaced on a miss?


Lec21.26

° Write through—The information is written to both the block in the cache and to the block in the lower-level memory.

° Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.

• is block clean or dirty?

° Pros and Cons of each?• WT: read misses cannot result in writes

• WB: no writes of repeated writes

° WT always combined with write buffers so that don’t wait for lower level memory

Q4: What happens on a write?


Lec21.27

° A Write Buffer is needed between the Cache and Memory

• Processor: writes data into the cache and the write buffer

• Memory controller: write contents of the buffer to memory

° Write buffer is just a FIFO:• Typical number of entries: 4

• Must handle bursts of writes

• Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

ProcessorCache

Write Buffer

DRAM

Write Buffer for Write Through


Lec21.28

Write Buffer Saturation

° Store frequency (w.r.t. time) > 1 / DRAM write cycle• If this condition exist for a long period of time (CPU cycle time too

quick and/or too many store instructions in a row):

- Store buffer will overflow no matter how big you make it

- The CPU Cycle Time <= DRAM Write Cycle Time

° Solution for write buffer saturation:• Use a write back cache

• Install a second level (L2) cache: (does this always work?)

ProcessorCache

Write Buffer

DRAM

ProcessorCache

Write Buffer

DRAML2Cache


Lec21.29

° Write-Buffer Issues: Could introduce RAW Hazard with memory!

• Write buffer may contain only copy of valid data Reads to memory may get wrong result if we ignore write buffer

° Solutions:• Simply wait for write buffer to empty before servicing reads:

- Might increase read miss penalty (old MIPS 1000 by 50% )• Check write buffer contents before read (“fully associative”);

- If no conflicts, let the memory access continue- Else grab data from buffer

° Can Write Buffer help with Write Back?• Read miss replacing dirty block

- Copy dirty block to write buffer while starting read to memory• CPU stall less since restarts as soon as do read

RAW Hazards from Write Buffer!


Lec21.30

° Assume: a 16-bit write to memory location 0x0 and causes a miss

• Do we allocate space in cache and possibly read in the block?

- Yes: Write Allocate

- No: Not Write Allocate

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x00

Ex: 0x00

0x50

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9

Write-miss Policy: Write Allocate versus Not Allocate


Lec21.31

° Set of Operations that must be supported• read: data <= Mem[Physical Address]• write: Mem[Physical Address] <= Data

° Determine the internal register transfers° Design the Datapath° Design the Cache Controller

Physical Address

Read/Write

Data

Memory“Black Box”

Inside it has:Tag-Data Storage,Muxes,Comparators, . . .

CacheController

CacheDataPathAddress

Data In

Data Out

R/WActive

ControlPoints

Signalswait

How Do you Design a Memory System?


Lec21.32

Impact on Cycle Time

IR

PCI -Cache

D Cache

A B

R

T

IRex

IRm

IRwb

miss

invalid

Miss

Cache Hit Time:directly tied to clock rateincreases with cache sizeincreases with associativity

Average Memory Access time = Hit Time + Miss Rate x Miss Penalty

Time = IC x CT x (ideal CPI + memory stalls)


Lec21.33

Options to reduce AMAT:

1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.

Time = IC x CT x (ideal CPI + memory stalls)

Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) =

(Hit Rate x Hit Time) + (Miss Rate x Miss Time)

Improving Cache Performance: 3 general options


Lec21.34




Improving Cache Performance


Lec21.35

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

Conflict

3Cs Absolute Miss Rate (SPEC92)


Lec21.36

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

Conflict

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

2:1 Cache Rule


Lec21.37

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0%

20%

40%

60%

80%

100%

1 2 4 8

16

32

64

12

8

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

3Cs Relative Miss Rate


Lec21.38

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

1. Reduce Misses via Larger Block Size


Lec21.39

° 2:1 Cache Rule: • Miss Rate DM cache size N Miss Rate 2-way cache size N/2

° Beware: Execution time is only final measure!• Will Clock Cycle time increase?

• Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2%

2. Reduce Misses via Higher Associativity


Lec21.40

° Assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped

Cache Size Associativity

(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.01

2 1.98 1.86 1.76 1.68

4 1.72 1.67 1.61 1.53

8 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.32

32 1.20 1.24 1.25 1.27

64 1.14 1.20 1.21 1.23

128 1.10 1.17 1.18 1.20

(Red means A.M.A.T. not improved by more associativity)

Example: Avg. Memory Access Time vs. Miss Rate


Lec21.41

To Next Lower Level InHierarchy

DATATAGS

One Cache line of DataTag and Comparator




3. Reducing Misses via a “Victim Cache”

° How to combine fast hit time of direct mapped yet still avoid conflict misses?

° Add buffer to place data discarded from cache

° Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache

° Used in Alpha, HP machines


Lec21.42

° E.g., Instruction Prefetching• Alpha 21064 fetches 2 blocks on a miss

• Extra block placed in “stream buffer”

• On miss check stream buffer

° Works with data blocks too:• Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4

streams got 43%

• Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches

° Prefetching relies on having extra memory bandwidth that can be used without penalty

• Could reduce performance if done indiscriminantly!!!

4. Reducing Misses by Hardware Prefetching


Lec21.43

° Data Prefetch• Load data into register (HP PA-RISC loads)

• Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)

• Special prefetching instructions cannot cause faults;a form of speculative execution

° Issuing Prefetch Instructions takes time• Is cost of prefetch issues < savings in reduced misses?

• Higher superscalar reduces difficulty of issue bandwidth

5. Reducing Misses by Software Prefetching Data


Lec21.44

° McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software

° Instructions• Reorder procedures in memory so as to reduce conflict misses

• Profiling to look at conflicts(using tools they developed)

° Data• Merging Arrays: improve spatial locality by single array of compound elements

vs. 2 arrays

• Loop Interchange: change nesting of loops to access data in order stored in memory

• Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap

• Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

6. Reducing Misses by Compiler Optimizations


Lec21.45




Improving Cache Performance (Continued)


Lec21.46

0. Reducing Penalty: Faster DRAM / Interface

° New DRAM Technologies • RAMBUS - same initial latency, but much higher bandwidth

• Synchronous DRAM

• TMJ-RAM (Tunneling magnetic-junction RAM) from IBM??

• Merged DRAM/Logic - IRAM project here at Berkeley

° Better BUS interfaces

° CRAY Technique: only use SRAM


Lec21.47

° A Write Buffer is needed between the Cache and Memory• Processor: writes data into the cache and the write buffer

• Memory controller: write contents of the buffer to memory

° Write buffer is just a FIFO:• Typical number of entries: 4

• Works fine if:Store frequency (w.r.t. time) << 1 / DRAM write cycle

• Must handle burst behavior as well!

ProcessorCache

Write Buffer

DRAM

1. Reducing Penalty: Read Priority over Write on Miss


Lec21.48

° Write-Buffer Issues: Could introduce RAW Hazard with memory!• Write buffer may contain only copy of valid data

Reads to memory may get wrong result if we ignore write buffer

° Solutions:• Simply wait for write buffer to empty before servicing reads:

- Might increase read miss penalty (old MIPS 1000 by 50% )• Check write buffer contents before read (“fully associative”);

- If no conflicts, let the memory access continue- Else grab data from buffer

° Can Write Buffer help with Write Back?• Read miss replacing dirty block

- Copy dirty block to write buffer while starting read to memory• CPU stall less since restarts as soon as do read

RAW Hazards from Write Buffer!


Lec21.49

° Don’t wait for full block to be loaded before restarting CPU

• Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution

• Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first

• DRAM FOR LAB 6 can do this in burst mode! (Check out sequential timing)

° Generally useful only in large blocks,

° Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart

block

2. Reduce Penalty: Early Restart and Critical Word First


Lec21.50

° Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss

• requires F/E bits on registers or out-of-order execution• requires multi-bank memories

° “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests

° “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses

• Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses

• Requires muliple memory banks (otherwise cannot support)• Penium Pro allows 4 outstanding memory misses

3. Reduce Penalty: Non-blocking Caches


Lec21.51

° For in-order pipeline, 2 options:• Freeze pipeline in Mem stage (popular early on: Sparc, R4000)

IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr

• Use Full/Empty bits in registers + MSHR queue

- MSHR = “Miss Status/Handler Registers” (Kroft)Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line.– Per cache-line: keep info about memory address.– For each word: register (if any) that is waiting for result.– Used to “merge” multiple requests to one memory line

- New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline.

- Attempt to use register before result returns causes instruction to block in decode stage.

- Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures.

° Out-of-order pipelines already have this functionality built in… (load queues, etc).

What happens on a Cache miss?


Lec21.52

° FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26

° Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19

° 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss

Hit Under i Misses

Av

g.

Me

m.

Acce

ss T

ime

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

eqntott

espresso

xlisp

compress

mdljsp2

ear

fpppp

tomcatv

swm256

doduc

su2cor

wave5

mdljdp2

hydro2d

alvinn

nasa7

spice2g6

ora

0->1

1->2

2->64

Base

Integer Floating Point

“Hit under n Misses”

0->11->22->64Base

Value of Hit Under Miss for SPEC


Lec21.53

° L2 Equations

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 +

Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)

° Definitions:• Local miss rate— misses in this cache divided by the total number of

memory accesses to this cache (Miss rateL2)

• Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2)

• Global Miss Rate is what matters

4. Reduce Penalty: Second-Level Cache

Proc

L1 Cache

L2 Cache


Lec21.54

° Reducing Miss Rate1. Reduce Misses via Larger Block Size

2. Reduce Conflict Misses via Higher Associativity

3. Reducing Conflict Misses via Victim Cache

4. Reducing Misses by HW Prefetching Instr, Data

5. Reducing Misses by SW Prefetching Data

6. Reducing Capacity/Conf. Misses by Compiler Optimizations

Reducing Misses: which apply to L2 Cache?


Lec21.55

Relative CPU Time

Block Size

11.11.21.31.41.51.61.71.81.9

2

16 32 64 128 256 512

1.361.28 1.27

1.34

1.54

1.95

° 32KB L1, 8 byte path to memory

L2 cache block size & A.M.A.T.


Lec21.56



3. Reduce the time to hit in the cache:

Lower Associativity (+victim caching or 2nd-level cache)?

Multiple cycle Cache access (e.g. R4000)

Harvard Architecture

Careful Virtual Memory Design (rest of lecture!)

Improving Cache Performance (Continued)


Lec21.57

° Sample Statistics:• 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%• 32KB unified: Aggregate miss rate=1.99%

° Which is better (ignore L2 cache)?• Assume 33% loads/store, hit time=1, miss time=50• Note: data hit has 1 stall for unified cache (only one port)

AMATHarvard=(1/1.33)x(1+0.64%x50)+(0.33/1.33)x(1+6.47%x50) = 2.05AMATUnified=(1/1.33)x(1+1.99%x50)+(0.33/1.33)X(1+1+1.99%x50)= 2.24

ProcI-Cache-1

Proc

UnifiedCache-1

UnifiedCache-2

D-Cache-1

Proc

UnifiedCache-2

Example: Harvard Architecture

Unified

Harvard Architecture


Lec21.58

Summary #1/ 2:

° The Principle of Locality:• Program likely to access a relatively small portion of the address

space at any instant of time.

- Temporal Locality: Locality in Time

- Spatial Locality: Locality in Space

° Three (+1) Major Categories of Cache Misses:• Compulsory Misses: sad facts of life. Example: cold start misses.

• Conflict Misses: increase cache size and/or associativity.Nightmare Scenario: ping pong effect!

• Capacity Misses: increase cache size

• Coherence Misses: Caused by external processors or I/O devices

° Cache Design Space• total size, block size, associativity

• replacement policy

• write-hit policy (write-through, write-back)

• write-miss policy


Lec21.59

Summary #2 / 2: The Cache Design Space

° Several interacting dimensions• cache size

• block size

• associativity

• replacement policy

• write-through vs write-back

• write allocation

° The optimal choice is a compromise• depends on access characteristics

- workload

- use (I-cache, D-cache, TLB)

• depends on technology / cost

° Simplicity often wins

Associativity

Cache Size

Block Size

Bad

Good

Less More

Factor A Factor B

Documents

CS152 / Kubiatowicz Lec21.1 4/21/03©UCB Spring 2003 CS152 Computer Architecture and Engineering Lecture 21 Memory Systems (recap) Caches April 21, 2003