58
Appendix C Memory Hierarchy

Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Embed Size (px)

Citation preview

Page 1: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Appendix CMemory Hierarchy

Page 2: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Why care about memory hierarchies?

1

10

100

1,000

10,000

100,000

1980 1985 1990 1995 2000 2005 2010

Year

Pe

rfo

rma

nc

e

Memory

ProcessorProcessor-MemoryPerformance GapGrowing

Major source of stall cycles: memory accesses

2

Page 3: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Levels of the Memory Hierarchy

CPU Registers100s Bytes<0.5 ns

CacheK Bytes1 ns1-0.1 cents/bit

Main MemoryM Bytes100ns$.0001-.00001 cents /bit

DiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

CapacityAccess TimeCost

Tapeinfinitesec-min10 -8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Page 4: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Motivating memory hierarchies Two structures that hold data

Registers: small array of storage Memory: large array of storage

What characteristics would we like memory to have? High capacity Low latency Low cost

Can’t satisfy these requirements with one memory technology

4

Page 5: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Memory hierarchy Solution: use a little bit of everything!

Small SRAM array (cache) Small means fast and cheap

Larger DRAM array (Main memory) Hope you rarely have to use it

Extremely large disk Costs are decreasing at a faster rate than we fill them

5

Page 6: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Terminology Find data you want at a given level: hit Data is not present at that level: miss

In this case, check the next lower level Hit rate: Fraction of accesses that hit at a

given level (1 – hit rate) = miss rate

Another performance measure: average memory access time

AMAT = (hit time) + (miss rate) x (miss penalty)

6

Page 7: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Memory hierarchy operation We’d like most accesses to use the cache

Fastest level of the hierarchy But, the cache is much smaller than the

address space Most caches have a hit rate > 80%

How is that possible? Cache holds data most likely to be accessed

7

Page 8: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Principle of locality Programs don’t access data randomly—they

display locality in two forms Temporal locality: if you access a memory

location (e.g., 1000), you are more likely to re-access that location than some random location

Spatial locality: if you access a memory location (e.g., 1000), you are more likely to access a location near it (e.g., 1001) than some random location

8

Page 9: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Cache Basics Fast (but small) memory close to processor When data referenced

If in cache, use cache instead of memory If not in cache, bring into cache

(actually, bring entire block of data, too) Maybe have to kick something else out to do it!

Important decisions Placement: where in the cache can a block go? Identification: how do we find a block in cache? Replacement: what to kick out to make room in

cache? Write policy: What do we do about stores?

Page 10: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level?

(Block placement) Q2: How is a block found if it is in the upper level?

(Block identification) Q3: Which block should be replaced on a miss?

(Block replacement) Q4: What happens on a write?

(Write strategy)

10

Page 11: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Q1: Cache Placement Placement

Which memory blocks are allowedinto which cache lines

Placement Policies Direct mapped (block can go to only one line) Fully Associative (block can go to any line) Set-associative (block can go to one of N lines)

E.g., if N=4, the cache is 4-way set associative Other two policies are extremes of this

(E.g., if N=1 we get a direct-mapped cache)

Page 12: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Q1: Block placement Block 12 placed in 8 block cache:

Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets

Cache

01234567 0123456701234567

Memory

111111111122222222223301234567890123456789012345678901

Fully AssociativeDirect Mapped(12 mod 8) = 4

2-Way Assoc(12 mod 4) = 0

12

Page 13: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Q2: Cache Identification When address referenced, need to

Find whether its data is in the cache If it is, find where in the cache This is called a cache lookup

Each cache line must have A valid bit (1 if line has data, 0 if line empty)

We also say the cache line is valid or invalid A tag to identify which block is in the line

(if line is valid)

Page 14: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Q2: Block identification Tag on each block

No need to check index or block offset Increasing associativity shrinks index,

expands tag

BlockOffset

Block Address

IndexTag

14

Page 15: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Address breakdown Block offset: byte address within block

# block offset bits = log2(block size)

Index: line (or set) number within cache # index bits = log2(# of cache lines)

Tag: remaining bits

TagBlockoffset

Index

15

Page 16: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Address breakdown example Given the following:

32-bit address 32 KB direct-mapped cache Each block has 64-byte

What are the sizes for the tag, index, and block offset fields?

index = 9 bits since there are 32KB/64B = 29 blocks

block offset = 6 bits since each block has 64B= 26

tag = 32 – 9 – 6 = 17 bits

16

Page 17: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Q3: Block replacement When we need to evict a line, what do we

choose? Easy choice for direct-mapped What about set-associative or fully-associative?

Want to choose data that is least likely to be used next Temporal locality suggests that’s the line that was

accessed farthest in the past Least recently used (LRU)

Hard to implement exactly in hardware—often approximated

Random (randomly selected line) FIFO (line that has been in cache the longest)

17

Page 18: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Q4: What happens on a write? Write-Through Write-Back

Policy

Data written to cache block

also written to lower-level

memory

Write data only to the cache

Update lower level when a

block falls out of the cache

Debug Easy Hard

Do read misses produce writes? No Yes

Do repeated writes make it to lower level?

Yes No

18

Page 19: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Write Policy Do we allocate cache lines on a write?

Write-allocate A write miss brings block into cache

No-write-allocate A write miss leaves cache as it was

Do we update memory on writes? Write-through

Memory immediately updated on each write Write-back

Memory updated when line replaced

Page 20: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Write Buffers for Write-Through Caches

Q. Why a write buffer ?

ProcessorCache

Write Buffer

Lower Level

Memory

Holds data awaiting write-through to lower level memory

A. So CPU doesn’t stall Q. Why a buffer,

why not just one register ?

A. Bursts of writes arecommon.

20

Page 21: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Write-Back Caches Need a Dirty bit for each line

A dirty line has more recent data than memory Line starts as clean (not dirty) Line becomes dirty on first write to it

Memory not updated yet, cache has the only up-to-date copy of data for a dirty line

Replacing a dirty line Must write data back to memory (write-back)

Page 22: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Basic cache design Cache memory can copy data from any part

of main memory Tag: Memory address Block: Actual data

On each access Compare the address with the tag

If they match hit! Get the data from the cache block

If they don’t miss Get the data from main memory

22

Page 23: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Cache organization Cache consists of multiple tag/block pairs,

called cache lines/blocks Can search lines in parallel (within reason) Each line also has a valid bit Write-back caches have a dirty bit

Note that block sizes can vary Most systems use between 32 and 128 bytes Larger blocks exploit spatial locality Larger block size smaller tag size

23

Page 24: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example Assume the following simple setup

Only 2 levels to hierarchy 16-byte memory 4-bit addresses Cache organization

Direct-mapped 8 total bytes 2 bytes per block 4 lines Write-back cache

Leads to the following address breakdown: Offset: 1 bit Index: 2 bits Tag: 1 bit

24

Page 25: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: initial stateInstructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 71

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

Registers:$t0 = ?, $t1 = ?

0

1

2

3

4

5

6

7

Block #

Page 26: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #1Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 71

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

Address = 1 = 00012

Tag = 0 Index = 00 Offset = 1

Hits: 0Misses: 0

Registers:$t0 = ?, $t1 = ?

26

0

1

2

3

4

5

6

7

Block #

Page 27: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #1Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 71

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 0 78 29

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

Address = 1 = 00012

Tag = 0 Index = 00 Offset = 1

Hits: 0Misses: 1

Registers:$t0 = 29, $t1 = ?

27

0

1

2

3

4

5

6

7

Block #

Page 28: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #2Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 71

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 0 78 29

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

Address = 8 = 10002

Tag = 1 Index = 00 Offset = 0

Hits: 0Misses: 1

Registers:$t0 = 29, $t1 = ?

28

0

1

2

3

4

5

6

7

Block #

Page 29: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #2Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 71

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 1 18 21

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

Address = 8 = 10002

Tag = 1 Index = 00 Offset = 0

Hits: 0Misses: 2

Registers:$t0 = 29, $t1 = 18

29

0

1

2

3

4

5

6

7

Block #

Page 30: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #3Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 71

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 1 18 21

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

Address = 4 = 01002

Tag = 0 Index = 10 Offset = 0

Hits: 0Misses: 2

Registers:$t0 = 29, $t1 = 18

30

0

1

2

3

4

5

6

7

Block #

Page 31: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #3Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 71

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 0 18 150

0 0 0 0 0

Address = 4 = 01002

Tag = 0 Index = 10 Offset = 0

Hits: 0Misses: 3

Registers:$t0 = 29, $t1 = 18

04/20/23 31M. Geiger CIS 570 Lec. 13

0

1

2

3

4

5

6

7

Block #

Page 32: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #4Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 71

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 0 18 150

0 0 0 0 0

Address = 13 = 11012

Tag = 1 Index = 10 Offset = 1

Hits: 0Misses: 3

Registers:$t0 = 29, $t1 = 18

32

0

1

2

3

4

5

6

7

Block #

Page 33: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #4Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 18

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 0 18 150

0 0 0 0 0

Address = 13 = 11012

Tag = 1 Index = 10 Offset = 1

Hits: 0Misses: 4

Registers:$t0 = 29, $t1 = 18

Must write backdirty block

04/20/23 33M. Geiger CIS 570 Lec. 13

0

1

2

3

4

5

6

7

Block #

Page 34: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #4Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 18

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 1 19 29

0 0 0 0 0

Address = 13 = 11012

Tag = 1 Index = 10 Offset = 1

Hits: 0Misses: 4

Registers:$t0 = 29, $t1 = 18

34

0

1

2

3

4

5

6

7

Block #

Page 35: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #5Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 18

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 1 19 29

0 0 0 0 0

Address = 9 = 10012

Tag = 1 Index = 00 Offset = 1

Hits: 0Misses: 4

Registers:$t0 = 29, $t1 = 18

04/20/23 35M. Geiger CIS 570 Lec. 13

0

1

2

3

4

5

6

7

Block #

Page 36: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Direct-mapped cache example: access #5Instructions:

lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,

13($zero)lb $t1, 9($zero)

Memory

0 78

1 29

2 120

3 123

4 18

5 150

6 162

7 173

8 18

9 21

10 33

11 28

12 19

13 200

14 210

15 225

Cache

V D Tag Data

1 0 1 18 21

0 0 0 0 0

1 1 1 19 29

0 0 0 0 0

Address = 9 = 10012

Tag = 1 Index = 00 Offset = 1

Hits: 1Misses: 4

Registers:$t0 = 29, $t1 = 21

04/20/23 36M. Geiger CIS 570 Lec. 13

0

1

2

3

4

5

6

7

Block #

Page 37: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Cache performance Simplified model:

CPU time = (CPU clock cycles + memory stall cycles) cycle time

memory stall cycles = # of misses x miss penalty = IC X Misses/instruction x miss penalty = IC x memory accesses/instruction x miss rate miss penalty

Average CPI = CPI(without stalls)+ memory accesses/instruction x miss rate miss penalty

AMAT = hit time + miss rate x miss penalty

37

Page 38: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Example A computer has CPI =1 when all hits. Loads and

stores are 50% of instructions. If the miss penalty is 25 cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits

For all hits: CPU time =(ICxCPI +0)xCCT =ICx1.0 x CCT Real cache with stalls

Memory stall cycles =IC x (1+0.5)x0.02x25 = ICx0.75 CPU time = (ICx1.0+ICx0.75)xCCT = 1.75ICx CCT

Speedup = 1.75ICxCCT/ICxCCT = 1.75

38

Page 39: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Average memory access time For unified cache

AMAT = (hit time) + (miss rate) x (miss penalty) For split cache

AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty)

For multi-level cache AMAT = hit timeL1 + miss rateL1x miss penaltyL1 = hit timeL1

+ miss rateL1x (hit timeL2 + miss rateL2 x miss penaltyL2) Miss rate(L2) is measured on the leftovers from L1 cache.

39

Page 40: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Example (split cache vs unified cache) Which has the lower miss rate? A 16KB instruction cache with a 16KB data cache or a 32KB unified cache? If the miss per 1000 instructions for instruction, data and unified caches are 3.82, 40.9 and 43.3, respectively. Assume 36% of instructions are data transfer instructions. Assume a hit takes 1 CC and the miss penalty is 200 cc. A load or store hit takes 1 extra cc on a unified cache. Which is the AMAT?

Find miss rate = miss/instructions / memory access/instruction Miss rate(I) = 3.82/1000 / 1 = 0.004 Miss rate(D) = 40.9/1000 / 0.36 = 0.114 Miss rate (U) = 43.3/1000 / (1+0.36) = 0.0318 Miss rate (Split) = 74%x0.0004 + 26%x0.114 = 0.0326 A 32KB unified cache has a slightly lower miss rate

40

Page 41: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Example (cont.)

AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty)

AMAT (split) = 74%(1 + 0.004x200)+ 26%x(1+0.114x200) =7.52

AMAT(unified) = 74%x(1+0.0318x200) + 26%(1+1+0.0318x200) = 7.62

41

Page 42: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Another Example (multilevel cache) Suppose that in 1000 memory reference there are 40 misses in the L1 cache and 20 misses in the L2 cache. What are the miss rates? Assume miss penalty from L2 cache to memory is 200 CC, the hit time of the L2 cache is 10 CC, the hit time of L1 is 1 CC. What is the AMAT?

Miss rate (L1) = 40/1000 = 4% Miss rate (L2) = 20/40 = 50%

AMAT = hit timeL1 + miss rateL1x miss penaltyL1 = = hit timeL1 + miss rateL1x (hit timeL2 + miss rateL2 x miss penaltyL2) = 1 + 4%(10+ 50%x200) = 5.4 CC

42

Page 43: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Reasons for cache misses AMAT = (hit time) + (miss rate) x (miss penalty) Reduce misses improve performance The three C’s

First reference to an address: Compulsory miss Increasing the block size

Cache is too small to hold data: Capacity miss Increase the cache size

Replaced from a busy line or set: Conflict miss Increase associativity Would have had hit in a fully associative cache

43

Page 44: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

44

Six Basic Cache Optimizations Reducing Miss Rate

1. Larger Block size (Compulsory misses)

2. Larger Cache size (Capacity misses)

3. Higher Associativity (Conflict misses Reducing Miss Penalty

4. Multilevel Caches

5. Giving Read misses Priority over Writes e.g., Read complete before earlier writes in write buffer

Reducing hit time6. Avoiding Address Translation during Cache Indexing

Page 45: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Problems with memory DRAM is too expensive to buy many

gigabytes We need our programs to work even if they

require more memory than we have A program that works on a machine with 512 MB

should still work on a machine with 256 MB Most systems run multiple programs

45

Page 46: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Solutions Leave the problem up to the programmer

Assume programmer knows exact configuration Overlays

Compiler identifies mutually exclusive regions Virtual memory

Use hardware and software to automatically translate references from virtual address (what the programmer sees) to physical address (index to DRAM or disk)

46

Page 47: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Benefits of virtual memory

CPU Memory

A0-A31 A0-A31

D0-D31 D0-D31

Data

User programs run in a standardizedvirtual address space

Address Translation hardware managed by the operating system (OS)

maps virtual address to physical memory

“Physical Addresses”

AddressTranslation

Virtual Physical

“Virtual Addresses”

Hardware supports “modern” OS features:Protection, Translation, Sharing

47

Page 48: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Managing virtual memory Effectively treat main memory as a cache

Blocks are called pages Misses are called page faults

Virtual address consists of virtual page number and page offset

Virtual page number Page offset01131

48

Page 49: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Page tables encode virtual address spaces

A machine usually supports

pages of a few sizes

(MIPS R4000):

A valid page table entry codes physical memory “frame” address for the page

A virtual address spaceis divided into blocks

of memory called pages

Physical Address Space

Virtual Address Space

frame

frame

frame

frame

49

Page 50: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Page tables encode virtual address spaces

A machine usually supports

pages of a few sizes

(MIPS R4000):

PhysicalMemory Space

A valid page table entry codes physical memory “frame” address for the page

A virtual address spaceis divided into blocks

of memory called pagesframe

frame

frame

frame

A page table is indexed by a virtual address

virtual address

Page Table

04/20/23 50

Page 51: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

PhysicalMemory Space

Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry)

Virtual memory => treat memory cache for disk

Details of Page Table Virtual Address

Page Table

indexintopagetable

Page TableBase Reg

V AccessRights PA

V page no. offset12

table locatedin physicalmemory

P page no. offset12

Physical Address

frame

frame

frame

frame

virtual address

Page Table

51

Page 52: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Paging the page tableA table for 4KB pages for a 32-bit address

space has 1M entries Each process needs its own address space!

P1 index P2 index Page Offset

31 12 11 02122

32 bit virtual address

Top-level table wired in main memory

Subset of 1024 second-level tables in main memory; rest are on disk or

unallocated

Two-level Page Tables

04/20/23 52M. Geiger CIS 570 Lec. 13

Page 53: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

VM and Disk: Page replacement policy

...

Page Table

1 0

useddirty

1 00 11 10 0

Set of all pagesin Memory Tail pointer:

Clear the usedbit in thepage table

Head pointerPlace pages on free list if used bitis still clear.Schedule pages with dirty bit set tobe written to disk.

Freelist

Free Pages

Dirty bit: page written.

Used bit: set to

1 on any reference

Architect’s role: support setting dirty and used

bits04/20/23 53M. Geiger CIS 570 Lec. 13

Page 54: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Virtual memory performance Address translation requires a physical

memory access to read the page table Must then access physical memory again to

actually get the data Each load performs at least 2 memory reads Each store performs at least 1 memory read

followed by a write

04/20/23 54M. Geiger CIS 570 Lec. 13

Page 55: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Improving virtual memory performance Use a cache for common translations

Translation lookaside buffer (TLB)

Virtual page

v tag Physical page

Pg offset

04/20/23 55M. Geiger CIS 570 Lec. 13

Page 56: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Caches and virtual memory Using two different addresses: virtual and

physical Which should we use to access cache? Physical address

Pros: simpler to manage Cons: slower access

Virtual address Pros: faster access Cons: aliasing, difficult management

Use both: virtually indexed, physically tagged

56

Page 57: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Three Advantages of Virtual Memory Translation:

Program can be given consistent view of memory, even though physical memory is scrambled

Makes multithreading reasonable (now used a lot!) Only the most important part of program (“Working Set”) must be in physical

memory. Contiguous structures (like stacks) use only as much physical memory as

necessary yet still grow later. Protection:

Different threads (or processes) protected from each other. Different pages can be given special behavior

(Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs

Sharing: Can map same physical page to multiple users

(“Shared memory”) Allows programs to share same physical memory without knowing what else

is thereMakes memory appear larger than it actually is

57

Page 58: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses

Average memory access time AMAT = (hit time) + (miss rate) x (miss

penalty) Given the following:

Cache: 1 cycle access time Memory: 100 cycle access time Disk: 10,000 cycle access time

What is the average memory access time if the cache hit rate is 90% and the memory hit rate is 80%?

58