22
1 Maurizio Palesi 1 Memory Hierarchy Memory Hierarchy Maurizio Palesi Maurizio Palesi 2 References References John L. Hennessy and David A. Patterson, Computer Architecture a Quantitative Approach, second edition, Morgan Kaufmann Chapter 5

24 memory hierarchy - Unict · DRAM 9%/yr. 1 (2X/10 yrs) 10 100 1000 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM CPU 1982

Embed Size (px)

Citation preview

1

Maurizio Palesi 1

Memory HierarchyMemory Hierarchy

Maurizio Palesi

Maurizio Palesi 2

ReferencesReferencesJohn L. Hennessy and David A. Patterson,Computer Architecture a QuantitativeApproach, second edition, MorganKaufmann

Chapter 5

2

Maurizio Palesi 3

Who Cares About the Memory Hierarchy?Who Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

100019

8019

81

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orm

ance

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Maurizio Palesi 4

3

Maurizio Palesi 5

Maurizio Palesi 6

4

Maurizio Palesi 7

Levels of the Memory HierarchyLevels of the Memory HierarchyCPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns1-0.1 cents/bit

Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

Tapeinfinitesec-min10 -8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Faster

Larger

Maurizio Palesi 8

What is What is a Cache?a Cache?Small, fast storage used to improve average access time toslow memoryExploits spacial and temporal localityIn computer architecture, almost everything is a cache!

Registers a cache on variablesFirst-level cache a cache on second-level cacheSecond-level cache a cache on memoryMemory a cache on disk (virtual memory)TLB a cache on page tableBranch-prediction a cache on prediction information?

5

Maurizio Palesi 9

The Principle of LocalityThe Principle of LocalityThe Principle of Locality:

Program access a relatively small portion of the address space atany instant of time

Two Different Types of Locality:Temporal Locality (Locality in Time): If an item is referenced, it willtend to be referenced again soon (e.g., loops, reuse)Spatial Locality (Locality in Space): If an item is referenced, itemswhose addresses are close by tend to be referenced soon(e.g., straightline code, array access)

Last 15 years, HW relied on locality for speed

Maurizio Palesi 10

Exploit LocalityExploit LocalityBy taking advantage of the principle of locality

Present the user with as much memory as is availablein the cheapest technologyProvide access at the speed offered by the fastesttechnology

DRAM is slow but cheap and denseGood choice for presenting the user with a BIG memorysystem

SRAM is fast but expensive and not very denseGood choice for providing the user FAST access time

6

Maurizio Palesi 11

General PrincipesGeneral PrincipesLocality

Temporal Locality: referenced again soonSpatial Locality: nearby items referenced soon

Locality + smaller HW is faster = memory hierarchyLevels: each smaller, faster, more expensive/byte than level belowInclusive: data found in top also found in the bottom

DefinitionsUpper is closer to processorBlock: minimum unit that present or not in upper levelAddress = Block frame address + block offset address

Maurizio Palesi 12

Memory HierarchyMemory Hierarchy: : TerminologyTerminologyHit: data appears in some block in the upper level (example: Block X)

Hit Rate: the fraction of memory access found in the upper levelHit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

Miss: data needs to be retrieve from a block in the lower level (Block Y)Miss Rate = 1 - (Hit Rate)Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

Hit Time << Miss Penalty (500 instructions on 21264!)

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

7

Maurizio Palesi 13

Cache Cache MeasuresMeasuresAverage memory-access time

= Hit time + Miss rate x Miss penalty [ns or clocks]Miss penalty: time to replace a block from lower level, including timeto replace in CPU

access time: time to lower level= f(latency to lower level)transfer time: time to transfer block=f(BW between upper & lower levels)

Maurizio Palesi 14

Block Size TradeoffBlock Size TradeoffIn general, larger block size take advantage of spatial locality BUT

Larger block size means larger miss penaltyTakes longer time to fill up the block

If block size is too big relative to cache size, miss rate will go upToo few cache blocks

In general, Average Access Time= Hit Time + Miss Penalty x Miss Rate

MissPenalty

Block Size

MissRate Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

AverageAccess

Time

Increased Miss Penalty& Miss Rate

Block Size Block Size

8

Maurizio Palesi 15

Four Questions for Memory HierarchyFour Questions for Memory HierarchyDesignersDesigners

Q1: Where can a block be placed in the upperlevel? (Block placement)

Q2: How is a block found if it is in the upper level? (Block identification)

Q3: Which block should be replaced on a miss?(Block replacement)

Q4: What happens on a write?(Write strategy)

Maurizio Palesi 16

Q1: Where can a block be placed in theQ1: Where can a block be placed in theupper level?upper level?

0 1 2 3 4 5 6 7Blockno.

Fully associative:block 12 can goanywhere

0 1 2 3 4 5 6 7Blockno.

Direct mapped:block 12 can goonly into block 4(12 mod 8)

0 1 2 3 4 5 6 7Blockno.

Set associative:block 12 can goanywhere in set 0(12 mod 4)

Set0

Set1

Set2

Set3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Block-frame address

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.

9

Maurizio Palesi 17

Q2: How Is a Block Found If It Is inQ2: How Is a Block Found If It Is inthe Upper Level?the Upper Level?

Tag on each blockNo need to check index or block offset

Increasing associativity shrinks index,expands tag

Full Associative: No indexDirect Mapped : Large index

Maurizio Palesi 18

Cache Direct MappedCache Direct Mapped00000 ( 0)00001 ( 1)

00111 ( 7)01000 ( 8)

01111 (15)10000 (16)

10111 (23)11000 (24)

11111 (31)

Main Memory

Part. 0

Part. 1

Part. 2

Part. 3

0

7

Cache

10

Maurizio Palesi 19

Cache Direct MappedCache Direct Mapped00000 ( 0)00001 ( 1)

00111 ( 7)01000 ( 8)

01111 (15)10000 (16)

10111 (23)11000 (24)

11111 (31)

Main Memory

Part. 0

Part. 1

Part. 2

Part. 3

Cache

0

7

0

7

tag

Maurizio Palesi 20

Q3: Which Block Should be Replaced onQ3: Which Block Should be Replaced ona Miss?a Miss?

Easy for Direct MappedS.A. or F.A.:

Random (large associativities)LRU (smaller associativities)

Size LRU RND LRU RND LRU RND16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

2-way 4-way 8-wayAssociativity

11

Maurizio Palesi 21

Q4: What Happens on a Write?Q4: What Happens on a Write?

Write through: The information is written to both the blockin the cache and to the block in the lower-level memoryWrite back: The information is written only to the block inthe cache. The modified cache block is written to mainmemory only when it is replaced

Is block clean or dirty?

Pros and Cons of eachWT: read misses cannot result in writes (because of replacements)WB: no writes of repeated writes

WT always combined with write buffers so that don’t waitfor lower level memory

Maurizio Palesi 22

Write Buffer for Write ThroughWrite Buffer for Write Through

A Write Buffer is needed between the Cache and MemoryProcessor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory

Write buffer is just a FIFOTypical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

Memory system designer’s nightmareStore frequency (w.r.t. time) > 1 / DRAM write cycleWrite buffer saturation

ProcessorCache

Write buffer

DRAM

12

Maurizio Palesi 23

How a Block is Found in CacheHow a Block is Found in CacheCPU addresstag index offset

Cache tag Cache data

Hit/miss

CPUData bus

compare

Maurizio Palesi 24

How a Block is Found in CacheHow a Block is Found in Cache

Two sets ofAddress tagsand data RAM

2:1 Muxfor the way

Use addressbits to selectcorrect DRAM

13

Maurizio Palesi 25

Simplest Simplest Cache: Cache: Direct MappedDirect MappedMemory

4 Byte Direct Mapped Cache

Memory Address0123456789ABCDEF

Cache Index0123

Location 0 can be occupied by data from:Memory location 0, 4, 8, ... etc.In general: any memory locationwhose 2 LSBs of the address are 0sAddress<1:0> => cache index

Which one should we place in the cache?How can we tell which one is in the cache?

Maurizio Palesi 26

1 KB Direct Mapped Cache, 32B blocks1 KB Direct Mapped Cache, 32B blocksFor a 2N byte cache:

The uppermost (32 - N) bits are always the Cache TagThe lowest M bits are the Byte Select (Block Size = 2M)

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9

14

Maurizio Palesi 27

Two-way Set Associative CacheTwo-way Set Associative CacheN-way set associative: N entries for each Cache Index

N direct mapped caches operates in parallel (N typically 2 to 4)Example: Two-way set associative cache

Cache Index selects a “set” from the cacheThe two tags in the set are compared in parallelData is selected based on the tag result

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Maurizio Palesi 28

Disadvantage of Set Associative CacheDisadvantage of Set Associative CacheN-way Set Associative Cache v. Direct Mapped Cache:

N comparators vs. 1Extra MUX delay for the dataData comes AFTER Hit/Miss

In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:Possible to assume a hit and continue. Recover later if miss.

Cache DataCache Block 0

Cache TagValid

:: :

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

15

Maurizio Palesi 29

4 Questions for Memory Hierarchy4 Questions for Memory HierarchyQ1: Where can a block be placed in the upper level?(Block placement)

Fully Associative, Set Associative, Direct MappedQ2: How is a block found if it is in the upper level?(Block identification)

Tag/BlockQ3: Which block should be replaced on a miss?(Block replacement)

Random, LRUQ4: What happens on a write?(Write strategy)

Write Back or Write Through (with Write Buffer)

Maurizio Palesi 30

Q1: Where can a block be placed in the upper level?Q1: Where can a block be placed in the upper level?

Block 12 placed in 8 block cache:Fully associative, direct mapped, 2-way set associativeS.A. Mapping = Block Number Modulo Number Sets

Cache

01234567 0123456701234567

Memory

111111111122222222223301234567890123456789012345678901

Full Mapped Direct Mapped(12 mod 8) = 4

2-Way Assoc(12 mod 4) = 0

16

Maurizio Palesi 31

Q2: How is a block found if it is in the upper level?Q2: How is a block found if it is in the upper level?

Tag on each blockNo need to check index or block offset

Increasing associativity shrinks index, expands tag

BlockOffset

Block Address

IndexTag

Maurizio Palesi 32

Q3: Which block should be replaced on a miss?Q3: Which block should be replaced on a miss?

Easy for Direct MappedSet Associative or Fully Associative:

RandomLRU (Least Recently Used)

Size LRU RND LRU RND LRU RND16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

2-way 4-way 8-wayAssociativity

17

Maurizio Palesi 33

Q4: What happens on a write?Q4: What happens on a write?Write through The information is written to both the blockin the cache and to the block in the lower-level memory.Write back The information is written only to the block inthe cache. The modified cache block is written to mainmemory only when it is replaced.

is block clean or dirty?

Pros and Cons of each?WT: read misses cannot result in writesWB: no repeated writes to same location

WT always combined with write buffers so that don’t waitfor lower level memory

Maurizio Palesi 34

Write Buffer for Write ThroughWrite Buffer for Write Through

A Write Buffer is needed between the Cache and MemoryProcessor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory

Write buffer is just a FIFO:Typical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

Memory system designer’s nightmare:Store frequency (w.r.t. time) -> 1 / DRAM write cycleWrite buffer saturation

ProcessorCache

Write Buffer

DRAM

18

Maurizio Palesi 35

Reducing MissesReducing MissesClassifying Misses: 3 Cs

Compulsory The first access to a block is not in the cache, so theblock must be brought into the cache. Also called cold start misses or firstreference misses(Misses in even an Infinite Cache)

Capacity If the cache cannot contain all the blocks needed duringexecution of a program, capacity misses will occur due to blocks beingdiscarded and later retrieved(Misses in Fully Associative Size X Cache)

Conflict If block-placement strategy is set associative or direct mapped,conflict misses (in addition to compulsory & capacity misses) will occurbecause a block can be discarded and later retrieved if too many blocksmap to its set. Also called collision misses or interference misses(Misses in N-way Associative, Size X Cache)

Maurizio Palesi 36

3Cs Absolute Miss Rate (SPEC92)3Cs Absolute Miss Rate (SPEC92)

Cache Size (KB)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16 32 64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

19

Maurizio Palesi 37

2:1 Cache Rule2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

Cache Size (KB)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16 32 64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

Maurizio Palesi 38

3Cs Relative Miss Rate3Cs Relative Miss Rate

Cache Size (KB)

0%

20%

40%

60%

80%

100%

1 2 4 8

16 32 64

128

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

20

Maurizio Palesi 39

How Can Reduce Misses?How Can Reduce Misses?3 Cs: Compulsory, Capacity, ConflictIn all cases, assume total cache size not changed:What happens if:

Change Block Size:Which of 3Cs is obviously affected?Change Associativity:Which of 3Cs is obviously affected?Change Compiler:Which of 3Cs is obviously affected?

Maurizio Palesi 40

1. Reduce Misses via Larger Block Size1. Reduce Misses via Larger Block Size

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16 32 64

128

256

1K

4K

16K

64K

256K

21

Maurizio Palesi 41

2. Reduce Misses via Higher Associativity2. Reduce Misses via Higher Associativity

2:1 Cache Rule:Miss Rate DM cache size N - Miss Rate 2-waycache size N/2

Beware: Execution time is only finalmeasure!

Will Clock Cycle time increase?Hill [1988] suggested hit time for 2-way vs. 1-way

external cache +10%internal + 2%

Maurizio Palesi 42

3. Reducing Misses via a “Victim Cache”3. Reducing Misses via a “Victim Cache”How to combine fast hit time ofdirect mappedyet still avoid conflict misses?Add buffer to place datadiscarded from cacheJouppi [1990]: 4-entry victimcache removed 20% to 95% ofconflicts for a 4 KB directmapped data cacheUsed in Alpha, HP machines

To Next Lower Level InHierarchy

DATATAGS

One Cache line of DataTag and Comparator

One Cache line of DataTag and Comparator

One Cache line of DataTag and Comparator

One Cache line of DataTag and Comparator

22

Maurizio Palesi 43

4. Reducing Misses via “Pseudo-Associativity”4. Reducing Misses via “Pseudo-Associativity”

How to combine fast hit time of Direct Mapped and have the lowerconflict misses of 2-way SA cache?Divide cache: on a miss, check other half of cache to see if there, if sohave a pseudo-hit (slow hit)

Drawback: CPU pipeline is hard if hit takes 1 or 2 cyclesBetter for caches not tied directly to processor (L2)Used in MIPS R1000 L2 cache, similar in UltraSPARC

Hit Time

Pseudo Hit Time Miss Penalty