Upload
dinhtuyen
View
217
Download
0
Embed Size (px)
Citation preview
1
Maurizio Palesi 1
Memory HierarchyMemory Hierarchy
Maurizio Palesi
Maurizio Palesi 2
ReferencesReferencesJohn L. Hennessy and David A. Patterson,Computer Architecture a QuantitativeApproach, second edition, MorganKaufmann
Chapter 5
2
Maurizio Palesi 3
Who Cares About the Memory Hierarchy?Who Cares About the Memory Hierarchy?
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)1
10
100
100019
8019
81
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU19
82
Processor-MemoryPerformance Gap:(grows 50% / year)
Perf
orm
ance
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
Maurizio Palesi 4
4
Maurizio Palesi 7
Levels of the Memory HierarchyLevels of the Memory HierarchyCPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns1-0.1 cents/bit
Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
Tapeinfinitesec-min10 -8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Faster
Larger
Maurizio Palesi 8
What is What is a Cache?a Cache?Small, fast storage used to improve average access time toslow memoryExploits spacial and temporal localityIn computer architecture, almost everything is a cache!
Registers a cache on variablesFirst-level cache a cache on second-level cacheSecond-level cache a cache on memoryMemory a cache on disk (virtual memory)TLB a cache on page tableBranch-prediction a cache on prediction information?
5
Maurizio Palesi 9
The Principle of LocalityThe Principle of LocalityThe Principle of Locality:
Program access a relatively small portion of the address space atany instant of time
Two Different Types of Locality:Temporal Locality (Locality in Time): If an item is referenced, it willtend to be referenced again soon (e.g., loops, reuse)Spatial Locality (Locality in Space): If an item is referenced, itemswhose addresses are close by tend to be referenced soon(e.g., straightline code, array access)
Last 15 years, HW relied on locality for speed
Maurizio Palesi 10
Exploit LocalityExploit LocalityBy taking advantage of the principle of locality
Present the user with as much memory as is availablein the cheapest technologyProvide access at the speed offered by the fastesttechnology
DRAM is slow but cheap and denseGood choice for presenting the user with a BIG memorysystem
SRAM is fast but expensive and not very denseGood choice for providing the user FAST access time
6
Maurizio Palesi 11
General PrincipesGeneral PrincipesLocality
Temporal Locality: referenced again soonSpatial Locality: nearby items referenced soon
Locality + smaller HW is faster = memory hierarchyLevels: each smaller, faster, more expensive/byte than level belowInclusive: data found in top also found in the bottom
DefinitionsUpper is closer to processorBlock: minimum unit that present or not in upper levelAddress = Block frame address + block offset address
Maurizio Palesi 12
Memory HierarchyMemory Hierarchy: : TerminologyTerminologyHit: data appears in some block in the upper level (example: Block X)
Hit Rate: the fraction of memory access found in the upper levelHit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block Y)Miss Rate = 1 - (Hit Rate)Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
Hit Time << Miss Penalty (500 instructions on 21264!)
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
7
Maurizio Palesi 13
Cache Cache MeasuresMeasuresAverage memory-access time
= Hit time + Miss rate x Miss penalty [ns or clocks]Miss penalty: time to replace a block from lower level, including timeto replace in CPU
access time: time to lower level= f(latency to lower level)transfer time: time to transfer block=f(BW between upper & lower levels)
Maurizio Palesi 14
Block Size TradeoffBlock Size TradeoffIn general, larger block size take advantage of spatial locality BUT
Larger block size means larger miss penaltyTakes longer time to fill up the block
If block size is too big relative to cache size, miss rate will go upToo few cache blocks
In general, Average Access Time= Hit Time + Miss Penalty x Miss Rate
MissPenalty
Block Size
MissRate Exploits Spatial Locality
Fewer blocks: compromisestemporal locality
AverageAccess
Time
Increased Miss Penalty& Miss Rate
Block Size Block Size
8
Maurizio Palesi 15
Four Questions for Memory HierarchyFour Questions for Memory HierarchyDesignersDesigners
Q1: Where can a block be placed in the upperlevel? (Block placement)
Q2: How is a block found if it is in the upper level? (Block identification)
Q3: Which block should be replaced on a miss?(Block replacement)
Q4: What happens on a write?(Write strategy)
Maurizio Palesi 16
Q1: Where can a block be placed in theQ1: Where can a block be placed in theupper level?upper level?
0 1 2 3 4 5 6 7Blockno.
Fully associative:block 12 can goanywhere
0 1 2 3 4 5 6 7Blockno.
Direct mapped:block 12 can goonly into block 4(12 mod 8)
0 1 2 3 4 5 6 7Blockno.
Set associative:block 12 can goanywhere in set 0(12 mod 4)
Set0
Set1
Set2
Set3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Block-frame address
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.
9
Maurizio Palesi 17
Q2: How Is a Block Found If It Is inQ2: How Is a Block Found If It Is inthe Upper Level?the Upper Level?
Tag on each blockNo need to check index or block offset
Increasing associativity shrinks index,expands tag
Full Associative: No indexDirect Mapped : Large index
Maurizio Palesi 18
Cache Direct MappedCache Direct Mapped00000 ( 0)00001 ( 1)
00111 ( 7)01000 ( 8)
01111 (15)10000 (16)
10111 (23)11000 (24)
11111 (31)
Main Memory
Part. 0
Part. 1
Part. 2
Part. 3
0
7
Cache
10
Maurizio Palesi 19
Cache Direct MappedCache Direct Mapped00000 ( 0)00001 ( 1)
00111 ( 7)01000 ( 8)
01111 (15)10000 (16)
10111 (23)11000 (24)
11111 (31)
Main Memory
Part. 0
Part. 1
Part. 2
Part. 3
Cache
0
7
0
7
tag
Maurizio Palesi 20
Q3: Which Block Should be Replaced onQ3: Which Block Should be Replaced ona Miss?a Miss?
Easy for Direct MappedS.A. or F.A.:
Random (large associativities)LRU (smaller associativities)
Size LRU RND LRU RND LRU RND16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
2-way 4-way 8-wayAssociativity
11
Maurizio Palesi 21
Q4: What Happens on a Write?Q4: What Happens on a Write?
Write through: The information is written to both the blockin the cache and to the block in the lower-level memoryWrite back: The information is written only to the block inthe cache. The modified cache block is written to mainmemory only when it is replaced
Is block clean or dirty?
Pros and Cons of eachWT: read misses cannot result in writes (because of replacements)WB: no writes of repeated writes
WT always combined with write buffers so that don’t waitfor lower level memory
Maurizio Palesi 22
Write Buffer for Write ThroughWrite Buffer for Write Through
A Write Buffer is needed between the Cache and MemoryProcessor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory
Write buffer is just a FIFOTypical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designer’s nightmareStore frequency (w.r.t. time) > 1 / DRAM write cycleWrite buffer saturation
ProcessorCache
Write buffer
DRAM
12
Maurizio Palesi 23
How a Block is Found in CacheHow a Block is Found in CacheCPU addresstag index offset
Cache tag Cache data
Hit/miss
CPUData bus
compare
Maurizio Palesi 24
How a Block is Found in CacheHow a Block is Found in Cache
Two sets ofAddress tagsand data RAM
2:1 Muxfor the way
Use addressbits to selectcorrect DRAM
13
Maurizio Palesi 25
Simplest Simplest Cache: Cache: Direct MappedDirect MappedMemory
4 Byte Direct Mapped Cache
Memory Address0123456789ABCDEF
Cache Index0123
Location 0 can be occupied by data from:Memory location 0, 4, 8, ... etc.In general: any memory locationwhose 2 LSBs of the address are 0sAddress<1:0> => cache index
Which one should we place in the cache?How can we tell which one is in the cache?
Maurizio Palesi 26
1 KB Direct Mapped Cache, 32B blocks1 KB Direct Mapped Cache, 32B blocksFor a 2N byte cache:
The uppermost (32 - N) bits are always the Cache TagThe lowest M bits are the Byte Select (Block Size = 2M)
Cache Index
0123
:
Cache DataByte 0
0431
:
Cache Tag Example: 0x50Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Byte 992Byte 1023 :
Cache Tag
Byte SelectEx: 0x00
9
14
Maurizio Palesi 27
Two-way Set Associative CacheTwo-way Set Associative CacheN-way set associative: N entries for each Cache Index
N direct mapped caches operates in parallel (N typically 2 to 4)Example: Two-way set associative cache
Cache Index selects a “set” from the cacheThe two tags in the set are compared in parallelData is selected based on the tag result
Cache DataCache Block 0
Cache TagValid
:: :
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
Maurizio Palesi 28
Disadvantage of Set Associative CacheDisadvantage of Set Associative CacheN-way Set Associative Cache v. Direct Mapped Cache:
N comparators vs. 1Extra MUX delay for the dataData comes AFTER Hit/Miss
In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:Possible to assume a hit and continue. Recover later if miss.
Cache DataCache Block 0
Cache TagValid
:: :
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
15
Maurizio Palesi 29
4 Questions for Memory Hierarchy4 Questions for Memory HierarchyQ1: Where can a block be placed in the upper level?(Block placement)
Fully Associative, Set Associative, Direct MappedQ2: How is a block found if it is in the upper level?(Block identification)
Tag/BlockQ3: Which block should be replaced on a miss?(Block replacement)
Random, LRUQ4: What happens on a write?(Write strategy)
Write Back or Write Through (with Write Buffer)
Maurizio Palesi 30
Q1: Where can a block be placed in the upper level?Q1: Where can a block be placed in the upper level?
Block 12 placed in 8 block cache:Fully associative, direct mapped, 2-way set associativeS.A. Mapping = Block Number Modulo Number Sets
Cache
01234567 0123456701234567
Memory
111111111122222222223301234567890123456789012345678901
Full Mapped Direct Mapped(12 mod 8) = 4
2-Way Assoc(12 mod 4) = 0
16
Maurizio Palesi 31
Q2: How is a block found if it is in the upper level?Q2: How is a block found if it is in the upper level?
Tag on each blockNo need to check index or block offset
Increasing associativity shrinks index, expands tag
BlockOffset
Block Address
IndexTag
Maurizio Palesi 32
Q3: Which block should be replaced on a miss?Q3: Which block should be replaced on a miss?
Easy for Direct MappedSet Associative or Fully Associative:
RandomLRU (Least Recently Used)
Size LRU RND LRU RND LRU RND16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
2-way 4-way 8-wayAssociativity
17
Maurizio Palesi 33
Q4: What happens on a write?Q4: What happens on a write?Write through The information is written to both the blockin the cache and to the block in the lower-level memory.Write back The information is written only to the block inthe cache. The modified cache block is written to mainmemory only when it is replaced.
is block clean or dirty?
Pros and Cons of each?WT: read misses cannot result in writesWB: no repeated writes to same location
WT always combined with write buffers so that don’t waitfor lower level memory
Maurizio Palesi 34
Write Buffer for Write ThroughWrite Buffer for Write Through
A Write Buffer is needed between the Cache and MemoryProcessor: writes data into the cache and the write bufferMemory controller: write contents of the buffer to memory
Write buffer is just a FIFO:Typical number of entries: 4Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designer’s nightmare:Store frequency (w.r.t. time) -> 1 / DRAM write cycleWrite buffer saturation
ProcessorCache
Write Buffer
DRAM
18
Maurizio Palesi 35
Reducing MissesReducing MissesClassifying Misses: 3 Cs
Compulsory The first access to a block is not in the cache, so theblock must be brought into the cache. Also called cold start misses or firstreference misses(Misses in even an Infinite Cache)
Capacity If the cache cannot contain all the blocks needed duringexecution of a program, capacity misses will occur due to blocks beingdiscarded and later retrieved(Misses in Fully Associative Size X Cache)
Conflict If block-placement strategy is set associative or direct mapped,conflict misses (in addition to compulsory & capacity misses) will occurbecause a block can be discarded and later retrieved if too many blocksmap to its set. Also called collision misses or interference misses(Misses in N-way Associative, Size X Cache)
Maurizio Palesi 36
3Cs Absolute Miss Rate (SPEC92)3Cs Absolute Miss Rate (SPEC92)
Cache Size (KB)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16 32 64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
19
Maurizio Palesi 37
2:1 Cache Rule2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2
Cache Size (KB)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16 32 64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
Maurizio Palesi 38
3Cs Relative Miss Rate3Cs Relative Miss Rate
Cache Size (KB)
0%
20%
40%
60%
80%
100%
1 2 4 8
16 32 64
128
1-way
2-way4-way
8-way
Capacity
Compulsory
Conflict
20
Maurizio Palesi 39
How Can Reduce Misses?How Can Reduce Misses?3 Cs: Compulsory, Capacity, ConflictIn all cases, assume total cache size not changed:What happens if:
Change Block Size:Which of 3Cs is obviously affected?Change Associativity:Which of 3Cs is obviously affected?Change Compiler:Which of 3Cs is obviously affected?
Maurizio Palesi 40
1. Reduce Misses via Larger Block Size1. Reduce Misses via Larger Block Size
Block Size (bytes)
Miss Rate
0%
5%
10%
15%
20%
25%
16 32 64
128
256
1K
4K
16K
64K
256K
21
Maurizio Palesi 41
2. Reduce Misses via Higher Associativity2. Reduce Misses via Higher Associativity
2:1 Cache Rule:Miss Rate DM cache size N - Miss Rate 2-waycache size N/2
Beware: Execution time is only finalmeasure!
Will Clock Cycle time increase?Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%internal + 2%
Maurizio Palesi 42
3. Reducing Misses via a “Victim Cache”3. Reducing Misses via a “Victim Cache”How to combine fast hit time ofdirect mappedyet still avoid conflict misses?Add buffer to place datadiscarded from cacheJouppi [1990]: 4-entry victimcache removed 20% to 95% ofconflicts for a 4 KB directmapped data cacheUsed in Alpha, HP machines
To Next Lower Level InHierarchy
DATATAGS
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
22
Maurizio Palesi 43
4. Reducing Misses via “Pseudo-Associativity”4. Reducing Misses via “Pseudo-Associativity”
How to combine fast hit time of Direct Mapped and have the lowerconflict misses of 2-way SA cache?Divide cache: on a miss, check other half of cache to see if there, if sohave a pseudo-hit (slow hit)
Drawback: CPU pipeline is hard if hit takes 1 or 2 cyclesBetter for caches not tied directly to processor (L2)Used in MIPS R1000 L2 cache, similar in UltraSPARC
Hit Time
Pseudo Hit Time Miss Penalty