MemoryHierarchy:Cache
MemoryhierarchyCachebasicsLocalityCacheorganizationCache-aware programming
Howdoesexecution timegrowwithSIZE?int[] array = new int[SIZE];fillArrayRandomly(array); int s = 0;
for (int i = 0; i < 200000; i++) {for (int j = 0; j < SIZE; j++) {s += array[j];
}}
3SIZE
TIME
reality beyondO(...)
4
0
5
10
15
20
25
30
35
40
45
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
SIZE
Time
Processor-MemoryBottleneck
5
MainMemory
CPU Reg
Processorperformancedoubledaboutevery18months Busbandwidth
evolvedmuchslower
Bandwidth:256bytes/cycleLatency:1-fewcycles
Bandwidth:2Bytes/cycleLatency:100cycles
Solution:caches
Cache
Example
CacheEnglish:n.ahiddenstoragespaceforprovisions,weapons,ortreasuresv.tostoreawayinhidingforfutureuse
ComputerScience:n.acomputermemorywithshortaccesstimeusedtostorefrequentlyorrecentlyusedinstructionsordatav. tostore[data/instructions]temporarilyforlaterquickretrieval
AlsousedmorebroadlyinCS:softwarecaches,filecaches,etc.
6
GeneralCacheMechanics
7
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
8 9 14 3Cache
Memory Larger, slower, cheaper.Partitioned intoblocks (lines).
Dataismovedinblockunits
Smaller, faster,moreexpensive.Storessubsetofmemoryblocks.
(lines)
CPU Block: unitofdataincacheandmemory.(a.k.a.line)
CacheHit
8
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
8 9 14 3Cache
Memory
1.Requestdatainblock b.Request:14
142.Cachehit:
Blockbisincache.
CPU
9
CacheMiss
9
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
8 9 14 3Cache
Memory
1.Request datainblockb.Request:12
2.Cache miss:blockisnot incache
4.Cachefill:Fetchblockfrommemory,storeincache.
Request:12
12
12
9
9
12
3.Cacheeviction:Evictablocktomakeroom,maybestoretomemory.
PlacementPolicy:where toputblockincache
ReplacementPolicy:whichblocktoevict
CPU
Locality:whycacheswork
Programstendtousedataandinstructionsataddressesnearorequaltothosetheyhaveusedrecently.
Temporallocality:Recently referenced itemsarelikelytobereferenced againinthenear future.
Spatiallocality:Itemswithnearbyaddresses arelikelytobereferenced closetogether intime.
Howdocachesexploittemporalandspatiallocality?
10
block
block
Locality#1
Data:Temporal:sum referenced ineachiterationSpatial:arraya[] accessed instride-1 pattern
Instructions:Temporal:execute looprepeatedlySpatial:execute instructions insequence
Assessinglocalityincodeisanimportantprogrammingskill.
11
sum = 0;for (i = 0; i < n; i++) {
sum += a[i];}return sum;
Whatisstoredinmemory?
Locality#2
12
a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]
1:a[0][0]2:a[0][1]3:a[0][2]4:a[0][3]5:a[1][0]6:a[1][1]7:a[1][2]8:a[1][3]9:a[2][0]10:a[2][1]11:a[2][2]12:a[2][3]
stride1
int sum_array_rows(int a[M][N]) {int sum = 0;
for (int i = 0; i < M; i++) {for (int j = 0; j < N; j++) {
sum += a[i][j];}
}return sum;
}
row-majorMxN2DarrayinC
Locality#3
13
int sum_array_cols(int a[M][N]) {int sum = 0;
for (int j = 0; j < N; j++) {for (int i = 0; i < M; i++) {
sum += a[i][j];}
}return sum;
}
1:a[0][0]2:a[1][0]3:a[2][0]4:a[0][1]5:a[1][1]6:a[2][1]7:a[0][2]8:a[1][2]9:a[2][2]10:a[0][3]11:a[1][3]12:a[2][3]
strideN
row-majorMxN2DarrayinC
…
…a[0][0] a[0][1] a[0][2] a[0][3]a[1][0] a[1][1] a[1][2] a[1][3]a[2][0] a[2][1] a[2][2] a[2][3]
Locality#4
Whatis"wrong"withthiscode?Howcanitbefixed?
14
int sum_array_3d(int a[M][N][N]) {int sum = 0;
for (int i = 0; i < N; i++) {for (int j = 0; j < N; j++) {
for (int k = 0; k < M; k++) {sum += a[k][i][j];
}}
}return sum;
}
CostofCacheMissesHugedifferencebetweenahitandamiss
Couldbe100x,ifjustL1andmainmemory
99%hitscouldbetwiceasgoodas97%.How?Assumecachehittimeof1cycle,misspenaltyof100cycles
Meanaccess time:97%hits:1cycle+0.03*100cycles=4cycles99%hits:1cycle+0.01*100cycles=2cycles
15
hit/miss rates
CachePerformanceMetrics
MissRateFractionofmemoryaccesses todatanotincache (misses /accesses)Typically: 3%- 10%forL1;maybe<1% forL2,depending onsize,etc.
HitTimeTimetofindanddeliverablockinthecachetotheprocessor.Typically:1- 2clockcyclesforL1;5- 20clockcyclesforL2
MissPenaltyAdditional timerequired oncachemiss=mainmemoryaccess timeTypically50- 200cyclesforL2 (trend:increasing!)
16
Memory
memoryhierarchywhydoesitwork?
persistentstorage(harddisk, flash,overnetwork,cloud,etc.)
mainmemory(DRAM)
L3cache(SRAM,off-chip)
L1cache(SRAM,on-chip)
L2cache(SRAM,on-chip)
registerssmall,fast,power-hungry,expensive
large,slow,power-efficient,cheap
programsees“memory”;hardwaremanagescaching
transparently
explicitlyprogram-controlled
Cache Organization:KeyPointsBlockFixed-sizeunitofdata inmemory/cache
PlacementPolicyWhereshouldagivenblockbestoredinthecache?
§ direct-mapped, setassociative
ReplacementPolicyWhatifthereisnoroominthecacheforrequesteddata?
§ leastrecentlyused,mostrecentlyused
WritePolicyWhenshouldwritesupdatelowerlevelsofmemoryhierarchy?
§ writeback,writethrough,writeallocate,nowriteallocate
Blocks 00000000
00001000
00010000
00011000
Memory(byte)address
00010010
Dividememory intofixed-sizealignedblocks.powerof2
fullbyteaddress
BlockIDaddressbits- offsetbits
offsetwithinblocklog2(blocksize)
Example:blocksize=8
block
0
block
1
block
2
block
3
00010001000100100001001100010100000101010001011000010111
rememberwithinSameBlock?(PointersLab) ...
Note:draw
ingaddressorderdifferentlyfromhereon!
PlacementPolicy
00011011
IndexCache
S=#slots=4
Small,fixednumberofblockslots.
Large,fixednumberofblockslots.
Memory Mapping:index(BlockID)=???BlockID
0000000100100011010001010110011110001001101010111100110111101111
Placement:Direct-Mapped
21
00011011
Index
0000000100100011010001010110011110001001101010111100110111101111
Memory Mapping:index(BlockID)=BlockIDmod SBlockID
Cache
S=#slots=4
(easyforpower-of-2blocksizes...)
Placement:mappingambiguity
22
00011011
Index
0000000100100011010001010110011110001001101010111100110111101111
Memory
Whichblockisinslot2?
BlockID
Cache
S=#slots=4
Mapping:index(BlockID)=BlockIDmod S
Placement:Tagsresolveambiguity
23
00011011
Index
0000000100100011010001010110011110001001101010111100110111101111
Memory
BlockIDbitsnotusedforindex.
BlockID
Tag Data00110101
Cache
S
Mapping:index(BlockID)=BlockIDmod S
Address=Tag,Index,Offset
00010010 fullbyteaddress
BlockIDAddressbits - Offsetbits
Offsetwithinblocklog2(blocksize)=b
#addressbits
BlockIDbits - IndexbitsTag
log2(# cacheslots)Index
a-bitAddresssbits(a-s-b) bits bbits
OffsetTag Index
Wherewithinablock?
Whatslot inthecache?Disambiguates slotcontents.
Placement:Direct-Mapped
25
00011011
Index
0000000100100011010001010110011110001001101010111100110111101111
Memory
(stilleasyforpower-of-2blocksizes...)
BlockID
Cache
Whynotthismapping?index(BlockID)=BlockID/ S
Apuzzle.
Cachestartsempty.Access(address,hit/miss)stream:
(10,miss),(11,hit),(12,miss)
Whatcouldtheblocksizebe?
26
blocksize>=2bytes blocksize<8bytes
Placement:directmappingconflicts
Whathappenswhenaccessinginrepeatedpattern:0010,0110,0010,0110,0010...?
27
00011011
Index
0000000100100011010001010110011110001001101010111100110111101111
BlockID
cacheconflictEveryaccess suffers amiss,evictscachelineneededbynextaccess.
Placement:SetAssociative
28
0
1
2
3
Set
2-way4sets,
2blockseach
0
1
Set
4-way2sets,
4blockseach
01234567
Set
1-way8sets,
1blockeach
directmapped
0
Set
8-way1set,
8blocks
fullyassociative
Mapping:index(BlockID)=BlockIDmod S
S=#slotsincachesets
Indexperset ofblockslots.Storeblockinany slotwithinset.
Replacementpolicy:ifset isfull,whatblockshouldbereplaced?Common: leastrecentlyused(LRU)buthardwareusually implements “notmostrecentlyused”
Example:Tag,Index,Offset?
index(1101)=____
4-bitAddress OffsetTag Index
tagbits ____setindexbits ____blockoffsetbits____
Direct-mapped4slots2-byteblocks
Example:Tag,Index,Offset?
16-bitAddress OffsetTag IndexE-wayset-associativeS slots16-byteblocks
01234567
Set
0
1
2
3
Set
0
1
Set
E=1-wayS=8sets
E=2-wayS=4sets
E=4-wayS=2sets
tagbits ____setindexbits ____blockoffsetbits ____index(0x1833) ____
tagbits ____setindexbits ____blockoffsetbits ____index(0x1833) ____
tagbits ____setindexbits ____blockoffsetbits ____index(0x1833) ____
ReplacementPolicyIfsetisfull,whatblockshouldbereplaced?
Common: leastrecentlyused(LRU)(buthardwareusually implements “notmostrecently used”
Anotherpuzzle:Cachestartsempty,usesLRU.Access(address,hit/miss)stream
(10,miss);(12,miss);(10,miss)
31
12isnotinthesameblockas10 12’sblockreplaced10’sblock
direct-mapped cacheassociativity ofcache?
GeneralCacheOrganization (S,E,B)
32
Elinesperset(“E-way”)
Ssets
set
block/line
0 1 2 B-1tagv
validbit B =2b bytesofdatapercacheline(thedatablock)
cachecapacity:SxExBdatabytesaddresssize:t+s+baddressbits
Powersof2
CacheRead
33
E=2e lines perset
S=2s sets
0 1 2 B-1tag1
validbitB=2b bytesofdatapercacheline(thedatablock)
tbits sbits bbitsAddressofbyteinmemory:
tag setindex
blockoffset
databeginsatthisoffset
LocatesetbyindexHitifanyblock inset:
is valid;andhas matching tag
Getdataatoffsetinblock
CacheRead:Direct-Mapped (E=1)
34
S=2s sets
tbits 0…01 100Addressofint:
0 1 2 7tagv 3 654
0 1 2 7tagv 3 654
0 1 2 7tagv 3 654
0 1 2 7tagv 3 654
findset
Thiscache:• Blocksize:8bytes• Associativity: 1blockperset(directmapped)
CacheRead:Direct-Mapped (E=1)
35
tbits 0…01 100Addressofint:
0 1 2 7tagv 3 654
match?:yes=hitvalid?+
blockoffset
tag 7654
int (4Bytes) ishere
Ifnomatch:old line isevictedandreplaced
Thiscache:• Blocksize:8bytes• Associativity: 1blockperset(directmapped)
Direct-MappedCachePractice
12-bitaddress16lines,4-byteblocksizeDirectmapped
36
11 10 9 8 7 6 5 4 3 2 1 0
03DFC2111167––––03161DF0723610D5
098F6D431324––––03630804020011B2––––0151112311991190B3B2B1B0ValidTagIndex
––––014FD31B7783113E15349604116D
––––012C––––00BB3BDA159312DA––––02D98951003A1248B3B2B1B0ValidTagIndex
0x354
0xA20
Offsetbits? Indexbits?Tagbits?
Example (E=1)
37
int sum_array_rows(double a[16][16]){double sum = 0;
for (int r = 0; r < 16; r++){for (int c = 0; c < 16; c++){
sum += a[r][c];}
}return sum;
}
32bytes=4doubles
Assume: cold(empty)cache3-bitsetindex,5-bitoffset
aa...arrr rcc cc000
int sum_array_cols(double a[16][16]){double sum = 0;
for (int c = 0; c < 16; c++){for (int r = 0; r < 16; r++){
sum += a[r][c];}
}return sum;
}
Localsinregisters.Assume a is aligned such that&a[r][c] is aa...a rrrr cccc 000
0,0 0,1 0,2 0,3
0,4 0,5 0,6 0,7
0,8 0,9 0,a 0,b
0,c 0,d 0,e 0,f
1,0 1,1 1,2 1,3
1,4 1,5 1,6 1,7
1,8 1,9 1,a 1,b
1,c 1,d 1,e 1,f
32bytes=4doubles
4missesperrowofarray4*16=64misses
everyaccessamiss16*16=256misses
0,0 0,1 0,2 0,3
1,0 1,1 1,2 1,3
2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
4,0 4,1 4,2 4,3
0,0:aa...a000 000 000000,4:aa...a000 001 000001,0:aa...a000 100 000002,0:aa...a001 000 00000
Example (E=1)
38
int dotprod(int x[8], int y[8]) {int sum = 0;
for (int i = 0; i < 8; i++) {sum += x[i]*y[i];
}return sum;
}
x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]y[0] y[1] y[2] y[3]x[0] x[1] x[2] x[3]
ifxandyaremutuallyaligned,e.g.,0x00,0x80
ifxandyaremutuallyunaligned,e.g.,0x00,0xA0
x[0] x[1] x[2] x[3]
y[0] y[1] y[2] y[3]
x[4] x[5] x[6] x[7]
y[4] y[5] y[6] y[7]
block=16bytes;8sets incacheHowmanyblockoffsetbits?Howmanysetindexbits?
Addressbits:ttt....tsssbbbbB=16=2b: b=4offsetbitsS=8=2s: s=3indexbits
Addresses asbits0x00000000: 000....000000000x00000080: 000....100000000x000000A0: 000....1010000016bytes=4ints
CacheRead:Set-Associative (Example:E=2)
39
tbits 0…01 100Addressofint:
findset
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
Thiscache:• Blocksize:8bytes• Associativity: 2blocksperset
0 1 2 7tagv 3 6540 1 2 7tagv 3 654
CacheRead:Set-Associative (Example:E=2)
40
Thiscache:• Blocksize:8bytes• Associativity: 2blocksperset
tbits 0…01 100Addressofint:
compareboth
valid?+ match:yes=hit
blockoffset
tag 7654
int (4Bytes) ishere
Ifnomatch:Evictandreplaceone line inset.
Example (E=2)
42
float dotprod(float x[8], float y[8]) {float sum = 0;
for (int i = 0; i < 8; i++) {sum += x[i]*y[i];
}return sum;
}
x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3]Ifxandyaligned,e.g.&x[0]=0,&y[0]=128,canstillfitbothbecauseeachsethasspacefortwoblocks/lines
x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7]4sets
2blocks/lines perset
TypesofCacheMisses
Cold(compulsory)miss
Conflictmiss
Capacitymiss
Whichonescanwemitigate/eliminate?How?
43
WritingtocacheMultiplecopiesofdataexist,mustbekeptinsync.
Write-hitpolicyWrite-through:Write-back:needsadirtybit
Write-misspolicyWrite-allocate:No-write-allocate:
Typicalcaches:Write-back+Write-allocate, usuallyWrite-through +No-write-allocate, occasionally
44
Write-back,write-allocateexample
45
0xCAFECache
Memory
U
0xFACE
0xCAFE
0
T
U
dirtybittag
1. mov $T,%ecx2. mov $U,%edx3. mov $0xFEED,(%ecx)
a. MissonT.
eax = 0xCAFEecx =Tedx =U
Cache/memorynotinvolved
Write-back,write-allocateexample
46
Cache
Memory 0xFACE
0xCAFE
T
U
dirtybit
1. mov $T,%ecx2. mov $U,%edx3. mov $0xFEED,(%ecx)
a. MissonT.b. EvictU(clean:discard).c. FillT(write-allocate).d. WriteTincache(dirty).
4. mov (%edx),%eaxa. MissonU.tag
T 00xFACE0xFEED 1
eax = 0xCAFEecx =Tedx =U
Write-back,write-allocateexample
47
0xCAFECache
Memory
U
0xFACE
0xCAFE
0
T
U
dirtybittag
eax = 0xCAFEecx =Tedx =U
1. mov $T,%ecx2. mov $U,%edx3. mov $0xFEED,(%ecx)
a. MissonT.b. EvictU(clean:discard).c. FillT(write-allocate).d. WriteTincache(dirty).
4. mov (%edx),%eaxa. MissonU.b. EvictT(dirty:writeback).c. FillU.d. Set%eax.
5. DONE.0xFEED
0xCAFE
ExampleMemoryHierarchy
48
Regs
L1d-cache
L1i-cache
L2unified cache
Core0
Regs
L1d-cache
L1i-cache
L2unified cache
Core3
…
L3unified cache(sharedbyallcores)
Mainmemory
Processorpackage
L1i-cacheandd-cache:32KB,8-way,Access: 4cycles
L2unified cache:256KB,8-way,Access: 11cycles
L3unified cache:8MB,16-way,Access: 30-40cycles
Blocksize:64bytesforallcaches.
slower,butmorelikelytohit
Typicallaptop/desktopprocessor(alwayschanging)
Aside:softwarecachesExamples
Filesystembuffercaches,webbrowser caches,databasecaches,networkCDNcaches,etc.
SomedesigndifferencesAlmostalwaysfully-associative
Oftenusecomplexreplacement policies
Notnecessarily constrained tosingle“block”transfers
49
Cache-FriendlyCodeLocality,locality,locality.Programmercanoptimizeforcacheperformance
Datastructure layoutDataaccesspatterns
Nested loopsBlocking(seeCSAPP6.5)
Allsystemsfavor“cache-friendlycode”Performance ishardware-specificGeneric rulescapturemostadvantages
Keepworkingsetsmall(temporal locality)Usesmallstrides (spatial locality)Focusoninnerloopcode
50