Upload
diem
View
32
Download
0
Embed Size (px)
DESCRIPTION
SOFTENG 363. Computer Architecture Cache John Morris ECE/CS, The University of Auckland. Iolanthe I at 13 knots on Cockburn Sound, WA. Cache. Small, fast memory Typically ~64kbytes ( Level 1 – 1998+) 2 cycle access time Same die as processor ‘High-end’ CPUs - PowerPoint PPT Presentation
Citation preview
SOFTENG 363
Computer ArchitectureCache
John MorrisECE/CS, The University of Auckland
Iolanthe I at 13 knots on Cockburn Sound, WA
Cache• Small, fast memory
• Typically ~64kbytes (Level 1 – 1998+)• 2 cycle access time
• Same die as processor• ‘High-end’ CPUs
• 2 levels of cache on main die• L1 – 64kb, L2 - ~1Mbyte (but slower!)
• “Off-chip” cache possible• Custom cache chip closely coupled to processor• Use fast static RAM (SRAM) rather than
slower dynamic RAM
• 2nd level of the memory hierarchy• “Caches” most recently used memory locations
“closer” to the processor• closer = closer in time
Cache• Etymology
• cacher (French) = “to hide”• Transparent to a program
• Programs simply run slower without it• Modern processors rely on it
• Reduces the cost of main memory access• Enables instruction/cycle throughput• Typical program
• ~25% memory accesses
Cache• Relies upon locality of reference
• Programs continually use - and re-use -the same locations
• Instructions• loops, • common subroutines
• Data• look-up tables• “working” data sets
Cache - operation• Memory requests checked in cache first
• If the word sought is in the cache,it’s read from cache (or updated in cache)Cache hit
• If not, request is passed to main memoryand data is read (written) thereCache miss
CPUMMU
Cache MainMemD or I
VA PA PA
D or I
Cache - operation• Hit rates of 96% are usual
• Cache: 64 kbytes• Effective Memory Access Time
• Cache: 2 cycles• Main memory: 100 cycles• Average access: 0.96*2 + 0.04*100 = 4.2 cycles• In general, if there are are n levels of memory,
Avg memory access time = fjtjacc
where fj = fraction of accesses to memory level jtj
acc = access time for memory level j
j=1,n
Cache - organisation• Direct-mapped cache
• Each word in the cache has a tag• Assume
• cache size - 2k words• machine words - p bits• byte-addressed memory
• m = log2 ( p/8 ) bits not used to address words• m = 2 for 32-bit machines
p-k-m mk
p bits
tag cache address byte address
Addressformat
Cache - organisation• Direct-mapped cache
p-k-m mk
tag cache address byte address
tagdata
Hit?
memory
CPU
2k lines
p-k-mp
A cache line
Memory address
Cache - Direct Mapped• Conflicts
• Two addresses separated by 2k+m
will hit the same cache location• 32-bit machine, 64kbyte (16kword) cachem = 2, k = 14Any program or data set larger than 64kb
will generate conflicts• On a conflict, the ‘old’ word is flushed
• Unmodified word ( Program, constant data )
overwritten by the new data from memory• Modified data needs to be written back to
memory before being overwritten
Cache - Conflicts• Modified or dirty words When a word is modified in cache
Write-back cache• Only writes data back when neededMissesTwo memory accesses
• Write modified word back• Read new word
Write-through cache• Low priority write to main memory is queued• Processor is delayed by read only
• Memory write occurs in parallel with other work• Instruction and necessary data fetches take priority
Cache - Write-through or write-back?• Write-through
• Allows an intelligent bus interface unitto make efficient use of a serious bottle-neck
Processor - memory interface(Main memory bus)
• Reads (instruction and data) need priority!• They stall the processor• Writes can be delayed
• At least until the location is needed!• More on intelligent system interface units later
but ...
Cache - Write-through or write-back?• Write-through
• Seems a good idea! but ...
• Multiple writes to the same location waste memory bus bandwidth
Typical programs run better with write-back caches
however• Often you can easily predict which will be bestSome processors (eg PowerPC) allow you to
classify memory regions as write-back or write-through
Cache - more bits• Cache lines need some status bits
• Tag bits + ..• Valid
• All set to false on power up• Set to true as words are loaded into cache
• Dirty• Needed by write-back cache• Write- through cache always queues the
write, so lines are never ‘dirty’
Cache - Improving Performance• Conflicts ( addresses 2k+m bytes apart )
• Degrade cache performance• Lower hit rate
• Murphy’s Law operates• Addresses are never random!• Some locations ‘thrash’ in cache
• Continually replaced and restored
Cache - Fully Associative• All tags are compared at the same time• Words can use any cache line
Cache - Fully Associative• Associative
• Each tag is compared at the same time• Any match hit
• Avoids ‘unnecessary’ flushing• Replacement
• Least Recently Used - LRU• Needs extra status bits
• Cycles since last accessed• Hardware cost high
• Extra comparators• Wider tags
• p-m bits vs p-k-m bits
Cache - Set Associative
Each line -two wordstwo comparators only
• 2-way setassociative
Cache - Set Associative• n-way set associative caches
• n can be small: 2, 4, 8• Best performance• Reasonable hardware cost• Most high performance processors
• Replacement policy• LRU choice from n• Reasonable LRU approximation
• 1 or 2 bits• Set on access• Cleared / decremented by timer• Choose cleared word for replacement
Cache - Locality of ReferenceTemporal Locality
• Same location will be referenced again soon• Access same data again• Program loops - access same instruction again• Caches described so far exploit temporal
localitySpatial Locality
• Nearby locations will be referenced soon• Next element of an array• Next instruction of a program
Cache - Line Length• Spatial Locality
• Use very long cache lines• Fetch one datum
Neighbours fetched also• PowerPC 601 (Motorola/Apple/IBM)
first of the single chip Power processors• 64 sets• 8-way set associative• 32 bytes per line• 32 bytes (8 instructions) fetched into
instruction buffer in one cycle• 64 x 8 x 32 = 16k byte total
Cache - Separate I- and D-caches• Unified cache
• Instructions and Data in same cache• Two caches -
* Instructions * DataIncreases total bandwidth
• MIPS R10000• 32Kbyte Instruction; 32Kbyte Data• Instruction cache is pre-decoded! (32 36bits)• Data
• 8-word (64byte) line, 2-way set associative• 256 sets
• Replacement policy?