SOFTENG 363

SOFTENG 363

Computer ArchitectureCache

John MorrisECE/CS, The University of Auckland

Iolanthe I at 13 knots on Cockburn Sound, WA

Cache• Small, fast memory

• Typically ~64kbytes (Level 1 – 1998+)• 2 cycle access time

• Same die as processor• ‘High-end’ CPUs

• 2 levels of cache on main die• L1 – 64kb, L2 - ~1Mbyte (but slower!)

• “Off-chip” cache possible• Custom cache chip closely coupled to processor• Use fast static RAM (SRAM) rather than

slower dynamic RAM

• 2nd level of the memory hierarchy• “Caches” most recently used memory locations

“closer” to the processor• closer = closer in time

Cache• Etymology

• cacher (French) = “to hide”• Transparent to a program

• Programs simply run slower without it• Modern processors rely on it

• Reduces the cost of main memory access• Enables instruction/cycle throughput• Typical program

• ~25% memory accesses

Cache• Relies upon locality of reference

• Programs continually use - and re-use -the same locations

• Instructions• loops, • common subroutines

• Data• look-up tables• “working” data sets

Cache - operation• Memory requests checked in cache first

• If the word sought is in the cache,it’s read from cache (or updated in cache)Cache hit

• If not, request is passed to main memoryand data is read (written) thereCache miss

CPUMMU

Cache MainMemD or I

VA PA PA

D or I

Cache - operation• Hit rates of 96% are usual

• Cache: 64 kbytes• Effective Memory Access Time

• Cache: 2 cycles• Main memory: 100 cycles• Average access: 0.96*2 + 0.04*100 = 4.2 cycles• In general, if there are are n levels of memory,

Avg memory access time = fjtjacc

where fj = fraction of accesses to memory level jtj

acc = access time for memory level j

j=1,n

Cache - organisation• Direct-mapped cache

• Each word in the cache has a tag• Assume

• cache size - 2k words• machine words - p bits• byte-addressed memory

• m = log2 ( p/8 ) bits not used to address words• m = 2 for 32-bit machines

p-k-m mk

p bits

tag cache address byte address

Addressformat

Cache - organisation• Direct-mapped cache

p-k-m mk

tag cache address byte address

tagdata

Hit?

memory

CPU

2k lines

p-k-mp

A cache line

Memory address

Cache - Direct Mapped• Conflicts

• Two addresses separated by 2k+m

will hit the same cache location• 32-bit machine, 64kbyte (16kword) cachem = 2, k = 14Any program or data set larger than 64kb

will generate conflicts• On a conflict, the ‘old’ word is flushed

• Unmodified word ( Program, constant data )

overwritten by the new data from memory• Modified data needs to be written back to

memory before being overwritten

Cache - Conflicts• Modified or dirty words When a word is modified in cache

Write-back cache• Only writes data back when neededMissesTwo memory accesses

• Write modified word back• Read new word

Write-through cache• Low priority write to main memory is queued• Processor is delayed by read only

• Memory write occurs in parallel with other work• Instruction and necessary data fetches take priority

Cache - Write-through or write-back?• Write-through

• Allows an intelligent bus interface unitto make efficient use of a serious bottle-neck

Processor - memory interface(Main memory bus)

• Reads (instruction and data) need priority!• They stall the processor• Writes can be delayed

• At least until the location is needed!• More on intelligent system interface units later

but ...

Cache - Write-through or write-back?• Write-through

• Seems a good idea! but ...

• Multiple writes to the same location waste memory bus bandwidth

Typical programs run better with write-back caches

however• Often you can easily predict which will be bestSome processors (eg PowerPC) allow you to

classify memory regions as write-back or write-through

Cache - more bits• Cache lines need some status bits

• Tag bits + ..• Valid

• All set to false on power up• Set to true as words are loaded into cache

• Dirty• Needed by write-back cache• Write- through cache always queues the

write, so lines are never ‘dirty’

Cache - Improving Performance• Conflicts ( addresses 2k+m bytes apart )

• Degrade cache performance• Lower hit rate

• Murphy’s Law operates• Addresses are never random!• Some locations ‘thrash’ in cache

• Continually replaced and restored

Cache - Fully Associative• All tags are compared at the same time• Words can use any cache line

Cache - Fully Associative• Associative

• Each tag is compared at the same time• Any match hit

• Avoids ‘unnecessary’ flushing• Replacement

• Least Recently Used - LRU• Needs extra status bits

• Cycles since last accessed• Hardware cost high

• Extra comparators• Wider tags

• p-m bits vs p-k-m bits

Cache - Set Associative

Each line -two wordstwo comparators only

• 2-way setassociative

Cache - Set Associative• n-way set associative caches

• n can be small: 2, 4, 8• Best performance• Reasonable hardware cost• Most high performance processors

• Replacement policy• LRU choice from n• Reasonable LRU approximation

• 1 or 2 bits• Set on access• Cleared / decremented by timer• Choose cleared word for replacement

Cache - Locality of ReferenceTemporal Locality

• Same location will be referenced again soon• Access same data again• Program loops - access same instruction again• Caches described so far exploit temporal

localitySpatial Locality

• Nearby locations will be referenced soon• Next element of an array• Next instruction of a program

Cache - Line Length• Spatial Locality

• Use very long cache lines• Fetch one datum

Neighbours fetched also• PowerPC 601 (Motorola/Apple/IBM)

first of the single chip Power processors• 64 sets• 8-way set associative• 32 bytes per line• 32 bytes (8 instructions) fetched into

instruction buffer in one cycle• 64 x 8 x 32 = 16k byte total

Cache - Separate I- and D-caches• Unified cache

• Instructions and Data in same cache• Two caches -

* Instructions * DataIncreases total bandwidth

• MIPS R10000• 32Kbyte Instruction; 32Kbyte Data• Instruction cache is pre-decoded! (32 36bits)• Data

• 8-word (64byte) line, 2-way set associative• 256 sets

• Replacement policy?

Documents

SOFTENG 363