Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn

Computer Architecture

Cache

John MorrisElectrical & Computer Enginering/Computer Science, The University of Auckland

Iolanthe at 13 knots on Cockburn Sound, WA

Memory Bottleneck• State-of-the-art processor

• f = 3 GHz• tclock = 330ps• 1-2 instructions per cycle

• ~25% memory reference• Memory response

• One reference every 4 instructions • 4 x 330ps~1.2ns needed!

• Bulk semiconductor RAM• 100ns+ for a ‘random’ access on system bus! Processor will spend most of its time waiting for

memory!• Faster for DDR RAM streaming date over special bus

but random access is still slow!

Cache

• Small, fast memory• Typically ~50kbytes (1998)

• 2007 Pentium – 16kbytes (Level 1)• 2 cycle access time

• Same die as processor• “Off-chip” cache possible

• Custom cache chip closely coupled to processor• Use fast static RAM (SRAM) rather than

slower dynamic RAM• Several levels possible

• 2nd level of the memory hierarchy• “Caches” most recently used memory

locations “closer” to the processor• closer = closer in time

Cache

• Etymology• cacher (French) = “to hide”

• Transparent to a program• Programs simply run slower without it

• Modern processors rely on it• Reduces the cost of main memory access• Enables instruction/cycle throughput• Typical program

• ~25% memory accesses

• Reference: Wikipedia• Don’t try Google

Cache

• Relies upon locality of reference• Programs continually use - and re-use -

the same locations• Instructions

• loops, • common subroutines

• Data• look-up tables• “working” data sets

Cache - operation

• Memory requests checked in cache first• If the word sought is in the cache,

it’s read from cache (or updated in cache)Cache hit

• If not, request is passed to main memoryand data is read (written) thereCache miss

CPU

MMU

CacheMainMemD or I

VA PAPA

D or I

Cache - operation

• Hit rates of 95% are usual• Cache: 16 kbytes

• Effective Memory Access Time• Cache: 2 cycles• Main memory: 10 cycles• Average access: 0.95*2 + 0.05*10

= 2.4 cycles

• In general, if there are are n levels of memory,

Avg memory access time = fjtjacc

where fj = fraction of accesses ‘hitting’ at level j tj

acc = access time for level j

CPU

L1 Cache

L2 Cache

Bulk memory

Paging Disc

MemoryHierarchy

Fast

er Larg

er

Small is fast!

Cache - operation

• Key Problem• Cache: 16 kbytes• Memory: 2 Mbytes

min

• What goes in the cache?• ‘Common’ instructions and data• We’d like

• Very common data in L1• Less common data in L2

• How do we • Ensure that the very common data is in

L1?• Find out whether the data item that we

want now is in the cache?

CPU

L1 Cache

L2 Cache

Bulk memory

Paging Disc

MemoryHierarchy

Fast

er Larg

er

Small is fast!

Cache - organisation

• Three basic types• Direct-mapped cache

• Fully associative cache

• Set associative cache

• Note that all these are possible for other caches too

• Disc cache

• Web page cache

• Frequently accessed pages

• Data base cache

• Frequently accessed records


• Direct-mapped cache• Each word in the cache has a tag• Assume

• cache size - 2k words• machine words - p bits• byte-addressed memory

• m = log2 ( p/8 ) bits not used to address words

• m = 2 for 32-bit machines

p-k-m mk

p bits

tag cache address byte address

Addressformat


• Direct-mapped cache

p-k-m mk


tagdata

Hit?

memory

CPU

2k lines

p-k-mp

A cache line

Memory address

Cache - Conflicts

• Conflicts• Two addresses

separated by 2k+m

will hit the same cache location

p-k-m mk

p bits


Addresses in which these k bitsare the same will map to the same

cache line

Cache - Direct Mapped

• Conflicts• Two addresses separated by 2k+m

will hit the same cache location• 32-bit machine, 64kbyte (16kword) cachem = 2, k = 14Any program or data set larger than 64kb

will generate conflicts• On a conflict, the ‘old’ word is flushed

• Unmodified word ( Program, constant data )

overwritten by the new data from memory• Modified data needs to be written back to

memory before being overwritten

Cache - Conflicts

• Modified or dirty words When a word is modified in cache

Write-back cache• Only writes data back when neededMissesTwo memory accesses

• Write modified word back

• Read new word

Write-through cache• Low priority write to main memory is queued• Processor is delayed by read only

• Memory write occurs in parallel with other work

• Instruction and necessary data fetches take priority

Cache - Write-through or write-back?

• Write-through• Allows an intelligent bus interface unit

to make efficient use of a serious bottle-neck

Processor - memory interface(Main memory bus)

• Reads (instruction and data) need priority!• They stall the processor• Writes can be delayed

• At least until the location is needed!

• More on intelligent system interface units later

but ...

Cache - Write-through or write-back?

• Write-through• Seems a good idea!

but ...• Multiple writes to the same location waste

memory bus bandwidthTypical programs run better with write-back

caches

however• Often you can easily predict which will be bestSome processors (eg PowerPC) allow you to

classify memory regions as write-back or write-through

Cache - more bits

• Cache lines need some status bits• Tag bits + ..• Valid

• All set to false on power up• Set to true as words are loaded into cache

• Dirty• Needed by write-back cache• Write- through cache always queues the

write, so lines are never ‘dirty’

Tag V M DataCache line

p-k-m p1 1

Cache – Improving Performance• Conflicts ( addresses 2k+m bytes

apart )• Degrade cache performance

• Lower hit rate• Murphy’s Law operates

• Addresses are never random!• Some locations ‘thrash’ in cache

• Continually replaced and restored

• Alternatively• Ideal cache performance depends

on uniform access to all parts of memory

• Never happens in real programs!

Cache access pattern

• Ideally, each cache location is ‘hit’ the same number of times

• In reality, hot spots occur

• Some cache lines ‘thrash’

• Values ping-pong between cache and next level memory

Cache provides no benefit• Could even slow down

processor (write-backs generate uneven bus load)

Cache - Fully Associative

• All tags are compared at the same time• Words can use any cache line

Cache - Fully Associative

• Associative• Each tag is compared at the same time• Any match hit

• Avoids ‘unnecessary’ flushing• Replacement

• ‘Capacity’ conflicts will still occur• Cache size << working data set size!

• Use Least Recently Used (LRU) algorithm• Needs extra status bits

• Cycles since last accessed

• Hardware cost high• Extra comparators• Wider tags

• p-m bits vs p-k-m bits

Cache Organization

• Direct Mapped Cache Simple, fast Conflicts (‘hot spots’) degrade performance

• Fully Associative Avoids conflict problem Any number of hits on address

a + k 2m for any set of values of kpossible

? Depends on precise LRU algorithm• Extra hardware

Additional comparators (one per line!) make hardware cost very high!

Hybrid organization developed Set associative caches

Cache - Set Associative

Each line -two wordstwo comparators only

Example: 2-way set associative

Cache - Set Associative

• n-way set associative caches• n can be small: 2, 4, 8Best performance• Reasonable hardware cost• Most high performance processors

• Replacement policy• LRU choice from n• Reasonable LRU approximation

• 1 or 2 bits• Set on access• Cleared / decremented by timer• Choose cleared word for replacement

Cache - Locality of Reference

• Locality of Reference is the fundamental principle which permits caches to work• Almost all programs exhibit it to some degree!

Temporal Locality• Same location will be referenced again soon• Access same data again• Program loops - access same instruction again• Caches described so far exploit temporal

localitySpatial Locality

• Nearby locations will be referenced soon• Next element of an array• Next instruction of a program

Cache - Line Length

• Spatial Locality• Use very long cache lines• Fetch one datum

Neighbours fetched also

• Common in all types of program• Programs – next instruction

• Branches – about 10% of instructions• Data

• Scientific, Engineering• Arrays

• Commercial, Information processing• Character strings

Cache - Line Length

• Spatial Locality• Allows efficient use of bus• Blocks of data are ‘burst’ across the bus

• More efficient, reduced bus overheads• Modern RAM (RAMbus, DDR, SDDR, etc) rely on it

• Example• PowerPC 601 (Motorola/Apple/IBM)

first of the single chip Power processors• 64 sets• 8-way set associative• 32 bytes per line• 32 bytes (8 instructions) fetched into instruction

buffer in one cycle or• data: 32 byte string or 8 floats or 4 doubles• 64 x 8 x 32 = 16k byte total

Cache - Separate I- and D-caches

• Unified cache• Instructions and Data in same cache

• Two caches - * Instructions * DataIncreases total bandwidth

• MIPS R10000• 32Kbyte Instruction; 32Kbyte Data• Instruction cache is pre-decoded! (32 36bits)• Data

• 8-word (64byte) line, 2-way set associative• 256 sets

• Replacement policy?

Documents

Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn