of 28 /28
Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn Sound, WA

Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn

Embed Size (px)

Text of Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The...

  • Computer ArchitectureCache

    John MorrisElectrical & Computer Enginering/ Computer Science, The University of AucklandIolanthe at 13 knots on Cockburn Sound, WA

  • Memory BottleneckState-of-the-art processorf = 3 GHztclock = 330ps1-2 instructions per cycle~25% memory referenceMemory responseOne reference every 4 instructions 4 x 330ps~1.2ns needed!Bulk semiconductor RAM100ns+ for a random access on system bus!Processor will spend most of its time waiting for memory!Faster for DDR RAM streaming date over special busbut random access is still slow!

  • CacheSmall, fast memoryTypically ~50kbytes (1998)2007 Pentium 16kbytes (Level 1)2 cycle access timeSame die as processorOff-chip cache possibleCustom cache chip closely coupled to processorUse fast static RAM (SRAM) rather than slower dynamic RAMSeveral levels possible2nd level of the memory hierarchyCaches most recently used memory locations closer to the processorcloser = closer in time

  • CacheEtymologycacher (French) = to hideTransparent to a programPrograms simply run slower without itModern processors rely on itReduces the cost of main memory accessEnables instruction/cycle throughputTypical program~25% memory accessesReference: WikipediaDont try Google

  • CacheRelies upon locality of referencePrograms continually use - and re-use - the same locationsInstructionsloops, common subroutinesDatalook-up tablesworking data sets

  • Cache - operationMemory requests checked in cache firstIf the word sought is in the cache, its read from cache (or updated in cache)Cache hitIf not, request is passed to main memory and data is read (written) thereCache missCPUMMUCacheMainMemD or IVAPAPAD or I

  • Cache - operationHit rates of 95% are usualCache:16 kbytesEffective Memory Access TimeCache:2 cyclesMain memory:10 cyclesAverage access:0.95*2 + 0.05*10 = 2.4 cyclesIn general, if there are are n levels of memory, Avg memory access time = S fjtjaccwhere fj = fraction of accesses hitting at level j tjacc = access time for level j

    CPUL1 CacheL2 CacheBulk memoryPaging DiscMemory HierarchyFasterLargerSmall is fast!

  • Cache - operationKey ProblemCache:16 kbytesMemory: 2 Mbytes minWhat goes in the cache?Common instructions and dataWed likeVery common data in L1Less common data in L2How do we Ensure that the very common data is in L1?Find out whether the data item that we want now is in the cache?

    CPUL1 CacheL2 CacheBulk memoryPaging DiscMemory HierarchyFasterLargerSmall is fast!

  • Cache - organisationThree basic typesDirect-mapped cacheFully associative cacheSet associative cacheNote that all these are possible for other caches tooDisc cacheWeb page cacheFrequently accessed pagesData base cache Frequently accessed records

  • Cache - organisationDirect-mapped cacheEach word in the cache has a tagAssume cache size - 2k wordsmachine words - p bitsbyte-addressed memorym = log2 ( p/8 ) bits not used to address wordsm = 2 for 32-bit machinesp-k-mmkp bitstagcache addressbyte addressAddressformat

  • Cache - organisationDirect-mapped cachep-k-mmktagcache addressbyte addresstagdata Hit? memoryCPU2k linesp-k-mpA cache lineMemory address

  • Cache - ConflictsConflictsTwo addresses separated by 2k+m will hit the same cache location

    p-k-mmkp bitstagcache addressbyte addressAddresses in which these k bits are the same will map to the same cache line

  • Cache - Direct MappedConflictsTwo addresses separated by 2k+m will hit the same cache location32-bit machine, 64kbyte (16kword) cachem = 2, k = 14Any program or data set larger than 64kb will generate conflictsOn a conflict, the old word is flushedUnmodified word( Program, constant data )overwritten by the new data from memoryModified data needs to be written back to memory before being overwritten

  • Cache - ConflictsModified or dirty wordsWhen a word is modified in cacheWrite-back cacheOnly writes data back when neededMissesTwo memory accessesWrite modified word backRead new wordWrite-through cacheLow priority write to main memory is queuedProcessor is delayed by read onlyMemory write occurs in parallel with other workInstruction and necessary data fetches take priority

  • Cache - Write-through or write-back?Write-throughAllows an intelligent bus interface unit to make efficient use of a serious bottle-neckProcessor - memory interface (Main memory bus)Reads (instruction and data) need priority!They stall the processorWrites can be delayedAt least until the location is needed!More on intelligent system interface units laterbut ...

  • Cache - Write-through or write-back?Write-throughSeems a good idea!but ...Multiple writes to the same location waste memory bus bandwidthTypical programs run better with write-back cacheshoweverOften you can easily predict which will be bestSome processors (eg PowerPC) allow you to classify memory regions as write-back or write-through

  • Cache - more bitsCache lines need some status bitsTag bits + ..ValidAll set to false on power upSet to true as words are loaded into cacheDirtyNeeded by write-back cacheWrite- through cache always queues the write, so lines are never dirtyTagVMDataCache linep-k-mp11

  • Cache Improving PerformanceConflicts ( addresses 2k+m bytes apart )Degrade cache performanceLower hit rateMurphys Law operatesAddresses are never random!Some locations thrash in cacheContinually replaced and restoredAlternativelyIdeal cache performance depends on uniform access to all parts of memoryNever happens in real programs!

  • Cache access patternIdeally, each cache location is hit the same number of timesIn reality, hot spots occurSome cache lines thrashValues ping-pong between cache and next level memoryCache provides no benefitCould even slow down processor (write-backs generate uneven bus load)

  • Cache - Fully AssociativeAll tags are compared at the same timeWords can use any cache line

  • Cache - Fully AssociativeAssociativeEach tag is compared at the same timeAny match hitAvoids unnecessary flushingReplacementCapacity conflicts will still occurCache size
  • Cache OrganizationDirect Mapped CacheSimple, fastConflicts (hot spots) degrade performanceFully AssociativeAvoids conflict problemAny number of hits on address a + k 2m for any set of values of k possibleDepends on precise LRU algorithmExtra hardwareAdditional comparators (one per line!) make hardware cost very high!Hybrid organization developed Set associative caches

  • Cache - Set AssociativeEach line -two wordstwo comparators onlyExample: 2-way set associative

  • Cache - Set Associativen-way set associative cachesn can be small: 2, 4, 8Best performanceReasonable hardware costMost high performance processorsReplacement policyLRU choice from nReasonable LRU approximation1 or 2 bitsSet on accessCleared / decremented by timerChoose cleared word for replacement

  • Cache - Locality of ReferenceLocality of Reference is the fundamental principle which permits caches to workAlmost all programs exhibit it to some degree!Temporal LocalitySame location will be referenced again soonAccess same data againProgram loops - access same instruction againCaches described so far exploit temporal localitySpatial LocalityNearby locations will be referenced soonNext element of an arrayNext instruction of a program

  • Cache - Line LengthSpatial LocalityUse very long cache linesFetch one datum Neighbours fetched alsoCommon in all types of programPrograms next instructionBranches about 10% of instructionsDataScientific, EngineeringArraysCommercial, Information processingCharacter strings

  • Cache - Line LengthSpatial LocalityAllows efficient use of busBlocks of data are burst across the busMore efficient, reduced bus overheadsModern RAM (RAMbus, DDR, SDDR, etc) rely on itExamplePowerPC 601 (Motorola/Apple/IBM) first of the single chip Power processors64 sets8-way set associative32 bytes per line32 bytes (8 instructions) fetched into instruction buffer in one cycle ordata: 32 byte string or 8 floats or 4 doubles64 x 8 x 32 = 16k byte total

  • Cache - Separate I- and D-cachesUnified cacheInstructions and Data in same cacheTwo caches -* Instructions * DataIncreases total bandwidthMIPS R1000032Kbyte Instruction; 32Kbyte DataInstruction cache is pre-decoded! (32 36bits)Data8-word (64byte) line, 2-way set associative256 setsReplacement policy?