of 56 /56
Cache Memory Chapter 17 S. Dandamudi

Cache Memory Chapter 17 S. Dandamudi. 2003 To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer, 2003. S. Dandamudi

Embed Size (px)

Text of Cache Memory Chapter 17 S. Dandamudi. 2003 To be used with S. Dandamudi, “Fundamentals of Computer...

  • Cache MemoryChapter 17S. Dandamudi

  • OutlineIntroductionHow cache memory worksWhy cache memory worksCache design basicsMapping functionDirect mappingAssociative mappingSet-associative mappingReplacement policiesWrite policiesSpace overheadTypes of cache missesTypes of cachesExample implementationsPentiumPowerPCMIPSCache operation summaryDesign issuesCache capacityCache line sizeDegree of associatively

  • IntroductionMemory hierarchyRegistersMemoryDiskCache memory is a small amount of fast memoryPlaced between two levels of memory hierarchyTo bridge the gap in access timesBetween processor and main memory (our focus)Between main memory and disk (disk cache)Expected to behave like a large amount of fast memory

  • Introduction (contd)

  • How Cache Memory WorksPrefetch data into cache before the processor needs itNeed to predict processor future access requirements Not difficult owing to locality of referenceImportant termsMiss penaltyHit ratioMiss ratio = (1 hit ratio)Hit time

  • How Cache Memory Works (contd)Cache read operation

  • How Cache Memory Works (contd)Cache write operation

  • Why Cache Memory WorksExamplefor (i=0; i
  • How Cache Memory Works (contd)Column-orderRow-order








    Matrix size

    Execution time (ms)









    Matrix size

    Execution time (ms)



  • Cache Design BasicsOn every read missA fixed number of bytes are transferredMore than what the processor needsEffective due to spatial localityCache is divided into blocks of B bytesb-bits are needed as offset into the blockb = log2BBlock are called cache linesMain memory is also divided into blocks of same sizeAddress is divided into two parts

  • Cache Design Basics (contd)B = 4 bytesb = 2 bits

  • Cache Design Basics (contd)Transfer between main memory and cacheIn units of blocksImplements spatial localityTransfer between main memory and cacheIn units of wordsNeed policies forBlock placementMapping functionBlock replacementWrite policies

  • Cache Design Basics (contd)Read cycle operations

  • Mapping FunctionDetermines how memory blocks are mapped to cache linesThree typesDirect mappingSpecifies a single cache line for each memory blockSet-associative mappingSpecifies a set of cache lines for each memory blockAssociative mappingNo restrictionsAny cache line can be used for any memory block

  • Mapping Function (contd)Direct mapping example

  • Mapping Function (contd)Implementing direct mappingEasier than the other twoMaintains three pieces of informationCache dataActual dataCache tagProblem: More memory blocks than cache linesSeveral memory blocks are mapped to a cache lineTag stores the address of memory block in cache lineValid bitIndicates if cache line contains a valid block

  • Mapping Function (contd)

  • Mapping Function (contd)Reference pattern:0, 4, 0, 8, 0, 8, 0, 4, 0, 4, 0, 4Direct mappingHit ratio = 0%

  • Mapping Function (contd)Reference pattern:0, 7, 9, 10, 0, 7, 9, 10, 0, 7, 9, 10Direct mappingHit ratio = 67%

  • Mapping Function (contd)Associative mapping

  • Mapping Function (contd)Associative mappingReference pattern:0, 4, 0, 8, 0, 8, 0, 4, 0, 4, 0, 4Hit ratio = 75%

  • Mapping Function (contd)Address match logic forassociative mapping

  • Mapping Function (contd)Associative cache with address match logic

  • Mapping Function (contd)Set-associative mapping

  • Mapping Function (contd)Address partition in set-associative mapping

  • Mapping Function (contd)Set-associative mappingReference pattern:0, 4, 0, 8, 0, 8, 0, 4, 0, 4, 0, 4Hit ratio = 67%

  • Replacement PoliciesWe invoke the replacement policyWhen there is no place in cache to load the memory blockDepends on the actual placement policy in effectDirect mapping does not need a special replacement policyReplace the mapped cache lineSeveral policies for the other two mapping functionsPopular: LRU (least recently used)Random replacementLess interest (FIFO, LFU)

  • Replacement Policies (contd)LRUExpensive to implement Particularly for set sizes more than fourImplementations resort to approximationPseudo-LRUPartitions sets into two groupsMaintains the group that has been accessed recentlyRequires only one bitRequires only (W-1) bits (W = degree of associativity) PowerPC is an exampleDetails later

  • Replacement Policies (contd)Pseudo-LRU implementation

  • Write PoliciesMemory write requires special attentionWe have two copiesA memory copyA cached copyWrite policy determines how a memory write operation is handledTwo policiesWrite-throughUpdate both copiesWrite-backUpdate only the cached copyNeeds to be taken care of the memory copy

  • Write Policies (contd)Cache hit in a write-through cacheFigure 17.3a

  • Write Policies (contd)Cache hit in a write-back cache

  • Write Policies (contd)Write-back policyUpdates the memory copy when the cache copy is being replacedWe first write the cache copy to update the memory copyNumber of write-backs can be reduced if we write only when the cache copy is different from memory copyDone by associating a dirty bit or update bitWrite back only when the dirty bit is 1Write-back caches thus require two bitsA valid bitA dirty or update bit

  • Write Policies (contd)Needed only in write-back caches

  • Write Policies (contd)Other ways to reduce write trafficBuffered writesEspecially useful for write-through policiesWrites to memory are buffered and written at a later timeAllows write combiningCatches multiple writes in the buffer itselfExample: PentiumUses a 32-byte write bufferBuffer is written at several trigger pointsAn example trigger pointBuffer full condition

  • Write Policies (contd)Write-through versus write-backWrite-throughAdvantage Both cache and memory copies are consistentImportant in multiprocessor systemsDisadvantage Tends to waste bus and memory bandwidthWrite-backAdvantage Reduces write traffic to memoryDisadvantages Takes longer to load new cache linesRequires additional dirty bit

  • Space OverheadThe three mapping functions introduce different space overheadsOverhead decreases with increasing degree of associativitySeveral examples in the text4 GB address space32 KB cache

  • Types of Cache MissesThree typesCompulsory missesDue to first-time access to a blockAlso called cold-start misses or compulsory line fillsCapacity missesInduced due to cache capacity limitationCan be avoided by increasing cache sizeConflict missesDue to conflicts caused by direct and set-associative mappingsCan be completely eliminated by fully associative mappingAlso called collision misses

  • Types of Cache Misses (contd)Compulsory missesReduced by increasing block sizeWe prefetch moreCannot increase beyond a limitCache misses increaseCapacity missesReduced by increasing cache sizeLaw of diminishing returnsConflict missesReduced by increasing degree of associativityFully associative mapping: no conflict misses

  • Types of CachesSeparate instruction and data cachesInitial cache designs used unified cachesCurrent trend is to use separate caches (for level 1)

  • Types of Caches (contd)Several reasons for preferring separate cachesLocality tends to be strongerCan use different designs for data and instruction cachesInstruction cachesRead only, dominant sequential accessNo need for write policiesCan use a simple direct mapped cache implementationData cachesCan use a set-associative cacheAppropriate write policy can be implementedDisadvantageRigid boundaries between data and instruction caches

  • Types of Caches (contd)Number of cache levelsMost use two levelsPrimary (level 1 or L1)On-chipSecondary (level 2 or L2)Off-chipExamplesPentiumL1: 32 KBL2: up to 2 MBPowerPCL1: 64 KBL2: up to 1 MB

  • Types of Caches (contd)Two-level caches work as follows:First attempts to get data from L1 cacheIf present in L1, gets data from L1 cache (L1 cache hit)If not, data must come form L2 cache or main memory (L1 cache miss)In case of L1 cache miss, tries to get from L2 cacheIf data are in L2, gets data from L2 cache (L2 cache hit)Data block is written to L1 cacheIf not, data comes from main memory (L2 cache miss)Main memory block is written into L1 and L2 cachesVariations on this basic scheme are possible

  • Types of Caches (contd)Virtual and physical caches

  • Example ImplementationsWe look at three processorsPentiumPowerPCMIPSPentium implementationTwo levelsL1 cache Split cache designSeparate data and instruction cachesL2 cache Unified cache design

  • Example Implementations (contd)Pentium allows each page/memory region to have its own caching attributesUncacheableAll reads and writes go directly to the main memoryUseful for Memory-mapped I/O devicesLarge data structures that are read onceWrite-only data structuresWrite combiningNot cachedWrites are buffered to reduce access to main memoryUseful for video buffer frames

  • Example Implementations (contd)Write-throughUses write-through policyWrites are delayed as they go though a write buffer as in write combining modeWrite backUses write-back policyWrites are delayed as in the write-through modeWrite protectedInhibits cache writesWrite are done directly on the memory

  • Example Implementations (contd)Two bits in control register CR0 determine the modeCache disable (CD) bitNot write-through (NW) bitwWrite-back

  • Example Implementations (contd)PowerPC cache implementationTwo levelsL1 cache Split cache Each: 32 KB eight-way associativeUses pseudo-LRU replacementInstruction cache: read-onlyData cache: read/writeChoice of write-through or write-backL2 cache Unified cache as in PentiumTwo-way set associative

  • Example Implementations (contd)Write policy type and caching attributes can be set by OS at the block or page levelL2 cache requires only a single bit to implement LRUBecause it is 2-way associativeL1 cache implements a pseudo-LRUEach set maintains seven PLRU bits (B0-B6)

  • Example Implementations (contd)PowerPC placement policy (incl. PLRU)

  • Example Implementations (contd)MIPS implementationTwo-level cacheL1 cacheSplit organizationInstruction cacheVirtual cacheDirect mappedRead-onlyData cacheVirtual cacheDirect mappedUses write-back policyL1 line size: 16 or 32 bytes

  • Example Implementations (contd)L2 cachePhysical cacheEither unified or splitConfigured at boot timeDirect mappedUses write-back policyCache block size16, 32, 64, or 128 bytesSet at boot timeL1 cache line size L2 cache sizeDirect mapping simplifies replacementNo need for LRU type complex implementation

  • Cache Operation SummaryVarious policies used by cachePlacement of a blockDirect mappingFully associative mappingSet-associative mappingLocation of a blockDepends on the placement policyReplacement policyLRU is the most popularPseudo-LRU is often implementedWrite policyWrite-throughWrite-back

  • Design IssuesSeveral design issuesCache capacityLaw of diminishing returnsCache line size/block sizeDegree of associativityUnified/splitSingle/two-levelWrite-through/write-backLogical/physical

  • Design Issues (contd)Last slide