CMPE 421 Parallel Computer Architecture

CMPE 421Parallel Computer Architecture

PART 2CACHING

Caching Principles• The idea is to use small amount of fast memory

near the processor (in a cache) • The cache hold frequently needed memory

locations– When an instruction references a memory locations,

we want that value to be in the cache• For time being, we will focus on a 2 level

hierarchy– Cache (small, fast memory directly connected

processor (upper)– Main memory (large, slow memory at level 2 in the

hierarchy

Caching Principles

Processor

Data are transferred

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distance from the CPU in

access time

Size of the memory at each level

Block of data(unit of data copy)

-Transfer of data is done between adjacent levels in the hierarchy only- All access by the processor is to the topmost level- Performance depends on hit rates

Caching Examples• Principle: Results of operations that are

expensive should be kept around for reuse• Examples:

– CPU caching– Forwarding table caching – File cashing– Web cashing– Query cashing– Computation cashing

Cache Levels• Register, a cache on variables • First level cache, a cache on second level

cache• Second level cache, a cache on memory• Memory cache, cache on disk (virtual

mem)• TLB a cache on page table• Branch Prediction a cache on prediction

information?

Terminology• Block: The minimum unit of information transferred

between the cache and main memory. Typically measured in bytes or words– Block addressing varies by technology at each level– Blocks are moved one level at a time

• HIT: Data appears in block in upper level when a program needs a particular data object d (blocks) from lower level, it first looks for d in one of the blocks currently stored at upper level. If d happens to be cached at upper level, then we have what is called a cache hit.– For example, a program with good temporal locality might read a

data object from block d, resulting in cache hit from upper level • Remote HTML files stored on WEB servers

Terminology• MISS: Data was not in upper level and had

to be fetched from a lower level when there is a miss, the cache at upper level fetches the block containing possibly overwriting an existing block if the upper level is already full.

• HIT RATE: The ratio of hits to memory access found in the upper level– Used as a measure of the performance of

memory hierarchy

MISS EXAMPLE

Miss Example: Reading the data object from block 12 in the upper level cache would result in a cache miss because block 12 is not currently stored in the upper level cache once it has been copied from lower level to upper level, block 12 will remain there in expectation of later access

Terminology• MISS RATE: The ratio of miss to memory

access found in the upper level• Miss Rate= 1 – Hit rate

• HIT TIME: Time to access to upper level (cache)

hit time=tc= Access Time + Time to determine hit/miss (cache to processor) (Time to find out if it is in the cache)

Ex: The time needed to look through the books on the desk

Terminology• MISS PENALTY: The time to replace a block in the

cache with a block from main memory and to deliver the element to the processor

Miss penalty= tc+tm= Lower level access time + Replacement time +Time to deliver to upper level

EX: The time to get another book from the shelves and place it on desk• Miss penalty is usually much larger than the hit time

– Because the upper level is smaller and built using faster memory parts

– Time to examine the books on the desk is much smaller than the time to get up and get a new book from the shelves

EX: HIT_RATIO=0.9MISS_RATIO=1.0- 0.9 =0.1Ideally hit_ratio 1.0, miss_ratio=0.0In practice, hit_ratio<1.0 0.95 or better

Slide #11

Handling a Cache Miss• A cache hit if it happens in 1 cycle has no affect on our

pipeline, but a cache miss does• The action required depends on whether we have an

instruction miss or a data miss• For an instruction miss:

1. Send the original PC value to the memory2. Instruct main memory to perform a read and wait for the memory

to complete access3. Write the result into the appropriate cache entry4. Restart the instruction

• For a data miss:1. Stall the pipeline2. Instruct main memory to perform a read and wait for the memory

to complete access3. Return the result from the memory unit and allow the pipeline to

continue

Exploiting Locality• Need to update the contents of the cache to useful stuff

– Leverage locality• Spatial locality

– Rather than fetching just the word that missed– Fetch a block of data around the word that missed

• If you need these words (and you often do) they will now hit

• This is also good since you can build memory systems that deliver large blocks of data once they access it (disk/DRAM)

• Temporal locality• Keep more recently accessed data items closer to the

processor, • so when we need space in the cache, evict the old ones

Access Times

• Average Access timeAccess time = [(hit time) (hit rate)] + (miss penalty) (miss rate)

– The hope is that the hit time will be low and the hit rate high since the miss penalty is so much larger than the hit time

• Average Memory Access Time (AMAT)– Formula can be applied to any level of the hierarchy– Can be generalized for the entire hierarchy

Simple Cache Model• Assume that the processor accesses

memory one word at a time.• A block consists of one word.• When a word is referenced and is not in

the cache, it is put in the cache (copied from main memory).

Cache Usage

• At some point in time the cache holds • memory items X1,X2,…Xn-1• The processor next accesses memory

item Xn which is not in the cache.• How do we know if an item is in the

cache?• If it is in the cache, how do we know where

it is?

Cache Arrangement• How should the data in the cache be arranged?• Several different approaches

– Direct Mapped – Memory addresses map to particular location in the cache

– Fully Associative – Data can be placed anywhere in the cache

– N-way Set Associative – Data can be placed in a limited number of places in the cache depending upon the memory address

Direct Mapped Cache Organization

• Each memory location is mapped to a single location in the cache.– there in only one place it can be!

• Remember that the cache is smaller than memory, so many memory locations will be mapped to the same location in the cache.

Mapping Function

• The simplest mapping is based on the LS bits of the address.

• For example, all memory locations whose address ends in 001 will be mapped to the same location in the cache.

• The requires a cache size of 2^n locations (a power of 2).

A Direct Mapped Cache•Memory addresses are mapped to cache index

-The index is given by the (block address) modulo (num blocks in cache)• If cache size is power of 2, mode operators is simply throwing away

some high order bits from addressEX: Direct Mapped cache 8 words, MEM size is 32 words

use LOG2 8 = 3 to have cache address = XXXAs the cache size is power of 2 throw away higher bitsEX: 00 001 => the cache addr = 001

11 101 => the cache addr = 101

Problem With Direct mapped Cache• We still need a way to find out which of the many

possible memory elements is currently in a cache slot.– slot: a location in the cache that can hold a block.

• We need to store the address of the item currently using cache slot 001.

• We therefore add a tag to each cache entry that identifies which address it currently contains by storing the MSBs that uniquely identify that memory address (LSBs are referred to a particular cache entry)

• The tag associated with a cache slot tells who is currently using the slot.

• We don’t need to store the entire memory location address, just those bits that are not used to determine the slot number (the mapping).

Solution

A Field in a table used for a memory hierarchy that contains the address information required to identify whether the associated block in the hierarchy corresponds to required word

Initialization Problem• Initially the cache is empty.

– all the bits in the cache (including the tags) will have random values.

• After some number of accesses, some of the tags are real and some are still just random junk.

• How do we know which cache slots are junk and which really mean something?

Answer: Introduce Valid BITS• Include one more bit with each cache slot

that indicates whether the tag is valid or not.

• Provide hardware to initialize these bits to 0 (one bit per cache slot).

• When checking a cache slot for a specific memory location, ignore the tag if the valid bit is 0.

• Change a slot’s valid bit to a 1 when putting something in the slot (from main memory).

Direct mapped Cache with valid Bit

CMPE 421 Parallel Computer Architecture

Documents

Advanced VLSI Design Combinational Logic Design CMPE … · Advanced VLSI Design Combinational Logic Design CMPE 640 ... Advanced VLSI Design Combinational Logic Design CMPE 640

CMPE 49B Sp. Top. in CMPE: Multi-Core Programming

CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding

CMPE 257: Wireless Networking

CMPE-310 - swe.umbc.edu

CMPE 2020 -2021

CMPE Segunda Generación

CMPE 421 Parallel Computer Architecture Part 1 Pipeline: HAZARD

CMPE 511 COMPUTER ARCHITECTURE

1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

CMPE Tercera Generación

1 Advanced Hardware Parallel/Distributed Processing High Performance Computing Top 500 list Grid computing CMPE 478, Parallel Processing picture of ASCI

CMPE 421 Advanced Parallel Computer Architecture

1 CMPE 421 Parallel Computer Architecture PART5 More Elaborations with cache & Virtual Memory

CMPE Senior Design Project Group Members: Jose A Montoya (CMPE) Carlos Olvera (CSCI)

Parallel Event Driven Simulation using GPU (CUDA) M.Sancar Koyunlu & Ervin Domazet Cmpe 436 TERM PROJECT

CMPE 511 Multithreading

CMPE 4784 1 picture of Tianhe, the most powerful computer in the world in Nov-2010 CMPE 478 Parallel Processing

Cmpe Maroc Export

CMPE 421 Parallel Computer Architecture