25
CMPE 421 Parallel Computer Architecture PART 2 CACHING

CMPE 421 Parallel Computer Architecture

  • Upload
    gustav

  • View
    52

  • Download
    3

Embed Size (px)

DESCRIPTION

CMPE 421 Parallel Computer Architecture. PART 2 CACHING. Caching Principles. The idea is to use small amount of fast memory near the processor (in a cache) The cache hold frequently needed memory locations - PowerPoint PPT Presentation

Citation preview

Page 1: CMPE 421 Parallel Computer Architecture

CMPE 421Parallel Computer Architecture

PART 2CACHING

Page 2: CMPE 421 Parallel Computer Architecture

2

Caching Principles• The idea is to use small amount of fast memory

near the processor (in a cache) • The cache hold frequently needed memory

locations– When an instruction references a memory locations,

we want that value to be in the cache• For time being, we will focus on a 2 level

hierarchy– Cache (small, fast memory directly connected

processor (upper)– Main memory (large, slow memory at level 2 in the

hierarchy

Page 3: CMPE 421 Parallel Computer Architecture

3

Caching Principles

Processor

Data are transferred

CPU

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distance from the CPU in

access time

Size of the memory at each level

Block of data(unit of data copy)

-Transfer of data is done between adjacent levels in the hierarchy only- All access by the processor is to the topmost level- Performance depends on hit rates

Page 4: CMPE 421 Parallel Computer Architecture

4

Caching Examples• Principle: Results of operations that are

expensive should be kept around for reuse• Examples:

– CPU caching– Forwarding table caching – File cashing– Web cashing– Query cashing– Computation cashing

Page 5: CMPE 421 Parallel Computer Architecture

5

Cache Levels• Register, a cache on variables • First level cache, a cache on second level

cache• Second level cache, a cache on memory• Memory cache, cache on disk (virtual

mem)• TLB a cache on page table• Branch Prediction a cache on prediction

information?

Page 6: CMPE 421 Parallel Computer Architecture

6

Terminology• Block: The minimum unit of information transferred

between the cache and main memory. Typically measured in bytes or words– Block addressing varies by technology at each level– Blocks are moved one level at a time

• HIT: Data appears in block in upper level when a program needs a particular data object d (blocks) from lower level, it first looks for d in one of the blocks currently stored at upper level. If d happens to be cached at upper level, then we have what is called a cache hit.– For example, a program with good temporal locality might read a

data object from block d, resulting in cache hit from upper level • Remote HTML files stored on WEB servers

Find

ing

the

info

rmat

ion

in

one

of th

e bo

oks

on y

our

desk

Page 7: CMPE 421 Parallel Computer Architecture

7

Terminology• MISS: Data was not in upper level and had

to be fetched from a lower level when there is a miss, the cache at upper level fetches the block containing possibly overwriting an existing block if the upper level is already full.

• HIT RATE: The ratio of hits to memory access found in the upper level– Used as a measure of the performance of

memory hierarchy

Page 8: CMPE 421 Parallel Computer Architecture

8

MISS EXAMPLE

Miss Example: Reading the data object from block 12 in the upper level cache would result in a cache miss because block 12 is not currently stored in the upper level cache once it has been copied from lower level to upper level, block 12 will remain there in expectation of later access

Page 9: CMPE 421 Parallel Computer Architecture

9

Terminology• MISS RATE: The ratio of miss to memory

access found in the upper level• Miss Rate= 1 – Hit rate

• HIT TIME: Time to access to upper level (cache)

hit time=tc= Access Time + Time to determine hit/miss (cache to processor) (Time to find out if it is in the cache)

Ex: The time needed to look through the books on the desk

Page 10: CMPE 421 Parallel Computer Architecture

10

Terminology• MISS PENALTY: The time to replace a block in the

cache with a block from main memory and to deliver the element to the processor

Miss penalty= tc+tm= Lower level access time + Replacement time +Time to deliver to upper level

EX: The time to get another book from the shelves and place it on desk• Miss penalty is usually much larger than the hit time

– Because the upper level is smaller and built using faster memory parts

– Time to examine the books on the desk is much smaller than the time to get up and get a new book from the shelves

EX: HIT_RATIO=0.9MISS_RATIO=1.0- 0.9 =0.1Ideally hit_ratio 1.0, miss_ratio=0.0In practice, hit_ratio<1.0 0.95 or better

Page 11: CMPE 421 Parallel Computer Architecture

Slide #11

Handling a Cache Miss• A cache hit if it happens in 1 cycle has no affect on our

pipeline, but a cache miss does• The action required depends on whether we have an

instruction miss or a data miss• For an instruction miss:

1. Send the original PC value to the memory2. Instruct main memory to perform a read and wait for the memory

to complete access3. Write the result into the appropriate cache entry4. Restart the instruction

• For a data miss:1. Stall the pipeline2. Instruct main memory to perform a read and wait for the memory

to complete access3. Return the result from the memory unit and allow the pipeline to

continue

Page 12: CMPE 421 Parallel Computer Architecture

12

Exploiting Locality• Need to update the contents of the cache to useful stuff

– Leverage locality• Spatial locality

– Rather than fetching just the word that missed– Fetch a block of data around the word that missed

• If you need these words (and you often do) they will now hit

• This is also good since you can build memory systems that deliver large blocks of data once they access it (disk/DRAM)

• Temporal locality• Keep more recently accessed data items closer to the

processor, • so when we need space in the cache, evict the old ones

Page 13: CMPE 421 Parallel Computer Architecture

13

Access Times

• Average Access timeAccess time = [(hit time) (hit rate)] + (miss penalty) (miss rate)

– The hope is that the hit time will be low and the hit rate high since the miss penalty is so much larger than the hit time

• Average Memory Access Time (AMAT)– Formula can be applied to any level of the hierarchy– Can be generalized for the entire hierarchy

Page 14: CMPE 421 Parallel Computer Architecture

14

Simple Cache Model• Assume that the processor accesses

memory one word at a time.• A block consists of one word.• When a word is referenced and is not in

the cache, it is put in the cache (copied from main memory).

Page 15: CMPE 421 Parallel Computer Architecture

15

Cache Usage

• At some point in time the cache holds • memory items X1,X2,…Xn-1• The processor next accesses memory

item Xn which is not in the cache.• How do we know if an item is in the

cache?• If it is in the cache, how do we know where

it is?

Page 16: CMPE 421 Parallel Computer Architecture

16

Page 17: CMPE 421 Parallel Computer Architecture

17

Cache Arrangement• How should the data in the cache be arranged?• Several different approaches

– Direct Mapped – Memory addresses map to particular location in the cache

– Fully Associative – Data can be placed anywhere in the cache

– N-way Set Associative – Data can be placed in a limited number of places in the cache depending upon the memory address

Page 18: CMPE 421 Parallel Computer Architecture

18

Direct Mapped Cache Organization

• Each memory location is mapped to a single location in the cache.– there in only one place it can be!

• Remember that the cache is smaller than memory, so many memory locations will be mapped to the same location in the cache.

Page 19: CMPE 421 Parallel Computer Architecture

19

Mapping Function

• The simplest mapping is based on the LS bits of the address.

• For example, all memory locations whose address ends in 001 will be mapped to the same location in the cache.

• The requires a cache size of 2^n locations (a power of 2).

Page 20: CMPE 421 Parallel Computer Architecture

20

A Direct Mapped Cache•Memory addresses are mapped to cache index

-The index is given by the (block address) modulo (num blocks in cache)• If cache size is power of 2, mode operators is simply throwing away

some high order bits from addressEX: Direct Mapped cache 8 words, MEM size is 32 words

use LOG2 8 = 3 to have cache address = XXXAs the cache size is power of 2 throw away higher bitsEX: 00 001 => the cache addr = 001

11 101 => the cache addr = 101

Page 21: CMPE 421 Parallel Computer Architecture

21

Problem With Direct mapped Cache• We still need a way to find out which of the many

possible memory elements is currently in a cache slot.– slot: a location in the cache that can hold a block.

• We need to store the address of the item currently using cache slot 001.

• We therefore add a tag to each cache entry that identifies which address it currently contains by storing the MSBs that uniquely identify that memory address (LSBs are referred to a particular cache entry)

• The tag associated with a cache slot tells who is currently using the slot.

• We don’t need to store the entire memory location address, just those bits that are not used to determine the slot number (the mapping).

Page 22: CMPE 421 Parallel Computer Architecture

22

Solution

A Field in a table used for a memory hierarchy that contains the address information required to identify whether the associated block in the hierarchy corresponds to required word

Page 23: CMPE 421 Parallel Computer Architecture

23

Initialization Problem• Initially the cache is empty.

– all the bits in the cache (including the tags) will have random values.

• After some number of accesses, some of the tags are real and some are still just random junk.

• How do we know which cache slots are junk and which really mean something?

Page 24: CMPE 421 Parallel Computer Architecture

24

Answer: Introduce Valid BITS• Include one more bit with each cache slot

that indicates whether the tag is valid or not.

• Provide hardware to initialize these bits to 0 (one bit per cache slot).

• When checking a cache slot for a specific memory location, ignore the tag if the valid bit is 0.

• Change a slot’s valid bit to a 1 when putting something in the slot (from main memory).

Page 25: CMPE 421 Parallel Computer Architecture

25

Direct mapped Cache with valid Bit