CMPE 421 Parallel Computer Architecture

1

CMPE 421 Parallel Computer ArchitecturePART3

Accessing a Cache

3

Direct Mapped Cache Example 2 (4, 1-word blocks)

0 1 2 3

4 3 4 15

Consider the main memory word reference string 0 1 2 3 4 3 4 15

00 Mem(0) 00 Mem(0)00 Mem(1)

00 Mem(0) 00 Mem(0)00 Mem(1)00 Mem(2)

miss miss miss miss

miss misshit hit

00 Mem(0)00 Mem(1)00 Mem(2)00 Mem(3)

01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)

01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)

01 Mem(4)00 Mem(1)00 Mem(2)00 Mem(3)

01 4

11 15

00 Mem(1)00 Mem(2)

00 Mem(3)

Start with an empty cache - all blocks initially marked as not valid

8 requests, 6 misses, 2 hits

4

Direct Mapped Caching: A Simple First Example

00011011

Cache

Main Memory

Q1: How do we find it?

Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache)

Tag Data

Q2: Is it there?

Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache

Valid

0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx

Two low order bits define the byte in the word (32b words)

(block address) modulo (# of blocks in the cache)

Index

Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block

5

Direct Mapped Cache Example 1 8-blocks, 1 word/block, direct mapped Initial state

Index V Tag Data

000 N

001 N

010 N

011 N

100 N

101 N

110 N

111 N

6

Direct Mapped Cache Example 1

Index V Tag Data

000 N

001 N

010 N

011 N

100 N

101 N

110 Y 10 Mem[10110]

111 N

Word addr Binary addr Hit/miss Cache block

22 10 110 Miss 110

7


Index V Tag Data

000 N

001 N

010 Y 11 Mem[11010]

011 N

100 N

101 N

110 Y 10 Mem[10110]

111 N


26 11 010 Miss 010

8


Index V Tag Data

000 N

001 N

010 Y 11 Mem[11010]

011 N

100 N

101 N

110 Y 10 Mem[10110]

111 N


22 10 110 Hit 110

26 11 010 Hit 010

9


Index V Tag Data

000 Y 10 Mem[10000]

001 N

010 Y 11 Mem[11010]

011 Y 00 Mem[00011]

100 N

101 N

110 Y 10 Mem[10110]

111 N


16 10 000 Miss 000

3 00 011 Miss 011

16 10 000 Hit 000

10


Index V Tag Data

000 Y 10 Mem[10000]

001 N

010 Y 10 Mem[10010]

011 Y 00 Mem[00011]

100 N

101 N

110 Y 10 Mem[10110]

111 N


18 10 010 Miss 010

11

One word/block, cache size = 1K words

20Tag 10Index

Data Index TagValid012...

102110221023

31 30 . . . 13 12 11 . . . 2 1 0Byte offset

What kind of locality are we taking advantage of?

20

Data

32

Hit

Address Subdivision : Direct Mapped Cache

FIGURE 7.7 For this cache, the lower portion of the address is used to select a cache entry consisting of a data word and a tag.

13

Another Example for Direct Mapping

0 4 0 4

0 4 0 4

Consider the main memory word reference string 0 4 0 4 0 4 0 4

miss miss miss miss

miss miss miss miss

00 Mem(0) 00 Mem(0)01 4

01 Mem(4)000

00 Mem(0)01

4

00 Mem(0)01 4

00 Mem(0)01

401 Mem(4)

00001 Mem(4)

000


Ping pong effect due to conflict misses - two memory locations that map into the same cache block

8 requests, 8 misses

14

Handling Cache Misses Our control unit must detect a miss and process

the miss by fetching the data from memory or from a lower-level cache.

Approach– stall the CPU, freezing the contents of all the registers– a separate controller fetches the data from memory– once data is present, execution of datapath is

resumed

15

Instruction Cache MissesSend the original PC value (current PC - 4) to

the memory Instruct main memory to perform a read and wait

for the memory to complete its access.Write the cache entry, putting the data from

memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and setting the valid bit

Restart the instruction execution at the first step, which will re-fetch the instruction (now in the cache)

16

Example MachineThe Digital DECStation 3100, one of the first

commercially available RISC-Architecture machines used a MIPS R2000 processor.

5 stage pipeline requested an instruction word and a data word on

every clock cycle static branch prediction delayed branch instruction 64 KB data cache and 64 KB instruction cache

17

DECStation 3100 Cache (16 K words)

offset This cache has 214

(16K) words with a block size of 1 word. 14 bits are used to index into the cache. 16 bits are compared against the tag. A hit results if the upper 16 bits matches the tag AND the valid bit is set

18

DECStation 3100 Cache

64 KB = 16 K words with a 1 word blockRead requests (lw instruction):

1. Send the address to the appropriate cache.• Instruction - from PC• Data - from ALU

2. If the cache signals a hit, read the requested word from the data lines.Else, send the address to main memory. When the requested word comes back from main memory, write it into the cache.

19

DECStation 3100 Cache Write requests:On a sw instruction, the DECStation 3100 used a scheme

called write-through. Write-through is when you store the word in the data

cache AND in main memory. This is done to keep the data cache and main memory consistent.

Don’t bother to check for hits and misses. Just write the word into the cache and into main memory.

A=5 A=5cache mem

P

A=10 A=5cache mem

P

A=10 A=10cache mem

P

Process wants to update A as 10

Now, cache copy of A is not equal to MEM copy of A

With write policy write into both of them

inconsistency

20

DECStation 3100 Cache Write requests using a write-through scheme (see

page17):1. Index the cache using bits 2-15 of the address.2. Write bits 31-16 of the address into the tag, write the data word

into the data portion, and set the valid bit.3. Also write the word to main memory using the entire address.

This simple approach slows our performance down because of the long write to main memory.

21

EXAMPLE : for cache MISS Suppose 10% of the instructions are stores, if the CPI

without cache misses 1.0 spending 100 extra cycles on every write

Þ 1.0+ 100 * 10 % = 11Þ Reducing the performance by more than factor of 10

Þ SOLUTION: Write Buffer Solution

22

Write Buffer Solution Use a write buffer to store the data while it is waiting to be written to

memory

Write the data into the cache and into the write buffer Processor continues with execution The write buffer copies the data into main memory (memory controller)

Hopefully, the processor does not generate writes faster than the write buffer can take them. If the write buffer becomes full, stalls occur.

A write buffer stores the data while it is waiting to be written to memory. After writing the data into the cache and into the write buffer the processor can continue execution. When a write main memory completes the entry in the write buffer is freed.

23

Write Buffer Saturation PROBLEM: Memory system designer’s nightmare:!

If Store frequency (w.r.t. time) << 1 / DRAM write cycle!- Write buffer works fine

If Store frequency (w.r.t. time) -> 1 / DRAM write cycle!- If this condition exist for a long period of time (CPU cycle time too

quick and/or too many store instructions in a row):!– Store buffer will overflow no matter how big you make it!

» Because you simply feeding data faster than can empty it– The CPU Cycle Time << DRAM Write Cycle Time!

24

Solution for write buffer saturation: Use a write back cache! Install a second level (L2) cache:! store compression!

25

Alternative: Write-Back Solution The write-back scheme doesn’t automatically write to the

cache AND to main memory. Instead, it writes only to the cache. The data does not get

written to memory until that cache block has to be replaced with a different block (when that cache block is being replaced on a cache miss).

Can improve performance. Greatly reduce the memory bandwidth requirement! More complex to implement than write-through.But, Control can be complex! Need a “dirty bit” for each cache block!

26

Sources of Cache Misses Compulsory (cold start or process migration, first

reference): They are caused when we first start the program. First access to a block, “cold” fact of life, not a whole lot you can

do about it If you are going to run “millions” of instruction, compulsory

misses are insignificant Conflict (collision):

Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (next lecture)

Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size

27

Multiword (4 word) Block Direct Mapped CacheTaking the Advantage of Spatial Locality The cache organization (i.e. 1 word= 1 block) we have

discussed so far does not take the spatial locality, We want have a cache block size > one word So, when we have cache miss occurs, we fetch multiple

adjacent words, and the probability that one these words will be needed shortly will be high

Example: - Choose block size= 4 words (4x4=16 bytes)- Now, we need an extra “block index” field, which selects one of

four words in the indexed block according to the request - The total size of the tag field is also reduced per word, because

each tag is used for 4 words (25% tag overhead only, compared to the case where block size is 1 word)

28

Multiword (4 word) Block Direct Mapped Cache

Cache size4Kx4= 16K words

16Kx4 Bytes

=64K Byte cache

29

Formulas During Lecture Period

30

Multiword (4 word) Block Direct Mapped Cache

12Index

DataIndex TagValid012...

409340944095

31 30 . . . 17 16 15 . . . 4 3 2 1 0 Byte offset

16

16Tag

Hit Data

32

Block offset

Four words/block, cache size = 16K words

What kind of locality are we taking advantage of?

32 32 32 32

2

32

Taking Advantage of Spatial Locality

0

Let cache block hold more than one word 0 1 2 3 4 3 4 15

1 2

3 4 3

4 15

00 Mem(1) Mem(0)

miss

00 Mem(1) Mem(0)

hit

00 Mem(3) Mem(2)00 Mem(1) Mem(0)

miss

hit

00 Mem(3) Mem(2)00 Mem(1) Mem(0)

miss

00 Mem(3) Mem(2)00 Mem(1) Mem(0)

01 5 4hit

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

hit

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

00 Mem(3) Mem(2)01 Mem(5) Mem(4)

miss

11 15 14


8 requests, 4 misses

33

Cache Hits and Misses Read misses are processed the same as with a 1 word

block cache. Read the entire block from memory into the cache.

Write hits and misses must be handled differently with a multiword block cache. Consider:

Assume memory addresses X and Y both map to cache block C C is a 4 word block containing Y What would happen if we did a write to address X by simply

overwriting the data and tag in cache block C?Ans: Cache block C would contain the tag for X, 1 word of X, and 3

words of Y.

34

Cache Hits and MissesSolution 1 (when write Miss occurs)

Ignore the cache when we have a write miss Do not change the tag for X and do not update data X1,

just write it on memory Where is the idea of using cache? if the data resides in memory In this case we can not use the advantage of locality

….. Another Solution ?

35

Cache Hits and MissesSolution 2 (First implement read miss then write)

Perform a tag comparison while writing the data. If equal, we have a write hit. No problem. If unequal, we have a write miss.

– Fetch the block from memory– Rewrite the word that caused the miss (Write through)

So, with a multi-word cache, a write miss causes a read from memory.

36

Miss Penalty and Miss Rate vs Block Size In general, larger block sizes take advantage of spatial locality BUT: – Larger block size means larger miss penalties:

Takes longer time to fill up the block – If block size is too big relative to cache size, miss rate will go up

Too few cache blocks 16- 64 bytes works fine

In general, Average Access Time: = Hit Time + Miss Penalty x Miss Rate Need to find a middle ground (Good design needs compromise

37

Miss Rate vs Block Size vs Cache Size

0

5

10

8 16 32 64 128 256Block size (bytes)

Miss

rate

(%) 8 KB

16 KB64 KB256 KB

Miss rate goes up if the block size is too large relative to the cache size.

because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

38

Spacial Locality Does increasing the block size help the miss rate?

• Yes, until the number of blocks in the cache becomes small. Then, a cache block may be swapped out before many of its words are accessed losing any spatial locality benefits.

Effective Instruction Data combined

Program miss rate miss rate miss rate

gcc (1 word blocks) 6.1% 2.1% 5.4%

gcc (4 word blocks) 2.0% 1.7% 1.9%

spice (1 word blocks) 1.2% 1.3% 1.2%

spice (4 word blocks) 0.3% 0.6% 0.4%

39

Miss Penalty Increasing the block size also increases the miss penalty

since we must read more words from memory for each miss. Reading more words takes more time.

Miss Penalty = latency to the first word + transfer time for the rest.

One way around this problem is to design our memory system to transfer blocks of memory efficiently.

One common way is to increase the width of the memory and the bus. (Transfer 2 or 4 words at a time, 1 latency period.)

Another common way is interleaving. This technique uses banks of memory. The requested address is sent to each bank in parallel. The memory latency is incurred once. Then the banks take turns at sending the requested words to the cache.

40

Cache Summary The Principle of Locality:

Program likely to access a relatively small portion of the address space at any instant of time

- Temporal Locality: Locality in Time- Spatial Locality: Locality in Space

Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses Conflict Misses: increase cache size and/or associativity

Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size

Cache Design Space total size, block size, associativity (replacement policy) write-policy (write-through, write-back)

41

Main Memory Organizations

CPU

Cache

Bus

Memory

CPU

Bus

Memory

Multiplexor

Cache

CPU

Cache

Bus

Memorybank 1

Memorybank 2

Memorybank 3

Memorybank 0

(a) one-word widememory organization

(b) wide memory organization (c) interleaved memory organization

DRAM access time >> bus transfer time

Processing the cache Miss• Latency to fetch the first word from MEM(finding the addr. for

word0)• Block transfer time (to bring all words in block from MEM)• It is difficult to reduce the latency to fetch the first word from MEM.

However, we can reduce the miss penalty if we increase the bandwidth from MEM to cache

42

Memory Access Time Example Assume that it takes 1 cycle to send the address, 15 cycles

for each DRAM access and 1 cycle to send a word of data. Assuming a cache block of 4 words and one-word wide

DRAM (page 41 fig. a), miss penalty = 1 + 4x15 + 4x1 = 65 cycles

With main memory and bus width of 2 words (page 41 fig. b), miss penalty = 1 + 2x15 + 2x1 = 33 cycles. For 4-word wide memory, miss penalty is 17 cycles. Expensive due to wide bus and control circuits.

With interleaved memory of 4 memory banks and same bus width (page 41 fig. c), the miss penalty = 1 + 1x15 + 4x1 = 20 cycles. The memory controller must supply consecutive addresses to different memory banks. Interleaving is universally adapted in high-performance computers.

43

Cache Performance During Lecture Period

Documents

CMPE 421 Parallel Computer Architecture