1 CSCI 2510 Computer Organization Memory System II Cache In Action

1

CSCI 2510Computer Organization

Memory System II

Cache In Action

2

Cache-Main Memory Mapping

A way to record which part of the Main Memory is now in cache

Design concerns:Be Efficient: fast determination of cache hits/ missesBe Effective: make full use of the cache;

increase probability of cache hits

In the following discussion, we assume:16-bit address, i.e. 216 = 64k Bytes in the Main MemoryOne Word = One ByteSynonym: Cache line === Cache block

3

Imagine: Trivial Conceptual Case

Cache size == Main Memory size == 64kBTrivial one-to-one mappingDo we need Main Memory any more?

Cache

64kB

FAST

Main Memory

64kB

SLOW

CPU

FASTEST

4

Reality: Cache Block/ Cache Line

Cache size is much smaller than the Main Memory size

A block in the Main Memory maps to a block in the Cache

Many-to-One Mapping

MainMemory

Block 0

Block 1

Block 127

Block 128

Block 129

Block 255

Block 256

Block 257

Block 4095

tag

tag

tag

Cache

Block 0

Block 1

Block 127

5

Direct Mapping

Direct mapped cache

Block j of Main Memory maps to block (j mod 128) of Cache[same colour in figure]

Cache hit occurs if tag matches desired address

• there are 24 = 16 words (bytes) in a block• 27 = 128 Cache blocks• 2(7+5) = 27 x 25 = 4096 Main Memory blocks

MainMemory

Block 0

Block 1

Block 127

Block 128

Block 129

Block 255

Block 256

Block 257

Block 4095

tag

tag

tag

Cache

Block 0

Block 1

Block 127

7 4 16-bit Main Memory address

Cachetag

CacheBlock No

Word/ Byte Addresswithin block (4-bit)

5

12-bit Main MemoryBlock number/ address

6

Direct Mapping Memory address divided into 3

fields Main Memory Block number

determines position of block in cache

Tag used to keep track of which block is in cache (as many MM blocks can map to same position in cache)

The last 4 bits in the address selects target word in the block

Given an address t,b,w (16-bit) See if it is already in cache by

comparing t with the tag in block b

If not, cache miss!, replace the current block at b with a new one from memory block t,b (12-bit)

E.g. CPU is looking for [A7B4]MAR = 1010011110110100Go to cache block 1111011See if the tag is 10100If YES, cache hit!Get the word/ byte 0100 in thecache block 1111011

MainMemory

Block 0

Block 1

Block 127

Block 128

Block 129

Block 255

Block 256

Block 257

Block 4095

tag

tag

tag

Cache

Block 0

Block 1

Block 127

7 4 16-bit Main Memory address

Cachetag

CacheBlock No

Word/ Byte Addresswithin block (4-bit)

5

12-bit Main MemoryBlock number/ address

7

Associative Mapping

In direct mapping, a Main Memory block is restricted to reside in a given position in the Cache (determined by mod)

Associative mapping allows a MM block to reside in an arbitrary Cache block location

In this example, all 128 tag entries must be compared with the address Tag in parallel (by hardware)

4

tag

tag

tag

Cache

MainMemory

Block 0

Block 1

Block i

Block 4095

Block 0

Block 1

Block 127

12 16-bit Main Memory address

Tag Word/ Byte

E.g. CPU is looking for [A7B4]MAR = 1010011110110100See if the tag 101001111011 matches one of the 128 cache tagsIf YES, cache hit!Get the word/ byte 0100 in theBINGO cache block

Tag width is 12 bits

8

Set Associative Mapping

Combination of direct and associative

Same colour Main Memory blocks 0,64,128,…,4032 map to cache set 0 and can occupy either of 2 positions within that set

A cache with k-blocks per set is called a k-way set associative cache. (j mod 64) derives the Set Number Within the target Set, compare

the 2 tag entries with the address Tag in parallel (by hardware)

MainMemory

Block 0

Block 1

Block 63

Block 64

Block 65

Block 127

Block 128

Block 129

Block 409516-bit Main Memory address

6 6 4

TagSet

Number Word/ Byte

tag

tag

tag

Cache

Block 0

Block 1

Block 126

tag

tag

Block 2

Block 3

tagBlock 127

Set 0

Set 1

Set 63

Tag width is 6 bits

9

Set Associative Mapping MainMemory

Block 0

Block 1

Block 63

Block 64

Block 65

Block 127

Block 128

Block 129

Block 409516-bit Main Memory address

6 6 4

TagSet

Number Word/ Byte

tag

tag

tag

Cache

Block 0

Block 1

Block 126

tag

tag

Block 2

Block 3

tagBlock 127

Set 0

Set 1

Set 63

Tag width is 6 bits

E.g. 2-Way Set Associative

CPU is looking for [A7B4]

MAR = 1010011110110100

Go to cache Set 111011 (5910)- Block 1110110 (11810)- Block 1110111 (11910)

See if ONE of the TWO tags in the Set 111011 is 101001

If YES, cache hit!Get the word/ byte 0100 in theBINGO cache block

10

Replacement Algorithms

Direct mapped cache Position of each block fixed

Whenever replacement is needed (i.e. cache miss new block to load), the choice is obvious and thus no "replacement algorithm" is needed

Associative and Set Associative Need to decide which block to replace

(thus keep/ retain ones likely to be used in cache in near future again)

One strategy is least recently used (LRU)e.g. for a 4-block/ set cache, use a log24 = 2-bit counter for each blockReset the counter to 0 whenever the block is accessed;

counters of other blocks in the same set should be incrementedOn cache miss, replace/ uncache a block with counter reaching 3

Another is random replacement (choose random block)Advantage is that it is easier to implement at high speed

11

Cache Example

Assume separate instruction and data caches We consider only the data

Cache has space for 8 blocks

A block contains one 16-bit word (i.e. 2 bytes)

A[10][4] is an array of words located at 7A00-7A27 in row-major order

The program is to normalize first column of the array (a vertical vector) by its average

short A[10][4];int sum = 0;int j, i;double mean;

// forward loopfor (j = 0; j <= 9; j++) sum += A[j][0];

mean = sum / 10.0;

// backward loopfor (i = 9; i >= 0; i--) A[i][0] = A[i][0] / mean;

12

Cache Example

A[0][0]

A[0][1]

A[0][2]

A[0][3]

A[1][0]

A[9][0]

A[9][1]

A[9][2]

A[9][3]

Array Contents (40 elements)

Tag for Direct Mapped

Tag for Set-Associative

Tag for Associative

0 1 1 1 1 10 0 0 0 0 0 0 0 0 0

1 1 1 10 10 0 0 0 0 0 0 0 0 1

0 1 1 1 1 10 0 0 0 0 0 0 0 01

0 1 1 1 1 10 0 0 0 0 0 0 0 1 1

0 1 1 1 1 10 0 0 0 0 0 0 1 0 0

0 1 1 1 1 10 0 0 0 1 0 0 1 0 0

1 1 1 10 10 0 0 0 1 0 0 1 0 1

0 1 1 1 1 10 0 0 0 1 0 0 1 01

0 1 1 1 1 10 0 0 0 1 0 0 1 1 1

Memory word address in binary

(7A00)

(7A01)

(7A02)

(7A03)

(7A04)

(7A24)

(7A25)

(7A26)

(7A27)

Memory wordaddress in hex

8 blocks in cache, 3 bits encodes cache block number

To simplify the discussion in this example:

16-bit word address; byte addresses are not shown

One item == One block == One word == Two bytes

4 blocks/ set, 2 cache sets, 1 bit encodes cache set number

13

Direct Mapping

Least significant 3-bits of address determine location in cache No replacement algorithm is needed in Direct Mapping When i = 9 and i = 8, get a cache hit (2 hits in total) Only 2 out of the 8 cache positions used Very inefficient cache utilization

　 Content of data cache after loop pass: (time line)

　 j = 0

j = 1

j = 2

j = 3

j = 4

j = 5

j = 6

j = 7

j = 8

j = 9

i = 9

i = 8

i = 7

i = 6

i = 5

i = 4

i = 3

i = 2

i = 1

i = 0

CacheBlocknumb

er

0 A[0][0]

A[0][0]

A[2][0]

A[2][0]

A[4][0]

A[4][0]

A[6][0]

A[6][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[6][0]

A[6][0]

A[4][0]

A[4][0]

A[2][0]

A[2][0]

A[0][0]

1 　　　　　　　　　　　　　　　　　　　　

2 　　　　　　　　　　　　　　　　　　　　

3 　　　　　　　　　　　　　　　　　　　　

4 　 A[1][0]

A[1][0]

A[3][0]

A[3][0]

A[5][0]

A[5][0]

A[7][0]

A[7][0]

A[9][0]

A[9][0]

A[9][0]

A[7][0]

A[7][0]

A[5][0]

A[5][0]

A[3][0]

A[3][0]

A[1][0]

A[1][0]

5 　　　　　　　　　　　　　　　　　　　　

6 　　　　　　　　　　　　　　　　　　　　

7 　　　　　　　　　　　　　　　　　　　　

Tags not shown but are needed

14

Associative Mapping

LRU replacement policy: get cache hits for i = 9, 8, …, 2 If i loop was a forward one, we would get no hits!


　 j = 0

j = 1

j = 2

j = 3

j = 4

j = 5

j = 6

j = 7

j = 8

j = 9

i = 9

i = 8

i = 7

i = 6

i = 5

i = 4

i = 3

i = 2

i = 1

i = 0

Cache

Blocknumb

er

0 A[0][0]

A[0][0]

A[0][0]

A[0][0]

A[0][0]

A[0][0]

A[0][0]

A[0][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[0][0]

1 　 A[1][0]

A[1][0]

A[1][0]

A[1][0]

A[1][0]

A[1][0]

A[1][0]

A[1][0]

A[9][0]

A[9][0]

A[9][0]

A[9][0]

A[9][0]

A[9][0]

A[9][0]

A[9][0]

A[9][0]

A[1][0]

A[1][0]

2 　　 A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[2][0]

3 　　　 A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

4 　　 A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

5 　　　　　 A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

6 　　　　　　 A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

7 　　　　　　　 A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]


LRU Counters not shown but are needed

15

Set Associative Mapping

Since all accessed blocks have even addresses (7A00, 7A04, 7A08, ...), only half of the cache is used, i.e. they all map to set 0

LRU replacement policy: get hits for i = 9, 8, 7 and 6 Random replacement would have better average performance If i loop was a forward one, we would get no hits!

Set 0


　 j = 0 j = 1 j = 2 j = 3 j = 4 j = 5 j = 6 j = 7 j = 8 j = 9 i = 9 i = 8 i = 7 i = 6 i = 5 i = 4 i = 3 i = 2 i = 1 i = 0

Cache

Blocknumber

0 A[0][0]

A[0][0]

A[0][0]

A[0][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[8][0]

A[4][0]

A[4][0]

A[4][0]

A[4][0]

A[0][0]

1 　 A[1][0]

A[1][0]

A[1][0]

A[1][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[9][0]

A[9][0]

A[9][0]

A[9][0]

A[9][0]

A[5][0]

A[5][0]

A[5][0]

A[5][0]

A[1][0]

A[1][0]

2 　　 A[2][0]

A[2][0]

A[2][0]

A[2][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[6][0]

A[2][0]

A[2][0]

A[2][0]

3 　　　 A[3][0]

A[3][0]

A[3][0]

A[3][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[7][0]

A[3][0]

A[3][0]

A[3][0]

A[3][0]

4 　　　　　　　　　　　　　　　　　　　　

5 　　　　　　　　　　　　　　　　　　　　

6 　　　　　　　　　　　　　　　　　　　　

7 　　　　　　　　　　　　　　　　　　　　


Set 1

LRU Counters not shown but are needed

16

Comments on the Example

In this example, Associative is best, then Set-Associative, lastly Direct Mapping

What are the advantages and disadvantages of each scheme?

In practice, low hit rates like in the example is very rareusually Set-Associative with LRU replacement scheme

Larger blocks and more blocks greatly improve cache hit rate, i.e. more cache memory

17

Real-life Example: Intel Core 2 Duo

Core Processing unit

L2 cache 4MB

Mainmemory Input/Output

System bus

L1 instructioncache32kB

L1 datacache32kB

Core Processing unit

L1 instructioncache32kB

L1 datacache32kB

18

Real-life Example: Intel Core 2 Duo

Number of processors : 1

Number of cores : 2 per processor

Number of threads : 2 per processor

Name : Intel Core 2 Duo E6600

Code Name : Conroe

Specification : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz

Technology : 65 nm

Core Speed : 2400 MHz

Multiplier x Bus speed : 9.0 x 266.0 MHz = 2400 MHz

Front-Side-Bus speed : 4 x 266.0MHz = 1066 MHz

Instruction sets : MMX, SSE, SSE2, SSE3, SSSE3, EM64T

L1 Data cache 2 x 32 KBytes, 8-way set associative, 64-byte line size

L1 Instruction cache 2 x 32 KBytes, 8-way set associative, 64-byte line size

L2 cache 4096 KBytes, 16-way set associative, 64-byte line size

19

Memory Module Interleaving

Processor and cache are fast, main memory is slow. Try to hide access latency by interleaving memory accesses across

several memory modules. Each memory module has own Address Buffer Register (ABR) and

Data Buffer Register (DBR) Which scheme below can be better interleaved?

m bits

Address in module MM address

(a) Consecutive words in a module

i

k bits

Module Module Module

Module

DBRABR DBRABR ABR DBR

0 n 1-

(b) Consecutive words in consecutive modules

i

k bits

0ModuleModuleModule

Module MM address

DBRABRABR DBRABR DBR

Address in module

2k

1-

m bits

20

Memory Module Interleaving

Memory interleaving can be realized in technology such as “Dual Channel Memory Architecture.”

Two or more compatible (identical the best) memory modules are used in matching banks.

Within a memory module, in fact, 8 chips are used in “parallel”, to achieve a 8 x 8 = 64-bit memory bus. This is also a kind of memory interleaving.

21

Memory Interleaving Example

Suppose we have a cache read miss and need to load from main memory

Assume cache with 8-word block (cache line size = 8 bytes)

Assume it takes one clock to send address to DRAM memory and one clock to send data back.

In addition, DRAM has 6 cycle latency for first word;good that each of subsequent words in same row takes only 4 cycles

Read 8 bytes without interleaving: 1+(1x6)+(7x4)+1 = 36 cycles

For 4-module interleaved scheme: 1+(1x6)+(1x8) = 15 cycles Transfer of first 4 words are overlapped with access of second 4 words

22

Memory Without Interleaving

1 6 1

1 4 1

1 4 1

1 4 1

…Read 8 bytes (non-interleaving):

1 + 1x6 + 7x4 + 1 = 36 Cycles

1 6 1

Single Memory Read: 1 + 6 + 1 = 8 Cycles

23

Four Memory Modules Interleaved

1

1 6 1

1 6 1

1 6 1

1 4 1

1 4 1

1 4 1

1 4 1

Read 8 bytes (interleaving): 1 + 6 + 1x8 = 15 Cycles

1 6

24

Cache Hit Rate and Miss Penalty

Goal is to have a big memory system with speed of cache

High cache hit rates (> 90%) are essentialMiss penalty must also be reduced

Exampleh is hit rateM is miss penaltyC is cache access timeAverage memory access time = h * C + (1-h) * M

25

Cache Hit Rate and Miss Penalty Optimistically assume 8 cycles to read a single memory item;

15 cycles to load a 8-byte block from main memory (previous example); cache access time = 1 cycle

For every 100 instructions, statistically 30 instructions are data read/ write

Instruction fetch: 100 memory access: assume hit rate = 0.95

Data read/ write: 30 memory access: assume hit rate = 0.90

Execution cycles without cache = (100 + 30) x 8

Execution cycles with cache = 100(0.95x1+0.05x15)+30(0.9x1+0.1x15)

Ratio = 4.30, i.e. cache speeds up execution more than 4 times!

26

Summary

Cache Organizations:Direct, Associative, Set-Associative

Cache Replacement Algorithms:Random, Least Recently Used

Memory Interleaving

Cache Hit and Miss Penalty

Documents

1 CSCI 2510 Computer Organization Memory System II Cache In Action