26
Chapter 17 Cache Memory 1

Dandamudi Solution Ch17

Embed Size (px)

DESCRIPTION

d

Citation preview

Page 1: Dandamudi Solution Ch17

Chapter 17

Cache Memory

1

Page 2: Dandamudi Solution Ch17

2 Chapter 17

17–1 The two components of the locality are: spatial and temporal. The spatial locality principle statesthat programs tend to access data and instructions sequentially. This is fairly easy to understand.Most of the time, programs also tend to execute instructions sequentially. Of course, occasion-ally procedure calls and if-then-else types of conditional statements alter the sequential programexecution path. The spatial locality characteristic allows us to successfully perform predictivepreloading.

Temporal locality refers to the tendency of programs to repeatedly use a small part of code/dataover a certain time. The loop is a prime example that contributes to this behavior. Temporallocality helps improve performance because repeated access can be satisfied from the cache.

As an example, consider the following matrix operation, which adds constant K to all elementsof an M � N matrix X. Let us suppose that each element is a double data type requiring eightbytes.

for(i=0; i<M; i++)for(j=0; j<N; j++)

X[i][j] = X[i][j] + K;

The statement X[i][j]=X[i][j]+K is executed M * N times. If we place the instructionsfor this loop in the cache, we avoid accessing them from the main memory. We can expect per-formance benefit in this case because we load the instructions once into the cache and use themrepeatedly (exhibiting temporal locality).

Continuing with our example, there will be a read miss when the matrix element X[0][0] is ac-cessed. We have to access the main memory to get X[0][0], which will also be placed in the cache.Now suppose that we not only get X[0][0] but also the next three elements in the first row (i.e.,get four elements X[0][0], X[0][1], X[0][2], and X[0][3]) for a total of 32-byte data movementbetween the main memory and cache. Then, we can avoid accessing the memory for the nextthree iterations. Notice that, in this example, the data brought into the cache are not reused. Thus,we do not derive any benefit from the temporal locality component as we did for the instructions.However, we still get the benefit of using the cache because our prediction works well i.e., spatiallocality component gives us the benefit.

Page 3: Dandamudi Solution Ch17

Chapter 17 3

17–2 As an example, consider the following matrix operation, which adds constant K to all elementsof an M � N matrix X. Let us suppose that each element is a double data type requiring eightbytes.

for(i=0; i<M; i++)for(j=0; j<N; j++)

X[i][j] = X[i][j] + K;

There will be a read miss when the matrix element X[0][0] is accessed. We have to access themain memory to get X[0][0], which will also be placed in the cache. Now suppose that we notonly get X[0][0] but also the next three elements in the first row (i.e., get four elements X[0][0],X[0][1], X[0][2], and X[0][3]) for a total of 32-byte data movement between the main memoryand cache. Then, we can avoid accessing the memory for the next three iterations. Notice that,in this example, the data brought into the cache are not reused. However, we still get the benefitof using the cache because our prediction works well i.e., spatial locality component gives us thebenefit.

Page 4: Dandamudi Solution Ch17

4 Chapter 17

17–3 As an example, consider the following matrix operation, which adds constant K to all elementsof an M � N matrix X. Let us suppose that each element is a double data type requiring eightbytes.

for(i=0; i<M; i++)for(j=0; j<N; j++)

X[i][j] = X[i][j] + K;

The statement X[i][j]=X[i][j]+K is executed M * N times. If we place the instructionsfor this loop in the cache, we avoid accessing them from the main memory. We can expect per-formance benefit in this case because we load the instructions once into the cache and use themrepeatedly (exhibiting temporal locality).

Page 5: Dandamudi Solution Ch17

Chapter 17 5

17–4 The valid bit keeps information on whether a cache line contains a valid block. This informationis necessary so that we do not treat garbage as valid data.

Page 6: Dandamudi Solution Ch17

6 Chapter 17

17–5 We can reduce the number of writes in the write-back policy by noting that if a cache line has notbeen updated by the processor, it does not have to be written back to the memory. This wouldavoid unnecessary writes to the memory. The dirty bit indicates whether the cache copy has beenupdated. If the dirty bit is set for a cache line, that block has to be written back to memory beforereplacing it.

Page 7: Dandamudi Solution Ch17

Chapter 17 7

17–6 No. In the write-through policy, every time the processor writes to the cache copy, it also updatesthe main memory copy.

Page 8: Dandamudi Solution Ch17

8 Chapter 17

17–7 Direct mapping does not need a replacement policy as it specifies a single cache line into which amemory block should be placed.

Page 9: Dandamudi Solution Ch17

Chapter 17 9

17–8 Direct mapping results in a simple implementation of the cache. In direct mapping, placementand replacement policies are straightforward: a single cache line is specified to place the memoryblock in the case of a cache miss.

The simplicity of the direct mapping function, although desirable from the implementation view-point, is also its disadvantage. By mapping each memory block to a single cache line, we loseflexibility that can lead to performance degradation. In contrast, the fully associative mappinggives maximum flexibility in terms placement at the cost of increased implementation complexity.

Page 10: Dandamudi Solution Ch17

10 Chapter 17

17–9 In fully associative mapping, the incoming block can take any free cache line. In this mapping,compared to the direct mapping, the tag field is longer. This is not the major problem with thismapping scheme. The major drawback is the location of a block in the cache. Since a block canbe in any cache line, we have to search all tag fields in parallel to locate a block. This means weneed hardware to do 2c comparisons, where 2c is the number of cache lines.

Page 11: Dandamudi Solution Ch17

Chapter 17 11

17–10 Direct mapping results in a simple implementation of the cache. The simplicity of the directmapping function, although desirable from the implementation viewpoint, is also its disadvantage.By mapping each memory block to a single cache line, we lose flexibility that can lead to perfor-mance degradation. In contrast, the fully associative mapping gives maximum flexibility in termsplacement at the cost of increased implementation complexity.

Set-associative mapping is a compromise between direct and associative mapping. It divides thecache lines into disjoint sets. Mapping of a memory block is done as in direct mapping, exceptthat it maps the block to a set of cache lines rather than to a single cache line. The block canbe placed in any cache line within the assigned set as in the associative mapping. This gives itmore flexibility compared to the direct mapping scheme. Set-associative mapping thus reducesthe search complexity to the number of cache lines within a set. Typically, small set sizes—2,4, or 8—are used. Of course, this flexibility is obtained at the cost of increased implementationcomplexity. However, this complexity is far less than that associated with the fully associativemapping.

Page 12: Dandamudi Solution Ch17

12 Chapter 17

17–11 By loading a block of data, rather than what is required by the processor, we are prefetching datain anticipation of its use. This is beneficial because if the processor needs this prefetched data, weavoid accessing the main memory. This is the spatial locality component discussed in this chapter.

Page 13: Dandamudi Solution Ch17

Chapter 17 13

17–12 The impact of cache line size or block size is shown in the following figure. Initially, miss ratedecreases as we increase the block size. This is mainly due to the prefetching achieved with largerblock sizes. Since programs exhibit spatial locality, prefetching reduces the miss rate. However,as we continue to increase the block size, the miss rate starts to increase. This is due to theblock replacements as we run out of space in cache memory. When we replace a larger block tomake room for an incoming block, we also throw away data/instructions that the processor mightreference in the future. This sensitivity can best be explained by considering the two extreme blocksize values.

Block size

Mis

s ra

te

At one end of the spectrum, consider a block size of one word. In this case we are not prefetchingdata/instructions from the main memory. Thus, if we go from one-word to two-word cache lines,we reduce the miss rate. We can extend this argument to larger block sizes. Larger block size alsohas the benefit of taking less time to load from main memory in a burst rather than by readingindividual words. Most processors support burst transfers. For example, the Pentium requiresthree clocks to transfer 64-bit noncacheable data from the memory. However, in the burst mode,after an initial delay of two clocks, one 64-bit word is transferred each clock cycle. In this mode,it can transfer up to four 64-bit words in six clocks. In contrast, we need 12 cycles to transfer thesame data using single cycles. This is true whether it is a memory read or a write. Notice that thePentium uses the 32-byte cache line size, which is exactly the amount of data transferred in oneburst cycle (four transfers of 64-bit words each).

At the other extreme, consider a block size equal to the whole cache. In this case, whenever thereis a miss, we throw the entire cache contents. In the process, we lose a lot of data that the processormight reference in the future. In addition, such large block sizes affect performance as the misspenalty increases with the block size. Most processors tend to use a block size in the range of 8 to64 bytes, with 32 bytes being the most common. Some processors such as the PowerPC and MIPSR4000 allow the system designer to program the cache line size at boot time.

Page 14: Dandamudi Solution Ch17

14 Chapter 17

17–13 The cache has four lines and the set size is 2. Since the reference pattern must have four distinctblocks, we will incur at least 4 misses corresponding the first time access to these four blocks.After these blocks are in, it does not matter which of these four are referenced. We get a hit ratioof (12� 4)/12� 67%. An example reference pattern is given below:

123444433211

Page 15: Dandamudi Solution Ch17

Chapter 17 15

17–14 The following table shows the cache state for the reference string. As shown in this table, thereare 6 misses and 8 hits. Thus the hit ratio is 8/14� 57%.

Block Hit or Cache Cache Cache Cacheaccessed miss line 0 line 1 line 2 line 3

0 Miss Block 0 ??? ??? ???

4 Miss Block 4 ??? ??? ???

3 Miss Block 4 ??? ??? Block 3

10 Miss Block 4 ??? Block 10 Block 3

10 Hit Block 4 ??? Block 10 Block 3

3 Hit Block 4 ??? Block 10 Block 3

4 Hit Block 4 ??? Block 10 Block 3

0 Miss Block 0 ??? Block 10 Block 3

0 Hit Block 0 ??? Block 10 Block 3

4 Miss Block 4 ??? Block 10 Block 3

3 Hit Block 4 ??? Block 10 Block 3

10 Hit Block 4 ??? Block 10 Block 3

4 Hit Block 4 ??? Block 10 Block 3

10 Hit Block 4 ??? Block 10 Block 3

Page 16: Dandamudi Solution Ch17

16 Chapter 17

17–15 Since the cache size is four lines and the set size is 2, we have two sets. Therefore, all blockswith an even number are mapped to set 0, and odd numbered blocks to set 1. The following tableshows the cache state for the reference string. As shown in this table, there are 6 misses and 8 hits.Thus the hit ratio is 8/14� 57%.

Blockaccessed

Hit ormiss

Set 0 Set 1

Cache Cache Cache Cacheline 0 line 1 line 0 line 1

0 Miss Block 0 ??? ??? ???

4 Miss Block 0 Block 4 ??? ???

3 Miss Block 0 Block 4 Block 3 ???

10 Miss Block 10 Block 4 Block 3 ???

10 Hit Block 10 Block 4 Block 3 ???

3 Hit Block 10 Block 4 Block 3 ???

4 Hit Block 10 Block 4 Block 3 ???

0 Miss Block 0 Block 4 Block 3 ???

0 Hit Block 0 Block 4 Block 3 ???

4 Hit Block 0 Block 4 Block 3 ???

3 Hit Block 0 Block 4 Block 3 ???

10 Miss Block 10 Block 4 Block 3 ???

4 Hit Block 10 Block 4 Block 3 ???

10 Hit Block 10 Block 4 Block 3 ???

Page 17: Dandamudi Solution Ch17

Chapter 17 17

17–16 Since we have four distinct blocks—blocks 0, 3, 4, and 10—and four cache lines, a fully associa-tive cache experiences only four misses. Thus, the hit ratio is 10/14 � 71.4%. This is the best hitratio as we have to experience four misses (i.e., these four misses are the compulsory misses).

Page 18: Dandamudi Solution Ch17

18 Chapter 17

17–17 The following table shows the cache state with the FCFS replacement policy. As shown in thistable, there are 9 misses and 5 hits. Thus the hit ratio is 5/14� 35.7%.

Block Hit or Cache Cache Cache Cacheaccessed miss line 0 line 1 line 2 line 3

0 Miss Block 0 ??? ??? ???

1 Miss Block 0 Block 1 ??? ???

2 Miss Block 0 Block 1 Block 2 ???

3 Miss Block 0 Block 1 Block 2 Block 3

4 Miss Block 4 Block 1 Block 2 Block 3

5 Miss Block 4 Block 5 Block 2 Block 3

5 Hit Block 4 Block 5 Block 2 Block 3

4 Hit Block 4 Block 5 Block 2 Block 3

3 Hit Block 4 Block 5 Block 2 Block 3

2 Hit Block 4 Block 5 Block 2 Block 3

1 Miss Block 4 Block 5 Block 1 Block 3

0 Miss Block 4 Block 5 Block 1 Block 0

1 Hit Block 4 Block 5 Block 1 Block 0

10 Miss Block 10 Block 5 Block 1 Block 0

The following table shows the cache state with the LRU replacement policy. As shown in thistable, the hit ratio is 5/14 � 35.7%, which is the same with the FCFS policy.

Page 19: Dandamudi Solution Ch17

Chapter 17 19

Block Hit or Cache Cache Cache Cacheaccessed miss line 0 line 1 line 2 line 3

0 Miss Block 0 ??? ??? ???

1 Miss Block 0 Block 1 ??? ???

2 Miss Block 0 Block 1 Block 2 ???

3 Miss Block 0 Block 1 Block 2 Block 3

4 Miss Block 4 Block 1 Block 2 Block 3

5 Miss Block 4 Block 5 Block 2 Block 3

5 Hit Block 4 Block 5 Block 2 Block 3

4 Hit Block 4 Block 5 Block 2 Block 3

3 Hit Block 4 Block 5 Block 2 Block 3

2 Hit Block 4 Block 5 Block 2 Block 3

1 Miss Block 4 Block 1 Block 2 Block 3

0 Miss Block 0 Block 1 Block 2 Block 3

1 Hit Block 0 Block 1 Block 2 Block 3

10 Miss Block 0 Block 1 Block 2 Block 10

Page 20: Dandamudi Solution Ch17

20 Chapter 17

17–18 The following table shows the cache state with the FCFS replacement policy. As shown in thistable, there are 9 misses and 3 hits. Thus the hit ratio is 3/12 = 25%.

Block Hit or Cache Cache Cache Cacheaccessed miss line 0 line 1 line 2 line 3

0 Miss Block 0 ??? ??? ???

1 Miss Block 0 Block 1 ??? ???

2 Miss Block 0 Block 1 Block 2 ???

0 Hit Block 0 Block 1 Block 2 ???

3 Miss Block 0 Block 1 Block 2 Block 3

2 Hit Block 0 Block 1 Block 2 Block 3

5 Miss Block 5 Block 1 Block 2 Block 3

3 Hit Block 5 Block 1 Block 2 Block 3

6 Miss Block 5 Block 6 Block 2 Block 3

0 Miss Block 5 Block 6 Block 0 Block 3

2 Miss Block 5 Block 6 Block 0 Block 2

1 Miss Block 1 Block 6 Block 0 Block 2

The following table shows the cache state with the LRU replacement policy. As shown in thistable, the hit ratio is 3/12 = 25%, which is the same with the FCFS policy.

Block Hit or Cache Cache Cache Cacheaccessed miss line 0 line 1 line 2 line 3

0 Miss Block 0 ??? ??? ???

1 Miss Block 0 Block 1 ??? ???

2 Miss Block 0 Block 1 Block 2 ???

0 Hit Block 0 Block 1 Block 2 ???

3 Miss Block 0 Block 1 Block 2 Block 3

2 Hit Block 0 Block 1 Block 2 Block 3

5 Miss Block 5 Block 0 Block 2 Block 3

3 Hit Block 5 Block 0 Block 2 Block 3

6 Miss Block 5 Block 6 Block 2 Block 3

0 Miss Block 5 Block 6 Block 0 Block 3

2 Miss Block 3 Block 6 Block 0 Block 2

1 Miss Block 1 Block 6 Block 0 Block 2

Page 21: Dandamudi Solution Ch17

Chapter 17 21

17–19 A memory address space of 4 GB implies 32-bit addresses. These 32 bits are partitioned intothree groups: byte offset, cache line#, and cache tag.

(a) 16-byte blocksThe byte offset is

b = log216 = 4 bits.

The number of cache lines is 64 KB/16 = 4096 cache lines. Thus, the number of bits required toidentify a cache line is

c = log24096 = 12 bits.

The remaining 16 higher-order bits of the 32-bit address are used for the tag. To calculate the cachebit capacity, we note that each cache line consists of 16 bytes of data, a 16-bit tag field, a valid bitand a dirty bit (because it is a write-back cache). Thus, the overhead is 16+1+1 = 18 bits. Sincethe cache line (data) size is 16� 8 = 128 bits, the overhead is 18=128, which is about 14.06%.

(b) 32-byte blocksNote that the number of bits required for the byte offset and the cache line is independent of theblock size. In our example, their sum is always 16 bits. The remaining 16 higher-order bits of the32-bit address are used for the tag. Thus, the overhead is 16 + 1 + 1 = 18 bits. Since the cacheline (data) size is 32� 8 = 256 bits, the overhead is 18=256, which is about 7.03%.

(c) 64-byte blocksAs in the previous cases, the overhead is 16 + 1 + 1 = 18 bits. Since the cache line (data) size is64� 8 = 512 bits, the overhead is 18=512, which is about 3.52%.

Page 22: Dandamudi Solution Ch17

22 Chapter 17

17–20 A memory address space of 4 GB implies 32-bit addresses. In fully associative mapping, theaddress is divided into two fields: byte offset and tag.

(a) 16-byte blocksSince the byte offset requires 4 bits for 16-byte blocks, the tag field is 32 � 4 = 28 bits. In thiscase, there is an overhead of 30 bits (28 tag bits, a valid bit and a dirty bit) for every cache linewith 16� 8 = 128 bits of data. Thus, it represents an overhead of about 23.45%.

(b) 32-byte blocksSince the byte offset requires 5 bits for 32-byte blocks, the tag field is 32 � 5 = 27 bits. In thiscase, there is an overhead of 29 bits (27 tag bits, a valid bit and a dirty bit) for every cache linewith 32� 8 = 256 bits of data. Thus, it represents an overhead of about 11.33%.

(c) 64-byte blocksSince the byte offset requires 6 bits for 64-byte blocks, the tag field is 32 � 6 = 26 bits. In thiscase, there is an overhead of 28 bits (26 tag bits, a valid bit and a dirty bit) for every cache linewith 64� 8 = 512 bits of data. Thus, it represents an overhead of about 5.47%.

Page 23: Dandamudi Solution Ch17

Chapter 17 23

17–21 A memory address space of 4 GB implies 32-bit addresses. These 32 bits are partitioned intothree groups: byte offset, set#, and cache tag.

(a) 16-byte blocksThe byte offset field requires 4 bits for 16-byte blocks. The number of bits s to identify a set islog

2S, where S is the number of sets in the cache. Since it is an 8-way set-associative cache, the

number of sets is

S = 64KB=(8 � 16) = 512 sets.

This gives us s = log2S = 9 bits. The remaining 32� 9� 4 = 19 bits are used for the tag field.

Therefore each cache line consists of 16� 8 = 128 bits of data, 19 bits of tag, a valid bit, and adirty bit. Thus, it represents an overhead of about 21=128 � 16:4%.

(b) 32-byte blocksThe byte offset field requires 5 bits for 32-byte blocks. The total number of bits for the set# and thebyte offset is independent o fthe block size. In our example, it is 13 bits. The remaining 32�13 =

19 bits are used for the tag field. Therefore each cache line consists of 32� 8 = 256 bits of data,19 bits of tag, a valid bit, and a dirty bit. Thus, it represents an overhead of about 21=256 � 8:2%.

(c) 64-byte blocksEach cache line consists of 64 � 8 = 512 bits of data, 19 bits of tag, a valid bit, and a dirty bit.Thus, it represents an overhead of about 21=512 � 4:1%.

Page 24: Dandamudi Solution Ch17

24 Chapter 17

17–22 A memory address space of 4 GB implies 32-bit addresses. These 32 bits are partitioned intothree groups: byte offset, cache line#, and cache tag.

Since 32-byte blocks are used, the byte offset is

b = log232 = 5 bits.

The number of cache lines is 32 KB/32 = 1024 cache lines. Thus, the number of bits required toidentify a cache line is

c = log21024 = 10 bits.

The remaining 32� 5� 10 = 17 higher-order bits are used for the tag.

Page 25: Dandamudi Solution Ch17

Chapter 17 25

17–23 A memory address space of 4 GB implies 32-bit addresses. In fully associative mapping, theaddress is divided into two fields: byte offset and tag.

Since 32-byte blocks are used, the byte offset is

b = log232 = 5 bits.

The remaining 32� 5 = 27 higher-order bits are used for the tag.

Page 26: Dandamudi Solution Ch17

26 Chapter 17

17–24 A memory address space of 4 GB implies 32-bit addresses. These 32 bits are partitioned intothree groups: byte offset, set#, and cache tag.

Since 64-byte blocks are used, the byte offset is

b = log264 = 6 bits.

The number of bits s to identify a set is log2S, where S is the number of sets in the cache. Since

it is an 8-way set-associative cache, the number of sets is

S = 32KB=(8 � 64) = 64 sets.

This gives us s = log2S = 6 bits. The remaining 32� 6� 6 = 20 bits are used for the tag field.