Download pptx - Memory Hierarchy and Cache Management

Speculate AmbitiouslyTo speculate ambitiously requires three capabilities:• The ability of the complier to find instructions that, with the

possible use of register renaming, can be speculatively moved and not affect the program data flow.

• The ability to ignore exceptions in speculated instructions, until we know that such exceptions should really occur.

• The ability to speculatively interchange loads and stores, or stores and stores, which may have address conflicts.

• The first of these is a complier capability, while the last two require hardware support .

Hardware Support for Preserving Exception Behavior

• To speculate ambitiously, we must be able to move any type

of instruction and still preserve its exception behavior.

• The key to being able to do this is to observe that the results

of a speculated sequence that is mispredicted will not be used

in the final computation , and such a speculated instruction

should not cause an exception.

Investigated for supporting more ambitious speculation

• There are four methods that have been investigated for supporting more ambitious speculation without introducing erroneous behavior:– The hardware and operating system cooperatively ignore

exceptions for speculative instructions. This approach preserves exception behavior for correct programs, but not for incorrect ones. This approach may viewed as unacceptable for some programs, but it has been used, under program control , as a “fast mode” in several processors.

– Speculative instructions that never raise exceptions are used, and checks are introduced to determine when an exception should occur.

Investigated for supporting more ambitious speculation

– A set of status bits, called poison bits, are attached to the

result registers written by speculated instructions when the

instructions cause exceptions. The poison bits cause a fault

when a normal instructions attempts to use the register.

– A mechanism is provided to indicate that an instruction is

speculative , and the hardware buffers the instruction result

until it is certain that the instruction is no longer speculative.

Schemes• To know the schemes, we need to distinguish between

exceptions that indicate a program error and would normally cause termination, such as a memory protection violation, and those that are handled and normally resumed, such as a page default.

• Exception that can be resumed can be accepted and processed for speculative instructions just as if they were normal instructions.

• If the speculative instruction should not have been executed, handling the unneeded exception may have some negative performance effects, but it cannot cause incorrect execution.

Schemes..• The cost of these exceptions may be high, however, and some processors

use hardware support to avoid taking such exceptions, just as processors with hardware speculation may take some exceptions in speculative mode, while avoiding others until an instruction is known not to be speculative.

• Exceptions that indicate a program error should not occur in correct programs, and the result of a program that gets such a exception is not well defined, except perhaps when the program is running in a debugging mode.

• If such exceptions arise in speculated instructions, we cannot take the exception until we know that the instruction is no longer speculative.

Schemes …• In the simplest method for preserving exceptions, the

hardware and the operating system handle all resumable exceptions when the exception occurs and simply return an undefined value for any exception that would cause termination.

• If the instruction generating the terminating exception was not speculative , then the program is in error.

• Note the instead of terminating the program, the program is allowed to continue, although it will almost certainly generate incorrect results.

• If the instruction generating the terminating exception is speculative, then the program may be correct and the speculative result will simply be unused.

Schemes …• Thus, returning an undefined value for the instruction cannot

be harmful.• This scheme can never cause a correct program to fail, no

matter how much speculation is done.• An incorrect program, which formerly might have received a

terminating exception, will get an incorrect result.• This a acceptable for some programs, assuming the complier

can also generate a normal version of the program, which does not speculate and can receive a terminating exception.

Schemes …• In such a scheme, it is not necessary to know that an

instruction is speculative. Indeed, it is helpful only when a program is in error and receives a terminating exception on a normal instruction; in such cases, if the instruction were not marked as speculative, the program could be terminated.

• A second approach to preserving exception behavior when speculating introduces speculative versions of instructions that do not generate terminating exceptions and instructions to check for such exceptions. This combines preserves the exception behavior exactly.

Schemes …• A third approach for preserving exception tracks exceptions as they occur

but postpones of the exception, although not in a completely precise fashion.

• The scheme is simple: a poison bit is added to every register, and another bit is added to every instruction to indicate whether a speculative instructions results in a terminating exception; all other exceptions are handled immediately.

• If a speculative instruction uses a register with a poison bit turned on, the destination register of the instruction simply has its poison bit turned on,

• If a normal instruction attempts to use a register source with its poison bit turned on, the instruction causes a fault .

• In this way, any program that would have generated an exception still generates one, albeit at the first instance where a result is used by an instruction that is not speculative.

Schemes …• Since poison bits exist only on register values and not memory values,

stores are never speculative and thus trap if either operand is “poison”.• One complication thus must be overcome is how the OS saves the user

registers on a context switch if the poison bit is set.• A special instruction is needed to save and reset the state of the poison

bits to the avoid this problem.• The fourth and final approach listed in earlier relies on a hardware

mechanism that operates like a reorder buffer.• In such an approach, instructions are marked by the complier as

speculative and include an indicator of how many branches the instruction was speculatively moved across and what branch action (taken/not taken) the complier assumed.

• The last piece of information basically tells the hardware the location of the code block where the speculated instruction originally was .

Schemes …• In practice, most of the benefit of speculative instruction is

marked by a sentinel, which tells the hardware that the earlier speculative instruction is no longer speculative and values may be committed.

• All instructions are placed in a reorder buffer when issued and are forced to commit in order, as in hardware speculation approach.(Notice , though, that no actual speculative branch prediction or dynamic scheduling occurs).

• The reorder buffer tracks when instructions are ready to commit and delays the “write-back” portion of any speculative instruction.

Schemes …• Speculative instructions are not allowed to commit until the

branches that have been speculatively moved over are also ready to commit, or ,alternatively until the corresponding sentinel is reached.

• At that point, we know whether the speculated instructions should have been executed or not.

• If it should have been executed and it generated a terminating exception, then we know that the program should be terminated.

• If the instruction should not have been executed then the can be ignored.

Example 1• Consider the code fragment from if-the-else statement of the

form:If (A==0) A=B; else A=a+4;Where A is at 0(R3) and B is at 0(R2). Assume

the then clause is almost always executed. Compile the code using complier-based speculation. Assume R14 is unused and available.

LD R1,0(R3) ; load ABNEZ R1,L1 ;test ALD R1,0(R2) ; then clauseJ L2 ; Skip else

L1: DADDI R1,R1, #4 ; else clauseL2: SD R0,0(R3) ; store A

Example 2• Show how the code using a speculative load (sLD) and a

speculation check instruction (SPECCK) to completely preserve exception behavior. Assume R14 is unused and available.



Example 3• Consider the code fragment and show how it would be

complied with speculative instructions and poison bits. Show where an exception for the speculative memory reference would be recognized. Assume R14 is unused and available.



Hardware support for memory Reference Speculation

• Moving loads across stores is usually done when the complier is certain the address do not conflict.

• A special instruction to check for address conflicts can be included in the architecture,

• The special instruction is left at the original location of the load instruction (and acts like a guardian), and the load is moved up across or more stores.

• When a speculated load is executed, the hardware saves the address of the accessed memory location.

• If a subsequent store the address of the location before the check instruction, then the speculation has failed.

Hardware support for memory Reference Speculation …

• If the location has not been touched, then the speculation is successful.

• Speculation failure can be handled in two ways.– If only the load instruction was speculated, then it suffices

to redo the load at the point of the check instruction (which could supply the target register in addition to the memory address).

– If additional instructions that depend on the load were also speculated , then a fix-up sequence that reexecutes all the speculation instructions starting with the load is needed.

– In this case, the check instruction specifies the address where the fix-up code is loaded.

Sections To be covered from Chapter 4

• Section 4.5• Software versus Hardware based scheduling

Memory Hierarchy

Memory Hierarchy & Cache Memory

Entry Quiz1. Primary cache or level 1 cache is

implemented in a separate chip

a. True b. False

Entry Quiz2. SRAM is implemented using

a. Flip-Flopb. Magnetic corec. Capacitord. Non-volatile Technology

Entry Quiz3. Main memory (200 ns) is slower compared

register (0.2 ns) by an order of

a. 3b. 4c. 1000d. 10000

Entry Quiz4. Virtual Memory is

a. Same as cachingb. Same as associative memoryc. Different from cachingd. Same as disk memory

Entry Quiz5. Cache Miss occurs when the

a. Required instruction is not found in the cacheb. Required data is not found in the cachec. Required instruction or data is not found in the

cached. Required instruction or data is not found in the

main memorye. For all of the above conditions

Module Objective• To understand

1. Memory requirements of different computers

2. Memory hierarchy and the motivation behind it 3. Moore’s Law 4. Principles of Locality 5. Cache Memory and its implementation6. Cache Performance7. Terms: Cache, Cache Miss, Cache Hit, Latency,

Bandwidth, SRAM, DRAM, by an order of, Direct Mapping, Associative Mapping, Set Mapping, Write Through, Write Allocated, Write Back, Dirty Bit, and Valid Bit

Memory Requirements• In general we would like to have

– Faster Memory (lower access time or latency)

– Larger (capacity and bandwidth) Memory– Simpler Memory

Memory Requirements - Server, Desktop, and Embedded devices

• Server– Lower Access time– Higher Bandwidth*– Better Protection*– Larger Memory

Desktop Lower Access

time Larger Memory

Embedded Lower Access

time Simpler

Memory*

Processor-Memory Gap

http://images.google.co.in/imgres?imgurl=http://www.joe-ks.com/archives_jan2004/DavidGoliath.jpg&imgrefurl=http://www.joe-ks.com/art.htm&h=511&w=467&sz=46&tbnid=WWfHliE4O3sJ:&tbnh=128&tbnw=116&hl=en&start=9&prev=/images%3Fq%3DDavid%2BGoliath%26svnum%3D10%26hl%3Den%26lr%3D

Moore’s Law• Transistor density on a chip dye doubles every

couple (1.5) of years.

• Short reference: http://en.wikipedia.org/wiki/Moore's_law

http://en.wikipedia.org/wiki/Moore's_law

What is Memory Hierarchy and Why?

CPU

(Registers)

MainMemory

0.25 ns

250 ns

Bus Adapter

Storage & I/O devices

2,500,000 ns!

Memory Hierarchy & Cache

Cache• Cache is a smaller, faster, and expensive memory.• Improves the througput/latency of slower memory

next to it in the memory hierarchy.• Blocking reads and delaying the writes to slower

memory offers better performance.• There are two cache memories L1 and L2 between

CPU and main memory.• L1 is built into CPU.• L2 is an SRAM.• Data, Instructions, and Addresses are cached.

Cache Operation• Cache Hit: CPU finds the required data item (or

instruction) in the cache.• Cache Miss: CPU does not find the required item

in the cache.– CPU Stalled– Hardware loads the entire block that contains the

required data item from the memory into cache.– CPU continues to execute once the cache loaded.

• Spatial Locality• Principle of Locality

Hit and Miss• Hit

– Hit Rate– Hit Time

• Miss– Miss Rate– Miss Penalty

• Hit Time << Miss Penalty

Higher Level

Lower Level

Cache Performance Program Execution Time

• CPU Clock Cycle• Cycle Time• IC – Instruction Count • Program Execution Time (Simple model)

= (Useful CPU Cycles + Stalled CPU Cycles) x

Cycle Time

Stalled CPU CyclesStalled CPU Cycles

= Number of Cache Misses X Miss Penalty

= IC X Memory Access X Miss Rate X Miss Penalty Instruction

Note: Unit of Miss Penalty is in CPU cycles

Number of Cache reference

http://images.google.co.in/imgres?imgurl=http://www.missnzuniverse.co.nz/tiara_oval.jpg&imgrefurl=http://www.missnzuniverse.co.nz/header.html&h=89&w=166&sz=5&tbnid=gDht-mfhZiwJ:&tbnh=49&tbnw=93&hl=en&start=4&prev=/images%3Fq%3DMiss%2BUniverse%2BTiara%26svnum%3D10%26hl%3Den%26lr%3D

http://images.google.co.in/imgres?imgurl=http://imagescommerce.bcentral.com/merchantfiles/4873862/tiara%2520headpieces%2520combs%252012335.jpg&imgrefurl=http://www.a1weddinginvitations.net/Tiaras%25203.htm&h=278&w=350&sz=26&tbnid=zmnFBDPhkOYJ:&tbnh=92&tbnw=116&hl=en&start=2&prev=/images%3Fq%3DTiara%26svnum%3D10%26hl%3Den%26lr%3D

Separating Read Miss from Write Miss

Stalled CPU Cycles

= IC X Memory Access X Miss Rate X Miss Penalty

Instruction

= IC X Read Access X Read Miss Rate X Read Penalty +

Instruction

IC X Write Access X Write Miss Rate X Write Penalty

Instruction

Example 1• Assume we have a computer where Clock cycles Per

Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. If the miss penalty is 25 clock cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits?

Example 1 …1. Execution time for the program in a

computer with 100 % hits

= (Useful CPU Cycles + Stalled CPU Cycles) x Cycle Time

= (IC X CPI + 0) X Cycle Time= IC X 1.0 X Cycle Time

Example 1 …2. Stall Cycle when there is one or more Cache

Misses: = IC X Memory Access X Miss Rate X Miss Penalty

Instruction

= IC X (1 + 0.5) X 0.02 X 25 Cycle Time = IC X 0.75 Cycle Time

3. Total Execution Time: = (IC X 1 + IC X 0.75) Clock Cycles = 1.75 IC Clock Cycles

Example 1 …

4. Ratios = 1.75 IC Clock Cycles / IC X 1.0 X Cycle Time

= 1.75

Execution time is 1.75 faster in a computer with no Misses.

Example 2

• Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. Miss penalty for read miss is 25 clock cycles and miss penalty for write miss is 50 clock cycles. Miss rate is 2% and out of this 80% is Read miss. How much faster would the computer be if all instructions were cache hits?

Cache Operation

Word Length(K Words)

...

Block

Block(K Words)

0123

2N - 1

Line Number

0

1

2

C= 16

Block

Block Length(K Words)

Tag

...

Cache

Main Memory

CPU

Elements of Cache Design• Cache Size• Block Size• Mapping Function• Replacement Algorithm• Write Policy• Write Miss• Number of caches• Split versus Unified/Mixed Cache

Mapping FunctionsTag

CacheMain Memory

Tag Line WordBlock 0

Block 1

Block n-1

Block n

Line 0

Line 1

Line 2

Line 3 (m)

3.Compare

1. Select

+4. Hit

Miss

To main memory

From CPUAddress

2. Copy

5. Load

Mapping Function• Direct

– Line value in address uniquely points to a line in cache. 1 tag Comparison

• Set Associative – Line value in address points to a set of lines in cache (typically

2/4/8, so 2/4/8 tag comparisons). This is known as 2/4/8 way Set Associative.

• Associative– Line value is always 0. This means Line points to all the lines

in cache (4 (m) tag Comparisons)– Uses Content Addressable Memory (CAM) for comparison.– Needs non-trivial replacement algorithm

Tag

Cache

Tag Line WordBlock 0

Block 1

Block 126

Block 127

Line 0

Line 1

Line 2

Line 3

3.Compare

1. Select

+4. Hit

Miss

To main memory

From CPUAddress

2. Copy

5. Load

0

1

23

Memory

508

509

510511

4

5

67

5 2 2 3

504

505

506507

SET 1

SET 0

Mapping Function Comparison

Cache Type Hit Ratio Search Speed

Direct Mapping Good Best

Fully Associative Best Moderate

N-Way Set AssociativeVery Good, Better as N Increases

Good, Worse as N Increases

Replacement Algorithm• Least Recently Used• First In First Out• Least Frequently Used• Random

Write Policy• Write-through

– Information written to cache and the memory

• Write back– Information written only to cache. Content of the

cache is written to the main memory only when this cache block is replaced or the program terminates.

• Dirty bit is used to indicate that a cache block needs “Write back”

Write Miss• Write-Allocate• No Write-Allocate

Number of Caches• Primary Cache (CPU)• Secondary Cache (SRAM)• L3 Cache (Cheaper SRAM)

Split versus Unified/Mixed Cache

Size (KB)Instruction Cache Data Cache Unified Cache

8 8.16 44 6316 3.82 40.9 5132 1.36 38.4 43.364 0.61 36.9 39.4128 0.3 35.3 36.2256 0.02 32.6 32.9

Single or two Level Unified or Split Misses per 1000 instructions with various Cache size

Improving Cache Performance• Reducing Penalty• Reducing Misses

– Compiler optimization attempts to reduce the Cache Misses falls under this category

Stalled Cycles

= IC X Memory Access X Miss Rate X Miss Penalty

Instruction

Improving Cache Performance Using Compiler

• Compilers are built with the following optimization:Instructions:

Reordering instructions to avoid conflict misses

Data :Merging Arrays

Loop Interchange

Loop Fusion

Blocking

Merging Arrays/* Conflict */

int key[size];

int value [size];

/* Instead no conflict*/

struct merge {

int key;

int value;

} ;

Key

value

Key, value pairs

Loop Interchangefor (k = 0; k < 100; k= k+1)

for (j = 0; j < 100; j= j+1)

for (i = 0; i < 5000; i= i+1)

x[i][j] = 2*x[i][j]

/*instead */

for (k = 0; k < 100; k= k+1)

for (i = 0; i < 5000; i= i+1)

for (j = 0; j < 100; j= j+1)

x[i][j] = 2*x[i][j]

X(0,0)X(0,1)X(0,2)X(0,3)X(0,4)X(0,5)…X(0,4999)X(1,0)X(1,1)…

MemoryAddress

Blocking

Loop Fusionfor (i = 0; i < 5000; i= i+1)

a[i] = i;…for (i = 0; i < 5000; i= i+1)

b[i] = b[i]+a[i];

/*instead */

for (i = 0; i < 5000; i= i+1)a[i]= i;b[i] = b[i]+a[i];

Example 2• Assume a fully associative write-back cache with many cache

entries that starts empty. Below is a sequence of five memory operations (the address is in square brackets)

• What are the number of hits and misses using no-write allocate versus write allocate?

WriteMem[100];WriteMem[100];ReadMem[200];WriteMem[200];WriteMem[100];

Exit Quiz1. In memory hierarchy top layer is occupied by

the

a. Fastest and the most expensive memoryb. Slowest and the most expensive memoryc. High band width and the fastest memoryd. Slowest and the least expensive memory

Exit Quiz2. The memory technology suitable for L2 cache

is

a. EPROMb. DRAMc. SRAMd. Flash

Exit Quiz3. Server’s memory requirements are different

from desktop’s memory requirementsa. Trueb. False

Exit Quiz4. Hit Rate is (1 – Miss Rate)

a. Trueb. False

5. Gordon Moore’s law states that the transistor density

a. Triples every yearb. Doubles every two yearsc. Doubles every 1.5 yearsd. Doubles every year

Improving Cache Performance• Two options

– By Reducing Miss Penalty– By Reducing Miss Rate

Reducing Cache Miss Penalty1. Multilevel caches2. Critical word first & early restart3. Priority for Read misses over Write4. Merging Write Buffers5. Victim Cache

Reducing Miss Penalty - Multilevel caches (1)

Cache1 CPU MainMemory Cache 1CPU Main

Memory

Faster but smaller Slower but larger

CPU MainMemory

Faster and Larger

Cache 1 Cache 2

2 ns, 16K 10 ns, 512K

? ns, 528 K

0.020.04

0.040.02

100 ns, 128M100 ns, 128M

100 ns, 128M

Reducing Miss Rate & Global Miss Rate (1)

CPU MainMemory

Faster and Larger

Cache 1 Cache 2

Average memory access timeeff = Hit TimeL1 + Miss RateL1 x Miss

PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

Global Miss Rate = Miss RateL1 x Miss RateL2

0.050.02

2 ns, 16K 10 ns, 512K

100 ns, 128M2.6 ns, 528K

• Critical word first • Early restart • Lets discuss the following:

– How is Miss Penalty improved?– Under what conditions this is useful?– Is it worth going through all the trouble?

Reducing Miss Penalty – Critical Word First and Early Restart (2)

Reducing Miss Penalty – Priority To Read Miss (3)

• Read Miss takes priority• Write to Memory is put on hold• Lets discuss the following:

– Is there a problem?– How is Miss Penalty improved? – How this is done in Write-through?– How this is done in Write-back?– Under what conditions this is useful?– Is it worth going through all the trouble?

Reducing Miss Penalty – Merging Write Buffer (4)

• Write-through is sent to a write buffer• Buffer & Write possible options:

1. Buffer is empty – write address and data to buffer2. Buffer is not full – write or merge data3. Buffer is full – Stall until buffer is written to memory. – Note: Block write to memory is more efficient then multiple writes.

• Lets Discuss the following:– How many different ways do we optimize Miss Penalty in this

scheme?– Explain the Fig 5.12 in the book– When there is no block write (like I/O device), no merging

Reducing Miss Penalty – Fully Associative Victim Cache (5)

• Key word is Recycle• Another implicit key word is complex

(hardware)• Aimed at handling conflict misses• 1 to 6 victim caches are ideal

Reducing Misses• Misses due to 3Cs: Compulsory, Capacity, and

Conflict• Reducing Misses Techniques

1. Larger Lines/Blocks2. Larger Caches3. Higher Associativity4. Way Prediction Pseudoassociative caches5. Compiler Techniques

Notes: copy tables 5.14 and 5.17 to two slides

Reducing Cache Misses – Larger Block Size (1)

• Larger block improves number of misses up to a point!

• Points to discuss– Why number of misses starts increasing for larger blocks?– Low latency encourages smaller blocks and higher latency

encourages larger blocks.– Low bandwidth encourages smaller blocks and higher

bandwidth encourages larger blocks.– Effect on Miss Penalty with larger blocks

• What is the other name for 1-way set associative mapping?

Reducing Cache Misses – Larger Caches (2) • Obviously!

Reducing Cache Misses – Higher Associativity (3)

• Miss rate improves with associativity• Points to discuss

– Complexity of set associative mapping versus the improvement.

– Rule of thumb 1: 8-way associative is as good as fully associative.

– Rule of thumb 2: 2-way associative is as good as direct mapping.

– Greater associativity increases Hit time

Reducing Cache Misses – Pseudo-Associative Caches (4)

• Lets understand this using an 8-way set associative Instruction Cache.

• Instruction exhibits better locality• Each access to instruction cache

normally needs 8 comparisons.• Using locality predict the next block

to access in the set to reduce the number of comparisons.

• Effect on Hit-time and Cache Miss

Cache Memory – ith set

Block 0

Block 1

Block 6

Block 8

tags

Reducing Hit Time – Small and Simple Cache (1)

• Tag comparison is complex, specifically in associative mapping

• Tag checking can be overlapped with data transfer to reduce the hit time.

• Tag checking in CPU chip, cache in a chip by itself. Provides better hit time and larger cache capacity

Reducing Hit Time – Virtual Cache (2)

• Addressing– VA -> Physical Address -> cache

• Skip two levels, VA maps to cache• Problems:

– No page boundary checks– Building direct mapping between VA and Cache

for every process is not easy.

Reducing Hit Time – Pipelined Cache (3)

• Increases the cache throughput not access time

Reducing Hit Time – Trace Cache

• Increasing instruction level parallelism• Instead of 4 consecutive locations of cache,

load the next 4 instruction required by the CPU using trace.

• Folding branch prediction into cache

Improving Parallelism with Cache• Cache Hit under Cache Miss• Cache Miss under Cache Miss• Both require non-blocking cache, out of order

execution CPU• Other methods to improve performance using

parallelism:– Hardware pre-fetching of instruction– Software (compiler) pre-fetching of data

PreparationWrite-back• Complicates cache coherency

problem• Low overhead memory access

overhead• Better cache access time than

write -through• Requires higher memory

bandwidth if blocking

Write Through• Simplifies cache coherency

problem• High overhead• If cache is blocking then higher

access time • Requires Lower memory

bandwidth

Note: No need to describe these two policies Write-through does not buy anything extra for a single

processor system due to absence of cache coherency

CPU Execution TimeCPU Execution Time

= (CPU Cycle time + Stalled Cycle) X Cycle Time

• Stalled Cycle = misses x penalty• Misses given either as misses/1000 instruction or

misses/memory-access AKA miss rate.• Instruction Count , Cycles per Instruction, Miss are also

required to compute CPU execution time.

Average Access Time with CacheAverage Access Time= Hit Time + Miss Rate X Penalty

Multi-level CacheAvg Access Time = Hit TimeL1+ Miss RateL1X PenaltyL1

PenaltyL1 = Hit TimeL2+ Miss RateL2X PenaltyL2

AddressingTag

Cache

Main Memory

Tag Set WordBlock 0

Block 1

Block 126

Block 127

Line 0

Line 1

Line 2

Line 3 (m)

3.Compare

1. Select

+4. Hit

4. Miss

To main memory

From CPUAddress

2. Copy

5. Load

0123

508509510511

4567

Set 0

Set 1

Assignment I – Due same day next week

• Mapping functions• Replacement algorithms• Write policies• Write Miss policies• Split Cache versus Unified Cache• Primary Cache versus Secondary Cache• Compiler cache optimization techniques with

examples

Assignment II - Due same day next week

• Multilevel Cache• Cache Inclusion/Exclusion Property• Thumb rules of cache• Compiler pre-fetch• Multi-level Caching + one another Miss

Penalty Optimization technique• Two miss Rate Optimization Techniques

Assignment III - Due 2nd class of next week

• All odd numbered problems from cache module of your text book.

Assignment IV - Due 2nd class of next week

• All even numbered problems from cache module of your text book.

CPU Execution Time & Average Access Time

Memory

CPU

Cache

1 CC 100 CC

With Multi-level Cache

1000

Memory

CPU

Cache

1 CC 100 CC10 CC

Memory Hierarchy

Main Memory

Main Memory• Module Objective

– To understand Main memory latency and bandwidth

– Techniques to improve latency and bandwidth

Memory Hierarchy & Cache

Main Memory – Cache – I/O

MainMemory

0.25 ns

250 ns

Bus Adapter

Storage & I/O devices

2,500,000 ns!

cacheCPU

Cache prefers low latency main memory I/O & Multiprocessors prefer high bandwidth/throughput

main memory

Main Memory Access Time Main

Memory

cacheCPU

DataBus Address

Bus

4 cc

4 cc

56 cc

Access Time per word = 4+56+4 CCOne word is 8 BytesLatency is 1 bit/CC

CC – Clock Cycle

Improving Memory Performance• Improving Latency ( time to access 1 memory

unit - word)• Improving bandwidth (bytes accessed in unit

time)

Improving Memory BandwidthCPU

Memory

Cache

64 bits

64 bits

CPU

Memory

Cache

64 bits

4x64 bits

Bandwidth, Latency, Penalty Interleaving Factor Address allocation with Interleaving

CPU

Cache

64 bits

64 bits

048

12

159

13

37

1115

26

1014

Bank 0 Bank 1 Bank 2 Bank 3Cache Block 4 wordsOne word 8 bytes

Simple Design Wider Bus Interleaved