Speculate AmbitiouslyTo speculate ambitiously requires three capabilities:• The ability of the complier to find instructions that, with the
possible use of register renaming, can be speculatively moved and not affect the program data flow.
• The ability to ignore exceptions in speculated instructions, until we know that such exceptions should really occur.
• The ability to speculatively interchange loads and stores, or stores and stores, which may have address conflicts.
• The first of these is a complier capability, while the last two require hardware support .
Hardware Support for Preserving Exception Behavior
• To speculate ambitiously, we must be able to move any type
of instruction and still preserve its exception behavior.
• The key to being able to do this is to observe that the results
of a speculated sequence that is mispredicted will not be used
in the final computation , and such a speculated instruction
should not cause an exception.
Investigated for supporting more ambitious speculation
• There are four methods that have been investigated for supporting more ambitious speculation without introducing erroneous behavior:– The hardware and operating system cooperatively ignore
exceptions for speculative instructions. This approach preserves exception behavior for correct programs, but not for incorrect ones. This approach may viewed as unacceptable for some programs, but it has been used, under program control , as a “fast mode” in several processors.
– Speculative instructions that never raise exceptions are used, and checks are introduced to determine when an exception should occur.
Investigated for supporting more ambitious speculation
– A set of status bits, called poison bits, are attached to the
result registers written by speculated instructions when the
instructions cause exceptions. The poison bits cause a fault
when a normal instructions attempts to use the register.
– A mechanism is provided to indicate that an instruction is
speculative , and the hardware buffers the instruction result
until it is certain that the instruction is no longer speculative.
Schemes• To know the schemes, we need to distinguish between
exceptions that indicate a program error and would normally cause termination, such as a memory protection violation, and those that are handled and normally resumed, such as a page default.
• Exception that can be resumed can be accepted and processed for speculative instructions just as if they were normal instructions.
• If the speculative instruction should not have been executed, handling the unneeded exception may have some negative performance effects, but it cannot cause incorrect execution.
Schemes..• The cost of these exceptions may be high, however, and some processors
use hardware support to avoid taking such exceptions, just as processors with hardware speculation may take some exceptions in speculative mode, while avoiding others until an instruction is known not to be speculative.
• Exceptions that indicate a program error should not occur in correct programs, and the result of a program that gets such a exception is not well defined, except perhaps when the program is running in a debugging mode.
• If such exceptions arise in speculated instructions, we cannot take the exception until we know that the instruction is no longer speculative.
Schemes …• In the simplest method for preserving exceptions, the
hardware and the operating system handle all resumable exceptions when the exception occurs and simply return an undefined value for any exception that would cause termination.
• If the instruction generating the terminating exception was not speculative , then the program is in error.
• Note the instead of terminating the program, the program is allowed to continue, although it will almost certainly generate incorrect results.
• If the instruction generating the terminating exception is speculative, then the program may be correct and the speculative result will simply be unused.
Schemes …• Thus, returning an undefined value for the instruction cannot
be harmful.• This scheme can never cause a correct program to fail, no
matter how much speculation is done.• An incorrect program, which formerly might have received a
terminating exception, will get an incorrect result.• This a acceptable for some programs, assuming the complier
can also generate a normal version of the program, which does not speculate and can receive a terminating exception.
Schemes …• In such a scheme, it is not necessary to know that an
instruction is speculative. Indeed, it is helpful only when a program is in error and receives a terminating exception on a normal instruction; in such cases, if the instruction were not marked as speculative, the program could be terminated.
• A second approach to preserving exception behavior when speculating introduces speculative versions of instructions that do not generate terminating exceptions and instructions to check for such exceptions. This combines preserves the exception behavior exactly.
Schemes …• A third approach for preserving exception tracks exceptions as they occur
but postpones of the exception, although not in a completely precise fashion.
• The scheme is simple: a poison bit is added to every register, and another bit is added to every instruction to indicate whether a speculative instructions results in a terminating exception; all other exceptions are handled immediately.
• If a speculative instruction uses a register with a poison bit turned on, the destination register of the instruction simply has its poison bit turned on,
• If a normal instruction attempts to use a register source with its poison bit turned on, the instruction causes a fault .
• In this way, any program that would have generated an exception still generates one, albeit at the first instance where a result is used by an instruction that is not speculative.
Schemes …• Since poison bits exist only on register values and not memory values,
stores are never speculative and thus trap if either operand is “poison”.• One complication thus must be overcome is how the OS saves the user
registers on a context switch if the poison bit is set.• A special instruction is needed to save and reset the state of the poison
bits to the avoid this problem.• The fourth and final approach listed in earlier relies on a hardware
mechanism that operates like a reorder buffer.• In such an approach, instructions are marked by the complier as
speculative and include an indicator of how many branches the instruction was speculatively moved across and what branch action (taken/not taken) the complier assumed.
• The last piece of information basically tells the hardware the location of the code block where the speculated instruction originally was .
Schemes …• In practice, most of the benefit of speculative instruction is
marked by a sentinel, which tells the hardware that the earlier speculative instruction is no longer speculative and values may be committed.
• All instructions are placed in a reorder buffer when issued and are forced to commit in order, as in hardware speculation approach.(Notice , though, that no actual speculative branch prediction or dynamic scheduling occurs).
• The reorder buffer tracks when instructions are ready to commit and delays the “write-back” portion of any speculative instruction.
Schemes …• Speculative instructions are not allowed to commit until the
branches that have been speculatively moved over are also ready to commit, or ,alternatively until the corresponding sentinel is reached.
• At that point, we know whether the speculated instructions should have been executed or not.
• If it should have been executed and it generated a terminating exception, then we know that the program should be terminated.
• If the instruction should not have been executed then the can be ignored.
Example 1• Consider the code fragment from if-the-else statement of the
form:If (A==0) A=B; else A=a+4;Where A is at 0(R3) and B is at 0(R2). Assume
the then clause is almost always executed. Compile the code using complier-based speculation. Assume R14 is unused and available.
LD R1,0(R3) ; load ABNEZ R1,L1 ;test ALD R1,0(R2) ; then clauseJ L2 ; Skip else
L1: DADDI R1,R1, #4 ; else clauseL2: SD R0,0(R3) ; store A
Example 2• Show how the code using a speculative load (sLD) and a
speculation check instruction (SPECCK) to completely preserve exception behavior. Assume R14 is unused and available.
LD R1,0(R3) ; load ABNEZ R1,L1 ;test ALD R1,0(R2) ; then clauseJ L2 ; Skip else
L1: DADDI R1,R1, #4 ; else clauseL2: SD R0,0(R3) ; store A
Example 3• Consider the code fragment and show how it would be
complied with speculative instructions and poison bits. Show where an exception for the speculative memory reference would be recognized. Assume R14 is unused and available.
LD R1,0(R3) ; load ABNEZ R1,L1 ;test ALD R1,0(R2) ; then clauseJ L2 ; Skip else
L1: DADDI R1,R1, #4 ; else clauseL2: SD R0,0(R3) ; store A
Hardware support for memory Reference Speculation
• Moving loads across stores is usually done when the complier is certain the address do not conflict.
• A special instruction to check for address conflicts can be included in the architecture,
• The special instruction is left at the original location of the load instruction (and acts like a guardian), and the load is moved up across or more stores.
• When a speculated load is executed, the hardware saves the address of the accessed memory location.
• If a subsequent store the address of the location before the check instruction, then the speculation has failed.
Hardware support for memory Reference Speculation …
• If the location has not been touched, then the speculation is successful.
• Speculation failure can be handled in two ways.– If only the load instruction was speculated, then it suffices
to redo the load at the point of the check instruction (which could supply the target register in addition to the memory address).
– If additional instructions that depend on the load were also speculated , then a fix-up sequence that reexecutes all the speculation instructions starting with the load is needed.
– In this case, the check instruction specifies the address where the fix-up code is loaded.
Sections To be covered from Chapter 4
• Section 4.5• Software versus Hardware based scheduling
Memory Hierarchy
Memory Hierarchy & Cache Memory
Entry Quiz1. Primary cache or level 1 cache is
implemented in a separate chip
a. True b. False
Entry Quiz2. SRAM is implemented using
a. Flip-Flopb. Magnetic corec. Capacitord. Non-volatile Technology
Entry Quiz3. Main memory (200 ns) is slower compared
register (0.2 ns) by an order of
a. 3b. 4c. 1000d. 10000
Entry Quiz4. Virtual Memory is
a. Same as cachingb. Same as associative memoryc. Different from cachingd. Same as disk memory
Entry Quiz5. Cache Miss occurs when the
a. Required instruction is not found in the cacheb. Required data is not found in the cachec. Required instruction or data is not found in the
cached. Required instruction or data is not found in the
main memorye. For all of the above conditions
Module Objective• To understand
1. Memory requirements of different computers
2. Memory hierarchy and the motivation behind it 3. Moore’s Law 4. Principles of Locality 5. Cache Memory and its implementation6. Cache Performance7. Terms: Cache, Cache Miss, Cache Hit, Latency,
Bandwidth, SRAM, DRAM, by an order of, Direct Mapping, Associative Mapping, Set Mapping, Write Through, Write Allocated, Write Back, Dirty Bit, and Valid Bit
Memory Requirements• In general we would like to have
– Faster Memory (lower access time or latency)
– Larger (capacity and bandwidth) Memory– Simpler Memory
Memory Requirements - Server, Desktop, and Embedded devices
• Server– Lower Access time– Higher Bandwidth*– Better Protection*– Larger Memory
Desktop Lower Access
time Larger Memory
Embedded Lower Access
time Simpler
Memory*
Moore’s Law• Transistor density on a chip dye doubles every
couple (1.5) of years.
• Short reference: http://en.wikipedia.org/wiki/Moore's_law
What is Memory Hierarchy and Why?
CPU
(Registers)
MainMemory
0.25 ns
250 ns
Bus Adapter
Storage & I/O devices
2,500,000 ns!
Memory Hierarchy & Cache
Cache• Cache is a smaller, faster, and expensive memory.• Improves the througput/latency of slower memory
next to it in the memory hierarchy.• Blocking reads and delaying the writes to slower
memory offers better performance.• There are two cache memories L1 and L2 between
CPU and main memory.• L1 is built into CPU.• L2 is an SRAM.• Data, Instructions, and Addresses are cached.
Cache Operation• Cache Hit: CPU finds the required data item (or
instruction) in the cache.• Cache Miss: CPU does not find the required item
in the cache.– CPU Stalled– Hardware loads the entire block that contains the
required data item from the memory into cache.– CPU continues to execute once the cache loaded.
• Spatial Locality• Principle of Locality
Hit and Miss• Hit
– Hit Rate– Hit Time
• Miss– Miss Rate– Miss Penalty
• Hit Time << Miss Penalty
Higher Level
Lower Level
Cache Performance Program Execution Time
• CPU Clock Cycle• Cycle Time• IC – Instruction Count • Program Execution Time (Simple model)
= (Useful CPU Cycles + Stalled CPU Cycles) x
Cycle Time
Stalled CPU CyclesStalled CPU Cycles
= Number of Cache Misses X Miss Penalty
= IC X Memory Access X Miss Rate X Miss Penalty Instruction
Note: Unit of Miss Penalty is in CPU cycles
Number of Cache reference
Separating Read Miss from Write Miss
Stalled CPU Cycles
= IC X Memory Access X Miss Rate X Miss Penalty
Instruction
= IC X Read Access X Read Miss Rate X Read Penalty +
Instruction
IC X Write Access X Write Miss Rate X Write Penalty
Instruction
Example 1• Assume we have a computer where Clock cycles Per
Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. If the miss penalty is 25 clock cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits?
Example 1 …1. Execution time for the program in a
computer with 100 % hits
= (Useful CPU Cycles + Stalled CPU Cycles) x Cycle Time
= (IC X CPI + 0) X Cycle Time= IC X 1.0 X Cycle Time
Example 1 …2. Stall Cycle when there is one or more Cache
Misses: = IC X Memory Access X Miss Rate X Miss Penalty
Instruction
= IC X (1 + 0.5) X 0.02 X 25 Cycle Time = IC X 0.75 Cycle Time
3. Total Execution Time: = (IC X 1 + IC X 0.75) Clock Cycles = 1.75 IC Clock Cycles
Example 1 …
4. Ratios = 1.75 IC Clock Cycles / IC X 1.0 X Cycle Time
= 1.75
Execution time is 1.75 faster in a computer with no Misses.
Example 2
• Assume we have a computer where Clock cycles Per Instruction (CPI) is 1.0 when all memory accesses are cache hits. The only data accesses are loads and stores and these total 50% of the instructions. Miss penalty for read miss is 25 clock cycles and miss penalty for write miss is 50 clock cycles. Miss rate is 2% and out of this 80% is Read miss. How much faster would the computer be if all instructions were cache hits?
Cache Operation
Word Length(K Words)
...
Block
Block(K Words)
0123
2N - 1
Line Number
0
1
2
C= 16
Block
Block Length(K Words)
Tag
...
Cache
Main Memory
CPU
Elements of Cache Design• Cache Size• Block Size• Mapping Function• Replacement Algorithm• Write Policy• Write Miss• Number of caches• Split versus Unified/Mixed Cache
Mapping FunctionsTag
CacheMain Memory
Tag Line WordBlock 0
Block 1
Block n-1
Block n
Line 0
Line 1
Line 2
Line 3 (m)
3.Compare
1. Select
+4. Hit
Miss
To main memory
From CPUAddress
2. Copy
5. Load
Mapping Function• Direct
– Line value in address uniquely points to a line in cache. 1 tag Comparison
• Set Associative – Line value in address points to a set of lines in cache (typically
2/4/8, so 2/4/8 tag comparisons). This is known as 2/4/8 way Set Associative.
• Associative– Line value is always 0. This means Line points to all the lines
in cache (4 (m) tag Comparisons)– Uses Content Addressable Memory (CAM) for comparison.– Needs non-trivial replacement algorithm
Tag
Cache
Tag Line WordBlock 0
Block 1
Block 126
Block 127
Line 0
Line 1
Line 2
Line 3
3.Compare
1. Select
+4. Hit
Miss
To main memory
From CPUAddress
2. Copy
5. Load
0
1
23
Memory
508
509
510511
4
5
67
5 2 2 3
504
505
506507
SET 1
SET 0
Mapping Function Comparison
Cache Type Hit Ratio Search Speed
Direct Mapping Good Best
Fully Associative Best Moderate
N-Way Set AssociativeVery Good, Better as N Increases
Good, Worse as N Increases
Replacement Algorithm• Least Recently Used• First In First Out• Least Frequently Used• Random
Write Policy• Write-through
– Information written to cache and the memory
• Write back– Information written only to cache. Content of the
cache is written to the main memory only when this cache block is replaced or the program terminates.
• Dirty bit is used to indicate that a cache block needs “Write back”
Write Miss• Write-Allocate• No Write-Allocate
Number of Caches• Primary Cache (CPU)• Secondary Cache (SRAM)• L3 Cache (Cheaper SRAM)
Split versus Unified/Mixed Cache
Size (KB)Instruction Cache Data Cache Unified Cache
8 8.16 44 6316 3.82 40.9 5132 1.36 38.4 43.364 0.61 36.9 39.4128 0.3 35.3 36.2256 0.02 32.6 32.9
Single or two Level Unified or Split Misses per 1000 instructions with various Cache size
Improving Cache Performance• Reducing Penalty• Reducing Misses
– Compiler optimization attempts to reduce the Cache Misses falls under this category
Stalled Cycles
= IC X Memory Access X Miss Rate X Miss Penalty
Instruction
Improving Cache Performance Using Compiler
• Compilers are built with the following optimization:Instructions:
Reordering instructions to avoid conflict misses
Data :Merging Arrays
Loop Interchange
Loop Fusion
Blocking
Merging Arrays/* Conflict */
int key[size];
int value [size];
/* Instead no conflict*/
struct merge {
int key;
int value;
} ;
Key
value
Key, value pairs
Loop Interchangefor (k = 0; k < 100; k= k+1)
for (j = 0; j < 100; j= j+1)
for (i = 0; i < 5000; i= i+1)
x[i][j] = 2*x[i][j]
/*instead */
for (k = 0; k < 100; k= k+1)
for (i = 0; i < 5000; i= i+1)
for (j = 0; j < 100; j= j+1)
x[i][j] = 2*x[i][j]
X(0,0)X(0,1)X(0,2)X(0,3)X(0,4)X(0,5)…X(0,4999)X(1,0)X(1,1)…
MemoryAddress
Blocking
Loop Fusionfor (i = 0; i < 5000; i= i+1)
a[i] = i;…for (i = 0; i < 5000; i= i+1)
b[i] = b[i]+a[i];
/*instead */
for (i = 0; i < 5000; i= i+1)a[i]= i;b[i] = b[i]+a[i];
Example 2• Assume a fully associative write-back cache with many cache
entries that starts empty. Below is a sequence of five memory operations (the address is in square brackets)
• What are the number of hits and misses using no-write allocate versus write allocate?
WriteMem[100];WriteMem[100];ReadMem[200];WriteMem[200];WriteMem[100];
Exit Quiz1. In memory hierarchy top layer is occupied by
the
a. Fastest and the most expensive memoryb. Slowest and the most expensive memoryc. High band width and the fastest memoryd. Slowest and the least expensive memory
Exit Quiz2. The memory technology suitable for L2 cache
is
a. EPROMb. DRAMc. SRAMd. Flash
Exit Quiz3. Server’s memory requirements are different
from desktop’s memory requirementsa. Trueb. False
Exit Quiz4. Hit Rate is (1 – Miss Rate)
a. Trueb. False
5. Gordon Moore’s law states that the transistor density
a. Triples every yearb. Doubles every two yearsc. Doubles every 1.5 yearsd. Doubles every year
Improving Cache Performance• Two options
– By Reducing Miss Penalty– By Reducing Miss Rate
Reducing Cache Miss Penalty1. Multilevel caches2. Critical word first & early restart3. Priority for Read misses over Write4. Merging Write Buffers5. Victim Cache
Reducing Miss Penalty - Multilevel caches (1)
Cache1 CPU MainMemory Cache 1CPU Main
Memory
Faster but smaller Slower but larger
CPU MainMemory
Faster and Larger
Cache 1 Cache 2
2 ns, 16K 10 ns, 512K
? ns, 528 K
0.020.04
0.040.02
100 ns, 128M100 ns, 128M
100 ns, 128M
Reducing Miss Rate & Global Miss Rate (1)
CPU MainMemory
Faster and Larger
Cache 1 Cache 2
Average memory access timeeff = Hit TimeL1 + Miss RateL1 x Miss
PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
Global Miss Rate = Miss RateL1 x Miss RateL2
0.050.02
2 ns, 16K 10 ns, 512K
100 ns, 128M2.6 ns, 528K
• Critical word first • Early restart • Lets discuss the following:
– How is Miss Penalty improved?– Under what conditions this is useful?– Is it worth going through all the trouble?
Reducing Miss Penalty – Critical Word First and Early Restart (2)
Reducing Miss Penalty – Priority To Read Miss (3)
• Read Miss takes priority• Write to Memory is put on hold• Lets discuss the following:
– Is there a problem?– How is Miss Penalty improved? – How this is done in Write-through?– How this is done in Write-back?– Under what conditions this is useful?– Is it worth going through all the trouble?
Reducing Miss Penalty – Merging Write Buffer (4)
• Write-through is sent to a write buffer• Buffer & Write possible options:
1. Buffer is empty – write address and data to buffer2. Buffer is not full – write or merge data3. Buffer is full – Stall until buffer is written to memory. – Note: Block write to memory is more efficient then multiple writes.
• Lets Discuss the following:– How many different ways do we optimize Miss Penalty in this
scheme?– Explain the Fig 5.12 in the book– When there is no block write (like I/O device), no merging
Reducing Miss Penalty – Fully Associative Victim Cache (5)
• Key word is Recycle• Another implicit key word is complex
(hardware)• Aimed at handling conflict misses• 1 to 6 victim caches are ideal
Reducing Misses• Misses due to 3Cs: Compulsory, Capacity, and
Conflict• Reducing Misses Techniques
1. Larger Lines/Blocks2. Larger Caches3. Higher Associativity4. Way Prediction Pseudoassociative caches5. Compiler Techniques
Notes: copy tables 5.14 and 5.17 to two slides
Reducing Cache Misses – Larger Block Size (1)
• Larger block improves number of misses up to a point!
• Points to discuss– Why number of misses starts increasing for larger blocks?– Low latency encourages smaller blocks and higher latency
encourages larger blocks.– Low bandwidth encourages smaller blocks and higher
bandwidth encourages larger blocks.– Effect on Miss Penalty with larger blocks
• What is the other name for 1-way set associative mapping?
Reducing Cache Misses – Larger Caches (2) • Obviously!
Reducing Cache Misses – Higher Associativity (3)
• Miss rate improves with associativity• Points to discuss
– Complexity of set associative mapping versus the improvement.
– Rule of thumb 1: 8-way associative is as good as fully associative.
– Rule of thumb 2: 2-way associative is as good as direct mapping.
– Greater associativity increases Hit time
Reducing Cache Misses – Pseudo-Associative Caches (4)
• Lets understand this using an 8-way set associative Instruction Cache.
• Instruction exhibits better locality• Each access to instruction cache
normally needs 8 comparisons.• Using locality predict the next block
to access in the set to reduce the number of comparisons.
• Effect on Hit-time and Cache Miss
Cache Memory – ith set
Block 0
Block 1
Block 6
Block 8
tags
Reducing Hit Time – Small and Simple Cache (1)
• Tag comparison is complex, specifically in associative mapping
• Tag checking can be overlapped with data transfer to reduce the hit time.
• Tag checking in CPU chip, cache in a chip by itself. Provides better hit time and larger cache capacity
Reducing Hit Time – Virtual Cache (2)
• Addressing– VA -> Physical Address -> cache
• Skip two levels, VA maps to cache• Problems:
– No page boundary checks– Building direct mapping between VA and Cache
for every process is not easy.
Reducing Hit Time – Pipelined Cache (3)
• Increases the cache throughput not access time
Reducing Hit Time – Trace Cache
• Increasing instruction level parallelism• Instead of 4 consecutive locations of cache,
load the next 4 instruction required by the CPU using trace.
• Folding branch prediction into cache
Improving Parallelism with Cache• Cache Hit under Cache Miss• Cache Miss under Cache Miss• Both require non-blocking cache, out of order
execution CPU• Other methods to improve performance using
parallelism:– Hardware pre-fetching of instruction– Software (compiler) pre-fetching of data
PreparationWrite-back• Complicates cache coherency
problem• Low overhead memory access
overhead• Better cache access time than
write -through• Requires higher memory
bandwidth if blocking
Write Through• Simplifies cache coherency
problem• High overhead• If cache is blocking then higher
access time • Requires Lower memory
bandwidth
Note: No need to describe these two policies Write-through does not buy anything extra for a single
processor system due to absence of cache coherency
CPU Execution TimeCPU Execution Time
= (CPU Cycle time + Stalled Cycle) X Cycle Time
• Stalled Cycle = misses x penalty• Misses given either as misses/1000 instruction or
misses/memory-access AKA miss rate.• Instruction Count , Cycles per Instruction, Miss are also
required to compute CPU execution time.
Average Access Time with CacheAverage Access Time= Hit Time + Miss Rate X Penalty
Multi-level CacheAvg Access Time = Hit TimeL1+ Miss RateL1X PenaltyL1
PenaltyL1 = Hit TimeL2+ Miss RateL2X PenaltyL2
AddressingTag
Cache
Main Memory
Tag Set WordBlock 0
Block 1
Block 126
Block 127
Line 0
Line 1
Line 2
Line 3 (m)
3.Compare
1. Select
+4. Hit
4. Miss
To main memory
From CPUAddress
2. Copy
5. Load
0123
508509510511
4567
Set 0
Set 1
Assignment I – Due same day next week
• Mapping functions• Replacement algorithms• Write policies• Write Miss policies• Split Cache versus Unified Cache• Primary Cache versus Secondary Cache• Compiler cache optimization techniques with
examples
Assignment II - Due same day next week
• Multilevel Cache• Cache Inclusion/Exclusion Property• Thumb rules of cache• Compiler pre-fetch• Multi-level Caching + one another Miss
Penalty Optimization technique• Two miss Rate Optimization Techniques
Assignment III - Due 2nd class of next week
• All odd numbered problems from cache module of your text book.
Assignment IV - Due 2nd class of next week
• All even numbered problems from cache module of your text book.
CPU Execution Time & Average Access Time
Memory
CPU
Cache
1 CC 100 CC
With Multi-level Cache
1000
Memory
CPU
Cache
1 CC 100 CC10 CC
Memory Hierarchy
Main Memory
Main Memory• Module Objective
– To understand Main memory latency and bandwidth
– Techniques to improve latency and bandwidth
Memory Hierarchy & Cache
Main Memory – Cache – I/O
MainMemory
0.25 ns
250 ns
Bus Adapter
Storage & I/O devices
2,500,000 ns!
cacheCPU
Cache prefers low latency main memory I/O & Multiprocessors prefer high bandwidth/throughput
main memory
Main Memory Access Time Main
Memory
cacheCPU
DataBus Address
Bus
4 cc
4 cc
56 cc
Access Time per word = 4+56+4 CCOne word is 8 BytesLatency is 1 bit/CC
CC – Clock Cycle
Improving Memory Performance• Improving Latency ( time to access 1 memory
unit - word)• Improving bandwidth (bytes accessed in unit
time)
Improving Memory BandwidthCPU
Memory
Cache
64 bits
64 bits
CPU
Memory
Cache
64 bits
4x64 bits
Bandwidth, Latency, Penalty Interleaving Factor Address allocation with Interleaving
CPU
Cache
64 bits
64 bits
048
12
159
13
37
1115
26
1014
Bank 0 Bank 1 Bank 2 Bank 3Cache Block 4 wordsOne word 8 bytes
Simple Design Wider Bus Interleaved