EENG449b/Savvides Lec 17.1 4/1/04 April 1, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer

EENG449b/SavvidesLec 17.1

4/1/04

April 1, 2004

Prof. Andreas Savvides

Spring 2004

http://www.eng.yale.edu/courses/eeng449bG

EENG 449bG/CPSC 439bG Computer Systems

Lecture 17

Memory Hierachy Design Part I


4/1/04

Announcements

• HWK 2 is out:– Text problems 3.6, 4.2, 4.13, 5.3, 5.4, 5.23– Due date April 20th

• Midterm 2 on April 22• 2 sets of final project presentations

– Set 1 – Friday April 23– Set 2 – Tuesday April 27

• Final project reports due May 4, midnight


4/1/04

CPU-DRAM Gap

• 1980: no cache in µproc; 1995 2-level cache on chip(1989 first Intel µproc with a cache on chip)

Who Cares About the Memory Hierarchy?

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

“Moore’s Law”

“Less’ Law?”


4/1/04

Review of Caches

• Cache is the name given to the first level of the memory hierarchy encountered one the address leaves the CPU

– Cache hit / cache miss when data is found / not found

– Block – a fixed size collection of data containing the requested word

– Spatial / temporal localities– Latency – time to retrieve the first word in the

block– Bandwidth – time to retrieve the rest of the block– Address space is broken into fixed-size blocks

called pages– Page fault – when CPU references something that

is not on cache or main memory


4/1/04

Generations of Microprocessors

• Time of a full cache miss in instructions executed:

1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136

2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320

3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648

• 1/2X latency x 3X clock rate x 3X Instr/clock 5X


4/1/04

Processor-Memory Performance Gap “Tax”

Processor % Area %Transistors

(cost) (power)• Alpha 21164 37% 77%• StrongArm SA110 61% 94%• Pentium Pro 64% 88%

– 2 dies per package: Proc/I$/D$ + L2$

• Caches have no “inherent value”, only try to close performance gap


4/1/04

What is a cache?• Small, fast storage used to improve average

access time to slow memory.• Exploits spacial and temporal locality• In computer architecture, almost everything is a

cache!– Registers “a cache” on variables – software managed– First-level cache a cache on second-level cache– Second-level cache a cache on memory– Memory a cache on disk (virtual memory)– TLB a cache on page table– Branch-prediction a cache on prediction information?Proc/Regs

L1-Cache

L2-Cache

Memory

Disk, Tape, etc.

Bigger Faster


4/1/04

• Miss-oriented Approach to Memory Access:

– CPIExecution includes ALU and Memory instructions

CycleTimeyMissPenaltMissRateInst

MemAccessExecution

CPIICCPUtime

CycleTimeyMissPenaltInst

MemMissesExecution

CPIICCPUtime

Review: Cache performance

• Separating out Memory component entirely

– AMAT = Average Memory Access Time

– CPIALUOps does not include memory instructionsCycleTimeAMAT

Inst

MemAccessCPI

Inst

AluOpsICCPUtime

AluOps

yMissPenaltMissRateHitTimeAMAT DataDataData

InstInstInst

yMissPenaltMissRateHitTime

yMissPenaltMissRateHitTime


4/1/04

Impact on Performance•Suppose a processor executes at

– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1

– 50% arith/logic, 30% ld/st, 20% control

•Suppose that 10% of memory operations get 50 cycle miss penalty

•Suppose that 1% of instructions get same miss penalty

•CPI = ideal CPI + average stalls per instruction=1.1(cycles/ins) +[ 0.30 (DataMops/ins)

x 0.10 (miss/DataMop) x 50 (cycle/miss)] +

[ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50

(cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1

•58% of the time the proc is stalled waiting for memory!


4/1/04

Traditional Four Questions for Memory Hierarchy Designers

• Q1: Where can a block be placed in the upper level? (Blockplacement)

– Fully Associative, Set Associative, Direct Mapped

• Q2: How is a block found if it is in the upper level?(Blockidentification)

– Tag/Block

• Q3: Which block should be replaced on a miss? (Blockreplacement)

– Random, LRU

• Q4: What happens on a write? (Writestrategy)

– Write Back or Write Through (with Write Buffer)


4/1/04

Q1: Where can a Block Be Placed in a Cache?


4/1/04

Set Associatively

• Direct mapped = one-way set associative

• Fully associative = set associative with 1 set

• Most popular cache configurations in today’s processors

– Direct mapped, 2-way set associative, 4-way set associative


4/1/04

Q2: How is a block found if it is in the cache?

Selects the desired data from the block

Selects the set

Compared against for a hit

• If cache size remains the same increasing associativity increases The number of blocks per set => decrease index size and increase tag


4/1/04

Q3: Which Block Should be Replaced on a Cache Miss?

• Directly mapped cache– No choice – a single block is checked for a hit. If

there is a miss, data is fetched into that block

• Fully associative and set associative – Random– Least Recently Used(LRU) – locality principles– First in, First Out(FIFO) - Approximates LRU


4/1/04

Q4: What Happens on a Write?

• Cache accesses dominated by reads• E.g on MIPS 10% stores and 37% loads = 21%

of cache traffic is writes• Writes are much slower than reads

– Block is read from cache at the same time a block is read and compared

» If the write is a hit the block is passed to the CPU – Writing cannot begin unless the address is a hit

• Write through – information is written to both the cache and the lower-level memory

• Write back – information only written to cache. Written to memory only on block replacement


4/1/04

Example: Alpha 21264 Data Cache

• 2-way set associative

• Write back• Each block has 64

bytes of data– Offset points to the

data we want– Total cache size

65,536 bytes

• Index 29=512 points to the block

• Tag comparison determines if we have a hit

• Victim buffer to helps with write back


4/1/04

Address Breakdown

• Physical address is 44 bits wide, 36-bit block address and 6-bit offset

• Calculating cache index size

• Blocks are 64 bytes so offset needs 6 bits

• Tag size = 38 – 9 = 29 bits

92512264

356,652

ityAssociativSet sizeBlock size CacheIndex


4/1/04

Unified vs Split Caches• Unified vs Separate Instruction and Data caches

• Example:– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%– 32KB unified: Aggregate miss rate=1.99%– Using miss rate in the evaluation may be misleading!

• Which is better (ignore L2 cache)?– Assume 25% data ops 75% accesses from instructions (1.0/1.33)– hit time=1, miss time=50– Note that data hit has 1 stall for unified cache (only one port)

AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

ProcI-Cache-1

Proc

UnifiedCache-1

UnifiedCache-2

D-Cache-1

Proc

UnifiedCache-2


4/1/04

Impact of Caches on Performance

• Consider a in-order execution computer

– Cache miss penalty 100 clock cycles, CPI=1– Average miss rate 2% and an average of 1.5

memory references per instruction– Average # of cache misses 30 per 1000

instructions

Performance with cache misses

time cycleClock nInstructiocycles stall Memory

CPIIC time CPU execution

time cycleClock 4.00IC

time cycleClock 100))(30/1000(1.0ICtime CPU cache with


4/1/04

Impact of Caches on Performance

• Calculate the performance using miss rate

time cycleClock penalty Miss

nInstructioaccesses Memory

rate MissCPIIC time CPU execution

time cycleClock 4.00IC

time cycleClock 100))2%(1.5(1.0ICtime CPU cache with

• 4x increase in CPU time from “perfect cache”• No cache – 1.0 + 100 x 1.5 = 151 – a factor of 40 compared to a system with cache• Minimizing memory access does not always imply reduction in CPU time


4/1/04

How to Improve Cache Performance?

Fourmaincategoriesofoptimizations1. Reducingmisspenalty

- multilevel caches, critical word first, read miss before write miss, merging write buffers and victim caches

2. Reducing miss rate- larger block size, larger cache size, higher associativity, way prediction and

pseudoassociativity and computer optimizations

2. Reduce the miss penalty or miss rate via parallelism- non-blocking caches, hardware prefetching and compiler prefetching

3. Reduce the time to hit in the cache- small and simple caches, avoiding address translation, pipelined cache access

yMissPenaltMissRateHitTimeAMAT


4/1/04

Where to misses come from?• Classifying Misses: 3 Cs

– Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called coldstartmisses or firstreferencemisses.(MissesinevenanInfiniteCache)

– Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(MissesinFullyAssociativeSizeXCache)

– Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collisionmisses or interferencemisses.(MissesinN-wayAssociative,SizeXCache)

• 4th “C”:– Coherence - Misses caused by cache coherence.


4/1/04

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

Miss rate


4/1/04

Cache Size

• Old rule of thumb: 2x size => 25% cut in miss rate• What does it reduce?

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory


4/1/04

Cache Organization?

• Assume total cache size not changed:• What happens if:

1) Change Block Size:

2) Change Associativity:

3) Change Compiler:

Which of 3Cs is obviously affected?


4/1/04

Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

Larger Block Size (fixed size&assoc)

Reduced compulsory

missesIncreasedConflictMisses

What else drives up block size?


4/1/04

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

Associativity

Conflict


4/1/04

3Cs Relative Miss Rate

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0%

20%

40%

60%

80%

100%1 2 4 8

16

32

64

12

8

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

Flaws: for fixed block sizeGood: insight => invention


4/1/04

Associativity vs Cycle Time

•Beware: Execution time is only final measure!

•Why is cycle time tied to hit time?

•Will Clock Cycle time increase?– Hill [1988] suggested hit time for 2-way vs. 1-way

external cache +10%, internal + 2%

– suggested big and dumb caches

Effective cycle time of assocpzrbski ISCA


4/1/04

Next Time

• Cache Tradeoffs for Performance• Reducing Miss Penalties• Reducing Hit Times

Documents

EENG449b/Savvides Lec 17.1 4/1/04 April 1, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer