Datorteknik MemoryAcceleration bild 1 Memory Hierarchy –Reasons –Virtual Memory Cache Memory Translation Lookaside Buffer –Address translation –Demand

Datorteknik MemoryAcceleration bild 1

Memory Hierarchy

Memory Hierarchy– Reasons

– Virtual Memory

Cache Memory Translation Lookaside Buffer

– Address translation

– Demand paging


µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

Processor-MemoryPerformance Gap:(grows 50% / year)

1

10

100

1000

DRAM

CPU

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

1982

Per

form

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Why Care About the Memory Hierarchy?


DRAM Generation‘84 ‘87 ‘90 ‘93 ‘96 ‘99

1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb

55 85 130 200 300 450

30 47 72 110 165 250

28.84 11.1 4.26 1.64 0.61 0.23

(from Kazuhiro Sakashita, Mitsubishi)

1st Gen. Sample

Memory Size

Die Size (mm2)

Memory Area (mm2)

Memory Cell Area (µm2)

DRAMs over Time


Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be

referenced again soon.

– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon.

By taking advantage of the principle of locality:– Present the user with as much memory as is available in the cheapest

technology.

– Provide access at the speed offered by the fastest technology.

DRAM is slow but cheap and dense:– Good choice for presenting the user with a BIG memory system

SRAM is fast but expensive and not very dense:– Good choice for providing the user FAST access time.

Recap:


By taking advantage of the principle of locality:– Present the user with as much memory as is available in the cheapest

technology.

– Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On

-Ch

ipC

ache

1ns 10 msSpeed : 10ns 100ns

100 GSize (bytes): K..M M

TertiaryStorage

(Disk/Tape)

10 sec

T

Memory Hierarchy of a Modern Computer

Xns

64K


Reg:s

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

faster

Larger

Levels of the Memory Hierarchy


reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .

op: i-fetch, read, write

Optimize the memory system organizationto minimize the average memory access timefor typical workloadsProcessor

$

MEM

Memory

Workload orBenchmarkprograms

The Art of Memory System Design


size of information blocks that are transferred from secondary to main storage (M)

block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy

which region of M is to hold the new block --> placement policy

missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy

Paging Organizationvirtual and physical address space partitioned into blocks of equal size (pages)

page

reg

cachemem disk

Page

Virtual Memory System Design


V = {0, 1, . . . , n - 1} virtual address spaceM = {0, 1, . . . , m - 1} physical address space

MAP: V --> M U {0} address mapping function

n > m

MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M

= 0 if data at virtual address a is not present in M

Processor

Name Space V

Addr TransMechanism

faulthandler

MainMemory

SecondaryMemory

a

aa'

0

missing item fault

physical address OS performsthis transfer

Address Map


Paging Organization

Address Mapping

VA page no. disp10

Page Table

indexintopagetable

Page TableBase Reg

VAccessRights PA +

table locatedin physicalmemory

physicalmemoryaddress

actually, concatenation is more likely

frame 01

7

01024

7168

P.A.

PhysicalMemory

1K1K

1K

AddrTransMAP

page 01

31

1K1K

1K

01024

31744

unit of mapping

also unit oftransfer fromvirtual tophysical memory

Virtual Memory

V.A.


Address Mapping

32-bit Virtual Address

CP0

MIPS PIPELINE

3232 Instr Data

24-bit Physical Address

PageTable

1PageTable

2PageTable

n

User Memory

Kernel MemoryUser process 2 running

Here we need page table 2 for address mapping


Translation Lookaside Buffer (TLB)CP0

MIPS PIPELINE

24

32

Virtual AddressD R Physical Addr [23:10]

PageTable

1PageTable

2PageTable

n

User Memory

Kernel Memory

On TLB hit, the 32-bit virtual address is translated into a

24-bit physical address by hardware

We never call the Kernel!

32


So Far, NO GOODCP0

3232

24-bit Physical Address

PageTable

1PageTable

2PageTable

n

Kernel Memory

Critical path 20 ns

IM DE EX DM

TLB

5ns

MIPS pipe is clocked at 50 MHz

60 ns, RAM

But RAM needs3 cycles to read/write

STALLS the pipe

STALL


Let’s put in a CacheCP0

3232

PageTable

1PageTable

2PageTable

n

Kernel Memory

Critical path 20 ns

IM DE EX DM

TLB

5ns

MIPS pipe is clocked at 50 MHz

60 ns, RAM

A cache Hit never STALLS the pipe15ns

Cache


Check all Cache linesCache Hit if PA[23:2]=TAG

Fully Associative Cache1 0223

24-bit PA

all 2 lines16

Tag PA[23:2] Data Word PA[1:0]

162 * 4=256kb


Fully Associative Cache

Very good hit ratio (nr hits/nr accesses)

But! Too expensive checking all 2 Cache lines

concurrently– A comparator for each line! A lot of hardware

16


Direct Mapped Cache1 0223

24-bit PA

1 line

1718

Tag PA[23:18] Data Word PA[1:0]

Selects ONE cache lineCache Hit if PA[23:18]=TAG

162 * 4=256kb


Direct Mapped Cache

Not so good hit ratio– Each line can hold only certain addresses, less freedom

But! Much cheaper to implement, only one line

checked– Only one comparator


Set Associative Cache1 0223

24-bit PA

17-z18-z

Tag PA[23:18-z] Data Word PA[1:0]

Selects ONE set of lines, size 2Cache Hit if PA[23:18-z]=TAG in the set

z

162 * 4=256kb

z 2 lines

2z-way set associative


Set Associative Cache

Quite good hit ratio– The number (set) of different addresses for each line is

greater than that of a directly mapped cache

The larger Z the better hit ratio, but more expensive

– 2z comparators

– Cost-performance tradeoff


Cache Miss

A Cache Miss should be handled by the hardware

– If handled by the OS it would be very slow (>>60 ns)

On a Cache Miss– Stall the pipe

– Read in new data to cache

– Release the pipe, now we get a Cache Hit


A Summary on Sources of Cache Misses

Compulsory (cold start or process migration, first reference): first access to a block

– “Cold” fact of life: not a whole lot you can do about it

– Note: If you are going to run “billions” of instructions, Compulsory Misses are insignificant

Conflict (collision):– Multiple memory locations mapped

to the same cache location

– Solution 1: increase cache size

– Solution 2: increase associativity

Capacity:– Cache cannot contain all blocks access by the program

– Solution: increase cache size

Invalidation: other process (e.g., I/O) updates memory


Example: 1 KB Direct Mapped

Cache with 32 Byte Blocks For a 2N byte cache:– The uppermost (32 - N) bits are always the Cache Tag

– The lowest M bits are the Byte Select (Block Size = 2M)

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31

:

Byte 32Byte 33Byte 63

:

Byte 992Byte 1023

:

Cache Tag

Byte SelectEx: 0x00

9


Block Size Tradeoff In general, larger block size take advantage of spatial locality

BUT:– Larger block size means larger miss penalty:

Takes longer time to fill up the block

– If block size is too big relative to cache size, miss rate will go up:

Too few cache blocks

In gerneral, Average Access Time: TimeAv= Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate

MissPenalty

Block Size

MissRate Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

AverageAccess

Time

Increased Miss Penalty& Miss Rate

Block Size Block Size


Extreme Example: single big line

Cache Size = 4 bytes Block Size = 4 bytes– Only ONE entry in the cache

If an item is accessed, likely that it will be accessed again soon– But it is unlikely that it will be accessed again immediately!!!

– The next access will likely to be a miss again Continually loading data into the cache but

discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect

Conflict Misses are misses caused by:– Different memory locations mapped to the same cache index

Solution 1: make the cache size bigger Solution 2: Multiple entries for the same Cache Index

0

Cache DataValid Bit

Byte 0Byte 1Byte 3

Cache Tag

Byte 2


HierarchySmall, fast and expensive VS Slow big and inexpensive

HD 2 Gb

Cache Contains copiesWhat if copies are changed?

INCONSISTENCY!

RAM 16 Mb

I

D

Cache 256kb


Cache Miss, Write Through/Back

To avoid INCONSISTENCY we can Write Through

– Always write data to RAM

– Not so good performance (write 60ns)

– Therefore, WT always combined with write buffers so that don’t wait for lower level memory

Write Back– Write data to memory only when cache line is replaced

We need a Dirty bit (D) for each cache line D-bit set by hardware on write operation

– Much better performance, but more complex hardware


Write Buffer for Write Through

A Write Buffer is needed between the Cache and Memory– Processor: writes data into the cache and the write buffer

– Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO:– Typical number of entries: 4

– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

Memory system designer’s nightmare:– Store frequency (w.r.t. time) -> 1 / DRAM write cycle

– Write buffer saturation

ProcessorCache

Write Buffer

DRAM


Write Buffer Saturation

Store frequency (w.r.t. time) -> 1 / DRAM write cycle– If this condition exist for a long period of time (CPU cycle time too quick

and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time

Solution for write buffer saturation:– Use a write back cache

– Install a second level (L2) cache:

ProcessorCache

Write Buffer

DRAML2Cache


Replacement Strategy in Hardware

A Direct mapped cache selects ONE cache line– No replacement strategy

Set/Fully Associative Cache selects a set of lines. Strategy to select one Cache line

– Random, Round Robin Not so good, spoils the idea with Associative Cache

– Least Recently Used, (move to top strategy) Good, but complex and costly for large Z

– We could use an approximation (heuristic) Not Recently Used, (replace if not used for a certain time)


Sequential RAM Access

Accessing sequential words from RAM is faster than accessing RAM randomly

– Only lower address bits will change

How could we exploit this?– Let each Cache Line hold an Array of Data words

– Give the Base address and array size “Burst Read” the array from RAM to Cache

“Burst Write” the array from Cache to RAM


System Startup, RESET

Random Cache Contents– We might read incorrect values from the Cache

We need to know if the contents is Valid, a V-bit for each cache line

– Let the hardware clear all V-bits on RESET

– Set the V-bit and clear the D-bit for the line copied from RAM to Cache


Final Cache Model

DV

...

1+j 02+j23

24-bit PA

17-z18-z

Tag PA[23:18-z] Data Word PA[1+j:0]

Selects ONE set of lines, size 2Cache Hit if (PA[23:18-z]=TAG) and V in setSet D bit if Write

z

z 2 lines


Translation Lookaside Buffer (TLB)CP0

MIPS PIPELINE

24

32

Virtual AddressD R Physical Addr [23:10]

PageTable

1PageTable

2PageTable

n

User Memory

Kernel Memory

On TLB hit, the 32-bit virtual address is translated into a

24-bit physical address by hardware

We never call the Kernel!

32


Virtual Address and a Cache

CPU

Trans-lation

Cache

MainMemory

VA

PA

miss

hit

data

It takes an extra memory access to translate VA to PA

This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible

ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address!

for update: must update all cache entries with same physical address or memory becomes inconsistent

determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or

software enforced alias boundary: same lsb of VA &PA > cache size


Translation Look-Aside BuffersJust like any other cache, the TLB can be organized as fully associative,

set associative, or direct mapped

TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations.

CPUTLB

LookupCache Main

Memory

VA PA miss

hit

data

OSPage table

hit

missTranslationwith a TLB


Reducing Translation Time

Machines with TLBs go one step further to reduce cycles/cache access

They overlap the cache access with the TLB access

Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache


Overlapped Cache & TLB Access

TLB Cache

10 200

4 bytes

index 1 K

page # disp20 12

assoclookup32

PA Hit/Miss

PA Data Hit/Miss

=

IF cache hit AND (cache tag = PA) then deliver data to CPUELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLBELSE do standard VA translation


Problems With Overlapped TLB AccessOverlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation

This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache

Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K:

11 200

virt page # disp20 12

cache index This bit is changed

by VA translation, butis needed for cachelookup

Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13]

1K

4 4

102 way set assoc cache


Startup a User process

TLB

V

0

0

...

0PageTable

Kernel Memory

Place on Hard Disk

D R Page Table

I0

I

I

D

0

0

0

0 S

0

0

0

0

0

Allocate Stack pages, Make a Page Table;– Set Instruction (I), Global Data (D) and Stack pages (S)

– Clear Resident (R) and Dirty (D) bits

Clear V-bits in TLB


Demand Paging

D22-bit Page #

IM Stage: We get a TLB Miss and Page Fault (page 0 not resident)– Page Table (Kernel memory) holds HD address for page 0 (P0)

– Read page to RAM page X, Update PA[23:10] in Page Table

– Update TLB, set V, clear D, Page #, PA[23:10]

Restart failing instruction: TLB hit!

I

V

Page 0

RAM

I

Physical Addr PA[23:10]

TLB

P00

0

...

1 000…...….0 XX………………..X

XX…X00..0


Demand Paging

D22-bit Page #

DM Stage: We get a TLB Miss and Page Fault (page 3 not resident)– Page Table (Kernel memory) holds HD address for page 3 (P3)

– Read page to RAM page Y, Update PA[23:10] in Page Table

– Update TLB, set V, clear D, Page #, PA[23:10]

Restart failing instruction: TLB hit!

I

V

Page 0

RAM

I

Physical Addr PA[23:10]

TLB

P01

0

...

1 000…...….0 XX………………..X

P300…...…11

P1

P2

YY…Y00..0Page 3 D

YY………………..Y D0


Spatial and Temporal Locality

Spatial Locality Now TLB holds page translation; 1024 bytes, 256 instructions

– The next instruction (PC+4) will cause a TLB Hit

– Access a data array, e.g., 0($t0),4($t0) etc

Temporal Locality TLB holds translation

– Branch within the same page, access the same instruction address– Access the array again e.g., 0($t0),4($t0) etc

THIS IS THE ONLY REASON A SMALL TLB WORKS


Replacement Strategy

If TLB is full the OS selects the TLB line to replace Any line will do, they are the same and concurrently

checked

Strategy to select one Random

– Not so good

Round Robin– Not so good, about the same as random

Least Recently Used, (move to top strategy)– Much better, (the best we can do without knowing or predicting

page access). Based on temporal locality


HierarchySmall, fast and expensive VS Slow big and inexpensive

RAM 256Mb

PageTable

Kernel Memory HD 32 Gb

TLB 64 Lines

>> 64

TLB/RAM Contains copies

What if copies are changed?

INCONSISTENCY!


Inconsistency

Replace a TLB entry, caused by TLB Miss If old TLB entry dirty (D-bit) we update Page Table (Kernel

memory)

Replace a page in RAM (swapping) caused by Page Fault If old Page is in TLB

– Check old page TLB D-bit, if Dirty write page to HD

– Clear TLB V-bit and Page Table R-bit (now not resident)

If old Page is in not in TLB– Check old page Page Table D-bit, if Dirty write page to HD

– Clear Page Table R-bit (page not resident any more)


Current Working Set

If RAM is full the OS selects a page to replace, Page Fault OBS! The RAM is shared by many User processes Least Recently Used, (move to top strategy)

– Much better, (the best we can do without knowing or predicting page access)

Swapping is VERY expensive (, maybe > 100 ms)– Why not try harder to keep the pages needed (the working set) in RAM using Advanced memory paging algorithms

t

t{p0,p3,...} set of pages used under t

now

Current working set of process P


TrashingProbability of Page Fault

1

00 1

Fragment ofworking setnot resident

TrashingNo useful work done!

This we want to avoid


Summary: Cache, TLB, Virtual Memory

Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions:

– Where can a page be placed?

– How is a page found?

– What page is replaced on miss?

– How are writes handled?

Page tables map virtual address to physical address

TLBs are important for fast translation

TLB misses are significant in processor performance:– (some systems can’t access all of 2nd level cache without TLB misses!)


Summary: Memory Hierachy

Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs?

– 1000X DRAM growth removed the controversy

Today VM allows many processes to share single memory without having to swap all processes to disk;

VM protection is more important than memory space increase

Today CPU time is a function of (ops, cache misses) vs. just of(ops):– What does this mean to Compilers, Data structures, Algorithms?

– Vtune performance analyzer, cache misses.

Documents

Datorteknik MemoryAcceleration bild 1 Memory Hierarchy –Reasons –Virtual Memory Cache Memory Translation Lookaside Buffer –Address translation –Demand