Main Memory and Virtual Memory

ENGS 116 Lecture 14 1

Main Memory and Virtual Memory

Vincent H. Berk

October 26, 2005

Reading for today: Sections 5.1 – 5.4, (Jouppi article)

Reading for Friday: Sections 5.5 – 5.8 Reading for Monday: Sections 5.8 – 5.12 and 5.16


Main Memory Background

• Performance of Main Memory:

– Latency: Cache miss penalty

• Access Time: time between request and word arrives

• Cycle Time: time between requests

– Bandwidth: I/O & large block miss penalty (L2)

• Main Memory is DRAM: dynamic random access memory

– Dynamic since needs to be refreshed periodically (1% time)

– Addresses divided into 2 halves (memory as a 2-D matrix):

• RAS or Row Access Strobe

• CAS or Column Access Strobe

• Cache uses SRAM: static random access memory

– No refresh; 6 transistors/bit vs. 1 transistor; Size: DRAM/SRAM ≈ 4-8; Cost/Cycle time: SRAM/DRAM ≈ 8-16

3ENGS 116 Lecture 14

4 Key DRAM Timing Parameters

• tRAC: minimum time from RAS line falling to the valid data output.

– Quoted as the speed of a DRAM when buying

– A typical 512Mbit DRAM tRAC = 60-40 ns

• tRC: minimum time from the start of one row access to the start of the next.

– tRC = 80 ns for a 512Mbit DRAM with a tRAC of 60-40 ns

• tCAC: minimum time from CAS line falling to valid data output.

– 5 ns for a 512Mbit DRAM with a tRAC of 60-40 ns

• tPC: minimum time from the start of one column access to the start of the next.

– 15 ns for a 512Mbit DRAM with a tRAC of 60-40 ns

4ENGS 116 Lecture 14

DRAM Performance

• A 40 ns (tRAC) DRAM can

– perform a row access only every 80 ns (tRC)

– perform column access (tCAC) in 5 ns, but time between column accesses is at least 15 ns (tPC).

• In practice, external address delays and turning around buses make it 20 to 25 ns

• These times do not include the time to drive the addresses off the microprocessor or the memory controller overhead!


DRAM History

• DRAMs: capacity + 60%/yr, cost – 30%/yr

– 2.5X cells/area, 1.5X die size in ≈ 3 years

• ‘98 DRAM fab line costs $2B

• Rely on increasing numbers of computers & memory per computer (60% market)

– SIMM or DIMM is replaceable unit computers use any generation DRAM

• Commodity, second source industry high volume, low profit, conservative

– Little organization innovation in 20 years

• Order of importance: 1) Cost/bit, 2) Capacity

– First RAMBUS: 10X BW, + 30% cost little impact

• Current SDRAM yield very high: > 80%


Main Memory Performance• Simple:

– CPU, Cache, Bus, Memory same width (32 or 64 bits)

• Wide:

– CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UltraSPARC 512)

• Interleaved:

– CPU, Cache, Bus 1 word; Memory N modules (4 modules); example is word interleaved


Main Memory Performance

• Timing model (word size is 32 bits)– 1 to send address,

– 6 for access time, 1 to send data

– Cache Block is 4 words

• Simple memory 4 (1 + 6 + 1) = 32• Wide memory 1 + 6 + 1 = 8• Interleaved memory 1 + 6 + 4 1 = 11

Address Address Address AddressBank 0 Bank 1 Bank 3Bank 2

048

12

37

1115

26

1014

159

13


Independent Memory Banks

• Memory banks for independent accesses vs. faster sequential accesses– Multiprocessor

– I/O (DMA)

– CPU with Hit under n Misses, Non-blocking Cache

• Superbank: all memory active on one block transfer (or Bank)

• Bank: portion within a superbank that is word interleaved (or subbank)

Superbank Superbank offset (Bank)

Superbank # Bank # Bank offset

...


Independent Memory Banks

• How many banks?

number banks ≥ number clocks to access word in bank

– For sequential accesses, otherwise will return to original bank before it has next word ready

– (like in vector case)

• Increasing DRAM fewer chips harder to have banks


Avoiding Bank Conflicts

• Lots of banksint x[256][512];

for (j = 0; j < 512; j = j+1)for (i = 0; i < 256; i = i+1)

x[i][j] = 2 * x[i][j];• Even with 128 banks, since 512 is multiple of 128, conflict on word

accesses

• SW: loop interchange or declaring array not power of 2 (“array padding”)

• HW: prime number of banks– bank number = address mod number of banks

– address within bank = address / number of words in bank

– modulo & divide per memory access with prime no. banks?

– address within bank = address mod number words in bank

– bank number? easy if 2N words per bank


Fast Memory Systems: DRAM specific

• Multiple CAS accesses: several names (page mode)– Extended Data Out (EDO): 30% faster in page mode

• New DRAMs to address gap; what will they cost, will they survive?– RAMBUS: startup company; reinvent DRAM interface

>> Each chip a module vs. slice of memory

>> Short bus between CPU and chips

>> Does own refresh

>> Variable amount of data returned

>> 1 byte / 2 ns (500 MB/s per chip)

– Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz)

– Intel claims RAMBUS Direct is future of PC memory

• Niche memory or main memory?– e.g., Video RAM for frame buffers, DRAM + fast serial output


Virtual Memory

• Virtual Address (232, 264) to Physical Address mapping (228)

• Virtual memory in terms of cache:– Cache block?

– Cache miss?

• How is virtual memory different from caches?– What controls replacement

– Size (transfer unit, mapping mechanisms)

– Lower-level use


Figure 5.36 The logical program in its contiguous virtual address space is shown on the left; it consists of four pages A, B, C, and D.

C

A

B

0

4K

8K

12K

16K

20K

24K

28K

Physical address:

A

C

B

D

0

4K

8K

12K

Virtual address:

Physical main memory

Virtual memory

D Disk


Figure 5.37 Typical ranges of parameters for caches and virtual memory.

Parameter First-level cache Virtual memory

Block (page) size 16 – 128 bytes 4096 – 65,536 bytes

Hit time 1 – 2 clock cycles 40 – 100 clock cycles

Miss penalty (Access time) (Transfer time)

8 – 100 clock cycles (6 – 60 clock cycles) (2 – 40 clock cycles)

700,000 – 6,000,000 clock cycles (500,000 – 4,000,000 clock cycles) (200,000 – 2,000,000 clock cycles)

Miss rate 0.5 – 10% 0.00001 – 0.001%

Data memory size 0.016 – 1 MB 16 – 8192 MB


Virtual Memory

• 4 Questions for Virtual Memory (VM)?

– Q1: Where can a block be placed in the upper level?

fully associative, set associative, or direct mapped?

– Q2: How is a block found if it is in the upper level?

– Q3: Which block should be replaced on a miss?

random or LRU?

– Q4: What happens on a write?

write back or write through?

• Other issues: size; pages or segments or hybrid


Figure 5.40 The mapping of a virtual address to a physical address via a page table.

Page offsetVirtual page number

Virtual address

Page table

Physical address

Main memory


Fast Translation: Translation Buffer (TLB)• Cache of translated addresses

• Data portion usually includes physical page frame number, protection field, valid bit, use bit, and dirty bit

• Alpha 21064 data TLB: 32-entry fully associative

43

21

Page-frame address <30>

Page offset <13>

<30> Tag

<21> Physical page #

(low-order 13 bits of address)

34-bit physical address

(high-order 21 bits of address)

32:1 MUX

V R W<1> <2><2>

<13>

<21>


Selecting a Page Size• Reasons for larger page size

– Page table size is inversely proportional to the page size;

therefore memory saved

– Fast cache hit time easy when cache ≤ page size (VA caches);

bigger page makes it feasible as cache grows in size

– Transferring larger pages to or from secondary storage,

possibly over a network, is more efficient

– Number of TLB entries is restricted by clock cycle time, so a larger

page size maps more memory, thereby reducing TLB misses

• Reasons for a smaller page size– Fragmentation: don’t waste storage; data must be contiguous within page

– Quicker process start for small processes

• Hybrid solution: multiple page sizes– Alpha: 8 KB, 16 KB, 32 KB, 64 KB pages (43, 47, 51, 55 virtual addr bits)


Alpha VM Mapping

• “64-bit” address divided into 3 segments

– seg0 (bit 63 = 0) user code/heap

– seg1 (bit 63 = 1, 62 = 1) user stack

– kseg (bit 63 = 1, 62 = 0)

kernel segment for OS

• Three level page table, each one page

– Alpha only 43 bits of VA

– (future min page size up to 64 KB 55 bits of VA)

• PTE bits; valid, kernel & user, read & write enable (no reference, use, or dirty bit)

– What do you do?

Page table entry

Page table entry

Page table entry

Page Table Base Register

+

+

+

Physical address

page offsetphysical page-frame number

Main memory

Virtual address

page offsetlevel3seg0/seg1 selector

level1 level2

21

10 10 10 13

000 … 0 or 111 … 1

8 bytes32 bit address32 bit fields

L2 page table

L3 page table

L1 page table

. . .

. . .

. . .

. . .. . . . . .


Protection

• Avoid separate processes to access each others memory– Causes Segmentation Fault: sigSEG– Useful for Multitasking systems– Operating system issue

• At least two levels of protection:– Supervisor (Kernel) mode (privileged)

• Creates page tables, sets process bounds, handles exceptions

– User mode (non-privileged)• Can only make requests to Kernel: called SYSCALLs

• Shared memory• SYSCALL parameter passing


Protection 2

• Each page needs:– PID bit

– Read/Write/Execute bit

• Each process needs– Stack frame page(s)

– Text or code pages

– Data or heap pages

– State table keeping:• PC and other CPU status registers

• State of all registers


Alpha 21064• Separate Instruction & Data

TLB & Caches

• TLBs fully associative

• TLB updates in SW(“Private Arch Lib”)

• Caches 8KB direct mapped, write through

• Critical 8 bytes first

• Prefetch instr. stream buffer

• 2 MB L2 cache, direct mapped, WB (off-chip)

• 256 bit path to main memory, 4 64-bit modules

• Victim buffer: to give read priority over write

• 4-entry write buffer between D$ & L2$

StreamBuffer

WriteBuffer

Victim Buffer

Instr Data


0.00

0.501.00

1.50

2.00

2.503.00

3.50

4.004.50

5.00

Alp

haSort

TPC

-B (

db2)

TPC

-B (

db1)

Esp

ress

o

Li

Eqnto

tt

Sc

Gcc

Com

pre

ss

Mdljsp

2

Ora

Fpppp

Ear

Sw

m256

Doduc

Alv

inn

Tom

catv

Wave5

Mdljp2

Hydro

2d

CPI

L2I$D$I StallOther

Alpha CPI Components

• Instruction stall: branch mispredict (green);

• Data cache (blue); Instruction cache (yellow); L2$ (pink) Other: compute + register conflicts, structural conflicts


Pitfall: Predicting Cache Performance of One Program from Another (ISA, compiler, ...)

• 4KB data cache: miss rate 8%, 12%, or 28%?

• 1KB instruction cache: miss rate 0%, 3%, or 10%?

• Alpha vs. MIPSfor 8 KB Data $:17% vs. 10%

• Why 2X Alpha v. MIPS?

0%

5%

10%

15%

20%

25%

30%

35%

1 2 4 8

16 32 64

128

Cache Size (KB)

Miss Rate

D: tomcatvD: gccD: espressoI: gccI: espressoI: tomcatv

D$, Tom

D$, gcc

D$, esp

I$, gcc

I$, esp

I$, Tom


Pitfall: Simulating Too Small an Address Trace

Instructions Executed (billions)

CumulativeAverageMemoryAccess Time

1

1.5

2

2.5

3

3.5

4

4.5

0 1 2 3 4 5 6 7 8 9 10 11 12I$ = 4 KB, B = 16 BD$ = 4 KB, B = 16 BL2 = 512 KB, B = 128 BMP = 12, 200 (miss penalties)


Additional Pitfalls

• Having too small an address space

• Ignoring the impact of the operating system on the performance of the memory hierarchy


Figure 5.53 Summary of the memory-hierarchy examples in Chapter 5.

TLB First-level cache Second-level cache Virtual memory

Block size 4–8 bytes (1 PTE) 4–32 bytes 32–256 bytes 4096–16,384 bytes

Hit time 1 clock cycle 1–2 clock cycles 6–15 clock cycles 10–100 clock cycles

Miss penalty 10–30 clockcycles

8–66 clock cycles 30–200 clock cycles 700,000–6,000,000clock cycles

Miss rate (local) 0.1–2% 0.5–20% 15–30% 0.00001–0.001%

Size 32–8192 bytes(8-1024 PTEs)

1–128 KB 256 KB – 16 MB 16–8192 MB

Backing store First-level cache Second-level cache Page-mode DRAM Disks

Q1: block placement Fully associativeor set associative

Direct mapped Direct mapped orset associative

Fully associative

Q2: blockidentification

Tag/block Tag/block Tag/block Table

Q3: block replacement Random N.A. (direct mapped) Random LRU

Q4: write strategy Flush on a writeto page table

Write through orwrite back

Write back Write back

Documents

Main Memory and Virtual Memory