Upload
geoffrey-taylor
View
243
Download
3
Embed Size (px)
Citation preview
CS/EE 5810CS/EE 6810
F00: 1
Virtual Memory
CS/EE 5810CS/EE 6810
F00: 2
Virtual MemoryAn Address Remapping Scheme
• Permits applications to grow bigger than main memory size
• Helps with multiple process management
– Each process gets its own chunk of memory
– Permits protection of one process’ chunk from another
– Mapping of multiple chunks onto shared physical memory
– Mapping also facilitates relocation
• Think of it as another level of cache in the hierarchy:
– Caches disk pages into DRAM» Miss becomes a page fault
» Block becomes a page (or segment)
CS/EE 5810CS/EE 6810
F00: 3
Virtual Memory Basics
• Programs reference “virtual” addresses in a non-existent memory
– These are then translated into real “physical” addresses
– Virtual address space may be bigger than physical address space
• Divide physical memory into blocks, called pages
– Anywhere from 512 to 16MB (4k typical)
• Virtual-to-physical translation by indexed table lookup
– Add another cache for recent translations (the TLB)
• Invisible to the programmer
– Looks to your application like you have a lot of memory!
– Anyone remember overlays?
CS/EE 5810CS/EE 6810
F00: 4
VM: Page Mapping
Process 1’sVirtual AddressSpace
Process 2’sVirtual AddressSpace
Physical Memory
Disk
Page Frames
CS/EE 5810CS/EE 6810
F00: 5
VM: Address Translation
Virtual page number Page offset
Physical page number Page offset
Page Tablebase
Per-process page table
Valid bitProtection bits
Dirty btReference bit
12 bits20 bits Log2 ofpagesize
To physical memory
CS/EE 5810CS/EE 6810
F00: 6
Typical Page Parameters
• It’s a lot like what happens in a cache
– But everything (except miss rate) is a LOT worse
Parameter Value
Page Size 4KB – 64KB
L1 Cache Hit Time 1-2 clock cycles
Virtual Hit (e.g. mapped to DRAM) 50-400 clock cycles
Miss Penalty (all the way to disk) 700k-6M clock cycles
Disk Access Time 500k-4M clock cycles
Page Transfer Time 200k-2M clock cycles
Page Fault Rate .001% - .00001%
Main Memory Size 4MB – 4GB
CS/EE 5810CS/EE 6810
F00: 7
Cache vs. VM Differences
• Replacement– cache miss handled by hardware– page fault usually handled by the OS
» This is OK - since fault penalty is so horrific» hence some strategy of what to replace makes sense
• Addresses– VM space is determined by the address size of the CPU– cache size is independent of the CPU address size
• Lower level memory– For caches, the main memory is not shared by something
else (well, there is I/O…)– For VM, most of the disk contains the file system
» File system addressed differently, usually in I/O space» The VM lower level is usually called Swap Space
CS/EE 5810CS/EE 6810
F00: 8
Paging vs. Segmentation• Pages are fixed sized blocks
• Segments vary from 1 byte to 232 (for 32bit addresses) bytes
Aspect Page Segment
Words per address One – contains page and offset
Two – possible large max-size, so need Seg and offset words
Programmer visible? No Sometimes
Replacement Trivial – because of fixed size
Hard, need to find contiguous space, use garbage collection
Memory Efficiency Internal Fragmentation
External Fragmentation
Disk Efficiency Yes – adjust page size to balance access and transfer time
Not always – segment size varies
CS/EE 5810CS/EE 6810
F00: 9
Pages are Cached in a Virtual Memory System
Can Ask the Same Four Questions we did about caches
• Q1: Block Placement
– choice: lower miss rates and complex placement or vice versa
» miss penalty is huge
» so choose low miss rate ==> place page anywhere in physical memory
» similar to fully associative cache model
• Q2: Block Addressing - use additional data structure
– fixed size pages - use a page table» virtual page number ==> physical page number and
concatenate offset
» tag bit to indicate presence in main memory
CS/EE 5810CS/EE 6810
F00: 10
Normal Page Tables
• Size is number of virtual pages
• Purpose is to hold the translation of VPN to PPN
– Permits ease of page relocation
– Make sure to keep tags to indicate page is mapped
• Potential problem:
– Consider 32bit virtual address and 4k pages
– 4GB/4KB = 1MW required just for the page table!
– Might have to page in the page table… » Consider how the problem gets worse on 64bit machines with
even larger virtual address spaces!
» Alpha has a 43bit virtual address with 8k pages…
– Might have multi-level page tables
CS/EE 5810CS/EE 6810
F00: 11
Inverted Page Tables
Similar to a set-associative mechanism
• Make the page table reflect the # of physical pages (not virtual)
• Use a hash mechanism
– virtual page number ==> HPN index into inverted page table
– Compare virtual page number with the tag to make sure it is the one you want
– if yes » check to see that it is in memory - OK if yes - if not page fault
– If not - miss » go to full page table on disk to get new entry
» implies 2 disk accesses in the worst case
» trades increased worst case penalty for decrease in capacity induced miss rate since there is now more room for real pages with smaller page table
CS/EE 5810CS/EE 6810
F00: 12
Inverted Page Table
Page V
Page Offset
Frame Offset
FrameHash
=
OK
•Only store entriesFor pages in physicalmemory
CS/EE 5810CS/EE 6810
F00: 13
Address Translation Reality
• The translation process using page tables takes too long!
• Use a cache to hold recent translations
– Translation Lookaside Buffer» Typically 8-1024 entries
» Block size same as a page table entry (1 or 2 words)
» Only holds translations for pages in memory
» 1 cycle hit time
» Highly or fully associative
» Miss rate < 1%
» Miss goes to main memory (where the whole page table lives)
» Must be purged on a process switch
CS/EE 5810CS/EE 6810
F00: 14
Back to the 4 Questions
• Q3: Block Replacement (pages in physical memory)
– LRU is best» So use it to minimize the horrible miss penalty
– However, real LRU is expensive» Page table contains a use tag
» On access the use tag is set
» OS checks them every so often, records what it sees, and resets them all
» On a miss, the OS decides who has been used the least
– Basic strategy: Miss penalty is so huge, you can spend a few OS cycles to help reduce the miss rate
CS/EE 5810CS/EE 6810
F00: 15
Last Question
• Q4: Write Policy
– Always write-back» Due to the access time of the disk
» So, you need to keep tags to show when pages are dirty and need to be written back to disk when they’re swapped out.
– Anything else is pretty silly
– Remember – the disk is SLOW!
CS/EE 5810CS/EE 6810
F00: 16
Page SizesAn architectural choice
• Large pages are good:
– reduces page table size
– amortizes the long disk access
– if spatial locality is good then hit rate will improve
• Large pages are bad:
– more internal fragmentation» if everything is random each structure’s last page is only half
full
» Half of bigger is still bigger
» if there are 3 structures per process: text, heap, and control stack
» then 1.5 pages are wasted for each process
– process start up time takes longer » since at least 1 page of each type is required to prior to start
» transfer time penalty aspect is higher
CS/EE 5810CS/EE 6810
F00: 17
More on TLBs
• The TLB must be on chip
– otherwise it is worthless
– small TLB’s are worthless anyway
– large TLB’s are expensive» high associativity is likely
• ==> Price of CPU’s is going up!
– OK as long as performance goes up faster
CS/EE 5810CS/EE 6810
F00: 18
Alpha AXP 21064 TLB
Page Frame Number Page Offset
13 bits30 bits
43-bit Virtual Address
30 bits 21 bits221
VPN Tag Physical PNWRV
VPN Tag Physical PNWRV
VPN Tag Physical PNWRV
VPN Tag Physical PNWRV
32 entries total
1
2
32:1 Mux
hitlocation
3
34-b
itp
hys
ical
addr
ess
4
protection
56 bits/entry
CS/EE 5810CS/EE 6810
F00: 19
Protection• Multiprogramming forces us to worry about it
– think about what happens on your workstation» it would be annoying if your program clobbered your email
files
• There are lots of processes
– Implies lots of task switch overhead
– HW must provide savable state
– OS must promise to save and restore properly
– most machines task switch every few milliseconds
– a task switch typically takes several microseconds
– also implies inter-process communication» which implies OS intervention
» which implies a task switch
» which implies less of the duty cycle gets spent on the application
CS/EE 5810CS/EE 6810
F00: 20
Protection Options• Simplest - base and bound
– 2 registers - check each address falls between the values» these registers must be changed by the OS but not the app
– need for 2 modes: regular & privileged» hence need to privilege-trap and provide mode switch ability
• VM provides another option
– check as part of the VA --> PA translation process» The protection bits reside in the page table & TLB
• Other options
– rings - ala MULTIC’s and now Pentium» inner is most privileged - outer is least
– capabilities (i432) - similar to a key or password model» OS hands them out - so they’re difficult to forge
» in some cases they can be passed between app’s
CS/EE 5810CS/EE 6810
F00: 21
VAX 11/780 Example• Hybrid - segmented + paged
– segments used for system/process separation– paging used for VM, relocation, and protection– reasonably common again (was common in the late 60’s
too)• Segments - 1 for system and 1 for processes
– high order address bit 31: 1- system, 0 - process– all processes have their own space but share the process
segment– bit 30 divides the process segment into 2 regions– =1 P1 grows downward (stack),=0 P0 grows upward
(text,heap)• Protection
– pair of base and bound registers for S, PO, and P1– saves page table space since the page size was 512 bytes
CS/EE 5810CS/EE 6810
F00: 22
VAX-11/780 Address Mapping31 30
21 bit virtual page num- 9 bit page offset
29 9 8 0
Page TBL
Page TBL S
P0
P1
+
>
21 bit page frame number 9 bit page offset
PT
Note - 3 separatepage tables
Page Index Fault
To the TLB
SPT, P0PT, P1PT
CS/EE 5810CS/EE 6810
F00: 23
More VAX-11/780• System page tables are pinned (frozen) in physical memory
– virtual and physical page numbers are the same
– OS handles replacement so it never moves these pages
• P0 and P1 tables are in virtual memory
– hence they can be missed as well» double page fault is therefore possible
• Page Table Entry
– M - modify indicating dirty page
– V - valid indicating a real entry
– PROT - 4 protection bits
– 21 bit physical page number
– no reference or use bit - hence hokey LRU accounting - use M
CS/EE 5810CS/EE 6810
F00: 24
VAX PROT bits
• 4 levels of use - each with its own stack and stack pointer R15
– kernel - most trusted
– executive
– supervisor
– user - least trusted
• 3 types of access
– no access, read, or write
• Bizarre encoding
– all 16 4-bit patterns meant something» if there was a model to their encoding method it eludes
everybody I know
– 1001 - R/W for kernel and exec, R for sup , zip for user
VAX 11/780 TLB Parameters• 2-way set associative but partitioned
» 2 64 entry banks - top 32 of each bank is reserved for system
» on a task switch only need to invalidate half the TLB
» high order address bit selects the bank (corresponds to P0, P1 distinction)
» split increases miss rate but under high task switch rate may still be a win
Parameter Value
Block Size 1 Page table entry (4 bytes)
Hit time 1 cycle
Average miss penalty 22 clock cycles
Miss rate 1% - 2%
TLB Size 128 entries
Block Selection Random, but not last used
Block Placement 2-way Set-associative
Write Strategy Explicit by OS
CS/EE 5810CS/EE 6810
F00: 26
VAX 11/780 TLB Action• Steps 1&2
» bit 31 ## Index passed to both banks (both set members)
» V must be 1 for anything to continue - TLB miss trap if V=0
» PROT bits checked against R/W access type and which kind of user (from the PSW) - TLB protection trap if violation
• Step 3» page table tag compared with TLB tag
» both banks done in parallel
» both cannot match - if no match then TLB-miss trap
• Step 4» matching side’s 21 bit physical page address passed thru MUX
• Step 5» 21 bit physical page addr ## 9 bit page offset = physical address
» if P=1 then cache hit and address goes to the cache - note cache hit time stretch
» if P=0 then page fault trap
CS/EE 5810CS/EE 6810
F00: 27
The Alpha AXP 21064
Also a page/segment combo• Segmented 64 bit address space
– Seg0 (addr63 = 0) & Seg1 (addr[63:62] = 11)» Mapped into pages for user code» Seg0 for text and heap sections grows upwards» Seg1 for stack which grows downward
– Kseg (addr[63:62] = 10)» Reserved for the O.S.» Uniformly protected space, no memory management
• Advantages of the split model– Segmentation conserves page table space– Paging provides VM, protection, and relocation– O.S. tends to need to be resident anyway
CS/EE 5810CS/EE 6810
F00: 28
Page Table Problems… 64-bit address is the first-order cause
• Reduction approach - go hierarchical
– 3 levels - each page table limited to a single page size» current page size is 8KB but claim to support 16, 32, and 64KB
in the future
» Superpage model extends TLB REACH – used in MIPS R10k
» Uses 34-bit physical addresses (max could be 41)
– Virtual address = [seg-selector Lvl1 Lvl2 Lvl3 Offset]
• Mapping
– Page table base register points to base of LVL1-TBL
– LVL1-TBL[Lvl1] + Lvl2 points to LVL2-TBL entry
– LVL2-TBL entry + Lvl3 points to LVL3-TBL entry
– LVL3-TBL entry provides physical page number finally
– PPN##Offset => physical address for main memory
Alpha VPN => PPN Mapping
seg0 or 1 level 1 level 1 level 1 page offset
virtual address
page tablebase reg.
entry
L1 Table
entry
L1 Table
entry
L1 Table
physical page frame # page offset
physical addressNote potential delayproblem – sequential
Lookups X 3
CS/EE 5810CS/EE 6810
F00: 30
Page Table Entries64-bit = 1k entries per table
• low order 32 bits = page frame number• 5 Protection fields
– valid - e.g. OK to do address translation– user Read Enable– kernel Read Enable– user Write Enable– kernel Write Enable
• Other fields– system accounting - like a USE field– high order bits basically unused - real Virtual Addr = 43 bits
» common hack to save chip area in the early implementations - different now
interesting to see OS problems as VM goes to bigger pages
CS/EE 5810CS/EE 6810
F00: 31
Alpha TLB Stats• Contiguous pages mapped as 1
– Option of 8, 64, 512 pages – also extends TLB reach
– Complicates the TLB somewhat… » Separate ITLB and DTLB
Parameter Description
Block Size 1 PTE (8 bytes)
Hit Time 1 cycle
Average Miss Penalty 20 cycles
ITLB size 96 bytes (12 entries for 8KB pages, 4 entries for max 4MB superpage)
DTLB 32 PTE’s for any size superpage
Block Placement Random, but not last used
Block Placement Fully associative
CS/EE 5810CS/EE 6810
F00: 32
The Whole Alpha Memory System
• Boot
– initial Icache fill from boot PROM
– hence no Valid bit needed - Vbits in Dcache are cleared
– PC set to kseg so no address translation and TLB can bebypassed
– then start loading the TLB for real from the kseg OS code» subsequent misses call PAL (Priv. Arch. Library) to remap and
fill the miss
• Once ready to run user code
– OS sets PC to appropriate seg0 address
– then the memory hierarchy is ready to go
CS/EE 5810CS/EE 6810
F00: 33
Things to Note
• Prefetch stream buffer
– holds next Ifetch access so an Icache miss checks there first
• L1 is write through with a write buffer
– avoids CPU stall on a write miss
– 4 block capacity
– write merge - if block is same then merges the writes
• L2 is write back with a victim buffer
– 1 block - so delays stall on a L2 miss dirty until the 2nd miss
– normally this is safe
• Full datapath shown in Figure 5.47
– worth the time to follow the numbered balls
CS/EE 5810CS/EE 6810
F00: 34
Alpha 21064 I-fetch• I-Cache Hit (1 cycle)
» Step 1: Send address to TLB an Cache» Step 2: Icache lookup (8KB, DM, 32-byte blocks)» Step 3: ITLB Lookup (12 entries, FA)» Step 4: Cache hit and valid PTE» Step 5: Send 8 bytes to CPU
• I-Cache miss, Prefetch buffer (PFB) hit (1 cycle)» Step 6: Start L2 access, just in case» Step 7: Check prefetch buffer» Step 8: Prefetch buffer hit, send 8 bytes to CPU» Step 9: Refill I-Cache from PFB, and cancel L2 access
• I-Cache Miss, Prefetch buffer miss, L2 Cache Hit» Step 10: Check L2 cache tag» Step 11: Return critical 16B (5 cycles)» Step 12: Return other 16B (5 cycles)» Step 13: Prefetch next sequential block to PFB (10 cycles)
CS/EE 5810CS/EE 6810
F00: 35
Alpha 21064 L2 Cache Miss
• L2 Cache Miss» Step 14: Send new request to main memory
» Step 15: Put dirty victim block in victim buffer
» Step 16: Load new block in L2 cache, 16B at a time
» Step 17: Write Victim buffer to memory
• Data loads are like instruction fetches, except use DTLB and D-Cache instead of the ITLB and I-Cache
• Allows hits under miss
• On a read miss, the write buffer is flushed first to avoid RAW hazards
CS/EE 5810CS/EE 6810
F00: 36
Alpha 21064 Data Store
• Data Store» Step 18: DTLB Lookup and protection violation check
» Step 19: D-Cache lookup (8KB, DM, write through)
» Step 22: Check D-Cache tag
» Step 24: Send data to write buffer
» Step 25: Send data to delayed write buffer in front of D-Cache
• Write hits are pipelined
» Step 26: Write previous delayed write buffer to D-Cache
» Step 27: Merge data into write buffer
» Step 28: Write data at the head of write buffer to L2 cache (15 cycles)
CS/EE 5810CS/EE 6810
F00: 37
AXP Memory is Quite ComplexHow well does it work?
• Table 5.48 in the book tells…
• Summary – interesting to note:
– Alpha 21064 is a dual-issue machine, but nowhere is the CPI < 1
– Worst case are the TPC benchmarks» Memory intensive, so more stalls
» CPI bloats to 4.3 (1.8 of which is from the caches)
» 1.67 due to other stalls (I.e. load dependencies)
– Average SPECint CPI is 1.86» .77 contributed by caches, .74 by other stalls
– Average SPECfp CPI is 2.14» .45 by caches, .98 by other stalls
Superscalar issue is sorely limited!
CS/EE 5810CS/EE 6810
F00: 38
Summary 1
• CPU performance is outpacing main memory performance
• Principle of locality saves us: thus the Memory Hierarchy
• The memory hierarchy should be designed as a system
• Key old ideas:
– Bigger caches
– Higher set-associativity
• Key new ideas:
– Non-blocking caches
– Ancillary caches
– Multi-port caches
– Prefetching (software and/or hardware controlled)
Summary 2: Typical ChoicesOption TLB L1 Cache L2 Cache VM (page)
Block Size 4-8 bytes (1 PTE)
4-32 bytes 32-256 bytes 4k-16k bytes
Hit Time 1 cycle 1-2 cycles 6-15 cycles 10-100 cycles
Miss Penalty 10-30 cycles 8-66 cycles 30-200 cycles 700k-6M cycles
Local Miss Rate
.1 - 2% .5 – 20% 13 - 15% .00001 - 001%
Size 32B – 8KB 1 – 128 KB 256KB - 16MB
Backing Store L1 Cache L2 Cache DRAM Disks
Q1: Block Placement
Fully or set associative
DM DM or SA Fully associative
Q2: Block ID Tag/block Tag/block Tag/block Table
Q3: Block Replacement
Random (not last)
N.A. For DM Random (if SA)
LRU/LFU
Q4: Writes Flush on PTE write
Through or back
Write-back Write-back