Upload
cristopher-dennis
View
230
Download
0
Embed Size (px)
Citation preview
Datorteknik MemoryAcceleration bild 1
Memory Hierarchy
Memory Hierarchy– Reasons
– Virtual Memory
Cache Memory Translation Lookaside Buffer
– Address translation
– Demand paging
Datorteknik MemoryAcceleration bild 2
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)
Processor-MemoryPerformance Gap:(grows 50% / year)
1
10
100
1000
DRAM
CPU
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Per
form
ance
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
Why Care About the Memory Hierarchy?
Datorteknik MemoryAcceleration bild 3
DRAM Generation‘84 ‘87 ‘90 ‘93 ‘96 ‘99
1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
55 85 130 200 300 450
30 47 72 110 165 250
28.84 11.1 4.26 1.64 0.61 0.23
(from Kazuhiro Sakashita, Mitsubishi)
1st Gen. Sample
Memory Size
Die Size (mm2)
Memory Area (mm2)
Memory Cell Area (µm2)
DRAMs over Time
Datorteknik MemoryAcceleration bild 4
Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be
referenced again soon.
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon.
By taking advantage of the principle of locality:– Present the user with as much memory as is available in the cheapest
technology.
– Provide access at the speed offered by the fastest technology.
DRAM is slow but cheap and dense:– Good choice for presenting the user with a BIG memory system
SRAM is fast but expensive and not very dense:– Good choice for providing the user FAST access time.
Recap:
Datorteknik MemoryAcceleration bild 5
By taking advantage of the principle of locality:– Present the user with as much memory as is available in the cheapest
technology.
– Provide access at the speed offered by the fastest technology.
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On
-Ch
ipC
ache
1ns 10 msSpeed : 10ns 100ns
100 GSize (bytes): K..M M
TertiaryStorage
(Disk/Tape)
10 sec
T
Memory Hierarchy of a Modern Computer
Xns
64K
Datorteknik MemoryAcceleration bild 6
Reg:s
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
faster
Larger
Levels of the Memory Hierarchy
Datorteknik MemoryAcceleration bild 7
reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . .
op: i-fetch, read, write
Optimize the memory system organizationto minimize the average memory access timefor typical workloadsProcessor
$
MEM
Memory
Workload orBenchmarkprograms
The Art of Memory System Design
Datorteknik MemoryAcceleration bild 8
size of information blocks that are transferred from secondary to main storage (M)
block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy
which region of M is to hold the new block --> placement policy
missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy
Paging Organizationvirtual and physical address space partitioned into blocks of equal size (pages)
page
reg
cachemem disk
Page
Virtual Memory System Design
Datorteknik MemoryAcceleration bild 9
V = {0, 1, . . . , n - 1} virtual address spaceM = {0, 1, . . . , m - 1} physical address space
MAP: V --> M U {0} address mapping function
n > m
MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M
= 0 if data at virtual address a is not present in M
Processor
Name Space V
Addr TransMechanism
faulthandler
MainMemory
SecondaryMemory
a
aa'
0
missing item fault
physical address OS performsthis transfer
Address Map
Datorteknik MemoryAcceleration bild 10
Paging Organization
Address Mapping
VA page no. disp10
Page Table
indexintopagetable
Page TableBase Reg
VAccessRights PA +
table locatedin physicalmemory
physicalmemoryaddress
actually, concatenation is more likely
frame 01
7
01024
7168
P.A.
PhysicalMemory
1K1K
1K
AddrTransMAP
page 01
31
1K1K
1K
01024
31744
unit of mapping
also unit oftransfer fromvirtual tophysical memory
Virtual Memory
V.A.
Datorteknik MemoryAcceleration bild 11
Address Mapping
32-bit Virtual Address
CP0
MIPS PIPELINE
3232 Instr Data
24-bit Physical Address
PageTable
1PageTable
2PageTable
n
User Memory
Kernel MemoryUser process 2 running
Here we need page table 2 for address mapping
Datorteknik MemoryAcceleration bild 12
Translation Lookaside Buffer (TLB)CP0
MIPS PIPELINE
24
32
Virtual AddressD R Physical Addr [23:10]
PageTable
1PageTable
2PageTable
n
User Memory
Kernel Memory
On TLB hit, the 32-bit virtual address is translated into a
24-bit physical address by hardware
We never call the Kernel!
32
Datorteknik MemoryAcceleration bild 13
So Far, NO GOODCP0
3232
24-bit Physical Address
PageTable
1PageTable
2PageTable
n
Kernel Memory
Critical path 20 ns
IM DE EX DM
TLB
5ns
MIPS pipe is clocked at 50 MHz
60 ns, RAM
But RAM needs3 cycles to read/write
STALLS the pipe
STALL
Datorteknik MemoryAcceleration bild 14
Let’s put in a CacheCP0
3232
PageTable
1PageTable
2PageTable
n
Kernel Memory
Critical path 20 ns
IM DE EX DM
TLB
5ns
MIPS pipe is clocked at 50 MHz
60 ns, RAM
A cache Hit never STALLS the pipe15ns
Cache
Datorteknik MemoryAcceleration bild 15
Check all Cache linesCache Hit if PA[23:2]=TAG
Fully Associative Cache1 0223
24-bit PA
all 2 lines16
Tag PA[23:2] Data Word PA[1:0]
162 * 4=256kb
Datorteknik MemoryAcceleration bild 16
Fully Associative Cache
Very good hit ratio (nr hits/nr accesses)
But! Too expensive checking all 2 Cache lines
concurrently– A comparator for each line! A lot of hardware
16
Datorteknik MemoryAcceleration bild 17
Direct Mapped Cache1 0223
24-bit PA
1 line
1718
Tag PA[23:18] Data Word PA[1:0]
Selects ONE cache lineCache Hit if PA[23:18]=TAG
162 * 4=256kb
Datorteknik MemoryAcceleration bild 18
Direct Mapped Cache
Not so good hit ratio– Each line can hold only certain addresses, less freedom
But! Much cheaper to implement, only one line
checked– Only one comparator
Datorteknik MemoryAcceleration bild 19
Set Associative Cache1 0223
24-bit PA
17-z18-z
Tag PA[23:18-z] Data Word PA[1:0]
Selects ONE set of lines, size 2Cache Hit if PA[23:18-z]=TAG in the set
z
162 * 4=256kb
z 2 lines
2z-way set associative
Datorteknik MemoryAcceleration bild 20
Set Associative Cache
Quite good hit ratio– The number (set) of different addresses for each line is
greater than that of a directly mapped cache
The larger Z the better hit ratio, but more expensive
– 2z comparators
– Cost-performance tradeoff
Datorteknik MemoryAcceleration bild 21
Cache Miss
A Cache Miss should be handled by the hardware
– If handled by the OS it would be very slow (>>60 ns)
On a Cache Miss– Stall the pipe
– Read in new data to cache
– Release the pipe, now we get a Cache Hit
Datorteknik MemoryAcceleration bild 22
A Summary on Sources of Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block
– “Cold” fact of life: not a whole lot you can do about it
– Note: If you are going to run “billions” of instructions, Compulsory Misses are insignificant
Conflict (collision):– Multiple memory locations mapped
to the same cache location
– Solution 1: increase cache size
– Solution 2: increase associativity
Capacity:– Cache cannot contain all blocks access by the program
– Solution: increase cache size
Invalidation: other process (e.g., I/O) updates memory
Datorteknik MemoryAcceleration bild 23
Example: 1 KB Direct Mapped
Cache with 32 Byte Blocks For a 2N byte cache:– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2M)
Cache Index
0123
:
Cache DataByte 0
0431
:
Cache Tag Example: 0x50Ex: 0x01
0x50
Stored as partof the cache “state”
Valid Bit
:31
Byte 1Byte 31
:
Byte 32Byte 33Byte 63
:
Byte 992Byte 1023
:
Cache Tag
Byte SelectEx: 0x00
9
Datorteknik MemoryAcceleration bild 24
Block Size Tradeoff In general, larger block size take advantage of spatial locality
BUT:– Larger block size means larger miss penalty:
Takes longer time to fill up the block
– If block size is too big relative to cache size, miss rate will go up:
Too few cache blocks
In gerneral, Average Access Time: TimeAv= Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate
MissPenalty
Block Size
MissRate Exploits Spatial Locality
Fewer blocks: compromisestemporal locality
AverageAccess
Time
Increased Miss Penalty& Miss Rate
Block Size Block Size
Datorteknik MemoryAcceleration bild 25
Extreme Example: single big line
Cache Size = 4 bytes Block Size = 4 bytes– Only ONE entry in the cache
If an item is accessed, likely that it will be accessed again soon– But it is unlikely that it will be accessed again immediately!!!
– The next access will likely to be a miss again Continually loading data into the cache but
discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect
Conflict Misses are misses caused by:– Different memory locations mapped to the same cache index
Solution 1: make the cache size bigger Solution 2: Multiple entries for the same Cache Index
0
Cache DataValid Bit
Byte 0Byte 1Byte 3
Cache Tag
Byte 2
Datorteknik MemoryAcceleration bild 26
HierarchySmall, fast and expensive VS Slow big and inexpensive
HD 2 Gb
Cache Contains copiesWhat if copies are changed?
INCONSISTENCY!
RAM 16 Mb
I
D
Cache 256kb
Datorteknik MemoryAcceleration bild 27
Cache Miss, Write Through/Back
To avoid INCONSISTENCY we can Write Through
– Always write data to RAM
– Not so good performance (write 60ns)
– Therefore, WT always combined with write buffers so that don’t wait for lower level memory
Write Back– Write data to memory only when cache line is replaced
We need a Dirty bit (D) for each cache line D-bit set by hardware on write operation
– Much better performance, but more complex hardware
Datorteknik MemoryAcceleration bild 28
Write Buffer for Write Through
A Write Buffer is needed between the Cache and Memory– Processor: writes data into the cache and the write buffer
– Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:– Typical number of entries: 4
– Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designer’s nightmare:– Store frequency (w.r.t. time) -> 1 / DRAM write cycle
– Write buffer saturation
ProcessorCache
Write Buffer
DRAM
Datorteknik MemoryAcceleration bild 29
Write Buffer Saturation
Store frequency (w.r.t. time) -> 1 / DRAM write cycle– If this condition exist for a long period of time (CPU cycle time too quick
and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time
Solution for write buffer saturation:– Use a write back cache
– Install a second level (L2) cache:
ProcessorCache
Write Buffer
DRAML2Cache
Datorteknik MemoryAcceleration bild 30
Replacement Strategy in Hardware
A Direct mapped cache selects ONE cache line– No replacement strategy
Set/Fully Associative Cache selects a set of lines. Strategy to select one Cache line
– Random, Round Robin Not so good, spoils the idea with Associative Cache
– Least Recently Used, (move to top strategy) Good, but complex and costly for large Z
– We could use an approximation (heuristic) Not Recently Used, (replace if not used for a certain time)
Datorteknik MemoryAcceleration bild 31
Sequential RAM Access
Accessing sequential words from RAM is faster than accessing RAM randomly
– Only lower address bits will change
How could we exploit this?– Let each Cache Line hold an Array of Data words
– Give the Base address and array size “Burst Read” the array from RAM to Cache
“Burst Write” the array from Cache to RAM
Datorteknik MemoryAcceleration bild 32
System Startup, RESET
Random Cache Contents– We might read incorrect values from the Cache
We need to know if the contents is Valid, a V-bit for each cache line
– Let the hardware clear all V-bits on RESET
– Set the V-bit and clear the D-bit for the line copied from RAM to Cache
Datorteknik MemoryAcceleration bild 33
Final Cache Model
DV
...
1+j 02+j23
24-bit PA
17-z18-z
Tag PA[23:18-z] Data Word PA[1+j:0]
Selects ONE set of lines, size 2Cache Hit if (PA[23:18-z]=TAG) and V in setSet D bit if Write
z
z 2 lines
Datorteknik MemoryAcceleration bild 34
Translation Lookaside Buffer (TLB)CP0
MIPS PIPELINE
24
32
Virtual AddressD R Physical Addr [23:10]
PageTable
1PageTable
2PageTable
n
User Memory
Kernel Memory
On TLB hit, the 32-bit virtual address is translated into a
24-bit physical address by hardware
We never call the Kernel!
32
Datorteknik MemoryAcceleration bild 35
Virtual Address and a Cache
CPU
Trans-lation
Cache
MainMemory
VA
PA
miss
hit
data
It takes an extra memory access to translate VA to PA
This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible
ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address!
for update: must update all cache entries with same physical address or memory becomes inconsistent
determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or
software enforced alias boundary: same lsb of VA &PA > cache size
Datorteknik MemoryAcceleration bild 36
Translation Look-Aside BuffersJust like any other cache, the TLB can be organized as fully associative,
set associative, or direct mapped
TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations.
CPUTLB
LookupCache Main
Memory
VA PA miss
hit
data
OSPage table
hit
missTranslationwith a TLB
Datorteknik MemoryAcceleration bild 37
Reducing Translation Time
Machines with TLBs go one step further to reduce cycles/cache access
They overlap the cache access with the TLB access
Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache
Datorteknik MemoryAcceleration bild 38
Overlapped Cache & TLB Access
TLB Cache
10 200
4 bytes
index 1 K
page # disp20 12
assoclookup32
PA Hit/Miss
PA Data Hit/Miss
=
IF cache hit AND (cache tag = PA) then deliver data to CPUELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLBELSE do standard VA translation
Datorteknik MemoryAcceleration bild 39
Problems With Overlapped TLB AccessOverlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation
This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache
Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K:
11 200
virt page # disp20 12
cache index This bit is changed
by VA translation, butis needed for cachelookup
Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13]
1K
4 4
102 way set assoc cache
Datorteknik MemoryAcceleration bild 40
Startup a User process
TLB
V
0
0
...
0PageTable
Kernel Memory
Place on Hard Disk
D R Page Table
I0
I
I
D
0
0
0
0 S
0
0
0
0
0
Allocate Stack pages, Make a Page Table;– Set Instruction (I), Global Data (D) and Stack pages (S)
– Clear Resident (R) and Dirty (D) bits
Clear V-bits in TLB
Datorteknik MemoryAcceleration bild 41
Demand Paging
D22-bit Page #
IM Stage: We get a TLB Miss and Page Fault (page 0 not resident)– Page Table (Kernel memory) holds HD address for page 0 (P0)
– Read page to RAM page X, Update PA[23:10] in Page Table
– Update TLB, set V, clear D, Page #, PA[23:10]
Restart failing instruction: TLB hit!
I
V
Page 0
RAM
I
Physical Addr PA[23:10]
TLB
P00
0
...
1 000…...….0 XX………………..X
XX…X00..0
Datorteknik MemoryAcceleration bild 42
Demand Paging
D22-bit Page #
DM Stage: We get a TLB Miss and Page Fault (page 3 not resident)– Page Table (Kernel memory) holds HD address for page 3 (P3)
– Read page to RAM page Y, Update PA[23:10] in Page Table
– Update TLB, set V, clear D, Page #, PA[23:10]
Restart failing instruction: TLB hit!
I
V
Page 0
RAM
I
Physical Addr PA[23:10]
TLB
P01
0
...
1 000…...….0 XX………………..X
P300…...…11
P1
P2
YY…Y00..0Page 3 D
YY………………..Y D0
Datorteknik MemoryAcceleration bild 43
Spatial and Temporal Locality
Spatial Locality Now TLB holds page translation; 1024 bytes, 256 instructions
– The next instruction (PC+4) will cause a TLB Hit
– Access a data array, e.g., 0($t0),4($t0) etc
Temporal Locality TLB holds translation
– Branch within the same page, access the same instruction address– Access the array again e.g., 0($t0),4($t0) etc
THIS IS THE ONLY REASON A SMALL TLB WORKS
Datorteknik MemoryAcceleration bild 44
Replacement Strategy
If TLB is full the OS selects the TLB line to replace Any line will do, they are the same and concurrently
checked
Strategy to select one Random
– Not so good
Round Robin– Not so good, about the same as random
Least Recently Used, (move to top strategy)– Much better, (the best we can do without knowing or predicting
page access). Based on temporal locality
Datorteknik MemoryAcceleration bild 45
HierarchySmall, fast and expensive VS Slow big and inexpensive
RAM 256Mb
PageTable
Kernel Memory HD 32 Gb
TLB 64 Lines
>> 64
TLB/RAM Contains copies
What if copies are changed?
INCONSISTENCY!
Datorteknik MemoryAcceleration bild 46
Inconsistency
Replace a TLB entry, caused by TLB Miss If old TLB entry dirty (D-bit) we update Page Table (Kernel
memory)
Replace a page in RAM (swapping) caused by Page Fault If old Page is in TLB
– Check old page TLB D-bit, if Dirty write page to HD
– Clear TLB V-bit and Page Table R-bit (now not resident)
If old Page is in not in TLB– Check old page Page Table D-bit, if Dirty write page to HD
– Clear Page Table R-bit (page not resident any more)
Datorteknik MemoryAcceleration bild 47
Current Working Set
If RAM is full the OS selects a page to replace, Page Fault OBS! The RAM is shared by many User processes Least Recently Used, (move to top strategy)
– Much better, (the best we can do without knowing or predicting page access)
Swapping is VERY expensive (, maybe > 100 ms)– Why not try harder to keep the pages needed (the working set) in RAM using Advanced memory paging algorithms
t
t{p0,p3,...} set of pages used under t
now
Current working set of process P
Datorteknik MemoryAcceleration bild 48
TrashingProbability of Page Fault
1
00 1
Fragment ofworking setnot resident
TrashingNo useful work done!
This we want to avoid
Datorteknik MemoryAcceleration bild 49
Summary: Cache, TLB, Virtual Memory
Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions:
– Where can a page be placed?
– How is a page found?
– What page is replaced on miss?
– How are writes handled?
Page tables map virtual address to physical address
TLBs are important for fast translation
TLB misses are significant in processor performance:– (some systems can’t access all of 2nd level cache without TLB misses!)
Datorteknik MemoryAcceleration bild 50
Summary: Memory Hierachy
Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs?
– 1000X DRAM growth removed the controversy
Today VM allows many processes to share single memory without having to swap all processes to disk;
VM protection is more important than memory space increase
Today CPU time is a function of (ops, cache misses) vs. just of(ops):– What does this mean to Compilers, Data structures, Algorithms?
– Vtune performance analyzer, cache misses.