Upload
imaran
View
102
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Lecture on High Performance Processor Architecture ( CS05162 ). Review of Memory Hierarchy. An Hong [email protected] Fall 2009 School of Computer Science and Technology University of Science and Technology of China. Quick review of everything you should have learned. - PowerPoint PPT Presentation
Citation preview
Lecture on High Performance Processor Architecture
(CS05162)
An Hong
Fall 2009
School of Computer Science and Technology
University of Science and Technology of China
Review of Memory Hierarchy
23/4/22 USTC CS AN Hong 2
Quick review of everything you should have learned
23/4/22 USTC CS AN Hong 3
Proc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)1
10100
1000198
0198
1198
3198
4198
5198
6198
7198
8198
9199
0199
1199
2199
3199
4199
5199
6199
7199
8199
9200
0
DRAM
CPU198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Perf
orm
ance
Time
Moore’s Law
Processor-DRAM Memory Gap (latency)
Who Cares About the Memory Hierarchy?
23/4/22 USTC CS AN Hong 4
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns1-0.1 cents/bit
Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
CapacityAccess TimeCost
TapeT Bytesor infinite sec-min10 cents/bit-8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Levels of the Memory Hierarchy
23/4/22 USTC CS AN Hong 5
The Principle of Locality
The Principle of Locality:−Program access a relatively small portion of the address space
at any instant of time.
Two Different Types of Locality:−Temporal Locality (Locality in Time): If an item is referenced, it
will tend to be referenced again soon (e.g., loops, reuse)−Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
Last 20 years, HW relied on locality for speed and cost
It is a property of programs which is exploited in machine design.
23/4/22 USTC CS AN Hong 6
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
Memory Hierarchy: Terminology Hit( 命中 ): CPU 访问存储系统时 , 在高层找到所需的信息 (e.g.: Block
X) − Hit Rate( 命中率 ): CPU 访问存储系统时 , 在高层找到所需信息的概率− Hit Time( 命中时间 ): 在高层命中时的访问时间
RAM access time + Time to determine hit/miss
Miss( 缺失 ): CPU 访问存储系统时 , 所需的信息 (e.g.:Block Y) 需要从低层传送到高层−Miss Rate( 缺失率 ) = 1 - (Hit Rate)−Miss Penalty( 缺失开销 ): 在低层存储中找到一个信息块所需的时间 + 向高层传送一个信息块所需的时间
Hit Time << Miss Penalty
23/4/22 USTC CS AN Hong 7
Hit rate( 命中率 )−So high that usually talk about Miss rate−Miss rate fallacy: as MIPS to CPU performance,
miss rate to average memory access time in memory
Miss penalty( 缺失开销 ): time to replace a block from lower level, including time to replace in CPU−access time: 在低层存储中找到一个信息块所需的时间
= f(latency to access lower level)−transfer time: 向高层传送一个信息块所需的时间
=f(BW between upper & lower levels)
Average memory-access time( 平均访问时间 ) = Hit time + Miss rate x Miss penalty
(ns or clocks)
Cache Measures
23/4/22 USTC CS AN Hong 8
Four Questions for Memory Hierarchy Designers
Q1: Where can a block be placed in the upper level? 块的放置 (Block placement)− Fully Associative, Set Associative, Direct Mapped
Q2: How is a block found if it is in the upper level? 块的定位 (Block identification)− Tag/Block
Q3: Which block should be replaced on a miss? 块的替换 (Block replacement)− Random, LRU
Q4: What happens on a write? 写策略 (Write strategy)− Write Through (with Write Buffer)( 直写 ) or Write Back( 回写 )
23/4/22 USTC CS AN Hong 9
Block 12 placed in 8 block cache:−Fully associative, direct mapped, 2-way set associative−S.A. Mapping = Block Number Modulo Number Sets
0 1 2 3 4 5 6 7Blockno.
Fully associative:block 12 can go anywhere
0 1 2 3 4 5 6 7Blockno.
Direct mapped:block 12 can go only into block 4 (12 mod 8)
0 1 2 3 4 5 6 7Blockno.
Set associative:block 12 can go anywhere in set 0 (12 mod 4)
Set0
Set1
Set2
Set3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
Block-frame address
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.
Q1: Where can a block be placed?
23/4/22 USTC CS AN Hong 10
Cache 中的每一块都有一个地址标志 (Tag), 给出块地址 . 对每一个可能包含有被访问信息的 Cache 块中的标志进行检查, 看是否与来自 CPU 的块地址匹配 .
增加相联度 , 则 Cache 的组数减少 , 每组中的块数增加 --> 则 index 域的宽度减少 , Tag 域的宽度增加
Blockoffset
Block frame AddressTag Index
Set Select
Data Select
Q2: How is a block found?
( 用于判断是否命中 )
23/4/22 USTC CS AN Hong 11
Cache Index
0123
:
Cache DataByte 0
0431
:
Cache Tag Example: 0x50Ex: 0x01
0x50
Valid Bit
:31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte SelectEx: 0x00
9
1 KB Direct Mapped Cache, 32B blocks
For a 2 N byte cache:− The uppermost (32 - N) bits are always the Cache Tag− The lowest M bits are the Byte Select (Block Size = 2 M )
①② ③
23/4/22 USTC CS AN Hong 12
Two-way Set Associative Cache N-way set associative: N entries for each Cache Index
− N direct mapped caches operates in parallel (N typically 2 to 4)
Example: Two-way set associative cache− Cache Index selects a set?from the cache− The two tags in the set are compared in parallel− Data is selected based on the tag result
Cache DataCache Block 0
Cache TagValid
:: :
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
①② ②③ ③
23/4/22 USTC CS AN Hong 13
Cache DataCache Block 0
Cache Tag Valid
: ::
Cache DataCache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
N-way Set Associative Cache v. Direct Mapped Cache
相联度越高 , Cache 空间的利用率就越高 , 块冲突的概率越低 ,Cache 的缺失率也就越低 . 全相联的缺失率最低 , 直接映射的缺失率最高 相联度越高 , Cache 实现的复杂度和访问延迟就越大 大多数的处理器采用直接映射 , 两路组相联 , 或四路组相联
23/4/22 USTC CS AN Hong 14
Q3: Which block should be replaced on a miss?
Easy for Direct Mapped Set Associative or Fully Associative:
−Random−LRU (Least Recently Used)
Associativity: 2-way 4-way 8-way
Size LRU Random LRU Random LRURandom
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
23/4/22 USTC CS AN Hong 15
Q4: What happens on a write?
Write through(WT, 直写 ): the information is written to both the block in the cache and to the block in the lower-level memory.−WT always combined with write buffers so that don’t wait for
lower level memory
Write back ( WB, 后写) : the information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.−is block clean or dirty?
23/4/22 USTC CS AN Hong 16
A Write Buffer is needed between the Cache and Memory−Processor: writes data into the cache and the write buffer−Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:−Typical number of entries: 4−Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle
Memory system designer’s nightmare:−Store frequency (w.r.t. time) -> 1 / DRAM write cycle−Write buffer saturation
ProcessorCache
Write Buffer
DRAM
Write Buffer for Write Through
23/4/22 USTC CS AN Hong 17
Write-miss Policy( 写缺失策略 ): Write Allocate versus Not Allocate
Write Allocate( 取写 ): 即写分配 . 在发生写缺失时 , 主存的块读到 Cache 中 , 然后执行写命中操作 .−通常配合后写法
Write Not Allocate( 绕写 ): 即非写分配 . 仅修改下层存储器的该块而不将该块取到 Cache 中 .−通常配合直写法
23/4/22 USTC CS AN Hong 18
Cache history Caches introduced (commercially) more than 30 years
ago in the IBM 360/85−already a processor-memory gap
Oblivious to the ISA−caches were organization, not architecture
Many different organizations−direct-mapped, set-associative, skewed-associative, sector,
decoupled sector etc.
Caches are ubiquitous−On-chip, off-chip−But also, disk caches, web caches, trace caches etc.
Multilevel cache hierarchy−With inclusion or exclusion−4+ level
23/4/22 USTC CS AN Hong 19
Cache history
Cache exposed to the ISA−Prefetch, Fence, Purge etc.
Cache exposed to the compiler−Code and data placement
Cache exposed to the O.S.−Page coloring
Many different write policies−copy-back, write-through, fetch-on-write, write-around, write-
allocate etc.
23/4/22 USTC CS AN Hong 20
Cache history
Numerous cache assists, for example:−For storage: write-buffers, victim caches, temporal/spatial
caches−For overlap: lock-up free caches−For latency reduction: prefetch−For better cache utilization: bypass mechanisms, dynamic line
sizes−etc ...
23/4/22 USTC CS AN Hong 21
Caches and Parallelism
Cache coherence−Directory schemes−Snoopy protocols
Synchronization−Test-and-test-and-set−load linked -- store conditional
Models of memory consistency−TCC, LogTM
23/4/22 USTC CS AN Hong 22
When were the 2K papers being written?
A few facts:−1980 textbook: < 10 pages on caches (2%)−1996 textbook: > 120 pages on caches (20%)
Smith survey (1982)−About 40 references on caches
Uhlig and Mudge survey on trace-driven simulation (1997)−About 25 references specific to cache performance only−Many more on tools for performance etc.
23/4/22 USTC CS AN Hong 23
Cache research vs. time
% of ISCA papers dealing principally with caches
05
101520253035
1st session on caches
Largest number (14)
23/4/22 USTC CS AN Hong 24
Present Latency Solutions and Limitation
SolutionLarger Caches
Hardware Prefetching
Software Prefetching
Multithreading
Limitations Slow
Works well only if working set fits cache and there is temporal locality
Cannot be tailored for each application
Behavior based on past and present execution-time behavior
Ensure overheads of prefetching do not outweigh the benefits> conservative prefetching
Adaptive software prefetching is required to change prefetch distance during run-time
Hard to insert prefetches for irregular access patterns
focus on solving the throughput problem, not the memory latency problem
23/4/22 USTC CS AN Hong 25
The Memory Bandwidth Problem
It’s expensive! Often ignored Processor-centric optimization to bridge the gap but
lead to memory-bandwidth problems−Prefetching−Speculation−Multithreading hide latency
Can we always just trade bandwidth for latency?
23/4/22 USTC CS AN Hong 26
Present Bandwidth Solutions
Wider/faster connections to memory−Rambus DRAM−Use higher signaling rates on existing pins−Use more pins for the memory interface
Larger on-chip caches−Fewer requests to DRAM−Only effective if larger caches improve hit rate
Traffic-efficient requests−Only request what you need−Caches are “guessing” that you might need adjacent data−Compression?
23/4/22 USTC CS AN Hong 27
Present Bandwidth Solutions
More efficient on-chip caches−Only 1/20 – 1/3 of the data in a cache is live−Again, caches are “guessing” what will be used again−Spatial vs. temporal vs. no locality
Logic/DRAM integration−Put the memory on the processor−On-chip bandwidth is cheaper than pin bandwidth−You will still probably have external DRAM as well
Memory-centric architectures−“Smart” memory (PIM)−Put processing elements wherever there is memory
23/4/22 USTC CS AN Hong 28
评测存储系统性能的指标
CPU 时间 = =
执行一个程序所需的时钟周期数 × 时钟周期IC ×CPI × Clock Cycle time
A.M.A.T : Average Memory Access time
IC: Instruction Count
CCT: Clock cycle time
(1) 平均访存时间A.M.A.T= (Hit Rate x Hit Time) +(Miss Rate x Miss Time )
= (Hit Rate x Hit Time) +(1–Hit Rate ) x (Hit Time + Miss Penalty)
= Hit Time + (Miss Rate x Miss Penalty)
(2) CPU 性能公式 ( 带存储器层次结构的 CPU 性能公式 )
复习 : 不考虑访存造成的 CPU 暂停周期时的 CPU 性能公式
23/4/22 USTC CS AN Hong 29
评测存储系统性能的指标第一种扩展形式 :CPU time = (CPU execution clock cycles + Memory stall clock cycles) x CCT
Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty +
Writes x Write miss rate x Write miss penalty)= Memory accesses x Miss rate x Miss penalty
第二种扩展形式 :CPU time
= ICx (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x CCT
= IC x (CPIexecution + Misses per instruction x Miss penalty) x CCT
CPUtimeIC CPIExecution Memory accessesInstruction
Miss rateMiss penalty
Clock cycle time
23/4/22 USTC CS AN Hong 30
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) =
(Hit Rate x Hit Time) + (Miss Rate x Miss Time)
Improving Cache Performance: 3 general options
23/4/22 USTC CS AN Hong 31
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On-C
hipC
ache1s 10,000,000s
(10s ms)Speed (ns): 10s 100s
100sGs
Size (bytes):Ks Ms
TertiaryStorage
(Disk/Tape)
10,000,000,000s (10s sec)
Ts
A Modern Memory Hierarchy
By taking advantage of the principle of locality:−Present the user with as much memory as is available in the
cheapest technology.−Provide access at the speed offered by the fastest technology.
23/4/22 USTC CS AN Hong 32
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns$.01-.001/bit
Main MemoryM Bytes100ns-1us$.01-.001
DiskG Bytesms10 - 10 cents-3 -4
CapacityAccess TimeCost
Tapeinfinitesec-min10-6
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
Recall: Levels of the Memory Hierarchy
23/4/22 USTC CS AN Hong 33
How is the hierarchy managed?
Registers <-> Memory−by compiler (programmer?)
Cache <-> Memory−by the hardware
Memory <-> Disks−by the hardware and operating system (virtual memory)−by the programmer (files)
23/4/22 USTC CS AN Hong 34
物理地址空间虚地址空间 虚地址格式
页表
索引页表页表基地址寄存器
V AccessRights PA
虚页号 页内偏移
内存中的页表 实页号 页内偏移10
物理地址格式
什么是虚拟存储 ? 虚拟存储:将内存看作是硬盘的 cache 页的典型大小为: 1K — 8K 页表将编程空间(虚地址空间)中的“虚页”映射为内存空间(物理地址空间)中的“实页”
23/4/22 USTC CS AN Hong 35
虚拟存储器与 cache 的比较 访问单位
−在 cache 中是“块”,在虚存中是“页” 缺失时的替换操作
−Cache 由硬件担当;虚存由 OS 担当 容量
−虚存的容量由处理器的地址位数决定; Cache 的容量与处理器的地址位数无关 下一级的存储空间
−主存是 cache 的下一级存储器;辅存不仅用作主存的下一级存储器,而且用作文件系统
23/4/22 USTC CS AN Hong 36
V = {0, 1, . . . , n - 1} 虚地址空间M = {0, 1, . . . , m - 1} 物理地址空间映射 : V --> M U {0}
n > m
地址映象函数:MAP(a) = a' 当虚地址 a 的数据出现在物理地址 a' 且 a' 在 M 中 = 0 当虚地址 a 的数据没有出现在 M 中
地址映射
One Map per process!
Processor
名空间 V
地址变换机制
缺页处理
内存 辅存
a
aa'
0
missing item fault
物理地址 由 OS 完成
23/4/22 USTC CS AN Hong 37
虚拟存储的优点 Translation (内存变换) : 简化了程序的装入
−在小的物理地址空间里提供大的一致的编程空间−多个线程的每一个只需分配部分内存块即可运行−只有程序中最重要的部分(称为“工作集”)必需放在内存中
Protection (内存保护) : 自动管理,程序员不必做存储管理工作−为不同的页赋予特定的保护
( 如 , 只读 , 对用户程序不可见等 )−保护不同的线程(或进程)之间互不干扰−保护核心程序不受用户程序干扰 −防病毒或恶意程序的破坏
Sharing (内存共享) : 多个进程可以共享主存空间−同一个物理页映射给多个用户 (“共享主存” )
23/4/22 USTC CS AN Hong 38
虚拟存储系统设计中的问题 一个新的块可以放在主存的何处 ?
−页的放置 (placement policy) −允许将页放在主存中的任意位置 --------全相联
如果一块在主存中 , 如何找到它 ? −页的定位 (block identification) −TLB/ 页表或段表
当发生缺页时哪一页将会被替换掉 ? −页的替换 (replacement policy)−引用位 /LRU算法
采用什么写策略 ? −写策略 (write policy)−后写策略
23/4/22 USTC CS AN Hong 39
Address Mapping
VA page no. disp10
Page Table
indexintopagetable
Page TableBase Reg
V AccessRights PA +
table locatedin physicalmemory
physicalmemoryaddress
actually, concatenation is more likely
frame 01
7
01024
7168
P.A.
PhysicalMemory
1K1K
1K
AddrTransMAP
page 01
31
1K1K
1K
01024
31744
unit of mapping
also unit oftransfer fromvirtual tophysical memory
Virtual Memory
V.A.
分页
23/4/22 USTC CS AN Hong 40
两级页表32-bit 地址 :
P1 index P2 index page offest
4 bytes
4 bytes
4KB
10 10 12
1KPTEs
° 4 GB virtual address space° 4 MB of PTE2° 4 KB of PTE1
What about a 48-64 bit address space?
大地址空间
一级页表
二级页表
23/4/22 USTC CS AN Hong 41
Virtual memory seems to be really slow:−we have to access memory on every access -- even cache hits!−Worse, if translation not completely in memory, may need to go to
disk before hitting in cache!
Solution: Caching page table! −Keep track of most common translations and place them in a
“Translation Lookaside Buffer” (TLB , 快表 )
CPU Trans-lation Cache Main
Memory
VA PA miss
hitdata
Virtual Address and a Cache: Step backward???
23/4/22 USTC CS AN Hong 42
Translation Look-aside Buffer (TLB) is a cache of recent translations
PhysicalMemory Space
Virtual Address Space
TLB
Page Table
2
0
13
virtual address
page off
2frame page
250
physical address
page off
Making address translation practical: TLB
在片内
在内存
23/4/22 USTC CS AN Hong 43
TLB usually organized as fully-associative cache−Lookup is by Virtual Address−Returns Physical Address + other info
Dirty => Page modified (Y/N)? Ref => Page touched (Y/N)?Valid => TLB entry valid (Y/N)? Access => Read? Write? ASID => Which User?
Virtual Address Physical Address Dirty Ref Valid Access ASID
0xFA00 0x0003 Y N Y R/W 340xFA00 0x0003 Y N Y R/W 340x0040 0x0010 N Y Y R 00x0041 0x0011 N Y Y R 0
TLB organization: include protection
23/4/22 USTC CS AN Hong 44
缺页处理 缺页( Page fault ):该页不在内存里 检测:由硬件完成 处理:陷入 OS ,由它来完成缺页处理
−选择一页换出(可能写到辅存上)−从辅存装入该页−调度一些其它程序到处理器上运行
Later (when page has come back from disk):−更新页表−恢复程序继续执行 !
What is in the page fault handler?−见 OS教材
What can HW do to help it do a good job?
23/4/22 USTC CS AN Hong 45
As described, TLB lookup is in serial with cache lookup:
Machines with TLBs go one step further: they overlap TLB lookup with cache access.− Works because lower bits of result (offset) available early
Virtual Address
TLB Lookup
V AccessRights PA
V page no. offset10
P page no. offset10
Physical Address
Reducing translation time further
23/4/22 USTC CS AN Hong 46
pipelined
dunpipeline
TimeCycle TimeCycle CPI stall Pipeline 1
depth Pipeline Speedup
Summary #1/5: Control and Pipelining
Control VIA State Machines and Microprogramming Just overlap tasks; easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then:
Hazards limit performance on computers:−Structural: need more HW resources−Data (RAW,WAR,WAW): need forwarding, compiler scheduling−Control: delayed branch, prediction
23/4/22 USTC CS AN Hong 47
Summary #2/5: Caches
The Principle of Locality:−Program access a relatively small portion of the address space
at any instant of time.Temporal Locality: Locality in TimeSpatial Locality: Locality in Space
Three Major Categories of Cache Misses:−Compulsory Misses: sad facts of life. Example: cold start
misses.−Capacity Misses: increase cache size−Conflict Misses: increase cache size and/or associativity.
Nightmare Scenario: ping pong effect!
Write Policy:−Write Through: needs a write buffer. Nightmare: WB saturation−Write Back: control can be complex
23/4/22 USTC CS AN Hong 48
Associativity
Cache Size
Block Size
Bad
GoodLess More
Factor A Factor B
Summary #3/5: The Cache Design Space
Several interacting dimensions−cache size−block size−associativity−replacement policy−write-through vs write-back−write allocation
The optimal choice is a compromise−depends on access characteristics
workloaduse (I-cache, D-cache, TLB)
−depends on technology / cost
Simplicity often wins
23/4/22 USTC CS AN Hong 49
Summary #4/5: TLB, Virtual Memory
Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: −1) Where can block be placed? −2) How is block found? −3) What block is repalced on miss? −4) How are writes handled?
Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance
−funny times, as most systems can’t access all of 2nd level cache without TLB misses!
23/4/22 USTC CS AN Hong 50
Summary #5/5: Memory Hierachy
Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs?−1000X DRAM growth removed the controversy
Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy
Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms?