Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture

(CS05162)

An Hong

[email protected]

Fall 2009

School of Computer Science and Technology

University of Science and Technology of China

Review of Memory Hierarchy

23/4/22 USTC CS AN Hong 2

Quick review of everything you should have learned


Proc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10100

1000198

0198

1198

3198

4198

5198

6198

7198

8198

9199

0199

1199

2199

3199

4199

5199

6199

7199

8199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orm

ance

Time

Moore’s Law

Processor-DRAM Memory Gap (latency)

Who Cares About the Memory Hierarchy?


CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns1-0.1 cents/bit

Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

CapacityAccess TimeCost

TapeT Bytesor infinite sec-min10 cents/bit-8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Levels of the Memory Hierarchy


The Principle of Locality

The Principle of Locality:−Program access a relatively small portion of the address space

at any instant of time.

Two Different Types of Locality:−Temporal Locality (Locality in Time): If an item is referenced, it

will tend to be referenced again soon (e.g., loops, reuse)−Spatial Locality (Locality in Space): If an item is referenced, items

whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)

Last 20 years, HW relied on locality for speed and cost

It is a property of programs which is exploited in machine design.


Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Memory Hierarchy: Terminology Hit( 命中 ): CPU 访问存储系统时 , 在高层找到所需的信息 (e.g.: Block

X) − Hit Rate( 命中率 ): CPU 访问存储系统时 , 在高层找到所需信息的概率− Hit Time( 命中时间 ): 在高层命中时的访问时间

RAM access time + Time to determine hit/miss

Miss( 缺失 ): CPU 访问存储系统时 , 所需的信息 (e.g.:Block Y) 需要从低层传送到高层−Miss Rate( 缺失率 ) = 1 - (Hit Rate)−Miss Penalty( 缺失开销 ): 在低层存储中找到一个信息块所需的时间 + 向高层传送一个信息块所需的时间

Hit Time << Miss Penalty


Hit rate( 命中率 )−So high that usually talk about Miss rate−Miss rate fallacy: as MIPS to CPU performance,

miss rate to average memory access time in memory

Miss penalty( 缺失开销 ): time to replace a block from lower level, including time to replace in CPU−access time: 在低层存储中找到一个信息块所需的时间

= f(latency to access lower level)−transfer time: 向高层传送一个信息块所需的时间

=f(BW between upper & lower levels)

Average memory-access time( 平均访问时间 ) = Hit time + Miss rate x Miss penalty

(ns or clocks)

Cache Measures


Four Questions for Memory Hierarchy Designers

Q1: Where can a block be placed in the upper level? 块的放置 (Block placement)− Fully Associative, Set Associative, Direct Mapped

Q2: How is a block found if it is in the upper level? 块的定位 (Block identification)− Tag/Block

Q3: Which block should be replaced on a miss? 块的替换 (Block replacement)− Random, LRU

Q4: What happens on a write? 写策略 (Write strategy)− Write Through (with Write Buffer)( 直写 ) or Write Back( 回写 )


Block 12 placed in 8 block cache:−Fully associative, direct mapped, 2-way set associative−S.A. Mapping = Block Number Modulo Number Sets

0 1 2 3 4 5 6 7Blockno.

Fully associative:block 12 can go anywhere

0 1 2 3 4 5 6 7Blockno.

Direct mapped:block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7Blockno.

Set associative:block 12 can go anywhere in set 0 (12 mod 4)

Set0

Set1

Set2

Set3

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

Block-frame address

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3Blockno.

Q1: Where can a block be placed?


Cache 中的每一块都有一个地址标志 (Tag), 给出块地址 . 对每一个可能包含有被访问信息的 Cache 块中的标志进行检查, 看是否与来自 CPU 的块地址匹配 .

增加相联度 , 则 Cache 的组数减少 , 每组中的块数增加 --> 则 index 域的宽度减少 , Tag 域的宽度增加

Blockoffset

Block frame AddressTag Index

Set Select

Data Select

Q2: How is a block found?

( 用于判断是否命中 )


Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9

1 KB Direct Mapped Cache, 32B blocks

For a 2 N byte cache:− The uppermost (32 - N) bits are always the Cache Tag− The lowest M bits are the Byte Select (Block Size = 2 M )

①② ③


Two-way Set Associative Cache N-way set associative: N entries for each Cache Index

− N direct mapped caches operates in parallel (N typically 2 to 4)

Example: Two-way set associative cache− Cache Index selects a set?from the cache− The two tags in the set are compared in parallel− Data is selected based on the tag result

Cache DataCache Block 0

Cache TagValid

:: :


Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

①② ②③ ③



Cache Tag Valid

: ::


Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

N-way Set Associative Cache v. Direct Mapped Cache

相联度越高 , Cache 空间的利用率就越高 , 块冲突的概率越低 ,Cache 的缺失率也就越低 . 全相联的缺失率最低 , 直接映射的缺失率最高相联度越高 , Cache 实现的复杂度和访问延迟就越大大多数的处理器采用直接映射 , 两路组相联 , 或四路组相联


Q3: Which block should be replaced on a miss?

Easy for Direct Mapped Set Associative or Fully Associative:

−Random−LRU (Least Recently Used)

Associativity: 2-way 4-way 8-way

Size LRU Random LRU Random LRURandom

16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%


Q4: What happens on a write?

Write through(WT, 直写 ): the information is written to both the block in the cache and to the block in the lower-level memory.−WT always combined with write buffers so that don’t wait for

lower level memory

Write back （ WB, 后写） : the information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.−is block clean or dirty?


A Write Buffer is needed between the Cache and Memory−Processor: writes data into the cache and the write buffer−Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO:−Typical number of entries: 4−Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle

Memory system designer’s nightmare:−Store frequency (w.r.t. time) -> 1 / DRAM write cycle−Write buffer saturation

ProcessorCache

Write Buffer

DRAM

Write Buffer for Write Through


Write-miss Policy( 写缺失策略 ): Write Allocate versus Not Allocate

Write Allocate( 取写 ): 即写分配 . 在发生写缺失时 , 主存的块读到 Cache 中 , 然后执行写命中操作 .−通常配合后写法

Write Not Allocate( 绕写 ): 即非写分配 . 仅修改下层存储器的该块而不将该块取到 Cache 中 .−通常配合直写法


Cache history Caches introduced (commercially) more than 30 years

ago in the IBM 360/85−already a processor-memory gap

Oblivious to the ISA−caches were organization, not architecture

Many different organizations−direct-mapped, set-associative, skewed-associative, sector,

decoupled sector etc.

Caches are ubiquitous−On-chip, off-chip−But also, disk caches, web caches, trace caches etc.

Multilevel cache hierarchy−With inclusion or exclusion−4+ level


Cache history

Cache exposed to the ISA−Prefetch, Fence, Purge etc.

Cache exposed to the compiler−Code and data placement

Cache exposed to the O.S.−Page coloring

Many different write policies−copy-back, write-through, fetch-on-write, write-around, write-

allocate etc.


Cache history

Numerous cache assists, for example:−For storage: write-buffers, victim caches, temporal/spatial

caches−For overlap: lock-up free caches−For latency reduction: prefetch−For better cache utilization: bypass mechanisms, dynamic line

sizes−etc ...


Caches and Parallelism

Cache coherence−Directory schemes−Snoopy protocols

Synchronization−Test-and-test-and-set−load linked -- store conditional

Models of memory consistency−TCC, LogTM


When were the 2K papers being written?

A few facts:−1980 textbook: < 10 pages on caches (2%)−1996 textbook: > 120 pages on caches (20%)

Smith survey (1982)−About 40 references on caches

Uhlig and Mudge survey on trace-driven simulation (1997)−About 25 references specific to cache performance only−Many more on tools for performance etc.


Cache research vs. time

% of ISCA papers dealing principally with caches

05

101520253035

1st session on caches

Largest number (14)


Present Latency Solutions and Limitation

SolutionLarger Caches

Hardware Prefetching

Software Prefetching

Multithreading

Limitations Slow

Works well only if working set fits cache and there is temporal locality

Cannot be tailored for each application

Behavior based on past and present execution-time behavior

Ensure overheads of prefetching do not outweigh the benefits> conservative prefetching

Adaptive software prefetching is required to change prefetch distance during run-time

Hard to insert prefetches for irregular access patterns

focus on solving the throughput problem, not the memory latency problem


The Memory Bandwidth Problem

It’s expensive! Often ignored Processor-centric optimization to bridge the gap but

lead to memory-bandwidth problems−Prefetching−Speculation−Multithreading hide latency

Can we always just trade bandwidth for latency?


Present Bandwidth Solutions

Wider/faster connections to memory−Rambus DRAM−Use higher signaling rates on existing pins−Use more pins for the memory interface

Larger on-chip caches−Fewer requests to DRAM−Only effective if larger caches improve hit rate

Traffic-efficient requests−Only request what you need−Caches are “guessing” that you might need adjacent data−Compression?


Present Bandwidth Solutions

More efficient on-chip caches−Only 1/20 – 1/3 of the data in a cache is live−Again, caches are “guessing” what will be used again−Spatial vs. temporal vs. no locality

Logic/DRAM integration−Put the memory on the processor−On-chip bandwidth is cheaper than pin bandwidth−You will still probably have external DRAM as well

Memory-centric architectures−“Smart” memory (PIM)−Put processing elements wherever there is memory


评测存储系统性能的指标

CPU 时间 = =

执行一个程序所需的时钟周期数 × 时钟周期IC ×CPI × Clock Cycle time

A.M.A.T : Average Memory Access time

IC: Instruction Count

CCT: Clock cycle time

(1) 平均访存时间A.M.A.T= (Hit Rate x Hit Time) +(Miss Rate x Miss Time )

= (Hit Rate x Hit Time) +(1–Hit Rate ) x (Hit Time + Miss Penalty)

= Hit Time + (Miss Rate x Miss Penalty)

(2) CPU 性能公式 ( 带存储器层次结构的 CPU 性能公式 )

复习 : 不考虑访存造成的 CPU 暂停周期时的 CPU 性能公式


评测存储系统性能的指标第一种扩展形式 :CPU time = (CPU execution clock cycles + Memory stall clock cycles) x CCT

Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty +

Writes x Write miss rate x Write miss penalty)= Memory accesses x Miss rate x Miss penalty

第二种扩展形式 :CPU time

= ICx (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x CCT

= IC x (CPIexecution + Misses per instruction x Miss penalty) x CCT

CPUtimeIC CPIExecution Memory accessesInstruction

Miss rateMiss penalty

Clock cycle time


1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.

Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) =

(Hit Rate x Hit Time) + (Miss Rate x Miss Time)

Improving Cache Performance: 3 general options


Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On-C

hipC

ache1s 10,000,000s

(10s ms)Speed (ns): 10s 100s

100sGs

Size (bytes):Ks Ms

TertiaryStorage

(Disk/Tape)

10,000,000,000s (10s sec)

Ts

A Modern Memory Hierarchy

By taking advantage of the principle of locality:−Present the user with as much memory as is available in the

cheapest technology.−Provide access at the speed offered by the fastest technology.


CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns$.01-.001/bit

Main MemoryM Bytes100ns-1us$.01-.001

DiskG Bytesms10 - 10 cents-3 -4

CapacityAccess TimeCost

Tapeinfinitesec-min10-6

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

Recall: Levels of the Memory Hierarchy


How is the hierarchy managed?

Registers <-> Memory−by compiler (programmer?)

Cache <-> Memory−by the hardware

Memory <-> Disks−by the hardware and operating system (virtual memory)−by the programmer (files)


物理地址空间虚地址空间虚地址格式

页表

索引页表页表基地址寄存器

V AccessRights PA

虚页号页内偏移

内存中的页表实页号页内偏移10

物理地址格式

什么是虚拟存储 ? 虚拟存储：将内存看作是硬盘的 cache 页的典型大小为： 1K — 8K 页表将编程空间（虚地址空间）中的“虚页”映射为内存空间（物理地址空间）中的“实页”


虚拟存储器与 cache 的比较访问单位

−在 cache 中是“块”，在虚存中是“页” 缺失时的替换操作

−Cache 由硬件担当；虚存由 OS 担当容量

−虚存的容量由处理器的地址位数决定； Cache 的容量与处理器的地址位数无关下一级的存储空间

−主存是 cache 的下一级存储器；辅存不仅用作主存的下一级存储器，而且用作文件系统


V = {0, 1, . . . , n - 1} 虚地址空间M = {0, 1, . . . , m - 1} 物理地址空间映射 : V --> M U {0}

n > m

地址映象函数：MAP(a) = a' 当虚地址 a 的数据出现在物理地址 a' 且 a' 在 M 中 = 0 当虚地址 a 的数据没有出现在 M 中

地址映射

One Map per process!

Processor

名空间 V

地址变换机制

缺页处理

内存辅存

a

aa'

0

missing item fault

物理地址由 OS 完成


虚拟存储的优点 Translation （内存变换） : 简化了程序的装入

−在小的物理地址空间里提供大的一致的编程空间−多个线程的每一个只需分配部分内存块即可运行−只有程序中最重要的部分（称为“工作集”）必需放在内存中

Protection （内存保护） : 自动管理，程序员不必做存储管理工作−为不同的页赋予特定的保护

( 如 , 只读 , 对用户程序不可见等 )−保护不同的线程（或进程）之间互不干扰−保护核心程序不受用户程序干扰 −防病毒或恶意程序的破坏

Sharing （内存共享） : 多个进程可以共享主存空间−同一个物理页映射给多个用户 (“共享主存” )


虚拟存储系统设计中的问题一个新的块可以放在主存的何处 ?

−页的放置 (placement policy) −允许将页放在主存中的任意位置 --------全相联

如果一块在主存中 , 如何找到它 ? −页的定位 (block identification) −TLB/ 页表或段表

当发生缺页时哪一页将会被替换掉 ? −页的替换 (replacement policy)−引用位 /LRU算法

采用什么写策略 ? −写策略 (write policy)−后写策略


Address Mapping

VA page no. disp10

Page Table

indexintopagetable

Page TableBase Reg

V AccessRights PA +

table locatedin physicalmemory

physicalmemoryaddress

actually, concatenation is more likely

frame 01

7

01024

7168

P.A.

PhysicalMemory

1K1K

1K

AddrTransMAP

page 01

31

1K1K

1K

01024

31744

unit of mapping

also unit oftransfer fromvirtual tophysical memory

Virtual Memory

V.A.

分页


两级页表32-bit 地址 :

P1 index P2 index page offest

4 bytes

4 bytes

4KB

10 10 12

1KPTEs

° 4 GB virtual address space° 4 MB of PTE2° 4 KB of PTE1

What about a 48-64 bit address space?

大地址空间

一级页表

二级页表


Virtual memory seems to be really slow:−we have to access memory on every access -- even cache hits!−Worse, if translation not completely in memory, may need to go to

disk before hitting in cache!

Solution: Caching page table! −Keep track of most common translations and place them in a

“Translation Lookaside Buffer” (TLB ，快表 )

CPU Trans-lation Cache Main

Memory

VA PA miss

hitdata

Virtual Address and a Cache: Step backward???


Translation Look-aside Buffer (TLB) is a cache of recent translations

PhysicalMemory Space

Virtual Address Space

TLB

Page Table

2

0

13

virtual address

page off

2frame page

250

physical address

page off

Making address translation practical: TLB

在片内

在内存


TLB usually organized as fully-associative cache−Lookup is by Virtual Address−Returns Physical Address + other info

Dirty => Page modified (Y/N)? Ref => Page touched (Y/N)?Valid => TLB entry valid (Y/N)? Access => Read? Write? ASID => Which User?

Virtual Address Physical Address Dirty Ref Valid Access ASID

0xFA00 0x0003 Y N Y R/W 340xFA00 0x0003 Y N Y R/W 340x0040 0x0010 N Y Y R 00x0041 0x0011 N Y Y R 0

TLB organization: include protection


缺页处理缺页（ Page fault ）：该页不在内存里检测：由硬件完成处理：陷入 OS ，由它来完成缺页处理

−选择一页换出（可能写到辅存上）−从辅存装入该页−调度一些其它程序到处理器上运行

Later (when page has come back from disk):−更新页表−恢复程序继续执行 !

What is in the page fault handler?−见 OS教材

What can HW do to help it do a good job?


As described, TLB lookup is in serial with cache lookup:

Machines with TLBs go one step further: they overlap TLB lookup with cache access.− Works because lower bits of result (offset) available early

Virtual Address

TLB Lookup

V AccessRights PA

V page no. offset10

P page no. offset10

Physical Address

Reducing translation time further


pipelined

dunpipeline

TimeCycle TimeCycle CPI stall Pipeline 1

depth Pipeline Speedup

Summary #1/5: Control and Pipelining

Control VIA State Machines and Microprogramming Just overlap tasks; easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then:

Hazards limit performance on computers:−Structural: need more HW resources−Data (RAW,WAR,WAW): need forwarding, compiler scheduling−Control: delayed branch, prediction


Summary #2/5: Caches

The Principle of Locality:−Program access a relatively small portion of the address space

at any instant of time.Temporal Locality: Locality in TimeSpatial Locality: Locality in Space

Three Major Categories of Cache Misses:−Compulsory Misses: sad facts of life. Example: cold start

misses.−Capacity Misses: increase cache size−Conflict Misses: increase cache size and/or associativity.

Nightmare Scenario: ping pong effect!

Write Policy:−Write Through: needs a write buffer. Nightmare: WB saturation−Write Back: control can be complex


Associativity

Cache Size

Block Size

Bad

GoodLess More

Factor A Factor B

Summary #3/5: The Cache Design Space

Several interacting dimensions−cache size−block size−associativity−replacement policy−write-through vs write-back−write allocation

The optimal choice is a compromise−depends on access characteristics

workloaduse (I-cache, D-cache, TLB)

−depends on technology / cost

Simplicity often wins


Summary #4/5: TLB, Virtual Memory

Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: −1) Where can block be placed? −2) How is block found? −3) What block is repalced on miss? −4) How are writes handled?

Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance

−funny times, as most systems can’t access all of 2nd level cache without TLB misses!


Summary #5/5: Memory Hierachy

Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs?−1000X DRAM growth removed the controversy

Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy

Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms?

Documents

Lecture on High Performance Processor Architecture ( CS05162 )