Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 1

Department of Electrical Engineering

Stanford University

http://eeclass.stanford.edu/ee282

Lecture 5

Main Memory Systems


Announcements

• If you don’t have a group of 3, contact us ASAP

• HW-1 is due on 10/15, 5pm no extensions, no exceptions)

– Bring to lecture or drop off in box outside Gates Hall 310

• PA-1 will be out on Thu

– Discussion session on PA1

• Fri 10/10, 11am, Skilling 193

– START EARLY!


Review: Prefetching

• Idea: fetch data into the cache before processors request them

– Can address cold misses

– Can be done by the programmer, compiler, or hardware

• Characteristics of ideal prefetching

– You only prefetch data that are truly needed

• Avoid bandwidth waste

– You issue prefetch requests early enough

• To hide the memory latency

– You don’t issue prefetch requests too early

• To avoid cache pollution


Review: Stream Prefetching or Stream Buffers

• Sequential prefetching problem:

– Performance slows down once every N cache lines

• Stream prefetching is a continuous version of sequential prefetching

– Stream buffer can fit N cache lines

– On a miss, start fetching N sequential cache lines

– On a stream buffer hit:

• Move cache line to cache, start fetching line (N+1)

• In other words, stream buffer tries to stay N cache lines ahead

• Design issues

– When is a stream buffer released

• When we miss both in the cache and the stream buffer

– Can use multiple stream buffers to capture multiple streams

• E.g. a program operating on 2 arrays


Stream Buffer Design


Strided Prefetching

• Idea: detect and prefetch strided accesses

– for (i=0; i<N; i++) A[i+1024]++;

• Stride detected using a PC-based table

– For each PC, remember the stride

– Stride detection

• Remember the last address used for this PC

• Compare to currently used address for this PC

– Track confidence using a two bit saturating counter

• Increment when stride correct, decrement when incorrect

• How to use the PC-based table

– When stream prefetching is initialized, direct to fetch strided

– Everything else remains the same


Other Ideas in Prefetching

• Prefetch engines for pointer-based data structures

– Predict if fetched data contain a pointer & follow it

– Works for linked-lists, graphs, etc

– Must be very careful: • What is a pointer?

• How far to prefetch?

• Correlating prefetchers

– Learn about address correlation (ABC always accessed in order) • When A is accessed, immediately fetch B & C

– Can use a PC-based table or a Markov prefetcher

• Pre-execution or run-ahead

– Distill the part of the program that generates addresses

– Run this program on other processor/thread to generate prefetches


Today’s Menu: Main Memory Systems

• Memory basics

– DRAM Vs SRAM

• DRAM

– Basic operation

– System organization

– DRAM chip architectures

• DRAM controller

• How to improve the memory system bandwidth and latency

• Acknowledgements: Bruce Jacob, University of Maryland

– Extensive research & teaching on modern DRAMs

– http://www.ece.umd.edu/~blj/

• See two optional papers online


Computer System (PC) Overview


General Memory Background

Read access sequence:

1. Decode row address &

drive word-lines

2. Selected bits drive bit-lines

• Entire row read

3. Amplify row data

4. Decode column address &

select subset of row

• Send to output

5. Precharge bit-lines

• For next access

2D Storage Array

Row

Decoder

Column Decoder

Addre

ss R

egis

ter

Addre

ss

Data Out

LS bits

MS bits


Memory Terminology

• Access time (latency)

– Time from issuing and address to data out

• Cycle time

– Minimum time between two request (repeat rate)

• Bandwidth

– Bytes/unit of time we can extract from the memory

• Peak: ignore initial latency

• Sustained: include initial latency

• Concurrency

– Number of accesses executing in parallel or overlapped manner

– Can help increase bandwidth or improve latency


SRAM vs DRAM

• 6-transistor storage cell

– Retains value if power is on

– Non destructive reads

• Cycle time==access time

• Wide interfaces

• Typical product today

– 1-16Mbit, 2-15ns access time

• 1-transistor+1-capacitor storage cell

– Requires refresh

– Destructive reads

• Cycle time > access time

• Narrower interfaces (4b to 32b)

• Typical product today

– 64Mb-1Gb, 5-40ns access time, 8-60ns cycle time

Word Line

Bit Line

C

Sense Amp

.

.

.


SRAM Vs DRAM: Considerations

• SRAM is preferable for register files & L1/L2 caches

– Fast access

– No refreshes

– Simpler manufacturing

• DRAM is preferable for stand-alone memory chips

– Much higher capacity

• 10x and growing

– Better immunity to soft error rates

– Latency dominated by board traces anyway

• There is some gray area in the midle


DRAM Basic Operation


Basic DRAM Operation (1)







• Not shown: precharge time, refresh time


Latency Components Basic DRAM Operation

• CPU controller transfer time

• Controller latency

– Queuing & scheduling delay at the controller

– Access converted to basic commands

• Controller DRAM transfer time

• DRAM latency

– Simple CAS is row is “open” OR

– RAS + CAS if array precharged OR

– PRE + RAS + CAS (worst case)

• DRAM CPU transfer time (through controller)


DRAM Latency Examples

• Often quoted

– tRC = RAS + PRE

– tRAC = RAS + CAS

• Faster DRAMs are possible, but are more expensive

– Non commodity


DRAM DIMMs

• Dual Inline Memory Module (DIMM)

– A PCB with 8 to 16 DRAM chips – All chips receive identical control and addresses – Data pins from all chips are directly connected to PCB pins

• Advantages:

– A DIMM acts like a high-capacity DRAM chip with a wide interface • E.g. use 8 chips with 8-bit interfaces to connect to a 64-bit memory bus

– Easier to replace/add memory in a system • No need to solder/remove individual chips

• Disadvantage: memory granularity problem


Multi-DIMM SDRAM Memory System


DRAM Banks

• Banks are independent arrays WITHIN a chip

– DRAMs today have 4 to 32 banks

• SDRAM/DDR SDRAM system: 4 banks

• RDRAM system: 16-32 banks

• Advantages

– Lower latency

– Higher bandwidth by overlapping

– Finer-grain power management

• Disadvantages

– Bank area overhead

– More complicated control


How Do Multiple Banks Help

A0 A1 A2

D0 D1

Wait for DRAM access Wait for DRAM access Wait…

Addr Bus

DRAM

Data Bus

A0 A1 A2

D0 D1

Wait for DRAM bank 0


Wait…

Addr Bus

DRAM

Data Bus

A3

Wait…

D2 D3

Before: No OverlappingAssuming accesses to different DRAM rows

After: Overlapped AccessesAssuming no bank conflicts


DRAM Ranks

• A group of chips that responds to a single command & returns data

– E.g. half the chips in on a two-sided DIMMs

– SDRAM/DDR SDRAM system: 4~6 ranks

– RDRAM system: 32 ranks


DIMMS Revisited


DRAM Channels (Physical & Logical)

• Why more channels

– Increase bandwidth

• Cost

– More board wires

– More resources in controller

• Less if single logical channel

• Multiple physical, one logical

channel

– More over-laping across banks

– No parallel accesses


How Do Multiple Banks/Ranks/Channels Help

A0 A1 A2

D0 D1

Wait for DRAM access Wait for DRAM access Wait…

Addr Bus

DRAM

Data Bus

A0 A1 A2

D0 D1



Wait…

Addr Bus

DRAM

Data Bus

A3

Wait…

D2 D3

Before: No OverlappingAssuming accesses to different DRAM rows

After: Overlapped AccessesAssuming no bank conflicts


Address Mapping Examples (aka Address Interleaving

• What are the tradeoffs?

– Think about sequential patterns initially…

• What is fast and what is

slow in memory accesses?

• What about non-sequential

accesses?

• Other issues?


DRAM Controllers

• Their role

– Generate proper controls for DRAM DIMMs for each access

– Schedule across banks & potentially reorder DRAM accesses

• Involves queuing & buffering

• Their location

– In the chipset/memory controller/north bridge

– In the processor chip

• Reduces latency & improve BW between CPU & controller

• What makes them complicated

– Variability of timings across different systems/DRAM chips

– Ordering requirements

– Trade-off between latency and bandwidth


DRAM Controller Topologies

• Tradeoffs?

• See optional paper for

examples

– Available on-line…


DRAM Controller Scheduling Policies

• Bank precharging: open or closed

– Open: leave row open until new row request

– Closed: precharge bitlines as soon as current burst satisfied

• Power mode

– Active, stand-by, self-refresh, power-down

• Basic ordering:

– In-order, load-over-store, bank-ready, age-threshold, …

– Remember that ordering matters across banks as well

– All banks share same IO pins

• Advanced ordering:

– Open row first, row with most pending, row with less pending, …


DRAM Evolution: SDRAM & DDR

• SDRAM: 1st synchronous DRAM

– 66 to 133MHz with multiplexed address bus

– 4 banks

– Programmable burst (1 to 8)

• DDR SDRAM: double data rate (both clock edges)

– 100 to 266MHz with multiplexed address bus

– 4 banks

– Programmable burst (2 to 8)

• DDR2

– 200 to 333MHz, 4 banks, 4-8 burst, …

• Over time:

– Clock , minimum burst , banks , …


DDR Vs. Rambus

200MHz 64-bit bus

800MHz 16-bit bus

• Many banks/chip (4-32)

• Narrow fast interconnect (pipelined)

• High bandwidth

• Latency & area penalty


Other DRAM Options

• GDDRx: DRAM specialized for graphics

– Unidirectional signaling, higher clock rate, lower tRC, …

• RLDRAM/FCDRAM: reduced latency / fast cycle DRAM

– Mostly targeted toward L3 caches & telecom gear

– Wider bus, low tRC/tRAC, non multiplexed address bus, small bursts

• ESDRAM: 1T SRAM (SRAM replacement)

– 16 banks, hidden refresh, 4-6 cycle latency, large bursts

• VCDRAM: virtual channel DRAM

– Includes a small SRAM cache

• Mobile DRAM

– Low cost and low power design, hidden refresh


Fully Buffered DIMM (FB-DIMM)

• The DDR problem

– Higher capacity more DIMMs lower data-rate (multidrop bus)

• FBDIMM approach: use point-to-point links

– While still using commodity DRAM chips

– Network with 12-beat packages, separate up/downstream wires


Advanced Memory Buffer


Fully Buffered DIMM (FB-DIMM)

• Watch out for:

– Asymmetric upstream/downstream

– Requires deep channel for maximum bandwidth efficiency

– Power overhead of current generation AMBs


System Level Choices for DRAM


What Processor Vendors Are Currently Supporting


How to Select a DRAM Architecture

Don’t just make a decision based on specs!

• Bandwidth: measure for your own workload

– Mix of reads/writes, bursts, locality, strides, …

– Different architectures/chips are optimized for different cases

• Latency: typically not critical but…

– Don’t forget other latency contributors (e.g. DRAM controller)

• Cost:

– pins (board traces), signaling, cost/DRAM bit

• Power:

– Voltage, power modes, …

• Risk:

– Number of suppliers


DRAM Trends to Keep in Mind

• DRAMs: capacity +60%, cost –30% per year

– 2.5x cells/area, 1.5x die size in -3 years

• ‘98 DRAM fabrication line costs $2B

– DRAM only: density, leakage v. speed

• Rely on increasing number of computers & memory per computer

(60% market)

– DIMM is replaceable unit computers use any generation DRAM

• Commodity, second source industry high volume, low profit,

conservative

– Little organization innovation in 20 years

• Order of importance: 1) Cost/bit 2) Capacity

– First Rambus: 10x BW, +30% cost little impact


Embedded DRAM

• The inevitable: CPU & DRAM integration

• Embedded DRAM, Merged-DRAM-logic, intelligent RAM, …

– Allows for high bandwidth

• Multiple wide busses, switched interconnect

– Allows for low latency

– Current set of problems

• Cost and capacity of single chip

• Alternatives

– MCM packaging

– 3D packaging


Embedded DRAM Example

• VIRAM media processor

– 125M transistors

– 200MHz, 2 Watt

• Embedded DRAM

– 13 Mbytes

– 8 banks

– 6.4GB/sec per bank (peak)

• Processor

– 4-lane vector processor

• 6.4 Gop/sec

– 64-bit MIPS core

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

Multimedia CPU

MIPS

CPU

I/O crossbar

crossbar


Non-volatile Memory (Flash)

• Storage technology

– Charge trapped in a floating gate

– Retains information even without power supply

• Two design alternatives

– NOR: used primarily for code

• better E/W endurance (100K vs 10K), fast reads (100ns), slow writes (10usec)

– NAND: used primarily for data

• Smaller cell (~40%), reads and writes are 1usec

• Applications

– MP3 players, cameras, …

– Hard disk replacement

– Main memory replacement or assist?

Documents

Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,