45
EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 1 Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 5 Main Memory Systems

Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 1

Department of Electrical Engineering

Stanford University

http://eeclass.stanford.edu/ee282

Lecture 5

Main Memory Systems

Page 2: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 2

Announcements

• If you don’t have a group of 3, contact us ASAP

• HW-1 is due on 10/15, 5pm no extensions, no exceptions)

– Bring to lecture or drop off in box outside Gates Hall 310

• PA-1 will be out on Thu

– Discussion session on PA1

• Fri 10/10, 11am, Skilling 193

– START EARLY!

Page 3: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 3

Review: Prefetching

• Idea: fetch data into the cache before processors request them

– Can address cold misses

– Can be done by the programmer, compiler, or hardware

• Characteristics of ideal prefetching

– You only prefetch data that are truly needed

• Avoid bandwidth waste

– You issue prefetch requests early enough

• To hide the memory latency

– You don’t issue prefetch requests too early

• To avoid cache pollution

Page 4: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 4

Review: Stream Prefetching or Stream Buffers

• Sequential prefetching problem:

– Performance slows down once every N cache lines

• Stream prefetching is a continuous version of sequential prefetching

– Stream buffer can fit N cache lines

– On a miss, start fetching N sequential cache lines

– On a stream buffer hit:

• Move cache line to cache, start fetching line (N+1)

• In other words, stream buffer tries to stay N cache lines ahead

• Design issues

– When is a stream buffer released

• When we miss both in the cache and the stream buffer

– Can use multiple stream buffers to capture multiple streams

• E.g. a program operating on 2 arrays

Page 5: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 5

Stream Buffer Design

Page 6: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 6

Strided Prefetching

• Idea: detect and prefetch strided accesses

– for (i=0; i<N; i++) A[i+1024]++;

• Stride detected using a PC-based table

– For each PC, remember the stride

– Stride detection

• Remember the last address used for this PC

• Compare to currently used address for this PC

– Track confidence using a two bit saturating counter

• Increment when stride correct, decrement when incorrect

• How to use the PC-based table

– When stream prefetching is initialized, direct to fetch strided

– Everything else remains the same

Page 7: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 7

Other Ideas in Prefetching

• Prefetch engines for pointer-based data structures

– Predict if fetched data contain a pointer & follow it

– Works for linked-lists, graphs, etc

– Must be very careful: • What is a pointer?

• How far to prefetch?

• Correlating prefetchers

– Learn about address correlation (ABC always accessed in order) • When A is accessed, immediately fetch B & C

– Can use a PC-based table or a Markov prefetcher

• Pre-execution or run-ahead

– Distill the part of the program that generates addresses

– Run this program on other processor/thread to generate prefetches

Page 8: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 8

Today’s Menu: Main Memory Systems

• Memory basics

– DRAM Vs SRAM

• DRAM

– Basic operation

– System organization

– DRAM chip architectures

• DRAM controller

• How to improve the memory system bandwidth and latency

• Acknowledgements: Bruce Jacob, University of Maryland

– Extensive research & teaching on modern DRAMs

– http://www.ece.umd.edu/~blj/

• See two optional papers online

Page 9: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 9

Computer System (PC) Overview

Page 10: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 10

General Memory Background

Read access sequence:

1. Decode row address &

drive word-lines

2. Selected bits drive bit-lines

• Entire row read

3. Amplify row data

4. Decode column address &

select subset of row

• Send to output

5. Precharge bit-lines

• For next access

2D Storage Array

Row

Decoder

Column Decoder

Addre

ss R

egis

ter

Addre

ss

Data Out

LS bits

MS bits

Page 11: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 11

Memory Terminology

• Access time (latency)

– Time from issuing and address to data out

• Cycle time

– Minimum time between two request (repeat rate)

• Bandwidth

– Bytes/unit of time we can extract from the memory

• Peak: ignore initial latency

• Sustained: include initial latency

• Concurrency

– Number of accesses executing in parallel or overlapped manner

– Can help increase bandwidth or improve latency

Page 12: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 12

SRAM vs DRAM

• 6-transistor storage cell

– Retains value if power is on

– Non destructive reads

• Cycle time==access time

• Wide interfaces

• Typical product today

– 1-16Mbit, 2-15ns access time

• 1-transistor+1-capacitor storage cell

– Requires refresh

– Destructive reads

• Cycle time > access time

• Narrower interfaces (4b to 32b)

• Typical product today

– 64Mb-1Gb, 5-40ns access time, 8-60ns cycle time

Word Line

Bit Line

C

Sense Amp

.

.

.

Page 13: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 13

SRAM Vs DRAM: Considerations

• SRAM is preferable for register files & L1/L2 caches

– Fast access

– No refreshes

– Simpler manufacturing

• DRAM is preferable for stand-alone memory chips

– Much higher capacity

• 10x and growing

– Better immunity to soft error rates

– Latency dominated by board traces anyway

• There is some gray area in the midle

Page 14: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 14

DRAM Basic Operation

Page 15: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 15

Basic DRAM Operation (1)

Page 16: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 16

Basic DRAM Operation (2)

Page 17: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 17

Basic DRAM Operation (3)

Page 18: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 18

Basic DRAM Operation (4)

• Not shown: precharge time, refresh time

Page 19: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 19

Latency Components Basic DRAM Operation

• CPU controller transfer time

• Controller latency

– Queuing & scheduling delay at the controller

– Access converted to basic commands

• Controller DRAM transfer time

• DRAM latency

– Simple CAS is row is “open” OR

– RAS + CAS if array precharged OR

– PRE + RAS + CAS (worst case)

• DRAM CPU transfer time (through controller)

Page 20: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 20

DRAM Latency Examples

• Often quoted

– tRC = RAS + PRE

– tRAC = RAS + CAS

• Faster DRAMs are possible, but are more expensive

– Non commodity

Page 21: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 21

DRAM DIMMs

• Dual Inline Memory Module (DIMM)

– A PCB with 8 to 16 DRAM chips – All chips receive identical control and addresses – Data pins from all chips are directly connected to PCB pins

• Advantages:

– A DIMM acts like a high-capacity DRAM chip with a wide interface • E.g. use 8 chips with 8-bit interfaces to connect to a 64-bit memory bus

– Easier to replace/add memory in a system • No need to solder/remove individual chips

• Disadvantage: memory granularity problem

Page 22: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 22

Multi-DIMM SDRAM Memory System

Page 23: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 23

DRAM Banks

• Banks are independent arrays WITHIN a chip

– DRAMs today have 4 to 32 banks

• SDRAM/DDR SDRAM system: 4 banks

• RDRAM system: 16-32 banks

• Advantages

– Lower latency

– Higher bandwidth by overlapping

– Finer-grain power management

• Disadvantages

– Bank area overhead

– More complicated control

Page 24: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 24

How Do Multiple Banks Help

A0 A1 A2

D0 D1

Wait for DRAM access Wait for DRAM access Wait…

Addr Bus

DRAM

Data Bus

A0 A1 A2

D0 D1

Wait for DRAM bank 0

Wait for DRAM bank 1

Wait…

Addr Bus

DRAM

Data Bus

A3

Wait…

D2 D3

Before: No OverlappingAssuming accesses to different DRAM rows

After: Overlapped AccessesAssuming no bank conflicts

Page 25: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 25

DRAM Ranks

• A group of chips that responds to a single command & returns data

– E.g. half the chips in on a two-sided DIMMs

– SDRAM/DDR SDRAM system: 4~6 ranks

– RDRAM system: 32 ranks

Page 26: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 26

DIMMS Revisited

Page 27: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 27

DRAM Channels (Physical & Logical)

• Why more channels

– Increase bandwidth

• Cost

– More board wires

– More resources in controller

• Less if single logical channel

• Multiple physical, one logical

channel

– More over-laping across banks

– No parallel accesses

Page 28: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 28

How Do Multiple Banks/Ranks/Channels Help

A0 A1 A2

D0 D1

Wait for DRAM access Wait for DRAM access Wait…

Addr Bus

DRAM

Data Bus

A0 A1 A2

D0 D1

Wait for DRAM bank 0

Wait for DRAM bank 1

Wait…

Addr Bus

DRAM

Data Bus

A3

Wait…

D2 D3

Before: No OverlappingAssuming accesses to different DRAM rows

After: Overlapped AccessesAssuming no bank conflicts

Page 29: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 29

Address Mapping Examples (aka Address Interleaving

• What are the tradeoffs?

– Think about sequential patterns initially…

• What is fast and what is

slow in memory accesses?

• What about non-sequential

accesses?

• Other issues?

Page 30: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 30

DRAM Controllers

• Their role

– Generate proper controls for DRAM DIMMs for each access

– Schedule across banks & potentially reorder DRAM accesses

• Involves queuing & buffering

• Their location

– In the chipset/memory controller/north bridge

– In the processor chip

• Reduces latency & improve BW between CPU & controller

• What makes them complicated

– Variability of timings across different systems/DRAM chips

– Ordering requirements

– Trade-off between latency and bandwidth

Page 31: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 31

DRAM Controller Topologies

• Tradeoffs?

• See optional paper for

examples

– Available on-line…

Page 32: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 32

DRAM Controller Scheduling Policies

• Bank precharging: open or closed

– Open: leave row open until new row request

– Closed: precharge bitlines as soon as current burst satisfied

• Power mode

– Active, stand-by, self-refresh, power-down

• Basic ordering:

– In-order, load-over-store, bank-ready, age-threshold, …

– Remember that ordering matters across banks as well

– All banks share same IO pins

• Advanced ordering:

– Open row first, row with most pending, row with less pending, …

Page 33: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 33

DRAM Evolution: SDRAM & DDR

• SDRAM: 1st synchronous DRAM

– 66 to 133MHz with multiplexed address bus

– 4 banks

– Programmable burst (1 to 8)

• DDR SDRAM: double data rate (both clock edges)

– 100 to 266MHz with multiplexed address bus

– 4 banks

– Programmable burst (2 to 8)

• DDR2

– 200 to 333MHz, 4 banks, 4-8 burst, …

• Over time:

– Clock , minimum burst , banks , …

Page 34: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 34

DDR Vs. Rambus

200MHz 64-bit bus

800MHz 16-bit bus

• Many banks/chip (4-32)

• Narrow fast interconnect (pipelined)

• High bandwidth

• Latency & area penalty

Page 35: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 35

Other DRAM Options

• GDDRx: DRAM specialized for graphics

– Unidirectional signaling, higher clock rate, lower tRC, …

• RLDRAM/FCDRAM: reduced latency / fast cycle DRAM

– Mostly targeted toward L3 caches & telecom gear

– Wider bus, low tRC/tRAC, non multiplexed address bus, small bursts

• ESDRAM: 1T SRAM (SRAM replacement)

– 16 banks, hidden refresh, 4-6 cycle latency, large bursts

• VCDRAM: virtual channel DRAM

– Includes a small SRAM cache

• Mobile DRAM

– Low cost and low power design, hidden refresh

Page 36: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 36

Fully Buffered DIMM (FB-DIMM)

• The DDR problem

– Higher capacity more DIMMs lower data-rate (multidrop bus)

• FBDIMM approach: use point-to-point links

– While still using commodity DRAM chips

– Network with 12-beat packages, separate up/downstream wires

Page 37: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 37

Advanced Memory Buffer

Page 38: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 38

Fully Buffered DIMM (FB-DIMM)

• Watch out for:

– Asymmetric upstream/downstream

– Requires deep channel for maximum bandwidth efficiency

– Power overhead of current generation AMBs

Page 39: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 39

System Level Choices for DRAM

Page 40: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 40

What Processor Vendors Are Currently Supporting

Page 41: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 41

How to Select a DRAM Architecture

Don’t just make a decision based on specs!

• Bandwidth: measure for your own workload

– Mix of reads/writes, bursts, locality, strides, …

– Different architectures/chips are optimized for different cases

• Latency: typically not critical but…

– Don’t forget other latency contributors (e.g. DRAM controller)

• Cost:

– pins (board traces), signaling, cost/DRAM bit

• Power:

– Voltage, power modes, …

• Risk:

– Number of suppliers

Page 42: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 42

DRAM Trends to Keep in Mind

• DRAMs: capacity +60%, cost –30% per year

– 2.5x cells/area, 1.5x die size in -3 years

• ‘98 DRAM fabrication line costs $2B

– DRAM only: density, leakage v. speed

• Rely on increasing number of computers & memory per computer

(60% market)

– DIMM is replaceable unit computers use any generation DRAM

• Commodity, second source industry high volume, low profit,

conservative

– Little organization innovation in 20 years

• Order of importance: 1) Cost/bit 2) Capacity

– First Rambus: 10x BW, +30% cost little impact

Page 43: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 43

Embedded DRAM

• The inevitable: CPU & DRAM integration

• Embedded DRAM, Merged-DRAM-logic, intelligent RAM, …

– Allows for high bandwidth

• Multiple wide busses, switched interconnect

– Allows for low latency

– Current set of problems

• Cost and capacity of single chip

• Alternatives

– MCM packaging

– 3D packaging

Page 44: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 44

Embedded DRAM Example

• VIRAM media processor

– 125M transistors

– 200MHz, 2 Watt

• Embedded DRAM

– 13 Mbytes

– 8 banks

– 6.4GB/sec per bank (peak)

• Processor

– 4-lane vector processor

• 6.4 Gop/sec

– 64-bit MIPS core

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

DR

AM

Multimedia CPU

MIPS

CPU

I/O crossbar

crossbar

Page 45: Lecture 5 Main Memory Systems - ERASMUS PulseLecture 5 Main Memory Systems EE282 – Fall 2008 Lecture 5 - 2 Christos Kozyrakis Announcements • If you don’t have a group of 3,

EE282 – Fall 2008 Christos Kozyrakis Lecture 5 - 45

Non-volatile Memory (Flash)

• Storage technology

– Charge trapped in a floating gate

– Retains information even without power supply

• Two design alternatives

– NOR: used primarily for code

• better E/W endurance (100K vs 10K), fast reads (100ns), slow writes (10usec)

– NAND: used primarily for data

• Smaller cell (~40%), reads and writes are 1usec

• Applications

– MP3 players, cameras, …

– Hard disk replacement

– Main memory replacement or assist?