EECC722 - Shaaban #1 lec # 12 Fall 2001 10-29-2001 Computer System Components SDRAM PC100/PC133 100-133MHZ 64-128 bits wide 2-way interleaved ~ 900 MBYTES/SEC

EECC722 - ShaabanEECC722 - Shaaban#1 lec # 12 Fall 2001 10-29-2001

Computer System ComponentsComputer System Components

SDRAMPC100/PC133100-133MHZ64-128 bits wide2-way interleaved~ 900 MBYTES/SEC )64bit)

Double DateRate (DDR) SDRAMPC2100266MHZ64-128 bits wide4-way interleaved~2.1 GBYTES/SEC (64bit)

RAMbus DRAM (RDRAM)400-800MHZ16 bits wide~ 1.6 GBYTES/SEC

CPU

CachesSystem Bus

I/O Devices:

Memory

Controllers

adapters

DisksDisplaysKeyboards

Networks

NICs

I/O BusesMemoryController Example: PCI, 33MHZ

32 bits wide 133 MBYTES/SEC

CPU Core500 MHZ - 2.0 GHZ4-way SuperscalerRISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware speculation

L1

L2 L3

Memory Bus

All Non-blocking cachesL1 16-64K 1-2 way set associative (on chip), separate or unifiedL2 128K- 1M 4-16 way set associative (on chip) unifiedL3 1-16M 8-16 way set associative (off chip) unified

Examples: Alpha, AMD K7: EV6, 200-266MHZ Intel PII, PIII: GTL+ 100MHZ Intel P4 400MHZ


Main MemoryMain Memory• Main memory generally utilizes Dynamic RAM (DRAM),

which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec).

• Static RAM may be used if the added expense, low density, power consumption, and complexity is feasible (e.g. Cray Vector Supercomputers)

• Main memory performance is affected by:

– Memory latency: Affects cache miss penalty. Measured by:• Access time: The time it takes between a memory access request is issued to

main memory and the time the requested information is available to cache/CPU.

• Cycle time: The minimum time between requests to memory

(greater than access time in DRAM to allow address lines to be stable)

– Memory bandwidth: The sustained data transfer rate between main memory and cache/CPU.


Processor-Memory (DRAM) Performance GapProcessor-Memory (DRAM) Performance Gap

µProc60%/yr.

DRAM10%/yr.

1

10

100

1000198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU

198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Current memory access latency: 100 or more CPU cycles


X86 CPU Cache/Memory Performance Gap Example:X86 CPU Cache/Memory Performance Gap Example:AMD Athlon T-Bird Vs. Intel PIII, Vs. P4AMD Athlon T-Bird Vs. Intel PIII, Vs. P4

AMD Athlon T-Bird 1GHZL1: 64K INST, 64K DATA (3 cycle latency), both 2-wayL2: 256K 16-way 64 bit Latency: 7 cycles L1,L2 on-chip

Intel PIII 1 GHZL1: 16K INST, 16K DATA (3 cycle latency) both 4-wayL2: 256K 8-way 256 bit , Latency: 7 cycles

L1,L2 on-chip

Intel P 4, 1.5 GHZL1: 8K INST, 8K DATA (2 cycle latency) both 4-way 96KB Execution Trace CacheL2: 256K 8-way 256 bit , Latency: 7 cycles

L1,L2 on-chip

Source: http://www1.anandtech.com/showdoc.html?i=1360&p=15

High L1, L2 data miss ratesmain memory accessed for data


Logical DRAM Organization Logical DRAM Organization

• Square root of bits per RAS/CAS

Column DecoderColumn Decoder

Sense Sense Amps & I/OAmps & I/O

Memory Memory ArrayArray(2,048 x 2,048)(2,048 x 2,048)

A0…A1A0…A10

…

11 DD

QQ

WWord Lineord Line Storage CellCell


CPU/Memory Performance Gap Reduction Techniques

• Latency Reduction/Higher Bandwidth:– Memory Hierarchy with one or more levels of cache + software/hardware

cache performance enhancement techniques (high level of locality essential).

– Low latency, wider, faster system bus.

– Independent Memory Bank Interleaving.

– Lower latency, high-bandwidth memory interfaces:• Current: DDR SDRAM, Direct Rambus (DRDRAM)

• Future: Magnetic RAM (MRAM).

– Chip level integration of memory controller/main memory (IRAM-like).

• Latency-tolerant Architectures:– Simultaneous Multithreaded(SMT) Architectures.

– Decoupled Architectures: Separate memory access from normal processor operation (e.g HiDISC).


Latency Reduction/Higher Bandwidth The Memory Hierarchy The Memory Hierarchy

Part of The On-chip CPU Datapath 16-256 Registers

One or more levels (Static RAM):Level 1: On-chip 16-64K Level 2: On or Off-chip 128-512KLevel 3: Off-chip 128K-8M

Registers

Cache

Main Memory

Magnetic Disc

Optical Disk or Magnetic Tape

Farther away from The CPU

Lower Cost/Bit

Higher Capacity

Increased AccessTime/Latency

Lower ThroughputDRAM, RDRAM 16M-16G

Interface:SCSI, RAID, IDE, 13944G-100G


A Typical Memory HierarchyA Typical Memory Hierarchy ((With Two Levels of Cache)With Two Levels of Cache)

VirtualMemory,

SecondaryStorage(Disk)

MainMemory(DRAM)

SecondLevelCache

(SRAM)L2

1s 10,000,000s

(10s ms)

Speed (ns): 10s 100s

100s GsSize (bytes): Ks Ms

TertiaryStorage(Tape)

10,000,000,000s (10s sec)

Ts

Control

Datapath

Processor

Registers

On-ChipLevel OneCache L1

Larger CapacityFaster


Improving Cache PerformanceImproving Cache Performance• Miss Rate Reduction Techniques:Miss Rate Reduction Techniques:

* Increased cache capacity * Larger block size

* Higher associativity * Victim caches

* Hardware prefetching of instructions and data * Pseudo-associative Caches

* Compiler-controlled prefetching * Compiler optimizations

* Trace cache

• Cache Miss Penalty Reduction Techniques:Cache Miss Penalty Reduction Techniques:* Giving priority to read misses over writes * Sub-block placement

* Early restart and critical word first * Non-blocking caches

* Second-level cache (L2)

• Cache Hit Time Reduction Techniques:Cache Hit Time Reduction Techniques:* Small and simple caches

* Avoiding address translation during cache indexing

* Pipelining writes for fast write hits


Cache Optimization SummaryCache Optimization SummaryTechnique MR MP HT Complexity

Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0

Trace Cache + 3

Priority to Read Misses + 1Subblock Placement + + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2

Small & Simple Caches – + 0Avoiding Address Translation + 2Pipelining Writes + 1

Mis

s ra

teH

it t

ime

Mis

sP

enal

ty


Latency Reduction/Higher Bandwidth

• Wider Main Memory System BUS: Memory width is increased to a number of words (usually the size of a

cache block). Memory bandwidth is proportional to memory width.

e.g Doubling the width of cache and memory doubles

memory bandwidth

• Simple Interleaved Memory: Memory is organized as a number of banks each one word wide.

– Simultaneous multiple word memory reads or writes are accomplished by sending memory addresses to several memory banks at once.

– Interleaving factor: Refers to the mapping of memory addressees to memory banks.

e.g. using 4 banks, bank 0 has all words whose address is:

(word address mod) 4 = 0


Three examples of bus width, memory width, and memory interleaving to achieve higher memory bandwidth

Narrow busand cachewithinterleaved memory

Wider memory, busand cache

Simplest design:Everything is the width of one word


Memory InterleavingMemory Interleaving


Memory Width, Interleaving: An ExampleMemory Width, Interleaving: An ExampleGiven the following system parameters with single cache level L1:Block size=1 word Memory bus width=1 word Miss rate =3% Miss penalty=32 cycles

(4 cycles to send address 24 cycles access time/word, 4 cycles to send a word)

Memory access/instruction = 1.2 Ideal CPI (ignoring cache misses) = 2

Miss rate (block size=2 word)=2% Miss rate (block size=4 words) =1%

• The CPI of the base machine with 1-word blocks = 2+(1.2 x 0.03 x 32) = 3.15

• Increasing the block size to two words gives the following CPI:

– 32-bit bus and memory, no interleaving = 2 + (1.2 x 002 x 2 x 32) = 3.54

– 32-bit bus and memory, interleaved = 2 + (1.2 x .02 x (4 + 24 + 8) = 2.86

– 64-bit bus and memory, no interleaving = 2 + (1.2 x 0.02 x 1 x 32) = 2.77

• Increasing the block size to four words; resulting CPI:


– 32-bit bus and memory, interleaved = 2 + (1.2 x 0.01 x (4 +24 + 16) = 2.53



Simplified Asynchronous DRAM Read TimingSimplified Asynchronous DRAM Read Timing

Source: http://arstechnica.com/paedia/r/ram_guide/ram_guide.part2-1.html


DRAM memory interfaces Page Mode DRAM: OperationPage Mode DRAM: Operation


DRAM memory interfaces

Simplified Asynchronous Fast Page Mode (FPM) DRAM Read TimingSimplified Asynchronous Fast Page Mode (FPM) DRAM Read Timing

Typical timing at 66 MHZ : 5-3-3-3For bus width = 64 bits = 8 bytes cache block size = 32 bytesIt takes = 5+3+3+3 = 14 memory cycles or 15 ns x 14 = 210 ns to read 32 byte blockRead Miss penalty for CPU running at 1 GHZ = 15 x 14 = 210 CPU cycles

FPM DRAM speed rated using tRAC ~ 50-70ns


• Extended Data Out DRAM operates in a similar fashion to Fast Page Mode DRAM except the data from one read is on the output pins at the same time the column address for the next read is being latched in.

Simplified Asynchronous Extended Data Out (EDO) Simplified Asynchronous Extended Data Out (EDO) DRAM Read TimingDRAM Read Timing

Source: http://arstechnica.com/paedia/r/ram_guide/ram_guide.part2-1.html

Typical timing at 66 MHZ : 5-2-2-2For bus width = 64 bits = 8 bytes Max. Bandwidth = 8 x 66 / 2 = 264 Mbytes/sec It takes = 5+2+2+2 = 11 memory cycles or 15 ns x 11 = 165 ns to read 32 byte cache blockRead Miss penalty for CPU running at 1 GHZ = 11 x 15 = 165 CPU cycles

EDO DRAM speed rated using tRAC ~ 40-60ns


Characteristics of Synchronous DRAM Characteristics of Synchronous DRAM Interface ArchitecturesInterface Architectures


SynchronousSynchronousDynamic RAM,Dynamic RAM,SDRAMSDRAMOrganizationOrganization

DDR SDRAM:Similar organization but using four banks to allowdata transfer on both risingand falling edges of the clock.


Simplified SDRAM Read TimingSimplified SDRAM Read Timing

Typical timing at 133 MHZ (PC133 SDRAM) : 4-1-1-1For bus width = 64 bits = 8 bytes Max. Bandwidth = 133 x 8 = 1064 Mbytes/secIt takes = 4+1+1+1 = 8 memory cycles or 7.5 ns x 8 = 60 ns to read 32 byte cache blockRead Miss penalty for CPU running at 1 GHZ = 7.5 x 8 = 60 CPU cycles

Documents

EECC722 - Shaaban #1 lec # 12 Fall 2001 10-29-2001 Computer System Components SDRAM PC100/PC133 100-133MHZ 64-128 bits wide 2-way interleaved ~ 900 MBYTES/SEC