37
1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

Embed Size (px)

Citation preview

Page 1: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

1

Efficient Data Access in FutureMemory Hierarchies

Rajeev Balasubramonian

School of ComputingResearch Buffet, Fall 2010

Page 2: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

2

Acks

• Terrific students in the Utah Arch group

• Collaborators at HP, Intel, IBM

• Prof. Al Davis, who re-introduced us to memory systems

Page 3: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

3

Current Trends

• Continued device scaling

• Multi-core processors

• The power wall

• Pin limitations

• Problematic interconnects

• Need for high reliability

Page 4: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

4

Anatomy of Future High-Perf Processors

TheCore

Hi-Perf

TheCore

Lo-Perf

• Designs well understood• Combo of hi- and lo-perf• Risky Ph.D.!!

Page 5: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

5

Anatomy of Future High-Perf Processors

Core

• Large shared L3• Partitioned into many banks• Assume one bank per core

L2/L3CacheBank

Page 6: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

6

Anatomy of Future High-Perf Processors

C

• Many cores!• Large distributed L3 cache

$ C $ C $ C $

C $ C $ C $ C $

C $ C $ C $ C $

C $ C $ C $ C $

Page 7: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

7

Anatomy of Future High-Perf Processors

C

• On-chip network• Includes routers and long wires• Used for cache coherence• Used for off-chip requests/responses

$ C $ C $ C $

C $ C $ C $ C $

C $ C $ C $ C $

C $ C $ C $ C $

Page 8: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

8

Anatomy of Future High-Perf Processors

C

• Memory controller handles off-chip requests to memory

$ C $ C $ C $

C $ C $ C $ C $

C $ C $ C $ C $

C $ C $ C $ C $

Memory Controller

DIMM DIMM DIMM

Page 9: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

9

CPU

Anatomy of Future High-Perf Processors

DIMM DIMM DIMM

CPU

DIMM DIMM DIMM

CPU

DIMM DIMM DIMM

• Multi-socket motherboard• Lots of cores, lots of memory, all connected

Page 10: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

10

CPU

Anatomy of Future High-Perf Processors

DIMM DIMM DIMM

CPU

DIMM DIMM DIMM

CPU

DIMM DIMM DIMM

• DRAM backed up by slower, higher capacity emerging non-volatile memories• Eventually backed up by disk… maybe

NonVolatilePCM

NonVolatilePCM

DISK DISK

Page 11: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

11

Life of a Cache Miss

CoreL1

Miss in L1on-chipnetwork

$

Look up L2/L3 bank

On- andoff-chipnetwork

MC

Wait in MC queue

DIMM

Access DRAM

NonVolatilePCM

Access PCM

DISK

Access disk

Page 12: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

12

Research Topics

CoreL1

Miss in L1on-chipnetwork

$

Look up L2/L3 bank

On- andoff-chipnetwork

MC

Wait in MC queue

DIMM

Access DRAM

NonVolatilePCM

Access PCM

DISK

Access disk

Not very hot topics!

Page 13: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

13

Research Topics

CoreL1

Miss in L1on-chipnetwork

$

Look up L2/L3 bank

On- andoff-chipnetwork

MC

Wait in MC queue

DIMM

Access DRAM

NonVolatilePCM

Access PCM

DISK

Access disk

2

45

6 1

3

Page 14: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

14

Problems with DRAM

• DRAM main memory contributes 1/3rd of total energy in datacenters

• Long latencies; high bandwidth needs

• Error resilience is very expensive

• DRAM is a commodity and chips have to be compliant with standards

Initial designs instituted in the 1990s Innovations are evolutionary Traditional focus on density

Page 15: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

15

Time for a Revolutionary Change?

• Energy is far more critical today

• Cost-per-bit perhaps not as relevant today

• Memory reliability is increasing in importance

• Multi core access streams have poor locality

• Queuing delays are starting to dominate

• Potential migration to new memory technologies and interconnects

Page 16: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

• Incandescent light bulb• Low purchase cost• High operating cost• Commodity

• Energy-efficient light bulb• Higher purchase cost• Much lower operating cost• Value-addition

16

It’s worth a small increase in capital costs to gain large reductions in operating costs

$3.00 13W

$0.30 60W

And not 10X, just 15-20%!

Key Idea

Page 17: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

17

DRAM Basics

Memory bus or channel

Rank

DRAMchip ordeviceBank

Array1/8th of therow buffer

One word ofdata output

DIMM

On-chip Memory

Controller

Page 18: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

18

RAS

CAS

Cache Line

DRAM Chip DRAM Chip DRAM Chip DRAM Chip

Row Buffer

One bank shown in each chip

DRAM Operation

Page 19: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

19

New Design Philosophy

• Eliminate overfetch; activate a single chip and a single small array much lower energy, slightly higher cost

• Provide higher parallelism

• Add features for error detection

[ Appears in ISCA’10 paper, Udipi et al. ]

Page 20: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

20

Single Subarray Access (SSA)

MEMORY CONTROLLER

8 8

ADDR/CMD BUS

64 Bytes

Bank

Subarray

Bitlines

Row buffer

Global Interconnect to I/O

ONE DRAM CHIP

DIMM

8 8 8 8 8 88DATA BUS

Page 21: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

21

Address

Cache Line

DRAM ChipSubarray

DRAM ChipSubarray

DRAM ChipSubarray

DRAM ChipSubarraySubarray Subarray Subarray Subarray

Sleep Mode(or other parallelaccesses)

Subarray Subarray Subarray SubarraySubarray Subarray Subarray Subarray

SSA Operation

Page 22: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

22

Consequences

• Why is this good? Minimal activation energy for a line More circuits can be placed in low-power sleep Can perform multiple operations in parallel

• Why is this bad? Higher area and cost (roughly 5 – 10%) Longer data transfer time Not compatible with today’s standards No opportunity for row buffer hits

Page 23: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

23

Narrow Vs. Wide Buses?

• What provides higher utilization? 1 wide bus or 8 narrow buses?

• Must worry about load imbalance and long data transfer time in latter

• Must worry about many bubbles in the former because of dependences

[ Ongoing work, Chatterjee et al. ]

Bank Access DT

Bank Access DT

One 64-bit wide bus

Back-to-backrequests to the

same bank

64 idle bits

Bank Access DT

Bank Access DT

8 idle bits

Eight 8-bit wide buses

Page 24: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

24

Methodology

• Tested with simulators (Simics) and multi-threaded benchmark suites

• 8-core simulations

• DRAM energy and latency from Micron datasheets

Page 25: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

25

blackscholes

canneal

fluidanimate

streamcluste

rx264 cg

0.00

0.50

1.00

1.50

2.00

2.50

Baseline Open Row

Baseline Close Row

SBA

SSA

Rela

tive

DRA

M E

nerg

y Co

nsum

ption

Moving to close page policy – 73% energy increase on average Compared to open page, 3X reduction with SBA, 6.4X with SSA

Results – Energy

Page 26: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

26

BASELINE (OPEN PAGE,

FR-FCFS)

BASELINE (CLOSED

ROW, FCFS)

SBA SSA0%

20%

40%

60%

80%

100%Termination Resistors

Global In-terconnect

Bitlines

Decoder + Wordline + Senseamps

Results – Energy – Breakdown

Page 27: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

27

blackscholes

bodytrack

cannealferre

t

fluidanimate

freqmine

streamcluste

rvips

x264str

eam cg is

Average0.00

100.00200.00300.00400.00500.00600.00700.00800.00

Baseline Open Page

Baseline Close Page

SBA

SSA

Cycl

es

• Serialization/Queuing delay balance in SSA - 30% decrease (6/12) or 40% increase (6/12)

Results – Performance

Page 28: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

28

BASELINE (OPEN PAGE,

FR-FCFS)

BASELINE (CLOSED ROW,

FCFS)

SBA SSA0%

20%

40%

60%

80%

100%Data Transfer

DRAM Core Access

Rank Switching delay (ODT)

Command/Addr Transfer

Queuing Delay

Results – Performance – Breakdown

Page 29: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

29

Error Resilience in DRAM

• Important to not only withstand a single error, but also entire chip failure – referred to as chipkill

• DRAM chips do not include error correction features -- error tolerance must be built on top

• Example: 8-bit ECC code for a 64-bit word; for chipkill correctness, each of the 72 bits must be read from a separate DRAM chip significant overfetch!

0 1 2 3 68 69 70 71….

72-bit word on every bus transfer

Page 30: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

30

Two-Tiered Solution

• Add a small (8-bit) checksum for every cache line

• Maintain one extra DRAM chip for parity across 8 DRAM chips

• When the checksum flags an error, use the parity to re-construct the corrupted cache line

• Writes will require updates to parity as well

0 1 7 Parity….

Page 31: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

31

Research Topics

CoreL1

Miss in L1on-chipnetwork

$

Look up L2/L3 bank

On- andoff-chipnetwork

MC

Wait in MC queue

DIMM

Access DRAM

NonVolatilePCM

Access PCM

DISK

Access disk

2

45

6 1

3

Page 32: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

32

Topic 2 – On-chip Networks

• Conventional wisdom: buses are not scalable; need routed packet-switched networks

• But, routed networks require bulky energy-hungry routers

• Results: Buses can be made scalable by having a hierarchy

of buses and Bloom filters to stifle broadcasts Low-swing buses can yield low energy, simpler

coherence, and scalable operation

[ Appears in HPCA’10 paper, Udipi et al. ]

Page 33: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

33

Topic 3: Data Placement in Caches, Memory

• In a large distributed NUCA cache, or in a large distributed NUMA memory, data placement must be controlled with heuristics that are aware of:

capacity pressure in the cache bank distance between CPU and cache bank and DIMM queuing delays at memory controller potential for row buffer conflicts at each DIMM

[ Appears in HPCA’09, Awasthi et al. and in PACT’10, Awasthi et al. (Best paper!) ]

Page 34: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

34

Topic 4: Silicon Photonic Interconnects

• Silicon photonics can provide abundant bandwidth and makes sense for off-chip communication

• Can help multi-cores overcome the bandwidth problem

• Problems: DRAM design that best interfaces with silicon photonics, protocols that allow scalable operation

[ On-going work, Udipi et al. ]

Page 35: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

35

Topic 5: Memory Controller Design

• Problem: abysmal row buffer hit rates, quality of service

• Solutions:

Co-locate hot cache lines in the same page

Predict row buffer usage and “prefetch” row closure

QoS policies that leverage multiple memory “knobs”

[ Appears in ASPLOS’10 paper, Sudan et al. and on-going work, Awasthi et al., Sudan et al. ]

Page 36: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

36

Topic 6: Non Volatile Memories

• Emerging memories (PCM): can provide higher densities at smaller feature sizes are based on resistance, not charge (hence non-volatile) can serve as replacements to DRAM/Flash/disk

• Problem: when a cell is programmed to a given resistance, the resistance tends to drift over time may require efficient refresh or error correction

[ On-going work, Awasthi et al. ]

Page 37: 1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010

37

Title

• Bullet