24
Not only Faster but Accurater, too Bruce Jacob University of Maryland SLIDE Faster and Accurater The Future of Memory-System Modeling and Simulation Bruce Jacob (with Ph.D. results of Shang Li) Keystone Professor University of Maryland 1

Not only Faster Bruce Jacob Faster and Accurater

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Fasterand AccuraterThe Future of Memory-SystemModeling and SimulationBruce Jacob (with Ph.D. results of Shang Li) Keystone Professor University of Maryland

�1

Page 2: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

The Bottom Line

�2

SimulationSpeed

SimulationAccuracy

TraceBased

HDLBased

Cycle Accurate

[don’t actually

go here]

We canget up here

(e.g., via prediction)

Page 3: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Background

�3

The Root of the Problem

Column

Read

tRP = 15ns tRCD = 15ns, tRAS = 37.5ns

CL = 8

Bank

Precharge

Row Activate (15ns)

and Data Restore (another 22ns)

DATA

(on bus)

BL = 8TIME

Cost of access is high; requires significant effort to amortize this over the (increasingly short) payoff.

Page 4: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

CPU/$

“Significant Effort”

CPU/$

Outgoing bus request

MC

read data

read data

Read B

Write X, data

Read Z

Write Q, data

Read A

Write A, data

Read W

Read Z

Read YA

CT

RD

PR

E

RD

RD

PR

E

PR

E

AC

T

WR

WR

AC

TR

D

PRE ACTRDread data

beat

cm

d

Background (‘significant effort’)

�4

Page 5: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Faster?

�5

CPU Simulators

A. JAMES CLARK SCHOOL OF ENGINEERING 5

Background & MotivationPerformance & Scalability

Simulation Speed

Accuracy

SST-Ariel

Gem5-O3

GraphiteGem5-Timing

SniperZSim

CMP$im

Marss-O3

• Simulation speed: 100X faster• Error: < 20%• 10s of cores simulated on 10s of cores

CPU Simulators

A. JAMES CLARK SCHOOL OF ENGINEERING 5

Background & MotivationPerformance & Scalability

Simulation Speed

Accuracy

SST-Ariel

Gem5-O3

GraphiteGem5-Timing

SniperZSim

CMP$im

Marss-O3

• Simulation speed: 100X faster• Error: < 20%• 10s of cores simulated on 10s of cores

Page 6: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Faster?

�6

DRAM vs CPU simulator performance

A. JAMES CLARK SCHOOL OF ENGINEERING 4

Background & MotivationPerformance & Scalability

Cycle accurate, traditional CPU simulator

DRAM vs CPU simulator performance

A. JAMES CLARK SCHOOL OF ENGINEERING 6

Background & MotivationPerformance & Scalability

Faster CPU simulator w/ DRAM simulator

Easily Predictable

Result:Memory-System

Simulation is now Limiting

Factor

Page 7: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

I’ll just leave this here …

�7

A. JAMES CLARK SCHOOL OF ENGINEERING 14

Cycle Accurate Memory SimulationSimulator Comparison

Trace Simulation:10M Random requests10M Stream requests

A. JAMES CLARK SCHOOL OF ENGINEERING 14

Cycle Accurate Memory SimulationSimulator Comparison

Trace Simulation:10M Random requests10M Stream requests

Page 8: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Even Faster via Prediction

�8

Turning DRAM timing simulation into a classification problem

A. JAMES CLARK SCHOOL OF ENGINEERING 37

Statistical DRAM ModelProposed Approach

Clock Address OP0 0x01230000 READ12 0x01230020 READ40 0x0123003C READ65 0x06340000 WRITE... ... ...

ClassIdleRow-HitRow-HitRow-Miss...

Latency36222256...

Classification Recovery

Page 9: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Latency ← Queue Contents

�9

The Root of the Problem

Column

Read

tRP = 15ns tRCD = 15ns, tRAS = 37.5ns

CL = 8

Bank

Precharge

Row Activate (15ns)

and Data Restore (another 22ns)

DATA

(on bus)

BL = 8TIME

Cost of access is high; requires significant effort to amortize this over the (increasingly short) payoff.

BankConflict

Idle Bank

Row Hit

…plus any Queueing Delays

Refresh Delay

Page 10: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Training Process

�10

A. JAMES CLARK SCHOOL OF ENGINEERING 43

Statistical DRAM ModelTraining: Supervised Learning

Synthetic Trace:~7000 RequestsVarious access patterns, inter-arrival timings to cover all kinds of workloads

A. JAMES CLARK SCHOOL OF ENGINEERING 43

Statistical DRAM ModelTraining: Supervised Learning

Synthetic Trace:~7000 RequestsVarious access patterns, inter-arrival timings to cover all kinds of workloads

Page 11: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Models (performed the same)

�11Models: Decision Tree & Random Forest

A. JAMES CLARK SCHOOL OF ENGINEERING 44

Statistical DRAM ModelTraining

same-row-last

ref-after-last

num-recent-rank

Idle Idle

row-miss

Decision Tree

Page 12: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Results: Way Faster

�12

Page 13: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Results: Way Faster

�12

Page 14: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

But Wait — and Accurater?A Little Background:

�13

MEMSYS’18, October 2018, Alexandria, Washington DC, USA Anon.

Figure 1: When CPU and memory simulators are coupled,the timings of thememory request between the LLC and thememory controller could be easily overlooked.

better simulation accuracy. Finally, we quantify the LLC to memorylatency for various high-end and emerging platforms and we showits signi�cant range, between 30ns (POWER8) and 277ns (KnighthsLanding); therefore, it is really important to properly adjust andvalidate this parameter in system simulators before any measure-ments are performed. Overall, we believe that the issues addressin this paper would help researchers of the computer architecturecommunity to improve main memory system simulation.

The rest of the paper is organized as follows. Section 2 explainssimulation environment and evaluates main memory latency witha microbenchmark for real and simulated systems. This section alsopropose approaches to �x the deviation identi�ed between real andsimulated main memory latency measurements. Section 3 detailsthe validation of the proposed approaches with SPEC CPU2006benchmarks , while Section 4 discusses LLC tomainmemory latencyof various high-end and emerging High Performance Computing(HPC) platforms. Section 5 analyzes the validation procedure ofstate-of-the-art system and memory simulators . Finally, Section 6presents the conclusions of the study.

2 MAIN MEMORY LATENCY EVALUATIONAND SIMULATION ENHANCEMENTS

In this section we detail the methodology used to model a targetedsystem into a simulation infrastructure and we describe the mi-crobenchmarks used to discover the main memory access latency.The targeted system we aimed to model is an Intel Xeon E5-2670Sandy Bridge-EP processor [8] operating at 3.0GHz. The mainmemory comprises four 4GiB DIMMS devices [21] connected tothe processor using four DDR3-1600 channels. Each processor runseight cores where the hyper-threading feature has been disabledlike in most HPC systems [22].

2.1 Simulation environmentThe simulator infrastructure we chose to use is an integration oftwo simulators: ZSim [20] as CPU simulator and DRAMsim2 asmain memory simulator.

ZSim is a user-level, execution-driven CPU simulatorwidely usedin the computer architecture research community. Developed byresearchers fromMIT and Stanford University, ZSim is designed forsimulation of large-scale systems. However, ZSim was originally de-veloped to simulate Intel Westmere architecture which is no longerbeing used in HPC domain. One of the tasks that we had to perform

Table 1: Cache parameters of the Sandy Bridge EP class pro-cessor used in the study.

L1-D L2 L3

Size 32 KiB 256 KiB 20MiBLatency (in CPU cycles) 4 8 28Cache line size 64 B 64 B 64 BSet associativity 8-way 8-way 20-way

was to upgrade and validate ZSim for Intel Sandy Bridge processor.The work to upgrade ZSim consisted in the following steps: First,we adjusted the simulator by updating the instruction latenciesobtained trough the execution of CPU microbenchmarks [23] in thereal hardware; Second, we improved the micro-operation fusionand we increased the number of entries in the Reorder Bu�er (ROB)from 128 (Westmere) to 168 (Sandy Bridge); Third, we con�guredthe cache hierarchy according to the Intel documentation [8] for aSandy Bridge EP Class, summarized in Table 1. Finally, we updatedthe L3 caching mechanism implementing the hashing functiondescribed in work by Maurice et al. [13].

ZSim is easily integrated with a main memory simulator suchas DRAMsim2 . DRAMsim2 is a cycle-accurate simulator validatedagainst Verilogmodels formemory devices.We con�gured DRAM-sim2 following manufactures documentation with speci�c timingson memory device part [21].

2.2 Memory latency microbenchmarkState-of-the-artmemory benchmarks such as LMbench [15], stream[14] and Intel’s Memory Latency Checker (imlc) [24] can be usedfor main memory latency measurements. However, they are not agood �t for our study because it is very di�cult to use them in ZSimsimulation. LMbench and stream rely on compiler optimizationand imlc is a binary-only distributed program; Hence, no tailoredanalysis nor modi�cation to the code could be made. Therefore, asnone of the open source existent benchmarks was appropriate forour analysis, we had to design a speci�c microbenchmark to usefor our experiments.

Our microbenchmark is designed to stress the caches and mainmemory implementing the concept of pointer chasing. Because themicrobenchmarks are designed to run on top of an Operating Sys-tem, a C program wraps all functionality outside the microbench-mark design as memory initialization, metrics collection, andprogram cleanup. By doing so, the microarchitectural implicationsof running on top of an OS are diminished.

In the microbenchmarks prologue, we allocate a contiguoussection of memory that stores an array of pointers. The elementson the array are initialized as a circular linked list that follows apointer chase pattern, Figure 2 portray an example of such ordering.Our design goals for the microbenchmarks are summarized as:(1) Iteratively traverse the whole array; (2) Access di�erent cachelines for every memory access; (3) The memory accesses have arandom pattern preventing data prefetchers to bring data to anylevel of cache.

2

Page 15: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Conference’17, July 2017, Washington, DC, USA Anon.

(a) DRAM latency and overall latency reported byGem5 and ZSim.

(Program timing instrumented here)Min Latency Min Latency Min Latency

Request 0 Request 1

Request 0Returned

Request 1Returned

Request 2Returned

Request 2

Request 0 Request 1

Request 0Returned

Request 1Returned

Request 2Returned

Request 2

Request 0 Request 1

Request 0Returned

Request 1Returned

Request 2Returned

Request 2

Hardware &Cycle Accurate

ZSim Phase 1

ZSim Phase 2

Request 0 Latency

Request 1 Latency

Request 2 Latency

CPU to Mem

Mem to CPU

Timeline

(b) ZSim 2-phase memory model timeline diagram compared with real hard-ware/cycle accurate model. Three back-to-back memory requests (0, 1, 2) are issuedto the memory model.

Figure 3: Simulator memory latency analysis

main memory backend at all. [17] supports cycle accurate memorybackend, but as we will see soon, it has its issues when integratinga cycle accurate memory backend.

The problem was �rst discovered by [19], who observed a mem-ory latency error of about 20ns when they tested a memory latencybenchmark. But [19] did not answer where this 20ns missing la-tency comes from as was suspecting it came from the cycle accurateDRAM simulator they were using. We will analyze this situationand provide a conclusive answer to this question.

To replicate the issue independently, we developed a simpli�edversion of LMBench(ram_lat we referred in Table 1) that randomlytraverse a huge array, and measure the average latency of eachaccess. When the array is too large to �t in the cache and mostaccesses go to DRAM, the average access latency will include theDRAM latency. The benchmark inserts time stamps before andafter the memory traversal, and use them to determine the overalllatency of a certain number of memory requests, and divide thenumber of requests to obtain average memory latency. This averagememory latency consists of cache latency and DRAM latency, andthus we use the term overall latency in the following discussion.

Like [19], we ran this benchmark natively on our machine toobtain “hardware measured” latency(72ns), then ran it in ZSimalong with DRAMSim2 as DRAM backend, and we were able toreproduce similar results as [19]. That is, the overall latency (43ns)is 29ns lower than hardware measurement (72ns). To determinewhether this is a ZSim speci�c issue or DRAM simulator issue, weran the same benchmark in Gem5 with the same cache and DRAMparameters, and this time, the overall latency is 78ns, much closer toour hardware measurement. So we conclude this is a ZSim speci�cissue not a DRAM simulator issue. We then further looked into thesimulator statistics, and found that the DRAM latency reported bythe DRAM simulator in Gem5 is 55ns, which makes sense as theoverall latency (78ns) should be a combination of DRAM latency(55ns) and cache latency (23ns). However, in ZSim, the DRAM

latency reported by the DRAM simulator is 73ns, much higher thanoverall latency, which makes no sense. Figure 3a visualize theseresults. This again con�rms the issue lies within the ZSim memorymodel.

The way ZSim memory model works is, it has two phases ofmemory models, the �rst phase is an �xed latency model thatassumes a �xed “minimum latency” for all memory events. Thepurpose is to simulate instructions as fast as possible, and generate atrace of memory events. After the memory event trace is generated,the second phase kicks in and that’s when the cycle accurate DRAMsimulator actually works, the cycle accurate simulation use theevent trace as input and update latency timings associated withthese events.

For instance, Figure 3b demonstrates how ZSim memory modelhandles memory requests di�erently from hardware/cycle accuratemodels. Suppose there are 3 back-to-back memory requests(eachrelies on the �nishing of previous one). In real hardware or a cycleaccurate model, each memory request’s latency may vary and nextrequest cannot be issued until the previous request is returned. InZSim Phase 1, all requests are assumed to be �nished with “mini-mum latency”, and therefore �nish earlier than they should. Thenin ZSim Phase 2, cycle accurate simulation is performed, more ac-curate latency timing is produced by cycle accurate simulator andall 3 requests update their timings. But even if all memory requestsobtain correct timings in Phase 2, unfortunately, when the simu-lated program, like our benchmark, has instrumenting instructionssuch as reading system clock, it will obtain the timing numbersduring Phase 1, which can be substantially smaller. This is why theoverall latency is much smaller than DRAM latency.

So in other words, the “minimum latency” ZSim parameter willdictate the latency observed by the simulated program. To verifythis claim, we run the same simulation with di�erent “minimumlatency” parameters, and plot them against the benchmark reported

But Wait — and Accurater?The Real Culprit (took 2 yrs to find):

�14

ZSim 2-phase memory model timeline diagram compared

with real hardware/cycle accurate model.

Three back-to-back memory requests (0, 1, 2) are issued to

the memory model.

First phase of memory access aggressively schedules reqs for

performance; second phase fails to take into account dependence information.

Page 16: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

But Wait — and Accurater?What Programmers WANT: (and if you can do it → accurate, parallel sims)

if (INSTR.isMemOp) { if (L1_cache_miss(INSTR.dAddr)) {

if (L2_cache_miss(INSTR.dAddr)) { INSTR.valid = now + DRAM_request(INSTR.dAddr);

} }

}

�15

Page 17: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

But Wait — and Accurater?What Programmers WANT: (and if you can do it → accurate, parallel sims)

if (INSTR.isMemOp) { if (L1_cache_miss(INSTR.dAddr)) {

if (L2_cache_miss(INSTR.dAddr)) { INSTR.valid = now + DRAM_request(INSTR.dAddr);

} }

}

�15

Prediction gives it to them

Page 18: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Bottom Line (scalability)The Future:

�16

Input Interface(Configuration, request inputs, address mapper...)

Controller

Bank States

Scheduler

Statistics

...

Controller

Bank States

Scheduler

Statistics

...

Controller

Bank States

Scheduler

Statistics

...

...

Output Interface(Aggregated statistics, request callbacks...)

SerialRegion

ParallelRegion

SerialRegion

Figure 5.1: Parallel DRAM simulator architecture.

5.1.2 Evaluation

First, we examine the execution speedup of the parallel DRAM simulator over

trace inputs. The trace frontend contributes little overhead to the overall simulation

time. We can also load traces to keep the DRAM simulator busy all the time to

maximize the simulation time spent in the DRAM simulator. Therefore it is the

ideal scenario for testing parallelization speedup.

We run two types of trace, stream and random, for 10 million cycles on a

8-channel HBM configuration. We compare the simulation time of running the

simulation in 1, 2, 4, and 8 threads respectively.

78

Large parallel simulations enabled, wherein each CPU

model can have its own memory-system predictor to

provide estimates of main memory-system latency.

None of the memory models need interact to provide their

predictions.

Moreover, the CPU models can be written in a FAR simpler

way than they are now, making them faster and less likely to

contain “gotcha” assumptions.

Page 19: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Call For Participation www.memsys.io Call For Participation

The International Symposium on Memory Systems v October 1–4, Washington DC

MEMSYS 2018Keynote Addresses

Hardware Keynote: Steve Wallach Micron

Software Keynote: Brian Barrett Amazon

Postamble: J Thomas Pawlowski Micron

Panelists

Keren Bergman, ColumbiaWendy Elsasser, ArmPhilip Emma, Systems Technology & Architecture Consulting Michael Healy, IBMAdolfy Hoisie, BNLDave Resnick, ConsultantJeffrey Vetter, ORNL

Organizers

Bruce Jacob, U. Maryland Kathy Smiley, Memory Systems

Rajat Agarwal, Intel Abdel-Hameed Badawy, NMSUJonathan Beard, ArmIshwar Bhati, Intel Bruce Christenson, Intel Zeshan Chishti, Intel Zhaoxia (Summer) Deng, Facebook Chen Ding, U. Rochester David Donofrio, Berkeley Lab Dietmar Fey, FAU Erlangen-Nürnberg Maya Gokhale, LLNLXiaochen Guo, Lehigh U. Manish Gupta, NVIDIA Fazal Hameed, TU DresdenMatthias Jung, Fraunhofer IESE Kurt Keville, MITHyesoon Kim, Georgia Tech Scott Lloyd, LLNLSally A. McKee, ClemsonMoinuddin Qureshi, Georgia Tech Petar Radojkovic, BSCArun Rodrigues, Sandia National LabsRobert Voigt, Northrop GrummanGwendolyn Voskuilen, SandiaDavid T. Wang, Samsung Vincent Weaver, U. Maine Norbert Wehn, U. KaiserslauternYuan Xie, UC Santa BarbaraKe Zhang, Chinese Acad. of SciencesXiaodong Zhang, Ohio StateJishen Zhao, UC San Diego

Memory-device manufacturing, memory-architecture design, and the use of memory technologies by application software all profoundly impact today’s and tomorrow’s computing systems, in terms of their performance, function, reliability, predictability, power dissipation, and cost. Existing memory technologies are seen as limiting in terms of power, capacity, and bandwidth. Emerging memory technologies offer the potential to overcome both technology- and design-related limitations to answer the requirements of many different applications. Our goal is to bring together researchers, practitioners, and others interested in this exciting and rapidly evolving field, to update each other on the latest state of the art, to exchange ideas, and to discuss future challenges.

Conference Schedule and VenueThe conference will be held at the Gaylord National Resort & Convention Center at The National Harbor, Maryland. An opening reception will be held on Monday evening, followed by 2 1/2 days of technical presentations (full days on Tuesday and Wednesday, a half length technical day on Thursday), Conference Dinner Wednesday evening, and Awards Luncheon Tuesday afternoon. A discounted room block is still available on the registration site, with only a few rooms left.

Tracks and TopicsThe following topics will be presented over the 3-day conference:• Memory-system design from both hardware and software perspectives• Memory failure modes and mitigation strategies• Memory-system resilience, especially at large scale• Memory and system security issues• Operating system design for hybrid/nonvolatile memories• Technologies like flash, DRAM, STT-MRAM, 3DXP, memristors, etc.• Memory-centric programming models, languages, optimization• Compute-in-memory and compute-near-memory technologies• Large-scale data movement: networks, hardware, software, mitigation• Virtual memory redesign for unifying storage/memory/accelerators• Algorithmic & software memory-management techniques• Emerging memory technologies, both hardware and software,

including memory-related blockchain applications• Interference at the memory level across datacenter applications• Issues in the design and operation of large-memory machines• In-memory databases and NoSQL stores• Post-CMOS scaling efforts and memory technologies to support them,

including cryogenic, neural, quantum, and heterogeneous memories• The conference focuses on these and other related topics.

Publications & PresentationsAll accepted papers will be published in the ACM & IEEE Digital Libraries. Our primary goal is to showcase interesting ideas that will spark conversation between disparate groups—to get applications people, operating systems people, system architecture people, interconnect people and circuits people to talk to each other. Thus, we try to showcase interesting ideas in a format that will facilitate this. The talks are short, to encourage participation and discussion. Every evening we host a panel discussion of invited speakers, with beer, wine, and hot hors d’oeuvres.

2018 Conference Sponsors⟜ ⟜

Shameless Plug

Washington DC Sep 30 – Oct 3, 2019

�17

www.memsys.io

Page 20: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Thank You!Bruce Jacob

[email protected] www.ece.umd.edu/~blj

�18

Page 21: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Backup Slides

�19

Page 22: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE �20

System Level

Memory Controller

Memory Controller

Side View

Top View

Package Pins

Edge Connectors

PCB Bus Traces

DIMM 0 DIMM 1 DIMM 2

DRAMs DIMMs

Rank 0, Rank 1or

Rank 0, Rank 1or even

Rank 0/1, Rank 2/3…

One DIMM can have one RANK, two RANKs, or even more depending on its configuration.

I/O

MUX

One DRAM device with eight internal BANKS, each of which connects to the shared I/O bus.

One DRAM bank is comprised of many DRAM ARRAYS, depending on the part’s configuration. This example shows four arrays, indicating a x4 part (4 data pins).

DRAM Array

One BANK,four ARRAYS

Nomenclature

Page 23: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Background

�21

Chapter 15 MEMORY SYSTEM DESIGN ANALYSIS 589

Per Bank (DRAM Command) Queue Depth

Ban

dwid

th E

ffici

ency

Impr

ovem

ent P

erce

ntag

e

4 6 8 10 12 14 160

20

40

2R8B vs. 1R8B

1R16B vs. 1R8B

Legend

2R16B vs. 2R8B

tRTRS ! 1 cycle

tRTRS ! 2 cycles

tRTRS ! 3 cycles2R8B vs. 1R16B

1R16B outperforms2R8B with queuedepth of 16

FIGURE 15.36: Bandwidth improvement—16-bank versus 8-bank DDR3 devices; relaxed tFAW and tWTR.

0 200 400 600 800Memory Access Latency (ns)

1

100

10000

1e+06

1e+08

Num

ber

of A

cces

ses

at G

iven

Lat

ency

Val

ue

0 200 400 600 8001

100

10000

1e+06

1e+08

Num

ber

of A

cces

ses

at G

iven

Lat

ency

Val

ue

Memory Access Latency (ns)

FCFS179.art CPRH179.art

FIGURE 15.37: Impact of scheduling policy on memory-access latency distribution: 179.art.

ch15_P379751.indd 589ch15_P379751.indd 589 8/8/07 4:03:32 PM8/8/07 4:03:32 PM

590 Memory Systems: Cache, DRAM, Disk

problems with the CPRH scheduling algorithm for other workloads. Figure 15.38 shows the latency dis-tribution curve for 188.ammp, and 188.ammp was one workload that points to possible issues with the CPRH algorithm. Figure 15.38 shows that the CPRH scheduling algorithm resulted in longer latencies for a number of transactions, and the number of trans-actions with memory-access latency greater than 400 ns actually increased. Figure 15.38 also shows that the increase of a small number of transactions with memory-access latency greater than 400 ns is offset by the reduction of the number of transactions with memory transaction latency around 200 ns and the increase of the number of transactions with mem-ory-access latency less than 100 ns. In other words, the CPRH scheduling algorithm redistributed the memory-access latency curve so that most memory transactions received a modest reduction in access latency, but a few memory transactions suffered a substantial increase in access latency. The net result is that the changes in access latency cancelled each other out, resulting in limited speedup for the CPRH algorithm over the FCFS algorithm for 188.ammp.

15.5 A Latency-Oriented StudyIn the previous section, we examined the impact

of transaction ordering on the memory-access latency distribution for various applications. Memory controller schedulers typically attempt to maximize performance by taking advantage of memory applica-tion access patterns to hide DRAM-access penalties. In this section, we provide insight into the impact that DRAM architectural choices make on the average read latency or memory-access latency. We briefl y examine how the choice of DRAM protocol impacts memory system performance and then discuss in detail how aspects of the memory system protocol and confi gu-ration contribute to the observed access latency.4

15.5.1 Experimental FrameworkThis study uses DRAMSim, a stand-alone memory

subsystem simulator. DRAMSim provides a detailed execution-driven model of a Fully Buffered (FB) DIMM memory system. The simulator also sup-ports the variation of memory system parameters of interest, including scheduling policies and memory

0 200 400 600 800Memory Access Latency (ns)

1

100

10000

1e 06

1e 08

Num

ber

of A

cces

ses

at G

iven

Lat

ency

Val

ue

Num

ber

of A

cces

ses

at G

iven

Lat

ency

Val

ue

0 200 400 600 8001

100

10000

1e 06

1e 08

Memory Access Latency (ns)

188.ammp FCFS 188.ammp CPRH

FIGURE 15.38: Impact of scheduling policy on memory-access latency distribution: 188.ammp.

4Some of this section’s material appears in “Fully-Buffered DIMM memory architectures: Understanding mechanisms, overheads and scaling,” by B. Ganesh, A. Jaleel, D. Wang, and B. Jacob. In Proc. 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Phoenix, AZ, February 2007. Copyright IEEE. Used with permission.

ch15_P379751.indd 590ch15_P379751.indd 590 8/8/07 4:03:33 PM8/8/07 4:03:33 PM

Page 24: Not only Faster Bruce Jacob Faster and Accurater

Not only Fasterbut Accurater, too

Bruce Jacob

University of Maryland

SLIDE

Features Extracted

�22

Table 6.1: Features with Descriptions

Feature Values Description Intuition

same-row-last 0/1

whether the last request

that goes to same bank has the same row

(as this one)

key factor for the most

recent bank state

is-last-recent 0/1whether the last request to the

same bank added recently (tRC)

relevancy of the last request

to the same bank

is-last-far 0/1whether the last request to the

same bank added long ago (tRFC)

relevancy of the last request

to the same bank

op 0/1 operation(read/write) for potential R/W scheduling

last-op 0/1 operation of last request to the same bank for potential R/W scheduling

ref-after-last 0/1whether there is a refresh since

last request to the same bank

refresh reset the

bank to idle

near-ref 0/1 whether this cycle is near a refresh cyclelatency can be really

high if it’s near a refresh

same-row-prev intnumber of previous requests with

same row to the same bank

if there is same row

request then OOO

may be possible

num-recent-bank intnumber of requests added recently

to the same bank

contention/queuing

in the bank

num-recent-rank intnumber of recent requests added

recently to the same rank

contention

num-recent-all intnumber of recent requests added

recently to all ranks

contention

108