53
Solar-DRAM: Reducing DRAM Access Latency by Exploiting the Variation in Local Bitlines Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu

ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Solar-DRAM:ReducingDRAMAccessLatency

byExploitingtheVariationinLocalBitlines

JeremieS.Kim Minesh PatelHasanHassanOnur Mutlu

Page 2: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Motivation:DRAMlatencyisamajorperformancebottleneckProblem:Manyimportantworkloadsexhibitbankconflicts inDRAM,whichresultinevenlongerlatenciesGoal:1. Rigorously characterizeaccesslatency onLPDDR4DRAM2. ExploitfindingstorobustlyreduceDRAMaccesslatencySolar-DRAM:• Categorizeslocalbitlines as“weak(slow)”or“strong(fast)”• RobustlyreducesDRAMaccesslatencyforreadsandwritestodatacontainedin“strong”localbitlines.

Evaluation:1. Experimentallycharacterize282 realLPDDR4DRAMchips2. Insimulation,Solar-DRAM provides10.87% system

performanceimprovementoverLPDDR4DRAM

ExecutiveSummary

2

Page 3: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

MotivationandGoalDRAMBackgroundExperimentalMethodologyCharacterizationResultsMechanism:Solar-DRAMEvaluationConclusion

3

Solar-DRAMOutline

Page 4: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

MotivationandGoalDRAMBackgroundExperimentalMethodologyCharacterizationResultsMechanism:Solar-DRAMEvaluationConclusion

4

Solar-DRAMOutline

Page 5: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

• Manyimportantworkloadsexhibitmanybankconflicts• BankconflictsresultinanadditionaldelayoftRCD• Thisnegativelyimpactsoverallsystemperformance

• Apriorwork(FLY-DRAM)findsweak(slow)cells andusesvariabletRCD dependingoncellstrength,however• Theydonot showtheviabilityofstaticprofileofcellstrength• Theycharacterizeanolder generation(DDR3)ofDRAM

• Ourgoalisto• Rigorouslycharacterizestate-of-the-artLPDDR4DRAM• Demonstrate viabilityofusingstaticprofileofcellstrength• Devise amechanismtoexploitmoreactivationfailure(tRCD)characteristics andfurtherreduceDRAMlatency

MotivationandGoal

5

Page 6: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

MotivationandGoalDRAMBackgroundExperimentalMethodologyCharacterizationResultsMechanism:Solar-DRAMEvaluationConclusion

6

Solar-DRAMOutline

Page 7: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

wordline

sense amplifier

capa

citor

access transistor bitlineDRAM cell

DRAMBackground

7

EachDRAMcellismadeof1capacitorand1transistor

Wordline enablesreading/writingdatainthecellBitlinemovesdatafromcellsto/fromI/Ocircuitry

Page 8: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

DRAMBank

local row-buffer

local

row

deco

der

subarray

global row buffer

globa

l row

de

code

r

8

local row-buffer

local

row

deco

der

subarraysubarraysubarraylocalbitline

wordline

DRAMcell

local row-buffer

local

row

deco

der

local

row

deco

der

subarray

DRAMrow

ADRAMbankisorganizedhierarchicallywithsubarrays

Columnsofcellsinsubarrayssharea localbitlineRowsofcellsinasubarrayshareawordline

DRAMBackground

Page 9: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

9

DRAMOperation

…… …LocalRowBufferLocalRowBuffer

Cacheline

READ

READ READ

RowDecoder

LocalRowBufferREAD READ READ

ACTR0 RD PRER0RD RD ACTR1 RD RD RD

Page 10: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

wordline

capacitor

accesstransistor

bitline

Sense Amplifier

Vdd

0.5 Vdd

Bitli

neVo

ltage

Time

Ready to Access Voltage Level

tRCD

Guardband

Process variation during manufacturing results in cells having unique behavior

Vmin

ACTIVATE SA Enable READ

Weak(slow)

Strong(fast)

Bitline Charge Sharing

DRAMAccessesandFailures

10[Kim+ HPCA’18]

Page 11: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

wordline

capacitor

accesstransistor

bitline

SA

Vdd

0.5 Vdd

Bitli

neVo

ltage

Time

Ready to Access Voltage Level

tRCD

Vmin

ACTIVATE SA Enable READ

Weaker cells have a higher probabilityto fail because they are slower

DRAMAccessesandFailures

11

Weak(slow)

Strong(fast)

[Kim+ HPCA’18]

Page 12: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

12

RecapofGoalsToidentifytheopportunityforreliablyreducingtRCD,wewantto:

1. Rigorouslycharacterizestate-of-the-artLPDDR4DRAM

2. Demonstrate theviabilityofusingstaticprofileofcellstrength

3. Devise amechanismtoexploitmoreactivationfailure(tRCD)characteristics andfurtherreduceDRAMlatency

Page 13: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

MotivationandGoalDRAMBackgroundExperimentalMethodologyCharacterizationResultsMechanism:Solar-DRAMEvaluationConclusion

13

Solar-DRAMOutline

Page 14: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

•2822y-nmLPDDR4DRAMmodules• 2GB devicesize• From3majorDRAMmanufacturers

•Thermallycontrolledtestingchamber• Ambienttemperaturerange:{40°C– 55°C}± 0.25°C• DRAMtemperatureisheldat15°Caboveambient

•PrecisecontroloverDRAMcommandsandtimingparameters• TestreducedlatencyeffectsbyreducingtRCD parameter

•Ramulator DRAMSimulator[Kim+,CAL’15]• Accesslatencycharacterizationinrealworkloads

ExperimentalMethodology

14

Page 15: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

MotivationandGoalDRAMBackgroundExperimentalMethodologyCharacterizationResultsMechanism:Solar-DRAMEvaluationConclusion

15

Solar-DRAMOutline

Page 16: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

1. Spatialdistribution ofactivationfailures

2. Spatiallocalityof activationfailures

3. Distributionof cacheaccessesinrealworkloads

4. Short-termvariationofactivationfailureprobability

5. Effectsof reduced tRCD onwriteoperations

16

CharacterizationResults

Page 17: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Activationfailuresarehighlyconstrainedtolocalbitlines (i.e.,subarrays)

SpatialDistributionofFailures

17

Subarray border

Rem

apped row

DR

AM

row

s (1

cel

l)

DRAM columns (1 cell)0 512 1024

1024

512

1024

512

0 512 1024

DRAM

Row(num

ber)

SubarrayEdge

DRAMColumn(number)

HowareactivationfailuresspatiallydistributedinDRAM?

Page 18: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Activationfailuresareconstrainedtothecachelinefirstaccessedimmediatelyfollowinganactivation

Where does a single access induce activation failures?

18

SpatialLocalityofFailures

…… …

LocalRowBuffer

Weakbitline

LocalRowBuffer

Cacheline

RowDecoder

READ

✘ ✘

✘ ✘

Page 19: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Activationfailuresareconstrainedtothecachelinefirstaccessedimmediatelyfollowinganactivation

Where does a single access induce activation failures?

19

SpatialLocalityofFailures

…… …

LocalRowBuffer

Weakbitline

LocalRowBuffer

Cacheline

RowDecoder

READ

✘ ✘

✘ ✘

Thisshowsthatwecanrelyonastaticprofile ofweakbitlines todeterminewhetheranaccesswillcausefailures

WecanprofileregionsofDRAMatthegranularityofcachelineswithinsubarrays

(i.e.,subarraycolumn)

Page 20: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

DistributionofCacheAccesses

20

Whichcachelineismostlikelytobeaccessedfirstimmediatelyfollowinganactivation?

…… …

LocalRowBuffer

Cacheline

RowDecoder

…Cachelineoffset0

Cachelineoffset1

Cachelineoffset2

Page 21: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Insomeapplications,upto22.2% offirstaccessestoanewly-activatedDRAMrowrequestcacheline0intherow

Cache line number in DRAM row

Prob

abili

ty (%

)

Cache line offset in newly-activated DRAM rowProb

abili

ty o

f fir

st a

cces

s (%

)DistributionofCacheAccesses

21

Whichcachelineismostlikelytobeaccessedfirstimmediatelyfollowinganactivation?

Page 22: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Insomeapplications,upto22.2% offirstaccessestoanewly-activatedDRAMrowrequest cacheline0intherow

Cache line number in DRAM row

Prob

abili

ty (%

)

Cache line offset in newly-activated DRAM rowProb

abili

ty o

f fir

st a

cces

s (%

)DistributionofCacheAccesses

22

Whichcachelineismostlikelytobeaccessedfirstimmediatelyfollowinganactivation?

Thisshowsthatwecanrelyonastaticprofile ofweakbitlines todeterminewhetheranaccesswillcausefailures

tRCD generallyaffectscacheline0intherowmorethan other cache lineoffsets

Page 23: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Short-termVariation

23

Doesabitline’s probabilityoffailure(i.e.,latencycharacteristics)changeovertime?

!"#$% = '()*

+,--._0(_12_%03-0(, (45_03,#._670-,8+,--((45_03,#. × +,--._0(_12_%03-0(,

cells_in_SA_bitline: numberofcellsinalocalbitlinenum_iters: iterationswetrytoinducefailuresineachcellnum_iters_failedcelln:iterationscelln failsinWesamplemanytimesoveralongperiodandplothowitvariesacrossallsamples

+,--._0(_12_%03-0(,

WesampleFprobmanytimesoveralongperiodandplothowFprob variesacrossallsamples

+,--._0(_12_%03-0(,(45_03,#._670-,8+,--(

(45_03,#.

Page 24: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Aweakbitline islikelytoremainweak andastrong bitline islikelytoremainstrong overtime

Fail probability at time 1 (%)

Fail

prob

abili

ty a

t ti

me

2 (%

)F p

rob

at ti

me

t 2(%

)

Fprob at time t1 (%)

Short-termVariation

24

Doesabitline’s probabilityoffailure(i.e.,latencycharacteristics)changeovertime?

Page 25: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Aweakbitline islikelytoremainweak andastrong bitline islikelytoremainstrong overtime

Fail probability at time 1 (%)

Fail

prob

abili

ty a

t ti

me

2 (%

)F p

rob

at ti

me

t 2(%

)

Fprob at time t1 (%)

Short-termVariation

25

Doesabitline’s probabilityoffailure(i.e.,latencycharacteristics)changeovertime?

Thisshowsthatwecanrelyonastaticprofile ofweakbitlines todeterminewhetheranaccesswillcausefailures

Wecanstaticallyprofileweakbitlinesanddetermineifanaccessinthefuturewillcausefailures

Page 26: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

WecanreliablyissuewriteoperationswithsignificantlyreducedtRCD (e.g.,by77%)

HowarewriteoperationsaffectedbyreducedtRCD?

26

WriteOperations

…… …

LocalRowBuffer

Weakbitline

LocalRowBuffer

Cacheline

RowDecoder

WRITE

Page 27: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

MotivationandGoalDRAMBackgroundExperimentalMethodologyCharacterizationResultsMechanism:Solar-DRAMEvaluationConclusion

27

Solar-DRAMOutline

Page 28: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Solar-DRAMIdentifiessubarraycolumnsas“weak(slow)”or“strong(fast)”andaccessescachelinesinstrongsubarraycolumnswithreducedtRCDUsesastaticprofileofweaksubarraycolumns• Obtainedinaone-timeprofilingstep

ThreeComponents

1. Variable-latencycachelines(VLC)2. Reorderedsubarraycolumns(RSC)3. Reducedlatencyforwrites(RLW)

28

Page 29: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Solar-DRAMIdentifiessubarraycolumnsas“weak(slow)”or“strong(fast)”andaccessescachelinesinstrongsubarraycolumnswithreducedtRCDUsesastaticprofileofweaksubarraycolumns• Obtainedinaone-timeprofilingstep

ThreeComponents

1. Variable-latencycachelines(VLC)2. Reorderedsubarraycolumns(RSC)3. Reducedlatencyforwrites(RLW)

29

Page 30: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

30

Solar-DRAM:VLC(I)RowDecoder Cacheline

…… ……

IdentifiessubarraycolumnscomprisedofstrongbitlinesAccesscachelinesinstrongsubarraycolumnswithareducedtRCD

LocalRowBuffer

Weakbitline Strongbitline

Strongsubarraycolumn

Page 31: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Solar-DRAMIdentifiessubarraycolumnsas“weak(slow)”or“strong(fast)”andaccessescachelinesinstrongsubarraycolumnswithreducedtRCDUsesastaticprofileofweaksubarraycolumns• Obtainedinaone-timeprofilingstep

ThreeComponents

1. Variable-latencycachelines(VLC)2. Reorderedsubarraycolumns(RSC)3. Reducedlatencyforwrites(RLW)

31

Page 32: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

32

Solar-DRAM:RSC(II)RowDecoder Cacheline

…… ……

LocalRowBuffer

Cacheline0 Cacheline1Cacheline0 Cacheline1

RemapcachelinesacrossDRAMatthememorycontrollerlevelsocacheline0willlikelymaptoastrong cacheline

Strongsubarraycolumn

Page 33: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Solar-DRAMIdentifiessubarraycolumnsas“weak(slow)”or“strong(fast)”andaccessescachelinesinstrongsubarraycolumnswithreducedtRCDUsesastaticprofileofweaksubarraycolumns• Obtainedinaone-timeprofilingstep

ThreeComponents

1. Variable-latencycachelines(VLC)2. Reorderedsubarraycolumns(RSC)3. Reducedlatencyforwrites(RLW)

33

Page 34: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Strongsubarraycolumn

34

Solar-DRAM:RLW(III)RowDecoder Cacheline

LocalRowBuffer…… …

WritetoalllocationsinDRAMwithasignificantlyreducedtRCD (e.g.,by77%)

CachelinesdonotfailwithreducedtRCD

Page 35: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

35

Solar-DRAM:PuttingitallTogether

EachcomponentincreasesthenumberofaccessesthatcanbeissuedwithareducedtRCD

TheycombinetofurtherincreasethenumberofcaseswheretRCD canbereduced

Solar-DRAM utilizeseachcomponent(VLC,RSC,andRLW)inconcerttoreduceDRAMlatencyandsignificantlyimprovesystemperformance

Page 36: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

MotivationandGoalDRAMBackgroundExperimentalMethodologyCharacterizationResultsMechanism:Solar-DRAMEvaluationConclusion

36

Solar-DRAMOutline

Page 37: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

• Cycle-levelsimulator:Ramulator [Kim+,CAL’15]https://github.com/CMU-SAFARI/ramulator

• 4-core systemwithLPDDR4-3200memory

• Benchmarks:SPEC2006

• 408-coreworkloads

• Performancemetric:WeightedSpeedup(WS)

37

EvaluationMethodology

Page 38: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Evaluation:Homogeneousworkloads

FLY-DRAM

4-core Homogeneous Workload Mixes

Wei

ghte

d Sp

eedu

p (%

)

38

Page 39: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Evaluation:Homogeneousworkloads

FLY-DRAM VLC

4-core Homogeneous Workload Mixes

Wei

ghte

d Sp

eedu

p (%

)

39

4-core Homogeneous Workload Mixes

Page 40: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Evaluation:Homogeneousworkloads

FLY-DRAM VLC RSC

4-core Homogeneous Workload Mixes

Wei

ghte

d Sp

eedu

p (%

)

40

Page 41: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Evaluation:Homogeneousworkloads

FLY-DRAM VLC RSC RLW

4-core Homogeneous Workload Mixes

Wei

ghte

d Sp

eedu

p (%

)

41

Page 42: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Evaluation:Homogeneousworkloads

FLY-DRAM VLC RSC RLW Solar-DRAM

4-core Homogeneous Workload Mixes

Wei

ghte

d Sp

eedu

p (%

)

Solar-DRAMreducestRCD formoreDRAMaccessesandprovides10.87% performancebenefit

42

Page 43: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

• A detailed analysis on:- Devices of the three major DRAM manufacturers- Data Pattern Dependence of activation failures

- Random data pattern finds the highest coverage of weak bitlines- Temperature effects on activation failure probability

- Fprob generally increases with higher temperatures- Evaluation with Heterogeneous workloads

- Solar-DRAM provides 8.79%performancebenefit

• Further discussion on:- Implementation details- Finding a comprehensive profile of weak subarray

columns

OtherResultsinthePaper

43

Page 44: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

MotivationandGoalDRAMBackgroundExperimentalMethodologyCharacterizationResultsMechanism:Solar-DRAMEvaluationConclusion

44

Solar-DRAMOutline

Page 45: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Motivation:DRAMlatencyisamajorperformancebottleneckProblem:Manyimportantworkloadsexhibitbankconflicts inDRAM,whichresultinevenlongerlatenciesGoal:1. Rigorously characterizeaccesslatency onLPDDR4DRAM2. ExploitfindingstorobustlyreduceDRAMaccesslatencySolar-DRAM:• Categorizeslocalbitlines as“weak(slow)”or“strong(fast)”• RobustlyreducesDRAMaccesslatencyforreadsandwritestodatacontainedin“strong”localbitlines.

Evaluation:1. Experimentallycharacterize282 realLPDDR4DRAMchips2. Insimulation,Solar-DRAM provides10.87% system

performanceimprovementoverLPDDR4DRAM

ExecutiveSummary

45

Page 46: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Solar-DRAM:ReducingDRAMAccessLatency

byExploitingtheVariationinLocalBitlines

JeremieS.Kim Minesh PatelHasanHassanOnur Mutlu

Page 47: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

4-core Heterogeneous Workload Mixes

Wei

ghte

d Sp

eedu

p (%

)

Evaluation:Heterogeneousworkloads

FLY-DRAM VLC RSC RLW Solar-DRAM

Solar-DRAMreducestRCD formoreDRAMaccessesandprovides8.79% performancebenefit

47

Page 48: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

Fail probability at temperature 1 (%)

Fail

prob

abili

ty a

t te

mp

2 (%

)

A B C

F pro

bat

tem

pera

ture

T+5

(%)

Fprob at temperature T (%)Fail probability at time 1 (%)

Fail

prob

abili

ty a

t ti

me

2 (%

)

Temperature

48

WestudytheeffectsofchangingtemperatureonFprob. The x-axisshows the Fprob at a given temperature T,andthey-axisplotsthedistribution(box and whiskers plot) ofFprob atahighertemperatureforthesamebitline

Sinceamajorityofthedatapointsareabovethex=yline,Fprobgenerallyincreases with higher temperatures

Page 49: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

A B C

Cove

rage

of W

eak

Loca

l Bitl

ines

DataPatternDependence

49

Westudyhowusingdifferentdatapatternsaffectsthenumberofweakbitlines found over multipleiterations

Page 50: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

DRAMBackground

memory controller64-bit

internal data/command bus

I/O

circu

itry

DRAM bank(0)

DRAM bank

(B – 1)...

DRAM rank...DRAM

chip 0

...

DRAM chip N-1

DRAM rank...DRAM

chip 0

...

DRAM chip N-1

DRA

M

Mod

ule

corecore

CPU

channel

DRAM

Chip

50

DRAM chips are organized into DRAM ranks and modules.

TheCPUinterfaceswithDRAMatthegranularityofamodulewithamemorycontrollerthathasa64-bitchannelconnection

Page 51: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

51

duced tRCD (i.e., 4ns, as measured with our experimentalinfrastructure) for all write operations to DRAM.

6.2. Static Pro�le of Weak Subarray ColumnsTo obtain the static pro�le of weak subarray columns, we

run multiple iterations of Algorithm 1, recording all subarraycolumns containing observed activation failures. As we ob-serve in Section 5, there are various factors that a�ect a localbitline’s probability of failure (Fprob). We use these factors todetermine a method for identifying a comprehensive pro�leof weak subarray columns for a given DRAM module. First,we use our observation on the accumulation rate of �ndingweak local bitlines (Section 5.5) to determine the number ofiterations we expect to test each DRAM module. However,since there is such high variation across each DRAM module(as seen in the standard deviations of the distributions in Ob-servation 11), we can only provide the expected number ofiterations needed to �nd a comprehensive pro�le for DRAMmodules of a manufacturer, and the time to pro�le dependson the module. We show in Section 5.2 that no single datapattern alone �nds a high coverage of weak local bitlines.This indicates that we must test each data pattern (40 datapatterns) for the expected number of iterations needed to �nda comprehensive pro�le of a DRAM module for a range oftemperatures (Section 5.3). While this could result in manyiterations of testing (on the order of a few thousands; seeSection 5.5), this is a one-time process on the order of half aday per bank that results in a reliable pro�le of weak subarraycolumns. The required one-time pro�ling can be performedin two ways: 1) the system running Solar-DRAM can pro�lea DRAM module when the memory controller detects a newDRAM module at bootup, or 2) the DRAM manufacturer canpro�le each DRAMmodule and provide the pro�le within theSerial Presence Detect (SPD) circuitry (a Read-Only Memorypresent in each DIMM) [20].To minimize the storage overhead of the weak subarray

column pro�le in the memory controller, we encode eachsubarray column with a bit indicating whether or not to issueaccesses to it with a reduced tRCD . After pro�ling DRAM,the memory controller loads the weak subarray column pro-�le once into a small lookup table in the DRAM channel’smemory controller.3 For any DRAM request, thememory con-troller references the lookup table with the subarray columnthat is being accessed. The memory controller determinesthe tRCD timing parameter according to the value of the bitfound in the lookup table.

7. Solar-DRAM EvaluationWe �rst discuss our evaluation methodology and evalu-

ated system con�gurations. We then present our multi-coresimulation results for our chosen system con�gurations.

3To store the lookup table for a DRAM channel, we requirenum_banks ◊ num_subarrays_per_bank ◊ row_size

cacheline_size bits, wherenum_subarrays_per_bank is the number of subarrays in a bank, row_size isthe size of a DRAM row in bits, and cacheline_size is the size of a cache line inbits. For a 4GB DRAM module with 8 banks, 64 subarrays per bank, 32-bytecache lines, and 2KB per row, the lookup table requires 4KB of storage.

7.1. Evaluation MethodologySystem Con�gurations. We evaluate the performance ofSolar-DRAM on a 4-core system using Ramulator [1, 32], anopen-source cycle-accurate DRAM simulator, in CPU-trace-driven mode. We analyze various real workloads with tracesfrom the SPEC CPU2006 benchmark [2] that we collect usingPintool [43]. Table 1 shows the con�guration of our evalu-ated system. We use the standard LPDDR4-3200 [18] timingparameters as our baseline. To give a conservative estimate ofSolar-DRAM’s performance improvement, we simulate witha 64B cache line and a subarray size of 1024 rows.4

Processor 4 cores, 4 GHz, 4-wide issue, 8 MSHRs/core, OoO 128-entry window

LLC 8 MiB shared, 64B cache line, 8-way associative

MemoryController 64-entry R/W queue, FR-FCFS [55, 74]

DRAMLPDDR4-3200 [18], 2 channels, 1 rank/channel, 8 banks/rank,64K rows/bank, 1024 rows/subarray, 8 KiB row-bu�er, Baseline:tRCD/tRAS/tWR = 29/67/29 cycles (18.125/41.875/18.125 ns)

Solar-DRAM

reduced tRCD for requests to strong cache lines: 18 cycles (11.25ns)reduced tRCD for write requests: 7 cycles (4.375ns)

Table 1: Evaluated system con�guration.

Solar-DRAMCon�guration. To evaluate Solar-DRAM andFLY-DRAM [6] on a variety of di�erent DRAM modules withunique properties, we simulate varying 1) the number of weaksubarray columns per bank between n = 1 to 512, and 2) thechosen weak subarray columns in each bank. For a givenn, i.e., weak subarray column count, we generate 10 uniquepro�les with n randomly chosen weak subarray columns perbank. The pro�le indicates whether a subarray column shouldbe accessed with the default tRCD (29 cycles; 18.13 ns) or thereduced tRCD (18 cycles; 11.25 ns). We use these pro�les toevaluate 1) Solar-DRAM’s three components (described inSection 6.1) independently, 2) Solar-DRAM with all its threecomponents, 3) FLY-DRAM [6], and 4) our baseline LPDDR4DRAM.

Variable latency cache lines (VLC), directly uses a weak sub-array column pro�le to determine whether an access shouldbe issued with a reduced or default tRCD value. Reordered sub-array columns (RSC) takes a pro�le and maps the 0th cacheline to the strongest global column in each bank. For a givenpro�le, this maximizes the probability that any access to the0th cache line of a row will be issued with a reduced tRCD .Reduced latency for writes (RLW) reduces tRCD to 7 cycles(4.38 ns) (Section 5.6) for all write operations to DRAM. Solar-DRAM (Section 6.1) combines all three components (VLC,RSC, and RLW ). Since FLY-DRAM [6] issues read requests atthe granularity of the global column depending on whether aglobal column contains weak bits, we evaluate FLY-DRAM bytaking a weak subarray column pro�le and extending eachweak subarray column to the global column containing it.Baseline LPDDR4 uses a �xed tRCD of 29 cycles (18.13 ns) forall accesses. We present performance improvement of thedi�erent mechanisms over this LPDDR4 baseline.

4Using the typical upper-limit values for these con�guration variablesreduces the total number of subarray columns that comprise DRAM (to8,192 subarray columns per bank). A smaller number of subarray columnsreduces the granularity at which we can issue DRAM accesses with reducedtRCD , which reduces Solar-DRAM’s potential for performance bene�t. This isbecause a single activation failure requires the memory controller to accesslarger regions of DRAM with default tRCD .

8

EvaluationMethodology

Page 52: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

52

4. Testing MethodologyTo analyze DRAM behavior under reduced tRCD values, we

developed an infrastructure to characterize state-of-the-artLPDDR4 DRAM chips [19] in a thermally-controlled cham-ber. Our testing environment gives us precise control overDRAM commands and tRCD , as veri�ed via a logic analyzerprobing the command bus. In addition, we determined theaddress mapping for internal DRAM row scrambling so thatwe could study the spatial locality of activation failures in thephysical DRAM chip. We test for activation failures across aDRAM module using Algorithm 1. The key idea is to accessevery cache line across DRAM, and open a closed row oneach access. This guarantees that we test every DRAM cell’spropensity for activation failure.

Algorithm 1: DRAM Activation Failure Testing1 DRAM_ACT_fail_testing(data_pattern, reduced_tRCD):2 write data_pattern (e.g., solid 1s) into all DRAM cells3 foreach col in DRAM module:4 foreach row in DRAM module:5 refresh(row) // replenish cell voltage6 precharge(row) // ensure next access activates row7 read(col) with reduced_tRCD // induce activation failures on col8 �nd and record activation failures

We �rst write a known data pattern to DRAM (Line 2)for consistent testing conditions. The for loops (Lines 3-4)ensure that we test all DRAM cache lines. For each cacheline, we 1) refresh the row containing it (Line 5) to induceactivation failures in cells with similar levels of charge, 2)precharge the row (Line 6), and 3) activate the row againwith a reduced tRCD (Line 7) to induce activation failures.We then �nd and record the activation failures in the row(Line 8), by comparing the read data to the data pattern therow was initialized with. We experimentally determine thatAlgorithm 1 takes approximately 200ms to test a single bank.

Unless otherwise speci�ed, we perform all tests using 2y-nm LPDDR4 DRAM chips from three major manufacturersin a thermally-controlled chamber held at 55¶C. We controlthe ambient temperature precisely using heaters and fans.A microcontroller-based PID loop controls the heaters andfans to within an accuracy of 0.25¶C and a reliable range of40¶C to 55¶C. We keep the DRAM temperature at 15¶C aboveambient temperature using a separate local heating source.This local heating source probes local on-chip temperaturesensors to smooth out temperature variations due to self-induced heating.

5. Activation Failure CharacterizationWe present our extensive characterization of activation fail-

ures in modern LPDDR4 DRAM modules from three majorDRAM manufacturers. We make a number of key observa-tions that 1) support the viability of a mechanism that usesa static pro�le of weak cells to exploit variation in accesslatencies of DRAM cells, and 2) enable us to devise new mech-anisms that exploit more activation failure characteristics tofurther reduce DRAM latency.

5.1. Spatial Distribution of Activation FailuresWe �rst analyze the spatial distribution of activation fail-

ures across DRAM modules by visually inspecting bitmaps

of activation failures across many DRAM banks. A repre-sentative 1024x1024 array of DRAM cells with a signi�cantnumber of activation failures is shown in Figure 3. Usingthese bitmaps, we make three key observations. Observa-tion 1: Activation failures are highly constrained to local

Subarray border

Rem

apped row

DR

AM

row

s (1

cel

l)

DRAM columns (1 cell)0 512 1024

1024

512

1024

512

0 512 1024

DRAM

Row

(num

ber)

Subarray Edge

Remapped Row

DRAM Column (number)

Figure 3: Activation failure bitmap in 1024x1024 cell array.

bitlines. We infer that the granularity at which we see bitline-wide activation failures is a subarray. This is because thenumber of consecutive rows with activation failures on thesame bitline falls within the range of expected modern sub-array sizes of 512 to 1024 [31, 37]. We hypothesize that thisoccurs as a result of process manufacturing variation at thelevel of the local sense ampli�ers. Some sense ampli�ers aremanufactured “weaker” and cannot amplify data on the localbitline as quickly. This results in a higher probability of ac-tivation failures in DRAM cells attached to the same “weak”local bitline. While manufacturing process variation dictatesthe local bitlines that contain errors, the manufacturer de-sign decisions for subarray size dictates the number of cellsattached to the same local bitline, and thus, the number ofconsecutive rows that contain activation failures in the samelocal bitline. Observation 2: Subarrays from Vendor B andC’s DRAM modules consist of 512 DRAM rows, while subar-rays from Vendor A’s DRAM modules consist of 1024 DRAMrows. Observation 3: We �nd that within a set of subarrayrows, very few rows (<0.001%) exhibit a signi�cantly di�erentset of cells that experience activation failures compared tothe expected set of cells. We hypothesize that the rows withsigni�cantly di�erent failures are rows that are remapped toredundant rows (see [25, 40]) after the DRAM module wasmanufactured (indicated in Figure 3).

We next study the granularity at which activation failurescan be induced when accessing a row. We make two obser-vations (also seen in prior work [6]). Observation 4: Whenaccessing a row with low tRCD , the errors in the row are con-strained to the DRAM cache line granularity (typically 32 or64 bytes), and only occur in the aligned 32 bytes that is �rstaccessed in a closed row (i.e., up to 32 bytes are a�ected by asingle low tRCD access). Prior work [6] also observes that fail-ures are constrained to cache lines on a system with 64 bytecache lines. Observation 5: The �rst cache line accessed ina closed DRAM row is the only cache line in the row thatwe observe to exhibit activation failures. We hypothesizethat DRAM cells that are subsequently accessed in the samerow have enough time to have their charge ampli�ed andcompletely restored for correct sensing.We next study the proportion of weak subarray columns

per bank across many DRAM banks from all 282 of our DRAMmodules. We collect the proportion of weak subarray columns

4

TestingMethodology

Page 53: ICCD talk 10 9 18 final - ETH ZMotivation: DRAM latency is a major performance bottleneck Problem: Many important workloads exhibit bank conflictsin DRAM, which result in even longer

53

duced tRCD (i.e., 4ns, as measured with our experimentalinfrastructure) for all write operations to DRAM.

6.2. Static Pro�le of Weak Subarray ColumnsTo obtain the static pro�le of weak subarray columns, we

run multiple iterations of Algorithm 1, recording all subarraycolumns containing observed activation failures. As we ob-serve in Section 5, there are various factors that a�ect a localbitline’s probability of failure (Fprob). We use these factors todetermine a method for identifying a comprehensive pro�leof weak subarray columns for a given DRAM module. First,we use our observation on the accumulation rate of �ndingweak local bitlines (Section 5.5) to determine the number ofiterations we expect to test each DRAM module. However,since there is such high variation across each DRAM module(as seen in the standard deviations of the distributions in Ob-servation 11), we can only provide the expected number ofiterations needed to �nd a comprehensive pro�le for DRAMmodules of a manufacturer, and the time to pro�le dependson the module. We show in Section 5.2 that no single datapattern alone �nds a high coverage of weak local bitlines.This indicates that we must test each data pattern (40 datapatterns) for the expected number of iterations needed to �nda comprehensive pro�le of a DRAM module for a range oftemperatures (Section 5.3). While this could result in manyiterations of testing (on the order of a few thousands; seeSection 5.5), this is a one-time process on the order of half aday per bank that results in a reliable pro�le of weak subarraycolumns. The required one-time pro�ling can be performedin two ways: 1) the system running Solar-DRAM can pro�lea DRAM module when the memory controller detects a newDRAM module at bootup, or 2) the DRAM manufacturer canpro�le each DRAMmodule and provide the pro�le within theSerial Presence Detect (SPD) circuitry (a Read-Only Memorypresent in each DIMM) [20].To minimize the storage overhead of the weak subarray

column pro�le in the memory controller, we encode eachsubarray column with a bit indicating whether or not to issueaccesses to it with a reduced tRCD . After pro�ling DRAM,the memory controller loads the weak subarray column pro-�le once into a small lookup table in the DRAM channel’smemory controller.3 For any DRAM request, thememory con-troller references the lookup table with the subarray columnthat is being accessed. The memory controller determinesthe tRCD timing parameter according to the value of the bitfound in the lookup table.

7. Solar-DRAM EvaluationWe �rst discuss our evaluation methodology and evalu-

ated system con�gurations. We then present our multi-coresimulation results for our chosen system con�gurations.

3To store the lookup table for a DRAM channel, we requirenum_banks ◊ num_subarrays_per_bank ◊ row_size

cacheline_size bits, wherenum_subarrays_per_bank is the number of subarrays in a bank, row_size isthe size of a DRAM row in bits, and cacheline_size is the size of a cache line inbits. For a 4GB DRAM module with 8 banks, 64 subarrays per bank, 32-bytecache lines, and 2KB per row, the lookup table requires 4KB of storage.

7.1. Evaluation MethodologySystem Con�gurations. We evaluate the performance ofSolar-DRAM on a 4-core system using Ramulator [1, 32], anopen-source cycle-accurate DRAM simulator, in CPU-trace-driven mode. We analyze various real workloads with tracesfrom the SPEC CPU2006 benchmark [2] that we collect usingPintool [43]. Table 1 shows the con�guration of our evalu-ated system. We use the standard LPDDR4-3200 [18] timingparameters as our baseline. To give a conservative estimate ofSolar-DRAM’s performance improvement, we simulate witha 64B cache line and a subarray size of 1024 rows.4

Processor 4 cores, 4 GHz, 4-wide issue, 8 MSHRs/core, OoO 128-entry window

LLC 8 MiB shared, 64B cache line, 8-way associative

MemoryController 64-entry R/W queue, FR-FCFS [55, 74]

DRAMLPDDR4-3200 [18], 2 channels, 1 rank/channel, 8 banks/rank,64K rows/bank, 1024 rows/subarray, 8 KiB row-bu�er, Baseline:tRCD/tRAS/tWR = 29/67/29 cycles (18.125/41.875/18.125 ns)

Solar-DRAM

reduced tRCD for requests to strong cache lines: 18 cycles (11.25ns)reduced tRCD for write requests: 7 cycles (4.375ns)

Table 1: Evaluated system con�guration.

Solar-DRAMCon�guration. To evaluate Solar-DRAM andFLY-DRAM [6] on a variety of di�erent DRAM modules withunique properties, we simulate varying 1) the number of weaksubarray columns per bank between n = 1 to 512, and 2) thechosen weak subarray columns in each bank. For a givenn, i.e., weak subarray column count, we generate 10 uniquepro�les with n randomly chosen weak subarray columns perbank. The pro�le indicates whether a subarray column shouldbe accessed with the default tRCD (29 cycles; 18.13 ns) or thereduced tRCD (18 cycles; 11.25 ns). We use these pro�les toevaluate 1) Solar-DRAM’s three components (described inSection 6.1) independently, 2) Solar-DRAM with all its threecomponents, 3) FLY-DRAM [6], and 4) our baseline LPDDR4DRAM.

Variable latency cache lines (VLC), directly uses a weak sub-array column pro�le to determine whether an access shouldbe issued with a reduced or default tRCD value. Reordered sub-array columns (RSC) takes a pro�le and maps the 0th cacheline to the strongest global column in each bank. For a givenpro�le, this maximizes the probability that any access to the0th cache line of a row will be issued with a reduced tRCD .Reduced latency for writes (RLW) reduces tRCD to 7 cycles(4.38 ns) (Section 5.6) for all write operations to DRAM. Solar-DRAM (Section 6.1) combines all three components (VLC,RSC, and RLW ). Since FLY-DRAM [6] issues read requests atthe granularity of the global column depending on whether aglobal column contains weak bits, we evaluate FLY-DRAM bytaking a weak subarray column pro�le and extending eachweak subarray column to the global column containing it.Baseline LPDDR4 uses a �xed tRCD of 29 cycles (18.13 ns) forall accesses. We present performance improvement of thedi�erent mechanisms over this LPDDR4 baseline.

4Using the typical upper-limit values for these con�guration variablesreduces the total number of subarray columns that comprise DRAM (to8,192 subarray columns per bank). A smaller number of subarray columnsreduces the granularity at which we can issue DRAM accesses with reducedtRCD , which reduces Solar-DRAM’s potential for performance bene�t. This isbecause a single activation failure requires the memory controller to accesslarger regions of DRAM with default tRCD .

8

ImplementationOverhead

ThetableisstoredinthememorycontrollerthatinterfaceswiththeDRAMchannel