Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
LA-UR-14-27913
Supercomputer Field Data –
DRAM, SRAM, and Projections for Future Systems
Nathan DeBardeleben, Ph.D. (LANL) Ultrascale Systems Research Center
(USRC) 6th Soft Error Rate (SER) Workshop
Santa Clara, October 16, 2014
LA-UR-14-27913
Acknowledgements
n Collaborators • Vilas Sridharan, AMD RAS, Advanced Micro Devices Inc. • Sudhanva Gurumurthi, AMD Research, Advanced Micro Devices Inc. • Jon Stearley, Scalable Architectures, Sandia National Laboratories • Kurt Ferreira, Scalable Architectures, Sandia National Laboratories • John Shalf, Computational Research Division, Lawrence Berkeley National
Laboratory
n Thanks to • Many folks at Cray • Many folks at LANL, LBNL, SNL, ORNL
n This is a team effort • By no means is this entirely my own work!
LA-UR-14-27913
Cielo Data – A More Detailed Analysis Than Is Often Possible
n Cielo affords us more data sources than are often available
n It’s very easy to jump to wrong conclusions with the kind of data usually available
n Such as . . .
LA-UR-14-27913
Vendor effects
} A correlation to physical location... } ...is due to non-uniform distribution of vendor...
} ...and disappears when examined by vendor.
} DRAM reliability studies must account for DRAM vendor or risk inaccurate conclusions
Sridharan et al., Feng Shui of Supercomputer Memory, SC 2013
South?
Hotter?
why?
LA-UR-14-27913
Field Data – It Matters
A B CManufacturer
Billio
ns o
f DR
AM H
ours
05
1015
20
3.14
14.48
5.41
A B CManufacturer
FITs/DRAM
020
4060
80 PermanentTransient
73.6
41.2
18.8
n Not all vendors are created equal
n As much as a 4x difference in FIT rate depending on DRAM vendor used in Cielo nodes
n While B and C are about 50/50 vulnerable to permanent/transient, vendor is is closer to 30/70 permanent/transient
Sridharan et al., Feng Shui of Supercomputer Memory, SC 2013
LA-UR-14-27913
What About Hopper?
A B CManufacturer
FITs/DRAM
020
4060
80 PermanentTransient
73.6
41.2
18.8
n Cielo = 8.5k nodes, Hopper = 6k nodes
n Cielo at 7.3k feet elevation
n Hopper at 43 feet elevation
n Same DRAM vendor IDs, approximately same relative concentration in entire system
n Vendor A higher fault rate in Cielo, likely attributable to altitude
Currently under peer review . . .
A B CDRAM Vendor
FITs
/DR
AM0
510
1520
2530
35
TransientPermanent
32.3
22.2 21.5
Cielo
Hopper
LA-UR-14-27913
Positional Effects
n Fault rates increase as you go vertically in a rack
n Shielding? Temperature?
n We found a similar correlation in DRAM
n More studies are needed to explain why we see this
Sridharan et al., Feng Shui of Supercomputer Memory, SC 2013
Bottom Middle TopChassis Location in Rack
Rel
ative
SR
AM F
ault
Rat
e0.
00.
40.
81.
2
1+11%
+22%
Bottom Middle TopChassis Location in Rack
Rel
ative
SR
AM F
ault
Rat
e0.
00.
40.
81.
2
1+14% +19%
Cielo
Jaguar
LA-UR-14-27913
DDR Command and Address Parity – It’s a Good Thing
n Key feature of DDR3 (and on) is the ability to add parity-check logic to the command and address bus.
n Can have a significant positive impact on DDR memory reliability • Not previously shown empirically
n DDR3 sub-system on Cielo includes command and address parity checking.
n Rate of command/address parity errors was 75% that of the rate of uncorrected ECC errors.
n Increasing DDR memory channel speeds may cause an increase in signaling-related errors.
DeBardeleben et al., Extra Bits on SRAM and DRAM Errors, SELSE 2014
LA-UR-14-27913
y Data from Hopper ‒ 12,000 CPU sockets ‒ 12MB L3 cache / socket ‒ 3MB L2 cache / socket ‒ ~1.5 years of observa?on
y Faults occur o1en ‒ Even small structures (TLBs) see faults
y Exascale systems will: ‒ Have 4-‐5x the number of compute sockets ‒ Have much more SRAM per socket ‒ Have more faults!
Where do faults Occur?
Caveat: vendors must pay attention to reliable design
Attribution: Vilas Sridharan, AMD
LA-UR-14-27913
Accelerated Testing Comparison
n Studies of DOE supercomputers compared to AMD accelerated testing
n Accelerated testing remains a good proxy for what is seen in the field
n We would expect lower field FIT rates than accelerated testing due to workload differences, faults being overwritten, etc.
0 0.2 0.4 0.6 0.8 1
1.2
Fault rate pe
r bit (a.u.)
LA-UR-14-27913
What will SRAM errors look like in exascale? SRAM UNCORRECTED ERROR RATE RELATIVE TO CIELO
y Two potenIal systems ‒ Small: 50k nodes ‒ Large: 100k nodes
y Same fault rate as 45nm ‒ Sky is falling
y Scale faults per current trend ‒ Sky falls more slowly ‒ Switch to FinFETs may make this even beYer
y Add some engineering effort ‒ Sky stops falling
SRAM faults are unlikely to be a significantly larger problem than today
Attribution: Vilas Sridharan, AMD
LA-UR-14-27913
y DRAM faults are…weird ‒ Affect mul?ple rows/columns/chips ‒ Not just simple par?cle strikes…
y Many permanent faults ‒ En?rely unlike SRAM
What about DRAM? AND OTHER EXTERNAL MEMORY SUBSYSTEMS
0
512
1024
1536
2048 Co
lumn
Row
0
512
1024
1536
2048
Column
Row
Fault Mode % Faulty DRAMs Single-bit 67.7%
Single-word 0.2%
Single-column 8.7%
Single-row 11.8%
Single-bank 9.6%
Multi-bank 1.0%
Multi-rank 1.1%
Attribution: Vilas Sridharan, AMD
LA-UR-14-27913
0.1
1
10
100
2 3 4 5 6 7 8 9 10 11 RelaIv
e Ra
te of
Uncorrected
Errors
Month
SEC-‐DED Chipkill
Projecting to Exascale
Sridharan and Liberty, A Study of DRAM Failures in the Field, SC12
y Uncorrected error rate ‒ 10-‐70x error rate of current systems ‒ Is the sky falling?
y This is not just a problem for exascale ‒ Cost problem for data centers / cloud ‒ Reliability problem in client?
y SoluIons are out there ‒ Including for die-‐stacked DRAM ‒ Lots of people working on this…
y Historical example ‒ Chipkill vs. SEC-‐DED
Caveat: DRAM subsystems need higher reliability than today, but will likely get it
Attribution: Vilas Sridharan, AMD
LA-UR-14-27913
Conclusions
n It is not often one gets to see field studies in HPC
n We have shown the value of: • Collaborating with vendors to interpret the data • Analyzing reliability based on vendor choice • Studying positional effects of faults in a data center
n SRAM would benefit from more advanced ECC
n Accelerated testing is useful
n DDR3 address and command parity check is useful
n Exascale trends are a mixed bag: • Sky is probably not falling • But there is no doubt that user experienced uncorrectable error rates will increase
n Always interested in collaboration!
n I haven’t said anything about silent data corruption here