Neutron Sensitivity andNeutron Sensitivity andSoftware Hardening Strategies for Software Hardening Strategies for
Matrix Multiplication and FFTMatrix Multiplication and FFTon Graphics Processing Unitson Graphics Processing Units
June 18th, 2013 – New York City, NY, USA
P. Rech, L. Pilla, F. Silvestri,P. O. Navaux, and Luigi Carro
Paolo Rech – FTXS 2013, New York City, NY
OutlineOutline Radiation Effects on Graphics Processing Units
Experimental Setup
Matrix Multiplication
- Error Rate at Sea Level
- Hardening Techniques
Fast Fourier Transform
- Error Rate at Sea Level
- Hardening Techniques
Conclusions2/27
Paolo Rech – FTXS 2013, New York City, NY
OutlineOutline Radiation Effects on Graphics Processing Units
Experimental Setup
Matrix Multiplication
- Error Rate at Sea Level
- Hardening Techniques
Fast Fourier Transform
- Error Rate at Sea Level
- Hardening Techniques
Conclusions
Paolo Rech – FTXS 2013, New York City, NY
Terrestrial Radiation EnvironmentTerrestrial Radiation Environment
Galactic cosmic rays interact with atmosphere
shower of energetic particles:- Muons- Pions- Protons- Gamma rays- Neutrons
13 n/(cm2h) @sea level
Radiation is an issue at sea level!!
3/27
Paolo Rech – FTXS 2013, New York City, NY
GPU Internal StructureGPU Internal Structure
GPU
ThreadThread ThreadThread ThreadThread
RegReg RegReg RegReg
Shared MemoryShared Memory
ThreadThread ThreadThread ThreadThread
RegReg RegReg RegReg
Streaming Multiprocessor
DRAMDRAM
A GPU is an array of Streaming Multiprocessors
The SMs share DRAM
SM executes various threads in parallelThreads has access to Registers and Shared Memory
4/27
Paolo Rech – FTXS 2013, New York City, NY
Streaming Multiprocessor
Radiation Effects on a GPURadiation Effects on a GPU
GPU
ThreadThread ThreadThread ThreadThread
RegReg RegReg RegReg
Shared MemoryShared Memory
ThreadThread ThreadThread ThreadThread
RegReg RegReg RegReg
DRAMDRAM
SEU
SEU
SEU
SETRadiation can corrupt memory resources (SEU)……but also logic (SET) and control circuitry:a scheduler failure may have severe repercussions
5/27
Paolo Rech – FTXS 2013, New York City, NY
Why Radiation Test on GPUs?Why Radiation Test on GPUs?
6/27
Titan (Oak Ridge National Lab): 18,000 GPUs
Pedestrian Detection*
High probability of having a GPU corrupted
High reliability is required
*From 2015: 5 stars of security only to cars with pedestrian detection (Euro NCAP)
NVIDIA Tegra
Paolo Rech – FTXS 2013, New York City, NY
OutlineOutline Radiation Effects on Graphics Processing Units
Experimental Setup
Matrix Multiplication
- Error Rate at Sea Level
- Hardening Techniques
Fast Fourier Transform
- Error Rate at Sea Level
- Hardening Techniques
Conclusions
Paolo Rech – FTXS 2013, New York City, NY
Tested DevicesTested Devices
NVIDIA GeeForce GTX480(desktop board)
NVIDIA TESLA C2050(built-in ECC)
7/27
Paolo Rech – FTXS 2013, New York City, NY
Radiation Test FacilitiesRadiation Test Facilities
p+
8/27
Paolo Rech – FTXS 2013, New York City, NY
Radiation Test FacilitiesRadiation Test Facilities
9/27
Paolo Rech – FTXS 2013, New York City, NY
Radiation Test FacilitiesRadiation Test Facilities
Weapon Nuclear Research
10/27
Paolo Rech – FTXS 2013, New York City, NY
Neutrons SpectrumNeutrons Spectrum
1 sec @ISIS = 107 sec(110 days) of natural irradiation @NYC
11/27
Paolo Rech – FTXS 2013, New York City, NY
GPU Radiation Test SetupGPU Radiation Test Setup
PC
20 cm PCI-E bus
Beam spotPC inside the room butout of the beam
PCI-E bus extension between PC and GPU
Extension with fuseson power linesto avoid GPU latchups to affect the PC
12/27
Paolo Rech – FTXS 2013, New York City, NY
GPU Radiation Test SetupGPU Radiation Test Setup
GPU power control circuitry is out of beam
power control circuitry failure could compromise the experience and the GPU
DDR are out of beam
Beam spot is 3cm wide:GPU fully irradiated
13/27
Paolo Rech – FTXS 2013, New York City, NY
OutlineOutline Radiation Effects on Graphics Processing Units
Experimental Setup
Matrix Multiplication
- Error Rate at Sea Level
- Hardening Techniques
Fast Fourier Transform
- Error Rate at Sea Level
- Hardening Techniques
Conclusions
Paolo Rech – FTXS 2013, New York City, NY
x
Matrix MultiplicationMatrix Multiplication
AA
2048 elements
BB
204
8 e
lem
ent
s
2048 elements
MM
204
8 e
lem
ent
s
2048 x 2048 threads2
048
su
m &
mu
lt
=
204
8 s
um
& m
ult
14/27
Paolo Rech – FTXS 2013, New York City, NY
Matrix Multiplication ResultsMatrix Multiplication ResultsExperimental Cross Section* @ISIS = 2.0110-6 cm2
The Cross Section @ISIS resemble the Cross Section @sea level
2.60104 FIT1 error every 4,5 years
Neutrons spectrum @ISIS resemble the atmospheric one
Cross Section #Particles (@sea level) = Error Rate
2.0110-6 cm2 13 n/cm2/h =
Titan (GTX): 18,000 errors every 4,5 years10 errors per day!
Titan (GTX): 18,000 errors every 4,5 years10 errors per day!
*with double data
15/27
Paolo Rech – FTXS 2013, New York City, NY
Multiple Output ErrorsMultiple Output ErrorsIt was accredited that just single error affects output
Experimental results:
Single: 42.2%Multiple: 58.8%
the majority of errors are multiple output errors
16/27
Paolo Rech – FTXS 2013, New York City, NY
Multiple Output Errors AnalysisMultiple Output Errors AnalysisThree different Multiple Errors patterns are detected:
Out
put
Err
ors
[%]
Multiple
Sin
gle
Ro
w
Co
lum
n
RN
D
1) 22.8% on the same Row
MM
xx xx x
xxx
xx
x
x
x
x
2) 26.8% on the same Column
3) 8% Cluster Errors
17/27
Paolo Rech – FTXS 2013, New York City, NY
Errors on Row/Column CausesErrors on Row/Column Causes
AA BB
MM
……
GPU cachex
xxxxxxx
M column is calculated using A rows and one column of B, stored in the GPU cache.
Cache corruption causes errors on row/column
threads on a SM share cachethreads on a SM share cache
18/27
Paolo Rech – FTXS 2013, New York City, NY
Errors CorrectionErrors Correction1) ECC on Cache memory
- Corrects multiple errors on Row/Column, which are almost 50% of the total (tested on C2050)
- Memory availability is reduced of 12.5%*
- Execution time is increased of up to 30%*
19/27
*NVIDIA datasheet
2) Algorithm Based Fault Tolerance:technique specifically designed for an algorithm
xAA BB
checksumchecksum
chec
ksum
chec
ksum
∑
∑ MM=
col-checkcol-check
row
-che
ckro
w-c
heck
*Freivalds ‘79
Paolo Rech – FTXS 2013, New York City, NY
MM
col-checkcol-checkro
w-c
heck
row
-che
ck
∑
col-sumcol-sumro
w-s
umro
w-s
um
X
X
X
Single Errors* aredetected in O(N)and corrected in O(1)
Matrix Multiplication ABFTMatrix Multiplication ABFT
MM
col-checkcol-check
row
-che
ckro
w-c
heck
col-sumcol-sum
row
-sum
row
-sum
X
X
XXX
XXErrors on a Row/Col* are detected in O(N)and corrected in O(1)
*Huang and Abraham ‘84
*P. Rech at al, ‘12
20/27
Paolo Rech – FTXS 2013, New York City, NY
Cluster Errors CausesCluster Errors Causes
Scheduler failure affects some threads synchronization or provides incomplete results
Random locations of M result then erroneous
MMx xx
x21/27
Cluster errors can be caused by-Cache cross-talk-Errors in dirty cache flags-Pairwise bit flips in cache-Scheduler failure
Paolo Rech – FTXS 2013, New York City, NY
Cluster Errors CriticalityCluster Errors Criticality
Cluster errors:-not corrected by ECC (tested on C2050)-scheduler cannot be physically harden-scheduler SW hardening* not yet proved on GPU
22/27
*Rossi et al. ’10*Karimi et al. ‘10
Out
put
Err
ors
[%]
Multiple
Sin
gle
Ro
w
Co
lum
n
Cluster errors are less likely to occur, however their FIT is 1.13103, which is not negligible!
Paolo Rech – FTXS 2013, New York City, NY
MM
col-checkcol-check
row
-che
ckro
w-c
heck
col-sumcol-sum
row
-sum
row
-sum
X
X
XX
XX
X
X
various mismatches between row-checkrow-check row-sumrow-sum
various mismatches between col-checkcol-check col-sumcol-sum
checksum info is not enough for distinguishing errors but…
…we can try to correct errors with row-checksums orcol-checksums and check if correction succeed
Experimentally observedcorrupted location on a cluster ≤ 4:
at most 16 checks are needed!MM
XX
X X
23/27
Cluster Errors CorrectionCluster Errors Correction
Paolo Rech – FTXS 2013, New York City, NY
OutlineOutline Radiation Effects on Graphics Processing Units
Experimental Setup
Matrix Multiplication
- Error Rate at Sea Level
- Hardening Techniques
Fast Fourier Transform
- Error Rate at Sea Level
- Hardening Techniques
Conclusions
Paolo Rech – FTXS 2013, New York City, NY
Fast Fourier TransformFast Fourier Transform
64-p
oint
s F
FT
64-p
oint
s F
FT
...
64-p
oint
s F
FT
64-p
oint
s F
FT
64-p
oint
s F
FT
...64-p
oint
s F
FT
64-p
oint
s F
FT
64-p
oint
s F
FT
...
...
512 FFTs
512
FFTs log264=6 iterations required
512x512 threads, each executing the Stockham algorithm on a 64-points FFT
at each iteration a thread updates 2-by-2 the 64 elemens
a thread in one iteration uses the output of previous threads as input
Threads are not independent, errors are likely to spread
FFT cross section = 3.6910-6 cm2 (5.17105 FIT) 24/27
Paolo Rech – FTXS 2013, New York City, NY
FFT Multiple ErrorsFFT Multiple Errors
Multiple Errors
Per
cent
age
of fa
ulty
FF
T
0
1
2
3
4
5
6
7
8
9
10
2 4 6 9-11 14 16 18 20-21 24 26 28 30 34-39 42 44 46-47 50-51 54-55 62 64 66-12632 57-59 >130128
Less than 4% of execution has single errorsfew executions has odd amount of errors
Most executions has less than 32 errors or 64 (thread failure leads to the wrong update of all the 64 elements in the FFT)
Software hardening idea: prevent errors propagation25/27
Paolo Rech – FTXS 2013, New York City, NY
FFT HardeningFFT Hardeninginput coding
output decodingchecksum generation
All errors are detected with a wise coding-decoding scheme*...
*J.Y. Jou and Abraham ’88*P. Rech and al. ‘13
...but just when all iterations are completed: errors do propagate and FFT recomputation is required
Divide the N-FFT in N2-FFTs and N1-FFTs (N=N1*N2) performing coding-decoding-checksum on each smaller FFT......only the small FFT found corrupted has to be recomputed
error propagation computational overhead
check
check
check26/27
FF
T
FF
T
ABFT
Ha
rden
ed
FF
T
Paolo Rech – FTXS 2013, New York City, NY
OutlineOutline Radiation Effects on Graphics Processing Units
Experimental Setup
Matrix Multiplication
- Error Rate at Sea Level
- Hardening Techniques
Fast Fourier Transform
- Error Rate at Sea Level
- Hardening Techniques
Conclusions
Paolo Rech – FTXS 2013, New York City, NY
- GPUs are very prone to be corrupted by neutrons
- The radiation response depends on executed algorithm
- The corruption of shared and critical resources leads to multiple output errors
- ECC is not sufficient to guarantee high reliability
- Software-Based Hardening Strategies can be built analyzing the algorithm and experimental data
Work in Progress:
- Reduce scheduler strain optimizing thread distributions
- Analyze cache flags corruptions
- Evaluate error criticality (precision of data)
ConclusionsConclusions
27/27