13
1 FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks Kai Zhao, Sheng Di, Senior, IEEE, Sihuan Li, Xin Liang, Yujia Zhai, Jieyang Chen, Kaiming Ouyang, Franck Cappello, Fellow, IEEE , and Zizhong Chen, Senior, IEEE Abstract—Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%8% in both error-free and error-injected situations). Index Terms—Algorithm-Based Fault Tolerance, Deep Learning, Silent Data Corruption, Reliability, High-Performance Computing 1 I NTRODUCTION Deep learning using convolutional neural networks (CNNs) is becoming the key state-of-the-art technique in science and technology fields such as image classification [1], [2], [3], object detection [4], natural language processing [5], medical image analysis [6], and drug design [7]. More and more scientific research (such as cosmological simulation and materials analysis) also is addressing the great potential of leveraging CNN techniques to analyze extremely large amounts of data in a supercomputer environment, achieving unprecedented discoveries in their domains [8]. The reliability of the CNN inference is becoming a critical concern [9] because CNN inference applications are being widely utilized in different scenarios, including high- performance scientific simulations and safety-critical sys- tems [10], [11] such as aerospace and autonomous vehicles. CNN inference applications usually run for a long time or continuously to process many inference tasks. For example, the inference engine in autonomous vehicles is running continuously to predict road conditions. As a result, even a single inference task for one input finishes in seconds, the reliability of CNN inference is still critically important given the long execution time of the inference applications. Kai Zhao, Sihuan Li, Yujia Zhai,and Zizhong Chen are with the Depart- ment of Computer Science and Engineering at University of California, Riverside, CA 92521. Sheng Di and Franck Cappello are with the Mathematics and Computer Science division at Argonne National Laboratory, Lemont, IL 60439. Xin Liang and Jieyang Chen are with the Computer Science and Mathe- matics Division at Oak Ridge National Laboratory, Oak Ridge, TN 37831. In the domain of CNN inference, machine learning ap- plications could be very error prone because of two reasons. On the one hand, recent literature indicates that soft errors are inevitable in modern systems, from edge computing devices to supercomputers [12], [13], because of multiple factors [14] such as high-energy cosmic radiation [15], aging, and wear of devices [16]. On the other hand, CNN inference applications often call for power-efficient and cost-efficient machine learning accelerators, which [17], [18] may adopt overclocking with voltage underscaling, incurring more soft errors than common hardware incurs. Soft errors may cause serious consequences to CNN inference systems. Recent studies [19], [20], [21] indicate that resilient convolutional neural networks are essential for guaranteeing the correctness of inference applications. Researchers [19] demonstrate that a single bit flip happened during CNN image classification could result in as much as 40% and 70% SDC rate in datapath and memory, respec- tively. Such high SDC rates would downgrade the CNN prediction accuracy dramatically. Furthermore, the neutron beam test [20] shows that when running YOLO [4] classifi- cation, the Failure In Time (FIT) caused by SDCs could be as much as 38 for Nvidia K40 and 96 for Nvidia Tegra X1, which fail to meet the ISO 26262 standard for functional safety of road vehicles [22]. Existing resilient solutions are insufficient for protecting CNN inference applications against these soft errors. Error- correcting code (ECC), for example, suffers from memory area cost and relatively high latency and power consump- tion. According to [23], ECC with chip-kill applied to all data, compared with no ECC protection, has an average of 40% overhead in memory energy, 20% overhead in system arXiv:2003.12203v4 [cs.DC] 7 Sep 2020

1 FT-CNN: Algorithm-Based Fault Tolerance for

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 FT-CNN: Algorithm-Based Fault Tolerance for

1

FT-CNN: Algorithm-Based Fault Tolerance forConvolutional Neural Networks

Kai Zhao, Sheng Di, Senior, IEEE, Sihuan Li, Xin Liang, Yujia Zhai, Jieyang Chen, Kaiming Ouyang,Franck Cappello, Fellow, IEEE , and Zizhong Chen, Senior, IEEE

Abstract—Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problemsin many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused byhigh-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inferenceprocess against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code isunable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based faulttolerance (ABFT) techniques cannot protect all convolution implementations. In this paper, we focus on how to protect the CNNinference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematicABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditionalABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflowintegrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform ourevaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental resultsdemonstrate that our implementation can handle soft errors with very limited runtime overhead (4%∼8% in both error-free anderror-injected situations).

Index Terms—Algorithm-Based Fault Tolerance, Deep Learning, Silent Data Corruption, Reliability, High-Performance Computing

F

1 INTRODUCTION

Deep learning using convolutional neural networks (CNNs)is becoming the key state-of-the-art technique in scienceand technology fields such as image classification [1], [2],[3], object detection [4], natural language processing [5],medical image analysis [6], and drug design [7]. More andmore scientific research (such as cosmological simulationand materials analysis) also is addressing the great potentialof leveraging CNN techniques to analyze extremely largeamounts of data in a supercomputer environment, achievingunprecedented discoveries in their domains [8].

The reliability of the CNN inference is becoming acritical concern [9] because CNN inference applications arebeing widely utilized in different scenarios, including high-performance scientific simulations and safety-critical sys-tems [10], [11] such as aerospace and autonomous vehicles.CNN inference applications usually run for a long time orcontinuously to process many inference tasks. For example,the inference engine in autonomous vehicles is runningcontinuously to predict road conditions. As a result, evena single inference task for one input finishes in seconds, thereliability of CNN inference is still critically important giventhe long execution time of the inference applications.

• Kai Zhao, Sihuan Li, Yujia Zhai,and Zizhong Chen are with the Depart-ment of Computer Science and Engineering at University of California,Riverside, CA 92521.

• Sheng Di and Franck Cappello are with the Mathematics and ComputerScience division at Argonne National Laboratory, Lemont, IL 60439.

• Xin Liang and Jieyang Chen are with the Computer Science and Mathe-matics Division at Oak Ridge National Laboratory, Oak Ridge, TN 37831.

In the domain of CNN inference, machine learning ap-plications could be very error prone because of two reasons.On the one hand, recent literature indicates that soft errorsare inevitable in modern systems, from edge computingdevices to supercomputers [12], [13], because of multiplefactors [14] such as high-energy cosmic radiation [15], aging,and wear of devices [16]. On the other hand, CNN inferenceapplications often call for power-efficient and cost-efficientmachine learning accelerators, which [17], [18] may adoptoverclocking with voltage underscaling, incurring more softerrors than common hardware incurs.

Soft errors may cause serious consequences to CNNinference systems. Recent studies [19], [20], [21] indicatethat resilient convolutional neural networks are essentialfor guaranteeing the correctness of inference applications.Researchers [19] demonstrate that a single bit flip happenedduring CNN image classification could result in as much as40% and 70% SDC rate in datapath and memory, respec-tively. Such high SDC rates would downgrade the CNNprediction accuracy dramatically. Furthermore, the neutronbeam test [20] shows that when running YOLO [4] classifi-cation, the Failure In Time (FIT) caused by SDCs could beas much as 38 for Nvidia K40 and 96 for Nvidia Tegra X1,which fail to meet the ISO 26262 standard for functionalsafety of road vehicles [22].

Existing resilient solutions are insufficient for protectingCNN inference applications against these soft errors. Error-correcting code (ECC), for example, suffers from memoryarea cost and relatively high latency and power consump-tion. According to [23], ECC with chip-kill applied to alldata, compared with no ECC protection, has an average of40% overhead in memory energy, 20% overhead in system

arX

iv:2

003.

1220

3v4

[cs

.DC

] 7

Sep

202

0

Page 2: 1 FT-CNN: Algorithm-Based Fault Tolerance for

2

energy and 20% overhead in performance for computation-bounded applications. Moreover, ECC cannot handle mul-tiple bit flips or computational errors. Techniques basedon instruction duplication (ID) [24] incur high overheadand require both application-specific and hardware-specificoptimization; and optimizing and deploying ID techniqueson all CNN accelerators is difficult.

Considering all the drawbacks and limitations of ECCand ID, algorithm-based fault tolerance (ABFT) [25] is anattractive solution to realize resilient CNN. It has muchlower overhead than other techniques have; and it is archi-tecture independent, meaning that it supports any hardwareaccelerator. The idea of ABFT is to detect and/or correct softerrors based on the known invariants that the algorithmhas. Over the past thirty years, ABFT schemes have beensuccessful in detecting errors for matrix operations [26], [27],[28], [29], [30], [31], iterative methods [32], [33], [34], [35],data transformation kernels [36] and sorting algorithm [37].However, the existing ABFT schemes for matrix operationsfocus mainly on large and square matrices. Moreover, theyincur more than 50% overhead when applied for CNN softerror protection (shown by our experiments in Section 6.3 ).

In this paper, we propose a strategy comprising a seriesof ABFT schemes for protecting the CNN inference stageagainst soft errors. We focus on the convolutional layersin CNN because they consume the major portion of thecomputation time [38], [39], [40], [41].

The main contributions of this paper are summarized asfollows.• We design several ABFT schemes that can be applied

to any convolution implementation on any hardware.They can detect and correct errors at runtime. Weprovide an in-depth analysis of the ABFT schemes interms of fault protection ability and runtime.

• We design a multischeme workflow for soft error pro-tection with layerwise optimization to obtain a highdetection/correction ability with limited runtime over-head. Additionally, our solution can protect the biasoperation, grouped convolution, and back propagation.

• We implement an efficient soft error detection libraryfor CNN, called FT-Caffe, and evaluate FT-Caffe on Im-ageNet [42] using four popular CNN models: Alexnet[1], VGG-19 [2], ResNet-18 [3], and YOLOv2 [4]. Exper-imental results on the Bebop supercomputer [43] usingup to 128 nodes demonstrate that FT-Caffe can keep thecorrectness of the inferences with 4%∼8% overhead inboth error-free and erroneous cases.

In the rest of the paper, we first introduce backgroundabout convolutional layers and existing ABFT techniquesapplicable to matrix-matrix multiplication (MM)-based con-volution implementation. In Section 3, we propose fournovel ABFT schemes that can be applied to any convolutionimplementations. In Section 4, we analyze the fault protec-tion ability and runtime of the four schemes and proposean efficient multischeme workflow integrating all the fourschemes. In Section 5, we discuss how to support bias,grouped convolution, and back propagation. In Section 6,we evaluate our solutions for both error-free case and erro-neous case. In Section 7, we discuss related work on faulttolerance in convolutional neural networks. We present ourconcluding remarks in Section 8.

2 BACKGROUND

This section introduces some high-level ideas of convolu-tional layers and the existing ABFT techniques to MM-basedconvolution algorithms. The notations and symbols used inthis paper are summarized in Table 1.

2.1 Definition of Convolutional Layer

The convolutional layer can be represented as the followingconvolution operation.

O[n][m][x][y] = B[m]+Ch−1∑k=0

R−1∑i=0

R−1∑j=0

D[n][k][Ux+ i][Uy + j]×W[m][k][i][j]

0 ≤ n < N, 0 ≤ m < M, 0 ≤ x, y < E,E = H−R+UU

(1)

The convolution operation involves two significant in-puts: the feature map (fmap) D, D ∈ RN×Ch×H×H , andthe convolutional kernels W, W ∈ RM×Ch×R×R. Note thatall the matrices and vectors in this paper are highlightedin bold in order to differentiate from the scalar numbers,according to the naming convention. The bias, denoted asB, is applied to the output after convolution, and the finalresult is denoted as O, O ∈ RN×M×E×E . Since the biasoperation is independent of the convolution computation,in the rest of this section we describe only the protection forconvolution computation. In Section 5.1, we will discuss theprotection for bias.

TABLE 1Notations and Symbols Used in This Paper

Notation DescriptionD Feature map, dimension is 4DW Kernels, also called filters, dimension is 4DO Output, dimension is 4DB Bias, dimension is 1DC ChecksumsS Block summations of O, corresponding to checksums⊗ Convolution operationN First dimension of D and OM First dimension of W and second dimension of OCh Second dimension of D and W, also called channelsH Third and fourth dimension of DR Third and fourth dimension of WE Third and fourth dimension of OU Stride size

2.2 Implementation of Convolutional Layer

Convolution can be implemented efficiently in severalways [44]. The first option is MM-based convolution [45],which reshapes the kernel and feature map to two tempo-rary matrices and then applies matrix-matrix multiplication(MM) on them. Another way to implement convolutionis called direct convolution, which performs the convolu-tion operation directly. It is widely used in AI acceleratorsincluding Eyeriss [39], DianNao [46] and NVIDIA DeepLearning Accelerator [47]. Fast Fourier transform–basedconvolution [48] leverages FFT to compute the convolution.It is particularly suitable for the relatively large featuremap and kernel. However, it is inferior to the Winogradconvolution [44] when the sizes of the feature map andkernel are relatively small.

Page 3: 1 FT-CNN: Algorithm-Based Fault Tolerance for

3

Modern CNN frameworks and accelerators generallyautomatically choose the best implementations of convo-lution based on hardware resources and model structure,because various implementations have different constraintson memory, architecture, and CNN model.

2.3 ABFT for Matrix-Matrix Multiplication

Traditional ABFT designed for matrix-matrix multiplicationcan be applied to the MM calculation of the MM-basedconvolution [20], but it has at least three limitations. (1)It supports only MM-based convolution implementation,which is not always the best-fit implementation selectedby the CNN framework and accelerator. (2) It incurs highoverhead (more than 50%, as shown in Section 6.3), due tothe small and irregular shape of the matrices used by MM-based convolution. (3) Moreover, it cannot cover the reor-ganization operations of feature before the MM calculation.Therefore, new ABFT schemes are needed in order to protectthe convolutional layer more effectively.

3 NOVEL ABFT SCHEMES FOR CONVOLUTION

In this section, we present four novel ABFT schemes, eachsupporting any convolution implementation and being ableto protect the whole convolution process. In Section 4, wepropose a multischeme workflow using all the schemes indifferent stages to maximize the soft error protection abilitywith minimized performance overhead.

3.1 Preliminary Analysis – Convolution

For clear description, we interpret convolution at the blocklevel. Specifically, in Equation (1), D, W, and O are all 4Dmatrices. They can be represented as being composed ofmultiple blocks as shown in Figure 1. For any n and m (0 ≤n < N, 0 ≤ m < M ), Dn, Wm, and Onm are blocks. Thedimension of Dn, Wm, and Onm areCh×H×H ,Ch×R×Rand E×E, respectively. The notation ⊗ is used to representthe convolution computation between blocks Dn and Wm.The convolution operation defined by Equation (1) can besimplified at the block level as follows.

Onm = Dn ⊗Wm

0 ≤ n < N, 0 ≤ m < M(2)

O11

O21

On1

O12

O22

On2

O1m

O2m

Onm

CO1

CO3

CO

2

CO

4

CO5

CO7

CO6

D1

D2

Dn

Cd1

Cd2

W1 W2 … Wm Cw1 Cw2

Input & output Checksum

Level 1

O1m

Level 2

Fig. 1. Interpretation of Convolution at the Block Level

Equation (2) can be interpreted by using the blue partin Figure 1. Since each of the 3D substructures of D and Wis treated as a block, D and W can be thought of as two1D vectors of blocks. At the block level, the convolutionoperation is similar to matrix-matrix multiplication. Theelement (i, j) of O is calculated by using the ith elementof D and the jth element of W. As illustrated in Figure 1,the elements involved in the convolutional layers can besplit into two levels, which are covered by our protectionsolution, respectively.

We can derive that the convolution operation (denotedby ⊗) has a distributive property as follows.

D1 ⊗W1 + D2 ⊗W1 =∑Ch−1k=0

∑R−1i=0

∑R−1j=0 D1[k][Ux+ i][Uy + j]×W1[k][i][j]

+∑Ch−1

k=0

∑R−1i=0

∑R−1j=0 D2[k][Ux+ i][Uy + j]×W1[k][i][j]

=∑Ch−1

k=0

∑R−1i=0

∑R−1j=0 (D1+D2)[k][Ux+i][Uy+j]×W1[k][i][j]

= (D1 + D2)⊗W1

Similarly, we can get the following equation.

D1 ⊗W1 + D1 ⊗W2 = D1 ⊗ (W1 + W2) (3)

The distributive property, Formula (3), is the key to provingthe equivalence between the sum of the output and theoutput checksum. This property will be used later to provethe correctness of our design.

3.2 Preliminary Analysis – CNN ChecksumsIn general, we compute checksums for D and W and thenuse them to derive the checksums for O. Soft errors can bedetected and corrected by comparing O with its checksums.

We introduce all the checksums (as shown in Table 2 andas yellow blocks in Figure 1) that are necessary for our ABFTschemes.

TABLE 2Checksums Used by Schemes

Scheme Checksums Checksumsof D and W of O

Full Checksum (FC) Cd1, Cw1 Co1, Co2

Row Checksum (RC) Cd1, Cd2 Co1, Co3

Column Checksum (ClC) Cw1, Cw2 Co2, Co4

Checksum-of-Checksum (CoC) Cd1, Cw1, Cd2, Cw2 Co5, Co6, Co7

CoC Detection Only (CoC-D) Cd1, Cw1, Cd2, Cw2 Co5

We define the checksums of D and W as follows.

Cd1 =∑N−1

n=0 Dn

Cd2 =∑N−1

n=0 nDn

Cw1 =∑M−1

m=0 Wm

Cw2 =∑M−1

m=0 mWm

(4)

The four checksums (denoted as input checksums) canbe treated as four blocks of D and W. The checksumsof O (denoted as output checksums) are defined as theconvolution result of input checksums and/or inputs.

Co1 = Cd1 ⊗WCo2 = D⊗ Cw1

Co3 = Cd2 ⊗WCo4 = D⊗ Cw2

Co5 = Cd1 ⊗ Cw1

Co6 = Cd1 ⊗ Cw2

Co7 = Cd2 ⊗ Cw1

(5)

Page 4: 1 FT-CNN: Algorithm-Based Fault Tolerance for

4

The output O is represented in the form of blocks (i.e.,Level 1 in Figure 1). Elements inside the same block areindependent with respect to checksums ( Level 2 in Figure1). That is, we perform the checksum comparison indepen-dently for each element across blocks. Therefore, multiplesoft errors in the same block can be detected and correctedindependently.

In what follows, we describe the four schemes weproposed, each involving one or more input and outputchecksums. The required checksums used by each schemeare summarized in Table 2.

3.3 Full Checksum Scheme (FC)

The first scheme we designed is called full checksum scheme,or FC, because it is based on checksums from both D andW, as shown in Figure 1 and Table 2.

Cd1 and Cw1 are calculated before the convolution op-eration, so any memory error striking D or W during theconvolution would not affect Cd1 or Cw1. As for the outputchecksums, we can get the following equations by applyingthe distributive property of ⊗.

Co1[m] =(N−1∑n=0

Dn)⊗Wm =N−1∑n=0

(Dn ⊗Wm) =N−1∑n=0

Onm

Co2[n]=Dn ⊗ (M−1∑m=0

Wm) =M−1∑m=0

(Dn ⊗Wm) =M−1∑m=0

Onm

These equations show the equality between the sum ofoutput and the output checksums. Let So1 and So2 be thesummation of the output, where So1[m] =

∑N−1n=0 Onm ,

So2[n] =∑M−1

m=0 Onm. We can compare Co1, Co2 with So1,So2 to detect, locate, and correct soft errors if they exist.

3.4 Row Checksum Scheme (RC)

Compared with the full checksum scheme, the second ABFTscheme we designed involves only the row checksums ofoutput O, so we call it row checksum scheme.

The row checksums used in this scheme are Co1 andCo3. Co3 is computed from convolution operation betweenCd2 and W, and the related output summation is defined bySo3[m] =

∑N−1n=0 n×Onm.

For the detection of soft errors, we need to compare Co1

with So1. If they are not equal to each other at location j, theerror can be located by i = Co3[j]−So3[j]

Co1[j]−So1[j]and j, and it can be

corrected by adding Co1[j]− So1[j] to the block (i, j).

3.5 Column Checksum Scheme (ClC)

The third scheme we proposed is called column checksumscheme because it involves only the column checksums ofoutput O. The column checksums used in this scheme areCo2 and Co4. Co4 is defined by performing convolutionoperation between D and Cw2, and the related outputsummation is defined as So4[n] =

∑M−1m=0 m×Onm. To detect

soft errors, we compare Co2 with So2 first. If they are notequal to each other at location i, the error can be located byi and j (= Co4[i]−So4[i]

Co2[i]−So2[i]), and it can be recovered by adding

Co2[i]− So2[i] to the block (i, j).

3.6 Checksum-of-Checksum Scheme (CoC/CoC-D)

Unlike the three schemes that all need D and/or W tocalculate output checksums, the last scheme we proposedinvolves neither D nor W but only their checksums, soit is named checksum-of-checksum scheme (or CoC schemefor short). Specifically, Co5, Co6, and Co7 are the outputchecksums we will use in this scheme. Similar to Co1, usingthe distributive property can get three equations betweenthe output checksums and output as follows.

Co5 =∑N−1

n=0

∑M−1

m=0Onm = So5

Co6 =∑N−1

n=0

∑M−1

m=0m×Onm = So6

Co7 =∑N−1

n=0

∑M−1

m=0n×Onm = So7

So5, So6, and So7 are defined as output summationscorresponding to Co5, Co6, and Co7. Let O(i, j) be thecorrupted output block, O′ be the correct output, and letδ = O′ij−Oij be the difference. Using the output checksums,we can get the following.

Co5 − So5 =∑N−1

n=0

∑M−1

m=0O′nm −Onm = δ

Co6 − So6 =∑N−1

n=0

∑M−1

m=0m× (O′nm −Onm) = j × δ

Co7 − So7 =∑N−1

n=0

∑M−1

m=0n× (O′nm −Onm) = i× δ

The location i, j can be obtained by i = (Co7 − So7)/δand j = (Co6 − So6)/δ. Then the soft error can be fixed byadding δ to Oij .

If only soft error detection is required, we do not needto compute Co6 and Co7, thus reducing the number ofcomputations. Input checksums regarding Cd1, Cd2 andCw1, Cw2, however, are still required for soft error detection.We denote such a detection scheme by CoC-D.

4 MULTISCHEME WORKFLOW

In this section, we first discuss the fault protection abilitiesand runtime of the four schemes we proposed in Section3. Then, we propose a multischeme workflow, powered bycalibrated arrangement of the four schemes and layerwiseoptimization.

4.1 Analysis of Protection Ability for ConvolutionChecksum Schemes

In this section, we analyze the fault protection ability of allthe schemes.

4.1.1 Fault ModelThe fault model for soft errors that we discuss in this paperincludes transient faults in computational units and datacorruption faults (both transient and persistent) in memory(including cache). In the following text, we use fault to rep-resent a malfunction event, and we denote its correspondingsymptom as soft error.

Soft error protection includes error detection and errorcorrection. Error detection means that the scheme can detect

Page 5: 1 FT-CNN: Algorithm-Based Fault Tolerance for

5

soft errors without knowing the exact location. Error correc-tion means that the scheme can locate the soft error locationsand recover the incorrect result.

Without loss of generality, in the following analysis weconsider at most one fault per convolution. One convo-lutional neural network contains several or even tens ofconvolutional layers, and the total forward execution timeof a CNN model is usually within seconds. Thus, we canreasonably assume that at most one fault may strike to oneconvolutional layer, considering the short executing time ofa single layer. Multiple faults per convolution can also bedetected by our schemes and recovered by recomputing thecorrupted convolutional layer.

4.1.2 Analysis of Soft Error in D and W

One fault occurring during the convolution execution canresult in multiple soft errors in W and D. The soft errorsin W can be detected by comparing the checksum of Wwith Cw1 and corrected by reloading weights from the CNNmodel. The soft errors in D do not need correction becauseD will be discarded after convolution computation; theresulting errors in the output can be detected and correctedby the checksums of the output, as demonstrated below.

4.1.3 Analysis of Soft Error in O

One fault during the convolution execution can result incorruption of one block row or column of O. By definition,the row i of O is computed by the ith block of D with W .Thus, one fault in D would result in at most one corruptedrow. The column j of O is computed by D with the jthblock of W. Thus, one fault in W would result in at most onecorrupted column. Moreover, the intermediate result will bereused only by the same row or column, such that one faultin the computational units would corrupt only values in thesame row or column. Accordingly, in the following sectionswe discuss the soft error protection ability in the context ofat most one corrupted row or column of O.

4.1.4 Soft Error Protection Ability of CoC Scheme

Figure 2 demonstrates the protection ability of the CoCscheme when soft errors strike the input or output data. Asshown in Figure 2(a), multiple soft errors can be detectedby using only Co5. A single soft error in O can be correctedby CoC using all checksums including Co5, Co6, and Co7,as shown in Figure 2 (b). However, CoC cannot correct softerrors across multiple blocks in O.

Figure 3 illustrates the protection ability of the CoCscheme when soft errors happen inside the checksums.Such soft errors can cause inconsistency among the outputchecksums of CoC, which can be used for error detection.For example, in Figure 3(a), Cd1 is corrupted, leading tocorrupted Co5 and Co6 with correct Co7. We can detectthis abnormal pattern when comparing checksums with thesummation of O to detect the input checksum corruption.The input D, W, and output O are clean and without softerrors since fault frequency is at most once per convolution.Thus, we can safely discard all the checksums and finish thisconvolution computation.

O11

O21

On1

O12

O22

On2

O1m

O2m

Onm

CO5

CO7

CO6

D1

D2

Dn

Cd1

Cd2

W1 W2 … Wm Cw1 Cw2

O11

O21

On1

O12

O22

On2

O1m

O2m

Onm

CO5

D1

D2

Dn

Cd1

W1 W2 … Wm Cw1

(a) CoC Error Detection (b) CoC Error Detection

Input & output Checksum Data with soft error Checksum in use

Fig. 2. Soft Error Protection Ability of CoC Scheme (Soft error happensin inputs and outputs)

CO5

CO7

CO6Cd1

Cd2

Cw1 Cw2

CO5

CO7

CO6Cd1

Cd2

Cw1 Cw2

CO5

CO7

CO6Cd1

Cd2

Cw1 Cw2

CO5

CO7

CO6Cd1

Cd2

Cw1 Cw2

(a) SDC in Cd1 (b) SDC in Cw1

(c) SDC in Cd2 (d) SDC in Cw2

Fig. 3. Soft Error Protection Ability of CoC Scheme (Soft error happensin checksums)

4.1.5 Soft Error Protection Ability of Row ChecksumScheme and Column Checksum SchemeSince the row checksum scheme and column checksumscheme are symmetric with each other, we discuss themtogether in this section. As shown in Figure 4(a), the rowchecksum scheme can detect and correct soft errors if theyare in the same row. If the soft errors are in the same column,as shown in Figure 4(b), the row checksum scheme can onlydetect soft errors; it has no correction ability. The columnchecksum scheme, on the contrary, can detect and correcterrors located in the same column but fail to correct thoseappearing in the same row.

O11

O21

On1

O12

O22

On2

O1m

O2m

Onm

CO1

CO3

D1

D2

Dn

Cd1

Cd2

W1 W2 … Wm

(a) Row Checksum Scheme

O11

O21

On1

O12

O22

O1m

O2m

Onm

CO4

D1

D2

Dn

W1 W2 … Wm Cw1 Cw2

On2

CO2

(b) Column Checksum Scheme

Fig. 4. Soft Error Protection Ability of Row/Column Checksum Schemes

4.1.6 Soft Error Protection Ability of Full ChecksumSchemeThe full checksum scheme has the highest ability to correctsoft errors. The scheme uses both the row checksum Co1

Page 6: 1 FT-CNN: Algorithm-Based Fault Tolerance for

6

and column checksum Co2 so that it can correct soft errors inboth directions, as shown in Figure 5(a)(b). If soft errors existin Co1 (Figure 5(d)), however, Co1 can no longer be used tolocate or correct soft errors. To support error correction inthis situation, we use checksum Co5 and Co6 from the CoCscheme to locate the corrupted column, and we then use Co2

to correct the soft errors. If soft errors exist in Co2 (Figure5(c)), Co5 and Co7 are used to locate the corrupted row, andCo1 is used to correct the soft errors.

CO1

O11

O21

O12

O22

On2

O1m

O2m

Onm

D1

D2

Dn

Cd1

W1 W2 … Wm Cw1

CO2

On1

(a) Soft Error in the Same Row

O11

O21

On1

O12

O22

O1m

O2m

Onm

D1

D2

Dn

Cd1

W1 W2 … Wm Cw1

CO1

On2

CO2

(b) Soft Error in the Same Column

O11

O21

On1

O12

O22

On2

O1m

O2m

Onm

CO1

D1

D2

Dn

Cd1

W1 W2 … Wm Cw1

CO2

CO5

CO7

(c) Soft error in the Same Row(including Co2)

O11

O21

On1

O12

O22

O1m

O2m

Onm

D1

D2

Dn

Cd1

W1 W2 … Wm Cw1

CO1

On2

CO2

CO5 CO6

(d) Soft error in the Same Column(including Co1)

Fig. 5. Soft Error Protection Ability of Full Checksum Scheme

4.1.7 ConclusionIn this section, we define our fault model and analyze thesoft error protection ability of four schemes. We concludethat the CoC scheme has the lowest error correction abilityand that the full checksum scheme has the best error correc-tion ability. The abilities of the row checksum scheme andcolumn checksum scheme are higher than that of the CoCscheme but lower than that of the full checksum scheme.CoC-D (discussed in Section 3.6) can detect multiple softerrors but without correction ability. The analysis hereserves as the fundamental basis of our low-overhead high-protection design, which will be presented in Section 4.3.

4.2 Runtime Analysis

In this section, we analyze the time complexity theoreticallyand present runtimes of all schemes based on experiments.

Table 3 shows the time complexity of some basic check-sum operations, where α is the coefficient of CPU-intensiveoperations and β represents the coefficient for memory-intensive operations.

Table 4 shows the theoretical time complexity of all theschemes. The full checksum scheme has the best soft errorcorrection ability; however, its runtime is relatively long.Although the CoC scheme has lower ability than the otherthree schemes in correcting soft errors, it has the shortestruntime. Note that the kernel checksum Cw1 and Cw2 canbe precalculated before the application; there is no cost in

TABLE 3Runtimes of Basic Operations

Operation Derived Runtimeblock level convolution Dn ⊗ Wm αChR2E2

Total convolution operations αNMChR2E2

Compute the checksum of D βNChH2

Compute the checksum of O βNME2

0 2 4 6 8

10 12 14 16 18

Separate

CoC

+FC

CoC

+RC+FC

CoC

+ClC

+FC

Norm

aliz

ed R

untim

e CoC-DCoCRCClCFC

(a) AlexNet

0 2 4 6 8

10 12 14 16 18

Separate

CoC

+FC

CoC

+RC+FC

CoC

+ClC

+FC

Norm

aliz

ed R

untim

e CoC-DCoCRCClCFC

(b) YOLOv2

0

5

10

15

20

Separate

CoC

+FC

CoC

+RC+FC

CoC

+ClC

+FC

Norm

aliz

ed R

untim

e CoC-DCoCRCClCFC

(c) YOLOv2 (Conv8)

0 2 4 6 8

10 12 14 16 18

Separate

CoC

+FC

CoC

+RC+FC

CoC

+ClC

+FC

Norm

aliz

ed R

untim

e CoC-DCoCRCClCFC

(d) VGG-19

0 2 4 6 8

10 12 14 16 18

Separate

CoC

+FC

CoC

+RC+FC

CoC

+ClC

+FC

Norm

aliz

ed R

untim

e CoC-DCoCRCClCFC

(e) ResNet-18

0 2 4 6 8

10 12 14 16 18

Separate

CoC

+FC

CoC

+RC+FC

CoC

+ClC

+FC

Norm

aliz

ed R

untim

e CoC-DCoCRCClCFC

(f) ResNet-18 (Conv1)

Fig. 6. Worst-Case Normalized Runtime, Baseline is CoC-D

generating kernel checksum in the row, column, and CoCschemes.

To verify the correctness of the derived time complexityof the four schemes, we execute them on a supercomputerusing four CNN models. We show the normalized worst-case runtime of the four schemes in the separate columnof Figure 6. Other columns of this Figure represent theworst-case runtime of multischeme workflows and will bediscussed in the next section. Experiments confirm our con-clusion that CoC and CoC-D have the shortest runtime andthat the runtime of the full checksum scheme is relativelylong. We also see that the column checksum scheme has amuch longer runtime than the row checksum scheme does.The reason is twofold. On the one hand, W blocks havesmaller sizes than D blocks have, leading to longer time tocompute D⊗ Cw2 by the column checksum scheme than tocompute Cd2 ⊗ W by the row checksum scheme. On theother hand, computing row checksums (Co1 and Co3) ismore efficient than computing column checksums (Co2 andCo4), because the row checksum calculation can be reducedto efficient column-summation operations.

TABLE 4ABFT Schemes Runtime

Scheme Derived Runtime Soft ErrorCorrection Ability

FC α(N +M)ChR2E2 + β(NChH2 + 2NME2) HighRC 2αMChR2E2 + 2β(NChH2 +NME2) MiddleClC 2αNChR2E2 + 2β(NME2) MiddleCoC 3αChR2E2 + β(2NChH2 + 3NME2) Low

Page 7: 1 FT-CNN: Algorithm-Based Fault Tolerance for

7

4.3 Multischeme Workflow for Soft Error ProtectionThe four schemes we proposed have pros and cons in termsof their soft error correction ability and runtime overhead.To achieve the highest protection ability and lowest over-head, we propose a multischeme workflow by integratingthe four schemes, as shown in Figure 7. The workflow ismade up of two modules: error detection and error correc-tion. In our designed workflow, we use CoC-D to detecterrors because it has the lowest overhead. For the errorcorrection, we put CoC in the beginning because it is themost lightweight method. By comparison, FC has highestcorrection ability but also highest time overhead, so we putit at the end of the workflow.

Error is

detected

No errors

CoC-d CoC

Error is corrected

Unable to

correct

errorRC/ClC

Controller

ClC

Use RC

Use ClC

RC

Error is

corrected

FCSkip RC/ClC

Error is corrected

Unable to correct error

Error is

corrected

Finish

Convolution

operation

Start

Unable to

correct error

Fig. 7. Multischeme Workflow Designed to Detect/Correct Soft Errors

The error detection modules will be executed for everyexecution whether there is a soft error or not. Thus, anyunnecessary computations should be avoided in order toreduce the overall overhead. For instance, both CoC-D andFC are able to detect all the soft errors, but we adopt onlyCoC-D in the workflow for error detection because FC has amuch higher overhead. RC and ClC cannot detect soft errorscorrectly if the checksum is corrupted.

The error correction module will not be executed untilsome soft errors are detected. The schemes in this modulewill be invoked to fix soft errors according to the workflow.If it fails to correct the errors due to inconsistency of check-sum blocks or illegal error locations, the next-level schemewill be invoked.

Since the checksums can be reused among different CNNschemes in the workflow, the runtime of the workflow isactually lower than the sum of all schemes’ runtimes. Forexample, both CoC-D and CoC use Co5; if CoC-D detectssoft errors and CoC is invoked to correct soft errors, CoCcan save the time of computing Co5 and its correspondingsummation So5, since they have been computed by CoC-D. This analysis can also be confirmed by our experiments.As shown in Figure 6, the relative runtime of CoC in thesecond column is reduced compared with that of CoC in thefirst column. The relative runtime of RC in the third columnis reduced compared with that of RC in the first column.

The decision to put RC and ClC in the workflow betweenCoC and FC is controlled by each layer. The reason to controlRC/ClC in each layer is that their relative runtimes differacross layers. Since RC and ClC are symmetric based, inthe following we present our analysis based mainly on RC,without loss of generality.

We denote the runtime of the workflow CoC+FC as t0,the runtime of workflow CoC+RC as t1, and the runtimeof workflow CoC+RC+FC as t2. Enabling RC can fix somesoft errors before FC, thus changing the runtime from t0 tot1. When RC fails to correct soft errors, however, FC stillneeds to be invoked; and the runtime will increase from t0to t2. Denoting the probability of row soft errors by pr andthe probability of column soft errors by pc, we can derivethe average time saved by RC as ty = pr(t0 − t1) and theaverage time increase by RC as tn = pc(t2 − t0). In orderto minimize the total runtime, RC should be enabled whenty > tn.

We give an example to further illustrate when RC shouldbe enabled. Figure 6(b) shows the average runtime amongall the convolutional layers in YOLOv2. In this figure, theruntime of CoC+RC is much lower than that of CoC+FC,and the runtime of CoC+RC+FC is slightly higher than thatof CoC+FC. Therefore, enabling RC can save significantruntime when the soft errors are able to be corrected byRC. On the other hand, a bit runtime penalty is incurred ifRC fails to correct the soft errors. However, for the conv8layer in YOLOv2 (shown in Figure 6(c)), CoC+RC’s runtimeis close to that of CoC+FC. Thus, enabling RC in this layerwould barely reduce the overall runtime even though thesoft errors can be corrected by RC. Moreover, CoC+RC+FC’sruntime is much higher than CoC+RC’s. As a result, the totalruntime will increase significantly if the soft errors cannotbe corrected by RC. Hence, for this layer, it is better to useCoC+FC for error correction with RC disabled.

In practice, the runtime t0, t1 and t2 can be computedby offline profiling. The probability values pc and pr canbe estimated based on the size of D and size of W. Forinstance, the soft error often strikes each element in theinput under the independent and identical distribution. Inthis situation, it is easy to drive that the probability of softerrors occurring in D is proportional to that of W (i.e., pr

pc=

number of elements in Dnumber of elements in W .

5 RESOLVING BIAS, GROUPED CONVOLUTION, ANDBACK PROPAGATION

In this section, we extend our solution to support bias,grouped convolution, and the back propagation of convo-lutional layers.

5.1 BiasBias is a 1D vector that needs be added to the output of theconvolutional layers. FT-Caffe provides protection for thebias operation.

Many CNN frameworks add bias on the fly with theconvolution calculation. As a result, the output O alreadycontains bias, whereas the output checksums do not con-tain bias since they are calculated by inputs and inputchecksums without bias. In order to compare the outputchecksums and the output O, bias has to be subtractedfrom output summation before comparison. Subtractingbias from output O directly before verification and thenadding bias to O after verification is not feasible, however,because of the overhead of modifying every element in O.Table 5 shows the output checksums and adjusted output

Page 8: 1 FT-CNN: Algorithm-Based Fault Tolerance for

8

summation for comparison in order to detect errors. Thebias part of the formulations can be precomputed.

TABLE 5Bias Adjustments for Output Checksums Comparison

Checksum Adjust SummationCo1 So1[m][i][j]−N ×Bias[m]

Co3 So3[m][i][j]− (∑N

i=1 i)×Bias[m]Co2 So2[n][i][j]−

∑mBias[m]Co4 So4[n][i][j]−

∑mm×Bias[m]Co5 So5[i][j]−N ×

∑mBias[m]Co6 So6[i][j]−N ×

∑mm×Bias[m]

Co7 So7[i][j]− (∑N

i=1 i)×∑mBias[m]

5.2 ABFT for Grouped Convolution

Grouped convolution is a special kind of convolution. Ourschemes need to be modified to support this convolution.Define the number of groups as G. Each fmap basic blockhas Ch

G instead of Ch channels. All the M kernel basicblocks are divided into G groups, each having M

G 3D basicblocks. The kernel block in the gth group does convolutiononly with the gth channel group of every fmap block. Figure8 shows this process for N=2, M=4, and G=2.

d1

d2

Group 1Group 2

Group 1Group 2

d1

d2

W1 W2 W3 W4

Group 1 Group 2O11O12O13O14

O21O22O23O24

d1

d2

d1

d2

Group 1

Group 2

W1 W2

Group 1

W3 W4

Group 2

O11O12

O21O22

O13O14

O23O24

d1 = D1[1,…,N/2-1]; d1 = D1[N/2,…,N-1]; d2 = D2[1,…,N/2-1]; d2 = D2[N/2,…,N-1]' '

Grouped Convolution

Breakdown of Grouped Convolution Operation

Fig. 8. Demonstration of Grouped Convolution, Groups = 2

The checksums for fmap Cd1 and Cd2 stay the same. Thechecksum for kernel are redefined as

Cw1 = [

MG −1∑m=0

Wm,

2MG −1∑

m=MG

Wm, ...,M−1∑

m=(G−1)MG

Wm]

Cw2 = [

MG −1∑m=0

m×Wm,

2MG −1∑

m=MG

m×Wm, ...,M−1∑

m=(G−1)MG

m×Wm]

where Cw1 and Cw2 are the combination of G checksumsfrom each kernel group. Each checksum has Ch

G channels,so Chw1 and Chw1 each have GCh

G = Ch channels, whichare the same with every fmap block.

The definition of output checksums Co1, Co2, ..., Co7

stays the same. Let X[l..r] represent the channels from lto r in matrix X . We can prove the following property forany Dn and Wm according to Equation (1).

Dn ⊗Wm = Dn[1..k − 1]⊗Wm[1..k − 1]+Dn[k..Ch]⊗Wm[k..Ch], 0 ≤ k < Ch

Using this equation, we can prove the relation between Co2

and O as follows.

Co2[n] = Dn ⊗ [

MG −1∑m=0

Wm,2M

G −1∑m=M

G

Wm, · · · ,M−1∑

m=(G−1)MG

wm]

= Dn[0..CG−1]⊗

MG −1∑m=0

wm+Dn[CG ..

2CG −1]⊗

2MG −1∑

m=MG

wm

+ · · ·+ Dn[(G−1)C

G ..C − 1]⊗M−1∑

m= (G−1)MG

wm

=M−1∑m=0

Dn ⊗wm =M−1∑m=0

Onm

Similar equations can be proved for Co1,Co3,Co4, Co5,Co6, and Co7. Therefore, all the ABFT schemes we proposedcan be applied to grouped convolution.

5.3 ABFT for Convolution Back PropagationOur schemes can also be applied to back propagation to-gether with forward pass so that the convolutional layerscan be fully protected in the training phase.

During back propagation, the gradient of kernel ∇Wis used by methods such as gradient descent in order toupdate W. The gradient of fmap ∇D is used to get ∇Oof the previous layer. As shown in Figure 9, the gradientsare calculated as D ⊗ ∇O = ∇W and WT ⊗ ∇O = ∇D.Checksums for ∇O are used in this situation to protect thetwo convolution operations.

D W:

D1 D2 D3

CK

CK

CK

W1W2 W3 CK

WT

D:

CK CK CK

W1

W2

W3

T

D3

CK

D3

D3

Fig. 9. Demonstration of Checksum Design for Back Propagation

Since CNN models are usually trained in a more stableenvironment than the inference stage and since the trainingstage can tolerate some soft errors because of their iterative-convergent nature, we focus our experiments on the infer-ence stage.

6 EXPERIMENTAL EVALUATION

In this section, we evaluate our multischeme workflowusing our FT-Caffe fault tolerance CNN framework.

6.1 Experimental SetupFT-Caffe. Our FT-Caffe framework is based on Intel-Caffe.MKL-DNN is enabled to support dynamic selection of con-volution execution. MKL-DNN contains all the convolutionimplementations we discussed in Section 2.2. It automati-cally chooses the most suitable implementation to use foreach convolutional layer. To compare the runtime overheadof our solution with that of the ABFT designed for matrix-matrix multiplication, we also perform the experimentsbased on the MM-based convolution implementation.

Page 9: 1 FT-CNN: Algorithm-Based Fault Tolerance for

9

CNN models and dataset. We tested our FT-Caffe with fourwidely used networks: AlexNet, VGG-19, ResNet-18, andYOLOv2. Pretrained Caffe models are used together withmodel prototypes for deployment. We adopt the ImageNetvalidation set, which contains 50k images. The images arepreprocessed to smaller size in order to save picture pro-cessing time when the program starts. The batch size is setto 64.

Experimental platforms. We conducted our experiments onthe Bebop supercomputer [43] at Argonne National Labora-tory using up to 128 nodes. Each node is equipped with 128GB memory and two Intel Xeon E5-2695 v4 processors (eachwith 16 cores)

Error injection. To demonstrate the overhead of our faulttolerant solutions, we inject soft errors at the source codelevel as most ABFT works did [29], [36]. The consequencesof one computational fault or memory fault are simulatedby randomly corrupting selected row or column of output.We denote the total number of convolutional layers of aCNN model as L. To assess the overhead accurately, we runthe experiments for L epochs corresponding to the numbersof convolutional layers of each network (L= 5, 9, 16, 21 forAlexNet, YOLOv2, VGG-19, and ResNet-18, respectively).For the ith epoch, we inject errors to ith convolutional layer.The final overhead is the arithmetic mean of all the inferenceexecutions and the standard deviation in our experiments iswithin 5%.

6.2 Experimental Results with MKL-DNN

In this section, we present our evaluation results with MKL-DNN . We analyze the results from the perspective of execu-tion time overhead for both error-free cases and erroneouscases.

Error-free cases. The experimental results in the error-free cases are presented in Figure 10(a). We can see fromthe figure that our FT-caffe can protect the inference stagewith less than 4%, 4.5%, 8%, and 5% overhead for AlexNet,VGG-19, YOLOv2, and ResNet-18, respectively, regardlessof the convolution implementation. These results show thatour CoC-D error detection scheme has relatively short run-time compared with the convolution execution, which isattributed to the design of avoiding unnecessary compu-tations in our solution (see Section 4.3). The reason ResNet-18 has higher overhead than the other models have is thatthe ResNet-18 has small convolution kernels (W size isM×C×3×3) in all the convolutional layers, which haverelatively short computing time; thus, the checksum compu-tation and verification time percentage would be relativelylarge.

Erroneous cases – RC/ClC disabled. To show the effective-ness of our layerwise optimization for RC/ClC, we firsttest our multischeme workflow with RC/ClC disabled inerroneous cases. Figure 10(b) demonstrates that the run-time overheads (including both error detection and errorcorrection) of the four CNN models are all below 9%. Theerror detection overhead is higher than the error correctionoverhead because the error detection scheme is executed forevery convolution operation whereas the error correctionschemes are invoked only when errors are detected. The fullchecksum scheme dominates the error correction overhead,

thus confirming our analysis in Section 4 that FC has highprotection ability and relatively longer runtime.

Erroneous cases – layerwise RC/ClC optimization. Figure10(c) demonstrates the runtime overhead with layerwiseoptimization enabled. Every layer decides whether to useRC/ClC independently, as described in Section 4.3. Com-pared with Figure 10(b), the error correction overhead de-creases by 40%∼60% (e.g., 1.55% → 0.72% for YOLOv2 asshown in Figure 10(b) vs. (c)) in all CNN models because ofthe effectiveness of RC. Figure 11(a) shows the distributionof varies workflows that is the result of layerwise RC/ClCoptimization. We can see that RC is enabled in all layers ofAlexNet and VGG-19, while it is disabled in 30% to 40% oflayers in ResNet-18 and YOLOv2. The results demonstratethe need for layerwise RC optimization since RC is notsuitable for all layers in the same CNN model. Figure 11(b)shows the distribution of soft errors by the schemes thatcorrect them. Less than 5% of soft errors are correctedby CoC because of the low correction ability of CoC. RCcorrects nearly 90% of the soft errors in AlexNet and VGG-19 because RC is enabled in all layers of the two CNNmodels and the probability of soft errors striking a rowin O is higher than the probability of soft errors strikinga column.

0%

2%

4%

6%

8%

10%

12%

AlexN

et

VGG-1

9

ResN

et-1

8

YOLOv2

Overh

ead

CoC-D

(a) Error-free

0%

2%

4%

6%

8%

10%

12%

AlexN

et

VGG-1

9

ResN

et-1

8

YOLOv2

Overh

ead

CoC-D

CoC

FC

(b) Erroneous(RC/ClC Disabled)

0%

2%

4%

6%

8%

10%

12%

AlexN

et

VGG-1

9

ResN

et-1

8

YOLOv2

Overh

ead

CoC-D

CoC

RC

ClC

FC

(c) Erroneous(RC/ClC enabled)

Fig. 10. Runtime Overhead with MKL-DNN

0%

50%

100%

150%

200%

AlexNet

VGG-19

ResN

et-18

YOLO

v2

Perc

enta

ge

CoC+FCCoC+RC+FCCoC+ClC+FC

(a) Distribution of differentworkflows

0%

50%

100%

150%

200%

AlexNet

VGG-19

ResN

et-18

YOLO

v2

Perc

enta

ge

CoCRCClCFC

(b) Distribution of soft errors cor-rected by schemes

Fig. 11. Breakdown Analysis of Multischeme Workflow with MKL-DNN

Erroneous cases – breakdown of error correction overhead bylayer. To better illustrate the overhead of our solution foreach model, we present in Figure 12 the breakdown of theoverhead by layer. The figure demonstrates that layers havediverse overheads of error protection due to the differentshapes of D and W. We also notice that the overhead of RCdiffers among layers in the same model, thus confirming thefunctionality of our layerwise RC/ClC optimization.

Page 10: 1 FT-CNN: Algorithm-Based Fault Tolerance for

10

0.0%

0.5%

1.0%

1.5%

2.0%

conv1

conv2

conv3

conv4

conv5

Ove

rhe

ad

CoC-D

CoC

RC

ClC

FC

(a) AlexNet

0.0%

0.5%

1.0%

1.5%

2.0%

conv1

conv2

conv3

conv4

conv5

conv6

conv7

conv8

conv9

Ove

rhe

ad

CoC-D

CoC

RC

ClC

FC

(b) YOLOv2

0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

co

nv1

co

nv2

co

nv3

co

nv4

co

nv5

co

nv6

co

nv7

co

nv8

co

nv9

co

nv1

0

co

nv1

1

co

nv1

2

co

nv1

3

co

nv1

4

co

nv1

5

co

nv1

6

Ove

rhe

ad

CoC-D

CoC

RC

ClC

FC

(c) VGG-19

0.00%

0.20%

0.40%

0.60%

0.80%

1.00%

1.20%

co

nv1

co

nv2

co

nv3

co

nv4

co

nv5

co

nv6

co

nv7

co

nv8

co

nv9

co

nv1

0co

nv1

1co

nv1

2co

nv1

3co

nv1

4co

nv1

5co

nv1

6co

nv1

7co

nv1

8co

nv1

9co

nv2

0co

nv2

1

Ove

rhe

ad

CoC-D

CoC

RC

ClC

FC

(d) ResNet-18

Fig. 12. Breakdown of Runtime Overhead by Layer with MKL-DNN

6.3 Experimental Results with MM-Based Convolution

In this section, we evaluate the runtime overhead ofour multischeme workflow and the traditional MM-basedABFT. Since the MM-based ABFT supports only the MM-based convolution implementation, we set the convolutionimplementation to the MM-based mode in MKL-DNN. Weimplemented MM-based ABFT rigorously based on [26],which has ≤1% overhead for large and square matricesas claimed by the authors of that work. The overhead ofthe MM-based ABFT in convolution execution is shown inTable 6. The MM-based ABFT incurs up to 60% overheadeven without error injection for the four CNN models.This result is consistent with our analysis in Section 2.3.Considering that the MM-based ABFT cannot protect thewhole process of MM-based convolution and cannot protectother convolution implementations, we conclude that theMM-based ABFT is unsuitable for soft error protection ofCNN applications.

Figure 13 shows the overhead of our multischeme work-flow for MM-based convolution. The overhead of our so-lution is below 6% in the error-free cases and below 6.5%in the cases with injected errors for all CNN models. Thelayerwise RC/ClC optimization reduces the overhead forerror correction by as much as 77%. Figure 14(a) shows thefractions of different workflows chosen by convolutionallayers. Figure 14(b) shows the distribution of soft errorsthat are corrected by different schemes. Compared with theMKL-DNN implementation, more layers adopt RC for errorcorrection in the MM-based convolution (see Figure 11 ver-sus Figure 14). The reason is that the relative runtime of RCcompared with FC is lower in the MM-based convolutionimplementation than other implementations.

TABLE 6Overhead of MM-Based ABFT for MM-Based Convolution, No Error

InjectionModel AlexNet YOLOv2 VGG-19 ResNet-18

Overhead 27.9% 57.5% 45.8% 61.2%

0%

2%

4%

6%

8%

10%

12%

AlexN

et

VGG-1

9

ResN

et-1

8

YOLOv2

Overh

ead

CoC-D

(a) Error-Free

0%

2%

4%

6%

8%

10%

12%

AlexN

et

VGG-1

9

ResN

et-1

8

YOLOv2

Overh

ead

CoC-D

CoC

FC

(b) Erroneous(RC/ClC Disabled)

0%

2%

4%

6%

8%

10%

12%

AlexN

et

VGG-1

9

ResN

et-1

8

YOLOv2

Overh

ead

CoC-D

CoC

RC

ClC

FC

(c) Erroneous(RC/ClC enabled)

Fig. 13. Runtime Overhead with MM-based Convolution

0%

50%

100%

150%

200%

AlexNet

VGG-19

ResN

et-18

YOLO

v2

Perc

enta

ge

CoC+FCCoC+RC+FCCoC+ClC+FC

(a) Distribution of differentworkflows

0%

50%

100%

150%

200%

AlexNet

VGG-19

ResN

et-18

YOLO

v2

Perc

enta

ge

CoCRCClCFC

(b) Distribution of soft errors cor-rected by schemes

Fig. 14. Breakdown Analysis of Multischeme Workflow with MM-BasedConvolution

6.4 Parallel Performance EvaluationIn this section, we present the parallel performance eval-uation results of AlexNet, YOLOv2, VGG-19, and ResNet-18. Original images of the ImageNet validation dataset areused without preprocessing in order to better demonstratethe process of parallel CNN inference application. In thebeginning of the parallel process, images are distributedto the local disk of each node; then each node starts todo the data processing step first to convert the images tosuitable size required by CNN models, and then executethe inference step under the protection of our multischemeworkflow.

We conducted the parallel evaluation in both error-freeand erroneous cases. However, because of space limits, wepresent only the parallel performance evaluation results inthe situation with injected errors (as shown in Figure 15).In fact, the evaluation results in the error-free situation aresimilar. Specifically, experiments show that our multischemeworkflow has a very good scalability: that is, the soft errorprotection overhead does not increase with the number ofnodes at all. In absolute terms, the overhead stays around2%∼6% in the erroneous cases and is only 1%∼4% in theerror-free cases.

7 RELATED WORK

The importance of fault tolerance for convolution has beenemerging in recent years. Guaranteeing the correctness ofinference is vital in a safety-critical use case [19]. To achievebetter resiliency for CNN networks, researchers have beenexploring solutions from different perspectives includinghardware, system, and software. For hardware, Kim et al.[49] proposed a hardened 3D die-stacked memory basedon the fault characteristics in convolutional DNNs. Li etal. [19] proposed to add redundant circuits selectively to

Page 11: 1 FT-CNN: Algorithm-Based Fault Tolerance for

11

0

10

20

30

40

50

60

8 16 32 64 128

Tim

e (

se

co

nd

)

Nodes

Image ProcessingInferenceSoft Error

Protection

(a) AlexNet

0

20

40

60

80

100

120

140

160

8 16 32 64 128

Tim

e (

se

co

nd

)

Nodes

Image ProcessingInferenceSoft Error

Protection

(b) YOLOv2

0

50

100

150

200

250

300

350

400

8 16 32 64 128

Tim

e (

se

co

nd

)

Nodes

Image ProcessingInferenceSoft Error

Protection

(c) VGG-19

0

10

20

30

40

50

60

70

80

90

100

8 16 32 64 128

Tim

e (

se

co

nd

)

Nodes

Image ProcessingInferenceSoft Error

Protection

(d) ResNet-18

Fig. 15. Parallel Performance Evaluation of Our Solution with InjectedErrors on Bebop Supercomputer

harden the latches based on analysis of data resiliency.Compared with traditional full-hardware redundancy tech-niques, those partial-hardware redundancy techniques maynot double the power usage. However, hardware modifica-tion incurs significant effort considering the varied CNNmodels and their accelerators. At the system level, otherthan the DMR/TMR protection, checkpoint/restart (C/R) isalso applied to large-scale machine learning systems. Sub-sequently, Qiao et al. proposed a more efficient C/R schemebased on their derived upper bound on extra iteration costwith perturbations [50]. While those C/R techniques arepromising to protect model training from soft errors, theyare not good fits for inference since one inference executioncould be very fast and applying C/R incurs significantoverhead. Researchers have therefore pursued lightweightsoftware-level solutions. By applying ABFT techniques forMM-based convolution, Santos et al. [20] reported that50%∼60% of radiation-induced corruptions could be cor-rected. Unfortunately, the traditional ABFT works only forMM-based convolution, which is inefficient in most cases.In contrast, our solutions can work for any convolutionimplementations.

8 CONCLUSION AND FUTURE WORK

This work focus on extending ABFT to convolution oper-ations in convolutional neural networks. We propose fourABFT schemes and a multischeme workflow to protectthe convolutional layer. We further extend our schemes tosupport bias, grouped convolution, and convolution backpropagation. We implement an efficient CNN framework,FT-Caffe, that is resilient to silent data corruption.

Experiments demonstrate that our proposed fault-tolerant solutions incur negligible overhead. In absoluteterms, FT-Caffe can acheive less than 8% overhead for themost widely used CNN models, including AlexNet, YOLO,VGG-19, and ResNet-18, in both error-free and erroneouscases.

We plan to extend the implementation to more CNNframeworks and to design architecture-specific optimiza-tions for different hardware including GPU, FPGA, and AIaccelerators.

ACKNOWLEDGMENTS

This research was supported by the Exascale ComputingProject (ECP), Project Number: 17-SC-20-SC, a collaborativeeffort of two DOE organizations - the Office of Science andthe National Nuclear Security Administration, responsiblefor the planning and preparation of a capable exascaleecosystem, including software, applications, hardware, ad-vanced system engineering and early testbed platforms, tosupport the nations exascale computing imperative. Thematerial was supported by the U.S. Department of En-ergy, Office of Science, under contract DE-AC02-06CH11357.This work was also supported by the National ScienceFoundation under Grants CCF-1513201, CCF-1619253, andOAC-2034169. We acknowledge the computing resourcesprovided on Bebop, which is operated by the LaboratoryComputing Resource Center at Argonne National Labora-tory.

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-cation with deep convolutional neural networks,” in Advances inneural information processing systems, 2012, pp. 1097–1105.

[2] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556, 2014.

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in Proceedings of the IEEE conference onComputer Vision and Pattern Recognition, 2016, pp. 770–778.

[4] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” inProceedings of the IEEE conference on computer vision and patternrecognition, 2017, pp. 7263–7271.

[5] Y. Goldberg, “Neural network methods for natural language pro-cessing,” Synthesis Lectures on Human Language Technologies, vol. 10,no. 1, pp. 1–309, 2017.

[6] S.-C. B. Lo, H.-P. Chan, J.-S. Lin, H. Li, M. T. Freedman, and S. K.Mun, “Artificial convolution neural network for medical imagepattern recognition,” Neural networks, vol. 8, no. 7-8, pp. 1201–1214,1995.

[7] E. Gawehn, J. A. Hiss, and G. Schneider, “Deep learning in drugdiscovery,” Molecular informatics, vol. 35, no. 1, pp. 3–14, 2016.

[8] J. M. Wozniak et al., “Candle/supervisor: A workflow frameworkfor machine learning applied to cancer research,” BMC Bioinfor-matics, vol. 19, no. 18, p. 491, 2018.

[9] J. J. Zhang et al., “Building robust machine learning systems:Current progress, research challenges, and opportunities,” in Pro-ceedings of the 56th Annual Design Automation Conference 2019, ser.DAC ’19. New York, NY, USA: ACM, 2019.

[10] M. T. Le, F. Diehl, T. Brunner, and A. Knol, “Uncertainty estimationfor deep neural object detectors in safety-critical applications,”in 2018 21st International Conference on Intelligent TransportationSystems (ITSC), 2018, pp. 3873–3878.

[11] S. Burton, L. Gauerhof, and C. Heinzemann, “Making the case forsafety of machine learning in highly automated driving,” in Com-puter Safety, Reliability, and Security. Cham: Springer InternationalPublishing, 2017, pp. 5–16.

[12] M. Snir et al., “Addressing failures in exascale computing,” Inter-national Journal of High Performance Computing Applications, 2014.

[13] L. Bautista-Gomez, F. Zyulkyarov, O. Unsal, and S. McIntosh-Smith, “Unprotected computing: A large-scale study of dram rawerror rate on a supercomputer,” in Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis, ser. SC 16. IEEE Press, 2016.

Page 12: 1 FT-CNN: Algorithm-Based Fault Tolerance for

12

[14] L. Tan, S. L. Song, P. Wu, Z. Chen, R. Ge, and D. J. Kerbyson, “In-vestigating the interplay between energy efficiency and resiliencein high performance computing,” in 2015 IEEE International Paral-lel and Distributed Processing Symposium. IEEE, 2015, pp. 786–796.

[15] J. P. Walters, K. M. Zick, and M. French, “A practical character-ization of a nasa spacecube application through fault emulationand laser testing,” in Proceedings of the 2013 43rd Annual IEEE/IFIPInternational Conference on Dependable Systems and Networks, ser.DSN ’13. USA: IEEE Computer Society, 2013, pp. 1–8.

[16] A. Geist, “How to kill a supercomputer: Dirty power, cosmic rays,and bad solder,” IEEE Spectrum, vol. 10, pp. 2–3, 2016.

[17] J. Zhang, K. Rangineni, Z. Ghodsi, and S. Garg, “Thundervolt: En-abling aggressive voltage underscaling and timing error resiliencefor energy efficient deep learning accelerators,” in Proceedings ofthe 55th Annual Design Automation Conference, ser. DAC ’18. NewYork, NY, USA: ACM, 2018.

[18] W. Choi, D. Shin, J. Park, and S. Ghosh, “Sensitivity based er-ror resilient techniques for energy efficient deep neural networkaccelerators,” in Proceedings of the 56th Annual Design AutomationConference 2019, ser. DAC ’19. New York, NY, USA: ACM, 2019.

[19] G. Li et al., “Understanding error propagation in deep learningneural network (dnn) accelerators and applications,” in Proceed-ings of the International Conference for High Performance Computing,Networking, Storage and Analysis, ser. SC ’17. New York, NY, USA:ACM, 2017, pp. 8:1–8:12.

[20] F. F. d. Santos, L. Draghetti, L. Weigel, L. Carro, P. Navaux,and P. Rech, “Evaluation and mitigation of soft-errors in neuralnetwork-based object detection in three gpu architectures,” in2017 47th Annual IEEE/IFIP International Conference on DependableSystems and Networks Workshops (DSN-W), 2017, pp. 169–176.

[21] B. Reagen et al., “Ares: A framework for quantifying the resilienceof deep neural networks,” in Proceedings of the 55th Annual DesignAutomation Conference, ser. DAC ’18. New York, NY, USA: ACM,2018, pp. 17:1–17:6.

[22] R. Salay, R. Queiroz, and K. Czarnecki, “An analysis of iso 26262:Using machine learning safely in automotive software,” arXivpreprint arXiv:1709.02435, 2017.

[23] D. Li, Z. Chen, P. Wu, and J. S. Vetter, “Rethinking algorithm-basedfault tolerance with a cooperative software-hardware approach,”in Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis, ser. SC 13. New York,NY, USA: ACM, 2013.

[24] X. Vera, J. Abella, J. Carretero, and A. Gonzalez, “Selective repli-cation: A lightweight technique for soft errors,” ACM Transactionson Computer Systems (TOCS), vol. 27, no. 4, p. 8, 2009.

[25] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerancefor matrix operations,” IEEE Transactions on Computers, vol. 100,no. 6, pp. 518–528, 1984.

[26] P. Wu, Q. Guan, N. DeBardeleben, S. Blanchard, D. Tao, X. Liang,J. Chen, and Z. Chen, “Towards practical algorithm based faulttolerance in dense linear algebra,” in Proceedings of the 25th ACMInternational Symposium on High-Performance Parallel and DistributedComputing, ser. HPDC ’16. New York, NY, USA: ACM, 2016, pp.31–42.

[27] P. Wu, D. Li, Z. Chen, J. S. Vetter, and S. Mittal, “Algorithm-directed data placement in explicitly managed non-volatile mem-ory,” in Proceedings of the 25th ACM International Symposium onHigh-Performance Parallel and Distributed Computing, ser. HPDC 16.New York, NY, USA: Association for Computing Machinery, 2016,p. 141152.

[28] L. Tan, S. Kothapalli, L. Chen, O. Hussaini, R. Bissiri, and Z. Chen,“A survey of power and energy efficient techniques for high per-formance numerical linear algebra operations,” Parallel Computing,vol. 40, no. 10, pp. 559 – 573, 2014.

[29] J. Chen, H. Li, S. Li, X. Liang, P. Wu, D. Tao, K. Ouyang, Y. Liu,K. Zhao, Q. Guan, and Z. Chen, “Fault tolerant one-sided matrixdecompositions on heterogeneous systems with gpus,” in Proceed-ings of the International Conference for High Performance Computing,Networking, Storage, and Analysis, ser. SC ’18. Piscataway, NJ, USA:IEEE Press, 2018, pp. 68:1–68:12.

[30] J. Chen, N. Xiong, X. Liang, D. Tao, S. Li, K. Ouyang, K. Zhao,N. DeBardeleben, Q. Guan, and Z. Chen, “Tsm2: optimizing tall-and-skinny matrix-matrix multiplication on gpus,” in Proceedingsof the ACM International Conference on Supercomputing, 2019, pp.106–116.

[31] C. Rivera, J. Chen, N. Xiong, S. L. Song, and D. Tao, “Tsm2x:

High-performance tall-and-skinny matrix-matrix multiplicationon gpus,” 2020.

[32] Z. Chen, “Online-abft: An online algorithm based fault tolerancescheme for soft error detection in iterative methods,” in Proceedingsof the 18th ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming, ser. PPoPP ’13. New York, NY, USA: ACM,2013, pp. 167–176.

[33] D. Tao, S. L. Song, S. Krishnamoorthy, P. Wu, X. Liang, E. Z.Zhang, D. Kerbyson, and Z. Chen, “New-sum: A novel onlineabft scheme for general iterative methods,” in Proceedings of the25th ACM International Symposium on High-Performance Parallel andDistributed Computing, ser. HPDC ’16. New York, NY, USA: ACM,2016, pp. 43–55.

[34] D. Tao, S. Di, X. Liang, Z. Chen, and F. Cappello, “Improvingperformance of iterative methods by lossy checkponting,” inProceedings of the 27th International Symposium on High-PerformanceParallel and Distributed Computing, 2018, pp. 52–65.

[35] D. Tao, “Fault tolerance for iterative methods in high-performancecomputing,” Ph.D. dissertation, UC Riverside, 2018.

[36] X. Liang, J. Chen, D. Tao, S. Li, P. Wu, H. Li, K. Ouyang, Y. Liu,F. Song, and Z. Chen, “Correcting soft errors online in fast fouriertransform,” in Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis, ser. SC’17. New York, NY, USA: ACM, 2017, pp. 30:1–30:12.

[37] S. Li, H. Li, X. Liang, J. Chen, E. Giem, K. Ouyang, K. Zhao,S. Di, F. Cappello, and Z. Chen, “Ft-isort: Efficient fault tolerancefor introsort,” in Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis, ser. SC 19.New York, NY, USA: Association for Computing Machinery, 2019.

[38] J. Cong and B. Xiao, “Minimizing computation in convolutionalneural networks,” in Artificial Neural Networks and Machine Learn-ing – ICANN 2014. Springer International Publishing, 2014, pp.281–290.

[39] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neuralnetworks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016.

[40] S. Jin, S. Di, X. Liang, J. Tian, D. Tao, and F. Cappello, “Deepsz:A novel framework to compress deep neural networks by usingerror-bounded lossy compression,” in Proceedings of the 28th In-ternational Symposium on High-Performance Parallel and DistributedComputing, 2019, pp. 159–170.

[41] Z. Hu, X. Zou, W. Xia, S. Jin, D. Tao, Y. Liu, W. Zhang, andZ. Zhang, “Delta-dnn: Efficiently compressing deep neural net-works via exploiting floats similarity,” in The 49th InternationalConference on Parallel Processing (ICPP 2020), 2020.

[42] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im-agenet: A large-scale hierarchical image satabase,” in 2009 IEEEconference on computer vision and pattern recognition. IEEE, 2009,pp. 248–255.

[43] Bebop supercomputer. Available at https://www.lcrc.anl.gov/systems/resources/bebop, 2019, online.

[44] A. Lavin and S. Gray, “Fast algorithms for convolutional neuralnetworks,” in 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016, pp. 4013–4021.

[45] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catan-zaro, and E. Shelhamer, “cudnn: Efficient primitives for deeplearning,” arXiv preprint arXiv:1410.0759, 2014.

[46] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,“Diannao: A small-footprint high-throughput accelerator for ubiq-uitous machine-learning,” in Proceedings of the 19th InternationalConference on Architectural Support for Programming Languages andOperating Systems, ser. ASPLOS 14. New York, NY, USA: ACM,2014, pp. 269–284.

[47] NVIDIA, http://nvdla.org, 2019, online.[48] S. Liu, Q. Wang, and G. Liu, “A versatile method of discrete

convolution and fft (dc-fft) for contact analyses,” Wear, vol. 243,no. 1, pp. 101–111, 2000.

[49] J.-S. Kim and J.-S. Yang, “Dris-3: Deep neural network reliabilityimprovement scheme in 3d die-stacked memory based on faultanalysis,” in Proceedings of the 56th Annual Design AutomationConference 2019, ser. DAC ’19. New York, NY, USA: ACM, 2019,pp. 129:1–129:6.

[50] A. Qiao, B. Aragam, B. Zhang, and E. Xing, “Fault tolerance initerative-convergent machine learning,” in Proceedings of the 36thInternational Conference on Machine Learning, ser. ICML, vol. 97.Long Beach, California, USA: PMLR, 2019, pp. 5220–5230.

Page 13: 1 FT-CNN: Algorithm-Based Fault Tolerance for

13

Kai Zhao received his bachelor’s degree fromPeking University in 2014 and will receive hisPh.D. degree from University of California, River-side in 2022. He is a long-term intern at ArgonneNational Laboratory. His research interests in-clude high-performance computing, scientificdata management and reduction, and resilientmachine learning. Email: [email protected].

Sheng Di (Senior Member, IEEE) received hismaster’s degree from Huazhong University ofScience and Technology in 2007 and Ph.D. de-gree from the University of Hong Kong in 2011.He is currently a computer scientist at ArgonneNational Laboratory. Dr. Di’s research interestinvolves resilience on high-performance comput-ing (such as silent data corruption, optimizationcheckpoint model, and in-situ data compression)and broad research topics on cloud comput-ing (including optimization of resource allocation,

cloud network topology, and prediction of cloud workload/hostload). Heis working on multiple HPC projects, such as detection of silent datacorruption, characterization of failures and faults for HPC systems, andoptimization of multilevel checkpoint models. Email: [email protected].

Sihuan Li is a Ph.D. student in computer sci-ence at University of California, Riverside. Heobtained his bachelor’s degree in math fromHuazhong University of Science and Technol-ogy, China. He did a long-term internship atArgonne National Laboratory. Broadly speaking,his research interests fall into High PerformanceComputing. Specifically, he mainly studies Al-gorithm Based Fault Tolerance (ABFT), lossycompression and their applications in large scalescientific simulations. He is an IEEE student

member. Email: [email protected].

Xin Liang is a Computer/Data Scientist atOak Ridge National Laboratory. He receivedhis Ph.D. degree from University of California,Riverside in 2019 and his bachelor’s degreefrom Peking University in 2014. His researchinterests include high-performance computing,parallel and distributed systems, scientific datamanagement and reduction, big data analytic,scientific visualization, and cloud computing. Hehas interned in multiple national laboratories andworked on several exascale computing projects.

He is a member of the IEEE. Email: [email protected].

Yujia Zhai received his bachelor’s degree fromUniversity of Science and Technology of Chinain 2016, a master’s degree from Duke Univer-sity in 2018, and will receive his Ph.D. de-gree from University of California, Riversidein 2023. His research interests include high-performance computing, parallel and distributedsystems, and numerical linear algebra software.Email: [email protected].

Jieyang Chen (Member, IEEE) is a ComputerScientist at Computer Science and Mathemat-ics Division at Oak Ridge National Laboratory.He received his master and Ph.D. degrees inComputer Science from University of Califor-nia, Riverside in 2014 and 2019. He receiveda bachelors degree in Computer Science andEngineering from Beijing University of Tech-nology in 2012. His research interests includehigh-performance computing, parallel and dis-tributed systems, and big data analytics. Email:

[email protected].

Kaiming Ouyang received his bachelor’s de-gree from University of Electronic Science andTechnology of China and joined the University ofCalifornia, Riverside SuperLab in Fall 2016. Hewill receive his Ph.D. degree from University ofCalifornia, Riverside in 2021. He is a long-termintern at Argonne National Laboratory PMRSgroup led by Dr. Balaji and supervised by Dr. Si.His research interest is parallel runtime system.Email: [email protected].

Franck Cappello (Fellow, IEEE) is the direc-tor of the Joint-Laboratory on Extreme ScaleComputing gathering six of the leading high-performance computing institutions in the world:Argonne National Laboratory, National Centerfor Scientific Applications, Inria, Barcelona Su-percomputing Center, Julich SupercomputingCenter, and Riken AICS. He is a senior computerscientist at Argonne National Laboratory and anadjunct associate professor in the Department ofComputer Science at the University of Illinois at

Urbana-Champaign. He is an expert in resilience and fault tolerance forscientific computing and data analytics. Recently he started investigat-ing lossy compression for scientific data sets to respond to the pressingneeds of scientist performing large-scale simulations and experiments.His contribution to this domain is one of the best lossy compressors forscientific data set respecting user-set error bounds. He is a member ofthe editorial board of the IEEE Transactions on Parallel and DistributedComputing and of the ACM HPDC and IEEE CCGRID steering commit-tees. He is a fellow of the IEEE. Email: [email protected].

Zizhong Chen (Senior Member, IEEE) receiveda bachelor’s degree in mathematics from BeijingNormal University, a master’s degree degree ineconomics from the Renmin University of China,and a Ph.D. degree in computer science fromthe University of Tennessee, Knoxville. He is aprofessor of computer science at the Universityof California, Riverside. His research interests in-clude high-performance computing, parallel anddistributed systems, big data analytics, clusterand cloud computing, algorithm-based fault tol-

erance, power and energy efficient computing, numerical algorithms andsoftware, and large-scale computer simulations. His research has beensupported by National Science Foundation, Department of Energy, CMGReservoir Simulation Foundation, Abu Dhabi National Oil Company,Nvidia, and Microsoft Corporation. He received a CAREER Award fromthe US National Science Foundation and a Best Paper Award from theInternational Supercomputing Conference. He is a Senior Member of theIEEE and a Life Member of the ACM. He currently serves as a subjectarea editor for Elsevier Parallel Computing journal and an associateeditor for the IEEE Transactions on Parallel and Distributed Systems.Email: [email protected].