Energy-Efficient Neural Network Acceleration in the ... · constitute greater than 90% of total weight parameters [7]. As a result, convolutional data-reuse techniques like those

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 12, DECEMBER 2018 4285

Energy-Efficient Neural Network Acceleration inthe Presence of Bit-Level Memory Errors

Sung Kim , Student Member, IEEE, Patrick Howe, Thierry Moreau, Student Member, IEEE,

Armin Alaghi , Member, IEEE, Luis Ceze, Senior Member, IEEE, and Visvesh S. Sathe, Member, IEEE

Abstract— As a result of the increasing demand for deepneural network (DNN)-based services, efforts to develop hard-ware accelerators for DNNs are growing rapidly. However,while highly efficient accelerators on convolutional DNNs (Conv-DNNs) have been developed, less progress has been made withregards to fully-connected DNNs. Based on analysis of bit-levelSRAM errors, we propose memory adaptive training with in-situcanaries (MATIC), a methodology that enables aggressive voltagescaling of accelerator weight memories to improve the energy-efficiency of DNN accelerators. To enable accurate operationwith voltage overscaling, MATIC combines characteristics ofSRAM bit failures with the error resilience of neural networksin a memory-adaptive training (MAT) process. Furthermore,PVT-related voltage margins are eliminated using bit-cells fromsynaptic weights as in-situ canaries to track runtime environ-mental variation. Demonstrated on a low-power DNN acceleratorfabricated in 65 nm CMOS, MATIC enables up to 3.3× energyreduction versus the nominal voltage, or 18.6× application errorreduction. We also perform a simulation study that extendsMAT to Conv-DNNs, and characterize the accuracy impact ofbit failure statistics. Finally, we develop a weight refinementalgorithm to improve the performance of MAT, and show thatit improves absolute accuracy by 0.8–1.3% or reduces trainingtime by 5–10×.

Index Terms— Neural networks, deep learning, voltage scaling,SRAM, machine learning acceleration.

I. INTRODUCTION

DEEP neural networks (DNNs) have demonstrated state-of-the-art performance on a variety of signal processing

tasks, and there is growing interest in developing efficientDNN accelerators for application domains ranging from IoTto the datacenter. However, much of the recent success withDNN-based approaches has been attributed to the use of larger

Manuscript received January 15, 2018; revised April 2, 2018 and April 23,2018; accepted May 13, 2018. Date of publication June 7, 2018; date ofcurrent version October 23, 2018. This work was supported in part by theNational Science Foundation through Oracle Labs, Nvidia, under Grant CCF-1518703, and in part by C-FAR, one of the six SRC STARnet Centers throughMARCO and DARPA. This paper was recommended by Associate EditorI. Kale. (Corresponding author: Sung Kim.)

S. Kim, P. Howe, and V. S. Sathe are with the Department of ElectricalEngineering, University of Washington, Seattle, WA 98195 USA (e-mail:[email protected]; [email protected]; [email protected]).

T. Moreau, A. Alaghi, and L. Ceze are with the Paul G. Allen School ofComputer Science and Engineering, University of Washington, Seattle, WA98195 USA (e-mail: [email protected]; [email protected];[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSI.2018.2839613

Fig. 1. (left) The fraction of total power dissipated by weight storage SRAMs,and (right) the fraction of total SRAM used to store fully-connected weights.On-chip weight storage accounts for a significant fraction of the total powerdissipation in state-of-the-art DNN accelerators. Even for Conv-DNNs suchas AlexNet, weight storage is dominated by fully-connected layers.

model architectures (i.e., models with more layers), and state-of-the-art model architectures can have millions or billions oftrainable weights. As a result of weight storage requirements,neural networks algorithms are particularly demanding onmemory systems; for instance, Chen et al. [1] found that mainmemory accesses dominated the total power consumption fortheir accelerator design. Subsequently, Du et al. [2] devel-oped an accelerator that leveraged large quantities of on-chipSRAM, such that expensive off-chip DRAM accesses couldbe eliminated. To a similar end, Han et al. [3] used pruningand compression techniques to yield sparse weight matrices,and Chen et al. [4] leveraged data-reuse techniques and run-length compression to reduce off-chip memory access. Morerecently, other works have considered alternative data-reusestrategies to further reduce data bandwidth [5], as well ascomputation latency [6]. Nevertheless, once off-chip memoryaccesses are largely eliminated, on-chip SRAM dedicatedto synaptic weights can account for greater than 50% oftotal system power [3] (Figure 1). The on-chip memoryproblem is particularly acute in DNNs with dense classifierlayers, since classifier weights are typically unique and canconstitute greater than 90% of total weight parameters [7].As a result, convolutional data-reuse techniques like those ofChen et al. [4] and Wang et al. [8] lose effectiveness.

Voltage scaling can enable significant reduction in static anddynamic power dissipation, however read and write stabilityconstraints have historically prevented aggressive voltagescaling on SRAM. While alternative bit-cell topologies can beused, they typically trade-off bit-cell stability for non-trivialoverheads in terms of area, power, or speed [9]. Hence,designs that employ SRAM either place on-chip memories

1549-8328 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0002-9010-4617

https://orcid.org/0000-0003-2055-6754

4286 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 12, DECEMBER 2018

Fig. 2. (left) MATIC increases energy-efficiency by aggressively scalingsupply voltages of on-chip weight SRAMs. (right) Compared to hardwarepaired with conventionally-trained neural network models, MATIC leveragesadaptive training to recover from errors caused by voltage overscaling.

on dedicated supply-rails, allowing them operate hundreds ofmillivolts higher than the rest of the design, or the systemshares a unified voltage domain. In either case, significantenergy savings from voltage scaling remain unrealizeddue to SRAM operating voltage constraints; this translatesto either shorter operating lifetime for battery-powereddevices, or higher operating costs. Furthermore, to accountfor PVT variation, designers either apply additional staticvoltage margin, or add dummy logic circuits that detectimminent failure (canaries).

Several works have explored methods for resilience totiming errors that result from voltage scaling, or techniquesthat leverage algorithmic resilience to enable greater SRAMvoltage scalability. Recently, Reagen et al. [10] used Razorfault detection [11], [12] combined with masking of indi-vidual faulty bits to increase the tolerable fault rate inweight SRAMs compared to full word masking. Yang andMurmann [13] demonstrated that Conv-DNNs become morerobust to bit errors when trained on images corrupted byvoltage scaling, or by uniform bit errors with representativestatistics. We refer the reader to Section VIII for a morecomprehensive discussion of related works.

This paper presents Memory Adaptive Training and In-Situ Canaries (MATIC, Figure 2), a hardware/algorithm co-design methodology that enables aggressive voltage scaling ofweight SRAMs with tuneable accuracy-energy tradeoffs [14].To achieve this end, MATIC jointly exploits the inherenterror tolerance of DNNs with the specific characteristics ofSRAM read-stability failures. To evaluate the effectivenessof MATIC, we also design and implement SNNAC (SystolicNeural Network AsiC), a low-power DNN accelerator formobile devices fabricated in 65 nm CMOS, and demonstratestate-of-the-art energy-efficiency on classification and regres-sion tasks. In addition to our hardware results, we performa simulation study to evaluate how bit error statistics, suchas bias and failure polarity, impact the accuracy of Conv-DNNs trained with memory-adaptive training (MAT). We alsodevelop improvements to the baseline MAT algorithm toprevent divergence during training, reduce training time by5-10×, and recover accuracy by 0.8-1.3% despite significantbit error rates.

The remaining sections are organized as follows. Section IIprovides background on DNN inference and training, and anoverview of 6T SRAM failure modes. Section III describes thealgorithmic details of MATIC. Sections IV and V describe theaccelerator prototype, hardware results, and compare MATICto prior works. Sections VII-VI present the characterization ofMAT with Conv-DNNs, improvements to the MAT algorithm,and describe the simulation results.

II. BACKGROUND

This section briefly reviews relevant background on DNNinference and training, and SRAM read-stability failure.

A. Deep Neural Networks

Deep neural networks (DNNs) are a class of bio-inspiredmachine learning models that are represented as a directedgraph of neurons [15]. During inference, a neuron k in layer jimplements the composite function:

z( j )k = f

(x ( j )

k

)= f

(∑N ( j−1)

i=1 w( j )k,i z( j−1)

i

), (1)

where z( j−1)i denotes the output from neuron i in the previous

layer, and w( j )k,i denotes the weight in layer j from neuron

i in the previous layer to neuron k. f (x) is a non-linearfunction, typically a sigmoidal function or rectified linearunit (ReLU) [7]. Since the computation of a DNN layercan be represented as a matrix-vector dot product (withf (x) computed element-wise), DNN execution is especiallyamenable to dataflow hardware architectures designed forlinear algebra.

Training involves iteratively solving for weight parametersusing either batch or stochastic gradient descent [15]. Givena weight w

( j )k,i from training iteration n, its value at training

iteration n + 1 is given by:

w( j )k,i [n + 1] = w

( j )k,i [n] − α ∂ J

∂w( j)k,i [n] , (2)

where α is the step size and J is a suitable loss function (e.g.,cross-entropy) [15]. The partial derivatives of the loss functionwith respect to the weights are computed by propagatingerror backwards via partial differentiation (backprop) [15].For example, for a network with a single hidden layer,the error gradient with respect to a weight w

(2)k,i is given by

∂ J∂w

(2)k,i

= ∂ J∂z(2)

k

∂z(2)k

∂x (2)k

∂x (2)k

∂w(2)k,i

. MATIC relies on the observation that

backprop makes error at the output, including artificial weightperturbations, observable and subject to compensation by allweights in the network.

Conv-DNNs are a popular type of neural network thatachieve state-of-the-art performance on image and videodata [7], [16], [17]. In addition to fully-connected (FC) layers,Conv-DNNs typically contain non-linear pooling operators andconvolution filters [15]. However, compared to FC layers,Conv layers typically account for a small fraction (e.g., <10%)of the total weights in the network (Figure 1).

B. SRAM Read Failures

Figure 3 shows a 6T SRAM bit-cell, composed of twocross-coupled inverters (M2/M3 and M4/M5) and associatedaccess transistors (M1 and M6). Variation-induced mismatchbetween devices in the bit-cell creates an inherent, state-independent offset [18], [19]. This offset results in eachbit-cell having a “preferred state.” For instance, the bit-celldepicted in Figure 3b favors driving Q and Q to logic ‘1’ and‘0’, respectively. Due to statistical variation, larger memories

KIM et al.: ENERGY-EFFICIENT NEURAL NETWORK ACCELERATION IN THE PRESENCE OF BIT-LEVEL MEMORY ERRORS 4287

Fig. 3. (a) An SRAM 6T bit-cell with mismatch-induced, input-referredstatic offsets. During a read, the active pulldown device (either M2 or M4)may be overcome by the current sourced from the precharged bit-line (viathe access device) if there is insufficient static noise margin. (b) Equivalentcross-coupled inverter schematic, illustrating bit-cell state preference. In thisexample, variation-induced offsets result in a preference towards Q = 1.

Fig. 4. Overview of the MATIC compilation and deployment flow.

are likely to see greater numbers of cells with significantoffset error.

As supply voltage scales, the diminished noise marginallows the bit-cell to be flipped to its preferred state during aread [18]. This type of read-disturbance failure, which occursat the voltage Vmin,read , is a result of insufficient pulldownstrength in either M2 or M4 (whichever side of the cellstores 0), leading to a voltage upset from the pass-gate and pre-charged bitline. Once flipped, the bit-cell retains state in subse-quent repeated reads, favoring its (now incorrect) bit valuedue to the persistence of the built-in offset. Consequently,the occurrence of bit-cell read-stability failure at low supplyvoltages is random in space, but essentially provides stableread outputs consistent with bit cells’ preferred states [18].Notably, the read failures described above are in distinct frombitline access-time failures, which can be corrected with ampletiming margin.

III. VOLTAGE SCALING FOR DNN ACCELERATORS

We now present MATIC, a voltage scaling methodology thatleverages memory-adaptive training (MAT) to operate SRAMspast their point of failure, and in-situ synaptic canaries toremove static voltage margins for accurate operation acrossPVT variation. In conjunction, the two techniques enablesystem-wide voltage scaling for energy-efficient operation,with tuneable accuracy-energy tradeoffs. Figure 4 shows anoverview of the processing and deployment flow, which isdetailed below.

A. SRAM Profiling

SRAM read failures are profiled post-silicon, and are usedfor both in-situ canary selection and the memory-adaptivetraining process. The SRAM profiling procedure takes placeonce at compile time, and consists of a read-after-write andread-after-read operation on each SRAM address, at the target

Fig. 5. The modified DNN training algorithm and injection masking process.OR and AND bit masks annotate information on SRAM bit errors that occurdue to voltage overscaling, and quantization convergence issues are avoidedby maintaining both float and fixed-point weights.

DNN accuracy level (bit error proportion). The word address,bit index, and error polarity of each bit-cell failure are thencollected to form a complete failure map for each voltage-scalable weight memory in the hardware design. Failure mapsare stored by the profiling software in the form of two arraysthat contain bit masks indicating bit failure state (fail/non-fail) and the failure polarity (0/1) for each weight. Hencethe storage required for failure maps is 2N B bits, where Nand B are the number of weights and bit-precision for thenetwork, respectively. The entire profiling process and failure-map generation is automated with a host PC that controls on-chip debug software, off-chip profiling software, and externaldigitally-programmable voltage regulators.

B. Memory-Adaptive Training

MAT augments the vanilla backprop algorithm by injectingprofiled SRAM bit errors throughout the offline trainingprocess, enabling the neural network to compensate via adap-tation. As described in Section II, random mismatch resultsin bit-cells that are statically biased towards a particularstorage state. If a bit-cell stores the complement of its“preferred” state, performing a read at a sufficiently lowvoltage flips the cell and subsequent reads will be incorrect,but stable. MAT leverages this stability during training withan injection masking process (Figure 5). The injection maskapplies bit masks corresponding to profiled bit errors to eachDNN weight before the forward training pass. As a result,the network error propagated in the backward pass reflectsthe impact of the bit errors, leading to compensatory weightupdates in the entire network. Since the injection maskingprocess operates on weights that correspond to real hardware,weights are necessarily quantized during training to matchthe SRAM word length. However, previous work has shownthat unmitigated quantization during training (in contrast topost-training quantization) can lead to significant accuracydegradation [21]. MAT counteracts the effects of quantizationduring training by preserving the fractional quantization error,in effect performing floating point training to enable gradualweight updates that occur over multiple backprop iterations.The augmented weight update rule for MAT is given by

w( j )k,i [n + 1] = m( j )

k,i [n] − α ∂ J∂m( j)

k,i [n] + �q , (3)

and

m( j )k,i [n] = Bor Band Q

(w

( j )k,i [n]

), (4)


Fig. 6. Simulated performance of memory-adaptive training on MNIST [20].

in which m( j )k,i is the quantized weight that corresponds to

w( j )k,i , Bor and Band are the bit masks corresponding to bit-

cell faults in the physical SRAM word, Q is the quantizationfunction, and �q is the quantization error in the fractionalpart of the quantized weight (fractional quantization error).The definitions of the indices i, j, k and n are the same asin Equations 1 and 2. We note that the characteristics of thebit masks and quantization function, such as bit-precision, arehardware dependent.

To evaluate the feasibility of memory-adaptive training,we first examine the memory-adaptive training flow withsimulated SRAM failure patterns. We implement thetraining modifications described above in the open-sourceFANN [22] and Caffe [23] frameworks. A proportion ofrandomly selected weight bits are statically flipped at eachvoltage, where the proportion of faulty bits is determined fromSPICE Monte Carlo simulations of a 6T bit-cell. Figure 6shows that a significant fraction of bit errors can be tolerated,and that MATIC provides a reasonably smooth energy-errortradeoff curve.

C. Model Pre-Training

Pre-training is a promising technique that has been lever-aged for model fine-tuning and transfer learning [24]–[26].In the pre-training paradigm, specific layers or entire networksthat have already been trained are reused as a starting point fortraining on a different task, or as fixed feature extractors. Theuse of pre-trained network components is often advantageousin terms of training time, or when a scarcity of training dataprevents training a neural network from scratch. For instance,Donahue et al. [25] fix pre-trained convolutional layers andretrain only classifier layers for adaptation to multiple recog-nition tasks, and Razavian et al. [26] use pre-trained featuresin conjunction with an SVM classifier. We show that forMAT, using pre-trained networks for initialization enablesbetter classification performance, or significantly reducedtraining time.

Figure 7 shows the impact of pre-training with MNISTusing the FC-DNN topology from Section V. For the pre-trained case, MAT+pretrain, we first train the model for N0epochs with zero bit failures, and use the resulting modelas a seed model for subsequent post-training with MAT.The first notable improvement is with training time - witha pre-trained seed model, post-training can be terminatedearly with additional training epochs N � << N0 to obtainthe same error performance as the MAT baseline. On theother hand, if increased training time can be tolerated, more

Fig. 7. Training convergence with and without pre-training on MNIST, withbit failure rate of (a) 0.1 and (b) 0.6. The steep decrease in training error atiteration 100 results from the learning rate schedule; in this case the learningrate is scaled by 0.1 every 100 epochs.

Fig. 8. (left) Overview of the runtime SRAM voltage control scheme.Between inferences, the integrated μC polls canary bits to determine ifvoltage modifications should be applied. (right) In-situ canary bit cells areselected from the boundary of the SRAM failure CDF. This effectively formsa margin, with faulty (but compensated) and protected bits on either side ofthe boundary.

epochs of post-training can be executed such that N � ≥ N0.When post-training is terminated early to match the accu-racy of the baseline MAT model, we observe training timeimprovements of 10-30×. On the other hand, post-trainingwith epochs N � = N0 results in relative error improvementof 1.1-1.3×, which corresponds to 0.1-1% improvement inabsolute accuracy. Since pre-training only needs to be executedonce per topology, the cost of pre-training can be amortizedacross chips that share a seed model. Hence, the pre-trainingapproach is advantageous to reduce training time so long asN0 ∗ M ≥ N0 + M ∗ N �, where M is the number of chips thatshare a seed model.

D. In-Situ Synaptic Canaries

Canary circuits are hardware mechanisms that are usedto detect failure when supply voltage is varied at runtime.Traditional canary circuits replicate critical circuits to detectimminent failure, but require added margin and are vulnerableto PVT-induced mismatch [27]. In contrast, the proposedin-situ canary circuits are bit-cells selected directly fromsynaptic weight SRAMs that facilitate SRAM supply-voltagecontrol (Figure 8). A number of bit-cells are selected formonitoring based on if they fail at the margin of the targetoperating voltage. We define “marginal” bit-cells as bits thatare non-failing at voltage Vtarget + Vstep and fail at voltageVtarget , where Vtarget is the operating voltage correspondingto a desired bit failure pattern, and Vstep is the discretevoltage step size.

At runtime, in-situ canary bits are polled by a controllerto determine whether supply voltage modifications shouldbe applied. To verify the state of a canary bit, the runtime


Algorithm 1 In-Situ Canary-Based Voltage Control

controller performs a read on the SRAM location containingthe canary bit, and checks whether the read result matches theexpected value. If any of the in-situ canary bits are failing theruntime controller boosts the SRAM supply voltage, otherwisethe controller reduces the supply voltage. This process ofsetting the SRAM supply voltage based on the failure states ofthe in-situ canaries maintains the desired bit-cell fault pattern,and in-turn maintains the level of classification accuracy. Thein-situ canary-based voltage control routine is summarized inAlgorithm 1. In contrast to a static voltage margin, the in-situcanary-based margin provides reliability tailored to the specificfailure patterns of the test chip. The in-situ canary techniquerelies on two key observations:

1. Since the most marginal, failure prone bit-cells are chosenas canaries, canaries fail before other accuracy-critical bit-cells and protect their storage states.

2. Neural networks are inherently robust to a limited numberof uncompensated errors [28]. As a result, network accu-racy is not dependent on the failure states of canary bit-cells, and the actual operating voltage can be broughtdirectly to the Vmin,read boundary of the canaries.

The runtime controller is implemented with an integratedmicrocontroller in the test chip described in Section V,however faster or more efficient circuits can be used ifrequired. To be minimally invasive to the operation of theaccelerator, the runtime controller performs the voltagecontrol routine in-between invocations of the accelerator. Forcanary selection and voltage adjustment, we conservativelyselect eight marginal canary bit-cells from each weight-storage SRAM. Vmin,read data for each bit-cell in the designis reused from the memory failure maps obtained duringmemory profiling. We note that a small voltage marginmay be retained to compensate for aging effects, which canresult in Vmin drift of a few percent over the lifetime of anSRAM [29]. Aging effects could also be compensated withmore frequent SRAM profiling, where the memory failuremaps are updated periodically. We leave an in-depth analysisof aging compensation for future work.

IV. DNN ACCELERATOR ARCHITECTURE

To demonstrate the effectiveness of MATIC on real hard-ware, we implement SNNAC (Systolic Neural Network AsiC)

Fig. 9. (a) Microphoto of a fabricated SNNAC test chip, and (b) summary oftest chip characteristics. The baseline voltage, power, frequency, and energyefficiency are reported.

Fig. 10. Architecture of the SNNAC DNN accelerator.

in 65 nm CMOS technology (Figure 9). The SNNAC architec-ture (Figure 10) is based on the open-source systolic dataflowdesign from SNNAP [30], optimized for integration with alight-weight SoC.

The SNNAC core consists of a fully-programmableNeural Processing Unit (NPU) that contains eight multiply-accumulate (MAC)-based Processing Elements (PEs). ThePEs are arranged in a 1D systolic ring that maintains highcompute utilization during inner-product operations. Energy-efficient arithmetic in the PEs is achieved with 8-22 bitfixed-point operands, and each PE includes a dedicatedvoltage-scalable SRAM bank to enable on-chip storage of allsynaptic weights. The systolic ring is attached to an activationfunction unit (AFU), which minimizes energy and areafootprint with piecewise-linear approximation of activationfunctions (e.g., sigmoid or ReLU).

The operation of the PEs is coordinated by a lightweightcontrol core that executes statically compiled microcode.To achieve programmability and support for a wide range oflayer configurations, the computation of wide DNN layers istime-multiplexed onto the PEs in the systolic ring. When thelayer width exceeds the number of physical PEs, PE resultsare buffered to an accumulator that computes the sum of allatomic MAC operations in the layer. SNNAC also includes asleep-enabled OpenMSP430-based microcontroller (μC) [31]to handle runtime control, debugging functions, and off-chipcommunication with a UART serial interface. To minimizedata movement, NPU input and output data buffers arememory-mapped directly to the μC data-memory addressspace.


Fig. 11. (a) Measured SRAM read-failure rate at 25◦C. The first SRAMbit failures are observed at 0.53 V. (b) Number of DNN weight parametersversus test error during topology search. DNN topologies are selected in orderto minimize redundancy and over-parameterization. Networks that are over-parameterized could unfairly bias the application error analysis.

V. EXPERIMENTAL SETUP AND HARDWARE RESULTS

At 25◦C and 0.9 V, a nominal SNNAC implementationoperates at 250 MHz and dissipates 16.8 mW, achieving a90.6% classification rate on MNIST handwritten characterrecognition [20]. MNIST images are resized to 10×10 inorder to fit the network into on-chip SRAM; while this has anegative impact on classifier performance, the topologies aresufficient to demonstrate MATIC on real hardware. In additionto MNIST, we evaluate face detection on the MIT CBCL facedatabase [32], and two approximate computing benchmarksfrom [33]. For all of the benchmark tasks, we divide thedatasets into training and test subsets with either a 7-to-1 or10-to-1 train/test split.

To avoid unfairly biasing the application error analysis,all benchmarks use compact DNN topologies that minimizeintrinsic over-parameterization. DNN topologies are evaluatedwith a parameter search, where the number of hidden layersand neurons per layer are swept over the ranges [1,3] (inincrements of 1) and [1,512] (in multiples of 2), respectively.Topologies are empirically selected at the knee of the errorcurve for each benchmark, as shown in Figure 11b.

The off-the-shelf SRAM macros (rated at 0.9 V) for thetest chip exhibit bit errors starting from 0.52 V at roomtemperature with all reads failing at ∼0.4 V (Figure 11a).Since the point of first failure is dictated by the tails ofthe Vmin,read statistical distribution, we expect Vmin,read toincrease in more advanced process nodes (which have greatervariation), and with larger memories (where there is increasedprobability of sampling from tails of the Vmin,read distrib-ution). For instance, the SRAM variability study from [18]exhibits Vmin,read failures starting at ∼0.66 V with a 45 nm,8 KB array. In practice, this extends the range of voltageswhere memory-adaptive training is useful, and increases therelative voltage scalability compared to a system that does notapply overscaling.

A. Application Error

Figure 12 and Table I show how MATIC recovers appli-cation error after the point of first failure. Table I lists thebenchmarks along with model descriptions, DNN topologies,and application error for the baseline (naive) and memory-adaptive evaluations at select voltages. Table I also liststhe nominal application error (no SRAM failures) for eachbenchmark, which applies to voltages [0.9 V, 0.53 V) at roomtemperature. The baseline and memory-adaptive models use

Fig. 12. Error performance of SNNAC, with and without MATIC deployed,as a function of weight SRAM supply voltage. Subplots correspond tobenchmarks for digit recognition, face detection, inverse kinematics, andBlack-Scholes option pricing (from top to bottom, left to right).

the same DNN model topologies (e.g., layer depth and widthconfigurations), but memory-adaptive training modificationsare disabled for the naive case. Compared to a voltage-scalednaive system (the SNNAC accelerator operating with base-line DNN models), MATIC demonstrates significantly lowerapplication error. We define the average error increase (AEI)as the error increase across voltage compared to the nominalapplication error. Then between 0.46 V and 0.53 V, the use ofMATIC results in 6.7× to 28.4× application error reductionversus naive hardware. This voltage range for evaluating theAEI reflects the range that a designer could leverage withtolerable accuracy degradation; i.e., there is a steep decreasein accuracy below 0.46 V that a designer would be unlikelyto consider as useful. When averaged across both voltageand all benchmarks, the AEI is reduced by 18.6×. Hence,MATIC enables accurate operation at low voltages, where theaccuracy using naive hardware quickly degrades after the onsetof SRAM bit errors.

B. Energy-Efficiency

For energy-efficiency we consider the operation of SNNACin three feasible operating scenarios, HighPerf (high perfor-mance, maximum frequency), EnOpt_split (energy optimal,disjoint logic and SRAM voltages), and EnOpt_joint (energyoptimal, unified voltage domains). Figure 13 shows the energy-per-cycle measurements on SNNAC for both logic and weightSRAMs, derived from test chip leakage and dynamic currentmeasurements. In HighPerf, operating frequency determinesvoltage settings, while frequency settings for EnOpt_splitand EnOpt_joint are determined by the minimum-energypoint (MEP) subject to the voltage domain configuration.The baselines for each operating scenario use the same clockfrequencies and logic voltages as the optimized cases, but withSRAM operating at the nominal voltage.

In HighPerf, we observe that timing limitations preventvoltage scaling the logic and memory in both the baselineand optimized configurations. However, while the baselineis unable to scale SRAM voltage due to stability margins,the optimized case (with MATIC) is able to scale SRAM


TABLE I

DNN BENCHMARKS AND APPLICATION ERROR MEASUREMENTS

Fig. 13. Energy-per-cycle measurements for (left) logic and (right)weight SRAM in SNNAC, obtained from test chip current measurements.The total energy-per-cycle shows a characteristic “bowl” shape, caused byincreasing/decreasing leakage/dynamic current as a function of voltage.

down to 0.65 V, resulting in 1.4× energy savings; timingrequirements in the SRAM periphery prevent further scaling.

In EnOpt_split, where SRAM and logic voltage domainsare separated, the baseline is able to scale logic to the MEPbut SRAM remains at the nominal voltage. While the baselineis unable to voltage-scale its weight memories, with MATIC,we are able to scale both logic and SRAM to the MEP,leading to 2.5× energy savings. Furthermore, SRAM energyis minimized at 0.5 V with a 28% SRAM bit-cell failurerate, which corresponds to an 87% classification accuracy onMNIST (versus 29.3% accuracy with naive hardware).

Finally, in EnOpt_joint, where voltage domains are unifiedto match a system with stringent power grid requirements,the baseline is unable to scale both SRAM and logic voltages.While logic voltage in the HighPerf scenario was limiteddue to timing requirements, logic in the baseline case forEnOpt_joint is limited by SRAM Vmin,read since the voltagedomains are shared; in this case, SRAM PVT and readstability margins prevent system-wide voltage scaling. TheMATIC-SNNAC combination, on the other hand, is able toscale both logic and SRAM voltages to the unified energy-optimal voltage, 0.55 V, which results in 3.3× energy savings.The baseline design in EnOpt_split is more efficient thanthe baseline design in EnOpt_joint. Although the relativesavings-versus-baseline shows better results for EnOpt_joint,the EnOpt_split configuration provides the highest energy effi-ciency. The energy-per-cycle measurements for the scenariosdescribed above are summarized in Table II.

In general, the breakdown of voltage savings betweenMAT and the in-situ canary system depends on the level ofprocess variation. We can write the target operating voltageas Vtarget = Vnom − Vmargin − Voverscale, where Vnom is thenominal supply voltage, and Vmargin and Voverscale are thevoltage savings from applying the canary technique and MAT,respectively. Then for chips where variation is more severe,we expect Vmargin to decrease relative to Voverscale, since thepoints of first failure will occur at higher supply voltages.For our particular test setup, variation is relatively low and

TABLE II

ENERGY-EFFICIENCY WITH MATIC-ENABLED SCALING

Fig. 14. Runtime closed-loop SRAM voltage control enabled by the in-situcanary system, in response to ambient temperature variation. The actualdistance between time steps varies from 2-10 minutes due to the variableheating/cooling rate of the test chamber.

the point of first failure is 0.52 V; as a result the majorityof voltage savings in this case occur from the use of thein-situ canaries. For example, for a target voltage of 0.5 V,Vmargin = 350 mV and Voverscale = 20 mV. As described inSection V, this breakdown would likely shift in more advancedprocess nodes, and with larger memories.

C. Temperature Variation

To demonstrate system stability over temperature,we execute the application benchmarks in a chamberwith ambient temperature control, and sweep temperaturefrom −15◦C to 90◦C for a given nominal voltage. Afterinitialization at the nominal voltage and temperature,we sweep the temperature down to −15◦C, and then up from−15◦C to 90◦C in steps of 15◦C, letting the chamber stabilizeat each temperature point. Figure 14 shows the SRAM voltagesettings dictated by the in-situ canary system for an initialsetting at 0.5 V on inversek2j. The results illustrate howthe in-situ canary technique adjusts SRAM voltage to tracktemperature variation, while conventional systems wouldrequire static voltage margins. We note that the operatingvoltages for the temperature chamber experiments are belowthe temperature inversion point for the 65 nm process; thisis illustrated by the inverse relationship between temperatureand SRAM voltage.

D. Performance Comparison

A comparison with recent DNN accelerator designs is listedin Table III. The performance comparison shows that theMATIC-SNNAC combination is comparable to state-of-the-art accelerators despite modest nominal performance, and


TABLE III

COMPARISON WITH STATE-OF-THE-ART DNN ACCELERATORS

enables a comparatively wider operating voltage range. Whilethe algorithmic characteristics (and hardware requirements) ofnetworks containing convolutional layers are vastly differentfrom FC-oriented DNNs [35], we include two recent Convaccelerators to show that SNNAC is competitive despite thelack of convolution-oriented optimization techniques.

This section concludes the hardware results for this paper.We have shown that MATIC, the first hardware/algorithmco-design methodology that addresses the energy-efficiencybottleneck imposed by synaptic weight SRAMs, enables 1.4×to 3.3× total energy reduction depending on the performanceand voltage constraints on the system. We also demonstratethat while naive hardware is unable to handle SRAM bit fail-ures, the use of MATIC enables between 6.7-28.4× reductionin application error. Finally, we show that the in-situ canarytechnique compensates for temperature variation, eliminatingthe need for large static voltage margins. The rest of thispaper is dedicated to the description and analysis of theextension of MAT to Conv-DNNs, and improvements to theMAT algorithm.

VI. MAT WITH DEEP CONVOLUTIONAL NETWORKS

As described in Section I, recent Conv-DNN acceleratorshave leveraged a variety of techniques to increase the compu-tational efficiency of convolution operations. However, manypopular Conv-DNN models contain dense FC layers nearthe network output for classification [7], [16], [17]. Hence,once the energy efficiency of computation for Conv-DNNs isoptimized, the relative dissipation due to memory operationsbecomes significant. For instance, Moons and Verhelst designa Conv-DNN accelerator that achieves a 2.6× reduction inpower (from 142 mW to 55 mW) by using a combina-tion of voltage scaling on hardware MACs and operationguarding [36]. Notably, the accelerator’s on-chip SRAMs arefixed to the nominal operating voltage in a separate voltagedomain. As a result, in its most optimized configuration,approximately 40% of the accelerator power is consumed bythe fixed voltage domain that contains on-chip weight anddata SRAM buffers. Hence, although energy reduction withMAT is maximized for FC-DNNs, significant gains in energyefficiency can also be realized for compute-optimized Conv-DNN accelerators.

The following sections examine the extension of MATto Conv-DNNs, which are state-of-the-art for many imageprocessing and recognition tasks. In particular, we describe

Fig. 15. Conv-DNN weight and training characteristics. (a) Weight distribu-tion of the seed model (0% failed bits). (b)-(c) Weight distribution with 50%failed bits after 500 epochs without and with gradient clipping, respectively.(d) Test accuracy over time without and with gradient clipping, respectively.

solutions to training convergence problems that arise dueto exploding gradients [37], [38], study the impact of biterrors as a function of layer type, and characterize theperformance of MAT under different assumptions on biterror statistics. We perform all evaluations with a modifiedCaffe framework, using LeNet-5 [16] with 16-bit quantizationin the range [−1.0, 1.0] for all weights. The pre-trainingmethodology described in Section III-C is used throughout,along with the reference LeNet model from the Caffe modelzoo at 5000 epochs as the reference model. Unless explicitlystated, all training hyperparameters are unchanged from thereference model.

A. Training Convergence

Compared to smaller FC-DNNs we observe that Conv-DNNs are more sensitive to bit-level perturbations, and can besusceptible to divergence during training. This divergence canbe attributed to the introduction of large errors at the onset oftraining, which leads to correspondingly large error gradientsand weight updates. This is akin to the exploding gradientsproblem [37], [38], where steep curvatures in the surface ofthe objecive function lead to weight updates that overshootminima. Since weights are quantized in MAT, the largeweight updates caused by exploding gradients result in weightsthat “snap” to the quantization boundaries (Figure 15b).To solve this problem we employ the gradient clipping methodproposed by Pascanu et al. [38], where the error-gradienttensor G = ∂ J/∂θ is scaled by k = T/�G�2 if theL2-norm exceeds the threshold T . In practice, we foundthat training converged for T < 200. However, making Ttoo small (<10) yielded marginal improvements in final testaccuracy compared to models trained with larger values of T ,and converged to the final test accuracy at a slower rate.Figure 15 illustrates the impact of bit failures on Conv-DNNtraining, with and without the application of gradient clippingdescribed above.


Fig. 16. Conv-DNN accuracy versus failure proportion, with failures isolatedto specific layer types. Error bars show the min, numerical mean, and max of5 trials. (a) MAT with the proportion of failed bits normalized to the numberof bits for the layer type. (b)-(c) MAT with the proportion of failed bitsnormalized to the number of bits for the entire network. (d) Models trainedwithout MAT, with same normalization as (a).

B. Convolutional v.s. Fully-Connected Layers

Since Conv-DNN accelerators contain weights for bothConv and FC layers, accelerator designers may considerwhether to voltage scale memories on a per-layer or layer-type basis. However, since on-chip memory for Conv-DNNaccelerators is typically dominated by weights from FC layers,the accuracy-energy tradeoff obtained by voltage scaling Convweight memories may not be favorable. To quantify the impactof voltage scaling on Conv and FC layers, we perform anexperiment where bit errors are isolated to all layers for thespecified layer types, followed by training with MAT. We referto the proportion of failures that are isolated and normalizedto the number of bits in a layer type as the layer-type failureproportion, and the same proportion normalized to the numberof bits in the entire network as the total failure proportion.

Figure 16a shows Conv-DNN accuracy when failures areisolated to all layers of either Conv, FC, or both Conv andFC layer types. The results from Figure 16 that are describedbelow are also summarized in Table IV. We note that errorbars in Figures 16 and 17 represent min/max, while the σin Table IV represents standard deviation. While it appearsthat Conv layers are more robust to failure compared toFC layers (Figure 16a), recall that Conv weights generallyaccount for a small fraction of the total weights in the network.Figure 16b shows that when normalized to the total numberof bits in the network (as opposed to total bits for the layertype), failures in Conv layers result in significant accuracydegradation compared to failures in FC layers. For instance,a 100% failure proportion in Conv layers corresponds to only6% failure proportion considering all weights in the network(total failure proportion), and 96.1% accuracy. On the otherhand, a 20% failure proportion in FC layers corresponds to19% failure in the entire network, and 96.8% accuracy. Thisdifference in accuracy is apparent in the absolute accuracy

TABLE IV

AVERAGE ERROR WITH ISOLATED BIT FAILURES

degradation per failed bit (Table IV), which is 5-17× lowerwhen failures are isolated to FC layers (versus isolatedto Conv layers).

Notably, when failures are isolated to either Conv or FClayers, the accuracy stays above 96% even when the layer-type failure proportion is 100% (Figure 16a-b). In this case,all of the weights for the failing layer type are fixed, indicatingsignificant adaptation in the non-failing weights to preserveclassifier behavior.

Figure 16c shows that bit errors that are distributed acrossboth FC and Conv layers result in greater accuracy degradationcompared to isolated bit failures, even when the proportionof failed bits is comparable. However, accuracy obtained withMAT is an order-of-magnitude better compared to the baseline,where regular training is used (Figure 16d). The sensitivity ofnetwork accuracy to failures in Conv layers strongly motivatesthe use of a mechanism that enables independent voltagescaling of FC and Conv weights. For instance, a simpleimplementation strategy would be to store FC and Convweights in separate voltage-scalable SRAM banks.

C. Bias in Bit-Failure Polarity

Specific circuit implementations and process technologiesmay result in bit failures that are biased. For instance,in SRAM, the failure of sense amplifiers or other peripherycircuitry can result in systematic failures that are unipolar,resulting in a bit error distribution that is strongly biased.Another example is with DRAM, where the bit-cell iscomposed of a storage capacitor and single-ended accesstransistor; in this case, failures from voltage scaling or refreshthrottling may exhibit strong bias towards the state that isphysically stored as a 0. To examine the impact of bit failurebias on overall accuracy, we perform a limit analysis wherefailures are unbiased, 100% biased towards 0, or 100% biasedtowards 1 (Figure 17). We see that the accuracy performanceof the three cases can be ranked from best to worst in the order


Fig. 17. Conv-DNN accuracy after MAT, with biased bit failures isolated toeither (a) Conv or (b) FC layers. Bit-failure proportion is normalized to thenumber of bits for the given layer type. Error bars show the min, numericalmean, and max of 5 trials. Note that the unbiased case is identical to thatin Figure 16a-b.

of (1) unbiased, (2) 100% bias towards 0, and (3) 100% biastowards 1. The reason for this discrepancy is determined byprofiling the layer activations; in the case of bias towards 1,we observe that activation sparsity is lost, i.e., there are alarge number of false activations in the network. On the otherhand, activations are suppressed in the case where failures arebiased towards 0, but largely unchanged in the unbiased case.Hence, the application of MAT is most favorable for memorytechnologies that have relatively unbiased failures, or failuremodes that preserve activation sparsity.

To summarize the section above, we showed that MATcan be extended to Conv-DNNs, and that gradient clippingis an effective countermeasure against divergence duringtraining. Furthermore, we characterized the impact of biterrors on layer types, and showed that efficiency gains fromvoltage scaling are maximized if Conv and FC weights canbe voltage scaled independently. Finally, we showed thatMAT performs best when bit errors are unbiased, motivatingthe use of design techniques and circuit topologies thatminimize biased failures. The following section describesadditional enhancements to the MAT algorithm that improveboth accuracy and training time.

VII. NEAREST WEIGHT REFINEMENT

During a weight update, bit-level failures in aweight introduce mismatch between the ideal updatedweight wt = w + �w, and erroneous updated weightw� = Band Bor (w + �w), where Band and Bor are the maskscorresponding to bit errors. For instance, given w = 00044and wt = 00084, if bit 3 (zero-indexed) fails towardszero, the weight update will result in w� = 0. Notably,training with the baseline MAT algorithm converges withreasonable accuracy in spite of the potential mismatchbetween w� and wt , but an update policy that is cognizant ofsuch mismatch should result in better performance. As such,we develop a nearest weight refinement (NWR) scheme tominimize the euclidean distance between w� and wt , andimprove the overall accuracy of networks trained with MAT.Furthermore, we show that a simple NWR implementationthat runs in constant time enables up to 1.5× reduction inclassification error, or 5-10× reduction in training time.

A. Algorithm Description

The intuition behind the algorithm is illustratedin Figure 18 - for a given target wt , the nearest refined

Fig. 18. NWR minimizes the distance between an ideal target weight and itsactual, error-prone value during training. To find the nearest refined weightw∗, NWR solves for the greatest lower bound or least upper bound of the errorregion (colored in red) that intersects wt , using only constant-time bit-wiselogical operations.

Algorithm 2 Nearest Weight Refinement

weight w∗ must be at one of the boundaries of the errorregion intersecting with wt . However, a bit-cell whose valueis erroneously fixed creates observable error only if its fixedpolarity does not match that of the bit in wt . We distinguishbetween bit error types by defining biased bit-cells thatare fixed but do not mismatch as latent bit errors, andbiased bit-cells that do mismatch as active bit errors. Thus,we define an error region more precisely as a set of numberscorresponding to a contiguous segment of active or latent biterrors on the 2’s complement number line. For instance, givenwt = 010002, Band = 011112, and Bor = 000112 (whichimplies w� = 010112), the error regions corresponding to anynumbers with 0 at bit-indices 0 or 1 are active, while theregions corresponding to negative numbers (due to Band ) arelatent. Note that a single error region can be composed ofboth active and latent sub-segments.

Clearly, an error region that intersects with wt is problematicif it is active, and in such a case we can minimize the distancefrom wt by approaching w∗ from either side of the intersectingerror region. We refer to the endpoints of the intersecting errorregion as the greatest lower bound (GLB) and least upper


Fig. 19. Examples illustrating NWR. (a) An instance of undershoot (w� < wt ,pc = 1). In this case, the GLB is nearest to wt with a distance of 3. (b) Aninstance of overshoot (w� > wt , pc = 0). The zero-crossing check finds thetrue GLB across the origin with a distance of 9.

bound (LUB). To find the GLB and LUB, NWR performsbit-wise logical operations on w� based on a single bit index.Specifically, NWR takes the index of the greatest failing bit inthe active error region, ie,max = log2

(w� ⊕ wt

), where ⊕ is

the bit-wise XOR operator, and selects the best refined weightproduced by the following routines, where pc is a polarity forcorrective bit flips:

1) Set bits at valid indices less than ie,max to polarity pc

2) Set the bit at ie,max +1 to pc, and all bits at valid indicesless than ie,max to pc

3) Flip the MSB to pmsb, and set all bits at valid indicesless than ie,max to pmsb

This routine (Algorithm 2) works irrespective of the signs ofw� and wt (only dependent on the comparison to wt ), and runsin constant time. The third case, which we denote the zero-crossing (ZC) check, accounts for when the GLB or LUB liesacross the origin from wt . Figure 19 illustrates the operationof NWR and the three cases above. Note that w∗ may lie onthe boundary of a latent error region, i.e., the GLB and LUBdo not necessarily correspond to the endpoints of the activeerror region used to find ie,max .

In practice, the procedures Set Less and Set do not need tocheck the validity of an individual bit before setting its value.That is, Set Less and Set can be implemented efficiently byapplying shifted masks of all 1’s or 0’s, followed by maskscorresponding to Bor and Band . Hence, the runtime of NWRper weight is O(1) regardless of weight bit-widths, and there

Fig. 20. Conv-DNN (a) accuracy after 5000 epochs, and (b) accuracy versustraining time. Error bars show the min, numerical mean, and max of 5 trials.Bit-errors are segregated to FC layers as described in Section VI-B.

TABLE V

AVERAGE APPLICATION ERROR WITH NWR

is no additional memory overhead compared to the baselineMAT algorithm.

B. Impact of NWR on Training Time and Accuracy

We evaluate the MAT-NWR combination developed aboveagainst the baseline MAT algorithm from Section III-B.Since we saw in Section VI-B that bit failures in Convlayers result in marginal energy benefits relative to theaccuracy loss, we consider the case where unbiased fail-ures occur in FC layers. Figure 20a shows the accuracyof MAT-NWR and the baseline MAT as a function offailure proportion. For both MAT-NWR and the baseline,we train and test 5 instances of the reference LeNet modelat each failure proportion. Figure 20a and Table V showthat the improvement in absolute accuracy with MAT-NWRis between 0.8-1.3%, with an average of 1.1% across allfailure proportions. This corresponds to a relative improve-ment in error of 1.5× on average compared to the baselineMAT algorithm.

In addition to final classification accuracy, we record theaccuracy and loss on a separate validation dataset every100 training epochs. Figure 20b shows the accuracy onthe validation dataset throughout training. Although theMAT-NWR combination achieves a lower error floor, for thesame accuracy as the baseline MAT, the enhanced algorithmoffers an order-of-magnitude speedup in training time for thesame accuracy. Across failure proportions, we observe trainingspeedups between 5-10×.

VIII. RELATED WORK

There has been a vast body of recent work on hardware formachine learning. This section attempts to summarize recentefforts related to accelerating neural networks in hardware,defect tolerance, as well as recent works that perform compu-tation using analog and mixed-signal circuits.


A. Digital Accelerators for Neural Networks

There has been a vast body of work on DNN hardwareaccelerators [1]–[4], [8], [30], [34], [39], [40] (see Section I).Accelerators such as those developed in Du et al. [2] andBang et al. [34] perform significant architectural redesign,integrating large amounts of on-chip SRAM to minimize off-chip memory bandwidth. This is a design pattern used bymany subsequent designs, and while an effective approach,results in significant power dissipation from on-chip SRAM.Moons et al. [41] use various circuit techniques, dynamicvoltage, frequency, and precision scaling, and data reusetechniques to enable up to 10 TOPS/W on sparse Conv-DNNswith 4-bit weight precision. These works show promisingapproaches to improve the accuracy, performance, and energyefficiency of DNN hardware, but are largely orthogonal effortssince they have different design goals (e.g., better classifica-tion accuracy), or leverage architectural techniques, networksparsity, and data reuse techniques that are only amenable toConv-DNNs. In contrast, our work is applicable to both FC andConv-DNNs, focuses on increasing the energy-efficiency ofDNN accelerators by enabling aggressive system-wide voltagescaling, and leverages the characteristics of SRAM circuitstructures to develop a hardware-algorithm co-design solution.

B. Resilient Hardware and Defect Tolerance

Temam [28] explores the impact of defects in logic gates,and is among the first to develop fault-tolerant hardware forneural networks. Srinivasan et al. [42] exploit DNN resiliencewith a mixed 8T-6T SRAM, where weight MSBs that havea greater impact on overall classification accuracy are storedin 8T cells. Xin et al. [43] and Liu et al. [44] designvariability-tolerant training schemes, and show that adaptationimproves the classification accuracy of DNNs implementedwith resistive-RAM-based crossbars. Yang and Murmann [13]perform experiments using fabricated SRAMs to demon-strate that Conv-DNNs become more robust to error whentrained on input images that have been corrupted by voltagescaling. Interestingly, Yang and Murmann find that classifi-cation performance is improved compared to their baseline,even when errors in images used for training and testing aredifferent (i.e., errors need not be static). Reagen et al. [10]present a framework to develop and optimize highly-accurateDNN hardware accelerators. Microarchitectural techniquesfrom this work include Razor error detection [11], [12],and a technique to mask individual faulty bits at runtimethat provides better error tolerance compared to full wordmasking. Whatmough et al. [45] develop a fabricated proto-type with support for highly-accurate sparse FC-DNNs, andleverage Razor error detection to allow a controlled number oftiming violations to occur without compensation or significantloss in accuracy. Our work leverages algorithm/hardware co-design with training techniques to increase the efficiency ofDNN weight storage, incorporates the DNN accelerator andSRAM weight memories in the training loop for greater error-tolerance, and presents results from a fabricated acceleratorprototype. In addition, our work can be applied to a broadclass of DNN accelerators without significant architectural

modification, and is amenable to standard CMOS technologiesand IP libraries.

C. Analog, Mixed-Signal, and In-Memory Techniques

There has also been recent work on in-memory andanalog/mixed-signal computing techniques [46]–[48],which have demonstrated energy efficiency of multipleTOPS/W or hundreds of pJ per classification. These effortshave shown promising energy efficiency, although they mayrequire some re-design in terms of architecture, circuits, andCAD tools. Our work aims to leverage mature digital CMOScircuits and technologies, for the purpose of being applicableto a variety of existing hardware accelerators.

IX. CONCLUSION

This paper presents a methodology and algorithms thatenable DNN accelerators to gracefully tolerate bit errorsresulting from memory supply-voltage overscaling. Thecornerstones of our approach are (1) Memory-AdaptiveTraining - a technique that leverages the adaptability ofneural networks to train around errors resulting from SRAMvoltage scaling, and (2) in-situ synaptic canaries - the useof bit-cells directly from weight SRAMs for voltage controland variation-tolerance.

To validate the effectiveness of MATIC, we designed andimplemented SNNAC, a low-power DNN accelerator fabri-cated in 65 nm CMOS. As demonstrated on SNNAC, the appli-cation of MATIC results in either 3.3× total energy reductionand 5.1× energy reduction in SRAM, or 18.6× reductionin application error. In addition to hardware results withFC-DNNs, we extend MAT to Conv-DNNs and show howtraining divergence due to weight snapping can be prevented.We also characterize the impact of varying bit error statisticson Conv-DNN layer types, to show that the best energy-error tradeoff is achieved when bit errors are unbiased andisolated to FC layers. Finally, we develop Nearest WeightRefinement, which results in a 1.5× reduction in error relativeto the baseline algorithm, or training-time improvements of5−10×. Thus, MATIC and its variants enable accurate infer-ence on aggressively voltage-scaled DNN accelerators, therebyenabling robust and efficient operation for a broad class ofDNN accelerators.

ACKNOWLEDGMENT

The authors would like to thank Fahim ur Rahman, VenkataRajesh Pamula, and John Uehlin for design support, and EricKarl for helpful discussions. We also thank the reviewers fortheir valuable input during the preparation of the manuscript.

REFERENCES

[1] T. Chen et al., “DianNao: A small-footprint high-throughput accel-erator for ubiquitous machine-learning,” in Proc. ASPLOS, 2014,pp. 269–284.

[2] Z. Du et al., “ShiDianNao: Shifting vision processing closer tothe sensor,” ACM SIGARCH Comput. Archit. News, vol. 43, no. 3,pp. 92–104, 2015.

[3] S. Han et al., “EIE: Efficient inference engine on compressed deep neuralnetwork,” in Proc. ISCA, 2016, pp. 243–254.


[4] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: Anenergy-efficient reconfigurable accelerator for deep convolutional neuralnetworks,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.Papers, Jan./Feb. 2016, pp. 262–263.

[5] Y.-J. Lin and T. S. Chang, “Data and hardware efficient design forconvolutional neural network,” IEEE Trans. Circuits Syst. I, Reg. Papers,vol. 65, no. 5, pp. 1642–1651, May 2018.

[6] A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An architectureto accelerate convolution in deep neural networks,” IEEE Trans. CircuitsSyst. I, Reg. Papers, vol. 65, no. 4, pp. 1349–1362, Apr. 2017.

[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-tion with deep convolutional neural networks,” in Proc. NIPS, 2012,pp. 1097–1105.

[8] S. Wang, D. Zhou, X. Han, and T. Yoshimura, “Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neuralnetworks,” in Proc. DATE, 2017, pp. 1032–1037.

[9] M. Qazi, M. E. Sinangil, and A. P. Chandrakasan, “Challenges anddirections for low-voltage SRAM,” IEEE Des. Test Comput., vol. 28,no. 1, pp. 32–43, Jan./Feb. 2011.

[10] B. Reagen et al., “Minerva: Enabling low-power, highly-accurate deepneural network accelerators,” ACM SIGARCH Comput. Archit. News,vol. 44, no. 3, pp. 267–278, 2016.

[11] S. Das et al., “A self-tuning DVS processor using delay-error detectionand correction,” IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 792–804,Apr. 2006.

[12] J. P. Kulkarni et al., “A 409 GOPS/W adaptive and resilient dominoregister file in 22 nm Tri-gate CMOS featuring in-situ timing marginand error detection for tolerance to within-die variation, voltage droop,temperature and aging,” in IEEE Int. Solid-State Circuits Conf. (ISSCC)Dig. Tech. Papers, Feb. 2015, pp. 1–3.

[13] L. Yang and B. Murmann, “SRAM voltage scaling for energy-efficientconvolutional neural networks,” in Proc. ISQED, 2017, pp. 7–12.

[14] S. Kim et al., “MATIC: Learning around errors for efficient low-voltageneural network accelerators,” in Proc. DATE, 2018, pp. 1–6.

[15] C. M. Bishop, Pattern Recognition and Machine Learning (InformationScience and Statistics). New York, NY, USA: Springer-Verlag, 2006.

[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-basedlearning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,pp. 2278–2324, Nov. 1998.

[17] K. Simonyan and A. Zisserman. (2014). “Very deep convolutionalnetworks for large-scale image recognition.” [Online]. Available:http://arxiv.org/abs/1409.1556

[18] Z. Guo, A. Carlson, L.-T. Pang, K. T. Duong, T.-J. K. Liu, andB. Nikolic, “Large-scale SRAM variability characterization in 45 nmCMOS,” IEEE J. Solid-State Circuits, vol. 44, no. 11, pp. 3174–3192,Nov. 2009.

[19] E. Grossar, M. Stucchi, K. Maex, and W. Dehaene, “Read stability andwrite-ability analysis of SRAM cells for nanometer technologies,” IEEEJ. Solid-State Circuits, vol. 41, no. 11, pp. 2577–2588, Nov. 2006.

[20] Y. LeCun and C. Cortes. (2010). MNIST Handwritten Digit Database.[Online]. Available: http://yann.lecun.com/exdb/mnist/

[21] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deeplearning with limited numerical precision,” in Proc. 32nd Int. Conf.Mach. Learn. (ICML), vol. 37. Lille, France, 2015, pp. 1737–1746.[Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045303

[22] S. Nissen, “Implementation of a fast artificial neural networklibrary (FANN),” Dept. Comput. Sci., Univ. Copenhagen, København,Denmark, Tech. Rep., 2003. [Online]. Available: http://fann.sf.net

[23] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-ding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 675–678.

[24] L. Torrey and J. Shavlik, “Transfer learning,” Handbook Of ResearchOn Machine Learning Applications and Trends: Algorithms, Methodsand Techniques, vol. 1. Hershey, PA, USA: IGI Global, 2009, p. 242.

[25] J. Donahue et al., “DeCAF: A deep convolutional activation feature forgeneric visual recognition,” in Proc. 31st IEEE Int. Conf. Mach. Learn.,Beijing, China, Jun. 2014, pp. 647–655.

[26] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson,“CNN features off-the-shelf: An Astounding baseline forrecognition,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. Workshops (CVPRW), 2014, pp. 512–519. [Online].Available: http://dx.doi.org/10.1109/CVPRW.2014.131, doi:10.1109/CVPRW.2014.131.

[27] J. Wang and B. H. Calhoun, “Canary replica feedback for near-DRVstandby V DD scaling in a 90 nm SRAM,” in Proc. CICC, 2007,pp. 29–32.

[28] O. Temam, “A defect-tolerant accelerator for emerging high-performanceapplications,” in Proc. ISCA, 2012, pp. 356–367.

[29] E. H. Cannon, A. KleinOsowski, R. Kanj, D. D. Reinhardt, andR. V. Joshi, “The impact of aging effects and manufacturing variationon SRAM soft-error rate,” IEEE Trans. Device Mater. Rel., vol. 8, no. 1,pp. 145–152, Mar. 2008.

[30] T. Moreau et al., “SNNAP: Approximate computing on programmableSoCs via neural acceleration,” in Proc. HPCA, 2015, pp. 603–614.

[31] O. Girard. OpenMSP430. Accessed: Feb. 2017. [Online]. Available:http://opencores.org/project,openmsp430

[32] M. Alvira and R. Rifkin, “An empirical comparison of SNoW andSVMs for face detection,” Dept. Elect. Eng. Comput. Sci., MassachusettsInst. Technol., Cambridge, MA, USA, Tech. Rep. AIM-2001-004,2001.

[33] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural accel-eration for general-purpose approximate programs,” in Proc. MICRO,2012, pp. 449–460.

[34] S. Bang et al., “A 288 μW programmable deep-learning processor with270 KB on-chip weight storage using non-uniform memory hierarchyfor mobile intelligence,” in IEEE Int. Solid-State Circuits Conf. (ISSCC)Dig. Tech. Papers, 2017, pp. 250–251.

[35] N. P. Jouppi et al., “In-datacenter performance analysis of a tensorprocessing unit,” in Proc. ISCA, 2017, pp. 1–12.

[36] B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision-scalableprocessor for real-time large-scale ConvNets,” in Proc. VLSIC, 2016,pp. 1–2.

[37] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencieswith gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5,no. 2, pp. 157–166, Mar. 1994.

[38] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficultyof training recurrent neural networks,” in Proc. ICML, 2013,pp. III-1310–III-1318.

[39] D. Liu et al., “PuDianNao: A polyvalent machine learning accelerator,”ACM SIGARCH Comput. Archit. News, vol. 43, no. 1, pp. 369–381,2015.

[40] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” inProc. MICRO, 2014, pp. 609–622.

[41] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envision:A 0.26-to-10 TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28 nmFDSOI,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.Papers, Feb. 2017, pp. 246–247.

[42] G. Srinivasan, P. Wijesinghe, S. S. Sarwar, A. Jaiswal, and K.Roy, “Significance driven hybrid 8T-6T SRAM for energy-efficientsynaptic storage in artificial neural networks,” in Proc. DATE, 2016,pp. 151–156.

[43] L. Xia, M. Liu, X. Ning, K. Chakrabarty, and Y. Wang, “Fault-toleranttraining with on-line fault detection for RRAM-based neural computingsystems,” in Proc. DAC, 2017, pp. 33-1–33-6.

[44] C. Liu, M. Hu, J. P. Strachan, and H. H. Li, “Rescuing memristor-based neuromorphic design with high defects,” in Proc. DAC, 2017,pp. 87-1–87-6.

[45] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks, andG.-Y. Wei, “A 28 nm SoC with a 1.2 GHz 568 nJ/prediction sparsedeep-neural-network engine with >0.1 timing error rate tolerance forIoT applications,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.Tech. Papers, Feb. 2017, pp. 242–243.

[46] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifierimplemented in a standard 6T SRAM array,” in Proc. VLSIC, 2016,pp. 1–2.

[47] F. N. Buhler, P. Brown, J. Li, T. Chen, Z. Zhang, and M. P. Flynn,“A 3.43 TOPS/W 48.9 pJ/pixel 50.1 nJ/classification 512 analog neuronsparse coding neural network with on-chip learning and classification in40 nm CMOS,” in Proc. VLSIC, 2017, pp. C30–C31.

[48] E. H. Lee and S. S. Wong, “A 2.5 GHz 7.7 TOPS/W switched-capacitormatrix multiplier with co-designed local memory in 40 nm,” in IEEEInt. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Jan./Feb. 2016,pp. 418–419.

http://dx.doi.org/10.1109/CVPRW.2014.131

http://dx.doi.org/10.1109/CVPRW.2014.131


Sung Kim received the B.S. degree in elec-trical engineering from the University of Wash-ington in 2015, where he is currently pursuingthe Ph.D. degree and he is a member of theProcessing Systems Lab. His research interestsinclude computer architecture, digital VLSI, andapplication-specific hardware design.

Patrick Howe received the B.S. degree in electricalengineering from the University of Washington,Seattle, WA, USA, in 2015. He is currently an FPGADesign Engineer with Spectranetix Inc. His researchcontributions focused on pre and post-silicon verifi-cation for digital ICs.

Thierry Moreau received the B.A.Sc. degreein computer engineering from the Universityof Toronto, Canada, in 2012, and the M.S.degree in computer science from the Univer-sity of Washington, USA, in 2015, where he iscurrently pursuing the Ph.D. degree in computerscience and engineering and he is a memberof the Sampa Computer Architecture Lab. Hisresearch interests include FPGA acceleration, deeplearning, compilers, approximate computing, andhigh-performance computing.

Armin Alaghi (S’06–M’15) received the B.Sc.degree in electrical engineering and the M.Sc. degreein computer architecture from the University ofTehran, Iran, in 2006 and 2009, respectively, andthe Ph.D. degree from the Electrical Engineeringand Computer Science Department, University ofMichigan, in 2015. From 2005 to 2009, he was aResearch Assistant in the Field-Programmable Gate-Array (FPGA) Laboratory and the Computer-AidedDesign Laboratory, University of Tehran, where hewas involved in FPGA Testing and Network-on-

Chip testing. From 2009 to 2015, he was with the Advanced ComputerArchitecture Laboratory, University of Michigan. He is currently an affiliateAssistant Professor with the University of Washington. His research interestsinclude digital system design, embedded systems, VLSI circuits, computerarchitecture, and electronic design automation.

Luis Ceze received the Ph.D. degree in computerscience from the University of Illinois at Urbana–Champaign. He is currently a Professor with thePaul G. Allen School of Computer Science andEngineering, University of Washington. His researchfocuses on the intersection between computer archi-tecture, programming languages, and biology.

Visvesh S. Sathe is currently an Assistant Professorwith the Department of Electrical Engineering,University of Washington. Untill 2013, he servedas a Member of Technical Staff with the Low-Power Advanced Development Group, AMD, wherehis research focused on digital, mixed-signal andpower technologies for next-generation microproces-sors. He led the research and development effort atAMD that resulted in the first-ever resonant clockedcommercial microprocessor. His research interestslie at the intersection of digital and mixed-signal

circuits and architectures for energy-efficient computing and bio-medicalelectronics.

Documents

Energy-Efficient Neural Network Acceleration in the ... · constitute greater than 90% of total weight parameters [7]. As a result, convolutional data-reuse techniques like those