ISSCC 2014 / SESSION 30 / TECHNOLOGIES FOR NEXT-GENERATION SYSTEMS …web.eecs.utk.edu/research/iss/papers/lu_adml_isscc2014.pdf · 2014-10-24 · ISSCC 2014 / SESSION 30 / TECHNOLOGIES

504 • 2014 IEEE International Solid-State Circuits Conference

ISSCC 2014 / SESSION 30 / TECHNOLOGIES FOR NEXT-GENERATION SYSTEMS / 30.10

30.10 A 1TOPS/W Analog Deep Machine-Learning Engine with Floating-Gate Storage in 0.13µm CMOS

Junjie Lu, Steven Young, Itamar Arel, Jeremy Holleman

University of Tennessee, Knoxville, TN

Direct processing of raw high-dimensional data such as images and video bymachine learning systems is impractical both due to prohibitive power consumption and the “curse of dimensionality,” which makes learning tasksexponentially more difficult as dimension increases. Deep machine learning(DML) mimics the hierarchical presentation of information in the human brain toachieve robust automated feature extraction, reducing the dimension of suchdata. However, the computational complexity of DML systems limits large-scaleimplementations in standard digital computers. Custom analog or mixed-modesignal processors have been reported to yield much higher energy efficiencythan DSP [1-4], presenting the means of overcoming these limitations. However,the use of volatile digital memory in [1-3] precludes their use in intermittently-powered devices, and the required interfacing and internal A/D/A conversionsadd power and area overhead. Nonvolatile storage is employed in [4], but thelack of learning capability requires task-specific programming before operation,and precludes online adaptation.

The feasibility of analog clustering, a key component of DML, has been demon-strated in [5]. In this paper, we present an analog DML engine (ADE) implementing DeSTIN [6], a state-of-art DML framework, and featuring onlineunsupervised trainability. Floating-gate nonvolatile memory facilitates operationwith intermittent harvested energy. An energy efficiency of 1TOPS/W is achievedthrough massive parallelism, deep weak-inversion biasing, current-mode analogarithmetic, distributed memory, and power gating applied to per-operation partitions. Additionally, algorithm-level feedback desensitizes the system toerrors such as offset and noise, allowing reduced device sizes and bias currents.

Figure 30.10.1 shows the architecture of the ADE, in which seven identical cortical circuits (nodes) form a 4-2-1 hierarchy. Each node captures regularitiesin its inputs through an unsupervised learning process. The lowest layer receivesraw data (e.g. the pixels of an image), and continuously constructs belief statesthat characterize the sequence observed. The inputs of nodes on 2nd and 3rd

layers are the belief states of nodes at their respective lower layers. The beliefsof the top layer are then used as rich features for a classifier.

The node (Fig. 30.10.2) incorporates an 8×4 array of reconfigurable analog computation cells (RAC), grouped into 4 centroids, each with 8-dimensionalinput. The centroids are characterized by their mean μ and variance σ 2 in eachdimension, stored in their respective floating gate memories (FGM). In a trainingcycle, the analog arithmetic elements (AAE) calculate a simplified Mahalanobisdistance (assuming a diagonal covariance matrix) DMAH between the input observation o and each centroid. The 8-D distances are built by joining the output currents. A distance processing unit (DPU) performs inverse-normalization (IN) operation to the 4 distances to construct the belief states,which are the likelihood that the input belongs to each centroid. Then the centroid parameters μ and σ 2 are adapted using the online clustering algorithm.The centroid with the smallest Euclidean distance DEUC to the input is selected(classification). The errors between the selected centroids and input are loadedto the training control (TC) and their memories are then updated proportionally.In recognition mode, only the belief states are constructed and the memories arenot adapted. Intra-cycle power gating is applied to reduce the power consumption by up to 37%.

Figure 30.10.3 shows the schematic of the RAC, which performs three differentoperations through reconfigurable current routing. Two embedded FGMs provide nonvolatile storage for centroid parameters. Capacitive feedback stabilizes the floating gate voltage (VFG) to yield pulse-width controlled update.Tunneling is enabled by lowering its supply to bring down the VFG, increasing thevoltage across the tunneling junction. Injection is enabled by raising the sourceof the injection transistor. This allows random-accessible bidirectional updateswithout the need for on-chip high-voltage switches or charge pump. A 2-T V-Iconverter then provides a current output and sigmoid update rule. The FGM consumes 0.5nA of bias current, and shows an 8b programming accuracy anda 46dB SNR at full scale. The absolute value circuit (ABS) in the AAE rectifies thebidirectional difference current between o and μ. Class-B operation and the virtual ground provided by amplifier A allow high-speed resolution of small

current differences. The rectified currents are then fed into a translinear X2/Y circuit, which simulations indicate operates with more than an order of magnitude higher energy efficiency than its digital equivalence.

In the belief construction phase, the DPU (Fig. 30.10.4) inverts the distance outputs from the 4 centroids to calculate similarities, and normalizes them toyield a valid probability distribution. The output belief states are sampled thenheld for the rest of the cycle to allow parallel operation of all layers. The samplingswitch reduces current sampling error due to charge injection: a diode-connected PMOS provides a reduced VGS to the switch NMOS to turn it on withminimal channel charge. The S/H achieved less than 0.7mV of charge injectionerror (2% current error), and less than 14μV of droop with parasitic capacitorsas holding capacitor. In classification phase, the IN circuits are reused togetherwith the winner-take-all network (WTA) to classify the observation to the nearestcentroid. A starvation trace (ST) circuit is implemented to address unfavorableinitial conditions wherein some centroids are starved of nearby inputs and neverupdated. The ST provides starved centroids with a small but increasing additional current to force their occasional selection and pull them into morepopulated areas of the input space. The lower right of Fig. 30.10.4 shows the TCcircuit, which performs current-to-pulse-width conversion using a VDD-referenced comparison. Proportional updates cause the mean and variancememories to converge to the sample statistics, respectively.

The ADE is evaluated on a custom test board with data acquisition hardware connecting to a host PC. The waveforms in Fig. 30.10.5 show the measuredbeliefs, one from each layer. The sampling of beliefs proceeds from the top layerto the bottom to avoid delays due to output settling. The performance of thenode is demonstrated by solving a clustering problem. The input data consistsof 4 underlying clusters, each drawn from a Gaussian distribution with differentmean and variance. The node achieves accurate extraction of the cluster parameters (μ and σ 2), and the ST ensures a robust operation against unfavorable initial conditions.

We demonstrate feature extraction for pattern recognition with the setup shownin Fig. 30.10.6. The input patterns are 16×16 symbol bitmaps corrupted by random pixel errors. An 8×4 moving window defines the pixels applied to theADE’s 32-D input. First the ADE is trained unsupervised with examples of patterns. After the training converges, the 4 belief states from the top layer areused as rich features and classified with a neural network implemented in software, achieving a dimension reduction from 32 to 4. Recognition accuraciesof 100% with corruption lower than 10%, and 95.4% with 20% corruption areobtained, comparable to a software baseline, demonstrating robustness to thenon-idealities of analog computation.

The ADE was fabricated in a 0.13μm CMOS process with thick-oxide IO FETs.The die micrograph is shown in Fig. 30.10.7, together with a performance summary and a comparison with state-of-art bio-inspired parallel processorsutilizing analog computation. We achieve 1TOPS/W peak energy efficiency inrecognition mode. Compared to state-of-art, this work achieves very high energyefficiency in both modes. This combined with the advantages of nonvolatilememory and unsupervised online trainability makes it a general-purpose featureextraction engine ideal for autonomous sensory applications or as a buildingblock for large-scale learning systems.

References:[1] J. Park, et al., “A 646GOPS/W Multi-Classifier Many-Core Processor withCortex-Like Architecture for Super-Resolution Recognition,” ISSCC Dig. Tech.Papers, pp. 168-169, Feb. 2013.[2] J. Oh, et al., “A 57mW Embedded Mixed-Mode Neuro-Fuzzy Accelerator forIntelligent Multi-Core Processor,” ISSCC Dig. Tech. Papers, pp. 130-132, Feb.2011.[3] J.-Y. Kim, et al., “A 201.4GOPS 496mW Real-Time Multi-Object RecognitionProcessor With Bio-Inspired Neural Perception Engine,” ISSCC Dig. Tech.Papers, pp. 150-151, Feb. 2009.[4] S. Chakrabartty and G. Cauwenberghs, “Sub-Microwatt Analog VLSITrainable Pattern Classifier,” IEEE J. Solid-State Circuits, vol. 42, no. 5, pp.1169-1179, May 2007.[5] J. Lu, et al., “An Analog Online Clustering Circuit in 130nm CMOS,” IEEEAsian Solid-State Circuits Conference, Nov. 2013.[6] S. Young, et al., “Hierarchical Spatiotemporal Feature Extraction usingRecurrent Online Clustering,” Pattern Recognition Letters, Oct. 2013.

978-1-4799-0920-9/14/$31.00 ©2014 IEEE

505DIGEST OF TECHNICAL PAPERS •

ISSCC 2014 / February 12, 2014 / 4:45 PM

Figure 30.10.1: The analog deep learning engine (ADE) architecture.Figure 30.10.2: The node architecture and its timing diagram showing powergating.

Figure 30.10.3: The reconfigurable current routing of the RAC, the schematicsof the FGM and AAE and measurement results.

Figure 30.10.5: Measured output waveforms, clustering and ST test results.For clustering, 2-D results are shown for better visualization.

Figure 30.10.6: Pattern recognition test setup and results, demonstratingaccuracy comparable to baseline software simulation.

Figure 30.10.4: The schematic of the DPU and its sub-blocks. The trainingcontrol is shown on the lower right.

Video/Image

Pattern

Voice

Inputs

RichFeatures

The ADE Architecture

RawData

1st Layer

2nd Layer

3rd Layer

Node

BeliefStates

RawData

0101011

Transmission

Classification

Storage

Post-Processing

Pow

er(μ

W) 30

10

20

Disabled Enabled

TrainingMode

RecognitionMode

-22%

-37%

Timing Diagram and Intra-Cycle Power Gating

DEUC/DMAH WIN

Reconfigurable AnalogComputation Cell (RAC) 8×4

82Dim1

...

TrainingControl (TC)×8

TrainingControl

μTrainingControlσ2

Error16

16

Ctrl

Distance Processing Unit (DPU)4Output Belief

State

8Input

Observation

Inverse Normalization (IN)

Winner-Take-All Network (WTA) S/H

Node Architecture

Centroid1

82Dim1

...

AAE

FGMμ

FGMσ2

ΣI

MemoryWriting Logic 8

2Dim1

...

AAE

FGMμ

FGMσ2

ΣI


2Dim1

...

AAE

FGMμ

FGMσ2

ΣI


2Dim1

...

AAE

FGMμ

FGMσ2

ΣI

MemoryWriting Logic

Centroid2 Centroid3 Centroid4

BeliefConstruction

AAE

FGM

TC

WTA

IN

S/H

DMAH

Classific-ation

TrainingLoad

MemoryWrite

DEUC

Sample HoldHold

OFF

OFF

OFF

OFFOFF

OFF

Errors

Training Cycle

B

o

Saving of Power Gating

0 5 100

15

30

Input o (nA)

DM

AH(n

A)

0 5 100

20

40

Input o (nA)

DM

AH(n

A)

0 50 100

0

5

10

I out (n

A)

Number of 1ms update pulses

2 4 6 8

2

4

6

8

Target Values (nA)

Mem

ory

Val

ues

(nA

)

2 4 6 8-0.2

-0.1

0

0.1

0.2

Erro

r (nA

)

DMAH

FGMμ

Σ-

Abs(·)

YX Z

X2/YC

FGMσ2 Σ-

Input o

FGMμ

Σ-

Abs(·)

YX Z

X2/YC

FGMσ2 Σ-

Input oDEUC

Err_μ

Err_σ2

FGMμ

Σ-

Abs(·)

YX Z

X2/YC

FGMσ2 Σ-

Input o

Mahalanobis DistanceDMAH = Σ[(o-μ)2/σ2]

Euclidean DistanceDEUC = Σ(o-μ)2/C, C is constant

Memory ErrorsErr_μ = o-μ, Err_σ2=(o-μ)2/C-σ2

Mode = Belief Construction

Mode = Classification

Mode = Training Load

The Reconfigurable Current Routing of RAC FGM and Measured Results

VTUNNEL

FloatingGate

VBN

InjectionJunction

TunnelingJunction

VS

IOUT

Absolute Value Circuit

Updates Rule

Tunnel Inject

+−

o-μ

Translinear Loop

X=|o-μ|

Y Z=X2/Y

Translinear X2/Y Operator

VREF

Tun

3V

1VInj3V

0

VINJECT

Programming Accuracy

AAE and Measured Results

μ=5nA, Sweeping σ2 σ2=2nA, Sweeping μ

σ2 Increaseμ increase

VBP

V-IConverter

A

AAE

AAE

AAE0 5 10 15 20

0.2

0.4

0.6

Input Current (nA)

Cha

rge

Inje

ctio

n (m

V)

0 5 10 15 2012.4

12.8

13.2

Dro

op (

V)

DEUC/DMAH

Inverse Normalization (IN)

IOUT

IOUT = INORM

IOUT = λ/IIN{

VB

Starvation Trace

B

Winner Take All (WTA)

S/H

IB

FromRAC

To HigherLayer

IB/2

DPU × ¼

INORM

. . .

. . .

. . .

Distance Processing Unit

Starvation Trace (ST)

CLKRST CLK

RST

IOUT

IOUT

In next layerIOUT

IINS/H

M1M2

M3VDD

Signal

GND

S/H

|Vt,P|>Vt,N due to body effect

Sample Mode

VDD

Post-Layout Simulation Results

Training Control (TC)

VRAMP

VP

RST

LOAD

LOAD

OUT

PW_μ=K*Err_μPW_σ2=K*Err_σ2

Err_μ/Err_σ

2

μ(n+1)=μ(n)+α*Err_μσ2(n+1)=σ2(n)+β*Err_σ2

μ→MEAN(o)σ2→VAR(o)/C

RSTLOAD

VpVRAMP

VDDOUTPW

Utilize WiringCapacitance as CHOLD

CHOLD

Shielding from noises

BeliefStates

B

1st Layer

2nd Layer

3rd Layer

Clock

B(n)

BeliefConstruction

Classific-ation

TrainingLoad

MemoryWrite

Sample{B(n-1)

220μs

Input Data

Cluster Means

Evolution ofCentroid Means

ExtractedVariance

μ=σ2=

Data ClusterParameters μ= (3, 3)

σ2=(1, 1)

μ= (7, 7)σ2=(1.5, 1.5)

μ= (7, 3)σ2=(1, 1)

μ= (3, 7)σ2=(0.6, 0.6)

Belief Output Waveforms

Clustering Test

Starvation Trace Test

With ST

Without ST

Centroid movedto data by ST

CentroidStarved

Incorrectclustering

CorrectClustering

0 50 1000

0.5

1

Sample

Bel

ief S

tate

s

0 20 40 60 800

5

10

Training Sample (x1k)

Cen

troid

Mea

ns

ADE Neural Network Classifier

Classification ResultInput Pattern

Raw Data

Rich Features

Rich FeaturesTraining Convergence

Pattern Recognition Test

Recognition Accuracy

Input Pattern10% Corruption

ClassificationResult

95.4%Accuracy

100%Accuracy

Input Pattern20% Corruption

Beliefs

Sum ofBeliefs

4-D32-D

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Percentage Corruption

Cla

ssifi

catio

n A

ccur

acy

MeasuredBaseline

30

• 2014 IEEE International Solid-State Circuits Conference 978-1-4799-0920-9/14/$31.00 ©2014 IEEE

ISSCC 2014 PAPER CONTINUATIONS

Figure 30.10.7: Die micrograph, performance summary and comparison table.

RAC

DPU

7 Nodes

TC

L1 L1 L1 L1L2 L2L3

0.4mm

0.9mm

This work ISSCC'13 [1] ISSCC'11 [2] ISSCC'09 [3]Process 0.13μm 0.13μm 0.13μm 0.13μmPurpose DML Feature Extraction Object Recognition Neural-Fuzzy Accelarator Object Recognition

Non-volatile Memory Floating Gate NA NA NAPower (W) 11.4μW 260mW 57mW 496mW

Peak Energy Efficiency 1.04TOPS/W 646GOPS/W 655GOPS/W 290GOPS/W

TechnologyPower Supply

Active AreaMemory

Memory SNRTraining

Algorithm

Output Feature

Input Referred Noise

System SNRI/O Type

Training Mode 4.5kHzRecognition Mode 8.3kHz

Training Mode 27μWRecognition Mode 11.4μW

Training Mode 480GOPS/WRecognition Mode 1.04TOPS/W

Energy Efficiency

Unsupervised Online Clustering

Inverse-Nomalized Mahanalobis Distances

3V1P8M 0.13μm CMOS

0.9mm×0.4mmNon-Volatile Floating Gate

46dB

56.23pArms

45dBAnalog Current

Operating Frequency

Power Consumption

Documents

ISSCC 2014 / SESSION 30 / TECHNOLOGIES FOR NEXT-GENERATION SYSTEMS …web.eecs.utk.edu/research/iss/papers/lu_adml_isscc2014.pdf · 2014-10-24 · ISSCC 2014 / SESSION 30 / TECHNOLOGIES