Upload
nguyendat
View
220
Download
2
Embed Size (px)
Citation preview
Nanoelectronic Neurocomputing: Status and Prospects
L. Ceze,1 J. Hasler,2 K. K. Likharev,3 J.-s. Seo,4 T. Sherwood,5 D. Strukov5*, Y. Xie,5 and S. Yu4
1 University of Washington, Seattle WA 98195-2350, U.S.A., 2 Georgia Institute of Technology, Atlanta GA 30332-
0250, U.S.A., 3 Stony Brook University, Stony Brook, NY 11794-3800, U.S.A., 4 Arizona State University, Tempe,
AZ 85281-9309, U.S.A., 5 UC Santa Barbara, Santa Barbara, CA 93106-9560, U.S.A.
Email: *[email protected]
Potential advantages of specialized hardware for neuromorphic computing had been recognized several decades
ago (see, e.g., Refs. [1, 2]), but the need for it became especially acute recently, due to significant advances of the
computational neuroscience and machine learning. The most vivid example is given by the deep learning in
convolution neuromorphic networks [3]: the recent dramatic progress of this technology, with it's rapid extension to
several important applications, was enabled by the use of modern GPU clusters [4, 5]. Even higher performance and
lower power consumption has been recently demonstrated using FPGAs [5-7] and custom digital circuits [5, 8].
However, the energy efficiency of these fully digital implementations of the convolutional networks and other
neuromorphic systems (see, e.g., Ref. 9) is still well below that of their biological prototypes [2, 10], even if the
most advanced CMOS technology is used. The main reason for this efficiency gap is that the use of digital operations,
with their high precision, for mimicking high-noise biological neural networks is inherently unnatural. Theoretical
estimates show that this gap may be closed by using purely analog and mixed-signal neuromorphic networks [1, 2,
10]. For instance, analog crossbar circuits may perform a key operation, the vector-by-matrix multiplication, by
utilizing the fundamental Ohm and Kirchhoff laws. In addition, nanoelectronic implementation of such analog and
mixed-signal neuromorphic circuits may allow them to approach the areal density of their biological prototypes, far
outperforming them (and their digital counterparts) in speed performance, all at manageable power consumption [2,
10] - see Fig. 1.
The key analog component of the circuits is a compact electron device with adjustable conductance - essentially
an analog nonvolatile memory cell - typically used at each crosspoint of a crossbar array, and mimicking the biological
synapse. Up until recently, such devices were implemented mostly as “synaptic transistors” [2, 11], which may be
fabricated using the standard CMOS technology. This approach has resulted in several sophisticated, efficient systems
- see, e.g., Fig. 2 [11]. However, these devices have relatively large areas, leading to higher interconnect capacitance
and hence larger time delays. Figures 3 and 4 illustrate two promising candidates for addressing this challenge: two-
terminal nonvolative resistive devices (“memristors”) [12, 13], and compact floating-gate memory cells, redesigned
from commercial NOR flash memory to allow their high-precision analog tuning [14, 15]. The advantage of the
floating-gate memory devices is their mature, CMOS-compatible fabrication technology. On the other hand, the tuning
of memristors is based on reversible displacements of just a few atoms, so that even the best technologies of their
fabrication developed by now do not yet provide the device variability low enough for VLSI circuits. However, the
memristors may have inherently lower relative area than the floating-gate cells, and may be scaled down below 10 nm
without sacrificing their endurance, retention, and tuning accuracy [12]. Moreover, these devices are naturally suitable
for 3D integration (Fig. 5) [12, 13]. (Certain recent developments in deep learning algorithms give a hope for using
novel 3D NAND technologies for neurocomputing as well.)
The further progress of nanoelectronic neurocomputing circuits may lead not only to high-performance
convolutional networks, but eventually to other intelligent, high-speed, low-power information processing systems for
emerging applications. (These applications notably include flexible, high-speed platforms for testing large-scale
models of biological neural systems including cerebral cortex.) The important research directions toward these goals
include: (i) the development of hardware-friendly neuromorphic network algorithms to enable the use of advanced
nanoelectronic devices without compromising system’s functional performance; (ii) the device optimization for
particular neuromorphic applications; (iii) the progress toward understanding of the optimal partitioning between the
digital and analog domains in mixed-signal systems (such as shown in Fig. 2); (iv) the development of block-based
architectures for general purpose neuromorphic systems; and (v) the development of suitable models for programming
of such systems (see, e.g., Fig. 6).
[1] C. Mead, Analog VLSI and neural systems, 1989.
[2] J. Hasler et al., Front. Neurosci., vol. 7, p. 119, 2013.
[3] Y. LeCun et al., Nature, vol. 521, p. 436, 2015.
[4] A. Krizhevsky et al., NIPS, Lake Tahoe, CA, 2012, p.1097.
[5] C. Farabet et al., Machine learning on very large data sets,
ed. by R. Bekkerman et al., 2011, p. 399.
[6] A. Putnam et al., ISCA, Minneapolis, MN, 2014, p. 13.
[7] T. Moreau et al., HPCA, Burlingame, CA, 2015, p. 603.
[8] T. Chen et al., ASPLOS, Salt Lake City, UT, 2014, p. 269.
[9] P. Merolla et al., Science, vol. 345, p. 668, 2014.
[10] K. Likharev, Sci. Advanced. Mat., vol. 3, p. 322, 2011.
[11] S. George et al., IEEE Trans. VLSI, 2016 (accepted).
[12] J. Yang et al., Nature Nano, vol. 8, p. 13, 2013.
[13] S. Yu et al., ACS Nano, vol. 7, p. 2320, 2013.
[14] F. Merrikh Bayat et al., IMW, Monterey, CA, 2015, p.1.
[15] F. Merrikh Bayat et al., DRC, Newark, DE, 2016.
Fig. 1. Energy efficiency and performance comparison for a 64×64-pixel deep learning neural network image classification. The
estimates for the state-of-the-art digital and emerging analog nanodevice implementations are based on the data from Refs. [5] and
[12, 14], respectively.
Digital AnalogCPU 2.66 GHz GPU 1 GHz FPGA 200 MHz ASIC 400 MHz NOR 180 nm NOR 55 nm Memristors 200 nm 3D memristors 10 nm
Time (s) ~810-3 ~310-4 ~1.510-4 ~510-5 ~210-6 ~710-7 ~510-8 ~110-8
Power (W) ~30 to 40 ~40 ~10 ~3 ~1 ~1 ~1 ~0.1Energy (J) ~310-1 ~110-2 ~110-3 ~110-4 ~210-6 ~710-7 ~510-8 ~110-9
Fig. 3. Memristor circuits for neurocomputing
[14]: (a) 12×12 crossbar with integrated metal-
oxide memristors, (b) I-V switching curve of a
memristor and the device stack. (c) Pattern
classification results using a single layer
perceptron (bottom inset) for 3×3 binary patterns
(top inset).
-2.0-1.5-1.0-0.5 0.0 0.5 1.0 1.5 2.0
-600
-500
-400
-300
-200
-100
0
100
200
300
Cu
rre
nt (
A)
Voltage (V)
Reset
SetForm
SiO2/Si
Pt (60 nm)
TiO2-x (30 nm)
Ti (15 nm)
Al2O3 (4 nm)
Ta (5 nm)
Pt (60 nm)
Voltage (V)-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
300
200
100
0
-100
-200
-300
-400
-500
-600
Cu
rren
t (μ
A)
(b)
2 μm
(a)
0 5 10 15 20 25 30 35
02468
1012141618202224262830
# m
iscl
assi
fied
pat
tern
s
Epoch
1
2
1
2
3
pre-synaptic neurons
post-synaptic neurons
w10,3
w1,1
10
z
n
v
(c)
Fig. 4. Gate-coupled vector-by-
matrix multiplier implemented with
180-nm embedded NOR flash
memory redesigned for analog
computing applications [14, 15]: (a)
Layout, (b) equivalent circuit, and
(c) experimental results.
-1 -0.5 0 0.5 1 -0.1
-0.05
0
0.05
0.1
Ibias = 500 nA
Vdd = 1V
w1+ ≈ 10-5
Iout+ - Iout
- = (w1+ - w1
-)(Iin+ - Iin
- )
I out+
-I o
ut-
(µA
)
Iin+ - Iin
- (µA)
(c)So
urc
e
Gate and drain
Gate and drain
Peripheral devices
High voltage
pass gates
(a) (b)
supercell
Fig. 5. 3D memristor circuits: (a)
circuit, (b) possible memristor
structure, and (c) TEM cross-section
of side-wall integrated HfOx devices
[13]. (d) General idea of CMOS/
memristor hybrids [10, 12, 14], and (e)
top view of a TiOx memristors back-
integrated on 0.25-μm CMOS circuit.
SiO2
Pt
5nm
HfOx
SiO2
TiN
5 nm
(a)
(b)
add-on
CMOS
stack
(d) (e)
(c)
Fig. 2. RASP 3.0 field programmable
analog array [11]: (a) general
architecture, and (b) chip layout in 350-
nm CMOS process. The chip consists
of an embedded microprocessor, a
static random access memory, digital
and analog converters, circuitry to
program nonvolatile memory, and
digital/analog configurable blocks
(denoted with D/A labels).
(a) (b)
Fig. 6. FPGA-based neural network accelerator for approximate programs [7]: (a) Top level architecture, (b) a fragment of code
written in approximation-aware programming language, which filters pixels in the image. (c) Performance speed-up and (d) energy
savings evaluated on the set of studied benchmarks.
(a) (b)
(c)
(d)