A Custom 7nm CMOS Standard Cell Library for Implementing

A Custom 7nm CMOS Standard Cell Library for Implementing TNN-based Neuromorphic Processors

Harideep Nair, Prabhu Vellaisamy, Santha Bhasuthkar, and John Paul Shen

Neuromorphic Computer Architecture Lab, ECE Department, Carnegie Mellon University Email: [email protected] Website: https://www.ncal.sv.cmu.edu

Abstract— A set of highly-optimized custom macro extensions is developed for a 7nm CMOS cell library for implementing Temporal Neural Networks (TNNs) that can mimic brain-like sensory processing with extreme energy efficiency. A TNN prototype (13,750 neurons and 315,000 synapses) for MNIST requires only 1.56mm2 die area and consumes only 1.69mW.

I. INTRODUCTION

Temporal Neural Networks (TNNs) are a special class of spiking neural networks (SNNs) that use precise spike timings to represent and process information, mimicking biological neural networks [1]. Key building blocks of TNNs are: multi-neuron columns and multi-column layers as shown in Fig. 1 [2]. A pxq TNN column consists of q excitatory neurons, each with p synapses as inputs, and is capable of Spike Timing Dependent Plasticity (STDP) learning and Winner Take All (WTA) inhibition. Unlike deep artificial neural networks (ANNs) that utilize high-dimensional tensor processing with compute intensive backpropagation, TNNs do not employ any complex linear algebraic computation nor backpropagation. Hence, TNNs are truly neuromorphic and capable of brain-like sensory processing with brain-like extreme energy efficiency. Direct CMOS implementation of TNNs was first proposed in [2]; this is a largely unexplored area and is the focus of this paper.

This paper builds on the work in [2], which demonstrates the feasibility of implementing TNNs in CMOS using 45nm standard cell library with regular standard cells. This paper differs from [2] in two key contributions: 1) we replicate the TNN design process in 7nm CMOS using ASAP7 PDK [3] and take it a step further through post-layout Power-Performance-Area (PPA) analysis, and 2) we propose new highly-optimized custom macro extensions to ASAP7, that can be used for implementing highly energy efficient TNNs. A set of 11 macros are proposed: syn_weight_update, syn_output, pac_adder, less_equal, pulse2edge, stdp_case_ gen, stabilize_func, incdec, mux2to1gdi, edge2pulse and spike_gen. We demonstrate their utility by implementing the key TNN building blocks of multi-neuron columns. We also assess the hardware complexity of a multi-column and multi-layer TNN prototype from [2], which can achieve 93% (with potential for 98%) accuracy on MNIST.

II. TNN STANDARD CELL MACROS

A. Framework and Methodology

ASAP7 [3] is a 7nm finFET-based predictive, foundry-agnostic process design kit (PDK) containing a standard cell library and complete physical verification suite - design rule checks (DRC), layout vs. schematic (LVS), and parasitic

extraction (PEX) rule-sets. The transistor models' electrical behavior is scaled from the BSIM-CMG SPICE models [4] that capture advanced finFET industry trends. ASAP7 provides transistor device models at four threshold voltage levels (SLVT, LVT, RVT and SRAM) and three process corners, namely, typical-typical (TT), slow-slow (SS) and fast-fast (FF).

In this work, we use the following choices for standard cell development - RVT device models at TT process corner with nominal operating conditions of 0.7V supply voltage and 25°C operating temperature. Liberty (LIB) files are developed using composite current source (CCS) delay modeling. Cadence toolchain is used - Genus for register-transfer level (RTL) synthesis, Virtuoso for schematics and layouts, Liberate for characterization of the macro cells, and Abstract to build liberty exchange (LEF) files. Hard macros are constructed with aggressive PPA optimizations for implementing fundamental functional modules needed for designing TNNs.

B. Optimization Strategies

Since TNNs are being envisioned for online, always-on, and edge-native sensory processing units operating at real-time kHz frequencies, PPA optimizations are prioritized for first reducing power consumption followed by area and delay. We use Gate Diffusion Input (GDI)-based designs for maximal area and power optimizations, since they provide fast, low-power circuits with reduced number of transistors [5]. However, the tradeoff is degraded output levels which is corrected by using level restorers at the outputs. Diffusion sharing is consistently used across all macros to reduce the cell area consumption.

C. Cell Macros

Standard cell macros are designed targeting the following TNN components as proposed in [2] - excitatory SRM0 neurons with ramp-no-leak (RNL) response function, WTA inhibition and STDP learning. TNN designs use two clocks, a unit clock (aclk) for temporal encoding and a gamma clock (gclk) for separating and synchronizing different computational waves. Rest of the sections follow the work in [2] closely.

The neuron has two main components leading to the following three macros: 1) synapse implemented as an FSM that stores and updates weights (Fig. 2. syn_weight_update macro), and reads an 8-cycle wide input spike into thermometer-coded RNL response function (Fig. 3. syn_output macro); and 2) neuron body implemented as a parallel accumulative counter that accumulates response functions from all synapses into a body potential and outputs a spike when body potential crosses a given threshold (pac_adder). Note that

Genus synthesizes the adder modules for the parallel accumulative counter with ASAP7 Majority cells. Architectural use of ripple-carry adder chain propagation provides noticeable optimization. A single-bit adder (Fig. 4) is built with ASAP7 full adder and inverter cells and used to infer in pac_adder.

WTA inhibition allows the output spike from the first spiking neuron of a column to pass through intact, while nullifying other neurons’ outputs. Tie-breaking is implemented by systematically choosing the lowest index. A less_equal macro (Fig. 5) based on pass transistors is designed for performing inhibition. Furthermore, pulse2edge macros (Fig. 6, Fig. 7) are developed to convert pulses to edges (in representing spikes), i.e., to stay asserted until a gamma reset. We construct two variants, namely, power optimized (contains asynchronous active high reset register) as well as area optimized (contains synchronous active low reset register).

STDP learning rules as shown in [2] update weights locally according to the timing relationships between input and output spikes, stochastically based on Bernoulli random variables (BRVs). STDP implementation leads to three new macros: 1) stdp_case_gen macro (Fig. 8) to generate control signals corresponding to four different cases of input-output spike timing relationships, 2) stabilize_func macro (Fig. 9) to select appropriate BRV using synaptic weight as per the stabilization function in [2] to stabilize weight convergence, and 3) incdec macro (Fig. 10) to generate increment/decrement control signals to drive the weight update. Note that stabilize_func replicates the functionality of an 8-to-1 multiplexer. To optimize the design, a 2-to-1 GDI-based multiplexer design (Fig. 11. mux2to1gdi macro) is developed. Seven 2-to-1 GDI-based multiplexers are combined to make a combinational block functioning as an 8-to-1 multiplexing logic.

We also develop two more macros: 1) spike_gen macro (Fig. 12) to generate 8-cycle wide pulses for spikes required by syn_output, and 2) edge2pulse macro (Fig. 13) to generate reset pulses (grst) from gclk for performing essential computational reset between consecutive computational cycles.

III. BENCHMARKING AND RESULTS

In this section, we provide 7nm post-layout PPA results generated using Cadence toolchain to demonstrate the efficacy of the proposed standard cell macros. Using these results, we perform three types of benchmarking analysis: 1) compare the layouts of two custom macros, namely, less_equal and mux2to1gdi with corresponding ASAP7 standard cell-based modules to demonstrate the efficiency of our approach, 2) build three benchmark TNN columns with sizes, 64x8, 128x10 and 1024x16 as in [2] using the proposed custom macros as well as ASAP7-based standard cells and compare their PPA metrics with the corresponding 45nm values in Table IV in [2], and 3) assess PPA metrics for the 2-layer prototype TNN in [2] using synaptic scaling and evaluate the improvement in PPA and Energy-Delay Product (EDP) obtained by our custom macros.

A. Layout Comparison of Two Macros

As can be seen from the layouts in Fig. 14 and Fig. 15, the standard cell-based less_equal module is significantly more

complex than the custom less_equal macro implementing the same functionality. Fig. 16 and Fig. 17 show the layouts for 2-to-1 mux designed using ASAP7 standard cells (with 12 transistors) and custom GDI-based macro (with 2 transistors). Fig. 18 shows the stabilize_func logic generated using 7 custom mux2to1gdi cells, incurring only similar complexity to that of standard cell-based single mux.

B. PPA Analysis for Benchmark TNN Columns

Table I shows the power, computation time, and area values for the benchmark columns for both standard cell-based and custom cases. It can be seen that custom designs consume about 45% lesser power and 35% lesser area as compared to the standard designs, and are about 20% faster. Compared to 45nm values in Table IV [2], we see close to two orders of magnitude improvement in power and area. The custom largest 1024x16 column in 7nm consumes merely 0.08 mm2 area, 74 uW power and 29.5 ns computation time as compared to the corresponding 45 nm values of 1.65 mm2, 7.96 mW and 42.3 ns respectively.

C. PPA Analysis for TNN Prototype

Table II shows 35% improvement in area and power, and 20% in computation time for custom macro-based TNN prototype (Fig. 19) with respect to the standard cell-based one. EDP reduces by almost 55%. Compared to the 45nm values in Table VI in [2], power consumption for standard cell-based prototype is almost 60x lesser, whereas area and computation time reduce by almost 14x and 2x. This leads to almost two orders of magnitude difference in power between 45nm and custom 7nm designs. As shown in Table I, a 2-layer TNN, providing 93% MNIST accuracy, consumes only 1.69 mW power, 1.56 mm2 die area and can process each image in 19 ns.

IV. CONCLUDING REMARKS

Prior works [1,2] have shown that TNNs can achieve brain-like sensory processing with brain-like efficiency in complexity and power. This paper demonstrates a 2-layer TNN prototype with 13,750 neurons and 315,000 synapses can be implemented in 7nm CMOS. Using customized standard cell macros developed in this paper, significant (~2x) power and area reduction can be achieved. We believe brain-like edge-native sensory processing units can be effectively implemented in standard digital CMOS that consume only a few mW of power and few mm2 die area.

REFERENCES [1] J. E. Smith, "Space-time algebra: A model for neocortical

computation." 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018.

[2] H. Nair, J. P. Shen and J. E. Smith, "Direct CMOS implementation of neuromorphic temporal neural networks for sensory processing," under review for an IEEE conference. (arXiv submission #3342901).

[3] L. T. Clark, V. Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy and G. Yeric, "ASAP7: A 7-nm finFET predictive process design kit." Microelectronics Journal 53 (2016): 105-115.

[4] J. P. Duarte, S. Khandelwal, A. Medury, C. Hu, P. Kushwaha, H. Agarwal, A. Dasgupta, and Y. S. Chauhan. "BSIM-CMG: Standard FinFET compact model for advanced circuit design." 41st European Solid-State Circuits Conference (ESSCIRC), pp. 196-201. IEEE, 2015.

[5] Morgenshtein, Arkadiy, Alexander Fish, and A. Wagner. "Gate-diffusion input (GDI)-a novel power efficient method for digital circuits: a design methodology." Proceedings 14th Annual IEEE International ASIC/SOC Conference (IEEE Cat. No. 01TH8558). IEEE, 2001.

Fig. 3. syn_output macro

Fig. 4. Single-bit adder for pac_adder macro

Fig. 5. less_equal macro

DATA_IN

OUT

INHIBIT

Fig. 6. pulse2edge macro – power optimized

Fig. 7. pulse2edge macro – area optimized

Fig. 8. stdp_case_gen macro

Fig. 9. stabilize_func macro

Fig. 10. incdec macro

Fig. 2. syn_weight_update macro

Fig. 1. Temporal Neural Network Organization

Fig. 11. mux2to1gdi macro

OUT

P

N

G

VDD

VSS

MUX_INV

MUX2to1gdi ASAP7 INVx1

Fig. 14. Layout for ASAP7 standard-cell based less_equal module

Fig. 15. Layout for custom less_equal macro

Fig. 13. edge2pulse macro

Fig. 12. spike_gen macro

Fig. 17. Layout for custom mux2to1gdi cell

Fig. 16. Layout for ASAP7 standard-cell based 2to1mux

Fig. 19. 2-layer TNN prototype in [2] with 625 32x12

columns in first layer, and 625 12x10 columns in second layer. (Complexity: 32M gates, 128M transistors)

Power

(mW)

Computation Time (ns)

Cell Area

(mm2)

EDP

(nJ-ns) Standard

Cell-Based 2.54 24.14 2.36 1.48

Custom Macro-Based

1.69 19.15 1.56 0.62

Table II. Standard vs Custom PPA comparison in 7nm CMOS for the

multilayer prototype TNN in [2]

Column Size pxq

Power

(uW)

Computation Time (ns)

Area

(mm2)

Standard Cell-

Based

64x8 3.89 26.92 0.004

128x10 10.27 28.52 0.009

1024x16 131.46 36.52 0.124

Custom Macro-Based

64x8 2.73 20.59 0.003

128x10 5.76 22.79 0.006

1024x16 73.73 29.49 0.079

Table I. Standard vs Custom PPA comparison in 7nm CMOS for three benchmark column sizes, namely, 64x8, 128x10 and 1024x16, as in [2]

Fig. 18. Layout for custom stabilize_func macro using mux2to1gdi cells

Documents

A Custom 7nm CMOS Standard Cell Library for Implementing