Model and Optimized Fully Binary Neural Network Hardware ...ziyang.eecs.umich.edu/iesr/lectures/coussy15apr-present.pdfFully Binary Neural Network Model and Optimized Hardware Architectures

Fully Binary Neural Network Model and Optimized Hardware Architectures for Associative MemoriesPHILIPPE COUSSY, CYRILLE CHAVET, HUGUES NONO WOUAFO, and LAURA CONDE-CANENCIA

Presented by: Stefany Escobedo, Joshua Kallus, and Alyssa ScheskeMarch 26, 2020

Introduction

● The goal is to develop associative memories based on neural networks which can store

information and retrieve it in a similar manner as the human brain does

○ Robust against input noise

○ Constant retrieval time independent of the number of stored associations

IntroductionGBNN Model OverviewProposed OptimizationsExperimentsConclusion

GBNN Model

● Abstract neural network model

● Based on sparse clustered networks used to design

associative memories


GBNN Model

● N binary neurons

● C equally-partitioned clusters

● L = N/C neurons per cluster

● Each cluster is associated through one of its neurons

with a portion of an input message

● m message of K bits

● X = K/C = log_2(L) length of each cluster submessage

● Clique: set of of activated neurons that are connected

to each other


GBNN Model

● Learns by memorizing that the set of neurons that

constitute the input message are connected to each

other and form a clique

● Retrieves by detecting which neuron is the most

“stimulated” ○ Scoring step using Eq. 1○ Winner Takes All (WTA) step using Eq. 2

(1)

(2)


GBNN Model: HW Architecture

● Fully parallel HW implementation

● Modules○ Decoding○ Learning (memory)○ Computing

● Crossbar network dedicated to interchanges of neuron values

between clusters


GBNN Model: HW Architecture

● Learning process○ Cluster receives K-bit binary word○ Decoding module splits word in C-subwords (C clusters)○ Subword is used to determine which neuron must be activated

■ Remaining subwords used to determine which neurons must be connected to locally activated neuron

○ Memory is updated with the selected weights to store the clique

● Retrieval process○ Scoring step is processed○ WTA step elects a neuron or group of neurons○ Local neuron values are updated with new information○ Info is broadcasted to all distant neurons


GBNN Model: Discussion

● Advantage○ Strongly enhances performance of associative memories compared to Hopfield networks

● Disadvantage○ Complex hardware architectures whose area and timing performances do not scale well

● Further optimizations○ Transformation into a full binary model to simplify scoring and removing WTA (area reduction)○ Memorize half of the synaptic weights to reduce # of storage elements & cost of learning logic○ Serialize communications (area reduction)

● Overall goal○ Ease the process realized by the neurons○ Optimize hardware implementation○ Keep functionality and performance of the original model


Proposed Simplified Neural Network

The optimizations proposed include the following:

● Fully binary semantics vs arithmetical-integer semantics● Reduced memory complexity● Serialized communications


Fully Binary Semantics

Replacing all arithmetical-integer computations with logical equations allows for removing the winner takes all step and achieves the same performance as the enhanced GBNN model


Unanimous Vote

● A neuron ni,j

is active in a given cluster if at least one active neuron in each other active cluster (distant active neurons), indicates that it is connected with neuron n

i,j

● This changes how the decoding module works and enables removal of the WTA step. Values of neurons can be calculated with only logical equations now.


Reduced Memory

Synaptic weights are stored which represent connections between neurons and others in distant clusters. The original GBNN model calculates and stores redundant information which can be optimized out to save space with no performance cost


Serialized Communications

In the fully parallel GBNN design a very large number of wires and logic is needed to connect every node to every other node. Serializing data transfers offers several benefits:

● Improve clock frequencies● Reduce area significantly● Lower power consumption


Serialization Implementation

Cluster Based:

● Clusters take turns to broadcast the value of all their neurons● Takes C (# clusters) cycles to complete

Neuron Based:

● Clusters broadcast concurrently the values of one of their neurons● Takes L=N/C cycles to complete

Serialization: Hardware Implementation

● Steering logic for synaptic weights has large overhead in multiplexers

● Area cost of this design is high

Serialization: Hardware Implementation

● Flip Flop ring buffer logic● Requires only one MUX

instead of L-1 2:1 MUXes● Can be used with either

neuron based or cluster based serialization

Experiments

● Performance Analysis

● Complexity Analysis

● Hardware Synthesis Analysis○ FPGA Target - Stratix IV FPGA Platform○ ASIC Target - Altera HardCopy Platform


Architecture Label

Original GBNN model V0

Fully binary model V1.0

Binary + triangular synaptic weight model

V1.1

Binary + cluster-based serialization

V1.2

Binary + neuron-based serialization

V1.3

Experiments

Proposed architecture performance

matches/superimposed original GBNN

architecture

Performance AnalysisComplexity AnalysisHardware Synthesis Analysis


Experiments

Controller resources decompose into:

decoding, memorizing and computing

tasks.

Fully binary model (V1.0) reduces the

total area by 50% from V0.

V1.1 reduces architecture complexity

by ⅓ of V0 (70% area reduction).

V1.2 and V1.3 reduce architecture

complexity by ⅙ (83% area reduction)





V0

V1.0

V1.1

V1.2

Experiments - Area

Largest improvement from the original V0 architecture to

triangular synaptic weight matrix V1.1 by 50% for all

configurations.



Largest improvement from the original V0

architecture to V1.2/3 by 87% area savings.

Experiments

Look-up Table (LUT) average area

reductions range from 62% for V1.0

and up to 86% for V1.2.

The larger the network, the more impactful the reductions!



Experiments - Clock Frequencies



Conclusion & Comments

● Full binary computation strongly reduces the cost of the computation module

● Memory reduction limits the cost of both the memory and the decoding modules

● Serialization optimizes the computation and the decoding modules

● Future work: ○ Further optimize architectures for timing performance (not just area)


Documents

Model and Optimized Fully Binary Neural Network Hardware ...ziyang.eecs.umich.edu/iesr/lectures/coussy15apr-present.pdfFully Binary Neural Network Model and Optimized Hardware Architectures