Soft Computing Lecture 23 Hardware neural networks

Soft Computing

Lecture 23

Hardware neural networks

14.12.2005 2

Kinds of hardware support of NN

• Using parallel general purpose hardware

• Using of signal processors

• Development of special hardware, based on VLSI, for implementation of NN

14.12.2005 3

Technologies for development of neural networks in VLSI

Kind of elements Advantages Disadvantages

Analog optical Opportunity of mass connections

Full technology of optical computing is absent

Analog electrical Simple concepts, fast processing of data

Hard technological requirements, sensitivity to defects and external influences, small exactness of computing, difficulties of implementation of mass connections

Digital electrical Advanced full technology, high exactness of computing, robustness to technological options

Complexity of schemes decisions, multitact execution of basic operations, difficulties of implementation of mass connections

Hybrid Analog acceleration of basic operation with digital interfaces with external devices, opportunity of optical commutation

Further technological developments are needed

14.12.2005 4

Some first neural networks in VLSI

Name(company-producer)

Kind of elements

Number of neurons and synapses

Number of multiplications with

summation per second

Silicon Retina (Synaptics) Analog 48 x 48 ?

ETANN (Intel) Analog 64 / 104 2 109

N64000 (Inova) Digital 64 / 105 9 108

MA-16 (Siemens) Digital 16 / 256 4 108

RN-200 (Ricoh) Hybrid 16 / 256 3 109

NeuroClassifier (Mesa Research Institute)

Hybrid 7 / 426 2 1010

14.12.2005 5

14.12.2005 6

Some neuro computers

Name, Producer Type Number of processor elements

Number of multiplication

s with summation per second

CNAPS/PC(Adaptive Solutions)

PC-accelerator card

2 CNAPS-1016 processors

(128 neuronsв)

2.5 10 9

CNAPS(Adaptive Solutions)

Neuro computer

8 CNAPS-1016 processors

(512 neurons)

10 10

SYNAPSE-1 (Siemens)

Neuro computer

8 MA-16 processors

(512 neurons)

3 10 9

14.12.2005 7

14.12.2005 8

Examples of Neurocomputers

• Three neurocomputer systems are listed:• The Adaptive Solutions CNAPS uses the Inova N64000

chip on on VME boards in a custom cabinet run from a UNIX host. Boards come with 1 to 4 chips and two boards can process the same network to give a total of 512 PE's. The software includes a C-language library, assembler, compiler, and a package of NN algorithms.

• Similarly, the HNC SNAP Neurocomputer comes with typically include 2 VME boards, each with four NAP 100 chips, providing 32 PE's total. The boards are controlled from a PC by the HNC Balboa accelerator card.

• The Siemens SYNAPSE-1 uses a systolic array of 8 MA-16 chips in a custom cabinet with a Unix host.

14.12.2005 9

Slice architecture• Following the bit slice concept of conventional digital processors, the neural

network slice chips provide building blocks to construct networks of arbitrary size and precision. Such chips typcially cost only about $50/chip, perform at moderate speeds, and are without on-chip learning.

• The Micro Devices MD1220 was probably the first commercial neural network chip. Each chip has eight neurons with hard-limit thresholds and eight 16-bit synapses with 1-bit inputs. With bit-serial multipliers in the synapse, the chip provides about 9MCPS. Bigger networks and networks with higher bit inputs can be constructed with multiple chips. A 16-bit accumulator limits the total number of inputs because of overflows.

• A similar chip is the Neuralogix NLX-420 Neural Processor Slice[9], which has has 16 processing elements (PE). A common 16-bit input is multiplied by a weight in each PE in parallel. New weights are read from off-chip. The 16-bit weights and inputs can be user selected as 16 1-bit, 4 4-bit, 2 8-bit or 1 16-bit value(s). The 16 neuron sums are multiplexed through a user-defined piece-wise continuous threshold function to produce a 16-bit output. Internal feedback allows for multi-layer networks. Multiple chips can build large networks .

• The Philips Lneuro 1.0 chip, which is designed to be easily interfaced to Transputers, also has 16-bit processing in which the neuron values can be interpreted as 8 2-bit, 4 4-bit, etc., sub-values. Unlike the NLX-420, there is a sizable (1kByte) on chip cache to hold weights. The transfer function is done off-chip, which allows for multiple chips to provide synapse-input products to the neurons to build very large networks.

http://www.particle.kth.se/~lindsey/elba2html/bibliography3_7.html#NLX420

14.12.2005 10

Multi-processor CHIPs• A far more elaborate approach is to put many small processors on a

chip. Two architectures dominate such designs: single instruction with multiple data (SIMD) and systolic arrays. For SIMD design, each processor executes the same instruction in parallel but on different data. In systolic arrays, a processor does one step of a calculation (always the same step) before passing it's result on to the next processor in a pipelined manner.

• SIMD chips include the Inova N64000 and the HNC 100 NAP. The Adaptive Solutions CNAPS systems uses the Inova N64000 to build a SIMD array. The chip contains 64 PE's, with each PE possessing a 9x16 bit integer multiplier, 32-bit accumulator, and 4KBytes of on-chip memory for weight storage. All chips execute the same instruction and common control and data buses allow for multiple chips to be combined. The Hecht-Nielson Computers 100 NAP (Neurocomputer Array Processor) contains only 4 PE's but each PE performs true 32-bit floating point arithmetic. Weights are stored in off-chip memory and multiple chips can be cascaded.

• A systolic array system can be built with the Siemens MA-16. The MA-16 provides for fast matrix-matrix operations (mult, sub, or add) of 4x4 matrices with 16-bit elements. The multipler outputs and accumulators have 48-bit precision. Weights are stored off-chip and neuron transfer functions are off-chip via lookup tables. Multiple chips can be cascaded.

14.12.2005 11

RBF functions:• RBF networks provide fast learning and straight-forward

interpretation. The comparison of input vectors to stored training vectors can be done quickly if non-Euclidian distances, such as the Manhatten block norm (sum of element differences), are calculated with no multiplication operations.

• Two commercial RBF products are now available: the IBM ZISC036 (Zero Instruction Set Computer) chip and the Nestor Ni1000 chip.

• The ZISC036 contains 36 prototype-vector neurons, where the vectors have 64 8-bit elements, and can be assigned to categories from 1 to 16383. Multiple chips can be easily cascaded to provide additional prototypes. The distance norm is selectable between Manhatten block and the largest element difference. The chip implements a Region of Influence learning algorithm using signum basis functions with radii of 0 to 16383. Recall is according to the ROI identification or via nearest neighbor readout. Recall processing takes for a 250k/sec pattern presentation rate.

• The Nestor Ni1000, developed jointly by Intel and Nestor, contains 1024 prototypes of 256 5-bit elements. The chip has two on-chip learning algorithms, RCE and PNN, and other algorithms can be microcoded. The processing rate is about 40k patterns/sec with a 40MHz clock.

14.12.2005 12

Other digital design• Some digital neural network chips don't quite fit into the

above three sub-categories.• Examples include the Micro Circuit Engineering MT19003

NISP Neural Instruction Set Processor and the Hitachi Wafer Scale Integration chips.

• The NISP is basically a very simple RISC processor with seven instructions, optimized for implementation of multi-layer networks, and loaded with small programs to direct the processing. Feed-forward processing reaches 40MCPS.

• At the other end of the complexity scale are the Hitachi Wafer Scale Integration chips. Both Hopfield and back-propagation wafers have been built. A neurocomputer with 8 of the back-prop wafers, each with 144 neurons, achieved 2.3GCUPS.

14.12.2005 13

NeuroMatrix® NM6403 RISC/DSP Microprocessor (Research Center Module, Russia)

• NeuroMatrix® NM6403 is a high performance dual-core microprocessor with combination of VLIW/SIMD architectures. The architecture includes two main units:– 32-bit RISC Core– 64-bit VECTOR co-processor to support vector operations with

elements of variable bit length (Patent US 6,539,368 B1).

There are two identical programmable interfaces to work with any memory types as well as two communication ports hardware compatible with TI DSP TMS320C4x which permit to build multi- processor systems.

14.12.2005 14

RISC-Core • 5-stage pipelined 32-bit RISC; • processor instructions are 32 and 64 bit wide (usually

two operations are executed by each instruction); • two address generation units, address space - 16 GB; • two 64 bit programmable interfaces with SRAM/DRAM

shared memory; • data format: 32-bit digit integers; • registers:

– 8 of 32 bit general purpose registers; – 8 of 32 bit address registers; – special control and state registers;

• two high speed I/O communication ports of a byte width hardware compatible with those of TMS320C4x.

14.12.2005 15

VECTOR co-processor • 1-64 bit word length of vector operands

and the products;

• data format: integer data packed into 64-bit blocks in the form of variable length words from 1 to 64 bits each;

• hardware support of vector-matrix or matrix-matrix multiplication;

• On-chip saturation functions;

• On-chip three 32*64 bit RAM blocks.

14.12.2005 16

Applications

• accelerators for PCs and workstations for: – neural net emulation; – signal processing; – image processing; – acceleration of vector and matrix calculations;

• telecommunications;

• embedded systems;

• basic block for building large super parallel computing systems.

14.12.2005 17

Performance • scalar operations:

– 40 MIPS; – 120 MOPS for 32 bit data;

• vector operations: – from 40 to 11.500+ MMAC (million multiplication and

accumulation per second);

• I/O and interfaces: – two programmable external memory 64 bit interfaces

have up to 800 MB/sec. throughput; – I/O communication ports up to 20 MB/sec. throughput

each.

14.12.2005 18

NeuroMatrix® MC431 Single-

DSP PCI Evaluation Board

• MC431 is a Single-DSP PCI board designed for software evaluation and system prototyping on NM6403 DSP. MC431 is a low cost solution that can be used for learning NeuroMatrix® architecture and Software Development Kit. It has one NM6403 DSP, 4MB SRAM and two communication ports.

• The NM6403 DSP has an access to two 2MB SRAM banks (one bank per each bus). One bank is accessible for reading/writing both from the processor and from PCI bus. The MC431 has two external communication ports for connecting input/output devices.

14.12.2005 19

NeuroMatrix® MC431 Single-DSP PCI Evaluation Board (2)

low cost high-performance vector-matrix engine - NM6403 DSP

• 4 MB SRAM • Two I/O communication ports hardware compatible with

TI 'C40 DSP • PCI host interface

14.12.2005 20

14.12.2005 21

Neural networks taget traffic

management Using a PCI-based video input board, image processor, and video display board, each with an on-board NM6403 neural-network processor, RC Module's imaging system can monitor and analyze traffic flow on six-lane highways in real time

Documents

Soft Computing Lecture 23 Hardware neural networks