Neural chip for fast pattern matching

Neural chip for fast pattern matching

Edoardo Franzi

The AT&T Bell Labs" neural circuit NET32K is used to increase the speed of image processing in association with a small mobile robot. The neural accelerator was built around an MC68340 microcontroller and the NET32K circuit. Use of field programmable gate arrays minimizes board size and power consumption. Extracting contours, recognizing objects and obtaining a usable measure of distance was achieved with this interface to make robot guidance possible.

neural chips pattern matching robot guidance

binary patterns of the same resolution. No known digital device can approach this performance.

The design of the circuit provides many features to increase the flexibility of the chip, but pattern matching and contour extraction is the most realistic application.

Thanks to the good relations between LAMI-EPFL and AT&T Bell Labs, several NET32K circuits have been made available for a student project 3' 4 and this paper describes the improved interface that resulted s. The architecture of the circuit and the correlation method will be described first.

Image processing is an essential part of robotics. Smaller CCD cameras and fast dedicated integrated circuits for processing information allow new applications, which were inconceivable only a few years ago, guidance of mobile robots, feature extraction and recognition can be processed in real time with an acceptable hardware requirement.

The basic operation for digital correlation is rather simple. To decide if two groups of bits contain the same information, it is natural to count the bits that have the same state and position inside both groups; the ratio between the resulting count and the total number of bits processed determines if the two groups are similar. Only a few integrated circuits that are easy to interface with microprocessors are available for image processing (correlation, pattern matching, etc.).

We used the AT&T Bell Labs' NET32K neural chip 1'2 since it includes 33 k synapses (connections between inputs and neurons). This circuit includes 412000 transistors in a 0.9pro CMOS technology and can be programmed to have 256 or 64 neurons of 128 or 512 analogue synapses with an external digital interface. Analogue processing enables fully parallel computation; the 32 768 synapses are evaluated in 100ns (peak performance). The average speed is 1.5/~s for a correlation between 16 x 16 monochrome pixel bitmap and 32

Laboratoire de Microinformatique (LAMI), Ecole Polytechnique F~d~rale de Lausanne, I NF-Ecublens, CH-1015, Lausanne, Switzerland Paper received: 25 March 1992. Revised: 13 October 1992

0141-9331/93/060325-08 © 1993

NET32K ARCHITECTURE

A neuron

The basic function of each neuron is the correlation between the input vector (coming from an external bitmap) and the synaptic weight vector (representing the kernel to be recognized). Figure 1 shows the simplified schematic of the McCulloch-Pitts model and a NET32K electronic neuron. In both cases, there are two basic operations: product and sum.

The product (function AN D or XOR for the N ET32K) is achieved with two 128 bit vectors: the input X and the synaptic weight W. The sum is determined by accumulating all the multipliers' output currents on the sum wire; this intermediate result corresponds to the degree of similarity (Hamming distance) between the vectors. The total current on the wire can be weighted by some constant (1,1/2, 1/4, 1/8) defined by programmable switches; this ensures scalability when large numbers of synapses are necessary.

The sum represents the neuron's potential P; it is fed through a hard limiter function with adjustable threshold to produce the neuron binary output activity Y. The hard limiter is implemented with a comparator. The threshold level, called the reference, is generated on the sum wire of another neuron, dedicated only to that purpose.

It is important to note that the chip does not include any automatic learning function to modify reference values of the neurons. This task has to be achieved off-line

ButtenNorth-Heinemann Ltd

Vol 17 No 6 1993 325

a

>Y ( ~ - - ~

+1

Y

t, 0

P

127

P = ~ WiX~ j=O

P > O = > Y = I

P _ < O ~ Y = O

Figure 1.

bit 0 1 2 3

Input I I I I I vector

r--I-4-1-4 Weight

I I I I I vector W

~ ~ ~ ~ A N D O r x o R

Current sum wire (analog part) Potential P

bit 127

E]

ott Multiply factor

(Reference for other neurons)

Connections with other neurons

Comparator

Binary ~ ~ o u t p u t

Reference generated by an other neuron 0

Left part (neuron)

b a, McCulloch-Pitts modeb b, one neuron of the chip

Right part (reference)

bit i

~°°,°°

outside the chip by successive attempts to achieve the best threshold level for a particular pattern.

Dis t r ibu t ion of n e u r o n s inside the ch ip

The chip is subdivided in two identical parts (left and right) of eight blocks of 16 neurons each. As shown in Figure I b, these two parts are used to create the 'neuron' funct ion and the 'reference' function. It is possible to combine more neurons (one to eight) to create a larger one with more synapses (128 to 1024).

The combinat ion is realized during the configuration step and it is only possible between neurons having the same posit ion inside different blocks; e.g. neurons I of blocks 0, I , . . . , 7 can form a large neuron with 1024 synapses (Figure 2).

I n p u t / o u t p u t registers

The particular inpu t /ou tpu t structure of the chip is opt imized for image correlation (Figure 3). Each of the 16 inputs is connected to two cascaded 8 bit shift registers. All the outputs of the first-stage shift register (main) are mul t ip lexed with the second (additional) stage and connected to an internal 128b i t bus. Access to the internal 128 bit bus is possible only by the main 8 bit shift registers (16 X 8 = 128 bit). The addit ional registers are used to increase performance for kernels of 16 X 16 bits;

Left part (neuron)

o 127

I l l l l l l l ........ L j%1

I l l l l l l ........ I 1

0 127 ~ £

I I I I I I I I ........ I, ~1 2

I I I I f l lF . I

° ...TJ ~ 0 127 i

Bloc 0

Bi neurons of I ()g24 synapses

Bloc 1

n.c,

n.c

nc.

BlOC 0

n.c.

Figure 2.

Right part (reference)

Bloc 1

n.c

n.c.

nO.

" • 127 0

I . . . ~ I ........ 11111111

127 0

N~_ I ........ I I I I l t l l

Bloc 7 Bloc 7

n.c n.c._ _ _ ~

n.c. nc._

n.c n.c± _ ~

Large neurons with 1024 synapses

326 Microprocessors and Microsystems

Figure 3.

Synaptic weights

Neuron outputs

Configur. registers

/

/

/ | o

:-"'":" ,n ,5 i , ,D

. . . . . . . . ~ In 14

. . . . . . . . . . . . ~ ', In 13

~ t S h i f t ~ ' ' " ~ / mrs \

<-

Input~output of the chip

Input registers

Chip inputs

Additional and main shift registers

! I | i i | | i i

127

124

120

116

112

108

104

100

Internal bus of 128bit

Output registers i | |

Out 31

Out 30

Out 29

Out 28

Out 27

Out 26

Out 25

oo, Out 1

Out 0

in this case two cycles are necessary to transfer both arrays of shift registers on the internal 128 bit bus.

The 32 outputs are connected to the internal 128 bit bus by 4 bit shift registers; in this way the internal bus is read in four cycles (four vectors of 32 bits at a time). This configuration is particularly efficient when kernels of 16 X 16 monochrome pixels must be used by neurons in correlation cycles. In this situation, during the scanning of the image to be processed, we need to send the chip only one 16 bit wide word at a time, because it still has the trace of the previous 15 words (scanning window).

PA'n'ERN MATCHING

Ternary kernels

To recognize patterns inside a bitmap, we need to distinguish between significative and non-significative pixels. Significative pixels describe the part of the image that we want to recognize and non-significative ones are considered as a general noise (for example coming from the digitization step).

To achieve this task with the NET32K we have to describe elementary kernels (prototypes used to find shapes inside the bitmap) where all the information can be coded. Three-state logic is needed to describe all the different situations. We named 'on' and 'off' the primary codes that describe the main significative form of the kernels, and 'do not care' the other ones. The real implementation of this three-state logic is carried out using two binary kernels, as shown in Figure 4a.

With its internal architecture and the possibility to build neurons with larger synaptic count, the configuration is able to find ternary kernels of 16 x 16 bit pixels. In this case we need to connect four neurons together (Figure 4b). These three-state kernels have the advantage of offering a better shape description than the binary ones. On the other hand, this solution is implemented at the expense of the number of neurons (one ternary pattern is built using two binary ones).

Figure 4c is an example of white and black edge detection; it shows how the ternary kernels work for the correlation cycles. The spatial influence on the neuron potentials by the image pixels during scanning is not the same inside the shape domains 'on' and 'off' or inside the domains 'do not care'. Three basic responses are possible:

• A group of 'on' pixels of size N in the picture is inside the domain 'do not care': in this case the potential neuron responsible for correlating the positive part is increased by N units and the one responsible for the negative part is decreased by N units.

• A group of 'on' pixels of size N in the picture is inside the domain 'on': in this case both neurons are concerned and their potential is increased by 2N units (maximum peak of the correlation response).

• A group of 'on' pixels of size N in the picture is inside the domain 'off': in this case both neurons are concerned and their potential is decreased by 2N units.

Of course these considerations are the same for 'off' pixels.

V o l 1 7 N o 6 1993 327

[ ] Do not care domain

[ ] Shape domain of pixels OFF

• Shape domain of pixels ON

Ternary kernel S

Binary kernel Negative part

16x16bit

_ ~ 1 6 x 8 b i t (half kernel)

a

Bitmap

distance

Direction [ )

m scanning window

"1-

1 i i i , J i

i i i i t i i i i i ~ i

i i i i i i i i i i i

f e d c b a

Binary kernel Positive part

256 units - ~ PP

al positive part

256 units D I Pn >

Potential negative part

Pg

Pp

Pn

b

PP Global potential

I,g ( ~ ~ O u t p u t

1 Reference

i : : i _ _

, i i L ~ J i i i i ~ i i i t l i i

B , i h , t , ~ i , i i

, i i i i i

: , , , : i i ~ l i i

i i i i i i

i i , ~ i i i i , i i i

i i i i i '

i i i i

a b c d e f

Correlation peak

)

distance

>

distance

)

distance

C

Figure 4. Ternary kernels

Comparators and current switches Correlation response vs. distance

The weakness of the chip lies in the generation of references for the threshold of the comparator part in the neuron. In fact, we have to sacrifice the 128 neurons on the right-hand side to adjust each reference for the neurons on the left-hand side. The slight difference in threshold value of the comparators and the different numbers of 'do not care' pixels present in each ternary kernel justify this choice. On the other hand, because of the configuration mechanism, 64 other neurons cannot be used. Therefore, only 25% of the total number of the neurons are used for this application. Finally, the number of neurons at hand is 64 (16 large neurons with 512 synapses).

C o n f i g u r a t i o n a n d s c a n n i n g

The chip must link to a conventional data processing system (microprocessors, DSPs, workstations etc.), which provides the bitmap of the image to be processed. The

NET32K is able to store up to 16 ternary kernels of 16 X 16 bits in the synaptic weights of the neurons.

There are two basic sequences for using the NET32K: configuration and scanning. The configuration sequence loads the reference values and the 16 kernels inside the chip and takes about 1 ms. The scanning is performed by sending one 16 bit word sample to the chip at a time; this step is processed in about 5/~s and is repeated for all the bits of the picture. Thanks to its massive parallel computation capability, all the kernels stored inside the chip are correlated at the same time with the supplied bitmap sample (scanning window). The result of each interaction is a bit vector in which the neuron binary state indicates which kernels have been found.

An example of contour extraction is shown in Figure 5. To realize this task we have to describe 16 kernels with different angles of black and white transitions. The image is scanned from bottom to top and from right to left, with a one pixel offset. At each step, we get information on the 16 neuron states for the current scanning window. If one


Source image Processed image

Larni-EPFL ©

X_,

~ . 1 ~ ...:_.,_

. . .

- x -¥ I Y .-,)'

Pattern matching ; window ~ I X ' ~ L _ --~"~ I

o o o o o o o o

t- o 0 0 0 1 0 0 0 - - 7 .-= m/[ mNNNINJ Ternary kernels stored inside the NET32K ~ Neurons' response (for this example)

Figure 5. Example of contour extraction with ternary patterns

Turn "ON" the pixel

of the 16 kernels which describe all the possible black and white transitions is detected (logical OR of the neuron responses) we turn 'on' the pixel in the processed image corresponding to the position of the centre of the scanning window.

A R C H I T E C T U R E OF THE N E U R A L M A C H I N E

This neural machine consists of two basic boards: the main CPU and the neural board. The architecture and the different components have been designed to achieve the best compromise between size, computational speed and power consumption. Mechanical and electrical interconnections are performed by putting the neural board (and other possible interfaces) on piggy-back over the CPU; this construction allows very small machines.

The CPU board

The CPU board is built around a new microcontroller from Motorola, the MC68340; its core is the CPU32 unit (similar to the well-known MC68020 core). The main features of this chip are two 24 bit timers, two serial communication channels and two powerful DMA interfaces. It is interesting to note that the chip is clocked by a 32 kHz crystal oscillator; an internal PLL controlled by software enables the CPU clock to be varied from 131 kHz to 16.77 MHz, making it easy to optimize the speed-power consumption ratio.

The board (Figure 6) includes the following main features:

• 256 kbyte to 1 Mbyte static RAM with 85 ns access time

• 256 kbyte EPROM (100 ns) • 32 I/O lines directly controlled by two UPPs (universal

pulse processors)

RS485 network

RS232

Mubus, 8bit bus for experirr~nts

Fifo channel for high speed communications with workstation or frame graber

Figure 6.

Memory

Universal Pulse Processors HD63140 HD63143 m

Digital and analog I/O

Mictocontroller

DMA2

: : tCS2

Microprocessor bus

Schematic diagram of the CPU board

• 8 channels of A/D converters • 2 X 4 kbyte FIFO communication interface.

The communication interface is coupled with a DMA channel to give the smallest loss of time during image transfers. By software it is possible to configure the bandwidth of the D/ViA operations for sharing the bus between CPU and FIFOs.

Interface of the N E T 3 2 K

Interfacing the NET32K circuit to a processor is not easy. Neither a command nor a high level programming model is available. To access the different registers, it is necessary to control about 40 lines with precise sequences. Microprogrammed sequences have been used in the AT&T interface, as well as in our first prototype. FPGA (field programmable gate array) circuits make an imple-

Vol 17 No 6 "/993 .329

D[o..t5]

A• O . . 7] eset

/CS2 /W /IRQ3

Figure Z

I Commands

I _ I

Microprocessor bus iiiiiill NET32K control lines

Schematic diagram of the interface

mentation with only two circuits possible (Figure 7). The first FPGA (ACTEL 1) is used to produce all the sequences necessary to the NET32K (microsequence interface layer) while the second (ACTEL 2) encodes the output vectors to know which neuron has fired. For each 16 bit word (pixel row) sent to the circuit, a response with four 32 bit vectors is available.

As stated before, only 16 neurons are employed and for this reason only a part of these vectors is useful. This new NET32K interface (Figure 8) saves about 30 standard TTL chips.

RESULTS AND PERFORMANCE

Figure 9 shows the architecture of the system 6 realized at AT&T Bell Labs (Figure 9a) compared with the one realized at LAMI-EPFL (Figure 9b). As can be seen, the essential difference is the control of the neural chip. The AT&T Bell Labs' board uses a powerful processor coupled with six EPROMs to generate the microsequences. The LAMI-EPFL board uses a slower processor, but the microsequences are generated by the FPGAs.

Speed per fo rmance

Table 1 compares thu perfomlan(e ~ tiw ~,~o ~eu~ai machines of a workstation and of the MC68 ~40 withotd a NET32K chip. The test was made on a 512 x 512 pixe[ image (pattern matchim~ with 16 ternary, kern~-q,, ~:,i 16 X 16 bits).

The SUN Spar(-2 does not use the NE-[~2K (hip: its operations are emulated with an optimized program written in C. The total time necessaw for our machine to achieve a complete image process is similar to the AT&-[ one. The medium speed performance of the MC68340 CPU is compensated for by a very good image transfer path (FIFOs controlled by DMA channel) and by the FPGAs' NET32K control.

The contribution of the neural chip coupled with the MC68340 for this particular task allows reduction by a factor of 210 of the computational time necessary compared to the MC68340 alone.

Power consumpt i on

Using high performance microcontrollers coupled with dedicated programmable logic (field programmable gate arrays) allows us to reduce drastically (:hip count and power consumption (3 W). This makes small battery- powered systems possible. This is particularly relevant in mobile robotics where autonomous vehicles have to be controlled.

APPLICATIONS

The accuracy of the information from sensors and the system computation is essential when a robot has to be guided in an unknown environment. Contour extraction (Figure 10a) can inform the robot about the environment topology (openings and discontinuities in walls, etc.). Image analysis software must be able to process information coming from the neural accelerator and to translate it to guide the robot through possible openings.

The picture is processed column by column, from right to left. Each column is analysed from bottom to top until a

Figure 8. MC68340 and NET32K boards


Figure 9.

Table 1.

AT&T Bell labs system

/~ldre=l=

I.- 1 SRAM

-,.11 V

IosP =1 I,°..zl [!lJ

[ ] NET32K eontrol lines

a

LAMI-EPFL system Processor board Neural board

I I Address

I t !

I

I |

I I I

I I

D

I .

Fq

68020 16MHz workstation (option) 256x256 image

b

I I I ACT

~ / Image aquisition board I I

Camera

Architecture of the Bell Labs and LAMI systems

Characteristics of different machines for pattern matching

Machine Software Transfer t ime of one image Scanning t ime Total t ime (host-accelerator-host) (ms) (ms) (s)

Power

(w)

AT&T (Figure 8a) Assembler 700 300 1 20 LAMI (Figure 8b) Assember 170 1400" 1.57 3 SUN Sparc 2 C - 74000 74 - MC68340 alone Assembler - 33000 330 -

*A version of the MC68340 running at 25 MHz is now available; this decreases the scanning time to ~900 ms and the total time to ~1 s.

Source image Processed image

a

b Figure 10.

( Scanning direction

Distance 5m Distance 3m Distance lm

• i k . ¢ , . . i i ~ . ~ . . . .

Example of the picture to be processed

Vol 17 No 6 1993 331

about 330 s, that is more than 200 time.~ slower than with the NET32K accelerator. There is great potentiai in robotics for dedicated circuits performing image processin~ and intelligent tasks. The AT&T NET32K (:hip developed for character recognition already offers ve~ interesting possibilities.

Figure 11. Mob i le robot with neural accelerator

discontinuity is detected. This sequence allows the extraction of the lowest point of the image. Signalling panels of particular shapes (targets) can be placed on the walls around the robot (Figure 10b). The information that we can get from these targets should be sufficient to tell the robot which action it has to do (turn left, turn right, straight on, etc).

Inside the NET32K we can store kernels of different sizes; this configuration is useful to estimate the distance between the robot and the signalling panel. In fact, a target is recognized by one neuron of the accelerator only at the distance where it is the same as one of the kernels stored in the chip. By preliminary calibration of the correlation distances, the distance to the target can be estimated.

A mobile robot (25 x 23 cm) equipped with a rotating ultrasonic sensor and a fixed camera was used for the experiments (Figure 11).

CONCLUSIONS

The high speed of this accelerator and its small dimensions allow us to mount a powerful image processing system on a small mobile robot. The estimated execution time for the pattern matching algorithm, running with only the MC68340 CPU and with the same test conditions, takes

ACKNOWLEDGEMENTS

This project has been supported by the Swiss National Foundation for Scientific Research (PNR23) and has benefited from the Microinformatics Lab (LAMI). I thank Eric Cosatto and J D Nicoud who have contributed to the development of this accelerator and to the NET32K documentation.

REFERENCES

1 Graf, H P 'Reconfigurabte neural net chip with 32K connections' in Touretzky, D {Ed) Advances in Neural In format ion Processing 3 Morgan Kaufmann, San Mateo, CA (1991)

2 Graf, H P, Jackel, D and Hubbard, W E 'VLSI implementation of a neural network model' IEEE Comput. (March 1988) pp 41-49

3 Cosallo, E 'Syst~me neuronal VLSI reconfigurable' Travail de semestre LAMI-EPFL (1990)

4 Cosallo, F 'Carte d'acc~leration pour le systeme neuronal des AT&T Bell Labs' Travail de diple)me LAMI-EPFL (1990)

5 Franzi, E 'Analyse d'images par circuit neuronal N ET32K' Rapport R91.74 LAMI-EPFL (1991)

6 Graf, H P, Nohl, C R and Ben, I 'A neural-net board system for machine vision applications' Int. Joint Conf. on Neural Networks Seattle (June 1991)

Edoardo Franzi obtained a Bachelor's degree of engineering in electronics from the Vaud En~neefing College in 1980. For 10 years he worked in different industries for microprocessor applications. Since 1989 he has worked in LAMI- EPFL and he is involved in a project (PNR23) for the Sw/ss National Foundation for Scientific Research. In 1991 he received Reg. A (EPF level) and a Postgraduate Certificate in Computer Science.


Documents

Neural chip for fast pattern matching