[IEE 1st International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications (GALESIA) - Sheffield, UK (12-14 Sept. 1995)] 1st International Conference

337

An Architecture for Enhancing Image Processing via Parallel Genetic Algorithms & Data Compression

B.C.H Turton, T Arslan

University of Wales, LJK

Introduction

Vision systems require processing techniques which are robust, fast and capable of dealing with large quantities of data. Genetic algorithms have been used with such systems because of the first of these criteria, as is exemplified in the works of Fitzpatriclk 1984, Mandava 1989, McAulay 1989 [l-31. Early work in genetic algorithms [ 1,2] demonstrated the usehlness of image registration, hiowever the timescales involved made rtal-time registration impractical. Using this technique the genetic algorithm (GA) seeks to adapt the chromosomes to find the transform that best maps the reference image to an observed image. The transform can be used to determine the position, orientation and size of an object for use in manufacturing environments. Alternatively the transform is used to obtain a 'best fit' before a comparison is made between the reference and observed images. However many applications of this type require an algorithm that works in real- time (sub one second timescale). Turton et a1 [4] designed the first real-time genetic algorithm for this purpose by using a Parallel Genetic Algorithm and a specialist VLSI architecture. The VLSI architecture made maximum use of the simple bitwise manipulation involved in standard crossover operators whilst the parallel genetic algorithm allows a linear improvement in time required with the number of chromosomes.

A variety of forms of Parallel IGenetic Algorithm (PGA) exist, a simple parallelising of a standard GA is the least t:fficient technique. The two main methods are coarse- grain and fine-grain genetic algorithms. The former method uses a single processor to run a population of chromosomes over several generations and occasionally, every 'epoch', exchanges fit individuals with other processors (populations). The latter method processes each chromosome individually [5 ] . Typically a fine ga in PGA has an order of magnitude more parallel processes than the coarse grain PGA. For traditional applications the: coarse grain approach typically uses tens of standard

processors to run populations in parallel. However if a fine-grain approach is used with a specialist VLSI architecture over a hundred chromosomes can be processed in parallel thus dramatically decreasing the time required to converge to a solution. Normally interprocessor communications are a limiting factor, however if a specialist architecture is implemented such limitations can be overcome by careful design. Consequently a fine-grain PGA was chosen due to its massively parallel nature. Results fiom earlier work [4] suggested that the black & white images tested (64x64 pixel, lbit/pixel) could be processed in less than 10 milliseconds, thus making the technique practical for real-time applications. However it is desirable to extend the technique to larger images. In order to evaluate each chromosome independently a local copy of the image is required. Consequently the design is limited by the chip area available for storing an image per chromosome. In addition the evaluation time per chromosome is proportional to the number of pixels in the image. This paper improves the original PGA by applying data compression techniques to the image in order to minimise the data manipulated by the chromosomes. Image registration is performed in the compressed domain and the chromosomes encode the transform in this domain. The transform must then be convmed back to the 'real' world domain for practical use. Consequently the chip area can be decreased by the compression factor, and the processing time for the image can be improved.

A variety of compression techniques can be used for example PEG, Fractal, various forms of Discrete Cosine Transform @CT), run length encoding, H&an encoding, Arithmetic encoding. For image registration purposes the compression method must be fast and provide a good compression ratio. The compression method does not have to be lossless. Of the aforementioned techniques P E G (which utilises a combination of techniques), Fractal & DCT are lossy techniques that have good compression ratios

Genetic Algorithms in Engineering Systems: Innovations and Applications 12-14 September 1995, Conference Publication No. 414,O IEE, 1995

338

161. DCT is the easiest method to implement efficiently on-chip. Consequently the DCT compression method was chosen as an effective lossy compression technique for reducing the image size. Additional benefits to using a lossy algorithm include the ability to compress the image to a fixed compressed image size. This permits a variety of sizes of image to be processed on a chip with limited memory per chromosome. Consequently this system is far more flexible than the original method. The technique and its implications are described in this paper along with simulated results for a number of images. Conclusions and future developments are subsequently discussed.

Parallel Genetic Alporithm for vision

A brief background to Parallel Genetic Algorithms (PGAs) can be found in [7]. For this application each chromosome specifies a two-dimensional transform which maps a reference image to an external image. The images are 64x64 byte greyscale images. The transform contains information about scale, rotation and position. A measure of the accuracy of the transform is found by summing the absolute error between the transformed reference image and the external image. The known largest possible error would be 220.1n order to make the largest number the best match the ‘fitness’ of the chromosome is measured as 2*’-the absolute error. Once the transform which produces the best match (highest value) has been identified the position orientation and scale of the target can be determined.

The transform used will be:

I=RT (1) Where I is the real image, T is the transform and R is the reference image

The Transform is a 3x2 matrix

Here S-refers to a scaling factor, I$ to rotation and ~0 & yo are position o&b.

Each element of the matrix is stored in the chromosome as a six bit value with the exception of the offsets which are only stored to three bit accuracy. However if the original image were used the memory requirements

would be large, consequently a DCT compressed version of the external image and the reference image are used.

DCT Comuression

The DCT takes the image in sets of 4x4 blocks. Each 4x4 block is separately transformed (figure 1). The transformed block is not necessarily stored to the same resolution as the original. This allows some compression of the image. Consequently the number produced after transformation is usually divided by some factor and then quantised before storage. The transformed block is related to the Discrete Fourier transform and can be regarded as the relative magnitudes of the two dimensional spatial frequencies which make up the picture. Images concentrate most of the information in the lower spatial frequencies. Consequently an image can be compressed by not storing the higher frequencies as accurately as the lower frequencies. For this work the higher frequencies (top half of the spectrum) is set to zero and the lower frequencies are only stored to 4 bit accuracy. The consequence of this is to reduce the storage capacity required per chromosome by a factor of four. The equations for transforming the image are,

(2x + 1)un .ws--] ( 2 y + 1)vrr (3) f ( x , r ) = [ ~ ~ ‘ ( . ) “ ( v ) F ( . , . ) . w s ~ u=o “=o 8 8

Where C(0) =I/+ else C()=1 Difficulties arise using this technique in dealing appropriately with the transform domain. The position of the 8x8 blocks conforms to the normal equation { 1 } however the fiequency distribution changes when scaling & rotation occurs. OfEsets have no affect on the frequency distributions within a block because only ofEsets corresponding to complete block movements have been permitted in this work. Scaling can be properly incorporated as fiequency changes inversely with the scale in each dmension. Further work will be required to properly compensate for rotational movement.

Hardware Im~lementation

DCT has been widely recognised as the most effective technique for image and video compression and its single chip

339

implementation has already been reported [8,9]. For the hardware implementation proposed in this paper DCT is only applied once prior to commencing the genetic evolution for image registration. Hence, it was decided to perform the DCT off-chip since it is not a speed critical task in the case of this application. Incorporating the DCT algorithm on-chip will further increase both the complexity and the computational intensity of the proposed architecture.

The proposed hardware could be divided into four main blocks as shown in figure 2. The €allowing is a brief description of each block:

Block A :performs the following functions: Selection of the best of the four neighbours; Crossover and mutation; Deposition of the best individual, from the contents of registers (REGO, REG1, REG2) into REG2.

The genetic evolution commences when an external chromosome (CHROMOi) is deposited in REGO which is subsequently duplicated in REG1 and REG2. The Register Control Logic (RCL) is the block responsible for handling the transfer of data among the registers above. An appropriate control signal on multiplexer MUX1 will select either the four neighibouring chromosomes (Cs . . . Ce) or the chromosomes in the registers (REGO, REG1, and REG2). The same control signal will be used to select the corresponding fimess values via multiplexer MUX14. The ‘e logic will enable MUX2 to select the best chromosome and place it in REG2. RCL will ensure that the appropriate fimess value is passed to FREG2. The signal MIX is applied externally to the processor to indicate whether a c r o s ~ ~ e r or mutation should take place in which caise a 16- bit random number @No) to indicate the appropriate position(s). In the case of crossover RNo is split into two individual %bit numbers by another logic block (MXCL) in order to provide the positions requirrd for a two point-crossover. In the case of mutation only a single 8-bit number is extracted fiom R N O .

Block B :is responsible for fitness evaluation of chromosomes. This is performed either after possible crossover/mutation or during the optimisation process. The appropriate control signal on MUX8 will select a chramosome fiom one of the above two destinations..

As can be seen from figure 2 the circuit consists of two similar sections. The top section is for the calculation of x’ while the bottom section calculates y” ( x’ and y” are the x-y co-ordinate addresses for the transformed pixel of the image). The calculation of x’ commences by depositing the value of a13 in register Rxi. This value is used till the end of the x-axis is reached (monitored by the counter C1) at which case R x O is loaded into Rxi. This operation continues till the complete image is processed. A similar procedure is followed €or y’ in the lower section. Both x’ and y’ are evaluated in parallel.

Block C : processes the twelve bit address corresponding to x’ and y’ so that it could be mapped to the compact frame buffer FB. This process involves separating the most significant three bits of the x‘ and y’ addresses, i.e. the bits responsible for identifying a transform block, into a separate six bit bus. This will allow the use of a simple logic fimction, DL3, to map the six bit addresses of the individual pixels (within a transform block) to be mapped into five bit address. The resulting eleven bit address will be used to identify the individual pixel in the frame buffer for fimess comparison.

Block E :After the calculation of x’ and y’ the corresponding pixel is extracted fiom a 2Kx4 Frame Buffer (FB) and is compared with the corresponding pixel of the reference image. If a match is detected then C3 is incremented during which the count is stored in register Rft. When the whole image is processed, one of the following is performed :

1. If the fitness is being calculated for a chromosome transferred from REG2 (after possible crossover), then the final fitness, calculated here, will be moved to FREG2 by enabling the appropriate control signals on DX5, MX12, and DXO.

2. IC however, the fimess is being evaluated for a chromosome during the optimisation phase, then it is compared with the previous fitness value in the register Ipftmp (at the start of the evaluation phase Ipftmp =O), which stores the best fitness seen so far during optimisation, and the corresponding chromosome is stored in RCtmp. If the fitness value is better than the one in Rftmp, then it is copied to Rftmp and its associated chromosome is transferred to RCtmp, otherwise the chromosome in RC is incremented or

340

decremented depending on the state of the counter C4 (see below).

The optimisation phase commences by moving the chromosome in REG2 to RC. MX13 selects each parameter sequentially incrementing and then decrementing its value (decided by the code produced by the counter C4). The chromosome produced is then presented to the evaluation section by enabling MX8 which should be enabled with the appropriate code signalling the use of the evaluation section in the optimisation phase.

The design was evaluated using a 1 p ES2 CMOS process, in which an individual chromosome could be processed in approximately 2 milliseconds.

Results The result of using this technique are given in figure 3. Each image has six separate PGA runs with the average result for each generation plotted. In addition for picture A the best result for each generation is plotted. Clearly the technique has managed to find the optimum solution in some cases (2'' = 104000). Investigation of the results which do not reach the optimum result show that a local minimum has been reached where one of the scaling factors has collapsed to a suboptimal value. Limitations on the permissible change in scale would substantially assist this problem. In addition the coefficients found under transformation do have some limitations in this implementation. In particular the offsets are only coarsely calculated, to the nearest 8x8 block. This limitation is imposed because a simple OW in the compressed domain is not equivalent to an oaet in the original domain unless it is by an integer number of blocks. A more advanced version of this algorithm would be able to adjust for this effect thus permitting more accurate comparisons.

Conclusion

The suggested hardware provides a realistic method of comparing greyscale images within the limits of existing technology. Convergence ~ a f l be e;rcpected within 2 milliseconds. Multiple GA runs can be initiated to decrease the chances of producing a suboptimal value from a local optimum. However there are severe restrictions on the transformation algorithm used. More advanced algorithms are required to allow accurate measurement of

offsets and rotations and rules should be incorporated to guide the PGA into legitimate regions. These must all be provided for in hardware and consequently will require firrther research into algorithms which are both effective and simple to implement in hardware using commonplace technology.

References 1.

2.

3.

4.

5.

6.

7.

8.

9.

Fieatrick J.M, Grefenstette J.J and Van Gucht D (1984) 'Image Registration by Genetic Search' IEEE SouthEastcon pp

Mandava V.R, Fitzpatrick J.M and Pickens D.R (1989) 'Adaptive Search Space Scaling in Digital Image Registration' IEEE Transactions on Medical Imaging MI-8 No 3 pp 25 1-62 McAulay A.D and Oh J.C ' (1989) Image Learning Classifier System Using Genetic Algorithms' IEEE pp 705-10 Turton B.C.H, Arslan T, Horrocks D.H (1994) 'A Hardware Architecture for a Parallel Genetic Algorithm for Image Registration' IEE Colloquium on "Genetic Algorithms in Image Processing and Vision" Digest No: 1994/193 ppl111-6 Petty C.C and Leuze M.R (1989) ' A Theoretical Investigation of a Parallel Genetic Algorithm' in Proceedings of the third International Conference on Genetic Algorithms Schaffer J.D (Ed) Morgan Kautinann Publishers pp 398-405 Wallace G.K (1992) 'The P E G still picture compression standard' IEEE Transactions on Consumer Electronics 38

Turton B.C.H, Arslan T (1995) 'A Parallel Genetic VLSI Architecture for Combinatorial Real-Time Applications - Disc Scheduling' First IEE/IEEE International Conference on Genetic Algorithms in Engineering Systems:Innovations & Applications Conference.

460-64

NO 1 pp 18-34

Sun M-T, Chen T-C, and Gottlieb A. M (1989) 'VLSI Implementation of a 16x16 Discrete Cosine Transform' IEEE Trans. ccts & SYS Vol. 36, NO. 4 610-17.

Chiu C . T, Kolagotla R K, and Jaja J.F (1994) 'VLSI Implementation of Real- Time Parallel DCTDST Lattice Stntctures for Video Communications' VLSI Signal Processing V, IEEE, ISBN 0-7803-5.

34 1

T-

A- - t-% R E G 0 F R E G O

1 * R E G 1 F R E G l

342

While mom blocks

Figure 2: DCT Conversion Process

Parallel GAL Results

950000

900000

850000

800000

750000

- Picture A, Average Picture B, Average Picture C, Average Picture A, Best Picture B, Best . Picture C Best

- ---

- - - -

- - - _ _

Figure 3: Results averaged over ten PGA Runs, 128 Chromosomes

Documents

[IEE 1st International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications (GALESIA) - Sheffield, UK (12-14 Sept. 1995)] 1st International Conference