6
Abstract This paper presents a parallel architecture for the solution of linear equations based on the Division Free Gaussian Elimination Method is presented [1].This architecture can handle single and double data that follows the IEEE standard 754 for floating-point data. Also, it can be implemented in a FPGA Spartan 3 of Xilinx. The mathematical algorithm is implemented in an array of processors. The main procedure inside each processor and the data distribution between processors is described. Furthermore, the synthesis of the designed modules for each processor that composed the proposed architecture is presented. The obtained algorithmic complexity is O(n 2 ) using a scheme of n 2 processors that perform the solution of the linear equations set. 1. Introduction The technological development in parallel processing, computer systems, information systems and digital electronic devices had evolved the electrical engineering. The design of novel parallel architectures for the solution of engineering problems is an efficient computational way to process information for the exploitation of concurrent events [1]. The parallel architectures applied to lineal algebra problems permit that methods for the solution of equations systems can be executed faster than in a computational system. Due to the necessity of a faster response of computations, the use of parallelism techniques has been increased recently. The proposed architecture can be implemented in Field Programmable Gate Array (FPGA). The FPGA’s are composed by bidimensional arrays of reprogrammable digital modules. These devices have millions of logical gates, have a good relation cost- benefit and are suitable for the design of parallel circuits. The one-step division free Gaussian elimination method was parallelized for its implementation in a FPGA due to the characteristics of this method [2]. For instance, the main procedure is performed by using only basic arithmetic operations, such as sums and multiplications. The paper is organized as follows. In Section 2, the one-step division free Gaussian method is presented in detail. Section 3 shows the proposed parallel architecture based in the method introduced in Section 2. In Section 4, the development phase of the proposed architecture is described. In section 5, the processing unit FPGA and the synthesis of each module that composed the processors are presented. Section 6, shows the performed tests, obtained results and comparatives whit other architectures reported in the literature. Section 7 presents experimental results. Finally, Section 8 presents the conclusions. 2. One-Step Division Free Method Recently, mathematical algorithms for the solution of linear equation systems that do not use division are becoming more attractive because the division operation needs a major computational effort and a special purpose processor. Also, it involves a significant increment in computational time and memory space. Moreover, the accumulative effect of the round error present in the division operations produces numerical instability. These drawbacks of the division operation make the division free Gaussian elimination method a good alternative for the solution of linear equation systems. Consider a linear equation system in the form Ax = b where n j i a A ij d d , 1 ), ( , m j n n i a b ij d d d d 1 , 1 ), ( n m j n i a x ij d d d d 1 , 1 ), ( . In order to solve Ax b , the matrix A is reduced in a diagonal form or triangular form and then a back substitution can be performed. In general, the systolic algorithm that reduces the matrix A in its diagonal form is more complex than an algorithm to produce its R. Martínez, Student Member, IEEE, D. Torres, Member, IEEE Instituto Tecnológico de Morelia. Michoacán, México. [email protected], [email protected] Parallel Architecture for the Solution of Linear Equation Systems Implemented in FPGA 2009 Electronics, Robotics and Automotive Mechanics Conference 978-0-7695-3799-3/09 $26.00 © 2009 IEEE DOI 10.1109/CERMA.2009.14 235 2009 Electronics, Robotics and Automotive Mechanics Conference 978-0-7695-3799-3/09 $26.00 © 2009 IEEE DOI 10.1109/CERMA.2009.14 275 2009 Electronics, Robotics and Automotive Mechanics Conference 978-0-7695-3799-3/09 $26.00 © 2009 IEEE DOI 10.1109/CERMA.2009.14 275 2009 Electronics, Robotics and Automotive Mechanics Conference 978-0-7695-3799-3/09 $26.00 © 2009 IEEE DOI 10.1109/CERMA.2009.14 275

[IEEE 2009 Electronics, Robotics and Automotive Mechanics Conference - Cuernavaca, Morelos, Mexico (2009.09.22-2009.09.25)] 2009 Electronics, Robotics and Automotive Mechanics Conference

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 Electronics, Robotics and Automotive Mechanics Conference - Cuernavaca, Morelos, Mexico (2009.09.22-2009.09.25)] 2009 Electronics, Robotics and Automotive Mechanics Conference

Abstract

This paper presents a parallel architecture for the solution of linear equations based on the Division Free Gaussian Elimination Method is presented [1].This architecture can handle single and double data that follows the IEEE standard 754 for floating-point data. Also, it can be implemented in a FPGA Spartan 3 of Xilinx. The mathematical algorithm is implemented in an array of processors. The main procedure inside each processor and the data distribution between processors is described. Furthermore, the synthesis of the designed modules for each processor that composed the proposed architecture is presented. The obtained algorithmic complexity is O(n2) using a scheme of n2 processors that perform the solution of the linear equations set.

1. Introduction The technological development in parallel

processing, computer systems, information systems and digital electronic devices had evolved the electrical engineering. The design of novel parallel architectures for the solution of engineering problems is an efficient computational way to process information for the exploitation of concurrent events [1].

The parallel architectures applied to lineal algebra problems permit that methods for the solution of equations systems can be executed faster than in a computational system. Due to the necessity of a faster response of computations, the use of parallelism techniques has been increased recently.

The proposed architecture can be implemented in Field Programmable Gate Array (FPGA). The FPGA’s are composed by bidimensional arrays of reprogrammable digital modules. These devices have millions of logical gates, have a good relation cost-benefit and are suitable for the design of parallel circuits. The one-step division free Gaussian elimination method was parallelized for its implementation in a FPGA due to the characteristics of

this method [2]. For instance, the main procedure is performed by using only basic arithmetic operations, such as sums and multiplications.

The paper is organized as follows. In Section 2, the one-step division free Gaussian method is presented in detail. Section 3 shows the proposed parallel architecture based in the method introduced in Section 2. In Section 4, the development phase of the proposed architecture is described. In section 5, the processing unit FPGA and the synthesis of each module that composed the processors are presented. Section 6, shows the performed tests, obtained results and comparatives whit other architectures reported in the literature. Section 7 presents experimental results. Finally, Section 8 presents the conclusions.

2. One-Step Division Free Method Recently, mathematical algorithms for the solution

of linear equation systems that do not use division are becoming more attractive because the division operation needs a major computational effort and a special purpose processor. Also, it involves a significant increment in computational time and memory space. Moreover, the accumulative effect of the round error present in the division operations produces numerical instability. These drawbacks of the division operation make the division free Gaussian elimination method a good alternative for the solution of linear equation systems.

Consider a linear equation system in the form Ax = b

where njiaA ij ,1),( ,

mjnniab ij 1,1),( nmjniax ij 1,1),( .

In order to solve Ax b , the matrix A is reduced in a diagonal form or triangular form and then a back substitution can be performed. In general, the systolic algorithm that reduces the matrix A in its diagonal form is more complex than an algorithm to produce its

R. Martínez, Student Member, IEEE, D. Torres, Member, IEEE Instituto Tecnológico de Morelia. Michoacán, México.

[email protected], [email protected]

Parallel Architecture for the Solution of Linear Equation Systems Implemented in FPGA

2009 Electronics, Robotics and Automotive Mechanics Conference

978-0-7695-3799-3/09 $26.00 © 2009 IEEEDOI 10.1109/CERMA.2009.14

235

2009 Electronics, Robotics and Automotive Mechanics Conference

978-0-7695-3799-3/09 $26.00 © 2009 IEEEDOI 10.1109/CERMA.2009.14

275

2009 Electronics, Robotics and Automotive Mechanics Conference

978-0-7695-3799-3/09 $26.00 © 2009 IEEEDOI 10.1109/CERMA.2009.14

275

2009 Electronics, Robotics and Automotive Mechanics Conference

978-0-7695-3799-3/09 $26.00 © 2009 IEEEDOI 10.1109/CERMA.2009.14

275

Page 2: [IEEE 2009 Electronics, Robotics and Automotive Mechanics Conference - Cuernavaca, Morelos, Mexico (2009.09.22-2009.09.25)] 2009 Electronics, Robotics and Automotive Mechanics Conference

triangular form. In this paper, the one-step division free Gaussian elimination method is used for the reduction of the matrix A in its diagonal form.

The diagonalization process of one-step division free algorithm is given by

( 0 )

( 1)

( ) ( 1) ( 1)

( 1) ( 1)

, 1 , 1 ;

1 , 1 , ;

,

,

ij ij

kij

k k kij kk kj

k kik ij

a a i n j m

k n i n k j m

a if i k

a a aotherwise

a a

(1)

where A is the determinant of matrix A.

3. Parallel architecture for the one-step division free method

Equation (1) describes the one-step division free Gaussian elimination method. The following observed characteristics from (1) make this algorithm suitable for parallelization. The 2x2 determinants can be obtained simultaneously at each iteration of the algorithm. Therefore, the complexity of the algorithm for the computation of the determinants is:

( )O n n x m (2) where O(n) is the number of determinants to

compute in each iteration, n is the number of rows, m is the number of columns, and m = n+1; then

2( )O n n n (3)

Therefore, the complexity of the algorithm is 2n . Also, data dependency exists at each iteration. The algorithm begins with an iteration k. For the iteration k +1, the previous iteration has to be completely compute due to the data in the previous iteration are needed to compute the current iteration.

According to the aforementioned, this paper proposes a parallel architecture composed of processors that are able to perform the one-step division free algorithm with a considerable time reduction.

The processors are distributed in a grid. The grid size is 2n and a central master processor coordinates the data distribution between the processors. In this way, the master processor recollects the resulting data of each processor and sends the proper data to every processor for the next iteration.

The operations performed for the processors are two multiplications and a sum that correspond to a 2x2 determinant computation. The corresponding algorithm to solve equation (1) is presented.

Algorithm: One Step division Free Method for (k=0; k<n; k++) for (i=0; i<n; i++) // Row for (j=0;j<n+1; j++) // Column if (k==i) D[i][j]=C[k][j]; else D[i][j]=(C[k][k]*C[i][j])-(C[i][k]*C[k][j]); end if end for end for for ( p=0; p<n; p++) // matrix C for ( q=0;q<n+1; q++) C[p][q]=D[p][q]; end for end for end for //end program

4. Data distribution The first task needed to perform the algorithm is the

data distribution. The data is send to each processor; therefore the processor can perform its internal operations. The master processor sequentially sends the data to the first n processors.

The number of the executed iteration is considering as the position of the processor into the grid. Therefore, each processor computes the data according to its position. For instance, the processor P11 computes the element a11, the processor P12 computes the element a12, and successively every processor Pij computes its corresponding element ija .

At the beginning of the algorithm, the data from the original matrix is taken as the input data for the iteration 0.

Consider k as the number of iteration, i as the number of row and j as the number of column. The processors are distributed in a matrix form shown in Fig 1.

Fig. 1. Distribution of the processor in matrix form

236276276276

Page 3: [IEEE 2009 Electronics, Robotics and Automotive Mechanics Conference - Cuernavaca, Morelos, Mexico (2009.09.22-2009.09.25)] 2009 Electronics, Robotics and Automotive Mechanics Conference

The data that correspond to the position is stored at each processor. For instance, the processor P11 store the element 11a , the processor P12 store the element

12a , and successively. The master processor sends the input data to every processor. Fig. 2 shows the data distribution.

Fig. 2. Data distribution of elements ija into the processors ijP

The elements 11 12 13 1, , . ... na a a a for the first iteration

are exactly the same for the iteration 0. The first row of the new matrix corresponding to iteration 1 is equal to the first row of the original matrix; therefore, its computation is not required.

These elements are stored in the master processor, one by one until reach the n elements.

The element 11a is stored in its corresponding processor ijP when k i . If 1k then all the

processors need the term 11a due to this element is used in the computation of all the determinants required for the computation of the new value in the next iteration of the algorithm.

Moreover, this element is stored in all the processor in a single clock cycle. Fig. 3 shows this distribution.

For the rows a special condition has to be considered. Each processor needs the element that corresponds to the row number into the matrix and the current iteration. For instance, for row 2 and iteration 1, the element a21 is stored in all the processor located in row 2 of the matrix of processors. This special data distribution is shown in Fig. 4.

The columns also have a common data that is stored in all the processor located in the same column. For example, for column 2 and iteration 1, the element 12a is stored in all the processors corresponding to that

column number. This behavior is present in every column. Fig. 5 depicts this data distribution.

Fig. 3. Data distribution of the elements kka into the processors ijP

Fig. 4. Data distribution in the rows of the matrix of processors

Fig. 5. Data distribution in the columns of the architecture

237277277277

Page 4: [IEEE 2009 Electronics, Robotics and Automotive Mechanics Conference - Cuernavaca, Morelos, Mexico (2009.09.22-2009.09.25)] 2009 Electronics, Robotics and Automotive Mechanics Conference

The data distribution presented in this section is performed to compute the first iteration of the algorithm. The data stored in each processor is used to compute the determinants that produce the new input data required for the next iteration k+1.

5. Processing A FPGA device, also named LCA (Logic Cell

Array) is used for the processing phase. The FPGA’s consist in a bidimentional matrix composed of configurable block that can be connected by general resources of interconnection [10].

The architecture presented was developed in the VHDL language using a top-down design and it was programmed, compiled and simulated in the ModelSIM SE 6.3 f software.

The synthesis of the architecture was developed in the Xilinx ISE 8.1i software for FPGAs of the Spartan 3 family of Xilinx, to obtain the number of Gates, IOBs, CLBs, Slices, Shift, Flip Flops, and LUTs. The connection diagrams between digital components with different abstraction levels were also obtained.

In this section, the simulation of the modules of proposed parallel architecture, based in the parallelization of the one-step division free Gaussian elimination, is presented. The proposed architecture is composed by the following components or modules:

An adder-subtractor. A multiplier. A serial data input/output A counter Memory registers

The aforementioned components take part of every

processor of the proposed architecture described in the earlier section. The designed processor uses the following components:

Arithmetical Logical Unit (ALU) to solve

mathematical operations. Memory to data store. Control module to manage its internal

performance. Sending and reception of data. Data sending and receiving controller.

Every processor has an ALU composed by an

adder-subtractor of 64 bits and a multiplier of 32 bits. Moreover, every processor contains a memory of 6 registers of 64 bits, four registers to store the four

elements needed to compute the 2x2 determinant, one register to store the resulting determinant and the final register stores the processor number.

The data of every processor is sent sequentially. This task is performed by a component inside each processor. The master processor manages the data distribution between the processors by using a counter.

All the modules were programmed in VHDL language and it includes standard IEEE libraries.

In the simulations of the designed VHDL modules presented in this paper, a pulse clock of 100 ns was considered. Fig. 6 shows the design of the processor.

Fig. 6. Designed processor

5.1. Adder-subtractor The adder and the subtractor contained in the ALU

of the processor are used for the computation of the determinant in (1). The adder is able to add and subtract negative and positive numbers in a range of 0 to 264. Internally, this block is composed by an adder, a subtractor, a multiplexor, flip flops and logic stages.

5.2. Multiplier The designed multiplier is able to multiply two

positive or negative numbers. This module is important to the computation of the determinant.

The multiplicand and the multiplier numbers can be 32 bits number or a decimal number in the range of 0 to 232, the result is a 64 bits number. This module is composed by two flip flops type D and a multiplexor.

5.3. Memory registers of 64 bits The memory registers of 64 bits are used for data

storage. Every processor has six registers that compose its memory. Only one register is fixed and it contains the processor number. The other registers store the data required in the computation of the determinant given in (1).

EPij

Adder

akk

(k-1)

akj

(k-1)

aik

(k-1)

aij

(k-1)

Multiplier

Memory

EPij

AdderAdder

akk

(k-1)

akk

(k-1)

akj

(k-1)

akj

(k-1)

aik

(k-1)

aik

(k-1)

aij

(k-1)

aij

(k-1)

MultiplierMultiplier

MemoryMemory

238278278278

Page 5: [IEEE 2009 Electronics, Robotics and Automotive Mechanics Conference - Cuernavaca, Morelos, Mexico (2009.09.22-2009.09.25)] 2009 Electronics, Robotics and Automotive Mechanics Conference

5.4. Serial data input/output This module performs the sending and reception of

data between processors. This module sends and receives information in serial form. Data of 64 bits is sent and received for mathematical operations. Also, the processor number, that identifies each processor, is sent using this module.

5.5. Data sending controller The master processor manages the data sending to

every processor. First, the master processor sends a number of identification, called processor number, to every processor. Once all the processors have been identified, the master processor can send the elements required for the computation of the determinant. The data sending controller module was designed to identify each processor.

The synthesis of the modules describes in this

section are shown in Table 1.

Table 1. Number of components per Processor Module Total

Gates CLB´s IOB´s Shift Mult

18x18 FF Lut 4

inputAdder-subtractor

765 32 193 64 64

Multiplier 17829 56 129 4 64 111 Memory registers

515 129 64

Data input/output

267 2 4 4 1

Counter 756 42 12 33 47 Total Components

20132 132 467 4 4 226 222

Table 2 shows the obtained times of each designed module and the total time needed for the execution of a single iteration of the division free Gaussian elimination method per processor.

Table 2. Times measurenments

Description Operations Time per operation (sec)

Total time per operation (sec)

Adder-subtractor

1 4000x10-9 4000x10-9

Multiplier 2 1500x10-9 3000x10-9 Memory registers

6 1500x10-9 9000 x10-9

Data input/output

4 1325 x10-9 5300 x10-9

Counter 4 100x10-9 400x10-9 Total Time 21700 x10-9

The obtained computation time result for one

processor is 21.7 μs. Since one iteration can be executed simultaneously in parallel, the total time

needed for the computation of one iteration of the division free method in all processors is 21.7 μs multiply for the order of the matrix to be solved.

The designed Processor diagram block is presented in Fig. 7. It can be observed that this processor is composed by the five modules described before.

Fig. 7. Processor diagram block.

6. Experimental Results In this paper, a Division Free Parallel Architecture

(DFPA) is presented. Experimental results are given after synthesize. A comparison of the obtained results using DFPA and the results reported in [9], called here Parallel LU, is shown in Table 3. In this case, a frequency of 40 Mhz was used.

Table 3. Comparison DFPA vs Parallel LU

Matrix DFPA (sec) Parallel LU [9] (sec) Improve 24 1.36E-04 6.84E-04 5.02

30 1.70E-04 1.41E-03 8.26

36 2.04E-04 1.17E-03 5.73

42 2.38E-04 1.66E-03 6.96

48 2.72E-04 3.00E-03 11.02

54 3.06E-04 4.89E-03 15.94

96 5.45E-04 1.67E-02 30.58 Fig. 8 shows a plot of the results presented in Table

3, where DFPA shows a better performance than Parallel LU.

Another comparison of the obtained results using

DFPA and the results reported in [10], named here Pipeline LU, is shown in Table 4. In this case, a frequency of 100 Mhz was used. Also, Table 4 includes a comparison with another architecture based in a Pentium M processor reported in [10].

239279279279

Page 6: [IEEE 2009 Electronics, Robotics and Automotive Mechanics Conference - Cuernavaca, Morelos, Mexico (2009.09.22-2009.09.25)] 2009 Electronics, Robotics and Automotive Mechanics Conference

Fig. 8. Comparison DFPA vs parallel LU

Table 4. Comparison DFPA vs Pipeline LU TIme (ms)

Matrix APLD

(100 Mhz) Pipeline LU (100 Mhz)

Pentium M (1.6 GHz)

Improve vs Pentium

Improve vsPipeline LU

100 0.227 0.46 9.11 40.13 2.03

300 0.681 8.76 134.20 197.06 12.86

500 1.14 40.50 661.00 579.82 35.53

800 1.82 167.60 2984.50 1639.84 92.09

1000 2.27 328.40 7871.50 3467.62 144.67

Fig. 9 shows a plot of the results presented in Table 4, where DFPA shows a better performance than pipeline LU and Pentium M reported in [10].

Fig. 9. Comparison DFPA vs Pipeline LU

7. Conclusions During the revision of mathematical algorithms for

the solution of linear equation systems, the methods that use division require a major processing time and its implementation in hardware produces complex architectures. However, the division free method proposed by Bareiss [8] presented many advantages for parallelization. For this reason, this method was selected and it is the base of the proposed parallel architecture in a FPGA.

The parallelization of the division free Gaussian elimination methods produces a simple independent

process that can be implemented in identical processors and its hardware implementation is easily constructed by using basic algebraic operations. The obtained algorithmic complexity is O(n2) under a scheme of n2 processors that solve a linear equation system of n order.

The performed simulations of the modules that compose a processor show a low time results in nano-seconds for this kind of computations.

Finally, the construction of VHDL modules for digital systems for the simulation and synthesis of a matricial processor is possible using the ModelSIM software and Xilinx ISE software.

References [1] Bareiss E. H. Sylvester`s Identity and Multistep Integer-

Preserving Gaussian Elimintation, Mathematics of Computation, 22, 1968, pp. 565-578.

[2] D. Torres, Herve Mathias, Hassan Rabah, and Serge Weber, “SIMD/restricted MIMD Parallel Architecture for Image Processing Based on a New Design of a Multi-mode Access Memory,” presented at International Conference on Parallel and Distributed Processing Techniques and Applications PDPTA'99, Las Vegas Nevada USA, 1999

[3] Rubén Martínez Alonso, Domingo Torres Lucio, “Paralelización del algoritmo del método de bareiss libre de división para solución de sistemas de ecuaciones lineales de ingeniería eléctrica.” 4th International Congress and 2nd Nactional Congress of Numerical Methods in Engineering and Applied Sciences ISBN 978-84-96736-08-5, Morelia, Michoacán, México, del 17 al 19 de Enero de 2007.

[4] Fernando Pardo Carpio, Arquitecturas Avanzadas, Universidad de Valencia, Valencia España, 30, Enero 2002.

[5] Shietung Peng, Stanislav Sedukhin, “Parallel Algorithm and Architecture for Two-step Division-free Gaussian Elimination”, 0-7803-4229-1/97, pp. 489-502, IEEE, 1997

[6] Michael J. Beauchamp, Scott hauck, Keith D. Underwood and Scott Hemment. “Arquitectural Modifications to Enhance the Floating-Point Performance of FPGA´s” IEEE transactions on VLSI Systems, Vol. 16 No. 2, February 2008.

[7] Ronald Scrofano, Ling Zhuo, Vicktor K. Prasanna. “Area-Efficient Arithmetic Expresion Evaluation Using deeply Pipelined Floating-Point Cores”, IEEE transactions on VLSI Systems, Vol. 16 No. 2, February 2008.

[8] Xilinx DS099. Spartan-3 Family, complete Data Sheet, http:www.xilinx.com. product specification.

[9] Xiaofang Wang and Sotirios G. Ziavras, “Parallel Direct Solution of Linear Equations on FPGA-Based Machines”, Parallel and distributed Processing Symposium, (IPDPS 2003), Proceedings of IEEE.

[10] Vikash Daga, Gokul Govindu, Viktor Prasanna, “Efficient Floating-Point based Block LU Decomposition on FPGAs”, ERSA 2005, Las Vegas Nevada, USA, Jun, 21-24.2004, pp.137-148.

240280280280