SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm

8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm

http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 1/9

22 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) S Y S T E M S , V OL . 1, NO . 1, M A R C H 1993

SIGMA: A VLSI Systolic Array Implementation

of a Galois Field GF(2m) Based Multiplication

and Division AlgorithmMario Kovac, Member, IEEE, N. Ranganathan, Senior Member, IEEE, and Murali Varanasi, Senior M ember, IEEE

Abstract-Finite or Galo is fields are used in numerous appli-cations like error correcting codes, digital signal processing andcryptography. The design of efficient methods for Galois fieldarithmetic such as multiplication and division is critical for theseapplications.In this paper, we present a new algorithm basedona pattern matching technique for computing multiplication anddivision in GF(2m).An efficient systolic architectureis describedfor implementing the algorithm which can produce a new resultevery clock cycle and the multiplication and division operationscan be interleaved. The architecturehas been implemented using2-pm CMOS technology. The chip yields a computational rate of33.3 million multiplicatioddivisions per second.

I. INTRODUCTION

N RECENT YEARS, finite fields or Galois fields haveI een extensively applied in i) error correcting codes such as

BCH codes and RS codes [l], ii) digital signal processing [2],

iii) pseudorandom number generation [131, and iv) encryption

and decryption protocols in cryptography [3] and space object

tracking applications [16]. In many of the above applications,

the finite field GF(2m), a number system made of 2" elements

is used. The practical use of GF(2m) in various applications

requires arithmetic operations like addition, multiplication and

division. Thus efficient algorithms are required to perform

these arithmetic operations on-the-fly. Addition in GF(2") is

relatively simple and straightforward. However, multiplicationand division are more complex operations. Division is per-

formed by multiplication of the inverse of the denominator

element and inversion itself is achieved through repeated

multiplications.

In this paper, we propose a new algorithm for multipli-

cation and division in GF(2m). The algorithm is designed

for elements of GF(2'") represented by the conventional

basis (1 , a ,a2, 3 , . ,am- ' } . The algorithm is based on a

pattem matching and recognition approach and is amenable for

Manuscript received June 8, 1992; revised October 14, 199 2. M. Kova c wassupported by Departmentof Science, Republic of Croatia. N. Ranganathan wassupported in part by the National Science Foundation under Grant MIP-9 010

358 and by the Florida High Technology and Industry Council. M. Varanasiwas sup ported in part by the AT&T Found ation.

M. Kovac is with the College of Electrical Engineering, University ofZagreb, Unska 3, 41 OOO Zagreb, Croatia.

N. Ranganathan is with the Center for Microelectron ics Research, Depart-ment of Computer Science and En gineering, University of South Florida,Tampa, FL 33620.

M. Varanasi is with the Department of Computer Science and Engineering,University of South Florida, Tampa, FL 33620.

IEEE Log Number 9206232.

hardware implementation. A VLSI architecture is described

for implementing the proposed algorithm. The architecture is

systolic and uses the principles of pipelining and parallelism to

obtain high speed and throughput. The overall architecture is

a multistage linear pipeline and hence, can yield a new result

every clock cycle.

A prototype CMOS VLSI chip, SIGMA, implementing the

architecture for Galois field GF(z4) has been designed, fabri-

cated and tested. The prototype chip is operational at 33.3 MHz

yielding a throughput of 33.3 million multiplications/divisions

per second. An important feature of our design is that the

multiplication and division operations can be interleaved. The

hardware is programmable for different primitive irreducible

polynomials.

The outline of the paper is as follows. An overview of the

various hardware approaches for Galois field GF(2m) is given

in Section 11. Section I11 develops the theoretical basis for

our proposed algorithm for multiplication and division. The

algorithm is described in Section TV. The SIGMA architecture

is presented in the Section V. Section VI describes the VLSI

chip implementation and its performance. Conclusions are

given in Section VII.

11. RELATEDWORK

In recent years, many researchers have proposed hardware

algorithms and architectures for performing arithmetic op-

erations in Galois fields that can be implemented in VLSI

[4]-[ 131. Most approaches implement separate hardware for

multiplication, inversion and division. In our approach, we

propose a systolic hardware architecture that can perform both

multiplication and division. We present here a brief overview

of previous work on VLSI hardware methods for Galois field

arithmetic.

A cellular array multiplier for GF(2m) was proposed by

Laws and Rushfort [4]. In their design, the elements of the

Galois field are represented using a conventional basis. The ar-

chitecture consists of a two-dimensional array of m2 dentical

cells which is a straightforward spatial iteration of the standard

sequential shift register multiplier. The computation requiresabout 2m gate delays. The hardware used is programmable to

implement fields based on different irreducible polynomials.

Yeh, Reed, and Truong [5 ] proposed two systolic architectures

for multiplication in Galois fields. Their design is also based

on a Galois field represented by a conventional basis. The

1063-8210/93$03.00 0 1993 IEEE

1



KOVAC et al.: SIG MA V LSI SYSTOLIC ARRAY IMPLEMENTATION 23

first architecture is a one-dimensional systolic array which is

a serial-in-serial-out multiplier. The hardware has a through

delay of 2m cycles and requires a minimum average time of

m cycles per computation. The second architecture is a two-dimensional systolic array with an average computation time

of one cycle.

Wang et al. [6] proposed a set of architectures for im-

plementing the Massey-Omura multiplication algorithm. The

Massey and Omura a1 orithm uses a normal basis of the

form {a,a', a4, . . a2 }. Two different architectures are

proposed in [6] for multiplication and inversion. Inversion

is done by performing multiplication repeatedly m times.

Although, the multiplication hardware is efficient for small

values of m, like m = 4, he authors point out that the method

is impractical for realization for large m. Later, Wang and Pei

[131 proposed a VLSI design for computing exponentiation in

GF(2") based on the multiplier described in [6].

Scott, Tavares, and Peppard [7] proposed a new algorithm

and hardware for multiplication of elements in Galois field

represented by the conventional basis. The hardware is based

on a bitslice architecture and the data is input and output in

a serial manner. The circuit complexity is O ( m ) for bothspace and time. Okano and Imai [8] present methods of

solving algebraic equations over Galois field GF(2"). In this

paper, they discuss how finite field arithmetic operations like

multiplication and division can be performed using modulo

addition and subtraction. The elements of a Galois field

GF(2") can be expressed as a power of a [l] which is

called the "exponent expression". The exponents are obtained

using table look-ups and multiplicatioddivision is achieved

by performing modulo additiodsubtraction on these exponent

values.

A bit-serial linear systolic array is proposed for multiplica-

tion over GF(2") by Zhou [9]. The hardware requires (3m- 1)

units of time per multiplication. Since the method is recursive,

it cannot be pipelined for vector processing. Multiplication

algorithm proposed by Pincin [lo ] is also based on theMassey-Omura algorithm. The algorithm is parallelizable, but

implementation issues have not been considered. Furer and

Mehlhom [1 11 present some theoretical results analyzing the

area-time complexities for VLSI implementation of Galois

field multiplication in general form GFb"). A VLSI architec-

ture for inversion in GF(2") is proposed by Feng in [121. The

algorithm proposed by Feng requires O ( m og, m) time for

inversion. The inversion is achieved using a systolic serial-in

parallel-out multiplier. Each multiplication itself requires m

clock pulses and the number of multiplications required per

inversion is O(log, m).

Thus there exist different approaches for performing mul-

tiplication and division in hardware. Each approach has its

merits and demerits. The purpose of this paper is to propose

a single VLSI chip architecture that can perform both mul-

tiplication as well as division with an average computation

time of one clock cycle. This is achieved by using a fully

pipelined systolic array architecture. The multiplication and

division operations can be interleaved. The proposed archi-

tecture and chip are targeted for applications that use Galois

fields GF(2"), m 5 8.

f m - 1 ,

111. MULTIPLICATIONND DIVISIONN GF(2")

Finite fields with 2" symbols are called Galois fields,

GF(2"). Elements of the field can be represented as m-

tuples over GF(2). It is conventional to represent each nonzero

element as a power of a primitive element , where is a root

of F ( z ) ,a primitive irreducible polynomial of degree m over

GF(2). The nonzero elements of GF(2") can be represented as1,a ,a', . . .,a 2 (m - 1 ) .ach of these elements can be expressed

as a sum of the elements {l ,a ,a ', .. ., a" -' } which is

commonly known as the conventional basis. A polynomialF ( z ) of degree m is said to be irreducible over the field GF(2)

if F ( z ) s not divisible by any polynomial of degree less than

m and greater than zero. An irreducible polynomial of degree

m over GF(2) is called primitive if it has a primitive element

of GF(2") as a root. Every Galois field has at least one such

primitive element a , the successive powers of which generate

all the nonzero elements of the Galois field. For a complete

overview of finite fields, the reader is referred to [11, [1714191.

In this section, we develop only the necessary background

required for understanding the Galois field multiplication and

division algorithm proposed in the next section.

Notation: The addition of two elements from GF(2m) aswell as the addition of two integers will be expressed with the

symbol "+" while bit-wise modulo 2 addition will be shown

as "W.

Dejinition I : Let

p = pm-lam--l+ pm-2am-2 + . . .+ P1a + Po.

be a nonzero element of GF(2m) where a is a root of a

primitive irreducible polynomial F ( x ) ,where

F ( x )= x 7 n+ fm- lxm- l + fm-2xm-2 + . . .+ fl. + 1

with P i , f i E GF(2).

For convenience, we define a function p as follows

p(P) = aB(modu1o F (a ) ) . (1)

Since p is a power of a , p ( P ) is the element corresponding to

the next power of a. It is easy to verify that

P i ( P ) = p ( p Z - l ( B ) ) = P(P"- ' (P ) ) =

. . = p ( p ( . . . p ( P ) . .)) = azp. (2)

Example I : Let p and 6 = p(p) be two elements of ~ i F ( 2 ~ )

and F ( z ) = x4+ x + 1. Then P and 6 an be expressed as

4-tuples over GF(2). Further, the coefficients 63,62, 61,60 are

related to ps,,&, PI,PO as follows. Let

P = P3a3 + P2a2 + PlQ+ Po

s = 6 3 a 3 + 6 2 0 2 + 61a + so.

Then,

6 = p ( p ) = @(modulo a4 + a + 1 )

= p3a4+ p2a3+ &a2+ f?oa(modulo a4+ a + 1 )

= p3(a+ I) + p2a3+ plo2+p0a

= + + P O CB~ 3 ) a P3 .

1



24 I E E E T R A N S A C T I ON S ON V E R Y L A R GE S C A L E I N T E GR A T I ON ( V L S I ) S Y S T E M S , VOL. 1, NO . 1, M A R C H 1993

Fig. 1. Implementation of circle rotation function.

Therefore,

63 = Pz, 62 = Pi, 61 = Po CBP3 , and60 = P3 .

Thus the elements of 6 can be obtained by shifting the elements

of /3 circularly to the left and introducing ,f33 corresponding to

the two rightmost positions. In further discussion, we will refer

to this function as circle rotation. The coefficients of 6 can

be computed from elements of p by implementing a simplecircuit shown in Fig. 1.

Theorem 1: Given two elements p, y E GF(2m) withp = ai nd y = a J , 05 i , j 5 2"-2, p can be expressed as

p = p k ( y ) , 0 5 i , j 5 2"-2.

(3 )This expression implies that every nonzero element of GF(2")

can be obtained from another nonzero element of GF(2") by

a specific number of circle rotations.

Proof: For i, , , 0 5 i , , c 5 2" - 2, there exists a k

such that Ij + klmod(2" - 1) = i. Consequently,

i = Ik+j(mod(2" - 1)

CY2 = aJak* aZ p k ( a J )* p = pk(y) .

and k must satisfy the relation, k = i - jn,k < j or

equivalently i = k + jn . Thus

a2 = a k a i "

p = p k ( a J n ) , k < .

Example 3: Let m = 4, F ( z ) = x4 + z + 1, /3 = 3, and

y = a 4 . By choosing appropriately, four subsets of GF(2") areformed. The first subset contains elements {ao, l , . . ,a 3 } ,

the second {a4,a5,...,a7},he third { a 8 , a 9 , . - . , a l i } and

the last subset {a1',a13,ai4} . nitially, p is compared to all

the patterns { ao , 4 , 8, i'}. If it does not match with any

of the patterns, then all the patterns are rotated. From (l) ,

circle rotation of patterns increments their power by one, so

after the first rotation P will be compared to the set of patterns

{ a ' ,a', a', d 3 } . After three rotations, /3 will be compared to

{ a 3 , 7, " } . Now, p will equal the rotated value of the first

pattern and the procedure is complete. It is also important to

notice that after a certain number of rotations, the last pattern

will equal a which is the value of the first pattern in the

initial step, and hence the comparisons to the last pattern will

notbe

necessary.Theorem 3: For two elements of GF(2m), and PZ where

P1 = pkl(ajnl)nd p2 = pkz(ajnz),he product of the

elements PI , PZ can be expressed as p1pz = pk3( a i n 3 )here

k3 = ll(jn1 + kl ) + ( j n ~ kz) mod (2" - 1)1mod j

R.3 = L(l(j.1 + kl ) + ( jnz+ kz)l mod (2" - l ) ) / j J .

Proof:Example 2: Let m = 4, F ( z ) = z4+ z + 1,p = a4 and

p = a i3 .Then,

p = a4= p 4 ( 4 = p 4 ( a Z a 1 3 ) = p 6 ( a 1 3 ) = p 6 ( y ) .

Therefore,The number of circle rotations needed can always be computed

by solving for k in (3). Since 4 = Ik + 131 mod 15,k = 6.

Theorem 2: The element p can also be obtained as

P = P k ( a J" ) ,

plpz = & + j n l a k z + ~ n z

- alk~+jnl+kz+in~lmod(2"-1).

The exponent of a corresponding to the result can be

n E {0,1,2,. . , L(2" - 2)/jJ 1) and > (4) computed as

where 1 denotes the largest integer 5 z. In this case, there

exist several elements that satisfy (aJ") n (4). Consecutive

elements aJnhat are used in (4) differ by a power of a equal

to j . Thus the number of rotations needed to obtain ,L? in the

worst case is equal to ( j - 1).All the elements aJnwill be

of great importance in our discussion and for reasons that will

be explained in context, will be called patterns.

Proof: It is clear that by choosing y = aJ all the nonzero

elements of GF(2") can be divided into L(2" - 2)/jJ + 1

subsets, each starting with element aJn , E {0,1,. . . , L(2" -

2)/jJ}, and containing j elements (except for the last subsetthat may contain less than j ) .All the elements from any subset

contain powers of a equal to { j n , j n+ 1 , j n + 2, . . . j 7 1 +

n - 1). Therefore, it follows that

B = CY2.

l k i j n l + kz + jnzl mod (2" - 1)

= I I ( j n z + k i ) + ( jnz + kz)l mod (2" - 1)1

. mod j + L( l(j.1 + kl ) + ( j n 2 + kz)l mod (2" - 1))J

= k3 + n 3 , 0 5 IC3 < j , 0 5 713 5 L(2" - 2)/jJ.

Therefore,

1 .ip 2 = &+ in3 = pk 3 (ain3

The significance of this result will be illustrated by the next

example.Example 4: Le t m = 4, and let all the integers

k l , 1 .1 , kz,nz ,k 3 , n 3 be represented in binary form. Let, also

j be a power of 2, for example, j = = 4. n this case everytwo nonzero elements PI ,PZ E GF(24) can be represented as

p1 = pk' ( a4 " l )j n , j n + 1, . . . j n + n - 1

j n , n + 1, . . . 2" - 2

in general

for the last subset PZ = p k 2 ( a 4 n 2 ) , 0 5 k l , IC2 < 4; 0 5 n1,n 2 5 3.i e {



KOVAC et al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION

If kl n1 k p , np are represented as binary numbers, then prod-uct PlBz can be computed very easily. Because j = 2", the

least significantx bits in the m-bit representation of a product

j n will be zero, while the rest m - x bits will equal j . In this

case, the patterns will be cyoooo, ao1O0,doo0 ,1 O 0 . Further,

the addition k + n can be done just by replacing the x least

significant zeros with k . After modulo (2"-l) addition of thetwo binary numbers, (operand powers), the reverse procedure

to obtain j 3 and k3 is performed by copying the higher and

the lower portion of the result to 723 and k3, respectively.

The binary representation of k and j leads to a simple

method for computing the quotient ,&/Pp:

Theorem 4: For elements PI, P2 in GF(2"), P I / P z =

p k3 a j n 3 ) , here

where 3 denotes one's complement of z.

Therefore,

Proof: From the properties of GF(2"), ~ ( 2 ~ ~ ' )(YO.

a t = c y t QO = Q t c y ( 2 m - 1 ) = &+ 2"m-' )

( y l Q I Y = a ( " - Y ) a ( " - Y +Zm - ' ) = &+(-y+Zm-'))

If (-y + 2m- 1) is represented in binary form, Since, 2" can

be represented as

-y + 2m- 1 = -(y,-12"-1 + ym-22m-2+

. . '+ y12 + yo ) + 2" - 1.

+ 2m-2 + . . .+ 2 + 1+ 1"-1

-y + 2m - 1= -( y m-12m-l + ym-22m-2+

.-.+?/12+y0)+(2"-1 +2"-2+

. . . + 2 + 1+ 1)- 1

= 2"-l(l- ?/"-I) + 2"-2(1- ym-p)+

. . e + 2(1- Yl)+ (1 - yo)-- y.

IV. T H E GMA ALGORITHM

In this section, we present a new algorithm, called the

GMA algorithm, for the computation of Galois Field based

multiplication and division. The algorithm exploits the various

properties described in Section I1 for efficient computation and

leads to a simple hardware implementation. The arithmetic

operations, i.e., the multiplication of two elements PI and

p2 from a Galois field GF(2") or their division p1/ / 32 , can

be achieved using modulo (2"-l) addition in the following

manner: (i) for P1 and Pz , find the corresponding values forkllnl and k p l n 2 pairs, (ii) using k1,nl and k 2 , n 2 pairs

compute kg , 723, which corresponds to 0 3 , the resultant product

or quotient, and finally, (iii) transform k3 and n3 into the actual

result p3 . The flowchart of the proposed algorithm is given in

Fig. 2. The algorithm is described in the rest of this section.

The elements of GF(2") are divided into subsets for a

particulary value as in Theorem 2. Each subset n corresponds

:.....

25

Fig. 2. Flowchart of the GM A algorithm.

to a pattern T,. For every pattern T, = aJn,he product j n

will be denoted as the pattern power. The given input p , is

compared with each pattern T,, where n = 0 to [(2"-')/jJ.

As the input p, is compared with the first pattern R,=o, if

they do not match, the pattern is circle rotated once and then

compared again. This process is repeated until a match occurs.

If there is no match, the loop is executed j times, since it is

only possible to arrive at j elements from any single pattern

within a subset through circle rotation. The values of k and n

that correspond to a successful match become the desired k,

and n, values. Thus for the inputs Dl and pp the corresponding

kl, n1 and k2, np values are obtained. The above steps are

repeated for every pattern R,.

The next step is to perform modulo (am - 1) addition

as explained in Theorem 4 in order to derive k3 and n3. It

should be noted that the computation of k3 and n3 would

have been complex if y = a3 was chosen such that j is

not a power of two. Once the values k3 and n3 have been

computed, they are used to find the resultant element P 3 . Theresultant element p3 is determined using the method which is

a reversal of the process described in the previous paragraph.

The value 723 is compared with each value of n and then the

pattern R,, corresponding to the matching n is selected as the

intermediary result Pi. This intermediary result is circle rotated

k3 times in order to arrive at the final result ,&. It is important

to observe that the same algorithm can be used to compute



26 IEEE TRANSACTIONSON VERY LARGE SCALE INTEGRATION VLSI) SYSTEMS, VOL. 1, NO. 1, MARCH 1993

.......................................................................................

j Result Rotation Pancm RecognitionSystolic Pow- Computation; Pmccssor(P3) Processor Pl) Pmcessor(F'2) :

I . : . I I.f...f ... ....................................................................

VddGND 41 +Z

Fig. 3. SIGMA chip architecture

multiplication as well as division with the only change being

in the modulo addition step. For computing the quotient, in adivision operation, the values k2 and n2 are inverted beforethe addition step.

V. VLSI ARCHITECTUREOR GMA ALGORITHM

In this section, we describe the architecture of SIGMA, a

Systolic Implementation of the GM A algorithm. The architec-

ture exploits the principles of pipelining and parallelism to

obtain high speed and throughput. The architecture is systolic

and implemented as a multistage linear static pipeline. Thus

once the pipeline is filled, we can obtain a new result every

clock cycle. The architecture has been implemented as a

single chip using CMOS VLSI technology. The design and

implementation of the SIGMA chip will be described in the

next section.

A. The SIGMA Architecture

The system architecture of SIGMA is given in Fig. 3. Thebasic architecture consists of three computational processors

(i) Pattern recognition processor ( P I ) , (ii) Power computa-tion processor (P2) , and (iii) Result rotation processor ( P 3 ) .

The three processors are organized as a pipeline structure and

each processor itself is organized internally as a linear multi-

stage pipe. The pattern recognition processor PI is organized

as a systolic array. It is named so since the input data is

compared with the pattern that is stored in each processing

element within this systolic array. The power computation

processor will perform modulo (2"-') addition or subtractiondepending on multiplication or division that is being computed.

The purpose of the result rotation processor is to circle rotate

the intermediary result pattern that is output by P I . Although,the proposed architecture can be used to implement the GMA

algorithm for Galois fields GF(2") of higher values of m, we

will describe the architecture using an example Galois field

GF(z4). The architecture is programmable in order to choose

from a set of irreducible polynomials.

Now, we will describe the proposed architecture with an

example Galois field where m = 4, j = 4 and a choice from

two irreducible polynomials,z4+x+l and z4+z3+ l . or this

example, the pattern recognition processor will consist of four

processing elements with each processing element consisting

of four stages. Each stage corresponds to one hardware clock

cycle. The power computation processor consists of two stageswhich is the same for any choice of m. The result rotation

processor consists of three stages since the result can be

arrived from the intermediary pattern with a maximum of three

rotations.

The two input operands x = 01 and y = p2, along with

a single bit signal op, indicating the operation (multiplica-

tioddivision) that is to be performed, are provided to the

first processing element in the pattern recognition processor.

Each processing element stores a pattern that corresponds to

a subset of the elements of GFQ4). The two input operands

are compared with this pattern in parallel. If they match, the

corresponding values of n and k are selected. If the match is

unsuccessful, the pattern is circle rotated once and compared

again. This is repeated for all the three possible rotations. If

any of the four comparisons is successful, the correspondingvalues of n and k are forwarded to the next processing element

during the fifth clock cycle. A single-bit signal is sent along

with the n and k values to indicate that a match has already

occurred. This control signal is interpreted by the subsequent

processors in deciding whether to perform comparison or not.

After 16 cycles, the values of n and k for both the operands

emerge out of PI and are available to P 2 during the 17th

cycle. The power computation processor takes two cycles toperform modulo (2" - 1) ddition or subtraction of the inputs.

The resultant power computed by P2 is passed on to the last

processing element in P I . This data passes through the systolic

array in a backward direction during the next four cycles. The

resultant power is compared with the powers of the patterns

stored in the PE's. The pattern corresponding to the successfulmatch is passed on as an intermediary result to P 3 at the start

of the 23rd cycle. This intermediary result is circle rotated k g

times in P 3 which is a three stage operation. Thus the final

result is output by the system after 25 cycles.

The SIGMA chip architecture shown in Fig. 3 indicates the

flow of data and control signals through the various processors

within the chip. The input signals px and py are always

initialized to zero values which are used to forward the powers

selected as a result of successful pattern matches from one

processor to the next. The single bit control signals rz and ry

are initialized to zero. These will be set within some processing

element to indicate that the buses p z and py carry valid

operand powers. The power computation processor P2 outputs

the resultant power rp,and a single bit rr which when set to

0; indicates that none of the inputs x and y corresponded to azero value. As the signals r p and rr flow through the systolic

array, r p is compared with the pattern stored in each processor.

The input signal r is initialized to a zero value and the pattern

recognition processor copies the selected intermediary result

to the r bus. The r bus carries the intermediary result to the

result rotation processor P 3 . The two least significant bits of

r p are used in the result rotation processor in order to rotate



21OVAC er al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION

I I

P E 0 j PE 1 j PE 2 j PE 3

Fig. 4. Block diagram of pattern recognition systolic processor (Pl ) .

the intermediary result r to get the final result. The final resultis output on the r bus by P 3 . The set of input signals denotedas “Initialization and control” in Fig. 3 is used to preselect the

polynomial and to preload the patterns and the corresponding

pattern powers into the various PE’s of the pattern recognition

processor. This loading is done once in the beginning of the

computation.

B. Pattern Recognition Processor (P l )

The pattern recognition processor is a systolic array of

processors organized as a multistage linear static pipeline. The

number of processors in the array is a function of m and j

which decide the number of subsets within the chosen finite

field GF(2m). In this subsection, we will describe the hardware

organization of a single processing element that will be repli-cated in space to form the pattern recognition processor. The

block diagram of the pattern recognition processor is given in

Fig. 4. The pattern recognition processor consists of 4 PE’s

arranged as a systolic array. The flow of inputs and outputs

through the various processing elements is shown in the figure.

Each PE consists of three logic modules, the upper module, the

central module and the lower module. The PE is divided into

these modules on the basis of the function being performedin each of them. The detailed block diagram of a single PE

in the pattern recognition processor is shown in Fig. 5. The

central module in each PE stores (i) a pattern P corresponding

to one of the subsets in the chosen Galois Field GF(2m), (ii)

a pattern power PP associated with the pattern P, and (iii) a

polynomial selection bit P SB that is used to select one of the

two polynomials implemented(z4 z+1 or z4 z3 1).The

three values are stored in static registers P, PP, and PSB,

respectively. The function of the lower module is to compare

the inputs with the pattern stored in the central module (and

its rotated values).

The Lower Module (LM) consists of four stages-

LMO-LMhrganized as a linear pipeline. The blockdiagram of a single stage is shown in Fig. 6. The major

Fig. 5. Block diagram of a single PE in the pattern reco gniti on systo licarray P I .

components in a stage are (i) X logic, (ii) Y logic, and(iii) circle rotation logic. The X and Y logics are identical

in terms of hardware. The circuitry for each consists of a

comparator, some latches and multiplexers. The input element

z is compared with the pattern P stored in the P register of

the central module. If they match and the recognition bit r z is

not set, then the pattern power P P is copied on to the output

bus p z and the recognition bit rz is set. If the recognition

bit input rz has already been set, it indicates that there was amatch earlier in some previous stage, in which case, the current

match result is ignored. This logic helps to avoid the possible

second match of the input element in the last PE of P I . Asecond match is possible if j is chosen such that last subset of

GF(2m)has less than j elements. If the match operation in the

current stage is unsuccessful or the recognition bit has already

been set, then the input values z, pz, and rz are simply passed

onto the next stage. The circle rotation logic implements the

circle rotation function for the two primitive polynomials.

The Upper Module of P1 is shown in Fig. 7. The function

of the upper module is to compare the input result power r p

with the pattern power p p from the central module of the

same PE. If the match is successful and the result recognition

bit rr is set, then the pattern p from the P register in the

central module is copied onto the r bus which represents the

intermediary result. If the match was not successful, the input

values from the previous PE are passed on to the next PE.

C . The Power C omputation Processor P2

The block diagram of the Power Computation Processor

P2 is given in Fig. 8. The purpose of the power computation

processor P2 is to compute the power of the result element

which corresponds to the product in case of multiplication

and the quotient in case of division. This processor is a two



28 -- ---- - e +

latch latch

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION VLSI) SYSTEMS, OL. I , NO. I , MARCH 1993

4 bitadder

++,1

*

10next

PE

4

4

ee

fmm PP reg.in cenw l module

from F‘SB reg.in cenw l module

n I

e e

pattern

4 r[031

. _ _ _ _ $ . & . ~.............

-2:. * /

mux 4,

...........

I-PI031 4

t

............

$1 $2 W $2

Fig. 8. The power computation processor (P2).

$1 $2L............................................. Fig. 9. Result recognition processor (P3).

Fig. 6. Block diagram of one stage from the lower module.

4 rp[03]

7-mrmi I

intermediateresult

fromprevious

PE

result recognitionbit

result power

.......................... .... ..................................... :

from PP register from P register C comparatorin cenwl module in cenw l module

Fig. 7. The upper module

stage pipe that performs modulo ( 2 m- 1)additiodsubtraction

of the operand powers px and py. If the op bit is set to low

indicating multiplication, modulo addition is performed and if

the op bit is set to high, then modulo subtraction is performed.

During the phil phase of the first cycle, the op bit is used

to select the py value or its 1’s complement. It should be

noted that this is the only operation that differentiates between

multiplication and division in the entire algorithm. Also, the

recognition bits rx and r y are ‘AND’ed to set the value of

rr which indicates a valid result in the end. During the phi2

phase, the px and py values are added using a 4-bit carry look-

ahead adder. During the phil phase of the second cycle, the

result is checked if it is larger than 14 (modulo 15 addition

in our example). If it is larger, this logic passes a whichduring the following phi2 cycle, increments the result by one

in order to get the correct result in rp .

D . The Result Rotation Processor (P 3)

The architecture of the result rotation processor P3 is shown

in Fig. 9. The hardware is organized as a three stage pipeline

and the computation takes a total of three cycles. The purpose

uu U

$1 4 Vdd

Floorplan of SIGMA chip.ig. 10.

of this processor is to circle rotate the intermediary result r

that is output by P I . The number of circle rotations that will

ever be required for an intermediary result is less than or equal

to three. Each stage of the result rotation processor consists

of (i) the circle rotation logic, (ii) a comparator, and (iii) a

2:l multiplexer. The intermediary result r and the lower two

bits of its power rp are input to the first stage of P 3. The

intermediary result r is circle rotated once in the phil phase

of the first cycle. During the phi2 phase, the result power is

checked using the comparator. If the value of r p is greater thanzero, the rotated result is selected through the multiplexer. Inthe second stage, the intermediary result is rotated again if

the value of r p [0:1] is greater than one. In the last stage,

the result is rotated again if r p is greater than two. Thus, the

result rotation processor P 3 outputs the final resultant value r

that is the product/quotient depending on whether the operationperformed was multiplication or division.



KOVAC er al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION 29

Fig. I I . SIGMA chip microphotograph

VI. VLSI IMPLEMENTATIONA N D PERFORMANCE

A prototype VLSI chip was designed using CMOS p-well

2-pm technology and was fabricated by MOSIS. The chip

implements the GMA algorithm and architecture described in

Sections IV and V. Since the entire architecture is a multistagelinear static pipeline, the chip can produce one result every

clock cycle, after the pipe is filled. It should be noted that the

architecture has a through delay of O(am)which is equal to

the number of stages in the pipe and hence, is not suitable

for large m. Since many of the applications use only fields

of small m, the proposed architecture is targetted toward

such applications. The chip was designed using a 2-phase

nonoverlapping clocking scheme. The circuit was fitted on

a 6.68x4.48 mm2 MOSIS standard frame. The floorplan of

the chip is given in Fig. 10. The placement of the various

processors and the pin assignments are shown in the floorplan.

The overall circuitry required a silicon area of 1.789x3.570mm2 and a total of 8799 transistors. As can be seen in

the floorplan, the pattern recognition processor PI required

most of the silicon area. It occupies 1.789x2.318 mm2 and

consists of 7 184 transistors. Although the chip required a

total of only 26 pins, the circuitry was fitted in a @-pin

package with the most remaining pins connected to the various

internal nodes fo r extra testability. The prototype chip was



30 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 1. NO . 1. MARCH 1993

tested using the HP82000 high speed IC tester and was found

to be fully operational at 33.3 MHz. The perfomance can

be improved by omitting the extra connections and also,

additional speed-ups can be obtained by using sub-micron

technology. The critical delay of the chip depends only on

the four bit adder and the chip does not have any global

signals. Based on the performance of the prototype chip, theSIGMA chip can be improved in design so as to operate

at a speed as high as 40 MH z yielding a throughput of 40

million multiplications/divisions per second. The SIGMA chip

microphotograph is shown in Fig. 11.

VII. CONCLUSIONS

In this paper, we have presented a new algorithm for

performing multiplication as well as division of two elements

of GF(2”). An efficient VLSI architecture for implement-

ing the proposed algorithm is described. The architecture is

systolic and exploits pipelining and parallelism possible in

order to obtain high speed and throughput. A CMOS VLSIchip, SIGMA, for a GF(Z4) was designed, fabricated and

tested. The chip can yield a computation rate of 40 million

multiplications/divisions per second. The hardware can be

programmed for choosing different irreducible polynomials.

REFERENCES

S . Lin, An Introduction to Error-Correcting Codes. Englewood Cliffs:Prentice-Hall, 1970.J. H. McClellan and C. M. Rader, Number Theory in Digital SignalProcessing. Prentice Hall, Englewood Cliffs, 1979.S . Berkovits, J. Kowaltchuk, and B. Schanning, “Implementing publickey scheme,” IEEE Commun. Mag., vol. 17, pp. 2-3, May 1979.B. A. Laws and C. K. Rushfort, “A cellular-army multiplier forGF(2”),” IEEE Trans. Computers, vol. C-20, pp. 1573-1578, Dec.

1971.C. S. Yeh, I. S . Reed, and T. K. Truong, “Systolic multipliers for finitefields GF(2’”) ,” IEEE Trans. Computers, vol. C-33, pp. 357-360, Apr.

1984.C. S . Wang et al., “VLSI architecture for computing multiplications andinverses in GF( 2’”) ,” IEEE Tran s. Computers, vol. C-34, pp. 709-716,Aug. 1985.P. A. Scott, S. E. Tavares, and L. E. Peppard, “A fast VLSI multiplierfor GF(2“),” IEEE J . Select. Areas Commun.,vol. SAC-4, pp. 62-65,Jan. 1986.H. Okano and H. Imai, “A construction method of high-speed decodersusing ROM’s for bch and rs codes,” IEEE Trans. Computers, vol. C-36,pp. 1165-1171, Oct. 1987.B. B. Zhou, “A new bit-serial systolic multiplier over GF(2’”),” IEEETrans. Computers, vol. 37, pp. 749-751, June 1988.A. Pincin, “A new algorithm for multiplication in finite fields,” IEEETrans. Computers, vol. 38, pp. 1045-1049, July 1989.M.Furer and K. Mehlhom, “AT2 optimal galois field multiplier forVLSI”, IEEE T rans. Computers, vol. 38, pp. 1333-1336, Sept. 1989.G. L. Feng, “A VLSI architecture for fast inversion in GF(2”),” IEEE

Trans. Computers, vol. 38, pp. 1383-1386, Oct. 1989.C. C. Wang and D.Pei, “A VLSI design for computing exponentiationsin GF2m),” IEEE Trans. Computers, vol. 39, pp. 258-262, Feb. 1990.

N. Weste and K. Eshraghian, Principles of CMOS VISI Design-ASystems Perspective, Reading, MA: Addison-Wesley, 1988.

[15] K. Hwang and F. A. Briggs, Computer Architecture And ParallelProcessing. McGraw Hill, 1984.

[16] T. C. Bartee and D. I. Schneider, “Computation with finite fields,”Information and Conrrol, no. 6, pp 79-98, 1963.

[17] S. Lin and D. Costello, Error Control Coding. Englewood Cliffs, NJ:Prentice-Hall, 1983.

[181 W.W. eterson and E. J. Weldon, E m r Correcting Code s. Cambridge,M A MIT Press, 1981.

[19] T. R. N. Rao and E. Fujiwara, Error-Control Coding for Computer

Systems. Englewood Cliffs, NJ: Prentice-Hall, 1989.

Mario Kovac (S’WM’91) received the B.S. andM.S. degrees in computer science and engineer-ing from the Faculty of Electrical Engineering,University of Zagreb, Croatia, in 1988 and 1991,respectively, where he is working toward the Ph.D.degree.He has been on the faculty of the University ofZagreb since 1989 and is currently holding a sci-entific assistant position. During 1990 and 1991, hewas a visiting research scholar at the Universityof South Florida, Tampa. His research interests

include computer architecture, parallel processing, VLSI and implementationof algorithms and architectures in hardware (on both PCB and chip level).

N. Ranganathan (S’81-M’88-SM’92) was bomin Tiivaiym, India, in 1961. He received theB.E. (Hons) degree in electrical and electronicsengineering from Regional Engineering College,Tiichirapalli, University of Madras, India, in 1983,and the Ph.D. degree n computer science from theUniversity of Central Florida, Orlando, in 1988.

His research intents include VLSI design andhardware algorithms, computer architecture and par-allel processing. He is currently involved in thedesign and implementation of VLSI architectures or

computer vision, image pmi s ing , databases. data compression, and signalprocessing appplications. He has been named the Program Co-chair for the7th International Conference on VLSI Design to be held in Calcutta, India,in January 1994.Dr.Ranganathan is a member of the IEEE Computer Society, the IEEE

Computer Society Technical Committee on VLSI, the ACM, and the VLSISociety of India.

M u d R. Varanasi (S’72-M’73SM’89) receivedthe B.Sc. and D.M.I.T. degrees from Andhra Uni-versity, India, and Madras Institute of Technology,India, Ha has also received the M.S. and Ph.D.degrees in electrical engineering from the Universityof Maryland, College Park, in 1972 and 1973,respectively.

From 1973 to 1980 he was with the Departmentof Electrical Engineering, Old Dominion University,Norfolk, VA. He is currently working as a Professorof Computer Science at the University of South

Florida, Tampa. His research interests include coding theory, computer ar-chitecture, fault tolerant computing, and VLSI design.

Dr. Varanasi is a member of the IEEE Computer Society and is currently

involved in educational and publications activities of that society. He is amember of ACM, Eta Kappa Nu, nd Sigma Xi.

Documents

SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm