Upload
jsmith2012
View
227
Download
0
Embed Size (px)
Citation preview
8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm
http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 1/9
22 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) S Y S T E M S , V OL . 1, NO . 1, M A R C H 1993
SIGMA: A VLSI Systolic Array Implementation
of a Galois Field GF(2m) Based Multiplication
and Division AlgorithmMario Kovac, Member, IEEE, N. Ranganathan, Senior Member, IEEE, and Murali Varanasi, Senior M ember, IEEE
Abstract-Finite or Galo is fields are used in numerous appli-cations like error correcting codes, digital signal processing andcryptography. The design of efficient methods for Galois fieldarithmetic such as multiplication and division is critical for theseapplications.In this paper, we present a new algorithm basedona pattern matching technique for computing multiplication anddivision in GF(2m).An efficient systolic architectureis describedfor implementing the algorithm which can produce a new resultevery clock cycle and the multiplication and division operationscan be interleaved. The architecturehas been implemented using2-pm CMOS technology. The chip yields a computational rate of33.3 million multiplicatioddivisions per second.
I. INTRODUCTION
N RECENT YEARS, finite fields or Galois fields haveI een extensively applied in i) error correcting codes such as
BCH codes and RS codes [l], ii) digital signal processing [2],
iii) pseudorandom number generation [131, and iv) encryption
and decryption protocols in cryptography [3] and space object
tracking applications [16]. In many of the above applications,
the finite field GF(2m), a number system made of 2" elements
is used. The practical use of GF(2m) in various applications
requires arithmetic operations like addition, multiplication and
division. Thus efficient algorithms are required to perform
these arithmetic operations on-the-fly. Addition in GF(2") is
relatively simple and straightforward. However, multiplicationand division are more complex operations. Division is per-
formed by multiplication of the inverse of the denominator
element and inversion itself is achieved through repeated
multiplications.
In this paper, we propose a new algorithm for multipli-
cation and division in GF(2m). The algorithm is designed
for elements of GF(2'") represented by the conventional
basis (1 , a ,a2, 3 , . ,am- ' } . The algorithm is based on a
pattem matching and recognition approach and is amenable for
Manuscript received June 8, 1992; revised October 14, 199 2. M. Kova c wassupported by Departmentof Science, Republic of Croatia. N. Ranganathan wassupported in part by the National Science Foundation under Grant MIP-9 010
358 and by the Florida High Technology and Industry Council. M. Varanasiwas sup ported in part by the AT&T Found ation.
M. Kovac is with the College of Electrical Engineering, University ofZagreb, Unska 3, 41 OOO Zagreb, Croatia.
N. Ranganathan is with the Center for Microelectron ics Research, Depart-ment of Computer Science and En gineering, University of South Florida,Tampa, FL 33620.
M. Varanasi is with the Department of Computer Science and Engineering,University of South Florida, Tampa, FL 33620.
IEEE Log Number 9206232.
hardware implementation. A VLSI architecture is described
for implementing the proposed algorithm. The architecture is
systolic and uses the principles of pipelining and parallelism to
obtain high speed and throughput. The overall architecture is
a multistage linear pipeline and hence, can yield a new result
every clock cycle.
A prototype CMOS VLSI chip, SIGMA, implementing the
architecture for Galois field GF(z4) has been designed, fabri-
cated and tested. The prototype chip is operational at 33.3 MHz
yielding a throughput of 33.3 million multiplications/divisions
per second. An important feature of our design is that the
multiplication and division operations can be interleaved. The
hardware is programmable for different primitive irreducible
polynomials.
The outline of the paper is as follows. An overview of the
various hardware approaches for Galois field GF(2m) is given
in Section 11. Section I11 develops the theoretical basis for
our proposed algorithm for multiplication and division. The
algorithm is described in Section TV. The SIGMA architecture
is presented in the Section V. Section VI describes the VLSI
chip implementation and its performance. Conclusions are
given in Section VII.
11. RELATEDWORK
In recent years, many researchers have proposed hardware
algorithms and architectures for performing arithmetic op-
erations in Galois fields that can be implemented in VLSI
[4]-[ 131. Most approaches implement separate hardware for
multiplication, inversion and division. In our approach, we
propose a systolic hardware architecture that can perform both
multiplication and division. We present here a brief overview
of previous work on VLSI hardware methods for Galois field
arithmetic.
A cellular array multiplier for GF(2m) was proposed by
Laws and Rushfort [4]. In their design, the elements of the
Galois field are represented using a conventional basis. The ar-
chitecture consists of a two-dimensional array of m2 dentical
cells which is a straightforward spatial iteration of the standard
sequential shift register multiplier. The computation requiresabout 2m gate delays. The hardware used is programmable to
implement fields based on different irreducible polynomials.
Yeh, Reed, and Truong [5 ] proposed two systolic architectures
for multiplication in Galois fields. Their design is also based
on a Galois field represented by a conventional basis. The
1063-8210/93$03.00 0 1993 IEEE
1
8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm
http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 2/9
KOVAC et al.: SIG MA V LSI SYSTOLIC ARRAY IMPLEMENTATION 23
first architecture is a one-dimensional systolic array which is
a serial-in-serial-out multiplier. The hardware has a through
delay of 2m cycles and requires a minimum average time of
m cycles per computation. The second architecture is a two-dimensional systolic array with an average computation time
of one cycle.
Wang et al. [6] proposed a set of architectures for im-
plementing the Massey-Omura multiplication algorithm. The
Massey and Omura a1 orithm uses a normal basis of the
form {a,a', a4, . . a2 }. Two different architectures are
proposed in [6] for multiplication and inversion. Inversion
is done by performing multiplication repeatedly m times.
Although, the multiplication hardware is efficient for small
values of m, like m = 4, he authors point out that the method
is impractical for realization for large m. Later, Wang and Pei
[131 proposed a VLSI design for computing exponentiation in
GF(2") based on the multiplier described in [6].
Scott, Tavares, and Peppard [7] proposed a new algorithm
and hardware for multiplication of elements in Galois field
represented by the conventional basis. The hardware is based
on a bitslice architecture and the data is input and output in
a serial manner. The circuit complexity is O ( m ) for bothspace and time. Okano and Imai [8] present methods of
solving algebraic equations over Galois field GF(2"). In this
paper, they discuss how finite field arithmetic operations like
multiplication and division can be performed using modulo
addition and subtraction. The elements of a Galois field
GF(2") can be expressed as a power of a [l] which is
called the "exponent expression". The exponents are obtained
using table look-ups and multiplicatioddivision is achieved
by performing modulo additiodsubtraction on these exponent
values.
A bit-serial linear systolic array is proposed for multiplica-
tion over GF(2") by Zhou [9]. The hardware requires (3m- 1)
units of time per multiplication. Since the method is recursive,
it cannot be pipelined for vector processing. Multiplication
algorithm proposed by Pincin [lo ] is also based on theMassey-Omura algorithm. The algorithm is parallelizable, but
implementation issues have not been considered. Furer and
Mehlhom [1 11 present some theoretical results analyzing the
area-time complexities for VLSI implementation of Galois
field multiplication in general form GFb"). A VLSI architec-
ture for inversion in GF(2") is proposed by Feng in [121. The
algorithm proposed by Feng requires O ( m og, m) time for
inversion. The inversion is achieved using a systolic serial-in
parallel-out multiplier. Each multiplication itself requires m
clock pulses and the number of multiplications required per
inversion is O(log, m).
Thus there exist different approaches for performing mul-
tiplication and division in hardware. Each approach has its
merits and demerits. The purpose of this paper is to propose
a single VLSI chip architecture that can perform both mul-
tiplication as well as division with an average computation
time of one clock cycle. This is achieved by using a fully
pipelined systolic array architecture. The multiplication and
division operations can be interleaved. The proposed archi-
tecture and chip are targeted for applications that use Galois
fields GF(2"), m 5 8.
f m - 1 ,
111. MULTIPLICATIONND DIVISIONN GF(2")
Finite fields with 2" symbols are called Galois fields,
GF(2"). Elements of the field can be represented as m-
tuples over GF(2). It is conventional to represent each nonzero
element as a power of a primitive element , where is a root
of F ( z ) ,a primitive irreducible polynomial of degree m over
GF(2). The nonzero elements of GF(2") can be represented as1,a ,a', . . .,a 2 (m - 1 ) .ach of these elements can be expressed
as a sum of the elements {l ,a ,a ', .. ., a" -' } which is
commonly known as the conventional basis. A polynomialF ( z ) of degree m is said to be irreducible over the field GF(2)
if F ( z ) s not divisible by any polynomial of degree less than
m and greater than zero. An irreducible polynomial of degree
m over GF(2) is called primitive if it has a primitive element
of GF(2") as a root. Every Galois field has at least one such
primitive element a , the successive powers of which generate
all the nonzero elements of the Galois field. For a complete
overview of finite fields, the reader is referred to [11, [1714191.
In this section, we develop only the necessary background
required for understanding the Galois field multiplication and
division algorithm proposed in the next section.
Notation: The addition of two elements from GF(2m) aswell as the addition of two integers will be expressed with the
symbol "+" while bit-wise modulo 2 addition will be shown
as "W.
Dejinition I : Let
p = pm-lam--l+ pm-2am-2 + . . .+ P1a + Po.
be a nonzero element of GF(2m) where a is a root of a
primitive irreducible polynomial F ( x ) ,where
F ( x )= x 7 n+ fm- lxm- l + fm-2xm-2 + . . .+ fl. + 1
with P i , f i E GF(2).
For convenience, we define a function p as follows
p(P) = aB(modu1o F (a ) ) . (1)
Since p is a power of a , p ( P ) is the element corresponding to
the next power of a. It is easy to verify that
P i ( P ) = p ( p Z - l ( B ) ) = P(P"- ' (P ) ) =
. . = p ( p ( . . . p ( P ) . .)) = azp. (2)
Example I : Let p and 6 = p(p) be two elements of ~ i F ( 2 ~ )
and F ( z ) = x4+ x + 1. Then P and 6 an be expressed as
4-tuples over GF(2). Further, the coefficients 63,62, 61,60 are
related to ps,,&, PI,PO as follows. Let
P = P3a3 + P2a2 + PlQ+ Po
s = 6 3 a 3 + 6 2 0 2 + 61a + so.
Then,
6 = p ( p ) = @(modulo a4 + a + 1 )
= p3a4+ p2a3+ &a2+ f?oa(modulo a4+ a + 1 )
= p3(a+ I) + p2a3+ plo2+p0a
= + + P O CB~ 3 ) a P3 .
1
8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm
http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 3/9
24 I E E E T R A N S A C T I ON S ON V E R Y L A R GE S C A L E I N T E GR A T I ON ( V L S I ) S Y S T E M S , VOL. 1, NO . 1, M A R C H 1993
Fig. 1. Implementation of circle rotation function.
Therefore,
63 = Pz, 62 = Pi, 61 = Po CBP3 , and60 = P3 .
Thus the elements of 6 can be obtained by shifting the elements
of /3 circularly to the left and introducing ,f33 corresponding to
the two rightmost positions. In further discussion, we will refer
to this function as circle rotation. The coefficients of 6 can
be computed from elements of p by implementing a simplecircuit shown in Fig. 1.
Theorem 1: Given two elements p, y E GF(2m) withp = ai nd y = a J , 05 i , j 5 2"-2, p can be expressed as
p = p k ( y ) , 0 5 i , j 5 2"-2.
(3 )This expression implies that every nonzero element of GF(2")
can be obtained from another nonzero element of GF(2") by
a specific number of circle rotations.
Proof: For i, , , 0 5 i , , c 5 2" - 2, there exists a k
such that Ij + klmod(2" - 1) = i. Consequently,
i = Ik+j(mod(2" - 1)
CY2 = aJak* aZ p k ( a J )* p = pk(y) .
and k must satisfy the relation, k = i - jn,k < j or
equivalently i = k + jn . Thus
a2 = a k a i "
p = p k ( a J n ) , k < .
Example 3: Let m = 4, F ( z ) = x4 + z + 1, /3 = 3, and
y = a 4 . By choosing appropriately, four subsets of GF(2") areformed. The first subset contains elements {ao, l , . . ,a 3 } ,
the second {a4,a5,...,a7},he third { a 8 , a 9 , . - . , a l i } and
the last subset {a1',a13,ai4} . nitially, p is compared to all
the patterns { ao , 4 , 8, i'}. If it does not match with any
of the patterns, then all the patterns are rotated. From (l) ,
circle rotation of patterns increments their power by one, so
after the first rotation P will be compared to the set of patterns
{ a ' ,a', a', d 3 } . After three rotations, /3 will be compared to
{ a 3 , 7, " } . Now, p will equal the rotated value of the first
pattern and the procedure is complete. It is also important to
notice that after a certain number of rotations, the last pattern
will equal a which is the value of the first pattern in the
initial step, and hence the comparisons to the last pattern will
notbe
necessary.Theorem 3: For two elements of GF(2m), and PZ where
P1 = pkl(ajnl)nd p2 = pkz(ajnz),he product of the
elements PI , PZ can be expressed as p1pz = pk3( a i n 3 )here
k3 = ll(jn1 + kl ) + ( j n ~ kz) mod (2" - 1)1mod j
R.3 = L(l(j.1 + kl ) + ( jnz+ kz)l mod (2" - l ) ) / j J .
Proof:Example 2: Let m = 4, F ( z ) = z4+ z + 1,p = a4 and
p = a i3 .Then,
p = a4= p 4 ( 4 = p 4 ( a Z a 1 3 ) = p 6 ( a 1 3 ) = p 6 ( y ) .
Therefore,The number of circle rotations needed can always be computed
by solving for k in (3). Since 4 = Ik + 131 mod 15,k = 6.
Theorem 2: The element p can also be obtained as
P = P k ( a J" ) ,
plpz = & + j n l a k z + ~ n z
- alk~+jnl+kz+in~lmod(2"-1).
The exponent of a corresponding to the result can be
n E {0,1,2,. . , L(2" - 2)/jJ 1) and > (4) computed as
where 1 denotes the largest integer 5 z. In this case, there
exist several elements that satisfy (aJ") n (4). Consecutive
elements aJnhat are used in (4) differ by a power of a equal
to j . Thus the number of rotations needed to obtain ,L? in the
worst case is equal to ( j - 1).All the elements aJnwill be
of great importance in our discussion and for reasons that will
be explained in context, will be called patterns.
Proof: It is clear that by choosing y = aJ all the nonzero
elements of GF(2") can be divided into L(2" - 2)/jJ + 1
subsets, each starting with element aJn , E {0,1,. . . , L(2" -
2)/jJ}, and containing j elements (except for the last subsetthat may contain less than j ) .All the elements from any subset
contain powers of a equal to { j n , j n+ 1 , j n + 2, . . . j 7 1 +
n - 1). Therefore, it follows that
B = CY2.
l k i j n l + kz + jnzl mod (2" - 1)
= I I ( j n z + k i ) + ( jnz + kz)l mod (2" - 1)1
. mod j + L( l(j.1 + kl ) + ( j n 2 + kz)l mod (2" - 1))J
= k3 + n 3 , 0 5 IC3 < j , 0 5 713 5 L(2" - 2)/jJ.
Therefore,
1 .ip 2 = &+ in3 = pk 3 (ain3
The significance of this result will be illustrated by the next
example.Example 4: Le t m = 4, and let all the integers
k l , 1 .1 , kz,nz ,k 3 , n 3 be represented in binary form. Let, also
j be a power of 2, for example, j = = 4. n this case everytwo nonzero elements PI ,PZ E GF(24) can be represented as
p1 = pk' ( a4 " l )j n , j n + 1, . . . j n + n - 1
j n , n + 1, . . . 2" - 2
in general
for the last subset PZ = p k 2 ( a 4 n 2 ) , 0 5 k l , IC2 < 4; 0 5 n1,n 2 5 3.i e {
8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm
http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 4/9
KOVAC et al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION
If kl n1 k p , np are represented as binary numbers, then prod-uct PlBz can be computed very easily. Because j = 2", the
least significantx bits in the m-bit representation of a product
j n will be zero, while the rest m - x bits will equal j . In this
case, the patterns will be cyoooo, ao1O0,doo0 ,1 O 0 . Further,
the addition k + n can be done just by replacing the x least
significant zeros with k . After modulo (2"-l) addition of thetwo binary numbers, (operand powers), the reverse procedure
to obtain j 3 and k3 is performed by copying the higher and
the lower portion of the result to 723 and k3, respectively.
The binary representation of k and j leads to a simple
method for computing the quotient ,&/Pp:
Theorem 4: For elements PI, P2 in GF(2"), P I / P z =
p k3 a j n 3 ) , here
where 3 denotes one's complement of z.
Therefore,
Proof: From the properties of GF(2"), ~ ( 2 ~ ~ ' )(YO.
a t = c y t QO = Q t c y ( 2 m - 1 ) = &+ 2"m-' )
( y l Q I Y = a ( " - Y ) a ( " - Y +Zm - ' ) = &+(-y+Zm-'))
If (-y + 2m- 1) is represented in binary form, Since, 2" can
be represented as
-y + 2m- 1 = -(y,-12"-1 + ym-22m-2+
. . '+ y12 + yo ) + 2" - 1.
+ 2m-2 + . . .+ 2 + 1+ 1"-1
-y + 2m - 1= -( y m-12m-l + ym-22m-2+
.-.+?/12+y0)+(2"-1 +2"-2+
. . . + 2 + 1+ 1)- 1
= 2"-l(l- ?/"-I) + 2"-2(1- ym-p)+
. . e + 2(1- Yl)+ (1 - yo)-- y.
IV. T H E GMA ALGORITHM
In this section, we present a new algorithm, called the
GMA algorithm, for the computation of Galois Field based
multiplication and division. The algorithm exploits the various
properties described in Section I1 for efficient computation and
leads to a simple hardware implementation. The arithmetic
operations, i.e., the multiplication of two elements PI and
p2 from a Galois field GF(2") or their division p1/ / 32 , can
be achieved using modulo (2"-l) addition in the following
manner: (i) for P1 and Pz , find the corresponding values forkllnl and k p l n 2 pairs, (ii) using k1,nl and k 2 , n 2 pairs
compute kg , 723, which corresponds to 0 3 , the resultant product
or quotient, and finally, (iii) transform k3 and n3 into the actual
result p3 . The flowchart of the proposed algorithm is given in
Fig. 2. The algorithm is described in the rest of this section.
The elements of GF(2") are divided into subsets for a
particulary value as in Theorem 2. Each subset n corresponds
:.....
25
Fig. 2. Flowchart of the GM A algorithm.
to a pattern T,. For every pattern T, = aJn,he product j n
will be denoted as the pattern power. The given input p , is
compared with each pattern T,, where n = 0 to [(2"-')/jJ.
As the input p, is compared with the first pattern R,=o, if
they do not match, the pattern is circle rotated once and then
compared again. This process is repeated until a match occurs.
If there is no match, the loop is executed j times, since it is
only possible to arrive at j elements from any single pattern
within a subset through circle rotation. The values of k and n
that correspond to a successful match become the desired k,
and n, values. Thus for the inputs Dl and pp the corresponding
kl, n1 and k2, np values are obtained. The above steps are
repeated for every pattern R,.
The next step is to perform modulo (am - 1) addition
as explained in Theorem 4 in order to derive k3 and n3. It
should be noted that the computation of k3 and n3 would
have been complex if y = a3 was chosen such that j is
not a power of two. Once the values k3 and n3 have been
computed, they are used to find the resultant element P 3 . Theresultant element p3 is determined using the method which is
a reversal of the process described in the previous paragraph.
The value 723 is compared with each value of n and then the
pattern R,, corresponding to the matching n is selected as the
intermediary result Pi. This intermediary result is circle rotated
k3 times in order to arrive at the final result ,&. It is important
to observe that the same algorithm can be used to compute
8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm
http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 5/9
26 IEEE TRANSACTIONSON VERY LARGE SCALE INTEGRATION VLSI) SYSTEMS, VOL. 1, NO. 1, MARCH 1993
.......................................................................................
j Result Rotation Pancm RecognitionSystolic Pow- Computation; Pmccssor(P3) Processor Pl) Pmcessor(F'2) :
I . : . I I.f...f ... ....................................................................
VddGND 41 +Z
Fig. 3. SIGMA chip architecture
multiplication as well as division with the only change being
in the modulo addition step. For computing the quotient, in adivision operation, the values k2 and n2 are inverted beforethe addition step.
V. VLSI ARCHITECTUREOR GMA ALGORITHM
In this section, we describe the architecture of SIGMA, a
Systolic Implementation of the GM A algorithm. The architec-
ture exploits the principles of pipelining and parallelism to
obtain high speed and throughput. The architecture is systolic
and implemented as a multistage linear static pipeline. Thus
once the pipeline is filled, we can obtain a new result every
clock cycle. The architecture has been implemented as a
single chip using CMOS VLSI technology. The design and
implementation of the SIGMA chip will be described in the
next section.
A. The SIGMA Architecture
The system architecture of SIGMA is given in Fig. 3. Thebasic architecture consists of three computational processors
(i) Pattern recognition processor ( P I ) , (ii) Power computa-tion processor (P2) , and (iii) Result rotation processor ( P 3 ) .
The three processors are organized as a pipeline structure and
each processor itself is organized internally as a linear multi-
stage pipe. The pattern recognition processor PI is organized
as a systolic array. It is named so since the input data is
compared with the pattern that is stored in each processing
element within this systolic array. The power computation
processor will perform modulo (2"-') addition or subtractiondepending on multiplication or division that is being computed.
The purpose of the result rotation processor is to circle rotate
the intermediary result pattern that is output by P I . Although,the proposed architecture can be used to implement the GMA
algorithm for Galois fields GF(2") of higher values of m, we
will describe the architecture using an example Galois field
GF(z4). The architecture is programmable in order to choose
from a set of irreducible polynomials.
Now, we will describe the proposed architecture with an
example Galois field where m = 4, j = 4 and a choice from
two irreducible polynomials,z4+x+l and z4+z3+ l . or this
example, the pattern recognition processor will consist of four
processing elements with each processing element consisting
of four stages. Each stage corresponds to one hardware clock
cycle. The power computation processor consists of two stageswhich is the same for any choice of m. The result rotation
processor consists of three stages since the result can be
arrived from the intermediary pattern with a maximum of three
rotations.
The two input operands x = 01 and y = p2, along with
a single bit signal op, indicating the operation (multiplica-
tioddivision) that is to be performed, are provided to the
first processing element in the pattern recognition processor.
Each processing element stores a pattern that corresponds to
a subset of the elements of GFQ4). The two input operands
are compared with this pattern in parallel. If they match, the
corresponding values of n and k are selected. If the match is
unsuccessful, the pattern is circle rotated once and compared
again. This is repeated for all the three possible rotations. If
any of the four comparisons is successful, the correspondingvalues of n and k are forwarded to the next processing element
during the fifth clock cycle. A single-bit signal is sent along
with the n and k values to indicate that a match has already
occurred. This control signal is interpreted by the subsequent
processors in deciding whether to perform comparison or not.
After 16 cycles, the values of n and k for both the operands
emerge out of PI and are available to P 2 during the 17th
cycle. The power computation processor takes two cycles toperform modulo (2" - 1) ddition or subtraction of the inputs.
The resultant power computed by P2 is passed on to the last
processing element in P I . This data passes through the systolic
array in a backward direction during the next four cycles. The
resultant power is compared with the powers of the patterns
stored in the PE's. The pattern corresponding to the successfulmatch is passed on as an intermediary result to P 3 at the start
of the 23rd cycle. This intermediary result is circle rotated k g
times in P 3 which is a three stage operation. Thus the final
result is output by the system after 25 cycles.
The SIGMA chip architecture shown in Fig. 3 indicates the
flow of data and control signals through the various processors
within the chip. The input signals px and py are always
initialized to zero values which are used to forward the powers
selected as a result of successful pattern matches from one
processor to the next. The single bit control signals rz and ry
are initialized to zero. These will be set within some processing
element to indicate that the buses p z and py carry valid
operand powers. The power computation processor P2 outputs
the resultant power rp,and a single bit rr which when set to
0; indicates that none of the inputs x and y corresponded to azero value. As the signals r p and rr flow through the systolic
array, r p is compared with the pattern stored in each processor.
The input signal r is initialized to a zero value and the pattern
recognition processor copies the selected intermediary result
to the r bus. The r bus carries the intermediary result to the
result rotation processor P 3 . The two least significant bits of
r p are used in the result rotation processor in order to rotate
8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm
http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 6/9
21OVAC er al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION
I I
P E 0 j PE 1 j PE 2 j PE 3
Fig. 4. Block diagram of pattern recognition systolic processor (Pl ) .
the intermediary result r to get the final result. The final resultis output on the r bus by P 3 . The set of input signals denotedas “Initialization and control” in Fig. 3 is used to preselect the
polynomial and to preload the patterns and the corresponding
pattern powers into the various PE’s of the pattern recognition
processor. This loading is done once in the beginning of the
computation.
B. Pattern Recognition Processor (P l )
The pattern recognition processor is a systolic array of
processors organized as a multistage linear static pipeline. The
number of processors in the array is a function of m and j
which decide the number of subsets within the chosen finite
field GF(2m). In this subsection, we will describe the hardware
organization of a single processing element that will be repli-cated in space to form the pattern recognition processor. The
block diagram of the pattern recognition processor is given in
Fig. 4. The pattern recognition processor consists of 4 PE’s
arranged as a systolic array. The flow of inputs and outputs
through the various processing elements is shown in the figure.
Each PE consists of three logic modules, the upper module, the
central module and the lower module. The PE is divided into
these modules on the basis of the function being performedin each of them. The detailed block diagram of a single PE
in the pattern recognition processor is shown in Fig. 5. The
central module in each PE stores (i) a pattern P corresponding
to one of the subsets in the chosen Galois Field GF(2m), (ii)
a pattern power PP associated with the pattern P, and (iii) a
polynomial selection bit P SB that is used to select one of the
two polynomials implemented(z4 z+1 or z4 z3 1).The
three values are stored in static registers P, PP, and PSB,
respectively. The function of the lower module is to compare
the inputs with the pattern stored in the central module (and
its rotated values).
The Lower Module (LM) consists of four stages-
LMO-LMhrganized as a linear pipeline. The blockdiagram of a single stage is shown in Fig. 6. The major
Fig. 5. Block diagram of a single PE in the pattern reco gniti on systo licarray P I .
components in a stage are (i) X logic, (ii) Y logic, and(iii) circle rotation logic. The X and Y logics are identical
in terms of hardware. The circuitry for each consists of a
comparator, some latches and multiplexers. The input element
z is compared with the pattern P stored in the P register of
the central module. If they match and the recognition bit r z is
not set, then the pattern power P P is copied on to the output
bus p z and the recognition bit rz is set. If the recognition
bit input rz has already been set, it indicates that there was amatch earlier in some previous stage, in which case, the current
match result is ignored. This logic helps to avoid the possible
second match of the input element in the last PE of P I . Asecond match is possible if j is chosen such that last subset of
GF(2m)has less than j elements. If the match operation in the
current stage is unsuccessful or the recognition bit has already
been set, then the input values z, pz, and rz are simply passed
onto the next stage. The circle rotation logic implements the
circle rotation function for the two primitive polynomials.
The Upper Module of P1 is shown in Fig. 7. The function
of the upper module is to compare the input result power r p
with the pattern power p p from the central module of the
same PE. If the match is successful and the result recognition
bit rr is set, then the pattern p from the P register in the
central module is copied onto the r bus which represents the
intermediary result. If the match was not successful, the input
values from the previous PE are passed on to the next PE.
C . The Power C omputation Processor P2
The block diagram of the Power Computation Processor
P2 is given in Fig. 8. The purpose of the power computation
processor P2 is to compute the power of the result element
which corresponds to the product in case of multiplication
and the quotient in case of division. This processor is a two
8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm
http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 7/9
28 -- ---- - e +
latch latch
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION VLSI) SYSTEMS, OL. I , NO. I , MARCH 1993
4 bitadder
++,1
*
10next
PE
4
4
ee
fmm PP reg.in cenw l module
from F‘SB reg.in cenw l module
n I
e e
pattern
4 r[031
. _ _ _ _ $ . & . ~.............
-2:. * /
mux 4,
...........
I-PI031 4
t
............
$1 $2 W $2
Fig. 8. The power computation processor (P2).
$1 $2L............................................. Fig. 9. Result recognition processor (P3).
Fig. 6. Block diagram of one stage from the lower module.
4 rp[03]
7-mrmi I
intermediateresult
fromprevious
PE
result recognitionbit
result power
.......................... .... ..................................... :
from PP register from P register C comparatorin cenwl module in cenw l module
Fig. 7. The upper module
stage pipe that performs modulo ( 2 m- 1)additiodsubtraction
of the operand powers px and py. If the op bit is set to low
indicating multiplication, modulo addition is performed and if
the op bit is set to high, then modulo subtraction is performed.
During the phil phase of the first cycle, the op bit is used
to select the py value or its 1’s complement. It should be
noted that this is the only operation that differentiates between
multiplication and division in the entire algorithm. Also, the
recognition bits rx and r y are ‘AND’ed to set the value of
rr which indicates a valid result in the end. During the phi2
phase, the px and py values are added using a 4-bit carry look-
ahead adder. During the phil phase of the second cycle, the
result is checked if it is larger than 14 (modulo 15 addition
in our example). If it is larger, this logic passes a whichduring the following phi2 cycle, increments the result by one
in order to get the correct result in rp .
D . The Result Rotation Processor (P 3)
The architecture of the result rotation processor P3 is shown
in Fig. 9. The hardware is organized as a three stage pipeline
and the computation takes a total of three cycles. The purpose
uu U
$1 4 Vdd
Floorplan of SIGMA chip.ig. 10.
of this processor is to circle rotate the intermediary result r
that is output by P I . The number of circle rotations that will
ever be required for an intermediary result is less than or equal
to three. Each stage of the result rotation processor consists
of (i) the circle rotation logic, (ii) a comparator, and (iii) a
2:l multiplexer. The intermediary result r and the lower two
bits of its power rp are input to the first stage of P 3. The
intermediary result r is circle rotated once in the phil phase
of the first cycle. During the phi2 phase, the result power is
checked using the comparator. If the value of r p is greater thanzero, the rotated result is selected through the multiplexer. Inthe second stage, the intermediary result is rotated again if
the value of r p [0:1] is greater than one. In the last stage,
the result is rotated again if r p is greater than two. Thus, the
result rotation processor P 3 outputs the final resultant value r
that is the product/quotient depending on whether the operationperformed was multiplication or division.
8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm
http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 8/9
KOVAC er al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION 29
Fig. I I . SIGMA chip microphotograph
VI. VLSI IMPLEMENTATIONA N D PERFORMANCE
A prototype VLSI chip was designed using CMOS p-well
2-pm technology and was fabricated by MOSIS. The chip
implements the GMA algorithm and architecture described in
Sections IV and V. Since the entire architecture is a multistagelinear static pipeline, the chip can produce one result every
clock cycle, after the pipe is filled. It should be noted that the
architecture has a through delay of O(am)which is equal to
the number of stages in the pipe and hence, is not suitable
for large m. Since many of the applications use only fields
of small m, the proposed architecture is targetted toward
such applications. The chip was designed using a 2-phase
nonoverlapping clocking scheme. The circuit was fitted on
a 6.68x4.48 mm2 MOSIS standard frame. The floorplan of
the chip is given in Fig. 10. The placement of the various
processors and the pin assignments are shown in the floorplan.
The overall circuitry required a silicon area of 1.789x3.570mm2 and a total of 8799 transistors. As can be seen in
the floorplan, the pattern recognition processor PI required
most of the silicon area. It occupies 1.789x2.318 mm2 and
consists of 7 184 transistors. Although the chip required a
total of only 26 pins, the circuitry was fitted in a @-pin
package with the most remaining pins connected to the various
internal nodes fo r extra testability. The prototype chip was
8/6/2019 SIGMA - A VLSI Systolic Array Implementation of a Galois Field GF(2m) Based Multiplication and Division Algorithm
http://slidepdf.com/reader/full/sigma-a-vlsi-systolic-array-implementation-of-a-galois-field-gf2m-based 9/9
30 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 1. NO . 1. MARCH 1993
tested using the HP82000 high speed IC tester and was found
to be fully operational at 33.3 MHz. The perfomance can
be improved by omitting the extra connections and also,
additional speed-ups can be obtained by using sub-micron
technology. The critical delay of the chip depends only on
the four bit adder and the chip does not have any global
signals. Based on the performance of the prototype chip, theSIGMA chip can be improved in design so as to operate
at a speed as high as 40 MH z yielding a throughput of 40
million multiplications/divisions per second. The SIGMA chip
microphotograph is shown in Fig. 11.
VII. CONCLUSIONS
In this paper, we have presented a new algorithm for
performing multiplication as well as division of two elements
of GF(2”). An efficient VLSI architecture for implement-
ing the proposed algorithm is described. The architecture is
systolic and exploits pipelining and parallelism possible in
order to obtain high speed and throughput. A CMOS VLSIchip, SIGMA, for a GF(Z4) was designed, fabricated and
tested. The chip can yield a computation rate of 40 million
multiplications/divisions per second. The hardware can be
programmed for choosing different irreducible polynomials.
REFERENCES
S . Lin, An Introduction to Error-Correcting Codes. Englewood Cliffs:Prentice-Hall, 1970.J. H. McClellan and C. M. Rader, Number Theory in Digital SignalProcessing. Prentice Hall, Englewood Cliffs, 1979.S . Berkovits, J. Kowaltchuk, and B. Schanning, “Implementing publickey scheme,” IEEE Commun. Mag., vol. 17, pp. 2-3, May 1979.B. A. Laws and C. K. Rushfort, “A cellular-army multiplier forGF(2”),” IEEE Trans. Computers, vol. C-20, pp. 1573-1578, Dec.
1971.C. S. Yeh, I. S . Reed, and T. K. Truong, “Systolic multipliers for finitefields GF(2’”) ,” IEEE Trans. Computers, vol. C-33, pp. 357-360, Apr.
1984.C. S . Wang et al., “VLSI architecture for computing multiplications andinverses in GF( 2’”) ,” IEEE Tran s. Computers, vol. C-34, pp. 709-716,Aug. 1985.P. A. Scott, S. E. Tavares, and L. E. Peppard, “A fast VLSI multiplierfor GF(2“),” IEEE J . Select. Areas Commun.,vol. SAC-4, pp. 62-65,Jan. 1986.H. Okano and H. Imai, “A construction method of high-speed decodersusing ROM’s for bch and rs codes,” IEEE Trans. Computers, vol. C-36,pp. 1165-1171, Oct. 1987.B. B. Zhou, “A new bit-serial systolic multiplier over GF(2’”),” IEEETrans. Computers, vol. 37, pp. 749-751, June 1988.A. Pincin, “A new algorithm for multiplication in finite fields,” IEEETrans. Computers, vol. 38, pp. 1045-1049, July 1989.M.Furer and K. Mehlhom, “AT2 optimal galois field multiplier forVLSI”, IEEE T rans. Computers, vol. 38, pp. 1333-1336, Sept. 1989.G. L. Feng, “A VLSI architecture for fast inversion in GF(2”),” IEEE
Trans. Computers, vol. 38, pp. 1383-1386, Oct. 1989.C. C. Wang and D.Pei, “A VLSI design for computing exponentiationsin GF2m),” IEEE Trans. Computers, vol. 39, pp. 258-262, Feb. 1990.
N. Weste and K. Eshraghian, Principles of CMOS VISI Design-ASystems Perspective, Reading, MA: Addison-Wesley, 1988.
[15] K. Hwang and F. A. Briggs, Computer Architecture And ParallelProcessing. McGraw Hill, 1984.
[16] T. C. Bartee and D. I. Schneider, “Computation with finite fields,”Information and Conrrol, no. 6, pp 79-98, 1963.
[17] S. Lin and D. Costello, Error Control Coding. Englewood Cliffs, NJ:Prentice-Hall, 1983.
[181 W.W. eterson and E. J. Weldon, E m r Correcting Code s. Cambridge,M A MIT Press, 1981.
[19] T. R. N. Rao and E. Fujiwara, Error-Control Coding for Computer
Systems. Englewood Cliffs, NJ: Prentice-Hall, 1989.
Mario Kovac (S’WM’91) received the B.S. andM.S. degrees in computer science and engineer-ing from the Faculty of Electrical Engineering,University of Zagreb, Croatia, in 1988 and 1991,respectively, where he is working toward the Ph.D.degree.He has been on the faculty of the University ofZagreb since 1989 and is currently holding a sci-entific assistant position. During 1990 and 1991, hewas a visiting research scholar at the Universityof South Florida, Tampa. His research interests
include computer architecture, parallel processing, VLSI and implementationof algorithms and architectures in hardware (on both PCB and chip level).
N. Ranganathan (S’81-M’88-SM’92) was bomin Tiivaiym, India, in 1961. He received theB.E. (Hons) degree in electrical and electronicsengineering from Regional Engineering College,Tiichirapalli, University of Madras, India, in 1983,and the Ph.D. degree n computer science from theUniversity of Central Florida, Orlando, in 1988.
His research intents include VLSI design andhardware algorithms, computer architecture and par-allel processing. He is currently involved in thedesign and implementation of VLSI architectures or
computer vision, image pmi s ing , databases. data compression, and signalprocessing appplications. He has been named the Program Co-chair for the7th International Conference on VLSI Design to be held in Calcutta, India,in January 1994.Dr.Ranganathan is a member of the IEEE Computer Society, the IEEE
Computer Society Technical Committee on VLSI, the ACM, and the VLSISociety of India.
M u d R. Varanasi (S’72-M’73SM’89) receivedthe B.Sc. and D.M.I.T. degrees from Andhra Uni-versity, India, and Madras Institute of Technology,India, Ha has also received the M.S. and Ph.D.degrees in electrical engineering from the Universityof Maryland, College Park, in 1972 and 1973,respectively.
From 1973 to 1980 he was with the Departmentof Electrical Engineering, Old Dominion University,Norfolk, VA. He is currently working as a Professorof Computer Science at the University of South
Florida, Tampa. His research interests include coding theory, computer ar-chitecture, fault tolerant computing, and VLSI design.
Dr. Varanasi is a member of the IEEE Computer Society and is currently
involved in educational and publications activities of that society. He is amember of ACM, Eta Kappa Nu, nd Sigma Xi.