p26_kancharla_p.doc

The Advanced Encryption Standard on the HC 36m Reconfigurable Computer

Pradeep Kancharla and Duncan A. BuellDepartment of Computer Science and Engineering

University of South CarolinaColumbia, South Carolina 29208

{kancharl|buell}@cse.sc.edu

ABSTRACTWe describe an implementation of the Rijndael algorithm on the HC 36m reconfigurable computing machine

developed by Star Bridge systems. Rijndael is the symmetric key encryption algorithm that was adopted as the Advanced Encryption Standard (AES) by NIST. The HC 36m is a reconfigurable computer that has two 2.4GHz Xeon processors and a reconfigurable resource comprising five Xilinx Virtex-II 6000 and two Virtex-II 4000 FPGAs. The implementation is done in Viva, an environment developed by Star Bridge Systems, the vendor of the HC 36m.

1. Introduction

In recognition of the fact that cryptography has become an essential part of communications security for electronic commerce, NIST has promulgated standards for cryptography for more than 20 years. With the ever expanding technology and the increase in speeds of microprocessor chips, DES (Data Encryption Standard) can no longer be secure as it was. In 1997, when it became clear that DES would become obsolete, the US National Institute of Standards and Technology (NIST) initiated a new encryption standard, the Advanced Encryption Standard, to replace DES as the Federal Information Processing Standard (FIPS). In October 2002, after many rounds of testing a block cipher algorithm “Rijndael” was accepted as the new Advanced Encryption Standard. The algorithm is designed by Vincent Rijmen and Joan Daemen.

In this paper we discuss the implementation of this algorithm in hardware on the Star Bridge Systems HC 36m reconfigurable computing machine [3]. Other implementations of AES in hardware have usually focused on minimizing the power consumed or the logic needed, as would be appropriate for a smart card application. The basic goal behind this implementation in hardware is to use the relatively large FPGA resources of the HC 36m, thus not to be constrained by logic, and thus to obtain significant speed up in terms of execution time primarily through parallelism and pipelining.

2. HC 36m Reconfigurable Computer

The platform we are targeting is an HC 36m Hypercomputer® developed by Star Bridge Systems. The reconfigurable resources on the Hypercomputer are comprised of five Xilinx Virtex-II 6000 and two Virtex-II 4000 FPGA chips organized in a proprietary manner. The architecture of this computer is mainly built upon four Processing Elements (PEs). Each PE consists of Xilinx Virtex-II 6000 FPGA chip connected to four DDR RAM modules each of 512MB with a 90 bit wide communication link. All these four PEs are arranged in a “Quad Structure” passing through a cross-point which is another Virtex-II 6000 chip with a 50-bit wide communication link. The Virtex-II 4000 chips serve as a bus controller and a router. The 2.4GHz Xeon Processors are connected to the FPGA interface through a 64bit bidirectional PCIX bus running at 66MHz. If the data to be sent is more than the available bit width the PCIX bus muxes the data to be sent.

mailto:%7Bkancharl%7Cbuell%[email protected]

Fig 1: Quad Structure *.

The HC 36m reconfigurable computer comes with a development environment called Viva®. Viva provides a graphical editor to design the applications, which are then synthesized by Viva and mapped onto hardware using Xilinx tools. The design need not be constrained to a single chip as Viva is capable of mapping designs onto more than one chip. Viva also comes with a rich library of objects which can be used in the design of applications.

3. The Advanced Encryption Standard

Rijndael is a block cipher with possible block and key lengths of 128, 192 and 256 bits. The block to be encrypted and the key can be of different lengths. The encryption is comprised of a variable number of rounds (determined by the key and block lengths) with each round containing four transformations: ByteSub, ShiftRow, MixColumn and Round Key Addition (in the last round, the MixColumn is omitted). An initial key is expanded to form an Expanded Round Key based on the number of rounds. Since Rijndael is a symmetric cipher, decryption is just the inverse of the encryption.

The block and the intermediate cipher can be envisioned as a two dimensional array of four rows. The number of columns varies depending upon the bit length. All the operations in Rijndael are performed either on bytes or on 4-byte words where bytes representing elements in the finite field GF (2 8). The 4-byte words are the columns of the state. The key is also envisioned in the above format. The input to the cipher also known as plain text is a single dimensional array of 16, 24, 32 bytes depending on the block size. These bytes are mapped into the states in the column order as in Fig 2.

Fig 2: Example of 128 bit State

* Courtesy: Star Bridge Systems

a1 a5 a9 a13a2 a6 a10 a14a3 a7 a11 a15a4 a8 a12 a16

ByteSub transformation works independently on each of the cells of the state. The transformation consists of two parts. First the multiplicative inverse of the byte is calculated followed by an affine transformation. In the case of Decryption, called InvByteSub, an inverse of the affine mapping done above is applied followed by taking the multiplicative inverse.

ShiftRow transformation is applied independently to all the four rows. Each row is cyclically shifted left by a different offset. The first row is not shifted at all. The offsets of each row are determined by the block length. In case of Decryption, called InvShiftRow, the rows are shifted back to nullify the effect.

MixColumn transformation is applied independently to each column of the state. Each column of the state is treated as a polynomial. For example, the first column is fig 1 can be treated as a1 x 3+a2 x2+a3+a4. This polynomial is multiplied by a fixed polynomial given by e(x) = 03 x3+02 x2+ 01 x + 01 modulo its co-prime x4+1 in the GF. This can be done in matrix multiplication. In case of decryption, called InvMixcolumn, each column is multiplied by a polynomial given by the equation, e(x) d(x) = 1 where d(x) = 0B x3+0D x2+09 x + 0E.

In Round Key Addition the Round Key is added to the state. Addition in GF is a simple bit wise x-or. The round key is of same length of the state. It is derived from the initial cipher by means of key schedule. The process is same in case of decryption also.

Key schedule is process of deriving the round key for each round form the initial cipher key. This involves expansion of the initial key followed by selection of the key for each round. The Round Key addition is done once every round and an additional Round Key Addition is done before the rounds in case of encryption and after the rounds in case of decryption. Since Round key should be of same length of Block the total number of round key bits needed is block length times one greater than number of rounds. Key expansion is done differently for different key sizes.

4. Implementation Approaches

We have initially implemented AES for a 128-bit block and key size. It is to be noted that AES was designed to be easily implemented on smart cards with limited computing capability, so most of its operations work on bytes independently.

There are two basic approaches for putting AES in hardware. AES can be viewed as a loop on number of rounds with four (or three in the last round) transformations. One could therefore implement each stage once in hardware and iterate over the rounds. This would take the least hardware but obviously doesn’t allow pipelining. An alternative would be to unroll the loop and pipeline the computation using multiple instances in hardware of each stage. This allows pipelining the computations of different blocks. In either of the possible implementations one can, inside some of the stages, obtain further parallelism and speed, since the byte-oriented nature of AES allows for independent computations on the bytes. The computations on these bytes can be done in two ways. One is by implementing the actual arithmetic and the other is by using lookup tables. However, this last step in parallelizing can lead to a significant increase in hardware utilization for large block sizes due to the need, as described below, to replicate a large number of lookup tables.

Key scheduling can also be done in two ways. One way is to do it prior to the rounds and store the Expanded Key. The advantage with this approach is that the key need not be calculated again for different blocks. We implemented Decryption in this approach. The other way is to compute the key on the fly with the rounds. This approach is used in Encryption.

5. Implementation:

The simplest of the four transformations is the ShiftRow, which as its name implies is simply a row-by-row permutation of the block. Its hardware implementation is straightforward and full parallelism is obvious. Next simplest is the RoundKey addition, which is essentially done with X-ors. The ByteSub and MixColumn involve computations such as multiplication, multiplicative inverse and affine transformation all under the GF. In software and in our initial hardware implementations, this arithmetic is done with lookup tables. The additions are performed using X-ors. The multiplication is done using two lookup tables. The multiplicative inverse is done using a single lookup table. All these lookup tables contain 256 values as they are to be operated on bytes and are stored on chip memory.

In these transformations we operate independently on each cell of the state. Since these operations are independent of each other, they can either be done iteratively using the same lookup tables or in parallel with separate lookup tables for each byte. In the iterative case, the output of all iterations is registered and passed to the next transformation after the completion of all the iterations. This implementation uses fewer resources in terms of silicon.

We did an initial VHDL implementation of the algorithm for 128 bit block and key size to get an idea of the delay and resource usage. For this implementation we used lookup tables for performing the arithmetic. We replicated many instances of the lookup tables for performing the operations on bytes in parallel. The results for different transformations and round for encryption are tabulated in Fig 3.

Block Slices Max Pin Delay (ns)Lookup table 68 6.392ByteSub 1088 16.324ShiftRow 0 8.871MixColumn 5056 25.510Round Key Addition 128 9.989Key Schedule 356 9.486Last Round 1216 14.089Round 6150 22.888

Fig 3: Results for VHDL Implementation

All the results are obtained using Xilinx ISE, synthesized to Virtex-II device XC2V6000, package ff1152, with speed grade –4. Based on these results, the whole algorithm requires around 57,000 slices. The reconfigurable chips on the Hypercomputer consist of 33,792 slices. This implies that the whole algorithm can be accommodated on two chips if implemented in this approach provided that the synthesis tools of Viva are as efficient as the standard Xilinx tools and that the necessary data transfer between chips can be made.

On the HC 36m, we started with an iterative approach inside the transformations involving the lookup tables so as to get a feel of Viva’s synthesis efficiency. We use three lookup tables for encryption and four tables for decryption. The results of all iterations are registered and are passed to the next transformation after completion. The Multiplicative inverse and affine transformation on a byte in ByteSub are done using a single lookup. The multiplication with a polynomial in MixColumn involves two lookups for each multiplication. The logarithmic value of a byte is calculated using a lookup which is added with the logarithmic value of the coefficient of the polynomial which is a constant. The antilog value of the sum is calculated using another lookup. The Key Schedule is done on the fly in case of encryption. The substituted values required in Key Schedule are also calculated in ByteSub. In case of Decryption the Key Schedule is done prior to rounds and the Expanded Round Key is registered. The design of the round both in encryption and decryption are shown in Fig 4.

Fig 4: Design of Round and Key Schedule in encryption and decryption

Results were taken for an iterative approach on the round described above as well as unrolling the loop on the round. All the results tabulated in Fig 5 were taken at 25ns clock. The results also include the overhead in terms of slices for doing the input and output which is basically done by muxing in Viva. A single lookup table which took 68 slices in VHDL implementation when synthesized using Xilinx ISE took 160 slices when synthesized using Viva.

Fig 5: Results for Iterative approach inside Round

Block Slices CyclesByteSub 794 50InvByteSub 642 40MixColumn 983 392Round Key Addition 329 1Key Schedule 2895 130Encryption Round 2082 444Decryption Round 1715 433Iterative Encryption 2285 4069Iterative Decryption 4656 4480Unrolled Encryption 16393 4056Unrolled Decryption 14395 4077

The implementation of round in the case of the iterative approach over round involves some logic to skip the MixColumn in the last round in case of encryption and InvMixColumn in the first round in the case of decryption. But during the unrolled approach as we are using different objects for each round this logic is eliminated and the last round object is built without the MixColumn in case of encryption and similarly in decryption. The results of Key Schedule above correspond to the Key Schedule done prior to the rounds as in decryption. The results imply that we cannot accommodate all the ten rounds on four chips available on HC 36m if we completely unroll the loop in ByteSub and MixColumn transformations.

We then unrolled the loop inside the round of encryption to an extent of single iteration so as to accommodate all the rounds on the four chips on HC 36m. ShiftRow is done before ByteSub to calculate the substituted values required for two polynomial multiplications in MixColumn for iteration and register them. A single round designed in this manner show in Fig 6 took around 12564 slices which is around 37% of the chip. So if we split two rounds on two chips each getting around half the number of slices then two and a half rounds would come in 2*37% + 20% = 94% of a single chip. But Viva failed to synthesize two rounds on a single chip done in this manner because of the excessive usage of X-ors which resulted in its synthesis process running out of memory. Fig 6: Design of Round

The excessive usage of lookup tables in MixColumn transformation resulted in excessive usage of X-ors. So we decided to replace the lookup tables with the actual implementation of the arithmetic. We note that the use of lookup tables facilitates a software implementation, since the GF (2 8) arithmetic is not supported by the instruction set architecture of microprocessors and since much of the use of the lookup tables has to be done sequentially in software. Given the ability in using a reconfigurable computing machine to design one’s own instructions, there may well be an advantage to implementing the arithmetic directly in logic.

We implemented the multiplication with the polynomial using Shifts and X-ors. A multiplication with x is a left shift followed by a conditional X-or. The multiplication with other coefficients is also done in this manner. Since a lot of silicon usage is in the MixColumn because of a lot of lookup tables, replacing them with actual arithmetic allowed us to attain complete parallelism in the operations inside ByteSub and MixColumn. The results of encryption and decryption done in this manner are tabulated in Fig 7 for a 15ns clock. By results we can clearly see that a complete implementation of the algorithm would come in two chips for all the bit widths. We are currently working on this aspect. The current problem is the enormous synthesis time taken by Viva to synthesize whole design. Fig 7: Results

Block SlicesByteSub 2559InvByteSub 2559MixColumn 341InvMixColumn 651Encryption Round 3592Decryption Round 3497

The FPGAs in the Quad structure have a communication link of 50-bit wide and the PCIX bus which transfers the data from the host to the FPGA architecture can only be connected to one of the four chips. So we have to input the data to and collect the data from one of the four chips available. Keeping within these constraints we are basically looking at the following design for a multi-chip implementation.

We are trying to implement encryption on two chips by placing half of the total number of aforementioned round blocks required as per bit widths of the input block and the key on each of the two chips. The input which is the State and the Key is to be passed through the round blocks on the first chip and will be sent to the second chip at a rate of 23 bits per clock cycle. The data takes two clock cycles to pass between any two chips. Once all the data is collected on the second chip the data is passed through the remaining round blocks. Once again the data is sent back to the first chip to be sent back to the host.

Decryption can also be done in a similar way. We place half of the required rounds and also the same number of KeySchedule (KS) blocks in each of the two chips. But since we are implementing the Key Schedule prior to the rounds and since the Key is taken in the reverse order from Expanded Key there is some variation in the data movement across the chips. First the data is sent to the second chip and passed through the KS blocks. After that the intermediate key is sent back to the first chip and passed through the remaining KS blocks to complete the Key Schedule. Once the Key Schedule is done the data is passed through the rounds as in encryption. But the only difference is that we need not send the key again to the second chip as we did in encryption. Data transfer is done at 23 bits per clock cycle.

The reason for doing data transfer at 23 bits per clock cycle is that in each direction two bits are used as control bits for each transfer which makes 25 bits on each direction summing to 50 bits on two directions. The numbers of slices taken for a 256 bit transfer on both directions are 728 on first chip and 576 on second.This includes the overhead for input and output.

6. Future Work

We are presently working on multi-chip implementation of the algorithm. We are also in the process of the importing a round object synthesized using VHDL into viva and to compare its performance with the viva object. We are also in implementing an end-to-end application, reading a file in blocks, encrypting the blocks, and then decrypting them in any of the several standard modes (Electronic Code Book, Cipher Block Chaining, etc.), and writing the output to a file. The file handling is done using the Microsoft COM objects on the host processor and the encryption and decryption are done on the HC 36m hardware.

7. References

1) J. Daemen and V. Rijmen. The Design of Rijndael: AES, The Advanced Encryption Standard, Springer Verlag, Berlin, 2001.

2) National Institute for Standards and Technology. FIP 197: Announcing the Advanced Encryption Standard, Nov. 26, 2001. http://csrc.nist.gov/encryption/aes/index.html

3) Star Bridge Systems, Inc., web site, http://www.starbridgesystems.com

http://www.starbridgesystems.com/

http://csrc.nist.gov/encryption/aes/index.html

Documents

p26_kancharla_p.doc