Ronan Et Al. - 2007 - A Reconfigurable Processor for the Cryptographic η T Pairing in Characteristic 3-Annotated

  • Upload
    ilg1

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

  • 7/25/2019 Ronan Et Al. - 2007 - A Reconfigurable Processor for the Cryptographic T Pairing in Characteristic 3-Annotated

    1/6

    A Recongurable Processor for the Cryptographic T Pairingin Characteristic 3

    Robert Ronan , Colm O h Eigeartaigh , Colin Murphy , Tim Kerins and Paulo S. L. M. Barretto

    Department of Electrical Engineering, School of Computing, Escola Polit ecnica,University College Cork, Dublin City University, Universidade de S ao Paulo,

    Cork, Dublin, S ao Paulo, Ireland Ireland Brazil

    Abstract Recently, there have been many proposals for secureand novel cryptographic protocols that are built on bilinearpairings. The T pairing is one such pairing and is closelyrelated to the Tate pairing. In this paper we consider the efcienthardware implementation of this pairing in characteristic 3.All characteristic 3 operations required to compute the pairingare outlined in detail. An efcient, exible and recongurableprocessor for the T pairing in characteristic 3 is presented anddiscussed. The processor can easily be tailored for a low areaimplementation, for a high throughput implementation, or fora balance between the two. Results are provided for variouscongurations of the processor when implemented over the eldF 3 97 on an FPGA.

    Keywords T pairing, characteristic 3, elliptic curve,recongurable processor, FPGA

    I. INTRODUCTION

    Since the introduction of the Weil and Tate pairings forconstructive cryptographic applications, there have been nu-merous proposals for novel cryptographic protocols based onpairings, including identity-based encryption (IBE) schemes,key exchange schemes and short signature schemes.

    The practicality of these schemes relies on the efcientimplementation of bilinear pairings. This has led to manysuggestions for algorithmic optimisations. The Tate pairingwas originally computed using Millers Algorithm [1]. In

    2002, signicant improvements to this method were inde-pendently suggested in [2] and [3]. Subsequently, the Tatepairing implementation was further optimised in [4] for a smallclass of curves, resulting in the Duursma-Lee algorithm forpairing implementation. In [5] the Duursma-Lee results weregeneralised and it was also demonstrated that an even fasterbilinear pairing can be returned by using a smaller iterativeloop and a slightly more complicated nal exponentiation. Thispairing is known as the truncated eta pairing (denoted T ). Inthis paper we describe the hardware implementation of thispairing in characteristic 3.

    Even with these algorithmic optimisations, pairing calcula-tion remains a complicated operation. Pairings are very wellsuited to hardware implementation for two reasons. Firstly,much of the arithmetic performed on the extension eldF qk can be reduced to arithmetic on F q . Many of these F qarithmetic operations can be performed in parallel in hardware,providing a large saving over implementations on general pur-pose serial processors. Secondly, a pairing calculation consistsof an iterative loop followed by a nal exponentiation. If care is taken with the scheduling of operations through theloop, a hardware implementation can provide a large level of pipelining, thereby accelerating pairing computation.

    An FPGA implementation of the characteristic 3 pairingmethods described in [6] has appeared in [7], where theextension eld operations required for pairing computationare hardwired. However, this leads to a large design andprovides no level of exibility for different applications (theauthors concern themselves with high throughput only). In [8]a pairing processor for the implementation of the pairingsdescribed in [4] and [6] is presented. An F 3m ALU is usedto perform the operations required for pairing computation.A study on the scheduling of the Duursma-Lee algorithmin [4] has appeared in [9]. Recently, an implementation of the characteristic 3 T pairing has appeared in [10]. The

    operations necessary for the iterative loop of the T pairingare hardwired. This results in a high operational frequency.However, there is no means for exponentiation using thearchitecture in [10]. This exponentiation is required to ensurethat the pairing returns a unique value, which is required forcryptographic applications. Thus, the architecture in [10] mustbe supplemented with a coprocessor for exponentiation, which,at the time of writing, was a work in progress and details werenot included.

    I I . M ATHEMATICAL PRELIMINARIES

    Let F q = F pm be a nite eld of characteristic p. The groupof points on an elliptic curve E dened over F q is denoted by

    E (F q). A subgroup of E (F q) of prime order r has embeddingdegree k if r divides q k 1, but does not divide q i 1 forany 0 < i < k . A subgroup of order r is denoted E (F q)[r ].

    Following [2] and [3], the Tate pairing of order r is a bilinearmap between E (F q)[r ] and E (F qk )[r ] to an element of themultiplicative group F qk :

    er (P, Q ) : E (F q)[r ] E (F qk )[r ] F qk (1)

    The second argument can be generated over E (F q) and thenmapped to E (F qk ) using a distortion map (denoted ) [4].This leads to a reduction in cost since more operations can beperformed on the sub-eld. The incorporation of the distortionmap yields the modied Tate pairing

    er (P, Q ) = er (P, (Q)) (2)

    where now P, Q E (F q)[r ]. The value er (P, Q ) is onlydened up to r -th powers. A nal exponentiation must beperformed to obtain a unique value for cryptographic purposes.The reduced modied Tate pairing is dened as

    e(P, Q ) = er (P, Q )(qk 1) /r (3)

    In this paper we consider the computation of the T pairingon a characteristic 3 curve, which has an embedding degree of

    nternational Conference on Information Technology (ITNG'07)0-7695-2776-0/07 $20.00 2007

  • 7/25/2019 Ronan Et Al. - 2007 - A Reconfigurable Processor for the Cryptographic T Pairing in Characteristic 3-Annotated

    2/6

    6. The T pairing reduces the number of iterations required tobuild a bilinear pairing at the expense of a more complicatednal exponentiation, when compared to the Tate pairing. Thepairing

    T (P, Q )M (4)

    where M = (3 3m 1)(3m + 1)(3 m b.3(m +1) / 2 + 1) is

    itself bilinear and returns a non-degenerate and unique value.It is, therefore, suitable as a basis for secure cryptographicprotocols. The remainder of this paper deals with the efcientimplementation of T (P, Q )M in hardware. If compatibilitywith the Tate pairing is required, then a nal powering canbe performed with a very small number of calculations (thatare insignicant when compared to the overall cost of thepairing computation). The interested reader is referred to [5]for the derivation of the T pairing and for more details onits conversion to the Tate pairing.

    A. The Characteristic 3 T Pairing

    In this section we describe the calculation of the T pairingon the curve E : y2 = x3 x + b in afne coordinates denedon the eld F 3m in the m mod 12 1, m mod 6 1,b = 1 case. A suitable distortion map on this curve is(Q) = (xQ , yQ ) = ( xQ , yQ ) where F 32 and F 33 and 2 + 1 = 0 and 3 + 1 = 0 . Elements of F 36 m are represented using the basis (1,,,, 2 , 2). Anelement A F 36 m is then represented as

    A = ( a0 , a1 , a2 , a3 , a4 , a5)= a0 + a1 + a2 + a3 + a42 + a52 (5)= a0 + a1 + a22 (6)

    where a0 , a1 , a2 , a3 , a4 , a5 F 3m and a0 = a0 + a1, a1 =a2 + a3, a2 = a4 + a5 with a0 , a1 , a2 F 32 m .

    The operations required to compute T (P, Q )M , as de-scribed in [5], are shown in Algorithm 1. The algorithm beginswith a precomputation stage in which powers of cubes of xQand yQ are calculated and stored in memory. This means thatthe calculation of cube roots is avoided (as is the additionalcircuit cost) and the appropiate powers of xQ and yQ areinstead extracted on each iteration.

    The most costly part of the pairing is the loop iteration. Theastute scheduling of these operations through the hardwareprocessor will be vital to computation speed. The multipli-cation of f by g can be performed with a regular F 36 mmultiplication. However, this does not make use of the fact thatg has the sparse form g = ( g0 + g1 + g2+(0) +( 1)2 +(0)). If the F 36 m multiplication is unrolled and examined,it becomes apparent that f.g can be performed in only 13multiplications, providing a saving of ve multiplications.The F 3m operations required for the multiplication of f =(f 0 , f 1 , f 2 , f 3 , f 4 , f 5) by g = ( g0 , g1 , g2 , 0, 1, 0), includingthe required reductions are shown in (7), (8) and (9). Notethat all 13 F 3m multiplications can be performed in parallel.

    Algorithm 1 Computation of T (P, Q )M on E (F 3m ) : y2 =x3 x 1, m mod 12 1, m mod 6 1 caseINPUT : P = ( xP , yP ), Q = ( xQ , yQ ) , (P, Q ) E (F 3 m )OUTPUT : f = T (P, Q ) F 3 6 m

    INITIALISE:1: f F 3 6 m2: f 1PRECOMPUTE:3: for i m 1 downto ( m + 1) / 2 + 1 do4: xQ x Q 3 , yQ yQ 3 2F 3 m cubings

    5: end for6: for i (m + 1 ) / 2 downto 0 do7: xQ x Q 3 , yQ yQ 3 2F 3 m cubings8: xQ [i ] x Q , yQ [i ] yQ9: end forRUN:

    10: for i 1 to ( m + 1) / 2 do11: u x P + x Q [i ] + 112: c0 u.u 1F 3 m mul13: c1 yP .y Q [i ] 1F 3 m mul14: g c0 + ( c1 ) + ( u ) + (0) + ( 1) 2 + (0) 215: f f .g 13 F 3 m muls16: if i < ( m + 1) / 2 then17: xP x 3P , yP y

    3P 2F 3 m cubings

    18: end if 19: end for20: g yP . (x Q [i ] + x P + 1) + ( yQ [i ]) + ( yP ) 1F 3 m mul21: f f .g 10 F 3 m muls22: return f (3

    3 m 1)(3 m +1)(3 m +3 ( m +1) / 2 +1) See Algorithm 2

    mul 0 f 0 .g0mul 1 f 1 .g1mul 2 (f 0 + f 1 ).(g0 + g1 )mul 3 f 2 .g2mul 4 f 3 .g2mul 5 (f 0 + f 2 ).(g0 + g2 )mul 6 (f 1 + f 3 ).(g1 )mul 7 (f 0 + f 1 + f 2 + f 3 ) .(g0 + g1 + g2 )mul 8 (f 0 + f 4 ).(g0 + 2)mul 9 (f 1 + f 5 ).(g1 )mul 10 (f 0 + f 1 + f 4 + f 5 ).(g0 + g1 + 2)mul 11 (f 2 + f 4 ).(g2 + 2)mul 12 (f 3 + f 5 ).(g2 + 2)

    (7)

    t00r mul 0 mul 1 t00 i mul 2 mul 0 mul 1t11 r mul 3 t11 i mul 4t22r f 4 t22 i f 5t01r mul 5 mul 6 t01 i mul 7 mul 5 mul 6t02r mul 8 mul 9 t02 i mul 10 mul 8 mul 9t12r mul 11 t12 i mul 12

    (8)

    c0 t00r t12r + t11 r + t22rc1 t00i t12i + t11 i + t22ic2 t01r t00r + t11 r + t12r + t22 rc3 t01i t00i + t11 i + t12i + t22ic4 t02r t00r + t11 rc5 t02i t00i + t11 i

    (9)

    The nal exponentiation of f in Algorithm 1 is vital sinceit ensures a unique output value for the overall computation.The exponent M is unrolled and the exponentiation requires(m + 1) / 2 + 1 F 36 m cubings, seven powerings to q = 3m ,two powerings to q 2 , eleven F 24 m multiplications and an F 36 minversion. The computation of these operations will be detailedlater.

    III . C HARACTERISTIC 3 ARITHMETIC

    In this section we consider the hardware implementationof F 3 , F 3m and F 36 m arithmetic for the computation of T (P, Q )M . We provide results when the arithmetic operations

    nternational Conference on Information Technology (ITNG'07)0-7695-2776-0/07 $20.00 2007

  • 7/25/2019 Ronan Et Al. - 2007 - A Reconfigurable Processor for the Cryptographic T Pairing in Characteristic 3-Annotated

    3/6

    Op Description Cycles Area(Slices)Add/Sub Performed using m F 3 addition or

    subtraction units. Combinatorial logiconly.

    1 97

    Cube Created as an xor -gate array us-ing methods from [11]. Combinatoriallogic only.

    1 120

    Inv Performed using the Extended Euclid-ean Algorithm (EEA). The EEA oper-ations are modied for a characteristic3 implementation [12].

    2m 1939

    MultD 1 A serial Most Signicant Coefcient

    First (MSC) Multiplier [12] is used.Only one input coefcient operated onat a time.

    m 637

    D> 1 Digit Multipliers of type describedin [13] are used. D coefcients of theinputs are processed in parallel, whereD is known as the Digit Size . The areaof the multiplier increases with D .These multipliers offer a direct trade-off between speed and area.

    m/D D=4: 1224

    D=8: 2049

    D=16: 3737

    TABLE I

    IMPLEMENTATION OF A RITHMETIC ON F 3 m

    are implemented over a base eld F 397 . This base eld enablesperformance comparisons with others in the literature. Theirreducible polynomial used is x97 + x16 1. However, thecomponents described in this section can be recongured forany other suitable eld size or irreducible polynomial. TheFPGA on which the arithmetic is implemented is a Xilinx Virtex-II Pro FPGA (xc2vp100-6ff1696) with 45120 slices. Allcomponents have been realised using VHDL and synthesisedand placed & routed using Xilinx ISE Version 8.1i.

    A. F 3 Arithmetic

    Each element {0, 1, 2} F 3 must be represented using twohardware bits. We choose the representation 0 = {00, 01},1 = 10 and 2 = 11 . This representation means that the check if zero operation can be performed by simply checking thehigh bit of an F 3 element.

    On F 3 , the required arithmetic operations are addition,subtraction and multiplication (negation can be performed witha subtraction from zero). The underlying logic unit on FPGAsare function generators, which can be congured as 4 : 1Look-Up Tables (LUTs). We make explicit use of these LUTsin generating circuitry for the additive and multiplicative F 3operations. The input-output map for each of the arithmeticoperations can be placed on two 4 : 1 LUTs, or one FPGAslice.

    B. F 3m ArithmeticThe F 3m operations that are required for the computation

    of T (P, Q )M are addition, subtraction, cubing and multipli-cation. F 3m inversion is also required during exponentiation.Note that F 3m squaring is performed using multiplicationcircuitry. The arithmetic components used to implement F 3mare described in Table 1. Addition, subtraction and cubing inF 3m are combinatorial and incur a relatively small area cost.Multiplication can be performed using two different types of multipliers. The rst is a low area cost serial approach utilising

    a Most Signicant Coefcient First (MSC) multiplier. In thiscase, bits of the inputs are operated on one at a time, with aresult returned in m clock cycles. The second approach is touse Digit Multipliers such that D bits of the inputs are operatedon in parallel. These multipliers return results in m/D clock cycles and provide a speed/area trade-off.

    C. F 36 m Arithmetic

    During the exponentiation, various extension eld opera-tions must be performed. Using the tower of extensions idea,arithmetic on F 36 m can be reduced to operations on F 3m . Thismeans that the components described in the previous sectioncan be reused to perform the F 36 m operations. The requiredF 36 m operations are described in Table II.

    IV. T HE T PROCESSOR

    This section describes the recongurable processor designedto compute the secure bilinear pairing T (P, Q )M . The proces-sor was designed with exibility and versatility in mind andis shown in Figure 1.

    Rather than hardwiring the logic to compute the pairing, theprocessor is implemented with an ALU containing a number of F 3m arithmetic components. The ALU contains one arithmeticunit for F 3m addition, F 3m subtraction, F 3m cubing and F 3minversion. As multiplication is performed so often and is arelatively time consuming operation (requiring m/D clock cycles), the processor has been designed such that the numberof multipliers in the ALU can be recongured with littleimpact on the overall architecture. By varying the numberof multipliers, and indeed their digit sizes, the processorcan easily be tailored for a low area implementation, a highthroughput implementation, or a desired compromise betweenthe two.

    Dual port RAM is used to store the intermediate variablesrequired for computation. The required input coordinates areread serially into the RAM before computation begins. Duringcomputation, two 2m -bit data signals bring variables to theALU to be operated on. Tri-state buffers at the output of the arithmetic components are used to select which result iswritten to RAM when the F 3m operation has been completed.

    A control unit consisting of a ROM, a counter and a simplestate machine is used to perform the pairing operation. Aninstruction set sequencing the operations required for the T pairing is loaded into the ROM. A counter controls the addressof the ROM. When the processor is reset, the counter beginsto iterate through the instruction set. A simple state machinechecks bits of the instruction vector and halts the counter form/D clock cycles or m clock cycles when a multiplicationor an inversion are being performed, respectively, so thatthe correct result will be written to RAM. The counter alsocontains a load control bit such that the counter can jumpto particular addresses in the instruction set when required.This control methodology ensures exibility as it eliminatesthe requirement for a large xed state machine. When a newinstruction set is required, it can simply be loaded into theROM.

    nternational Conference on Information Technology (ITNG'07)0-7695-2776-0/07 $20.00 2007

  • 7/25/2019 Ronan Et Al. - 2007 - A Reconfigurable Processor for the Cryptographic T Pairing in Characteristic 3-Annotated

    4/6

    Op Method of Computation Cost on F 3 m

    Cube Let A = a 0 + a 1 + a 2 2 where a 0 = ( a 0 + a 1 ), a 1 = ( a 2 + a 3 ), a 2 = ( a 4 + a 5 ) A3 = a 30 + a

    31

    3 + a 2 3 6 . But 3 = 1 and 6 = 2 + 1 . Reduce: A3 = (a 30 a

    31 + a

    32 ) + ((a

    31 + a

    32 ) + a

    33

    2

    But 2 = 1, a 30 = ( a 03 a 1 3 ), a 31 = ( a 2

    3 a 3 3 ), a 32 = ( a 43 a 5 3 )

    A3 = ( a 0 3 a 2 3 + a 4 3 ) + ( a 3 3 a 1 3 a 5 3 ) + ( a 2 3 + a 4 3 ) + ( a 3 3 a 5 3 )+( a 4 3 )2 + ( a 5 3 ) 2

    6 cubings,8 sub / adds

    PowQ Let A = a 0 + a 1 + a 2 2 again: Aq = A 3

    m= a 3

    m0 + a

    3 m1

    3 m + a 2 3 (2 )3 m

    = a 0 + a 1 3 + (a 1 )6 Aq = ( a 0 a 2 + a4 ) + ( a 3 a 1 a 5 ) + ( a 2 + a 4 ) + ( a 3 a 5 )+( a 4 )2 + ( a 5 ) 2

    8 sub/adds

    PowQ 2 Reapply PowQ : Aq

    2= ( a 0 + a 2 + a 4 ) + ( a 3 + a 1 + a 5 ) + ( a 2 a 4 ) + ( a 3 a 5 )

    +( a 4 )2 + ( a 5 ) 2

    6 sub/adds

    Mul Given A = ( a 0 + a 1 + a 2 2 ) , B = ( b0 + b1 + b2 2 ) perform Karatsuba Multiplication of A.B overF 3 2 m followed by a reduction by 3 = 1. This costs six F 3 2 m multiplications. Each F 3 2 m multiplicationrequires three multiplications on F 3 m . This means that, in total, 18 F 3 m multiplications are required for F 3 6 mmultiplication. All multiplications can be performed in parallel if desired. See [7] for more details.

    18 muls,72 add / subs

    Inv Inversion can be carried out efciently using conjugate methods [7]. Only one F 3 m inversion is required. 33 muls, 4 cubings,67add / subs, 1 inv

    TABLE II

    IMPLEMENTATION OF A RITHMETIC ON F 3 6 m

    + -- b3 b -1 mul0 mulk-10 1 2 3 4

    douta

    dinb

    doutb

    dinax ,y ,x ,yp p q q

    addra

    e n

    l o a

    d

    d a

    t a i n

    dout

    dout

    r s

    t

    I n s t r u c t i o n s

    sta tem ach ine

    buff sel buff sel buff sel buff sel buff sel buff sel

    2m 2m

    2m

    2m

    Dout

    RAM

    ROMCOUNTER

    e n r am a

    en ram b

    add r r am a

    ad dr ram b

    bu ff sel

    rst ari th

    e n r a m a

    e n r a m b

    a d d r

    r a m a

    a d d r

    r a m b

    r s t ( 0 )

    r s t ( k )

    r s t ( 1 )

    Din

    clk

    rst

    2m

    ALU

    r s t

    c l k

    Data Line

    CONTROL do ne done

    Fig. 1. The characteristic 3 elliptic curve T processor

    To facilitate ease of processor reconguration and to ensureexibility, the VHDL code for the processor is generatedusing a C program. Using C code the eld size, the numberof multipliers and their digit sizes, the instruction set and,indeed, the size of the memory blocks, can be automaticallyrecongured according to the application. The instruction setimplementing the bilinear pairing is also generated in C codeand written to a le that is loaded into the ROM.

    It is vital that the scheduling of operations be as efcient aspossible to ensure that valuable clock cycles are not wasted.Operations are scheduled such that, if possible, additions, sub-tractions and cubings are performed and their results writtento RAM while multiplication or inversion is in progress. Inparticular, the scheduling of operations through the loop of Algorithm 1 is vital since the loop is performed (m + 1) / 2times. To achieve a fast throughput through the loop, a largelevel of parallelisation is achieved during loop computation asoperations on successive iterations of the loop are performed

    in parallel, where possible.

    V. R ESULTS

    Results returned by the processor when T (P, Q )M isimplemented with 1, 2, 3, 4, 5 and 8 multipliers are providedin Table III. Note that an implementation with either 6 or7 multipliers would not yield an efcient use of resources.This is because the multiplications required within the loop of Algorithm 1 dominate the pairing calculation time. The latencyof these 15 multiplications does not improve between 5 and7 (it remains at 3m/D until a conguration incorporating 8multipliers is instantiated). Operations required by the F 36 mmultiplication are also scheduled to make use of the variousnumbers of multipliers.

    Results are returned by the processor when implementedwith serial MSC multipliers ( D 1) and for digit multiplierswith digit sizes of 4, 8 and 16. Note that when conguredwith the MSC multiplier, the maximum attainable post place

    nternational Conference on Information Technology (ITNG'07)0-7695-2776-0/07 $20.00 2007

  • 7/25/2019 Ronan Et Al. - 2007 - A Reconfigurable Processor for the Cryptographic T Pairing in Characteristic 3-Annotated

    5/6

    (a) Total area for the T Processor in slices (b) Total number of clock cycles for T (P, Q )M

    (c) Total time for computation of T (P, Q )M in s (d) Area-Time Product for the T Processor in slices.s

    Fig. 2. T processor results when implemented with 1, 2, 3, 4, 5 and 8 multipliers with D=1, 4, 8, 16

    and route clock frequency of the processor is 91.2MHz. Themaximum frequencies for the processor when implementedwith digit multipliers of D=4, 8 and 16 are 84.8MHz, 70.4MHzand 61.6MHz respectively. The computation times in TableIII are returned by the processor at these post place & routefrequencies.

    Figure 2 shows trends in the area of the processor (slices),

    the total number of clock cycles required for T (P, Q )M

    , thetotal computation time ( s) and the area-time product returnedfor the various congurations ( slices.s ). From Table III andFigure 2(a) it can be seen that the processor can be conguredfor a large range of area costs, depending on the requiredcomputation time and application (361039553 slices).

    As seen in Figure 2(b) the total number of clock cyclesdecreases as both the digit size and number of multipliersincrease. An increase in the number of multipliers in the D 1 case provides for a dramatic reduction in the number of

    clock cycles required since, in this case, each multiplicativeoperation consumes a relatively large number ( m ) of clock cycles. The effect is less dramatic in the D > 1 cases sincenot only are the multiplicative operations less dominant butthere is less time to pipeline intermediate additions while themultiplications are being performed.

    Figure 2(c) illustrates trends in the time taken to compute

    T (P, Q )M and provides some interesting results. Here, thedecrease in clock frequency as the digit size is increased hasan effect on the computation. From Table III the fastest pairingis returned in 172s for the {D=8, number of muls=8 } case.However, note from Table III that increasing the number of multipliers from 3 to 8 for this digit size returns a very smalldecrease in computation time. A more efcient option for afast throughput architecture is the {D=8, number of muls=3 }case, which requires only 10, 000 slices (22.6% of the FPGA)and returns a cryptographically secure pairing in 178 s.

    nternational Conference on Information Technology (ITNG'07)0-7695-2776-0/07 $20.00 2007

  • 7/25/2019 Ronan Et Al. - 2007 - A Reconfigurable Processor for the Cryptographic T Pairing in Characteristic 3-Annotated

    6/6

    Number of Multipliers

    D m 1 2 3 4 5 8

    Area of the T Processor (slices)

    1 3610 4420 5190 6069 7093 9652

    4 4125 5657 7265 8887 10540 15401

    8 4995 7491 10000 12520 15056 22632

    16 7036 11635 16290 20943 25626 39553

    Clock Cycles for T (P, Q )M

    1 106714 60150 41453 35629 29653 24467

    4 36586 23286 18053 16885 15853 15529

    8 24898 17190 15113 15081 14615 14451

    16 19054 15417 14331 14385 14089 14445

    Execution Time for T (P, Q )M (s)

    1 1170 659 454 391 325 268

    4 432 275 213 199 187 183

    8 294 203 178 178 173 172

    16 309 250 232 233 228 227

    Area Time Product (slices. s)

    1 4.22 2.91 2.36 2.37 2.31 2.59

    4 1.78 1.55 1.55 1.77 1.97 2.82

    8 1.47 1.52 1.78 2.23 2.60 3.88

    16 2.17 2.90 3.78 4.88 5.85 8.97

    TABLE III

    RESULTS RETURNED BY THE T PROCESSOR

    Trends in the area-time (AT) product of the processor areillustrated in Figure 2(d) and are, perhaps, the most revealing

    since a measure of the efciency of the processor in the variouscongurations is provided. In the D 1 case, the AT productreduces continuously until the number of multipliers is 5.Increasing the number of multipliers to 8 yields a less efcientutilisation of area as the extra 3 multipliers do not provide alarge enough reduction in computation time to warrant theirinclusion. The AT product oor is hit when the number of multipliers is 3 in the D = 4 case. For digit sizes of 8 and16, the most efcient utilisation of resources is returned by aprocessor with just one multiplier as the latency through themultipliers is reduced and the additions begin to dominate thecomputation time. Overall, the most efcient use of resourcesis returned by the {D = 8 , number of muls= 2} case. This

    conguration costs only 7491 slices (17% of the FPGA) andreturns a fully secure pairing in 203s. A. Comparisons

    Consider, for example, our pairing result of 178s or 15,113clock cycles utilising 10,000 slices returning an area-timeproduct of 1.78 slices.s in the {D=8, number of muls=2 }case. The pairing computation time provides a large speedupover the software implementation time of 2.72ms on a PentiumIV processor running at 3GHz in [5]. It also provides alarge improvement on the estimated pairing time of 0.85ms

    (at 15MHz) reported for a characteristic 3 hardware pairingimplementation in [7]. In [8] the characteristic 3 pairing algo-rithms of [6] and [4] are performed in 64887 and 69543 clock cycles respectively with an area of 4481 slices. No post placeand route frequency values are provided. Our implementationcompares favourably with these results.

    VI . C ONCLUSIONS

    In this paper we have discussed the efcient implementationof the T pairing in characteristic 3. All F 3 , F 3m and F 36 moperations required to compute a secure cryptographic pairing(including nal exponentiation) have been discussed and theirhardware implementation considered. A recongurable hard-ware processor for the T pairing, which can be tailored forvarious applications by changing the eld size, the number of multipliers and the digit sizes of the multipliers has been pre-sented. Results returned by the processor when implementedover a eld size of F 397 have been presented.

    R EFERENCES

    [1] V. S. Miller, Short programs for functions on curves, Unpublishedmanuscript, 1986, http://crypto.stanford.edu/miller/miller.pdf.

    [2] P. S. L. M. Barreto, H. Y. Kim, B. Lynn, and M. Scott, Efcientalgorithms for pairing-based cryptosystems, in Advances in Cryptology Crypto2002 , ser. Lecture Notes in Computer Science, vol. 2442.Springer-Verlag, 2002, pp. 354368.

    [3] S. Galbraith, K. Harrison, and D. Soldera, Implementing the Tatepairing, in Algorithmic Number Theory ANTS V , ser. Lecture Notesin Computer Science, vol. 2369. Springer-Verlag, 2002, pp. 324337.

    [4] I. Duursma and H.-S. Lee, Tate pairing implementation for hyperellipticcurves y2 = xp x + d, in Advances in Cryptology Asiacrypt2003 ,ser. Lecture Notes in Computer Science, vol. 2894. Springer-Verlag,2003, pp. 111123.

    [5] P. S. L. M. Barreto, S. Galbraith, C. O. hEigeartaigh, andM. Scott, Pairing computation on supersingular abelian varieties,Cryptology ePrint Archive, Report 2004/375, 2004, available fromhttp://eprint.iacr.org/2004/375.

    [6] S. Kwon, Efcient Tate pairing computation for elliptic curves overbinary elds, in Australasian Conference on Information Security and Privacy ACISP 2005 , ser. Lecture Notes in Computer Science, vol.3574. Springer-Verlag, 2005, pp. 134145.

    [7] T. Kerins, W. Marnane, E. Popovici, and P. Barreto, Efcient hard-ware for the Tate pairing calculation in characteristic 3, in Crypto-graphic Hardware and Embedded Systems (CHES) , vol. 3659 of LNCS.Springer, 2005, pp. 412426.

    [8] P. Grabher and D. Page, Hardware acceleration of the tate pairing incharacteristic 3, in Cryptographic Hardware and Embedded Systems(CHES) , ser. Lecture Notes in Computer Science, vol. 3659. Springer-Verlag, 2005, pp. 398411.

    [9] G. Bertoni, L. Breveglieri, P. Fragneto, and G. Pelosi, Parallel Hard-ware Architectures for the Cryptographic Tate Pairing, in InformationTechnology: New Generations . IEEE Computer Society, 2006, pp. 186191.

    [10] J. Beuchat, M. Shirase, T. Takagi, and E. Okamoto, An algorithmfor the T pairing calculation in characteristic three and its hardwareimplementation, Cryptology ePrint Archive, Report 2006/327, 2006,http://eprint.iacr.org/2006/327.

    [11] G. Bertoni, J. Guajardo, S. Kumar, G. Orlando, C. Paar, and T. Wollinger,Efcient GF ( pm ) arithmetic architectures for cryptographic applica-tions, in Topics in Cryptology - CT RSA 2003 , ser. Lecture Notes inComputer Science, vol. 2612. Springer-Verlag, 2003, pp. 158175.

    [12] T. Kerins, E. M. Popovici, and W. P. Marnane, Algorithms and archi-tectures for use in FPGA implementations of identity based encryptionschemes. in FPL , 2004, pp. 7483.

    [13] L. Song and K. Parhi, Low-energy digit-serial/parallel nite eldmultipliers, Journal of VLSI Signal Processing Systems , vol. 2(22), pp.117, 1997.

    nternational Conference on Information Technology (ITNG'07)0-7695-2776-0/07 $20.00 2007