Pushing the Limits of High-Speed GF(2m) Elliptic Curve ... · PDF filePushing the Limits of High-Speed GF(2m) Elliptic Curve Scalar Multiplier on FPGAs ... ROM Control Unit (b) BlockDiagram

Pushing the Limits of High-Speed GF (2m) Elliptic

Curve Scalar Multiplier on FPGAs

Chester Rebeiro, Sujoy Sinha Roy, and Debdeep Mukhopadhyay

Secured Embedded Architecture LabIndian Institute of Technology Kharagpur

India

9/12/2012

CHES 2012, Leuven Belgium

Elliptic Curve Scalar Multiplication

• An elliptic curve over GF (2m) is a set of points which satisfiesthe equation

y2 + xy = x3 + ax2 + b ,

where a, b ∈ GF (2m) and b 6= 0. The points on the ellipticcurve form an additive group.

• The projective coordinate representation of the curve is

Y 2 + XYZ = X 3 + aX 2Z 2 + bZ 4

• Scalar Multiplication : Given a base point P = (XP ,YP ,ZP)on the elliptic curve and a scalar s compute Q = sP

(i.e. Q = P + P + P + · · · (stimes))

9/12/2012 CHES 2012, Leuven Belgium 2

Montgomery Ladder for Scalar MultiplicationInputs : scalar s = (st−1st−2 · · · s1s0)2

basepoint P

Output : Scalar Product Q = sP

1 P1 = (X1,Y1,Z1)← P and P2 = (X2,Y2,Z2)← 2 · P2 For each bit sk (for k = t − 2, t − 3, · · · , 0)

• if sk = 1 then P1 ← P1 + P2 ; P2 = 2 · P2

• if sk = 0 then P2 ← P1 + P2 ; P1 = 2 · P1

3 Q = Projective2Affine(P1)


Montgomery Ladder for Scalar MultiplicationInputs : scalar s = (st−1st−2 · · · s1s0)2

basepoint P

Output : Scalar Product Q = sP

1 P1 = (X1,Y1,Z1)← P and P2 = (X2,Y2,Z2)← 2 · P2 For each bit sk (for k = t − 2, t − 3, · · · , 0)

• if sk = 1 then P1 ← P1 + P2 ; P2 = 2 · P2

• if sk = 0 then P2 ← P1 + P2 ; P1 = 2 · P1

3 Q = Projective2Affine(P1)

Performing Pi ← Pi + Pj and Pj ← 2 · Pi

Xi ← Xi · Zj ; Zi ← Xj · Zi ; T ← Xj ; Xj ← X 4j + b · Z 4

j

Zj ← (T · Zj)2 ; T ← Xi · Zi ; Zi ← (Xi + Zi)

2 ; Xi ← x · Zi + T

. . . all operations are in GF (2m)


Engineering the Montgomery Ladder for Scalar

Multiplication

Multiplication

Elliptic Curve

Finite Field Operations

Group Operations

Scalar

(a) The ECC Pyramid

Regbank

sP

s

ArithmeticUnit

ROM Control Unit

(b) Block Diagram


Engineering the Montgomery Ladder for Scalar

Multiplication

Multiplication

Elliptic Curve

Finite Field Operations

Group Operations

Scalar

(a) The ECC Pyramid

Regbank

sP

s

ArithmeticUnit

ROM Control Unit

(b) Block Diagram

High-speed scalar multiplication on FPGAs

• Minimize area by maximizing utilization of available resources

• Optimal Pipelining

• Efficient Scheduling of Operations


Field Programmable Gate Arrays

• Provides the speed of hardware and the reconfigurablitity ofsoftware

• FPGA Architecture

Programmable Connection

Routing Switches

Logic Block

Programmable

Switch

(a) FPGA Island


Field Programmable Gate Arrays

• Provides the speed of hardware and the reconfigurablitity ofsoftware

• FPGA Architecture

Programmable Connection

Routing Switches

Logic Block

Programmable

Switch

(a) FPGA Island

CLK

CIN

COUT

F1

F2

F3

F4

CLK

CE

SR

BY

PRE

D

CE

Q

CLR

LUT &CarryLogic

Control

(b) Lookup Table


LUT Utilization

LUT

• Four (or six) input → one output

• Can implement any four (or six) input truth table

• y1 = x1 ⊕ x2 ⊕ x3 ⊕ x4 Requires one LUT.

• y2 = x1 ⊕ x2 Still requires one LUT.


LUT Utilization

LUT

• Four (or six) input → one output

• Can implement any four (or six) input truth table

• y1 = x1 ⊕ x2 ⊕ x3 ⊕ x4 Requires one LUT.

• y2 = x1 ⊕ x2 Still requires one LUT.

• y2 results in an under utilized LUT.

. . . need to maximize LUT utilization to minimize area.


Finite field Multiplier for Best LUT utilization

7 7 7 8 7 7 7 87 8 7 87 8 7 77 87 7 7 8 7 7 7 7 7 7 7 8 5 6

29 29 29 29

14 14 15 14 15

5858 59

116 117

58

14 15 14 15 14 15 15 15 1515 14

233 Karatusba Multiplier

29 29 29 30

(a) Karatsuba-Ofman Multiplication

[VLSID 2008]9/12/2012 CHES 2012, Leuven Belgium 7


7 7 7 8 7 7 7 87 8 7 87 8 7 77 87 7 7 8 7 7 7 7 7 7 7 8 5 6

29 29 29 29

14 14 15 14 15

5858 59

116 117

58

14 15 14 15 14 15 15 15 1515 14


29 29 29 30


233

29 29 29 29

14 14 15 14 15

5858 59

29 29 29 30

116 117

58

14 15 14 15 14 15 15 15 1515 14

Karatusba Multiplier

77 78 7778 7778 7778 7778 7778 77 78 7856

Classical Multiplier

(b) Hybrid Karatsuba Multiplication



7 7 7 8 7 7 7 87 8 7 87 8 7 77 87 7 7 8 7 7 7 7 7 7 7 8 5 6

29 29 29 29

14 14 15 14 15

5858 59

116 117

58

14 15 14 15 14 15 15 15 1515 14


29 29 29 30


233

29 29 29 29

14 14 15 14 15

5858 59

29 29 29 30

116 117

58

14 15 14 15 14 15 15 15 1515 14


77 78 7778 7778 7778 7778 7778 77 78 7856



8800

9600

4 6 8 10 12 14 16 18 20 22

LU

Ts

Threshold

(c) Finding the Right Threshold



7 7 7 8 7 7 7 87 8 7 87 8 7 77 87 7 7 8 7 7 7 7 7 7 7 8 5 6

29 29 29 29

14 14 15 14 15

5858 59

116 117

58

14 15 14 15 14 15 15 15 1515 14


29 29 29 30


233

29 29 29 29

14 14 15 14 15

5858 59

29 29 29 30

116 117

58

14 15 14 15 14 15 15 15 1515 14


77 78 7778 7778 7778 7778 7778 77 78 7856



8800

9600

4 6 8 10 12 14 16 18 20 22

LU

Ts

Threshold

(c) Finding the Right Threshold

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1e+06

1.1e+06

90

180

270

360

450

540

Area

* Ti

me

Number of bits

Karatsuba-OfmanHybrid Karatsuba

(d) Comparing Multipliers


Finite Field Inversion Using Itoh-Tsujii Algorithm

• Given a ∈ GF (2m), find a−1 ∈ GF (2m) such that a · a−1 = 1• Fermat’s Little Theorem : a−1 = a2

m−2

• Itoh-Tsujii Algorithm

1 Define the addition chain for m − 1(for example m = 233 : (1, 2, 3, 6, 7, 14, 28, 58, 116, 232))

2 Computea→ a2

2−1 → a2

3−1 → a2

6−1 → a2

7−1 → a2

14−1 · · · → a2

232−1

3 Square to get a2233

−2


Finite Field Inversion Using Itoh-Tsujii Algorithm

• Given a ∈ GF (2m), find a−1 ∈ GF (2m) such that a · a−1 = 1• Fermat’s Little Theorem : a−1 = a2

m−2

• Itoh-Tsujii Algorithm

1 Define the addition chain for m − 1(for example m = 233 : (1, 2, 3, 6, 7, 14, 28, 58, 116, 232))

2 Computea→ a2

2−1 → a2

3−1 → a2

6−1 → a2

7−1 → a2

14−1 · · · → a2

232−1

3 Square to get a2233

−2

• Exponentiation requires a series of cascaded squarers calledpowerblock along with a finite field multiplier

Square Square Square Square

Circuit−1 Circuit−11

Multiplexerqsel

Input

Output

Circuit−2 Circuit−3


Using Higher Exponents in the Itoh-Tsujii AlgorithmConsider using a quad circuit instead of a square.

• This requires an addition chain to m−12 instead of m − 1 thus

finishes faster.

[IEEE TVLSI 2011, DATE 2011]




finishes faster.• The frequency of operation is not affected and area used isless due to better LUT utilization.

Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA

Field Squarer Circuit Quad Circuit Size ratio

#LUTs Delay (ns) #LUTq Delay (ns)#LUTq

2(#LUTs )

GF (2193) 96 1.48 145 1.48 0.75

GF (2233) 153 1.48 230 1.48 0.75









2(#LUTs )

GF (2193) 96 1.48 145 1.48 0.75

GF (2233) 153 1.48 230 1.48 0.75

• Larger exponent circuits can similarly be used to obtain fasterresults.









2(#LUTs )

GF (2193) 96 1.48 145 1.48 0.75

GF (2233) 153 1.48 230 1.48 0.75

• Larger exponent circuits can similarly be used to obtain fasterresults.

• However there is an initial overhead of computing a2q−1,

which increases as the exponent circuit increases.



The Arithmetic Unit

2n2n

Circuit 12n

us

D_sel

C_sel

MUXD

E

MUXA0

E_sel

A2

A3

Circuit 2 Circuit

Pow_sel

B_sel

MUX

B

A2

A3

A1

Quad

A0

C0

Powerblock

SquarerC1

C2

Squarer

Quad

A_sel

MUX

A

A0

Field

(HBKM)(#11)

(#2) (#2)

(#2)

(#2)

(#2)

(#2)

MUXF

(#2)

(#1)

(#1) (#2)

(#1)C

MUX

ARITHMETIC UNIT

(#2) (#2) (#2)

MultiplierFinite Field Multiplication

Power Block


The Arithmetic Unit

2n2n

Circuit 12n

us

C_sel

Circuit 2 Circuit

Pow_sel

A0

C0

Powerblock

Field

MUXF

(#2)(#1)C

MUX

ARITHMETIC UNIT

(#2) (#2) (#2)

Multiplier

^4 ^2

^4

^2

Power Block


The Arithmetic Unit

2n2n

Circuit 12n

us

C_sel

Circuit 2 Circuit A0

Powerblock

Field

ARITHMETIC UNIT

Multiplier

^4 ^2

^4

^2


The Register Bank

2n2n

Circuit 12n

us

D_sel

C_sel

MUXD

E

MUXA0

E_sel

A2

A3

Circuit 2 Circuit

Pow_sel

B_sel

MUX

B

A2

A3

A1

Quad

A0

C0

Powerblock

SquarerC1

C2

A3

A2

A1

A0Squarer

Quad

A_sel

MUX

A

A0

FieldMultiplier

(HBKM)

C2C1C0

Qout

Qin

REGISTERBANK

MUXF C

MUX

ARITHMETIC UNIT

Six registers implemented as flip-flops to maximize CLButilization


Pipelining the Processor• How many stages of pipelining ?

• To many would increase the frequency of operation but makesit difficult to schedule instructions without bubbles

• To few would result in a low frequency of operation





• Where to place the pipeline stages?






2n2n

Circuit 12n

us

D_sel

C_sel

MUXD

E

MUXA0

E_sel

A2

A3

Circuit 2 Circuit

Pow_sel

B_sel

MUX

B

A2

A3

A1

Quad

A0

C0

Powerblock

SquarerC1

C2

A3

A2

A1

A0Squarer

Quad

A_sel

MUX

A

A0

FieldMultiplier

(HBKM)

C2C1C0

Qout

Qin

REGISTERBANK

Reg

iste

rs

Muxes

MuxesInput

Output

MUXF C

MUX

ARITHMETIC UNIT

CRITICAL PATH

(a) A Bad Pipline StagePlacement

2n2n

Circuit 12n

us

D_sel

C_sel

MUXD

E

MUXA0

E_sel

A2

A3

Circuit 2 Circuit

Pow_sel

B_sel

MUX

B

A2

A3

A1

Quad

A0

C0

Powerblock

SquarerC1

C2

A3

A2

A1

A0Squarer

Quad

A_sel

MUX

A

A0

FieldMultiplier

(HBKM)

C2C1C0

Qout

Qin

REGISTERBANK

Reg

iste

rs

Muxes

MuxesInput

Output

MUXF C

MUX

ARITHMETIC UNIT

CRITICAL PATH

(b) A Good Pipeline StagePlacement






2n2n

Circuit 12n

us

D_sel

C_sel

MUXD

E

MUXA0

E_sel

A2

A3

Circuit 2 Circuit

Pow_sel

B_sel

MUX

B

A2

A3

A1

Quad

A0

C0

Powerblock

SquarerC1

C2

A3

A2

A1

A0Squarer

Quad

A_sel

MUX

A

A0

FieldMultiplier

(HBKM)

C2C1C0

Qout

Qin

REGISTERBANK

Reg

iste

rs

Muxes

MuxesInput

Output

MUXF C

MUX

ARITHMETIC UNIT

CRITICAL PATH

(a) A Bad Pipline StagePlacement

2n2n

Circuit 12n

us

D_sel

C_sel

MUXD

E

MUXA0

E_sel

A2

A3

Circuit 2 Circuit

Pow_sel

B_sel

MUX

B

A2

A3

A1

Quad

A0

C0

Powerblock

SquarerC1

C2

A3

A2

A1

A0Squarer

Quad

A_sel

MUX

A

A0

FieldMultiplier

(HBKM)

C2C1C0

Qout

Qin

REGISTERBANK

Reg

iste

rs

Muxes

MuxesInput

Output

MUXF C

MUX

ARITHMETIC UNIT

CRITICAL PATH

(b) A Good Pipeline StagePlacement

Can we determine the best pipeline strategy at the designexploration phase?


Optimally Pipeline into L Stages Apriori

1 Model the delay of the design

2 Use the model to identify all critical paths (with delay t)

3 Place pipeline registers to ensure that no path has delaygreater than t

L


Modeling the Delay

• The delay of a circuit is proportional to the number of LUTsin the critical path

4 5 6 7 8 9

10

5 6 7 8 9 10 11No

. o

f L

UT

s in

Critic

al P

ath

Delay of Circuit (in ns)


Modeling the Delay

• The delay of a circuit is proportional to the number of LUTsin the critical path

4 5 6 7 8 9

10

5 6 7 8 9 10 11No

. o

f L

UT

s in

Critic

al P

ath

Delay of Circuit (in ns)

• The number of LUTs in the critical path of an n inputBoolean function f (x1, x2, x3, · · · xn) is ⌈logk(n)⌉, where k isthe number of inputs to the LUT.


Modeling LUTdelay of all Elements in the Processor

Component k− LUT Delay for k ≥ 4 m = 163, k = 4m bit field adder 1 1m bit n : 1 Mux ⌈logk(n + log2n)⌉ 2 (for n = 4)(Dn:1(m)) 1 (for n = 2)

Exponentiation max(LUTDelay(di )), where di is the i th 2 (for n = 1)Circuit (D2n (m)) output bit of the exponentiation circuit 2 (for n = 2)Powerblock us × D2n (m) + Dus :1(m) 4 (for us = 2)(Dpowerblk(m))

Modular Reduction 1 for irreducible trinomials 2(Dmod ) 2 for pentanomials (for pentanomials)HBKM 11 (for τ = 11)(DHBKM(m)) Dsplit + Dthreshold + Dcombine + Dmod

= ⌈logk (mτ)⌉ + ⌈logk (2τ)⌉

+⌈log2(

mτ

)

⌉ + Dmod


Critical Path : Placing Pipeline Registers Optimally

C_sel

B_sel

A3

A1

Powerblock

Squarer

C2

A3

A2

A1

A0

FieldCritical Path

Multiplier

(HBKM)

C2C1C0

Qout

Qin

REGISTERBANK

Reg

iste

rs

ARITHMETIC UNIT

(#2)

(#1)

(#1)(#1)

(#2)(#2)

(#11)

(#2)

(#2)

(#2)

(#2) (#2)

(#2)

(#2) (#2)(#2)

(#2)


Critical Path : Placing Pipeline Resgisters Optimally

Squarer Squarer

4:1 Input MuxRegister BankMerged Squarer and Adder

HBKM

(#1)

MUX

B

MUX

D

Pipeline Register

(#11)(#2)

(#6) (#6) (#5) (#6)

Register Bank

4:1 Output Mux

(#2)(#2)(#1)

(#2)(#2)

4 stage pipeline

• Critical path length : 23 LUTs

• Time Period : ⌈234 ⌉ = 6 LUTs


Scheduling of Operations (for 1 bit)

Table: Scheduling Instructions

ek1 : Xi ← Xi · Zj ek4 : Zj ← (T · Zj )2

ek2 : Zi ← Xj · Zi ek5 : T ← Xi · Zi ; Zi ← (Xi + Zi )2

ek3 : T ← Xj ; Xj ← X 4j + b · Z4

j ek6 : Xi ← x · Zi + T


Scheduling of Operations (for 1 bit)

Table: Scheduling Instructions

ek1 : Xi ← Xi · Zj ek4 : Zj ← (T · Zj )2

ek2 : Zi ← Xj · Zi ek5 : T ← Xi · Zi ; Zi ← (Xi + Zi )2

ek3 : T ← Xj ; Xj ← X 4j + b · Z4

j ek6 : Xi ← x · Zi + T

ek1 ek2 ek3

ek5

ek6

ek4

(a) Data Dependencies BetweenInstructions

1 2 3 4 5 6 7 8Clock Cycle

ek6

ek5

ek4

ek3

ek2or ek

1

ek1or ek

2

(b) Timing Diagram for 3 StagePipeline


Scheduling of 2 Consequtive bits : Overlapping two bits

When two consequtive bits are equal

sk−1

sk

ek1 ek2 ek3

ek5

ek6

ek4

ek−11ek−12 ek−13

ek−15

ek−16

ek−14

153 4 5 6 7 9 10 11 12 13 148

Clock Cycles

1 2

Stolen Clock Cycle

ek−16

ek−15

ek−14

ek−13

ek−12

ek−11

ek6

ek5

ek4

ek3

ek2 or ek1

ek1 or ek2sk completes

sk−2starts

sk−1starts


Scheduling of 2 Consequtive bits : Overlapping two bits

When two consequtive bits are NOT equal

sk−1

sk

ek1 ek2 ek3

ek4

ek6

ek5

ek−11ek−12 ek−13

ek−14

ek−15

ek−16

153 4 5 6 7 9 10 11 12 13 148

Clock Cycles

1 2

Stolen Clock Cycle

ek−16

ek−15

ek−14

ek−13

ek−12

ek−11

ek6

ek5

ek4

ek3

ek2 or ek1

ek1 or ek2 sk completes

sk−1 starts

sk−2 starts


Clock Cycles for L Stage Pipeline• Without overlapping 2L+ 2 clock cycles are required per bit• With overlapping, we save a clock cycle, so 2L+ 1• Additionally,

• If there is a pipline stage soon after the multiplier, then dataforwarding is feasible

• Clock cycles per bit is 2L

1 2 3 4 5 6 7 8Clock Cycle

9

Data Forwarding Path

10

ek6

ek5

ek4

ek3

sk−1 starts

ek1or ek

2

ek2or ek

1

Clock cycles saved are m or 2m9/12/2012 CHES 2012, Leuven Belgium 23

Modeling Computation Time to find the Right PipelineSo, we now know• How to place pipeline registers• and estimate the clock cycles required

Putting it together, we can estimate how long it takes to perform ascalar multiplication

Table: Computation Time Estimates for Various Values of L for an ECM over GF (2163) and FPGA with 4

input LUTs

L us DataForwarding Computation TimeFeasible

1 9 No 10192 4 No 5243 3 No 4124 2 Yes 3575 1 No 3956 1 Yes 3607 1 Yes 358


Architecture Details

1Z2Z X1X2

us

2n2n

Circuit 12n

A0

A1

A2

A3

A1

A1

A1

A3

C

MOD

Combine OutputsThreshold MultsSplit Operands

Combine Level 1

Combine Level 4

{

STAGE 1 STAGE 2 STAGE 3 STAGE 4

REGISTER BANK

ARITHMETIC UNIT

ClockReset

Scalar

B_selC_selD_selE_selPow_selA0_selA1_selA2_selA3_selQin_selX_selZ_sel

X1_wX2_wZ1_wZ2_wT_wA_w

Write EnableRegister

MultiplexerSelect Lines

y

Z_sel

Control Unit

MUX Z XX_sel

1

Base Point in ROMCurve Constant

x A TA_sel

Qin_sel

A2_sel A1_selA3_sel

A0_sel

Data Forwarding Path

PowerblockCircuit 2 Circuit

b

Qin

A0A1A2A3

A2 DMUX

MUX

C_sel

D_sel

Qout

A0

Pow_sel

A0

A3

A2

A0

A_sel

B_sel

E_sel

MUX F

C0

C1

C2

Squarer

Quad

Squarer

Quad

MUX

MUX

MUX

E

B

A

HBKM

MUX GMUX HMUX IMUX JK


Finite State Machine

Initialization Coordinate Conversion

Completion

CompletionAD2

0AD3

0AD4

0AD9

0

AD1

neq

AD4

1AD2

1AD3

1

AD1

eq

AD9

1

I0 I1 I5 AD1 C1 C2

sk−1 6= sk

sk−1 = sk

sk−1 = sk

sk−1 6= sk

sk−1 = 1

sk−1 = 0

st−2 = 1

st−2 = 0sk−1 = 0

sk−1 = 1

C125


Comparisons

Work Platform Field Slices LUTs Freq Comp.(m) (MHz) Time (µs)

Orlando XCV400E 163 - 3002 76.7 210Bednara XCV1000 191 - 48300 36 270Gura XCV2000E 163 - 19508 66.5 140Lutz XCV2000E 163 - 10017 66 233Saqib XCV3200 191 18314 - 10 56Pu XC2V1000 193 - 3601 115 167Ansari XC2V2000 163 - 8300 100 42Rebeiro XC4V140 233 19674 37073 60 31

Jarvinen1 Stratix II 163 (11800ALMs) - - 48.9

Kim 2 XC4VLX80 163 24363 - 143 10.1Chelton XCV2600E 163 15368 26390 91 33

XC4V200 163 16209 26364 153.9 19.5Azarderakhsh XC4CLX100 163 12834 22815 196 17.2

XC5VLX110 163 6536 17305 262 12.9Our Result (Virtex 4 FPGA) XC4VLX80 163 8070 14265 147 9.7

XC4V200 163 8095 14507 132 10.7XC4VLX100 233 13620 23147 154 12.5

Our Result (Virtex 5 FPGA) XC5VLX85t 163 3446 10176 167 8.6XC5VSX240 163 3513 10195 148 9.5XC5VLX85t 233 5644 18097 156 12.3

1. uses 4 field multipliers; 2. uses 3 field multipliers; 3. uses 2 field multipliers


Conclusions

• We show the implementation of a high-speed elliptic curvecrypto-processor for FPGA platforms

• The use of highly optimized finite field primitives, efficientutilization of FPGA primitives, help reduce the area, which inturn make routing easier

• A theoretically designed pipline strategy provides idealpipelining of the design to increase clock frequncy

• Efficient scheduling with data forwarding machanisms reduceclock cycle requirement

• All these result in one of the fastest elliptic curveimplementation on FPGAs

• Additionally, area required is significantly less compared toother high-speed designs


Thank You for your Attention


Documents

Pushing the Limits of High-Speed GF(2m) Elliptic Curve ... · PDF filePushing the Limits of High-Speed GF(2m) Elliptic Curve Scalar Multiplier on FPGAs ... ROM Control Unit (b) BlockDiagram