Upload
dangtuyen
View
218
Download
2
Embed Size (px)
Citation preview
Pushing the Limits of High-Speed GF (2m) Elliptic
Curve Scalar Multiplier on FPGAs
Chester Rebeiro, Sujoy Sinha Roy, and Debdeep Mukhopadhyay
Secured Embedded Architecture LabIndian Institute of Technology Kharagpur
India
9/12/2012
CHES 2012, Leuven Belgium
Elliptic Curve Scalar Multiplication
• An elliptic curve over GF (2m) is a set of points which satisfiesthe equation
y2 + xy = x3 + ax2 + b ,
where a, b ∈ GF (2m) and b 6= 0. The points on the ellipticcurve form an additive group.
• The projective coordinate representation of the curve is
Y 2 + XYZ = X 3 + aX 2Z 2 + bZ 4
• Scalar Multiplication : Given a base point P = (XP ,YP ,ZP)on the elliptic curve and a scalar s compute Q = sP
(i.e. Q = P + P + P + · · · (stimes))
9/12/2012 CHES 2012, Leuven Belgium 2
Montgomery Ladder for Scalar MultiplicationInputs : scalar s = (st−1st−2 · · · s1s0)2
basepoint P
Output : Scalar Product Q = sP
1 P1 = (X1,Y1,Z1)← P and P2 = (X2,Y2,Z2)← 2 · P2 For each bit sk (for k = t − 2, t − 3, · · · , 0)
• if sk = 1 then P1 ← P1 + P2 ; P2 = 2 · P2
• if sk = 0 then P2 ← P1 + P2 ; P1 = 2 · P1
3 Q = Projective2Affine(P1)
9/12/2012 CHES 2012, Leuven Belgium 3
Montgomery Ladder for Scalar MultiplicationInputs : scalar s = (st−1st−2 · · · s1s0)2
basepoint P
Output : Scalar Product Q = sP
1 P1 = (X1,Y1,Z1)← P and P2 = (X2,Y2,Z2)← 2 · P2 For each bit sk (for k = t − 2, t − 3, · · · , 0)
• if sk = 1 then P1 ← P1 + P2 ; P2 = 2 · P2
• if sk = 0 then P2 ← P1 + P2 ; P1 = 2 · P1
3 Q = Projective2Affine(P1)
Performing Pi ← Pi + Pj and Pj ← 2 · Pi
Xi ← Xi · Zj ; Zi ← Xj · Zi ; T ← Xj ; Xj ← X 4j + b · Z 4
j
Zj ← (T · Zj)2 ; T ← Xi · Zi ; Zi ← (Xi + Zi)
2 ; Xi ← x · Zi + T
. . . all operations are in GF (2m)
9/12/2012 CHES 2012, Leuven Belgium 3
Engineering the Montgomery Ladder for Scalar
Multiplication
Multiplication
Elliptic Curve
Finite Field Operations
Group Operations
Scalar
(a) The ECC Pyramid
Regbank
sP
s
ArithmeticUnit
ROM Control Unit
(b) Block Diagram
9/12/2012 CHES 2012, Leuven Belgium 4
Engineering the Montgomery Ladder for Scalar
Multiplication
Multiplication
Elliptic Curve
Finite Field Operations
Group Operations
Scalar
(a) The ECC Pyramid
Regbank
sP
s
ArithmeticUnit
ROM Control Unit
(b) Block Diagram
High-speed scalar multiplication on FPGAs
• Minimize area by maximizing utilization of available resources
• Optimal Pipelining
• Efficient Scheduling of Operations
9/12/2012 CHES 2012, Leuven Belgium 4
Field Programmable Gate Arrays
• Provides the speed of hardware and the reconfigurablitity ofsoftware
• FPGA Architecture
Programmable Connection
Routing Switches
Logic Block
Programmable
Switch
(a) FPGA Island
9/12/2012 CHES 2012, Leuven Belgium 5
Field Programmable Gate Arrays
• Provides the speed of hardware and the reconfigurablitity ofsoftware
• FPGA Architecture
Programmable Connection
Routing Switches
Logic Block
Programmable
Switch
(a) FPGA Island
CLK
CIN
COUT
F1
F2
F3
F4
CLK
CE
SR
BY
PRE
D
CE
Q
CLR
LUT &CarryLogic
Control
(b) Lookup Table
9/12/2012 CHES 2012, Leuven Belgium 5
LUT Utilization
LUT
• Four (or six) input → one output
• Can implement any four (or six) input truth table
• y1 = x1 ⊕ x2 ⊕ x3 ⊕ x4 Requires one LUT.
• y2 = x1 ⊕ x2 Still requires one LUT.
9/12/2012 CHES 2012, Leuven Belgium 6
LUT Utilization
LUT
• Four (or six) input → one output
• Can implement any four (or six) input truth table
• y1 = x1 ⊕ x2 ⊕ x3 ⊕ x4 Requires one LUT.
• y2 = x1 ⊕ x2 Still requires one LUT.
• y2 results in an under utilized LUT.
. . . need to maximize LUT utilization to minimize area.
9/12/2012 CHES 2012, Leuven Belgium 6
Finite field Multiplier for Best LUT utilization
7 7 7 8 7 7 7 87 8 7 87 8 7 77 87 7 7 8 7 7 7 7 7 7 7 8 5 6
29 29 29 29
14 14 15 14 15
5858 59
116 117
58
14 15 14 15 14 15 15 15 1515 14
233 Karatusba Multiplier
29 29 29 30
(a) Karatsuba-Ofman Multiplication
[VLSID 2008]9/12/2012 CHES 2012, Leuven Belgium 7
Finite field Multiplier for Best LUT utilization
7 7 7 8 7 7 7 87 8 7 87 8 7 77 87 7 7 8 7 7 7 7 7 7 7 8 5 6
29 29 29 29
14 14 15 14 15
5858 59
116 117
58
14 15 14 15 14 15 15 15 1515 14
233 Karatusba Multiplier
29 29 29 30
(a) Karatsuba-Ofman Multiplication
233
29 29 29 29
14 14 15 14 15
5858 59
29 29 29 30
116 117
58
14 15 14 15 14 15 15 15 1515 14
Karatusba Multiplier
77 78 7778 7778 7778 7778 7778 77 78 7856
Classical Multiplier
(b) Hybrid Karatsuba Multiplication
[VLSID 2008]9/12/2012 CHES 2012, Leuven Belgium 7
Finite field Multiplier for Best LUT utilization
7 7 7 8 7 7 7 87 8 7 87 8 7 77 87 7 7 8 7 7 7 7 7 7 7 8 5 6
29 29 29 29
14 14 15 14 15
5858 59
116 117
58
14 15 14 15 14 15 15 15 1515 14
233 Karatusba Multiplier
29 29 29 30
(a) Karatsuba-Ofman Multiplication
233
29 29 29 29
14 14 15 14 15
5858 59
29 29 29 30
116 117
58
14 15 14 15 14 15 15 15 1515 14
Karatusba Multiplier
77 78 7778 7778 7778 7778 7778 77 78 7856
Classical Multiplier
(b) Hybrid Karatsuba Multiplication
8800
9600
4 6 8 10 12 14 16 18 20 22
LU
Ts
Threshold
(c) Finding the Right Threshold
[VLSID 2008]9/12/2012 CHES 2012, Leuven Belgium 7
Finite field Multiplier for Best LUT utilization
7 7 7 8 7 7 7 87 8 7 87 8 7 77 87 7 7 8 7 7 7 7 7 7 7 8 5 6
29 29 29 29
14 14 15 14 15
5858 59
116 117
58
14 15 14 15 14 15 15 15 1515 14
233 Karatusba Multiplier
29 29 29 30
(a) Karatsuba-Ofman Multiplication
233
29 29 29 29
14 14 15 14 15
5858 59
29 29 29 30
116 117
58
14 15 14 15 14 15 15 15 1515 14
Karatusba Multiplier
77 78 7778 7778 7778 7778 7778 77 78 7856
Classical Multiplier
(b) Hybrid Karatsuba Multiplication
8800
9600
4 6 8 10 12 14 16 18 20 22
LU
Ts
Threshold
(c) Finding the Right Threshold
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1e+06
1.1e+06
90
180
270
360
450
540
Area
* Ti
me
Number of bits
Karatsuba-OfmanHybrid Karatsuba
(d) Comparing Multipliers
[VLSID 2008]9/12/2012 CHES 2012, Leuven Belgium 7
Finite Field Inversion Using Itoh-Tsujii Algorithm
• Given a ∈ GF (2m), find a−1 ∈ GF (2m) such that a · a−1 = 1• Fermat’s Little Theorem : a−1 = a2
m−2
• Itoh-Tsujii Algorithm
1 Define the addition chain for m − 1(for example m = 233 : (1, 2, 3, 6, 7, 14, 28, 58, 116, 232))
2 Computea→ a2
2−1 → a2
3−1 → a2
6−1 → a2
7−1 → a2
14−1 · · · → a2
232−1
3 Square to get a2233
−2
9/12/2012 CHES 2012, Leuven Belgium 8
Finite Field Inversion Using Itoh-Tsujii Algorithm
• Given a ∈ GF (2m), find a−1 ∈ GF (2m) such that a · a−1 = 1• Fermat’s Little Theorem : a−1 = a2
m−2
• Itoh-Tsujii Algorithm
1 Define the addition chain for m − 1(for example m = 233 : (1, 2, 3, 6, 7, 14, 28, 58, 116, 232))
2 Computea→ a2
2−1 → a2
3−1 → a2
6−1 → a2
7−1 → a2
14−1 · · · → a2
232−1
3 Square to get a2233
−2
• Exponentiation requires a series of cascaded squarers calledpowerblock along with a finite field multiplier
Square Square Square Square
Circuit−1 Circuit−11
Multiplexerqsel
Input
Output
Circuit−2 Circuit−3
9/12/2012 CHES 2012, Leuven Belgium 8
Using Higher Exponents in the Itoh-Tsujii AlgorithmConsider using a quad circuit instead of a square.
• This requires an addition chain to m−12 instead of m − 1 thus
finishes faster.
[IEEE TVLSI 2011, DATE 2011]
9/12/2012 CHES 2012, Leuven Belgium 9
Using Higher Exponents in the Itoh-Tsujii AlgorithmConsider using a quad circuit instead of a square.
• This requires an addition chain to m−12 instead of m − 1 thus
finishes faster.• The frequency of operation is not affected and area used isless due to better LUT utilization.
Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA
Field Squarer Circuit Quad Circuit Size ratio
#LUTs Delay (ns) #LUTq Delay (ns)#LUTq
2(#LUTs )
GF (2193) 96 1.48 145 1.48 0.75
GF (2233) 153 1.48 230 1.48 0.75
[IEEE TVLSI 2011, DATE 2011]
9/12/2012 CHES 2012, Leuven Belgium 9
Using Higher Exponents in the Itoh-Tsujii AlgorithmConsider using a quad circuit instead of a square.
• This requires an addition chain to m−12 instead of m − 1 thus
finishes faster.• The frequency of operation is not affected and area used isless due to better LUT utilization.
Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA
Field Squarer Circuit Quad Circuit Size ratio
#LUTs Delay (ns) #LUTq Delay (ns)#LUTq
2(#LUTs )
GF (2193) 96 1.48 145 1.48 0.75
GF (2233) 153 1.48 230 1.48 0.75
• Larger exponent circuits can similarly be used to obtain fasterresults.
[IEEE TVLSI 2011, DATE 2011]
9/12/2012 CHES 2012, Leuven Belgium 9
Using Higher Exponents in the Itoh-Tsujii AlgorithmConsider using a quad circuit instead of a square.
• This requires an addition chain to m−12 instead of m − 1 thus
finishes faster.• The frequency of operation is not affected and area used isless due to better LUT utilization.
Table: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA
Field Squarer Circuit Quad Circuit Size ratio
#LUTs Delay (ns) #LUTq Delay (ns)#LUTq
2(#LUTs )
GF (2193) 96 1.48 145 1.48 0.75
GF (2233) 153 1.48 230 1.48 0.75
• Larger exponent circuits can similarly be used to obtain fasterresults.
• However there is an initial overhead of computing a2q−1,
which increases as the exponent circuit increases.
[IEEE TVLSI 2011, DATE 2011]
9/12/2012 CHES 2012, Leuven Belgium 9
The Arithmetic Unit
2n2n
Circuit 12n
us
D_sel
C_sel
MUXD
E
MUXA0
E_sel
A2
A3
Circuit 2 Circuit
Pow_sel
B_sel
MUX
B
A2
A3
A1
Quad
A0
C0
Powerblock
SquarerC1
C2
Squarer
Quad
A_sel
MUX
A
A0
Field
(HBKM)(#11)
(#2) (#2)
(#2)
(#2)
(#2)
(#2)
MUXF
(#2)
(#1)
(#1) (#2)
(#1)C
MUX
ARITHMETIC UNIT
(#2) (#2) (#2)
MultiplierFinite Field Multiplication
Power Block
9/12/2012 CHES 2012, Leuven Belgium 10
The Arithmetic Unit
2n2n
Circuit 12n
us
C_sel
Circuit 2 Circuit
Pow_sel
A0
C0
Powerblock
Field
MUXF
(#2)(#1)C
MUX
ARITHMETIC UNIT
(#2) (#2) (#2)
Multiplier
^4 ^2
^4
^2
Power Block
9/12/2012 CHES 2012, Leuven Belgium 11
The Arithmetic Unit
2n2n
Circuit 12n
us
C_sel
Circuit 2 Circuit A0
Powerblock
Field
ARITHMETIC UNIT
Multiplier
^4 ^2
^4
^2
9/12/2012 CHES 2012, Leuven Belgium 12
The Register Bank
2n2n
Circuit 12n
us
D_sel
C_sel
MUXD
E
MUXA0
E_sel
A2
A3
Circuit 2 Circuit
Pow_sel
B_sel
MUX
B
A2
A3
A1
Quad
A0
C0
Powerblock
SquarerC1
C2
A3
A2
A1
A0Squarer
Quad
A_sel
MUX
A
A0
FieldMultiplier
(HBKM)
C2C1C0
Qout
Qin
REGISTERBANK
MUXF C
MUX
ARITHMETIC UNIT
Six registers implemented as flip-flops to maximize CLButilization
9/12/2012 CHES 2012, Leuven Belgium 13
Pipelining the Processor• How many stages of pipelining ?
• To many would increase the frequency of operation but makesit difficult to schedule instructions without bubbles
• To few would result in a low frequency of operation
9/12/2012 CHES 2012, Leuven Belgium 14
Pipelining the Processor• How many stages of pipelining ?
• To many would increase the frequency of operation but makesit difficult to schedule instructions without bubbles
• To few would result in a low frequency of operation
• Where to place the pipeline stages?
9/12/2012 CHES 2012, Leuven Belgium 14
Pipelining the Processor• How many stages of pipelining ?
• To many would increase the frequency of operation but makesit difficult to schedule instructions without bubbles
• To few would result in a low frequency of operation
• Where to place the pipeline stages?
2n2n
Circuit 12n
us
D_sel
C_sel
MUXD
E
MUXA0
E_sel
A2
A3
Circuit 2 Circuit
Pow_sel
B_sel
MUX
B
A2
A3
A1
Quad
A0
C0
Powerblock
SquarerC1
C2
A3
A2
A1
A0Squarer
Quad
A_sel
MUX
A
A0
FieldMultiplier
(HBKM)
C2C1C0
Qout
Qin
REGISTERBANK
Reg
iste
rs
Muxes
MuxesInput
Output
MUXF C
MUX
ARITHMETIC UNIT
CRITICAL PATH
(a) A Bad Pipline StagePlacement
2n2n
Circuit 12n
us
D_sel
C_sel
MUXD
E
MUXA0
E_sel
A2
A3
Circuit 2 Circuit
Pow_sel
B_sel
MUX
B
A2
A3
A1
Quad
A0
C0
Powerblock
SquarerC1
C2
A3
A2
A1
A0Squarer
Quad
A_sel
MUX
A
A0
FieldMultiplier
(HBKM)
C2C1C0
Qout
Qin
REGISTERBANK
Reg
iste
rs
Muxes
MuxesInput
Output
MUXF C
MUX
ARITHMETIC UNIT
CRITICAL PATH
(b) A Good Pipeline StagePlacement
9/12/2012 CHES 2012, Leuven Belgium 14
Pipelining the Processor• How many stages of pipelining ?
• To many would increase the frequency of operation but makesit difficult to schedule instructions without bubbles
• To few would result in a low frequency of operation
• Where to place the pipeline stages?
2n2n
Circuit 12n
us
D_sel
C_sel
MUXD
E
MUXA0
E_sel
A2
A3
Circuit 2 Circuit
Pow_sel
B_sel
MUX
B
A2
A3
A1
Quad
A0
C0
Powerblock
SquarerC1
C2
A3
A2
A1
A0Squarer
Quad
A_sel
MUX
A
A0
FieldMultiplier
(HBKM)
C2C1C0
Qout
Qin
REGISTERBANK
Reg
iste
rs
Muxes
MuxesInput
Output
MUXF C
MUX
ARITHMETIC UNIT
CRITICAL PATH
(a) A Bad Pipline StagePlacement
2n2n
Circuit 12n
us
D_sel
C_sel
MUXD
E
MUXA0
E_sel
A2
A3
Circuit 2 Circuit
Pow_sel
B_sel
MUX
B
A2
A3
A1
Quad
A0
C0
Powerblock
SquarerC1
C2
A3
A2
A1
A0Squarer
Quad
A_sel
MUX
A
A0
FieldMultiplier
(HBKM)
C2C1C0
Qout
Qin
REGISTERBANK
Reg
iste
rs
Muxes
MuxesInput
Output
MUXF C
MUX
ARITHMETIC UNIT
CRITICAL PATH
(b) A Good Pipeline StagePlacement
Can we determine the best pipeline strategy at the designexploration phase?
9/12/2012 CHES 2012, Leuven Belgium 14
Optimally Pipeline into L Stages Apriori
1 Model the delay of the design
2 Use the model to identify all critical paths (with delay t)
3 Place pipeline registers to ensure that no path has delaygreater than t
L
9/12/2012 CHES 2012, Leuven Belgium 15
Modeling the Delay
• The delay of a circuit is proportional to the number of LUTsin the critical path
4 5 6 7 8 9
10
5 6 7 8 9 10 11No
. o
f L
UT
s in
Critic
al P
ath
Delay of Circuit (in ns)
9/12/2012 CHES 2012, Leuven Belgium 16
Modeling the Delay
• The delay of a circuit is proportional to the number of LUTsin the critical path
4 5 6 7 8 9
10
5 6 7 8 9 10 11No
. o
f L
UT
s in
Critic
al P
ath
Delay of Circuit (in ns)
• The number of LUTs in the critical path of an n inputBoolean function f (x1, x2, x3, · · · xn) is ⌈logk(n)⌉, where k isthe number of inputs to the LUT.
9/12/2012 CHES 2012, Leuven Belgium 16
Modeling LUTdelay of all Elements in the Processor
Component k− LUT Delay for k ≥ 4 m = 163, k = 4m bit field adder 1 1m bit n : 1 Mux ⌈logk(n + log2n)⌉ 2 (for n = 4)(Dn:1(m)) 1 (for n = 2)
Exponentiation max(LUTDelay(di )), where di is the i th 2 (for n = 1)Circuit (D2n (m)) output bit of the exponentiation circuit 2 (for n = 2)Powerblock us × D2n (m) + Dus :1(m) 4 (for us = 2)(Dpowerblk(m))
Modular Reduction 1 for irreducible trinomials 2(Dmod ) 2 for pentanomials (for pentanomials)HBKM 11 (for τ = 11)(DHBKM(m)) Dsplit + Dthreshold + Dcombine + Dmod
= ⌈logk (mτ)⌉ + ⌈logk (2τ)⌉
+⌈log2(
mτ
)
⌉ + Dmod
9/12/2012 CHES 2012, Leuven Belgium 17
Critical Path : Placing Pipeline Registers Optimally
C_sel
B_sel
A3
A1
Powerblock
Squarer
C2
A3
A2
A1
A0
FieldCritical Path
Multiplier
(HBKM)
C2C1C0
Qout
Qin
REGISTERBANK
Reg
iste
rs
ARITHMETIC UNIT
(#2)
(#1)
(#1)(#1)
(#2)(#2)
(#11)
(#2)
(#2)
(#2)
(#2) (#2)
(#2)
(#2) (#2)(#2)
(#2)
9/12/2012 CHES 2012, Leuven Belgium 18
Critical Path : Placing Pipeline Resgisters Optimally
Squarer Squarer
4:1 Input MuxRegister BankMerged Squarer and Adder
HBKM
(#1)
MUX
B
MUX
D
Pipeline Register
(#11)(#2)
(#6) (#6) (#5) (#6)
Register Bank
4:1 Output Mux
(#2)(#2)(#1)
(#2)(#2)
4 stage pipeline
• Critical path length : 23 LUTs
• Time Period : ⌈234 ⌉ = 6 LUTs
9/12/2012 CHES 2012, Leuven Belgium 19
Scheduling of Operations (for 1 bit)
Table: Scheduling Instructions
ek1 : Xi ← Xi · Zj ek4 : Zj ← (T · Zj )2
ek2 : Zi ← Xj · Zi ek5 : T ← Xi · Zi ; Zi ← (Xi + Zi )2
ek3 : T ← Xj ; Xj ← X 4j + b · Z4
j ek6 : Xi ← x · Zi + T
9/12/2012 CHES 2012, Leuven Belgium 20
Scheduling of Operations (for 1 bit)
Table: Scheduling Instructions
ek1 : Xi ← Xi · Zj ek4 : Zj ← (T · Zj )2
ek2 : Zi ← Xj · Zi ek5 : T ← Xi · Zi ; Zi ← (Xi + Zi )2
ek3 : T ← Xj ; Xj ← X 4j + b · Z4
j ek6 : Xi ← x · Zi + T
ek1 ek2 ek3
ek5
ek6
ek4
(a) Data Dependencies BetweenInstructions
1 2 3 4 5 6 7 8Clock Cycle
ek6
ek5
ek4
ek3
ek2or ek
1
ek1or ek
2
(b) Timing Diagram for 3 StagePipeline
9/12/2012 CHES 2012, Leuven Belgium 20
Scheduling of 2 Consequtive bits : Overlapping two bits
When two consequtive bits are equal
sk−1
sk
ek1 ek2 ek3
ek5
ek6
ek4
ek−11ek−12 ek−13
ek−15
ek−16
ek−14
153 4 5 6 7 9 10 11 12 13 148
Clock Cycles
1 2
Stolen Clock Cycle
ek−16
ek−15
ek−14
ek−13
ek−12
ek−11
ek6
ek5
ek4
ek3
ek2 or ek1
ek1 or ek2sk completes
sk−2starts
sk−1starts
9/12/2012 CHES 2012, Leuven Belgium 21
Scheduling of 2 Consequtive bits : Overlapping two bits
When two consequtive bits are NOT equal
sk−1
sk
ek1 ek2 ek3
ek4
ek6
ek5
ek−11ek−12 ek−13
ek−14
ek−15
ek−16
153 4 5 6 7 9 10 11 12 13 148
Clock Cycles
1 2
Stolen Clock Cycle
ek−16
ek−15
ek−14
ek−13
ek−12
ek−11
ek6
ek5
ek4
ek3
ek2 or ek1
ek1 or ek2 sk completes
sk−1 starts
sk−2 starts
9/12/2012 CHES 2012, Leuven Belgium 22
Clock Cycles for L Stage Pipeline• Without overlapping 2L+ 2 clock cycles are required per bit• With overlapping, we save a clock cycle, so 2L+ 1• Additionally,
• If there is a pipline stage soon after the multiplier, then dataforwarding is feasible
• Clock cycles per bit is 2L
1 2 3 4 5 6 7 8Clock Cycle
9
Data Forwarding Path
10
ek6
ek5
ek4
ek3
sk−1 starts
ek1or ek
2
ek2or ek
1
Clock cycles saved are m or 2m9/12/2012 CHES 2012, Leuven Belgium 23
Modeling Computation Time to find the Right PipelineSo, we now know• How to place pipeline registers• and estimate the clock cycles required
Putting it together, we can estimate how long it takes to perform ascalar multiplication
Table: Computation Time Estimates for Various Values of L for an ECM over GF (2163) and FPGA with 4
input LUTs
L us DataForwarding Computation TimeFeasible
1 9 No 10192 4 No 5243 3 No 4124 2 Yes 3575 1 No 3956 1 Yes 3607 1 Yes 358
9/12/2012 CHES 2012, Leuven Belgium 24
Architecture Details
1Z2Z X1X2
us
2n2n
Circuit 12n
A0
A1
A2
A3
A1
A1
A1
A3
C
MOD
Combine OutputsThreshold MultsSplit Operands
Combine Level 1
Combine Level 4
{
STAGE 1 STAGE 2 STAGE 3 STAGE 4
REGISTER BANK
ARITHMETIC UNIT
ClockReset
Scalar
B_selC_selD_selE_selPow_selA0_selA1_selA2_selA3_selQin_selX_selZ_sel
X1_wX2_wZ1_wZ2_wT_wA_w
Write EnableRegister
MultiplexerSelect Lines
y
Z_sel
Control Unit
MUX Z XX_sel
1
Base Point in ROMCurve Constant
x A TA_sel
Qin_sel
A2_sel A1_selA3_sel
A0_sel
Data Forwarding Path
PowerblockCircuit 2 Circuit
b
Qin
A0A1A2A3
A2 DMUX
MUX
C_sel
D_sel
Qout
A0
Pow_sel
A0
A3
A2
A0
A_sel
B_sel
E_sel
MUX F
C0
C1
C2
Squarer
Quad
Squarer
Quad
MUX
MUX
MUX
E
B
A
HBKM
MUX GMUX HMUX IMUX JK
9/12/2012 CHES 2012, Leuven Belgium 25
Finite State Machine
Initialization Coordinate Conversion
Completion
CompletionAD2
0AD3
0AD4
0AD9
0
AD1
neq
AD4
1AD2
1AD3
1
AD1
eq
AD9
1
I0 I1 I5 AD1 C1 C2
sk−1 6= sk
sk−1 = sk
sk−1 = sk
sk−1 6= sk
sk−1 = 1
sk−1 = 0
st−2 = 1
st−2 = 0sk−1 = 0
sk−1 = 1
C125
9/12/2012 CHES 2012, Leuven Belgium 26
Comparisons
Work Platform Field Slices LUTs Freq Comp.(m) (MHz) Time (µs)
Orlando XCV400E 163 - 3002 76.7 210Bednara XCV1000 191 - 48300 36 270Gura XCV2000E 163 - 19508 66.5 140Lutz XCV2000E 163 - 10017 66 233Saqib XCV3200 191 18314 - 10 56Pu XC2V1000 193 - 3601 115 167Ansari XC2V2000 163 - 8300 100 42Rebeiro XC4V140 233 19674 37073 60 31
Jarvinen1 Stratix II 163 (11800ALMs) - - 48.9
Kim 2 XC4VLX80 163 24363 - 143 10.1Chelton XCV2600E 163 15368 26390 91 33
XC4V200 163 16209 26364 153.9 19.5Azarderakhsh XC4CLX100 163 12834 22815 196 17.2
XC5VLX110 163 6536 17305 262 12.9Our Result (Virtex 4 FPGA) XC4VLX80 163 8070 14265 147 9.7
XC4V200 163 8095 14507 132 10.7XC4VLX100 233 13620 23147 154 12.5
Our Result (Virtex 5 FPGA) XC5VLX85t 163 3446 10176 167 8.6XC5VSX240 163 3513 10195 148 9.5XC5VLX85t 233 5644 18097 156 12.3
1. uses 4 field multipliers; 2. uses 3 field multipliers; 3. uses 2 field multipliers
9/12/2012 CHES 2012, Leuven Belgium 27
Conclusions
• We show the implementation of a high-speed elliptic curvecrypto-processor for FPGA platforms
• The use of highly optimized finite field primitives, efficientutilization of FPGA primitives, help reduce the area, which inturn make routing easier
• A theoretically designed pipline strategy provides idealpipelining of the design to increase clock frequncy
• Efficient scheduling with data forwarding machanisms reduceclock cycle requirement
• All these result in one of the fastest elliptic curveimplementation on FPGAs
• Additionally, area required is significantly less compared toother high-speed designs
9/12/2012 CHES 2012, Leuven Belgium 28
Thank You for your Attention
9/12/2012 CHES 2012, Leuven Belgium 29