Lab Session 2 Design of Elliptic Curve Cryptosystem Debdeep Mukhopadhyay Chester Rebeiro Dept. of Computer Science and Engineering Indian Institute of

Lab Session 2

Design of Elliptic Curve Cryptosystem

Debdeep Mukhopadhyay Chester Rebeiro

Dept. of Computer Science and Engineering

Indian Institute of Technology Kharagpur

INDIA

The Processor Overview

Register Bank: regbank.v (stores temporary data)ROM: stores the curve constant ‘b’ and the base points.Arithmetic Unit: ec_alu.v (performs the underlying field

computations)Control Unit: smul.v (sequences the field computations

for performing the point operations)

Register Bank•Heart of the register file is 8 registers of size 233 bits.• Organized as 3 banks:RA,

RB, RC• Dual port Distributed RAMs

of the Xilinx FPGAs• Asynchronous Read• Synchronous Write• we: write enable signal• Inputs: C0, C1, Qout

• Outputs: A0, A1, A2, A3 and Qin

Module regbank.v

module regbank(clk, cwh, c0, c1, a0, a1, a2, a3);input wire clk; input wire [22:0] cwh; /* control word */input wire [232:0] c0, c1; /* Inputs to regbank from

ALU */output wire [232:0] a0, a1, a2, a3; /* Output from

regbank to ALU */

wire rb1_we;wire [3:0] rb1_addr1, rb2_addr1, rb3_addr1;wire [3:0] rb1_addr2, rb2_addr2, rb3_addr2;wire [232:0] rb1_din, rb2_din, rb3_din;wire [232:0] rb1_dout1, rb2_dout1, rb3_dout1;wire [232:0] rb1_dout2, rb2_dout2, rb3_dout2;wire [232:0] qin, qout;

/* Instances of distributed memory */XC3S_RAM16X233 regbank1(rb1_din, rb1_addr1,

rb1_addr2, rb1_we, clk, rb1_dout1, rb1_dout2);XC3S_RAM16X233_regbank2(rb2_din, rb2_addr1,

rb2_addr2, rb2_we, clk, rb2_dout1, rb2_dout2);XC3S_RAM16X233_D regbank3(rb3_din, rb3_addr1,

rb3_addr2, rb3_we, clk, rb3_dout1, rb3_dout2);

/* Quadblock instance */bquadblk bqb(cwh[20], qin, cwh[19:16], qout);

assign qin = (cwh[21] == 1'b1) ? a1 : a2;

assign rb1_addr1 = {3'b0, cwh[0]};assign rb1_addr2 = {3'b0, cwh[1]};assign rb2_addr1 = {2'b0, cwh[4:3]};assign rb2_addr2 = {2'b0, cwh[6:5]};assign rb3_addr1 = {3'b0, cwh[8]};assign rb3_addr2 = {3'b0, cwh[9]};

/* a0 to a3 are fed to the ALU */assign a0 = (cwh[11] == 1'b0) ? rb1_dout1 : rb2_dout2;assign a2 = (cwh[12] == 1'b0) ? rb2_dout1 : rb1_dout2;assign a1 = (cwh[13] == 1'b0) ? rb3_dout1 : rb2_dout2;assign a3 = rb3_dout2;

/* Select what get written into RAM */assign rb1_we = cwh[2];assign rb1_din = (cwh[22] == 1'b1) ? c0: ((cwh[14] == 1'b0) ? c0 : c1);

assign rb2_we = cwh[7];assign rb2_din = (cwh[22] == 1'b1) ? c1: ((cwh[15] == 1'b0) ? c0 : c1);

assign rb3_we = cwh[10];assign rb3_din = (cwh[22] == 1'b1) ? 233'h1

: (cwh[20] == 1'b1) ? qout: c0;endmodule

The ALU for the ECC Processor

The AU performs:• point operations (doubling and

addition) in Lopez Dahab Projective co-ordinates efficiently.

• Inversion (from projective co-ordinates) to affine co-ordinates.

•The AU has 5 inputs (the outputs of the regbank) and 3 outputs (the inputs of the regbank).

•The computation of the AU has 2 phases:• Point addition and doubling• Inversion

•The AU consists of:• Hybrid Karatsuba Multiplier (used

at all time steps)• Quad block (used in phase 2 for

the final invesion)•The Quadblock has 14 cascade steps (as described before) and computes the quading operation repeatedly (as per the value of c[29…26]

The verilog code for ALU

module ec_alu(cw, a0, a1, a2, a3, c0, c1);input wire [232:0] a0, a1, a2, a3; /* the inputs to

the alu */input wire [9:0] cw; /* the control word */output wire [232:0] c0, c1; /* the alu outputs

*/

/* Temporary results */wire [232:0] a0sq, a0qu;wire [232:0] a1sq, a1qu;wire [232:0] a2sq, a2qu;wire [232:0] sa2, sa4, sa5, sa7, sa8, sa8_1;wire [232:0] sc1;wire [232:0] sd2, sd2_1;

/* Multiplier inputs and output */wire [232:0] minA, minB, mout;

multiplier mul(minA, minB, mout);squarer sq1_p0(a0, a0sq);squarer sq_p1(a1, a1sq);squarer sq_p2(a2, a2sq);

squarer sq2_p2(a2sq, a2qu);squarer sq2_p1(a1sq, a1qu);squarer sq2_p3(a0sq, a0qu);

/* Choose the inputs to the Multiplier */mux8 muxA(a0, a0sq, a2, sa7, sd2, a1, a1qu, 233'd0, cw[2:0], minA);mux8 muxB(a1, a1sq, sa4, sa8, sd2_1, a3, a2qu,a1qu, cw[5:3], minB);

/* Choose the outputs of the ALU */mux4 muxC(mout, sa2, a1sq, sc1, cw[7:6], c0); mux4 muxD(sa8_1, sa5, a1qu, sd2, cw[9:8], c1);

assign sa2 = mout ^ a2;assign sa4 = a1sq ^ a2;assign sa5 = mout ^ a2sq ^ a0;assign sa7 = a0 ^ a2;assign sa8 = a1 ^ a3;assign sa8_1 = mout ^ a0;

assign sc1 = mout ^ a3;

assign sd2 = a0qu ^ a1;assign sd2_1 = a2sq ^ a3 ^ a1;

endmodule

Control Unit

The CU is hardwired.It generates 33 control signals, which

determine the flow of data.c[0…9]: controls the input to the

multiplier and the output C0 and C1 of the AU

c[26…29]: select line of the multiplexers inside the quad block

Remaining control lines are for read and write of the registers in the register file

Projective Point Arithmetic

Point Doubling:◦ Input : (X1,Y1,Z1),

Output: (X4, Y4, Z4)

◦ Constraint: One multiplier Can perform one

multiplication per clock cycle

Hence needs four time steps

Projective Point Doubling Sequencing

Projective Point Addition

• Input : P=(X1,Y1,Z1), Q=(x2,y2)

• Output: (X3, Y3, Z3)

◦ Constraint: One multiplier Can perform one

multiplication per clock cycle

Hence needs eight time steps

Projective Point Addition Sequencing

The State Machine for the CU

Design of the CU

The verilog code for the CU/* Output Logic */always @(state) begin

case(state)6'd0: begin cwl <= 10'h000; /* Init L2R Step 1 */cwh <= 23'h4x8484;end6'd1: begincwl <= 10'h000;cwh <= 23'h4x808D; /* Init L2R Step 2 */end6'd2: begincwl <= 10'hx; /* Init L2R Step 3 */cwh <= 23'h4xx098;end/* The Doubling*/6'd3: begin /* Double Step 1 */cwl <= 10'h209;cwh <= 23'h0x8490;end6'd4: begin /* Double Step 1a */cwl <= 10'h002;cwh <= 23'h0x20F0;end6'd5: begin /* Double Step 2 */cwl <= 10'h324;cwh <= 23'h0x6544;end6'd6: begin /* Double Step 3 */cwl <= 10'hxC0;cwh <= 23'h0x0ac0;end

/* The Addition States */6'd7: begin /* Addition Step 1 */

cwl <= 10'h048;cwh <= 23'h0x08a0;

end6'd8: begin /* Addition Step 2 */

cwl <= 10'h002;cwh <= 23'h0x5006;


cwl <= 10'h028;cwh <= 23'h0x0090;


cwl <= 10'h011;cwh <= 23'h0x0214;


cwl <= 10'h102;cwh <= 23'h0x6544;


cwl <= 10'h08A;cwh <= 23'h0xB4D2;


cwl <= 10'h00B;cwh <= 23'h0x18A2;


cwl <= 10'h058;cwh <= 23'h0x0ac0;

end

The final Inversion Step

The verilog snippet/* The final Inverse : Starting the Itoh Tsujii

here*/6'd15: begin /* Inv 1 */cwl <= 10'hx0D; cwh <= 23'h0x04x0; /* The first a=a^3 */end6'd16: begin /* Inv 2 */ cwl <= 10'hx06;cwh <= 23'h0x0090;end6'd17: begin /* Inv 3 */cwl <= 10'hx35;cwh <= 23'h0x0090;end6'd18: begin /* Inv 4-1 */ cwl <= 10'hx;cwh <= 23'h130510;end6'd19: begin /* Inv 4-2 */cwl <= 10'hx02;cwh <= 23'h0x0190;end6'd20: begin /* Inv 5 */cwl <= 10'hx35;cwh <= 23'h0x0090;endend

6'd21: begin /* Inv 6-1 */cwl <= 10'hx;cwh <= 23'h170510;

end6'd22: begin /* Inv 6-2 */

cwl <= 10'hx02;cwh <= 23'h0x0190;


cwl <= 10'hx;cwh <= 23'h1E0510;

6'd24: begin /* Inv 7-2 */cwl <= 10'hx02;cwh <= 23'h0x0190;

end6'd25: begin /* Inv 8 */

cwl <= 10'hx35;cwh <= 23'h0x0090;


cwl <= 10'hx;cwh <= 23'h1E0510;


cwl <= 10'hx;cwh <= 23'h3E0500;

end6'd28: begin

cwl <= 10'hx3A; /* Inv 9-3 */

cwh <= 23'h0x0190;end

endmodule

The scalar multiplier

The verilog snippet

/* the next state logic */always @(state) begin

case(state)/* Init states */6'd0: nextstate <= 6'd1;6'd1: nextstate <= 6'd2;6'd2: nextstate <= 6'd3;/* Double States */6'd3: nextstate <= 6'd4;6'd4: nextstate <= 6'd5;6'd5: nextstate <= 6'd6;6'd6: begin

if(k[`KEYMSB] == 1'b1)nextstate <= 6'd7; /* Do Addition and doubling if K0 is

1 */else if (ef == 1'b0)nextstate <= 6'd15; /* k[0]=0 and we are in the last

iteration, goto end */elsenextstate <= 6'd3; /* Skip addition and do next

doubling */end

/* Addition States */6'd7: nextstate <= 6'd8;6'd8: nextstate <= 6'd9;6'd9: nextstate <= 6'd10;6'd10: nextstate <= 6'd11;6'd11: nextstate <= 6'd12;6'd12: nextstate <= 6'd13;6'd13: nextstate <= 6'd14;6'd14: begin

if(ef == 1'b1)nextstate <= 6'd3;

elsenextstate <= 6'd15;

end/* The Itoh Tsujii States */6'd15: nextstate <= 6'd16;6'd16: nextstate <= 6'd17;6'd17: nextstate <= 6'd18;6'd18: nextstate <= 6'd19;6'd19: nextstate <= 6'd20;6'd20: nextstate <= 6'd21;6'd21: nextstate <= 6'd22;6'd22: nextstate <= 6'd23;6'd23: nextstate <= 6'd24;6'd24: nextstate <= 6'd25;6'd25: nextstate <= 6'd26;6'd26: nextstate <= 6'd27;6'd27: nextstate <= 6'd28;6'd28: nextstate <= 6'd29;6'd29: nextstate <= 6'd30;6'd30: nextstate <= 6'd31;6'd31: nextstate <= 6'd32;6'd32: nextstate <= 6'd33;6'd33: nextstate <= 6'd34;6'd34: nextstate <= 6'd35;6'd35: nextstate <= 6'd36;6'd36: nextstate <= 6'd37;6'd37: nextstate <= 6'd38;6'd38: nextstate <= 6'd38;default: nextstate <= 6'bx;endcase

end

IO Interface of the scalar multiplier

/* The input to regbank is either constants or the results from ALU */

assign c0r = (cwh[22] == 1'b0) ? c0a : `BASEPOINT_X;

assign c1r = (cwh[22] == 1'b0) ? c1a : yconstants;

assign yconstants = (state != 6'd2) ? `BASEPOINT_Y : `CURVECONSTANT_B;

Output

/* Store the results after converting back into affine coordinates */

assign sx = (key != 233'b1) ? a0 : `BASEPOINT_X;

assign sy = (key != 233'b1) ? a2 : `BASEPOINT_Y;

assign done = (state == 6'd38) ? 1 : 0; /* Set done to 1 if multiplication is complete */

The Module counter

`define KEYSIZE 32`define KEYMSB 31 module counter (clk, nrst, e);input wire clk; /* clock used for the counter */input wire nrst; /* active low reset */output wire e; /* set to 0 if count = 0 */

reg [7:0] count; /* ...and the register which actually decrements */

/* activate e when count reaches 0 */assign e = |count;

always @(posedge clk or negedge nrst) beginif(nrst == 1'b0)

count <= 8'd`KEYMSB;else

count <= count - 1'b1;end

endmodule

This module computes the number of key bits which are processed.• Before the

leading 1 is detected, it just shifts the key (at a fast clock)

• After, the leading 1 is detected, it updates at the start of doubling.

The clocks in the design

/* Shift register for the key. The current bit is always the MSB */

always @(posedge clk) beginif (nrst == 1'b0)k <= key;else if (start == 1'b0) /* if start=0, shift every clock cycle */k[`KEYMSB:0] <= {k[(`KEYMSB-1):0], 1'b0};else if (state == 6'd4) /* if start=1, shift once very iteration of multiplier */k[`KEYMSB:0] <= {k[(`KEYMSB-1):0], 1'b0};else /* else, don't shift */k[`KEYMSB:0] <= k[`KEYMSB:0];

end

/* Detect the first 1 */assign tstart = start;

/* The counter clock is either a fast clock (clk) used during

* leading 1 detection, or a slow clock used during multiplication

*/assign mainclk = (start == 1'b0) ? clk : cck;

always @(posedge clk) beginif (nrst == 1'b0)

start <= key[`KEYMSB];

elsestart <= tstart |

k[(`KEYMSB - 1)];end

/* Generate clock signal for counter */always @(posedge clk) begin

if (nrst == 1'b0) begincck <= 1'b0;

endelse begin

case(state) 6'd3: cck <= ~cck;6'd4: cck <= ~cck;default: cck <= cck;endcase

endend

Testing of the design

Top level Interface of the design:◦ module ecsmul(clk, nrst, key, sx, sy, done);

Test-bench:◦ module tb();

reg clk;reg nrst;wire done;reg [31:0] kmem [7:0];reg [255:0] k; wire [232:0] sx, sy;reg [232:0] res[1:0];integer i;

A testbench

initial begin$readmemh("../../scratch/tv.txt", kmem);

#50for(i=0; i<8; i = i+1) begin

$display("%d %h\n", i, kmem[i]);endk = {kmem[0], kmem[1], kmem[2], kmem[3], kmem[4], kmem[5], kmem[6], kmem[7]};

// k = `KEYSIZE'hC7;$display("%h\n", k[`KEYSIZE - 1:0]);clk = 1'b0;nrst = 1'b1;

#10nrst = 1'b0;#110 nrst = 1'b1;#100000

$display("%h\n%h", sx, sy);res[0] = sx;res[1] = sy;

#1$writememh("../../scratch/res_verilog.txt", res);$finish;

end

initial begin

$dumpfile ("dump.vcd");$dumpvars;$dumpon;

#100000 $dumpoff;end

ecsmul ec_mul(clk, nrst, k[`KEYMSB:0], sx, sy, done);

always begin#100 clk =~clk;end

endmodule

Verification of Correctness

We use the elliptic curve code by M. Rosing in the book Implementation of Elliptic Curve Cryptography

Go to directory ld_rosing/polymath/basics/

In file elliptic_poly.cLines 1232 to 1239 contain the scalar t1.e[0] is

most significant word and t1.e[7] is the least significant Word.

Save file and exitOn a linux machine run ‘make’ to create the

executableExecute ‘elliptic_poly’

Documents

Lab Session 2 Design of Elliptic Curve Cryptosystem Debdeep Mukhopadhyay Chester Rebeiro Dept. of Computer Science and Engineering Indian Institute of