Master’s Thesis Presentation Hoang Le Director: Dr. Kris Gajece.gmu.edu/crypto_resources/web_resources/theses/... · SecuSecu tyrity of RSA yFactoring a big number is difficult

Master’s Thesis PresentationHoang Le

Director: Dr. Kris Gajj

OutlineOut e

RSAECMR fi bl C ti Pl tf L d Reconfigurable Computing Platforms, Languages and Programming EnvironmentsPartitioning ECM Code between HDLs and HLLsa t t o g C Code betwee s a d sImplementation of ECM Using SRC Carte C and Celoxica Handel‐CComparison of ResultsDiscussion of ResultsConclusionsConclusions

RSARSA

History of RSAsto y o S

Invented by Rivest, Shamir, and Adleman in 1977Most widely used public key cryptosystemKey length is variable depending on applicationRSA’s key can be chosen long for enhanced security, or h f ffi ishort for efficiency

Operation of RSAOpe at o o S

P bli K ( )

m c

Public Key (e, n)

c=f(m)=me mod n

dm=f ‐1(c)=cd mod n

Private Key (d, n)

Security of RSASecu ty o S

Factoring a big number is difficult or infeasibleCost and time required to factor an n‐bit RSA modulus provide an upper bound on RSA modulus provide an upper bound on the security of n‐bit RSACurrently the size of key used is 1024 bits or Currently, the size of key used is 1024 bits or 309 digits

ECMECM

What is ECM?at s C ?

Elliptic Curve Method of Factoring

Lenstra 1985 Phase 1

Brent, Montgomery 1986-87 Phase 2

Nq

Factoring time depends mainly on the size of factor q

< 50 bits

g p y q

ECM in NFSC S

Polynomial Selection

R l i C ll iRelation Collection

SievingMini‐Factoring

(ECM)200‐250 bitnumbers

Linear Algebra

Square Root

Elliptic Curve2 3 1 mod p ( 23)Y X X p= + + =

special point ϑ+

pt c Cu e

Points fullfiling the equation of the curve

20

25Y

p p(point at infinity)such that:P=(6,19)

Addition

10

15P P Pϑ ϑ+ = + =

Q=(7,12)A

AdditionP=(3,13)

2P P+P (7 11)D

5

10

R=P+Q=(13,7)

2P=P+P=(7,11)D

Doubling

0

0 5 10 15 20X

2 3 ( 1) 2P P P nP n P P Pϑ +∃ P:all points of the curve

, 2 ,3 ,.............., , ( 1) , 2P P P nP n P P Pϑ= + =∃ P:

Projective vs. Affine CoordinatesProjective vs. Affine Coordinates

( )affine coordinates Pa = (x , y)addition and doubling require inversion

projective coordinates Pp = (x , y , z)addition and doubling can be done without inversion

projective coordinates for Montgomery PpM = (x : : z)f f th form of the curve

addition and doubling do not require y coordinate

Scalar MultiplicationSca a u t p cat o

Q = k • P = P + P + P + + PQ = k P = P + P + P + ………………… + P

k ‐ times

point Number (scalar)

point

k ‐ times

(scalar)

ECM AlgorithmInputs :

C go t

N – number to be factoredE – elliptic curveP – point of the curve E : initial pointP0 point of the curve E : initial pointB1 – smoothness bound for Phase1B2 – smoothness bound for Phase2

Outputs:q ‐ factor of N, 1 < q ≤ N

or FAIL

ECM Algorithm – Phase 1

1 h h i iek B≤∏precomputations

C go t ase

1

1

1: such that - consecutive primes

- largest exponent such that

i

i

i

ei ip

ei i

k p p B

e p B

← ≤

≤

∏

0 0

0

0 02: ( : : )

3: gcd( , )Q Q

Q

Q kP x z

q z N

← =

←

main computations

4 : if 15: r

q >eturn (factor of )q N

postcomputations

6: else7: go to Phase 28: end if

ECM Algorithm – Phase 209: 1d ←

C go t ase

0 0 0

1 2

0

10: for each prime to do11: ( , , )pQ pQ pQ

p B Bx y z pQ

=← main computations

012: (mod )

13: end forpQd d z N← ⋅

14: gcd( , )15: if 1 then

q d Nq←>

16: return 17: else18 t FAIL

q postcomputations

18: return FAIL19: end if

ECM Phase 1 ExampleC ase a p e

N = 1 740 719 = 1279·1361N 1 740 719 1279 1361

E : y2 = x3 + 14x + 1 (mod 1 740 719)E : y x 14x 1 (mod 1 740 719)P0 = (5 : : 1)B1 = 201k = 24·32·5·7·11·13·17·19 = 232 792 560

kP0 = (707 838 : : 1 686 279)gcd (1 686 279, 1 740 719) = 1361

Hierarchy of ECM Operationse a c y o C Ope at o s

ECM

Top level Scalar multiplication

k·P

Medium levelElliptic curve

point operations

P+Q 2PPoint addition

Low level

Point doubling

x·y mod p x+y mod p x-y mod p

Low level

Modular arithmetic(field operations)

Moduarmultiplication

Modularaddition

Modularsubtraction

Reconfigurable Computing Platformsp g

FPGAsG s

ConfigurableLogicBlocks

Block R

A

Block R

A

Blocks

I/OBlocksA

Ms

AM

sBlockRAMs

Reconfigurable Computereco gu ab e Co pute

Microprocessor system Reconfigurable system

μP μP. . .FPGA FPGA. . .

μP Memory

μP Memory

. . . FPGA Memory

FPGA Memory

. . .

InterfaceI/O I/OInterface

SRC MAP – Reconfigurable ProcessorSRC MAP Reconfigurable Processor

Source: [SRC, MAPLD04]

SRC SystemS C Syste

SNAPSNAP™™ SNAPSNAP Common Common Common Common

SRC HiSRC Hi‐‐Bar SwitchBar Switch

PCIPCI‐‐XXPCIPCI XX

MAPMAP®®

SRCSRC‐‐66

MAPMAP

μμPP

MemoryMemory

μμPP

MemoryMemory

Gig EthernetGig Ethernetetcetc

Common Common MemoryMemory

ChainingChainingGPIOGPIO

CCMemoryMemory

Storage Area Storage Area Network Network

Local Area Local Area Network Network

Wide Area Wide Area Network Network DiskDisk

PCIPCI‐‐XXPCIPCI‐‐XX SRCSRC 66etc.etc.

Customers’ Existing NetworksCustomers’ Existing Networks

Source: [SRC, MAPLD04]

Reconfigurable Computing L d P iLanguages and Programming

EnvironmentsEnvironments

Traditional Design Flow ‐HDLs for Programming FPGAsHDLs for Programming FPGAs

S ifi iSpecification

HDL source Functional HDL description

code

Netlists

Simulation

Post‐synthesis Si l i

Synthesis

Netlists

Bitstream

Simulation

Timing Simulation

Implementation

Simulation

On‐chip

Configuration

pTesting

Traditional Design Flow –HLLs for Programming µProcessorHLLs for Programming µProcessor

Software Extract the Software elements analysis Customer

Extract the requirements

Precisely describe the software

Specification

Abstract representation

Architecture

Coding

Implementation

Test all parts of the software

Testing

APIs between FPGAs and µPs bet ee G s a d µ

Host Computer

Main Program APIs FPGA

Boards

Vendor dependentNo common standardNo common standard.

SRC Carte CS C Ca te C

+ very easy to learn and use+ standard ANSI C+ hides implementation details

ll d+ very well integrated environment+ mature ‐ in production use for over 4 years with constant improvements

‐ subset of C‐ legacy C code requires rewritingC li it ti i d ibi HW ( l lli d t ‐ C limitations in describing HW (paralellism, data types)

‐ closed environment, limited portability of code to HW platforms other than SRCHW platforms other than SRC

SRC Carte C – Design FlowS C Ca te C es g o

Application sources Macro sources Application sources Macro sources

HDLsources

.c or .f files .vhdor .v filesHDL

sources

. or.mc or .mf files

MAP CompilerμP CompilerLogic synthesis.v files

.ngo filesNetlists

Logic synthesis.v files

.ngo filesObjectfiles Place & Route

Li k

.o files .o files

Li kLinker.bin files

Applicationt bl

Configurationbitstreams

Linker.bin files

executable

Celoxica Handel‐CCe o ca a de C

+ very easy to learn and use+ super set of ANSI C+ hides implementation details+ hides implementation details+ very flexible , no limitation in parallelism and data type, extended operators for bit manipulation

ll d fi d i i d l+ well‐defined timing model+ portable to a wide range of FPGA devices

‐ legacy C code requires rewriting‐ each statement takes 1 clock cycle to execute

Celoxica Handel C – Design FlowCeloxica Handel C Design Flow

Executable Specification

Handel‐C

VHDL

Synthesis

VHDL

EDIFEDIF

Place & Route

EDIFEDIF

Previous Work on Implementing Applications in HLLs for Reconfigurable Hardware (1)HLLs for Reconfigurable Hardware (1)

Vendor libraries of hardware macros developed and distributed by SRC Inc., including

basic integer and floating‐point arithmeticd l ldigital signal processing

User libraries of hardware macros developed by GWU/GMU/USC 2002‐2006, including

S k i h i & b kiSecret‐key cipher encryption & breakingBinary Galois Field arithmetic (polynomial basis & normal basis representation)Elliptic Curve ArithmeticElliptic Curve ArithmeticLong integer modular arithmetic (RSA)SortingImage processingg p gBioinformatics


N. Nguyen, K. Gaj, D. Caliga, T. El‐Ghazawi, "Implementation of Elliptic Curve Cryptosystems on a Reconfigurable Computer”, FPT 2003g p 3S. Bajracharya, C. Shu, K. Gaj, T. El‐Ghazawi, "Implementation of Elliptic Curve Cryptosystems over GF(2n) in Optimal Normal Basis on a Reconfigurable GF(2 ) in Optimal Normal Basis on a Reconfigurable Computer”, FPL 2004Vlad Kindratenko (NCSA), “Accelerating Scientific Applications with Reconfigurable Computing” SC06Applications with Reconfigurable Computing , SC06Viktor K. Prasanna, Gerald R. Morris, “Sparse Matrix Computations on Reconfigurable Computer”, RC M i M Magazine, Mar 2007


Esam El‐Araby, Mohamed Taher, Mohamed Abouellail, Tarek El‐Ghazawi, and Gregory B. Newby, “Comparative Analysis of High Level Programming for Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology and Empirical Study”, SPL2007p y , 7

Partitioning ECM CodePartitioning ECM Code between HDLs and HLLsbetween HDLs and HLLs

General Architecture of ECMGe e a c tectu e o C

Main Data Control

M1

MMain Program Communication

ModuleUnit & Mem

M2

A/S

Coding Partition – Scheme 1Cod g a t t o Sc e e

M1

Main Program

Data Communication

Module

Control Unit & Mem

M1

M2

A/S

C/C++ HLL HDL

Coding Partition – Scheme 2Cod g a t t o Sc e e

M1

Main Program

Data Communication

Module

Control Unit & Mem

M1

M2

A/S

C/C++ HLL HDL

Coding Partition – Scheme 3Cod g a t t o Sc e e 3

M1

Main Program

Data Communication

Module

Control Unit & Mem

M1

M2

A/S

C/C++ HLL

4 Implemented Schemesp e e ted Sc e es

1 2 3

Scheme 1 (Entire ECM Unit as

Scheme 2 (Control Unit in SRC Carte C

HDL core) Carte‐C)

Scheme 2 (Control Unit in

Scheme 3 (Entire ECM Unit in

l d l Handel‐C) Handel‐C)Celoxica Handel‐C

Implementation of ECMImplementation of ECM Using SRC Carte CUsing SRC Carte C

Carte C Partition – Scheme 1Ca te C a t t o Sc e e

M1

Main Program

Data Communication

Module

Control Unit & Mem

M1

M2

A/S

C/C++ Carte C VHDL

ECM Module – Block DiagramC odu e oc ag a

M1

M2

LocalMEM Control Unit Instruction

M2

A/S Unit 1

Control Unit MEM

M1M1

M2

A/S

LocalMEM

U it T

Global MEM

A/S Unit T

Result – Place & Routeesu t ace & oute

Number of occupied Slices 30,996 out of 33,792 91%

Frequency 99 MHzFrequency 99 MHz

Execution Time (both Phases) 38 ms

Result – # Lines of Codesesu t es o Codes

N b f li f E ti t d Ti tNumber of lines of codes

Estimated Time to Complete

VHDL3975

3 students x 1 semesterCarte‐C

106

ECM Architecture – Top Level ViewFPGA

ECM Architecture Top Level View

Instructionmemory

ControlUnit

Phase1 &Ph

I/OHost

computer

Phase2

Globalmemory

RAMECM Units

Host computer pre‐computes ‐E, P0, k and post‐computes finalgcds

Result – Overall

4 – Main computations (Phase 1 & 2) (FPGA)3 – Transfer in (μP→FPGA)2 – Pre‐computations (μP)1 – General pre‐computations independent of NLegend:

2,2491,368

177 36,289 14581

µP

Time(µs) 6 – Post‐computations (μP)

5 – Transfer out (FPGA→ μP)4 – Main computations (Phase 1 & 2) (FPGA)

2 3 4 51

3.6% 0.5% 95.3% 0.4% 0.2%

2 3 4 5 6µ&

FPGA

PercentageB f i i i

6

Before optimization

After optimization100%= 38,060 μs

100%= 36,611 μs

1 2 2

3 4 5

6 2 66 2

3 4 5 3 4 5

µP

FPGA 3 4 5

0.5% 0.4%99.1%

3 4 5 3 4 5FPGA

Percentage

Carte C Partition – Scheme 2Ca te C a t t o Sc e e

M1

Main Program

Data Communication

Module

Control Unit & Mem

M1

M2

A/S

C/C++ Carte C VHDL



Frequency 100 Mhzq y

Execution Time 69 ms

Result – # Lines of Codesesu t es o Codes

Number of lines of Estimated Time toNumber of lines of codes


VHDLVHDL1812

1 students x 1 semesterCarte‐C

11271127

Implementation of ECMImplementation of ECM Using Celoxica Handel‐CUsing Celoxica Handel C

Handel‐C Partition – Scheme 2a de C a t t o Sc e e

M1

Main Program

Data Communication

Module

Control Unit & Mem

M1

M2

A/S

C/C++ Handel‐C VHDL





Result ‐ # Lines of Codesesu t es o Codes

Number of lines of Estimated Time toNumber of lines of codes


VHDL 1750VHDL 1750 1 students x 1/2 semesterHandel‐C 1400

Handel‐C Partition – Scheme 3a de C a t t o Sc e e 3

M1

Main Program

Data Communication

Module

Control Unit & Mem

M1

M2

A/S

C/C++ Handel‐C





Result ‐ # Lines of Codesesu t es o Codes

Number of lines Estimated Time of codes to Complete

VHDLVHDL0

1 students x 1/3 semesterHandel‐C semesterHandel‐C

1600

Comparison of ResultsComparison of Results

3 Possible Comparisons3 oss b e Co pa so s

1Scheme 1 (Entire ECM Unit as HDL core)


Carte‐C)SRC Carte C


Handel‐C)

Scheme 3 (Entire ECM Unit in Handel‐C)Celoxica Handel‐C

2

3

Comparison of Results of 2 SchemesComparison of Results of 2 Schemes in SRC – Resource Utilization

Resource (CLB # Units

# Lines of Codes Coding

TimeExe. Time

slices)Time Time

HDL HLL

Scheme3 students

Scheme 1

91%* 9 3975 106 x 1 semester

38 ms

1 studentScheme

297%* 5 1812 1127

1 student x 1

semester69 ms

Scheme 1: Entire ECM Unit in VHDLScheme 2: Control Unit in Carte‐C

Comparison of Results of 2 SchemesComparison of Results of 2 Schemes in SRC – # ECM Operations/sec

Scheme 1246246

Scheme 2

3 42x 1 8x

Scheme 272

Software40

3.42x 1.8x

Scheme 1: Entire ECM Unit in VHDLScheme 2: Control Unit in Carte‐C

Comparison of Results of 2 SchemesComparison of Results of 2 Schemes in Handel‐C – Resource Utilization

Resource (CLB # Units

# Lines of Codes

Coding Exe. (CLB slices)

# Units CodesTime Time

HDL HLL1 student

160Scheme 2 99% 11 1750 1400 x 1/2

semester

160 ms

1 dScheme 3 99% 10 0 1600

1 student x 1/3

semester

116 ms

semester

Scheme 2: Control Unit in Handel‐CScheme 3: Entire ECM Unit in Handel‐C

Comparison of Results of 2 SchemesComparison of Results of 2 Schemes in Handel‐C – # ECM Operations/sec

VHDL

Scheme 2 Scheme 386

436 *

1 25x 2 2x

69 86 Software40

10 9x1.25x 2.2x 10.9x

Scheme 1: Control Unit in Handel‐CScheme 2: Entire ECM Unit in Handel‐C * Reported in SHARCS, 2006

Cross Comparison between schemes in SRCCross‐Comparison between schemes in SRC and Handel‐C – Resource Utilization

Resource # Units# Lines of Codes Coding

TimeExe. TimeHDL HLLHDL HLL

SRC Scheme 2

97% 5 1812 1127 1 student x 1

69 ms

semester

Handel‐C1 student

/160 Handel C

Scheme 299% 9 1750 1400 x 1/2

semester

60ms

Note: the bold numbers are already adjusted to reflect the added resourcey j

Handel‐C and SRC Scheme: Control Unit in HLLs

Cross Comparison between schemes in SRCCross‐Comparison between schemes in SRC and Handel‐C – # ECM Operations/sec

SRC Handel‐C Scheme 2

6

SRC Scheme 2

72 Software

1 29x 1 8x

56 40

1.29x 1.8x

Handel‐C and SRC Scheme: Control Unit in HLLs

Discussion of Resultsscuss o o esu ts

VHDL100%

High

ncy

VHDL

SRC 1

wH

fficien

SRC 2

Handel‐C 2

Handel‐C 350%

LowEff

Difficult EasyEase‐of‐use

100%0 50%

SRC 1: Entire ECM Unit in VHDL SRC 2: Control Unit in Carte‐CHandel‐C 2: Control Unit in Handel‐C Handel‐C 3: Entire ECM Unit in Handel‐C

ConclusionsCo c us o s

VHDL still gives out the best performanceSRC 1 to SRC 2:

d ti i f3.42x reduction in performance1/3x reduction in development time

Handel‐C 1 to Handel‐C 2Handel C 1 to Handel C 21.25x increase in performance1/2x reduction in development time

SRC 2 to Handel‐C 21.29x reduction in performance1/2x reduction in development time1/2x reduction in development time

Future WorkRun the whole ECM project written in Handel‐C using selected Celoxica boards D l i l i l i f ECM i Develop equivalent implementation of ECM using other HLLs for RCs, such as: Impulse‐C, Mitrion‐C and DSPLogic; then compare all the results among each DSPLogic; then compare all the results among each other.Tweak the Handel‐C ECM code to achieve higher clock gfrequency and port it to run on SRC (as hardware macro), Cray XD1, SGI, and other platforms that may not have a fixed running clock not have a fixed running clock.

Thank you !!!

Questions???Questions???

Documents

Master’s Thesis Presentation Hoang Le Director: Dr. Kris Gajece.gmu.edu/crypto_resources/web_resources/theses/... · SecuSecu tyrity of RSA yFactoring a big number is difficult