Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Master’s Thesis PresentationHoang Le
Director: Dr. Kris Gajj
OutlineOut e
RSAECMR fi bl C ti Pl tf L d Reconfigurable Computing Platforms, Languages and Programming EnvironmentsPartitioning ECM Code between HDLs and HLLsa t t o g C Code betwee s a d sImplementation of ECM Using SRC Carte C and Celoxica Handel‐CComparison of ResultsDiscussion of ResultsConclusionsConclusions
RSARSA
History of RSAsto y o S
Invented by Rivest, Shamir, and Adleman in 1977Most widely used public key cryptosystemKey length is variable depending on applicationRSA’s key can be chosen long for enhanced security, or h f ffi ishort for efficiency
Operation of RSAOpe at o o S
P bli K ( )
m c
Public Key (e, n)
c=f(m)=me mod n
dm=f ‐1(c)=cd mod n
Private Key (d, n)
Security of RSASecu ty o S
Factoring a big number is difficult or infeasibleCost and time required to factor an n‐bit RSA modulus provide an upper bound on RSA modulus provide an upper bound on the security of n‐bit RSACurrently the size of key used is 1024 bits or Currently, the size of key used is 1024 bits or 309 digits
ECMECM
What is ECM?at s C ?
Elliptic Curve Method of Factoring
Lenstra 1985 Phase 1
Brent, Montgomery 1986-87 Phase 2
Nq
Factoring time depends mainly on the size of factor q
< 50 bits
g p y q
ECM in NFSC S
Polynomial Selection
R l i C ll iRelation Collection
SievingMini‐Factoring
(ECM)200‐250 bitnumbers
Linear Algebra
Square Root
Elliptic Curve2 3 1 mod p ( 23)Y X X p= + + =
special point ϑ+
pt c Cu e
Points fullfiling the equation of the curve
20
25Y
p p(point at infinity)such that:P=(6,19)
Addition
10
15P P Pϑ ϑ+ = + =
Q=(7,12)A
AdditionP=(3,13)
2P P+P (7 11)D
5
10
R=P+Q=(13,7)
2P=P+P=(7,11)D
Doubling
0
0 5 10 15 20X
2 3 ( 1) 2P P P nP n P P Pϑ +∃ P:all points of the curve
, 2 ,3 ,.............., , ( 1) , 2P P P nP n P P Pϑ= + =∃ P:
Projective vs. Affine CoordinatesProjective vs. Affine Coordinates
( )affine coordinates Pa = (x , y)addition and doubling require inversion
projective coordinates Pp = (x , y , z)addition and doubling can be done without inversion
projective coordinates for Montgomery PpM = (x : : z)f f th form of the curve
addition and doubling do not require y coordinate
Scalar MultiplicationSca a u t p cat o
Q = k • P = P + P + P + + PQ = k P = P + P + P + ………………… + P
k ‐ times
point Number (scalar)
point
k ‐ times
(scalar)
ECM AlgorithmInputs :
C go t
N – number to be factoredE – elliptic curveP – point of the curve E : initial pointP0 point of the curve E : initial pointB1 – smoothness bound for Phase1B2 – smoothness bound for Phase2
Outputs:q ‐ factor of N, 1 < q ≤ N
or FAIL
ECM Algorithm – Phase 1
1 h h i iek B≤∏precomputations
C go t ase
1
1
1: such that - consecutive primes
- largest exponent such that
i
i
i
ei ip
ei i
k p p B
e p B
← ≤
≤
∏
0 0
0
0 02: ( : : )
3: gcd( , )Q Q
Q
Q kP x z
q z N
← =
←
main computations
4 : if 15: r
q >eturn (factor of )q N
postcomputations
6: else7: go to Phase 28: end if
ECM Algorithm – Phase 209: 1d ←
C go t ase
0 0 0
1 2
0
10: for each prime to do11: ( , , )pQ pQ pQ
p B Bx y z pQ
=← main computations
012: (mod )
13: end forpQd d z N← ⋅
14: gcd( , )15: if 1 then
q d Nq←>
16: return 17: else18 t FAIL
q postcomputations
18: return FAIL19: end if
ECM Phase 1 ExampleC ase a p e
N = 1 740 719 = 1279·1361N 1 740 719 1279 1361
E : y2 = x3 + 14x + 1 (mod 1 740 719)E : y x 14x 1 (mod 1 740 719)P0 = (5 : : 1)B1 = 201k = 24·32·5·7·11·13·17·19 = 232 792 560
kP0 = (707 838 : : 1 686 279)gcd (1 686 279, 1 740 719) = 1361
Hierarchy of ECM Operationse a c y o C Ope at o s
ECM
Top level Scalar multiplication
k·P
Medium levelElliptic curve
point operations
P+Q 2PPoint addition
Low level
Point doubling
x·y mod p x+y mod p x-y mod p
Low level
Modular arithmetic(field operations)
Moduarmultiplication
Modularaddition
Modularsubtraction
Reconfigurable Computing Platformsp g
FPGAsG s
ConfigurableLogicBlocks
Block R
A
Block R
A
Blocks
I/OBlocksA
Ms
AM
sBlockRAMs
Reconfigurable Computereco gu ab e Co pute
Microprocessor system Reconfigurable system
μP μP. . .FPGA FPGA. . .
μP Memory
μP Memory
. . . FPGA Memory
FPGA Memory
. . .
InterfaceI/O I/OInterface
SRC MAP – Reconfigurable ProcessorSRC MAP Reconfigurable Processor
Source: [SRC, MAPLD04]
SRC SystemS C Syste
SNAPSNAP™™ SNAPSNAP Common Common Common Common
SRC HiSRC Hi‐‐Bar SwitchBar Switch
PCIPCI‐‐XXPCIPCI XX
MAPMAP®®
SRCSRC‐‐66
MAPMAP
μμPP
MemoryMemory
μμPP
MemoryMemory
Gig EthernetGig Ethernetetcetc
Common Common MemoryMemory
ChainingChainingGPIOGPIO
CCMemoryMemory
Storage Area Storage Area Network Network
Local Area Local Area Network Network
Wide Area Wide Area Network Network DiskDisk
PCIPCI‐‐XXPCIPCI‐‐XX SRCSRC 66etc.etc.
Customers’ Existing NetworksCustomers’ Existing Networks
Source: [SRC, MAPLD04]
Reconfigurable Computing L d P iLanguages and Programming
EnvironmentsEnvironments
Traditional Design Flow ‐HDLs for Programming FPGAsHDLs for Programming FPGAs
S ifi iSpecification
HDL source Functional HDL description
code
Netlists
Simulation
Post‐synthesis Si l i
Synthesis
Netlists
Bitstream
Simulation
Timing Simulation
Implementation
Simulation
On‐chip
Configuration
pTesting
Traditional Design Flow –HLLs for Programming µProcessorHLLs for Programming µProcessor
Software Extract the Software elements analysis Customer
Extract the requirements
Precisely describe the software
Specification
Abstract representation
Architecture
Coding
Implementation
Test all parts of the software
Testing
APIs between FPGAs and µPs bet ee G s a d µ
Host Computer
Main Program APIs FPGA
Boards
Vendor dependentNo common standardNo common standard.
SRC Carte CS C Ca te C
+ very easy to learn and use+ standard ANSI C+ hides implementation details
ll d+ very well integrated environment+ mature ‐ in production use for over 4 years with constant improvements
‐ subset of C‐ legacy C code requires rewritingC li it ti i d ibi HW ( l lli d t ‐ C limitations in describing HW (paralellism, data types)
‐ closed environment, limited portability of code to HW platforms other than SRCHW platforms other than SRC
SRC Carte C – Design FlowS C Ca te C es g o
Application sources Macro sources Application sources Macro sources
HDLsources
.c or .f files .vhdor .v filesHDL
sources
. or.mc or .mf files
MAP CompilerμP CompilerLogic synthesis.v files
.ngo filesNetlists
Logic synthesis.v files
.ngo filesObjectfiles Place & Route
Li k
.o files .o files
Li kLinker.bin files
Applicationt bl
Configurationbitstreams
Linker.bin files
executable
Celoxica Handel‐CCe o ca a de C
+ very easy to learn and use+ super set of ANSI C+ hides implementation details+ hides implementation details+ very flexible , no limitation in parallelism and data type, extended operators for bit manipulation
ll d fi d i i d l+ well‐defined timing model+ portable to a wide range of FPGA devices
‐ legacy C code requires rewriting‐ each statement takes 1 clock cycle to execute
Celoxica Handel C – Design FlowCeloxica Handel C Design Flow
Executable Specification
Handel‐C
VHDL
Synthesis
VHDL
EDIFEDIF
Place & Route
EDIFEDIF
Previous Work on Implementing Applications in HLLs for Reconfigurable Hardware (1)HLLs for Reconfigurable Hardware (1)
Vendor libraries of hardware macros developed and distributed by SRC Inc., including
basic integer and floating‐point arithmeticd l ldigital signal processing
User libraries of hardware macros developed by GWU/GMU/USC 2002‐2006, including
S k i h i & b kiSecret‐key cipher encryption & breakingBinary Galois Field arithmetic (polynomial basis & normal basis representation)Elliptic Curve ArithmeticElliptic Curve ArithmeticLong integer modular arithmetic (RSA)SortingImage processingg p gBioinformatics
Previous Work on Implementing Applications in HLLs for Reconfigurable Hardware (2)HLLs for Reconfigurable Hardware (2)
N. Nguyen, K. Gaj, D. Caliga, T. El‐Ghazawi, "Implementation of Elliptic Curve Cryptosystems on a Reconfigurable Computer”, FPT 2003g p 3S. Bajracharya, C. Shu, K. Gaj, T. El‐Ghazawi, "Implementation of Elliptic Curve Cryptosystems over GF(2n) in Optimal Normal Basis on a Reconfigurable GF(2 ) in Optimal Normal Basis on a Reconfigurable Computer”, FPL 2004Vlad Kindratenko (NCSA), “Accelerating Scientific Applications with Reconfigurable Computing” SC06Applications with Reconfigurable Computing , SC06Viktor K. Prasanna, Gerald R. Morris, “Sparse Matrix Computations on Reconfigurable Computer”, RC M i M Magazine, Mar 2007
Previous Work on Implementing Applications in HLLs for Reconfigurable Hardware (3)HLLs for Reconfigurable Hardware (3)
Esam El‐Araby, Mohamed Taher, Mohamed Abouellail, Tarek El‐Ghazawi, and Gregory B. Newby, “Comparative Analysis of High Level Programming for Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology and Empirical Study”, SPL2007p y , 7
Partitioning ECM CodePartitioning ECM Code between HDLs and HLLsbetween HDLs and HLLs
General Architecture of ECMGe e a c tectu e o C
Main Data Control
M1
MMain Program Communication
ModuleUnit & Mem
M2
A/S
Coding Partition – Scheme 1Cod g a t t o Sc e e
M1
Main Program
Data Communication
Module
Control Unit & Mem
M1
M2
A/S
C/C++ HLL HDL
Coding Partition – Scheme 2Cod g a t t o Sc e e
M1
Main Program
Data Communication
Module
Control Unit & Mem
M1
M2
A/S
C/C++ HLL HDL
Coding Partition – Scheme 3Cod g a t t o Sc e e 3
M1
Main Program
Data Communication
Module
Control Unit & Mem
M1
M2
A/S
C/C++ HLL
4 Implemented Schemesp e e ted Sc e es
1 2 3
Scheme 1 (Entire ECM Unit as
Scheme 2 (Control Unit in SRC Carte C
HDL core) Carte‐C)
Scheme 2 (Control Unit in
Scheme 3 (Entire ECM Unit in
l d l Handel‐C) Handel‐C)Celoxica Handel‐C
Implementation of ECMImplementation of ECM Using SRC Carte CUsing SRC Carte C
Carte C Partition – Scheme 1Ca te C a t t o Sc e e
M1
Main Program
Data Communication
Module
Control Unit & Mem
M1
M2
A/S
C/C++ Carte C VHDL
ECM Module – Block DiagramC odu e oc ag a
M1
M2
LocalMEM Control Unit Instruction
M2
A/S Unit 1
Control Unit MEM
M1M1
M2
A/S
LocalMEM
U it T
Global MEM
A/S Unit T
Result – Place & Routeesu t ace & oute
Number of occupied Slices 30,996 out of 33,792 91%
Frequency 99 MHzFrequency 99 MHz
Execution Time (both Phases) 38 ms
Result – # Lines of Codesesu t es o Codes
N b f li f E ti t d Ti tNumber of lines of codes
Estimated Time to Complete
VHDL3975
3 students x 1 semesterCarte‐C
106
ECM Architecture – Top Level ViewFPGA
ECM Architecture Top Level View
Instructionmemory
ControlUnit
Phase1 &Ph
I/OHost
computer
Phase2
Globalmemory
RAMECM Units
Host computer pre‐computes ‐E, P0, k and post‐computes finalgcds
Result – Overall
4 – Main computations (Phase 1 & 2) (FPGA)3 – Transfer in (μP→FPGA)2 – Pre‐computations (μP)1 – General pre‐computations independent of NLegend:
2,2491,368
177 36,289 14581
µP
Time(µs) 6 – Post‐computations (μP)
5 – Transfer out (FPGA→ μP)4 – Main computations (Phase 1 & 2) (FPGA)
2 3 4 51
3.6% 0.5% 95.3% 0.4% 0.2%
2 3 4 5 6µ&
FPGA
PercentageB f i i i
6
Before optimization
After optimization100%= 38,060 μs
100%= 36,611 μs
1 2 2
3 4 5
6 2 66 2
3 4 5 3 4 5
µP
FPGA 3 4 5
0.5% 0.4%99.1%
3 4 5 3 4 5FPGA
Percentage
Carte C Partition – Scheme 2Ca te C a t t o Sc e e
M1
Main Program
Data Communication
Module
Control Unit & Mem
M1
M2
A/S
C/C++ Carte C VHDL
Result – Place & Routeesu t ace & oute
Number of occupied Slices 33,114 out of 33,792 97%
Frequency 100 Mhzq y
Execution Time 69 ms
Result – # Lines of Codesesu t es o Codes
Number of lines of Estimated Time toNumber of lines of codes
Estimated Time to Complete
VHDLVHDL1812
1 students x 1 semesterCarte‐C
11271127
Implementation of ECMImplementation of ECM Using Celoxica Handel‐CUsing Celoxica Handel C
Handel‐C Partition – Scheme 2a de C a t t o Sc e e
M1
Main Program
Data Communication
Module
Control Unit & Mem
M1
M2
A/S
C/C++ Handel‐C VHDL
Result – Place & Routeesu t ace & oute
Number of occupied Slices 33,680 out of 33,792 99%
Frequency 22 MHzFrequency 22 MHz
Execution Time (both Phases) 160 ms
Result ‐ # Lines of Codesesu t es o Codes
Number of lines of Estimated Time toNumber of lines of codes
Estimated Time to Complete
VHDL 1750VHDL 1750 1 students x 1/2 semesterHandel‐C 1400
Handel‐C Partition – Scheme 3a de C a t t o Sc e e 3
M1
Main Program
Data Communication
Module
Control Unit & Mem
M1
M2
A/S
C/C++ Handel‐C
Result – Place & Routeesu t ace & oute
Number of occupied Slices 33,790 out of 33,792 99%
Frequency 30 MHzFrequency 30 MHz
Execution Time (both Phases) 116 ms
Result ‐ # Lines of Codesesu t es o Codes
Number of lines Estimated Time of codes to Complete
VHDLVHDL0
1 students x 1/3 semesterHandel‐C semesterHandel‐C
1600
Comparison of ResultsComparison of Results
3 Possible Comparisons3 oss b e Co pa so s
1Scheme 1 (Entire ECM Unit as HDL core)
Scheme 2 (Control Unit in
Carte‐C)SRC Carte C
Scheme 2 (Control Unit in
Handel‐C)
Scheme 3 (Entire ECM Unit in Handel‐C)Celoxica Handel‐C
2
3
Comparison of Results of 2 SchemesComparison of Results of 2 Schemes in SRC – Resource Utilization
Resource (CLB # Units
# Lines of Codes Coding
TimeExe. Time
slices)Time Time
HDL HLL
Scheme3 students
Scheme 1
91%* 9 3975 106 x 1 semester
38 ms
1 studentScheme
297%* 5 1812 1127
1 student x 1
semester69 ms
Scheme 1: Entire ECM Unit in VHDLScheme 2: Control Unit in Carte‐C
Comparison of Results of 2 SchemesComparison of Results of 2 Schemes in SRC – # ECM Operations/sec
Scheme 1246246
Scheme 2
3 42x 1 8x
Scheme 272
Software40
3.42x 1.8x
Scheme 1: Entire ECM Unit in VHDLScheme 2: Control Unit in Carte‐C
Comparison of Results of 2 SchemesComparison of Results of 2 Schemes in Handel‐C – Resource Utilization
Resource (CLB # Units
# Lines of Codes
Coding Exe. (CLB slices)
# Units CodesTime Time
HDL HLL1 student
160Scheme 2 99% 11 1750 1400 x 1/2
semester
160 ms
1 dScheme 3 99% 10 0 1600
1 student x 1/3
semester
116 ms
semester
Scheme 2: Control Unit in Handel‐CScheme 3: Entire ECM Unit in Handel‐C
Comparison of Results of 2 SchemesComparison of Results of 2 Schemes in Handel‐C – # ECM Operations/sec
VHDL
Scheme 2 Scheme 386
436 *
1 25x 2 2x
69 86 Software40
10 9x1.25x 2.2x 10.9x
Scheme 1: Control Unit in Handel‐CScheme 2: Entire ECM Unit in Handel‐C * Reported in SHARCS, 2006
Cross Comparison between schemes in SRCCross‐Comparison between schemes in SRC and Handel‐C – Resource Utilization
Resource # Units# Lines of Codes Coding
TimeExe. TimeHDL HLLHDL HLL
SRC Scheme 2
97% 5 1812 1127 1 student x 1
69 ms
semester
Handel‐C1 student
/160 Handel C
Scheme 299% 9 1750 1400 x 1/2
semester
60ms
Note: the bold numbers are already adjusted to reflect the added resourcey j
Handel‐C and SRC Scheme: Control Unit in HLLs
Cross Comparison between schemes in SRCCross‐Comparison between schemes in SRC and Handel‐C – # ECM Operations/sec
SRC Handel‐C Scheme 2
6
SRC Scheme 2
72 Software
1 29x 1 8x
56 40
1.29x 1.8x
Handel‐C and SRC Scheme: Control Unit in HLLs
Discussion of Resultsscuss o o esu ts
VHDL100%
High
ncy
VHDL
SRC 1
wH
fficien
SRC 2
Handel‐C 2
Handel‐C 350%
LowEff
Difficult EasyEase‐of‐use
100%0 50%
SRC 1: Entire ECM Unit in VHDL SRC 2: Control Unit in Carte‐CHandel‐C 2: Control Unit in Handel‐C Handel‐C 3: Entire ECM Unit in Handel‐C
ConclusionsCo c us o s
VHDL still gives out the best performanceSRC 1 to SRC 2:
d ti i f3.42x reduction in performance1/3x reduction in development time
Handel‐C 1 to Handel‐C 2Handel C 1 to Handel C 21.25x increase in performance1/2x reduction in development time
SRC 2 to Handel‐C 21.29x reduction in performance1/2x reduction in development time1/2x reduction in development time
Future WorkRun the whole ECM project written in Handel‐C using selected Celoxica boards D l i l i l i f ECM i Develop equivalent implementation of ECM using other HLLs for RCs, such as: Impulse‐C, Mitrion‐C and DSPLogic; then compare all the results among each DSPLogic; then compare all the results among each other.Tweak the Handel‐C ECM code to achieve higher clock gfrequency and port it to run on SRC (as hardware macro), Cray XD1, SGI, and other platforms that may not have a fixed running clock not have a fixed running clock.
Thank you !!!
Questions???Questions???