View
224
Download
2
Category
Preview:
Citation preview
ModularMultiplicationAlgorithmsforFPGAs
MustafaParlak
Outline• WhatisanFPGA?• FPGAvs.ASIC&Microprocessors• FPGADesignMetrics• FPGAsinCryptography• Adders:BasicoperatorofModularMultiplications
• ModularMultiplications– InterleavedModularMultiplications– MontgomeryModularMultiplications
• ComparisonofModularMultiplicationalgorithms
WhatisanFPGA• FPGA =FieldProgrammableGateArray• AsemiconductorICthatcanbeconfiguredbytheuser(designer)aftermanufacturing
• Twodimensionalarrayofcustomizablelogicblockplacedinaninterconnectframework
• Theusertoconfigure:1. Thefunctionofeachlogicblock2. Theinterconnectionbetweenthelogicblocks,
• Canbeprogrammedusingalogiccircuitdiagram(schematic)orsourcecodeinVHDLorVerilog
WhatisanFPGA• Logicblocks
– toimplementcombinationalandsequentiallogic
• Interconnect– wirestoconnect inputsand
outputstologicblocks• I/Oblocks
– speciallogicblocksatperipheryofdevice forexternalconnections
• Keyquestions:– howtomakelogicblocks
programmable?– howtoconnect thewires?– afterthechiphasbeenfabricated
FPGALogicBlocks
• 4-inputlookuptable(LUT)– implementscombinationallogicfunctions
• Register– optionallystoresoutputofLUT
4-LUT FF1
0
latchLogic Block set by configuration
bit-stream
4-input "look up table"
OUTPUTINPUTS
FPGAInterconnect
LUTs(LookUp Tables)• LUTcontainsMemoryCellstoimplementsmalllogic
functions• Eachcellholds‘0’or‘1’.• ProgrammedwithoutputsofTruthTable• Inputsselectcontentofoneofthecellsasoutput
16-bit SR
flip-flop
clock
muxy
qe
abcd
16x1 RAM4-input
LUT
clock enable
set/reset
3 Inputs LUT -> 8 Memory Cells
SRAM
SRAM
3 – 6 Inputs
Multiplexer MUX Static Random Access MemorySRAM cells
ConfiguringFPGA• MillionsofSRAMcellsholdingLUTsandInterconnectRouting• VolatileMemory.Losesconfigurationwhenboardpoweris
turnedoff.• KeepBitPatterndescribingtheSRAMcellsinnon-Volatile
Memorye.g.Flash• Configurationtakes~secs
Configuration data in
Configuration data out
= I/O pin/pad
= SRAM cell
GenericFPGADesignFlow
• DesignEntry:– Createyourdesign files using:
• Schematic editoror• Hardware description language
(Verilog, VHDL)• Design“implementation”onFPGA:
– Synthesis, Partition,place,androute tocreatebit-stream file
• Designverification:– UseSimulator tocheckfunction,– othersoftwaredetermines maxclock
frequency.– LoadontoFPGAdevice (cableconnects
PCtodevelopment board)• Checkoperation atfullspeed inreal
environment.
FPGAvs.ASIC/Microprocessors
–ASICgiveshighperformanceatcostofinflexibility.–Processorisveryflexiblebutnottunedtotheapplication.–Reconfigurablehardwareisanicecompromise.
Microprocessor ReconfigurableHardware
ASIC
Software Firmware Hardware
FPGAvs.ASIC
FPGA• Reconfigurable• Lowthroughput• Shortdesigncycle• Suitableforlowvolume
production– Lowcostatsmallnumber
• Highpower• Highsiliconarea
– Utilizationproblem• Notestingcost• Alreadyfabricated
ASIC• Noreconfiguration• Highthroughput• Longdesigncycle• Suitableforhighvolume
production(>1Million)– Lowcostatlargenumber
• Lowpower• Lowsiliconarea
– Fullyutilized• Hightestingcost• Needtobefabricated
FPGAvs.ProcessorsFPGA• Longdesigncycle• Expensive• Highthroughput
– (morethan20~100x)
Processor• Shortdesigncycle• Cheap• Lowthroughput
– Significantlyslower
FPGABasedApplications• Cryptography• Networkprocessors• Evolvableandbiologically-inspired hardware• RapidASICprototyping• Real-timesystems• Embeddedapplications• Custom-computinghardware• Reconfigurablecomputing• Special-purpose computationengines
– Hardwarededicatedtosolvingoneproblem(orclassofproblems)
– Acceleratorsattachedtogeneral-purposecomputers
FPGADesignMetrics• TimeComplexity– Throughputisthenumberofprocesseddataperunittime(bits/sec)
– Thehigherthethroughputofadesignthebetteritsefficiency
• AreaComplexity– #ofLUT,FF,RAMetc.
• Designmetriccombiningtimeandareatogether– Throughput/Area– Theratioishigherincaseofhighthroughputandlessspace
• Anotherimportantdesignmetric:Power
Area-Speedoptimization
Loopunrolling&pipelining
Ingeneralthereisatrade-offbetween• Speed• Area
• Speedboosters• Parallelexecution• Loopunrollingand
pipelining• Inallcasesarea
increaseswithincreasingspeed
WhyFPGA?• Flexibilityfromgeneralpurposecomputingandspeedfromreconfigurable logic
• Duetotheinherentfine-grainedgranularitytheparallelismtendstobeveryhigh
• Registers,latchesandevendistributedRAMblockscanbecreatedanddistributedwhereverneededbythedatapath
• LackofafixedarchitectureofFPGA,allowsthedesignerstotailordesign'sdatapathandcontrolflowarbitrarily
• Highlyregularanditerativeapplicationswithnon-standardwordlengths.
WhyFPGAsuitswellinCryptography
• Speed&realtimeexecution– Encryption/decryptiondaterateupto1Gb/secforIPseccrypto
devices• RNGintegrity
– RObasedRNG• COMSECCriteria
– Red-BlackSeparation.– HardertoattackandbreakthecryptosystemrunningonFPGAas
comparedtoGPPs• TheeffectivenessoftheFPGA’scellstructureforimplementingbit-
wiselogicaloperationstypicaltomanycryptographicalgorithms• ThelargeamountofmemoryinsideFPGA
– Easetheimplementationofmemoryintensivesubstitutionoperation– Localstorageworkingasacachewheneverneeded
• Lowpower(ascomparedtoGPP?)
ModularMultiplicationAlgorithms• Whymodularmultiplicationisimportant?
– Mostcommon operation of• RSA• Finitefieldarithmetic• DSA• Diffie-Hellmankeyexchange• ECC
• ModularMultiplicationalgorithmsinGF(p)– Multiply anddivide
• Naïvemethod– Interleavedmodularmultiplication
• Multiplicationandreduction areinterleaved– Montgomerymodularmultiplication
• Transformationandoperations inresiduedomain– Otheralgorithms
• Brickell’s method• …etc
Adders:BasicBuildingBlockofMultiplication
• Fulladder(FA)iscombinational circuitwith3inputsandtwooutputs
• Computes sum(Si)andcarry(Ci+1)forthenextstage• FAisone-bit adder.Whathappens ifFAscascaded to
maken-bitadder§ Carryhastobepropagated§ Problem: propagationdelay § Canwegetridofcarrypropagation ordecrease
it?§ Number ofmethod proposedtoefficiently
implement addition• Ripple Carry(obviousone)• CarryLookAhead• CarrySave• DelayedCarry• Brent-Kung• etc….
RippleCarryandCarrySaveAddersRippleCarryAdder• EachFAreceivesCin from
previousFA• Advantages
– Signdetectioniseasy• Disadvantage
– Delayishigh– LetdelayofanFAisT(FA)– Delayofn-bitadderisn*T(FA)
Carry-SaveAdder• ParallelEnsembleofFAs• Advantages
– DelayisconstantandoneFA• Disadvantages
– Addsthreenumberandproducestwo
– Thesigndetectionishard– Needconventionaladdertoget
finalresult
OtherAddersandComparison
CarryLookAhead• Improvesspeedby
reducingcarrypropagation
CarryDelayedAdder• Twolevelcarrysave
adder
ModularAddition• GivenA,B<PcomputeA+B(modP)
1. FindSʹ=A+B2. If(Sʹ>P)3. S=Sʹ- P4. elseS=Sʹ
• Omura’s Method:Anefficientmethodcomputingthemodularaddition– Usefulformultioperandmodularaddition– Eliminatestheneedforsubtraction– Foran-bitoperands,thismethodalwayskeepstheintermediate
resultswithinn-bit.Nevergrowsbeyondthat– Wheneveritexceedsn-bit,thecarry-outisignoredandacorrection
isperformed.
Omura’s Method1. Computecorrection
factorm=2n-P2. FirstcomputeS'=A
+B.3. Ifthereisacarry-
out(nth bit),thenS=S'+m,elseS=S'.
Ex:AssumeP=39m=26-39=25=(011001)
WeobtaintheresultasS=31whichis70(mod39)
InterleavedModularMultiplication
Atmosttwosubtraction isneeded toreducepartial product
InterleavedwithOmura’s MethodObservations withstandard interleavedmethod• 3addition (orsubtraction) periteration• Twocomparison andtworeduction per iteration• Partialaddition result goesbeyondn-bit• UseOmura’s method togetridofsubtractions andcomparisons
Advantages• Comparisons andsubtractions
eliminated• PartialproductRnevergrow
beyondn-bitDisadvantages• Pre-computation increases
execution time• Still3addition periteration• ExtramemoryforstoringM• Onefinalcorrection subtraction
mayberequired
InterleavedwithPre-computation• Aclevermethodtoreduce3addition/reductionto1addition:
– Idea:Reductionofith iterationcanbecalculatedandgetreadyfornextiteration(i+1)th.(correctionstep)
– Correctioncanbeaddedtothenextiterationintermediateproduct– InsteadofreducingwithPreducewith2nwhichisselectingnleast
significantbits– Thesepossiblecorrectionvaluescanbepre-computedbefore
multiplicationstartsandstoredinalook-uptable• Atith iteration,assumepartialproductiscalculatedR=A•Bi +2R
andreadyfornextiteration.• PartialproductR,maygrowonly2morebit,fromnton+2as
R=(Rn+1 RnRn-1 …R0)• AssumethatRgrowonly1bit,R=(RnRn-1….R0).
– NowRisn+1bitlong
InterleavedwithPre-computation• InsteadofreducingRtoP,reduceitto2n.
– Rʹ=R(mod2n)selectsnleastsignificantbitsofR.ThenRʹ=R– 2n isreadyfori+1th iteration
– Addcorrectionfactoratnextiteration(i+1th) torestorethesamepartialproductinoriginalinterleavedalgorithm
• At(i+1th)iteration,AssumeBi+1 =0– OriginalinterleavedalgorithmfindsRʹʹ=A•Bi +2R(modP)=0+2R
(modP)=2R(modP)• Verification
– Shiftleft(doublesthepartialproduct)Rʹʹ=2Rʹ=2(R– 2n)=2R- 2n+1– Reducethepartialproductbyadding2n+1 (modP).(correctionfactor)– Rʹʹ=2R- 2n+1+2n+1 (modP)=2R(modP)whichisdesiredresultfor
i+1th iteration.• Onlyafewpossiblecorrectionfactormayoccur.
– 0,B,2n+1 (modP),B+2n+1 (modP),2n+2 (modP),B+2n+2 (modP)
InterleavedwithPre-computation• Advantages
– Oneadditionineachiteration– Almost2xincreaseinspeed
• Disadvantages– Requirepre-computation(breakstheregularity)
– Requireoneextraiteration– Requireextralocalstorage(4xoperandbitlength=4x2n)• Ex:2048-RSAmodularmultiplication(4x2048=8kbit
– comparisonandsubtractionattheend
InterleavedwithPre-computationDatapath
InterleavedwithCSAUtilization
• ConventionaladderisreplacedwithCSAadder(redundantrepresentation)
• ReductiontoMininterleavedalgorithmisreplacedwithreductionwith2n
• Afterwards,thevalueofk*2n(modM)isaddedinordertoreconstructthecorrectintermediateresultatnextiteration
• AttheendS,Careaddedtofindcorrectresult
InterleavedwithCSAUtilizationAdvantages• Twoaddition periteration (?)• Additions inconstanttime (No
carrypropagation)
Disadvantages• Theresult isinredundant form
(C,S)whichhastobecalculatedwithconventional adder. (Onemoreadder)
• Calculation ofAisnotstraightforwardandneed subtractionsandcomparisons
• NeedmorestoragetosaveS,Cinstead ofone.
• Datapath requiremorelogic• Complex FSMandaddress
generation
InterleavedwithCSAUtilizationandPre-computation
• Thementionedproblemsmakesthealgorithminfeasible
• Samepre-computationideaisapplied– TheintermediateresultIhasonlytwopossiblevalues(0,Y)
– IncorrectionphaseAalsohasafewpossiblevalues
– Thesetwocancombinedas2A+Iandpre-computedandstored
InterleavedwithCSAUtilizationandPre-computation
Advantages– Onlyoneaddition periteration
inconstant time– Nocomparison andreduction
Disadvantages– Require pre-computation
(breakstheregularity)– Require oneextraiteration– Require extrastorage(6x
operandbit length)• Ex:2048-RSAmodular
multiplication• 6x2048=12kbit localstorage
– Attheendofiterations• Requireconventionaladderto
calculate(C+S)• Mayrequireoneextrareduction
(subtraction)– Require 3operandmemory
bandwidth percycle
MontgomeryModularMultiplication
• In1985,P.L.MontgomeryintroducedanefficientalgorithmforcomputingA·B(modP)
• Itperformsmoduloreductionwithoutdivision• AlgorithmreplacesdivisionbyPoperationwithdivisionbyapower
of2– Wellsuitscomputersystemsbecausedivisionbypowerof2issimply
theshiftoperation• DefineanP-residuetoberesidueclassmoduloP.
– GivenA,Basn-bitoperand.Aʹ=A·R(modP),Bʹ=B·R(modP)• SelectRco-primetoP.NaturalchoiceisRbeingtheoperandsize
(2n).• Montgomerymultiplicationcomputes
– MonPro(A,B)=A·B·R-1 (modP)• GivenAʹ=A·R(modP),B
– MonPro(Aʹ,B)=Aʹ·B·R-1 (modP)=A·R·B·R-1 (modP)=A·B(modP)
BinaryMontgomeryModularMultiplication
• A,B,Paren-bitnumbers (A,B,P<2n)• LetA=(An-1An-2 •••A0)bethebinaryrepresentation ofA.• Choose R=2n• MonPro(A,B)=A·B·2-n (modP)• Startfromthe leastsignificant bit,andobtainthefollowingbinaryadd-shift
algorithm tocomputeT=A·B·2-n
BinaryMontgomeryModularMultiplication
• WeareinterestedinT=A•B•2-n (modP)notT=A•B•2-n
• ReducepartialproductTineachiteration– IfTiseventhen
• T/2(modP)=T/2• Reducebyjustrightshiftedbyonebit
– IfTisoddthenT+Pmustbeeven• WeknowT<P• T(modP)=T+P(modP)• (T+P)<2P=>(T+P)/2<P• ResultisalreadyreducedmoduloP• ReducebyaddingPandthenrightshiftingbyonebit
BinaryMontgomeryModularMultiplication
Advantages• Onaveragemorethanone
addition foreach iteration• Onlyone-bit comparison is
performed todecide thePaddition
Disadvantages• Oneextrasubtraction is
needed attheend• Require conversionto
residue domain• Notabigproblem if
multiple multiplicationsrequiredforthesamemodulus
MontgomeryMultiplicationwithPre-computation
• Beforecomputingpartialproductitisknownthateither0,P,B,B+Pneedtobeadded.
• Followingtruthtableshowswhattoadded
R0 Ai B0 Precomp
0 0 0 0
0 0 1 0
0 1 0 B
0 1 1 B+P
1 0 0 P
1 0 1 P
1 1 0 B+P
1 1 1 B
MontgomerywithPre-computation
Advantages• Lessthanoneadditionper
iteration– Latencydecreased
• Simplerdatapath
Disadvantages• Storageisrequiredtosave
B+P• B+Phastobecalculated
beforeiterationsstart.• Littlebitmorecomplexloop
controlcomparedtosimpleMontgomerymultiplication– Negligible
MontgomerywithCSAutilization
MontgomerywithCSAutilizationAdvantages• AdditionsisdonebyCSAwhichhas1FAdelay
– Improvesoperation frequency• Almostoneadditionperiteration
Disadvantages• Memorybandwidthis3operandpercycle(C,S,I)• Require1extraiterationtorestoretheresult• Storageincreases
– X,Y,P,Y+P,C,Sneed tobestored• Complexdatapath(2xlargerbecauseofredundantrepresentation{C,S})
– Conventional adderneeded togetC+S• Directlyaffectsoperationfrequency(think ofRCAn*FAdelay)
• Conventionaladditionneedtobereduced(finalreduction)
ComparisonsofMMAlgorithmsAlgorithms # ofAddition/
iteration# ofAdder Storageneeded
Interleaved Greater than2 1 3xoperand length
InterleavedwithPre-computation
Slightlygreaterthan1(oneextra
iteration)
1 7xoperand length
InterleavedwithCSA
Slightlygreaterthan1 (oneextra
iteration)
2(1CSA,1RCA)Complex datapath(redundant rep)
9xoperand length
Montgomery Greaterthan1lessthan1.5
1 3xoperand length
MontgomerywithPre-computation
Less than1 1 4x operand length
MontgomerywithCSA
Slightlygreaterthan1 (oneextra
iteration)
2(1CSA,1RCA)Complex datapath(redundant rep)
4xoperand length
Recommended