Modular Multiplication Algorithms for...

ModularMultiplicationAlgorithmsforFPGAs

MustafaParlak

Outline• WhatisanFPGA?• FPGAvs.ASIC&Microprocessors• FPGADesignMetrics• FPGAsinCryptography• Adders:BasicoperatorofModularMultiplications

• ModularMultiplications– InterleavedModularMultiplications– MontgomeryModularMultiplications

• ComparisonofModularMultiplicationalgorithms

WhatisanFPGA• FPGA =FieldProgrammableGateArray• AsemiconductorICthatcanbeconfiguredbytheuser(designer)aftermanufacturing

• Twodimensionalarrayofcustomizablelogicblockplacedinaninterconnectframework

• Theusertoconfigure:1. Thefunctionofeachlogicblock2. Theinterconnectionbetweenthelogicblocks,

• Canbeprogrammedusingalogiccircuitdiagram(schematic)orsourcecodeinVHDLorVerilog

WhatisanFPGA• Logicblocks

– toimplementcombinationalandsequentiallogic

• Interconnect– wirestoconnect inputsand

outputstologicblocks• I/Oblocks

– speciallogicblocksatperipheryofdevice forexternalconnections

• Keyquestions:– howtomakelogicblocks

programmable?– howtoconnect thewires?– afterthechiphasbeenfabricated

FPGALogicBlocks

• 4-inputlookuptable(LUT)– implementscombinationallogicfunctions

• Register– optionallystoresoutputofLUT

4-LUT FF1

latchLogic Block set by configuration

bit-stream

4-input "look up table"

OUTPUTINPUTS

FPGAInterconnect

LUTs(LookUp Tables)• LUTcontainsMemoryCellstoimplementsmalllogic

functions• Eachcellholds‘0’or‘1’.• ProgrammedwithoutputsofTruthTable• Inputsselectcontentofoneofthecellsasoutput

16-bit SR

flip-flop

16x1 RAM4-input

clock enable

set/reset

3 Inputs LUT -> 8 Memory Cells

3 – 6 Inputs

Multiplexer MUX Static Random Access MemorySRAM cells

ConfiguringFPGA• MillionsofSRAMcellsholdingLUTsandInterconnectRouting• VolatileMemory.Losesconfigurationwhenboardpoweris

turnedoff.• KeepBitPatterndescribingtheSRAMcellsinnon-Volatile

Memorye.g.Flash• Configurationtakes~secs

Configuration data in

Configuration data out

= I/O pin/pad

= SRAM cell

GenericFPGADesignFlow

• DesignEntry:– Createyourdesign files using:

• Schematic editoror• Hardware description language

(Verilog, VHDL)• Design“implementation”onFPGA:

– Synthesis, Partition,place,androute tocreatebit-stream file

• Designverification:– UseSimulator tocheckfunction,– othersoftwaredetermines maxclock

frequency.– LoadontoFPGAdevice (cableconnects

PCtodevelopment board)• Checkoperation atfullspeed inreal

environment.

FPGAvs.ASIC/Microprocessors

–ASICgiveshighperformanceatcostofinflexibility.–Processorisveryflexiblebutnottunedtotheapplication.–Reconfigurablehardwareisanicecompromise.

Microprocessor ReconfigurableHardware

Software Firmware Hardware

FPGAvs.ASIC

FPGA• Reconfigurable• Lowthroughput• Shortdesigncycle• Suitableforlowvolume

production– Lowcostatsmallnumber

• Highpower• Highsiliconarea

– Utilizationproblem• Notestingcost• Alreadyfabricated

ASIC• Noreconfiguration• Highthroughput• Longdesigncycle• Suitableforhighvolume

production(>1Million)– Lowcostatlargenumber

• Lowpower• Lowsiliconarea

– Fullyutilized• Hightestingcost• Needtobefabricated

FPGAvs.ProcessorsFPGA• Longdesigncycle• Expensive• Highthroughput

– (morethan20~100x)

Processor• Shortdesigncycle• Cheap• Lowthroughput

– Significantlyslower

FPGABasedApplications• Cryptography• Networkprocessors• Evolvableandbiologically-inspired hardware• RapidASICprototyping• Real-timesystems• Embeddedapplications• Custom-computinghardware• Reconfigurablecomputing• Special-purpose computationengines

– Hardwarededicatedtosolvingoneproblem(orclassofproblems)

– Acceleratorsattachedtogeneral-purposecomputers

FPGADesignMetrics• TimeComplexity– Throughputisthenumberofprocesseddataperunittime(bits/sec)

– Thehigherthethroughputofadesignthebetteritsefficiency

• AreaComplexity– #ofLUT,FF,RAMetc.

• Designmetriccombiningtimeandareatogether– Throughput/Area– Theratioishigherincaseofhighthroughputandlessspace

• Anotherimportantdesignmetric:Power

Area-Speedoptimization

Loopunrolling&pipelining

Ingeneralthereisatrade-offbetween• Speed• Area

• Speedboosters• Parallelexecution• Loopunrollingand

pipelining• Inallcasesarea

increaseswithincreasingspeed

WhyFPGA?• Flexibilityfromgeneralpurposecomputingandspeedfromreconfigurable logic

• Duetotheinherentfine-grainedgranularitytheparallelismtendstobeveryhigh

• Registers,latchesandevendistributedRAMblockscanbecreatedanddistributedwhereverneededbythedatapath

• LackofafixedarchitectureofFPGA,allowsthedesignerstotailordesign'sdatapathandcontrolflowarbitrarily

• Highlyregularanditerativeapplicationswithnon-standardwordlengths.

WhyFPGAsuitswellinCryptography

• Speed&realtimeexecution– Encryption/decryptiondaterateupto1Gb/secforIPseccrypto

devices• RNGintegrity

– RObasedRNG• COMSECCriteria

– Red-BlackSeparation.– HardertoattackandbreakthecryptosystemrunningonFPGAas

comparedtoGPPs• TheeffectivenessoftheFPGA’scellstructureforimplementingbit-

wiselogicaloperationstypicaltomanycryptographicalgorithms• ThelargeamountofmemoryinsideFPGA

– Easetheimplementationofmemoryintensivesubstitutionoperation– Localstorageworkingasacachewheneverneeded

• Lowpower(ascomparedtoGPP?)

ModularMultiplicationAlgorithms• Whymodularmultiplicationisimportant?

– Mostcommon operation of• RSA• Finitefieldarithmetic• DSA• Diffie-Hellmankeyexchange• ECC

• ModularMultiplicationalgorithmsinGF(p)– Multiply anddivide

• Naïvemethod– Interleavedmodularmultiplication

• Multiplicationandreduction areinterleaved– Montgomerymodularmultiplication

• Transformationandoperations inresiduedomain– Otheralgorithms

• Brickell’s method• …etc

Adders:BasicBuildingBlockofMultiplication

• Fulladder(FA)iscombinational circuitwith3inputsandtwooutputs

• Computes sum(Si)andcarry(Ci+1)forthenextstage• FAisone-bit adder.Whathappens ifFAscascaded to

maken-bitadder§ Carryhastobepropagated§ Problem: propagationdelay § Canwegetridofcarrypropagation ordecrease

it?§ Number ofmethod proposedtoefficiently

implement addition• Ripple Carry(obviousone)• CarryLookAhead• CarrySave• DelayedCarry• Brent-Kung• etc….

RippleCarryandCarrySaveAddersRippleCarryAdder• EachFAreceivesCin from

previousFA• Advantages

– Signdetectioniseasy• Disadvantage

– Delayishigh– LetdelayofanFAisT(FA)– Delayofn-bitadderisn*T(FA)

Carry-SaveAdder• ParallelEnsembleofFAs• Advantages

– DelayisconstantandoneFA• Disadvantages

– Addsthreenumberandproducestwo

– Thesigndetectionishard– Needconventionaladdertoget

finalresult

OtherAddersandComparison

CarryLookAhead• Improvesspeedby

reducingcarrypropagation

CarryDelayedAdder• Twolevelcarrysave

ModularAddition• GivenA,B<PcomputeA+B(modP)

1. FindSʹ=A+B2. If(Sʹ>P)3. S=Sʹ- P4. elseS=Sʹ

• Omura’s Method:Anefficientmethodcomputingthemodularaddition– Usefulformultioperandmodularaddition– Eliminatestheneedforsubtraction– Foran-bitoperands,thismethodalwayskeepstheintermediate

resultswithinn-bit.Nevergrowsbeyondthat– Wheneveritexceedsn-bit,thecarry-outisignoredandacorrection

isperformed.

Omura’s Method1. Computecorrection

factorm=2n-P2. FirstcomputeS'=A

+B.3. Ifthereisacarry-

out(nth bit),thenS=S'+m,elseS=S'.

Ex:AssumeP=39m=26-39=25=(011001)

WeobtaintheresultasS=31whichis70(mod39)

InterleavedModularMultiplication

Atmosttwosubtraction isneeded toreducepartial product

InterleavedwithOmura’s MethodObservations withstandard interleavedmethod• 3addition (orsubtraction) periteration• Twocomparison andtworeduction per iteration• Partialaddition result goesbeyondn-bit• UseOmura’s method togetridofsubtractions andcomparisons

Advantages• Comparisons andsubtractions

eliminated• PartialproductRnevergrow

beyondn-bitDisadvantages• Pre-computation increases

execution time• Still3addition periteration• ExtramemoryforstoringM• Onefinalcorrection subtraction

mayberequired

InterleavedwithPre-computation• Aclevermethodtoreduce3addition/reductionto1addition:

– Idea:Reductionofith iterationcanbecalculatedandgetreadyfornextiteration(i+1)th.(correctionstep)

– Correctioncanbeaddedtothenextiterationintermediateproduct– InsteadofreducingwithPreducewith2nwhichisselectingnleast

significantbits– Thesepossiblecorrectionvaluescanbepre-computedbefore

multiplicationstartsandstoredinalook-uptable• Atith iteration,assumepartialproductiscalculatedR=A•Bi +2R

andreadyfornextiteration.• PartialproductR,maygrowonly2morebit,fromnton+2as

R=(Rn+1 RnRn-1 …R0)• AssumethatRgrowonly1bit,R=(RnRn-1….R0).

– NowRisn+1bitlong

InterleavedwithPre-computation• InsteadofreducingRtoP,reduceitto2n.

– Rʹ=R(mod2n)selectsnleastsignificantbitsofR.ThenRʹ=R– 2n isreadyfori+1th iteration

– Addcorrectionfactoratnextiteration(i+1th) torestorethesamepartialproductinoriginalinterleavedalgorithm

• At(i+1th)iteration,AssumeBi+1 =0– OriginalinterleavedalgorithmfindsRʹʹ=A•Bi +2R(modP)=0+2R

(modP)=2R(modP)• Verification

– Shiftleft(doublesthepartialproduct)Rʹʹ=2Rʹ=2(R– 2n)=2R- 2n+1– Reducethepartialproductbyadding2n+1 (modP).(correctionfactor)– Rʹʹ=2R- 2n+1+2n+1 (modP)=2R(modP)whichisdesiredresultfor

i+1th iteration.• Onlyafewpossiblecorrectionfactormayoccur.

– 0,B,2n+1 (modP),B+2n+1 (modP),2n+2 (modP),B+2n+2 (modP)

InterleavedwithPre-computation• Advantages

– Oneadditionineachiteration– Almost2xincreaseinspeed

• Disadvantages– Requirepre-computation(breakstheregularity)

– Requireoneextraiteration– Requireextralocalstorage(4xoperandbitlength=4x2n)• Ex:2048-RSAmodularmultiplication(4x2048=8kbit

– comparisonandsubtractionattheend

InterleavedwithPre-computationDatapath

InterleavedwithCSAUtilization

• ConventionaladderisreplacedwithCSAadder(redundantrepresentation)

• ReductiontoMininterleavedalgorithmisreplacedwithreductionwith2n

• Afterwards,thevalueofk*2n(modM)isaddedinordertoreconstructthecorrectintermediateresultatnextiteration

• AttheendS,Careaddedtofindcorrectresult

InterleavedwithCSAUtilizationAdvantages• Twoaddition periteration (?)• Additions inconstanttime (No

carrypropagation)

Disadvantages• Theresult isinredundant form

(C,S)whichhastobecalculatedwithconventional adder. (Onemoreadder)

• Calculation ofAisnotstraightforwardandneed subtractionsandcomparisons

• NeedmorestoragetosaveS,Cinstead ofone.

• Datapath requiremorelogic• Complex FSMandaddress

generation

InterleavedwithCSAUtilizationandPre-computation

• Thementionedproblemsmakesthealgorithminfeasible

• Samepre-computationideaisapplied– TheintermediateresultIhasonlytwopossiblevalues(0,Y)

– IncorrectionphaseAalsohasafewpossiblevalues

– Thesetwocancombinedas2A+Iandpre-computedandstored

InterleavedwithCSAUtilizationandPre-computation

Advantages– Onlyoneaddition periteration

inconstant time– Nocomparison andreduction

Disadvantages– Require pre-computation

(breakstheregularity)– Require oneextraiteration– Require extrastorage(6x

operandbit length)• Ex:2048-RSAmodular

multiplication• 6x2048=12kbit localstorage

– Attheendofiterations• Requireconventionaladderto

calculate(C+S)• Mayrequireoneextrareduction

(subtraction)– Require 3operandmemory

bandwidth percycle

MontgomeryModularMultiplication

• In1985,P.L.MontgomeryintroducedanefficientalgorithmforcomputingA·B(modP)

• Itperformsmoduloreductionwithoutdivision• AlgorithmreplacesdivisionbyPoperationwithdivisionbyapower

of2– Wellsuitscomputersystemsbecausedivisionbypowerof2issimply

theshiftoperation• DefineanP-residuetoberesidueclassmoduloP.

– GivenA,Basn-bitoperand.Aʹ=A·R(modP),Bʹ=B·R(modP)• SelectRco-primetoP.NaturalchoiceisRbeingtheoperandsize

(2n).• Montgomerymultiplicationcomputes

– MonPro(A,B)=A·B·R-1 (modP)• GivenAʹ=A·R(modP),B

– MonPro(Aʹ,B)=Aʹ·B·R-1 (modP)=A·R·B·R-1 (modP)=A·B(modP)

BinaryMontgomeryModularMultiplication

• A,B,Paren-bitnumbers (A,B,P<2n)• LetA=(An-1An-2 •••A0)bethebinaryrepresentation ofA.• Choose R=2n• MonPro(A,B)=A·B·2-n (modP)• Startfromthe leastsignificant bit,andobtainthefollowingbinaryadd-shift

algorithm tocomputeT=A·B·2-n

• WeareinterestedinT=A•B•2-n (modP)notT=A•B•2-n

• ReducepartialproductTineachiteration– IfTiseventhen

• T/2(modP)=T/2• Reducebyjustrightshiftedbyonebit

– IfTisoddthenT+Pmustbeeven• WeknowT<P• T(modP)=T+P(modP)• (T+P)<2P=>(T+P)/2<P• ResultisalreadyreducedmoduloP• ReducebyaddingPandthenrightshiftingbyonebit

Advantages• Onaveragemorethanone

addition foreach iteration• Onlyone-bit comparison is

performed todecide thePaddition

Disadvantages• Oneextrasubtraction is

needed attheend• Require conversionto

residue domain• Notabigproblem if

multiple multiplicationsrequiredforthesamemodulus

MontgomeryMultiplicationwithPre-computation

• Beforecomputingpartialproductitisknownthateither0,P,B,B+Pneedtobeadded.

• Followingtruthtableshowswhattoadded

R0 Ai B0 Precomp

0 0 0 0

0 0 1 0

0 1 0 B

0 1 1 B+P

1 0 0 P

1 0 1 P

1 1 0 B+P

1 1 1 B

MontgomerywithPre-computation

Advantages• Lessthanoneadditionper

iteration– Latencydecreased

• Simplerdatapath

Disadvantages• Storageisrequiredtosave

B+P• B+Phastobecalculated

beforeiterationsstart.• Littlebitmorecomplexloop

controlcomparedtosimpleMontgomerymultiplication– Negligible

MontgomerywithCSAutilization

MontgomerywithCSAutilizationAdvantages• AdditionsisdonebyCSAwhichhas1FAdelay

– Improvesoperation frequency• Almostoneadditionperiteration

Disadvantages• Memorybandwidthis3operandpercycle(C,S,I)• Require1extraiterationtorestoretheresult• Storageincreases

– X,Y,P,Y+P,C,Sneed tobestored• Complexdatapath(2xlargerbecauseofredundantrepresentation{C,S})

– Conventional adderneeded togetC+S• Directlyaffectsoperationfrequency(think ofRCAn*FAdelay)

• Conventionaladditionneedtobereduced(finalreduction)

ComparisonsofMMAlgorithmsAlgorithms # ofAddition/

iteration# ofAdder Storageneeded

Interleaved Greater than2 1 3xoperand length

InterleavedwithPre-computation

Slightlygreaterthan1(oneextra

iteration)

1 7xoperand length

InterleavedwithCSA

Slightlygreaterthan1 (oneextra

iteration)

2(1CSA,1RCA)Complex datapath(redundant rep)

9xoperand length

Montgomery Greaterthan1lessthan1.5

1 3xoperand length

MontgomerywithPre-computation

Less than1 1 4x operand length

MontgomerywithCSA

Slightlygreaterthan1 (oneextra

iteration)

2(1CSA,1RCA)Complex datapath(redundant rep)

4xoperand length

Modular Multiplication Algorithms for...

Documents

Combinational Logic Circuits

Digital Design - Combinational Logic Design Chapter 2 - Combinational Logic Design

Combinational Logic Design Principles. Combinational ...staff.cs.upt.ro/~todinca/DL/Lectures/dl3.pdf · Combinational Logic Design Principles. Combinational Circuits ... Digital Design

Combinational Logic Circuits

Combinational Logic Circuits_PPT

Area Complexity Estimation for Combinational Logic …jayantt/masters_thesis.pdf · · 2014-12-18Area Complexity Estimation for Combinational Logic ... Combinational Logic using

Combinational Logic 1

02 combinational logic

Chapter 3 – Combinational Logic Design - unitbv.roetc.unitbv.ro/~tulbure/dig/DIG_06.pdf · Chapter 3 – Combinational Logic Design Part 2 – Combinational Logic Logic and Computer

Combinational Logic Design Principles. Combinational ...staff.cs.upt.ro/~todinca/DL/Lectures/dl4.pdf · Combinational Logic Design Principles. Combinational Circuits Synthesis Using

Combinational Logic Review

Combinational Logic Design Combinational Functions and

Chapter 0 - reVieW Combinational Logic Circuit, Combinational Logic Circuit, Propagation Delay, Propagation Delay, Programmable Logic. Programmable Logic

COMBINATIONAL CIRCUITS - Dinabandhu Andrews …...Combinational Logic • Logic circuits for digital systems may be combinational or sequential. • A combinational circuit consists

Combinational Logic Circuit.ppt

Combinational Logic Circuit

Digital Integrated Circuits© Prentice Hall 1995 Combinational Logic COMBINATIONAL LOGIC

slides4 combinational logic

Combinational Logic Functions

Combinational logic implementation