RSA - TCR.pdf

7/27/2019 RSA - TCR.pdf

1/89

Alexandru Ioan Cuza University of Iasi

Faculty of Computer Science

TECH

NI

CAL

REP

OR

T

MpNT: A Multi-Precision NumberTheory Package.

Number-Theoretic Algorithms (I)

F.L. Tiplea S. IfteneC. Hritcu I. Goriac

R.M. Gordan E. Erbiceanu

TR 03-02, May 2003

ISSN 1224-9327

Universitatea Alexandru Ioan Cuza IasiFacultatea de Informatica

Str. Berthelot 16, 6600-Iasi, RomaniaTel. +40-32-201090, email: [email protected]

7/27/2019 RSA - TCR.pdf

2/89

Multi

NumberTheory

precision A Multi-Precision Number Theory PackageMpNT:

Number-Theoretic Algorithms (I)

F.L. Tiplea S. IfteneC. Hritcu I. Goriac

R.M. Gordan E. Erbiceanu

May 21, 2003

Faculty of Computer Science

Al.I.Cuza University of Iasi

6600 Iasi, RomaniaE-mail: [email protected]

7/27/2019 RSA - TCR.pdf

3/89

Preface

MpNT is a multi-precision number theory package developed at the Facultyof Computer Science, Al.I.Cuza University of Iasi, Romania. It has been

started as a base for cryptographic applications, looking for both efficiency andportability without disregarding code structure and clarity. However, it can beused in any other domain which requires efficient large number computations.

The present paper is the first in a series of papers dedicated to the designof the MpNT library. It has two goals. First, it discusses some basic number-theoretic algorithms that have been implemented so far, as well as the structureof the library. The second goal is to have a companion to the coursesAlge-braic Foundations of Computer Science [64], Coding Theory and Cryptography[65], and Security Protocols [66], where most of the material of this paper hasbeen taught and students were faced with the problem of designing efficientimplementations of cryptographic primitives and protocols. From this point ofview we have tried to prepare a self-contained paper. No specific background

in mathematics or programming languages is assumed, but a certain amount ofmaturity in these fields is desirable.Due to the detailed exposure, the paper can accompany well any course on

algorithm design or computer mathematics.

The paper has been prepared as follows:

Sections 16, by F.L. Tiplea;

Section 7, by S. Iftene;

Section 8, by C. Hritcu, I. Goriac, R.M. Gordan, and E. Erbiceanu.

The library has been implemented by C. Hrit cu, I. Goriac, R.M. Gordan, andE. Erbiceanu.

Iasi, May 2003

7/27/2019 RSA - TCR.pdf

4/89

Contents

Preface 3

Introduction 5

1 Preliminaries 61.1 Base Representation of Integers . . . . . . . . . . . . . . . . . . . 61.2 Analysis of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Addition and Subtraction 9

3 Multiplication 103.1 School Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Karatsuba Multiplication . . . . . . . . . . . . . . . . . . . . . . 113.3 FFT Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Division 274.1 School Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Recursive Division . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 The Greatest Common Divisor 365.1 The Euclidean GCD Algorithm . . . . . . . . . . . . . . . . . . . 365.2 Lehmers GCD Algorithm . . . . . . . . . . . . . . . . . . . . . . 395.3 The Binary GCD Algorithm . . . . . . . . . . . . . . . . . . . . . 44

6 Modular Reductions 476.1 Barrett Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Montgomery Reduction and Multiplication . . . . . . . . . . . . 496.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Exponentiation 537.1 General Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 537.2 Fixed-Exponent Techniques . . . . . . . . . . . . . . . . . . . . . 617.3 Fixed-Base Techniques . . . . . . . . . . . . . . . . . . . . . . . . 647.4 Techniques Using Modulus Particularities . . . . . . . . . . . . . 687.5 Exponent Recoding . . . . . . . . . . . . . . . . . . . . . . . . . . 697.6 Multi-Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . 71

8 MpNT: A Multi-precision Number Theory Package 76

8.1 MpNT Development Policies . . . . . . . . . . . . . . . . . . . 768.2 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . 798.3 Other Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

References 85

4

7/27/2019 RSA - TCR.pdf

5/89

Introduction

An integer in C is typically 32 bits, of which 31 can be used for positive integerarithmetic. Some compilers, such as GCC, offer a long long type, giving 64bits capable of representing integers up to about 91018. This is good formost purposes, but some applications require many more digits than this. Forexample, public-key encryption with the RSA algorithm typically requires 300digit numbers or more. Computing the probabilities of certain real events ofteninvolves very large numbers; although the result might fit in a typical C type,the intermediate computations require very large numbers which will not fit intoa C integer, not even a 64 bit one. Therefore, computations with large numericaldata (having more than 10 or 20 digits, for example) need specific treatment.There is mathematical software, such as Maple or Mathematica, which providesthe possibility to work with a non-limited precision. Such software can be used

to prototype algorithms or to compute constants, but it is usually non-efficientsince it is not dedicated to numerical computations. In order to overcome thisshortcoming, several multi-precision libraries have been proposed, such as LIP[40], LiDIA [25], CLN [27], PARI [26], NTL [57], GMP [24] etc.

MpNT (Multi-precision Number Theory) is a new multi-precision librarydeveloped at the Faculty of Computer Science, Al.I.Cuza University of Iasi,Romania. It has been started as a base for cryptographic applications, look-ing for both efficiency and portability without disregarding code structure andclarity. However, it can be used in any other domain which requires efficientlarge number computations. The library is written in ISO C++ (with a smallkernel in assembly language), so it should work on any available architecture.Special optimizations apply for the Intel IA-32 and compatible processors under

Windows and Linux. Comparisons between our library and the existing onesshow that it is quite efficient.

The present paper is the first in a series of papers dedicated to the designof the MpNT library. It has two goals. First, it discusses some basic number-theoretic algorithms that have been implemented so far, as well as the structureof the library. The second goal is to have a companion to the coursesAlge-braic Foundations of Computer Science [64], Coding Theory and Cryptography[65], and Security Protocols [66], where most of the material of this paper hasbeen taught and students were faced with the problem of designing efficientimplementations of cryptographic primitives and protocols. From this point ofview we have tried to prepare a self-contained paper. No specific backgroundin mathematics or programming languages is assumed, but a certain amount of

maturity in these fields is desirable.

The paper is organized into 8 sections. The first section establishes thenotation regarding base representation and complexity of algorithms, while thenext 6 sections discuss algorithms for addition and subtraction, multiplication,division, greatest common divisor, modular reduction, and exponentiation. Thelast section is devoted to a short description of the MpNT library (for furtherdetails the reader is referred to the user guide and the home page of the library).

5

7/27/2019 RSA - TCR.pdf

6/89

1 Preliminaries

In this section we establish the notation we use in our paper regarding baserepresentation of integers and complexity of algorithms. For details, the readeris referred to classical textbooks on algorithmic number theory and analysis ofalgorithms, such as [1, 36, 23].

1.1 Base Representation of Integers

The set of integers is denoted by Z. Integers a 0 are referred to as non-negative integers or natural numbers, and integers a > 0 are referred to aspositive integers. The set of natural numbers is denoted by N.

For an integer a,|a| stands for the absolute valueofa, and sign(a) standsfor its sign defined by

sign(a) =

1, ifa01, ifa

7/27/2019 RSA - TCR.pdf

7/89

We identify positive integers by their representation in some base . From

this point of view we can write

(ak1, . . . , a0) = (al1, . . . , a

0)

in order to specify that (ak1, . . . , a0) and (al1, . . . , a

0) are distinct repre-

sentations (in base and, respectively) of the same integer. For example,

32 = (100000)2

means that 32 and (100000)2are representations in base 10 and base 2, respec-tively, of the same integer. From technical reasons, the notation (ak1, . . . , a0)will be sometimes extended by allowing leading zeroes. For example, (0 , 0, 1, 1)also denotes the integer 0 3 + 0 2 + 1 + 1 =+ 1 whose baserepresen-tation is (1, 1). The difference between these two notations will be clear fromthe context.

Recall now a few useful properties regarding base representation of integers(for details, the reader is referred to [67]). Let a0, . . . , ak1 be base digits.Then,

1. a0+ a1+ + aii < i+1, for all i, 0i < k;2. (ak1, . . . , a0) = (ak1, . . . , aki) ki + (aki1, . . . , a0) , for all i,

1i < k.Sometimes, we shall be interested in decomposing the representation of an

integer in blocks of digits. For example, it may be the case that the represen-tation (100000)2 be decomposed into 3 equally sized blocks, 10, 00, and 00. In

this case we shall write [10, 00, 00]22. In general, we write

[wl1, . . . , w0]m

for the decomposition of (ak1, . . . , a0) into blocks of m digits each, wherem1 and wi is a sequence ofm -digits, for all 0i < l (therefore, k= ml).The first block may be padded by zeroes to the left in order to have length m.For example, [01, 01]22 is a decomposition of (1, 0, 1)2 into blocks of length 2.When the length of blocks varies from block to block, we shall write

[wl1, . . . , w0]+,

but in this case no padding to the left of the first block is allowed. For example,

[1, 000, 00]2+

is such a decomposition of (1, 0, 0, 0, 0, 0)2.Ifa = (ak1, . . . , a0) is the baserepresentation ofa andk 1ij0,then a[i: j ] denotes the integer

a[i: j ] = (ai, . . . , aj) .

Using this notation, the relation at 2 can be re-written as

2. a= a[k 1 :k i] ki + a[k i 1 : 0].In practice,big integersor multi-precision integersare handled thanks to the

representation in a suitable base. The choice of the base must satisfy somerequirements such as:

7

7/27/2019 RSA - TCR.pdf

8/89

should fit in a basic data type; should be as large as possible to decrease the size of the representation

of large integers and to decrease the cost of the basic algorithms runningon them.

Typical choices for are = 2m or = 10m (powers of 2 or 10).

1.2 Analysis of Algorithms

Two basic things must be taken into consideration when one describes an al-gorithm. The first one is to prove its correctness, i.e., that the algorithm givesthe desired result when it halts. The second one is about its complexity. Sincewe are interested in practical implementations, we must give an estimate of thealgorithms running time (if possible, both in the worst and average case). Thespace requirement must also be considered. In many algorithms this is negli-gible, and then we shall not bother mentioning it, but in certain algorithms itbecomes an important issue which has to be addressed.

The time complexity will be the most important issue we address. It willbe always measured in bit/digit operations, i.e., logical or arithmetic operationson bits/digits. This is the most realistic model if using real computers (and notidealized ones). We do not list explicitly all the digit operations, but they willbe clear from the context. For example, the addition of two digits taking intoaccount the carry too is a digit operation. Theshiftingand copying operationsare not considered digit operations 1. In practice, these operations are fast incomparison with digit operations, so they can be safely ignored.

In many cases we are only interested in the dominant factor of the complexity

of an algorithm, focusing on the shape of the running time curve, ignoring thuslow-order terms or constant factors. This is called an asymptotic analysis, whichis usually discussed in terms of theO notation.

Given a function ffrom the set of positive integers N into the set of positivereal numbers R+, denote byO(f) the set

O(f) ={g : NR+|(c >0)(n0N)(nn0)(g(n)cf(n))}.

Notice thatO(f) is a set of functions. Nonetheless, it is common practice towriteg (n) =O(f(n)) to mean that g O(f), or to say g(n) isO(f(n)) thanto say g is inO(f). We also writef(n) = (g(n)) for f(n) =O(g(n)) andg(n) =O(f(n)).

We shall say that an algorithm is of time complexity

O(f(n)) if the running

timeg(n) of the algorithm on an input of sizen isO(f(n)). Whenf(n) = log n(f(n) = (log n)2, f(n) =P(log n) for some polynomialP, f(n) =n) we shallsay that the algorithm is linear(quadratic, polynomial, exponential)time.

1The shifting operation shifts digits a few places to the left/right. The copying operationcopies down a compact group of digits.

8

7/27/2019 RSA - TCR.pdf

9/89

2 Addition and Subtraction

Addition and subtraction are two simple operations whose complexity is linearwith respect to the length of the operands.

Leta and b be two integers. Their sum satisfies the following properties:

1. ifa andbhave the same sign, then a + b= sign(a) (|a| + |b|);2. ifa and b have opposite signs and|a| |b|, thena + b= sign(a) (|a||b|);3. ifa and b have opposite signs and|a|0, b = (bk1, . . . , b0) >0;output: c= a + b;begin

1. carry := 0;2. for i := 0 tok 1 do

begin

3. ci := (ai+ bi+ carry) mod ;

4. carry := (ai+bi+ carry)div ;end;5. if carry >0 thenck := 1;

end.

It is easy to see that the algorithm Add requires k digit operations, whichshows that its time complexity isO(k).

In a similar way we can design an algorithm Sub for subtraction. Its com-plexity isO(k) as well.

Sub(a, b)input: a= (ak1, . . . , a0), b = (bk1, . . . , b0) , a > b >0;output: c= a

b;

begin

1. borrow := 0;2. for i := 0 tok 1 do

begin

3. ci := (ai bi+ borrow) mod ;4. borrow := (ai bi+ borrow) div ;

end;end.

9

7/27/2019 RSA - TCR.pdf

10/89

3 Multiplication

Several techniques can be used in order to compute the product ab of twointegers. We shall present in this section the school multiplication method, theKaratsuba method, and the FFT method.

Sincea b= (sign(a) sign(b)) (|a| |b|) and a 0 = 0 a= 0, we may focusonly on multiplication of non-negative integers.

3.1 School Multiplication

The technique we present here is an easy variation of the school method. It isbased on computing partial sums after each row multiplication. In this way, theintermediate numbers obtained by addition do not exceed21 and, therefore,a basic procedure for multiplication of two base digits can be used.

Formally, we can write

ab=k

i=0

(aib)i,

where a =k

i=0aii. The algorithm SchoolMul given below exploits this

property.

SchoolMul(a,b)input: a= (ak1, . . . , a0)0, b = (bl1, . . . , b0)0;output: c= ab;begin

1. if (a= 0 b= 0) thenc := 0else begin2. for i := 0 to k 1 do ci:= 0;

3. for j := 0 to l 1 dobegin

4. carry := 0;5. forh := 0 to k 1 do

begin

6. t:= ahbj+ ch+j+ carry;7. ch+j :=t mod ;8. carry := t div

end;9. cj+k :=carry;

end;

end;end.

It is easy to see that c < k+l because a < k and b < l. Therefore, wehave|c|k + l.

The number t in line 6 of the algorithmSchoolMulsatisfies

t= ahbj+ ch+j+ carry( 1)2 + ( 1) + ( 1) =2 1< 2,

because ah, bj , ch+j,carry < . Thus, the algorithm SchoolMulhas the timecomplexityO(kl).

10

7/27/2019 RSA - TCR.pdf

11/89

3.2 Karatsuba Multiplication

The school multiplication is far from being the best solution to the multiplicationproblem. In [33], Karatsuba and Ofman proposed an iterative solution to thisproblem. Let us assume that we want to multiply a and b, both of the samelength k in base . Moreover, assume thatk is an even number, k = 2l. Wecan write:

a= u1l + u0 and b= v1

l + v0

(u1, u0,v1, andv0 are l-digit integers obtained as in Section 1.1).The number c = ab can be decomposed as

c= u1v12l + [(u1 u0)(v0 v1) + u1v1+ u0v0]l + u0v0

or as

c= u1v12l + [(u0+ u1)(v0+ v1) u1v1 u0v0]l + u0v0.No matter which of the two decompositions is chosen, the relation giving c con-sists only of three multiplications of basenumbers of lengthl and a few auxil-iary operations (additions and shifts) whose complexity is linear to l. Therefore,theM(2l) complexity of multiplying two base sequences of lengthk = 2l sat-isfies

M(2l)

, ifl = 1

3M(l) + l, ifl >1,

for some constant.By induction on i we obtain M(2i)(3i 2i). Indeed,

M(2i+1) = M(2 2i) 3M(2i) + 2i 3(3i 2i) + 2i= (3i+1 2i+1).

Let k and l be positive integers such that 2 l1 k

7/27/2019 RSA - TCR.pdf

12/89

Karatsuba1(a,b,k)

input: a, b0 integers of length k , wherek is a power of 2;output: c= ab;begin

1. if kk0 thenSchoolMul(a,b)else begin

2. l:= k/2;3. decomposea = u1

l + u0 and b = v1l + v0;

4. A:=Karatsuba1(u1, v1, l);5. B:=Karatsuba1(u0, v0, l);6. C:=Karatsuba1(u0+ u1, v0+ v1, l);7. D:= C A B;8. c:= A2l + Dl + B

end

end.

The variablesl and D are not really needed; they are used here only to helpthe comments below.

At each recursive call ofKaratsuba1(a,b,k), wherea and b have the lengthk andk > k0, the following are true:

step 3 is very fast: u1, u0, v1, and v0 are directly obtained from the baserepresentation ofa andb by digit extraction (copy);

step 8 is also very fast: multiplication by means a left shift by a digit; 2 subtractions of integers of length k are needed to compute D at step 7.

We notice thatD0; one addition is nedeed in order to compute c = ab at step 8. Indeed, if we

consider the result split up into four parts of equal length (i.e., k/2), thenit can be obtained from A, B, and D just by adding D as it is shown inFigure 1.

...............................

BA

D

k/2k/2k/2k/2

Figure 1: Getting ab in Karatsuba1(a,b,k)

As we have already seen, 2 subtractions and one addition of base integersof length k are needed in order to complete a recursive call. This means 4subtractions and 2 additions of base integers of length k/2. Michel Querciaremarked in [51] (see also [72]) that a decomposition of A, B, and C beforethe step 7 in the algorithm above can result in a reduction of the number ofadditions. Indeed, let

A= A1k/2 + A0, B= B1k/2 + B0, and

12

7/27/2019 RSA - TCR.pdf

13/89

C=C1k/2 + C0.Then,ab can be obtained as described in Figure 2, where the difference A0B1occurs twice. More precisely, the decomposition ofabcan be rewritten as follows:

ab= A13k/2 + (C1 A1+ (A0 B1))k + (C0 B0 (A0 B1))k/2 + B0.

...............................

...............................

...............................

...............................

...............................

...............................

B1A0 B0A1

A1 A0

B1 B0

C1 C0

k/2k/2k/2k/2

Figure 2: A new way of getting ab

In this way, only 4 subtractions and one addition of basenumbers of lengthk/2 are needed. The algorithm Karatsuba2 exploits this property.

Karatsuba2(a,b,k)input: a, b0 integers of length k , wherek is a power of 2;output: c= ab;begin

1. if kk0 thenSchoolMul(a, b)else begin

2. l:= k/2;3. decomposea = u1l + u0 and b = v1

l + v0;4. A:=Karatsuba2(u1, v1, l);5. B:=Karatsuba2(u0, v0, l);6. C :=Karatsuba2(u0+ u1, v0+ v1, l);7. decomposeA = A1

l + A0;8. decomposeB = B1

l + B0;9. decomposeC=C1l + C0;

10. D:= A0 B1;11. A0:= C1 A1+ D;12. B1:= C0 B0 D;13. c:= A13

l + A02l + B1

l + B0;end

end.

None of these two algorithms take into consideration the case of arbitrarylength operands. In order to overcome this we have to consider two more cases:

the operands a and b have an odd length k . Letk = 2l + 1. Then, we canwritea = u1+ u0 andb = v1+ v0, which leads to

ab= u1v12 + (u0v1+ u1v0)+ u0v0.

This relation shows that we can use a Karatsuba step to compute u1v1;the other products can be computed by school multiplication (whose com-plexity is linear to k);

13

7/27/2019 RSA - TCR.pdf

14/89

the operands a and b have different lengths. Let k be the length ofa, andl be the length ofb. Assume that k > l. Let k = rl+ k

, where k

< l.We can writea= ur

(r1)l+k + + u1k

+ u0,

whereu1, . . . , ur are of length l , andu0 is of length k. Then,

ab= urb(r1)l+k + + u1bk

+ u0b.

Moreover, uibcan be computed by a Karatsuba step.

The algorithmKaratsubaMultakes into consideration these remarks and pro-vides a complete solution to the multi-precision integer multiplication.

KaratsubaMul(a,b,k,l)

input: a0 of lengthk , and b0 of length l ;output: c= ab;begin

1. casek , l of2. min{k, l} k0: SchoolMul(a, b);3. k= l even:

begin

4. p:= k/2;5. decomposea = u1

p + u0 and b = v1p + v0;

6. A:=KaratsubaMul(u1, v1, p , p);7. B:=KaratsubaMul(u0, v0, p , p);8. C :=KaratsubaMul(u0+u1, v0+ v1, p , p);9. D:= C

A

B;

10. c:= A2p + Dp + Bend;

11. k= l odd:begin

12. letk = 2k + 1;13. decomposea = u1+ u0 and b = v1+ v0;14. A:=KaratsubaMul(u1, v1, k

, k);15. B:=SchoolMul(u0, v1);16. C :=SchoolMul(u1, v0);17. D:=SchoolMul(u0, v0);18. c= A2 + (B+ C)+ D

end

elsebegin

(letk > l);19. decomposek = rl + k, wherek < l ;

20. decomposea = ur(r1)l+k

+ + u1k +u0;21. fori := 1 tor do Ai :=KaratsubaMul(ui, b , l , l);22. A0:=KaratsubaMul(u0, b , k

, l);

23. c:= Ar(r1)l+k + +A1k + A0;

end

end case

end.

14

7/27/2019 RSA - TCR.pdf

15/89

3.3 FFT Multiplication

The main idea of the Karatsuba algorithm can be naturally generalized. Insteadof dividing the sequencesa and b into two subsequences of the same size, dividethem into r equally sized subsequences,

a= ur1(r1)l + + u1l + u0

andb= vr1

(r1)l + + v1l +v0(assuming that their length is k = rl). LetU andVbe the polynomials

U(x) =ur1xr1 + + u1x + u0

and V(x) =vr1xr1 + + v1x + v0

Thus,a= U(l) and v = V(l). Therefore, in order to determineab it shouldbe sufficient to find the coefficients of the polynomial W(x) =U(x)V(x),

W(x) =w2r2x2r2 + + w1x + w0.

The coefficients of W can be efficiently computed by the Fast Fourier Trans-form. This idea has been first exploited by Schonhage and Strassen [56]. Theiralgorithm is the fastest multiplication algorithm for arbitrary length integers.

The Discrete Fourier Transform The discrete Fourier transform, abbre-viated DFT, can be defined over an arbitrary commutative ring (R, +,

, 0, 1) 2,

usually abbreviated by R. The one-element ring (that is, the case 1 = 0) isexcluded from our considerations. For details concerning rings the reader isreferred to [13].

Definition 3.3.1 LetR be a commutative ring. An element R is called aprimitiveorprincipalnth root of unity, wheren2, if the following propertiesare true:

(1) = 1;(2) n = 1;

(3) n1j=0 (

i)j = 0, for all 1

i < n.

The elements0, 1, . . . , n1 are called thenth roots of unity.

Remark 3.3.1 Assume thatR is a field and n has a multiplicative inversen1

in R 3. Then, the third property in Definition 3.3.1 is equivalent to

(3) i = 1, for all 1i < n.2Recall that (R,+, , 0, 1) is a commutative ring if (R,+, 0) is an abelian group, (R{0}, , 1)

is a commutative monoid, and the usual distributive laws hold.3Integers appear in any ring, even in finite ones. Take n to be 1+ + 1 (n times), where

1 is the multiplicative unit.

15

7/27/2019 RSA - TCR.pdf

16/89

Indeed, if we assume that (3) holds true but there is an i such that i = 1, then

the relation n1j=0 (i)j = 0 implies n 1 = 0, which leads to 1 = n1 0 = 0; acontradiction.Conversely, assume that (3) holds true. Then, for anyi, 1i < n, we have

0 = (n)i 1 = (i)n 1 = (i 1)(n1j=0

(i)j ),

which leads ton1

j=0 (i)j = 0 because i = 1 and R is a field.

Example 3.3.1 In the field C of complex numbers, the element = e(2i)/n

is a primitive nth root of unity.In the field Zm, wheremis a prime number, any primitive element (generator

of this field) is a primitive (m

1)st root of unity.

The following properties are obvious but they will be intensively used.

Proposition 3.3.1 LetR be a commutative ring andR be a primitiventhroot of unity. Then, the following properties are true:

(1) has a multiplicative inverse, which is n1;

(2) i =ni, for all 0i < n;(3) ij =(i+j) mod n;

(4) Ifn is even, then (i)2 = (n/2+i)2, for all 0i < n/2;(5) Ifn >2, is even, and has a multiplicative inverse n1 in R, then 2 is a

primitive (n/2)th root of unity.

Proof (1) The relation n = 1 leads to n1 = 1, which shows that n1

is the multiplicative inverse of .(2) follows from (1), and (3) uses the property n = 1.(4) Let 0 i < n/2. Then,0 = n leads to 2i = n+2i which is what

we need.(5) Clearly, (2)n/2 = n = 1. Let us prove that2 = 1. If we assume, by

contradiction, that 2 = 1, then the relationn1

j=0 (2)j = 0 implies n 1 = 0,

which leads to a contradiction as in Remark 3.3.1.Let 1i < n/2. We have to prove that n/21j=0 ((2)i)j = 0. Since 2i < n

we can write

n1j=0 (

2i)j = n/21j=0 (

2i)j + n1j=n/2(

2i)j

= n/21

j=0 (2i)j +

n/21j=0 (

2i)n/2+j

= n/21

j=0 (2i)j +

n/21j=0 (

2i)j

= 2n/21

j=0 (2i)j

= 0

(the property (4) and the fact that is a primitive nth root of unity have been

used). The relation 2n/21

j=0 (2i)j = 0 leads to n

n/21j=0 (

2i)j = 0, and ton/21j=0 (

2i)j =n1 0 = 0.Therefore,2 is a primitive (n/2)th root of unity.

16

7/27/2019 RSA - TCR.pdf

17/89

Remark 3.3.2 When R is a field, more useful properties about roots of unity

can be proved. For example, n/2 =1, provided thatn is even.

Indeed,0 =n 1 = (n/2)2 1 = (n/2 1)(n/2 + 1),

which shows that n/2 + 1 = 0 because n/2 = 1.This property leads to another useful property, namely

n/2+i =i, for all 0i < n/2, provided thatn is even.

Let R be a commutative ring and R be a primitive nth root of unity.Denote by A,n , or simply by A when and nare clear from the context, the

matrix

A=

1 1 1 . . . 1

1 (1)1 (1)2 . . . (1)n1

. . .

1 (n1)1 (n1)2 . . . (n1)n1

That is, the line i ofA contains all the powers of i, for all 0i < n.Proposition 3.3.2 LetR be a commutative ring andR be a primitiventhroot of unity. If the integern has a multiplicative inverse n1 in R, then thematrix A is non-singular and its inverse A1 is given by A1(i, j) = n1ij ,for all 0

i,j < n.

Proof It is enough to show that AA1 = A1 A = In, where In is then n identity matrix.

LetB = A A1. Then,

B(i, j) = 1

n

n1k=0

k(ij),

for all i and j. If i = j then B(i, j) = 1, and if i > j then B(i, j) = 0(by Definition 3.3.1(3)). In the case i < j we use the property in Proposition3.3.1(2), which leads to B (i, j) = 0. Therefore,B = In. The case A

1 A= Incan be discussed in a similar way to this.

For an n-dimensional row vector a = (a0, . . . , an1) with elements from R,at stands for the transpose ofa.

Definition 3.3.2 LetR be a commutative ring and R be a primitive nthroot of unity.

(1) The discrete Fourier transform of a vector a = (a0, . . . , an1) with ele-ments fromR is the vector F(a) =Aat.

(2) The inverse discrete Fourier transform of a vector a = (a0, . . . , an1)with elements from R is the vectorF1(a) =A1at, provided thatn hasa multiplicative inverse in R .

17

7/27/2019 RSA - TCR.pdf

18/89

Relationship Between DFT and Polynomial Evaluation There is a

close relationship between DFT and polynomial evaluation. Let

a(x) =n1j=0

aj xj

be an (n 1)st degree polynomial overR. This polynomial can be viewed as avector

a= (a0, . . . , an1).

Evaluating the polynomial a(x) at the points 0, 1, . . . , n1 is equivalent tocomputing F(a), i.e.,

F(a) = (a(0), a(1), . . . , a(n1)).

Likewise, computing the inverse discrete Fourier transform is equivalent to find-ing (interpolating) a polynomial given its values at the nth roots of unity.

Definition 3.3.3 Let a = (a0, . . . , an1) and b = (b0, . . . , bn1) be two n-aryvectors over R. The convolution ofa and b, denoted by ab, is the 2n-aryvectorc = (c0, . . . , c2n1) given by

ci =i

j=0

aj bij ,

for all 0i

7/27/2019 RSA - TCR.pdf

19/89

The Fast Fourier Transform Algorithm Computing the (inverse) discrete

Fourier transform of a vector a Rn

requires time complexityO(n2

) (if weassume that arithmetic operations on elements of R require one step each).Whenn is a power of 2 there is anO(nlogn) time complexity algorithm due toCooley and Tukey [16]. This algorithm is based on the following simple remark.Let a(x) and a(x) be the polynomials given by

a(x) =a0+ a2x + + an2xn/21

anda(x) =a1+ a3x + + an1xn/21,

wherea(x) =a0+ a1x + a2x

2 + + an1xn1.

Then,() a(x) =a(x2) + xa(x2).

Therefore,

to evaluate a(x) at the points 0, 1, . . . , n1 reduces to evaluate a(x)anda(x) at (0)2, (1)2, . . . , (n1)2, and then to re-combine the resultsvia (). By Proposition 3.3.1(4),a(x) anda(x) will be evaluated at n/2points. That is,

a(i) =a(2i) + ia(2i)

anda(n/2+i) =a(2i) + n/2+ia(2i),

for all 0i < n/2; by Proposition 3.3.1(5), a(x) and a(x) will be evaluated at the (n/2)th

roots of unity. Therefore, we can recursively continue with a(x) anda(x).

The algorithmFFT1given below is based on these remarks. Lines 5 and 6in FFT1 are the recursion, and lines 810 do the re-combination via (): line8 does the first half, from 0 to n/2 1, while line 9 does the second half, fromn/2 to n 1.

Line 9 can be replaced by

yi := yiy i,

whenever the property n/2 =1 holds true in the ring R (e.g., when R is afield).

The time complexity T(n) of this algorithm satisfies

T(n) = 2T(n/2) + (n)

(the loop 810 has the complexity (n)). So, T(n) = (nlogn).In order to compute F1(a) we replace by 1 and multiply the result

byn1 in the algorithmFFT1(a, n). Therefore, the inverse DFT has the sametime complexity.

19

7/27/2019 RSA - TCR.pdf

20/89

FFT1(a, n)

input: a= (a0, . . . , an1)Rn

, where n is a power of 2;output: F(a);begin

1. if n = 1 thenF(a) =aelse begin

2. := 1;3. a = (a0, a2, . . . , an2);4. a = (a1, a3, . . . , an1);5. y :=FFT1(a, n/2);6. y :=FFT1(a, n/2);7. for i := 0 to n/2 1 do

begin

8. yi := y i+ yi;

9. yn/2+i := y i+ n/2yi;10. := ;

end

11. F(a) :=y;end

end.

Let us have a closer look atFFT1. Figure 3 shows a tree of input vectors tothe recursive calls of the procedure, with an initial vector of length 8. This tree

(a0, a1, a2, a3, a4, a5, a6, a7)

(a0, a2, a4, a6) (a1, a3, a5, a7)

(a2, a6) (a1, a5) (a3, a7)

(a3) (a7)(a5)(a1)(a2)(a4)(a0) (a6)

(a0, a4)

Figure 3: The tree of vectors of the recursive calls

shows where coefficients are moved on the way down. The computation startsat the bottom and traverses the tree to the root. That is, for instance, from (a0)and (a4) we compute a0+

0a4, which can be stored into a0, anda0+ 4+0a4,

which can be stored into a4. From (a0, a4) and (a2, a6) we compute a0+ 0a2,

which can be stored intoa0,a0+4+0a2, which can be stored intoa2,a4+

2a6,

which can be stored into a4, anda4+4+2

a6, which can be stored intoa6. Threeiterations are needed to reach the root of the tree. After the third one, a0 is infact the evaluation ofa(x) at0, and so on. These iterations are represented inFigure 4. Let us analyse in more details this example:

the order (a0, a4, a2, a6, a1, a3, a5, a7) can be easily obtained by reversingthe binary representations of the numbers 0, 1, 2, 3, 4, 5, 6, 7. For instance,a4takes the place ofa1 because the image mirror of 001 is 100, and so on;

there arek = log2 8 iterations, and the j th iteration uses the (n/(2j1))throots of unity, which are

(0)2j1

, . . . , (n1)2j1

;

20

7/27/2019 RSA - TCR.pdf

21/89

0

4

0

4

0

4

0

4

0

2

4

6

0

2

4

6

a(0) a(1) a(2) a(3) a(4) a(5) a(6) a(7)

a0 a4 a2 a6 a1 a5 a3 a7

4

5

6

7

0

1

2

3

1st iteration

2nd iteration

3rd iteration

Figure 4: Iterations ofFFT1 for n = 8

at each iteration j, there are 2kj

subvectors of 2j

elements each. Theelements of each subvector are re-combined by means of the unity rootsthat are used at this iteration. The re-combination respects the samerule for each subvector. For example, the subvector (a0, a4) at the firstiteration in Figure 4 gives rise to the elements a0+ 0a4 and a0+ 4a4,the subvector (a2, a6) gives rise to the elements a2+ 0a6 and a2+ 4a6,and so on.

These remarks, generalized to an arbitrary integer n = 2k, lead to the followingalgorithm.

FFT2(a, n)input: a= (a0, . . . , an1)Rn, where n = 2k, k1;output: F(a);begin

1. a:=bit-reverse(a);2. for j := 1 tok do

begin

3. m:= 2j ;

4. j :=2kj

;5. = 1;6. fori := 0 tom/2 1 do

begin

7. for s := i to n 1 stepm dobegin

8. t:= as+m/2;9. u:= as;10. as:= u + t;11. as+m/2:= u +

m/2t;end

12. := j ;end

end

end.

In the algorithm FFT2, bit-reverse is a very simple procedure whichproduces the desired order of the input vector. As we have already explained,

21

7/27/2019 RSA - TCR.pdf

22/89

this order is obtained just by replacingai byarev(i), whererev(i) is the number

obtained from the binary representation ofi by reversing its bits.Line 2 in the algorithm counts the iterations, and for each iteration j itgives the dimension of the subvectors at that iteration and prepares for thecomputation of the unity roots (used at that iteration). Line 6 prepares formaking the re-combinations regarding each subvector, and lines 711 do them.

As with FFT1, line 11 can be replaced by

as+m/2:= u t,

whenever the property n/2 =1 holds true in the ring R.In order to compute F1(a) we replace by 1 and multiply the result by

n1 in the algorithm FFT2(a, n).

If one wants to start with the original order of the input vector, then there-combination should be performed in a reverse order than the one used inFFT2. This order is obtained like in Figure 5. As we can see, the first iteration

a0 a1 a2 a3 a4 a5 a6 a7

1st iteration

2nd iteration

3rd iteration0

4

1

5

2

6

3

7

0

2

4

6

0

2

4

6

0

0

0

0

4

4

4

4

Figure 5: Starting with the original order of the input vector

combines the elements in the same order as the last iteration in Figure 4, andso on. After the last iteration, the resulting vector should be reversed. One caneasily design a slight modified version ofFFT2 in order to work in this casetoo. We shall prefer to present another version, FFT3, which shows explicitlyhow to get, for each iterationj and elementai, the indexesp,q, andr such thatai = ap+

raq .

FFT3(a, n)input: a= (a0, . . . , an1)

Rn, where n = 2k, k

1;

output: F(a);begin

1. for i := 0 to2k 1 doyi := ai;2. for j := 0 tok 1 do

begin

3. fori := 0 to2k 1 do zi := yi;4. fori := 0 to2k 1 do

begin

5. let i = (dk1 d0)2;6. p:= (dk1, . . . , dkj+1, 0, dkj1, . . . , d0)2;7. q:= (dk1, . . . , dkj+1, 1, dkj1, . . . , d0)2;

22

7/27/2019 RSA - TCR.pdf

23/89

8. r:= (dkj , . . . , d0, 0, . . . , 0)2;

9. yi := zp+ r

zq;endend;

10. F(a):=bit-reverse(y);end.

Usually, in implementations, the powers of are pre-computed.

FFT in the Complex Field C FFT is intensively used with the field ofcomplex numbers. We recall that any complex number z = x + iy can bewritten as

z=|z|(cos + i sin )

or z =|z|ei,where|z|=

x2 + y2 and tan = y/x.

The equation zn = 1 has exactly n complex roots, which are ei2jn , for all

0 j < n. These solutions form a cyclic group under multiplication, and thecomplex number = e(2i)/n generates this group. It is easy to see that is aprimitiventh root of unity, and the solutions of the equationz n = 1 are exactlythe nth roots of unity (in the sense of Definition 3.3.1). These numbers canbe located in the complex plane as the vertices of a regular polygon ofn sidesinscribed in the unit circle|z|= 1 and with one vertex at the pointz = 1 (seeFigure 6 for the case n = 8).

11

i

i

0

1

7

2

6

3

5

4

Figure 6: The values 0, . . . , 7, where = e(2i)/8

Since C is a field, we may use all the properties of the nth roots of unitymentioned above, such as

1 = n1;

23

7/27/2019 RSA - TCR.pdf

24/89

n/2 =1, provided thatn is even; n/2+i =i and (n/2+i)2 = (i)2, for all 0i < n/2, provided thatn

is even;

2 is a primitive (n/2)th root of unity, provided that n is even.The FFT algorithms developed in the above subsections work in this case

too. Moreover, they may use the property n/2+i =i, which gives someefficiency in implementations.

Let us consider now the problem of computing j, for all 0j < n, wheren is a power of 2 (for example, n = 2k).

Denoter = e(2i)/2r , for all 1rk. That is,

r = 2kr ,

for all r. Remark that k = .If these values have been already computed, then j can be computed by

j =r r q times

,

where j = 2krq, for some r and odd q. This method could be very inefficientbecause it implies many multiplications.

We can computej recursively. For example, if we assume thatj

has beencomputed, for allj < j , then we can write

j =j

r,

wherej = 2kr(q 1)< j. Now,q 1 is even and, therefore, j can be writtenas j = 2kr

q, for somer < r and odd q < q. This shows thatr is not usedin computing j

. In other words, we can say that there is a sequence

rj > > r11such that

j =r1 rj .In this way, no more than k multiplications are needed to compute j .

Let us focus now on computing r, for all 1rk . Simple computationsshow that1=1 and2= i.

Denoter =xr + iyr, where xr =cos(2/2r) and xr = sin(2/2r). There

are many pairs of recurrence equations for getting r+1 fromr. One of them,proposed in [36], is

xr+1=

1 + xr

2 , yr+1=

1 xr

2 .

In [62], Tate has shown that the equation for yr+1 is not stable because xrapproaches 1 rather quickly. Moreover, he proposed an alternative pair of equa-tions,

xr+1=1

2

2(1 + xr), yr+1=

yr2(1 + xr)

and proved that these equations are stable.

24

7/27/2019 RSA - TCR.pdf

25/89

FFT in Finite Fields If we are working in the field of real (or complex)

numbers, we must approximate real numbers with finite precision numbers. Inthis way we introduce errors. These errors can be avoided if we work in a finitering (or field), like the ring Zm of integers modulo m.

The FFT in a finite ring is usually called the number theoretic transform,abbreviated NTT.

Theorem 3.3.2 Let n and be powers of 2, and m = n/2 + 1. Then, thefollowing are true:

(1) nhas a multiplicative inverse in Zm;

(2) n/2 1mod m;(3) is a principalnth root of unity in Zm.

Proof (1) and (2) follows directly from the wayn,, andmhave been chosen.(3) Letn= 2k. By induction on k one can easily prove that the following

relation holds in any commutative ringR:

n1i=0

ai =k1i=0

(1 + a2i

),

for all aR. Therefore, it holds in Zm.Let 1i < n. In order to prove that

n1

j=0(i)j

0 mod m

it suffices to show that 1 + 2si 0 mod m, for some 0s < k.

Leti = 2ti, wherei is odd. By choosing s such thats + t= k 1, one caneasily show that 1 + 2

si 0 mod m.The importance of Theorem 3.3.2 is that the convolution theorem is valid

in rings like Zn/2+1, where n and are powers of 2. Moreover, the property

n/2 1 mod m holds true, which is very important for implementations.Let us assume that we wish to compute the convolution of two n-dimensional

vectors a= (a0, . . . , an1) and b= (b0, . . . , bn1), where n is a power of 2. Then,we have to

definea

= (a0, . . . , an1, 0, . . . , 0) and b

= (b0, . . . , bn1, 0, . . . , 0), whichare 2n-ary vectors;

choose as a power of 2, for instance = 2. Then, 2 is a primitive (2n)throot of unity inZ2n+1;

compute F1(F(a) F(b)) in Z2n+1. If the components ofa and b arein the range 0 to 2n, then the result is exactly a b; otherwise, the resultis correct modulo 2n + 1.

Let us focus now on computingF(a) inZn/2+1, where ais ann-dimensionalvector overZn/2+1, andn and are powers of 2. Clearly, we can use any FFT

25

7/27/2019 RSA - TCR.pdf

26/89

algoritm presented in the subsections above, where the addition and multipli-

cation should be replaced by addition and multiplication modulo n/2

+ 1.Letm= n/2 + 1. Any number aZm can be represented by a string ofbbits, where

b= n

2log + 1.

There are logn iterations in FFT2, and each iteration computes n values byaddition and multiplication modulom. Therefore, there areO(n log n) opera-tions.

Addition modulo m is of time complexityO(b). The only multiplicationused is by p, where 0 p < n. Such a multiplication is equivalent to a leftshift byp log places (remember that is a power of 2). The resulting integerhas at most 3b 2 bits because

p log < n log = 2(b 1)(see the relation above) and the original integer has at most b bits.

The modular reductionx mod mwe have to perform, wherex is an integer,is based on the fact that x can be written in the form

l1i=0xi(

n/2)i (as usual,by splitting the binary representation ofxinto blocks of (n/2)log bits). Then,we use the following lemma.

Lemma 3.3.1l1

i=0 xi(n/2)i l1i=0xi(1)i mod m.

Proof Remark that n/2 1 mod m.Therefore, the direct Fourier transform is of time complexity O(bn log n),

i.e.O(n2

(log n)(log )).Let us take into consideration the inverse Fourier transform. It requires

multiplication by p and byn1 (these are computed modulo m).Since

pnp 1 mod m,we have that p = np. Therefore, multiplication by p is multiplicationby np, which is a left shift of (n p)log p places (the resulting integer stillhas at most 3b 2 bits).

Assume n = 2k (we know that n is a power of 2). In order to find y suchthat

2ky1 mod m

we start from the remark that

n

1 mod m. Therefore, we may choosey suchthat 2ky= n, i.e., y = 2n log k. The integery is the inverse ofn modulo m.Multiplying byn1 is equivalent to a left shift ofn log k places.

The time complexity of the inverse transform is the same as for the directtransform.

26

7/27/2019 RSA - TCR.pdf

27/89

4 Division

Division deals with the problem of finding, for two integers a and b= 0, arepresentation a = bq+ r, where 0 r 0 and divisors b > 0 because, for all

the other cases we have:

ifa >0 and b 0, thena/b =(a)/b 1 and a mod b = b

((a)mod b); ifa

7/27/2019 RSA - TCR.pdf

28/89

Theorem 4.1.1 Leta = (ak, ak1, . . . , a0) andb = (bk1, . . . , b0) be integers

such thatq=a/b< . Then, the integerq= min

ak+ ak1

bk1

, 1

.

satisfies the following:

(1) qq;(2) q 2q, providing thatb is normalized.

Proof (1) Let a = qb+r, where r < b. If q= 1, then q q becauseq < . Otherwise, q=

ak+ak1

bk1

and, therefore,

ak+ ak1 = qbk1+ r,

where r < bk1. Thus,

ak+ ak1= qbk1+ rqbk1+ bk1 1.

We shall prove thata qb < b, which shows that qq. We have:

a qb a q(bk1k1)= (akk + + a0) (qbk1)k1 (akk + + a0) (ak+ ak1 bk1+ 1)k1

= ak2k2 + + a0+ bk1k1 k1< bk1

k1

b.

(2) Assume thatbk1 b/2, butq < q 2. Then,

q ak+ ak1bk1

=akk + ak1

k1

bk1k1 a

bk1k1,

which leads toaqbk1k1 > q(b k1)

(the integer b cannot be k1

because, otherwise, q= q). Therefore,

q < a

b k1 .

The relationq < q 2 leads to:

3q q < ab k1 q 2

b k1k1

2(bk1 1).

28

7/27/2019 RSA - TCR.pdf

29/89

Next, using 1q, we obtain 4q 3q= a

b 2(bk1 1),

which shows that bk1/2 1; a contradiction.The following theorem follows easily from definitions.

Theorem 4.1.2 Leta and b = (bk1, . . . , b0) be integers. Then, db is normal-ized anda/b=(da)/(db), where d =/(bk1+ 1).

Ifb is normalized, then the integerd in Theorem 4.1.2 is 1.

Let a = (ak, ak1, . . . , a0) and b = (bk1, . . . , b0) be integers such thatq=a/b< . The theorems above say that, in order to determine q, one cando as follows:

replacea byda and b by db (b is normalized now and the quotient is notchanged);

compute qand try successively q, q 1 and q 2 in order to find q (oneof them must be q). If abq 0, then q = q; otherwise, we computea b(q 1). Ifa b(q 1)0 then q= q 1 elseq= q 2.

There is a better quotient testthan the above one, which avoids making toomany multiplications. It is based on the following theorem.

Theorem 4.1.3 Let a = (ak, ak1, . . . , a0) > 0 and b = (bk1, . . . , b0) > 0be integers, and q=a/b. Then, for any positive integer q the following hold:

(1) If qbk2> (ak+ ak1

qbk1)+ ak2, then q < q;

(2) If qbk2(ak+ ak1qbk1)+ ak2, then q > q 2.Proof (1) Assume that qbk2 > (ak+ ak1qbk1)+ ak2. Then,

a qb a q(bk1k1 + bk2k2)< akk + ak1

k1 + ak2k2 + k2 q(bk1k1 + bk2k2)

= k2((ak+ ak1qbk1)+ ak2qbk2+ 1) 0

(the last inequality follows from the hypothesis). Therefore,q < q.(2) Assume that qbk2

(ak+ ak1

qbk1)+ ak2, but q

q

2.

Then,

a < (q 1)b< q(bk1

k1 + bk2k2) + k1 b

qbk1k1 + ((ak+ ak1qbk1)+ ak2)k2 + k1 b= ak

k + ak1k1 + ak2

k2 + k1 b akk + ak1k1 + ak2k2 a,

29

7/27/2019 RSA - TCR.pdf

30/89

which is a contradiction (the third inequality follows from hypothesis).Before describing a division algorithm based on the above ideas, one more

thing has to be treated carefully: the requirementa/b< .Leta = (ak, ak1, . . . , a0) and b = (bk1, . . . , b0) . We have two cases:

1. If b is normalized, then the integers a = (0, ak, . . . , a1) and b satisfya/b< ;

2. Ifbis not normalized anddis chosen as in Theorem 4.1.2, then the integers(da)[k + 1 : 1] (consisting of thek + 1 most significant digits ofda) anddbsatisfy(da)[k+ 1 : 1]/(db)< .

The algorithm SchoolDiv1applies all these ideas.

SchoolDiv1(a, b)input: a= (ak+m1, . . . , a0) >0 and b= (bk1, . . . , b0) >0,

wherek2 and m1;output: q= (qm, . . . , q 0) and r such that a = bq+ r and 0r < b;begin

1. d:=/(bk1+ 1);2. a:= da;3. let a = (ak+m, . . . , a0) (ak+m = 0 ifd = 1);4. b:= db;5. let b = (bk1, . . . , b0) ;6. x:= (ak+m, . . . , am) ;7. for i := m downto0 do

begin

8. ifbk1=xk then q:= 1 else q:=(xk+ xk1)/bk1;9. ifbk2q >(xk+ xk1 bk1q)+ xk2

10. then begin11. q:= q 1;12. if bk2q >(xk+ xk1 bk1q)+ xk213. then qi := q 114. else ifx bq

7/27/2019 RSA - TCR.pdf

31/89

SchoolDiv2(a, b)

input: a= (ak1, . . . , a0) >0 and 0 < b < , wherek1;output: q= (qk1, . . . , q 0) andr such that a = bq+ r and 0r < b;begin

1. r:= 0;2. for i := k 1 downto0 do

begin

3. qi :=(r+ ai)/b;4. r:= (r+ ai) mod b;

end

end.

Let us discuss now the time complexity of division. Denote byD(n) the costof division of a 2n-bit integer by ann-bit integer.

In order to obtain the quotienta/b, wherea is a 2n-bit integer andb is ann-bit integerb, we write

ab

=

a

22n1 2

2n1

b

.

a/22n1 can be easily obtained by left shifts. Therefore, we have to estimatethe cost of getting the quotient22n1/b. Denote by C(n) this cost.

We may assume, without loss of the generality, that nis even, and letn= 2mand b = 2mb + b, whereb 2m1. Then, we have

22n1

b

= 24m1

2

m

b

+ b

=

24m1

2

m

b

2mb

2

m

b

+ b

=

23m1

b

2mb

2

m

b

+ b

.

Let = 2mb/(2mb +b) and x = b/(2mb). It is easy to see that =1/(x + 1), which can be written as

= 1 x + x2 Thus,

22n1

b =

23m1

b (1 x + x2 )

Since 0x

7/27/2019 RSA - TCR.pdf

32/89

According to these formulas, the quotient q can be obtained from a, ab,

(22m1

ab

) a, and(22m1

b

)/(b2

), and some extra simple operations. Thecost of these auxiliary operations is linear to n. Therefore, we can writeC(n)2C(n/2) + 3M(n/2) + cn,

for some positive constantc. Assuming M(2k)2M(k) for all k , we obtainC(n) =O(M(n)) + O(n ln n).

If we add the hypothesis M(n) cn ln n for some positive constant c, theprevious estimate leads to

C(n) =O(M(n)).

Therefore,D(n) =O(M(n)), which shows that the cost of the division is dom-inated by the cost of multiplication, up to a multiplicative constant.

4.2 Recursive Division

The Karatsuba multiplication is based on partitioning the integers and perform-ing more multiplications with smaller integers. A similar idea can be appliedto division as well, in the sense that the quotient q=a/bcan be obtained byq = q1

k +q0, where q1 =a1/b, q0 =(r1k +a0)/b, and a = a1k +a0,for some k. This method has been discovered in 1972 by Moenck and Borodin[43], who espressed it only in the case ofO(n1+) multiplication. Later, in1997, Jebelean [32] expressed the same algorithm in the Karatsuba case, andfinally, in 1998, Burnizel and Ziegler [10] worked out the details for a practical

implementation. This algorithm, which should be called theMoenck-Borodin-Jebelean-Burnikel-Ziegler division algorithmas Zimmermann pointed out in [71],provides the fastest solution to the division problem.

Leta >0 and b >0 be integers. Then, for any k1 there are integers a1,a0, q1, q0, r1, and r0 such that the following are true:

a = a1k + a0 0a0< k

= (q1b + r1)k + a0 0r1 < b

= q1kb + (r1

k + a0)

= q1kb +q0b + r0 0r0 < b

= (q1

k

+ q0)b+ r0.Therefore, the quotient of the division of a by b can be obtained as follows:decompose a into two blocks a1 and a0 (a = a1

k +a0), find q1 =a1/b andq0=(r1k + a0)/b, and then combine these two quotients via k.

The algorithm D2/1 exploits this simple idea. It finds the quotient of thedivision of an at most 2m-digit integer by anm-digit integer (that is, the baserepresentation of the divident is at most double than that of the divisor). Theconstant m0used in this algorithm depends on the machine and implementation,and it is assumed that school division is fast formm0. The algorithm D3/2,also used inD2/1, outputs the quotient and the remainder of the division of twointegersx andy such that 2|y|3|x| (it will be described later).

32

7/27/2019 RSA - TCR.pdf

33/89

D2/1(a, b)

input: a >0 and b >0 such that a/b < m

and m

/2b < m

;output: q=a/band r = a qb;begin

1. if (mis odd) or (mm0) thenSchoolDiv(a, b)else begin

2. decomposea = a1m/2 + a0, wherea0<

m/2;3. (q1, r1) :=D3/2(a1, b);4. (q0, r0) :=D3/2(r1

m/2 + a0, b);5. q:= q1

m/2 + q0;6. r:= r0;

end

end.

Both requiremets a/b < m and m/2b < m in the algorithm D2/1are particularly important for the algorithmD3/2 (see the theorem below).

In order to describe the algorithm D3/2 we need the following theoreticalresult, whose proof can be found in [10].

Theorem 4.2.1 Let a >0, b >0, and 0< k 0 and b >0 such that a/b < m and 2m/2b < 2m;output: q=a/band r = a qb;begin

1. decomposea = a22m + a1

m + a0, wherea2, a1, a0< m;

2. decomposeb = b1m + b0, whereb0< m;3. if a2< b1 then(q, r) :=D2/1(a2

m + a1, b1);4. else beginq:= m 1; r:= (a2m + a1) qb1 end;5. d:= qb0;6. r:= rm + a0 d;7. whiler

7/27/2019 RSA - TCR.pdf

34/89

The multiplication in line 5 can be done by the Karatsuba algorithm. Lines

3 and 4 estimate the quotient, and lines 59 test it for the real value.Let us now analyze the running time of algorithm D2/1. Let M(m) denote

the time to multiply two m-digit integers, andD(m) the time to divide a 2m-digit integer by an m-digit integer. The operations in lines 1 and 2 are done inO(m) (or even in timeO(1) if we use clever pointer arithmetic and temporaryspace management). D3/2 makes a recursive call in line 3 which takes D(m/2).The backmultiplication qb0 in line 5 takes M(m/2). The remaining additionsand subtractions takeO(m). Letc be a convenient constant so that the overalltime for all splitting operations, additions, and subtractions is less than cm.D3/2 is called twice by D2/1, so we have the recursion

D(m) 2D(m/2)+ 2M(m/2) + cm 2[2D(m/4)+ 2M(m/4) + cm/2] + 2M(m/2) + cm= 4D(m/4)+ 4M(m/4)+ 2M(m/2)+ 2cm/2 + cm

. . . 2logmD(1) + log mi=1 2iM(m/2i) + logmi=1 cm= O(m) + i=1log m2iM(m/2i) + O(m log m)

At this point we see that the running time depends on the multiplicationmethod we use. If we use ordinary school method, we have M(m) =m2 and

D(m) =m2 + O(m log m),which is even a little bit worse than school division.

If we use Karatsuba multiplication with M(m) = K(m) =O(mlog23), theabove sum has the upper bound

D(m) = 2K(m) + O(m log m)because in this caseK(m/2i) =K(m)/3i.

We consider now the problem of dividing an arbitrary n-digit integer a byan arbitrary m-digit integer b. The first idea would be to pad a by zeros tothe left such that its length becomes a multiple ofm. Let n be the length ofa after padding with zeros, and let [uk1, . . . , u0]m be its decomposition intoblocks of length m each, where n = mk. Then, the idea is to apply D2/1to (uk1)

m + (uk2) and b getting a quotient qk2 and a remainder rk2,then to apply D2/1 to (rk2)

m + (uk3) and b getting a quotient qk3 anda remainder rk3, and so on. But, we notice that D2/1(x, y) calls D3/2(x, y)which, in returns, calls D2/1(x

, y1), for some x, x, and y, where y1 is the

high-order half ofy. Therefore, the length ofy should be even. The same holdsfory1too. That is, a call ofD2/1 withy1is a divisor leads to a call ofD3/2 andthen to a call ofD2/1 with the divisor y11, the high-order half ofy1. Therefore,the length ofy1should be even. The recursive calls stop when the divisor lengthis less than or equal to the division limit m0, when school division is faster.

As a conclusion, the divisor b should be padded with zeros to the left untilits length becomes a power of two multiple ofm0. Then, a should be paddedwith zeros to the left until its length becomes a multiple ofm.

All these lead to the algorithm RecursiveDiv.

34

7/27/2019 RSA - TCR.pdf

35/89

RecursiveDiv(a, b)

input: a >0 of lengthn and b >0 of length m;output: q=a/band r = a qb;begin

1. j :=min{i|2im0> m};2. l:=m/2j;3. m := 2jl;

4. d:= max{i|2ib < m};5. b:= b2d;6. a:= a2d;

7. k:= min{i2|a < im/2};8. decomposea = ak1m

(k1) + . . .+ a1m

+ a0;

9. x:= ak1m + ak2;

10. for i := k 2 downto0 dobegin11. (qi, ri) :=D2/1(x, b);

12. ifi >0 thenx := rim

+ ai1;end;

13. q:= qk2m(k2) + . . . + q1

m + q0;14. r:=r0/2d;

end.

Line 5 in algorithmRecursiveDivnormalizesb, while line 6 shiftsa by thesame amount as b.

Let us analyze the running time ofRecursiveDiv. First of all we noticethat k

max

{2,

(n+ 1)/m

} (using the notation in the algorithm). Indeed,

we may assume that the highest digit bs1ofb is nonzero. Then, our choice ofkensuresk (n + mm+ 1)/mbecause, after shifting,a < n+mm+1/2. Ifnm + 1, we clearly havek = 1. Otherwise, it is easy to see that the followingis true

n+ m m+ 1m

n + 1m

,

which proves thatkmax{2, (n + 1)/m}.The algorithm D2/1 is called k 1 times and, because m 2m, every

execution ofD2/1 takes time

D(m)D(2m)K(2m) + O(2m log2m).

WithK(m) =cmlog 3, for some constant c, we conclude that the overall running

time is

O((n/m)(K(m) + O(m log m))) =O(nmlog 31 + n log m)

which is a remarkable asymptotic improvement over the (nm) time bound forordinary school division.

35

7/27/2019 RSA - TCR.pdf

36/89

5 The Greatest Common Divisor

The greatest common divisor (GCD) of two integers a and b not both zero,denoted by gcd(a, b), is the largest integer d such that d divides both a and b.In particular, gcd(a, 0) = gcd(0, a) =|a|, and gcd(a, b) = gcd(|a|, |b|). Hence,we can always assume that a and b are non-negative.

The most well-known algorithm for computing GCD is the Euclidean al-gorithm 4. In 1938, Lehmer [39] proposed a method to apply the Euclideanalgorithm to large integers. Then, in 1967 Stein [60] discovered a GCD algo-rithm well suited for single-precision integers employing binary arithmetic. Allof these algorithms haveO(n2) time complexity on two n-bit integers. In 1971,Schonhage [55] described anO(n(log n)2(log log n)) time complexity GCD al-gorithm using FFT, but the algorithm is not considered practical. With theemphasis on parallel algorithms, Sorenson [58] generalized in 1994 the binary

algorithm proposing thek-ary GCD algorithm. This algorithm (and its variants)has practical and efficient sequential versions of time complexityO(n2/(log n)).

5.1 The Euclidean GCD Algorithm

Euclids algorithm for computing the greatest common divisor of two integersis probably the oldest and most important algorithm in number theory. In [36],Knuth classifies this algorithm as

... the grand daddy of all algorithms, because it is the oldest non-trivial algorithm that has survived to the present day.

Based on the remark that gcd(a, b) = gcd(b,a mod b), the Euclidean algo-

rithm consists in performing successive divisions until the remainder becomes0. The last non-zero remainder is g cd(a, b). The algorithm is given below.

Euclid(a, b)input: ab >0;output: gcd(a, b);begin

1. whileb >0 dobegin

2. r:= a mod b; a:= b; b:= r;end

3. gcd(a, b) :=aend.

The number of division steps performed by the algorithm Euclid was thesubject of some very interesting research. Finally, in 1844, Lame [38] was ableto determine the inputs which elicit the worst-case behavior of the algorithm.

Theorem 5.1.1 (Lame, 1844 [38])Let a b > 0 be integers. The number of division steps performed byEuclid(a, b) does not exceed 5 times the number of decimal digits in b.

4Described in Book VI of Euclids Elements.

36

7/27/2019 RSA - TCR.pdf

37/89

Proof Letr0= a, r1 = b, and

r0 = r1q1+ r2, 0< r2< r1

r1 = r2q2+ r3, 0< r3< r2

rn2 = rn1qn1+ rn, 0< rn < rn1

rn1 = rnqn+rn+1, rn+1= 0

be the division steps performed by the algorithm Euclid(a, b). Then,gcd(a, b) =rn. We remark thatqn2 and qi1, for all 1in 1.

Let (Fi)i1 be the Fibonacci sequence, given by F1= F2= 1, and

Fi+2= Fi+1+ Fi,

for all i1. Then, rn1 =F2 and rn12 =F3 which lead torn2rn1+ rnF2+ F3=F4.

By induction on i we can prove easily that ri Fni+2, for all i. Therefore,b= r1Fn+1. Now, using the fact that

Fi+2> Ri,

for all i1, whereR = (1 + 5)/2, we getb= r1

Fn+1> R

n1,

which leads to

log10 b >(n 1)log10 R > n 1

5

by the fact that log10 R > 1/5. Therefore, n < 5log10+1, which proves thetheorem.

Let N0 be a natural number. The Euclidean algorithm performs O(log N)division steps on inputs aand b with 0< baN. The naive bit complexityof a division step with integers less thanN isO((log N)2). Therefore, the timecomplexity of Euclids algorithm isO((log N)3).

A careful analysis shows that the Euclidean algorithm performs better. Di-viding ri byri+1 to get a quotient ofqi+1 and a remainder ofri+2 can be done

inO((log ri+1)(log qi+1)) (using the notation in the proof of Theorem 5.1.1).Then, n1

i=0(log ri+1)(log qi+1) (log b)n1

i=0 log qi+1

(log b)(log q1 qn)It is not hard to see that

q1 qna,which leads to

Theorem 5.1.2 Let ab >0. The time complexity of Euclids algorithm oninputs a and b isO((log a)(log b)).

37

7/27/2019 RSA - TCR.pdf

38/89

It is well-known that the greatest common divisor of two numbers a and b

can be written as a linear combination gcd(a, b) = ua+vb, for some u, v Z.Two integers u and v satisfying this property can be easily computed usingEuclids algorithm. This is very important in practice, for example in solvinglinear diophantine equations ax + by= c, wherea,b, cZ, or in computing themodular multiplicative inverse. Let us give a few details about these two facts.

It is well-known that a linear diophantine equationax + by= c with integercoefficients has solutions (in Z) if and only ifgcd(a, b)|c. If this happens, thenx= uc andy = vc is a solution, where gcd(a, b) =ua + vb and c = gcd(a, b)c.

Let a Z and m 2. It is known thata has an inverse modulo m if andonly ifgcd(a, m) = 1. Finding an inverse modulo m ofa, when it exists, reducesto solving the equation

ax1 mod m

or, equivalently, ax my= 1in nedeterminates x and y. This equation has solutions (in Z) if and only ifgcd(a, m) = 1 and, if this happens, any solutionx = x0of it is an inverse modulom ofa.

Let us go back now to the problem of finding a linear combination ofgcd(a, b).If we analyze the remainder sequence in Euclids algorithm we notice that:

r0 = 1 a + 0 br1 = 0 a + 1 br2 = r0 r1q1 = 1 a+ (q1) br3 = r1 r2q2 = (q2) a+ (1 + q1q2) b

and so on (using the notation in the proof of Theorem 5.1.1). In other words,along with getting the remainder (at a step) we can also find a linear combinationof it (with respect to a and b).

What needs to be done next is to find an elegant way of writing the re-mainders linear combination based on the ones in the previous steps. If toeach element x that appears in the remainder sequence above we associate a2-ary vector Vx = (u, v) that gives the linear combination of it with respect toa and b, i.e.,x = ua + vb, then the linear combination of the remainders can bedetermined by:

Vr0 = (1, 0)

Vr1 = (0, 1)

r2 = r0 r1q1 Vr2 = Vr0 q1Vr1r3 = r1 r2q2 Vr3 = Vr1 q2Vr2 rn = rn2 rn1qn1 Vrn = Vrn2 qn1Vrn1

In this way we can determine gcd(a, b) as well as a linear combination (withrespect toa and b) for it. Thus, we have obtainedEuclids Extended Algorithm,EuclidExt.

38

7/27/2019 RSA - TCR.pdf

39/89

EuclidExt(a, b)

input: ab >0;output: gcd(a, b) and V = (u, v) such thatg cd(a, b) =ua + vb;begin

1. V0:= (1, 0);2. V1:= (0, 1);3. whileb >0 do

begin

4. q:= a div b; r:= a mod b; a:= b; b:= r;5. V :=V0; V0 := V1; V1:= V qV1;

end

5. gcd(a, b) :=a;6. V :=V0;

end.

The correctness of EuclidExt can be proved immediately, based on thespecifications above. It has the same complexity as Euclids algorithm. Moreprecisely, at each step, besides a division, we perform two multiplications (amultiplication having the same complexity as a division) and two subtractions(the complexity of a subtraction being linear to the maximum length of thebinary representation of the operands).

The sequences (ui)i0 and (vi)i0 given by Vri = (ui, vi) for all 0in,are called the cosequencesassociated to r0 =a and r1 =b (using the notationabove). These sequences can be computed independently one of the other dueto the recurrence equation they satisfy, that is

Vri+2 =Vri qi+1Vri+1,for all 0 i n 2. This is important because, in some cases, computingonly one of the cosequences can speed up the whole computation. For example,computing only the first cosequence (ui)i0we getu as the last but one elementof it. Then, v = (gcd(a, b) ua)/b.

We close the section by mentioning once more that three very importantkinds of sequences are associated with two integers ab >0:

thequotient sequence(qi)i1, the remainder sequence(ri)i0, and

twocosequences(ui)i0 and (vi)i0.These sequences are intensively used in order to improve the Euclidean algo-rithm.

5.2 Lehmers GCD Algorithm

For multi-precision inputs, the Euclidean algorithm is not the best choice inpractice. This is because a multi-precision division operation is relatively expen-sive. In 1938, Lehmer [39] proposed an interesting method by which consecutive

39

7/27/2019 RSA - TCR.pdf

40/89

multi-precision division steps are replaced by consecutive single-precision divi-

sion steps. In order to understand his method let us recall that the EuclideanGCD algorithm is based on computing a remainder sequence,

r0 = a

r1 = b

ri = ri2 mod ri1

for all 2 i n+ 1 and any inputs a b > 0, where rn= 0 and rn+1 = 0.Then,gcd(a, b) =rn.

A quotient sequence is implicitly computed along with this remainder se-quence,

qi+1= riri+1

,for all 1in.

In general, ifa and bare multi-precision numbers, many of the computationsabove are multi-precision division steps, and in most cases quotients are single-precision integers. Finding the quotient qof two multi-precision integers a andb is much more expensive than computing aqb, assuming that q is a single-precision integer.

Therefore, if we are able to find the quotient sequence q1, . . . , q n in a lessexpensive way, then we have the chance to significantly improve the speed ofthe Euclidean algorithm.

Given two multi-precision integersa and b, Lehmer had the idea to considertwo single-precision approximations r0and r1ofa and b, and to compute a quo-

tient sequence q1, . . . ,qi as long as q1= q1, . . . , q i = qi. Remark that computingsuch a quotient sequence involves single-precision division steps. If this quotientsequence is computed, then the integers a and b can be replaced by correspond-ing integers and the procedure can be repeated. Three main problems are to besettled up:

1. how to choose r0 and r1;

2. how to test the equality qj = qj without computing qj (a test for doingthat will be called a quotient test);

3. how to compute newa and b values when the quotient sequence q1, . . . ,qicannot be extended any longer (i.e., qi+1=qi+1).

Lehmers approach to the problems above is the following. Letab > 0be two multi-precision base integers. Assume that h is chosen such thata=a/h< p and b=b/h< p are single-precision integers.

Then, consider r0= a+ A1 and r1 =b+ C1, where A1 = 1 and C1 = 0. Inorder to check that the quotient q1 =r0/r1 is the same with the quotient q1ofr0 andr1, we make use of the inequalities

a + B1

b + D10 such that all x < p are single-precision;output: gcd(a, b) and V = (u, v) such thatg cd(a, b) =ua + vb;begin

1. u0:= 1,u1:= 0;2. a :=a, b :=b;3. whileb > p do

begin

4. leth be the smallest integer such that a:=a/h< p;5. b:=b/h< p;6. A:= 1; B := 0; C:= 0; D:= 1;

7. while b + C= 0 and b + D= 0 dobegin

8. q:=(a+ A)/(b + C); q :=(a+ B)/(b + D);9. if q=q then break;

10. x:= A qC; A:= C; C :=x;11. x:= B qD; B:= D; D:= x;12. x:= a

qb; a:=b; b:= x;

end;13. ifB = 0

then begin

14. X :=a/b; Y :=a mod b; a :=b; b :=Y;15. x:= u0 Xu1; u0:= u1; u1:= x;

end

else begin

16. X :=Aa + Bb; Y :=C a + Db; a :=X; b :=Y;17. X :=Au0+ Bu1; Y :=C u0+ Du1; u0:= X; u1:= Y;

end

end

18. (gcd(a, b), (u, v)) :=EuclidExt(a, b);

19. u:= uu0+ vu1; v:= (gcd(a, b) ua)/bend.

Collins Approach Lehmers quotient test is based on computing one morequotient (line 7 in Lehmer1). In 1980, Collins [14] developed a better test whichrequires only the computation of the sequences (qi), (ri), and (vi) associated to

r0= a and r1 =b. The test is

if (ik)(ri+1 |i+1| and ri+1ri |vi+1vi|), then (ik)(qi = qi).Collins test has the advantage that only one quotient has to be computed,

and only one of the cosequences.

43

7/27/2019 RSA - TCR.pdf

44/89

Jebeleans Approach In 1993, Jebelean [31] refined Collins quotient test

getting an exact quotient test based on the sequences (qi), (ri), (ui), and (vi)associated to r0= a and r1 =b. The test is

qi = qi for all 1ik iff for all 1ik the following are true:1. ri+1 ui+1 and riri+1vi+1vi, ifi is even, and2. ri+1 vi+1 and riri+1ui+1ui, ifiis odd

(the proof can be found in [31]).The algorithmLehmer2below is a version of Lehmers algorithm based on

Jebeleans quotient test. The algorithm is taken from [59].

Lehmer2(a,b,p)input: ab >0 two multi-precision base integers, and

p >0 such that all x < p are single-precision;output: gcd(a, b);begin

1. whileb= 0 dobegin

2. if|a| |b|p/2 thenbegin

3. h:=|a|p;4. a:=a/h;5. b:=b/h;6. (u0, v0) := (1, 0);7. (u1, v1) := (0, 1);

8. i:= 0;9. done:=false;10. repeat

11. q:=a/b;12. (x, y) := (u0, v0) q(u1, v1);13. (a, b) := (b, a qb);14. i:= i + 1;15. ifi is even

16. thendone:=b

7/27/2019 RSA - TCR.pdf

45/89

1. Ifa and b are both even, then g cd(a, b) = 2 gcd(a/2, b/2);2. Ifa is even and b is odd, then g cd(a, b) =gcd(a/2, b);

3. Ifa and b are both odd, then g cd(a, b) =gcd(|a b|/2, b);4. gcd(a, 0) =|a|.

The idea is to apply the first identity until one ofaandbis odd. Then, identities2 and 3 are applied, making a and b smaller each time. Sincea or b is alwaysodd from this point on, we shall never apply the first identity again. Eventually,one ofa and b becomes zero, allowing the application of the last identity.

An explicit description of the algorithm is given below.

Binary(a, b)

input: ab >0 integers;output: gcd(a, b);begin

1. x:=1;2. while(a mod 2 = 0) and (b mod2 = 0) do

begin

3. a:= a/2; b:= b/2; x:= 2x;end

4. whilea= 0 dobegin

5. while(a mod2 = 0) doa := a/2;6. while(b mod2 = 0) do b := b/2;7. y:=

|a

b

|/2;

8. ifab thena := y elseb := y;end

9. gcd(a, b) :=xb;end.

The main loop of the algorithm is that at line 4. Since each iteration ofthis loop reduces the product ab by a factor of 2 or more, the total number ofiterations of the main loop isO(log ab). As each iteration can be performed inO(log ab) time, the total time complexity of the algorithm isO((log ab)2).

The binary GCD algorithm can be extended in order to find a linear combi-nation of GCD, based on the following remarks:

1. letab >0 be integers;2. the algorithmBinary(a, b) transformsa and b until one of them becomes

0. Then, the other one multiplied by a suitable power of two is g cd(a, b);

3. if we start the algorithm with a linear combination of a and a linearcombination ofb, usually (1, 0) and (0, 1) respectively, and transform theselinear combinations in accordance with the transformations applied to aand b, then we get finally a linear combination of the non-zero factor(which is a linear combination ofgcd(a, b)).

More precisely:

45

7/27/2019 RSA - TCR.pdf

46/89

(a) start initially with (1, 0) and (0, 1) as linear combinations for a and

b, respectively;(b) if both a and b are even, then divide them by 2 without changing

their linear combination. This is based on the fact that any linearcombination ofgcd(a/2, b/2) is a linear combination ofgcd(a, b), andvice-versa;

(c) assume thata is even and b is odd, and their linear combinations are(u0, v0) and (u1, v1), respectively. Then,

i. (u0/2, v0/2) is a linear combination ofa/2, ifu0and v0are even;

ii. ((u0+ b)/2, (v0 a)/2) is a linear combination ofa/2, ifu0 orv0is odd.

(d) assume thata and b are odd, a b, and their linear combinationsare (u0, v0) and (u1, v1), respectively. Then, (u0

u1, v0

v1) is a

linear combination ofa b.These remarks lead to the algorithm BinaryExt given below.

BinaryExt(a, b)input: ab >0 integers;output: gcd(a, b) and V = (u, v) such thatg cd(a, b) =ua + vb;begin

1. x:=1;2. while(a mod 2 = 0) and (b mod2 = 0) do

begin

3. a:= a/2; b:= b/2; x:= 2x;end

4. u0:= 1; v0:= 0; u1:= 0; v1 := 1;5. a :=a; b :=b;6. whilea = 0 do

begin

7. while(a mod2 = 0) dobegin

8. a :=a/2;9. if (u0 mod2 = 0) and (v0 mod2 = 0)

10. then beginu0:= u0/2; v0 :=v0/2 end11. else beginu0:= (u0+ b)/2; v0:= (v0 a)/2 end;

end

12. while(b mod2 = 0) dobegin

13. b :=b/2;14. if (u1 mod2 = 0) and (v1 mod2 = 0)15. then beginu1:= u1/2; v1 :=v1/2 end16. else beginu1:= (u1+ b)/2; v1:= (v1 a)/2 end;

end

17. ifa b18. then begina :=a b; u0:= u0 u1; v0 := v0 v1 end19. else beginb :=b a; u1:= u1 u0; v1:= v1 v0 end

end

20. gcd(a, b) :=xb; V := (u1, v1)end.

46

7/27/2019 RSA - TCR.pdf

47/89

6 Modular Reductions

Arithmetic in Zm, the set of integers modulo m, is usually called modulararithmetic. Basic modular operations (addition, subtraction, multiplication,exponentiation) are obtained from basic operations on integers plus a modularreduction. For example,

a b= (a+ b) mod m,for all a, bZm, whereis the addition in Zm andx mod mis the remainderof the division ofx by m (also called the modular reductionofx).

Modular addition and subtraction can be easily implemented taking intoaccount the following remarks:

1. a + b < 2m,

2. ifab, then 0a b < m, and3. ifa < b then a + m b < m,

for all a, bZm.Things are more involved in case of modular multiplication or exponentia-

tion, where efficient modular reduction techniques are required.The most straightforward method for performing modular reduction is to

compute the remainder of division by m using a multi-precision division algo-rithm. This method is commonly referred to as theclassical reduction. There arealso other reduction techniques, like Barrett or Montgomery reduction, whichcan speed up modular operations in certain circumstancies.

6.1 Barrett Reduction

The modular reduction technique we present in this section has been introducedby Paul Barrett [2] in order to obtain a high speed implementation of the RSAencryption algorithm on an off-the-shelf digital signal processing chip.

Let us assume that the representation base is and

m= mk1k1 + + m1+ m0

a= al1l1 + + a1+ a0,

wherek < l2k. Therefore,k1 m < k a < 2k.We can write a

m

= a

k1 2k

m 1

k+1

which shows that quotient q=a/m can be estimated by computing

q=

u

k+1

,

where

u=

a

k1

=a[l 1 : k 1] and =

2k

m

.

The estimation qsatisfies:

47

7/27/2019 RSA - TCR.pdf

48/89

Theorem 6.1.1 Using the notations above, ifk < l2k, then q 2qq.Proof Clearly, qq. Using x/y x/y> x/y 1, we obtain

am q

> 1k+1

a

k1 1

2k

m 1

1

= am a2k k1

m + 1k+1 1

q

a2k

+ k1

m 1k+1 + 1

> q 3,which proves the theorem.

The theorem above shows that the error in estimating qby qis at most two.It is mentioned in [2] that for about 90% of the values ofa and m, the initialvalue of qwill be the right one, and that only in 1% of cases qwill be two inerror.

The computation of qrequires:

a pre-computation (of), two divisions by a power of(which are right-shifts of the base repre-

sentation), and

a multi-precision multiplication of and a[l 1 :k 1].Moreover, we remark that there is no need for the whole multiplication of and a[l

1 : k

1] because we have to divide the result by k+1 and, therefore,

we do not need the lower k+ 1 digits of the product. But, we must take intoconsideration the influence of the carry from position k to position k+ 1. Thiscarry can be accurately estimated by only computing the digits from the positionk 1. Therefore, we can obtain an approximation of the correct value of qbycomputing

i+jk1

iaj+k1i+j

and then dividing the result by k+1. This approximation may be at most onein error.

An estimate for a mod qis then r given by r= a qm. The real remainderr = amq satisfies r < m < k. The remainder r may be larger than kbecause qcould be smaller than q. But, if >2, we have r < k+1. Therefore,we can find r by

r= (a mod k+1 (qm)mod k+1)mod k+1,which means that the product qm can be done partially by computing only thelowerk + 1 digits

i+jk

qimj i+j .

As a conclusion, we can obtain a mod m by subtracting qm mod k+1 froma[k: 0], and then adjusting the result by at most two subtractions ofm.

The algorithm BarrettRed takes into account all these remarks.

48

7/27/2019 RSA - TCR.pdf

49/89

BarrettRed(a, m)

input: a base l-digit number a and a base k-digitnumber m such that k < l2k;output: r= a mod m;pre-computation: =2k/m;begin

1. u:=

i+jk1iaj+k1i+j ;

2. q:=u/k+1;3. v:=

i+jk qimj

i+j ;

4. r:= a[k: 0] v[k: 0];5. if r m andgcd(m, R) = 1. The Montgomery reductionof an integer a w.r.t. m and R isdefined as aR1 mod m, where R1 is the inverse of R modulo m (it existsbecause R is relatively prime to m).

Theorem 6.2.1 Let m and R be positive integers such that m < R andgcd(m, R) = 1. Then, for all integers a the following are true:

(1) a + tmis an integer multiple ofR,

(2) (a + tm)/RaR1 mod m,(3) (a + tm)/R

7/27/2019 RSA - TCR.pdf

50/89

wheret = am mod R,m =m1 mod R, andm1 is the inverse ofm moduloR.

Proof (1) follows from the remark that tm a mod R, which leads toa + tm= R, for some integer .

(2) It is enough to show that in the relation above is congruent toaR1 modulo m. Indeed, we have aR1 +tmR1 = , which shows thataR1 mod m.

(3) follows from a < mR and t < R.Theorem 6.2.1 shows that, in order to compute the Montgomery reduction

ofa one can compute the estimate a= (a+ tm)/Rwhich is at most one in error,provided thata < mR. This means that the Montgomery reduction ofareducesto computing (a+tm)/R, which means the computation ofm1 mod R, oft (by a

multiplication), and finally to perform a division byR. These computations canbe performed efficiently if we choose R = k, wherek1 m < k. Moreover,Dusse and Kaliski [18] remarked that the main idea in Theorem 6.2.1 is to makeaa multiple ofRby adding multiples ofm. Instead of computing all oftat once,one can compute one digitti at a time, addtimi toa, and repeat. This change,combined with the particular choice ofR, allows to computem0=m10 mod instead ofm, wherem0 is the least digit ofm (providing thatgcd(m0, ) = 1).

A Montgomery reduction can be implemented according to the algorithmMonRed.

MonRed(a, m)input: m= (mk1 m0) and a such that 0a < mR

and g cd(m0, ) = 1, whereR = k;

output: r= aR1 mod m;pre-computation: m0=m10 modulo;begin

1. for i := 0 to k 1 dobegin

2. ti := aim0 mod ;3. a:= a + timi;

end

4. a:=a/k;5. if am thena := a m;6. r:= a

end.

As can be seen from the algorithm, the Montgomery reduction can be per-formed in k(k + 1) multiplications, which is superior to Barretts algorithm.However, we note that the number of multiplications does not depend by thesize ofa. This means that the Montgomery reduction is not efficient for smallnumbers, where the Barrett reduction is better to use.

Montgomery multiplication Letm and R be as in the Montgomery reduc-tion. It is straightforward to show that the set{aR mod m|a < m}is a completeresidue system, i.e., it contains all numbers between 0 and m 1. The numbera= aR mod m is called the m-residueofa.

50

7/27/2019 RSA - TCR.pdf

51/89

Thus, there is a one-to-one correspondence between the numbers in the range

0 and m 1 and the numbers in the set above, given by aa= aR mod m.The Montgomery product of two m-residues a and b is defined as the m-residue

c= abR1 mod m.

It is easy to see that cis the m-residue of the product c = ab mod m.The computation of the Montgomery product is achieved as in the algorithm

MonPro.

MonPro(a, b, m)input: twom-residues a and b;output: c= abR1 mod m;pre-computation: m0=m10 modulo, provided thatg cd(m0, ) = 1;begin1. x:= ab;

2. c:=MonRed(x, m);end.

Remark thatx in the algorithmMonProis not larger than mR and, there-fore, MonRedcan be applied to it.

Now, if we want to compute ab mod m, for some a, bZm, then we can doit in at least two ways:

MonPro(MonPro(a,b,m), R2 mod m), or MonRed(MonPro(a, b, m), m).

Although the cost of computing R2 mod m(in the first case) and aandb(inthe second case) can be neglected, both variants are more expensive than theusual multiplication followed by a classical reduction. However, Montgomerymultiplication is very effective when several modular multiplications w.r.t. thesame modulus are needed. Such is the case when one needs to compute modularexponentiation. Details about this operation will be given in Section 7.

6.3 Comparisons

The execution time for classical reduction, Barrett reduction, and Montgomeryreduction depends in a different way on the length of the argument. For thefirst two of them, the time complexity varies linearly between the maximumvalue (for an argument twice as long as the modulus) and almost zero (for anargument as long as the modulus). For arguments smaller than the modulus noreduction takes place (the arguments are already reduced). On the other hand,the time complexity for Montgomery reduction is independent of the length ofthe argument. This shows that both the classical and Barrett reduction arefaster than Montgomery reduction below a certain length of the argument.

The table in Figure 7 gives some comparisons between classical reductionusing recursive division, Barrett reduction, and Montgomery reduction. In thistable, K(k) stands for Karatsuba multiplication complexity of two k-digit inte-gers.

51

7/27/2019 RSA - TCR.pdf

52/89

Barrett Montgomery Recursive Division

Restrictions on input a < 2k a < mk None

Pre-computation 2k/m m10 mod NormalizationPost-computation None None Unnormalization

Multiplications k(k+ 4) k(k+ 1) 2K(k) + O(k log k)

Figure 7: Three reduction methods

52

7/27/2019 RSA - TCR.pdf

53/89

7 Exponentiation

The exponentiation problem in a monoid (M, ) is to compute an, for someaM and integer n >0. When M is a group, we may also raise the problemof computing an for negative integersn. The naive method for computing an

requiresn 1 multiplications, but as we shall see in this section we can do muchbetter. This is very important because exponentiation is heavily used in manyworking fields such as primality testing, cryptography, security protocols etc.In such fields, practical implementations depend crucially on the efficiency ofexponentiation.

Although the methods we are going to present can be applied to any monoid,we shall deal only with modular exponentiation, that is exponentiation in Zm,where the multiplication is ordinary multiplication followed by reduction mod-ulom. Therefore, anything in this section refer to integersa

Zm,n

1, and

m2.We shall mainly discuss five types of exponentiation techniques:

general techniques, where the base and the exponent are arbitrary; fixed-exponent techniques, where the exponent is fixed and arbitrary choices

of the base are allowed;

fixed-base techniques, where the base is fixed and arbitrary choices of theexponent are allowed;

techniques based on modulus particularities, where special properties ofthe modulus are exploited;

exponent recoding techniques, where various representations of the expo-nent are used.

Finally, we focus on multi-exponentiation techniques.In general, there are two ways of reducing the time required by a modular

exponentiation. One way is to reduce the number of modular multiplications,and the other is to decrease the time required for modular multiplication. Bothof them will be considered in our exposition, but the stress lies on the first one(a preliminary version of this section can be found in [30]).

7.1 General Techniques

Most of the general exponentiation techniques, i.e., techniques which do not

exploit any particularity of the exponent or of the base, thus being generallyapplicable, can be viewed as particular cases or slight variants of the slidingwindow method. This method is based on arbitrary block decompositions of thebinary representation of the exponent,n = [wl1, . . . , w0]2+. The blockswi areusually called windows.

There are several variants of this method, depending on how the windows are

Documents

RSA - TCR.pdf