34
Linear algebra over dense matrices over F 2 and small extensions Martin R. Albrecht SIAM AG11, October 7, 2011

Linear algebra over dense matrices over GF(2) and small extensions

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Linear algebra over dense matrices over F2 andsmall extensions

Martin R. Albrecht

SIAM AG11, October 7, 2011

Outline

M4RI – linear algebra over GF(2)

M4RIE – linear algebra over small extensions of GF(2)

F2 with SSE2

We can perform 128 finite fields operations in one instruction,i.e., memory access is the expensive operation, not arithmetic.

Hence, optimising memory access is crucial.

Strassen-Winograd [Str69] Multiplication

All multiplications in this work use Strassen-Winograd matrixmultiplication, but with different base case implementations.

I fastest known pratical algorithm

I complexity: O(nlog2 7

)I → linear algebra constant: ω = log2 7

I for efficiency: crossover to base case at some dimension

In our work we focus on optimising the base case.

M4RM [ADKF70] I

Consider C = A · B, where A is m × ` and B is `× n.

A can be divided into `/k vertical “stripes”

A0 . . .A(`−1)/k

of k columns each. B can be divided into `/k horizontal “stripes”

B0 . . .B(`−1)/k

of k rows each. We have:

C = A · B =

(`−1)/k∑0

Ai · Bi .

M4RM [ADKF70] II

A =

1 1 0 10 0 0 01 1 1 10 1 1 1

,B =

1 0 1 10 1 1 00 1 1 00 1 0 1

,A0 =

1 10 01 10 1

A1 =

0 10 01 11 1

,B0 =

(1 0 1 10 1 1 0

),B1 =

(0 1 1 00 1 0 1

)

A0 · B0 =

1 1 0 10 0 0 01 1 0 10 1 1 0

,A1 · B1 =

0 1 0 10 0 0 00 0 1 10 0 1 1

M4RM: Algorithm O(n3/ log n

)begin1

C ←− create an m × n matrix with all entries 0;2

k ←− log n;3

for 0 ≤ i < (`/k) do4

// create table of 2k − 1 linear combinationsT ← MakeTable(B, i × k , 0, k);5

for 0 ≤ j < m do6

// read index for table Tid ←− ReadBits(A, j , i × k , k);7

add row id from T to row j of C ;8

return C ;9

end10

Algorithm 1: M4RM

M4RM: Cache friendliness I

begin1

C ←− create an m × n matrix with all entries 0;2

k ←− log n;3

for 0 ≤ i < (`/k) do4

// this is cheap in terms of memory accessT ← MakeTable(B, i × k , 0, k);5

for 0 ≤ j < m do6

// we load each row to take care of only k bitsid ←− ReadBits(A, j , i × k, k);7

add row id from T to row j of C ;8

return C ;9

end10

Algorithm 2: Memory access pattern of M4RM

M4RM: Cache friendliness II

begin1

C ←− create an m × n matrix with all entries 0;2

k ←− blog nc;3

for 0 ≤ start < m/bs do4

for 0 ≤ i < (`/k) do5

// we regenerate T for each blockT ← MakeTable(B, i × k , 0, k);6

for 0 ≤ s < bs do7

j ←− start × bs + s;8

id ←− ReadBits(A, j , i × k , k);9

add row id from T to row j of C ;10

return C ;11

end12

Algorithm 3: Cache friendly M4RM

t > 1 Gray Code Tables

I arithmetic is quite cheap compared to memory access

I the cost of memory access depends on location in memory

I → try to fill all of L1 with Gray code tables.

I Example: k = 10 and 1 table → 10 bits at a time. k = 9 and2 tables, same memory but 18 bits at once.

I price: one extra row addition

Matrix Dimensions t = 1 t = 2 t = 8

10, 000× 10, 000 4.141 1.982 1.59916, 384× 16, 384 16.434 7.258 6.03420, 000× 20, 000 29.520 14.655 11.655

Table: Strassen with different base cases on 2.33Ghz Core 2 Duo

Results: Multiplication

1000 5000 9000 13000 17000 21000 25000 290000

5

10

15

20

25

30

35

execu

tion t

ime t

MagmaSageMagma/Sage

0.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

log 2

rela

tion

Multiplication

Figure: 2.66 Ghz Intel i7, 4GB RAM

PLE Decomposition

Definition (PLE)

Let A be a m × n matrix over a field K .A PLE decomposition of A is a triple ofmatrices P, L and E such that P is am ×m permutation matrix, L is unitlower triangular matrix, and E is am× nmatrix in row-echelon form, and

A = PLE .

PLE decomposition can be in-place, that is L and E are stored inA and P is stored as an m-vector.

Algorithms

Iterative Gaussian elimination where we write L below and onthe main diagonal.

Block Iterative Variant of PLE-style Gaussian elimination; inspiredby the M4RI [Bar07] algorithm.

Block Recursive Variant of asymptotically fast PLUQ factorisation[IMH82]; reduces to matrix multiplication.

We implemented the block recursive algorithm with block iterativeas base case; we focus on optimisation of the latter.

Results: Reduced Row Echelon Form

1000 5000 9000 13000 17000 21000 25000 290000

5

10

15

20

25

30

execu

tion t

ime t

MagmaSageMagma/Sage

1.0

1.5

2.0

log 2

rela

tion

Elimination

Figure: 2.66 Ghz Intel i7, 4GB RAM

Outline

M4RI – linear algebra over GF(2)

M4RIE – linear algebra over small extensions of GF(2)

Motivation

System Time

Sage 4.7.1 3267.27sNTL 5.4.2 1005.73sGAP 4.4.12 16.94s

Magma 2.15 3.40sLinBox over F13 4.67s

this work 0.71s

Table: RREF of a 4, 000× 4, 000 matrix over F24 .

Representation of Elements I

Elements in F2e∼= F2[x ]/f can be written as

a0α0 + a1α

1 + · · ·+ ae−1αe−1.

We identify the bitstring a0, . . . , ae−1 with

I the element∑e−1

i=0 aiαi ∈ F2e and

I the integer∑e−1

i=0 ai2i .

We pack several of those bitstrings into one machine word:

a0,0,0, . . . , a0,0,e−1, padding , . . . , a0,n−1,0, . . . , a0,n−1,e−1, padding .

This representation is used in the matrix type mzed t.

Representation of Elements II

I We can rewrite matrices over F2e as an e-tuple of matrix“slices” over F2.

I One slice for each degree of α.

I Instead of considering matrices of polynomials, we canconsider polynomials of matrices.

I This representation is used in the matrix type mzd slice t.

Tomas J. Boothby and Robert Bradshaw.Bitslicing and the Method of Four RussiansOver Larger Finite Fields.CoRR, abs/0901.1413, 2009.

Representation of Elements III

Example:

(α2 + α + 1 α + 1

α2 1

)mzed_t:

|-111-011...|

|-100-001...|

mzd_slice_t:

0: |11...| 1: |11...| 2: |10...|

|01...| |00...| |10...|

Travolta tables: The idea I

In both representations:

I Scaling/multiplying is expensive: either table lookups perelement or lots of bit operations on words.

I Adding is cheap, one XOR per word.

Thus, we really, really prefer additions over multiplications.

Travolta tables: The idea II

Input: A – m × n matrixInput: B – n × k matrixbegin1

for 0 ≤ i < m do2

for 0 ≤ j < n do3

Cj ←− Cj + Aj ,i × Bi ;4

return C ;5

end6

Travolta tables: The idea III

Input: A – m × n matrixInput: B – n × k matrixbegin1

for 0 ≤ i < m do2

for 0 ≤ j < n do3

Cj ←− Cj + Aj ,i × Bi ; // cheap4

return C ;5

end6

Travolta tables: The idea IV

Input: A – m × n matrixInput: B – n × k matrixbegin1

for 0 ≤ i < m do2

for 0 ≤ j < n do3

Cj ←− Cj + Aj ,i×Bi ; // expensive4

return C ;5

end6

Travolta tables: The idea V

Input: A – m × n matrixInput: B – n × k matrixbegin1

for 0 ≤ i < m do2

for 0 ≤ j < n do3

Cj ←− Cj + Aj ,i×Bi ; // expensive4

return C ;5

end6

. . . but there are only 2e possible multiples of Bi .

Travolta tables: The idea VI

begin1

Input: A – m × n matrixInput: B – n × k matrixfor 0 ≤ i < m do2

// use Gray codes herefor 0 ≤ j < 2e do3

Tj ←− j × Bi ;4

for 0 ≤ j < n do5

x ←− Aj ,i ;6

Cj ←− Cj + Tx ;7

return C ;8

end9

Matrix multiplication:

I m · (2e + n) · k additions

I m · e · k multiplications

Gaussian elimination:

I r · (n + 2e) · n additions

I r · (e + 1) · n multiplications

Karatsuba multiplication: the idea

I Consider F22 with the primitive polynomial f = x2 + x + 1.

I We want to compute C = AB.

I Rewrite A as A0x + A1 and B as B0x + B1.

I The product is

C = A0B0x2 + (A0B1 + A1B0)x + A1B1.

I Reduction modulo f gives

C = (A0B0 + A0B1 + A1B0)x + A1B1 + A0B0.

I This last expression can be rewritten as

C = ((A0 + A1)(B0 + B1) + A1B1)x + A1B1 + A0B0.

Thus the cost is 3 multiplications and 4 additions over F2.

Is it worth it?

e Trav+Stras # of M4RI mults schoolbook [Mon05] M4RIE2 0.624s 6.47 4 3 3.093 1.324s 13.73 9 6 6.124 1.480s 15.35 16 9 9.585 2.776s 28.79 25 13 –6 3.180s 32.98 36 17 –7 3.992s 41.41 49 22 –8 6.304s 65.39 64 27 –9 26.737s 277.35 81 34 –

10 34.526s 358.15 100 39 –

Table: Multiplication of 4, 000× 4, 000 matrices over F2e .

Peter L. MontgomeryFive, six, and seven-term Karatsuba-like formulaeIEEE Trans. on Computers, 53(3):362–369, 2005.

Results: Reduced Row Echelon Forms I

Putting it all together:

I PLE and TRSM are reduced to asymptotically fast matrixmultiplication (Karatsuba or Travolta-Strassen).

I Both the PLE and the TRSM base case use Travolta tables,though not fully optimised yet.

I Note, there is some overhead because auf representationswitching (mzed t vs. mzd slice t).

Results: Reduced Row Echelon Forms II

1000 3000 5000 7000 9000m

0

2

4

6

8

10

12

wall

tim

e

Elimination (2.66Ghz, Intel i7), (e=2)

pletravolta

1000 3000 5000 7000 9000m

0

5

10

15

20

25

wall

tim

e

Elimination (2.66Ghz, Intel i7), (e=3)

pletravolta

1000 3000 5000 7000 9000m

0

5

10

15

20

25

wall

tim

e

Elimination (2.66Ghz, Intel i7), (e=4)

pletravolta

1000 3000 5000 7000 9000m

0.02

0.04

0.06

0.08

0.10

0.12

cc/(

2m

nr^

0.8

07)

Elimination (2.66Ghz, Intel i7), (e=2)

pletravolta

1000 3000 5000 7000 9000m

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

0.22

cc/(

2m

nr^

0.8

07)

Elimination (2.66Ghz, Intel i7), (e=3)

pletravolta

1000 3000 5000 7000 9000m

0.05

0.10

0.15

0.20

0.25

cc/(

2m

nr^

0.8

07)

Elimination (2.66Ghz, Intel i7), (e=4)

pletravolta

Figure: Travolta O(n3)

vs. PLE O(nω) on 2.66 Ghz Intel i7, 4GB RAM

Results: Reduced Row Echelon Forms III

100 700 1300 1900 2500

2

3

4

5

6

7

8

9

Elimination: Magma vs. Sage

4.5

3.0

1.5

0.0

1.5

3.0

4.5

Figure: 2.66 Ghz Intel i7, 4GB RAM

Results: Reduced Row Echelon Forms IV

e Magma GAP M4RIE2.15-10 4.4.12 0x6b24b839a46f

2 6.040 162.658 3.3103 14.470 442.522 5.3324 60.370 502.672 6.330

Table: Elimination of 10, 000× 10, 000 matrices on 2.66 Ghz Intel i7

Thank you

I website:I http://m4ri.sagemath.org

I code:I http://bitbucket.org/malb/m4riI http://bitbucket.org/malb/m4rie

I papers:I http://arxiv.org/abs/0811.1714I https://bitbucket.org/cpernet/pluqm4riI https://bitbucket.org/malb/m4rie-paper

V. Arlazarov, E. Dinic, M. Kronrod, and I. Faradzev.On economical construction of the transitive closure of adirected graph.Dokl. Akad. Nauk., 194(11), 1970.(in Russian), English Translation in Soviet Math Dokl.

Gregory V. Bard.Algorithms for Solving Linear and Polynomial Systems ofEquations over Finite Fields with Applications toCryptanalysis.PhD thesis, University of Maryland, 2007.

O. Ibarra, S. Moran, and R. Hui.A generalization of the fast LUP matrix decompositionalgorithm and applications.Journal of Algorithms, 3:45—-56, 1982.

Peter L. Montgomery.Five, six, and seven-term Karatsuba-like formulae.IEEE Trans. on Computers, 53(3):362–369, 2005.

Volker Strassen.Gaussian elimination is not optimal.Nummerische Mathematik, 13:354–256, 1969.