ApproximateNeumann SeriesorExactMatrix InversionforMassive ...arith24.arithsymposium.org/slides/s10-gustafsson.pdf · •Assumemultiply-and-add(MAD)operations ... MatrixInversionforMassiveMIMO

Approximate NeumannSeries or Exact MatrixInversion for MassiveMIMO?Oscar Gustafsson, Erik Bertilsson, JohannesKlasson, and Carl Ingemarsson

Matrix Inversion for Massive MIMO Oscar Gustafsson July 25, 2017 1

Matrix Inversion in Massive MIMO

• N terminals,M antennas

• Channel matrix,H ∈ CM×N

• Gram matrix,X = HHH ∈ CN×N to be inverted

for zero forcing (or MMSE)

• X: conjugate symmetric (Hermitian) and

semi-definite

• X: with uncorrelated channels andM � N ,

diagonally dominant








semi-definite


diagonally dominant








semi-definite


diagonally dominant



PUL UL UL UL DLG DL G

Tframe

NUL,1 NUL,2 NDL

• One matrix inversion per frame

• Computed between reception of pilot and

transmission of first downlink data

• Latency, not throughput




Tframe

NUL,1 NUL,2 NDL








Tframe

NUL,1 NUL,2 NDL






Algorithms for Matrix Inversion

• Exact algorithms

• Numerical issues, especially in fixed-point, forclose to singular (sub-)matrices

• Division and/or square-roots• Cubic complexity

• LDLᵀ-decomposition

• Lowest operation count• Reasonable fixed-point properties• No square-roots



• Exact algorithms

• Numerical issues, especially in fixed-point, forclose to singular (sub-)matrices

• Division and/or square-roots• Cubic complexity

• LDLᵀ-decomposition

• Lowest operation count• Reasonable fixed-point properties• No square-roots



• Neumann series expansion

• Precondition matrixA ≈ X−1

X̂−1K =

(K∑

n=1

(I−AX)n−1

)A, (1)

• “High parallelism”

• “Low complexity”

• “No division”

• “Numerically stable”



• Neumann series expansion

• Precondition matrixA ≈ X−1

X̂−1K =

(K∑

n=1

(I−AX)n−1

)A, (1)

• “High parallelism”

• “Low complexity”

• “No division”

• “Numerically stable”



Diagonal precondition matrix

A =

a1,1 0 · · · 00 a2,2 . . . 0...

. . ....

...

0 0 · · · aN,N

ai,i = 1/xi,i

I−AX =

0 y1,2 · · · y1,N

y2,1 0 . . . y2,N...

. . ....

...yN,1 yN,2 · · · 0



Diagonal precondition matrix

A =

a1,1 0 · · · 00 a2,2 . . . 0...

. . ....

...

0 0 · · · aN,N

ai,i = 1/xi,i

I−AX =

0 y1,2 · · · y1,N

y2,1 0 . . . y2,N...

. . ....

...yN,1 yN,2 · · · 0



Tri-diagonal precondition matrix

A =

a1,1 a1,2 0 · · · 0a2,1 a2,2 a2,3 . . . 00 a3,2 a3,3 . . . 0...

. . ....

...

0 0 0 · · · aN,N

Sequential computation ofAGeneric I−AX



Tri-diagonal precondition matrix

A =

a1,1 a1,2 0 · · · 0a2,1 a2,2 a2,3 . . . 00 a3,2 a3,3 . . . 0...

. . ....

...

0 0 0 · · · aN,N

Sequential computation ofAGeneric I−AX



Diagonal + column precondition matrix

A =

a1,1 0 · · · 0a2,1 a2,2 . . . 0...

. . ....

...

aN,1 0 · · · aN,N

I−AX =

0 y1,2 · · · y1,N0 y2,2bb . . . y2,N...

. . ....

...

0 yN,2 · · · yN,N



Diagonal + column precondition matrix

A =

a1,1 0 · · · 0a2,1 a2,2 . . . 0...

. . ....

...

aN,1 0 · · · aN,N

I−AX =

0 y1,2 · · · y1,N0 y2,2bb . . . y2,N...

. . ....

...

0 yN,2 · · · yN,N


Computational Complexity

• The latency (time to obtain the result) of analgorithm depends on two aspects:

• Total number of operations→ latency scales withnumber of processing elements (PEs)

• Number of sequential operations→ latency doesnot scale with number of PEs

• Pipelining of the PEs

• Increases clock frequency• Increases latency























Computational Complexity Example

4× 4 exact matrix inversion based on LDLᵀ

-

-

-

-

- -

-

- - -

-

-

-

--

- -

-

-

-

--

-

-

- -

- - -

-


How Many Cycles?

• Assume multiply-and-add (MAD) operations

• Reciprocals performed using Newton-Raphson→a number of sequential MAD operations

• Sum-of-products computed using sequential

MADs

• O operations, each with P pipeline stages

implemented on Q processing elements (PEs)

require

Calg ≥ max

{⌈O

Q

⌉+ P − 1, PClatency

}cycles. (2)


How Many Cycles?




MADs



require

Calg ≥ max

{⌈O

Q


}cycles. (2)


How Many Cycles?




MADs



require

Calg ≥ max

{⌈O

Q


}cycles. (2)


Algorithm Comparison – Complexity

Method MADs Reciprocals

Exact method

LDLᵀ+EQU 12N

3 + 12N

2 −N N

Neumann series

Diagonal,K = 2 N2 −N NK = 3 1

2N3 +N2 − 1

2N N

Tri-diagonals,K = 2 3N2 + 7N − 10 2N − 1K = 3 1

2N3 + 6N2 + 1

2N − 2 2N − 1

Diag. + column,K = 2 32N

2 + 52N − 4 N

K = 3 12N

3 + 52N

2 − 2N − 1 N


Algorithm Comparison – Latency

Method MADs Reciprocals

Exact method

LDLᵀ+EQU 4N − 4 N

Neumann series

Diagonal,K = 2 2 1K = 3 N + 1 1

Tri-diagonals,K = 2 2N + 5 NK = 3 3N + 5 N

Diag. + column,K = 2 N + 2 1K = 3 2N + 1 1


Results

Bit-error rate for the four approaches,N = 20,M = 120

0 1 2 3 4 510-8

10-6

10-4

10-2

100

DiagonalColumn DiagonalTridiagonalLDL


Results

Reciprocal⇒ Three sequential MAD operations

4× 4-matrix#PE: 1, latency: 48

20 40Cycle

0

0.5

1#O

pera

tions

#PE: 2, latency: 29

5 10 15 20 25Cycle

0

1

2

#Ope

ratio

ns

#PE: 3, latency: 26

5 10 15 20 25Cycle

0

2

4

#Ope

ratio

ns

#PE: 4, latency: 25

5 10 15 20 25Cycle

0

2

4

#Ope

ratio

ns


Results – 16× 16

Solid: actual result, dashed: from equation

5 10 15Processing elements

102

103

104C

ycle

sTri-diagonalCol. + Diag.DiagonalExact


Results – 8× 8

Solid: actual result, dashed: from equation

5 10 15Processing elements

101

102

103C

ycle

sCol. + Diag.DiagonalExact


Results

With P = 1, 2, 3, 4 levels of pipelining4× 4-matrix

P: 1, latency: 48

20 40Cycle

0

0.5

1#O

pera

tions

P: 2, latency: 57

10 20 30 40 50Cycle

0

0.5

1

#Ope

ratio

ns

P: 3, latency: 77

20 40 60Cycle

0

0.5

1

#Ope

ratio

ns

P: 4, latency: 98

20 40 60 80Cycle

0

0.5

1#O

pera

tions


Results – 16× 16

Time in single cycle latency operations, assuming

pipelining increases speed linearly

Solid: P = 1, dashed: P = 2, dash-dotted: P = 3

1 2 3 4Processing elements

102

103

Tim

e

Col. + Diag.DiagonalExact


Results – 8× 8

Time in single cycle latency operations, assuming

pipelining increases speed linearly

Solid: P = 1, dashed: P = 2, dash-dotted: P = 3

1 2 3 4Processing elements

101

102

Tim

e

Col. + Diag.DiagonalExact


Design Example

• Assume a latency requirement of 0.05 ms (10% of

an LTE-like frame with 2 UL and 2 DL symbols)

• For N = 8 and one PE, 304 cycles are required forthe exact algorithm

• One PE operating at fclk = 6.08MHz

• N = 30 ⇒ fclk ≈ 280MHz

• 2 kInv/s, idle 90% of the time


Design Example





• N = 30 ⇒ fclk ≈ 280MHz



Design Example





• N = 30 ⇒ fclk ≈ 280MHz



Design Example





• N = 30 ⇒ fclk ≈ 280MHz



Design Example





• N = 30 ⇒ fclk ≈ 280MHz



Is Neumann useful at all?

• If less than three terms are used, the complexity

may be lower

• Only compute parts of the third iteration

• Allow increasing the number of terminals further

• But numerically most efficient when the ratio

between number of antennas and terminals is high

• May give a better result with singular or close to

singular matrices (not correct result maybe not as

bad as an exact algorithm)

• (Really) large matrices




may be lower












may be lower












may be lower










Conclusions


• Complexity for Neumann series withK = 3 higherthan best exact algorithm

• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned⇒ exact

algorithm behaves well• Few terminals⇒more diagonally dominant⇒

fewer Neumann terms (but also less complexity forexact algorithm)

• With few PEs compared to matrix size, the limited

parallelism of the exact algorithm is no problem• Required latency/parallelism determined by frame

structure


Conclusions

• Latency, not throughput• Complexity for Neumann series withK = 3 higherthan best exact algorithm






structure


Conclusions


• Few terms for Neumann when diagonallydominant

• Diagonally dominant⇒ well conditioned⇒ exactalgorithm behaves well

• Few terminals⇒more diagonally dominant⇒fewer Neumann terms (but also less complexity forexact algorithm)



structure


Conclusions


• Few terms for Neumann when diagonallydominant• Diagonally dominant⇒ well conditioned

⇒ exactalgorithm behaves well

• Few terminals⇒more diagonally dominant⇒fewer Neumann terms (but also less complexity forexact algorithm)



structure


Conclusions



algorithm behaves well• Few terminals⇒more diagonally dominant

⇒fewer Neumann terms (but also less complexity forexact algorithm)



structure


Conclusions







structure


Conclusions






parallelism of the exact algorithm is no problem

• Required latency/parallelism determined by frame

structure


Conclusions







structure

Thank you!Questions?

www.liu.se

www.liu.se

Documents

ApproximateNeumann SeriesorExactMatrix InversionforMassive ...arith24.arithsymposium.org/slides/s10-gustafsson.pdf · •Assumemultiply-and-add(MAD)operations ... MatrixInversionforMassiveMIMO