Inversion Architecture for MIMO Processinglrts.gel.ulaval.ca/publications/uploadPDF/publication_34.pdf · Ro=c2i. Zero(l) Zero(2) Zero(M From a hardware perspective, equation (4)

An Efficient Regular Matrix Inversion CircuitArchitecture for MIMO Processing

Isabelle LaRoche and Sebastien Roy

Department of Electrical and Computer Engineering,Laval University, Quebec, Qc, G1K 7P4, Canada

Email: {ilaroche, sebasroy} @ gel.ulaval.ca

Abstract-A novel circuit architecture and algorithm is pre- There is some evidence that methods which involve re-sented for the efficient implementation of a matrix inversion currence, such as Sherman-Morrison, iterative methods, andunit. The division-free algorithm yields a scaled version of the v.l'S. 'iEinverse and the scaling factor. Based on the Sherman-Morrison variou formsfiven rotatns-asd inversion typicallyformula, the proposed architecture is characterized by regular, provlde better numerical robustness and precision than conven-locally-connected arrays of processing units and simple iterative tional direct inversion methods [6]. Simulations have shownprocessing. It is especially well-suited for covariance matrices, that this is true for the proposed method. Furthermore, aor any other matrix which can be constructed from rank-one means of dynamic scaling control was devised to maximizeupdates of an initial matrix whose inverse is known. While it the precision of the result when integer arithmetic is used.constitutes an ideal solution for antenna array MMSE (minimummean-square error) processing, it can also be generalized to many II. OPTIMAL COMBINING USING MINIMUM MEANother applications with little effort. Implementation results of a SQUARE ERROR CRITERIAheavily pipelined matrix inverter on a Xilinx Virtex-I FPGAare presented, including cost in logic slices and maximum clock Figure 1 shows a typical M x M Multi-Input Multi-Outputfrequency. The cost / complexity of the proposed solution is com- (MIMO) system. In a Layered Space-Time (LST) scheme [10],parable to, and in many cases better than, known alternatives, each transmitting antenna's signal is at some point estimated

by passing through a linear weight-and-sum structure (spatialfilter), where the optimal weight vector can be chosen accord-

Iing to the zero-forcing (ZF) or the minimum mean-square errorMatrix inversion is a common operation in many signal (MMSE) criterion.

processing problems, including most block adaptation methods >used in adaptive communication systems.

In previous efforts [1], [2], implementation in hardware isoften non-trivial because of the need for complex operators, * :e.g. division or square root. Many proposed architectures rely Tx Rxon some form of factorization such as Cholesky [3] or QR[4], [5], [6]. The latter class is of particular interest, and <M >: rMincludes many variants which exploit Givens rotations [7]for efficient hardware-based QR decomposition. In particular, Fig. 1. An M x M MIMO systemthe implementation described in [5] exploits squared Givensrotations (SGR) [8] to avoid the square root operator, as was In the presence of interference, the optimal method isoriginally suggested in [9]. However, a number of divisions MMSE combining, which requires knowledge of the receivedis still required. The triangular locally-connected array of signal autocorrelation matrix to compute the weights:processing units used therein yields directly a weight vector W = a,R- (1for array processing where the necessary matrix inversion isimplicit. This approach is characterized by a very complex se- is a costant c0 is the desired hanel vector andries of events and timing requirements, and therefore requires RN is the autocorrelation matrix.a substantial design effort to emulate. Assuming the noise and interfering signals are uncorrelated,

In [3], the use of the Sherman-Morrison formula was the autocorrelation matrix is given byemployed to construct an inversion circuit which inverts a :

series of matri es where two sue essive matrices differ only RXX= cr2I +- cic, (2)by a smnall perturbatioln. The approach proposed hereiln uses t=1a modified form of the Sherman-Morrison formula to invert where ca2 is the noise power, I is the identity matrix, ci anda covarialnce matrix firom scratch, while avoiding divisionLs cU are the chalnnrel vector anLd tranLsposed-colnjugate channrelthalnks to appropriate scaling, vector, respectively, of the ith tranlsm[itted signlal.

0-7803-9390-2/06/$20.00 ©2006 IEEE 4819 ISCAS 2006

The LST receiver architecture which constitutes the moti- hard-wired 18 x 18 multipliers and RAM blocks, which free upvation for the present work was designed in such a way that the generic logic for the other required components (adders,estimates of the channel vector ci are successively obtained, flip-flops, counters, etc). Simulation of the system was per-and are used to construct the RXX matrix for each layer. This formed using Matlab-generated test vectors in the Modelsimstructure given in (2) can therefore be exploited to perform XE testing environment.inversion. V. MODULE DESCRIPTION

III. SHERMAN-MORRISON FORMULA This section gives an in-depth view of the important mod-The Sherman-Morrison formula [11], being a special case ules in the implementation.

of the matrix inversion lemma, allows the easy computation A. Matrix-vector multiplicationof the inverse of a slightly-modified matrix A given that the Shown in Figure 3, this unit computes the product cH I.inverse of the original matrix is known. The perturbation must Multiplexers are needed in order to select either the initialtake the form of a rank-I update, e.g. uv , where u and v 2 o o.

are vectors f appropriat dimensions.matrix, a o h euto h rvosieain hsare vectors of appropriate dimensions. signals are input at the top of a linear I-cell systolic array

GivenA-_.1. , the: Sherman-Morrison formlomprised of multiply-ac umulate cells. The current channel

(A-1 UVH-1= - (A-lUVHA-l) vector is conjugated and input from the left of the array. The1 + vHA-lu - resulting vector is fed into a series of AND gates to ensure

By examining (2), it can easily be seen that (3) can be di- that only the final and valid result, not the intermediary value,rectly applied to the autocorrelation matrix inversion problem. is sent to the next modules.Using a different notation, (3) becomes: init(1) R, init(2) R2 * * init(M)RM

(R7'1 CfCHRj71j) II(R11 .c,cf'<1 R7 i"1 IcUR7-1 (4 SelMat(1) SelMat(2) SelMat(M)i - y1 y i - i 1 1 + cHR 1CiI I II I

where Ri = H72i+ CkC, so that R1 = R, andRo =c2i. Zero(l) Zero(2) Zero(MFrom a hardware perspective, equation (4) is problematic

because of the presence of division. By introducing appropriatescaling, the division can be translated into a multiplication,thus avoiding this difficulty. The recurrent relation then be-comes Fig. 3. Matrix-vector imnultiplication

MT7 (cl +cH,7c1C) , + CC -1i-l B. Matrix-matrix multiplication= M- I(ci_ + CHM I Ci)- Figure 4 details the matrix multiplier which computes

;Xf ' 0* ) 0 : ' 7t 1 MI~1 T-

(5) f<1icic{'f<14i. Since dMi7'j1 (c{Iitij), both inputsof the unit come from the matrix-vector multiplier, one branch

where J4-7 = ciR-1 and the scaling factor ce can be being fed to a conjugate unit beforehand (see Figure 2). Theexpressed as: unit is comprised of a M x M-cell systolic array. Each cell

ai+1 = ai(aj + CH -1) has the same structure as the one described above, with thea1(oi-j iv0~ (6)addition of a vertical passthrough link to propagate the topwith ao = 1. inputs downwards. The results are multiplexed onto M1 output

lines in accordance with a specific structure.IV. SYSTEM: ARCH[ITECTURE

A. System description C. Vector-vector multiplicationThis unit implements cHR17ci. The result of the matrix-

Shown in Figure 2, the overall system arhitecture directly i e transpose eore bengimplements the modified Sherman-Morrison formula (5). Each multipliebthe hn el toien vtrorse peformedbox corresponds to a specific hardware unit, the main ones multiplexerywhich effeciely acts as a parallel-to-serialbeing detailed in section V. At the system level, it is essen- cnre Bothpvectors effed to a single cell.tially a dataflow architecture with some amount of pipelining.Individual arithmetic units performing matrix and/or vector D. Scaling factor additionoperlatin are reglar Ulotlly-connectedmarrxays. Fiur 5vhweteiplmnaicforc~71 n

Memryo: blLocks alnd a high lnumber of multipliers alre (6). A flip-flop is used as a memryo: elemnent to stolre the valuerequired, as well as other basic logic elements. In part because of a. This value is added to the vector-vector multiplicationof these requiremenlts, the Xilinx Vilrtex-JI family of FPGAs lresult. The nrew oa is computed anLd stolred inLto the memoryseem[ed a nlaturalL target platformn sinlce it provides distributed elemnent.

4820

2

a- Addnition M

r1Chne Meor S Matrix X|etmto bufr Vector

M 2

flit 2M+1 V

FLegend Multiplication btraction

M complex->: complexl|e delay integer 1

I.-I__ j

Fig. 2. General system architecture

cOiR-I1z me <= '0'; loadM <= 'O';enScale <= 'O'; clrMM <= 'O';mwe <= '0'; selMV <= '0';clrScale <= '0'; count <= 0;

||... enMV <= '0'; selC <= '0'; loadScale <= '0'; count2 <= 0;ZiiW> W cIrMV <='0'; enVy <= '0'; enLambda <= '0';count3 <= 0;

. Z12 ,>.Rst=1zeroMV=;clrVV <= 'O'; enMM <= 'O'; count iter < 0r Z _ ...... Z12=W30 <- t\ I~~~~~~~~~~~~~~~~~~~~~~~me <= '1',Zii Z12 Zi ZiMZ1M Valid 0 mwe<e T;

Z 1 2 ->*->~~ ~ ~ ~ ~ ~ ~ Wit Valid =1 enMV <= 1;Z22 > i~~~~~~count_iter += 1 cunt_iter >= MloadM =T

Z2M RT~~}Vc~R~ enLambda<='~1';51 SI crScale <= 1Z2M-~~~~~I) zcount <=0R7zzc | Z22Z2M Z2M >Rcjgc,Rz z enaba<' cutie

enMM <='O';cun_itr Mcount2 <= 0;cIrMM <= '1'; ount3 >= M+1 loadM <= '0'

DQA else <= cou-n2 e+V1; ( count==1;* *> *--cl2M| crVV <= '1cTM = 0:

-------nugte-i Zmm > JselMV <= '1' nt +=3enScale <= 'O'; JloadScale <= '0;

Fig. 4. Matrix-matrix multiplicatione nMM <=T;

Fig. 5. Scaling factor additioncal <=T; e <'O'

vcount2<d d i f maste<sign; lVI SEQUEN R if(count iter uty) First, MVA<=fopar loaded wzeroMVth u=O4~~~~~~~~~~~~~~~~~~~oDSQcalse= ,; ( S)count2 += 1; count2 += 1; enV S5 r;

u 6> DL o th sa+ct RXXcg loadScale<=sina o g b tEin-- Eln > end;/ (Sk(-

Lds > Ldrqie tnd sn thi v

ctHR c Fig. 6. State M bachine

Fig. 5. Scaling factor addition

various delayed versions of these master signals.The generation of the initial matrix also requires some

VI. SEQUENCER 2conrtrol circuitry. First, M\ flip-flLops are loaded with the crTFigure 6 shows the diagram of the state machine generating The load sigdallob dM generated by the FSM is for the first

all the control signals required to ensure plrope sequencig of flip-flop and since this value is exclusively on the diagon,al, athe system. delay of 2 clock cycles is introduced between loatdM and the

Because of the layered structure, the circuitry is used more other flip-flops.tan once to compute te inverse an many units need to be Te system contains many multplexers whose select Sinls

turnled on anld off at specific tim[es. This is crucial for the are genlerated by the FSM. The mnultiplexer for the chalnnelunits containlinlg processing cell arrays. Also, the accumulated coefficienlt vectors switches inlputs at each iteration, so thatvalues from a previous iteration must be cleared before starting its select signal,se/C, is tied to a counter, count_iter whichthe next computation. Enable (enMV, enMM and enVV) and counts the number of iterations completed so far. For theclear (clrMV, clrMM and clrVV) signals are thus needed to input multiplexers of the matrix-vector multiplication unit, thecontrol the array cells. Inl the linlear and square arrays, the cells switch between the two inputs maust not occur at the samle timaemnust e controle b.yindividual elnabl alndclear signlas. To for al A multiplexers because each line iS delayed onelockachieve this simply, while maintaining the generic, scalable cycle from the previous. Thus, the select signal generated byqua lty of the designl, a single enalla,e and a single clear: signll the FSM, selMV, iS used dilrectly by the frst mux and delayedare generated by the FSM. A sequence of flip-flops provides for the other mnuxes. selMV is alLso delLayed alnd used by the

4821

multiplexer for the matrix input of the multiplication unit. The operations and both the individual processing cells and theinput multiplexer of the vector-vector multiplication unit must sequencer are much more complex than with the proposedchange at every clock cycle for a total of M1 switches, so it method. Furthermore, the 0(M/2) completion time of theis tied to counter count2 controlled by the FSM. The output proposed technique is due to the systolic-style dataflow (whichmultiplexers for the matrix-matrix multiplication must also simplifies routing and control signal requirements). Linearswitch between lines at every clock cycles for a total of M time is achievable in principle by using pipelining betweenswitches. A series a flip-flops delay selMV by the appropriate iterations and between successive inversions.number of cycles and the end signal is used by the first mux.

V CONCLUSIONAgain, not every mux must be switched at the same time, soflip-flops are used as a mean to delay the select signal for the A division and square-root free matrix inversion unit wasremaining muxes. proposed and designed for incorporation into a layered space-

The scaling factor addition contains a memory element that time MIMO receiver. It is implemented using integer arith-must be controlled by the FSM. Three signals -enScale metic and handles complex-valued matrices. It requires thatloadScale and clrScale -are used to properly control this the input matrix be expressible as a sum of rank-1 updates, aunit. Particularities of these signals are that loadScale is tied to condition automatically met in a LST receiver since individualone for a single clock cycle during the entire inverting process channel estimates are available. It was implemented in aand clrScale is also tied to one for a single clock cycle, at the Virtex-JI FPGA with the inclusion of a dynamic scalingend of the Ml iterations before the next inversion starts. circuit (using simple shifts) to improve precision and avoid

Specific to the matrix-vector multiplication unit is a control underflow/overflow. Further optimisations are planned, as wellsignal fed to the output AND gates called zeroMV. Again, the as a detailed precision analysis.one generated by the FSM is delayed before being fed to the ACKNOWLEDGMENTother AND gates.

This work was supported by the National Sciences andVII. IMPLEMENTATION RESULTS Engineering Research Council of Canada (NSERC) under

its Discovery Grant Program. The support of the CanadianThe implemented system was synthesised using Xilinx XST Microelectronics Corporation (CMC), under its System-On-

on a Virtex-JI XC2V6000. Preliminary logic resource cost Chip Research Network (SOCRN) program, is also gratefullyresults are shown in Table I. The maximum clock frequency acknowledged.is currently 108.3 MHz and avenues have been identified forfurther optimization. REFERENCES

[I] A. El-Amawy, "A systolic architecture for fast dense imnatrix inversion'Sloic 4-input 18TsXult 1IEEE Transactions On Computers, vol. 38, no. 3, pp. 449-455, MaySlices LUTs Multipliers 1989.

Matrix-vector 765/33792 1412/67584 16/144 [2] F. Edman and V. Owall "A scalable pipelined complex valued m-atrixmultiplication 2 0 2 0 11% 0inversion architecture," in IEEE Initernational Symposium on CircuitsMatrix-matrix 3108/33792 5850/67584 64/144 and Systenis Kobe Japan, May 2005.multiplication 9 e 8e 44% [3] M. Ylinen, A. Burian, and J. Takala, "Updating matrix inverse in fixed-Vector-vector 187/33792 350/67584 4/144 point representation: Direct versus iterative methods," in IEEE Inter-multiplication 0 0o 2% na;tional Syoiposiumi on System-o;n-Cl ip, Tampere, Finland, NovemberMultiplication 780/33792 376/67584 16/144 2003.

2= 0o 11= [4] Z. Liu, J. V. McCanny, G. Lightbody, and R. Walke, "Generic socTotal 4446/33792 7805/67584 101/144 qr array processor for adaptive beamforming," IEEE Tran.sactions On

13% 11% 70% Circuits and Systems - II: An alog and Digital, vol. 50, no. 4, pp. 169-

TABLE I 175, Avril 2003.[5] G. Lightbody R. Woods, and R. Walke, "Design of parameterizable

IMPLEMENTATION RESULTS silicon intellectual property core for qr-based rls filtering, IEEE Tracns-actions On Very La-ge Scale Integration (VLSI) Systems, vol. 1, no. 4,pp. 659-678, Aout 2003.

[6] T. Shepherd and J. McWhirter, "Systolic Adaptive Beamforming," inA single matrix inversion requires 41Mf2 clock cycles, ex- Radar Array Processing, S. Haykin, J. Litva, and T. Shepherd, Eds.*t * 2 I simple processing cells. In comparison, Springer-Verlag, 1993.ploiting [7] G. HI. Golub and C. F. V. Loan, Matrix Conpurtation. John Hopkins

Edman's solution [2] is a linear array of 2M processors with University Press, 1991, second Edition.0(2P(AMi2 + AI - 1)) time units being required to complete [8] R. Dohler, "Squared givens rotation," IMA Jourtial of Numerical Anal-aninversion, where P is the degree of pipelining with ach ysis, vol. 11, pp. 1-5, 199l1.an llnversloln, where P 1S the degree of plpellnllrl each 9[9] W. M. Gentleman and H. T. Kung, "Matrix triangularization by systolic

processing cell. A high degree of pipelining is mandatory (the arrays," SPIE Real-Time Signal Processing IV, vol. 298, pp. 19-26, 1981.author proposes P 5) to obtain a decent clock frequency [10] G. J. Foschini "Layered space-tim-e architecture for wireless commnu-

mneiation in a fadinog environnment when using multi-elemennt anntennas,because of the high latenlcy of the required divisionl operator. Bell Labs Tech mica.l Jour ml.I pp. 41-59, FalLl 1996.The very best methods [1] which also require Alf2 +Al [11] E. W. Weisstein, She; manz-Mor-isonz Emmuzla, Fromprocessors (typically as two triangular arrays of MIf +M. elc- M[athWorld-A Wolfralm Web Resource, [Onlinre], Available:2 http://mathworld.wolLfram.co rSherlman-M[orrisonLFormula.htlml

these methods lne essarily involve division alnd/or square root

4822

Documents

Inversion Architecture for MIMO Processinglrts.gel.ulaval.ca/publications/uploadPDF/publication_34.pdf · Ro=c2i. Zero(l) Zero(2) Zero(M From a hardware perspective, equation (4)