[IEEE 2008 International Conference on Reconfigurable Computing and FPGAs (ReConFig) - Cancun, Mexico (2008.12.3-2008.12.5)] 2008 International Conference on Reconfigurable Computing

Finite Precision Analysis of the 3GPP Standard Turbo Decoder for Fixed-Point Implementation in FPGA Devices

Anabel Morales-Cortés and R. Parra-Michel

Department of Electrical Engineering, CINVESTAV-IPN, Jalisco, Mexico.

[email protected]

Luis F. González-Pérez and Gabriela Cervantes T.

Electronic Design Center, ITESM, Jalisco, Mexico.

[email protected]

Abstract It is well known that fixed point hardware

implementations of DSP algorithms need an accurate analysis of finite precision in operands and arithmetic operations. In this paper, a finite precision analysis is given for the implementation of a turbo code in FPGA devices for the 3GPP standard of cellular communications. We determine the size of operands to be used at the component decoders of the algorithm. This analysis is performed in a practical way and compared to theoretical results reported in the literature. The ranges of the differences in the node metrics, LLRs and extrinsic information are obtained with simulations and the bitwidth of all these variables are readily estimated so that a straightforward hardware implementation in a FPGA is possible.

Keywords: Turbocodes, MAP decoding, finite precision arithmetic, FPGA, VLSI architecture. 1. Introduction1

Turbo Codes (TC) have attracted a lot of attention since their discovery by Berrou et al in 1993 [1]. They are one of the most important achievements in the channel coding area since they are quite close to the theoretical limits exposed by Shannon [2].

Nowadays, TC are being incorporated in new communication standards. They have been adopted in the new cellular communication systems UMTS (3GPP) and CDMA2000 (3GPP2). This work is focused on the turbo code scheme developed by 3rd Generation Partnership Project (3GPP) [3].

When TC are to be implemented in hardware architectures, it is very important to consider both its processing speed to fulfill throughput requirements and This work was supported by research grant INTEL-MCORE2008, CINVESTAV, CONACYT research project 84559-Y and CONACYT scholarship 203468.

the operand and arithmetic operators precision of the global architecture if a fixed point implementation is to be done. In this paper we focus on the latter.

For a fixed point implementation of the turbo decoder, it is important to minimize the bitwidth used to represent quantities. Some papers have investigated the fixed-point representation of the inputs to Maximum A Posteriori (MAP) decoders [4,5], others review minimum fixed point representations for their inner signals [5,6].

In MAP decoders, the problems associated with implementing a fixed point architecture can be summarized in three main aspects: (1) The representation of the soft inputs to decoder, i.e., the channel inputs and the extrinsic information (or a priori information) coming from the other constituent decoder; (2) The fixed point representation of Log Likelihood Ratios (LLR) and extrinsic information at the output of each MAP decoder and (3) The internal precision of the decoder arithmetic operations. All these issues will be addressed in this paper.

This paper is organized as follow: In Section 2, the principles of the turbo code algorithm for the 3GPP standard is described. In Section 3, an analysis of finite precision arithmetic in the context of the decoder algorithm is given; in Section 4 the results of simulations are shown together with a comparison with theoretical results. Finally, the gathered conclusions are given in section 5. 2. Turbo codes

The structure of the 3GPP turbo encoder is shown in Figure 1-(a). It is composed of two Recursive Systematic Convolutional (RSC) codes separated by an Interleaver (denoted hereafter as π) [3]. Each RSC is a rate R=1/2, K=4 constraint length convolutional code with the trellis diagram shown in Figure 1-(b). This way, the overall encoder is a rate R=1/3 turbo encoder.

2008 International Conference on Reconfigurable Computing and FPGAs

978-0-7695-3474-9/08 $25.00 © 2008 IEEE

DOI 10.1109/ReConFig.2008.82

43

In the standard, the number of data bits at the input to the turbo encoder is N (40 ≤ N ≤ 5114).

(a) (b) Figure 1. (a) 3GPP Turbo encoder, (b) Trellis diagram.

At time k, the information bit is uk (+1 or -1) and the turbo encoder outputs the codeword (Xk, Zk, Z’k) according to the encoder structure, where Xk is the systematic bit and Zk and Z’k are the parity bits generated according to the modulo-2 adders and the contents of the shift registers in the constituent encoders.

The structure of the iterative decoder is shown in Figure 2. It consists of two MAP decoders linked by interleavers and deinterleavers (π-1). Each decoder takes three inputs: the channel output corresponding to the systematic bit (yks), the parity bits from the associated component encoder (ykp1 and ykp2), and the extrinsic information from the other component decoder (L(uk)), known as a-priori information of the systematic bit. The component decoders exploit both the inputs from the channel and this a-priori information to refine the associated probabilities for each information bit, which are typically represented in terms of Log Likelihood Ratios (LLRs) [7]. Each decoder iteratively refines these probabilities at each iteration. In the first iteration, the first component decoder (MAP 1) takes the channel outputs (yks, ykp1) and produces a soft output indicating the estimate of the information bits in the form of LLRs. Then, the systematic channel output and the a priori information coming from the second decoder are subtracted from this LLR in order to generate the extrinsic information that is used as the a priori information of the systematic bits in the second constituent decoder. This extrinsic information is used as additional information for the second decoder so as to refine the LLRs of the information bits. This process is done iteratively where an iteration comprises two decoding stages (MAP1 and MAP2). This iterative process ameliorates the LLRs of the information bits and improves the decoding process

at each iteration. For implementation purposes, the well known Log-MAP algorithm is used [7]. The Log-MAP algorithm is the original MAP algorithm [8] in the log domain.

DecodedSequence

MAP 1

MAP 2

π

π

π-1

)( kuL

)( kuL

)|( yuL k

)|( yuL k

π-1

)( kk uLe

)( kk uLe

ksc yL

1kpc yL

2kpc yL

Figure 2. General structure of a turbo decoder

The decoding process performed in each constituent decoder for the computation of the LLRs can be summarized in the following steps:

1. Branch metric computation (BM)

∑−

=+=Γ

1

02)(

21),'(

n

lklkl

ckkk yxLuLuss

(1)

where uk is the information bit that makes the transition from state s’ to state s in the trellis. L(uk) is the a priori information provided by the previous decoder and xkl and ykl are the expected symbols (x0 and x1 in Figure 1-(b)) and the actual received symbols (in figure 2 yks and ykp1 if MAP1 is used or yks and ykp2 if MAP2 is used) at the channel output, respectively. Finally, Lc is the channel reliability value which for an Additive White Gaussian Noise (AWGN) channel is defined as: Lc=(2/σ2), and σ2 is the noise variance.

2. Forward recursion (FW) ( )1'

( ) * ( ') ( ', ) ; 0,1, ... 1,k ksA s MAX A s s s k Nκ−= + Γ = − (2)

where A0(0)=0, A0(s)=-∞, for all s ≠ 0. 3. Backward recursion (BW)

( )1 '( ') * ( ) ( ', ) ; , ...1,k ks

B s MAX B s s s k Nκ− = + Γ = (3)

where BN(0)=0, BN(s)=- ∞, for all s ≠ 0. 4. Log-Likelihood Ratio (LLR) ( ))(),'()'(*)|( 1

1),'(

sBsssAMAXyuL kkku

ssk

k

+Γ+= −+=

=>

( )1( ', )1

* ( ') ( ', ) ( ) .k

k k ks su

MAX A s s s B s−=>=−

− + Γ + (4)

5. Extrinsic information ( ) ( | ) ( ).k k k c ks kLe u L u y L y L u= − − (5)

In eqs. (2-4), Ak and Bk are called the node metrics of the trellis. The MAX* operator is known as the Jacobian logarithm, and defined here as: MAX* (x, y) = ln (ex + ey)

44

= max(x, y) + ln (1 + e -|x-y|) = max(x, y) + fc(|x-y|) (6) where fc(|x-y|) is a correction function that can be implemented using a look-up table (LUT), and max(x,y) selects the maximum between x and y.

A block diagram explaining how to calculate these metrics is shown in Figure 3. Next, the analysis of the finite precision of each of these blocks is given.

Figure 3. Computational process in the MAP

algorithm.

3. Node metrics and LLRs precision analysis

As mentioned above, finite precision analysis in the

turbo decoder is quite challenging because we must consider the bitwidth of the inputs, outputs and the internal precision operations at the decoder, especially when an iterative process is performed. We know that the LLRs at the decoder output depend only on the difference between node metrics (eqs. 2 and 3). Hence, with a proper use of normalization techniques we can estimate the bitwidth necessary to prevent overflow [9]. Let us focus on the estimation of the dynamic range of the FW and BW recursions to deduce their required bitwidth. These dynamic ranges are defined as:

1 2, 1 2max ( ) ( ) ,k s s k kA A s A sΔ = −

1 2, 1 2max ( ) ( ) ,k s s k kB B s B sΔ = − where s1 and s2 are any selected two states s1, s2 є {0,…, sm-1} in the trellis and k є {0,…, N-1} is the time instant. In [5], these bounds were derived as ΔAk≤Bα and ΔBk≤Bα, where

( )( )0max ( ; ) max ( ; ) ,i

k kB m u I n c Iα λ λ= + ⋅ (7) and m is the encoder memory, n0 is the number of input bits at each decoder, λk(u;I) is the a priori information and λk(c;I) is the channel output of the systematic and parity bits defined as in [5]: ( )( ) 2 ( ) ( )( ; ) 2 / , 1, 2. i i i

k k kc I y Lcy iλ σ= = = (8) Finally, the range of the extrinsic information at the

decoder output (a priori information to the next decoder) is bounded by: ⏐λek(u;O)⏐≤ Bλ where

( ) 01 max ( ; ) max ( ; ) .k kB m n c I m u Iλ λ λ= + ⋅ ⋅ + ⋅ (9) As noted in equations (7) and (9), to determine these bounds, it is necessary to determine first the range of the channel outputs λk(c;I) and the a priori information λk(u;I). There are no exact or theoretical bounds results for these values. As a result, simulations of the decoding process must be performed in order to estimate their dynamic range.

4. Simulation based precision analysis

Next, bounds for the dynamic range of forward and backward recursions, LLRs at the decoder output and extrinsic information (eqs. 4 and 5) are obtained with simulations and compared to theoretical bounds found in the previous section. Simulations were carried out using BPSK modulation over an AWGN channel.

4.1 Channel symbols and a priori information

In this section, an analysis of the minimum bitwidth

required for the systematic and parity symbols and the extrinsic information, is given.

Let FP(nb,np) be the finite precision representation of an n-bit two’s complement number where np is the precision of the fractional part, nb is the integer part and an additional bit is used for the sign bit. This way, n = 1+nb+np. Using this format, different configurations of nb and np for the channel symbols λk(c;I) (see Figure 2) are studied to determine the best combination that entails the least performance degradation. Figure 4 presents a comparison in terms of bit error rate (BER) vs signal to noise ratio (SNR) and for different number of decoding iterations between the “ideal” (infinite precision in the channel outputs) decoder and two different finite precision formats, FP(2,2) and FP(2,3).

From this figure it can be noticed that at iterations 5 and 8, and for a BER of 10-5, both formats have practically the same performance as the ideal case; at iteration 3 the FP(2,2) format has a loss of 0.02 dB while FP(2,3) only shows a 0.0125 dB penalty. From these results the FP(2,2) format is chosen, i.e., the channel outputs only need 5 bits without noticeable performance degradation.

Regarding the a priori or extrinsic information, different formats have been analyzed as well. Figure 5 shows performance degradations of these formats when compared to the infinite precision case for the same number of decoding iterations while keeping the FP(2,2) format for the channel outputs.

45

Figure 4. Decoding Performance for different channel

output formats.

Figure 5. Decoding Performance for different a priori

information formats.

We can see that the FP(2,2) format suffers a considerable performance degradation. However, increasing the integer part by one bit, performance degradation decreases. The FP(4,2), FP(5,2) and FP(6,2) formats show a very similar performance to the ideal case. At a BER of 10-4 there is a loss of 0.0175 dB, 0.0167 dB and a 0.0165 dB degradation for the each format. As a result, the 7 bit FP(4,2) format will be retained for the hardware architecture.

4.2 Finite precision bounds on BM, FW and BW node metrics and LLR

Once we have found the bit size required for the inputs to the MAP decoders, we are able to investigate the necessary bitwidth for the signals that are computed within the MAP decoder, namely, BM (eq.1), BW and FW (eqs. 2, 3), LLR (eq. 4) and extrinsic information (eq. 5). Recalling the branch metric as

( )

1

1 1( ', ) ( ; ) ( ; ),2 2

nl

k k k kl kl

s s u u I x c Iλ λ=

Γ = + ∑

where λk(u;I) and λk(c;I) are the a priori information and the channel outputs of the previous section, respectively, and uk, xk є {-1,1} are the information bit at the input to the encoder and the branch labels of each transition in the trellis. In order to find the dynamic range of this metric it suffices to estimate its maximum value, which is:

( ) 1 1max ( ', ) max ( ; ) 2 max ( ; )2 2

1 max ( ; ) max ( ; )21 (16) (4) 12.2

k k k

k k

s s u I c I

u I c I

λ λ

λ λ

Γ = + ⋅

= +

= + =

The maximum values of the channel outputs were found in the previous section. The former equation suggests that we only need 4 bits for the integer part of the BM plus the sign bit, requiring a total of 5 bits.

To determine the dynamic range of the node metrics for both the FW and BW recursions, we proceed in the same way, that is, the maximum and minimum values of these metrics are computed for every time instant and for different SNRs, iterations, information frames and different values of N. The dynamic range of the inputs to the decoder was bounded in the previous section. Figure 6 presents the results obtained by simulation. When the inputs to the MAP decoder use the FP(2,2) and FP(4,2) formats for the channel outputs and the a-priori information, respectively, the node metrics in the FW recursion do not exceed the 45 value; hence, 6 bits suffice for the integer part of the node metrics plus the sign bit. Similar results were obtained for the BW recursion since the computations are the same.

In the case of the LLR, simulations showed that the maximum value of the former was 64 (Figure 7); hence, 6 bits plus the sign bit are required. For the extrinsic information (Figure 2), Figure 8 shows that the maximum value is 44; hence a total of 7 bits are required. However, since the FP(4,2) format will be used at the input to the second MAP decoder, only 4 integer bits are kept. Simulations showed that this reduction does not entail any loss in performance.

0 0.2 0.4 0.6 0.8 1 1.2 1.410

-7

10-6

10-5

10-4

10-3

10-2

10-1

BER for frames 5114 bits

SNR (dB)

BE

R

IdealFP(2,2)FP(3,2)FP(4,2)FP(5,2)FP(6,2)

3 it

5 it 8 it

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.610

-7

10-6

10-5

10-4

10-3

10-2

10-1


SNR (dB)

BE

R

Ideal

FP(2,2)FP(2,3)

3 it 5 it

8 it

46

The theoretical bound for the dynamic range of the FW and BW recursion in Eq. 7 with FP(2,2) for the channel outputs and FP(4,2) for the a priori information is 72 and that for the extrinsic information in Eq. 9 is 80. These bounds are consistent with our simulation results. Moreover, these simulations helped to realize that the dynamic range of these variables can be reduced in practice which implies an important reduction in area and speed of the decoder architecture.

Up to this point we have analyzed the dynamic range of all variables considered in the decoder scheme. This analysis provided a tight estimation of the number of bits in the integer part of this dynamic range. To finish this analysis, the precision of the fractional part must be analyzed as well. Figure 9 shows results of the performance of the turbo decoder for different number of bits in the fractional part of the path metrics, node metrics, LLRs and extrinsic information. As can be seen, three bits suffice since performance degradation is only 0.02dB for BER=10-5 at iteration 8.

To summarize, the bitwidth of node metrics, LLRs and extrinsic information is 10 using a FP(6,3) format. Next, we will estimate the number of bits required in the Max* operator.

4.3 Precision of the Max* operator

From Eqs. 2, 3 and 4, the Max* operator (eq. 6) is the core of the FW, BW and LLR computations. A detailed analysis of its finite precision is mandatory.

1 2 3 4 5 6 7 815

20

25

30

35

40

45

Iterations

Max

imum

din

amic

rang

e

Node metrics

NMA d2,0.5db 5114

NMA d2,2.0db 5114

NMA d2,5.0db 40

NMA d2,7.0db 40

Figure 6. Node metrics of FW recursion at MAP2.

1 2 3 4 5 6 7 820

30

40

50

60

70

Iterations

Max

imum

abs

olut

e LL

R

LLR output

LLR d2,0.5db 5114

LLR d2,2.0db 5114

LLR d2,5.0db 40

LLR d2,7.0db 40

Figure 7. LLRs at MAP2.

1 2 3 4 5 6 7 810

15

20

25

30

35

40

45

Iterations

Max

imum

abs

olut

e ex

t-inf

Extrinsic information output

ext-inf d2,0.5db 5114

ext-inf d2,2.0db 5114

ext-inf d2,5.0db 40

ext-inf d2,7.0db 40

Figure 8. Extrinsic information at MAP2.

Figure 9. BER using different fractional bit size.

The function (fc(|x-y|)) in eq. 6 (Figure 10) can be

implemented in a look-up table (LUT) addressed by the magnitude of the difference between x and y. This LUT contains the values of the correction factor. It can have different number of fractional bits, depending on the required precision. Figure 10 shows the correction factor with different fractional digits. Table 1 shows the memory requirements for the LUT for different precision bits. Figure 11 shows the performance of the turbo decoder for different sizes of the LUT and fractional bits. From these results we can see that 3 bits of precision incurs a penalty of only 0.015 dB. Hence, 3 bits of precision are kept for the architecture.

0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Performance of MAX* operator

|x-y|

MA

X*(x,

y)

Ideal3 bits5 bits

Figure 10. Correction Factor

0 0.2 0.4 0.6 0.8 1 1.2 1.410

-7

10-6

10-5

10-4

10-3

10-2

10-1

100 BER for frames 5114 bits

SNR (dB)

BE

R

IdealFP(X,2)FP(X,3)FP(X,4)FP(X,5)

3 it

5 it 8 it

47

Table 1. Memory requirements for MAX* operator

Output bits Input bits Memory Locations

2 bits 5 bits (0-2) 17 3 bits 5 bits (0-2.625) 22 4 bits 5 bits (0-3.375) 28 5 bits 6 bits (0-4.125) 34

Finally, Figure 12 presents a performance

comparison between the infinite precision turbo decoder and the fixed point simulation for different frames of the 3GPP standard with the parameters obtained throughout this paper. For a BER of 10-5 we only have a loss in performance of 0.017 dB, which represents quite acceptable performance degradation.

Figure 11. Decoding performance with variable

LUT size

5. Conclusions In this paper a finite precision analysis of a turbo

decoder for the 3GPP standard was given for its implementation in programmable logic devices. This precision analysis was made with simulations of the algorithm and validated with theoretical bounds from the literature. Simulations helped to realize that we can use fewer bits that those claimed by theory. The analysis presented here can also serve as a general framework for obtaining the bitwidth required for operands and arithmetic operators in DSP algorithm. First, an analysis of the inputs to the algorithm to be implemented in hardware is done so that no important degradation is introduced when limiting their dynamic range. Then, the dynamic range of the internal variables is estimated so that proper storage requirements, operand size and resolution of the arithmetic operators can be found. Finally, a performance comparison between the infinite precision

and the finite precision systems is done. With these results the hardware implementation of the algorithm (turbo decoder in this case) can be straightforwardly implemented in programmable logic devices.

Figure12. Turbo code performance with infinite

and finite precision

6. References [1] C. Berrou, A. Glavieux and P. Thitimajshima. “Near

Shannon Limit Error Correcting Coding and Decoding: Turbo Codes,” in Proc. ICC’93, pp. 1064-1070.

[2] C. E. Shannon, "A mathematical theory of communication," Bell Sys. Tech. J. http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

[3] 3rd Generation Partnership Project; Technical Specification Group Radio Access Network, Multiplexing and channel coding (FDD). http://www.mumor.org/public/background/25201-500.pdf

[4] G. Montorsi, S. Benedetto, “Design of Fixed-Point Iterative Decoders for Concatenated Codes with interleavers,” IEEE Journal on Selected Areas in Communications, vol. 19, No. 5, pp. 571-582, May 2001.

[5] Y. Wu, B. D. Woerner, and T. K. Blankenship, “Data Width Requirements in SISO Decoding With Modulo Normalization”. IEEE Trans. On Commun., vol. 49, No. 11, Nov. 2001.

[6] J. M. Hsu and C. L. Wang, “On Finite Implementation of a Decoder for Turbo Codes,” in IEEE Int. Symp. Circuit and Applications, vol. 4, May ’99, pp. 423-426.

[7] L. Hanzo, T. H. Liew, B. L. Yeap,”Turbo Coding, Turbo Equalisation and Space-Time Coding for Transmission over Wireless Channels”; Ed. Wiley-IEEE Press, Sept 9, 2002 - 766 pp.

[8] L. R. Bahl, J. Cocke, F. Jelinek and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE ICC ’95, pp. 1009-1013.

[9] C. B. Shung, P. H. Siegel, G. Ungerboeck, H. K. Thapar, “VLSI Architectures for Metric Normalization in the Viterbi Algorithm,” Proc. of Int. Conf. Commun., vol. 4, pp. 1723-1728.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.510

-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

Performance of Turbo Code

SNR

BE

R

Floating point

Fixed point

40 bits

1024 bits

2560 bits 5114 bits

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.610

-7

10-6

10-5

10-4

10-3

10-2

10-1

100


SNR (dB)

BE

R

Ideal

LUT 2 bits

LUT 3 bitsLUT 4 bits

LUT 5 bits

3 it 5 it

8 it

48

Documents

[IEEE 2008 International Conference on Reconfigurable Computing and FPGAs (ReConFig) - Cancun, Mexico (2008.12.3-2008.12.5)] 2008 International Conference on Reconfigurable Computing