A Brief SNR Analysis in Turbo Decoding and Its Applications · 2006. 2. 8. · SNR ANALYSIS IN TURBO DECODING 159 scheme for the correction term in log-MAP decoding”, “UMTS WCDMA

Received April 10, 2005; revised January 15, 2006 155

A Brief SNR Analysis in Turbo Decoding and Its Applications

Edited by Shuzhan Xu

Abstract. This paper is a collection of a few papers. The key topic is SNR calculation in turbo decoding for different situations, practical window techniques analysis, soft sample scaling, AGC, BTFD and their applications. The main contributions are local MAP and ML decoding algorithms (based on approximate computation) leading to high-speed decoding schemes, and simplified MAP algorithm with no soft sample scaling needed. Key Words. Turbo/MAP (BCJR)/ML (Viterbi), simplified MAP/ML decoding, intrinsic and virtual SNR, windowing technique, local/adaptive/high speed decoding, RMS scaling

“With a precise theory, we can communicate over millions of miles.”

--- Wayne Stark, Michigan EECS 554 Lecture Notes, 1995 “For communication engineers, the dream of operating near Shannon limit is like to search for the holy grail.” --- Stephen Wilson, Digital Modulation and Coding, Prentice Hall, 1995

1. Introduction By Shuzhan Xu

This paper is a collection of four research manuscripts and three technical reports on turbo, BCJR (MAP) and Viterbi decoder design issues. It is a summary of some previous research and development efforts on turbo and Viterbi decoder in two wireless digital chipsets: Motorola CDMA 2000 (1998-2000) and Agere/Lucent UMTS WCDMA (2000-2002). Several patent applications by Motorola, Agere/Lucent and Chinese Academy of Sciences (high-speed decoding design methods were filed in China with reservations as Motorola did not make the claim) respectively have covered pretty much the main results. For legal issues, please contact intellectual property lawyers Brian Mancini ([email protected]), David Smith ([email protected]) and Xiumei Zhou ([email protected]). Organized according to contents, notes are also provided to explain where and when the results come from. As engineers, we have delivered the designs for commercial products. As researchers, we feel lucky for being able to come up with some new algorithms and analysis. As a joint effort of many people for many years, we are extremely happy that the painful effort was not in vain.

This paper is polished based on a Chinese Academy of Science technical report (No. 742-03-798-01, Some SNR Analysis in Turbo Decoding and Applications, compiled and released in October of 2003 in Shenyang Institute of Automation where I used to hold a part time position). The report was then distributed to people for comments. We

©2006 Institute for Scientific Computing and Information

INTERNATIONAL JOURNAL OF INFORMATION AND SYSTEMS SCIENCES Volume 2, Number 2, Pages 155-279

156 EDITED BY SHUZHAN XU

are happy that it started many communications and gained us new friends. With positive comments even from the legendary Claude Berrou and Andrew Viterbi, we now feel that our results have some values and are mature enough to be published. This journal is chosen for the reasons that the paper is too long and the “conjectures” for future may well lead to lengthy discussions and even debate. The major contents (four manuscripts and three reports) are as follows. A Simple Turbo Decoding Intrinsic SNR Calculation and Applications (by Shuzhan Xu, Haim Teicher, Koji Tanaka and William Smith)

Abstract: A simple SNR and quality index calculation is given for the intrinsic SNR increase as the number of

turbo iteration increases. We can use the asymptotic behavior of these indexes to monitor decoding process, to

stop iteration and to devise ARQ schemes. In practice, sliding window technique and Viterbi dual backward

engine technique are used to reduce memory with extra computation as compare to the direct implementation

of MAP-based (max* or max) turbo decoder. We can shrink or vary the SNR dependent synchronization

window size as iteration goes. A bypass scheme, bypass the SNR-dependent correction term, can also be easily

implemented to switch max* to max in later iterations. Some simulation results are provided for justification.

Extrinsic Information Impact on ML and MAP Decoding of Convolutional Codes (by Shuzhan Xu and Wayne Stark)

Abstract: We analyze uncorrelated extrinsic information impact on ML and MAP decoding of convolutional

codes (that is we try to do one iteration step analysis of turbo codes). LLR value monotonic properties and

performance bounds under Gaussian noise are presented. Quality indexes and virtual SNR values with

extrinsic information input are proposed to monitor decoding. This analysis enables us to devise some new

decoding strategies such as ARQ schemes with Yamamoto-Itoh type indexes and adaptive turbo decoding with

local decoding engines. We also reason the SNR dependency of the truncated Viterbi decoder trace back length

and the truncated MAP decoder synchronization window size.

High-Speed Convolutional and Turbo Decoding Schemes (by Shuzhan Xu)

Abstract: As data rates get higher and higher in communication systems, we need faster and faster decoders

correspondingly. Time delay of the commonly used Viterbi (ML) decoder and BCJR (MAP) decoder comes from

trellis complexity and frame size. Parallel layout of ACS butterfly structures can be applied to tackle the trellis

complexity. We can chop the frame into small segments and introduce corresponding local decoding engines in

parallel to tackle the frame size complexity. We can virtually reach arbitrary decoder speed with combination

of the previous two strategies. The price to pay is repeated parallel computations. We present algorithms,

architectures, and trade off analysis with focus on implementation issues. Naturally, these schemes can also be

applied to speed up each constituent decoder and thus turbo decoder.

SNR ANALYSIS IN TURBO DECODING 157

Simple RMS Soft Sample Scaling and Simplified Turbo Decoders (by Shuzhan Xu, Jan Meyer and Gerhard Ammer)

Abstract: The soft sample input to turbo decoder must be scaled for two reasons: (1) signal-to-noise ratio (SNR)

must be estimated as scaling factors for optimal performance, (2) the fixed precision samples must be scaled

into the right dynamic range in practical design. Scaling is a challenging problem in CDMA systems due to the

interaction among power control, AGC, channel de-interleaving, and online SNR scaling. We investigate these

problems and come up with very simple slot based RMS scaling algorithms. Numerical simulation results are

presented for justification. To alleviate the painful scaling process, we introduce a simple turbo decoder based

on a soft output convolutional decoding scheme with performance and complexity between log-MAP and

max-log-MAP decoding. This gives us robust low-power designs with fairly good performance.

Optimal linear approximation scheme for the correction term in log-MAP decoding (by Shuzhan Xu, John Falkowski and Junchen Du)

Abstract: For better turbo decoder performance, log-MAP (max*) algorithm needs to be implemented with a

correction term typically implemented by look-up table. We propose here optimal linear approximation scheme

to compute the correction term. The performance degradation is negligible. Our schemes can be very simply

implemented via ASIC and DSP.

UMTS WCDMA Soft Sample AGC Normalization for Decoding (by Shuzhan Xu, Qi Wang, Vasic Dobrica, Stephen Spence and Phong Nguyen)

Abstract: In UMTS CDMA systems, soft samples come out of RAKE receiver need to be normalized for optimal

decoding performance. We present normalization schemes based on slot-based mantissa multiplication and

cascaded bit shifts. TTI-based normalization is based on frame-based normalization, which in turn relies on

slot-based normalization. This is the optimal way to fully utilize the dynamic range, to use least exponent

memory and to have least amount of operation. We also address some implementation details.

UMTS WCDMA Blind Transport Format Detection (BTFD) Schemes (by Shuzhan Xu, William Smith and Gerhard Ammer)

Abstract: This paper investigates BTFD algorithms and design issues in UMTS WCDMA systems. Our

guideline is to implement BTFD on top of the commonly used truncated Viterbi decoder to guarantee decoding

quality and flexibility of BTFD implementation. A DSP based multi-pass default solution is also presented as a

backup scheme.

As mentioned, this work is a by-product of Motorola and Agere/Lucent design projects.

Other involved parties are NEC Australia and Bell Labs Australia. We feel lucky for the fact that these design efforts have provided us with observations and intuitions for deeper understanding. Our gratitude goes also to Motorola, Agere/Lucent and Chinese Academy of Science patent committees and lawyers for writing up patent applications. Our results


clearly reflect the joint effort among quite a few companies and many people from different countries. The fun together has left in our lives an unforgettable chapter.

I am extremely lucky for getting help from so many people. In particular, my gratitude goes to my teachers Professors Wayne Stark and Sean Coffey of University of Michigan for leading me into the world of communications. The brief yet productive study period in Ann Arbor has marked a turning point of life and transition to a new career. I thank Professor Stephen Wilson of University of Virginia for great help and invaluable guidance. I thank Tom Richardson of Flarion for his deep insights, fun together and stimulating discussions. A special thank goes to Professor Sergio Verdu of Princeton University for his great encouragement, precise guidance and sharp sense of science.

I must say something about the long preparation process of this paper. All investigations started with the event when I come up with the idea of sliding window size shrinking in turbo decoding iteration during literature survey in 1998. Motorola patent committee criticized this variation to be too simple and without justification. My formal colleague Haim Teicher suggested to calculate intrinsic SNR for explanation (virtual SNR increase in turbo decoding was studied by other people and published later, yet our approach with direct calculation seems to be more straightforward), and the paper “A simple turbo decoding intrinsic SNR calculation and applications” was first drafted, submitted and rejected. In order to sort out the truth, joint work with Professor Wayne Stark was conducted under Motorola research contract (1999-2000). Detailed analysis on MAP, ML decoding and window techniques led to dual-side windowing technique, local decoding and adaptive decoding schemes. This research effort is summarized in the paper “Extrinsic information impact on ML and MAP decoding of convolutional codes”. Motorola gave up the intellectual property protection of our results for the reason of not suitable for PCS applications. Tom Richardson confirmed and further elaborated the window techniques. After double RTL simulation confirmation and further study with my Agere colleagues Koji Tanaka and Bill Smith, the paper “A simple turbo decoding intrinsic SNR calculation and applications” was finalized. One day of 2002 in Agere, I was shocked by the fact that parallel lay-out of local MAP or ML decoding engines gives us high-speed decoding schemes. After consulting Professor Wayne Stark and Motorola patent committee, the paper “High-speed convolutional and turbo decoding schemes” was compiled. To build the Agere UMTS WCDMA chipset decoder, approximate and simple RMS scaling schemes were proposed with formal colleagues Jan Meyer and Gerhard Ammer. Trying to take the scaling advantage of ML decoding, a scaling free simplified MAP decoding scheme was also proposed. The previous results are summarized in the paper “Simple RMS soft sample scaling and simplified turbo decoders”. During Agere decoder design period (2000-2002), studies on other decoding issues are summarized in the three technical reports “Optimal linear approximation


scheme for the correction term in log-MAP decoding”, “UMTS WCDMA soft sample AGC normalization for decoding” and “UMTS WCDMA blind transport format detection (BTFD) schemes”. The previous results were all compiled in the Chinese Academy of Science report and distributed.

This turbo journey was such an unusual, dramatic and unexpected period of life. The US communication industry was in such an exciting yet turbulent mode of development. Companies, development teams and colleagues changed so dramatically and so many things happened. We can see clearly the authors are all over the world now. So much strong emotions and surprises get involved when we get re-connected in the final editing of this paper. Anyway, we thank for the development opportunities that made this paper possible. What is really life then? Who are we really?

“I sometimes have strange feelings, are they right?” Haim Teicher’s words in casual conversation raised many thoughts then. We have tried to “communicate” with the academic world from the very beginning even we spoke quite different “languages”. I sincerely thank professors Richard Wesel (UCLA), William Ryan (University of Arizona) and Norman Beulieu (University of Alberta) and the anonymous referees for their valuable comments on previous drafts. Their detailed error correction and very strict professional requirements were actually the greatest help to improve the paper quality.


2. A Simple Turbo Decoding Intrinsic SNR Calculation and Applications By Shuzhan Xu, Haim Teicher, Koji Tanaka and William Smith

2.1. Introduction This paper presents some simple calculations of intrinsic SNR and quality indexes in

turbo decoding for practical use. Derived easily with the extended path metric combining extrinsic information, these indexes have typical asymptotic behavior with respect to turbo iterations and can be applied for decoder monitoring, iteration stopping and ARQ schemes. Having the advantage of simple implementation, some practical schemes can be devised and explained easily based on intrinsic SNR increase. Such applications include varying the window sizes in sliding window or Viterbi techniques in practical decoder designs, to switch max* to max in later iterations via bypass of the logarithmic correction term. 2.2. A simple intrinsic SNR and quality index calculation

For system setup, we simply assume the turbo code rate is 1/3 without puncturing. The

code rate is thus ½ for each systematic constituent encoder: let 10

}{ −=

= LiiSX be the

information bits, 10

}{ −=

= LiipP and 1

0}'{' −=

= Lii

pP be the parity bits of the first and

the second constituent encoder respectively. For static AWGN channel with noise

variance 202 N

=σ , we receive soft samples 10

}{10

}{ −=

+=−=

= LiinbEixL

iiyY ,

10

}'{10

}{ −=

+=−=

= Lii

nbEipLiitT , 1

0}'''{1

0}{' −

=+=−

== L

iinbE

ipL

iisT , where L

is the frame size. For each constituent MAP decoder, we have [1][4]

(1) iliziybEiL ++= 2

2

σ,

where 10

}{ −=

LiiL are the generated soft LLR values, 1

0}{ −=

Liiz are the input a priori

information, and 10

}{ −=

Liil are the newly generated extrinsic information for the next

iteration. We only analyze the first constituent decoder for simplicity. For a SISO (soft-in, soft-out) convolutional decoding scheme (e.g. MAP or SOVA)

with extrinsic information (i.e. a priori information), we have

(2) ∑−

=∑−

=−+−−

∆=

1

0211

0}2)(2){(

22

1

]|[

L

i izixe

L

i bEipitbEixiyeXYp σ


∑−

=∑−

=+∑

−

=+++−

∆=

1

0211

0}{

2

1

0}2)2222{(

22

1 L

i izixe

L

i itipiyixbE

e

L

i bEipitbEixiye σσ ,

with )1

0 2/2/1

()2

1( ∏

−

= +−

=∆L

i izeiz

e

Lσπ

. Please note that ∑−

=

1

021 L

i izixe is the path metric

correction factor introduced by the extrinsic information. This factor helps the path metric separation and thus improves the decoding performance. We therefore define the

quality index as ∑−

==

1

0)},{,(

L

i izixLi

xiterQ , where iter is the iteration number in the turbo

decoding. Please note that what we mean by iteration is a half-iteration cycle. This is based on the symmetric nature of the two constituent decoders and our generic analysis.

The correlation term with extrinsic information included in the previous analysis

(3) ∑ ++=∑+∑ +−

=

−

=

−

=

1

0

21

0

1

0})

2({

21}{

22

L

iiii

bii

L

iii

L

iiiii tpz

Eyxzxtpyx bEbE σ

σσ,

is the extended path metric commonly if Viterbi decoding is applied. Equivalently, we

can think the soft samples input to the Viterbi decoders as }),2

{(2

iib

i tzE

y σ+ . With the

standard signal to noise ratio calculation 2

2])|[(σ

ii xyESNR = , we get SNR of the input

data samples into the constituent decoder as

(4) 22

2

2

2

22

4

])|2

[(),,( i

bii

ibii

bi

ii zE

zxxExz

EyE

iteryxSNR σσσ

σ

++=

+

= ,

where the last two terms are due to the extrinsic information. The SNR corresponding to the parity samples is

(5) 2

2

2

2])|[(),,(σσ

ibiiii

pEptEitertpSNR == .

In turbo decoding, the SNR for systematic samples will change with the iteration and the

corresponding SNR for parity samples does not change. If iz has the same sign as ix

(this is true in general in turbo decoding and is the key point [1]),

(6) )0,,(])|

2[(

),,( 2

2

2

22

iiib

iib

i

ii yxSNRxExz

EyE

iteryxSNR =≥

+

=σσ

σ

,

which shows that the extrinsic information will increase the SNR of the data input to


each constituent decoder. With )0(AverageSNR denote the initial SNR value, we get

the following average SNR over the whole frame at an iteration stage:

(7) }1

0),,(

1

0),,({

21

)( ∑∑−

=+

−

==

L

iiteritipSNR

L

iiteriyixSNR

LiterAverageSNR

)(4

2

2

2

2

2

21 1

0

21

0

1

0 21

21}{ ∑+∑+∑ +

−

=

−

=

−

==

L

ii

L

iii

L

iz

Lzx

L bEipbEixbE

Lσ

σσ

)1

0

221

(4

2),,(

21

)0( ∑−

=++=

L

i izLbE

LiterQL

AverageSNR ix σ.

We call this value the intrinsic SNR. The above expression also justifies our quality index definition and shows the connection between quality index and intrinsic SNR. Our simple analysis re-interprets also the well-known fact that the turbo iteration will bring up SNR and thus produce the coding gain. Please refer to [5][17-19] for reference.

Since ix are unknown at the receiver, we use ∑−

==

1

0ˆ)},{,(

L

i izidLi

xiterH

Q , where id̂ is

the hard decision )(î

Lsignid = ; or the soft version ∑−

==

1

0)},{,(

L

i iziLLi

xiterS

Q as

substitutes of the quality index in practice. We thus have practical intrinsic SNR values

(8) )1

0

221

(4

2),,(

21

)0()( ∑−

=++=

L

i izLbE

LiterH

QL

AverageSNRiterH

AverageSNR ix σ,

(9) )1

0

221

(4

2),,(

21

)0()( ∑−

=++=

L

i izLbE

LiterS

QL

AverageSNRiterS

AverageSNR ix σ,

for application purposes.

Our analysis is based on simple engineering intuition. For good turbo codes, iz

generally has the same sign as ix and its amplitude || iz is increasing with the turbo

iteration (these are true at least for most of the bits in a frame). This once again is the very observation lead to the feed back turbo decoding scheme invention [1]. In general, the intrinsic SNR will increase with iteration and finally reach a relatively saturated constant value. The quality index and intrinsic SNR will also have such asymptotic behavior. We can further justify it by numerical simulation. Using the CDMA 2000 standard code (rate 1/3, G1=13 and G2=15) with 2000 frames of size 640 bits under AWGN channel conditions. The behavior of our hard and soft quality indexes are


represented by the following figures:

Figure 1. Asymptotic behavior of the hard and soft quality indexes

Under the same conditions, the values of )0()( AverageSNRiterH

AverageSNR − and

)0()( AverageSNRiterS

AverageSNR − are presented in the following plots respectively.

That is we only show the sum of the last two terms in the intrinsic SNR expressions. These results also demonstrate the asymptotic behavior of the intrinsic SNR values.

Figure 2. Asymptotic behavior of the hard and soft intrinsic SNR values Finally, we point out that the previous analysis is equivalent to utilize Viterbi decoding

to analyze and monitoring turbo decoding as follows.

The turbo decoding process remains unchanged. Two Viterbi decoders are attached for the purpose of analysis to derive intrinsic SNR and quality indexes without any real decoding effort. We can also view the two Viterbi decoders as a monitoring windows for the turbo decoding process. The approximation error of the Viterbi decoder to the

0 2 4 6 8 10 12 14 160

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 104 Asymptotic behavior of hard quality index: 8 full cycles

Iteration Number

Har

d qu

ality

inde

x va

lue

EbNo=1.5dB: solid line =2.0dB: .* line =2.5dB: dotted line

0 2 4 6 8 10 12 14 160

1

2

3

4

5

6x 106

Sof

t qua

lity

inde

x va

lue

Iteration Number

Asymptotic behavior of soft quality index: 8 full cycles

EbNo=1.5dB: solid line =2.0dB: .* line =2.5dB: dotted line

0 2 4 6 8 10 12 14 160

50

100

150

200

250Intrinsic SNR Increment: Hard Version

Number of Iteration

Har

d In

trins

ic S

NR

Incr

emen

t

Eb/N0=0.8dB: solid line =0.9dB: .* line =1.0dB: dotted line

0 2 4 6 8 10 12 14 160

500

1000

1500

2000

2500

Sof

t Int

rinsi

c S

NR

Incr

emen

t

Number of Iteration

Intrinsic SNR Increment: Soft Version

Eb/N0=0.8dB: solid line =0.9dB: .* line =1.0dB: dotted line

i n t

d e i n t

S I S O I

S I S O I I

L L R

e x t r i n s i c

e x t r i n s i c

i n t

v i r t u a l V i t e r b id e c o d e r f o r a n a l y s i s

v i r t u a l V i t e r b id e c o d e r f o r a n a l y s i s

F i g u r e 3 . A n a l y z i n g t u r b o d e c o d e r w i t h V i t e r b i d e c o d e r


constituent MAP decoding will not get accumulated since the turbo decoding is not affected by this monitoring. Anyway, this is one way to capture our intuition. 2.3. Application I: iteration stopping

The asymptotic behavior of the intrinsic SNR and quality indexes can be applied for iteration stopping. We can stop iteration when they cross the knee of the asymptote. One easy way is to check the percentage of increase. With a random threshold say 0.01, we stop iteration (after 9 iterations out of total 16, to avoid false alarm) if

(10) 01.0),{,(/()},{,((),{,1({ }}} <−+ LiterH

QLiterH

QLiterH

Q iii xxx ,

(11) 01.0),{,(/()},{,((),{,1({ }}} <−+ LiterS

QLiterS

QLiterS

Q iii xxx .

The BER performance curves are given as follows.

Figure 4. BER performance of iteration stopping with quality indexes

The degradation due to iteration stopping is less than a few hundredth of a dB at 510−

BER. The average numbers of iterations are approximately 9.6 (about 4.5 full cycles), which tell us the computational savings. Intuitively, the number of average iterations needed decreases as initial SNR increases.

Iteration stopping has long been studied. Engineering approaches like CRC check, LLR threshold and variance have been applied in practice. More subtle indexes like cross-entropy and variations were studied [4][6]. Cross entropy approximated by

(12) )(|))(exp(|

|)(|_ iT

k ikL

ikz

EntropyCross =∆

≈ ∑ ,

where )2()()( −−=∆ ikzi

kzikz is difference of the input extrinsic information to the same

constituent decoder between one iteration cycles. We can stop iteration when )(iT

drops to the range of )1(*)410~210( T−− . Shao et al [6] simplified the criterion of

0.5 1 1.510

-6

10-5

10-4

10-3

10-2 BER performance: exact(solid), hard index(:),soft index(*)

SNR in dB

BER


Hagenauer with further approximations and derived the following two indexes for iteration stopping:

(1) SCR (sign change ratio) criterion: Let )(iC denote the number of sign

changes in the extrinsic information input to a same constituent decoder between iteration cycles. We stop iteration if LiC *)03.0~005.0()( ≤ .

(2) HDA (hard decision aided) criterion: We store hard decisions at thi −− )2(

iteration, and compare them with signs of })({ ikL . If the sign matches for each

bit in the whole block, we can stop the iteration. It is reported in [6] that HDA reduces the number of iterations more than CE or SCR for similar BER performance at low SNR. However, HDA is not as efficient as either CE or SCR criteria in computational savings for similar BER performance at high SNR. From the implementation simplicity point of view, HDA is arguably “the way”.

Recall that the hard quality index is represented as ∑−

==

1

0)()},{,(

L

ik

izixLi

xkQ , and

suppose all extrinsic information 10})({ −

=Li

kiz has about the same magnitude for two

full iteration cycles; that is, Zkizk

iz ≈≈+ |)(||)2(| , and we then have

(13) ),{,(/()},{,((),{,2({ }}} LkH

QLkH

QLkH

Q iii xxx −+

LiC

ZLZsign

L

i

kizix

ZsignL

ik

izixL

ik

izixL

ik

izix)(

**)(#

1

0

)(

*)(#1

0)(/)

1

0)(1

0)2(( =≈

∑−

=

≈∑−

=∑−

=−∑

−

=+=

,

where )(#sign is number of sign changes between 10

)( }{ −=

Li

kiz and 1

0)2( }{ −

=+ L

ik

iz .

Clearly, we now see that our quality index is virtually the SCR criterion. The performance of our iteration stopping schemes depends on the threshold setup and

therefore requires calibration with a lot of frames. The main advantage of our approach is implementation simplicity. Since there are LLR and extrinsic information output in each constituent decoding stage, we just need a MAC (multiply and accumulate unit) unit to calculate the soft index. One unit memory unit to store it, to compare with the next one for slope calculation. We can even store all the quality index values at different iterations with very few memory elements. A comparison unit based on one subtraction and one division is needed. For hard index, we just need to add a slicer for hard decision. The previous hardware can be simply attached to the decoder hardware. Another advantage of is that our approach can be applied at any iteration stage (not only at full iteration cycles).


2.4. Application II: ARQ schemes

We now look at ARQ schemes using quality indexes. Let ),( LiterindexQ denote any

of the quality indexes or the intrinsic SNR values, 120)}({ −

=N

iteriterlowerT be threshold

values and 0I be a mandatory iteration number. We propose a simple practical ARQ

scheme as follows. ARQ scheme with quality index:

(1) Keep decoding until 0Iiter = (i.e. at least 0I iterations).

(2) Send a retransmission request for the whole frame, if

)()( iterlowerTiterindexQ < , else keep decoding and checking for iteration

stopping criteria after each constituent decoding. By intuition, if the quality index is still not good enough after certain iterations, the frame itself is bad. Continuing the decoding iteration in such a condition will not resolve the frame with an acceptable BER. A retransmission request is therefore a good decision. If the quality index passes the threshold, the frame quality is good and we should continue the decoding process. The iteration stopping criteria will also be checked for the purpose of computational savings and decoder delay reduction.

We now present simulation justification under the same simulation environment setup for the previous section. With 100,000 frames (CDMA 2000 standard turbo code, size of 640-bit) transmitted, figure 7 shows the BER performance of our ARQ schemes. Curve

ARQ_1 is derived with 00 =I and 120

)]}(*2)([)({ −=

−= Niter

iteriteriterlower

T σµ ,

curve ARQ_2 is derived with 10 =I and 120

)]}()([)({ −=

−= Niter

iteriteriterlower

T σµ ,

the no ARQ curve is simply the turbo decoder performance. )(iterSQ is used in

2}1

0)(

1{

1

02)]([

12 ∑−

−∑−

=N

iterSQN

NiterSQ

Nσ , ∑

−=

1

0)(

1 NiterSQ

Nµ , 100000=N

for variance and mean statistics generation. We have the following BER performance. The throughputs for curve ARQ_1 are [95.4%, 96.8%, 97.2%, 98.3%] with about 0.05dB

coding gain at 510− BER. The throughputs for curve ARQ_2 are [78.3%, 79.8, 81.4%,

82.3%]. and we have about 0.2dB coding gain at 510− BER. This partly demonstrates


the performance of our ARQ schemes.

Figure 5. Performance of ARQ schemes with quality indexes Once again, we can setup the threshold values according to statistics. Statistics

generation defines the calibration process, with the mean and variance as the key parameters for the threshold search. The ARQ scheme performance (effective throughput, BER performance and decoding processing) is directly related to this calibration. We

now elaborate on the setup of the threshold values, 120)}({ −

=N

iteriterlowerT . We first

generate statistics of the chosen quality index. Denote 120)}({ −

=N

iteriterµ and

120)}({ −

=N

iteriterσ as the mean and variance values, the threshold values can be given as

120)}(*)()({12

0)}({ −=−=−

=N

iteriteriterkiterNiteriterlowerT σµ , where )(iterk can be the

same or different for all iterations ( 120)}({ −

=N

iteriterk can be set as 1, 2, or 3 as typical

statistics). The following diagram depicts the threshold setup.

The intuition behind this setup is premise that quality indexes at each iteration may be represented as a Gaussian distribution. This can be partially justified by the central limit theorem. Our proposed scheme is then based on the following quantity

(14) 120}

)(

)({12

0)}({ −=

−=−

=N

iteriter

iterindexNiteriterz

σ

µ,

I t e r a t i o n N u m b e r

Qua

lity

Inde

x

t h r e s h o l dµ ( i t e r ) − k ( i t e r ) ∗ σ ( i t e r )

r e j e c t r e g i o n

p a s s r e g i o n

F i g u r e 6 . A R Q s c h e m e w i t h q u a l i t y i n d e x


which are z-score values in the terms of statistics. 2.5. Application III: sliding window size shrinking or variation

Direct implementation of the MAP-based (max* or max) turbo decoder requires lot of memory to store the intermediate recursive sequences. Sliding window techniques [8][9][10][11] have been proposed to reduce the memory with extra computation. Later, Viterbi dual backward engine technique [8] reduces the computation of this technique with moderate amount of memory increase. This scheme basically sits in between the direct computation (minimum amount of computation, maximum amount of memory) and the extreme case of sliding window technique (minimum amount of memory, maximum amount of computation). Both of these “windowing” techniques rely on the self-synchronization nature of convolutional decoders. In Viterbi algorithm, we can trace back to the right state starting from any state given that the trace back length is long enough. The MAP decoding recursions are virtually Viterbi algorithms executed in two different directions. The starting value doesn’t matter much as long as we have enough “trace back length” (this “trace back length” is the synchronization window size).

Intuitively, synchronization window size or trace back length is also determined by the input SNR. The synchronization window size will be longer for lower SNR and will be shorter for higher SNR. Just consider the extreme case with propagation through channel without noise, we don’t even need this synchronization window calculation. In most practical turbo decoder designs, two MAP decoding will be done in each iteration cycle with windowing techniques. As the turbo iteration process will virtually increase the SNR of input samples to each constituent decoder, the synchronization window size needed in each constituent MAP decoding can be shrunk or varied accordingly. The only subtle point is how to set the starting values for the backward recursion. Of course, we should not reduce the window size to zero. This intuition is the foundation of the windowing schemes to be introduced.

Successive shrinking of synchronization window: Let )(iterindexQ be any of the

above quality indexes or intrinsic SNR values, lowerT is a threshold value and 0I is

a fixed iteration number, we have the following successive window-shrinking scheme:

(1) Keep decoding until 0Iiter = or lowerTiterindexQ ≥)( with

synchronization window size of K symbols and random start. (2) Shrink the synchronization window size successively by a fixed number of

1K symbols in each iteration. To be more adaptive, we can shrink the

window size proportional to the increment of the quality indexes. We use


starting values )()( iSiS αβ = , if )( iSα is available, numSiS1

)( =β , if

)( iSα is not available, iS is a state of each constituent MAP decoder.

(3) Check for iteration stopping while shrinking the window size. Richardson [13] point out a subtle window combination scheme to be presented. His

recommendation is based on the following intuition and fact: the decoding quality is not going to be good at low SNR no matter how long the window size is. As an extreme case, we are not able to decode any thing if pure noise is pumped into the decoder. This says that we should use short window sizes at the beginning.

Richardson window size variation scheme: The windowing technique is: (1) Use very short window size for the first iteration and increase the window size

gradually as iteration goes on until the maximum window size is reached. (2) Shrink the synchronization window size successively after the maximum

window size has been reached. (3) Check for iteration stopping while shrinking the window size.

Under the same setup, we have the following decoder BER performance curve.

Figure 7. MAX* and MAX turbo decoder with different window combinations

In the above plots, B1 means window size combination [20, 20, 20, 20, 15, 15, 15, 15, 10, 10, 10, 10, 10, 10, 10, 10]. B2 means window size combination [10, 10, 15, 15, 20, 20, 20, 20, 10, 10, 10, 10, 10, 10, 10, 10]. And B3 means window size combination [10, 10, 10, 10, 20, 20, 20, 20, 10, 10, 10, 10, 10, 10, 10, 10]. The performance degradation of these techniques compare to the exact performance curve is negligible (within hundredth of a dB). This shows that the successive window size shrinking or variation schemes work well. As expected, the max-log-MAP turbo decoding is less sensitive to the window size change. For practical channel, we can rely on simulation or product calibration to figure out the “optimal” combination for a specific design.

The previous windowing schemes can be implemented via very simple attachment to the existing designs. For ASIC decoder with Viterbi technique as an example, one

0.5 1 1.510

-7

10-6

10-5

10-4

10-3

10-2 Max* turbo decoding: exact(solid), B1(:), B2(+), B3(*)

BE

R

SNR0.5 1 1.5

10-5

10-4

10-3

10-2

10-1 Max turbo decoding: exact(solid), B1(:), B2(+), B3(*)

SNR

BE

R


forward engine and two parallel backward engines are used. All we need to do is just to control the backward engines to be active and to be idle. This “sleep mode” is very easy to control and the attached hardware is extremely simple. All we need is just to disable the clock to the sleep engines and add a counter to control the timing properly. 2.6. Application IV: switching max* to max in later iteration stages

Note that )||1log(),max()log(),(*max yxeyxyexeyx −−++=+= , and the

logarithmic correction term is typically implemented via a look-up table. Let )( iSα ,

)( iSβ and ),'( jSiSγ be the forward and backward recursive sequences and branch

metric, )}(log{)( iSiSa α= , )}(log{)( iSiSb β= , )},'(log{),'( jSiSjSiSc γ= , then

(15) })],2()2([)],,1()1({[*max)( iSiSciSaiSiSciSaiSa ++= ,

(16) )]}2,()2([)],1,()1({[*max)( iSiSciSbiSiSciSbiSb ++= ,

where 1iS and 2

iS are the corresponding transition states based on the recursive

situation. This is the classical ACS operation with a correction term. As

),max(),(*max yxyx ≈ , we utilize the normalized path metric in the correction term.

From the LLR decomposition, we see that all the soft samples are scaled by 0

4

NbE

. Let

x and y denote the path metrics at the ACS, we see that

(17) ∑= ±=i ixbE

NbE

xN

bEx )(

0

4~

0

4, ∑= ±=

i iybEN

bEy

NbE

y )(0

4~

0

4,

where ix and iy denote the involved systematic symbols and parity symbols, and

)1log()1log(|~~|

4|| 0

yxN

Eyx

b

ee−−

−− +=+ . For high SNR, we have asymptotically,

(18) 0)1log(lim|~~|

4

00

0=+

−−

→

yxN

E

N

b

e .

That is the well-known fact that max* and max are asymptotically the same. As the performance difference between max* and max are getting closer as SNR


grows. The 0.3 dB performance difference between max* and max based turbo decoding under AWGN channel is the accumulated differences of many iteration stages. As turbo decoder will bring up intrinsic SNR, we can use max* at low SNR for performance and switch max* to max in the later turbo iteration stages to reduce computation. Formally, let ITERSwitch _ be an given iteration number, we have the following algorithm:

Turbo decoding with algorithm switch: for the first ITERSwitch _ iterations, use max* for constituent decoding, use max for

constituent decoding after. Suppose ITERMax _ is the maximum iteration number and each constituent encoder

has 8 states, the total number of max* correction computation for both forward recursion and backward recursion is ITERMaxL _**8 with direct implementation. This number will be reduced to ITERSwitchL _**8 by algorithm switching and the percentage of

saving is

(19) )__

1(_

__ITERMaxITERSwitch

ITERMaxITERSwitchITERMax

−=−

.

The following simulation results (UMTS W-CDMA turbo codes with 10 maximum iteration under AWGN channel) justify the performance of our switching scheme.

Figure 8. Performance of max* and max switching algorithms Here 1S means 1_ =ITERSwitch . Similarly, and 2S , 3S , and 4S have the

similar means. The BER performance degradation is negligible as we use max* for the first 4 iterations. The final switching point can be decided via simulation or calibration over real channel. Our switching algorithms can be implemented in hardware with programmable parameters so that we can switch at any iteration stage.

We need the following stages of ACS operation for max*: (1) fetch memory to get path metric values )0(PM and )1(PM , and calculate branch metric S , (2) three parallel computations for SPM +)0( , SPM −)1( and SPMPM 2)1()0( +− , (3) two parallel operations to compare SPM +)0( and SPM −)1( for maximum selection and to use SPMPM 2)1()0( +− to address LUT for correction term, and (4) a final

addition. In max decoding, the operations can be bypassed are: the three-input adder to compute

SPMPM 2)1()0( +− , the addressing of LUT, and the final two-input adder for new


path metric. We include the following diagrams to illustrate the hardware implementation. Note that the registers utilized add an additional latch to the flip-flop and that this latch is cleared separately. This feature allows for two outputs from the register where one may be separately cleared. Figure 9 shows the actual usage of this register in a possible hardware implementation of our algorithm.

Figure 9. Implementation of max* and max switching in butterfly In max decoding, the operations can be bypassed are: the three-input adder to compute

SPMPM 2)1()0( +− , the addressing of LUT, and the final two-input adder for new

path metric. We include the following diagrams to illustrate the hardware implementation. Note that the registers utilized add an additional latch to the flip-flop and that this latch is cleared separately. This feature allows for two outputs from the register where one may be separately cleared. This is illustrated in the first figure following. The second figure following shows the actual usage of this register in the bypass scheme illustration of a

DQA

QB

Select

A

B

A-

A

B

A+

A

B

A- A

B

A+

A

B

A+

Compare

&

Select

LUT

A

A

B

1

0

Reset

S

PM(1)

PM(0)

Disabled by

Added to

R RB

D QA

QB

R RB

D QA

QB

R RB

PM(0) + S

PM(1) - S

PM(0) - PM(1) + 2S


possible hardware implementation of our algorithm.

Figure 10. Implementation of max* and max switching in butterfly There are many different ways to implement the ACS butterfly structure. Bypass

schemes of similar form can be easily devised for any of them. For hardware impact, they reduce the circuit activity factor (for the LUT branch of ACS butterfly will not be activated after switching) and thus power consumption. We also point out another way of doing these switching algorithms. When max* is switched to max, we don’t need do this computation anymore. We can simply use fixed value to address LUT to get zero output. 2.7. Conclusions

We have in a sense interpreted (simple analysis plus simulation) the turbo decoding process. The practical applications we come up are based on the characteristics of the turbo decoding convergence process and engineering intuitions. Acknowledgement: We thank professor Stephen Wilson of University of Virginia and the anonymous referee for pointing out a straightforward way to derive the intrinsic SNR. Our special appreciation goes to John Falkowski of Agere Systems for help on writing. References [1] C. Berrou et al, Near Shannon limit error-correcting coding and decoding: Turbo codes, IEEE Int. Conf. On

Comm., pp 1064-1070, May, 1993

[2] L. Bahl et al, Optimal decoding of linear codes for minimizing symbol error rate, IEEE Trans. Info. Theory,

Vol. 20, pp284-287, March, 1974

[3] J. Hagenauer & P. Hoeher, A Viterbi algorithm with soft-decision outputs and its applications, Proc. IEEE

GLOBECOMM, pp1680-1686, 1989

[4] J. Hagenauer et al, Iterative decoding of binary block and convolutional codes, IEEE Trans. Inform Theory,

D Q

RST

D Q

RST

D Q

RST

CK

CK CK

CLK

RB

RA

D

QA

QB

Latch Latch

Latch

Master/Slave

Added Latch


Vol. 42, pp429-445, March, 1996

[5] J. Hagenauer, The turbo principle for decoding of concatenated codes, IEEE International Workshop on

Concatenated Codes, Ulm, October, 1999

[6] R. Y. Shao, S. Lin and M. P. C. Fossorier, Two simple stopping criteria for turbo decoding, IEEE Tans.

Comm., Vol. 47, No 8, pp1117-1120, August, 1999

[7] H. Yamamoto and K. Itoh, Viterbi decoding algorithm for convolutional codes with repeat request, IEEE

Trans. Info. Theory, Vol. 26, No 5, pp540-547, 1980

[8] A. Viterbi, An intuitive justification and a simplification of a simplied implementation of the MAP decoder

for convolutional codes, IEEE JSAC Vol 16, No 2, pp260-264, February, 1998

[9] S. Benedetto et al, Soft input soft output MAP module to decode parallel and serial concatenated codes,

TDA Progress Report 42-127, JPL, 1996

[10] S. Benedetto et al, Soft-output decoding algorithms in iterative decoding of turbo codes, TDA Progress

Report 142-124, February, 1996

[11] S. Pietrobon, Efficient implementation of continuous MAP decoders and a synchronization technique for

turbo decoders, pp586-589, Proc. Int. Sym. Inform.Theory Appl., Victoria, B. C. Canada, 1996

[12] A. Matache, S. Dolinar and F. Pollara, Stopping rules for turbo decoders, JPL TMO Progress Report

42-142, August, 2000

[13] T. Richardson, Personal communications, Flarion, Bedminster, NJ

[14] C. LaRosa, Personal communications, PCSRL, Motorola, Harvard, IL

[15] M. Schaffner and J. Oliver, personal communications, Motorola, IL

[16] R. Wesel (UCLA), W. Ryan (Univ of Arizona), N. Beulieu (Univ of Alberta), personal communications

[17] T. Richardson and R. Urbanke, The capacity of low-density parity check codes under message passing

decoding, IEEE Trans. Info. Theory, Vol. 47, No. 2, pp599-pp618, February 2001

[18] H. El Gamal and A. R. Hammons, Analyzing the turbo decoder using the Gaussian approximation, IEEE

Trans. Info. Theory, Vol. 47, No. 2, pp671-pp686, 2001

[19] S. ten Brink, Convergence behavior of iteratively decoded parallel concatenated codes, IEEE Trans.

Comm., Vol. 49, No. 10, pp1727-1737, October 2001


3. Extrinsic Information Impact on ML and MAP Decoding of Convolutional Codes By Shuzhan Xu and Wayne Stark

3.1. Introduction To understand turbo decoding, we study the impact of extrinsic information on Viterbi

(ML) and BCJR (MAP) decoding schemes. We thus try to do one iteration step analysis of turbo decoding with focus on the decoding process (fixed numerical procedures). First, we restate the classical ML and MAP decoding algorithms in an unified fashion to reveal their connections more clearly. One direct application of this analysis is that we can reason the SNR dependency of the truncated Viterbi decoder trace back length and the truncated MAP decoder synchronization window size. This justifies analytically the commonly used windowing techniques in practical decoder designs [6][7][8][12].

Monotonic properties of LLR values and improved performance bounds are derived directly from our input analysis. For simplicity, we analyze only uncorrected extrinsic information input. These results can be generalized to later turbo iterations in practice with relaxation of the correlation requirements and with help of numerical simulations. In short, right extrinsic information input will improve the decoder performance. Our analysis is only some justification of this key intuition in turbo decoding.

Quality indexes and virtual SNR values, general versions of the average intrinsic SNR introduced in [12], with extrinsic information input are proposed to monitor the decoding quality. These values have typical asymptotic behavior with respect to turbo iterations and are simple indications of the convergence of turbo decoding. Various versions come with different index sets (global, local or Yamamoto-Itoh type) which are the foundation of some practical applications. These applications are ARQ schemes, iteration stopping of local decoding engines, and adaptive iterative turbo decoding schemes. We present some analysis and some numerical simulation results for brief justification.

We simply assume code rate r is ½, total

S is the total number of states on trellis,

and the frame size is L for the convolutional encoder and the turbo constituent encoder (the turbo encoder rate is thus 1/3 without puncturing). We suppose the encoder starts

with and ends in zero state with proper tail bits. We suppose 10

}{ −=

= LiimM is the

transmitted information bits, and the convolutional (or the first constituent turbo encoder)

output is 10

},{ −=

= LiipixX ( imix = for systematic encoder). Transmitting over

AWGN channel with noise variance 202 N

=σ using BPSK modulation, we receive

soft samples 10},{ −

−= LiitiyY , inbEixiy += and '

inbEipit += with polarity


10 +→ and 11 −→ . The extrinsic information sequence }1,...,1,0{ −= LzzzZ is

given as )1(

)1(log

−=

+==

impimp

iz with 2/2/

2/

][iz

eize

izime

imp+

−= , for 1±=im .

3.2. SNR dependency of ML and MAP decoder windowing techniques

With iS denote a state of the trellis corresponding to the thi − time moment, MAP

decoding is optimal symbol-by-symbol detection with forward recursion

(20) )1()1

1()( iSiSiS iSiS →−∑−

−= γαα , 00,0)0(,00,1)0( ≠=== SSSS αα ,

and backward recursion

(21) ∑+

++→=1

)1()1()(iS iSiSiSiS βγβ , 0,0)(,0,1)( ≠=== LSLSLSLS ββ .

The soft decision LLR is calculated as

(22) ∑− ++→

∑+ ++→

=−=

+==

S iSiSiSiSS iSiSiSiS

Yimp

Yimp

iL

)1()1()(

)1()1()(log

]|1[

]|1[log

βγα

βγα,

which is the so-called log-MAP algorithm in implementation and )1( +→ iSiSγ is

branch metric. An equivalent form will be

(23) ∑− ++→

∑+ ++→

=∑

−∈

∑+∈=

P iSRpiSiSiSLpP iSRpiSiSiSLp

PSXYp

PSXYp

iL

)1()1()(

)1()1()(log

)|(

)|(log

γ

γ,

where S is a continuous path, }1:{ +==+ imSP and }1:{ −==− imSP cover all

of the continuous paths start and end with zero state on the trellis,

]],1,,0[

|[iSi

XYpL

P−

=L

is a path metric of a path start with zero state and end

with state iS , and ]]1,,1[,1

|[−++

=LiiS

XYpR

PL

is path metric of a path start

with state 1+iS and end with zero state.

Viterbi decoder is a ML algorithm, which searches for the optimal continuous path with an effective path trimming process. In parallel to MAP decoding, we now define the following recursive sequences


(24) )}1()1(*{1

max)(*iSiSiS

iSiS →−−−

= γαα ,

00,0)0(*,00,1)0(* ≠=== SSSS αα ,

for forward recursion and the following recursive sequences

(25) )1(*)1({1

max)(*++→

+= iSiSiS

iSiS βγβ ,

0,0)(*,0,1)(* ≠=== LSLSLSLS ββ ,

for backward recursion. We can easily derive the following basic properties.

Proposition 2.1. For a state *k

S on the trellis at time moment k with 10 −≤≤ Lk ,

any two path sets )}*,,1,0{(k

SSSfwd

P L= and )},,1,*{( LSkSk

Sbwd

P L+= , we have

(26) )}}*,1,,1,0{

|],0[

({}1,,1,0{

max)*(*

kSkSSSX

kYp

kSSSkS

−−=

LLα ,

(27) )}},,1,*{

|],[

({},,2,1{

max)*(*

LSkSkSX

LkYp

LSkSkSkS

LL+++

=β .

In particular, we have )}|({}{

max)00(*)0(* XYpX

SLS ==== βα .

Proof: Due to the symmetric properties of Lii

S≤≤0

)}(*{α and Lii

S≤≤0

)}(*{β ,

we just show the forward recursive results. We do this proof by mathematical induction.

)}*10

()0

(*),*1

0()0(*max{)}*

1,0{|

]1,0[()*

1(* SSSS

SXYpS →→== γαγαα is true

due to the initial assumptions about the recursive sequences. That is for 1=k

(28) )}}*,1,1,0{

|],0[

({}1,1,0{

max)*(*

kSkSSSX

kYp

kSSSkS

−−=

LLα .

Suppose )}}*,1,1,0{

|],0[

({}1,1,0{

max)*(*

kSkSSSX

kYp

kSSSkS

−−=

LLα is true.

With the definition of our recursive sequences, we have for 1+k


(29) )}*1

()(*{}{

max)*1

(*+

→=+ k

SkSk

S kk SS γαα

)}},1,1,0{

|],0[

({}{

max)}*1

({}{

maxkSkSSS

Xk

YpkSk

SkS kS

−+

→=L

γ

)}}*,,1,1,0{

|]1,0[

({},,1,0{

max

kSkSkSSSX

kYp

kSSS−

+=

LL,

which proves our claim. Q.E.D The forward and backward recursive sequences we just defined are simply a different

statement of the Viterbi decoding processes (the backward sequence can be simply viewed as Viterbi decoder running in the different direction, or decoding backward after a whole frame of samples have been received). This enables us to treat ML and MAP decoding in an universal algorithmic fashion. The soft decision LLR is calculated as

(30) ∑− ++→

∑+ ++→

=


iL

)1(*)1()(*

)1(*)1()(*

log*

βγα

βγα,

which is the so-called max-log-MAP algorithm in implementation. Equivalently,

(31) ∑

−∈

∑+∈

=

)(]|[

)(]|[

log*

survivingPSXYp

survivingPSXYp

iL

∑

−∈ ++→

∑+∈ ++→

=

)()1()1()(

)()1()1()(

log

survivingPS iSRpiSiSiSLpsurvivingPS iSRpiSiSiSLp

γ

γ,

where }1:{)( +==+i

mSsurvivingP and }1:{)( −==−i

mSsurvivingP cover all the

continuous surviving paths (only the surviving paths after Viterbi decoder path trimming operation, and the path sets are smaller than the path sets of the log-MAP path sets, the difference is surviving paths versus all paths) start with and end in zero state on the trellis. We can see the connection and difference between MAP and ML algorithms more clearly now. In particular, it has been proved that max-log-MAP decoding is equivalent to the extended SOVA (which is Viterbi decoding in terms of hard decision) [9]. Our analysis offers some indication to this equivalence. What we are trying to do is only to view ML and MAP decoding in a more unified fashion.


One direct application of the universal treatment of ML and MAP decoding is that we can reason that the truncated Viterbi decoder trace back length and the truncated MAP decoder synchronization window size are SNR dependent. These truncated algorithms are with great practical values. Based mainly on intuition and simulation, the Viterbi decoder trace back length and the MAP decoder window size are known to be SNR dependent and some practical variations has been investigated (see [12] and the listed references for detail). We now try to reason these claims more rigorously. The key point is to analyze the error tolerance introduced by the truncation in the decoding process.

Since path metric is computed based on ACS (add, compare and select) operation, we can give some analysis on the probability of two paths (both forward and backward) merge together (we only analyze the forward recursion paths due to the symmetry). We

assume the all zero code word is transmitted. For two paths },,,,{ 11 Wikiii SSSSX+−++ L and

},,,,{ ''1

'1

'Wikiii SSSS

X+−++ L

starting with fixed different states iS and 'iS (corresponding to

bits },,,1,1,,{ WipWixipixipix ++++ L and }',',,'1,'

1,','{ WitWiyitiyitiy ++++ L ) and

do not joint together, we have at least ⎥⎦⎥

⎢⎣⎢

KW different information bits on these two

paths (where K is the constraint length of the encoder trellis). The reason is that any K consecutive identical information bits will push the encoders into the same state and the two paths will merge together. Suppose the corresponding received soft samples are

},,,,,,{ 11 WiWiiiii tytyty ++++ L , and then the path metric difference of this section is

(32) })()({0

''

0, ∑∑

=++++

=++++ +−+=∆

W

jjijijiji

W

jjijijijiWi tpyxtpyx

∑∑∈

+∈

+ +=21 Oj

jiOj

ji ty ,

where 1O has at least ⎥⎦⎥

⎢⎣⎢

K

W elements, 2O has at least one element (the parity bits

cannot all match due to different information input). For statistics analysis, we have the following mean and variance values

(33) bEK

WbE

K

WbEWWiE ≥+≥=∆ )1(*],[ ,


(34) 22)1(2*],var[ σσσK

W

K

WWWi ≥+≥=∆ ,

where nEWi~~

, +=∆ is a Gaussian random variable with b

EWE *~ = and

2*]~var[ σWn = . The probability of wrong path selection, if decision is made only based

on this section of the trellis, is

(35) )2*2

2)*(()0

,(

,,, σWbEW

QWi

peWjpath

p =<∆=

)0

()22

*(

KNbWE

QbEWQ ≤=

σ,

and clearly 0,,,

→eWjpath

p as ∞→W .

Theorem 2.2. For Viterbi decoding with trace back length W and tracing back from

a randomly pick state on trellis section },,,1,1,,{ WipWixipixipix ++++ L , the

probability of the event E of the tracing back path does not give same bit decision for

the thi −− )1( bit 1−i

x is bounded by

(36) )0

()(KN

bWEQ

stateNEp ⋅≤ ,

where state

N is the total number of states of the trellis, and clearly

0)(lim =∞→

EpW

with 00>

NbE

.

Proof: Viterbi decoding hard decision is given by the optimal surviving path. Without loss of generality, we assume the optimal path passes the branch linking two zero states

for the thi −− )1( bit 1−i

x . Denote the event of tracing back from state i

S and the

tracing back path does not give same bit decision for the thi −− )1( bit 1−i

x as i

E .

Clearly, we have )0

()(KN

bWEQ

iEp ≤ for this path does not have cross point with the


optimal path (otherwise the two paths will give the same thi −− )1( bit decision). We

thus have

(37) )0

(1

0)()

1

0()(

KNbWE

Qstate

NstateN

i iEpstateN

i iEpEp ⋅≤

−

=≤

−

=≤ ∑U .

Given the commonly used bound 2

2

21

)(2

2

2

)1

1(

x

exQx

x

ex

−≤≤

−

−π

, we have easily that

(38) 0020

21

)0

()( →

−

⋅≤⋅≤KN

WE

estate

NKN

bWEQ

stateNEp ,

as ∞→W and with 00>

NbE

. That is 0)(lim =∞→

EpW

. Q.E.D

The previous results tell us that we can get good Viterbi decoding performance by tracing back from any state as long as the trace back length is long enough and the SNR is not that bad. Normally, the Viterbi decoder will behave better than this “random trace back” since we do trace back from the state with maximum correlation path metric.

We now look at the SNR dependency of the window size of the practical truncated MAP decoder. Sliding window, Viterbi technique and their variations has been studied for the single side truncated MAP decoding [6][7][8]. The dual truncated MAP version (what we call local decoding schemes) will be presented later. We try to reason these decoding schemes more rigorously. For log-MAP decoding with dual truncated window

size W , the LLR values are ∑− ++→

∑+ ++→

=

P iSWRpiSiSiSW

Lp

P iSWRpiSiSiSW

LpW

iL

)1()()1()()(

)1()()1()()(

log)(

γ

γ,

where )()(i

WL Sp and )()(

iW

R Sp are the corresponding calculated path metrics from

the truncated window operation. Without loss of generality, we assume that the truncated log-MAP decoding window operation is as follows

(39) ∑

= −=

−+−−i

Wijbiibii EptExy

statei

WL e

NSp

})(){(2

1)(

2221)( σ ,

(40) ∑

= −=

−+−−

−

i

Wijbiibii EptExy

WiiL eSSp})(){(

21

*22

2

)()( σα ,


where )(*WiS −α is also been normalized to have 1)(* ≤−WiSα . Please notice that

222 2]})(){([ σWEptExyEi

Wijbiibii ≥∑ −+−

−=,

which leads to 02|])()([| )( →≤− −Wi

WLiL eSPSPE , as ∞→W . Similarly, we have

02|])()([| )( →≤− −Wi

WRiR eSPSPE , ∞→W . Above all, we need to show that

0|][|lim )( =−∞→ iW

iW LLE , as ∞→W .

For the 0|][|lim )( =−∞→ iW

iW LLE SNR dependency, we have )(Wii LL − equals to

(41) ∑+ ++→

∑+ ++→

=−

P iSWRpiSiSiSW

Lp

P iSRpiSiSiSLpW

ii LL)1()()1()()(

)1()1()(log)(

γ

γ

∑− ++→

∑− ++→

−

P iSWRpiSiSiSW

Lp

P iSRpiSiSiSLp

)1()()1()()(

)1()1()(log

γ

γ

−

∆−+

∆=̂ ,

where =̂ representing the words “defined as”, obviously |||||| )(

−∆+

+∆≤− W

ii LL , and

||+

∆ is upper-bounded by

(42) ∑+ ++→

∑+ ++→

=∆+

P iSWRpiSiSiSW

Lp

P iSWRpiSiSiSLp

)1()()1()()(

)1()()1()(log

γ

γ

∑+ ++→

∑+ ++→

+

P iSWRpiSiSiSLp

P iSRpiSiSiSLp

)1()()1()(

)1()1()(log

γ

γ,

express these terms as addition and subtraction in logarithmic domain, the intermediate value theorem of calculus tells us that


(43) |)1()()1()(

)1()1()(log|

∑+ ++→

∑+ ++→

P iSWRpiSiSiSLp

P iSRpiSiSiSLp

γ

γ

)(ˆ| |,)1()(

)}1()()1(){1()(WT

PSi

P jSiSjSjSjSLpiSW

RpiSRpiSiSiSLp

jS

=∑=+∈ ∑

+ +→+−++→

∈ξγ

γ ,

where jSiS ,

ξ is an intermediate value between )1

(+i

SR

p and )1

()(+i

SWR

p .

From the truncated decoding scheme, we have

(44) },)1()(min{

|)1()()1(|)1()(

)(∑

+ +→

+∈ +−

++→

∈

∑

≤

P jSiSjSjSjSLp

PiS iSWRpiSRpiSiSiSLp

jS

WTξγ

γ

},)1()(min{

|)1()()1(|

∑+ +→

+∈ +−

+

∈

∑

≤

P jSiSjSjSjSLp

PiS iSWRpiSRp

jS

ξγ,

and we also have symmetrically that

(45) |)1()()1()()(

)1()()1()(log|

∑+ ++→

∑+ ++→

P iSWRpiSiSiSW

Lp

P iSWRpiSiSiSLp

γ

γ

})()1('

,min{

|)1()()1(|

∑+ +

→

+∈ +−

+

∈

∑

≤

P jSRpjSjSjSiS

PiS iSWLpiSLp

jS

γξ,

where ', jSiS

ξ is a intermediate value between )1

(+i

SL

p and )1

()(+i

SWL

p , and

now |||||| )(

−∆+

+∆≤− W

ii LL is therefore upper bounded by


(46) },)1()(min{

|)1()()1(|

|| )(

∑+ +→

+∈ +−

+

≤

∈

∑

−

P jSiSjSjSjSLp

PiS iSWRpiSRp

jS

Wii LL

ξγ

})1()1('

,min{

|)()()(|

∑+ ++→

+∈−

∈

∑

+

P jSRpjSjSjSiS

PiS iSWRpiSLp

jS

γξ

},)1()()(min{

|)1()()1(|

∑− +→

−∈ +−

+

∈

∑

+

P jSiSjSjSjSWLp

PiS iSWRpiSRp

jS

ηγ

})1()1('

,min{

|)()()(|

∑− ++→

−∈−

∈

∑

+

P jSRpjSjSjSiS

PiS iSWLpiSLp

jS

γη,

where ',

,, jSiSjSiS

ξη are the other two corresponding variable values when applying

the intermediate value theorem. Please note that )()()(i

SWL

pi

SL

p = is true for

samples close to the left end of the frame when the window edge meets the beginning

position. And )()()(i

SWR

pi

SR

p = is true for samples close to the right end of the

frame when the window covers the end position. In the statistical average sense, we have

(47) ])()([lim ,1∑ →+∈

+∞→PS

SSjjjLWj

jiSSSpE ξγ

i

PSiRiiiL NSpSSSp

i

=∑ →=+∈

++ ˆ)()()( 11γ ,

(48) i

PSjRjjSSW NSpSSE

jji

=∑ →+∈

+∞→ ])()([lim 1'

, γξ ,

(49) ])()([lim ,1)(∑ →

−∈+∞→

PSSSjjj

WLW

jji

SSSpE ηγ

i

PSjRjjjL DSpSSSp

j

=∑ →=−∈

+ ˆ)()()( 1γ ,


(50) i

PSjRjjSSW DSpSSE

jji

=∑ →−∈

+∞→ ])()([lim 1'

, γµ ,

we have the following bound for the difference of LLR values on average

(51) ][)1( |)()(||][| 1)(

1)( ∑ −−

+∈++⋅−≤

PSi

WRiR

Wii

i

SpSpLLE Ei

Nε

][)1( |)()(| )(∑ −++∈

⋅−PS

iW

LiLi

SpSpEi

Nε

][)1( |)()(| 1)(

1∑ −+−∈

++⋅−PS

iW

RiRi

SpSpEi

Dε

][)1( |)()(| 1)(∑ −+

−∈⋅−

PS

WLiL

i

SpSpEi

Dε ,

when 0

WW ≥ , here 0

W and ε are fixed positive constants.

We now look at )()()(i

SWL

Pi

SL

P − for a specific time position i in the middle of

the frame. We can calculate this difference precisely as

(52) |)()()(|i

SWL

Pi

SL

P −

∑−=

−+−−

−−

=

i

Wij bEjpjtbEjtjy

estateNWi

S

}2)(2){(22

1

|1

)(*| σα

∑−=

−+−−

≤

i

Wij bEjpjtbEjtjy

e

}2)(2){(22

1

σ .

If a truncated path is part of the path corresponding to the transmitted bit sequence, we

have 22][ })(){( 22 σWEi

Wijbjjbjj EptEty =∑ −+−

−=

, and for all other truncated paths,

bE

fdWE

i

Wijbjjbjj EptEty +≥∑ −+−

−=

22][ })(){( 22 σ for at least f

d bit difference

given window size W is big enough, where f

d is the free distance. Counting all the

involved truncated paths, we have

(53) ]22)122(1[2][ |)()(| )( σ

bEfd

eWstate

NWeEPS

iW

LiLi

SpSp−

−+−≤∑ −+∈

.

The same bound is true in statistics sense for the following


]22)122(1[2]|)()()(|[ σ

bEfd

eWstate

NWePiS iSW

RPiSRPE−

−+−≤+∈

−∑ ,

]22)122(1[2]|)()()(|[ σ

bEfd

eWstate

NWePiS iSW

RPiSRPE−

−+−≤−∈

−∑ ,

]22)122(1[2]|)()()(|[ σ

bEfd

eWstate

NWePiS iSW

LPiSLPE−

−+−≤−∈

−∑ .

The previous tedious derivation final lead to the following theorem. Theorem 2.3. For dual truncated log-MAP decoding with window size W , we have

(54) ]22)122(1[|])([| σ

bEfd

eWstate

NWCeWi

Li

LE−

−+−≤− ,

where ),max()1(8i

Di

NC ε−= , where state

N is the total number of states of the

trellis, and clearly 0|])([|lim =−∞→

Wi

Li

LW

given 00>

NbE

.

3.3. Extrinsic information impact on ML and MAP decoding For ML decoding, the path metrics with uncorrelated extrinsic information input is

(55) ∏−

==

1

0][]|[]|[]|},[{

L

i impipitpixiypXZYp

∑−

=∏−

= +−

=

1

021

)1

0 2/2/1

](|[)21

(

L

i izime

L

i izeiz

e

XYpLσπ

,

where ∑−

=−+−−

=

1

0}2)(2){(

22

1

]|[

L

i bEipitbEimiyeXYp σ . Please note that

∑−

=

1

021 L

i izime is the path metric correction factor introduced by the extrinsic


information. Intuitively, this correction factor helps the separation of path metrics and thus improves the decoding performance (given the right extrinsic information input).

Let ∑−

=+∑

−

=−+−−=

1

0211

0}2)(2){(22

1)1( L

iii zm

L

i bEipitbEimiyS

Mσ

be the ML

decoding path metric with extrinsic information input (also known as extended path

metric), and denote ∑−

=−+−−=

1

0}2)(2){(22

1)0( L

i bEipitbEimiyS

Mσ

as the

path metric without extrinsic information, we have i

L

iiSS zmMM ∑

−

=

+=1

0

)0()1(

21

. Let

)0(optS and )1(

optS denote the optimal surviving path of ML decoding with and

without extrinsic information input, that is }{max )0()0(SSS MM

opt= , and

}{max )1()1(SSS MM

opt= . Clearly 0≥ii zm , if im and iz has the same sign,

0≤ii zm , otherwise. Denote

})1()1({min)1(SM

optSMoptSSM −≠=∆ ,

})0()0({min)0(SM

optSMoptSSM −≠=∆ ,

be the minimum path metric difference to the optimal path metric. If

},...,,{ *1

*1

*0

)0(−= Lopt xxxS , extrinsic information iz has same sign as *

ix for each

i , and },...,,{ 110 −= LxxxS be any non-optimal path, then ∑∑−

=

−

=

≥1

0

1

0

*L

iiii

L

ii zxzx due

to the sign assumption. This implies )0()1(optopt SS = and the optimal path remains the

same with extrinsic information input. We have

(56) ∑−

=∑−

=−−+=−

1

02

11

0)0(*

2

1)0()1()1( L

i izixL

i SMizixoptSMSM

optSM

∑−

=−+−=

1

0)*(

2

1)0()0( L

i izixixSMoptSM ,


optS differs S in }1,11100:)1,,1,0{( ≥−≤−<<<≤−= pLpiiipiiiT LL

information bit positions with the cardinal number 1|| ≥T . Note that

(57) ||2)*( izizixix =− , if ixix ≠* ,

(58) 0)*( =− izixix , if ixix =* ,

and ∑+−≥−T izSM

optSMSMoptSM ||

2

1)0()0()1()1( . Denote ∑=T iz

pZC ||1

;

then clearly |}{|10min izLiZC −≤≤≥ , take minimum of both side gives

(59) |}{|10min)0()0()1(izLiMZCMM −≤≤+∆≥+∆=∆ .

This inequality shows the increase of path metric difference due to the right extrinsic information input. This increase of path metric difference helps the decoding quality and can be summarized in the following error probability bound derived with the classical technique [5] of Viterbi decoder performance analysis.

Theorem 3.1. If }*1,...,*

1,*0{)0(

−= LxxxoptS , and iz has same sign as *ix ,

we then have the following results for Viterbi decoding

(60) 0/|)(0/

)( NbEeD

DTNbdE

eCQep −=

⋅⋅≤ ,

where ep is the error probability per node and )(DT is the generating function, and

the constant is )1*

0||

401(

0

2∑−

=+=

d

j jzbdE

N

NbdE

C , also

(61) 0/

,1,1|

),,(0/)( NbE

eDILI

ILDTNbdEeCQep −

===∂

∂≤ ,

where fd is free distance of the decoding trellis, bp is the bit error probability and

),,( ILDT is the generating function with L denote the length and I denote the number of 1’s in the information sequence. Clearly, the )(CQ term is bounded by


(62) ))1*

0||

401(

0

2()( ∑

−

=+≤

d

j jzbdE

N

NbdE

QCQ

|))|10min4

01(0

2( jzLj

bdE

N

NbdE

Q −≤≤+≤ .

Proof: We just outline the major portion of this standard proof.

The key is to evaluate pair wise error probability of two paths with fdd ≥ bit

difference and path metrics shifted by extrinsic information. Suppose one path is the

transmitted the all zero path and the other one has *d different information bits. We

simple assume the different bit positions are }*,,1,0{ dL , 1* ≥d without loss of

generality. We rename the samples as 10

}{ −=

djj

ξ . Just look at the correlation part of the

extended path metric, we have the pair wise probability d

p as

(63) }1*

0211

02

1*

0211

02{ ∑∑∑∑

−

=−

−

=−≤

−

=+

−

==

d

j jzd

j jbEd

j jzd

j jbE

pd

p ξσ

ξσ

}1*

0

1

02

2{ ∑∑

−

=−≤

−

==

d

j jzd

j jbE

p ξσ

.

Let ∑−

==

1

02

2 d

j jbE

X ξσ

, then )2,(~ σµNX with b

dE2

2

σµ = ,

bdE

242

σσ = ,

(64) )2

221*

0()2

221*

0(

bdE

bEdd

j jz

Q

bdE

bEdd

j jz

dp

σ

σ

σ

σ+

−

==

−−

=−

Φ=

∑∑

)4

01(2

2()

2

21(

2

2(

1

0

1

0

**

∑∑−

=

−

=+=+=

d

jj

d

jj zz

bdE

NbdEQ

bdEbdE

Qσ

σ

σ

)4

01(2

2()(

1

0

*

||∑−

=+≤=

d

jjz

bdE

NbdEQCQ

σ,


}){|10

min4

01(2

2( |jz

LjbEfd

NbEfdQ

−≤≤+≤

σ,

and the rest of the proof follows strictly the classical techniques. Q.E.D With uncorrelated extrinsic information input, the branch metric computation becomes

(65) ][]|[]|[]1|),[(]|1[)1( impipitpixiypiSiSitiypiSiSpiSiS =+→+=+→γ ,

and the LLR values are calculated as

(66) ∑−∈

∑+∈=

−=

+==

SS SMSS SM

ZYimp

ZYimpiL

))1(exp(

))1(exp(log

}],{|1[

}],{|1[log)1( .

Recall the fact that the bit error probability of log-MAP decoding is not greater than the

bit error probability of Viterbi decoder. We use ),( iMbp denote the bit error probability

of log-MAP decoding and ),( iVbp denote the bit error probability of Viterbi or

max-log-MAP decoding without extrinsic information ( 0=i ) or with extrinsic

information ( 1=i ). If }*1,...,*

1,*0{)0(

−= LxxxoptS , and iz has same sign as *ix ,

we then have )0,()0,( VbpM

bp ≤ , )1,()1,( VbpM

bp ≤ , where

0/

,1,1|

),,(0/)

0

2()0,(

NbEeDILI

ILDTNbdEe

NbdE

QVep −

===∂

∂⋅⋅≤ ,

0/

,1,1|

),,(0/)()1,(

NbEeDILI

ILDTNbdEeCQV

ep −===∂

∂≤ .

In short, all the previous analysis simply says that the right extrinsic information input will help decoder performance. We can now somehow “reason” the role of extrinsic information in turbo decoding a little bit. 3.4. Quality indexes and virtual SNR with extrinsic information

We now try to reason that the proper extrinsic information input actually will bring up the corresponding operating SNR (virtually) of the decoder. This is of course in the sense of equivalent SNR (as the samples are received, the SNR values of the received samples cannot be changed any more). With the standard signal to noise ratio calculation formula


(67) 2

2])|[(

σ

ξ ixiESNR = ,

if making bit wise decision ix based on iξ . Also, we have the following relations

(68) )1()1|(

)1()1|(log

)|1(

)|1(log

−=−=

+=+==

−=

+=

impimiypimpimiyp

iyimpiyimp

)1(

)1(log

)1|(

)1|(log

−=

+=+

−=

+==

impimp

imiypimiyp

)1(

)1(log

2)(22

1

21

2)(22

1

21

log−=

+=+

+−

−−

=impimp

bEiye

bEiye

σσπ

σσπ

)2

2(2

22

2

iz

bEiybE

iziybE σ

σσ+=+=

ibE

iz

bEiybE

iziybE

ξσ

σ

σσ 22

ˆ)2

2(2

22

2=+=+= .

This simple analysis showed us the impact of extrinsic information on detection. As the extrinsic information shifted the mean value of the received sample corresponding to the systematic bits (we rather do it this way instead of combine it to the received samples

corresponding to the parity sample). In other words, it is equivalent to use },{i

ziξ for

decision in each trellis branch. Then the average SNR for the corresponding trellis branch of the constituent decoder thus becomes

(69) }2

2])|[(

2

2])|[({

21

ˆ),,(σσ

ξ ipitEixiE

iz

iy

ixSNR +=

}2

2])|'[(

2

2])|2

2[(

{21

σσ

σ

ipinbEipEixizbEinbEixE

++

++

=


}2

2

2

2])|[2

2(

{21

σσ

σ

bEipixizEbEbEix

+

+

=

}2)(4

2)(

2{

21 ]|[]|[22

iiiiiii xzExzExpxbE

bE σ

σ++ +=

}24

2)(

2{

21 22

iiiii zzxpxbE

bE σ

σ++ += .

The last equality is true when iz is uncorrelated with ix . This sort of justified when

we simply look at decoding schemes as numerical procedures. In decoding algorithms, we treat extrinsic information as an input number to be combined into branch metric or recursive computations. We thus have

(70) }24

2)(

2{

21

({ 22),, iiiiiiii zzxpxzyxbE

bESNR

σ

σ++ += .

Clearly if iz has same sign as ix , then the virtual SNR will be increased due to the

right extrinsic information input. A different derivation of this result has been studied in

[12]. The very phenomenon that iz usually has same sign as ix was one of the

motivations for turbo decoder invention [1].

We proposed decoding monitoring quality index as ∑∈

=Ti izix

TTQ

||21

)( , where

iz is the extrinsic information and T is a set of consecutive sample indexes in a frame

and || T is the number of indexes (equal to the corresponding number of information

bits) in it. For practical use, we define hard index ∑∈

=Ti izid

TT

HQ ˆ

||21

)( , where

id̂ is the hard decision as }{îLsignid = , and soft index ∑

∈=

Ti iziLT

TS

Q||2

1)( as

approximations. Since iz typically has same sign as im , we can use


∑∈

=Ti iz

TT

absQ ||

||21

)( as quality index too. Similar to the intrinsic SNR studied in

[12], we also have average virtual SNR values

(71) }28

2{

||1

)()( ∑∈

++=Ti iz

bETTQStartSNRTAverageSNR

σ,

for the corresponding decoding stage. We propose the following practical virtual SNR

(72) }28

2{

||1

)()( ∑∈

++=Ti iz

bETT

HQStartSNRT

HAverageSNR

σ,

or a soft version of it

(73) }28

2{

||1

)()( ∑∈

++=Ti iz

bETT

SQStartSNRT

SAverageSNR

σ,

or the absolute value quality index version of it

(74) }28

2{

||1

)()( ∑∈

++=Ti iz

bETT

absQStartSNRT

absAverageSNR

σ,

with 2}22{2

{||2

1σσ

bETi ipixbE

TStartSNR =

∈+= ∑ is the SNR without extrinsic

information. When }1,,1,0{ −= LT L , the previous expressions are the quality indexes and the

intrinsic SNR values introduced in [12]. When }1,,1,0{ −= WT L , with 1,,1,0 −= LW L , these quality indexes are then similar to Yamamoto-Itoh indexes for

Viterbi decoding and ARQ schemes can be devised. When }1,,1,{ −++= WiiiT L and 1,,1,0 −−= WLi L , then these quality indexes are essentially some moving average of the extrinsic information. We are going to call these indexes as local quality indexes and local decoding engines could be devised accordingly with truncated operations with a section of samples.

Looking at the most extreme case with only one point extrinsic information input, we try to understand intuitively the local impact of the extrinsic information.

Proposition 4.1. If }*1,...,*

1,*0{)0(

−= LxxxoptS , 0=iz , if 0ii ≠ and 0i

z has

same sign as *0i

x , then for both max-log-MAP and log-MAP decoding, we have


(75) )0()1(jLjL ≥ , if 1*

+=jx and 10 −≤≤ Lj ,

(76) )0()1(jLjL ≤ , if 1*

−=jx and 10 −≤≤ Lj ,

where jL is the LLR. We have particularly

(77) 0

)0(0

)1(iziLoiL += ,

and this equation is always true regardless the sign of the extrinsic information input.

Proof: We just show that 0

)0(0

)1(iziLoiL += is true for log-MAP decoding

(max-log-MAP is similar). For log-MAP decoding, we only need to follow BCJR algorithm and use the recursive relations. Under the given conditions, we can easily verify that

(78) )1()1

1()( iSiSiS iSiS →−∑−

−= γαα , for 00 ii ≤≤ ,

(79) )()0()()1(iSiS αα = , for Lii ≤≤+ 10 .

We have therefore the following relation

(80) ∑− ++→

∑+ ++→

=

S iSiSiSiS

S iSiSiSiS

iL

)1()1()1()1()()1(

)1()1()1()1()()1(

log)1(

0 βγα

βγα

0

)0(

0)1()0()1()1()()0(

)1()0()1()1()()0(

logi

zi

L

S iSiSiSiS

S iSiSiSiS+=

∑− ++→

∑+ ++→

=βγα

βγα,

and the last part follows directly from the branch metric calculation. Q.E.D The turbo decoding process can be viewed as a confidence propagation scheme [10].

The previous results somehow illustrate the confidence propagation in its simplest form: from one point to the whole frame. The last equality shows weather the extrinsic information is helping the decoding “constructively” or “destructively”. The local indexes can reflect the local impact of extrinsic information and enable us to monitor the decoding quality as local as to a single bit. They can therefore sort of reflect local changes in decoding. 3.5. ARQ schemes with Yamamoto-Itoh type indexes

Following the approach of Yamamoto and Itoh [11], we can devise similar ARQ

schemes for turbo decoding. With ),1(* NindexQ denote any of the Yamamoto-Itoh


type of quality index or virtual SNR, A denote a threshold value, we first present the following ARQ scheme for ML or MAP decoding with extrinsic information input.

ARQ Scheme: For LN ≤≤1 , if ANindexQ ≥),1(* , continue the decoding process,

else request retransmission of sample block with time index }1,...,1,0{ −= NK .

For performance, we have the following results using the classical technique [5][11].

Theorem 5.1. If }*1,...,*

1,*0{)0(

−= LxxxoptS , 0=iz , if 0ii ≠ and 0i

z has

same sign as *0i

x , then if we apply the ARQ scheme with quality index

izN

i imbEN

NimQ ∑−

==

1

0

1)},({* with Viterbi or max-log-MAP decoding, we then have

(81) 0/|)(0/

))4

01(0

2( NbE

eDDT

NbdEe

bdE

AN

NbdE

Qep −=

⋅⋅+≤ ,

where fd is free distance of the decoding trellis, ep is the error probability per node

and )(DT is the generating function. Also

(82) 0/|)(0/

))4

01(0

2( NbE

eDDT

NbdEe

bdE

AN

NbdE

Qbp −=

⋅⋅+≤ ,

where bp is the bit error probability and ),,( ILDT is the generating function with

L denote the length and I denote the number of 1’s in the information sequence. If

log-MAP with the same scheme is applied and )(Mbp be the bit error probability, then

(83) 0/|)(0/

))4

01(0

2()(

NbEeD

DTNbdE

ebdE

AN

NbdE

QbpMbp −

=⋅⋅+≤≤ .

The previous results tell us the improvement of decoding performance introduced by the ARQ schemes. We believe the performance will be similar if a different quality index or virtual SNR is used. With threshold check at each iterative constituent decoding, we can also devise similar ARQ schemes for turbo decoding. Suppose the turbo decoder is designed with M full iteration cycles. For each of the M2 half iteration cycles, we

need to do a SISO (soft in and soft out) decoding. Define 120)},,1(*{ −

=M

iteriterNindexQ ,

LN ≤≤1 , as any of the previous Yamamoto-Itoh type indexes, 120)}({ −

=M

iteriterA be


iteration-based threshold values, we propose an ARQ scheme for turbo decoding. ARQ scheme for turbo decoding: For 12,...,0 −= Miter , check the following ARQ

scheme at the corresponding half iteration cycle: (1) if )(),,1(* iterAiterNindexQ ≥

for LN ≤≤1 , keep the turbo decoding process,(2) else request retransmission of block with time index }1,...,1,0{ −= NK .

We can illustrate this ARQ schemes schematically with the following diagrams.

Each SISO block is a normal soft output ML or MAP decoding with Yamamoto-Itoh type of ARQ schemes utilizing the extrinsic information input as follows.

We require each constituent decoding pass the ARQ schemes with Yamamoto-Itoh type index threshold requirements. Intuitively, many retransmissions could be resulted by this repeated threshold check. This will certainly increase the decoding overhead and reduce the throughput with repeated transmission and extra decoding processing. The feasible implementation schemes and performance need to be evaluated accordingly in practice. 3.6. Adaptive iterative decoding schemes

We now introduce some more adaptive turbo decoding schemes based on local quality indexes to combat fading. The key behind these adaptive iterative decoding schemes is the combination of local decoding engines and iteration stopping schemes. We propose the local decoding engines, based on decoder windowing operations, first.

Instead of operating on the full trellis, practical ML and MAP decoding schemes are commonly implemented to operate only on part of the trellis at a time to save memory. These truncated algorithms are with great practical values and have been extensively studied [6][7][8][12]. The operating part of the trellis is typically shifted or jumped to

F i g u r e 2 . S I S O w i t h A R Q

I sQ ( 1 , N , i t e r ) < A ?

r e t r a n s m i ts a m p l e s 0 . . N - 1

S E T N = 1

K e e p S I S OD e c o d i n g

S E T N = N + 1N o

Y e s

F i g u r e 1 . A R Q T u r b o D e c o d i n g

S I S O I w i t hA R Q s c h e m e

S I S O I I w i t hA R Q s c h e m e

π

π

d e - i n t e r e l e a v e r


cover the whole trellis of the frame. The truncated Viterbi decoding is a common practice and techniques like sliding window [7][8] and Viterbi technique [6] are also been developed for MAP and thus turbo decoding. The key to these schemes is the self-synchronization property of Vitervi and MAP decoding. Some variations, based on intuition and simulation, of these algorithms are also been investigated [12]. Our analysis of the SNR dependency of the windowing techniques further validates these schemes. We now propose local decoding schemes for adaptive iterative decoding.

We do dual side truncation to the normal Viterbi algorithms instead of the common

single sided truncation. Rename the samples as }12,22,,1,0{ −− LL ξξξξ L , we use M

local decoding engines (we simply split soft samples evenly into M parts, even

non-uniform partition is perfectly valid also). We also assume NML

= is an integer

and correlation is used as path metric. The thi − ( Ni ≤≤1 ) local Viterbi decoding engine operates as follows.

Begin local Viterbi decoding engine:

(1) If 1=i , start path metric computation with }14,24,,1,0{ −− MM ξξξξ L .

The path metric are initialized as 0)( =iPM for 0=i and +∞=)(iPM for 0≠i . Start trace back operation from the state with maximum path metric at

time moment 12 −M , decode bits }1,,1,0{ −Mxxx L .

(2) If 1−= Ni or Ni = , start decoder path metric computation with soft

samples }12,,162,62{ −+−− LMLML ξξξ L . The path metrics are initialized as

0)( =iPM for all i states. Tracing back from zero state at time moment 1−L

to decode bits }1,,12,2{ −+−− LxMLxMLx L . Please note that these two cases

are merged into one due to the same trace back state (decoder flushing). (3) If 11 −<< Ni , start decoder path metric computation with soft samples

}1)2(2,,1)1(2,)1(2{ −++−− MiiMi ξξξ L . The path metric values are initialized

as 0)( =iPM for all i states. Tracing back from the state with maximum path

metric at 1)2(2 −+ Mi to decode bits }1)1(,,1,{ −++ MixiMxiMx L .

End local Viterbi decoding engine Following the simple engineering diagram approach, these local Viterbi decoding

engines can be schematically illustrated as.


The purpose of the first synchronization portion is to make the trellis fully open and the path metrics reliable. This synchronization period is not needed at the beginning of the frame due to the known starting state. The purpose of the second synchronization portion is for the paths to merge and have a reliable trace back. This synchronization portion is not needed at the end of the frame due to the known ending state. It is well known that the performance degradation of Viterbi decoder with an adequate trace back length (SNR dependent, five times constraint length is a common rule of thumb) compare to the full trace back length will be negligible. Longer trace back length could be needed for fading channel. We can always resort to simulation to decide the proper window size needed. Please note that that we can use max-log-MAP approach to have soft output local Viterbi decoding engines.

Based on the self-synchronization property, we can also devise the following local BCJR decoding schemes (or local BCJR decoding engine) for MAP decoding.

Notice that what we proposed here is actually a modified version of the sliding window technique [7][8] and Viterbi technique [6]. Once again, this scheme is based on “double truncation” to work only on a trellis segment. The key to our proposal is to add a learning portion (synchronization) for the forward recursion computation so we can start them with random values.

The purpose of the two synchronization portions is to make the recursive sequence values reliable for later LLR computation. Due the symmetric nature of the recursive sequences, same window size can be utilized for both synchronization portions. The very symmetry of these two recursive sequences inspired us to devise the local decoding schemes. The synchronization portion for α sequence is not needed at the beginning of the frame, and the synchronization portion for β sequence is not needed at the end of

the frame due to known states. For practical design, we can always use simulation to figure out the needed window sizes. Same as in Viterbi decoding case, we use M local decoding engines (split soft samples evenly into M parts though non-uniform partition

S y n c h r o n i z a t i o nP o r t i o n I

P a t h M e t r i cC o m p u t a t i o n

W i t h R a n d o m S t a r t

R e l i a b l e D e c o d i n gP o r t i o n

S t a r t T r a c e B a c kO p e r a t i o n

S y n c h r o n i z a t i o n P o r t i o n I I

F i g u r e 3 . L o c a l V i t e r b i D e c o d i n g E n g i n e

F o r w a r d R e c u r s i v eC o m p u t a t i o n


S y n c h r o n i z a t i o n P o r t i o n I

R e l i a b l e D e c o d i n gP o r t i o n

S y n c h r o n i z a t i o n P o r t i o n I I

B a c k w a r d R e c u r s i v eC o m p u t a t i o n


F i g u r e 4 . L o c a l B C J R D e c o d i n g E n g i n e


can also be applied). We also assume NML

= is an integer. The thi − ( Ni ≤≤1 )

local MAP decoding engine is thus formally given as follows. Begin local BCJR decoding engine:

(1) If 1=i , start forward recursive with }14,24,,1,0{ −− MM ξξξξ L . The

initial start values are 1)0( =α and 0)( =iα for 0≠i . Start backward

recursive computation with }14,24,,1,0{ −− MM ξξξξ L . The initial start

values are assigned uniformly (that is stateN

i1

)( =β ). Decode bits

}1,,1,0{ −Mxxx L .

(2) If 1−= Ni or Ni = , start forward recursive computations with soft

samples }12,24,,142,42{ −−+−− LLMLML ξξξξ L . The initial start values

are assigned uniformly. Start backward recursive computation with the same soft samples and initial values 1)0( =β and 0)( =iβ for 0≠i . Decode

information bits }1,,1,2{ −+−− LxMLxMLx L . Please note that these two

cases are merged into one due to the same backward recursion starting point. (3) If 11 −<< Ni , start forward recursive computations with soft samples

}1)2(2,2)2(2,,1)1(2,)1(2{ −+−++−− MiMiMiMi ξξξξ L . The initial start

values are assigned uniformly. Start backward recursive computation with the same soft samples and uniformly assigned initial start values. Decode bits

}1)1(,,1,{ −++ MixiMxiMx L .

End local BCJR decoding engine Splitting the whole soft sample frame time indexes into T disjoint blocks denoted

by: }1,...,1,{ ++= idididiK , Ti <≤0 , 00 =d , 11 −=+ LTd and 1+< idid , local

decoding engines TiiM <≤0}{ are for }1,,1,...,,,{ WidididWidiM +++−= LL ,

where W is the synchronization window size (we assume the window size is same for both sides though they can be different). We can provide parallel local ML or MAP

decoding engines for each TiiM <≤0}{ to cover the whole decoding frame. That is we


can decode the whole frame with parallel operation of these local engines. We can do iterative turbo decoding based on the parallel layout of the local decoding engines for each constituent decoder. For each local decoding engine, we can apply iteration stopping criteria separately. In this way, we can avoid uniform stopping of iteration to have more adaptive turbo decoding schemes. Intuitively, the fading channel will have non-uniform impact on the received samples in the frame. Part of the samples might need more iteration to get the useful “information” fully extracted. Turbo decoding simulation results have shown that most bit errors will be corrected in the first a few iterations. The later iterations are used for correcting not so many errors. Our intuition is therefore to introduce adaptive schemes based on local decoding engines without involving too much unnecessary computation. The local quality indexes we introduced also have asymptotic behavior and can be used for iteration stopping as what have been done in [12]. One way to stop iteration is to monitor quality index value until it passes a given threshold, and another way is to utilize the asymptotic behavior (we stop iteration when the indexes reaches asymptote). Other iteration-stopping criterion can also be modified to stop local decoding engines.

In turbo decoding process, some of the local decoding engines may be stopped first. Yet turbo decoding is a frame-based operation due to the turbo interleaver. This decides that the local decoding engines need to rely on each other for the next iteration stage. Even some local decoding engines can be stopped; we still need to continue the whole iterative decoding process. We can only stop the decoding when every local decoding engine can be stopped. The key is to devise bypass schemes to cooperate with the turbo interleaving and deinterleaving of extrinsic information. The bypass scheme we are going to propose will enable us to stop some local decoding engines first and still carry on the whole decoding process. We use adaptive SISO blocks for each constituent decoding. Both LLR values and extrinsic information need also to be fed into the adaptive SISO block for quality index calculation (which is a very important difference with the conventional SISO). We first present our adaptive turbo decoding schemes schematically as follows. Each adaptive SISO consists of the following local decoding engines, local quality index calculation and local decoding engine bypass control blocks.

The local decoding engines inside each adaptive SISO is as follows.

Each decoding engine with local quality index check will deal with just portion of the data samples, extrinsic information and LLR values. The local quality check part will

i n t

d e i n t

A d a p t i v eS I S O I A d a p t i v e

S I S O I I

F i g u r e 5 . A d a p t i v e T u r b o d e c o d i n g

i n t

L L R &e x t r i n s i c

L L R &e x t r i n s i c


decide whether we should bypass the local decoding engine as follows.

With a hard decision block, we can calculate both the hard and soft version of the local quality indexes or virtual SNR values. Each local decoding engine will check whether it should do the decoding or not. If the local quality index is good, then the LLR and extrinsic information values will just be passed to the next decoding stage without decoding. Otherwise, the local decoding engine will do the decoding to update LLR and extrinsic information values. The quality index check block dictates the corresponding local decoding engine and the MUX. This control mechanism will enable us to combine iteration stopping into parallel local decoding engine schemes with kind of “sleep mode” (the local decoding engines are not necessarily running all the time, it remains in “sleep” when bypassed). One of the key factors to this scheme is that our local quality indexes are based on checking the input to each local decoding engine rather than the output.

As global quality indexes can be used for iteration stopping at half iteration cycles [12], we can also stop iteration on a local decoding engine at any stage. This gives us better adaptive ability and flexibility of decoder design and possible maximum power reduction. Besides computation saving, the decoding time will also be reduced as we just pass the LLR and extrinsic information values to the next decoding stage. That is the decoding time for each constituent adaptive SISO decoder will get shorter and shorter as we put more and more local decoding engines into sleep. A subtle point is that if we put a local decoding engine in one of the adaptive SISO decoder into sleep, that engine should remain in sleep in the next iteration (one iteration means 2 half iteration cycles). Yet the number of sleeping local engines in the next adaptive SISO decoder may not be less due

e x t r i n s i c i n f o

L L R

L L R v a l u e

r e c e i v e d s a m p l e s

e x t r i n s i c

L o c a l D e c o d i n gE n g i n e

L o c a l Q u a l i t yI n d e x C h e c k

( e n g i n e c o n t r o l )

M U X

H a r dD e c i s i o n

F i g u r e 7 . L o c a l D e c o d i n g B y p a s s S c h e m e

r e c e i v e d s a m p l e s ,e x t r in s i c i n f o& L L R v a lu e s

L o c a l D e c o d in g E n g in e n 1w i t h

L o c a l Q u a l i t y I n d e x C h e c k

L o c a l D e c o d in g E n g in e n 2w i t h


L o c a l D e c o d i n g E n g i n e n Tw i t h


S a m p l e , e x t r i n s i c & L L RS p l i t

F i g u r e 6 . A d a p t i v e S I S O D e c o d e r


to the turbo interleaving and deinterleaving. The decoding time needed in each full iteration cycle will get smaller and smaller. But the immediate next adaptive decoding engine might take longer decoding time. Formally,

Begin adaptive turbo decoding scheme: (1) The adaptive turbo decoder consists of two identical adaptive constituent SISO decoders separated by turbo interleaver and de-interleaver. The inputs to each adaptive SISO are soft samples, extrinsic information and LLR values. The outputs of each adaptive SISO are LLR values and extrinsic information to be feed into the next constituent adaptive SISO. (2) Each adaptive SISO consists of T local decoding engines corresponding to the T segments of the whole frame (for the frame in both the original order and the interleaved order). Each local decoding engine is local ML or MAP decoding with soft output. Each local decoding engine has an associated decoding bypass scheme to decide whether the local decoding engine should be put into sleep. (3) Each local decoding engine bypass scheme work as follows: (a) calculate first the corresponding hard or soft local quality index with the input LLR values and the extrinsic information. (b) Check the local quality index with a threshold or check to see whether it reaches asymptote. (c) Bypass the local decoding engine if the quality index passes the threshold or asymptotic check, and otherwise run the local decoding engine. The word “bypass” means just take the input LLR and extrinsic values as the output LLR and extrinsic values. We say the local decoding engine is put into sleep when they are bypassed. (4) A sleep local decoding engine will not be activated in the next iteration cycle. In other words, the number of sleeping engines in each adaptive SISO decoder will keep increase as iteration goes on. We finish the adaptive turbo decoding process when every local decoding engine is in sleep.

End adaptive turbo decoding scheme If we implement the proposed adaptive iterative decoding schemes in ASIC, the key is

to implement the local decoding engines needed. There are three different ways: (1) just implement one local decoding engine in hardware, repeat it T times in each adaptive SISO decoder. (2) Implement K local decoding engines in hardware, repeats them

KT / times in each adaptive SISO decoder, here TK <<1 . (3) Implement all T local decoding engines in hardware. The trade off is between hardware and decoding delay. The first approach needs least amount of hardware and requests maximum amount of decoding time. The last approach needs maximum amount of hardware but comes with least amount of decoding time. The second approach is a compromise of the previous two. We want to point out that it is extremely easy to control each local decoding engine and put it into “sleep”. The major implementation effort needed is just


to disable the clock to the sleep engines. This can be easily done with a counter for timing control. We point out a very important fact that decoder with local decoding engines uses more power due to the extra synchronization even the adaptive decoding schemes will reduce power consumption. This subtle trade off investigation will guide the final decision and the choice of feasible schemes in real implementation. Clearly, careful simulation and calibration are needed to fully evaluate all the trade off factors.

Finally, we point out that the strategy with parallel local decoding engines enables us to reach virtually the highest possible convolutional and turbo decoder speed [13]. Most importantly, it breaks the decoder speed bottleneck imposed by the frame size. This unexpected application serves also as justification of our local decoding schemes. 3.7. Numerical simulation results

We present some simulation results to briefly justify the decoding schemes proposed in the previous sections. For Yamamoto-Itoh type of ARQ schemes, we use CDMA 2000 standard turbo code with frame size of 640-bit for simulation. The soft quality index

)(iterS

Q is used wit threshold values 120

)]}()([)({ −=

−= Niter

iteriteriterA σµ , where

)(iterµ and )(iterσ are the estimated statistics of the extrinsic information obtained by

∑−

=1

0)(

1)(

NiterSQ

Niterµ , 2}

1

0)(

1{

1

0

2)]([12)( ∑∑

−−

−=

NiterSQ

N

NiterSQ

Niterσ ,

with number of frames as 100000=N . The throughputs are [93.9%, 94.8%, 95.2%,

96.3%] with about 0.1dB coding gain at 510− BER. This partly justifies our ARQ

schemes with Yamamoto-Itoh type of indexes. The BER performance is as follows.

Figure 8. Performance of Yamamo-Itoh type turbo ARQ scheme We show the performance of local decoding engines to demonstrate partly the adaptive

iterative decoding schemes. We justify only the performance of turbo decoding with parallel layout of local decoding engines in static channel with various window sizes. We use the standard CDMA2000 turbo code (constraint length 4, rate 1/3 and rate ½ constituent code, 640 bits per frame). The performance degradation compare to the


optimal performance limit, is negligible for long enough window sizes (we plot both max and max* versions). In the following plots, 20W , 15W , 10W means the window size is 20, 15 and 10 BPSK modulated symbols respectively.

Figure 9. Performance of adaptive iterative decoding with MAX and MAX* We use the standard CDMA2000 convolutional code (constraint length 9, rate ½, 504

bits frame) to justify our local Viterbi decoding engines with various window sizes. Denote KW *5 , KW *6 , and KW *7 as the window size to be 5, 6 and 7 times of the constraint length (45, 54and 63 symbols) respectively.

Figure 10. Performance of parallel local Viterbi decoding scheme) Starting from the commonly used trace back length of fives times constraint length, we

can rely on simulation to decide the final window size needed for the dual truncated local Viterbi decoders. Longer window size is needed to have more accurate computation. Practical fading channel will most likely need even longer window sizes. 3.8. Conclusions

We briefly analyzed extrinsic information impact on ML and MAP decoding schemes of convolutional codes. This analysis gives us some intuition about turbo decoding process and leads to some practical implementation schemes (quality monitoring indexes, ARQ schemes and adaptive iterative decoding). The truncated ML and MAP decoding schemes can now also be reasoned more rigorously within the context of our analysis. Acknowledgement: We thank M. Eoin Buckley for presenting our results to the patent


committee of Motorola and Bryan Mancini for writing up the patent applications. References [1] C. Berrou et al, Near Shannon limit error-correcting coding and decoding: Turbo codes, IEEE Int. Conf. On

Comm., pp 1064-1070, May, 1993


Vol. 20, pp284-287, March, 1974


GLOBECOMM, pp1680-1686, 1989


Vol. 42, pp429-445, March, 1996

[5] A. Viterbi and J. Omura, Principles of digital communication and coding, McGraw-Hill, 1979

[6] A. Viterbi, An intuitive justification and a simplification of a simplified implementation of the MAP

decoder for convolutional codes, IEEE JSAC Vol. 16, No 2, pp260-264, February, 1998




turbo decoders, pp586-589, Proc. Int. Sym. Inform. Theory Appl., Victoria, B. C. Canada, 1996

[9] M. Fossorier et al, On the equivalence between SOVA and max-log-MAP decodings, pp137-139, IEEE

Comm. Letters, Vol. 2, No 5, May 1998

[10] Jung-Fu Cheng, Iterative decoding, Ph.d Thesis, Caltech

[11] H. Yamamoto and K. Itoh, Viterbi decoding algorithm for convolutional codes with repeat request, IEEE

Trans. Info. Theory, Vol. 26, No 5, pp540-547, 1980

[12] S. Xu, H. Teicher, K. Tanaka and W. Smith, A Simple Turbo Decoding Intrinsic SNR Calculation and

Applications, section 2 of this paper

[13] S. Xu, High-speed convolutional and turbo decoding schemes, section 4 of this paper


4. High-Speed Convolutional and Turbo Decoding Schemes By Shuzhan Xu

4.1. Introduction We present in this paper some algorithms and architectures to speed up the commonly

used Viterbi (ML) decoder and BCJR (MAP) decoder. These decoding schemes with soft output can be utilized in iterative turbo decoding. As data rate gets higher and higher in communication systems, we need faster and faster decoders in product development. We propose here some highly parallel approaches to speed up Viterbi, BCJR and turbo decoding. Our schemes can make these decoders virtually reach arbitrary high speed if we ignore the implementation cost. The decoder delay (time delay) is introduced mainly by two factors: trellis complexity and frame size. Parallel layout of ACS (add, compare and select) butterfly structures can tackle the complexity introduced by the trellis. To overcome the speed barrier introduced by the frame size, we need rely on new algorithms and architectures capable of decoding local segments of the whole frame in parallel. These decoding schemes has been initiated and investigated in [10] (referred to as local decoding engines there) for adaptive decoding schemes targeted to deal with local channel impairments for turbo decoding. With parallel layout of these local decoding engines and butterfly structures, we can virtually reach arbitrary high decoding speed. The cost to pay is silicon area expansion and power consumption increase. The parallel schemes are approximate decoding schemes. The coding gain could remain virtually unchanged and the performance degradation due to this parallel scheme could be negligible with proper arrangement.

The ACS butterfly structures and the local decoding engines can be laid in serial or in parallel, this flexible combination and freedom of implementation gives us the capability to design decoders with any speed we want (from very slow to extremely fast). This also reveals the underlying harmony and beauty in decoder design. 4.2. Viterbi (ML) decoding algorithms, architectures and variations

As the ML (maximum likelihood) algorithm is optimal in the sense of best path, Viterbi decoder searches for the optimal continuous path under AWGN channel with an

effective path trimming process: ACS operation. Mathematically, with 10

}{ −=

= Lii

xX

as information bits, 10

}12

,2

{ −=+

= Lii

yi

yY as soft samples, and code rate ½, we have

]}|[maxarg{ XYpX

X = ,

and the i.i.d. input assumption of soft samples leads to


∑−

=−+−−

=∏−

==

1

0}2)(2){(22

12)

2

1(]|[]

1

0|[]|[

L

i itipiyixeL

ipitpL

i ixiypXYp σσπ

,

for AWGN. For Viterbi decoding, the optimal path is the path with the minimum Euclidean distance squared or equivalently the maximum correlation. We use correlation as branch and path metric in this paper. The additive property of correlation and the trellis structure gives the following typical efficient butterfly structures for ACS process. We present only the commonly used non-recursive code butterfly structures here. The proof follows directly from the encoder bit operation with generator polynomials. Butterfly structures for recursive codes can be modified accordingly.

Viterbi decoding butterfly structure

With 12 −= KN denote the total number of states of the trellis and K be the

constraint length, we have the following branch transition for state 1,2

,0N

i = ,

(1) from state i and state iN

+2

to state i2 if input bit is 0, and the two branch

outputs are with opposite polarity, from state i and state iN

+2

to state 12 +i

if input bit is 1, the two branch outputs are with opposite polarity. The branch output corresponding to state transition i to state i2 is identical with the one

corresponding to state transition iN

+2

to state 12 +i .

(2) With )(ip denote path metric and S as the branch metric corresponding to

state transition from i to i2 , we have )2(])2

(,)(max[ ipSiN

pSip →−++

and )12(])2

(,)(max[ +→++− ipSiN

pSip .

The opposite polarity branch outputs means the corresponding branch metrics have opposite sign. Thus only one branch metric computation is needed for the previous four path metric updates.

Viterbi decoding with full trace back length, even with the best BER performance, has rarely been implemented for the need to store the trace back history of the whole frame (a huge amount of trace back memory for large frame size). To reduce memory, truncated Viterbi decoding schemes are typically implemented with sliding window techniques [6][7][8]. The truncated Vitervi decoding schemes reduced the trace back memory. But their speeds are still bounded by the frame size (bottleneck of the decoder delay). We now recall the following local Viterbi decoding engines [10] to break the frame barrier.


They are initially designed to deal with local channel fading effects in turbo decoding. Given the capability of doing decoding on a segment, we can use parallel lay out of local decoding engines to decode a whole frame. This will enable us to reach virtually arbitrary high-speed. The extreme case is to make the reliable decoding portion to be just one symbol. The whole decoding time will be reduced to the double window processing time. These local decoding engines can be illustrated as.

Here, “path metric computation with equal start” means we set all the starting path metric to be zero for all states on the trellis. The purpose of the first synchronization portion is to make the trellis fully open and the path metrics reliable. This synchronization period is not needed at the beginning of a frame. The purpose of the second synchronization portion is for the paths to merge and have a reliable trace back. This synchronization portion is not needed at the end of the frame due to the known ending state. Normal Viterbi decoder uses a trace back length equal to five times constraint length or longer. We can use this trace back length as guideline to search for optimal synchronization period. Longer trace back lengths (synchronization portion) will be needed for fading channel. We can always resort to simulation for final decision. This synchronization length (called window size) decides the complexity of the local decoding engines.

We assume the frame size is L including tail bits and M local decoding engines

are used (we split the whole frame evenly into M parts) and assume NML

= is an

integer. Formally, the thi − ( Ni ≤≤1 ) local decoding engines are as follows [10]. Begin local Viterbi decoding algorithm:

(1) If 1=i , start butterfly structure with }14

,24

,,1

,0

{−− M

yM

yyy L . The

path metric values are initialized as 0)( =iPM for 0=i and +∞=)(iPM for 0≠i . Start trace back operation from the state with maximum path metric at

time moment 12 −M , decode bits }1

,,1

,0

{−M

xxx L .

(2) If 1−= Ni or Ni = , start butterfly structure computation with soft

samples }12

,,162

,62

{−+−− L

yML

yML

y L . The path metric values are

p a t h m e t r i cc o m p u t a t i o n

w i t h e q u a ls t a r t

R e l i a b l ed e c o d i n g

p o r t i o n

S t a r t t r a c eb a c k f r o m

m i n o r m a xp a t h m e t r i c

S y n cp o r t i o n I

S y n cp o r t i o n I I

S l i d e o r j u m pt o r e p e a t t h e

w h o l e o p e r a t i o n

F i g u r e 1 . L o c a l V i t e r b i d e c o d i n g e n g i n e


initialized as 0)( =iPM for all i . Tracing back from zero state at 1−L time

moment to decode bits }1

,,12

,2

{−+−− L

xML

xML

x L by Please note that

these two cases are merged into one here due to the same trace back state. (3) If 11 −<< Ni , start butterfly computation with soft samples

}1)2(2

,,1)1(2

,)1(2

{−++−− Mi

yMi

yMi

y L . The path metric values are

initialized as 0)( =iPM for all i states. Tracing back from the state with maximum path metric at time moment 1)2(2 −+ Mi to decode bits

}1)1(

,,1

,{−++ Mi

xiM

xiM

x L .

End local Viterbi decoding algorithm 4.3. BCJR (MAP) decoding algorithms, architectures and variations

BCJR algorithm is MAP (max a priori probability) algorithm based on optimal symbol detection. It is commonly used for soft output and turbo decoding. Mathematically, with

10

}{ −=

= Lii

xX as information bits, 10

}12

,2

{ −=+

= Lii

yi

yY as soft samples, and code

rate ½, the following probability properties are true

][

],1[]|

1[

Yp

YiSiSpY

iS

iSp +

→=

+→ ,

],[ 1 YSSp ii +→ ]|),,...,,[(]|),(,[)],,...,,(,[ 1111111100 +−−+++−−= iLLiiiiiiiii StytypStySptytySp

)1()1()(ˆ ++→= iSiSiSiS βγα ,

]1|),[(]|1[)1( +→+=+→ iSiSitiypiSiSpiSiSγ

]|[]|[][ ipitpixiypimp= ,

where i

S is state on the trellis, )1

(+

→i

Si

Sγ is the branch metric between state

transition, )(i

Sα and )1

(+i

Sβ are two probability sequences with recursive

properties goes forward and backward respectively. The LLR values is given by

∑− ++→

∑+ ++→

=−=

+==


Yimp

Yimp

iL

)1()1()(

)1()1()(log

]|1[

]|1[log

βγα

βγα,


where }1:1

{ +=+

→=+i

mi

Si

SS , }1:1

{ −=+

→=−i

mi

Si

SS are two transition

sets. This decoding algorithm can be implemented efficiently with the dual directional butterfly structures (derived from the encoder bit operation with generator polynomials) for path metric computation. Once again, we present only non-recursive code case here. Butterfly structures can be modified accordingly for recursive codes.

BCJR decoding dual directional butterfly structure

With 12 −= KN denote the total number of states of the trellis and K be the

constraint length, we have the following branch transition for state 12

,0 −=N

i ,

)2(])2

(,)([*max iaSiN

aSia →−++ , )12(])2

(,)([*max +→++− iaSiN

aSia ,

where S is the branch metric corresponding to state transition from i to i2 (forward state transition and recursive computation),

]')2

(,')([*max)2( SiN

bSibib −++← , ]')2

(,')([*max)12( SiN

bSibib ++−←+ ,

where 'S is the branch metric corresponding to state transition from i to i2 , i.e.

(backward state transition and backward computation). Only one branch metric computation is needed for each of the previous four transitions.

Here )}(log{)(i

Si

Sa α= , )}(log{)(i

Si

Sb β= and )},'(log{),'( jSiSjSiSc γ= .

Note that the ACS operation will contain a logarithmic correction term, typically implemented via a look up table or linear approximation, for log-MAP (max*, defined as

)||1log(),max(),(*max yxeyxyx −−++= ) decoding. Ignoring the logarithmic

correction term leads to max-log-MAP (max) decoding. The key problem with direct implementation is that we need to hold the whole frame of forward or backward sequences to derive the final soft output, which impose tremendous memory requirement. Sliding block type techniques have been proposed to reduce the memory requirement with introduction of extra computation [6][7][8]. Similar to the truncated Viterbi decoding, their speeds are bounded by the frame size (bottleneck of the decoder delay).

We now cite the “local BCJR decoding engines” [10] to break the bottleneck imposed by the frame size. Once again, they are initially designed to deal with fading effects in turbo decoding. With parallel lay out of these local decoding engines, we can virtually make the MAP decoding to run as fast as we want. Once again, these schemes give us the capability of doing MAP decoding on a segment or even at a single point. The local


decoding engines can be schematically illustrated in the following diagram.

The computation of )(i

Sa and )(i

Sb sequence starts with uniform (i.e. all equal)

values except both end points. Sync portion I and sync portion II are the learning portion for the recursive sequences to have reliable values respectively. The first sync portion is not needed for the very beginning and the second sync portion is not needed for the very end. Similar to Viterbi decoder trace back length, there is an issue concerning the length of the synchronization portions (also called window size). We can also use five times constraint length at least as the starting point to search for the proper synchronization period via simulation. Longer synchronization period will be needed for fading channel. We use same window size for both synchronization portions (they can be different).

Under the same assumption as in the Viterbi decoding case, we also use M local

decoding engines (we split soft samples evenly into M parts) and assume NML

= is

an integer. Formally, the thi − ( Ni ≤≤1 ) local decoding engines operate as. Begin local BCJR decoding algorithm:

(1) If 1=i , start forward recursive computations by butterfly structure with

soft samples }14

,,1

,0

{−M

yyy L . The initial start values are 1)0( =α and

0)( =iα for 0≠i . Start backward recursive computation by butterfly

structure with soft samples }14

,,1

,0

{−M

yyy L . The initial start values are

assigned uniformly (that is stateNi

b1

{ = ). Decode bits }1

,,1

,0

{−M

xxx L .

(2) If 1−= Ni or Ni = , start forward recursive computations by butterfly

structure with soft samples }12

,,142

,42

{−+−− L

yML

yML

y L . The

initial start values are assigned uniformly. Start backward recursive computations by butterfly structure with the same soft samples and initial values

0)0( =b and 0)( =ib for 0≠i . Decode information bits

F o r w a r dc o m p u t a t i o nw i t h r a n d o ms t a r t v a l u e s


p o r t i o n

B a c k w a r dc o m p u t a t i o nw i t h r a n d o ms t a r t v a l u e s





F i g u r e 2 . L o c a l B C J R d e c o d i n g e n g i n e


}1

,,12

,2

{−+−− L

xML

xML

x L . Please note that these two cases are

merged into one here due to the same backward recursion starting point. (3) If 11 −<< Ni , start forward recursive computations by butterfly structure

with samples }1)2(2

,2)2(2

,,1)1(2

,)1(

{−+−++−− Mi

yMi

yMi

yMi

y L .

The initial start values are assigned uniformly. Start backward recursive computations by butterfly structure with the same set of soft samples and equal

initial starting values. Decode bits }1)1(

,,1

,{−++ Mi

yiM

yiM

y L .

End local BCJR decoding algorithm 4.4. First High-speed decoding strategy: parallel butterfly structures

To build high-speed ML and MAP decoder, the first strategy is to lay out the ACS butterfly structures in parallel to tackle the trellis complexity. Trellis complexity comes from the constraint length as more trellis states and longer constraint length are typically used to have larger coding gain. The ACS butterfly structures are typically 4-state atomic by nature. Parallel layout or serial layout is the trade off between hardware area (for ASIC implementation) and decoder speed. This strategy is not suitable for turbo decoder as we typically only use 4-state or 8-state constituent encoders, which implies that the parallel layout of local decoding engines is pretty much the only way for high-speed turbo decoding schemes. Clearly, more elaborated schemes and architectures can be devised for specific designs. However, architectures with parallel layout of butterfly structures are still speed bounded by the frame size. The soft samples input to the decoder are still processed in the sequential order, which is still the bottleneck of decoder speed. This is particularly true as typically higher data rate means longer frame size (normally we keep the frame duration remains sort of constant). The local decoding engines will enable us to break the frame size bottleneck. 4.5. Second high-speed decoding strategy: parallel local decoding engines

The key contribution of local decoding engines is that we can actually to decode just a small segment in the whole frame. They were initially designed to have adaptive decoding schemes for turbo codes [10]. With unequal number of iterations imposed to difference local decoding engines, they can be adaptive to fading channel impairment. To speed up turbo decoder, we can layout these local decoding engines (local BCJR engines or local soft output Viterbi decoding engines (SOVA) for each constituent decoder) in parallel so the whole frame of samples can be processed in parallel order instead of in sequential order. High-speed architectures can thus be devised to break the frame size bottleneck. This is extremely important for building high-speed turbo decoder as decoder delay comes mainly from the frame size and the number of iterations instead of the trellis


complexity. The parallel layout of local decoding engines enables us to have virtually arbitrary high-speed decoders (ML, MAP and turbo) for convolutional and turbo codes in practice.

Once again, we assume that the whole frame of L2 soft samples (include soft samples associated to tail bits) are divided into M parts (we simply assume this partition is uniform). More complex schemes can be devised and implemented with various decoding lengths. Clearly, uniform partition actually has the simplest decoder implementation. Parallel layout of the local decoding engines for MAP decoding has some subtle difference as compare to that of Viterbi decoder. The major difference is between the trace back operation and the backward recursion. We illustrate in the following diagrams our high-speed decoding schemes based on parallel layout of local Viterbi decoding engines and local BCJR engines.

We see that to build high speed turbo decoder is a straightforward task after high speed ML and MAP decoding schemes. As turbo decoding is merely MAP or ML decoding separated by interleaving/deinterleaving with iterations. Of course, we suppose the interleaver/deinterleaver is not the bottleneck of turbo decoder speed.

We now analyze decoder speed in terms of hardware clock cycles. If we can process in parallel all states (in turbo decoding case or with parallel layout of butterfly structures) in one cycle for each soft symbol. The delay will be M3 cycles for the fully parallel decoders (with proper hardware techniques the trace back or LLR calculation overhead can be reduced to zero). This tells us how fast the decoder can be in its ultimate limit. If we chose 100=M , then we can decode the whole frame of data in 300 cycles. This speed is fast enough to handle any data rate. We make two important points clear here. First point is that the speed of the fully parallel decoder is independent of the frame size (parallel layout of local decoding engines breaks the frame size barrier). Second point is that the decoder speed will be considered fast enough as long as it is faster than the speed of the rest of the receiver blocks (decoder need not to be faster than soft symbols feed into it anyway). In one word, we have reached the decoder design speed limit.

l o c a l d e c o d i n ge n g i n e I

l o c a l d e c o d i n ge n g i n e I I

l o c a l d e c o d i n ge n g i n e I I I

l o c a l d e c o d i n ge n g i n e K

F i g u r e 3 . P a r a l l e l l o c a l V i t e r b i / S O V A e n g i n e s

l o c a l d e c o d i n ge n g i n e I

l o c a l d e c o d i n ge n g i n e K

l o c a l d e c o d i n ge n g i n e I I

l o c a l d e c o d i n ge n g i n e I I I

F i g u r e 4 . P a r a l l e l l o c a l B C J R ( M A P ) e n g i n e s


We now give some brief analysis of the implementation cost. Clearly, major costs are silicon area expansion and power consumption increase. The rough estimate of the area

expansion in full parallel implementation is increase by ML

times as compare to a

normal implementation. This is the number of total local decoding engines in parallel.

We assume that power consumption for execute same engine ML

times is same as

execute ML

engines once in parallel. The power consumption increase is mainly

introduced by the extra computation brought in by the first synchronization portion (this portion will not be needed if we only apply sliding window or Viterbi technique

[6][7][8]). Therefore, the dynamic power is roughly increased by a factor of 13

compare to normal decoding architectures. Please note that the silicon area and the power consumption of each local decoding engine are decided by how we lay out the butterfly structures also. To achieve faster decoding speed, parallel layout of butterfly structures will of course be more costly than serial layout versions. 4.6. Flexible combination, implementation freedom and performance

Finally, we can view our high-speed decoding schemes as a combination between two “processing”: states and samples. Both states and samples can be processed in parallel or sequentially. We can also view these schemes as a combination of state “resolution” and sample “resolution”. The higher parallel processing means the finer the resolution. Our high-speed decoding algorithms give the flexibility to have any combination of these resolutions. We illustrate our decoding schemes with the following diagrams.

We assume here the states are organized according to orders appeared in the butterfly structures and the samples are in natural order. Each box in the previous diagram corresponding to a local decoding engine (the box is simply the reliable decoding portion of the dual directional synchronized local decoding engine). Note that the state resolution and the sample resolution can be refined globally or even locally (that is we can actually have non-uniform local decoding engines). The combination of resolutions can be refined or merged. More precisely, a local decoding engine can be split into two local decoding engines and two consecutive local decoding engines can be merged into one.

S t a t e R e s o l u t i o n X S a m p l e R e s o l u t i o n : s p l i t a n dm e r g e ( p a r a l l e l s t a t e s & p a r a l l e l l o c a l e n g i n e s )

s p l i t 1 i n t o 2

m e r g e2 i n t o 1

F i g u r e 5 . J o i n t s t a t e a n d s a m p l e r e s o l u t i o n d i a g r a m

S a m p l e R e s o l u t i o n

Sta

te R

esol

utio

n


This “split and merge” can be done according to any desired resolution. This gives us the full flexibility of designing local decoding engines. Similarly, the state resolution can also be “split and merged”. More precisely, we can use full parallel layout of butterfly structures (finest resolution) and use the butterfly structures serially (coarsest resolution). A chosen decoder design can come with any combination and corresponds to a specific decoder speed ranges from the slowest (coarsest resolution) to the fastest (finest resolution). This adaptability reveals the underlying harmony and beauty in decoder design (nature is indeed interesting).

Parallel layout of butterfly structures is purely an implementation issue and will not affect decoder performance. We just need to justify the performance under the parallel layout of local decoder engines (that is we only need to justify the performance of local decoding engines under various window sizes). For turbo decoding, we use the original turbo code [1] (constraint length 4, rate 1/3 and rate ½ constituent code, 800 bits per frame) to justify the performance of our decoding schemes with local decoding engines under various window sizes. The performance degradation compare to the commonly used sliding window or Viterbi techniques, even the optimal performance limit, is negligible for long enough window sizes. We plot their performance (for both max* and max versions) in the following plots. In each of the next two plots, 20W , 15W , 10W ,

5W means the window size is 20, 15, 10 and 5 symbols (with BPSK modulation) respectively. The local decoding engines are implemented with equal window sizes.

Figure 6. Performance of parallel local MAP turbo decoding schemes: max*

Figure 7. Performance of parallel local MAP turbo decoding schemes: max


The previous performance curves show the robustness and quality of turbo decoding with parallel local MAP decoding engines under various window sizes. For practical designs, we can always use simulation to decide the “optimal” window size. As expected, we notice that max based turbo decoding is less sensitive to window size change.

We now use the standard convolutional codes (constraint length 9 and rate ½) to justify our local Viterbi decoding engines under various window sizes. The following two plots are with frame size 288 and 512 respectively. In each of the next two lots, KW *4 ,

KW *5 , KW *6 , KW *7 means the window size is 4, 5, 6 and 7 times of the constraint length (that is 36, 45, 54, 63 symbols) respectively. The local decoding engines are implemented with equal window size for both ends.

Figure 8. Performance of parallel local Viterbi decoding schemes ( 288=L )

Figure 9. Performance of parallel local Viterbi decoding schemes ( 512=L ) We see that the common practice of fives times of constraint length as trace back

length and window size does not quite work well for the dual truncated Viterbi decoding schemes. Longer window size is needed for more accurate computation (to compensate the truncation error from both ends). This is a demanding requirement especially for low BER operation range. Fading channel will need even longer window sizes. Always, we can resort to numerical simulation to resolve the “optimal” window size easily.

Luckily, the parallel local MAP-based (max or max*) turbo decoders are less sensitive to window size variations. As Viterbi decoder is mainly used for high BER applications and turbo decoder is mainly used for low BER applications, we can build the corresponding error correction schemes without too much implementation cost.


4.7. Conclusions We presented algorithms and architectures to speed up ML, MAP and turbo decoding

schemes. Without considering the implementation cost (such as silicon area and power consumption), our algorithms can virtually achieve arbitrary high speed. The main design effort is to decide trade off between speed, silicon area and power consumption. The “split and merge” capability of local decoding engines and butterfly structures enables us to devise different speed decoding algorithms and architectures based on the final choice. In short, we have presented schemes to reach virtually the possible decoding speed limit and come up with guidelines for speed and implementation trade off decisions.

The input sample and output sample transmission latency must be considered in decoder designs. Further elaborated on professor Stephen Wilson’s comment, high-speed bus (parallel, high bandwidth and with efficient protocol) for data transmission needs to be studied also. This effort could lead to important engineering innovations. Acknowledgement: My gratitude goes to my teacher professor Wayne Stark of university of Michigan for leading me into the world of communications. References [1] C. Berrou et al, Near Shannon limit error-correcting coding and decoding: turbo codes, IEEE Int. Conf. On

Comm., pp 1064-1070, May, 1993


Vol. 20, pp284-287, March, 1974


GLOBECOMM, pp1680-1686, 1989


Vol. 42, pp429-445, March, 1996

[5] A. Viterbi and J. Omura, Principles of digital communication and coding, McGraw-Hill, 1979






turbo decoders, pp586-589, Proc. Int. Sym. Inform.Theory Appl., Victoria, B. C. Canada, 1996

[9] G. Fettweis and H. Meyr, High speed parallel Viterbi decoding: algorithm and VLSI architecture, IEEE

Communications Magazine, pp46-55, May 1991

[10] S. Xu and W. Stark, Extrinsic information impact on ML and MAP decoding of convolutional codes,

section 4 of this paper

[11] P. Beerel and K. Chugg, A low latency SISO with application to broadband turbo decoding, IEEE JSAC

Vol. 19, No 5, May 2001, pp860-870


5. Simple RMS Soft Sample Scaling and Simplified Turbo Decoders By Shuzhan Xu, Jan Meyer and Gerhard Ammer

5.1. Introduction: soft sample scaling problems and intuitions For turbo decoding, we need estimation of SNR values to scale input samples to avoid

possibly severe performance degradation [4][5][6][7][8][9][10]. For implementation, we also need to address the dynamic range issue for efficient soft sample representation. In finite precision format, every scaling operation means requantization (with truncation error and saturation error). Precisely speaking, we must have a SNR estimator to give the online SNR values and another scaling factor to put soft samples into the right dynamic range. These two estimated factors are ideally the same one. We must also take into account the power control and the AGC impact in CDMA systems. To have a good local online SNR estimation, the estimator needs to be short. Yet, short estimator itself will have more estimation errors. We prefer to have an estimator before the channel de-interleaver for the purpose of reflecting channel impairment in its natural order. Given the power control and the AGC schemes, we prefer the estimator to be slot based to reflect the difference in transmission power.

Besides the dynamic range, the scaling requirements for turbo decoder is mainly due to max* correction term (typically implemented via a look-up table). One way is to scale the soft samples properly and use a fixed look-up table. Another way is to program the look-up table according to SNR estimation. The last approach is to have a combined approach: we scale the soft samples with a control constant to program the look-up table entries. RMS values are natural choice for dynamic range scaling. An important property is that slot based RMS value square can add up to RMS value square of the whole frame with a simple control constant. The most important fact, as we will see, is that RMS value with control constant can also serve as SNR estimation.

We suppose the code rate is 1/3 for turbo encoder without puncturing and the code rate

is ½ for each constituent encoder. Let 10

}{ −=

Lii

x be the transmitted bit sequence,

10

}{ −=

Lii

p and 10

}{ −=

Lii

q be the parity bit sequence output of the first and the second

constituent encoder respectively. When these bits are transmitted over channel with

Rayleigh fading factors iα and added white Gaussian noise with variance 202 N

=σ ,

we receive samples

10

}{10

}{ −=

+=−=

Lii

nSEi

xi

Lii

y α ,


10

}''{10

}{ −=

+=−=

Lii

nSEi

pi

Lii

t α ,

10

}''''{10

}{ −=

+=−=

Lii

nSEi

qi

Lii

s α ,

where L is frame size. Each SISO decoding relies on the following equation [1][3]

(84) i

li

lSEii

L ++=2

2

σ

α,

where 10

}{ −=

Lii

L is the LLR values, 10

}{ −=

Lii

z is the input extrinsic (a priori)

information and 10

}{ −=

Lii

l is the newly generated extrinsic information to be used for

next iteration. Clearly, we need SNR values to scale the received samples for extrinsic information generation. A subtle point is that SNR scaling is only needed for max* (log-MAP) due to the logarithm correction term, which make the decoding process nonlinear. There is no need for soft sample scaling with linear decoding schemes (e. g. max-log-MAP, but with performance not as good).

In digital processing, another nonlinear factor is introduced by AGC and front end A/D converter. A/D conversion with fixed number of quantization levels will make even linear decoding, Viterbi decoding for example, into a nonlinear process. This A/D conversion process requires dynamic range scaling to reduce quantization errors. We need also to address this dynamic scaling issue in practice. Solutions to the turbo scaling problems therefore must do the following: (1) online SNR estimation to have optimal decoder performance, (2) scaling to keep soft samples in right dynamic range. For CDMA systems, another nonlinear factor is added by the power control schemes. As the transmission power is only constant for each slot and varies all the time. We need to estimate the SNR values based on slots to take into account the transmission power difference. We can only get more accurate estimations in this way.

As the soft sample sequence after channel de-interleaving does not reflect the time order in channel propagation anymore, estimators and scaling schemes tend to be with some sort of random nature (post-channel de-interleaver processing is thus virtually a “blind” estimation approach). Ideally, scaling should be done before channel de-interleaver to reflect the natural propagation order. We also face fixed-point precision issues when we do scaling and adjust look-up table. In CDMA systems, RAKE receiver demands higher number of bits for data path and turbo decoder needs less number of bits. This reduction of bit precision, actually digital requantization, is typically done before channel de-interleaver. Therefore, scaling is preferably done before the requantization (or at lease the scaling factors should be estimated already) to have more accurate results.


This strongly recommends us to take the slot-based approaches also. Typically, turbo decoder tolerates more overestimation than underestimation (this is partially due to the

fact that xe

xexe

xexedxd

−+

−=

−+

−−=−+

1|

1||)1log(| is decreasing with respect to x ,

that is overestimate will cause smaller change in function value )1log( xe−+ than

underestimate). The SNR accuracy is within 2− dB and 6 dB to have acceptable performance degradation in static channel. According to Summers and Wilson [4],

(85) )()2

(2|)](|[

)2(

2|)](|[

)2(

2|)](|[

)2(β

σfSE

f

isEisE

itEitE

iyEiyE

==== ,

where the function )(βf is given by

(86)

2)]}22

([2

222

{

21

)(

σσσπ

σβ

SEerfSE

SE

e

SE

f

+−

+

= ,

with 2σβ SE= denote the SNR values. With the average value of

2|)](|[

)2(

iyEiyE

z = , we

can find the need SNR value 2σβ SE= by

(87) 6184.239548.6520516.34 −+−≈ zzβ .

Note that fading factor α is combined into SE in this case. A look-up table can be

used to decide the final estimated online SNR value to avoid complicated computations. We can also use the following noise estimator proposed by Kim [6][7] in his thesis,

(88) 12]1

)(1

[2ˆ −∑=

++=K

i isitiyK

σ

]1

)}2)0ˆ(2)0ˆ(2)0ˆ{(1

[ ∑=

−+−+−+K

i isitiyK

µµµ ,


where ∑−

=++=

1

0)(

3

10ˆ

K

i isitiyK

µ . For symbol energy, we can use

(89) ∑−

=++=

1

0]222[

3

1ˆ K

i isitiyKSE ,

(90) 2}1

0)]()0ˆ()()0ˆ()()0ˆ[(

3

1{ˆ ∑

−

=−+−+−=

K

i issignisissignitissigniyKSE µµµ .

All the previous estimators can be applied over a slot or over a whole frame. As max-log-MAP decoding (with performance degradation) requires no soft sample

scaling, we come up with a compromise scheme, named simplified log-MAP (SMAP), with max-log-MAP recursion and log-MAP type LLR calculation. Luckily, it has slightly more operation than max-log-MAP with performance close to log-MAP. With max-log-MAP recursion, the demanding soft sample scaling requirements get dropped. The reduced computation actually also gives us a low-power decoder design. 5.2. Pre(Post)-channel deinterleaver and decoder assisted approaches

We formulate the soft samples before channel de-interleaver as:

10

)}({10

}{ −=

+=−=

Lii

nSEi

xii

GLii

y α ,

10

)}''('{10

}{ −=

+=−=

Lii

nSEi

pii

GLii

t α ,

10

)}''''(''{10

}{ −=

+=−=

Lii

nSEi

qii

GLii

s α ,

where i

α , 'i

α and ''i

α are i.i.d. Rayleigh fading factors, and i

n , 'i

n and ''i

n are

combined noise and all other interference. We assume i

n , 'i

n and ''i

n are zero mean

Gaussian with variance 202 N

=σ . Factors i

k , 'i

k and ''i

k are due to power control

and they remain constant over a slot. Factors i

G , 'i

G and ''i

G are due to the front end

AGC loop and 0

22

0

~

NSEiki

NSE α= is the SNR value we need to estimate. The estimation

schemes do not heavily rely on AGC because it will not change the SNR values. Also, the AGC impact is in the received soft samples already before we do any SNR estimation and scaling.

As an example, we look at the downlink power control schemes in UMTS WCDMA systems. Base station adjusts its transmitting power, with certain delay, according to the


received TPC (transmitting power control) bit. The down link power for k -th slot is adjusted according to the following formula

(91) )()()1()( kbal

PkTPC

PkPkP ++−= ,

where )(kTPC

P is the power adjustment due to inner loop power control, and

)(kbal

P is the correction according to down link power control procedure for balancing

radio link power to a common reference power. )(kTPC

P is given as follows: if the

value of Limited Power Raise Used parameter is ‘Not used’, then

(92) TPC

kTPC

P ∆+=)( , if 1)( =kest

TPC ,

(93) TPC

kTPC

P ∆−=)( , if 0)( =kest

TPC ,

else if the value of Limited Power Raise Used parameter is ‘Used’, then

(94) TPC

kTPC

P ∆+=)( ,

if 1)( =kest

TPC and LimitRaisePowerTPC

ksum

__)( <∆+∆ ,

(95) 0)( =kTPC

P ,

if 1)( =kest

TPC and LimitRaisePowerTPC

ksum

__)( ≥∆+∆ ,

(96) TPC

kTPC

P ∆−=)( , if 0)( =kest

TPC ,

where ∑−

+−==∆

1

1____)()(

k

SizeWindowAveragingPowerDLkiiTPCPk

sum is the

temporary sum of the last SizeWindowAveragingPowerDL ____ inner loop power

adjustments. The power control step size TPC∆ can take four values: 0.5, 1, 1.5 or 2 dB.

The power control justify frequency is 1500 Hz. In short, the transmitting power only remains constant for a slot and is constantly in change slot by slot.

We have the following assumptions about our CDMA system model '''i

ki

ki

k == for

each slot, '''''''''i

kii

Gi

kii

Gi

kii

G ααα == remains constant for each slot. These generic


assumptions enable our investigation to be valid for most receivers with CDMA type of power control schemes. The slot based pre-channel de-interleaver processing schemes can be done in two steps: (1) online SNR estimation schemes for each slot, (2) average among estimations get from slots.

There are several reasons for us to consider inter-slot estimation schemes. First, we can use inter-slot estimation schemes to average out some noise perturbation. Another reason for the inter-slot estimation scheme is corresponding to the transmitting power

control schemes. Suppose )(0

î

NSE

is the estimated online SNR for the thi − slot, then

we propose the following average as the final scaling factor for the thi − slot

(97) )()()1(0

ˆ

2)(

0

ˆ

1)( i

balPi

TPCPi

NSE

iN

SEiSNR ++−+= λλ ,

where 1λ and

2λ are positive numbers add to one (that is 1

21=+ λλ ). When

01=λ and 1

2=λ , the online SNR for this lot is purely based the previous estimation

and the power control adjustment. On the other hand when 11=λ and 0

2=λ , no

estimation of the previous slot is used, the scheme becomes slot based. Clearly, average over more than two slots can also be devised and applied.

For post-channel de-interleaver processing, the channel de-interleaver will randomize the soft samples and sort of average out the power control effects and the AGC effects. We can have the following assumptions to the received samples

(98) 0

µα ++=i

nSEi

xii

y ,

(99) 0

'' µα ++=i

nSEi

pii

t ,

(100) 0

'''' µα ++=i

nSEi

qii

s ,

where i

α , 'i

α and ''i

α are Gaussian random variables with µ as their mean and

*σ as their variance, and

in , '

in and ''

in (same as in the pre-channel de-interleaver

system model) are combined noise and interference. We assume i

n , 'i

n and ''i

n are


zero mean Gaussian with variance 202 N

=σ , and are independent with i

α , 'i

α , ''i

α .

Here 0

µ denote the mean value of the soft samples, and is typically very close to zero.

Denote 00

20022

0

ˆN

SE

NSESE

NES

µµµµµ≈

++= , and this is the scaling factor

to be estimated (it makes no sense to use shorter estimator beside frame based approach). Another way to do SNR estimation is to use some of the partially decoded results

(both extrinsic information and LLR values). In [12], Oh and Cheun proposed some adaptive channel SNR estimation algorithms use the extrinsic values generated within the iterative MAP decoder to update the channel SNR estimate toward its optimum value per each iteration or per each frame. Define

(101) |}{|10

min|}{|10

max)(i

zLii

zLi

itern

C−≤≤

−−≤≤

= ,

where n is the frame number, and iter is the iteration number in turbo decoding. That

is )(itern

C is a very simple estimation of extrinsic information variation. We then take

average of )(itern

C over several frames

(102) ∑=

=P

niternC

Piter

nC

1)(

1)( .

The channel SNR estimation γ̂ is then given by

(103) )1(ˆ)}()(1

{)(ˆ −+×−−

= iteritern

Citern

Csigniter γµγ ,

where µ is the update gain. Scaled by these SNR estimations, we can also speed up the

convergence of turbo decoder to reduce steady state jitter. We can also estimate the online channel SNR values, similar to the estimators based on

extrinsic values, based on LLR values exactly the same way as before. More specifically, we can devise the following estimation schemes. Define

(104) |}{|10

min|}{|10

max)(i

LLii

LLi

itern

D−≤≤

−−≤≤

= ,

where n is the frame number and iter is the iteration number in turbo decoding. That

is )(itern

D is the difference between maximum and minimum LLR value amplitude in


the n -th frame. We then take average of )(itern

D over several frames

(105) ∑=

=P

niternD

Piter

nD

0)(

1)( .

The channel SNR estimation $γ is then given by

(106) )1(ˆ)}()(1

{)(ˆ −+×−−

= iteritern

Ditern

Dsigniter γµγ ,

where µ is the update gain (based on simulation calibration or real measurement). The

main drawback of these estimators is the involved calibration demand. It is well known that the turbo decoding process will actually bring up the online SNR

values. This in an engineering sense explains where the coding gain comes from. These results also give us intuition to propose new online SNR estimation schemes. That is we can simply look at them as dual aspects of a single process. First the online SNR value can be estimated in one way as [23]

(107) ∑−

=+≈

1

0

1)(

L

i iziLL

StartSNRiterSNR ,

where the second term is what called soft quality index. With different update gains and

the fact that iL and iz are typically have same sign in the later iteration stages, we

thus have the following channel SNR estimation

(108) )1

0

21(

2)

1

0

21(

1)( ∑∑

−

=≈

−

=≈

L

i izL

CL

i iLL

CiterSNR ,

where 1

C and 2

C are two constants (we use the gain factors to adjust them). We can

see clearly the unified logical reason for the previous estimation schemes. These estimation schemes are just two sides of the same problem. The feasibility of these decoder-assisted schemes relies heavily on the decoder design and they are with our least recommendation for implementation concerns. 5.3. RMS scaling algorithms

We now propose our one-step slot based RMS scaling algorithms. To keep the maximum flexibility, we introduce an overall scaling control constant for all slots to have

the correction look-up table flexible. We study only AWGN channel (that is 1=i

α ) for


easy analysis. Denote 10

},,{130

}{ −=

=−=

Lii

si

ti

yLii

ξ as the soft samples, we have

(109) 2]2|[| σξ +=S

Ei

E ,

For low SNR, we have 2]2|[| σξ ≈i

E . The RMS value over a slot is then

(110) 2]2|[|13

0

2||31

σξξ ≈≈−

== ∑ i

EK

i iKRMS ,

where K is the number of samples per slot. Therefore RMS value can be used as online noise variance estimator. Luckily, this estimation is more accurate for low SNR.

Suppose we have T slots in a frame as follows

(111) ))13

,,)1(3

(,),16

,,3

(),13

,,0

((−−−− TKKTKKK

ξξξξξξ LLLL ,

where K is the number of samples per slot. Suppose the slot based online SNR scaling

factors are 10

}{ −=

Lii

C , where i

C is the estimated scaling factor for the thi − slot by

certain algorithm. The RMS value over the whole frame after slot by slot scaling will be

(112) ∑−

==

1

0}2{

11 T

i iRMSiCT

RMS ,

where iRMS is the RMS value on the thi − slot. For reference, the RMS value over

a whole frame before slot based scaling is ∑−

==

1

0

10 T

i iRMST

RMS . As scaling is trying

to stabilize the channel impairments and to fit soft samples into the right dynamic range for fixed-point implementation, we naturally require that the RMS value after scaling to be a fixed constant, that is let

(113) CT

i iRMSiCT

RMS =−

== ∑

1

0}2{

11 ,

where C is a constant. The most natural choice for each slot based scaling factor is

(114) iRMS

CiC = ,

which is exactly the scaling constant we proposed. This also tells us the connection between slot based scaling factors and the overall control constant.


As for noise variance estimation, we need to use RMS value as scaling factor. However, we need to use the square root RMS value for the purpose of generating output data stream with constant RMS value. The slot based SNR scaling factor we need is

(115) ))((

2222

SEiRMSSEiRMSSE

SEiRMSSES

E

−+=

−≈

σ

iRMS

iD

SEiRMS

SEiRMS

=

−+

=

)1)((

2,

which is exactly the same form as the scaling constant we proposed. This also tells us the exact relation between the ideal scaling factor and the RMS value. The fact that turbo decoder tolerates more overestimate than underestimate serves as the guideline for final scaling factor calculation. From the previous expressions, we can derive many feasible schemes in real design (compensation schemes, i.e. SNR dependent control constant out of calibration, can be added for more accurate estimation results). We see easily that

(116) iRMS

iCSE

≈2

2

σ,

which is exactly our slot based RMS scaling schemes. Let’s quote some results of Pietrobon [8] to further understand the scaling issues and

take a different look. The LUT for the logarithmic correction term is generated by

(117) )1log( cx

ec−

+ ,

where c is given by the following equation 4

2σAc = , and A is given by

(118) )(

)112(*

σmag

qCA

−−= ,

where 65.0* =C , q is the number of bits in the digital quantizing scheme, and

(119) σπ

σσ 798.02

)( ≈≈mag .

We can thus easily calculate the needed LUT for the logarithmic correction term. The exact operation is to divide every soft sample by c for scaling. Clearly, we have

(120) σσ

σ

*)112(*65.0

798.0*42*)112(*65.0

)(*41

−−≈

−−=

qqmag

c.


Different approximate scaling algorithms can also be devised accordingly. Following [8], we have the following typical quantities of soft sample statistics

(121) σσξ 798.0)(|][| ≈= magi

E ,

(122) 2]2|[| σξ +=S

Ei

E .

With these approximations, we can devise the following feasible scaling algorithms,

(123) |)(|*)112(*65.0

798.0*798.0*41

iEqc ξ−−≈ ,

which is not explicitly depending on SNR. A slot based scaling factor could be

(124) |)(|

1

iEiH

c ξ≈ ,

which is not based on RMS values and can also be utilized. Given the fact that

(125) 12

122 +=+=+≈

SNREEiRMS σ

σσσ ,

we can also have the following scaling factor

(126) iRMS

iE

iRMSq

SNR

c=

−−

+≈

*)112(*65.0

12

*798.0*41,

which is explicitly dependent on the SNR values and the operation range. This once again leads to one-step slot based RMS value scaling, which confirms our previous

analysis. Numerical simulation results under static channel show that 0.2=iC is pretty

much the optimal constant. This serves as guideline for the “optimal constant” search. In practice, we can also take into account the AGC effect and the signal strength as

(127) iRMS

K

i iKiF

iRMSiEiF

c

)13

0||

31

(|)(|1∑−

==≈

ξξ.

If we have concern about DC offset residue in the soft samples, we can also use

(128) 2)

13

031

(2)]([

1

∑−

=−

=−

≈K

i iKiRMS

iG

iEiRMS

iG

cξξ

.


The control constant is operation range dependent and can be decided by simulation or calibration with final goal to put the signal into the decoder tolerance range. 5.4. Soft sample scaling implementation issues

We now address some implementation issues particularly complexity and latency. For ASIC or DSP implementation, we have the following overall scenarios.

(1) Slot based pre-channel de-interleaver processing which follows the channel impairment naturally and is more suitable to be combined with the RAKE receiver (that is scale soft samples inside RAKE receiver).

(2) Random based post-channel de-interleaver processing, which can be designed as part of the channel de-interleaver.

(3) Decoder-assisted post-channel de-interleaver processing which is virtually to do online SNR estimation and soft sample scaling inside the turbo decoder.

We see clearly that these three approaches corresponding naturally to the three major components in turbo decoder interface: RAKE receiver, channel de-interleaver, and turbo decoder. The complexity and latency is decided by how and when the scaling is finally done. Soft samples that come out of the RAKE receiver in CDMA receivers are typically processed on a slot-by-slot fashion. Slot based approach will also be the most natural choice as implementation is concerned let alone the higher dynamics available. The soft samples out of the RAKE combining are typically handled by DSP in slot-based buffer before being pushed towards the channel de-interleaver. We can therefore use a DSP to implement our slot based scaling algorithms. To reduce the requantization error, we do rounding before format cast as follows: estimation of slot based scaling constant

(square root RMS values or |)(| iE ξ , LUT values can be used to avoid the square root

and inverse operation), scaling the soft samples of the whole slot (three steps are multiplication, rounding and format cast), regenerate the frame based logarithmic correction LUT and re-program it to the turbo decoder if needed (we prefer to have one-step approach with fixed control constant and LUT to avoid this operation).

We can easily count the operations needed to finish these tasks. Recall that there are T slots in a frame and there are K samples per slot. Suppose the slot based online

SNR scaling factors are 10

}{ −=

TiiC , we need K multiplication and K addition to

calculate RMS values. The square root and inverse are just fractional overhead operations. We also need another K multiplication to finish the scaling process. Rounding and format cast may add actual operations. There will be no overhead for those DSP processors with built in rounding and format cast with multiplication. 5.5. Soft sample scaling numerical simulation results


First, we present some pre-channel de-interleaver processing results under AWGN with power control schemes (by varying the symbol energy of each slot with step size

0.5 dB). We utilize the average over slots (with PC: 5.01=λ and 5.0

2=λ , without

PC: 0.11=λ and 0.0

2=λ ). Estimation is done by Summers and Wilson’s method.

Figure 1. BER performance of pre-channel de-interleaving processing The performance degradation is negligible with power control combined in and the performance degradation without average over slots is also pretty small. To justify RMS scaling, we present here the “optimal” scaling (frame based) under AWGN with floating point and 8-bit soft samples (simulation with rate 1/3 and frame size 860) respectively.

Figure 2. Fame based RMS scaling performance (floating point vs. 8-bit sample) The RMS scaling schemes format under static channels in floating point has negligible

performance degradation. It has less than hundredth of dB performance degradation with 8-bit soft samples. Given the 0.3 dB degradation of max as compare to max*, we see clearly that max* with our RMS scaling algorithms can outperform max. One important performance evaluation criteria therefore should be fixed-point max* with RMS scaling should outperform floating-point max. For 8-bit soft sample precision under Rayleigh fading (which is the typical preliminary evaluation test before hardware design) conditions in WCDMA environment (code rate 1/3 and frame size 860, 4-path Rayleigh at 120 km/h), we have the following simulation results.


Figure 3. BER performance of the slot based RMS scaling in WCDMA systems We see that log-MAP with the RMS scaling scheme outperforms floating-point max-log-MAP in about 0.2 dB. This is the result we expected. 5.6. A simplified turbo decoding scheme without scaling

Have seen the painful effort needed to have optimal turbo decoding performance, we

now look at ways to bypass the scaling requirement. We assume total

S is the total

number of states on trellis. Extrinsic information }1,...,1,0{ −= LzzzZ is given by

)1(

)1(log

−=

+==

impimp

iz with 2/2/

2/][

izeiz

e

izime

imp+

−= . With iS denote a state of the

trellis corresponding to the thi − moment, MAP decoding is with forward recursion

(129) ∑−

→−−=1

)1()1()(iS iSiSiSiS γαα , 00,0)0(,00,1)0( ≠=== SSSS αα ,

and backward recursion

(130) ∑+

++→=1

)1()1()(iS iSiSiSiS βγβ , 0,0)(,0,1)( ≠=== LSLSLSLS ββ .

The soft decision LLR is calculated as

(131) ∑− ++→

∑+ ++→

=∑

−∈

∑+∈=


PPXYp

PPXYp

iL)1()1()(

)1()1()(log

]|[

]|[log

βγα

βγα,

which is the so-called log-MAP algorithm in implementation and )( iSiS →γ is

branch metric. Also, P is a continuous path, }1:{ +==+i

mPP and

}1:{ −==−i

mPP cover all of the continuous paths start and end with zero state on

the trellis, S is a transition branch, }1:{ +==+i

mSS and }1:{ −==−i

mSS


covers all the branch metric transitions at the i th− time moment. The approximation

(132) )(]}|[{max

]}|[{maxlog

]|[

]|[log M

iL

XYpP

XYpP

PPXYp

PPXYp

iL =−

+≈∑

−∈

∑+∈= ,

leads to the so-called max-log-MAP decoding in practice. The extrinsic information in turbo decoding will be combined into the branch metric computation.

Similar to log-MAP decoding, we define the following modified recursive sequences

(133) )}1()1({1

max)(*iSiSiS

iSiS →−−−

= γαα ,

00,0)0(*,00,1)0(* ≠=== SSSS αα ,

for forward recursion and the following recursive sequences

(134) )}1(*)1({1

max)(*++→

+= iSiSiS

iSiS βγβ ,

0,0)(*,0,1)(* ≠=== LSLSLSLS ββ .

for backward recursion, and calculate the soft decision LLR as

(135) ∑− ++→

∑+ ++→

=


iL

)1(*)1()(*

)1(*)1()(*

log*

βγα

βγα,

which is a new soft output convolutional decoding scheme. These forward and backward recursive sequences are simply a different statement of the Viterbi decoding processes (the backward sequence computation can be simply viewed as Viterbi decoder running in reverse direction after a whole frame of samples have been received). Precisely, the LLR values are calculated by log-MAP decoding and the recursive sequences are calculated by max-log-MAP decoding. Therefore, we can view this scheme as a simplified version of log-MAP (with reduced complexity in recursive computation) or an enhanced version of max-log-MAP (with increased complexity in LLR calculation). This gives the simplified turbo decoding algorithm we are going to introduce. Equivalently,

(136) )(]}|[{max

]}|[{maxlog

)(]|[

)(]|[

log* Mi

LXYpP

XYpP

survivingPPXYp

survivingPPXYp

iL =

−

+≈∑

−∈

∑+∈

= ,

where }1:{)( +==+i

mSsurvivingP and }1:{)( −==−i

mSsurvivingP cover all


the continuous surviving paths (only the surviving paths after Viterbi decoder path trimming operation, and the path sets are smaller than the path sets of the log-MAP path sets, the difference is surviving paths versus all paths) start with and end in zero state on the trellis. This soft-out put convolutional decoding scheme, called simplified MAP decoding (SMAP), will bring us a simplified yet efficient turbo decoding scheme as we

will see later. We have |)(||*||| Mi

Li

Li

L ≥≥ in statistical sense generally (verified by

simulation). Using SMAP for each constituent decoder, we can have a new simplified turbo decoding scheme following the classical iterative approach. The key is to combine the extrinsic information (interleaved or de-interleaved version) into our simplified MAP constituent decoding. Analyzing the first constituent decoder will be adequate for our discussion. We know the max-log-MAP decoding recursive sequence is the path metric of Viterbi decoding. Mathematically, the path metric with extrinsic information input is

(137) ∑

∏

−

=−

= −+

⋅∆=

1

021

)1

0 2/2/1

(]|},[{

L

i izime

L

i izeiz

e

XZYp ,

where ∑−

=−+−−

==∆

1

0}2)(2){(

22

1

2)21

(]|[

L

i bEipitbEixiyeLXYp σ

σπ.

At the thi − time moment, the path metric for paths in

}1:{)( +==+i

mSsurvivingP is

(138) 22

1

02)(

21

22

1

02)(

]|},[{ σσ

∑−

=−

−

⋅

+

∑−

=+

−

⋅+Ξ=

L

i bEipit

eiz

L

i bEiy

ei

XZYp ,

where

∑≠

+∑≠

−+−−

⋅=+Ξ ij izimij bEipitbEixiy

eZKi

21

}2)(2){(22

1

)( σ

and LL

i izeiz

e

ZK 2)2

1()

1

0 2/2/1

()(σπ

⋅−

= −+

= ∏ , and the path metric for paths in

}1:{)( −==−i

mSsurvivingP is


(139) 22

1

02)(

21

22

1

02)(

]|},[{ σσ

∑−

=−

−

⋅

−

∑−

=−

−

⋅−Ξ=

L

i bEipit

eiz

L

i bEiy

ei

XZYp ,

where

∑≠

+∑≠

−+−−

⋅=−Ξ ij izimij bEipitbEixiy

eZKi

21

}2)(2){(22

1

)( σ . A

direct calculation of ∑

−∈

∑+∈

=

)(]|[

)(]|[

log*

survivingPPXYp

survivingPPXYp

iL leads us to conclusion

(140) i

li

zi

ybE

iL ++=

2

2*

σ,

where

∑

∑

−∈

∑−

=−

−⋅−Ξ

+∈

∑−

=−

−⋅+Ξ

=

)(

22

1

02)(

)(

22

1

02)(

log

survivingPP

L

i bEipit

ei

survivingPP

L

i bEipit

ei

il

σ

σ

is the newly updated

extrinsic information to be fed into the next iteration. This is the theoretical foundation of turbo decoding, What we have shown is simply that the well-known LLR decomposition equation, as shown in [1][3], is also valid for our SMAP.

Our simplified turbo decoder is therefore a very simple variation of the classical turbo scheme. This approach is clearly valid for both parallel and serial concatenated versions. Most importantly, this decoder can be easily built without much modification of the commonly implemented architectures, which is extremely important in practice.

Finally, we point out that a different approach has been tried in [22] to improve the max-log-MAP decoding. Their approach is based on scaling of the extrinsic information come out of each max-log-MAP constituent decoder. This paper motivated us to analyze turbo decoding further and to come up with the SMAP version of improvement. 5.7. Window techniques, soft sample scaling and low-power issues

The recursive sequences are just intermediate stages of the final LLR calculation. Direct implementation requires lot of memory and is impractical. Truncated versions or window techniques must be applied to alleviate the memory requirement


[15][16][17][23]. SMAP is the same as max-log-MAP in recursive computation and all exiting window techniques can be applied. Variations of these techniques (non-truncated, single-side truncated, or dual-side truncated versions) can be utilized in implementation, which gives us a full range of designs with different speed and power consumption.

One very important fact is that our SMAP algorithm actually requires no online SNR estimation and soft sample scaling. The recursive sequence computation is linear in terms of input soft samples. We now give precise reason for this linearity. The involved SMAP path metric can be equivalently calculated, correlation versus Euclidean distance

squared, as ∑∑−

=

−

=+

1

0211

0}{

2

L

i izime

L

i bEipitbEixiybE

eσ . If we multiply the soft

samples by a positive constant 0k > , we have LLR values of SMAP calculated as

(141)

∑

∑

−∈

∑−

=∑−

=+

+∈

∑−

=∑−

=+

=

)(

1

0211

0}{

2

)(

1

0211

0}{

2

log*,

survivingPP

L

i izime

L

i bEipiktbEixikybE

e

survivingPP

L

i izime

L

i bEipiktbEixikybE

e

kiL

σ

σ

*

)(

1

0211

0}{

2

)(

1

0211

0}{

2

logi

L

survivingPP

L

i izime

L

i bEipitbEixiybE

e

survivingPP

L

i izime

L

i bEipitbEixiybE

e

=

−∈

∑−

=∑−

=+

+∈

∑−

=∑−

=+

=

∑

∑

σ

σ

.

The invariance of **, i

Lki

L = tells us that no soft sample scaling is needed for SMAP.

This makes our simplified turbo decoder robust and simple in implementation. Our simplified turbo decoder is actually a robust low-power scheme. Max-log-MAP

decoding needs less computation as result of ignoring the logarithmic correction term.

Note that )||1log(),max()log(),(*max yxeyxxexeyx −−++=−+−= , and the

logarithmic correction term is typically implemented via a look-up table. Let

)}(log{)( iSiSa α= , )}(log{)( iSiSb β= , )},'(log{),'( jSiSjSiSc γ= , we have the

following recursions

(142) )]},2()2([)],,1()1({[*max)( iSiSciSaiSiSciSaiSa ++= ,


(143) )]}2,()2([)],1,()1({[*max)( iSiSciSbiSiSciSbiSb ++= ,

where 1iS and 2

iS are the corresponding transition states based on the recursive

situation. This is the classical ACS operation with a correction term. LLR value kL at

time moment k with log-MAP can then be computed as

(144) )}(),'()'({1

maxi

Sbi

Si

SCi

SaiuiL ++

+==

)}(),'()'({1

maxi

Sbi

Si

SCi

Saiu

++−=

− .

Define )}(*log{)(*iSiSa α= , )}(*log{)(*

iSiSb β= , we have the following LLR

calculation of the max-log-MAP decoding

(145) )}(*),'()'(*{1

max)(i

Sbi

Si

SCi

Saiu

Mi

L +++=

=

)}(*),'()'(*{1

maxi

Sbi

Si

SCi

Saiu

++−=

− .

The LLR values of SMAP is calculated, on the other hand, as

(146) )}(*),'()'(*{*1

max)(i

Sbi

Si

SCi

Saiu

Mi

L +++=

=

)}(*),'()'(*{*1

maxi

Sbi

Si

SCi

Saiu

++−=

− .

SMAP will reduce the number of operations in decoding compare to log-MAP, which leads to power consumption reduction. As studied in [23], switching off the LUT operation can introduce power reduction. Note that the power increase introduced by the SMAP LLR calculation is relatively small since the LLR computation unit in log-MAP decoding is not the major power source anyway.

To reduce the major part of SMAP power consumption, we need to apply iteration stopping (also called early termination) schemes as the turbo convergence has a typical asymptotic behavior. Iteration stopping schemes can be devised to cut the later iterations without much performance degradation. Iteration stopping schemes have significant


practical impact. The key to all iteration schemes is really how to detect the saturation point in turbo convergence. Matache et al [18] classifies all iteration stopping schemes as hard decision rules, soft decision rules, CRC rules, and magic Genie rules. Their underlining logic is based on how the information is processed. Trade off between computation reduction and performance degradation is the key for practical design. My favorite iteration stopping criteria, simple yet solidly verified in many designs, is the hard decision comparison scheme between two iteration stages by Shao et al [19]. We can put these schemes on top of our SMAP decoding with ease. All the related design and implementation issues are the same as for log-MAP and max-log-MAP.

To improve the performance gap between log-MAP and SMAP, we can further extend the algorithm switch strategy investigated in [23]. We can reason, from an engineering point of view, that the turbo decoding process will bring up the intrinsic SNR. This explains the coding gain. The intrinsic SNR value grows until eventually reach saturation as iteration goes on as a result of extrinsic information combining. Recall the fact that log-MAP and max-log-MAP are asymptotically the same for high SNR. As performance degradation of max-log-MAP compare to log-MAP will get smaller at high SNR, we can switch log-MAP to max-log-MAP in the later turbo iteration stages. This strategy will reduce the decoder activity factor and will lead to power consumption reduction. The switching point can be decided according to simulation and calibration results. We now naturally have the capability to build a decoder switch from log-MAP to SMAP then max-log-MAP. This enables us to have a full range of decoder designs with trade off between complexity and performance.

However, the switching strategy from log-MAP to SMAP then max-log-MAP is not with our recommendation for the reason of avoiding the demanding soft sample scaling in log-MAP. Switch from SMAP to max-log-MAP doesn’t save much computation. 5.8. SMAP numerical simulation results

We use the standard UMTS WCDMA turbo code (constraint length 4, rate ½ constituent code, alternative puncturing, 640 bits per frame) just to illustrate our SMAP decoder performance. The following performance results, as compare to log-MAP and max-log-MAP decoding, are generated under AWGN channel for standard reference.

Figure 4. Performance of SMAP turbo decoding scheme


The performance of SMAP is, as expected in between log-MAP and max-log-MAP

and the degradation of SMAP comparing to log-MAP is less than 0.1 dB at the 510−

BER (the performance degradation of max-log-MAP is about 0.3 dB). To our satisfaction, we see that the simple log-MAP LLR average process actually improves the max-log-MAP decoding quite much. Please note that only a few more iterations are added for SMAP as compare to max-log-MAP, which makes our scheme very attractive in implementation.

We now present some performance results to justify the algorithm switch strategy. The following plots show the performance of turbo decoder utilizing log-MAP for the first 4 iterations and SMAP for the last 4 iterations. We can that performance degradation is way less than 0.1 dB as compare to the log-MAP decoding limit.

Figure 5. Performance of algorithm switch scheme We now justify the performance of SMAP decoding with windowing techniques

(sliding window, Viterbi technique and variations). Similar techniques are also valid for truncated Viterbi decoding with limited trace back operation. The key to these techniques is that the recursive sequences can be calculated approximately with random start as long as the window size (that is the synchronization portion) is long enough. We first illustrate the sliding window; we may call it single-sided window technique, and Viterbi technique (algorithmically sliding window technique with dual backward recursive engines) for the constituent SMAP in the following diagram.

F o r w a r d c o m p u t a t i o n c o n t i n u e df r o m t h e b e g i n n i n g o f t h e f r a m e

R e l i a b l e d e c o d i n g p o r t i o n

B a c k w a r d c o m p u t a t i o nw i t h r a n d o ms t a r t v a l u e s

S y n cp o r t i o n w i t h f i x e d

w i n d o w s i z e



F i g u r e 6 . S M A P s l i d i n g w i n d o w t e c h n i q u e


The computation of )(*iSa sequence is as defined and the )(*

iSb sequence

computation starts with uniform values except at the end of the frame. The sync portion is the learning portion for the backward recursive sequence to have reliable values. Similar to Viterbi decoder trace back length, there is an issue concerning the window size. We can use five times constraint length at least as the starting point to search for the right choice via simulation. Longer window size is clearly needed for fading channels. The following performance curves show the performance degradation due to window size variation. In these plots, 5W , 10W , 15W , 30W means the window size is 5, 10, 15, and 30 information bits respectively. We can see that the performance degradation, comparing to the exact performance, is negligible when the window size is 25. This final design decision can be easily derived by extensive whole chain simulation.

Figure 7. Performance of SMAP window techniques The dual-truncated SMAP version (local decoding engine) enables us to have the

capability of reaching as high as possible decoder speed (of course with ignoring implementation cost). We illustrate the dual truncated; we may call it double-sided window, in the following diagram.

The computation of )(*iSa and )(*

iSb sequence starts with uniform values except

when meet the end points. The two sync portions are the learning portions for the recursive sequences to have reliable values respectively due to their uniformly assigned starting values. The choice of window size can be easily justified via numerical simulation. We may use same window size for both synchronization portions or make them different. The double truncated versions will introduce extra computation and their

F o r w a r dc o m p u t a t i o nw i t h r a n d o ms t a r t v a l u e s


p o r t i o n

B a c k w a r dc o m p u t a t i o nw i t h r a n d o ms t a r t v a l u e s



j u m p t o r e p e a t t h ew h o l e o p e r a t i o n

F i g u r e 8 . L o c a l S M A P d e c o d i n g e n g i n e


performance will not be justified here. Please refer to [23][24] and the listed references there for detailed description, performance and further variation of these techniques. 5.9. Conclusions

We have studied and analyzed the soft sample scaling problems for turbo decoder in the context of CDMA systems. Our analysis is general enough for most receivers. The key to our solution is to have online SNR estimation factor and dynamic range scaling factor to be the same. Our slot-based approach is the natural choice for systems with power control schemes. These algorithms are with good performance and simple implementation. As a simple variation of log-MAP and max-log-MAP decoding, our simplified decoder is robust, with good performance and reduced complexity. Most importantly, it can get rid of the tedious requirements of soft sample scaling. Acknowledgement: We thank Mark Bickerstaff, Thomas Prokop, Charles Thomas and Ben Widdup of Bell Labs Australia for inspiring discussions. References [1] C. Berrou et al, Near Shannon limit error-correcting coding and decoding: Turbo codes, IEEE Int. Conf. On

Comm., pp 1064-1070, May, 1993


Vol. 20, pp284-287, March, 1974


Vol. 42, pp429-445, March, 1996

[4] T. Summers and G. Wilson, SNR mismatch and online estimation in turbo decoding, IEEE Trans. Comm.,

Vol. 46, No 4, pp 421-423, April 1998

[5] E. Hall and S. Wilson, Design and analysis of turbo codes on Rayleigh fading channels, IEEE Trans.

Comm., Vol. 16, No 2, pp.160-174, February, 1998

[6] S. Kim, Belief propagation, parameter estimation, and issues in turbo decoding, PhD. Thesis, Cornell

University, 1998

[7] C. Heegard and S. Wicker, Turbo codes, Kluwer Academic Press, 1999

[8] S. Pietrobon, Implementation and performance of a Turbo/MAP decoder, International Journal of Satellite

Communications, Vol. 16, pp 409-429, 1998

[9] P. Frenger, Turbo decoding for wireless systems with imperfect channel estimation, IEEE Trans. Comm.,

Vol. 48, No 9, pp 1437-2000, September 2000

[10] P. Frenger and A. Svensson, Decision directed coherent detection in multicarrier systems on Rayleigh

fading channels, IEEE Trans. Veh. Technol., Vol. 48, pp. 490-498, March, 1999

[11] M. Valenti and B. Woerner, Iterative channel estimation and decoding of pilot symbol assisted turbo codes

over flat-fading channels, manuscript

[12] W. Oh and K. Cheun, Adaptive channel SNR estimation algorithm for turbo decoder, IEEE Comm. Letter,


Vol. 4, No 8, pp 255-257, August 2000

[13] A. Worm, P. Hoeher and N. When, Turbo decoding without SNR estimation, IEEE Comm. Letter, Vol. 4,

No 6, pp 193-195, June 2000

[14] A Viterbi, CDMA: principles of spread spectrum communication, Addison-Wesley, 1995






turbo decoders, pp586-589, Proc. Int. Sym. Inform. Theory Appl., Victoria, B. C. Canada, 1996

[18] A. Matache, S. Dolinar and F. Pollara, Stopping rules for turbo decoders, JPL TMO Progress Report

42-142, August 2000

[19] R. Shao, S. Lin and M. Fossorier, Two simple stooping criteria for turbo decoding, IEEE Trans. Comm.,

Vol. 47, pp1117-1120, 1999

[20] D. Pauluzzi and N. Beaulieu, A comparison of SNR estimation techniques for the AWGN channel, IEEE

Trans. Comm., pp1681-1691, Vol. 48, N0 10, October, 2000

[21] N. Beaulieu, A. Toms and D. Pauluzzi, Comparison of four SNR estimators for QPSK modulations,

pp43-45, IEEE Comm. Letters, Vol. 4, No 2, February, 2000

[22] J. Ertel, A. Finger and J. Vogt, Improving the max-log-MAP turbo decoder, pp1714-1716, Electronics

Letters, Vol. 36, No. 20, October, 2000

[23] S. Xu, H. Teicher, K. Tanaka and W. Smith, A Simple calculation of turbo decoding intrinsic SNR and

some applications, section 2 of this paper

[24] S. Xu and W. Stark, Extrinsic information impact on ML and MAP decoding of convolutional codes,

section 3 of this paper


6. Further thoughts and intuitions By Shuzhan Xu

Evariste Galois told us “Unfortunately what is little recognized is that the most worthwhile scientific books are those in which the author clearly indicates what he does not know; for an author most hurts his readers by concealing difficulties.” (Quoted in N. Rose, Mathematical Maxims and Minims, Raleigh N C 1988). From the preface to his final manuscripts, we can read “Since the beginning of the century, computational procedures have become so complicated that any progress by those means has become impossible, without the elegance which modern mathematicians have brought to bear on their research, and by means of which the spirit comprehends quickly and in one step a great many computations.” In the last days of his short life, he reminded us “Go to the roots, of these calculations. Group the operations. Classify them according to their complexities rather than their appearances. This, I believe, is the mission of future mathematicians. This is the road on which I am embarking in this work.”

After decoder design and some thinking, I also have my own observations, intuitions and feelings about decoding and communication theory inevitably. Have consulted several researchers and colleagues with collective thinking, the following “conjectures” are solely my own responsibility if they are wrong or completely misleading. (1) Shannon capacity with Gaussian noise seems unachievable with finite length code and finite decoding operations. Practice is generally below the performance bounds. (2) Three issues on iterative decoding are: high-speed decoding, low-power decoding, and convergence analysis. Low-power decoding algorithm (optimal algorithm in terms of decoding “efficiency”) seems can only be answered by convergence analysis. It is generally hard to get an optimal algorithm (low-power) without knowing why. (3) To get a constructive proof of Shannon capacity theorem, it seems we need to have better understanding of both code structures and decoding schemes. This “general” structure needs to include turbo and LDPC codes as special cases. And the “universal” decoding scheme needs to be linked with information analysis.

Whoever read this please help us clear out the answers (either in positive or negative way). The answers (wherever come from) are clearly the most important for they give us better understandings and solve puzzles. In the end, truth is simply truth. More or less as an amateur, I will be very happy and feel greatly honored should the above speculations offer any valid help to the communication theory research. Anyway, I am speaking out other people’s feelings and my own mind (so often in random and “day dreaming” state). For now, let’s treat science simply as art and entertain ourselves with some perturbations of possibility combinations.


Appendices A. Optimal linear approximation for the correction term in log-MAP decoding By Shuzhan Xu, John Falkowski and Junchen Du

A.1. Optimal linear approximation scheme The log-MAP (max*) decoding in practice is typically done by a dual directional ACS

(accumulate, compare and select, the recursions) with a logarithmic correction term as

),max()||1log(),max()log(),(*max yxyxeyxyexeyx ≈−−++=+= , where the

correction term )||1log( yxe −−+ is typically implemented with look-up table, a

constant value or simply ignored (max). The performance degradation of max as compare to max* is about 0.35 dB under AWGN channel.

Linear and constant approximation have been investigated by Benedetto et al [3][4] as

baxxe +−≈−+ )1log( ,

for ab

x <<0 , )2log(=b and 3.0=a , the BER performance degradation is 0.1-dB.

The problem with this linear approximation is that 0.3 can not be easily implemented in ASIC hardware without approximation and cannot be represented precisely in the common DSP. This observation motivates us to investigate further.

The key goal for us is obtain linear approximation schemes, with good performance, that can be easily implemented in digital logic (ASIC hardware or DSP). The target

function )1log()( xexf −+= , has its Taylor series at 0=x as xxf21

)2log()( −≈ .

This scheme has easy implementation in digital logic ( x21

is just a right shift of x ). For

Taylor series at )3log(=x , we have xxf41

)34

log()( −≈ . Many different linear

approximation schemes can be devised from Taylor series. And their performance can be justified via simulations.

Now, we look at optimal linear approximation schemes with respect to ∞L -norm,

1L -norm and 2L -norm. To approximate )1log()( xexf −+= , we first need to

truncate the function and set the approximation range as follows.

],0[

)1log()(η

χxexf −+= ,


where ],0[ η

χ is the character function of interval ],0[ η , that is 1)(],0[

=xη

χ for

η≤≤ x0 and 0)(],0[

=xη

χ otherwise. We define ∞

L -norm, 1

L -norm and

2L -norm respectively as |)(|

0max|||| xf

xLf

η≤≤=

∞, ∫=

η

0|)(|

1|||| dxxfL

f ,

21

}0

2|)(|{2

|||| ∫=η

dxxfL

f . We denote function spaces }||:||{],0[ ∞<∞

=∞ L

ggL η ,

}1

||:||{],0[1

∞<=L

ggL η , }2

||:||{],0[2

∞<=L

ggL η , and define the

corresponding distances as ∞

−=∞

Lgfgf

Ld ||||),( ,

1||||),(

1 Lgfgf

Ld −= ,

2||||),(

2 Lgfgf

Ld −= . Our goal is to find optimal linear approximation schemes

}||min{||arg∞

∞−

∞−=

∞+

∞ Lbxafbxa ,

}1

||11

min{||arg11 L

bxafbxa −−=+ ,

}2

||22

min{||arg22 L

bxafbxa −−=+ .

Dyadic approximation to the linear line slope will be investigated for implementation purposes. The following classical results tell us the existence and uniqueness of the optimal linear approximation schemes [6].

Proposition 1.1. For ],0[ η∞

L , ],0[1

ηL and ],0[2

ηL , there exist unique linear

functions ∞

+∞

bxa , 11

bxa + and 22

bxa + such that

}||)(min{||arg∞

−−=∞

+∞ L

baxxfbxa ,

}1

||)(min{||arg11 L

baxxfbxa −−=+ ,

}2

||)(min{||arg22 L

baxxfbxa −−=+ ,

as optimal linear approximation functions. To search for optimal linear approximations, we search for optimal range, optimal

slope and optimal constant. With the theoretical guidelines and the numerical search


procedures in the appendix, we restrict our search range as ]0.3,0.1[=η (for

049.0)31log( =−+ e is very close to zero already) with the following results.

η Slope Constant Max Error 0.2=η

28.0−=∞

a 64.0=∞

b 12574.0

12693.0)( =ηf 27.0

1−=a 62.0

1=b 12574.0

28.02

−=a 63.02=b 12574.0

0.3=η 24.0−=

∞a 62.0=

∞b 07813.0

07889.0)( =ηf 23.0

1−=a 59.0

1=b 12315.0

24.02

−=a 60.02=b 09315.0

0.3=η 22.0−=

∞a 62.0=

∞b 09309.0

04587.0)( =ηf 19.0

1−=a 53.0

1=b 16315.0

20.02

−=a 56.02=b 13315.0

The optimal linear approximation schemes under different norms are virtually the same (consider also the numerical errors). We propose the approximation range to be 0.3=η ,

and force the line slope to be 25.04/1 −=− . We have the following numerical results. Range and Slope Constant Max Error

0.3=η 67.0=b 12859.0 25.0−=a 68.0=b 11859.0

69.0=b 12766.0 Clearly, the optimal choice is 0.3=η , 25.0−=a and 68.0=b . To keep the linear approximation always positive, we restrict 72.2=η as follows. Given the fact that

these coefficients are generated by numerical computation, we point out that we may vary 72.2=η , and 68.0=b a little in implementation (but 25.0−=a should not be

changed). The performance curve is generated with UMTS W-CDMA turbo codes. These results show that the performance degradation is negligible (less than 0.05 dB under AWGN channel). To our satisfaction, the optimal linear approximation scheme (optimal in the sense of implementation) is good both in implementation and in performance.


Figure 1. Optimal linear approximation and its BER performance For even better approximation results without using the look up table, higher order

approximation schemes can be utilized. We can use higher order Lagrange interpolation schemes (particularly quadratic and cubic) for example. Piece wise linear approximation can also be used of course with very simple implementation and negligible performance degradation. Higher order and piecewise linear approximation schemes are not in higher demand given the good performance of our optimal linear approximation schemes. A.2. ASIC and DSP implementation schemes

The basic structure for ASIC implementation of the optimal linear approximation scheme is shown as follows. The actual fixed-point values for the constant 0.68 and 2.72 will be determined by the Q-format and the number of binary bits to represent the integer. For example, a 16-bit Q12 number has an implicit binary point between bit 12 and 11. Bit 15 (sign bit) and bits14-12 represent the integral part of the number, and bits 11-0 represent the fractional part. We have the following ASIC implementation diagram.

Figure 2. Implementation diagram of optimal linear approximation For DSP implementation, we give examples based on Agere Systems’s StarCore SC140 DSP. StarCore SC140 DSP is a VLIW machine with four data arithmetic and logic units (DALU) and two address generation units (AGU). It can execute four DALU instructions and two AGU instructions per cycle. SC140 has sixteen 40-bit orthogonal data registers (d0-d15) and sixteen 32-bit address registers (r0-r15). The following

c=max(a,b)d=min(a,b)

a b

+

+ f=min(e,2.72)

c d

e

-

+

0.68

g = f >> 2

+

+

+ f

g-+

h

y


example is the kernel of max* implementation on SC140 with a single DALU: /* r0: memory address where “0.68” is stored */

/* r1: memory address to temporary store “h” */

/* r2: memory address where “a” is stored */

/* r3: memory address where “2.72” is stored */

/* r4: memory address to store “y” */

/* r5: memory address where “b” is stored */

max d0,d4 move.w *r0,d8 move.w d8,*r4+

/* d4=”c”=max(“a”,”b”), load “0.68” to d8,store old “y” */

add d4,d8,d8 move.w d4,*r1 move.w *r2+,d4

/* d8=h=c+”0.68”, store “c” or d4, load current “b” to d4 */

min d0,d4 move.w *r1,d0

/* d4=”d”=min(“a”,”b”), load “c” to d0 */

sub d4,d0,d0 move.w *r3,d4

/* d0=”e”=”c”-“d” , load “2.72” to “d4 */

min d0,d4

/* d4=”f”=min(“e”,”2.72”) */

asrr #2,d4 move.w *r5+,d0

/* d4=”g”=”f”>>2, load new “a” to d0 */

add d4,d8,d8 move.w *r2,d4

/* d8=”y”=”h”-“g”, load new “b” to d4 */

It takes 7 cycles to compute one max* value with one DALU. If all the 4 DALUs are used, 1.75 cycles per max* value can be achieved with the following SC140 code [ max d0,d4 max d1,d5 max d2,d6 max d3,d7

move.4w *r0,d8:d9:d10:d11 move.w d8:d9:d10:d11,*r4+ ]

[ add d4,d8,d8 add d5,d9,d9 add d6,d10,d10 add d7,d11,d11

move.4w d4:d5:d6:d7,*r1 move.4w *r2+,d4:d5:d6:d7 ]

[ min d0,d4 min d1,d5 min d2,d6 min d3,d7 move.w *r1,d0:d1:d2:d3 ]

[ sub d4,d0,d0 sub d5,d1,d1 sub d6,d2,d2 sub d7,d3,d3 move.4w

*r3,d4:d5:d6:d7 ]

min d0,d4 min d1,d5 min d2,d6 min d3,d7

[ asrr #2,d4 asrr #2,d5 asrr #2,d6 asrr #2,d7

move.w *r5+,d0:d1:d2:d3 ]


[ add d4,d8,d8 add d5,d9,d9 add d6,d10,d10 add d7,d11,d11 move.w

*r2,d4:d5:d6:d7 ]

Some of the SC140 instructions can do two 16-bit operations in one DALU instruction, such as max2, add2, sub2. If two more instructions, min2 and asr2, are added to SC140 instruction set, 0.875 cycle per max* operation can be achieved. The following is the pseudo code if min2 and asr2 become available: /* r0: memory address where “0.68” is stored */

/* r1: memory address where “a” is stored */

/* r3: memory address where “2.72” is stored */

/* r4: memory address to store “y” */

/* r5: memory address where “b” is stored */

[ max2 d0,d4 min2 d4,d0 max2 d1,d5 min2 d5,d1

move.2l d8:d9,*r4+ move.2l *r1+,d6:d7 ]

/* perform c=max(a,b) and d=min”a,b” */

[ max2 d2,d6 min2 d6,d2 max2 d3,d7 min2 d7,d3

move.2l d10:d11,*r4+ move.2l *r0,d8:d9 ]

/* perform c=max(a,b) and d=min”a,b” */

[ add2 d4,d8 add2 d5,d9 sub2 d0,d4 sub2 d1,d5

move.2l *r0,d10:d11 ]

/* perform h=c+0.68, e=c-d */

[ add2 d6,d10 add2 d7,d11 sub2 d2,d6 sub2 d3,d7

move.2l *r3,d0:d1 move.2l *r3,d2:d3 ]

/* perform h=c+0.68, e=c-d */

[ min d0,d4 min d1,d5 min d2,d6 min d3,d7

move.2l *r5+,d0:d1 ]

/* perform f=min(e,2.72) */

[ asr2 #2,d4 asr2 #2,d5 asr2 #2,d6 asr2 #2,d7


/* perform g=f>>2 */

[ add2 d4,d8 add2 d5,d9 add2 d6,d10 add2 d7,d11


/* perform y=h-g */

Please note that the previous ASIC and DSP implementations are the full max* ACS operation not just the implementation of the optimal linear approximation scheme. A.3. Conclusions

We proposed and analyzed optimal linear approximation scheme for computation of the log-MAP correction term. Numerical simulation results further justify our analysis.


In particular, the optimal linear approximation scheme is simple in implementation (both ASIC and DSP) and with near-ideal final decoder performance. APPENDIX: Calculation of Optimal Linear Approximation

We now give detailed calculation procedures to get the coefficients of optimal linear approximation schemes. This is a summary of utilizing some theory presented in [6] as guideline to our specific problems. As we know the optimal linear approximation to function )(xf is unique under any of the three norms. For easy analysis, we avoid the

derivation effort to get close form linear approximation schemes (we always represent the coefficients with finite precision in implementation anyway). We approach this problem via some numerical schemes with the following procedures.

For ]1,0[∞

L -norm, all we need to do is find ∞

+∞

bxa , such that

}||)(min{||arg∞

−−=∞

+∞ L

baxxfbxa

|})(|10

min{maxarg baxxfx

−−≤≤

= .

With 10 ≤≤ x , 02

1≤≤− a and 11 ≤≤− b (this is an extended search range), we

view this problem as a three-dimensional optimization problem. We can use numerical procedures to find optimal solutions very easily.

For ]1,0[1

L -norm, we need to do is find 1 1a x b+ such that

}1

||11

min{||arg11 L

bxafbxa −−=+

}1

0|11|min{arg dxbxaf∫ −−= ,

and for ]1,0[2

L -norm, we need to do is find 22

bxa + such that

}2

||)(min{||arg22 L

baxxfbxa −−=+

}21

)1

0

2|)(|min{(arg ∫ −−= dxbaxxf .

With the same extended search range 10 ≤≤ x , 02

1≤≤− a and 11 ≤≤− b , we

apply some very simple numerical integration schemes to resolve this problem to find


optimal linear approximations. Finally, we point out that optimal linear approximation schemes for different η value can be easily determined in the same way.

References [1] C. Berrou et al, Near Shannon limit error-correcting coding and decoding: Turbo codes, IEEE Int. Conf. On

Comm., pp 1064-1070, May, 1993


Vol. 20, pp284-287, March, 1974



[4] S. Benedetto et al, Soft-output decoding algorithms in iterative decoding of turbo codes, TDA Progress

Report 142-124, February, 1996

[5] A. Chass, A. Gubeskys and G. Kutz, Efficient software implementation of the Max-Log-MAP turbo decoder

on the StarCore SC140 DSP, Motorola Document

[6] E. W. Cheney, Introduction to approximation theory, AMS, March 1999


B. UMTS WCDMA Soft Sample AGC Normalization for Decoding By Shuzhan Xu, Qi Wang, Vasic Dobrica, Stephen Spence and Phong Nguyen

B.1. Introduction: problems and intuitions UMTS WCDMA receiver contains two decoders: Viterbi and turbo decoder. Turbo

decoder can be configured into two versions: log-MAP and max-log-MAP (with or without the logarithmic correction term implemented as a look-up table). The following three impacts must be studied for decoder operation:

(1) DC offset residue in the soft samples. (2) Dynamic range control to reduce digital quantization errors. (3) SNR scaling for log-MAP turbo decoding.

The DC offset is generally supposed to be removed already before the soft samples reach the decoding stage. Dynamic range control means efficient utilization of the finite bit width and to keep the quantization errors to minimum. For the optimal turbo decoding (log-MAP version), soft samples must also be scaled with the online SNR value. SNR variation is hidden in the soft samples in the normalization and scaling stage. Viterbi and max-log-MAP decoding are linear decoding schemes with respect to the input soft samples and quantization error is the major issue needs to be considered. Soft sample scaling for the nonlinear log-MAP turbo decoding is the most demanding due to the online SNR scaling and dynamic range control [3].

Closely related functions are AGC and power control schemes. AGC is applied to reduce the quantization error of the front end AGC and it will not change the signal SNR value. Power control is to combat propagation fading and it will change the signal SNR. An ideal power control scheme, if fast enough and without control error, should maintain the final SNR to be constant. Above all, decoder is a passive device and is the last block of the receiver. The information loss and performance degradation of the previous receiver blocks will be accumulated to and will be reflected by the decoder.

Soft sample normalization is needed to reverse impact of the front end AGC and to align the soft samples in a decoding block to best preserve the information in soft samples that have been processed by all the previous receiver blocks. For implementation, we need to scale soft samples after RAKE receiver along with the Rx transport processing. We study these issues and present some simple implementation schemes to have optimal decoder performance. Our normalization schemes are slot-based with mantissa multiplication and cascaded bit shift. Based on slot-based normalization, we can come up with frame-based and further TTI–based normalization with cascaded operations. This study reveals the connection among different normalization stages. B.2. RAKE receiver, AGC, power control and RX transport in UMTS

The UMTS WCDMA receiver functions can be briefly illustrated as follows.


The AGC loop is updated 20 times per slot with approximately 2± dB/slot adjustment. A gain factor per slot is passed through the formatter block to produce an adjustment factor for each slot in floating point format with 8-bit mantissa and 5-bit exponent. That is all the soft samples in a slot will be multiplied by

722128

−⋅=⋅ ExponentMantissaExponentMantissa to complete the AGC loop

operation. This is quite straightforward in DSP implementation: bit shift after the multiplication of the mantissa with the soft samples. This operation of course has the possibility to push the small amplitude soft samples into zero if one sample is with extremely big amplitude and the rest samples are small in amplitude. The chance of this scenario should be very small due to the following reasons:

The soft samples of a slot come out of de-spreading and RAKE combining. The AGC adjustment factor is come out an average operation. The AGC is designed to operate slowly. The peak to average power is under tight control.

The implementation detail of this normalization is our major discussion topic. The power control function compensates fading effects. The received SNR should

remain same under perfect power control schemes. Given the slot based power control schemes, we do expect the SNR variation of the soft samples come out of RAKE receiver is limited to certain narrow range. This is also the pre-assumption for good decoder performance. In contrast to AGC loop, the effect of power control is implicit and is hidden in the soft samples before we scale and normalize them. The SNR variation is with approximately 8.1± dB/slot variation in the current design.

RAKE receiver, based on MRC (maximum ratio combining) principles, combines signals from the strongest propagation paths. SIR (signal to interference ratio) is estimated for each finger for MRC combining. Right after the RAKE receiver, the slot-based AGC normalization must be performed and soft samples must be normalized accordingly.

We look at the RX transport processing schemes (decoder interface data flow).

F i g u r e 1 . R A K E , A G C a n d p o w e r c o n t r o l

t o d e c o d e r

S c a l i n gf r n t e n d

A / D

D e s p r e a d& R A K E

c o m b i n i n gN o r m a l i z a t i o n

A G C l o o p

S l o t b a s e dP o w e r

C o n t r o l


This diagram shows the data flow from RAKE receiver to decoder. The soft samples after the RAKE receiver are normalized and scaled (if necessary) into the right range on a slot-by-slot basis. We will see that frame-based and TTI based normalization schemes are slot-based with proper exponent shift (just the range to get the largest common exponent is different). For efficient SNR scaling to get optimal performance of log-MAP decoding, we also use slot-based schemes [3]. The RX transport is merely a soft sample shuffling and re-arrangement. To reduce the memory and the processing load, we need to do format cast (RAKE receiver gives 16-bit samples and the decoder only uses 8-bit samples). To reduce the digital quantization error and estimation error, we try to have all the mantissa multiplication on a slot-by-slot basis before format cast and do the exponent shift later. This is the very key our algorithms.

We take the frame-based AGC normalization as an example. The AGC mantissa multiplication is slot-based (that is we first normalize the samples on a slot by slot basis). Then the number of bits need to be shifted is recorded for later comparison. The whole implementation trick here is that we do the AGC mantissa multiplication first and normalize them on a slot basis. These samples are truncated and rounded before pushed into the 2nd de-interleaver. The exponent or the bit shift needed for the slot is recorded and the soft samples are then read out from the 2nd de-interleaver, bit shifted and written back after a whole frame of samples are available. Right shift of samples will not introduce error and the dynamic range can be utilized most efficiently this way. Of course, the slot holds the soft sample with the largest amplitude will not be shifted. We depict this normalization graphically as follows.

The signal being processed will look graphically as follows.

s l o t - b a s e d A G Cm a n t i s s a

m u l t i p l i c a t i o n a n db i t s h i f t r e c o r d

t o 2 n d d e - i n t

R A K E

S l o t 1

f r a m e - b a s e d s o f t s a m p l en o r m a l i z a t i o n i n D S P

S l o t 1 5

1 0 m s f r a m e

s l o t - b a s e db i t s h i f t

t o & f r o m2 n d d e - i n t

F i g u r e 3 . F r a m e - b a s e d A G C n o r m a l i z a t i o n

F i u r e 2 . R x T r a n s p o r t

2 n d d e - i n t

E x p o n e n t

R A K E s l o tb u f f e r

S 1S 2

S 1 5S a m p l e

C o n c a t

C C T r C H

T r C H F T r C H F

T T I T T I

f r a m es e g m e n t

1 s t d e - i n t 1 s t d e - i n t

f i n a l T T I - b a s e db i t s h i f t


The signal over a TTI block will not have the same look as in a frame because the sample propagation order has been completely shuffled due to Rx transport processing. But the samples in a TTI block go to the same decoder. Most importantly, the slot-based online SNR scaling factor can be combined with the AGC scaling factor. Thus, we only discuss AGC normalization in this paper. B.3. Data flow, processing error and optimal dynamic range utilization

We point out immediately the following issues in the algorithm implementation. (1) Bit width of the 2nd de-interleaver is 8-bit. A 16-bit format will force all the following memory blocks double also. (2) Slot-based, frame-based or TTI-based normalization. Clearly, our scheme is based on slot-based normalization and scaling. The difference between slot-based, frame-based, or TTI-based normalization schemes is simply the range to look for the maximum common exponent for the final bit shift. This can be evaluated in terms of memory size, processing timing and delay, overall system performance, and so on. It should also fit in the available Rx transport processing architecture. The first trade off is virtually the processing (truncation and rounding) error analysis. The second involves more of system and architecture issues.

To implement mantissa multiplication with slot-based normalization first is trying to have the minimum processing error within the given precision. The underlying reason is that the right shift of soft samples will not increase the truncation and rounding error as compare to the left shift. The order arrangement of truncation, rounding and bit shift does make difference. For example, if 0000,0000,0100,0000=x to be left shifted two bits, then truncate and rounded to 8 bits, then right shifted by two bits, we have 0000,0000,0000,0001=x 0001,00100001,00000000,0001 =+=x

F i g u r e 5 . S i g n a l a f t e r f r a m e - b a s e d A G C

S l o t 1 S l o t 1 5

F r a m e

F g u r e 4 . S i g n a l a f t e r R A K E w i t h o u t A G C

S l o t 1 S l o t 1 5

F r a m e


0100,0000=x and the error is 0000,00000100,00000100,0000 =− in 8-bit format. On the other hand, truncate and round to 8-bit, and without bit shift leads to 0101,00000001,00000000,0000,0100,0000 →+=x

and the error is 72− . Keep the processing error under tight control is a key issue in the

implementation of digital processing. We can easily get some general analysis on soft sample manipulation. When we cast a

soft sample from a higher number of bits 1K into a lower number of bits 2K , the

truncation and rounding error is 22 K− . But if we shift the soft sample left by K bits

first, do truncation and rounding, then shift right by K bits, the error will be

)2(2 KK +− . We suppose of course there are no overflow (in the case of overflow, no

bit shift is done and the error is simply 22 K− ). To be more precise, the error for our

slot-based normalization approach is no more than 72− for every sample in 8-bit

format. This is the best possible error we can control. This analysis partly justifies the AGC normalization algorithm (which utilize the dynamic range in the most efficient way

possible). Please note that 72− is the worst possible error for any sample in a slot. In

frame-based normalization, the average truncation and rounding error could be much less due to the uneven number of bit shifts in the whole frame. Roughly speaking, the

truncation and rounding error of the whole frame will be )7(

2 aveExp−− and

}{ slotExpaverageaveExp = is a negative number. If the bit width in the 2nd

de-interleaver is increased to 16-bit, then the maximum truncation and rounding error

will be reduced to 152− and the average error will be reduced to )15(

2 aveExp−−

accordingly. This simple error analysis can be easily extended to frame-based and TTI-based schemes. This capability of cascaded bit shift operation (slot-based normalization leads to frame-based normalization, then lead to TTI-based normalization) gives easy implementation of TTI-based normalization schemes. This is the way to have the optimal dynamic range utilization. B.4. Normalization with slot-based multiplication and cascaded bit shift

We now present the frame-based AGC normalization scheme implementation details.


BEGIN AGC multiplication and frame-based normalization algorithm { Step I: multiplication of soft samples with mantissa on each slot.

On each slot, suppose the AGC factor is Exponentmantissa 2⋅ and the samples are

Nii

y1

}{=

, where mantissa is in 8-bit format (1-bit sign, 7-bit fractional), and

Nii

y1

}{=

are in 16-bit format (1-bit sign, 8-bit integer, 7-bit fractional). For DSP

implementation,

iExponent

imantissa

iymantissa 2* ⋅= .

For full precision, i

mantissa needs to be 24-bit (16+8), we can keep it as 16-bit (1-bit

sign, 15-bit fractional) and overwrite the soft sample i

y in the slot buffer. The

truncation and rounding is therefore

biti

mantissabiti

mantissa−

→−

−+16][

24)]152[( .

After the whole slot of sample multiplication, we find and record

}expmax{expi

onentonentEXP +=

which is the largest common exponent. We then retrieve each imantissa and do bit

shift of each sample accordingly, that is

i

mantissaEXPonentiExponent

imantissa →

−+⋅

exp2 .

The newly updated i

mantissa will be kept in 8-bit format and pushed into the 2nd

de-interleaver. The truncation and rounding operation is

biti

mantissabiti

mantissa−

→−

−+8

][16

)]152[( .

A new largest common exponent

}expmax{exp EXPi

onentonentslot

EXP −+=

is recorded for this slot. Step II: repeat step I for each slot in the whole frame.

All slots in the frame will be processed following step I. The common exponent

}max{slot

EXPframe

EXP = ,


is found and the soft samples of the whole frame NumSlotsNii

mantissa *1

}{=

will be

retrieved and bit shifted accordingly on slot basis as follows

i

mantissaframeEXPslotEXP

imantissa →

−−⋅ 2

with no truncation and rounding (this is done before the 2nd de-interleaving). } END algorithm We see that the frame-based AGC normalization is really based on a slot-based

mantissa multiplication and maximum exponent search over 15 slots. Each slot is normalized and one exponent is recorded for frame-based normalization. For TTI-based normalization, we can use slot-based approach with all the exponents recorded and processes according to the RX transport. A more subtle way to do it is to use a frame-based exponents to generate the TTI-based common exponent for the final bit shifts before the 1st de-interleaver. Similar to the slot-based normalization, we record an exponent after each frame-based normalization operation. The final TTI-based normalization is done via bit shift based on the maximum exponent taking over these frames. Please note that TTI-based normalization is built on top of the frame-based normalization and no AGC multiplication is needed. Graphically, we have the following scheme of implementation.

We have the following formal description of the TTI-based normalization scheme. BEGIN TTI-based normalization algorithm { Step I: do frame-based normalization on each frame, record the overall

}max{slot

EXPframe

EXP = .

Step II: repeat step I for each frame, record the corresponding frame

EXP for each

frame in the whole TTI block, do Rx transport processing. The common exponent

}max{frame

EXPTTI

EXP = ,

f r a m e - b a s e d A G Cn o r m a l i z a t i o n a n d

b i t s h i f t r e c o r d

t o 1 s t d e - i n t &R x t r a n s p o r t

f r a m e 1

T T I - b a s e d n o r m a l i z a t i o n v i ab i t s h i f t i n D S P

f r a m e 8

T T I

f r a m e b a s e db i t s h i f t

t o & f r o m1 s t d e - i n t

F i g u r e 6 . T T I - b a s e d A G C n o r m a l i z a t i o n


is found and the soft samples of the whole TTI block TTINumSlotsInNii

mantissa *1

}{=

will be retrieved and bit shifted accordingly on frame basis as follows

i

mantissaTTIEXPframeEXP

imantissa →

−−⋅ 2

with no truncation and rounding (this done before the 1st de-interleaving). } END algorithm We may come up with “optimal” mixture implementation schemes based the

slot-based AGC normalization, cascaded bit shift and Rx transport processing. The key here is that mantissa multiplication is done with slot-based operation and all remains is cascaded bit shifts. The mixture scheme could take advantage of the implementation simplicity of frame-based approach and the performance of the TTI-based approach. The exponents are stored first as slot based, then frame based and so on. They pass the same Rx transport processing as the soft samples do. The final bit shift and maximum component are calculated based on the exponents in the normalization range. As TTI-based normalization is in the affordable budget, we ignore detailed description of any other normalization schemes. B.5. DSP Implementation details

We now count the number of DSP operations need for the previous algorithms. Operations in slot-based AGC mantissa multiplication

--- N 16-bit read from the slot buffer, here N is the number of samples per frame. --- N multiplication with AGC mantissa, kept in 32-bit DSP counter each time

--- represent ionent

imantissa

iy

imantissa

exp2*−

⋅= ( NM * operations, M is the

number of operations to put the product into mantissa form. With special build-in normalization function, 1=M for most DSP), do 16-bit truncation and rounding ( N

operations), write each sample to slot buffer ( N writes), record i

onentEXP exp−

( N writes). --- N comparisons to find the maximum common component, one write to record it. --- read out soft samples from the slot buffer, do bit shift (one read for exponent, and the bit shift itself), do truncation and rounding (16-bit to 8-bit), write to the 2nd de-interleaver buffer ( N sample read, N exponent read, N bit shifts, N operation for truncation and rounding, N writes to the 2nd de-interleaver buffer).

The total number of operations is NMN **11 + plus some overhead operations, and the number M is DSP dependent and yet to be determined. The total number of operations is therefore about NMN **11 + per slot.


Operations in frame-based bit shifts and normalization --- 15 read and comparison to decide the maximum common exponent --- for each slot, read out soft samples from the 2nd de-interleaver, do bit shift, write the soft sample back to the 2nd de-interleaver ( N read, N bit shift, N write per slot). The total number of operation is therefore N*3*15 plus some overhead per frame. In summary, the total operation needed for AGC multiplication and normalization is roughly )*3**11(*15 NNMN ++ per frame.

We now look at the TTI-based normalization process. Based on the current architecture, each sample is read out from, bit shifted, and pushed back to the 1st de-interleaver buffer. The major issue is that we need a buffer to record the shift exponents for each frame. This array will pass the 2nd de-interleaver, be segmented and form CCTrChs, and form TTI blocks. That is this array will pass exactly the same Rx transport processing functions as the samples go through. All these equivalent Rx transport functions can be implemented in DSP firmware (do not mess with the Rx transport processing for samples).

Operations in TTI-based bit shifts and normalization The TTI block size could be 6400*6.6=42240, we jus need to estimate the extra bit

shifts needed on top of the frame-based normalization. We need an array to record exponents (one for each frame). This array need go through all the Rx transport processing functions (include 2nd de-interleaving, frame segmentation and concatenation, form CCTrCHs, and form a buffer for TTI block). Its size will be bigger than the number of frames in a TTI block (one TTI block contains at most 8 10 ms frame) due to the frame segmentation. Please note that the 2nd de-interleaving actually will not change the shift exponent for that frame. This is the key to the simple implementation of TTI-based normalization. We can estimate the buffer size needed for this part accordingly. For each TTI block, we also need 1-read for each soft sample with some overhead for exponent read, 1-bits shit, and 1-write for each sample to the 1st de-interleaver buffer. B.6. Conclusions

We have proposed and analyzed soft sample normalization schemes with slot-based operation and cascaded bit shifts for most efficient implementation. The algorithms we come up are straightforward (can be extended to frame-based then TTI-based with ease) and with best utilization of the available dynamic range. This approach is very flexible in DSP implementation and can be easily modified to fit for the final design architecture and RX transport processing schemes. References [1] T. Summers and G. Wilson, SNR mismatch and online estimation in turbo decoding, IEEE Trans. Comm.,

Vol. 46, No 4, pp 421-423, April 1998


[2] S. Pietrobon, Implementation and performance of a Turbo/MAP decoder, International Journal of Satellite

Communications, Vol. 16, pp 409-429, 1998

[3] Shuzhan Xu, Jan Meyer and Gerhard Ammer, Simple RMS soft sample scaling and simplified turbo

decoders, section 5 of this paper

[4] F. Zalio, S. Wang and F. Savaglio, personal communications, NEC Australia


C. UMTS WCDMA Blind Transport Format Detection (BTFD) Schemes By Shuzhan Xu, William Smith and Gerhard Ammer

C.1. BTFD problems and intuitions The UMTS W-CDMA standard specifies quite complicated transport, multiplexing,

channel coding, and channel structures. We will discuss BTFD with channel decoding and CRC checking here. Only one coded composite transport channel (CCTrCH) is received and the number of CCTrCH bits per radio frame is 600 or less. Fixed positions of the transport channels are used on the CCTrCH to be detected. Convolutional code is used and CRC is appended to transport blocks on all explicitly detected TrCHs. The sum of the transport format set sizes of all explicitly detected TrCHs is 16 or less and the total number of TrCHs is no more than 3.

Suppose that we use correlation as path metric, the overall BTFD search criterion are:

)(min)(max

)(min)(010

log10)(endnaendna

endnaendna

endns

−

−= and the CRC, where )(max endna ,

)(min endna , )(0 endna are three path metrics (max, min and 0-state) at a checkpoint

endn . A correct BTFD point may be declared only if Dend

ns ≤)( and pass of CRC

check. The final conclusion on TF is the point with minimum Dend

ns ≤)( and pass of

CRC check. The BTFD schemes are thus search processes and how to implement them on top of the decoder architecture is the key to BTFD algorithm design. Intuitively, we can implement the major functions of BTFD inside or outside of the Viterbi decoder (as an extension of the Viterbi decoder or as part of the DSP processing after decoding).

We now address the path metric threshold implementation. From Dend

ns ≤)( , we

see )}(min)(max{1010)}(min)(0{ endnaendnaD

endnaendna −≤− . Let 1010*D

D = ,

we need just to check 0)( ≥endnThreshold , where

)}(min)(0{)}(min)(max{*)( endnaendnaendnaendnaDendnThreshold −−−= .

The valid points are the points with 0)( ≥endnThreshold . The final winner is the point

with the largest )( endnThreshold (that is )}(max{arg endnThresholdwinn = ). This

threshold computation requires three additions (subtraction) and one multiplication. This


can be done with few DSP cycles (4 cycles for most DSP). Please note that 1010*D

D = needs to be programmed after the system performance calibration. We assume the Viterbi decoder is implemented in ASIC hardware throughout our discussion. The overall system view is as follows.

The downlink transport channel multiplexing structure (Tx) is illustrated in Figure 2 (page 11) of TS 25.212 with detailed descriptions. The downlink transport channel multiplexing structure (Rx) is the set of receiver functions that reverse the specified transmitter operation. They are design dependent and we ignore them here. C.2. DSP based multi-pass default solutions

We can configure and run the decoder from the beginning to each BTFD checkpoint. Then the CRC and path metric threshold are computed and stored (only for points that passed the CRC check). Optimal end points could be determined after exhausting all the BTFD checkpoints [3]. The function flow of this BTFD solution is as follows:

(1) Transfer soft samples (whole frame) into the decoder input buffer. (2) Configure the Viterbi decoder (constraint length, code rate, CRC format, first BTFD checkpoint) and run the decoder to the first checkpoint. (3) Compute or tap out (if computed by hardware) CRC results and the path

metrics to check DendnS ≤)( . Store the check results.

(4) Configure Viterbi decoder with the next checkpoint, run the decoder, and repeat step (iii) until the last checkpoint. (5) Compare all the checked results in DSP. If no point satisfies the CRC check

and DendnS ≤)( , then declare a frame error. Otherwise, declare the winner

winn to be the checkpoint with CRC pass and minimum )( endnS .

(6) Read out the hard decision bits from the beginning to winn .

We must point out a subtle point here that we need some scratch memory to buffer some hard decision bits. For each BTFD check point, decoding ends with flushing from zero state. The flushed bits will be rewritten by normal decoding operation when

F i g u r e 1 . S y s t e m v i e w w i t h V i t e r b i d e c o d e r a n d B T F D

d o w n l i n k t r a n s p o r tc h a n n e l m u l t i p l e x i n g

( R x )

R xR F

R e c e i v e rb a s e b a n d p r o c e s s i n g

d o w n l i n k t r a n s p o r tc h a n n e l m u l t i p l e x i n g

( T x )

T xR F

T r a n s m i t t e rb a s e b a n d p r o c e s s i n g


checking the next BTFD check point. The CRC pass of the current checkpoint is based on flushed bits not on the bits rewritten. If this point is decided to be the “optimal” BTFD point, the flushed bits should be buffered for a later hard decision bit read out. A buffer of two times trace back length is sufficient for this purpose. Without this scratch

buffer, we need to configure and re-run the decoder from the very beginning to winn ;

this last approach is our least recommended for obvious reasons. The communication between the DSP and the decoder are as follows: (1) DSP to

decoder: soft sample transfer, decoder configuration, and instruction to start decoding. (2) Decoder to DSP: interrupt after each decoding. The DSP also needs to read out from the Decoder the CRC check results (if it is implemented as part of decoder), path metrics for threshold computation, and hard decision bits (when BTFD is complete or there is a need read out hard decision bits for a CRC check). The amount of DSP processing depends on how much of the BTFD functions have been done in hardware. For example, the DSP can simply read out the CRC check results if the CRC check is done in the decoder. Otherwise, the DSP needs to read out the hard decision bits and run the full CRC check. Similarly, we can use the DSP to check the path metric check criteria or simply read out the threshold calculation if it is implemented in hardware.

The tasks in the DSP processing routine (interrupt service) are basically Context switch as needed in every interrupt service routine.

(1) At each checkpoint: read out the CRC check or compute the CRC from available

hard decision bits. If the CRC passes, read out path metrics to check DendnS ≤)( . If

DendnS ≤)( is true, check the previously stored checkpoint location and )( endnS . If

the current )( endnS is smaller, overwrite the previous point location by the current

one and overwrite the previous )( endnS by the current )( endnS . Otherwise (CRC

fails, or DendnS >)( , or the current )( endnS is bigger than the previous stored one)

do nothing and go to the next checkpoint. (2) After checking all BTFD points, if there are no stored pass points, declare a frame

error. Otherwise, read out the hard decision bits from the beginning to endnwinn = .

The two major evaluation factors concerning the DSP implementation are MIPS and decoding time. The MIPS are small due to the easy processing. The needed BTFD processing time (DSP configuration time, interrupt service latency, and decoder


processing time) is quite significant due to the multiple configurations and run time. For the decoder, this excessive repeated decoding will also increase the power consumption.

As the decoding time on averaged is about 82

16= folds of the decoding time with a

full run (based on average statistics), the power consumption will be roughly 8 times that of the normal decoding. In conclusion, this implementation approach is simple and easy to control. It is not efficient in terms of decoding time and decoder power consumption. This approach therefore serves mainly as a default and backup approach. C.3. Hardware based one-pass solutions

We now look at a hardware based one-pass solution to improve the BTFD efficiency. That is, we will attempt to implement BTFD as an integral part of Viterbi decoder. For trace back memory saving, the Viterbi decoder is typically implemented with a truncated version (fixed window size trace back instead of full frame length trace back). Algorithm wise, the whole frame is partitioned into small blocks of the trace back window size. A normal block is decoded based on two blocks (the current and the next) of path metric computation and the trace back. The trace back is done from a maximum path metric state except for the last two blocks. The last two blocks are decoded by tracing back from zero state at once (decoder flushing). There are three possibilities for each BTFD check point: (1) the BTFD check point is a trace back point, (2) the BTFD check point is in the middle of a trace back window, (3) the BTFD check point is at the end of the frame. We have developed a one-pass zero state trace back solution.

Suppose the frame size is L , the trace back length is TrL and the decoded portion after each trace back is DecL (typically DecLTrL = and we assume so in this paper).

For BTFD checkpoint endn , there are three possibilities: TrLkendn *= ,

MTrLkendn += * , or Lendn = for some integer k and 0>M . As BTFD may

trace back from a different state as compared to normal truncated Viterbi decoding, hardware context switching (save registers, path metrics and so on) and path metric computation rollbacks will be needed. Proper resource management in the Viterbi decoder is the key to implement these BTFD schemes. We need to add an extra buffer

tempBTFD _ of size TrL*2 bits to store the hard decision bits that come out of the zero state trace back. We also need a temporary buffer tempBTFD _ of TrL bits to

store some decoder output.

For the first case with TrLkendn *= , we have the following sub-algorithm.

Begin Sub-algorithm I:

(1) Trace back from the state with the maximum path metric at timing moment endn


to decode the portion with time index ]*)1(,*)2[( TrLkTrLk −− , this is normal Viterbi decoder operation. Buffer the hard decision bits into tempBTFD _ .

(2) Check if threshold 0)( ≥endnThreshold .

(3) If threshold check passed, flush decoder from zero state and decode hard decision bits with time index ]*,*)2[( TrLkTrLk − . Please note that portion

]*)1(,*)2[( TrLkTrLk −− gets re-decoded. Otherwise go to (5).

(4) Calculate and check the CRC, If passed, compare )( endnThreshold and

)( winnThreshold . If )()( winnThresholdendnThreshold ≥ , then update

endnwinn = , set )()( endnThresholdwinnThreshold = , and buffer the hard

decision bits into tempBTFD _ . Go to (5).

(5) Copy the hard decision bits from tempDEC _ into the decoder output buffer to

replace the corresponding portion with time index ]*)1(,*)2[( TrLkTrLk −− .

Continue the Viterbi decoder operation to decode the portion with time index ]*,*)1[( TrLkTrLk − again (trace back from state with maximum path metric at

time moment TrLk *)1( + later). End Sub-algorithm I

For the second case MTrLkendn += * , we have the following sub-algorithm.

Begin Sub-algorithm II: (1) Trace back from the state with the maximum path metric at timing moment

( 1)*k TrL+ to decode the portion with time index ]*,*)1[( TrLkTrLk − ,

and buffer the hard decision bits into tempDEC _ .

(2) Check if threshold 0)( ≥endnThreshold .

(3) If the threshold check passed, flush the decoder from the zero state and decode

hard decision bits with time index ],*)1[(end

nTrLk − . Please note that

portion ]*,*)1[( TrLkTrLk − get re-decoded. Otherwise go to (5).

(4) Calculate and check the CRC. If the CRC passed, compare )( endnThreshold

with )( winnThreshold . If )()( winnThresholdendnThreshold ≥ , then update


endnwinn = , set )()( endnThresholdwinnThreshold = , and buffer the hard

decision bits into tempBTFD _ . Go to (5). (5) Copy the hard decision bits from tempDEC _ into the decoder output buffer

to replace the corresponding portion with time index ]*)1(,*)2[( TrLkTrLk −− . Continue the Viterbi decoder operation to decode

the portion with time index ]*)1(,*[ TrLkTrLk + again (trace back from state with maximum path metric at time moment ]*)2( TrLk + later).

End Sub-algorithm II

The third BTFD case sNumTailBitLendn += is straightforward in trace back,

CRC check and path metrics examination. It is just the normal decoding operation as described in the following sub-algorithm.

Begin Sub-algorithm III: (1) Flush the decoder from the zero state and decode hard decision bits with time

index ],*)1[(end

nTrLk − . Note that no temporary buffer is needed in this

case.

(2) Check the threshold 0)( ≥endnThreshold .

(3) If the threshold check passed, calculate and check the CRC. If passed, compare

)( endnThreshold with )( winnThreshold . Update endnwinn = , and

)()( endnThresholdwinnThreshold = , If

)()( winnThresholdendnThreshold ≥ ; otherwise, do nothing. The decoding

and BTFD are now finished. End Sub-algorithm III We formally present the one-pass zero state trace back BTFD algorithms as follows: Begin one-pass BTFD algorithm:

Step 1: Initialize NULLwinn = and 0)( =winnThreshold .

Step 2: For each BTFD checkpoint endn {

If TrLkendn *= and sNumTailBitLendn +< {

Execute Sub-algorithm I;


} else if MTrLkendn += * and sNumTailBitLendn +< {

Execute Sub-algorithm II;

} else if sNumTailBitLendn += {

Execute Sub-algorithm III; }

Step 3: If ( NULLwinn =! ) {

Declare winn ;

If ( sNumTailBitLwinn +< ) {

Read out the hard bits stored in tempBTFD _ for Sub-algorithm I and II;

Overwrite the corresponding portion of the decoder out put buffer. }

Read out the hard decision bits of index ],0[win

n from decoder output buffer;

} else { Declare a bad frame; }

End one-pass BTFD algorithm Please note that the third step in the previous algorithm is data I/O and it can be easily

implemented in hardware or DSP as an integral part of the Viterbi decoder. C.4. SIP specific BTFD implementation issues

We now address the SIP (soft information processor, contains both Viterbi and turbo decoder) BTFD implementation issues. These issues are general issues, as SIP is a very standard Viterbi decoder, for most Viterbi decoder designs. For SIP Viterbi decoding, the calculated path metric values are stored in the Path Metric Memory (PMM) block. There are eight banks (rows) in the PMM, each storing a number of path metric values depending on the code constraint. The path metric values are read out, normalized and presented to the butterfly structure to calculate the new path metric values, which are then written back. The decision bits, also calculated by the butterfly structure, are aggregated into bytes and written to the trace back memory. The winning state, corresponding to the maximum path metric, for each trellis row is delivered to the trace back processor as it becomes the starting point for the trace back process. The maximum path metric value is also used to compute the normalization factor. The trace back memory processor is used to implement the trace back operation. The Viterbi Traceback


Processor (VTP) controls the writing of the decision bytes to the trace back memory every clock cycle. During the trace back processing, the VTP examines each trellis row to determine the biggest value to be used as a starting point for the trace back to follow. The trace back during flush period will start from zero state no matter what the last winning state is. Based on our algorithm, we need one decoder flushing from zero state for each BTFD checkpoint.

The overall Viterbi decoder operation of SIP (same as most designs) is as follows.

To implement the hardware based one-pass BTFD algorithms, we need to modify the following blocks of the Viterbi decoder on top of the current design. Modify the control unit to add BTFD checkpoint operation configuration and control. The major functions that need to be added are: (a) storage and retrieval of all the BTFD check points, (b) configuration and control of decoder trace back and flushing at BTFD check points, (c) temporary buffer of hard decision bits, (d) CRC check, (e) threshold computation and comparison with previous storage, (f) other book keeping functions. Read and write from and to the two buffers tempDEC _ and tempBTFD _ .Registers to store input of the

BTFD check points and an additional register are needed for the output of win

n . Also,

the interface between the SIP and external DSP needs to be modified accordingly. We point out an important fact that hardware calculation of CRC is available in the SIP.

This makes our one-pass hardware solution much easier. The SIP Viterbi decoder control state-machine can be illustrated in the following diagram [4].

As we can see from our one-pass BTFD algorithms, we need double trace back for the

F i g u r e 2 . S I P V i t e r b i d e c o d i n g e n g i n e

C o n t r o l U n i t

F o r w a r d t r e l l i sp r o c e s s o r

P P M

B u t t e r f l y S t r u c t u r e w i t hn o r m a l i z a t i o n

D a t a f e t c h

N o r m a l i z a t i o nf a c t o r

c a l c u l a t i o n

V T PD a t a P a c k e r

w a i t f o rs a m p l e

s t e ps a m p l e

F i g u r e 3 . V i t e r b i d e c o d e rc o n t r o l l e r s t a t e m a c h i n e

f o r w a r dp r o c e s s

t r a c eb a c k

D o n e

S t a r t V i v t e r b i


first case with TrLkendn *= (one for normal decoding and one for decoder flushing).

Control of the VTP, illustrated as follows, is therefore very important.

The input and output signals are listed in [4] with detailed description. As a consequence, it is fairly simple to configure and to control the VTP in the current design for BTFD purposes. This enables us to have simple, one-pass BTFD solutions. Even in the dual trace back case, all that is needed is to configure and run the VTP twice. We point out in particular that no context switching or trellis re-processing is needed in our scheme. In this way, we try to implement the simplest BTFD implementation.

We will not present the implementation details of decoder interface and regular book keeping inside the decoder for BTFD. In addition to running the VTP and checking the CRC results, the main function of BTFD is threshold calculation and comparison. From previous discussions, we note the following equivalent form of threshold checking:

)}(min)(0{)}(min)(max{*)( endnaendnaendnaendnaDendnThreshold −−−= ,

0)( ≥endnThreshold .

This functional block can be implemented in an ASIC as follows.

The overall functional flow of our one-pass BTFD schemes can be illustrated as follows:

We can finally summarize the BTFD control state machine in the following diagram.

F i g u r e 4 . S I P V i t e r b i V T P e n g i n e

V T P

C L K

i n p u ts i g n a l o u t p u t s i g n a l

F i g u r e 5 . B T F D t h r e s h o l d c o m p u t a t i o n & c o m p a r i s o n

u p d a t e o rb y p a s s

C o n t r o l U n i tp r e v i o u s

t h r e s h o l d

P P Mr e t r i e v e

&c o m p a r e

+

+

X+- c o m p a r e

D *

α ( m a x )

α ( m i n )

α ( 0 )

F i g u r e 6 . B T F D f u n c t i o n a l f l o w d i a g r a m

C o n t r o l U n i t

C o n f i g u r e &r u n V T P

r u n V i t e r b id e c o d e r t o

c h e c k p o i n t

b u f f e r b i t s t oD E C _ t e m p &B T F D _ t e m p

C h e c kC R C

c a l c & c o m p a r et h r e s h o l d

s t o r e d n ( e n d ) &t h r e s h o l d ( n ( e n d ) )

C R C p a s s

C R C f a i l

d e c l a r e n ( w i n ) o ra b a d f r a m e


In summary, the diagram in Figure 5 depicts the detailed functional flow of BTFD. The normal Viterbi decoder state machine is illustrated in Figure 2. We point out in particular that the approach we proposed here is an add-on approach. This is in implementation sense the “optimal” approach. It can be easily added to the SIP and to most commonly available decoder designs as well. C.5. SIP specific BTFD implementation details

The previous discussions and diagrams give us guidelines for our “optimized” one-pass BTFD implementation. We now address some hardware implementation details in the SIP specific environment. Our efforts here should help guide the final implementation.

BTFD and hardware CRC calculation are only applicable for constraint length 9 Viterbi decoding in the SIP. The entire BTFD process is basically trying to figure out the “optimal” BTFD end position based on threshold and the CRC passing. The major tasks are: (1) configure and run the VTP engine for trace back, (2) buffer the VTP hard decision bits into temporary registers, (3) retrieve three path metrics (zero state, min and max) for threshold computation, (4) check for CRC passing, and (5) compare and find the optimal BTFD check point. We give further implementation details of each task as follows (without addressing the control functions).

The two runs of the VTP engine are straightforward tasks. We only need to configure it with different trace back state and run it twice for BTFD. The VTP unit is connected to the data packer unit, which operates in a bit wise fashion. That is the data packer unit will write the hard decision bits to the decoder output buffer bit by bit. For BTFD, we need to buffer these hard decision bits into ( TrL*2 bits) and tempBTFD _ (TrL bits) accordingly. We would implement tempBTFD _ and tempDEC _ as two SIP internal

memory registers. The following pseudo RTL code describes this buffering. Possible tempBTFD _ and tempDEC _ implementation (pseudo code):

------------------------------------------------------------------

-- Note, this is created in standard memory format. It should probably be --

-- keyed to note the following pin connections: --

-- --

N o r m a lV i t e r b i

d e c o d i n g

B T F DP r o c e s s i n g

D e c o d e ri n t e r f a c e

s t a r t d e c o d i n g

f i n i s h d e c o d i n g

f i n i s h B T F D

s t a r t B T F D

a t t a c h e dB T F D

c o n t r o l

F i g u r e 7 . B T F D s t a t e m a c h i n e

V i t e r b i d e c o d i n g c o n t r o l

B T F D c o n t r o l


-- Output pin Input pin --

-- on Data Packer: on dec_temp: Description: --

-- =============== ============ ================================== --

-- clk61 CK System Clock --

-- dselout CS Select or enable a Write or Read --

-- dwen BW Select which bit of the word to write --

-- daddrout(1:0) A Select which of the 4 registers to use –

-- ddataout D Value of the data bit to be written --

-- --

-- Output pin --

-- on dec_temp: --

-- ============ --

-- Q Overwrite value to be used by output --

-- --

------------------------------------------------------------------------

library IEEE;

use IEEE.NUMERIC_STD.all;

use IEEE.STD_LOGIC_1164.all;

-- entity declaration --

entity dec_temp is

generic(

N : NATURAL := 32; -- Width of the Register

W : NATURAL := 4; -- Depth of the Register

M : NATURAL := 2 -- Width of the address bus for the RAM

);

port(

A : in STD_LOGIC_VECTOR(M-1 downto 0) := (OTHERS => 'X'); -- Addr Bus

D : in STD_LOGIC := 'X'; -- Data Input

-- Bit

BW : in STD_LOGIC_VECTOR(N-1 downto 0) := (OTHERS => 'X'); -- Bit-Write

-- Enable

(Active-High)

RW : in STD_LOGIC := 'X'; -- Read/Write Enable (Active-High Read Enable)

CS : in STD_LOGIC := 'X'; -- Active-High Chip Select

CK : in STD_LOGIC := 'X'; -- Clock

Q : out STD_LOGIC_VECTOR(N-1 downto 0) := (OTHERS => 'X') -- Data Output


); -- Bus

end dec_temp;

-- architecture body --

architecture bhv of dec_temp is

begin

process -- Write/Read RAM on positive clock edge if sampled CS select true.

TYPE MEM_ADDR is array (W-1 downto 0) of STD_LOGIC_VECTOR(N-1 downto 0);

VARIABLE mem : MEM_ADDR;

VARIABLE temp : STD_LOGIC_VECTOR(N-1 downto 0);

VARIABLE CSRW : STD_LOGIC_VECTOR( 1 downto 0);

VARIABLE addr : NATURAL;

begin

wait until (CK'event and CK = '1');

CSRW := CS & RW;

case CSRW is

when "11" => -- Read Cycle: Update Q with specified contents of the memory.

Q <= To_X01( mem( TO_INTEGER( UNSIGNED(A) )));

when "10" => -- Write those bits within the word for which BW is asserted.

addr := TO_INTEGER( UNSIGNED(A) );

temp := mem(addr);

for i in 0 to N-1 loop

if (BW(i) = '1') then

temp(i) := D;

end if;

end loop;

mem(addr) := temp;

when others =>

null;

end case;

end process;


end;

Note: The BTFD_temp register will be the same as the DEC_temp register except that the value for ‘W’ increases from 4 to 8 and the value for ‘M’ increases from 2 to 3.

Possible threshold computation implementation (pseudo code): For threshold computation, we have three internal registers (dpmzero, dpmmin and

dpmmax) to hold the three corresponding path metrics. We have the following pseudo RTL code to implement the threshold computation and comparison (that is Figure 4). --------------------------------------------------------------------

--

-- File : thresh_cac.vhd

-- Related Files : sip_pkg.vhd -- required package file

--

--------------------------------------------------------------------

--

-- Revision History

-- Revision 1.1 2002/03/07 22:51:23 whsmith3

-- Initial revision.

--

--------------------------------------------------------------------

LIBRARY ieee;

USE ieee.std_logic_1164.ALL;

USE ieee.std_logic_arith.ALL;

LIBRARY work;

USE work.sip_pkg.ALL;

entity thresh_cac is

port (

clk61 : in std_logic; -- SIP system clock

reset_n : in std_logic; -- Async reset; best used after each run

run_thresh : in std_logic; -- Enable to allow computation

dpmmin : in t_metric; -- Minimum path metric

dpmmax : in t_metric; -- Maximum path metric

dpmzero : in t_metric; -- Zero value path metric

dpmcons : in t_metric; -- Computation constant, D

nthresh : buffer std_logic_vector(3 downto 0)

-- selected value for n output

);

end thresh_cac;


architecture rtl of thresh_cac is

subtype t_metric2 is signed(metric_width+metric_width-1 downto 0);

signal dpm_x_n : t_metric; -- value for max path metric minus min path

metric

signal dpm_z_n : t_metric; -- value for zero path metric minus min path

metric

signal dsub : t_metric2; -- Constant D * dpm_x_n

signal threshold : t_metric2; -- Resulting computed threshold

signal previous_threshold : t_metric2; -- Previously computed, maximum

threshold

signal ncount : std_logic_vector(3 downto 0); -- n count to be matched to

the

-- maximum threshold

begin

process(clk61,reset_n,run_thresh,ncount)

begin -- counter to assign the proper value of n to each threshold.

if reset_n = '0' then

ncount <= (others => '0');

elsif run_thresh = '1' then

if rising_edge(clk61) then

ncount = ncount + 1;

end if;

end if;

end process;

process(run_thresh,dpmmax,dpmmin,dpmzero,dpm_x_n,dsub,dpm_z_n)

begin -- latch series that computes the new threshold.

if run_thresh = '1' then

dpm_x_n <= dpmmax - dpmmin;

dpm_z_n <= dpmzero - dpmmin;

dsub <= dpmcons * dpm_x_n;

threshold <= dsub - dpm_z_n;

end if;

end process;


process(clk61,reset_n,run_thresh,threshold,ncount,previous_threshold)

begin -- Register that looks for the maximum threshold and assignes the

-- corrisponding n as an output from the circuit.

if reset_n = '0' then

previous_threshold <= (others => '0');

elsif run_thresh = '1' then

if (threshold > previous_threshold) then

if rising_edge(clk61) then

previous_threshold <= threshold;

nthresh <= ncount;

end if;

end if;

end if;

end process;

end rtl;

As the CRC is computed in hardware, we have the following details. CRC check description: At the beginning of a computation, the internal crc register is initialized to zero.

Periodically throughout a decode the dcrc_enb line is set high, enabling a comparison between the injected CRC registered value and the computed, internal CRC value. The internal CRC value is computed using an algorithm taking in each CRC bit value in serial and generating the next value to be stored in the internal CRC register. Toward the end of the computation, the injected CRC registered value is externally updated. A final internal CRC computation is then run until the internal and external CRC values match. At that point, CRC_out is set high, indicating the end of decode. The CRC_out pin directly results in a decoder interrupt that is sent to the CPU indicating a completed decoder analysis.

Finally, we point out that analysis of hardware implementation issues such as timing, synthesis and layout via RTL simulation may be simply completed after the final BTFD design. We assume the related control functions and the needed SIP interface modification tasks are straightforward and can be easily justified. It is clear that an extensive verification effort of the entire SIP module will be needed after our one-pass hardware BTFD implementation to finalize the design. This verification effort will support the design analysis presented here. C.6. Conclusions

We have investigated and compared two BTFD solutions. They are straightforward and direct extensions of the commonly used Viterbi decoder designs. They can be


designed as an integral part of the Viterbi decoder with a minor attached block or with some simple DSP processing. Acknowledgement: We sincerely thank our colleagues Jin-Ghee Goh and Junchen Du of Agere for various discussions on BTFD issues. References [1] A. Viterbi and J. Omura, Principles of digital communication and coding, McGraw-Hill, 1979

[2] 3GPP TS 25.212 Technical Specification

[3] S. Xu and G. Ammer, BTFD Application Notes, Agere Systems

[4] Soft Information Processor (SIP) Specification, Bell Labs Australia


Author information Gerhard Ammer received his Dipl.-Ing. in Electrical Engineering from the Technical University of Munich, Germany in 1982. He worked in several R&D positions for 2G/3G mobile communication, and is currently a manager for Agere Systems in Germany. E-mail: [email protected] Vasic Dobrica received his B.Eng. and the M.Eng degree from Belgrade University, Serbia, in 1982 and 1985 respectively. He works as a manager in NEC Australia in areas of mobile terminal R&D for 3G and beyond 3G. E-mail: [email protected] Junchen Du received his BS EE from University of Science and Technology of China, MS EE from Shanghai Jiao Tong University, China, and Ph.D in EE from Polytechnic University, Brooklyn, NY. He works as a staff/Manager in the video firmware team at Qualcomm CDMA Technologies. E-mail: [email protected] John Falkowski received his MS in Electrical Engineering from the University of Dayton, Ohio. He is currently a Senior Member of Technical Staff in Agere mobile terminal division working on 2G/3G chipset design. E-mail: [email protected] Jan Meyer received his Dipl.-Ing. degree in EE Information Technology in 1995 from Munich University of Technology, Germany. He worked with Siemens Mobile Phones on optimization of GSM devices regarding channel decoding and equalization (1995-1998). He worked with Lucent Technologies / Agere systems on UMTS standardization and handset chipset definition (1998-2001). He was with Interdigital Communications and worked on UMTS TDD and FDD, base and mobile station architectures (2001-2004). In 2004 he joined the German Patent and Trade Mark Office to become a Patent Examiner on the area of Data Communications. His research interests are centered on channel coding and equalization of the mobile radio channel. E-mail: [email protected]

Phong Nguyen received his double degree in B.EEE with honor in 1997 from Adelaide University, Australia and currently he is working toward his M.SE. He has been working on various wireless digital system design including short range connectivity, WLAN, 2G and 3G. Currently he is working for NECA Australia 3G Mobile R&D division as a


Principal Engineer on beyond 3G communication systems. E-mail: [email protected] William Smith received his BSEE from Michigan Technological University in Houghton, MSEE from Illinois Institute of Technology in Chicago in 1981 and 1989 respectively. He is currently a Senior Engineer at Northrup-Grumman in Baltimore working on Signal Processors specialize in VHDL design and simulation. He is also an adjunct faculty in electrical engineering at Anne Arundel Community College. E-mail: [email protected] Stephen Spence was born and educated in Australia and has worked on various projects on electronics equipment development in various companies. He is currently a project leader in NEC Australia at Melbourne responsible for 3G and beyond 3G mobile terminal chipset design and implementation. E-mail: [email protected] Wayne Stark received his Ph.D. from the University of Illinois in 1982. He has been with the University of Michigan at Ann Arbor since then and is currently a professor of Electrical Engineering and Computer Science. He was an Associate Editor of the IEEE Transactions on Communications from 1985-1989. He received a national Science Foundation Presidential Young Investigator Award in 1985, was a member of the Board of Governors of the IEEE Information Theory Society from 1986-1988, and became an IEEE Fellow in 1998. He pursues research in a range of topics related to wireless communications including spread-spectrum modulation, error control coding, adaptive coded-modulation, facing, multiple-access, jamming, and handoff algorithms. E-mail: [email protected] Koji Tanaka received his M. S. in EE from State University of New York at Buffalo, Ph.D in EE from Virginia Polytechnic Institute and State University, Blacksburg, Virginia in 1977 and 1981 respectively. He was an assistant professor at Ohio University at Athens from 1981 to 1985 and an associate professor at Japan Defense Academy at Yokosuka from 1990 to 1995. He is currently an adjunct professor at Temple University, Philadelphia, PA. His twenty years of industrial experience includes AT&T Bell Laboratories/Lucent/Agere. His current research interests are in the area of Digital Signal Processing and its application to VLSI system design. E-mail: [email protected]


Haim Teicher received a B.Sc. (EE) from The City College of New York and a M.Sc. EE in Digital Communications from Columbia University in 1974 and 1976 respectively. He has worked as an R&D engineer on Telemetry and Communication Systems in the Armament Dept. of the Government of Israel, as system architect for Wireless IS-95, IS-2000, 3G and HSDPA in AT&T/Lucent and Motorola. He has also done research in image automatic recognition of VLSI masks defects in Applied Materials Israel. Currently, he is working on Ultra Wideband communications for Elbit Systems in Israel. E-mail: [email protected] Qi Wang received his BS and MS in Telecommunication Engineering from Nanjing University of Posts and Telecommunications, Ph.D in Electronic Engineering from the Australian National University in 1994, 1997 and 2000 respectively. He was a Research Lecturer at the ITU, University of South Australia in 2000. He has been working on the WCDMA chipset design in NEC Australia since 2001. E-mail: [email protected] Shuzhan Xu received BS in mathematics from Shandong Normal University, China, Ph.D in mathematics from University of Alberta, Canada, and MS in EE from University of Michigan, MI in 1986, 1995 and 1996 respectively. He has worked on various wireless digital modem and chipset design projects focusing on error correction decoder design and digital base band signal processing. He is a connectivity specialist working on digital audio and home network in Philips Development Center China. E-mail: [email protected]

Documents

A Brief SNR Analysis in Turbo Decoding and Its Applications · 2006. 2. 8. · SNR ANALYSIS IN TURBO DECODING 159 scheme for the correction term in log-MAP decoding”, “UMTS WCDMA