[IEEE 2010 Ieee Globecom Workshops - Miami, FL, USA (2010.12.6-2010.12.10)] 2010 IEEE Globecom Workshops - Agile encoder architectures for strength-adaptive long BCH codes

Agile Encoder Architectures for Strength-AdaptiveLong BCH Codes

Raghunath CherukuriCodEPhy Inc

Richardson, Texas, 75082Email: [email protected]

Abstract— Long Bose-Chaudhuri-Hocquenghem (BCH) codesare the choice of error correction codes for FLASH memoryapplications. Quite often, for FLASH memory applications, tocope with variations in noise mechanisms, the packet length andthe error correction capability of the code needs to be changed(strength adaptive BCH code). In this paper, we present a linearfeedback shift register (LFSR) based architecture with a criticalpath bounded by logN (where N is the length of the codeword),independent of the strength of the code and without any penaltyin latency at a reasonable additional cost. The differentiatingfeature of the proposed architecture is the agility at which thestrength can be changed at a very competitive cost1.

I. INTRODUCTION

As of 2010, error correction codes are in wide use inalmost all digital communications. This is due to the higherperformance that the market demands for achieving reliablecommunications over noisy channels. BCH codes, a veryimportant family of block codes that can be decoded usingalgebraic techniques with affordable complexity, have been inwide use for decades, especially in storage channels in variousforms either as Hamming codes or as Reed-Solomon (RS)codes. BCH codes are in wide use in concatenated codingtechniques, concatenated either with Convolutional codes orwith other block codes such as Low-density parity check(LDPC) codes. Second generation Digital Video Broadcast(DVB) standards are adopting BCH codes as part of theirconcatenated coded strategy. Binary BCH codes have someadvantages over RS codes, especially if the noise is random.The read channel in a Multi-level per cell (MLC) basedFLASH memory exhibits a random noise channel and BCHis a favorite code for error correction in FLASH memoryproducts.

It is well known in coding theory, that the performance of acode improves as the length of the message k increases. Thisis one of the reasons why there is a trend in high performancecommunication to use message lengths that are e.g., greaterthan 214 as in DVB-S2 [1] or in NAND FLASH memory[2]. It is also very common, that the application may demandvarying error correction capabilities [1], the reasons may be achange in the packet length or a change in the noise statistics.All this points to the need and practice of long BCH codes.

For a message of length k bits, a codeword of length nbits can be generated by appending (n − k) bits of parity

1Protected under patent law.

to the message. For binary BCH codes these parity bits cancorrect up to t errors [3]. To correct t number of errors,a BCH code is typically implemented in systematic form.In a systematic form, a BCH code is obtained by dividingthe message polynomial with the generator polynomial andthen appending the remainder with the message to form thecodeword polynomial. The generator polynomial is formed bytaking the least-common multiple (LCM) of all the minimalpolynomials corresponding to the 2t roots. The division istypically implemented using a LFSR architecture. In a serialLFSR architecture, the feedback signal is required to drive allthe XORs in the LFSR, as part of the polynomial reductionoperation, built into the division operation. When the divisorpolynomial g(x) is implemented in the expanded form, thedegree of g(x) can be very high, and the fan-out of thefeedback signal is so large that the throughput is constrainedby this load.

The rate R of a given code is defined as the ratio k/n. Acode described as high rate code has the value of R high. Thisimplies that the number of parity bits are few and the code isdesigned to correct only fewer errors. The number of errorsa code can correct can also be used to characterize a code.For example, if t is small then the code is said to be weakand when the value of t is large, then the code is said to bestrong. It can be easily seen that a strong code typically hasmore parity bits, and a weak code has fewer parity bits.

In a MLC NAND FLASH memory the cell is a float-ing gate Metal-Oxide-Semiconductor Field-Effect transis-tor(MOSFET), with programmable threshold that can be con-trolled with the floating gate. When the control voltage exceedsthe programmed threshold voltage, the transistor conducts aproportional current, proportional to the threshold voltage. Bymeasuring the current and through proper conversions, thethreshold voltage can be inferred as one of the P levels. IfP=8, then there are 8 levels of threshold and the cell canconvey log2(8) = 3 bits/cell. On a typical MLC NAND FLASHdie, there are millions of these MOSFETs arranged in a knownfashion for read and write operations. There are several errormechanisms at work that alter the threshold voltages from theirdesired values which work against the memory read or writeoperation of a MLC NAND cell. They are a) manufacturingtolerances, b) leakage of charges from one cell to another andc) manufacturing defects. The variations in threshold voltagechanges across the die and also from die to die. Another

IEEE Globecom 2010 Workshop on Application of Communication Theory to Emerging Memory Technologies

978-1-4244-8865-0/10/$26.00 ©2010 IEEE 1900

important factor in the application of MLC NAND FLASHmemories is their finite life time, which is due to the finitewrite and read operations that can be done on a particular cell.Through wear level management and other highly intelligentalgorithms the memory controller keeps track of the defectsand the number of operations done on a particular cell in orderto extend the life of the memory. Sice variations in thresholdvoltage is not uniform across the die and from die to die,a single ECC is not able meet the application demands. Thestrategy that incorporates multiple error correction codes basedon the defect densities is termed as adaptive coding strategy.This work is not concerned with the design of such adaptivecodes, rather with the implementation of the encoder for sucha given code. There are many ways in which the notion of"adaptive behavior" can be incorporated. In our context ofMLC NAND FLASH memory employing BCH code as ECC,we mean to say that a code is adaptive, or rate-adaptive orstrength-adaptive if for the given k, the n changes or (n− k)changes, which also means that t changes.

Despite the fact that the long BCH code has fan-outbottleneck, the application demands higher performance. Thisshould be achieved by eliminating or reducing the fan-outbottleneck, processing more message samples (in parallel) andmaking the architecture agile and efficient to switch betweenpackets and various values of t. This is the definition and ourmotivation to study the problem. In the following, we reviewsome recent and pertinent solutions for this problem that areavailable in the open literature.

Re-timing and Loop unfolding can be used to reduce thefan-out bottleneck as is done in [4]. Unfolding can be com-bined with re-timing to transform the LFSR to accept parallelinput data [4]. Further speedup has been achieved in [5]. Boththese techniques involve pre and post processing using a newpolynomial p(x) that needs to be identified through exhaustivesearching based on clustered look-ahead computation. Theoverhead may be justified since the solution was targeted forGbps or higher through-puts. In [6], we can find that theparallelism is achieved by simple loop unrolling. This solutiondid not address the fan-out issue since the target applicationsrequires data rates in the 100Mbps range. Parallelism can beachieved using state-space design methods as has been shownwhile realizing a parallel cyclic redundancy check (CRC)implementation [7]. This method involves pre-computationsand storing of huge matrices. In [8], [9] we can find somedesign examples using the method proposed in [7] targetingMLC NAND FLASH memory applications.

An approach that is similar to our solution is presented in[10]. Here the given message is first divided by the minimalpolynomials in parallel, and then the resulting remainders areall combined in a weighted fashion based on the Chinese Re-mainder theorem (CRT). The weights need to be pre-computedusing Euclid’s multiplicative inverse algorithm. Further, theseweights need to be stored in the memory. The area overheadis more than t times that of the LFSR corresponding tog(x) polynomial. Since the division is done by the minimalpolynomials, the fan-out is upper bounded by m = (logN).

In our solution, instead of having the LFSRs correspondingto the minimal polynomials in parallel, we cascade them inseries. This means that the upstream LFSR divides the quotientarriving from the downstream LFSR. After processing all then samples serially (or k samples depending on how the LFSRis implemented), the remainders are weighted and combined.The weights are polynomials that are the minimal polynomialsthemselves, and there is no need for any pre-computationand storing of any other polynomials other than the minimalpolynomials. The complexity of the combining network is atmost another g(x). There is no penalty in latency and the fan-out is upper bounded by m. Parallelism can be applied eitherby using loop unrolling or by unfolding as in [4]. Furtherimprovements in fan-out at the minimal polynomial level canalso be done using re-timing. Switching context is as easy asadding another stage of LFSR.

Remainder of the paper is organized as follows: the problemand an existing solution based on CRT are presented in section2, our solution is presented in section 3 and some results arepresented in section 4. Conclusions and an outline of furtherwork are presented in section 5.

II. BCH ENCODER ARCHITECTURES

A k-bit message (u0, u1, .....uk−1) can be protected by aBCH code to protect up to t random bit errors by addingup to m.t redundant bits to form an n-bit long codeword(c0, c1, .....cn−1), where n = 2m − 1. The binary messageand code bits ui and cj are from GF(2) and can form thecoefficients of the polynomials of degree (k− 1) and (n− 1)respectively. The systematic encoding is performed by

c(x) = u(x).xn−k + r(x) (1)

where

r(x) = Rem(u(x).xn−k)g(x) (2)

or

u(x).xn−k = q(x).g(x) + r(x) (3)

Here g(x) is the generator polynomial to be specified, r(x)is the remainder polynomial resulting from the polynomialdivision of u(x).xn−k by g(x) and q(x) is the quotient fromthe division. Let α be a primitive root from the extended fieldGF(2m) formed by using a primitive polynomial p(x). Thenthe generator polynomial g(x) is the lowest-degree polynomialover GF(2) that has [3] (α, α2, α3.....α2t) as its roots. Let gibe the minimal polynomial of αi, and since every even powerof α can be expressed as some preceding odd power of α, theng(x) must be the LCM of all the odd minimal polynomials,which can be given as

g(x) = LCM(g1.g3.g5......g2t−1) (4)

To get an idea, the minimal polynomials for a BCH codethat is used in DVB-S2/T2 are given in table I. The weightof a polynomial is defined as the number of non-zeroelements in the polynomial, and the weights of each minimalpolynomial is shown in column 3. To correct t errors,

1901

Table IBCH POLYNOMIALS FOR DVB-S2/T2 WITH N=64800

gi minimal polynomial weightg1 1 + x2 + x3 + x5 + x16 5g2 1 + x+ x4 + x5 + x6 + x8 + x16 7g3 1+x2+x3+x4+x5+x7+x8+x9+x10+x11+x16 11g4 1 + x2 + x4 + x6 + x9 + x11 + x12 + x14 + x16 9g5 1+x+x2+x3+x5+x8+x9+x10+x11+x12+x16 11g6 1+x2+x4+x5+x7+x8+x9+x10+x12+x13+

x14 + x15 + x1613

g7 1+x2+x5+x6+x8+x9+x10+x11+x13+x15+x16 11g8 1+x+x2+x5+x6+x8+x9+x12+x13+x14+x16 11g9 1 + x5 + x7 + x9 + x10 + x11 + x16 7g10 1+x+x2+x5+x7+x8+x10+x12+x13+x14+x16 11g11 1 + x2 + x3 + x5 + x9 + x11 + x12 + x13 + x16 9g12 1 + x+ x5 + x6 + x7 + x9 + x11 + x12 + x16 9

D D D D D

Open for parity bits

Message

quotient

closed for parity bits

codeword

1 (n-k-1)2 (n-k-2)

Figure 1. Serial BCH encoder architecture based on LFSR.

the first t polynomials are multiplied to obtain g(x). InDVB-S2, with a packet length of n=64800, t can be 8, 10or 12. The generator polynomials for each case will have adegree and weights of (128, 69), (160, 79) and (192, 85).In general the weight of a minimal polynomial gi(x) isbounded by deg(gi(x)). The architecture of a systematicBCH encoder using LFSR is as shown in Fig.1. Here γ arethe coefficients of the generator polynomial. The messageis shifted into the shift register serially. On each clock thequotient is shifted out and is discarded. At the end of kclock cycles, the state of the LFSR is the remainder and isshifted out as the tail bits to the message. At each clockthe feedback signal is generated and has to drive all thesumming nodes to perform the polynomial reduction. Thefan-out of this signal sets the clock period of this architecture.

Encoder based on CRTSince a BCH encoder based on CRT has some similar-

ities with our proposed method, we will review it in thissection. For mathematical accuracy and details please referto [10]. For every gi(x) for all i = 1, 2, ...t in GF (2),let there be g′i(x) such that g′i(x) = g(x)/gi(x). Letthere be another polynomial g′′i (x) such that g′′i (x).g

′i(x) =

1modgi(x). Then g′′i (x) is the multiplicative inverse ofg′i(x) congruent to gi(x) and such a multiplicative in-verse can be obtained using extended Euclidean algorithm.Then according to CRT, Rem(u(x).xn−k)g(x) can be givenas

∑i=1..t g

′i(x)Rem(g′′i (x).u(x).x

n−k)gi(x). As is evident,there is a pre-computation during design phase to find themultiplicative inverses of the g′i(x), pre-computation of g′i(x)itself and storing these polynomials, in addition to storing

all the gi(x)s. Then there is pre-processing (multiplication byLFSR) of the message with g′′i (x)s, before actual division. Thedivision is carried in parallel using LFSRs and the remaindersare combined (addition) after properly weighted (multipliedusing LFSR) by the g′i(x)s. The clock speed is constrained bythe fan-out of the LFSR during division and is thus boundedby m.

There are several disadvantages with the encoder based onCRT. The pre-multiplication increases the data length by thedegree of g′′i (x) which can be at least m. The divider needsto process these extra bits, and thus the latency increases.Clocking issues are not simple but can be worked out. Thehardware complexity is at least t times the m.t, since thereare t parallel branches. A point to make here is that, thearchitecture is input bit serial although division operations byminimal polynomials are parallel, and throughput is that of anyserial LFSR architecture. An advantage worth mentioning isstrength adaptation is simple, only if the relevant polynomialsare pre-computed during design phase.

In the next section, we propose an architecture that elimi-nates all these disadvantages and keeps the advantages.

III. PROPOSED ENCODER ARCHITECTURE BASED ONWEIGHTED SUM OF REMAINDERS

In this section we present our architecture. Let us define at number of divisions as follows

u(x).xn−k = q1(x).g1(x) + r1(x)

q1(x) = q2(x).g2(x) + r2(x)

q2(x) = q3(x).g3(x) + r3(x)

.

.

qt−1(x) = qt(x).gt(x) + rt(x) (5)

It can be shown in a straightforward manner that the desiredr(x) can be obtained by summing the weighted remaindersr1(x) through rt(x) as given below (dropping x for clarity)

r = (((rt.gt−1 + rt−1).gt−2) + rt−2)....g1) + r1 (6)

In Fig. 2 we illustrate the architecture as a block diagram. Theweighting and combining network shown in Fig. 2 and definedin (6), is expanded in Fig. 3. Through a proper clockingschedule, the remainders are shifted out, multiplied by theminimal polynomials and added bit wise. All the multipliersare enabled simultaneously. It should be worth recalling thatthe length of the output sequence of a multiplier is upperbounded the sum of the length of the multiplier polynomialand the length of the input sequence. This means that although,the multipliers are all enabled at the same time, they workfor different durations (progressively longer). Multiplier gt−1is clocked during the first m clocks of (n − k) parity bits.Multiplier gt−2 is clocked for 2.m clocks cycles and so on.

1902

Message

codeword

q1 q2

g1 g2 gtgt-1

qtqt-1

Weighting and

combining

r1 r2 rt-1

rt

Figure 2. Proposed BCH encoder architecture based on cascaded LFSRdivision and weighted remainders.

codewordX gt-1rt

rt-1

X gt-2

rt-2

X g1

r1

Figure 3. Weighting and Combining Network.

All the remainders are clocked during the first set of m clocksand added bit wise to the result of the proper multiplicationoperation(the first bit that is shifted out from the LFSRs is theMSB).

This process continues with other multipliers until all the(n − k) bits are shifted out. It can be noted here that thedegree of the registers is m. The remainders are available in thedivisor registers. Multiplication can also be done using LFSRarchitecture. The order of each multiplier is again m. Thereare (t − 1) multipliers in the combiner. This is the overheadin the proposed architecture. In this case the throughput isnot constrained by fan-out of the feedback, since there isno feedback signal and thus multipliers work faster than thedividers in the architecture and hence are not a bottleneck.

In the long division architecture, when t is changed theentire g(x) polynomial changes as shown in Table I. So inthe example shown in Table I, three such long polynomialsneed to be stored in a RAM, and each one is selected throughmultiplexors. If the packet length also changes as it happensin DVB, then overhead of the storage and selection networkincreases. Not so in the proposed architecture. If t changes tot+1, another short LFSR is cascaded both in the division andthe combining network. Clock frequency remains the same inour architecture. We do not claim that (6) is novel, but wemerely claim novelty in the application of (6) as an encoderof BCH family of codes for adaptive strength capability whilesimultaneously reducing the fan-out bottleneck.

A design issue for the proposed encoder architecture isthe effect of carry propagation, that results in simultaneousclocking of the division and multiplication operations using theLFSRs. It appears that the fan-out problem has been translatedto a carry propagation problem. There are many excellenttechniques that are available to solve the carry propagationproblem (such as carry look-ahead) and hence improve the

clock speed of the architecture.

IV. IMPLEMENTATION ISSUES

In this section, we discuss some of the implementationissues and give some design examples.

A. Implementation

As illustrated in Fig. 2, t number of LFSRs of size m (mdelay elements) are cascaded in series. They all function basedon the architecture shown in Fig. 1. The message enters theLFSR1 at the summing node after the mth delay element. Andat the same time sent to the channel as part of the informationbits of the systematic code. The quotient of LFSR1 entersLFSR2 at the same point, but in this case the quotient willnot go any further. Similar signal routing is done with all theremaining quotients. No special clocking is done, and all theLFSRs work on the same input clock. For every clock, thereis a division operation performed in all the LFSRs and at theend of the k clock cycles, the remainder is available in all thet LFSRs. The number of XORs in each LFSR depends on theweight of the minimal polynomial and can be at most m anda total of m.t XORs for this stage. The input is bit serial andlatency is thus k clock cycles. The parity bits start going outto the channel from (k + 1)th clock cycle onwards, from thecombiner block. The throughput is bounded by the m XORspresent in the feedback path of the dividers.

Stage 2 of encoding performed in the combining network isdepicted in Fig. 3 and described in (6). There are (t−1) LFSRbased multipliers and (t − 1) XORs for bitwise GF additionfor a total of (m + 1).(t − 1) XORs. Since multipliers arenot a speed bottleneck, they can operate at the same clockas the dividers. As mentioned in the previous section onlyone remainder (the corresponding LFSR) will be enabled andclocked out for m clock cycles for a total of m.t clock cyclescorresponding to the m.t parity bits. Each multiplier worksfor 2.m clock cycles with an overlap of m clocks.

The total complexity in terms of XORs is m.t+(t−1).m+(t−1) which can be upper bounded by 2.m.t. For comparison,the count of XORs for CRT based encoder is of t2.m+2.m.t+t. Clearly our solution is more hardware efficient offering thesame throughput without any overhead during design phase.

B. Design Examples

Example 1.: m=11, k=1926, n=2047 and t=11. Fan-out isupper bounded by 11. Count of XORs is 242 and the countof CRT based architecture is 1595.

Example 2.: m=13, k=7684, n=8191 and t=39. Fan-out isupper bounded by 13. Count of XORs is 1014 vs. 20865.

Example 3.: DVB-S2/T2 m=16, k=51648, n= 51840 (short-ened from 65536) and t=12. Fan-out is upper bounded by 16and count of XORs is 384 vs. 2506.

V. CONCLUSIONS

A new encoder architecture for BCH codes has been pre-sented that is based on LFSR. The architecture is a cascadeof LFSRs, each LFSR representing the minimal polynomial

1903

corresponding to the root that defines the BCH code. Theload for the feedback signal for each LFSR is the weight ofthat minimal polynomial and the worst case is bounded bydeg(gi). This is much better than the load of the expandedversion of the generator polynomial, which can have a de-gree deg(g1).deg(g2)...deg(gt). The overhead of the proposedmethod is the multiplication and addition network to combinethe weighted remainders. It is shown that the complexity ofsuch a network is less than g(x). There is no penalty inlatency using the proposed architecture. There is no needfor pre-computations, no need to store lengthy polynomials,no need for exhaustive search for special modifying polyno-mials. Parallelism can be incorporated using loop unfoldingtechniques. Another benefit is the agility with which thecorrection capability can be modified. Compared with a longLFSR implementation, the proposed architecture reduces thecritical path by a factor of t with an area overhead of 2x bypreserving the latency. The proposed encoder architecture canalso be applied to non-binary codes and are currently underinvestigation. The speed advantage that can be gained for aparallel implementation of the proposed encoder architectureis also under investigation.

REFERENCES

[1] P. Urard, L. Paumier, V. Heinrich, N. Raina, and N. Chawla, “A 360mW105Mb/s DVB-S2 compliant codec based on 64800b LDPC and bchcodes enabling satellite-transmission portable devices,” IEEE Solid-StateCircuits Conference, 2008, pp. 310–311, Feb. 2008.

[2] C. Trinh and et al, “A 5.6MB/s 64Gb 4b/Cell NAND Flash memory in43nm CMOS,” IEEE Solid-State Circuits Conference, 2006, pp. 246–247, Feb. 2009.

[3] S. Lin and D. J. Costello, “Error Control Coding,” Prentice Hall, pp.194–195, 2004.

[4] K. K. Parhi, “Eliminating the fanout bottleneck in parallel long BCHencoders,” IEEE Trans. Circuits Syst. I, vol. 51, no. 3, pp. 512–516,Mar. 2004.

[5] X. Zhang and K. K. Parhi, “High-speed architectures for parallel longBCH encoders,” IEEE Trans. VLSI Syst., vol. 13, no. 7, pp. 872–877,Jul. 2005.

[6] R. Micheloni and et al, “A 4Gb 2b/cell NAND flash memory withembedded 5b BCH ECC for 36MB/s system read throughput,” IEEESolid-State Circuits Conference, 2006, pp. 497–506, Feb. 2006.

[7] G. Campobello, G. Patane, and M. Russo, “Parallel CRC realization,”IEEE Transactions on Computers, vol. 52, no. 10, pp. 1312–1319, Oct.2003.

[8] Z. Jun, W. Zhi-Gong, H. Qing-Sheng, and X. Jie, “Optimized designfor high-speed parallel BCH encoder,” IEEE International Workshop onVLSI and Video Tech, pp. 97–100, May 2005.

[9] W. Lui, J. Rho, and W. Sung, “Low-power high-throughput BCH errorcorrection VLSI design for multi-level cell NAND flash memories,”IEEE Workshop on Signal Processing Systems Design and Implementa-tion, pp. 303–308, Oct. 2006.

[10] H. Chen, “CRT-Based high-speed parallel architecture for long BCHencoding,” IEEE Trans. Circuits Syst. II, vol. 56, no. 8, pp. 684–686,Aug. 2009.

1904

Documents

[IEEE 2010 Ieee Globecom Workshops - Miami, FL, USA (2010.12.6-2010.12.10)] 2010 IEEE Globecom Workshops - Agile encoder architectures for strength-adaptive long BCH codes