Eindhoven University of Technology MASTER VLSI design of a ... · VLSI design of a Reed-Solomon decoder for gigabit automotive ethernet Xue, B. Award date: 2016 Link to publication

Eindhoven University of Technology

MASTER

VLSI design of a Reed-Solomon decoder for gigabit automotive ethernet

Xue, B.

Award date:2016

Link to publication

DisclaimerThis document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Studenttheses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the documentas presented in the repository. The required complexity or quality of research of student theses may vary by program, and the requiredminimum study period may vary in duration.

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

https://research.tue.nl/en/studentthesis/vlsi-design-of-a-reedsolomon-decoder-for-gigabit-automotive-ethernet(1ea6e335-2562-43c6-82bd-1a3221b2b23e).html

VLSI Design of A Reed-SolomonDecoder for Gigabit Automotive

Ethernet

Master Thesis

Bing Xue

Department of Mathematics and Computer ScienceSystem Architecture and Networking Research Group

Supervisors:dr.ir. Ozgun Paker

prof.dr.ir. C.H. (Kees) van Berkelprof.dr.ir. M.C.W. (Marc) Geilen

Eindhoven, September 2016

Abstract

The car becomes smarter in today’s world and more advanced electronics is being used in-vehicle.Gigabit Automotive Ethernet is developed to provide fast bandwidth for all kinds of applicationsand connect different functional components in the car. The Reed-Solomon coding is a powerfulforward error correction technique used in Gigabit Automotive Ethernet to compact channel noiseduring data transmission. The Gigabit Automotive Ethernet uses 9-bit based shortened (450, 406)codes. Few study can be found in literature to decoding 9-bit RS codes. This thesis exploredthe design space of RS decoder and designed two RS algebraic decoder based on inversionlessBerlekamp-Massey algorithm. The decoder was designed in CMOS 40nm process and achieves a1.125 Gbps throughput. A low latency is achieved to decode RS codeword and the decoder canreport decoding failure if the received codeword is not correctable.

VLSI Design of A Reed-Solomon Decoder for Gigabit Automotive Ethernet iii

Preface

This master project is conducted in the High Speed Serial Interfaces group of NXP SemiconductorB.V in Eindhoven. Hereby I sincerely want to give thanks to my school supervisor C.H. (Kees) vanBerkel Without him, neither would I get to know the splendid world of VLSI nor take this projectas my graduation project. During the project, he was helpful and always gave me proper advicewhich avoided many detours. Meanwhile, I am wholeheartedly grateful to Ozgun Paker’s devotionin the guidance to me. Without him, this project wouldnt go smoothly. He never hesitated tohelp me when I had doubts and confusion. He encouraged me when I got depressed and made meconfident with my work. Finally, I would also like to thanks M.C.W. (Marc) Geilen for reviewingmy work.

VLSI Design of A Reed-Solomon Decoder for Gigabit Automotive Ethernet v

Contents

Contents vii

List of Figures ix

List of Tables xi

1 Introduction 11.1 Thesis Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Reed-Solomon Codes and Their Application for Automotive . . . . . . . . . . . . . 21.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Preliminaries 52.1 Galois Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Construction of Galois Field . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Representation of Galois Field . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.5 Galois Field Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Encoding of Reed-Solomon Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Decoding Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Alternative Designs and Final Algorithm 113.1 Decoder Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Algebraic Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 Transform Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.1 Syndrome Computation for RS Decoding . . . . . . . . . . . . . . . . . . . 143.2.2 Key Equation Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.3 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.1 Deign Decision on Syndrome Computation . . . . . . . . . . . . . . . . . . 203.3.2 Comparison of Transform Decoder and Algebraic Decoder . . . . . . . . . . 203.3.3 Comparison of Key Equation Solvers . . . . . . . . . . . . . . . . . . . . . . 21

4 Reed-Solomon Decoder Implementation 254.1 Syndrome Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Key Equation Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Parallel iBM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.2 Dual-line iBM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Chien Search and Forney . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

VLSI Design of A Reed-Solomon Decoder for Gigabit Automotive Ethernet vii

CONTENTS

5 Results and Analysis 315.1 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Conclusions 37

Bibliography 39

Acronyms 41

viii VLSI Design of A Reed-Solomon Decoder for Gigabit Automotive Ethernet

List of Figures

2.1 Systematic RS Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Encoding Circuit for (450, 406) RS Code . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Decoding Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 An overall block diagram for algebraic RS decoder . . . . . . . . . . . . . . . . . . 123.2 An overall block diagram for the transform RS decoder . . . . . . . . . . . . . . . 133.3 The Block Diagram for Chien Search . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 The Block Diagram for Forney Algorithm . . . . . . . . . . . . . . . . . . . . . . . 193.5 Remaining Error Transform Block . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.6 Inverse Error Transform Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.7 (a) Parallel to serial converter (P2S) (b) Serial to parallel converter (S2P) . . . . . 22

4.1 Syndrome Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Parallel Architecture for inversionless Berlekamp-Massey Algorithm . . . . . . . . . 264.3 Dual-line Architecture for inversionless Berlekamp-Massey Algorithm . . . . . . . . 274.4 Chien Search for shortened RS code . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5 Decoding Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 Layout of Dual-line iBM RS Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Power distribution for parallel iBM RS decoder . . . . . . . . . . . . . . . . . . . . 335.3 Power distribution for dual-line iBM RS decoder . . . . . . . . . . . . . . . . . . . 345.4 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

VLSI Design of A Reed-Solomon Decoder for Gigabit Automotive Ethernet ix

List of Tables

1.1 Modifying techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Comparison of Transform Decoder and Algebraic Decoder . . . . . . . . . . . . . . . . 213.2 Hardware Complexity, Critical Path Delay and Latency of (n, k) RS KES Block . 223.3 Estimated Delay of Basic Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Estimated Complexity, Latency and Critical Path Dealy for (450, 406) RS Decoder 23

5.1 Cell Area of RS Decoder for Difference Standard Cells and Time Constraint . . . . 315.2 Selected Netlist for Placement and Routing . . . . . . . . . . . . . . . . . . . . . . 325.3 Summary of RS Decoder after Placement and Routing . . . . . . . . . . . . . . . . 335.4 Power Consumption of two RS Decoders . . . . . . . . . . . . . . . . . . . . . . . . 355.5 Calculated Gate Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.6 Comparison Results of RS Decoder Designs . . . . . . . . . . . . . . . . . . . . . . 35

VLSI Design of A Reed-Solomon Decoder for Gigabit Automotive Ethernet xi

Chapter 1

Introduction

1.1 Thesis Context

The car technology has been advancing considerably over the past decades. A growing number ofelectronics systems are integrated to vehicles to improve the performance of vehicles and provideadvanced features. Based on features and constraints, these electronic systems in a car can bedivided into different functional domains and each domain has independent controls. For example,the powertrain domain of a vehicle is responsible for control of engine and transmission. Varioussensors and complex controls are used in this domain to maximize the power and efficiency.

To maintain the normal operation of vehicles, components in the same domains or even in dif-ferent domains need to communicate with each other. Therefore, in-vehicle networks are developedto interconnect components inside a vehicle. Multiple in-vehicle networks has been developed suchas CAN,LIN, LVDS, MOST and FlexRay. Those networks are usually developed for specific ap-plications or domains. Typically, each in-vehicle component has its own dedicated wiring andcommunication requirements, resulting complex cabling. According to the investigation in [11],the wiring harness is the third highest cost component and third heaviest component in a car. Inaddition, integration of heterogeneous networks in a car can be a great challenge. If one centralizednetwork can be provided to connect different functions and domains in a car, weight and cost ofa car can be reduced due to less wiring and simpler cabling. Besides, a large set of AdvancedDriver Assistance Systems (ADAS) has been developed to further enhance safety and ease driv-ing process. Such examples include automotive night vision system, lane departure warning andparking assistance. These systems are typically camera based and their bandwidth requirementshave exceeded the capabilities of traditional automotive networks as mentioned above.

Automotive Ethernet, a wired hierarchical homogeneous network, is one industry solution tothose issues. Even though the Ethernet has been invented over 40 years, it is not generally viewedas an option for in-vehicle use due to technical concerns such as Electromagnetic Compatibility(EMC), cost and other considerations. A few years ago, however, Broadcom developed an Ethernetvariant called BroadR − Reach that is specifically designed for automotive networking and laterevolved into the standard IEEE P802.3bp. The Gigabit Ethernet, next generation of AutomotiveEthernet, can serve as a backbone of the car that bridges other networks and subsystems together.All communications of different in-vehicle components can coexist in the same switched Ethernetnetwork, thus reducing connectivity costs. In addition, Automotive Ethernet adopts unshieldedtwisted pair cable which is light and inexpensive. Besides, the Gigabit Ethernet can providebandwidth needed for Advanced Driver Assistance Systems, making these advanced applicationspossible in vehicles. Hence, The need for network convergence, weight reduction and cost saving,high bandwidth networks and wider array of applications are the main drivers of AutomotiveEthernet.

Automobiles are expected to function for decades after the initial sale and they carry pre-cious human cargo. Therefore, the data stream through the network also has to be reliable. The

VLSI Design of A Reed-Solomon Decoder for Gigabit Automotive Ethernet 1

CHAPTER 1. INTRODUCTION

Automotive Ethernet requires a demanding bit error ratio (BER) which is less than 10−10. How-ever, the communication channel of Automotive Ethernet suffers from un-cancellable cross-talkfrom other data lines and impulse error due to ignition systems, switched electronics, etc. Oneof techniques that protect data against these errors during transmission is Reed-Solomon coding.The Reed-Solomon codes, a kind of forward error correction codes, add redundancy data to themessage so that the receiver is able to detect and correct a certain number of errors introducedby noisy communication channel. Hence, the Reed-Solomon encoder and decoder are requiredin the Physical Layer (PHY) of Automotive Ethernet, and they are working simultaneously toensure a full-duplex communication. The decoder designed in this thesis project is based on the1000BASE-T1 PHY which is one of the Gigabit Ethernet family of high-speed ful-duplex networkspecifications, capable of operating at 1000 Mb/s.

1.2 Reed-Solomon Codes and Their Application for Auto-motive

Reed-Solomon Codes were discovered by Irving Reed and Gus Solomon in 1959 and publishedin 1960[27]. Binary BCH codes were discovered by Hocquenghem in 1959 and independently byBose and Chaudhuri in 1960 [19][27]. Later, in 1961 Gorenstein and Zierler then generalizedBCH code to arbitrary Galois fields of size pm and discovered that RS code is a subclass of BCHcode. As one of the most popular forward error correction codes, Reed-Solomon codes are widelyused in data transmission and data storage. Reed-Solomon codes are constructed and decodedthrough the use of finite filed arithmetic. The finite field is also referred to Galois Field to honorthe French mathematician Evariste Galois. A Galois Field is usually denoted as GF (q) and it isa algebraic system with q elements where addition, subtraction, multiplication and division areproperly defined. In this thesis, we only discuss binary field GF (2) and its extended field GF (2m).The arithmetic results of two elements in a finite field is still in the same field. Therefore, finitefield is said to be a closed field. A finite field GF (2m) is defined by a primitive polynomial p(x)and can be viewed as a vector space over GF (2). The complexity of finite field arithmetics isheavily dependent on the primitive polynomial and basis representation.

Reed-Solomon codes are symbol-based and each m bits are grouped into one symbols. Eachsymbol of Reed-Solomon code is from a finite field GF (q). A sequence of symbols to be transmittedthrough the channel is referred to as message and denoted as u. The encoder at transmitter acceptsk message symbols and add 2t redundant symbols, also called parity symbols, based a specific ruleto form a codeword. The codeword, with length of n = k + 2t, is also called one RS frame anddenoted as v. The resulting RS code is a linear, cyclic block code. The linear property means thatthe sum of any two codeword is still a valid codeword. The cyclic perperty refers to that whenthe codeword is shifted cyclically, the result is also a valid codeword. The communication channelin reality is usually suffering from various noise. When codewords reach receive through the noisychannel, the received codeword r might be corrupted by noise. Denote the error introduced by thenoisy channel as e, and then error corruption can be modeled as addition in finite field arithmetic.

r = v + e (1.1)

If the decoder can reconstruct the error pattern e, then original codeword v can be recovered.The decoder checks parity symbols of received codeword r based on the same rule used in encoderto determine whether it is a valid codeword. If there is error in r, the decoder tries to correcterrors. The error-correcting capability of RS codes is determined by its Hamming distance. ForReed-Solomon codes, its Hamming distance is determined by number of parities.

Hamming distance = 2t+ 1 (1.2)

error-correcting capability = bHamming distance

2c (1.3)

2 VLSI Design of A Reed-Solomon Decoder for Gigabit Automotive Ethernet


To decode the codeword, the decoder has to map the received codeword r to one of codeword inthe code set. It is possible that the received codeword is transformed into another valid codewordafter error corruption. In this case, the parities are still valid and the decoder will assume thereis no error in received codeword. This kind of error pattern is referred to as undetectable errorpattern. When an undetectable error pattern occurs, the decoder makes a decoding error. Whenthe parities do not hold, the decoder tries to correct the codeword. It happens that the codewordis still invalid after correction. In this case, this received codeword is called uncorrectable RScodeword and the decoder results in decoding failure.

The notation (n, k) is used to specify the block length and message lenth of a Reed-Solomoncode. A natural Reed-Solomon code has a fixed block length n where the number of symbols inone codeword is n = 2m. Denote p as the number of parity symbols where p = n − k. For anatural Reed-Solomon code, the parity symbols p = 2t where t is the error-correcting capability.Therefore, an RS code can be described with these three parameters. If the original code doesnot fit to the channel, some modification techniques can be used to change the length of the code.The modification of code is done by fixing one parameter and varying the other two parameters.There are six such techniques as shown in Table 1.1.

Table 1.1: Modifying techniques.

Modification Modified (n, k) RS code DescriptionAugmenting (n, k + l) Fix n; increase k by l; decrease p by lExpurgating (n, k − l) Fix n; decrease k by l; increase p by lExtending (n+ l, k) Fix k; increase n by l; increase p by lPuncturing (n− l, k) Fix k; decrease n by l; decrease p by lLengthening (n+ l, k + l) Fix p; increase n by l; increase k by lShortening (n− l, k − l) Fix p; decrease n by l; decrease k by l

In this thesis, only shortening technique is discussed since it is used in 1000BASE-T1 PHY.Shortening is done by decreasing the number of message symbols and keep same number of paritiessuch that the resulting shortened code has the same error-correcting capability as the original code.For automotive, the (450, 406) RS code is used and each symbol in the codeword is from the finitefield GF (29) which is defined by a primitive polynomial x9 + x4 + 1. This RS code has 4069-bit message symbols and 44 parity symbols and it is shortened from (511, 467) RS code. Reed-Solomon codes are good at correcting both random errors and burst errors. For the (450, 406)RS code, 22 random symbol errors or 190-bit burst errors can be corrected at least. According to[13], the probability Pu of undetectable error after correction for worst-case channel is boundedby

Pu < q−2tt∑

h=0

(q − 1

h

)(q − 1)h (1.4)

For worst case channel, the probability of symbol error during transmission is (q − 1)/q, whereq is the size of the finite field. The upper bound of Pu for RS decoder 1000BASE-T1 PHY is5.2× 10−22, allowing the decoder runs for decades without any undetected error.

1.3 Problem Description

As discussed in previous sections, Reed-Solomon coding is important to combat channel noiseduring data transmission for Automotive Ethernet. To implement Reed-Solomon coding, both RSencoder and RS decoder are required in the PHY. The design of encoder is simple and specifiedby the standard, while the design of RS decoder is much more complex. Hence, design of a RSdecoder will be the focus of this thesis.



There are several algorithms to decode RS codes and various RS decoders have been imple-mented. Each design results in different complexity. Designing a cost-effective RS decoder interms of silicon area and power consumption is of vital importance for organizations striving toremain competitive. Although RS codes have been discovered over 60 years and various decodersare designed for different applications, there is no 9-bit symbol based RS decoder existing in themarket yet. Hence, investigation of existing RS decoders and estimation of their complexity in1000BASE-T1 PHY are essential before the chip is designed and fabricated.

The objective of this project is to design a high-speed low-cost Reed-Solomon decoder for1000BASE-T1 PHY of Automotive Ethernet. The RS decoder is specifically designed to decode(450, 406) RS codes. The decoder receives 9-bit symbols from the channel and detect whetherthere is any error in the RS frame. If there is any, the decode tries to correct it. As required by thestandard, the decoder has to report decoding failure when the RS frame is uncorrectable. Besides,the standard also imposes a 7200 ns delay constraint for transmitting and receiving data, whichis two RS frame duration. When the thesis starts, other IP blocks in the PHY have not beendesigned and thus their delays are not finalized. Therefore, a low-latency RS decoder is designedand thus it gives more flexibility to other IP blocks. An assumption of 1.2 RS frame duration,which is 4320 ns is set to be the latency constraint of the RS decoder. The RS decoder will beimplemented in VHDL and synthesized using 40nm CMOS technology. The RS decoder is locatedat the physical Coding Sublayer (PCS) of the 1000BASE-T1 PHY and it accepts one 9− bit datasymbol every 8 ns. Therefore, the critical path of decoder should be less than 8 ns and thus thedecoder can run at least 125MHz. Every clock cycle, the decoder output one 9-bit symbol andthus it achieves a throughput of 1.125 Gbps.

1.4 Scope of the Project

It is assumed that audience has no preliminary knowledge about Galois Field (GF) and Reed-Solomon (RS) codes. Hence, this thesis presents all necessary backgrounds in detail. Since thisproject targets a specific application, the requirements of the application constraints the researchscope of this project. However, some alternatives, e.g. encoding and decoding approaches, arepresented without too much details. By doing this, a more general overview of RS encodingand decoding is presented. The audience aiming for different application may still benefit fromthis thesis. This thesis focuses on the VLSI decoding algorithm of RS codes based on generatorpolynomial encoding approach and their implementation.

1.5 Thesis Overview

The rest of this thesis is organized as follows. Chapter 2 introduces the Galois Field, RS encodingand some terminology used in RS decoding, which are preliminary knowledge to understand RSdecoding. Chapter 3 presents the decoder structure and existing algorithms for decoding RS code.Various designs are explored and comparison is made in terms of hardware complexity, latency andcritical path delay. The detailed VLSI implementation of RS decoder in this thesis is presentedin Chapter 4. Then, the synthesized results and power estimations are shown in Chapter 5. TheChapter 6 concludes this thesis.


Chapter 2

Preliminaries

This chapter introduces preliminary knowledge that is essential to understand RS decoding. Sec-tion 2.1 begins with Galois filed which is used throughout RS encoding and decoding. Then, thedetails of RS encoding is discussed in Section 2.2. Next, some terminologies and concepts used inRS decoding is presented.

2.1 Galois Field

2.1.1 Construction of Galois Field

As stated in Chapter 1, Reed-Solomon codes are constructed and decoded through the use of finitefield arithmetic. Hence, understanding finite field arithmetic and their complexity for implement-ation are necessary before the RS encoding and decoding are discussed. A field is basically a set ofelements where addition, subtraction, multiplication and division are probably defined and can beperformed without leaving the field. The smallest finite field is {0, 1} and it is denoted as GF (2).

Binary field GF(2):

GF (2) = {0, 1}(2.1)

The addition and multiplication operations defined in GF (2) are shown in Eq.(2.2). The additionin GF (2) is performed based on binary addition without carry-bit. Therefore, addition in GF (2) isequivalent to subtraction. The division can be derived from multiplication and it will be elaboratedlater. The commutativity and associativity which hold in normal algebraic system are still validhere. For GF (2), addition can be implemented by an XOR gates and multiplication can beimplemented by simply an AND gate.

Arithmetic in GF(2):

1 + 1 = 0

1− 1 = 0

1 + 0 = 1

1 · 1 = 1

0 · 1 = 0

(2.2)

An extended binary field GF (2m) includes GF (2) and introduces some additional elements. Toconstruct GF (2m), a primitive polynomial p(x) =

∑mi=0 pix

i over GF (2) is required. A polynomialwith coefficients from GF (2) is said that it is a polynomial over GF (2). The concept of a primitivepolynomial is similar to prime number in the normal algebraic system.


CHAPTER 2. PRELIMINARIES

Definition 1. A polynomial p(x) of degree m is a primitive polynomial if the following twoconditions are satisfied:

1. The polynomial p(x) is not divisible by any polynomial f(x) over GF(2) of degree 0 <deg f(x) < m.

2. The smallest positive integer n for which p(x) divides xn + 1 is n = 2m − 1.

The first element that is added to the binary field GF (2) is denoted as α and it is assumedthat α is a root of p(x). Additional elements are introduced based on consecutive multiplication.That is,

α2 = α · α

α3 = α · α · α...

αj = α · α · . . . · α (j times)

...

Now, we haveF = {0, 1, α, α2, ..., αj , ...}

The elements 0 and 1 can be denoted as α−∞ and α0 respectively. As stated in Definition 1, p(x)divides X2m−1 + 1, we have

x2m−1 + 1 = f(x)p(x) (2.3)

where f(x) is a polynomial over GF (2). By replacing x with α, we obtain

α2m−1 + 1 = f(α) · 0 = 0 (2.4)

since α is a root of p(x). Therefore,α2m−1 = 1 (2.5)

Then, it can be seen that the element after α2m−1 is mapped to element αi where 0 ≤ i ≤ 2m− 2.Hence, under the condition p(α) = 0, the set F becomes a finite set with 2m elements and it is anextended binary field GF (2m) denoted as

Extended binary filed GF (2m) :

GF (2m) = {0, 1, α, α2, ..., α2m−1}(2.6)

2.1.2 Representation of Galois Field

The elements are represented as powers of α in Eq.(2.6). Besides, a GF (2m) can be regarded asa vector space of dimension m over GF (2). Given a basis {γ0, γ1, ..., γm−1} of the vector spaceGF (2m), any element a ∈ GF (2m) can be represented as linear summation of elements of thebasis, i.e.

a = a0γ0 + a1γ1 + ...+ am−1γm−1 (2.7)

where ai ∈ GF (2) and γi ∈ GF (2m) for 0 ≤ i ≤ m − 1. Then, a0, a1, ..., am−1 can be viewedas coordinates of a with respect to this basis. The set {1, α, ..., αm−1} forms such a basis and it

is referred to as standard basis1. Another common used basis is normal basis {α, α21 , ..., α2m−1}.The complexity of Galois Field arithmetic is heavily dependent on the primitive polynomial p(x)and the basis used to represent the field. In 100BASE-T1 standard, primitive p(x) = x9 +x4 +1 is

1standard basis is also called canonical basis or polynomial basis



used to construct the Galois Field GF (29) and the standard basis {α0, α1, ..., α8} is used to GaloisField elements. With standard basis representation, each Galois Field elements a ∈ GF (2m) canbe written as polynomial form a(x)

a(x) = a0 + a1x0 + ...+ am−1x

m−1 (2.8)

Then, addition of two elements is simply adding each corresponding digits without carry bit.Subtraction is also equivalent to addition in GF (2m).

2.1.3 Multiplication

The multiplication operation with standard basis can be accomplished in two steps: polynomialmultiplication and modular reduction. Let c(x) ∈ GF (2m) be the product of a(x) ∈ GF (2m)and b(x) ∈ GF (2m), then c(x) = a(x)b(x) mod p(x). The polynomial multiplication step d(x) =a(x)b(x) results in a polynomial d(x) with degree of (at most) 2m− 2 .

d(x) = a(x)b(x) = (

m−1∑i=0

aixi)(

m−1∑i=0

bixi) (2.9)

After the reduction operation c(x) = d(x) mod p(x), a polynomial c(x) with degree of m − 1 isobtained. For efficiency reasons, these two steps are often combined.

2.1.4 Inversion

The computation of multiplicative inverse of Galois Field elements is one of the most complicatedoperations in GF (2m). There are three well-known methods to perform GF inversion operation:lookup table, Fermat’s Theorem and Euclidean Algorithm.

Lookup Table. The first one is ROM-based look-up table. The required size of the ROM is (2m−1)×m-bit, since the element 0 does not have a multiplicative inverse. For small m, this methodusually shows better performance than other two methods in terms of silicon area and computationtime[3]. However, it is not attractive for large m since the size of ROM grows exponentially withm. Large m is typically used in cryptography, i.e. 163 or more [14]. Furthermore, it also cannotbe used for an universal RS decoder.

Fermat’s Theorem. For every a ∈ GF (2m), a2m−1 = 1. Therefore, a−1 = a2m−2 = a2

1

a22

...a2m−1

.The inverse operation based on Fermat’s theorem requires m−1 squaring and m−2 multiplication.It is widely used for normal basis representation since the squaring is simply a cyclic shift of a GFelement’s binary digits for normal basis representation.

Euclidean Aglorithm. The Eculid’s algorithm is originally used to find the greatest commondivisor (GCD) of two numbers or two polynomials. For polynomial a(x) and b(x), their greatestcommon divisor can be written as

a(x)f(x) + b(x)g(x) = gcd(a(x), b(x)) (2.10)

Setting b(x) as primitive polynomial p(x), the gcd(a(x), p(x)) is 1 since p(x) is relative prime tob(x). Then, we have

a(x)f(x) + p(x)g(x) = 1 (2.11)

Then, modulo p(x) at both sides, we have

1 = a(x)f(x) mod p(x)

a−1(x) = f(x) mod p(x)(2.12)

Hence, f(x) mod p(x) is the inverse of a(x). Inversion circuits based on Euclidean Algorithmrequires polynomial divisions, which are very complex operations.



2.1.5 Galois Field Fourier Transform

The Galois filed Fourier transform is a generalization of discrete Fourier transform to finite fields.Let v(x) = v0 + v1x + ... + vn−1x

n−1 and V (x) = V0 + V1x + ... + V n− 1Xn−1 be polynomialsover GF (2m), where αn = 1. Then, v(x) and V (x) form a Fourier transform pair if:

Vj = v(αj) =

n−1∑i=0

viαij (2.13a)

vi =1

nV (αi) =

1

n

n−1∑j=0

Vjα−ij (2.13b)

Then V (x) is the spectrum polynomial of v(x). As shown in Eq.(2.13), the Fourier transformand inverse transform in Galois field are essentially polynomial evaluation by replacing x with αi.Then, the following two properties holds:

1. The jth spectral component Vj is zero if and only if αj is a root of v(x).

2. The ith component of v(x) is zero if and only if α−i is a root of V (x).

2.2 Encoding of Reed-Solomon Codes

There are systematic RS encoding and non-systematic RS encoding. The code is systematic,meaning that the information symbols are not disturbed in any way in the encoder and the paritysymbols are added separately to each block. These two encoding methods can lead to differentdecoding methods. The original approach proposed by Reed and Solomon is non-systemaic. Themessage symbols (m0,m1, ...,mn−1)can be viewed as coefficients of a polynomial m(x) = m0 +m1x + ... + mn−1x

n−1; the same applies to codeword v. In this thesis, the word codeword andcode polynomial are interchangeable. The original approach of RS encoding is that a RS codeword v is formed by evaluating P (x) at each of the q elements in the finite field GF (q).

v = (v0, v1, ..., vn−1) = (m(1),m(α1), ...,m(αn−1)) (2.14)

However, this encoding method results in a large silicon area since it requires n registers, nmultipliers and n adders. In [24], the original approach of RS codes is used to build a RS encoderand decoder. The generator polynomial approach, a method of systematic encoding, is the oneof the most popular approaches since it requires less silicon area. The resulting encoder onlyrequires 2t registers, 2t + 1 multipliers and 2t adders. The 1000BASE-T1 PHY standard adoptsthis approach. The generator polynomial approach requires a code generator polynomial g(x).

g(x) =

FCR+2t−1∏i=FCR+1

(x− (αG)i) (2.15)

where FCR is the power of the first consecutive root in g(x) and αG is any primitive element offinite field GF (2m) [9]. Any element in the finite field can be denoted as powers of the primitive

element, i.e. α(G)i . The element αG is generally chosen to be α so that the roots of g(x) areconsecutive powers. The value of FCR is chosen to be 1 in many applications, for example Voyagerproject [9], DVD [5] and Ultra Wide Band [15]. The choice of FCR sometimes may affect thehardware requirements of the encoder and/or decoder slightly. In [28], a (204, 188) RS encoderbased on FCR = 120 is built and it claims the resulting encoder saves 30% area. However, nospecific number of area of encoder and details of the comparison are given in [28]. For 1000BASE-T1 PHY, αG is chosen to be α, FCR is chosen to be 0 and t is 22. Then, the code generatorpolynomial g(x) can be simplified to

g(x) =

43∏i=1

(x− αi) = g0 + g1x+ g2x2 + ...+ g44x

44 (2.16)



The message polynomial u(x) is

u(x) = u0 + u1x+ u2x2 + ...+ uk−1x

k−1 (2.17)

The symbol xi serves as a place holder and it indicates that symbol ui is in the ith position in thecodeword. The parity-check symbols b(x) is obtained by first shifting u(x) 2t positions and thenperforming polynomial division.

b(x) = x2tu(x) mod g(x) (2.18)

The resulting remainder b(x) has a degree of at most 2t-1. Then, the encoder appends b(x) tothe message. Then we obtain the resulting codeword v(x)

v(x) = x2tu(x) + b(x) (2.19)

Since the addition in GF (2m) is equivalent to subtraction. The encoding operations removes theleft-over part of x2tu(x) mod g(x). Then, the resulting v(x) is divisible by g(x) and therefore ithas all the 2t roots α0, α1, ..., α2t−1 of g(x) as its own roots. The resulting RS code can correctup to t random symbol errors.

Message symbolsParity

2t k

n

Figure 2.1: Systematic RS Encoding

Let V (x) = V0 + V1x + ... + Vn−1xn−1 be the Fourier transform of RS codeword v(x). Then,

V (x) can be obtained by evaluating v(x) at 2m−1 elements α0, α1, ..., α2m−1 of finite field GF (2m)based on Eq.(2.13a). Based on first property of Galois Field Fourier transform, if αi = j is a rootof v(x) then the jth frequency component of V (x) is zero. Hence, the encoding is a process ofmaking 2t spectral component as zero from the point of view of frequency domain.

Vj = 0 for j = 0, 1, ..., 2t-1 (2.20)

For a natural (n,k) RS code, the block length n has to be n = 2m − 1. Therefore, once m ischosen the block length is fixed. In some application, however, a code of suitable natural lengthor suitable number of message symbols cannot be found. Then, the modifying techniques canbe applied to get the required code. In (n,k) RS code, there are 2k−l codewords with l leadinghigh-order message symbols. These 2k−l codewords forms a shorten (n − l, k − l) RS code byremoving these l leading zeros. The encoding for a shortened RS code can be accomplished bythe same circuits as those employed by the original RS code. The reason is that the deletedleading-zero message symbols do not affect the computation of parity symbols. After shortening,the (n− l, k − l) RS code has at least the same error-correcting capability as the (n, k) RS code.

Figure 2.2 shows the encoding circuit of RS code used for 1000BASE-T1 PHY. Before encoding,all shift register are initialized to zero. The message symbols µ0, µ1, ..., µ406 are shifted into thecircuit from right end. Shifting m(x) from the right end is equivalent to multiplying m(x) byxn−k. After all 406 symbols have entered the circuit, 44 parity symbols are computed and held in44 registers P0...P43.



D

g0

D

g1

... ... D

g2

D

g43 g44

b0 b1 b42 b43 µ405;µ404; :::;µ1;µ0

Figure 2.2: Encoding Circuit for (450, 406) RS Code

2.3 Decoding Flow

Before the decoding methods of RS codes are described, there are several characteristics andconcepts of RS codes that need to be discussed. Let r(x) = r0 + r1x+ ...+ rn−1x

n−1 denote thereceived code polynomial and e(x) = e0+e1x+...+en−1x

n−1 denote the error pattern that corruptsthe transmitted code polynomial v(x) = v0 + v1x + ... + vn−1x

n−1, then the error corruption inthe communication channel can be modeled as addition in finite field.

r(x) = v(x) + e(x) (2.21)

If there is no error, all coefficients of e(x) are zero. If there is v > 0 errors, then polynomial e(x)will have v non-zero coefficients

e(x) = Y1X1 + Y2X2 + ...+ YvXv (2.22)

where x0 ≤ X1 < X2 < ... < Xv ≤ xn−1. It is conventional to say that error values Y1, Y2, ..., Yvhave occurred in error locations X1, X2, ..., Xv. If we can reconstruct the error pattern e(x), thenthe original code polynomial v(x) can be recovered.

The received RS codeword, also called RS frame, is said to be valid if all the parity symbolshold for the codeword. The (n, k) RS code essentially defines a vector space of k dimensions andthe every non-zero codeword differs at least 2t+1 coordinates. When receiving a invalid codeword,the decoder will map the received codeword to the closest codeword in the vector space definedby (n, k) RS code. However, if there is more than t errors in the received codeword, the receivedcodeword may be closer to another valid codeword and the decoder will map it to that codeword.As a result, the decoder commits a decoding error which is undetectable. On the other hand, ifa erroneous codeword differs from all codewords in t+ 1 or more coordinates, then the codewordafter decoding is still invalid. This error is detectable and the decoder can report a decodingfailure.

Received RS frame

valid RS frame

no error undetectable error

invalid RS frame

correctableuncorrectable

(decoding failure)

Figure 2.3: Decoding Flow


Chapter 3

Alternative Designs and FinalAlgorithm

There are many ways to decode Reed-Solomon codes. The decoding algorithms of RS codesconstructed by the generator polynomial approach can be divided into three main classes [27]:syndrome-based decoding, remainder-based decoding and time-domain decoding. Further, syndrome-based decoding can be divided into algebraic decoding1 and transform decoding. The time domaindecoding is derived from transform decoding by Blauhut [27]. In the literature, however, there arelack of existing practical applications and chip fabrication using remainder-based decoding andtime-domain decoding. The syndrome-based decoding, which is the most classical RS decodingmethod, is extensively studied by researchers and many applications has been using this approach.Since the focus of this thesis is to design efficient RS decoder for 1000BASE-T1 PHY which iscomparable to existing work in the literature, the syndorme-based decoding and time domaindecoding are not used. In the first subsection, the decoder structure of algebraic decoding andtransform decoding are presented. Then, the design space of syndrome-based decoding is explored.Finally, the design decisions are made based on estimation of hardware complexity, critical pathand decoding latency.

3.1 Decoder Structure

The syndrome-based decoding firstly compute 2t syndromes to detect the occurrence of error.Syndromes are a set of characteristics that characterize a particular error pattern. The syndromeSi is defined as

Si = r(αI) = v(αI) + e(αI) where I = iG+ FCR, for 0 ≤ i ≤ 2t− 1 (3.1)

Since First Consecutive Root (FCR) is zero and G is 1 in 1000BASE-T1 PHY, FCR and G willbe omitted in the rest of this thesis. Then

Si = r(αi) = v(αi) + e(αi) for 0 ≤ i ≤ 2t− 1 (3.2)

Since the 2t elements α0, α1, ..., α2t−1 are the roots of code generator polynomial, then

Si = r(αi) = 0 + e(αi) = e(αi) (3.3)

As shown in Eq. (3.3), the value of syndromes are only related to the error pattern e(x). If all 2tsyndrome components are zero, then r(x) is valid. Otherwise, further actions are taken for errorcorrection based on the syndrome polynomial S(x), which is defined as

S(x) = S0 + S1x+ ...+ S2t−1x2t−1 (3.4)

1For historical reason, algebraic decoding is originally called time-domain decoding in some papers, e.g. [8].However, this thesis refers to the classification and definition in[27].


CHAPTER 3. ALTERNATIVE DESIGNS AND FINAL ALGORITHM

For error correction,both the error locations and error values have to be identified. To facilitatethe decoding process, the error locator polynomial σ(x) is defined as

σ(x) =

v∏l=1

(1−Xlx) (3.5)

and error value evaluator polynomial Z(x) is defined as

Z(x) =

v∑l=1

YlXl

v∏i=1,i6=l

(1−Xix) (3.6)

Based on Eq.(3.5) and Eq.(3.6), it should be noted that

deg(Z(x)) < deg(σ(x)) ≤ t (3.7)

Then, polynomial σ(x) and Z(x) are related to S(x) through the key equation which is definedby Berlekamp [1]:

[1 + S(x)]σ(x) ≡ Z(x) mod x2t+1 (3.8)

A slightly different key equation which is defined in [19] is used in this thesis

S(x)σ(x) ≡ Z(x) mod x2t (3.9)

Alternatively, the key equation can be shown in matrix form as

S0 0 · · · 0 0S1 S0 · · · 0 0...

.... . .

......

Sv Sv−1 · · · S1 S0

Sv+1 Sv · · · S2 S1

......

. . ....

...S2t−1 S2t−2 · · · S2t−v S2t−1−v

σ0σ1σ2...σv

=

Z...

Zv−100...0

(3.10)

The key equation states that the coefficients of xv, xv+1, ..., x2t in the product S(x)σ(x) are zero.This condition can be used to find σ(x). The well-known algorithms for solving the key equa-tions are Peterson Gorenstein Zierler (PGZ) algorithm, Berlekamp-Massey (BM) algorithm andEuclidean Algorithm (EA). Details of these algorithms are presented in the next section. In thefollowing subsections, the overall structure of algebraic decoding and transform decoding will bepresented. The main difference is how to determine error locations and error values.

3.1.1 Algebraic Decoding

SyndromeComputation

Key Eqn.Solver

Chien&Forney

Buffer

r(x)S(x)

σ(x)

Z(x)

e(x)

v(x)

Figure 3.1: An overall block diagram for algebraic RS decoder

As shown in Figure 3.1, the first step of algebraic decoding is to compute 2t syndromes basedon Eq.(3.3). Next, the key equation is solved based on the 2t syndromes. The error locatorpolynomial σ(x) and error value evaluator Z(x) are found at this step. After obtaining σ(x),Z(x) can be computed based on the upper part of the matrix in Eq.(3.10). The reciprocals



of roots X1, X2, ..., Xv of error locator polynomial σ(x) are the error locations and the ChienSearch is generally used to evaluate the roots of σ(x). Chien Search essentially substitutes all theelements in finite field GF (2m) to σ(x) and check whether the result is zero. Once error locationsX−11 , X−12 , ..., X−1v are found, an efficient method called Forney Algorithm can be used to evaluatethe magnitude of the error. Based on Forney Algorithm, the error value Yk at location Xk can becalculated as

Yk = −X1−FCRk Z(X−1k )

σ′(X−1k )(3.11)

where σ′(x) is the derivative of σ(x). Both Chien Search and Forney Algorithm involve polynomialevaluation and therefore they share similarity in VLSI implementation. In many designs, thesetwo step are combined into one function block.

3.1.2 Transform Decoding

SyndromeComputation

Key Eqn.Solver

Inverse Fourier

Transform

Buffer

r(x)S(x) σ(x) e(x)

v(x)

Buffer

RemainingError Transform

E(x)

Figure 3.2: An overall block diagram for the transform RS decoder

Let R(x) = R0 +R1x+ ...+Rn−1xn−1 and E(x) = E0 +E1x+ ...+En−1x

n−1 be the Fouriertransform of r(x) and e(x), respectively. Then,

Rj = Vj + Ej for j = 0, 1, ..., n-1 (3.12)

As shown in Eq.(2.20), the first 2t frequency components of codeword should be zero after encoding.Further, the syndrome computation is essentially to compute the first 2t frequency components ofreceived code polynomial r(x). Then

Rj = Ej = Sj = r(αj) for j = 0, 1, ..., 2t-1 (3.13)

That is, the first 2t frequency component of error pattern in frequency domain is known after syn-dorme computation. The task is to find the rest frequency component of E(x) for error correction.The error locator polynomial σ(x) can be found by solving key equation. Next, the transform de-coding finds the error pattern in frequency domain instead of time-domain as algebraic decodingby using a recursive extension method.

Ei = −v∑j=0

σjEi−j , for i = 0, 1, ..., n− 1 (3.14)

Then the whole error correction process is performed in frequency domain, while algebraic decodingis a mixture of frequency domain process and time domain process. After the error pattern is re-constructed, inverse Fourier transform has to be performed to get codeword in time domain.

3.2 Design Space Exploration

In the last section, both architectures of algebraic RS decoder and transform RS decoder havebeen presented. This section describes available options for each function block of the decoder.



3.2.1 Syndrome Computation for RS Decoding

The 2t syndrome components S0, S1, ..., S2t−1 can be computed as

Si = r(αi) =

n−1∑j=0

rjα(ij) (3.15)

As shown in Eq.(3.15), each syndrome Si requires n(n−1)2 multiplications (α0 = 1 does not require

multiplication) and n − 1 additions. The Horner’s rule can be used to reduced the requiredmultiplication to n− 1. By using Horner’s rule Eq.(3.15) can be rewritten in recursive form as

Si = (..((rn−1αi) + rn−2α

i) + ...+ r1)αi + r0 (3.16)

The syndromes are first used for error detection. As stated in previous section, all 2t syndrome areequal to zero if there is no error in the codeword. In [4], it was proved that error detection can bedone by only computing the first half syndromes S0, S1, ..., St−1. Then this idea is generalized tothat computation of any t continuous syndrome components can be used for error detection [15].The half syndrome computation technique can reduce the power consumption since most of thetime the decoder only performs syndrome computation. However, if there is error in the codeword,the other half syndromes has to be computed for solving the key equation. Computating rest ofthe syndrome components takes n clock cycles using Horner’s rule. Hence, the decoding latencywill significantly increases if the half syndrome computation technique is used. Although parallelunit can be added to speed up the computation, it requires additional hardware resources.

3.2.2 Key Equation Solver

The well-known algorithms for solving the key equations are Peterson Gorenstein Zierler (PGZ)algorithm, Berlekamp-Massey (BM) algorithm and Euclidean Algorithm (EA). The PGZ directlysolves the key equations which are non-linear simultaneous equations. However, the decoding com-plexity of PGZ is proportional to t2 [28] and therefore it is not suitable for VLSI implementationwhen error-correcting capability t is large. According to [26], the PGZ is efficient for t ≤ 3. Thedecoding complexity of BM and EA scales linearly with number of error t [28] and therefore arewidely used in RS decoding. Since our application has a large error-correcting capability t = 22,BM and EA are considered as the options for our design. The detail of PGZ is omitted here.

Berlekmap-Massey Algorithm and Its Variants

The BM algorithm is an iterative procedure. It was proposed by Berlekmap and later improveby Massey. Inversion required in the original BM algorithm is very complex and time-consumingoperations. In 1971, an Inverse-free Berlekamp-Massey (iBM) algorithm was proposed by Burton[5] for binary BCH codes and then extended to RS codes by Reed et al. [21] in 1991. By removingthe inversion, the new algorithm is more suitable for VLSI implementation and a complete proofof the algorithm was presented in [21]. The error locator and error evaluator found by iBM isscaled by a constant comparing the one obtained from BM algorithm. That is,

σ∗(x) = cσ(x)

Z∗(x) = cZ(x)(3.17)

where c is a constant. However, σ∗(x) has the same roots as σ(x). Since the error evaluator Z(x)is also scaled by the same constant, the scaling factor c will be canceled when computing errorvalues. Hence, we will not specifically differentiate σ∗(x) and σ(x) in the rest of the thesis. Thesame applies to Z∗(x) and Z(x). Since the iBM algorithm is more hardware efficient comparedto BM algorithm, many RS decoders have implemented based it. The conventional BM algorithmtakes 2t iterations to find error-locator polynomial for RS codes. It was proved in [7] that onlyt+ e iterations are required where e is the number of errors in the RS frame. This Early-stopped



Berlekamp-Massey (ESBM) algorithm, which is also inverse-free, can improve the decoding stepsand save power for error correction. However, the variable execution time will cause output jitter,which is not desirable in many applications. The ESBM is summarized as following

Initial values:

σ(0)(x) = 1, B(0)(x) = 1

l0 = 0, ε0 = 1

At uth iteration:

du =

u−1∑j=0

Su−1−j

δ =

{1 if du 6= 0 and 2lu−1 ≤ u− 10 otherwise

lu = δ(u− lu−1) + (1− δ)lu−1[σ(u)(x)

B(u)(x)

]=

[εu−1 −duxδ (1− δ)x

][σ(u−1)(x)

B(u−1)(x)

]εu = δdu + (1− δ)εu−1

Stop condition:

u = 2t or u = lu + t

(3.18)

The variable u is step index and δ is an auxiliary variable used to format the algorithm. Thepolynomial B(x) is a scratch polynomial and used to update error locator polynomial σ(x) ateach iteration. The variable lu indicates the degree of σ(u)(x) and ε is a scaling factor introducedto eliminate inversion required in BM algorithm. The value of discrepancy d together with ldetermines how to update σ(x) at each iteration. The iBM runs for u = 1, 2, ...2t and σ(2t) is theresulting error locator polynomial. The ESBM algorithm refines the stop condition as u = 2t oru = l + t[7].

A straightforward implementation of iBM leads to a parallel architecture where coefficientsof polynomials are updated in parallel in each iteration of iBM algorithm as shown in [21]. Thespeed bottleneck in parallel architecture is in the iterative computation of discrepancies followedby the updating of the error-locator polynomial. This critical path existing in a loop and it cannotbe directly pipelined. A serial architecture, which is mentioned in [12], can be implemented toreduce critical path delay and hardware complexity. However, it is not generally implementeddue to high decoding latency which is 2t× (2t+ 1) clock cycles. Then, [6] presented decomposedBerlekmap-Massey algorithm where a clever scheduling is used. The resulting VLSI design onlyrequires 3 multipliers and 2t× (t+ 1) clock cycles latency. Later, Chang and Shung [5] improvedtheir implementation in [6] by adopting dual basis multiplier instead of standard basis multipliersince the dual basis multiplier use less transistor without considering the transformation circuits.The parallel structure of RS decoder suffers from long critical path, and the serial structure anddecomposed structure suffers from a long decoding latency. To achieve a fast speed of the circuitas well as low decoding latency, researchers have been trying to reformulate iBM algorithm so thata pipelined version of iBM can be implemented. A dual-line structure which has precisely thisproperty was proposed in [12]. The dual-line structure stores intermediate result of discrepancy dwhich can be used to compute σ(x).

Euclidean Algorithm and Its Variants

The Euclidean Algorithm (EA), or Euclid algorithm, is an efficient method for finding the GreatestCommon Divisor (GCD) of two numbers or two polynomials2. In 1975, it was discovered by

2In some research papers, the extended version of the algorithm for finding GCD of two polynomials is referredto Extended Euclidean Algorithm (EEA).



Sugiyama etc. [25] that EA can be used for solving key equation in RS decoding. To solve thekey equation (3.9) by EA, the key equation can be rearranged as

Z(x) = γ(x)x2t + σ(x)S(x) (3.19)

This is exactly the same form as Eq. (2.10) by setting a(x) = x2t and b(x) = S(x). The EuclideanAlgorithm is an iterative procedure and at each iteration i the following equation always holds

Z(i)(x) = γ(x)(i)(x)x2t + σ(i)(x)S(x) (3.20)

The polynomial Z(i)(x), γ(x)(i)(x) and σ(i)(x) are updated as

q(i) = bZ(i−2)(x)

Z(i−1)(x)c

Z(i)(x) = Z(i−1)(x)− q(i)(x)Z(i−2)(x)

σ(i)(x) = σ(i−1)(x)− q(i)(x)σ(i−2)(x)

γ(i)(x) = γ(i−1)(x)− q(i)(x)γ(i−2)(x)

(3.21)

with initial values

Z(−1)(x) = x2t, Z(0)(x) = S(x)

γ(−1)(x) = 1, γ(0)(x) = 0

σ(−1)(x) = 0, σ(0)(x) = 1

(3.22)

The polynomial γ(x) is not required to solve key equation and thus can be omitted during com-putation. The algorithm stops when we reach a step ρ for which

degZ(ρ) < deg σ(ρ) ≤ t (3.23)

It is shown in [19] that this algorithm takes ρ ≤ 2t iterations to find both σ(x) and Z(x). Itshould be noticed that σ(x) and Z(x) found by EA are also scaled by a constant. However, it isgenerally believed that the original EA is not suitable for VLSI implementation since it involvesa sequence of complicated polynomial divisions. Unlike the polynomial modulo operation in RSencoding where divisor polynomial is fixed, both divisor and dividend polynomials involved in EAare not known prior to computation. In VLSI implementation, division generally involves inverseoperation followed by multiplication. The inverse operation in finite field is area intensive andtiming consuming. In 1985, Shao etc. [23] proposed a Modified Euclidean Algorithm (MEA) forRS decoding based on the work in [2]. The MEA which does not require inverse operation can beimplemented in systolic array structures that lead to efficient VLSI implementation. The MEA is



summarized here.

Initial values:

µ(0)(x) = 1, R(0)(x) = x2t

λ(0)(x) = 0, Q(0)(x) = S(x)

At ith iteration:

li−1 = deg(R(i−1)(x))− deg(Q(i−1)(x))

δ =

{1 if li−1 ≥ 0

0 otherwise

R(i)(x) =

{bi−1R

(i−1)(x)− x|li−1|ai−1Q(i−1)(x) if δ = 1

ai−1Q(i−1)(x)− x|li−1|bi−1R

(i−1)(x) if δ = 0

Q(i)(x) =

{Q(i−1)(x) if δ = 1

R(i−1)(x) if δ = 0

λ(i)(x) =

{bi−1λ

(i−1)(x)− x|li−1|ai−1µ(i−1)(x) if δ = 1

ai−1µ(i−1)(x)− x|li−1|bi−1λ

(i−1)(x) if δ = 0

µi(x) =

{mu(i−1)(x) if δ = 1

λ(i−1)(x) if δ = 0

Stop condition:

degR(i)(x) < t

(3.24)

where ai−1 and bi−1 are the leading coefficients of R(i−1)(x) and Q(i−1)(x), respectively. Whenthe algorithm stops, σ(x) = λi(x) and Z(x) = Ri(x).The resulting implementation in [23] is aserial-in serial-out architecture and it consists of 2t identical processing cells. The 2t syndromesare serially shifted in and it takes 2t cycles to obtain the first coefficient of error locator polynomialand error evaluator polynomial. The advantages of MEA are its regularity and retiming abilitythat inserting registers in a feedforward cut to reduce the critical path.

In [18], Lee et al implemented both EA and MEA for (255, 239) RS decoder. It showed that thedecoder using EA has a small area than the decoder using MEA. However, the decoder using EAhas a slower clock frequency due to inverse operation required in EA. The EA architecture consistsof 2t divider processing cells and t multiply processing cells[18][17]. It is a parallel-in and parallel-out architecture and its critical path consists of (1ROM + 1AND +2Mult+1Add+2MUX). TheMEA implemented in [18] is a serial-in and serial-out architecture and its consists of 2t processingcells. The critical path is only (1Mult+1Add+1Mux) and thus it can run at a higher clockfrequency. In [16] , the multiplier is pipelined and retiming technique is used to reduce the criticalpath of a MEA-based RS decoder in [18]. However, retiming also introduces more hardware andincreases latency. The implementation of MEA requires 2t processing cells and a substantialportion of the cells are idle during computation, resulting inefficient hardware utilization. Sinceeach processing cell performs the same computation, it is possible to use a single cell recursively toperform the successive steps of MEA instead of pipelining data to the next cell. By multiplexingand recursive techniques, the MEA in [22] was implemented with one cell. Since MEA takes 2titeration and each iteration is processed by 2t cells successively, it takes 4t2 cycles to obtain thefirst coefficients of σ(x) and Z(x) if one processing cell used. In addition, the first iteration of MEAcan be eliminated by changing initial conditions[10][28]. In [10], Hsu et al implemented a foldedMEA architecture and only one processing cell is used to perform the computation recursively. Theresulting latency is (2t−1)(2t+1) cycles which is exactly one RS frame duration for (255, 239) RScode. However, more processing cells are required for (450, 406) RS code. Otherwise, the latencywill be much large than one RS frame duration. For MEA, each processing cell can be furtherdivided into a degree computation block and polynomial arithmetic block. As shown in Eq (3.21),



the degree computation determines how polynomials should be updated. It is shown in [28] thatthe degree computation can be omitted since the degree difference is predictable. Concretely,degR(i)(x) < degQ(i)(x) after odd iterations and degR(i)(x) = degQ(i)(x) after even iterations.Thus, the hardware complexity of MEA can be greatly reduced. The MEA in [28] is parallel-inand parallel out architecture. Therefore, we will refer to this architecture as parallel MEA todifferentiate other architectures.

3.2.3 Error Correction

Once the key equation is solved, the error vector e(x) can be reconstructed either in time domainor frequency domain. To reconstruct the error vector, both the error location and error value needto be determined. In time domain, The Chien Search is used to find the error location by findingthe roots of σ(x) and Forney Algorithm is used to evaluate the error value. The Forney Algorithmrequires computing derivative σ′(x) as shown in Eq.(3.11).

σ′(x) = σ1 + 2σ2x+ 3σ3x2...+ tσtx

t (3.25)

Since in finite field a+ a = a− a = 0, we have

σ′(x) = σ1 + 3σ3x2 + 5σ5x

4...+ tσt−1xt−1 (3.26)

Therefore, the computation of σ′(x) can be simply done by extracting the odd coefficients of σ(x).By setting FCR = 0 in Eq.(3.11), the Forney Algorithm for our decoder can be written as

Yk = −X1kZ(X−1k )

σ′(X−1k )= −

Z(X−1k )

σodd(X−1k )

(3.27)

where Yk is the error value at error location X−1k , and σodd(x) represents the odd terms of σ(x).The division in Eq.(3.27) can be done by multiplying the inverse of σodd(x). To find the inverse ofσodd(x), an inversion circuit is required. The evaluation of Z(x) and σ(x) requires two polynomialevaluation circuits as shown in 3.3. After Chien Search is finished, a test for decoding failure

D

1 0

α0

Z0

D

1 0

α1

Z1

... ... D

1 0

αt−1

Zt−1

Z(αi)... ...

D

1 0

α1

σ1

D

1 0

α3

σ3

... ... D

1 0

αt−1

σt−1

σodd(αi)

... ...

D α0

1 0

σ0

D α2

1 0

σ2

D αt

1 0

σt

... ...

... ...

σ(αi)

Figure 3.3: The Block Diagram for Chien Search

can be performed. During the literature research, it was found that only a few of publicationsdiscuss the details about detecting decoding failure of RS decoding. It was mentioned in [8] thatdetection of decoding failure can be done by testing that there are no repeated roots in σ(x).

The Chien Search checks whether each element in GF (2m) is a root of σ(x) by followinga specific order. Since the roots of σ(x) are reciprocals of error locations, the Chine Searchessentially checks the error location from location αn−1 to α0 which corresponds to the outputorder of buffered RS frame. However, the possible error location will be from αn−l−1 to α0 ifshortening technique is used. Hence, Chien Search for shortened codes will spend l ”idle cycles.



D D

D

D

σ(αi)

σ(αi)

Z(αi)

Inversion

CircuitD D

10

Zero detect

m

m

m

m

m m m m

1 FIFO output

00vi

D

Figure 3.4: The Block Diagram for Forney Algorithm

Alternatively, the registers can be pre-initialized to skip these l cycles or σr(x) = xtσ(x−1) canbe used so that Chien Search follows the order from α0 to αn−l−1. However, pre-initializationmethod introduces extra multipliers and revering σ(x) violates the output order of RS frame. Dueto the limited latency budget, the pre-initialization method will be used.

Reconstruction of error pattern in frequency domain does not require error evaluator poly-nomial. The first 2t frequency components E0, E1, ...E2t−1 of error vector e(x) are known fromsyndrome computation. The remaining error transform can be calculated recursively as

Ej = −v∑k=0

σkEj−k, for j = 2t, 2t+ 1, ..., n− 1 (3.28)

This equation assumes that σ0 = 1. However, σ(x) is scaled by a constant for variants of BMalgorithm and EA. In that case, the error locator polynomial σ(x) needs to be normalized by σ0before computing remaining error transform. The remaining error transform can be implementedin hardware as shown in Figure 3.5. It consists of t cells. Each cell, except for the first one,contains one general multiplier, two registers and one adder. The registers, which provides inputto multipliers, are initialized with coefficients of σ(x). The 2t syndromes are serially shifted inand error vector in frequency domain are serially shifted out. The first 2t outputs are just the2t syndromes. Then, the multiplexer M1 select the feedback output and M2 select output fromerror transform block. In order to correct the codeword, inverse Fourier transform has to be done

σt

D

σt−1

D

...

...

σ1

D 0

1

Syndromes

S2t−1; :::; S1; S0

Error Transform

En−1; :::; E0; E1

0

1

M1

M2

Figure 3.5: Remaining Error Transform Block

to find the error vector in time domain e(x). Since the error transform E(x) is computed serially,it is efficient to use an architecture similar to that for the syndrome computation[22] as shown inFigure 3.6. There are n cells and the block length n equals to 450 for 1000BASE-T1 PHY. Eachcell consists of one constant multiplier, one adder, one multiplexer and two registers. The errortransform are serially shifted in and e(x) are obtained after q cycles where q is the size of GF (2m).Duplicated registers are used to convert the parallel output to serial output.



D

Error Transform

E510; :::; E1; E0

α0

e0

D

α−1

e1

D

α−449

e449...

...

......

...

Figure 3.6: Inverse Error Transform Block

3.3 Design Decisions

The design choices of each block have been presented. This section shows how the design decisionsare made based on estimation of hardware complexity, critical path delay and processing latency.

3.3.1 Deign Decision on Syndrome Computation

The syndrome computation is the first block for both algebraic RS decoder and transform RSdecoder. Hence, we first make a design decision on syndrome computation. As stated in section3.2.1, the half syndrome computation method can reduce power consumption at the expense ofincreasing n cycles decoding latency. Since our decoding latency is only 1.2n cycles, the halfsyndrome computation method cannot directly be used. All 2t syndromes will be computed first.Then, instead of checking all 2t syndromes, only the first t syndromes are checked. If those tsyndromes are zero, all computed 2t syndromes will be passed to key equation solver. In this way,half hardware of the zero-detector circuit is saved. The overall power consumption can also bereduced slightly.

3.3.2 Comparison of Transform Decoder and Algebraic Decoder

By estimating hardware complexity, comparison of transform decoder and algebraic decoder usingMEA is presented in [22]. It concluded that the algebraic decoder based on MEA uses less hardwareresources than transform decoder based on MEA. The same conclusion was presented in [8] byproviding supporting data in terms of silicon area of complete decoder chip design. In Table3.1, hardware complexity and decoding latency of transform decoder and algebraic decoder arepresented for shortened (450, 406) RS code. The syndrome computation block for both decodersare the same and therefore is ignored in Table 3.1. The major difference is the error correctionblock. The algebraic decoder requires both σ(x) and Z(x) for reconstructing error vector e(x)while the transform decoder only requires σ(x). Therefore, the choices of error correction blockalso affects the choices of key equation solver. Here we first make a design decision about the errorcorrection block.

The BM algorithm is generally considered to be more hardware efficient than EA and thisalso can be seen in the next subsection. To simplify the comparison, we assume that both BMalgorithm is used for transform decoder and algebraic decoder. The BM algorithm and its variantsusually first generates σ(x) after 2t steps and then generate Z(x) after t+ 1 steps. It can be seenfrom Table 3.1 that transform decoder requires more building elements than algebraic decoder.In [22], Shao concluded that the estimated area of (255, 223) RS transform decoder is almosttwice that of (255, 223) RS algebraic decoder. In [8], the core area of the (255, 239) RS transformdecoder is also twice that of (255, 239) RS algebraic decoder after placement&routing. Since thesize of algebraic decoder mainly depends on error correcting capability t while the size of transformdecoder heavily depends on code block length n, the area difference of (450, 406) RS transformdecoder and algebraic decoder can be even greater.



Table 3.1: Comparison of Transform Decoder and Algebraic Decoder

Decoder Block Algebraic Decoder Transform DecoderKey Equation Solver Latency = 3t+1 Latency = 2t

Polynoimal Evalutionpart of Chien/Forney

2t+1 Registers(2t + 1) × 2 constant Mults2t+1 Adders2t+1 MuxesLatency = 2 cycles

-

Calculating Error Valuepart of Chien/Forney

4 Registers1 Inversion Circuit1 general MultsLatency = 1cycles

-

Delayed Syndromes - 2t Registers

Remaining ErrorTransform

-

t Registerst+1 general Multst Adderst Muxes1 Inversion CircuitsLatency = 1 cycle

Inveser ErrorTransform

-

2n Registersn Constant Multsn Aders2n MuxesLatency = q cycles

Output Decoded Code Latency = 1 cycle Latency = 1 cycle

Total Amount for(450, 406) RS Decoder(n=450, t=22, q=512)

49 Registers1 General Mults90 Constant Mults45 Adders45 Muxes1 Inversion CircuitLatency = 49 cycles

966 Registers22 General Mults450 Constant Mults472 Adders922 Muxes1 Inversion CircuitsLatency = 559 clocks

q = 2m is the size of GF (2m), t is error-correcting capability and n isthe block length of RS codeword

Our design has a latency budget which is 1.2 RS frame duration. Since syndrome computationalready takes 1 RS frame duration, the total latency of the rest cannot exceeds 0.2 RS frameduration which is equivalent to 90 clock cycles. For algebraic decoder, more hardware can be addedto Chien&Forney block to skip the 61 idle cycles for shortened (450, 406) RS code. However, fortransform decoder the remaining error transform block is a recursive procedure and it generatesone frequency component Ej every cycle. Since the first 44 frequency components are known, ittakes 468 cycles to find all the frequency components of error vector. This recursive block cannotbe speed up by parallel units. The inverse Fourier transform block requires all 512 frequencycomponents. Therefore, 61 idle cycles cannot be skipped. The resulting latency which is 558cycles is much greater than 90 cycles. Therefore, the transform decoding fails to satisfy thelatency budget.

3.3.3 Comparison of Key Equation Solvers

The key equation solver block is the most complicated block in RS decoder and are intensivelystudied by researchers. Summary of complexity of various KES architecture can be found in [29][15] [20]. However, comparison of latency of KES is not fair in those paper. First, the BM-basedarchitecture usually generates first generate σ(x) and then generate Z(x), while the EA/MEA-based architecture generates both σ(x) and Z(x) at same time. However, in [15] the latency ofBM-based architectures only consider the delay for computing σ(x). Second, the BM algorithmis usually implemented as parallel-out architecture where all coefficients of resulting polynomialare generated in one cycle. However, MEA algorithm can be implemented as either parallel-out or serial-out architecture. In serial-out architecture, one coefficient of resulting polynomialsgenerated every cycle. Since the structure algebraic decoder is used, all coefficients of both error



locator and error evaluator are needed for Chien&Forney block. Therefore, a serial-to-parallelconverter is required in this case, which introduces extra latency. Hence, we consider the latencyof KES block as both computation of error locator and error evaluator are fully founded. Inaddition, the syndrome computation block generates all 2t syndromes simultaneously. Therefore,a parallel-to-serial converter is also required if the KES block is a serial-in architecture. The serial-to-parallel and parallel-to-serial converters can be implemented as shown Figure 3.7. Since thosetwo converters are pretty simple, we only add the required registers which is more important forimplementation in last column of Table 3.2.

D0 D1 Di... ...

1

0

1

0

a0 a1 ai... ...

(a)

D0 D1 Di... ...

(b)

a0 a1 ai... ...

Figure 3.7: (a) Parallel to serial converter (P2S) (b) Serial to parallel converter (S2P)

Table 3.2: Hardware Complexity, Critical Path Delay and Latency of (n, k) RS KES Block

KESArchitecture

Reg Mult Add Mux Latency Critical Path Delay CommentExtra

RegisterParalleliBM

3t+1 3t+1 t+log(t+1) 2t 3t+2 2TMult + (1 + log(t + 1))TAddserial-in

parallel-out2t

SerialiBM

3t+3 3 1 4 5t2 + 4t + 1 1TMult + 1TAdd + 1TMuxserial-in

parallel-out2t

DecomposediBM[6]

2t+6 3 2 2 3t2 + 2t 1TMult + 1TAddserial-in

parallel-out2t

Dual-lineiBM [12]

4t+3 4t+1 2t 2t 3t+2 1TMult + 1TAddparallel-in

parallel-outt

OriginalEA [17]

14t+6 3t+1 4t+1 11t+4 4t-31TROM + 2TMult

+1TAdd + 2TMux

parallel-inparallel-out

0

parallelMEA [28]

4t+4 4t+2 4t+4* 16t+16* 4t-2 1TMult + 1TAdd + 1TMuxparallel-in

parallel-out0

MEA [18](estimated)

42t 8t 8t 38t 3t+37 1TMult + 1TAdd + 1TMuxserial-in

serial-out4t+1

retimedMEA [16]

78t+4 8t 8t 40t+2 10t+81TDff + 1TNor8

+1TAnd2 + 1TMux2

serial-inserial-out

4t+1

foldedMEA [10]

8t+2 4 2 4 4t2 − 1 1TMult + 1TAdd + 1TMuxserial-in

parallel-out2t

a) All the latency has taken into account the delay introduced by serial-to-parallel converter at output.b) For Dual-line iBM, the extra registers are due to computation of Z(x).c) Fo other architectures, the extra registers are due to P2S and/or S2Pd) The asterisk (*) indicates the estimated number.e) All the multiplier list here refer to general finite field multiplier.

The Table 3.2 gives a comparison of different key equation solver architecture. It can be seenthat the BM-based architectures usually requires less registers and multipliers than EA/MEA-based architectures. The key word parallel refers to the fact that all coefficients of the polynomialsare computed every clock cycle, while the key word serial refers to the fact that only one coefficientsof the polynomials are computed.

To better estimate whether a design is feasible and can run at 125MHz, we first estimate thedelay of each basics elements, such as multipliers, adders and multipliers. These basic elements arefirst implemented and then synthesized with 40nm technology by using Cadence RTL Complier.The result is shown in Table 3.3. It shows the Galois field multiplier has the longest delay.



Table 3.3: Estimated Delay of Basic Elements

Delay (ns)TMult 2.171TAdd 0.304TMux 0.097

Table 3.4: Estimated Complexity, Latency and Critical Path Dealy for (450, 406) RS Decoder

KESArchitecture

Reg Mult Add Mux LatencySatisfy

Latency BudgetCritical PathDelay (ns)

Satisfy TimeConstraint

ParalleliBM

111 67 24 44 68 YES 5.06 EASY

SerialiBM

113 3 1 4 2509 NO 2.57 EASY

DecomposediBM

94 3 2 2 1519 NO 2.48 EASY

Dual-lineiBM

113 89 44 44 68 YES 2.48 EASY

OriginalEA

314 67 89 246 85 YES 6.84 EASY

parallelMEA

92 90 92 368 86 YES 2.57 EASY

MEA(estimated)

1013 176 176 836 103 NO 2.57 EASY

retimedMEA

1809 176 176 882 228 NO < 2.57 EASY

foldedMEA

222 4 2 4 967 NO 2.57 EASY

a) The Original EA requires one ROM which is not included in the table.b) The retimed MEA use pipelined multiplier and the actual area can be even larger.c) For retimed MEA, the critical path delay is not computed. However, it should be less than thatof MEA due to pipelined multiplier.

Table 3.4 shows the estimated cost for our application. We have a latency budget for decodingwhich is 1.2n = 540 cycles. Since SC block takes n = 450 cycles and CF block takes 4 cycles,the left latency budget for KES block is 86 clock cycles. The latency requirements remove 5designs as shown in Table 3.4. The estimation of critical path delay can show how difficult it isto make a feasible design run at 125MHz which is the targeted clock frequency. Even though theestimated critical path delay slightly violates the constraint, a feasible design may still be madedue to optimization of the RTL complier. Hence, we only put the word Easy or Difficult in thelast column. Fortunately, all the designs are feasible at gate level synthesis. However, the originalEA architecture may not give enough margin for placement and routing steps. In addition, oneROM is required in original EA to perform polynomial division. Up to this point, there are onlythree options left: parallel iBM, Dual-line iBM and parallel MEA. Both estimated critical pathand area are comparable for Dual-line iBM and parallel MEA. The parallel iBM has least areaamong all feasible designs and therefore it is chosen to implement as a baseline. We also chooseDual-line iBM architecture for two reasons. First, dual-line architecture can achieve a smallerdecoding latency which requires a smaller FIFO size. Second, the parallel iBM has a long logicsteps. As shwon in Table 3.2, the critical path which consists of 2 multipliers and 1 + log(t + 1)adders. After the decoder has been designed and implemented on silicon, many testing need tobe performed. Long logic step is not good for testing purpose.


Chapter 4

Reed-Solomon DecoderImplementation

After reviewing some related works, design decisions have been made in chapter 3. This chapterdescribes implementation of RS decoder in detail. As described in section 3.1, the algebraic RSdecoder consists of three function blocks: Syndrome Computation (SC) block, Key Equation Solver(KES) block and Chien&Forney (CF) block. First, both RS encoder and decoder are modeled inC language and then tested in MATLAB. Then, the RTL code is developed for the decoder andtested with C code model.

4.1 Syndrome Computation

Figure 4.1: Syndrome Computation

The decoder receives a sequence of symbols r0, r1, ..., rn−1 consecutively which are used tocompute 2t syndrome components.

Si =

n−1∑j=0

rjαij for i = 0, 1, ..., n− 1 (4.1)

According to Eq.(4.1), computation of each syndrome component Si requires n(n−1)2 multiplication,

n− 1 additions. By using Horner’s rule, Eq.(4.1) can be re-written as

Si = (...(((rn−1αi + rn−2)αi + rn−3)αi + ...)αi + r0) (4.2)

Then, computation of Si only requires n − 1 multiplications and n − 1 additions. As shown inFigure 4.1, the syndrome computation block consists of 2t cells. Each cell compute one syndrome


CHAPTER 4. REED-SOLOMON DECODER IMPLEMENTATION

component after n clock cycles. At each clock cycle, the computing results are accumulated andafter n clock cycles the syndromes are held in 2t registers as shown in Figure 4.1. Since serialsymbols are fed consecutively, the SC block should be able to separate RS frames. This can beachieved either by a local controller or an external signal indicating start of a RS frame. A localcontroller is used in this design. The controller is basically a counter that counts from 0 to n− 1.When it counts to n − 1, the input symbols which is the first symbols of next RS frame will befed into the register instead of accumulating results.

4.2 Key Equation Solver

4.2.1 Parallel iBM architecture

Su−1 Su−2 Su−22

σ(u−1)1 σ

(u−1)2 σ

(u−1)22σ

(u−1)0

+... ...

B(u−1)1 B

(u−1)2B

(u−1)0

Su−21

σ(u−1)21

B(u−1)21

"u−1

Su

Controller

... ...

Figure 4.2: Parallel Architecture for inversionless Berlekamp-Massey Algorithm

The block diagram of parallel architecture of inversionless Berlekamp-Massey algorithm isshown in Figure 4.2. The parallel-to-serial converter is omitted here and it is required to holdthe burst transfer of syndromes from SC block. There are three series of registers shown in theblock diagram. The register S represents the syndromes used for computing discrepancy at eachiteration. At each iteration, at most t+1 syndrome components are required. Initially the all thoseregisters are initialed to zero. The When the KES block accepts the input syndromes, it also checkswhether the first half syndromes is zero or not. If it is, then the received RS frame is assumedto be error free and KES. Otherwise, the parallel-to-serial converter starts to shifts syndromesinto register S. The register σ represents the scratch error locator polynomial and register Brepresents the polynomial used to update σ. According to Eq.(3.18), both polynomial σ(x) and

B(x) are initialized to 1. At each iteration u, the coefficient σ(u−1)i multiply the corresponding

syndrome Su−i and the summation of these products is the discrepancy du. For i < u, Su−i = 0.

Then, the multiplicative results duB(u−1)i is added to σ

(u−1)i . The discrepancy du is stored as the

scaling factor εu if du 6= 0 and 2li−1 ≤ u − 1. The scaling factor is introduced to eliminate theinverse operaiton required in original Berlekamp-Massey Algorithm. Then after 2t iteration, theresulting polynomial σ(x) is held in register σ. The computation of error evaluator polynomialZ(x) is very similar to the computation of discrepancy. Therefore, the same circuit in Figure 4.2can be reused to compute Z(X). To compute Z(x), register S and B are set to zero while registerσ is unchanged. Then the perallel-to-serial convert starts to shift S0, S1, ...St−1 to the registerS, and one coefficient of Z(x) is computed at each iteration. The error locator polynomial has a



degree of t− 1 at most and it can be held in register B. Parallel-to-serial converter causes 1 cycledelay, 2t cycles is sued to compute σ(x) and t+ 1 cycle is used for computing Z(x). therefore, thelatency of KES block is 3t+ 2.

As mentioned before, the Berlekmap-Massey algorithm can stop early at t + v steps where vis the number of error in the RS frame. The new stop condition is u = 2toru = lu + t. Thenthe computation of Z(x) which has a degree of v − 1 will only takes v + 1 steps. Hence, thelatency of KES reduce from 3t + 2 to t + 2v + 2. When there is only one error in the receivedRS frame, the KES block only has to take 24 cycles instead of 68 cycles. Applying this early stoptechnique, the average latency of KES block is reduced. However, the output will be generatedwith different latency. As a result, when the KES block is starting to output, the CF block maystill be busy processing the previous data. In that case, a handshaking protocol is required toensure the data in CF block will not be contaminated. In addition, the irregular output behaviorwill be propagated and visible at the output of the decoder. This irregular output behavior is notdesired and can be eliminated by buffering the results. An alternative solution is that the KESblocks stops early but only output data with fixed latency. When the algorithm finds σ(x) andZ(x) at i < 3t + 2 step, the KES circuit stops computation and but keep counting the steps. Itonly output data at 3t+ 2 steps. In this way, the output of decoder is still a regular behavior andpower of KES block will also be reduced if proper clock gating is inserted.

4.2.2 Dual-line iBM architecture

D(u−1)44D

(u−1)45

C(u−1)44 C

(u−1)43

0

D(u−1)43

C(u−1)42

0

C(u−1)2

D(u−1)2

0

C(u−1)1

D(u−1)1

0

C(u−1)0 Controller

"(u−1)

du... ...

Figure 4.3: Dual-line Architecture for inversionless Berlekamp-Massey Algorithm

In parallel iBM, t + 1 syndromes multiply t + 1 coefficients of σ(x) respectively, resultingt + 1 product results. Then, t + 1 product results are summed for computing discrepancy. Thecomputation of discrepancy has a long logic steps and can potentially limit the speed of decoder.Usually, the path delay of computation of discrepancy in iBM is the critical path of RS decoder.The dual-line architecture reduces this long critical path and achieves the same latency as paralleliBM. The block diagram of dual-line iBM is shown in Figure 4.3. There are two series of registers,C and D. The C registers have the scratch error locator polynomial σ(x) and D registers have thecorrection terms B(x) which is used to update error locator polynomial. The dual-line architectureis a parallel-in parallel-out architecture and the 2t syndromes are directly stored in those C andD registers. The C registers and D registers are initialized as shown in Eq.(4.3).

C(0)k =

{Sk+1 , for 0 ≤ k ≤ 2t− 11 , for k=2t

D(0)k =

Sk , for 1 ≤ k ≤ 2t0 , for k=2t1 , for k=2t+1

(4.3)



Every cycle, the C registers and D registers are updated as

C(u)k =

{ε(u−1) × C(u−1)

k+1 , when du = 0

ε(u−1) × C(u−1)k+1 + du ×D(u−1)

k+1 , when du 6= 0

D(u)k =

0 , when k=2t-u

C(u−1)k , when k 6= 2t− u ∧ (du 6= 0 ∧ 2lu−1 ≤ u− 1

D(u−1)k , otherwise

(4.4)

The discrepancy is generated at the right-most of C registers. There are 45 C registers whileσ(x) only have 23 coefficients. Initially, the C44 represents the σ(0)(x). Then every cycle, theposition of σ(x) is shifted in C registers and the size of σ(x) also grows. After 2t cycle, theregister C0, C1, ..., C22 represents the σ(22)(x). Similar to parallel iBM, the circuit can be reusedto compute error locator polynomial Z(x). The author who proposed this dual-line architecturedid not explicitly give the initial condition for computing error evaluator Z(x). In [15] a initialcondition is given. In this thesis, a different initial condition so that the error locator polynomialcan be stored and used for computing error evaluator without extra storage.

C(2t+1)k =

{Sk , for 0 ≤ k ≤ t− 10 , for t ≤ 2t

D(2t+1)k =

{0 , for 1 ≤ k ≤ tσk−t−1 , for t+ 1 ≤ k ≤ 2t+ 1

(4.5)

The ε register is initialized to 1 and kept for the rest of computation. After the re-initialization, thecircuits can run with same updating rule for C registers while keep D registers unchanged. Then,the Z(x) is obtained in registers C1, C2, ..., Ct after another t cycles. During the computationof Z(x), the σ(x) is stored in D23 to D45. In addition, in order to perform re-initializationthe syndorme S0, S1, ..., St+1 has to be stored with extra register when KES block first acceptsinput.In total, the dual-line iBM also takes 3t+ 2 cycles. That is, 1 cycle is used for initializing Cregisters and D registers, 2t cycles are required for computing σ(x) and t+ 1 cycles are requiredfor computing Z(x).

4.3 Chien Search and Forney

To make a fair comparison between algebraic decoder and transform decoder, a generic Chien &Forney block has been presented in Chapter 3.2.3. Here, we present more details for implementingChien&Forney block for our (450, 406) RS code. First, Chien Search block, as well as part ofForney block, performs polynomial evaluation. Due to the shortening, there is no error at locationα450, α451, ..., α510. The error detection at location α450, α451, ..., α510 corresponds to evaluatingσ(x) at x = α, α2, ..., α61, since the error location is the inverse of roots of σ(x). The Chien searchalways starts searching from α and therefore it takes 511 cycles to search all locations. Therefore,we pay 61 cycles penalty. To skip these 61 cycles, either σr = xtσ(x−1) can be used for Chiensearch, or we perform a initialization step to skip the first 61 search. Using the inverse versionerror locator polynomial σr(x) conflicting with the output order of FIFO. Therefore, we use theinitialization method to achieve low latency. For our (450, 406) RS code, the fist location to searchis α449. Therefore, the Chien search first evaluates σ(α(−449)) = σ(α62) and check weather theresult is zero.

σ(α62) = σ0 + σ1α61 + σ2α

(2×62) (4.6)

According to Eq.(4.6), t constant multiplier is added to the Chien search block to perform theinitialization as shown in Figure 4.4. Similarly, t−1 additional constant multipliers ar also requiredfor evaluating Z(x). As a result, the Chien$Forney block for shortened RS codes uses 2t− 1 moreconstant multiplier than the natural RS code. During the Chien search, one extra counter is usedto count how many roots have been found. If the number of roots is less than the degree of error



D

1 0

α0

D

1 0

α1... ... D

1 0

... ...

σtσ0 σ1

α1∗62

αt∗62

αt

σ(α(i+62))

Figure 4.4: Chien Search for shortened RS code

locator polynomial, it indicates that the error locator polynomial is not correctly found by KESblock. Then the decoder can declare a decoding failure to indicates that there is more thant 22errors in the received RS frame.

The Forney algorithm also requires find the inverse of σodd(x) as shown in Eq.3.27. The circuitshown in Figure 3.4 takes 4 cycles to output first error symbols. During the implementation, onemore cycle is used to ensure that the ROM is probably gated. That is, it is only accessed whenthere is error has been found. As discussed in section 2.1.4, there are three methods to find inverseelement in Galois field. The latency of Euclidean algorithm is 2m and the latency of Fermat’salgorithm is m − 1. A look-up table method only takes 1 cycle to find the inverse. Therefore, AROM-based look-up table is used to achieve low latency. As a result, the latency of Chien&Forneyblock is 5 cycles.

4.4 Summary

The overall decoding timeline is shown in Figure 4.5. The resulting decoding latency of our decoderis 523 cycles. The latency of CF is only 5 cycles. However, at worst case, it may be active for oneRS frame duration. The decoding latency determines the FIFO size. The FIFO is implemented byusing SRAM. The SRAM and ROM are generate by memory generator tool Embed-It. However,it has certain limitation for the size of generated memory. A 9-bit ROM cannot be generated bythe tool. Therefore, 512 × 10 bit ROM is used. On wrapper is designed for the ROM to convert9-bit address to 10-bit address by adding one zero bit. The 10-bit ROM output is also convertedto 9-bit. One 528 × 9-bit SRAM is generated by the tool to be used as FIFO for the decoder.The FIFO output has to be synchronize with output of CF block. Therefore, one more controllerblock is developed at top level to signal FIFO outputting.

Figure 4.5: Decoding Timeline


Chapter 5

Results and Analysis

5.1 Synthesis Results

The RTL designs are synthesized with RTL compiler and mapped to technology dependent lo-gic gates. The RTL compiler runs with GlobalFoundries CMOS 40nm standard cells librarygf40npksdst. Table 5.1 shows the synthesized netlists of these two decoders under different clockperiod. Both decoders can easily operate at 125MHz period which is our target frequency. Besides,the operating frequency can be easily scaled from 125MHz to at most 500MHz. Since the Galoisfield multiplier is essentially a set of combinatorial logics, they can be optimized, distributed andshared between different components. Therefore, RS decoder can easily scales to different fre-quency. As a result, more cells are required. The maximum frequency that the parallel iBMRS decoder can achieve is 333MHz, while dual-line iBM RS decoder can achieve a maximum fre-quency of 500MHz. The operating frequency of parallel iBM cannot increase due to the criticalpath in the KES block. For dual-line iBM RS decoder, the critical path lies in the CF block whenfrequency is higher than 500MHz. The access of ROM-table limits the speed of the decoder. At125MHz, the dual-line iBM still leaves 2.3ns while the slack of parallel iBM is 0. Besides, dual-liniBM generally occupies more silicon area than parallel iBM. These results are consistent with ourprevious estimation.

Table 5.1: Cell Area of RS Decoder for Difference Standard Cells and Time Constraint

Cell Area(um2)

Time Constraint(ns)

Slack(ns)

Computation Density(Kbit/s/um2)

dual-line iBM

49498 8 2.3 2350081 6 0.5 3050159 4 0 4551523 3 0 5854322 2.5 0 66

parallel iBM

44415 8 0 2545653 6 0 3350944 4 0 4462795 3 0 48

It also can be easily seen from Table 5.1 that both the computation capacity of two designshave exceeded our requirements. The metric computation density is used to indicate the efficiencyof the design. The parallel iBM provides higher computation density when the target frequency isless than 250MHz, while the dual-line iBM is more efficient for higher frequencies. For dual-lineiBM the netlist with 8ns constraint is chosen for the placement and routing since there is still2.4ns slack. For parallel iBM the netlist with 6ns constraint is chosen, since extra delay will beintroduced by routing. Details of selected netlist are shown in Table 5.2.


CHAPTER 5. RESULTS AND ANALYSIS

Table 5.2: Selected Netlist for Placement and Routing

Dual-line iBMCell Area

(um2)Percentage Parallel iBM

Cell Area(um2)

Percentage

Top View 49498 100.00% Top View 45653 100.00%SC 4847 9.79% SC 4302 9.42%KES 27595 55.75% KES 25014 54.79%CF* 7647 15.45% CF* 6951 15.23%

CF/ROM 1959 3.96% CF/ROM 1959 4.29%FIFO 7265 14.68% FIFO 7254 15.89%

Controller 185 0.37% Controller 173 0.38%CF* indicates that he area of CF block excludes the area of ROM.CF/ROM indicates that ROM is a subblock of CF block

The FIFO consists of a SRAM and an external wrapper. The area of SRAM is 6985 um2.

As shown in Table 5.2, the SC block has the least area while the KES block has the highestcomplexity. The table shows the area of CF block excluding ROM. The FIFO is implementedwith SRAM. The embedded memory, ROM and SRAM, occupies around 20% area of the design.It is interesting to see that the area of SC block and CF block differs in two design even as aresult of optimization of the RTL complier. A initial layout is made for both designs so that amore accurate power estimation can be obtained. Figure 5.1 shows the layout for dual-line iBM.The parallel iBM has a similar has a similar layout and it is not shown here. The summary ofpost-layout results can be found in Table 5.3.

Figure 5.1: Layout of Dual-line iBM RS Decoder



Table 5.3: Summary of RS Decoder after Placement and Routing

Dual-line iBMRS Decoder

Parallel iBMRS Decoder

ProcessCMOS 40nm

gf40npksdst cell libraryCMOS 40nm

gf40npksdst cell libraryCore Area 264um x 263um 264um x 263umCell Area 50046 um2 46071 um2

Memory Block(ROM)

one 512x10bit ROM= 1959 um2

one 512x10bit ROM= 1959 um2

Memory Block(SRAM)

one 528x9bit SRAM= 6985 um2

one 528x9bit SRAM= 6985 um2

Operatingfrequency

125MHz 125MHz

Supplyvoltage

1.1 V 1.1 V

5.2 Power Analysis

Power distribution for parallel iBM RS decoder when no error is found

SC46%

KES20%

CF*< 1%CF/ROM2%

FIFO28%

Ctrl4%

Power distribution for parallel iBM RS decoder when correcting 22 errors

SC27%

KES28%

CF*24%

CF/ROM2%

FIFO16%

Ctrl2%

Figure 5.2: Power distribution for parallel iBM RS decoder

The Figure 5.4 shows the power consumption of each blocks when correcting different numberof errors. The SC block, FIFO and control module have a constant power regardless of numberof errors since they are always functioning. When there is no error When there is no error, theSC blocks consumes the most of the power. When correcting 22 errors, the power consumption ofKES and CF blocks are comparable to syndrome computation. The CF block has been optimizedso that it stops computation when expected number of error has been founded. However, the errorcan be located at the last location so that CF has to be active for 450 cycles. Therefore, the powerconsumed by CF is approximately linear with number of errors. When there is no error, the majorpower consumption of CF block comes from the leakage power of ROM. When error increases, thepower consumption of CF block increases sharply. However the power consumption of ROM onlyincreases slightly since it is accessed 22 times at worst case during one RS frame duration. Theparallel iBM has been using early-stop condition, thus it is easier to understand that the power



Power distribution for dual-line iBM RS decoder when no error is found

SC54%

KES14%

CF*< 1%CF/ROM2%

FIFO25%

Ctrl5%

Power distribution for dual-line iBM RS decoder when correcting 22 errors

SC31%

KES25%

CF*25%

CF/ROM2%

FIFO14%

Ctrl3%

Figure 5.3: Power distribution for dual-line iBM RS decoder

Number of error0 5 10 15 20

Pow

er (

uw)

0

200

400

600

800

1000

1200

Power consumption versus number of errorfor parallel iBM RS decoder

SCKESCF*CF/ROMFIFOCtrl

Number of error0 5 10 15 20

Pow

er (

uw)

0

200

400

600

800

1000

1200

Power consumption versus number of errorfor dual-line iBM RS decoder

SCKESCF*CF/ROMFIFOCtrl

Figure 5.4: Power consumption

consumption of KES for parallel iBM increase linearly with number of errors. For dual-line iBMthe early-stop condition is not applied. The power consumption of KES also scales approximatelylinear with number of error. It is probably because that less computation is required when thenumber of error is small in the codeword. Since the computation steps of KES depends on thecalculated discrepancy. When the error locator polynomial has been found, the discrepancy willbe always 0 in the rest of iteration. In that case, less registers are updating and less gates areswitching. The power consumption of KES is less than that of SC at most of time since it is onlyactive during 68 cycles during one RS frame duration. The power distribution in each block canalso be seen in Figure 5.2 and 5.3. When no error is found, the KES also consumes a certain ofpower since it is also perform error detection for each RS frame.

The parallel iBM has a smaller power than dual-line iBM. This corresponds to the fact thatless gates are used for this design. When there is 22 errors, the dual-line iBM consumes 13% morepower than the parallel iBM.



Table 5.4: Power Consumption of two RS Decoders

Power (mW )Number of error parallel iBM dual-line iBM

0 2.1 2.322 3.6 4.0

5.3 Benchmarking

In chapter 5.1, we use computation density to compare efficiency of parallel iBM and dual-lineiBM based on computation density. In literature, many paper reports the gate count of theirdesign instead of cell area. To compare our designs with others, we also calculate the gate count.One 2-input NAND is counted as a gate [28], and the gate count is calculated as

gate count =total cell area

area of one 2-input NAND(5.1)

In standard cell library gf40npksdst, the area of one NAND is 0.7056um2. In chapter 5.1, we donot select the most efficient netlist since the computation capacity has exceeded our requirements.Here, to show how efficient of our RTL design, we use the most efficient netlist. That is, the netlistwith 2.5ns constraint for dual-line iBM and the netlist with 3ns constraint for parallel iBM arechosen. Then gate count of our design is shown in Table 5.5.

Table 5.5: Calculated Gate Count

Gate Countparallel iBM dual-line iBM

Non-memory Part 76310 64301Memory Part 12686 12686

Table 5.6: Comparison Results of RS Decoder Designs

RS code LatencyMax Freq(MHz)

Throughput(Mbit/s)

GateCount

NT(Kbit/gate)

Our bestparallel iBM

(450, 406) n+3t+7 333 2667 30180 88

Our bestdual-line iBM

(450, 406) n+3t+7 400 3200 26299 122

dual-line iBM[15] (255, 239) n+3t+5 200 1600 12613 127Parallel MEA [28] (204, 188) n+4t-2 200 1600 30343 53Original MEA [17] (255, 239) 287 300 2400 44700 54

First, the Normalized Throughput (NT) is calculated by normalizing throughput by gate count.Then, we can measure number of bits can be processed per second per gate. It is a similar metricto computation density. Since there is no design of 9-bit based RS decoder published in literature,we have to compare our designs with 8-bit based RS decoder. Therefore, normalization has tobe performed to take into account number of bits m, error-correcting capability t and technology.The register, multiplier and adder are proportional to m. However, the complexity of multiplieris generally proportional to m2 and also depends on the primitive polynomial p(x). Here, weroughly consider it as linear to m to make a comparison. The non-memory part of RS decoderis also promotional to t. The FIFO size is proportional to the block length n and ROM size isproportional to the size of Galois field q. Since a 8-bit based (255, 206) RS decoder has half sizeof n and q, we scale down the memory part by half. For the non-memory part, the gate count islinearly scaled down for t = 8 and m = 8. Similarly, the throughput also linearly reduces since itis also proportional to m. The scaled version of our decoder is shown in Table 5.6. We compare



the design with similarly latency as our case. As shown in Table 5.6, the both our best paralleliBM RS decoder and dual-line iBM RS decoder can achieve a high normalized throughput.

ower


Chapter 6

Conclusions

The required bandwidth of in-vehicle network is increasing sharply due to more advanced electron-ics being used in the car. The Gigabit Automotive Ethernet provides fast bandwidth to all kindsAdvanced Driver Assistance Systems. In addition, it can server as a backbone of the car and thusreduce the cost of harness. The 1000BASE-T1 PHY is one network specification for Gigabit Auto-motive Ethernet. The Reed-Solomon coding technique is used in1000BASE-T1 PHY to compactthe noise in the transmission channel. The Reed-Solomon codes is one powerful forward error cor-rection code and it can correct t random symbols errors based on 2t parity symbols. The shortened(450, 406) Reed-Solomon code is used in 1000BASE-T1 PHY. It is 9-bit based and all symbols ofRS code are from Galois field GF (29). The RS codes is encoded based on generator polynomialapproach as specified by the standard. For RS decoder in gigabit Automotive Ethernet, we assume1.2 RS frame duration as the latency budget for decoding RS code. In addition,The decoder needsto be able to run at 125MHz frequency and it is also required to report decoding failure whenthe received RS codeword is not correctable. Based on these requirements, we have explored thedesign space of RS decoder. Two different decoder structures, algebraic decoder and transformdecoder, have been studied. The transform decoder is generally believed to occupy more siliconarea than algebraic decoder. In addition, the transform decoder has a long decoding latency whichviolates our latency budget. The algebraic decode consists of three functional blocks, SyndromeComputation (SC) ,Key Equation Solver (KES) and Chien&Forney (CF). For SC block, the halfsyndrome computation method can save power consumption for error detection. However, it isnot used in our design due to the latency penalty. Two algorithms for KES and their variantshave also been discussed. The Euclidean Algorithm (EA) algorithm occupies lots of silicon areaand one ROM is required. The Modified Euclidean Algorithm (MEA) is a feed-forward structureand more suitable for Very-large-scale Integration (VLSI) implementation. This structure can beeasily pipelined to achieve a high speed. The Inverse-free Berlekamp-Massey (iBM) is a iterativeprocedure to solve key equation. It takes 2t steps to find the error locator polynomial and cansteps early at t+ v steps where v is the expected number of error. The iBM is generally believedto be more hardware efficient than MEA since the same circuit can be used for computing botherror locator polynomial and error evaluator polynomial. However, the MEA has to compute twopolynomials at the same time. Based on the estimated hardware complexity, latency and delayof critical path, we have chosen parallel iBM and dual-line iBM to be implemented. For the CFblock, ROM-based look-up table is used to perform the inverse operation to achieve a low latencyand low cost. Due to shortening, the Chien search is active for 61 cycles without really checkingerror location. Therefore, 2t − 1 extra constant multipliers are used to initialize the registers inChien search to skip the first 61 search. At the end of Chien search, the decoder can check whetherthe number of founded roots is equal to the degree of error locator polynomial. If not, then theerror locator polynomial is not corrected and the received codeword is not correctable. Then, thedecoder can report decoding failure.

The designed RS decoder is one sub-block of 1000BASE-T1 PHY. The PHY has a two RSframe duration latency requirement for decoding RS code. When this thesis starts, the latency of


CHAPTER 6. CONCLUSIONS

other IP blocks has not been finalized. Therefore a 1.2 RS frame duration is assumed to be thelatency requirement of RS decoder. Due to the low latency requirement, high speed RS decoderis designed. Two RS algebraic decoder based on Inverse-free Berlekamp-Massey (iBM), parallelarchitecture of iBM and dual-line iBM, have been designed and implemented in 40nm technology.Both decoders satisfy the requirement of 1000BASE-T1 PHY. An overal decoding latency of 523clock cycles is achieved which corresponds to 4184ns delay. When the time constraint is less than4ns, the parallel iBM gives a better computation density. The dual-line iBM is more efficientfor high frequencies. In addition, the critical path of the dual-line architecture has consists of 1multiplier and 1 adder for all kinds of RS codes. For the number of adders in the critical pathdepends on the error correcting capability t. Both the RTL designs are efficient when comparingwith related work. However, to maintain a low cost, the netlists with lower area are chosen. Bothdesigns have same decoding latency. The dual-line iBM can be synthesized to run at a higherspeed if required. The final designed RS coder runs at 125MHz. The cell area of final designeddual-line iBM RS decoder occupies 8% percentage more silicon area than parallel iBM RS decoder.The power consumption of both decoders are approximately linear with number of errors. Thedual-line parallel iBM consumes 15% more power than parallel iBM at worst case. Therefore, theparallel iBM is more efficient thant dual-line iBM in terms of silicon area and power consumption.However the difference is small. The parallel iBM RS decoder has a long logic steps.


Bibliography

[1] R. Elwyn Berlekamp. Algebraic Coding Theory. MacGraw-Hill Book Company, 1968. 12

[2] R. P. Brent and H. T. Kung. Systolic VLSI Arrays for Polynomial GCD Computation. IEEETransactions on Computers, C-33(8):731–736, Aug 1984. 16

[3] H. Brunner, A. Curiger, and M. Hofstetter. On Computing Multiplicative Inverses inGF (2m).IEEE Transactions on Computers, 42(8):1010–1015, Aug 1993. 7

[4] H. Chang, C. Lin, and C. Lee. A low-power Reed-Solomon decoder for STM-16 opticalcommunications. In ASIC, 2002. Proceedings. 2002 IEEE Asia-Pacific Conference on, pages351–354, 2002. 14

[5] H. Chang, C. B. Shung, and C. Lee. A Reed-Solomon product-code (RS-PC) decoder chipfor DVD applications. IEEE Journal of Solid-State Circuits, 36(2):229–238, Feb 2001. 8, 14,15

[6] H. C. Chang and C. Shung. A reed-solomon product-code (rs-pc) decoder for dvd applic-ations. In Solid-State Circuits Conference, 1998. Digest of Technical Papers. 1998 IEEEInternational, pages 390–391, Feb 1998. 15, 22

[7] C. L. Chen. High-speed decoding of BCH codes (Corresp.). IEEE Transactions on InformationTheory, 27(2):254–256, Mar 1981. 14, 15

[8] Ted S Fill. An assessment of vlsi and embedded software implementations for reed-solomondecoders. Master’s thesis, National Library of Canada= Bibliotheque nationale du Canada,2001. 11, 18, 20

[9] W. A Geisel. Tutorial on Reed-Solomon error correction coding. 1990. 8

[10] H. Y. Hsu, A. Y. Wu, and J. C. Yeo. Area-Efficient VLSI Design of Reed-Solomon Decoder for10GBase-LX4 Optical Communication Systems. IEEE Transactions on Circuits and SystemsII: Express Briefs, 53(11):1245–1249, Nov 2006. 17, 22

[11] Ixia. Automotive Ethernet: An Overview (White Paper). Online, May 2014. 1

[12] H. J. Kang and I. C. Park. A high-speed and low-latency Reed-Solomon decoder based ona dual-line structure. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEEInternational Conference on, volume 3, pages III–3180–III–3183, May 2002. 15, 22

[13] T. Kasami and S. Lin. On the Probability of Undetected Error for the Maximum DistanceSeparable Codes. IEEE Transactions on Communications, 32(9):998–1006, Sep 1984. 3

[14] K. Kobayashi, N. Takagi, and K. Takagi. An Algorithm for Inversion in GF (2m) Suitable forImplementation Using a Polynomial Multiply Instruction on GF(2). In 18th IEEE Symposiumon Computer Arithmetic (ARITH ’07), pages 105–112, June 2007. 7

[15] A. Kumar. High Throughput Reed Solomon Decoder for Ultra Wide Band. Technical report,Koninklijke Philips Electronics N.V. 2004, 2004. 8, 14, 21, 28, 35


BIBLIOGRAPHY

[16] H. Lee. A vlsi design of a high-speed reed-solomon decoder. In ASIC/SOC Conference, 2001.Proceedings. 14th Annual IEEE International, pages 316–320, 2001. 17, 22

[17] H. Lee. An area-efficient Euclidean algorithm block for Reed-Solomon decoder. In VLSI,2003. Proceedings. IEEE Computer Society Annual Symposium on, pages 209–210, Feb 2003.17, 22, 35

[18] H. Lee, M. Yu, and L. Song. VLSI design of Reed-Solomon decoder architectures. In Cir-cuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE InternationalSymposium on, volume 5, pages 705–708 vol.5, 2000. 17, 22

[19] S. Lin and D.J Costello. Error Control Coding. Pearson Education, second edition, 2004. 2,12, 16

[20] J. I. Park, K. Lee, C. S. Choi, and H. Lee. High-speed low-complexity reed-solomon decoderusing pipelined berlekamp-massey algorithm. In SoC Design Conference (ISOCC), 2009International, pages 452–455, Nov 2009. 21

[21] I. S Reed and M. T Shih. VLSI design of inverse-free Berlekamp-Massey algorithm. IEEProceedings E-Computers and Digital Techniques, 138(5):295–298, 1991. 14, 15

[22] H. M. Shao and I. S. Reed. On the VLSI design of a pipeline Reed-Solomon decoder usingsystolic arrays. IEEE Transactions on Computers, 37(10):1273–1280, Oct 1988. 17, 19, 20

[23] H. M. Shao, T. K. Truong, L. J. Deutsch, J. H. Yuen, and I. S. Reed. A VLSI Design of aPipeline Reed-Solomon Decoder. IEEE Transactions on Computers, C-34(5):393–403, May1985. 16, 17

[24] S. Srivastava, R. McSweeney, C. Spagnol, and E. Popovici. Efficient Berlekamp-Massey basedrecursive decoder for Reed-Solomon codes. In Microelectronics (MIEL), 2012 28th Interna-tional Conference on, pages 379–382, May 2012. 8

[25] Y Sugiyama, M Kasahara, S Hirasawa, and T Namekawa. A method for solving key equationfor decoding Goppa codes. Information and Control, 27(1):87–99, 1975. 16

[26] S. Wang, Huai-Yi Hsu, and An-Yeu Wu. A very low-cost multi-mode Reed-Solomon decoderbased on Peterson-Gorenstein-Zierler algorithm. In Signal Processing Systems, 2001 IEEEWorkshop on, pages 37–48, 2001. 14

[27] S.B. Wicker and V.K Bhargava, editors. Reed-Solomon Codes and Their Applications. IEEEPress Piscataway, NJ, USA, 1994. 2, 11

[28] Y. You, J. Wang, F. Lai, and Y. Ye. VLSI design and implementation of high-speedRS(204,188) decoder. In Communications, Circuits and Systems and West Sino Expositions,IEEE 2002 International Conference on, volume 1, pages 82–86 vol.1, June 2002. 8, 14, 17,18, 22, 35

[29] B. Yuan, Z. Wang, L. Li, M. Gao, J. Sha, and C. Zhang. Area-efficient reed-solomon decoderdesign for optical communications. IEEE Transactions on Circuits and Systems II: ExpressBriefs, 56(6):469–473, June 2009. 21


Acronyms

ADAS Advanced Driver Assistance Systems. 1

BCH Bose, Chaudhuri, and Hocquenghem. 14

BM Berlekamp-Massey. 12, 14, 15, 19–22

CAN Controller Area Network. 1

CF Chien&Forney. 23, 25, 29, 37

DVD Digital Versatile Disk. 8

EA Euclidean Algorithm. 12, 14–17, 19–23, 37

EEA Extended Euclidean Algorithm. 15

EMC Electromagnetic Compatibility. 1

ESBM Early-stopped Berlekamp-Massey. 14, 15

FCR First Consecutive Root. 8, 11

GCD Greatest Common Divisor. 7, 15

GF Galois Field. 4, 7

iBM Inverse-free Berlekamp-Massey. 14, 15, 27, 28, 31–35, 37, 38

KES Key Equation Solver. 21–23, 25, 37

LIN Local Interconnect Network. 1

LVDS Low Voltage Differential Signaling. 1

MEA Modified Euclidean Algorithm. 16–18, 20–22, 37

MOST Media Oriented Systems Transport. 1

PCS physical Coding Sublayer. 4

PGZ Peterson Gorenstein Zierler. 12, 14

PHY Physical Layer. 2, 8

ROM Read Only Memory. 7


Acronyms

RS Reed-Solomon. 3–5, 8–11, 14–21, 25–29, 31, 35, 38

SC Syndrome Computation. 23, 25, 26, 37

VLSI Very-large-scale Integration. 4, 13–16, 37


Documents

Eindhoven University of Technology MASTER VLSI design of a ... · VLSI design of a Reed-Solomon decoder for gigabit automotive ethernet Xue, B. Award date: 2016 Link to publication