D 100 -S · PDF file1.1 Goals of the thesis ... hardware architectures were developed to reach the decoding speed of 100 Gbps. VHDL ... Schematisation in blocks of a RF system. 12

POLITECNICO DI MILANO

Scuola di Ingegneria Industriale e dell’Informazione

DESIGN AND IMPLEMENTATION OF 100

GBPS REED-SOLOMON DECODERS

Supervisors:

Prof. Franco Zappa

Prof. Javier Valls

Made by:

Gabriele Perrone 820485

Academic Year 2015-2016

2

3

Contents Acknowledgments...................................................................................................................... 7

Abstract ...................................................................................................................................... 9

1 Introduction ...................................................................................................................... 11

1.1 Goals of the thesis ..................................................................................................... 12

1.2 Methods ..................................................................................................................... 12

1.2.1 Algebra ............................................................................................................... 13

1.2.1.1 Arithmetic in binary fields .......................................................................... 14

1.2.1.2 Galois fields ................................................................................................ 16

1.2.2 Modelling ........................................................................................................... 18

1.2.2.1 Matlab ......................................................................................................... 18

1.2.2.2 Python ......................................................................................................... 18

1.2.2.3 Unified Modelling Language (UML) ......................................................... 19

1.2.3 Realisation.......................................................................................................... 19

1.2.3.1 Doxygen...................................................................................................... 19

1.3 Structure of the thesis ................................................................................................ 19

2 VHDL library of arithmetic operators ............................................................................. 21

2.1 Sum............................................................................................................................ 21

2.2 Multiplication ............................................................................................................ 21

2.3 Inversion .................................................................................................................... 23

2.4 Examples of usage ..................................................................................................... 24

2.4.1 Encoder RS(7,3) in GF(23) ................................................................................ 24

2.4.1.1 Simulation and system validation ............................................................... 29

2.4.2 Encoder RS(255,239) in GF(28) ........................................................................ 29

2.4.2.1 Simulation and system validation ............................................................... 30

2.4.2.2 Utilisation and timing report ....................................................................... 30

2.4.2.3 Code documentation ................................................................................... 32

3 Decoder for RS codes ...................................................................................................... 33

3.1 Decoding process ...................................................................................................... 33

3.1.1 Reed-Solomon codes ......................................................................................... 33

3.1.2 Syndrome Computer .......................................................................................... 33

3.1.2.1 Horner’s rule ............................................................................................... 34

3.1.3 Berlekamp-Massey Algorithm (BMA) .............................................................. 35

4

3.1.3.1 Enhanced Parallel Inversionless Berlekamp-Massey Algorithm (ePIBMA)

36

3.1.4 Chien Search (CS) and Forney’s algorithm ....................................................... 37

3.1.4.1 enhanced Chien Search Error Evaluation Component (eCSEE) ................ 38

3.2 Decoder RS(255,239) in GF(28)................................................................................ 38

3.2.1 Syndrome Computer .......................................................................................... 39

3.2.2 ePIBMA component .......................................................................................... 41

3.2.3 eCSEE component ............................................................................................. 46

3.2.4 Decoder block assembly .................................................................................... 53

3.3 Decoder RS(528,514) in GF(210) .............................................................................. 58

3.4 Compared analysis of timings and usage of resources .............................................. 60

3.4.1 Usage of resources in a generic RS(n,k) decoder .............................................. 60

3.4.1.1 Syndrome Computer ................................................................................... 60

3.4.1.2 RS_ePIBMA ............................................................................................... 60

3.4.1.3 RS_eCSEE .................................................................................................. 61

3.4.1.4 Other components ....................................................................................... 61

3.4.2 Latencies and critical paths ................................................................................ 61

3.4.3 Maximum operating frequency .......................................................................... 62

4 Design of 100 Gbps decoders .......................................................................................... 65

4.1 Parallelisation of the RS(255,239) decoder............................................................... 65

4.1.1 Background ........................................................................................................ 65

4.1.1.1 Syndrome Computer ................................................................................... 68

4.1.1.2 eCSEE ......................................................................................................... 72

4.1.1.3 Conclusions ................................................................................................ 75

4.1.2 Innovative designs ............................................................................................. 77

4.1.2.1 Proposed decoder architecture and theoretical analysis ............................. 78

4.1.2.2 FPGA implementation ................................................................................ 88

4.1.2.3 ASIC implementation ............................................................................... 105

4.1.2.4 Comparison with published results ........................................................... 109

4.2 Parallelisation of the RS(528,514) decoder............................................................. 110

4.2.1 Background ...................................................................................................... 110

4.2.2 Innovative designs ........................................................................................... 111

4.2.2.1 Proposed architecture and theoretical analysis ......................................... 111

4.2.2.2 FPGA implementation .............................................................................. 114

5

4.2.2.3 ASIC implementation ............................................................................... 115

5 Conclusions .................................................................................................................... 117

6 List of figures ................................................................................................................. 119

7 List of tables ................................................................................................................... 123

8 Bibliography .................................................................................................................. 125

9 Annexure ........................................................................................................................ 127

9.1 [Annex 1] Files positions ........................................................................................ 127

6

7

Acknowledgments I thank my family for providing me support and encouragement throughout my years of study

and personal growth.

I must express gratitude to professor Franco Zappa. If I didn’t attend his classes during the

bachelor, probably I would have never passed to Electronics Engineering and I would not

have discovered my passion for this field.

I thank my Spanish supervisor, professor Javier Valls, for giving me the chance of developing

the thesis with him.

8

9

Abstract The Reed-Solomon error correction codes are used in a large variety of fields, ranging from

the telecommunication field to the digital data storage. Recently, this family of codes are being

proposed for high-speed connections through cable or optical fibre. In the thesis, optimised

hardware architectures were developed to reach the decoding speed of 100 Gbps. VHDL

parametrical libraries were implemented to simplify the execution of the arithmetic operations

in the Galois fields, the algebraic fields in which the Reed-Solomon codes work. The

architectures proposed by other authors were studied in order to use the component that

implements the Berlekamp-Massey algorithm at its best. Particular attention was paid also to

an efficient usage of memory in the decoder. The analyses of the possible solutions are made

and exposed. The two Reed-Solomon codes studied are the RS(255,239), that works in GF(28),

and the RS(528,514), that working in GF(210). For every decoder, a VHDL model was

implemented. After the verification with the help of Python models, they were implemented in

FPGAs and in ASIC CMOS 90nm technology. The obtained results attained the required

decoding throughput and happened to be more efficient, in terms of area-time relation, and

with less latency in respect to the actual state of art.

10

11

1 Introduction The project consists in the development and realisation of a decoder for Reed-Solomon codes

that can perform the decoding task at a high speed, so granting a throughput of the order of

Gbps. These decoders are used in communication or storage systems to ensure the correct

transmission or recovery of data. In Figure 1 we can see the block schematic of a generic

system. The aim of the overall architecture is to have a reliable and efficient exchange of data

between the two endpoints. One aspect that can effectively make the situation better is the

possibility to restore some data that arrived with errors. For example, imagine to have a

communication with a probe that is exploring Mars where with quite big probability we will

have constraints for the energy used and relatively huge timings for the exchange of data. For

this case, having only some parity symbols would be a real waste: if the message arrives with

some error it is possible to recognise it, but nothing can be done except sending a packet for

asking to send the data again, so wasting time and energy. For these cases, systems were

developed for the encoding of the channel that can not only detect the errors, but also restore a

certain amount of mistakes occurred in the transmission.

The two blocks highlighted are the areas where the thesis takes place.

Figure 1) Schematisation in blocks of a RF system.

12

Resuming, the encoding of the transmission canal has the goals of:

Maximise the number of bits transmitted (throughput);

Minimise the number of errors introduced in the transmission;

Minimise the energy required;

Minimise the bandwidth required;

Minimise the complexity of the encoder and decoder.

Channel Encoding

The encoding of the channel implies the transformation of the data of the original message.

The aim, as said before, is not to use a reverse channel from which the receiver, once detected

an error, can ask to send again the message affected by a mistake. The FEC techniques imply

the tasks of the detection and correction of the errors are made by the receiver.

To do so, from the original pack of data, the encoder will produce some redundancy bits in

order to give the receiver the possibility to restore the original message. These FEC codes are

generally characterised by some main parameters:

𝑛 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑎𝑐𝑘𝑒𝑡 = 𝑐𝑜𝑑𝑒 𝑏𝑖𝑡𝑠

𝑘 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑝𝑎𝑐𝑘𝑒𝑡 = 𝑑𝑎𝑡𝑎 𝑏𝑖𝑡𝑠

𝑡 = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟𝑠 𝑡ℎ𝑎𝑡 𝑡ℎ𝑒 𝑐𝑜𝑑𝑒 𝑐𝑎𝑛 𝑟𝑒𝑠𝑡𝑜𝑟𝑒

Before describing the composition and structure of the codes that were adopted for the project,

some details about the particular algebra used in these numerical fields are to be explained.

These details are demanded to other sections exposed later.

1.1 Goals of the thesis In literature a lot of work has been done on decoders of Reed-Solomon codes, currently used

in lots of devices for various purposes. The main idea of the project is to arrive to the Ethernet

speed (100 Gbps) using the smallest amount of resources possible. The focus of the thesis is

going to be mainly on the RS(255,239) and RS(528,514) in order to include them in 100 Gbps

communication standards. The trade-off among the area occupation, the throughput and the

latency of a decoder is present in every previous work analysed. The aim is to untie as most as

possible the three parameters in order to reduce the compromise and attain a better quality

solution. To do so, a study on the way of using resources will be exploited, in order to get a

generic method to determine the optimal solution. To reach to the goal, a new approach that

brings plenty of advantages is used. Before explaining it, some methods are to be introduced.

1.2 Methods Before proceeding with the development of the work, here the main tools used for the thesis

are introduced and the way of using them is explained.

13

1.2.1 Algebra To comprehend the working operations of the FEC codes, the algebra that is used in them has

to be explained. Since it is not straightforward and since it’s at the basis of the algorithm, in

this section some theorems and definitions that can better fix some concept of this particular

algebra are introduced. Since the field is very extended, it’s not possible to write all the details

but only the fundamental concepts for the operations.

Field: Let F be a set of elements on which two binary operations, addition and multiplication,

are defined. The set F together with two binary operations “+” and “∙” is a field if the following

conditions are satisfied:

i. F is a commutative group under addition “+”. The identity element with respect to

addition is called the zero element or the additive identity of F and is denoted by 0.

ii. The set of nonzero elements in F is a commutative group under multiplication. The

identity element with respect to multiplication is called the unit element or

multiplicative identity of F and is denoted by 1.

iii. Multiplication is distributive over addition; that is, for any three elements a, b, and

c in F:

𝑎 ∙ (𝑏 + 𝑐) = 𝑎 ∙ 𝑏 + 𝑎 ∙ 𝑐

The field therefore consists of at least two elements: the additive identity and the multiplicative

identity.

The order of a field is the number of elements belonging to the field.

A Finite field is a field with a finite number of elements.

Properties

1) For every element a in a field a∙0=0∙a=0

2) For any two nonzero elements a and b in a field, a∙b≠0

3) a∙b=0 and a≠0 imply that b=0

4) for any two elements a and b in a field

−(𝑎 ∙ 𝑏) = (−𝑎) ∙ 𝑏 = 𝑎 ∙ (−𝑏)

5) For a≠0 and a∙b=a∙c, then b=c

The set {0,1} is called binary field and is denoted as GF(2). It’s really important in the coding

theory and is widely used in digital computers and digital data transmission.

An example of modulo-2 addition and modulo-2 multiplication are presented in the tables here

below.

0 1

0 0 1

1 1 0 Table 1) Modulo-2 addition.

14

0 1

0 0 0

1 0 1 Table 2) Modulo-2 multiplication.

Being p a prime number and being GF(p) a finite field of p elements, for any m positive integer

number it’s possible to extend the prime field GF(p) to a field of pm elements, which is called

an extension field of GF(p) and is denoted as GF(pm). It was proved that the order of any finite

field is a power of a prime.

Since a field is closed under addition, the results of additions have to lay inside the field. The

finite field has a finite number of elements, so there has to be some repetition, the sums cannot

be all distinct. Thus, two positive integers n and m must exist, with m<n and has to be valid the

following expression:

∑ 1

𝑛

𝑖=1

= ∑ 1

𝑚

𝑖=1

Doing some math, can be written in a new form:

∑ 1

𝑛−𝑚

𝑖=1

= ∑ 1

𝜆

𝑖=1

= 0

Where 𝜆 is the smallest number that satisfies the relation and is called characteristic of the

field GF(q). Can be demonstrated that the characteristic of a finite field is prime.

Theorems

i) Let a be a non-zero element of the finite field GF(q); therefore, the element aq-1=1.

ii) The order n of the a element of the previous theorem divides q-1.

An element a of order q-1 is called primitive.

1.2.1.1 Arithmetic in binary fields

Although it is possible to construct a code starting from any Galois field, in digital electronics

it’s most commonly used to create codes starting from the GF(2) or its extensions GF(2m).

Starting with a simple GF(2), the arithmetic is really similar to the arithmetic we use, except

that the element 2 is equal to 0. This small change actually implies that in a simple addition

1+1=2=0 and therefore the unity element has a value 1 or -1. Thus, the subtraction in a GF has

the same result of an addition and vice versa.

An example of application of this property, can be the solution of the following system:

15

{𝑋 + 𝑌 = 1𝑋 + 𝑍 = 0

𝑋 + 𝑌 + 𝑍 = 1

Simply using the above described property:

{𝑋 = 1 − 𝑌 = 1 + 𝑌

𝑍 = −𝑋 = 𝑋 = 1 + 𝑌𝑌 + 𝑋 + 𝑋 = 𝑌 = 1

{𝑋 = 0𝑍 = 0𝑌 = 1

A generic polynomial in the binary GF(2) can be written as:

𝑓(𝑋) = 𝑓0 + 𝑓1𝑋 + 𝑓2𝑋2 + ⋯ + 𝑓𝑛𝑋𝑛

Where all the fi coefficients can be either 0 or 1. The largest power of X with a non-zero

coefficient is the degree of the polynomial. If in the previous polynomial fn is not 0, then n is

the degree of the polynomial.

When in the following part there will be written “a polynomial over GF(2)”, it will be meaning

“a polynomial whose coefficients are elements of the GF(2)”.

In GF(2) there are in general 2n polynomials of degree n. For example, there are two

polynomials with degree 1: X and 1+X; four of degree 2: X2, 1+X2, X+X2 and 1+X+X2.

Polynomials over GF(2) follow the same normal polynomial rules for addition, subtraction,

multiplication and division except that the coefficients have the properties of the GF.

Suppose that we want to divide f(X)=1+X+X4+X5+X6 by g(X)=1+X+X3. The division can be

performed with the Euclid’s division algorithm. At the first step, the result of the division is

X3. Multiplying the dividing polynomial for the partial result, there’ll be p1(X)=X3+X4+X6.

Summing the first partial polynomial to the polynomial that has to be divided will be obtained

p2(X)=1+X+X3+X5. Continuing with the operations, the result will be q(X)=X3+X2 and the

remainder will be r(X)=1+X+X2.

The elements a that, substituted into the polynomial, are giving 0 as a result are called roots.

𝑓(𝑎) = 0

A root of the polynomial f(X)=1+X+X2+X4 is 1: f(1)=1+1+12+14=1+1+1+1=0. Dividing then

the polynomial for the root solution 1+X, it can be written as f(X)=(1+X)(X3+X2+1).

A polynomial over GF(2) with an even number of terms is divisible for 1+X.

A polynomial f(X) of degree m is called irreducible over GF(2) if is not divisible by any

polynomial over GF(2) with a degree less than m but greater than 0.

Theorem: Any irreducible polynomial over GF(2) of degree m divides X^(2m-1)+1.

16

Any irreducible polynomial p(X) of degree m is said to be primitive if the smallest positive

integer n for which p(X) divides Xn+1 is n=2m-1.

1.2.1.2 Galois fields

Construction of a Galois field

For starting the construction of a GF, we start with the two elements 0 and 1 and with the

element α. The following operations are defined by the properties exposed in the previous

section.

0 ∙ 1 = 1 ∙ 0 = 0

1 ∙ 1 = 1

0 ∙ α = α ∙ 0 = 0

1 ∙ α = α ∙ 1 = α

α ∙ α = α2

α ∙ α ∙ α = α3

αi ∙ αj = α𝑖+𝑗

Then the resulting field will be F={0,1, α , α2, …, α𝑗, …}. The first non-zero element 1 can also

be referred with α0. For continuing building the field, we insert the conditions that the field F

contains only 2m elements and is closed under the multiplication.

Taking a primitive polynomial p(X) of degree m in GF(2), from a previous theorem we can

assert that p(X) divides 𝑋2𝑚−1 + 1:

𝑋2𝑚−1 + 1 = 𝑝(𝑋) ∙ 𝑞(𝑋)

Supposing alpha is a root of the primitive polynomial, thus p(α)=0.

Replacing then the X with an alpha:

α2𝑚−1 + 1 = 0 ∙ 𝑞(α) = 0

α2𝑚−1 = 1

Therefore, now it’s possible to write the newly created field as F={0, α0, α1, α2, …, α2𝑚−2}.

Taking two exponents “i” and “j” such that 0 ≤ 𝑖, 𝑗 ≤ 2𝑚 − 1, knowing that α𝑖 ∙ α𝑗 = α𝑖+𝑗, we

can assert that, since the field is closed under the “∙” operation, 𝑖 + 𝑗 = (2𝑚 − 1) + 𝑟 where

0 ≤ 𝑟 ≤ 2𝑚 − 1. Hence α𝑖+𝑗 = α(2𝑚−1)+𝑟 = α2𝑚−1 ∙ α𝑟 = α𝑟 and so it is verified the closure

under the multiplication.

The next step is the definition of the addition operation. Taking 0 ≤ 𝑖 ≤ 2𝑚 − 1, we can divide

the polynomial Xi by the polynomial p(X):

17

𝑋𝑖 = 𝑝(𝑋) ∙ 𝑞𝑖(𝑋) + 𝑎𝑖(𝑋)

Where qi and ai are respectively the quotient and the remainder. The remainder, since X and

p(X) are relatively prime (only element “1” in common) is of the form:

𝑎𝑖(𝑋) = 𝑎𝑖,0 + 𝑎𝑖,1𝑋 + ⋯ + 𝑎𝑖,𝑚−1𝑋𝑚−1

For any non-negative “i” it’s valid the expression 𝑎𝑖(𝑋) ≠ 0.

Taking two variables (“i” and “j”) inside the valid range [0;2m-1], the remainder polynomials

ai(X) and aj(X) cannot be equal (the demonstration is not presented here).

Thus, for i=0,1,2,…,2m-2 we can create 2m-1 separated polynomials of at most m-1 degree.

Replacing the X with an α we have αi = 𝑎𝑖(α) and, for derivation from the previous line, we

have obtained 2m-1 non-zero distinct polynomials in GF(2m). So the 2m elements of F can be

represented by 2m distinct polynomials of α over GF(2) with degree that at most is m-1. The

addition operation is commutative in F and F is closed under “+”. The binary field GF(2) is

also called ground field.

For a Galois Field there are therefore three representations, that are resumed in the table below.

In this example, the aim is to build a GF(23) field generated by p(X)=1+X+X3. For p(X)=0 we

get X3=1+X. From this equation we can start to build the field.

Power representation Polynomial

representation

Tuple representation Integer

representation

0 0 000 0

1 1 001 1

α α 010 2

α2 α2 100 4

α3 1 + α 011 3

α4 α + α2 110 6

α5 1 + α + α2 111 7

α6 1 + α2 101 5 Table 3) Representations of the Galois Field formed with three bits.

The first property is that a root of the polynomial p(X) over GF(2m) can also be in a field that

is an extension of GF(2m). The concept is similar of the complex roots for a real polynomial.

Theorem: taking β (element in GF(2m)) as a root of the polynomial f(X) that has coefficients

in GF(2), then for any non-negative “l” we have β2lthat is as well a root of p(X). The element

β2l is called conjugate.

Theorem: the minimal polynomial ϕ(X) of a field element β is irreducible.

Theorem: taking a polynomial f(X) over GF(2) and taking ϕ(X) of the field element β, if β is

a root of f(X), then f(X) is divisible by ϕ(X). Since f(X) is irreducible and β is not one,

necessarily will be ϕ(X)= f(X).

18

Theorem: if β is a primitive element of GF(2m), all its conjugates β2, β4, β8, … are also

primitive elements of GF(2m).

This brings to the conclusion that, taking β of order n in GF(2m), all its conjugates have the

same order.

Vector Spaces

Taking V, a set of elements on which is defined the addition, and F, a field, and defining the

multiplication between the elements of V and F, the set V is called vector space over the field

F if:

i) V is a commutative group under addition.

ii) For any element a in F and any element v in V, a∙v is an element of V.

iii) (Distributive Laws) For any elements u and v in V and any elements a and b in F:

𝑎 ∙ (𝒖 + 𝐯) = 𝑎 ∙ 𝒖 + 𝑎 ∙ 𝒗

(𝑎 + 𝑏) ∙ 𝒗 = 𝑎 ∙ 𝒗 + 𝑏 ∙ 𝒗

iv) (Associative Law) For any v in V and any a and b in F:

(𝑎 ∙ 𝑏) ∙ 𝒗 = 𝑎 ∙ (𝑏 ∙ 𝒗)

v) With 1 that is the unit element of F, for any v in V, 1∙v=v.

The elements of V are called vectors and the elements of the field F are called scalars.

1.2.2 Modelling For realising the decoder, it was really important to model the algorithms and the components.

The models have two main purposes: making understand correctly and in detail the working

operations of the algorithms, and allowing an easier debug while implementing the components

in VHDL. In fact, having informatics models of the components, allowed to have partial results

and variables values during the debug of the components and also for the verification/testing

of the devices realised.

1.2.2.1 Matlab

Matlab is a standard program for the modelling and the prototyping in the scientific field. This

software was used for understanding the algorithms in the first place because it’s easy and

straightforward to use. Thanks to its simple interfaces and to the dedicated library for Galois

fields, coding with this tool was fast.

1.2.2.2 Python

In a second place, after being sure of how the algorithms work, Python was used for a better

modelling of the components. The main annoying aspects of Matlab are the notation for the

accessing the elements of an array (first element is at the index one instead of zero). This and

the better freedom of programming brought in some part of the project to the usage of Python.

Moreover, with Python was not lost the advantage of using a console, since it’s an interpreted

language, gets an easier access to basic tools like file streams or string parsing, useful for

VHDL test-benches, and also is more free to code since it’s a proper programming language.

19

1.2.2.3 Unified Modelling Language (UML)

Together with Matlab and Python, UML was adopted as another standard in the scientific

modelling. UML gave the tools necessary to have clearer vision of the procedures to be done.

With this language, the algorithms were schematised as flowchart, so that was easier the

implementation.

1.2.3 Realisation After having realised the models, the following step was to implement them in hardware. For

this a description language was adopted.

The choice for the description language was between Verilog and VHDL. They are quite

equivalent and there’s no big difference but, since VHDL is more C-like and is more precise,

the choice happened to be VHDL. Moreover, in this language it is easier the parametrisation

of the components and so resulted to be more powerful in the realisation of scalable structures.

For what concerns the FPGA implementation, the IDE and a first basic simulator used in the

project are Xilinx Vivado® 2016. For a better and detailed simulation, the use of Altera

ModelSim 10.3d was necessary.

For the ASIC implementations of the designs, for comparing them with the literature results,

was used the Cadence SOC Encounter, that takes a VHDL design and generates an equivalent

integrated circuit.

1.2.3.1 Doxygen

The realisation of a complex system coded in VHDL involved a problem of documentation and

reusability of the code. For this the program Doxygen was used for documenting the code and

make understandable the usage of packages and components realised during the development

of the decoder. Doxygen is a software that automatically realise a HTML and LaTex

documentation of the files if they are opportunely commented.

1.3 Structure of the thesis The work of the thesis is divided into main parts. A first important chapter is dedicated to the

study of the Galois fields’ arithmetic. These are a necessary basis to understand the working

operations of the various algorithms for decoding. After understanding them and after having

modelled them in Python, they have also to be implemented in VHDL so that it’s easier to use

them during the implementation of more complex systems. In fact, splitting the complexity of

the operations makes easier the design itself.

A second chapter is dedicated to the decoding algorithms. After presenting the main algorithms

and architectures for the decoder, the algorithms are implemented in working decoders and the

first results are obtained. There will be a discussion about the drawbacks and the advantages

of the solutions analysed. These will lead to new innovative designs (chapter 4) for reaching

efficiently to the throughput of 100 Gbps. Therefore, the results are numerically analysed and

in the last chapter some conclusions are drawn.

20

21

2 VHDL library of arithmetic operators The first chapter of the work is dedicated to the arithmetic operators on which all the thesis is

based. The aim pursued here is to obtain an easy-to-use library for the realisation of

computations in the Galois fields. The operators were developed in a specific package so that

their reutilisation in different context is facilitated.

2.1 Sum The first operation needed by the algorithms is a basic sum between two numbers in the Galois

field. The sum is basically a simple XOR gate among the ordered bits of the numbers. An

example of the operation (between unsigned) can be seen here below. The leftmost bit is the

MSB.

A = [0 1 1 0 1 0 1 0] = 64+32+8+2 = 108

B = [1 0 1 1 0 0 0 1] = 128+32+16+1 = 177

A 0 1 1 0 1 0 1 0

B 1 0 1 1 0 0 0 1

C 1 1 0 1 1 0 1 1

So the result is C=A+B=[1 1 0 1 1 0 1 1] = 128+64+16+8+2+1 = 219.

2.2 Multiplication This operation was more complex and had to be analysed deeper. First of all, a flowchart that

explains the working operation is presented below.

Figure 2) Working operation of the multiplication in Galois fields.

22

Here below, the working operation is explained with an example so that it’s clearer and more

straightforward to understand.

A = 156 = [1 0 0 1 1 1 0 0]

B = 67 = [0 1 0 0 0 0 1 1]

P = primitive polynomial = 285 = [1 0 0 0 1 1 1 0 1] (in red in the table)

1 0 0 1 1 1 0 0

0 1 0 0 0 0 1 1

1 0 0 1 1 1 0 0

1 0 0 1 1 1 0 0 -

0 0 0 0 0 0 0 0 -

0 0 0 0 0 0 0 0 -

0 0 0 0 0 0 0 0 -

0 0 0 0 0 0 0 0 -

1 0 0 1 1 1 0 0 -

0 0 0 0 0 0 0 0 -

0 1 0 0 1 1 0 1 0 1 0 0 1 0 0

1 0 0 0 1 1 1 0 1

0 0 0 1 0 1 0 0 0 0 0 1 0 0

1 0 0 0 1 1 1 0 1

0 0 1 0 1 1 1 0 0 0 0

1 0 0 0 1 1 1 0 1

0 0 1 1 0 1 1 0 1

0 1 1 0 1 1 0 1

So the result of the operation is C=A*B=[0 1 1 0 1 1 0 1] = 64+32+8+4+1 = 109

The script correspondent to this operation was developed in Matlab. The first script does the

partial multiplication between the two numbers:

function [C] = mult_stage1(A, B) n = length(A); length_C = 2*(n-1); C = zeros(length_C, 1); C(1) = and(A(1), B(1)); % first element C(length_C+1) = and(A(n), B(n)); % last element for (i=2:(length_C)) i if (i<=n) r = and(A(i), B(1)); for (j = 2:i) r = xor(r, and(B(j), A(i-j+1))); end C(i) = r; else delta = i-n ; r = and(A(n), B(delta+1)); for (j=delta+1:n-1) r = xor(r, and(A(n-j+delta), B(j+1))); end C(i) = r; end end

23

end

The second one does the modulo reduction for the primitive polynomial P:

function [C] = mult_stage2(M, P) for (i=length(M):-1:length(P)) if (M(i)==1) for (j=length(P):-1:1) if (P(j)==1) M(i+j-length(P))= not(M(i+j-length(P))); end end end end C=M; end

Having verified, and therefore finished, with the modelling, the same operations have to be

implemented in VHDL. The best solution for doing this it’s to implement a package that could

be used easily when coding and so an overload of the operators “+” and “*” for the

std_logic_vector subtype gf_logic_vector was defined. In this way, the usage inside the other

part of the project will be fast. In fact, if an operation like D=(A+B)*C has to be performed it

will be necessary only to write “D<=(A+B)*C” supposing that A, B, C, D are of the subtype

gf_logic_vector. Another basic subtype of gf_logic_vector that was defined is the gf_symbol,

i.e. a gf_logic_vector of length m.

2.3 Inversion The inversion of an element is needed especially in the computation of the errors occurred.

This operation is quite simple from a mathematical point of view, but it can be implemented in

multiple ways.

Supposing we want to invert the generic element of the Galois field GF(n=2m) αi, the inverted

element corresponds to αn-i. Therefore, the simplest way of inverting the element from an

electronic point of view is the use of a table: at the input there is the element to be inverted,

that references a table that brings then to the output the correspondent inverted element. The

procedure from which the inverted elements are generated is modelled with the following

Python code, that implements the mechanism show in Figure 3:

inv = [0]*(n+1) inv[1] = 1

for i in range(1,pow(2,m)-1):

u = gf_pow(2,i)

u_1 = 1

for j in range(1,n-i+1):

u_1 = F.Multiply(u_1,2) print i,u, u_1

inv[u] = u_1

Figure 3) Algorithm used for inverting an element "u".

24

The VHDL function is instead the following:

--! @brief Function that generates automatically a vector for inverting the elements of a

field

function generate_inverted_elements

return inverted_elements_type is

constant alpha : gf_symbol :=

conv_gf_logic_vector(std_logic_vector(to_unsigned(2,m))); -- element alpha^1=2

variable u : gf_symbol := conv_gf_logic_vector(std_logic_vector(to_unsigned(1,m))); --

element to be inverted

variable index : integer; -- index at which we will store the inverted element;

basically is the element to be inverted

variable u_1 : gf_symbol; -- inverted element

variable inverted_elements : inverted_elements_type; -- all the inverted elements

begin

-- first two elements are separate

inverted_elements(0) := conv_gf_logic_vector(std_logic_vector(to_unsigned(0,m)));

inverted_elements(1) := u;

for i in 1 to 2**m-2 loop

u := u*alpha; -- calculating firstly the element to be inverted

u_1 := conv_gf_logic_vector(std_logic_vector(to_unsigned(1,m))); -- initialising

inverted_element to 1

for j in 1 to n-i loop

u_1 := u_1 * alpha; -- updating variable

end loop;

-- storing the solution

index := to_integer(signed(std_logic_vector(u)));

inverted_elements(to_integer(unsigned(u))) := u_1;

end loop;

return inverted_elements;

end function;

This function is called in the code for the initialisation of the ROM inversion tables in order to

generate all the inverted elements.

2.4 Examples of usage The encoder is the component that receives at the input the data that has to be sent and encodes

it finding the parity symbols. Though the implementation of the encoder was not strictly

necessary, it was used as an exercise for the verification of the GF Arithmetic package and for

generating some encoded message to be further on tested on the decoder.

2.4.1 Encoder RS(7,3) in GF(23) A first simple component designed was an encoder for RS(7,3) code working in GF(23) that

can be seen in the figure below. Since the number of symbols and the number of bits per symbol

is low, the debug of the device was easier. In fact, in this implementation it was possible to see

how the synthesiser was implementing the device. Only when the architecture of the encoder

was completely verified it was possible, in a second moment, to increase the number of symbols

and bits in order to arrive to the final result.

25

Figure 4) Schematic of a RS(7,3) encoder.

For the first “k” cycles the two switches stay on the “a” contact while in the next “n-k” cycles

the switch will stay on the “b” contact. Before the explanation of the system, it is worth to

observe the critical path, that consists in: adder switch multiplier adder register.

Obviously the adders and multipliers, since they are implementing operations in the Galois

Field, are built as in the previous section was shown.

Just to explain the working operation better, we can see how this structure works with an

example of RS(7,3). Let’s take m=[2, 7, 3] as the message that has to be encoded. The generator

polynomial for GF(23) is g(X) = α3 + α X + X2 + α3 X3 + X4 and so g=[3, 2, 1, 3, 1]. These

coefficients are the multiplication factors used in the structure shown in figure 2. Before

starting the example, at the beginning obviously the registers values are set to 0.

Cycle 1 (m=2)

Output of switch 1 = 2+0 = 2

Reg(0) = 2*3+0 = α* α3 = α4 = 6

Reg(1) = 2*2+0 = α* α = α2 = 4

Reg(2) = 2*1+0 = α = 2

Reg(3) = 2*3+0 = α* α3 = α4 = 6

Cycle 2 (m=7)

Out_switch1 = 7+6 = 1

Reg(0) = 1*3 = 3

Reg(1) = 1*2+6 = 2+6 = 4

Reg(2) = 1*1+4 = 1+4 = 5

26

Reg(3) = 1*3+2 = 3+2 = 1

And so on for the remaining cycles. This simple example shows also the mode of debugging

of the hardware using Python. The Python script was automatically generating these results for

helping to retrieve the mistake in the VHDL code, as can be seen in Figure 5.

Figure 5) Python RS(7,3) Encoder for message 0 [2,7,3].

After the drawing of the schematic of the encoder and the realisation of the Python model for

the encoding calculations, the VHDL design of the RS Encoder started. The first stage

consisted in the definition of the black box mask of the component, that can be seen in Figure

6.

Figure 6) Black box representation of the RS Encoder in VHDL.

The RS Encoder block has as inputs:

27

bus of data of “m” bits for the message to be encoded;

clock signal port;

synchronous reset port;

enable port.

For the outputs instead it has:

bus of data of “m” bits for the message encoded;

par signal that is ‘0’ if at the output there’s not a parity symbol and is ‘1’ if there is;

ready signal that is ‘0’ if there’s no valid output, ‘1’ if a valid output is given.

For realising the schematic present in Figure 4, it can be noticed that there were repetitive parts

that could be schematised as cells. In fact, the block Multiplier-Adder-Register can be the basic

systolic cell of the encoder. It was called RSEncUnit. The schematic of Figure 4 now, using the

above described cell, is changed into the schematic of Figure 7. Since doesn’t have any adder,

the first cell can be schematised as a RSEncUnit with one addend put to zero.

Figure 7) Encoder RS(7,3) working in GF(8) internal structure.

The cell realised has the black box representation shown in Figure 8 and the internal structure

is shown in Figure 9.

28

Figure 8) RSEncUnit black box.

Figure 9) RSEncUnit internal schematic and connections.

For representing the generator polynomial, in the code an integer calculated is used as follows:

g(X) = X4 + α3 X3 + X2 + α2 X + α3 = [1,3,1,4,3] = [001 011 001 100 011] = 5715

The generator polynomial is converted into a constant to make sure that, if possible, the

synthesiser is going to simplify the architecture and optimise it.

In the code for RS(7,3) encoder, the architecture was implemented instantiating four times the

component RSEncUnit with a generate statement and interconnecting the blocks with an array

of gf_logic_vectors of length m bits. This bus of data is called out_reg. With combinational

logic (conditional assignment) was realised the switch 1, while for realising switch 2 was used

a process in order to register the output. The multiplexers of the two switches are controlled by

a control signal switch_control_signal that is a std_ulogic. This is low when are arriving

message symbols otherwise is high.

29

2.4.1.1 Simulation and system validation

After finishing the implementation, a test-bench had to be implemented for both the Python

and the VHDL systems. The test is schematised in Figure 10. Basically the test consists in the

sending three messages to be encoded without any pause and check that the output is exactly

the one desired. It was used Python for checking the goodness of the encoder. The messages

chosen for the test were [2,7,3], [4,0,6] and [5,1,1].

Figure 10) Blocks representing the test-bench system.

2.4.2 Encoder RS(255,239) in GF(28) After having obtained the basic version of the encoder, the encoder for the code that it’s going

to be used, i.e. RS(255,239), was designed. This encoder works in GF(28), so for every symbol

8 bits are needed: the symbol used is the byte. The first step was to modify opportunely the

GF_Arithmetic module. Since the quantity of data to try and verify is significantly more than

the previous device, the modality had to be modified and every component programmed

slightly differently. The first part of the new encoder was to create a script for the generation

of the messages and proper modifications to the original Python script in order to accomplish

the correct encoding.

The message generator script simply takes three strings (Svetonio’s “Alea iacta est”, “Carpe

Diem” and Cato the oldest’s “Carthagum delendam est”), cuts them or makes a zero padding

in order that the length is exactly “k” (so 239 in my case) and then write the ASCII number

corresponding to the character of the string into a stream file, used afterwards by both the

Python and VHDL encoder.

The Python encoder takes these files as an input and in the same way realised previously

encodes them. The computed output is printed in different files in two notations: binary and

decimal.

Having the Python code finished and ready for the VHDL debug, the next part was the

development of the encoder in VHDL. The first problem faced was the generator polynomial:

previously in the RS(7,3) encoder was coded as a integer (5715 [001 011 001 100 011]), but

30

in this case the length of the code couldn’t be supported by the VHDL integer type. A new

subtype of integer gen_integer of m bits, that was a subtype used for the generator polynomial’s

coefficients, had then to be defined. Then, a type array of gen_integer was defined so that all

the polynomial coefficient can stay in one compact variable, divided in smaller parts.

2.4.2.1 Simulation and system validation

This section is the analogue of the one for RS(7,3) but since the component is more complex,

there will be more tests to make. Basically two behavioural test-benches were realised.

The first one Generic Architecture is the encoding of one message of 239 bytes while the

second one Multi_Message_Architecture is the sequenced encoding (without any pause in

between) of three messages of 239 bytes.

2.4.2.2 Utilisation and timing report

For the timing report, a particular component, called Timing_Tester, had to be created. This

component is registering the inputs and the outputs of the RSEncoder so that prevents an

incorrect calculation of the timings. In fact, if the inputs and the outputs are not registered, not

all the paths in the device will be taken into account. To be realistic, this paths exist and so with

this precaution the error is avoided. The module so is consisting in input registers, that sample

the inputs, connected to an instance of the encoder whose outputs are again sampled by other

registers. The basic schematic is the one that can be seen in Figure 11.

Figure 11) Timing_tester realisation viewed in Xilinx Vivado®.

In Figure 11 it is possible to see the three parts described before: on the left the input registers,

the light blue component is the RSEncoder and on the right the output registers. Registering

both inputs and outputs makes us sure that at least one critical path exists and that all the paths

are taken into consideration, making so possible a correct timing analysis.

For the post-synthesis timing analysis was used a Kintex 7 device (xc7k160tfbg676-3).

The initial constraint used for timing analysis is the TCL script:

create_clock -period 2.000 -name clk -waveform {0.000 1.000} [get_ports clk]

31

At the beginning, it started with a clock period of 2.000 nanoseconds and Newton-Raphson

method was adopted for finding the solution of the problem. The waveform, as we can see from

the TCL constraint above, is always with duty cycle of 50%. When was used 1.900

nanoseconds, for example, the waveform was “-waveform {0.000 0.950}”.

The results of the Newton-Raphson method used are visible in Figure 12, that is a Python graph

representing the data obtained. With the black dots are highlighted the valid solutions found,

while in red there are the invalid ones. The final result is:

Clock period = 1.778 ns

Clock frequency = 562.43 MHz

Figure 12) Graph representing the post-synthesis time analysis made with Xilinx Vivado®.

For what concerns the utilisation of the hardware, in Figure 13 we can see the utilisation of

resources for a RS Encoder for RS(255,239).

Figure 13) Resources utilisation for Encoder RS(255,239).

32

After having this data, a post-implementation timing analysis with a clock period equal to 550

MHz was run, so the constraint used was:

create_clock -period 1.818 -name clk -waveform {0.000 0.909} [get_ports clk]

Since the frequency is 550 MHz and the symbol is the byte (therefore 8 bits), the overall data

processing speed is 4.4 Gbps.

2.4.2.3 Code documentation

In Figure 14 and Figure 15 we can see the results of the VHDL code commenting for

documenting the system. The HTML pages realised automatically with Doxygen are useful

every time the code is used/re-used for understanding interfaces, functions, components and

their internal structure.

Figure 14) HTML page for the VHDL code documentation of the RS Encoder.

Figure 15) HTML page for the VHDL code documentation of the Timing Tester for RS Encoder.

33

3 Decoder for RS codes The first chapter of results is reserved to the study of the decoder. In the first part the algorithms

are studied and compared in order to make the best possible choice. After that, these chosen

algorithms are inserted into some working architecture and the first results, in terms of area

occupation and timings, are obtained.

3.1 Decoding process The FEC codes are a group of codes that introduce some redundancy symbols to make possible

for the receiver to find and restore some possible errors in the code. These codes are assigning

the main tasks of the working operation to the receiver; from this comes the name. From now

onwards the sender will be called encoder, while the receiver will be called decoder.

3.1.1 Reed-Solomon codes There is a huge quantity of FEC codes available in literature. In this section the methods used

in the project are introduced.

The Reed-Solomon codes were discovered in 1960 by Irving Reed and Gus Solomon, giving

them the name. The RS codes are non-binary cyclic codes in which the codes symbols are

binary m-tuples.

Non-binary codes don’t use the simple binary operations defined by the XOR and AND ports,

but they perform operations in the GF(2m). The group of “m” bits is defined as symbol.

The structure of the encoder is straightforward and will be neglected for now while for the

decoder some words have to be spent.

Figure 16) Generic decoder architecture.

In the figure above, the generic architecture of a decoder system is illustrated. The first block

is computing the syndrome.

3.1.2 Syndrome Computer The syndrome is a polynomial that can be computed starting from the received message and

that contains inside all the restoring basic information: position and magnitude of the errors.

Every set of errors is related to one and only one syndrome polynomial. Thus, since it’s a one-

one relationship, in theory it’s possible to link through a table the set of errors to its syndrome

34

polynomial without any further step. Knowing the syndrome, we can get the correspondent set

of errors and add it to the received message in order to get the original message back.

Even if this solution is in theory possible, there is a physical limit of storing all the possible

syndrome values and the correspondent set of errors: basically there are memory issues so it’s

not technically possible and convenient to implement a decoder in this way.

The classical way of computing the syndrome is consisting in substituting in the polynomial

the correspondent value. For example, in a generic RS(7,3) code, the syndrome is long “t”

symbols, i.e. two. The meaning of the “t” parameter, as introduced before in the introduction

chapter, is the half of the difference between the total number of symbols and the useful number

of symbols of the message.

𝑆 = 𝑆𝑦𝑛𝑑𝑟𝑜𝑚𝑒 𝑣𝑒𝑐𝑡𝑜𝑟 = [𝑆0 𝑆1]

If the received message is:

𝑟(𝑋) = 𝑟0𝑋6 + 𝑟1𝑋5 + 𝑟2𝑋4 + 𝑟3𝑋3 + 𝑟4𝑋2 + 𝑟5𝑋1 + 𝑟6

The first element of the syndrome can be computed as follows:

𝑆0 = 𝑟(𝛼) = 𝑟0𝛼6 + 𝑟1𝛼5 + 𝑟2𝛼4 + 𝑟3𝛼3 + 𝑟4𝛼2 + 𝑟5𝛼1 + 𝑟6

While the second one is:

𝑆1 = 𝑟(𝛼2)

The substitution in the received polynomial is the classical way of computing and the

understanding of the algorithm is really easy but it’s not characterised by a great performance

since it’s consuming a lot of hardware. In general, during all the discussion of the project, the

minimisation of resources will be a continuous goal. Few resources used imply the possibility

of more parallelisation and so higher throughput: we are searching the optimisation of resources

usage in respect of throughput and latency.

3.1.2.1 Horner’s rule

The way of computing the syndrome showed above is not the actual way used in digital

systems. There’s a more convenient and efficient way, called Horner’s rule, that allows to use

less resources, though requiring more clock pulses to finish the calculation.

If r(X) = message received, then Si = r(αi) with i=1:(n-k). Basically, it’s a substitution in the

message polynomial.

For a RS(7,3), the rule can be written in the extended form in this way:

𝑆(𝑋) = (((((𝑟6 ∙ 𝑋 + 𝑟5) ∙ 𝑋 + 𝑟4) ∙ 𝑋 + 𝑟3) ∙ 𝑋 + 𝑟2) ∙ 𝑋 + 𝑟1) ∙ 𝑋 + 𝑟0

The formula schematises with the parenthesis the operations to be made in the same clock

cycle. Hence, the overall required amount of clock periods for the computation of a syndrome

is “n”.

35

3.1.3 Berlekamp-Massey Algorithm (BMA) A second step is the use of a component able to build from the syndrome polynomial two other

polynomials that contain the location of the errors and the magnitude. The error-location

polynomial is a polynomial that has as roots the values of the position of errors; the error-

magnitude instead has as roots the magnitude of the error at the position defined by the error-

location polynomial.

There are several algorithms available for this task, such as the Euclidean one for example. In

the project it was decided, for reasons explained further on, to use the one called from the

surnames of the scientists who discovered it: Berlekamp-Massey (BM).

This algorithm, together with the Euclidean one, is normally considered a standard as

contemporarily is computing both the polynomials in “2t” iterations. The classical version of

the algorithm, during its working operations, is using the inversion of a matrix. This operation

is particularly expensive in terms of resources. The need of avoiding the inversion brought to

the research of a variant of the algorithm that manages to arrive in the same number of iterations

to the solution. Since its main characteristic from which we can distinguish it from the classical

version is that avoids the inversion of elements, the variant takes the name of inversion-less

Berlekamp-Massey (iBM) algorithm. Its working operation flowchart is in Figure 17.

Figure 17) Flowchart that describes the working operation of the inversion-less Berlekamp-Massey (iBM)

algorithm.

This version is not properly the one used in the project but is a precursor from which will be

derived the most practical and implementable one.

36

3.1.3.1 Enhanced Parallel Inversionless Berlekamp-Massey Algorithm

(ePIBMA)

The previous version of the Berlekamp-Massey procedure was not perfectly optimised. Indeed,

though the number of iterations needed is the same, after the latter algorithm, a canonical Chien

search and Forney’s correction. Moreover, it’s not possible to design a component that

implements this version of the algorithm through a systolic architecture. The possibility to

realise a systolic structure is important because lower significantly the complexity of the

architecture, and therefore also of the VHDL code. In order to attain these improvement, the

so called enhanced Parallel Inversionless Berlekamp-Massey algorithm is introduced [Wu15].

The outputs are the error-location polynomial “Λ” and the auxiliary polynomial “B”. These

two polynomials used together helps a faster solution in the successive stage. Also other outputs

of secondary importance are computed.

The algorithm takes “n-k” clock pulses to compute the outputs, as the other algorithm exposed,

but it’s possible to build the system through the composition of cells.

Figure 18) ePIBMA working operation diagram.

37

3.1.4 Chien Search (CS) and Forney’s algorithm The last stage of the decoder is consisting in two different parts: one computes the location of

the error and the other one computes the magnitude of the error. Generally, for the calculation

of the position of the errors, the standard algorithm is the Chien search. In Figure 19 there’s

the flowchart that resumes the procedure.

Resuming in few words, the basic idea behind is the simple substitution in the polynomial of

all the non-zero elements of the field in which the code operates. If the evaluation of the

polynomial is actually zero, by definition, it means a solution is found and so the decoder will

proceed in computing the correspondent magnitude.

Figure 19) Flowchart that describes the working operation of the classical version of the Chien Search (CS)

algorithm.

Having found the position of an error, the magnitude is still to be computed. To do this, can be

used the Forney’s algorithm. Its work flow is resumed in Figure 20.

Figure 20) Flowchart of the working operation of the Forney's algorithm.

38

3.1.4.1 enhanced Chien Search Error Evaluation Component (eCSEE)

The classical form of the Chien algorithm has got the same drawback of the Berlekamp-Massey

one: it doesn’t have a systolic structure. A systolic structure allows the building of a simpler

system, hence a simpler VHDL code, though doesn’t present any drawback in terms of

resources used and timing characteristics. The eCSEE is evaluating the error-location

polynomial and computing the error magnitude at the same time. It corresponds, functionally

speaking, to both the Chien and Forney algorithms. The outputs are the location of the error

and the magnitude computed. In the schematic of a general architecture of the decoder (Figure

16), this block is the last one (actually two blocks). The addition and the selection made by the

switch of the figure in reality are made outside this block.

The procedure followed for implementing this component is the one showed in Figure 21.

Figure 21) Flowchart diagram of the eCSEE algorithm.

The schematics and algorithms presented represent the choices made for the thesis. This

versions of decoder were implemented in Python and are presented as results of the thesis. For

what concerns the VHDL coding, only the systolic final versions are submitted and explained.

3.2 Decoder RS(255,239) in GF(28) The first component to be studied and implemented is a plain version of a decoder for the Reed-

Solomon (255,239) code, working in a Galois field of eight bits per symbol. The device can be

split into four main components:

SyndromeComputer

ePIBMA

eCSEE

39

3.2.1 Syndrome Computer It’s the first stage of the decoder device. The component calculates the syndrome vector using

the Horner’s rule. As explained before, the advantage of this rule is that the multiplier is only

one and is used always through the “n” cycles that the component is active. The schematics

that implements the method is, straightforwardly, the one in Figure 22. Obviously, in the figure,

the control signals of clock, enable and reset are connected properly to the global system

signals.

Figure 22) Syndrome calculation unit schematic implementing Horner's rule.

For what concerns the VHDL code, there were some problems. The first problem was how to

obtain the factors to be substituted in the message arrived. If in informatics, especially in

Python that is an interpreted dynamic programming language, the value can be computed with

a for statement, in VHDL the issue is that the factors are to be declared as constants in order to

allow the synthesiser to simplify as much as possible the circuit. This, and later on also other

issues like type declarations, enforced the idea of creating a separate package file

(RS_Decoder_Types) in which every accessory variable, type, subtype, function is defined.

Following this concept, a function that, through a for loop, generates and returns the alpha

factors needed for the SyndromeComputer initialisation. This implied the creation also of a

type specifically designed for the vector of elements. Resuming, the problem was solved

declaring a new array of “n-k” gf_symbols that is the alpha type called alphas. This array is

initialised by the function generate_alphas that was designed for this purpose.

Another problem was the definition of the type for returning the syndrome vector, called

syndrome_type, that is an array of “n-k” gf_symbols.

The component requires a specific cell (Figure 22) for its working operation. The cell was

called SyndromeCell (the black box model is in Figure 23).

Figure 23) SyndromeCell black box block.

40

The cell, observing Figure 22, is apparently nothing different from the previous cell used for

the encoder: it’s composed by the same topology of a GF multiplier, a GF adder and a register.

When all the cells are interconnected and tested though, it can be noticed that, as it is, the cell

couldn’t be used for messages sent in series without pauses. In fact, thinking of the basic

schematic showed in Figure 9, the register at the clock pulse number “n-1” would have to be

reset in order to start from zero at the next syndrome calculation. In the same time, it has to

avoid the reset because we want the result at the output. Like that, the architecture was not

correct and an improvement was needed.

The first idea was to insert in the loop a multiplexer between the output of the register and the

multiplier. If the reset was high, the multiplexer was selecting a zero, otherwise it was selecting

the output of the register. The same signal was used as clock enable for the register. The

solution was correctly working but, at a deeper analysis, the lengthening of the critical path can

be observed. It degrades the performances of the system because it forces the synthesiser to

decrease the maximum operating frequency achieved. The original critical path was composed

by an adder, a register and a multiplier. The new solution added path a multiplexer to the

critical. This was not acceptable and other solutions had to be exploited.

The final architecture adopted for the cell is showed in Figure 24.

Figure 24) SyndromeCell internal structure.

The main advantage of using two registers, instead of one, is that the reset signal is actually

going to the one that is in the loop and is used as enable for sampling the output. So the reset

signal is used as reset for the register above and enable for the one below. Thus, it was possible

to restore the previous critical path of the cell and was obtained the behaviour wanted. Hence,

the critical path is again:

𝑇𝑐𝑟𝑖𝑡 = 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟 𝑠𝑒𝑡𝑢𝑝

Obtained the cell, the component was built as below.

41

Figure 25) Internal structure of the SyndromeComputer.

For the system it’s also necessary a control circuit for generating the ready signal. The signal

has to arrive at the output after “n” clock pulses. The clock pulses are counted by a counter

and, when this arrives to “n”, the signal is high for a clock pulse. At the output, there are the

syndrome symbols grouped as syndrome_type.

The result of the section is the black box model presented in Figure 26.

Figure 26) Black box model of the SyndromeComputer component.

3.2.2 ePIBMA component Having computed the syndrome, the following stage is the Berlekamp-Massey one. Its task is

to get, from the syndrome values and the message arrived, the error-location polynomials (“B”

and “Λ”) and the other outputs required by the eCSEE module.

As discussed in the previous section of this chapter, the version of the method adopted for the

decoder is the one that allows the systolic structure.

The fundamental repetitive cell of the component is the one illustrated in Figure 27 and Figure

28, where there are respectively the black box and the internal structure of the cell.

42

Figure 27) Black box representation of the RS_ePIBMA_Cell.

Figure 28) Internal Structure of the RS_ePIBMA_Cell.

Every cell computes the “i-th” coefficient of the two polynomials. From the figure we can

notice that the control signals (MC1, MC2, MC3) determine the behaviour of the cell through

some multiplexer.

The critical path of the cell is the one used by Ω0(i) and Ωp+1

(i), i.e.:

𝑇𝑐𝑟𝑖𝑡 = 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟 𝑠𝑒𝑡𝑢𝑝

In general, this is also the critical path of the overall system.

Particular attention was necessary for the initialisation stage. In fact, if in Python it corresponds

to a simple assignation, in VHDL is a little more complex. The initialisation step was

implemented by forcing the control signals so that at the input of the two registers could arrive

directly the value of the Ω and Θ. If at the iteration 0, at the inputs of the cell number “p” there

are the initial values, also “MC1=0”, “MC2=(others=>’0’)” and “MC3=1” have to be set. In

this way, it’s sure that at the next clock pulse the cells will be initialised correctly. This iteration

43

corresponds to the “Initialisation” step of Figure 18. In the same moment that these signals are

set, some attention has also to be paid on how these signals will be the following step. In fact,

also the control signals at the following iteration have to be initialised. In the following figures

there are the solutions adopted for the several signals.

Figure 29) Circuit for the initialisation of the MC1 control signal.



In Figure 29, Figure 30 and Figure 31 it can be noticed that the temporary signals created

(MC1_temp for example) are selected only if init_signal is low.

Referring to Figure 18, here the correspondence of the signals is explained. The MC1 signal

corresponds to the first if during the iterations. Though the condition to be verified is modified

for optimising the implementation. In fact, writing:

𝑀𝐶1 = (𝛺0(𝑖)

≠ 0) & (𝐿𝛬(𝑖)

≤ 𝐿𝐵(𝑖)

)

44

is the same of checking this other condition:

𝑀𝐶1̅̅ ̅̅ ̅̅ = (𝛺0(𝑖)

= 0) ‖ (𝐿𝛬(𝑖)

> 𝐿𝐵(𝑖)

)

This result comes directly from the application of De Morgan’s theorem. Simply moving the

LB from the right side to the left side, the system will be even better:

𝑀𝐶1̅̅ ̅̅ ̅̅ = (𝛺0(𝑖)

= 0) ‖ ((𝐿𝛬(𝑖)

− 𝐿𝐵(𝑖)

) > 0)

If before there was a comparator between two integers, now there’s first a subtraction and then

there’s be a simple check of the sign bit of the result of the subtraction.

The MC2 signal is a shifting vector that, given to the cells, is putting a 0 at the position “2t-i-

2” of the 𝛩 polynomial. Finally, the MC3 signal synthesised the choice that appears on the right

branch of the iteration working operations, i.e. the condition (LB = (t-1)).

In the same way, also the other control signals were deduced from the algorithm in Figure 18.

The signal gamma is computed as shown in Figure 32. We can notice that, as previously done

for the control signals, a value for gamma is set initially when the init_signal arrives and also

for the next period, thanks to the presence of the register.

Figure 32) Circuit for the calculation of the gamma signal.

The signal omega_0 is set simply with a multiplexer (Figure 33). The select signal is

init_signal. In the initialisation, the signal is set to zero otherwise is taken the first symbol of

the omega polynomial.

45

Figure 33) Circuit for the calculation of the omega_0 signal.

Following the same procedure, also the circuits for computing LB and Lsigma were obtained.

Figure 34) Circuit for the calculation of parameter L_B.

Figure 35) Circuit for the calculation of L_sigma.

It’s then worth a description of the system built for calculating the zeta (Figure 36). The signal

is defined as the previous ones by the control signals. It was necessary to use a register for

delaying the signal to the next iteration, when the init_signal goes down. The additional part is

first of all the presence of a GF multiplier that multiplies the actual zeta and the basic element

α-1. In the schematic, this operation is simplified with a block containing the basic element. In

VHDL instead, there’s the problem of how to implement it. The α-1 value is taken by a function

(generate_inverted_elements) defined in the package RS_Decoder_Types. At the declaration

of the signal alpha_1, it’s initialised with the second element of the array that is the output of

the function. At the output of the first register on the left, there’s a multiplexer that manages

the first clock period when init_signal is high, as seen in the other architectures shown in this

component.

46

Finally, referring to Figure 37, there’s the ready signal. When the counter arrives to “2t” counts,

the system schedules for the following period that the internal_ready signal goes high. With

one period more also the ready signal goes high. This delay is useful to allow the ePIBMA

component to know one cycle before the computation is finished. In this way it’s possible to

sample the outputs of the system: simultaneously there’s the output ready going high and the

data available at the output ports.

Figure 36) Circuit for the calculation of the zeta signal.

Figure 37) Circuit for the calculation of the ready signal.

3.2.3 eCSEE component The final stage of the device is a component that implements the eCSEE method. It receives

the outputs of the Berlekamp-Massey component and computes the roots of the error-location

polynomial and the magnitude of the errors. The black-box mask of the block is shown in

Figure 38. The outputs of this block are the clock counts (CC), the error magnitude, the ready

signal and the number of errors occurred (num_err). This last one will be useful for the real-

time debug of the component, explained in the following section dedicated to the blocks

architecture of the decoder.

47

Figure 38) RS_eCSEE black-box representation.

We can see the generic blocks of the device in Figure 39. Observing the architecture, we notice

that there are repetitive modules for the evaluation of the sigma_even and sigma_odd, that are

the first two rows of the figure. Then, similarly, also there are similar modules for the

evaluation of the “B” polynomial. As earlier expounded, the advantages of the systolic structure

are enormous and in this component too the algorithm used is allowing to exploit this

characteristic. At the output stage, the enable is given by the condition 𝛬𝑒𝑣𝑒𝑛(𝛼−𝑗) ==

𝛬𝑜𝑑𝑑(𝛼−𝑗). This condition, in the Galois field, can be computed as the sum of the evaluation

of the two polynomials, i.e.:

𝛬𝑒𝑣𝑒𝑛(𝛼−𝑗) + 𝛬𝑜𝑑𝑑(𝛼−𝑗) == 0

The output block, when enabled, multiplies the evaluation of the odd part of sigma and the

evaluation of the “B” polynomial. Then this result is inverted and multiplied for the zeta value

and finally presented at the output.

Resuming, we can distinguish blocks: a part for the evaluation of the sigma and “B”

polynomial, a part for the calculation of the zeta, another one for computing the enable signal

for the output stage and the output stage. The evaluation circuits are built using only one

repetitive unit, called RS_eCSEE_Cell.

The scratch, seen in general before, was represented by the VHDL code as in Figure 40. There

are little differences that will be explained and motivated later in this section, but in general

it’s the same architecture of Figure 39.

In the left part of the structure, there are four blocks for the evaluation of the two polynomials.

If for the sigma polynomial is necessary to divide the evaluation because is needed the odd part

in the output stage, for the “B” is not. The reason why it was separated it is the segmentation.

The division of the evaluation unit in smaller units allowed to pipeline the system and reduce

then the critical path of the system. Since the system was pipelined, if was kept the “B”

polynomial evaluation entire, it would have had more latency. Dividing the “B” as well into

two parts, the latency is reduced and a symmetry in respect to the sigma evaluator is restored.

48

The symmetry simply helps more in the computation of the times of arrival of the various

signals in the digital circuit.

Since the cell used is the one visible in Figure 41 and Figure 42, the latency of the evaluation

blocks is “t/2”. In my particular example of RS(255,239) is 4.

Figure 39) Internal blocks of the eCSEE device.

Figure 40) Internal structure of the eCSEE component; the blocks are placed ordered in respect to the x-axis that

represents the time.

49

Figure 41) Black-box representation of the RS_eCSEE_Cell.

Figure 42) Internal structure of the RS_eCSEE_Cell.

Comparing the images above, the cell is nothing different from the cells visible in Figure 39

except the presence of the segmentation registers between one adder and the next one.

Referring to Figure 42, it can be seen that the multiplexer selects the polynomial coefficient

“Λi” for the first cycle and then afterwards starts selecting the previous partial_result.

After this initialising step, the cell starts working normally at regime, i.e. mult_factor will be

the mult_factor of the previous period.

The register at the bottom part on the right has the task to delay the init_signal arrived to the

cell and give it as an input to the next cell. The segmentation of the path makes these delays of

the signals necessary and this was the solution chosen. In VHDL, it was more compact to make

the signal go through the cells in this way, simply connecting a cell’s output to the next cell’s

input. Another way of doing it, saving few registers usage, would have been to create a chain

of registers for delaying only once the init_signal instead of replicating the structure. Though

this solution is cheaper from a resources point of view, the waste of registers with this cell is

only “3*t/2” 1-bit registers, i.e. 12 1-bit registers. This is negligible in respect to the overall

usage of resources that will be explained in the next section of this chapter and, therefore, also

this solution is equally valid.

50

The critical path of the cell is the path that goes from the partial_result register to add_result,

passing through one multiplexers, one multiplier and one adder.

𝑇𝑐𝑟𝑖𝑡 = 𝑇𝑚𝑢𝑥 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑𝑒𝑟

The evaluation blocks are a chain of RS_eCSEE_Cell put in series as shown in Figure 43 and

Figure 44. For the first cells of the even chains of cells, the previous cell’s value is zero so that

the adder is reduced to a simple wire connection. For the odd chains of cells instead the value

is set to the first coefficient of the polynomial. This follows the general circuit of Figure 39.

The evaluators are the cascade of four cells except for the B_even evaluation block. In fact, if

the sigma polynomial’s length is “t”, the length of “B” is “t-1” and this creates an issue for the

calculation of the even part because it needs one cell less (“t/2-1” instead of “t/2”). For

synchronising the arrival of the signals at the output of the evaluators, a register was put at the

end of the evaluation block of the even part of “B” so that the synchronisation is restored again.

In Figure 44 the different cell can be seen at the end of the chain for the evaluation of the

B_even polynomial.

Figure 43) Sigma polynomial evaluation block.

Figure 44) “B” polynomial evaluation block.

51

Another important part of this component is the zeta computer. This retraces the schematic of

Figure 45. The circuit uses the control signal internal_CC that is the signal coming directly

from the counter of clock pulses. The shift register at the output of the module is needed for

the segmentation of this component, to synchronise the signals arriving to the multiplier of the

output stage.

Figure 45) Zeta computer component.

The segmentation of the RS_eCSEE brought problems of synchronisation all over the internal

circuits. We illustrated the counting of the clock pulses and the init_signal, now the entire

processing of the control signals all over the device will be illustrated.

The counts of the clock pulses have to be delayed in order to arrive simultaneously with the

error magnitude. In Figure 40 the x-axis represents the time. The evaluation blocks occupy

“t/2” clock pulses of latency, the inverting element one clock pulse and so the zeta computer is

“t/2+1” clock pulses. This is exactly the shift register delay that has to be given to the

internal_CC for arriving to the output (Figure 46).

Figure 46) Circuit for the calculation of the number of clock pulses (CC).

The internal_ready signal is another signal of interest. Figure 47 represents the system for the

calculation of the ready signal. There’s a counter that counts separately the clock pulses. The

52

counts are compared to the first extreme (“t/2”) of the interval, where the ready has to be high,

and the second extreme (“t/2+n”). When the counts are between these two values, the

internal_ready has to be high, otherwise is low. The reset circuit in reality is a bit more complex

than that to allow the eCSEE to work even with consecutive messages. If the reset arrives and

the internal_CC is stopped, it means that there are no messages in series but at least there’s one

clock period of pause and so the counter of this module is reset to zero. Else if when the reset

signal arrives, the internal_CC is still counting, it’s the condition where there are two

consecutive messages and so to maintain the correct number of counts, the counter is reset to

one.

Figure 47) Circuit for the calculation of the ready signal.

Another crucial signal is the reset. When it arrives, also this signal has to be delayed for the

correct working operation of the component. Using the same estimation made for the

internal_CC, the signal of reset is delayed as well by “t/2+1” clock periods.

Having presented all the control signals, the architecture of the last block of this component is

presented. The last stage’s enable is the condition given before, i.e. the evaluation of the “Λ”

(“sigma_evaluated==0”), as can be seen in Figure 40. This condition (error_found), always for

the same reason of synchronisation, is delayed by one period before arriving at the clock enable

of the output register. The outputs of the “B” polynomial evaluation blocks are summed in

order to obtain the total value B_evaluated. This is multiplied by sigma_odd_evaluated and the

result of this operation (non_inverted_partial_result) is the input of the

Inverted_elements_ROM (Figure 48). This piece is defined as a ROM memory built with the

embedded RAM (BRAM) of the FPGA. For filling the ROM with the correct values, it was

necessary to write a specific function generate_inverted_elements in the package

RS_Decoder_Types. The function for the creation of an array of inverted elements is based on

the scheme of Figure 3.

Figure 48) Black-box representation of the ROM for inverting the elements.

53

The ROM is used giving the element to be inverted to the input port addr and in the next period

the correspondent inverted element is given to the output port data. This element introduces a

delay of one clock period to the path.

Continuing the description of the system of Figure 40, the outcome of the inverting block

(inverted_partial_result) is multiplied by the zeta calculated by the apposite block and then

given to the input of the output multiplexer. The result of the multiplication is called

error_magnitude_calculated and corresponds to the computation of the error magnitude for the

eCSEE algorithm. The mux output is selected by the error_found signal and then sampled by

a register. The error magnitude is similarly zero if the error is not found and is the calculation

of the error_magnitude_calculated if instead an error is found.

Finally, the signal num_err is simply determined by the counts of the times the error_found

signal went high during the decoding of one message.

3.2.4 Decoder block assembly The blocks that up to now have been described have now to be assembled together. The way

of assembling them and making the decoder work properly is worth a section.

In Figure 49 there’s the description from illustrating the decoder black-box mask. The inputs

are the canonical inputs for clock, reset and enable plus the data_in where are to be inserted

one by one the symbols of the message to be decoded. The outputs are:

- data_out is, similarly to the data_in, the message decoded given one symbol by one;

- ready is simply a bit that goes high when an output is produced;

- invalid_output is a flag bit communicating if the output produced is valid or, for

example, didn’t find all the errors contained in the message;

- par says when at the output there are the parity bits (high) or the message bits (low).

Figure 49) RS_Decoder black-box representation.

The general structure of the decoder was shown previously in Figure 16. In the upper branch

we see the three blocks described until now in sequence: SyndromeComputer, RS_ePIBMA and

RS_eCSEE. Still the architecture to obtain the original message is missing from the description.

We notice that to do it, a block is necessary for delaying the received message and keeping it

saved until the three blocks elaborate it. Then, it’s possible to sum it to the error vector

computed by the three core blocks. This block for delay is implemented by a Double-Port RAM

called Delay_RAM. The interface of this component is displayed in Figure 50.

54

Figure 50) Black-box representation of the Delay_RAM.

Beyond the usual input of the clock, there are the inputs for enabling the two ports and two

address ports. Moreover, there are the input data port for the “a” port and the output port for

the “b” port.

A double port was required because it’s necessary to write the new data arriving for storing

them, as they arrive and reading them when the error correction is ready. For doing this, it was

needed to design two units for the addresses to give in order to have the synchronisation of the

signals. In Figure 51 there is the basic schematisation of circuit that uses the Delay_RAM.

Figure 51) Example circuit for the usage of the Delay_RAM block.

There has to be a system similar to the internal system of the RS_eCSEE that manages all the

delays of the signals to make them be synchronised. These two processes, that are driving the

RAM block, are simple counters. The input_delay_process (Figure 52) counts the clock pulses

and takes into account also the number of the message they are examining: “m” bits for the

symbol address and two MSBs for the message address. In my example of a RS(255,239) the

number of bits of the address is then ten bits.

Figure 52) Circuit for the generation of the address for input data in the Delay_RAM.

55

For what concerns the output data address, the situation is more complex since the symbols at

the output of the RS_eCSEE component are in a reversed order in respect to the arrival ones.

In fact, if for the arriving messages the parity symbols are at the end of the message, at the

output of the Chien search component are the first coming out. Therefore, the differences

between the two systems are explained and can be observed comparing Figure 52 and Figure

53.

Figure 53) Circuit for the generation of the output data address.

In the output data address, we can see that the error location, signal coming out from the

RS_eCSEE module, is reversed in respect to “n” (length of the message) to re-establish the

same sequence of symbols of the message. When the “m” bits of the address arrive to zero,

then the system switches to the next message.

Another important process of the RS_Decoder is the one that has to control whether the output

is valid or not. In fact, if the number of errors found is less than the length of the sigma

polynomial L_sigma, then it means that not all the errors were found. Moreover, there are

limitations about the maximum numbers of errors that are detectable and correctable by a Reed-

Solomon code. The maximum number is “t” that in the case of RS(255,239) is eight.

The overall system (Figure 54) is completed with a small note about the control signals that are

used in the decoder. The blocks in the figure appear to be receiving the same control signals

but this is not the real case, it would not work. This has been done only for sake of simplicity

for drawing the circuit but, in reality, the latencies of the blocks are to be taken into account.

The discussion about the latencies is left to the following section but in general to delay control

signals over the blocks and to allow the system to work properly shift registers are used.

56

Figure 54) Complete schematic of the RS_Decoder component described with blocks.

In the system shift registers are for properly delaying the control signals among the various

blocks and for delivering a correct timing synchronisation. This kind of components have the

task of delaying the value of the L_sigma for the latency of the Chien component in order to

avoid problems of invalid_output due to not synchronised signals. In matter of facts, if there

was no delay of the length of the sigma polynomial, when a new Berlekamp-Massey finishes

with its new length, it would be compared to the length found by Chien in the end of the

previous message. For example, if the message 1 is being decoded by Chien, for some clock

pulses, when the message 2 finishes Berlekamp-Massey, there would be a comparison between

the length of the sigma polynomial of message two with the error found by Chien in message

1. Interposing this block, it’s possible to avoid this synchronisation problem.

To make the shift register more compact, it was based on a Dual-Port RAM. In Figure 55 we

can see the structure. The RAM is built for storing “t/2+2” values of an unsigned number. The

control logic gives the proper values to the two addresses and also gives a flag full, saying

whether the memory is full or not. This signal is selecting the output through a multiplexer: if

memory is not full, the output is set to zero.

Figure 55) Internal structure of the L_sigma shift register.

57

The decoder component RS_Decoder now is completely described.

The test-bench for the device was made with a system similar to the one used for the encoder

showed in Figure 10. The difference is that in this case it was compulsory to simulate the

presence of noise in the transmission medium or in the receiving system. These imperfections

were modelled in the Python code of the decoder with the function add_noise. This function

takes eight random numbers between zero and “n” corresponding to the positions of the errors

introduced, and then changes the symbols in those positions choosing a random number

between zero and 2m. The number of errors introduced by the code is actually fluctuating with

the variable num_errors passed to the function. In my code this is set to the maximum number

of errors correctable by the code, “t”. The original messages, read from a text file, are modified

and saved into another text file with the noise added. The decoding algorithm then does its job

and decodes the messages and saves them again. The VHDL file uses the noisy files as a

starting point and computes the output. Thus, the data that come out of the decoder are manually

compared with the ones coming out from the Python model.

A comparison of the results described now for a RS(255,239) decoder is shown in the figure

here below.

Figure 56) Python graph presenting the results obtained by the RS(255,239) decoder.

In red there is the noisy message that slightly differs from the original message, the blue one.

The blue code actually is not observable in the figure because the green one, i.e. the decoded

message, completely coincides with the original one, showing the correct working operation of

the algorithm. The message is composed by three parts: a first part containing a string, a second

part that is composed by zero-padding technique in order to reach the 239 symbols and the final

parity symbols. In the central part, where only zeros should be found, we can see spikes of

errors corrected by the decoding algorithm.

58

3.3 Decoder RS(528,514) in GF(210) After implementing the classic Reed-Solomon code RS(255,239), another basic decoder that

had to be analysed was the RS(528,514) in order to increment the throughput. Since the formula

of the throughput, as presented before, is the following:

𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 =𝑚

𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ

And since the critical path cannot be changed without changing the architecture of the decoder,

the code had to be changed. Another relevant code in literature is the RS(1023,1009) that works

in a Galois Field of ten bits. In this case, the critical path should be the same while the number

of bits increases of 20%, producing a correspondent increment of the throughput.

Observing better the code of the previous decoder it can be noticed that the previous

architecture was not generic enough. In fact, the previous VHDL code was valid only for even

values of “t”. In this case, a new internal structure of the Chien component had to be arranged.

With “t” that is an even number, there have to be “t/2” cells of sigma (even and odd parts) and

of the odd part of “B” and “t/2-1” cells for the even part of “B”. For synchronising the signals,

the presence of a register at the end of the “B” even cells is mandatory. With “t” that is an odd

number instead, the number of cells for sigma_even and for the odd and even part of “B” is

“floor(t/2)” while for the odd part of “sigma” is “ceil(t/2)”. Also in this case registers were

added to guarantee the correct synchronous arrival of the signals to the following stage.

In VHDL it was implemented with a simple extension inside the Chien component. At the

beginning of the architecture description the condition in which the decoder is operating (is “t”

odd or even?) was verified with two if generate statements that ensure to choose the proper

internal structure. For realising the floor and ceil functions, the expression “t mod 2” is used.

It gives both the information whether the “t” is an odd or even number and also can be used for

rounding numbers. In fact, the default operation made by the VHDL synthesiser is a floor

function. Using this information it is possible to obtain the operation needed. For example:

𝑡

2+ 𝑡 𝑚𝑜𝑑 2 =

8

2+ 8 𝑚𝑜𝑑 2 = 4 for t=8

𝑡

2+ 𝑡 𝑚𝑜𝑑 2 =

7

2+ 7 𝑚𝑜𝑑 2 = 3 + 1 = 4 for t=7

The same expression is used all over the new modified code for guaranting to have the correct

length of vectors and the correct latency. In particular, having (in case of t=7) four cells for the

odd part of sigma, the minimum latency for the block is still four and the latency that the

decoder module foresees for the Chien block has to be adjusted. If before it was “t/2”, now it

will be “t/2+t mod 2” in order to compute correctly the latency for the two cases and to maintain

a compact form without adding other conditional generate statements.

After the verification of this code, it was possible to pass to a new step: cutting the code. In

fact, not all the 1023 symbols of the message are actually used. For doing this, some new

parameters have to be introduced:

59

𝑠 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 𝑐𝑢𝑡

𝑐𝑜𝑑𝑒 𝑟𝑎𝑡𝑒 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑢𝑠𝑒𝑓𝑢𝑙 𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 =𝑘 − 𝑠

𝑛 − 𝑠

Comparing the codes RS(1023,1009) and RS(528,514) it can be seen that if the total throughput

remains basically untouched, the code rate changes significantly. It’s higher in the first one,

implying that less number of errors in percentage can be corrected: only 7 errors on 1023

symbols.

The shortened code is composed as follows: a vector of 495 zeros have to be concatenated to

the 514 symbols of the message to send. To these, fourteen parity symbols have to be added

after the message. Then at the end the zeros are cut from the sending of the message and only

the 528 not null symbols are sent.

The first question that can be asked is if the previously implemented decoder is still working

correctly or if it is affected by this change. Especially, if the syndrome computed is changing

when committing these changes. For sake of simplicity it’s demonstrated here that nothing

relevant changes with a RS(7,3) and a RS(5,1) codes but the result is obviously general. It’s

not depending on the length of the message nor the length of the cut.

Taking into account the entire message:

𝑟(𝑋) = [𝑟6, 𝑟5, 𝑟4, 𝑟3, 𝑟2, 𝑟1, 𝑟0]

The message filled with zeros:

𝑟(𝑋) = [0,0, 𝑟4, 𝑟3, 𝑟2, 𝑟1, 𝑟0]

And the shortened message:

𝑟(𝑋) = [𝑟4, 𝑟3, 𝑟2, 𝑟1, 𝑟0]

We can finally compute the syndrome with the Horner’s rule:

𝑆(𝑋) = (((((𝑟6 ∙ 𝑋 + 𝑟5) ∙ 𝑋 + 𝑟4) ∙ 𝑋 + 𝑟3) ∙ 𝑋 + 𝑟2) ∙ 𝑋 + 𝑟1) ∙ 𝑋 + 𝑟0

That can be simplified (both for the zero padded and the cut message) as:

𝑆(𝑋) = (((𝑟4 ∙ 𝑋 + 𝑟3) ∙ 𝑋 + 𝑟2) ∙ 𝑋 + 𝑟1) ∙ 𝑋 + 𝑟0

Since the syndrome is not changing, the Berlekamp-Massey algorithm and Chien search

component will not be affected. The only changes that have to be taken into account are the

latency and the number of symbols arriving. The two processes for the input and output

addresses of the Delay_RAM and the latencies of the SyndromeComputer and RS_eCSEE

blocks.

60

3.4 Compared analysis of timings and usage of resources Here the two typologies of the decoder are compared and some calculation for area and timing

is made.

3.4.1 Usage of resources in a generic RS(n,k) decoder In this section the components used for each version are analysed in detail, keeping separated

the results for each component. For every component, firstly the resources used by every cell

are analysed. Secondly, the overall system will be taken into account (for example, control

signals’ circuits). At the end of this section, a table is presented for resuming the results

computed.

3.4.1.1 Syndrome Computer

The SyndromeComputer is basically only constituted by cells and a final output stage for the

sampling of the outputs.

The SyndromeCell is built with one adder, one multiplier and two registers. For every device,

“n-k” cells are used for the calculation of the syndrome. Therefore, the total amount of

resources used are “n-k” adders, “n-k” multipliers and “2(n-k)” registers.

The control signals’ circuits will be ignored in the recap table. These components are negligible

in respect to the amount of hardware required by the cells.

3.4.1.2 RS_ePIBMA

Every cell of this component is using three multiplexers, one adder, two multipliers and two

registers. The number of cells for the component are “n-k”. The calculation of the resources

needed are summarised in Table 5.

In this case, the control circuits’ resources usage is not straightforwardly negligible so it’s taken

into consideration. Referring to the figures of the apposite section related to this component,

Table 4 was drawn up.

Multiplexers Multipliers Registers Counters Logic ports Shift

registers

MC1 2 0 0 0 1xOR 0

MC2 2 0 1 0 0 1

MC3 2 0 0 0 1xAND 0

Gamma 3 0 1 0 0 0

Omega_0 1 0 0 0 0 0

L_B 4 0 1 0 0 0

L_sigma 3 0 1 0 0 0

zeta 3 1 2 0 1xOR 1

ready 0 0 2 1 1xCOMPARATOR 0

Table 4) Recap table for the control unit of the RS_ePIBMA component.

For this table, the details about the bits of the components are not written but full details of the

component are accessible through the code. The table is showed here just to give a rough idea

of how much these control logic elements are affecting the usage of resources in relation to the

other components. Taking the example of a RS(255,239) decoder, having “2t=16”, the amount

of multiplexers used in the cells is 48 while in the control logic is 20. The greater is the number

of parity symbols, the more is negligible the control logic in respect to the cells.

61

3.4.1.3 RS_eCSEE

The cell of this device is composed by two multiplexers, one adder, one multiplier and three

registers (two of 1 bit and one of “m” bits). The two registers of 1 bit will not be counted for

sake of simplicity. The cells are seven for the auxiliary polynomial and eight for the sigma

polynomial, so these numbers are not to be multiplied by “n-k” but by “n-k-1”. The registers

instead are “n-k-1” contained in the cells and one outside introduced at the end of the auxiliary

polynomial evaluator in order to make properly the segmentation.

This component uses, moreover, two adders, one multiplier, one ROM, one comparator and

one register of 1 bit for the calculation of the error condition and of the inverted_partial_result.

Then, also another multiplier and another register for the output stage. The zeta computer

instead needs two multiplexers, one multiplier, one register and a shift register. The control

logic needs one and port, two comparators, one counter and two registers for the calculation of

the internal and external ready.

3.4.1.4 Other components

There are also the outer components still inside the decoder but not in any of the previously

analysed blocks. First of all, a double-port RAM is needed. The circuits for the addresses need

two counters of 2 bits, one counter of “m” bits, one adder, two comparators, two AND ports

and two registers. Also there is to be taken into account an adder and a mux for the correction

of the messages and for checking the valid output a counter and a comparator. The shift

registers for the control signals are two shift registers for enable and reset for blocks of

Berlekamp-Massey algorithm and Chien search, so in total are four.

Multiplexers Adders Multipliers Registers

SyndromeComputer 0 n-k n-k 2(n-k)

RS_ePIBMA 3(n-k) n-k 2(n-k) 2(n-k)

RS_eCSEE 2(n-k-1) n-k-1 n-k-1 n-k Table 5) Recap table of resources used per every component, taking into account the cells usage.

3.4.2 Latencies and critical paths The critical paths are fundamental for estimating the maximum frequency of the component.

This one, in the first order, is given by the inverse of the maximum period of time necessary

for covering the critical paths. In Table 6 the critical paths of the components realised are

resumed. In general, the critical path of the cells is the most remarkable one.

The SyndromeCell and the RS_ePIBMA_Cell have the critical path discussed before passing

through an adder and a multiplier. Instead, for what concerns the Chien search, the critical path

is the one that goes from the register of partial_result to add_result passing through one mux,

one multiplier, one adder and then again one mux.

The latency of the module for the calculation of the syndrome is due to the iterations that for

the Horner’s rule have to be done. The data go out all simultaneously. For the ePIBMA, the

latency is defined implicitly by the algorithm. The RS_eCSEE instead takes “t/2” cycles due to

the cells of the sigma and auxiliary polynomials plus a delay of the inversion and a delay of the

output register.

62

Tcritical_path Latency

SyndromeComputer 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 n

RS_ePIBMA 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 n-k

RS_eCSEE 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 2 ∙ 𝑇𝑚𝑢𝑥 (n-k)/4 + 2 +n Table 6) Table for critical paths and latencies of the various parts.

The overall latency is not the sum of the latencies anyways because another fact has to be taken

into account. The data arrives from the first symbol of the message to the last of the parity part

while the output of the Chien search gives at the output the reverse order. Thus, the actual

latency depends on the position of the symbol requested. To have the first symbol corrected it

should wait:

𝐿𝑎𝑡𝑒𝑛𝑐𝑦𝑓𝑖𝑟𝑠𝑡 𝑠𝑦𝑚𝑏𝑜𝑙 = 𝑛 + (𝑛 − 𝑘) + ((𝑛 − 𝑘

4) + 2 + 𝑛) + 1

In fact, the first symbol waits the overall latency and “n” cycles more for getting the error-

magnitude value. For the last symbol of the parity:

𝐿𝑎𝑡𝑒𝑛𝑐𝑦𝑙𝑎𝑠𝑡 𝑠𝑦𝑚𝑏𝑜𝑙 = 𝑛 + (𝑛 − 𝑘) + ((𝑛 − 𝑘

4) + 2) + 1

The formula for the latency therefore is:

𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝑝) = 𝑛 + (𝑛 − 𝑘) + ((𝑛 − 𝑘

4) + 2 + 𝑝) + 1

Where “p” is the position of the symbol for which we want to computer the latency.

3.4.3 Maximum operating frequency The maximum operating frequency was obtained as previously shown in the section of the of

the encoder RS(255,239). Since the schematic of the digital circuit is different, a new analysis

of the critical path has to be done. In the sections of the internal components of the decoder,

the critical paths of each of them are computed and are resumed the results in Table 6. As we

can notice, the worst critical path is the one related to the Chien search cell. In the whole

decoder implemented in a FPGA, though, the critical path will not happen to be this one. In

fact, the problem of routing in a determined device can bring new timing constraints that would

be not predictable with only the simple kind of analysis that was made.

After using the method of “trial and error”, the final result obtained for the decoder was:

𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ = 3.02 𝑛𝑠

𝑓𝑚𝑎𝑥 = 331 𝑀𝐻𝑧

For what concerns the quantity of data that will be produced at the output, there is a relation

with the Galois field in which the decoder operates. In particular:

63

𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = 𝑚 ∙ 𝑓max =𝑚

𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ

Therefore, for the two decoders taken into account in this chapter:

𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡𝑅𝑆(255,239) = 2.65 𝐺𝑏𝑝𝑠

𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡𝑅𝑆(1023,1009) = 3,31 𝐺𝑏𝑝𝑠

64

65

4 Design of 100 Gbps decoders The decoders obtained until now are only reaching few Gbps. The value is not enough in

respect to the goal of the thesis. With the parallelisation of the components, it’s possible to

obtain higher throughputs without wasting too many resources. The way of parallelising

decoders is customised: it’s a matter of choices. The chapter has the aim of explaining the

choices made in the project and illustrating the performances attained.

The analysis is made on different levels. The background works are evaluated by an estimation

of the XOR gates that they use. The estimation is made by a Python script that computes the

equivalent ports, basing the calculation on the main blocks instantiation. The same script is

used to estimate the equivalent number of ports used by the innovative designs proposed.

Therefore, at the end, some of these designs are implemented in FPGA and ASIC. The solution

are finally compared.

4.1 Parallelisation of the RS(255,239) decoder The first typology of decoder to be subjected to the parallelisation is the classical RS(255,239).

Starting from the background works, the solutions present in literature are explained and

analysed. Finally, some new design is introduced and compared.

4.1.1 Background In literature, plenty of articles about the parallelisation of the RS(255,239) can be found. The

decoder, in fact, has been studied since a while and so a lot of different solutions were exploited.

Among these, the newest and most performant solutions were chosen in order to have a rough

estimation of the goodness of the new solutions proposed later on.

The parallelisation process consists in increasing the number of symbols processed per clock

pulse.

It can be observed that the Berlekamp-Massey component is the one that gives some limitation

to the critical path though it has few latency. The other two components, SyndromeComputer

and eCSEE, instead, give some problem with the latency but don’t give any with the maximum

operating frequency. Stated this, parallelisation can bring to the possible upcoming of new

limitations to the critical paths (new cell architectures are to be considered), but it decreases

significantly the latency depending on the level of parallelisation. Parallelising can also bring

to more hardware utilisation and area occupation. The Berlekamp-Massey component has few

latency (only “n-k”) and therefore it can be used by more than one SyndromeComputer, saving

some area. Some logic will have to be implemented for the correct rooting of the data.

The optimal solution is the solution that manages to have a good trade-off between resources

used and latency, without worsening too much the critical path of the system; ideally not

worsening the longest path. To decide suitably, it is proposed here an analysis of the versions

of the components that compute syndrome and Chien search, resuming everything in tables to

show better and clearly the comparisons. The analysis consists in the calculation of the overall

66

equivalent XOR gates used by each solution. The details about this translation will be given

later.

The analysis has to take into account not only the hardware usage but also the throughput and

the latencies, that are crucial parameters of interest. In fact, the main goal is to obtain a high-

speed low-latency decoder. In order to achieve a better view over the solutions available and

to compare them easily, some new parameter is defined.

#(𝑏𝑎𝑠𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡)

1 𝐺𝑏𝑝𝑠= ℎ𝑜𝑤 𝑚𝑎𝑛𝑦 𝑏𝑎𝑠𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑎𝑟𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑝𝑒𝑟 𝐺𝑏𝑝𝑠 𝑜𝑓 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡

#(𝑏𝑎𝑠𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡)

𝑠𝑎𝑣𝑒𝑑 𝑙𝑎𝑡𝑒𝑛𝑐𝑦′𝑠 𝑐𝑙𝑜𝑐𝑘 𝑝𝑢𝑙𝑠𝑒= 𝑏𝑎𝑠𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑝𝑒𝑟 𝑙𝑎𝑡𝑒𝑛𝑐𝑦′𝑠 𝑐𝑙𝑜𝑐𝑘 𝑝𝑢𝑙𝑠𝑒

While the saved latency is defined as:

𝑠𝑎𝑣𝑒𝑑 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = (𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑐𝑎𝑠𝑒 𝑙𝑎𝑡𝑒𝑛𝑐𝑦) − (𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑎𝑡𝑒𝑑 𝑐𝑎𝑠𝑒 𝑙𝑎𝑡𝑒𝑛𝑐𝑦)

These two new parameters also provide a better way of comparison usage of resources when

the throughputs and the latencies are not the same, actually really common in real cases: it

combines the performances obtained (throughput or latency saved) with the resources used.

The lesser are the resources used for reaching to the same performance, the better it is.

The versions taken into consideration for the analysis are the basic non-parallelised, the two-

parallel, the three-parallel, the four-parallel and the five-parallel ones. Of these, the four-

parallel solution is a new typology introduced in the thesis.

An analysis of the throughput and latency of the five versions, using only one channel per every

solution, is presented in the pictures here below. For every cell, the same operating frequency

is used, as it’s to say that for every solution an architecture that does not increase the critical

path is adopted. In the FPGA implementation this will result completely inaccurate since in

large systems the critical path often is defined by the routing of the signals and not by the

theoretical critical path of the cells. Anyways, this is just to have an indication of the

performances achieved. Only after the implementation of the integrated circuit the final results

will be finally correctly compared.

67

Figure 57) Graph that represents the latency and the throughput correspondent to each solution exploited.

For what concerns the latency and throughput, the formulas that define them are here below:

𝐿𝑎𝑡𝑒𝑛𝑐𝑦 =#(𝑡𝑜𝑡𝑎𝑙 𝑠𝑦𝑚𝑏𝑜𝑙𝑠)

𝑝

𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = (𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑛𝑔 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦) ∙ 𝑚 ∙ 𝑝

𝑝 = #(𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 𝑝𝑒𝑟 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛)

Nomenclature

Here it is introduced some nomenclature that was used in the development of the project to

shorten the names. For shortening the number of symbols processed per clock pulse, it’s used

the letter “p”. For example, a two-parallel RS(255,239) decoder will be called now onwards

“2p” RS(255,239) decoder.

The letter “s” is introduced, in analogy with the previous letter, to show explicitly the number

of segmentations used in the syndrome cell.

The letter “c” is instead used for identifying the number of channels used in the decoder.

Resuming, a two-parallel RS(255,239) decoder that has a syndrome cell segmented once and

uses eight channels is shortened as “2p-1s-8c” RS(255,239) decoder. As can be noticed it’s a

really compact way of communicating the main parameters of interest of the decoder.

Equivalent XOR gates computation

Before passing to the computation, some word has to be spent on how to compute the

equivalent number of XOR gates. As far as we saw until now, only basic objects like adders or

multipliers were used: how to pass to the number of XOR gates?

68

As a resume, the equivalent values are inserted in the conversion table obtained by

implementing each of these basic components and matched the result of the XOR gates.

Simplified

Multiplier

Multiplier Adder Mux 2-1

(1 bit)

Register

(1 bit)

RAM

(1 bit)

AND

#(ports) 2∙log2(2m)

XORs 8∙m XORs

+ 6∙m

ANDs

m XORs 1 XOR 3 XORs 1 XOR 0.75 XORs

Table 7) Table that resumes the usage of XOR gates for every component used.

A simplified multiplier is a multiplier that has one of the two factors that is a constant value.

This brings to the simplification of cable multiplication, typical of the constant multiplication

in digital electronics. For the multiplexers that have more than two inputs, they have to be

dissembled in basic multiplexers. For example, a 4-1 mux can be translated into three 2-1

multiplexers.

4.1.1.1 Syndrome Computer

The first component to be analysed is the syndrome computer. The variants taken into

consideration for this component are the standard non-parallelised one and the ones with two,

three, four and five parallelised symbols.

The schematic of the new versions of the cells are presented here below.

Figure 58) Two-parallel syndrome cell internal architecture [Ji15].

Figure 59) Three-parallel syndrome cell internal architecture [Park12].

69

Figure 60) Four-parallel syndrome cell internal architecture.

Figure 61) Five-parallel syndrome cell internal architecture [Salvador14].

In the table below the results of the resources used by the five typologies are summarised.

Adders Simplified

Multipliers

Mux (2-1) Registers

Non-paralleled 1 1 0 1

2-parallel 2 2 0 1

3-parallel 4 4 1 5

4-parallel 4 4 1 4

5-parallel 6 5 5 5 Table 8) Comparison of resources usage of the variants of the syndrome cells.

In all the situations the simplified multipliers are used since actually one of the factors is a

constant. At a glance, it can be observed immediately that the solutions that parallel three and

five elements are using too many resources and probably are not the best solutions, but let take

a look at the numbers.

Below, there is the resume of the performance information.

70

Tcritical_path Latency Throughput [Gbps]

Non-paralleled 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 255 2.58

2-parallel 2 ∙ 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 128 5.16

3-parallel 2 ∙ 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑚𝑢𝑥 85 7.74

4-parallel 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑚𝑢𝑥 65 10.32

5-parallel 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑚𝑢𝑥 51 12.90

Table 9) Resume of the latency and critical path characteristics for the solutions analysed.

The critical path in this case is not affecting any of the operating frequency because the

maximum critical path in the table is anyways less than the critical path of the Berlekamp-

Massey component. In fact, in the two-parallel solution, some segmentation can be performed

to achieve the minimum critical path. The independence of the operating frequency of the

device from the critical path of the cells of this component, make the latter parameter useless

for the analysis.

In Table 10 are resumed the results of the Python script for the calculation of the equivalent

number of XOR gates for every version.

#(XORs)

Non-paralleled 768

2-parallel 1536

3-parallel 3472

4-parallel 3088

5-parallel 4048 Table 10) Recap table for used basic elements in the overall syndrome component.

Therefore, the last step is to bond every resources usage to the throughput and to the latency

saved. Below are presented the figures that graphically show the results.

71

Figure 62) First picture that represents the usage of resources per Gbps of throughput obtained. On the x axis

there are the number of symbols computed per clock pulse; on the y axis there are the number of XOR gates

used per Gbps of throughput.

Figure 63) Resources utilisation per every clock pulse of latency saved. On the x axis there are the number of

symbols computed per clock pulse; on the y axis there are the average number of basic elements used per clock

pulse saved.

72

4.1.1.2 eCSEE

Now the eCSEE component is under analysis. Similarly to what done with the syndrome, the

beginning is with the calculation of the resources used.

Figure 64) Internal structure of a two-parallel eCSEE cell.

Figure 65) Internal structure of a three-parallel eCSEE cell.

The typologies of cell for Chien are actually only two. The same typology, in fact, is used in

all the variants that process more than one symbol. Figure 64 and Figure 65 show examples of

the paralleled type of Chien cell.

The analysis of the resources usage is resumed in Table 9.

Adders Multipliers Mux (2-1) Registers

Non-paralleled 1 1 2 1

p-parallel 0 p 1 1 Table 11) Comparison of resources usage of the variants of the eCSEE cells.

Passing now to latency and critical paths for the Chien cells:

Tcritical_path Latency

Non-paralleled 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 2 ∙ 𝑇𝑚𝑢𝑥 𝑛 − 𝑘

4+ 2

p-parallel 𝑇𝑚𝑢𝑥 + 𝑇𝑚𝑢𝑙𝑡 𝑐𝑒𝑖𝑙 (𝑛

𝑝)

Table 12) Resume of the latency and critical path characteristics for the three solutions analysed.

To have a better idea about the resources usage, a deeper insight view must be exploited.

In general, the architectures use the same amount of cells (“t” and “t-1”) for sigma and “B”

polynomials, one cell for the computation of the zeta value and “p” number of final stages. The

73

only difference is that for the paralleled version there’s another stage that is necessary for the

correct working operation, i.e. the addition stage (Figure 66).

Figure 66) Internal structure for the 3-parallel evaluation of the sigma polynomial [Park12].

For arriving to the overall usage of resources, the only thing left to do is to show the usage of

these single elements and then sum them for each solution. The so called iROM refers to the

ROM used in a module for the inversion of the elements of a Galois field, while SR stays for

shift register.

Adders Multipliers Registers Mux (2-1) iROMs SR (t/2-1)

Main cell 0 p 1 1 0 0

Zeta cell 0 1 1 2 0 1

Final

stage

0 2* 1 0 1 0

Add stage 8 0 1 1 0 0 Table 13) Recap table for the resources utilisation of the components in eCSEE variants.

In the final stage, the multipliers used are not the simplified ones and this has to be taken into

account for the resources utilisation.

Main cells Zeta cells Final stages Add stages

Non-paralleled 2t-1 1 1 0

p-parallel 2t-1 1 p p Table 14) Table that resumes the instantiation of cells in the typologies of eCSEE.

74

Doing the simple calculation now, the results are displayed in the table here below for each

solution taken into consideration.

#(XORs)

Non-paralleled 7862

2-parallel 13726

3-parallel 20310

4-parallel 26894

5-parallel 33478 Table 15) Recap table for the resources used by the eCSEE module of all the solutions taken into consideration.

To get a graphical idea of the resources usage, some pictures, similar to the previously used for

the syndrome, are displayed here below.

Figure 67) Picture that represents the usage of resources per Gbps of throughput obtained. On the x axis there

are the number of symbols computed per clock pulse; on the y axis there are the average number of basic

elements used per Gbps of throughput.

75

Figure 68) Resources utilisation per every clock pulse of latency saved. On the x axis there are the number of

symbols computed per clock pulse; on the y axis there are the average number of basic elements used per clock

pulse saved.

For what concerns the latency and the throughput of the component, Figure 57 is valid also for

the eCSEE component.

4.1.1.3 Conclusions

Having now disposable data, a final computation has to be made. The best options are clearly

the two-parallel and four-parallel solutions. In this section they are the only ones under

evaluation in order to simplify the discussion.

Firstly, it’s explained the architecture of the system that is going to be evaluated. To compare

in a fair way the two solutions, the throughput has to be equalised for both the solutions. Having

the throughput fixed, the latency and the usage of resources are computed and a trade-off has

to be made then.

The system for the two-parallel solution is the one in Figure 69; the system for the four-parallel

solution is the one in Figure 70.

76

Figure 69) System evaluated for the two-parallel solution.

Figure 70) System evaluated for the four-parallel solution.

Therefore, the use of the components in a generic architecture can be resumed as:

𝑝 ∙ 𝑆𝑦𝑛𝑑𝑟𝑜𝑚𝑒𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 + 𝑒𝑃𝐼𝐵𝑀𝐴 + 𝑝 ∙ 𝑒𝐶𝑆𝐸𝐸

For sake of simplicity, since the Berlekamp-Massey component is only one for every solution

and then is not changing the overall result, its contribute to the resources consumption here will

be neglected. Also the logic circuits are all neglected since they give a marginal contribution.

Before passing to the numbers, another observation has to be made. In the usage of resources

also the amount of FIFO RAM used for storing the messages arrived has to be taken into

consideration. In the formula below, it’s computed the amount of messages to be stored for a

generic “p”-parallel solution:

𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = 2 ∙ 𝑐𝑒𝑖𝑙 (𝑛

𝑝) + 𝑛 − 𝑘 + 𝑠

#(𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 𝑝𝑒𝑟 𝑐ℎ𝑎𝑛𝑛𝑒𝑙) = 𝑐𝑒𝑖𝑙 (𝑛

𝑝)

77

#(𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠 𝑝𝑒𝑟 𝑐ℎ𝑎𝑛𝑛𝑒𝑙) = 𝑐𝑒𝑖𝑙 (𝑙𝑎𝑛𝑡𝑒𝑛𝑐𝑦

#(𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 𝑝𝑒𝑟 𝑐ℎ𝑎𝑛𝑛𝑒𝑙)) + 1

For the 2-p case, for example, the latency is nearly 272 clock pulses while the number of

symbols processed per channel is 128. Therefore, the number of messages that have to be stored

is three. In this situation, it was taken into account that the sequence of symbols that arrives to

the decoder goes out in reverse order from the Chien component and so has to be added one

message more.

If the same calculation is made for the 4-p option, the latency is approximately 144 clock

pulses, the number of symbols processed per channel is 64 and therefore the number of

messages still remains three. In general, whatever the parallelisation, per every channel three

messages have to be stored. The overall number of messages to store is given by the following

formula:

#(𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠) = 𝑐 ∙ 4

#(𝑠𝑦𝑚𝑏𝑜𝑙𝑠) = 𝑐 ∙ 4 ∙ 𝑛

In the table below, the results of the computations for the two analysed solutions are resumed.

Syndrome Computer eCSEE RAM Total

2p-8c 12288 109808 65280 187376

4p-4c 12352 107576 32640 152568 Table 16) Table that resumes the number of XOR gates used for the two solutions taken into consideration.

The results show that the four-parallel solution is comparable to the two-parallel one for the

basic components, but has a significant saving of XOR gates due to the smaller number of

channels used and therefore less number of messages to be stored.

4.1.2 Innovative designs In the previous discussion, it was found as a result that the most convenient topology isthe the

four-parallel cell. The question which arose is: given the cell topology, is the four-parallel

version the most convenient one? Since it’s clear the impact of the number of channels in the

XOR gates used for storing the messages and since it’s relevant in respect to the overall number

of ports used, is there any better solution? Since the answer is not straightforward, some

analysis has to be made.

The approach used in the thesis is different from the ones seen in literature. The main idea is

to use a core heavy Berlekamp-Massey algorithm that in respect to the others uses more

resources but doesn’t worsen the critical path, and uses it for more messages. For reusing the

ePIBMA component to the best, the optimal numbers of syndrome and Chien blocks are

derived: it’s the opposite of what the others do. The result of this approach can be seen in Figure

75. At the end there is a comparison between what this approach brings and what in literature

can be found. The comparison is made with a theoretical estimation of equivalent gates, the

FPGA implementation and the ASIC one.

78

4.1.2.1 Proposed decoder architecture and theoretical analysis

The solutions in analysis are the ones that can give at the end a throughput of 41.28 Gbps, i.e.

sixteen times the basic cell. The parallelisation for sixteen times of the basic cell already

realised is considered one of the possible cases.

In general, the Berlekamp-Massey component remains the same while some different

configurations are implemented in order to find the best one. A small change is performed and

consists in inserting the registers for keeping the syndrome values after the big multiplexer.

This can save a lot of resources. For example, for the 1p solution for saving the syndrome uses

one sixteenth of the registers used in the other configuration.

After presenting the internal architectures and after analysing latency and critical path, at the

end of the section, there is a part for comparing the resources used and the results obtained. So,

on these bases, it will be predicted which one is the most convenient one. Later, the analyses

are proved with the Xilinx Vivado® post-implementation simulation and an ASIC

implementation.

1p RS Decoder

The first solution that is exploited is the one that parallelise the syndrome and the Chien

components that are the same used in the basic decoder RS(255,239) built before. The structure

of the system is the one showed in Figure 71.

Figure 71) Schematic of the system using 1p versions of the cells for SyndromeComputer and eCSEE

components.

For what concerns this solution, the structures of the components are exactly the same as

presented in the previous chapter. The only difference is that some components are instantiated

more than once.

In respect to the original syndrome cell, actually in this case there’s a small change: to save

some resources the registers for storing the syndrome are inserted after the 16-1 multiplexer

and this influences also the structure of the cell. In short, from the architecture of the cell of

Figure 24 is removed the register below, as can be seen in Figure 72.

79

Figure 72) Internal structure of the 1p SyndromeComputer cell.

Talking about latencies, the syndrome and the Chien parts are requiring both “n” clock pulses

that summed to “n-k” of the Berlekamp-Massey and one for the registers for saving the

syndrome values is leading to “3n-k+1” clock pulses. Some more clock pulses due to

segmentation are added but they are negligible in respect to the total.

2p RS Decoder

For the parallelised solutions, the Chien component is always based on the same topology of

cell, like the one in Figure 66. The internal structure of the cell can always be derived from the

figures already showed before and so it will not be explained here time by time.


components.

The syndrome calculation is more interesting since the segmentation of the cells can bring to

different solutions with their advantages and drawbacks.

In Figure 58 a basic version of the two-parallel cell is displayed, as already showed before.

Since the critical path of the RS_ePIBMA component is:

𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ = 𝑇𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 + 𝑇𝑎𝑑𝑑𝑒𝑟

The goal is not to exceed the limit imposed by the Berlekamp-Massey component so that the

operating frequency of the system is not worsening. The cell found in [Ji15] is with the critical

path of two adders and a simplified multiplier and so it’s not the solution we are searching for.

A variant that shortens a bit the critical path is the 2p-1s showed here below.

80

Figure 74) Internal structure of the 2p-1s SyndromeComputer cell.

The critical path now is constituted by the accumulator on the right:

𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ = 𝑇𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 + 𝑇𝑎𝑑𝑑𝑒𝑟 + 𝑇𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒𝑥𝑒𝑟

This path cannot be shortened more since the accumulator operates in a loop and so there’s

nothing to be done more. In all the configurations that are analysed later, this will be the goal

to reach. The price to pay in this variant is one register more and one clock pulse more in the

latency. The additional clock in reality brings a new problem; the Berlekamp-Massey needs 16

clock pulses per message, while the normal two-parallel non-segmented syndrome block needs

128 clock pulses. Thus, with eight syndrome blocks the timings are perfectly fitting. With the

additional register, the segmented variant will need 129 clock periods and so for one clock the

system has to be blocked. This brings actually to a decrease of the throughput in theory. Doing

the math, the throughput passes from 41.28 Gbps to 40.96 Gbps (therefore a loss of nearly 0.32

Gbps in total), if the frequency is considered a constant. This, instead, doesn’t seem to be

correct since the critical path changes significantly and has to be verified with a simulation

tool.

Figure 75) Graph that shows the perfect arrival of the signals if there’s no segmentation of the cells.

In Figure 75 the periods when the SyndromeComputer components are active are showed in

green. In red instead there are the Berlekamp-Massey ones and in blue the eCSEE. It can be

noticed that the signals fit perfectly. In case of segmentation, for some time the Berlekamp-

Massey component won’t be active for allowing the correct working operation of the decoder.

4p RS Decoder

Similarly to what has been discussed in the previous variant, here below the generic system for

a four-parallel decoder is presented.

81


components.



Skipping the basic non-segmented case, which is obviously not a practical solution because of

the long critical path, the one-segmented and two-segmented solutions are presented.

Starting from the non-segmented, passing through the one-segmented and arriving to the two-

segmented variants, a significant increase of the used registers has to be paid in order to shorten

82

the critical path. Another price to pay is the latency that as well increases. The same effect

considered for the 2p-1s variant happens in all the segmented solutions: in particular, supposing

a same operating frequency (to be proved with simulation), for two cycles the decoder has to

stop for having correct working operation. Therefore, the throughput will pass from the original

41.28 Gbps to 40.03 Gbps. This factor, as stated before, in reality should not decrease since a

significant increase of the frequency should be achieved. In the case this last won’t happen, the

non-segmented solution will be the best. A precise analysis, again, will be made with the

simulation tools in the next section.

8p RS Decoder

As for the four-parallel solution, the eight-parallel can be segmented in three different ways


components.

83

Figure 80) Internal structure of the 8p SyndromeComputer cell.

84


85


In general, the same observations over the trade-off: better latency and less usage of resources

versus the shortening of the critical path. In the worst case, i.e. the 2s variant of the eight-

parallel cell, happens to be a worsening of the throughput that leads to 38.85 Gbps. As said

previously, in reality the frequency achieved should be higher, thus the overall throughput

should be higher than the non-segmented variant.

16p RS Decoder

For the sixteen-parallel solution, intermediate solutions are not showed and only the final two-

segmented solution (Figure 84) is presented. The discussion about the trade-off to be made is

the same of the other parts, so it’s skipped here.


components.

86


87

Comparisons

In general, the sums that are done after the first segmentation layer can be made with a tree

architecture so that the path passes from the lowest number of adders possible. Anyways the

number of adders in overall is not changing.

In the table below, the resources consumptions and the critical path of the several topologies

of cells presented till now are resumed. For the multiplier, the asterisk stays for the simplified

version of the multiplier, i.e. the multiplication for a constant value.

Adder Multiplier* Register Multiplexer

(2-1)

Tcritical_path

1p 1 1 1 0 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑

2p 2 2 1 1 𝑇𝑚𝑢𝑙𝑡 + 2∙ 𝑇𝑎𝑑𝑑

2p-1s 2 2 2 1 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑

+ 𝑇𝑚𝑢𝑥

4p 4 4 1 1 2 ∙ 𝑇𝑚𝑢𝑙𝑡 + 3∙ 𝑇𝑎𝑑𝑑

4p-1s 4 4 3 1 𝑇𝑚𝑢𝑙𝑡 + 2∙ 𝑇𝑎𝑑𝑑


+ 𝑇𝑚𝑢𝑥

8p 8 8 1 1 2 ∙ 𝑇𝑚𝑢𝑙𝑡 + 4∙ 𝑇𝑎𝑑𝑑

8p-1s 8 8 4 1 𝑇𝑚𝑢𝑙𝑡 + 2∙ 𝑇𝑎𝑑𝑑

+ 𝑇𝑚𝑢𝑥


+ 𝑇𝑚𝑢𝑥

16p 16 16 1 1 2 ∙ 𝑇𝑚𝑢𝑙𝑡 + 9∙ 𝑇𝑎𝑑𝑑

16p-2s 16 16 10 1 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑

+ 𝑇𝑚𝑢𝑥 Table 17) Table that resumes the utilisation of elements for the various variants of SyndromeComputer cell.

For completing the analysis, the total amount of resources used in the SyndromeComputer and

in the eCSEE components has to be computed.

Choosing the solutions that give the minimum critical path (so that the operating frequency can

actually increase), the table below resumes the components used by the solutions analysed.

For what concerns the Chien component, as said in the previous section, the occupation of

resources is following a precise formula and so the discussion can be avoided. The details can

be found in Table 14.

88

SyndromeComputer ePIBMA eCSEE Correction_Block Total

Original 1152 4480 4135 8168 17935

1p-16c 12288 4480 63536 130688 210992

2p-1s-8c 13312 4480 48768 65408 131968

4p-2s-4c 12800 4480 48288 32768 98336

8p-2s-2c 11008 4480 48048 16448 79984

16p-2s-1c 10112 4480 47928 8288 70808 Table 18) Recap table of the number of XOR gates used by every block for every version.

In the computation the logic blocks and the big multiplexer are neglected. In Table 19 there are

the usage of the resources for each of these multiplexers.

In general, to compute the equivalent number of mux is used the basic recursive formula:

#(𝑏𝑎𝑠𝑖𝑐 𝑚𝑢𝑥) = ∑𝑖

2

𝑖=1

𝑖=𝑐:𝑖2

Where c is the number of syndrome blocks parallelised.

Here is an example to explain the formula. For the 1p solution, c=16. The multiplexer to be

converted is a sixteen-one multiplexer. Therefore, in the first step sixteen inputs are converted

into eight outputs through the usage of eight multiplexers. At the second step eight inputs are

converted into four outputs with four multiplexers and so on. The overall number of

multiplexers needed for the conversion is resumed in the table below.

#(basic multiplexers)

1p-16c 8+4+2+1=15

2p-1s-8c 4+2+1=7

4p-2s-4c 2+1=3

8p-2s-2c 1

16p-2s-1c 0 Table 19) Results of the simplification of the P-1 multiplexer.

For the complete evaluation of the solution, some other details have to be taken into account.

In the table below the equivalences that can be used to translate the basic elements into XOR

gates are presented. The first solution needs fifteen 2-1 multiplexers of “m” bits. This leads to

an additional amount of XOR gates of about 120 that is clearly negligible.

In Table 18, the RAM for the storing of the messages is kept inside the block called

Correction_Block that is supposed to store the data at the input, to take the error vector coming

from the Chien component and to correct the received symbol in case is needed. The

expectation of the implementations is that the best solution should be the 16p-2s-1c without

any doubt.

4.1.2.2 FPGA implementation

In the second part of this analysis, every solution was properly implemented in the IDE and an

implementation of the system was run in order to get precise results. The device used is a

89

Kintex 7. It seems clear, from the analysis made before, that the parallelisation brings to huge

improvements. In this sense, it seems obvious that the 16p-2s-1c decoder should be the best

choice.

At the end, after the presentation of all the versions, the results of resources used are presented.

For getting the results, some detail about the way of obtaining them has to be spent. In getting

the minimum clock period, more resources are used by the synthesiser. Instead for getting the

hardware usage, the clock period is relaxed to 20 nanoseconds. The information of timings and

resources usage, therefore, is not strictly coherent with each other but it gives a good estimation

of the order among the versions obtained. Stated this, we can pass to the analysis of the FPGA

implementations.

1p-16c RS Decoder

In this version, the changes are really few since the architecture basically remains the same

adopted for the basic version of the decoder RS(255,239). The parallelisation process was quite

straightforward. The only block that was introduced is the Delay_Block (Figure 85 and Figure

86) that has the aim of delaying the input in a FIFO logic.

Figure 85) Black box representation of the Delay_Block.

Figure 86) Internal structure of the Delay_Block.

90

The block is providing the delaying function without needing any outer control logic since this

is included inside. The input and output addresses, in fact, are generated by two internal

processes.

Implemented the delayed block, the technical problem that was met was how to use a generate

statement for the sixteen channels. Obviously it was not convenient to code manually all of

them. The solution was to incorporate all the remaining channel logic inside a new component

called Correction_Block (Figure 87 and Figure 88). Resuming, this component will receive the

incoming data, will wait the Chien search to end its working operation and then will proceed

with the generation of the output for the decoder. The block allows the VHDL coding of the

channels with the generate statement and so the architectural problems raised by the addition

of the channels are all solved.

Figure 87) Black box representation of the Correction_Block.

Figure 88) Internal structure of the Correction_Block.

91

At the beginning the implementation of the Correction_Block was precisely done as described.

Afterwards, after some tests, it was changed because it was simpler to incorporate all the

Delay_Block directly inside the block of correction. Only the way of organising the circuit

changed, not the logic at the basis.

The speed achieved with this design was around 275 MHz, bringing to a throughput of 35.2

Gbps.

2p-1s-8c RS Decoder

For the two-parallel version of the decoder the architecture was slightly different. If the

differences for the cell that computes the syndrome were already discussed above, some more

explanation is given for the architecture used for the eCSEE component, now renamed

paralleled enhanced Chien Search and Error Evaluation (peCSEE).

Since the architecture was more complicated and the cell of the Chien component changes, the

basic blocks implemented in VHDL had to change in respect to the original version. The first

change that was needed was the redefinition of the cell for the syndrome. The cell implemented

in the VHDL system is the one shown in Figure 89.

Figure 89) Internal structure of the 2p-1s-SyndromeCell.

After having implemented the cell and after having slightly modified the general architecture

of the SyndromeComputer because of the segmentation of one clock pulse in the cells, the main

problem of design stayed in the final parts of Chien search and error evaluation components.

This part, in fact, changed completely the physiognomy after the introduction of the new cell

(Figure 90).

It started therefore with the design and implementation of a cell that had the aim of sampling

and storing some important variables coming from the Berlekamp-Massey block, for example

sigma(0) or B(0). These values don’t have to change while the eCSEE component is processing

the data. In Figure 91 it is presented the schematic of the Sampling_Cell.

92

Figure 90) Internal structure of the 2p-eCSEE cell.

Figure 91) Internal structure of the Sampling_Cell.

It can be now explained one of the most important blocks of the Chien search, i.e. the evaluation

blocks. The structure remained like in the original version, i.e. separated evaluations of even

and odd parts of the two polynomials. Figure 92 below well expose the structure. In the end of

the section some more detail is discussed about the segmentation and the latencies of the entire

block.

The design of the eCSEE component passed then to the realisation of a device that could

implement the last part of detecting the error and, in case, compute the error magnitude. The

structure of the eCSEE is displayed in Figure 93 where the Final_Stage component is shown

in Figure 94.

Now the segmentation can be discussed. At the beginning the two outputs of the eCSEE_Cell

were not registered, so that there were less resources used. Though, assembling the various

blocks, it happened that the critical path of the component became quite relevant, since passed

from one simplified multiplier, three adders and one full multiplier. Obviously the path had to

be shortened and the first thing done was the introduction of segmentation inside the evaluation

cells. It has to be noticed that this segmentation didn’t bring any problem of synchronisation

except the one done to ePIBMA_ready, since the sampling cells completely hold the other

variables. Then, the critical path was still three adders and a full multiplier, so it had to be cut

just before the multiplier. Therefore, in the picture the registers inserted at the outputs of the

evaluation blocks can be seen. This segmentation brought to another register in the path of the

93

ePIBMA_ready signal. Some more comment has to be put about the outputs of the final stages.

The ready signals are put into a AND port, since they should be synchronised so should not be

a problem, and the number of errors found has obviously to be summed since it is still the same

message.

Figure 92) Structures of the evaluation blocks for the odd and even parts of the two polynomials.

Figure 93) Internal structure of the Chien search and error evaluation component.

94

Figure 94) Internal structure of the Final_Stage component.

The internal architecture of the Final_stage is quite similar to the one used in the original

decoder but here it is grouped inside one component. The presence of the inversion ROM

implies the insertion of registers for the correct synchronisation of the signals: one was put in

the path of the comparison of the sigma evaluation to zero. No more segmentation was needed

for the inversion ROM since Zeta_Computer contains inside already a delay, like can be seen

in Figure 95.

Figure 95) Internal structure of the component for the calculation of zeta.

The synchronisation had another issue due to the enable signal that introduces a new delay of

one clock pulse in the output processes for the generation of the ready, num_err and error

location signals. Because of this, a new register has to be inserted at the output of the

multiplexer, thus avoiding long critical paths because of blocks put in cascade to this.

In the realisation of the system a problem raises: the two final stages should start from different

values of zeta and so has to be inserted a component that computes it

(Starting_Zeta_Computer). In the figure below it is possible to see the internal architecture of

the block.

95

Figure 96) Internal architecture of the 2p-Starting_Zeta_Computer.

This solution would add two multipliers and “2*m+1” bits of registers. This implementation

leads also to a different structure of the component that computes the zeta inside the final stage,

as showed in the picture below.

Figure 97) Internal structure of the modified Zeta_Computer.

Though this solution seems reasonable in the first place, if is extended to the more parallelised

versions of the Chien component it’s quite clear that it’s not implementable. In the figures

below are presented the versions for the other decoders: for a 16p are needed sixteen additional

full multipliers that is obviously not acceptable.


96



The multipliers, it has to be noticed, are used only once for every message and this is a total

waste of resources. An additional restriction that is limiting the possible solutions is the latency.

In fact, the maximum tolerated latency for the block is three-four clock pulses and so the

solution is not really straightforward. A first way to reuse the multipliers is what we will refer

97

to as enhanced_Starting_Zeta_Computer. In the figure below there is only the sixteen parallel

version, but the analogous architecture can be derived also for the other options.

Figure 101) Internal architecture of the 16p-enhanced_Starting_Zeta_Computer.

The architecture is really more complex and the control logic as well. Although in the picture

it is not shown, there is a significant saving of resources: most of the registers can be thrown

away and only eight multipliers are needed. The latency is unchanged.

This solution seems the best obtainable, but the problem was solved in a radically different

way. The idea that manages to overtake all of the calculation comes from understanding how

the zeta value is computed inside the Berlekamp-Massey component. To compute the two zetas

required, now are inserted two Look-Up Tables (LUTs) filled with alphas from α0 to α15i, where

“i” is the number related to the correspondent table (in this case is zero and one). A counter

keeps counting according to the ePIBM algorithm and then at the end it is used as a pointer for

the output of the LUTs. This solution is minimising at most the resources used and is even

saving one multiplier from the ePIBMA: it’ll be the solution adopted for all the versions from

here onwards.

In Figure 102 it’s exposed how the components are connected to each other inside this version

of the decoder. The last column of components is the step of the Correction_Block components.

98

The block is basically the same presented for the previous version, but some parallelisation is

needed. In Figure 103 and Figure 104 is presented the modified version.

Figure 102) Blocks assembly inside the decoder.

Figure 103) Black box representation of the Correction_Block.

Figure 104) Internal structure of the Correction_Block component.

99

The parallelisation is obtained with the use of “p” number of Delay_Block components, that

are the same introduced in the previous version. As said before, the grouping of the circuit in

the Delay_Block is not correspondent to the VHDL code, but it has been distributed in the

Correction_Block.

The clock period obtained for this implementation in the device previously defined is 4

nanoseconds that brings to a frequency of 250MHz. The throughput therefore is 32 Gbps. This

values are just descriptive the rough order among the versions; changing the device this

parameter changes.

4p-2s-4c RS Decoder

This version of the decoder is really similar to the two-parallel one. The components previously

presented are modular; if they are scaled by the “p” parameter the generation of this solution

is really easy to implement. Here below are posted images about some of the components

updated to this version. Since the schematics are straightforward to derive from the ones

already showed, from now onwards it will not be inserted anymore.


100

Figure 106) Internal structure of the four-parallel eCSEE cell.

Figure 107) Schematics for the evaluation blocks and the following stages.

Figure 108) Simplified blocks that constitute the 4p-2s-4c decoder.

For what concerns the minimum clock period obtainable with this decoder architecture, it was

obtained 4 nanoseconds that brings to nearly 250 MHz. Therefore, the throughput is 32 Gbps.

101

8p-2s-2c RS Decoder

This version is the same as the previous one. Only few images are going to be illustrated

because of space issues and because it’s the same typology seen before.

Figure 109) Simplified blocks that constitute the 8p-2s-2c decoder.

The architecture is really simplified in respect to the previous part. Going to more parallelised

versions, the complexity migrates from the global architecture (the one in Figure 109 for

example) to the cells that become bigger and with more connection. Moreover, the multiplexer

(as the de-multiplexer) is become simpler with two inputs and one input.

For what concerns the minimum clock period achieved, it was 3.85 nanoseconds, so almost

260 MHz. This brings to a higher value of 33.28 Gbps of throughput.

16p-2s-1c RS Decoder

This version is even more simplified. The multiplexer and the de-multiplexer completely

disappear together with the register after the multiplexer. For sake of simplicity, in fact, the

register was directly included in the cells (Figure 110).

102

Figure 110) Internal structure of the SyndromeCell for the 16p-2s-1c version.

The control signal logics are the simplest possible since there’s only one channel. In this case,

as told before, all the complexity of the system is passed inside the cells, like it can be noticed

in the picture below.

Figure 111) System architecture of the 16p-2s-1c decoder.

103

The schematic of the system is really similar to the one of the original non-parallelised version.

All the other components are not explained since are the same topology previously described.

The minimum clock period obtained was significantly shorter than the previous solution. The

one found for this version was 3.3 nanoseconds, versus the 3.8 nanoseconds of the best previous

versions. This is due to the simplification and the decrease of complexity of the connections

between the components. Less complexity brings also shorter routing paths and therefore

higher operating frequency. The operating frequency of this version is therefore 303 MHz that

leads to a throughput of 38.79 Gbps.

Pushing a bit further the synthesiser, the results register a significant increase of resources used.

The minimum clock period obtained is 3.25 nanoseconds (an operating frequency of 307.7

MHz and so a throughput of 39.38 Gbps).

The choice between the two implementations depends on the final goal. If the device used is

the one used for these tests and the goal is the Ethernet speed, then three decoders of the first

less-consuming version are to be used in parallel. Time by time it has to be evaluated if it’s

worth using more resources for those 2 Gbps gained.

Implementation results

The two main fields that are to be exploited are the use of resources and the performances

obtained. The results are taken directly from the Vivado® environment.

The versions are compared in two separate tables: resources and timings. For the resources it

is taken into account the version of the decoder with less resources used possible; for the

timings it’s inserted the minimum clock pulse obtainable (even if uses in theory more

resources).

Table 20) Recap table of the resources used by the implementations of the various versions of the decoder in a

Kintex 7.

In Table 20, for the Correction_Block of the original implementation there’s unexpectedly no

usage of LUTs nor registers. This is due to the fact that in reality the architecture is different in

respect to the others and the adders are distributed in the generic decoder. Instead of the

Correction_Block the RAM for the delay is inserted. The choice was made for comparing the

usage of BRAMs.

In the table here below there is an analysis of the number of clock pulses that are required by

each decoder to produce an output.

Latency [#(clock pulses)]

1p-16c n+(n-k)+n+11=3*n-k+11=537

104

2p-1s-8c (n/2)+(n-k)+(n/2)+11=2*n-k+11=282

4p-2s-4c (n/4)+(n-k)+(n/4)+12=(n/2)+n-k+11=154

8p-2s-2c (n/8)+(n-k)+(n/8)+11=(n/4)+n-k+11=85

16p-2s-1c (n/16)+(n-k)+(n/16)+11=(n/8)+n-k+10=54 Table 21) Table that analyses in the clock pulses the latency of each decoder.

The calculation is done taking into account the latencies due to:

1) SyndromeComputer intrinsic latency;

2) Syndrome cells additional segmentation;

3) eCSEE latency before starting the throughput;

4) eCSEE intrinsic latency for finishing all the symbols correction.

Moreover, all the versions are affected by one clock pulse of latency due to the registration of

the output of the multiplexer after the SyndromeComputer components and are affected also

by the intrinsic latency of the Berlekamp-Massey block (n-k). The timings are computed by

merging the clock pulses information and the time period obtained already before written in

the relative sections.

The timings results are presented in the table below.

Clock period

[ns]

Operating

frequency

[MHz]

Throughput

[Gbps]

Latency [ns]

Original 3.02 331.12 2.65 1621.74

1p-16c 3.65 250 35.07 1960.05

2p-1s-8c 4 250 32 1043.4

4p-2s-4c 4 250 32 674.25

8p-2s-2c 3.85 260 33.25 400.4

16p-2s-1c 3.25 307.7 39.38 255.2 Table 22) Recap table for the timings results in Kintex 7.

For checking the correct implementation of the decoders, some consideration about the way of

using BRAMs has to be made. The blocks of BRAM embedded in the FPGA are 1kx36,

organised in two separated blocks of 1kx18 each. Obviously the calculations are matching the

results obtained.

p=#(symbols

processed)

c=#(chann

els)

#(words/comp

onent)

#(bits/word) Configuration

BRAM

Generic

Configuration

1 16 1024 8 1k x 18 1kx8 8 BRAMs

2 8 512 16 512 x 36 512x16 4 BRAMs

4 4 256 32 512 x 36 256x36 2 BRAMs

8 2 128 64 512 x 36 128x36 2 BRAMs

16 1 64 128 512 x 36 64x36 2 BRAMs

Table 23) Table that resumes the configurations of the BRAMs used for delaying the messages.

About the clock pulse minimum duration, it can be noticed that the in a FPGA the timing is

strictly limited by the routing of the signals: the simpler is the architecture, the faster it goes.

This explains the sensible difference of speed between the first versions and the last ones.

105

Regarding all the resources usage, instead, it’s clear that the more the system is parallelised

less resources are used. As said before, surprisingly also the operating frequency gets better

and this lets no doubt about which solution is the optimal one: the 16p-2s-1c RS(255,239)

decoder.

After implementing the decoders in the Xilinx Kintex 7, to speed up as much as possible the

system, the various versions were implemented in a Xilinx Ultra-Scale FPGA (xcvu190-

flga2577-3-e). Fixing the operating frequency to 300 MHz, the final throughput was a constant

38.4 Gbps. In the two tables here below are resumed the results obtained.


Virtex.

Table 25) Table that resumes the timing information of the various implementations in a Virtex.

In Table 25 the main parameters of the designs are resumed. The throughputs are almost

constant, except for some versions where the minimum clock period was difficult to lower. For

understanding how fast and how better are some implementations, also the Worst Negative

Slack (WNS) parameter is given: with this we can appreciate how close to the final timing

limitation we are. Finally, if going into detail, the clock period is rounded and this is why

apparently there’s some incoherence in the information of the operating frequency of the two-

parallel and four-parallel versions: the operating frequency is the precise reliable value.

For accomplishing the goals of the thesis, we will need three of the 16p version of decoders.

The attained throughput will be nearly 115 Gbps.

4.1.2.3 ASIC implementation

The use of Cadence SOC Encounter software makes it possible to pass from a VHDL code to

an integrated circuit. The library used for implementing standard cells is the “Faraday 90-nm

CMOS standard cell library”.

For the estimation of the number of ports used by the components, an approximated technique

was used. In the first place, 50 NAND and 50 XOR gates were realised and was obtained the

area of the overall circuit. When later a circuit is implemented, the formulas that will give a

rough estimation of the number of basic ports used are:

106

#(𝑏𝑎𝑠𝑖𝑐 𝑋𝑂𝑅 𝑝𝑜𝑟𝑡𝑠) = 50 ∙𝐴𝑟𝑒𝑎(𝑑𝑒𝑣𝑖𝑐𝑒)

𝐴𝑟𝑒𝑎(50 𝑋𝑂𝑅 𝑝𝑜𝑟𝑡𝑠)

#(𝑏𝑎𝑠𝑖𝑐 𝑁𝐴𝑁𝐷 𝑝𝑜𝑟𝑡𝑠) = 50 ∙𝐴𝑟𝑒𝑎(𝑑𝑒𝑣𝑖𝑐𝑒)

𝐴𝑟𝑒𝑎(50 𝑁𝐴𝑁𝐷 𝑝𝑜𝑟𝑡𝑠)

In this section also the detail of the place and route timings related to the ASIC implementations

will be taken. The expectation in taking this information is a remarkably higher operating

frequency. In fact, in the FPGA implementations the minimum clock period obtained is limited

by the routing of the signals.

Procedure

Some scripts had to be written for implementing the versions of the decoders. Since the

implementations were thrown via console (and not via GUI), the hierarchical order of

compilation for each part had to be specified. Throwing the first scripts, it was noticed that the

implementation of RAM and ROM was too expensive for what concerns the hardware

utilisation. The way of communicating with the synthesiser that the component we would like

to instantiate is a RAM is different from the way is communicated with the ASIC synthesiser:

the VHDL code has to change. Taking as an example the internal description of the

Delay_RAM, previously was described as follows:

-------------------------------------------------- -- WRITE PROCESS

--------------------------------------------------

Write_process : process(clk)

begin

if rising_edge(clk) then

if (ena='1') then

RAM(conv_integer(addra)) := dia;

end if;

end if;

end process Write_process;

--------------------------------------------------

-- READ PROCESS

--------------------------------------------------

Read_process : process(clk)

begin


if (enb='1') then

dob <= RAM(conv_integer(addrb));

else

dob <= (others=>'0'); -- if not enable, I give a 0

end if;

end if;

end process Read_process;

This description is going to use BRAMs available in the FPGA used, that was exactly the

solution chosen in designing the decoder. The FPGA synthesiser recognises the behavioural

codification used and showed previously and will implement the component with a BRAM.

For the ASIC implementation, some VHDL code has to be added. In particular, it’s better to

implement the memory with latches, instead of flip-flops, since it’s less area consuming. The

Xilinx manual shows (Figure 112) the basic structure of a BRAM that has to be implemented

with a structural representation.

107

Figure 112) Internal structure of a BRAM cell.

The structural representation code is:

--------------------------------------------------

-- WRITE PROCESS

--------------------------------------------------

Write_process : process(ena_reg,dia_reg,addra_reg)

begin

if (ena_reg='1') then

RAM(conv_integer(addra_reg)) := dia_reg;

end if;

end process Write_process;

--------------------------------------------------

-- READ PART

--------------------------------------------------

dob_int <= RAM(conv_integer(addrb_reg));

--------------------------------------------------

-- OUTPUT LATCH

--------------------------------------------------

Latch_dob : process(enb_reg,dob_int)

begin

if (enb_reg='1') then

dob <= dob_int;

end if;

end process Latch_dob;

--------------------------------------------------

-- REGISTERING PROCESS

--------------------------------------------------

Reg_process : process(clk)

begin


ena_reg <= ena;

enb_reg <= enb;

addra_reg <= addra;

addrb_reg <= addrb;

dia_reg <= dia;

end if;

end process Reg_process;

The difference between the two cases is only the internal structure of the cell. So the best way

of representing this in VHDL is using the same entity with different architectures. Doing this,

a variable had to be defined (in “RS_Decoder_Types.vhd”) for keeping track of the version

that has to be implemented: a standard logic type was chosen. If it’s zero, the FPGA version

has to be implemented, else if it’s one the ASIC one is chosen. To choose inside the code the

architecture, the following code is used:

-- connecting the RAM block

FPGA_mode_generate : if (implementation_mode='0') generate

Delay_RAM_component : entity work.Delay_RAM(FPGA_architecture) port map(

108

-- INPUTS

clk => clock,

ena => enable_in,

enb => enable_out,

addra => input_data_address,

addrb => output_data_address,

dia => internal_data_in,

-- OUTPUTS

dob => data_out_bus

);

end generate;

ASIC_mode_generate : if(implementation_mode='1') generate

Delay_RAM_component : entity work.Delay_RAM(ASIC_architecture) port map(

-- INPUTS

clk => clock,

ena => enable_in,

enb => enable_out,

addra => input_data_address,

addrb => output_data_address,

dia => internal_data_in,

-- OUTPUTS

dob => data_out_bus

);

end generate;

The same procedure was used for every component that used a ROM or a RAM. In the files

given, this kind of architecture is showed only in the 38p-2s-1c (see the next chapter) just to

show how the architecture is coded, the other decoders have a normal FPGA architecture.


The results are presented in two tables: one for area occupation and one for timings. Starting

from the resources used by the implementation, the results are separated into components. This

allows us to distinguish the gates used by each component and compare how they change from

one implementation to another. The gates used by the versions of decoder are given with an

equivalent number of XOR and NAND gates. Finally, in the table there are two total numbers.

This is due to the idea of showing the overall impact of the FIFO memory on the area

occupation. It gives the chance of further considerations over the resources usage.

Table 26) Recap table that resumes the equivalent number of gates used by each RS(255,239) ASIC

implementation of the decoders.

The numbers are constantly decreasing while the degree of parallelisation increases. This is

something that happened also with the FPGA implementation. The only component that

increases is the ePIBMA because of the ZetaComputer (table for giving the zeta results to the

eCSEE component).

The ASIC implementations are realised giving a constant fixed requirement of clock period.

Every version was implemented with a clock period equal to 2 nanoseconds. Hence, the

operating frequency is 500 MHz.

109

In the table below, the results are grouped in the same order of Table 26. In the left part, the

main decoder’s parameters are resumed. While the throughput is maintained constant, the

latency is definitely decreasing, arriving with the 16p version to only 116 nanoseconds.

Table 27) Recap table that resumes the main timing characteristics of the RS(255,239) ASIC implementations of

the decoders.

For comparing the decoders, a new parameter is introduced. The main idea is to bound together

two important parameters (throughput and number of gates used) into a single parameter. The

efficiency is defined so that it’s increasing when the result is getting better. Therefore:

𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦∗ = 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 [𝐺𝑏𝑝𝑠]

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑞𝑢𝑖𝑣𝑎𝑙𝑒𝑛𝑡 𝑁𝐴𝑁𝐷 𝑔𝑎𝑡𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝐹𝐼𝐹𝑂 [𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠]

In the previous works, the results were compared only among the gates used by the decoder

itself, leaving apart the problem of the FIFO memory. A most proper consideration, instead,

should be considering the overall system, since there’s no way to avoid the implementation of

the memory. So, another parameter is defined

𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 = 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 [𝐺𝑏𝑝𝑠]

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑞𝑢𝑖𝑣𝑎𝑙𝑒𝑛𝑡 𝑁𝐴𝑁𝐷 𝑔𝑎𝑡𝑒𝑠 𝑤𝑖𝑡ℎ 𝐹𝐼𝐹𝑂 [𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠]

In Table 27, there are the comparisons among the decoders obtained and, as expected, the best

one is the 16p.

4.1.2.4 Comparison with published results

The decoders obtained have now to be compared with what was obtained in literature. The way

of comparing is still the efficiency parameter. Here the two different kind of efficiency

calculations become useful. In literature, in fact, the gates used for the implementation of FIFO

memory are ignored. An estimation of the memory was made by taking into account the latency

of each decoder and making Python compute the equivalent number of ports. It’s not going to

be precise, but, since the data were omitted in the articles, this is going at least to give a rough

estimation. For comparing the results to the ones taken from the literature, the “gates”

parameter is used. It refers to the equivalent number of NANDs.

In Table 28 it’s clear from the “Efficiency” parameter that the solution proposed is better than

the others taken from literature. Adding the FIFO in the calculation (last row), the gap between

the decoders increases significantly. The latency is also registering a relevant enhancement:

only 106 nanoseconds instead of, in the best of the other cases, 256.

For arriving to the Ethernet speed, the proposed architecture needs two decoders in parallel

(hence, the throughput will be 128 Gbps).

110

Proposed

Decoder [Ji15] [Park12] [LeeH08] [LeeS08] [Lee05] [Song02]

CMOS

Technology

[nm]

90 90 90 130 180 130 160

SC 19600 75000 129000 58000 100800 48000 40000

KES 12230 112000 114350 108200 156000 272000 84000

CSEE 69900 82000 174250 211800 178000 73000 240000

Partial gate

count 104650 269000 417600 378000 434800 393000 364000

FIFO 31900 313000 291000 437000 313000 314000 318000

Total Gate

Count 136550 582000 708600 815000 747800 707000 682000

Clock Rate

[MHz] 500 625 625 300 400 625 112

Latency

[clock

pulses]

53 260 161 242 260 522 168

Latency [µs] 0.106 0.416 0.2576 0.8066667 0.65 0.8352 1.5

Throughput

[Gbps] 64 156 240 115 102 80 43

Efficiency

without

FIFO

[Gbps/(millio

ns of gates)]

0.6116 0.58 0.5747 0.3042 0.2346 0.2036 0.1181

Efficiency

[Gbps/(millio

ns of gates)]

0.4769 0.268 0.3387 0.1411 0.1364 0.1131 0.06305

Table 28) Comparison table between the decoder proposed and the literature’s results.

Another interesting point of strength of the thesis can be deducted in the following table. There

are the related weights (in percentages) of the FIFO memory over the total number of gates

used. The results show the

Proposed

Decoder [Ji15] [Park12] [LeeH08] [LeeS08] [Lee05] [Song02]

FIFO 31900 313000 291000 437000 313000 314000 318000

Total Gate

Count

136550 582000 708600 815000 747800 707000 682000

Percentage

weight

23.36% 53.78% 41.07% 53.62% 41.86% 44.41% 46.63%

Table 29) Analysis of the percentage weight of the FIFO memory on the total number of gates used.

4.2 Parallelisation of the RS(528,514) decoder In analogy of what done with the RS(255,239), it was necessary to parallelise the basic plain

structure of the decoder RS(528,514). The way of proceeding is the same.

4.2.1 Background In the literature few articles can be found about this code and no articles were available for the

parallelisation. The reason is mainly that the code is relatively new: if the RS(255,239), that

111

now is considered a standard, has tons of publications about its implementation, this code was

freshly discovered and a blank page is left to us, giving great freedom of design.

4.2.2 Innovative designs The scientific articles don’t give us any reference or design with which possibly compare the

results. The way of proceeding, then, will be to use the same method used in the previous

chapter, analysing different solutions, and then, at the end of the thesis, comparing the best

results obtainable with both the architectures: RS(255,239) and RS(528,514).

Not taking into account the huge impact of the RAM used for delaying the messages, the neat

deduction that was attained at the end of the previous section was that the parallelisation helps

in saving resources and manages to obtain higher throughputs and lower latencies. The number

of cycles necessary to the Berlekamp-Massey component in this case are fourteen. Dividing

the number of overall symbols that arrives (528) and then rounding it by excess, is obtained

the number of maximum channels usable.

#(𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠) = 𝑐𝑒𝑖𝑙 (528

14) = 38

In this sense, the cases studied for the RS(528,514) are restricted to 1p-38c, 2p-1s-19c and

finally 38p-2s-1c. The obvious expected outcome of the analysis is that the 38p-2s-1c should

be the best obtainable among this family of decoders.

4.2.2.1 Proposed architecture and theoretical analysis

As seen in the other analysis, a first rough calculation of the equivalent number of XOR gates

is made. The calculation, again, is only indicative of the approximate usage of resources and

so has to be coherent with the ASIC implementation results with good margin of tolerance. The

structure used are the same used in the previous chapter so no detail is given for the first two

versions.

38p-2s-1c RS Decoder

The only detail that has to be discussed is the structure of the syndrome cell, since was not

discussed before.

112

Figure 113) Internal structure of the cell for the computation of the syndrome in the 38p-2s-1c RS(528,514)

decoder.

In the following pictures, instead, it can be seen the structure of the so called Mult_Cell, that

composes the SyndromeCell for this version of the decoder.

Figure 114) Black box representation of the Mult_Cell component.

Figure 115) Internal structure of the Mult_Cell.

The kind of structure of the syndrome computer is the same as before, only it’s partitioned into

smaller units to make it simpler to build it. The segmentation is done at the end of each cell

and at the end of the additions. Despite the large amount of adders in a sequence, the critical

path of this block remains still the one consisting of a multiplier, an adder and a multiplexer.

113

Another important aspect to be remarked is the way the messages are coming to the decoder.

The bus of input data of the message that arrives is organised in thirty-eight ports. The number

of cycles that a message takes to be completely inserted inside the decoder is fourteen.

Therefore, in the decoder come inside 532 symbols and so four of them have to be zero

elements. For not adding any correction stage in the decoder, it’s better to commit the zero-

padding action at the beginning of the message. If the message is:

m = [1, 2, 3, 4, …]

Then, the zero-padded message will be:

m* = [0, 0, 0, 0, 1, 2, 3, 4, …]

and so the syndrome is not affected. The order of the symbols inside the input of the decoder

is that the first symbol of the message after zero-padding is put at the last of the ports of the

bus of data input. Therefore, at the first clock pulse there’ll be:

data_in[37] <= m*[0];

data_in[36] <= m*[1];

...

data_in[0] <= m*[37];

More details are displayed in the test-bench of the 38p-2s-1c decoder.

Comparisons

The obtained data from the calculations made by the Python script are presented in table form

below.

SyndromeComputer ePIBMA eCSEE CorrectionBlock Total

Original 1176 4172 7451 21128 33927

1p-38c 28728 4172 283138 802864 1118902

2p-1s-19c 23408 4172 245974 401584 675138

38p-2s-1c 21728 4172 243634 21424 290958 Table 30) Recap table of the number of XOR gates used by every block for every version.

In the analysis four messages to be stored used are counted per every channel. This brings to

the following table.

#(messages to be stored)

Original 4

1p-38c 38*4=152

2p-1s-19c 19*4=76

38p-2s-1c 1*4=4 Table 31) Table that resumes the number of messages that have to be stored for every version.

The weight of the messages, observing from the Table 30, is dominant in respect to the other

components. This confirms that the method used in the thesis brings to relevant advantages.

114

The conclusions will be taken in the final chapter only after having obtained the results from

the ASIC implementation.

4.2.2.2 FPGA implementation

The analysis consisted in the VHDL coding and verification of the decoders. Like before, for

getting timing information the clock period is pushed to the limit (therefore more resources

used) and for getting the resources usage information the clock period requested is relaxed to

20 nanoseconds. These data therefore are not strictly related to each other but give a correct

idea of the hierarchy in hardware usage and operating frequency.


The results in terms of hardware usage of the versions of the decoder are the following:


Kintex 7.

In the resources usage, it can be noticed that the ePIBMA component is increasing slightly with

the higher parallelisation of the decoder. This is due to the computer of the zetas that is

consisting of as many tables of sixteen GF symbols as many are the symbols parallelised, i.e.

one for the 1p, two for the 2p and thirty-eight for the 38p.

In general, the amount of resources used should decrease with higher parallelisation but it can

be seen that passing from the 1p-38c to the 2p-1s-19c decoder there’s an increase for some

parameter. This is due to the change in the architecture of the decoder.

Another thing that can be observed is the usage of BRAMs. This component is used in the

implementations for both the inversion ROM and the RAM for delaying the messages arrived.

It’s not really straightforward to reconstruct the way the synthesiser uses the BRAMs in these

components, but it’s clear that more parallelising gives a total saving of memory. For better

analysing the usage of BRAM due to the delay of the messages in the channels, some

calculation had to be made.

p=#(symbols

processed)

c=#(channels) #(words/component) #(bits/word) Configuration

BRAM

Generic

Configuration

1 38 2048 10 2k x 18 2kx10 38 BRAMs

2 19 1024 20 1k x 36 1kx36 19 BRAMs

38 1 64 380 512 x 36 64x36 6 BRAMs


As seen also for the other kind of decoder, the Vivado implementations confirmed the

estimation made.

Clock period

[ns]

Operating

frequency

[MHz]

Throughput

[Gbps]

Latency [ns]

115

Original 3.02 331.13 3.31 3303.9

1p-38c 4.4 227.27 86.36 4813.6

2p-1s-19c 3.6 277.78 105.55 1994.4

38p-2s-1c 3.65 273.97 104.11 188.8 Table 34) Recap table for the timings results for Kintex 7.

For what concerns the results of timings, the table above resumes the main parameters of

interest.

The throughput in the last two options already reached the goal, but still the latency seems too

high. The typical latency that is normally accepted is around a hundred nanoseconds.

For better satisfying the strict requirements, the device in which the designs were implemented

was changed for a new Vivado Ultra Scale FPGA. The model is the same used in the previous

chapter. The results achieved are resumed in the tables below.


Virtex.

Table 36) Recap table for the timings results for Virtex.

For these implementations, the clock frequency was uniformed (as much as possible). The

original design, therefore, is not slower. In general, with a Virtex better results are obtainable

(Table 34 and Table 36).

4.2.2.3 ASIC implementation

The technique and the steps followed for implementing the designs in ASIC are the same

described in the previous chapter, so the results of the implementations are presented directly.


In analogy to the other ASIC implementations, here the results are divided by components and

the two separated total gates are resumed: one with the implementation of the FIFO memory

and the other one without. The best version of this Reed-Solomon code is the 38p decoder.

The timing (Table 38), in general, show a bit of slowness in respect to the RS(255,239) ASIC

implementations: the throughput is lowered. Also the efficiencies (both the ways of computing

them) are slightly lower than the other implementations. The latency obtained for the 38p

decoder is bearable (130 nanoseconds).

116

Table 37) Recap table that resumes the equivalent number of gates used by each RS(528,514) ASIC

implementation of the decoders.

Table 38) Recap table that resumes the main timing characteristics of the RS(528,514) ASIC implementations of

the decoders.

For reaching to the throughput goal of the thesis, the instantiation of only one 38p RS(528,514)

is more than enough, since it has 152 Gbps.

117

5 Conclusions In the thesis, the hardware architecture of decoders for Reed-Solomon (RS) codes, for reaching

the decoding speed of 100 Gbps, were developed. The basic arithmetical operations with the

finite elements algebra in Galois fields were studied and a VHDL library was implemented to

make the operations easier to be used in the development of the work. The decoding process

was studied and the algorithms were selected after a brief analysis. The chosen algorithms were

then implemented in basic plain versions of decoders, reaching the throughput of the order of

some Gbps. Then, an analysis of the high-speed decoders already in literature was made and a

way for overtaking them was studied. For doing so, a detailed discussion was made on the best

way of parallelising the decoders. The goals of the study were to obtain an efficient area-time

relation, to find an optimised method for the usage of memory in the decoder and to attain a

low latency. The family of RS codes studied are the RS(255,239), that operates in GF(28), and

the RS(528,514), that operates in GF(210). The main idea of the implementation was to use the

component that implements the Berlekamp-Massey algorithm at its best and to adapt the

remaining components in order to obtain the highest rate of parallelisation. This brought to low

latency and high throughput solutions. A systolic architecture of each component was

preferred. The typologies of solutions studied realise the parallelisation in different degrees, so

the number of parallelised components for the computation of the syndrome and the Chien

search was varied from one solution to another. The architectures were then implemented in

VHDL. Their behaviour was tested, thanks to the use of Python scripts for the verification of

the correct outputs. Then, the designs were implemented in FPGA (Xilinx Kintex 7 and Xilinx

Virtex Ultra-Scale). With some slight change to the memory internal structure, the designs

were implemented in ASIC in CMOS 90 nm technology.

Parallelising two innovative designs of RS(255,239) decoders, the throughput of 124 Gbps and

latency of 116 nanoseconds were reached, using only a bit less than 0.842 mm2 of area

occupation.

For the RS(528,514), one 38p version of the innovative design is more than enough for reaching

the goals. The throughput attained is 152 Gbps while the latency is 130 nanoseconds. The area

used for implementing the decoder is nearly 1.2 mm2.

118

119

6 List of figures Figure 1) Schematisation in blocks of a RF system. ................................................................ 11

Figure 2) Working operation of the multiplication in Galois fields. ....................................... 21

Figure 3) Algorithm used for inverting an element "u". .......................................................... 23

Figure 4) Schematic of a RS(7,3) encoder. .............................................................................. 25

Figure 5) Python RS(7,3) Encoder for message 0 [2,7,3]. ....................................................... 26

Figure 6) Black box representation of the RS Encoder in VHDL. .......................................... 26

Figure 7) Encoder RS(7,3) working in GF(8) internal structure. ............................................ 27

Figure 8) RSEncUnit black box. .............................................................................................. 28

Figure 9) RSEncUnit internal schematic and connections. ...................................................... 28

Figure 10) Blocks representing the test-bench system. ........................................................... 29

Figure 11) Timing_tester realisation viewed in Xilinx Vivado®. ........................................... 30

Figure 12) Graph representing the post-synthesis time analysis made with Xilinx Vivado®. 31

Figure 13) Resources utilisation for Encoder RS(255,239). .................................................... 31

Figure 14) HTML page for the VHDL code documentation of the RS Encoder. ................... 32

Figure 15) HTML page for the VHDL code documentation of the Timing Tester for RS

Encoder. ................................................................................................................................... 32

Figure 16) Generic decoder architecture.................................................................................. 33

Figure 17) Flowchart that describes the working operation of the inversion-less Berlekamp-

Massey (iBM) algorithm. ......................................................................................................... 35

Figure 18) ePIBMA working operation diagram. .................................................................... 36

Figure 19) Flowchart that describes the working operation of the classical version of the Chien

Search (CS) algorithm. ............................................................................................................ 37

Figure 20) Flowchart of the working operation of the Forney's algorithm. ............................ 37

Figure 21) Flowchart diagram of the eCSEE algorithm. ......................................................... 38

Figure 22) Syndrome calculation unit schematic implementing Horner's rule........................ 39

Figure 23) SyndromeCell black box block............................................................................... 39

Figure 24) SyndromeCell internal structure. ............................................................................ 40

Figure 25) Internal structure of the SyndromeComputer. ........................................................ 41

Figure 26) Black box model of the SyndromeComputer component. ..................................... 41

Figure 27) Black box representation of the RS_ePIBMA_Cell. ............................................... 42

Figure 28) Internal Structure of the RS_ePIBMA_Cell............................................................ 42

Figure 29) Circuit for the initialisation of the MC1 control signal. ......................................... 43



Figure 32) Circuit for the calculation of the gamma signal. .................................................... 44

Figure 33) Circuit for the calculation of the omega_0 signal. ................................................. 45

Figure 34) Circuit for the calculation of parameter L_B.......................................................... 45

Figure 35) Circuit for the calculation of L_sigma. .................................................................. 45

Figure 36) Circuit for the calculation of the zeta signal. ......................................................... 46

Figure 37) Circuit for the calculation of the ready signal. ....................................................... 46

Figure 38) RS_eCSEE black-box representation. .................................................................... 47

Figure 39) Internal blocks of the eCSEE device. ..................................................................... 48

120

Figure 40) Internal structure of the eCSEE component; the blocks are placed ordered in respect

to the x-axis that represents the time........................................................................................ 48

Figure 41) Black-box representation of the RS_eCSEE_Cell. ................................................. 49

Figure 42) Internal structure of the RS_eCSEE_Cell. .............................................................. 49

Figure 43) Sigma polynomial evaluation block. ...................................................................... 50

Figure 44) “B” polynomial evaluation block. .......................................................................... 50

Figure 45) Zeta computer component. ..................................................................................... 51

Figure 46) Circuit for the calculation of the number of clock pulses (CC). ............................ 51

Figure 47) Circuit for the calculation of the ready signal. ....................................................... 52

Figure 48) Black-box representation of the ROM for inverting the elements. ........................ 52

Figure 49) RS_Decoder black-box representation. .................................................................. 53

Figure 50) Black-box representation of the Delay_RAM. ....................................................... 54

Figure 51) Example circuit for the usage of the Delay_RAM block. ....................................... 54

Figure 52) Circuit for the generation of the address for input data in the Delay_RAM. .......... 54

Figure 53) Circuit for the generation of the output data address. ............................................ 55

Figure 54) Complete schematic of the RS_Decoder component described with blocks. ........ 56

Figure 55) Internal structure of the L_sigma shift register. ..................................................... 56

Figure 56) Python graph presenting the results obtained by the RS(255,239) decoder. ......... 57

Figure 57) Graph that represents the latency and the throughput correspondent to each solution

exploited. .................................................................................................................................. 67

Figure 58) Two-parallel syndrome cell internal architecture [Ji15]. ....................................... 68

Figure 59) Three-parallel syndrome cell internal architecture [Park12]. ................................ 68

Figure 60) Four-parallel syndrome cell internal architecture. ................................................. 69

Figure 61) Five-parallel syndrome cell internal architecture [Salvador14]. ............................ 69

Figure 62) First picture that represents the usage of resources per Gbps of throughput obtained.

On the x axis there are the number of symbols computed per clock pulse; on the y axis there

are the number of XOR gates used per Gbps of throughput. ................................................... 71

Figure 63) Resources utilisation per every clock pulse of latency saved. On the x axis there are

the number of symbols computed per clock pulse; on the y axis there are the average number

of basic elements used per clock pulse saved. ......................................................................... 71

Figure 64) Internal structure of a two-parallel eCSEE cell. ..................................................... 72

Figure 65) Internal structure of a three-parallel eCSEE cell. ................................................... 72

Figure 66) Internal structure for the 3-parallel evaluation of the sigma polynomial [Park12].

.................................................................................................................................................. 73

Figure 67) Picture that represents the usage of resources per Gbps of throughput obtained. On

the x axis there are the number of symbols computed per clock pulse; on the y axis there are

the average number of basic elements used per Gbps of throughput. ..................................... 74

Figure 68) Resources utilisation per every clock pulse of latency saved. On the x axis there are

the number of symbols computed per clock pulse; on the y axis there are the average number

of basic elements used per clock pulse saved. ......................................................................... 75

Figure 69) System evaluated for the two-parallel solution. ..................................................... 76

Figure 70) System evaluated for the four-parallel solution. .................................................... 76

Figure 71) Schematic of the system using 1p versions of the cells for SyndromeComputer and

eCSEE components. ................................................................................................................. 78

121

Figure 72) Internal structure of the 1p SyndromeComputer cell. ............................................ 79



Figure 74) Internal structure of the 2p-1s SyndromeComputer cell. ....................................... 80

Figure 75) Graph that shows the perfect arrival of the signals if there’s no segmentation of the

cells. ......................................................................................................................................... 80







Figure 80) Internal structure of the 8p SyndromeComputer cell. ............................................ 83





Figure 84) Internal structure of the 16p-2s SyndromeComputer cell. ..................................... 86

Figure 85) Black box representation of the Delay_Block. ....................................................... 89

Figure 86) Internal structure of the Delay_Block. ................................................................... 89

Figure 87) Black box representation of the Correction_Block. ............................................... 90

Figure 88) Internal structure of the Correction_Block. ........................................................... 90

Figure 89) Internal structure of the 2p-1s-SyndromeCell. ....................................................... 91

Figure 90) Internal structure of the 2p-eCSEE cell. ................................................................. 92

Figure 91) Internal structure of the Sampling_Cell. ................................................................ 92

Figure 92) Structures of the evaluation blocks for the odd and even parts of the two

polynomials. ............................................................................................................................. 93

Figure 93) Internal structure of the Chien search and error evaluation component. ............... 93

Figure 94) Internal structure of the Final_Stage component. .................................................. 94

Figure 95) Internal structure of the component for the calculation of zeta. ............................ 94

Figure 96) Internal architecture of the 2p-Starting_Zeta_Computer. ...................................... 95

Figure 97) Internal structure of the modified Zeta_Computer................................................. 95



Figure 100) Internal architecture of the 16p-Starting_Zeta_Computer. .................................. 96

Figure 101) Internal architecture of the 16p-enhanced_Starting_Zeta_Computer. ................. 97

Figure 102) Blocks assembly inside the decoder. .................................................................... 98

Figure 103) Black box representation of the Correction_Block. ............................................. 98

Figure 104) Internal structure of the Correction_Block component. ....................................... 98

Figure 105) Internal structure of the 4p-2s SyndromeComputer cell. ..................................... 99

Figure 106) Internal structure of the four-parallel eCSEE cell. ............................................. 100

Figure 107) Schematics for the evaluation blocks and the following stages. ........................ 100

Figure 108) Simplified blocks that constitute the 4p-2s-4c decoder. .................................... 100

Figure 109) Simplified blocks that constitute the 8p-2s-2c decoder. .................................... 101

122

Figure 110) Internal structure of the SyndromeCell for the 16p-2s-1c version. .................... 102

Figure 111) System architecture of the 16p-2s-1c decoder. .................................................. 102

Figure 112) Internal structure of a BRAM cell. ..................................................................... 107

Figure 113) Internal structure of the cell for the computation of the syndrome in the 38p-2s-1c

RS(528,514) decoder. ............................................................................................................ 112

Figure 114) Black box representation of the Mult_Cell component..................................... 112

Figure 115) Internal structure of the Mult_Cell. ................................................................... 112

123

7 List of tables Table 1) Modulo-2 addition. .................................................................................................... 13

Table 2) Modulo-2 multiplication. ........................................................................................... 14

Table 3) Representations of the Galois Field formed with three bits. ..................................... 17

Table 4) Recap table for the control unit of the RS_ePIBMA component. .............................. 60

Table 5) Recap table of resources used per every component, taking into account the cells

usage. ....................................................................................................................................... 61

Table 6) Table for critical paths and latencies of the various parts. ........................................ 62

Table 7) Table that resumes the usage of XOR gates for every component used. .................. 68

Table 8) Comparison of resources usage of the variants of the syndrome cells. ..................... 69

Table 9) Resume of the latency and critical path characteristics for the solutions analysed. .. 70

Table 10) Recap table for used basic elements in the overall syndrome component. ............. 70

Table 11) Comparison of resources usage of the variants of the eCSEE cells. ....................... 72

Table 12) Resume of the latency and critical path characteristics for the three solutions

analysed.................................................................................................................................... 72

Table 13) Recap table for the resources utilisation of the components in eCSEE variants. .... 73

Table 14) Table that resumes the instantiation of cells in the typologies of eCSEE. .............. 73

Table 15) Recap table for the resources used by the eCSEE module of all the solutions taken

into consideration. .................................................................................................................... 74

Table 16) Table that resumes the number of XOR gates used for the two solutions taken into

consideration. ........................................................................................................................... 77

Table 17) Table that resumes the utilisation of elements for the various variants of

SyndromeComputer cell. .......................................................................................................... 87

Table 18) Recap table of the number of XOR gates used by every block for every version. . 88

Table 19) Results of the simplification of the P-1 multiplexer. ............................................... 88

Table 20) Recap table of the resources used by the implementations of the various versions of

the decoder in a Kintex 7. ...................................................................................................... 103

Table 21) Table that analyses in the clock pulses the latency of each decoder. .................... 104

Table 22) Recap table for the timings results in Kintex 7. .................................................... 104


................................................................................................................................................ 104


the decoder in a Virtex. .......................................................................................................... 105

Table 25) Table that resumes the timing information of the various implementations in a Virtex.

................................................................................................................................................ 105

Table 26) Recap table that resumes the equivalent number of gates used by each RS(255,239)

ASIC implementation of the decoders. .................................................................................. 108

Table 27) Recap table that resumes the main timing characteristics of the RS(255,239) ASIC

implementations of the decoders. .......................................................................................... 109

Table 28) Comparison table between the decoder proposed and the literature’s results. ...... 110

Table 29) Analysis of the percentage weight of the FIFO memory on the total number of gates

used. ....................................................................................................................................... 110

Table 30) Recap table of the number of XOR gates used by every block for every version. 113

124

Table 31) Table that resumes the number of messages that have to be stored for every version.

................................................................................................................................................ 113


the decoder in a Kintex 7. ...................................................................................................... 114


................................................................................................................................................ 114

Table 34) Recap table for the timings results for Kintex 7. ................................................... 115


the decoder in a Virtex. .......................................................................................................... 115

Table 36) Recap table for the timings results for Virtex. ...................................................... 115

Table 37) Recap table that resumes the equivalent number of gates used by each RS(528,514)

ASIC implementation of the decoders. .................................................................................. 116

Table 38) Recap table that resumes the main timing characteristics of the RS(528,514) ASIC

implementations of the decoders. .......................................................................................... 116

125

8 Bibliography [Song02] Song L., Yu M-L, Shaffer M.S., "10 and 40 Gb/s Forward Error Correction Devices

for Optical Communications" IEEE Journal of Solid-State Circuits, pp. 1565-1573, 2002

[Lee03] Lee H., "High-Speed VLSI Architecture for Parallel Reed-Solomon Decoder" IEEE

Transactions of VLSI Systems, pp. 288-294, 2003

[Lee05] Lee H., "A high-speed low-complexity Reed-Solomon decoder for optical

communications" IEEE Transactions on Circuits and Systems II, pp. 461-465, 2005

[LeeH08] Lee H., Choi C-S, Shin J., Ko J-S, "100Gb/s Three-Parallel Reed-Solomon based

Forward Error Correction Architecture for Optical Communications" 2008 International SoC

Design Conference, pp. 265-268, 2008

[LeeS08] Lee S., Choi C-S, Lee H., "Two-parallel Reed-Solomon based FEC Architecture for

Optical Communications" IEICE Electronics Express, pp. 374-380, 2008

[Park12] Park J-I, Yeon J., Yang S-J, Lee H., "An Ultra High-Speed Time-Multiplexing Reed-

Solomon-based FEC architecture" IEEE, 2012

[Salvador14] Salvador A., Carvalho D., Nakandakare C., Mobilon E., de Oliveira J.,

"100Gbit/s FEC for OTN Protocol: Design Architecture and Implementation Results" IEEE,

2014

[Ji15] Ji W., Zhang W., Xingru P., Zhibin L., "16-Channel Two-Parallel Reed-Solomon Based

Forward Error Correction Architecture for Optical Communications" 2015 IEEE International

Conference on Digital Signal Processing (DSP), pp. 239-243, 2015

[Wu15] Wu Y., "New Scalable Decoder Architectures for Reed-Solomon Codes" IEEE, pp.

2741-2761, 2015

126

127

9 Annexure

9.1 [Annex 1] Files positions The section has the aim of communicating how the files given with the thesis are placed inside

the folder.

The files follow the same order and are organised in the same chapters in which is divided the

report of the thesis. For the Python and Matlab files, the simple plain “file.py” or “file.m” files

are sufficient. For what concerns the projects files, it was not possible to insert all the files.

Only the VHDL sources were inserted into the folder, since are the essential result of the thesis.

Some text file is inserted for allowing the testing with the test-benches.

All the versions of the decoders and encoders were successfully tested. The test-benches are

passed in the thesis but it’s mandatory to change the paths of the source files used in the

simulation.

The documentation of each project stays in the folder “html” and can be opened by double-

clicking on “index.html”, that stays in the previously said folder.

Documents

D 100 -S · PDF file1.1 Goals of the thesis ... hardware architectures were developed to reach the decoding speed of 100 Gbps. VHDL ... Schematisation in blocks of a RF system. 12