Upload
vunhi
View
213
Download
1
Embed Size (px)
Citation preview
POLITECNICO DI MILANO
Scuola di Ingegneria Industriale e dell’Informazione
DESIGN AND IMPLEMENTATION OF 100
GBPS REED-SOLOMON DECODERS
Supervisors:
Prof. Franco Zappa
Prof. Javier Valls
Made by:
Gabriele Perrone 820485
Academic Year 2015-2016
2
3
Contents Acknowledgments...................................................................................................................... 7
Abstract ...................................................................................................................................... 9
1 Introduction ...................................................................................................................... 11
1.1 Goals of the thesis ..................................................................................................... 12
1.2 Methods ..................................................................................................................... 12
1.2.1 Algebra ............................................................................................................... 13
1.2.1.1 Arithmetic in binary fields .......................................................................... 14
1.2.1.2 Galois fields ................................................................................................ 16
1.2.2 Modelling ........................................................................................................... 18
1.2.2.1 Matlab ......................................................................................................... 18
1.2.2.2 Python ......................................................................................................... 18
1.2.2.3 Unified Modelling Language (UML) ......................................................... 19
1.2.3 Realisation.......................................................................................................... 19
1.2.3.1 Doxygen...................................................................................................... 19
1.3 Structure of the thesis ................................................................................................ 19
2 VHDL library of arithmetic operators ............................................................................. 21
2.1 Sum............................................................................................................................ 21
2.2 Multiplication ............................................................................................................ 21
2.3 Inversion .................................................................................................................... 23
2.4 Examples of usage ..................................................................................................... 24
2.4.1 Encoder RS(7,3) in GF(23) ................................................................................ 24
2.4.1.1 Simulation and system validation ............................................................... 29
2.4.2 Encoder RS(255,239) in GF(28) ........................................................................ 29
2.4.2.1 Simulation and system validation ............................................................... 30
2.4.2.2 Utilisation and timing report ....................................................................... 30
2.4.2.3 Code documentation ................................................................................... 32
3 Decoder for RS codes ...................................................................................................... 33
3.1 Decoding process ...................................................................................................... 33
3.1.1 Reed-Solomon codes ......................................................................................... 33
3.1.2 Syndrome Computer .......................................................................................... 33
3.1.2.1 Horner’s rule ............................................................................................... 34
3.1.3 Berlekamp-Massey Algorithm (BMA) .............................................................. 35
4
3.1.3.1 Enhanced Parallel Inversionless Berlekamp-Massey Algorithm (ePIBMA)
36
3.1.4 Chien Search (CS) and Forney’s algorithm ....................................................... 37
3.1.4.1 enhanced Chien Search Error Evaluation Component (eCSEE) ................ 38
3.2 Decoder RS(255,239) in GF(28)................................................................................ 38
3.2.1 Syndrome Computer .......................................................................................... 39
3.2.2 ePIBMA component .......................................................................................... 41
3.2.3 eCSEE component ............................................................................................. 46
3.2.4 Decoder block assembly .................................................................................... 53
3.3 Decoder RS(528,514) in GF(210) .............................................................................. 58
3.4 Compared analysis of timings and usage of resources .............................................. 60
3.4.1 Usage of resources in a generic RS(n,k) decoder .............................................. 60
3.4.1.1 Syndrome Computer ................................................................................... 60
3.4.1.2 RS_ePIBMA ............................................................................................... 60
3.4.1.3 RS_eCSEE .................................................................................................. 61
3.4.1.4 Other components ....................................................................................... 61
3.4.2 Latencies and critical paths ................................................................................ 61
3.4.3 Maximum operating frequency .......................................................................... 62
4 Design of 100 Gbps decoders .......................................................................................... 65
4.1 Parallelisation of the RS(255,239) decoder............................................................... 65
4.1.1 Background ........................................................................................................ 65
4.1.1.1 Syndrome Computer ................................................................................... 68
4.1.1.2 eCSEE ......................................................................................................... 72
4.1.1.3 Conclusions ................................................................................................ 75
4.1.2 Innovative designs ............................................................................................. 77
4.1.2.1 Proposed decoder architecture and theoretical analysis ............................. 78
4.1.2.2 FPGA implementation ................................................................................ 88
4.1.2.3 ASIC implementation ............................................................................... 105
4.1.2.4 Comparison with published results ........................................................... 109
4.2 Parallelisation of the RS(528,514) decoder............................................................. 110
4.2.1 Background ...................................................................................................... 110
4.2.2 Innovative designs ........................................................................................... 111
4.2.2.1 Proposed architecture and theoretical analysis ......................................... 111
4.2.2.2 FPGA implementation .............................................................................. 114
5
4.2.2.3 ASIC implementation ............................................................................... 115
5 Conclusions .................................................................................................................... 117
6 List of figures ................................................................................................................. 119
7 List of tables ................................................................................................................... 123
8 Bibliography .................................................................................................................. 125
9 Annexure ........................................................................................................................ 127
9.1 [Annex 1] Files positions ........................................................................................ 127
6
7
Acknowledgments I thank my family for providing me support and encouragement throughout my years of study
and personal growth.
I must express gratitude to professor Franco Zappa. If I didn’t attend his classes during the
bachelor, probably I would have never passed to Electronics Engineering and I would not
have discovered my passion for this field.
I thank my Spanish supervisor, professor Javier Valls, for giving me the chance of developing
the thesis with him.
8
9
Abstract The Reed-Solomon error correction codes are used in a large variety of fields, ranging from
the telecommunication field to the digital data storage. Recently, this family of codes are being
proposed for high-speed connections through cable or optical fibre. In the thesis, optimised
hardware architectures were developed to reach the decoding speed of 100 Gbps. VHDL
parametrical libraries were implemented to simplify the execution of the arithmetic operations
in the Galois fields, the algebraic fields in which the Reed-Solomon codes work. The
architectures proposed by other authors were studied in order to use the component that
implements the Berlekamp-Massey algorithm at its best. Particular attention was paid also to
an efficient usage of memory in the decoder. The analyses of the possible solutions are made
and exposed. The two Reed-Solomon codes studied are the RS(255,239), that works in GF(28),
and the RS(528,514), that working in GF(210). For every decoder, a VHDL model was
implemented. After the verification with the help of Python models, they were implemented in
FPGAs and in ASIC CMOS 90nm technology. The obtained results attained the required
decoding throughput and happened to be more efficient, in terms of area-time relation, and
with less latency in respect to the actual state of art.
10
11
1 Introduction The project consists in the development and realisation of a decoder for Reed-Solomon codes
that can perform the decoding task at a high speed, so granting a throughput of the order of
Gbps. These decoders are used in communication or storage systems to ensure the correct
transmission or recovery of data. In Figure 1 we can see the block schematic of a generic
system. The aim of the overall architecture is to have a reliable and efficient exchange of data
between the two endpoints. One aspect that can effectively make the situation better is the
possibility to restore some data that arrived with errors. For example, imagine to have a
communication with a probe that is exploring Mars where with quite big probability we will
have constraints for the energy used and relatively huge timings for the exchange of data. For
this case, having only some parity symbols would be a real waste: if the message arrives with
some error it is possible to recognise it, but nothing can be done except sending a packet for
asking to send the data again, so wasting time and energy. For these cases, systems were
developed for the encoding of the channel that can not only detect the errors, but also restore a
certain amount of mistakes occurred in the transmission.
The two blocks highlighted are the areas where the thesis takes place.
Figure 1) Schematisation in blocks of a RF system.
12
Resuming, the encoding of the transmission canal has the goals of:
Maximise the number of bits transmitted (throughput);
Minimise the number of errors introduced in the transmission;
Minimise the energy required;
Minimise the bandwidth required;
Minimise the complexity of the encoder and decoder.
Channel Encoding
The encoding of the channel implies the transformation of the data of the original message.
The aim, as said before, is not to use a reverse channel from which the receiver, once detected
an error, can ask to send again the message affected by a mistake. The FEC techniques imply
the tasks of the detection and correction of the errors are made by the receiver.
To do so, from the original pack of data, the encoder will produce some redundancy bits in
order to give the receiver the possibility to restore the original message. These FEC codes are
generally characterised by some main parameters:
𝑛 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑝𝑎𝑐𝑘𝑒𝑡 = 𝑐𝑜𝑑𝑒 𝑏𝑖𝑡𝑠
𝑘 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑝𝑎𝑐𝑘𝑒𝑡 = 𝑑𝑎𝑡𝑎 𝑏𝑖𝑡𝑠
𝑡 = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟𝑠 𝑡ℎ𝑎𝑡 𝑡ℎ𝑒 𝑐𝑜𝑑𝑒 𝑐𝑎𝑛 𝑟𝑒𝑠𝑡𝑜𝑟𝑒
Before describing the composition and structure of the codes that were adopted for the project,
some details about the particular algebra used in these numerical fields are to be explained.
These details are demanded to other sections exposed later.
1.1 Goals of the thesis In literature a lot of work has been done on decoders of Reed-Solomon codes, currently used
in lots of devices for various purposes. The main idea of the project is to arrive to the Ethernet
speed (100 Gbps) using the smallest amount of resources possible. The focus of the thesis is
going to be mainly on the RS(255,239) and RS(528,514) in order to include them in 100 Gbps
communication standards. The trade-off among the area occupation, the throughput and the
latency of a decoder is present in every previous work analysed. The aim is to untie as most as
possible the three parameters in order to reduce the compromise and attain a better quality
solution. To do so, a study on the way of using resources will be exploited, in order to get a
generic method to determine the optimal solution. To reach to the goal, a new approach that
brings plenty of advantages is used. Before explaining it, some methods are to be introduced.
1.2 Methods Before proceeding with the development of the work, here the main tools used for the thesis
are introduced and the way of using them is explained.
13
1.2.1 Algebra To comprehend the working operations of the FEC codes, the algebra that is used in them has
to be explained. Since it is not straightforward and since it’s at the basis of the algorithm, in
this section some theorems and definitions that can better fix some concept of this particular
algebra are introduced. Since the field is very extended, it’s not possible to write all the details
but only the fundamental concepts for the operations.
Field: Let F be a set of elements on which two binary operations, addition and multiplication,
are defined. The set F together with two binary operations “+” and “∙” is a field if the following
conditions are satisfied:
i. F is a commutative group under addition “+”. The identity element with respect to
addition is called the zero element or the additive identity of F and is denoted by 0.
ii. The set of nonzero elements in F is a commutative group under multiplication. The
identity element with respect to multiplication is called the unit element or
multiplicative identity of F and is denoted by 1.
iii. Multiplication is distributive over addition; that is, for any three elements a, b, and
c in F:
𝑎 ∙ (𝑏 + 𝑐) = 𝑎 ∙ 𝑏 + 𝑎 ∙ 𝑐
The field therefore consists of at least two elements: the additive identity and the multiplicative
identity.
The order of a field is the number of elements belonging to the field.
A Finite field is a field with a finite number of elements.
Properties
1) For every element a in a field a∙0=0∙a=0
2) For any two nonzero elements a and b in a field, a∙b≠0
3) a∙b=0 and a≠0 imply that b=0
4) for any two elements a and b in a field
−(𝑎 ∙ 𝑏) = (−𝑎) ∙ 𝑏 = 𝑎 ∙ (−𝑏)
5) For a≠0 and a∙b=a∙c, then b=c
The set {0,1} is called binary field and is denoted as GF(2). It’s really important in the coding
theory and is widely used in digital computers and digital data transmission.
An example of modulo-2 addition and modulo-2 multiplication are presented in the tables here
below.
0 1
0 0 1
1 1 0 Table 1) Modulo-2 addition.
14
0 1
0 0 0
1 0 1 Table 2) Modulo-2 multiplication.
Being p a prime number and being GF(p) a finite field of p elements, for any m positive integer
number it’s possible to extend the prime field GF(p) to a field of pm elements, which is called
an extension field of GF(p) and is denoted as GF(pm). It was proved that the order of any finite
field is a power of a prime.
Since a field is closed under addition, the results of additions have to lay inside the field. The
finite field has a finite number of elements, so there has to be some repetition, the sums cannot
be all distinct. Thus, two positive integers n and m must exist, with m<n and has to be valid the
following expression:
∑ 1
𝑛
𝑖=1
= ∑ 1
𝑚
𝑖=1
Doing some math, can be written in a new form:
∑ 1
𝑛−𝑚
𝑖=1
= ∑ 1
𝜆
𝑖=1
= 0
Where 𝜆 is the smallest number that satisfies the relation and is called characteristic of the
field GF(q). Can be demonstrated that the characteristic of a finite field is prime.
Theorems
i) Let a be a non-zero element of the finite field GF(q); therefore, the element aq-1=1.
ii) The order n of the a element of the previous theorem divides q-1.
An element a of order q-1 is called primitive.
1.2.1.1 Arithmetic in binary fields
Although it is possible to construct a code starting from any Galois field, in digital electronics
it’s most commonly used to create codes starting from the GF(2) or its extensions GF(2m).
Starting with a simple GF(2), the arithmetic is really similar to the arithmetic we use, except
that the element 2 is equal to 0. This small change actually implies that in a simple addition
1+1=2=0 and therefore the unity element has a value 1 or -1. Thus, the subtraction in a GF has
the same result of an addition and vice versa.
An example of application of this property, can be the solution of the following system:
15
{𝑋 + 𝑌 = 1𝑋 + 𝑍 = 0
𝑋 + 𝑌 + 𝑍 = 1
Simply using the above described property:
{𝑋 = 1 − 𝑌 = 1 + 𝑌
𝑍 = −𝑋 = 𝑋 = 1 + 𝑌𝑌 + 𝑋 + 𝑋 = 𝑌 = 1
{𝑋 = 0𝑍 = 0𝑌 = 1
A generic polynomial in the binary GF(2) can be written as:
𝑓(𝑋) = 𝑓0 + 𝑓1𝑋 + 𝑓2𝑋2 + ⋯ + 𝑓𝑛𝑋𝑛
Where all the fi coefficients can be either 0 or 1. The largest power of X with a non-zero
coefficient is the degree of the polynomial. If in the previous polynomial fn is not 0, then n is
the degree of the polynomial.
When in the following part there will be written “a polynomial over GF(2)”, it will be meaning
“a polynomial whose coefficients are elements of the GF(2)”.
In GF(2) there are in general 2n polynomials of degree n. For example, there are two
polynomials with degree 1: X and 1+X; four of degree 2: X2, 1+X2, X+X2 and 1+X+X2.
Polynomials over GF(2) follow the same normal polynomial rules for addition, subtraction,
multiplication and division except that the coefficients have the properties of the GF.
Suppose that we want to divide f(X)=1+X+X4+X5+X6 by g(X)=1+X+X3. The division can be
performed with the Euclid’s division algorithm. At the first step, the result of the division is
X3. Multiplying the dividing polynomial for the partial result, there’ll be p1(X)=X3+X4+X6.
Summing the first partial polynomial to the polynomial that has to be divided will be obtained
p2(X)=1+X+X3+X5. Continuing with the operations, the result will be q(X)=X3+X2 and the
remainder will be r(X)=1+X+X2.
The elements a that, substituted into the polynomial, are giving 0 as a result are called roots.
𝑓(𝑎) = 0
A root of the polynomial f(X)=1+X+X2+X4 is 1: f(1)=1+1+12+14=1+1+1+1=0. Dividing then
the polynomial for the root solution 1+X, it can be written as f(X)=(1+X)(X3+X2+1).
A polynomial over GF(2) with an even number of terms is divisible for 1+X.
A polynomial f(X) of degree m is called irreducible over GF(2) if is not divisible by any
polynomial over GF(2) with a degree less than m but greater than 0.
Theorem: Any irreducible polynomial over GF(2) of degree m divides X^(2m-1)+1.
16
Any irreducible polynomial p(X) of degree m is said to be primitive if the smallest positive
integer n for which p(X) divides Xn+1 is n=2m-1.
1.2.1.2 Galois fields
Construction of a Galois field
For starting the construction of a GF, we start with the two elements 0 and 1 and with the
element α. The following operations are defined by the properties exposed in the previous
section.
0 ∙ 1 = 1 ∙ 0 = 0
1 ∙ 1 = 1
0 ∙ α = α ∙ 0 = 0
1 ∙ α = α ∙ 1 = α
α ∙ α = α2
α ∙ α ∙ α = α3
αi ∙ αj = α𝑖+𝑗
Then the resulting field will be F={0,1, α , α2, …, α𝑗, …}. The first non-zero element 1 can also
be referred with α0. For continuing building the field, we insert the conditions that the field F
contains only 2m elements and is closed under the multiplication.
Taking a primitive polynomial p(X) of degree m in GF(2), from a previous theorem we can
assert that p(X) divides 𝑋2𝑚−1 + 1:
𝑋2𝑚−1 + 1 = 𝑝(𝑋) ∙ 𝑞(𝑋)
Supposing alpha is a root of the primitive polynomial, thus p(α)=0.
Replacing then the X with an alpha:
α2𝑚−1 + 1 = 0 ∙ 𝑞(α) = 0
α2𝑚−1 = 1
Therefore, now it’s possible to write the newly created field as F={0, α0, α1, α2, …, α2𝑚−2}.
Taking two exponents “i” and “j” such that 0 ≤ 𝑖, 𝑗 ≤ 2𝑚 − 1, knowing that α𝑖 ∙ α𝑗 = α𝑖+𝑗, we
can assert that, since the field is closed under the “∙” operation, 𝑖 + 𝑗 = (2𝑚 − 1) + 𝑟 where
0 ≤ 𝑟 ≤ 2𝑚 − 1. Hence α𝑖+𝑗 = α(2𝑚−1)+𝑟 = α2𝑚−1 ∙ α𝑟 = α𝑟 and so it is verified the closure
under the multiplication.
The next step is the definition of the addition operation. Taking 0 ≤ 𝑖 ≤ 2𝑚 − 1, we can divide
the polynomial Xi by the polynomial p(X):
17
𝑋𝑖 = 𝑝(𝑋) ∙ 𝑞𝑖(𝑋) + 𝑎𝑖(𝑋)
Where qi and ai are respectively the quotient and the remainder. The remainder, since X and
p(X) are relatively prime (only element “1” in common) is of the form:
𝑎𝑖(𝑋) = 𝑎𝑖,0 + 𝑎𝑖,1𝑋 + ⋯ + 𝑎𝑖,𝑚−1𝑋𝑚−1
For any non-negative “i” it’s valid the expression 𝑎𝑖(𝑋) ≠ 0.
Taking two variables (“i” and “j”) inside the valid range [0;2m-1], the remainder polynomials
ai(X) and aj(X) cannot be equal (the demonstration is not presented here).
Thus, for i=0,1,2,…,2m-2 we can create 2m-1 separated polynomials of at most m-1 degree.
Replacing the X with an α we have αi = 𝑎𝑖(α) and, for derivation from the previous line, we
have obtained 2m-1 non-zero distinct polynomials in GF(2m). So the 2m elements of F can be
represented by 2m distinct polynomials of α over GF(2) with degree that at most is m-1. The
addition operation is commutative in F and F is closed under “+”. The binary field GF(2) is
also called ground field.
For a Galois Field there are therefore three representations, that are resumed in the table below.
In this example, the aim is to build a GF(23) field generated by p(X)=1+X+X3. For p(X)=0 we
get X3=1+X. From this equation we can start to build the field.
Power representation Polynomial
representation
Tuple representation Integer
representation
0 0 000 0
1 1 001 1
α α 010 2
α2 α2 100 4
α3 1 + α 011 3
α4 α + α2 110 6
α5 1 + α + α2 111 7
α6 1 + α2 101 5 Table 3) Representations of the Galois Field formed with three bits.
The first property is that a root of the polynomial p(X) over GF(2m) can also be in a field that
is an extension of GF(2m). The concept is similar of the complex roots for a real polynomial.
Theorem: taking β (element in GF(2m)) as a root of the polynomial f(X) that has coefficients
in GF(2), then for any non-negative “l” we have β2lthat is as well a root of p(X). The element
β2l is called conjugate.
Theorem: the minimal polynomial ϕ(X) of a field element β is irreducible.
Theorem: taking a polynomial f(X) over GF(2) and taking ϕ(X) of the field element β, if β is
a root of f(X), then f(X) is divisible by ϕ(X). Since f(X) is irreducible and β is not one,
necessarily will be ϕ(X)= f(X).
18
Theorem: if β is a primitive element of GF(2m), all its conjugates β2, β4, β8, … are also
primitive elements of GF(2m).
This brings to the conclusion that, taking β of order n in GF(2m), all its conjugates have the
same order.
Vector Spaces
Taking V, a set of elements on which is defined the addition, and F, a field, and defining the
multiplication between the elements of V and F, the set V is called vector space over the field
F if:
i) V is a commutative group under addition.
ii) For any element a in F and any element v in V, a∙v is an element of V.
iii) (Distributive Laws) For any elements u and v in V and any elements a and b in F:
𝑎 ∙ (𝒖 + 𝐯) = 𝑎 ∙ 𝒖 + 𝑎 ∙ 𝒗
(𝑎 + 𝑏) ∙ 𝒗 = 𝑎 ∙ 𝒗 + 𝑏 ∙ 𝒗
iv) (Associative Law) For any v in V and any a and b in F:
(𝑎 ∙ 𝑏) ∙ 𝒗 = 𝑎 ∙ (𝑏 ∙ 𝒗)
v) With 1 that is the unit element of F, for any v in V, 1∙v=v.
The elements of V are called vectors and the elements of the field F are called scalars.
1.2.2 Modelling For realising the decoder, it was really important to model the algorithms and the components.
The models have two main purposes: making understand correctly and in detail the working
operations of the algorithms, and allowing an easier debug while implementing the components
in VHDL. In fact, having informatics models of the components, allowed to have partial results
and variables values during the debug of the components and also for the verification/testing
of the devices realised.
1.2.2.1 Matlab
Matlab is a standard program for the modelling and the prototyping in the scientific field. This
software was used for understanding the algorithms in the first place because it’s easy and
straightforward to use. Thanks to its simple interfaces and to the dedicated library for Galois
fields, coding with this tool was fast.
1.2.2.2 Python
In a second place, after being sure of how the algorithms work, Python was used for a better
modelling of the components. The main annoying aspects of Matlab are the notation for the
accessing the elements of an array (first element is at the index one instead of zero). This and
the better freedom of programming brought in some part of the project to the usage of Python.
Moreover, with Python was not lost the advantage of using a console, since it’s an interpreted
language, gets an easier access to basic tools like file streams or string parsing, useful for
VHDL test-benches, and also is more free to code since it’s a proper programming language.
19
1.2.2.3 Unified Modelling Language (UML)
Together with Matlab and Python, UML was adopted as another standard in the scientific
modelling. UML gave the tools necessary to have clearer vision of the procedures to be done.
With this language, the algorithms were schematised as flowchart, so that was easier the
implementation.
1.2.3 Realisation After having realised the models, the following step was to implement them in hardware. For
this a description language was adopted.
The choice for the description language was between Verilog and VHDL. They are quite
equivalent and there’s no big difference but, since VHDL is more C-like and is more precise,
the choice happened to be VHDL. Moreover, in this language it is easier the parametrisation
of the components and so resulted to be more powerful in the realisation of scalable structures.
For what concerns the FPGA implementation, the IDE and a first basic simulator used in the
project are Xilinx Vivado® 2016. For a better and detailed simulation, the use of Altera
ModelSim 10.3d was necessary.
For the ASIC implementations of the designs, for comparing them with the literature results,
was used the Cadence SOC Encounter, that takes a VHDL design and generates an equivalent
integrated circuit.
1.2.3.1 Doxygen
The realisation of a complex system coded in VHDL involved a problem of documentation and
reusability of the code. For this the program Doxygen was used for documenting the code and
make understandable the usage of packages and components realised during the development
of the decoder. Doxygen is a software that automatically realise a HTML and LaTex
documentation of the files if they are opportunely commented.
1.3 Structure of the thesis The work of the thesis is divided into main parts. A first important chapter is dedicated to the
study of the Galois fields’ arithmetic. These are a necessary basis to understand the working
operations of the various algorithms for decoding. After understanding them and after having
modelled them in Python, they have also to be implemented in VHDL so that it’s easier to use
them during the implementation of more complex systems. In fact, splitting the complexity of
the operations makes easier the design itself.
A second chapter is dedicated to the decoding algorithms. After presenting the main algorithms
and architectures for the decoder, the algorithms are implemented in working decoders and the
first results are obtained. There will be a discussion about the drawbacks and the advantages
of the solutions analysed. These will lead to new innovative designs (chapter 4) for reaching
efficiently to the throughput of 100 Gbps. Therefore, the results are numerically analysed and
in the last chapter some conclusions are drawn.
20
21
2 VHDL library of arithmetic operators The first chapter of the work is dedicated to the arithmetic operators on which all the thesis is
based. The aim pursued here is to obtain an easy-to-use library for the realisation of
computations in the Galois fields. The operators were developed in a specific package so that
their reutilisation in different context is facilitated.
2.1 Sum The first operation needed by the algorithms is a basic sum between two numbers in the Galois
field. The sum is basically a simple XOR gate among the ordered bits of the numbers. An
example of the operation (between unsigned) can be seen here below. The leftmost bit is the
MSB.
A = [0 1 1 0 1 0 1 0] = 64+32+8+2 = 108
B = [1 0 1 1 0 0 0 1] = 128+32+16+1 = 177
A 0 1 1 0 1 0 1 0
B 1 0 1 1 0 0 0 1
C 1 1 0 1 1 0 1 1
So the result is C=A+B=[1 1 0 1 1 0 1 1] = 128+64+16+8+2+1 = 219.
2.2 Multiplication This operation was more complex and had to be analysed deeper. First of all, a flowchart that
explains the working operation is presented below.
Figure 2) Working operation of the multiplication in Galois fields.
22
Here below, the working operation is explained with an example so that it’s clearer and more
straightforward to understand.
A = 156 = [1 0 0 1 1 1 0 0]
B = 67 = [0 1 0 0 0 0 1 1]
P = primitive polynomial = 285 = [1 0 0 0 1 1 1 0 1] (in red in the table)
1 0 0 1 1 1 0 0
0 1 0 0 0 0 1 1
1 0 0 1 1 1 0 0
1 0 0 1 1 1 0 0 -
0 0 0 0 0 0 0 0 -
0 0 0 0 0 0 0 0 -
0 0 0 0 0 0 0 0 -
0 0 0 0 0 0 0 0 -
1 0 0 1 1 1 0 0 -
0 0 0 0 0 0 0 0 -
0 1 0 0 1 1 0 1 0 1 0 0 1 0 0
1 0 0 0 1 1 1 0 1
0 0 0 1 0 1 0 0 0 0 0 1 0 0
1 0 0 0 1 1 1 0 1
0 0 1 0 1 1 1 0 0 0 0
1 0 0 0 1 1 1 0 1
0 0 1 1 0 1 1 0 1
0 1 1 0 1 1 0 1
So the result of the operation is C=A*B=[0 1 1 0 1 1 0 1] = 64+32+8+4+1 = 109
The script correspondent to this operation was developed in Matlab. The first script does the
partial multiplication between the two numbers:
function [C] = mult_stage1(A, B) n = length(A); length_C = 2*(n-1); C = zeros(length_C, 1); C(1) = and(A(1), B(1)); % first element C(length_C+1) = and(A(n), B(n)); % last element for (i=2:(length_C)) i if (i<=n) r = and(A(i), B(1)); for (j = 2:i) r = xor(r, and(B(j), A(i-j+1))); end C(i) = r; else delta = i-n ; r = and(A(n), B(delta+1)); for (j=delta+1:n-1) r = xor(r, and(A(n-j+delta), B(j+1))); end C(i) = r; end end
23
end
The second one does the modulo reduction for the primitive polynomial P:
function [C] = mult_stage2(M, P) for (i=length(M):-1:length(P)) if (M(i)==1) for (j=length(P):-1:1) if (P(j)==1) M(i+j-length(P))= not(M(i+j-length(P))); end end end end C=M; end
Having verified, and therefore finished, with the modelling, the same operations have to be
implemented in VHDL. The best solution for doing this it’s to implement a package that could
be used easily when coding and so an overload of the operators “+” and “*” for the
std_logic_vector subtype gf_logic_vector was defined. In this way, the usage inside the other
part of the project will be fast. In fact, if an operation like D=(A+B)*C has to be performed it
will be necessary only to write “D<=(A+B)*C” supposing that A, B, C, D are of the subtype
gf_logic_vector. Another basic subtype of gf_logic_vector that was defined is the gf_symbol,
i.e. a gf_logic_vector of length m.
2.3 Inversion The inversion of an element is needed especially in the computation of the errors occurred.
This operation is quite simple from a mathematical point of view, but it can be implemented in
multiple ways.
Supposing we want to invert the generic element of the Galois field GF(n=2m) αi, the inverted
element corresponds to αn-i. Therefore, the simplest way of inverting the element from an
electronic point of view is the use of a table: at the input there is the element to be inverted,
that references a table that brings then to the output the correspondent inverted element. The
procedure from which the inverted elements are generated is modelled with the following
Python code, that implements the mechanism show in Figure 3:
inv = [0]*(n+1) inv[1] = 1
for i in range(1,pow(2,m)-1):
u = gf_pow(2,i)
u_1 = 1
for j in range(1,n-i+1):
u_1 = F.Multiply(u_1,2) print i,u, u_1
inv[u] = u_1
Figure 3) Algorithm used for inverting an element "u".
24
The VHDL function is instead the following:
--! @brief Function that generates automatically a vector for inverting the elements of a
field
function generate_inverted_elements
return inverted_elements_type is
constant alpha : gf_symbol :=
conv_gf_logic_vector(std_logic_vector(to_unsigned(2,m))); -- element alpha^1=2
variable u : gf_symbol := conv_gf_logic_vector(std_logic_vector(to_unsigned(1,m))); --
element to be inverted
variable index : integer; -- index at which we will store the inverted element;
basically is the element to be inverted
variable u_1 : gf_symbol; -- inverted element
variable inverted_elements : inverted_elements_type; -- all the inverted elements
begin
-- first two elements are separate
inverted_elements(0) := conv_gf_logic_vector(std_logic_vector(to_unsigned(0,m)));
inverted_elements(1) := u;
for i in 1 to 2**m-2 loop
u := u*alpha; -- calculating firstly the element to be inverted
u_1 := conv_gf_logic_vector(std_logic_vector(to_unsigned(1,m))); -- initialising
inverted_element to 1
for j in 1 to n-i loop
u_1 := u_1 * alpha; -- updating variable
end loop;
-- storing the solution
index := to_integer(signed(std_logic_vector(u)));
inverted_elements(to_integer(unsigned(u))) := u_1;
end loop;
return inverted_elements;
end function;
This function is called in the code for the initialisation of the ROM inversion tables in order to
generate all the inverted elements.
2.4 Examples of usage The encoder is the component that receives at the input the data that has to be sent and encodes
it finding the parity symbols. Though the implementation of the encoder was not strictly
necessary, it was used as an exercise for the verification of the GF Arithmetic package and for
generating some encoded message to be further on tested on the decoder.
2.4.1 Encoder RS(7,3) in GF(23) A first simple component designed was an encoder for RS(7,3) code working in GF(23) that
can be seen in the figure below. Since the number of symbols and the number of bits per symbol
is low, the debug of the device was easier. In fact, in this implementation it was possible to see
how the synthesiser was implementing the device. Only when the architecture of the encoder
was completely verified it was possible, in a second moment, to increase the number of symbols
and bits in order to arrive to the final result.
25
Figure 4) Schematic of a RS(7,3) encoder.
For the first “k” cycles the two switches stay on the “a” contact while in the next “n-k” cycles
the switch will stay on the “b” contact. Before the explanation of the system, it is worth to
observe the critical path, that consists in: adder switch multiplier adder register.
Obviously the adders and multipliers, since they are implementing operations in the Galois
Field, are built as in the previous section was shown.
Just to explain the working operation better, we can see how this structure works with an
example of RS(7,3). Let’s take m=[2, 7, 3] as the message that has to be encoded. The generator
polynomial for GF(23) is g(X) = α3 + α X + X2 + α3 X3 + X4 and so g=[3, 2, 1, 3, 1]. These
coefficients are the multiplication factors used in the structure shown in figure 2. Before
starting the example, at the beginning obviously the registers values are set to 0.
Cycle 1 (m=2)
Output of switch 1 = 2+0 = 2
Reg(0) = 2*3+0 = α* α3 = α4 = 6
Reg(1) = 2*2+0 = α* α = α2 = 4
Reg(2) = 2*1+0 = α = 2
Reg(3) = 2*3+0 = α* α3 = α4 = 6
Cycle 2 (m=7)
Out_switch1 = 7+6 = 1
Reg(0) = 1*3 = 3
Reg(1) = 1*2+6 = 2+6 = 4
Reg(2) = 1*1+4 = 1+4 = 5
26
Reg(3) = 1*3+2 = 3+2 = 1
And so on for the remaining cycles. This simple example shows also the mode of debugging
of the hardware using Python. The Python script was automatically generating these results for
helping to retrieve the mistake in the VHDL code, as can be seen in Figure 5.
Figure 5) Python RS(7,3) Encoder for message 0 [2,7,3].
After the drawing of the schematic of the encoder and the realisation of the Python model for
the encoding calculations, the VHDL design of the RS Encoder started. The first stage
consisted in the definition of the black box mask of the component, that can be seen in Figure
6.
Figure 6) Black box representation of the RS Encoder in VHDL.
The RS Encoder block has as inputs:
27
bus of data of “m” bits for the message to be encoded;
clock signal port;
synchronous reset port;
enable port.
For the outputs instead it has:
bus of data of “m” bits for the message encoded;
par signal that is ‘0’ if at the output there’s not a parity symbol and is ‘1’ if there is;
ready signal that is ‘0’ if there’s no valid output, ‘1’ if a valid output is given.
For realising the schematic present in Figure 4, it can be noticed that there were repetitive parts
that could be schematised as cells. In fact, the block Multiplier-Adder-Register can be the basic
systolic cell of the encoder. It was called RSEncUnit. The schematic of Figure 4 now, using the
above described cell, is changed into the schematic of Figure 7. Since doesn’t have any adder,
the first cell can be schematised as a RSEncUnit with one addend put to zero.
Figure 7) Encoder RS(7,3) working in GF(8) internal structure.
The cell realised has the black box representation shown in Figure 8 and the internal structure
is shown in Figure 9.
28
Figure 8) RSEncUnit black box.
Figure 9) RSEncUnit internal schematic and connections.
For representing the generator polynomial, in the code an integer calculated is used as follows:
g(X) = X4 + α3 X3 + X2 + α2 X + α3 = [1,3,1,4,3] = [001 011 001 100 011] = 5715
The generator polynomial is converted into a constant to make sure that, if possible, the
synthesiser is going to simplify the architecture and optimise it.
In the code for RS(7,3) encoder, the architecture was implemented instantiating four times the
component RSEncUnit with a generate statement and interconnecting the blocks with an array
of gf_logic_vectors of length m bits. This bus of data is called out_reg. With combinational
logic (conditional assignment) was realised the switch 1, while for realising switch 2 was used
a process in order to register the output. The multiplexers of the two switches are controlled by
a control signal switch_control_signal that is a std_ulogic. This is low when are arriving
message symbols otherwise is high.
29
2.4.1.1 Simulation and system validation
After finishing the implementation, a test-bench had to be implemented for both the Python
and the VHDL systems. The test is schematised in Figure 10. Basically the test consists in the
sending three messages to be encoded without any pause and check that the output is exactly
the one desired. It was used Python for checking the goodness of the encoder. The messages
chosen for the test were [2,7,3], [4,0,6] and [5,1,1].
Figure 10) Blocks representing the test-bench system.
2.4.2 Encoder RS(255,239) in GF(28) After having obtained the basic version of the encoder, the encoder for the code that it’s going
to be used, i.e. RS(255,239), was designed. This encoder works in GF(28), so for every symbol
8 bits are needed: the symbol used is the byte. The first step was to modify opportunely the
GF_Arithmetic module. Since the quantity of data to try and verify is significantly more than
the previous device, the modality had to be modified and every component programmed
slightly differently. The first part of the new encoder was to create a script for the generation
of the messages and proper modifications to the original Python script in order to accomplish
the correct encoding.
The message generator script simply takes three strings (Svetonio’s “Alea iacta est”, “Carpe
Diem” and Cato the oldest’s “Carthagum delendam est”), cuts them or makes a zero padding
in order that the length is exactly “k” (so 239 in my case) and then write the ASCII number
corresponding to the character of the string into a stream file, used afterwards by both the
Python and VHDL encoder.
The Python encoder takes these files as an input and in the same way realised previously
encodes them. The computed output is printed in different files in two notations: binary and
decimal.
Having the Python code finished and ready for the VHDL debug, the next part was the
development of the encoder in VHDL. The first problem faced was the generator polynomial:
previously in the RS(7,3) encoder was coded as a integer (5715 [001 011 001 100 011]), but
30
in this case the length of the code couldn’t be supported by the VHDL integer type. A new
subtype of integer gen_integer of m bits, that was a subtype used for the generator polynomial’s
coefficients, had then to be defined. Then, a type array of gen_integer was defined so that all
the polynomial coefficient can stay in one compact variable, divided in smaller parts.
2.4.2.1 Simulation and system validation
This section is the analogue of the one for RS(7,3) but since the component is more complex,
there will be more tests to make. Basically two behavioural test-benches were realised.
The first one Generic Architecture is the encoding of one message of 239 bytes while the
second one Multi_Message_Architecture is the sequenced encoding (without any pause in
between) of three messages of 239 bytes.
2.4.2.2 Utilisation and timing report
For the timing report, a particular component, called Timing_Tester, had to be created. This
component is registering the inputs and the outputs of the RSEncoder so that prevents an
incorrect calculation of the timings. In fact, if the inputs and the outputs are not registered, not
all the paths in the device will be taken into account. To be realistic, this paths exist and so with
this precaution the error is avoided. The module so is consisting in input registers, that sample
the inputs, connected to an instance of the encoder whose outputs are again sampled by other
registers. The basic schematic is the one that can be seen in Figure 11.
Figure 11) Timing_tester realisation viewed in Xilinx Vivado®.
In Figure 11 it is possible to see the three parts described before: on the left the input registers,
the light blue component is the RSEncoder and on the right the output registers. Registering
both inputs and outputs makes us sure that at least one critical path exists and that all the paths
are taken into consideration, making so possible a correct timing analysis.
For the post-synthesis timing analysis was used a Kintex 7 device (xc7k160tfbg676-3).
The initial constraint used for timing analysis is the TCL script:
create_clock -period 2.000 -name clk -waveform {0.000 1.000} [get_ports clk]
31
At the beginning, it started with a clock period of 2.000 nanoseconds and Newton-Raphson
method was adopted for finding the solution of the problem. The waveform, as we can see from
the TCL constraint above, is always with duty cycle of 50%. When was used 1.900
nanoseconds, for example, the waveform was “-waveform {0.000 0.950}”.
The results of the Newton-Raphson method used are visible in Figure 12, that is a Python graph
representing the data obtained. With the black dots are highlighted the valid solutions found,
while in red there are the invalid ones. The final result is:
Clock period = 1.778 ns
Clock frequency = 562.43 MHz
Figure 12) Graph representing the post-synthesis time analysis made with Xilinx Vivado®.
For what concerns the utilisation of the hardware, in Figure 13 we can see the utilisation of
resources for a RS Encoder for RS(255,239).
Figure 13) Resources utilisation for Encoder RS(255,239).
32
After having this data, a post-implementation timing analysis with a clock period equal to 550
MHz was run, so the constraint used was:
create_clock -period 1.818 -name clk -waveform {0.000 0.909} [get_ports clk]
Since the frequency is 550 MHz and the symbol is the byte (therefore 8 bits), the overall data
processing speed is 4.4 Gbps.
2.4.2.3 Code documentation
In Figure 14 and Figure 15 we can see the results of the VHDL code commenting for
documenting the system. The HTML pages realised automatically with Doxygen are useful
every time the code is used/re-used for understanding interfaces, functions, components and
their internal structure.
Figure 14) HTML page for the VHDL code documentation of the RS Encoder.
Figure 15) HTML page for the VHDL code documentation of the Timing Tester for RS Encoder.
33
3 Decoder for RS codes The first chapter of results is reserved to the study of the decoder. In the first part the algorithms
are studied and compared in order to make the best possible choice. After that, these chosen
algorithms are inserted into some working architecture and the first results, in terms of area
occupation and timings, are obtained.
3.1 Decoding process The FEC codes are a group of codes that introduce some redundancy symbols to make possible
for the receiver to find and restore some possible errors in the code. These codes are assigning
the main tasks of the working operation to the receiver; from this comes the name. From now
onwards the sender will be called encoder, while the receiver will be called decoder.
3.1.1 Reed-Solomon codes There is a huge quantity of FEC codes available in literature. In this section the methods used
in the project are introduced.
The Reed-Solomon codes were discovered in 1960 by Irving Reed and Gus Solomon, giving
them the name. The RS codes are non-binary cyclic codes in which the codes symbols are
binary m-tuples.
Non-binary codes don’t use the simple binary operations defined by the XOR and AND ports,
but they perform operations in the GF(2m). The group of “m” bits is defined as symbol.
The structure of the encoder is straightforward and will be neglected for now while for the
decoder some words have to be spent.
Figure 16) Generic decoder architecture.
In the figure above, the generic architecture of a decoder system is illustrated. The first block
is computing the syndrome.
3.1.2 Syndrome Computer The syndrome is a polynomial that can be computed starting from the received message and
that contains inside all the restoring basic information: position and magnitude of the errors.
Every set of errors is related to one and only one syndrome polynomial. Thus, since it’s a one-
one relationship, in theory it’s possible to link through a table the set of errors to its syndrome
34
polynomial without any further step. Knowing the syndrome, we can get the correspondent set
of errors and add it to the received message in order to get the original message back.
Even if this solution is in theory possible, there is a physical limit of storing all the possible
syndrome values and the correspondent set of errors: basically there are memory issues so it’s
not technically possible and convenient to implement a decoder in this way.
The classical way of computing the syndrome is consisting in substituting in the polynomial
the correspondent value. For example, in a generic RS(7,3) code, the syndrome is long “t”
symbols, i.e. two. The meaning of the “t” parameter, as introduced before in the introduction
chapter, is the half of the difference between the total number of symbols and the useful number
of symbols of the message.
𝑆 = 𝑆𝑦𝑛𝑑𝑟𝑜𝑚𝑒 𝑣𝑒𝑐𝑡𝑜𝑟 = [𝑆0 𝑆1]
If the received message is:
𝑟(𝑋) = 𝑟0𝑋6 + 𝑟1𝑋5 + 𝑟2𝑋4 + 𝑟3𝑋3 + 𝑟4𝑋2 + 𝑟5𝑋1 + 𝑟6
The first element of the syndrome can be computed as follows:
𝑆0 = 𝑟(𝛼) = 𝑟0𝛼6 + 𝑟1𝛼5 + 𝑟2𝛼4 + 𝑟3𝛼3 + 𝑟4𝛼2 + 𝑟5𝛼1 + 𝑟6
While the second one is:
𝑆1 = 𝑟(𝛼2)
The substitution in the received polynomial is the classical way of computing and the
understanding of the algorithm is really easy but it’s not characterised by a great performance
since it’s consuming a lot of hardware. In general, during all the discussion of the project, the
minimisation of resources will be a continuous goal. Few resources used imply the possibility
of more parallelisation and so higher throughput: we are searching the optimisation of resources
usage in respect of throughput and latency.
3.1.2.1 Horner’s rule
The way of computing the syndrome showed above is not the actual way used in digital
systems. There’s a more convenient and efficient way, called Horner’s rule, that allows to use
less resources, though requiring more clock pulses to finish the calculation.
If r(X) = message received, then Si = r(αi) with i=1:(n-k). Basically, it’s a substitution in the
message polynomial.
For a RS(7,3), the rule can be written in the extended form in this way:
𝑆(𝑋) = (((((𝑟6 ∙ 𝑋 + 𝑟5) ∙ 𝑋 + 𝑟4) ∙ 𝑋 + 𝑟3) ∙ 𝑋 + 𝑟2) ∙ 𝑋 + 𝑟1) ∙ 𝑋 + 𝑟0
The formula schematises with the parenthesis the operations to be made in the same clock
cycle. Hence, the overall required amount of clock periods for the computation of a syndrome
is “n”.
35
3.1.3 Berlekamp-Massey Algorithm (BMA) A second step is the use of a component able to build from the syndrome polynomial two other
polynomials that contain the location of the errors and the magnitude. The error-location
polynomial is a polynomial that has as roots the values of the position of errors; the error-
magnitude instead has as roots the magnitude of the error at the position defined by the error-
location polynomial.
There are several algorithms available for this task, such as the Euclidean one for example. In
the project it was decided, for reasons explained further on, to use the one called from the
surnames of the scientists who discovered it: Berlekamp-Massey (BM).
This algorithm, together with the Euclidean one, is normally considered a standard as
contemporarily is computing both the polynomials in “2t” iterations. The classical version of
the algorithm, during its working operations, is using the inversion of a matrix. This operation
is particularly expensive in terms of resources. The need of avoiding the inversion brought to
the research of a variant of the algorithm that manages to arrive in the same number of iterations
to the solution. Since its main characteristic from which we can distinguish it from the classical
version is that avoids the inversion of elements, the variant takes the name of inversion-less
Berlekamp-Massey (iBM) algorithm. Its working operation flowchart is in Figure 17.
Figure 17) Flowchart that describes the working operation of the inversion-less Berlekamp-Massey (iBM)
algorithm.
This version is not properly the one used in the project but is a precursor from which will be
derived the most practical and implementable one.
36
3.1.3.1 Enhanced Parallel Inversionless Berlekamp-Massey Algorithm
(ePIBMA)
The previous version of the Berlekamp-Massey procedure was not perfectly optimised. Indeed,
though the number of iterations needed is the same, after the latter algorithm, a canonical Chien
search and Forney’s correction. Moreover, it’s not possible to design a component that
implements this version of the algorithm through a systolic architecture. The possibility to
realise a systolic structure is important because lower significantly the complexity of the
architecture, and therefore also of the VHDL code. In order to attain these improvement, the
so called enhanced Parallel Inversionless Berlekamp-Massey algorithm is introduced [Wu15].
The outputs are the error-location polynomial “Λ” and the auxiliary polynomial “B”. These
two polynomials used together helps a faster solution in the successive stage. Also other outputs
of secondary importance are computed.
The algorithm takes “n-k” clock pulses to compute the outputs, as the other algorithm exposed,
but it’s possible to build the system through the composition of cells.
Figure 18) ePIBMA working operation diagram.
37
3.1.4 Chien Search (CS) and Forney’s algorithm The last stage of the decoder is consisting in two different parts: one computes the location of
the error and the other one computes the magnitude of the error. Generally, for the calculation
of the position of the errors, the standard algorithm is the Chien search. In Figure 19 there’s
the flowchart that resumes the procedure.
Resuming in few words, the basic idea behind is the simple substitution in the polynomial of
all the non-zero elements of the field in which the code operates. If the evaluation of the
polynomial is actually zero, by definition, it means a solution is found and so the decoder will
proceed in computing the correspondent magnitude.
Figure 19) Flowchart that describes the working operation of the classical version of the Chien Search (CS)
algorithm.
Having found the position of an error, the magnitude is still to be computed. To do this, can be
used the Forney’s algorithm. Its work flow is resumed in Figure 20.
Figure 20) Flowchart of the working operation of the Forney's algorithm.
38
3.1.4.1 enhanced Chien Search Error Evaluation Component (eCSEE)
The classical form of the Chien algorithm has got the same drawback of the Berlekamp-Massey
one: it doesn’t have a systolic structure. A systolic structure allows the building of a simpler
system, hence a simpler VHDL code, though doesn’t present any drawback in terms of
resources used and timing characteristics. The eCSEE is evaluating the error-location
polynomial and computing the error magnitude at the same time. It corresponds, functionally
speaking, to both the Chien and Forney algorithms. The outputs are the location of the error
and the magnitude computed. In the schematic of a general architecture of the decoder (Figure
16), this block is the last one (actually two blocks). The addition and the selection made by the
switch of the figure in reality are made outside this block.
The procedure followed for implementing this component is the one showed in Figure 21.
Figure 21) Flowchart diagram of the eCSEE algorithm.
The schematics and algorithms presented represent the choices made for the thesis. This
versions of decoder were implemented in Python and are presented as results of the thesis. For
what concerns the VHDL coding, only the systolic final versions are submitted and explained.
3.2 Decoder RS(255,239) in GF(28) The first component to be studied and implemented is a plain version of a decoder for the Reed-
Solomon (255,239) code, working in a Galois field of eight bits per symbol. The device can be
split into four main components:
SyndromeComputer
ePIBMA
eCSEE
39
3.2.1 Syndrome Computer It’s the first stage of the decoder device. The component calculates the syndrome vector using
the Horner’s rule. As explained before, the advantage of this rule is that the multiplier is only
one and is used always through the “n” cycles that the component is active. The schematics
that implements the method is, straightforwardly, the one in Figure 22. Obviously, in the figure,
the control signals of clock, enable and reset are connected properly to the global system
signals.
Figure 22) Syndrome calculation unit schematic implementing Horner's rule.
For what concerns the VHDL code, there were some problems. The first problem was how to
obtain the factors to be substituted in the message arrived. If in informatics, especially in
Python that is an interpreted dynamic programming language, the value can be computed with
a for statement, in VHDL the issue is that the factors are to be declared as constants in order to
allow the synthesiser to simplify as much as possible the circuit. This, and later on also other
issues like type declarations, enforced the idea of creating a separate package file
(RS_Decoder_Types) in which every accessory variable, type, subtype, function is defined.
Following this concept, a function that, through a for loop, generates and returns the alpha
factors needed for the SyndromeComputer initialisation. This implied the creation also of a
type specifically designed for the vector of elements. Resuming, the problem was solved
declaring a new array of “n-k” gf_symbols that is the alpha type called alphas. This array is
initialised by the function generate_alphas that was designed for this purpose.
Another problem was the definition of the type for returning the syndrome vector, called
syndrome_type, that is an array of “n-k” gf_symbols.
The component requires a specific cell (Figure 22) for its working operation. The cell was
called SyndromeCell (the black box model is in Figure 23).
Figure 23) SyndromeCell black box block.
40
The cell, observing Figure 22, is apparently nothing different from the previous cell used for
the encoder: it’s composed by the same topology of a GF multiplier, a GF adder and a register.
When all the cells are interconnected and tested though, it can be noticed that, as it is, the cell
couldn’t be used for messages sent in series without pauses. In fact, thinking of the basic
schematic showed in Figure 9, the register at the clock pulse number “n-1” would have to be
reset in order to start from zero at the next syndrome calculation. In the same time, it has to
avoid the reset because we want the result at the output. Like that, the architecture was not
correct and an improvement was needed.
The first idea was to insert in the loop a multiplexer between the output of the register and the
multiplier. If the reset was high, the multiplexer was selecting a zero, otherwise it was selecting
the output of the register. The same signal was used as clock enable for the register. The
solution was correctly working but, at a deeper analysis, the lengthening of the critical path can
be observed. It degrades the performances of the system because it forces the synthesiser to
decrease the maximum operating frequency achieved. The original critical path was composed
by an adder, a register and a multiplier. The new solution added path a multiplexer to the
critical. This was not acceptable and other solutions had to be exploited.
The final architecture adopted for the cell is showed in Figure 24.
Figure 24) SyndromeCell internal structure.
The main advantage of using two registers, instead of one, is that the reset signal is actually
going to the one that is in the loop and is used as enable for sampling the output. So the reset
signal is used as reset for the register above and enable for the one below. Thus, it was possible
to restore the previous critical path of the cell and was obtained the behaviour wanted. Hence,
the critical path is again:
𝑇𝑐𝑟𝑖𝑡 = 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟 𝑠𝑒𝑡𝑢𝑝
Obtained the cell, the component was built as below.
41
Figure 25) Internal structure of the SyndromeComputer.
For the system it’s also necessary a control circuit for generating the ready signal. The signal
has to arrive at the output after “n” clock pulses. The clock pulses are counted by a counter
and, when this arrives to “n”, the signal is high for a clock pulse. At the output, there are the
syndrome symbols grouped as syndrome_type.
The result of the section is the black box model presented in Figure 26.
Figure 26) Black box model of the SyndromeComputer component.
3.2.2 ePIBMA component Having computed the syndrome, the following stage is the Berlekamp-Massey one. Its task is
to get, from the syndrome values and the message arrived, the error-location polynomials (“B”
and “Λ”) and the other outputs required by the eCSEE module.
As discussed in the previous section of this chapter, the version of the method adopted for the
decoder is the one that allows the systolic structure.
The fundamental repetitive cell of the component is the one illustrated in Figure 27 and Figure
28, where there are respectively the black box and the internal structure of the cell.
42
Figure 27) Black box representation of the RS_ePIBMA_Cell.
Figure 28) Internal Structure of the RS_ePIBMA_Cell.
Every cell computes the “i-th” coefficient of the two polynomials. From the figure we can
notice that the control signals (MC1, MC2, MC3) determine the behaviour of the cell through
some multiplexer.
The critical path of the cell is the one used by Ω0(i) and Ωp+1
(i), i.e.:
𝑇𝑐𝑟𝑖𝑡 = 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟 𝑠𝑒𝑡𝑢𝑝
In general, this is also the critical path of the overall system.
Particular attention was necessary for the initialisation stage. In fact, if in Python it corresponds
to a simple assignation, in VHDL is a little more complex. The initialisation step was
implemented by forcing the control signals so that at the input of the two registers could arrive
directly the value of the Ω and Θ. If at the iteration 0, at the inputs of the cell number “p” there
are the initial values, also “MC1=0”, “MC2=(others=>’0’)” and “MC3=1” have to be set. In
this way, it’s sure that at the next clock pulse the cells will be initialised correctly. This iteration
43
corresponds to the “Initialisation” step of Figure 18. In the same moment that these signals are
set, some attention has also to be paid on how these signals will be the following step. In fact,
also the control signals at the following iteration have to be initialised. In the following figures
there are the solutions adopted for the several signals.
Figure 29) Circuit for the initialisation of the MC1 control signal.
Figure 30) Circuit for the initialisation of the MC2 control signal.
Figure 31) Circuit for the initialisation of the MC3 control signal.
In Figure 29, Figure 30 and Figure 31 it can be noticed that the temporary signals created
(MC1_temp for example) are selected only if init_signal is low.
Referring to Figure 18, here the correspondence of the signals is explained. The MC1 signal
corresponds to the first if during the iterations. Though the condition to be verified is modified
for optimising the implementation. In fact, writing:
𝑀𝐶1 = (𝛺0(𝑖)
≠ 0) & (𝐿𝛬(𝑖)
≤ 𝐿𝐵(𝑖)
)
44
is the same of checking this other condition:
𝑀𝐶1̅̅ ̅̅ ̅̅ = (𝛺0(𝑖)
= 0) ‖ (𝐿𝛬(𝑖)
> 𝐿𝐵(𝑖)
)
This result comes directly from the application of De Morgan’s theorem. Simply moving the
LB from the right side to the left side, the system will be even better:
𝑀𝐶1̅̅ ̅̅ ̅̅ = (𝛺0(𝑖)
= 0) ‖ ((𝐿𝛬(𝑖)
− 𝐿𝐵(𝑖)
) > 0)
If before there was a comparator between two integers, now there’s first a subtraction and then
there’s be a simple check of the sign bit of the result of the subtraction.
The MC2 signal is a shifting vector that, given to the cells, is putting a 0 at the position “2t-i-
2” of the 𝛩 polynomial. Finally, the MC3 signal synthesised the choice that appears on the right
branch of the iteration working operations, i.e. the condition (LB = (t-1)).
In the same way, also the other control signals were deduced from the algorithm in Figure 18.
The signal gamma is computed as shown in Figure 32. We can notice that, as previously done
for the control signals, a value for gamma is set initially when the init_signal arrives and also
for the next period, thanks to the presence of the register.
Figure 32) Circuit for the calculation of the gamma signal.
The signal omega_0 is set simply with a multiplexer (Figure 33). The select signal is
init_signal. In the initialisation, the signal is set to zero otherwise is taken the first symbol of
the omega polynomial.
45
Figure 33) Circuit for the calculation of the omega_0 signal.
Following the same procedure, also the circuits for computing LB and Lsigma were obtained.
Figure 34) Circuit for the calculation of parameter L_B.
Figure 35) Circuit for the calculation of L_sigma.
It’s then worth a description of the system built for calculating the zeta (Figure 36). The signal
is defined as the previous ones by the control signals. It was necessary to use a register for
delaying the signal to the next iteration, when the init_signal goes down. The additional part is
first of all the presence of a GF multiplier that multiplies the actual zeta and the basic element
α-1. In the schematic, this operation is simplified with a block containing the basic element. In
VHDL instead, there’s the problem of how to implement it. The α-1 value is taken by a function
(generate_inverted_elements) defined in the package RS_Decoder_Types. At the declaration
of the signal alpha_1, it’s initialised with the second element of the array that is the output of
the function. At the output of the first register on the left, there’s a multiplexer that manages
the first clock period when init_signal is high, as seen in the other architectures shown in this
component.
46
Finally, referring to Figure 37, there’s the ready signal. When the counter arrives to “2t” counts,
the system schedules for the following period that the internal_ready signal goes high. With
one period more also the ready signal goes high. This delay is useful to allow the ePIBMA
component to know one cycle before the computation is finished. In this way it’s possible to
sample the outputs of the system: simultaneously there’s the output ready going high and the
data available at the output ports.
Figure 36) Circuit for the calculation of the zeta signal.
Figure 37) Circuit for the calculation of the ready signal.
3.2.3 eCSEE component The final stage of the device is a component that implements the eCSEE method. It receives
the outputs of the Berlekamp-Massey component and computes the roots of the error-location
polynomial and the magnitude of the errors. The black-box mask of the block is shown in
Figure 38. The outputs of this block are the clock counts (CC), the error magnitude, the ready
signal and the number of errors occurred (num_err). This last one will be useful for the real-
time debug of the component, explained in the following section dedicated to the blocks
architecture of the decoder.
47
Figure 38) RS_eCSEE black-box representation.
We can see the generic blocks of the device in Figure 39. Observing the architecture, we notice
that there are repetitive modules for the evaluation of the sigma_even and sigma_odd, that are
the first two rows of the figure. Then, similarly, also there are similar modules for the
evaluation of the “B” polynomial. As earlier expounded, the advantages of the systolic structure
are enormous and in this component too the algorithm used is allowing to exploit this
characteristic. At the output stage, the enable is given by the condition 𝛬𝑒𝑣𝑒𝑛(𝛼−𝑗) ==
𝛬𝑜𝑑𝑑(𝛼−𝑗). This condition, in the Galois field, can be computed as the sum of the evaluation
of the two polynomials, i.e.:
𝛬𝑒𝑣𝑒𝑛(𝛼−𝑗) + 𝛬𝑜𝑑𝑑(𝛼−𝑗) == 0
The output block, when enabled, multiplies the evaluation of the odd part of sigma and the
evaluation of the “B” polynomial. Then this result is inverted and multiplied for the zeta value
and finally presented at the output.
Resuming, we can distinguish blocks: a part for the evaluation of the sigma and “B”
polynomial, a part for the calculation of the zeta, another one for computing the enable signal
for the output stage and the output stage. The evaluation circuits are built using only one
repetitive unit, called RS_eCSEE_Cell.
The scratch, seen in general before, was represented by the VHDL code as in Figure 40. There
are little differences that will be explained and motivated later in this section, but in general
it’s the same architecture of Figure 39.
In the left part of the structure, there are four blocks for the evaluation of the two polynomials.
If for the sigma polynomial is necessary to divide the evaluation because is needed the odd part
in the output stage, for the “B” is not. The reason why it was separated it is the segmentation.
The division of the evaluation unit in smaller units allowed to pipeline the system and reduce
then the critical path of the system. Since the system was pipelined, if was kept the “B”
polynomial evaluation entire, it would have had more latency. Dividing the “B” as well into
two parts, the latency is reduced and a symmetry in respect to the sigma evaluator is restored.
48
The symmetry simply helps more in the computation of the times of arrival of the various
signals in the digital circuit.
Since the cell used is the one visible in Figure 41 and Figure 42, the latency of the evaluation
blocks is “t/2”. In my particular example of RS(255,239) is 4.
Figure 39) Internal blocks of the eCSEE device.
Figure 40) Internal structure of the eCSEE component; the blocks are placed ordered in respect to the x-axis that
represents the time.
49
Figure 41) Black-box representation of the RS_eCSEE_Cell.
Figure 42) Internal structure of the RS_eCSEE_Cell.
Comparing the images above, the cell is nothing different from the cells visible in Figure 39
except the presence of the segmentation registers between one adder and the next one.
Referring to Figure 42, it can be seen that the multiplexer selects the polynomial coefficient
“Λi” for the first cycle and then afterwards starts selecting the previous partial_result.
After this initialising step, the cell starts working normally at regime, i.e. mult_factor will be
the mult_factor of the previous period.
The register at the bottom part on the right has the task to delay the init_signal arrived to the
cell and give it as an input to the next cell. The segmentation of the path makes these delays of
the signals necessary and this was the solution chosen. In VHDL, it was more compact to make
the signal go through the cells in this way, simply connecting a cell’s output to the next cell’s
input. Another way of doing it, saving few registers usage, would have been to create a chain
of registers for delaying only once the init_signal instead of replicating the structure. Though
this solution is cheaper from a resources point of view, the waste of registers with this cell is
only “3*t/2” 1-bit registers, i.e. 12 1-bit registers. This is negligible in respect to the overall
usage of resources that will be explained in the next section of this chapter and, therefore, also
this solution is equally valid.
50
The critical path of the cell is the path that goes from the partial_result register to add_result,
passing through one multiplexers, one multiplier and one adder.
𝑇𝑐𝑟𝑖𝑡 = 𝑇𝑚𝑢𝑥 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑𝑒𝑟
The evaluation blocks are a chain of RS_eCSEE_Cell put in series as shown in Figure 43 and
Figure 44. For the first cells of the even chains of cells, the previous cell’s value is zero so that
the adder is reduced to a simple wire connection. For the odd chains of cells instead the value
is set to the first coefficient of the polynomial. This follows the general circuit of Figure 39.
The evaluators are the cascade of four cells except for the B_even evaluation block. In fact, if
the sigma polynomial’s length is “t”, the length of “B” is “t-1” and this creates an issue for the
calculation of the even part because it needs one cell less (“t/2-1” instead of “t/2”). For
synchronising the arrival of the signals at the output of the evaluators, a register was put at the
end of the evaluation block of the even part of “B” so that the synchronisation is restored again.
In Figure 44 the different cell can be seen at the end of the chain for the evaluation of the
B_even polynomial.
Figure 43) Sigma polynomial evaluation block.
Figure 44) “B” polynomial evaluation block.
51
Another important part of this component is the zeta computer. This retraces the schematic of
Figure 45. The circuit uses the control signal internal_CC that is the signal coming directly
from the counter of clock pulses. The shift register at the output of the module is needed for
the segmentation of this component, to synchronise the signals arriving to the multiplier of the
output stage.
Figure 45) Zeta computer component.
The segmentation of the RS_eCSEE brought problems of synchronisation all over the internal
circuits. We illustrated the counting of the clock pulses and the init_signal, now the entire
processing of the control signals all over the device will be illustrated.
The counts of the clock pulses have to be delayed in order to arrive simultaneously with the
error magnitude. In Figure 40 the x-axis represents the time. The evaluation blocks occupy
“t/2” clock pulses of latency, the inverting element one clock pulse and so the zeta computer is
“t/2+1” clock pulses. This is exactly the shift register delay that has to be given to the
internal_CC for arriving to the output (Figure 46).
Figure 46) Circuit for the calculation of the number of clock pulses (CC).
The internal_ready signal is another signal of interest. Figure 47 represents the system for the
calculation of the ready signal. There’s a counter that counts separately the clock pulses. The
52
counts are compared to the first extreme (“t/2”) of the interval, where the ready has to be high,
and the second extreme (“t/2+n”). When the counts are between these two values, the
internal_ready has to be high, otherwise is low. The reset circuit in reality is a bit more complex
than that to allow the eCSEE to work even with consecutive messages. If the reset arrives and
the internal_CC is stopped, it means that there are no messages in series but at least there’s one
clock period of pause and so the counter of this module is reset to zero. Else if when the reset
signal arrives, the internal_CC is still counting, it’s the condition where there are two
consecutive messages and so to maintain the correct number of counts, the counter is reset to
one.
Figure 47) Circuit for the calculation of the ready signal.
Another crucial signal is the reset. When it arrives, also this signal has to be delayed for the
correct working operation of the component. Using the same estimation made for the
internal_CC, the signal of reset is delayed as well by “t/2+1” clock periods.
Having presented all the control signals, the architecture of the last block of this component is
presented. The last stage’s enable is the condition given before, i.e. the evaluation of the “Λ”
(“sigma_evaluated==0”), as can be seen in Figure 40. This condition (error_found), always for
the same reason of synchronisation, is delayed by one period before arriving at the clock enable
of the output register. The outputs of the “B” polynomial evaluation blocks are summed in
order to obtain the total value B_evaluated. This is multiplied by sigma_odd_evaluated and the
result of this operation (non_inverted_partial_result) is the input of the
Inverted_elements_ROM (Figure 48). This piece is defined as a ROM memory built with the
embedded RAM (BRAM) of the FPGA. For filling the ROM with the correct values, it was
necessary to write a specific function generate_inverted_elements in the package
RS_Decoder_Types. The function for the creation of an array of inverted elements is based on
the scheme of Figure 3.
Figure 48) Black-box representation of the ROM for inverting the elements.
53
The ROM is used giving the element to be inverted to the input port addr and in the next period
the correspondent inverted element is given to the output port data. This element introduces a
delay of one clock period to the path.
Continuing the description of the system of Figure 40, the outcome of the inverting block
(inverted_partial_result) is multiplied by the zeta calculated by the apposite block and then
given to the input of the output multiplexer. The result of the multiplication is called
error_magnitude_calculated and corresponds to the computation of the error magnitude for the
eCSEE algorithm. The mux output is selected by the error_found signal and then sampled by
a register. The error magnitude is similarly zero if the error is not found and is the calculation
of the error_magnitude_calculated if instead an error is found.
Finally, the signal num_err is simply determined by the counts of the times the error_found
signal went high during the decoding of one message.
3.2.4 Decoder block assembly The blocks that up to now have been described have now to be assembled together. The way
of assembling them and making the decoder work properly is worth a section.
In Figure 49 there’s the description from illustrating the decoder black-box mask. The inputs
are the canonical inputs for clock, reset and enable plus the data_in where are to be inserted
one by one the symbols of the message to be decoded. The outputs are:
- data_out is, similarly to the data_in, the message decoded given one symbol by one;
- ready is simply a bit that goes high when an output is produced;
- invalid_output is a flag bit communicating if the output produced is valid or, for
example, didn’t find all the errors contained in the message;
- par says when at the output there are the parity bits (high) or the message bits (low).
Figure 49) RS_Decoder black-box representation.
The general structure of the decoder was shown previously in Figure 16. In the upper branch
we see the three blocks described until now in sequence: SyndromeComputer, RS_ePIBMA and
RS_eCSEE. Still the architecture to obtain the original message is missing from the description.
We notice that to do it, a block is necessary for delaying the received message and keeping it
saved until the three blocks elaborate it. Then, it’s possible to sum it to the error vector
computed by the three core blocks. This block for delay is implemented by a Double-Port RAM
called Delay_RAM. The interface of this component is displayed in Figure 50.
54
Figure 50) Black-box representation of the Delay_RAM.
Beyond the usual input of the clock, there are the inputs for enabling the two ports and two
address ports. Moreover, there are the input data port for the “a” port and the output port for
the “b” port.
A double port was required because it’s necessary to write the new data arriving for storing
them, as they arrive and reading them when the error correction is ready. For doing this, it was
needed to design two units for the addresses to give in order to have the synchronisation of the
signals. In Figure 51 there is the basic schematisation of circuit that uses the Delay_RAM.
Figure 51) Example circuit for the usage of the Delay_RAM block.
There has to be a system similar to the internal system of the RS_eCSEE that manages all the
delays of the signals to make them be synchronised. These two processes, that are driving the
RAM block, are simple counters. The input_delay_process (Figure 52) counts the clock pulses
and takes into account also the number of the message they are examining: “m” bits for the
symbol address and two MSBs for the message address. In my example of a RS(255,239) the
number of bits of the address is then ten bits.
Figure 52) Circuit for the generation of the address for input data in the Delay_RAM.
55
For what concerns the output data address, the situation is more complex since the symbols at
the output of the RS_eCSEE component are in a reversed order in respect to the arrival ones.
In fact, if for the arriving messages the parity symbols are at the end of the message, at the
output of the Chien search component are the first coming out. Therefore, the differences
between the two systems are explained and can be observed comparing Figure 52 and Figure
53.
Figure 53) Circuit for the generation of the output data address.
In the output data address, we can see that the error location, signal coming out from the
RS_eCSEE module, is reversed in respect to “n” (length of the message) to re-establish the
same sequence of symbols of the message. When the “m” bits of the address arrive to zero,
then the system switches to the next message.
Another important process of the RS_Decoder is the one that has to control whether the output
is valid or not. In fact, if the number of errors found is less than the length of the sigma
polynomial L_sigma, then it means that not all the errors were found. Moreover, there are
limitations about the maximum numbers of errors that are detectable and correctable by a Reed-
Solomon code. The maximum number is “t” that in the case of RS(255,239) is eight.
The overall system (Figure 54) is completed with a small note about the control signals that are
used in the decoder. The blocks in the figure appear to be receiving the same control signals
but this is not the real case, it would not work. This has been done only for sake of simplicity
for drawing the circuit but, in reality, the latencies of the blocks are to be taken into account.
The discussion about the latencies is left to the following section but in general to delay control
signals over the blocks and to allow the system to work properly shift registers are used.
56
Figure 54) Complete schematic of the RS_Decoder component described with blocks.
In the system shift registers are for properly delaying the control signals among the various
blocks and for delivering a correct timing synchronisation. This kind of components have the
task of delaying the value of the L_sigma for the latency of the Chien component in order to
avoid problems of invalid_output due to not synchronised signals. In matter of facts, if there
was no delay of the length of the sigma polynomial, when a new Berlekamp-Massey finishes
with its new length, it would be compared to the length found by Chien in the end of the
previous message. For example, if the message 1 is being decoded by Chien, for some clock
pulses, when the message 2 finishes Berlekamp-Massey, there would be a comparison between
the length of the sigma polynomial of message two with the error found by Chien in message
1. Interposing this block, it’s possible to avoid this synchronisation problem.
To make the shift register more compact, it was based on a Dual-Port RAM. In Figure 55 we
can see the structure. The RAM is built for storing “t/2+2” values of an unsigned number. The
control logic gives the proper values to the two addresses and also gives a flag full, saying
whether the memory is full or not. This signal is selecting the output through a multiplexer: if
memory is not full, the output is set to zero.
Figure 55) Internal structure of the L_sigma shift register.
57
The decoder component RS_Decoder now is completely described.
The test-bench for the device was made with a system similar to the one used for the encoder
showed in Figure 10. The difference is that in this case it was compulsory to simulate the
presence of noise in the transmission medium or in the receiving system. These imperfections
were modelled in the Python code of the decoder with the function add_noise. This function
takes eight random numbers between zero and “n” corresponding to the positions of the errors
introduced, and then changes the symbols in those positions choosing a random number
between zero and 2m. The number of errors introduced by the code is actually fluctuating with
the variable num_errors passed to the function. In my code this is set to the maximum number
of errors correctable by the code, “t”. The original messages, read from a text file, are modified
and saved into another text file with the noise added. The decoding algorithm then does its job
and decodes the messages and saves them again. The VHDL file uses the noisy files as a
starting point and computes the output. Thus, the data that come out of the decoder are manually
compared with the ones coming out from the Python model.
A comparison of the results described now for a RS(255,239) decoder is shown in the figure
here below.
Figure 56) Python graph presenting the results obtained by the RS(255,239) decoder.
In red there is the noisy message that slightly differs from the original message, the blue one.
The blue code actually is not observable in the figure because the green one, i.e. the decoded
message, completely coincides with the original one, showing the correct working operation of
the algorithm. The message is composed by three parts: a first part containing a string, a second
part that is composed by zero-padding technique in order to reach the 239 symbols and the final
parity symbols. In the central part, where only zeros should be found, we can see spikes of
errors corrected by the decoding algorithm.
58
3.3 Decoder RS(528,514) in GF(210) After implementing the classic Reed-Solomon code RS(255,239), another basic decoder that
had to be analysed was the RS(528,514) in order to increment the throughput. Since the formula
of the throughput, as presented before, is the following:
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 =𝑚
𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ
And since the critical path cannot be changed without changing the architecture of the decoder,
the code had to be changed. Another relevant code in literature is the RS(1023,1009) that works
in a Galois Field of ten bits. In this case, the critical path should be the same while the number
of bits increases of 20%, producing a correspondent increment of the throughput.
Observing better the code of the previous decoder it can be noticed that the previous
architecture was not generic enough. In fact, the previous VHDL code was valid only for even
values of “t”. In this case, a new internal structure of the Chien component had to be arranged.
With “t” that is an even number, there have to be “t/2” cells of sigma (even and odd parts) and
of the odd part of “B” and “t/2-1” cells for the even part of “B”. For synchronising the signals,
the presence of a register at the end of the “B” even cells is mandatory. With “t” that is an odd
number instead, the number of cells for sigma_even and for the odd and even part of “B” is
“floor(t/2)” while for the odd part of “sigma” is “ceil(t/2)”. Also in this case registers were
added to guarantee the correct synchronous arrival of the signals to the following stage.
In VHDL it was implemented with a simple extension inside the Chien component. At the
beginning of the architecture description the condition in which the decoder is operating (is “t”
odd or even?) was verified with two if generate statements that ensure to choose the proper
internal structure. For realising the floor and ceil functions, the expression “t mod 2” is used.
It gives both the information whether the “t” is an odd or even number and also can be used for
rounding numbers. In fact, the default operation made by the VHDL synthesiser is a floor
function. Using this information it is possible to obtain the operation needed. For example:
𝑡
2+ 𝑡 𝑚𝑜𝑑 2 =
8
2+ 8 𝑚𝑜𝑑 2 = 4 for t=8
𝑡
2+ 𝑡 𝑚𝑜𝑑 2 =
7
2+ 7 𝑚𝑜𝑑 2 = 3 + 1 = 4 for t=7
The same expression is used all over the new modified code for guaranting to have the correct
length of vectors and the correct latency. In particular, having (in case of t=7) four cells for the
odd part of sigma, the minimum latency for the block is still four and the latency that the
decoder module foresees for the Chien block has to be adjusted. If before it was “t/2”, now it
will be “t/2+t mod 2” in order to compute correctly the latency for the two cases and to maintain
a compact form without adding other conditional generate statements.
After the verification of this code, it was possible to pass to a new step: cutting the code. In
fact, not all the 1023 symbols of the message are actually used. For doing this, some new
parameters have to be introduced:
59
𝑠 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 𝑐𝑢𝑡
𝑐𝑜𝑑𝑒 𝑟𝑎𝑡𝑒 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑢𝑠𝑒𝑓𝑢𝑙 𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 =𝑘 − 𝑠
𝑛 − 𝑠
Comparing the codes RS(1023,1009) and RS(528,514) it can be seen that if the total throughput
remains basically untouched, the code rate changes significantly. It’s higher in the first one,
implying that less number of errors in percentage can be corrected: only 7 errors on 1023
symbols.
The shortened code is composed as follows: a vector of 495 zeros have to be concatenated to
the 514 symbols of the message to send. To these, fourteen parity symbols have to be added
after the message. Then at the end the zeros are cut from the sending of the message and only
the 528 not null symbols are sent.
The first question that can be asked is if the previously implemented decoder is still working
correctly or if it is affected by this change. Especially, if the syndrome computed is changing
when committing these changes. For sake of simplicity it’s demonstrated here that nothing
relevant changes with a RS(7,3) and a RS(5,1) codes but the result is obviously general. It’s
not depending on the length of the message nor the length of the cut.
Taking into account the entire message:
𝑟(𝑋) = [𝑟6, 𝑟5, 𝑟4, 𝑟3, 𝑟2, 𝑟1, 𝑟0]
The message filled with zeros:
𝑟(𝑋) = [0,0, 𝑟4, 𝑟3, 𝑟2, 𝑟1, 𝑟0]
And the shortened message:
𝑟(𝑋) = [𝑟4, 𝑟3, 𝑟2, 𝑟1, 𝑟0]
We can finally compute the syndrome with the Horner’s rule:
𝑆(𝑋) = (((((𝑟6 ∙ 𝑋 + 𝑟5) ∙ 𝑋 + 𝑟4) ∙ 𝑋 + 𝑟3) ∙ 𝑋 + 𝑟2) ∙ 𝑋 + 𝑟1) ∙ 𝑋 + 𝑟0
That can be simplified (both for the zero padded and the cut message) as:
𝑆(𝑋) = (((𝑟4 ∙ 𝑋 + 𝑟3) ∙ 𝑋 + 𝑟2) ∙ 𝑋 + 𝑟1) ∙ 𝑋 + 𝑟0
Since the syndrome is not changing, the Berlekamp-Massey algorithm and Chien search
component will not be affected. The only changes that have to be taken into account are the
latency and the number of symbols arriving. The two processes for the input and output
addresses of the Delay_RAM and the latencies of the SyndromeComputer and RS_eCSEE
blocks.
60
3.4 Compared analysis of timings and usage of resources Here the two typologies of the decoder are compared and some calculation for area and timing
is made.
3.4.1 Usage of resources in a generic RS(n,k) decoder In this section the components used for each version are analysed in detail, keeping separated
the results for each component. For every component, firstly the resources used by every cell
are analysed. Secondly, the overall system will be taken into account (for example, control
signals’ circuits). At the end of this section, a table is presented for resuming the results
computed.
3.4.1.1 Syndrome Computer
The SyndromeComputer is basically only constituted by cells and a final output stage for the
sampling of the outputs.
The SyndromeCell is built with one adder, one multiplier and two registers. For every device,
“n-k” cells are used for the calculation of the syndrome. Therefore, the total amount of
resources used are “n-k” adders, “n-k” multipliers and “2(n-k)” registers.
The control signals’ circuits will be ignored in the recap table. These components are negligible
in respect to the amount of hardware required by the cells.
3.4.1.2 RS_ePIBMA
Every cell of this component is using three multiplexers, one adder, two multipliers and two
registers. The number of cells for the component are “n-k”. The calculation of the resources
needed are summarised in Table 5.
In this case, the control circuits’ resources usage is not straightforwardly negligible so it’s taken
into consideration. Referring to the figures of the apposite section related to this component,
Table 4 was drawn up.
Multiplexers Multipliers Registers Counters Logic ports Shift
registers
MC1 2 0 0 0 1xOR 0
MC2 2 0 1 0 0 1
MC3 2 0 0 0 1xAND 0
Gamma 3 0 1 0 0 0
Omega_0 1 0 0 0 0 0
L_B 4 0 1 0 0 0
L_sigma 3 0 1 0 0 0
zeta 3 1 2 0 1xOR 1
ready 0 0 2 1 1xCOMPARATOR 0
Table 4) Recap table for the control unit of the RS_ePIBMA component.
For this table, the details about the bits of the components are not written but full details of the
component are accessible through the code. The table is showed here just to give a rough idea
of how much these control logic elements are affecting the usage of resources in relation to the
other components. Taking the example of a RS(255,239) decoder, having “2t=16”, the amount
of multiplexers used in the cells is 48 while in the control logic is 20. The greater is the number
of parity symbols, the more is negligible the control logic in respect to the cells.
61
3.4.1.3 RS_eCSEE
The cell of this device is composed by two multiplexers, one adder, one multiplier and three
registers (two of 1 bit and one of “m” bits). The two registers of 1 bit will not be counted for
sake of simplicity. The cells are seven for the auxiliary polynomial and eight for the sigma
polynomial, so these numbers are not to be multiplied by “n-k” but by “n-k-1”. The registers
instead are “n-k-1” contained in the cells and one outside introduced at the end of the auxiliary
polynomial evaluator in order to make properly the segmentation.
This component uses, moreover, two adders, one multiplier, one ROM, one comparator and
one register of 1 bit for the calculation of the error condition and of the inverted_partial_result.
Then, also another multiplier and another register for the output stage. The zeta computer
instead needs two multiplexers, one multiplier, one register and a shift register. The control
logic needs one and port, two comparators, one counter and two registers for the calculation of
the internal and external ready.
3.4.1.4 Other components
There are also the outer components still inside the decoder but not in any of the previously
analysed blocks. First of all, a double-port RAM is needed. The circuits for the addresses need
two counters of 2 bits, one counter of “m” bits, one adder, two comparators, two AND ports
and two registers. Also there is to be taken into account an adder and a mux for the correction
of the messages and for checking the valid output a counter and a comparator. The shift
registers for the control signals are two shift registers for enable and reset for blocks of
Berlekamp-Massey algorithm and Chien search, so in total are four.
Multiplexers Adders Multipliers Registers
SyndromeComputer 0 n-k n-k 2(n-k)
RS_ePIBMA 3(n-k) n-k 2(n-k) 2(n-k)
RS_eCSEE 2(n-k-1) n-k-1 n-k-1 n-k Table 5) Recap table of resources used per every component, taking into account the cells usage.
3.4.2 Latencies and critical paths The critical paths are fundamental for estimating the maximum frequency of the component.
This one, in the first order, is given by the inverse of the maximum period of time necessary
for covering the critical paths. In Table 6 the critical paths of the components realised are
resumed. In general, the critical path of the cells is the most remarkable one.
The SyndromeCell and the RS_ePIBMA_Cell have the critical path discussed before passing
through an adder and a multiplier. Instead, for what concerns the Chien search, the critical path
is the one that goes from the register of partial_result to add_result passing through one mux,
one multiplier, one adder and then again one mux.
The latency of the module for the calculation of the syndrome is due to the iterations that for
the Horner’s rule have to be done. The data go out all simultaneously. For the ePIBMA, the
latency is defined implicitly by the algorithm. The RS_eCSEE instead takes “t/2” cycles due to
the cells of the sigma and auxiliary polynomials plus a delay of the inversion and a delay of the
output register.
62
Tcritical_path Latency
SyndromeComputer 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 n
RS_ePIBMA 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 n-k
RS_eCSEE 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 2 ∙ 𝑇𝑚𝑢𝑥 (n-k)/4 + 2 +n Table 6) Table for critical paths and latencies of the various parts.
The overall latency is not the sum of the latencies anyways because another fact has to be taken
into account. The data arrives from the first symbol of the message to the last of the parity part
while the output of the Chien search gives at the output the reverse order. Thus, the actual
latency depends on the position of the symbol requested. To have the first symbol corrected it
should wait:
𝐿𝑎𝑡𝑒𝑛𝑐𝑦𝑓𝑖𝑟𝑠𝑡 𝑠𝑦𝑚𝑏𝑜𝑙 = 𝑛 + (𝑛 − 𝑘) + ((𝑛 − 𝑘
4) + 2 + 𝑛) + 1
In fact, the first symbol waits the overall latency and “n” cycles more for getting the error-
magnitude value. For the last symbol of the parity:
𝐿𝑎𝑡𝑒𝑛𝑐𝑦𝑙𝑎𝑠𝑡 𝑠𝑦𝑚𝑏𝑜𝑙 = 𝑛 + (𝑛 − 𝑘) + ((𝑛 − 𝑘
4) + 2) + 1
The formula for the latency therefore is:
𝐿𝑎𝑡𝑒𝑛𝑐𝑦(𝑝) = 𝑛 + (𝑛 − 𝑘) + ((𝑛 − 𝑘
4) + 2 + 𝑝) + 1
Where “p” is the position of the symbol for which we want to computer the latency.
3.4.3 Maximum operating frequency The maximum operating frequency was obtained as previously shown in the section of the of
the encoder RS(255,239). Since the schematic of the digital circuit is different, a new analysis
of the critical path has to be done. In the sections of the internal components of the decoder,
the critical paths of each of them are computed and are resumed the results in Table 6. As we
can notice, the worst critical path is the one related to the Chien search cell. In the whole
decoder implemented in a FPGA, though, the critical path will not happen to be this one. In
fact, the problem of routing in a determined device can bring new timing constraints that would
be not predictable with only the simple kind of analysis that was made.
After using the method of “trial and error”, the final result obtained for the decoder was:
𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ = 3.02 𝑛𝑠
𝑓𝑚𝑎𝑥 = 331 𝑀𝐻𝑧
For what concerns the quantity of data that will be produced at the output, there is a relation
with the Galois field in which the decoder operates. In particular:
63
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = 𝑚 ∙ 𝑓max =𝑚
𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ
Therefore, for the two decoders taken into account in this chapter:
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡𝑅𝑆(255,239) = 2.65 𝐺𝑏𝑝𝑠
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡𝑅𝑆(1023,1009) = 3,31 𝐺𝑏𝑝𝑠
64
65
4 Design of 100 Gbps decoders The decoders obtained until now are only reaching few Gbps. The value is not enough in
respect to the goal of the thesis. With the parallelisation of the components, it’s possible to
obtain higher throughputs without wasting too many resources. The way of parallelising
decoders is customised: it’s a matter of choices. The chapter has the aim of explaining the
choices made in the project and illustrating the performances attained.
The analysis is made on different levels. The background works are evaluated by an estimation
of the XOR gates that they use. The estimation is made by a Python script that computes the
equivalent ports, basing the calculation on the main blocks instantiation. The same script is
used to estimate the equivalent number of ports used by the innovative designs proposed.
Therefore, at the end, some of these designs are implemented in FPGA and ASIC. The solution
are finally compared.
4.1 Parallelisation of the RS(255,239) decoder The first typology of decoder to be subjected to the parallelisation is the classical RS(255,239).
Starting from the background works, the solutions present in literature are explained and
analysed. Finally, some new design is introduced and compared.
4.1.1 Background In literature, plenty of articles about the parallelisation of the RS(255,239) can be found. The
decoder, in fact, has been studied since a while and so a lot of different solutions were exploited.
Among these, the newest and most performant solutions were chosen in order to have a rough
estimation of the goodness of the new solutions proposed later on.
The parallelisation process consists in increasing the number of symbols processed per clock
pulse.
It can be observed that the Berlekamp-Massey component is the one that gives some limitation
to the critical path though it has few latency. The other two components, SyndromeComputer
and eCSEE, instead, give some problem with the latency but don’t give any with the maximum
operating frequency. Stated this, parallelisation can bring to the possible upcoming of new
limitations to the critical paths (new cell architectures are to be considered), but it decreases
significantly the latency depending on the level of parallelisation. Parallelising can also bring
to more hardware utilisation and area occupation. The Berlekamp-Massey component has few
latency (only “n-k”) and therefore it can be used by more than one SyndromeComputer, saving
some area. Some logic will have to be implemented for the correct rooting of the data.
The optimal solution is the solution that manages to have a good trade-off between resources
used and latency, without worsening too much the critical path of the system; ideally not
worsening the longest path. To decide suitably, it is proposed here an analysis of the versions
of the components that compute syndrome and Chien search, resuming everything in tables to
show better and clearly the comparisons. The analysis consists in the calculation of the overall
66
equivalent XOR gates used by each solution. The details about this translation will be given
later.
The analysis has to take into account not only the hardware usage but also the throughput and
the latencies, that are crucial parameters of interest. In fact, the main goal is to obtain a high-
speed low-latency decoder. In order to achieve a better view over the solutions available and
to compare them easily, some new parameter is defined.
#(𝑏𝑎𝑠𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡)
1 𝐺𝑏𝑝𝑠= ℎ𝑜𝑤 𝑚𝑎𝑛𝑦 𝑏𝑎𝑠𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑎𝑟𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑝𝑒𝑟 𝐺𝑏𝑝𝑠 𝑜𝑓 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡
#(𝑏𝑎𝑠𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡)
𝑠𝑎𝑣𝑒𝑑 𝑙𝑎𝑡𝑒𝑛𝑐𝑦′𝑠 𝑐𝑙𝑜𝑐𝑘 𝑝𝑢𝑙𝑠𝑒= 𝑏𝑎𝑠𝑖𝑐 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑝𝑒𝑟 𝑙𝑎𝑡𝑒𝑛𝑐𝑦′𝑠 𝑐𝑙𝑜𝑐𝑘 𝑝𝑢𝑙𝑠𝑒
While the saved latency is defined as:
𝑠𝑎𝑣𝑒𝑑 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = (𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑐𝑎𝑠𝑒 𝑙𝑎𝑡𝑒𝑛𝑐𝑦) − (𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑎𝑡𝑒𝑑 𝑐𝑎𝑠𝑒 𝑙𝑎𝑡𝑒𝑛𝑐𝑦)
These two new parameters also provide a better way of comparison usage of resources when
the throughputs and the latencies are not the same, actually really common in real cases: it
combines the performances obtained (throughput or latency saved) with the resources used.
The lesser are the resources used for reaching to the same performance, the better it is.
The versions taken into consideration for the analysis are the basic non-parallelised, the two-
parallel, the three-parallel, the four-parallel and the five-parallel ones. Of these, the four-
parallel solution is a new typology introduced in the thesis.
An analysis of the throughput and latency of the five versions, using only one channel per every
solution, is presented in the pictures here below. For every cell, the same operating frequency
is used, as it’s to say that for every solution an architecture that does not increase the critical
path is adopted. In the FPGA implementation this will result completely inaccurate since in
large systems the critical path often is defined by the routing of the signals and not by the
theoretical critical path of the cells. Anyways, this is just to have an indication of the
performances achieved. Only after the implementation of the integrated circuit the final results
will be finally correctly compared.
67
Figure 57) Graph that represents the latency and the throughput correspondent to each solution exploited.
For what concerns the latency and throughput, the formulas that define them are here below:
𝐿𝑎𝑡𝑒𝑛𝑐𝑦 =#(𝑡𝑜𝑡𝑎𝑙 𝑠𝑦𝑚𝑏𝑜𝑙𝑠)
𝑝
𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = (𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑛𝑔 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦) ∙ 𝑚 ∙ 𝑝
𝑝 = #(𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 𝑝𝑒𝑟 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛)
Nomenclature
Here it is introduced some nomenclature that was used in the development of the project to
shorten the names. For shortening the number of symbols processed per clock pulse, it’s used
the letter “p”. For example, a two-parallel RS(255,239) decoder will be called now onwards
“2p” RS(255,239) decoder.
The letter “s” is introduced, in analogy with the previous letter, to show explicitly the number
of segmentations used in the syndrome cell.
The letter “c” is instead used for identifying the number of channels used in the decoder.
Resuming, a two-parallel RS(255,239) decoder that has a syndrome cell segmented once and
uses eight channels is shortened as “2p-1s-8c” RS(255,239) decoder. As can be noticed it’s a
really compact way of communicating the main parameters of interest of the decoder.
Equivalent XOR gates computation
Before passing to the computation, some word has to be spent on how to compute the
equivalent number of XOR gates. As far as we saw until now, only basic objects like adders or
multipliers were used: how to pass to the number of XOR gates?
68
As a resume, the equivalent values are inserted in the conversion table obtained by
implementing each of these basic components and matched the result of the XOR gates.
Simplified
Multiplier
Multiplier Adder Mux 2-1
(1 bit)
Register
(1 bit)
RAM
(1 bit)
AND
#(ports) 2∙log2(2m)
XORs 8∙m XORs
+ 6∙m
ANDs
m XORs 1 XOR 3 XORs 1 XOR 0.75 XORs
Table 7) Table that resumes the usage of XOR gates for every component used.
A simplified multiplier is a multiplier that has one of the two factors that is a constant value.
This brings to the simplification of cable multiplication, typical of the constant multiplication
in digital electronics. For the multiplexers that have more than two inputs, they have to be
dissembled in basic multiplexers. For example, a 4-1 mux can be translated into three 2-1
multiplexers.
4.1.1.1 Syndrome Computer
The first component to be analysed is the syndrome computer. The variants taken into
consideration for this component are the standard non-parallelised one and the ones with two,
three, four and five parallelised symbols.
The schematic of the new versions of the cells are presented here below.
Figure 58) Two-parallel syndrome cell internal architecture [Ji15].
Figure 59) Three-parallel syndrome cell internal architecture [Park12].
69
Figure 60) Four-parallel syndrome cell internal architecture.
Figure 61) Five-parallel syndrome cell internal architecture [Salvador14].
In the table below the results of the resources used by the five typologies are summarised.
Adders Simplified
Multipliers
Mux (2-1) Registers
Non-paralleled 1 1 0 1
2-parallel 2 2 0 1
3-parallel 4 4 1 5
4-parallel 4 4 1 4
5-parallel 6 5 5 5 Table 8) Comparison of resources usage of the variants of the syndrome cells.
In all the situations the simplified multipliers are used since actually one of the factors is a
constant. At a glance, it can be observed immediately that the solutions that parallel three and
five elements are using too many resources and probably are not the best solutions, but let take
a look at the numbers.
Below, there is the resume of the performance information.
70
Tcritical_path Latency Throughput [Gbps]
Non-paralleled 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 255 2.58
2-parallel 2 ∙ 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 128 5.16
3-parallel 2 ∙ 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑚𝑢𝑥 85 7.74
4-parallel 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑚𝑢𝑥 65 10.32
5-parallel 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑚𝑢𝑥 51 12.90
Table 9) Resume of the latency and critical path characteristics for the solutions analysed.
The critical path in this case is not affecting any of the operating frequency because the
maximum critical path in the table is anyways less than the critical path of the Berlekamp-
Massey component. In fact, in the two-parallel solution, some segmentation can be performed
to achieve the minimum critical path. The independence of the operating frequency of the
device from the critical path of the cells of this component, make the latter parameter useless
for the analysis.
In Table 10 are resumed the results of the Python script for the calculation of the equivalent
number of XOR gates for every version.
#(XORs)
Non-paralleled 768
2-parallel 1536
3-parallel 3472
4-parallel 3088
5-parallel 4048 Table 10) Recap table for used basic elements in the overall syndrome component.
Therefore, the last step is to bond every resources usage to the throughput and to the latency
saved. Below are presented the figures that graphically show the results.
71
Figure 62) First picture that represents the usage of resources per Gbps of throughput obtained. On the x axis
there are the number of symbols computed per clock pulse; on the y axis there are the number of XOR gates
used per Gbps of throughput.
Figure 63) Resources utilisation per every clock pulse of latency saved. On the x axis there are the number of
symbols computed per clock pulse; on the y axis there are the average number of basic elements used per clock
pulse saved.
72
4.1.1.2 eCSEE
Now the eCSEE component is under analysis. Similarly to what done with the syndrome, the
beginning is with the calculation of the resources used.
Figure 64) Internal structure of a two-parallel eCSEE cell.
Figure 65) Internal structure of a three-parallel eCSEE cell.
The typologies of cell for Chien are actually only two. The same typology, in fact, is used in
all the variants that process more than one symbol. Figure 64 and Figure 65 show examples of
the paralleled type of Chien cell.
The analysis of the resources usage is resumed in Table 9.
Adders Multipliers Mux (2-1) Registers
Non-paralleled 1 1 2 1
p-parallel 0 p 1 1 Table 11) Comparison of resources usage of the variants of the eCSEE cells.
Passing now to latency and critical paths for the Chien cells:
Tcritical_path Latency
Non-paralleled 𝑇𝑎𝑑𝑑 + 𝑇𝑚𝑢𝑙𝑡 + 2 ∙ 𝑇𝑚𝑢𝑥 𝑛 − 𝑘
4+ 2
p-parallel 𝑇𝑚𝑢𝑥 + 𝑇𝑚𝑢𝑙𝑡 𝑐𝑒𝑖𝑙 (𝑛
𝑝)
Table 12) Resume of the latency and critical path characteristics for the three solutions analysed.
To have a better idea about the resources usage, a deeper insight view must be exploited.
In general, the architectures use the same amount of cells (“t” and “t-1”) for sigma and “B”
polynomials, one cell for the computation of the zeta value and “p” number of final stages. The
73
only difference is that for the paralleled version there’s another stage that is necessary for the
correct working operation, i.e. the addition stage (Figure 66).
Figure 66) Internal structure for the 3-parallel evaluation of the sigma polynomial [Park12].
For arriving to the overall usage of resources, the only thing left to do is to show the usage of
these single elements and then sum them for each solution. The so called iROM refers to the
ROM used in a module for the inversion of the elements of a Galois field, while SR stays for
shift register.
Adders Multipliers Registers Mux (2-1) iROMs SR (t/2-1)
Main cell 0 p 1 1 0 0
Zeta cell 0 1 1 2 0 1
Final
stage
0 2* 1 0 1 0
Add stage 8 0 1 1 0 0 Table 13) Recap table for the resources utilisation of the components in eCSEE variants.
In the final stage, the multipliers used are not the simplified ones and this has to be taken into
account for the resources utilisation.
Main cells Zeta cells Final stages Add stages
Non-paralleled 2t-1 1 1 0
p-parallel 2t-1 1 p p Table 14) Table that resumes the instantiation of cells in the typologies of eCSEE.
74
Doing the simple calculation now, the results are displayed in the table here below for each
solution taken into consideration.
#(XORs)
Non-paralleled 7862
2-parallel 13726
3-parallel 20310
4-parallel 26894
5-parallel 33478 Table 15) Recap table for the resources used by the eCSEE module of all the solutions taken into consideration.
To get a graphical idea of the resources usage, some pictures, similar to the previously used for
the syndrome, are displayed here below.
Figure 67) Picture that represents the usage of resources per Gbps of throughput obtained. On the x axis there
are the number of symbols computed per clock pulse; on the y axis there are the average number of basic
elements used per Gbps of throughput.
75
Figure 68) Resources utilisation per every clock pulse of latency saved. On the x axis there are the number of
symbols computed per clock pulse; on the y axis there are the average number of basic elements used per clock
pulse saved.
For what concerns the latency and the throughput of the component, Figure 57 is valid also for
the eCSEE component.
4.1.1.3 Conclusions
Having now disposable data, a final computation has to be made. The best options are clearly
the two-parallel and four-parallel solutions. In this section they are the only ones under
evaluation in order to simplify the discussion.
Firstly, it’s explained the architecture of the system that is going to be evaluated. To compare
in a fair way the two solutions, the throughput has to be equalised for both the solutions. Having
the throughput fixed, the latency and the usage of resources are computed and a trade-off has
to be made then.
The system for the two-parallel solution is the one in Figure 69; the system for the four-parallel
solution is the one in Figure 70.
76
Figure 69) System evaluated for the two-parallel solution.
Figure 70) System evaluated for the four-parallel solution.
Therefore, the use of the components in a generic architecture can be resumed as:
𝑝 ∙ 𝑆𝑦𝑛𝑑𝑟𝑜𝑚𝑒𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 + 𝑒𝑃𝐼𝐵𝑀𝐴 + 𝑝 ∙ 𝑒𝐶𝑆𝐸𝐸
For sake of simplicity, since the Berlekamp-Massey component is only one for every solution
and then is not changing the overall result, its contribute to the resources consumption here will
be neglected. Also the logic circuits are all neglected since they give a marginal contribution.
Before passing to the numbers, another observation has to be made. In the usage of resources
also the amount of FIFO RAM used for storing the messages arrived has to be taken into
consideration. In the formula below, it’s computed the amount of messages to be stored for a
generic “p”-parallel solution:
𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = 2 ∙ 𝑐𝑒𝑖𝑙 (𝑛
𝑝) + 𝑛 − 𝑘 + 𝑠
#(𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 𝑝𝑒𝑟 𝑐ℎ𝑎𝑛𝑛𝑒𝑙) = 𝑐𝑒𝑖𝑙 (𝑛
𝑝)
77
#(𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠 𝑝𝑒𝑟 𝑐ℎ𝑎𝑛𝑛𝑒𝑙) = 𝑐𝑒𝑖𝑙 (𝑙𝑎𝑛𝑡𝑒𝑛𝑐𝑦
#(𝑠𝑦𝑚𝑏𝑜𝑙𝑠 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑑 𝑝𝑒𝑟 𝑐ℎ𝑎𝑛𝑛𝑒𝑙)) + 1
For the 2-p case, for example, the latency is nearly 272 clock pulses while the number of
symbols processed per channel is 128. Therefore, the number of messages that have to be stored
is three. In this situation, it was taken into account that the sequence of symbols that arrives to
the decoder goes out in reverse order from the Chien component and so has to be added one
message more.
If the same calculation is made for the 4-p option, the latency is approximately 144 clock
pulses, the number of symbols processed per channel is 64 and therefore the number of
messages still remains three. In general, whatever the parallelisation, per every channel three
messages have to be stored. The overall number of messages to store is given by the following
formula:
#(𝑚𝑒𝑠𝑠𝑎𝑔𝑒𝑠) = 𝑐 ∙ 4
#(𝑠𝑦𝑚𝑏𝑜𝑙𝑠) = 𝑐 ∙ 4 ∙ 𝑛
In the table below, the results of the computations for the two analysed solutions are resumed.
Syndrome Computer eCSEE RAM Total
2p-8c 12288 109808 65280 187376
4p-4c 12352 107576 32640 152568 Table 16) Table that resumes the number of XOR gates used for the two solutions taken into consideration.
The results show that the four-parallel solution is comparable to the two-parallel one for the
basic components, but has a significant saving of XOR gates due to the smaller number of
channels used and therefore less number of messages to be stored.
4.1.2 Innovative designs In the previous discussion, it was found as a result that the most convenient topology isthe the
four-parallel cell. The question which arose is: given the cell topology, is the four-parallel
version the most convenient one? Since it’s clear the impact of the number of channels in the
XOR gates used for storing the messages and since it’s relevant in respect to the overall number
of ports used, is there any better solution? Since the answer is not straightforward, some
analysis has to be made.
The approach used in the thesis is different from the ones seen in literature. The main idea is
to use a core heavy Berlekamp-Massey algorithm that in respect to the others uses more
resources but doesn’t worsen the critical path, and uses it for more messages. For reusing the
ePIBMA component to the best, the optimal numbers of syndrome and Chien blocks are
derived: it’s the opposite of what the others do. The result of this approach can be seen in Figure
75. At the end there is a comparison between what this approach brings and what in literature
can be found. The comparison is made with a theoretical estimation of equivalent gates, the
FPGA implementation and the ASIC one.
78
4.1.2.1 Proposed decoder architecture and theoretical analysis
The solutions in analysis are the ones that can give at the end a throughput of 41.28 Gbps, i.e.
sixteen times the basic cell. The parallelisation for sixteen times of the basic cell already
realised is considered one of the possible cases.
In general, the Berlekamp-Massey component remains the same while some different
configurations are implemented in order to find the best one. A small change is performed and
consists in inserting the registers for keeping the syndrome values after the big multiplexer.
This can save a lot of resources. For example, for the 1p solution for saving the syndrome uses
one sixteenth of the registers used in the other configuration.
After presenting the internal architectures and after analysing latency and critical path, at the
end of the section, there is a part for comparing the resources used and the results obtained. So,
on these bases, it will be predicted which one is the most convenient one. Later, the analyses
are proved with the Xilinx Vivado® post-implementation simulation and an ASIC
implementation.
1p RS Decoder
The first solution that is exploited is the one that parallelise the syndrome and the Chien
components that are the same used in the basic decoder RS(255,239) built before. The structure
of the system is the one showed in Figure 71.
Figure 71) Schematic of the system using 1p versions of the cells for SyndromeComputer and eCSEE
components.
For what concerns this solution, the structures of the components are exactly the same as
presented in the previous chapter. The only difference is that some components are instantiated
more than once.
In respect to the original syndrome cell, actually in this case there’s a small change: to save
some resources the registers for storing the syndrome are inserted after the 16-1 multiplexer
and this influences also the structure of the cell. In short, from the architecture of the cell of
Figure 24 is removed the register below, as can be seen in Figure 72.
79
Figure 72) Internal structure of the 1p SyndromeComputer cell.
Talking about latencies, the syndrome and the Chien parts are requiring both “n” clock pulses
that summed to “n-k” of the Berlekamp-Massey and one for the registers for saving the
syndrome values is leading to “3n-k+1” clock pulses. Some more clock pulses due to
segmentation are added but they are negligible in respect to the total.
2p RS Decoder
For the parallelised solutions, the Chien component is always based on the same topology of
cell, like the one in Figure 66. The internal structure of the cell can always be derived from the
figures already showed before and so it will not be explained here time by time.
Figure 73) Schematic of the system using 2p versions of the cells for SyndromeComputer and eCSEE
components.
The syndrome calculation is more interesting since the segmentation of the cells can bring to
different solutions with their advantages and drawbacks.
In Figure 58 a basic version of the two-parallel cell is displayed, as already showed before.
Since the critical path of the RS_ePIBMA component is:
𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ = 𝑇𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 + 𝑇𝑎𝑑𝑑𝑒𝑟
The goal is not to exceed the limit imposed by the Berlekamp-Massey component so that the
operating frequency of the system is not worsening. The cell found in [Ji15] is with the critical
path of two adders and a simplified multiplier and so it’s not the solution we are searching for.
A variant that shortens a bit the critical path is the 2p-1s showed here below.
80
Figure 74) Internal structure of the 2p-1s SyndromeComputer cell.
The critical path now is constituted by the accumulator on the right:
𝑇𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑝𝑎𝑡ℎ = 𝑇𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 + 𝑇𝑎𝑑𝑑𝑒𝑟 + 𝑇𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒𝑥𝑒𝑟
This path cannot be shortened more since the accumulator operates in a loop and so there’s
nothing to be done more. In all the configurations that are analysed later, this will be the goal
to reach. The price to pay in this variant is one register more and one clock pulse more in the
latency. The additional clock in reality brings a new problem; the Berlekamp-Massey needs 16
clock pulses per message, while the normal two-parallel non-segmented syndrome block needs
128 clock pulses. Thus, with eight syndrome blocks the timings are perfectly fitting. With the
additional register, the segmented variant will need 129 clock periods and so for one clock the
system has to be blocked. This brings actually to a decrease of the throughput in theory. Doing
the math, the throughput passes from 41.28 Gbps to 40.96 Gbps (therefore a loss of nearly 0.32
Gbps in total), if the frequency is considered a constant. This, instead, doesn’t seem to be
correct since the critical path changes significantly and has to be verified with a simulation
tool.
Figure 75) Graph that shows the perfect arrival of the signals if there’s no segmentation of the cells.
In Figure 75 the periods when the SyndromeComputer components are active are showed in
green. In red instead there are the Berlekamp-Massey ones and in blue the eCSEE. It can be
noticed that the signals fit perfectly. In case of segmentation, for some time the Berlekamp-
Massey component won’t be active for allowing the correct working operation of the decoder.
4p RS Decoder
Similarly to what has been discussed in the previous variant, here below the generic system for
a four-parallel decoder is presented.
81
Figure 76) Schematic of the system using 4p versions of the cells for SyndromeComputer and eCSEE
components.
Figure 77) Internal structure of the 4p-1s SyndromeComputer cell.
Figure 78) Internal structure of the 4p-2s SyndromeComputer cell.
Skipping the basic non-segmented case, which is obviously not a practical solution because of
the long critical path, the one-segmented and two-segmented solutions are presented.
Starting from the non-segmented, passing through the one-segmented and arriving to the two-
segmented variants, a significant increase of the used registers has to be paid in order to shorten
82
the critical path. Another price to pay is the latency that as well increases. The same effect
considered for the 2p-1s variant happens in all the segmented solutions: in particular, supposing
a same operating frequency (to be proved with simulation), for two cycles the decoder has to
stop for having correct working operation. Therefore, the throughput will pass from the original
41.28 Gbps to 40.03 Gbps. This factor, as stated before, in reality should not decrease since a
significant increase of the frequency should be achieved. In the case this last won’t happen, the
non-segmented solution will be the best. A precise analysis, again, will be made with the
simulation tools in the next section.
8p RS Decoder
As for the four-parallel solution, the eight-parallel can be segmented in three different ways
Figure 79) Schematic of the system using 8p versions of the cells for SyndromeComputer and eCSEE
components.
83
Figure 80) Internal structure of the 8p SyndromeComputer cell.
84
Figure 81) Internal structure of the 8p-1s SyndromeComputer cell.
85
Figure 82) Internal structure of the 8p-2s SyndromeComputer cell.
In general, the same observations over the trade-off: better latency and less usage of resources
versus the shortening of the critical path. In the worst case, i.e. the 2s variant of the eight-
parallel cell, happens to be a worsening of the throughput that leads to 38.85 Gbps. As said
previously, in reality the frequency achieved should be higher, thus the overall throughput
should be higher than the non-segmented variant.
16p RS Decoder
For the sixteen-parallel solution, intermediate solutions are not showed and only the final two-
segmented solution (Figure 84) is presented. The discussion about the trade-off to be made is
the same of the other parts, so it’s skipped here.
Figure 83) Schematic of the system using 16p versions of the cells for SyndromeComputer and eCSEE
components.
86
Figure 84) Internal structure of the 16p-2s SyndromeComputer cell.
87
Comparisons
In general, the sums that are done after the first segmentation layer can be made with a tree
architecture so that the path passes from the lowest number of adders possible. Anyways the
number of adders in overall is not changing.
In the table below, the resources consumptions and the critical path of the several topologies
of cells presented till now are resumed. For the multiplier, the asterisk stays for the simplified
version of the multiplier, i.e. the multiplication for a constant value.
Adder Multiplier* Register Multiplexer
(2-1)
Tcritical_path
1p 1 1 1 0 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑
2p 2 2 1 1 𝑇𝑚𝑢𝑙𝑡 + 2∙ 𝑇𝑎𝑑𝑑
2p-1s 2 2 2 1 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑
+ 𝑇𝑚𝑢𝑥
4p 4 4 1 1 2 ∙ 𝑇𝑚𝑢𝑙𝑡 + 3∙ 𝑇𝑎𝑑𝑑
4p-1s 4 4 3 1 𝑇𝑚𝑢𝑙𝑡 + 2∙ 𝑇𝑎𝑑𝑑
4p-2s 4 4 4 1 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑
+ 𝑇𝑚𝑢𝑥
8p 8 8 1 1 2 ∙ 𝑇𝑚𝑢𝑙𝑡 + 4∙ 𝑇𝑎𝑑𝑑
8p-1s 8 8 4 1 𝑇𝑚𝑢𝑙𝑡 + 2∙ 𝑇𝑎𝑑𝑑
+ 𝑇𝑚𝑢𝑥
8p-2s 8 8 6 1 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑
+ 𝑇𝑚𝑢𝑥
16p 16 16 1 1 2 ∙ 𝑇𝑚𝑢𝑙𝑡 + 9∙ 𝑇𝑎𝑑𝑑
16p-2s 16 16 10 1 𝑇𝑚𝑢𝑙𝑡 + 𝑇𝑎𝑑𝑑
+ 𝑇𝑚𝑢𝑥 Table 17) Table that resumes the utilisation of elements for the various variants of SyndromeComputer cell.
For completing the analysis, the total amount of resources used in the SyndromeComputer and
in the eCSEE components has to be computed.
Choosing the solutions that give the minimum critical path (so that the operating frequency can
actually increase), the table below resumes the components used by the solutions analysed.
For what concerns the Chien component, as said in the previous section, the occupation of
resources is following a precise formula and so the discussion can be avoided. The details can
be found in Table 14.
88
SyndromeComputer ePIBMA eCSEE Correction_Block Total
Original 1152 4480 4135 8168 17935
1p-16c 12288 4480 63536 130688 210992
2p-1s-8c 13312 4480 48768 65408 131968
4p-2s-4c 12800 4480 48288 32768 98336
8p-2s-2c 11008 4480 48048 16448 79984
16p-2s-1c 10112 4480 47928 8288 70808 Table 18) Recap table of the number of XOR gates used by every block for every version.
In the computation the logic blocks and the big multiplexer are neglected. In Table 19 there are
the usage of the resources for each of these multiplexers.
In general, to compute the equivalent number of mux is used the basic recursive formula:
#(𝑏𝑎𝑠𝑖𝑐 𝑚𝑢𝑥) = ∑𝑖
2
𝑖=1
𝑖=𝑐:𝑖2
Where c is the number of syndrome blocks parallelised.
Here is an example to explain the formula. For the 1p solution, c=16. The multiplexer to be
converted is a sixteen-one multiplexer. Therefore, in the first step sixteen inputs are converted
into eight outputs through the usage of eight multiplexers. At the second step eight inputs are
converted into four outputs with four multiplexers and so on. The overall number of
multiplexers needed for the conversion is resumed in the table below.
#(basic multiplexers)
1p-16c 8+4+2+1=15
2p-1s-8c 4+2+1=7
4p-2s-4c 2+1=3
8p-2s-2c 1
16p-2s-1c 0 Table 19) Results of the simplification of the P-1 multiplexer.
For the complete evaluation of the solution, some other details have to be taken into account.
In the table below the equivalences that can be used to translate the basic elements into XOR
gates are presented. The first solution needs fifteen 2-1 multiplexers of “m” bits. This leads to
an additional amount of XOR gates of about 120 that is clearly negligible.
In Table 18, the RAM for the storing of the messages is kept inside the block called
Correction_Block that is supposed to store the data at the input, to take the error vector coming
from the Chien component and to correct the received symbol in case is needed. The
expectation of the implementations is that the best solution should be the 16p-2s-1c without
any doubt.
4.1.2.2 FPGA implementation
In the second part of this analysis, every solution was properly implemented in the IDE and an
implementation of the system was run in order to get precise results. The device used is a
89
Kintex 7. It seems clear, from the analysis made before, that the parallelisation brings to huge
improvements. In this sense, it seems obvious that the 16p-2s-1c decoder should be the best
choice.
At the end, after the presentation of all the versions, the results of resources used are presented.
For getting the results, some detail about the way of obtaining them has to be spent. In getting
the minimum clock period, more resources are used by the synthesiser. Instead for getting the
hardware usage, the clock period is relaxed to 20 nanoseconds. The information of timings and
resources usage, therefore, is not strictly coherent with each other but it gives a good estimation
of the order among the versions obtained. Stated this, we can pass to the analysis of the FPGA
implementations.
1p-16c RS Decoder
In this version, the changes are really few since the architecture basically remains the same
adopted for the basic version of the decoder RS(255,239). The parallelisation process was quite
straightforward. The only block that was introduced is the Delay_Block (Figure 85 and Figure
86) that has the aim of delaying the input in a FIFO logic.
Figure 85) Black box representation of the Delay_Block.
Figure 86) Internal structure of the Delay_Block.
90
The block is providing the delaying function without needing any outer control logic since this
is included inside. The input and output addresses, in fact, are generated by two internal
processes.
Implemented the delayed block, the technical problem that was met was how to use a generate
statement for the sixteen channels. Obviously it was not convenient to code manually all of
them. The solution was to incorporate all the remaining channel logic inside a new component
called Correction_Block (Figure 87 and Figure 88). Resuming, this component will receive the
incoming data, will wait the Chien search to end its working operation and then will proceed
with the generation of the output for the decoder. The block allows the VHDL coding of the
channels with the generate statement and so the architectural problems raised by the addition
of the channels are all solved.
Figure 87) Black box representation of the Correction_Block.
Figure 88) Internal structure of the Correction_Block.
91
At the beginning the implementation of the Correction_Block was precisely done as described.
Afterwards, after some tests, it was changed because it was simpler to incorporate all the
Delay_Block directly inside the block of correction. Only the way of organising the circuit
changed, not the logic at the basis.
The speed achieved with this design was around 275 MHz, bringing to a throughput of 35.2
Gbps.
2p-1s-8c RS Decoder
For the two-parallel version of the decoder the architecture was slightly different. If the
differences for the cell that computes the syndrome were already discussed above, some more
explanation is given for the architecture used for the eCSEE component, now renamed
paralleled enhanced Chien Search and Error Evaluation (peCSEE).
Since the architecture was more complicated and the cell of the Chien component changes, the
basic blocks implemented in VHDL had to change in respect to the original version. The first
change that was needed was the redefinition of the cell for the syndrome. The cell implemented
in the VHDL system is the one shown in Figure 89.
Figure 89) Internal structure of the 2p-1s-SyndromeCell.
After having implemented the cell and after having slightly modified the general architecture
of the SyndromeComputer because of the segmentation of one clock pulse in the cells, the main
problem of design stayed in the final parts of Chien search and error evaluation components.
This part, in fact, changed completely the physiognomy after the introduction of the new cell
(Figure 90).
It started therefore with the design and implementation of a cell that had the aim of sampling
and storing some important variables coming from the Berlekamp-Massey block, for example
sigma(0) or B(0). These values don’t have to change while the eCSEE component is processing
the data. In Figure 91 it is presented the schematic of the Sampling_Cell.
92
Figure 90) Internal structure of the 2p-eCSEE cell.
Figure 91) Internal structure of the Sampling_Cell.
It can be now explained one of the most important blocks of the Chien search, i.e. the evaluation
blocks. The structure remained like in the original version, i.e. separated evaluations of even
and odd parts of the two polynomials. Figure 92 below well expose the structure. In the end of
the section some more detail is discussed about the segmentation and the latencies of the entire
block.
The design of the eCSEE component passed then to the realisation of a device that could
implement the last part of detecting the error and, in case, compute the error magnitude. The
structure of the eCSEE is displayed in Figure 93 where the Final_Stage component is shown
in Figure 94.
Now the segmentation can be discussed. At the beginning the two outputs of the eCSEE_Cell
were not registered, so that there were less resources used. Though, assembling the various
blocks, it happened that the critical path of the component became quite relevant, since passed
from one simplified multiplier, three adders and one full multiplier. Obviously the path had to
be shortened and the first thing done was the introduction of segmentation inside the evaluation
cells. It has to be noticed that this segmentation didn’t bring any problem of synchronisation
except the one done to ePIBMA_ready, since the sampling cells completely hold the other
variables. Then, the critical path was still three adders and a full multiplier, so it had to be cut
just before the multiplier. Therefore, in the picture the registers inserted at the outputs of the
evaluation blocks can be seen. This segmentation brought to another register in the path of the
93
ePIBMA_ready signal. Some more comment has to be put about the outputs of the final stages.
The ready signals are put into a AND port, since they should be synchronised so should not be
a problem, and the number of errors found has obviously to be summed since it is still the same
message.
Figure 92) Structures of the evaluation blocks for the odd and even parts of the two polynomials.
Figure 93) Internal structure of the Chien search and error evaluation component.
94
Figure 94) Internal structure of the Final_Stage component.
The internal architecture of the Final_stage is quite similar to the one used in the original
decoder but here it is grouped inside one component. The presence of the inversion ROM
implies the insertion of registers for the correct synchronisation of the signals: one was put in
the path of the comparison of the sigma evaluation to zero. No more segmentation was needed
for the inversion ROM since Zeta_Computer contains inside already a delay, like can be seen
in Figure 95.
Figure 95) Internal structure of the component for the calculation of zeta.
The synchronisation had another issue due to the enable signal that introduces a new delay of
one clock pulse in the output processes for the generation of the ready, num_err and error
location signals. Because of this, a new register has to be inserted at the output of the
multiplexer, thus avoiding long critical paths because of blocks put in cascade to this.
In the realisation of the system a problem raises: the two final stages should start from different
values of zeta and so has to be inserted a component that computes it
(Starting_Zeta_Computer). In the figure below it is possible to see the internal architecture of
the block.
95
Figure 96) Internal architecture of the 2p-Starting_Zeta_Computer.
This solution would add two multipliers and “2*m+1” bits of registers. This implementation
leads also to a different structure of the component that computes the zeta inside the final stage,
as showed in the picture below.
Figure 97) Internal structure of the modified Zeta_Computer.
Though this solution seems reasonable in the first place, if is extended to the more parallelised
versions of the Chien component it’s quite clear that it’s not implementable. In the figures
below are presented the versions for the other decoders: for a 16p are needed sixteen additional
full multipliers that is obviously not acceptable.
Figure 98) Internal architecture of the 4p-Starting_Zeta_Computer.
96
Figure 99) Internal architecture of the 8p-Starting_Zeta_Computer.
Figure 100) Internal architecture of the 16p-Starting_Zeta_Computer.
The multipliers, it has to be noticed, are used only once for every message and this is a total
waste of resources. An additional restriction that is limiting the possible solutions is the latency.
In fact, the maximum tolerated latency for the block is three-four clock pulses and so the
solution is not really straightforward. A first way to reuse the multipliers is what we will refer
97
to as enhanced_Starting_Zeta_Computer. In the figure below there is only the sixteen parallel
version, but the analogous architecture can be derived also for the other options.
Figure 101) Internal architecture of the 16p-enhanced_Starting_Zeta_Computer.
The architecture is really more complex and the control logic as well. Although in the picture
it is not shown, there is a significant saving of resources: most of the registers can be thrown
away and only eight multipliers are needed. The latency is unchanged.
This solution seems the best obtainable, but the problem was solved in a radically different
way. The idea that manages to overtake all of the calculation comes from understanding how
the zeta value is computed inside the Berlekamp-Massey component. To compute the two zetas
required, now are inserted two Look-Up Tables (LUTs) filled with alphas from α0 to α15i, where
“i” is the number related to the correspondent table (in this case is zero and one). A counter
keeps counting according to the ePIBM algorithm and then at the end it is used as a pointer for
the output of the LUTs. This solution is minimising at most the resources used and is even
saving one multiplier from the ePIBMA: it’ll be the solution adopted for all the versions from
here onwards.
In Figure 102 it’s exposed how the components are connected to each other inside this version
of the decoder. The last column of components is the step of the Correction_Block components.
98
The block is basically the same presented for the previous version, but some parallelisation is
needed. In Figure 103 and Figure 104 is presented the modified version.
Figure 102) Blocks assembly inside the decoder.
Figure 103) Black box representation of the Correction_Block.
Figure 104) Internal structure of the Correction_Block component.
99
The parallelisation is obtained with the use of “p” number of Delay_Block components, that
are the same introduced in the previous version. As said before, the grouping of the circuit in
the Delay_Block is not correspondent to the VHDL code, but it has been distributed in the
Correction_Block.
The clock period obtained for this implementation in the device previously defined is 4
nanoseconds that brings to a frequency of 250MHz. The throughput therefore is 32 Gbps. This
values are just descriptive the rough order among the versions; changing the device this
parameter changes.
4p-2s-4c RS Decoder
This version of the decoder is really similar to the two-parallel one. The components previously
presented are modular; if they are scaled by the “p” parameter the generation of this solution
is really easy to implement. Here below are posted images about some of the components
updated to this version. Since the schematics are straightforward to derive from the ones
already showed, from now onwards it will not be inserted anymore.
Figure 105) Internal structure of the 4p-2s SyndromeComputer cell.
100
Figure 106) Internal structure of the four-parallel eCSEE cell.
Figure 107) Schematics for the evaluation blocks and the following stages.
Figure 108) Simplified blocks that constitute the 4p-2s-4c decoder.
For what concerns the minimum clock period obtainable with this decoder architecture, it was
obtained 4 nanoseconds that brings to nearly 250 MHz. Therefore, the throughput is 32 Gbps.
101
8p-2s-2c RS Decoder
This version is the same as the previous one. Only few images are going to be illustrated
because of space issues and because it’s the same typology seen before.
Figure 109) Simplified blocks that constitute the 8p-2s-2c decoder.
The architecture is really simplified in respect to the previous part. Going to more parallelised
versions, the complexity migrates from the global architecture (the one in Figure 109 for
example) to the cells that become bigger and with more connection. Moreover, the multiplexer
(as the de-multiplexer) is become simpler with two inputs and one input.
For what concerns the minimum clock period achieved, it was 3.85 nanoseconds, so almost
260 MHz. This brings to a higher value of 33.28 Gbps of throughput.
16p-2s-1c RS Decoder
This version is even more simplified. The multiplexer and the de-multiplexer completely
disappear together with the register after the multiplexer. For sake of simplicity, in fact, the
register was directly included in the cells (Figure 110).
102
Figure 110) Internal structure of the SyndromeCell for the 16p-2s-1c version.
The control signal logics are the simplest possible since there’s only one channel. In this case,
as told before, all the complexity of the system is passed inside the cells, like it can be noticed
in the picture below.
Figure 111) System architecture of the 16p-2s-1c decoder.
103
The schematic of the system is really similar to the one of the original non-parallelised version.
All the other components are not explained since are the same topology previously described.
The minimum clock period obtained was significantly shorter than the previous solution. The
one found for this version was 3.3 nanoseconds, versus the 3.8 nanoseconds of the best previous
versions. This is due to the simplification and the decrease of complexity of the connections
between the components. Less complexity brings also shorter routing paths and therefore
higher operating frequency. The operating frequency of this version is therefore 303 MHz that
leads to a throughput of 38.79 Gbps.
Pushing a bit further the synthesiser, the results register a significant increase of resources used.
The minimum clock period obtained is 3.25 nanoseconds (an operating frequency of 307.7
MHz and so a throughput of 39.38 Gbps).
The choice between the two implementations depends on the final goal. If the device used is
the one used for these tests and the goal is the Ethernet speed, then three decoders of the first
less-consuming version are to be used in parallel. Time by time it has to be evaluated if it’s
worth using more resources for those 2 Gbps gained.
Implementation results
The two main fields that are to be exploited are the use of resources and the performances
obtained. The results are taken directly from the Vivado® environment.
The versions are compared in two separate tables: resources and timings. For the resources it
is taken into account the version of the decoder with less resources used possible; for the
timings it’s inserted the minimum clock pulse obtainable (even if uses in theory more
resources).
Table 20) Recap table of the resources used by the implementations of the various versions of the decoder in a
Kintex 7.
In Table 20, for the Correction_Block of the original implementation there’s unexpectedly no
usage of LUTs nor registers. This is due to the fact that in reality the architecture is different in
respect to the others and the adders are distributed in the generic decoder. Instead of the
Correction_Block the RAM for the delay is inserted. The choice was made for comparing the
usage of BRAMs.
In the table here below there is an analysis of the number of clock pulses that are required by
each decoder to produce an output.
Latency [#(clock pulses)]
1p-16c n+(n-k)+n+11=3*n-k+11=537
104
2p-1s-8c (n/2)+(n-k)+(n/2)+11=2*n-k+11=282
4p-2s-4c (n/4)+(n-k)+(n/4)+12=(n/2)+n-k+11=154
8p-2s-2c (n/8)+(n-k)+(n/8)+11=(n/4)+n-k+11=85
16p-2s-1c (n/16)+(n-k)+(n/16)+11=(n/8)+n-k+10=54 Table 21) Table that analyses in the clock pulses the latency of each decoder.
The calculation is done taking into account the latencies due to:
1) SyndromeComputer intrinsic latency;
2) Syndrome cells additional segmentation;
3) eCSEE latency before starting the throughput;
4) eCSEE intrinsic latency for finishing all the symbols correction.
Moreover, all the versions are affected by one clock pulse of latency due to the registration of
the output of the multiplexer after the SyndromeComputer components and are affected also
by the intrinsic latency of the Berlekamp-Massey block (n-k). The timings are computed by
merging the clock pulses information and the time period obtained already before written in
the relative sections.
The timings results are presented in the table below.
Clock period
[ns]
Operating
frequency
[MHz]
Throughput
[Gbps]
Latency [ns]
Original 3.02 331.12 2.65 1621.74
1p-16c 3.65 250 35.07 1960.05
2p-1s-8c 4 250 32 1043.4
4p-2s-4c 4 250 32 674.25
8p-2s-2c 3.85 260 33.25 400.4
16p-2s-1c 3.25 307.7 39.38 255.2 Table 22) Recap table for the timings results in Kintex 7.
For checking the correct implementation of the decoders, some consideration about the way of
using BRAMs has to be made. The blocks of BRAM embedded in the FPGA are 1kx36,
organised in two separated blocks of 1kx18 each. Obviously the calculations are matching the
results obtained.
p=#(symbols
processed)
c=#(chann
els)
#(words/comp
onent)
#(bits/word) Configuration
BRAM
Generic
Configuration
1 16 1024 8 1k x 18 1kx8 8 BRAMs
2 8 512 16 512 x 36 512x16 4 BRAMs
4 4 256 32 512 x 36 256x36 2 BRAMs
8 2 128 64 512 x 36 128x36 2 BRAMs
16 1 64 128 512 x 36 64x36 2 BRAMs
Table 23) Table that resumes the configurations of the BRAMs used for delaying the messages.
About the clock pulse minimum duration, it can be noticed that the in a FPGA the timing is
strictly limited by the routing of the signals: the simpler is the architecture, the faster it goes.
This explains the sensible difference of speed between the first versions and the last ones.
105
Regarding all the resources usage, instead, it’s clear that the more the system is parallelised
less resources are used. As said before, surprisingly also the operating frequency gets better
and this lets no doubt about which solution is the optimal one: the 16p-2s-1c RS(255,239)
decoder.
After implementing the decoders in the Xilinx Kintex 7, to speed up as much as possible the
system, the various versions were implemented in a Xilinx Ultra-Scale FPGA (xcvu190-
flga2577-3-e). Fixing the operating frequency to 300 MHz, the final throughput was a constant
38.4 Gbps. In the two tables here below are resumed the results obtained.
Table 24) Recap table of the resources used by the implementations of the various versions of the decoder in a
Virtex.
Table 25) Table that resumes the timing information of the various implementations in a Virtex.
In Table 25 the main parameters of the designs are resumed. The throughputs are almost
constant, except for some versions where the minimum clock period was difficult to lower. For
understanding how fast and how better are some implementations, also the Worst Negative
Slack (WNS) parameter is given: with this we can appreciate how close to the final timing
limitation we are. Finally, if going into detail, the clock period is rounded and this is why
apparently there’s some incoherence in the information of the operating frequency of the two-
parallel and four-parallel versions: the operating frequency is the precise reliable value.
For accomplishing the goals of the thesis, we will need three of the 16p version of decoders.
The attained throughput will be nearly 115 Gbps.
4.1.2.3 ASIC implementation
The use of Cadence SOC Encounter software makes it possible to pass from a VHDL code to
an integrated circuit. The library used for implementing standard cells is the “Faraday 90-nm
CMOS standard cell library”.
For the estimation of the number of ports used by the components, an approximated technique
was used. In the first place, 50 NAND and 50 XOR gates were realised and was obtained the
area of the overall circuit. When later a circuit is implemented, the formulas that will give a
rough estimation of the number of basic ports used are:
106
#(𝑏𝑎𝑠𝑖𝑐 𝑋𝑂𝑅 𝑝𝑜𝑟𝑡𝑠) = 50 ∙𝐴𝑟𝑒𝑎(𝑑𝑒𝑣𝑖𝑐𝑒)
𝐴𝑟𝑒𝑎(50 𝑋𝑂𝑅 𝑝𝑜𝑟𝑡𝑠)
#(𝑏𝑎𝑠𝑖𝑐 𝑁𝐴𝑁𝐷 𝑝𝑜𝑟𝑡𝑠) = 50 ∙𝐴𝑟𝑒𝑎(𝑑𝑒𝑣𝑖𝑐𝑒)
𝐴𝑟𝑒𝑎(50 𝑁𝐴𝑁𝐷 𝑝𝑜𝑟𝑡𝑠)
In this section also the detail of the place and route timings related to the ASIC implementations
will be taken. The expectation in taking this information is a remarkably higher operating
frequency. In fact, in the FPGA implementations the minimum clock period obtained is limited
by the routing of the signals.
Procedure
Some scripts had to be written for implementing the versions of the decoders. Since the
implementations were thrown via console (and not via GUI), the hierarchical order of
compilation for each part had to be specified. Throwing the first scripts, it was noticed that the
implementation of RAM and ROM was too expensive for what concerns the hardware
utilisation. The way of communicating with the synthesiser that the component we would like
to instantiate is a RAM is different from the way is communicated with the ASIC synthesiser:
the VHDL code has to change. Taking as an example the internal description of the
Delay_RAM, previously was described as follows:
-------------------------------------------------- -- WRITE PROCESS
--------------------------------------------------
Write_process : process(clk)
begin
if rising_edge(clk) then
if (ena='1') then
RAM(conv_integer(addra)) := dia;
end if;
end if;
end process Write_process;
--------------------------------------------------
-- READ PROCESS
--------------------------------------------------
Read_process : process(clk)
begin
if rising_edge(clk) then
if (enb='1') then
dob <= RAM(conv_integer(addrb));
else
dob <= (others=>'0'); -- if not enable, I give a 0
end if;
end if;
end process Read_process;
This description is going to use BRAMs available in the FPGA used, that was exactly the
solution chosen in designing the decoder. The FPGA synthesiser recognises the behavioural
codification used and showed previously and will implement the component with a BRAM.
For the ASIC implementation, some VHDL code has to be added. In particular, it’s better to
implement the memory with latches, instead of flip-flops, since it’s less area consuming. The
Xilinx manual shows (Figure 112) the basic structure of a BRAM that has to be implemented
with a structural representation.
107
Figure 112) Internal structure of a BRAM cell.
The structural representation code is:
--------------------------------------------------
-- WRITE PROCESS
--------------------------------------------------
Write_process : process(ena_reg,dia_reg,addra_reg)
begin
if (ena_reg='1') then
RAM(conv_integer(addra_reg)) := dia_reg;
end if;
end process Write_process;
--------------------------------------------------
-- READ PART
--------------------------------------------------
dob_int <= RAM(conv_integer(addrb_reg));
--------------------------------------------------
-- OUTPUT LATCH
--------------------------------------------------
Latch_dob : process(enb_reg,dob_int)
begin
if (enb_reg='1') then
dob <= dob_int;
end if;
end process Latch_dob;
--------------------------------------------------
-- REGISTERING PROCESS
--------------------------------------------------
Reg_process : process(clk)
begin
if rising_edge(clk) then
ena_reg <= ena;
enb_reg <= enb;
addra_reg <= addra;
addrb_reg <= addrb;
dia_reg <= dia;
end if;
end process Reg_process;
The difference between the two cases is only the internal structure of the cell. So the best way
of representing this in VHDL is using the same entity with different architectures. Doing this,
a variable had to be defined (in “RS_Decoder_Types.vhd”) for keeping track of the version
that has to be implemented: a standard logic type was chosen. If it’s zero, the FPGA version
has to be implemented, else if it’s one the ASIC one is chosen. To choose inside the code the
architecture, the following code is used:
-- connecting the RAM block
FPGA_mode_generate : if (implementation_mode='0') generate
Delay_RAM_component : entity work.Delay_RAM(FPGA_architecture) port map(
108
-- INPUTS
clk => clock,
ena => enable_in,
enb => enable_out,
addra => input_data_address,
addrb => output_data_address,
dia => internal_data_in,
-- OUTPUTS
dob => data_out_bus
);
end generate;
ASIC_mode_generate : if(implementation_mode='1') generate
Delay_RAM_component : entity work.Delay_RAM(ASIC_architecture) port map(
-- INPUTS
clk => clock,
ena => enable_in,
enb => enable_out,
addra => input_data_address,
addrb => output_data_address,
dia => internal_data_in,
-- OUTPUTS
dob => data_out_bus
);
end generate;
The same procedure was used for every component that used a ROM or a RAM. In the files
given, this kind of architecture is showed only in the 38p-2s-1c (see the next chapter) just to
show how the architecture is coded, the other decoders have a normal FPGA architecture.
Implementation results
The results are presented in two tables: one for area occupation and one for timings. Starting
from the resources used by the implementation, the results are separated into components. This
allows us to distinguish the gates used by each component and compare how they change from
one implementation to another. The gates used by the versions of decoder are given with an
equivalent number of XOR and NAND gates. Finally, in the table there are two total numbers.
This is due to the idea of showing the overall impact of the FIFO memory on the area
occupation. It gives the chance of further considerations over the resources usage.
Table 26) Recap table that resumes the equivalent number of gates used by each RS(255,239) ASIC
implementation of the decoders.
The numbers are constantly decreasing while the degree of parallelisation increases. This is
something that happened also with the FPGA implementation. The only component that
increases is the ePIBMA because of the ZetaComputer (table for giving the zeta results to the
eCSEE component).
The ASIC implementations are realised giving a constant fixed requirement of clock period.
Every version was implemented with a clock period equal to 2 nanoseconds. Hence, the
operating frequency is 500 MHz.
109
In the table below, the results are grouped in the same order of Table 26. In the left part, the
main decoder’s parameters are resumed. While the throughput is maintained constant, the
latency is definitely decreasing, arriving with the 16p version to only 116 nanoseconds.
Table 27) Recap table that resumes the main timing characteristics of the RS(255,239) ASIC implementations of
the decoders.
For comparing the decoders, a new parameter is introduced. The main idea is to bound together
two important parameters (throughput and number of gates used) into a single parameter. The
efficiency is defined so that it’s increasing when the result is getting better. Therefore:
𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦∗ = 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 [𝐺𝑏𝑝𝑠]
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑞𝑢𝑖𝑣𝑎𝑙𝑒𝑛𝑡 𝑁𝐴𝑁𝐷 𝑔𝑎𝑡𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝐹𝐼𝐹𝑂 [𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠]
In the previous works, the results were compared only among the gates used by the decoder
itself, leaving apart the problem of the FIFO memory. A most proper consideration, instead,
should be considering the overall system, since there’s no way to avoid the implementation of
the memory. So, another parameter is defined
𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 = 𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 [𝐺𝑏𝑝𝑠]
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑞𝑢𝑖𝑣𝑎𝑙𝑒𝑛𝑡 𝑁𝐴𝑁𝐷 𝑔𝑎𝑡𝑒𝑠 𝑤𝑖𝑡ℎ 𝐹𝐼𝐹𝑂 [𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠]
In Table 27, there are the comparisons among the decoders obtained and, as expected, the best
one is the 16p.
4.1.2.4 Comparison with published results
The decoders obtained have now to be compared with what was obtained in literature. The way
of comparing is still the efficiency parameter. Here the two different kind of efficiency
calculations become useful. In literature, in fact, the gates used for the implementation of FIFO
memory are ignored. An estimation of the memory was made by taking into account the latency
of each decoder and making Python compute the equivalent number of ports. It’s not going to
be precise, but, since the data were omitted in the articles, this is going at least to give a rough
estimation. For comparing the results to the ones taken from the literature, the “gates”
parameter is used. It refers to the equivalent number of NANDs.
In Table 28 it’s clear from the “Efficiency” parameter that the solution proposed is better than
the others taken from literature. Adding the FIFO in the calculation (last row), the gap between
the decoders increases significantly. The latency is also registering a relevant enhancement:
only 106 nanoseconds instead of, in the best of the other cases, 256.
For arriving to the Ethernet speed, the proposed architecture needs two decoders in parallel
(hence, the throughput will be 128 Gbps).
110
Proposed
Decoder [Ji15] [Park12] [LeeH08] [LeeS08] [Lee05] [Song02]
CMOS
Technology
[nm]
90 90 90 130 180 130 160
SC 19600 75000 129000 58000 100800 48000 40000
KES 12230 112000 114350 108200 156000 272000 84000
CSEE 69900 82000 174250 211800 178000 73000 240000
Partial gate
count 104650 269000 417600 378000 434800 393000 364000
FIFO 31900 313000 291000 437000 313000 314000 318000
Total Gate
Count 136550 582000 708600 815000 747800 707000 682000
Clock Rate
[MHz] 500 625 625 300 400 625 112
Latency
[clock
pulses]
53 260 161 242 260 522 168
Latency [µs] 0.106 0.416 0.2576 0.8066667 0.65 0.8352 1.5
Throughput
[Gbps] 64 156 240 115 102 80 43
Efficiency
without
FIFO
[Gbps/(millio
ns of gates)]
0.6116 0.58 0.5747 0.3042 0.2346 0.2036 0.1181
Efficiency
[Gbps/(millio
ns of gates)]
0.4769 0.268 0.3387 0.1411 0.1364 0.1131 0.06305
Table 28) Comparison table between the decoder proposed and the literature’s results.
Another interesting point of strength of the thesis can be deducted in the following table. There
are the related weights (in percentages) of the FIFO memory over the total number of gates
used. The results show the
Proposed
Decoder [Ji15] [Park12] [LeeH08] [LeeS08] [Lee05] [Song02]
FIFO 31900 313000 291000 437000 313000 314000 318000
Total Gate
Count
136550 582000 708600 815000 747800 707000 682000
Percentage
weight
23.36% 53.78% 41.07% 53.62% 41.86% 44.41% 46.63%
Table 29) Analysis of the percentage weight of the FIFO memory on the total number of gates used.
4.2 Parallelisation of the RS(528,514) decoder In analogy of what done with the RS(255,239), it was necessary to parallelise the basic plain
structure of the decoder RS(528,514). The way of proceeding is the same.
4.2.1 Background In the literature few articles can be found about this code and no articles were available for the
parallelisation. The reason is mainly that the code is relatively new: if the RS(255,239), that
111
now is considered a standard, has tons of publications about its implementation, this code was
freshly discovered and a blank page is left to us, giving great freedom of design.
4.2.2 Innovative designs The scientific articles don’t give us any reference or design with which possibly compare the
results. The way of proceeding, then, will be to use the same method used in the previous
chapter, analysing different solutions, and then, at the end of the thesis, comparing the best
results obtainable with both the architectures: RS(255,239) and RS(528,514).
Not taking into account the huge impact of the RAM used for delaying the messages, the neat
deduction that was attained at the end of the previous section was that the parallelisation helps
in saving resources and manages to obtain higher throughputs and lower latencies. The number
of cycles necessary to the Berlekamp-Massey component in this case are fourteen. Dividing
the number of overall symbols that arrives (528) and then rounding it by excess, is obtained
the number of maximum channels usable.
#(𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠) = 𝑐𝑒𝑖𝑙 (528
14) = 38
In this sense, the cases studied for the RS(528,514) are restricted to 1p-38c, 2p-1s-19c and
finally 38p-2s-1c. The obvious expected outcome of the analysis is that the 38p-2s-1c should
be the best obtainable among this family of decoders.
4.2.2.1 Proposed architecture and theoretical analysis
As seen in the other analysis, a first rough calculation of the equivalent number of XOR gates
is made. The calculation, again, is only indicative of the approximate usage of resources and
so has to be coherent with the ASIC implementation results with good margin of tolerance. The
structure used are the same used in the previous chapter so no detail is given for the first two
versions.
38p-2s-1c RS Decoder
The only detail that has to be discussed is the structure of the syndrome cell, since was not
discussed before.
112
Figure 113) Internal structure of the cell for the computation of the syndrome in the 38p-2s-1c RS(528,514)
decoder.
In the following pictures, instead, it can be seen the structure of the so called Mult_Cell, that
composes the SyndromeCell for this version of the decoder.
Figure 114) Black box representation of the Mult_Cell component.
Figure 115) Internal structure of the Mult_Cell.
The kind of structure of the syndrome computer is the same as before, only it’s partitioned into
smaller units to make it simpler to build it. The segmentation is done at the end of each cell
and at the end of the additions. Despite the large amount of adders in a sequence, the critical
path of this block remains still the one consisting of a multiplier, an adder and a multiplexer.
113
Another important aspect to be remarked is the way the messages are coming to the decoder.
The bus of input data of the message that arrives is organised in thirty-eight ports. The number
of cycles that a message takes to be completely inserted inside the decoder is fourteen.
Therefore, in the decoder come inside 532 symbols and so four of them have to be zero
elements. For not adding any correction stage in the decoder, it’s better to commit the zero-
padding action at the beginning of the message. If the message is:
m = [1, 2, 3, 4, …]
Then, the zero-padded message will be:
m* = [0, 0, 0, 0, 1, 2, 3, 4, …]
and so the syndrome is not affected. The order of the symbols inside the input of the decoder
is that the first symbol of the message after zero-padding is put at the last of the ports of the
bus of data input. Therefore, at the first clock pulse there’ll be:
data_in[37] <= m*[0];
data_in[36] <= m*[1];
...
data_in[0] <= m*[37];
More details are displayed in the test-bench of the 38p-2s-1c decoder.
Comparisons
The obtained data from the calculations made by the Python script are presented in table form
below.
SyndromeComputer ePIBMA eCSEE CorrectionBlock Total
Original 1176 4172 7451 21128 33927
1p-38c 28728 4172 283138 802864 1118902
2p-1s-19c 23408 4172 245974 401584 675138
38p-2s-1c 21728 4172 243634 21424 290958 Table 30) Recap table of the number of XOR gates used by every block for every version.
In the analysis four messages to be stored used are counted per every channel. This brings to
the following table.
#(messages to be stored)
Original 4
1p-38c 38*4=152
2p-1s-19c 19*4=76
38p-2s-1c 1*4=4 Table 31) Table that resumes the number of messages that have to be stored for every version.
The weight of the messages, observing from the Table 30, is dominant in respect to the other
components. This confirms that the method used in the thesis brings to relevant advantages.
114
The conclusions will be taken in the final chapter only after having obtained the results from
the ASIC implementation.
4.2.2.2 FPGA implementation
The analysis consisted in the VHDL coding and verification of the decoders. Like before, for
getting timing information the clock period is pushed to the limit (therefore more resources
used) and for getting the resources usage information the clock period requested is relaxed to
20 nanoseconds. These data therefore are not strictly related to each other but give a correct
idea of the hierarchy in hardware usage and operating frequency.
Implementation results
The results in terms of hardware usage of the versions of the decoder are the following:
Table 32) Recap table of the resources used by the implementations of the various versions of the decoder in a
Kintex 7.
In the resources usage, it can be noticed that the ePIBMA component is increasing slightly with
the higher parallelisation of the decoder. This is due to the computer of the zetas that is
consisting of as many tables of sixteen GF symbols as many are the symbols parallelised, i.e.
one for the 1p, two for the 2p and thirty-eight for the 38p.
In general, the amount of resources used should decrease with higher parallelisation but it can
be seen that passing from the 1p-38c to the 2p-1s-19c decoder there’s an increase for some
parameter. This is due to the change in the architecture of the decoder.
Another thing that can be observed is the usage of BRAMs. This component is used in the
implementations for both the inversion ROM and the RAM for delaying the messages arrived.
It’s not really straightforward to reconstruct the way the synthesiser uses the BRAMs in these
components, but it’s clear that more parallelising gives a total saving of memory. For better
analysing the usage of BRAM due to the delay of the messages in the channels, some
calculation had to be made.
p=#(symbols
processed)
c=#(channels) #(words/component) #(bits/word) Configuration
BRAM
Generic
Configuration
1 38 2048 10 2k x 18 2kx10 38 BRAMs
2 19 1024 20 1k x 36 1kx36 19 BRAMs
38 1 64 380 512 x 36 64x36 6 BRAMs
Table 33) Table that resumes the configurations of the BRAMs used for delaying the messages.
As seen also for the other kind of decoder, the Vivado implementations confirmed the
estimation made.
Clock period
[ns]
Operating
frequency
[MHz]
Throughput
[Gbps]
Latency [ns]
115
Original 3.02 331.13 3.31 3303.9
1p-38c 4.4 227.27 86.36 4813.6
2p-1s-19c 3.6 277.78 105.55 1994.4
38p-2s-1c 3.65 273.97 104.11 188.8 Table 34) Recap table for the timings results for Kintex 7.
For what concerns the results of timings, the table above resumes the main parameters of
interest.
The throughput in the last two options already reached the goal, but still the latency seems too
high. The typical latency that is normally accepted is around a hundred nanoseconds.
For better satisfying the strict requirements, the device in which the designs were implemented
was changed for a new Vivado Ultra Scale FPGA. The model is the same used in the previous
chapter. The results achieved are resumed in the tables below.
Table 35) Recap table of the resources used by the implementations of the various versions of the decoder in a
Virtex.
Table 36) Recap table for the timings results for Virtex.
For these implementations, the clock frequency was uniformed (as much as possible). The
original design, therefore, is not slower. In general, with a Virtex better results are obtainable
(Table 34 and Table 36).
4.2.2.3 ASIC implementation
The technique and the steps followed for implementing the designs in ASIC are the same
described in the previous chapter, so the results of the implementations are presented directly.
Implementation results
In analogy to the other ASIC implementations, here the results are divided by components and
the two separated total gates are resumed: one with the implementation of the FIFO memory
and the other one without. The best version of this Reed-Solomon code is the 38p decoder.
The timing (Table 38), in general, show a bit of slowness in respect to the RS(255,239) ASIC
implementations: the throughput is lowered. Also the efficiencies (both the ways of computing
them) are slightly lower than the other implementations. The latency obtained for the 38p
decoder is bearable (130 nanoseconds).
116
Table 37) Recap table that resumes the equivalent number of gates used by each RS(528,514) ASIC
implementation of the decoders.
Table 38) Recap table that resumes the main timing characteristics of the RS(528,514) ASIC implementations of
the decoders.
For reaching to the throughput goal of the thesis, the instantiation of only one 38p RS(528,514)
is more than enough, since it has 152 Gbps.
117
5 Conclusions In the thesis, the hardware architecture of decoders for Reed-Solomon (RS) codes, for reaching
the decoding speed of 100 Gbps, were developed. The basic arithmetical operations with the
finite elements algebra in Galois fields were studied and a VHDL library was implemented to
make the operations easier to be used in the development of the work. The decoding process
was studied and the algorithms were selected after a brief analysis. The chosen algorithms were
then implemented in basic plain versions of decoders, reaching the throughput of the order of
some Gbps. Then, an analysis of the high-speed decoders already in literature was made and a
way for overtaking them was studied. For doing so, a detailed discussion was made on the best
way of parallelising the decoders. The goals of the study were to obtain an efficient area-time
relation, to find an optimised method for the usage of memory in the decoder and to attain a
low latency. The family of RS codes studied are the RS(255,239), that operates in GF(28), and
the RS(528,514), that operates in GF(210). The main idea of the implementation was to use the
component that implements the Berlekamp-Massey algorithm at its best and to adapt the
remaining components in order to obtain the highest rate of parallelisation. This brought to low
latency and high throughput solutions. A systolic architecture of each component was
preferred. The typologies of solutions studied realise the parallelisation in different degrees, so
the number of parallelised components for the computation of the syndrome and the Chien
search was varied from one solution to another. The architectures were then implemented in
VHDL. Their behaviour was tested, thanks to the use of Python scripts for the verification of
the correct outputs. Then, the designs were implemented in FPGA (Xilinx Kintex 7 and Xilinx
Virtex Ultra-Scale). With some slight change to the memory internal structure, the designs
were implemented in ASIC in CMOS 90 nm technology.
Parallelising two innovative designs of RS(255,239) decoders, the throughput of 124 Gbps and
latency of 116 nanoseconds were reached, using only a bit less than 0.842 mm2 of area
occupation.
For the RS(528,514), one 38p version of the innovative design is more than enough for reaching
the goals. The throughput attained is 152 Gbps while the latency is 130 nanoseconds. The area
used for implementing the decoder is nearly 1.2 mm2.
118
119
6 List of figures Figure 1) Schematisation in blocks of a RF system. ................................................................ 11
Figure 2) Working operation of the multiplication in Galois fields. ....................................... 21
Figure 3) Algorithm used for inverting an element "u". .......................................................... 23
Figure 4) Schematic of a RS(7,3) encoder. .............................................................................. 25
Figure 5) Python RS(7,3) Encoder for message 0 [2,7,3]. ....................................................... 26
Figure 6) Black box representation of the RS Encoder in VHDL. .......................................... 26
Figure 7) Encoder RS(7,3) working in GF(8) internal structure. ............................................ 27
Figure 8) RSEncUnit black box. .............................................................................................. 28
Figure 9) RSEncUnit internal schematic and connections. ...................................................... 28
Figure 10) Blocks representing the test-bench system. ........................................................... 29
Figure 11) Timing_tester realisation viewed in Xilinx Vivado®. ........................................... 30
Figure 12) Graph representing the post-synthesis time analysis made with Xilinx Vivado®. 31
Figure 13) Resources utilisation for Encoder RS(255,239). .................................................... 31
Figure 14) HTML page for the VHDL code documentation of the RS Encoder. ................... 32
Figure 15) HTML page for the VHDL code documentation of the Timing Tester for RS
Encoder. ................................................................................................................................... 32
Figure 16) Generic decoder architecture.................................................................................. 33
Figure 17) Flowchart that describes the working operation of the inversion-less Berlekamp-
Massey (iBM) algorithm. ......................................................................................................... 35
Figure 18) ePIBMA working operation diagram. .................................................................... 36
Figure 19) Flowchart that describes the working operation of the classical version of the Chien
Search (CS) algorithm. ............................................................................................................ 37
Figure 20) Flowchart of the working operation of the Forney's algorithm. ............................ 37
Figure 21) Flowchart diagram of the eCSEE algorithm. ......................................................... 38
Figure 22) Syndrome calculation unit schematic implementing Horner's rule........................ 39
Figure 23) SyndromeCell black box block............................................................................... 39
Figure 24) SyndromeCell internal structure. ............................................................................ 40
Figure 25) Internal structure of the SyndromeComputer. ........................................................ 41
Figure 26) Black box model of the SyndromeComputer component. ..................................... 41
Figure 27) Black box representation of the RS_ePIBMA_Cell. ............................................... 42
Figure 28) Internal Structure of the RS_ePIBMA_Cell............................................................ 42
Figure 29) Circuit for the initialisation of the MC1 control signal. ......................................... 43
Figure 30) Circuit for the initialisation of the MC2 control signal. ......................................... 43
Figure 31) Circuit for the initialisation of the MC3 control signal. ......................................... 43
Figure 32) Circuit for the calculation of the gamma signal. .................................................... 44
Figure 33) Circuit for the calculation of the omega_0 signal. ................................................. 45
Figure 34) Circuit for the calculation of parameter L_B.......................................................... 45
Figure 35) Circuit for the calculation of L_sigma. .................................................................. 45
Figure 36) Circuit for the calculation of the zeta signal. ......................................................... 46
Figure 37) Circuit for the calculation of the ready signal. ....................................................... 46
Figure 38) RS_eCSEE black-box representation. .................................................................... 47
Figure 39) Internal blocks of the eCSEE device. ..................................................................... 48
120
Figure 40) Internal structure of the eCSEE component; the blocks are placed ordered in respect
to the x-axis that represents the time........................................................................................ 48
Figure 41) Black-box representation of the RS_eCSEE_Cell. ................................................. 49
Figure 42) Internal structure of the RS_eCSEE_Cell. .............................................................. 49
Figure 43) Sigma polynomial evaluation block. ...................................................................... 50
Figure 44) “B” polynomial evaluation block. .......................................................................... 50
Figure 45) Zeta computer component. ..................................................................................... 51
Figure 46) Circuit for the calculation of the number of clock pulses (CC). ............................ 51
Figure 47) Circuit for the calculation of the ready signal. ....................................................... 52
Figure 48) Black-box representation of the ROM for inverting the elements. ........................ 52
Figure 49) RS_Decoder black-box representation. .................................................................. 53
Figure 50) Black-box representation of the Delay_RAM. ....................................................... 54
Figure 51) Example circuit for the usage of the Delay_RAM block. ....................................... 54
Figure 52) Circuit for the generation of the address for input data in the Delay_RAM. .......... 54
Figure 53) Circuit for the generation of the output data address. ............................................ 55
Figure 54) Complete schematic of the RS_Decoder component described with blocks. ........ 56
Figure 55) Internal structure of the L_sigma shift register. ..................................................... 56
Figure 56) Python graph presenting the results obtained by the RS(255,239) decoder. ......... 57
Figure 57) Graph that represents the latency and the throughput correspondent to each solution
exploited. .................................................................................................................................. 67
Figure 58) Two-parallel syndrome cell internal architecture [Ji15]. ....................................... 68
Figure 59) Three-parallel syndrome cell internal architecture [Park12]. ................................ 68
Figure 60) Four-parallel syndrome cell internal architecture. ................................................. 69
Figure 61) Five-parallel syndrome cell internal architecture [Salvador14]. ............................ 69
Figure 62) First picture that represents the usage of resources per Gbps of throughput obtained.
On the x axis there are the number of symbols computed per clock pulse; on the y axis there
are the number of XOR gates used per Gbps of throughput. ................................................... 71
Figure 63) Resources utilisation per every clock pulse of latency saved. On the x axis there are
the number of symbols computed per clock pulse; on the y axis there are the average number
of basic elements used per clock pulse saved. ......................................................................... 71
Figure 64) Internal structure of a two-parallel eCSEE cell. ..................................................... 72
Figure 65) Internal structure of a three-parallel eCSEE cell. ................................................... 72
Figure 66) Internal structure for the 3-parallel evaluation of the sigma polynomial [Park12].
.................................................................................................................................................. 73
Figure 67) Picture that represents the usage of resources per Gbps of throughput obtained. On
the x axis there are the number of symbols computed per clock pulse; on the y axis there are
the average number of basic elements used per Gbps of throughput. ..................................... 74
Figure 68) Resources utilisation per every clock pulse of latency saved. On the x axis there are
the number of symbols computed per clock pulse; on the y axis there are the average number
of basic elements used per clock pulse saved. ......................................................................... 75
Figure 69) System evaluated for the two-parallel solution. ..................................................... 76
Figure 70) System evaluated for the four-parallel solution. .................................................... 76
Figure 71) Schematic of the system using 1p versions of the cells for SyndromeComputer and
eCSEE components. ................................................................................................................. 78
121
Figure 72) Internal structure of the 1p SyndromeComputer cell. ............................................ 79
Figure 73) Schematic of the system using 2p versions of the cells for SyndromeComputer and
eCSEE components. ................................................................................................................. 79
Figure 74) Internal structure of the 2p-1s SyndromeComputer cell. ....................................... 80
Figure 75) Graph that shows the perfect arrival of the signals if there’s no segmentation of the
cells. ......................................................................................................................................... 80
Figure 76) Schematic of the system using 4p versions of the cells for SyndromeComputer and
eCSEE components. ................................................................................................................. 81
Figure 77) Internal structure of the 4p-1s SyndromeComputer cell. ....................................... 81
Figure 78) Internal structure of the 4p-2s SyndromeComputer cell. ....................................... 81
Figure 79) Schematic of the system using 8p versions of the cells for SyndromeComputer and
eCSEE components. ................................................................................................................. 82
Figure 80) Internal structure of the 8p SyndromeComputer cell. ............................................ 83
Figure 81) Internal structure of the 8p-1s SyndromeComputer cell. ....................................... 84
Figure 82) Internal structure of the 8p-2s SyndromeComputer cell. ....................................... 85
Figure 83) Schematic of the system using 16p versions of the cells for SyndromeComputer and
eCSEE components. ................................................................................................................. 85
Figure 84) Internal structure of the 16p-2s SyndromeComputer cell. ..................................... 86
Figure 85) Black box representation of the Delay_Block. ....................................................... 89
Figure 86) Internal structure of the Delay_Block. ................................................................... 89
Figure 87) Black box representation of the Correction_Block. ............................................... 90
Figure 88) Internal structure of the Correction_Block. ........................................................... 90
Figure 89) Internal structure of the 2p-1s-SyndromeCell. ....................................................... 91
Figure 90) Internal structure of the 2p-eCSEE cell. ................................................................. 92
Figure 91) Internal structure of the Sampling_Cell. ................................................................ 92
Figure 92) Structures of the evaluation blocks for the odd and even parts of the two
polynomials. ............................................................................................................................. 93
Figure 93) Internal structure of the Chien search and error evaluation component. ............... 93
Figure 94) Internal structure of the Final_Stage component. .................................................. 94
Figure 95) Internal structure of the component for the calculation of zeta. ............................ 94
Figure 96) Internal architecture of the 2p-Starting_Zeta_Computer. ...................................... 95
Figure 97) Internal structure of the modified Zeta_Computer................................................. 95
Figure 98) Internal architecture of the 4p-Starting_Zeta_Computer. ...................................... 95
Figure 99) Internal architecture of the 8p-Starting_Zeta_Computer. ...................................... 96
Figure 100) Internal architecture of the 16p-Starting_Zeta_Computer. .................................. 96
Figure 101) Internal architecture of the 16p-enhanced_Starting_Zeta_Computer. ................. 97
Figure 102) Blocks assembly inside the decoder. .................................................................... 98
Figure 103) Black box representation of the Correction_Block. ............................................. 98
Figure 104) Internal structure of the Correction_Block component. ....................................... 98
Figure 105) Internal structure of the 4p-2s SyndromeComputer cell. ..................................... 99
Figure 106) Internal structure of the four-parallel eCSEE cell. ............................................. 100
Figure 107) Schematics for the evaluation blocks and the following stages. ........................ 100
Figure 108) Simplified blocks that constitute the 4p-2s-4c decoder. .................................... 100
Figure 109) Simplified blocks that constitute the 8p-2s-2c decoder. .................................... 101
122
Figure 110) Internal structure of the SyndromeCell for the 16p-2s-1c version. .................... 102
Figure 111) System architecture of the 16p-2s-1c decoder. .................................................. 102
Figure 112) Internal structure of a BRAM cell. ..................................................................... 107
Figure 113) Internal structure of the cell for the computation of the syndrome in the 38p-2s-1c
RS(528,514) decoder. ............................................................................................................ 112
Figure 114) Black box representation of the Mult_Cell component..................................... 112
Figure 115) Internal structure of the Mult_Cell. ................................................................... 112
123
7 List of tables Table 1) Modulo-2 addition. .................................................................................................... 13
Table 2) Modulo-2 multiplication. ........................................................................................... 14
Table 3) Representations of the Galois Field formed with three bits. ..................................... 17
Table 4) Recap table for the control unit of the RS_ePIBMA component. .............................. 60
Table 5) Recap table of resources used per every component, taking into account the cells
usage. ....................................................................................................................................... 61
Table 6) Table for critical paths and latencies of the various parts. ........................................ 62
Table 7) Table that resumes the usage of XOR gates for every component used. .................. 68
Table 8) Comparison of resources usage of the variants of the syndrome cells. ..................... 69
Table 9) Resume of the latency and critical path characteristics for the solutions analysed. .. 70
Table 10) Recap table for used basic elements in the overall syndrome component. ............. 70
Table 11) Comparison of resources usage of the variants of the eCSEE cells. ....................... 72
Table 12) Resume of the latency and critical path characteristics for the three solutions
analysed.................................................................................................................................... 72
Table 13) Recap table for the resources utilisation of the components in eCSEE variants. .... 73
Table 14) Table that resumes the instantiation of cells in the typologies of eCSEE. .............. 73
Table 15) Recap table for the resources used by the eCSEE module of all the solutions taken
into consideration. .................................................................................................................... 74
Table 16) Table that resumes the number of XOR gates used for the two solutions taken into
consideration. ........................................................................................................................... 77
Table 17) Table that resumes the utilisation of elements for the various variants of
SyndromeComputer cell. .......................................................................................................... 87
Table 18) Recap table of the number of XOR gates used by every block for every version. . 88
Table 19) Results of the simplification of the P-1 multiplexer. ............................................... 88
Table 20) Recap table of the resources used by the implementations of the various versions of
the decoder in a Kintex 7. ...................................................................................................... 103
Table 21) Table that analyses in the clock pulses the latency of each decoder. .................... 104
Table 22) Recap table for the timings results in Kintex 7. .................................................... 104
Table 23) Table that resumes the configurations of the BRAMs used for delaying the messages.
................................................................................................................................................ 104
Table 24) Recap table of the resources used by the implementations of the various versions of
the decoder in a Virtex. .......................................................................................................... 105
Table 25) Table that resumes the timing information of the various implementations in a Virtex.
................................................................................................................................................ 105
Table 26) Recap table that resumes the equivalent number of gates used by each RS(255,239)
ASIC implementation of the decoders. .................................................................................. 108
Table 27) Recap table that resumes the main timing characteristics of the RS(255,239) ASIC
implementations of the decoders. .......................................................................................... 109
Table 28) Comparison table between the decoder proposed and the literature’s results. ...... 110
Table 29) Analysis of the percentage weight of the FIFO memory on the total number of gates
used. ....................................................................................................................................... 110
Table 30) Recap table of the number of XOR gates used by every block for every version. 113
124
Table 31) Table that resumes the number of messages that have to be stored for every version.
................................................................................................................................................ 113
Table 32) Recap table of the resources used by the implementations of the various versions of
the decoder in a Kintex 7. ...................................................................................................... 114
Table 33) Table that resumes the configurations of the BRAMs used for delaying the messages.
................................................................................................................................................ 114
Table 34) Recap table for the timings results for Kintex 7. ................................................... 115
Table 35) Recap table of the resources used by the implementations of the various versions of
the decoder in a Virtex. .......................................................................................................... 115
Table 36) Recap table for the timings results for Virtex. ...................................................... 115
Table 37) Recap table that resumes the equivalent number of gates used by each RS(528,514)
ASIC implementation of the decoders. .................................................................................. 116
Table 38) Recap table that resumes the main timing characteristics of the RS(528,514) ASIC
implementations of the decoders. .......................................................................................... 116
125
8 Bibliography [Song02] Song L., Yu M-L, Shaffer M.S., "10 and 40 Gb/s Forward Error Correction Devices
for Optical Communications" IEEE Journal of Solid-State Circuits, pp. 1565-1573, 2002
[Lee03] Lee H., "High-Speed VLSI Architecture for Parallel Reed-Solomon Decoder" IEEE
Transactions of VLSI Systems, pp. 288-294, 2003
[Lee05] Lee H., "A high-speed low-complexity Reed-Solomon decoder for optical
communications" IEEE Transactions on Circuits and Systems II, pp. 461-465, 2005
[LeeH08] Lee H., Choi C-S, Shin J., Ko J-S, "100Gb/s Three-Parallel Reed-Solomon based
Forward Error Correction Architecture for Optical Communications" 2008 International SoC
Design Conference, pp. 265-268, 2008
[LeeS08] Lee S., Choi C-S, Lee H., "Two-parallel Reed-Solomon based FEC Architecture for
Optical Communications" IEICE Electronics Express, pp. 374-380, 2008
[Park12] Park J-I, Yeon J., Yang S-J, Lee H., "An Ultra High-Speed Time-Multiplexing Reed-
Solomon-based FEC architecture" IEEE, 2012
[Salvador14] Salvador A., Carvalho D., Nakandakare C., Mobilon E., de Oliveira J.,
"100Gbit/s FEC for OTN Protocol: Design Architecture and Implementation Results" IEEE,
2014
[Ji15] Ji W., Zhang W., Xingru P., Zhibin L., "16-Channel Two-Parallel Reed-Solomon Based
Forward Error Correction Architecture for Optical Communications" 2015 IEEE International
Conference on Digital Signal Processing (DSP), pp. 239-243, 2015
[Wu15] Wu Y., "New Scalable Decoder Architectures for Reed-Solomon Codes" IEEE, pp.
2741-2761, 2015
126
127
9 Annexure
9.1 [Annex 1] Files positions The section has the aim of communicating how the files given with the thesis are placed inside
the folder.
The files follow the same order and are organised in the same chapters in which is divided the
report of the thesis. For the Python and Matlab files, the simple plain “file.py” or “file.m” files
are sufficient. For what concerns the projects files, it was not possible to insert all the files.
Only the VHDL sources were inserted into the folder, since are the essential result of the thesis.
Some text file is inserted for allowing the testing with the test-benches.
All the versions of the decoders and encoders were successfully tested. The test-benches are
passed in the thesis but it’s mandatory to change the paths of the source files used in the
simulation.
The documentation of each project stays in the folder “html” and can be opened by double-
clicking on “index.html”, that stays in the previously said folder.