Block Codes

Chapter 1

Introduction to Block Codes

1.1 IntroductionConsider (see figure 1.1) the binary symmetric channel (BSC). Its crossover probability is de-noted by p. Assume that 0 ≤ p ≤ 1/2.

""

""

""

""

""

""

"" bb

bb

bb

bb

bb

bb

bb-

¤£ ¡¢

¤£ ¡¢ 1

0

1

0

X Yp

p

1− p¤£ ¡¢

1− p¤£ ¡¢

Figure 1.1: Binary symmetric channel.

We observe that the output Y is a non-perfect reproduction of the input X (if p 6= 0). There-fore one might think that it is impossible to communicate in a reliable way over a BSC. Howeverone of the principal results in Information Theory states that we can communicate reliably overthis channel if we use coding.

1.2 Codes and channel capacityA code with blocklength n consists of M binary codewords x1, x2, · · · , x M each having n binary(∈ {0, 1}) components. The rate of such a code is defined as

R 1= 1n

log M bits per transmission. (1.1)

Note that the base of the logarithm is assumed to be 2.

4

CHAPTER 1. INTRODUCTION TO BLOCK CODES 5

The transmitter (encoder) chooses one of the codewords, say x , and offers it to the channel.We assume that all codewords are chosen with equal probability 1/M . The response of thechannel to the input sequence x is the output sequence y. Based on this output sequence thereceiver (decoder) produces an estimate x̂ of the transmitted codeword.

The probabilityPE

1= Pr{X̂ 6= X} (1.2)

is the average error probability.

Example 1.1 A repetition code consists of two codewords whose components differ in every position,e.g. x1 = 00 · · · 0 and x2 = 11 · · · 1. For the rate of a repetition code we get R = (log 2)/n = 1/n.

@@

@@¡¡

¡¡

-´

´´

´´

´´

´´

´´

´´

´´

´´́

@@

@@¡¡

¡¡

¤£ ¡¢

¤£ ¡¢

¤£ ¡¢

¤£ ¡¢

¤£ ¡¢

¤£ ¡¢@

@@

@@

@@

@@

@@

@@

@@

@@@PPPPPPPPPPPPPPPPPP

ll

ll

ll

ll

ll

ll

ll

ll

llhhhhhhhhhhhhhhhhhh

HHHHHHHHHHHHHHHHHH((((((((((((((((((

QQ

QQ

QQ

QQ

QQ

QQ

QQ

QQ

QQ

hhhhhhhhhhhhhhhhhh©©©©©©©©©©©©©©©©©©

³³³³³³³³³³³³³³³³³³¡

¡¡

¡¡

¡¡

¡¡

¡¡

¡¡

¡¡

¡¡¡

((((((((((((((((((,

,,

,,

,,

,,

,,

,,

,,

,,,

X̂ = 111

X̂ = 000

X = 111

Y = 111

Y = 110

Y = 101

Y = 011

Y = 100

Y = 010

Y = 001

X = 000 ¤£ ¡¢

Y = 000

¤£ ¡¢

¤£ ¡¢

¤£ ¡¢

Figure 1.2: Coding for the BSC with n = 3 and M = 2 (repetition code).

For blocklength n = 3 the transmitter can choose e.g. x̂ = 000 for y = 000, 001, 010, or 100and x̂ = 111 for y = 011, 101, 110, or 111 (see figure 1.2). This leads to an average error probabilityPE = 3p2(1− p)+ p3.

Shannon [16] proved in 1948 with a random coding argument the following result.

Theorem 1.1 For a discrete memoryless channel having channel capacity C, there exist codesof any rate R smaller than C with average error probability PE as small as we desire. Thechannel capacity C can be determined from the transition probability matrix of the channel.

The channel capacity of the BSC with crossover probability p is given by

CBSC = 1− h(p) = 1+ p log(p)+ (1− p) log(1− p). (1.3)


In figure 1.3 we have plotted this capacity as a function of p.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 1.3: Capacity of the BSC as a function of the crossover probability p.

With his result Shannon created a number of problems:

• How do we find the codes that do exist according to Shannon’s result? This results tells usthat codes that are chosen at random are likely to be good.

• How do we compute the error probability of specific codes? Shannon could only show thatthe average error probability averaged over an ensemble of codes is not too large. Fromthis he could conclude that there should exist at least one code in the ensemble with a nottoo large average error probability.

• How do we implement these codes? The random codes that Shannon proposes do not havestructure and are hopelessly complex for interesting blocklengths n. Are there codes withan acceptable complexity that achieve capacity?

These are the important problems in CodingT heory.

1.3 Decoding principles

1.3.1 Maximum-a-posteriori (MAP) decodingSuppose that we use a certain code. What should the decoding rule be to achieve the smallestpossible average error probability PE?


Note that

1− PE =∑

y

Pr{X = x̂(y), Y = y}

=∑

y

Pr{Y = y} Pr{X = x̂(y)|Y = y}

≤∑

y

Pr{Y = y}maxx

Pr{X = x |Y = y}. (1.4)

If the decoder upon receiving y chooses x̂ such that

Pr{X = x̂ |Y = y} = maxx

Pr{X = x |Y = y}, (1.5)

the upper bound for 1 − PE is achieved and thus the lowest possible value of the average errorprobability PE .

Note that Pr{X = x |Y = y} is the a-posteriori probability of the codeword x when y wasreceived. The a-priori probability of the codeword x is Pr{X = x}.Result 1.2 The MAP-decoding rule (after receiving y choose x̂ such that Pr{X = x̂ |Y = y} ≥Pr{X = x |Y = y}, for all x) minimizes the average error probability PE . This is the mostfundamental decoding rule.

1.3.2 Maximum-likelihood (ML) decodingWe have assumed that all codewords are equally probable i.e. have probability 1/M . Hence

Pr{X = x |Y = y} = Pr{X = x, Y = y}Pr{Y = y}

= Pr{X = x} Pr{Y = y|X = x}Pr{Y = y}

= Pr{Y = y|X = x}M Pr{Y = y} . (1.6)

Upon receiving y the a posteriori probability Pr{X = x |Y = y} is now maximized if Pr{Y =y|X = x} is maximized. We get the smallest PE when the decoder chooses the x that maximizesthe likelihood Pr{Y = y|X = x}.Result 1.3 The ML-decoding rule (after receiving y choose x̂ such that Pr{Y = y|X = x̂} ≥Pr{Y = y|X = x}, for all x) minimizes the average error probability PE when all codewordshave the same probability 1/M.

From (1.6) we see that MAP-decoding requires, after reception of y, maximizing Pr{X =x} Pr{Y = y|X = x} over x while ML-decoding requires maximizing only Pr{Y = y|X = x}over x which only depends on the channel transition probabilities and is thus more easy.


1.3.3 Minimum-distance (MD) decodingDefinition 1.1 Let dH(x, y) be the Hamming distance between x and y, i.e. the number ofpositions on which x and y differ from each other. We assume that the sequences x and y bothhave the same length n.

For the BSC we now get that

Pr{Y = y|X = x} = pdH(x,y)(1− p)n−dH(x,y)

= (1− p)n(p

1− p)dH(x,y). (1.7)

Since p/(1 − p) ≤ 1 we now achieve the smallest possible average error probability PE if thedecoder chooses an x that has ‘minimum (Hamming) distance’ to the received y.

Result 1.4 The MD-decoding rule (after receiving y choose x̂ such that dH(x̂, y) ≤ dH(x, y), forall x) minimizes the average error probability PE when all codewords have the same probability1/M and the channel is the BSC.

1.4 Minimum distance of a code and error correctionDefinition 1.2 We define for a code the minimum (Hamming) distance as

dmin1= min

x 6=x ′dH(x, x ′), (1.8)

where x and x ′ are both codewords.

Result 1.5 For a code with minimum distance dmin a minimum-distance decoder is capable ofcorrecting all error patterns containing not more than e errors if and only if 2e + 1 ≤ dmin.

Proof (a) Let dmin ≥ 2e + 1. Suppose that an error pattern occurred with not more than eerrors, hence dH(x, y) ≤ e.

¤¤

¡¡´́

»»

CCJJ

@@QQ

XX

¤¤

¡¡´́»»

CCJJ

@@QQXX

CCJJ

@@QQXX

¤¤

¡¡´́»»

CCJJ

@@QQ

XX

¤¤

¡¡´́

»»

¤£ ¡¢¤£ ¡¢

¤£ ¡¢AAA `````````````̀

¡¡

¡¡ª

¡¡

¡¡ª

y

x ′x

e e

Figure 1.4: Error correction


The triangle inequality holds for Hamming distance. It implies (see figure 1.4) that

dH(x, y)+ dH(x ′, y) ≥ dH(x, x ′). (1.9)

From this and dH(x, y) ≤ e we obtain for any x ′ 6= x that

dH(x ′, y) ≥ dH(x, x ′)− dH(x, y)≥ dmin − e ≥ 2e + 1− e = e + 1. (1.10)

Hence a minimum distance decoder can always correct this error pattern containing not morethan e errors.

(b) Now let dmin ≤ 2e. Note that for a code with minimum distance dmin there exist at leastone triple x, x ′ ( 6= x), and y such that dH(x, y) = ddmin/2e and dH(x ′, y) = bdmin/2c. IfdH(x, y) > dH(x ′, y) an (optimal) minimum distance decoder will not make the right decisionif x was transmitted and y received. In this case an error pattern containing ddmin/2e ≤ e errorsoccurred. If dH(x, y) = dH(x ′, y) a minimum distance decoder will not make the right decisionif x was transmitted and y received or if x ′ was transmitted and y received. Also in this case anerror pattern containing dmin/2 ≤ e errors occurred.

Thus if dmin ≤ 2e some patterns containing e or less errors are not corrected. 2

We can conclude that for a code with minimum distance dmin all error patterns with

e = bdmin − 12c errors (1.11)

can be corrected and not all patterns containing more than e errors. Since

1− PE ≥∑

i=0,e

(ni

)pi (1− p)n−i (1.12)

it is important to make e, and hence dmin, of a code as large as possible.

1.5 Unstructured codesFor codes that do not have structure

• the encoder has to use a table with all M = 2n R codewords, and

• the decoder has to compare all M = 2n R codewords from a similar table with the receivedsequence y, OR the decoder has to use a table with 2n entries, one for each possible Y ,from which it can find x̂ immediately.

Hence for large values of n unstructured codes are not very practical.Shannon’s random codes are with high probability unstructured codes. Only by making n

large we can make the error probability PE acceptably small. Therefore these random codes arenot very practical.

In the next sections we will consider codes with a simple structure: linear codes.


1.6 Linear codes: definitionWe can consider a codeword x = (x1, x2, · · · , xn) as a (row-)vector with n components that areeither 0 or 1.

Definition 1.3 The sum α + α′ of two scalars α ∈ {0, 1} and α′ ∈ {0, 1} is the modulo-2 sum ofthese scalars. The sum x + x ′ of two vectors x and x ′ is the componentwise modulo-2 sum ofboth vectors. The product αx of a vector x with a scalar α ∈ {0, 1} can be found by multiplyingall components of x with α.

Definition 1.4 A collection x1, x2, · · · , xk of k vectors is called dependent when there exist kscalars αi ∈ {0, 1}, i = 1, 2, · · · , k (not all of them equal to 0), such that

∑i=1,k αi x i = 0,

where 0 1= (0, 0, · · · , 0). If such a combination of k scalars does not exist the vectors are calledindependent.

Definition 1.5 The codewords of a linear code are linear combinations of a collection of k inde-pendent vectors x1, x2, · · · , xk hence for each codeword x there exist k scalars αi ∈ {0, 1}, i =1, 2, · · · , k such that

x =∑

i=1,k

αi x i , where αi ∈ {0, 1}. (1.13)

There are 2k of such combinations. These combinations all result in a different codewordssince the spanning vectors x1, x2, · · · , xk are independent. Note that 0 is always a codeword.The rate of a linear code is thus equal to k/n (bit per code-symbol).

Example 1.2 For n = 5 and k = 3 and the independent vectors x1 = 11100, x2 = 00110, and x3 =11111, we obtain the codewords 00000, 11100, 00110, 11010, 11111, 00011, 11001, and 00101. The rateof this code is 3/5.

1.7 Linear codes: propertiesSome properties of linear codes are given below.

• If x is a codeword and so is x ′ then also x + x ′ is a codeword.

Let x =∑i=1,k αi x i and x ′ =∑i=1,k α′i x i then we obtain that x+x ′ =∑i=1,k(αi+α′i )x i

is also a codeword.

• Definition 1.6 Let wH (x) be the Hamming weight of x , i.e. the number of components ofx that are non-zero.

From dH(x, x ′) = wH (x + x ′) it then follows that:

dmin = minx 6=x ′

dH(x, x ′) = minx 6=x ′

wH (x + x ′) = minx 6=0

wH (x). (1.14)


66

¢¢¢¢¢¢¢¢¢¢¢̧

¢¢¢¢¢¢¢¢¢¢¢̧

¤£ ¡¢¤£ ¡¢

¤£ ¡¢¤£ ¡¢¤£ ¡¢

x ′ + bx + b

bb a a

x ′ + a

x ′

x + a ¤£ ¡¢

x

¤£ ¡¢¤£ ¡¢

¤£ ¡¢

Figure 1.5: Two codewords x and y have the same environment.

in other words in a linear code the codeword ( 6= 0) with the smallest weight determinesthe minimum distance of the code.

Note that in (1.14) the last equality follows from the observation that the sum x + x ′ is acodeword and that this sum runs over all codewords 6= 0. This can be observed by takingx ′ = 0.

Example 1.3 The minimum distance of the code in example 1.2 is 2 since 00011 is one of thecodewords with the smallest Hamming weight 6= 0.

• Let x and x ′ be both codewords from a linear code. If also x + a is a codeword then so isx ′ + a. If x + b is not a codeword then also x ′ + b can not be a codeword. See figure 1.5.This implies that all codewords in a linear code have the same environment.

• Linear codes ‘achieve capacity’ for the binary symmetric channel. This is perhaps themost important property of linear codes. This result follows from inspection of the randomcoding argument (see e.g. Gallager [10]).

1.8 The generator matrix G

If G is the n × k matrix with rows x1, x2, · · · , xk then we can read formula (1.13) as

x = αG. (1.15)

Note that also α is a row-vector but with k components. It consists of k information symbols thattogether form the message that is to be transmitted by means of the codeword.

Using this matrix G, the generator matrix of the code, the encoder can easily compute allthe codewords. Therefore the encoder (and decoder) for a linear code only needs to store the kvectors that form the generator matrix and not the M = 2k codewords.


Example 1.4 From n = 5 and k = 3 and x1 = 11100, x2 = 00110, and x3 = 11111 we obtain thegenerator matrix

G =

1 1 1 0 00 0 1 1 01 1 1 1 1

.

Then e.g. for α = 011 we get codeword x = 11001 since

(11001) = (011)

1 1 1 0 00 0 1 1 01 1 1 1 1

.

Definition 1.7 Systematic linear codes have the property that the i-th component of the code-word x is equal to the i-th component of the corresponding message α, hence

xi = αi , for i = 1, 2, · · · , k. (1.16)

The generator matrix of a systematic code has the form G = [Ik P], where Ik is the k×k identitymatrix and P the so-called parity-check part of the generator matrix.

The first k symbols of a codeword are the information symbols, the other n − k symbols arethe parity-check symbols. These symbols are added in such a way that the parity check equationsare satisfied as we shall soon see.

It is always possible to make a generator matrix systematic by adding rows to each other andinterchanging columns of the matrix.

Example 1.5 Generator matrix G (see example 1.4) is brought in systematic form by adding row 1 to row3, then adding row 2 to row 1, and then again row 3 to row 1 and row 2. In this way we obtain

G ′ =

1 1 0 0 10 0 1 0 10 0 0 1 1

.

Subsequently we interchange column 2 and 3 and then column 3 and 4 and we get the “systematic” matrix

G ′′ =

1 0 0 1 10 1 0 0 10 0 1 0 1

.

1.9 The parity-check matrix H

Assume that the generator matrix G of a linear code is systematic, i.e. G = [Ik P]. Then eachcodeword x1, x2, · · · , xn relates to the message α1, α2, · · · , αk as follows:

x1 = α1· · ·

xk = αkxk+1 = P1,1α1 + · · ·+ Pk,1αk

· · ·xn = P1,n−kα1 + · · ·+ Pk,n−kαk .

(1.17)


This results in n − k equations for codeword x1, x2, · · · , xn:

P1,1x1 + · · ·+ Pk,1xk +xk+1 = 0· · ·

P1,n−k x1 + · · ·+ Pk,n−k xk +xn = 0.(1.18)

If the equations hold for a sequence x1, x2, · · · , xn this sequence has to be a codeword. To seethis take αi = xi for i = 1, k.

The (independent) equations in 1.9, the so-called parity-check equations form an alternativeway to describe a linear code. A sequence x is a codeword if and only if

H xT = 0, (1.19)

where matrix H = [PT In−k] is the so-called systematic parity-check-matrix. Note that 0 is acolumn-vector.

Example 1.6 De parity-check-matrix corresponding to the systematic matrix

G ′′ =

1 0 0 1 10 1 0 0 10 0 1 0 1

(see example 1.5) is

H ′′ =(

1 0 0 1 01 1 1 0 1

).

By placing columns back on their original positions we obtain for the code generated by G from example1.4 a parity-check matrix

H =(

1 1 0 0 01 0 1 1 1

).

1.10 Syndrome decodingThe parity-check matrix can be very useful in the decoding process. Assume that an outputsequence y has been received. For this output sequence we can write that

y = x + e, (1.20)

where x is the transmitted codeword and e the error vector that occurred.For each fixed y there are now 2k possible error vectors. All these error vectors are such that

y + e is again a codeword. Therefore all possible error vectors are a solution of H(y + e)T = 0,or HeT = H yT .

If we now define the syndrome as s 1= H yT , note that this is a column-vector, all possibleerror vectors e are solutions of

HeT = s. (1.21)


We have seen before that the smallest possible average error probability is achieved if weapply minimum-distance decoding. Therefore the decoder has to search for the (or for an) errorvector e having the smallest possible weight and satisfying HeT = s. This leads to the followingdecoding procedure:

Result 1.6 1. Compute the syndrome s = H yT .

2. Determine the (or an) error vector z with the smallest Hamming weight satisfying H zT =s,

3. Decode x̂ = y + z.

If both k and n − k are not too large step 2 can be implemented as a simple ‘table-lookup’action. Therefore we first determine the so-called standard-array corresponding to the code. Weagain consider our example code.

Example 1.7 The standard-array corresponding to the code from example 1.4 becomes

cosetsyndrome leader

00T 00000 00011 00101 00110 11001 11010 11100 1111101T 00100 00111 00001 00010 11101 11110 11000 1101110T 01000 01011 01101 01110 10001 10010 10100 1011111T 10000 10011 10101 10110 01001 01010 01100 01111

In this array the rows are formed by all words that have the same syndrome. These words together forma so-called coset. The (or a) word in the coset having the smallest Hamming weight is called the cosetleader and is placed in the first column. The top-row has syndrome 00 and consists of all code words. Thecoset leader there is 00000.

Consider now the coset corresponding to syndrome s and assume that z is the coset leader, henceH zT = s. For each codeword x from the top-row we get that H(z + x)T = H zT + H xT = H zT = s.Hence codeword x corresponds to a word z + x in our coset. Therefore we place this word z + x in thecolumn that contains codeword x . If we do this for all codewords we see that there are at least 2k membersin our coset. The number of cosets is 2n−k . Since these cosets are all disjoint and all together they can notcontain more than 2n elements, each coset should have exactly 2k elements.

Note that now by the structure of the standard-array the words in a column are all decoded in the sameway, namely onto the codeword in the (top-position of the) column. This is a consequence of the factthat y with its corresponding coset-leader z determines x̂ = y + z. By construction this is equal to thecodeword x in de column,

Using the standard-array we can set up a table that contains for each syndrome s the corre-sponding coset-leader z, i.e. the word having smallest Hamming weight satisfying H zT = s.

Syndrome decoding is optimal, it results in the smallest posssible PE , however it is also quitecomplex (time- or memory consuming). Therefore we would like to find linear codes whichneed decoding methods that are not so complex and which do achieve acceptable average errorprobabilities.


1.11 Hamming codesThe codes found in 1950 by Hamming [12] can be described best by their parity-check matrixH . This matrix consists of all 2m − 1 different columns of size m, not including the 0-column.

Example 1.8 E.g. for m = 4 we obtain:

H =

1 0 1 0 1 0 1 0 1 0 1 0 1 0 10 1 1 0 0 1 1 0 0 1 1 0 0 1 10 0 0 1 1 1 1 0 0 0 0 1 1 1 10 0 0 0 0 0 0 1 1 1 1 1 1 1 1

A Hamming code can correct a single error. To see this assume that an error occurred atposition i . Then the error vector e consists of only zeroes except for the 1 on position i . Thesyndrome of the received sequence y is

H yT = H(x + e)T = H xT + HeT = HeT = vi (1.22)

where vi is the i-th column in the parity-check matrix is which is 6= 0. If no error occurred thesyndrome is 0. Since by construction all columns are different and 6= 0 the syndrome specifieswhether or not an error occurred and if so the position where the error did occur. Therefore if asingle error occurs it can always be corrected.

Since a Hamming code can correct at least one error its minimum distance dmin should be atleast 3. However the minimum distance of a Hamming code is exactly 3. This follows from thefact that there is at least on triple of columns in the parity-check matrix that are dependent, in theexample the first leftmost three columns. This implies that there is a codeword with Hammingweight 3, in the example 111000000000000. Hence dmin cannot be larger than 3 and therefore isexactly 3.

Result 1.7 The minimum Hamming distance of a HAMMING-code dmin = 3, hence it is capableof correcting a single error. There are Hamming codes with

n = 2m − 1,k = n − m = 2m − 1− m, (1.23)

for m = 2, 3, · · · . For m →∞ the blocklength n→∞ and the rate R = k/n→ 1.

1.12 The Hamming boundResult 1.8 A code with blocklength n that is capable of correcting all error patterns of weightnot more than e can not have more than

2n/

[1+

(n1

)+(

n2

)+ · · · +

(ne

)]codewords. (1.24)

This is the so-called Hamming bound [12].


This result follows from the fact that each codeword kan be received in 1+(n1)+(n2

)+· · ·+(ne)

different ways if e of less errors occur. All these possible received sequences together form the“decoding-region” of the codeword. Since all the decoding regions corresponding to codewordshave to be disjoint, and cannot contain more than 2n elements in total, the bound follows.

For Hamming codes the Hamming bound is satisfied with equality. These codes can correcta single error. According to the Hamming bound these codes cannot have more than 2n/[1+ n]codewords. For a Hamming code n = 2m − 1 for some m ∈ {2, 3, · · · } thus we get the bound22m−1/[1 + 2m − 1] = 22m−1−m , which is equal to the actual number of codewords in theHamming code.

Codes that satisfy the Hamming bound with equality are called perfect. Therefore Hammingcodes are perfect.

1.13 The Gilbert-Varshamov bound for linear codes

1.13.1 The boundResult 1.9 Consider a linear code with blocklength n, minimum distance dmin ≥ d and 2k code-words. If

2k[

1+(

n1

)+ · · · +

(n

d − 1

)]< 2n (1.25)

there exists a linear code with the same blocklength n and minimum distance also ≥ d, but with2k+1 codewords. This results in the so-called Gilbert-Varshamov bound [11], [17].

Proof Consider all “spheres” with radius d−1 having a codeword as center. These spheres pos-sibly overlap each other but contain together certainly not more than 2k

[1+ (n1

)+ · · · + ( nd−1

)]

sequences. If this number is less than the total number of sequences 2n of length n, thus ifinequality (1.25) holds, there should exist a word that does not belong to any of the spheres.This sequence z therefore has distance at least d to a all codewords. Now note that z and the krow-vectors that span our linear code are independent. This follows from the fact that the k row-vectors are independent and z is no codeword, i.e. no linear combination of the k row-vectors.

We now add the word z as a row-vector to the generator matrix of our linear code. The newlinear code has 2k+1 codewords. We now have to check the Hamming weights of the “new”codewords x + z where x is an ‘old’ codeword. It follows that

wH (x + z) = dH(x, z) ≥ d, (1.26)

hence also the minimum distance of the new code is ≥ d. 2

From this results, by increasing k, it follows that there exists at least one linear code ofblocklength n with dmin ≥ d and

2k ≥ 2n/

[1+

(n1

)+ · · · +

(n

d − 1

)]. (1.27)


1.13.2 Asymptotic versionTo investigate the asymptotic behaviour of the Gilbert-Varshamov bound we increase the block-length n and we consider the rate of the linear codes with dmin ≥ δn that must exist according tothe Gilbert-Varshamov bound. If we assume that 0 < δ < 1/2 then

R = log(2k)

n≥

log(2n/[1+ (n1

)+ · · · + ( ndδne−1

)])

n

≥ 1− 1n

log∑

0≤i≤δn

(ni

)

≥ 1− h(δ). (1.28)

The first inequality follows from the Gilbert-Varshamov result in (1.27). The last inequality is aconsequence of (see also van Lint [14], p. 9)

1 = (δ + (1− δ))n (a)≥∑

0≤i≤δn

(ni

)δi (1− δ)n−i

=∑

0≤i≤δn

(ni

)(1− δ)n( δ

1− δ )i

(b)≥∑

0≤i≤δn

(ni

)(1− δ)n( δ

1− δ )δn = δδn(1− δ)(1−δ)n

∑

0≤i≤δn

(ni

).(1.29)

Here (a) follows from Newton’s binomium and (b) from the fact that δ/(1− δ) ≤ 1 and i ≤ δn.If we want to use these codes on the BSC with crossover probability p we have to take δ

larger than 2p to get an arbitrary small average error probability. E.g. take δ = 2p + ε for someε > 0. Then with probability approaching 1 for n→∞ by the law of large numbers the numbere of errors that actually occurred satisfies

δn = (2p + ε)n > 2e + 1. (1.30)

Then, since dmin ≥ δn > 2e + 1, the message can be reconstructed and therefore the averageerror probability can be made arbitrarily small by increasing n.

Since we can choose ε > 0 arbitrarily small there exist for the BSC with 0 ≤ p ≤ 1/4 codesfor all

R > 1− h(2p) (1.31)

for which the average error probability can be made arbitrarily small by increasing n.This is essentially less than what Shannon promised (see figure 1.6). Since so far no bi-

nary code constructions have been found that achieve asymptotically more than the Gilbert-Varshamov-bound for n→∞ there is still a lot to be discovered in Coding Theory.

NOTE: One should realize that the concept of minimum distance plays no role in the randomcoding arguments given by Shannon. So maybe we must not focus our attention on minimumdistance. This is however what mainly has been done in the first three or four decades in CodingTheory.


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 1.6: The asymptotic Gilbert-Varshamov bound and Shannon capacity versus crossoverprobability p on the horizontal axis.

1.14 Introduction BCH codesTen years after the discovery of the Hamming code, codes were invented that were capable ofcorrecting two or more errors. These codes are named BCH codes after their inventors (Bose andRay-Chaudhuri [9] and Hocquenghem[13]).

To describe these codes we consider a Hamming code with parameter m = 4, hence n = 15(and k = 11). The parity-check matrix of this code consists of 15 different columns vi 6= 0,i = 1, 2, · · · , 15 of four elements ∈ {0, 1}, thus

H = (v1, v2, · · · , v15). (1.32)

To be able to correct one error we need 4 parity-check equations. It is not unreasonable toassume that with 8 parity-check equations we can correct two errors. Therefore we expand ourparity-check matrix by another collection of 15 columns wi , i = 1, 2, · · · , 15 of four elements,hence

H ′ =(v1 v2 · · · v15w1 w2 · · · w15

). (1.33)

The question now arises what column wi has to be added to vi , for i = 1, 2, · · · , 15 ? Sinceall vi , i = 1, 2, · · · , 15 are different we can rephrase this question as: What is the function F(·)that maps v onto w?

To be able to investigate all these functions F(·) it would be nice if we could add, subtract,multiply and divide columns. We have already seen that adding columns is easy, simply take themodulo-2 sum componentwise. Subtraction then is the same as addition. But how should wethen multiply and divide?


To make this possible we associate with each column a polynomial: polynomial 0 with col-umn (0, 0, 0, 0)T , 1 with (1, 0, 0, 0)T , x with (0, 1, 0, 0)T , x + 1 with (1, 1, 0, 0)T ,· · · andx3 + x2 + x + 1 with (1, 1, 1, 1)T . All these polynomials have degree at most 3. In whatfollows we will call these polynomials elements.

We can now perform the following operations on these 16 elements:

• ADDITION: (x3 + x + 1)+ (x2 + x + 1) = (x3 + x2). No problem.

• SUBTRACTION: Same as addition.

• MULTIPLICATION: Since (x3 + x + 1)(x2 + x + 1) = x5 + x4 + 1 has degree > 3,we have a problem. However we can reduce the results modulo some polynomial M(x) ofdegree 4. Take e.g. M(x) = x4 + x + 1, then we get x5 + x4 + 1 : x4 + x + 1 = x + 1with remainder x2 thus x5 + x4 + 1 = x2 mod M(x).

• DIVISION: We can only divide if each element (except 0) has a unique inverse (also anelement). Such an inverse exists if we choose the polynomial M(x) to be irreducible. Apolynomial is irreducible if it can only be divided by itself and by 1 (a ’prime’-polynomial).For an explanation see appendix A.

The 16 elements together with the described operations (addition, multiplication) form the GaloisField G F(16).

We now want to be specific about the columns. Now assume that two errors occurred, oneon position i the other one on position j . Then for the syndrome that consists of two columns

(elements) hence s =(

s1s2

), we get that

vi + v j = s1 6= 0F(vi ) + F(v j ) = s2

(1.34)

It would be nice if we could set up a quadratic equation that has vi and v j as roots. Hence

(v − vi )(v − v j ) = v2 + (vi + v j )v + viv j = 0 mod M(x). (1.35)

That vi and v j are the only solutions of this equation is demonstrated in Appendix A.It now appears that F(v) = v3 makes it possible to express the coefficients in equation (1.35) interms of the syndrome elements s1 and s2. We get

s1 = vi + v j

s2 = v3i + v3

j = (vi + v j )(v2i + v2

j + viv j ) (1.36)

= (vi + v j )((vi + v j )

2 + viv j

)

= s1(s21 + viv j ), (1.37)


hence

vi + v j = s1

viv j =s2s1+ s2

1. (1.38)

Therefore vi and v j are the only roots of the quadratic equation

v2 + s1v + (s2s1+ s2

1) = 0. (1.39)

This equation can be solved easily by substituting subsequently v1, v2, · · · , v15 in it and to lookwhether equality is achieved.

Example 1.9 If s1 = x3 + x2 and s2 = x3 + x2 + 1, then we obtain the quadratic equation

v2 + (x3 + x2)v + ( x3 + x2 + 1x3 + x2

+ (x3 + x2)2) = 0.

The roots of this equation are x3 + x + 1 and x2 + x + 1. The corresponding positions are the errors.

If only one error occurs then s2 = s31. If no errors occur then s1 = s2 = 0.

Result 1.10 There are double-error correcting BCH codes with

n = 2m − 1k = n − 2m = 2m − 1− 2m, (1.40)

for m = 2, 3, · · · . The minimum Hamming distance of a double-error correcting BCH codedmin ≥ 5. For m →∞ the blocklength n→∞ and R = k/n→ 0.

Note that the BCH decoder that we have described here is not a MD-decoder. We have onlydemonstrated that two or less errors can always be corrected. It is not discussed what happenswhen more than two errors occur.

If instead of two we want to be able to correct t errors, we can use mt parity-check equations[9], [13]. Asymptotically BCH codes, do not achieve the Gilbert-Varshamov bound (except forrate R = 1).

1.15 The weight enumeratorConsider a linear code with blocklength n. The weight enumerator of this code is the polynomial

A(z) = A0 + A1z + A2z2 + · · · + Anzn, (1.41)

where Aw is the number of codewords with Hamming weight w.


Theorem 1.11 If we use a code with weight enumerator A(z) on the BSC with crossover proba-bility p we obtain the following upper bound for the average error probability:

PE ≤ A(γ )− 1, with γ = 2√

p(1− p). (1.42)

The proof of this theorem can be found in Appendix B.

Example 1.10 For the (linear) code consisting of two codewords 00000 and 11111 the weight enumeratorA(z) = 1 + z5. Upper bound for the average error probability is therefore 32p5/2(1 − p)5/2. The exactaverage error probability is 10p3(1− p)2 + 5p4(1− p)+ p5 = 10p3 − 15p4 + 6p5. Hence the bound isnot very good.

1.16 Hard and soft decisions

1.16.1 Hard decisionsThe binary symmetric channel is actually a model for transmitting antipodal signals over anadditive Gaussian noise channel and then making so-called hard decisions at the output (seefigure 1.7).

º

¹

·

¸- - -

?

-

ri = si + ni

yirisi ∈ {±√

Es}+ 0

ni

+1

-1

Figure 1.7: Antipodal signaling over an additive white Gaussian noise channel. Hard and softoutput.

Each transmission i = 1, 2, · · · , n a signal si ∈ {−√

Es,+√

Es} is chosen as channel input.Here Es is the signal power. Gaussian noise ni disturbs the signal additively, i.e.

ri = si + ni ,

and the density function of the noise variable Ni is

pNi (n) =1√

2πσ 2exp(− n2

2σ 2 ) with σ 2 = N02 . (1.43)

The receiver observes the output ri of the Gaussian channel and then decides first whether thesign of ri is positive or negative. This leads to a variable yi that is either +1 or −1. The vectorof variables y1, y2, · · · , yn is now used for decoding.


What is now the probability p that e.g. si = −√

Es is sent and yi = +1 is received? It canbe seen that this occurs only if ni ≥

√Es , hence

p =∫ ∞√

Es

1√2πσ 2

exp(− α2

2σ 2 )dα = Q(√

Es/σ) = Q(√

2Es/N0),

where Q(x) 1= 1/√

2π∫∞

x exp(−α2/2)dα. Similarly the probability that si =√

Es is sent andyi = −1 is received is equal to Q(

√2Es/N0). Note that this implies that the channel with input

x ∈ {−1,+1} and output y is binary and symmetric, if we assume that s = x√

Es . Exactly thisgives us the motivation for investigating codes for the BSC.

1.16.2 Soft decisions, improvement studyInstead of using the vector y1, y2, · · · , yn of hard decisions for decoding we can use the so-called soft decisions r1, r2, · · · , rn (see figure 1.7). Clearly this can only improve the decoder’sperformance. To see what improvement can be achieved we will study the transmission of twocodewords x and x ′ that have Hamming distance dH(x, x ′) = d . We express our results in termsof Es/N0, which is (half) the signal-to-noise ratio.

We know from theorem 1.11 that for hard decisions

PhardE (d) ≤ (2

√p(1− p))d ≤ (4p)d/2 = (4Q(

√2Es/N0))

d/2. (1.44)

We can now use the following upper bound for the Q-function

Q(x) ≤ 12

exp(−x2

2), for x ≥ 0. (1.45)

This leads toPhardE (d) ≤ (2 exp(− Es

N0))d/2 = 2d/2 exp(−d Es

2N0).

What is now the corresponding error probability for the soft-decision case? Observe thatsince dH(x, x ′) = d the corresponding signalvectors s and s′ have squared Euclidean distance4d Es . Therefore

PsoftE (d) = Q(

√d Es

σ) = Q(

√2d Es/N0) ≤

12

exp(−d Es

N0).

We can now compare the performance of soft- versus hard-decisions. If we ignore the coefficientin front of the exponential function we see that in the soft-decision case we need only half asmuch signal energy than in the hard-decision case (in other words we gain 3.01 dB). Note that tocome to this conclusion we did assume that all bounds are tight!

What if we have a (linear) code with many codewords and not just two? In that case we canobtain an upper bound on the average error probability in terms of the weight enumerator of thecode, just like in the hard-decision case.


Theorem 1.12 If we use a code with weight enumerator A(z) on the AWGN channel with an-tipodal signaling with amplitude

√Es and noise variance N0/2 we obtain the following upper

bound for the average error probability:

PE ≤12

[A(γ )− 1], with γ = exp(− Es

N0). (1.46)

Appendix A: Finding inverses and solving equations in a GaloisfieldThe question that we want to solve here is what it means that

r(x)s(x) = 0 mod M(x) (1.47)

for two polynomials (not necessarily elements) r(x) and s(x) where we assume that M(x) isirreducible and has degree 4.

Now r(x) could be divisible by a polynomial p(x) 6= 1 and s(x) could be divisible by apolynomial q(x) 6= 1 with p(x)q(x) = M(x). Then r(x)s(x) would be a multiple of M(x) and(1.47) would be satisfied without r(x) or s(x) being a multiple of M(x) (or 0).

However since M(x) is irreducible, there do not exist polynomials p(x) 6= 1 and q(x) 6= 1such that p(x)q(x) = M(x). Therefore r(x) or s(x) is divisible by M(x), hence (1.47) impliesthat

r(x) = 0 mod M(x) or s(x) = 0 mod M(x).

For a more complete proof see [1], page 82.We will now use this implication twice.

1. Assume that element a(x) 6= 0. We are now interested in knowing the inverse of thiselement. We therefore multiply a(x) with b1(x) and b2(x), which are both elements.Assume that both multiplications give the same result hence a(x)b1(x) mod M(x) =a(x)b2(x) mod M(x) thus

a(x)(b1(x)+ b2(x)) = 0 mod M(x).

Since a(x) 6= 0 and not divisible by M(x) (the degree of a(x) is smaller than 4) it isnecessary that b1(x) + b2(x) = 0 mod M(x). The sum b1(x) + b2(x) of two elementscan not be divided by M(x) since the degree of b1(x) and b2(x) are both smaller than 4.Therefore b1(x)+ b2(x) = 0 hence b1(x) = b2(x).The 16 different elements b(x) thus give 16 different products. Since a product can onlyassume one out of 16 possible values, exactly one of these 16 products is equal to 1. Thecorresponding element b(x) is the unique inverse of the element a(x).

2. Consider two elements a(x) and b(x) and the quadratic equation

(e(x)+ a(x))(e(x)+ b(x)) = 0 mod M(x),


in e(x), where e(x) is an element. Can this equation have other solutions than e(x) = a(x)and e(x) = b(x)? The answer is no, this is an immediate consequence of the implicationsince e(x) + a(x) = 0 mod M(x) implies that e(x) = a(x) since e(x) and a(x) areelements and have degree less than 4, etc.

Appendix B: Upper bound for PE in terms of weight enumera-torIn the proof we assume that the decoder is a MD-decoder (which is optimal here). Let codewordx1 be the actually transmitted codeword.

Define for m = 2, 3, · · · ,M the sets of output sequences

Ym1= {y : dH (y, xm) ≤ dH (y, x1)} (1.48)

If y 6∈ Ym for all m = 2, 3, · · · ,M certainly no error occurs. Then for all m 6= 1 we get that

dH (y, xm) > dH (y, x1)

and a MD-decoder decodes x1. Hence for the error probability P1E1= Pr{X̂ 6= x1|X = x1} we

get that1− P1

E ≥ Pr{Y 6∈ Y2 ∩ Y 6∈ Y3 ∩ · · · ∩ Y 6∈ YM | X = x1

},

in other words

P1E ≤ 1− Pr

{Y 6∈ Y2 ∩ Y 6∈ Y3 ∩ · · · ∩ Y 6∈ YM | X = x1

}

= Pr{Y ∈ Y2 ∪ Y ∈ Y3 ∪ · · · ∪ Y ∈ YM | X = x1

}

≤∑

m=2,3,··· ,MPr{Y ∈ Ym | X = x1

},

where in the last inequality we use the “union”-bound. The union bound in its most simple formstates that Pr{A ∪ B} ≤ Pr{A} + Pr{B}.

Now fix m 6= 1. Then

Pr{Y ∈ Ym | X = x1

} = Pr{dH (Y , xm) ≤ dH (Y , x1) | X = x1

}.

Wat does it mean that dH (y, xm) ≤ dH (y, x1)? Take a look at figure 1.8. In this figure we see thetransmitted codeword x1 and the alternative codeword xm . On a number of positions x1 and xmare equal, on the remaining positions, their number is dH (xm, x1), x1 and xm differ from eachother.

dH (x1, y) = # errors in =-positions+ # errors in 6=-positionsdH (xm, y) = # errors in =-positions+ dH (xm, x1)− # errors in 6=-positions.


y received

x1 transmitted

6=-positions=-positions

xmk 6= x1k

xm alternative

xmk = x1k

Figure 1.8: Actually transmitted codeword, alternative codeword, and received word.

Note that an extra error in the 6=-positions results in an increment of dH (x1, y) and in a decrementof dH (xm, y). We therefore conclude that

dH (y, xm) ≤ dH (y, x1)⇔ # errors in 6=-positions ≥ dH (xm, x1)

2, (1.49)

For dH (xm, x1) = d this leads to

Pr{Y ∈ Ym | X = x1

}

= Pr{dH (Y , xm) ≤ dH (Y , x1) | X = x1

}

=d∑

j=dd/2e

(dj

)p j (1− p)d− j := Pe(d). (1.50)

We will now compute Pe(d) from (1.50). First we assume that d = 2k is even. Then

Pe(d) =2k∑

j=k

(2kj

)p j (1− p)2k− j

= pk(1− p)k2k∑

j=k

(2kj

)(

p1− p

) j−k

≤ pk(1− p)k2k∑

j=k

(2kj

)

≤ pk(1− p)k22k =(

2√

p(1− p))2k

, (1.51)


where we have used that 0 ≤ p ≤ 1/2. For d = 2k + 1 is odd we obtain

Pe(d) =2k+1∑

j=k+1

(2k + 1

j

)p j (1− p)2k+1− j

= pk+1/2(1− p)k+1/22k+1∑

j=k+1

(2k + 1

j

)(

p1− p

) j−k−1/2

≤ pk+1/2(1− p)k+1/22k+1∑

j=k+1

(2k + 1

j

)

≤ pk+1/2(1− p)k+1/222k+1 =(

2√

p(1− p))2k+1

. (1.52)

Finally we get that

Pr{Y ∈ Ym | X = x1} ≤(

2√

p(1− p))dH(xm ,x1)

. (1.53)

This results in the following upperbound for the average error probability conditioned on the factthat x1 is transmitted:

P1E ≤

∑

m=2,3,··· ,M

(2√

p(1− p))dH(xm ,x1)

.

Since the code is linear, the number of codewords at distance w from x1 is equal to the numberof codewords with weight w, hence equal to Aw. The consequence of this is that

P1E ≤

∑

w=1,n

Aw(

2√

p(1− p))w= A(γ )− 1.

For any other codeword than x1 we can prove the same. This finishes the proof.

1.17 Exercises1. Show that the Hamming distance dH defined in definition 1.1 satisfies the following prop-

erties of a bona fide metric:

(a) dH(x, x) = 0.

(b) dH(x, y) > 0 if x 6= y.

(c) dH(x, y) = dH(y, x).

(d) dH(x, y) ≤ dH(x, z)+ dH(z, y).

(Problem 7.4 from McEliece [15])


2. Suppose you were approached by a communications engineer who told you that his (bi-nary) channel accepts words of length n and that the only kind of error pattern ever ob-served is one of the n + 1 patterns ( 000000, 000001, 000011, 000111, 001111, 011111,111111, illustrated for n = 6). Design a linear (n, k) code that will correct all such patternswith as large a rate as possible. Illustrate your construction for n = 7.

(Problem 7.15 from McEliece [15])

3. Consider our example BCH code. Let the syndromes s1 = x2 + x + 1 and s2 = x2. Whatare the columns (elements) vi and v j that correspond to the error-positions.

4. The following result (due to Van de Meeberg [25]) shows how to strengthen theorem 1.11for a BSC.

(a) Show that the bound (1.53) can be improved to Pe(d) ≤ γ dH(x1,xm)+1 if dH(x1, xm)

is odd. [Hint: it can be shown that the error probability of a repetition code of length2n is the same as for one of length 2n − 1.]

(b) Hence show that theorem 1.11 can be improved to PE ≤ 12 [(1 + γ )A(γ ) + (1 −

γ )A(−γ )]− 1.

(From problem 7.26 from McEliece [15])

5. Show that

Q(x) ≤ 12

exp(−x2

2), for x ≥ 0.

Documents

Block Codes