How to Construct Polar Codes - arXiv · How to Construct Polar Codes Ido Tal and Alexander Vardy Abstract—A method for efﬁciently constructing polar codes is presented and analyzed

How to Construct Polar Codes

Ido Tal and Alexander Vardy

Abstract—A method for efficiently constructing polar codesis presented and analyzed. Although polar codes are explicitlydefined, a straightforward construction is intractable, since theresulting bit channels typically have an output alphabet sizewhich is exponential in the code length. Thus, the core problemthat needs to be solved is: how does one faithfully approximatea bit channel with an intractably large output alphabet byanother bit channel having a manageable output alphabet size.We devise two approximation methods which “sandwich” theoriginal bit channel. Both approximations turn out to be veryclose in practice, and can be efficiently computed.

I. INTRODUCTION

Polar codes [1] have recently been invented by Arıkan. Inhis seminal paper, Arıkan introduced polar codes in the contextof a binary-input, memoryless, output-symmetric channel overwhich information is to be sent. In this setting, polar codeshave an explicit construction and are known to be capacityachieving. More so, they have corresponding efficient encod-ing and decoding algorithms. To date, no other family of codesis known to posses all of these favorable traits. Since [1], polarcodes have been generalized both in definition and application.In [2], polar codes are generalized to the case of memorylessoutput-symmetric channels with non-binary input alphabets. In[3], the polarization phenomenon is studied for a broad familyof kernel matrices and error exponents are derived for eachsuch kernel matrix (generalizing the work of [4] for Arıkan’soriginal 2 × 2 kernel). In terms of applications, polar codeshave been used in the context of MAC channels [5], wiretapchannels [6], and compression [7][8]. However, in this paper,we restrict ourselves to the study of the original setting studiedin [1]. Namely, a binary-input, memoryless, output-symmetricchannel with the standard 2× 2 kernel matrix.

Although the construction of polar codes is explicit, there isonly one known instance — the binary erasure channel (BEC)— in which the construction is also efficient. A first attemptat an efficient construction of polar codes in the general casewas made by Mori and Tanaka [9] [10]. In [9], the authorsshow that a key step in the channel construction can beviewed as an instance of density evolution [11]. Thus, theypropose a method utilizing convolution, where the numberof convolutions needed is of the order of the code length.However, as indeed noted in [9], it is not clear how one would

The material in this paper was presented in part at the Workshop onInformation Theory, Dublin, Ireland, August 2010.

Ido Tal is with the Department of Electrical and Computer Engineering,University of California San Diego, La Jolla, CA 92093–0407, U.S.A. (e-mail: [email protected]).

Alexander Vardy is with the Department of Electrical and Com-puter Engineering, the Department of Computer Science and Engineering,and the Department of Mathematics, University of California San Diego, LaJolla, CA 92093–0407, U.S.A. (e-mail: [email protected]).

implement such a convolution to be sufficiently exact on theone hand and tractable on the other.

Our aim in this paper is to provide a method by which polarcodes can be efficiently constructed. We do so by buildingupon the ideas derived in [9]. An exact implementation ofthe convolutions discussed in [9] implies an algorithm withmemory requirements exponential in the code length, and thusimpractical. Alternatively, one could use quantization (bin-ning) to try and reduce the memory requirements. However,for a quantization scheme to be of interest, it must satisfytwo requirements. First, it must be fast enough, which usuallytranslates into a rather small number of quantization levels(bins). Second, after the calculations have been carried out, wemust be able to interpret them in a precise manner. Namely,the quantization operation induces an inherent inaccuracy inthe calculation, which we should be able take into account soas to ultimately make a precise statement.

Our main contribution are two approximation methods. Inboth methods, the memory limitations are specified, and notexceeded. One method is used to get a lower bound on theprobability of error of each channel while the other is used toobtain an upper bound. The quantization used to derive a lowerbound on the probability of error is called a degrading quan-tization, while the other is called an upgrading quantization.Both quantization transform the “current channel” into a newone with a smaller output alphabet. The degrading quantizationresults in a channel degraded with respect to the original one,while the upgrading quantization results in a channel such thatthe original channel is degraded with respect to it.

Figure 1 shows that, in practice, the two approximationsare typically very tight. Recall from [1, Proposition 2] that anupper bound PFER = PFER(k, n) on the frame error rate of apolar code of dimension k is obtained as follows: we order thebit channels in ascending order according to their misdecodingprobability1 and sum the first k entries. Through our degradingand upgrading procedures, we are able to find upper and lowerbounds on PFER, respectively. As can be seen in Figure 1,these bounds effectively coincide. As an example, let us fixPFER to 10−6 and consider upper and lower bounds on the rateR = k/n. For the left graph we have a (constructive) lowerbound of R ≥ 0.4247 and an upper bound of R ≤ 0.4248.For the right graph, we have 0.9732 ≤ R ≤ 0.9737. We notethat for all the plots presented in this paper, the parameter µ(defined later on) did not exceed 512.

The structure of this paper is as follows. In Section II, webriefly review polar codes and set up notation. Section III

1In [1], Arıkan uses the Bhattacharyya parameter instead of the misdecodingprobability, but as we explain shortly, this is of no real importance.

arX

iv:1

105.

6164

v2 [

cs.I

T]

5 O

ct 2

011

Prob

abili

ty o

f Err

or

Rate0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−18

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

= 1,048,576nn = 1024

BSC(0.11)

Rate0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1

−18

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4

Prob

abili

ty o

f Err

or

= 1,048,576n

= 1024n

BSC(0.001)

Fig. 1. Upper and lower bounds on log10 PFER for two codeword lengthsand two channels, as a function of rate R = k/n.

is devoted to the exposition of the degrading and upgradingrelations, and their consequences for us. In Section IV, we givea high level description of our algorithms for approximatinga channel. The missing details in Section IV are filled in inSection V. Namely, we show how one can reduce the outputalphabet size of a channel by 2 to get either a degraded orupgraded version of the original channel. In Section VI, weshow how one can either degrade or upgrade a channel witha continuous output alphabet to a channel with a finite outputalphabet of specified size. Finally, in Section VII, we discusshow one can find improvements to the general algorithms fora specialized case. The improved algorithm is then analyzedin Section VIII.

II. POLAR CODES

In this section we briefly review polar codes with theprimary aim of setting up notation. We also show where thedifficulty of constructing polar codes lies.

Denote by W the underlying memoryless channel throughwhich we are to transmit information. Since W is a channel, it

has an associated input alphabet X and output alphabet Y; wedenote this by W : X → Y . The probability of observingy ∈ Y given that x ∈ X was transmitted through W isdenoted by W(y|x). We assume throughout that W has binaryinput and so X = {0, 1}. We require that W is symmetric.Namely, there exists a permutation π : Y → Y such that: i)π−1 = π and ii) W(y|1) = W(π(y)|0) (see [12, Page 94] foran equivalent definition). When the permutation is understoodfrom the context, we abbreviate π(y) as y. For now, we willfurther assume that the output alphabet Y of W is finite. Thisassumption will be justified in Section VI, in which we showhow to get from a continuous output alphabet to a discreteone.

Denote the length of the codewords we will be transmittingover W by n = 2m. For y = (yi)

n−1i=0 and u = (ui)

n−1i=0 ,

let Wn(y|u) be shorthand for∏n−1i=0 W(yi|ui). That is, Wn

corresponds to n independent uses of the channel W. Akey paradigm introduced in [1] is that of transforming nidentical copies (independent uses) of the channel W into n

bit-channels. Namely, for 0 ≤ i < n, bit channel W(m)i =Wi

has a binary input alphabet X and output alphabet Yn × X i.For input ui ∈ X and output (y,ui−1

0 ) ∈ Yn × X i theprobability Wi((y,u

i−10 )|ui) is defined as follows. Let G be

the kernel matrix introduced in [1],

G =

(1 01 1

).

Let G⊗m be the m-fold Kronecker product of G and let Bmbe the n× n bit-reversal matrix defined in [1]. Then,

W(m)i ((y,ui−1

0 )|ui) =

1

2n−1

∑

v∈{0,1}n−1−i

Wn(y|(ui−10 , ui,v) ·BmG⊗m) . (1)

In essence, constructing a polar code of dimension k isequivalent to finding the k “best” bit-channels. Namely, in[1], one is instructed to choose the k bit-channels with thelowest corresponding Bhattacharyya bounds. We note that thechoice of ranking according to Bhattacharyya bounds stemsfrom the relative technical ease of manipulating them. A morestraightforward criterion would have been to rank according tothe probability of misdecoding ui given the input (y,ui−1

0 ),and this is the criterion we will follow here. Since Wi is welldefined through (1), this task is indeed explicit, and thus sois the construction of a polar code. However, note that theoutput alphabet size of each bit channel is exponential in n,and thus a straightforward evaluation of the ranking criterion isintractable for all but the shortest of codes. Our main objectivewill be to circumvent this difficulty.

As a first step towards achieving our goal, we recall thatboth W(m+1)

2i and W(m+1)2i+1 can be constructed recursively

from W(m)i according to (22) and (23) in [1], respectively.

Thus, the explosion in output alphabet size happens in stages:going from stage m to m + 1 roughly squares the outputalphabet size. Thus, our revised aim will be to control thegrowth at each stage. To keep the notation minimal, we do

originalchannel

W

anotherchannel

P︸︷︷︸

degraded channel Q(a) Degrading

upgradedchannel

Q′

anotherchannel

P︸︷︷︸

original channel W

(b) Upgrading

Fig. 2. Degrading and upgrading a channel W

the following. Recall that in [1], Equations (22) and (23) area specialization of (17) and (18) with the f function beingessentially a reshuffling of terms. In a slight abuse of notation,we henceforth take f as the identity function. Next, let Y(m)

i

be the output alphabet ofW(m)i . We adopt the notation in [13]

and denote

W(m+1)2i = (W(m)

i�W(m)

i ) , (2)

W(m+1)2i+1 = (W(m)

i�W(m)

i ) , (3)

where for y1, y2 ∈ Y(m)i and u1, u2 ∈ X

(W(m)i

�W(m)i )(y1, y2|u1)

def=

∑

u2∈X

1

2W(m)i (y1|u1 ⊕ u2)W(m)

i (y2|u2) (4)

and

(W(m)i

�W(m)i )(y1, y2, u1|u2)

def=

1

2W(m)i (y1|u1 ⊕ u2)W(m)

i (y2|u2) (5)

III. CHANNEL DEGRADATION AND UPGRADATION

As previously outlined, our solution to the explosion ingrowth of the output alphabet Y(m)

i is to replace the chan-nel W(m)

i by an approximation. In fact, we will have twoapproximations, one yielding a “better” channel and the otheryielding a “worse” one. In this section, we formalize thesenotions.

We say that a channel Q : X → Z is degraded with respectto W : X → Y , if there exists a channel P : Y → Z suchthat for all z ∈ Z and x ∈ X ,

Q(z|x) =∑

y∈YW(y|x) · P(z|y) . (6)

We write Q �W to denote that Q is degraded with respectto W.

In the interest of brevity and clarity later on, we also definethe inverse relation: we say that a channel Q′ : X → Z ′ is

upgraded with respect to W : X → Y if there exists a channelP : Z ′ → Y such that for all z′ ∈ Z ′ and x ∈ X ,

W(y|x) =∑

z′∈Z′Q′(z′|x) · P(y|z′) . (7)

Namely, Q′ can be degraded to W. Similarly, we write thisas Q′ �W.

By definition,

W1 �W2 if and only if W2 �W1 . (8)

Also, it is easily shown that both “degraded” and “upgraded”are transitive relations:

If W1 �W2 and W2 �W3 then W1 �W3 .(9)

If W1 �W2 and W2 �W3 then W1 �W3 .(10)

Lastly, since a channel is both degraded and upgraded withrespect to itself (take the intermediate channel as the identityfunction), we have that both relations are reflexive:

W �W and W �W . (11)

If a channel Weq is both degraded and upgraded with re-spect to W, then we say that W and Weq are equivalent, anddenote this by W ≡Weq. Since “degraded” and “upgraded”are transitive relations, it follows from (9) and (10) that the“equivalent” relation is transitive as well:

If W1 ≡W2 and W2 ≡W3 then W1 ≡W3 .(12)

Also, by (8), we have that “equivalent” is a symmetric relation:

If W1 ≡W2 then W2 ≡W1 . (13)

Lastly, since a channel W is both upgraded and degradedwith respect to itself, we have by (11) that “equivalent” is areflexive relation:

W ≡W . (14)

Let W : X → Y be a given binary-input, memoryless,output-symmetric (BMS) channel. We now set the notationfor three quantities of interest. i) Denote by Pe(W) the prob-ability of error under Maximum Likelihood (ML) decoding,where ties are broken arbitrarily, and the input distribution isBernoulli(1/2). That is,

Pe(W) =∑

y∈Y:W(y|0)<W(y|1)

W(y|0) +∑

y∈Y:W(y|0)=W(y|1)

W(y|0)/2 . (15)

ii) Denote by Z(W) the Bhattacharyya parameter,

Z(W) =∑

y∈Y

√W(y|0)W(y|1) . (16)

iii) Denote by I(W) the capacity,∑

y∈Y

∑

x∈X

1

2W(y|x) log

W(y|x)12W(y|0) + 1

2W(y|1).

The following lemma states that these three quantities behaveas expected with respect to the degrading and upgradingrelations. The equation most important to us will be (17).

Lemma 1: Let W : X → Y be a BMS channel and letQ : X → Z be degraded with respect to W, that is, Q �W.Then,

Pe(Q) ≥ Pe(W) , (17)Z(Q) ≥ Z(W) , and (18)I(Q) ≤ I(W) . (19)

Moreover, all of the above continues to hold if we replace“degraded” by “upgraded”, � by �, and reverse the inequal-ities. Specifically, if W ≡ Q, then the weak inequalities arein fact equalities.

Proof: We regard only the first part, since the “Moreover”part follows easily from Equation (8). Equation (17) followsfrom the fact that an ML decoder minimizes the probabilityof a decoding failure, together with the observation that theoutput of Q can be “simulated”, given the output of W.Equation (18) is proved in [13, Lemma 1.8]. Equation (19)is a simple consequence of the data-processing inequality [14,Theorem 2.8.1].

Note that it may be the case that y and y are the samesymbol (an erasure). It would make our proofs simpler ifthis special case was assumed not to happen. We will indeedassume this later on, with the next lemma providing most ofthe justification.

Lemma 2: Let W : X → Y be a BMS channel. Thereexists a BMS channel Weq : X → Z such that i) Weq isequivalent to W, and ii) for all z ∈ Z we have that z and zare distinct.

Proof: If W is such that for all y ∈ Y we have that yand y are distinct, then we are done, since we can take Weq

equal to W.Otherwise, let y? ∈ Y be such that y? and y? are the same

symbol. Let the alphabet Z be defined as follows:

Z = (Y \ {y?}) ∪ {z1, z2} ,where z1 and z2 are new symbols, not already in Y . Now,define the channel Weq : X → Z as follows. For all z ∈ Zand x ∈ X ,

Weq(z|x) =

{W(z|x) if z ∈ Y ,12W(y?|x) if z = z1 or z = z2.

We first show that Weq � W. To see this, take theintermediate channel P : Z → Y as the channel that maps(with probability 1) z1 and z2 to y?, and all other symbolsto themselves. Next, we show that Weq � W. To see this,define the intermediate channel P : Y → Z as follows.

P(z|y) =

1 if z = y,12 if y = y? and z ∈ {z1, z2},0 otherwise.

To sum up, we have constructed a new channel Weq whichis equivalent to W, and contains one less problematic symbol

(y? was replaced by the pair z1, z2). It is also easy to see thatWeq is BMS. We can now apply this construction over andover, until the resulting channel has no problematic symbols.

Now that Lemma 2 is proven, we will indeed assume fromthis point forward that all channels are BMS and have nooutput symbols y such that y and y are equal. As we willshow later on, this assumption does not limit us. Moreover,given a generic BMS channel W : X → Y , we will furtherassume that for all y ∈ Y , at least one of the probabilitiesW(y|0) and W(y|0) is positive (otherwise, we can removethe pair of symbols y, y from the alphabet, since they cannever occur).

Given a channel W : X → Y , we now define for eachoutput symbol y ∈ Y an associated likelihood ratio, denotedLRW(y). Specifically,

LRW(y) =W(y|0)

W(y|0)

(if W(y|0) = 0, then we must have by assumption thatW(y|0) > 0, and we define LRW(y) =∞). If the channel Wis understood from the context, we will abbreviate LRW(y)to LR(y).

IV. A HIGH LEVEL DESCRIPTION OF THE ALGORITHM

In this section, we give a high level description of ouralgorithms of approximating a bit channel. We then showhow these approximations can be used in order to constructa polar code, and also gauge the penalty incurred due toapproximation.

In order to completely specify the approximating algo-rithms, one has to supply two merging functions, a degrad-ing merging function degrading_merge and an upgradedmerging function upgrading_merge. We will now definethe properties required of our merging functions, leaving thespecification of the functions we have actually used to the nextsection. The next section will also make clear why we havechosen to call these functions “merging”.

For a degrading merge function degrading_merge, thefollowing must hold. For a BMS channel W and positiveinteger µ, the output of degrading_merge(W, µ) is aBMS channel Q such that i) Q �W is degraded with respectto W, and ii) The size of the output alphabet of Q is at mostµ. We define the properties required of upgrading_mergesimilarly, but with “degraded” replaced by “upgraded” and �by �.

Let 0 ≤ i < n be an integer with binary representationi = 〈b1, b2, . . . , bm〉2, where b1 is the most significant bit.Algorithms 1 and 2 contain our procedures for finding a de-graded and upgraded approximation of the bit channel W(m)

i ,respectively. In words, we employ the recursive constructions(2) and (3), taking care to reduce the output alphabet size ofeach intermediate channel from at most 2µ2 (apart possiblyfrom the underlying channel W) to at most µ.

The key to proving the correctness of Algorithms 1 and 2is the following lemma. It is essentially a restatement of [13,Lemma 4.7]. For completeness, we restate the proof as well.

Algorithm 1: A high level description of the degradingprocedure

input : An underlying BMS channel W, a bound µ onthe output alphabet size, an indexi = 〈b1, b2, . . . , bm〉2.

output: A BMS channel that is degraded with respect tothe bit channel W(m)

i .1 Q← degrading_merge(W, µ);2 for j = 1, 2, . . . ,m do3 if bj = 0 then4 W← Q � Q5 else6 W← Q � Q

7 Q← degrading_merge(W, µ);

8 return Q;

Algorithm 2: A high level description of the upgradingprocedureinput : An underlying BMS channel W, a bound µ on

the output alphabet size, an indexi = 〈b1, b2, . . . , bm〉2.

output: A BMS channel that is upgraded with respect tothe bit channel W(m)

i .1 Q’← upgrading_merge(W, µ);2 for j = 1, 2, . . . ,m do3 if bj = 0 then4 W← Q’ � Q’5 else6 W← Q’ � Q’

7 Q’← upgrading_merge(W, µ);

8 return Q’;

Lemma 3: Fix a binary input channel W : X → Y , anddenote

W� = W�W , W� = W�W .

Next, letQ �W be a degraded with respect to W, and denote

Q� = Q�Q , Q� = Q�Q .

Then,Q� �W� and Q� �W� .

Namely, the degradation relation is preserved by the thechannel transformation operation.

Moreover, all of the above continues to hold if we replace“degraded” by “upgraded” and � by �.

Proof: We will prove only the “degraded” part, since itimplies the “upgraded” part (by interchanging the roles of Wand Q).

Let P : Y → Z be the channel which degrades W to Q:for all z ∈ Z and x ∈ X ,

Q(z|x) =∑

y∈YW(y|x)P(z|y) . (20)

We first prove Q� �W�. By (4) applied to Q, we get thatfor all z ∈ Y and u1 ∈ X ,

Q�((z1, z2)|u1) =∑

u2∈X

1

2Q(z1|u1 ⊕ u2)Q(z2|u2) .

Next, we expand Q twice according to (20), and get

Q�((z1, z2)|u1) =∑

(y1,y2)∈Y2

∑

u2

1

2W (y1|u1⊕u2)W (y2|u2)P(z1|y1)P(z2|y2) .

By (4), this reduces to

Q�((z1, z2)|u1) =∑

(y1,y2)∈Y2

W�((y1, y2)|u1)P(z1|y1)P(z2|y2) . (21)

Next, define the channel P∗ : Y2 → Z2 as follows. For all(y1, y2) ∈ Y2 and (z1, z2) ∈ Z2,

P∗((z1, z2)|(y1, y2)) = P(z1|y1)P(z2|y2) .

It is easy to prove that P∗ is indeed a channel (we get aprobability distribution on Z2 for every fixed (y1, y2) ∈ Y2).Thus, (21) reduces to

Q�((z1, z2)|u1) =∑

(y1,y2)∈Y2

W�((y1, y2)|u1)P∗((z1, z2)|(y1, y2)) ,

and we get by (6) that Q� �W�. The claim Q� �W� isproved in much the same way.

Proposition 4: The output of Algorithm 1 (Algorithm 2) isa BMS channel that is degraded (upgraded) with respect toW(m)i .

Proof: The proof follows easily from Lemma 3, byinduction on j.

Recall that, ideally, a polar code is constructed as follows.We are given an underlying channel W : X → Y , a specifiedcodeword length n = 2m, and a target block error rateeBlock. We choose the largest possible subset of bit-channelsW(m)i such that the sum of their bit-misdecoding probabilities

Pe(W(m)i ) is not greater than eBlock. The resulting code is

spanned by the rows in BmG⊗m corresponding to the subset

of chosen bit-channels. Denote the rate of this code as Rexact.Since we have no computational handle on the bit channels

W(m)i , we must resort to approximations. Let Q(m)

i be theresult of running Algorithm 1 on W and i. Since Q(m)

i �W(m)i , we have by (17) that Pe(Q(m)

i ) ≥ Pe(W(m)i ). Note

that since the output alphabet of Q(m)i is small (at most µ),

we can actually compute Pe(Q(m)i ). We now mimic the ideal

construction by choosing the largest possible subset of indicesfor which the sum of Pe(Q(m)

i ) is at most eBlock. Note thatfor this subset we have that the sum of Pe(W(m)

i ) is at mosteBlock as well. Thus, the code spanned by the correspondingrows of BmG

⊗m is assured to have block error probability ofat most eBlock.

W:

Q:

a1

b1a1

b1 a2

b2a2

b2

y1 y1y2 y2

a1+a2b1+b2a1+a2

b1+b2

z1,2 z1,2

(a)

1

1

1

1

z1,2

z1,2

y1

y2

y1

y2

P(b)

Fig. 3. Degrading W to Q. (a) The entry in the first/second row of a channelis the probability of receiving the corresponding symbol, given that a 0/1 wastransmitted. (b) The intermediate channel P .

Denote the rate of this code by Rdegraded. It is easy tosee that Rdegraded ≤ Rexact. In order to gauge the differencebetween the two rates, we compute a third rate, Rupgraded,such that Rupgraded ≥ Rexact and consider the differenceRupgraded − Rdegraded. The rate Rupgraded is computed thesame way that Rdegraded is, but instead of using Algorithm 1we use Algorithm 2. Recall from Figure 1 that Rdegraded andRupgraded are typically very close.

V. MERGING FUNCTIONS

In this section, we construct the alphabet size reduc-ing degrading and upgrading functions refereed to asdegrading_merge and upgrading_merge in Algo-rithms 1 and 2, respectively. For now, let us treat our functionsas heuristic (delaying their analysis to Section VIII). Theproblem of constructing optimal degrading and upgradingfunctions (as well as finding a good definition of “optimal”)are, as far as we know, open.

A. Degrading-merge function

The next lemma shows how one can reduce the outputalphabet size by 2, and get a degraded channel. It is our firststep towards defining a valid degrading_merge function.

Lemma 5: Let W : X → Y be a BMS channel, and lety1 and y2 be symbols in the output alphabet Y . Define thechannel Q : X → Z as follows (see Figure 3(a)). The outputalphabet Z is given by

Z = Y \ {y1, y1, y2, y2} ∪ {z1,2, z1,2} .For all x ∈ X and z ∈ Z , define

Q(z|x) =

W(z|x) if z 6∈ {z1,2, z1,2},W(y1|x) + W(y2|x) if z = z1,2,W(y1|x) + W(y2|x) if z = z1,2.

Then Q �W. That is, Q is degraded with respect to W.Proof: Take the intermediate channel P : Y → Z as

the channel that maps with probability 1 as follows (seeFigure 3(b)): both y1 and y2 map to z1,2, both y1 and y2

map to z1,2, other symbols map to themselves. Recall that wehave assumed that W does not contain an erasure symbol, andthis continues to hold for Q.

We now define the degrading_merge function we haveused. It gives good results in practice and affords a fast imple-mentation. Assume we are given a BMS channel W : X → Ywith an alphabet size of 2L (recall our assumption that y andy are unique), and wish to reduce its alphabet size to µ, whiletransforming W into a degraded version of itself. If 2L ≤ µ,then we are done, since we can take the degraded version of Wto be W itself. Otherwise, we do the following. Recall that foreach y we have that LR(y) = 1/LR(y), where in this context1/0 =∞ and 1/∞ = 0. Thus, our first step is to choose fromeach pair (y, y) a representative such that LR(y) ≥ 1. Next,we order these L representative as y1, y2, . . . , yL such that forall 1 ≤ i ≤ L− 1

LR(yi) ≤ LR(yi+1) .

We now ask the following: for which index 1 ≤ i ≤ L − 1does the channel resulting from the application of Lemma 5to W, yi, and yi+1 result in a channel with largest mutualinformation? After finding the maximizing index i we indeedapply Lemma 5 and get a degraded channelQ with an alphabetsize smaller by 2 than that of W. The same process is appliedto Q, until the output alphabet size is not more than µ.

In terms of complexity, note that finding the index i forwhich the mutual information of Q is maximized is equiv-alent to finding the index i for which the change in mutualinformation between W and Q is minimized. Assuming weare supplied the probabilities of receiving yi, yi, yi+1, andyi+1, given that 0 we transmitted, the calculation of the changewith respect to the index i takes O(1) time. More so, afterthe merging operation takes place, all but a constant numberof such calculations carried out for the previous step can bereused for the current step (since all but a constant number ofprobabilities stay the same). Assuming that the output alphabetsize of W is at most 2µ2, we conclude that by using the heap[15, Section 6] and doubly-linked-list [15, Subsection 10.2]data structures, one can implement the degrading_mergefunction to run in O(µ2 logµ) time.

Note that at first sight, it may seem as though there mightbe an even better heuristic to employ. As before, assume thatthe yi are ordered according to their likelihood ratios, andall of these are at least 1. Instead of limiting the applicationof Lemma 5 to yi and yi+1, we can broaden our search andconsider the penalty in merging arbitrary yi and yj wherei 6= j. Indeed, we could further consider merging arbitrary yiand yj where i 6= j. Clearly, this broader search will incur acomplexity penalty. However, as the next theorem shows, wewill essentially gain nothing by it.

Theorem 6: Let W : X → Y be a BMS channel, with

Y = {y1, y2, . . . , yL, y1, y2, . . . , yL} .

W:

Q′:

a1

b1a1

b1 a2

b2a2

b2

y1 y1y2 y2

a2+α2b2+β2a2+α2

b2+β2

z2 z2

a2

b2≥ a1

b1α2

β2=

a2

b2

α2+β2 = a1 + b1

(a)

p2

p2

p1

p1

p3

p3

z2

z2

y2

y1

y2

y1

P

γ =a1 − β2

α2 − β2

p1 =α2γ

a2 + α2

p2 =a2

a2 + α2

p3 =α2(1− γ)

a2 + α2

(b)

Fig. 4. First method of Upgrading W to Q′. (a) The entry in the first/secondrow of a channel is the probability of receiving the corresponding symbol,given that a 0/1 was transmitted. (b) The intermediate channel P .

Assume that

1 ≤ LR(y1) ≤ LR(y2) ≤ · · · ≤ LR(yL) .

For symbols w1, w2 ∈ Y , denote by I(w1, w2) the symmetriccapacity of the channel one gets by the application of Lemma 5to w1 and w2. Then, for all distinct 1 ≤ i ≤ L and 1 ≤ j ≤ L,

I(yi, yj) = I(yi, yj) ≥ I(yi, yj) = I(yi, yj) . (22)

Moreover, for all 1 ≤ i < j < k ≤ L we have that either

I(yi, yj) ≥ I(yi, yk) ,

orI(yj , yk) ≥ I(yi, yk) .

Since it is somewhat long, we have relegated the proofof Theorem 6 to Appendix A. Although the initial lemmasneeded for the proof are rather intuitive, the latter seem tobe a lucky coincidence (probably due to a lack of a deeperunderstanding on the authors’ part). The prime example seemsto be Equation (71) in the proof of Lemma 27.

B. Upgrading-merge functions

The fact that one can merge symbol pairs and get a degradedversion of the original channel should come as no surprise.However, it turns out that we can also merge symbol pairsand get an upgraded version of the original channel. We firstshow a simple method of doing this. Later on, we will show aslightly more complex method, and compare between the two.

As in the degrading case, we show how to reduce the outputalphabet size by 2, and then apply this method repeatedly asmuch as needed. The following lemma shows how the corereduction can be carried out. The intuition behind it is simple.Namely, now we “promote” a pair of output symbols to have ahigher LR value, and then merge with an existing pair havingthat LR.

Lemma 7: Let W : X → Y be a BMS channel, and lety2 and y1 be symbols in the output alphabet Y . Denote λ2 =LR(y2) and λ1 = LR(y1). Assume that

1 ≤ λ1 ≤ λ2 . (23)

Next, let a1 = W(y1|0) and b1 = W(y1|0). Define α2 andβ2 as follows. If λ2 <∞

α2 = λ2a1 + b1λ2 + 1

β2 =a1 + b1λ2 + 1

. (24)

Otherwise, we have λ2 =∞, and so define

α2 = a1 + b1 β2 = 0 . (25)

For real numbers α, β, and x ∈ X , define

t(α, β|x) =

{α if x = 0,β if x = 1.

Define the channel Q′ : X → Z ′ as follows (see Figure 4(a)).The output alphabet Z ′ is given by

Z ′ = Y \ {y2, y2, y1, y1} ∪ {z2, z2} .For all x ∈ X and z ∈ Z ′,

Q′(z|x) =

W(z|x) if z 6∈ {z2, z2},W(y2|x) + t(α2, β2|x) if z = z2,W(y2|x) + t(β2, α2|x) if z = z2.

Then Q′ �W. That is, Q′ is upgraded with respect to W.Proof: Denote a2 = W(y2|0) and b2 = W(y2|0). First,

note thata1 + b1 = α2 + β2 .

Next, let γ be defined as follows. If λ2 > 1, let

γ =a1 − β2

α2 − β2=b1 − α2

β2 − α2,

and note that (23) implies that 0 ≤ γ ≤ 1. Otherwise (λ1 =λ2 = 1), let

γ = 1 .

Define the intermediate channel P : Z ′ → Y as follows.

P(y|z) =

1 if z 6∈ {z2, z2} and y = z,α2γa2+α2

if (z, y) ∈ {(z2, y1), (z2, y1)},a2

a2+α2if (z, y) ∈ {(z2, y2), (z2, y2)},

α2(1−γ)a2+α2

if (z, y) ∈ {(z2, y1), (z2, y1)},0 otherwise.

Notice that when λ2 <∞, we have that

a2

a2 + α2=

b2b2 + β2

andα2

a2 + α2=

β2

b2 + β2.

Some simple calculations finish the proof.The following corollary shows that we do not “lose any-

thing” when applying Lemma 7 to symbols y1 and y2 suchthat LR(y1) = LR(y2). Thus, intuitively, we do not expect tolose much when applying Lemma 7 to symbols with “close”LR values.

Corollary 8: Let W, Q′, y1, and y2 be as in Lemma 7.If LR(y1) = LR(y2), then Q′ ≡ W. That is, W and Q′are equivalent. More so: all of the above holds if we replace“Lemma 7” by “Lemma 5”.

W:

Q′:

a1

b1a1

b1 a2

b2a2

b2 a3

b3a3

b3

y1 y1y2 y2y3 y3

a1+α1b1+β1a1+α1

b1+β1

z1 z1

a3+α3b3+β3a3+α3

b3+β3

z3 z3

a1

b1<

a2

b2<

a3

b3

α1

β1=

a1

b1α3

β3=

a3

b3

α1+α3 = a2

β1+β3 = b2

(a)

p3

p3

q3

q3

q1

q1

p1

p1

z3

z3

z1

z1

y3

y2

y1

y3

y2

y1

P

p1 =a1

a1 + α1

q1 =α1

a1 + α1

p3 =a3

a3 + α3

q3 =α3

a3 + α3

(b)

Fig. 5. Second method of Upgrading W to Q′. (a) The entry in thefirst/second row of a channel is the probability of receiving the correspondingsymbol, given that a 0/1 was transmitted. (b) The intermediate channel P .

Proof: The proof follows by noticing that the channel Q′we get by applying Lemma 7 to W, y1, and y2, is exactly thesame channel we get if we apply Lemma 5 instead. Thus, wehave both Q′ �W and Q′ �W.

In Lemma 7, we have essentially transferred the probabilityW(y1|0) + W(y1|0) onto a symbol pair with a higher LRvalue. We now show a different method of merging thatinvolves dividing the probability W(y1|0)+W(y1|0) betweena symbol pair with higher LR value and a symbol pair withlower LR value. As we will prove latter on, this new approachis generally preferable.

Lemma 9: Let W : X → Y be a BMS channel, and lety1, y2, and y3 be symbols in the output alphabet Y . Denoteλ1 = LR(y1), λ2 = LR(y2), and λ3 = LR(y3). Assume that

1 ≤ λ1 < λ2 < λ3 .

Next, let a2 = W(y2|0) and b2 = W(y2|0). Define α1, β1,α3, β3 as follows. If λ3 <∞

α1 = λ1λ3b2 − a2

λ3 − λ1β1 =

λ3b2 − a2

λ3 − λ1, (26)

α3 = λ3a2 − λ1b2λ3 − λ1

β3 =a2 − λ1b2λ3 − λ1

. (27)

Otherwise, we have λ3 =∞, and so define

α1 = λ1b2 β1 = b2 , (28)α3 = a2 − λ1b2 β3 = 0 . (29)

Let t(α, β|x) be as in Lemma 7, and define the BMS channelQ′ : X → Z ′ as follows (see Figure 5(a)). The output alphabetZ ′ is given by

Z ′ = Y \ {y1, y1, y2, y2, y3, y3} ∪ {z1, z1, z3, z3} .

For all x ∈ X and z ∈ Z ′, define

Q′(z|x) =

W(z|x) if z 6∈ {z1, z1, z3, z3},W(y1|x) + t(α1, β1|x) if z = z1,W(y1|x) + t(β1, α1|x) if z = z1,W(y3|x) + t(α3, β3|x) if z = z3,W(y3|x) + t(β3, α3|x) if z = z3.

Then Q′ �W. That is, Q′ is upgraded with respect to W.Proof: Denote a1 = W(y1|0), b1 = W(y1|0), a3 =

W(y3|0), and b3 = W(y3|0). Define the intermediate channelP : Z ′ → Y as follows.

P(y|z) =

1 if z 6∈ {z3, z3, z1, z1} and y = z,a1

a1+α1= b1

b1+β1if (z, y) ∈ {(z1, y1), (z1, y1)},

α1

a1+α1= β1

b1+β1if (z, y) ∈ {(z1, y2), (z1, y2)},

a3a3+α3

if (z, y) ∈ {(z3, y3), (z3, y3)},α3

a3+α3if (z, y) ∈ {(z3, y2), (z3, y2)},

0 otherwise.

Notice that when λ3 <∞, we have thata3

a3 + α3=

b3b3 + β3

andα3

a3 + α3=

β3

b3 + β3.

The proof follows by observing that, whatever the value ofλ3,

α1 + α3 = a2 and β1 + β3 = b2 .

The following Lemma formalizes why Lemma 9 results ina merging operation that is better than that of Lemma 7.

Lemma 10: Let W, y1, y2, and y3 be as in Lemma 9.Denote by Q′123 : X → Z ′123 the result of applying Lemma 9to W, y1, y2, and y3. Next, denote by Q′23 : X → Z ′23

the result of applying Lemma 7 to W, y2, and y3. ThenQ′23 � Q′123 �W. Namely, in a sense,Q′123 is a more faithfulrepresentation of W than Q′23 is.

Proof: Recall that the two alphabets Z ′123 and Z ′23 satisfy

Z ′123 = {z1, z1, z3, z3} ∪ A ,

Z ′23 = {y1, y1, z3, z3} ∪ A ,

whereA = Y \ {y1, y1, y2, y2, y3, y3}

is the set of symbols not participating in either merge opera-tion.

In order to prove that Q′123 is degraded with respect toQ′23, we must supply a corresponding intermediate channelP : Z ′23 → Z ′123. To this end, let

λ3 =Q′123(z3|0)

Q′123(z3|1)=Q′23(z3|0)

Q′23(z3|1)=

W(y3|0)

W(y3|1)

andγ =

Q′123(z3|0)

Q′23(z3|0)=Q′123(z3|1)

Q′23(z3|1).

Note that in both Lemma 9 and 7 we have that α3/β3 = λ3.Next, we recall that in Lemma 7 we have that α3+β3 = a2+b2

whereas in Lemma 9 we have α3 + β3 = a2 + b2 − α1 − β1.Thus, we conclude that 0 ≤ γ ≤ 1. Moreover, since 0 ≤ γ ≤1, we conclude that the following definition of an intermediatechannel is indeed valid.

P(z123|z23) =

1 if z123 = z23 and z123 ∈ A,1 if (z23, z123) ∈ {(y1, z1), (y1, z1)},γ if (z23, z123) ∈ {(z3, z3), (z3, z3)},(1−γ)λ1

λ1+1 if (z23, z123) ∈ {(z3, z1), (z3, z1)},(1−γ)λ1+1 if (z23, z123) ∈ {(z3, z1), (z3, z1)},0 otherwise.

A short calculation shows that P is indeed an intermediatechannel that degrades Q′23 to Q′123.

At this point, the reader may be wondering why we havechosen to state Lemma 7 at all. Namely, it is clear whatdisadvantages it has with respect to Lemma 9, but we have yetto state any advantages. Recalling the conditions of Lemma 9,we see that it can not be employed when the set {λ1, λ2, λ3}contains non-unique elements. In fact, more is true. Ultimately,when one wants to implement the algorithms outlined in thispaper, one will most probably use floating point numbers.Recall that a major source of numerical instability stems fromsubtracting two floating point numbers that are too close. Byconsidering the denominator in (26) and (27) we see that λ1

and λ3 should not be too close. More so, by considering thethe numerators, we conclude that λ2 should not be too closeto neither λ1 nor λ3. So, when these cases do occur, our onlyoption is Lemma 7.

We now define the merge-upgrading procedure we haveused. Apart from an initial step, it is very similar to the merge-degrading procedure we have previously outlined. Assumewe are given a BMS channel W : X → Y with analphabet size of 2L and wish to reduce its alphabet size toµ, while transforming W into a upgraded version of itself.If 2L ≤ µ, then, as before, we are done. Otherwise, as inthe merge-degrading procedure, we choose L representativesy1, y2, . . . , yL, and order them according to their LR values,all of which are greater than or equal to 1. We now specifythe preliminary step: for some specified parameter epsilon(we have used ε = 10−3), we check if there exists an index1 ≤ i < L such that the ratio LR(yi+1)/LR(yi) is less than1 + ε. If so, we apply Lemma 7 repeatedly, until no suchindex exists. Now comes the main step. We ask the followingquestion: for which index 1 ≤ i ≤ L − 1 does the channelresulting from the application of Lemma 9 to W, yi, yi+1, andyi+2 result in a channel with largest mutual information? Afterfinding the maximizing index i we indeed apply Lemma 9 andget an upgraded channel Q′ with an alphabet size smaller by2 than that of W. The same process is applied to Q′, until theoutput alphabet size is not more than µ. As before, assumingthe output alphabet size of W is at most 2µ2, it should beeasy to see that the whole procedure takes O(µ2 logµ) time.

As was the case for degrading, the following theorem proves

1,048,576786,432524,288262,1440.5×10−9

10−9

1.5×10−9

Fig. 6. Result of running both upgrading and degrading algorithms onBSC(0.11) and n = 220, setting 10−9 as a threshold on Pe, and plottingcases in which one bound is above the threshold and the other is below it(132 such cases occurred).

that no generality is lost by only considering merging ofconsecutive triplets of the form yi, yi+1, yi+2 in the mainstep. The proof is given in Appendix B.

Theorem 11: Let W : X → Y be a BMS channel. Denoteby IW the symmetric capacity of W and by I(y1, y2, y3) thesymmetric capacity one gets by the application of Lemma 9to W and symbols y1, y2, y3 ∈ Y such that

1 ≤ LR(y1) ≤ LR(y2) ≤ LR(y3) .

Let LR(y1) = λ1, LR(y2) = λ2, LR(y3) = λ3, π2 =W(y2|0) + W(y2|1), and denote the difference in capacitiesas

∆[λ1;λ2, π2;λ3] = I(y1, y2, y3)− IW .

Then, for all λ′1 ≤ λ1 and λ′3 ≥ λ3,

∆[λ1;λ2, π2;λ3] ≤ ∆[λ′1;λ2, π2;λ′3] . (30)

We end this section by considering the running time ofAlgorithms 1 and 2. Without loss of generality, we considerAlgorithm 1. Let us assume for simplicity that the outputalphabet size of W is at most µ. Thus, by induction, at the startof each loop the size of the output alphabet of Q is at most µ.Therefore, at each iteration, calculating W from Q takes O(µ2)time, since the output alphabet size of W is at most 2µ2. Next,we have seen that each invocation of degrading_mergetakes O(µ2 logµ) time. Since the number of times we loop ism, the total running time is O(m · µ2 logµ).

Typically, we would run Algorithms 1 and 2 for each ofthe n channels. Noticing that intermediate calculations can beshared between different channels, we get that the total runningtime is O(n · µ2 logµ).

Figures 6 and 7 contain examples of the sort of informationthat can be extracted from a run of our algorithms.

VI. CHANNELS WITH CONTINUOUS OUTPUT ALPHABETS

Recall that in order to apply either Algorithm 1 or 2 toan underlying BMS channel W, we had to thus far assumethat W has a finite output alphabet. In this section, we showtwo transforms (one degrading and the other upgrading) thattransform a BMS channel with a continuous alphabet to a BMSchannel with a specified finite output alphabet size. Thus, afterapplying the degrading (upgrading) transform we will shortlyspecify to W, we will be in a position to apply Algorithm 1(2) and get a degraded (upgraded) approximation of W(m)

i .Moreover, we prove that both degraded and upgraded versions

(a) Mutual information

(b) Bhattacharyya

(c) Probability of error

Fig. 7. Result of running both upgrading and degrading algorithms onBSC(0.11) and n = 220, plotting upper and lower bounds on the mutualinformation I(W(m)

i ), Bhattacharyya bound Z(W(m)i ), and Probability of

error Pe(W(m)i ). Lower bounds are colored while upper bounds are black.

Most times, the dots overlap, since the bounds are close.

of our original channels have a guaranteed closeness to theoriginal channel, in terms of difference of symmetric capacity.

Let W be a given BMS channel with a continuous alphabet.We will make a few assumptions on W. First, we assume thatthe output alphabet of W is the reals R. Thus, for y ∈ R, letf(y|0) and f(y|1) be the pdf functions of the output giventhat the input was 0 and 1, respectively. Next, we require thatthe symmetry of W manifest itself as

f(y|0) = f(−y|1) , for all y ∈ R .

Also, for notational convenience, we require that

f(y|0) ≥ f(y|1) , for all y ≥ 0 . (31)

Note that all of the above holds for the BAWGN channel (afterrenaming the input 0 as −1).

We now introduce some notation. For y ≥ 0, define thelikelihood ratio of y as

λy =f(y|x = 0)

f(y|x = 1). (32)

As usual, if f(y|x = 1) is zero while f(y|x = 0) is not, wedefine λy =∞. Also, if both f(y|x = 0) and f(y|x = 1) arezero, then we arbitrarily define λy = 1. Note that by (31), wehave that λy ≥ 1.

Under these definitions, a short calculation shows that thesymmetric capacity of W is

I(W) =

∫ ∞

0

(f(y|x = 0) + f(y|x = 1))C[λy] dy ,

where for 1 ≤ λ <∞

C[λ] = 1− λ

λ+ 1log2

(1 +

1

λ

)− 1

λ+ 1log2 (λ+ 1) ,

and (for continuity) we define C[∞] = 1.Let µ = 2L be the specified size of the degraded/upgraded

channel output alphabet. An important property of C[λ] is thatit is strictly increasing in λ for λ ≥ 1. This property is easilyproved, and will now be used to show that the following setsform a partition of the non-negative reals. For 1 ≤ i ≤ L− 1,let

Ai =

{y ≥ 0 :

i− 1

L≤ C[λy] <

i

L

}. (33)

For i = L we similarly define (changing the second inequalityto a weak inequality)

AL =

{y ≥ 0 :

L− 1

L≤ C[λy] ≤ 1

}. (34)

As we will see later on, we must assume that the sets Ai aresufficiently “nice”. This will indeed be the case of for BAWGNchannel.

A. Degrading transform

Essentially, our degrading procedure will consist of Lapplications of the continuous analog of Lemma 5. Denoteby Q : X → Z the degraded approximation of W we aregoing to produce, where

Z = z1, z1, z2, z2, . . . , zL, zL .

We define Q as follows.

Q(zi|0) = Q(zi|1) =

∫

Ai

f(y|0) dy , (35)

Q(zi|0) = Q(zi|1) =

∫

Ai

f(−y|0) dy . (36)

Lemma 12: The channel Q : X → Z is a BMS channelsuch that Q �W.

Proof: It is readily seen that Q is a BMS channel. Toprove Q �W, we now supply intermediate channel P : R→Z .

P(z|y) =

1 if z = zi and y ∈ Ai ,1 if z = zi and −y ∈ Ai ,0 otherwise .

The following Lemma bounds the loss in symmetric capac-ity incurred by the degrading operation.

Lemma 13: The difference in symmetric capacities of Qand W can be bounded as follows,

0 ≤ I(W)− I(Q) ≤ 1

L=

2

µ. (37)

Proof: The first inequality in (37) is a consequence of thedegrading relation and (19). We now turn our attention to thesecond inequality.

Recall that since the Ai partition the non-negative reals, thesymmetric capacity of W equals

I(W) =

L∑

i=1

∫

Ai

(f(y|0) + f(y|1))C[λy]dy . (38)

As for Q, we start by defining for 1 ≤ i ≤ L the ratio

Λi =Q(zi|0)

Q(zi|1),

where the cases of the numerator and/or denominator equalingzero are as in the definition of λy . By this definition, similarlyto the continuous case, the symmetric capacity of Q is equalto

I(Q) =

L∑

i=1

(Q(zi|0) +Q(zi|1))C[Λi] . (39)

Recall that by the definition of Ai in (33) and (34), we havethat for all y ∈ Ai,

i− 1

L≤ C[λy] ≤ i

L.

Thus, by the definition of Q(zi|0) and Q(zi|1) in (35) and(36), respectively, we must have by the strict monotonicity ofC that

i− 1

L≤ C[Λi] ≤

i

L, if Q(zi|0) > 0 .

Thus, for all y ∈ Ai,

|C[Λi]− C[λy]| ≤ 1

L, if Q(zi|0) > 0 .

Next, note that Q(zi|0) > 0 implies Q(zi|0) + Q(zi|1) > 0.Thus, we may bound I(Q) as follows,

I(Q) =

L∑

i=1

(Q(zi|0) +Q(zi|1))C[Λi] =

L∑

i=1

∫

Ai

(f(y|0) + f(y|1))C[Λi]dy ≥

L∑

i=1

∫

Ai

(f(y|0) + f(y|1))

(C[λy]− 1

L

)dy =

(L∑

i=1

∫

Ai

(f(y|0) + f(y|1))C[λy]dy

)− 1

L= I(W)− 1

L,

which proves the second inequality.

B. Upgrading transform

In parallel with the degrading case, our upgrading procedurewill essentially consist of L applications of the continuousanalog of Lemma 7. Denote by Q′ : X → Z ′ the upgradedapproximation of W we are going to produce, where

Z ′ = z1, z1, z2, z2, . . . , zL, zL .

As before, we will show that the loss in capacity due to theupgrading operation is at most 1/L.

Let us now redefine the ratio Λi. Recalling that the functionC[λ] is strictly increasing in λ ≥ 1, we deduce that it has aninverse in that range. Thus, for 1 ≤ i ≤ L, we define Λi ≥ 1as follows,

Λi = C−1

[i

L

]. (40)

Note that for i = L, we have that ΛL = ∞. Also, note thatfor y ∈ Ai we have by (33) and (34) that

1 ≤ λy ≤ Λi . (41)

We now define Q′. For 1 ≤ i ≤ L, let,

πi =

∫

Ai

(f(α|0) + f(−α|0)

)dα . (42)

Then,

Q′(z|0) =

ΛiπiΛi+1 if z = zi and Γi 6=∞ ,πi

Λi+1 if z = zi and Γi 6=∞ ,

πi if z = zi and Γi =∞ ,

0 if z = zi and Γi =∞ ,

(43)

and

Q′(zi|1) = Q′(zi|0) , Q′(zi|1) = Q′(zi|0) . (44)

Lemma 14: The channel Q′ : X → Z ′ is a BMS channelsuch that Q′ �W.

Proof: As before, the proof that Q′ is a BMS channel iseasy. To show that Q′ �W, we must supply the intermediatechannel P . The proof follows easily if we define P : Z ′ → Ras the cascade of two channels, P1 : Z ′ → R and P2 : R→ R.

The channel P1 : Z ′ → R is essentially a renaming channel.Denote by g(α|z) the pdf of the output P1 given that the inputwas z. Then, for 1 ≤ i ≤M ,

g(α|z) =

f(α|0)+f(−α|0)πi

if z = zi and α ∈ Ai ,f(α|0)+f(−α|0)

πiif z = zi and −α ∈ Ai ,

0 otherwise .(45)

Note that by (42), the function g(α|z) is indeed a pdf for everyfixed value of z ∈ Z ′.

Next, we turn to P2 : R→ R, the LR reducing channel. Letα ∈ Ai and recall the definition of λy given in (32). Definethe quantity pα as follows,

pα =

Λi−λα(λα+1)(Λi−1) if 1 < Λi <∞ ,12 if Λi = 1 ,

1λα+1 if Λi =∞ and λα <∞ ,

0 if λα =∞ .

(46)

By (41) with α in place of y we deduce that 0 ≤ pα ≤ 1/2.We define the channel P2 : R→ R as follows. For y ≥ 0,

P2(y|α) =

{1− pα if y = α ,

pα if y = −α , (47)

andP2(−y| − α) = P2(y|α) .

Consider the random variable Y , which is defined as the outputof the concatenation of channels Q′, P1, and P2, given thatthe input to Q′ was 0. We must show that the pdf of Y isf(y|0). To do this, we consider the limit

limε→0

Prob(y ≤ Y ≤ y + ε)

ε.

Consider first a y such that y ∈ Ai, and assume further thatε is small enough so that the whole interval between y andy + ε is in Ai. In this case, the above can be expanded to

limε→0

1

ε

[Q′(zi|0) ·

∫ y+ε

y

g(α|zi)(1− pα) dα

+Q′(zi|0) ·∫ −y

−y−εg(α|zi)p−α dα

].

Assuming that the two integrands are indeed integrable, thisreduces to

Q′(zi|0) · g(y|zi)(1− py) +Q′(zi|0) · g(−y|zi)py .From here, simple calculations indeed reduce the above tof(y|0). The other cases are similar.

As in the degrading case, we can bound the loss in sym-metric capacity incurred by the upgrading operation.

Lemma 15: The difference in symmetric capacities of Q′and W can be bounded as follows,

0 ≤ I(Q′)− I(W) ≤ 1

L=

2

µ. (48)

Proof: The first inequality in (48) is a consequence of theupgrading relation and (19). We now turn our attention to thesecond inequality.

For all y ∈ Ai, by (33), (34) and (40), we have that

C[Λi]− C[λy] ≤ 1

L, if Q′(zi|0) > 0 .

Next, notice that by (43),

Λi =Q′(zi|0)

Q′(zi|1), if Q′(zi|0) > 0 .

As in the degrading case, we have that Q′(zi|0) > 0 impliesQ′(zi|0) + Q′(zi|1) > 0. Thus, we may bound I(Q′) asfollows,

I(Q′) =

L∑

i=1

(Q′(zi|0) +Q′(zi|1))C[Λi] =

L∑

i=1

∫

Ai

(f(y|0) + f(y|1))C[Λi]dy ≤

L∑

i=1

∫

Ai

(f(y|0) + f(y|1))

(C[λy] +

1

L

)dy =

(L∑

i=1

∫

Ai

(f(y|0) + f(y|1))C[λy]dy

)+

1

L= I(W) +

1

L,

which proves the second inequality.

VII. VARIATIONS OF OUR ALGORITHMS

As one might expect, Algorithms 1 and 2 can be tweakedand modified. As an example, we now show an improvementto Algorithm 1 for a specific case.

Recall our description of how to construct a polar code givenat the end of Section IV: obtain a degraded approximation ofeach bit channel through the use of Algorithm 1, and thenselect the k best channels when ordered according to theupper bound on the probability of misdecoding. Note thatAlgorithm 1 returns a channel, but in this case only oneattribute of that channel interests us, namely, the probabilityof misdecoding. In this section, we show how to specializeAlgorithm 1 accordingly and benefit.

The specialized algorithm is given as Algorithm 3. Thekey observation follows from Equations (26) and (27) in [1],which we now restate. Recall that Z(W) is the Bhattacharyyaparameter of the channel W. Then,

Z(W�W) ≤ 2Z(W)− Z(W)2 (49)Z(W�W) = Z(W)2 (50)

The proof of correctness is by induction, and we leave itto the reader. However, we would like to draw attention tothe following possibly subtle point. Note that for any channelW, we have that 0 ≤ Z(W) ≤ 1 since a geometric mean isupper-bounded by the corresponding arithmetic one. This factis important, since we are now allowed to assume that 2Z−Z2

is increasing in Z. Also, we note that the plots in this paperhaving to do with a lower bound on the probability of errorwere produced by running this algorithm.

Algorithm 3: An upper bound on misdecoding probabilityinput : An underlying BMS channel W, a bound µ on

the output alphabet size, an indexi = 〈b1, b2, . . . , bm〉2.

output: An upper bound on Pe(W(m)i ).

1 Z← Z(W);2 Q← degrading_merge(W, µ);3 for j = 1, 2, . . . ,m do4 if bj = 0 then5 W← Q � Q ;6 Z← min{Z(W), 2Z− Z2};7 else8 W← Q � Q ;9 Z← min{Z(W),Z2};

10 Q← degrading_merge(W, µ);

11 return min{Pe(Q),Z};

VIII. ANALYSIS

As we’ve seen in previous sections, we can build polar codesby employing Algorithm 3, and gauge how far we are fromthe optimal construction by running Algorithm 2. As can beseen in Figure 1, our construction turns out to be essentiallyoptimal, for moderate sizes of µ. Still, we would like to have atheorem validating our numerical observations, and we devotethis section to stating and proving such a theorem.

Recall from [4] that for a polar code of length n = 2m, thefraction of bit channels with probability of error less than 2−n

β

tends to the symmetric capacity of the underlying channel as ngoes to infinity, for β < 1/2. Moreover, the constraint β < 1/2is tight in that the fraction of such channels is strictly less thanthe symmetric capacity, for β > 1/2.

We now state an analog of [4, Theorem 1] for the up-per bounds on the probability of error supplied by runningAlgorithm 3. The following theorem states that we can getarbitrarily close to the optimal construction of a polar code oflength n, as n tends to infinity.

Theorem 16: Let an underlying BMS channel W be given.Denote by n = 2m a code length and by µ = 2L an upperbound on the size of an output alphabet. Recall that for 0 ≤i < n, the output of running Algorithm 3 on input W, µ,i, and m is an upper bound on the probability of error ofW(m)i . Denote this upper bound as Pe(W(m)

i , µ). Then, foreach constant β < 1/2 there exists an ε ≥ 0 which is afunction of W, µ, and β (but is not a function of n) such that

lim infm→∞

∣∣∣{i : Pe(W(m)

i , µ) < 2−nβ}∣∣∣

n= I(W)− ε , (51)

wherelimµ→∞

ε = ε(W, µ, β) = 0 . (52)

The rest of this section is devoted to proving Theorem 16.However, let us first state the following two implications.

Remark 17: With respect to Theorem 16, setting the pa-rameter µ to, say, µ = 2 blog nc results in an asymptoticallyoptimal construction scheme. That is, in this case, ε tends tozero. Thus, for n large enough we can construct a polar codeof length n in time O(n log2 n log log n). This polar code isassured to have a block error probability of 2−n

β

, where β isa function of n and approaches 1/2 as n tends to infinity

Using Theorem 16 one can show that capacity approachingpolar codes can be constructed in polynomial time (in fact,linear time). This claim is made more precise in the followingcorollary.

Corollary 18: Let an underlying BMS channel W be given,as well as constants β < 1/2 and ε > 0. Then, there exist anm0 and an even integer µ such that the following holds. Forall m > m0 and n = 2m we have that∣∣∣

{i : Pe(W(m)

i , µ) < 2−nβ}∣∣∣

n≥ I(W)− ε . (53)

The running time needed to find the above set (and constructthe corresponding code) is O(n · µ2 logµ). However, since µis not a function of n, we may argue that the running time isO(n).

Proof: Given the specified ε, let µ = 2L be such thatε(W, µ, β) from (52) satisfies

ε(W, µ, β) ≤ ε

2.

Thus, from (51) we have that

lim infm→∞

∣∣∣{i : Pe(W(m)

i , µ) < 2−nβ}∣∣∣

n≥ I(W)− ε

2.

Therefore, there must be an m0 such that (53) holds for allm > m0.

In order to prove Theorem 16, we make essential use ofthe results of Pedarsani, Hassani, Tal, and Telatar [16], inparticular [16, Theorem 1] given below. We also point outthat certain ideas used in the proof of Theorem 16 appear —in one form or another — in [16, Theorem 2] and its proof.

Theorem 19 (Restatement of [16, Theorem 1]): Let an un-derlying BMS channel W be given. Let n = 2m be the codelength, and denote byW(m)

i the corresponding ith bit channel,where 0 ≤ i < n. Next, denote by Q(m)

i (L) the degradedapproximation ofW(m)

i returned by running Algorithm 1 withparameters W, µ = 2L, i, and m. Then,

∣∣∣{i : I(W(m)

i )− I(Q(m)i (L)) ≥

√mL

}∣∣∣n

≤√m

L.

Remark 20: Recall that in Subsection VI-A we introduceda method of degrading a continuous channel to a discrete onewith at most µ = 2L symbols. In fact, there is nothing specialabout the continuous case: a slight modification can be used todegrade an arbitrary discrete channel to a discrete channel withat most µ symbols. Thus, we have an alternative to the merge-degrading method introduced in Subsection V-A. It turns outthat Theorem 19 would still hold had we used that alternative,and thus so would Theorem 16.

We now break the proof of Theorem 16 into several lemmas.Lemma 21: Let Q(m)

i (L) be as in Theorem 19. Then, forevery δ > 0 and ε > 0 there exists an m0 and a µ = 2L suchthat ∣∣∣

{i0 : Z

(Q(m0)i0

(L))≤ δ}∣∣∣

n0≥ I(W)− ε , (54)

wheren0 = 2m0 and 0 ≤ i0 < n0 .

Proof: For simplicity of notation, let us drop the subscript0 from i0, n0, and m0. Recall that by [1, Theorem 1] wehave that the symmetric capacity of bit channels polarizes.Specifically, for each ε1 > 0 and δ1 > 0 there exists an msuch that∣∣∣

{i : I

(W(m)i

)≥ 1− δ1

}∣∣∣n

≥ I(W)− ε1 . (55)

We can now combine the above with Theorem 19 and deducethat∣∣∣{i : I

(Q(m)i (L)

)≥ 1− δ1 −

√mL

}∣∣∣n

≥

I(W)− ε1 −√m

L. (56)

Next, we claim that for each δ2 > 0 and ε2 > 0 there existm and µ = 2L such that

∣∣∣{i : I

(Q(m)i (L)

)≥ 1− δ2

}∣∣∣n

≥ I(W)− ε2 . (57)

To see this, take ε1 = ε2/2, δ1 = δ2/2, and let m be theguaranteed constant such that (55) holds. Now, we can take Lbig enough so that, in the context of (56), we have that both

δ1 +

√m

L< δ2

and

ε1 +

√m

L< ε2 .

By [1, Equation (2)] we have that

Z(Q(m)i (L)

)≤√

1− I2(Q(m)i (L)

).

Thus, if (57) holds then∣∣∣{i : Z

(Q(m)i (L)

)≤√

2δ2 − δ22

}∣∣∣n

≥ I(W)− ε2 .

So, as before, we deduce that for every δ3 > 0 and ε3 > 0there exist m and µ such that

∣∣∣{i : Z

(Q(m)i (L)

)≤ δ3

}∣∣∣n

≥ I(W)− ε3 .

Lemma 22: For every m ≥ 0 and index 0 ≤ i < 2m letthere be a corresponding real 0 ≤ ζ(i,m) ≤ 1. Denote the

binary representation of i by i = 〈b1, b2, . . . , bm〉2. Assumethat the ζ(i,m) satisfy the following recursive relation. Form > 0 and i′ = 〈b1, b2, . . . , bm−1〉2,

ζ(i,m) ≤{2ζ(i′,m− 1)− ζ2(i′,m− 1) if bm = 0 ,

ζ2(i′,m− 1) otherwise .(58)

Then, for every β < 1/2 we have that

lim infm→∞

∣∣∣{i : ζ(i,m) < 2−n

β}∣∣∣

n≥ 1− ζ(0, 0) , (59)

where n = 2m.Proof: First, note that both f1(ζ) = ζ2 and f2(ζ) =

2ζ − ζ2 strictly increase from 0 to 1 when ζ ranges from 0to 1. Thus, it suffices to prove the claim for the worst case inwhich the inequality in (58) is replaced by an equality. Assumefrom now on that this is indeed the case.

Consider an underlying BEC with probability of erasure(as well as Bhattacharyya parameter) ζ(0, 0). Next, note thatthe ith bit channel, for 0 ≤ i < n = 2m, is also a BEC,with probability of erasure ζ(i,m). Since the capacity ofthe underlying BEC is 1 − ζ(0, 0), we deduce (59) by [4,Theorem 2].

We are now in a position to prove Theorem 16.Proof of Theorem 16: We first consider (51). By defini-

tion of lim inf , the left hand side exists. Thus, there exists an εso that the right hand side is equal to the left hand side. We areleft to prove that ε ≥ 0. However, assuming that ε is negativewould imply that the fraction of bit channels with capacityapproaching zero is strictly less than 1 − I(W). Thus, ourassumption contradicts the polarization phenomenon stated in[1, Theorem 2].

We now focus on the main part of the proof, validating (52).In order to prove the limit, we must show that for each κ > 0there exists a µ = 2L such that ε(W, µ, β) ≤ κ. Thus, letκ > 0 be given. By Lemma 21, there exist constants m0 andL such that∣∣∣

{i0 : Z

(Q(m0)i0

(L))≤ κ

2

}∣∣∣n0

≥ I(W)− κ

2, (60)

wheren0 = 2m0 and 0 ≤ i0 < n0 .

Denote the codeword length as n = 2m, where m = m0 +m1 and m1 > 0. Consider an index 0 ≤ i < n having binaryrepresentation

i = 〈b1, b2, . . . , bm0, bm0+1, . . . , bm〉2 ,

where b1 is the most significant bit. We split the run ofAlgorithm 3 on i into two stages. The first stage will havej going from 1 to m0, while the second stage will have jgoing from m0 + 1 to m.

We start by considering the end of the first stage. Namely,consider line 10 of the algorithm when j = m0. After the line

is executed, we have that the variable Q is equal to Q(m0)i0

(L),where

i0 = 〈b1, b2, . . . , bm0〉2 .

Thus, since degrading increases the Bhattacharyya parameter,we have then that the Bhattacharyya parameter of W is lessthan or equal to that of Q. So, by the minimization carried outin either line 6 or 9, we conclude the following: at the end ofline 10 of the algorithm, when j = m0,

Z = Z(m0)i0

(L) ≤ Z(Q(m0)i0

(L)).

We can now combine this observation with (60) to concludethat ∣∣∣

{i0 : Z(m0)

i0(L) ≤ κ

2

}∣∣∣n0

≥ I(W)− κ

2. (61)

We now move on to consider the second stage of thealgorithm. Fix an index i = 〈b1, b2, . . . , bm0

, bm0+1, . . . , bm〉2.That is, let i have i0 as a binary prefix of length m0. Denoteby Z[t] the value of Z at the end of line 10, when j = m0 + t.By lines 6 and 9 of the algorithm we have, similarly to (58),that

Z[t+ 1] ≤{

2Z[t]− Z2[t] if bm0+t+1 = 0 ,

Z2[t] otherwise .

We now combine our observations about the two stages. Letγ be a constant such that

β < γ <1

2.

Considering (60), we see that out of the n0 = 2m0 possibleprefixes of length m0, the fraction for which

Z[0] ≤ κ

2(62)

is at least I(W) − κ2 . Next, by Lemma 22, we see that for

each such prefix, the fraction of suffixes for which

Z[m1] ≤ 2−(n1)γ (63)

is at least 1 − Z[0], as n1 = 2m1 tends to infinity. Thus, foreach such prefix, we get by (62) that (in the limit) the fractionof such suffixes is at least 1 − κ

2 . We can now put all ourbounds together and claim that as m1 tends to infinity, thefraction of indices 0 ≤ i < 2m for which (63) holds is at least

(I(W)− κ

2

)·(

1− κ

2

)≥ I(W)− κ .

By line (11) of Algorithm 3, we see that Z[m1] is an upperbound on the return value Pe(W(m)

i ). Thus, we conclude that

lim infm→∞

∣∣∣{i : Pe(W(m)

i , µ) < 2−(n1)γ}∣∣∣

n= I(W)− ε

With the above at hand, the only thing left to do in order toprove (51) is to show that for m1 large enough we have that

2−(n1)γ ≤ 2−nβ

,

which reduces to showing that

(n1)γ−β ≥ (n0)β .

Since γ > β and n0 = 2m0 is constant, this is indeed the case.

APPENDIX APROOF OF THEOREM 6

This appendix is devoted to the proof of Theorem 6. We startby defining some notation. Let W : X → Y , L, y1, y2, . . . , yLand y1, y2, yL be as in Theorem 6. Let w ∈ Y and w ∈ Y be asymbol pair, and denote by (a, b) the corresponding probabilitypair, where

a = p(w|0) = p(w|1) , b = p(w|1) = p(w|0) .

The contribution of this probability pair to the symmetriccapacity of W is denoted by

C(a, b) = −(a+b) log2((a+b)/2)+a log2(a)+b log2(b) =

− (a+ b) log2(a+ b) + a log2(a) + b log2(b) + (a+ b) ,

where 0 log2 0 = 0.Next, suppose we are given two probability pairs: (a1, b1)

and (a2, b2) corresponding to the symbol pair w1, w1 andw2, w2, respectively. The capacity difference resulting fromthe application of Lemma 5 to w1 and w2 is denoted by

∆(a1, b1; a2, b2) = C(a1, b1)+C(a2, b2)−C(a1+a2, b1+b2) .

For reasons that will become apparent later on, we hence-forth relax the definition of a probability pair to two non-negative numbers, the sum of which may be greater than 1.Note that C(a, b) is still well defined with respect to thisgeneralization, as is ∆(a1, b1; a2, b2). Furthermore, to excludetrivial cases, we require that a probability pair (a, b) has atleast one positive element.

The following lemma states that we lose capacity by per-forming a downgrading merge.

Lemma 23: Let (a1, b1) and (a2, b2) be two probabilitypairs. Then,

∆(a1, b1; a2, b2) ≥ 0

Proof: Assume first that a1, b1, a2, b2 are all positive. Inthis case, ∆(a1, b1; a2, b2) can be written as follows:

(a1 + a2)

(−a1a1+a2

log2(a1+b1)(a1+a2)a1(a1+b1+a2+b2) +

−a2a1+a2

log2(a2+b2)(a1+a2)a2(a1+b1+a2+b2)

)+

(b1 + b2)

(−b1b1+b2

log2(a1+b1)(b1+b2)b1(a1+b1+a2+b2) +

−b2b1+b2

log2(a2+b2)(b1+b2)b2(a1+b1+a2+b2)

)

By Jensen’s inequality, both the first two lines and the lasttwo lines can be lower bounded be 0. The proof for cases inwhich some of the variables equal zero is much the same.

The intuition behind the following lemma is that the orderof merging does matter in terms of total capacity lost.

Lemma 24: Let (a1, b1), (a2, b2), and (a3, b3) be threeprobability pairs. Then,

∆(a1, b1; a2, b2) + ∆(a1 + a2, b1 + b2; a3, b3) =

∆(a2, b2; a3, b3) + ∆(a1, b1; a2 + a3, b2 + b3) .

Proof: Both sides of the equation equal

C(a1, b1)+C(a2, b2)+C(a3, b3)−C(a1+a2+a3, b1+b2+b3) .

Instead of working with a probability pair (a, b), we find iteasier to work with a probability sum π = a+b and likelihoodratio λ = a/b. Of course, we can go back to our previousrepresentation as follows. If λ = ∞ then a = π and b = 0.Otherwise, a = λ·π

λ+1 and b = πλ+1 . Recall that our relaxation

of the term “probability pair” implies that π is positive and itmay be greater than 1.

Abusing notation, we define the quantity C through λ andπ as well. For λ =∞ we have C[∞, π] = π. Otherwise,

C[λ, π] =

π

(− λ

λ+ 1log2

(1 +

1

λ

)− 1

λ+ 1log2(1 + λ)

)+ π .

Let us next consider merging operations. The merging ofthe symbol pair corresponding to [λ1, π1] with that of [λ2, π2]gives a symbol pair with [λ1,2, π1,2], where

π1,2 = π1 + π2

and

λ1,2 = λ[π1, λ1;π2, λ2] =

λ1π1(λ2 + 1) + λ2π2(λ1 + 1)

π1(λ2 + 1) + π2(λ1 + 1)(64)

Abusing notation, define

∆[λ1, π1;λ2, π2] = C[λ1, π1] + C[λ2, π2]− C[λ1,2, π1,2] .

Clearly, we have that the new definition of ∆ is symmetric:

∆[λ1, π1;λ2, π2] = ∆[λ2, π2;λ1, π1] . (65)

Lemma 25: ∆[λ1, π1;λ2, π2] is monotonic increasing inboth π1 and π2.

Proof: Recall from (65) that ∆ is symmetric, and so itsuffices to prove the claim for π1. Thus, our goal is to provethe following for all ρ > 0,

∆[λ1, π1 + ρ;λ2, π2] ≥ ∆[λ1, π1;λ2, π2] .

At this point, we find it useful to convert back from the like-lihood ratio/probability sum representation [λ, π] to the prob-ability pair representation (a, b). Denote by (a1, b1), (a2, b2),and (a′, b′) the probability pairs corresponding to [λ1, π1],[λ2, π2], and [λ1, π1 + ρ], respectively. Let a3 = a′ − a1 andb3 = b′ − b1. Next, since both (a1, b1) and (a′, b′) have the

same likelihood ratio, we deduce that both a3 and b3 are non-negative. Under our new notation, we must prove that

∆(a1 + a3, b1 + b3; a2, b2) ≥ ∆(a1, b1; a2, b2) .

Since both (a1, b1) and (a′, b′) have likelihood ratio λ1, thisis also the case for (a3, b3). Thus, a simple calculation showsthat

∆(a1, b1; a3, b3) = 0 .

Hence,

∆(a1 + a3, b1 + b3; a2, b2) =

∆(a1 + a3, b1 + b3; a2, b2) + ∆(a1, b1; a3, b3)

Next, by Lemma 24,

∆(a1 + a3, b1 + b3; a2, b2) + ∆(a1, b1; a3, b3) =

∆(a1, b1; a2, b2) + ∆(a1 + a2, b1 + b2; a3, b3) .

Since, by Lemma 23, we have that ∆(a1 +a2, b1 + b2; a3, b3)is non-negative, we are done.

We are now at the point in which our relaxation of the term“probability pair” can be put to good use. Namely, we willnow see how to reduce the number of variables involved byone, by taking a certain probability sum to infinity.

Lemma 26: Let λ1, π1, and λ2 be given. Assume that 0 <λ2 <∞. Define

∆[λ1, π1;λ2,∞] = limπ2→∞

∆[λ1, π1;λ2, π2] .

If 0 < λ1 <∞, then

∆[λ1, π1;λ2,∞] =

π1

(− λ1

λ1 + 1log2

(1 + 1

λ1

1 + 1λ2

)− 1

λ1 + 1log2

(λ1 + 1

λ2 + 1

)).

(66)

If λ1 =∞, then

∆[λ1, π1;λ2,∞] = π1

(− log2

(1

1 + 1λ2

)). (67)

If λ1 = 0, then

∆[λ1, π1;λ2,∞] = π1

(− log2

(1

λ2 + 1

)). (68)

Proof: Consider first the case 0 < λ1 <∞. We write out∆[λ1, π1;λ2, π2] in full and after rearrangement get

π1

(1

1λ1,2

+ 1log2

(1 +

1

λ1,2

)− 1

1λ1

+ 1log2

(1 +

1

λ1

))+

π1

(1

λ1,2 + 1log2 (1 + λ1,2)− 1

λ1 + 1log2 (1 + λ1)

)+

π2

(1

1λ1,2

+ 1log2

(1 +

1

λ1,2

)− 1

1λ2

+ 1log2

(1 +

1

λ2

))+

π2

(1

λ1,2 + 1log2 (1 + λ1,2)− 1

λ2 + 1log2 (1 + λ2)

),

(69)

where λ1,2 is given in (64). Next, note that

limπ2→∞

λ1,2 = λ2 .

Thus, applying limπ2→∞ to the first two lines of (69) isstraightforward. Next, consider the third line of (69), and writeits limit as

limπ2→∞

11

λ1,2+1

log2

(1 + 1

λ1,2

)− 1

1λ2

+1log2

(1 + 1

λ2

)

1π2

Since limπ2→∞ λ1,2 = λ2, we get that both numerator anddenominator tend to 0 as π2 →∞. Thus, we apply l’Hopital’srule and get

limπ2→∞

1(λ1,2+1)2

(log2 e− log2

(1 + 1

λ1,2

))∂λ1,2

∂π2

1(π2)2

=

limπ2→∞

1

(λ1,2 + 1)2

(log2 e− log2

(1 +

1

λ1,2

))·

π1(λ2 + 1)(λ1 + 1)(λ2 − λ1)

(π1(λ1+1)+π2(λ1+1)π2

)2=

π1(λ2 − λ1)

(λ1 + 1)(λ2 + 1)

(log2 e− log2

(1 +

1

λ2

)),

where e = 2.71828 . . . is Euler’s number. Similarly, taking thelimπ2→∞ of the fourth line of (69) gives

π1(λ2 − λ1)

(λ1 + 1)(λ2 + 1)(− log2 e− log2 (1 + λ2)) .

Thus, a short calculations finishes the proof for this case. Thecases λ1 = ∞ and λ1 = 0 are handled much the same way.

The utility of the next Lemma is that it asserts a strongerclaim than the “Moreover” part of Theorem 6, for a specificvalue of λ2.

Lemma 27: Let probability pairs (a1, b1) and (a3, b3) havelikelihood ratios λ1 and λ3, respectively. Assume λ1 ≤ λ3.Denote π1 = a1 + b1 and π3 = a3 + b3. Let

λ2 = λ1,3 = λ[π1, λ1;π3, λ3] , (70)

as defined in (64). Then,

∆[λ1, π1;λ2,∞] ≤ ∆[λ1, π1;λ3, π3]

and∆[λ3, π3;λ2,∞] ≤ ∆[λ1, π1;λ3, π3]

Proof: We start by taking care of a trivial case. Note thatif it is not the case that 0 < λ2 < ∞, then λ1 = λ2 = λ3,and the proof follows easily.

So, we henceforth assume that 0 < λ2 < ∞, as was donein Lemma 26. Let

∆(1,3) = ∆[λ1, π1;λ3, π3] ,

∆′(1,2) = ∆[λ1, π1;λ2,∞] ,

and∆′(2,3) = ∆[λ3, π3;λ2,∞] .

Thus, we must prove that ∆′(1,2) ≤ ∆(1,3) and ∆′(2,3) ≤∆(1,3). Luckily, Lemma 26 and a bit of calculation yields that

∆′(1,2) + ∆′(2,3) = ∆(1,3) . (71)

Recall that ∆′(1,2) and ∆′(2,3) must be non-negative by Lem-mas 23 and 25. Thus, we are done.

The next lemma shows how to discard the restraint put onλ2 in Lemma 27.

Lemma 28: Let the likelihood ratios λ1, λ3 and the proba-bility sums π1, π3 be as in be as in Lemma 27. Fix

λ1 ≤ λ2 ≤ λ3 . (72)

Then either

∆[λ1, π1;λ2,∞] ≤ ∆[λ1, π1;λ3, π3] (73)

or∆[λ3, π3;λ2,∞] ≤ ∆[λ1, π1;λ3, π3] (74)

Proof: Let λ1,3 be as in (70), and note that

λ1 ≤ λ1,3 ≤ λ3 .

Assume w.l.o.g. that λ2 is such that

λ1 ≤ λ2 ≤ λ1,3 .

From Lemma 27 we have that

∆[λ1, π1;λ1,3,∞] ≤ ∆[λ1, π1;λ3, π3]

Thus, we may assume that λ2 < λ1,3 and aim to prove that

∆[λ1, π1;λ2,∞] ≤ ∆[λ1, π1;λ1,3,∞] . (75)

Next, notice that

∆[λ1, π1;λ2,∞] = 0 , if λ2 = λ1 . (76)

Thus, let us assume that

λ1 < λ2 < λ1,3 .

Specifically, it follows that

0 < λ2 <∞and thus the assumption in Lemma 26 holds.

Define the function f as follows

f(λ′2) = ∆[λ1, π1;λ′2,∞] .

Assume first that λ1 = 0, and thus by (68) we have that

∂f(λ′2)

∂λ′2≥ 0 .

On the other hand, if λ1 6= 0 we must have that 0 ≤ λ1 <∞.Thus, by (66) we have that

∂f(λ′2)

∂λ′2=

π1

(λ1 + 1)(λ′2 + 1)

(1− λ1

λ′2

),

which is also non-negative for λ′2 ≥ λ1. Thus, we have provedthat the derivative is non-negative in both cases, and thistogether with (76) proves (75).

We are now in a position to prove Theorem 6.

Proof of Theorem 6: We first consider the “Moreover”part of the theorem. Let [λ1, π1], [λ2, π2], and [λ1, π1] cor-respond to yi, yj , and yk, respectively. From Lemma 28 wehave that either (73) or (74) holds. Assume w.l.o.g. that (73)holds. By Lemma 25 we have that

∆[λ1, π1;λ2, π2] ≤ ∆[λ1, π1;λ2,∞] .

Thus,∆[λ1, π1;λ2, π2] ≤ ∆[λ1, π1;λ3, π3] ,

which is equivalent to

I(yj , yk) ≥ I(yi, yk) .

Having finished the “Moreover” part, we now turn ourattention to the proof of (22). The two equalities in (22) arestraightforward, so we are left with proving the inequality. Forλ ≥ 0 and π > 0, the following are easily verified:

C[λ, π] = C[1/λ, π] , (77)

andC[λ, π] increases with λ ≥ 1 . (78)

Also, for λ as given in (64), λ1, λ2 ≥ 0, and π1 > 0, π2 > 0,it is easy to show that

λ[λ1, π1, λ2, π2] increases with both λ1 and λ2 . (79)

Let [λ1, π1] and [λ2, π2] correspond to yi and yj , respec-tively. Denote

γ = λ[λ1, π1;λ2, π2]

andδ = λ[1/λ1, π1;λ2, π2]

Hence, our task reduces to showing that

C[γ, π1 + π2] ≥ C[δ, π1 + π2] . (80)

Assume first that δ ≥ 1. Recall that both λ1 ≥ 1 and λ2 ≥ 1.Thus, by (79) we conclude that γ ≥ δ ≥ 1. This, togetherwith (78) finishes the proof.

Conversely, assume that δ ≤ 1. Since

λ[λ1, π1;λ2, π2] = λ[1/λ1, π1; 1/λ2, π2]

we now get from (79) that γ ≤ δ ≤ 1. This, together with(77) and (78) finishes the proof.

APPENDIX BPROOF OF THEOREM 11

As a preliminary step toward the proof of Theorem 11, weconvince ourselves that the notation ∆[λ1;λ2, π2;λ3] used inthe theorem is indeed valid. Specifically, the next lemma showsthat knowledge of the arguments of ∆ indeed suffices to cal-culate the difference in capacity. The proof is straightforward.

Lemma 29: For i = 1, 2, 3, let yi and λi, as well as π2 beas in Theorem 11. If λ3 <∞, then

∆[λ1;λ2, π2;λ3] =π2

(λ2 + 1)(λ1 − λ3)

[

(λ3 − λ2)

(λ1 log2

(1 +

1

λ1

)+ log2(1 + λ1)

)+

(λ2 − λ1)

(λ3 log2

(1 +

1

λ3

)+ log2(1 + λ3)

)+

(λ1 − λ3)

(λ2 log2

(1 +

1

λ2

)+ log2(1 + λ2)

)]. (81)

Otherwise, λ3 =∞ and

∆[λ1;λ2, π2;λ3 =∞] =

π2

λ2 + 1

[−λ1 log2

(1 +

1

λ1

)− log2(1 + λ1)

+ λ2 log2

(1 +

1

λ2

)+ log2(1 + λ2)

]. (82)

Having the above calculations at hand, we are in a positionto prove Theorem 11.

Proof of Theorem 11: First, let consider the case λ3 <∞.Since our claim does not involve changing the values of λ2

and π2, let us fix them and denote

f(λ1, λ3) = ∆[λ1;λ2, π2;λ3] .

Under this notation, it suffices to prove that f(λ1, λ3) isdecreasing in λ1 and increasing in λ3, where λ1 < λ2 < λ3.A simple calculation shows that

∂f(λ1, λ3)

∂λ1=

−π2(λ3 − λ2)

(1 + λ2)(λ3 − λ1)2

[

λ3 log

(1 + 1

λ1

1 + 1λ3

)+ log

(1 + λ1

1 + λ3

)]. (83)

So, in order to show that f(λ1, λ3) is decreasing in λ1, itsuffices to show that the term inside the square brackets ispositive for all λ1 < λ3. Indeed, if we denote

g(λ1, λ3) = λ3 log

(1 + 1

λ1

1 + 1λ3

)+ log

(1 + λ1

1 + λ3

),

then is readily checked that

g(λ1, λ1) = 0 ,

while∂g(λ1, λ3)

∂λ1=

λ3 − λ1

λ1(λ1 + 1)

is positive for λ3 > λ1. The proof of f(λ1, λ3) increasing inλ3 is exactly the same, up to a change of variable names.

Let us now consider the second case, λ3 = ∞. Similarlyto what was done before, let us fix λ2 and π2, and consider∆[λ1;λ2, π2;λ3 =∞] as a function of λ1. Denote

h(λ1) = ∆[λ1;λ2, π2;λ3 =∞] .

Under this notation, our aim is to prove that h(λ1) is decreas-ing in λ1. Indeed,

∂h(λ1)

∂λ1=−π2 log2

(1 + 1

λ1

)

λ2 + 1

is easily seen to be negative.

ACKNOWLEDGMENTS

We thank Emmanuel Abbe, Erdal Arıkan, Hamed Hassani,Ramtin Pedarsani, Artyom Sharov, Emre Telatar, and RudigerUrbanke for helpful discussions. We are especially grateful toArtyom Sharov for going over many incarnations of this paperand offering valuable comments.

REFERENCES

[1] E. Arıkan, “Channel polarization: A method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,” IEEETrans. Inform. Theory, vol. 55, pp. 3051–3073, 2009.

[2] E. Sasoglu, E. Telatar, and E. Arıkan, “Polarization for arbitrary discretememoryless channels,” arXiv:0908.0302v1.

[3] S. B. Korada, E. Sasoglu, and R. Urbanke, “Polar codes: Character-ization of exponent, bounds, and constructions,” IEEE Trans. Inform.Theory, vol. 56, pp. 6253–6264, 2010.

[4] E. Arıkan and E. Telatar, “On the rate of channel polarization,” in Proc.IEEE Int’l Symp. Inform. Theory (ISIT’2009), Seoul, South Korea, 2009,pp. 1493–1495.

[5] E. Abbe and E. Telatar, “Polar codes for the m-user MAC and matroids,”arXiv:1002.0777v2.

[6] H. Mahdavifar and A. Vardy, “Achieving the secrecy capacity of wiretapchannels using polar codes,” IEEE Trans. Inform. Theory, accepted forpublication.

[7] E. Arıkan, “Source polarization,” arXiv:1001.3087v2.[8] E. Abbe, “Extracting randomness and dependencies via a matrix polar-

ization,” arXiv:1102.1247v1.[9] R. Mori, “Properties and construction of polar codes,” Master’s thesis,

Kyoto University, 2010. [Online]. Available: http://arxiv.org/abs/1002.3521

[10] R. Mori and T. Tanaka, “Performance and construction of polar codeson symmetric binary-input memoryless channels,” in Proc. IEEE Int’lSymp. Inform. Theory (ISIT’2009), Seoul, South Korea, 2009, pp. 1496–1500.

[11] T. Richardson and R. Urbanke, Modern Coding Theory. Cambridge,UK: Cambridge University Press, 2008.

[12] R. G. Gallager, Information Theory and Reliable Communications. NewYork: John Wiley, 1968.

[13] S. B. Korada, “Polar codes for channel and source coding,” Ph.D.dissertation, Ecole Polytechnique Federale de Lausanne, 2009.

[14] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed.Wiley, 2006.

[15] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introductionto Algorithms, 2nd ed. Cambridge, Massachusetts: The MIT Press,2001.

[16] R. Pedarsani, S. H. Hassani, I. Tal, and E. Telatar, “On the constructionof polar codes,” in Proc. IEEE Int’l Symp. Inform. Theory (ISIT’2011),Saint Petersburg, Russia, 2011, pp. 11–15.

http://arxiv.org/abs/1002.3521

http://arxiv.org/abs/1002.3521

Documents

How to Construct Polar Codes - arXiv · How to Construct Polar Codes Ido Tal and Alexander Vardy Abstract—A method for efﬁciently constructing polar codes is presented and analyzed