Information Bottleneck and its Applications in Deep Learning · Journal of Information Systems and Telecommunication Information Bottleneck and its Applications in Deep Learning Hassan

Journal of Information Systems and Telecommunication

Information Bottleneckand its Applications in Deep Learning

Hassan Hafez-KolahiDepartment of Computer Engineering, Sharif University of Technology, Tehran, Iran

[email protected]

Shohreh Kasaei*Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

[email protected]

Abstract

Information Theory (IT) has been used in Machine Learning (ML) from early days of this field. In thelast decade, advances in Deep Neural Networks (DNNs) have led to surprising improvements in manyapplications of ML. The result has been a paradigm shift in the community toward revisiting previousideas and applications in this new framework. Ideas from IT are no exception. One of the ideas whichis being revisited by many researchers in this new era, is Information Bottleneck (IB); a formulation ofinformation extraction based on IT. The IB is promising in both analyzing and improving DNNs. Thegoal of this survey is to review the IB concept and demonstrate its applications in deep learning. Theinformation theoretic nature of IB, makes it also a good candidate in showing the more general conceptof how IT can be used in ML. Two important concepts are highlighted in this narrative on the subject,i) the concise and universal view that IT provides on seemingly unrelated methods of ML, demonstratedby explaining how IB relates to minimal sufficient statistics, stochastic gradient descent, and variationalauto-encoders, and ii) the common technical mistakes and problems caused by applying ideas from IT,which is discussed by a careful study of some recent methods suffering from them.Keywords: Machine Learning; Information Theory; Information Bottleneck; Deep Learning; VariationalAuto-Encoder.

1. Introduction

The area of information theory was born by Shan-non’s landmark paper in 1948 [1]. One of the maintopics of IT is communication; which is sendingthe information of a source in such a way that thereceiver can decipher it. Shannon’s work estab-lished the basis for quantifying the bits of infor-mation and answering the basic questions facedin that communication. On the other hand, onecan describe the machine learning as the scienceof deciphering (decoding) the parameters of a truemodel (source), by considering a random samplethat is generated by that model. In this view, it iseasy to see why these two fields usually cross patheach other. This dates back to early attempts ofstatisticians to learn parameters from a set of ob-

served samples; which was later found to have in-teresting IT counterparts [2]. Up until now, IT isused to analyze statistical properties of learningalgorithms [3, 4, 5].

After the revolution of deep neural networks[6], the lack of theory that is able to explainits success [7] has motivated researchers to ana-lyze (and improve) DNNs by using IT observa-tions. The idea was first proposed by [8] whomade some connections between the informationbottleneck method [9] and DNNs. Further exper-iments showed evidences that support the appli-cability of IB in DNNs [10]. After that, many re-searchers tried to use those techniques to analyzeDNNs [11, 12, 10, 13] and subsequently improvethem [14, 15, 16].

In this survey, in order to follow current research

1

arX

iv:1

904.

0374

3v1

[cs

.LG

] 7

Apr

201

9

headlines, the main needed concepts and meth-ods to get more familiar with the IB and DNNare covered. In Section 2, the historical evolutionof information extraction methods from classicalstatistical approaches to IB are discussed. Section3, is devoted to the connections between IB andrecent DNNs. In Section 4 another informationtheoretic approach for analyzing DNNs is intro-duced as an alternative to IB. Finally, Section 5concludes the survey.

2. Evolution of Information ExtractionMethods

A shared concept in statistics, information theory,and machine learning is defining and extractingthe relevant information about a target variablefrom observations. This general idea, was pre-sented from the early days of modern statistics.It then evolved ever since taking a new form ineach discipline which arose through time. As isexpected from such a multidisciplinary concept, acomplete understanding of it requires a persistentpursuit of the concept in all relevant fields. Thisis the main objective of this section. In order tomake a clear view, the methods are organized ina chronological order with the emphasis on theircause and effect ; i.e., why each concept has beendeveloped and what has it added to the big pic-ture.

In the reminder of this section, first the no-tations are defined and after that the evolutionof methods from sufficient statistics to IB is ex-plained.

2.1 Preliminaries and Notations

Consider X ∈ X and Y ∈ Y as random variableswith the joint distribution function of p(x, y),where X and Y are called input and output spaces,respectively. Here, the realization of each RandomVariable (r.v.) is represented by the same symbolin the lower case. The conditional entropy of X,given Y , is defined as H(X|Y ) = E[− log p(X,Y )]and their Mutual Information (MI) is given by

I(X;Y ) = E[log p(X,Y )p(X)p(Y ) ]. There are also more

technical definitions for MI allowing it to be usedin cases that the distribution function p(x, y) issingular [17, 18]. An important property of MIis that it is invariant under bijective transforms f

and g; i.e., I(X;Y ) = I(f(X), g(Y )) [19].A noisy channel is described by a conditional

distribution function p(x|x), in which x ∈ X is thenoisy version of X. In the rate distortion function,the distortion function d : X×X → R is given andthe minimum required bit-rate for a fixed expecteddistortion is studied. Then

R(D) = minp(x|x)

s.t.E[d(X,X)]≤D

I(X;X). (1)

2.2 Minimal Sufficient Statistics

A core concept in statistics is defining the relevantinformation about a target Y from observationsX. One of the first mathematical formulationsproposed for measuring the relevance, is the con-cept of sufficient statistic. This concept is definedbelow [20].

Definition 1 (Sufficient Statistics). Let Y ∈ Ybe an unknown parameter and X ∈ X be a ran-dom variable with conditional probability distribu-tion function p(x|y). Given a function f : X → S,the random variable S = f(X) is called a suffi-cient statistic for Y if

∀x ∈ X , y ∈ Y :

P (X = x|Y = y, S = s) = P (X = x|S = s).

(2)

In other words, a sufficient statistic captures allthe information about Y which is available in X.This property is stated in the following theorem[21, 2].

Theorem 1 Let S be a probabilistic function ofX. Then, S is a sufficient statistic for Y if andonly if (iff)

I(S;Y ) = I(X;Y ). (3)

Note that in many classical cases that one en-counters in point estimation, it is assumed thatthere is a family of distribution functions that isparameterized by an unknown parameter θ andfurthermore N Independent and Identically Dis-tributed (i.i.d.) samples of the target distributionfunction are observed. This case fits the defini-tion by setting Y = θ and considering the highdimensional random variable X = {X(i)}Ni=1 thatcontains all observations.

2

A simple investigation shows that the suf-ficiency definition includes the trivial identitystatistic S = X. Obviously, such statistics arenot helpful, as copying the whole signal cannot becalled ”extraction” of relevant information. Con-sequently, one needs a way to restrict the sufficientstatistic from being wasteful in using observations.To address this issue, authors of [22] introducedthe notion of minimal sufficient statistics. Thisconcept is defined below.

Definition 2 (Minimal Sufficient Statistic) Asufficient statistic S is said to be minimal if it isa function of all other sufficient statistics

∀T ;T is sufficient statistic⇒ ∃g;S = g(T ). (4)

It means that a Minimal Sufficient Statistic (MSS)has the coarsest partitioning of the input variableX. In other words, an MSS tries to group thevalues of X together in as few number of partitionsas possible, while making sure that there is noinformation loss in the process.

The following theorem describes the relationbetween minimal sufficient statistics and mutualinformation[21].

Theorem 2 Let X be a sample drawn from a dis-tribution function that is determined by the ran-dom variable Y . The statistic S is an MSS for Yiff it is a solution of the optimization process

minT :sufficiet statistic

I(X;T ). (5)

By using Theorem 1, the constraint of this opti-mization problem can be written by informationtheory terms, as

minT :I(T ;Y )=I(X;Y )

I(X;T ). (6)

It shows that MSS is the statistic that have allthe available information about Y , while retainingthe minimum possible information about X. Inother words, it is the best compression of X, withzero information loss about Y .

In Table 1, the components of MSS are pre-sented in a concise way by using Markov chains.Note that these Markov chains should hold for ev-ery possible statistic S, sufficient statistic SS, andminimal sufficient tatistic MSS. By these threeMarkov chains and the information inequalities

corresponding to each, it is easy to verify The-orems 1 and 2. By using the two first inequalities,is easily proved that I(SS;Y ) = I(SS;X) . Thelast inequality shows that MSS should be the SSwith minimal I(SS;X).

In most practical problems where X ={(X(i)}Ni=1 is an N -dimensional data, one hopesto find a (minimal) sufficient statistic S in sucha way that its dimension does not depend on N .Unfortunately, it is found to be impossible for al-most all distributions (except the ones belongingto the exponential family) [21, 23].

2.3 Information Bottleneck

To tackle this problem, Tishby presented the IBmethod to solve the Lagrange relaxation of theoptimization function (6), by[9]

minp(x|x)

I(X;X)− βI(X;Y ) (7)

where X is the representation of X, and β is apositive parameter that controls the trade-off be-tween the compression and preserved informationabout Y . For β <= 1, the trivial case whereX ⊥⊥ X is a solution. The reason is that thedata processing inequality enforces I(X;X) ≥I(X;Y ) = ((1 − β) + β)I(X;Y ). Therefore, thevalue (1− β)I(X;Y ) is a lower bound for the ob-jective function of optimization problem (7). Forβ ≥ 1, this lower bound is minimized by settingI(X;Y ) = 0. It is achieved by simply choosingI(X;X) = 0.

As such, the solution starts from I(X;X) =I(X;Y ) = 0, and by increasing β, both I(X;X)and I(X;Y ) are increased. At the limit, β → ∞,this optimization function is equivalent to (5)[21]. Note that in IB, the optimization functionis performed on conditional distribution functionsp(x|x). Therefore, the solution is no longer re-stricted to deterministic statistics T = f(X). Ingeneral, the optimization function (7) does notnecessarily have a deterministic solution. This istrue even for simple cases with two binary vari-ables [16]. The IB provides a quite general frame-work with many extensions (there are variationsof this method for more than one variable [24]).But, since there is no evident connection betweenthese variations and DNNs, they are not coveredin this survey.

3

Table 1: Markov chains corresponding to conditions that form a Minimal Sufficient Statistic, along withits enforced information inequality.

Markov Chain Data Processing Inequality

Statistic Y X S I(S;Y ) ≤ I(X;Y )

Sufficient Y SS X I(SS;Y ) ≥ I(X;Y )

Minimal X SS MSS ∀ SS : I(MSS;X) ≤ I(SS;X)

Tishby et al. showed that IB has a nicerate-distortion interpretation, using the distortionfunction d(x, x) = KL(p(y|x) ‖ p(y|x)) [25]. Itshould be noted that this does not exactly conformto the classical rate-distortion settings, since herethe distortion function implicitly depends on thep(x|x) which is being optimized. They providedan algorithm similar to the well-known Blahut-Arimoto rate-distortion algorithm [26, 27] to solvethe IB problem.

Till now, it was considered that the joint dis-tribution function of X and Y is known. But,it is not the case in ML. In fact, if one knowsthe joint distribution function, then the prob-lem is usually as easy as computing an expecta-tion on the conditional distribution function; e.g.,f(x) = Ep(y|x)[Y ] for regression and Ep(y|x)[1(Y =c)]; c ∈ Y for classification. Arguably, one of themain challenges of ML is to solve the problemwhen one has the access to the distribution func-tion through a finite set of samples.

Interestingly, it was found that the value of β,introduced as a Lagrange relaxation parameterin (7), can be used to control the bias-variancetrade-off in cases for which the distribution func-tion is not known and the mutual information isjust estimated from a finite number of samples.It means that instead of trying to reach the MSSby setting β → ∞, when the distribution func-tion is unknown, one should settle for a β∗ < ∞which gives the best bias-variance trade-off [21].The reason is that the error of estimating the mu-tual information from finite samples is bounded

by O( |X | logm√m

), where |X | is the number of pos-

sible values that the random variable X can take(see Theorem 1 of [21]). The |X | has a direct rela-tion with β: small β means more compressed X,

meaning that less distinct values are required torepresent X. This is in line with the general rulethat simpler models generalize better. As such,there are two opposite forces in play, one trying toincrease β to make the Lagrange relaxation of op-timization function (7) to be more accurate, whilethe other tries to decrease β in order to controlthe finite sample estimation errors of I(X;X) andI(X;Y ). The authors of [21] also tried to makesome connections between the IB and the classi-fication problem. Their main argument is that inequation (7), I(X|Y ) can be considered as a proxyfor the classification error. They showed that iftwo conditions are met, the miss-classification er-ror is bounded from above by I(X|Y ) . These con-ditions are: i) the classes have equal probability,and ii) each sample is composed of a lot of com-ponents (as in the document (text) classificationsetting). The latter is equivalent to the generaltechnique in IT where one can neglect small prob-abilities when dealing with typical sets. They alsoargued that I(X;X) is a regularization term thatcontrols the generalization-complexity trade-off.

The main limitation of their work is that theyconsidered both X and Y to be discrete. This as-sumption is violated in many applications of ML;including image and speech processing. Whilethere are extensions to IB allowing to work withcontinuous random variables [28], their finite sam-ple analysis and the connections to ML applica-tions are less studied.

3. Information Bottleneck and DeepLearning

After the revolution of DNNs, which started bythe work of [29], in various areas of ML the state-

4

of-the-art algorithms were beaten by DNN alter-natives. While most of the ideas used in DNNsexisted for decades, the recent success attractedunprecedented attention of the community. Inthis new paradigm, both practitioners and the-oreticians found new ideas to either use DNNs tosolve specific problems or use previous theoreticaltools to understand DNNs.

Similarly, the interaction of IB and DNN in theliterature can be divided in two main categories.The first is to use the IB theories in order toanalyze DNNs and the other is to use the ideasfrom IB to improve the DNN-based learning algo-rithms. The remaining of this section is dividedbased on these categories.

Section 3.1 is devoted to the application of IB inanalyzing the usual DNNs, which is mainly due tothe conjecture that Stochastic Gradient Descent,the de facto learning algorithm used for DNNs, im-plicitly solves an IB problem. In Section 3.2, thepractical applications of IB for improving DNNsand developing new structures are discussed. Thepractical application is currently mostly limited toVariational Auto-Encoders (VAEs).

3.1 Information Bottleneck and StochasticGradient Descent

From theoretical standpoint, the success of DNNsis not completely understood. The reason is thatmany learning theory tools analyze models witha limited capacity and find inequalities restrict-ing the deviation of train test statistics. But, itwas shown that commonly used DNNs have hugecapacities that make such theoretical results tobe inapplicable [7, 4]. In recent years, there werelots of efforts to mathematically explain the gen-eralization capability of DNNs by using variety oftools. They range from attributing it to the waythat the SGD method automatically finds flat lo-cal minima (which are stable and thus can be wellgeneralized) [30, 31, 32, 33], to efforts trying torelate the success of DNNs to the special classof hierarchical functions that they generate [34].Each of these categories has its critics and thusthe problem is still under debate (e.g., [35] ar-gues that flatness can be changed arbitrarily byre-parametrization and the direct relation betweengeneralization and flatness is not generally true).In this survey, the focus is on a special set of meth-

ods that try to analyze DNNs by information the-ory results (see [36] for a broader discussion).

Tishby et al. used ideas from IB to formulatethe goal of deep learning as an information theo-retic trade-off between compression and prediction[8]. In that view, an NN forms a Markov chain ofrepresentations, each trying to refine (compress)the representation while preserving the informa-tion about the target. Therefore, they argued thatDNN is automatically trying to solve an IB prob-lem and the last layer is the optimal representa-tion x that is to be found. Then, they used thegeneralization theories of IB (discussed in 2.3) toexplain the success of DNNs. One of their maincontributions is the idea to use the informationplane diagrams showing the inside performance ofa DNN (see Figure 1b). The information plane isa 2D diagram with I(X;X) and I(X;Y ) as thex and y axis, respectively. In this diagram , eachlayer of the network is represented by a point thatshows how much information it contains about theinput and output.

Later, they also practically showed that inlearning DNNs by a simple SGD (without regu-larization or batch normalization), the compres-sion actually happens [10]. The Markov chainrepresentation that they used and their resultsare shown in Figure 1. As the SGD proceeds, bytracking each layer on the information plane, theyreported observing the path A in Figure 1b. Inthis path, a deep hidden layer starts from point(0, 0). The justification is that at the beginning ofSGD, where all weights are chosen randomly, thehidden layer is meaningless and does not hold anyinformation about either of X or Y . During thetraining phase, as the prediction loss is minimized,I(X;Y ) is expected to increase (since the networkuses X to predict the label, and its success de-pends on how much information X has about Y ).But, changes in I(X;X) are not easy to predict.The surprising phenomena that they reported isthat at first I(X;X) increases (called the learn-ing phase). But, at some point a phase transitionhappens (presented by a star in Figure 1b) andI(X;X) starts to decrease (called the compressionphase). It is surprising because the minimized lossin deep learning does not have any compressionterm. By experimental investigations, they alsofound that compression happens in later steps of

5

(a) (b)

Figure 1: Information plane diagram of DNNs. (a) Markov chain representation of a DNN with m hiddenlayers. [Note that the predicted label Y has access to Y only through X.] (b) Path hidden layers undergoduring SGD training in information plane. Three possible paths under debate by authors are representedby A, B, and C.

SGD when the empirical error is almost zero andthe gradient vector is dominated by its noisy part(i.e., observing a small gradient mean but a highgradient variance). By this observation, they ar-gued that after reaching a low empirical error, thenoisy gradient descent forms a diffusion processwhich approaches the stationary distribution thatmaximizes the entropy of the weights, under theempirical error constraint. They also explainedhow deeper structures can help SGD to faster ap-proach to the equilibrium. In summary, their re-sults suggested that the reason behind the DNNsuccess, is that it automatically learns short de-scriptions of samples, which in turn controls thecapacity of models. They reported their resultsfor both synthesis datasets (true mutual informa-tion values) and real datasets (estimated mutualinformation values).

Saxe et al. [13] further investigated this phe-nomena on more datasets and different kinds ofactivation functions. They observed the compres-sion phase just in cases for which a saturating ac-tivation function is used (e.g., sigmoid or tanh).They argued that the explanation of diffusion pro-cess is not adequate to explain all different cases;e.g., for Relu activation which is commonly used inthe literature, they usually could not see any com-pression phase (path B in Figure 1b). It should benoted that their observations do not take the effectof compression completely out of picture, ratherthey just reject the universal existence of an ex-plicit compression phase at the end of the trainingphase. As shown in Figure 1b, even though there

is no compression phase in Path B, the result-ing representation is still compressed compared toX. This compression effect can be attributed tothe initial randomness of the network rather thanan explicit compression phase. They also noticedthat the way that the mutual information is esti-mated is crucial in the process. One of the usualmethods for mutual information estimation is bin-ning. In that approach, the bin size is the param-eter to be chosen. They showed that for smallenough bin sizes, if the precision error of arith-metic calculations is not involved, there will notbe any information loss to begin with (Path C inFigure 1b). The reason is that when one projectsa finite set of distinct points to a random lower di-mensional space, the chance that any two pointsget mixed is zero. Even though this problem isseemingly just an estimation error caused by a lownumber of samples in each bin (and thus does notinvalidate synthesis data results of [10]), it is ac-tually connected to a more fundamental problem.If one removes the binning process and deals withtrue values of mutual information, serious prob-lems will arise when using IB to study commonDNNs on continuous variables. The problem isthat in usual DNNs, for which the hidden repre-sentation has a deterministic relation with inputs,the IB functional of optimization (7) is infinite foralmost all weight matrices and thus the problemis ill-posed. This concept was further investigatedin [37].

Even though the problem was not explicitly ad-dressed until recently, there are two approaches

6

used by researchers that automatically tackle thisproblem. As mentioned before, the first approach,used by [8], applies binning techniques to esti-mate the mutual information. This is equivalentto add a (quantization) noise, making the IB func-tional limited. But, in this way, the noise is addedjust for the analysis process and does not affectthe NN. As noted by [13], unfortunately some ofthe advertised characteristics of mutual informa-tion, namely the information inequality for layersand the invariance on reparameterization of theweights, does not hold any more.

The second approach is to explicitly add somenoise to the layers and thus make the NN trulystochastic. This idea was first discussed by [10]as a way to make IB to be biased toward simplermodels (as is usually desired in ML problems). Itwas later found that there is a direct relationshipbetween the SGD and variational inference [38].On the other hand, the variational inference hasa ”noisy computation” interpretation [16]. Theseresults showed that the idea of using stochasticmappings in NNs has been used much earlier thanthe recent focus on IB interpretations. In the lightof this connection, researchers tried to proposenew learning algorithms based on IB in order tomore explicitly take the compression into account.These ideas are strongly connected to VariationalAuto-Encoders (VAEs) [39]. The denoising auto-encoders [40, 41] also use an explicit noise additionand thus can be studied in the IB framework. Thenext section is devoted to the relation between IBand VAE which recently has been a core conceptin the field .

3.2 Information Bottleneck andVariational Auto-Encoder

Achille et al. [16] introduced the idea of informa-tion dropout in correspondence to the commonlyused dropout technique [6]. Starting from the lossfunctional in the optimization function (7) andnoting that I(X;Y ) = H(Y ) −H(Y |X), one canrewrite the problem as

minp(x|x)

I(X; X) + βH(Y |X). (8)

Moreover, the terms can be expanded as per sam-ple loss of

H(Y |X) = Ep(x,y)[Ep(x|X)[− log(p(Y |X))]

]I(X; X) = Ep(x) [KL(p(x|X) ‖ p(x))] (9)

where KL denotes the Kullback-Leibler diver-gence. The expectations in these two equationscan be estimated by a sampling process. For dis-tribution functions p(x) and p(x, y), the trainingsamples D = {(x(i), y(i))}Ni=1 are already given.Therefore, the loss function of IB can be approxi-mated as

L =1

N

N∑i=1

Ep(x|x(i))[− log(p(y(i)|x))]

+ βKL(p(x|x(i)) ‖ p(x)).

(10)

It is worth noting that if we let x to be theoutput of NN, the first term is the cross entropy(which is the loss function usually used in deeplearning). The second term acts like a regular-ization term to prevent the conditional distribu-tion function p(x|x) from being too dependent tothe value of x. As noted by [16], this formula-tion reveals interesting resemblance to VariationalAuto-Encoder (VAE) presented by [39]. The VAEtries to solve the unsupervised problem of recon-struction, by modeling the process which has gen-erated each data from a (simpler) random vari-able x with a (usually fixed) prior p0(x). Thegoal is to find the generative distribution func-tion pθ(x|x) and also a variational approximationpφ(x|x). This is done by minimizing the varia-tional lower-bound of the marginal log-likelihoodof the training data, given by [16]

Lθ,φ =1

N

N∑i=1

Epφ(x|x(i))[− log(pθ(x(i)|x))]

+ KL(pφ(x|x(i)) ‖ p0(x)).

(11)

Comparing this with equation (10), it is evidentthat VAE can be considered as an estimationfor a special case of IB when: i) Y = X, ii)β = 1, iii) the prior distribution function is fixedp(x) = p0(x), and iv) the distribution functionsp(x|x) and p(x|x) are parameterized by φ and θ,respectively. These parameters are optimized sep-arately as suggested by the variational inference

7

(note that in IB, the attention is on p(x|x), andassuming that p(x, y) is given, the values of p(x)and p(y|x) are determined from that). It is worthnoting that the ii and iii restrictions are crucial.The reason is that just setting X = Y and β = 1,without any other restrictions, would make the ob-jective function (7) to be a constant, making everyp(x|x) to be a solution. Even if β 6= 1, the triv-ial loss function (1− β)I(X;X) is obtained whichis minimized either for x = x (when β > 1) orx ⊥⊥ x (when β < 1). Neither of these solutionsis desired in representation learning (for anotherview on this matter, see the discussion of [42] on”feasible” vs ”realizable” solutions).

A similar variational approach, is used to solvethe IB optimization process (10), which is a moregeneral setting with β 6= 1 and X 6= Y [16].

Another concept to note is that despite the con-nection between IB and VAE, some of VAE issuesthat researchers have reported do not directly ap-ply to IB. In fact, we think that it is helpful touse the IB interpretation to understand the VAEproblems to remedy them. For example, one ofthe improvements over the original VAE, is β-VAE [45]. They found that having β > 1 leadsto a better performance compared to the originalconfiguration of VAE which is restricted to β = 1.This phenomena can be studied by using its coun-terpart results in IB. As mentioned in Section 2.3,β controls the bias-variance trade-off in case of fi-nite training set. Therefore, one should search forβ∗ which practically does the best in preventingthe model from over-fitting. The same argumentmight be applied to VAE.

Another issue in VAE, which has attracted theattention of many researchers [42, 43, 46, 47] , isthat when the family of decoders pθ(x|x) is toopowerful, the loss function (11) can be minimizedby just using the decoder and completely ignoringthe latent variable; i.e. pθ(x|x) = p(x). In thiscase, the optimization function (11) will be de-composed into two separate terms, where the firstterm just depends on θ and the second term justdepends on φ. As a result, the second term willbe minimized by setting pφ(x|x) = p0(x). There-fore, x and x will be independent, which is ob-viously not desired in a feature extraction prob-lem. This problem does not exist in the originalIB formulation, in which the focus is on pφ(x|x)

and p(x|x) ∝ p(x|x)p(x) is computed without anydegrees-of-freedom (no parameter θ to optimize).It is in contrast with the VAE settings where thediscussion starts from pθ(x|x) and later pφ(x|x)is introduced in variational inference. Note thathaving a strong family of encoders pφ(x|x), doesnot make any problem as long as it is adequatelyregularized by KL(pφ(x|x(i)) ‖ p0(x)). It shouldbe added that even though IB does not inherentlysuffer from the ”too strong decoder” problem, thecurrent methods which are based on the varia-tional distribution and optimization of both θ andφ are not immune to it [14, 12, 16]. This is cur-rently an active research area and we believe theIB viewpoint will help to develop better solutionsto it.

In Figure 2, the summary of existing methodsand how they evolved trough time, is representedin a hierarchical structure. Note that the solu-tion based on variational techniques [16] bypassesall the limitations that are faced in previous sec-tions; i.e., meaning that it is not limited to a spe-cific family of distributions, does not need the dis-tribution function to be known, and also worksfor continuous variables. As it is represented inthis figure, while the recent methods are capableof solving more general problems, the theoreticalguarantees for them are more scarce.

4. Beyond Information Bottleneck

All the methods discussed till now were using IBwhich uses the quantity I(X;T ) to control thevariance of the method (see Section 2.3). Whilethis approach is used successfully in many applica-tions, its complete theoretical analysis in the gen-eral case is difficult. In this section, a differentapproach based on mutual information which re-cently has attracted the attention of researchers ispresented. In this new view, instead of looking atI(X;T ) as the notion of complexity, one considersI(S;A(S)). Here S is the set of all training sam-ples, and A is the learning algorithm which usestraining points to calculate a hypothesis h.

In this approach, not only the mutual informa-tion of a single sample X and its representationis considered, but also the mutual information be-tween all of the samples and the whole learnedmodel is studied.

8

InformationExtraction

ExponentialFamily

(Statistics)

Minimal Sufficient Statistics [22, 2]

GeneralDistributions

(IB)

KnownDistribution

Blahut-Arimoto [9, 25, 24, 28]

UnnownDistribution

Discrete

Empirical Blahut-Arimoto [21]

Continuous(DNNs)

Noisy Computation [16, 40, 41, 43] Information Regularization [14, 44, 11, 12, 15]

before 1999

1999-2005

2010 2013-2018 2016-2018

– Applicable in Fewer Cases+ More Theoretical Gaurantees

+ Applicable in More Cases– Less Theoretical Gaurantees

Time

Figure 2: Schematic review of main information extraction methods discussed in this survey, represent-ing the evolution of algorithms through time. Moving from left to right, the methods are sorted in achronological order. This figure shows that recent algorithms are applicable in more general cases (butusually provide less theoretical guarantees).

Following recent information theoretic tech-niques from [48, 49, 50], authors of paper [3] usedthe following notion to prove the interesting in-equality

P [|errtest − errtrain| > ε] < O

(I(S;A(S))

nε2

),

(12)where errtest and errtrain are the test (true) errorand the training (empirical) error of the hypoth-esis A(S), respectively, n is the training size, andε > 0 is a positive real number.

The intuition behind this inequality is that, themore a learning algorithm uses bits of the train-ing set, there is potentially more risk that it willoverfit to it. The interesting property of this in-equality is that the mutual information betweenthe whole input and output of the algorithm, de-pends deeply on all the aspects of the learningalgorithm. It is in contrast with many other ap-proaches that use the properties of the hypothe-

ses space H to bound the generalization gap, andusually the effect of final hypothesis chosen by thelearning algorithm is blurred away due to the us-age of a uniform convergence in proving bounds;like in the Vapnik-Chervonenkis theory [51]. Inpaper [52], the chaining method [53] was used tofurther improve the inequality (12) to also takeinto account the capacity of the hypotheses space.

Though the inequality 12 seems appealingas it directly bounds the generalization er-ror by the simple-looking information theo-retic term I(S;A(S)), unfortunately the calcula-tion/estimation of this term is even harder thanI(X;T ) which was used in IB. This made it quitechallenging to apply this technique in real worldmachine learning problems where the distributionis unknown and the learning algorithms is usuallyquite complex [54, 4].

To the best knowledge of the authors, the onlyattempt made to use this technique to analyze thedeep learning process is the recent article [55]. In

9

that work, authors argue that as the dataset Sgoes trough DNN layers 1 . . .m, the intermedi-ate sequence of datasets (S`)

m`=1 are formed and

I(S`;W ) is a decreasing function of ` (here W isthe set of all weights in the DNN). They furtherargue that this can be used along the inequality(12) to show that deeper architectures have lessgeneralization error. A major problem with theiranalysis is that they used the Markov assumptionW – S – S1 – S2 ... Sm−1 – Sm. This assumptiondoes not generally hold in a DNN. Because forcalculating the S`, a direct usage of W is needed(more precisely the weights up to layer ` are used).Therefore, it seems that the correct applicationof this technique in analyzing DNNs requires amore elaborate treatment which is hopped to bereleased in near future.

5. Conclusion

A survey on the interaction of IB and DNNs wasgiven. First, the headlines of the prolong historyof using the information theory in ML was pre-sented. The focus was on how the ideas evolvedover time. The discussion started from MSS whichis practically restricted to distributions from ex-ponential family. Then the IB framework and theBlahut-Arimoto algorithm were discussed whichdo not work for unknown continuous distribu-tions. After that methods based on variationalapproximation introduced which are applicable toquite general cases. Finally, another more the-oretically appealing usage of information theorywas introduced, which used the mutual informa-tion between the training set and the learnedmodel to bound the generalization error of a learn-ing algorithm. Despite its theoretical benefits, itwas shown that its application in understandingDNNs, is challenging.

During this journey, it was revealed that howsome seemingly unrelated areas have hidden rela-tions to the IB. It was also shown that how themysterious generalization power of SGD (which isthe De facto learning method of DNNs) is hypoth-esized to be caused by the implicit IB compressionproperty which is hidden in SGD. Also, the recentsuccessful unsupervised method VAE was foundto be a special case of the IB when solved by em-ploying the variational approximation.

In fact, the profound and seemingly simple toolsthat the information theory provides bring sometraps. As the understanding of these pitfalls areas important, they were also discussed in this sur-vey. It could be seen that how seemingly harmlessinformation theoretic formulas can make impossi-ble situations. Two major discussed cases were:i) using the mutual information to train continu-ous deterministic DNNs, which made the problemill-posed, and ii) using variational approximationswithout restricting the space of solutions can eas-ily result in meaningless situations. The impor-tant lesson learned from these revelations was howthe ideas from the information theory can give aunified view to different ML concepts. We believethat this view is quite helpful to understand theshortcomings of methods and to remedy them.

Acknowledgment

We wish to thank Dr. Mahdieh Soleymani for herbeneficial discussions and comments.

References

[1] C. E. Shannon, “A mathematical theory ofcommunication,” Bell system technical jour-nal, vol. 27, no. 3, pp. 379–423, 1948.

[2] S. Kullback and R. A. Leibler, “On informa-tion and sufficiency,” The annals of mathe-matical statistics, vol. 22, no. 1, pp. 79–86,1951.

[3] R. Bassily, S. Moran, I. Nachum, J. Shafer,and A. Yehudayoff, “Learners that Use LittleInformation,” in Algorithmic Learning The-ory, pp. 25–55, 2018.

[4] M. Vera, P. Piantanida, and L. R. Vega,“The Role of Information Complexity andRandomization in Representation Learning,”arXiv:1802.05355 [cs, stat], Feb. 2018.

[5] I. Nachum, J. Shafer, and A. Yehudayoff, “ADirect Sum Result for the Information Com-plexity of Learning,” arXiv:1804.05474 [cs,math, stat], Apr. 2018.

[6] N. Srivastava, G. Hinton, A. Krizhevsky,I. Sutskever, and R. Salakhutdinov,“Dropout: A simple way to prevent neural

10

networks from overfitting,” The Journal ofMachine Learning Research, vol. 15, no. 1,pp. 1929–1958, 2014.

[7] C. Zhang, S. Bengio, M. Hardt, B. Recht,and O. Vinyals, “Understanding deep learn-ing requires rethinking generalization,” Inter-national Conference on Learning Representa-tions, 2017.

[8] N. Tishby and N. Zaslavsky, “Deep Learn-ing and the Information Bottleneck Princi-ple,” arXiv preprint arXiv:1503.02406, 2015.

[9] N. Tishby, F. Pereira, and W. Bialek, “Theinformation bottleneck method,” in Proceed-ings of the 37-th Annual Allerton Conferenceon Communication, Control and Computing,pp. 368–377, 1999.

[10] R. Shwartz-Ziv and N. Tishby, “Opening theBlack Box of Deep Neural Networks via Infor-mation,” arXiv:1703.00810 [cs], Mar. 2017.

[11] P. Khadivi, R. Tandon, and N. Ramakrish-nan, “Flow of information in feed-forwarddeep neural networks,” arXiv preprintarXiv:1603.06220, 2016.

[12] A. Achille and S. Soatto, “On the Emergenceof Invariance and Disentangling in Deep Rep-resentations,” arXiv:1706.01350 [cs, stat],June 2017.

[13] A. M. Saxe, Y. Bansal, J. Dapello, M. Ad-vani, A. Kolchinsky, B. D. Tracey, and D. D.Cox, “On the Information Bottleneck Theoryof Deep Learning,” International Conferenceon Learning Representations, Feb. 2018.

[14] A. A. Alemi, I. Fischer, J. V. Dillon,and K. Murphy, “Deep Variational Informa-tion Bottleneck,” International Conferenceon Learning Representations, 2017.

[15] A. Kolchinsky, B. D. Tracey, and D. H.Wolpert, “Nonlinear Information Bottle-neck,” arXiv:1705.02436 [cs, math, stat],May 2017.

[16] A. Achille and S. Soatto, “InformationDropout: Learning Optimal Representations

Through Noisy Computation,” IEEE Trans-actions on Pattern Analysis and Machine In-telligence, pp. 1–1, 2018.

[17] A. Kolmogorov, “On the Shannon theory ofinformation transmission in the case of con-tinuous signals,” IRE Transactions on Infor-mation Theory, vol. 2, no. 4, pp. 102–108,1956.

[18] T. M. Cover, P. Gacs, and R. M. Gray, “Kol-mogorov’s Contributions to Information The-ory and Algorithmic Complexity,” The an-nals of probability, no. 3, pp. 840–865, 1989.

[19] T. M. Cover and J. A. Thomas, Elementsof Information Theory. John Wiley & Sons,2012.

[20] M. A. RA Fisher, “On the mathematicalfoundations of theoretical statistics,” Phil.Trans. R. Soc. Lond. A, vol. 222, no. 594-604,pp. 309–368, 1922.

[21] O. Shamir, S. Sabato, and N. Tishby, “Learn-ing and generalization with the informationbottleneck,” Theoretical Computer Science,vol. 411, pp. 2696–2711, June 2010.

[22] E. L. Lehmann and H. Scheffe, “Complete-ness, Similar Regions, and Unbiased Estima-tion,” in Bulletin of the American Mathemat-ical Society, vol. 54, pp. 1080–1080, CharlesSt, Providence, 1948.

[23] B. O. Koopman, “On distributions admit-ting a sufficient statistic,” Transactions ofthe American Mathematical society, vol. 39,no. 3, pp. 399–409, 1936.

[24] N. Friedman, O. Mosenzon, N. Slonim, andN. Tishby, “Multivariate information bottle-neck,” in Proceedings of the Seventeenth Con-ference on Uncertainty in Artificial Intelli-gence, pp. 152–161, Morgan Kaufmann Pub-lishers Inc., 2001.

[25] R. Gilad-Bachrach, A. Navot, and N. Tishby,“An information theoretic tradeoff betweencomplexity and accuracy,” in Learning The-ory and Kernel Machines, pp. 595–609,Springer, 2003.

11

[26] R. Blahut, “Computation of channel capacityand rate-distortion functions,” IEEE transac-tions on Information Theory, vol. 18, no. 4,pp. 460–473, 1972.

[27] S. Arimoto, “An algorithm for computingthe capacity of arbitrary discrete memorylesschannels,” IEEE Transactions on Informa-tion Theory, vol. 18, no. 1, pp. 14–20, 1972.

[28] G. Chechik, A. Globerson, N. Tishby, andY. Weiss, “Information bottleneck for Gaus-sian variables,” Journal of machine learningresearch, vol. 6, no. Jan, pp. 165–188, 2005.

[29] A. Krizhevsky, I. Sutskever, and G. E. Hin-ton, “Imagenet classification with deep con-volutional neural networks,” in Advancesin Neural Information Processing Systems,pp. 1097–1105, 2012.

[30] N. S. Keskar, D. Mudigere, J. Nocedal,M. Smelyanskiy, and P. T. P. Tang, “Onlarge-batch training for deep learning: Gen-eralization gap and sharp minima,” arXivpreprint arXiv:1609.04836, 2016.

[31] S. Hochreiter and J. Schmidhuber, “Flat min-ima,” Neural Computation, vol. 9, no. 1,pp. 1–42, 1997.

[32] P. Chaudhari, A. Choromanska, S. Soatto,and Y. LeCun, “Entropy-sgd: Biasing gradi-ent descent into wide valleys,” arXiv preprintarXiv:1611.01838, 2016.

[33] M. Hardt, B. Recht, and Y. Singer,“Train faster, generalize better: Stability ofstochastic gradient descent,” arXiv preprintarXiv:1509.01240, 2015.

[34] T. Poggio, H. Mhaskar, L. Rosasco, B. Mi-randa, and Q. Liao, “Why and When CanDeep–but Not Shallow–Networks Avoid theCurse of Dimensionality,” arXiv preprintarXiv:1611.00740, 2016.

[35] L. Dinh, R. Pascanu, S. Bengio, andY. Bengio, “Sharp minima can generalize fordeep nets,” arXiv preprint arXiv:1703.04933,2017.

[36] R. Vidal, J. Bruna, R. Giryes, andS. Soatto, “Mathematics of Deep Learning,”arXiv:1712.04741 [cs], Dec. 2017.

[37] R. A. Amjad and B. C. Geiger, “How(Not) To Train Your Neural Network Us-ing the Information Bottleneck Principle,”arXiv:1802.09766 [cs, math], Feb. 2018.

[38] P. Chaudhari and S. Soatto, “Stochasticgradient descent performs variational infer-ence, converges to limit cycles for deep net-works,” arXiv:1710.11029 [cond-mat, stat],Oct. 2017.

[39] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprintarXiv:1312.6114, 2013.

[40] Y. Bengio, L. Yao, G. Alain, and P. Vincent,“Generalized denoising auto-encoders as gen-erative models,” in Advances in Neural In-formation Processing Systems, pp. 899–907,2013.

[41] D. J. Im, S. Ahn, R. Memisevic, andY. Bengio, “Denoising Criterion for Vari-ational Auto-Encoding Framework.,” inAAAI, pp. 2059–2065, 2017.

[42] A. A. Alemi, B. Poole, I. Fischer, J. V. Dil-lon, R. A. Saurous, and K. Murphy, “Fixing aBroken ELBO,” arXiv:1711.00464 [cs, stat],Feb. 2018.

[43] X. Chen, D. P. Kingma, T. Salimans,Y. Duan, P. Dhariwal, J. Schulman,I. Sutskever, and P. Abbeel, “Varia-tional lossy autoencoder,” arXiv preprintarXiv:1611.02731, 2016.

[44] C.-W. Huang and S. S. S. Narayanan, “Flowof Renyi information in deep neural net-works,” in Machine Learning for Signal Pro-cessing (MLSP), 2016 IEEE 26th Interna-tional Workshop On, pp. 1–6, IEEE, 2016.

[45] I. Higgins, L. Matthey, A. Pal, C. Burgess,X. Glorot, M. Botvinick, S. Mohamed, andA. Lerchner, “Beta-VAE: Learning Basic Vi-sual Concepts with a Constrained Varia-tional Framework,” International Conferenceon Learning Representations, Nov. 2016.

12

[46] S. Zhao, J. Song, and S. Ermon, “Infovae:Information maximizing variational autoen-coders,” arXiv preprint arXiv:1706.02262,2017.

[47] H. Zheng, J. Yao, Y. Zhang, and I. W.Tsang, “Degeneration in VAE: In the Light ofFisher Information Loss,” arXiv:1802.06677[cs, stat], Feb. 2018.

[48] D. Russo and J. Zou, “How much doesyour data exploration overfit? Controllingbias via information usage,” arXiv preprintarXiv:1511.05219, 2015.

[49] D. Russo and J. Zou, “Controlling bias inadaptive data analysis using information the-ory,” in Artificial Intelligence and Statistics,pp. 1232–1240, 2016.

[50] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capabil-ity of learning algorithms,” in Advancesin Neural Information Processing Systems,pp. 2521–2530, 2017.

[51] V. N. Vapnik and A. Y. Chervonenkis, “Onthe Uniform Convergence of Relative Fre-quencies of Events to Their Probabilities,”Theory of Probability and its Applications,vol. 16, no. 2, p. 264, 1971.

[52] A. R. Asadi, E. Abbe, and S. Verdu,“Chaining Mutual Information and Tighten-ing Generalization Bounds,” arXiv preprintarXiv:1806.03803, 2018.

[53] R. M. Dudley, “The sizes of compact subsetsof Hilbert space and continuity of Gaussianprocesses,” Journal of Functional Analysis,vol. 1, no. 3, pp. 290–330, 1967.

[54] M. Vera, P. Piantanida, and L. R. Vega, “TheRole of the Information Bottleneck in Rep-resentation Learning,” in 2018 IEEE Inter-national Symposium on Information Theory(ISIT), (Vail, CO), pp. 1580–1584, IEEE,June 2018.

[55] J. Zhang, T. Liu, and D. Tao, “AnInformation-Theoretic View for Deep Learn-ing,” arXiv preprint arXiv:1804.09060, 2018.

Hassan Hafez-Kolahi was born in Mashhad,Iran, in 1989. He received the B.Sc. degree inComputer Engineering from Ferdowsi Universityof Mashhad in 2011, the M.Sc. degree in 2013from Sharif University of Technology. He is cur-rently a Ph.D. candidate at Sharif University ofTechnology. His research interests are in machinelearning and information theory.

Shohreh Kasaei received the B.Sc. degreefrom the Department of Electronics, Faculty ofElectrical and Computer Engineering, IsfahanUniversity of Technology, Iran, in 1986, the M.Sc.degree from the Graduate School of Engineer-ing, Department of Electrical and Electronic En-gineering, University of the Ryukyus, Japan, in1994, and the Ph.D. degree from Signal Process-ing Research Centre, School of Electrical and Elec-tronic Systems Engineering, Queensland Univer-sity of Technology, Australia, in 1998. She joinedSharif University of Technology since 1999, whereshe is currently a full professor and the directorof Image Processing Laboratory (IPL). Her re-search interests are in 3D computer vision and im-age/video processing including dynamic 3D objecttracking, dynamic 3D human activity recognition,virtual reality, 3D semantic segmentation, multi-resolution texture analysis, scalable video coding,image retrieval, video indexing, face recognition,hyperspectral change detection, video restoration,and fingerprint authentication.

13

Documents

Information Bottleneck and its Applications in Deep Learning · Journal of Information Systems and Telecommunication Information Bottleneck and its Applications in Deep Learning Hassan