15
Proceedings of Machine Learning Research 101:94108, 2019 ACML 2019 ResNet and Batch-normalization Improve Data Separability Yasutaka Furusho [email protected] Kazushi Ikeda [email protected] Nara Institute of Science and Technology, Nara 630-0192, Japan Editors: Wee Sun Lee and Taiji Suzuki Abstract The skip-connection and the batch-normalization (BN) in ResNet enable an extreme deep neural network to be trained with high performance. However, the reasons for its high performance are still unclear. To clear that, we study the effects of the skip-connection and the BN on the class-related signal propagation through hidden layers because a large ratio of the between-class distance to the within-class distance of feature vectors at the last hidden layer induces high performance. Our result shows that the between-class distance and the within-class distance change differently through layers: the deep multilayer perceptron with randomly initialized weights degrades the ratio of the between-class distance to the within-class distance and the skip-connection and the BN relax this degradation. Moreover, our analysis implies that the skip-connection and the BN encourage training to improve this distance ratio. These results imply that the skip-connection and the BN induce high performance. Keywords: Deep learning, ResNet, Skip-connection, Batch-normalization 1. Introduction Deep neural networks have a high expressive power that grows exponentially with respect to the depth of the neural network (Montufar et al., 2014; Telgarsky, 2016; Raghu et al., 2017). However, the classic multilayer perceptron (MLP) cannot reduce its empirical risk by training even though it stacks more layers (He et al., 2016a). To overcome this prob- lem, the ResNet incorporates skip-connections between layers (He et al., 2016a,b) and the batch-normalization (BN) normalizes the input of activation functions (Ioffe and Szegedy, 2015). These architectures enable an extreme deep neural network to be trained with high performance. The property of the skip-connection is discussed from the point of view of the signal propagation, that is, the change of feature vectors through layers (Poole et al., 2016; Yang and Schoenholz, 2017). In an MLP with randomly initialized weights, the cosine distance between two feature vectors converges to a fixed point in [0, 1] in an exponential order of the depth (Poole et al., 2016). The skip-connection relaxes this exponential order into a polynomial order and thus preserves the structure of the input space (Yang and Schoenholz, 2017). Their analysis used the mean-field theory, that is, they considered the neural network with infinite hidden units and approximated the sum of the activations by the Gaussian random variable. This approximation can deal with broad classes of the activation func- c 2019 Y. Furusho & K. Ikeda.

ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

Proceedings of Machine Learning Research 101:94–108, 2019 ACML 2019

ResNet and Batch-normalization Improve Data Separability

Yasutaka Furusho [email protected]

Kazushi Ikeda [email protected]

Nara Institute of Science and Technology, Nara 630-0192, Japan

Editors: Wee Sun Lee and Taiji Suzuki

Abstract

The skip-connection and the batch-normalization (BN) in ResNet enable an extreme deepneural network to be trained with high performance. However, the reasons for its highperformance are still unclear. To clear that, we study the effects of the skip-connection andthe BN on the class-related signal propagation through hidden layers because a large ratio ofthe between-class distance to the within-class distance of feature vectors at the last hiddenlayer induces high performance. Our result shows that the between-class distance andthe within-class distance change differently through layers: the deep multilayer perceptronwith randomly initialized weights degrades the ratio of the between-class distance to thewithin-class distance and the skip-connection and the BN relax this degradation. Moreover,our analysis implies that the skip-connection and the BN encourage training to improvethis distance ratio. These results imply that the skip-connection and the BN induce highperformance.

Keywords: Deep learning, ResNet, Skip-connection, Batch-normalization

1. Introduction

Deep neural networks have a high expressive power that grows exponentially with respectto the depth of the neural network (Montufar et al., 2014; Telgarsky, 2016; Raghu et al.,2017). However, the classic multilayer perceptron (MLP) cannot reduce its empirical riskby training even though it stacks more layers (He et al., 2016a). To overcome this prob-lem, the ResNet incorporates skip-connections between layers (He et al., 2016a,b) and thebatch-normalization (BN) normalizes the input of activation functions (Ioffe and Szegedy,2015). These architectures enable an extreme deep neural network to be trained with highperformance.

The property of the skip-connection is discussed from the point of view of the signalpropagation, that is, the change of feature vectors through layers (Poole et al., 2016; Yangand Schoenholz, 2017). In an MLP with randomly initialized weights, the cosine distancebetween two feature vectors converges to a fixed point in [0, 1] in an exponential order ofthe depth (Poole et al., 2016). The skip-connection relaxes this exponential order into apolynomial order and thus preserves the structure of the input space (Yang and Schoenholz,2017). Their analysis used the mean-field theory, that is, they considered the neural networkwith infinite hidden units and approximated the sum of the activations by the Gaussianrandom variable. This approximation can deal with broad classes of the activation func-

c© 2019 Y. Furusho & K. Ikeda.

Page 2: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

ResNet and Batch-normalization Improve Data Separability

tion. However, their analysis is limited to the randomly initialized neural networks and thetheoretical relationship between their results and the classification performance is unclear.

In this work, we focused on the most popular activation function, the ReLU function,and analyzed the class-related signal propagation through hidden layers of the randomlyinitialized MLP, the ResNet, and the ResNet with BN and the effect of training because alarge ratio of the between-class distance to the within-class distance of feature vectors atthe last hidden layer induces high classification performance (Devroye et al., 2013). Ourresults show that, in randomly initialized weights, the MLP strongly decreases the between-class distance compared with the within-class distance. The skip-connection and the BNrelax this decrease of the between-class distance thanks to the preservation of the anglebetween input vectors. Our analysis also implies that this preservation of the angle atinitialization encourages training to improve the distance ratio. These results imply thatthe skip-connection and the BN induce high performance.

2. Preliminaries

2.1. Problem settings

We have a training set S = {(x(n), y(n))}Nn=1. Each example is a pair of input x(n) ∈ RDand a class label y(n) ∈ {−1,+1} which are independently identically distributed from aprobability distribution D. Indices of the examples are omitted if they are clear from thecontext.

2.2. Neural networks

We consider DNNs, which transform an input vector x ∈ RD into a new feature vectorhL ∈ RD through the following L blocks. Let h0 = x and φ(·) = max{0, ·} be the ReLUactivation function.Multilayer perceptron (MLP):

hl+1i = φ

(ul+1i

), ul+1

i =D∑j=1

W li,jh

lj . (1)

ResNet (Yang and Schoenholz, 2017; Hardt and Ma, 2017) :

hl+1i =

D∑j=1

W l,2i,j φ(ul+1

j ) + hli, ul+1i =

D∑j=1

W l,1i,j h

lj . (2)

ResNet with batch-normalization (BN):

hl+1i =

D∑j=1

W l,2i,j φ

(BN(ul+1

j ))

+ hli,

BN(ul+1i ) =

ul+1i − E

[ul+1i

]√

Var(ul+1i

) , ul+1i =

D∑j=1

W l,1i,j h

lj ,

(3)

95

Page 3: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

Furusho Ikeda

Figure 1: The blocks of the MLP, the ResNet, and the ResNet with BN.

where the expectation is taken under the distribution of input vectors in the mini-batchof the stochastic gradient descent (SGD). Without loss of generality, we assume that thevariance of input vectors in the mini-batch is one, Var (xd) = 1 for all d ∈ [D].

The above DNNs predict the corresponding output for an input based on the featurevector at the last block.

y = vThL + b. (4)

We analyzed the average behaviors of these neural networks when the weights wererandomly initialized as follows. In the MLP, the weights were initialized by the He initial-ization (He et al., 2015) because the activation function is the ReLU function.

W li,j ∼ N

(0,

2

D

). (5)

In the ResNet and the ResNet with BN, the first internal weights were initialized by theHe initialization, but the second internal weights were initialized by the Xavier initial-ization (Glorot and Bengio, 2010) because the second internal activation function is theidentity.

W l,1i,j ∼ N

(0,

2

D

), W l,2

i,j ∼ N(

0,1

D

). (6)

2.3. Classification error and the feature vectors

A large ratio of the between-class distance to the within-class distance of feature vectors hL

at the last block induces a small classification error.

Theorem 1 Theorem 4.4 of (Devroye et al., 2013). Let m+1 and S+1 be the populationmean and the population covariance matrix of the feature vector hL for class +1. We alsodefine m−1 and S−1 for class −1 in the same way. Then, for any v ∈ RD,

infb∈R

Pr(x,y)∼D[y · (vThL + b) ≤ 0] ≤

1 +

(vT (m+1 −m−1)√vTS+1v +

√vTS−1v

)2−1 . (7)

We can apply this result into the classification error in the training set by replacing thepopulation mean and the population covariance matrix with the sample mean and thesample covariance matrix.

96

Page 4: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

ResNet and Batch-normalization Improve Data Separability

2.4. Signal propagation

To clear the effect of the architecture on the distance between feature vectors hl(n), hl(m),

‖hl(n)− hl(m)‖2 = ‖hl(n)‖2 + ‖hl(m)‖2 − 2‖hl(n)‖‖hl(m)‖ cos∠(hl(n), hl(m)

), (8)

we calculated the length ql(n) and the angle ∠l(n,m),

ql(n) = E[‖hl(n)‖2

], ∠l(n,m) = arccos cl(n,m),

ql(n,m) = E[hl(n)

Thl(m)

], cl(n,m) =

ql(n,m)√ql(n)ql(m)

,(9)

where ql(n,m) is the inner product and cl(n.m) is the cosine similarity. Note that theexpectation is taken under the probability distribution of randomly initialized weights.

3. Main results

3.1. Signal propagation

The distance between feature vectors, which can be written as the length and the angle, isrelated to the classification error. To clear the effect of the skip-connection and the BN onthe classification error, we derive the recurrence relations of the length and the angle in theMLP, the ResNet, and the ResNet with BN. The proofs of theorems are in the appendix.

Theorem 2 The transformation by the randomly initialized MLP remains the length forany feature vector and strongly decreases a large angle compared with a small angle (Fig. 3).

ql+1(n) = ‖hl(n)‖2, ∠l+1(n,m) = arccosψ(∠(hl(n), hl(m))) (10)

where ψ(θ) = 1π{sin θ + (π − θ) cos θ}.

Remark 3 The angle between input vectors which belong to different classes has a largerangle than the angle between feature vectors which belong to the same class in real-lifeapplications (Yamaguchi et al., 1998; Wolf and Shashua, 2003). Therefore, ψ(θ) stronglydecreases the between-class angle compared with the within-class angle (Fig. 2).

0.0 0.5 1.0 1.5 2.0 2.5 3.0angle

0.0

0.5

1.0

1.5

2.0

2.5

3.0 arccos( ( ))identity

Figure 2: Strong decrease of the between-class angle by ψ(θ).

0.0 0.5 1.0 1.5 2.0 2.5 3.0

(h l(n), h l(m))

0.0

0.5

1.0

1.5

2.0

2.5

3.0

l+1 (

n,m

) MLPResNetBN (l = 1)BN (l = 3)

Figure 3: Recurrence relation of the angleby the lth block of the DNN.

97

Page 5: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

Furusho Ikeda

Remark 4 Because of the strong decrease of the between-class angle (Theorem 2 andFig. 3), the MLP degrades the ratio of the between-class distance to the within-class distance,which is undesirable property for the classification.

Next theorems show that the skip-connection and the BN relax this degradation.

Theorem 5 The transformation by the randomly initialized ResNet increases the lengthin the same scale for any feature vector and relaxes the strong decrease of the large anglecompared with the MLP (Fig. 3).

ql+1(n) = 2‖hl(n)‖2,

∠l+1(n,m) = arccos

{1

2ψ(∠(hl(n), hl(m))) +

1

2cos∠(hl(n), hl(m))

}.

(11)

Theorem 6 The transformation by the randomly initialized ResNet with BN increases thelength in the same scale for any feature vector and relaxes the strong decrease of the largeangle compared with the MLP and the ResNet (Fig. 3).

ql+1(n) =l + 3

l + 2‖hl(n)‖2,

∠l+1(n,m) = arccos

{1

l + 3ψ(∠(hl(n), hl(m))) +

l + 2

l + 3cos∠(hl(n), hl(m))

}.

(12)

Remark 7 The skip-connection and the BN relax the degradation of the distance ratiocompared with the MLP thanks to the preservation of the angle (Theorem 5, 6, and Fig. 3).

The above theorems also provide us with a clear interpretation of how the skip-connectionand the BN preserve the angle. The ReLU activation function contracts the angle becausethe ReLU activation function truncates the negative value of its input. The skip-connectionbypasses the ReLU activation function and thus reduces the effect of the ReLU activationfunction to the half. Moreover, the BN reduces the effect of the ReLU activation functionto the reciprocal of the depth.

Corollary 8 Suppose that the change of the angle by each block follows the Theorem 3, 5,and 6. When the angle between input vectors is sufficiently small such that arccosψ(∠(x(n), x(m))can be approximated by the linear function a ·∠(x(n), x(m)) where 0 < a < 1 is a constant,we can obtain the angle dynamics (Table 1).

Table 1: Angle dynamics when the inputangle is sufficiently small.

Model Angle ∠L(n,m)

MLP ' aL · ∠(x(n), x(m))

ResNet '(1+a2

)L · ∠(x(n), x(m))

BN ' 2L+2 · ∠(x(n), x(m))

0 2 4 6 8 10# blocks L

4 × 10 1

5 × 10 1

L (n,m

)

MLPResNetBN

Figure 4: Angle dynamics when the inputangle is ∠(x(n), x(m)) = 0.5.

98

Page 6: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

ResNet and Batch-normalization Improve Data Separability

3.2. Role of training and its relation to the initialization

Feature vectors should have a large between-class angle and a small within-class anglefor small classification error. However, the randomly initialized neural networks decreasethe between-class angle. Wang et al. (2018) showed that minimizing the softmax losscorresponds to increasing the between-class angle and decreasing the within-class angleunder some assumptions. Besides the above property of the softmax loss, our analysisprovides us with an insight into how training tackles this problem from the point of viewof the network architecture.

Theorem 9 In the MLP, the cosine similarity cos∠(hl+1(n), hl+1(m)) is proportional to

D∑i=1

δli(n,m) · ‖W li,•‖2 · cos∠(W l

i,•, hl(n)) cos

(∠(W l

i,•, hl(n))− ∠(hl(n), hl(m))

)(13)

where W li,• =

(W li,1, ...,W

li,D

)Tis a weight vector, which can be controlled by training, and

δli(n,m) = 1[W li,•h

l(n) > 0] · 1[W li,•h

l(m) > 0] is the co-activation state.

Similar theorems hold in the cases of the ResNet and the ResNet with BN.Theorem 9 implies that training can find a good weight W l making the within-class

angle smaller and making the between-class angle larger as follows.

Remark 10 cos∠(W li,•, h

l(n)) cos(∠(W l

i,•, hl(n)) − ∠(hl(n), hl(m))

)in Eq.13 and its plot

(Fig. 5) imply training can find a weight vector W li,• making the angle smaller or larger.

However, it seems difficult to make the within-class angle smaller and the between-classangle larger at the same time. The co-activation states {δli(n,m)}Di=1 overcome this problemby being activated class-wisely such that part of it becomes active if x(n), x(m) belong to thesame class and the other part becomes active if x(n), x(m) belong to difference classes.

The above discussion and the following theorem show the relationship between trainingand the preservation of the angle at initialization.

Theorem 11 A small angle ∠(hl(n), hl(m)) encourages the co-activation at initialization.

P(δli(n,m) = 1

)=

1

2

(1− ∠(hl(n), hl(m))

π

). (14)

0 1 2 3parameter i

0.50

0.25

0.00

0.25

0.50

0.75

cos

icos

(i

)

similarity ( = /4)cos /4similarity ( = /2)cos /2

Figure 5: Change of the similarity bychanging αi = ∠(W l

i,•, hl(n)).

Let ∠ = ∠(hl(n), hl(m)).

Figure 6: Region of W li,• and the corresponding

behavior of δli(n,m).

99

Page 7: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

Furusho Ikeda

Remark 12 At the high block of the initialized MLP, the between-class angle is small andthus large part of the co-activation states {δli(n,m)}Di=1 are independent of the class infor-mation (left in Fig. 6). Therefore, training changes all the angles in a similar way andthe improvement of the distance ratio is small. On the other hand, the skip-connection andthe BN preserve the between-class angles at high blocks and thus the co-activation states{δli(n,m)}Di=1 still depend on the class information (right in Fig. 6). Therefore, trainingcan change angles class-wisely and improves the distance ratio well.

4. Numerical simulations

4.1. Dataset

We made a training set and a test set {x(n), y(n)}1000n=1 by subsampling data which belong tothe class label 0 or 1 from the MNIST dataset or CIFAR10 dataset. Moreover, we appliedPCA-whitening to these datasets to make the variance of each element of input vectorsbecome one. The class label 0 was relabeled to −1.

0.8 1.0 1.2 1.4 1.6 1.8(x(n), x(m))

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

L (n,m

)

MNIST (L = 1)

MLPResNetBNtheory

0.8 1.0 1.2 1.4 1.6 1.8(x(n), x(m))

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

L (n,m

)

MNIST (L = 3)

1.3 1.4 1.5 1.6 1.7(x(n), x(m))

0.6

0.8

1.0

1.2

1.4

1.6

L (n,m

)CIFAR10 (L = 1)

1.3 1.4 1.5 1.6 1.7(x(n), x(m))

0.6

0.8

1.0

1.2

1.4

1.6

L (n,m

)

CIFAR10 (L = 3)

Figure 7: Recurrent relation of the angle by one block or three blocks of neural networks.We plotted the mean and the standard deviation of the angle over ten randomlyinitialized weights. The black lines show the identity.

0 2 4 6 8 10# blocks L

100

4 × 10 1

6 × 10 1

L (n,m

)

MNIST(x(n), x(m)) = 0.93

MLPResNetBNtheory

0 2 4 6 8 10# blocks L

100

6 × 10 1

2 × 100

L (n,m

)

MNIST(x(n), x(m)) = 1.96

0 2 4 6 8 10# blocks L

100

6 × 10 1

L (n,m

)

CIFAR10(x(n), x(m)) = 1.31

0 2 4 6 8 10# blocks L

100

6 × 10 1

L (n,m

)

CIFAR10(x(n), x(m)) = 1.83

Figure 8: Dynamics of the angle through blocks of neural networks. We plotted the meanand the standard deviation of the angle over ten randomly initialized weights.The black lines show the identity.

100

Page 8: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

ResNet and Batch-normalization Improve Data Separability

4.2. Signal propagation through blocks

We calculated angles between feature vectors transformed by one block or three blocks ofthe neural networks (Fig. 7). The MLP strongly decreased the between-class (large) anglecompared with the within-class (small) angle, the skip-connection and the BN relaxed thisdecrease, and these numerical values matched our theoretical values.

We chose two input vectors and calculated its angle at each block (Fig. 8). The MLPmade this angle decrease quickly and the skip-connections and the BN relaxed this decrease,which agreed with our analysis.

4.3. Co-activation rate through blocks

We calculated the co-activation rates between feature vectors transformed by three blocksof the neural networks (Fig. 9). A small angle encouraged the co-activation and thesenumerical values matched our theoretical values.

We chose two input vectors and calculated its activation rate at each block (Fig. 10). Theactivation rate of the MLP quickly increased through the blocks and the skip-connectionand the BN suppressed this increase, which agreed with our analysis.

4.4. Effect of training on the distance ratio and relation to the error

We stacked a softmax layer on top of the L blocks neural networks and trained these modelswith the stochastic gradient descent by minimizing the cross-entropy loss. We calculatederror rates and the ratios of the between-class distance to the within-class distance inTheorem 1,

vT (m+1 −m−1)√vTS+1v +

√vTS−1v

where v = (m+1 −m−1), (15)

on the test set. We plotted distance ratios before and after training (Fig. 11) and therelationship between the distance ratio and the classification error (Fig. 12). We applied

0.8 1.0 1.2 1.4 1.6 1.8(x(n), x(m))

0.20

0.25

0.30

0.35

0.40

P(L i(

n,m

)=1)

MNIST (L = 3)

MLPResNetBNtheory

1.3 1.4 1.5 1.6 1.7(x(n), x(m))

0.24

0.26

0.28

0.30

0.32

0.34

0.36

0.38

P(L i(

n,m

)=1)

CIFAR10 (L = 3)

Figure 9: Co-activation rates. We plottedthe mean and the standard devia-tion of the co-activation rate overten randomly initialized weights.

2 4 6 8 10# blocks L

0.20

0.25

0.30

0.35

0.40

0.45

P(L i(

n,m

)=1)

MNIST(x(n), x(m)) = 1.96

MLPResNetBNtheory

0 2 4 6 8# blocks L

0.20

0.25

0.30

0.35

0.40

P(L i(

n,m

)=1)

CIFAR10(x(n), x(m)) = 1.83

Figure 10: Co-activation rates throughblocks. We plotted the meanand the standard deviation ofthe co-activation rate over tenrandomly initialized weights.

101

Page 9: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

Furusho Ikeda

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00# blocks L

0.4

0.6

0.8

1.0di

stan

ce ra

tio MLPResNetBN

Figure 11: Distance ratio of the neural net-works before (dot lines) and af-ter training (bold lines).

0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

distance ratio

0.16

0.18

0.20

0.22

0.24

erro

r rat

e

1 11

22

2

3

33

MLPResNetBN

Figure 12: Error rate vs. the distance ratio.Each number means the numberof blocks of the neural networks.

this simulation only on the binary class CIFAR10 dataset because the binary class MNISTdataset was so easy that the all DNN perfectly classified data even on the test set.

Fig. 11 shows that the skip-connection and the BN improve the distance ratio comparedwith the MLP although deeper neural networks degraded the distance ratio. Fig. 12 showsthat the error rate is negatively correlated to the distance ratio (the correlation coefficientis -0.91). These results supported our claim.

5. Conclusion

To clear the success of the ResNet and the BN, we analyzed the change of the distancebetween feature vectors through blocks of the MLP, the ResNet, and the ResNet with BNbecause a large ratio of the between-class distance to the within-class distance of feature vec-tors at the last block induces small classification error. Our results show that the randomlyinitialized MLP degrades this ratio because of the strong decrease of the between-class an-gle. The skip-connection and the BN relax this degradation thanks to the preservationof the between-class angle. Our analysis also implies that this preservation of the angle atinitialization encourages training to improve the distance ratio and thus the skip-connectionand the BN induce high performance. Numerical simulations supported these results.

Acknowledgments

This work was supported in part by JSPS-KAKENHI numbers, 18J15055 and 18K19821,and the NAIST Big Data Project.

References

Luc Devroye, Laszlo Gyorfi, and Gabor Lugosi. A probabilistic theory of pattern recognition,volume 31. Springer Science & Business Media, 2013.

Raja Giryes, Guillermo Sapiro, and Alexander M Bronstein. Deep neural networks withrandom gaussian weights: A universal classification strategy? IEEE Trans. Signal Pro-cessing, 64(13):3444–3457, 2016.

102

Page 10: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

ResNet and Batch-normalization Improve Data Separability

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforwardneural networks. In Proceedings of the thirteenth international conference on artificialintelligence and statistics, pages 249–256, 2010.

Talha Cihad Gulcu and Alper Gungor. Comments on” deep neural networks with randomgaussian weights: A universal classification strategy?”. arXiv preprint arXiv:1901.02182,2019.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. International Conferenceon Learning Representations, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:Surpassing human-level performance on imagenet classification. In Proceedings of theIEEE international conference on computer vision, pages 1026–1034, 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In Proceedings of the IEEE conference on computer vision and patternrecognition, pages 770–778, 2016a.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deepresidual networks. In European conference on computer vision, pages 630–645. Springer,2016b.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train-ing by reducing internal covariate shift. In International Conference on Machine Learning,pages 448–456, 2015.

Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the numberof linear regions of deep neural networks. In Advances in neural information processingsystems, pages 2924–2932, 2014.

Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.Exponential expressivity in deep neural networks through transient chaos. In Advancesin neural information processing systems, pages 3360–3368, 2016.

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. Onthe expressive power of deep neural networks. In International Conference on MachineLearning, pages 2847–2854, 2017.

Matus Telgarsky. benefits of depth in neural networks. In Conference on Learning Theory,pages 1517–1539, 2016.

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li,and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, pages 5265–5274,2018.

Lior Wolf and Amnon Shashua. Learning over sets using kernel principal angles. Journalof Machine Learning Research, 4(Oct):913–931, 2003.

103

Page 11: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

Furusho Ikeda

Osamu Yamaguchi, Kazuhiro Fukui, and K-i Maeda. Face recognition using temporal imagesequence. In Proceedings Third IEEE International Conference on Automatic Face andGesture Recognition, pages 318–323. IEEE, 1998.

Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. InAdvances in neural information processing systems, pages 7103–7114, 2017.

Appendix A. Proof of the recurrence relation (Theorem 3,5,6, and 11)

Before we prove the recurrence relation, we present the following lemma which plays animportant role in our analysis. The strategy of its proof is the same way as Giryes et al.(2016); Gulcu and Gungor (2019), that is, we use the fact that the random Gaussian vectoris uniformly distributed on the sphere. Theorem 11 can be proved in the same way.

Lemma 13 Let w ∈ RD be a random vector which follows the Gaussian distributionN(0, σ2I

). Then, for x(n), x(m) ∈ RD,

E[φ(wTx(n)) · φ(wTx(m))

]=σ2

2‖x(n)‖‖x(m)‖ · ψ(∠(x(n), x(m))) (16)

where ψ(θ) = 1π{sin θ + (π − θ) cos θ}.

Proof of Lemma 13 Without loss of generality, we can take x(n) to lie along the w1-axis and x(m) to lie on the w1w2-plane. Integrate out the D − 2 orthogonal coordinatesof the random weight vectors and represent x(n), x(m) by the remaining two-dimensionalCartesian coordinate system of w1w2-plane. Consider integration over w1w2-plane by thechange of variables (w1, w2) = (r cos θ, r sin θ). Let u(θ) = (cos θ, sin θ)T . Then,

E[φ(wTx(n)) · φ(wTx(m))

]=

∫ ∞0

∫ 2π

0p(ru(θ))φ(ru(θ)Tx(n))φ(ru(θ)Tx(m)) Jdθdr (17)

where p(ru(θ)) = 12πσ2 exp

(−r22σ2

)is the two dimensional Gaussian density function and J =∣∣∣det

[∂(w1,w2)T

∂r , ∂(w1,w2)T

∂θ

]∣∣∣ = r is the Jacobian. Notice that φ(u(θ)Tx(n)) · φ(u(θ)Tx(m)) is

non-zero iff u(θ) is the same direction as input vectors x(n), x(m) (Fig. 11).

E[φ(wTx(n)) · φ(wTx(m))

]=

∫ ∞0

p(ru(θ)) r3 dr ·∫ 2π

0φ(u(θ)Tx(n)) φ(u(θ)Tx(m)) dθ

=σ2

π

∫ π2

−(π2−∠(x(n),x(m)))‖x(n)‖ cos θ · ‖x(m)‖ cos(θ − ∠(x(n), x(m))) dθ

=σ2

2‖x(n)‖‖x(m)‖ · ψ(∠(x(n), x(m))).

(18)

104

Page 12: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

ResNet and Batch-normalization Improve Data Separability

Table 2: Recurrence relation. Let ψ(θ) = 1π{sin θ + (π − θ) cos θ}

Model Length ql+1(n) Inner product ql+1(n,m)

MLP ‖hl(n)‖2 ‖hl(n)‖‖hl(m)‖ψ(∠(hl(n), hl(m)))ResNet 2‖hl(n)‖2 ‖hl(n)‖‖hl(m)‖

(ψ(∠(hl(n), hl(m))) + cos∠(hl(n), hl(m))

)BN l+3

l+2‖hl(n)‖2 ‖hl(n)‖‖hl(m)‖

(1l+3ψ(∠(hl(n), hl(m))) + l+2

l+3 cos∠(hl(n), hl(m)))

A.1. Recurrence relation of MLP

Let W li,• =

(W li,1, ...,W

li,D

)be a row vector.

Length:

ql+1(n) = E[‖hl+1(n)‖2

]=

D∑i=1

E[φ(W l

i,•hl(n))2

]

=D∑i=1

1

2E

D∑j=1

W li,j

2hlj(n)

2

= ‖hl(n)‖2.

(19)

Inner-product:

ql+1(n,m) = E[hl+1(n)Thl+1(m)

]= E

[φ(W lhl(n))Tφ(W lhl(m))

]=

D∑i=1

E[φ(W l

i,•hl(n))Tφ(W l

i,•hl(m))

].

Note that W li,• is the random Gaussian vector. Apply Lemma 11 into the above equation.

ql+1(n,m) = ‖hl(n)‖‖hl(m)‖ · ψ(∠(hl(n), hl(m))

). (20)

Then, we can calculate the recurrence relation of the angle by applying these into Eq.9.

Figure 13: Condition of φ(u(θ)Tx(n)) · φ(u(θ)Tx(m)) > 0.

105

Page 13: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

Furusho Ikeda

A.2. Recurrence relation of ResNet

Length:

ql+1(n) = E[‖W l,2φ(W l,1hl(n)) + hl(n)‖2

]= E

[‖W l,2φ(W l,1hl(n)‖2

]+ 2E

[h(n)l

TW l,2φ(W l,1hl(n))

]+ ‖hl‖2

= 2‖hl(n)‖2.

(21)

Inner-product:

ql+1(n,m) = E[{W l,2φ(W l,1hl(n) + hl(n)

}T {W l,2φ(W l,1hl(m) + hl(m)

}]= E

[{W l,2φ(W l,1hl(n))

}T {W l,2φ(W l,1hl(m))

}]+ hl(n)TE

[W l,2φ(W l,1hl(m)

]+ hl(m)TE

[W l,2φ(W l,1hl(n)

]+ hl(n)Thl(m).

(22)

The first term can be calculated using Lemma 11.

E[{W l,2φ(W l,1hl(n))

}T {W l,2φ(W l,1hl(m))

}]=

D∑i=1

E[ D∑j=1

W l,2i,j

2φ(W l,1

j,•hl(n)) · φ(W l,1

j,•hl(m))

]

=D∑i=1

1

DE[ D∑j=1

φ(W l,1j,•h

l(n)) · φ(W l,1j,•h

l(m))]

= ‖hl(n)‖‖hl(m)‖ · ψ(∠(hl(n), hl(m))).

(23)

The second term and the third term are zero. Then,

ql+1(n,m) = ‖hl(n)‖‖hl(m)‖(ψ(∠(hl(n), hl(m))) + cos∠(hl(n), hl(m))

). (24)

Thus, we can calculate the recurrence relation of the angle by applying these into Eq.9.

A.3. Recurrence relation of ResNet with BN

We first calculate the statistics of BN.Mean:

E[ul+1i

]= E

[W l,1i,•h

l]

= 0. (25)

106

Page 14: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

ResNet and Batch-normalization Improve Data Separability

Variance:

Var(ul+1i

)= Var

( D∑j=1

W l,1i,j h

lj

)=

2

D

D∑j=1

Var(hlj

)

=2

D

D∑j=1

Var(W l−1,2j,• φ(BN(W l−1,1hl−1)) + hl−1j

)

=2

D

D∑j=1

{ D∑k=1

1

D

1

2+ Var

(hl−1j

)}

=2

D

D∑j=1

{1

2+ Var

(hl−1j

)}

=2

D

D∑j=1

(l

2+ Var (xj)

)= l + 2.

(26)

By applying these statistics into Eq.3, we can derive the recurrence relations in the sameway as those of the ResNet.

Appendix B. Proof of the angle dynamics through layers (Table 1)

B.1. Angle dynamics through layers of MLP

If θ is small, arccos(ψ(θ)) can be well approximated by linear function, a · ∠(hl(n), hl(m)),where a < 1 is a positive constant. Therefore,

∠1(n,m) ' a · ∠(x(n), x(m)) (27)

Apply this procedure L times, we obtain

∠L(n,m) ' aL · ∠(x(n), x(m)). (28)

B.2. Angle dynamics through layers of ResNet and ResNet with BN

Notice that ψ(θ) + cos θ is positive in θ ∈ [0, 2] and arccos θ is concave in positive θ.Therefore, we can obtain a lower bound of the angle dynamics of the ResNet by using theJensen inequality.

∠1(n,m) = arccos(1

2ψ(∠(x(n), x(m))) +

1

2cos∠(x(n), x(m))

)>

1

2arccos

(ψ(∠(x(n), x(m)))

)+

1

2· ∠(x(n), x(m))

' 1 + a

2· ∠(x(n), x(m))

(29)

Apply this procedure L times, we obtain

∠L(n,m) '

(1 + a

2

)L· ∠(x(n), x(m)). (30)

107

Page 15: ResNet and Batch-normalization Improve Data Separabilityproceedings.mlr.press/v101/furusho19a/furusho19a.pdf · 2020. 11. 21. · Proceedings of Machine Learning Research 101:94{108,

Furusho Ikeda

We can obtain a lower bound of the angle dynamics of the ResNet with BN in the sameway as that of the ResNet.

Appendix C. Proof of Theorem 9

Cosine similarity can be written as follows.

cos∠(hl+1(n), hl+1(m)) =hl+1(n)Thl+1(m)

‖hl+1(n)‖‖hl+1(m)‖

=1

‖hl+1(n)‖‖hl+1(m)‖

D∑i=1

φ(W li,•h

l(n)) · φ(W li,•h

l(m))

=‖hl(n)‖‖hl(m)‖‖hl+1(n)‖‖hl+1(m)‖

D∑i=1

δli(n,m) · ‖W li,•‖2 · cos∠(W l

i,•, hl(n)) cos∠(W l

i,•, hl(m))

∝D∑i=1

δli(n,m) · ‖W li,•‖2 · cos∠(W l

i,•, hl(n)) cos

(∠(W l

i,•, hl(n))− ∠(hl(n), hl(m))

)(31)

where δli(n,m) = 1[W li,•h

l(n) > 0] · 1[W li,•h

l(m) > 0].

108