Lecture 3: The Principle of Sparsity

Preview:

Citation preview

Lecture 3:The Principle of Sparsity

[§4, §8, §11.2, and §∞.3 of“The Principles of Deep Learning Theory (PDLT),”arXiv:2106.10165]

Problems 1, 2, & 3

Fully-trained network output, Taylor-expanded around initialization:

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

Problems 1, 2, & 3

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

initial distributionsover

model parameters

Problems 1, 2, & 3

statistics at initialization

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

• Problem 3: complicated dynamics

Problems 1, 2, & 3

statistics at initializationinitial distributionsover

model parameters

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

• Problem 3: complicated dynamics

statistics at initialization

Dan has covered Dynamics

initial distributionsover

model parameters

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

• Problem 3: complicated dynamics

statistics at initialization statistics after training

Dan has covered Dynamics

initial distributionsover

model parameters

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

• Problem 2: complicated mapping

• Problem 3: complicated dynamics

statistics at initialization statistics after training

Sho will cover Statistics

initial distributionsover

model parameters

statistics at initialization

Sho will cover Statistics

for WIDE & DEEP neural networks

initial distributionsover

model parameters

statistics at initialization

Sho will cover Statistics

for WIDE & DEEP neural networks

Lecture 3: The Principle of Sparsity, deriving recursions

Lecture 4: The Principle of Criticality, solving recursions

initial distributionsover

model parameters

Outline1. Neural Networks 101

2. One-Layer Neural Networks

3. Two-Layer Neural Networks

4. Deep Neural Networks

1. Neural Networks 101

Neural Networks

Neural Networks

activation function

Neural Networks

preactivations

network output

Neural Networks

hat @ initialization

Neural Networks

Biases and weights (model parameters) are independently (& symmetrically) distributed with variances

Neural Networks

initialization hyperparameters

Biases and weights (model parameters) are independently (& symmetrically) distributed with variances

Neural Networks

good wide limit

Biases and weights (model parameters) are independently (& symmetrically) distributed with variances

One Aside on Gradient Descent

[Cf. Andrea’s “S” matrix]

One Aside on Gradient Descent

Taylor expansion:

One Aside on Gradient Descent

Taylor expansion:

Neural Tangent Kernel (NTK) [Cf. Dan’s ]

One Aside on Gradient Descent

Taylor expansion:

Neural Tangent Kernel (NTK) [Cf. Dan’s ]

differential of NTK (dNTK) [Cf. Dan’s ]

Taylor expansion:

Neural Tangent Kernel (NTK)

differential of NTK (dNTK)

One Aside on Gradient Descent

ddNTK

Taylor expansion:

Neural Tangent Kernel (NTK)

differential of NTK (dNTK)

One Aside on Gradient Descent

ddNTK

Neural Tangent Kernel (NTK)

[Cf. Andrea’s “S” matrix]

Neural Tangent Kernel (NTK)

Diagonal, group-by-group, learning rate:

good wide limit

Neural Tangent Kernel (NTK)

Diagonal, group-by-group, learning rate:

Two Pedagogical Simplifications

1. Single input; drop sample indices

[See “PDLT” (arXiv:2106.10165) for more general cases.]

2. Layer-independent hyperparameters; drop layer indices from them

2. One-Layer Neural Networks

Statistics of

Statistics of

Statistics of

Statistics of“Wick contraction”

Statistics of

Statistics of

Statistics of

Statistics of

Statistics of

Statistics of

Statistics of

Statistics of

Statistics of

Statistics of

Statistics of

• Neurons don’t talk to each other; they are statistically independent.

• We marginalized over/integrated out and .

• Two interpretations:(i) outputs of one-layer networks; or(ii) preactivations in the first layer of deeper networks.

Statistics of

Statistics of

Statistics of

Statistics of

Statistics of

• “Deterministic”: it doesn't depend on any particular initialization; you always get the same number.

• “Frozen”: it cannot evolve during training; no representation learning.

Statistics of

Statistics of

Statistics of

• No representation learning.

• No algorithm dependence.

Statistics of One-Layer Neural Networks

Statistics of One-Layer Neural Networks

Linear dynamics:

Simple solution:

Statistics of One-Layer Neural Networks

statistics at initialization statistics after training

Statistics of One-Layer Neural Networks

• Same trivial statistics for infinite-width neural networks of any fixed depth.

• No representation learning, no algorithm dependence; not a good model of deep learning.

We must study deeper networks of finite width!

3. Two-Layer Neural Networks

Statistics of

Statistics of

Wick

arrange

Statistics of

Statistics of

• Recursive.

• width-scaling was important.

Statistics of

Wick

Statistics of

Wick

Statistics of

Wick

Statistics of

Wick

Statistics of

with

Statistics of

with

Nearly-Gaussian distribution for

[Cf. Gaussian distribution in the first layer:

]

Statistics of

• Gaussian in the infinite-width limit, too simple; specified by one number (one matrix – kernel – more generally)

• Sparse description at ; specified by two numbers (two tensors more generally, one of them having four sample indices)

• Interacting neurons at finite width.

Statistics of

Statistics of

Statistics of

1st piece

2nd piece

Statistics of

1st piece, the same as before:

Statistics of

1st piece, the same as before:

width-scaling was important.

Statistics of

2nd piece, chain rule:

Statistics of

2nd piece, chain rule:

Statistics of

2nd piece, chain rule:

Statistics of

2nd piece, chain rule:

Statistics ofPutting things together, NTK forward equation:

Statistics of

• “Stochastic”: it fluctuates from instantiation to instantiation.

• “Defrosted”: it can evolve during training.

Putting things together, NTK forward equation:

Putting things together, NTK forward equation:

Fun for the weekend (solutions in §8):

[*for smooth activation functions]

Statistics of

Putting things together, NTK forward equation:

Fun for the weekend (solutions in §8, §11.2, and §∞.3):

[*for smooth activation functions]

Statistics of and beyond

Putting things together, NTK forward equation:

Fun for the weekend (solutions in §8, §11.2, and §∞.3):

[*for smooth activation functions]

Statistics of and beyond

Representation Learning

[*for smooth activation functions]

Statistics of and beyond

Representation Learning

Statistics of Two-Layer Neural Networks

• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.

• Neurons do talk to each other; they are statistically dependent.

• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.

Statistics of Two-Layer Neural Networks

But what is being amplified by deep learning?

• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.

• Neurons do talk to each other; they are statistically dependent.

• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.

4. Deep Neural Networks

Statistics of

Statistics of

Wick

arrange

Statistics of

leading

Statistics ofTwo-point:

Statistics ofTwo-point:

Four-point:

Two-point:

Four-point:

NTK mean:

Statistics of

1st trivial piece

2nd chain-rule piece

NTK forward equation:

Statistics of

NTK forward equation:

Statistics of

1st trivial piece

2nd chain-rule piece

Two-point:

Four-point:

NTK mean:

Statistics of and beyond

Similarly for

NTK fluctuations (§8)

dNTK (§11.2)

ddNTK (§∞.3)

Two-point:

Four-point:

NTK mean:

Statistics of and beyond

Similarly for

NTK fluctuations (§8)

dNTK (§11.2)

ddNTK (§∞.3)

Two-point:

Four-point:

NTK mean:

Statistics of and beyond

Similarly for

NTK fluctuations (§8)

dNTK (§11.2)

ddNTK (§∞.3)

statistics at initialization statistics after training

The Principle of Sparsity for WIDE Neural NetworksDanSho

• Infinite width:

specified by

The Principle of Sparsity for WIDE Neural Networks

statistics at initialization statistics after training

• Infinite width:

• Large-but-finite width at :

specified by

specified by

The Principle of Sparsity for WIDE Neural Networks

statistics at initialization statistics after training

All determined through recursion relations(RG-flow interpretation: §4.6 )

All determined through recursion relations(RG-flow interpretation: §4.6 )

Next Lecture: Solving Recursions“The Principle of Criticality”

forDEEP Neural Networks

One more thing…

Recommended