Lecture 3: The Principle of Sparsity

Lecture 3:The Principle of Sparsity

[§4, §8, §11.2, and §∞.3 of“The Principles of Deep Learning Theory (PDLT),”arXiv:2106.10165]

Problems 1, 2, & 3

Fully-trained network output, Taylor-expanded around initialization:

• Problem 1: too many terms in general

Problems 1, 2, & 3

• Problem 2: complicated mapping

initial distributionsover

model parameters

Problems 1, 2, & 3

statistics at initialization

• Problem 3: complicated dynamics

Problems 1, 2, & 3

statistics at initializationinitial distributionsover

model parameters

Dan has covered Dynamics

model parameters

statistics at initialization statistics after training

Dan has covered Dynamics

model parameters

Sho will cover Statistics

model parameters

for WIDE & DEEP neural networks

model parameters

for WIDE & DEEP neural networks

Lecture 3: The Principle of Sparsity, deriving recursions

Lecture 4: The Principle of Criticality, solving recursions

model parameters

Outline1. Neural Networks 101

2. One-Layer Neural Networks

3. Two-Layer Neural Networks

4. Deep Neural Networks

1. Neural Networks 101

Neural Networks

activation function

Neural Networks

preactivations

network output

Neural Networks

hat @ initialization

Neural Networks

Biases and weights (model parameters) are independently (& symmetrically) distributed with variances

Neural Networks

initialization hyperparameters

Neural Networks

good wide limit

One Aside on Gradient Descent

[Cf. Andrea’s “S” matrix]

Taylor expansion:

Neural Tangent Kernel (NTK) [Cf. Dan’s ]

Taylor expansion:

Neural Tangent Kernel (NTK) [Cf. Dan’s ]

differential of NTK (dNTK) [Cf. Dan’s ]

Taylor expansion:

Neural Tangent Kernel (NTK)

differential of NTK (dNTK)

Taylor expansion:

differential of NTK (dNTK)

[Cf. Andrea’s “S” matrix]

Diagonal, group-by-group, learning rate:

good wide limit

Diagonal, group-by-group, learning rate:

Two Pedagogical Simplifications

1. Single input; drop sample indices

[See “PDLT” (arXiv:2106.10165) for more general cases.]

2. Layer-independent hyperparameters; drop layer indices from them

2. One-Layer Neural Networks

Statistics of

Statistics of“Wick contraction”

Statistics of

• Neurons don’t talk to each other; they are statistically independent.

• We marginalized over/integrated out and .

• Two interpretations:(i) outputs of one-layer networks; or(ii) preactivations in the first layer of deeper networks.

Statistics of

• “Deterministic”: it doesn't depend on any particular initialization; you always get the same number.

• “Frozen”: it cannot evolve during training; no representation learning.

Statistics of

• No representation learning.

• No algorithm dependence.

Statistics of One-Layer Neural Networks

Linear dynamics:

Simple solution:

• Same trivial statistics for infinite-width neural networks of any fixed depth.

• No representation learning, no algorithm dependence; not a good model of deep learning.

We must study deeper networks of finite width!

3. Two-Layer Neural Networks

Statistics of

arrange

Statistics of

• Recursive.

• width-scaling was important.

Statistics of

Nearly-Gaussian distribution for

[Cf. Gaussian distribution in the first layer:

Statistics of

• Gaussian in the infinite-width limit, too simple; specified by one number (one matrix – kernel – more generally)

• Sparse description at ; specified by two numbers (two tensors more generally, one of them having four sample indices)

• Interacting neurons at finite width.

Statistics of

1st piece

2nd piece

Statistics of

1st piece, the same as before:

Statistics of

1st piece, the same as before:

width-scaling was important.

Statistics of

2nd piece, chain rule:

Statistics of

Statistics ofPutting things together, NTK forward equation:

Statistics of

• “Stochastic”: it fluctuates from instantiation to instantiation.

• “Defrosted”: it can evolve during training.

Putting things together, NTK forward equation:

Fun for the weekend (solutions in §8):

[*for smooth activation functions]

Statistics of

Fun for the weekend (solutions in §8, §11.2, and §∞.3):

Statistics of and beyond

Fun for the weekend (solutions in §8, §11.2, and §∞.3):

Representation Learning

Statistics of Two-Layer Neural Networks

• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.

• Neurons do talk to each other; they are statistically dependent.

• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.

Statistics of Two-Layer Neural Networks

But what is being amplified by deep learning?

• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.

• Neurons do talk to each other; they are statistically dependent.

• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.

4. Deep Neural Networks

Statistics of

arrange

Statistics of

leading

Statistics ofTwo-point:

Four-point:

Two-point:

Four-point:

NTK mean:

Statistics of

1st trivial piece

2nd chain-rule piece

NTK forward equation:

Statistics of

NTK forward equation:

Statistics of

1st trivial piece

2nd chain-rule piece

Two-point:

Four-point:

NTK mean:

Similarly for

NTK fluctuations (§8)

dNTK (§11.2)

ddNTK (§∞.3)

Two-point:

Four-point:

NTK mean:

Similarly for

dNTK (§11.2)

ddNTK (§∞.3)

Two-point:

Four-point:

NTK mean:

Similarly for

dNTK (§11.2)

ddNTK (§∞.3)

The Principle of Sparsity for WIDE Neural NetworksDanSho

• Infinite width:

specified by

The Principle of Sparsity for WIDE Neural Networks

• Infinite width:

• Large-but-finite width at :

specified by

The Principle of Sparsity for WIDE Neural Networks

All determined through recursion relations(RG-flow interpretation: §4.6 )

Next Lecture: Solving Recursions“The Principle of Criticality”

forDEEP Neural Networks

One more thing…

Lecture 3: The Principle of Sparsity

Documents

Le Chatelier's Principle (Lecture)_0

Meas Lecture Note 1 Principle of Measurement Note 1

Lecture 4, January 14, 2011 aufbau principle atoms

Lecture 04 bernouilli's principle

Nanoindentation Lecture 1 Basic Principle

Instrumental Analysis Lecture 34tgolden/courses/Lecture... · Lecture 34 Chem 4631. Mass Spectrometry (MS) Instrumentation Principle components:

Lecture # 2 (28.01.2017) @ ibt principle of logic

Principle of Design Lecture

Fundamental Legal Principles Lecture No. 17. Objectives Principle of Indemnity Principle of Insurable Interest Principle of Subrogation Principle of Utmost

Lecture 1 b definition of principle of design

Lecture 02 density, pressure and pascal's principle

Lecture 5: First principle calculation of the NMR spin ...mission.igic.bas.bg/downloads/Lecture5.pdf · Lecture 5: First principle calculation of the NMR spin Hamiltonian parameters

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1

Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Landscape Ecology Lecture 02 Principle in Landscape Ecology

Exposition of the Divine Principle One Hour Lecture

the vital few, principle of factor sparsity...2012/09/06 · The Pareto principle (also known as the 80-20 rule, the law of the vital few, and the principle of factor sparsity) states

FUNCTIONAL SPARSITY: GLOBAL VERSUS LOCALkaib.people.cofc.edu/research/Sinica_FunSparse.pdf · described two types of sparsity: global and local sparsity. In particular, if f is zero

Lecture on the Principle of Least Action by Feynman

Lecture 14: Oct 28 Inclusion-Exclusion Principle