Lecture 8 Deep Learning and Neural Networks · StatisticalMachineLearning Lecture8–Deeplearningandneuralnetworks NiklasWahlström DivisionofSystemsandControl DepartmentofInformationTechnology

Statistical Machine LearningLecture 8 – Deep learning and neural networks

Niklas WahlströmDivision of Systems and ControlDepartment of Information TechnologyUppsala University

Email: [email protected]

1 / 26 N. Wahlström, 2020

Summary of Lecture 7 (I/III)

Boosting is a sequential ensemble method for regression orclassification.

Trainingdata

Model 1

Modifieddata

Model 2

Modifieddata

Model 3

Modifieddata

Model B

Boosted model

The models are built sequentially such that each models tries tocorrect the mistakes made by the previous one!

Contrary to bagging, boosting methods reduce the bias of theirbase models. Works well with e.g. shallow trees.


Summary of Lecture 7 (II/III)

AdaBoost is a boosting method for binary classification. TheAdaBoost classifier can be written as

yBboost(x) = sign

(B∑

b=1α(b)y(b)(x)

).

where y(b)(x), the bth base classifier, predicts either −1 or 1, andα(b) is a nonnegative real number representing this classifiers“confidence” in its prediction.

The base classifiers and confidence coefficients are found bysequentially (and greedily) minimizing the exponential loss which,for a classifier y(x) = sign{c(x)}, is defined as,

L(y, c(x)) = exp(−y · c(x))

where y · c(x) is referred to the margin.3 / 26 N. Wahlström, 2020

Summary of Lecture 7 (III/III)

The exponential loss function is computationally convenient, butsensitive to outliers. In practice it is often preferable to use arobust loss function with a more gentle slope for negativemargins.

−3 −2 −1 0 1 2 30

1

2

3

Margin, y · c(x)

Loss

Exponential lossBinomial devianceHuber-like loss

Misclassification loss

Gradient boosting can be used to learn a boosting ensemble forgeneral differentiable loss functions.


Practical info

1. The laboratory work• Format: 4h mandatory computer lab about deep learning.

Approved on spot, no report.• Five available slots: March 2, 18:15-12:00, March 5,

13:15-17:00, March 8, 8:15-12:00 and 13:15-17:00, and March9, 13:15-17:00

• Sign up at studentportalen• Lab-pm and code available from the homepage. Take a printed

lab-pm before you leave and bring it to the lab session• Use the lab computers or bring your own laptop with PyTorch

installed and running, or use Google Colab.• Do the preparatory exercises before the lab session.

2. The exam• You may bring one A4 handwritten sheet of paper,

mathematics handbook and a calculator.• Questions from any part of the course may be on the exam.• Don’t forget to register!


Contents – Lecture 8

1. This lecture: The neural network model• Neural network for regression• Neural network for classification

2. Next lecture:• Convolutional neural network• How to train a neural network


Constructing neural networks for regression

A neural network (NN) is a nonlinear function y = f(x; θ)from an input x to an output y parameterized by parameters θ.

Linear regression models the relationship between a continuousoutput y and a continuous input x,

y =p∑

j=1Wjxj + b = Wx + b,

where the parameters θ are the weights Wj , and the offset (akabias/intercept) term b,

θ =[b W

]T, W =

[W1 W2 · · · Wp

]7 / 26 N. Wahlström, 2020

Generalized linear regression

We can generalize this by introducing nonlinear transformations ofthe predictor Wx + b,

y = h(Wx + b), ...

1x1

xp

hhh y

b

W1Wp

We call h(x) the activation function. Two common choices are:

−5 5

1

x

h(x)

Sigmoid: h(x) = 11+e−x

−1 1

1

x

h(x)

ReLU: h(x) = max(0, x)

Let us consider an example of a neural network.8 / 26 N. Wahlström, 2020

Neural network - construction

A neural network is a sequential construction of severalgeneralized linear regression models.

...

1x1

xp

h q1

yh...h qU

1

1

...

h

h

h

q(2)1q

(2)2

q(2)U2

y

Inputs Hidden units Output

q1 = h

(b

(1)1 +

∑p

j=1W

(1)1j xj

)q2 = h

(b

(1)2 +

∑p

j=1W

(1)2j xj

)...

qU = h

(b

(1)U +

∑p

j=1W

(1)Uj xj

)y = b(2) +

U∑i=1

W(2)i qi




...

1x1

xp

h

yh...h

1

1

...

h

h

h

q(2)1q

(2)2

q(2)U2

y

Inputs Hidden units Output

q = h(W(1)x + b(1)) y = W(2)q + b(2)

W(1)=

W(1)11 ... W

(1)1p

......

W(1)U1 ... W

(1)Up

, b(1) =

b(1)1...

b(1)U

, q =[

q1

...qU

]b(2) = [ b(2) ]

W(2) = [ W(2)1 ... W

(2)U

]Weight matrix Offset vector Hidden units




...

1x1

xp

h

h...h

q(1)1q

(1)2

q(1)U1

1 1

...

h

h

h

q(2)1q

(2)2

q(2)U2

y

Inputs Hidden units Hidden units Output

q(1) = h(W(1)x + b(1))q(2) = h(W(2)q(1) + b(2))y = W(3)q(2) + b(3)

The model learns better using adeep network (several layers)instead of a wide and shallownetwork.


Deep learning

A neural network with L layers can be written asq(0) = x

q(l) = h(W(l)q(l−1) + b(l)

), l = 1, . . . , L− 1

y = W(L)q(L−1) + b(L)

All weight matrices and offset vectors in all layers combined arethe parameters of the network

θ =[vec(W(1))T, b(1)T, . . . , vec(W(L))T, b(L)T

]T,

which constitutes the parametric model y = f(x; θ). If L is largewe call it a deep neural network.

Deep learning is a class of machine learning models and algo-rithms that use a cascade of multiple layers, each of which is anonlinear transformation.

10 / 26 N. Wahlström, 2020

NN for classification (M = 2 classes)

We can also use neural networks for classification. With M = 2classes we squash the output through the logistic function to get aclass probability g ∈ [0, 1]

g = f(z) = p(y = 1|x) where f(z) = ez

1 + ez

...

1

...

1x1

xp

h

h

h

z f g

−5 5

1

z

f(z)

Inputs Hidden units Logisticfunction

Classprobability

What if M > 2 ?11 / 26 N. Wahlström, 2020

NN for classification (M > 2 classes)

For M > 2 classes we want to predict the class probability for allM classes gm = p(y = m|x). We extend the logistic function tothe softmax activation function

gm = fm(z) = ezm∑Ml=1 e

zl, m = 1, . . . ,M.

...

1

...

... ... ...

1x1

xp

h

h

h

z1

zM

f1

fM

g1

gM

Inputs Hidden units Logits Classprobabilitiessoftmax(z)

softmax(z) = [f1(z), . . . , fM (z)]T12 / 26 N. Wahlström, 2020

NN classification with M = 3 classes

Consider an example with three classes M = 3.

...

1

...

1x1

xp

h

h

h

z1 = 0.9

z2 = 4.2

z3 = 0.2

f1

f2

f3

g1= 0.03

g2= 0.95

g3= 0.02

Inputs Hidden units

Logits

Classprobabilities

softmax(z)

where

fm(z) = ezm∑Ml=1 e

zl, softmax(z) = [f1(z), . . . , fM (z)]T.

13 / 26 N. Wahlström, 2020

One-hot encoding (for M > 2)

We use one-hot encoding for the output y = [y1, . . . , yM ]T,which means

ym ={

1 if x belongs to class m0 otherwise

Example: We have M = 3 classes and if x belongs to ...• ...class m = 1, then y = [1, 0, 0]T.• ...class m = 2, then y = [0, 1, 0]T.• ...class m = 3, then y = [0, 0, 1]T.

14 / 26 N. Wahlström, 2020

One-hot encoding and cross-entropy loss

Example: Consider M = 3 classes and the output y = [0, 1, 0]T.

m = 1 m = 2 m = 3 L(y,g)ym 0 1 0gm(θA) 0.28 0.68 0.04 − ln 0.68 = 0.39gm(θB) 0.51 0.03 0.46 − ln 0.03 = 3.51gm(θC) 0.03 0.95 0.02 − ln 0.95 = 0.05

How do we compare y1, . . . , yM with g1, . . . , gM?

Cross-entropy loss function: L(y,g) = −∑M

m=1ym ln gm.

Consider n training data points T = {xi,yi}ni=1. We then trainthe network by minimizing

θ = arg minθ

1n

∑n

i=1L(yi,gi).

See slides in the end for a more formal derivation of softmax.15 / 26 N. Wahlström, 2020

Example: Language models

One result on the use of deep learning for language modeling

Language Models - February 2019Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.Language Models are Unsupervised Multitask Learners.,

February, 2019.

See also blog post:https://blog.openai.com/better-language-models/

16 / 26 N. Wahlström, 2020

https://blog.openai.com/better-language-models/

Language models - Background

A language model is a probabilistic model over sequences of words.

TrainingThis is an introductory course to statistical machine︸︷︷︸

input

learning︸︷︷︸output

The model can be used to hallucinate new texts.PredictionMachine learning is the scientific study of algorithms and︸︷︷︸

input

statistical︸︷︷︸prediction

In the paper they trained a huge language model with 1.5 billion(!) parameters using 40 GB of text data from internet.Can also be used for Reading Comprehension, LanguageTranslation, Question Answering etc...

17 / 26 N. Wahlström, 2020

Language models - Synthetic text (I/IV)

System Prompt (human-written) In a shocking finding,scientist discovered a herd of unicorns living in a remote, previouslyunexplored valley, in the Andes Mountains. Even more surprisingto the researchers was the fact that the unicorns spoke perfectEnglish.

Model Completion (machine-written)The scientist named the population, after their distinctive horn,Ovid's Unicorn. These four-horned, silver-white unicorns werepreviously unknown to science.

Now, after almost two centuries, the mystery of what sparked thisodd phenomenon is finally solved....

18 / 26 N. Wahlström, 2020

Language models - Synthetic text (II/IV)

...Dr. Jorge Pérez, an evolutionary biologist from the University of LaPaz, and several companions, were exploring the Andes Mountainswhen they found a small valley, with no other animals or humans.Pérez noticed that the valley had what appeared to be a naturalfountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. ”By thetime we reached the top of one peak, the water looked blue, withsome crystals on top,” said Pérez....

19 / 26 N. Wahlström, 2020

Language models - Synthetic text (III/IV)

...Pérez and his friends were astonished to see the unicorn herd.These creatures could be seen from the air without having to movetoo much to see them - they were so close they could touch theirhorns.

While examining these bizarre creatures the scientists discoveredthat the creatures also spoke some fairly regular English. Pérezstated, ”We can see, for example, that they have a common’language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated inArgentina, where the animals were believed to be descendants of alost race of people who lived there before the arrival of humans inthose parts of South America....

20 / 26 N. Wahlström, 2020

Language models - Synthetic text (IV/IV)

...While their origins are still unclear, some believe that perhaps thecreatures were created when a human and a unicorn met eachother in a time before human civilization. According to Pérez, ”InSouth America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only wayof knowing for sure if unicorns are indeed the descendants of a lostalien race is through DNA. ”But they seem to be able tocommunicate in English quite well, which I believe is a sign ofevolution, or at least a change in social organization,” said thescientist.

21 / 26 N. Wahlström, 2020

Language models - policy implication

Could be used for• AI writing assistants• More capable dialogue agents• Unsupervised translation between languages• Better speech recognition systems

Potential malicious purposes• Generate misleading news articles• Impersonate others online• Automate the production of faked content on social media• Automate the production of spam/phishing content

Due to our concerns about malicious applications of thetechnology, we are not releasing the trained model.

22 / 26 N. Wahlström, 2020

Why deep?

Example: Image classification

Input: pixels of an imageOutput: object identityEach hidden layer extractsincreasingly abstractfeatures.

Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networksComputer Vision - ECCV (2014).

23 / 26 N. Wahlström, 2020

Some comments - Why now?

Neural networks have been around for more than fifty years. Whyhave they become so popular now (again)?

To solve really interesting problems you need:1. Efficient learning algorithms2. Efficient computational hardware3. A lot of labeled data!

These three factors have not been fulfilled to a satisfactory leveluntil the last 5-10 years.

24 / 26 N. Wahlström, 2020

The lab

Topic: Image classification with neural networks

Task 1Classification of hand-written digits

Task 2Real world image classification

• Read Section 2 in the lab instruction and do the preparatoryexercises in Section 3 before the lab

25 / 26 N. Wahlström, 2020

A few concepts to summarize lecture 8

Neural network (NN): A nonlinear parametric model constructed by stacking severallinear models with intermediate nonlinear activation functions.

Activation function: (a.k.a squashing function) A nonlinear scalar function applied toeach output element of the linear models in a NN.

Hidden units: Intermediate variables in the NN which are not observed, i.e. belongsneither to the input nor output data.

Softmax: A function used to transform the output of a NN to class probabilities.

One-hot encoding: An encoding of the output for training NNs with softmax output.

26 / 26 N. Wahlström, 2020

Softmax regression - A multi-class problem

Consider a classification problem with 3 classesInput: x = [x1, x2] ∈ R2

Output: y = {”red”, ”blue”, ”green”}

x1

x2

Can we extend logistic regression to handle more than 2 classes?27 / 26 N. Wahlström, 2020

Softmax regression

Consider a multi-class classification problem y ∈ {1, . . . ,M}.gim = p(yi = m|xi)

Softmax regression Assume a linear model for each ”log odds”

ln gim

Z= wT

mxi + bm, m = 1, . . . ,M

where Z is a normalization factor such that ∑Mm=1 gim = 1. ⇒

gim = Zezim , zim = wTmxi + bm, m = 1, . . . ,M

The normalization factor Z can now be computed as

1 =∑M

m=1gim = Z

∑M

m=1ezim ⇒ Z = 1∑M

m=1 ezim

Softmax regression model is then

gim = ezim∑Ml=1 e

zil, zim = wT

mxi + bm, m = 1, . . . ,M

28 / 26 N. Wahlström, 2020

Softmax regression using maximumlimelihoodPick wm and bm which make data as likely as possible

w1:M , b1:M = argmaxw1:M ,b1:M

ln p(y1:n|x1:n,w1:M , b1:M )

Assume all yi to be independent

ln p(y1:n|x1:n,w1:M , b1:M ) =n∑

i=1ln p(yi|xi,w1:M , b1:M )

=M∑

m=1

n∑i=1whereyi=m

ln p(yi = m|xi,w1:M , b1:M )︸︷︷︸=gim

This leads to the following optimization problem

w1:M , b1:M = argminw1:M ,b1:M

−n∑

i=1

M∑m=1

yim ln(gim),

One-hot encoding of output29 / 26 N. Wahlström, 2020

yim ={

1, if yi = m

0, if yi 6= m

Softmax regression on multi-class problem

Consider again the classification problem with 3 classes.

x1

x2

Softmax computes the probability for each of the three classes.

Here the result is visualized with decision boundaries for the classwhich has the highest probability.

30 / 26 N. Wahlström, 2020

Documents

Lecture 8 Deep Learning and Neural Networks · StatisticalMachineLearning Lecture8–Deeplearningandneuralnetworks NiklasWahlström DivisionofSystemsandControl DepartmentofInformationTechnology