Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Statistical Machine LearningLecture 8 – Deep learning and neural networks
Niklas WahlströmDivision of Systems and ControlDepartment of Information TechnologyUppsala University
Email: [email protected]
1 / 26 N. Wahlström, 2020
Summary of Lecture 7 (I/III)
Boosting is a sequential ensemble method for regression orclassification.
Trainingdata
Model 1
Modifieddata
Model 2
Modifieddata
Model 3
Modifieddata
Model B
Boosted model
The models are built sequentially such that each models tries tocorrect the mistakes made by the previous one!
Contrary to bagging, boosting methods reduce the bias of theirbase models. Works well with e.g. shallow trees.
2 / 26 N. Wahlström, 2020
Summary of Lecture 7 (II/III)
AdaBoost is a boosting method for binary classification. TheAdaBoost classifier can be written as
yBboost(x) = sign
(B∑
b=1α(b)y(b)(x)
).
where y(b)(x), the bth base classifier, predicts either −1 or 1, andα(b) is a nonnegative real number representing this classifiers“confidence” in its prediction.
The base classifiers and confidence coefficients are found bysequentially (and greedily) minimizing the exponential loss which,for a classifier y(x) = sign{c(x)}, is defined as,
L(y, c(x)) = exp(−y · c(x))
where y · c(x) is referred to the margin.3 / 26 N. Wahlström, 2020
Summary of Lecture 7 (III/III)
The exponential loss function is computationally convenient, butsensitive to outliers. In practice it is often preferable to use arobust loss function with a more gentle slope for negativemargins.
−3 −2 −1 0 1 2 30
1
2
3
Margin, y · c(x)
Loss
Exponential lossBinomial devianceHuber-like loss
Misclassification loss
Gradient boosting can be used to learn a boosting ensemble forgeneral differentiable loss functions.
4 / 26 N. Wahlström, 2020
Practical info
1. The laboratory work• Format: 4h mandatory computer lab about deep learning.
Approved on spot, no report.• Five available slots: March 2, 18:15-12:00, March 5,
13:15-17:00, March 8, 8:15-12:00 and 13:15-17:00, and March9, 13:15-17:00
• Sign up at studentportalen• Lab-pm and code available from the homepage. Take a printed
lab-pm before you leave and bring it to the lab session• Use the lab computers or bring your own laptop with PyTorch
installed and running, or use Google Colab.• Do the preparatory exercises before the lab session.
2. The exam• You may bring one A4 handwritten sheet of paper,
mathematics handbook and a calculator.• Questions from any part of the course may be on the exam.• Don’t forget to register!
5 / 26 N. Wahlström, 2020
Contents – Lecture 8
1. This lecture: The neural network model• Neural network for regression• Neural network for classification
2. Next lecture:• Convolutional neural network• How to train a neural network
6 / 26 N. Wahlström, 2020
Constructing neural networks for regression
A neural network (NN) is a nonlinear function y = f(x; θ)from an input x to an output y parameterized by parameters θ.
Linear regression models the relationship between a continuousoutput y and a continuous input x,
y =p∑
j=1Wjxj + b = Wx + b,
where the parameters θ are the weights Wj , and the offset (akabias/intercept) term b,
θ =[b W
]T, W =
[W1 W2 · · · Wp
]7 / 26 N. Wahlström, 2020
Generalized linear regression
We can generalize this by introducing nonlinear transformations ofthe predictor Wx + b,
y = h(Wx + b), ...
1x1
xp
hhh y
b
W1Wp
We call h(x) the activation function. Two common choices are:
−5 5
1
x
h(x)
Sigmoid: h(x) = 11+e−x
−1 1
1
x
h(x)
ReLU: h(x) = max(0, x)
Let us consider an example of a neural network.8 / 26 N. Wahlström, 2020
Neural network - construction
A neural network is a sequential construction of severalgeneralized linear regression models.
...
1x1
xp
h q1
yh...h qU
1
1
...
h
h
h
q(2)1q
(2)2
q(2)U2
y
Inputs Hidden units Output
q1 = h
(b
(1)1 +
∑p
j=1W
(1)1j xj
)q2 = h
(b
(1)2 +
∑p
j=1W
(1)2j xj
)...
qU = h
(b
(1)U +
∑p
j=1W
(1)Uj xj
)y = b(2) +
U∑i=1
W(2)i qi
9 / 26 N. Wahlström, 2020
Neural network - construction
A neural network is a sequential construction of severalgeneralized linear regression models.
...
1x1
xp
h
yh...h
1
1
...
h
h
h
q(2)1q
(2)2
q(2)U2
y
Inputs Hidden units Output
q = h(W(1)x + b(1)) y = W(2)q + b(2)
W(1)=
W(1)11 ... W
(1)1p
......
W(1)U1 ... W
(1)Up
, b(1) =
b(1)1...
b(1)U
, q =[
q1
...qU
]b(2) = [ b(2) ]
W(2) = [ W(2)1 ... W
(2)U
]Weight matrix Offset vector Hidden units
9 / 26 N. Wahlström, 2020
Neural network - construction
A neural network is a sequential construction of severalgeneralized linear regression models.
...
1x1
xp
h
h...h
q(1)1q
(1)2
q(1)U1
1 1
...
h
h
h
q(2)1q
(2)2
q(2)U2
y
Inputs Hidden units Hidden units Output
q(1) = h(W(1)x + b(1))q(2) = h(W(2)q(1) + b(2))y = W(3)q(2) + b(3)
The model learns better using adeep network (several layers)instead of a wide and shallownetwork.
9 / 26 N. Wahlström, 2020
Deep learning
A neural network with L layers can be written asq(0) = x
q(l) = h(W(l)q(l−1) + b(l)
), l = 1, . . . , L− 1
y = W(L)q(L−1) + b(L)
All weight matrices and offset vectors in all layers combined arethe parameters of the network
θ =[vec(W(1))T, b(1)T, . . . , vec(W(L))T, b(L)T
]T,
which constitutes the parametric model y = f(x; θ). If L is largewe call it a deep neural network.
Deep learning is a class of machine learning models and algo-rithms that use a cascade of multiple layers, each of which is anonlinear transformation.
10 / 26 N. Wahlström, 2020
NN for classification (M = 2 classes)
We can also use neural networks for classification. With M = 2classes we squash the output through the logistic function to get aclass probability g ∈ [0, 1]
g = f(z) = p(y = 1|x) where f(z) = ez
1 + ez
...
1
...
1x1
xp
h
h
h
z f g
−5 5
1
z
f(z)
Inputs Hidden units Logisticfunction
Classprobability
What if M > 2 ?11 / 26 N. Wahlström, 2020
NN for classification (M > 2 classes)
For M > 2 classes we want to predict the class probability for allM classes gm = p(y = m|x). We extend the logistic function tothe softmax activation function
gm = fm(z) = ezm∑Ml=1 e
zl, m = 1, . . . ,M.
...
1
...
... ... ...
1x1
xp
h
h
h
z1
zM
f1
fM
g1
gM
Inputs Hidden units Logits Classprobabilitiessoftmax(z)
softmax(z) = [f1(z), . . . , fM (z)]T12 / 26 N. Wahlström, 2020
NN classification with M = 3 classes
Consider an example with three classes M = 3.
...
1
...
1x1
xp
h
h
h
z1 = 0.9
z2 = 4.2
z3 = 0.2
f1
f2
f3
g1= 0.03
g2= 0.95
g3= 0.02
Inputs Hidden units
Logits
Classprobabilities
softmax(z)
where
fm(z) = ezm∑Ml=1 e
zl, softmax(z) = [f1(z), . . . , fM (z)]T.
13 / 26 N. Wahlström, 2020
One-hot encoding (for M > 2)
We use one-hot encoding for the output y = [y1, . . . , yM ]T,which means
ym ={
1 if x belongs to class m0 otherwise
Example: We have M = 3 classes and if x belongs to ...• ...class m = 1, then y = [1, 0, 0]T.• ...class m = 2, then y = [0, 1, 0]T.• ...class m = 3, then y = [0, 0, 1]T.
14 / 26 N. Wahlström, 2020
One-hot encoding and cross-entropy loss
Example: Consider M = 3 classes and the output y = [0, 1, 0]T.
m = 1 m = 2 m = 3 L(y,g)ym 0 1 0gm(θA) 0.28 0.68 0.04 − ln 0.68 = 0.39gm(θB) 0.51 0.03 0.46 − ln 0.03 = 3.51gm(θC) 0.03 0.95 0.02 − ln 0.95 = 0.05
How do we compare y1, . . . , yM with g1, . . . , gM?
Cross-entropy loss function: L(y,g) = −∑M
m=1ym ln gm.
Consider n training data points T = {xi,yi}ni=1. We then trainthe network by minimizing
θ = arg minθ
1n
∑n
i=1L(yi,gi).
See slides in the end for a more formal derivation of softmax.15 / 26 N. Wahlström, 2020
Example: Language models
One result on the use of deep learning for language modeling
Language Models - February 2019Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.Language Models are Unsupervised Multitask Learners.,
February, 2019.
See also blog post:https://blog.openai.com/better-language-models/
16 / 26 N. Wahlström, 2020
Language models - Background
A language model is a probabilistic model over sequences of words.
TrainingThis is an introductory course to statistical machine︸ ︷︷ ︸
input
learning︸ ︷︷ ︸output
The model can be used to hallucinate new texts.PredictionMachine learning is the scientific study of algorithms and︸ ︷︷ ︸
input
statistical︸ ︷︷ ︸prediction
In the paper they trained a huge language model with 1.5 billion(!) parameters using 40 GB of text data from internet.Can also be used for Reading Comprehension, LanguageTranslation, Question Answering etc...
17 / 26 N. Wahlström, 2020
Language models - Synthetic text (I/IV)
System Prompt (human-written) In a shocking finding,scientist discovered a herd of unicorns living in a remote, previouslyunexplored valley, in the Andes Mountains. Even more surprisingto the researchers was the fact that the unicorns spoke perfectEnglish.
Model Completion (machine-written)The scientist named the population, after their distinctive horn,Ovid's Unicorn. These four-horned, silver-white unicorns werepreviously unknown to science.
Now, after almost two centuries, the mystery of what sparked thisodd phenomenon is finally solved....
18 / 26 N. Wahlström, 2020
Language models - Synthetic text (II/IV)
...Dr. Jorge Pérez, an evolutionary biologist from the University of LaPaz, and several companions, were exploring the Andes Mountainswhen they found a small valley, with no other animals or humans.Pérez noticed that the valley had what appeared to be a naturalfountain, surrounded by two peaks of rock and silver snow.
Pérez and the others then ventured further into the valley. ”By thetime we reached the top of one peak, the water looked blue, withsome crystals on top,” said Pérez....
19 / 26 N. Wahlström, 2020
Language models - Synthetic text (III/IV)
...Pérez and his friends were astonished to see the unicorn herd.These creatures could be seen from the air without having to movetoo much to see them - they were so close they could touch theirhorns.
While examining these bizarre creatures the scientists discoveredthat the creatures also spoke some fairly regular English. Pérezstated, ”We can see, for example, that they have a common’language,’ something like a dialect or dialectic.”
Dr. Pérez believes that the unicorns may have originated inArgentina, where the animals were believed to be descendants of alost race of people who lived there before the arrival of humans inthose parts of South America....
20 / 26 N. Wahlström, 2020
Language models - Synthetic text (IV/IV)
...While their origins are still unclear, some believe that perhaps thecreatures were created when a human and a unicorn met eachother in a time before human civilization. According to Pérez, ”InSouth America, such incidents seem to be quite common.”
However, Pérez also pointed out that it is likely that the only wayof knowing for sure if unicorns are indeed the descendants of a lostalien race is through DNA. ”But they seem to be able tocommunicate in English quite well, which I believe is a sign ofevolution, or at least a change in social organization,” said thescientist.
21 / 26 N. Wahlström, 2020
Language models - policy implication
Could be used for• AI writing assistants• More capable dialogue agents• Unsupervised translation between languages• Better speech recognition systems
Potential malicious purposes• Generate misleading news articles• Impersonate others online• Automate the production of faked content on social media• Automate the production of spam/phishing content
Due to our concerns about malicious applications of thetechnology, we are not releasing the trained model.
22 / 26 N. Wahlström, 2020
Why deep?
Example: Image classification
Input: pixels of an imageOutput: object identityEach hidden layer extractsincreasingly abstractfeatures.
Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networksComputer Vision - ECCV (2014).
23 / 26 N. Wahlström, 2020
Some comments - Why now?
Neural networks have been around for more than fifty years. Whyhave they become so popular now (again)?
To solve really interesting problems you need:1. Efficient learning algorithms2. Efficient computational hardware3. A lot of labeled data!
These three factors have not been fulfilled to a satisfactory leveluntil the last 5-10 years.
24 / 26 N. Wahlström, 2020
The lab
Topic: Image classification with neural networks
Task 1Classification of hand-written digits
Task 2Real world image classification
• Read Section 2 in the lab instruction and do the preparatoryexercises in Section 3 before the lab
25 / 26 N. Wahlström, 2020
A few concepts to summarize lecture 8
Neural network (NN): A nonlinear parametric model constructed by stacking severallinear models with intermediate nonlinear activation functions.
Activation function: (a.k.a squashing function) A nonlinear scalar function applied toeach output element of the linear models in a NN.
Hidden units: Intermediate variables in the NN which are not observed, i.e. belongsneither to the input nor output data.
Softmax: A function used to transform the output of a NN to class probabilities.
One-hot encoding: An encoding of the output for training NNs with softmax output.
26 / 26 N. Wahlström, 2020
Softmax regression - A multi-class problem
Consider a classification problem with 3 classesInput: x = [x1, x2] ∈ R2
Output: y = {”red”, ”blue”, ”green”}
x1
x2
Can we extend logistic regression to handle more than 2 classes?27 / 26 N. Wahlström, 2020
Softmax regression
Consider a multi-class classification problem y ∈ {1, . . . ,M}.gim = p(yi = m|xi)
Softmax regression Assume a linear model for each ”log odds”
ln gim
Z= wT
mxi + bm, m = 1, . . . ,M
where Z is a normalization factor such that ∑Mm=1 gim = 1. ⇒
gim = Zezim , zim = wTmxi + bm, m = 1, . . . ,M
The normalization factor Z can now be computed as
1 =∑M
m=1gim = Z
∑M
m=1ezim ⇒ Z = 1∑M
m=1 ezim
Softmax regression model is then
gim = ezim∑Ml=1 e
zil, zim = wT
mxi + bm, m = 1, . . . ,M
28 / 26 N. Wahlström, 2020
Softmax regression using maximumlimelihoodPick wm and bm which make data as likely as possible
w1:M , b1:M = argmaxw1:M ,b1:M
ln p(y1:n|x1:n,w1:M , b1:M )
Assume all yi to be independent
ln p(y1:n|x1:n,w1:M , b1:M ) =n∑
i=1ln p(yi|xi,w1:M , b1:M )
=M∑
m=1
n∑i=1whereyi=m
ln p(yi = m|xi,w1:M , b1:M )︸ ︷︷ ︸=gim
This leads to the following optimization problem
w1:M , b1:M = argminw1:M ,b1:M
−n∑
i=1
M∑m=1
yim ln(gim),
One-hot encoding of output29 / 26 N. Wahlström, 2020
yim ={
1, if yi = m
0, if yi 6= m
Softmax regression on multi-class problem
Consider again the classification problem with 3 classes.
x1
x2
Softmax computes the probability for each of the three classes.
Here the result is visualized with decision boundaries for the classwhich has the highest probability.
30 / 26 N. Wahlström, 2020