Upload
lamkiet
View
232
Download
0
Embed Size (px)
Citation preview
Deep Learning ExplainedModule 2: Logistic Regression
Sayan D. Pathak, Ph.D., Principal ML Scientist, Microsoft
Roland Fernandez, Senior Researcher, Microsoft
Module Outline
Application:
OCR with MNIST data
Model:
Logistic Regression
Concepts:
Loss, Minibatch
Train-Test-Predict workflow
MNIST Handwritten Digits (OCR)
• Data set of hand written digits (0-9) with✓60,000 training images
✓10,000 test images
• Each image is: 28 x 28 pixels
Handwritten Digits
1 5 4 35 3 5 35 9 0 6
Corresponding Labels
28 pix
28
pix
.
784 pixels (x)
Logistic Regression
784 pixels ( Ԧ𝑥)
0.1 0.1 0.3 0.9 0.4 0.2 0.1 0.1 0.6 0.3
Model
(W, 𝑏)
Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏
Ԧ𝑧
0 1 2 3 4 5 6 7 8 9
weights (W)
784
10
bias( Ԧ𝑏)10
Model Parameters
Model that maps input features to
discrete output classesopposed to linear
regression
28 pix
28
pix
.
784 pixels
S S = Sum (weights x pixels) = 𝑤0 ∙ Ԧ𝑥𝑇
784 784
Logistic Regression
0
𝑤0
784 pixels ( Ԧ𝑥)
28 pix
28
pix
.
784 pixels (x)
S S S = Sum (weights x pixels) = 𝑤0 ∙ Ԧ𝑥𝑇
784 784
Logistic Regression
0 1
S = Sum (weights x pixels) = 𝑤1 ∙ Ԧ𝑥𝑇
784 784
𝑤1
784 pixels ( Ԧ𝑥)
28 pix
28
pix
.
784 pixels (x)
S S
Weights (W)
S = Sum (weights x pixels) = 𝑤0 ∙ Ԧ𝑥𝑇
784 784
784
10
Logistic Regression
S0 1 9
784 pixels ( Ԧ𝑥)
S = Sum (weights x pixels ) = 𝑤9 ∙ Ԧ𝑥𝑇
784 784
…
…
S = Sum (weights x pixels) = 𝑤1 ∙ Ԧ𝑥𝑇
784 784
Ԧ𝑧 =
𝑤9
Ԧ𝑧 = W Ԧ𝑥𝑇
28 pix
28
pix
.
784 pixels (x)
S S
weights (W)
= map to (0-1) rangeActivation function
0.1 0.1 0.3 0.9 0.4 0.2 0.1 0.1 0.6 0.3 Sigmoid
784
10
Model
Logistic Regression
Sbias ( Ԧ𝑏)(10) 0 1 9
…
784 pixels ( Ԧ𝑥)
Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏Ԧ𝑧 =
28 pix
28
pix
.
784 pixels (x)
S S
weights (W)
= map to (0-1) rangeActivation function
0.1 0.1 0.3 0.9 0.4 0.2 0.1 0.1 0.6 0.3 Sigmoid
784
10
Model
Logistic Regression
SBias (10)
(𝑏)0 1 9
…
784 pixels ( Ԧ𝑥)
Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏Ԧ𝑧 =
28 pix
28
pix
.
784 pixels (x)
S S
weights (W)
Ԧ𝑧 = W Ԧ𝑥𝑇 + 𝑏
= None (pass through)
z0 z1 z2 z3 z4 z5 z6 z7 z8 z9
784
10
Model
Logistic Regression with Softmax
0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01𝑒𝑧i
σ𝑗=09 𝑒𝑧j
softmax
SBias (10)
(𝑏)0 1 9
…
784 pixels ( Ԧ𝑥)
Ԧ𝑧 =
Predicted Probabilities (p)
28 pix
28
pix
.
28 x 28 pix
Loss Function
Lossfunctions
se = σ𝑗= 09 𝑦𝑗 − 𝑝𝑗
2Squared error
ce = −σ𝑗=09 𝑦𝑗 𝑙𝑜𝑔 𝑝𝑗
Cross entropy error
1 5 4 35 3 5 35 9 0 6
Label One-hot encoded (Y)
0 0 0 1 0 0 0 0 0 0
Model(w, b)
Predicted Probabilities (p)
0.08 0.08 0.10 0.17 0.11 0.09 0.08 0.08 0.13 0.01
Train(learner)
Reporting
TrainingData
Trainmore?
Data SamplerFeatures (x), Labels (Y)
Model
z(params)
Train Workflow
params
update params
loss
iterationsY
Train Workflow
MNISTTrain
3
7
8
0
Input feature (X: 128 x 784)12
8 s
am
ples
(min
i-ba
tch
)
Loss cross_entropy_with_softmax(z,Y)One-hot encoded
Label
(Y: 128 x 10)
0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0
bias(𝑏
)(D
im-
10
)
Model
S
z = times(X , W) + b
= 𝐖 X𝑇 + 𝑏
Trainer(model, (loss, error), learner)
Trainer.train_minibatch({X, Y})
Learnersgd, adagrad etc, are solvers to estimate – W & b
Error(optional)
classification_error(z,Y)
weights (W)
784
10
Learn the weights: Learners / Optimizers / Solvers
For 1 sample:
Loss (𝐿𝑖) = −σ𝑗=09 𝑦𝑗
𝑖𝑙𝑜𝑔 𝑝𝑗 where: 𝑝𝑗 = 𝑓(𝑥 𝑖 ; 𝜃)𝑗
𝜃 ∈ (𝑤, 𝑏)
For all samples (m = 60000 images):
Total loss = σ𝑖=1𝑚 𝐿𝑖 (𝜃; (𝑥
𝑖 , 𝑦 𝑖 ))
Convex function: There is 1 and only 1 minimum
Fig: courtesy http://codingwiththomas.blogspot.com/2012/09/particle-swarm-optimization.html
Gradient Descent
𝜃′ = 𝜃 − 𝜇 𝑔𝑟𝑎𝑑 𝐿; 𝜃
Where: 𝜃 = model parameter𝜇 = learning rate
Computing “Total Loss” (σ𝑖𝑛 𝐿𝑖) for large data set is expensive and often redundant
- refer to http://sebastianruder.com/optimizing-gradient-descent/ for details
Stochastic Gradient Descent (SGD)
SGD:Update the parameters foreach (data, label) pair
Mini-batch SGD:Update the parameters formini-batch setSet of (data, label) pairs
refer to http://sebastianruder.com/optimizing-gradient-descent/ for details on different learners
Other learners
Momentum-SGDNestorovAdagradAdsdeltaAdam
Refer to http://sebastianruder.com/optimizing-gradient-descent/ for
details on different learners
Image by: Alec Radford
Train(learner)
Reporting
TrainingData
Trainmore?
Data SamplerFeatures (x), Labels (Y)
Data SamplerFeatures (x), Labels (Y)
Validate
Reporting
More?
ValidationData
Model
z(params)
Validation Workflow
params
update params
trainedparams
loss
iterations
Modelfinal
Y Y
Data SamplerFeatures (x), Labels (Y)
Test
Reporting
Testmore?
TestData
Modelfinal
Test Workflow
trainedparams
Y
Test Workflow
MNISTTest
3
7
8
0
Input feature (X*: 32 x 784)3
2sa
mpl
es(m
ini-
batc
h)
One-hot encoded
Label
(Y*: 32 x 10)
0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0
ModelS
z = times(X* , 𝐖∗ ) + b*
= 𝐖∗ X∗𝑇 + 𝑏∗
Trainer.test_minibatch({X*, Y*})
Returns the classification error as % incorrectlylabeled MNIST image.
MNISTTrain
bias(𝑏
*)
(Dim
-10
)weights (W*)
784
10
Prediction Workflow
Any MNIST
9
Input feature (new X: 1 x 784)Model
(W, b)
Model.eval(new X)
0.02 0.09 0.03 0.03 0.01 0.02 0.02 0.06 0.02 0.70
Predicted Softmax Probabilities (predicted_label)
[ numpy.argmax(predicted_label) for predicted_label in predicted_labels ]
[9]
Prediction Workflow
Any MNIST
9
5
8
2
Input feature (new X: 25 x 784)Model
(W, b)
Model.eval(new X)
Predicted Softmax Probabilities (predicted_label)
[ numpy.argmax(predicted_label) for predicted_label in predicted_labels ]
[9, 5, 8, …, 2]