163
Advanced Introduction to Machine Learning — Spring Quarter, Week 2 — https://canvas.uw.edu/courses/1372141 Prof. JeBilmes University of Washington, Seattle Departments of: Electrical & Computer Engineering, Computer Science & Engineering http://melodi.ee.washington.edu/ ~ bilmes April 6th/8th, 2020 Prof. JeBilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F1/67 (pg.1/163)

Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

  • Upload
    others

  • View
    23

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Advanced Introduction to Machine Learning

— Spring Quarter, Week 2 —https://canvas.uw.edu/courses/1372141

Prof. Je↵ Bilmes

University of Washington, Seattle

Departments of: Electrical & Computer Engineering, Computer Science & Engineering

http://melodi.ee.washington.edu/~bilmes

April 6th/8th, 2020

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F1/67 (pg.1/163)

Page 2: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Announcements

HW1 to be posted this evening, due in 1.5 weeks.

Virtual o�ce hours this week, Thursday night at 10:00pm via zoom(same link as class).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F2/67 (pg.2/163)

mmmm omma

Page 3: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Class Road Map

W1(3/30,4/1): What is ML, Probability, Coins, Gaussians and linearregression, Associative Memories, Supervised LearningW2(4/6,4/8): More supervised, logistic regression, complexity andbias/variance tradeo↵W3(4/13,4/15): Bias/Variance, Regularization, Ridge, CrossVal,MulticlassW4(4/20,4/22): Multiclass classification, ERM, Gen/Disc, Naıve BayesW5(4/27,4/29): Lasso, Regularizers, Curse of DimensionalityW6(5/4,5/6): Curse of Dimensionality, Dimensionality Reduction, k-NNW7(5/11,5/13): k-NN, LSH, DTs, Bootstrap/Bagging, Boosting &Random Forests, GBDTsW8(5/18,5/20): Graphs; Graphical Models (Factorization, Inference,MRFs, BNs);W9(5/27,6/1): Learning Paradigms; Clustering; EM Algorithm;W10(6/3,6/8): Spectral Clustering, Graph SSL, Deep models, (SVMs,RL); The Future.

Last lecture is 6/8 since 5/25 is holiday (or we could just have lecture on 5/25).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F3/67 (pg.3/163)

Page 4: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Acknowledgments/References

Some of the below material was drawn from:

Bishop, 1996.

https://courses.cs.washington.edu/courses/cse546/18au/,https://courses.cs.washington.edu/courses/cse546/16au/,https://courses.cs.washington.edu/courses/cse546/14au/,http://cs229.stanford.edu/syllabus.html,https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F4/67 (pg.4/163)

Page 5: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Review

This is where each day we will be reviewing previous lecture material.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F5/67 (pg.5/163)

Page 6: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Some readings

Matrix cookbook https:

//www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf

Linear algebra noteshttp://cs229.stanford.edu/section/cs229-linalg.pdf

Writeup on Overfitting and Underfitting on our web page(https://canvas.uw.edu/courses/1372141). (see in particularhttps://canvas.uw.edu/courses/1372141/discussion_topics/

5384617)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F6/67 (pg.6/163)

-

Page 7: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Class (and Machine Learning) overview

1. Introduction• What is ML• What is AI• Why are we so interested in these topics right now?

2. ML Paradigms/Concepts• Over!tting/Under!tting, model complexity, bias/variance• size of data, big data, sample complexity• ERM, loss + regularization, loss functions, regularizers• Supervised, unsupervised, and semi-supervised learning;• reinforcement learning, RL, multi-agent, planning/control• transfer and multi-task learning• federated and distributed learning• active learning, machine teaching• self-supervised, zero/one-shot, open-set learning

3. Dealing with Features• dimensionality reduction, PCA, LDA, MDS, T-SNE, UMAP • Locality sensitive hashing (LSH)• feature selection• feature engineering• matrix factorization & feature engineering• representation learning

4. Evaluation• accuracy/error, precision/recall, ROC, likelihood/posterior, cost/utility, margin • train/eval/test data splits• n-fold cross validation• method of the bootstrap

6. Inference Methods• probabilistic inference• MLE, MAP• belief propagation• forward/backpropagation• Monte Carlo methods

7. Models & Representation• linear least squares, linear regression, logistic regression, sparsi-ty, ridge, lasso• generative vs. discriminative models• Naive Bayes• k-nearest neighbors• clustering, k-means, k-mediods, EM & GMMs, single linkage• decision trees and random forests• support vector machines, kernel methods, max margin• perceptron, neural networks, DNNs• Gaussian processes• Bayesian nonparametric methods• ensemble methods• the bootstrap, bagging, and boosting• graphical models• time-series, HMMs, DBNs, RNNs, LSTMs, Attention, Transformers • structured prediction• grammars (as in NLP)

12. Other Techniques• compressed sensing• submodularity, diversity/homogeneity modeling

8. Philosophy, Humanity, Spirituality• arti!cial intelligence (AI)• arti!cal general intelligence (AGI)• arti!cial intelligence vs. science !ction

9. Applications• computational biology• social networks• computer vision• speech recognition• natural language processing• information retrieval• collaborative !ltering/matrix factorization

10. Programming• python• libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras, H2O, etc.• HPC: C/C++, CUDA, vector processing

11. Background• linear algebra• multivariate calculus• probability theory and statistics• information theory• mathematical (e.g., convex) optimization

x6

x3

x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3

x2x4

x5

x1

x3

x2x4

x1

x3

x2

x1

x2

x1

x3

x2x4

x5

x1

x2x4

x5

x1

x2

x5

x1

x2

=X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x4

X

x5

(x1, x2)� 66,2(x2)X

x3

(x1, x3) (x3, x4) (x3, x5)

| {z }� 63,1,4,5(x1,x4,x5)

=X

x5

(x1, x2)� 66,2(x2)X

x4

� 63,1,4,5(x1, x4, x5)

| {z }� 63, 64,1,5(x1,x5)

= (x1, x2)� 66,2(x2)X

x5

� 63, 64,1,5(x1, x5)

| {z }� 63,65, 64,1(x1)

= (x1, x2)� 66,2(x2)� 63, 64, 65,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·

X

x6

p(x1, x2, . . . , x6)

X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x3

X

x4

(x1, x2)� 66,2(x2) (x1, x3) (x3, x4)X

x5

(x3, x5)

| {z }� 65,3(x3)

=X

x3

(x1, x2)� 66,2(x2) (x1, x3)� 65,3(x3)X

x4

(x3, x4)

| {z }� 64,3(x3)

= (x1, x2)� 66,2(x2)X

x3

(x1, x3)� 65,3(x3)� 64,3(x3)

| {z }� 65, 64, 63,1(x1)

= (x1, x2)� 66,2(x2)� 65,64, 63,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·

X

x6

p(x1, x2, . . . , x6)

=

Reconstituted Graph Reconstituted Graph

O(r2)

O(r4)

O(r3)

O(r2)

x6

x5

x4

x3

O(r2)

O(r2)

O(r2)

O(r2)

GraphicalTransformation

CorrespondingMarginalization Operation

GraphicalTransformation

CorrespondingMarginalization Operation

Variableto

Eliminateand

Complexity

Variableto

Eliminateand

Complexity

InputLayer

HiddenLayer 1

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

HiddenLayer 5

HiddenLayer 6

HiddenLayer 7

OutputUnit

5. Optimization Methods• Unconstrained Continuous Optimization: (stochastic) gradient descent (SGD), adap-tive learning rates, conjugate gradient, 2nd order Newton• Constrained Continuous Optimization : Frank-Wolf (conditional gradient descent), projected gradient, linear, quadratic, and convex programming• Discrete optimization - greedy, beam search, branch-and-bound, submodular optimization.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F7/67 (pg.7/163)

Page 8: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Strategy

Strategy for the next period of time.

For some topic in (2)for subtopic in subset of (6) relevant to topic.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F8/67 (pg.8/163)

nmnmmqq

Page 9: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Traditional Computer Programming vs. ML

Let us change our traditional attitude to the construc-tion of programs. Instead of imagining that our main task is to instruct a computer what to do, let us concen-trate rather on explaining to human beings what we want a computer to do. -- Donald Knuth

writesHuman Programming a Computer

Algorithm/Computer Program

Computer

Something Seemingly

Useful

produces

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F9/67 (pg.9/163)

Page 10: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Traditional Computer Programming vs. ML

writes

Machine Learning is the art of repeatedly telling a computer what one wants the computer to tell a second computer about a lot of data. This continues until the second computer gets it right.

Human Programming a Computer

Algorithm/Computer Program

Something Seemingly Intelligent

Training Data

TestData

writes producesAlgorithm/Computer Program

Computer

Computer

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F10/67 (pg.10/163)

Page 11: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Traditional Computer Programming vs. ML

https://imarticus.org/what-is-machine-learning-and-does-it-matter/

other defs of ML: https://www.kdnuggets.com/2018/12/essence-machine-learning.html

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F11/67 (pg.11/163)

Page 12: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Probability and Uncertainty

Key point: the world is a complicated place, we cannot knoweverything, and even what we think we know we can (nor should) notalways be certain. Uncertainty abounds!Need a representation of uncertainty.Probability has a precise mathematical definition (Kolmogorovaxioms), but we use it in deference to the inevitable uncertaintysurrounding all decisions.Simple and subjective working definition:

probability =number of cases something happened

number of total cases. (1.2)

Good for repeatable measurable events (e.g., coins flips, dice, etc.).Harder for future events (probability it will rain tomorrow, probabilityManchester City wins Liverpool, etc.).Despite shortcomings, used as representation of uncertainty/certainty(i.e., probability that image x contains face of person y).Machine learning often strives for the “best” probabilities in data usinglearning algorithms.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F12/67 (pg.12/163)

Page 13: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Coin Flipping and ML

D = {b1, b2, . . . , bn} is series of n independent and identical coin flips,bi 2 {H,T}.

k = |{i : bi = H}| is the count of the number of heads in D

How true, or likely, is it that ✓ is probability of heads?

Pr(D|✓) = ✓k(1� ✓)n�k = Likelihood of D given ✓ (1.2)

How to find the most likely explanation of D? Maximum likelihood

✓MLE = argmax✓2[0,1]

Pr(D|✓) = argmax✓2[0,1]

log Pr(D|✓) (1.3)

How to find ✓MLE, calculus,@@✓ log Pr(D|✓) = 0 leads to

✓MLE = k/n (1.4)

Thus, computing k and dividing by n is a simple way to learn!Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F13/67 (pg.13/163)

Page 14: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Learning Gaussians

Given the data sample D without knowing µ, C, how likely is thesample under some hypothesized parameters µ, C.

log Pr(D|µ, C) =nX

i=1

log Pr(xi|µ, C) (1.3)

, log Likelihood of D given µ, C (1.4)

How to find the most likely explanation of D? Maximum likelihood

[µMLE, CMLE] = argmaxµ2Rm,C2P(m)

log Pr(D|µ, C) (1.5)

How to find MLE quantities, again calculus, @@µ log Pr(D|µ, C) = 0

and @@C log Pr(D|µ, C) = 0 leads to

µMLE =1

n

nX

i=1

xi and CMLE =1

n

nX

i=1

(xi � µMLE)(xi � µMLE)|

(1.6)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F14/67 (pg.14/163)

Page 15: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Associations and Associative Memories

Associative memory, auto-associative memory, or hetero-associativememory. In general, associate x 2 X to y 2 Y via h : X ! Y.

Examples: memory subsystem (separate address for each x 2 X ), datastructures like hash tables, or red-black trees, etc.

Often X , Y is very large, and we have only a sample associationsD = {(xi, yi)}

ni=1 where xi 2 X , yi 2 Y where n⌧ |X |.

We want to build an associative memory that works even outside of D.That is, even for x /2 {x : x = xi for some i 2 [n], (xi, yi) 2 D}.

Why? D might not be complete, variation, noise, or possible datacorruption not fully captured in D. Also, X might be infinitely large.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F15/67 (pg.15/163)

Page 16: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Associations and Associative Memories

Machine learning: Write an algorithm that, given large enough D,produces a program h that generalizes (works) well on unseen samples.Respond reasonably to variation, noise, data corruption (be robust).Do this computationally as e�ciently as possible, and (ideally)understand it mathematically.

Boils down to finding a good h : X ! Y that can do the mapping(association). Sometimes we choose some h 2 H where H is largecollection of possible associators. More frequently, h is parameterizedvia some parameters ✓ and we find a good ✓ leading to h✓.

Many ways to do this, depends on nature of X , Y, how big the data is(number of samples n), and available resources (compute, coremachine memory/RAM, storage/disk, communication(latency/bandwidth), time, money, energy usage).

Often, x 2 Rm is an m-dimensional vector of features. In general, x isknown as a feature vector.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F16/67 (pg.16/163)

Page 17: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Statistical parameter estimation

Training data D =�(x(i)

, y(i))

n

i=1where (x(i)

, y(i)) ⇠ p(x, y) are

drawn from some distribution, x(i)2 Rm and y

(i)2 R.

x(i) is m-dimensional column vector of features, y

(i) is scalar.Goal: find h✓ : X ! Y with minimum error, where

Errori = ei = h✓(x(i))� y

(i) (1.3)

E[e2] = Ep(x,y)[(h✓(x)� y)2] =

Zp(x, y)(h✓(x)� y)2dxdy

(1.4)

=

Zp(x)

Z(h✓(x)� y)2p(y|x)dydx (1.5)

and ✓ 2 Rm is a parameter vector, ✓ = (✓1, ✓2, . . . , ✓m), ✓i 2 R.Taking derivatives and setting to zero, we get best solution:

h✓(x) =

Zyp(y|x)dy = E[Y |x] = best association. (1.6)

This assumes we have the distribution p and also the resources tocompute E[Y |x].

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F17/67 (pg.17/163)

Page 18: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Linear estimator: Objective Optimization

Recall, h✓(x) , ✓|x is parameterized by parameters ✓ so

J(✓) =1

n

nX

i=1

(h✓(x(i))� y

(i))2 (1.3)

Taking derivative of error objective J(✓) w.r.t. ✓ and set to zero gets:

@J

@✓=

2

n

nX

n=1

(h✓(x(i))� y

(i))@h✓(x(i))

@✓= 0 (1.4)

Linear h✓(x) = x|✓ assumption, yields @h✓(x(i))

@✓ = x(i).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F18/67 (pg.18/163)

Page 19: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Linear Least Squares

This gives objective to be minimized (smallest, or least of the sum ofsquares of the errors).

@J(✓)

@✓=

2

n

nX

n=1

(x(i)|✓ � y

(i))x(i) = 0 (1.3)

We simplify this a bit by defining matrices associated with thesequantities. First define a n⇥m design matrix X and length-n columnvector ~y

X =

0

BBB@

x(1)|

x(2)|

...x(n)|

1

CCCA, and ~y =

0

BBB@

y1

y2...

yn

1

CCCA(1.4)

Objective Equation (??), equivalent matrix-vector form:

J(✓) =1

2(X✓ � ~y)|(X✓ � ~y) (1.5)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F19/67 (pg.19/163)

Page 20: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Normal Equations

With this, we get the “normal equations”

r✓J(✓) = X|(X✓ � ~y) = ~0 (1.3)

i.e., modeling ~y to be in column space of matrix X (linearcombinations of columns of X), when ~y is being approximated by X✓.

Called normal equations because column space of X is orthogonal tothe residual error E = (~y �X✓), giving solution ✓ = ✓ as shown.

what is to beapproximated

error

y y � X ✓

X ✓actualapproximation,

closest point to y

{y:y=

X✓, ✓

2 R m} space of possible approximations,

column space of X

If X|X invertible (typical

if n � m), solution hasform:

✓ = (X|X)�1

X|~y

where (X|X)�1

X| is

known as the Moore-Penrose pseudo-inverse ofmatrix X.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F20/67 (pg.20/163)

Page 21: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Gradient Descent, Batch Gradient Descent

Gradient updates for all elements of ✓ at the same time and for samplepair (x(i)

, y(i))

✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) = ✓ + ↵(y(i) � ✓

|x(i))x(i) (1.9)

move ✓ in the direction of x(i) weighted by ↵(y(i) � h✓(x(i))) 2 R, ↵

times the error.Called LMS (least mean squares) update rule, also called Widrow-Ho↵(early NN folks) learning rule.Batch Gradient Descent uses J(✓) = 1

n

Pni=1(h✓(x(i))� y

(i))2, andsince the gradient is a linear operator, this yields the following:

Algorithm 2: Batch Gradient descent learningInput : Training data D, learning rate ↵, initial parameter estimate ✓

Output: Learnt model parameters ✓

1 for t = 1, · · · , T do

2 ✓ ✓ + ↵Pn

i=1(y(i)� h✓(x(i)))x(i)

Return : the final parameters ✓

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F21/67 (pg.21/163)

Page 22: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

More visualization: Batch Gradient Descent

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F22/67 (pg.22/163)

Page 23: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

Incremental and Stochastic Gradient Descent

Algorithm 3: Incremental Gradient Descent (IGD) learning

Input : Training data D, learning rate ↵, initial parameterestimate ✓

Output: Learnt model parameters ✓

1 for t = 1, · · · , T do

2 for i = 1, · · · , n do

3 ✓ ✓ + ↵(y(i) � h✓(x(i)))x(i)

Return : the final parameters ✓

Optimization folks (e.g., Bertsekas) call this incremental gradient methods.It is called Stochastic Gradient Descent (SGD) if we randomize (with orwithout replacement) the order of the data items.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F23/67 (pg.23/163)

Page 24: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

Logistics Review

More visualization: Stochastic Gradient Descent

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 1 - April 6th/8th, 2020 F24/67 (pg.24/163)

Page 25: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Underfitting vs. Overfitting

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)

, y(i)

}i.

Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right

y =P5

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons.

The left could be underfitting, and the right could beoverfitting. The center plot looks better.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.25/163)

Page 26: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Underfitting vs. Overfitting

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)

, y(i)

}i.

Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right

y =P5

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons.

The left could be underfitting, and the right could beoverfitting. The center plot looks better.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.26/163)

Page 27: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Underfitting vs. Overfitting

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)

, y(i)

}i.

Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right

y =P5

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons.

The left could be underfitting, and the right could beoverfitting. The center plot looks better.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.27/163)

Page 28: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Underfitting vs. Overfitting

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)

, y(i)

}i.

Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right

y =P5

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons. The left could be underfitting, and the right could beoverfitting.

The center plot looks better.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.28/163)

Page 29: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Underfitting vs. Overfitting

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)

, y(i)

}i.

Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right

y =P5

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons. The left could be underfitting, and the right could beoverfitting. The center plot looks better.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F25/67 (pg.29/163)

Page 30: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting definition (T. Mitchell)

We say that a hypothesis overfits the training examples if some otherhypothesis that fits the training examples less well actually performsbetter over the entire distribution of instances (i.e., including instancesbeyond the training set).

Definition 2.3.1 (overfitting)

Given a hypothesis space H, a hypothesis h 2 H is said to overfit thetraining data if there exists some alternative hypothesis h

02 H, such that h

has smaller error than h0 over the training examples, but h

0 has a smalleroverall error than h over the entire distribution (or data set) of instances.

We’ll visit this topic again when we discuss bias/variance, but first letsdiscuss a few more models.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F26/67 (pg.30/163)

Page 31: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting definition (T. Mitchell)

We say that a hypothesis overfits the training examples if some otherhypothesis that fits the training examples less well actually performsbetter over the entire distribution of instances (i.e., including instancesbeyond the training set).

Definition 2.3.1 (overfitting)

Given a hypothesis space H, a hypothesis h 2 H is said to overfit thetraining data if there exists some alternative hypothesis h

02 H, such that h

has smaller error than h0 over the training examples, but h

0 has a smalleroverall error than h over the entire distribution (or data set) of instances.

We’ll visit this topic again when we discuss bias/variance, but first letsdiscuss a few more models.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F26/67 (pg.31/163)

Page 32: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting definition (T. Mitchell)

We say that a hypothesis overfits the training examples if some otherhypothesis that fits the training examples less well actually performsbetter over the entire distribution of instances (i.e., including instancesbeyond the training set).

Definition 2.3.1 (overfitting)

Given a hypothesis space H, a hypothesis h 2 H is said to overfit thetraining data if there exists some alternative hypothesis h

02 H, such that h

has smaller error than h0 over the training examples, but h

0 has a smalleroverall error than h over the entire distribution (or data set) of instances.

We’ll visit this topic again when we discuss bias/variance, but first letsdiscuss a few more models.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F26/67 (pg.32/163)

Page 33: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Linear Regression

Linear regression involved fitting a model of the formy = h✓(x) =

Pi ✓ixi where xi is the i

th input feature and ✓i is the ith

parameter.

model is linear in the parameters, h↵✓+↵0✓0(x) = ↵h✓(x) + ↵0h✓0(x)

that we “regress” to.

Reasonable starting model for when y 2 R.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F27/67 (pg.33/163)

Page 34: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Linear Regression

Linear regression involved fitting a model of the formy = h✓(x) =

Pi ✓ixi where xi is the i

th input feature and ✓i is the ith

parameter.

model is linear in the parameters, h↵✓+↵0✓0(x) = ↵h✓(x) + ↵0h✓0(x)

that we “regress” to.

Reasonable starting model for when y 2 R.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F27/67 (pg.34/163)

Page 35: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Linear Regression

Linear regression involved fitting a model of the formy = h✓(x) =

Pi ✓ixi where xi is the i

th input feature and ✓i is the ith

parameter.

model is linear in the parameters, h↵✓+↵0✓0(x) = ↵h✓(x) + ↵0h✓0(x)

that we “regress” to.

Reasonable starting model for when y 2 R.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F27/67 (pg.35/163)

Page 36: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.

Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.36/163)

Page 37: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).

Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.37/163)

Page 38: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).

Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.38/163)

Page 39: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].

With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.39/163)

Page 40: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).

Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.40/163)

Page 41: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.

Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.41/163)

Page 42: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!

It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.42/163)

Page 43: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

What if y is category? E.g., y 2 {0, 1, 2, . . . , `� 1} for ` categories.Simplest case is y 2 {0, 1}, a binary label (more general, y could be aninteger, or sometimes 2 {�1, +1}, or a vector of integers).Categorical prediction, or binary classification, predict which class (zeroor one) data object x is in (e.g., buy/sell, good/defective, fresh/stale,fraud/genuine, spam/not-spam, fake news/fake-fake news, etc.).Linear model can still be used but might not be ideal, we do not needvalues outside the range [0, 1].With probabilities, binary prediction can still be presented with arepresentation of uncertainty — rather than crisp 0/1 decision, canlook at p(y = 0|x) vs. p(y = 1|x).Most uncertain about x if p(y = 0|x) = p(y = 1|x) = 0.5. Mostcertain bout x if p(y = 1|x) = 0 or p(y = 1|x) = 1.Correctness is not the same as certainty!!!It is harder for a linear models to give such an interpretation sincelinear model output has no bound, in general.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F28/67 (pg.43/163)

Page 44: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

n

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:

True positives TP =Pn

i=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.44/163)

Page 45: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

n

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:

True positives TP =Pn

i=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.45/163)

Page 46: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

n

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:

True positives TP =Pn

i=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.46/163)

Page 47: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

n

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:

True positives TP =Pn

i=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.47/163)

Page 48: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

n

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:True positives TP =

Pni=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.48/163)

Page 49: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

n

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:True positives TP =

Pni=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.49/163)

Page 50: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

n

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:True positives TP =

Pni=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.50/163)

Page 51: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

n

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:True positives TP =

Pni=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.51/163)

Page 52: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Making a decision

Given a probability model p(y|x), how do we make a final decision?

Let y 2 {0, 1} be the true label and y 2 {0, 1} be a prediction.

Decide true if p(y = 1|x) � ⌧ where ⌧ 2 [0, 1] is a decision threshold,i.e., y = 1{p(y=1|x)�⌧}. Natural value of ⌧ = 0.5 but other values arealso not unreasonable.

Given a validation data set Dva =�(x(i)

, y(i))

n

i=1where on which a

classifier produces predictions�y(i) n

i=1, we can compute the following

quantities:True positives TP =

Pni=1 1{y(i)=1^y(i)=1}

True negatives TN =Pn

i=1 1{y(i)=0^y(i)=0}

False positives FP =Pn

i=1 1{y(i)=1^y(i)=0}

False negatives FN =Pn

i=1 1{y(i)=0^y(i)=1}

Note that number of samples n = TP + TN + FP + FN

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F29/67 (pg.52/163)

Page 53: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Accuracy and Error

Overall accuracy

Accuracy =Total correct predictions

Total predictions=

TP + TN

TP + TN + FP + FN(2.1)

Error

Error = 1.0� Accuracy =FP + FN

TP + TN + FP + FN(2.2)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F30/67 (pg.53/163)

Page 54: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Accuracy and Error

Overall accuracy

Accuracy =Total correct predictions

Total predictions=

TP + TN

TP + TN + FP + FN(2.1)

Error

Error = 1.0� Accuracy =FP + FN

TP + TN + FP + FN(2.2)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F30/67 (pg.54/163)

Page 55: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Binary Confusion Matrix

Given n samples in a validation data set, we can plot the relationshipbetween TP, TN, FP, FN.

True

Lab

elPredicted Label

Numberof PositiveSamples

TP

TN

FN

FP

Numberof SamplesPredicted tobe Positive

Numberof SamplesPredicted tobe Negative

Numberof NegativeSamples

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F31/67 (pg.55/163)

Page 56: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

False Positive and False Negative

https://www.kdnuggets.com/2020/01/guide-precision-recall-confusion-matrix.html

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F32/67 (pg.56/163)

Page 57: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Precision, Recall, and F-Feasure

Precision:

P =TP

TP + FP=

TP

number of predicted positives(2.3)

Recall

R =TP

TP + FN=

TP

number of positives(2.4)

F-measure (or F1-score), harmonic mean of precision and recall,

F-measure =2

1Precision

+ 1Recall

= 2Precision⇥ Recall

Precision + Recall(2.5)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F33/67 (pg.57/163)

Page 58: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Precision, Recall, and F-Feasure

Precision:

P =TP

TP + FP=

TP

number of predicted positives(2.3)

Recall

R =TP

TP + FN=

TP

number of positives(2.4)

F-measure (or F1-score), harmonic mean of precision and recall,

F-measure =2

1Precision

+ 1Recall

= 2Precision⇥ Recall

Precision + Recall(2.5)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F33/67 (pg.58/163)

Page 59: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Precision, Recall, and F-Feasure

Precision:

P =TP

TP + FP=

TP

number of predicted positives(2.3)

Recall

R =TP

TP + FN=

TP

number of positives(2.4)

F-measure (or F1-score), harmonic mean of precision and recall,

F-measure =2

1Precision

+ 1Recall

= 2Precision⇥ Recall

Precision + Recall(2.5)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F33/67 (pg.59/163)

×I In

Page 60: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

ROC and AUC

Recall, we predict as y = 1{p(y=1|x)�⌧}, but how does TP and FPchange as we vary the decision threshold ⌧?

Receiver Operating Characteristic (ROC) curve is determined by ⌧ .

Area under the curve (AUC) gives an overall measure of how wellmodel is doing for all ⌧ . Higher AUC is better.

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F34/67 (pg.60/163)

F

Page 61: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

ROC and AUC

Recall, we predict as y = 1{p(y=1|x)�⌧}, but how does TP and FPchange as we vary the decision threshold ⌧?

Receiver Operating Characteristic (ROC) curve is determined by ⌧ .

Area under the curve (AUC) gives an overall measure of how wellmodel is doing for all ⌧ . Higher AUC is better.

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F34/67 (pg.61/163)

Page 62: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

ROC and AUC

Recall, we predict as y = 1{p(y=1|x)�⌧}, but how does TP and FPchange as we vary the decision threshold ⌧?

Receiver Operating Characteristic (ROC) curve is determined by ⌧ .

Area under the curve (AUC) gives an overall measure of how wellmodel is doing for all ⌧ . Higher AUC is better.

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F34/67 (pg.62/163)

Page 63: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

Solution for binary classification: use a logistic function as in

Pr(y = 1|x) = h✓(x) = g(✓|x) =1

1 + exp(�✓|x)(2.6)

where ✓|x = ✓0 +

Pmi=1 ✓ixi, x0 ⌘ 1 so ✓0 is the bias/shift.

g(z) = 1/(1 + e�x)

is known as a logisticfunction.

A logistic function is one type of sigmoid function, others beinghyperbolic tangent, arctan, error function, etc. (seehttps://en.wikipedia.org/wiki/Sigmoid_function).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F35/67 (pg.63/163)

Page 64: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

Solution for binary classification: use a logistic function as in

Pr(y = 1|x) = h✓(x) = g(✓|x) =1

1 + exp(�✓|x)(2.6)

where ✓|x = ✓0 +

Pmi=1 ✓ixi, x0 ⌘ 1 so ✓0 is the bias/shift.

g(z) = 1/(1 + e�x)

is known as a logisticfunction.

�4 �2 0 2 4z

0.0

0.2

0.4

0.6

0.8

1.0

g(z

)

logistic function g(z) = 1/(1 + e�z)

A logistic function is one type of sigmoid function, others beinghyperbolic tangent, arctan, error function, etc. (seehttps://en.wikipedia.org/wiki/Sigmoid_function).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F35/67 (pg.64/163)

Page 65: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression

Solution for binary classification: use a logistic function as in

Pr(y = 1|x) = h✓(x) = g(✓|x) =1

1 + exp(�✓|x)(2.6)

where ✓|x = ✓0 +

Pmi=1 ✓ixi, x0 ⌘ 1 so ✓0 is the bias/shift.

g(z) = 1/(1 + e�x)

is known as a logisticfunction.

�4 �2 0 2 4z

0.0

0.2

0.4

0.6

0.8

1.0

g(z

)

logistic function g(z) = 1/(1 + e�z)

A logistic function is one type of sigmoid function, others beinghyperbolic tangent, arctan, error function, etc. (seehttps://en.wikipedia.org/wiki/Sigmoid_function).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F35/67 (pg.65/163)

Page 66: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic with scale parameter �

Approximate a step function with scale parameter � 2 R+, givingg�(z) = 1/(1 + e

��x)

�4 �2 0 2 4z

0.0

0.2

0.4

0.6

0.8

1.0

g(z

)

logistic function g(z) = 1/(1 + e�z)

� = 0.5

� = 1.0

� = 2.0

� = 5.0

� = 10.0

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F36/67 (pg.66/163)

Yltexpc -Lo, x > )⑤← E.. I 00

TERM

Page 67: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Gradients of Logistic

A logistic function’s gradient is easy to compute. This follows since:

g0(z) =

d

dz

1

1 + e�z(2.7)

= �1

(1 + e�z)2(�e

�z) (2.8)

=1

1 + e�z·

e�z

1 + e�z(2.9)

=1

1 + e�z

✓1�

1

1 + e�z

◆(2.10)

= g(z)(1� g(z)) (2.11)

Given this, we can derive a gradient descent learning rule, similar toLMS, but for logistic regression.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F37/67 (pg.67/163)

Page 68: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Gradients of Logistic

A logistic function’s gradient is easy to compute. This follows since:

g0(z) =

d

dz

1

1 + e�z(2.7)

= �1

(1 + e�z)2(�e

�z) (2.8)

=1

1 + e�z·

e�z

1 + e�z(2.9)

=1

1 + e�z

✓1�

1

1 + e�z

◆(2.10)

= g(z)(1� g(z)) (2.11)

Given this, we can derive a gradient descent learning rule, similar toLMS, but for logistic regression.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F37/67 (pg.68/163)

Page 69: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Fitting Logistic Regression Using Gradient Descent

Training data D = {(x(i), y

(i))}i2[n], now y(i)2 {0, 1} is a binary label.

Goal: formulate likelihood (to maximize) in terms of parameters ✓.

Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

Likelihood L(✓) =Q

i Pr(y(i)|x(i); ✓) and log likelihood

`(✓) =nX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.69/163)

Page 70: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Fitting Logistic Regression Using Gradient Descent

Training data D = {(x(i), y

(i))}i2[n], now y(i)2 {0, 1} is a binary label.

Goal: formulate likelihood (to maximize) in terms of parameters ✓.

Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

Likelihood L(✓) =Q

i Pr(y(i)|x(i); ✓) and log likelihood

`(✓) =nX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.70/163)

Page 71: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Fitting Logistic Regression Using Gradient Descent

Training data D = {(x(i), y

(i))}i2[n], now y(i)2 {0, 1} is a binary label.

Goal: formulate likelihood (to maximize) in terms of parameters ✓.

Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

Likelihood L(✓) =Q

i Pr(y(i)|x(i); ✓) and log likelihood

`(✓) =nX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.71/163)

Page 72: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Fitting Logistic Regression Using Gradient Descent

Training data D = {(x(i), y

(i))}i2[n], now y(i)2 {0, 1} is a binary label.

Goal: formulate likelihood (to maximize) in terms of parameters ✓.

Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

Likelihood L(✓) =Q

i Pr(y(i)|x(i); ✓) and log likelihood

`(✓) =nX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.72/163)

Page 73: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Fitting Logistic Regression Using Gradient Descent

Training data D = {(x(i), y

(i))}i2[n], now y(i)2 {0, 1} is a binary label.

Goal: formulate likelihood (to maximize) in terms of parameters ✓.

Probability model. Pr(y = 1|x; ✓) = h✓(x) = g(✓|x) andPr(y = 0|x; ✓) = 1� h✓(x), thus for y 2 {0, 1},

Pr(y|x; ✓) = (h✓(x))y(1� h✓(x))1�y (2.12)

Likelihood L(✓) =Q

i Pr(y(i)|x(i); ✓) and log likelihood

`(✓) =nX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.13)

Negative Log likelihood (or cost(✓) = �`(✓)) is convex in ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F38/67 (pg.73/163)

Page 74: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Fitting Logistic Regression Using Gradient Descent

Gradient of log likelihood on one training pair (x, y)

@

@✓`(✓) = (y � h✓(x))x. (2.14)

Derivation of this derivative, via chain rule, uses the logistic derivativeproperty g

0(z) = g(z)(1� g(z)).

Gradient descent steps

✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) (2.15)

again, direction is given by x(i), by an amount equal to

↵ · error = ↵(y(i) � h✓(x(i)))

Error has same form as linear case (the answer, y(i) minus prediction

h✓(x(i))) but prediction is quite di↵erent from before.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F39/67 (pg.74/163)

Page 75: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Fitting Logistic Regression Using Gradient Descent

Gradient of log likelihood on one training pair (x, y)

@

@✓`(✓) = (y � h✓(x))x. (2.14)

Derivation of this derivative, via chain rule, uses the logistic derivativeproperty g

0(z) = g(z)(1� g(z)).

Gradient descent steps

✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) (2.15)

again, direction is given by x(i), by an amount equal to

↵ · error = ↵(y(i) � h✓(x(i)))

Error has same form as linear case (the answer, y(i) minus prediction

h✓(x(i))) but prediction is quite di↵erent from before.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F39/67 (pg.75/163)

Page 76: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Fitting Logistic Regression Using Gradient Descent

Gradient of log likelihood on one training pair (x, y)

@

@✓`(✓) = (y � h✓(x))x. (2.14)

Derivation of this derivative, via chain rule, uses the logistic derivativeproperty g

0(z) = g(z)(1� g(z)).

Gradient descent steps

✓ ✓ + ↵(y(i) � h✓(x(i)))x(i) (2.15)

again, direction is given by x(i), by an amount equal to

↵ · error = ↵(y(i) � h✓(x(i)))

Error has same form as linear case (the answer, y(i) minus prediction

h✓(x(i))) but prediction is quite di↵erent from before.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F39/67 (pg.76/163)

Sealant .

,direction

in"EE"!.ae .

Page 77: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Fit Linear vs. Logistic Comparison

from https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F40/67 (pg.77/163)

"" t" "

I*..

m.

Page 78: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression, Linear Separability, and Overfitting

logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?

Logistic regression negative log likelihood (i.e., cost) J(✓)

J(✓) = �mX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

to be minimized, where h✓(x) = 11+exp(�✓|x) .

What happens as cost decreases (likelihood improves)?

If y(i) = 1,

h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires

✓ !1. Should transition be allowed to be arbitrarily sudden?

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.78/163)

V2

XI

XE frm m-2

Page 79: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression, Linear Separability, and Overfitting

logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?

Logistic regression negative log likelihood (i.e., cost) J(✓)

J(✓) = �mX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

to be minimized, where h✓(x) = 11+exp(�✓|x) .

What happens as cost decreases (likelihood improves)?

If y(i) = 1,

h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires

✓ !1. Should transition be allowed to be arbitrarily sudden?

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.79/163)

an

Page 80: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression, Linear Separability, and Overfitting

logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?

Logistic regression negative log likelihood (i.e., cost) J(✓)

J(✓) = �mX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

to be minimized, where h✓(x) = 11+exp(�✓|x) .

What happens as cost decreases (likelihood improves)?

If y(i) = 1,

h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires

✓ !1. Should transition be allowed to be arbitrarily sudden?

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.80/163)

Page 81: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression, Linear Separability, and Overfitting

logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?

Logistic regression negative log likelihood (i.e., cost) J(✓)

J(✓) = �mX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

to be minimized, where h✓(x) = 11+exp(�✓|x) .

What happens as cost decreases (likelihood improves)? If y(i) = 1,

h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0.

Requires✓ !1. Should transition be allowed to be arbitrarily sudden?

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.81/163)

Page 82: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression, Linear Separability, and Overfitting

logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?

Logistic regression negative log likelihood (i.e., cost) J(✓)

J(✓) = �mX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

to be minimized, where h✓(x) = 11+exp(�✓|x) .

What happens as cost decreases (likelihood improves)? If y(i) = 1,

h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires

✓ !1.

Should transition be allowed to be arbitrarily sudden?

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.82/163)

g(t)= +eE)

2-=<o ,

x>=

Oi- ti

*Rina

Br( O

260710

hour"') -g"

y'i'EE QB

@

Page 83: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Logistic Regression, Linear Separability, and Overfitting

logistic regression andlinearly separable data.How smooth is thetransition in logistic re-gression?

Logistic regression negative log likelihood (i.e., cost) J(✓)

J(✓) = �mX

i=1

hy(i) log h✓(x

(i)) + (1� y(i)) log(1� h✓(x

(i)))i

(2.16)

to be minimized, where h✓(x) = 11+exp(�✓|x) .

What happens as cost decreases (likelihood improves)? If y(i) = 1,

h✓(x(i))! 1; if y(i) = 0, h✓(x(i))! 0. Hence, J(✓)! 0. Requires

✓ !1. Should transition be allowed to be arbitrarily sudden?Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F41/67 (pg.83/163)

m notG-/ sum

region .

Page 84: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting and large magnitude parameters

Therefore, it seems that even large magnitude parameters can lead toa form of overfitting. Overfitting in that the i such that y

(i) = 0 getperfect zero prediction, and i such that y

(i) = 1 get perfect 1prediction, where such certainty is probably not warranted.

Better solution: don’t overfit, for points close to decision boundaryallow gradual prediction transition between 0 and 1 in region ofuncertainty.

This requires means putting a restriction on ✓ (not letting it get toobig).

One possible complexity penality, the 2-norm: ⌦(✓) = k✓k2, prefers“simple” models which in thise case are those with small coe�cients.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F42/67 (pg.84/163)

Page 85: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting and large magnitude parameters

Therefore, it seems that even large magnitude parameters can lead toa form of overfitting. Overfitting in that the i such that y

(i) = 0 getperfect zero prediction, and i such that y

(i) = 1 get perfect 1prediction, where such certainty is probably not warranted.

Better solution: don’t overfit, for points close to decision boundaryallow gradual prediction transition between 0 and 1 in region ofuncertainty.

This requires means putting a restriction on ✓ (not letting it get toobig).

One possible complexity penality, the 2-norm: ⌦(✓) = k✓k2, prefers“simple” models which in thise case are those with small coe�cients.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F42/67 (pg.85/163)

Page 86: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting and large magnitude parameters

Therefore, it seems that even large magnitude parameters can lead toa form of overfitting. Overfitting in that the i such that y

(i) = 0 getperfect zero prediction, and i such that y

(i) = 1 get perfect 1prediction, where such certainty is probably not warranted.

Better solution: don’t overfit, for points close to decision boundaryallow gradual prediction transition between 0 and 1 in region ofuncertainty.

This requires means putting a restriction on ✓ (not letting it get toobig).

One possible complexity penality, the 2-norm: ⌦(✓) = k✓k2, prefers“simple” models which in thise case are those with small coe�cients.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F42/67 (pg.86/163)

Page 87: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting and large magnitude parameters

Therefore, it seems that even large magnitude parameters can lead toa form of overfitting. Overfitting in that the i such that y

(i) = 0 getperfect zero prediction, and i such that y

(i) = 1 get perfect 1prediction, where such certainty is probably not warranted.

Better solution: don’t overfit, for points close to decision boundaryallow gradual prediction transition between 0 and 1 in region ofuncertainty.

This requires means putting a restriction on ✓ (not letting it get toobig).

One possible complexity penality, the 2-norm: ⌦(✓) = k✓k2, prefers“simple” models which in thise case are those with small coe�cients.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F42/67 (pg.87/163)

Lregularitch

Page 88: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Perceptron and Logistic Regression

https://towardsdatascience.com/what-the-hell-is-perceptron-626217814f53

m

m

✓0

✓1

✓m-1

✓m

Model h✓(x) = g(✓|x), perceptron uses a hard activation function

g(z) =

(�1 if z < 0

+1 if z � 0(2.17)

leads to same learning update rule ✓ ✓ + ↵(y(i) � h✓(x(i)))x(i)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F43/67 (pg.88/163)

= @jzqCHJ.a - I

Page 89: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Perceptron, Linear Models, and Linearly Separable Data

logistic regression andperceptron can doperfectly when thedata is (nicely) linearlyseparable.

Line designates boundary of a “ridge” or “cli↵” between the categories.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F44/67 (pg.89/163)

¥no

Page 90: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Review

The next three slides are review from Lecture 1.

Please read writeup “Underfitting and Overfitting in MachineLearning” to be posted to canvas.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F45/67 (pg.90/163)

Page 91: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Review

The next three slides are review from Lecture 1.

Please read writeup “Underfitting and Overfitting in MachineLearning” to be posted to canvas.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F45/67 (pg.91/163)

MM

Page 92: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Underfitting vs. Overfitting

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

0 1 2 3 4 5 6 70

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x

y

Fit a model with various input features, values of powers of x, goal isto predict y based on xy-pair samples D = {(x(i)

, y(i)

}i.

Fit models: left y = ✓0 + ✓1x; middle y = ✓0 + ✓1x + ✓2x2; right

y =P5

j=0 ✓jxj .

Both left and right plots poorly fit the data, but they are poor fordi↵erent reasons. The left could be underfitting, and the right could beoverfitting. The center plot looks better.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F46/67 (pg.92/163)

Page 93: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting definition (T. Mitchell)

We say that a hypothesis overfits the training examples if some otherhypothesis that fits the training examples less well actually performsbetter over the entire distribution of instances (i.e., including instancesbeyond the training set).

Definition 2.5.1 (overfitting)

Given a hypothesis space H, a hypothesis h 2 H is said to overfit thetraining data if there exists some alternative hypothesis h

02 H, such that h

has smaller error than h0 over the training examples, but h

0 has a smalleroverall error than h over the entire distribution (or data set) of instances.

We’ll visit this topic again when we discuss bias/variance, but first letsdiscuss a few more models.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F47/67 (pg.93/163)

h,tie H

'

op

Page 94: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Some Definitions

Any data set D =�(x(1)

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

the a given distribution, meaning that (x(j), y

(j)) ⇠ p(x, y) for all1 j n.

Training a model by maximizing accuracy on a training set Dtr:

h 2 argmaxh2H

X

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

Accuracy according to the sample distribution

accuracy(h) = Ep(x,y)[A(y, h(x))] =

Zp(x, y)A(y, h(x))]dxdy

(2.19)

Accuracy of a trained model on a data set D:

accuracyD(h) =1

|D|

X

(x,y)2D

A(yj , h(xj)) (2.20)

Training data set Dtr and validation (or development) data set Dva.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.94/163)

Page 95: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Some Definitions

Any data set D =�(x(1)

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

the a given distribution, meaning that (x(j), y

(j)) ⇠ p(x, y) for all1 j n.Training a model by maximizing accuracy on a training set Dtr:

h 2 argmaxh2H

X

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

Accuracy according to the sample distribution

accuracy(h) = Ep(x,y)[A(y, h(x))] =

Zp(x, y)A(y, h(x))]dxdy

(2.19)

Accuracy of a trained model on a data set D:

accuracyD(h) =1

|D|

X

(x,y)2D

A(yj , h(xj)) (2.20)

Training data set Dtr and validation (or development) data set Dva.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.95/163)

& accuracy - regularizationtradeoff coefficient.

A hyperparameter .

Acy,huh ) is big

it

hlxlis a

good

predictorof y

undis smelt it

hun is a poor

predict oty .

Page 96: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Some Definitions

Any data set D =�(x(1)

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

the a given distribution, meaning that (x(j), y

(j)) ⇠ p(x, y) for all1 j n.Training a model by maximizing accuracy on a training set Dtr:

h 2 argmaxh2H

X

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

Accuracy according to the sample distribution

accuracy(h) = Ep(x,y)[A(y, h(x))] =

Zp(x, y)A(y, h(x))]dxdy

(2.19)

Accuracy of a trained model on a data set D:

accuracyD(h) =1

|D|

X

(x,y)2D

A(yj , h(xj)) (2.20)

Training data set Dtr and validation (or development) data set Dva.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.96/163)

Page 97: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Some Definitions

Any data set D =�(x(1)

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

the a given distribution, meaning that (x(j), y

(j)) ⇠ p(x, y) for all1 j n.Training a model by maximizing accuracy on a training set Dtr:

h 2 argmaxh2H

X

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

Accuracy according to the sample distribution

accuracy(h) = Ep(x,y)[A(y, h(x))] =

Zp(x, y)A(y, h(x))]dxdy

(2.19)

Accuracy of a trained model on a data set D:

accuracyD(h) =1

|D|

X

(x,y)2D

A(yj , h(xj)) (2.20)

Training data set Dtr and validation (or development) data set Dva.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.97/163)

Page 98: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Some Definitions

Any data set D =�(x(1)

, y(1)

, (x(2), y

(2)), . . . , (x(n), y

(n)) drawn from

the a given distribution, meaning that (x(j), y

(j)) ⇠ p(x, y) for all1 j n.Training a model by maximizing accuracy on a training set Dtr:

h 2 argmaxh2H

X

(x,y)2Dtr

A(yi, h(xi))� �⌦(h) (2.18)

Accuracy according to the sample distribution

accuracy(h) = Ep(x,y)[A(y, h(x))] =

Zp(x, y)A(y, h(x))]dxdy

(2.19)

Accuracy of a trained model on a data set D:

accuracyD(h) =1

|D|

X

(x,y)2D

A(yj , h(xj)) (2.20)

Training data set Dtr and validation (or development) data set Dva.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F48/67 (pg.98/163)

Page 99: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting

Definition 2.5.1 (overfitting)

We say that h 2 H overfits the training data Dtr if there exists h02 H

such that

accuracyDtr(h) > accuracyDtr

(h0) and accuracy(h) < accuracy(h0).(2.21)

Since we can’t compute accuracy(h) or accuracy(h0) as mentioned above, apractical definition of overfitting changes this to:

accuracyDtr(h) > accuracyDtr

(h0) and accuracyDva(h) < accuracyDva

(h0).(2.22)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F49/67 (pg.99/163)

Page 100: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Underfitting

Definition 2.5.2 (underfitting)

We say that h 2 H underfits the training data Dtr if there exists h002 H

such that

accuracyDtr(h) < accuracyDtr

(h00) and accuracy(h) < accuracy(h00).(2.23)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F50/67 (pg.100/163)

Page 101: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting and Underfitting

Training set accuracy

Validation set accuracy

Complexity (or capability, or capacity) of model

Accu

racy

over!ttingrange

!ttingrange

under!ttingrange

(a) The highly typical caseis when the training setaccuracy is higher than thevalidation set accuracy.

Training set accuracy

Validation set accuracy

Complexity (or capability, or capacity) of model

Accu

racy

over!ttingrange

!ttingrange

under!ttingrange

(b) It is possible, butsomewhat unlikely, thatthe validation set accuracycrosses the validation setaccuracy. Theoverfitting/underfittingranges are still the same.

Training set accuracy

Validation set accuracy

Complexity (or capability, or capacity) of model

Accu

racy

over!ttingrange

!ttingrange

under!ttingrange

(c) It is possible, but veryunlikely, that thevalidation set accuracy ishigher than the trainingset accuracy, as shown inthis plot. But theoverfitting/underfittingranges are still the same.

Figure: Overfitting and underfitting shown as a function of model complexity for a fixed training set size ntr. Any h in the

red region overfits the training set. Any h in the yellow region underfits the training set. Any h in the green (middle) region

properly fits the training set. The regions are based all on the accuracy accuracyDva(h) computed on an validation data set Dva,

but the same principle would be true if it were possible to measure accuracy it on the entire distribution accuracy(h).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F51/67 (pg.101/163)

qualitative,

Nch)

Page 102: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Learning Curves

How does overfitting/underfitting depends on the size of the training dataset?

A given model (with fixed complexity) will tend to overfit a smalltraining data set, and underfit a large training data set.

(a) Learning curve with a low complexity(i.e., ⌦(h) small) model.

(b) Learning curve with a high complexity(i.e., ⌦(h) large) model.

On the left, the model underfits since the model does not have muchcapability even with much training data. On the right, the model starts outoverfitting but eventually fits with enough training data.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F52/67 (pg.102/163)

y

• •tot.

Page 103: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Learning Curves

How does overfitting/underfitting depends on the size of the training dataset? A given model (with fixed complexity) will tend to overfit a smalltraining data set, and underfit a large training data set.

Accu

racy

ntr number of training samples

Training set accuracy

Validation set accuracy

high

var

ianc

e

100%

under!ttingrange

low

var

ianc

e

(a) Learning curve with a low complexity(i.e., ⌦(h) small) model.

Accu

racy

ntr number of training samples

Training set accuracy

Validation set accuracy

100%

over!ttingrange

!ttingrange

high

var

ianc

e

low

var

ianc

e

(b) Learning curve with a high complexity(i.e., ⌦(h) large) model.

On the left, the model underfits since the model does not have muchcapability even with much training data. On the right, the model starts outoverfitting but eventually fits with enough training data.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F52/67 (pg.103/163)

Iuse gradient

To

-

9# of validation higher , o-

someone"7÷%IEi¥¥: FEI: car, so#i:w⇒.

Page 104: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Learning Curves

How does overfitting/underfitting depends on the size of the training dataset? A given model (with fixed complexity) will tend to overfit a smalltraining data set, and underfit a large training data set.

Accu

racy

ntr number of training samples

Training set accuracy

Validation set accuracy

high

var

ianc

e

100%

under!ttingrange

low

var

ianc

e

(a) Learning curve with a low complexity(i.e., ⌦(h) small) model.

Accu

racy

ntr number of training samples

Training set accuracy

Validation set accuracy

100%

over!ttingrange

!ttingrange

high

var

ianc

e

low

var

ianc

e

(b) Learning curve with a high complexity(i.e., ⌦(h) large) model.

On the left, the model underfits since the model does not have muchcapability even with much training data. On the right, the model starts outoverfitting but eventually fits with enough training data.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F52/67 (pg.104/163)

Page 105: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Real-world consequences of overfitting

The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and

https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf

earthquake data going back 400 years and model fit

Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.105/163)

Page 106: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Real-world consequences of overfitting

The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and

https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf

earthquake data going back 400 years and model fit

Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.106/163)

X

Page 107: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Real-world consequences of overfitting

The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and

https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf

earthquake data going back 400 years and model fit

Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.

This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.107/163)

Page 108: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Real-world consequences of overfitting

The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and

https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf

earthquake data going back 400 years and model fit

Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.

In 2011, 9.0 earthquake and 14 meter tsunami!!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.108/163)

Page 109: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Real-world consequences of overfitting

The Fukushima nuclear power plant disaster in 2011 caused byoverfitting a model to data!! see https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/ and

https://mpra.ub.uni-muenchen.de/69383/1/MPRA_paper_69383.pdf

earthquake data going back 400 years and model fit

Non-linear model fit, predicts mag. 9 earthquake every ⇡13000 years.This was used to design Fukushima Daiichi nuclear power plant, towithstand 8.6 magnitude earthquake, tsunami of 5.7 meters.In 2011, 9.0 earthquake and 14 meter tsunami!!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F53/67 (pg.109/163)

Page 110: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Real-world consequences of overfitting

log-linear model shows di↵erent trend:

linear model fit, predicts magnitude 9 earthquake about every 300years, would have led to quite a di↵erent reactor design.

Moral: Overfitting (or underfitting) can have huge consequences!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F54/67 (pg.110/163)

perham

i.: . fitwww.hod. . data -

'

..

.

Page 111: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Real-world consequences of overfitting

log-linear model shows di↵erent trend:

linear model fit, predicts magnitude 9 earthquake about every 300years, would have led to quite a di↵erent reactor design.

Moral: Overfitting (or underfitting) can have huge consequences!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F54/67 (pg.111/163)

Page 112: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Real-world consequences of overfitting

log-linear model shows di↵erent trend:

linear model fit, predicts magnitude 9 earthquake about every 300years, would have led to quite a di↵erent reactor design.

Moral: Overfitting (or underfitting) can have huge consequences!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F54/67 (pg.112/163)

Page 113: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Overfitting slider on the web

See “Visualization of the bias-variance tradeo↵” at the following linkentitled “The Bias-Variance Dilemma”: https:

//medium.com/@ml.at.berkeley/machine-learning-crash-course-part-4-the-bias-variance-dilemma-a94e60ec1d3

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F55/67 (pg.113/163)

Page 114: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

The random process of training

Recall training data D =�(x(i)

, y(i))

n

i=1where (x(i)

, y(i)) ⇠ p(x, y)

are drawn from some distribution, x(i)

2 Rm and y(i)

2 R.

Training data is a random sample, and is itself random.

Fitting a model h✓ where ✓ is derived from the training procedure.

We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.

Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.114/163)

Page 115: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

The random process of training

Recall training data D =�(x(i)

, y(i))

n

i=1where (x(i)

, y(i)) ⇠ p(x, y)

are drawn from some distribution, x(i)

2 Rm and y(i)

2 R.Training data is a random sample, and is itself random.

Fitting a model h✓ where ✓ is derived from the training procedure.

We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.

Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.115/163)

Pr CD) = II, PG"

is"')

⇒ D is a ruffian

Page 116: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

The random process of training

Recall training data D =�(x(i)

, y(i))

n

i=1where (x(i)

, y(i)) ⇠ p(x, y)

are drawn from some distribution, x(i)

2 Rm and y(i)

2 R.Training data is a random sample, and is itself random.

Fitting a model h✓ where ✓ is derived from the training procedure.

We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.

Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.116/163)

Page 117: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

The random process of training

Recall training data D =�(x(i)

, y(i))

n

i=1where (x(i)

, y(i)) ⇠ p(x, y)

are drawn from some distribution, x(i)

2 Rm and y(i)

2 R.Training data is a random sample, and is itself random.

Fitting a model h✓ where ✓ is derived from the training procedure.

We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.

Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.117/163)

Page 118: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

The random process of training

Recall training data D =�(x(i)

, y(i))

n

i=1where (x(i)

, y(i)) ⇠ p(x, y)

are drawn from some distribution, x(i)

2 Rm and y(i)

2 R.Training data is a random sample, and is itself random.

Fitting a model h✓ where ✓ is derived from the training procedure.

We can think of ✓(D) is the parameters of the model via the process ofmodel fitting. ✓(D) is a random variable since deterministic function ofrandom sample D of size n.

Each time we draw a di↵erent training set, we might get (hopefullyonly very slightly) di↵erent ✓.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F56/67 (pg.118/163)

Page 119: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).

High bias, low variance: (underfitting)if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.119/163)

Page 120: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.120/163)

Page 121: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.

Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.121/163)

"

inductivebias

"

const.mobht.CH= c

Page 122: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.

Thus, the variance of the random variable h✓(D) is low!Low bias, high variance: (overfitting)

if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.122/163)

Page 123: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.123/163)

Va- Choco, ) small.

wu C. v.

Page 124: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

Low bias, high variance: (overfitting)

if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.124/163)

Page 125: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.

When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.125/163)

Page 126: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.

Thus, the variance of the random variable h✓(D) is high!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.126/163)

Page 127: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Intuition (on one slide).

When are fitting a model h✓ to data D, and we optimize (or can chose) anymodel in the model family (e.g., linear model family, logistic model, do weinclude higher order polynomial features, etc.).High bias, low variance: (underfitting)

if model family is too simple (low complexity), it will never matchimportant characteristics in the data. It is, in such case, biased.Even if we train on multiple di↵erent data sets, we’ll get approximatelythe same model h✓ since the model family is not that capable orflexible. All are wrong in same way.Thus, the variance of the random variable h✓(D) is low!

Low bias, high variance: (overfitting)if model family is too complex, it will match unimportant idiosyncrasiesin the data. Bias is low, since it matches the data quite well.When we train on multiple di↵erent data sets, each with its own (oftenrandom) idiosyncrasies, we’ll get very di↵erent models h✓ since themodel family can match any idiosyncrasies in the data.Thus, the variance of the random variable h✓(D) is high!

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F57/67 (pg.127/163)

9

Page 128: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance

Recall lecture 1 goal: find h✓ : X ! Y with minimum error.

Best solution, argminh Ep(x,y)[(y � h(x))2] leads to

h⇤(x) =

Zyp(y|x)dy = E[Y |x] = best association . (2.24)

assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.In practice, we have training data D and a limited model family H

(e.g., linear models) and instead do:

h 2 argminh2H

1

n

nX

i=1

(y(i) � h(x(i)))2 (2.25)

Two reasons we might not be good: (a) wrong model family H and(b) not enough data (n too small).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F58/67 (pg.128/163)

Page 129: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance

Recall lecture 1 goal: find h✓ : X ! Y with minimum error.Best solution, argminh Ep(x,y)[(y � h(x))2] leads to

h⇤(x) =

Zyp(y|x)dy = E[Y |x] = best association . (2.24)

assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.

In practice, we have training data D and a limited model family H

(e.g., linear models) and instead do:

h 2 argminh2H

1

n

nX

i=1

(y(i) � h(x(i)))2 (2.25)

Two reasons we might not be good: (a) wrong model family H and(b) not enough data (n too small).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F58/67 (pg.129/163)

Page 130: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance

Recall lecture 1 goal: find h✓ : X ! Y with minimum error.Best solution, argminh Ep(x,y)[(y � h(x))2] leads to

h⇤(x) =

Zyp(y|x)dy = E[Y |x] = best association . (2.24)

assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.In practice, we have training data D and a limited model family H

(e.g., linear models) and instead do:

h 2 argminh2H

1

n

nX

i=1

(y(i) � h(x(i)))2 (2.25)

Two reasons we might not be good: (a) wrong model family H and(b) not enough data (n too small).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F58/67 (pg.130/163)

Page 131: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance

Recall lecture 1 goal: find h✓ : X ! Y with minimum error.Best solution, argminh Ep(x,y)[(y � h(x))2] leads to

h⇤(x) =

Zyp(y|x)dy = E[Y |x] = best association . (2.24)

assuming we have the distribution p, the resources to compute E[Y |x],and the model family spanned by h includes the functional form ofE[Y |x] (i.e., E[Y |x] is realizable, as apposed to the agnostic casewhich is not realizable). This is best we can do in theory.In practice, we have training data D and a limited model family H

(e.g., linear models) and instead do:

h 2 argminh2H

1

n

nX

i=1

(y(i) � h(x(i)))2 (2.25)

Two reasons we might not be good: (a) wrong model family H and(b) not enough data (n too small).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F58/67 (pg.131/163)

Page 132: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Validation and Bias/Variance

How do we measure how good we are?We could consider measuring on the same training data we’ve got:

1

n

nX

i=1

(y(i) � h(x(i)))2 =1

|D|

X

(x,y)2D

(y � h(x))2 (2.26)

Already, intuitively bad for bias/variance reasons we’ve discussed, eachD gives di↵erent solution, each h✓(D) can look good on its own data.

Ideal approach, tests all data (e.g., includes future samples we didn’ttrain on).

Ep(x,y)[(y � h(x))2] (2.27)

bad since impractical at best and (more likely) impossible.Typical approach, draw a separate validation data set Dva, withDva \ D = ;, and to try to get at generalization error, do:

1

|Dva|

X

(x,y)2Dva

(y � h(x))2 (2.28)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F59/67 (pg.132/163)

Page 133: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Validation and Bias/Variance

How do we measure how good we are?We could consider measuring on the same training data we’ve got:

1

n

nX

i=1

(y(i) � h(x(i)))2 =1

|D|

X

(x,y)2D

(y � h(x))2 (2.26)

Already, intuitively bad for bias/variance reasons we’ve discussed, eachD gives di↵erent solution, each h✓(D) can look good on its own data.Ideal approach, tests all data (e.g., includes future samples we didn’ttrain on).

Ep(x,y)[(y � h(x))2] (2.27)

bad since impractical at best and (more likely) impossible.

Typical approach, draw a separate validation data set Dva, withDva \ D = ;, and to try to get at generalization error, do:

1

|Dva|

X

(x,y)2Dva

(y � h(x))2 (2.28)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F59/67 (pg.133/163)

Page 134: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Validation and Bias/Variance

How do we measure how good we are?We could consider measuring on the same training data we’ve got:

1

n

nX

i=1

(y(i) � h(x(i)))2 =1

|D|

X

(x,y)2D

(y � h(x))2 (2.26)

Already, intuitively bad for bias/variance reasons we’ve discussed, eachD gives di↵erent solution, each h✓(D) can look good on its own data.Ideal approach, tests all data (e.g., includes future samples we didn’ttrain on).

Ep(x,y)[(y � h(x))2] (2.27)

bad since impractical at best and (more likely) impossible.Typical approach, draw a separate validation data set Dva, withDva \ D = ;, and to try to get at generalization error, do:

1

|Dva|

X

(x,y)2Dva

(y � h(x))2 (2.28)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F59/67 (pg.134/163)

Page 135: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance, Overall Error

Ideal solution h⇤(x) = E[Y |x] and random estimate from data set

hD(x) , h✓(D) = argminh2H

1n

Pni=1(y

(i)� h(x(i)))2

Lets consider measuring the overall error for any h:

error(h)=Ep(x,y)[(h(x) � y)2] =

Z(h(x) � y)2p(y|x)p(x)dxdy (2.29)

= Ep(x,y)[(h(x) � E[Y |x] + E[Y |x] � y)2] (2.30)

= Ep(x,y)

h(h(x) � E[Y |x])2 (2.31)

+ 2(h(x) � E[Y |x])(E[Y |x] � y) (2.32)

+ (E[Y |x] � y)2i

(2.33)

= Ep(x,y)[(h(x) � E[Y |x])2] (2.34)

+ 2(E[h(x)] � E[Y |X])(E[Y |x] � E[Y |x]) (2.35)

+ Ep(x,y)[(E[Y |x] � y)2] (2.36)

Note, 2nd term cancels out.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F60/67 (pg.135/163)

Page 136: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance, Overall Error

Ideal solution h⇤(x) = E[Y |x] and random estimate from data set

hD(x) , h✓(D) = argminh2H

1n

Pni=1(y

(i)� h(x(i)))2

Lets consider measuring the overall error for any h:

error(h)=Ep(x,y)[(h(x) � y)2] =

Z(h(x) � y)2p(y|x)p(x)dxdy (2.29)

= Ep(x,y)[(h(x) � E[Y |x] + E[Y |x] � y)2] (2.30)

= Ep(x,y)

h(h(x) � E[Y |x])2 (2.31)

+ 2(h(x) � E[Y |x])(E[Y |x] � y) (2.32)

+ (E[Y |x] � y)2i

(2.33)

= Ep(x,y)[(h(x) � E[Y |x])2] (2.34)

+ 2(E[h(x)] � E[Y |X])(E[Y |x] � E[Y |x]) (2.35)

+ Ep(x,y)[(E[Y |x] � y)2] (2.36)

Note, 2nd term cancels out.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F60/67 (pg.136/163)

Page 137: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

2nd Term Cancels Out

Why does second term cancel out?

E[Y |x] =R

yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x.

Therefore,

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

= 2

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

= 2

Zp(x)(h(x) � E[Y |x])

hZp(y|x)(E[Y |x] � y)dy

idx (2.39)

= 2

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].

Thus, E[E[Y |X]] =R

E[Y |x]p(x)dx =R ⇥R

yp(y|x)p(x)dy⇤dx =R

y⇥R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.137/163)

Page 138: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

2nd Term Cancels Out

Why does second term cancel out?

E[Y |x] =R

yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x.

Therefore,

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

= 2

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

= 2

Zp(x)(h(x) � E[Y |x])

hZp(y|x)(E[Y |x] � y)dy

idx (2.39)

= 2

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].

Thus, E[E[Y |X]] =R

E[Y |x]p(x)dx =R ⇥R

yp(y|x)p(x)dy⇤dx =R

y⇥R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.138/163)

Page 139: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

2nd Term Cancels Out

Why does second term cancel out?

E[Y |x] =R

yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x. Therefore,

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

= 2

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

= 2

Zp(x)(h(x) � E[Y |x])

hZp(y|x)(E[Y |x] � y)dy

idx (2.39)

= 2

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].

Thus, E[E[Y |X]] =R

E[Y |x]p(x)dx =R ⇥R

yp(y|x)p(x)dy⇤dx =R

y⇥R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.139/163)

Page 140: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

2nd Term Cancels Out

Why does second term cancel out?

E[Y |x] =R

yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x. Therefore,

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

= 2

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

= 2

Zp(x)(h(x) � E[Y |x])

hZp(y|x)(E[Y |x] � y)dy

idx (2.39)

= 2

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].

Thus, E[E[Y |X]] =R

E[Y |x]p(x)dx =R ⇥R

yp(y|x)p(x)dy⇤dx =R

y⇥R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.140/163)

Page 141: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

2nd Term Cancels Out

Why does second term cancel out?

E[Y |x] =R

yp(y|x)dy = best association and is a deterministicfunction of x, an expected value of the random variable Y underconditional distribution p(y|x) for a given x. Therefore,

Ep(x,y)

h2(h(x) � E[Y |x])(E[Y |x] � y)

i(2.37)

= 2

Z Zp(x)p(y|x)

h(h(x) � E[Y |x])(E[Y |x] � y)

idydx (2.38)

= 2

Zp(x)(h(x) � E[Y |x])

hZp(y|x)(E[Y |x] � y)dy

idx (2.39)

= 2

Zp(x)(h(x) � E[Y |x])

h(E[Y |x] � E[Y |x])

idx = 0 (2.40)

Asside: E[Y |X] =R

yp(y|X)dy is a deterministic function of r.v. X,and hence E[Y |X] is itself a r.v. with a mean E[E[Y |X]].

Thus, E[E[Y |X]] =R

E[Y |x]p(x)dx =R ⇥R

yp(y|x)p(x)dy⇤dx =R

y⇥R

p(y|x)p(x)dx⇤dy =

Ryp(y)dy = E[Y ].

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F61/67 (pg.141/163)

Page 142: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance, the r.v. hD(x)Thus the error has only two terms:

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)

Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.

The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h

⇤(x) = E[Y |x].

hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it

has a mean ED[hD(x)] and a variance ED

h�hD(x) � ED[hD(x)]

�2i.

To clarify notation, we’ll use D and D to express things like hD(x)’s

variance ED

h�hD(x) � E

D[h

D(x)]

�2ibut D and D are two

independent and identically distributed (iid) random variables overdatasets.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.142/163)

Page 143: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance, the r.v. hD(x)Thus the error has only two terms:

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)

Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.

The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h

⇤(x) = E[Y |x].

hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it

has a mean ED[hD(x)] and a variance ED

h�hD(x) � ED[hD(x)]

�2i.

To clarify notation, we’ll use D and D to express things like hD(x)’s

variance ED

h�hD(x) � E

D[h

D(x)]

�2ibut D and D are two

independent and identically distributed (iid) random variables overdatasets.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.143/163)

Page 144: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance, the r.v. hD(x)Thus the error has only two terms:

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)

Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.

The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h

⇤(x) = E[Y |x].

hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it

has a mean ED[hD(x)] and a variance ED

h�hD(x) � ED[hD(x)]

�2i.

To clarify notation, we’ll use D and D to express things like hD(x)’s

variance ED

h�hD(x) � E

D[h

D(x)]

�2ibut D and D are two

independent and identically distributed (iid) random variables overdatasets.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.144/163)

Page 145: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance, the r.v. hD(x)Thus the error has only two terms:

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)

Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.

The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h

⇤(x) = E[Y |x].

hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it

has a mean ED[hD(x)] and a variance ED

h�hD(x) � ED[hD(x)]

�2i.

To clarify notation, we’ll use D and D to express things like hD(x)’s

variance ED

h�hD(x) � E

D[h

D(x)]

�2ibut D and D are two

independent and identically distributed (iid) random variables overdatasets.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.145/163)

Page 146: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance, the r.v. hD(x)Thus the error has only two terms:

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2] (2.41)

Second term has nothing to do with the model h whose error we aremeasuring, and is the inherent error, due to the random process andlabel noise. We can neglect this term in our study of bias/variance.

The first term can be simplified to Ep(x)[(h(x) � E[Y |x])2] and is thelearning error (or MSE). It is zero when h(x) = h

⇤(x) = E[Y |x].

hD(x) is a (random) learnt model from (random variable) data set D.Hence, hD(x) is a random variable (a deterministic function, thelearning/optimization process, of random variable D data set) and it

has a mean ED[hD(x)] and a variance ED

h�hD(x) � ED[hD(x)]

�2i.

To clarify notation, we’ll use D and D to express things like hD(x)’s

variance ED

h�hD(x) � E

D[h

D(x)]

�2ibut D and D are two

independent and identically distributed (iid) random variables overdatasets.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F62/67 (pg.146/163)

Page 147: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Breakdown

We further analyze only this first term, and take ensemble averages over D,the random data sample. hD(x) is a (random) learnt model from (randomvariable) data set D.

For a given x, we have

ED[(hD(x) � E[Y |x])2] = ED[(hD(x) � ED

[hD

(x)] + ED

[hD

(x)] � E[Y |x])2]

= ED

h(hD(x) � E

D[h

D(x)])2 (2.42)

+ 2(hD(x) � ED

[hD

(x)])(ED

[hD

(x)] � E[Y |x])

+ (ED

[hD

(x)] � E[Y |x])2i

(2.43)

= ED[(hD(x) � ED

[hD

(x)])2] (2.44)

+ (ED

[hD

(x)] � E[Y |x])2 (2.45)

= variance(x) + bias squared (x) (2.46)

We then take Ep(x)[·] to get overall bias and variance over all x.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F63/67 (pg.147/163)

Page 148: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Breakdown

We further analyze only this first term, and take ensemble averages over D,the random data sample. hD(x) is a (random) learnt model from (randomvariable) data set D. For a given x, we have

ED[(hD(x) � E[Y |x])2] = ED[(hD(x) � ED

[hD

(x)] + ED

[hD

(x)] � E[Y |x])2]

= ED

h(hD(x) � E

D[h

D(x)])2 (2.42)

+ 2(hD(x) � ED

[hD

(x)])(ED

[hD

(x)] � E[Y |x])

+ (ED

[hD

(x)] � E[Y |x])2i

(2.43)

= ED[(hD(x) � ED

[hD

(x)])2] (2.44)

+ (ED

[hD

(x)] � E[Y |x])2 (2.45)

= variance(x) + bias squared (x) (2.46)

We then take Ep(x)[·] to get overall bias and variance over all x.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F63/67 (pg.148/163)

Page 149: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance Breakdown

We further analyze only this first term, and take ensemble averages over D,the random data sample. hD(x) is a (random) learnt model from (randomvariable) data set D. For a given x, we have

ED[(hD(x) � E[Y |x])2] = ED[(hD(x) � ED

[hD

(x)] + ED

[hD

(x)] � E[Y |x])2]

= ED

h(hD(x) � E

D[h

D(x)])2 (2.42)

+ 2(hD(x) � ED

[hD

(x)])(ED

[hD

(x)] � E[Y |x])

+ (ED

[hD

(x)] � E[Y |x])2i

(2.43)

= ED[(hD(x) � ED

[hD

(x)])2] (2.44)

+ (ED

[hD

(x)] � E[Y |x])2 (2.45)

= variance(x) + bias squared (x) (2.46)

We then take Ep(x)[·] to get overall bias and variance over all x.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F63/67 (pg.149/163)

Page 150: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Error (MSE) and Bias/Variance Breakdown

MSE = Ep(x)[ED[(hD(x) � E[Y |x])2]] (2.47)

= Ep(x)[ED[(hD(x) � ED

[hD

(x)])2]] + Ep(x)[(ED[h

D(x)] � E[Y |x])2]

= variance + bias squared (2.48)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F64/67 (pg.150/163)

Page 151: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance, Unbiased estimator

To be an unbiased estimation procedure means thatED[hD(x)] = E[Y |x], or that the bias is zero.

We do sometimes have zero bias and variance that depends on the sizeof the data.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F65/67 (pg.151/163)

Page 152: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance, Unbiased estimator

To be an unbiased estimation procedure means thatED[hD(x)] = E[Y |x], or that the bias is zero.

We do sometimes have zero bias and variance that depends on the sizeof the data.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F65/67 (pg.152/163)

Page 153: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|

x + ✏ where✏ ⇠ N (0, �

2).

Suppose this is true generative process for some ✓⇤

Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �

2).MLE parameter estimate ✓ = (X|

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

noisy (r.v.) version of ✓⇤.

Best estimate E[Y |x] = ✓⇤|

x

Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]

(2.49)

In current case, inherent error is:

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.153/163)

Page 154: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|

x + ✏ where✏ ⇠ N (0, �

2). Suppose this is true generative process for some ✓⇤

Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �

2).MLE parameter estimate ✓ = (X|

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

noisy (r.v.) version of ✓⇤.

Best estimate E[Y |x] = ✓⇤|

x

Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]

(2.49)

In current case, inherent error is:

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.154/163)

Page 155: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|

x + ✏ where✏ ⇠ N (0, �

2). Suppose this is true generative process for some ✓⇤

Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �

2).

MLE parameter estimate ✓ = (X|X)�1

X|~y = ✓

⇤ + (X|X)�1

X|~✏, a

noisy (r.v.) version of ✓⇤.

Best estimate E[Y |x] = ✓⇤|

x

Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]

(2.49)

In current case, inherent error is:

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.155/163)

Page 156: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|

x + ✏ where✏ ⇠ N (0, �

2). Suppose this is true generative process for some ✓⇤

Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �

2).MLE parameter estimate ✓ = (X|

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

noisy (r.v.) version of ✓⇤.

Best estimate E[Y |x] = ✓⇤|

x

Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]

(2.49)

In current case, inherent error is:

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.156/163)

Page 157: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|

x + ✏ where✏ ⇠ N (0, �

2). Suppose this is true generative process for some ✓⇤

Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �

2).MLE parameter estimate ✓ = (X|

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

noisy (r.v.) version of ✓⇤.

Best estimate E[Y |x] = ✓⇤|

x

Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]

(2.49)

In current case, inherent error is:

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.157/163)

Page 158: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|

x + ✏ where✏ ⇠ N (0, �

2). Suppose this is true generative process for some ✓⇤

Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �

2).MLE parameter estimate ✓ = (X|

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

noisy (r.v.) version of ✓⇤.

Best estimate E[Y |x] = ✓⇤|

x

Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]

(2.49)

In current case, inherent error is:

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.158/163)

Page 159: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Linear model with noise y = h✓⇤(x) + ✏ = ✓⇤|

x + ✏ where✏ ⇠ N (0, �

2). Suppose this is true generative process for some ✓⇤

Given training data D and corresponding n ⇥ m design matrix X andlength-n column vector ~y, we have relationship ~y = X✓ + ~✏, ~✏ islength-n vector of Gaussians, ✏i ⇠ N (0, �

2).MLE parameter estimate ✓ = (X|

X)�1X

|~y = ✓

⇤ + (X|X)�1

X|~✏, a

noisy (r.v.) version of ✓⇤.

Best estimate E[Y |x] = ✓⇤|

x

Recall error, which has only two terms, the model error (which can bebroken into the bias squared and variance) and the inherent error.

error(h) = Ep(x,y)[(h(x) � E[Y |x])2] + Ep(x,y)[(E[Y |x] � y)2]

(2.49)

In current case, inherent error is:

Ep(x,y)[(E[Y |x] � y)2] = �2 (2.50)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F66/67 (pg.159/163)

Page 160: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Random model h✓ based on random sample D, soED[h✓] = ED[h✓⇤+(X|X)�1X|~✏] = h✓⇤ since ED[✏] = 0.

Thus, bias squared at x:

(ED

[hD

(x)] � E[Y |x])2 = 0 (2.51)

Variance increases with m and decreases with n (sample size)

ED[(hD(x) � ED

[hD

(x)])2] =�2m

n(2.52)

A famous theorem (Gauss-Markov Theorem) states that among allunbiased estimators, the linear least squares (LLS) estimator has thesmallest variance and hence has the smallest smallest (mean squared)error of all unbiased linear estimators! I.e.,Var(h✓LLS(x)) Var(h✓any unbiased

(x)).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F67/67 (pg.160/163)

Page 161: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Random model h✓ based on random sample D, soED[h✓] = ED[h✓⇤+(X|X)�1X|~✏] = h✓⇤ since ED[✏] = 0.

Thus, bias squared at x:

(ED

[hD

(x)] � E[Y |x])2 = 0 (2.51)

Variance increases with m and decreases with n (sample size)

ED[(hD(x) � ED

[hD

(x)])2] =�2m

n(2.52)

A famous theorem (Gauss-Markov Theorem) states that among allunbiased estimators, the linear least squares (LLS) estimator has thesmallest variance and hence has the smallest smallest (mean squared)error of all unbiased linear estimators! I.e.,Var(h✓LLS(x)) Var(h✓any unbiased

(x)).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F67/67 (pg.161/163)

Page 162: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Random model h✓ based on random sample D, soED[h✓] = ED[h✓⇤+(X|X)�1X|~✏] = h✓⇤ since ED[✏] = 0.

Thus, bias squared at x:

(ED

[hD

(x)] � E[Y |x])2 = 0 (2.51)

Variance increases with m and decreases with n (sample size)

ED[(hD(x) � ED

[hD

(x)])2] =�2m

n(2.52)

A famous theorem (Gauss-Markov Theorem) states that among allunbiased estimators, the linear least squares (LLS) estimator has thesmallest variance and hence has the smallest smallest (mean squared)error of all unbiased linear estimators! I.e.,Var(h✓LLS(x)) Var(h✓any unbiased

(x)).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F67/67 (pg.162/163)

Page 163: Advanced Introduction to Machine Learning · 2020. 9. 27. · ¥ python ¥ libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras,

On Fitting (preview) Classification, Logistic Regression Fit Complexity, Bias/Variance

Bias/Variance for Linear Least Squares

Random model h✓ based on random sample D, soED[h✓] = ED[h✓⇤+(X|X)�1X|~✏] = h✓⇤ since ED[✏] = 0.

Thus, bias squared at x:

(ED

[hD

(x)] � E[Y |x])2 = 0 (2.51)

Variance increases with m and decreases with n (sample size)

ED[(hD(x) � ED

[hD

(x)])2] =�2m

n(2.52)

A famous theorem (Gauss-Markov Theorem) states that among allunbiased estimators, the linear least squares (LLS) estimator has thesmallest variance and hence has the smallest smallest (mean squared)error of all unbiased linear estimators! I.e.,Var(h✓LLS(x)) Var(h✓any unbiased

(x)).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 2 - April 6th/8th, 2020 F67/67 (pg.163/163)