CS229 Midterm Review Part IIcs229.stanford.edu/materials/midterm-review.pdf · CS229 Midterm Review Part II Taide Ding November 1, 2019 1/33. Overview 1 Past Midterm Stats 2 Helpful

CS229 Midterm Review Part II

Taide Ding

November 1, 2019

1 / 33

Overview

1 Past Midterm Stats

2 Helpful Resources

3 Notation: quick clarifying review

4 Another perspective on bias-variance

5 Common Problem-solving Strategies (with examples)

2 / 33

The Midterms are tough - DON’T PANIC!

Fall 16 Midterm Grade distribution

Fall 17: µ = 39.5, σ = 14.5Spring 19: µ = 65.4, σ = 22.4

3 / 33

Helpful Resources

Study guide by past CS229 TA Shervine Amidi (link is on coursesyllabus)

https://stanford.edu/∼shervine/teaching/cs-229/

4 / 33

Helpful Resources

IMPORTANT: CS229 Linear Algebra and Probability Review handouts

Go over them carefully and in detail.

Any and all of the concepts/tools within are fair game w.r.t. solvingmidterm problems

TAKE NOTES

5 / 33

Notation: quick clarifying review

{x (i), y (i)}ni=1 denotes a dataset of n examples. For each example i ,x (i) ∈ Rd , and y (i) ∈ R.The j-th element (i.e. feature) of the i-th sample is denoted x

(i)j .

X ∈ Rn×d is the data matrix and ~y ∈ Rn is the label vector such that:

x (i) =

x(i)1...

x(i)d

,X =

− x (1)T −

−... −

− x (n)T −

, ~y =

y(1)

...

y (n)

,Xθ =θ

T x (1)

...

θT x (n)

for parameter vector θ ∈ Rd .

The t-th iteration of θ is denoted θ(t).

Superscripts: sample index i ∈ [1, n]; iteration index t ∈ [1,T ]

Subscripts: feature index j ∈ [1, d ]

6 / 33


{x (i), y (i)}ni=1 denotes a dataset of n examples. For each example i ,x (i) ∈ Rd , and y (i) ∈ R.

The j-th element (i.e. feature) of the i-th sample is denoted x(i)j .


x (i) =

x(i)1...

x(i)d

,X =

− x (1)T −

−... −

− x (n)T −

, ~y =

y(1)

...

y (n)

,Xθ =θ

T x (1)

...

θT x (n)





7 / 33



(i)j .


x (i) =

x(i)1...

x(i)d

,X =

− x (1)T −

−... −

− x (n)T −

, ~y =

y(1)

...

y (n)

,Xθ =θ

T x (1)

...

θT x (n)





8 / 33



(i)j .


x (i) =

x(i)1...

x(i)d

,X =

− x (1)T −

−... −

− x (n)T −

, ~y =

y(1)

...

y (n)

,Xθ =θ

T x (1)

...

θT x (n)





9 / 33



(i)j .


x (i) =

x(i)1...

x(i)d

,X =

− x (1)T −

−... −

− x (n)T −

, ~y =

y(1)

...

y (n)

,Xθ =θ

T x (1)

...

θT x (n)





10 / 33



(i)j .


x (i) =

x(i)1...

x(i)d

,X =

− x (1)T −

−... −

− x (n)T −

, ~y =

y(1)

...

y (n)

,Xθ =θ

T x (1)

...

θT x (n)





11 / 33



(i)j .


x (i) =

x(i)1...

x(i)d

,X =

− x (1)T −

−... −

− x (n)T −

, ~y =

y(1)

...

y (n)

,Xθ =θ

T x (1)

...

θT x (n)





12 / 33

Another perspective on bias-variance

F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias

reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)

estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or

making ~f closer to g (e.g. better training algo, more data)

13 / 33






14 / 33


F is your model class

f ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias




15 / 33


F is your model classf ∗ is optimal model for problem (or the true generating distribution)

g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias




16 / 33


F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model class

f̂ is the model you obtain through learning on your dataset.approximation error → bias




17 / 33


F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.

approximation error → biasreduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)



18 / 33






19 / 33






20 / 33




estimation error → variance

reduce variance by contracting F (e.g. remove features, regularize) or


21 / 33





making ~f closer to g (e.g. better training algo, more data)22 / 33

Common Problem-solving Strategies

Take stock of your arsenal1 Probability

Bayes’ RuleIndependence, Conditional IndependenceChain Ruleetc.

2 Calculus (e.g. taking gradients)Maximum likelihood estimations:

`(.) = logL(.) = log∏

p(.) =∑

log p(.)

Loss minimizationetc.

3 Linear AlgebraPSD, eigendecomposition, projection, Mercer’s Theorem etc.

4 Proof techniquesconstruction, contradiction (e.g. counterexample), induction,contrapositive, etc.

23 / 33


Take stock of your arsenal

1 ProbabilityBayes’ RuleIndependence, Conditional IndependenceChain Ruleetc.


`(.) = logL(.) = log∏

p(.) =∑

log p(.)




24 / 33





`(.) = logL(.) = log∏

p(.) =∑

log p(.)




25 / 33





`(.) = logL(.) = log∏

p(.) =∑

log p(.)




26 / 33





`(.) = logL(.) = log∏

p(.) =∑

log p(.)




27 / 33





`(.) = logL(.) = log∏

p(.) =∑

log p(.)




28 / 33

Spring 19 Problem 3(a,b) - Exponential Discr. Analysis

Tools used: Probability (Bayes’, Indep, Chain Rule), Calculus (MLE)

29 / 33

Spring 19 Problem 3(a,b) - Exponential Discr. Analysis

Tools used: Probability (Bayes’, Indep, Chain Rule), Calculus (MLE)30 / 33

Spring 19 Problem 5(a) - Kernel Fun

Tools used: Linear Algebra (PSD properties, eigendecomposition), proofby construction

31 / 33

Summary

1 The midterm is tough. Don’t panic!

2 Use resources - study guide, lecture and review handouts, Piazza, OH

3 Know your problem-solving tools - take stock of your arsenal!

32 / 33

Best of Luck!

33 / 33

Documents

CS229 Midterm Review Part IIcs229.stanford.edu/materials/midterm-review.pdf · CS229 Midterm Review Part II Taide Ding November 1, 2019 1/33. Overview 1 Past Midterm Stats 2 Helpful