Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
CS229 Midterm Review Part II
Taide Ding
November 1, 2019
1 / 33
Overview
1 Past Midterm Stats
2 Helpful Resources
3 Notation: quick clarifying review
4 Another perspective on bias-variance
5 Common Problem-solving Strategies (with examples)
2 / 33
The Midterms are tough - DON’T PANIC!
Fall 16 Midterm Grade distribution
Fall 17: µ = 39.5, σ = 14.5Spring 19: µ = 65.4, σ = 22.4
3 / 33
Helpful Resources
Study guide by past CS229 TA Shervine Amidi (link is on coursesyllabus)
https://stanford.edu/∼shervine/teaching/cs-229/
4 / 33
Helpful Resources
IMPORTANT: CS229 Linear Algebra and Probability Review handouts
Go over them carefully and in detail.
Any and all of the concepts/tools within are fair game w.r.t. solvingmidterm problems
TAKE NOTES
5 / 33
Notation: quick clarifying review
{x (i), y (i)}ni=1 denotes a dataset of n examples. For each example i ,x (i) ∈ Rd , and y (i) ∈ R.The j-th element (i.e. feature) of the i-th sample is denoted x
(i)j .
X ∈ Rn×d is the data matrix and ~y ∈ Rn is the label vector such that:
x (i) =
x(i)1...
x(i)d
,X =
− x (1)T −
−... −
− x (n)T −
, ~y =
y(1)
...
y (n)
,Xθ =θ
T x (1)
...
θT x (n)
for parameter vector θ ∈ Rd .
The t-th iteration of θ is denoted θ(t).
Superscripts: sample index i ∈ [1, n]; iteration index t ∈ [1,T ]
Subscripts: feature index j ∈ [1, d ]
6 / 33
Notation: quick clarifying review
{x (i), y (i)}ni=1 denotes a dataset of n examples. For each example i ,x (i) ∈ Rd , and y (i) ∈ R.
The j-th element (i.e. feature) of the i-th sample is denoted x(i)j .
X ∈ Rn×d is the data matrix and ~y ∈ Rn is the label vector such that:
x (i) =
x(i)1...
x(i)d
,X =
− x (1)T −
−... −
− x (n)T −
, ~y =
y(1)
...
y (n)
,Xθ =θ
T x (1)
...
θT x (n)
for parameter vector θ ∈ Rd .
The t-th iteration of θ is denoted θ(t).
Superscripts: sample index i ∈ [1, n]; iteration index t ∈ [1,T ]
Subscripts: feature index j ∈ [1, d ]
7 / 33
Notation: quick clarifying review
{x (i), y (i)}ni=1 denotes a dataset of n examples. For each example i ,x (i) ∈ Rd , and y (i) ∈ R.The j-th element (i.e. feature) of the i-th sample is denoted x
(i)j .
X ∈ Rn×d is the data matrix and ~y ∈ Rn is the label vector such that:
x (i) =
x(i)1...
x(i)d
,X =
− x (1)T −
−... −
− x (n)T −
, ~y =
y(1)
...
y (n)
,Xθ =θ
T x (1)
...
θT x (n)
for parameter vector θ ∈ Rd .
The t-th iteration of θ is denoted θ(t).
Superscripts: sample index i ∈ [1, n]; iteration index t ∈ [1,T ]
Subscripts: feature index j ∈ [1, d ]
8 / 33
Notation: quick clarifying review
{x (i), y (i)}ni=1 denotes a dataset of n examples. For each example i ,x (i) ∈ Rd , and y (i) ∈ R.The j-th element (i.e. feature) of the i-th sample is denoted x
(i)j .
X ∈ Rn×d is the data matrix and ~y ∈ Rn is the label vector such that:
x (i) =
x(i)1...
x(i)d
,X =
− x (1)T −
−... −
− x (n)T −
, ~y =
y(1)
...
y (n)
,Xθ =θ
T x (1)
...
θT x (n)
for parameter vector θ ∈ Rd .
The t-th iteration of θ is denoted θ(t).
Superscripts: sample index i ∈ [1, n]; iteration index t ∈ [1,T ]
Subscripts: feature index j ∈ [1, d ]
9 / 33
Notation: quick clarifying review
{x (i), y (i)}ni=1 denotes a dataset of n examples. For each example i ,x (i) ∈ Rd , and y (i) ∈ R.The j-th element (i.e. feature) of the i-th sample is denoted x
(i)j .
X ∈ Rn×d is the data matrix and ~y ∈ Rn is the label vector such that:
x (i) =
x(i)1...
x(i)d
,X =
− x (1)T −
−... −
− x (n)T −
, ~y =
y(1)
...
y (n)
,Xθ =θ
T x (1)
...
θT x (n)
for parameter vector θ ∈ Rd .
The t-th iteration of θ is denoted θ(t).
Superscripts: sample index i ∈ [1, n]; iteration index t ∈ [1,T ]
Subscripts: feature index j ∈ [1, d ]
10 / 33
Notation: quick clarifying review
{x (i), y (i)}ni=1 denotes a dataset of n examples. For each example i ,x (i) ∈ Rd , and y (i) ∈ R.The j-th element (i.e. feature) of the i-th sample is denoted x
(i)j .
X ∈ Rn×d is the data matrix and ~y ∈ Rn is the label vector such that:
x (i) =
x(i)1...
x(i)d
,X =
− x (1)T −
−... −
− x (n)T −
, ~y =
y(1)
...
y (n)
,Xθ =θ
T x (1)
...
θT x (n)
for parameter vector θ ∈ Rd .
The t-th iteration of θ is denoted θ(t).
Superscripts: sample index i ∈ [1, n]; iteration index t ∈ [1,T ]
Subscripts: feature index j ∈ [1, d ]
11 / 33
Notation: quick clarifying review
{x (i), y (i)}ni=1 denotes a dataset of n examples. For each example i ,x (i) ∈ Rd , and y (i) ∈ R.The j-th element (i.e. feature) of the i-th sample is denoted x
(i)j .
X ∈ Rn×d is the data matrix and ~y ∈ Rn is the label vector such that:
x (i) =
x(i)1...
x(i)d
,X =
− x (1)T −
−... −
− x (n)T −
, ~y =
y(1)
...
y (n)
,Xθ =θ
T x (1)
...
θT x (n)
for parameter vector θ ∈ Rd .
The t-th iteration of θ is denoted θ(t).
Superscripts: sample index i ∈ [1, n]; iteration index t ∈ [1,T ]
Subscripts: feature index j ∈ [1, d ]
12 / 33
Another perspective on bias-variance
F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias
reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)
13 / 33
Another perspective on bias-variance
F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias
reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)
14 / 33
Another perspective on bias-variance
F is your model class
f ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias
reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)
15 / 33
Another perspective on bias-variance
F is your model classf ∗ is optimal model for problem (or the true generating distribution)
g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias
reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)
16 / 33
Another perspective on bias-variance
F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model class
f̂ is the model you obtain through learning on your dataset.approximation error → bias
reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)
17 / 33
Another perspective on bias-variance
F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.
approximation error → biasreduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)
18 / 33
Another perspective on bias-variance
F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias
reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)
19 / 33
Another perspective on bias-variance
F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias
reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)
20 / 33
Another perspective on bias-variance
F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias
reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variance
reduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)
21 / 33
Another perspective on bias-variance
F is your model classf ∗ is optimal model for problem (or the true generating distribution)g is the optimal model in your model classf̂ is the model you obtain through learning on your dataset.approximation error → bias
reduce bias by expanding F (e.g. more features, more layers) ormoving F closer to optimal model f ∗ (i.e. choosing a better class)
estimation error → variancereduce variance by contracting F (e.g. remove features, regularize) or
making ~f closer to g (e.g. better training algo, more data)22 / 33
Common Problem-solving Strategies
Take stock of your arsenal1 Probability
Bayes’ RuleIndependence, Conditional IndependenceChain Ruleetc.
2 Calculus (e.g. taking gradients)Maximum likelihood estimations:
`(.) = logL(.) = log∏
p(.) =∑
log p(.)
Loss minimizationetc.
3 Linear AlgebraPSD, eigendecomposition, projection, Mercer’s Theorem etc.
4 Proof techniquesconstruction, contradiction (e.g. counterexample), induction,contrapositive, etc.
23 / 33
Common Problem-solving Strategies
Take stock of your arsenal
1 ProbabilityBayes’ RuleIndependence, Conditional IndependenceChain Ruleetc.
2 Calculus (e.g. taking gradients)Maximum likelihood estimations:
`(.) = logL(.) = log∏
p(.) =∑
log p(.)
Loss minimizationetc.
3 Linear AlgebraPSD, eigendecomposition, projection, Mercer’s Theorem etc.
4 Proof techniquesconstruction, contradiction (e.g. counterexample), induction,contrapositive, etc.
24 / 33
Common Problem-solving Strategies
Take stock of your arsenal1 Probability
Bayes’ RuleIndependence, Conditional IndependenceChain Ruleetc.
2 Calculus (e.g. taking gradients)Maximum likelihood estimations:
`(.) = logL(.) = log∏
p(.) =∑
log p(.)
Loss minimizationetc.
3 Linear AlgebraPSD, eigendecomposition, projection, Mercer’s Theorem etc.
4 Proof techniquesconstruction, contradiction (e.g. counterexample), induction,contrapositive, etc.
25 / 33
Common Problem-solving Strategies
Take stock of your arsenal1 Probability
Bayes’ RuleIndependence, Conditional IndependenceChain Ruleetc.
2 Calculus (e.g. taking gradients)Maximum likelihood estimations:
`(.) = logL(.) = log∏
p(.) =∑
log p(.)
Loss minimizationetc.
3 Linear AlgebraPSD, eigendecomposition, projection, Mercer’s Theorem etc.
4 Proof techniquesconstruction, contradiction (e.g. counterexample), induction,contrapositive, etc.
26 / 33
Common Problem-solving Strategies
Take stock of your arsenal1 Probability
Bayes’ RuleIndependence, Conditional IndependenceChain Ruleetc.
2 Calculus (e.g. taking gradients)Maximum likelihood estimations:
`(.) = logL(.) = log∏
p(.) =∑
log p(.)
Loss minimizationetc.
3 Linear AlgebraPSD, eigendecomposition, projection, Mercer’s Theorem etc.
4 Proof techniquesconstruction, contradiction (e.g. counterexample), induction,contrapositive, etc.
27 / 33
Common Problem-solving Strategies
Take stock of your arsenal1 Probability
Bayes’ RuleIndependence, Conditional IndependenceChain Ruleetc.
2 Calculus (e.g. taking gradients)Maximum likelihood estimations:
`(.) = logL(.) = log∏
p(.) =∑
log p(.)
Loss minimizationetc.
3 Linear AlgebraPSD, eigendecomposition, projection, Mercer’s Theorem etc.
4 Proof techniquesconstruction, contradiction (e.g. counterexample), induction,contrapositive, etc.
28 / 33
Spring 19 Problem 3(a,b) - Exponential Discr. Analysis
Tools used: Probability (Bayes’, Indep, Chain Rule), Calculus (MLE)
29 / 33
Spring 19 Problem 3(a,b) - Exponential Discr. Analysis
Tools used: Probability (Bayes’, Indep, Chain Rule), Calculus (MLE)30 / 33
Spring 19 Problem 5(a) - Kernel Fun
Tools used: Linear Algebra (PSD properties, eigendecomposition), proofby construction
31 / 33
Summary
1 The midterm is tough. Don’t panic!
2 Use resources - study guide, lecture and review handouts, Piazza, OH
3 Know your problem-solving tools - take stock of your arsenal!
32 / 33
Best of Luck!
33 / 33