Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Lecture 2: Supervised vs. unsupervisedlearning, bias-variance tradeoff
Reading: Chapter 2
STATS 202: Data mining and analysis
Sergio BacalladoSeptember 24, 2014
1 / 20
Supervised vs. unsupervised learning
In unsupervised learning we start with a data matrix:
Variables or factors
Samples
orun
its
2 / 20
Supervised vs. unsupervised learning
In unsupervised learning we start with a data matrix:
Variables or factors
Quantitative, eg. weight, height, number of children, ...
Samples
orun
its
2 / 20
Supervised vs. unsupervised learning
In unsupervised learning we start with a data matrix:
Variables or factors
Qualitative, eg. college major, profession, gender, ...
Samples
orun
its
2 / 20
Supervised vs. unsupervised learning
In unsupervised learning we start with a data matrix:
Our goal is to:
I Find meaningful relationships between the variables or units.Correlation analysis.
I Find low-dimensional representations of the data which makeit easy to visualize the variables and units. PCA, ICA, isomap,locally linear embeddings, etc.
I Find meaningful groupings of the data. Clustering.
Unsupervised learning is also known in Statistics as exploratorydata analysis.
3 / 20
Supervised vs. unsupervised learning
In supervised learning, there are input variables, and outputvariables:
Variables or factors
Samples
orun
its
Input variables Output variable
4 / 20
Supervised vs. unsupervised learning
In supervised learning, there are input variables, and outputvariables:
Variables or factors
Samples
orun
its
Input variables Output variable
If quantitative, we saythis is a regressionproblem
4 / 20
Supervised vs. unsupervised learning
In supervised learning, there are input variables, and outputvariables:
Variables or factors
Samples
orun
its
Input variables Output variable
If qualitative, we saythis is a classificationproblem
4 / 20
Supervised vs. unsupervised learning
In supervised learning, there are input variables, and outputvariables:
If X is the vector of inputs for a particular sample. The outputvariable is modeled by:
Y = f(X) + ε︸︷︷︸Random error
Our goal is to learn the function f , using a set of training samples.
5 / 20
Supervised vs. unsupervised learning
Y = f(X) + ε︸︷︷︸Random error
Motivations:
I Prediction: Useful when the input variable is readilyavailable, but the output variable is not.
Example: Predict stock prices next month using data from lastyear.
I Inference: A model for f can help us understand thestructure of the data — which variables influence the output,and which don’t? What is the relationship between eachvariable and the output, e.g. linear, non-linear?
Example: What is the influence of genetic variations on theincidence of heart disease.
6 / 20
Parametric and nonparametric methods:
There are two kinds of supervised learning method:
I Parametric methods: We assume that f takes a specificform. For example, a linear form:
f(X) = X1β1 + · · ·+Xpβp
with parameters β1, . . . , βp. Using the training data, we try tofit the parameters.
I Non-parametric methods: We don’t make any assumptionson the form of f , but we restrict how “wiggly” or “rough” thefunction can be.
7 / 20
Parametric vs. nonparametric prediction
Years of Education
Sen
iorit
y
Incom
e
Years of Education
Sen
iorit
y
Incom
e
Figures 2.4 and 2.5
Parametric methods have a limit of fit quality. Non-parametricmethods keep improving as we add more data to fit.
Parametric methods are often simpler to interpret.
8 / 20
Prediction error
Training data: (x1, y1), (x2, y2) . . . (xn, yn)Predicted function: f .
Our goal in supervised learning is to minimize the prediction error.For regression models, this is typically the Mean Squared Error:
MSE(f) = E(y0 − f(x0))2.
Unfortunately, this quantity cannot be computed, because we don’tknow the joint distribution of (X,Y ). We can compute a sampleaverage using the training data; this is known as the training MSE:
MSEtraining(f) =1
n
n∑i=1
(yi − f(xi))2.
9 / 20
Prediction error
The main challenge of statistical learning is that a low trainingMSE does not imply a low MSE.
If we have test data {(x′i, y′i); i = 1, . . . ,m} which were not used tofit the model, a better measure of quality for f is the test MSE:
MSEtest(f) =1
m
m∑i=1
(y′i − f(x′i))2.
10 / 20
Figure 2.9.
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
FlexibilityM
ea
n S
qu
are
d E
rro
r
The circles are simulated data from the black curve. Inthis artificial example, we know what f is.
11 / 20
Figure 2.9.
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Me
an
Sq
ua
red
Err
or
Three estimates f are shown:1. Linear regression.2. Splines (very smooth).3. Splines (quite rough).
11 / 20
Figure 2.9.
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
FlexibilityM
ea
n S
qu
are
d E
rro
r
Red line: Test MSE.Gray line: Training MSE.
11 / 20
Figure 2.10
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Me
an
Sq
ua
red
Err
or
The function f is now almost linear.
12 / 20
Figure 2.11
0 20 40 60 80 100
−1
00
10
20
X
Y
2 5 10 20
05
10
15
20
Flexibility
Me
an
Sq
ua
red
Err
or
When the noise ε has small variance, the third method does well.
13 / 20
The bias variance decomposition
Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).
Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:
MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).
14 / 20
The bias variance decomposition
Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).
Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:
MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).
Irreducible error
14 / 20
The bias variance decomposition
Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).
Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:
MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).
The variance of the estimate of Y : E[f(x0)− E(f(x0))]2
This measures how much the estimate of f at x0changes when we sample new training data.
14 / 20
The bias variance decomposition
Let x0 be a fixed test point, y0 = f(x0) + ε0, and f be estimatedfrom n training samples (x1, y1) . . . (xn, yn).
Let E denote the expectation over y0 and the training outputs(y1, . . . , yn). Then, the Mean Squared Error at x0 can bedecomposed:
MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε0).
The squared bias of the estimate of Y : [E(f(x0))− f(x0)]2
This measures the deviation of the averageprediction f(x0) from the truth f(x0).
14 / 20
15 / 20
15 / 20
15 / 20
Implications of bias variance decomposition
MSE(x0) = E(y0−f(x0))2 = Var(f(x0))+[Bias(f(x0))]2+Var(ε).
I The MSE is always positive.I Each element on the right hand side is always positive.I Therefore, typically when we decrease the bias beyond some
point, we increase the variance, and vice-versa.
More flexibility ⇐⇒ Higher variance ⇐⇒ Lower bias.
16 / 20
Squiggly f , high noise Linear f , high noise Squiggly f , low noise
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
2 5 10 20
05
10
15
20
Flexibility
MSEBiasVar
Figure 2.12
17 / 20
Classification problems
In a classification setting, the output takes values in a discrete set.
For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.
18 / 20
Classification problems
In a classification setting, the output takes values in a discrete set.
For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.
The model:Y = f(X) + ε
becomes insufficient, as X is not necessarily real-valued.
18 / 20
Classification problems
In a classification setting, the output takes values in a discrete set.
For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.
The model:(((((((Y = f(X) + ε
becomes insufficient, as X is not necessarily real-valued.
18 / 20
Classification problems
In a classification setting, the output takes values in a discrete set.
For example, if we are predicting the brand of a car based on anumber of variables, the function f takes values in the set{Ford, Toyota, Mercedes-Benz, . . . }.
We will use slightly different notation:
P (X,Y ) : joint distribution of (X,Y ),
P (Y | X) : conditional distribution of X given Y,yi : prediction for xi.
18 / 20
Loss function for classification
There are many ways to measure the error of a classificationprediction. One of the most common is the 0-1 loss:
E(1(y0 6= y0))
Like the MSE, this quantity can be estimated from training andtest data by taking a sample average:
1
n
n∑i=1
1(yi 6= yi)
19 / 20
Bayes classifier
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X2
Figure 2.13
In practice, we never know thejoint probability P . However, wecan assume that it exists.
The Bayes classifier assigns:
yi = argmaxj P (Y = j | X = xi)
It can be shown that this is thebest classifier under the 0-1 loss.
20 / 20