Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
STK2100 - Machine Learning and Statistical Methods forPrediction and Classification
Lecturer: Geir StorvikExercises: Lars Henry Berge Olsen
Plan for today:Quick overview of the course
Some topics from the first two chapters
Big data, data science, statistics
I Huge amounts of data are now more easily availableI Receipts from supermarketsI Credit card dataI Genetic dataI Mobile phone dataI Metrological dataI WebdataI Astronomic dataI Large physical experimentsI Automatic methods for collecting dataI Major reductions in cost for storage
I Data Science is central in extracting knowledge from data
Data Science
The importance of Data Science:I https://blog.edx.org/
the-importance-of-data-science-in-the-21st-century
I http://bigdata.teradata.com/US/Big-Data-Quick-Start/People-And-Roles/Data-Scientist/
I https://blog.alexa.com/know-data-science-important/
I https://blog.udacity.com/2014/11/data-science-job-skills.html
This course:I Focus on data analysisI Statistics, machine learning, statistical learning
Statistics/machine learning/data mining
I Azzalini and Scarpa (2012):Data mining represents the work of processing, graphically or nume-rically, large amounts or continuous streams of data, with the aim ofextracting information useful to those who possess them.
Include statistics, machine learning, database management (IN3020)I In many cases statistics and machine learning are presented as different
things:I Statistics: Model basedI Machine learning: Algorithmic basedI https://www.svds.com/machine-learning-vs-statistics/I Breiman 2001
I In practice: Considerable overlapI Common goal: Extract knowledge from data
I Required background:I ProgrammingI Mathematics (calculus, linear algebra)I Probability theory, statistical inference
This course: Focus on prediction
I To predict an outcome that is stochastic is important in many situations:I Numbers that become infected or hospitalized due to Covid-19I Numbers that die within each age groups (life insurance)I Extreme rainfall (meteorology)I Outcome of a medical treatmentI What people buy in storesI +++
I Prediction of outcome is typically based on a choice of relevantexplanatory variablesI Residence, age, educationI Temperature, windI Genomic data
I Statistical terminology:Continuous response PredictionCategorical response Classification
I Huge datasets are (sometimes) collectedI Gives possibilities for more flexible methods/modelsI Can give new challenges in evaluation of methodsI Can also give computational challenges
I Many of the main ideas common with analysis of smaller datasets.
Car data
I Problem: Predict distance covered per unit fuel (or consumption of fuel)as a function of certain characteristics of a car
Citydistance
12
34
5
●●
●
●
●●●●●●
●●
●●●
●●●
●
●●●●●
●●●●
●
●
●●
●
●●●●
●●●●●●●
●●
●
●●
●
●●●●●
●●●●
●●●●●
●
●●
●●
●
●●●●
●●
●●●
●●●●
●●●●●●●●●
●●
●● ●●● ●
●●●●●
●
●●
●●●
●
●●
●●●●
●●●●●●
●●● ●●●●● ●●●●
●●●●●●●● ●●●●●●●
●●●●●●
●●●●
●●●●
●●●●●●
●
●
●●●●●●
●●
●
●
●●
●
●
●●●●●●
●●
●●●●●●
●
●●●●● ●●●●●● ●● ●●●●●●●●●●●● ●●●
●●
●
●●●●●
●●●●
●●●●●●
●●●●
● ●●●●● ●●●● ●●●● ●●●●●●●●●●●
●● ●●● ●
●●●●●● ●● ●●●●●●
●●●
●
●●●●●● ●●● ●●●●● ●●●● ●●●●●● ●● ●●●●●●●●●●●●● ●●●●
●●●●
●●●●●●
●
●●●●●●● ●●
●
●
5 10 15 20
800
1200
1600
●●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●●●●
●
●●●
●
●
●
●● ●
●●●●
●●●●
●
●●
●●
●
●●●
●●●●●
●●●●
●●●●●
●
●●
●
●
●
●●●
●
● ●
●●●
●●●●
●●●●●●●●●
●●
●
●
●●● ●
●
●
●
●
●●
●
●
●●
●
●
●●●●●
●
●●●●
●●
●●
●●●
●●
●
●
●●
●
●●●
●●
●
●● ●●●●●●●
●●●
●●
●
●●●●
●●●●
●●●●●●
●●
●
●●●●
●
●●●●
1 2 3 4 5
●●
●
●
●●●●
●●
●●
●●●
●●●
●
●●●
●
●
●●●
● ●
●
●
●
●
●●●●
●●●●
●●●
●●
●
●●
●
●●●●●
●●●●
●●●●●
●
●●
●●
●
●
●●
●●
●
●●●
●●
●●
●●●●●●●●●
●●
●●
●●
●
●●●●●●●
●
●
●●●
●
●●
●●●●
●●●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●●
●●
●●
●
●●●●
●●
●●●●●●
●
●●●
●●●
●
●●●●
●●
●
●
●●●●
●●
●
●●
●
Enginesize
●●
●
●
●●●●●●
●●
●●● ●●●
●
●●●●●●●●● ● ●●●● ●●●● ●●●●●●●●● ●
●●
●
●●●●●
●●● ●
●●●●● ●
●● ●●
●●●●● ● ● ●●●●●●●●●●●●●●●● ●●
●●●●●●
●●●●● ●●●●●● ● ●●
●●●
●
●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●
●●●●
●●●●●●
●
● ●●●●●● ●●
●
●
●●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●●●●
●
●●●
●
●
●
●●●●●●●
●●●●●
●●
●●
●
●●●
●●●●●
●●●●
●●●●●
●
●●
●
●
●
●●●
●
● ●
●●●
●●●●
●●●●●●●●●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●●
●
●
●● ●●●
●
●●●●●●
●●
●●●
●●
●
●
●●
●
●●●
●●
●
●●●●●●●●●
●●●
●●
●
●●●●
●●●●
●●●●●●
●●
●
●●●●
●
●● ●●
●●
●
●
●●●●
●●
●●
●●●
●●●
●
●●●
●
●
●●●
●●
●
●
●
●
●●●●
●●●●
●●●
●●
●
●●
●
●●●●●
●●●●
●●●●●
●
●●
●●
●
●
●●
●●
●
●●●
●●
●●
●●●●●●●●●
●●
●●
●●
●
●●●●●●●
●
●
●●●
●
●●
●●● ●
●●●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●●
●●
●●
●
●●●●
●●
●●●●●●
●
●●●
●●●●
●●●●
●●
●
●
●●●●
●●
●
●●
●
●●
●
●
●●●●●●
●●
●●●
●●●
●
●●●●●●●●●
●
●
●●
●
●●●●
●●●●●●●
●●
●
●●
●
●●●●●
●●●●
●●●●●
●
●●
●●
●
●●●●●●
●●●
●●●●
●●●●●●●●●
●●
●●●●●●
●●●●●
●
●●●●●
●
●●
●●●●
●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●
●●●●
●●●●
●●●●●●
●
●
●●●●●●●●
●
●
Number ofcylinders
2 4 6 8 10 12
●●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●●●●
●
●●●
●
●
●
●●●●●●●
●●●●●
●●
●●
●
●●●
●●●●●
●●●●
●●●●●
●
●●
●
●
●
●●●
●
●●
●●●
●●●●
●●●●●●●●●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●●
●
●
●● ●●●
●
●●●●●●
●●
●●●
●●
●
●
●●
●
●●●
●●
●
●●●●●●●●●
●●●
●●
●
●●●●
●●●●
●●●●●●
●●
●
●●●●●
●● ●●
800 1200 1600
510
1520
●●
●
●
●● ● ●
●●
●●
●●●
● ●●
●
●●●
●
●
●●●
● ●
●
●
●
●
●●●●
●●●●
●●●
●●
●
●●
●
●●●●●
●●●●
●●●●●
●
●●
●●
●
●
●●
●●
●
●●●
●●
●●
●●●●●●●●●
●●
● ●
●●
●
●● ●● ●●●
●
●
●● ●
●
●●
●●● ●
●●●●
●●
●
● ●
●
●
●
● ●
●
●
● ●
●
●● ●
● ●
●●
●
●●●●
●●
●●● ●● ●
●
●●●
●●
●●
●●●●
●●
●
●
● ●● ●
● ●
●
●●●
●●
●
●
●● ● ●●●
●●
●● ●
● ● ●
●
●●●●●
●●●●
●
●
● ●
●
●●●●
●●●● ●●●
●●
●
●●
●
●●●●●
●●●●
●●●●●
●
●●
●●
●
●●●●
●●
●●●
●●●●
●●●●●●●●●
●●
● ●●●●●
● ●● ●●
●
●●
●●●
●
●●
●●●●
●●●●●●
●● ●●● ●● ●● ●● ●
●●● ●● ●●●●●●●●●●
●●● ●● ●
●●●●
●● ●●
●●●●●●
●
●
● ●● ●● ●
● ●
●
●
24
68
1012
●●
●
●
●● ● ● ●●
●●
●● ● ● ● ●
●
●●●● ●●●● ● ● ●● ●● ●●●● ●●●● ●●●●● ●
●●
●
●●●●●
●●● ●
●●●●● ●
●● ●●
●●●● ● ●● ●●●●●●●●●●●●●●●● ●●
● ●●●●●
● ●● ●●●● ●●● ● ● ●●
●●●
●
●●●●●●●● ●●● ●● ●● ●● ●●●● ●● ●●●●●●●●●● ●●● ●● ●●●●●
●● ●●
●●●●●●
●
● ● ●● ●● ●● ●
●
●
Curbweight
Car dataI Problem: Predict distance covered per unit fuel (or consumption of fuel)
as a function of certain characteristics of a car
Cityconsumption
12
34
5
●●
●
●
●●●●● ●
●●
●● ●
●● ●
●
●●● ●●
●●●●
●
●
● ●
●
●●●●
●●●● ●●●
●●
●
●●
●
●●●●●
●●●●
●●●●●
●
●●
●●
●
● ●●●
●●
●●●
●●●●
●●●●●●●●●
●●
●●●● ●●
●●●●●
●
●●
●●●
●
●●
●●●●
●●●● ●●
●●●● ● ● ●●● ● ●●
● ●●● ●●●●● ●●●● ●●
●●●●●●
● ●●●
● ●●●
●●●● ●●
●
●
●●●●●●
● ●
●
●
●●
●
●
●●●● ● ●
●●
●● ● ●● ●
●
●●● ● ●●●● ●● ●● ●● ●●●● ●●●● ●●●●● ●
●●
●
●●●●●
●●● ●
●●●●● ●
●● ●●
●● ●● ●●● ●●●●● ●●●●●●●●●●● ●●
●●●● ●●
●●●●● ●● ●●●● ● ●●
●●●
●
●●●● ●●● ●●● ● ● ●●● ● ●●● ●●● ●●●●● ●●●● ●● ●●●●●●● ●●●
● ●● ●
●●●● ●●
●
● ●●●● ●●● ●
●
●
0.06 0.10 0.14 0.18
800
1200
1600
●●
●
●
●
●
●●
● ●
●●
●●
●
●
●
●
●
●●● ●
●
●●●
●
●
●
●●●●●●●
●●●●
●
●●
●●
●
●●●
●●●●●
●●●●
●●●●●
●
●●
●
●
●
● ●●
●
●●
●●●
●● ●●
●●●●●●●●●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●●
●
●
●● ●●●
●
●●●●
●●
●●
●● ●
● ●
●
●
● ●
●
●●●
● ●
●
●●● ●●●●●●
●●●
●●
●
●●●●
● ●● ●
●●●● ●●
●●
●
●●● ●
●
●● ●●
1 2 3 4 5
●●
●
●
●
●●●
●
●
●●
●●●
●●
●
●
●●●
●
●
●●●
● ●
●
●
●
●
●●●●
●●●●
●●●
●●
●
●●
●
●●●●●
●●●
●
●●●●●
●
●●
●●
●
●
●●
●●
●
●●●
●●
●●
●●●●●●●●●
●●
●●
●●
●
●●●●●●
●
●
●
●●●
●
●●
●●●●
●●●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●●
●●
●●
●
●●●●
●●
●●●●●●
●●●●
●
●
●
●
●●●●
●●
●
●
●●●●
●●
●
●
●
●
Enginesize
●●
●
●
●●●●●●
●●
●●● ●●●
●
●●●●●●●●● ● ●●●● ●●●● ●●●●●●●●● ●
●●
●
●●●●●
●●● ●
●●●●● ●
●● ●●
●●●●● ● ● ●●●●●●●●●●●●●●●● ●●
●●●●●●
●●●●● ●●●●●● ● ●●
●●●
●
●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●
●●●●
●●●●●●
●
● ●●●●●● ●●
●
●
●●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●●●●
●
●●●
●
●
●
●●●●●●●
●●●●●
●●
●●
●
●●●
●●●●●
●●●●
●●●●●
●
●●
●
●
●
●●●
●
● ●
●●●
●●●●
●●●●●●●●●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●●
●
●
●● ●●●
●
●●●●●●
●●
●●●
●●
●
●
●●
●
●●●
●●
●
●●●●●●●●●
●●●
●●
●
●●●●
●●●●
●●●●●●
●●
●
●●●●
●
●● ●●
●●
●
●
●
●●●
●
●
●●
●●●
●●
●
●
●●●
●
●
●●●
●●
●
●
●
●
●●●●
●●●●
●●●
●●
●
●●
●
●●●●●
●●●
●
●●●●●
●
●●
●●
●
●
●●
●●
●
●●●
●●
●●
●●●●●●●●●
●●
●●
●●
●
●●●●●●
●
●
●
●●●
●
●●
●●● ●
●●●●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●●
●●
●●
●
●●●●
●●
●●●●●●
●●●●
●
●
●
●
●●●●
●●
●
●
●●●●
●●
●
●
●
●
●●
●
●
●●●●●●
●●
●●●
●●●
●
●●●●●●●●●
●
●
●●
●
●●●●
●●●●●●●
●●
●
●●
●
●●●●●
●●●●
●●●●●
●
●●
●●
●
●●●●●●
●●●
●●●●
●●●●●●●●●
●●
●●●●●●
●●●●●
●
●●●●●
●
●●
●●●●
●●●●●●
●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●
●●●●
●●●●
●●●●●●
●
●
●●●●●●●●
●
●
Number ofcylinders
2 4 6 8 10 12
●●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●●●●
●
●●●
●
●
●
●●●●●●●
●●●●●
●●
●●
●
●●●
●●●●●
●●●●
●●●●●
●
●●
●
●
●
●●●
●
●●
●●●
●●●●
●●●●●●●●●
●●
●
●
●●●●
●
●
●
●
●●
●
●
●●
●
●
●● ●●●
●
●●●●●●
●●
●●●
●●
●
●
●●
●
●●●
●●
●
●●●●●●●●●
●●●
●●
●
●●●●
●●●●
●●●●●●
●●
●
●●●●●
●● ●●
800 1200 1600
0.06
0.10
0.14
0.18
●●
●
●
●
● ● ●
●
●
●●
●●●
● ●
●
●
●●●
●
●
●●●
● ●
●
●
●
●
●●●●
●●●●
●●●
●●
●
●●
●
●●●●●
●●●
●
●●●●●
●
●●
●●
●
●
●●
●●
●
●●●
●●
●●
●●●●●●●●●
●●
● ●
●●
●
●● ●● ●●
●
●
●
●● ●
●
●●
●●● ●
●●●●
●●
●
● ●
●
●
●
● ●
●
●
● ●
●
●● ●
● ●
●●
●
●●●●
●●
●●● ●● ●
●●●●
●
●
●
●
●●●●
●●
●
●
● ●● ●
● ●
●
●
●
●
●●
●
●
●● ● ●●●
●●
●● ●
● ● ●
●
●●●●●
●●●●
●
●
● ●
●
●●●●
●●●● ●●●
●●
●
●●
●
●●●●●
●●●●
●●●●●
●
●●
●●
●
●●●●
●●
●●●
●●●●
●●●●●●●●●
●●
● ●●●●●
● ●● ●●
●
●●
●●●
●
●●
●●●●
●●●●●●
●● ●●● ●● ●● ●● ●
●●● ●● ●●●●●●●●●●
●●● ●● ●
●●●●
●● ●●
●●●●●●
●
●
● ●● ●● ●
● ●
●
●
24
68
1012
●●
●
●
●● ● ● ●●
●●
●● ● ● ● ●
●
●●●● ●●●● ● ● ●● ●● ●●●● ●●●● ●●●●● ●
●●
●
●●●●●
●●● ●
●●●●● ●
●● ●●
●●●● ● ●● ●●●●●●●●●●●●●●●● ●●
● ●●●●●
● ●● ●●●● ●●● ● ● ●●
●●●
●
●●●●●●●● ●●● ●● ●● ●● ●●●● ●● ●●●●●●●●●● ●●● ●● ●●●●●
●● ●●
●●●●●●
●
● ● ●● ●● ●● ●
●
●
Curbweight
Wage data
I Variable of interest: WageI Explanatory variables:
I ageI yearI education
20 40 60 80
50
10
02
00
30
0
Age
Wa
ge
2003 2006 2009
50
10
02
00
30
0
Year
Wa
ge
1 2 3 4 5
50
10
02
00
30
0Education Level
Wa
ge
1Some of the figures in this presentation is taken from "An Introduction to Statistical Learning, with
applications in R"(Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R.
Tibshirani
Stock marked data
I Variable of interest: Increase/reduction in Økning/reduksjon i akskjeverdi(direction)
I Explanatory variables:I Value earlier days (lag1-5)I Volume (volume)
Down Up
−4
−2
02
46
Yesterday
Today’s Direction
Pe
rce
nta
ge
ch
an
ge
in
S&
P
Down Up
−4
−2
02
46
Two Days Previous
Today’s Direction
Pe
rce
nta
ge
ch
an
ge
in
S&
P
Down Up
−4
−2
02
46
Three Days Previous
Today’s Direction
Pe
rce
nta
ge
ch
an
ge
in
S&
P
Eyedata
I Gene expression datafrom a microarray experiment on 120 rats(Scheetz et al., 2006)I Response y : Ekspresjonslevel on TRIM32 geneI Explanatory variables x1, ..., x200: Data from 200 gene probes.
x.1377
3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 4.5 5.0 5.5 8.1 8.2 8.3 8.4 8.5 8.6 8.7
3.6
4.2
4.8
3.4
4.0
4.6
x.1748
x.2487
4.5
5.5
4.5
5.5
x.2679
x.2789
6.0
7.0
3.6 3.8 4.0 4.2 4.4 4.6 4.8
8.1
8.4
8.7
4.5 5.0 5.5 6.0 6.0 6.5 7.0 7.5
y
I Possible model: Y = β0 +∑p
j=1 βjXj + ε
I Least squares estimate:
β = (XT X)−1XT Y
I Problem: XT X not invertible!I Problem: Many explanatory variables compared to the number of
individuals
Images of faces
I Images of students at Standord (100 women, 100 men)I Each image consists 100× 100 = 10 000 pixelsI Possible to classify gender based on these images?
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
I Problem: Many explanatory variables compared to the number ofindividuals
Gen ekspresjonsdata
I 6,830 gene expresjons measurementsI No spesific response/variable of interestI Data fom 64 cancer cellsI Of interst to group cells
−40 −20 0 20 40 60
−60
−40
−20
020
−40 −20 0 20 40 60
−60
−40
−20
020
Z1Z1
Z2
Z2
STK2100
I Focus on prediction and classificationI Will also touch upon problems related to big data
I Statistical issuesI p larger than n problemsI Multiple testing
I Computational issues
I Textbook: Azzalini, Scarpa: Data Analysis and Data MiningI Supplementary text: James, Witten, Hastie, Tibshirani: An Introduction to
Statistical learning - with applications in RI Supplemented by extra exercises/notes.I Recommended previous knowledge
I Calkulus/linear algebra (MAT1100/1110/1120)I Probability theory/statistical inference (STK1100/1110)I Basic programming (IN1900 or IN1000)
Statistical learing
I Focus: Prediction of a response variable based on explanatory variablesítem If quantitative reponse variable: Regression
I If categorical response variable: ClassifikationI Machine learing: Focus on algorithms for prediction
I Often used as a black boxI Evaluation often based on a separate test set
I Statistical learing: Focus on method for predictionI More "open box"I Focus on inference
I quantification of uncertaintyI Evaluation of methodI Hypothesis testing
I No clear distinction, both have focus on prediction
Prediction - advertising data
I Response: Sale of product in 200 different markeds (sales)I Explantatory variables:
I Advertising budget in tv (TV)I Advertising budget in radio (radio)I Advertising budget in newspapers (newspaper)
0 50 100 200 300
510
15
20
25
TV
Sale
s
0 10 20 30 40 50
510
15
20
25
Radio
Sale
s
0 20 40 60 80 100
510
15
20
25
Newspaper
Sale
s
Prediction - Wage data
I Response: Income (income)I Explanatory variables: years of education
10 12 14 16 18 20 22
20
30
40
50
60
70
80
Years of Education
Inco
me
10 12 14 16 18 20 22
20
30
40
50
60
70
80
Years of Education
Inco
me
Statistical model for prediction
I x = (x1, ..., xp) is a set of explanatory variables, y is response.I Possible model: Y = f (x) + ε,E [ε] = 0I Sales data: Possible choice of f :
f (x) = β0 +
p∑j=1
βjxj
I Wage data: Non-linear relationship, not obvious how to choose f
Estimation of f
I Assume Y = f (x) + ε,E [ε] = 0, f unknown
E [Y |x] = f (x)
I Why estimate f?I Prediction: y = f (x)
I Can think of f as a black box, not important to know the form of fI Understanding on how y is influenced by x1, ..., xp .
I Which variable(s) influence the response?I What is the relationship between a variable and the response?
Is the relationship linear?
I Precision of an estimate (for given x and f ):
E(Y − Y )2 = E[(f (x)− f (x))2]︸ ︷︷ ︸reducible
+ V(ε)︸︷︷︸Non-reducible
Choice of method for estimation will influence the the reducible part ofthe error in prediction.
Estimation of f - parametric methodsI Linear regression: Assume
f (x) = β0 + β1x1 + · · ·+ βpxp
Example of a parametric model.The model is described by the parameters β = (β0, ..., βp).Estimation of f through estimation of parameters β
I Assume p = 1 and
f (x) = β1eλ1x + β2eλ2x
Also an example of a parametric model.The model is described by the parameters θ = (β1, β2, λ1, λ2).Estimation of f through estimation of parameters θ
0 50 100 150 200
4050
6070
8090
100
x
y
Estimation of f - non-parametric methods
I Non-parametric methods: Assume f (x) is a smooth function in x.
I Data: {(x1, y1), · · · , (xn, yn)}I Estimation: f (x) = E [Y |X = x]
can be estimated by
f (x) =∑n
i=1 I(xi = x)yi∑ni=1 I(xi = x)
Problem: Can be very fewobservations with xi = x.
I Idea: Since f (x) ≈ f (xi) for xi
close to x:
f (x) =∑n
i=1 I(xi ≈ x)yi∑ni=1 I(xi ≈ x)
Years of Education
Sen
iorit
y
Incom
e
Parametric or non-parametric
I Precision of estimate:
E(Y − Y )2 = E [f (x)− f (x)]2︸ ︷︷ ︸reducible
+ V(ε)︸︷︷︸Non-reducible
I Reducible part:
E [f (x)− f (x)]2 =E [f (x)− E [f (x)] + E [f (x)]− f (x)]2
=[f (x)− E [f (x)]]2 + E [(E [f (x)]− f (x))2]
= [f (x)− E [f (x)]]2︸ ︷︷ ︸Bias
+Var[f (x)]︸ ︷︷ ︸Variance
I Few assumptions about f gives small bias but can give high varianceI Harder assumptions on f can give small variance but high bias (if
assumptions are wrong)I For interpretation: Simpler forms of f is preferable
Overview of different methods
Flexibility
Inte
rpre
tabili
ty
Low High
Low
Hig
h Subset SelectionLasso
Least Squares
Generalized Additive ModelsTrees
Bagging, Boosting
Support Vector Machines
Regression vs classification
I Variables often divided into two characteristics:I Quantitative: Numerical values (ordered)I Qualitative: Categorical, no ordering
I Regression: Response is quantitativeI Classification: Response is qualitativeI Note:
I There can be qualitative explanatory variables within regressionI There can be quantitative explanatory variables within classification
Unsupervised learing
I So far: Response Y and explanatory variable(s) XI Supervised learing
I In many situations: Only XI Gene expression data
I Wants to find relations between variablesI Common approach : Cluster/group data
0 2 4 6 8 10 12
24
68
10
12
0 2 4 6
24
68
X1X1
X2
X2
I Unsupervised learing (Chap 6)
Evalution of precision
I Goal: Introduce many statistical learning methods (extending linearregression/logistic regression)
I No method will be the best in all situationsI Depend on how complex f is, how much data is available etc.
I In a specific problem setting: Try out several methodsI How to evaluate which one is the best?I Possible goal: (regression):
E[(Y ∗ − f (x∗))2]
where x∗ are values of explanatory variables where we want to predict anew Y ∗.
I Problem: Y ∗ is unknown
Estimation of measurement errorI Possible estimate on E[(Y ∗ − f (x∗))2]:
MSE = 1n
n∑i=1
(Yi − f (xi))2
I Measurement error evaluated on the same data as the ones used forestimating f .I Problem: We are interested in prediction on new dataI Use of the same data can give too optimistic answers
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Mean S
quare
d E
rror
I Idea: Divide data into two parts:I Training data: Used for estimating fI Test data: Used for estimating E[(Y∗ − f (x∗))2]
Variance vs bias
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Mean S
quare
d E
rror
0 20 40 60 80 100
24
68
10
12
X
Y
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
Mean S
quare
d E
rror
0 20 40 60 80 100
−10
010
20
X
Y
2 5 10 20
05
10
15
20
Flexibility
Mean S
quare
d E
rror
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
2 5 10 20
0.0
0.5
1.0
1.5
2.0
2.5
Flexibility
2 5 10 20
05
10
15
20
Flexibility
MSEBiasVar
E [(f (x)− f (x))2] = [f (x)− E [f (x)]]2︸ ︷︷ ︸Bias
+Var[f (x)]︸ ︷︷ ︸Variance
I Bias decrease with higher flexibilityI Variance increase with higher fleksibility
Classification
I Regression: Measurement error by E[(Y ∗ − f (x∗))2]
I Classification: Error rate: E[I(Y 6= Y )]
I Can be estimated by 1n
∑ni=1 I(yi 6= yi)
I Same problem as before:I Estimated error rate too optimistic if estimated from the same dataI Trade-off between bias and variance
Bayes classificator
I Can show (exerice) that the classification which minimizes error rate is
Y = argmaxj
Pr(Y = j|X = x)
I Called the Bayes classificator
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X2 Border between regions are called
Bayesian decision borders
K -nearest neighbors
I Ideal: Use Bayes classificatorI In practice Pr(Y = k |X = x∗) is unkown and needs to be estimated.I K -nearest neighbor: Let N0 ⊂ {1, ..., n} be the K indexes with xi nearest
to x∗.I Estimation of Pr(Y = k |X = x0) = E [I(Y = k |X = x∗)]:
Pr(Y = k |X = x) =1K
∑i∈N0
I(yi = k)
I Classification rule: Classify test point with X = x∗ to the class with thehighest number of cases among the K nearest points.
o
o
o
o
o
oo
o
o
o
o
o o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o oo
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
X1
X2
KNN: K=10
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o o
o
o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
KNN: K=1 KNN: K=100
0.01 0.02 0.05 0.10 0.20 0.50 1.00
0.0
00.0
50.1
00.1
50.2
0
1/K
Err
or
Rate
Training Errors
Test Errors
Plan
I Chapter 2: Linear modelsI Linear regressionI Logistic regressionI Least squares and maximum likelihoodI Also parts of Appendix A
I Chapter 3: Optimism, conflicts and trade-offsI General concepts, evaluation, comparison
I Chapter 4: Regression methodsI Chapter 5: Classification methodsI Chapter 6: Cluster analysis
A. Azzalini and B. Scarpa. Data analysis and data mining: An introduction.OUP USA, 2012.