STK2100 - Machine Learning and Statistical Methods for … · STK2100 - Machine Learning and Statistical Methods for Prediction and Classiﬁcation Lecturer: Geir Storvik Exercises:

STK2100 - Machine Learning and Statistical Methods forPrediction and Classification

Lecturer: Geir StorvikExercises: Lars Henry Berge Olsen

Plan for today:Quick overview of the course

Some topics from the first two chapters

Big data, data science, statistics

I Huge amounts of data are now more easily availableI Receipts from supermarketsI Credit card dataI Genetic dataI Mobile phone dataI Metrological dataI WebdataI Astronomic dataI Large physical experimentsI Automatic methods for collecting dataI Major reductions in cost for storage

I Data Science is central in extracting knowledge from data

Data Science

The importance of Data Science:I https://blog.edx.org/

the-importance-of-data-science-in-the-21st-century

I http://bigdata.teradata.com/US/Big-Data-Quick-Start/People-And-Roles/Data-Scientist/

I https://blog.alexa.com/know-data-science-important/

I https://blog.udacity.com/2014/11/data-science-job-skills.html

This course:I Focus on data analysisI Statistics, machine learning, statistical learning

https://blog.edx.org/the-importance-of-data-science-in-the-21st-century

https://blog.edx.org/the-importance-of-data-science-in-the-21st-century

http://bigdata.teradata.com/US/Big-Data-Quick-Start/People-And-Roles/Data-Scientist/

http://bigdata.teradata.com/US/Big-Data-Quick-Start/People-And-Roles/Data-Scientist/

https://blog.alexa.com/know-data-science-important/

https://blog.udacity.com/2014/11/data-science-job-skills.html

https://blog.udacity.com/2014/11/data-science-job-skills.html

Statistics/machine learning/data mining

I Azzalini and Scarpa (2012):Data mining represents the work of processing, graphically or nume-rically, large amounts or continuous streams of data, with the aim ofextracting information useful to those who possess them.

Include statistics, machine learning, database management (IN3020)I In many cases statistics and machine learning are presented as different

things:I Statistics: Model basedI Machine learning: Algorithmic basedI https://www.svds.com/machine-learning-vs-statistics/I Breiman 2001

I In practice: Considerable overlapI Common goal: Extract knowledge from data

I Required background:I ProgrammingI Mathematics (calculus, linear algebra)I Probability theory, statistical inference

https://www.uio.no/studier/emner/matnat/ifi/IN3020/index-eng.html

https://www.svds.com/machine-learning-vs-statistics/

https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

This course: Focus on prediction

I To predict an outcome that is stochastic is important in many situations:I Numbers that become infected or hospitalized due to Covid-19I Numbers that die within each age groups (life insurance)I Extreme rainfall (meteorology)I Outcome of a medical treatmentI What people buy in storesI +++

I Prediction of outcome is typically based on a choice of relevantexplanatory variablesI Residence, age, educationI Temperature, windI Genomic data

I Statistical terminology:Continuous response PredictionCategorical response Classification

I Huge datasets are (sometimes) collectedI Gives possibilities for more flexible methods/modelsI Can give new challenges in evaluation of methodsI Can also give computational challenges

I Many of the main ideas common with analysis of smaller datasets.

Car data

I Problem: Predict distance covered per unit fuel (or consumption of fuel)as a function of certain characteristics of a car

Citydistance

12

34

5

●●

●

●

●●●●●●

●●

●●●

●●●

●

●●●●●

●●●●

●

●

●●

●

●●●●

●●●●●●●

●●

●

●●

●

●●●●●

●●●●

●●●●●

●

●●

●●

●

●●●●

●●

●●●

●●●●

●●●●●●●●●

●●

●● ●●● ●

●●●●●

●

●●

●●●

●

●●

●●●●

●●●●●●

●●● ●●●●● ●●●●

●●●●●●●● ●●●●●●●

●●●●●●

●●●●

●●●●

●●●●●●

●

●

●●●●●●

●●

●

●

●●

●

●

●●●●●●

●●

●●●●●●

●

●●●●● ●●●●●● ●● ●●●●●●●●●●●● ●●●

●●

●

●●●●●

●●●●

●●●●●●

●●●●

● ●●●●● ●●●● ●●●● ●●●●●●●●●●●

●● ●●● ●

●●●●●● ●● ●●●●●●

●●●

●

●●●●●● ●●● ●●●●● ●●●● ●●●●●● ●● ●●●●●●●●●●●●● ●●●●

●●●●

●●●●●●

●

●●●●●●● ●●

●

●

5 10 15 20

800

1200

1600

●●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●●●●

●

●●●

●

●

●

●● ●

●●●●

●●●●

●

●●

●●

●

●●●

●●●●●

●●●●

●●●●●

●

●●

●

●

●

●●●

●

● ●

●●●

●●●●

●●●●●●●●●

●●

●

●

●●● ●

●

●

●

●

●●

●

●

●●

●

●

●●●●●

●

●●●●

●●

●●

●●●

●●

●

●

●●

●

●●●

●●

●

●● ●●●●●●●

●●●

●●

●

●●●●

●●●●

●●●●●●

●●

●

●●●●

●

●●●●

1 2 3 4 5

●●

●

●

●●●●

●●

●●

●●●

●●●

●

●●●

●

●

●●●

● ●

●

●

●

●

●●●●

●●●●

●●●

●●

●

●●

●

●●●●●

●●●●

●●●●●

●

●●

●●

●

●

●●

●●

●

●●●

●●

●●

●●●●●●●●●

●●

●●

●●

●

●●●●●●●

●

●

●●●

●

●●

●●●●

●●●●

●●

●

●●

●

●

●

●●

●

●

●●

●

●●●

●●

●●

●

●●●●

●●

●●●●●●

●

●●●

●●●

●

●●●●

●●

●

●

●●●●

●●

●

●●

●

Enginesize

●●

●

●

●●●●●●

●●

●●● ●●●

●

●●●●●●●●● ● ●●●● ●●●● ●●●●●●●●● ●

●●

●

●●●●●

●●● ●

●●●●● ●

●● ●●

●●●●● ● ● ●●●●●●●●●●●●●●●● ●●

●●●●●●

●●●●● ●●●●●● ● ●●

●●●

●

●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●

●●●●

●●●●●●

●

● ●●●●●● ●●

●

●

●●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●●●●

●

●●●

●

●

●

●●●●●●●

●●●●●

●●

●●

●

●●●

●●●●●

●●●●

●●●●●

●

●●

●

●

●

●●●

●

● ●

●●●

●●●●

●●●●●●●●●

●●

●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●● ●●●

●

●●●●●●

●●

●●●

●●

●

●

●●

●

●●●

●●

●

●●●●●●●●●

●●●

●●

●

●●●●

●●●●

●●●●●●

●●

●

●●●●

●

●● ●●

●●

●

●

●●●●

●●

●●

●●●

●●●

●

●●●

●

●

●●●

●●

●

●

●

●

●●●●

●●●●

●●●

●●

●

●●

●

●●●●●

●●●●

●●●●●

●

●●

●●

●

●

●●

●●

●

●●●

●●

●●

●●●●●●●●●

●●

●●

●●

●

●●●●●●●

●

●

●●●

●

●●

●●● ●

●●●●

●●

●

●●

●

●

●

●●

●

●

●●

●

●●●

●●

●●

●

●●●●

●●

●●●●●●

●

●●●

●●●●

●●●●

●●

●

●

●●●●

●●

●

●●

●

●●

●

●

●●●●●●

●●

●●●

●●●

●

●●●●●●●●●

●

●

●●

●

●●●●

●●●●●●●

●●

●

●●

●

●●●●●

●●●●

●●●●●

●

●●

●●

●

●●●●●●

●●●

●●●●

●●●●●●●●●

●●

●●●●●●

●●●●●

●

●●●●●

●

●●

●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●

●●●●●●

●

●

●●●●●●●●

●

●

Number ofcylinders

2 4 6 8 10 12

●●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●●●●

●

●●●

●

●

●

●●●●●●●

●●●●●

●●

●●

●

●●●

●●●●●

●●●●

●●●●●

●

●●

●

●

●

●●●

●

●●

●●●

●●●●

●●●●●●●●●

●●

●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●● ●●●

●

●●●●●●

●●

●●●

●●

●

●

●●

●

●●●

●●

●

●●●●●●●●●

●●●

●●

●

●●●●

●●●●

●●●●●●

●●

●

●●●●●

●● ●●

800 1200 1600

510

1520

●●

●

●

●● ● ●

●●

●●

●●●

● ●●

●

●●●

●

●

●●●

● ●

●

●

●

●

●●●●

●●●●

●●●

●●

●

●●

●

●●●●●

●●●●

●●●●●

●

●●

●●

●

●

●●

●●

●

●●●

●●

●●

●●●●●●●●●

●●

● ●

●●

●

●● ●● ●●●

●

●

●● ●

●

●●

●●● ●

●●●●

●●

●

● ●

●

●

●

● ●

●

●

● ●

●

●● ●

● ●

●●

●

●●●●

●●

●●● ●● ●

●

●●●

●●

●●

●●●●

●●

●

●

● ●● ●

● ●

●

●●●

●●

●

●

●● ● ●●●

●●

●● ●

● ● ●

●

●●●●●

●●●●

●

●

● ●

●

●●●●

●●●● ●●●

●●

●

●●

●

●●●●●

●●●●

●●●●●

●

●●

●●

●

●●●●

●●

●●●

●●●●

●●●●●●●●●

●●

● ●●●●●

● ●● ●●

●

●●

●●●

●

●●

●●●●

●●●●●●

●● ●●● ●● ●● ●● ●

●●● ●● ●●●●●●●●●●

●●● ●● ●

●●●●

●● ●●

●●●●●●

●

●

● ●● ●● ●

● ●

●

●

24

68

1012

●●

●

●

●● ● ● ●●

●●

●● ● ● ● ●

●

●●●● ●●●● ● ● ●● ●● ●●●● ●●●● ●●●●● ●

●●

●

●●●●●

●●● ●

●●●●● ●

●● ●●

●●●● ● ●● ●●●●●●●●●●●●●●●● ●●

● ●●●●●

● ●● ●●●● ●●● ● ● ●●

●●●

●

●●●●●●●● ●●● ●● ●● ●● ●●●● ●● ●●●●●●●●●● ●●● ●● ●●●●●

●● ●●

●●●●●●

●

● ● ●● ●● ●● ●

●

●

Curbweight

Car dataI Problem: Predict distance covered per unit fuel (or consumption of fuel)

as a function of certain characteristics of a car

Cityconsumption

12

34

5

●●

●

●

●●●●● ●

●●

●● ●

●● ●

●

●●● ●●

●●●●

●

●

● ●

●

●●●●

●●●● ●●●

●●

●

●●

●

●●●●●

●●●●

●●●●●

●

●●

●●

●

● ●●●

●●

●●●

●●●●

●●●●●●●●●

●●

●●●● ●●

●●●●●

●

●●

●●●

●

●●

●●●●

●●●● ●●

●●●● ● ● ●●● ● ●●

● ●●● ●●●●● ●●●● ●●

●●●●●●

● ●●●

● ●●●

●●●● ●●

●

●

●●●●●●

● ●

●

●

●●

●

●

●●●● ● ●

●●

●● ● ●● ●

●

●●● ● ●●●● ●● ●● ●● ●●●● ●●●● ●●●●● ●

●●

●

●●●●●

●●● ●

●●●●● ●

●● ●●

●● ●● ●●● ●●●●● ●●●●●●●●●●● ●●

●●●● ●●

●●●●● ●● ●●●● ● ●●

●●●

●

●●●● ●●● ●●● ● ● ●●● ● ●●● ●●● ●●●●● ●●●● ●● ●●●●●●● ●●●

● ●● ●

●●●● ●●

●

● ●●●● ●●● ●

●

●

0.06 0.10 0.14 0.18

800

1200

1600

●●

●

●

●

●

●●

● ●

●●

●●

●

●

●

●

●

●●● ●

●

●●●

●

●

●

●●●●●●●

●●●●

●

●●

●●

●

●●●

●●●●●

●●●●

●●●●●

●

●●

●

●

●

● ●●

●

●●

●●●

●● ●●

●●●●●●●●●

●●

●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●● ●●●

●

●●●●

●●

●●

●● ●

● ●

●

●

● ●

●

●●●

● ●

●

●●● ●●●●●●

●●●

●●

●

●●●●

● ●● ●

●●●● ●●

●●

●

●●● ●

●

●● ●●

1 2 3 4 5

●●

●

●

●

●●●

●

●

●●

●●●

●●

●

●

●●●

●

●

●●●

● ●

●

●

●

●

●●●●

●●●●

●●●

●●

●

●●

●

●●●●●

●●●

●

●●●●●

●

●●

●●

●

●

●●

●●

●

●●●

●●

●●

●●●●●●●●●

●●

●●

●●

●

●●●●●●

●

●

●

●●●

●

●●

●●●●

●●●●

●●

●

●●

●

●

●

●●

●

●

●●

●

●●●

●●

●●

●

●●●●

●●

●●●●●●

●●●●

●

●

●

●

●●●●

●●

●

●

●●●●

●●

●

●

●

●

Enginesize

●●

●

●

●●●●●●

●●

●●● ●●●

●

●●●●●●●●● ● ●●●● ●●●● ●●●●●●●●● ●

●●

●

●●●●●

●●● ●

●●●●● ●

●● ●●

●●●●● ● ● ●●●●●●●●●●●●●●●● ●●

●●●●●●

●●●●● ●●●●●● ● ●●

●●●

●

●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●

●●●●

●●●●●●

●

● ●●●●●● ●●

●

●

●●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●●●●

●

●●●

●

●

●

●●●●●●●

●●●●●

●●

●●

●

●●●

●●●●●

●●●●

●●●●●

●

●●

●

●

●

●●●

●

● ●

●●●

●●●●

●●●●●●●●●

●●

●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●● ●●●

●

●●●●●●

●●

●●●

●●

●

●

●●

●

●●●

●●

●

●●●●●●●●●

●●●

●●

●

●●●●

●●●●

●●●●●●

●●

●

●●●●

●

●● ●●

●●

●

●

●

●●●

●

●

●●

●●●

●●

●

●

●●●

●

●

●●●

●●

●

●

●

●

●●●●

●●●●

●●●

●●

●

●●

●

●●●●●

●●●

●

●●●●●

●

●●

●●

●

●

●●

●●

●

●●●

●●

●●

●●●●●●●●●

●●

●●

●●

●

●●●●●●

●

●

●

●●●

●

●●

●●● ●

●●●●

●●

●

●●

●

●

●

●●

●

●

●●

●

●●●

●●

●●

●

●●●●

●●

●●●●●●

●●●●

●

●

●

●

●●●●

●●

●

●

●●●●

●●

●

●

●

●

●●

●

●

●●●●●●

●●

●●●

●●●

●

●●●●●●●●●

●

●

●●

●

●●●●

●●●●●●●

●●

●

●●

●

●●●●●

●●●●

●●●●●

●

●●

●●

●

●●●●●●

●●●

●●●●

●●●●●●●●●

●●

●●●●●●

●●●●●

●

●●●●●

●

●●

●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●

●●●●●●

●

●

●●●●●●●●

●

●

Number ofcylinders

2 4 6 8 10 12

●●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●

●●●●

●

●●●

●

●

●

●●●●●●●

●●●●●

●●

●●

●

●●●

●●●●●

●●●●

●●●●●

●

●●

●

●

●

●●●

●

●●

●●●

●●●●

●●●●●●●●●

●●

●

●

●●●●

●

●

●

●

●●

●

●

●●

●

●

●● ●●●

●

●●●●●●

●●

●●●

●●

●

●

●●

●

●●●

●●

●

●●●●●●●●●

●●●

●●

●

●●●●

●●●●

●●●●●●

●●

●

●●●●●

●● ●●

800 1200 1600

0.06

0.10

0.14

0.18

●●

●

●

●

● ● ●

●

●

●●

●●●

● ●

●

●

●●●

●

●

●●●

● ●

●

●

●

●

●●●●

●●●●

●●●

●●

●

●●

●

●●●●●

●●●

●

●●●●●

●

●●

●●

●

●

●●

●●

●

●●●

●●

●●

●●●●●●●●●

●●

● ●

●●

●

●● ●● ●●

●

●

●

●● ●

●

●●

●●● ●

●●●●

●●

●

● ●

●

●

●

● ●

●

●

● ●

●

●● ●

● ●

●●

●

●●●●

●●

●●● ●● ●

●●●●

●

●

●

●

●●●●

●●

●

●

● ●● ●

● ●

●

●

●

●

●●

●

●

●● ● ●●●

●●

●● ●

● ● ●

●

●●●●●

●●●●

●

●

● ●

●

●●●●

●●●● ●●●

●●

●

●●

●

●●●●●

●●●●

●●●●●

●

●●

●●

●

●●●●

●●

●●●

●●●●

●●●●●●●●●

●●

● ●●●●●

● ●● ●●

●

●●

●●●

●

●●

●●●●

●●●●●●

●● ●●● ●● ●● ●● ●

●●● ●● ●●●●●●●●●●

●●● ●● ●

●●●●

●● ●●

●●●●●●

●

●

● ●● ●● ●

● ●

●

●

24

68

1012

●●

●

●

●● ● ● ●●

●●

●● ● ● ● ●

●

●●●● ●●●● ● ● ●● ●● ●●●● ●●●● ●●●●● ●

●●

●

●●●●●

●●● ●

●●●●● ●

●● ●●

●●●● ● ●● ●●●●●●●●●●●●●●●● ●●

● ●●●●●

● ●● ●●●● ●●● ● ● ●●

●●●

●

●●●●●●●● ●●● ●● ●● ●● ●●●● ●● ●●●●●●●●●● ●●● ●● ●●●●●

●● ●●

●●●●●●

●

● ● ●● ●● ●● ●

●

●

Curbweight

Wage data

I Variable of interest: WageI Explanatory variables:

I ageI yearI education

20 40 60 80

50

10

02

00

30

0

Age

Wa

ge

2003 2006 2009

50

10

02

00

30

0

Year

Wa

ge

1 2 3 4 5

50

10

02

00

30

0Education Level

Wa

ge

1Some of the figures in this presentation is taken from "An Introduction to Statistical Learning, with

applications in R"(Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R.

Tibshirani

Stock marked data

I Variable of interest: Increase/reduction in Økning/reduksjon i akskjeverdi(direction)

I Explanatory variables:I Value earlier days (lag1-5)I Volume (volume)

Down Up

−4

−2

02

46

Yesterday

Today’s Direction

Pe

rce

nta

ge

ch

an

ge

in

S&

P

Down Up

−4

−2

02

46

Two Days Previous

Today’s Direction

Pe

rce

nta

ge

ch

an

ge

in

S&

P

Down Up

−4

−2

02

46

Three Days Previous

Today’s Direction

Pe

rce

nta

ge

ch

an

ge

in

S&

P

Eyedata

I Gene expression datafrom a microarray experiment on 120 rats(Scheetz et al., 2006)I Response y : Ekspresjonslevel on TRIM32 geneI Explanatory variables x1, ..., x200: Data from 200 gene probes.

x.1377

3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 4.5 5.0 5.5 8.1 8.2 8.3 8.4 8.5 8.6 8.7

3.6

4.2

4.8

3.4

4.0

4.6

x.1748

x.2487

4.5

5.5

4.5

5.5

x.2679

x.2789

6.0

7.0

3.6 3.8 4.0 4.2 4.4 4.6 4.8

8.1

8.4

8.7

4.5 5.0 5.5 6.0 6.0 6.5 7.0 7.5

y

I Possible model: Y = β0 +∑p

j=1 βjXj + ε

I Least squares estimate:

β = (XT X)−1XT Y

I Problem: XT X not invertible!I Problem: Many explanatory variables compared to the number of

individuals

Images of faces

I Images of students at Standord (100 women, 100 men)I Each image consists 100× 100 = 10 000 pixelsI Possible to classify gender based on these images?

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

I Problem: Many explanatory variables compared to the number ofindividuals

Gen ekspresjonsdata

I 6,830 gene expresjons measurementsI No spesific response/variable of interestI Data fom 64 cancer cellsI Of interst to group cells

−40 −20 0 20 40 60

−60

−40

−20

020

−40 −20 0 20 40 60

−60

−40

−20

020

Z1Z1

Z2

Z2

STK2100

I Focus on prediction and classificationI Will also touch upon problems related to big data

I Statistical issuesI p larger than n problemsI Multiple testing

I Computational issues

I Textbook: Azzalini, Scarpa: Data Analysis and Data MiningI Supplementary text: James, Witten, Hastie, Tibshirani: An Introduction to

Statistical learning - with applications in RI Supplemented by extra exercises/notes.I Recommended previous knowledge

I Calkulus/linear algebra (MAT1100/1110/1120)I Probability theory/statistical inference (STK1100/1110)I Basic programming (IN1900 or IN1000)

Statistical learing

I Focus: Prediction of a response variable based on explanatory variablesítem If quantitative reponse variable: Regression

I If categorical response variable: ClassifikationI Machine learing: Focus on algorithms for prediction

I Often used as a black boxI Evaluation often based on a separate test set

I Statistical learing: Focus on method for predictionI More "open box"I Focus on inference

I quantification of uncertaintyI Evaluation of methodI Hypothesis testing

I No clear distinction, both have focus on prediction

Prediction - advertising data

I Response: Sale of product in 200 different markeds (sales)I Explantatory variables:

I Advertising budget in tv (TV)I Advertising budget in radio (radio)I Advertising budget in newspapers (newspaper)

0 50 100 200 300

510

15

20

25

TV

Sale

s

0 10 20 30 40 50

510

15

20

25

Radio

Sale

s

0 20 40 60 80 100

510

15

20

25

Newspaper

Sale

s

Prediction - Wage data

I Response: Income (income)I Explanatory variables: years of education

10 12 14 16 18 20 22

20

30

40

50

60

70

80

Years of Education

Inco

me

10 12 14 16 18 20 22

20

30

40

50

60

70

80

Years of Education

Inco

me

Statistical model for prediction

I x = (x1, ..., xp) is a set of explanatory variables, y is response.I Possible model: Y = f (x) + ε,E [ε] = 0I Sales data: Possible choice of f :

f (x) = β0 +

p∑j=1

βjxj

I Wage data: Non-linear relationship, not obvious how to choose f

Estimation of f

I Assume Y = f (x) + ε,E [ε] = 0, f unknown

E [Y |x] = f (x)

I Why estimate f?I Prediction: y = f (x)

I Can think of f as a black box, not important to know the form of fI Understanding on how y is influenced by x1, ..., xp .

I Which variable(s) influence the response?I What is the relationship between a variable and the response?

Is the relationship linear?

I Precision of an estimate (for given x and f ):

E(Y − Y )2 = E[(f (x)− f (x))2]︸︷︷︸reducible

+ V(ε)︸︷︷︸Non-reducible

Choice of method for estimation will influence the the reducible part ofthe error in prediction.

Estimation of f - parametric methodsI Linear regression: Assume

f (x) = β0 + β1x1 + · · ·+ βpxp

Example of a parametric model.The model is described by the parameters β = (β0, ..., βp).Estimation of f through estimation of parameters β

I Assume p = 1 and

f (x) = β1eλ1x + β2eλ2x

Also an example of a parametric model.The model is described by the parameters θ = (β1, β2, λ1, λ2).Estimation of f through estimation of parameters θ

0 50 100 150 200

4050

6070

8090

100

x

y

Estimation of f - non-parametric methods

I Non-parametric methods: Assume f (x) is a smooth function in x.

I Data: {(x1, y1), · · · , (xn, yn)}I Estimation: f (x) = E [Y |X = x]

can be estimated by

f (x) =∑n

i=1 I(xi = x)yi∑ni=1 I(xi = x)

Problem: Can be very fewobservations with xi = x.

I Idea: Since f (x) ≈ f (xi) for xi

close to x:

f (x) =∑n

i=1 I(xi ≈ x)yi∑ni=1 I(xi ≈ x)

Years of Education

Sen

iorit

y

Incom

e

Parametric or non-parametric

I Precision of estimate:

E(Y − Y )2 = E [f (x)− f (x)]2︸︷︷︸reducible

+ V(ε)︸︷︷︸Non-reducible

I Reducible part:

E [f (x)− f (x)]2 =E [f (x)− E [f (x)] + E [f (x)]− f (x)]2

=[f (x)− E [f (x)]]2 + E [(E [f (x)]− f (x))2]

= [f (x)− E [f (x)]]2︸︷︷︸Bias

+Var[f (x)]︸︷︷︸Variance

I Few assumptions about f gives small bias but can give high varianceI Harder assumptions on f can give small variance but high bias (if

assumptions are wrong)I For interpretation: Simpler forms of f is preferable

Overview of different methods

Flexibility

Inte

rpre

tabili

ty

Low High

Low

Hig

h Subset SelectionLasso

Least Squares

Generalized Additive ModelsTrees

Bagging, Boosting

Support Vector Machines

Regression vs classification

I Variables often divided into two characteristics:I Quantitative: Numerical values (ordered)I Qualitative: Categorical, no ordering

I Regression: Response is quantitativeI Classification: Response is qualitativeI Note:

I There can be qualitative explanatory variables within regressionI There can be quantitative explanatory variables within classification

Unsupervised learing

I So far: Response Y and explanatory variable(s) XI Supervised learing

I In many situations: Only XI Gene expression data

I Wants to find relations between variablesI Common approach : Cluster/group data

0 2 4 6 8 10 12

24

68

10

12

0 2 4 6

24

68

X1X1

X2

X2

I Unsupervised learing (Chap 6)

Evalution of precision

I Goal: Introduce many statistical learning methods (extending linearregression/logistic regression)

I No method will be the best in all situationsI Depend on how complex f is, how much data is available etc.

I In a specific problem setting: Try out several methodsI How to evaluate which one is the best?I Possible goal: (regression):

E[(Y ∗ − f (x∗))2]

where x∗ are values of explanatory variables where we want to predict anew Y ∗.

I Problem: Y ∗ is unknown

Estimation of measurement errorI Possible estimate on E[(Y ∗ − f (x∗))2]:

MSE = 1n

n∑i=1

(Yi − f (xi))2

I Measurement error evaluated on the same data as the ones used forestimating f .I Problem: We are interested in prediction on new dataI Use of the same data can give too optimistic answers

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Mean S

quare

d E

rror

I Idea: Divide data into two parts:I Training data: Used for estimating fI Test data: Used for estimating E[(Y∗ − f (x∗))2]

Variance vs bias

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Mean S

quare

d E

rror

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Mean S

quare

d E

rror

0 20 40 60 80 100

−10

010

20

X

Y

2 5 10 20

05

10

15

20

Flexibility

Mean S

quare

d E

rror

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

05

10

15

20

Flexibility

MSEBiasVar

E [(f (x)− f (x))2] = [f (x)− E [f (x)]]2︸︷︷︸Bias

+Var[f (x)]︸︷︷︸Variance

I Bias decrease with higher flexibilityI Variance increase with higher fleksibility

Classification

I Regression: Measurement error by E[(Y ∗ − f (x∗))2]

I Classification: Error rate: E[I(Y 6= Y )]

I Can be estimated by 1n

∑ni=1 I(yi 6= yi)

I Same problem as before:I Estimated error rate too optimistic if estimated from the same dataI Trade-off between bias and variance

Bayes classificator

I Can show (exerice) that the classification which minimizes error rate is

Y = argmaxj

Pr(Y = j|X = x)

I Called the Bayes classificator

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2 Border between regions are called

Bayesian decision borders

K -nearest neighbors

I Ideal: Use Bayes classificatorI In practice Pr(Y = k |X = x∗) is unkown and needs to be estimated.I K -nearest neighbor: Let N0 ⊂ {1, ..., n} be the K indexes with xi nearest

to x∗.I Estimation of Pr(Y = k |X = x0) = E [I(Y = k |X = x∗)]:

Pr(Y = k |X = x) =1K

∑i∈N0

I(yi = k)

I Classification rule: Classify test point with X = x∗ to the class with thehighest number of cases among the K nearest points.

o

o

o

o

o

oo

o

o

o

o

o o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

KNN: K=10

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

KNN: K=1 KNN: K=100

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

00.0

50.1

00.1

50.2

0

1/K

Err

or

Rate

Training Errors

Test Errors

Plan

I Chapter 2: Linear modelsI Linear regressionI Logistic regressionI Least squares and maximum likelihoodI Also parts of Appendix A

I Chapter 3: Optimism, conflicts and trade-offsI General concepts, evaluation, comparison

I Chapter 4: Regression methodsI Chapter 5: Classification methodsI Chapter 6: Cluster analysis

A. Azzalini and B. Scarpa. Data analysis and data mining: An introduction.OUP USA, 2012.

Documents

STK2100 - Machine Learning and Statistical Methods for … · STK2100 - Machine Learning and Statistical Methods for Prediction and Classiﬁcation Lecturer: Geir Storvik Exercises: