16
PART I: INTRODUCTION TO STATISTICAL LEARNING Donglin Zeng, Department of Biostatistics, University of North Carolina

PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

PART I: INTRODUCTION TOSTATISTICAL LEARNING

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 2: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Statistical Decision Theory

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 3: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Definition of statistical learning

I My definition: statistical learning is a framework ofstatistical methods and computational algorithms usingprobabilistic distribution generated data for the goal of eitherprediction or data extraction in future applications.

I Statistical learning consists of developing– statistical methods;– computational algorithms.

I Statistical learning concerns– empirical data and their randomness.

I Statistical learning aims for– future prediction;– understanding future data patterns.

I Hence, many scientific disciplines play roles in statisticallearning: probability and statistics, computer science, datascience, informatics and subject-area applications.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 4: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Other alias names for statistical learning

I Machine learning, data miningI Pattern recognitionI Supervised learning and unsupervised learningI Data analytics or predictive analytics

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 5: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Comparisons between statistical learning andstatistical inference

I Traditional statistical inference focuses on understandingdistributional behavior of data.

I In statistical inference, estimation and hypothesis testing(inference) of distribution parameters are of most interest;bias, consistency, efficiency are the main concerns.

I In statistical learning, distribution estimation is lessimportant compared to the learning goals such asprediction and feature extraction.

I Thus, prediction accuracy is most important in statisticallearning.

I Prediction rule consistency and expected risk control are ofmore concern in statistical learning.

I However,– both assume data randomly generated from someunderlying distribution so account for random behaviorsin the procedures;– both rely on data-dependent objective functions forestimation and inference;– both, more or less, involve development of statisticalmodels for data and computation algorithm for execution;– more specifically, supervised learning is analogue toregression and unsupervised learning is to densityestimation.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 6: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Challenges in modern statistical learning

I Method challenges: what kinds of methods/models enablethe achievement of prediction goals?

I Data challenges: how to deal with data complexity:dimensionality, heterogeneous structure, missing data etc.

I Algorithm challenges: what kind of computationalgorithms are suitable for estimation and data?

I Inference challenges: how well is the performance whenapplication to future data?

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 7: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Example 1. Email Spam Data

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 8: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Example 2. Prostate Cancer Data

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 9: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Example 3. Handwritten Digit Data

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 10: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Example 4. DNA Expression Data

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 11: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Overview of lectures on statistical learning

– I will introduce a number of statistical or machine learningmethods.

– I will discuss probabilistic and statistical theory behindlearning methods.

– Computation algorithms and examples will be usedthroughout lectures.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 12: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

What you should know

I Many data examples and figures are taken from Hastie,Tibshirani and Friedman’s book.

I A number of R-algorithms and examples are taken from avariety of web-sources publicly available.

I All errors in this course are mine.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 13: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Statistical Decision Theory (Supervised Learning)

I The goal of supervised learning is to learn a prediction ruleto predict outcome given subject’s feature variables.

I The components in supervised learning– X: feature variables – Y: outcome variable (continuous,categorical, ordinal)

I We assume that (X,Y) follows some distribution.I We aim to determine a prediction rule:

f : X→ Y

using available data (X1,Y1), ..., (Xn,Yn), called trainingdata or training sample.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 14: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Loss function to assess a prediction rule

I Loss function is a functional that for a given rule f and aspecific subject with (X,Y), what is the incurred loss due toimprecision.

I General notation: L(y, x; f ), but usually, it is defined basedone certain metric between y and f (x). For the latter, weuse L(y, f (x)).

I Examples of typical loss functions– squared loss: L(y, f ) = (y− f )2

– absolute deviation loss: L(y, f ) = |y− f |– Huber loss:L(y, f ) = (y− f )2I(|y− f | < δ) + (2δ|y− f | − δ2)I(|y− f | ≥ δ)– zero-one loss: L(y, f ) = I(y 6= f ) – preference loss:L(y1, y2, f1, f2) = 1− I(y1 < y2, f1 < f2)

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 15: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Plot of loss function

−2 −1 0 1 2

01

23

4

x

loss

func

tions

Donglin Zeng, Department of Biostatistics, University of North Carolina

Page 16: PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of statistical learning I My definition: statistical learning is a framework of statistical

Statistical framework for supervised learning

I Feature variables: X; outcome: Y.I Loss function: L(y, f (x)).I The goal is to find the optimal prediction f ∗ to minimize

the expected prediction error:

EPE(f ) = E [L(Y, f (X))] .

I Training data: (X1,Y1), ..., (Xn,Yn).

Donglin Zeng, Department of Biostatistics, University of North Carolina