PART I: INTRODUCTION TO STATISTICAL LEARNINGdzeng/BIOS740/Introduction.pdf · Definition of...

Preview:

Citation preview

PART I: INTRODUCTION TOSTATISTICAL LEARNING

Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistical Decision Theory

Donglin Zeng, Department of Biostatistics, University of North Carolina

Definition of statistical learning

I My definition: statistical learning is a framework ofstatistical methods and computational algorithms usingprobabilistic distribution generated data for the goal of eitherprediction or data extraction in future applications.

I Statistical learning consists of developing– statistical methods;– computational algorithms.

I Statistical learning concerns– empirical data and their randomness.

I Statistical learning aims for– future prediction;– understanding future data patterns.

I Hence, many scientific disciplines play roles in statisticallearning: probability and statistics, computer science, datascience, informatics and subject-area applications.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Other alias names for statistical learning

I Machine learning, data miningI Pattern recognitionI Supervised learning and unsupervised learningI Data analytics or predictive analytics

Donglin Zeng, Department of Biostatistics, University of North Carolina

Comparisons between statistical learning andstatistical inference

I Traditional statistical inference focuses on understandingdistributional behavior of data.

I In statistical inference, estimation and hypothesis testing(inference) of distribution parameters are of most interest;bias, consistency, efficiency are the main concerns.

I In statistical learning, distribution estimation is lessimportant compared to the learning goals such asprediction and feature extraction.

I Thus, prediction accuracy is most important in statisticallearning.

I Prediction rule consistency and expected risk control are ofmore concern in statistical learning.

I However,– both assume data randomly generated from someunderlying distribution so account for random behaviorsin the procedures;– both rely on data-dependent objective functions forestimation and inference;– both, more or less, involve development of statisticalmodels for data and computation algorithm for execution;– more specifically, supervised learning is analogue toregression and unsupervised learning is to densityestimation.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Challenges in modern statistical learning

I Method challenges: what kinds of methods/models enablethe achievement of prediction goals?

I Data challenges: how to deal with data complexity:dimensionality, heterogeneous structure, missing data etc.

I Algorithm challenges: what kind of computationalgorithms are suitable for estimation and data?

I Inference challenges: how well is the performance whenapplication to future data?

Donglin Zeng, Department of Biostatistics, University of North Carolina

Example 1. Email Spam Data

Donglin Zeng, Department of Biostatistics, University of North Carolina

Example 2. Prostate Cancer Data

Donglin Zeng, Department of Biostatistics, University of North Carolina

Example 3. Handwritten Digit Data

Donglin Zeng, Department of Biostatistics, University of North Carolina

Example 4. DNA Expression Data

Donglin Zeng, Department of Biostatistics, University of North Carolina

Overview of lectures on statistical learning

– I will introduce a number of statistical or machine learningmethods.

– I will discuss probabilistic and statistical theory behindlearning methods.

– Computation algorithms and examples will be usedthroughout lectures.

Donglin Zeng, Department of Biostatistics, University of North Carolina

What you should know

I Many data examples and figures are taken from Hastie,Tibshirani and Friedman’s book.

I A number of R-algorithms and examples are taken from avariety of web-sources publicly available.

I All errors in this course are mine.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistical Decision Theory (Supervised Learning)

I The goal of supervised learning is to learn a prediction ruleto predict outcome given subject’s feature variables.

I The components in supervised learning– X: feature variables – Y: outcome variable (continuous,categorical, ordinal)

I We assume that (X,Y) follows some distribution.I We aim to determine a prediction rule:

f : X→ Y

using available data (X1,Y1), ..., (Xn,Yn), called trainingdata or training sample.

Donglin Zeng, Department of Biostatistics, University of North Carolina

Loss function to assess a prediction rule

I Loss function is a functional that for a given rule f and aspecific subject with (X,Y), what is the incurred loss due toimprecision.

I General notation: L(y, x; f ), but usually, it is defined basedone certain metric between y and f (x). For the latter, weuse L(y, f (x)).

I Examples of typical loss functions– squared loss: L(y, f ) = (y− f )2

– absolute deviation loss: L(y, f ) = |y− f |– Huber loss:L(y, f ) = (y− f )2I(|y− f | < δ) + (2δ|y− f | − δ2)I(|y− f | ≥ δ)– zero-one loss: L(y, f ) = I(y 6= f ) – preference loss:L(y1, y2, f1, f2) = 1− I(y1 < y2, f1 < f2)

Donglin Zeng, Department of Biostatistics, University of North Carolina

Plot of loss function

−2 −1 0 1 2

01

23

4

x

loss

func

tions

Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistical framework for supervised learning

I Feature variables: X; outcome: Y.I Loss function: L(y, f (x)).I The goal is to find the optimal prediction f ∗ to minimize

the expected prediction error:

EPE(f ) = E [L(Y, f (X))] .

I Training data: (X1,Y1), ..., (Xn,Yn).

Donglin Zeng, Department of Biostatistics, University of North Carolina

Recommended