22
STAT 151B Administrative Info Introduction Statistics 151B: Modern Statistical Prediction and Machine Learning Overview and introduction

Statistics 151B: Modern Statistical Prediction and …jon/stat-151b-spring-2012/lecture... · Modern Statistical Prediction and Machine Learning ... statistical inference ... methods

  • Upload
    buitram

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

STAT 151B

AdministrativeInfo

IntroductionStatistics 151B:

Modern Statistical Prediction andMachine Learning

Overview and introduction

STAT 151B

AdministrativeInfo

Introduction

Administrative information

Homepage: http://www.stat.berkeley.edu/∼jon/

stat-151b-spring-2012

All announcements and materials here.No printouts distributed in lecture.

Instructor: Prof. J. [email protected] hours: Thu 12:30 PM - 1:30 PM

Graduate student instructor: Jeff [email protected] hours: TBA

STAT 151B

AdministrativeInfo

Introduction

Administrative information

Course texts:T. Hastie, R. Tibshirani, J. Friedman, The Elementsof Statistical Learning: Data Mining, Inference andPrediction, 2nd ed., 5th printing. Free pdf.P. Dalgaard, Introductory Statistics with R, 2nd ed.

Prerequisites:basic linear algebra (Ax = b, Ax = λx)multivariate calculus (∂f/∂xi , ∇x f )statistical inference

This is not a math course, but it uses math in anessential way to make ideas precise.

STAT 151B

AdministrativeInfo

Introduction

Computing

Platform: the R statistical computing environment,http://www.R-project.org .Freely available, open source.Installation packages available for Windows, Mac,Linuxen.The de facto standard for statistical computing all overthe world.You will become proficient in R: section will cover theDalgaard book extensively (learning R and reviewingbasic statistical inference).

STAT 151B

AdministrativeInfo

Introduction

Homework

Four assignments over the course of the semester, dueroughly every two weeks.Work in groups of three.Data analysis in R (submit a transcript of your Rsession, source files, brief written summaries), usingtechniques from the course.Problem-solving to check your understanding oftheories and methods.No late submissions.30% of the final grade.

STAT 151B

AdministrativeInfo

Introduction

Midterm

Topics covered on the midterm will be clearly indicated.One of the weekly discussion sections will be devotedto review/Q&A.Thursday March 22nd, in class.30% of the final grade.

STAT 151B

AdministrativeInfo

Introduction

Final project

This year, something new: a competition on a contestdataset.Build the best prediction rule you can using ideas andmethods from the course.Write a project report describing your analysis andresults (guidelines provided beforehand).Work in the same group of three.Project grade depends on quality of work and report,not standing in the competition.I will open the contest on Kaggle in early April.Jeff and I will compete. . . prepare to get owned.40% of the final grade.

STAT 151B

AdministrativeInfo

Introduction

Final project – alternative

A substantial data analysis project.You pick

an area/topic/field/subject you are interested in,a nontrivial dataset related to it,a question to answer or theory to test.

Address the question or theory using some of the ideasand methods from the course.Write a project report describing your analysis andresults (guidelines provided beforehand).Work in same group of three.40% of the final grade.

STAT 151B

AdministrativeInfo

Introduction

FINAL PROJECT DEADLINES

Thursday April 5th: for those opting out of competition,final project proposal due at the beginning of lecture.Friday May 11th: final project report due at 3pm.

Last day permitted by the university. No extensionis possible.

You cannot pass the course without doing the finalproject.Get in front of this project. Do not leave it to thelast minute.If you like, please come to my office hours with yourideas or questions about a candidate project.

STAT 151B

AdministrativeInfo

Introduction

Example: spam filtering

• Customize an email spam detection system.

STAT 151B

AdministrativeInfo

Introduction

Example: handwritten-digit recognition

Elements of Statistical Learning c!Hastie, Tibshirani & Friedman 2001 Chapter 1

Figure 1.2: Examples of handwritten digits from U.S.

postal envelopes.

STAT 151B

AdministrativeInfo

Introduction

Example: gene finding

!"#$%#&#$'(&)(&*$+,-./#0

12 32

!"#45-&$6 45-&$7 45-&$3 45-&$89&:,-&$6 9&:,-&$7 9&:,-&$3

;-/<=$>(*&?/+<,(0()(&#:,?@:

A,?&@";-(&:B!%#B

C;/(@#$>(:#B=%

C;/(@#$>(:#%%!%=%

!,?&>/?:(-&9&(:(?:(-&=!%

C:-;$@-)-&!=%D!%=D!==

+,-0-:#,!=!=

STAT 151B

AdministrativeInfo

Introduction

Example: typing cancer with microarrays

Elem

ents

ofSta

tisticalLea

rnin

gc!

Hastie,

Tib

shira

ni&

Fried

man

2001

Chapter

1

SID42354

SID31984

SID301902

SIDW128368

SID375990

SID360097

SIDW325120

ESTsChr.10

SIDW365099

SID377133

SID381508

SIDW308182

SID380265

SIDW321925

ESTsChr.15

SIDW362471

SIDW417270

SIDW298052

SID381079

SIDW428642

TUPLE1TUP1

ERLUMEN

SIDW416621

SID43609

ESTs

SID52979

SIDW357197

SIDW366311

ESTs

SMALLNUC

SIDW486740

ESTs

SID297905

SID485148

SID284853

ESTsChr.15

SID200394

SIDW322806

ESTsChr.2

SIDW257915

SID46536

SIDW488221

ESTsChr.5

SID280066

SIDW376394

ESTsChr.15

SIDW321854

WASWiskott

HYPOTHETIC

SIDW376776

SIDW205716

SID239012

SIDW203464

HLACLASSI

SIDW510534

SIDW279664

SIDW201620

SID297117

SID377419

SID114241

ESTsCh31

SIDW376928

SIDW310141

SIDW298203

PTPRC

SID289414

SID127504

ESTsChr.3

SID305167

SID488017

SIDW296310

ESTsChr.6

SID47116

MITOCHOND

Chr

SIDW376586

Homosapiens

SIDW487261

SIDW470459

SID167117

SIDW31489

SID375812

DNAPOLYME

SID377451

ESTsChr.1

MYBPROTO

SID471915

ESTs

SIDW469884

HumanmRNA

SIDW377402

ESTs

SID207172

RASGTPASE

SID325394

H.sapiensmRN

GNAL

SID73161

SIDW380102

SIDW299104

BREAST

RENAL

MELANOMA

MELANOMA

MCF7D-repro

COLON

COLONK562B-repro

COLON

NSCLC

LEUKEMIA

RENAL

MELANOMA

BREAST

CNS

CNS

RENAL

MCF7A-repro

NSCLCK562A-repro

COLON

CNS

NSCLC

NSCLC

LEUKEMIA

CNS

OVARIAN

BREAST

LEUKEMIA

MELANOMA

MELANOMA

OVARIANOVARIAN

NSCLC

RENAL

BREAST

MELANOMA

OVARIAN

OVARIAN

NSCLC

RENAL

BREAST

MELANOMA

LEUKEMIA

COLONBREAST

LEUKEMIA

COLON

CNS

MELANOMA

NSCLC

PROSTATE

NSCLC

RENAL

RENAL

NSCLC

RENALLEUKEMIA

OVARIAN

PROSTATE

COLON

BREAST

RENAL

UNKNOWN

Figure

1.3:D

NA

mic

roarra

ydata

:expre

ssion

matrix

of

6830

genes

(rows)

and

64

sam

ples

(colu

mns),

for

the

hum

an

tum

or

data

.O

nly

ara

ndom

sam

ple

of

100

rows

are

shown.

The

disp

lay

isa

hea

tm

ap,ra

ngin

gfro

mbrig

ht

gree

n(n

ega-

tive,under

expre

ssed)

tobrig

htred

(positiv

e,over

expre

ssed).

Missin

gvalu

es

are

gra

y.

The

rows

and

colu

mns

are

disp

layed

ina

random

lychose

nord

er.

STAT 151B

AdministrativeInfo

Introduction

Example: phoneme detection• Classify a recorded phoneme, based on a log-periodogram.

STAT 151B

AdministrativeInfo

Introduction

Example: Land usage from LANDSAT images• Classify the pixels in a LANDSAT image, according to usage:

{red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil,

very damp gray soil}

STAT 151B

AdministrativeInfo

Introduction

Example: face detection

!"#$%&''( )*+,-./0%1-%2,./%#34./001-56%7/8/.814-% ((

9:;/31</-8,=%$/0>=806%?$4@=/A%/8%,=B%CDE

STAT 151B

AdministrativeInfo

Introduction

Example: seizure prediction

STAT 151B

AdministrativeInfo

Introduction

Ex: “Collaborative filtering” (the Netflix prize)

CS 277: Data Mining Netflix Competition Overview Padhraic Smyth, UC Irvine: 8

Ratings Data

1 3 4

3 5 5

4 5 5

3

3

2 ? ?

?

2 1 ?

3 ?

1

Test Data Set (most recent ratings)

480,000 users

17,700 movies

STAT 151B

AdministrativeInfo

Introduction

Statistical prediction

AKA “supervised learning” in the machine learningcommunity. Ingredients of the general framework:An outcome Y (called the response variable,dependent variable, or target).A p-vector of observables X (called the predictors,covariates, features, inputs, regressors, or independentvariables).Regression: Y is a continuous quantity (price, weight,concentration).Classification: Y takes values in a finite, unordered set:

clinical outcome: { lived, died }consumer credit: { bankrupt in 2006, or not }cancer type: { breast, melanoma, leukemia, . . . }

STAT 151B

AdministrativeInfo

Introduction

Objectives

We have training data {(x1, y1), . . . , (xN , yN)}. These areobservations, examples, or instances. (Don’t call themsamples; the whole training data set is one sample.)Using the training data, the goals are:

Build a rule to predict ynew from xnew.Understand which predictors are related to theresponse.Estimate what the quality of future predictions andinferences will be.

STAT 151B

AdministrativeInfo

Introduction

Schedule of topics

[See syllabus]

STAT 151B

AdministrativeInfo

Introduction

A statistician’s manifesto (Trevor Hastie)

Understand the ideas behind the statistical methods, soyou know how to use them, when to use them, whennot to use them.Complicated methods build on simple methods.Understand the simple methods first.The results of a method are of little use without anassessment of how well or poorly it is doing.

Corollary : simple methods are sometimes just asgood as complicated methods; “technology tendsto overwhelm common sense” (D. Freedman)

Statistical prediction is an active and exciting area ofresearch, touching science, industry, and finance.