View
218
Download
0
Category
Preview:
Citation preview
STAT 151B
AdministrativeInfo
IntroductionStatistics 151B:
Modern Statistical Prediction andMachine Learning
Overview and introduction
STAT 151B
AdministrativeInfo
Introduction
Administrative information
Homepage: http://www.stat.berkeley.edu/∼jon/
stat-151b-spring-2012
All announcements and materials here.No printouts distributed in lecture.
Instructor: Prof. J. McAuliffejon@stat.berkeley.eduoffice hours: Thu 12:30 PM - 1:30 PM
Graduate student instructor: Jeff Regierjeff@stat.berkeley.eduoffice hours: TBA
STAT 151B
AdministrativeInfo
Introduction
Administrative information
Course texts:T. Hastie, R. Tibshirani, J. Friedman, The Elementsof Statistical Learning: Data Mining, Inference andPrediction, 2nd ed., 5th printing. Free pdf.P. Dalgaard, Introductory Statistics with R, 2nd ed.
Prerequisites:basic linear algebra (Ax = b, Ax = λx)multivariate calculus (∂f/∂xi , ∇x f )statistical inference
This is not a math course, but it uses math in anessential way to make ideas precise.
STAT 151B
AdministrativeInfo
Introduction
Computing
Platform: the R statistical computing environment,http://www.R-project.org .Freely available, open source.Installation packages available for Windows, Mac,Linuxen.The de facto standard for statistical computing all overthe world.You will become proficient in R: section will cover theDalgaard book extensively (learning R and reviewingbasic statistical inference).
STAT 151B
AdministrativeInfo
Introduction
Homework
Four assignments over the course of the semester, dueroughly every two weeks.Work in groups of three.Data analysis in R (submit a transcript of your Rsession, source files, brief written summaries), usingtechniques from the course.Problem-solving to check your understanding oftheories and methods.No late submissions.30% of the final grade.
STAT 151B
AdministrativeInfo
Introduction
Midterm
Topics covered on the midterm will be clearly indicated.One of the weekly discussion sections will be devotedto review/Q&A.Thursday March 22nd, in class.30% of the final grade.
STAT 151B
AdministrativeInfo
Introduction
Final project
This year, something new: a competition on a contestdataset.Build the best prediction rule you can using ideas andmethods from the course.Write a project report describing your analysis andresults (guidelines provided beforehand).Work in the same group of three.Project grade depends on quality of work and report,not standing in the competition.I will open the contest on Kaggle in early April.Jeff and I will compete. . . prepare to get owned.40% of the final grade.
STAT 151B
AdministrativeInfo
Introduction
Final project – alternative
A substantial data analysis project.You pick
an area/topic/field/subject you are interested in,a nontrivial dataset related to it,a question to answer or theory to test.
Address the question or theory using some of the ideasand methods from the course.Write a project report describing your analysis andresults (guidelines provided beforehand).Work in same group of three.40% of the final grade.
STAT 151B
AdministrativeInfo
Introduction
FINAL PROJECT DEADLINES
Thursday April 5th: for those opting out of competition,final project proposal due at the beginning of lecture.Friday May 11th: final project report due at 3pm.
Last day permitted by the university. No extensionis possible.
You cannot pass the course without doing the finalproject.Get in front of this project. Do not leave it to thelast minute.If you like, please come to my office hours with yourideas or questions about a candidate project.
STAT 151B
AdministrativeInfo
Introduction
Example: spam filtering
• Customize an email spam detection system.
STAT 151B
AdministrativeInfo
Introduction
Example: handwritten-digit recognition
Elements of Statistical Learning c!Hastie, Tibshirani & Friedman 2001 Chapter 1
Figure 1.2: Examples of handwritten digits from U.S.
postal envelopes.
STAT 151B
AdministrativeInfo
Introduction
Example: gene finding
!"#$%#&#$'(&)(&*$+,-./#0
12 32
!"#45-&$6 45-&$7 45-&$3 45-&$89&:,-&$6 9&:,-&$7 9&:,-&$3
;-/<=$>(*&?/+<,(0()(&#:,?@:
A,?&@";-(&:B!%#B
C;/(@#$>(:#B=%
C;/(@#$>(:#%%!%=%
!,?&>/?:(-&9&(:(?:(-&=!%
C:-;$@-)-&!=%D!%=D!==
+,-0-:#,!=!=
STAT 151B
AdministrativeInfo
Introduction
Example: typing cancer with microarrays
Elem
ents
ofSta
tisticalLea
rnin
gc!
Hastie,
Tib
shira
ni&
Fried
man
2001
Chapter
1
SID42354
SID31984
SID301902
SIDW128368
SID375990
SID360097
SIDW325120
ESTsChr.10
SIDW365099
SID377133
SID381508
SIDW308182
SID380265
SIDW321925
ESTsChr.15
SIDW362471
SIDW417270
SIDW298052
SID381079
SIDW428642
TUPLE1TUP1
ERLUMEN
SIDW416621
SID43609
ESTs
SID52979
SIDW357197
SIDW366311
ESTs
SMALLNUC
SIDW486740
ESTs
SID297905
SID485148
SID284853
ESTsChr.15
SID200394
SIDW322806
ESTsChr.2
SIDW257915
SID46536
SIDW488221
ESTsChr.5
SID280066
SIDW376394
ESTsChr.15
SIDW321854
WASWiskott
HYPOTHETIC
SIDW376776
SIDW205716
SID239012
SIDW203464
HLACLASSI
SIDW510534
SIDW279664
SIDW201620
SID297117
SID377419
SID114241
ESTsCh31
SIDW376928
SIDW310141
SIDW298203
PTPRC
SID289414
SID127504
ESTsChr.3
SID305167
SID488017
SIDW296310
ESTsChr.6
SID47116
MITOCHOND
Chr
SIDW376586
Homosapiens
SIDW487261
SIDW470459
SID167117
SIDW31489
SID375812
DNAPOLYME
SID377451
ESTsChr.1
MYBPROTO
SID471915
ESTs
SIDW469884
HumanmRNA
SIDW377402
ESTs
SID207172
RASGTPASE
SID325394
H.sapiensmRN
GNAL
SID73161
SIDW380102
SIDW299104
BREAST
RENAL
MELANOMA
MELANOMA
MCF7D-repro
COLON
COLONK562B-repro
COLON
NSCLC
LEUKEMIA
RENAL
MELANOMA
BREAST
CNS
CNS
RENAL
MCF7A-repro
NSCLCK562A-repro
COLON
CNS
NSCLC
NSCLC
LEUKEMIA
CNS
OVARIAN
BREAST
LEUKEMIA
MELANOMA
MELANOMA
OVARIANOVARIAN
NSCLC
RENAL
BREAST
MELANOMA
OVARIAN
OVARIAN
NSCLC
RENAL
BREAST
MELANOMA
LEUKEMIA
COLONBREAST
LEUKEMIA
COLON
CNS
MELANOMA
NSCLC
PROSTATE
NSCLC
RENAL
RENAL
NSCLC
RENALLEUKEMIA
OVARIAN
PROSTATE
COLON
BREAST
RENAL
UNKNOWN
Figure
1.3:D
NA
mic
roarra
ydata
:expre
ssion
matrix
of
6830
genes
(rows)
and
64
sam
ples
(colu
mns),
for
the
hum
an
tum
or
data
.O
nly
ara
ndom
sam
ple
of
100
rows
are
shown.
The
disp
lay
isa
hea
tm
ap,ra
ngin
gfro
mbrig
ht
gree
n(n
ega-
tive,under
expre
ssed)
tobrig
htred
(positiv
e,over
expre
ssed).
Missin
gvalu
es
are
gra
y.
The
rows
and
colu
mns
are
disp
layed
ina
random
lychose
nord
er.
STAT 151B
AdministrativeInfo
Introduction
Example: phoneme detection• Classify a recorded phoneme, based on a log-periodogram.
STAT 151B
AdministrativeInfo
Introduction
Example: Land usage from LANDSAT images• Classify the pixels in a LANDSAT image, according to usage:
{red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil,
very damp gray soil}
STAT 151B
AdministrativeInfo
Introduction
Example: face detection
!"#$%&''( )*+,-./0%1-%2,./%#34./001-56%7/8/.814-% ((
9:;/31</-8,=%$/0>=806%?$4@=/A%/8%,=B%CDE
STAT 151B
AdministrativeInfo
Introduction
Ex: “Collaborative filtering” (the Netflix prize)
CS 277: Data Mining Netflix Competition Overview Padhraic Smyth, UC Irvine: 8
Ratings Data
1 3 4
3 5 5
4 5 5
3
3
2 ? ?
?
2 1 ?
3 ?
1
Test Data Set (most recent ratings)
480,000 users
17,700 movies
STAT 151B
AdministrativeInfo
Introduction
Statistical prediction
AKA “supervised learning” in the machine learningcommunity. Ingredients of the general framework:An outcome Y (called the response variable,dependent variable, or target).A p-vector of observables X (called the predictors,covariates, features, inputs, regressors, or independentvariables).Regression: Y is a continuous quantity (price, weight,concentration).Classification: Y takes values in a finite, unordered set:
clinical outcome: { lived, died }consumer credit: { bankrupt in 2006, or not }cancer type: { breast, melanoma, leukemia, . . . }
STAT 151B
AdministrativeInfo
Introduction
Objectives
We have training data {(x1, y1), . . . , (xN , yN)}. These areobservations, examples, or instances. (Don’t call themsamples; the whole training data set is one sample.)Using the training data, the goals are:
Build a rule to predict ynew from xnew.Understand which predictors are related to theresponse.Estimate what the quality of future predictions andinferences will be.
STAT 151B
AdministrativeInfo
Introduction
A statistician’s manifesto (Trevor Hastie)
Understand the ideas behind the statistical methods, soyou know how to use them, when to use them, whennot to use them.Complicated methods build on simple methods.Understand the simple methods first.The results of a method are of little use without anassessment of how well or poorly it is doing.
Corollary : simple methods are sometimes just asgood as complicated methods; “technology tendsto overwhelm common sense” (D. Freedman)
Statistical prediction is an active and exciting area ofresearch, touching science, industry, and finance.
Recommended