Upload
sue
View
52
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Essential Statistics in Biology: Getting the Numbers Right. Raphael Gottardo Clinical Research Institute of Montreal (IRCM) [email protected] http://www.rglab.org. Outline. Exploratory Data Analysis 1-2 sample t -tests, multiple testing Clustering SVD/PCA - PowerPoint PPT Presentation
Citation preview
Essential Statistics in
Biology: Getting the Numbers
Right
Raphael GottardoClinical Research Institute of Montreal (IRCM)
[email protected]://www.rglab.org
Day 1 2
Outline
•Exploratory Data Analysis
•1-2 sample t-tests, multiple testing
•Clustering
•SVD/PCA
•Frequentists vs. Bayesians
PCA and SVD(Multivariate
analysis)
Day 1 - Section 4 4
Outline
•What is SVD? Mathematical definition
•Relation to Principal Component Analysis (PCA)
•Applications of PCA and SVD
•Illustration with gene expression data
Day 1 - Section 4 5
SVDLet X be a matrix of size mxn (m≥n) and rank r≤nthen we can decompose X as
XXVVSS
UU= x x T
m
n
m n
n n n
n
- U is the matrix of left singular vectors- V is the matrix of right singular vectors- S is a diagonal matrix who’s diagonal are the singular values
Day 1 - Section 4 6
SVDLet X be a matrix of size mxn (m≥n) and rank r≤nthen we can decompose X as
XXVVSS
UU= x x T
m
n
m n
n n n
n
Day 1 - Section 4 7
SVDLet X be a matrix of size mxn (m≥n) and rank r≤nthen we can decompose X as
XXVVSS
UU= x x T
m
n
m n
n n n
n
DirectionAmplitude
Day 1 - Section 4 8
Relation to PCA
Assume that the rows of X are centered then is (up to a constant) the empirical covariance matrix and SVD is equivalent to PCA
The rows of V are the singular vectors or principal components
New variabl
esVarianc
e
Gene expression: Eigengenes or eigenassays
Day 1 - Section 4 9
Applications of SVD and PCA•Dimension reduction (simplify a dataset)
•Clustering
•Discriminant analysis
•Exploratory data analysis tool
•Find the most important signal in data
•2D projections
Day 1 - Section 4 10
Toy examples=(13.47,1.45)set.seed(100)
x1<-rnorm(100,0,1)y1<-rnorm(100,1,1)
var0.5<-matrix(c(1,-.5,-.5,.1),2,2)
data1<-t(var0.5%*%t(cbind(x1,y1)))
set.seed(100)x2<-rnorm(100,2,1)y2<-rnorm(100,2,1)
var0.5<-matrix(c(1,.5,.5,1),2,2)
data2<-t(var0.5%*%t(cbind(x2,y2)))
data<-rbind(data1,data2)
svd1<-svd(data1)plot(data1,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))abline(coef=c(0,svd1$v[2,1]/svd1$v[1,1]),col=2)abline(coef=c(0,svd1$v[2,2]/svd1$v[1,2]),col=3)
Day 1 - Section 4 11
Toy examples=(47.79,13.25)svd2<-svd(data2)
plot(data2,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))abline(coef=c(0,svd2$v[2,1]/svd2$v[1,1]),col=2)abline(coef=c(0,svd2$v[2,2]/svd2$v[1,2]),col=3)
svd<-svd(data)
plot(data,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))abline(coef=c(0,svd$v[2,1]/svd$v[1,1]),col=2)abline(coef=c(0,svd$v[2,2]/svd$v[1,2]),col=3)
Day 1 - Section 4 12
Toy example### Projectiondata.proj<-svd$u%*%diag(svd$d)svd.proj<-svd(data.proj)
plot(data.proj,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))abline(coef=c(0,svd.proj$v[2,1]/svd.proj$v[1,1]),col=2)### svd.proj$v[1,2]=0abline(v=0,col=3)
Day 1 - Section 4 13
Toy examples=(47.17,11.88)
Newcoordina
tes
Projecteddata
Day 1 - Section 4 14
Toy example### New data
set.seed(100)x1<-rnorm(100,-1,1)y1<-rnorm(100,1,1)
var0.5<-matrix(c(1,-.5,-.5,1),2,2)
data1<-t(var0.5%*%t(cbind(x1,y1)))
set.seed(100)x2<-rnorm(100,1,1)y2<-rnorm(100,1,1)
var0.5<-matrix(c(1,.5,.5,1),2,2)
data2<-t(var0.5%*%t(cbind(x2,y2)))
data<-rbind(data1,data2)
svd1<-svd(data1)plot(data1,xlab="x",ylab="y",xlim=c(-
6,6),ylim=c(-6,6))
abline(coef=c(0,svd1$v[2,1]/svd1$v[1,1]),col=2)
abline(coef=c(0,svd1$v[2,2]/svd1$v[1,2]),col=3)
svd2<-svd(data2)plot(data2,xlab="x",ylab="y",xlim=c(-
6,6),ylim=c(-6,6))
abline(coef=c(0,svd2$v[2,1]/svd2$v[1,1]),col=2)
abline(coef=c(0,svd2$v[2,2]/svd2$v[1,2]),col=3)
svd<-svd(data)
plot(data,xlab="x",ylab="y",xlim=c(-6,6),ylim=c(-6,6))
abline(coef=c(0,svd$v[2,1]/svd$v[1,1]),col=2)
abline(coef=c(0,svd$v[2,2]/svd$v[1,2]),col=3)
Day 1 - Section 4 15
Toy examples=(26.48,24.98)
Day 1 - Section 4 16
Application to microarrays•Dimension reduction (simplify a dataset)
•Clustering (two many samples)
•Discriminant analysis (find a group of genes)
•Exploratory data analysis tool
•Find the most important signal in data
•2D projections (clusters?)
Day 1 - Section 4 17
Application to microarrays
Cho cell cycle data set384 genes
We have standardized the datacho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[,3:19])
cho.mean<-apply(cho.data,1,"mean")cho.sd<-apply(cho.data,1,"sd")cho.data.std<-(cho.data-cho.mean)/cho.sd
svd.cho<-svd(cho.data.std)### Contribution of each PCbarplot(svd.cho$d/sum(svd.cho$d),col=heat.colors(17))### First three singular vectors (PCA)plot(svd.cho$v[,1],xlab="time",ylab="Expression profile",type="b")plot(svd.cho$v[,2],xlab="time",ylab="Expression profile",type="b")plot(svd.cho$v[,3],xlab="time",ylab="Expression profile",type="b")
### Projectionplot(svd.cho$u[,1]*svd.cho$d[1],svd.cho$u[,2]*svd.cho$d[2],xlab="PCA 1 ",ylab="PCA 2")plot(svd.cho$u[,1]*svd.cho$d[1],svd.cho$u[,3]*svd.cho$d[3],xlab="PCA 1 ",ylab="PCA 3")plot(svd.cho$u[,2]*svd.cho$d[2],svd.cho$u[,3]*svd.cho$d[3],xlab="PCA 2 ",ylab="PCA 3")
### Select a clusterind<-(svd.cho$u[,2]*svd.cho$d[2])^2+(svd.cho$u[,3]*svd.cho$d[3])^2>5 & svd.cho$u[,2]*svd.cho$d[2]>0 & svd.cho$u[,3]*svd.cho$d[3]<0
plot(svd.cho$u[,2]*svd.cho$d[2],svd.cho$u[,3]*svd.cho$d[3],xlab="PCA 2 ",ylab="PCA 3")points(svd.cho$u[ind,2]*svd.cho$d[2],svd.cho$u[ind,3]*svd.cho$d[3],col=2)
matplot(t(cho.data.std[ind,]),xlab="time",ylab="Expression profiles",type="l")
Day 1 - Section 4 18
Application to microarrays
Singular values
Relativecontribution
Why?
Main contribution
Day 1 - Section 4 19
Application to microarraysPC1
Day 1 - Section 4 20
Application to microarraysPC2
Day 1 - Section 4 21
Application to microarraysPC3
Day 1 - Section 4 22
Application to microarraysProjection
onto PC1 PC2
Day 1 - Section 4 23
Application to microarraysProjection
onto PC1 PC3
Day 1 - Section 4 24
Application to microarraysProjection
onto PC2 PC3
Day 1 - Section 4 25
Application to microarraysProjection
onto PC2 PC3
24 genes
Day 1 - Section 4 26
Application to microarraysProjection
onto PC2 PC3
24 genes
Day 1 - Section 4 27
Conclusion
•SVD is a powerful tool
•Can be very useful in gene expression data
•SVD of genes (eigen-genes)
•SVD of samples (eigen-assays)
•Mostly an EDA tool
Overview of Statistics
inference: Bayes vs. Frequentists
(If time permits)
Day 1 - Section 5 29
Introduction
•Parametric statistical model
•Observation are drawn from a probability distribution where is the parameter vectorLikelihood function →
(Inverted density)
Day 1 - Section 5 30
Introduction
•Parametric statistical model
•Observation are drawn from a probability distribution where is the parameter vectorLikelihood function →
(Inverted density)
Day 1 - Section 5 31
Introduction
Normal distributionProbability distribution for one observation is
If independence
Day 1 - Section 5 32
Introduction15 observations
N(1,1)
Day 1 - Section 5 33
Introduction15 observations
N(1,1)
True probability distribution
Day 1 - Section 5 34
Inference
•The parameters are unknown
•“Learn” something about the parameter vector θ from the data
•Make inference about θ
‣ Estimate θ
‣ Confidence region
‣ Test an hypothesis (θ=0)
Day 1 - Section 5 35
The frequentist approach
•The parameters are fixed but unknown
•Inference is based on the relative frequency of occurrence when repeating the experiment
•For example, one can look at the variance of an estimator to evaluate its efficiency
Day 1 - Section 5 36
The Normal Example: Estimation
Normal distribution
is the mean and is the variance
(Sample mean and sample variance)
Numerical example, 15 obs. from N(1,1)
Use the theory of repeated samples to evaluatethe estimators.
Day 1 - Section 5 37
The Normal Example: EstimationIn our toy example, the data are normal, and we can derive the sampling distribution of the estimators.For example we know that is normal with mean and variance . The standard deviation of an estimator is called the standard error. What if we can’t derive the sampling distribution?Use the bootstrap!
Day 1 - Section 5 38
The Bootstrap- Basic idea is to resample the data we have observed and compute a new value of the statistic/estimator for each resampled data set.- Then one can assess the estimator by looking at the empirical distribution across the resampled data sets.
set.seed(100)x<-rnorm(15)mu.hat<-mean(x)sigma.hat<-sd(x)B<-100mu.hatNew<-rep(0,B)for(i in 1:B){ x.new<-sample(x,replace=TRUE) mu.hatNew[i]<-mean(x.new)}se<-sd(mu.hatNew)set.seed(100)x<-rnorm(15)mu.hat<-mean(x)sigma.hat<-sd(x)B<-100mu.hatNew<-rep(0,B)for(i in 1:B){ x.new<-sample(x,replace=TRUE) mu.hatNew[i]<-median(x.new)}se<-sd(mu.hatNew)
Day 1 - Section 5 39
The Normal Example: CIConfidence interval for
the mean :
depends on n but when n is large
and usuallywhere
Numerical example, 15 obs. from N(1,1)
What does this mean?set.seed(100)x<-rnorm(15)t.test(x,mean=0)
> set.seed(100)> x<-rnorm(15)> t.test(x,mean=0)
One Sample t-test
data: x t = 0.3487, df = 14, p-value = 0.7325alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.2294725 0.3185625 sample estimates:mean of x 0.044545
Day 1 - Section 5 40
The Normal Example:Testing
Test an hypothesis about the mean:
t-test
If , t follows a t-distribution with n-1 degrees of freedom
p-value
Day 1 - Section 5 41
The Bayesian Approach
•Parametric statistical model
•Observation are drawn from a probability distribution where is the parameter vector
● The parameters are unknown but random● The uncertainty on the vector parameter is model through a prior distribution
Day 1 - Section 5 42
The Bayesian Approach
A Bayesian statistical model is made of
1. A parametric statistical model
2. A prior distribution
Q: How can we combine the two?A: Bayes Theorem!
Day 1 - Section 5 43
The Bayesian ApproachBayes theorem ↔ Inversion of probability
If A and E are events such that P(E)≠0 and P(A)≠0 then P(A|E) and P(E|A) are related by
Day 1 - Section 5 44
The Bayesian ApproachFrom prior to posterior:
Information on Information on θθ contained in the contained in the observation observation yy
Prior informationPrior information
Normalizing constant
Day 1 - Section 5 45
The Bayesian ApproachSequential nature of Bayes’ theorem:
The posterior is the new prior!
Day 1 - Section 5 46
The Bayesian Approach
•Actualization of the information about θ by extracting the information about θ from the data
• Condition upon the observations (Likelihood principle)
•Avoids averaging over the unobserved values of y
•Provide a complete unified inferential scope
Justifications:
Day 1 - Section 5 47
The Bayesian Approach
•Calculation of the normalizing constant can be difficult
•Conjugate priors (exact calculation is possible)
•Markov chain Monte Carlo
Practical aspect:
Day 1 - Section 5 48
The Bayesian Approach
Conjugate priors:
Example:
and
+ →
Normal mean, one observation
Day 1 - Section 5 49
The Bayesian Approach
Conjugate priors:
Example:
and
+ →
Normal mean, n observations
Shrinkage
Day 1 - Section 5 50
Introduction15 observations
N(1,1)Standardized
likelihood
Day 1 - Section 5 51
Introduction15 observations
N(1,1)Standardized
likelihood
Prior
Day 1 - Section 5 52
Introduction15 observations
N(1,1)Standardized
likelihood
Prior
Posterior
Day 1 - Section 5 53
Introduction15 observations
N(1,1)Standardized
likelihood
Prior
Day 1 - Section 5 54
Introduction15 observations
N(1,1)Standardized
likelihood
Prior
Posterior
Day 1 - Section 5 55
The Bayesian Approach
•Many!
•Subjectivity of the prior (most critical)
•The prior distribution is the key to Bayesian inference
Criticism of the Bayesian choice:
Day 1 - Section 5 56
The Bayesian Approach
•Prior information is (almost) always available
•There is no such things as a prior distribution
•The prior is a tool summarizing available information as well as uncertainty related with this information
• The use of your prior is ok as long as you can justify it
Response:
Day 1 - Section 5 57
The Bayesian Approach
•Make the best of available prior information
•Unified framework
•The prior information can be used to regularize noisy estimates (few replicates)
•Computationally demanding?
Bayesian statistics and Bioinformatics