Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
1
Visualization and PCA with
Gene Expression Data
Utah State University – Fall 2017
Statistical Bioinformatics (Biomedical Big Data)
Notes 4
2
References
Chapter 10 of Bioconductor Monograph
Ringner (2008). “What is principal
component analysis?” Nature Biotechnology
26:303-304.
Chapter 8 of Johnson & Wichern’s Applied
Multivariate Statistical Analysis, 5th ed.
3
MA plot – compare samples
MA plot: M=Y-X vs. A=0.5(Y+X)
Rotate and scale Y vs. X scatterplot
For log-scale expression on arrays Y and X:
M = log fold-change (Y vs. X)
A = average expression
For comparing multiple arrays, can create a
“pseudo-array” reference X by taking median
for each gene across all samples
Sometimes add a Loess curve:
locally weighted polynomial regression
4
Visualization and efficiencylibrary(affydata); data(Dilution)
eset <- log2(exprs(Dilution))
dim(eset) # 409,600 rows (gene probes), 4 columns (samples)
X <- eset[,1]; Y <- eset[,2]
M <- Y - X; A <- (Y+X)/2
plot(A,M,main='default M-A plot',pch=16,cex=0.1); abline(h=0)
Interpretation:
M-direction shows differential expression
A-direction shows average expression
Look for:
systematic changes
outliers
patterns quality / normalization
(larger M-variability, curvature)
Problems: 1. loss of information
(overlayed points)
2. file size
Note: Dilution is from an Affymetrix experiment, only used for illustration here.
5
Alternative: show density by colorlibrary(geneplotter); library(RColorBrewer)
blues.ramp <- colorRampPalette(brewer.pal(9,"Blues")[-1])
dc <- densCols(A,M,colramp=blues.ramp)
plot(A,M,col=dc,pch=16,cex=0.1,main='color density M-A plot')
abline(h=0)
Interpretation:
color represents density
around corresponding point
How is this better?
visualize - overlayed points
What problem(s) remain?
file size (one point per probe)
6
op <- par()
par(bg="black", fg="white",
col.axis="white", col.lab="white",
col.sub="white", col.main="white")
dCol <- densCols(A,M,colramp=blues.ramp)
plot(A,M,col=dCol,pch=16,cex=0.1,
main='color density M-A plot')
abline(h=0)
par(op)
Alternative: highlight outer points
7
Alternative: smooth density
smoothScatter(A,M,nbin=250,nrpoints=50,colramp=blues.ramp,
main='smooth scatter M-A plot'); abline(h=0)
What does this do?
1. smooth the local density
using a kernel density estimator
2. black points represent
isolated data points
- But it can be a bit blurry
(creates visual artifacts)
8
Alternative: hexagonal binninglibrary(hexbin)
hb <- hexbin(cbind(A,M),xbins=40)
plot(hb, colramp = blues.ramp,
main='hexagonal binning M-A plot')
What does this do?
essentially discretizes density
- Maybe a little clunky,
and adding reference lines
can be tricky
- But – probably the “safest” plot
9
# Same as previous slide, but try adding horizontal ref
hb <- hexbin(cbind(A,M),xbins=40)
plot(hb, colramp = blues.ramp,
main='hexagonal binning M-A plot')
abline(h=0)
10
Image file types and sizes (slide 5 ex.)
dCol <- densCols(A,M,colramp=blues.ramp)
pdf("C:\\folder\\f1.pdf")
plot(A,M,col=dCol,pch=16,cex=0.1,main='M-A plot'); abline(h=0)
dev.off() # 3,643 KB
postscript("C:\\folder\\f1.ps")
plot(A,M,col=dCol,pch=16,cex=0.1,main='M-A plot'); abline(h=0)
dev.off() # 25,315 KB
jpeg("C:\\folder\\f1.jpg")
plot(A,M,col=dCol,pch=16,cex=0.1,main='M-A plot'); abline(h=0)
dev.off() # 14 KB
png("C:\\folder\\f1.png")
plot(A,M,col=dCol,pch=16,cex=0.1,main='M-A plot'); abline(h=0)
dev.off() # 9 KB
(Note other options for these functions to control size and quality.)
Principal Components Analysis
A common approach in high-dimensional data:
reduce dimensionality
Notation:
Xlj = [log-scale] expression / abundance level for
“variable” (gene / protein / metabolite / substance) j
in “observation” (sample) l of the data
[so XT ≈ expression set matrix]
Define ith principal component
(like a new variable or column):
𝑃𝐶𝑖 =
𝑗
𝑎𝑖𝑗𝑋𝑗
(where Xj is the jth column of X)11
Principal Components Analysis
12
In the ith principal component:
the coefficients 𝑎𝑖𝑗 are chosen (automatically)
so that:
PC1, PC2, … each have the most variation
possible
PC1, PC2, … are independent (uncorrelated)
PC1 has most variation, followed by PC2, then
PC3, …
𝑖(𝑎𝑖𝑗)2 = 1 for each i
𝑃𝐶𝑖 =
𝑗
𝑎𝑖𝑗𝑋𝑗
PCA: Interpretation
Size of 𝑎𝑖𝑗’s indicates importance in variability
Example:
Suppose 𝑎1𝑗 ’s are large for a certain class of gene
/ protein / metabolite, but small for other classes.
Then PC1 can be interpreted as representing that
class
Problem: such clean interpretation not
guaranteed
13
PCA: Visualization with the Biplot
Several tools exist, but the “biplot” is fairly
common
Represent both observations / samples (rows of X)
and variables [genes / proteins / etc.] (columns of X)
Observations usually plotted as text labels at
coordinates determined by first two PC’s
Greater interest: Variables plotted as labeled
arrows, to coordinate (on arbitrary scale of top and
right axes) “weight” in the first two PC’s
14
PCA Implementation and Example
Problem with high-dimensional “wide” data
If have many more “variables” than “observations”
Solution: transpose X and convert back to original
space [princomp2 function in msProcess package]
Example here: Naples data
4 Ctrl patients, 4 diseased (RCM or DCM) patients
Focus on C vs. non-C (D or R) differentiation
Just use previously-selected set of 20 genes for
now (recall Notes 3 slide 23)
15
16
## Get subset of Naples data (same as in Notes 3, slide 23)
url <- "http://www.stat.usu.edu/jrstevens/bioinf/naples.csv"
naples <- read.csv(url, row.names=1)
head(naples)
emat <- as.matrix(naples)
gn <- rownames(emat)
gn.list <- c("ENSG00000226310","ENSG00000226711",
"ENSG00000255007","ENSG00000197312","ENSG00000177551",
"ENSG00000130653","ENSG00000138495","ENSG00000236998",
"ENSG00000251322","ENSG00000269103","ENSG00000265933",
"ENSG00000205359","ENSG00000251576","ENSG00000186615",
"ENSG00000040933","ENSG00000068137","ENSG00000159248",
"ENSG00000168916","ENSG00000256238","ENSG00000169679")
t <- is.element(gn,gn.list)
small.eset <- emat[t,]
group <- c('C','R','C','D','C','R','C','D')
colnames(small.eset) <- group
# Run PCA and visualize result
source("http://www.stat.usu.edu/jrstevens/bioinf/pc2.R")
pc <- pc2(t(small.eset), scale = TRUE)
17
Biplot arrows show
how certain
variables “put”
observations in
certain parts of the
plot.
Here, dominant
variables are
251322, 197312,
138495, and 159248
biplot(pc, cex=c(2,1), xlim=c(-.8,.8))
18
C R C D C R C D
ENSG00000040933 640 924 740 808 633 942 352 835
ENSG00000068137 759 1121 681 863 762 785 401 839
ENSG00000130653 483 1102 620 1030 457 1250 493 1190
ENSG00000138495 3372 2682 3318 2092 2996 2077 3053 1774
ENSG00000159248 442 28 1111 41 97 6 393 18
ENSG00000168916 640 997 491 796 473 1038 238 849
ENSG00000169679 12 82 25 47 33 38 18 46
ENSG00000177551 0 1 0 10 0 2 0 2
ENSG00000186615 57 69 50 77 51 88 60 82
ENSG00000197312 745 1466 847 1045 659 1233 501 855
ENSG00000205359 1 0 1 0 1 0 1 0
ENSG00000226310 0 5 0 5 0 3 0 3
ENSG00000226711 32 96 42 58 41 55 23 55
ENSG00000236998 0 3 0 5 2 3 0 4
ENSG00000251322 2167 3098 2325 2474 2203 3173 1862 4666
ENSG00000251576 6 12 0 27 1 34 2 14
ENSG00000255007 3 0 4 1 3 0 10 1
ENSG00000256238 26 38 22 39 25 48 14 28
ENSG00000265933 0 1 0 2 0 11 0 1
ENSG00000269103 0 1 0 2 0 1 0 1
Why would 251322, 197312, 138495, and 159248 dominate?
small.eset
Caution with high-dimensional PCA
With a large # of variables (genes, dimensions,
etc.), need to be aware of scale
Variables with large variance (perhaps due to
scale) will dominate principal components
Need to standardize variables (genes; rows
here) “if they are measured on scales with
widely different ranges or if the units of
measurement are not [comparable]” (Johnson &
Wichern text)
Log-scale expression values often better
behaved
19
20
After standardizing rows of small.eset:
Now able to detect
effects, both huge and
subtle:
C > non-C for 138495,
159248, 205359, and
255007
C < non-C for 68137, …
21
“Scree plot” shows
variance of the first
few PC’s
Best to have majority
of variance
accounted for by the
first two (or so) PC’s
22
# standardize each row
X <- log(small.eset+1)
# Steps omitted here; each row of X standardized.
# Think about how you could do it!
pcX <- pc2(t(X), scale=TRUE)
biplot(pcX, xlim=c(-.8,.6), cex=c(2,1))
screeplot(pcX)
Side note: archived package
http://cran.r-project.org/web/packages/msProcess/
“Package ‘msProcess’ was removed from the CRAN repository.”
“Formerly available versions can be obtained from the archive.”
(But only there as source, not compiled package)
We may return to this package later, but for now, we only
need the princomp2 function:
“This function performs principal component analysis (PCA) for wide
data x, i.e. dim(x)[1] < dim(x)[2]. This kind of data can be handled by
princomp in S-PLUS but not in R. The trick is to do PCA for t(x) first
and then convert back to the original space. It might be more efficient
than princomp for high dimensional data.”
The pc2 function sourced in (previous slide) is a modified (but fully
functional) version
23
24
# Try a 3-d biplot
library(BiplotGUI)
Biplots(Data=t(X)) # actually, using t(log(small.eset+1))
# looks similar here
# lower-left: External --> In 3D
25
Summary
Visualization efficiency
Avoid overlaying points
Conserve file size (note file type)
Choose meaningful color palette
Principal Components
Emphasize major sources of variability
No guaranteed interpretability (no accounting for bio.)
Pay attention to scale of variables
26
Warnings – why end with this? What can be gained from clustering?
general ideas, possible structures; NOT - statistical inference (at least not in the same way)
Be wary of:
elaborate pictures
non-normalized data
unjustifiable decisions
clustering method
distance function
color scheme
claims of - “proof” (maybe “support”)
arbitrary decisions
What is clustering best for?
exploratory data analysis and summary, NOT statistical inference