Visualization and PCA with Gene Expression Data1 Visualization and PCA with Gene Expression Data Utah State University –Fall 2017 Statistical Bioinformatics (Biomedical Big Data)

1

Visualization and PCA with

Gene Expression Data

Utah State University – Fall 2017

Statistical Bioinformatics (Biomedical Big Data)

Notes 4

2

References

Chapter 10 of Bioconductor Monograph

Ringner (2008). “What is principal

component analysis?” Nature Biotechnology

26:303-304.

Chapter 8 of Johnson & Wichern’s Applied

Multivariate Statistical Analysis, 5th ed.

3

MA plot – compare samples

MA plot: M=Y-X vs. A=0.5(Y+X)

Rotate and scale Y vs. X scatterplot

For log-scale expression on arrays Y and X:

M = log fold-change (Y vs. X)

A = average expression

For comparing multiple arrays, can create a

“pseudo-array” reference X by taking median

for each gene across all samples

Sometimes add a Loess curve:

locally weighted polynomial regression

4

Visualization and efficiencylibrary(affydata); data(Dilution)

eset <- log2(exprs(Dilution))

dim(eset) # 409,600 rows (gene probes), 4 columns (samples)

X <- eset[,1]; Y <- eset[,2]

M <- Y - X; A <- (Y+X)/2

plot(A,M,main='default M-A plot',pch=16,cex=0.1); abline(h=0)

Interpretation:

M-direction shows differential expression

A-direction shows average expression

Look for:

systematic changes

outliers

patterns quality / normalization

(larger M-variability, curvature)

Problems: 1. loss of information

(overlayed points)

2. file size

Note: Dilution is from an Affymetrix experiment, only used for illustration here.

5

Alternative: show density by colorlibrary(geneplotter); library(RColorBrewer)

blues.ramp <- colorRampPalette(brewer.pal(9,"Blues")[-1])

dc <- densCols(A,M,colramp=blues.ramp)

plot(A,M,col=dc,pch=16,cex=0.1,main='color density M-A plot')

abline(h=0)

Interpretation:

color represents density

around corresponding point

How is this better?

visualize - overlayed points

What problem(s) remain?

file size (one point per probe)

6

op <- par()

par(bg="black", fg="white",

col.axis="white", col.lab="white",

col.sub="white", col.main="white")

dCol <- densCols(A,M,colramp=blues.ramp)

plot(A,M,col=dCol,pch=16,cex=0.1,

main='color density M-A plot')

abline(h=0)

par(op)

Alternative: highlight outer points

7

Alternative: smooth density

smoothScatter(A,M,nbin=250,nrpoints=50,colramp=blues.ramp,

main='smooth scatter M-A plot'); abline(h=0)

What does this do?

1. smooth the local density

using a kernel density estimator

2. black points represent

isolated data points

- But it can be a bit blurry

(creates visual artifacts)

8

Alternative: hexagonal binninglibrary(hexbin)

hb <- hexbin(cbind(A,M),xbins=40)

plot(hb, colramp = blues.ramp,

main='hexagonal binning M-A plot')

What does this do?

essentially discretizes density

- Maybe a little clunky,

and adding reference lines

can be tricky

- But – probably the “safest” plot

9

# Same as previous slide, but try adding horizontal ref

hb <- hexbin(cbind(A,M),xbins=40)

plot(hb, colramp = blues.ramp,

main='hexagonal binning M-A plot')

abline(h=0)

10

Image file types and sizes (slide 5 ex.)

dCol <- densCols(A,M,colramp=blues.ramp)

pdf("C:\\folder\\f1.pdf")

plot(A,M,col=dCol,pch=16,cex=0.1,main='M-A plot'); abline(h=0)

dev.off() # 3,643 KB

postscript("C:\\folder\\f1.ps")


dev.off() # 25,315 KB

jpeg("C:\\folder\\f1.jpg")


dev.off() # 14 KB

png("C:\\folder\\f1.png")


dev.off() # 9 KB

(Note other options for these functions to control size and quality.)

Principal Components Analysis

A common approach in high-dimensional data:

reduce dimensionality

Notation:

Xlj = [log-scale] expression / abundance level for

“variable” (gene / protein / metabolite / substance) j

in “observation” (sample) l of the data

[so XT ≈ expression set matrix]

Define ith principal component

(like a new variable or column):

𝑃𝐶𝑖 =

𝑗

𝑎𝑖𝑗𝑋𝑗

(where Xj is the jth column of X)11

Principal Components Analysis

12

In the ith principal component:

the coefficients 𝑎𝑖𝑗 are chosen (automatically)

so that:

PC1, PC2, … each have the most variation

possible

PC1, PC2, … are independent (uncorrelated)

PC1 has most variation, followed by PC2, then

PC3, …

𝑖(𝑎𝑖𝑗)2 = 1 for each i

𝑃𝐶𝑖 =

𝑗

𝑎𝑖𝑗𝑋𝑗

PCA: Interpretation

Size of 𝑎𝑖𝑗’s indicates importance in variability

Example:

Suppose 𝑎1𝑗 ’s are large for a certain class of gene

/ protein / metabolite, but small for other classes.

Then PC1 can be interpreted as representing that

class

Problem: such clean interpretation not

guaranteed

13

PCA: Visualization with the Biplot

Several tools exist, but the “biplot” is fairly

common

Represent both observations / samples (rows of X)

and variables [genes / proteins / etc.] (columns of X)

Observations usually plotted as text labels at

coordinates determined by first two PC’s

Greater interest: Variables plotted as labeled

arrows, to coordinate (on arbitrary scale of top and

right axes) “weight” in the first two PC’s

14

PCA Implementation and Example

Problem with high-dimensional “wide” data

If have many more “variables” than “observations”

Solution: transpose X and convert back to original

space [princomp2 function in msProcess package]

Example here: Naples data

4 Ctrl patients, 4 diseased (RCM or DCM) patients

Focus on C vs. non-C (D or R) differentiation

Just use previously-selected set of 20 genes for

now (recall Notes 3 slide 23)

15

16

## Get subset of Naples data (same as in Notes 3, slide 23)

url <- "http://www.stat.usu.edu/jrstevens/bioinf/naples.csv"

naples <- read.csv(url, row.names=1)

head(naples)

emat <- as.matrix(naples)

gn <- rownames(emat)

gn.list <- c("ENSG00000226310","ENSG00000226711",

"ENSG00000255007","ENSG00000197312","ENSG00000177551",

"ENSG00000130653","ENSG00000138495","ENSG00000236998",

"ENSG00000251322","ENSG00000269103","ENSG00000265933",

"ENSG00000205359","ENSG00000251576","ENSG00000186615",

"ENSG00000040933","ENSG00000068137","ENSG00000159248",

"ENSG00000168916","ENSG00000256238","ENSG00000169679")

t <- is.element(gn,gn.list)

small.eset <- emat[t,]

group <- c('C','R','C','D','C','R','C','D')

colnames(small.eset) <- group

# Run PCA and visualize result

source("http://www.stat.usu.edu/jrstevens/bioinf/pc2.R")

pc <- pc2(t(small.eset), scale = TRUE)

17

Biplot arrows show

how certain

variables “put”

observations in

certain parts of the

plot.

Here, dominant

variables are

251322, 197312,

138495, and 159248

biplot(pc, cex=c(2,1), xlim=c(-.8,.8))

18

C R C D C R C D

ENSG00000040933 640 924 740 808 633 942 352 835

ENSG00000068137 759 1121 681 863 762 785 401 839

ENSG00000130653 483 1102 620 1030 457 1250 493 1190

ENSG00000138495 3372 2682 3318 2092 2996 2077 3053 1774

ENSG00000159248 442 28 1111 41 97 6 393 18

ENSG00000168916 640 997 491 796 473 1038 238 849

ENSG00000169679 12 82 25 47 33 38 18 46

ENSG00000177551 0 1 0 10 0 2 0 2

ENSG00000186615 57 69 50 77 51 88 60 82

ENSG00000197312 745 1466 847 1045 659 1233 501 855

ENSG00000205359 1 0 1 0 1 0 1 0

ENSG00000226310 0 5 0 5 0 3 0 3

ENSG00000226711 32 96 42 58 41 55 23 55

ENSG00000236998 0 3 0 5 2 3 0 4

ENSG00000251322 2167 3098 2325 2474 2203 3173 1862 4666

ENSG00000251576 6 12 0 27 1 34 2 14

ENSG00000255007 3 0 4 1 3 0 10 1

ENSG00000256238 26 38 22 39 25 48 14 28

ENSG00000265933 0 1 0 2 0 11 0 1

ENSG00000269103 0 1 0 2 0 1 0 1

Why would 251322, 197312, 138495, and 159248 dominate?

small.eset

Caution with high-dimensional PCA

With a large # of variables (genes, dimensions,

etc.), need to be aware of scale

Variables with large variance (perhaps due to

scale) will dominate principal components

Need to standardize variables (genes; rows

here) “if they are measured on scales with

widely different ranges or if the units of

measurement are not [comparable]” (Johnson &

Wichern text)

Log-scale expression values often better

behaved

19

20

After standardizing rows of small.eset:

Now able to detect

effects, both huge and

subtle:

C > non-C for 138495,

159248, 205359, and

255007

C < non-C for 68137, …

21

“Scree plot” shows

variance of the first

few PC’s

Best to have majority

of variance

accounted for by the

first two (or so) PC’s

22

# standardize each row

X <- log(small.eset+1)

# Steps omitted here; each row of X standardized.

# Think about how you could do it!

pcX <- pc2(t(X), scale=TRUE)

biplot(pcX, xlim=c(-.8,.6), cex=c(2,1))

screeplot(pcX)

Side note: archived package

http://cran.r-project.org/web/packages/msProcess/

“Package ‘msProcess’ was removed from the CRAN repository.”

“Formerly available versions can be obtained from the archive.”

(But only there as source, not compiled package)

We may return to this package later, but for now, we only

need the princomp2 function:

“This function performs principal component analysis (PCA) for wide

data x, i.e. dim(x)[1] < dim(x)[2]. This kind of data can be handled by

princomp in S-PLUS but not in R. The trick is to do PCA for t(x) first

and then convert back to the original space. It might be more efficient

than princomp for high dimensional data.”

The pc2 function sourced in (previous slide) is a modified (but fully

functional) version

23

http://cran.r-project.org/web/packages/msProcess/

24

# Try a 3-d biplot

library(BiplotGUI)

Biplots(Data=t(X)) # actually, using t(log(small.eset+1))

# looks similar here

# lower-left: External --> In 3D

25

Summary

Visualization efficiency

Avoid overlaying points

Conserve file size (note file type)

Choose meaningful color palette

Principal Components

Emphasize major sources of variability

No guaranteed interpretability (no accounting for bio.)

Pay attention to scale of variables

26

Warnings – why end with this? What can be gained from clustering?

general ideas, possible structures; NOT - statistical inference (at least not in the same way)

Be wary of:

elaborate pictures

non-normalized data

unjustifiable decisions

clustering method

distance function

color scheme

claims of - “proof” (maybe “support”)

arbitrary decisions

What is clustering best for?

exploratory data analysis and summary, NOT statistical inference

Documents

Visualization and PCA with Gene Expression Data1 Visualization and PCA with Gene Expression Data Utah State University –Fall 2017 Statistical Bioinformatics (Biomedical Big Data)