23
Scalable Probabilistic PCA PCA Probabilistic PCA Results Scalable and flexible probabilistic PCA for large-scale genetic variation data Sriram Sankararaman Department of Computer Science Department of Human Genetics UCLA

Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Scalable and flexible probabilistic PCA forlarge-scale genetic variation data

Sriram Sankararaman

Department of Computer ScienceDepartment of Human Genetics

UCLA

Page 2: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

The genomic revolution

Challenge: scalable methods to analyze and visualize this data

Page 3: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Genetic data

Individual 1

Individual 2

Individual 3

Position along Genome

.

.

.

.A..C...

.

.

.

.A..T...

.

.

.

.A..C...

.

.

.

.G..T...

.

.

.

.A..C...

.

.

.

.G..C...

Individual 1

Individual 2

Individual 3

SNPs AC

AT

AC

GT

AC

GC

Genotype 1

Genotype 2

Genotype 3

SNPs 01

11

12

Page 4: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

PCA on genetic data

Genotypes

SNPs

1020202

1110102

1211111

0102011

0112010

PCA0.7 0.3 -0.1 -0.4 -0.5

Page 5: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

PCA on genetic data

Visualize genetic structure

Novembre et al. Nature 2008

Page 6: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

PCA

Given N genotype vectors over M SNPsxn ∈ RM , n ∈ {1, . . . , N} and K ≤M .

xn ≈ w1zn,1 + . . .+wKzn,K

= [w1 . . .wK ]

zn,1...zn,K

= Wzn

• PCA Constraint: columns of W are orthonormal.

• The PCA solution W = UK where UK contains the topK eigenvectors of the sample covariance matrix.

Page 7: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Challenges with PCA

Computational

• Compute all eigenvalue, eigenvectors.• Singular-Value Decomposition (SVD):

O(MN min(M,N)) ≈ O(MN2)

• Infeasible for genetic datasets (large number of SNPs M orindividuals N).

• Recent Randomized approximation algorithms

Statistical

• Missing genotypes.

• Correlation among SNPs.

Halko et al. 2009, Galinskey et al. 2016

Page 8: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Probabilistic PCA

Model

zniid∼ N (0, IK)

p(xn|zn,W , σ2) = N (Wzn, σ2IM )

Log likelihood

LL(W , σ2) ≡ logP (X|W , σ2) = log

N∏n=1

p(xn, zn|W , σ2)

The maximum likelihood estimator is equivalent to PCA.

Roweis 1999, Tipping and Bishop 1999, Engelhardt and Stephens 2010

Page 9: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Probabilistic PCA

EM algorithm

• E-step:

Z = (WTW )−1

WTX

• M-step:

W = XZT(ZZT)−1

• Assume: σ2 → 0

Page 10: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Probabilistic PCA

EM algorithm: computational complexity

• E-step:

Z = (WTW )︸ ︷︷ ︸K×K

−1WT︸︷︷︸K×M

X︸︷︷︸M×N

O(NMK)

• M-step:

W = X︸︷︷︸M×N

ZT︸︷︷︸N×K

(ZZT)︸ ︷︷ ︸K×K

−1

O(NMK)

Page 11: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Probabilistic PCA

EM algorithm: computational complexity

• Run for I iterations with each iteration costing O(NMK).

• For small K, leads to a linear-time algorithm.

• Ignores the special structure of the genotype matrix X.

Page 12: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Probabilistic PCA

EM algorithm: computational complexity

• Run for I iterations with each iteration costing O(NMK).

• For small K, leads to a linear-time algorithm.

• Ignores the special structure of the genotype matrix X.

Page 13: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Probabilistic PCA

EM algorithm: computational complexity

Each E and M step, perform the following operations K times:

c = Xb

• X is a fixed M ×N matrix of genotypes.

• b is a real-valued vector that could potentially changeeach iteration.

• Naive multiplication takes O(NM).

• For a genotype matrix, can we do some pre-processing sothat Xb can be computed more efficiently ?

• Yes! For a matrix with binary entries: O( MNlog2(N)).

Liberty and Zucker 2009

Page 14: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Probabilistic PCA

EM algorithm: computational complexity

Each E and M step, perform the following operations K times:

c = Xb

• X is a fixed M ×N matrix of genotypes.

• b is a real-valued vector that could potentially changeeach iteration.

• Naive multiplication takes O(NM).

• For a genotype matrix, can we do some pre-processing sothat Xb can be computed more efficiently ?

• Yes! For a matrix with binary entries: O( MNlog2(N)).

Liberty and Zucker 2009

Page 15: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Probabilistic PCA

EM algorithm: computational complexity

Each E and M step, perform the following operations K times:

c = Xb

• X is a fixed M ×N matrix of genotypes.

• b is a real-valued vector that could potentially changeeach iteration.

• Naive multiplication takes O(NM).

• For a genotype matrix, can we do some pre-processing sothat Xb can be computed more efficiently ?

• Yes! For a matrix with binary entries: O( MNlog2(N)).

Liberty and Zucker 2009

Page 16: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

EM with the Mailman algorithm

• Entries in genotype matrix take one of three values:{0, 1, 2}.

• Using the Mailman algorithm, per-iteration timecomplexity of EM for genotype matrix: O( MNK

log3(N)).

Sub-linear time algorithm for computing PCA

Page 17: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Simulations

Accuracy

50, 000 SNPs, 10, 000 individuals

Fst MEV

K = 5 K = 10

0.001 0.987 1.0000.002 0.999 1.0000.003 0.999 1.0000.004 0.999 1.0000.005 1.000 1.0000.006 1.000 1.0000.007 1.000 1.0000.008 1.000 1.0000.009 1.000 1.0000.010 1.000 1.000

Page 18: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Simulations

Efficiency

M = 100, 000 SNPs, K = 5, FST = 0.01

0 100 200 300 400 500

020

000

4000

060

000

8000

010

0000

1400

00

N (X 1000) individuals

Com

pute

tim

e (s

econ

ds)

●●●●●● ● ● ● ● ● ● ●

● EMFastPCAPLINK_SVDFlash−PCA

0 100 200 300 400 500

010

0020

0030

0040

00

N (X 1000) individuals

Com

pute

tim

e (s

econ

ds)

●●●●●● ● ● ●

● EMFastPCAFlash−PCA

0 100 200 300 400 500

010

020

030

040

050

060

070

0

N (X 1000) individuals

Ave

rage

iter

atio

n tim

e (s

econ

ds)

●●●●●● ● ● ●

● EMFastPCA

Page 19: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Application to 1000 Genomes data

●●●

●●●●●

●●●●●●●●●●●

●●

●●●●●

●●●●●

●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●

●●●

●●●●

●●●

●●●●

●●●●●●

●●●●

●●

●●●●●

●●●●●

●●●

●●

●●●●●●●●●●

●●●

●●●●●●

●●●

●●●

●●

●●●●●●●

●●●

●●●

●●●

●●●●●

●●●●●

●●●●

●●

●●

●●

●●●●●

●●

●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●

●●●●●

●●●

●●●

●●

●●●●●●

●●●●

●●

●●●

●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●

●● ●

●●

●● ●

●●

●●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●●●●●

●●

●●

●●●●

●●●

●●●●●

●●●

●●●●●

●●●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●●

●●

●●●

● ●

●●●

●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●

●●●●

●●

●●●●●●

●●●●

●●●

●●●

●●

●●●●●●

●●●

●●●●

●●●●●●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●●●

●●●

●●●●●●

●●●●

●●●●●

●●

●●●

●●●

●●●

●●●●●●

●●●●●●●

●●

●●●●●

● ●●●●●

●●

●●●

●●

●●

●●

●●●●●●●●●●

●●

●●

●●●●

●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●● ●

●●●●

●●●

●●● ●●

●●●●

●● ●● ●●●●●●●●●●

●●

●●

●●

●●●

●●

● ●

●●

●●●●●

● ●●●●●

● ●●●●

●●●●●●●

●● ●●●●●

●● ●●●●●●

●● ●●●●●

●●●●●●●

●●●●●●●

●●

●●●

●●●

●●

● ●

●● ●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

● ●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

● ●

●●●●

●●● ●●●●●●●●

●●

●●●●

●●●●●●●●●

●●●

●●●●

●●●●

●●●

●●●

−0.02 0.00 0.02 0.04 0.06

−0.

020.

000.

020.

04

EM

PC1

PC

2

EURASNAMRAFR

●●●

●●●●●

●●●●●●●●●●●

●●

●●●●●

●●●●●

●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●

●●●

●●●●

●●●

●●●●

●●●●●●

●●●●

●●

●●●●●

●●●●●

●●●

●●

●●●●●●●●●●

●●●

●●●●●●

●●●

●●●

●●

●●●●●●●

●●●

●●●

●●●

●●●●●

●●●●●

●●●●

●●

●●

●●

●●●●●

●●

●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●

●●

●●●●●

●●●

●●●

●●

●●●●●●

●●●●

●●

●●●

●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●

●● ●

●●

●● ●

●●

●●●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●●●●●

●●

●●

●●●●

●●●

●●●●●

●●●

●●●●●

●●●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●●

●●

●●●

● ●

●●●

●●

●●

●●●

●●●●

●●

●●●●

●●●●

●●●

●●●●

●●

●●●●●●

●●●●

●●●

●●●

●●

●●●●●●

●●●

●●●●

●●●●●●●

●●●

●●

●●●●

●●

●●●

●●

●●●●●●●

●●●

●●●●●●

●●●●

●●●●●

●●

●●●

●●●

●●●

●●●●●●

●●●●●●●

●●

●●●●●

● ●●●●●

●●

●●●

●●

●●

●●

●●●●●●●●●●

●●

●●

●●●●

●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●● ●

●●●●

●●●

●●● ●●

●●●●

●● ●● ●●●●●●●●●●

●●

●●

●●

●●●

●●

● ●

●●

●●●●●

● ●●●●●

● ●●●●

●●●●●●●

● ●●●●●

●● ●●●●●●

●● ●●●●●

●●●●●●●

●●●●●●●

●●

●●●

●●●

●●

● ●

●● ●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

● ●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

● ●

●●●●

●●● ●●●●●●●●

●●

●●●●

●●●●●●●●●

●●●

●●●●

●●●●

●●●

●●●

−0.02 0.00 0.02 0.04 0.06−

0.02

0.00

0.02

0.04

SVD

PC1

PC

2

EURASNAMRAFR

Page 20: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Other advantages of Probabilistic PCA

Can naturally handle missing data

• E-step involves inferring hidden variables Z as well ashidden (missing observations).

• Can handle missing data efficiently.

Can use model selection to infer K.

• Choose K to maximize the marginal likelihood P (X|K).

• Use cross-validation and pick K that maximizes likelihoodon held out data.

Page 21: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Open questions

• Modeling correlations .

• Beyond Gaussian outputs

Baran et al. 2013, Wen and Stephens 2012Collins et al. 2002

Page 22: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Summary

• PCA can be interpreted as a latent variable model withcontinuous latent variable.

• Probabilistic interpretation useful to generalize PCA.• Leads to efficient inference.

• Fast matrix-vector multiplication for genotype data moregenerally applicable and can lead to sub-linear timealgorithms.

Page 23: Scalable Probabilistic PCA - ww2.amstat.org › meetings › sdss › 2018 › online... · Scalable Probabilistic PCA PCA Probabilistic PCA Results Probabilistic PCA EM algorithm:

ScalableProbabilistic

PCA

PCA

ProbabilisticPCA

Results

Acknowledgments

• Aman Agrawal

• Minh Le

• Eran Halperin

Email [email protected]