71
Efficient dimension reduction on massive data Efficient dimension reduction on massive data MMDS 2010 Stoyan Georgiev 1 , Sayan Mukherjee 2 , Nick Patterson 3 1 Computational Biology and Bioinformatics Program Institute for Genome Sciences & Policy, Duke University 1 2 Departments of Statistical Science, Computer Science, and Mathematics, Institute for Genome Sciences & Policy Duke University 3 Broad Institute of MIT and Harvard June 17, 2010

E cient dimension reduction on massive data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Efficient dimension reduction on massive dataMMDS 2010

Stoyan Georgiev1, Sayan Mukherjee2, Nick Patterson3

1Computational Biology and Bioinformatics ProgramInstitute for Genome Sciences & Policy, Duke University1

2Departments of Statistical Science, Computer Science,and Mathematics, Institute for Genome Sciences & Policy

Duke University3 Broad Institute of MIT and Harvard

June 17, 2010

Page 2: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Overview

Overview

(1) Motivating genetics problem.

(2) Current approach (PCA).

(3) Factor models.

(a) A genetic example.(b) The Frisch problem.

(4) Supervised dimension reduction.

(5) Efficient subspace inference.

(6) Empirical results.

(a) Wishart simulations.(b) Text example.

Page 3: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Overview

Overview

(1) Motivating genetics problem.

(2) Current approach (PCA).

(3) Factor models.

(a) A genetic example.(b) The Frisch problem.

(4) Supervised dimension reduction.

(5) Efficient subspace inference.

(6) Empirical results.

(a) Wishart simulations.(b) Text example.

Page 4: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Overview

Overview

(1) Motivating genetics problem.

(2) Current approach (PCA).

(3) Factor models.

(a) A genetic example.(b) The Frisch problem.

(4) Supervised dimension reduction.

(5) Efficient subspace inference.

(6) Empirical results.

(a) Wishart simulations.(b) Text example.

Page 5: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Overview

Overview

(1) Motivating genetics problem.

(2) Current approach (PCA).

(3) Factor models.

(a) A genetic example.(b) The Frisch problem.

(4) Supervised dimension reduction.

(5) Efficient subspace inference.

(6) Empirical results.

(a) Wishart simulations.(b) Text example.

Page 6: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Overview

Overview

(1) Motivating genetics problem.

(2) Current approach (PCA).

(3) Factor models.

(a) A genetic example.(b) The Frisch problem.

(4) Supervised dimension reduction.

(5) Efficient subspace inference.

(6) Empirical results.

(a) Wishart simulations.(b) Text example.

Page 7: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Overview

Overview

(1) Motivating genetics problem.

(2) Current approach (PCA).

(3) Factor models.

(a) A genetic example.(b) The Frisch problem.

(4) Supervised dimension reduction.

(5) Efficient subspace inference.

(6) Empirical results.

(a) Wishart simulations.(b) Text example.

Page 8: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Motivating genetics problem

Inference of population structure

A classic problem in biology and genetics is to study populationstructure.

(1) Does genetic variation in populations follow geography ?

(2) Can we infer population histories from genetic variation ?

(3) When we associate genetic loci (locations) to disease we needto correct for population structure.

Page 9: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Motivating genetics problem

Inference of population structure

A classic problem in biology and genetics is to study populationstructure.

(1) Does genetic variation in populations follow geography ?

(2) Can we infer population histories from genetic variation ?

(3) When we associate genetic loci (locations) to disease we needto correct for population structure.

Page 10: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Motivating genetics problem

Inference of population structure

A classic problem in biology and genetics is to study populationstructure.

(1) Does genetic variation in populations follow geography ?

(2) Can we infer population histories from genetic variation ?

(3) When we associate genetic loci (locations) to disease we needto correct for population structure.

Page 11: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Motivating genetics problem

Genetic data

For each individual we have two letters from A,C ,T ,G at eachpolymorphic (SNP) site which is coded as an integer 0, 1, 2

Ci =

AC

...GG

...TT

=⇒

1...0...2

∈ R500,000,

C = [C1, ....,Cm].

Page 12: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Motivating genetics problem

Genetic data

For each individual we have two letters from A,C ,T ,G at eachpolymorphic (SNP) site which is coded as an integer 0, 1, 2

Ci =

AC

...GG

...TT

=⇒

1...0...2

∈ R500,000,

C = [C1, ....,Cm].

Page 13: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Motivating genetics problem

Genetic data encodes population historyFrom Novembre et al 2008 (Nature)

Page 14: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Current approach

Dominant method for inference of population

Eigenstrat: Patterson et al 2006 (PLoS Genetics)Combines principal components analysis and Tracy-Widom theoryto infer population structure.

(1) Mij =Cij−µjqµj2

(1−µj2

)∀i , j .

(2) X = 1nMM ′

(3) Order λ1, ...., λm and test for significant eigenvalues using TWstatistics

(4) Compute

n′ =(m + 1) (

∑i λi )

2((m − 1)

∑i λ

2i

)− (∑

i λi )2.

Page 15: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Current approach

Dominant method for inference of population

Eigenstrat: Patterson et al 2006 (PLoS Genetics)Combines principal components analysis and Tracy-Widom theoryto infer population structure.

(1) Mij =Cij−µjqµj2

(1−µj2

)∀i , j .

(2) X = 1nMM ′

(3) Order λ1, ...., λm and test for significant eigenvalues using TWstatistics

(4) Compute

n′ =(m + 1) (

∑i λi )

2((m − 1)

∑i λ

2i

)− (∑

i λi )2.

Page 16: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Current approach

Dominant method for inference of population

Eigenstrat: Patterson et al 2006 (PLoS Genetics)Combines principal components analysis and Tracy-Widom theoryto infer population structure.

(1) Mij =Cij−µjqµj2

(1−µj2

)∀i , j .

(2) X = 1nMM ′

(3) Order λ1, ...., λm and test for significant eigenvalues using TWstatistics

(4) Compute

n′ =(m + 1) (

∑i λi )

2((m − 1)

∑i λ

2i

)− (∑

i λi )2.

Page 17: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Current approach

Dominant method for inference of population

Eigenstrat: Patterson et al 2006 (PLoS Genetics)Combines principal components analysis and Tracy-Widom theoryto infer population structure.

(1) Mij =Cij−µjqµj2

(1−µj2

)∀i , j .

(2) X = 1nMM ′

(3) Order λ1, ...., λm and test for significant eigenvalues using TWstatistics

(4) Compute

n′ =(m + 1) (

∑i λi )

2((m − 1)

∑i λ

2i

)− (∑

i λi )2.

Page 18: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Current approach

Dominant method for inference of population

Eigenstrat: Patterson et al 2006 (PLoS Genetics)Combines principal components analysis and Tracy-Widom theoryto infer population structure.

(1) Mij =Cij−µjqµj2

(1−µj2

)∀i , j .

(2) X = 1nMM ′

(3) Order λ1, ...., λm and test for significant eigenvalues using TWstatistics

(4) Compute

n′ =(m + 1) (

∑i λi )

2((m − 1)

∑i λ

2i

)− (∑

i λi )2.

Page 19: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Current approach

The challenge

We will be getting genetic data withn ≥ 500, 000m ≥ 30, 000.

Can we extend Eigenstrat to this data to be run on a standarddesktop ?

Yes ! But....

Page 20: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Current approach

The challenge

We will be getting genetic data withn ≥ 500, 000m ≥ 30, 000.

Can we extend Eigenstrat to this data to be run on a standarddesktop ?

Yes ! But....

Page 21: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Current approach

The challenge

We will be getting genetic data withn ≥ 500, 000m ≥ 30, 000.

Can we extend Eigenstrat to this data to be run on a standarddesktop ?

Yes ! But....

Page 22: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Probabilistic view of PCA

X ∈ Rp is charterized by a multivariate normal

X ∼ No(µ+ Aν,∆),

ν ∼ No(0, Id)

µ ∈ Rp

A ∈ Rp×d

∆ ∈ Rp×p

ν ∈ Rd .

ν is a latent variable, what is d .

Page 23: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Probabilistic view of PCA

X ∈ Rp is charterized by a multivariate normal

X ∼ No(µ+ Aν,∆),

ν ∼ No(0, Id)

µ ∈ Rp

A ∈ Rp×d

∆ ∈ Rp×p

ν ∈ Rd .

ν is a latent variable, what is d .

Page 24: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

A genetic example

We obtain genetic data from Yorba (African) and Japanese people.

(1) Run Eignestrat: obtain 4 pcs

(2) Run a factor model constrained to two factors. Observe theprincipal components are the mean allele frequencies of thetwo populations.

What is right ?

Page 25: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

A genetic example

We obtain genetic data from Yorba (African) and Japanese people.

(1) Run Eignestrat: obtain 4 pcs

(2) Run a factor model constrained to two factors. Observe theprincipal components are the mean allele frequencies of thetwo populations.

What is right ?

Page 26: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

A genetic example

We obtain genetic data from Yorba (African) and Japanese people.

(1) Run Eignestrat: obtain 4 pcs

(2) Run a factor model constrained to two factors. Observe theprincipal components are the mean allele frequencies of thetwo populations.

What is right ?

Page 27: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Both

Let us decompose the covariance of the genetic variation Σ

(1) µ1: mean allele frequency in Yorba

(2) µ2: mean allele frequency in Japanese

(3) Σ = g(µ1µ′1 + µ2µ

′2 + µ1µ

′2)

So the covariance is rank 4 even if two factors capture the allelestructure in the two populations.

Page 28: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Both

Let us decompose the covariance of the genetic variation Σ

(1) µ1: mean allele frequency in Yorba

(2) µ2: mean allele frequency in Japanese

(3) Σ = g(µ1µ′1 + µ2µ

′2 + µ1µ

′2)

So the covariance is rank 4 even if two factors capture the allelestructure in the two populations.

Page 29: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Both

Let us decompose the covariance of the genetic variation Σ

(1) µ1: mean allele frequency in Yorba

(2) µ2: mean allele frequency in Japanese

(3) Σ = g(µ1µ′1 + µ2µ

′2 + µ1µ

′2)

So the covariance is rank 4 even if two factors capture the allelestructure in the two populations.

Page 30: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Both

Let us decompose the covariance of the genetic variation Σ

(1) µ1: mean allele frequency in Yorba

(2) µ2: mean allele frequency in Japanese

(3) Σ = g(µ1µ′1 + µ2µ

′2 + µ1µ

′2)

So the covariance is rank 4 even if two factors capture the allelestructure in the two populations.

Page 31: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Frisch problem (1934)

Given m observations of n variables, what are the linear relationsbetween the variables and how many linear relations are there ?

minimize rank(Σ−Ψ)

subject to Σ−Ψ 0

ψj > 0,

Page 32: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Frisch problem (1934)

Given m observations of n variables, what are the linear relationsbetween the variables and how many linear relations are there ?

minimize rank(Σ−Ψ)

subject to Σ−Ψ 0

ψj > 0,

Page 33: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Possible way out

The important quantity we should worry about is the subspace weproject onto.

Infer B = span(v1, ..., vd).

Page 34: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Factor models

Possible way out

The important quantity we should worry about is the subspace weproject onto.

Infer B = span(v1, ..., vd).

Page 35: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Supervised dimension reduction (SDR)

Given response variables Y1, ...,Ym ∈ IR and explanatory variablesor covariates X1, ...,Xm ∈ X ⊂ Rp

Yi = f (Xi ) + εi , εiiid∼ No(0, σ2).

Is there a subspace S ≡ SY |X such that Y ⊥⊥ X | PS(X ) with

PS(X ) = B ′X , B = (b1, ..., bd).

Page 36: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Supervised dimension reduction (SDR)

Given response variables Y1, ...,Ym ∈ IR and explanatory variablesor covariates X1, ...,Xm ∈ X ⊂ Rp

Yi = f (Xi ) + εi , εiiid∼ No(0, σ2).

Is there a subspace S ≡ SY |X such that Y ⊥⊥ X | PS(X ) with

PS(X ) = B ′X , B = (b1, ..., bd).

Page 37: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Distribution theory for SDR

Sliced inverse regression: K.C. Li 1991, (JASA):

(1) Define the following quantities

Ω ≡ cov (E[X | Y ]), ΣX

= cov (X ).

(2) Solve the following generalized eigen-decomposition problem

Ωb = λΣb.

(3) B = span(b1, ..., bd) for all i = 1, ..., d such that λi ≥ ε.(4) This idea works if p(X | Y ) is elliptical (unimodal).

Page 38: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Distribution theory for SDR

Sliced inverse regression: K.C. Li 1991, (JASA):

(1) Define the following quantities

Ω ≡ cov (E[X | Y ]), ΣX

= cov (X ).

(2) Solve the following generalized eigen-decomposition problem

Ωb = λΣb.

(3) B = span(b1, ..., bd) for all i = 1, ..., d such that λi ≥ ε.(4) This idea works if p(X | Y ) is elliptical (unimodal).

Page 39: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Distribution theory for SDR

Sliced inverse regression: K.C. Li 1991, (JASA):

(1) Define the following quantities

Ω ≡ cov (E[X | Y ]), ΣX

= cov (X ).

(2) Solve the following generalized eigen-decomposition problem

Ωb = λΣb.

(3) B = span(b1, ..., bd) for all i = 1, ..., d such that λi ≥ ε.

(4) This idea works if p(X | Y ) is elliptical (unimodal).

Page 40: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Distribution theory for SDR

Sliced inverse regression: K.C. Li 1991, (JASA):

(1) Define the following quantities

Ω ≡ cov (E[X | Y ]), ΣX

= cov (X ).

(2) Solve the following generalized eigen-decomposition problem

Ωb = λΣb.

(3) B = span(b1, ..., bd) for all i = 1, ..., d such that λi ≥ ε.(4) This idea works if p(X | Y ) is elliptical (unimodal).

Page 41: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

An algorithm

The data is x1, ..., xm and y1, ..., ym

(1) Compute sample covariance matrix ΣX

(2) Bin the yimi=1 values into S bins.

(3) For each bin s = 1, ...,S compute the mean, xi∈s

µs =1

ns

∑i∈s

xi .

(4) Compute Ω

Ω =1

S

∑s

µs µ′s .

(4) Solve Ωb = λΣb.

Page 42: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

An algorithm

The data is x1, ..., xm and y1, ..., ym

(1) Compute sample covariance matrix ΣX

(2) Bin the yimi=1 values into S bins.

(3) For each bin s = 1, ...,S compute the mean, xi∈s

µs =1

ns

∑i∈s

xi .

(4) Compute Ω

Ω =1

S

∑s

µs µ′s .

(4) Solve Ωb = λΣb.

Page 43: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

An algorithm

The data is x1, ..., xm and y1, ..., ym

(1) Compute sample covariance matrix ΣX

(2) Bin the yimi=1 values into S bins.

(3) For each bin s = 1, ...,S compute the mean, xi∈s

µs =1

ns

∑i∈s

xi .

(4) Compute Ω

Ω =1

S

∑s

µs µ′s .

(4) Solve Ωb = λΣb.

Page 44: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

An algorithm

The data is x1, ..., xm and y1, ..., ym

(1) Compute sample covariance matrix ΣX

(2) Bin the yimi=1 values into S bins.

(3) For each bin s = 1, ...,S compute the mean, xi∈s

µs =1

ns

∑i∈s

xi .

(4) Compute Ω

Ω =1

S

∑s

µs µ′s .

(4) Solve Ωb = λΣb.

Page 45: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

An algorithm

The data is x1, ..., xm and y1, ..., ym

(1) Compute sample covariance matrix ΣX

(2) Bin the yimi=1 values into S bins.

(3) For each bin s = 1, ...,S compute the mean, xi∈s

µs =1

ns

∑i∈s

xi .

(4) Compute Ω

Ω =1

S

∑s

µs µ′s .

(4) Solve Ωb = λΣb.

Page 46: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

An algorithm

The data is x1, ..., xm and y1, ..., ym

(1) Compute sample covariance matrix ΣX

(2) Bin the yimi=1 values into S bins.

(3) For each bin s = 1, ...,S compute the mean, xi∈s

µs =1

ns

∑i∈s

xi .

(4) Compute Ω

Ω =1

S

∑s

µs µ′s .

(4) Solve Ωb = λΣb.

Page 47: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Subgroups or multimodaln = 7129 dimensions, m = 38 samples,19: Acute Myeloid Leukemia (AML)19 are Acute Lymphoblastic Leukemia – B-cell and T-cell

−5 0 5 10x 104

−10

−5

0

5x 104

Page 48: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Localization

Local sliced inverse regression: Wu et al 2010, (JCGS)

(1) Define the following quantities

Ωloc ≡ cov (E[Xloc | Y ]), ΣX

= cov (X ).

(2) Solve the following generalized eigen-decomposition problem

Ωlocb = λΣb.

(3) B = span(b1, ..., bd) for all i = 1, ..., d such that λi ≥ ε.(4) This idea works if p(Xloc | Y ) is elliptical (unimodal).

Page 49: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Localization

Local sliced inverse regression: Wu et al 2010, (JCGS)

(1) Define the following quantities

Ωloc ≡ cov (E[Xloc | Y ]), ΣX

= cov (X ).

(2) Solve the following generalized eigen-decomposition problem

Ωlocb = λΣb.

(3) B = span(b1, ..., bd) for all i = 1, ..., d such that λi ≥ ε.(4) This idea works if p(Xloc | Y ) is elliptical (unimodal).

Page 50: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Localization

Local sliced inverse regression: Wu et al 2010, (JCGS)

(1) Define the following quantities

Ωloc ≡ cov (E[Xloc | Y ]), ΣX

= cov (X ).

(2) Solve the following generalized eigen-decomposition problem

Ωlocb = λΣb.

(3) B = span(b1, ..., bd) for all i = 1, ..., d such that λi ≥ ε.

(4) This idea works if p(Xloc | Y ) is elliptical (unimodal).

Page 51: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Localization

Local sliced inverse regression: Wu et al 2010, (JCGS)

(1) Define the following quantities

Ωloc ≡ cov (E[Xloc | Y ]), ΣX

= cov (X ).

(2) Solve the following generalized eigen-decomposition problem

Ωlocb = λΣb.

(3) B = span(b1, ..., bd) for all i = 1, ..., d such that λi ≥ ε.(4) This idea works if p(Xloc | Y ) is elliptical (unimodal).

Page 52: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Metrics for subspace estimatesGiven two subspaces B and B we will look at two metrics tocompute the similarity of B to B

(1) Qiang: Projection onto

1

d

d∑i=1

||PB bi ||2 =1

d

d∑i=1

||(BBT )bi ||2

(2) Golub: Angle between

dist(B,B) =√

1− cos(θd)2,

where the principle angles θ1, ..., θd are computed recursively

cos(θi ) = maxu∈B

maxv∈B

u′v = u′ivi

subject to‖u‖ = ‖v‖ = 1, u ⊥ u1, .., ui−1, v ⊥ v1, .., vi−1.

Page 53: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Metrics for subspace estimatesGiven two subspaces B and B we will look at two metrics tocompute the similarity of B to B

(1) Qiang: Projection onto

1

d

d∑i=1

||PB bi ||2 =1

d

d∑i=1

||(BBT )bi ||2

(2) Golub: Angle between

dist(B,B) =√

1− cos(θd)2,

where the principle angles θ1, ..., θd are computed recursively

cos(θi ) = maxu∈B

maxv∈B

u′v = u′ivi

subject to‖u‖ = ‖v‖ = 1, u ⊥ u1, .., ui−1, v ⊥ v1, .., vi−1.

Page 54: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

Digits

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

Page 55: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Supervised dimension reduction

All ten digits

digit Nonlinear Linear

0 0.04(± 0.01) 0.05 (± 0.01)1 0.01(± 0.003) 0.03 (± 0.01)2 0.14(± 0.02) 0.19 (± 0.02)3 0.11(± 0.01) 0.17 (± 0.03)4 0.13(± 0.02) 0.13 (± 0.03)5 0.12(± 0.02) 0.21 (± 0.03)6 0.04(± 0.01) 0.0816 (± 0.02)7 0.11(± 0.01) 0.14 (± 0.02)8 0.14(± 0.02) 0.20 (± 0.03)9 0.11(± 0.02) 0.15 (± 0.02)

average 0.09 0.14

Table: Average classification error rate and standard deviation on thedigits data.

Page 56: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Efficient subspace inference

Randomized methods

By combining (block) Lanzcos and random projections Rhoklin etal 2009 (SIAM J Mat Anal Appl) came up with a fast, provable,randomized method.

The matrix is m × n it is of rank k and t is the number ofiterations in a power method. With high probabilityapproximations of the top k eigenvalues and eigenvectors can bewell approximated in time

O(mnkt).

Page 57: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Efficient subspace inference

Randomized methods

By combining (block) Lanzcos and random projections Rhoklin etal 2009 (SIAM J Mat Anal Appl) came up with a fast, provable,randomized method.The matrix is m × n it is of rank k and t is the number ofiterations in a power method. With high probabilityapproximations of the top k eigenvalues and eigenvectors can bewell approximated in time

O(mnkt).

Page 58: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Efficient subspace inference

Randomized PCAdata: A ∈ Rm×n, number of eigenvalues: 2k ≤ m, number ofiterations i

(A) Find orthonormal basis for the range of A

0.1 G ∼ U[−1, 1] ∈ Rm×`

0.2 R0 = ATG0.3 ∀j = 1, ...i Rj = (ATA)Rj−1

0.4 R = [R0 . . . Ri ]0.5 R = QS , Q orthonormal, S upper triangular

(B) Project data and do SVD

0.1 B = AQ0.2 Factorize B = UΣW T (using SVD)0.3 Set U = U(:, 1 : k)

Set Σ = Σ(1 : k)Set V = AT UΣ−1

Page 59: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Efficient subspace inference

Randomized PCAdata: A ∈ Rm×n, number of eigenvalues: 2k ≤ m, number ofiterations i

(A) Find orthonormal basis for the range of A

0.1 G ∼ U[−1, 1] ∈ Rm×`

0.2 R0 = ATG0.3 ∀j = 1, ...i Rj = (ATA)Rj−1

0.4 R = [R0 . . . Ri ]0.5 R = QS , Q orthonormal, S upper triangular

(B) Project data and do SVD

0.1 B = AQ0.2 Factorize B = UΣW T (using SVD)0.3 Set U = U(:, 1 : k)

Set Σ = Σ(1 : k)Set V = AT UΣ−1

Page 60: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Efficient subspace inference

Randomized PCAdata: A ∈ Rm×n, number of eigenvalues: 2k ≤ m, number ofiterations i

(A) Find orthonormal basis for the range of A

0.1 G ∼ U[−1, 1] ∈ Rm×`

0.2 R0 = ATG0.3 ∀j = 1, ...i Rj = (ATA)Rj−1

0.4 R = [R0 . . . Ri ]0.5 R = QS , Q orthonormal, S upper triangular

(B) Project data and do SVD

0.1 B = AQ0.2 Factorize B = UΣW T (using SVD)0.3 Set U = U(:, 1 : k)

Set Σ = Σ(1 : k)Set V = AT UΣ−1

Page 61: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Efficient subspace inference

Our method

Iterate random PCA on the gram matrix A = XX ′ ∈ Rn×n untilsubspace converge.

Main differences

(1) Work with gram matrix to avoid storing in memory matricesthe size of the data.

(2) Implemented packing/unpacking into bytes and 2-bit fields forSNP data.

Page 62: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Efficient subspace inference

Our method

Iterate random PCA on the gram matrix A = XX ′ ∈ Rn×n untilsubspace converge.

Main differences

(1) Work with gram matrix to avoid storing in memory matricesthe size of the data.

(2) Implemented packing/unpacking into bytes and 2-bit fields forSNP data.

Page 63: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Empirical results

Wishart

Timing

1000 2000 3000 4000 5000 6000 7000

24

68

Wishart (iter=6)

m

time

Page 64: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Empirical results

Wishart

Error

1000 2000 3000 4000 5000 6000 7000

010

2030

40

Wishart (iter=8)

m

Qia

ng's

dis

sim

ilarit

y

Page 65: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Empirical results

Reuters data

Error

1000 2000 3000 4000 5000 6000 7000

0.0

0.2

0.4

0.6

0.8

1.0

Reuters documents (iter=5)

m

Qia

ng's

dis

sim

ilarit

y

Page 66: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Conclusion

Conclusion

(1) Can inference of population structure in large data usingeigen-decomposition.

(2) Interpretation of subspace is easier than factors.

(3) Method can be applied to generalzed eigen-decomposition.

(4) Loss of numerical precision ?

(5) Fast computation of Tracy-Widom statistics using Fredholmdeterminants, Bourneman 2009, (ArchivX).

Page 67: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Conclusion

Conclusion

(1) Can inference of population structure in large data usingeigen-decomposition.

(2) Interpretation of subspace is easier than factors.

(3) Method can be applied to generalzed eigen-decomposition.

(4) Loss of numerical precision ?

(5) Fast computation of Tracy-Widom statistics using Fredholmdeterminants, Bourneman 2009, (ArchivX).

Page 68: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Conclusion

Conclusion

(1) Can inference of population structure in large data usingeigen-decomposition.

(2) Interpretation of subspace is easier than factors.

(3) Method can be applied to generalzed eigen-decomposition.

(4) Loss of numerical precision ?

(5) Fast computation of Tracy-Widom statistics using Fredholmdeterminants, Bourneman 2009, (ArchivX).

Page 69: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Conclusion

Conclusion

(1) Can inference of population structure in large data usingeigen-decomposition.

(2) Interpretation of subspace is easier than factors.

(3) Method can be applied to generalzed eigen-decomposition.

(4) Loss of numerical precision ?

(5) Fast computation of Tracy-Widom statistics using Fredholmdeterminants, Bourneman 2009, (ArchivX).

Page 70: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Conclusion

Conclusion

(1) Can inference of population structure in large data usingeigen-decomposition.

(2) Interpretation of subspace is easier than factors.

(3) Method can be applied to generalzed eigen-decomposition.

(4) Loss of numerical precision ?

(5) Fast computation of Tracy-Widom statistics using Fredholmdeterminants, Bourneman 2009, (ArchivX).

Page 71: E cient dimension reduction on massive data

Efficient dimension reduction on massive data

Acknowledgements

Acknowledgements

Funding:

I Center for Systems Biology at Duke

I NSF

I NIH