16
Probabilistic Sparse Matrix Factorization Delbert Dueck, Quaid Morris, Brendan Frey (Probabilistic & Statistical Inference Group) Tim Hughes (Banting and Best Department of Medical Research)

Probabilistic Sparse Matrix Factorization

  • Upload
    long

  • View
    82

  • Download
    0

Embed Size (px)

DESCRIPTION

Probabilistic Sparse Matrix Factorization. Delbert Dueck, Quaid Morris, Brendan Frey (Probabilistic & Statistical Inference Group) Tim Hughes (Banting and Best Department of Medical Research). Objective. - PowerPoint PPT Presentation

Citation preview

Page 1: Probabilistic Sparse Matrix Factorization

Probabilistic Sparse Matrix Factorization

Delbert Dueck, Quaid Morris, Brendan Frey(Probabilistic & Statistical Inference Group)

Tim Hughes(Banting and Best Department of Medical Research)

Page 2: Probabilistic Sparse Matrix Factorization

Objective

Patterns in gene expression array data can be used to help understand gene regulation and predict the function of yet-uncharacterized genes

Objective: To develop a method of probabilistic sparse matrix factorization (PSMF) and apply it to gene expression data to learn the hidden structure underlying the data.

Page 3: Probabilistic Sparse Matrix Factorization

Biological Background

Genes encode basic information about an organism They tend to be highly expressed in tissues related to their

functional role Mouse gene expression data is from Zhang, Morris,

et al. (2004) Gene expression is influenced by the presence of

transcription factors (TFs) Co-expressed genes are likely activated by the same TFs The activity of each gene can be explained by the activities

of a small number of transcription factors

Page 4: Probabilistic Sparse Matrix Factorization

Gene Expression Array Dataset

T=55tissues

G=

22

70

9 g

en

es

T=55 tissues

Entire data set: X G×T matrix (G=22709, T=55)

10

0 g

en

es

Scalar expression values (xgt )

bla

dd

er (t=

3)

hin

db

rain

(t=2

2)

colo

n (t=

9)

mid

bra

in (t=

31

)

larg

e in

testin

e (t=

25

)

lymp

h n

od

e (t=

28

)

stom

ach

(t=4

5)

sple

en

(t=4

4)

pa

ncre

as (t=

34

)

sma

ll inte

stine

(t=4

1)

Expression vector for gene XM_133866.1 xg (g=10056), a row vector of length T=55

Scale:0 2 4 6 8 >10

Page 5: Probabilistic Sparse Matrix Factorization

Sparse Matrix Factorization

Gene expression data model: Each gene’s expression profile (xg) is …

a linear combination (weighted by ygc, csg) …

of a small number (rg<N) …

of C possible transcription factor profiles (zc, csg)

1

g

gn gn

r

g gs sny

=» åx z

Page 6: Probabilistic Sparse Matrix Factorization

Sparse Matrix Factorization

11 12 13 12 15

1 22 23 22 26

31 32 33 31 35

41 42 43 44

51 52 53

61 62 63

71 72 73

81 82 83

1 2 3

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0

G G G

x x x y y

x x x y y

x x x y y

x x x y

x x x

x x x

x x x

x x x

x x x

2

é ùê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úë û

X644444444744444L

L

L

L

L

L

L

L

M

4

M M

L

4 48

»

11 12 13

1 22 23

51 53 31 32 33

41 42 4365

51 52 5372

61 62 681 83

1 4

0 0

0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0

0 0 0 0G G

x x x

x x x

y y x x x

x x xyx x xyx x x

y y

y y

2

é ùê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úë û

×

Y644444444444444447444444444444444

L

L

L

L

L

L

M M M M M M

48

3

é ùê úê úê úê úê úê úê úê úê úê úê úë û

Z6444444447444444448

Matrix format:(entire dataset)

} }1 5 2

2 6 2

1 5 1

4 1

1 3 2

5 1

2 1

1 3 2

1 4 2

,

é ù éùê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúë û ëû

rS

M M M

Page 7: Probabilistic Sparse Matrix Factorization

Probabilistic Sparse Matrix Factorization To express as a distribution, assume …

varying levels of Gaussian noise in the data:

nothing about transcription factor weights: normally-distributed transcription factor profiles: uniformly-distributed factor assignments: multinomially-distributed factor counts:

( )2

1( | , , , , ) ; ,g

gn gn

r

g g g g g g gs s gnP r yy y

== åx y Z s x z IN

( ) 1gP µy

( ) ( )11 1

( | ) g gnr C s cCg g n c

P r d -

= ==Õ Õs

( ) ( ; , )c cP =z z 0 IN

( )g nP r n n= =

Page 8: Probabilistic Sparse Matrix Factorization

Probabilistic Sparse Matrix Factorization To express as a distribution, assume …

varying levels of Gaussian noise in the data:

nothing about transcription factor weights: normally-distributed transcription factor profiles: uniformly-distributed factor assignments: multinomially-distributed factor counts:

Multiply together to get joint distribution

( )2

1( | , , , , ) ; ,g

gn gn

r

g g g g g g gs s gnP r yy y

== åx y Z s x z IN

( ) 1gP µy

( ) ( )11 1

( | ) g gnr C s cCg g n c

P r d -

= ==Õ Õs

( ) ( ; , )c cP =z z 0 IN

( )g nP r n n= =

( ) ( ) ( ) ( ) ( )11

1 1 1 1 1 1 1

( , , , , | ) ( | , , , , ) ( ) ( ) ( | ) ( )

; ; ,g gn g

gn gn

G C G C N G Nr s c r n

g gs s c nCng c g c n g n

P P P P P P

yd dn

- -

== = = = = = =

Y = Y × × × ×

é ù é ùé ùé ùê ú ê úê úê úµ ê ú ê úê úê úë ûë û ë ûë û

åÕ Õ ÕÕÕ ÕÕ

X Y ZSr X Y ZSr Y Z S r r

x z z 0 IN N

Page 9: Probabilistic Sparse Matrix Factorization

Factorized Variational Inference Exact inference is intractable with P(∙)

( ) ( )

( ) ( ) ( )

11 1

1

1 1 1 1 1

( , , , , | ) ; ; ,g

gn gn

gn g

G Cr

g gs s cng c

G C N G Ns c r n

nCg c n g n

P y

d dn

== =

- -

= = = = =

é ùé ùê úê úY µ ê úê úë ûë û

é ùé ùê úê ú×ê úê úë ûë û

åÕ Õ

ÕÕÕ ÕÕ

X Y ZSr x z z 0 IN N

Page 10: Probabilistic Sparse Matrix Factorization

Factorized Variational Inference Exact inference is intractable with P(∙)

Approximate it by a simpler distribution, Q(∙), and perform inference on that

1 1 1 1 1 1 1

( , , , | , ) ( ) ( ) ( ) ( )G C C T G N G

gc c gn gg c c t g n g

P Q y Q Q s Q r= = = = = = =

Y » × × ×ÕÕ ÕÕ ÕÕ ÕY ZSr X z

( ) ( )

( ) ( ) ( )

11 1

1

1 1 1 1 1

( , , , , | ) ; ; ,g

gn gn

gn g

G Cr

g gs s cng c

G C N G Ns c r n

nCg c n g n

P y

d dn

== =

- -

= = = = =

é ùé ùê úê úY µ ê úê úë ûë û

é ùé ùê úê ú×ê úê úë ûë û

åÕ Õ

ÕÕÕ ÕÕ

X Y ZSr x z z 0 IN N

Page 11: Probabilistic Sparse Matrix Factorization

VisualizationG

ge

ne

s

T tissues

G

ge

ne

s

T tissues

G

ge

ne

s

C factorsC

fact

ors

T tissues

X X = Y Z

PROBABILISTIC SPARSE MATRIX FACTORIZATION

C=50 possible factors

N=3 factors per gene (max) P(rg)=[.55 .27 .18]

Scale:0 2 4 6 8 >10

*Sorted by primary transcription factor (sg1)

Page 12: Probabilistic Sparse Matrix Factorization

Results – p-value histograms

Genes can be partitioned into “primary categories” (i.e. same sg1 value), “secondary classes”, etc. Compare classes with annotated gene ontology

(GO-BP) categories for statistical significance

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

randomclustering

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

freq

uenc

y

hierarchicalagglomerative

clustering

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(primary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(secondary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(tertiary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

randomclustering

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

freq

uenc

y

hierarchicalagglomerative

clustering

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(primary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(secondary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(tertiary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

randomclustering

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

freq

uenc

y

hierarchicalagglomerative

clustering

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(primary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(secondary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(tertiary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

randomclustering

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

freq

uenc

y

hierarchicalagglomerative

clustering

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(primary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(secondary)

-20 -10 00

0.1

0.2

0.3

0.4

log10

(p-value)

PSMF(tertiary)

Page 13: Probabilistic Sparse Matrix Factorization

Results – mean log10 p-values

10 20 30 40 50 60 70 80 90 100

-25

-20

-15

-10

-5

0

Mean log10

p-values

C (# clusters, factors)

mea

n lo

g 10(p

-val

ue)

PSMF N={1,2,3} primary(i.e. s

g1 clustering)

hierarchicalagglomerative

clustering

PSMF N={2,3} secondary

PSMF N=3 tertiary

random clustering

10 20 30 40 50 60 70 80 90 1000%

20%

40%

60%

80%

100%Fraction of factors with significance

C (# clusters, factors)

frac

tion

of fa

ctor

s w

ith s

igni

fican

ce

PSMF N={1,2,3} primary(i.e. s

g1 clustering)

hierarchicalagglomerative

clustering

PSMF N={2,3} secondary

PSMF N=3 tertiary

random clustering

Page 14: Probabilistic Sparse Matrix Factorization

Results – count of significant p-values

10 20 30 40 50 60 70 80 90 100

-25

-20

-15

-10

-5

0

Mean log10

p-values

C (# clusters, factors)

mea

n lo

g 10(p

-val

ue)

PSMF N={1,2,3} primary(i.e. s

g1 clustering)

hierarchicalagglomerative

clustering

PSMF N={2,3} secondary

PSMF N=3 tertiary

random clustering

10 20 30 40 50 60 70 80 90 1000%

20%

40%

60%

80%

100%Fraction of factors with significance

C (# clusters, factors)

frac

tion

of fa

ctor

s w

ith s

igni

fican

cePSMF N={1,2,3} primary

(i.e. sg1

clustering)

hierarchicalagglomerative

clustering

PSMF N={2,3} secondary

PSMF N=3 tertiary

random clustering

Page 15: Probabilistic Sparse Matrix Factorization

Future Directions – different Q(·)

0 5 10 15 20 25 30 35 40 45 50-7

-6

-5

-4

-3

-2

-1-0.8-0.6-0.4-0.2

0x 10

4

iteration

co

mp

lete

log

lik

elih

oo

d

iterated conditional modes

**NOTE: The complete log likelihoods are not necessarily monotonically increasing due to the non-negativity constraint,implemented via a zero-thresholding heuristic.

1 1 1 1 1 1

( ) ( ) ( ) ( )G C C G N G

gc c gn gg c c g n g

Q y Q Q s Q r= = = = = =

× × ×ÕÕ Õ ÕÕ Õz

1 1 1 1

( ) ( ) ( , )G C C G

gc c g gg c c g

Q y Q Q r= = = =

× ×ÕÕ Õ Õz s1 1

( ) ( , , )C G

c g g gc g

Q Q r= =

×Õ Õz y s

Iterated conditional modes (point estimates)

Page 16: Probabilistic Sparse Matrix Factorization

Summary

Introduced probabilistic sparse matrix factorization (PSMF), each row is a linear combination of a “small” number of hidden factors selected from a larger set.

Described a variational inference algorithm for fitting the PSMF model.

Evaluated model on a gene functional prediction task.