31
Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Embed Size (px)

Citation preview

Page 1: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Model checks for complex hierarchical models

Alex Lewin and Sylvia Richardson

Imperial College

Centre for Biostatistics

Page 2: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Many complex models used in bioinformatics

Classification/clustering can be greatly affected by choice of distributions

Our approach: exploit the structure of the model to perform predictive checks

hierarchical models generally involve exchangeability assumptions

mixture models are partially exchangeable

Background and Aims

Page 3: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Mixture model for gene expression data

Model checks for mixture model

distribution for gene-specific variances

different mixture priors

Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

Outline of Talk

Page 4: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Hierarchical mixture model for gene expression data

differential effect for gene g

variance for each gene

Data: paired log differences between 2

conditions

g

ybarg Sg

σg

μ,τwjηj

g = gener = replicatej = mixture component

ygr | δg, g N(δg, g2)

w ~ Dirichlet(1,…,1), various priors for δg, g

δg | η ~ Σwjhj(ηj), g2 | μ,τ

f(μ,τ)

Page 5: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Mixture model for gene expression data

Many mixture models have been proposed for gene expression data

Set-up is similar to variable selection prior: point mass + alternative distribution

Particular choices for alternative:

Normal (Lönnstedt and Speed)

Uniform (Parmigiani et al)

many others …

Page 6: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Mixture model for gene expression data

Allow for asymmetry in over-and under-expressed genes 3-component mixture model

δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)

6 knock-out and 5 wildtype mice

MAS5.0 processed data

Page 7: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Mixture model for gene expression data

Classify each gene into mixture components using posterior probabilities

Page 8: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Choice of mixture prior affects classification results

Mixture Prior for δg Est. w2 (% in null)

w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+) 0.96

w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+) 0.68

w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+) 0.99

Page 9: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Mixture model for gene expression data

Models checks for mixture model

distribution for gene-specific variances

different mixture priors

Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

Outline of Talk

Page 10: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Predict new data from the model

Use posterior predictive distribution

Condition on hyperparameters (‘mixed predictive’ * not very conservative)

Get Bayesian p-value for each gene/marker/sample

Use all p-values together (100’s or 1000’s) to assess model fit

* Gelman, Meng and Stern 1995; Marshall and Spiegelhalter 2003

Predictive model checks

Page 11: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

posterior Smpred

Sgobs

Checking distribution for gene variances

Bayesian p-value for gene g:

pg = Prob( Smpred > Sgobs | data )

All genes are exchangeable

histogram of p-values for all genes together

g

ybarg Sgobs post.

pred.

Sgppred

mixedpred.Smpred

σg

μ,τ

σpred

Page 12: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Predictive p-values for data simulated from the model

Histograms should be Uniform

Mixed predictive distribution much less conservative than posterior predictive

‘Mixed’ v. ‘posterior’ predictive

Using global distributionUsing gene-specific distributions

Page 13: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Checking different variance models

Model differential expression between 3 transgenic and 3 wildtype mice

g2 | μ,τ

Gam(μ,τ), μ fixed

g2 | μ,τ Gam(μ,τ)

g2 | μ,τ logNorm(μ,τ)

g2 = 2 for all genes

Page 14: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

pg = 0

for t = 1,…,niter {

σtpred f(μt,τt)

Stmpred Gam( m, m(σt

pred)-2 )

pg pg + I[ Stmpred > Sg

obs ]

}

pg pg / niter

Implementation (MCMC)

Just two extra parameters predicted at each iteration

niter = no. MCMC iterations

m = (no. replicates – 1)/2

g

ybarg Sgobs

mixedpred.Smpred

σg

μ,τ

σpred

Page 15: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Mixture model for gene expression data

Model checks for mixture model

distribution for gene-specific variances

different mixture priors

Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

Outline of Talk

Page 16: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Checking mixture prior

δg | η ~ w1h1(η1) + w2h2(η2) + w3h3(η3)

OR

δg | η, zg = j ~ hj(ηj) j = 1,…,3

P(zg = j) = wj

Model checking: focus on separate mixture components

Page 17: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

δg | η, zg = j ~ hj(ηj) j = 1,…,3

Think about MCMC iterations …

Mixture component is estimated from genes currently assigned to that component

Can only define p-value for given gene and mix. component when the gene is assigned to that component (i.e. condition on zg in p-value)

So check each component using only the genes currently assigned (i.e. condition on zg in histogram)

Issues for mixture model checking

Page 18: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

g jpred

wj

ybarg Sg ybargjmpred

σg

μ,τηj

Predictive checks for mixture model

Bayesian p-value for gene g and mix. component j:

pgj = Prob( ybargjmpred > ybarg

obs | data, zg=j )

Genes assigned to the same mix. component are exchangeable

histogram of p-values for each mix. component separately

histogram for component j made only from genes with large P(zg = j)

Page 19: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Effectively we condition on a best classification

Condition on classification to check separate components

All genes with P(zg = j) > 0

Only genes with P(zg = j) > 0.5

Predictive p-values for data simulated from the model

Page 20: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Checking different mixture distributions

w1Unif(-η-,0) + w2δ(0) + w3Unif(0,η+)

Outer mix. components skewed too much away from zero

Null component too narrow

Page 21: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Checking different mixture distributions

w1Gam-(1.5,η-) + w2 δ(0) + w3Gam+(1.5,η+)

Outer components skewed opposite

Null still too narrow?

Page 22: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Checking different mixture distributions

w1Gam-(1.5,η-) + w2N(0,ε) + w3Gam+(1.5,η+)

Better fit for all components

Page 23: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Implementationg j

pred

wj

ybarg Sg ybargjmpred

σg

μ,τηj

pgj = 0

for t = 1,…,niter { δjt

pred ~ hjt(ηjt) j = 1,…,3

ybargtmpred

N( δjtpred , g

2/nrep ) for j = zgt

pgj pgj + I[ ybargtmpred > ybarg

obs ] for j = zgt

}

pgj pgj / niter(zg=j)

Need ≈ngenes extra parameters at each iteration

Page 24: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Summary of model checking procedure

1. Find part of model where individuals are assumed to be exchangeable (so information is shared)

2. Choose test statistic T (eg. sample mean or variance)

3. Predict Tpred from distribution for exchangeable individuals (whole posterior for Tpred)

4. Compare observed Ti for each individual i to distribution of Tpred

5. For checking mixture components, condition on the best classification

Page 25: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Mixture model for gene expression data

Model checks for mixture model

distribution for gene-specific variances

different mixture priors

Future work: model checks for a clustering and variable selection model (Tadesse et al. 2005)

Outline of Talk

Page 26: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

yi vector of gene expression for each sample i = 1,…,n

Multi-variate mixture model for clustering samples:

yi | zi = j MVN(ζj, Λj) j = 1,…,J

P(zi = j) = wj

No. of mix. components (J) is estimated in the model

Aim to select genes which are informative for clustering the samples

Clustering and variable selection (Tadesse et al. 2005)

Page 27: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Clustering and variable selection (Tadesse et al. 2005)

ji

Ci

Ti yy

j

))()(2

1exp( )()(1

)()()(

))()(2

1exp(~| )'()'(1

)'(1

)'()'(

i

n

i

Ti yyzLikelihood

γ = vector of indices of selected variables

γ’ = vector of indices of variables not used to cluster samples

Likelihood conditional on allocation to mixture:

Conjugate priors on multivariate means and covariance matrices

P(γg = 1) = φi = sampleg = genej = mix.

component

Page 28: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Clustering and variable selection (Tadesse et al. 2005)

i = sampleg = genej = mix.

component

Model checking: want to check the distribution for each mixture component separately (conditional on J)

In addition, need to condition on a given variable selection

Clearly impossible computationally

μj(γ) , Σj

(γ)

yi y(γ)jpred

wj

η(γ), Ω(γ) φ

J

Page 29: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

1) Run model with no prediction

2) Find the best configuration:

set of selected variables (γ)

no. mixture components J

allocation of samples to mixture components z i

3) Re-run model, with (γ), J and zi fixed, calculated predictive p-values

Computing predictive p-values

pij = Prob( Tjpred > Ti

obs | data, zi=j, J, (γ) )

where T = |y|2 (for example)

Page 30: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Conclusions

Choice of model distributions can greatly influence results of clustering and classification

For models where information is shared across individuals, predictive checks can be used as an alternative to cross-validation

Should be possible to do this even for quite complex models (if you can fit the model, you can check it)

Page 31: Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics

Acknowledgements

Collaborators on BBSRC Exploiting Genomics Grant

Natalia Bochkina, Clare Marshall

Peter Green

Meeting on model checking in Cambridge

David Spiegelhalter

Shaun Seaman

BBSRC Exploiting Genomics Grant

Paper and software at http://www.bgx.org.uk/