Bayesian State Space

8/8/2019 Bayesian State Space

1/14

Bayesian State Space Models for Inferringand Predicting Temporal Gene Expression Profiles

Yulan Liang* and Arpad Kelemen

Department of Biostatistics, University at Buffalo, The State University of New York,

252A2 Farber Hall, 3435 Main Street, Buffalo, NY 14214, USA

Received 26 April 2006, revised 10 October 2006, accepted 22 January 2007

Summary

Prediction of gene dynamic behavior is a challenging and important problem in genomic research while

estimating the temporal correlations and non-stationarity are the keys in this process. Unfortunately,

most existing techniques used for the inclusion of the temporal correlations treat the time course as

evenly distributed time intervals and use stationary models with time-invariant settings. This is an

assumption that is often violated in microarray time course data since the time course expression data

are at unequal time points, where the difference in sampling times varies from minutes to days. Further-

more, the unevenly spaced short time courses with sudden changes make the prediction of genetic

dynamics difficult. In this paper, we develop two types of Bayesian state space models to tackle this

challenge for inferring and predicting the gene expression profiles associated with diseases. In the

univariate time-varying Bayesian state space models we treat both the stochastic transition matrix and

the observation matrix time-variant with linear setting and point out that this can easily be extended to

nonlinear setting. In the multivariate Bayesian state space model we include temporal correlation struc-

tures in the covariance matrix estimations. In both models, the unevenly spaced short time courses with

unseen time points are treated as hidden state variables. Bayesian approaches with various prior and

hyper-prior models with MCMC algorithms are used to estimate the model parameters and hidden

variables. We apply our models to multiple tissue polygenetic affymetrix data sets. Results show that

the predictions of the genomic dynamic behavior can be well captured by the proposed models.

Key words: Affymetrix data; Bayesian approach; Deviance Information Criterion; Prediction;

State space model; Temporal gene expression.

1 Introduction

After the completion of the genome sequencing project, new computational and statistical challenges

have arisen in genomic research, which may include gene/protein function predictions, gene and pro-

tein interaction network modelling and dynamic pathway discovery. However, complex phenotypes

(characteristics or traits that are observable or measurable) such as disease status (normal, disease) or

blood pressure typically involve multiple inter-correlated genetic and environmental factors that inter-act in a hierarchical fashion. A high throughput genetic data collection method called microarray hold

tremendous latent information that require more sophisticated computational tools to tackle the hidden

information (Chu et al., 1998, Liang & Kelemen, 2006).

Time-course gene expression data are often prepared to study dynamic biological systems since

knowing when or whether a gene is expressed (Jacob and Monod, 1961), and how one interacts with

others can provide a strong clue to its biological roles. Clustering analyses are currently the most

commonly used statistical methods for time course gene microarray data due to the large number of

*Corresponding author: e-mail: [email protected], Phone: 001 (0) 716 829 2814, Fax: 001(0) 716 829 2200

Biometrical Journal 49 (2007) 6, 801814 DOI: 10.1002/bimj.200610335 801

# 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim


2/14

genes involved, and the identification of groups of genes with similar temporal patterns of expres-

sion is usually a critical step in the analysis of kinetic data (Eisen, et al., 1998; Holter, et al., 2001;

Ernst, et al., 2005). However, most of the common clustering methods that are used today lack theability to address the inherent temporal dependence between data observations when samples are col-

lected in an unequal time-ordered sequence.

Moreover, there are other limitations of the majority of the existing time course approaches and

clustering techniques. Firstly, conventional time series techniques such as Fourier analysis, autoregres-

sive or moving averaging models are not suitable for short time course gene expression data. Sec-

ondly, although other techniques such as dynamic Bayesian clustering have been developed, these

methods require stationary conditions, linearity for lower order AR models, and uniformly spaced

time points, which are not present in microarray experiments (Ramoni, et al., 2002). Thirdly, some

curve fitting, spline (e.g. in terms of polynomials in time) methods, mixed effects models and non-

linear regression models have been applied to temporal microarray data and they can model the non-

linear relations between genes, deal with the unevenly spaced data and may produce a good fit, but do

not facilitate prediction and may cause overfitting problems (Luan and Li, 2003). Bayesian decompo-

sition methods and singular value decomposition have also been developed for modelling the dy-namics of microarray data through matrix decomposition (Alter, et al., 2000). A difficulty is the

curse of dimensionality (high dimensional variable/feature space with small sample size) and ill

posed problems (West, 2003). Lastly, but most importantly, most existing approaches using static

models treat time as a categorical or ordinal variable but not as a continuous variable. This distinction

is important because the kinetic parameters derived from ordinal variable treatments will not carry

meaning except in the case when the time points are evenly spaced (Bar-Joseph, et al., 2003).

State space models have greater flexibility in modelling non-stationary and nonlinear short time

course microarray data (Roweis and Ghahramani, 1999; Congdon, 2003, Durbin and Koopman, 2000).

However, current existing methods were based on standard Kalman filter methods that rely on the

linear state transitions and Gaussian errors (Harvey, 1989). Perrin et al. (2003) used a penalized like-

lihood maximization (MAP) implemented through an extended version of EM algorithm to learn the

parameters of the model (Dempster, et al., 1977). The drawback of MAP is that it gives no tool for

reducing the model complexity and the smoothness coefficients. Rangel, et al. (2004) used classicalcross-validations and Bootstrap techniques and Beal et al. (2005) used variation approximations with

linear time invariant Gaussian setting for constructions of the regulatory network in the state space

framework (Roweis and Ghahramani, 1999).

In this paper, by combining the merits of Bayesian flexibility of estimation procedures and the

stochastic process of modelling the temporal dynamics, we develop state space model in the fully

Bayesian setting for inferring and predicting time course gene expression profiles associated with

diseases. We consider and develop both univariate Bayesian state space models and multivariate Baye-

sian state space models. Monte Carlo Markov Chain (MCMC) algorithms are used to sample the

posterior distribution of the hidden variables and the model parameters. Various prior models with

different hyper-prior distributions are simulated and compared, and Deviance Information Criterion

(DIC) is used for model checking and selections (Spiegelhalter, et al. 2002). DIC is also used to

compare the univariate Bayesian state space model and multivariate Bayesian state space models

performances. The developed models were applied to affymetrix temporal gene expressions data setsfollowing corticosteroid administration derived from multiple tissue polygenic phenomena in complex

biological systems, which will be discussed next.

2 Affymetrix Temporal Gene Expression Data Sets

Genes are expressed in a two stage process of protein production: transcription (RNA, gene expres-

sion) and translation. Each gene is transcribed (at the appropriate time) from DNA into mRNA, which

then leaves the nucleus and is translated into the required protein. Any gene which is active in this

way at any particular time is said to be expressed. Gene expression investigations study the amount of

802 Y. Liang and A. Kelemen: Bayesian State Space Models for Gene Expression

# 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com


3/14

transcribed mRNA in a biological system. Measuring the activity level of a gene (its expression level)

in a particular cell at a particular time can be conducted by measuring the concentration of that genes

mRNA transcript in the cells total RNA. One high-throughput method to measure gene expression isDNA microarrays. A DNA microarray can be used to detect RNAs that may or may not be translated

into active proteins and it consists of a solid surface on which strands of polynucleotides have been

attached in specified positions. We refer to the polynucleotides immobilized on the solid surface as

probes. There are three major popular approaches: cDNA (spotted) microarrays- on a glass slide and

Oligonucleotide microarrays (affymetrix)- on a silicon chip and SNPs microarray that are used to read

the sequence of a genome in particular positions.

Microarrays can provide a method of high throughput data collection that is necessary for construct-

ing comprehensive information on the transcriptional basis of polygenic phenomena. In order to inves-

tigate thousands of genes, there are two types of categories to mine gene expressions: coordinated

gene expressions (temporal gene expression profiles referred to as time course gene expression): by

assessing the expression levels of large number of genes over a period of time or through a series of

experimental conditions; and differential gene expressions: by making pair-wise comparisons (Liang,

and Kelemen, 2004; Liang et al., 2005). When microarrays are used in a rich time course, they yieldtemporal patterns of changes in gene expression that illustrate the cascade of molecular events that

cause broad systemic responses (Almon, et al., 2004).

Corticosteroids are a class of compounds that exhibit the most potent immunosuppressive and anti-

inflammatory activities. These drugs are widely used in a variety of acute and chronic disease states,

such as asthma, leukemia, and organ transplantation. Although their therapeutic effects result from

regulation of immune system genes, many adverse events occur due to unwanted influence of the drug

on other genes, primarily those genes involved in metabolic processes (Jin, et al., 2003). The corticos-

teroid compounds produce both beneficial and harmful effects, through binding to the same type of

glucocorticoid receptor. This binding activity results in a cascade of signal transduction pathways to

ultimately produce an eventual drug response and clinical outcome. Because drug activity requires a

sequential series of events in order to elicit its effects, different genes may exhibit different expression

profiles over time following the administration of a drug dose. The particular genes that are either up-

regulated or down-regulated, in combination with specific time-course patterns, may be predictive ofthe ultimate outcome(s) that result from drug therapy. Therefore, it is important to improve our under-

standing of the time-dependent changes in gene expression caused by corticosteroid therapy in order

to potentially discover the precise genes that may be the most critical in producing favorable therapeu-

tic outcomes versus those that may instigate negative, unwanted effects.

Our study start from Affymetrix time courses gene expression data sets that were generated in rat

tissues of liver, muscle and kidney over time in response to a single bolus dose of methylprednisolone

(MPL) in order to examine global changes in gene expression. This was a pre-clinical study per-

formed on experimental rats. Forty-eight animals received a single IV bolus 50 mg/kg dose of MPL

(Jin, et al., 2003). Three rats were consequently sacrificed at each of the following sixteen (experi-

mental) time-points: 0.25, 0.5, 0.75, 1, 2, 4, 5, 5.5, 6, 7, 8, 12, 18, 30, 48, and 72 hours. Liver, muscle

and kidney tissue samples were collected from each animal and processed to assay for gene expres-

sion. Therefore, triplicate measurements were obtained at each of these time-points. Four rats were

not administered any drug and were sacrificed at time t 0. These served as the control group forgene expression in the absence of any drug at baseline.

Total RNA was separately extracted from the samples from each animal and purified. The isolated

RNA was then used to create biotinylated cRNA. This target cRNA was then hybridized to individual

Affymetrix GeneChips Rat Genome U230A and U34A. The expression levels of a total of 15,923

oligonucleotide sequences were quantified for kidney and 8799 probe sets for liver and muscle in each

chip. A series of filtering steps including probes that were not expressed and not up or down regulated

were performed in order to subset the data into a more manageable dataset that would yield the most

relevant information regarding potentially important changes in gene expression due to MPL dosing

(Almon, et al., 2004). After pre-processing and gene filtering we were able to eliminate probe sets

Biometrical Journal 49 (2007) 6 803



4/14

that were not expressed in the tissue, were not regulated by the drug treatment, or did not meet

defined quality control standards.

In order to demonstrate the inferring and prediction performance of the proposed models we present6 selected genes that were differentially expressed in three tissues (liver, muscle and kidney). These

six genes were selected from our previous study: Bayesian Meta-Analysis with Markov Mixture mod-

el, which we will report in another paper. These genes are measured in a continuous scale. For in-

stance, gene L33869_at at the above 17 time points before transformation had the following measured

values: 62.025, 50.100, 56.833, 57.400, 63.633, 39.600, 26.333, 43.000, 37.367, 49.600, 49.067,

51.533, 44.967, 61.333, 38.133, 51.600, 44.733. Because our analysis is focused on the dynamic

changes in gene expression from time t 0 after dosing of MPL, gene expression was converted to aratio via a simple calculation that involved dividing the gene expression at time ti by the gene expres-

sion level at time t0, where i represents the specific post-dose time-point and t0 represents baseline at

time 0 hours (i.e. the control group that did not receive drug). Inherently, the gene expression forevery gene in the control group would be equal to a value of 1. These ratios were subsequently

natural-logarithmically transformed to produce normally distributed gene expression levels at each

sampling time-point. This log-transformation directly centered the data around mean 0 since thecontrol values were all equal to 1 prior to the transformation.

3 Methods

3.1 Bayesian state space model formulation and prior model specification

The measured (observed) gene expressions provided by microarray experiment are contaminated by

noise. A state space model will decompose the signal and noise processes into two model equations:

the stochastic equation and the observation equation. In a state space model, a sequence of P-dimen-

sional real valued observation gene expression vector fXtg is modelled by assuming that each timestep, fXtg are generated from a K-dimensional unobserved or hidden state variable fStg, and thesequence of fStg define a Markov process (Congdon, 2003). The joint probability of fXt; Stg

PSt;Xt PS1 PXt j StaTt2

PSt j St1 PXt j St 1

where PS1 is a state probability and is assumed to be generated from conjugate distributions such asGaussian or student t-distributions. PSt j St1 is the transition density or probability of hidden states(such as genes that are not measured or included in the study) that can be inferred from measured

gene expressions fXtg and it can be defined in the stochastic evolution equations as:

St gtSt1 Vt 2

where gt is the deterministic transition function determining the mean of St given St1. PXt j St isthe observation density or probability that can be defined in the observation equations as:

Xt ftSt Wt 3where ft is the statistical transition function of the observation processes. Wt; Vt are assumed to follow

Gaussian or non-Gaussian distributions with means zeros of both population processes. gt; ft follow

either linear or nonlinear settings.

3.1.1 Univariate time varying bayesian state space model

We start with univariate state space models in fully hierarchical Bayesian setting. Here, both stochas-

tic and observation equations take linear forms and the distributions of the state variables PSt j St1and observed gene expressions PXt j St are assumed to follow Gaussian distributions. The model is




5/14

as follows for each gene:

St ASt1 wt

Xt CSt nt 4

where t 1; . . . ; T (number of time points). The time steps in the model do not have to correspond toa fixed unit of real time which allows for unevenly distributed time course data that are not measured

at fixed time intervals to be modelled. wt; nt are noise sequences, A is state to state transition matrix.C is the state to observation transition matrix.

Besides the observed gene expressions fXtg and the hidden state variable fStg which can be in-ferred from the associated fXtg, we can also include a set of exogenous variables into the abovemodels. These variables can be environment factors such as certain exposures or they can be other

genes that interacted with the studied gene. These variables can modify the observed gene expression

fXtg and affect the inferred hidden state variables. In our case we are interested in modelling theeffects of the influence of expression of one gene at a previous time point on another gene and its

associated hidden variables, therefore, we modified (4) by including this influence in both the state

equation and observation equation as follows:

St1 ASt BXt wt

Xt CSt DXt1 nt5

where matrix D in the observation equation captures the relationship of gene expression levels at

consecutive time points, matrix C captures the influence of the hidden variables on gene expression

level at each time point, and matrix B models the influence of gene expression values from previous

time points on the hidden states. Combining the two formulas in (5) we get:

Xt ACSt1 CB D Xt1 et

et Cwt nt6

where AC is the hidden state dynamic matrix with the influence of hidden state variables on gene

expression level at each time point, CB D contains both the gene-to-gene interaction and the gene-to-gene interactions through the hidden state over time and et is the noise.

Rangel, et al. (2004) and Beal, et al. (2005) assumed that A;B; C;D are linear and time-invariant

matrices in the state space model setting. Here in our model (6), we initialized our model with a time

varying matrix. The motivation of the above BSS model with time varying coefficient comes from our

major prediction goal, which requires a model not only to have good fit in the sampling period, but

also a good generalization performance. Since microarray experiments are more concerned about the

short term prediction given the short term time course data, as Congdon (2003) suggested, the intro-

duction of time variability is advisable. This way, the underlying parameters to be estimated evolve

through time with continuous measure instead of discrete.

In order to incorporate the fully hierarchical Bayesian setting into the above model (6) for learning

the model parameters and model structures, and using probability density functions, we reformulate

our univariate time varying state space model given in (6) with fully Bayesian setting as follows:

Xt $ Nmt;s2t 7

mt ACt St CB Dt Xt 8

ACt $ Nb1;t1;s21tI ; CB Dt $ Nb2;t1; s

22tI ; St $ Nb3;s

23 9

As Congdon (2003) suggested, the above Gaussian/normal distribution can also be replaced by

student t distribution with three parameters: degree of freedom, mean (with the same mean as the

normal distribution), variance (we take larger variance than normal by a factor ofks2 in order to dealwith outliers and other sources that cause heavy tailed distributions which are typical in microarray

data. This may also help to deal with over-dispersion problems in time course data.




6/14

Specifying hyper-prior for vector b1;t1, Matrix s21tI and so on require elicitation which consists of

incorporating relevant background knowledge into formulation of priors on parameters. Since our

knowledge is limited on the values of the parameters, the diffuse or just proper priors can be calledupon. Regarding the stationary issue, we used Zellner (1996) without stationary constraints, which use

a non-informative reference prior (such as normal-gamma, or Gaussian-Wishart). Various hyper-param-

eters generated from non-informative distributions such as vague inverse-Gamma distribution were

tested.

The advantage of the proposed fully Bayesian setting with time varying state space model is that

the estimation of varying coefficients is based on the assumption that they belong to the common

density and hence Bayesian methods are appropriate in pooling the strength estimation of coefficients

from a common density. Moreover, it can deal with the outliers and shifts in time, learn the non-

stationarity typically observed in microarray data, and place more weight on the recent observations.

Lastly, it is flexible for time varying coefficients with linear or nonlinear functions (Congdon, 2003).

Note that this model is a univariate state space model through hierarchical Bayesian setting. A

hierarchical model has its virtues. It is richer than a one-stage model, and produces shrinkage estima-

tors parameters automatically. These shrinkage parameters may help to improve the generalizationperformance and overcome the over-fitting problem for noisy, sparse, high dimensional gene data.

3.1.2 Multivariate bayesian state space models with correlated covariance

Multivariate state space models that can describe patterns of dependency among multiple series (genes

across time) measured simultaneously on a common system may be helpful to discover the gene

dependences of the underlying processes (West and Harrison, 1999). Therefore, we extend the univari-

ate Bayesian state space model to multivariate model via inclusion of the covariance structure in order

to learn gene correlations and their temporal behavior for prediction. Here, each series depend on both

its own past and the past values of the other series, therefore the variations in expression for a given

gene can be predicted by a small set of other genes. One advantage of simultaneously modeling

several series is the possibility of pooling information from related genes to improve the precision and

out of sample forecasts (Congdon, 2003).We are particularly interested in modeling several correlated series (genes) as well as the error

terms. We still assume that the time course gene expression data with P genes follows Multivariate

Gaussian distribution with linear setting (7, 8). However, the covariance matrices and error matrices in

(9) are modified from univariate normal to multivariate Gaussian distributions or multivariate student-t

distributions as follows:

ACt $ MVN b1;W1 ; CB Dt $ MVN b2;W2 ; et $ MVN 0;W 10

where W1;W2;W are positive definite covariance matrices following inverse Wishart priors (Congdon,2003). b1;b2 are vectors that are generated from multivariate t-distributions or Gaussian distributionswith vague hyper inverse-Gamma distributions just as we have discussed earlier.

Note that the difference between (9) and (10) is that the covariance matrix in (10) is assumed to

have diagonal uncorrelated structure. However, this does not mean that (9) is a special case of (10)

since in (9) we assumed time varying coefficient vectors and in (10) we did not make this assumption.The estimations of the hidden variables, parameters and hyper-parameters such as the covariance

matrix in both univariate time varying Bayesian state space model and multivariate Bayesian state

space model are conducted by standard Monte Carlo Markov Chain algorithms for both posterior

inference and predictions.

3.2 Model comparisons, selection criteria and validations

In order to choose the best model for prediction, the model selection criteria, such as the Akaike

Information Criterion (AIC), Bayesian Information Criterion (BIC) or Bayes factors can be consid-




7/14

ered. Although AIC is useful for non-nested models, it works poorly in the case of multicollinearity,

which is typical for gene expression data. It has drawbacks of tending to be biased for complicated

models due to the fact that log-likelihood increases faster than the model complexity component.Deviance Information Criterion is a new measure proposed by Spiegelhalter, et al. (2002) for model

complexity and goodness of fit under the Bayesian setting. Its more appropriate when comparing

complex hierarchical models in the Bayesian setting, where the number of parameters is not clearly

defined. One advantage of DIC is its inclusion of a prior distribution, which induces a dependency

between parameters that is likely to reduce the effective dimensionality. Furthermore, it helps the prior

models identifications. DIC can be summarized by the posterior expectation of the deviance and

complexity (effective number of parameters) as the expected deviance minus deviance at the posterior

expectation of the parameters. It is defined as follows:

pD

Dq Dqq

Dq 2 log fpy j qg 2 log ffyg

DIC Dqq 2pD

11

where Dq is the Bayesian Deviance. The posterior mean of deviance is a Bayesian measure of fit.The effective number of parameters p

Din the model is defined as the difference between the posterior

mean of the deviance and the deviance at the posterior means of the parameters of interest. We used

DIC for within sample fit measure for model selections and also to avoid the notorious over-fitting

problem by controlling the optimum number of parameters for model selection.

To validate competing models, one option is to make short-term prediction ahead with samples and

then choose the model which diminishes the prediction error, the mean squared error or the mean

absolute error (Congdon, 2003). Another option is to fit the models to periods t 1; . . . ; F, where F isless than N. Periods F 1; F 2; . . . ;N are used to validate the competing models. Instead of usingthe first F time points to build the model and use the rest for validation, we used the first option.

Furthermore, we used the same model to predict not only the next time point, but the next few time

points, such as five points.

4 Results

We simulated the prior models with various cases of both univariate and multivariate Bayesian state

space models using WINBUGs and applied to multiple tissue data sets discussed in section 2. Monte

Carlo Markov Chain algorithm with Gibbs sampling was used to sample from the posterior distribu-

tions of parameters and the simulated data was used to draw inferences on the parameters given the

tissue data sets. Each parameter was estimated as the mean of its posterior distribution based on an

assumption for its prior distribution. The results were robust in many complex cases. To ensure that it

is sampling from its equilibrium distribution, 2000 samples after 6000 burn-ins were used for compu-

tation.

We tested four models of the Bayesian state space model we discussed in the methods section withvarious priors and hyper-prior distributions. Since the choice of the hyper-prior distribution of the

parameters is a key issue, to investigate the influence of the choice of the hyper-parameters on the

estimates, we carried out sensitivity analysis of different choices of the initial priors/parameters. In

model I, the univariate time invariant Bayesian state space (with regression coefficients, their var-

iances and state variables fixed through time) was tested: AC $ Nm1;s2; CB D $ Nm2;s

2;S$ Nm3;s

2; e $ Nm;W. Hyper-prior distribution for m1;m2;m3;m $ N0;s2 where

s2 $ Inverse Gamma 1; 0:001. Since we had no information on the precision (inverse of s), smallprecisions were tested such as 0.001, 0.01. W was generated from an inverse Wishart R; r distribu-tion, and the degree of freedom (r) was the number of genes included in the model. For demonstra-




8/14


box plot: muY[1,]

-4.0

-2.0

0.0

2.0

4.0

[1,1][1,2] [1,3] [1,4] [1,5] [1,6]

[1,7][1,8]

[1,9][1,10]

[1,11]

[1,12][1,13]

[1,14]

[1,15]

[1,16]

box plot: muY[2,]

-3.0

-2.0

-1.0

0.0

1.0

2.0

[2,1] [2,2][2,3]

[2,4]

[2,5] [2,6][2,7]

[2,8] [2,9]

[2,10]

[2,11]

[2,12]

[2,13][2,14]

[2,15][2,16]

box plot: muY[3,]

-2.0

-1.0

0.0

1.0

2.0

3.0

[3,1][3,2] [3,3] [3,4] [3,5] [3,6]

[3,7] [3,8]

[3,9]

[3,10]

[3,11]

[3,12]

[3,13]

[3,14]

[3,15]

[3,16]

box plot: muY[4,]

-75.0

-50.0

-25.0

0.0

25.0

50.0

[4,1]

[4,2]

[4,3] [4,4] [4,5] [4,6] [4,7] [4,8] [4,9]

[4,10]

[4,11][4,12][4,13]

[4,14][4,15]

[4,16]

box plot: muY[5,]

-30.0

-20.0

-10.0

0.0

10.0

20.0

[5,1] [5,2] [5,3] [5,4] [5,5]

[5,6]

[5,7][5,8]

[5,9]

[5,10][5,11]

[5,12]

[5,13]

[5,14][5,15]

[5,16]

box plot: muY[6,]

-2.0

-1.0

0.0

1.0

2.0

3.0

[6,1]

[6,2] [6,3] [6,4]

[6,5]

[6,6] [6,7] [6,8]

[6,9]

[6,10]

[6,11]

[6,12][6,13]

[6,14]

[6,15][6,16]

Figure 1 Box plots of estimated gene expression profiles for six selected genes

from liver tissue data using univariate Bayesian state space model (model I).



9/14

tion purposes, we selected six commonly differentially expressed genes from multiple tissues data: the

degree of freedom was six and the scale matrix was specified as

R

100 0 0 0 0 0

0 0:1 0 0 0 0

0 0 0:1 0 0 0

0 0 0 0:1 0 0

0 0 0 0 0:1 0

0 0 0 0 0 0:01

0BBBBBB@

1CCCCCCA


Figure 2 Box plots of estimated gene expression profiles for six

selected genes using univariate time varying Bayesian state space

model (model II).



10/14

By looking at the overall goodness of fit measure: DIC value (6.502E+7), the deviance (1230.0)

and the over-dispersed estimates (see Fig. 1), we conclude that this model did not fit well.

To obtain model II we improve model I by modifying the above time invariant model into univari-ate time varying Bayesian state space model as described earlier (see Fig. 2). Using this model, due to

the hierarchical Bayesian setting and shrinkage effects, we were also able to filter out non-differen-

tially expressed genes for reducing the number of dimensions and overcome the curse of dimension-

ality problem: totally 131 genes from 3614 genes were left. The remaining, non-differentially ex-

pressed genes were filtered out (with unadjusted significance level alpha = 0.05).


Figure 3 Box plots of estimated gene expressions for six selected

genes from liver at the given sixteen time points (model III). Predic-

tion of the gene expressions at the next 5 time points are also

shown.



11/14


Table

1

Sensitivity

analysisfrom

variouspriorand

initialv

aluesand

corresponding

DIC

valu

es;six

selected

genesweretested,

using

temporaltissuedata(linein

boldshowsthebestmodelwithlow

estDIC).

S(Statespacevariables)

AC

CB

D

Initials

Deviance

DIC

Initials

Deviance

DIC

Initials

Deviance

DIC

St$

N(0,

0.0

01)

458.5

436.7

/

/

/

m2

(10,

0.1,

10,

0.1,

10,0.1

)0

458.5

436.7

St$

N(1,1

)

467.4

517.9

/

/

/

m2

(10,

10,

10,

10,

10,

10

)0

458.5

436.7

St$

N(0,1

)

459.1

459.6

/

/

/

m2

(0.1,

0.1,

0.1,

0.1,

0.1,0

.1)

458.5

436.7

St$

N(0,

0.0

01)

440.9

612.1

m1

(10,

0.1,

10,

0.1,

10,

0.1

)0

440.9

612.1

m2

(10,

0.1,

10,

0.1,

10,0.1

)0

440.9

612.1

St$

N(1,1

)

457.8

333.7

m1

(10,

10,

10,

10,

10,

10)0

440.9

612.1

m2

(10,

10,

10,

10,

10,

10

)

440.9

612.1

St$

N(0,1

)

447.1

364.9

m1

(0.1,

0.1,

0.1

,

0.1,

0.1,

0.1

)

436.0

332.9

m2

(0.1,

0.1,

0.1,

0.1,

0.1,0

.1)

438.4

587.2

St$

t(0,

0.0

01,

95)

444.3

220.9

m1

(10,

0.1,

10,

0.1,

10,

0.1

)0

444.3

220.9

m2

(10,

0.1,

10,

0.1,

10,0.1

)0

444.3

220.9

St

$

t(0,

0.0

1,

95)

443.0

195.1

m1

(10,

10,

10,

10,

10,

10)0

443.0

195.1

m2

(10,

10,

10,

10,

10,

10

)0

443.0

195

.1

St$

t(0,

0.1,

95)

441.5

333.9

m1

(0.1,

0.1,

0.1,

0.1,

0.1,

0.1

)

443.7

176.5

m2

(0.1,

0.1,

0.1,

0.1,

0.1,0

.1)

441.1

5311

St$

t(0,

1,

95)

451.7

485.4

/

/

/

/

/

/

OtherInitials

m1

(10,

0.1,

10,

0.1,

10,

0.1

)0,

m2

(10,

0.1,

10,

0.1,

10,

0.1

)0

sg;

t$

N(0,

0.0

01)

orsg;

t$t

(0,

0.0

1,

95)

m2

(10,

0.1,

10,

0.1,

10,

0.1

)0

sg;

t$

N(0,

0.0

01)

orsg;

t$

t(0,

0.0

1,9

5)

m1

(10,

0.1,

10,

0.1,

10,

0.1

)0



12/14

We obtained model III by using multivariate Bayesian state space model and changing the prior

distribution from univariate normal distribution to multivariate Gaussian distribution such as

CB Dt $ MVN m;

W, while keeping the hyper-prior distributions similar to model I, such asinverse Wishart R;r prior distribution for W. The advantage of this model is the inclusion of gene-gene interaction correlation by using covariance matrix instead of diagonal uncorrelated matrix (as in

model I). The DIC value for this model was 463.9, and the deviance was 463.8. Since the number of

parameters estimated in this model was 294, these deviance and DIC values were considered small,

which indicated good model fit. Checking the box plots of gene expressions estimations (Fig. 3) we

can see that the over-dispersion problem is avoided by this model, although the estimated standard

errors are larger than in the previous model.

In model IV we further varied the prior and hyper-prior distributions of the state variable of the

mean from normal to student t-distribution: Xt $ t0; 0:001; 95. Here, there are three parameters for


Figure 4 Comparison of the observed gene expression profiles (dotted line) with the estimated (inferred)

gene expression profiles (solid line) using state space model for six selected genes from liver data.



13/14

the student t-distribution: mean, precision (or variance) and the degree of freedom of parameter k kdetermines the extent of over-dispersion, smaller values of k allow for more marked departures from

normality in the tails, so the density expected to have outliers might be described by a t density withdegrees of freedom under 10 (Congdon, 2002). We chose large k so that we can lead to a density a

little close to normal but still want to see the difference. With this model we were able to achieve a

significantly smaller DIC value of 195.1, which may be an evidence of the robustness of the t-distribu-

tion for gene expression data and provided an enhanced prior model compared to the other three

models. The improvement in the DIC supports this contention. Table 1 displays various prior and

hyper-prior settings for some of our experiments using Bayesian state space models, the sensitivity

analysis from various initials and the corresponding DIC values.

Figures 1, 2 and 3 show box plots of the posterior means and 95% credible intervals of the given

models for each of the six genes expression profile, estimated by the Bayesian state space model.

Figure 4 displays the comparison of the observed gene expression profiles (dotted line) with the esti-

mated (inferred) gene expression profiles (solid line) using state space model for six selected genes. In

each case, the state space model has successfully caught the dynamics (observed dotted line is very

close to the estimated solid line) although there are small deviations. Prediction of the gene expres-sions using model IV after 72 hours at the next 5 time points are also shown in the figure. The

estimated genes expression profiles are similar to the observed ones, although there are some differ-

ences for genes AF053312_s_at and L32591mRNA_g_at. These two genes have larger scale expres-

sions in the observed data than those in the estimated data. From these figures we can see that genes

are expressed significantly at least at one time point except gene L32591mRNA_g_at. It was predicted

that only gene L33869_at would have significant down regulation at the next time point after 72 hrs.

We also compared DICs with and without prediction, and the prediction model provided DIC value of

464.8 (without prediction the DIC was 463.9).

5 Discussion

We demonstrated the potential significance of our proposed Bayesian state space models for makinginferences and predictions on genomic dynamics and applied them to multiple tissue data: i) our

proposed Bayesian state space model in the simple case can provide us good estimates and predictions

for temporal gene expressions profiles by its model specification (e.g. estimation of scale matrix) in

the multivariate setting; ii) this model can handle and infer the hidden and un-measurable variables

that affect observed gene expressions; iii) this model can analyze time course gene expression data not

measured at fixed time intervals (discrete, unevenly spaced time course data), as is the case in most

genomic and proteomic data.

Our proposed models are innovative. Firstly, there are advantages in simplified estimations in fully

hierarchical Bayesian setting and other new model settings where non-standard distributions, non-sta-

tionary and nonlinear features with short unevenly spaced time can be incorporated. These settings are

also more realistic for genomic data. Secondly, they can handle un-measurable hidden variables and

unknown factors that affect observed gene expressions and predict time course gene expression data

not measured at fixed time intervals. The flexibility of modelling the expression measurements ascontinuous, rather than discrete and therefore with dynamic models rather than unrealistic static mod-

els appears to be a major advantage. The model has great potential for biological and medical systems

and can be applied to further the study of dynamic drug effects of various diseases. Currently we are

investigating Bayesian state space models for precise estimation of the hidden structural and func-

tional parameters of biological systems that are well known for their noisy, uncertain and stochastic

nature.

Acknowledgements The authors would like to thank Dr. Richard Almon for providing the data. This work is

supported by National Science Foundation grant DMS-0604639.




14/14

References

Almon, R. R., DuBois, D. C., Piel, W., and Jusko, W. J. (2004). The genomic response of skeletal muscle tomethylprednisolone using microarrays: tailoring data mining to the structure of the pharmacogenomic time

series. Pharmacogenomics 5, 525552.

Alter, O., Brown, P. O., and Botstein, D. (2000). Singular value decomposition for genome-wide expression data

processing and modelling. Proceedings of the National Academy of Sciences 97, 1010110106.

Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T., and Simon I. (2003). Continuous Representations of Time

Series Gene Expression Data. Journal of Computational Biology 10, 241256.

Beal, M. J., Falciani, F. L., Ghahramani, Z., Rangel, C., and Wild, D. (2005). A Bayesian Approach to Recon-

structing Genetic Regulatory Networks with Hidden Factors. Bioinformatics 21, 349356.

Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein D., and Brown, P. O. (1998). The Transcriptional Program

of Sporulation in Budding Yeast. Science 282, 699705.

Congdon, P. (2002). Bayesian Statistical Modelling. John Wiley & Sons, Ltd. New York.

Congdon, P. (2003). Applied Bayesian Modelling. John Wiley & Sons, Ltd. New York.

Dempster, A, Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm.

Journal of the Royal Statistical Society, Series B 39, 138.

Durbin, J. and Koopman, S. J. (2000). Time series analysis for non-Gaussian observations based on state space

models from both classical and Bayesian perspectives (with discussion). Journal of the Royal Statistical

Society, Series B 62, 356.

Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide

expression patterns. Proceedings of the National Academy of Sciences 95, 1486314868.

Ernst, J., Nau, G. J. and Bar-Josephm, Z. (2005). Clustering short time series gene expression data. Bioinformatics

21 Suppl 1: i159i168.

Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University

Press, London.

Holter, N., Maritan, A., Cieplak, M., Fedoroff, N., and Banavar, J. (2001). Dynamic modeling of gene expression

data. Proceedings of the National Academy of Sciences 98, 1693 1698.

Jacob, F. and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of the proteins. Journal of Mole-

cular Biology 3, 318356.

Jin, J. Y., Almon, R. R., Dubois, D. C., and Jusko, W. J. (2003). Modeling of corticosteroid pharmacogenomics in

rat liver using gene microarrays. Journal of Pharmacology and Experimental Therapeutics 307, 93109.Liang, Y. and Kelemen, A. (2004). Hierarchical Bayesian Neural Network for Gene Expression Temporal Pat-

terns, Journal of Statistical Applications in Genetics and Molecular Biology 3, Article 20.

Liang, Y., Tayo, B., Cai, X., and Kelemen, A. (2005). Differential and Trajectory Methods for Time Course Gene

Expression Data. Bioinformatics 20, 3009 3016.

Liang, Y. and Kelemen, A. (2006). Associating phenotypes with molecular events: a review of statistical advances

and challenges underpinning microarray analyses. Journal of Functional and Integrative Genomics 6, 113.

Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model with B-

splines. Bioinformatics 19, 474482.

Perrin, B. E., Ralaivola, L., Mazurie, A., Bottani, S., Mallet, J., and DAlche-Buc, F. (2003). Gene networks

inference using dynamic Bayesian networks. Bioinformatics 19, Suppl 2: II138II148.

Ramoni, M. F., Sebastiani, P., and Kohane, I. (2002). Cluster analysis of gene expression dynamics. Proceedings

of the National Academy of Sciences 99, 9121 9126.

Rangel, C., Angus, J., Ghahramani, Z., Lioumi, M., Sotheran, E. A., Gaiba, A., Wild, D. L., and Falciani, F.

(2004). Modeling T-cell activation using gene expression profiling and state space models. Bioinformatics 20,13611372.

Roweis, S. and Ghahramani, Z. (1999). A Unifying Review of Linear Gaussian Models. Neural Computation 11,

305345.

Spiegelhalter, D., Best, N., Carlin, B., and Linde, A. (2002). Bayesian measures of model complexity and fit.

Journal of Royal Statistical Society, Series B 64, 583639.

West, M. (2003). Bayesian factor regression models in the Large p, Small n paradigm. Bayesian Statistics 7,

723732.

West, M. and Harrison, J. (1999). Bayesian Forecasting and Dynamic models, 2nd edition. New York, Springer.

Zellner, A. (1996). An introduction to Bayesian inference in econometrics. John Wiley & Sons, Ltd. New York.



Documents

Bayesian State Space