Bayesian State Space

Embed Size (px)

Citation preview

  • 8/8/2019 Bayesian State Space

    1/14

    Bayesian State Space Models for Inferringand Predicting Temporal Gene Expression Profiles

    Yulan Liang* and Arpad Kelemen

    Department of Biostatistics, University at Buffalo, The State University of New York,

    252A2 Farber Hall, 3435 Main Street, Buffalo, NY 14214, USA

    Received 26 April 2006, revised 10 October 2006, accepted 22 January 2007

    Summary

    Prediction of gene dynamic behavior is a challenging and important problem in genomic research while

    estimating the temporal correlations and non-stationarity are the keys in this process. Unfortunately,

    most existing techniques used for the inclusion of the temporal correlations treat the time course as

    evenly distributed time intervals and use stationary models with time-invariant settings. This is an

    assumption that is often violated in microarray time course data since the time course expression data

    are at unequal time points, where the difference in sampling times varies from minutes to days. Further-

    more, the unevenly spaced short time courses with sudden changes make the prediction of genetic

    dynamics difficult. In this paper, we develop two types of Bayesian state space models to tackle this

    challenge for inferring and predicting the gene expression profiles associated with diseases. In the

    univariate time-varying Bayesian state space models we treat both the stochastic transition matrix and

    the observation matrix time-variant with linear setting and point out that this can easily be extended to

    nonlinear setting. In the multivariate Bayesian state space model we include temporal correlation struc-

    tures in the covariance matrix estimations. In both models, the unevenly spaced short time courses with

    unseen time points are treated as hidden state variables. Bayesian approaches with various prior and

    hyper-prior models with MCMC algorithms are used to estimate the model parameters and hidden

    variables. We apply our models to multiple tissue polygenetic affymetrix data sets. Results show that

    the predictions of the genomic dynamic behavior can be well captured by the proposed models.

    Key words: Affymetrix data; Bayesian approach; Deviance Information Criterion; Prediction;

    State space model; Temporal gene expression.

    1 Introduction

    After the completion of the genome sequencing project, new computational and statistical challenges

    have arisen in genomic research, which may include gene/protein function predictions, gene and pro-

    tein interaction network modelling and dynamic pathway discovery. However, complex phenotypes

    (characteristics or traits that are observable or measurable) such as disease status (normal, disease) or

    blood pressure typically involve multiple inter-correlated genetic and environmental factors that inter-act in a hierarchical fashion. A high throughput genetic data collection method called microarray hold

    tremendous latent information that require more sophisticated computational tools to tackle the hidden

    information (Chu et al., 1998, Liang & Kelemen, 2006).

    Time-course gene expression data are often prepared to study dynamic biological systems since

    knowing when or whether a gene is expressed (Jacob and Monod, 1961), and how one interacts with

    others can provide a strong clue to its biological roles. Clustering analyses are currently the most

    commonly used statistical methods for time course gene microarray data due to the large number of

    *Corresponding author: e-mail: [email protected], Phone: 001 (0) 716 829 2814, Fax: 001(0) 716 829 2200

    Biometrical Journal 49 (2007) 6, 801814 DOI: 10.1002/bimj.200610335 801

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

  • 8/8/2019 Bayesian State Space

    2/14

    genes involved, and the identification of groups of genes with similar temporal patterns of expres-

    sion is usually a critical step in the analysis of kinetic data (Eisen, et al., 1998; Holter, et al., 2001;

    Ernst, et al., 2005). However, most of the common clustering methods that are used today lack theability to address the inherent temporal dependence between data observations when samples are col-

    lected in an unequal time-ordered sequence.

    Moreover, there are other limitations of the majority of the existing time course approaches and

    clustering techniques. Firstly, conventional time series techniques such as Fourier analysis, autoregres-

    sive or moving averaging models are not suitable for short time course gene expression data. Sec-

    ondly, although other techniques such as dynamic Bayesian clustering have been developed, these

    methods require stationary conditions, linearity for lower order AR models, and uniformly spaced

    time points, which are not present in microarray experiments (Ramoni, et al., 2002). Thirdly, some

    curve fitting, spline (e.g. in terms of polynomials in time) methods, mixed effects models and non-

    linear regression models have been applied to temporal microarray data and they can model the non-

    linear relations between genes, deal with the unevenly spaced data and may produce a good fit, but do

    not facilitate prediction and may cause overfitting problems (Luan and Li, 2003). Bayesian decompo-

    sition methods and singular value decomposition have also been developed for modelling the dy-namics of microarray data through matrix decomposition (Alter, et al., 2000). A difficulty is the

    curse of dimensionality (high dimensional variable/feature space with small sample size) and ill

    posed problems (West, 2003). Lastly, but most importantly, most existing approaches using static

    models treat time as a categorical or ordinal variable but not as a continuous variable. This distinction

    is important because the kinetic parameters derived from ordinal variable treatments will not carry

    meaning except in the case when the time points are evenly spaced (Bar-Joseph, et al., 2003).

    State space models have greater flexibility in modelling non-stationary and nonlinear short time

    course microarray data (Roweis and Ghahramani, 1999; Congdon, 2003, Durbin and Koopman, 2000).

    However, current existing methods were based on standard Kalman filter methods that rely on the

    linear state transitions and Gaussian errors (Harvey, 1989). Perrin et al. (2003) used a penalized like-

    lihood maximization (MAP) implemented through an extended version of EM algorithm to learn the

    parameters of the model (Dempster, et al., 1977). The drawback of MAP is that it gives no tool for

    reducing the model complexity and the smoothness coefficients. Rangel, et al. (2004) used classicalcross-validations and Bootstrap techniques and Beal et al. (2005) used variation approximations with

    linear time invariant Gaussian setting for constructions of the regulatory network in the state space

    framework (Roweis and Ghahramani, 1999).

    In this paper, by combining the merits of Bayesian flexibility of estimation procedures and the

    stochastic process of modelling the temporal dynamics, we develop state space model in the fully

    Bayesian setting for inferring and predicting time course gene expression profiles associated with

    diseases. We consider and develop both univariate Bayesian state space models and multivariate Baye-

    sian state space models. Monte Carlo Markov Chain (MCMC) algorithms are used to sample the

    posterior distribution of the hidden variables and the model parameters. Various prior models with

    different hyper-prior distributions are simulated and compared, and Deviance Information Criterion

    (DIC) is used for model checking and selections (Spiegelhalter, et al. 2002). DIC is also used to

    compare the univariate Bayesian state space model and multivariate Bayesian state space models

    performances. The developed models were applied to affymetrix temporal gene expressions data setsfollowing corticosteroid administration derived from multiple tissue polygenic phenomena in complex

    biological systems, which will be discussed next.

    2 Affymetrix Temporal Gene Expression Data Sets

    Genes are expressed in a two stage process of protein production: transcription (RNA, gene expres-

    sion) and translation. Each gene is transcribed (at the appropriate time) from DNA into mRNA, which

    then leaves the nucleus and is translated into the required protein. Any gene which is active in this

    way at any particular time is said to be expressed. Gene expression investigations study the amount of

    802 Y. Liang and A. Kelemen: Bayesian State Space Models for Gene Expression

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    3/14

    transcribed mRNA in a biological system. Measuring the activity level of a gene (its expression level)

    in a particular cell at a particular time can be conducted by measuring the concentration of that genes

    mRNA transcript in the cells total RNA. One high-throughput method to measure gene expression isDNA microarrays. A DNA microarray can be used to detect RNAs that may or may not be translated

    into active proteins and it consists of a solid surface on which strands of polynucleotides have been

    attached in specified positions. We refer to the polynucleotides immobilized on the solid surface as

    probes. There are three major popular approaches: cDNA (spotted) microarrays- on a glass slide and

    Oligonucleotide microarrays (affymetrix)- on a silicon chip and SNPs microarray that are used to read

    the sequence of a genome in particular positions.

    Microarrays can provide a method of high throughput data collection that is necessary for construct-

    ing comprehensive information on the transcriptional basis of polygenic phenomena. In order to inves-

    tigate thousands of genes, there are two types of categories to mine gene expressions: coordinated

    gene expressions (temporal gene expression profiles referred to as time course gene expression): by

    assessing the expression levels of large number of genes over a period of time or through a series of

    experimental conditions; and differential gene expressions: by making pair-wise comparisons (Liang,

    and Kelemen, 2004; Liang et al., 2005). When microarrays are used in a rich time course, they yieldtemporal patterns of changes in gene expression that illustrate the cascade of molecular events that

    cause broad systemic responses (Almon, et al., 2004).

    Corticosteroids are a class of compounds that exhibit the most potent immunosuppressive and anti-

    inflammatory activities. These drugs are widely used in a variety of acute and chronic disease states,

    such as asthma, leukemia, and organ transplantation. Although their therapeutic effects result from

    regulation of immune system genes, many adverse events occur due to unwanted influence of the drug

    on other genes, primarily those genes involved in metabolic processes (Jin, et al., 2003). The corticos-

    teroid compounds produce both beneficial and harmful effects, through binding to the same type of

    glucocorticoid receptor. This binding activity results in a cascade of signal transduction pathways to

    ultimately produce an eventual drug response and clinical outcome. Because drug activity requires a

    sequential series of events in order to elicit its effects, different genes may exhibit different expression

    profiles over time following the administration of a drug dose. The particular genes that are either up-

    regulated or down-regulated, in combination with specific time-course patterns, may be predictive ofthe ultimate outcome(s) that result from drug therapy. Therefore, it is important to improve our under-

    standing of the time-dependent changes in gene expression caused by corticosteroid therapy in order

    to potentially discover the precise genes that may be the most critical in producing favorable therapeu-

    tic outcomes versus those that may instigate negative, unwanted effects.

    Our study start from Affymetrix time courses gene expression data sets that were generated in rat

    tissues of liver, muscle and kidney over time in response to a single bolus dose of methylprednisolone

    (MPL) in order to examine global changes in gene expression. This was a pre-clinical study per-

    formed on experimental rats. Forty-eight animals received a single IV bolus 50 mg/kg dose of MPL

    (Jin, et al., 2003). Three rats were consequently sacrificed at each of the following sixteen (experi-

    mental) time-points: 0.25, 0.5, 0.75, 1, 2, 4, 5, 5.5, 6, 7, 8, 12, 18, 30, 48, and 72 hours. Liver, muscle

    and kidney tissue samples were collected from each animal and processed to assay for gene expres-

    sion. Therefore, triplicate measurements were obtained at each of these time-points. Four rats were

    not administered any drug and were sacrificed at time t 0. These served as the control group forgene expression in the absence of any drug at baseline.

    Total RNA was separately extracted from the samples from each animal and purified. The isolated

    RNA was then used to create biotinylated cRNA. This target cRNA was then hybridized to individual

    Affymetrix GeneChips Rat Genome U230A and U34A. The expression levels of a total of 15,923

    oligonucleotide sequences were quantified for kidney and 8799 probe sets for liver and muscle in each

    chip. A series of filtering steps including probes that were not expressed and not up or down regulated

    were performed in order to subset the data into a more manageable dataset that would yield the most

    relevant information regarding potentially important changes in gene expression due to MPL dosing

    (Almon, et al., 2004). After pre-processing and gene filtering we were able to eliminate probe sets

    Biometrical Journal 49 (2007) 6 803

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    4/14

    that were not expressed in the tissue, were not regulated by the drug treatment, or did not meet

    defined quality control standards.

    In order to demonstrate the inferring and prediction performance of the proposed models we present6 selected genes that were differentially expressed in three tissues (liver, muscle and kidney). These

    six genes were selected from our previous study: Bayesian Meta-Analysis with Markov Mixture mod-

    el, which we will report in another paper. These genes are measured in a continuous scale. For in-

    stance, gene L33869_at at the above 17 time points before transformation had the following measured

    values: 62.025, 50.100, 56.833, 57.400, 63.633, 39.600, 26.333, 43.000, 37.367, 49.600, 49.067,

    51.533, 44.967, 61.333, 38.133, 51.600, 44.733. Because our analysis is focused on the dynamic

    changes in gene expression from time t 0 after dosing of MPL, gene expression was converted to aratio via a simple calculation that involved dividing the gene expression at time ti by the gene expres-

    sion level at time t0, where i represents the specific post-dose time-point and t0 represents baseline at

    time 0 hours (i.e. the control group that did not receive drug). Inherently, the gene expression forevery gene in the control group would be equal to a value of 1. These ratios were subsequently

    natural-logarithmically transformed to produce normally distributed gene expression levels at each

    sampling time-point. This log-transformation directly centered the data around mean 0 since thecontrol values were all equal to 1 prior to the transformation.

    3 Methods

    3.1 Bayesian state space model formulation and prior model specification

    The measured (observed) gene expressions provided by microarray experiment are contaminated by

    noise. A state space model will decompose the signal and noise processes into two model equations:

    the stochastic equation and the observation equation. In a state space model, a sequence of P-dimen-

    sional real valued observation gene expression vector fXtg is modelled by assuming that each timestep, fXtg are generated from a K-dimensional unobserved or hidden state variable fStg, and thesequence of fStg define a Markov process (Congdon, 2003). The joint probability of fXt; Stg

    PSt;Xt PS1 PXt j StaTt2

    PSt j St1 PXt j St 1

    where PS1 is a state probability and is assumed to be generated from conjugate distributions such asGaussian or student t-distributions. PSt j St1 is the transition density or probability of hidden states(such as genes that are not measured or included in the study) that can be inferred from measured

    gene expressions fXtg and it can be defined in the stochastic evolution equations as:

    St gtSt1 Vt 2

    where gt is the deterministic transition function determining the mean of St given St1. PXt j St isthe observation density or probability that can be defined in the observation equations as:

    Xt ftSt Wt 3where ft is the statistical transition function of the observation processes. Wt; Vt are assumed to follow

    Gaussian or non-Gaussian distributions with means zeros of both population processes. gt; ft follow

    either linear or nonlinear settings.

    3.1.1 Univariate time varying bayesian state space model

    We start with univariate state space models in fully hierarchical Bayesian setting. Here, both stochas-

    tic and observation equations take linear forms and the distributions of the state variables PSt j St1and observed gene expressions PXt j St are assumed to follow Gaussian distributions. The model is

    804 Y. Liang and A. Kelemen: Bayesian State Space Models for Gene Expression

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    5/14

    as follows for each gene:

    St ASt1 wt

    Xt CSt nt 4

    where t 1; . . . ; T (number of time points). The time steps in the model do not have to correspond toa fixed unit of real time which allows for unevenly distributed time course data that are not measured

    at fixed time intervals to be modelled. wt; nt are noise sequences, A is state to state transition matrix.C is the state to observation transition matrix.

    Besides the observed gene expressions fXtg and the hidden state variable fStg which can be in-ferred from the associated fXtg, we can also include a set of exogenous variables into the abovemodels. These variables can be environment factors such as certain exposures or they can be other

    genes that interacted with the studied gene. These variables can modify the observed gene expression

    fXtg and affect the inferred hidden state variables. In our case we are interested in modelling theeffects of the influence of expression of one gene at a previous time point on another gene and its

    associated hidden variables, therefore, we modified (4) by including this influence in both the state

    equation and observation equation as follows:

    St1 ASt BXt wt

    Xt CSt DXt1 nt5

    where matrix D in the observation equation captures the relationship of gene expression levels at

    consecutive time points, matrix C captures the influence of the hidden variables on gene expression

    level at each time point, and matrix B models the influence of gene expression values from previous

    time points on the hidden states. Combining the two formulas in (5) we get:

    Xt ACSt1 CB D Xt1 et

    et Cwt nt6

    where AC is the hidden state dynamic matrix with the influence of hidden state variables on gene

    expression level at each time point, CB D contains both the gene-to-gene interaction and the gene-to-gene interactions through the hidden state over time and et is the noise.

    Rangel, et al. (2004) and Beal, et al. (2005) assumed that A;B; C;D are linear and time-invariant

    matrices in the state space model setting. Here in our model (6), we initialized our model with a time

    varying matrix. The motivation of the above BSS model with time varying coefficient comes from our

    major prediction goal, which requires a model not only to have good fit in the sampling period, but

    also a good generalization performance. Since microarray experiments are more concerned about the

    short term prediction given the short term time course data, as Congdon (2003) suggested, the intro-

    duction of time variability is advisable. This way, the underlying parameters to be estimated evolve

    through time with continuous measure instead of discrete.

    In order to incorporate the fully hierarchical Bayesian setting into the above model (6) for learning

    the model parameters and model structures, and using probability density functions, we reformulate

    our univariate time varying state space model given in (6) with fully Bayesian setting as follows:

    Xt $ Nmt;s2t 7

    mt ACt St CB Dt Xt 8

    ACt $ Nb1;t1;s21tI ; CB Dt $ Nb2;t1; s

    22tI ; St $ Nb3;s

    23 9

    As Congdon (2003) suggested, the above Gaussian/normal distribution can also be replaced by

    student t distribution with three parameters: degree of freedom, mean (with the same mean as the

    normal distribution), variance (we take larger variance than normal by a factor ofks2 in order to dealwith outliers and other sources that cause heavy tailed distributions which are typical in microarray

    data. This may also help to deal with over-dispersion problems in time course data.

    Biometrical Journal 49 (2007) 6 805

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    6/14

    Specifying hyper-prior for vector b1;t1, Matrix s21tI and so on require elicitation which consists of

    incorporating relevant background knowledge into formulation of priors on parameters. Since our

    knowledge is limited on the values of the parameters, the diffuse or just proper priors can be calledupon. Regarding the stationary issue, we used Zellner (1996) without stationary constraints, which use

    a non-informative reference prior (such as normal-gamma, or Gaussian-Wishart). Various hyper-param-

    eters generated from non-informative distributions such as vague inverse-Gamma distribution were

    tested.

    The advantage of the proposed fully Bayesian setting with time varying state space model is that

    the estimation of varying coefficients is based on the assumption that they belong to the common

    density and hence Bayesian methods are appropriate in pooling the strength estimation of coefficients

    from a common density. Moreover, it can deal with the outliers and shifts in time, learn the non-

    stationarity typically observed in microarray data, and place more weight on the recent observations.

    Lastly, it is flexible for time varying coefficients with linear or nonlinear functions (Congdon, 2003).

    Note that this model is a univariate state space model through hierarchical Bayesian setting. A

    hierarchical model has its virtues. It is richer than a one-stage model, and produces shrinkage estima-

    tors parameters automatically. These shrinkage parameters may help to improve the generalizationperformance and overcome the over-fitting problem for noisy, sparse, high dimensional gene data.

    3.1.2 Multivariate bayesian state space models with correlated covariance

    Multivariate state space models that can describe patterns of dependency among multiple series (genes

    across time) measured simultaneously on a common system may be helpful to discover the gene

    dependences of the underlying processes (West and Harrison, 1999). Therefore, we extend the univari-

    ate Bayesian state space model to multivariate model via inclusion of the covariance structure in order

    to learn gene correlations and their temporal behavior for prediction. Here, each series depend on both

    its own past and the past values of the other series, therefore the variations in expression for a given

    gene can be predicted by a small set of other genes. One advantage of simultaneously modeling

    several series is the possibility of pooling information from related genes to improve the precision and

    out of sample forecasts (Congdon, 2003).We are particularly interested in modeling several correlated series (genes) as well as the error

    terms. We still assume that the time course gene expression data with P genes follows Multivariate

    Gaussian distribution with linear setting (7, 8). However, the covariance matrices and error matrices in

    (9) are modified from univariate normal to multivariate Gaussian distributions or multivariate student-t

    distributions as follows:

    ACt $ MVN b1;W1 ; CB Dt $ MVN b2;W2 ; et $ MVN 0;W 10

    where W1;W2;W are positive definite covariance matrices following inverse Wishart priors (Congdon,2003). b1;b2 are vectors that are generated from multivariate t-distributions or Gaussian distributionswith vague hyper inverse-Gamma distributions just as we have discussed earlier.

    Note that the difference between (9) and (10) is that the covariance matrix in (10) is assumed to

    have diagonal uncorrelated structure. However, this does not mean that (9) is a special case of (10)

    since in (9) we assumed time varying coefficient vectors and in (10) we did not make this assumption.The estimations of the hidden variables, parameters and hyper-parameters such as the covariance

    matrix in both univariate time varying Bayesian state space model and multivariate Bayesian state

    space model are conducted by standard Monte Carlo Markov Chain algorithms for both posterior

    inference and predictions.

    3.2 Model comparisons, selection criteria and validations

    In order to choose the best model for prediction, the model selection criteria, such as the Akaike

    Information Criterion (AIC), Bayesian Information Criterion (BIC) or Bayes factors can be consid-

    806 Y. Liang and A. Kelemen: Bayesian State Space Models for Gene Expression

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    7/14

    ered. Although AIC is useful for non-nested models, it works poorly in the case of multicollinearity,

    which is typical for gene expression data. It has drawbacks of tending to be biased for complicated

    models due to the fact that log-likelihood increases faster than the model complexity component.Deviance Information Criterion is a new measure proposed by Spiegelhalter, et al. (2002) for model

    complexity and goodness of fit under the Bayesian setting. Its more appropriate when comparing

    complex hierarchical models in the Bayesian setting, where the number of parameters is not clearly

    defined. One advantage of DIC is its inclusion of a prior distribution, which induces a dependency

    between parameters that is likely to reduce the effective dimensionality. Furthermore, it helps the prior

    models identifications. DIC can be summarized by the posterior expectation of the deviance and

    complexity (effective number of parameters) as the expected deviance minus deviance at the posterior

    expectation of the parameters. It is defined as follows:

    pD

    Dq Dqq

    Dq 2 log fpy j qg 2 log ffyg

    DIC Dqq 2pD

    11

    where Dq is the Bayesian Deviance. The posterior mean of deviance is a Bayesian measure of fit.The effective number of parameters p

    Din the model is defined as the difference between the posterior

    mean of the deviance and the deviance at the posterior means of the parameters of interest. We used

    DIC for within sample fit measure for model selections and also to avoid the notorious over-fitting

    problem by controlling the optimum number of parameters for model selection.

    To validate competing models, one option is to make short-term prediction ahead with samples and

    then choose the model which diminishes the prediction error, the mean squared error or the mean

    absolute error (Congdon, 2003). Another option is to fit the models to periods t 1; . . . ; F, where F isless than N. Periods F 1; F 2; . . . ;N are used to validate the competing models. Instead of usingthe first F time points to build the model and use the rest for validation, we used the first option.

    Furthermore, we used the same model to predict not only the next time point, but the next few time

    points, such as five points.

    4 Results

    We simulated the prior models with various cases of both univariate and multivariate Bayesian state

    space models using WINBUGs and applied to multiple tissue data sets discussed in section 2. Monte

    Carlo Markov Chain algorithm with Gibbs sampling was used to sample from the posterior distribu-

    tions of parameters and the simulated data was used to draw inferences on the parameters given the

    tissue data sets. Each parameter was estimated as the mean of its posterior distribution based on an

    assumption for its prior distribution. The results were robust in many complex cases. To ensure that it

    is sampling from its equilibrium distribution, 2000 samples after 6000 burn-ins were used for compu-

    tation.

    We tested four models of the Bayesian state space model we discussed in the methods section withvarious priors and hyper-prior distributions. Since the choice of the hyper-prior distribution of the

    parameters is a key issue, to investigate the influence of the choice of the hyper-parameters on the

    estimates, we carried out sensitivity analysis of different choices of the initial priors/parameters. In

    model I, the univariate time invariant Bayesian state space (with regression coefficients, their var-

    iances and state variables fixed through time) was tested: AC $ Nm1;s2; CB D $ Nm2;s

    2;S$ Nm3;s

    2; e $ Nm;W. Hyper-prior distribution for m1;m2;m3;m $ N0;s2 where

    s2 $ Inverse Gamma 1; 0:001. Since we had no information on the precision (inverse of s), smallprecisions were tested such as 0.001, 0.01. W was generated from an inverse Wishart R; r distribu-tion, and the degree of freedom (r) was the number of genes included in the model. For demonstra-

    Biometrical Journal 49 (2007) 6 807

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    8/14

    808 Y. Liang and A. Kelemen: Bayesian State Space Models for Gene Expression

    box plot: muY[1,]

    -4.0

    -2.0

    0.0

    2.0

    4.0

    [1,1][1,2] [1,3] [1,4] [1,5] [1,6]

    [1,7][1,8]

    [1,9][1,10]

    [1,11]

    [1,12][1,13]

    [1,14]

    [1,15]

    [1,16]

    box plot: muY[2,]

    -3.0

    -2.0

    -1.0

    0.0

    1.0

    2.0

    [2,1] [2,2][2,3]

    [2,4]

    [2,5] [2,6][2,7]

    [2,8] [2,9]

    [2,10]

    [2,11]

    [2,12]

    [2,13][2,14]

    [2,15][2,16]

    box plot: muY[3,]

    -2.0

    -1.0

    0.0

    1.0

    2.0

    3.0

    [3,1][3,2] [3,3] [3,4] [3,5] [3,6]

    [3,7] [3,8]

    [3,9]

    [3,10]

    [3,11]

    [3,12]

    [3,13]

    [3,14]

    [3,15]

    [3,16]

    box plot: muY[4,]

    -75.0

    -50.0

    -25.0

    0.0

    25.0

    50.0

    [4,1]

    [4,2]

    [4,3] [4,4] [4,5] [4,6] [4,7] [4,8] [4,9]

    [4,10]

    [4,11][4,12][4,13]

    [4,14][4,15]

    [4,16]

    box plot: muY[5,]

    -30.0

    -20.0

    -10.0

    0.0

    10.0

    20.0

    [5,1] [5,2] [5,3] [5,4] [5,5]

    [5,6]

    [5,7][5,8]

    [5,9]

    [5,10][5,11]

    [5,12]

    [5,13]

    [5,14][5,15]

    [5,16]

    box plot: muY[6,]

    -2.0

    -1.0

    0.0

    1.0

    2.0

    3.0

    [6,1]

    [6,2] [6,3] [6,4]

    [6,5]

    [6,6] [6,7] [6,8]

    [6,9]

    [6,10]

    [6,11]

    [6,12][6,13]

    [6,14]

    [6,15][6,16]

    Figure 1 Box plots of estimated gene expression profiles for six selected genes

    from liver tissue data using univariate Bayesian state space model (model I).

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    9/14

    tion purposes, we selected six commonly differentially expressed genes from multiple tissues data: the

    degree of freedom was six and the scale matrix was specified as

    R

    100 0 0 0 0 0

    0 0:1 0 0 0 0

    0 0 0:1 0 0 0

    0 0 0 0:1 0 0

    0 0 0 0 0:1 0

    0 0 0 0 0 0:01

    0BBBBBB@

    1CCCCCCA

    Biometrical Journal 49 (2007) 6 809

    Figure 2 Box plots of estimated gene expression profiles for six

    selected genes using univariate time varying Bayesian state space

    model (model II).

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    10/14

    By looking at the overall goodness of fit measure: DIC value (6.502E+7), the deviance (1230.0)

    and the over-dispersed estimates (see Fig. 1), we conclude that this model did not fit well.

    To obtain model II we improve model I by modifying the above time invariant model into univari-ate time varying Bayesian state space model as described earlier (see Fig. 2). Using this model, due to

    the hierarchical Bayesian setting and shrinkage effects, we were also able to filter out non-differen-

    tially expressed genes for reducing the number of dimensions and overcome the curse of dimension-

    ality problem: totally 131 genes from 3614 genes were left. The remaining, non-differentially ex-

    pressed genes were filtered out (with unadjusted significance level alpha = 0.05).

    810 Y. Liang and A. Kelemen: Bayesian State Space Models for Gene Expression

    Figure 3 Box plots of estimated gene expressions for six selected

    genes from liver at the given sixteen time points (model III). Predic-

    tion of the gene expressions at the next 5 time points are also

    shown.

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    11/14

    Biometrical Journal 49 (2007) 6 811

    Table

    1

    Sensitivity

    analysisfrom

    variouspriorand

    initialv

    aluesand

    corresponding

    DIC

    valu

    es;six

    selected

    genesweretested,

    using

    temporaltissuedata(linein

    boldshowsthebestmodelwithlow

    estDIC).

    S(Statespacevariables)

    AC

    CB

    D

    Initials

    Deviance

    DIC

    Initials

    Deviance

    DIC

    Initials

    Deviance

    DIC

    St$

    N(0,

    0.0

    01)

    458.5

    436.7

    /

    /

    /

    m2

    (10,

    0.1,

    10,

    0.1,

    10,0.1

    )0

    458.5

    436.7

    St$

    N(1,1

    )

    467.4

    517.9

    /

    /

    /

    m2

    (10,

    10,

    10,

    10,

    10,

    10

    )0

    458.5

    436.7

    St$

    N(0,1

    )

    459.1

    459.6

    /

    /

    /

    m2

    (0.1,

    0.1,

    0.1,

    0.1,

    0.1,0

    .1)

    458.5

    436.7

    St$

    N(0,

    0.0

    01)

    440.9

    612.1

    m1

    (10,

    0.1,

    10,

    0.1,

    10,

    0.1

    )0

    440.9

    612.1

    m2

    (10,

    0.1,

    10,

    0.1,

    10,0.1

    )0

    440.9

    612.1

    St$

    N(1,1

    )

    457.8

    333.7

    m1

    (10,

    10,

    10,

    10,

    10,

    10)0

    440.9

    612.1

    m2

    (10,

    10,

    10,

    10,

    10,

    10

    )

    440.9

    612.1

    St$

    N(0,1

    )

    447.1

    364.9

    m1

    (0.1,

    0.1,

    0.1

    ,

    0.1,

    0.1,

    0.1

    )

    436.0

    332.9

    m2

    (0.1,

    0.1,

    0.1,

    0.1,

    0.1,0

    .1)

    438.4

    587.2

    St$

    t(0,

    0.0

    01,

    95)

    444.3

    220.9

    m1

    (10,

    0.1,

    10,

    0.1,

    10,

    0.1

    )0

    444.3

    220.9

    m2

    (10,

    0.1,

    10,

    0.1,

    10,0.1

    )0

    444.3

    220.9

    St

    $

    t(0,

    0.0

    1,

    95)

    443.0

    195.1

    m1

    (10,

    10,

    10,

    10,

    10,

    10)0

    443.0

    195.1

    m2

    (10,

    10,

    10,

    10,

    10,

    10

    )0

    443.0

    195

    .1

    St$

    t(0,

    0.1,

    95)

    441.5

    333.9

    m1

    (0.1,

    0.1,

    0.1,

    0.1,

    0.1,

    0.1

    )

    443.7

    176.5

    m2

    (0.1,

    0.1,

    0.1,

    0.1,

    0.1,0

    .1)

    441.1

    5311

    St$

    t(0,

    1,

    95)

    451.7

    485.4

    /

    /

    /

    /

    /

    /

    OtherInitials

    m1

    (10,

    0.1,

    10,

    0.1,

    10,

    0.1

    )0,

    m2

    (10,

    0.1,

    10,

    0.1,

    10,

    0.1

    )0

    sg;

    t$

    N(0,

    0.0

    01)

    orsg;

    t$t

    (0,

    0.0

    1,

    95)

    m2

    (10,

    0.1,

    10,

    0.1,

    10,

    0.1

    )0

    sg;

    t$

    N(0,

    0.0

    01)

    orsg;

    t$

    t(0,

    0.0

    1,9

    5)

    m1

    (10,

    0.1,

    10,

    0.1,

    10,

    0.1

    )0

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    12/14

    We obtained model III by using multivariate Bayesian state space model and changing the prior

    distribution from univariate normal distribution to multivariate Gaussian distribution such as

    CB Dt $ MVN m;

    W, while keeping the hyper-prior distributions similar to model I, such asinverse Wishart R;r prior distribution for W. The advantage of this model is the inclusion of gene-gene interaction correlation by using covariance matrix instead of diagonal uncorrelated matrix (as in

    model I). The DIC value for this model was 463.9, and the deviance was 463.8. Since the number of

    parameters estimated in this model was 294, these deviance and DIC values were considered small,

    which indicated good model fit. Checking the box plots of gene expressions estimations (Fig. 3) we

    can see that the over-dispersion problem is avoided by this model, although the estimated standard

    errors are larger than in the previous model.

    In model IV we further varied the prior and hyper-prior distributions of the state variable of the

    mean from normal to student t-distribution: Xt $ t0; 0:001; 95. Here, there are three parameters for

    812 Y. Liang and A. Kelemen: Bayesian State Space Models for Gene Expression

    Figure 4 Comparison of the observed gene expression profiles (dotted line) with the estimated (inferred)

    gene expression profiles (solid line) using state space model for six selected genes from liver data.

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    13/14

    the student t-distribution: mean, precision (or variance) and the degree of freedom of parameter k kdetermines the extent of over-dispersion, smaller values of k allow for more marked departures from

    normality in the tails, so the density expected to have outliers might be described by a t density withdegrees of freedom under 10 (Congdon, 2002). We chose large k so that we can lead to a density a

    little close to normal but still want to see the difference. With this model we were able to achieve a

    significantly smaller DIC value of 195.1, which may be an evidence of the robustness of the t-distribu-

    tion for gene expression data and provided an enhanced prior model compared to the other three

    models. The improvement in the DIC supports this contention. Table 1 displays various prior and

    hyper-prior settings for some of our experiments using Bayesian state space models, the sensitivity

    analysis from various initials and the corresponding DIC values.

    Figures 1, 2 and 3 show box plots of the posterior means and 95% credible intervals of the given

    models for each of the six genes expression profile, estimated by the Bayesian state space model.

    Figure 4 displays the comparison of the observed gene expression profiles (dotted line) with the esti-

    mated (inferred) gene expression profiles (solid line) using state space model for six selected genes. In

    each case, the state space model has successfully caught the dynamics (observed dotted line is very

    close to the estimated solid line) although there are small deviations. Prediction of the gene expres-sions using model IV after 72 hours at the next 5 time points are also shown in the figure. The

    estimated genes expression profiles are similar to the observed ones, although there are some differ-

    ences for genes AF053312_s_at and L32591mRNA_g_at. These two genes have larger scale expres-

    sions in the observed data than those in the estimated data. From these figures we can see that genes

    are expressed significantly at least at one time point except gene L32591mRNA_g_at. It was predicted

    that only gene L33869_at would have significant down regulation at the next time point after 72 hrs.

    We also compared DICs with and without prediction, and the prediction model provided DIC value of

    464.8 (without prediction the DIC was 463.9).

    5 Discussion

    We demonstrated the potential significance of our proposed Bayesian state space models for makinginferences and predictions on genomic dynamics and applied them to multiple tissue data: i) our

    proposed Bayesian state space model in the simple case can provide us good estimates and predictions

    for temporal gene expressions profiles by its model specification (e.g. estimation of scale matrix) in

    the multivariate setting; ii) this model can handle and infer the hidden and un-measurable variables

    that affect observed gene expressions; iii) this model can analyze time course gene expression data not

    measured at fixed time intervals (discrete, unevenly spaced time course data), as is the case in most

    genomic and proteomic data.

    Our proposed models are innovative. Firstly, there are advantages in simplified estimations in fully

    hierarchical Bayesian setting and other new model settings where non-standard distributions, non-sta-

    tionary and nonlinear features with short unevenly spaced time can be incorporated. These settings are

    also more realistic for genomic data. Secondly, they can handle un-measurable hidden variables and

    unknown factors that affect observed gene expressions and predict time course gene expression data

    not measured at fixed time intervals. The flexibility of modelling the expression measurements ascontinuous, rather than discrete and therefore with dynamic models rather than unrealistic static mod-

    els appears to be a major advantage. The model has great potential for biological and medical systems

    and can be applied to further the study of dynamic drug effects of various diseases. Currently we are

    investigating Bayesian state space models for precise estimation of the hidden structural and func-

    tional parameters of biological systems that are well known for their noisy, uncertain and stochastic

    nature.

    Acknowledgements The authors would like to thank Dr. Richard Almon for providing the data. This work is

    supported by National Science Foundation grant DMS-0604639.

    Biometrical Journal 49 (2007) 6 813

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com

  • 8/8/2019 Bayesian State Space

    14/14

    References

    Almon, R. R., DuBois, D. C., Piel, W., and Jusko, W. J. (2004). The genomic response of skeletal muscle tomethylprednisolone using microarrays: tailoring data mining to the structure of the pharmacogenomic time

    series. Pharmacogenomics 5, 525552.

    Alter, O., Brown, P. O., and Botstein, D. (2000). Singular value decomposition for genome-wide expression data

    processing and modelling. Proceedings of the National Academy of Sciences 97, 1010110106.

    Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T., and Simon I. (2003). Continuous Representations of Time

    Series Gene Expression Data. Journal of Computational Biology 10, 241256.

    Beal, M. J., Falciani, F. L., Ghahramani, Z., Rangel, C., and Wild, D. (2005). A Bayesian Approach to Recon-

    structing Genetic Regulatory Networks with Hidden Factors. Bioinformatics 21, 349356.

    Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein D., and Brown, P. O. (1998). The Transcriptional Program

    of Sporulation in Budding Yeast. Science 282, 699705.

    Congdon, P. (2002). Bayesian Statistical Modelling. John Wiley & Sons, Ltd. New York.

    Congdon, P. (2003). Applied Bayesian Modelling. John Wiley & Sons, Ltd. New York.

    Dempster, A, Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm.

    Journal of the Royal Statistical Society, Series B 39, 138.

    Durbin, J. and Koopman, S. J. (2000). Time series analysis for non-Gaussian observations based on state space

    models from both classical and Bayesian perspectives (with discussion). Journal of the Royal Statistical

    Society, Series B 62, 356.

    Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide

    expression patterns. Proceedings of the National Academy of Sciences 95, 1486314868.

    Ernst, J., Nau, G. J. and Bar-Josephm, Z. (2005). Clustering short time series gene expression data. Bioinformatics

    21 Suppl 1: i159i168.

    Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University

    Press, London.

    Holter, N., Maritan, A., Cieplak, M., Fedoroff, N., and Banavar, J. (2001). Dynamic modeling of gene expression

    data. Proceedings of the National Academy of Sciences 98, 1693 1698.

    Jacob, F. and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of the proteins. Journal of Mole-

    cular Biology 3, 318356.

    Jin, J. Y., Almon, R. R., Dubois, D. C., and Jusko, W. J. (2003). Modeling of corticosteroid pharmacogenomics in

    rat liver using gene microarrays. Journal of Pharmacology and Experimental Therapeutics 307, 93109.Liang, Y. and Kelemen, A. (2004). Hierarchical Bayesian Neural Network for Gene Expression Temporal Pat-

    terns, Journal of Statistical Applications in Genetics and Molecular Biology 3, Article 20.

    Liang, Y., Tayo, B., Cai, X., and Kelemen, A. (2005). Differential and Trajectory Methods for Time Course Gene

    Expression Data. Bioinformatics 20, 3009 3016.

    Liang, Y. and Kelemen, A. (2006). Associating phenotypes with molecular events: a review of statistical advances

    and challenges underpinning microarray analyses. Journal of Functional and Integrative Genomics 6, 113.

    Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model with B-

    splines. Bioinformatics 19, 474482.

    Perrin, B. E., Ralaivola, L., Mazurie, A., Bottani, S., Mallet, J., and DAlche-Buc, F. (2003). Gene networks

    inference using dynamic Bayesian networks. Bioinformatics 19, Suppl 2: II138II148.

    Ramoni, M. F., Sebastiani, P., and Kohane, I. (2002). Cluster analysis of gene expression dynamics. Proceedings

    of the National Academy of Sciences 99, 9121 9126.

    Rangel, C., Angus, J., Ghahramani, Z., Lioumi, M., Sotheran, E. A., Gaiba, A., Wild, D. L., and Falciani, F.

    (2004). Modeling T-cell activation using gene expression profiling and state space models. Bioinformatics 20,13611372.

    Roweis, S. and Ghahramani, Z. (1999). A Unifying Review of Linear Gaussian Models. Neural Computation 11,

    305345.

    Spiegelhalter, D., Best, N., Carlin, B., and Linde, A. (2002). Bayesian measures of model complexity and fit.

    Journal of Royal Statistical Society, Series B 64, 583639.

    West, M. (2003). Bayesian factor regression models in the Large p, Small n paradigm. Bayesian Statistics 7,

    723732.

    West, M. and Harrison, J. (1999). Bayesian Forecasting and Dynamic models, 2nd edition. New York, Springer.

    Zellner, A. (1996). An introduction to Bayesian inference in econometrics. John Wiley & Sons, Ltd. New York.

    814 Y. Liang and A. Kelemen: Bayesian State Space Models for Gene Expression

    # 2007 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.biometrical-journal.com