View
217
Download
2
Tags:
Embed Size (px)
Citation preview
10. Lecture SS 20005
Cell Simulations 1
V10: Proteome during the Yeast Cell Cycle
Bioinformatics, 21, 1164 (2005)
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 2
Idea
Well known: certain genes are expressed only at specific stages of the cell
cycle, e.g. the cyclins and the histone.
these genes exhibit a periodic pattern of expression when monitored
during consecutive cell cycles.
Many microarry experiments have characterized the expression levels.
Post-analysis by computational methods yeast contains ca. 300 – 800
genes that are periodically expressed.
Problem: results correlate very poorly, particularly when different
computational methods are applied to the same data.
Altogether, 1800 genes have been proposed to be expressed periodically,
which is almost one third of the S. cerevisae genome.
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 3
Comparison of methods
Benchmark different methods on three data sets:
B1 113 genes identified as periodically expressed in small-scale experiments
B2 Genes whose promoters are bound (p-value < 0.01) by at least one of
nine known cell cycle transcription factors in two IP studies.
352 genes of which many should be cell cycle regulated
B3 genes annotated in MIPS as „cell cycle and DNA processing“
Understand strengths and weaknesses of different methods.
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 4
Methods of analysis: statistical test for regulation
The standard deviation can be easily calculated for each logratio profile,
giving a measure of the spread of the samples around the mean.
Heavily regulated genes will thus have large standard deviations, whereas
genes without significant regulation display little deviation from the mean.
To test for the significance of regulation, we therefore compare the observed
standard deviation for each expression profile to a randomly generated
background distribution.
1,000,000 random profiles were constructed by selecting at each time point
the log-ratio from a randomly chosen gene. A p-value for regulation was
calculated as the fraction of the simulated profiles with standard deviations
equal to or larger than that observed for the real expression profile.
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 5
where and T is the interdivision time.
Similarly, scores were calculated for 1,000,000 artificial profiles constructed by
random shuffling of the data points within the expression profile of the gene in
question. The p-value for periodicity was calculated as the fraction of artificial
profiles with Fourier scores equal to or larger than that observed for the real
expression profile.
The p-value for regulation is thus a comparison between individual genes and the
global distribution, whereas the pvalue for periodicity is a comparison involving
only data from the gene in question.
Methods of analysis: statistical test for periodicity
To estimate a p-value for periodicity, we compared the Fourier score of the
observed gene expression profile for each gene to those of random
permutations of the same gene.
For each gene, i, a Fourier score, Fi, was computed as
22
cossin
ti
tii txttxtF
T
2
10. Lecture SS 20005
Cell Simulations 6
Combined test for regulation and periodicity
For each gene, a combined p-value of regulation was calculated by
multiplying the separate p-values of regulation from each of the three
experiments.
Analogously, a combined p-value of periodicity was calculated.
Subsequently, the p-value of regulation and p-value of periodicity were
multiplied to obtain the total p-value. An undesirable feature of the total p-
value is that it may become very low (i.e. highly significant) due to only one of
the tests. Genes that are strongly regulated but not periodic (or vice versa)
will thus receive good scores.
To address this, we multiply the total p-value with two penalty terms that
weight down genes that are either not significantly regulated or not
significantly periodic. The final score used for ranking is:
22
001.01
001.01 yperiodicitregulation
total
ppp
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 7
Assigning the time of peak expression
Since we approximate each expression profile by a sine wave, the time of
peak expression for a gene in a single experiment is trivially defined as the
time where the sine wave is maximal. We refer to this as the peak time.
Due to differences in experimental conditions, the time it takes the cell to
complete a cycle (the interdivision time) varies greatly between the alpha,
Cdc15 and Cdc28 experiments. In order to compare the timing of peak
expression across experiments, we therefore transformed the time-scales
from minutes to percent of the cell cycle by dividing with the interdivision
times estimated by Zhao et al. (2001).
Combining peak times from different experiments into one is a non-trivial
task, since the assignment should not be trusted in those experiments where
the expression profile is not sufficiently periodic. To compensate for this, we
weighted the individual peak times when computing the global, combined
peak time.Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 8
Methods (1)
M1 Cho et al. (1998) visually inspected the expression profiles of all genes
regulated more than two-fold during the Cdc28 experiment, classifying 421 of
them as “periodic”.
M2 Spellman et al. (1998) computed a score for every gene that was based
partly on the correlation to one of five idealized gene profiles, and partly on a
Fourier-like score yielding the signal strength at a period similar to the
interdivision time. The genes were ranked based on their combined score
and truncated to 800 genes as this corresponds to a sensitivity of 90% on
B1.
M3 Johansson et al. (2003) used partial least squares regression to analyze
the three data sets individually and in combination. The approach was based
on fitting of sine curves and thus also yields an estimate of the time of
peak expression. Thresholds were estimated based on random permutations
of the data.Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 9
Methods (2)
M4 Zhao et al. (2001) reanalyzed each of the three data sets individually
using a statistical single–pulse model. The resulting score, describing how
well a profile fits the model, is independent of the magnitude of regulation.
An appealing feature of the model is that it also estimates the time of
activation and deactivation of the gene.
M5 Luan & Li (2004) used a modeling approach based on cubic splines,
rather than sine waves. The sets were analyzed individually and the
statistical nature of the method enabled the authors to identify thresholds that
satisfied a given false discovery rate.
M6 Lu et al. (2004) used Bayesian modeling techniques to estimate a
periodic–normal mixture model based on five microarray time courses. The
authors provide a ranked list of 822 genes based on all five data sets.
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 10
Combination method
In addition to the methods M1–M6, we also include a new and simple
statistical approach (M7) based on permutations.
Its score is composed of two terms: one quantifies only the magnitude of
regulation, whereas the other measures the periodicity of the expression
profile.
The combined score ensures that high ranking genes are both significantly
regulated and periodic.
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 11
Comparison of published methods
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 12
Regulation vs. periodicity
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 13
An alternative strategy for evaluating the performance ofmethods for extracting periodically expressed genes is to examine the overlap in genes identified in different experiments. For each of the methods that provide a ranked list of genes for each individual experiment, top 300 lists were extracted and their overlap visualized as Venn diagrams (see Figure 3). The results are consistent with those obtained from the benchmark sets. The results of M4, a representative of the magnitude-independent methods, shows by far the leastagreement between the three experiments. In comparison, M3 and M7 identify almost twice as many genes that agree in all three experiments. For all computational methods, the alpha factor experiment shows the best overlap with the two other experiments, while the Cdc28 experiment overlaps the least. This supports our previous observations with regard tothe quality of the individual experiments. The lower rightVenn diagram in Figure 3 illustrates that a renormalization of the Cdc28 data improves the agreement with the two other experimental data sets, in addition to improving performance on the benchmark sets (Figure 2). When including the renormalizedCdc28 data in our analysis, two thirds of the genesdetected from the alpha factor synchronization are confirmed by at least one other experiment.
Agreement across experiments
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 14
Consistency of peak expression
So far, we have addressed the issues of finding periodicallyexpressed genes and asked if the same genes are identifiedin different experiments. It is, however, equally important tocheck that those genes behave similarly across experiments,i.e. that their expression profiles peak at the same stage in thecell cycle. As a consequence of differences in experimentalprotocols, the interdivision time varies between experiments.Furthermore, the synchronization methods release cells atdifferent points in the cycle. To enable cross-experimentcomparisons, we represent time in percent of the cell cycle(0% being the time of cell division) and align the time axesof the experiments relative to each other. For each gene, we can thus assign a time of peak expression ineach experiment and compare these across experiments. Toexamine the degree of consistency between these peak timesin different experiments, we computed the largest peak timedifference for each gene. Figure 4 shows the distribution ofthese differences for the set of 89 genes that were identified asperiodic in all three experiments (see Figure 3). For 87 of the89 genes, all three peak times fall within an interval of 15% ofa cell cycle, clearly demonstrating the reproducibility of peakexpression by different synchronization methods. For genesthat appear periodic in all three experiments, an average peaktime could easily be computed.
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 15
Expression of phase-specific genes
As a final check of the reproducibility, we investigated
the timing of peak expression for four sets of known
phase specific genes across the three experiments. B1
was subdivided by Spellman et al. (1998) into phase-
specific groups according to their reported peak
expression in the small scale experiments.
The distribution of peak times within each group
is visualized in Figure 5, along with the distribution of
our combined peak times. From this, it is clear that the
phases occur in the same order, with the same length,
and at the same time in all three experiments. It thus
appears that the synchronization methods cause no
abnormalities of the cell-cycle transcriptional program.
Together, Figures 4 and 5 show that the combined
peak time is a meaningful measure that accurately
describes when in the cell cycle a gene is expressed.
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 16
Conclusions
Most surprisingly, the benchmark analysis reveals that most of the new and more mathematically advanced methods for identifying periodically expressed genes perform considerably worse than the early method by Spellman et al. (1998)(M2). These results should encourage developers of future computational methods to evaluate the performance of their methods carefully. We show that the performance gap is due to the magnitude-independence of most newer methods. This independence may improve their ability to discover novel weakly regulated cell cycle genes, however, as the magnitude of regulation contains a large part of the signal, this should be exploited in order to derive the most accurate set of cell cycle regulated genes.Using the periodicity score alone yields results similar to those of other magnitude-independent methods, while our combined score performs as well as the best existing methods. In comparison to other methods, ours have two main advantages. First, genes ranked high on our list are guaranteed to be both significantly regulated and significantly periodic. Second, we require consistency in peak time across experiments, which allows us to assign a time of peakexpression to each gene.
Bioinformatics, 21, 1164 (2005)
10. Lecture SS 20005
Cell Simulations 17
Analysis of complexome during cell cycle
Most research on biological networks has been focused on static topological
properties, describing networks as collections of nodes and edges rather than as
dynamic structural entities.
Here this study focusses on the temporal aspects of networks, which allows us to
study the dynamics of protein complex assembly during the Saccharomyces
cerevisiae cell cycle.
The integrative approach combines protein-protein interactions with information on
the timing of the transcription of specific genes during the cell cycle, obtained from
DNA microarray time series shown before.
a quality-controlled set of 600 periodically expressed genes, each assigned to
the point in the cell cycle where its expression peaks.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 18
Data preparation
Construct physical interaction network for the corresponding proteins from Y2H
screens, TAP pull-downs, and curated complexes from the MIPS database.
To reduce the error rate of 30 to 50% expected in most current large-scale
interaction screens, all physical interaction data were combined, a topology-based
confidence score was assigned to each individual interactions as in the STRING
database, and only high-confidence interactions were selected.
These were further filtered with information on subcellular localization to exclude
interactions between proteins annotated to incompatible compartments; no
curated MIPS interactions were lost because of this filtering.
The topology-based scoring scheme, filtering, and extraction criteria reduced the
error rate for interactions by an order of magnitude to only 3 to 5%.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 19
E.g. compatible subcellular localizations
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 20
Construct of temporal network
Include in the extracted network (Fig. 1), in addition to the periodically expressed
(„dynamic“) proteins, constitutively expressed („static“) proteins that preferentially
interact with dynamic ones.
The resulting network consists of 300 proteins (Fig. 1, inside circle), including 184
dynamic proteins (colored according to their time of peak expression) and 116
static proteins (depicted in white).
For 412 of the 600 dynamic proteins identified in the microarray analysis, no
physical interactions of sufficient reliability could be found (Fig. 1, outside circle).
Some may be missed subunits of stable complexes already in the network; the
majority, however, probably participate in transient interactions, which are often
not detected by current interaction assays.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 21
Temporal protein interaction network in yeast cell cycle
Cell cycle proteins that are part
of complexes or other physical
interactions are shown within
the circle.
For the dynamic proteins, the
time of peak expression is
shown by the node color;
static proteins are represented
as white nodes.
Outside the circle, the dynamic
proteins without interactions
are positioned and colored
according to their peak time.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 22
Agreement with know complexes; error estimates
de Lichtenberg et al. „cell cycle" was constructed by applying the entire scheme to „all". The MIPS complexes were not included for this analysis.
de Lichtenberg et al. „all" contains the union of the Y2H screens by Uetz et al. and Ito et al. and the matrix representations of the two complex pull-down experiments by Gavin et al. and Ho et al. This interaction set is the raw data for which the error rate has been estimated to be 30-50%.
de Lichtenberg et al. „45%" and „85%“consists of the subset of „all" to which we assign a topology-based quality score of at least 0.45 or 0.85, corresponding to the cutoff for including an interaction between two cell cycle proteins in the nal network (Figure 1). Interactions were not fitered based on subcellular localization data.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 23
Interpretation
Virtually all complexes contain both dynamic and static subunits (Fig. 1),
the latter accounting for about half of the direct interaction partners of periodically
regulated proteins through all phases of the cell cycle (Fig. 2).
Transcriptional regulation thus influences almost all cell cycle complexes and
thereby, indirectly, their static subunits.
This implies that many cell cycle proteins cannot be identified through the analysis
of any single type of experimental data but only through integrative analysis of
several data types.
In addition to suggesting functions for individual proteins, the network (Fig. 1)
indicates the existence of entire previously unknown modules.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 24
Just-in-time synthesis vs. just-in-time-assembly
Transcription of cell cycle–regulated genes is generally thought to be turned on
when or just before their protein products are needed: often referred to as
just-in-time synthesis.
Contrary to the cell cycle in bacteria, however, just-in-time synthesis of entire
complexes is rarely observed in the network. The only large complex to be
synthesized in its entirety just in time is the nucleosome, all subunits of which are
expressed in S phase to produce nucleosomes during DNA replication.
Instead, the general design principle appears to be that only some subunits of
each complex are transcriptionally regulated in order to control the timing of
final assembly.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 25
Interpretation
Several examples of this just-in-time assembly (rather than just-in-time
synthesis) are suggested by the network, including the prereplication complex
(Fig. 3B), complexes involved in DNA replication and repair, the spindle pole body,
proteins related to the cytoskeleton, and numerous smaller complexes or modules.
the transcriptome time mappings visualized in Fig. 1 are in close agreement
with previous studies on the dynamic formation of individual protein complexes,
suggesting that the timing of transcription of dynamic proteins is indicative of the
timing of assembly and action of the complexes and modules (Fig. 3).
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 26
Interpretation
Just-in-time assembly would have an advantage over just-in-time synthesis of
entire complexes in that only a few components need to be tightly regulated in
order to control the timing of final complex assembly.
This would explain the recent observation that the periodic transcription of specific
cell cycle genes is poorly conserved through evolution.
For the prereplication complex, exactly this variation between organisms has been
shown, although the subunits and the order in which they assemble are
conserved.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 27
Interactions during the cell cycle
The number of interactions of dynamic
proteins with other dynamic proteins
(dynamic – dynamic) and with static
proteins (dynamic – static) is shown as
a function of cell cycle progression.
Zero time corresponds to the time of
cell division. The large number of
interactions during G1 phase reflects a
general overrepresentation of genes
expressed in this part of the cell cycle.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 28
Dynamic modules
Each panel shows a specific module
from the network in Fig. 1 using the
corresponding colors of individual
proteins.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 29
Dynamic modules
Previously unknown module connecting
processes related to chromosome structure
with mitotic events in the bud. The nucleosome
assembly protein Nap1p is know to shuttle
between the nucleus and cytosol and regulates
the activity of Gin4p, one of two G1-expressd
budding-related kinases in the module. Nap1p
and the histone variant Htz1p connect the
module to the nucleosome and sister
chromatid modules, respectively. Two poorly
characterized proteins, Nis1p and the putative
Cdc28p substrate Yol070p are both expressed
in mitosis and localize to the bud neck.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 30
Dynamic modules
Schematic representation of the dynamic
assembly of the prereplication complex. It
contains six static proteins (Orc1p to Orc6p)
that are bound to origins of replication
throughout the entire division cycle.
A subcomplex of six Mcm proteins is recruited
to the ORC complex in G1 phase by a Cdc6p-
dependent mechanism.
We see the corresponding genes (except
MCM6) transcribed just before that. Final
recruitment of the replication machinery is
dependent on Cdc45p, which is found
expressed in early S phase.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 31
Dynamic modules
Cdc28p module, with the different cyclins and
interactors placed at their time of synthesis. At
the end of mitosis, the cyclins are ubiquitinated
and targeted for destruction by APC and SCF,
reflected in the network by the interaction
between Cdh1p and Clb2p. The latter also
interacts with Swe1p, which inhibits entry into
mitosis by phosphorylating Cdc28p in complex
with Clb-type cyclins.
In this well-studied module, an uncharacterized
protein, Ypl014p, was discovered as well.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 32
Timing of expression for interaction partners
The distribution of time differences between
dynamic interaction partners reveals a strong
preference for interactions between proteins
expressed close in time.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 33
Timing of expression for interaction partners
For each hub protein, the average difference in
peak time was calculated among the dynamic
interaction partners.
The distribution is bimodal, supporting the
proposition of two classes of hubs, namely
„party“ and „date“ hubs.
Representative hub proteins are marked in the
figure to illustrate the relation between the two
types of hubs and modules.
Science 307, 724 (2005)
10. Lecture SS 20005
Cell Simulations 34
Outlook
With the emergence of new large-scale data sets, including assays focused on
protein-DNA interactions and transient protein-protein interactions, we expect
more of the dynamic proteins to be included in the network.
Also, we currently lack information about the life- time of the observed
complexes and modules.
With reliable time series of protein abundances, preferably in individual
compartments, the resolution of this temporal network can be increased
considerably, because even individual interactions over time could then
be monitored.
Moreover, the integrative approach presented here should be applicable
to any biological system for which both interaction data and time series are
available.
Science 307, 724 (2005)