10. Lecture SS 20005Cell Simulations1 V10: Proteome during the Yeast Cell Cycle Bioinformatics, 21, 1164 (2005) Science 307, 724 (2005)

10. Lecture SS 20005

Cell Simulations 1

V10: Proteome during the Yeast Cell Cycle

Bioinformatics, 21, 1164 (2005)

Science 307, 724 (2005)


Cell Simulations 2

Idea

Well known: certain genes are expressed only at specific stages of the cell

cycle, e.g. the cyclins and the histone.

these genes exhibit a periodic pattern of expression when monitored

during consecutive cell cycles.

Many microarry experiments have characterized the expression levels.

Post-analysis by computational methods yeast contains ca. 300 – 800

genes that are periodically expressed.

Problem: results correlate very poorly, particularly when different

computational methods are applied to the same data.

Altogether, 1800 genes have been proposed to be expressed periodically,

which is almost one third of the S. cerevisae genome.



Cell Simulations 3

Comparison of methods

Benchmark different methods on three data sets:

B1 113 genes identified as periodically expressed in small-scale experiments

B2 Genes whose promoters are bound (p-value < 0.01) by at least one of

nine known cell cycle transcription factors in two IP studies.

352 genes of which many should be cell cycle regulated

B3 genes annotated in MIPS as „cell cycle and DNA processing“

Understand strengths and weaknesses of different methods.



Cell Simulations 4

Methods of analysis: statistical test for regulation

The standard deviation can be easily calculated for each logratio profile,

giving a measure of the spread of the samples around the mean.

Heavily regulated genes will thus have large standard deviations, whereas

genes without significant regulation display little deviation from the mean.

To test for the significance of regulation, we therefore compare the observed

standard deviation for each expression profile to a randomly generated

background distribution.

1,000,000 random profiles were constructed by selecting at each time point

the log-ratio from a randomly chosen gene. A p-value for regulation was

calculated as the fraction of the simulated profiles with standard deviations

equal to or larger than that observed for the real expression profile.



Cell Simulations 5

where and T is the interdivision time.

Similarly, scores were calculated for 1,000,000 artificial profiles constructed by

random shuffling of the data points within the expression profile of the gene in

question. The p-value for periodicity was calculated as the fraction of artificial

profiles with Fourier scores equal to or larger than that observed for the real

expression profile.

The p-value for regulation is thus a comparison between individual genes and the

global distribution, whereas the pvalue for periodicity is a comparison involving

only data from the gene in question.

Methods of analysis: statistical test for periodicity

To estimate a p-value for periodicity, we compared the Fourier score of the

observed gene expression profile for each gene to those of random

permutations of the same gene.

For each gene, i, a Fourier score, Fi, was computed as

22

cossin

ti

tii txttxtF

T

2


Cell Simulations 6

Combined test for regulation and periodicity

For each gene, a combined p-value of regulation was calculated by

multiplying the separate p-values of regulation from each of the three

experiments.

Analogously, a combined p-value of periodicity was calculated.

Subsequently, the p-value of regulation and p-value of periodicity were

multiplied to obtain the total p-value. An undesirable feature of the total p-

value is that it may become very low (i.e. highly significant) due to only one of

the tests. Genes that are strongly regulated but not periodic (or vice versa)

will thus receive good scores.

To address this, we multiply the total p-value with two penalty terms that

weight down genes that are either not significantly regulated or not

significantly periodic. The final score used for ranking is:

22

001.01

001.01 yperiodicitregulation

total

ppp



Cell Simulations 7

Assigning the time of peak expression

Since we approximate each expression profile by a sine wave, the time of

peak expression for a gene in a single experiment is trivially defined as the

time where the sine wave is maximal. We refer to this as the peak time.

Due to differences in experimental conditions, the time it takes the cell to

complete a cycle (the interdivision time) varies greatly between the alpha,

Cdc15 and Cdc28 experiments. In order to compare the timing of peak

expression across experiments, we therefore transformed the time-scales

from minutes to percent of the cell cycle by dividing with the interdivision

times estimated by Zhao et al. (2001).

Combining peak times from different experiments into one is a non-trivial

task, since the assignment should not be trusted in those experiments where

the expression profile is not sufficiently periodic. To compensate for this, we

weighted the individual peak times when computing the global, combined

peak time.Bioinformatics, 21, 1164 (2005)


Cell Simulations 8

Methods (1)

M1 Cho et al. (1998) visually inspected the expression profiles of all genes

regulated more than two-fold during the Cdc28 experiment, classifying 421 of

them as “periodic”.

M2 Spellman et al. (1998) computed a score for every gene that was based

partly on the correlation to one of five idealized gene profiles, and partly on a

Fourier-like score yielding the signal strength at a period similar to the

interdivision time. The genes were ranked based on their combined score

and truncated to 800 genes as this corresponds to a sensitivity of 90% on

B1.

M3 Johansson et al. (2003) used partial least squares regression to analyze

the three data sets individually and in combination. The approach was based

on fitting of sine curves and thus also yields an estimate of the time of

peak expression. Thresholds were estimated based on random permutations

of the data.Bioinformatics, 21, 1164 (2005)


Cell Simulations 9

Methods (2)

M4 Zhao et al. (2001) reanalyzed each of the three data sets individually

using a statistical single–pulse model. The resulting score, describing how

well a profile fits the model, is independent of the magnitude of regulation.

An appealing feature of the model is that it also estimates the time of

activation and deactivation of the gene.

M5 Luan & Li (2004) used a modeling approach based on cubic splines,

rather than sine waves. The sets were analyzed individually and the

statistical nature of the method enabled the authors to identify thresholds that

satisfied a given false discovery rate.

M6 Lu et al. (2004) used Bayesian modeling techniques to estimate a

periodic–normal mixture model based on five microarray time courses. The

authors provide a ranked list of 822 genes based on all five data sets.



Cell Simulations 10

Combination method

In addition to the methods M1–M6, we also include a new and simple

statistical approach (M7) based on permutations.

Its score is composed of two terms: one quantifies only the magnitude of

regulation, whereas the other measures the periodicity of the expression

profile.

The combined score ensures that high ranking genes are both significantly

regulated and periodic.



Cell Simulations 11

Comparison of published methods



Cell Simulations 12

Regulation vs. periodicity



Cell Simulations 13

An alternative strategy for evaluating the performance ofmethods for extracting periodically expressed genes is to examine the overlap in genes identified in different experiments. For each of the methods that provide a ranked list of genes for each individual experiment, top 300 lists were extracted and their overlap visualized as Venn diagrams (see Figure 3). The results are consistent with those obtained from the benchmark sets. The results of M4, a representative of the magnitude-independent methods, shows by far the leastagreement between the three experiments. In comparison, M3 and M7 identify almost twice as many genes that agree in all three experiments. For all computational methods, the alpha factor experiment shows the best overlap with the two other experiments, while the Cdc28 experiment overlaps the least. This supports our previous observations with regard tothe quality of the individual experiments. The lower rightVenn diagram in Figure 3 illustrates that a renormalization of the Cdc28 data improves the agreement with the two other experimental data sets, in addition to improving performance on the benchmark sets (Figure 2). When including the renormalizedCdc28 data in our analysis, two thirds of the genesdetected from the alpha factor synchronization are confirmed by at least one other experiment.

Agreement across experiments



Cell Simulations 14

Consistency of peak expression

So far, we have addressed the issues of finding periodicallyexpressed genes and asked if the same genes are identifiedin different experiments. It is, however, equally important tocheck that those genes behave similarly across experiments,i.e. that their expression profiles peak at the same stage in thecell cycle. As a consequence of differences in experimentalprotocols, the interdivision time varies between experiments.Furthermore, the synchronization methods release cells atdifferent points in the cycle. To enable cross-experimentcomparisons, we represent time in percent of the cell cycle(0% being the time of cell division) and align the time axesof the experiments relative to each other. For each gene, we can thus assign a time of peak expression ineach experiment and compare these across experiments. Toexamine the degree of consistency between these peak timesin different experiments, we computed the largest peak timedifference for each gene. Figure 4 shows the distribution ofthese differences for the set of 89 genes that were identified asperiodic in all three experiments (see Figure 3). For 87 of the89 genes, all three peak times fall within an interval of 15% ofa cell cycle, clearly demonstrating the reproducibility of peakexpression by different synchronization methods. For genesthat appear periodic in all three experiments, an average peaktime could easily be computed.



Cell Simulations 15

Expression of phase-specific genes

As a final check of the reproducibility, we investigated

the timing of peak expression for four sets of known

phase specific genes across the three experiments. B1

was subdivided by Spellman et al. (1998) into phase-

specific groups according to their reported peak

expression in the small scale experiments.

The distribution of peak times within each group

is visualized in Figure 5, along with the distribution of

our combined peak times. From this, it is clear that the

phases occur in the same order, with the same length,

and at the same time in all three experiments. It thus

appears that the synchronization methods cause no

abnormalities of the cell-cycle transcriptional program.

Together, Figures 4 and 5 show that the combined

peak time is a meaningful measure that accurately

describes when in the cell cycle a gene is expressed.



Cell Simulations 16

Conclusions

Most surprisingly, the benchmark analysis reveals that most of the new and more mathematically advanced methods for identifying periodically expressed genes perform considerably worse than the early method by Spellman et al. (1998)(M2). These results should encourage developers of future computational methods to evaluate the performance of their methods carefully. We show that the performance gap is due to the magnitude-independence of most newer methods. This independence may improve their ability to discover novel weakly regulated cell cycle genes, however, as the magnitude of regulation contains a large part of the signal, this should be exploited in order to derive the most accurate set of cell cycle regulated genes.Using the periodicity score alone yields results similar to those of other magnitude-independent methods, while our combined score performs as well as the best existing methods. In comparison to other methods, ours have two main advantages. First, genes ranked high on our list are guaranteed to be both significantly regulated and significantly periodic. Second, we require consistency in peak time across experiments, which allows us to assign a time of peakexpression to each gene.



Cell Simulations 17

Analysis of complexome during cell cycle

Most research on biological networks has been focused on static topological

properties, describing networks as collections of nodes and edges rather than as

dynamic structural entities.

Here this study focusses on the temporal aspects of networks, which allows us to

study the dynamics of protein complex assembly during the Saccharomyces

cerevisiae cell cycle.

The integrative approach combines protein-protein interactions with information on

the timing of the transcription of specific genes during the cell cycle, obtained from

DNA microarray time series shown before.

a quality-controlled set of 600 periodically expressed genes, each assigned to

the point in the cell cycle where its expression peaks.

Science 307, 724 (2005)


Cell Simulations 18

Data preparation

Construct physical interaction network for the corresponding proteins from Y2H

screens, TAP pull-downs, and curated complexes from the MIPS database.

To reduce the error rate of 30 to 50% expected in most current large-scale

interaction screens, all physical interaction data were combined, a topology-based

confidence score was assigned to each individual interactions as in the STRING

database, and only high-confidence interactions were selected.

These were further filtered with information on subcellular localization to exclude

interactions between proteins annotated to incompatible compartments; no

curated MIPS interactions were lost because of this filtering.

The topology-based scoring scheme, filtering, and extraction criteria reduced the

error rate for interactions by an order of magnitude to only 3 to 5%.

Science 307, 724 (2005)


Cell Simulations 19

E.g. compatible subcellular localizations

Science 307, 724 (2005)


Cell Simulations 20

Construct of temporal network

Include in the extracted network (Fig. 1), in addition to the periodically expressed

(„dynamic“) proteins, constitutively expressed („static“) proteins that preferentially

interact with dynamic ones.

The resulting network consists of 300 proteins (Fig. 1, inside circle), including 184

dynamic proteins (colored according to their time of peak expression) and 116

static proteins (depicted in white).

For 412 of the 600 dynamic proteins identified in the microarray analysis, no

physical interactions of sufficient reliability could be found (Fig. 1, outside circle).

Some may be missed subunits of stable complexes already in the network; the

majority, however, probably participate in transient interactions, which are often

not detected by current interaction assays.

Science 307, 724 (2005)


Cell Simulations 21

Temporal protein interaction network in yeast cell cycle

Cell cycle proteins that are part

of complexes or other physical

interactions are shown within

the circle.

For the dynamic proteins, the

time of peak expression is

shown by the node color;

static proteins are represented

as white nodes.

Outside the circle, the dynamic

proteins without interactions

are positioned and colored

according to their peak time.

Science 307, 724 (2005)


Cell Simulations 22

Agreement with know complexes; error estimates

de Lichtenberg et al. „cell cycle" was constructed by applying the entire scheme to „all". The MIPS complexes were not included for this analysis.

de Lichtenberg et al. „all" contains the union of the Y2H screens by Uetz et al. and Ito et al. and the matrix representations of the two complex pull-down experiments by Gavin et al. and Ho et al. This interaction set is the raw data for which the error rate has been estimated to be 30-50%.

de Lichtenberg et al. „45%" and „85%“consists of the subset of „all" to which we assign a topology-based quality score of at least 0.45 or 0.85, corresponding to the cutoff for including an interaction between two cell cycle proteins in the nal network (Figure 1). Interactions were not fitered based on subcellular localization data.

Science 307, 724 (2005)


Cell Simulations 23

Interpretation

Virtually all complexes contain both dynamic and static subunits (Fig. 1),

the latter accounting for about half of the direct interaction partners of periodically

regulated proteins through all phases of the cell cycle (Fig. 2).

Transcriptional regulation thus influences almost all cell cycle complexes and

thereby, indirectly, their static subunits.

This implies that many cell cycle proteins cannot be identified through the analysis

of any single type of experimental data but only through integrative analysis of

several data types.

In addition to suggesting functions for individual proteins, the network (Fig. 1)

indicates the existence of entire previously unknown modules.

Science 307, 724 (2005)


Cell Simulations 24

Just-in-time synthesis vs. just-in-time-assembly

Transcription of cell cycle–regulated genes is generally thought to be turned on

when or just before their protein products are needed: often referred to as

just-in-time synthesis.

Contrary to the cell cycle in bacteria, however, just-in-time synthesis of entire

complexes is rarely observed in the network. The only large complex to be

synthesized in its entirety just in time is the nucleosome, all subunits of which are

expressed in S phase to produce nucleosomes during DNA replication.

Instead, the general design principle appears to be that only some subunits of

each complex are transcriptionally regulated in order to control the timing of

final assembly.

Science 307, 724 (2005)


Cell Simulations 25

Interpretation

Several examples of this just-in-time assembly (rather than just-in-time

synthesis) are suggested by the network, including the prereplication complex

(Fig. 3B), complexes involved in DNA replication and repair, the spindle pole body,

proteins related to the cytoskeleton, and numerous smaller complexes or modules.

the transcriptome time mappings visualized in Fig. 1 are in close agreement

with previous studies on the dynamic formation of individual protein complexes,

suggesting that the timing of transcription of dynamic proteins is indicative of the

timing of assembly and action of the complexes and modules (Fig. 3).

Science 307, 724 (2005)


Cell Simulations 26

Interpretation

Just-in-time assembly would have an advantage over just-in-time synthesis of

entire complexes in that only a few components need to be tightly regulated in

order to control the timing of final complex assembly.

This would explain the recent observation that the periodic transcription of specific

cell cycle genes is poorly conserved through evolution.

For the prereplication complex, exactly this variation between organisms has been

shown, although the subunits and the order in which they assemble are

conserved.

Science 307, 724 (2005)


Cell Simulations 27

Interactions during the cell cycle

The number of interactions of dynamic

proteins with other dynamic proteins

(dynamic – dynamic) and with static

proteins (dynamic – static) is shown as

a function of cell cycle progression.

Zero time corresponds to the time of

cell division. The large number of

interactions during G1 phase reflects a

general overrepresentation of genes

expressed in this part of the cell cycle.

Science 307, 724 (2005)


Cell Simulations 28

Dynamic modules

Each panel shows a specific module

from the network in Fig. 1 using the

corresponding colors of individual

proteins.

Science 307, 724 (2005)


Cell Simulations 29

Dynamic modules

Previously unknown module connecting

processes related to chromosome structure

with mitotic events in the bud. The nucleosome

assembly protein Nap1p is know to shuttle

between the nucleus and cytosol and regulates

the activity of Gin4p, one of two G1-expressd

budding-related kinases in the module. Nap1p

and the histone variant Htz1p connect the

module to the nucleosome and sister

chromatid modules, respectively. Two poorly

characterized proteins, Nis1p and the putative

Cdc28p substrate Yol070p are both expressed

in mitosis and localize to the bud neck.

Science 307, 724 (2005)


Cell Simulations 30

Dynamic modules

Schematic representation of the dynamic

assembly of the prereplication complex. It

contains six static proteins (Orc1p to Orc6p)

that are bound to origins of replication

throughout the entire division cycle.

A subcomplex of six Mcm proteins is recruited

to the ORC complex in G1 phase by a Cdc6p-

dependent mechanism.

We see the corresponding genes (except

MCM6) transcribed just before that. Final

recruitment of the replication machinery is

dependent on Cdc45p, which is found

expressed in early S phase.

Science 307, 724 (2005)


Cell Simulations 31

Dynamic modules

Cdc28p module, with the different cyclins and

interactors placed at their time of synthesis. At

the end of mitosis, the cyclins are ubiquitinated

and targeted for destruction by APC and SCF,

reflected in the network by the interaction

between Cdh1p and Clb2p. The latter also

interacts with Swe1p, which inhibits entry into

mitosis by phosphorylating Cdc28p in complex

with Clb-type cyclins.

In this well-studied module, an uncharacterized

protein, Ypl014p, was discovered as well.

Science 307, 724 (2005)


Cell Simulations 32

Timing of expression for interaction partners

The distribution of time differences between

dynamic interaction partners reveals a strong

preference for interactions between proteins

expressed close in time.

Science 307, 724 (2005)


Cell Simulations 33

Timing of expression for interaction partners

For each hub protein, the average difference in

peak time was calculated among the dynamic

interaction partners.

The distribution is bimodal, supporting the

proposition of two classes of hubs, namely

„party“ and „date“ hubs.

Representative hub proteins are marked in the

figure to illustrate the relation between the two

types of hubs and modules.

Science 307, 724 (2005)


Cell Simulations 34

Outlook

With the emergence of new large-scale data sets, including assays focused on

protein-DNA interactions and transient protein-protein interactions, we expect

more of the dynamic proteins to be included in the network.

Also, we currently lack information about the life- time of the observed

complexes and modules.

With reliable time series of protein abundances, preferably in individual

compartments, the resolution of this temporal network can be increased

considerably, because even individual interactions over time could then

be monitored.

Moreover, the integrative approach presented here should be applicable

to any biological system for which both interaction data and time series are

available.

Science 307, 724 (2005)

Documents

10. Lecture SS 20005Cell Simulations1 V10: Proteome during the Yeast Cell Cycle Bioinformatics, 21, 1164 (2005) Science 307, 724 (2005)