30
SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. Graphical alignment results of DGE data. 3 Supp. Figure 2. Reproducibility and differential expression analysis for DGE data 4 Supp. Figure 3. Comparison of fold difference in RNA abundance between copper-replete vs. copper-deficient cells estimated from DGE-TAG vs. real time-PCR experiments. 4 Supp. Figure 4. Change in protein abundance for hydrogenase and acyl-ACP desaturase parallels mRNA abundance. 5 Supp. Figure 5. Representative lanes from gel electrophoresis. 5 Supp. Figure 6. Reproducibility of proteomics data. 6 Supp. Figure 7. Amino acid alignment of Chlamydomonas reinhardtii CGL78 to homologs from Arabidopsis thaliana (AT5G58250), Oryza sativa (Os03g21370), Physcomitrella patens ssp. patens (PHYPADRAFT_121744), and Thermosynechococcus elongatus BP-1 (tll1499). 7 Supp. Figure 8. Fatty acid content of crr1 cells under copper-replete and -deficient conditions. 8 Supp. Figure 9. Fatty acid composition of MGDG and DGTS for CRR1 cells under copper- replete and -deficient conditions. 8 Supp. Figure 10. Promoter analysis of CRR1 targets. 9 Supp. Figure 11. Metaplot of RNA-Seq coverage for different types of libraries. 10 Supp. Figure 12. Alignment of DGE tags. 10 Supp. Figure 13. Single-end RNA-Seq simulated library. 11 Supp. Figure 14. Chlamydomonas transcriptome mappability for different read lengths. 11 Supp. Figure 15. Percent of trimming hits as a function of average exon length for a simulated library of 35mers. 12 Supp. Figure 16. Histogram of fold changes for a simulated library of 35mers. 12 Supp. Figure 17. Allocation of ambiguous hits. 13 Supp. Figure 18. Definition of mappable sets. 14 Supp. Figure 19. Additional corrections for the estimation of the Transcript Relative Abundance (TRA). 15 Supp. Table 1. Primers used for real-time PCR. 15 Supp. Table 2. Alignment statistics of DGE libraries. 17 SUPPLEMENTAL METHODS 1. Computational methods 18 2. DGE analysis 20 3. RNA-Seq analysis 23 4. Promoter analysis 28 5. Identification of algal proteins in genome databases 29 6. Immunodetection 29 SUPPLEMENTAL REFERENCES 30 SUPPLEMENTAL DATASETS (see separate Excel files for datasets) Supplemental Dataset 1. Copper-deficiency responsive genes identified via DGE-TAGs. Supplemental Dataset 2. Summary of RNA-Seq statistics and expression estimates after allocation of ambiguous hits. Supplemental Dataset 3. RNA-Seq identifies copper-deficiency and CRR1 targets under photoheterotrophic conditions. Supplemental Dataset 4. RNA-seq identifies copper-deficiency targets in photoautotrophically grown cells. Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400. 1

SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

SUPPLEMENTAL FIGURES AND TABLES Pg

Supp. Figure 1. Graphical alignment results of DGE data. 3 Supp. Figure 2. Reproducibility and differential expression analysis for DGE data 4 Supp. Figure 3. Comparison of fold difference in RNA abundance between copper-replete vs. copper-deficient cells estimated from DGE-TAG vs. real time-PCR experiments. 4 Supp. Figure 4. Change in protein abundance for hydrogenase and acyl-ACP desaturase parallels mRNA abundance. 5 Supp. Figure 5. Representative lanes from gel electrophoresis. 5 Supp. Figure 6. Reproducibility of proteomics data. 6 Supp. Figure 7. Amino acid alignment of Chlamydomonas reinhardtii CGL78 to homologs from Arabidopsis thaliana (AT5G58250), Oryza sativa (Os03g21370), Physcomitrella patens ssp. patens (PHYPADRAFT_121744), and Thermosynechococcus elongatus BP-1 (tll1499). 7 Supp. Figure 8. Fatty acid content of crr1 cells under copper-replete and -deficient conditions. 8 Supp. Figure 9. Fatty acid composition of MGDG and DGTS for CRR1 cells under copper- replete and -deficient conditions. 8 Supp. Figure 10. Promoter analysis of CRR1 targets. 9 Supp. Figure 11. Metaplot of RNA-Seq coverage for different types of libraries. 10 Supp. Figure 12. Alignment of DGE tags. 10 Supp. Figure 13. Single-end RNA-Seq simulated library. 11 Supp. Figure 14. Chlamydomonas transcriptome mappability for different read lengths. 11 Supp. Figure 15. Percent of trimming hits as a function of average exon length for a simulated library of 35mers. 12 Supp. Figure 16. Histogram of fold changes for a simulated library of 35mers. 12 Supp. Figure 17. Allocation of ambiguous hits. 13 Supp. Figure 18. Definition of mappable sets. 14 Supp. Figure 19. Additional corrections for the estimation of the Transcript Relative Abundance (TRA). 15 Supp. Table 1. Primers used for real-time PCR. 15 Supp. Table 2. Alignment statistics of DGE libraries. 17

SUPPLEMENTAL METHODS

1. Computational methods 18 2. DGE analysis 20 3. RNA-Seq analysis 23 4. Promoter analysis 28 5. Identification of algal proteins in genome databases 29 6. Immunodetection 29 SUPPLEMENTAL REFERENCES 30

SUPPLEMENTAL DATASETS (see separate Excel files for datasets)

Supplemental Dataset 1. Copper-deficiency responsive genes identified via DGE-TAGs. Supplemental Dataset 2. Summary of RNA-Seq statistics and expression estimates after allocation of ambiguous hits. Supplemental Dataset 3. RNA-Seq identifies copper-deficiency and CRR1 targets under photoheterotrophic conditions. Supplemental Dataset 4. RNA-seq identifies copper-deficiency targets in photoautotrophically grown cells.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

1

Page 2: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Dataset 5. Identification of copper-deficiency targets via alignment of sequences to FM4 and Augustus 10.2. Supplemental Dataset 6. Transcript abundances for genes encoding tetrapyrrole biosynthesis enzymes. Supplemental Dataset 7. Transcript abundance for genes encoding copper-containing proteins and copper chaperones. Supplemental Dataset 8. Transcript abundance for genes encoding metal transporters. Supplemental Dataset 9. Summary of protein abundance from MSE analysis. Supplemental Dataset 10. Orthologs of Chlamydomonas copper-deficiency responsive genes in V. carteri. Supplemental Dataset 11. Most abundant transcripts in Chlamydomonas. Supplemental Dataset 12. Transcript abundance for nucleus-encoded genes for Chlamydomonas.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

2

Page 3: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 1. Graphical alignment results of DGE data. (A) Landscape of DGE data as displayed on the UCSC genome browser for a 700,000 bp region (x-axis). The y-axis shows signal intensity on a log scale. The average signal from 3 different conditions is displayed on both strands. (B) Metaplot of percent of DGE signal for the Chlamydomonas transcriptome. The most 3' NlaIII restriction site accounts for more than 40% of the total signal. (C) Browser view for the FDX5 locus, a CRR1 target. Differential expression is observed only for wild-type cells only.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

3

Page 4: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of p-values of differential counts in a pairwise comparison between biological replicates (A). Small p-values are overrepresented as compared to the uniform distribution. Mean-difference scatterplots for different biological replicates (B), different biological conditions (C), and different biological conditions after pooling biological replicates (D). Pooling of replicates reduces the variance of pairwise comparisons. Mixture-model fitting of p-values to a mixture of a uniform (green) and one (red) or two (dark yellow) beta distributions for false discovery estimates. The model performs better after low expression genes (p-value distribution skewed to one in panel E) are filtered (panel F). See Supplemental Methods for details.

Supplemental Figure 3. Comparison of fold difference in RNA abundance between copper-replete vs. copper-deficient cells estimated from DGE-TAG vs. real time-PCR experiments. Each data point represents the average of two independent experiments. The correlation coefficient was calculated from the regression line derived with a 95% confidence interval.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.44

4

Page 5: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 4. Change in protein abundance for hydrogenase and acyl-ACP desaturase parallels mRNA abundance. Proteins were separated on a SDS-denaturing polyacrylamide gel (10%) and transferred to PVDF-FL for immunoblot analysis with anti-hydrogenase and anti-FAB2. Protein loading was normalized by Coomassie staining.

Supplemental Figure 5. Representative lanes from gel electrophoresis. 30 µg of soluble protein extract from +Cu and –Cu cells was loaded per lane. A representative lane is shown. The 4-12% NuPage gel was stained with Coomassie blue to visualize proteins and sliced as indicated on the left for fractionation.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.55

5

Page 6: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 6. Reproducibility of proteomics data. Mean-difference scatterplots show similarity of samples prepared from (A) replicate cultures of the same condition, (B) different experimental conditions, and (C) after pooling replicates from each of the two conditions.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.66

6

Page 7: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 7. Amino acid alignment of Chlamydomonas reinhardtii CGL78 to homologs from Arabidopsis thaliana (AT5G58250), Oryza sativa (Os03g21370), Physcomitrella patens ssp. patens (PHYPADRAFT_121744), and Thermosynechococcus elongatus BP-1 (tll1499). The eukaryotic proteins are predicted to contain an N-terminal signal sequence and all proteins have predicted non-membrane, non-cytoplasmic localization. The cyanobacterial homologs are commonly named YCF54.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.77

7

Page 8: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 8. Fatty acid content of crr1 cells under copper -replete and -deficient conditions. Lipids were isolated from +Cu and −Cu crr1 cells. Lipids were separated by TLC and quantified by GC. The bars show lipid composition in mol % and indicate the means (±SD) of five independent experiments. Panel A shows the total fatty acids profile of whole crr1 cells and panels B, C, and D show the fatty acid composition of MGDG, DGDG, and DGTS.

Supplemental Figure 9. Fatty acid composition of MGDG and DGTS for CRR1 cells under copper-replete and -deficient conditions. The bars show lipid composition in mol % and indicate the means (±SD) of six independent experiments. Total fatty acids in MGDG and DGTS are shown in panel A and B respectively. Values marked with an asterisk are significantly different from each other (p < 0.05, non-paired 2-sample t-test).

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

8

Page 9: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 10. Promoter analysis of CRR1 targets. Panel A, left: Frequency differences for highly enriched 6-mers in CRR1 target promoters of variable length (from 100 to 2000 bp) as compared to random sets of promoters from the Chlamydomonas transcriptome. GTAC and GC-rich motifs ranked first among all possible words of 4 to 8 bases. Right: The prevalence of the top-10 enriched motifs in 100 different randomizations shows a decreasing enrichment for different windows along the promoter region as long as some extent of positional bias (with GC-rich words prevailing around ~1200 upstream of the transcription start site). Panel B: Sequence profiles for GTAC sites in different upstream windows (up to 200 bp, 800-900 bp and 1250-1350bp) show differential enrichment of GTAC flanking sequences at different residues.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.99

9

Page 10: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 12. Alignment of DGE tags. Example of brute force alignment of two arbitrary sequences, where "X" denotes the Frobenius inner product of two matrices (A). For each read, all pairwise products with NlaIII restriction sites in the genome are computed. The same method is employed to determine the degree of confusability of NlaIII restriction sites. Three consecutive rounds of alignment are performed (see Methods text for details). The first round (left) identifies reads with an alignment score equal to or better than 16 (B). The remaining reads correspond mostly to polyA rich sequences (right top, sequence content of all the reads in a single lane; right bottom, sequence content for the remaining reads after the first alignment round).

Supplemental Figure 11. Metaplot of RNA-Seq coverage for different types of libraries. Sequence coverage for each transcript was normalized to its maximum value across the transcript length as a test of signal uniformity. The metaplots show the average for all transcripts. The Whole Transcriptome Analysis (WTA) protocol from Illumina was optimized during the course of the project (from WTA-I to WTA-II) with increased uniformity for both single end (green) and paired-end (red) libraries.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.1010

10

Page 11: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 13. Single-end RNA-Seq simulated library. Mean-difference scatterplots of expression levels for a simulated library of increasing sequencing depth (left to right). The actual expression is compared with the estimated expression after the analysis pipeline. Light blue lines represent the mean fold change in running windows of 200 data points. Green and red lines are displayed for comparison, and correspond to the 5th and 95th quantiles of the fold change for high-sequencing depth (right plot).

Supplemental Figure 14. Chlamydomonas transcriptome mappability for different read lengths. The four histograms represent the fraction of genes with a unique mappable length in the interval 0-25%, 25-50%, 50-75% and 75-100% respectively. Unique mappability saturates for reads longer than 50 nt.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.1111

11

Page 12: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 15. Percent of trimming hits as a function of average exon length for a simulated library of 35mers. Genes with a lower fraction of exon length close to exon-intron boundaries receive on average 20% of their counts from the trimming alignment round. For genes with small exons, the ratio can be as high as 40%.

Supplemental Figure 16. Histogram of fold changes for a simulated library of 35mers. The number of genes with low or null fold change increase significantly when the expression estimates including counts from the second alignment round are compared to the true solution of the simulated library.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.1212

12

Page 13: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 17. Allocation of ambiguous hits. (A) Top: mean-difference scatterplots for the number of hits before and after allocation as compared to the true solution. The assignment of ambiguous hits provides a better approximation to the true solution for almost every expression level as shown by the 25th and 75th quantiles of the fold changes (red and green curves, top and bottom panel A). (B) Allocation of ambiguous hits for a real library on a single locus of the Chlamydomonas genome. Sequence coverage is improved after allocation (top green coverage track) for a gene with low unique mappability.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.1313

13

Page 14: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Figure 18. Definition of mappable sets. Fuzzy boundaries: Exon-intron boundaries are modified to account for sequence similarity between the 5' end of each intron and the next exon (A). The definition of the fuzzy boundary depends on the number of mismatches allowed for each read length. The final ensemble of mappable sets for each gene accounts for overlapping bases with neighbor genes and exonic k-mers that overlap with intronic sequence, along with the genomic multiplicity (B). Each of these factors depends in turn on the value of k and, therefore, the number of allowed mismatches (which determines the corresponding fuzzy boundary).

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.1414

14

Page 15: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Table 1. Primers used for real-time PCR.

Gene

name

Protein

IDa

Primer pairb

CTR1 196101 Fwd: CCTTCAACATCGGCTTCTTC Rev: CCCTTCTTGTGGTCAGCAGT

CTR2 196096 Fwd: CGCTGCTGAACCTCATCTC Rev: GCAGGTGGCCCAGGAATAG

CTR3 196115 Fwd: CGCTCCCCGCCATCCCCAAG Rev: GTGAGCGGGTCAGGGCAGGT

COPT1 196102 Fwd: TCTGGTTTGCTGGGCTCTCC Rev: GTACACCGCCAGTCCCTCGT

CTP2 205938 Fwd: CTGCAACGTGCTGGTGACTG Rev: GTGAGGGTGCCGGTCTTGTC

CTP3 195962 Fwd: GCACATGCTCCGAGGAAGAC Rev: TCGACAAGACCGGCACACTC

CPX1 53583 Fwd: CCAGACCTCCAAGCGTGTGT Rev: GCGTTGCGGATCACCTTCTC

HMA1 195962 Fwd: ACGGATGACGAACAGCAGCA Rev: CTCACACAGCCCCTTCCACA

IRT2 174212 Fwd: GGGGCTACTGATGGCGGTGA Rev: GTAGATGGGTGCGGCGATGA

FEA1 129929 Fwd: TGCTGGGCCTGCAGGGCGTGTC Rev: ACGGCGGAGGCCTTGAAGTTGC

FDX5 156833 Fwd: ACCATCCTCACGCACCAG Rev: CCCCTCCGTTGCGTGATAAA

CYC6 193170 Fwd: AGCGACGTGGCGCCGGTATCAAT Rev: GTCCACCGAGATGAACACCAGCTG

CRD1 284376 Fwd: CGTAGGTAGGCTGACTGCGTTG Rev: GTCATTTATGCGCAGCCCTTG

COX17 154148 Fwd: AGCTGCCCTGACACCAAGAA Rev: TCACACCTTGAACCCCTCCA

COX2a 184789 Fwd: TCCTGATCACTGTGGTGACCCTG Rev: CAGCGAGTAGATGAGGGTGAGC

COX2b 190125 Fwd: CGTGAAGATGGTCGCGGTGCCGGGT Rev: TTACGAGATCCACTTCTTCAC

Supplemental Figure 19. Additional corrections for the estimation of the Transcript Relative Abundance (TRA). (A) Cumulative distribution function (CDF) of the number of counts and (B) estimated expression levels. Each graph corresponds to a different real library of different sequencing depth. CDF of expression levels after imputation of missing values (C). CDF of expression levels after Kaplan-Meier correction (D). The expression distributions collapse into a similar curve irrespective of the sequencing depth.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.1515

15

Page 16: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

a Corresponding to the version 3.1 draft genome. b The primer pairs for each gene model are shown. All primer sequences are written 5’ to 3’.

COX11 195707 Fwd: CTTGTGACTGCGGGCTTGTG Rev: ACTGGGCTCCTGGGTTCTCC

COX19 127387 Fwd : AGGGCGTTTTCCCTCTGGAC Rev: TTGCGCCATCAGGTCCCTCT

AOX2 77667 Fwd: GAGGGAGAGTTCGTGACC Rev: TTACTTCCCGACTGTCGCCGC

SCO1 159985 Fwd: GAGGGAGAGTTCGTGACC Rev: TTACTTCCCGACTGTCGCCGC

79831 Fwd: GTGCTGACGTGGCTGTCGTT Rev: GATGCAGAAGGTGGCCCAGT

AOF1 173525 Fwd: GTGAGAGTGCCGGGTTGCTT Rev: GCCCCCATGCCATAAATACC

PFR1 122198 Fwd: CGATGTGGAGGAGTGGCTGA Rev: CAGCTCGTGGTGCAGCTTGT

143800 Fwd: GGTGCCAGGTTGATTTTGGTG Rev: CTGCACAATTCAGCCATTAGG

150517 Fwd: CGCTGTGGTTCTGGCTGTTC Rev: GATGAGGAAGCCGCTGATGG

HYDEF 128256 Fwd: CTGCATGATTGACGCCCAGA Rev: AGGCGGCCCAAGAGAAGAAC

LCI34 183315 Fwd: CACGGACGCTTCTGGCTTTC Rev: ACACGTCACCGGCCCTCTAA

DES6 56668 Fwd: CGCTGGAGGTGCTGCTGAAT Rev: GGTCAGCAGCTTGAGCGACA

HYD1 183963 Fwd: GTCTATTCGCGGCAGCTC Rev: TCGTGGACATGACTCAAAGG

RSEP1 206032 Fwd: GGGACTGGGGTTCAGGGATT Rev: AGGCGTGCTCGAATGACACA

185347 Fwd: ACACGCGGCGGTGGGCGTCA Rev: CGAGGGTGCCGAACAGGAAG

CYP51G1 196411 Fwd: ACAACAAGGCGGCTGAGGAG Rev: GCAGTGCCTCGGTGATGTTG

193036 Fwd: CACAAGCAGCACACGCACAG Rev: GCAAGACCCACCAAGAGAGCA

189582 Fwd: CAGTGCCCGAGCCAGTAACC Rev: CAGCCTAAGCGACGGAAGCA

GOX7 196819 Fwd: GCGTGGTCGTGCTGGAGATT Rev: CGGGGCAATGTTGATGTTGG

GOX8 196818 Fwd: CGCCTGGACTCCTTCCTTGA Rev: AACACACGCCACGCATGAAC

144679 Fwd: CTTGTGCCGTAGTGCGGTGA Rev: CCCTCATCATGCCGTCCTCCT

194369 Fwd: CCCTTGGCGTGAGTTGCAGT Rev: TGGGTCAGCCGTCATTCAAG

184989 Fwd: CGACCCAAACACTGACAAGCA Rev: GGACAAGCCAGGGTGCAGTT

166317 Fwd: CGGCGGTCTGTCATGCTGGTC Rev: TGAAGCCAACCAAATGCGTGT

CGL78 162021 Fwd: CCTGGACCGCGTGCTGAAGA Rev: TACCGGGCGTAAGGGGCAGT

HYDG 196226 Fwd: CCAATCACTGCATGAACACC Rev: TTGACAGATGCGATCACGTT

CβLP 105734 Fwd: GCCACACCGAGTGGGTGTCGTGCG Rev: CCTTGCCGCCCGAGGCGCACAGCG

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.1616

16

Page 17: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Supplemental Table 2 Alignment statistics of DGE libraries.

1st round

% aligned

2nd round

% aligned

3rd round

% aligned

Total

tags

Total Non-

ambiguous

Total Non-

ambiguous

Total Non-

ambiguous

Lane 1 4379841 0.92 0.85 0.95 0.88 0.97 0.90

Lane 2 3128365 0.92 0.86 0.94 0.88 0.97 0.90

Lane 3 4264777 0.90 0.84 0.93 0.87 0.96 0.90

Lane 4 3799682 0.92 0.86 0.95 0.88 0.97 0.90

Lane 5 2439979 0.93 0.86 0.95 0.88 0.97 0.90

Lane 6 4872158 0.91 0.84 0.94 0.87 0.96 0.89

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.1717

17

Page 18: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

SUPPLEMENTAL METHODS

1. Computational methods

1.1. Degree of confusability and transcriptome mappability — the infinite gap penalty approach: The transcriptome is profiled in this work by very short reads (21 nt for DGE, 33-50 nt for WTA). The unambiguous assignment of a short nucleotide read to a given genomic position is a non-trivial problem with typically-applied approximations not always made explicit. It is customary to align short read sequences allowing for a given (small) number of mismatches in order to account for base-calling errors. If a binary scoring approach is used (both perfect and suboptimal alignments are regarded as equivalent as long as they pass a predefined threshold), for every read one needs to report the complete set of alignments above that threshold and define an ambiguous or multiple alignment hit (multihit) accordingly. In other words, any given pair of genomic loci of the same size can be mapped by the same read if the number of allowed mismatches is high enough. Thus, for each genomic base j and read length k, we define and compute the degree of confusability DCj(k) as the minimum distance to any other locus of the same length in the genome, where the distance between two given loci is the number of bases they differ from each other. For a given read length k and number of allowed mismatches m(k), a base is called mappable if DCj(k)>m(k). Now, for every read r of length k we define an alignment k-hit hrk as the (possibly empty) set of genomic loci j with an alignment score Sr better than a given threshold s(k): hrk = {j | Sr>=s(k)}. This way, the classes of ambiguous and non-ambiguous hits are defined as follows: for non-ambiguous hits, the cardinality of hrk is either zero for all k in K, or for some value of k in K, the cardinality of hrk is one and the only member of the set is a mappable base, and for ambiguous hits, if for some k the cardinality of hrk is one and the member of the set is not mappable, or the cardinality of hrk is bigger than 1 (multihit).

Then, the perfect match approach can be understood as the limit of zero sequencing error rate, and ambiguous hits are restricted to sets of loci with complete sequence identity. In practice, once all the hits are labeled, one can make the choice to restrict any downstream analysis to unambiguous hits (as for DGE data in this work) or to allocate each ambiguous hit to a single genomic position based on a given assignment model (see RNA-Seq methods below).

For normalization of RNA-Seq data, it is necessary to compute the mappable length of each transcript. This estimation depends on sequencing parameters (read length), intrinsic properties of the genome (per-base mappability for each read length and exon-intron boundaries), and alignment parameters (allowed mismatches and multihit handling). The definition of mappable sets below and its application to the normalization of expression data accounts for these dependencies, and can be applied to different alignment strategies. However, for transcriptome applications, the previous definitions are still flawed as they rely on the hypothesis that any member of a k-hit is a connected set of bases in the genome. Moreover, the degree of confusability should be computed accordingly, as any given Mrna short fragment might be found both as a connected or non-connected set of bases in the genome. For very short reads as the ones used in this work, performing true gapped alignment is impractical, so we tried to maximize the number of aligned reads by a trimming approach (see below) and proper normalization. For longer reads, some sort of gap penalty might be introduced (both for gap opening and total number of gaps for a read of a given size) and the degree of confusability should be defined and computed accordingly. Consequently, the alignment strategy explained below must be understood in the infinite gap penalty limit.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

18

Page 19: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

An alternative common approach is to align the reads to the ensemble of transcript sequences, but we rule out this possibility to minimize the impact of an imperfect annotation (González-Ballester et al., 2010).

1.2. Definitions and notation: Throughout this work we apply the matrix notation introduced in (Bullard et al., 2010). Matrix subscripts are used to denote individual lanes

(i ∈[1, I]) , genes or sets of genomic bases

( j ∈[1, J]) , and aligned read lengths

(k ∈ [kmin , kmax ]) . The Mrna samples from 11 different experiments

(a(i) ∈[a1, a11]) were sequenced in

I = 28 lanes corresponding to 12

(c(i) ∈[c1, c12]) different flow cells (see Supplemental Dataset 2 for details). The set of curated genes provided in the JGI v3.1 Chlamydomonas genome annotation includes 14,598 gene models. We used a recursive alignment strategy (see below) in which the complete read

kmax or end-trimmed forms of the read (up to

kmin ) are mapped to the reference sequence. For each read we find k*, the maximum value of k such that Srk>=s(k), and the corresponding set of genomic positions hrk* = {j | Sr>=s(k*) } is denoted as a k-hit. Let

Hik (x) denote the matrix containing the number of k-hits for lane

i at a given genomic position

x . Thus,

Hijk = Hik (x ∈ Mjk ) contains the number of k-hits per lane for a given mappable set

Mjk (i.e. gene, see below). Replacing the corresponding subscript by a “·” symbol represents totals over all lanes, genes or read lengths. That is, the total number of hits for a given gene and lane is denoted by

Hij• , and the total lane counts by

Hi••. When there is no ambiguity, we use an implicit notation where indices for totals are omitted, that is,

Hij = Hij•. If groups of lanes from two different experiments were pooled for a differential expression analysis,

Ha(1) j• or

Ha(1) j would be read as the total number of hits for the mappable set

j over all lanes from experiment

a1. The average of a given variable over all the lanes for a given condition a(i) is denoted as

• a( i). Variables corresponding to a particular multiplicity filter (e.g. unique hits, or maximum number of hits reported by the aligner) are represented with a superscript:

H1ijk

indicates the hits matrix when only unique hits are considered. No superscript is used for unfiltered variables.

1.3. Statistical methods for differential expression and reproducibility analysis: Several approaches have been proposed to perform Poisson statistics on RNA-seq count data (Audic and Claverie, 1997; Marioni et al., 2008; Bullard et al., 2010). In the present work, we tested two different approaches for two-sample comparisons: the one presented in (Audic and Claverie, 1997) and the Fisher exact text as discussed in (Bullard et al., 2010). We found almost identical results with both methods both for low and high number of counts. Throughout this work, the threshold used to call a gene as differentially expressed is selected from a predefined value for the false discovery rate. Posterior probabilities of differential expression, PPDE(p) ∈ [0,1), were computed from the previous p-values using the mixture-models approach implemented in (Allison et al., 2002) in order to estimate false discovery rates. In short, the p-value distributions are fitted to a mixture of a beta distribution (representing the set of differentially expressed genes) and a uniform distribution (non-differentially expressed). To this end, we made use of an implementation in R included in the package cyberT (Baldi and Long, 2001). PPDE(pj) ~ 1 implies that gene j has a large probability of being differentially expressed. This method allows us to estimate the false discovery rates at any give p-value threshold, PPDE(< pj). In practice, we typically define the group of target genes starting from the set {j | PPDE(pj)>0.95}. In order to address technical or biological reproducibility between two different runs or libraries, two

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

19

Page 20: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

different types of graphs are presented. First, based on the hypothesis that two samples have the same statistical properties and that technical reproducibility is perfect, the distribution of p-values should be uniform. This idea is implemented in terms of qq-plots where the p-values of different statistical tests for hits totals

Hij ,Hi ' j are compared to the uniform distribution in

[0,1) . Second, we can inspect the variability of expression levels between different lanes in terms of mean-difference scatter-plots (Bullard et al., 2010). We define a normalized expression measure as

Ea( i) j =c

Ha( i)

Ha(i) j

for a set of lanes

a(i) and a constant factor

c (equal to 1x106 in this work). The mean-difference scatter-plots show expression fold-changes

E1a( i) j1 / E1

a(i) j2 as a function of the overall expression measure

E1a( i) j1 / E1

a(i) j2 in a modified logarithmic scale

log10(1+ x).

2. DGE analysis

DGE offers higher dynamic range relative to microarrays with a better sensitivity for poorly expressed genes. DGE is also less expensive compared with RNA-Seq at the same level of transcriptome coverage. However, in its current form, a number of limitations make the analysis of DGE data rather involved. First, the restriction of the original Mrna library by a particular enzyme (NlaIII in this work) predefines a set of transcripts as absent (those lacking an unambiguously mappable restriction site) regardless of their expression level (around 700 gene models of the Chlamydomonas v3.1 annotation). Second, the short length of DGE reads (20-21 bp) introduces an additional level of ambiguity for the assignment of alignment hits. In a first approximation, one could choose to perform a conservative analysis where only reads that map uniquely to the genome and with no mismatches are considered (~78% for the libraries presented in this work). Such an approach would limit the sensitivity of the technique in some particular cases, so one is tempted to relax the previous restrictions in order to increase the amount of mapped reads. However, as we mentioned above, one needs to be careful about the assignment of alignment hits to some particular sites with high degree of confusability. Only 83% of the NlaIII restriction sites in the Chlamydomonas genome are unique, i.e there are a significant number of such restriction sites that cannot be resolved with DGE tags. With no additional analysis, only silent repeats (those which are completely included in the same gene model) might be assigned without ambiguity to a single transcript. For 21bp tags, we cannot allow for a large number of mismatches, as they might be confused with one or more different restriction sites. In fact, the fraction of the Chlamydomonas v3.1 unique restriction sites with at least another site at a distance of 1 (respectively 2, 3 or 4) is 7% (respectively 26, 52 or 14%). A marginal number of restriction sites would allow for a bigger number of mismatches without ambiguity. That means that if we allow for one mismatch, there are an additional 7% of restriction sites that should be considered as non-unique. This additional confusability increases dramatically with the number of mismatches, and would be almost complete beyond 4 mismatches (as the uniqueness of the Chlamydomonas v3.1 genome for k-mers smaller than 17 decrease very fast).

2.1 Brute force recursive alignment of DGE reads: Illumina’s proprietary pipeline was used for off-instrument data pre-processing (image analysis and base-calling). The resulting 21-mers were aligned to the Chlamydomonas v3.1 genome by means of a brute-force exact alignment procedure (Supplemental Figure 12A): First, a tag table of NlaIII 17bp sites (577370 in total for each strand) was retrieved from the

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

20

Page 21: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

genome sequence. Also, a library of unique reads from six sequencing lanes was created (a total of 554389), keeping track of the number of occurrences of each read in every lane. Each entry of the tag table and each read was represented as a 4x17 binary matrix. The Frobenius inner product of such two matrices is defined as

A,B F := trace(ABT ) = AijBij(ij )∑

and provides the number of matching bases, or alignment score. We make use of the previous product to compute both DCj(k) for the members of the tag table and the alignment score of each read.

A hierarchical alignment strategy was carried out to build a landscape of non-ambiguous hits across the genome (see Supplemental Table 2): a) 1st round: As we mentioned above, the degree of confusability for one mismatch is moderate while there are a substantial number of one-mismatch alignments (7% of the total, see Supplemental Figure 12B). Thus, in a first step, all the hits with score better or equal to 16 (i.e up to one mismatch) were labeled as ambiguous or non-ambiguous (in the sense of DCj(k)). On average over the six lanes, 92% of the read passed this first filter (~93% of them with perfect score), and 85% can be assigned unambiguously. B) 2nd round: The set of reads that cannot be assigned in the previous step shows a strong bias to high A content at the 3’ end (see Supplemental Figure 12B). This suggests that the main source of misalignments is the sequencing of the polyA tail from transcripts with a restriction site close to the transcription termination site. Therefore, the remaining reads were 3’-trimmed one base at a time (starting from 15mers and up to 13mers) searching for a non-ambiguous hit with perfect score (i.e 15-13). This step was able to rescue an average of 2.6% reads per lane, bringing the total fraction of aligned reads close to 95% (~88% unambiguously). The relevance of this additional alignment step is more apparent at the transcript level (even if it affects only a small fraction of them), as it mainly involves the assignment of hits to the most 3’ restriction site, which carries around 45% of the total DGE signal of the transcript (Supplemental Figure 1). C) 3rd round: non-ambiguous hits from reads that miss the previous steps were assigned regardless of their score. No less than 2% of the remaining reads per lane was added to the alignment landscape in this step. The final percent of unambiguously assigned reads was ~90% for each of the six lanes.

2.2 Gene assignment and normalization for DGE mappable sets: The whole-genome landscape of non-ambiguous hits produced by the previous alignment approach (Supplemental Figure 1A) was used to estimate transcript abundances. First, the correspondence between NlaIII mappable restriction sites and annotated genes was established. Restriction sites located in overlapping regions of several genes were filtered out. On average, there are more than 8 restriction sites per gene (median = 6) and 611 models lack an NlaIII site. According to the DGE library preparation protocol, one should expect a single tag per gene (or more exactly per polyadenylation site) corresponding to the most 3’ restriction site. However, the enzymatic digestion is shown to be incomplete (Supplemental Figure 1BC), so we need to consider all of the mappable restriction sites overlapping the gene sequence. Even in this case, DGE still provides one single tag per Cdna molecule. Moreover, as shown in Supplemental Figure 1BC, we typically found DGE signal from both the forward and reverse strand for a given restriction site. The origin of this “anti-sense” signal is uncertain, but it clearly correlates with the intensity of the sense signal (data not shown). Both

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

21

Page 22: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

true anti-sense transcription and sequencing slips might contribute to this anti-sense signal. The total number of counts per transcript was computed by pooling the counts of all the non-ambiguous restriction sites in the corresponding gene strand. On average, 74% of the non-ambiguous hits were assigned to annotated transcripts (~66% of the total). The remaining hits are most likely located at 3’ UTRs of poorly annotated genes. Differential expression and reproducibility analysis was carried out in terms of absolute number of counts, while expression levels are reported as transcript relative abundance estimates (TRAs) averaged across replicates from the same condition:

Τ 1a(i) j = cH1

ij

Hi a( i)

where c = 106, and

Τ 1a(i) j is reported in units of counts per million (CPM, see Supplemental Dataset 1).

2.3 Reproducibility and differential expression: Six lanes, corresponding to three different experimental conditions were analyzed making use of DGE. For each condition, the replicates are true biological replicates. It is accepted that for high-quality NGS data, lane or flow cell effects are negligible compared to experimental variability (Bullard et al., 2010). As a matter of fact, qq-plots for different biological replicates of the same experiment show a noticeable overrepresentation of small p-values (Supplemental Figure 2A), while expression levels exhibit good reproducibility for moderately and highly expressed genes (Supplemental Figure 2B). On the other hand, we observed that pooling the counts of experimental replicates increases sensitivity and reproducibility for lowly and moderately expressed genes (Supplemental Figure 2CD). The digital nature of the data has two related effects on the analysis: first, the number of counts is bounded by 0 and the counts for the most abundant gene at a given sequencing depth, but do not saturate, in the sense that the technique does not introduce any extrinsic cut-off to the minimum and maximum number of counts. This contrasts with microarrays where expression levels are inferred from hybridization intensities, which are intrinsically bounded and noised by optical detection and saturation. The noiseless character of sequence data is manifested in the p-value distribution in the form of an overrepresentation of big (close to 1) p-values (Supplemental Figure 2E), or in other words, a high number of genes with no or very small variance. In practice, we need to correct or filter the p-value distribution in order to apply the mixture model approach for the computation of false discovery rates (see above). Supplemental Figure 2E shows the results of the mixture model fitting to an unfiltered distribution of p-values. When a combination of a uniform and a beta function is used, the shape of the p-value distribution produces a misleading model, where the posterior probability of differential expression is high for p-values close to 1. There are a number of ways to correct for this. First, it is possible to use a more complicated model with more than one beta function as implemented in (Allison et al., 2002) (Supplemental Figure 2EF). However, according to our results, this alternative tends to overfit the data and increase the estimated number of differentially expressed genes. It could be possible to modify this method by forcing the model to contain two beta functions with appropriate values for the parameters to fit each of the peaks of the distribution (small and big p-values) and deduce the formulas for PPDE(p) accordingly (so that the big p-value function would contribute to the estimated fraction of non-regulated

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

22

Page 23: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

genes). However, a more natural way is to filter the genes first, where those with a number of counts smaller than a predefined value are labeled as absent (and potentially less relevant from a biological point of view). Supplemental Figure 2F shows an example of this approach, which is used throughout this study to address posterior probabilities of differential expression.

3. RNA-Seq analysis

3.1. Simulated RNA-seq libraries: We used of a random but simplified simulation of the main steps of the RNA-seq protocol. In this simulation, the resulting sequence data lack most of the sequencing biases that are present in real libraries but is still a convenient tool for hypothesis/methods testing. These biases result from physical properties of the flow cell or sequence context during Cdna synthesis (Hansen et al., 2010). The main steps in the simulation are: Mrna generation and random sampling, random fragmentation, size selection and sequencing. For Mrna generation, expression levels (TRAij) from previous estimates are used to compute the initial number of mRNAs per gene I in a given experiment j as Ncopiesij = A*(TRAr(i)j+1)

where one pseudocount is added so a minimum of A mRNAs is generated for each gene, and A is an amplification constant. For validation purposes, we typically randomize the gene index (denoted by r(i) above) in order to remove any existing bias in the generation of TRA values, while keeping the global distribution intact. Also, if pairwise comparisons are involved, the same randomization is used across different lanes or experiments to preserve the same pattern of experimental variability. This first big pool of mRNAs is then subjected to random sampling to model a desired sequencing depth (Supplemental Figure 13). For the random fragmentation step we make use of a one-dimensional random chain of fragmenters at a predefined density (1/100 in this work). The chain is long enough (several million units) to avoid correlation between fragmentations of mRNAs from the same gene. Each Mrna copy is assigned a random position in the chain, and this determines the fragmentation points.

For size selection, each fragment f of length L is assigned a random number r(f) from a uniform distribution which is compared against a (non normalized) Gaussian distribution

ˆ N (x,µ,σ 2) = e−

(x −µ )2

2σ 2

with mean µ = 100 and standard deviation σ2=10. The fragment is selected if r(f)< N(L,µ,σ2). The resulting sample has a size distribution similar to

ˆ N (x,µ,σ 2).

Fragments selected from the previous step are used to generate the final set of sequences. First, for a single end library of k cycles, one of the ends is selected at random. Then, the first k bases are selected and “sequenced” with a predefined error rate function ]1[),( kxxe ∈ . The nucleotide substitution is random, meaning there is no any preferred substitution or bias in base calling (but it can be easily modified to account for a predefined matrix of transition probabilities). Again, for each read and each base, we assign a random number from a uniform distribution r(fx), and a nucleotide substitution is made if r(fx)<e(x).

3.2 Mapping reads to the genome: We performed a hierarchical genomic-based alignment. In terms of the notation introduced above, for each read, we tried sequentially to obtain a non-empty k-hit with the following parameters: for kmax=read length=33 K=[33 30 29 …. 21=kmin] ; m(k)=[2 0 0 … 0]; for kmax =35 K=[35 32 29 …. 21= kmin] ; m(k)=[2 0 0 … 0];

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

23

Page 24: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

for kmax =50 K=[50 47 29 …. 25= kmin] ; m(k)=[2 0 0 … 0]; given that we have computed DCj(k) for all k in K. In each step, reads yielding a non-empty k-hit are filtered out for the next step. For each read, two trimmed reads are generated by removing the required number of bases from both the 5’ and 3’ end, and their alignments are pooled together. Alignments were performed using SOAP (Li et al., 2008) and Novoalign (http:www.novocraft.com).

The previous parameters were selected based on the low error rate of the libraries (so a significant fraction of the misalignments for k=kmax are due to exon-intron boundaries) and the fact that for k<21 the overall confusability of the genome is high (Supplemental Figure 14) and the precise alignment of very short reads is slow. Summary statistics of the previous alignment steps can be found in Supplemental Dataset 2. On average, a non-empy k-hit was found for 92% of the reads, and a non-ambiguous one for 84%. The percent of sequences recovered from the trimming step can be as high as 30% for 50 nt libraries. Supplemental Figure 15 shows the relationship between the amount of sequencing recovered in the trimming step as compared to the average exon length of the gene for a simulated library (see next section). As expected, there is a significant improvement in the alignment of sequences across exon boundaries, which might be critical for the detection of small exons. Moreover, we expect that expression estimates and reproducibility for genes with low expression or a high number of exons will benefit significantly from this additional alignment step. Supplemental Figure 16 shows a histogram of fold changes for two simulated libraries of the same condition, where the height of the central bin shows that reproducibility is improved after the recursive alignment round.

3.3. Ambiguous hits assignment: A number of different approaches have been proposed for the allocation of ambiguous alignment hits (Taub et al., 2010). In our case a major limitation for the application of those methods is the incompleteness and inaccuracy of the annotation, as expression levels of annotated transcripts from non-ambiguous hits are typically used as the first estimate of the true solution.

Here, we have followed a simple allocation/disambiguation method prior to transcript assignment that makes use of the local coverage from unambiguous hits as a first estimate. Briefly: 1) For each experiment, whole genome coverage vectors are computed from the set of unambiguous hits. These are to be used as the first estimate of the solution. The choice of coverage instead of alignment hits is based on the observed unevenness of the hits landscape, most likely due to amplification bias during random hexamer priming, see (Hansen et al., 2010). The point coverage function of a transcribed base, while highly nonuniform, is a better representation of the average expression level of a particular locus. 2) Ambiguous hits are sorted by their cardinality (number of putative locations), so that the allocation steps below are implemented in increasing steps of ambiguity (in general, one would expect that the allocation of a multihit of size 2 is less involved than one of size 10). On the other hand, genomic repeated regions usually comprise a mixture of diverse multiplicities, so each allocation step could help to allocate neighboring hits of higher ambiguity. In practice, we use an upper limit for the number of alignments reported for each read (1000 in our case) therefore multiplicity values greater than that are discarded thereafter. 3) For each cardinality value and each hit hrk, we score each putative location x in hrk as follows. The coverage vectors [Cu(x,w),Cd(x,w)] for two windows of size w upstream and downstream of each location x are retrieved. The score is then computed as

Σ(x) = min(max(Cu(x,w)),max(Cd (x,w))). This particular choice is based on the following scenario: a base belonging to a region (e.g. an exon) with null or low coverage from the unambiguous hits gets a

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

24

Page 25: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

score from the coverage in its surroundings. The size of the window has to be big enough to span nearby introns (average size 373 bp), and because of the low or null coverage in a significant fraction of this neighborhood we choose the maximum in each window as a representative value (instead of, for example, the mean coverage). In some cases, one of the windows will overlap with an adjacent gene of, most likely, different expression level. The selection of the minimum of both maxima increases the likelihood that the score is selected from a base of the same gene. We have tested different values of w and the solutions are stable around w=1000, the value we have used for the data in this work. 4) The previous set of scores

σ = Σ(x), x ∈ hrk{ } can be used for the disambiguation of hrk in several ways. For example, if the read r is found a number of times F in the library, a proportional assignment method will assign, to each x in hrk, a number or reads f(x) given by f(x)= σ(x)/sum(σ)*F.

We have compared this deterministic approach with a probabilistic (yet proportional) assignment routine, where the cumulative sum of σ is normalized to the unit interval and, for each single occurrence of the read r, a random number from a uniform distribution selects the position x in hrk for allocation. Although both methods provide very similar results, the probabilistic version seems to converge faster to a self-consistent solution (data not shown), and is the method used for the results presented in this paper. Supplemental Figure 17A (top) shows mean-difference scatterplots for the number of hits per gene using simulated data. The number of hits is compared against the exact value before and after disambiguation. A significant number of genes is closer to the exact solution after disambiguation across the entire expression range (Supplemental Figure 17A (bottom), where both the 25th and 75th quantiles of the fold changes after disambiguation are closer to 0). The allocation method fails for genes with very low mappability, and clearly overestimates the number of hits for some of them (negative fold changes in the mean-difference scatterplots). The problem in this case results from the lack of prior information for these genes. Finally, Supplemental Figure 17B shows an example of disambiguation in a single locus from a real library. Expression estimates based on this allocation algorithm are provided in Supplemental Dataset 2.

3.4. Definition of mappable sets: The assignment of hits to annotated transcripts and their normalization is described in the next sections. However, some classes of hits overlapping annotated coding sequence need to be discarded: 1) hits in regions of overlap between different genes, 2) ambiguous hits (or those with multiplicity higher than the maximum used for allocation), and 3) hits overlapping exon-intron boundaries. The set of genomic positions linked to the previous types of hits can be identified from the annotation and the genome sequence. In other words, a natural approach to the assignment problem is to work out first, for each gene j, the group of bases that are accessible or can be unambiguously sampled with short reads, that we denominate here as mappable set (

Mqjk ). The size or

cardinality of this set,

M qjk = Mq

jk , is the so-called mappable length or mappability, and acts as an effective transcript length for normalization purposes. As explained below, the notation here emphasizes again the dependency of the length of the read k and the choice made for the allocation of ambiguous hits (q).

A small number of sequential adjustments on the annotated transcripts is required in order to yield, for each transcript, the final family of mappable sets

Mqjk{ }. 1) Fuzzy boundaries: coverage plots at exon-

intron boundaries are usually seen to decay a few bases downstream of the annotated junction (Supplemental Figure 18A). The reason for this can be found in the sequence similarity between the 5’ ends of the intron and the next exon, and the extent of it depends on the number of mismatches m(k)

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

25

Page 26: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

allowed in the alignment step (0 or 2 in our case, depending on the value of k). Therefore, we replace each annotated exon-intron junction by its 0 or 2 mismatch variants (Supplemental Figure 18A). 2) 3’-end correction for exons: k-hits located in a window of size

k −1 upstream of the 3’ end of each exon would overlap intron sequence. Therefore, all genomic bases from these windows are removed. For k=kmax, the 2 mismatches junction variant is used. For k<kmax, the 0 mismatch variant is now split in an ensemble of k-junctions, so that the effective size of the resulting exon is a function of k (Supplemental Figure 18B). This correction does not impact all genes equally, as it is proportional to the number of exons (but not necessarily to the transcript length). 3) Overlapping and multiplicity correction: regions of overlap between two different gene models

( j1, j2) were removed from

Mqj1k and

Mqj2k independently for each value of

k. Finally, from the computation of DCj(k) (that involves the reckoning of the number of copies of each k-mer in the genome), we can filter out those bases with a multiplicity equal or bigger than q.

This way, from the point of view of short-read transcriptome sampling, each gene is seen as a family of mappable sets for a given multiplicity level q

Mqjk , k ∈K{ }, or q-mappable sets. The set of cardinalities

M qjk = Mq

jk , k ∈K{ } is denominated q-mappable lengths (or q-mappability).

Mqjk might be an

empty set below some threshold for k. In the simplest case, the unique hits filter would then yield the

{M1jk} family of uniquely mappable sets.

3.5. Gene assignment and normalization for transcript relative abundance: Once the concept of mappable set has been introduced, the assignment and normalization of hits is straightforward. For each set of lanes from the same experiment, we quantified the number of k-hits associated with each mappable set (j,k),

H qa( i) jk . The partial totals for each k are normalized using the corresponding

mappability, and summed over all k values using the total number of aligned sequences

Ha( i) (a measure of sequencing depth) as a global scaling factor. The Transcript Relative Abundance

T qa( i) j (TRA) is

expressed in units of reads per kilobase of mappable length per million mapped reads (RPKM), where the constant factor

c is equal to 1x106 (see (Mortazavi et al., 2008)).

Τqij =

cHa(i)

H qa( i) jk

M qjkk∈K

3.6. Additional corrections—Imputation and censored correction: A total of 28 lanes were sequenced for a set of 11 single experiments. As is mentioned in the main text, for samples that were previously analyzed by DGE we found that the reproducibility between biological replicates was very high, so they were pooled into a single library for RNA-Seq sequencing. The remaining samples (Supplemental Dataset 2) are true biological replicates.

For pooled libraries, we found that expression estimates are highly reproducible above some mean expression level albeit small p-values are again slightly overrepresented. However, the same comment applies to simulated libraries, so lane to lane variations are likely due to stochastic fluctuations for low expression genes and the intrinsic overrepresentation of small p-values for moderately and highly expressed genes. Not surprisingly, the reproducibility is worse when true biological replicates are compared, so additional filters for low expression genes are applied in order to minimize the false positives in the final target gene set (see below). The arguments introduced in the section on DGE tags concerning pooling counts for replicates of the same condition apply here as well.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

26

Page 27: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

A good fraction of the experimental variability found in expression estimates affects genes with low or intermediate expression, and can be partially accounted for by the random sampling of the original Mrna pool which is subsequently processed at high (but not infinite) sequencing depth (see above). Below, we describe the methods we have used to alleviate the influence of sequencing depth in our expression estimates.

Following the previous discussion, Supplemental Figure 19AB shows the cumulative distribution function (CDF) of the hits and TRAs distribution respectively, for lanes with different sequencing depth. The effect that missing values have on the overall shape of the distribution is apparent, and this impacts primarily genes with low or intermediate expression estimates. We apply two additional corrections to

Τqa( i) j . A)

Simulation-assisted imputation. Given the first estimate

Τqa( i) j for each experiment, we build a set of

simulated libraries from a high concentration of mRNAs as described before and matching the same sequencing depth as obtained for the real libraries. We can then compute the detection probability of having an estimated expression level equal to zero given the exact expression level. Genes with no expression in any experiment are removed, as no prior information can be used in this case. For genes with missing entries in at least one experiment (

Τqa( i) j = 0 for some I, but

Τqa( i) j

i∑ ≠ 0), we

impute a value given by the weighted mean of the non-missing values, where the weights are the detection probability of the gene in each experiment. This way, genes with a low expression value across the entire expression matrix are imputed with a low but comparable expression level. In contrast, for transcriptionally repressed genes (those showing a significant expression in some experiments but zero in another condition), the method imputes a marginal expression. After imputation,

Τqa( i) j is

renormalized again to keep the units of estimated mRNAs per million. One of the side effects of this imputation is the regularization of fold changes as it avoids division by zero, providing a lower bound to the fold change of transcriptionally repressed genes. B) Censored-correction: Supplemental Figure 19C shows the effect of the previous regularization on the CDF of

Τqa( i) j . It is clear that the effect of finite

sequencing is still perceptible. However, for each experiment I,

Τqa( i) j can be seen as a series of

observations where some values were not quantified because of technical limitation. In other words, the distribution of

Τqa( i) j can be seen as a censored distribution (Lawless, 2002). Therefore, we performed

a simple modification where the previously imputed values are considered as censored cases. In short, the empirical CDF can be estimated making use of the Kaplan-Meier estimator, which is a non-parametric approach to the actual distribution (see (Lawless, 2002)). The resulting estimate is linearly interpolated to the values of the original CDF for

Τqa( i) j and renormalized to yield a matrix of corrected TRAs

(denoted as

Τ∞q

a( i) j ) to a situation of high sequencing depth (e.g. no missing values). The result is illustrated in Supplemental Figure 19D, where the CDF for

Τ∞q

a( i) j of the same set of libraries collapses into a similar distribution, facilitating the comparison of expression levels at different levels of sequencing depth.

3.7. Simulation-assisted selection of expression thresholds: The previous corrections provide a regularization of the low expression tail of the expression distribution and induce a minimal modification for genes with intermediate and high expression (where reproducibility is observed to be very high). However, they do not account for the variance introduced by random sampling, which is high for low expression genes and might confound any downstream analysis. Therefore, the final set of target genes for comparisons is filtered to remove genes with low abundance transcripts. While such genes might still be relevant, the filtering decreases the likelihood of false positives.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

27

Page 28: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

Nevertheless, in most applications, the selection of an expression threshold is arbitrary. Yet we know that it depends strongly on the amount of sequencing (Supplemental Figure 13). Therefore, we make use of simulated libraries to select an expression threshold on a per-experiment basis, based solely on the observed reproducibility at different levels of sequencing depth. Briefly, for each experiment we run several simulations from the estimated

Τ∞q

a( i) j , but at a much higher depth of coverage (typically around 15-20 million sequences, but always higher than the maximum in the real dataset). We run the entire analysis pipeline on the simulations to obtain the corresponding matrices of expression levels and compute the distribution of fold changes when compared to the real data. Typically, it was observed that for expression values lower than 7-10 RPKMs the 95th percentile of the absolute fold changes is higher than a two-fold change, so that value (unique for each library, and generally similar to the median of the distribution of expression estimates) was applied as the lower boundary for the selection of the final set of genes for further analysis.

4. Promoter analysis

The differential expression analysis explained in the main paper yielded a set of 63 genes that are potential targets of the CRR1 transcription factor. In order to address the enrichment of their promoters in binding motifs, we perform a k-mer frequency analysis on manually curated promoters. We build a 2D enrichment table by varying independently two different parameters: promoter length L (between 100 and 2000) and k-mer size k (from 4 to 8). For each pair (L,k), we select a random background set of promoters from the entire transcriptome and compute frequency differences for every k-mer between the target and background sets. A total of 100 different randomizations were used to compute average frequency differences along with the number of times a given k-mer scores in the top 10. For k=4, the GTAC motif ranked first out of all L values in terms of top-10 occurrences, and competes in frequency difference with the GC-rich motif (GCGC). Supplemental Figure 10A shows a similar set of results for k=6, where three related motifs with a GTAC core have the highest enrichment: G-GTAC-C, CG-GTAC and GTAC-CG. Moreover, the motif G-GTAC-A scored in the top 5 mostly for short promoters, while GC-reach motifs compete in enrichment for promoters of intermediate size. Not surprisingly, for short promoters and k=8, the motif CG-GTAC-CG showed the top scores followed by some GTAC-core motifs.

The previous results seem to indicate that there are some enriched residues in the flanking sequences of the GTAC core. However, only the motif obtained from proximal promoters (up to ~300bp) shows a remarkable deviation from a random sequence (Supplemental Figure 10B). We have obtained a clear overrepresentation of GTAC sites even for long promoters, but the high occurrence of GTAC motifs in proximal promoters might dominate the frequency analysis for high L values. On the other hand, we do not expect every site to be functional or have the same binding affinity. On top of that, there seems to be some positional bias both for GTAC and GC-rich motifs.

We measured the fraction of promoters with GTAC sites in windows of 100 bases along the promoter. When compared to whole transcriptome ratios, the overrepresentation seems to happen only in proximal promoters (~250bp) and two regions centered at 850 and 1300 upstream of the transcription start size (TSS). The corresponding motifs for these three regions show enrichment in different residues, with the G-GTAC-C site dominating for proximal promoters and more complex motifs (with enrichment in residues not flanking the core) further upstream of the transcription start site.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

28

Page 29: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

5. Identification of algal proteins in genome databases

5.1 Copper proteins: Known and putative copper binding proteins in the Chlamydomonas genome were identified from (Merchant et al., 2006; Kim et al., 2008). The sequences of copper proteins were used in tBLASTn searches against the Chlamydomonas v3 unmasked genome. Predicted copper-binding proteins identified in Arabidopis (Andreini et al., 2008) were also used as tBLASTn queries. The identifiers of copper-binding Interpro domains were used as keywords in the Advanced Search tool in the Chlamydomonas genome browser. All tBLASTn searches used the following parameters: word size of 3, e-value cutoff of 1e-2, and filtering for low complexity regions set to disabled.

5.2. Algal homologs: The protein sequences derived from the Chlamydomonas copper-deficiency responsive gene set were used as the query in BLASTp and tBLASTn searches against the Volvox carteri, Chlorella sp. NC64A, and Chlorella vulgaris draft genome sequences. Searches were conducted through the BLAST function on each organism’s genome browser. The protein sequence of the best hit was used as a query in a BLASTp search against Chlamydomonas proteins to establish a reciprocal best hit relationship. All searches used a word size of 3, e-value cutoff of 1x10-2, and filtering for low complexity regions set to disabled.

6. Immunodetection

Proteins were separated on a SDS-denaturing polyacrylamide gel (10% monomer) and transferred in a semi-dry blotter onto PVDF-FL (0.45 µm; Millipore) for 90 min under constant current (400 A) in transfer buffer (25 Mm Tris, 192 Mm glycine, 0.0004% SDS (w/v), and 20% (v/v) methanol). The membrane was blocked overnight in PBS containing 0.05% (w/v) Tween 20, 0.5% (w/v) bovine serum albumin, 15 ppm ProClin 300, and 0.02% SDS (w/v) and incubated in primary antiserum; this solution was used as the iluents for both primary and secondary antibodies. For washing, PBS containing 0.1% Tween 20 (w/v) was used. The secondary antibody, used at 1:10,000, was goat anti-rabbit IRDye 800 (LI-COR Biosciences). Primary antibodies were used at a 1:5000 dilution (anti-FAB2 and anti-HYD1A). Bound antibody was detected via the Odyssey Imager (LI-COR Biosciences) using the 800-channel.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

29

Page 30: SUPPLEMENTAL FIGURES AND TABLES Pg Supp. Figure 1. · Supplemental Figure 2. Reproducibility and differential expression analysis for DGE data. qq-plot of pvalues of differential

SUPPLEMENTAL REFERENCES

Allison, D.B., Gadbury, G.L., Heo, M.S., Fernández, J.R., Lee, C.K., Prolla, T.A., and Weindruch, R. (2002). A mixture model approach for the analysis of microarray gene expression data. Comput. Stat. Data Anal. 39, 1-20.

Andreini, C., Banci, L., Bertini, I., and Rosato, A. (2008). Occurrence of copper proteins through the three domains of life: A bioinformatic approach. J. Proteome Res. 7, 209-216.

Audic, S., and Claverie, J.M. (1997). The significance of digital gene expression profiles. Genome Res. 7, 986-995.

Baldi, P., and Long, A.D. (2001). A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17, 509-519.

Bullard, J.H., Purdom, E., Hansen, K.D., and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11, 13.

González-Ballester, D., Casero, D., Cokus, S., Pellegrini, M., Merchant, S.S., and Grossman, A.R. (2010). RNA-Seq Analysis of Sulfur-Deprived Chlamydomonas Cells Reveals Aspects of Acclimation Critical for Cell Survival. Plant Cell 22, 2058-2084.

Hansen, K.D., Brenner, S.E., and Dudoit, S. (2010). Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research 38, e131.

Kim, B.E., Nevitt, T., and Thiele, D.J. (2008). Mechanisms for copper acquisition, distribution and regulation. Nat. Chem. Biol. 4, 176-185.

Lawless, J.F. (2002). Statistical Models and Methods for Lifetime Data. (Wiley). Li, R.Q., Li, Y.R., Kristiansen, K., and Wang, J. (2008). SOAP: short oligonucleotide alignment program.

Bioinformatics 24, 713-714. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. (2008). RNA-seq: an assessment

of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509-1517 Epub 2008 Jun 1511.

Merchant, S.S., Allen, M.D., Kropat, J., Moseley, J.L., Long, J.C., Tottey, S., and Terauchi, A.M. (2006). Between a rock and a hard place: Trace element nutrition in Chlamydomonas. Biochim Biophys Acta. 1763, 578-594.

Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 5, 621-628.

Taub, M., Lipson, D., and Spee, T.P. (2010). Methods for allocating short-reads. Communications in information and systems 10, 69-82.

Supplemental Data. Castruita et al. Plant Cell (2011). 10.1105/tpc.111.084400.

30