Organisation of the Caenorhabditis elegans small non ... Organisation of the Caenorhabditis elegans

  • View
    2

  • Download
    0

Embed Size (px)

Text of Organisation of the Caenorhabditis elegans small non ... Organisation of the Caenorhabditis elegans

  • Supplementary materials for:

    Organisation of the Caenorhabditis elegans small non-coding

    transcriptome: genomic features, biogenesis and expression

    Wei Deng, Xiaopeng Zhu, Geir Skogerbø, Yi Zhao, Zhuo Fu, Yudong Wang, Housheng He, Lun Cai, Hong Sun, Changning Liu, Biao Li, Baoyan Bai, Jie Wang, Dong Jia, Shiwei Sun, Hang He, Yan Cui, Yu Wang, Dongbo Bu, Runsheng Chen

    Contents

    1. Supplementary figure 1. Expression profiles and clusters

    2. Supplementary figure 2. Distance from intronic ncRNAs to adjacent upstream and downstream exons.

    3. Supplementary figure 3. Relationship between intronic ncRNAs and host gene expression.

    4. Supplementary document 1. Capping probability

    5. Supplementary document 2: Genomic location of the novel ncRNA genes

    6. Supplementary document 3: ncRNA expression analysis

    7. Supplementary document 4: Upstream and internal motifs of the C. elegans noncoding RNA loci

    8. Supplementary document 5. C. elegans ncRNA biogenesis groups

    9. Supplementary table 1. ncRNA biogenesis groups

    10. Supplementary document 6: Estimates of the C. elegans small non-coding RNA number

    11. Supplementary document 7. oligo sequences used in this work.

    12. Supplementary sequences 1. All ncRNA sequences

    13. Supplementary tables 2 & 3. Basic data of ncRNAs and their gene loci used in this study.

  • Supplementary figure 1

    Fig. S-1 Expression patterns for 105 clustered ncRNAs (the figure includes both novel and known ncRNAs). L1s, starved overnight after hatch. L1m, m indicates the developmental midterm between the immediate preceding (L1s) and following (L1) stages. YA, Young Adult. MA, Mature adult. HS, Heat Shock treated (L4 worms at 30 3 hours)℃ . See supplementary document for time intervals.

  • Supplementary figure 2

    Upstream distance

    0

    5

    10

    15

    20

    25

    10 10 0

    10 00 10 10

    0 10

    00

    Downstream distance

    Fr eq

    ue nc

    y Group I Group II Group V

    Fig. S-2. Distance (in bp) from intronic ncRNAs to adjacent upstream and downstream exons. The distances to the upstream exon from Group I and II loci (“motif loci”) are significantly greater than from Group V loci (“non-motif loci”), suggesting that independently transcribed Group I and Group II loci require a upstream region of a minimum length to accommodate core promoter elements. Group V ncRNAs are transcribed from the host gene promoter, and excised from pre-mRNA or processed from intron lariats, thus having a lower requirement for an upstream sequence length. The size of the region downstream of intronic loci seems not correlated to the mode of biogenesis.

  • Supplementary figure 3

    R2 = 0.2271

    R2 = 0.0423

    0

    5

    10

    15

    20

    25

    0 50 100 150 200 250

    Host gene EST hits

    nc R

    N A

    li br

    ar y

    cl on

    es

    Group V

    Group II Trend V

    Trend II

    Fig. S-3. Relationship between intronic ncRNAs and host gene expression. The number

    of clones of each intronic ncRNA species in biogenesis groups II and V (see

    Supplementary table 1) were plotted against the average number of EST found for their

    respective host genes. For Group V ncRNAs, assumed to be excised and processed from

    pre-mRNA or spliced introns, there was a positive relationship between the number of

    host gene EST hits in WormBase (Harris et al., 2003) and the library clone number (R2

    = 0.23). The low R2 value is not surprising given the type and size of data used for the

    plot, however it may be caused by incompletely intron-processed ncRNAs (thus

    accounting for samples with high host gene and low ncRNA values). No positive

    relationship was found for the assumedly independently transcribed Group II ncRNAs.

    Independently transcribed ncRNAs appear to have higher expression levels than

    dependently transcribed ncRNAs, whilst the expression of their host genes expression

    appear be lower than those of co-transcribed ncRNAs.

  • Supplementary document 1: Capping probability

    The library construction procedure included a step to distinguish between 5’ capped and non-capped RNAs. Prior to 5’ end linker ligation, the ncRNAs were split into to aliquots, one treated with PolyNucleotide Kinase (PNK, Fermentas) to phosphorylate noncapped RNA, the other treated with Tobacco Acid Pyrophosphatase (TAP, Epicentre) to remove 5’ end methyl-guanosine caps form capped RNA. This generated two types of libraries, TAP libraries containing mostly 5’ capped ncRNAs, and PNK libraries with mostly non-capped ncRNA For the most known and abundant capped ncRNAs (U1, U2, U4, U5), clone numbers were used to determine the probability that an capped ncRNAs be cloned in TAP or PNK libraries. Among 407 clones, 395 were present/found in the TAP library and 12 reads in the PNK library. For non-capped ncRNA, 5.8 S rRNA clones were used, which had a distribution of 148 and 21 clones in TAP and PNK libraries, respectively. We also investigated why a number clones occurred in the “opposite” library, and found that almost all incorrectly cloned sequences were 5’ truncated, probably brought about the in experimental steps. As we have no information on the exact 5’ end structures of the novel ncRNAs other than observed differences between sequence,, we used all sequences reads to determine the probabilities of the cap status. Table1. Distribution of clones/sequence reads of known capped and non-capped ncRNA in TAP and PNK library

    Including partial ncRNA reads Ignoring partial ncRNA reads. TAP library PNK library TAP library PNK library Capped 97% (395) 3% (12) 100% (175) 0 Non-capped 12% (21) 88% (148) 0 100% (114)

    The probability of an ncRNA, with t clones in the TAP library and p clones in the PNK library, to either be capped [P(C|D)] or noncapped [P(N|D)] is determined by the following formula: P(C|D) / P(N|D) = (0.97 t x 0.03 p) / (0.12 t x 0.88 p) P(C|D) + P(N|D) = 1 According to the calculated probabilities of all ncRNAs, we divided them into five groups, assigned with a value ranging from 1 to -1. These value is used in supplementary tables 2 for capping possibility annotation.

  • Probability Assigned value > 95% probability of being capped 1 > 80% probability of being capped 0.5 Information insufficient to determine cap status 0 > 80% probability of being non-capped -0.5 > 95% probability of being non-capped -1 For ncRNAs with only one sequenced clone and a possibility of greater than 95% to be capped, we assigned them to the >80% group to reduce risk of faulty determinations due to too few samples. We also remove reads from the estimate in cases of affirmed truncations.

  • Supplementary document 2: Genomic location of the novel ncRNA

    genes

    The genomic locations of the novel small RNAs can be divided into several categories (tab. S-1). A majority of the novel loci (55%) are located in sense direction within a known or putative transcript of another gene, of which intronic loci constitute the larger part (51%). Among loci found within an intron, a considerable fraction (24% of all novel loci) is also located within a C. elegans operon. Two loci are found in either introns or UTR regions, depending on which frame is used for translation of the protein, and one locus (CeN42) covers the 3’ end of an exon and the 5’ end of the following intron. The approximately 1/3 of loci found outside any known transcribed region are to a large extent located in relative vicinity to a protein coding gene (< 1 kb), whereas no locus is more than 10 kb away from a known or predicted transcribed gene. One locus is found in the antisense direction to a coding exon. Table S-1. Distribution of novel small RNA genomic locations in C. elegans. It should be noted some RNAs may have several loci. All ncRNAs located in operons are found in introns (or intron + exon) of the operonic genes. The total number of genomic loci is therefore different both from the number of novel small RNAs, and from the sum of loci reported in the table. The “All” column refers to the loci of known plus novel ncRNAs detected in our screen. Number of ncRNA loci Genomic location Novel(%) All (%) Within transcribed sequence (sense) 55 (54.5) 90 (45.5) Intron 51 (50.5) 85 (42.9) In intron or UTR (cen95, cen106) 2 (2.0) 2 (1.0) Overlapping exon and intron (cen42) 1 (1.0) 1 (0.5) In UTR (cen59) 1 (1.0) 1 (0.5) Overlapping 5'UTR and the ORF (cen4.1) 0 (0.0) 1 (0.5) In operon 24 (23.8) 29 (14.6) Antisense to transcribed sequence 10 (9.9) 25 (12.6) - to intron 10 (9.9) 23 (11.6) - to exon (cen107.4) 0 (0.0) 1 (0.5) - to UTR (cen4.17) 0 (0.0) 1 (0.5) Intergenic 36 (35.6) 83 (41.9) All 101 198

  • 1. Intronic ncRNA genes Intronically located novel ncRNA genes comprise 54 loci, located to 48 different known or predicted protein-coding genes. Altogether 21 different protein functional classes are represented among the genes hosting the intronic small RNAs, however, genes coding for ribosomal and ATP-binding proteins make up 30% and 19% of all functionally annotated genes respectively.

    0

    4

    8

    12

    16

    20

    Function