35
University of Groningen Computational Methods for High-Throughput Small RNA Analysis in Plants Monteiro Morgado, Lionel IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2018 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Monteiro Morgado, L. (2018). Computational Methods for High-Throughput Small RNA Analysis in Plants. University of Groningen. Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 13-06-2021

University of Groningen Computational Methods for High … · 2018. 3. 14. · 12 AGO4-IP - A. thaliana Inflorescence Mi, 2008 Y Y 13 AGO4-IP - A. thaliana Inflorescence Havecker,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • University of Groningen

    Computational Methods for High-Throughput Small RNA Analysis in PlantsMonteiro Morgado, Lionel

    IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.

    Document VersionPublisher's PDF, also known as Version of record

    Publication date:2018

    Link to publication in University of Groningen/UMCG research database

    Citation for published version (APA):Monteiro Morgado, L. (2018). Computational Methods for High-Throughput Small RNA Analysis in Plants.University of Groningen.

    CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

    Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

    Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.

    Download date: 13-06-2021

    https://research.rug.nl/en/publications/computational-methods-for-highthroughput-small-rna-analysis-in-plants(3df6e1b7-fa60-4b29-a460-6b9224408c6f).html

  • 42

  • 43

    CHAPTER 3

    Learning sequence patterns of AGO-sRNA affinity from

    high-throughput sequencing libraries to improve in silico

    functional small RNA detection and classification in plants

    Lionel Morgado, Ritsert C Jansen and Frank Johannes

    Abstract

    The loading of small RNA (sRNA) into Argonaute complexes is a crucial step in all regulatory

    pathways identified so far in plants that depend on such non-coding sequences. Important

    transcriptional and post-transcriptional silencing mechanisms can be activated depending on

    the specific plant Argonaute (AGO) protein to which a sRNA binds. It is known that sRNA-

    AGO associations are at least partly encoded in the sRNA primary structure, but the

    sequence features that drive this association have not been fully explored. Here we train

    support vector machines (SVMs) on sRNA sequencing data obtained from AGO-

    immunoprecipitation experiments to identify features that determine sRNA affinity to

    specific AGOs. Our SVM models reveal that AGO affinity is strongly determined by complex

    k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known features such as sRNA length

    and the base composition of the first nucleotide. Moreover, we find that these k-mers tend

    to overlap known transcription factor (TF) binding motifs, thus highlighting a close interplay

    between TF and sRNA-mediated transcriptional regulation. We integrated the learned SVMs

    in a computational pipeline that can be used for de novo functional classification of sRNA

    sequences. This tool, called SAILS, is provided as a web portal accessible at

    http://sails.lionelmorgado.eu.

  • 44

    3.1 Introduction

    The small RNA (sRNA) is a class of non-coding RNA with significant roles in developmental

    biology, physiology, pathogen interactions, and more recently in genome stability and

    transposable element control [194]. Plants have two major classes of sRNA: the microRNA

    (miRNA), which is processed from imperfectly self-folded hairpin precursors derived from

    miRNA genes [195]; and the small interfering RNA (siRNA) that is produced from double-

    stranded RNA duplexes [196]. siRNAs can be further divided into three major groups:

    secondary siRNA such as trans-acting (ta)-siRNAs, which are promoted by miRNA cleavage of

    messenger RNA; natural antisense transcript (nat)-siRNAs derived from the overlapping

    regions of antisense transcript pairs naturally present in the genome; and heterochromatic

    (hc)-siRNAs, mostly generated from transposons, heterochromatic and repetitive genomic

    regions and involved in DNA methylation and heterochromatin formation. Apart from their

    biogenesis, the mode of action of a given sRNA is tightly related with the Argonaute protein

    to which it can bind [197]. Argonautes form the core of all sRNA-guided silencing complexes

    identified so far. Once loaded into Argonaute, sRNA guides the silencing machinery to

    targets through base pairing principles. Argonautes are highly conserved proteins with family

    members in most eukaryotes [197–199]. Although there are two main subfamilies of

    Argonautes in eukaryotes: AGO and PIWI; only AGO proteins can be found in plants. Also in

    plants, AGOs can be grouped into three phylogenetic clades with a highly variable number of

    elements from species to species [198] (Figure 3.1). Arabidopsis has ten members with

    specialized or redundant functions among them: AGO1, AGO5 and AGO10 in the first clade;

    AGO2, AGO3 and AGO7 form the second clade; and the third clade is composed by AGO4,

    AGO6, AGO8 and AGO9 [197]. Members of the first and second clade are involved in post-

    transcriptional silencing (PTS) by inhibiting translation or by promoting messenger RNA

    cleavage, and AGOs in the third clade are chromatin modifiers that induce transcriptional

    silencing (TS) via mechanisms such as DNA methylation [200–202]. Understanding the

    mechanisms that determine the loading of sRNA to specific AGOs is essential for predicting

    their biological function, and for identifying their putative silencing targets.

    High-throughput sequencing in combination with immunoprecipitation (IP) techniques have

    made possible to determine the sequences of sRNA that are bound to different AGO

    families. AGO-IP experiments have been performed for AGO1, AGO2, AGO4, AGO5, AGO6,

  • 45

    AGO7, AGO9 and AGO10. The low expression level of AGOs 3 and 8 suggests that they may

    not be functionally relevant. Previous analyses have shown that AGO-sRNA associations are

    partly determined by the 5’ terminus and the length of a sRNA sequence [203, 204].

    Sequences of approximately 21 nucleotides (nt) tend to be involved in PTS, while 24 nt

    sRNAs are characteristic of TS. Although sequence length has been widely used as a way to

    infer the silencing pathway a given sRNA is most likely implicated in, by itself it is an

    inaccurate predictor since many sequencing products can lack other structural features

    known to enable AGO loading [203–206]. Contrary to animals, in plants the 5’ nucleotide is

    also recognized as a strong indicator of AGO sorting. Enrichment for sequences starting with

    pyrimidines are frequent in AGOs from the first clade (AGO1/10: uridine and AGO5:

    cytosine), adenosine dominates the third clade (AGO4/6/9) as well as AGO2 from the second

    clade. Furthermore, direct experimental evidence shows that mutating the 5’ nucleotide of a

    sRNA can redirect its AGO destination [207]; however, other relevant features, such as

    sequence motifs encoded by the primary structure appear to play a role [200, 208–211].

    While sRNAs that participate in PTS have been intensively studied, leading to the discovery

    of many structural features that influence activation, much less is known about AGO-

    associated hc-siRNA in transcriptional silencing. Indeed, most studies that use hc-siRNAs give

    Figure 3.1. Phylogenetic tree for AGO in A. thaliana. Functional groups are indicated: transcriptional

    silencing (TS) and post-transcriptional silencing (PTS).

    3

  • 46

    a strong emphasis to sequence length ignoring other important aspects of the sRNA

    sequence that promote an active role in genomic regulation. Studying AGO-bound sRNA is a

    starting point to fill this gap and to improve our understanding on the relationship between

    the structure and function of sRNA in plants. Here we train support vector machines (SVMs)

    on sRNA sequencing data obtained from AGO-IP experiments to identify features that

    determine sRNA affinity with specific AGOs. Our SVMs reveal that AGO affinity is strongly

    determined by complex k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known

    features such as sRNA length and the base composition of the first nucleotide. Moreover,

    we find that these k-mers tend to overlap known transcription factor (TF) binding motifs,

    thus highlighting a close interplay between TF and sRNA-mediated transcriptional regulation.

    We incorporated the learned SVMs in an online computational pipeline that can be used for

    de novo sRNA functional classification. The classification pipeline is suitable for individual

    sRNAs but also for high-throughput sRNA-seq datasets.

    3.2 Material and methods

    3.2.1 Data sets

    Table 3.1 summarizes the A. thaliana deep sequencing sRNA libraries used in this study. For

    SVM model training and testing we used one Col-0 wild-type sRNA library as well as eight

    AGO-IP datasets. SVM model validation was afterwards performed on additional AGO-IP

    data from different tissues and from pathogen infected plants, in addition to a set of

    putative miRNA and ta-siRNAs from several plant species. All sRNA-seq datasets were pre-

    processed and mapped to the Col-0 A. thaliana reference genome. Reads with at least one

    perfect match were collapsed into unique sRNA sequences and single copy sRNAs were

    removed. sRNA sequences in the genome-wide library that were not present in any of the

    AGO-IP sets, were isolated as a new group named “noAGO”. The “noAGO” set was used for

    SVM training in combination with AGO-IP data to learn discriminative rules to identify

    sequences with low potential to load to an AGO and therefore with small chance of

    becoming functional sRNA.

  • 47

    Table 3.1. Libraries of sRNA used in the current work. TOT: total sRNA; AGO-IP: AGO

    immunoprecipitated; NA: Not Applicable. T: used for training; V: used for validation.

    # Type Notes Species Tissue Reference T V

    1 AGO1-IP - A. thaliana Inflorescence Mi, 2008 Y Y 2 AGO1-IP - A. thaliana Flower Zhu, 2011 N Y 3 AGO1-IP - A. thaliana Flower Wang, 2011 N Y 4 AGO1-IP - A. thaliana Leaf Wang, 2011 N Y

    5 AGO1-IP - A. thaliana Root Wang, 2011 N Y 6 AGO1-IP - A. thaliana Seedling Wang, 2011 N Y 7 AGO1-IP - A. thaliana Inflorescence Qi, 2006 N Y 8 AGO1-IP Pseudomonas syringae infection set A. thaliana Leaf Zhang, 2011 N Y 9 AGO2-IP - A. thaliana Inflorescence Montgomery, 2008 Y Y 10 AGO2-IP - A. thaliana Inflorescence Mi, 2008 N Y 11 AGO2-IP Pseudomonas syringae infection set A. thaliana Leaf Zhang, 2011 N Y

    12 AGO4-IP - A. thaliana Inflorescence Mi, 2008 Y Y 13 AGO4-IP - A. thaliana Inflorescence Havecker, 2010 N Y 14 AGO4-IP - A. thaliana Inflorescence Qi, 2006 N Y 15 AGO4-IP - A. thaliana Flower Wang, 2011 N Y 16 AGO4-IP - A. thaliana Leaf Wang, 2011 N Y 17 AGO4-IP - A. thaliana Root Wang, 2011 N Y 18 AGO4-IP - A. thaliana Seedling Wang, 2011 N Y 19 AGO5-IP - A. thaliana Inflorescence Mi, 2008 Y Y 20 AGO6-IP - A. thaliana Aerial Havecker, 2010 Y Y 21 AGO7-IP - A. thaliana Inflorescence Montgomery, 2008 Y Y 22 AGO9-IP - A. thaliana Aerial Havecker, 2010 Y Y 23 AGO10-IP A. thaliana Inflorescence Zhu, 2011 Y Y 24 AGO1-IP HA-AGO1-DAH(Col-0)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y

    25 AGO1-IP HA-AGO1-DAH(Col-0)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 26 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 27 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV-AS9 A. thaliana Rosette Garcia-Ruiz, 2015 N Y 28 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y 29 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y 30 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 31 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y 32 AGO1-IP HA-AGO1-DAH(Col)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 33 AGO1-IP HA-AGO1-DAH(Col)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y 34 AGO10-IP HA-AGO10-DDH(Col)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 35 AGO10-IP HA-AGO10-DDH(Col)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 36 AGO10-IP HA-AGO10-DDH(Col)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 37 AGO10-IP HA-AGO10-DDH(Col)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y 38 AGO1-IP HA-AGO1-DAH(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y 39 AGO1-IP HA-AGO1-DAH(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y 40 AGO10-IP HA-AGO10-DAH(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y 41 AGO10-IP HA-AGO10-DAH(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y 42 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 43 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 44 miRNA All known miRNA from Arabidopsis A. thaliana NA Kozomara, 2014 N Y 45 miRNA All known miRNA in plants All plants NA Kozomara, 2014 N Y 46 ta-siRNA All known tasiRNA from Arabidopsis A. thaliana NA Zhang, 2014 N Y 47 ta-siRNA All known tasiRNA in plants All plants NA Zhang, 2014 N Y 48 TOT Total sRNA from wild type A. thaliana Inflorescence Slotkin, 2009 Y N

    3

  • 48

    3.2.2 Learning procedure

    A supervised machine learning approach rooted in the SVM algorithm was devised to learn

    classifiers capable of determining AGO-sRNA affinity from the sRNA sequences alone

    (Supplementary Material). Briefly, the complete inference system comprises three layers

    (Figure 3.2A): layer 1 includes a binary SVM model that filters out sequences that do not

    show strong evidence for binding to any of the known plant AGOs, and that therefore are

    expected to be inactive; layer 2 is composed by an ensemble of binary one-vs-one classifiers,

    each trained to explore the dissimilarities in AGO-bound sRNA sequences in a pairwise

    fashion; finally, layer 3 consists of a voting system that assigns a single score to each AGO

    using the decision values produced in the previous layer. Layers 2 and 3 are interconnected,

    since the outputs of the classifiers from the 2nd layer serve as inputs to the 3rd layer, and the

    3rd layer combines them to provide a more informative result. Layer 1 is independent of all

    other classifiers and can therefore be decoupled if desired. All sRNA sequences used in

    training, testing or validation were transformed into sets of features comprising:

    i. Position specific base composition (PSBC)

    One way to convert a string into a numerical representation is by using flags for the

    presence of a given nucleotide in a determined sequence position. This is equivalent

    to mapping each sequence position to a four-dimensional feature space that

    represents each of the four possible bases in the DNA alphabet as follows:

    𝐴 = 〈1,0,0,0〉, 𝐶 = 〈0,1,0,0〉, 𝐺 = 〈0,0,1,0〉, 𝑇 = 〈0,0,0,1〉.

    By using such format, a sequence can be mapped to a feature space of

    dimensionality 𝐹 = 𝐴. 𝐿, where 𝐴 is the number of possibilities in the alphabet of

    nucleic acids and 𝐿 expresses the dependence on the sequence length. Since the

    length of the sRNA sequences here analyzed is not constant but can vary in an

    interval with an upper right limit here defined as being 27 bases, all sRNA must be

    projected into a space of size 27x4=108. In the case of sequences with a length

    shorter than 27, the extra positions in the feature vector can simply remain empty

    for all nucleotides. Although this approach is a reasonable solution to cope with the

    variation observed in length, it has the disadvantage of introducing noise in the

    representation that increases as the 3’ end of the model sequence is approached.

  • 49

    A

    B

    Figure 3.2. Machine learning architectures employed: (A) architecture of the inference system; (B) cascade

    SVM implemented. In the cascade scheme, data are split into subsets and SVs are determined for each one.

    The results are combined two-by-two and used for training in the next cascade layer. This is done sequentially

    till a single model is obtained in the last layer. The resulting SVs are then tested for global convergence by

    feeding the result of the last layer into the first layer, together with the non-SVs. FV: feature vectors; DVi:

    decision values from classifier i; TDi: Training data partition i; SVj: SVs produced by optimization j; CLk:

    cascade layer k.

    3

  • 50

    This happens because the right most nucleotides in the real sRNA sequences are

    projected into more central positions of the model sequence, blurring 3’ side

    positional patterns when looking across instances with variable size. To compensate

    for that effect, the same kind of projection but starting at the 3’ position of the sRNA

    instead of the 5’ was additionally considered in a feature set here mentioned as

    PSBC2.

    ii. k-mer composition

    Approaches based on k-mers map the presence or absence of subwords with a given

    length in the sRNA sequence into a feature space that represents all possible k-mers

    of that length. Taking as example k-mers of size 2, there are 42=16 possibilities in the

    4 letters universe of DNA. The 16 length 2-gram vector for the DNA “ACGT” alphabet

    would then be:

    〈𝐴𝐴, 𝐴𝐶, 𝐴𝐺, 𝐴𝑇, 𝐶𝐴, 𝐶𝐶, 𝐶𝐺, 𝐶𝑇, 𝐺𝐴, 𝐺𝐶, 𝐺𝐺, 𝐺𝑇, 𝑇𝐴, 𝑇𝐶, 𝑇𝐺, 𝑇𝑇〉.

    As an example, mapping the sequence “ATGCATG” onto this vector space,

    considering the presence or absence of each of the possible k-mers yields:

    〈0,0,0,2,1,0,0,0,0,1,0,0,0,0,2,0〉.

    It is important to note that this method focuses on the frequency of patterns rather

    than their position in the sequence. Here k-mers of length 1 to 5 were explored.

    iii. Shannon entropy scores

    Entropy as a measure of information content gives an indication about the degree of

    repetitiveness in a sequence. Among several flavors, Shannon entropy is one of the

    most popular and consists in a score given by:

    𝐻(𝑋) = −∑ 𝑝(𝑥𝑖) log2(𝑝(𝑥𝑖)𝑖 )

    , with 𝑋 a sequence of length 𝑙 and 𝑝(𝑥𝑖) the frequency of the character at position 𝑖.

    iv. Sequence length

    This feature entails the number of nucleotides that compose each sRNA. Although

    simple, the enrichment for certain sizes in specific pathways is a consistent

    observation.

  • 51

    The learning methodology applied to layers 1 and 2 was similar. Prior to model training,

    highly correlated features (|Pearson score|>0.75) were removed keeping a single

    representative randomly selected from each correlated set. The remaining ones were

    normalized in the range between 0 and 1, to avoid dominance effects and numerical

    difficulties in the downstream calculations [212]. SVM learning with recursive feature

    elimination (SVM-RFE) [213] was then applied to find models for layers 1 and 2 with a

    reduced and more informative feature set. To circumvent computational problems that

    typically arise when standard SVM algorithms are applied to large datasets, a linear kernel

    was employed with a specialized linear solver [214]. A 5-fold cross-validation procedure was

    implemented to modulate data variation in each feature selection round and the mean ROC-

    AUC was calculated to assess the quality of the classifiers. Each round, 1/3 of the features

    with the lowest contribution for the discriminative model were eliminated, until 10 features

    were left. From here on, features were excluded one by one until no more features were

    available. The optimal feature subset for each classifier from layers 1 and 2 was determined

    with an elbow method applied to the curve formed by the mean cross-validation ROC-AUC

    values recorded during the feature selection process. The best features were subsequently

    used to train classifiers applying LIBSVM [215] with radial basis function (RBF) kernels to

    explore non-linear relationships in the data. A cascade scheme was implemented in this task

    to tackle computational problems that otherwise would not allow learning with such large

    datasets (Figure 3.2B). To avoid biases in learning, training was performed with balanced

    datasets by undersampling the largest class involved in each binary problem. Sequences with

    the highest read abundance were prioritized, and the remaining spots were occupied by

    instances randomly selected from the remaining pool(s).

    In layer 3, a balanced dataset composed of sRNA from all 8 AGO groups available was

    created. Decision values were computed for each of the sequences using the classifiers

    obtained for layer 2, and served as input for a 5-fold cross-validation procedure used to train

    and test the inference scheme applied to the 3rd layer. Three strategies were explored to

    combine the outputs from layer 2: a voting system, where the winner is the AGO protein

    with the largest number of decisions in its favor; a weighted rule learned with a linear SVM

    algorithm; and a weighted rule learned with a non-linear SVM using a RBF kernel. All SVM

    hyperparameters in layers 1, 2 and 3, were tuned by means of a grid search. A more detailed

    description of the SVM approach is provided in Supplementary Material.

    3

  • 52

    3.3 Results

    3.3.1 Detection of high confidence functional sRNA

    We explored a series of SVM classifiers to discriminate putatively functional from non-

    functional sRNA based on various sRNA sequence features. As outlined above (see section

    Material and Methods), SVMs were trained on sequenced A. thaliana Columbia (Col-0) AGO-

    IP sRNA libraries in comparison with libraries of Col-0 total sRNA. Specific sRNA sequences

    contained in the AGO-IP sRNA libraries were labeled “AGO”, whereas sequences contained

    in the total library (but not in the AGO-sRNA libraries) were labeled “noAGO”. To minimize

    sequencing noise, single copy sRNA were removed.

    5’ and 3’ k-mers are important in distinguishing functional from non-functional sRNA

    We first trained separated SVMs using either only the Position Specific Base Composition

    (PSBC or PSBC2) of the sRNA sequences, sRNA entropy, sRNA length, or all possible k-mers of

    Figure 3.3. Results for the classifiers trained to detect sRNA from the sets AGO vs noAGO: number of

    features and performance. Pie charts represent the fraction of features in the complete set (the total number

    is in the central black section of the pie) that was eliminated during the correlation analysis (red section in the

    external ring) or kept (grey section in the external ring). The remaining features were subjected to selection

    with SVM-RFE, of which a subset comprises the optimal features (light blue section). For each initial

    conglomerate of features, accuracy was measured: using all features in a set (grey bars) and after feature

    selection (light blue bars). The optimized features determined when all feature sets were analyzed together

    were then subjected to non-linear learning using a cascade SVM scheme (dark blue bar). Each bar plot has on

    top the standard deviation for the accuracy calculated from the 5-fold cross-validation procedure.

  • 53

    Figu

    re 3

    .4.

    Feat

    ure

    s ra

    nke

    d b

    y ab

    solu

    te v

    alu

    e o

    f th

    eir

    we

    igh

    ts i

    n t

    he

    op

    tim

    ize

    d c

    lass

    ifie

    r fr

    om

    lay

    er

    1 (

    “AG

    O”

    vs “

    no

    AG

    O”)

    . P

    osi

    tio

    n s

    pec

    ific

    bas

    e

    com

    po

    siti

    on

    (P

    SBC

    ) fe

    atu

    res

    are

    rep

    rese

    nte

    d in

    th

    e fo

    rmat

    𝐿𝑃

    , w

    her

    e 𝐿

    ∈ 𝐴

    , 𝐶, 𝐺, 𝑈

    and

    𝑃 is

    th

    e p

    osi

    tio

    n o

    f th

    e n

    ucl

    eoti

    de

    in t

    he

    seq

    uen

    ce s

    tart

    ing

    the

    cou

    nt

    at 5

    ’ or

    3’ i

    n t

    he

    case

    of

    PSB

    C v

    2, w

    hic

    h is

    rep

    rese

    nte

    d b

    y 𝑖𝐿𝑃

    ; Sh

    ann

    on

    : seq

    uen

    ce in

    form

    atio

    n c

    on

    ten

    t gi

    ven

    by

    Shan

    no

    n e

    ntr

    op

    y.

    3

  • 54

    size 1 to 5 nucleotides (see section Learning procedure). The SVMs trained on the PSBC or

    the k-mer features achieved considerable classification accuracy (~65%), while the SVMs

    trained using sRNA length or entropy performed less well (accuracy: ~50-60%, Figure 3.3).

    We then joined all the features together in a single classifier. Starting with 1582 features in

    total, our selection procedure yielded a final classifier containing only 138 features (Figure

    3.3), and achieved a classification accuracy of nearly 80%. Of the 138 features, 127 are k-

    mers and 10 correspond to position specific nucleotides (Figure 3.4).

    The k-mers of the final classifier were examined in more detail. Since the k-mers had no

    positional requirements for inclusion in the model we asked whether the k-mers retained in

    the final classifier showed any positional bias in the actual sRNA sequences. To do this, each

    of these k-mers was mapped back to the sRNA sequences contained in the “AGO” and the

    “noAGO” libraries. We found that the location of the k-mers was strongly biased toward the

    5’ end of the sRNAs and to some extent also toward the 3’ end (Figure 3.5), with a visible

    depletion toward the center of the sRNA sequences.

    When looking to the 5’ end of sequences with meaningful length, we observed that 24 nt

    sRNA from the “AGO” set is dominated by k-mers starting with an adenine (A), and 21 nt

    sRNA have a mixture of adenine, uracil (U) and cytosine (C); plus in both cases there is

    absence of 5’ guanine (G), see Figure 3.6. This finding is consistent with previous reports

    showing that the base composition of the first nucleotide of a sRNA is important for AGO

    affinity [194, 197, 207]. However, the vast majority (125 out of 138) of the features consisted

    of k-mers larger than one nucleotide in length, indicating that more complex sequence

    features guide AGO-sRNA associations.

    Figure 3.5. Density distribution of k-mers retained in the final classifier discriminating between functional (AGO) versus non-functional (noAGO) sRNA (layer 1). The projection of k-mers in the AGO set shows a clear positional bias toward the 5’ and 3’ ends of 21 nt and 24 nt sRNA sequences.

  • 55

    Figure 3.6. Frequency of nucleotides in k-mers across the sRNA libraries. This was determined from the

    distribution of k-mers retained in the classifiers according to their density peaks in 21 nt and 24 nt sRNA:

    (A) layer 1: AGO vs noAGO; (B) layer 2: mean of the frequencies for classifiers involving a given AGO.

    3

  • 56

  • 57

    5’ and 3’ k-mers are enriched for transcription factor binding motifs

    We further assessed whether the k-mers correspond to known sequence motifs. To that

    end, we performed a motif analysis using the “AGO” and the “noAGO” sets with MEME

    [216], a computational framework for de novo and known motif identification. Then, we

    focused on k-mers with a size of 5 nt and mapped them to the motifs retrieved in the

    previous step. We found that the majority of k-mers (38 of 46) corresponded to segments of

    known or predicted motifs, and noted a significant enrichment for k-mer matches in “AGO”

    when comparing with “noAGO” (100 and 32, respectively). Interestingly, the known motifs

    mapping k-mers were consistently related to stress response and development, which are

    processes known to be highly regulated by sRNA (Table S3.1).

    The TF enrichment was not surprising for 21 nt sRNA which are known to act on genic

    sequences that frequently contain TF binding motifs. However, similar TF patterns were also

    found for 24 nt sRNA, suggesting a role in gene regulation beyond the well-documented

    function of 24 nt sequences in heterochromatin silencing. We identified the 5-mer “AGAAG”

    as the k-mer that showed stronger enrichment in 24 nt sequences compared with 21 nt

    sRNA from the “AGO” set and noted that this subsequence was associated with 3 motifs:

    At1g68670, a G2-like protein involved in phosphate homeostasis; SVP, a MADS protein that

    acts as a floral repressor and functions within the thermosensory pathway; and MYB77,

    which is expressed in response to potassium deprivation and auxin. All these motifs have

    been shown to have higher DNA-binding capacity when the targets are unmethylated [217],

    revealing a possible bridge between 24 nt heterochromatic sRNA, DNA methylation and TF

    occupancy.

    Figure 3.7. Results for the models trained to discriminate sRNA from two different AGO-IP sets (layer 2):

    features selected, performance and top3 features. The top three rows of pie charts represent the features

    kept after the correlation analysis (grey section), a fraction of which comprises the optimal sets obtained by

    feature selection (light blue section). The three following rows of pie charts represent the composition of

    the features kept after feature selection, with the number in the middle representing the total number of

    features that survived SVM-RFE. For each binary model, ROC-AUC is shown for classifiers trained: using all

    non-correlated features (grey bars), after feature selection (light blue bars) and using the optimized feature

    set in a non-linear learning with the cascade SVM scheme (dark blue bars). Each bar plot has on top the

    standard deviation for the performance calculated from the 5-fold cross-validation procedure. The top3

    features found with the feature selection procedure (features kept in the last 3 SVM-RFE iterations) are

    indicated in a 3rd

    section. The bottom section contains an indication of the AGO-IP sets used in each of the

    28 models, plus a colour scheme that signs inter- (white) and intra-clade models (red: clade 1; blue: clade 2;

    green: clade 3).

    3

  • 58

    3.3.2 Classification and validation of specific AGO-sRNA associations

    In the previous section, our goal was to build a classifier that can distinguish functional from

    non-functional sRNA. We achieved this by training on “AGO” versus “noAGO” sRNA. A

    related problem is to find properties of sRNA that allow to infer their differential loading to

    specific AGO proteins, as this will determine their particular mode of action. To achieve this,

    we trained binary one-vs-one classifiers to find sequence features that discriminate between

    the eight different Col-0 AGO-IP libraries (layer 2, Figure 3.2A). The ensemble of one-vs-one

    classifiers was then subjected to a voting system that assigns a single score to each AGO

    (layer 3, Figure 3.2A).

    We used a 5-fold cross-validation scheme to determine the accuracy of the inference system

    at three levels: 1. AGO: fraction of sequences for which the AGO-bound protein was

    correctly predicted; 2. Clade: fraction of sequences for which the AGO prediction falls within

    the correct clade; and 3. Function: fraction of sequences for which the functional group can

    be correctly assigned based on the AGO prediction, translating predictions for AGO4/6/9

    into a potential for involvement in TS, and assignments to other AGOs as suggestion of PTS

    activity. In addition, we also performed a validation using 35 A. thaliana AGO-IP datasets

    never seen by the classifiers during training, plus miRNA and ta-siRNA libraries from multiple

    plants. The AGO-IP validation sets were collected from different tissues and experimental

    conditions, which allowed us to evaluate the robustness of the classifiers to such variation.

    Classification accuracy of sRNA at AGO, clade and function level

    Results from our 5-fold cross-validation analysis showed that our classifiers achieved very

    high accuracy (Figure 3.8), indicating that differences between sRNA bound to specific AGOs

    can indeed be detected. Classification accuracy at the level of specific AGO proteins was

    around 60% on average, ranging from as low as 40% (AGO7) to as high as 85% (AGO5). Since

    AGO proteins within a clade are highly homologous and similar in function, it is likely that

    sRNA-AGO binding is promiscuous in nature, which would render the search for

    discriminative features more challenging. Indeed, classification accuracy at the level of the

    clade and function were considerably higher than at the level of specific AGOs (clade

    accuracy: 85%, function accuracy 93%; see Figure 3.8A). Moreover, the final intra-clade

    classifiers were generally more complex than inter-clade classifiers, containing on average

  • 59

    122 features compared with 75 features, respectively (Figure 3.8C). Looking to the features

    retained in the final classifiers, intra-clade classifiers contained proportionally more k-mers

    of length larger than 1 nt compared with inter-clade classifiers (86% and 79% of the features

    in intra-clade and inter-clade models, respectively). Hence, factors that govern AGO-specific

    affinities within a clade appear to involve more complex sequence determinants. Similar to

    the “AGO” versus “noAGO” analysis presented above (section Detection of high confidence

    functional sRNA), we found that the k-mers in the final classifiers showed strong positional

    bias toward the 5’ and 3’ ends of sRNA sequences (Figure 3.6). This finding indicates that the

    same sRNA regions that differentiate functional from non-functional sequences also

    contribute to the binding affinity of sRNA to specific AGO proteins.

    To study the nature of the information contained in the k-mers selected by SVM-RFE in more

    detail, we compared them with a motif analysis performed for each AGO library using the

    MEME suite [216]. For a matter of simplicity, we focused on 5-mers (Supplementary

    Material). We found that nearly 40% of the 5-mers were identified as motifs or derived

    segments (Table 3.2), often related to known transcription factors with roles in development

    and response to stress. For instance, in models involving AGO2 (an AGO linked to ta-siRNAs)

    we identified 5-mers matching the motif ETT, an experimentally validated target of an

    evolutionarily conserved ta-siRNA denominated tasiR-ARF [63]. Targeting of ETT by ta-siRNA

    has been extensively studied and is known to have pleiotropic effects on Arabidopsis flower

    development by interference with the auxin pathway [218]. These results thus support the

    conclusion that SVM learning captured sequence information with relevant biological

    meaning. A full overview of k-mers overlapping known or predicted motifs is provided in

    Table 3.2 and in the Supplementary Material.

    Validation analysis revealed that our classifiers are relatively sensitive, even across

    biologically very heterogeneous datasets (Figure 3.9). Sensitivity was particularly high at the

    level of clade and function (clade: 70% and function: 84.3%), and moderate at the level of

    specific AGOs (AGO: 42%). Interestingly, when looking to the performance obtained for the

    ta-siRNA and miRNA datasets, we observed high sensitivity values both for the set isolated

    for Arabidopsis (~80% and ~90%, respectively) and also for the complete plant set (~95% and

    ~90%, respectively). In the set containing a mixture of ta-siRNA from diverse plants, the

    sensitivity is considerably higher compared with the one composed only by sequences from

    3

  • 60

    A B C D

    Figure 3.8. Details of the AGO-sorting inference (layers 2+3). Mean accuracy across the 5-fold cross-

    validation scheme for the voting system used in layer 3, measured by: (A) AGO, clade and function level; (B)

    separated by individual AGO sets. The average number of features composing the intra- and inter-clade

    classifiers from layer 2 (C), and the percentage of those features which are k-mers (D). Error bars: standard

    deviation.

    Table 3.2. Number of 5-mers in the final classifiers from layer 2 matching motifs found with MEME in each

    AGO-IP library. The 5-mers were separated by the signal of their weight w in the final classifier.

    # Set a Set b # features in

    the final classifier

    # 5-mers in the final classifier % 5-mers

    mapping to a motif

    Favouring Mapping to a motif

    AGOa (w>0)

    AGOb (w

  • 61

    Figu

    re 3

    .9.

    Sen

    siti

    vity

    fo

    r th

    e v

    alid

    atio

    n s

    ets

    . B

    ar p

    lots

    sh

    ow

    re

    sult

    s fo

    r A

    GO

    (lig

    ht

    grey

    ), c

    lad

    e (l

    igh

    t b

    lue)

    an

    d f

    un

    ctio

    n (

    dar

    k b

    lue)

    in

    fere

    nce

    . In

    dic

    atio

    n

    abo

    ut

    the

    use

    of

    the

    dat

    a se

    ts d

    uri

    ng

    lear

    nin

    g, t

    he

    sam

    ple

    d t

    issu

    e an

    d t

    he

    AG

    O w

    ith

    wh

    ich

    th

    ey w

    ere

    imm

    un

    op

    reci

    pit

    ated

    is

    sho

    wn

    by

    colo

    red

    sq

    uar

    es

    un

    der

    th

    e b

    ars.

    Aft

    er t

    he

    se, t

    he

    dat

    a se

    ts r

    efe

    ren

    ce n

    um

    be

    rs a

    re in

    dic

    ated

    , fo

    llow

    ing

    the

    en

    um

    erat

    ion

    in T

    able

    3.1

    .

    3

  • 62

    Arabidopsis, which supports the idea that the inference system can identify PTS sRNA not

    only from the species used in the learning phase but is also extensible to other plant species.

    In conclusion, the inference system demonstrates robustness for tissue and treatment

    variation and can recognize sRNA membership independently of the cellular origin showing

    good generalization as desired for the discovery of new functional units.

    3.4 Discussion

    To our knowledge, this is the first time that classifiers were built to infer AGO-sRNA affinity

    from the sRNA sequence alone. Using adequate solvers and learning architectures, SVMs

    could be applied to large genomic datasets and discovered highly discriminative rules, both

    to distinguish sequences that bind to AGOs from other sequencing products, as well to infer

    AGO-sRNA kinship. In addition to the known 5’ nucleotide composition and the sequence

    length, feature selection revealed the contribution of other sequence attributes, mostly

    complex k-mers, in defining sRNA preference for certain AGO proteins. How robust specific

    sRNA-AGO affinities are to changes in certain sequence properties is an interesting topic for

    future research and can shed light on evolutionary constraints on sRNA-mediated

    transcriptional and post-transcriptional silencing pathways. To answer such questions,

    additional experiments are necessary to manipulate sRNA sequences in a precise and

    targeted fashion, for example by use of the CRISPR-Cas9 system. Although our inference

    method appears to be highly accurate in predicting the putative function of sRNA sequences,

    it is important to keep in mind that the actual biological activity of a given sRNA is

    dependent on other factors beyond the AGO loading step, such as the degree of

    complementarity between a sRNA and the target sequence, as well as the presence of

    specific chromatin states at the target locus.

    AGO-sorting information has the potential to decrease the very high number of false

    positives reported by currently available PTS target prediction tools, but the true impact

    needs to be further evaluated. In any case, the computational framework here developed

    displays a discriminative power that makes it suitable for early screens in genome-wide sRNA

    libraries when looking for candidates with the highest chance to have certain functional

    roles, and when sorting sRNA by TS and PTS classes is needed. The tool is ultimately a more

    affordable alternative to expensive and laborious AGO-IP experiments, since it can get AGO-

  • 63

    sRNA profiles from a single genome-wide sequencing library. Another interesting application

    of the framework is to explore if specific sRNA from exogenous sources, such as artificially

    designed sequences or those derived from pathogens, correspond to functional plant sRNA

    and which silencing pathways they are likely to target.

    More sophisticated models can be developed using for example the expression patterns of

    the AGO proteins at the moment that the sRNA sequencing experiments take place. This

    extra information can eventually improve the capacity to correctly predict AGO affinity

    under particular situations, but on the other hand demands additional and more complex

    inputs, thus limiting the range of applications of the sRNA classifiers.

    3

  • 64

    S3 Supplementary Material

    S3.1 Support Vector Machines

    The support vector machine (SVM) is a popular machine learning algorithm with strong

    bonds to statistical learning theory. During the training phase, the SVM searches for the

    decision hyperplane that maximizes the margin for the training set (�⃗�1, 𝑦1) ... (�⃗�𝑖, 𝑦𝑖)

    composed of 𝑛 training instances �⃗�𝑖 with labels 𝑦𝑖ϵ {−1, 1} in 2-class problems. Maximum

    margin classifiers are frequently the ones that guarantee the highest generalization capacity,

    which explains the higher performance of models obtained via SVMs compared to other

    machine learning algorithms [219].

    To attain a classifier of the form �⃗⃗⃗� ∙ �⃗� + 𝑏 = 0, the learning process demands to minimize

    the norm of the decision hyperplane normal vector ‖�⃗⃗⃗�‖ subject to the function 𝑦𝑖(�⃗⃗⃗� ∙ �⃗�𝑖 +

    𝑏) ≥ 1, for 𝑖 = 1,… , 𝑛. A classification is then given by the signal of the output from the

    classifier, this in the binary case.

    To accommodate points that lay in the wrong side of the decision frontier, a soft-margin

    classifier can be trained instead of a hard-margin version, considering a hinge loss function

    of the type

    𝜁𝑖 = 𝑚𝑎𝑥(0, 1 − 𝑦𝑖(�⃗⃗⃗� ∙ �⃗�𝑖 + 𝑏))

    , and minimizing

    1

    𝑛∑𝜁𝑖

    𝑛

    𝑖=1

    + 𝜆‖�⃗⃗⃗�‖2

    , subject to 𝑦𝑖(�⃗⃗⃗� ∙ �⃗�𝑖 + 𝑏) ≥ 1 − 𝜁𝑖 and 𝜁𝑖 ≥ 0, for all 𝑖. Variable 𝜆 defines the tradeoff

    between having a larger margin and the number of training points lying in the wrong side of

    the margin. For 𝜆 = 0, the soft margin formulation becomes the hard-margin version. Most

    frequently, the trade-off is tuned not through 𝜆 but a directly related cost parameter 𝐶 =1

    𝜆.

    The above equation states the optimization problem in a form known as the primal.

    Classically, it can be solved using quadratic programming, but there are other more

    sophisticated approaches such as the sub-gradient descent or coordinate descent methods

  • 65

    that can reduce the training time several orders of magnitude when learning from very large

    datasets.

    Alternatively, the optimization problem can be stated in the dual form

    𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑓(𝛼1 …𝛼𝑛) = ∑𝛼𝑖 −1

    2∑∑𝑦𝑖𝛼𝑖(𝑥𝑖 ∙ 𝑥𝑗)𝑦𝑗𝛼𝑗

    𝑛

    𝑖=1

    𝑛

    𝑖=1

    𝑛

    𝑖=1

    , subject to ∑ 𝛼𝑖𝑦𝑖 = 0𝑛𝑖=1 and 0 ≤ 𝛼𝑖 ≤

    1

    2𝑛𝜆 , for all 𝑖; and can be solved through the

    Lagrangian method. In this case the decision hyperplane is described in terms of the training

    instances �⃗�𝑖 that lay closer to it and for which the Lagrangian multiplier 𝛼𝑖 ≠ 0, being

    �⃗⃗⃗� = ∑ 𝛼𝑖𝑦𝑖�⃗�𝑖𝑛𝑖=1 . These instances placed in the margin are called support vectors (SV).

    When more complex relationships between the classes are encoded in the data, it is

    beneficial to use a search space with higher dimensionality where the discrimination is more

    feasible. Data can be projected into such space using non-linear functions classically

    employed in machine learning like polynomials, sigmoids or radial basis functions (RBFs)

    [220, 221]. RBFs are a special and common choice, since other kernel functions like the linear

    and the sigmoid can be designed as special cases of RBFs [220, 221].

    S3.2 SVM-RFE

    Recursive feature elimination (RFE) is a backward feature elimination method that can be

    defined in three simple steps:

    1. Train a classifier (optimization of weights 𝑤𝑖 according to an objective function 𝐽)

    2. Determine the ranking criteria for all features (the cost 𝐷𝐽(𝑖) or 𝑤𝑖2)

    3. Remove the feature(s) with the smallest ranking criterion

    The whole process must be repeated until a stopping condition is met, such as achieving a

    minimum number of features desired, an empty feature set or other.

    In the case of SVM-RFE, the principles of RFE are combined with SVM model optimization in

    a wrapper approach.

    S3.3 SVM learning in very large datasets

    SVM learning is frequently mentioned as one of the most accurate machine learning

    methods, thanks to the maximum margin principle that it employs. However, searching for

    3

  • 66

    the maximum margin hyperplane in a large dataset comes with a computational burden

    from loading large matrices for which it is necessary to solve extensive optimization

    problems. This makes traditional SVM formulations with a training complexity dependent on

    the number 𝑛 of training samples in between 𝒪(𝑛2) and 𝒪(𝑛3) [222], of difficult use when

    exploring large sets. The machine learning community has been struggling to overcome such

    adversity, mostly by optimizing the mathematical background of SVM learning or by

    presenting “out of the box” solutions based on the theory behind SVM learning and

    experimental observations.

    Empirical studies showed that linear SVM learning can be in some cases sufficient and allows

    to obtain classifiers with accuracy values comparable with the ones achieved using more

    complex kernel transformations. There are solvers specialized in linear SVM learning, such as

    LibLinear [214] with a training time complexity of 𝒪(𝑛), for which the training time grows

    linearly with the number of training instances and that therefore has a drastically reduced

    learning time when comparing to other SVM formulations. Additionally, using such approach

    does not demand optimizing any other hyperparameters beyond the cost, further decreasing

    the training rounds necessary to tune the algorithm.

    When non-linear SVM learning is necessary, the computational burden can be attenuated by

    using special learning schemes, such as the cascade configuration [223], which take

    advantage of theoretical properties of the SVM algorithm. This architecture breaks the

    global optimization process into smaller subproblems that are solved separately by multiple

    SVMs with subsets of data. The partial solutions composed by the support vectors can then

    be joined and reprocessed in a cascade of SVMs that end in a single model guaranteed to

    converge to the global optimum after multiple passes through the cascade. Empirical tests

    showed that a single pass over the cascade can already produce decision rules with good

    generalization. In addition, this design can be parallelized further dropping the training time.

    SVM formulations for multiclass problems are available. However, since the complexity of

    the problem increases as the number of classes are included in the same optimization

    problem, multiclass SVM classification uses typically binary models. Two flavors can be

    found: one-vs-one and one-vs-all [224]. In the first case, binary classifiers are trained for

    each of the pairwise combinations among the K classes, forming a total of K(K−1)

    2 classifiers

    [219]. One-vs-all schemes use less classifiers since each of the classes is trained to separate

  • 67

    class 𝑖 from all remaining groups, so a total of 𝐾. A final prediction can be obtained by

    combining the inference of each binary classifier through mechanisms as simple as an

    unweighted voting system, where a final decision is made in favor of the group that gathers

    the highest number of positive predictions from all the individual rules. Multiclass prediction

    through binary schemes is proven to achieve accuracy levels comparable to that of straight

    multiclass models [219], plus it brings extra advantages associated with the parallelization of

    the training process and its ease of implementation.

    S3.4 Cascade SVM

    The cascade SVM [223] is a learning architecture in which training is performed sequentially

    in subsets of data. Each partial SVM filters uninformative data by finding sub-solutions that

    through the cascade are sequentially joined in a global solution. Former SVM sub-solutions

    which are given by the respective SVs are fed as input to a later SVM that further simplifies

    the result in case of data redundancy. This hierarchy ends-up in a single SVM that joins the

    SVs from previously optimized problems in a final model. This divide-and-conquer strategy,

    which explores principles of SVM learning in the dual form, allows reducing the number of

    training points used to find the global solution by dealing with smaller and therefore

    computationally more manageable subsets of data. The computational problems faced by

    the SVM algorithm when dealing with large datasets are circumvented since the training

    complexity is dependent on the number 𝑛 of training samples and estimated between 𝒪(𝑛2)

    and 𝒪(𝑛3), with obvious gains for smaller datasets.

    S3.5 Motif analysis

    All sRNA sets used to develop the classifiers here presented were subject to a motif search

    to clarify the eventual presence of conserved patterns in the sRNA sequences. This task used

    the MEME-ChIP platform [225] and was directed towards the list of unique sRNAs in the

    “AGO”, “noAGO” and AGO-IP sets. In the beginning, a de novo motif discovery search was

    executed for long ungapped motifs using MEME (6 to 27 nt) and short ungapped motifs with

    DREME (3 to 8 nt). The motifs found to be significant (e-value≤0.05) were then analyzed for

    enrichment in terms of their relative position inside the sequence with CentriMo (e-

    value≤0.05).

    3

  • 68

    Next, all motifs found were aligned to two libraries of known motifs from Arabidopsis

    thaliana, using Tomtom (e-value≤1). The first library contained a set of 113 motifs detected

    by protein-binding microarrays (PBM; Franco-Zorrilla et al., 2014) and the second one is

    composed by 872 motifs captured by DNA affinity purification sequencing (DAP-seq;

    O’Malley et al., 2016).

    Finally, the motifs from each set were compared pairwisely to identify library-specific motifs

    and shared ones.

    The k-mers selected by SVM-RFE were matched with perfect similarity (no mismatches, no

    gaps) to subsequences retrieved with FIMO (p-value

  • 69

    Tab

    le S

    3.1

    . M

    oti

    fs w

    ith

    5-m

    er

    mat

    che

    s fo

    un

    d in

    th

    e “

    AG

    O”

    and

    “n

    oA

    GO

    ” sR

    NA

    se

    ts.

    3

  • 70

    Ta

    ble

    S3

    .1 (

    con

    tin

    ue

    d)

  • 71

    Table S3.2. PBM motifs found in the AGO-IP libraries.

    Set

    #Motifs

    Long ungapped (MEME)

    Short ungapped (DREME)

    Centrally enriched

    (CentriMo)

    Known (Tomtom)

    Remapped (FIMO)

    AGO1 1 40 10 25 36

    AGO2 0 28 39 28 38

    AGO4 0 28 0 21 20

    AGO5 0 74 5 57 72

    AGO6 1 22 2 14 10

    AGO7 3 84 0 53 80

    AGO9 1 27 0 15 18

    AGO10 1 28 32 22 37

    3 3 3 3 3

    3

  • 72

    Table S3.3. All PBM motifs found in the AGO-IP libraries.

    # Name TAIR ID Short description

    1 ARR14 AT2G01760 Cytokinin-activated signaling pathway, phosphorelay signal transduction system.

    2 ATHB51 AT5G03790 Bract formation, floral meristem determinacy, leaf morphogenesis, positive regulation of transcription, DNA-templated, regulation of timing of transition from vegetative to reproductive phase.

    3 bZIP60 AT1G42990 bZIP60 mRNA is upregulated by the addition of ER stress inducers. It is involved in endoplasmic reticulum unfolded protein response, immune system process, and response to chitin.

    4 DAG2 AT2G46590 Expression is limited to vascular system of the mother plant. Recessive mutation is inherited as maternal-effect and expression is not detected in the embryo. Mutants are defective in seed germination and are more dependent on light and cold treatment and less sensitive to gibberellin during seed germination.

    5 DEAR3 AT2G23340 Ethylene-activated signaling pathway.

    6 DEAR4 AT4G36900 Ethylene-activated signaling pathway, plant defense, freezing stress, dehydration stress.

    7 ETT AT2G33860 ettin (ett) mutations have pleiotropic effects on Arabidopsis flower development, causing increases in perianth organ number, decreases in stamen number and anther formation, and apical-basal patterning defects in the gynoecium. The ETTIN gene encodes a protein with homology to DNA binding proteins which bind to auxin response elements. ETT transcript is expressed throughout stage 1 floral meristems and subsequently resolves to a complex pattern within petal, stamen and carpel primordia. ETT probably functions to impart regional identity in floral meristems that affects perianth organ number spacing, stamen formation, and regional differentiation in stamens and the gynoecium. During stage 5, ETT expression appears in a ring at the top of the floral meristem before morphological appearance of the gynoecium, consistent with the proposal that ETT is involved in prepatterning apical and basal boundaries in the gynoecium primordium. It is a target of the ta-siRNA tasiR-ARF.

    8 GLK1 AT2G20570 Chloroplast organization, negative regulation of flower development, negative regulation of leaf senescence, positive regulation of transcription, regulation of chlorophyll biosynthetic process, regulation of transcription and signal transduction.

    9 HSFB2A AT5G62020 The heat sock transcription factor B2A is involved in response to chitin and gamethophyte development. The mRNA is cell-to-cell mobile.

    10 HSFC1 AT3G24520 Heat shock transcription factor C1.

    11 KAN1 AT5G16560 Involved in abaxial cell fate specification, carpel development, organ morphogenesis, plant ovule development, polarity specification of adaxial/abaxial axis, radial pattern formation, xylem and phloem pattern formation.

    12 KAN4 AT5G42630 Encodes a member of the KANADI family of putative transcription factors. Involved in integument formation during ovule development and expressed at the boundary between the inner and outer integuments. It is essential for directing laminar growth of the inner integument. Along with KAN1 and KAN2, KAN4 is involved in proper localization of PIN1 (necessary for proper flowering) in early embryogenesis.

    13 MYB52 AT1G17950 Involved in the regulation of secondary cell wall biogenesis and response to abscisic acid.

  • 73

    Table S3.3. (continued)

    # Name TAIR ID Description

    14 RAP2.6 AT1G43160 Involved in chloroplast organization, ethylene-activated signaling pathway, and positive regulation of transcription. Mediates cellular response to: heat, abscisic acid, cold, jasmonic acid, osmotic stress, salicylic acid, salt stress, water deprivation and wounding.

    15 REM1 AT4G31610 Expressed specifically in reproductive meristems. It is a member of a moderately sized gene family distantly related to known plant DNA binding proteins. Necessary for proper flower development.

    16 RVE1 AT5G17300 Myb-like transcription factor that regulates hypocotyl growth by regulating free auxin levels in a time-of-day specific manner.

    17 SPL7 AT5G18830 Encodes a member of the Squamosa Binding Protein family of transcriptional regulators. SPL7 is expressed highly in roots and appears to play a role in copper homeostasis. Mutants are hypersensitive to copper deficient conditions and display a retarded growth phenotype. SPL7 binds to the promoter of the copper responsive miRNAs miR398b and miR389c.

    18 STY1 A member of SHI gene family. Arabidopsis thaliana has ten members that encode proteins with a RING finger-like zinc finger motif. Despite being highly divergent in sequence, many of the SHI-related genes are partially redundant in function and synergistically promote gynoecium, stamen and leaf development in Arabidopsis. STY1/STY2 double mutants showed defective style, stigma as well as serrated leaves. Binds to the promoter of YUC4 and YUC8 (binding site ACTCTAC). It is involved in auxin homeostasis, negative regulation of gibberellic acid mediated signaling pathway, positive regulation of transcription, stigma development, style development, xylem and phloem pattern formation.

    19 TOE1 AT2G28550 It is related to AP2.7 (RAP2.7). It is involved in ethylene-activated signaling pathway, multicellular organismal development, organ morphogenesis and vegetative to reproductive phase transition of meristem.

    20 TOE2 AT5G60120 Involved in ethylene-activated signaling pathway and multicellular organismal development.

    21 WOX13 AT4G35550 Involved in plant development.

    22 WRKY12 AT2G44745 Pith secondary wall formation and a promoter of flowering.

    23 WRKY38 AT5G22570 Involved in defense response to bacterium and salicylic acid mediated signaling pathway. Arabidopsis FLOWERING LOCUS D influences systemic-acquired resistance-induced expression and histone modifications of WRKY genes. Dehydration stress. Basal Defense in Arabidopsis: WRKYs Interact with Histone Deacetylase HDA19.

    24 YAB1 AT2G45190 Abaxial cell fate specification, cell fate commitment, fruit development, inflorescence meristem growth, meristem structural organization, polarity specification of adaxial/abaxial axis, regulation of flower development, regulation of leaf development, regulation of shoot apical meristem development, specification of floral organ identity, specification of organ position.

    25 YAB5 AT2G26580 Polarity specification of adaxial/abaxial axis, regulation of leaf development, regulation of shoot apical meristem development.

    26 ZAT14 AT5G03510 Zinc finger.

    3

  • 74

    Table S3.4. PBM motifs found in the AGO-IP libraries: motifs only in AGOa.

    # AGOa AGOb Motifs only in AGOa

    1 AGO1 AGO2 KAN1, WRKY38, ZAT14

    2 AGO1 AGO4 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14

    3 AGO1 AGO5 DEAR3, MYB55_2, WRKY38, ZAT14

    4 AGO1 AGO6 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14

    5 AGO1 AGO7 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14

    6 AGO1 AGO9 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14

    7 AGO1 AGO10 DEAR3, WRKY38, ZAT14

    8 AGO2 AGO4 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1

    9 AGO2 AGO5 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1

    10 AGO2 AGO6 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, YAB1

    11 AGO2 AGO7 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1

    12 AGO2 AGO9 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1

    13 AGO2 AGO10 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, YAB1

    14 AGO4 AGO5 -

    15 AGO4 AGO6 -

    16 AGO4 AGO7 -

    17 AGO4 AGO9 -

    18 AGO4 AGO10 -

    19 AGO5 AGO6 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5

    20 AGO5 AGO7 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5

    21 AGO5 AGO9 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5

    22 AGO5 AGO10 HSFB2A_2, HSFC1_2, RAP2.6_2

    23 AGO6 AGO7 -

    24 AGO6 AGO9 -

    25 AGO6 AGO10 -

    26 AGO7 AGO9 -

    27 AGO7 AGO10 -

    28 AGO9 AGO10 -

  • 75

    Table S3.5. PBM motifs found in the AGO-IP libraries: motifs only in AGOb.

    # AGOa AGOb Motifs only in AGOb

    1 AGO1 AGO2 bZIP60_2, DEAR4, ETT_2, KAN4, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1

    2 AGO1 AGO4 -

    3 AGO1 AGO5 HSFB2A_2, HSFC1_2, RAP2.6_2, YAB5

    4 AGO1 AGO6 -

    5 AGO1 AGO7 -

    6 AGO1 AGO9 -

    7 AGO1 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, MYB52, TOE1_2, TOE2, WRKY12, YAB5, ZAT18

    8 AGO2 AGO4 -

    9 AGO2 AGO5 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5

    10 AGO2 AGO6 -

    11 AGO2 AGO7 -

    12 AGO2 AGO9 -

    13 AGO2 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, TOE1_2, TOE2, YAB5, ZAT18

    14 AGO4 AGO5 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5

    15 AGO4 AGO6 -

    16 AGO4 AGO7 -

    17 AGO4 AGO9 -

    18 AGO4 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, YAB5, ZAT18

    19 AGO5 AGO6 -

    20 AGO5 AGO7 -

    21 AGO5 AGO9 -

    22 AGO5 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, ZAT18

    123 AGO6 AGO7 -

    24 AGO6 AGO9 -

    25 AGO6 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, YAB5, ZAT18

    26 AGO7 AGO9 -

    27 AGO7 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, YAB5, ZAT18

    28 AGO9 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, YAB5, ZAT18

    3

    Chapter 3