Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
University of Groningen
Computational Methods for High-Throughput Small RNA Analysis in PlantsMonteiro Morgado, Lionel
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.
Document VersionPublisher's PDF, also known as Version of record
Publication date:2018
Link to publication in University of Groningen/UMCG research database
Citation for published version (APA):Monteiro Morgado, L. (2018). Computational Methods for High-Throughput Small RNA Analysis in Plants.University of Groningen.
CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.
Download date: 13-06-2021
https://research.rug.nl/en/publications/computational-methods-for-highthroughput-small-rna-analysis-in-plants(3df6e1b7-fa60-4b29-a460-6b9224408c6f).html
42
43
CHAPTER 3
Learning sequence patterns of AGO-sRNA affinity from
high-throughput sequencing libraries to improve in silico
functional small RNA detection and classification in plants
Lionel Morgado, Ritsert C Jansen and Frank Johannes
Abstract
The loading of small RNA (sRNA) into Argonaute complexes is a crucial step in all regulatory
pathways identified so far in plants that depend on such non-coding sequences. Important
transcriptional and post-transcriptional silencing mechanisms can be activated depending on
the specific plant Argonaute (AGO) protein to which a sRNA binds. It is known that sRNA-
AGO associations are at least partly encoded in the sRNA primary structure, but the
sequence features that drive this association have not been fully explored. Here we train
support vector machines (SVMs) on sRNA sequencing data obtained from AGO-
immunoprecipitation experiments to identify features that determine sRNA affinity to
specific AGOs. Our SVM models reveal that AGO affinity is strongly determined by complex
k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known features such as sRNA length
and the base composition of the first nucleotide. Moreover, we find that these k-mers tend
to overlap known transcription factor (TF) binding motifs, thus highlighting a close interplay
between TF and sRNA-mediated transcriptional regulation. We integrated the learned SVMs
in a computational pipeline that can be used for de novo functional classification of sRNA
sequences. This tool, called SAILS, is provided as a web portal accessible at
http://sails.lionelmorgado.eu.
44
3.1 Introduction
The small RNA (sRNA) is a class of non-coding RNA with significant roles in developmental
biology, physiology, pathogen interactions, and more recently in genome stability and
transposable element control [194]. Plants have two major classes of sRNA: the microRNA
(miRNA), which is processed from imperfectly self-folded hairpin precursors derived from
miRNA genes [195]; and the small interfering RNA (siRNA) that is produced from double-
stranded RNA duplexes [196]. siRNAs can be further divided into three major groups:
secondary siRNA such as trans-acting (ta)-siRNAs, which are promoted by miRNA cleavage of
messenger RNA; natural antisense transcript (nat)-siRNAs derived from the overlapping
regions of antisense transcript pairs naturally present in the genome; and heterochromatic
(hc)-siRNAs, mostly generated from transposons, heterochromatic and repetitive genomic
regions and involved in DNA methylation and heterochromatin formation. Apart from their
biogenesis, the mode of action of a given sRNA is tightly related with the Argonaute protein
to which it can bind [197]. Argonautes form the core of all sRNA-guided silencing complexes
identified so far. Once loaded into Argonaute, sRNA guides the silencing machinery to
targets through base pairing principles. Argonautes are highly conserved proteins with family
members in most eukaryotes [197–199]. Although there are two main subfamilies of
Argonautes in eukaryotes: AGO and PIWI; only AGO proteins can be found in plants. Also in
plants, AGOs can be grouped into three phylogenetic clades with a highly variable number of
elements from species to species [198] (Figure 3.1). Arabidopsis has ten members with
specialized or redundant functions among them: AGO1, AGO5 and AGO10 in the first clade;
AGO2, AGO3 and AGO7 form the second clade; and the third clade is composed by AGO4,
AGO6, AGO8 and AGO9 [197]. Members of the first and second clade are involved in post-
transcriptional silencing (PTS) by inhibiting translation or by promoting messenger RNA
cleavage, and AGOs in the third clade are chromatin modifiers that induce transcriptional
silencing (TS) via mechanisms such as DNA methylation [200–202]. Understanding the
mechanisms that determine the loading of sRNA to specific AGOs is essential for predicting
their biological function, and for identifying their putative silencing targets.
High-throughput sequencing in combination with immunoprecipitation (IP) techniques have
made possible to determine the sequences of sRNA that are bound to different AGO
families. AGO-IP experiments have been performed for AGO1, AGO2, AGO4, AGO5, AGO6,
45
AGO7, AGO9 and AGO10. The low expression level of AGOs 3 and 8 suggests that they may
not be functionally relevant. Previous analyses have shown that AGO-sRNA associations are
partly determined by the 5’ terminus and the length of a sRNA sequence [203, 204].
Sequences of approximately 21 nucleotides (nt) tend to be involved in PTS, while 24 nt
sRNAs are characteristic of TS. Although sequence length has been widely used as a way to
infer the silencing pathway a given sRNA is most likely implicated in, by itself it is an
inaccurate predictor since many sequencing products can lack other structural features
known to enable AGO loading [203–206]. Contrary to animals, in plants the 5’ nucleotide is
also recognized as a strong indicator of AGO sorting. Enrichment for sequences starting with
pyrimidines are frequent in AGOs from the first clade (AGO1/10: uridine and AGO5:
cytosine), adenosine dominates the third clade (AGO4/6/9) as well as AGO2 from the second
clade. Furthermore, direct experimental evidence shows that mutating the 5’ nucleotide of a
sRNA can redirect its AGO destination [207]; however, other relevant features, such as
sequence motifs encoded by the primary structure appear to play a role [200, 208–211].
While sRNAs that participate in PTS have been intensively studied, leading to the discovery
of many structural features that influence activation, much less is known about AGO-
associated hc-siRNA in transcriptional silencing. Indeed, most studies that use hc-siRNAs give
Figure 3.1. Phylogenetic tree for AGO in A. thaliana. Functional groups are indicated: transcriptional
silencing (TS) and post-transcriptional silencing (PTS).
3
46
a strong emphasis to sequence length ignoring other important aspects of the sRNA
sequence that promote an active role in genomic regulation. Studying AGO-bound sRNA is a
starting point to fill this gap and to improve our understanding on the relationship between
the structure and function of sRNA in plants. Here we train support vector machines (SVMs)
on sRNA sequencing data obtained from AGO-IP experiments to identify features that
determine sRNA affinity with specific AGOs. Our SVMs reveal that AGO affinity is strongly
determined by complex k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known
features such as sRNA length and the base composition of the first nucleotide. Moreover,
we find that these k-mers tend to overlap known transcription factor (TF) binding motifs,
thus highlighting a close interplay between TF and sRNA-mediated transcriptional regulation.
We incorporated the learned SVMs in an online computational pipeline that can be used for
de novo sRNA functional classification. The classification pipeline is suitable for individual
sRNAs but also for high-throughput sRNA-seq datasets.
3.2 Material and methods
3.2.1 Data sets
Table 3.1 summarizes the A. thaliana deep sequencing sRNA libraries used in this study. For
SVM model training and testing we used one Col-0 wild-type sRNA library as well as eight
AGO-IP datasets. SVM model validation was afterwards performed on additional AGO-IP
data from different tissues and from pathogen infected plants, in addition to a set of
putative miRNA and ta-siRNAs from several plant species. All sRNA-seq datasets were pre-
processed and mapped to the Col-0 A. thaliana reference genome. Reads with at least one
perfect match were collapsed into unique sRNA sequences and single copy sRNAs were
removed. sRNA sequences in the genome-wide library that were not present in any of the
AGO-IP sets, were isolated as a new group named “noAGO”. The “noAGO” set was used for
SVM training in combination with AGO-IP data to learn discriminative rules to identify
sequences with low potential to load to an AGO and therefore with small chance of
becoming functional sRNA.
47
Table 3.1. Libraries of sRNA used in the current work. TOT: total sRNA; AGO-IP: AGO
immunoprecipitated; NA: Not Applicable. T: used for training; V: used for validation.
# Type Notes Species Tissue Reference T V
1 AGO1-IP - A. thaliana Inflorescence Mi, 2008 Y Y 2 AGO1-IP - A. thaliana Flower Zhu, 2011 N Y 3 AGO1-IP - A. thaliana Flower Wang, 2011 N Y 4 AGO1-IP - A. thaliana Leaf Wang, 2011 N Y
5 AGO1-IP - A. thaliana Root Wang, 2011 N Y 6 AGO1-IP - A. thaliana Seedling Wang, 2011 N Y 7 AGO1-IP - A. thaliana Inflorescence Qi, 2006 N Y 8 AGO1-IP Pseudomonas syringae infection set A. thaliana Leaf Zhang, 2011 N Y 9 AGO2-IP - A. thaliana Inflorescence Montgomery, 2008 Y Y 10 AGO2-IP - A. thaliana Inflorescence Mi, 2008 N Y 11 AGO2-IP Pseudomonas syringae infection set A. thaliana Leaf Zhang, 2011 N Y
12 AGO4-IP - A. thaliana Inflorescence Mi, 2008 Y Y 13 AGO4-IP - A. thaliana Inflorescence Havecker, 2010 N Y 14 AGO4-IP - A. thaliana Inflorescence Qi, 2006 N Y 15 AGO4-IP - A. thaliana Flower Wang, 2011 N Y 16 AGO4-IP - A. thaliana Leaf Wang, 2011 N Y 17 AGO4-IP - A. thaliana Root Wang, 2011 N Y 18 AGO4-IP - A. thaliana Seedling Wang, 2011 N Y 19 AGO5-IP - A. thaliana Inflorescence Mi, 2008 Y Y 20 AGO6-IP - A. thaliana Aerial Havecker, 2010 Y Y 21 AGO7-IP - A. thaliana Inflorescence Montgomery, 2008 Y Y 22 AGO9-IP - A. thaliana Aerial Havecker, 2010 Y Y 23 AGO10-IP A. thaliana Inflorescence Zhu, 2011 Y Y 24 AGO1-IP HA-AGO1-DAH(Col-0)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y
25 AGO1-IP HA-AGO1-DAH(Col-0)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 26 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 27 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV-AS9 A. thaliana Rosette Garcia-Ruiz, 2015 N Y 28 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y 29 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y 30 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 31 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y 32 AGO1-IP HA-AGO1-DAH(Col)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 33 AGO1-IP HA-AGO1-DAH(Col)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y 34 AGO10-IP HA-AGO10-DDH(Col)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 35 AGO10-IP HA-AGO10-DDH(Col)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 36 AGO10-IP HA-AGO10-DDH(Col)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 37 AGO10-IP HA-AGO10-DDH(Col)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y 38 AGO1-IP HA-AGO1-DAH(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y 39 AGO1-IP HA-AGO1-DAH(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y 40 AGO10-IP HA-AGO10-DAH(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y 41 AGO10-IP HA-AGO10-DAH(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y 42 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 43 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 44 miRNA All known miRNA from Arabidopsis A. thaliana NA Kozomara, 2014 N Y 45 miRNA All known miRNA in plants All plants NA Kozomara, 2014 N Y 46 ta-siRNA All known tasiRNA from Arabidopsis A. thaliana NA Zhang, 2014 N Y 47 ta-siRNA All known tasiRNA in plants All plants NA Zhang, 2014 N Y 48 TOT Total sRNA from wild type A. thaliana Inflorescence Slotkin, 2009 Y N
3
48
3.2.2 Learning procedure
A supervised machine learning approach rooted in the SVM algorithm was devised to learn
classifiers capable of determining AGO-sRNA affinity from the sRNA sequences alone
(Supplementary Material). Briefly, the complete inference system comprises three layers
(Figure 3.2A): layer 1 includes a binary SVM model that filters out sequences that do not
show strong evidence for binding to any of the known plant AGOs, and that therefore are
expected to be inactive; layer 2 is composed by an ensemble of binary one-vs-one classifiers,
each trained to explore the dissimilarities in AGO-bound sRNA sequences in a pairwise
fashion; finally, layer 3 consists of a voting system that assigns a single score to each AGO
using the decision values produced in the previous layer. Layers 2 and 3 are interconnected,
since the outputs of the classifiers from the 2nd layer serve as inputs to the 3rd layer, and the
3rd layer combines them to provide a more informative result. Layer 1 is independent of all
other classifiers and can therefore be decoupled if desired. All sRNA sequences used in
training, testing or validation were transformed into sets of features comprising:
i. Position specific base composition (PSBC)
One way to convert a string into a numerical representation is by using flags for the
presence of a given nucleotide in a determined sequence position. This is equivalent
to mapping each sequence position to a four-dimensional feature space that
represents each of the four possible bases in the DNA alphabet as follows:
𝐴 = 〈1,0,0,0〉, 𝐶 = 〈0,1,0,0〉, 𝐺 = 〈0,0,1,0〉, 𝑇 = 〈0,0,0,1〉.
By using such format, a sequence can be mapped to a feature space of
dimensionality 𝐹 = 𝐴. 𝐿, where 𝐴 is the number of possibilities in the alphabet of
nucleic acids and 𝐿 expresses the dependence on the sequence length. Since the
length of the sRNA sequences here analyzed is not constant but can vary in an
interval with an upper right limit here defined as being 27 bases, all sRNA must be
projected into a space of size 27x4=108. In the case of sequences with a length
shorter than 27, the extra positions in the feature vector can simply remain empty
for all nucleotides. Although this approach is a reasonable solution to cope with the
variation observed in length, it has the disadvantage of introducing noise in the
representation that increases as the 3’ end of the model sequence is approached.
49
A
B
Figure 3.2. Machine learning architectures employed: (A) architecture of the inference system; (B) cascade
SVM implemented. In the cascade scheme, data are split into subsets and SVs are determined for each one.
The results are combined two-by-two and used for training in the next cascade layer. This is done sequentially
till a single model is obtained in the last layer. The resulting SVs are then tested for global convergence by
feeding the result of the last layer into the first layer, together with the non-SVs. FV: feature vectors; DVi:
decision values from classifier i; TDi: Training data partition i; SVj: SVs produced by optimization j; CLk:
cascade layer k.
3
50
This happens because the right most nucleotides in the real sRNA sequences are
projected into more central positions of the model sequence, blurring 3’ side
positional patterns when looking across instances with variable size. To compensate
for that effect, the same kind of projection but starting at the 3’ position of the sRNA
instead of the 5’ was additionally considered in a feature set here mentioned as
PSBC2.
ii. k-mer composition
Approaches based on k-mers map the presence or absence of subwords with a given
length in the sRNA sequence into a feature space that represents all possible k-mers
of that length. Taking as example k-mers of size 2, there are 42=16 possibilities in the
4 letters universe of DNA. The 16 length 2-gram vector for the DNA “ACGT” alphabet
would then be:
〈𝐴𝐴, 𝐴𝐶, 𝐴𝐺, 𝐴𝑇, 𝐶𝐴, 𝐶𝐶, 𝐶𝐺, 𝐶𝑇, 𝐺𝐴, 𝐺𝐶, 𝐺𝐺, 𝐺𝑇, 𝑇𝐴, 𝑇𝐶, 𝑇𝐺, 𝑇𝑇〉.
As an example, mapping the sequence “ATGCATG” onto this vector space,
considering the presence or absence of each of the possible k-mers yields:
〈0,0,0,2,1,0,0,0,0,1,0,0,0,0,2,0〉.
It is important to note that this method focuses on the frequency of patterns rather
than their position in the sequence. Here k-mers of length 1 to 5 were explored.
iii. Shannon entropy scores
Entropy as a measure of information content gives an indication about the degree of
repetitiveness in a sequence. Among several flavors, Shannon entropy is one of the
most popular and consists in a score given by:
𝐻(𝑋) = −∑ 𝑝(𝑥𝑖) log2(𝑝(𝑥𝑖)𝑖 )
, with 𝑋 a sequence of length 𝑙 and 𝑝(𝑥𝑖) the frequency of the character at position 𝑖.
iv. Sequence length
This feature entails the number of nucleotides that compose each sRNA. Although
simple, the enrichment for certain sizes in specific pathways is a consistent
observation.
51
The learning methodology applied to layers 1 and 2 was similar. Prior to model training,
highly correlated features (|Pearson score|>0.75) were removed keeping a single
representative randomly selected from each correlated set. The remaining ones were
normalized in the range between 0 and 1, to avoid dominance effects and numerical
difficulties in the downstream calculations [212]. SVM learning with recursive feature
elimination (SVM-RFE) [213] was then applied to find models for layers 1 and 2 with a
reduced and more informative feature set. To circumvent computational problems that
typically arise when standard SVM algorithms are applied to large datasets, a linear kernel
was employed with a specialized linear solver [214]. A 5-fold cross-validation procedure was
implemented to modulate data variation in each feature selection round and the mean ROC-
AUC was calculated to assess the quality of the classifiers. Each round, 1/3 of the features
with the lowest contribution for the discriminative model were eliminated, until 10 features
were left. From here on, features were excluded one by one until no more features were
available. The optimal feature subset for each classifier from layers 1 and 2 was determined
with an elbow method applied to the curve formed by the mean cross-validation ROC-AUC
values recorded during the feature selection process. The best features were subsequently
used to train classifiers applying LIBSVM [215] with radial basis function (RBF) kernels to
explore non-linear relationships in the data. A cascade scheme was implemented in this task
to tackle computational problems that otherwise would not allow learning with such large
datasets (Figure 3.2B). To avoid biases in learning, training was performed with balanced
datasets by undersampling the largest class involved in each binary problem. Sequences with
the highest read abundance were prioritized, and the remaining spots were occupied by
instances randomly selected from the remaining pool(s).
In layer 3, a balanced dataset composed of sRNA from all 8 AGO groups available was
created. Decision values were computed for each of the sequences using the classifiers
obtained for layer 2, and served as input for a 5-fold cross-validation procedure used to train
and test the inference scheme applied to the 3rd layer. Three strategies were explored to
combine the outputs from layer 2: a voting system, where the winner is the AGO protein
with the largest number of decisions in its favor; a weighted rule learned with a linear SVM
algorithm; and a weighted rule learned with a non-linear SVM using a RBF kernel. All SVM
hyperparameters in layers 1, 2 and 3, were tuned by means of a grid search. A more detailed
description of the SVM approach is provided in Supplementary Material.
3
52
3.3 Results
3.3.1 Detection of high confidence functional sRNA
We explored a series of SVM classifiers to discriminate putatively functional from non-
functional sRNA based on various sRNA sequence features. As outlined above (see section
Material and Methods), SVMs were trained on sequenced A. thaliana Columbia (Col-0) AGO-
IP sRNA libraries in comparison with libraries of Col-0 total sRNA. Specific sRNA sequences
contained in the AGO-IP sRNA libraries were labeled “AGO”, whereas sequences contained
in the total library (but not in the AGO-sRNA libraries) were labeled “noAGO”. To minimize
sequencing noise, single copy sRNA were removed.
5’ and 3’ k-mers are important in distinguishing functional from non-functional sRNA
We first trained separated SVMs using either only the Position Specific Base Composition
(PSBC or PSBC2) of the sRNA sequences, sRNA entropy, sRNA length, or all possible k-mers of
Figure 3.3. Results for the classifiers trained to detect sRNA from the sets AGO vs noAGO: number of
features and performance. Pie charts represent the fraction of features in the complete set (the total number
is in the central black section of the pie) that was eliminated during the correlation analysis (red section in the
external ring) or kept (grey section in the external ring). The remaining features were subjected to selection
with SVM-RFE, of which a subset comprises the optimal features (light blue section). For each initial
conglomerate of features, accuracy was measured: using all features in a set (grey bars) and after feature
selection (light blue bars). The optimized features determined when all feature sets were analyzed together
were then subjected to non-linear learning using a cascade SVM scheme (dark blue bar). Each bar plot has on
top the standard deviation for the accuracy calculated from the 5-fold cross-validation procedure.
53
Figu
re 3
.4.
Feat
ure
s ra
nke
d b
y ab
solu
te v
alu
e o
f th
eir
we
igh
ts i
n t
he
op
tim
ize
d c
lass
ifie
r fr
om
lay
er
1 (
“AG
O”
vs “
no
AG
O”)
. P
osi
tio
n s
pec
ific
bas
e
com
po
siti
on
(P
SBC
) fe
atu
res
are
rep
rese
nte
d in
th
e fo
rmat
𝐿𝑃
, w
her
e 𝐿
∈ 𝐴
, 𝐶, 𝐺, 𝑈
and
𝑃 is
th
e p
osi
tio
n o
f th
e n
ucl
eoti
de
in t
he
seq
uen
ce s
tart
ing
the
cou
nt
at 5
’ or
3’ i
n t
he
case
of
PSB
C v
2, w
hic
h is
rep
rese
nte
d b
y 𝑖𝐿𝑃
; Sh
ann
on
: seq
uen
ce in
form
atio
n c
on
ten
t gi
ven
by
Shan
no
n e
ntr
op
y.
3
54
size 1 to 5 nucleotides (see section Learning procedure). The SVMs trained on the PSBC or
the k-mer features achieved considerable classification accuracy (~65%), while the SVMs
trained using sRNA length or entropy performed less well (accuracy: ~50-60%, Figure 3.3).
We then joined all the features together in a single classifier. Starting with 1582 features in
total, our selection procedure yielded a final classifier containing only 138 features (Figure
3.3), and achieved a classification accuracy of nearly 80%. Of the 138 features, 127 are k-
mers and 10 correspond to position specific nucleotides (Figure 3.4).
The k-mers of the final classifier were examined in more detail. Since the k-mers had no
positional requirements for inclusion in the model we asked whether the k-mers retained in
the final classifier showed any positional bias in the actual sRNA sequences. To do this, each
of these k-mers was mapped back to the sRNA sequences contained in the “AGO” and the
“noAGO” libraries. We found that the location of the k-mers was strongly biased toward the
5’ end of the sRNAs and to some extent also toward the 3’ end (Figure 3.5), with a visible
depletion toward the center of the sRNA sequences.
When looking to the 5’ end of sequences with meaningful length, we observed that 24 nt
sRNA from the “AGO” set is dominated by k-mers starting with an adenine (A), and 21 nt
sRNA have a mixture of adenine, uracil (U) and cytosine (C); plus in both cases there is
absence of 5’ guanine (G), see Figure 3.6. This finding is consistent with previous reports
showing that the base composition of the first nucleotide of a sRNA is important for AGO
affinity [194, 197, 207]. However, the vast majority (125 out of 138) of the features consisted
of k-mers larger than one nucleotide in length, indicating that more complex sequence
features guide AGO-sRNA associations.
Figure 3.5. Density distribution of k-mers retained in the final classifier discriminating between functional (AGO) versus non-functional (noAGO) sRNA (layer 1). The projection of k-mers in the AGO set shows a clear positional bias toward the 5’ and 3’ ends of 21 nt and 24 nt sRNA sequences.
55
Figure 3.6. Frequency of nucleotides in k-mers across the sRNA libraries. This was determined from the
distribution of k-mers retained in the classifiers according to their density peaks in 21 nt and 24 nt sRNA:
(A) layer 1: AGO vs noAGO; (B) layer 2: mean of the frequencies for classifiers involving a given AGO.
3
56
57
5’ and 3’ k-mers are enriched for transcription factor binding motifs
We further assessed whether the k-mers correspond to known sequence motifs. To that
end, we performed a motif analysis using the “AGO” and the “noAGO” sets with MEME
[216], a computational framework for de novo and known motif identification. Then, we
focused on k-mers with a size of 5 nt and mapped them to the motifs retrieved in the
previous step. We found that the majority of k-mers (38 of 46) corresponded to segments of
known or predicted motifs, and noted a significant enrichment for k-mer matches in “AGO”
when comparing with “noAGO” (100 and 32, respectively). Interestingly, the known motifs
mapping k-mers were consistently related to stress response and development, which are
processes known to be highly regulated by sRNA (Table S3.1).
The TF enrichment was not surprising for 21 nt sRNA which are known to act on genic
sequences that frequently contain TF binding motifs. However, similar TF patterns were also
found for 24 nt sRNA, suggesting a role in gene regulation beyond the well-documented
function of 24 nt sequences in heterochromatin silencing. We identified the 5-mer “AGAAG”
as the k-mer that showed stronger enrichment in 24 nt sequences compared with 21 nt
sRNA from the “AGO” set and noted that this subsequence was associated with 3 motifs:
At1g68670, a G2-like protein involved in phosphate homeostasis; SVP, a MADS protein that
acts as a floral repressor and functions within the thermosensory pathway; and MYB77,
which is expressed in response to potassium deprivation and auxin. All these motifs have
been shown to have higher DNA-binding capacity when the targets are unmethylated [217],
revealing a possible bridge between 24 nt heterochromatic sRNA, DNA methylation and TF
occupancy.
Figure 3.7. Results for the models trained to discriminate sRNA from two different AGO-IP sets (layer 2):
features selected, performance and top3 features. The top three rows of pie charts represent the features
kept after the correlation analysis (grey section), a fraction of which comprises the optimal sets obtained by
feature selection (light blue section). The three following rows of pie charts represent the composition of
the features kept after feature selection, with the number in the middle representing the total number of
features that survived SVM-RFE. For each binary model, ROC-AUC is shown for classifiers trained: using all
non-correlated features (grey bars), after feature selection (light blue bars) and using the optimized feature
set in a non-linear learning with the cascade SVM scheme (dark blue bars). Each bar plot has on top the
standard deviation for the performance calculated from the 5-fold cross-validation procedure. The top3
features found with the feature selection procedure (features kept in the last 3 SVM-RFE iterations) are
indicated in a 3rd
section. The bottom section contains an indication of the AGO-IP sets used in each of the
28 models, plus a colour scheme that signs inter- (white) and intra-clade models (red: clade 1; blue: clade 2;
green: clade 3).
3
58
3.3.2 Classification and validation of specific AGO-sRNA associations
In the previous section, our goal was to build a classifier that can distinguish functional from
non-functional sRNA. We achieved this by training on “AGO” versus “noAGO” sRNA. A
related problem is to find properties of sRNA that allow to infer their differential loading to
specific AGO proteins, as this will determine their particular mode of action. To achieve this,
we trained binary one-vs-one classifiers to find sequence features that discriminate between
the eight different Col-0 AGO-IP libraries (layer 2, Figure 3.2A). The ensemble of one-vs-one
classifiers was then subjected to a voting system that assigns a single score to each AGO
(layer 3, Figure 3.2A).
We used a 5-fold cross-validation scheme to determine the accuracy of the inference system
at three levels: 1. AGO: fraction of sequences for which the AGO-bound protein was
correctly predicted; 2. Clade: fraction of sequences for which the AGO prediction falls within
the correct clade; and 3. Function: fraction of sequences for which the functional group can
be correctly assigned based on the AGO prediction, translating predictions for AGO4/6/9
into a potential for involvement in TS, and assignments to other AGOs as suggestion of PTS
activity. In addition, we also performed a validation using 35 A. thaliana AGO-IP datasets
never seen by the classifiers during training, plus miRNA and ta-siRNA libraries from multiple
plants. The AGO-IP validation sets were collected from different tissues and experimental
conditions, which allowed us to evaluate the robustness of the classifiers to such variation.
Classification accuracy of sRNA at AGO, clade and function level
Results from our 5-fold cross-validation analysis showed that our classifiers achieved very
high accuracy (Figure 3.8), indicating that differences between sRNA bound to specific AGOs
can indeed be detected. Classification accuracy at the level of specific AGO proteins was
around 60% on average, ranging from as low as 40% (AGO7) to as high as 85% (AGO5). Since
AGO proteins within a clade are highly homologous and similar in function, it is likely that
sRNA-AGO binding is promiscuous in nature, which would render the search for
discriminative features more challenging. Indeed, classification accuracy at the level of the
clade and function were considerably higher than at the level of specific AGOs (clade
accuracy: 85%, function accuracy 93%; see Figure 3.8A). Moreover, the final intra-clade
classifiers were generally more complex than inter-clade classifiers, containing on average
59
122 features compared with 75 features, respectively (Figure 3.8C). Looking to the features
retained in the final classifiers, intra-clade classifiers contained proportionally more k-mers
of length larger than 1 nt compared with inter-clade classifiers (86% and 79% of the features
in intra-clade and inter-clade models, respectively). Hence, factors that govern AGO-specific
affinities within a clade appear to involve more complex sequence determinants. Similar to
the “AGO” versus “noAGO” analysis presented above (section Detection of high confidence
functional sRNA), we found that the k-mers in the final classifiers showed strong positional
bias toward the 5’ and 3’ ends of sRNA sequences (Figure 3.6). This finding indicates that the
same sRNA regions that differentiate functional from non-functional sequences also
contribute to the binding affinity of sRNA to specific AGO proteins.
To study the nature of the information contained in the k-mers selected by SVM-RFE in more
detail, we compared them with a motif analysis performed for each AGO library using the
MEME suite [216]. For a matter of simplicity, we focused on 5-mers (Supplementary
Material). We found that nearly 40% of the 5-mers were identified as motifs or derived
segments (Table 3.2), often related to known transcription factors with roles in development
and response to stress. For instance, in models involving AGO2 (an AGO linked to ta-siRNAs)
we identified 5-mers matching the motif ETT, an experimentally validated target of an
evolutionarily conserved ta-siRNA denominated tasiR-ARF [63]. Targeting of ETT by ta-siRNA
has been extensively studied and is known to have pleiotropic effects on Arabidopsis flower
development by interference with the auxin pathway [218]. These results thus support the
conclusion that SVM learning captured sequence information with relevant biological
meaning. A full overview of k-mers overlapping known or predicted motifs is provided in
Table 3.2 and in the Supplementary Material.
Validation analysis revealed that our classifiers are relatively sensitive, even across
biologically very heterogeneous datasets (Figure 3.9). Sensitivity was particularly high at the
level of clade and function (clade: 70% and function: 84.3%), and moderate at the level of
specific AGOs (AGO: 42%). Interestingly, when looking to the performance obtained for the
ta-siRNA and miRNA datasets, we observed high sensitivity values both for the set isolated
for Arabidopsis (~80% and ~90%, respectively) and also for the complete plant set (~95% and
~90%, respectively). In the set containing a mixture of ta-siRNA from diverse plants, the
sensitivity is considerably higher compared with the one composed only by sequences from
3
60
A B C D
Figure 3.8. Details of the AGO-sorting inference (layers 2+3). Mean accuracy across the 5-fold cross-
validation scheme for the voting system used in layer 3, measured by: (A) AGO, clade and function level; (B)
separated by individual AGO sets. The average number of features composing the intra- and inter-clade
classifiers from layer 2 (C), and the percentage of those features which are k-mers (D). Error bars: standard
deviation.
Table 3.2. Number of 5-mers in the final classifiers from layer 2 matching motifs found with MEME in each
AGO-IP library. The 5-mers were separated by the signal of their weight w in the final classifier.
# Set a Set b # features in
the final classifier
# 5-mers in the final classifier % 5-mers
mapping to a motif
Favouring Mapping to a motif
AGOa (w>0)
AGOb (w
61
Figu
re 3
.9.
Sen
siti
vity
fo
r th
e v
alid
atio
n s
ets
. B
ar p
lots
sh
ow
re
sult
s fo
r A
GO
(lig
ht
grey
), c
lad
e (l
igh
t b
lue)
an
d f
un
ctio
n (
dar
k b
lue)
in
fere
nce
. In
dic
atio
n
abo
ut
the
use
of
the
dat
a se
ts d
uri
ng
lear
nin
g, t
he
sam
ple
d t
issu
e an
d t
he
AG
O w
ith
wh
ich
th
ey w
ere
imm
un
op
reci
pit
ated
is
sho
wn
by
colo
red
sq
uar
es
un
der
th
e b
ars.
Aft
er t
he
se, t
he
dat
a se
ts r
efe
ren
ce n
um
be
rs a
re in
dic
ated
, fo
llow
ing
the
en
um
erat
ion
in T
able
3.1
.
3
62
Arabidopsis, which supports the idea that the inference system can identify PTS sRNA not
only from the species used in the learning phase but is also extensible to other plant species.
In conclusion, the inference system demonstrates robustness for tissue and treatment
variation and can recognize sRNA membership independently of the cellular origin showing
good generalization as desired for the discovery of new functional units.
3.4 Discussion
To our knowledge, this is the first time that classifiers were built to infer AGO-sRNA affinity
from the sRNA sequence alone. Using adequate solvers and learning architectures, SVMs
could be applied to large genomic datasets and discovered highly discriminative rules, both
to distinguish sequences that bind to AGOs from other sequencing products, as well to infer
AGO-sRNA kinship. In addition to the known 5’ nucleotide composition and the sequence
length, feature selection revealed the contribution of other sequence attributes, mostly
complex k-mers, in defining sRNA preference for certain AGO proteins. How robust specific
sRNA-AGO affinities are to changes in certain sequence properties is an interesting topic for
future research and can shed light on evolutionary constraints on sRNA-mediated
transcriptional and post-transcriptional silencing pathways. To answer such questions,
additional experiments are necessary to manipulate sRNA sequences in a precise and
targeted fashion, for example by use of the CRISPR-Cas9 system. Although our inference
method appears to be highly accurate in predicting the putative function of sRNA sequences,
it is important to keep in mind that the actual biological activity of a given sRNA is
dependent on other factors beyond the AGO loading step, such as the degree of
complementarity between a sRNA and the target sequence, as well as the presence of
specific chromatin states at the target locus.
AGO-sorting information has the potential to decrease the very high number of false
positives reported by currently available PTS target prediction tools, but the true impact
needs to be further evaluated. In any case, the computational framework here developed
displays a discriminative power that makes it suitable for early screens in genome-wide sRNA
libraries when looking for candidates with the highest chance to have certain functional
roles, and when sorting sRNA by TS and PTS classes is needed. The tool is ultimately a more
affordable alternative to expensive and laborious AGO-IP experiments, since it can get AGO-
63
sRNA profiles from a single genome-wide sequencing library. Another interesting application
of the framework is to explore if specific sRNA from exogenous sources, such as artificially
designed sequences or those derived from pathogens, correspond to functional plant sRNA
and which silencing pathways they are likely to target.
More sophisticated models can be developed using for example the expression patterns of
the AGO proteins at the moment that the sRNA sequencing experiments take place. This
extra information can eventually improve the capacity to correctly predict AGO affinity
under particular situations, but on the other hand demands additional and more complex
inputs, thus limiting the range of applications of the sRNA classifiers.
3
64
S3 Supplementary Material
S3.1 Support Vector Machines
The support vector machine (SVM) is a popular machine learning algorithm with strong
bonds to statistical learning theory. During the training phase, the SVM searches for the
decision hyperplane that maximizes the margin for the training set (�⃗�1, 𝑦1) ... (�⃗�𝑖, 𝑦𝑖)
composed of 𝑛 training instances �⃗�𝑖 with labels 𝑦𝑖ϵ {−1, 1} in 2-class problems. Maximum
margin classifiers are frequently the ones that guarantee the highest generalization capacity,
which explains the higher performance of models obtained via SVMs compared to other
machine learning algorithms [219].
To attain a classifier of the form �⃗⃗⃗� ∙ �⃗� + 𝑏 = 0, the learning process demands to minimize
the norm of the decision hyperplane normal vector ‖�⃗⃗⃗�‖ subject to the function 𝑦𝑖(�⃗⃗⃗� ∙ �⃗�𝑖 +
𝑏) ≥ 1, for 𝑖 = 1,… , 𝑛. A classification is then given by the signal of the output from the
classifier, this in the binary case.
To accommodate points that lay in the wrong side of the decision frontier, a soft-margin
classifier can be trained instead of a hard-margin version, considering a hinge loss function
of the type
𝜁𝑖 = 𝑚𝑎𝑥(0, 1 − 𝑦𝑖(�⃗⃗⃗� ∙ �⃗�𝑖 + 𝑏))
, and minimizing
1
𝑛∑𝜁𝑖
𝑛
𝑖=1
+ 𝜆‖�⃗⃗⃗�‖2
, subject to 𝑦𝑖(�⃗⃗⃗� ∙ �⃗�𝑖 + 𝑏) ≥ 1 − 𝜁𝑖 and 𝜁𝑖 ≥ 0, for all 𝑖. Variable 𝜆 defines the tradeoff
between having a larger margin and the number of training points lying in the wrong side of
the margin. For 𝜆 = 0, the soft margin formulation becomes the hard-margin version. Most
frequently, the trade-off is tuned not through 𝜆 but a directly related cost parameter 𝐶 =1
𝜆.
The above equation states the optimization problem in a form known as the primal.
Classically, it can be solved using quadratic programming, but there are other more
sophisticated approaches such as the sub-gradient descent or coordinate descent methods
65
that can reduce the training time several orders of magnitude when learning from very large
datasets.
Alternatively, the optimization problem can be stated in the dual form
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑓(𝛼1 …𝛼𝑛) = ∑𝛼𝑖 −1
2∑∑𝑦𝑖𝛼𝑖(𝑥𝑖 ∙ 𝑥𝑗)𝑦𝑗𝛼𝑗
𝑛
𝑖=1
𝑛
𝑖=1
𝑛
𝑖=1
, subject to ∑ 𝛼𝑖𝑦𝑖 = 0𝑛𝑖=1 and 0 ≤ 𝛼𝑖 ≤
1
2𝑛𝜆 , for all 𝑖; and can be solved through the
Lagrangian method. In this case the decision hyperplane is described in terms of the training
instances �⃗�𝑖 that lay closer to it and for which the Lagrangian multiplier 𝛼𝑖 ≠ 0, being
�⃗⃗⃗� = ∑ 𝛼𝑖𝑦𝑖�⃗�𝑖𝑛𝑖=1 . These instances placed in the margin are called support vectors (SV).
When more complex relationships between the classes are encoded in the data, it is
beneficial to use a search space with higher dimensionality where the discrimination is more
feasible. Data can be projected into such space using non-linear functions classically
employed in machine learning like polynomials, sigmoids or radial basis functions (RBFs)
[220, 221]. RBFs are a special and common choice, since other kernel functions like the linear
and the sigmoid can be designed as special cases of RBFs [220, 221].
S3.2 SVM-RFE
Recursive feature elimination (RFE) is a backward feature elimination method that can be
defined in three simple steps:
1. Train a classifier (optimization of weights 𝑤𝑖 according to an objective function 𝐽)
2. Determine the ranking criteria for all features (the cost 𝐷𝐽(𝑖) or 𝑤𝑖2)
3. Remove the feature(s) with the smallest ranking criterion
The whole process must be repeated until a stopping condition is met, such as achieving a
minimum number of features desired, an empty feature set or other.
In the case of SVM-RFE, the principles of RFE are combined with SVM model optimization in
a wrapper approach.
S3.3 SVM learning in very large datasets
SVM learning is frequently mentioned as one of the most accurate machine learning
methods, thanks to the maximum margin principle that it employs. However, searching for
3
66
the maximum margin hyperplane in a large dataset comes with a computational burden
from loading large matrices for which it is necessary to solve extensive optimization
problems. This makes traditional SVM formulations with a training complexity dependent on
the number 𝑛 of training samples in between 𝒪(𝑛2) and 𝒪(𝑛3) [222], of difficult use when
exploring large sets. The machine learning community has been struggling to overcome such
adversity, mostly by optimizing the mathematical background of SVM learning or by
presenting “out of the box” solutions based on the theory behind SVM learning and
experimental observations.
Empirical studies showed that linear SVM learning can be in some cases sufficient and allows
to obtain classifiers with accuracy values comparable with the ones achieved using more
complex kernel transformations. There are solvers specialized in linear SVM learning, such as
LibLinear [214] with a training time complexity of 𝒪(𝑛), for which the training time grows
linearly with the number of training instances and that therefore has a drastically reduced
learning time when comparing to other SVM formulations. Additionally, using such approach
does not demand optimizing any other hyperparameters beyond the cost, further decreasing
the training rounds necessary to tune the algorithm.
When non-linear SVM learning is necessary, the computational burden can be attenuated by
using special learning schemes, such as the cascade configuration [223], which take
advantage of theoretical properties of the SVM algorithm. This architecture breaks the
global optimization process into smaller subproblems that are solved separately by multiple
SVMs with subsets of data. The partial solutions composed by the support vectors can then
be joined and reprocessed in a cascade of SVMs that end in a single model guaranteed to
converge to the global optimum after multiple passes through the cascade. Empirical tests
showed that a single pass over the cascade can already produce decision rules with good
generalization. In addition, this design can be parallelized further dropping the training time.
SVM formulations for multiclass problems are available. However, since the complexity of
the problem increases as the number of classes are included in the same optimization
problem, multiclass SVM classification uses typically binary models. Two flavors can be
found: one-vs-one and one-vs-all [224]. In the first case, binary classifiers are trained for
each of the pairwise combinations among the K classes, forming a total of K(K−1)
2 classifiers
[219]. One-vs-all schemes use less classifiers since each of the classes is trained to separate
67
class 𝑖 from all remaining groups, so a total of 𝐾. A final prediction can be obtained by
combining the inference of each binary classifier through mechanisms as simple as an
unweighted voting system, where a final decision is made in favor of the group that gathers
the highest number of positive predictions from all the individual rules. Multiclass prediction
through binary schemes is proven to achieve accuracy levels comparable to that of straight
multiclass models [219], plus it brings extra advantages associated with the parallelization of
the training process and its ease of implementation.
S3.4 Cascade SVM
The cascade SVM [223] is a learning architecture in which training is performed sequentially
in subsets of data. Each partial SVM filters uninformative data by finding sub-solutions that
through the cascade are sequentially joined in a global solution. Former SVM sub-solutions
which are given by the respective SVs are fed as input to a later SVM that further simplifies
the result in case of data redundancy. This hierarchy ends-up in a single SVM that joins the
SVs from previously optimized problems in a final model. This divide-and-conquer strategy,
which explores principles of SVM learning in the dual form, allows reducing the number of
training points used to find the global solution by dealing with smaller and therefore
computationally more manageable subsets of data. The computational problems faced by
the SVM algorithm when dealing with large datasets are circumvented since the training
complexity is dependent on the number 𝑛 of training samples and estimated between 𝒪(𝑛2)
and 𝒪(𝑛3), with obvious gains for smaller datasets.
S3.5 Motif analysis
All sRNA sets used to develop the classifiers here presented were subject to a motif search
to clarify the eventual presence of conserved patterns in the sRNA sequences. This task used
the MEME-ChIP platform [225] and was directed towards the list of unique sRNAs in the
“AGO”, “noAGO” and AGO-IP sets. In the beginning, a de novo motif discovery search was
executed for long ungapped motifs using MEME (6 to 27 nt) and short ungapped motifs with
DREME (3 to 8 nt). The motifs found to be significant (e-value≤0.05) were then analyzed for
enrichment in terms of their relative position inside the sequence with CentriMo (e-
value≤0.05).
3
68
Next, all motifs found were aligned to two libraries of known motifs from Arabidopsis
thaliana, using Tomtom (e-value≤1). The first library contained a set of 113 motifs detected
by protein-binding microarrays (PBM; Franco-Zorrilla et al., 2014) and the second one is
composed by 872 motifs captured by DNA affinity purification sequencing (DAP-seq;
O’Malley et al., 2016).
Finally, the motifs from each set were compared pairwisely to identify library-specific motifs
and shared ones.
The k-mers selected by SVM-RFE were matched with perfect similarity (no mismatches, no
gaps) to subsequences retrieved with FIMO (p-value
69
Tab
le S
3.1
. M
oti
fs w
ith
5-m
er
mat
che
s fo
un
d in
th
e “
AG
O”
and
“n
oA
GO
” sR
NA
se
ts.
3
70
Ta
ble
S3
.1 (
con
tin
ue
d)
71
Table S3.2. PBM motifs found in the AGO-IP libraries.
Set
#Motifs
Long ungapped (MEME)
Short ungapped (DREME)
Centrally enriched
(CentriMo)
Known (Tomtom)
Remapped (FIMO)
AGO1 1 40 10 25 36
AGO2 0 28 39 28 38
AGO4 0 28 0 21 20
AGO5 0 74 5 57 72
AGO6 1 22 2 14 10
AGO7 3 84 0 53 80
AGO9 1 27 0 15 18
AGO10 1 28 32 22 37
3 3 3 3 3
3
72
Table S3.3. All PBM motifs found in the AGO-IP libraries.
# Name TAIR ID Short description
1 ARR14 AT2G01760 Cytokinin-activated signaling pathway, phosphorelay signal transduction system.
2 ATHB51 AT5G03790 Bract formation, floral meristem determinacy, leaf morphogenesis, positive regulation of transcription, DNA-templated, regulation of timing of transition from vegetative to reproductive phase.
3 bZIP60 AT1G42990 bZIP60 mRNA is upregulated by the addition of ER stress inducers. It is involved in endoplasmic reticulum unfolded protein response, immune system process, and response to chitin.
4 DAG2 AT2G46590 Expression is limited to vascular system of the mother plant. Recessive mutation is inherited as maternal-effect and expression is not detected in the embryo. Mutants are defective in seed germination and are more dependent on light and cold treatment and less sensitive to gibberellin during seed germination.
5 DEAR3 AT2G23340 Ethylene-activated signaling pathway.
6 DEAR4 AT4G36900 Ethylene-activated signaling pathway, plant defense, freezing stress, dehydration stress.
7 ETT AT2G33860 ettin (ett) mutations have pleiotropic effects on Arabidopsis flower development, causing increases in perianth organ number, decreases in stamen number and anther formation, and apical-basal patterning defects in the gynoecium. The ETTIN gene encodes a protein with homology to DNA binding proteins which bind to auxin response elements. ETT transcript is expressed throughout stage 1 floral meristems and subsequently resolves to a complex pattern within petal, stamen and carpel primordia. ETT probably functions to impart regional identity in floral meristems that affects perianth organ number spacing, stamen formation, and regional differentiation in stamens and the gynoecium. During stage 5, ETT expression appears in a ring at the top of the floral meristem before morphological appearance of the gynoecium, consistent with the proposal that ETT is involved in prepatterning apical and basal boundaries in the gynoecium primordium. It is a target of the ta-siRNA tasiR-ARF.
8 GLK1 AT2G20570 Chloroplast organization, negative regulation of flower development, negative regulation of leaf senescence, positive regulation of transcription, regulation of chlorophyll biosynthetic process, regulation of transcription and signal transduction.
9 HSFB2A AT5G62020 The heat sock transcription factor B2A is involved in response to chitin and gamethophyte development. The mRNA is cell-to-cell mobile.
10 HSFC1 AT3G24520 Heat shock transcription factor C1.
11 KAN1 AT5G16560 Involved in abaxial cell fate specification, carpel development, organ morphogenesis, plant ovule development, polarity specification of adaxial/abaxial axis, radial pattern formation, xylem and phloem pattern formation.
12 KAN4 AT5G42630 Encodes a member of the KANADI family of putative transcription factors. Involved in integument formation during ovule development and expressed at the boundary between the inner and outer integuments. It is essential for directing laminar growth of the inner integument. Along with KAN1 and KAN2, KAN4 is involved in proper localization of PIN1 (necessary for proper flowering) in early embryogenesis.
13 MYB52 AT1G17950 Involved in the regulation of secondary cell wall biogenesis and response to abscisic acid.
73
Table S3.3. (continued)
# Name TAIR ID Description
14 RAP2.6 AT1G43160 Involved in chloroplast organization, ethylene-activated signaling pathway, and positive regulation of transcription. Mediates cellular response to: heat, abscisic acid, cold, jasmonic acid, osmotic stress, salicylic acid, salt stress, water deprivation and wounding.
15 REM1 AT4G31610 Expressed specifically in reproductive meristems. It is a member of a moderately sized gene family distantly related to known plant DNA binding proteins. Necessary for proper flower development.
16 RVE1 AT5G17300 Myb-like transcription factor that regulates hypocotyl growth by regulating free auxin levels in a time-of-day specific manner.
17 SPL7 AT5G18830 Encodes a member of the Squamosa Binding Protein family of transcriptional regulators. SPL7 is expressed highly in roots and appears to play a role in copper homeostasis. Mutants are hypersensitive to copper deficient conditions and display a retarded growth phenotype. SPL7 binds to the promoter of the copper responsive miRNAs miR398b and miR389c.
18 STY1 A member of SHI gene family. Arabidopsis thaliana has ten members that encode proteins with a RING finger-like zinc finger motif. Despite being highly divergent in sequence, many of the SHI-related genes are partially redundant in function and synergistically promote gynoecium, stamen and leaf development in Arabidopsis. STY1/STY2 double mutants showed defective style, stigma as well as serrated leaves. Binds to the promoter of YUC4 and YUC8 (binding site ACTCTAC). It is involved in auxin homeostasis, negative regulation of gibberellic acid mediated signaling pathway, positive regulation of transcription, stigma development, style development, xylem and phloem pattern formation.
19 TOE1 AT2G28550 It is related to AP2.7 (RAP2.7). It is involved in ethylene-activated signaling pathway, multicellular organismal development, organ morphogenesis and vegetative to reproductive phase transition of meristem.
20 TOE2 AT5G60120 Involved in ethylene-activated signaling pathway and multicellular organismal development.
21 WOX13 AT4G35550 Involved in plant development.
22 WRKY12 AT2G44745 Pith secondary wall formation and a promoter of flowering.
23 WRKY38 AT5G22570 Involved in defense response to bacterium and salicylic acid mediated signaling pathway. Arabidopsis FLOWERING LOCUS D influences systemic-acquired resistance-induced expression and histone modifications of WRKY genes. Dehydration stress. Basal Defense in Arabidopsis: WRKYs Interact with Histone Deacetylase HDA19.
24 YAB1 AT2G45190 Abaxial cell fate specification, cell fate commitment, fruit development, inflorescence meristem growth, meristem structural organization, polarity specification of adaxial/abaxial axis, regulation of flower development, regulation of leaf development, regulation of shoot apical meristem development, specification of floral organ identity, specification of organ position.
25 YAB5 AT2G26580 Polarity specification of adaxial/abaxial axis, regulation of leaf development, regulation of shoot apical meristem development.
26 ZAT14 AT5G03510 Zinc finger.
3
74
Table S3.4. PBM motifs found in the AGO-IP libraries: motifs only in AGOa.
# AGOa AGOb Motifs only in AGOa
1 AGO1 AGO2 KAN1, WRKY38, ZAT14
2 AGO1 AGO4 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14
3 AGO1 AGO5 DEAR3, MYB55_2, WRKY38, ZAT14
4 AGO1 AGO6 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14
5 AGO1 AGO7 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14
6 AGO1 AGO9 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14
7 AGO1 AGO10 DEAR3, WRKY38, ZAT14
8 AGO2 AGO4 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1
9 AGO2 AGO5 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1
10 AGO2 AGO6 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, YAB1
11 AGO2 AGO7 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1
12 AGO2 AGO9 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1
13 AGO2 AGO10 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, YAB1
14 AGO4 AGO5 -
15 AGO4 AGO6 -
16 AGO4 AGO7 -
17 AGO4 AGO9 -
18 AGO4 AGO10 -
19 AGO5 AGO6 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5
20 AGO5 AGO7 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5
21 AGO5 AGO9 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5
22 AGO5 AGO10 HSFB2A_2, HSFC1_2, RAP2.6_2
23 AGO6 AGO7 -
24 AGO6 AGO9 -
25 AGO6 AGO10 -
26 AGO7 AGO9 -
27 AGO7 AGO10 -
28 AGO9 AGO10 -
75
Table S3.5. PBM motifs found in the AGO-IP libraries: motifs only in AGOb.
# AGOa AGOb Motifs only in AGOb
1 AGO1 AGO2 bZIP60_2, DEAR4, ETT_2, KAN4, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1
2 AGO1 AGO4 -
3 AGO1 AGO5 HSFB2A_2, HSFC1_2, RAP2.6_2, YAB5
4 AGO1 AGO6 -
5 AGO1 AGO7 -
6 AGO1 AGO9 -
7 AGO1 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, MYB52, TOE1_2, TOE2, WRKY12, YAB5, ZAT18
8 AGO2 AGO4 -
9 AGO2 AGO5 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5
10 AGO2 AGO6 -
11 AGO2 AGO7 -
12 AGO2 AGO9 -
13 AGO2 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, TOE1_2, TOE2, YAB5, ZAT18
14 AGO4 AGO5 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5
15 AGO4 AGO6 -
16 AGO4 AGO7 -
17 AGO4 AGO9 -
18 AGO4 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, YAB5, ZAT18
19 AGO5 AGO6 -
20 AGO5 AGO7 -
21 AGO5 AGO9 -
22 AGO5 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, ZAT18
123 AGO6 AGO7 -
24 AGO6 AGO9 -
25 AGO6 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, YAB5, ZAT18
26 AGO7 AGO9 -
27 AGO7 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, YAB5, ZAT18
28 AGO9 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, YAB5, ZAT18
3
Chapter 3