University of Groningen Computational Methods for High … · 2018. 3. 14. · 12 AGO4-IP - A. thaliana Inflorescence Mi, 2008 Y Y 13 AGO4-IP - A. thaliana Inflorescence Havecker,

University of Groningen

Computational Methods for High-Throughput Small RNA Analysis in PlantsMonteiro Morgado, Lionel

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.

Document VersionPublisher's PDF, also known as Version of record

Publication date:2018

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):Monteiro Morgado, L. (2018). Computational Methods for High-Throughput Small RNA Analysis in Plants.University of Groningen.

CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.

Download date: 13-06-2021

https://research.rug.nl/en/publications/computational-methods-for-highthroughput-small-rna-analysis-in-plants(3df6e1b7-fa60-4b29-a460-6b9224408c6f).html

43

CHAPTER 3

Learning sequence patterns of AGO-sRNA affinity from

high-throughput sequencing libraries to improve in silico

functional small RNA detection and classification in plants

Lionel Morgado, Ritsert C Jansen and Frank Johannes

Abstract

The loading of small RNA (sRNA) into Argonaute complexes is a crucial step in all regulatory

pathways identified so far in plants that depend on such non-coding sequences. Important

transcriptional and post-transcriptional silencing mechanisms can be activated depending on

the specific plant Argonaute (AGO) protein to which a sRNA binds. It is known that sRNA-

AGO associations are at least partly encoded in the sRNA primary structure, but the

sequence features that drive this association have not been fully explored. Here we train

support vector machines (SVMs) on sRNA sequencing data obtained from AGO-

immunoprecipitation experiments to identify features that determine sRNA affinity to

specific AGOs. Our SVM models reveal that AGO affinity is strongly determined by complex

k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known features such as sRNA length

and the base composition of the first nucleotide. Moreover, we find that these k-mers tend

to overlap known transcription factor (TF) binding motifs, thus highlighting a close interplay

between TF and sRNA-mediated transcriptional regulation. We integrated the learned SVMs

in a computational pipeline that can be used for de novo functional classification of sRNA

sequences. This tool, called SAILS, is provided as a web portal accessible at

http://sails.lionelmorgado.eu.

44

3.1 Introduction

The small RNA (sRNA) is a class of non-coding RNA with significant roles in developmental

biology, physiology, pathogen interactions, and more recently in genome stability and

transposable element control [194]. Plants have two major classes of sRNA: the microRNA

(miRNA), which is processed from imperfectly self-folded hairpin precursors derived from

miRNA genes [195]; and the small interfering RNA (siRNA) that is produced from double-

stranded RNA duplexes [196]. siRNAs can be further divided into three major groups:

secondary siRNA such as trans-acting (ta)-siRNAs, which are promoted by miRNA cleavage of

messenger RNA; natural antisense transcript (nat)-siRNAs derived from the overlapping

regions of antisense transcript pairs naturally present in the genome; and heterochromatic

(hc)-siRNAs, mostly generated from transposons, heterochromatic and repetitive genomic

regions and involved in DNA methylation and heterochromatin formation. Apart from their

biogenesis, the mode of action of a given sRNA is tightly related with the Argonaute protein

to which it can bind [197]. Argonautes form the core of all sRNA-guided silencing complexes

identified so far. Once loaded into Argonaute, sRNA guides the silencing machinery to

targets through base pairing principles. Argonautes are highly conserved proteins with family

members in most eukaryotes [197–199]. Although there are two main subfamilies of

Argonautes in eukaryotes: AGO and PIWI; only AGO proteins can be found in plants. Also in

plants, AGOs can be grouped into three phylogenetic clades with a highly variable number of

elements from species to species [198] (Figure 3.1). Arabidopsis has ten members with

specialized or redundant functions among them: AGO1, AGO5 and AGO10 in the first clade;

AGO2, AGO3 and AGO7 form the second clade; and the third clade is composed by AGO4,

AGO6, AGO8 and AGO9 [197]. Members of the first and second clade are involved in post-

transcriptional silencing (PTS) by inhibiting translation or by promoting messenger RNA

cleavage, and AGOs in the third clade are chromatin modifiers that induce transcriptional

silencing (TS) via mechanisms such as DNA methylation [200–202]. Understanding the

mechanisms that determine the loading of sRNA to specific AGOs is essential for predicting

their biological function, and for identifying their putative silencing targets.

High-throughput sequencing in combination with immunoprecipitation (IP) techniques have

made possible to determine the sequences of sRNA that are bound to different AGO

families. AGO-IP experiments have been performed for AGO1, AGO2, AGO4, AGO5, AGO6,

45

AGO7, AGO9 and AGO10. The low expression level of AGOs 3 and 8 suggests that they may

not be functionally relevant. Previous analyses have shown that AGO-sRNA associations are

partly determined by the 5’ terminus and the length of a sRNA sequence [203, 204].

Sequences of approximately 21 nucleotides (nt) tend to be involved in PTS, while 24 nt

sRNAs are characteristic of TS. Although sequence length has been widely used as a way to

infer the silencing pathway a given sRNA is most likely implicated in, by itself it is an

inaccurate predictor since many sequencing products can lack other structural features

known to enable AGO loading [203–206]. Contrary to animals, in plants the 5’ nucleotide is

also recognized as a strong indicator of AGO sorting. Enrichment for sequences starting with

pyrimidines are frequent in AGOs from the first clade (AGO1/10: uridine and AGO5:

cytosine), adenosine dominates the third clade (AGO4/6/9) as well as AGO2 from the second

clade. Furthermore, direct experimental evidence shows that mutating the 5’ nucleotide of a

sRNA can redirect its AGO destination [207]; however, other relevant features, such as

sequence motifs encoded by the primary structure appear to play a role [200, 208–211].

While sRNAs that participate in PTS have been intensively studied, leading to the discovery

of many structural features that influence activation, much less is known about AGO-

associated hc-siRNA in transcriptional silencing. Indeed, most studies that use hc-siRNAs give

Figure 3.1. Phylogenetic tree for AGO in A. thaliana. Functional groups are indicated: transcriptional

silencing (TS) and post-transcriptional silencing (PTS).

3

46

a strong emphasis to sequence length ignoring other important aspects of the sRNA

sequence that promote an active role in genomic regulation. Studying AGO-bound sRNA is a

starting point to fill this gap and to improve our understanding on the relationship between

the structure and function of sRNA in plants. Here we train support vector machines (SVMs)

on sRNA sequencing data obtained from AGO-IP experiments to identify features that

determine sRNA affinity with specific AGOs. Our SVMs reveal that AGO affinity is strongly

determined by complex k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known

features such as sRNA length and the base composition of the first nucleotide. Moreover,

we find that these k-mers tend to overlap known transcription factor (TF) binding motifs,

thus highlighting a close interplay between TF and sRNA-mediated transcriptional regulation.

We incorporated the learned SVMs in an online computational pipeline that can be used for

de novo sRNA functional classification. The classification pipeline is suitable for individual

sRNAs but also for high-throughput sRNA-seq datasets.

3.2 Material and methods

3.2.1 Data sets

Table 3.1 summarizes the A. thaliana deep sequencing sRNA libraries used in this study. For

SVM model training and testing we used one Col-0 wild-type sRNA library as well as eight

AGO-IP datasets. SVM model validation was afterwards performed on additional AGO-IP

data from different tissues and from pathogen infected plants, in addition to a set of

putative miRNA and ta-siRNAs from several plant species. All sRNA-seq datasets were pre-

processed and mapped to the Col-0 A. thaliana reference genome. Reads with at least one

perfect match were collapsed into unique sRNA sequences and single copy sRNAs were

removed. sRNA sequences in the genome-wide library that were not present in any of the

AGO-IP sets, were isolated as a new group named “noAGO”. The “noAGO” set was used for

SVM training in combination with AGO-IP data to learn discriminative rules to identify

sequences with low potential to load to an AGO and therefore with small chance of

becoming functional sRNA.

47

Table 3.1. Libraries of sRNA used in the current work. TOT: total sRNA; AGO-IP: AGO

immunoprecipitated; NA: Not Applicable. T: used for training; V: used for validation.

# Type Notes Species Tissue Reference T V

1 AGO1-IP - A. thaliana Inflorescence Mi, 2008 Y Y 2 AGO1-IP - A. thaliana Flower Zhu, 2011 N Y 3 AGO1-IP - A. thaliana Flower Wang, 2011 N Y 4 AGO1-IP - A. thaliana Leaf Wang, 2011 N Y

5 AGO1-IP - A. thaliana Root Wang, 2011 N Y 6 AGO1-IP - A. thaliana Seedling Wang, 2011 N Y 7 AGO1-IP - A. thaliana Inflorescence Qi, 2006 N Y 8 AGO1-IP Pseudomonas syringae infection set A. thaliana Leaf Zhang, 2011 N Y 9 AGO2-IP - A. thaliana Inflorescence Montgomery, 2008 Y Y 10 AGO2-IP - A. thaliana Inflorescence Mi, 2008 N Y 11 AGO2-IP Pseudomonas syringae infection set A. thaliana Leaf Zhang, 2011 N Y

12 AGO4-IP - A. thaliana Inflorescence Mi, 2008 Y Y 13 AGO4-IP - A. thaliana Inflorescence Havecker, 2010 N Y 14 AGO4-IP - A. thaliana Inflorescence Qi, 2006 N Y 15 AGO4-IP - A. thaliana Flower Wang, 2011 N Y 16 AGO4-IP - A. thaliana Leaf Wang, 2011 N Y 17 AGO4-IP - A. thaliana Root Wang, 2011 N Y 18 AGO4-IP - A. thaliana Seedling Wang, 2011 N Y 19 AGO5-IP - A. thaliana Inflorescence Mi, 2008 Y Y 20 AGO6-IP - A. thaliana Aerial Havecker, 2010 Y Y 21 AGO7-IP - A. thaliana Inflorescence Montgomery, 2008 Y Y 22 AGO9-IP - A. thaliana Aerial Havecker, 2010 Y Y 23 AGO10-IP A. thaliana Inflorescence Zhu, 2011 Y Y 24 AGO1-IP HA-AGO1-DAH(Col-0)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y

25 AGO1-IP HA-AGO1-DAH(Col-0)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 26 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 27 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV-AS9 A. thaliana Rosette Garcia-Ruiz, 2015 N Y 28 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y 29 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y 30 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 31 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y 32 AGO1-IP HA-AGO1-DAH(Col)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 33 AGO1-IP HA-AGO1-DAH(Col)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y 34 AGO10-IP HA-AGO10-DDH(Col)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 35 AGO10-IP HA-AGO10-DDH(Col)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 36 AGO10-IP HA-AGO10-DDH(Col)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y 37 AGO10-IP HA-AGO10-DDH(Col)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y 38 AGO1-IP HA-AGO1-DAH(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y 39 AGO1-IP HA-AGO1-DAH(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y 40 AGO10-IP HA-AGO10-DAH(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y 41 AGO10-IP HA-AGO10-DAH(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y 42 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 43 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y 44 miRNA All known miRNA from Arabidopsis A. thaliana NA Kozomara, 2014 N Y 45 miRNA All known miRNA in plants All plants NA Kozomara, 2014 N Y 46 ta-siRNA All known tasiRNA from Arabidopsis A. thaliana NA Zhang, 2014 N Y 47 ta-siRNA All known tasiRNA in plants All plants NA Zhang, 2014 N Y 48 TOT Total sRNA from wild type A. thaliana Inflorescence Slotkin, 2009 Y N

3

48

3.2.2 Learning procedure

A supervised machine learning approach rooted in the SVM algorithm was devised to learn

classifiers capable of determining AGO-sRNA affinity from the sRNA sequences alone

(Supplementary Material). Briefly, the complete inference system comprises three layers

(Figure 3.2A): layer 1 includes a binary SVM model that filters out sequences that do not

show strong evidence for binding to any of the known plant AGOs, and that therefore are

expected to be inactive; layer 2 is composed by an ensemble of binary one-vs-one classifiers,

each trained to explore the dissimilarities in AGO-bound sRNA sequences in a pairwise

fashion; finally, layer 3 consists of a voting system that assigns a single score to each AGO

using the decision values produced in the previous layer. Layers 2 and 3 are interconnected,

since the outputs of the classifiers from the 2nd layer serve as inputs to the 3rd layer, and the

3rd layer combines them to provide a more informative result. Layer 1 is independent of all

other classifiers and can therefore be decoupled if desired. All sRNA sequences used in

training, testing or validation were transformed into sets of features comprising:

i. Position specific base composition (PSBC)

One way to convert a string into a numerical representation is by using flags for the

presence of a given nucleotide in a determined sequence position. This is equivalent

to mapping each sequence position to a four-dimensional feature space that

represents each of the four possible bases in the DNA alphabet as follows:

𝐴 = 〈1,0,0,0〉, 𝐶 = 〈0,1,0,0〉, 𝐺 = 〈0,0,1,0〉, 𝑇 = 〈0,0,0,1〉.

By using such format, a sequence can be mapped to a feature space of

dimensionality 𝐹 = 𝐴. 𝐿, where 𝐴 is the number of possibilities in the alphabet of

nucleic acids and 𝐿 expresses the dependence on the sequence length. Since the

length of the sRNA sequences here analyzed is not constant but can vary in an

interval with an upper right limit here defined as being 27 bases, all sRNA must be

projected into a space of size 27x4=108. In the case of sequences with a length

shorter than 27, the extra positions in the feature vector can simply remain empty

for all nucleotides. Although this approach is a reasonable solution to cope with the

variation observed in length, it has the disadvantage of introducing noise in the

representation that increases as the 3’ end of the model sequence is approached.

49

A

B

Figure 3.2. Machine learning architectures employed: (A) architecture of the inference system; (B) cascade

SVM implemented. In the cascade scheme, data are split into subsets and SVs are determined for each one.

The results are combined two-by-two and used for training in the next cascade layer. This is done sequentially

till a single model is obtained in the last layer. The resulting SVs are then tested for global convergence by

feeding the result of the last layer into the first layer, together with the non-SVs. FV: feature vectors; DVi:

decision values from classifier i; TDi: Training data partition i; SVj: SVs produced by optimization j; CLk:

cascade layer k.

3

50

This happens because the right most nucleotides in the real sRNA sequences are

projected into more central positions of the model sequence, blurring 3’ side

positional patterns when looking across instances with variable size. To compensate

for that effect, the same kind of projection but starting at the 3’ position of the sRNA

instead of the 5’ was additionally considered in a feature set here mentioned as

PSBC2.

ii. k-mer composition

Approaches based on k-mers map the presence or absence of subwords with a given

length in the sRNA sequence into a feature space that represents all possible k-mers

of that length. Taking as example k-mers of size 2, there are 42=16 possibilities in the

4 letters universe of DNA. The 16 length 2-gram vector for the DNA “ACGT” alphabet

would then be:

〈𝐴𝐴, 𝐴𝐶, 𝐴𝐺, 𝐴𝑇, 𝐶𝐴, 𝐶𝐶, 𝐶𝐺, 𝐶𝑇, 𝐺𝐴, 𝐺𝐶, 𝐺𝐺, 𝐺𝑇, 𝑇𝐴, 𝑇𝐶, 𝑇𝐺, 𝑇𝑇〉.

As an example, mapping the sequence “ATGCATG” onto this vector space,

considering the presence or absence of each of the possible k-mers yields:

〈0,0,0,2,1,0,0,0,0,1,0,0,0,0,2,0〉.

It is important to note that this method focuses on the frequency of patterns rather

than their position in the sequence. Here k-mers of length 1 to 5 were explored.

iii. Shannon entropy scores

Entropy as a measure of information content gives an indication about the degree of

repetitiveness in a sequence. Among several flavors, Shannon entropy is one of the

most popular and consists in a score given by:

𝐻(𝑋) = −∑ 𝑝(𝑥𝑖) log2(𝑝(𝑥𝑖)𝑖 )

, with 𝑋 a sequence of length 𝑙 and 𝑝(𝑥𝑖) the frequency of the character at position 𝑖.

iv. Sequence length

This feature entails the number of nucleotides that compose each sRNA. Although

simple, the enrichment for certain sizes in specific pathways is a consistent

observation.

51

The learning methodology applied to layers 1 and 2 was similar. Prior to model training,

highly correlated features (|Pearson score|>0.75) were removed keeping a single

representative randomly selected from each correlated set. The remaining ones were

normalized in the range between 0 and 1, to avoid dominance effects and numerical

difficulties in the downstream calculations [212]. SVM learning with recursive feature

elimination (SVM-RFE) [213] was then applied to find models for layers 1 and 2 with a

reduced and more informative feature set. To circumvent computational problems that

typically arise when standard SVM algorithms are applied to large datasets, a linear kernel

was employed with a specialized linear solver [214]. A 5-fold cross-validation procedure was

implemented to modulate data variation in each feature selection round and the mean ROC-

AUC was calculated to assess the quality of the classifiers. Each round, 1/3 of the features

with the lowest contribution for the discriminative model were eliminated, until 10 features

were left. From here on, features were excluded one by one until no more features were

available. The optimal feature subset for each classifier from layers 1 and 2 was determined

with an elbow method applied to the curve formed by the mean cross-validation ROC-AUC

values recorded during the feature selection process. The best features were subsequently

used to train classifiers applying LIBSVM [215] with radial basis function (RBF) kernels to

explore non-linear relationships in the data. A cascade scheme was implemented in this task

to tackle computational problems that otherwise would not allow learning with such large

datasets (Figure 3.2B). To avoid biases in learning, training was performed with balanced

datasets by undersampling the largest class involved in each binary problem. Sequences with

the highest read abundance were prioritized, and the remaining spots were occupied by

instances randomly selected from the remaining pool(s).

In layer 3, a balanced dataset composed of sRNA from all 8 AGO groups available was

created. Decision values were computed for each of the sequences using the classifiers

obtained for layer 2, and served as input for a 5-fold cross-validation procedure used to train

and test the inference scheme applied to the 3rd layer. Three strategies were explored to

combine the outputs from layer 2: a voting system, where the winner is the AGO protein

with the largest number of decisions in its favor; a weighted rule learned with a linear SVM

algorithm; and a weighted rule learned with a non-linear SVM using a RBF kernel. All SVM

hyperparameters in layers 1, 2 and 3, were tuned by means of a grid search. A more detailed

description of the SVM approach is provided in Supplementary Material.

3

52

3.3 Results

3.3.1 Detection of high confidence functional sRNA

We explored a series of SVM classifiers to discriminate putatively functional from non-

functional sRNA based on various sRNA sequence features. As outlined above (see section

Material and Methods), SVMs were trained on sequenced A. thaliana Columbia (Col-0) AGO-

IP sRNA libraries in comparison with libraries of Col-0 total sRNA. Specific sRNA sequences

contained in the AGO-IP sRNA libraries were labeled “AGO”, whereas sequences contained

in the total library (but not in the AGO-sRNA libraries) were labeled “noAGO”. To minimize

sequencing noise, single copy sRNA were removed.

5’ and 3’ k-mers are important in distinguishing functional from non-functional sRNA

We first trained separated SVMs using either only the Position Specific Base Composition

(PSBC or PSBC2) of the sRNA sequences, sRNA entropy, sRNA length, or all possible k-mers of

Figure 3.3. Results for the classifiers trained to detect sRNA from the sets AGO vs noAGO: number of

features and performance. Pie charts represent the fraction of features in the complete set (the total number

is in the central black section of the pie) that was eliminated during the correlation analysis (red section in the

external ring) or kept (grey section in the external ring). The remaining features were subjected to selection

with SVM-RFE, of which a subset comprises the optimal features (light blue section). For each initial

conglomerate of features, accuracy was measured: using all features in a set (grey bars) and after feature

selection (light blue bars). The optimized features determined when all feature sets were analyzed together

were then subjected to non-linear learning using a cascade SVM scheme (dark blue bar). Each bar plot has on

top the standard deviation for the accuracy calculated from the 5-fold cross-validation procedure.

53

Figu

re 3

.4.

Feat

ure

s ra

nke

d b

y ab

solu

te v

alu

e o

f th

eir

we

igh

ts i

n t

he

op

tim

ize

d c

lass

ifie

r fr

om

lay

er

1 (

“AG

O”

vs “

no

AG

O”)

. P

osi

tio

n s

pec

ific

bas

e

com

po

siti

on

(P

SBC

) fe

atu

res

are

rep

rese

nte

d in

th

e fo

rmat

𝐿𝑃

, w

her

e 𝐿

∈ 𝐴

, 𝐶, 𝐺, 𝑈

and

𝑃 is

th

e p

osi

tio

n o

f th

e n

ucl

eoti

de

in t

he

seq

uen

ce s

tart

ing

the

cou

nt

at 5

’ or

3’ i

n t

he

case

of

PSB

C v

2, w

hic

h is

rep

rese

nte

d b

y 𝑖𝐿𝑃

; Sh

ann

on

: seq

uen

ce in

form

atio

n c

on

ten

t gi

ven

by

Shan

no

n e

ntr

op

y.

3

54

size 1 to 5 nucleotides (see section Learning procedure). The SVMs trained on the PSBC or

the k-mer features achieved considerable classification accuracy (~65%), while the SVMs

trained using sRNA length or entropy performed less well (accuracy: ~50-60%, Figure 3.3).

We then joined all the features together in a single classifier. Starting with 1582 features in

total, our selection procedure yielded a final classifier containing only 138 features (Figure

3.3), and achieved a classification accuracy of nearly 80%. Of the 138 features, 127 are k-

mers and 10 correspond to position specific nucleotides (Figure 3.4).

The k-mers of the final classifier were examined in more detail. Since the k-mers had no

positional requirements for inclusion in the model we asked whether the k-mers retained in

the final classifier showed any positional bias in the actual sRNA sequences. To do this, each

of these k-mers was mapped back to the sRNA sequences contained in the “AGO” and the

“noAGO” libraries. We found that the location of the k-mers was strongly biased toward the

5’ end of the sRNAs and to some extent also toward the 3’ end (Figure 3.5), with a visible

depletion toward the center of the sRNA sequences.

When looking to the 5’ end of sequences with meaningful length, we observed that 24 nt

sRNA from the “AGO” set is dominated by k-mers starting with an adenine (A), and 21 nt

sRNA have a mixture of adenine, uracil (U) and cytosine (C); plus in both cases there is

absence of 5’ guanine (G), see Figure 3.6. This finding is consistent with previous reports

showing that the base composition of the first nucleotide of a sRNA is important for AGO

affinity [194, 197, 207]. However, the vast majority (125 out of 138) of the features consisted

of k-mers larger than one nucleotide in length, indicating that more complex sequence

features guide AGO-sRNA associations.

Figure 3.5. Density distribution of k-mers retained in the final classifier discriminating between functional (AGO) versus non-functional (noAGO) sRNA (layer 1). The projection of k-mers in the AGO set shows a clear positional bias toward the 5’ and 3’ ends of 21 nt and 24 nt sRNA sequences.

55

Figure 3.6. Frequency of nucleotides in k-mers across the sRNA libraries. This was determined from the

distribution of k-mers retained in the classifiers according to their density peaks in 21 nt and 24 nt sRNA:

(A) layer 1: AGO vs noAGO; (B) layer 2: mean of the frequencies for classifiers involving a given AGO.

3

57

5’ and 3’ k-mers are enriched for transcription factor binding motifs

We further assessed whether the k-mers correspond to known sequence motifs. To that

end, we performed a motif analysis using the “AGO” and the “noAGO” sets with MEME

[216], a computational framework for de novo and known motif identification. Then, we

focused on k-mers with a size of 5 nt and mapped them to the motifs retrieved in the

previous step. We found that the majority of k-mers (38 of 46) corresponded to segments of

known or predicted motifs, and noted a significant enrichment for k-mer matches in “AGO”

when comparing with “noAGO” (100 and 32, respectively). Interestingly, the known motifs

mapping k-mers were consistently related to stress response and development, which are

processes known to be highly regulated by sRNA (Table S3.1).

The TF enrichment was not surprising for 21 nt sRNA which are known to act on genic

sequences that frequently contain TF binding motifs. However, similar TF patterns were also

found for 24 nt sRNA, suggesting a role in gene regulation beyond the well-documented

function of 24 nt sequences in heterochromatin silencing. We identified the 5-mer “AGAAG”

as the k-mer that showed stronger enrichment in 24 nt sequences compared with 21 nt

sRNA from the “AGO” set and noted that this subsequence was associated with 3 motifs:

At1g68670, a G2-like protein involved in phosphate homeostasis; SVP, a MADS protein that

acts as a floral repressor and functions within the thermosensory pathway; and MYB77,

which is expressed in response to potassium deprivation and auxin. All these motifs have

been shown to have higher DNA-binding capacity when the targets are unmethylated [217],

revealing a possible bridge between 24 nt heterochromatic sRNA, DNA methylation and TF

occupancy.

Figure 3.7. Results for the models trained to discriminate sRNA from two different AGO-IP sets (layer 2):

features selected, performance and top3 features. The top three rows of pie charts represent the features

kept after the correlation analysis (grey section), a fraction of which comprises the optimal sets obtained by

feature selection (light blue section). The three following rows of pie charts represent the composition of

the features kept after feature selection, with the number in the middle representing the total number of

features that survived SVM-RFE. For each binary model, ROC-AUC is shown for classifiers trained: using all

non-correlated features (grey bars), after feature selection (light blue bars) and using the optimized feature

set in a non-linear learning with the cascade SVM scheme (dark blue bars). Each bar plot has on top the

standard deviation for the performance calculated from the 5-fold cross-validation procedure. The top3

features found with the feature selection procedure (features kept in the last 3 SVM-RFE iterations) are

indicated in a 3rd

section. The bottom section contains an indication of the AGO-IP sets used in each of the

28 models, plus a colour scheme that signs inter- (white) and intra-clade models (red: clade 1; blue: clade 2;

green: clade 3).

3

58

3.3.2 Classification and validation of specific AGO-sRNA associations

In the previous section, our goal was to build a classifier that can distinguish functional from

non-functional sRNA. We achieved this by training on “AGO” versus “noAGO” sRNA. A

related problem is to find properties of sRNA that allow to infer their differential loading to

specific AGO proteins, as this will determine their particular mode of action. To achieve this,

we trained binary one-vs-one classifiers to find sequence features that discriminate between

the eight different Col-0 AGO-IP libraries (layer 2, Figure 3.2A). The ensemble of one-vs-one

classifiers was then subjected to a voting system that assigns a single score to each AGO

(layer 3, Figure 3.2A).

We used a 5-fold cross-validation scheme to determine the accuracy of the inference system

at three levels: 1. AGO: fraction of sequences for which the AGO-bound protein was

correctly predicted; 2. Clade: fraction of sequences for which the AGO prediction falls within

the correct clade; and 3. Function: fraction of sequences for which the functional group can

be correctly assigned based on the AGO prediction, translating predictions for AGO4/6/9

into a potential for involvement in TS, and assignments to other AGOs as suggestion of PTS

activity. In addition, we also performed a validation using 35 A. thaliana AGO-IP datasets

never seen by the classifiers during training, plus miRNA and ta-siRNA libraries from multiple

plants. The AGO-IP validation sets were collected from different tissues and experimental

conditions, which allowed us to evaluate the robustness of the classifiers to such variation.

Classification accuracy of sRNA at AGO, clade and function level

Results from our 5-fold cross-validation analysis showed that our classifiers achieved very

high accuracy (Figure 3.8), indicating that differences between sRNA bound to specific AGOs

can indeed be detected. Classification accuracy at the level of specific AGO proteins was

around 60% on average, ranging from as low as 40% (AGO7) to as high as 85% (AGO5). Since

AGO proteins within a clade are highly homologous and similar in function, it is likely that

sRNA-AGO binding is promiscuous in nature, which would render the search for

discriminative features more challenging. Indeed, classification accuracy at the level of the

clade and function were considerably higher than at the level of specific AGOs (clade

accuracy: 85%, function accuracy 93%; see Figure 3.8A). Moreover, the final intra-clade

classifiers were generally more complex than inter-clade classifiers, containing on average

59

122 features compared with 75 features, respectively (Figure 3.8C). Looking to the features

retained in the final classifiers, intra-clade classifiers contained proportionally more k-mers

of length larger than 1 nt compared with inter-clade classifiers (86% and 79% of the features

in intra-clade and inter-clade models, respectively). Hence, factors that govern AGO-specific

affinities within a clade appear to involve more complex sequence determinants. Similar to

the “AGO” versus “noAGO” analysis presented above (section Detection of high confidence

functional sRNA), we found that the k-mers in the final classifiers showed strong positional

bias toward the 5’ and 3’ ends of sRNA sequences (Figure 3.6). This finding indicates that the

same sRNA regions that differentiate functional from non-functional sequences also

contribute to the binding affinity of sRNA to specific AGO proteins.

To study the nature of the information contained in the k-mers selected by SVM-RFE in more

detail, we compared them with a motif analysis performed for each AGO library using the

MEME suite [216]. For a matter of simplicity, we focused on 5-mers (Supplementary

Material). We found that nearly 40% of the 5-mers were identified as motifs or derived

segments (Table 3.2), often related to known transcription factors with roles in development

and response to stress. For instance, in models involving AGO2 (an AGO linked to ta-siRNAs)

we identified 5-mers matching the motif ETT, an experimentally validated target of an

evolutionarily conserved ta-siRNA denominated tasiR-ARF [63]. Targeting of ETT by ta-siRNA

has been extensively studied and is known to have pleiotropic effects on Arabidopsis flower

development by interference with the auxin pathway [218]. These results thus support the

conclusion that SVM learning captured sequence information with relevant biological

meaning. A full overview of k-mers overlapping known or predicted motifs is provided in

Table 3.2 and in the Supplementary Material.

Validation analysis revealed that our classifiers are relatively sensitive, even across

biologically very heterogeneous datasets (Figure 3.9). Sensitivity was particularly high at the

level of clade and function (clade: 70% and function: 84.3%), and moderate at the level of

specific AGOs (AGO: 42%). Interestingly, when looking to the performance obtained for the

ta-siRNA and miRNA datasets, we observed high sensitivity values both for the set isolated

for Arabidopsis (~80% and ~90%, respectively) and also for the complete plant set (~95% and

~90%, respectively). In the set containing a mixture of ta-siRNA from diverse plants, the

sensitivity is considerably higher compared with the one composed only by sequences from

3

60

A B C D

Figure 3.8. Details of the AGO-sorting inference (layers 2+3). Mean accuracy across the 5-fold cross-

validation scheme for the voting system used in layer 3, measured by: (A) AGO, clade and function level; (B)

separated by individual AGO sets. The average number of features composing the intra- and inter-clade

classifiers from layer 2 (C), and the percentage of those features which are k-mers (D). Error bars: standard

deviation.

Table 3.2. Number of 5-mers in the final classifiers from layer 2 matching motifs found with MEME in each

AGO-IP library. The 5-mers were separated by the signal of their weight w in the final classifier.

# Set a Set b # features in

the final classifier

# 5-mers in the final classifier % 5-mers

mapping to a motif

Favouring Mapping to a motif

AGOa (w>0)

AGOb (w

61

Figu

re 3

.9.

Sen

siti

vity

fo

r th

e v

alid

atio

n s

ets

. B

ar p

lots

sh

ow

re

sult

s fo

r A

GO

(lig

ht

grey

), c

lad

e (l

igh

t b

lue)

an

d f

un

ctio

n (

dar

k b

lue)

in

fere

nce

. In

dic

atio

n

abo

ut

the

use

of

the

dat

a se

ts d

uri

ng

lear

nin

g, t

he

sam

ple

d t

issu

e an

d t

he

AG

O w

ith

wh

ich

th

ey w

ere

imm

un

op

reci

pit

ated

is

sho

wn

by

colo

red

sq

uar

es

un

der

th

e b

ars.

Aft

er t

he

se, t

he

dat

a se

ts r

efe

ren

ce n

um

be

rs a

re in

dic

ated

, fo

llow

ing

the

en

um

erat

ion

in T

able

3.1

.

3

62

Arabidopsis, which supports the idea that the inference system can identify PTS sRNA not

only from the species used in the learning phase but is also extensible to other plant species.

In conclusion, the inference system demonstrates robustness for tissue and treatment

variation and can recognize sRNA membership independently of the cellular origin showing

good generalization as desired for the discovery of new functional units.

3.4 Discussion

To our knowledge, this is the first time that classifiers were built to infer AGO-sRNA affinity

from the sRNA sequence alone. Using adequate solvers and learning architectures, SVMs

could be applied to large genomic datasets and discovered highly discriminative rules, both

to distinguish sequences that bind to AGOs from other sequencing products, as well to infer

AGO-sRNA kinship. In addition to the known 5’ nucleotide composition and the sequence

length, feature selection revealed the contribution of other sequence attributes, mostly

complex k-mers, in defining sRNA preference for certain AGO proteins. How robust specific

sRNA-AGO affinities are to changes in certain sequence properties is an interesting topic for

future research and can shed light on evolutionary constraints on sRNA-mediated

transcriptional and post-transcriptional silencing pathways. To answer such questions,

additional experiments are necessary to manipulate sRNA sequences in a precise and

targeted fashion, for example by use of the CRISPR-Cas9 system. Although our inference

method appears to be highly accurate in predicting the putative function of sRNA sequences,

it is important to keep in mind that the actual biological activity of a given sRNA is

dependent on other factors beyond the AGO loading step, such as the degree of

complementarity between a sRNA and the target sequence, as well as the presence of

specific chromatin states at the target locus.

AGO-sorting information has the potential to decrease the very high number of false

positives reported by currently available PTS target prediction tools, but the true impact

needs to be further evaluated. In any case, the computational framework here developed

displays a discriminative power that makes it suitable for early screens in genome-wide sRNA

libraries when looking for candidates with the highest chance to have certain functional

roles, and when sorting sRNA by TS and PTS classes is needed. The tool is ultimately a more

affordable alternative to expensive and laborious AGO-IP experiments, since it can get AGO-

63

sRNA profiles from a single genome-wide sequencing library. Another interesting application

of the framework is to explore if specific sRNA from exogenous sources, such as artificially

designed sequences or those derived from pathogens, correspond to functional plant sRNA

and which silencing pathways they are likely to target.

More sophisticated models can be developed using for example the expression patterns of

the AGO proteins at the moment that the sRNA sequencing experiments take place. This

extra information can eventually improve the capacity to correctly predict AGO affinity

under particular situations, but on the other hand demands additional and more complex

inputs, thus limiting the range of applications of the sRNA classifiers.

3

64

S3 Supplementary Material

S3.1 Support Vector Machines

The support vector machine (SVM) is a popular machine learning algorithm with strong

bonds to statistical learning theory. During the training phase, the SVM searches for the

decision hyperplane that maximizes the margin for the training set (�⃗�1, 𝑦1) ... (�⃗�𝑖, 𝑦𝑖)

composed of 𝑛 training instances �⃗�𝑖 with labels 𝑦𝑖ϵ {−1, 1} in 2-class problems. Maximum

margin classifiers are frequently the ones that guarantee the highest generalization capacity,

which explains the higher performance of models obtained via SVMs compared to other

machine learning algorithms [219].

To attain a classifier of the form �⃗⃗⃗� ∙ �⃗� + 𝑏 = 0, the learning process demands to minimize

the norm of the decision hyperplane normal vector ‖�⃗⃗⃗�‖ subject to the function 𝑦𝑖(�⃗⃗⃗� ∙ �⃗�𝑖 +

𝑏) ≥ 1, for 𝑖 = 1,… , 𝑛. A classification is then given by the signal of the output from the

classifier, this in the binary case.

To accommodate points that lay in the wrong side of the decision frontier, a soft-margin

classifier can be trained instead of a hard-margin version, considering a hinge loss function

of the type

𝜁𝑖 = 𝑚𝑎𝑥(0, 1 − 𝑦𝑖(�⃗⃗⃗� ∙ �⃗�𝑖 + 𝑏))

, and minimizing

1

𝑛∑𝜁𝑖

𝑛

𝑖=1

+ 𝜆‖�⃗⃗⃗�‖2

, subject to 𝑦𝑖(�⃗⃗⃗� ∙ �⃗�𝑖 + 𝑏) ≥ 1 − 𝜁𝑖 and 𝜁𝑖 ≥ 0, for all 𝑖. Variable 𝜆 defines the tradeoff

between having a larger margin and the number of training points lying in the wrong side of

the margin. For 𝜆 = 0, the soft margin formulation becomes the hard-margin version. Most

frequently, the trade-off is tuned not through 𝜆 but a directly related cost parameter 𝐶 =1

𝜆.

The above equation states the optimization problem in a form known as the primal.

Classically, it can be solved using quadratic programming, but there are other more

sophisticated approaches such as the sub-gradient descent or coordinate descent methods

65

that can reduce the training time several orders of magnitude when learning from very large

datasets.

Alternatively, the optimization problem can be stated in the dual form

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑓(𝛼1 …𝛼𝑛) = ∑𝛼𝑖 −1

2∑∑𝑦𝑖𝛼𝑖(𝑥𝑖 ∙ 𝑥𝑗)𝑦𝑗𝛼𝑗

𝑛

𝑖=1

𝑛

𝑖=1

𝑛

𝑖=1

, subject to ∑ 𝛼𝑖𝑦𝑖 = 0𝑛𝑖=1 and 0 ≤ 𝛼𝑖 ≤

1

2𝑛𝜆 , for all 𝑖; and can be solved through the

Lagrangian method. In this case the decision hyperplane is described in terms of the training

instances �⃗�𝑖 that lay closer to it and for which the Lagrangian multiplier 𝛼𝑖 ≠ 0, being

�⃗⃗⃗� = ∑ 𝛼𝑖𝑦𝑖�⃗�𝑖𝑛𝑖=1 . These instances placed in the margin are called support vectors (SV).

When more complex relationships between the classes are encoded in the data, it is

beneficial to use a search space with higher dimensionality where the discrimination is more

feasible. Data can be projected into such space using non-linear functions classically

employed in machine learning like polynomials, sigmoids or radial basis functions (RBFs)

[220, 221]. RBFs are a special and common choice, since other kernel functions like the linear

and the sigmoid can be designed as special cases of RBFs [220, 221].

S3.2 SVM-RFE

Recursive feature elimination (RFE) is a backward feature elimination method that can be

defined in three simple steps:

1. Train a classifier (optimization of weights 𝑤𝑖 according to an objective function 𝐽)

2. Determine the ranking criteria for all features (the cost 𝐷𝐽(𝑖) or 𝑤𝑖2)

3. Remove the feature(s) with the smallest ranking criterion

The whole process must be repeated until a stopping condition is met, such as achieving a

minimum number of features desired, an empty feature set or other.

In the case of SVM-RFE, the principles of RFE are combined with SVM model optimization in

a wrapper approach.

S3.3 SVM learning in very large datasets

SVM learning is frequently mentioned as one of the most accurate machine learning

methods, thanks to the maximum margin principle that it employs. However, searching for

3

66

the maximum margin hyperplane in a large dataset comes with a computational burden

from loading large matrices for which it is necessary to solve extensive optimization

problems. This makes traditional SVM formulations with a training complexity dependent on

the number 𝑛 of training samples in between 𝒪(𝑛2) and 𝒪(𝑛3) [222], of difficult use when

exploring large sets. The machine learning community has been struggling to overcome such

adversity, mostly by optimizing the mathematical background of SVM learning or by

presenting “out of the box” solutions based on the theory behind SVM learning and

experimental observations.

Empirical studies showed that linear SVM learning can be in some cases sufficient and allows

to obtain classifiers with accuracy values comparable with the ones achieved using more

complex kernel transformations. There are solvers specialized in linear SVM learning, such as

LibLinear [214] with a training time complexity of 𝒪(𝑛), for which the training time grows

linearly with the number of training instances and that therefore has a drastically reduced

learning time when comparing to other SVM formulations. Additionally, using such approach

does not demand optimizing any other hyperparameters beyond the cost, further decreasing

the training rounds necessary to tune the algorithm.

When non-linear SVM learning is necessary, the computational burden can be attenuated by

using special learning schemes, such as the cascade configuration [223], which take

advantage of theoretical properties of the SVM algorithm. This architecture breaks the

global optimization process into smaller subproblems that are solved separately by multiple

SVMs with subsets of data. The partial solutions composed by the support vectors can then

be joined and reprocessed in a cascade of SVMs that end in a single model guaranteed to

converge to the global optimum after multiple passes through the cascade. Empirical tests

showed that a single pass over the cascade can already produce decision rules with good

generalization. In addition, this design can be parallelized further dropping the training time.

SVM formulations for multiclass problems are available. However, since the complexity of

the problem increases as the number of classes are included in the same optimization

problem, multiclass SVM classification uses typically binary models. Two flavors can be

found: one-vs-one and one-vs-all [224]. In the first case, binary classifiers are trained for

each of the pairwise combinations among the K classes, forming a total of K(K−1)

2 classifiers

[219]. One-vs-all schemes use less classifiers since each of the classes is trained to separate

67

class 𝑖 from all remaining groups, so a total of 𝐾. A final prediction can be obtained by

combining the inference of each binary classifier through mechanisms as simple as an

unweighted voting system, where a final decision is made in favor of the group that gathers

the highest number of positive predictions from all the individual rules. Multiclass prediction

through binary schemes is proven to achieve accuracy levels comparable to that of straight

multiclass models [219], plus it brings extra advantages associated with the parallelization of

the training process and its ease of implementation.

S3.4 Cascade SVM

The cascade SVM [223] is a learning architecture in which training is performed sequentially

in subsets of data. Each partial SVM filters uninformative data by finding sub-solutions that

through the cascade are sequentially joined in a global solution. Former SVM sub-solutions

which are given by the respective SVs are fed as input to a later SVM that further simplifies

the result in case of data redundancy. This hierarchy ends-up in a single SVM that joins the

SVs from previously optimized problems in a final model. This divide-and-conquer strategy,

which explores principles of SVM learning in the dual form, allows reducing the number of

training points used to find the global solution by dealing with smaller and therefore

computationally more manageable subsets of data. The computational problems faced by

the SVM algorithm when dealing with large datasets are circumvented since the training

complexity is dependent on the number 𝑛 of training samples and estimated between 𝒪(𝑛2)

and 𝒪(𝑛3), with obvious gains for smaller datasets.

S3.5 Motif analysis

All sRNA sets used to develop the classifiers here presented were subject to a motif search

to clarify the eventual presence of conserved patterns in the sRNA sequences. This task used

the MEME-ChIP platform [225] and was directed towards the list of unique sRNAs in the

“AGO”, “noAGO” and AGO-IP sets. In the beginning, a de novo motif discovery search was

executed for long ungapped motifs using MEME (6 to 27 nt) and short ungapped motifs with

DREME (3 to 8 nt). The motifs found to be significant (e-value≤0.05) were then analyzed for

enrichment in terms of their relative position inside the sequence with CentriMo (e-

value≤0.05).

3

68

Next, all motifs found were aligned to two libraries of known motifs from Arabidopsis

thaliana, using Tomtom (e-value≤1). The first library contained a set of 113 motifs detected

by protein-binding microarrays (PBM; Franco-Zorrilla et al., 2014) and the second one is

composed by 872 motifs captured by DNA affinity purification sequencing (DAP-seq;

O’Malley et al., 2016).

Finally, the motifs from each set were compared pairwisely to identify library-specific motifs

and shared ones.

The k-mers selected by SVM-RFE were matched with perfect similarity (no mismatches, no

gaps) to subsequences retrieved with FIMO (p-value

69

Tab

le S

3.1

. M

oti

fs w

ith

5-m

er

mat

che

s fo

un

d in

th

e “

AG

O”

and

“n

oA

GO

” sR

NA

se

ts.

3

70

Ta

ble

S3

.1 (

con

tin

ue

d)

71

Table S3.2. PBM motifs found in the AGO-IP libraries.

Set

#Motifs

Long ungapped (MEME)

Short ungapped (DREME)

Centrally enriched

(CentriMo)

Known (Tomtom)

Remapped (FIMO)

AGO1 1 40 10 25 36

AGO2 0 28 39 28 38

AGO4 0 28 0 21 20

AGO5 0 74 5 57 72

AGO6 1 22 2 14 10

AGO7 3 84 0 53 80

AGO9 1 27 0 15 18

AGO10 1 28 32 22 37

3 3 3 3 3

3

72

Table S3.3. All PBM motifs found in the AGO-IP libraries.

# Name TAIR ID Short description

1 ARR14 AT2G01760 Cytokinin-activated signaling pathway, phosphorelay signal transduction system.

2 ATHB51 AT5G03790 Bract formation, floral meristem determinacy, leaf morphogenesis, positive regulation of transcription, DNA-templated, regulation of timing of transition from vegetative to reproductive phase.

3 bZIP60 AT1G42990 bZIP60 mRNA is upregulated by the addition of ER stress inducers. It is involved in endoplasmic reticulum unfolded protein response, immune system process, and response to chitin.

4 DAG2 AT2G46590 Expression is limited to vascular system of the mother plant. Recessive mutation is inherited as maternal-effect and expression is not detected in the embryo. Mutants are defective in seed germination and are more dependent on light and cold treatment and less sensitive to gibberellin during seed germination.

5 DEAR3 AT2G23340 Ethylene-activated signaling pathway.

6 DEAR4 AT4G36900 Ethylene-activated signaling pathway, plant defense, freezing stress, dehydration stress.

7 ETT AT2G33860 ettin (ett) mutations have pleiotropic effects on Arabidopsis flower development, causing increases in perianth organ number, decreases in stamen number and anther formation, and apical-basal patterning defects in the gynoecium. The ETTIN gene encodes a protein with homology to DNA binding proteins which bind to auxin response elements. ETT transcript is expressed throughout stage 1 floral meristems and subsequently resolves to a complex pattern within petal, stamen and carpel primordia. ETT probably functions to impart regional identity in floral meristems that affects perianth organ number spacing, stamen formation, and regional differentiation in stamens and the gynoecium. During stage 5, ETT expression appears in a ring at the top of the floral meristem before morphological appearance of the gynoecium, consistent with the proposal that ETT is involved in prepatterning apical and basal boundaries in the gynoecium primordium. It is a target of the ta-siRNA tasiR-ARF.

8 GLK1 AT2G20570 Chloroplast organization, negative regulation of flower development, negative regulation of leaf senescence, positive regulation of transcription, regulation of chlorophyll biosynthetic process, regulation of transcription and signal transduction.

9 HSFB2A AT5G62020 The heat sock transcription factor B2A is involved in response to chitin and gamethophyte development. The mRNA is cell-to-cell mobile.

10 HSFC1 AT3G24520 Heat shock transcription factor C1.

11 KAN1 AT5G16560 Involved in abaxial cell fate specification, carpel development, organ morphogenesis, plant ovule development, polarity specification of adaxial/abaxial axis, radial pattern formation, xylem and phloem pattern formation.

12 KAN4 AT5G42630 Encodes a member of the KANADI family of putative transcription factors. Involved in integument formation during ovule development and expressed at the boundary between the inner and outer integuments. It is essential for directing laminar growth of the inner integument. Along with KAN1 and KAN2, KAN4 is involved in proper localization of PIN1 (necessary for proper flowering) in early embryogenesis.

13 MYB52 AT1G17950 Involved in the regulation of secondary cell wall biogenesis and response to abscisic acid.

73

Table S3.3. (continued)

# Name TAIR ID Description

14 RAP2.6 AT1G43160 Involved in chloroplast organization, ethylene-activated signaling pathway, and positive regulation of transcription. Mediates cellular response to: heat, abscisic acid, cold, jasmonic acid, osmotic stress, salicylic acid, salt stress, water deprivation and wounding.

15 REM1 AT4G31610 Expressed specifically in reproductive meristems. It is a member of a moderately sized gene family distantly related to known plant DNA binding proteins. Necessary for proper flower development.

16 RVE1 AT5G17300 Myb-like transcription factor that regulates hypocotyl growth by regulating free auxin levels in a time-of-day specific manner.

17 SPL7 AT5G18830 Encodes a member of the Squamosa Binding Protein family of transcriptional regulators. SPL7 is expressed highly in roots and appears to play a role in copper homeostasis. Mutants are hypersensitive to copper deficient conditions and display a retarded growth phenotype. SPL7 binds to the promoter of the copper responsive miRNAs miR398b and miR389c.

18 STY1 A member of SHI gene family. Arabidopsis thaliana has ten members that encode proteins with a RING finger-like zinc finger motif. Despite being highly divergent in sequence, many of the SHI-related genes are partially redundant in function and synergistically promote gynoecium, stamen and leaf development in Arabidopsis. STY1/STY2 double mutants showed defective style, stigma as well as serrated leaves. Binds to the promoter of YUC4 and YUC8 (binding site ACTCTAC). It is involved in auxin homeostasis, negative regulation of gibberellic acid mediated signaling pathway, positive regulation of transcription, stigma development, style development, xylem and phloem pattern formation.

19 TOE1 AT2G28550 It is related to AP2.7 (RAP2.7). It is involved in ethylene-activated signaling pathway, multicellular organismal development, organ morphogenesis and vegetative to reproductive phase transition of meristem.

20 TOE2 AT5G60120 Involved in ethylene-activated signaling pathway and multicellular organismal development.

21 WOX13 AT4G35550 Involved in plant development.

22 WRKY12 AT2G44745 Pith secondary wall formation and a promoter of flowering.

23 WRKY38 AT5G22570 Involved in defense response to bacterium and salicylic acid mediated signaling pathway. Arabidopsis FLOWERING LOCUS D influences systemic-acquired resistance-induced expression and histone modifications of WRKY genes. Dehydration stress. Basal Defense in Arabidopsis: WRKYs Interact with Histone Deacetylase HDA19.

24 YAB1 AT2G45190 Abaxial cell fate specification, cell fate commitment, fruit development, inflorescence meristem growth, meristem structural organization, polarity specification of adaxial/abaxial axis, regulation of flower development, regulation of leaf development, regulation of shoot apical meristem development, specification of floral organ identity, specification of organ position.

25 YAB5 AT2G26580 Polarity specification of adaxial/abaxial axis, regulation of leaf development, regulation of shoot apical meristem development.

26 ZAT14 AT5G03510 Zinc finger.

3

74

Table S3.4. PBM motifs found in the AGO-IP libraries: motifs only in AGOa.

# AGOa AGOb Motifs only in AGOa

1 AGO1 AGO2 KAN1, WRKY38, ZAT14

2 AGO1 AGO4 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14

3 AGO1 AGO5 DEAR3, MYB55_2, WRKY38, ZAT14




7 AGO1 AGO10 DEAR3, WRKY38, ZAT14

8 AGO2 AGO4 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1


10 AGO2 AGO6 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, YAB1



13 AGO2 AGO10 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, YAB1

14 AGO4 AGO5 -

15 AGO4 AGO6 -

16 AGO4 AGO7 -

17 AGO4 AGO9 -

18 AGO4 AGO10 -

19 AGO5 AGO6 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5



22 AGO5 AGO10 HSFB2A_2, HSFC1_2, RAP2.6_2

23 AGO6 AGO7 -

24 AGO6 AGO9 -

25 AGO6 AGO10 -

26 AGO7 AGO9 -

27 AGO7 AGO10 -

28 AGO9 AGO10 -

75

Table S3.5. PBM motifs found in the AGO-IP libraries: motifs only in AGOb.

# AGOa AGOb Motifs only in AGOb

1 AGO1 AGO2 bZIP60_2, DEAR4, ETT_2, KAN4, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13, WRKY12, YAB1

2 AGO1 AGO4 -

3 AGO1 AGO5 HSFB2A_2, HSFC1_2, RAP2.6_2, YAB5

4 AGO1 AGO6 -

5 AGO1 AGO7 -

6 AGO1 AGO9 -

7 AGO1 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, MYB52, TOE1_2, TOE2, WRKY12, YAB5, ZAT18

8 AGO2 AGO4 -


10 AGO2 AGO6 -

11 AGO2 AGO7 -

12 AGO2 AGO9 -

13 AGO2 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, TOE1_2, TOE2, YAB5, ZAT18


15 AGO4 AGO6 -

16 AGO4 AGO7 -

17 AGO4 AGO9 -

18 AGO4 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, YAB5, ZAT18

19 AGO5 AGO6 -

20 AGO5 AGO7 -

21 AGO5 AGO9 -

22 AGO5 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, MYB52, MYB55_2, TOE1_2, TOE2, WRKY12, ZAT18

123 AGO6 AGO7 -

24 AGO6 AGO9 -


26 AGO7 AGO9 -



3

Chapter 3

Documents

University of Groningen Computational Methods for High … · 2018. 3. 14. · 12 AGO4-IP - A. thaliana Inflorescence Mi, 2008 Y Y 13 AGO4-IP - A. thaliana Inflorescence Havecker,