Shortran v.0.43 Manual

7/30/2019 Shortran v.0.43 Manual

1/27

A users manual for

shortran

Released: 06-June-2012

Vikas Gupta and Stig U. Andersen

Bioinformatics Research Center and

Department of Molecular Biology and Genetics

Aarhus University, Denmark

Contact: [email protected], [email protected]


2/27

2

1 OVERVIEW.....................................................................................................................4

1.1Background...........................................................................................................................4

1.2Summaryofshortranmodules.......................................................................................4

1.3Implementation...................................................................................................................4

1.4Availability............................................................................................................................4

2 Installation....................................................................................................................5

2.1PythonandPerl...................................................................................................................5

2.2Bowtie.....................................................................................................................................5

2.3MySQLandsettingupdatabase......................................................................................5

2.4ECHO........................................................................................................................................6

2.5mirDeep..................................................................................................................................6

2.6Blast.........................................................................................................................................6

2.7NiBLS.......................................................................................................................................6

3 QUICKSTARTGUIDE..................................................................................................7

3.1Configuration........................................................................................................................7

3.2Explorethedatabase................. .................. .................. ................. .................. .................. 7

4 CONFIGURATION.........................................................................................................8

4.1 Configurationusingtheconfiguration.txtfile........................................................8

4.2 Interactiveconfiguration..............................................................................................9

5 Modules..........................................................................................................................95.1 Module-1:Adaptertrimming,errorcorrection,filteringandprofiling.......9

5.2 Module-2:GenomicMapping....................................................................................11

5.3 Module-3:Clusteringanalysis..................................................................................12

5.4 Module-4:miRNApredictions..................................................................................13

5.5 Module-5:ta-siRNApredictions..............................................................................14

5.6 Module-6:Annotation.................................................................................................14

5.7 Module-7:ExpressionPlots......................................................................................14

5.8 Module-8:PatternSearch..........................................................................................15

5.9 Module-9:Queries........................................................................................................16

6 Application.................................................................................................................17

6.1 Module-1..........................................................................................................................17

6.2 Module-2..........................................................................................................................21

6.3 Module-3..........................................................................................................................21

6.4 Module-4..........................................................................................................................22

6.5 Module-5..........................................................................................................................22

6.6 Module-6..........................................................................................................................22


3/27

3

6.7 Module-7..........................................................................................................................22

6.8 Module-8..........................................................................................................................25

6.9 Module-9..........................................................................................................................26

6.10 Computationsummary............................................................................................27


4/27

4

1 OVERVIEW

1.1Background

WefounditcomplicatedtocombinesmallRNA(sRNA)profilingandannotationinformationusingexistinganalysistools.Herewedescribetheinstallationand

use of shortran, which generates and combines expression profiling andannotationdatafromsRNAsequencingrawdata.Theoutputisprovidedasan

easily accessible MySQL database format and provides graphical output tofacilitatefastdataanalysis.

1.2Summaryofshortranmodules

shortran isa combinationof variousmodules,allowingauser toanalyzedata

using either an interactive command line interface or a configuration file. Adetailed descriptionof the use of eachmodule is provided in the application

section of this manual. shortran is capable of performing filtering, read errorcorrection, expression profiling,mapping,clustering analysis,miRNA/ta-siRNA

prediction,patternsearch,targetpredictionandannotation.

1.3Implementation

PythonistheprimarylanguageofthepipelinebutPerlhasbeenalsoused.The

package contains multiple Python language scripts. shortran.py is the main,whichinvokestheotherpipelinemodules. shortranisdevelopedusingpython(v2.7.2) andperl (v5.12.3), it has been tested on bothMacOS X (10.6.8) andLinuxUbuntu(10.04.3).

1.4Availability

Theshortranpackageincludinganexampledatasetisavailableatwww.carb.dk/resources.asp.


5/27

5

2 InstallationTheprimarymodulesofthe shortranpackagerequirenoinstallationandcanbeinvokeddirectlyusingpythonafterthedependencieshavebeeninstalled.Torun

shortran,changedirectory(cd)tothefoldercontainingthe shortran.pyfileand

runthescript:> cd[path of shortran_folder]

>python shortran.py

The > indicates the commandprompt and should not be typed. Brackets []

indicatestextthatneedstobeadjusted.Bracketsshouldnotbetyped.Commentsareindicatedby#.Toobtainthepathofafileorfolderforintroducingitintoa

script, the icon of the respective item can be dragged into a termial windowwherethepathwillthenbedisplayed.

DetailedinstructionsforinstallationandsetupofcoredependenciesonMacOS

and Linux platforms are included. Please refer to install_osx.README andinstall_linux.README.

2.1PythonandPerl

Most operating systems have built-in Python and Perl and no installation isrequired. If that is not the case, Python can be downloaded from

http://python.org/getit/, and Perl from http://www.perl.org/get.html. Inaddition, the Python libraries numpy (http://numpy.scipy.org/), scipy(http://www.scipy.org/) and matplotlib (http://matplotlib.sourceforge.net/)

needtobeinstalled.

2.2Bowtie

Bowtieisrequiredbyanumberofmodulesandshouldbeinstalledandaddedto

the system path (so that typing bowtie in the terminalactivates the programregardlessoftheworkingdirectory).IfinstallingonMacOS,Bowtieisinstalled

andaddedtothepathbytheinstallationscript.Itcanbefoundathttp://bowtie-bio.sourceforge.net/index.shtml.

2.3MySQLandsettingupdatabase

InstallationofMySQLisoptionalbutitisrequiredforuseoftheshortrandatabasefunctionalities.Adetaileddescriptionondownloadingandsettingup

MySQLcanbefoundathttp://www.mysql.com/.

MySQLsetupforshortranishandledautomaticallybytheMacOSsetupscriptandcanbesetuponLinuxsystemsusingtheenclosedMySQLsetupscript:

> sudo mysql < "mysql_setup.txt"

You can also set up MySQL manually by creating a database and givingpermissionstoyouruser(s):


6/27

6

> sudo mysql

mysql> CREATE DATABASE profiles;

mysql> CREATE USER user@localhost IDENTIFIED BY ;

mysql> GRANT ALL PRIVILEGES ON profiles.* TO

user@localhost;ThisbasicMySQLsetupcorrespondstothefollowingsettingsintheconfiguration.txtfile:

server=localhost

user=user

password=

database=profiles

2.4ECHO

ECHOisusedonlyifk-mercorrectionofsequencingreadsisrequired.Itcanbe

downloadedfromhttp://uc-echo.sourceforge.net/.

2.5mirDeep

AllthescriptsrequiredformirDeeparesuppliedwiththeshortranpackagebutits dependencies must be globally installed, see http://www.mdc-

berlin.de/en/research/research_teams/systems_biology_of_gene_regulatory_ele

ments/projects/miRDeep/.Itisonlyrequiredformodule-4.

2.6Blast

formatdbandfastacmdfromtheBlastpackageshouldbeinstalled.Adescriptioncanbefoundhere:http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html.

2.7NiBLS

NiBLS(MacLeanetal.BMCBioinformatics.201018;11:93)canoptionallybeusedtodetectsmallRNAlociinsteadofthesimpleclusterdetectionalgorithm

includedwithshortran.Seehttps://github.com/danmaclean/NiBLS.


7/27

7

3 QUICK START GUIDE3.1Configuration

Forthefirstrun,werecommendthatyouopenthe"configuration.txt"fileina

texteditorandreplace"/Users/test/Desktop/shortran_0.43/demo"withthepathtotheshortrandemofolderonyoursystem.

Opentheterminalwindowandchangetotheshortrandirectoryandruntheprogram:

> cd[path_to_shortran_folder]

>python shortran.py

Andhit"n"torunshortranaccordingtotheconfigurationfile.

Ifalldependenciesareinstalledcorrectly,youroutputfoldershouldlooklikethe

"example_output"folderavailablefordownload,whentherunisfinished.

3.2Explorethedatabase

YoucanexploretheMySQLdatabasedirectlyusingtheterminalwindow.For

example,type:

>mysql -u user

mysql > show databases;

mysql > use profiles;

mysql > show tables;

mysql > describe `demo_2012-05-09_profile_19_24_sorted.cut-1_score_1`;

mysql > SELECT * FROM `demo_2012-05-09_profile_19_24_sorted.cut-1_score_1` LIMIT 10;

Whenloggedintothedatabase,youcanexperimentwiththequeries.Forinstanceretrievethe10mosthighlyabundantsRNAsequenceswithexon

matchesandsortthembyabundance:

mysql > SELECT `#Sequence`,`sum_norm`,`genomic_region` FROM`demo_2012-05-09_profile_19_24_sorted.cut-1_score_1` WHERE

`genomic_region` LIKE "%exonic%" ORDER BY `sum_norm` DESCLIMIT 10;

Forfurtherqueryexamples,pleaseseesection6-9(Module9)


8/27

8

4 CONFIGURATIONTherearetwowaysofconfiguringtheanalysis.Eitherthefile configuration.txtcanbemodifieddirectly,ortheinteractiveconfigurationinterfacecanbeused.

4.1 Configurationusingtheconfiguration.txtfileParametersandvariablescanbedefinedduringconfiguration.Herewedescribethe common parameters for the pipeline. Module-specific parameters will be

explainedinsection4ofthemanual.Westartbydefininganame(i.e.date)forthefolderas:

date = '[a string]' # i.e. 2012-02-21

The range of small RNA lengths to be analyzed is defined by entering the

minimumandmaximumlengths:min_seq_size= '[an integer]' # i.e. 19

max_seq_size= '[an integer]' # i.e. 24

Rarereadscanbefilteredbyusingaminimumthreshold(cutoff),whichremovesall sequences which do not occur a minimum of cutoff times when summedacrossalllibraries.Filteringusingavariationscore( score_regulation_cutoff)canbeusedtofilterforsequencesthatshowinter-libraryvariationinabundance:

cutoff = '[an integer]'

score_regulation_cutoff= '[an integer]' #stdev/sqrt(average)

Library headers,whichare used in the plotsand in theMySQLtables, can bedefined:

genotype = '[strings separated by commas]' #i.e. 'wt','mut'

TheMySQLservershouldbeconfiguredusingthefollowingoptions:

server = '[string]' # i.e. localhost

user = '[string]' # i.e. user

password = '[string]' # i.e. ''

database = '[string]' # i.e. profiles

Bowtie mapping parameters such as maxmismatches in seed, max sum of

mismatchquality scores across alignmentand seed length,can bemodifiedusingthefollwoingparameters:

mismatches = '[an integer]'

max_maq_err = '[an integer]'


9/27

9

seedlen = '[an integer]'

Each library should have its own folder containing sequencing data in fastqformat.Thepathforthedirectorycontainingallfoldersandfoldernameshould

beconfigured(fordetailsseetheApplicationsection).fq_file_dir = '[path to directory with all data folders]'

lib = '[all the data folder names separated by commas]'

The variablefiltering_file in the configuration file can refer to any fasta file,against which the reads should be filtered. There is an option

use_only_mapped_reads,whichifTrueallowsdiscardallreadsnotmappingtothefiltering_fileandifFalsediscardsonlythemappedreads.filtering_filecanbeleftempty,ifnofilteringisrequired.

filtering_file = '[file_path]'

use_only_mapped_reads = '[True/False]'

Thereferencegenomefile,annotationfeaturefile(describingforexamplegene

models),andconservedmiRNAfilesaredefinedusingthefollowingoptions:

genome_file = '[file_path]'

gtf_file = '[file_path]'

miRNA_database = '[file_path]'

Theparametersigv_infileandcluster_infileareoptionalandcanbeusedtoinputmapping or clustering results from other sources into the pipeline. For fileformats, please see the example data. If left empty, the files generated from

earliermodulesareusedbydefault.

4.2 InteractiveconfigurationThe shortran pipelinealsooffersan interactive configurationmode,where the

user ispromptedforall optionsand parameters.The optionsare the sameasdescribed in section 3.1, but an additional help function is provided in the

interactiveversion.

5 Modules5.1 Module-1:Adaptertrimming,errorcorrection,filteringand

profiling

Thismodule offers pre-processing of the raw data obtained from sequencing(fastq format). Sequencing errors can be corrected byk-mercorrection using


10/27

10

ECHO.Thisstep iscomputationally expensive and isnot criticalforuse ofthe

othermodules.Itmay,however,improveanalysisaccuracy.

Raw_data_filtering = True

ErrorCorrection = True

Path_2_ECHO = '[full path to echo]'

Ifrawfastqfilesareused,adaptersshouldbeclippedoffandthefilesshouldbe

splitbysize:

adapter_filtering = True

keep_with_adapter_only = True

# If True, discards reads if no adapter was detected

adapter='[3 adapter sequence' # i.e. 'AGCCTTCTA'

# 3 adapter sequence. Multiple sequences can be entered as a comma-

separated list in the same order as the sequencing libraries in thelib variable

trim=0 # number of bases to trim from 5 end

Readscanbefilteredbasedonhomologytoauser-suppliedmultifastafile.For

instance, the user could remove readsmatching ribosomal RNA sequences or

choosetofocusonlyonreadsmappingtothereferencegenome.

homology_filtering='[full path and filename]' #multifasta file

use_only_mapped_reads=False #removes reads matching the fasta file

Followingerrorcorrectionandfiltering,profiletableswithrawandnormalized

readcountscanbegenerated.

Make_no_cut_profile_files=True

Oncetheprofiletablesareavailable,additionalfiltersforabundanceandinter-

libraryvariationscore(stdev/sqrt(average))canbeapplied.

cutoff='a digit' # the minimum read count sum across all libraries

score_regulation_cutoff=[a digit] # the minimum variation score

ThefinalstepinModule-1addsthefilteredprofiletabletoaMySQLdatabase:

Add_output_to_mysql_database=True

SummaryModule-1:

Input:

RawFastqfilesOutput:


11/27

11

AdapterfilteredFfastqfiles-$output/module-1/adapter_filtered_data/ Fastqfilessplitbysize-$output/module-1/size_fractionated_data/ HomologyfilteredFastqfiles-$output/module-1/filter/ Profiletables-$output/module-1/profiles/

o Normalizedcounttableswithabundancecutoffo Normalized count tables with abundance cutoff and variationscorecut-off

o Similarfilesforallthesizefractions Fileswithrawabundancetotalcounts

o Redundantcounts-$output/module-1/libraries_abundanceo Non-redundantcounts$output/module-1/libraries_abundance_unique

Fileforselectingcontrolforlibraries-libraries_reference Fileforselectingpairstoanalyzedforcorrelation-compare_libraries Graphicaloutput

o Plotsforrawabundancesacrosslibraries

o Sequencesizevariationacrossthelibrarieso Sequencesizefractionvariationwithinthelibrary

MySQLtable

5.2 Module-2:GenomicMappingReads are mapped to a given reference genome and alignment .igv files are

generated. These files contain information about mapping positions and

expression values and can be displayed using the IGV browser

(http://www.broadinstitute.org/igv/).genome_file = '[full path and filename' #reference genome file

Genome_mapping = True

The variable infile allows using an alternative profile table. If left empty, theprofiletablegeneratedbyModule-1isusedbydefault.

infile='[full path and filename]' #use alternative profile table

SummaryModule-2:

Input:

Filewithallthereadso Bydefaulttakenfrommodule-1outputo Optional: The user can submit any file where the first columncontains sequences for genome mapping using the infileparameter.

Output:

Mappedfileso

Samformato IGVformat


12/27

12

Fastafileo Containsallthereads

5.3 Module-3:ClusteringanalysisIn module-3, a clustering algorithm is invoked where mapped reads are

separated intodifferent clustersbased on thegenomicpositions.Withdefaultsettings,aclusterisagenomicregion,whichcontainsatleastfiveuniquereads

containingnogaplonger than50nucleotidesbetweenconsecutive reads.Alsotheabundancesofthesereadsshouldsumtoatotalofatleast75.Parameters

can be changed to adjust the stringency. An external IGV file to be used for

clusteringcanbespecifiedusingtheigv_infileparameter.Ifleftblank,theIGVfileproducedbyModule-2isusedbydefault.

Cluster_prediction=True

divide_by_size=False #do not treat size fractions separately incluster analysis

abundance_cutoff = 75 #minimum sum of read abundances per cluster

Add_cluster_to_mySQL = True

igv_infile=''

If preferred, the NiBLs algorithm (MacLean et al. BMC Bioinformatics. 201018;11:93) can be used to detect clusters instead of the simple approach

describedabove:

Cluster_prediction=TrueNiBLS=True

Finally,clusterscanbeannotatedaccordingtotheiroverlapwiththegenomicfeaturesspecifiedusingthegtf_fileparameter.

Cluster_genomic_region_prediction=True

SummaryModule-3:

Input: Mappedfiletogenome

o BydefaultmappedfilefromModule-2o Optional:IGVformatmappedfileprovidedexternally(igv_infile)

Output Clusterfile

o BEDformatfileforclusterconsistingallsizeclasseso Sizefractionatedclusterso Clusterssummaryo Clustersinfastaformatfile

Genomicannotationofclusterso Figurewithgenomicdistributionofclusters


13/27

13

o Countsinthefilecluster_genomic_region.txto File with all the sequences and their corresponding genomicannotation(fileextension.region)

MySQLtable

5.4 Module-4:miRNApredictionsModules 4 and 5 are designed for categorizing small RNAs. Module-4 uses

miRDeepormiRDeep-P formiRNAprediction.ThemiRDeep-Pmodule isonlyapplicableforsmallRNAdatasetsobtainedfromplants.Foranimaldata,please

use miRDeep instead. Predicted miRNAs are mapped against the a fasta file

containingconservedmiRNAsequences(miRNA_database)tochecktheaccuracyof the prediction. A summary onmapped conserved miRNA/miRNA*/miRNA-

familiesisgenerated.

miRNA_database = '[file name]'

#fasta file with conserved miRNA sequences

MiRNA_prediction=True

mi_sequence_file = '[file name]' #optional external miRDeep formatinput file

gff_file = '[file name]' #annotation file only for miRNApredictions

Thevariablemi_sequence_fileisthefilepathtotheproperlyformatted(miRDeepstandard input) fasta file. If left empty, a miRDeep format input fasta file is

generatedfromtheModule-1outputprofiletablebydefault.miRDeepcanuseanannotation file to filter tRNA and rRNA sequences from its prediction results

(gff_file).

SummaryModule-4:

Input

GenomeFastafile(genome_file) RNAclassificationannotationfile(gfffile) SequencefastafileinMiRDeepformat(mi_sequence_file)

o BydefaultitistakenfromModule-1 ConservedmiRNA(miRNA_database)

o Bydefaultitistakenfromdemo(miRbasev.17)Output

PotentialmiRNAcandidates-miRNA_filter_P_prediction_miRNA.fasta miRNAprecursorsmiRNA_precursors miRNAstructures-miRNA_structures ConservedmiRNAs,familyandhomologcounts-

miRNA_filter_P_prediction_miRNA.miRNA_mapped.miRNAfamily.txt


14/27

14

5.5 Module-5:ta-siRNApredictionsTrans-actingsmallinterferingRNAsisanothercategoryofsmallRNAsgenerated

byplants.Herewepredictta-siRNAusingamodifiedversionofChensalgorithm

(Ho-MingChenetal.,2007).Afastafilecontainingta-siRNAclustersisproduced

andsupplementedwithanotherfiledescribingtheprobabilitythatansRNAisassociatedwithatas-locus.

TasiRNA_prediction=True

SummaryModule-5:

Input

Genomefastafile(genome_file) Clusterfile(fromModule-3)

Output

Predicted ta-siRNA sequences with p-valve


15/27

15

libraries.Theselibrariescanbedirectlycomparedagainstoneanotherorthey

canbecomparedafterbeingnormalizedtoagivenreferencelibrary.Modifythefiles libraries_referenceand compare_libraries in theModule-1 folder to set upthelibrarycomparisons.

Pattern_plot=True

pattern_plot_file='' #external file with profile table, default isprofile table from Module-1

SummaryModule-7:

Input

Profiletableo Bydefaultfrommodule1(pattern_plot_file)

Output

Graphicaloutputo Normalizedexpressionvaluesploto Logofnormalizedexpressionvaluesploto Relativefractionplotwhenprovidedcontrollibraries

Fileswithcorrelationcoefficientvalues-*.spearman_values_*

5.8 Module-8:PatternSearchThismodule calculates the Markov probability for themost 5 two and three

nucleotides.Anoutput filewithprobabilities isgenerated toallow theuser toidentifysequencecompositionbiases.Givenapatterninthesub-module,allthe

sequences with the pattern can be retrieved and further analysis can be

performedonthisset.

Find_pattern=True

infile_for_pattern='' #profile table from Module-1 or other file withsequences in first column

SummaryModule-8:

Input

Filecontainingsequenceso Bydefaultprofiletablefrommodule-1(infile_for_pattern)

Output Markovprobabilities

o Forfirstthreenucleotides-markov_prob.txt Filecontainingsequenceswiththetargetedpattern-*.fasta.* Mappingcountsforsequenceswithpattern-pattern_read_stats.txt


16/27

16

5.9 Module-9:QueriesOncetheMySQLtableiscreated,thedatacanbeaccessedusingSQLqueries.

MySQL_query=True

query_infile='query_infile' #Text file containing MySQL queries.

Should be located in the shortran main directory.

SummaryModule-9:

Forexamplequeries,pleaseseetheapplicationsection,Module-9.

Input

MySQLformatquery(query_infile)

Output MySQLqueryresult-mysql_query_output


17/27

17

6 Application

Here we explain the application of shortran for the analysis of the datasetsuppliedwiththepackage.

Asampledatasetisavailablewiththe shortranpackage.ItconsistsofasubsetoffourlibrariesfromLotusjaponicussRNA-seqdata.Outputsfromallthemodulesareincludedinthepackage,andwerecommendthatusersgothroughthedemodata examplesbeforeproceedingwiththeir owndata.Herewebrieflyexplain

theusageandoutputofeachmodule.All shortransettingsusedintheanalysisare presented in the configuration.txt file, which can be used to analyze thesampledataafterchangingdirectoryandMySQLserversettingstoreflectyour

localenvironment.

6.1 Module-1Raw data for processing are fastq files. In the demo, these fastq files areseparated in fourdifferentfoldersnamedGHD-5_sample,GHD-6_sample,GHD-

9_sample and GHD-10_sample. These four folders refer to librariesmock_wild_type, wild_type, mock_NFR1 and NFR1 respectively. Each contains

readsobtainedfromIlluminaRNA-seqreads,whicharemappedtochromosome

2(35-40MB)onthegenome.

Wepre-processedthereadsinmodule1.1andanerrorcorrectionalgorithmwas

invoked. There were 627,147 reads in the raw dataset (Figure 1) and whenparsingsequencesthroughECHOonthedatasetapproximately3.6%readswere

correctedforeachlibrary.

The two files lib_abundance and lib_abundance_unique contain abundances foreachsizeandlibraryfornon-redundantandredundantreadsrespectively,data

isplotted(Figures2and3).Figures2aand2bareexamplefiguresshowingread

lengthdistribution(in%)amongdifferentlibrariesforsizeclasses21ntand24nt.Plotsforallsizeareavailableintheplotsfolder.Figure3aand3bareplotsfor

size distribution within a library for mock_NFR1 and NFR1 libraries,respectively.

Homology filteringwas performed toremove all the reads coming from lotus

repeat data, leaving 605,329 reads. Both filtered reads and reads left after

filteringareprovidedasfastqfilesthatcanbeusedinseparateanalyses.Inmodule1.2,all the readsareprocessed tocalculateabundancesof all smallRNAuniquereadswithineachlibrary,andnormalizationisperformedtotake

intoaccountdifferencesinlibrarysizes.Therewere42,448uniquereadsinthe

dataset and 25,329 were left after applying an abundance cut-off of 2. An

additionalvariationscore-cutoff(stdev/average>1)wasappliedtoidentify

12,814differentiallyregulatedsequences(uniquereads).Theprofiletablesare

storedintheModule-1outputfolderaswellasintheMySQLoutputtable.


18/27

18

Figure1.Rawreadcountsforeachlibrary.


19/27

19

Figure2a.Readlengthdistributionsamongalllibraries.21fraction.

Figure2b.Readlengthdistributionsamongalllibraries.24fraction.


20/27

20

Figure3a.Sizedistributionforthemock_NFR1library

Figure3b.SizedistributionfortheNFR1library


21/27

21

6.2 Module-2AllthereadsweremappedtogenomeusingBowtiemappingsoftware.Mapping

parameteroptionswere:

Mismatches=2

max_mac_err=70

seedlen=25

Allthereadsweremappedbacktothegenomeon102,356places.Asthedemo

datawasconstructedwiththemappedreadsonly,allthereadsareexpectedtomapback.AnIGVformatfileisgeneratedandplacedinModule-2.ThisIGVfile

alsocontainsallthenormalizedexpressionvaluesandthesecanvisualizedinthe

IGVbrowser.

6.3 Module-3ThismoduleprocessestheIGVfilefromModule-2andpredictsclusterswitha

minimumabundanceof75.4315clusterswerepredicted.A .bedformat file is

suppliedcontainingall the clusterswithembeddedsequences.The file can bevisualizedusing the IGV browser.Genomic annotationwas performed and 48

clusters were exonic, 46 intronic, 3855 were intergenic and 2 clusters

overlappedintron/exonboundaries(Figure-4).

Figure4.Genomicdistributionofclusters


22/27

22

6.4 Module-4miRNA prediction was done using the plant-adjusted version of miRDeep,miRDeep-P,and29miRNAswerepredicted,including1conservedand28novel

miRNAs.Thesecontain1miRNAfamiliesanddetailsofhomologsforeachfamily.

AlldetailsaregivenintheModule-4outputfolder,mappingisperformedagainstmIRBASErelease17.AlternativemiRNAdatabasescanbecomparedagainstas

definedbytheuser.

6.5 Module-530 individual ta-siRNA sequences were predicted based on the relative

positioning(phasing)ofthesRNAlocionthegenome.Themajorityoftheseloci

wascontainedin6clusters(Module-3).Aclusterfastafileisavailable,asisthelist of all the smallRNA readswith associated p-value evaluating the relative

positioning of the loci. A read should have p-value lower than 0.001 to beconsideredaphasedRNA.

6.6 Module-6Thismoduleisimplementedtoaddannotationtoallthesequencesintheprofile

table.ThismodulerequiresMySQLinstallationtobeexecuted.Asinput,avariety

offastafilescanbeprovidedbytheuser.Sequenceswillbemappedagainsteach

file, and where sequences map to the respective reference the associatedmappingpositionsarebereported.

6.7 Module-7Thismodule creates expressionpatternplots for the libraries thatneed tobe

compared. Pairs to be considered can be defined in the file provided in theModule-1folder,compare_libraries.Bydefaultplotsaregeneratedforallpossiblesamplepairs, including theplotagainstsumand scorevalues fromtheprofile

table. As target libraries will usually be sequenced alongwith corresponding

controls, the latter can be specified as references in the file namedlibraries_reference,alsoprovidedintheModule-1folder.

Thefilelibraries_referencecontainstwocolumnsofwhichthefirstreferstothetargetlibrary.Librariesareannotatedbynumbersinthesameorderasonthe

user-provided input list for libraries. The second column is filled by 0s bydefault, unless a control library is defined. Target/reference relationships are

definedbyreplacingthe0swiththenumberoftheappropriatecontrollibrary.

Four different kinds of plots can be generated. When no control library isdefined, normalized counts and log values of normalized counts are plotted

(Figures 5a and 5b). If reference libraries are defined, expression scores are

plottedaccordingtoequations1aand1b.TheexpressionplotsarevisualizedinFigures5cand5d.


23/27

23

Figure5a.Normalizedcountsareplotted.

Equation1a.log-odd-score=log(Ti/Ci)

Equation1b.norm_control=(Ti-Ci)/(Ti+Ci)

WhereTiistheexpressionvaluesinthetargetlibraryand

Ciistheexpressionvaluesinthecontrollibrary


24/27

24

Figure5b.log(normalizedcounts)areplotted

Figure5c..log-odd-score=log(Ti/Ci)plotted


25/27

25

Figure5d.norm_control=(Ti-Ci)/(Ti+Ci)plotted.

6.8 Module-8Thismodule is used to find sequence compositionbiases in the5 endof the

sequences.Di- and tri-nucleotideMarkov probabilities are calculated for eachnucleotide combination. In theprovidedsampledataset, forexample,anover-

representationofGAA starting triplets is observed. Sequenceswith particularmotifs are extracted andmapped back to the genome. If a bias is caused by

technical artifacts, the proportion of sequences with the motif in question

mappingtothereferencegenomemightbelowerthantheaverage.


26/27

26

6.9 Module-9Usingthismodule,theMySQLtablecanbequeried.Complexqueriesconcerning

particular biological questions can be performed. The table below shows

examplesofqueries.

Purpose Syntax

Retrieve variation scores and miRBase annotations

for predicted miRNAs.

SELECT

`#Sequence`,

`miRBase.fa`,

`score`

FROM `2012-03-04_profile_19_24_sorted.cut-1_score_1`

WHERE

`miRNA_prediction` = "predicted as miRNA";

Retrieve average response values for sRNAs up-

regulated in the wild only for sRNAs mapping to

miRNAs annotated in miRBase.

SELECT

COUNT(score) AS n,

AVG(LOG(wild_type_norm/Mock_wild_type_norm)) AS wt,

AVG(LOG(NFR1_norm/Mock_NFR1_norm)) AS nfr1


WHERE

LOG(wild_type_norm/Mock_wild_type_norm) > 0

AND `miRBase.fa` !="0";

Retrieve expression information for sRNAs

mapping to miRNAs annotated in miRBase and

sort in descending order by the wild type

expression response.

SELECT

`#Sequence`AS sequence,

`read_size` AS length,

LOG(wild_type_norm/Mock_wild_type_norm) AS wt,

LOG(NFR1_norm/Mock_NFR1_norm) AS nfr1,

`miRBase.fa`


WHERE

`miRBase.fa` != 0

ORDER BY wt DESC;


27/27

6.10Computationsummary

Module Process Time (s) CPU % Memory %

1.1 Adapter filtering/trimming 15.4 0.4 0.3

1.1 Error correction 5827.5 73.4 0.91.1 Size splitting 7.1 49.4 0.4

1.1 Homology filtering 116.0 113.5 0.5

1.2 Making profile tables 73.5 54.1 1.3

1.3 Adding data to MySQL database 1.9 37.9 0.2

2 Genomic mapping 129.0 131.9 5.4

3.1 Cluster prediction only 9.6 98.5 1.6

3.1 - including sequence extraction 1393.4 149.3 0.8

3.2 Cluster genomic region annotation 26.7 97.8 0.4

4.1 miRNA predictions-mirDeep2 2476.3 100 40.4

4.1 miRNA predictions-mirDeepP 11875.1 1.8 3.94.2 miRBase mapping 5.0 3.3 0.8

5 ta-siRNA prediction 79.7 98.9 0.9

6 Homology annotation - mapping 6.5 16.7 0.2

6 Homology annotation - MySQL 1.1 0.1 0.1

7 Expression profile plotting 107.8 97.6 1.1

8.1 Find pattern 3.7 100 0.2

8.2 Map pattern 155.2 99.8 5.9

9 MySQL querying 0.0 0.9 0.2

Thetotalnumberofreadsprocessedwas627147.Datawasanalyzedona

MacintoshiBookwith4GBmemoryandafourcore2.7GHzInteli7processor.

Documents

Shortran v.0.43 Manual