Upload
marcosdecarvalho
View
223
Download
0
Embed Size (px)
Citation preview
7/30/2019 Shortran v.0.43 Manual
1/27
A users manual for
shortran
Released: 06-June-2012
Vikas Gupta and Stig U. Andersen
Bioinformatics Research Center and
Department of Molecular Biology and Genetics
Aarhus University, Denmark
Contact: [email protected], [email protected]
7/30/2019 Shortran v.0.43 Manual
2/27
2
1 OVERVIEW.....................................................................................................................4
1.1Background...........................................................................................................................4
1.2Summaryofshortranmodules.......................................................................................4
1.3Implementation...................................................................................................................4
1.4Availability............................................................................................................................4
2 Installation....................................................................................................................5
2.1PythonandPerl...................................................................................................................5
2.2Bowtie.....................................................................................................................................5
2.3MySQLandsettingupdatabase......................................................................................5
2.4ECHO........................................................................................................................................6
2.5mirDeep..................................................................................................................................6
2.6Blast.........................................................................................................................................6
2.7NiBLS.......................................................................................................................................6
3 QUICKSTARTGUIDE..................................................................................................7
3.1Configuration........................................................................................................................7
3.2Explorethedatabase................. .................. .................. ................. .................. .................. 7
4 CONFIGURATION.........................................................................................................8
4.1 Configurationusingtheconfiguration.txtfile........................................................8
4.2 Interactiveconfiguration..............................................................................................9
5 Modules..........................................................................................................................95.1 Module-1:Adaptertrimming,errorcorrection,filteringandprofiling.......9
5.2 Module-2:GenomicMapping....................................................................................11
5.3 Module-3:Clusteringanalysis..................................................................................12
5.4 Module-4:miRNApredictions..................................................................................13
5.5 Module-5:ta-siRNApredictions..............................................................................14
5.6 Module-6:Annotation.................................................................................................14
5.7 Module-7:ExpressionPlots......................................................................................14
5.8 Module-8:PatternSearch..........................................................................................15
5.9 Module-9:Queries........................................................................................................16
6 Application.................................................................................................................17
6.1 Module-1..........................................................................................................................17
6.2 Module-2..........................................................................................................................21
6.3 Module-3..........................................................................................................................21
6.4 Module-4..........................................................................................................................22
6.5 Module-5..........................................................................................................................22
6.6 Module-6..........................................................................................................................22
7/30/2019 Shortran v.0.43 Manual
3/27
3
6.7 Module-7..........................................................................................................................22
6.8 Module-8..........................................................................................................................25
6.9 Module-9..........................................................................................................................26
6.10 Computationsummary............................................................................................27
7/30/2019 Shortran v.0.43 Manual
4/27
4
1 OVERVIEW
1.1Background
WefounditcomplicatedtocombinesmallRNA(sRNA)profilingandannotationinformationusingexistinganalysistools.Herewedescribetheinstallationand
use of shortran, which generates and combines expression profiling andannotationdatafromsRNAsequencingrawdata.Theoutputisprovidedasan
easily accessible MySQL database format and provides graphical output tofacilitatefastdataanalysis.
1.2Summaryofshortranmodules
shortran isa combinationof variousmodules,allowingauser toanalyzedata
using either an interactive command line interface or a configuration file. Adetailed descriptionof the use of eachmodule is provided in the application
section of this manual. shortran is capable of performing filtering, read errorcorrection, expression profiling,mapping,clustering analysis,miRNA/ta-siRNA
prediction,patternsearch,targetpredictionandannotation.
1.3Implementation
PythonistheprimarylanguageofthepipelinebutPerlhasbeenalsoused.The
package contains multiple Python language scripts. shortran.py is the main,whichinvokestheotherpipelinemodules. shortranisdevelopedusingpython(v2.7.2) andperl (v5.12.3), it has been tested on bothMacOS X (10.6.8) andLinuxUbuntu(10.04.3).
1.4Availability
Theshortranpackageincludinganexampledatasetisavailableatwww.carb.dk/resources.asp.
7/30/2019 Shortran v.0.43 Manual
5/27
5
2 InstallationTheprimarymodulesofthe shortranpackagerequirenoinstallationandcanbeinvokeddirectlyusingpythonafterthedependencieshavebeeninstalled.Torun
shortran,changedirectory(cd)tothefoldercontainingthe shortran.pyfileand
runthescript:> cd[path of shortran_folder]
>python shortran.py
The > indicates the commandprompt and should not be typed. Brackets []
indicatestextthatneedstobeadjusted.Bracketsshouldnotbetyped.Commentsareindicatedby#.Toobtainthepathofafileorfolderforintroducingitintoa
script, the icon of the respective item can be dragged into a termial windowwherethepathwillthenbedisplayed.
DetailedinstructionsforinstallationandsetupofcoredependenciesonMacOS
and Linux platforms are included. Please refer to install_osx.README andinstall_linux.README.
2.1PythonandPerl
Most operating systems have built-in Python and Perl and no installation isrequired. If that is not the case, Python can be downloaded from
http://python.org/getit/, and Perl from http://www.perl.org/get.html. Inaddition, the Python libraries numpy (http://numpy.scipy.org/), scipy(http://www.scipy.org/) and matplotlib (http://matplotlib.sourceforge.net/)
needtobeinstalled.
2.2Bowtie
Bowtieisrequiredbyanumberofmodulesandshouldbeinstalledandaddedto
the system path (so that typing bowtie in the terminalactivates the programregardlessoftheworkingdirectory).IfinstallingonMacOS,Bowtieisinstalled
andaddedtothepathbytheinstallationscript.Itcanbefoundathttp://bowtie-bio.sourceforge.net/index.shtml.
2.3MySQLandsettingupdatabase
InstallationofMySQLisoptionalbutitisrequiredforuseoftheshortrandatabasefunctionalities.Adetaileddescriptionondownloadingandsettingup
MySQLcanbefoundathttp://www.mysql.com/.
MySQLsetupforshortranishandledautomaticallybytheMacOSsetupscriptandcanbesetuponLinuxsystemsusingtheenclosedMySQLsetupscript:
> sudo mysql < "mysql_setup.txt"
You can also set up MySQL manually by creating a database and givingpermissionstoyouruser(s):
7/30/2019 Shortran v.0.43 Manual
6/27
6
> sudo mysql
mysql> CREATE DATABASE profiles;
mysql> CREATE USER user@localhost IDENTIFIED BY ;
mysql> GRANT ALL PRIVILEGES ON profiles.* TO
user@localhost;ThisbasicMySQLsetupcorrespondstothefollowingsettingsintheconfiguration.txtfile:
server=localhost
user=user
password=
database=profiles
2.4ECHO
ECHOisusedonlyifk-mercorrectionofsequencingreadsisrequired.Itcanbe
downloadedfromhttp://uc-echo.sourceforge.net/.
2.5mirDeep
AllthescriptsrequiredformirDeeparesuppliedwiththeshortranpackagebutits dependencies must be globally installed, see http://www.mdc-
berlin.de/en/research/research_teams/systems_biology_of_gene_regulatory_ele
ments/projects/miRDeep/.Itisonlyrequiredformodule-4.
2.6Blast
formatdbandfastacmdfromtheBlastpackageshouldbeinstalled.Adescriptioncanbefoundhere:http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html.
2.7NiBLS
NiBLS(MacLeanetal.BMCBioinformatics.201018;11:93)canoptionallybeusedtodetectsmallRNAlociinsteadofthesimpleclusterdetectionalgorithm
includedwithshortran.Seehttps://github.com/danmaclean/NiBLS.
7/30/2019 Shortran v.0.43 Manual
7/27
7
3 QUICK START GUIDE3.1Configuration
Forthefirstrun,werecommendthatyouopenthe"configuration.txt"fileina
texteditorandreplace"/Users/test/Desktop/shortran_0.43/demo"withthepathtotheshortrandemofolderonyoursystem.
Opentheterminalwindowandchangetotheshortrandirectoryandruntheprogram:
> cd[path_to_shortran_folder]
>python shortran.py
Andhit"n"torunshortranaccordingtotheconfigurationfile.
Ifalldependenciesareinstalledcorrectly,youroutputfoldershouldlooklikethe
"example_output"folderavailablefordownload,whentherunisfinished.
3.2Explorethedatabase
YoucanexploretheMySQLdatabasedirectlyusingtheterminalwindow.For
example,type:
>mysql -u user
mysql > show databases;
mysql > use profiles;
mysql > show tables;
mysql > describe `demo_2012-05-09_profile_19_24_sorted.cut-1_score_1`;
mysql > SELECT * FROM `demo_2012-05-09_profile_19_24_sorted.cut-1_score_1` LIMIT 10;
Whenloggedintothedatabase,youcanexperimentwiththequeries.Forinstanceretrievethe10mosthighlyabundantsRNAsequenceswithexon
matchesandsortthembyabundance:
mysql > SELECT `#Sequence`,`sum_norm`,`genomic_region` FROM`demo_2012-05-09_profile_19_24_sorted.cut-1_score_1` WHERE
`genomic_region` LIKE "%exonic%" ORDER BY `sum_norm` DESCLIMIT 10;
Forfurtherqueryexamples,pleaseseesection6-9(Module9)
7/30/2019 Shortran v.0.43 Manual
8/27
8
4 CONFIGURATIONTherearetwowaysofconfiguringtheanalysis.Eitherthefile configuration.txtcanbemodifieddirectly,ortheinteractiveconfigurationinterfacecanbeused.
4.1 Configurationusingtheconfiguration.txtfileParametersandvariablescanbedefinedduringconfiguration.Herewedescribethe common parameters for the pipeline. Module-specific parameters will be
explainedinsection4ofthemanual.Westartbydefininganame(i.e.date)forthefolderas:
date = '[a string]' # i.e. 2012-02-21
The range of small RNA lengths to be analyzed is defined by entering the
minimumandmaximumlengths:min_seq_size= '[an integer]' # i.e. 19
max_seq_size= '[an integer]' # i.e. 24
Rarereadscanbefilteredbyusingaminimumthreshold(cutoff),whichremovesall sequences which do not occur a minimum of cutoff times when summedacrossalllibraries.Filteringusingavariationscore( score_regulation_cutoff)canbeusedtofilterforsequencesthatshowinter-libraryvariationinabundance:
cutoff = '[an integer]'
score_regulation_cutoff= '[an integer]' #stdev/sqrt(average)
Library headers,whichare used in the plotsand in theMySQLtables, can bedefined:
genotype = '[strings separated by commas]' #i.e. 'wt','mut'
TheMySQLservershouldbeconfiguredusingthefollowingoptions:
server = '[string]' # i.e. localhost
user = '[string]' # i.e. user
password = '[string]' # i.e. ''
database = '[string]' # i.e. profiles
Bowtie mapping parameters such as maxmismatches in seed, max sum of
mismatchquality scores across alignmentand seed length,can bemodifiedusingthefollwoingparameters:
mismatches = '[an integer]'
max_maq_err = '[an integer]'
7/30/2019 Shortran v.0.43 Manual
9/27
9
seedlen = '[an integer]'
Each library should have its own folder containing sequencing data in fastqformat.Thepathforthedirectorycontainingallfoldersandfoldernameshould
beconfigured(fordetailsseetheApplicationsection).fq_file_dir = '[path to directory with all data folders]'
lib = '[all the data folder names separated by commas]'
The variablefiltering_file in the configuration file can refer to any fasta file,against which the reads should be filtered. There is an option
use_only_mapped_reads,whichifTrueallowsdiscardallreadsnotmappingtothefiltering_fileandifFalsediscardsonlythemappedreads.filtering_filecanbeleftempty,ifnofilteringisrequired.
filtering_file = '[file_path]'
use_only_mapped_reads = '[True/False]'
Thereferencegenomefile,annotationfeaturefile(describingforexamplegene
models),andconservedmiRNAfilesaredefinedusingthefollowingoptions:
genome_file = '[file_path]'
gtf_file = '[file_path]'
miRNA_database = '[file_path]'
Theparametersigv_infileandcluster_infileareoptionalandcanbeusedtoinputmapping or clustering results from other sources into the pipeline. For fileformats, please see the example data. If left empty, the files generated from
earliermodulesareusedbydefault.
4.2 InteractiveconfigurationThe shortran pipelinealsooffersan interactive configurationmode,where the
user ispromptedforall optionsand parameters.The optionsare the sameasdescribed in section 3.1, but an additional help function is provided in the
interactiveversion.
5 Modules5.1 Module-1:Adaptertrimming,errorcorrection,filteringand
profiling
Thismodule offers pre-processing of the raw data obtained from sequencing(fastq format). Sequencing errors can be corrected byk-mercorrection using
7/30/2019 Shortran v.0.43 Manual
10/27
10
ECHO.Thisstep iscomputationally expensive and isnot criticalforuse ofthe
othermodules.Itmay,however,improveanalysisaccuracy.
Raw_data_filtering = True
ErrorCorrection = True
Path_2_ECHO = '[full path to echo]'
Ifrawfastqfilesareused,adaptersshouldbeclippedoffandthefilesshouldbe
splitbysize:
adapter_filtering = True
keep_with_adapter_only = True
# If True, discards reads if no adapter was detected
adapter='[3 adapter sequence' # i.e. 'AGCCTTCTA'
# 3 adapter sequence. Multiple sequences can be entered as a comma-
separated list in the same order as the sequencing libraries in thelib variable
trim=0 # number of bases to trim from 5 end
Readscanbefilteredbasedonhomologytoauser-suppliedmultifastafile.For
instance, the user could remove readsmatching ribosomal RNA sequences or
choosetofocusonlyonreadsmappingtothereferencegenome.
homology_filtering='[full path and filename]' #multifasta file
use_only_mapped_reads=False #removes reads matching the fasta file
Followingerrorcorrectionandfiltering,profiletableswithrawandnormalized
readcountscanbegenerated.
Make_no_cut_profile_files=True
Oncetheprofiletablesareavailable,additionalfiltersforabundanceandinter-
libraryvariationscore(stdev/sqrt(average))canbeapplied.
cutoff='a digit' # the minimum read count sum across all libraries
score_regulation_cutoff=[a digit] # the minimum variation score
ThefinalstepinModule-1addsthefilteredprofiletabletoaMySQLdatabase:
Add_output_to_mysql_database=True
SummaryModule-1:
Input:
RawFastqfilesOutput:
7/30/2019 Shortran v.0.43 Manual
11/27
11
AdapterfilteredFfastqfiles-$output/module-1/adapter_filtered_data/ Fastqfilessplitbysize-$output/module-1/size_fractionated_data/ HomologyfilteredFastqfiles-$output/module-1/filter/ Profiletables-$output/module-1/profiles/
o Normalizedcounttableswithabundancecutoffo Normalized count tables with abundance cutoff and variationscorecut-off
o Similarfilesforallthesizefractions Fileswithrawabundancetotalcounts
o Redundantcounts-$output/module-1/libraries_abundanceo Non-redundantcounts$output/module-1/libraries_abundance_unique
Fileforselectingcontrolforlibraries-libraries_reference Fileforselectingpairstoanalyzedforcorrelation-compare_libraries Graphicaloutput
o Plotsforrawabundancesacrosslibraries
o Sequencesizevariationacrossthelibrarieso Sequencesizefractionvariationwithinthelibrary
MySQLtable
5.2 Module-2:GenomicMappingReads are mapped to a given reference genome and alignment .igv files are
generated. These files contain information about mapping positions and
expression values and can be displayed using the IGV browser
(http://www.broadinstitute.org/igv/).genome_file = '[full path and filename' #reference genome file
Genome_mapping = True
The variable infile allows using an alternative profile table. If left empty, theprofiletablegeneratedbyModule-1isusedbydefault.
infile='[full path and filename]' #use alternative profile table
SummaryModule-2:
Input:
Filewithallthereadso Bydefaulttakenfrommodule-1outputo Optional: The user can submit any file where the first columncontains sequences for genome mapping using the infileparameter.
Output:
Mappedfileso
Samformato IGVformat
7/30/2019 Shortran v.0.43 Manual
12/27
12
Fastafileo Containsallthereads
5.3 Module-3:ClusteringanalysisIn module-3, a clustering algorithm is invoked where mapped reads are
separated intodifferent clustersbased on thegenomicpositions.Withdefaultsettings,aclusterisagenomicregion,whichcontainsatleastfiveuniquereads
containingnogaplonger than50nucleotidesbetweenconsecutive reads.Alsotheabundancesofthesereadsshouldsumtoatotalofatleast75.Parameters
can be changed to adjust the stringency. An external IGV file to be used for
clusteringcanbespecifiedusingtheigv_infileparameter.Ifleftblank,theIGVfileproducedbyModule-2isusedbydefault.
Cluster_prediction=True
divide_by_size=False #do not treat size fractions separately incluster analysis
abundance_cutoff = 75 #minimum sum of read abundances per cluster
Add_cluster_to_mySQL = True
igv_infile=''
If preferred, the NiBLs algorithm (MacLean et al. BMC Bioinformatics. 201018;11:93) can be used to detect clusters instead of the simple approach
describedabove:
Cluster_prediction=TrueNiBLS=True
Finally,clusterscanbeannotatedaccordingtotheiroverlapwiththegenomicfeaturesspecifiedusingthegtf_fileparameter.
Cluster_genomic_region_prediction=True
SummaryModule-3:
Input: Mappedfiletogenome
o BydefaultmappedfilefromModule-2o Optional:IGVformatmappedfileprovidedexternally(igv_infile)
Output Clusterfile
o BEDformatfileforclusterconsistingallsizeclasseso Sizefractionatedclusterso Clusterssummaryo Clustersinfastaformatfile
Genomicannotationofclusterso Figurewithgenomicdistributionofclusters
7/30/2019 Shortran v.0.43 Manual
13/27
13
o Countsinthefilecluster_genomic_region.txto File with all the sequences and their corresponding genomicannotation(fileextension.region)
MySQLtable
5.4 Module-4:miRNApredictionsModules 4 and 5 are designed for categorizing small RNAs. Module-4 uses
miRDeepormiRDeep-P formiRNAprediction.ThemiRDeep-Pmodule isonlyapplicableforsmallRNAdatasetsobtainedfromplants.Foranimaldata,please
use miRDeep instead. Predicted miRNAs are mapped against the a fasta file
containingconservedmiRNAsequences(miRNA_database)tochecktheaccuracyof the prediction. A summary onmapped conserved miRNA/miRNA*/miRNA-
familiesisgenerated.
miRNA_database = '[file name]'
#fasta file with conserved miRNA sequences
MiRNA_prediction=True
mi_sequence_file = '[file name]' #optional external miRDeep formatinput file
gff_file = '[file name]' #annotation file only for miRNApredictions
Thevariablemi_sequence_fileisthefilepathtotheproperlyformatted(miRDeepstandard input) fasta file. If left empty, a miRDeep format input fasta file is
generatedfromtheModule-1outputprofiletablebydefault.miRDeepcanuseanannotation file to filter tRNA and rRNA sequences from its prediction results
(gff_file).
SummaryModule-4:
Input
GenomeFastafile(genome_file) RNAclassificationannotationfile(gfffile) SequencefastafileinMiRDeepformat(mi_sequence_file)
o BydefaultitistakenfromModule-1 ConservedmiRNA(miRNA_database)
o Bydefaultitistakenfromdemo(miRbasev.17)Output
PotentialmiRNAcandidates-miRNA_filter_P_prediction_miRNA.fasta miRNAprecursorsmiRNA_precursors miRNAstructures-miRNA_structures ConservedmiRNAs,familyandhomologcounts-
miRNA_filter_P_prediction_miRNA.miRNA_mapped.miRNAfamily.txt
7/30/2019 Shortran v.0.43 Manual
14/27
14
5.5 Module-5:ta-siRNApredictionsTrans-actingsmallinterferingRNAsisanothercategoryofsmallRNAsgenerated
byplants.Herewepredictta-siRNAusingamodifiedversionofChensalgorithm
(Ho-MingChenetal.,2007).Afastafilecontainingta-siRNAclustersisproduced
andsupplementedwithanotherfiledescribingtheprobabilitythatansRNAisassociatedwithatas-locus.
TasiRNA_prediction=True
SummaryModule-5:
Input
Genomefastafile(genome_file) Clusterfile(fromModule-3)
Output
Predicted ta-siRNA sequences with p-valve
7/30/2019 Shortran v.0.43 Manual
15/27
15
libraries.Theselibrariescanbedirectlycomparedagainstoneanotherorthey
canbecomparedafterbeingnormalizedtoagivenreferencelibrary.Modifythefiles libraries_referenceand compare_libraries in theModule-1 folder to set upthelibrarycomparisons.
Pattern_plot=True
pattern_plot_file='' #external file with profile table, default isprofile table from Module-1
SummaryModule-7:
Input
Profiletableo Bydefaultfrommodule1(pattern_plot_file)
Output
Graphicaloutputo Normalizedexpressionvaluesploto Logofnormalizedexpressionvaluesploto Relativefractionplotwhenprovidedcontrollibraries
Fileswithcorrelationcoefficientvalues-*.spearman_values_*
5.8 Module-8:PatternSearchThismodule calculates the Markov probability for themost 5 two and three
nucleotides.Anoutput filewithprobabilities isgenerated toallow theuser toidentifysequencecompositionbiases.Givenapatterninthesub-module,allthe
sequences with the pattern can be retrieved and further analysis can be
performedonthisset.
Find_pattern=True
infile_for_pattern='' #profile table from Module-1 or other file withsequences in first column
SummaryModule-8:
Input
Filecontainingsequenceso Bydefaultprofiletablefrommodule-1(infile_for_pattern)
Output Markovprobabilities
o Forfirstthreenucleotides-markov_prob.txt Filecontainingsequenceswiththetargetedpattern-*.fasta.* Mappingcountsforsequenceswithpattern-pattern_read_stats.txt
7/30/2019 Shortran v.0.43 Manual
16/27
16
5.9 Module-9:QueriesOncetheMySQLtableiscreated,thedatacanbeaccessedusingSQLqueries.
MySQL_query=True
query_infile='query_infile' #Text file containing MySQL queries.
Should be located in the shortran main directory.
SummaryModule-9:
Forexamplequeries,pleaseseetheapplicationsection,Module-9.
Input
MySQLformatquery(query_infile)
Output MySQLqueryresult-mysql_query_output
7/30/2019 Shortran v.0.43 Manual
17/27
17
6 Application
Here we explain the application of shortran for the analysis of the datasetsuppliedwiththepackage.
Asampledatasetisavailablewiththe shortranpackage.ItconsistsofasubsetoffourlibrariesfromLotusjaponicussRNA-seqdata.Outputsfromallthemodulesareincludedinthepackage,andwerecommendthatusersgothroughthedemodata examplesbeforeproceedingwiththeir owndata.Herewebrieflyexplain
theusageandoutputofeachmodule.All shortransettingsusedintheanalysisare presented in the configuration.txt file, which can be used to analyze thesampledataafterchangingdirectoryandMySQLserversettingstoreflectyour
localenvironment.
6.1 Module-1Raw data for processing are fastq files. In the demo, these fastq files areseparated in fourdifferentfoldersnamedGHD-5_sample,GHD-6_sample,GHD-
9_sample and GHD-10_sample. These four folders refer to librariesmock_wild_type, wild_type, mock_NFR1 and NFR1 respectively. Each contains
readsobtainedfromIlluminaRNA-seqreads,whicharemappedtochromosome
2(35-40MB)onthegenome.
Wepre-processedthereadsinmodule1.1andanerrorcorrectionalgorithmwas
invoked. There were 627,147 reads in the raw dataset (Figure 1) and whenparsingsequencesthroughECHOonthedatasetapproximately3.6%readswere
correctedforeachlibrary.
The two files lib_abundance and lib_abundance_unique contain abundances foreachsizeandlibraryfornon-redundantandredundantreadsrespectively,data
isplotted(Figures2and3).Figures2aand2bareexamplefiguresshowingread
lengthdistribution(in%)amongdifferentlibrariesforsizeclasses21ntand24nt.Plotsforallsizeareavailableintheplotsfolder.Figure3aand3bareplotsfor
size distribution within a library for mock_NFR1 and NFR1 libraries,respectively.
Homology filteringwas performed toremove all the reads coming from lotus
repeat data, leaving 605,329 reads. Both filtered reads and reads left after
filteringareprovidedasfastqfilesthatcanbeusedinseparateanalyses.Inmodule1.2,all the readsareprocessed tocalculateabundancesof all smallRNAuniquereadswithineachlibrary,andnormalizationisperformedtotake
intoaccountdifferencesinlibrarysizes.Therewere42,448uniquereadsinthe
dataset and 25,329 were left after applying an abundance cut-off of 2. An
additionalvariationscore-cutoff(stdev/average>1)wasappliedtoidentify
12,814differentiallyregulatedsequences(uniquereads).Theprofiletablesare
storedintheModule-1outputfolderaswellasintheMySQLoutputtable.
7/30/2019 Shortran v.0.43 Manual
18/27
18
Figure1.Rawreadcountsforeachlibrary.
7/30/2019 Shortran v.0.43 Manual
19/27
19
Figure2a.Readlengthdistributionsamongalllibraries.21fraction.
Figure2b.Readlengthdistributionsamongalllibraries.24fraction.
7/30/2019 Shortran v.0.43 Manual
20/27
20
Figure3a.Sizedistributionforthemock_NFR1library
Figure3b.SizedistributionfortheNFR1library
7/30/2019 Shortran v.0.43 Manual
21/27
21
6.2 Module-2AllthereadsweremappedtogenomeusingBowtiemappingsoftware.Mapping
parameteroptionswere:
Mismatches=2
max_mac_err=70
seedlen=25
Allthereadsweremappedbacktothegenomeon102,356places.Asthedemo
datawasconstructedwiththemappedreadsonly,allthereadsareexpectedtomapback.AnIGVformatfileisgeneratedandplacedinModule-2.ThisIGVfile
alsocontainsallthenormalizedexpressionvaluesandthesecanvisualizedinthe
IGVbrowser.
6.3 Module-3ThismoduleprocessestheIGVfilefromModule-2andpredictsclusterswitha
minimumabundanceof75.4315clusterswerepredicted.A .bedformat file is
suppliedcontainingall the clusterswithembeddedsequences.The file can bevisualizedusing the IGV browser.Genomic annotationwas performed and 48
clusters were exonic, 46 intronic, 3855 were intergenic and 2 clusters
overlappedintron/exonboundaries(Figure-4).
Figure4.Genomicdistributionofclusters
7/30/2019 Shortran v.0.43 Manual
22/27
22
6.4 Module-4miRNA prediction was done using the plant-adjusted version of miRDeep,miRDeep-P,and29miRNAswerepredicted,including1conservedand28novel
miRNAs.Thesecontain1miRNAfamiliesanddetailsofhomologsforeachfamily.
AlldetailsaregivenintheModule-4outputfolder,mappingisperformedagainstmIRBASErelease17.AlternativemiRNAdatabasescanbecomparedagainstas
definedbytheuser.
6.5 Module-530 individual ta-siRNA sequences were predicted based on the relative
positioning(phasing)ofthesRNAlocionthegenome.Themajorityoftheseloci
wascontainedin6clusters(Module-3).Aclusterfastafileisavailable,asisthelist of all the smallRNA readswith associated p-value evaluating the relative
positioning of the loci. A read should have p-value lower than 0.001 to beconsideredaphasedRNA.
6.6 Module-6Thismoduleisimplementedtoaddannotationtoallthesequencesintheprofile
table.ThismodulerequiresMySQLinstallationtobeexecuted.Asinput,avariety
offastafilescanbeprovidedbytheuser.Sequenceswillbemappedagainsteach
file, and where sequences map to the respective reference the associatedmappingpositionsarebereported.
6.7 Module-7Thismodule creates expressionpatternplots for the libraries thatneed tobe
compared. Pairs to be considered can be defined in the file provided in theModule-1folder,compare_libraries.Bydefaultplotsaregeneratedforallpossiblesamplepairs, including theplotagainstsumand scorevalues fromtheprofile
table. As target libraries will usually be sequenced alongwith corresponding
controls, the latter can be specified as references in the file namedlibraries_reference,alsoprovidedintheModule-1folder.
Thefilelibraries_referencecontainstwocolumnsofwhichthefirstreferstothetargetlibrary.Librariesareannotatedbynumbersinthesameorderasonthe
user-provided input list for libraries. The second column is filled by 0s bydefault, unless a control library is defined. Target/reference relationships are
definedbyreplacingthe0swiththenumberoftheappropriatecontrollibrary.
Four different kinds of plots can be generated. When no control library isdefined, normalized counts and log values of normalized counts are plotted
(Figures 5a and 5b). If reference libraries are defined, expression scores are
plottedaccordingtoequations1aand1b.TheexpressionplotsarevisualizedinFigures5cand5d.
7/30/2019 Shortran v.0.43 Manual
23/27
23
Figure5a.Normalizedcountsareplotted.
Equation1a.log-odd-score=log(Ti/Ci)
Equation1b.norm_control=(Ti-Ci)/(Ti+Ci)
WhereTiistheexpressionvaluesinthetargetlibraryand
Ciistheexpressionvaluesinthecontrollibrary
7/30/2019 Shortran v.0.43 Manual
24/27
24
Figure5b.log(normalizedcounts)areplotted
Figure5c..log-odd-score=log(Ti/Ci)plotted
7/30/2019 Shortran v.0.43 Manual
25/27
25
Figure5d.norm_control=(Ti-Ci)/(Ti+Ci)plotted.
6.8 Module-8Thismodule is used to find sequence compositionbiases in the5 endof the
sequences.Di- and tri-nucleotideMarkov probabilities are calculated for eachnucleotide combination. In theprovidedsampledataset, forexample,anover-
representationofGAA starting triplets is observed. Sequenceswith particularmotifs are extracted andmapped back to the genome. If a bias is caused by
technical artifacts, the proportion of sequences with the motif in question
mappingtothereferencegenomemightbelowerthantheaverage.
7/30/2019 Shortran v.0.43 Manual
26/27
26
6.9 Module-9Usingthismodule,theMySQLtablecanbequeried.Complexqueriesconcerning
particular biological questions can be performed. The table below shows
examplesofqueries.
Purpose Syntax
Retrieve variation scores and miRBase annotations
for predicted miRNAs.
SELECT
`#Sequence`,
`miRBase.fa`,
`score`
FROM `2012-03-04_profile_19_24_sorted.cut-1_score_1`
WHERE
`miRNA_prediction` = "predicted as miRNA";
Retrieve average response values for sRNAs up-
regulated in the wild only for sRNAs mapping to
miRNAs annotated in miRBase.
SELECT
COUNT(score) AS n,
AVG(LOG(wild_type_norm/Mock_wild_type_norm)) AS wt,
AVG(LOG(NFR1_norm/Mock_NFR1_norm)) AS nfr1
FROM `2012-03-04_profile_19_24_sorted.cut-1_score_1`
WHERE
LOG(wild_type_norm/Mock_wild_type_norm) > 0
AND `miRBase.fa` !="0";
Retrieve expression information for sRNAs
mapping to miRNAs annotated in miRBase and
sort in descending order by the wild type
expression response.
SELECT
`#Sequence`AS sequence,
`read_size` AS length,
LOG(wild_type_norm/Mock_wild_type_norm) AS wt,
LOG(NFR1_norm/Mock_NFR1_norm) AS nfr1,
`miRBase.fa`
FROM `2012-03-04_profile_19_24_sorted.cut-1_score_1`
WHERE
`miRBase.fa` != 0
ORDER BY wt DESC;
7/30/2019 Shortran v.0.43 Manual
27/27
6.10Computationsummary
Module Process Time (s) CPU % Memory %
1.1 Adapter filtering/trimming 15.4 0.4 0.3
1.1 Error correction 5827.5 73.4 0.91.1 Size splitting 7.1 49.4 0.4
1.1 Homology filtering 116.0 113.5 0.5
1.2 Making profile tables 73.5 54.1 1.3
1.3 Adding data to MySQL database 1.9 37.9 0.2
2 Genomic mapping 129.0 131.9 5.4
3.1 Cluster prediction only 9.6 98.5 1.6
3.1 - including sequence extraction 1393.4 149.3 0.8
3.2 Cluster genomic region annotation 26.7 97.8 0.4
4.1 miRNA predictions-mirDeep2 2476.3 100 40.4
4.1 miRNA predictions-mirDeepP 11875.1 1.8 3.94.2 miRBase mapping 5.0 3.3 0.8
5 ta-siRNA prediction 79.7 98.9 0.9
6 Homology annotation - mapping 6.5 16.7 0.2
6 Homology annotation - MySQL 1.1 0.1 0.1
7 Expression profile plotting 107.8 97.6 1.1
8.1 Find pattern 3.7 100 0.2
8.2 Map pattern 155.2 99.8 5.9
9 MySQL querying 0.0 0.9 0.2
Thetotalnumberofreadsprocessedwas627147.Datawasanalyzedona
MacintoshiBookwith4GBmemoryandafourcore2.7GHzInteli7processor.