62
14/11/2017 Assemblage de-novo de transcriptome Trinity Ecole ITMO 2017 ABiMS – Station Biologique Roscoff

Assemblage de-novo de transcriptome

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Assemblage de-novo de transcriptome

14/11/2017

Assemblagede-novodetranscriptome

TrinityEcoleITMO2017

ABiMS – StationBiologiqueRoscoff

Page 2: Assemblage de-novo de transcriptome

INTRODUCTION

Page 3: Assemblage de-novo de transcriptome

WGSSequencing

Assemble

DraftGenomeScaffolds

SNPs

Methylation

ProteinsTx-factor

bindingsites

AParadigmforGenomicResearch

Page 4: Assemblage de-novo de transcriptome

AParadigmforGenomicResearch

WGSSequencing

Assemble

DraftGenomeScaffolds

Expression

Transcripts

SNPs

Methylation

ProteinsTx-factor

bindingsites

Align

Page 5: Assemblage de-novo de transcriptome

AMaturing ParadigmforTranscriptomeResearch

WGSSequencing

Assemble

DraftGenomeScaffolds

Methylation

Tx-factorbindingsites

Align

Page 6: Assemblage de-novo de transcriptome

AMaturing ParadigmforTranscriptomeResearch

WGSSequencing

Assemble

DraftGenomeScaffolds

Methylation

Tx-factorbindingsites

Align

$$$$$$$$$$$$$$$$$$$$

$

$+

Page 7: Assemblage de-novo de transcriptome

AMaturing ParadigmforTranscriptomeResearch

WGSSequencing

Assemble

DraftGenomeScaffolds

Methylation

Tx-factorbindingsites

Align

$$$$$$$$$$$$$$$$$$$$

$

$+

Page 8: Assemblage de-novo de transcriptome

RNASeq denovoanalysis workflow

Page 9: Assemblage de-novo de transcriptome

DataCleaning

• Unknownnucleotides• Badqualitynucleotides• Adaptorsandprimerssub-sequences• PolyA/Ttails• Lowcomplexitysequences• rRNAsequences• Contaminantsequences• Shortlengthsequences

Butalso:• Removingsingletons• In-siliconormalization• Sequencingerrorscorrection• …

Biasshouldbecorrectedinreverseorderoftheirgeneration1. Sequencingbiases(badquality,unknowns)2. Librarypreparation

AdaptorsandprimerssequencesPolyA/Ttails

3. Biologicalsample(lowcomplexity,rRNA,contaminants)

Page 10: Assemblage de-novo de transcriptome

Datacleaning

Input(fastqc)

FastQC

FastQC HTMLReport

Trimmomatic

FastQC

FastQC HTMLReport

SortmeRNA

FastQC

FastQC HTMLReport

Page 11: Assemblage de-novo de transcriptome

Trimmomatic command

java -jar trimmomatic.jar PE -phred33 \ lib1_1.fastq lib1_2.fastq \ lib1_1.P.qtrim lib1_1.U.qtrim \ lib1_2.P.qtrim lib1_2.U.qtrim \ ILLUMINACLIP:illumina.fa:2:30:10\ SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25

Raw readsPairedandunpaired reads1Pairedandunpaired reads2Adapters

Input Read Pairs: 2 000 000 Both Surviving: 1 879 345 (93.97%) Forward Only Surviving: 94 153 (4.71%) Reverse Only Surviving: 18 098 (0.90%) Dropped: 8 404 (0.42%)

TrimmomaticPE: Completed successfully

Page 12: Assemblage de-novo de transcriptome

FastQC:PersequenceGCcontent

• Acontamination?

Page 13: Assemblage de-novo de transcriptome

FastQC:PersequenceGCcontent

• Acontamination?

Can this be fixed ? Maybe…

Page 14: Assemblage de-novo de transcriptome

FastQC:PersequenceGCcontent

Page 15: Assemblage de-novo de transcriptome

rRNA contamination

Priortosequencing :• Ribodepletion kits

After sequencing :• Remove rRNA reads from raw reads• Detect rRNA transcripts

Page 16: Assemblage de-novo de transcriptome

SortMeRNA

Page 17: Assemblage de-novo de transcriptome

SortMeRNA commands

> merge-paired-reads.sh read_1.fq read_2.fq read-interleaved.fq

>sortmerna --fastx -a 4 --log --paired_out -e 0.1 --id 0.97 --coverage 0.97 \--ref silva-bac-16s-id90.fasta,silva-bac-16s-id90:\silva-bac-23s-id98.fasta,silva-bac-23s-id98:\silva-euk-18s-id95.fasta,silva-euk-18s-id95:\silva-euk-28s-id98.fasta,silva-euk-28s-id98:\rfam-5s-database-id98.fasta,rfam-5s-database-id98:\rfam-5.8s-database-id98.fasta,rfam-5.8s-database-id98 --reads read-interleaved.fq --other output_mRNA.fastqfastq --aligned output_aligned.fastq

>unmerge-paired-reads.sh output_mRNA.fastq read-sortmerna_1.fq read-sortmerna_2.fq

ReferenceDB

Page 18: Assemblage de-novo de transcriptome

SortMeRNA results

Results: Total reads = 34 196 864 Total reads for de novo clustering = 4 084 914 Total reads passing E-value threshold = 30 122 173 (88.08%) Total reads failing E-value threshold = 4 074 691 (11.92%) Minimum read length = 150 Maximum read length = 150 Mean read length = 150 By database: silva-bac-16s-id90.fasta 6.95% silva-bac-23s-id98.fasta 18.75% silva-euk-18s-id95.fasta 9.97% silva-euk-28s-id98.fasta 52.42% rfam-5s-database-id98.fasta 0.00% rfam-5.8s-database-id98.fasta 0.00%

Total reads passing %id and %coverage thresholds = 26 037 259

Page 19: Assemblage de-novo de transcriptome

TRANSCRIPTOME ASSEMBLYSTRATEGIES

Page 20: Assemblage de-novo de transcriptome

DenovotranscriptassemblySplicedalignmentofRNA-Seq togenome

TranscriptreconstructionfromsplicedalignmentofassembledtranscriptstogenomeTranscriptreconstruction

fromRNA-Seq splicedalignments

Genome

Genome

TophatSTARHISAT2

CufflinksStringtieIsoLassoBayesemblerTripTraphCEMTransComb

+

RNA-Seq reads

ContemporarystrategiesfortranscriptreconstructionfromRNA-Seq

Gmap

OasesSoapDenovoTransAbyssTransIDBA-TranShannonBinPackerBridger

Page 21: Assemblage de-novo de transcriptome

- Compressdata(inchworm):- Cutreadsintok-mers (kconsecutivenucleotides)- Overlapandextend(greedy)- Reportallsequences(“contigs”)

- BuilddeBruijn graph(chrysalis):- Collectallcontigs thatsharek-1-mers- Buildgraph(disjoint“components”)- Mapreadstocomponents

- Enumerateallconsistentpossibilities(butterfly):- Unwrapgraphintolinearsequences- Usereadsandpairstoeliminatefalsesequences- Usedynamicprogrammingtolimitcomputetime(SNPs!!)

Trinityworkflow

Page 22: Assemblage de-novo de transcriptome

RNA-Seqreads

Linearcontigs

de-Bruijngraphs

Transcripts+

Isoforms

Thousands of disjoint graphs

Trinity– Howitworks:

Page 23: Assemblage de-novo de transcriptome

InchwormContigs fromAlt-SplicedTranscripts

Page 24: Assemblage de-novo de transcriptome

+

Inchwormcanonlyreportcontigs derivedfromuniquekmers.

Alternativelysplicedtranscripts:- themorehighlyexpressedtranscriptmaybereportedasasinglecontig,- thepartsthataredifferentinthealternativeisoformarereportedseparately.

InchwormContigs fromAlt-SplicedTranscripts

Page 25: Assemblage de-novo de transcriptome

BuilddeBruijn Graphs(ideally,onepergene)BuilddeBruijn Graphs(ideally,onepergene)

readpairinginformationtoincludeminimallyoverlappingcontigs

Verifyvia“welds”

Integrate(clustering)Isoformsviak-1overlaps

Chrysalis

Page 26: Assemblage de-novo de transcriptome

Butterfly

Page 27: Assemblage de-novo de transcriptome

Trinityusageandoptions

Typical TrinitycommandTrinity --seqType fq --max_memory 50G --leftA_rep1_left.fq --right A_rep1_right.fq --CPU 4

Runningatypical Trinityjobrequires ~1hour and~1GRAMper~1millionPEreads.

Theassembled transcripts will be found at'trinity_out_dir/Trinity.fasta'.

Trinity --seqType fq --max_memory 50G --single single.fq --CPU 4

Page 28: Assemblage de-novo de transcriptome

Results

>TRINITY_DN810_c0_g1_i2 len=226 path=[407:0-225] [-1, 407, -2] GATGATATCAACAATGAGACTTGTGAACCAGGTGAAGAAAACTCTTTCTTTGTATGCGAC CTAGGTGAAATTGAAAGATTGTACGCTAACTGGTGGAAAGAACTACCAAGAGTTCAGCCA TTTTACGCTGTCAAGTGTAACCCAGATTTGAAGATAATAAGAAAATTGGCTGACCTCGGA

TRINITY_DNW|cX_gY_iZ (until release2.0cX_gY_iZ previously compX_cY_seqZ

TRINITY_DNW|cX definesthegraphicalcomponentgeneratedbyChrysalis(fromclusteringinchwormcontigs).Butterflymightteasesubgraphsapartfromeachotherwithinasinglecomponent,basedonthereadsupportdata.Thisgivesrisetosubgraphs (gY).:trinity genesEachsubgraphthengivesrisetopathsequences (iZ).:trinity isoforms(path) list ofvertices inthecompacted graphthat represent thefinaltranscript sequence andtherangewithin thegiven assembledsequence that those nodes corresond to.

Result:linearsequencesgroupedincomponents,contigs andsequences

>TRINITY_DN889_c0_g1_i1 len=259 path=[473:0-258] [-1, 473, -2] GAACAATGTCTACACTGTCTTCAACTTGGATGACAAGGAACTTTCATTGGCTCAAGCTAA CTACAATTCATCTCTGAAACCAGATATTGAAGAAATCAAGGATACTGTCCCTAGCGCTGT GCTGGCTCCACAATACTACAACACATTCTCAGCTGACCCAACTGCCACTGCAGTCACTGG TAACATCTTTGCACCAGAGGCCACTATGTCCATGGCTGCTCCAGCTAATGCTTCTAGAAA CTCTTCATTAAACTCTCCT

Page 29: Assemblage de-novo de transcriptome

Trinitystatistics

TRINITY_HOME/util/TrinityStats.pl Trinity.fasta

################################ ## Counts of transcripts, etc.################################ Total trinity 'genes': 7648 Total trinity transcripts: 7719 Percent GC: 38.88 ######################################## Stats based on ALL transcript contigs: ######################################## Contig N10: 4318Contig N20: 3395 Contig N30: 2863 Contig N40: 2466 Contig N50: 2065 Median contig length: 1038 Average contig: 1354.26 Total assembled bases: 10453524

##################################################### ## Stats based on ONLY LONGEST ISOFORM per 'GENE': ##################################################### Contig N10: 4317 Contig N20: 3375 Contig N30: 2850 Contig N40: 2458 Contig N50: 2060 Median contig length: 1044 Average contig: 1354.49 Total assembled bases: 10359175

Page 30: Assemblage de-novo de transcriptome

Trinityusageandoptions

Typical Trinitycommandwith multiplesamplesTrinity --seqType fq --max_memory 50G --CPU 4\--left A_rep1_left.fq,A_rep2_left.fq \--right A_rep1_right.fq,A_rep2_right.fq

sample.txtcond_A cond_A_rep1 A_rep1_left.fq A_rep1_right.fqcond_A cond_A_rep2 A_rep2_left.fq A_rep2_right.fqcond_A cond_A_rep3 A_rep3_left.fq A_rep3_right.fqcond_B cond_B_rep1 B_rep1_left.fq B_rep1_right.fqcond_B cond_B_rep2 B_rep2_left.fq B_rep2_right.fqcond_B cond_B_rep3 B_rep3_left.fq B_rep3_right.fq

Trinity --seqType fq --max_memory 50G --CPU 4\--samples_file sample.txt

Page 31: Assemblage de-novo de transcriptome

Trinity« genome guided »

Genome guided TrinitycommandTrinity --genome_guided_bam rnaseq_alignments.csorted.bam --max_memory 50G --genome_guided_max_intron 10000 --CPU 6

Ifyour RNA-Seq sample differs sufficiently from your reference genome andyou'd like tocapturevariations within your assembled transcriptsDenovoassembly is restricted toonly those reads that map tothegenome.

Theadvantage is that reads that share sequence incommon butmap todistinctpartsofthegenome will be targeted separately forassembly.

Thedisadvantage is that reads that donotmap tothegenome will notbe incorporatedinto theassembly.->Unmapped reads can,however,be targeted foraseparate genome-freedenovoassembly.

Theassembled transcripts will be found at'trinity_out_dir/Trinity-GG.fasta'.

Page 32: Assemblage de-novo de transcriptome

Trinity« longreads »

Trinity --seqType fq --max_memory 50G --CPU 4\--samples_file sample.txt --long_reads contigs.fasta

Inshort,theTrinityv2.4.0versionusesthepacbio reads mostly forpath tracing inagraphthat's built based ontheilluminareads (notbuild using illuminaandpacbio).

contigs.fasta: fasta filecontaining error-corrected orcircular consensus(CCS)PacBio reads

Page 33: Assemblage de-novo de transcriptome

Trinityincluding trimming andnormalisation

Trinity --seqType fq --max_memory 50G --CPU 4--samples_file sample.txt --trimmomatic--quality_trimming_params "ILLUMINACLIP:illumina.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"

• Normalisation:- BydefinitionRNAseq displayawiderangeofexpressions

Verylowexpressedà Veryhighlyexpressedtranscripts

- Theinformationgivenbyreadsfromhighexpressiontranscriptsisredundant,andveryhighcoveragealsobringsmoresequencingerrors

- De-novoassemblersdonotbenefitfromcoverageincreasebeyondacertainpoint(>200millionsreads),andfewerdatameansquickerassemblies

èHowtodecreasecoverageofhighlyexpressedtranscriptswithoutdecreasingthatoflowexpressedtranscripts?

• Trimming

Page 34: Assemblage de-novo de transcriptome

Insilico normalizationofreads

High

Moderate

Low

Page 35: Assemblage de-novo de transcriptome

NGSreadsnormalization(byTrinity)

1. Countkmers inallthedata(Jellyfish):• withk=25

2. Foreachread,computethemedian,average andstdev kmers coverage

3. Acceptareadwithaprobabilityof:max coverage/median

Page 36: Assemblage de-novo de transcriptome

NGSreadsnormalization(byTrinity)

3. Acceptareadwithaprobabilityof:

e.g. with 𝑚𝑎𝑥𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 30

Read_A: 𝑚𝑒𝑑𝑖𝑎𝑛𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 60à 234_6789:3;929<=3>

= 0.5

è Read_A has a 50% chance of being kept

Read_B: 𝑚𝑒𝑑𝑖𝑎𝑛𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 = 10à 234_6789:3;929<=3>

= 3

è Read_B has a 300% chance of being kept ;-)è Read_B will be kept

Page 37: Assemblage de-novo de transcriptome

NGSreadsnormalization(byTrinity)

3. Acceptareadwithaprobabilityof:

Readscoming from ahighly expressed transcript andareseveraltimesmorecovered than thethreshold.

è Its informationis also contained byother reads.è Soit hasless chancetobe kept.

Reads coming from alow expressed transcript,way below thethreshold.

è Its informationis notvery redondant,need it fortheassembly.

è Soitwillabsolutly bekept

Page 38: Assemblage de-novo de transcriptome

NGSreadsnormalization(byTrinity)

1. Countkmers inallthedata(Jellyfish):• withk=25

2. Foreachread,computethemedian,average andstdev kmers coverage

3. Acceptareadwithaprobabilityof:maxcov/median

4. Removeareadif:standartdev/average (CV)>1(100%)

Ahighvariabilityinareadkmer coveragemeansthereisprobablyalotofsequencingerrorsinthisread

Page 39: Assemblage de-novo de transcriptome

Standalone normalisation

$TRINITY_HOME/util/insilico_read_normalization.pl\ --seqType fq --JM 1G --max_cov 50 \ --left lib1_1.P.qtrim --right lib2_2.P.qtrim \ --pairs_together --output insil_norm_ex

1189570 / 1879312 = 63.30% reads selected during normalization. 1094 / 1879312 = 0.06% reads discarded as likely aberrant based on coverage profiles.

Normalization complete. See outputs: insil_norm_ex/lib1_1.P.qtrim.normalized_K25_C50_pctSD200.fq insil_norm_ex/lib1_2.P.qtrim.normalized_K25_C50_pctSD200.fq

Page 40: Assemblage de-novo de transcriptome

Trinitynormalisation

Trinity --seqType fq --max_memory 50G --CPU 4--samples_file sample.txt --trimmomatic--quality_trimming_params "ILLUMINACLIP:illumina.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25 --normalize_by_read_set

Page 41: Assemblage de-novo de transcriptome

RNASeqanalysis

Page 42: Assemblage de-novo de transcriptome

ASSEMBLYQUALITYASSESSMENTANDCLEANNING

Transcriptomeassembly

Page 43: Assemblage de-novo de transcriptome

Assembly quality assessment

• Assemblymetrics

• Readsmappingbackrate

Page 44: Assemblage de-novo de transcriptome

Metrics

• Thenumber ofcontigsintheassembly

• Thesizeofthesmallest contig

• Thesizeofthelargest contig

• Thenumber ofbasesincluded inthe

assembly

• Themean length ofthecontigs

• Thenumber ofcontigs<200bases

• Thenumber ofcontigs>1,000bases

• Thenumber ofcontigs>10,000bases

• Thenumber ofcontigsthat had anopen

reading frame

• Themean %ofthecontigcovered bythe

ORF

• NX(e.G.N50):thelargest contigsizeat

which atleastX%ofbasesarecontained in

contigsatleastthis length

• %Ofbasesthat areGorC

• GCskew

• ATskew

• Thenumber ofbasesthat areN

• Theproportionofbasesthat areN

• Thetotallinguistic complexity ofthe

assembly

Page 45: Assemblage de-novo de transcriptome

DenovoTranscriptome AssemblyisPronetoCertainTypesofErrors

Smith-Unnaetal.GenomeResearch,2016

Page 46: Assemblage de-novo de transcriptome

Realignment metrics

Theassembly is asum-up.Therealignment rategives howmuch oftheinitialinformationisinside thecontigs.

• align reads against assembly generated transcripts• compute percentage ofreads mapped

Page 47: Assemblage de-novo de transcriptome

Realignment metrics

Factors affecting realignment rate:

• Presence ofhighly expressed genes• Contaminationbybuildingblocks(adaptors)• Reads quality

Atypical‘good’assemblyhas~80%readsmappingtotheassemblyand~80%areproperlypaired.

Properpairs

Givenreadpair: PossiblemappingcontextsintheTrinityassemblyarereported:

Improperpairs Leftonly Rightonly

Page 48: Assemblage de-novo de transcriptome

Assembly evaluation :read remapping

$TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq--transcripts Trinity.fasta --est_method RSEM --aln_method bowtie2 --prep_reference --trinity_mode --samples_file samples.txt --seqType fq

$TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq--transcripts Trinity.fasta --est_method kallisto --prep_reference--trinity_mode --samples_file samples.txt --seqType fq

$TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq--transcripts Trinity.fasta --est_method salmon --prep_reference --trinity_mode --samples_file samples.txt --seqType fq

Alignment methods :bowtie2-RSEM

Pseudo-Alignment methods :kallisto

Pseudo-Alignment methods :salmon

Page 49: Assemblage de-novo de transcriptome

Realignment metrics

70%

72%

74%

76%

78%

80%

82%

84%

86%

88%

90%

salmon kallisto bowtie2_RSEM

Page 50: Assemblage de-novo de transcriptome

Assembly evaluation :read remapping

head cond_A_rep1/abundance.tsv | column –tOr head cond_A_rep1/abundance.tsv.genes | column –t

Pseudo-Alignment methods :kallisto (salmon :quant.sf ;quant.sf.genes)

target_id length eff_length est_counts tpmTRINITY_DN144_c0_g1_i1 4833 4703.42 138 16.266 TRINITY_DN144_c0_g2_i1 2228 2098.42 0.000103136 2.72479e-05 TRINITY_DN179_c0_g1_i1 1524 1394.42 227 90.2502 TRINITY_DN159_c0_g1_i1 659 529.534 7.75713 8.12123 TRINITY_DN159_c0_g2_i1 247 119.949 0.24287 1.12251 TRINITY_DN153_c0_g1_i1 2378 2248.42 16 3.9451 TRINITY_DN130_c0_g1_i1 215 89.2898 776 4818.09 TRINITY_DN130_c1_g1_i1 295 166.986 216 717.115 TRINITY_DN106_c0_g1_i1 4442 4312.42 390 50.137

target_id length eff_length est_counts tpmTRINITY_DN2774_c0_g1 2926.00 2796.42 31.00 6.15 TRINITY_DN5482_c0_g1 3064.00 2934.42 344.00 64.99 TRINITY_DN6803_c0_g1 1439.00 1309.42 1379.00 583.85 TRINITY_DN386_c0_g2 4279.00 4149.42 3.23 0.43 TRINITY_DN23_c0_g2 632.00 502.53 9.99 11.02 TRINITY_DN5348_c0_g1 2091.00 1961.42 264.00 74.62 TRINITY_DN5222_c0_g1 2416.00 2286.42 148.00 35.89 TRINITY_DN4680_c0_g1 1420.00 1290.42 167.00 71.75 TRINITY_DN2900_c0_g1 283.00 155.12 1.00 3.57

Page 51: Assemblage de-novo de transcriptome

Expressionmatrixconstruction

$TRINITY_HOME/util/abundance_estimates_to_matrix.pl\ --est_method kallisto --out_prefix Trinity_trans\ --name_sample_by_basedir\ cond_A_rep1/abundance.tsv\ cond_A_rep2/abundance.tsv\ cond_B_rep1/abundance.tsv\ cond_B_rep2/abundance.tsv

Two matrices,- onecontaining theestimated counts,- onecontaining theTPMexpressionvaluesthat arecross-sample normalized using the

TMMmethod.

TMMnormalization assumesthat most transcripts arenotdifferentially expressed,andlinearly scales theexpressionvaluesofsamples tobetter enforce this property.

AscalingnormalizationmethodfordifferentialexpressionanalysisofRNA-Seqdata,RobinsonandOshlack,GenomeBiology2010.

Page 52: Assemblage de-novo de transcriptome

Expression

Often,mostassembledtranscriptsare*very*lowlyexpressed(Howmany‘transcripts&genes’aretherereally?)

20ktranscripts

Cumulative#of

Transcripts

1.4millionTrinitytranscriptcontigsN50~500bases

*Salamandertranscriptome

-1*minimumTPM

AlternativetoN50?

Page 53: Assemblage de-novo de transcriptome

• Sortcontigs byexpressionvalue,descendingly.• ComputeN50givenminimum%totalexpressiondatathresholds=>ExN50

N50=3457,and

24Ktranscripts

ComputeN50BasedontheTop-mostHighlyExpressedTranscripts(ExN50)

90%ofexpressiondata

AlternativetoN50:ExN50– E90N50

#E min_expr E-N50 num_transcriptsE2 89129.251 2397 1E3 89129.251 2397 2E5 66030.692 2397 3E6 66030.692 2397 4E8 66030.692 2397 5... ....... ...... ....E86 9.187 3056 12309E87 7.044 3149 14261E88 6.136 3261 16646E89 4.538 3351 19635E90 3.939 3457 23471E91 3.077 3560 28583E92 2.208 3655 35832E93 1.287 3706 47061... ....... ...... ....E97 0.235 2683 275376E98 0.164 2163 428285E99 0.128 1512 668589E100 0 606 1554055

$TRINITY_HOME/util/misc/contig_ExN50_statistic.pl \Trinity_trans.TMM.EXPR.matrix Trinity.fasta > ExN50.stats

Page 54: Assemblage de-novo de transcriptome

NoteshiftinExN50profilesasyouassemblemoreandmorereads.

*Candidatranscriptome

ThousandsofReads

MillionsofReads

ExN50ProfilesforDifferent TrinityAssembliesUsing Different ReadDepths

Page 55: Assemblage de-novo de transcriptome

Toolstoevaluate transcriptomes

Since areference genome is notavailable,thequality ofcomputer-assembled contigsmay be verified :

- bycomparing theassembled sequences tothereads used togenerate them (reference-free)

- byaligning thesequences ofconserved gene domains foundinmRNA transcripts totranscriptomes orgenomes ofcloselyrelated species (reference-based).

Page 56: Assemblage de-novo de transcriptome

Toolstoevaluate transcriptomes

Transrate:understand your transcriptome assembly.http://hibberdlab.com/transrate

Transrate analysesatranscriptome assembly inthree keyways:• byinspecting thecontigsequences• bymapping reads tothecontigsandinspecting thealignments• byaligning thecontigsagainst proteins ortranscripts from arelated species and

inspecting thealignments• Assemblies score• Contigsscore• Optimised assemblies score(filter outbad contigsfrom anassembly,leaving

you with only thewell-assembled ones)

Page 57: Assemblage de-novo de transcriptome

BUSCOanalysis

CEGMA(http://korflab.ucdavis.edu/datasets/cegma/)

HMM:sfor248coreeukaryoticgenesalignedtoyourassemblytoassesscompletenessofgenespace“complete”: 70%aligned“partial”: 30%aligned

BUSCO(http://busco.ezlab.org/)AssessinggenomeassemblyandannotationcompletenesswithBenchmarkingUniversalSingle-CopyOrthologs

Page 58: Assemblage de-novo de transcriptome

BUSCOResults# BUSCO was run in mode: transcriptome EUKARYOTES

C:86.5%[S:48.2%,D:38.3%],F:7.6%,M:5.9%,n:303

262 Complete BUSCOs (C)146 Complete and single-copy BUSCOs (S)116 Complete and duplicated BUSCOs (D)23 Fragmented BUSCOs (F)18 Missing BUSCOs (M)303 Total BUSCO groups searched

# BUSCO was run in mode: transcriptome PLANT

C:13.9%[S:8.1%,D:5.8%],F:2.0%,M:84.1%,n:1440

200 Complete BUSCOs (C)117 Complete and single-copy BUSCOs (S)83 Complete and duplicated BUSCOs (D)29 Fragmented BUSCOs (F)1211 Missing BUSCOs (M)1440 Total BUSCO groups searched

Page 59: Assemblage de-novo de transcriptome

DBG:softwares

• Velvet/Oases– Velvet(Zerbino,Birney 2008)is asophisticated setofalgorithms that constructs deBruijn graphs,simplifiesthegraphs,andcorrectsthegraphsforerrors andrepeats.

– Oases (Schulzetal.2012)post-processes Velvetassemblies (minustherepeat correction)with different k-mer sizes.

• Trans-ABySS– Trans-ABySS (Robertsonetal.2010)takes multipleABySSassemblies (Simpsonetal.2009)

• CLCbioGenomics Workstation• SOAPdenovo-trans,• rnaSPADES

Page 60: Assemblage de-novo de transcriptome

Assemblers comparison

HuangX,ChenXG,Armbruster PA/Comparativeperformanceoftranscriptome assembly methods fornon-modelorganisms. BMCGenomics.2016Jul 27;17:523.doi:10.1186/s12864-016-2923-8.

Thisstudy compared fourtranscriptome assembly methods,- adenovoassembler(Trinity)- two transcriptome re-assembly strategies utilizing proteomic andgenomic resources from closelyrelated species (reference-based re-assembly andTransPS)- agenome-guided assembler(Cufflinks)

« However,our results emphasize theefficacyofdenovoassembly,which can be aseffectiveasgenome-guided assembly when thereference genome assembly is fragmented.

Ifagenome assembly andsufficientcomputational resources areavailable,it canbe beneficial tocombinedenovoandgenome-guided assemblies »

Page 61: Assemblage de-novo de transcriptome

• Qiong-YiZhaoetal.,Optimizing denovotranscriptomeassemblyfromshort-readRNA-Seqdata:acomparativestudy.BMCBioinformatics2011,12(Suppl14):S2

• Clarke,K.,Yang,Y.,Marsh,R.,Xie,L.,&Zhang,K.K.(2013).Comparativeanalysis ofdenovotranscriptome assembly.ScienceChinaLifeSciences,56(2),156–162.doi:10.1007/s11427-013-4444-x

• (Vijay etal.,2013)Challengesandstrategies intranscriptome assembly anddifferential gene expressionquantification.Acomprehensive insilicoassessment ofRNA-seq experiments.Molecular ecology.PMID:22998089

• (Haasetal.,2013)Denovotranscript sequence reconstructionfrom RNA-seq using theTrinityplatform forreferencegeneration andanalysis.Natureprotocols.PMID:23845962

• (Luetal.,2013)Comparativestudy ofdenovoassembly andgenome-guided assembly strategies fortranscriptomereconstructionbased onRNA-Seq.Sci ChinaLifeSci.

• Chen,G.,Yin,K.,Wang,C.,&Shi,T.(n.d.).Denovotranscriptome assembly ofRNA-Seq reads with different strategies.ScienceChinaLifeSciences,54(12),1129–1133.doi:10.1007/s11427-011-4256-9

• (Heetal.,2015)Optimalassembly strategies oftranscriptome related toploidies ofeukaryotic organisms.BMCgenomics.DOI:10.1186/s12864-014-1192-7

• S.B.Rana,F.J.Zadlock IV,Z.Zhang,W.R.Murphy,andC.S.Bentivegna,“Comparison ofDeNovoTranscriptomeAssemblers andk-mer Strategies Using theKillifish,Fundulus heteroclitus,”PLoS ONE,vol.11,no.4,p.e0153104,Apr.2016.

• (WangandGribskov,2016)Comprehensive evaluation ofdenovotranscriptome assembly programsandtheir effectsondifferential gene expressionanalysis.Bioinformatics.PMID:27694201

Assemblers comparison

Page 62: Assemblage de-novo de transcriptome

Newdenovotranscriptome assemblers

• IDBA-Tran (Pengetal.,Bioinf.,2014)• IDBA-MTP(Pengetal.,RECOMB2014)• SOAPdenovo-Trans(Xie etal.,Bioinf.,2014)• Fuetal.,ICCABS,2014• StringTie (Pertea etal.,Nat.Biotech.,2015)• Bermuda(Tangetal.,ACM,2015)• Bridger(Changetal.,Gen.Biol.2015)• BinPacker (Liuetal.PLOSCompBiol,2016)• FRAMA(BensMetal.,BMCGenomics2016)