Upload
edward
View
17
Download
0
Embed Size (px)
DESCRIPTION
Assembly group Presentation II. Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye. Two problematic libraries. M19107 & M21639 were resequenced due to low coverage M19107 still had low coverage after re- seq M21639 coverage increased to ~80X. - PowerPoint PPT Presentation
Citation preview
Assembly group Presentation II
Robert ArthurKevin Lee
Xing LiuPushkar Pande
Gena TangRacchit Thapliyal
Tianjun Ye
Two problematic libraries
M19107 & M21639 were resequenced due to low coverage
M19107 still had low coverage after re-seq M21639 coverage increased to ~80X
De novo assembly – exponential dec.
0 10 20 30 40 50 60 70 80 900
200
400
600
800
1000
1200
1400
Effect of coverage depth on contig number
De novoReference2 sample de novo
Coverage
Tota
l C
onti
g C
ount
M21639 still bad
0 10 20 30 40 50 60 70 80 900
200
400
600
800
1000
1200
1400
Effect of coverage depth on contig number
De novoReference2 sample de novo
Coverage
Tota
l C
onti
g C
ount
Mixed culture?
Is this a mixed culture? What does a mixed culture assembly look
like? newProject, addRun, addRun, runProject … M19501 & M21127 M21127 & M21621
Mixed culture
0 10 20 30 40 50 60 70 80 900
200
400
600
800
1000
1200
1400
Effect of coverage depth on contig number
De novoReference2 sample de novo
Coverage
Tota
l C
onti
g C
ount
Possible explanations for poor assembly
About 20% larger genome◦ Recent plasmid or other large genome insertion
Also lost hemolytic ability and H2S production
De novo assembly – 36 is the best
0 10 20 30 40 50 60 70 80 900
200
400
600
800
1000
1200
1400
Effect of coverage depth on contig number
De novoReference2 sample de novo
Coverage
Tota
l C
onti
g C
ount
De novo assembly
Limited by our data, 36 contigs is lower limit◦ Number of repeat elements◦ rRNA
20X coverage is sufficient
General assembly stats
GenomeNewbler de novo
Newbler reference
Mira3 de novo
AMOScmp reference
minimus2 - newbler/mira
minimus2 - newbler/AMOScmp
M19107.sff 217 1260 208 471 123 109
M19501.sff 75 988 181 521 22 27
M21127.sff 59 1013 89 538 38 28
M21621.sff 50 986 67 515 28 23
M21639.sff 175 1272 175 573 54 69
M21709.sff 52 313 83 187 37 32
M19107_1.sff 1336 1361
M19107_2.sff 450 1006
M21639_1.sff 266 1165
M21639_2.sff 147 1282
Assembler Evaluation Strategy
Single assembler evaluation Minimus2 assembler evaluation Feedback by gene prediction group
Single Assembler Evaluation (“Hard Measurement”)
Single assembler evaluation
“Soft” measurement: satisfy gene prediction group's requirement.
RNA prediction group requires a file which can trace back the depth of the reads. For this, we use the .tsv file in the Newbler output.
• 454AlignmentInfo.tsv (-infoall/-info/-noinfo) base consensus, quality, depth and flow-signal, at each position in each contig. A very useful file.
• eg:
Position Consensus Quality Unique Align Signal Signal Score Depth Depth StdDev (incl. duplicates)
>contig000081 G 64 26 32 0.98 0.052 A 64 27 33 0.94 0.133 T 64 27 33 1.97 0.144 T 64 27 33 1.97 0.145 G 64 27 33 0.97 0.06...etc...
Newbler Output
Single Assembler Evaluation
For “hard” measurement: we focused mainly on “Total Contigs” “N50 Contigs Bases”, “Total Big Contigs”, “Big Contigs Percent Bases”, “Big Contig Reads”, “Singleton Reads”.
For “soft” measurement: we focused on the trace back of depth for each base pair.
Final Rank of Single Assembler
Combine the “hard” and “soft” measurement manually. We get as a result:
1. Newbler De Novo 2. Mira3 3. Amos Eliminated : Newbler reference & velvet
Minimus2 Evaluation
Since our top choice is Newbler, we want to include Newbler’s results in the merged contigs. Thus, we analysed the statistics of:
1. Newbler merged with Amos 2. Newbler merged with Mira Visulization tool: hawkeye
Minimus2 Evaluation
Minimus2 Evaluation
Gene prediction results feedback
Single assembler evaluation (Sample Data: contigs)
Single assembler evaluation (Sample Data: Big contigs)
Final Rankings of Single Assembler
Combined the “hard” and “soft” measurement manually and we got:
1. Newbler De Novo 2. Mira3 3. Amos Eliminated : Newbler reference & velvet
Minimus2 Evaluation (sample data: contigs)
Minimus2 Evaluation (sample data: reads)
Minimus2 Evaluation
Using the same strategy as above, our rankings are:
1. Newbler merged with Mira; 2. Newbler merged with Amos;
Final Recommendation
1. Merged Contigs: (1) Newbler merged with Mira (2) Newbler merged with Amos 2. Single Assembler: (1) Newbler (2) Mira (3) Amos
Feedback Strategy
Gene Prediction Group may use predicted genes and RNAs to evaluate our assembly results.
Minimus2 Overview
Minimus2 is a modified minimus pipeline
It is designed to merge one or two sequence sets hereafter referred to as S1 or S2
Uses Nucmer based Overlap Detector instead of the Smith-Waterman hash overlap Minimus1 uses (much faster)
Minimus2 Usage
minimus2 prefix \ -D REFCOUNT=n \ # Number of sequences is the first
set -D OVERLAP=n \ # Minimum overlap (Default 40bp) -D CONSERR=f \ # Maximum consensus error (0..1)
(Def 0.06) -D MINID=n \ # Minimum overlap %id for align. (Def 94) -D MAXTRIM=n # Maximum sequence trimming length
(Def 20bp)
◦ Prefix refers to an .AFG filename
Minimus2 Usage, Cont’d.
REFCOUNT should be set to number of sequences in the first set
“all vs all” alignment is ran by default and sets REFCOUNT to 0 unless user-specified
S1 & S2 should be merged and converted to AMOS format using toAmos command
toAmos –s S1-S2.seq –o S1-S2.afg
Minimus2 Usage, Cont’d.
After you merge the data, you actually run minimus on it
Minimus2 S1-S2 –D REFCOUNT=## Input
S1-S2.afg Output
S1-S2.fasta (contig) S1-S2.singletons.seq (single)
Nucmer Algorithm
A modification of the MUMmer package matching algorithm
Operates via building and then searching a suffix tree data structure
This is a significant upgrade from the minimus1 approach as searching using suffix trees is O(n) and minimus1’s method is O(n2)
Linear time versus polynomial time
MUMmer link : http://mummer.sourceforge.net/
Nucmer Algorithm The Nucmer strategy uses approximately 17 bytes
of memory for each basepair in the reference sequence
The query supplied by the user is streamed past the reference suffix tree so that the memory requirements do not depend on the size of the query sequence
In English: Bigger query does not mean order of magnitude longer operating time
Unique algorithm that can be found and analyzed as it is open source on Sourceforge
MIRA
Based on OLC approach Strategies:
◦ Preprocessing: high confidence region (HCR) ◦ Use a quick heuristic algorithm to alignment the HCR of reads◦ Overlaps are reviewed with Smith-Waterman alignment algorithm ◦ Contigs can be be optionally analysed and corrected by an incorporated
version of an automatic editor◦ Repeats are resolved by searching for typical mis-assembly patterns◦ Optional pre-assembly read extension step: the assembler can try to
extend HCRs of reads by analysing the overlap pairs from the previous alignments.
MIRA outputs
d_results: this directory contains all the output files of the assembly in different formats.
d_info: this directory contains information files of the final assembly.
d_log: this directory contains log files and temporary assembly files.
d_chkpt: this directory contains checkpoint files needed to resume assemblies that crashed or were stopped (not implemented yet, but soon)
d_results
out.padded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process. Positions in the consensus containing gaps (also called 'pads', denoted by an asterisk) are still present.
out.unpadded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process, put positions in the consensus containing gaps were removed.
qual files Outputs with other formats
◦ caf, ace, gap4d
d_info
info_assembly.txt: some statistics as well as whether or not problematic areas remain in the result.
info_callparameters.txt: This file contains the parameters as given on the mira command line when the assembly was started.
info_contigstats.txt: This file contains in tabular format statistics info_contigreadlist.txt: This file contains information which reads have
been assembled into which contigs. info_readstooshort: A list containing the names of those reads that have
been sorted out of the assembly only due to the fact that they were too short, before any processing started.
error_reads_invalid: A list of sequences that have been found to be invalid due to various reasons (given in the output of the assembler).
Newbler
Another OLC assembler◦ Starts with `indexing`
Scans the .sff file, trims the reads, Performs some checks for possible 3’ and 5’ primers.
◦ Finds overlaps between reads Splits the phase between long reads and short reads. Alignments proceed using seed and extend.
◦ Simplifies overlap graph and generates consensus contigs. Uses the quality information for base calling.
Newbler: Metrics file
454NewblerMetrics.txt◦ runData
Total number of reads and bases in the file, also the number of reads and bases after trimming.
◦ runMetrics Number of searches, seeds and overlaps during the
alignment phase of assembly◦ readAlignmentResults
Number of reads and bases aligned to other reads, Inferred read error – No. of errors, mainly indels, between
the contigs and the reads.◦ consensusDistribution
This section deals with base calling of the consensus contigs.
◦ consensusResults A summary of the read alignments and assembly statistics
Newbler: Contigs 454AllContigs.fna (>100 bp default) 454LargeContigs.fna (>500 bp default)
◦ The fasta header: Gives the the unique contig number, its length in bp and the number of
reads in the alignment used to build this contig
◦ lower case bases correspond to quality values below 40. 454Contigs.ace
◦ All contigs in .ace format. Useful for visualization using for e.g. Eagleview
>contig00001 length=381 numreads=158
>contig00002 length=144 numreads=560GGGAGAACTCATCTCTTGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCGCACTTAGCTACCCGGCAATGCGTCTGGCGACACAACCGGAACACCAGTGaTGCGTCCACTCCGGTCCTCTCGTACTAGGAGCAG
>contig00002 length=144 numreads=56064 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 6464 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 20 64 64 64 64 64 64 64 64 64 6464 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
Newbler: The status files 454TrimStatus.txt: This file describes what
(trimmed) part of the read was considered for alignment. ◦ read Id, trim points used, used trimmed length, raw length
454ReadStatus.txt: This file describes where reads ended up after assembly was complete.◦ Id, Status (assembled, partially assembled, singleton,
outlier, too short).
454AlignmentInfo.tsv: This file gives a consensus alignment overview for each position in each contig.◦ Position, consensus, quality score, depth, signal, std.
deviation
Automating assembly
Motivation◦ Troublesome installation, number of dependencies◦ Difficult to remember command line parameters
Automation◦ install.sh : A script to install assemblers and their
dependencies◦ assembler.sh: Script to run assemblers with default
arguments.
Future work
Continue to dialog with G.P. to determine assembly of choice.
Metrics:◦ tRNA count◦ rRNA count◦ Protein coding regions
Questions?