Genome Assembly Final Results

J E R I D I LT SS U Z A N N A K I M

H E M A N A G R A J A ND E E PA K P U R U S H O T H A M

A M B I LY S I VA D A SA M I T R U PA N I

L E O W U

Genome Assembly Final Results

0 2 - 2 2 - 2 0 1 2

Outline

Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo

Pipeline for evaluation

Strategy – Key alterations

Prinseq Preprocessing Unnecessary, assemblers have built in capabilities Use Prinseq for data statistics

Error Correction Does not fit methods Coral is based on Overlap-layout-consensus and

works best with de Bruijin Graph assemblers Echo has never been tested on 454 data

Final Assemblers Newbler, Mira, Celera, AmosCMP Discarded Assemblers Abyss, Velvet, and Pcap454

MAIA Hybrid Assembly Needs a close phylogenetic reference genome

Outline

Pipeline for evaluationQuantitative EvaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo

Metrics No. of Contigs -> Lesser the better N50 -> Higher the better Assembly size -> Closer to the estimated genome, the

betterQuantitative Assembly Score

N50 * Assembly size No. of Contigs

Higher the score, the better!

Quantitative Evaluation

M19107 - Evaluation

Runs # Contigs

N50 Total Size

Score

Newbler 199 16319 1753573 8.16

Mira 201 19353 1790088 8.24

Celera 146 20609 1747621 8.39

Newbler_Mira 129 25914 1774129 8.55

Newbler_Celera 104 27207 1719874 8.65

Newbler_Mira_Celera

96 27478 1701316 8.69

Outline


Qualitative Evaluation

Strategy Align the assembly contigs to the original reference

genome and compute differencesChallenges

No Original reference genome for our data setApproach

Create simulated 454 read datasets from a completely sequenced genome

Tools used FlowSim 454Sim Art-454

FlowSim

A simulation pipeline based on real dataLets you model each step of pyrosequencing processUtilities:

Clonesim : To simulate the shearing step Usage: clonesim -c count -l dist input.fasta

Gelfilter: To select a certain range of clone lengths. Usage: gelfilter min max

Kitsim: To attach A and B adaptors. Usage: kitsim -k key -a adapter input.fasta -o output.fasta

Mutator: To introduce random substitutions and indels in the sequences. Usage: mutator -i indel_rate -s subst_rate input.fasta -o output.fasta

Duplicator: To generate artificial duplicates of many clones. Usage: duplicator dup_prob

Flowsim : To simulate the actual pyrosequencing process Usage: flowsim -G generation input.fasta -o output.sff

Example: clonesim -c 400000 –l “Normal 350 95” input.fasta | gelfilter 25 600| kitsim | mutator | duplicator 0.03 | flowsim –G Titanium -o output.sff

http://biohaskell.org/Applications/FlowSim







454Sim

454 Simulation at higher speed and accuracyUSP: Configurable statistical modelsSupport GS FLX, Titanium and GS 20Utilities:

fragsim: To simulate shearing Usage: fragsim -c 1000000 -l 1000 genome.fasta >

genome.fragments.fasta 454sim: To simulate the sequencing step

Usage: 454sim -o genome.sff genome.fragments.fastaExample:

fragsim -c 250000 -l 1000 genome.fasta | 454sim –g FLX -o genome.sff

ART-454

Supports Illumina, 454 and Solexa read simulation

Used for 1000 Genomes ProjectUsage:

Art_454 Input.fasta Output prefix Fold_coverage (single – end reads)

Art_454 Input.fasta Output prefix Fold_coverage Mean_Flag_Len Std_Deviation (paired end reads)

Running pipeline on Simulated reads

Reference – Haemophilus influenzae F3047 (NC_014922)

Ran 454Sim, FlowSim and Art-454 to generate reads

Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG)

Merged assemblies using Minimus2

Evaluate Assembly Accuracy (How?)

Assembly Accuracy

Challenges Alignment of contigs to the reference genome

Approach Local alignment (BLAST, bwa, bowtie) Whole genome alignment (Mauve, MUMmer)

Align the assembly to the reference genome Compute nucleotide differences, gaps and rearranged

segments

Mauve

Uses positional homology genome alignment Each site in the assembly maps to at most one site on

the reference Optimized contiguity E.g. progressiveMauve

Ordering of contigs: Mauve Contig Mover algorithm

Compare to identify differences

Mauve Genome Aligner

After Ordering of Contigs

Mauve Assembly Metrics

Basecalling accuracy Count and location of bases called wrongly Direction of miscalling, e.g. A->G Count and location of bases predicted to exist, but

uncalledGenome content accuracy

Count and location of bases missing from the assembly Count and location of extra bases in the assembly Size distribution of the missing and extra fragments

Genome structure accuracy Estimate of misassembly count

Reference genome AGGCTAGCGCGCGATTAGGAT

CAssembly

AGTAGCGGGCCGATTAAGANC

Alignment AGGCTAGCGCG -

CGATTAGGATC AG - -

TAGCGGGCCGATTAAGANC

Example

Miscalls 2 (C->G and G->A)Uncalled bases 1 (N)Extra bases 1 (Insertion of C )Missing bases 2 (Deletion of GC )Missing segments 1Extra segments 1

Scoring simulated reads with Mauve

Reference – Haemophilus influenzae F3047 (NC_014922)

Ran 454Sim, FlowSim and Art-454 to generate reads

Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG)

Merged assemblies using Minimus2Ran Mauve to align the assemblies back to

the reference genomeComputed Assembly metrics

Miscalled Bases

0

20

40

60

Number of miscalled bases

Newbler

16

Mira

52

CA

8

Newbler+Mira

18

Newbler+CA

24

Newbler+Mira

+CA

36

Uncalled bases

0

10

20

30

40

Number of uncalled bases

Newbler Mira

3

CA Newbler+Mira

14

Newbler+CA

7

Newbler+Mira

+CA

34

Total missing bases

0

50,000

100,000

150,000

200,000

Number of missed bases

Newbler

90,490

Mira

76,648

CA

92,196

Newbler+Mira

73,632

Newbler+CA

82,121

Newbler+Mira

+CA

195,619

Total extra segments

0

2,000

4,000

6,000

Number of extra bases

Newbler

709

Mira

4,387

CA

1,907

Newbler+Mira

5,895

Newbler+CA

4,590

Newbler+Mira

+CA

5,218

Outline


Choosing the BEST assembly

Quantitative metrics N50 Contig count Assembly size

Qualitative metrics Miscalled bases Uncalled Missing bases Extra bases

Quantitative Score N50 * Assembly size

No. of Contigs

Qualitative Score ( % Accuracy ) Miscalls + Uncalled + Missing + Extra + Gaps in Ref + Gaps in

Assembly

Assembly Scores

Reference Size

1 -

Metrics Summary – Art 454

ASSEMBLY SCORE

QUALITY SCORE

Assembly spec. vs Accuracy plot – 454Sim

0.1

0.2

0.3

0.4

0.5

90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%

Assembly score

Mira

Newbler+CA+Mira

Newbler+Mira

Newbler+CA

CA

Newbler

Quality of output

Assembly spec. vs Accuracy plot - Art-454

0

1

2

3

4

5

6

90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%

Assembly score

Mira

Newbler

Newbler+CA

CA

Quality of output

Newbler+Mira

Newbler+Mira+CA

Assembly spec. vs Accuracy plot – FlowSim

0

1

2

3

4

5

6

7

8

90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%

Assembly score

Quality of output

Mira

Newbler+Mira

Newbler+Mira+CA

Newbler+CA

CA

Newbler

Assembly spec. vs Accuracy plot – M21709

0

2

4

6

8

50.0 55.0 60.0 65.0 70.0 75.0 80.0%

Assembly score

Newbler+Mira

AMOScmp

Quality of output

Celera

Mira

Newbler+Mira+CA

Newbler

Newbler+CA

AMOScmp+Newbler

Inference

Striking a balance is critical

We chose Newbler + MIRA for H. haemolyticus Newbler + AMOScmp for H. influenzae

Universally applicable pipeline

Assembling specific genomes/strains

Adopt the most consistent tool /pipeline (Conservative approach)

NEWBLER

Choose the one that works the best balance for your genome

NEWBLER + (CELERA/MIRA)

Outline


Final Results

Genomes

Contig # N50 Size Method

M19107 129 25914 1774129

Newbler + Mira

M19501 19 284900 1809865

Newbler + Mira

M21127 32 122121 2029793

Newbler + Mira

M21621 27 139238 1959123

Newbler + Mira

M21639 56 87673 2397857

Newbler + Mira

M21709 28 140484 1808157

Newbler + AMOScmp

Key take-aways

Understand your data Platform, long/short reads, Coverage, Paired/Non-paired,

Quality of basecalling etcEvaluate the need for error correctionChoose a set of “best” assemblers

De novo /Reference assembly, DBG/OLC algorithmMerge assembliesOrdering and ScaffoldingFinishing

Evaluate your assembly at every step to ensure that you are on the right track!

Coming next >>>Demo

Documents

Genome Assembly Final Results