38
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results 02 -22- 2012

Genome Assembly Final Results

  • Upload
    naava

  • View
    52

  • Download
    0

Embed Size (px)

DESCRIPTION

Genome Assembly Final Results. Jeri Dilts Suzanna Kim Hema Nagrajan Deepak Purushotham AMBILY SIVADAS AMIT RUPANI LEO WU. 02 -22- 2012. Outline. Pipeline for evaluation Quantitative evaluation Qualitative Evaluation Choosing the BEST assembly Final results Demo. - PowerPoint PPT Presentation

Citation preview

Page 1: Genome Assembly  Final Results

J E R I D I LT SS U Z A N N A K I M

H E M A N A G R A J A ND E E PA K P U R U S H O T H A M

A M B I LY S I VA D A SA M I T R U PA N I

L E O W U

Genome Assembly Final Results

0 2 - 2 2 - 2 0 1 2

Page 2: Genome Assembly  Final Results

Outline

Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo

Page 3: Genome Assembly  Final Results

Pipeline for evaluation

Page 4: Genome Assembly  Final Results

Strategy – Key alterations

Prinseq Preprocessing Unnecessary, assemblers have built in capabilities Use Prinseq for data statistics

Error Correction Does not fit methods Coral is based on Overlap-layout-consensus and

works best with de Bruijin Graph assemblers Echo has never been tested on 454 data

Final Assemblers Newbler, Mira, Celera, AmosCMP Discarded Assemblers Abyss, Velvet, and Pcap454

MAIA Hybrid Assembly Needs a close phylogenetic reference genome

Page 5: Genome Assembly  Final Results

Outline

Pipeline for evaluationQuantitative EvaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo

Page 6: Genome Assembly  Final Results

Metrics No. of Contigs -> Lesser the better N50 -> Higher the better Assembly size -> Closer to the estimated genome, the

betterQuantitative Assembly Score

N50 * Assembly size No. of Contigs

Higher the score, the better!

Quantitative Evaluation

Page 7: Genome Assembly  Final Results

M19107 - Evaluation

Runs # Contigs

N50 Total Size

Score

Newbler 199 16319 1753573 8.16

Mira 201 19353 1790088 8.24

Celera 146 20609 1747621 8.39

Newbler_Mira 129 25914 1774129 8.55

Newbler_Celera 104 27207 1719874 8.65

Newbler_Mira_Celera

96 27478 1701316 8.69

Page 8: Genome Assembly  Final Results

Outline

Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo

Page 9: Genome Assembly  Final Results

Qualitative Evaluation

Strategy Align the assembly contigs to the original reference

genome and compute differencesChallenges

No Original reference genome for our data setApproach

Create simulated 454 read datasets from a completely sequenced genome

Tools used FlowSim 454Sim Art-454

Page 10: Genome Assembly  Final Results

FlowSim

A simulation pipeline based on real dataLets you model each step of pyrosequencing processUtilities:

Clonesim : To simulate the shearing step Usage: clonesim -c count -l dist input.fasta

Gelfilter: To select a certain range of clone lengths. Usage: gelfilter min max

Kitsim: To attach A and B adaptors. Usage: kitsim -k key -a adapter input.fasta -o output.fasta

Mutator: To introduce random substitutions and indels in the sequences. Usage: mutator -i indel_rate -s subst_rate input.fasta -o output.fasta

Duplicator: To generate artificial duplicates of many clones. Usage: duplicator dup_prob

Flowsim : To simulate the actual pyrosequencing process Usage: flowsim -G generation input.fasta -o output.sff

Example: clonesim -c 400000 –l “Normal 350 95” input.fasta | gelfilter 25 600| kitsim | mutator | duplicator 0.03 | flowsim –G Titanium -o output.sff

Page 11: Genome Assembly  Final Results

454Sim

454 Simulation at higher speed and accuracyUSP: Configurable statistical modelsSupport GS FLX, Titanium and GS 20Utilities:

fragsim: To simulate shearing Usage: fragsim -c 1000000 -l 1000 genome.fasta >

genome.fragments.fasta 454sim: To simulate the sequencing step

Usage: 454sim -o genome.sff genome.fragments.fastaExample:

fragsim -c 250000 -l 1000 genome.fasta | 454sim –g FLX -o genome.sff

Page 12: Genome Assembly  Final Results

ART-454

Supports Illumina, 454 and Solexa read simulation

Used for 1000 Genomes ProjectUsage:

Art_454 Input.fasta Output prefix Fold_coverage (single – end reads)

Art_454 Input.fasta Output prefix Fold_coverage Mean_Flag_Len Std_Deviation (paired end reads)

Page 13: Genome Assembly  Final Results

Running pipeline on Simulated reads

Reference – Haemophilus influenzae F3047 (NC_014922)

Ran 454Sim, FlowSim and Art-454 to generate reads

Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG)

Merged assemblies using Minimus2

Evaluate Assembly Accuracy (How?)

Page 14: Genome Assembly  Final Results

Assembly Accuracy

Challenges Alignment of contigs to the reference genome

Approach Local alignment (BLAST, bwa, bowtie) Whole genome alignment (Mauve, MUMmer)

Align the assembly to the reference genome Compute nucleotide differences, gaps and rearranged

segments

Page 15: Genome Assembly  Final Results

Mauve

Uses positional homology genome alignment Each site in the assembly maps to at most one site on

the reference Optimized contiguity E.g. progressiveMauve

Ordering of contigs: Mauve Contig Mover algorithm

Compare to identify differences

Page 16: Genome Assembly  Final Results

Mauve Genome Aligner

Page 17: Genome Assembly  Final Results

After Ordering of Contigs

Page 18: Genome Assembly  Final Results

Mauve Assembly Metrics

Basecalling accuracy Count and location of bases called wrongly Direction of miscalling, e.g. A->G Count and location of bases predicted to exist, but

uncalledGenome content accuracy

Count and location of bases missing from the assembly Count and location of extra bases in the assembly Size distribution of the missing and extra fragments

Genome structure accuracy Estimate of misassembly count

Page 19: Genome Assembly  Final Results

Reference genome AGGCTAGCGCGCGATTAGGAT

CAssembly

AGTAGCGGGCCGATTAAGANC

Alignment AGGCTAGCGCG -

CGATTAGGATC AG - -

TAGCGGGCCGATTAAGANC

Example

Miscalls 2 (C->G and G->A)Uncalled bases 1 (N)Extra bases 1 (Insertion of C )Missing bases 2 (Deletion of GC )Missing segments 1Extra segments 1

Page 20: Genome Assembly  Final Results

Scoring simulated reads with Mauve

Reference – Haemophilus influenzae F3047 (NC_014922)

Ran 454Sim, FlowSim and Art-454 to generate reads

Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG)

Merged assemblies using Minimus2Ran Mauve to align the assemblies back to

the reference genomeComputed Assembly metrics

Page 21: Genome Assembly  Final Results

Miscalled Bases

0

20

40

60

Number of miscalled bases

Newbler

16

Mira

52

CA

8

Newbler+Mira

18

Newbler+CA

24

Newbler+Mira

+CA

36

Page 22: Genome Assembly  Final Results

Uncalled bases

0

10

20

30

40

Number of uncalled bases

Newbler Mira

3

CA Newbler+Mira

14

Newbler+CA

7

Newbler+Mira

+CA

34

Page 23: Genome Assembly  Final Results

Total missing bases

0

50,000

100,000

150,000

200,000

Number of missed bases

Newbler

90,490

Mira

76,648

CA

92,196

Newbler+Mira

73,632

Newbler+CA

82,121

Newbler+Mira

+CA

195,619

Page 24: Genome Assembly  Final Results

Total extra segments

0

2,000

4,000

6,000

Number of extra bases

Newbler

709

Mira

4,387

CA

1,907

Newbler+Mira

5,895

Newbler+CA

4,590

Newbler+Mira

+CA

5,218

Page 25: Genome Assembly  Final Results

Outline

Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo

Page 26: Genome Assembly  Final Results

Choosing the BEST assembly

Quantitative metrics N50 Contig count Assembly size

Qualitative metrics Miscalled bases Uncalled Missing bases Extra bases

Page 27: Genome Assembly  Final Results

Quantitative Score N50 * Assembly size

No. of Contigs

Qualitative Score ( % Accuracy ) Miscalls + Uncalled + Missing + Extra + Gaps in Ref + Gaps in

Assembly

Assembly Scores

Reference Size

1 -

Page 28: Genome Assembly  Final Results

Metrics Summary – Art 454

ASSEMBLY SCORE

QUALITY SCORE

Page 29: Genome Assembly  Final Results

Assembly spec. vs Accuracy plot – 454Sim

0.1

0.2

0.3

0.4

0.5

90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%

Assembly score

Mira

Newbler+CA+Mira

Newbler+Mira

Newbler+CA

CA

Newbler

Quality of output

Page 30: Genome Assembly  Final Results

Assembly spec. vs Accuracy plot - Art-454

0

1

2

3

4

5

6

90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%

Assembly score

Mira

Newbler

Newbler+CA

CA

Quality of output

Newbler+Mira

Newbler+Mira+CA

Page 31: Genome Assembly  Final Results

Assembly spec. vs Accuracy plot – FlowSim

0

1

2

3

4

5

6

7

8

90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%

Assembly score

Quality of output

Mira

Newbler+Mira

Newbler+Mira+CA

Newbler+CA

CA

Newbler

Page 32: Genome Assembly  Final Results

Assembly spec. vs Accuracy plot – M21709

0

2

4

6

8

50.0 55.0 60.0 65.0 70.0 75.0 80.0%

Assembly score

Newbler+Mira

AMOScmp

Quality of output

Celera

Mira

Newbler+Mira+CA

Newbler

Newbler+CA

AMOScmp+Newbler

Page 33: Genome Assembly  Final Results

Inference

Striking a balance is critical

We chose Newbler + MIRA for H. haemolyticus Newbler + AMOScmp for H. influenzae

Universally applicable pipeline

Assembling specific genomes/strains

Adopt the most consistent tool /pipeline (Conservative approach)

NEWBLER

Choose the one that works the best balance for your genome

NEWBLER + (CELERA/MIRA)

Page 34: Genome Assembly  Final Results

Outline

Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo

Page 35: Genome Assembly  Final Results

Final Results

Genomes

Contig # N50 Size Method

M19107 129 25914 1774129

Newbler + Mira

M19501 19 284900 1809865

Newbler + Mira

M21127 32 122121 2029793

Newbler + Mira

M21621 27 139238 1959123

Newbler + Mira

M21639 56 87673 2397857

Newbler + Mira

M21709 28 140484 1808157

Newbler + AMOScmp

Page 36: Genome Assembly  Final Results

Key take-aways

Understand your data Platform, long/short reads, Coverage, Paired/Non-paired,

Quality of basecalling etcEvaluate the need for error correctionChoose a set of “best” assemblers

De novo /Reference assembly, DBG/OLC algorithmMerge assembliesOrdering and ScaffoldingFinishing

Evaluate your assembly at every step to ensure that you are on the right track!

Page 37: Genome Assembly  Final Results

Coming next >>>Demo

Page 38: Genome Assembly  Final Results