43
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Embed Size (px)

Citation preview

Page 1: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Robert ArthurKevin Lee

Xing LiuPushkar Pande

Gena TangRacchit Thapliyal

Tianjun Ye

Page 2: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Sequencing Methods

Experimental comparison of De Bruijn graph and Overlay graph assemblers

Preliminary Results

Lab Exercise

Page 3: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Sanger Sequencing◦ Cycle sequencing rxn◦ ddNTP-terminated dye-

labeled products◦ High-resolution

electrophoretic separation

◦ Parallelized in 96 or 384 capillaries

◦ Read lengths up to 1kBp◦ Raw accuracy up to

99.999%◦ Costs 50 ¢ per kB

Sequencing MethodsSequencing Methods

Page 4: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Second Gen. Sequencing◦ Cyclical array methods

454 Illumina AB SOLiD Polonator HeliScope

◦ Platforms vary in biochemistry and array generation yet conceptually similar in workflow

Sequencing MethodsSequencing Methods

Page 5: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

IlluminaIllumina

Page 6: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Illumina continuedIllumina continued

Page 7: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

AB SOLiDAB SOLiD

Page 8: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Create a DNA library

◦ Ligate adaptors to fragments

Emulsion PCR◦ Agarose beads ◦ Oil, water, PCR reagents◦ Results in 1 mill copies /

fragment for each bead

454 Pyrosequencing454 Pyrosequencing

Page 9: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Beads arrayed into picotiter plate◦ Immobilized via

addition of enzyme containing beads

◦ Each cell contains exactly 1 bead

Bst polymerase, luciferase, apyrase, ATP sulferylase used

More 454More 454

Page 10: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Even more 454Even more 454Example of OutputExample of Output

Flow Order

TACG

1-mer

2-mer

3-mer

4-mer

KEY (TCAG)

Measures the presence or absence of each nucleotide at any given position

Page 11: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Videos (454 Workflow)Videos (454 Workflow)

Page 12: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Videos (Pyrosequencing)Videos (Pyrosequencing)note: we did not choose the musicnote: we did not choose the music

Page 13: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Comparison of 2Comparison of 2ndnd Gen Gen PlatformsPlatforms

Page 14: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Sequencing Methods

Experimental comparison of De Bruijn graph and Overlay graph assemblers

Preliminary Results

Lab Exercise

Page 15: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

De Bruijn Graph assemblers and De Bruijn Graph assemblers and Overlay Graph assemblersOverlay Graph assemblers

De Bruijn Graph assemblers◦ Velvet, Abyss, Euler

Overlay Graph assemblers◦ Newbler, Edena, SSAKE, VCAKE

Page 16: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Write a C program to simulate reads from reference genome with specific read length, coverage and base error rate◦ Human chr 22, ~33.5M bases◦ Streptococcus Suis, NC_012925.1, ~2M bases◦ Helicobacter acinonychis Sheeba, ~ 1.5M bases

Write anther C program to measure the quality of assemblers◦ N50 length◦ No. of contigs◦ Max contig length◦ No. of mis-assembled contigs

Synthetic Data used for Synthetic Data used for ExperimentsExperiments

Page 17: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

De Bruijn graph assemblers are only suitable for short reads data

K limitation◦ Use Hash table or Sorting to index K-mers

Need use a unique key(value) to represent each K-mer K=16 416=232 <-> 32-bit integer (unsigned int) K=32 432=264 <-> 64-bit integer (unsigned long long) K>32? <-> multiple integer to represent the hash table key

Read LengthRead Length

Page 18: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Simulate reads from Streptococcus Suis 300 read length, 50X coverage, error

rate 0.1% Velvet default: K <= 31, so we use 31

# of contigs (total length)

N50 length # of misassembled contigs (total length)

Velvet 46515 (1716053 bp) 115 bp 5 (1346 bp)

Recompile velvet, K = 99

# of contigs (total length)

N50 length

# of misassembled contigs (total length)

Velvet 441(1974382 bp) 15328 bp 1 (34 bp)

Page 19: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

It is stated in some literatures that “De Bruijn based approach prone to false positives”, “Overlap graph has better quality”

Quality and AccuracyQuality and Accuracy

Page 20: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Assemblers

# of contigs (total length)

N50 length

# of misassembled contigs (total length)

Velvet 336 (1525746 bp) 10.4 kbp 17 (156637 bp)

Edena 340 (1513259 bp) 9,8 kbp 0 (0 bp)

Simulate reads from Helicobacter acinonychis Sheeba

35 read length, 50X coverage, error rate 0.1%

Page 21: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Assemblers

# of contigs (total length)

N50 length

# of misassembled contigs (total length)

Velvet 1106 (1969617 bp) 5266 bp 12 (255594 bp)

Edena 1003 (1970342 bp) 6416 bp 0 (0 bp)

Simulate reads from Streptococcus Suis 35 read length, 50X coverage, error rate

0.1%

Page 22: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Overlap graph based assemblers are computing-expensive and use more memory◦ All-to-all alignment of reads, O(n2)◦ Use more memory to store overlap graph

Typically, number of reads is weigh larger than the number of K-mers

◦ Especially for short reads data With the same coverage and genome length, shorter

reads means more reads◦ It is stated that De Bruijn graph are more suitable

for NGS data Shorter reads, and high throughput

Runtime and Memory Runtime and Memory UsageUsage

Page 23: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Assemblers Time Memory

Velvet 33 secs ~220 M

SSAKE 26 mins ~900 M

VCAKE 107 mins ~1.1 G

Simulate reads from Streptococcus Suis 802995 reads 50 read length, 20X coverage, error rate

0.1% Xeon E5530 2.4 GHz

Page 24: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Recent advance of pattern matching algorithms and technical enable the use of overlap graph◦ Suffix tree, Suffix array, Prefix array, compressed suffix array

Suffix array◦ Be able to find overlap between reads in linear time◦ Usage of compressed suffix array can significantly reduce the

memory requirements of overlap graph assemblers Examples

◦ D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel , De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research. 18:802-809, 2008.

◦ Jared T. Simpson and  Richard Durbin Efficient construction of an assembly string graph using the FM-index, Bioinformatics (2010) 26 (12):i367-i373.

◦ Pasqual Pushkar and I have developed a parallel sequence assembler based on overlap

graph in our research project

However!However!

Page 25: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Assemblers Time Memory

Velvet 292 mins ~17 GB

Edena 37 mins ~7 GB

Pasqual 43 mins ~8 GB

Parallel Pasqual 9 mins ~8 GB

Simulate reads from Human chr22 6978908 reads 50 read length, 20X coverage, error rate

0.1% Xeon E5530 2.4 GHz with 4 cores/8 threads

Page 26: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

H. influenzae◦ 30 ~ 300 length

Velvet does not work◦ K is fixed◦ If we use big K, the reads shorter than K can not

be assembled.◦ If we use small K, it is difficult to assemble the

long reads Overlap graph assemblers do not have this

issue◦ Newbler

Mixed Length ReadsMixed Length Reads

Page 27: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Controversial◦ It is still unclear about the relation between De Bruijn

graph and Overlap graph We can still conclude from the experiments

◦ Regarding quality and accuracy, Overlap graph assemblers are thought to be better than De Bruijn graph assembler

◦ De Bruijn graph assemblers does not work for long reads◦ De Bruijn graph assemblers does not work for mixed

length reads (K is fixed)◦ Traditional overlap graph assemblers are slower and use

more memory, but latest assemblers are better than De Bruijn graph assemblers

ConclusionConclusion

Page 28: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Sequencing Methods

Experimental comparison of De Bruijn graph and Overlay graph assemblers

Preliminary Results

Lab Exercise

Page 29: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Quality score and length Quality score and length distributiondistribution

Mean length Median length Std devM19107 577.5849 569 83.9605

Page 30: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Quality score and length Quality score and length distributiondistribution

Mean length Median length Std devM19501 624.7172 621 78.4074

Page 31: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Quality score and length Quality score and length distributiondistribution

Mean length Median length Std devM21127 618.7576 616 81.5678

Page 32: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Quality score and length Quality score and length distributiondistribution

Mean length Median length Std devM21621 620.6305 621 83.978

Page 33: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Quality score and length Quality score and length distributiondistribution

Mean length Median length Std devM21639 573.384 564 66.5525

Page 34: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Quality score and length Quality score and length distributiondistribution

Mean length Median length Std devM21709 626.2459 624 78.2447

Page 35: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

VelvetVelvet

Id K No. of contigs N50 Max length Total length % reads used

M19107 19 217160 16 665 2905543 97.3535

29 176741 26 655 3315033 88.7319

M19501 19 618036 13 429 4716286 78.9177

29 537077 18 490 5725530 35.5981

M21127 19 319999 15 483 3498613 91.4239

29 259942 24 416 3998418 73.0187

M21621 19 218872 16 640 3052522 93.7490

29 157853 26 838 3256837 87.5425

M21639 19 770867 13 628 5818868 85.0236

29 680339 19 601 7348599 46.1671

M21709 19 291156 16 768 3425632 95.7695 29 207736 25 816 3637419 83.8704

$> velveth <output_dir> <k-mer length> -fasta -long <reads.fasta>$> velvetg <output_dir>

Input: Fasta/FastqOutput: Fasta

Page 36: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

WGS assembler (Celera)WGS assembler (Celera)

Id No.of Contigs N50 Max length Total length % reads usedM19107 236 11881 32038 1766060 96.3570M19501 214 1230 4519 278112 98.6032M21127 345 8349 26765 1947955 97.9181M21621 356 7791 30668 1892633 98.1710M21639 326 2092 9912 610813 98.3939M21709 520 4393 15002 1700040 98.5221

$> sffToCA –trim soft –libraryname ${Id}-trimsoft –output ${Id}-trimsoft ${Id}.sff$> runCA –p ${Id} –d ${Id} ovlConcurrency=4 ${id}-trimsoft.frg

Input: frg formatOutput: Fasta

• >50 separate programs make up the Celera Assembler pipeline

• runCA script helps manage them all

Page 37: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

NewblerNewblerDe Novo Assembly

Id No.of Contigs N50 Max length Total lengthM19107 217 15659 38000 25112606M19501 75 157459 343196 106836011M21127 59 121256 316274 40693944M21621 50 138437 339424 50432798M21639 175 43023 182797 158028027M21709 52 140128 319869 69503256

Reference Assembly – (Haemophilus-influenzae-refseq.fasta)Id No.of Contigs N50 Max length Total length

M19107 1260 2496 10409 1224223M19501 988 3503 18724 1380153M21127 - - - -M21621 - - - -M21639 1272 2701 13712 1416318M21709 313 13836 70298 1607841

Input: .sffOutput: Fasta

$> runAssembly <reads.sff> // de novo assembly

Page 38: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

MIRAMIRA

Id No.of Contigs N50 Max length Total length % reads used

M19107 208 18379 51687 1795134 95.7478

M19501 181 185484 321569 1901198 97.7347

M21127 89 81157 305626 1951240 97.4776

M21621 67 90877 253924 1887484 97.5015

M21639 175 90800 152373 2378888 98.1330

M21709 83 62871 197745 1840248 97.6776

MIRA stands for Mimicking Intelligent Read Assembly

$> sff_extract –s ${Id}_in.454.fasta -q ${Id}_in.454.fasta.qual -x ${Id}_traceinfo_in.454.xml ${Id}.sff

$> mira --project=${Id} --job=denovo,genome,normal,454 -GE:not=4 >& ${Id}_assembly.log

Input: Fasta + qual + trace infoOutput: Fasta, Ace

Page 39: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Eagle view - M19107.aceEagle view - M19107.ace

Page 40: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Eagle view - M19501.ace Eagle view - M19501.ace

Page 41: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

“Next-generation DNA sequencing” Shendure et. al, http://compgenomics2011.biology.gatech.edu/images/f/f9/Shendure-NatureBiotechnology-2008.pdf

“Next-generation DNA sequencing methods” Mardis et. al, http://compgenomics2011.biology.gatech.edu/images/5/59/Mardis-AnnuRevGenet-2008.pdf

Works CitedWorks Cited

Page 42: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Sequencing Methods

Experimental comparison of De Bruijn graph and Overlay graph assemblers

Preliminary Results

Lab Exercise

Page 43: Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye

Download the Lab Exercise file from the Genome Assembly wiki page

Lab ExerciseLab Exercise