16
The first near-complete assembly of the hexaploid bread wheat genome, Tritricum aestivum Daniela Puiu Aleksey Zimin, Richard Hall, Sarah Kingan, Bernardo Clavijo, Steven Salzberg ICG-12 Oct 27 2017

Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

Embed Size (px)

Citation preview

Page 1: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

The first near-complete assembly of the hexaploid bread wheat genome,

Tritricum aestivum

Daniela PuiuAleksey Zimin, Richard Hall, Sarah Kingan, Bernardo Clavijo, Steven Salzberg

ICG-12Oct 27 2017

Page 2: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 2

Sequencing and Assembly of the Ancestral and Common Wheat

Aegilops tauschii ssp strangulata accession AL8/78Chinese spring variety (CS42, accession Dv418)

2013-2017

Page 3: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 3

History of Wheat

~8,000 years ago: spontaneous hybridizationEmmer Wheat + Goat grass = Bread Wheat (World's 3rd cereal crop)

Triticum turgidum + Aegilops tauschii = Triticum aestivumAABB + DD = AABBDD

Whole Genome => Assisted Breeding => Improved Yield

Page 4: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 4

The Wheat Genome

One of the most complex genomes !

1) Genome size: over 15 billion bases 2) Allohexapoild : six copies of each chromosome3) >90% repeats

Multiple past attempts to assemble => assemblies shorter than the estimated genome size.

Page 5: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 5

New vs Previous Assemblies

Tritricum 3.1

N50

232K

Page 6: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 6

Data Reduction

Original Reads Number Sum Coverage Accuracy

Illumina 7.06G 1Tb 65x 99.5%

PacBio 55.5M 545Gb 36x 87.5%

Processed Seq Number Sum Coverage Accuracy

super-reads 95.7M 31Gb 2x 99.95%

mega-reads 57M 278Gb 18x 99.65%

MaSuRCA mega-readshybrid correction

Page 7: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 7

MaSuRCA mega-reads Correction

Page 8: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 8

Assembly Pipeline

MaSuRCA Correction

Illumina

Celera WGS Assembler

Mega-reads

Remove Duplicates

Tritricum 1.0

Tritricum 2.0

FALCON Correction

PacBio

FALCON Assembler

pReads

Arrow Polishing

FALCON Trit 0.5

FALCON Trit 1.0

k-mer Analysis

Merge

Tritricum 3.1

Page 9: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 9

k-mer Analysis

50M

k-mers missing from the PacBio assembly only

40M

30M

20M

10M

31-mer frequencies

Page 10: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 10

Assembly Merge

Merging of the Hybrid and PacBio assemblies Merging of the Hybrid and PacBio assemblies

Tritricum 2.0 contig

FALCON contigA FALCON contigB

Tritricum 3.1

>5Kb >5Kb>5Kb

Page 11: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 11

Assembly Statistics

Assembly Number Total size (bp)

N50 size (bp)

Triticum 2.0 375,328 14,395,027,822 75,599

FALCON Trit 1.0 97,809 12,939,100,857 215,314

Triticum 3.1 279,439 15,344,693,583 232,659

Page 12: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 12

Run Time: 100 CPU years

Main Steps

RunTimeCPUhrs

WallTimeMonths

MaSuRCA 100K 1.5

Celera WGS 470K 5

FALCON 150K 0.75

ARROW 160K 0.75

total 880K 9

100K CPU hrs=11.5 years800K CPU hrs=100 years

Page 13: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 13

Genome Repetitiveness

k-mer uniqueness ratios

WHEAT

FLY

COW

RICE

PINE

Ae tauschii

Page 14: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 14

Publication

Page 15: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 15

Conclusions

The most challenging genome (we) assembled!

Learning experience!

Assembly quality vs computational resources?

Share your data!

The most challenging genome (we) assembled!

Learning experience!

Assembly quality vs computational resources?

Share your data!

Page 16: Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome, Triticum aestivum

IGC-12The Wheat Genome 16

Acknowledgements

Steven Salzberg

Aleksey ZImin

Johns Hopkins University UCDavis Plant Sciences

Jan Dvorak

Earlham Institute

Bernardo Clavijo

Mingcheng Luo