Daniela Puiu at #ICG12: The first near-complete assembly of the hexaploid bread wheat genome,...

  • View
    64

  • Download
    1

  • Category

    Science

Preview:

Citation preview

The first near-complete assembly of the hexaploid bread wheat genome,

Tritricum aestivum

Daniela PuiuAleksey Zimin, Richard Hall, Sarah Kingan, Bernardo Clavijo, Steven Salzberg

ICG-12Oct 27 2017

IGC-12The Wheat Genome 2

Sequencing and Assembly of the Ancestral and Common Wheat

Aegilops tauschii ssp strangulata accession AL8/78Chinese spring variety (CS42, accession Dv418)

2013-2017

IGC-12The Wheat Genome 3

History of Wheat

~8,000 years ago: spontaneous hybridizationEmmer Wheat + Goat grass = Bread Wheat (World's 3rd cereal crop)

Triticum turgidum + Aegilops tauschii = Triticum aestivumAABB + DD = AABBDD

Whole Genome => Assisted Breeding => Improved Yield

IGC-12The Wheat Genome 4

The Wheat Genome

One of the most complex genomes !

1) Genome size: over 15 billion bases 2) Allohexapoild : six copies of each chromosome3) >90% repeats

Multiple past attempts to assemble => assemblies shorter than the estimated genome size.

IGC-12The Wheat Genome 5

New vs Previous Assemblies

Tritricum 3.1

N50

232K

IGC-12The Wheat Genome 6

Data Reduction

Original Reads Number Sum Coverage Accuracy

Illumina 7.06G 1Tb 65x 99.5%

PacBio 55.5M 545Gb 36x 87.5%

Processed Seq Number Sum Coverage Accuracy

super-reads 95.7M 31Gb 2x 99.95%

mega-reads 57M 278Gb 18x 99.65%

MaSuRCA mega-readshybrid correction

IGC-12The Wheat Genome 7

MaSuRCA mega-reads Correction

IGC-12The Wheat Genome 8

Assembly Pipeline

MaSuRCA Correction

Illumina

Celera WGS Assembler

Mega-reads

Remove Duplicates

Tritricum 1.0

Tritricum 2.0

FALCON Correction

PacBio

FALCON Assembler

pReads

Arrow Polishing

FALCON Trit 0.5

FALCON Trit 1.0

k-mer Analysis

Merge

Tritricum 3.1

IGC-12The Wheat Genome 9

k-mer Analysis

50M

k-mers missing from the PacBio assembly only

40M

30M

20M

10M

31-mer frequencies

IGC-12The Wheat Genome 10

Assembly Merge

Merging of the Hybrid and PacBio assemblies Merging of the Hybrid and PacBio assemblies

Tritricum 2.0 contig

FALCON contigA FALCON contigB

Tritricum 3.1

>5Kb >5Kb>5Kb

IGC-12The Wheat Genome 11

Assembly Statistics

Assembly Number Total size (bp)

N50 size (bp)

Triticum 2.0 375,328 14,395,027,822 75,599

FALCON Trit 1.0 97,809 12,939,100,857 215,314

Triticum 3.1 279,439 15,344,693,583 232,659

IGC-12The Wheat Genome 12

Run Time: 100 CPU years

Main Steps

RunTimeCPUhrs

WallTimeMonths

MaSuRCA 100K 1.5

Celera WGS 470K 5

FALCON 150K 0.75

ARROW 160K 0.75

total 880K 9

100K CPU hrs=11.5 years800K CPU hrs=100 years

IGC-12The Wheat Genome 13

Genome Repetitiveness

k-mer uniqueness ratios

WHEAT

FLY

COW

RICE

PINE

Ae tauschii

IGC-12The Wheat Genome 14

Publication

IGC-12The Wheat Genome 15

Conclusions

The most challenging genome (we) assembled!

Learning experience!

Assembly quality vs computational resources?

Share your data!

The most challenging genome (we) assembled!

Learning experience!

Assembly quality vs computational resources?

Share your data!

IGC-12The Wheat Genome 16

Acknowledgements

Steven Salzberg

Aleksey ZImin

Johns Hopkins University UCDavis Plant Sciences

Jan Dvorak

Earlham Institute

Bernardo Clavijo

Mingcheng Luo

Recommended