42
De novo whole genome assembly Qi Sun Bioinformatics Facility Cornell University

De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

De novo whole genome assembly

Qi Sun

Bioinformatics FacilityCornell University

Page 2: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

• Short reads:

o Illumina (150 bp, up to 300 bp)

• Long reads (>10kb):

o PacBio SMRT;

o Oxford Nanopore

Sequencing platforms

Page 3: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

• Illumina: Mate pair

• 10X Chromium: Linked Reads

• BioNano: Optical Mapping:

• Dovetail; Phase Genomics: Hi-C

• NRGene: proprietary DeNovoMAGIC software

• Long-read platforms (PacBio or Nanopore)

Contiging/Scaffolding Platforms

Page 4: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

DNA fragment length vs

Sequencing read length

DNA fragment

Sequencing Read: ACGGGAGGGACCCG…

Page 5: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Short-read Sequencing Platform: Illumina

Paired-endFragment: <500bp fragmentRead: 100-250 bp

Mate pair readsFragment: 5kb, 8kb or 15kbRead: 50 bp

Page 6: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

NNNScaffolds

Contigs

Illumina Reads Assembly Strategy

Paired-end reads

Mate-pair reads

NNNN

Contiging

Scaffolding

Page 7: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Fragment: >>10kbRead: 10-20 kb

Long-read Sequencing Platform: PacBio SMRT

2017 2018

From PacBio web site

Page 8: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Chin et al.(2013) Nature Methods 10:563

PacBio: High error rate (>10%)

Error correction methods• Self error correction (100x depth)• By Illumina reads

Page 9: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

MinION

Oxford Nanopore

Error rate is equivalent or better than PacBio but much cheaper

Page 10: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

From contiging to Scaffolding

Contigs

Using Mate-pair

Using Long reads

Using Physical map: optical map & Hi-C

Page 11: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Scaffolding strategies. Physical maps

Hi-C: Sequence cross-linked and ligated DNA fragments

Optical Mapping:Generating high-quality genome maps by labeling specific 7-mer nickaserecognition sites in a genome with a single-color fluorophore

Dovetail & Phased GenomicsBioNano

Page 12: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

10X Genomics

• < $5000 for assembled and phased genome;

• Large scaffold, capture the gene region very well

Page 13: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Genome assembly strategies

•The Goal Gene space vs whole genome;

Contig vs Scaffold vs pseudo-chromosome;

Resolving paralogs?

Resolving haploid?

•The price

Page 14: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Genome assembly strategies

Bacterial genomes (<10Mb) : Illumina paired-end reads

Eucaryotic genomes (<1Gb):

Ultimate high quality chromosomal level assembly: PacBio/Nanopore + BioNano + Dovetail/PhaseGenomics

+ Genetic linkage map

• PacBio• Nanopore• 10x

Page 15: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Two categories of contiging strategies

overlap–layout–consensus de-bruijn-graph

source: http://www.homolog.us/Tutorials/index.php

Velvet, Soap-denovo, Abyss, Trinity et alCanu, Falcon, Celera et al.

Long reads Short reads

Page 16: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

de-bruijn-graph for contiging short reads

source: http://www.homolog.us/Tutorials/index.php

Kmers

Page 17: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

GGATGGAAGTCG…………… CGATGGAAGGATParalogous regions

GATGGAA ATGGAAG

TGGAAGT

TGGAAGG

GGATGGAAGT

CGATGGAAGG

GATGGAAGT

GATGGAAGG

7merGGATTGGA

CCGATTGGA

9mer

Kmer Paths in paralogous regions

Page 18: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Sequencing errors and

Tips, bubbles and crosslinks

source: http://www.homolog.us/Tutorials/index.php

Tips

Bubbles

Crosslinks

Page 19: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Deal with sequencing errors and repetitive regions

1. Sequencing errors• Remove low depth kmers in a bubble;• Too long kmers would cause coverage problem;

2. Repetitive region• Longer kmers

Tiny repeat:Separate the path

Break boundary between low and high copy regions

Page 20: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Examine a de-bruijn graph assembly software: Soap denovo

Page 21: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Read depth vs Kmer depth

Read depth: number of reads at a genome position

Kmer depth: number of occurrence of an identical kmer

Read depth: 4

Kmeroccurance: 3

Page 22: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Impact of kmer-length (2)Read depth vs Kmer depth

30mer80mer

Kmer depth =3 Kmer depth =1

Longer kmers could results in too little kmer depth

Read depth: 3

Page 23: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Estimate genome size based on kmer distribution

Kmer depth -> Read depth -> Genome size

Page 24: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Estimate genome size based on kmer distribution

N = M* L/(L-K+1)

Step 1: convert read depth to kmer depth

M: kmer depth = 112L: read length = 101 bpK: Kmer size =21 bpN: read depth =140

Step 2: genome size is total sequenced basepairsdevided by read depth

Genome size = T/N

T: total base pairs = 0.505 gbN: read depth = 140Genome size: 3.6 mb

75 merRead depth = 3Kmer depth = 2

Page 25: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

ErrorCorrectReads.pl \

PAIRED_READS_A_IN=R1.fastq.gz \

PAIRED_READS_B_IN=R2.fastq.gz \

KEEP_KMER_SPECTRA=1 \

PHRED_ENCODING=33 \

PLOIDY=1 \

READS_OUT=corrected_out \

>& report.log &

Estimate genome size withErrorCorrectReads.pl from ALLPATHS-LG

Page 26: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Heterozygous genomes

Experimentally:

• Create inbred lines or haploid cell culture.• Assembly of clonal fragments, merging allelic regions.

Assembly

• SNP: merge bubbles• Highly polymorphic regions

Highly polymorphic regions

Kmer coverage distribution

Page 27: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Planning check list of whole genome assembly

• Sequencing platform;

• High quality DNA-extraction;

• Sequencing Read length and read depth;

• Contiging software;

• Scaffolding stragety software;

• Gene annotation strategy (RNA-seqrecommended)

Page 28: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Genome Assembly Software for Different Technology Platforms

PacBio/NanoporeCanuFalcon

10xSuperNova

IlluminaSoap DenovoMaSuRCADiscovarPlatinus

Page 29: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Workflow of de novo assembly

• Clean sequencing data as required;(trim adapter and low quality sequences)

• Contiging, scaffolding, gap filling, polishing

• Evaluation of assembly;

Page 30: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Polishing Software

Pilon

Racon

Scaffolding and Polishing

Commercial Serivce

BioNano

Dovetail

Phase Genomics

Hybrid data

Using long reads for gap filling and scaffolding:

e.g. SPAdes; PBJelly2

Using short reads to correct errors* Tend to cause errors

Page 31: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

SOAPdenovo-127mer all -s config.txt -K 101 -R -o assembly

SOAPdenovo-127mer all -s config.txt -K 127 -R -o assembly

Test different kmer sizes

SOAP denovo2 on BioHPC Lab

/programs/SOAPdenovo2/SOAPdenovo-63mer [options]/programs/SOAPdenovo2/SOAPdenovo-127mer [options]

Two binary codes with max kmer size 63 or 127

Running de-bruijn-graph assembly software

Page 32: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

//Start

k <- kmin (kmin is set at graph construction ‘pregraph’ step);

Construct initial de Bruijn graph with kmin;

Remove low depth k-mers and cut tips;

Merge bubbles of the de Bruijn graph;

Repeat {

k <- k + 1;

Get contig graph Hk from previous loop or construct from de Bruijn graph;

Map reads to Hk and remove the reads already represented in the graph;

Construct Hk+1 graph base on Hk graph and the remaining reads with k;

Remove low depth edges and weak edges in Hk;

} Stop if k >= kmax(kmax = k set in contig step(-m));

Cut tips and merge bubbles;

Output all contigs;

//End

Multi-Kmer Approach in Soap Denovo2

• Start building graph with small kmer; • Iteratively rebuild kmer by mapping larger kmers to previous graph

Page 33: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

#maximal read lengthmax_rd_len=101[LIB]#average insert sizeavg_ins=300#if sequence needs to be reversedreverse_seq=0#in which part(s) the reads are usedasm_flags=3#in which order the reads are used while scaffoldingrank=1# cutoff of pair number for a reliable connection (at least 3 for short insert size)pair_num_cutoff=3#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)map_len=32#a pair of fastq file, read 1 file should always be followed by read 2 fileq1=r1.fastqq2=r2.fastq

SOADdenovo config file

/programs/SOAPdenovo2/SOAPdenovo-127mer all -s config.txt -K 127 -R -o assembly

Page 34: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Evaluating genome assembly

• Completeness

• Contig/scaffold length

• Contig/scaffold quality• Chimeric contigs/scaffolds;• Collapsed paralogs;

• Estimated genome size vs assembly size;• Gene space completeness: BUSCO

• N50

Page 35: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Evaluate the completeness in Genespace

BUSCO gene sets: single-copy orthologs in at least 90% of the species in each lineage.

Arthropods Vertebrates Fungi BacteriaMetazoans Ekaryotes Plants

BUSCO

Page 36: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

BUSCO assessment workflow

Sima˜o et al. Bioinformatics, 2015, 1–3

Page 37: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Sima˜o et al. Bioinformatics, 2015, 1–3

BUSCO Output

C:complete D:duplicated F:fragmented M:missing(Report % of genes in each category)

Page 38: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Run BUSCO on BioHPC Lab

cp -r /programs/augustus-3.2.1 ./

# set PATH for required software

export AUGUSTUS_CONFIG_PATH=/workdir/XXX/augustus.2.5.5/config

export PATH=/programs/hmmer/binaries:/programs/emboss/bin:$PATH

export PATH=/workdir/XXX/augustus-3.2.1/bin:$PATH

python3 /programs/BUSCO_v1.2/BUSCO_v1.2.py -o SAMPLE -in

assembly.fa -l lineage_db -m genome

https://biohpc.cornell.edu/lab/userguide.aspx?a=software&i=255#c

Page 39: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Evaluation of Genome assembly 1Metrics for contig length

N50 and L50 *

N50 50% (base pairs) of the assemblies are contigs above this size.

L50 Number of contigs greater than the N50 length.

NG50 and LG50

N50 is calculated based on assembly size. NG50 is calculated based on estimated genome size.

Page 40: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

10X PACBio

Comparing synteny of ten biggest contigs from 10X and PACBioassembly of Cinerea

Evaluate scaffold quality based on alignment to a closely related genome

Cheng Zou

Page 41: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

Evaluate scaffolding quality based on genetic mapping

Fei Lu, Buckler lab http://www.nature.com/ncomms/2015/150416/ncomms7914/full/ncomms7914.html

Use mapped GBS sequence tags to evaluate each contig

Page 42: De novo whole genome assembly - Cornell UniversityRead: 10-20 kb Long-read Sequencing Platform: PacBio SMRT 2017 2018 ... Soap Denovo MaSuRCA Discovar Platinus. Workflow of de novo

From assembled genome to annotated genome

Procaryotic genomes Eucaryotic genomes

Genome annotation servers (web based)

1. RAST2. NCBI

Gene prediction pipeline: Maker

Function annotation pipeline: Blast2GO