15
Christopher W. Beitel, Aaron Darling twitter: @koadman

Enjoy Hi-C! - Aaron Darling

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Enjoy Hi-C! - Aaron Darling

Christopher W. Beitel, Aaron Darlingtwitter: @koadman

Page 2: Enjoy Hi-C! - Aaron Darling

Genomics revolutionized our understanding of bacteria

Perna et al 2001 Nature, Welch et al 2002 PNAS

Three genomes, same species only 40% genes in common

Page 3: Enjoy Hi-C! - Aaron Darling

Phylogenetic diversity: the known unknowns

● Can amplify and sequence the 16S rRNA gene present in all bacteria and archaea

● Millions of 16S sequences obtained per run

● Align sequences, build evolutionary tree, analyze

Uncultured bacterial diversity greatly exceeds known isolate diversity

Rinke et al 2013, Nature

Page 4: Enjoy Hi-C! - Aaron Darling

Is metagenomics the answer...??

SampleSmash with bead beater

Purify DNA

Shear DNA to 400nt frags

Prep sequencer librarySequence (yay!)Analyze

Metagenomics loses long-range linkageAt least two paths forward:

1. Physically “dissect” microbial communities

2. Better inference methods3. Nanopore long read sequencing?

Page 5: Enjoy Hi-C! - Aaron Darling

What is Hi-C?

High throughput sequencing of proximity-ligation products

Landmark papersChromosome conformation capture: sdfdfHi-C: Lieberman-Aiden et al 2009Single (human) cell Hi-C: Nagano et al 2013

Page 6: Enjoy Hi-C! - Aaron Darling

We propose to use Hi-C for metagenomics

Conjecture: DNA in the same cell will become crosslinked, ligated, and sequenced more frequently (green links) than DNA fragments from different cells (red links)

Cell/Species A Cell/Species B

Page 7: Enjoy Hi-C! - Aaron Darling

A little experiment to test Hi-C

Selected 5 strains, 4 species:

+ E. coli BL21 (BL21)+ E. coli K12 (K12)+ Pediococcus pentosaceus (Ped)+ Burkholderia thailandensis (2 chromosomes: Bur1, Bur2)+ Lactobacillus brevis (chromosome + 2 plasmids: Lac0, Lac1, Lac2)

Grew up cultures, mixed in equal parts based on OD600

Applied standard Hi-C protocol, sequenced on Illumina MiSeq

~22M read pairs, 160nt reads

Page 8: Enjoy Hi-C! - Aaron Darling

Reconstructing genomes from within metagenomes

A classic problem in metagenomics

Can't de novo assemble Hi-C data due to presence of ligation junctions, other issues

To test reconstruction we instead simulated reads with BioGrinder and assembled with soapdenovo.

Cov. Contigs (scaff.) Bp assem. N50 (scaff.) N90 (scaf) Max. len. Contig. error

Scaff. error

Pct. assem.

100 7687 (557) 15,577,164 87634 8552 379623 0 0 77.37

Assembly stats:

Page 9: Enjoy Hi-C! - Aaron Darling

A Hi-C scaffold graph

● Nodes are assembly contigs

● node size contig size∝

● edge weights normalized ∝Hi-C read counts linking contigs

● Fruchterman-Reingold layout

Page 10: Enjoy Hi-C! - Aaron Darling

A Hi-C scaffold graph

● Nodes are assembly contigs

● node size contig size∝

● edge weights normalized ∝Hi-C read counts linking contigs

● Fruchterman Reingold layout

● Nodes colored by SPECIES

● Can we actually compute clusters?

● Markov clustering, I=1.1:4 clusters, >97% of genomein each bin

Page 11: Enjoy Hi-C! - Aaron Darling

● 99% of Hi-C read pairs associate within-species

● Hi-C read pairs associate throughout the chromosome

● Spans distances that no existing or imaginary “long read” technology can cover

Contig clustering: why it works

0 500000 1000000 1500000 2000000 2500000

Hi−C read pair insert distributions

Insert distance in nucleotides

L. brevis ATCC367P. pentosaceus ATCC25745E. coli K12 DH10BE. coli BL21(DE3)B. thailandensis E264 chrIB. thailandensis E264 chrII

Rate at which Hi-C reads associatewithin and across species

Page 12: Enjoy Hi-C! - Aaron Darling

A

G

T

C

A

G

A

C

Resolving strain differences: Single nucleotide variants

A A AT

G C G C

- SNPs between E. coli K12 and BL21 are nodes

- Edges link SNPs observed in same read pair

Mate pair Hi-C

Are Hi-C graphs are better connected than mate-pair??

5k MP 10k MP 20k MP Hi-C

Largest connected component

6.2% 16.6% 32.4% 97.8%

Page 13: Enjoy Hi-C! - Aaron Darling

Resolving strain differences: Hi-C versus mate-pair

Mate pair graph distances scale linearly.

Hi-C graphs are scale-free.

Estimation error in probability of variant linkage grows with path length!

Distance on genome versus path length in variant graph

Page 14: Enjoy Hi-C! - Aaron Darling

Opportunities to collaborate

1. Species clustering of assemblies

2. Resolving strain haplotypes

3. Application to microbial communities!

4. Improvement of the Hi-C protocol

5. Population metagenomics – track gene flow of, e.g. antibiotic resistance plasmids

6. Your idea (and some grant funding) here

Page 15: Enjoy Hi-C! - Aaron Darling

Thanks to…

Jonathan EisenUC Davis

Jenna LangUC Davis

twitter: @koadman

Christopher BeitelUC Davis

Lutz Froenicke

Richard Michelmore

Ian Korf

Matthew DeMaere

Catherine Burke

Manuscript preprint online at https://peerj.com/preprints/260/