42
Comparative Microbial Genomics group Center for Biological Sequence analysis Department of Systems Biology, Technical University of Denmark Too Much Data - Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery Workshop on Comparative Genomics King Mongkut's University of Technology Thonburi Bangkok, Thailand 1rst Talk for Tuesday 8 March, 2010

Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

  • Upload
    lykhue

  • View
    249

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Too Much Data - Cautionary Tales of Next-generation

and Next-next Generation Sequencing

Dave Ussery

Workshop on Comparative GenomicsKing Mongkut's University of Technology ThonburiBangkok, Thailand

1rst Talk for Tuesday8 March, 2010

Page 2: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group
Page 3: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group
Page 4: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

www.cbs.dtu.dk

Page 5: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Outlinehttp://www.cbs.dtu.dk/courses/thaiworkshop2010/

Page 6: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Subject: What's new for '"complete genome"' in PubMedFrom: My NCBI <[email protected]>Date: 7 March, 2010 1:20:10 AM GMT+07:00To: Dave Ussery <[email protected]>

Sender's message:Sent on Saturday, 2010 Mar 06Search "complete genome" Click here to view complete results in PubMed. (Results may change over time.)To unsubscribe from these e-mail updates click here.

PubMed ResultsItems 1 - 12 of 12

2. Genomic Structure of an Economically Important Cyanobacterium, Arthrospira (Spirulina) platensis NIES-39.

Fujisawa T, Narikawa R, Okamoto S, Ehira S, Yoshimura H, Suzuki I, Masuda T, Mochimaru M, Takaichi S, Awai K, Sekine M, Horikawa H, Yashiro I, Omata S, Takarada H, Katano Y, Kosugi H, Tanikawa S, Ohmori K, Sato N, Ikeuchi M, Fujita N, Ohmori M.

DNA Res. 2010 Mar 4. [Epub ahead of print]

PMID: 20203057 [PubMed - as supplied by publisher]

Page 7: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Genomic Structure of an Economically Important Cyanobacterium,Arthrospira (Spirulina) platensis NIES-39

TAKATOMO Fujisawa1, REI Narikawa2, SHINOBUOkamoto3, SHIGEKI Ehira4, HIDEHISAYoshimura2, IWANE Suzuki5,TATSURU Masuda2, MARI Mochimaru6, SHINICHI Takaichi7, KOICHIRO Awai8, MITSUO Sekine1,HIROSHI Horikawa1, ISAO Yashiro1, SEIHA Omata1, HIROMI Takarada1, YOKO Katano1, HIROKI Kosugi1,SATOSHI Tanikawa1, KAZUKO Ohmori9, NAOKI Sato2, MASAHIKO Ikeuchi2, NOBUYUKI Fujita1,*,and MASAYUKI Ohmori4

Bioresource Information Center, Department of Biotechnology, National Institute of Technology and Evaluation(NITE), 2-10-49 Nishihara, Shibuya-ku, Tokyo 151-0066, Japan1; Department of Life Sciences (Biology), TheUniversity of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan2; Database Center for Life Science,Research Organization of Information and Systems, 2-11-6 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan3;Department of Biological Sciences, Faculty of Science and Engineering, Chuo University, 1-13-27 Kasuga,Bunkyo-ku, Tokyo 112-8551, Japan4; Graduate School of Life and Environmental Sciences, University of Tsukuba,Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8572, Japan5; Natural Science Faculty, Komazawa University,1-23-1 Komazawa, Setagaya-ku, Tokyo 154-8525, Japan6; Biological Laboratory, Nippon Medical School,Kosugi-cho 2, Nakahara-ku, Kawasaki 211-0063, Japan7; Division of Global Research Leaders, Shizuoka University,836 Ohya, Suruga-ku, Shizuoka 422-8529, Japan8 and Department of Life Sciences, Showa Women’s University,1-7 Taishido, Setagaya-ku, Tokyo 154-8533, Japan9

*To whom correspondence should be addressed. Tel. !81 3-3481-1933. Fax. !81 3-3481-8424.E-mail: [email protected]

Edited by Katsumi Isono(Received 1 December 2009; accepted 11 January 2010)

AbstractA filamentous non-N2-fixing cyanobacterium, Arthrospira (Spirulina) platensis, is an important organism

for industrial applications and as a food supply. Almost the complete genome of A. platensis NIES-39 wasdetermined in this study. The genome structure of A. platensis is estimated to be a single, circular chromo-some of 6.8 Mb, based on optical mapping. Annotation of this 6.7 Mb sequence yielded 6630 protein-coding genes as well as two sets of rRNA genes and 40 tRNA genes. Of the protein-coding genes, 78%are similar to those of other organisms; the remaining 22% are currently unknown. A total 612 kb ofthe genome comprise group II introns, insertion sequences and some repetitive elements. Group Iintrons are located in a protein-coding region. Abundant restriction-modification systems were deter-mined. Unique features in the gene composition were noted, particularly in a large number of genes foradenylate cyclase and haemolysin-like Ca21-binding proteins and in chemotaxis proteins. Filament-specific genes were highlighted by comparative genomic analysis.Key words: cyanobacteria; Arthrospira; health supplement; genome; cAMP

1. Introduction

Cyanobacteria are prokaryotes that perform oxy-genic photosynthesis and constitute a large taxonomicgroup within the domain of eubacteria. Cyanobacteria

are divided morphologically (unicellular or filamen-tous) or functionally (N2-fixing and non-N2-fixing).Filamentous species are subdivided into those withand without a heterocyst which is a differentiationfrom vegetative cells for fixing nitrogen.1,2

# The Author 2010. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in anymedium, provided the original workis properly cited.

DNA RESEARCH pp. 1–19, (2010) doi:10.1093/dnares/dsq004

DNA Research Advance Access published March 4, 2010

Genomic Structure of an Economically Important Cyanobacterium,Arthrospira (Spirulina) platensis NIES-39

TAKATOMO Fujisawa1, REI Narikawa2, SHINOBUOkamoto3, SHIGEKI Ehira4, HIDEHISAYoshimura2, IWANE Suzuki5,TATSURU Masuda2, MARI Mochimaru6, SHINICHI Takaichi7, KOICHIRO Awai8, MITSUO Sekine1,HIROSHI Horikawa1, ISAO Yashiro1, SEIHA Omata1, HIROMI Takarada1, YOKO Katano1, HIROKI Kosugi1,SATOSHI Tanikawa1, KAZUKO Ohmori9, NAOKI Sato2, MASAHIKO Ikeuchi2, NOBUYUKI Fujita1,*,and MASAYUKI Ohmori4

Bioresource Information Center, Department of Biotechnology, National Institute of Technology and Evaluation(NITE), 2-10-49 Nishihara, Shibuya-ku, Tokyo 151-0066, Japan1; Department of Life Sciences (Biology), TheUniversity of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan2; Database Center for Life Science,Research Organization of Information and Systems, 2-11-6 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan3;Department of Biological Sciences, Faculty of Science and Engineering, Chuo University, 1-13-27 Kasuga,Bunkyo-ku, Tokyo 112-8551, Japan4; Graduate School of Life and Environmental Sciences, University of Tsukuba,Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8572, Japan5; Natural Science Faculty, Komazawa University,1-23-1 Komazawa, Setagaya-ku, Tokyo 154-8525, Japan6; Biological Laboratory, Nippon Medical School,Kosugi-cho 2, Nakahara-ku, Kawasaki 211-0063, Japan7; Division of Global Research Leaders, Shizuoka University,836 Ohya, Suruga-ku, Shizuoka 422-8529, Japan8 and Department of Life Sciences, Showa Women’s University,1-7 Taishido, Setagaya-ku, Tokyo 154-8533, Japan9

*To whom correspondence should be addressed. Tel. !81 3-3481-1933. Fax. !81 3-3481-8424.E-mail: [email protected]

Edited by Katsumi Isono(Received 1 December 2009; accepted 11 January 2010)

AbstractA filamentous non-N2-fixing cyanobacterium, Arthrospira (Spirulina) platensis, is an important organism

for industrial applications and as a food supply. Almost the complete genome of A. platensis NIES-39 wasdetermined in this study. The genome structure of A. platensis is estimated to be a single, circular chromo-some of 6.8 Mb, based on optical mapping. Annotation of this 6.7 Mb sequence yielded 6630 protein-coding genes as well as two sets of rRNA genes and 40 tRNA genes. Of the protein-coding genes, 78%are similar to those of other organisms; the remaining 22% are currently unknown. A total 612 kb ofthe genome comprise group II introns, insertion sequences and some repetitive elements. Group Iintrons are located in a protein-coding region. Abundant restriction-modification systems were deter-mined. Unique features in the gene composition were noted, particularly in a large number of genes foradenylate cyclase and haemolysin-like Ca21-binding proteins and in chemotaxis proteins. Filament-specific genes were highlighted by comparative genomic analysis.Key words: cyanobacteria; Arthrospira; health supplement; genome; cAMP

1. Introduction

Cyanobacteria are prokaryotes that perform oxy-genic photosynthesis and constitute a large taxonomicgroup within the domain of eubacteria. Cyanobacteria

are divided morphologically (unicellular or filamen-tous) or functionally (N2-fixing and non-N2-fixing).Filamentous species are subdivided into those withand without a heterocyst which is a differentiationfrom vegetative cells for fixing nitrogen.1,2

# The Author 2010. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in anymedium, provided the original workis properly cited.

DNA RESEARCH pp. 1–19, (2010) doi:10.1093/dnares/dsq004

DNA Research Advance Access published March 4, 2010

Page 8: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

The genome sequence and annotation of A. platensis NIES-39 are available at GenBank/EMBL/DDBJ under accession no. AP011615

Page 9: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

~40 students, with 7 genomes per student.

2. 16S rRNA tree

4. Core- and pan-genome plot

3. BLAST Matrix

5. BLAST Atlas

1. Table of related genomes finished and in GenBank

• What can be included in the presentations on Friday?

Term

inus

Origin

rrlH

rrlG

rrlDrrlBrrlA

rrlC

rrlE

0M0 .

5M

1M 1.5M

2M2.5M

3M

3.5M4M

E.co

li K-

12_W

3110

4,64

1,43

3 bp

! " # $ % & ' ( ) !* !! !" !# !$ !% !& !' !( !) "*

*!***

"***

#***

$***

%***

&***

'***

+,-./,0,1

+,-./,0,.234565,1

789,./,084,

:30./,084,!.;..734<=68>3?@,9.A,AB05."&)C)'

".;..734<=68>3?@,9.A,AB05.DE!""!

#.;..734<=68>3?@,9.A,AB05.7F($(&

$.;..734<=68>3?@,9.A,AB05.7G)#!&

%.;..734<=68>3?@,9.A,AB05.($!"%

&.;..734<=68>3?@,9.A,AB05.7F($"!

'.;..734<=68>3?@,9.A,AB05.(!!!'&H

(.;..734<=68>3?@,9.A,AB05.(!!!'&I

).;..734<=68>3?@,9.A,AB05."&*C)$

!*.;..734<=68>3?@,9.A,AB05.+7J7K!!!&(

!!.;..734<=68>3?@,9.A,AB05.LI)#!!#

!".;..734<=68>3?@,9.A,AB05.(!!!&

!#.;..734<=68>3?@,9.A,AB05.E!

!$.;..734<=68>3?@,9.?865.DE"""(

!%.;..734<=68>3?@,9.6395.DE"!**

!&.;..734<=68>3?@,9.?80?51B1.!#("&

!'.;..734<=68>3?@,9.2,@B1.("!$*

!(.;..734<=68>3?@,9.B<1365,0151.DE#!)%

!).;..734<=68>3?@,9.M845051.HJ77KIHH!#(!

"*.;..734<=68>3?@,9.?B9NB1.%"%C)"

3.7 %67 / 1,824

60.7 %1,383 / 2,280

59.3 %1,223 / 2,064

61.4 %1,367 / 2,226

62.0 %1,374 / 2,217

60.6 %1,271 / 2,099

59.2 %1,341 / 2,267

60.6 %1,335 / 2,203

60.0 %1,336 / 2,228

64.2 %1,357 / 2,113

59.8 %1,334 / 2,229

61.0 %1,321 / 2,167

60.5 %1,315 / 2,174

1.6 %29 / 1,799

68.5 %1,312 / 1,915

74.4 %1,512 / 2,031

74.9 %1,512 / 2,019

71.4 %1,382 / 1,935

68.8 %1,449 / 2,106

68.5 %1,416 / 2,068

73.8 %1,487 / 2,016

74.8 %1,460 / 1,953

70.0 %1,440 / 2,058

71.7 %1,428 / 1,991

69.9 %1,408 / 2,014

0.6 %8 / 1,414

74.4 %1,342 / 1,804

75.1 %1,346 / 1,793

76.4 %1,261 / 1,651

70.7 %1,306 / 1,847

71.5 %1,285 / 1,798

73.5 %1,316 / 1,790

80.6 %1,345 / 1,668

72.9 %1,306 / 1,792

74.8 %1,291 / 1,727

73.4 %1,278 / 1,742

1.3 %23 / 1,728

88.2 %1,616 / 1,832

74.6 %1,380 / 1,851

75.2 %1,487 / 1,977

75.0 %1,455 / 1,939

80.3 %1,524 / 1,897

88.0 %1,557 / 1,770

77.9 %1,493 / 1,916

76.6 %1,447 / 1,888

75.7 %1,437 / 1,898

1.4 %24 / 1,720

76.1 %1,393 / 1,830

76.4 %1,498 / 1,962

76.1 %1,465 / 1,924

76.6 %1,481 / 1,934

89.1 %1,564 / 1,755

77.8 %1,490 / 1,916

76.6 %1,444 / 1,884

76.0 %1,439 / 1,893

1.0 %15 / 1,495

73.7 %1,373 / 1,864

74.4 %1,351 / 1,816

73.9 %1,358 / 1,837

81.6 %1,395 / 1,709

74.7 %1,362 / 1,824

77.7 %1,360 / 1,750

76.5 %1,348 / 1,762

1.4 %25 / 1,728

84.6 %1,556 / 1,840

78.6 %1,506 / 1,917

79.5 %1,477 / 1,859

85.2 %1,570 / 1,842

79.0 %1,474 / 1,867

77.4 %1,458 / 1,884

1.4 %23 / 1,662

78.4 %1,473 / 1,880

80.2 %1,454 / 1,813

81.0 %1,496 / 1,848

80.2 %1,454 / 1,814

79.8 %1,450 / 1,818

1.3 %22 / 1,688

80.0 %1,463 / 1,828

81.1 %1,509 / 1,861

81.0 %1,474 / 1,819

79.9 %1,464 / 1,832

1.3 %21 / 1,596

81.0 %1,469 / 1,814

82.4 %1,447 / 1,757

80.4 %1,429 / 1,777

1.3 %21 / 1,676

82.2 %1,483 / 1,805

81.0 %1,471 / 1,815

1.3 %21 / 1,598

92.3 %1,536 / 1,664

1.2 %19 / 1,601

C. jejuni doylei 267.97

PID 17163, length 1,845,106 nt

1,911 proteins, 1,824 families

C. jejuni RM1221

PID 303, length nt

1,838 proteins, 1,799 families

C. jejuni CG8486

PID 17055, length nt

1,425 proteins, 1,414 families

C. jejuni CF93-6

PID 16265, length nt

1,756 proteins, 1,728 families

C. jejuni 84-25

PID 16367, length nt

1,748 proteins, 1,720 families

C. jejuni CG8421

PID 21037, length nt

1,512 proteins, 1,495 families

C. jejuni 81-176A

PID 16135, length nt

1,758 proteins, 1,728 families

C. jejuni 81-176B

PID 17341, length nt

1,690 proteins, 1,662 families

C. jejuni 260.94

PID 16229, length nt

1,716 proteins, 1,688 families

C. jejuni NCTC 11168

PID 8, length 1,641,481 nt

1,624 proteins, 1,596 families

C. jejuni HB93-13

PID 16267, length nt

1,708 proteins, 1,676 families

C. jejuni 81116

PID 17953, length nt

1,626 proteins, 1,598 families

C. jejuni M1

PID ???, length nt

1,627 proteins, 1,601 families

C. jeju

ni doy

lei 2

67.9

7

PID 1

7163

, len

gth 1

,845

,106

nt

1,91

1 pro

tein

s, 1,

824

fam

ilies

C. jeju

ni RM

1221

PID 3

03, l

ength

nt

1,83

8 pro

tein

s, 1,

799

fam

ilies

C. jeju

ni CG84

86

PID 1

7055

, len

gth n

t

1,42

5 pro

tein

s, 1,

414

fam

ilies

C. jeju

ni CF93

-6

PID 1

6265

, len

gth n

t

1,75

6 pro

tein

s, 1,

728

fam

ilies

C. jeju

ni 84-

25

PID 1

6367

, len

gth n

t

1,74

8 pro

tein

s, 1,

720

fam

ilies

C. jeju

ni CG84

21

PID 2

1037

, len

gth n

t

1,51

2 pro

tein

s, 1,

495

fam

ilies

C. jeju

ni 81-

176A

PID 1

6135

, len

gth n

t

1,75

8 pro

tein

s, 1,

728

fam

ilies

C. jeju

ni 81-

176B

PID 1

7341

, len

gth n

t

1,69

0 pro

tein

s, 1,

662

fam

ilies

C. jeju

ni 260

.94

PID 1

6229

, len

gth n

t

1,71

6 pro

tein

s, 1,

688

fam

ilies

C. jeju

ni NCTC 1

1168

PID 8

, len

gth 1

,641

,481

nt

1,62

4 pro

tein

s, 1,

596

fam

ilies

C. jeju

ni HB93

-13

PID 1

6267

, len

gth n

t

1,70

8 pro

tein

s, 1,

676

fam

ilies

C. jeju

ni 811

16

PID 1

7953

, len

gth n

t

1,62

6 pro

tein

s, 1,

598

fam

ilies

C. jeju

ni M1

PID ??

?, le

ngth n

t

1,62

7 pro

tein

s, 1,

601

fam

ilies

Homology within proteomes

3.7 %0.6 %

Proteome comparison of Campylobacter proteomesConserved protein families

Homology between proteomes

92.3 %59.2 %

Page 10: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Outline

• The problem - too much data!• A brief history - The speed of sequencing• Cautionary tales• Some approaches to handle this....

Page 11: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Technology

The data delugeBusinesses, governments and society are only starting to tap its vast potentialFeb 25th 2010 | From The Economist print edition

EIGHTEEN months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each

day. Now the amount has increased tenfold. During 2009, American drone aircraft flying over Iraq and Afghanistan sent back around 24 years’

worth of video footage. New models being deployed this year will produce ten times as many data streams as their predecessors, and those in

2011 will produce 30 times as many.

Everywhere you look, the quantity of information in the world is soaring. According to one estimate, mankind created 150 exabytes (billion

gigabytes) of data in 2005. This year, it will create 1,200 exabytes. Merely keeping up with this flood, and storing the bits that might be useful, is

difficult enough. Analysing it, to spot patterns and extract useful information, is harder still. Even so, the data deluge is already starting to

transform business, government, science and everyday life (see our special report in this issue). It has great potential for good—as long as

consumers, companies and governments make the right choices about when to restrict the flow of data, and when to encourage it.

1. The problem - too much data!

Page 12: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

1. The problem - too much data!

27 February, 2010 | From The Economist print edition

Page 13: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

How to visualize lots of data....

In Nature this week, features and opinion pieces on one of the most daunting challenges facing modern science: how to cope with the flood of data now being generated. A petabyte is a lot of memory, however you say it - a quadrillion, 1015, or tens of thousands of trillions of bytes. But that is the currency of 'big data'. We visited the Sanger Institute's supercomputing centre, and its petabyte of capacity. [News Feature p. 16]

Nature podcast

Volume 455 Number 7209 pp1-136

4 September, 2008

1. The problem - too much data!

Page 14: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Three Current “next-generation” technologies:

1. The problem - too much data!

1. illumina (aka “Solexa”) - 500 million reads (100 bp )

Page 15: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

1. The problem - too much data!

Three Current “next-generation” technologies:

2. Roche 4541. illumina (aka “Solexa”) - 500 million reads (100 bp )

Page 16: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Applied Biosystems® SOLiD™ 4 System

SPECIFICATION SHEET

See the DifferenceThe SOLiD™ 4 System enables you to obtain more high-quality sequence at a lower cost per run. New optimized reagents and algorithms provide more uniform coverage across the genome and result in higher throughput and accuracy for all applications. Accelerate your time to results with automated workflows, intelligent barcoding designs, and the broadest portfolio of application-specific kits and analysis tools. With the SOLID™ 4 System, you have the throughput and accuracy to cost-effectively discover causative variation—you have the Quality Genome.

Key Benefits

Higher accuracy—detection of causative variation enabled at lower coverage and cost per sample

Scalable throughput on a single platform—80–100 GB of mappable sequence per run

Automated workflow—80% reduction in hands-on time and increased reproducibility in yield allow for significant time and labor savings

True paired-end sequencing—bidirectional sequencing facilitates detection of genetic alterations as well as splice variants and fusion transcripts with lower sample input

Robust multiplexing kits—intelligent barcode strategy enables accurate assignment without introduction of bias

Sample-to-results application support—additional application-specific kits and flexible analysis framework for optimized end-to-end application-specific workflows

Unrivaled support—over 800 dedicated service and support specialists as well as a catalog of in-depth chemistry and bioinformatics courses available

Experience Peace of Mind The SOLiD™ System’s open slide format and flexible bead densities continue to yield increases in throughput on the same platform with minor upgrades. The SOLiD™ 4 System can generate up to 100 Gb of mappable sequence or greater than 1.4 billion reads per run. Discover the peace of mind provided by the confidence that you will benefit from future technology advances without the purchase of a new system.

SOLiD™ 4S Y S T E M S E Q U E N C I N G

1. The problem - too much data!

Three Current “next-generation” technologies:

2. Roche 454 - > 1 million reads (1000 bp)1. illumina (aka “Solexa”) - 500 million reads (100 bp )

3. ABI SOLiD

~100 Gbp per run!35 bp reads

Page 17: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Genome Research, Jan 2009The new paradigm of flow cell sequencingRobert A. Holt1 and Steven J.M. JonesBritish Columbia Cancer Agency, Genome Sciences Centre, Vancouver, British Columbia V5Z 4E6, Canada

DNA sequencing is in a period of rapid change, in which capillary sequencing is no longer the technology of choicefor most ultra-high-throughput applications. A new generation of instruments that utilize primed synthesis in flowcells to obtain, simultaneously, the sequence of millions of different DNA templates has changed the field. Wecompare and contrast these new sequencing platforms in terms of stage of development, instrument configuration,template format, sequencing chemistry, throughput capability, operating cost, data handling issues, and errormodels. While these platforms outperform capillary instruments in terms of bases per day and cost per base, theshort length of sequence reads obtained from most instruments and the limited number of samples that can be runsimultaneously imposes some practical constraints on sequencing applications. However, recently developed methodsfor paired-end sequencing and for array-based direct selection of desired templates from complex mixtures extendthe utility of these platforms for genome analysis. Given the ever increasing demand for DNA sequence information,we can expect continuous improvement of this new generation of instruments and their eventual replacement byeven more powerful technology.

Since the establishment of DNA as hereditary material and theelucidation of its structure, there has been insatiable demand forsequence information and remarkable innovation in the meth-ods used to obtain it. Like many technologies, DNA sequencinghas advanced by punctuated equilibrium, where a new approachto sequencing is introduced, adopted, and improved upon incre-mentally for some period of time, then replaced by the nextwave. The very earliest sequencing techniques involved varia-tions on the theme of cleavage of short polynucleotides and sub-sequent identification by their migration characteristics usingtwo-dimensional paper chromatography. Using this approach itwas possible to infer short sequences, such as that of the Esche-richia coli lac operon (Gilbert and Maxam 1973), and it was fea-sible at the time to report the data from an entire sequencingproject in a paper’s abstract. A transition of major significancewas spearheaded by the Sanger group in the mid 1970s, whenthey introduced the notion of using primed template replicationby polymerase and separation of the extension products by gelelectrophoresis (Sanger and Coulson 1975) to obtain DNA se-quence information. Modifying this approach to allow base-specific chain termination by di-deoxy nucleotides (Sanger et al.1977) laid the foundation for sequencing for the next 30 yr.Further incremental improvements during this time included us-ing fluorescent rather than radiolabeled terminators, separationon acrylamide matrices in capillaries rather than slab gels, and,ultimately, the deployment of mechanized production lines fortemplate preparation and devices for automated generation andreading of sequence ladders. This industrial approach to sequenc-ing spawned the modern era of genomics and has provided anarchive of complete reference genome sequences. Yet demand forDNA sequence is undiminished and we find ourselves in a newperiod of rapid change. If the hallmark of the past paradigm waselectrophoretic separation of terminated DNA chains, then thehallmark of new paradigm is flow cell sequencing, with stepwisedetermination of DNA sequence by iterative cycles of nucleotideextensions done in parallel on massive numbers of clonally am-plified template molecules. If one takes the broad view of a flow

cell as a reaction chamber that contains template tethered to asolid support, to which nucleotides and ancillary reagents areiteratively applied and washed away, then the new instrumentson the market (the Roche GS-FLX, the Illumina 1G analyzer, andthe Applied Biosystems SOLiD) are all flow cell sequencers (as areinstruments anticipated in the near future such as the HelicosHeliScope and the Danaher Polonator). Massively parallel ap-proaches using flow cells allow DNA to be sequenced markedlyfaster and cheaper than ever before. This means that lines ofscientific inquiry that once were prohibitively expensive are nowfeasible, and this is good because there is much to explore. Forexample, human genome sequences have been compiled but rep-resent a miniscule proportion of the !100 million kilograms ofhuman DNA that is on the planet on any given day. It is certainthat novel template from the biosphere will continue to driveconsecutive waves of innovation in sequencing technology forsome time to come.

The technology

Templates and sequencing chemistries

While all of the latest commercial sequencing instruments useflow cells and massive parallelization to increase sequencing ca-pacity, the specifics of template preparation, sequencing chem-istry, and flow cell configuration differ among the platforms.There is often a misconception that the new generation of se-quencers perform sequencing on single molecules. In fact, allcurrently available platforms (the Roche GS-FLX, Illumina 1Ganalyzer, and the Applied Biosystems SOLiD) require PCR-basedamplification of fragmented template DNA to obtain sufficientsignal for base calling. However, these methods utilize a singleDNA molecule as the initial substrate for amplification allowingeach sequenced molecule to represent a single haplotype. Thishas proven to be useful for robust polymorphism detection par-ticularly in cancer-derived material, where associated normal tis-sue may obscure heterozygote calls using traditional Sanger se-quencing of PCR products. As discussed further below, the in-strument being developed by Helicos stays with the singlemolecule throughout analysis.

1Corresponding author.E-mail [email protected]; fax (604) 877-6085.Article is online at http://www.genome.org/cgi/doi/10.1101/gr.073262.107.

Next-Generation DNA Sequencing/Review

18:839–846 ©2008 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/08; www.genome.org Genome Research 839www.genome.org

Cold Spring Harbor Laboratory Press on January 3, 2009 - Published by genome.cshlp.orgDownloaded from

"Indeed, any of these new machines running at full capacity for a year will generate more sequence than existed in the whole of NCBI at the beginning of 2008. Analysis of the sequence data has rapidly become the limiting step and will likely become the most expensive part. The sheer volume of data will provide challenges in processing, networking, storage, and analysis of the flow-cell images just to provide the initial base calling." after Holt & Jones, 2009

Sanger Center has 28 Solexa machines, 8 ABI Solids, 2 Roche 454 machines

>1000 teraBytes per month!

1. The problem - too much data!

Page 18: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

1. The problem - too much data!

Page 19: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

The Human Genome Project

Started more than 20 years ago (~1985)

The U.S. government agreed to invest $200,000,000 U.S. per year for 20 years.

One base per second = 216 years!

~3,400,000,000 bp per haploid genome ~6,800,000,000 bp per diploid genome

2. A brief history - The speed of sequencing

Page 20: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

1. “First Human Genome”$3,000,000,000 + 15 years

2. Celera genome (a.k.a. J. Craig Venter)$100,000,000 + 0.75 years (9 months)

3. Jim Watson’s genome $900,000 + 0.17 years (2 months)

4. John Doe's genome $1,000 + 0.0002 years (0.1 day)

5. "next next-generation" machines•Helicos Biosystems machine can sequence human genome in 1 hour (2009).

•Pacific Biosciences machine can sequence human genome in 4 minutes (2010).

•Omni Molecular Recognizer Application - human genome less than $1, <1 minute.

2. A brief history - The speed of sequencing

Page 21: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Num

ber

Gen

omes

in N

CBI

web

pag

es

Year

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Bacteria Archaea total published Unfinished total

Page 22: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

as of 21 Jan, 2009as of 4 March, 20103630

2226

Page 23: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

as of 4 March, 2010

Page 24: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

3. Cautionary tales

Page 25: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

4 1 Sequences as Biological Information

organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.

From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of

Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance

BACTERIA

ARCHAEA

EUCARYA

Unicellulareukaryotes

Animals Plants

Macro-organisms

Protozoans

Flav

obac

teriu

m

Crenarchaeota

EuryarchaeotaChlamydiae

Cyanobacteria

Pro

teob

acte

ria

Act

inob

acte

ria

Chlorobi

Clostridium

Bacillus

Chloroflexi

Acidobacteria

Giardia

Saccharomyces

Trypanosoma

Slime mold

Babesia

Aquifi

cae

Therm

otoga

Thermus

Deinoco

ccus

Firmicutes

Bacteroidetes

Spirochaetes

Pla

ncto

myc

etes

3. Cautionary tales

Page 26: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Archaea

Firmicutes

Spirochetesγ-Proteo-bacteria

E. coli

Eukaryotes

!"#

Color ranges:

$%&'()*+'

,(-.'/'

0'-+/(1'

21'(31'45'6751'

8/19.6':1'46';*(

<.'5'991*

91('4=9/%3*:':'

>()=+*9=*(131%64.*61:19

?5'96*31%64@'5-1='(%6

A()B'49'+1C'

,('713*=9194+.'51':'

>)':131*9-.)B*:46/(*5'/

D1-+)*9+/51%64319-*13/%6

2'55%94E'55%9

F%946%9-%5%9

G'++%94:*(C/E1-%9

H*6*49'=1/:9

?':4+(*E5*3)+/9

<'&1@%E%4(%7(1=/9

D':1*4(/(1*

D(*9*=.15'46/5':*E'9+/( ,

:*=./5/94E'671'/

>'/:*(.'731+194/5/E':9

>'/:*(.'731+1947(1EE9'/

I-.1B*9'--.'(*6)-/94=*67/

$(/6*+./-1%64E*99)=11

I'--.'(*6)-/94-/(/C191'/

?)(*7'-%5%64'/(*=.15%6

,/(*=)(%64=/(:1J

I%5@*5*7%94+*&*3'11

I%5@*5*7%949*5@'+'(1-%9

K':*'(-.'/%64/L%1+':9

<./(6*=5'96

'4'-13*=.15%6

<./(6*=5'96

'4C*5-':1%6

?)(*-*--%94@%(1*9%9

?)(*-*--%94'7)991

?)(*-*--%94.*(1&*9.11

F/+.':*=)(%94&':35/(1

F/+.':*7'-+/(1%6

4+./(6'%+*+(*=.1-%6

F/+.':*-*--%94;'::'9-.11

F/+.':*-*--%946'(1='5%319

,(-.'/*E5*7%94@%5E13%9

F/+.':*9'(-1:'4'-/+1C*(':9

F/+.':*9'(-1:'46

'B/1

H'5*7'-+/(1%6

49="4KG>!#

<./(6*':'/(*7'-+/(4+/:E-*:E/:919

>5*9+(131%64'-/+*7%+)51-%6

>5*9+(131%64+/+':1

>5*9+(131%64=/(@(1:E/:9

I+'=.)5*-*--%94'%(/%94FMN

I+'=.)5*-*--%94'%(/%94KO#P

I+'=.)5*-*--%94'%(/%94F%P!

I+'=.)5*-*--%94/=13/(61319

819+/(1'41::*-%'

819+/(1'46*:*-)+*E/:/94QNORP

819+/(1'46*:*-)+*E/:/94$2D

0'-155%949%7+1519

0'-155%94':+.('-19

0'-155%94-/(/%94,<>>4#!STU

0'-155%94-/(/%94,<>>4#VPUS

0'-155%94.'5*3%(':9

A-/':*7'-155%941./)/:919

$:+/(*-*--%94@'/-'519

8'-+*-*--%945'-+19

I+(/=+*-*--%94=:/%6

*:1'/4GR

I+(/=+*-

*--%94=

:/%6*:

1'/4<W2

GV

I+(/=+*-

*--%94'E

'5'-+1'/4

WWW

I+(/=+*-*-

-%94'E'5'-

+1'/4X

I+(/=+*-*--

%94=)*E/:/9

4F#

I+(/=+*-*--%9

4=)*E/:/94F2

,ITNON

I+(/=+*-*--%94=

)*E/:/94F2,IO

#P

I+(/=+*-*--%94=)*E/

:/94IIW!#

I+(/=+*-

*--%946

%+':9

8'-+*7'-155%94=5':+'(%6

8'-+*7'-155%94;*.:9*:11

?.)+*=5'96'4A:1*:4)/55*Y9

F)-*=5'96'46)-*13/9

F)-*=5'96'46*715/

F)-*=5'96'4=%56*:19

Z(/'=5'96'4='(C%6

F)-*=5'96'4=/:/+(':9

F)-*=5'96'4E'5519/=+1-%6

F)-*=5'96'4=:/%6*:1'/

F)-*=5'96'4E/:1+'51%6

Q17(*7'-+/(49%--1:*E/:/9

>.5*(*71%64+/=13%6

?*(=.)(*6*:'94E1:E1C'519

0'-+/(*13/94+./+'1*+'*61-(*:

>.5'6)31'46%(13'(%6

>.5'6)31'4+('-.*6'+19

>.5'6)3*=.15'4-'C1'/

>.5'6)3*=.15'4=:/%6*:1'/4<M#TO

>.5'6)31'4=:/%6*:1'/4[#OT

>.5'6)31'4=:/%6*:1'/4>M8!NS>.5'6)31'4=:/%6*:1'/4,GOS

2/66'+'4*79-%(1E5*7%9G.*3*=1(/55%5'47'5+1-'

8/=+*9=1('41:+/((*E':948#!#O!8/=+*9=1('41:+/((*E':94PRR!#

0*((/51'47%(E3*(@/(1<(/=*:/6'43/:+1-*5'<(/=*:/6'4='5513%6

I+(/=+*6)-/94-*/51-*5*(

I+(/=+*6)-/94'C/(6

1+1519

F)-*7'-+/(1%6

4='('+%7/(-%5*919

F)-*7'-+/(1%6

4+%7/(-%5*9194>D>#PP#

F)-*7'-+/(1%6

4+%7/(-%5*9194HOUGC

F)-*7'-+/(1%6

47*C19

F)-*7'-+/(1%6

45/=('/

>*():/7'-+/(1%6431=.+./(1'/

>*():/7'-+/(1%64/@@1-1/:9

>*():/7'-+/(1%6

4E5%+'61-%6

>*():/7'-+/(1%6

4E5%+'61-%6

4#O!ON

01@13*7'-+/(1%645*:E%6

<(*=./()6'4Y.1==5/14<M!T\NU

<(*=./()6'4Y.1==5/14<Y19+

Q%9*7'-+/(1%6

4:%-5/'+%6

<./(6

*+*E'46'(1+16

'

,L%1@/J4'/*51-%9

D/.'5*-*--*13/94/+./:*E/:/9

<./(6%94+./(6*=.15%9

D/1:*-*--%94('31*3%(':9

25*/*7'-+/(4C1*

5'-/%9

I):/-.*-*--%94/5*:E'+%9

K*9+*-49=

"4?>>4U#N!

I):/-.*-)9+1949="4?>>RT!O

?(*-.5*(*-*--%946'(1:%94II#N!

?(*-.5*(*-*--%946'(1:%94FW<SO#O

I):/-.*-*--%949="4MHT#!N

?(*-.5*(*-*--%946'(1:%94>>F?#OUT

,-13*7'-+/(1%64-'=9%5'+%6

I*517'-+/(4%91+'+%9

D/9%5@*C17(1*4C%5E'(19

2/*7'-+/(49%5@%((/3%-/:9

03/55*C17(1*47'-+/(1*C*(%9

>'6=)5*7'-+/(4;/;%:1

M*51:/55'49%--1:*E/:/9

H/51-*7'-+/(4./='+1-%9

H/51-*7'-+/(4=)5*(14NRRSP

H/51-*7'-+/(4=)5*(14[SS

>'%5*7'-+/(4-(/9-/:+%9

G.1B*71%646/515*+1

,E(*7'-+/(1%64+%6/@'-1/:94>/(/*:

,E(*7'-+/(1%64+%6/@'-1/:94M'9.Z

0(%-/55'49%19

0(%-/55'46/51+/:919

G.1B*71%645*+1

G.*3*=9/%3*6*:'94='5%9+(19

0('3)(.1B*71%64;'=*:1-%6

G1-&/++91'4-*:*(11

G1-&/++91'4=(*Y'B/&11

M*57'-.1'49="4YF/5

K1+(*9*6*:'94/%(*='/'

>.(*6*7'-+/(1%64C1*5'-/%6

K/199/(1'46/:1:E1+131940

K/199/(1'46/:1:E1+13194,

G'59+*:1'49*5':'-/'(%6

0*(3/+/55'4=/(+%9919

0*(3/+/55'47(*:-.19/=+1-'

0*(3/+/55'4='('=/(+%9919

>*J1/55'47%(:/+11

]':+.*6*:'94-'6=/9+(19

]':+.*6*:'94'J*:*=*319

])5/55'4@'9+131*9'4S'P-

])5/55'4@'9+131*9'4U!!SRV

?9/%3*6*:'94'/(%E1:*9'

?9/%3*6*:'94=%+13'

?9/%3*6*:'949)(1:E'/

I./Y':/55'4*:/13/:919

?.*+*7'-+/(1%64=(*@%:3%6

X17(1*4-.*5/('/

X17(1*4C%5:1@1-%94^[!#R

X17(1*4C%5:

1@1-%94>F>?R

X17(1*4='('.'

/6*5)+1-%9

?'9+/%(

/55'46%5

+*-13'

H'/6*=

.15%941:@5%

/:B'/

H'/6*=

.15%943%

-(/)1

I'56*:/55'4+)=.16%(1%6

I'56*:/55'4/:+/(1-'

I'56*:/55'4+)=.1

$9-./(1-.1'4-*514$D8SOO

$9-./(1-.1'4-*514A#PU_HU

$9-./(1-.1'4-*514AR

$9-./(1-.1'4-*514`#N

I.1E/55'4@5/J:/(14N'4NVPU<

I.1E/55'4@5/J:/(14N'4O!#

^/(91:1'4=/9+194F/31/C'519

^/(91:1'4=/9+194`WF

^/(91:1'4=/9+194>ASN

?.*+*(.'73%945%61:/9-/:9

0%-.:/('4'=.13

1-*5'4,?I

0%-.:/('4'=.131-*5'

4IE

0%-.:/('4'=

.131-*5'40=

05*-.6'::

1'4@5*(13':

%9

M1EE5/9Y*

(+.1'47(/C1='

5=19

You are here

Page 27: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

!"#

Color ranges:

$%&'()*+'

,(-.'/'

0'-+/(1'

21'(31'45'6751'

8/19.6':1'46';*(

<.'5'991*

91('4=9/%3*:':'

>()=+*9=*(131%64.*61:19

?5'96*31%64@'5-1='(%6

A()B'49'+1C'

,('713*=9194+.'51':'

>)':131*9-.)B*:46/(*5'/

D1-+)*9+/51%64319-*13/%6

2'55%94E'55%9

F%946%9-%5%9

G'++%94:*(C/E1-%9

H*6*49'=1/:9

?':4+(*E5*3)+/9

<'&1@%E%4(%7(1=/9

D':1*4(/(1*

D(*9*=.15'46/5':*E'9+/( ,

:*=./5/94E'671'/

>'/:*(.'731+194/5/E':9

>'/:*(.'731+1947(1EE9'/

I-.1B*9'--.'(*6)-/94=*67/

$(/6*+./-1%64E*99)=11

I'--.'(*6)-/94-/(/C191'/

?)(*7'-%5%64'/(*=.15%6

,/(*=)(%64=/(:1J

I%5@*5*7%94+*&*3'11

I%5@*5*7%949*5@'+'(1-%9

K':*'(-.'/%64/L%1+':9

<./(6*=5'96

'4'-13*=.15%6

<./(6*=5'96

'4C*5-':1%6

?)(*-*--%94@%(1*9%9

?)(*-*--%94'7)991

?)(*-*--%94.*(1&*9.11

F/+.':*=)(%94&':35/(1

F/+.':*7'-+/(1%6

4+./(6'%+*+(*=.1-%6

F/+.':*-*--%94;'::'9-.11

F/+.':*-*--%946'(1='5%319

,(-.'/*E5*7%94@%5E13%9

F/+.':*9'(-1:'4'-/+1C*(':9

F/+.':*9'(-1:'46

'B/1

H'5*7'-+/(1%6

49="4KG>!#

<./(6*':'/(*7'-+/(4+/:E-*:E/:919

>5*9+(131%64'-/+*7%+)51-%6

>5*9+(131%64+/+':1

>5*9+(131%64=/(@(1:E/:9

I+'=.)5*-*--%94'%(/%94FMN

I+'=.)5*-*--%94'%(/%94KO#P

I+'=.)5*-*--%94'%(/%94F%P!

I+'=.)5*-*--%94/=13/(61319

819+/(1'41::*-%'

819+/(1'46*:*-)+*E/:/94QNORP

819+/(1'46*:*-)+*E/:/94$2D

0'-155%949%7+1519

0'-155%94':+.('-19

0'-155%94-/(/%94,<>>4#!STU

0'-155%94-/(/%94,<>>4#VPUS

0'-155%94.'5*3%(':9

A-/':*7'-155%941./)/:919

$:+/(*-*--%94@'/-'519

8'-+*-*--%945'-+19

I+(/=+*-*--%94=:/%6

*:1'/4GR

I+(/=+*-

*--%94=

:/%6*:

1'/4<W2

GV

I+(/=+*-

*--%94'E

'5'-+1'/4

WWW

I+(/=+*-*-

-%94'E'5'-

+1'/4X

I+(/=+*-*--

%94=)*E/:/9

4F#

I+(/=+*-*--%9

4=)*E/:/94F2

,ITNON

I+(/=+*-*--%94=

)*E/:/94F2,IO

#P

I+(/=+*-*--%94=)*E/

:/94IIW!#

I+(/=+*-

*--%946

%+':9

8'-+*7'-155%94=5':+'(%6

8'-+*7'-155%94;*.:9*:11

?.)+*=5'96'4A:1*:4)/55*Y9

F)-*=5'96'46)-*13/9

F)-*=5'96'46*715/

F)-*=5'96'4=%56*:19

Z(/'=5'96'4='(C%6

F)-*=5'96'4=/:/+(':9

F)-*=5'96'4E'5519/=+1-%6

F)-*=5'96'4=:/%6*:1'/

F)-*=5'96'4E/:1+'51%6

Q17(*7'-+/(49%--1:*E/:/9

>.5*(*71%64+/=13%6

?*(=.)(*6*:'94E1:E1C'519

0'-+/(*13/94+./+'1*+'*61-(*:

>.5'6)31'46%(13'(%6

>.5'6)31'4+('-.*6'+19

>.5'6)3*=.15'4-'C1'/

>.5'6)3*=.15'4=:/%6*:1'/4<M#TO

>.5'6)31'4=:/%6*:1'/4[#OT

>.5'6)31'4=:/%6*:1'/4>M8!NS>.5'6)31'4=:/%6*:1'/4,GOS

2/66'+'4*79-%(1E5*7%9G.*3*=1(/55%5'47'5+1-'

8/=+*9=1('41:+/((*E':948#!#O!8/=+*9=1('41:+/((*E':94PRR!#

0*((/51'47%(E3*(@/(1<(/=*:/6'43/:+1-*5'<(/=*:/6'4='5513%6

I+(/=+*6)-/94-*/51-*5*(

I+(/=+*6)-/94'C/(6

1+1519

F)-*7'-+/(1%6

4='('+%7/(-%5*919

F)-*7'-+/(1%6

4+%7/(-%5*9194>D>#PP#

F)-*7'-+/(1%6

4+%7/(-%5*9194HOUGC

F)-*7'-+/(1%6

47*C19

F)-*7'-+/(1%6

45/=('/

>*():/7'-+/(1%6431=.+./(1'/

>*():/7'-+/(1%64/@@1-1/:9

>*():/7'-+/(1%6

4E5%+'61-%6

>*():/7'-+/(1%6

4E5%+'61-%6

4#O!ON

01@13*7'-+/(1%645*:E%6

<(*=./()6'4Y.1==5/14<M!T\NU

<(*=./()6'4Y.1==5/14<Y19+

Q%9*7'-+/(1%6

4:%-5/'+%6

<./(6

*+*E'46'(1+16

'

,L%1@/J4'/*51-%9

D/.'5*-*--*13/94/+./:*E/:/9

<./(6%94+./(6*=.15%9

D/1:*-*--%94('31*3%(':9

25*/*7'-+/(4C1*

5'-/%9

I):/-.*-*--%94/5*:E'+%9

K*9+*-49=

"4?>>4U#N!

I):/-.*-)9+1949="4?>>RT!O

?(*-.5*(*-*--%946'(1:%94II#N!

?(*-.5*(*-*--%946'(1:%94FW<SO#O

I):/-.*-*--%949="4MHT#!N

?(*-.5*(*-*--%946'(1:%94>>F?#OUT

,-13*7'-+/(1%64-'=9%5'+%6

I*517'-+/(4%91+'+%9

D/9%5@*C17(1*4C%5E'(19

2/*7'-+/(49%5@%((/3%-/:9

03/55*C17(1*47'-+/(1*C*(%9

>'6=)5*7'-+/(4;/;%:1

M*51:/55'49%--1:*E/:/9

H/51-*7'-+/(4./='+1-%9

H/51-*7'-+/(4=)5*(14NRRSP

H/51-*7'-+/(4=)5*(14[SS

>'%5*7'-+/(4-(/9-/:+%9

G.1B*71%646/515*+1

,E(*7'-+/(1%64+%6/@'-1/:94>/(/*:

,E(*7'-+/(1%64+%6/@'-1/:94M'9.Z

0(%-/55'49%19

0(%-/55'46/51+/:919

G.1B*71%645*+1

G.*3*=9/%3*6*:'94='5%9+(19

0('3)(.1B*71%64;'=*:1-%6

G1-&/++91'4-*:*(11

G1-&/++91'4=(*Y'B/&11

M*57'-.1'49="4YF/5

K1+(*9*6*:'94/%(*='/'

>.(*6*7'-+/(1%64C1*5'-/%6

K/199/(1'46/:1:E1+131940

K/199/(1'46/:1:E1+13194,

G'59+*:1'49*5':'-/'(%6

0*(3/+/55'4=/(+%9919

0*(3/+/55'47(*:-.19/=+1-'

0*(3/+/55'4='('=/(+%9919

>*J1/55'47%(:/+11

]':+.*6*:'94-'6=/9+(19

]':+.*6*:'94'J*:*=*319

])5/55'4@'9+131*9'4S'P-

])5/55'4@'9+131*9'4U!!SRV

?9/%3*6*:'94'/(%E1:*9'

?9/%3*6*:'94=%+13'

?9/%3*6*:'949)(1:E'/

I./Y':/55'4*:/13/:919

?.*+*7'-+/(1%64=(*@%:3%6

X17(1*4-.*5/('/

X17(1*4C%5:1@1-%94^[!#R

X17(1*4C%5:

1@1-%94>F>?R

X17(1*4='('.'

/6*5)+1-%9

?'9+/%(

/55'46%5

+*-13'

H'/6*=

.15%941:@5%

/:B'/

H'/6*=

.15%943%

-(/)1

I'56*:/55'4+)=.16%(1%6

I'56*:/55'4/:+/(1-'

I'56*:/55'4+)=.1

$9-./(1-.1'4-*514$D8SOO

$9-./(1-.1'4-*514A#PU_HU

$9-./(1-.1'4-*514AR

$9-./(1-.1'4-*514`#N

I.1E/55'4@5/J:/(14N'4NVPU<

I.1E/55'4@5/J:/(14N'4O!#

^/(91:1'4=/9+194F/31/C'519

^/(91:1'4=/9+194`WF

^/(91:1'4=/9+194>ASN

?.*+*(.'73%945%61:/9-/:9

0%-.:/('4'=.13

1-*5'4,?I

0%-.:/('4'=.131-*5'

4IE

0%-.:/('4'=

.131-*5'40=

05*-.6'::

1'4@5*(13':

%9

M1EE5/9Y*

(+.1'47(/C1='

5=19

!"#

Color ranges:

$%&'()*+'

,(-.'/'

0'-+/(1'

21'(31'45'6751'

8/19.6':1'46';*(

<.'5'991*

91('4=9/%3*:':'

>()=+*9=*(131%64.*61:19

?5'96*31%64@'5-1='(%6

A()B'49'+1C'

,('713*=9194+.'51':'

>)':131*9-.)B*:46/(*5'/

D1-+)*9+/51%64319-*13/%6

2'55%94E'55%9

F%946%9-%5%9

G'++%94:*(C/E1-%9

H*6*49'=1/:9

?':4+(*E5*3)+/9

<'&1@%E%4(%7(1=/9

D':1*4(/(1*

D(*9*=.15'46/5':*E'9+/( ,

:*=./5/94E'671'/

>'/:*(.'731+194/5/E':9

>'/:*(.'731+1947(1EE9'/

I-.1B*9'--.'(*6)-/94=*67/

$(/6*+./-1%64E*99)=11

I'--.'(*6)-/94-/(/C191'/

?)(*7'-%5%64'/(*=.15%6

,/(*=)(%64=/(:1J

I%5@*5*7%94+*&*3'11

I%5@*5*7%949*5@'+'(1-%9

K':*'(-.'/%64/L%1+':9

<./(6*=5'96

'4'-13*=.15%6

<./(6*=5'96

'4C*5-':1%6

?)(*-*--%94@%(1*9%9

?)(*-*--%94'7)991

?)(*-*--%94.*(1&*9.11

F/+.':*=)(%94&':35/(1

F/+.':*7'-+/(1%6

4+./(6'%+*+(*=.1-%6

F/+.':*-*--%94;'::'9-.11

F/+.':*-*--%946'(1='5%319

,(-.'/*E5*7%94@%5E13%9

F/+.':*9'(-1:'4'-/+1C*(':9

F/+.':*9'(-1:'46

'B/1

H'5*7'-+/(1%6

49="4KG>!#

<./(6*':'/(*7'-+/(4+/:E-*:E/:919

>5*9+(131%64'-/+*7%+)51-%6

>5*9+(131%64+/+':1

>5*9+(131%64=/(@(1:E/:9

I+'=.)5*-*--%94'%(/%94FMN

I+'=.)5*-*--%94'%(/%94KO#P

I+'=.)5*-*--%94'%(/%94F%P!

I+'=.)5*-*--%94/=13/(61319

819+/(1'41::*-%'

819+/(1'46*:*-)+*E/:/94QNORP

819+/(1'46*:*-)+*E/:/94$2D

0'-155%949%7+1519

0'-155%94':+.('-19

0'-155%94-/(/%94,<>>4#!STU

0'-155%94-/(/%94,<>>4#VPUS

0'-155%94.'5*3%(':9

A-/':*7'-155%941./)/:919

$:+/(*-*--%94@'/-'519

8'-+*-*--%945'-+19

I+(/=+*-*--%94=:/%6

*:1'/4GR

I+(/=+*-

*--%94=

:/%6*:

1'/4<W2

GV

I+(/=+*-

*--%94'E

'5'-+1'/4

WWW

I+(/=+*-*-

-%94'E'5'-

+1'/4X

I+(/=+*-*--

%94=)*E/:/9

4F#

I+(/=+*-*--%9

4=)*E/:/94F2

,ITNON

I+(/=+*-*--%94=

)*E/:/94F2,IO

#P

I+(/=+*-*--%94=)*E/

:/94IIW!#

I+(/=+*-

*--%946

%+':9

8'-+*7'-155%94=5':+'(%6

8'-+*7'-155%94;*.:9*:11

?.)+*=5'96'4A:1*:4)/55*Y9

F)-*=5'96'46)-*13/9

F)-*=5'96'46*715/

F)-*=5'96'4=%56*:19

Z(/'=5'96'4='(C%6

F)-*=5'96'4=/:/+(':9

F)-*=5'96'4E'5519/=+1-%6

F)-*=5'96'4=:/%6*:1'/

F)-*=5'96'4E/:1+'51%6

Q17(*7'-+/(49%--1:*E/:/9

>.5*(*71%64+/=13%6

?*(=.)(*6*:'94E1:E1C'519

0'-+/(*13/94+./+'1*+'*61-(*:

>.5'6)31'46%(13'(%6

>.5'6)31'4+('-.*6'+19

>.5'6)3*=.15'4-'C1'/

>.5'6)3*=.15'4=:/%6*:1'/4<M#TO

>.5'6)31'4=:/%6*:1'/4[#OT

>.5'6)31'4=:/%6*:1'/4>M8!NS>.5'6)31'4=:/%6*:1'/4,GOS

2/66'+'4*79-%(1E5*7%9G.*3*=1(/55%5'47'5+1-'

8/=+*9=1('41:+/((*E':948#!#O!8/=+*9=1('41:+/((*E':94PRR!#

0*((/51'47%(E3*(@/(1<(/=*:/6'43/:+1-*5'<(/=*:/6'4='5513%6

I+(/=+*6)-/94-*/51-*5*(

I+(/=+*6)-/94'C/(6

1+1519

F)-*7'-+/(1%6

4='('+%7/(-%5*919

F)-*7'-+/(1%6

4+%7/(-%5*9194>D>#PP#

F)-*7'-+/(1%6

4+%7/(-%5*9194HOUGC

F)-*7'-+/(1%6

47*C19

F)-*7'-+/(1%6

45/=('/

>*():/7'-+/(1%6431=.+./(1'/

>*():/7'-+/(1%64/@@1-1/:9

>*():/7'-+/(1%6

4E5%+'61-%6

>*():/7'-+/(1%6

4E5%+'61-%6

4#O!ON

01@13*7'-+/(1%645*:E%6

<(*=./()6'4Y.1==5/14<M!T\NU

<(*=./()6'4Y.1==5/14<Y19+

Q%9*7'-+/(1%6

4:%-5/'+%6

<./(6

*+*E'46'(1+16

'

,L%1@/J4'/*51-%9

D/.'5*-*--*13/94/+./:*E/:/9

<./(6%94+./(6*=.15%9

D/1:*-*--%94('31*3%(':9

25*/*7'-+/(4C1*

5'-/%9

I):/-.*-*--%94/5*:E'+%9

K*9+*-49=

"4?>>4U#N!

I):/-.*-)9+1949="4?>>RT!O

?(*-.5*(*-*--%946'(1:%94II#N!

?(*-.5*(*-*--%946'(1:%94FW<SO#O

I):/-.*-*--%949="4MHT#!N

?(*-.5*(*-*--%946'(1:%94>>F?#OUT

,-13*7'-+/(1%64-'=9%5'+%6

I*517'-+/(4%91+'+%9

D/9%5@*C17(1*4C%5E'(19

2/*7'-+/(49%5@%((/3%-/:9

03/55*C17(1*47'-+/(1*C*(%9

>'6=)5*7'-+/(4;/;%:1

M*51:/55'49%--1:*E/:/9

H/51-*7'-+/(4./='+1-%9

H/51-*7'-+/(4=)5*(14NRRSP

H/51-*7'-+/(4=)5*(14[SS

>'%5*7'-+/(4-(/9-/:+%9

G.1B*71%646/515*+1

,E(*7'-+/(1%64+%6/@'-1/:94>/(/*:

,E(*7'-+/(1%64+%6/@'-1/:94M'9.Z

0(%-/55'49%19

0(%-/55'46/51+/:919

G.1B*71%645*+1

G.*3*=9/%3*6*:'94='5%9+(19

0('3)(.1B*71%64;'=*:1-%6

G1-&/++91'4-*:*(11

G1-&/++91'4=(*Y'B/&11

M*57'-.1'49="4YF/5

K1+(*9*6*:'94/%(*='/'

>.(*6*7'-+/(1%64C1*5'-/%6

K/199/(1'46/:1:E1+131940

K/199/(1'46/:1:E1+13194,

G'59+*:1'49*5':'-/'(%6

0*(3/+/55'4=/(+%9919

0*(3/+/55'47(*:-.19/=+1-'

0*(3/+/55'4='('=/(+%9919

>*J1/55'47%(:/+11

]':+.*6*:'94-'6=/9+(19

]':+.*6*:'94'J*:*=*319

])5/55'4@'9+131*9'4S'P-

])5/55'4@'9+131*9'4U!!SRV

?9/%3*6*:'94'/(%E1:*9'

?9/%3*6*:'94=%+13'

?9/%3*6*:'949)(1:E'/

I./Y':/55'4*:/13/:919

?.*+*7'-+/(1%64=(*@%:3%6

X17(1*4-.*5/('/

X17(1*4C%5:1@1-%94^[!#R

X17(1*4C%5:

1@1-%94>F>?R

X17(1*4='('.'

/6*5)+1-%9

?'9+/%(

/55'46%5

+*-13'

H'/6*=

.15%941:@5%

/:B'/

H'/6*=

.15%943%

-(/)1

I'56*:/55'4+)=.16%(1%6

I'56*:/55'4/:+/(1-'

I'56*:/55'4+)=.1

$9-./(1-.1'4-*514$D8SOO

$9-./(1-.1'4-*514A#PU_HU

$9-./(1-.1'4-*514AR

$9-./(1-.1'4-*514`#N

I.1E/55'4@5/J:/(14N'4NVPU<

I.1E/55'4@5/J:/(14N'4O!#

^/(91:1'4=/9+194F/31/C'519

^/(91:1'4=/9+194`WF

^/(91:1'4=/9+194>ASN

?.*+*(.'73%945%61:/9-/:9

0%-.:/('4'=.13

1-*5'4,?I

0%-.:/('4'=.131-*5'

4IE

0%-.:/('4'=

.131-*5'40=

05*-.6'::

1'4@5*(13':

%9

M1EE5/9Y*

(+.1'47(/C1='

5=19

E. coli

humansworms

Y. pestis

3. Cautionary tales

Page 28: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

3. Cautionary tales

23,621 genes 19,829 genes 18,529 genes 18,000 genes 6,000 genes19,568 orthologs

99%

14,000 orthologs

76%

human chimp chicken worm yeast

10,000 orthologs

55%

1,700 orthologs

28%

3606 genes 3553 genes 3874 genes 2801 genes 3760 genes3202 orthologs

90%

2974 orthologs

77%

C. botulanium

1126 orthologs

40%

1092 orthologs

29%

type A, strain ATCC 3502C. botulanium

type A, strain ATCC 19397C. botulanium C. botulanium C. botulanium

type A, strain Kyoto type C, strain Ecklund type E1, strain BoNT E Beluga

Page 29: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

4 1 Sequences as Biological Information

organisms, the number of species present in the environment, and, despite their small size, the biomass they represent on a worldwide scale. Even inside an animal, microbes are abundant: only one out of every 10 cells in a human body is actually human, whilst the other nine cells are prokaryotic.

From an evolutionary perspective, Bacteria and Archaea have been around for more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on the scene, arriving less than half a billion years ago. Since Bacteria and Archaea can divide rather quickly and have had much more time to evolve, their diversity by far exceeds that of eukaryotes (the members of Eucarya). Our human perception is that plants and animals are completely unlike each other, and so are, say, insects and mammals, as they are strikingly different even at first sight. The diversity of

Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in examples throughout the book. The distance between bacterial genera is much larger than that of plants and animals, drawn on the same scale of genetic distance

BACTERIA

ARCHAEA

EUCARYA

Unicellulareukaryotes

Animals Plants

Macro-organisms

Protozoans

Flav

obac

teriu

m

Crenarchaeota

EuryarchaeotaChlamydiae

Cyanobacteria

Pro

teob

acte

ria

Act

inob

acte

ria

Chlorobi

Clostridium

Bacillus

Chloroflexi

Acidobacteria

Giardia

Saccharomyces

Trypanosoma

Slime mold

Babesia

Aquifi

cae

Therm

otoga

Thermus

Deinoco

ccus

Firmicutes

Bacteroidetes

Spirochaetes

Pla

ncto

myc

etes

3. Cautionary tales

Page 30: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa aaattatctt gataaagcag gaattactac tgcttgttta cgaattaaat cgaagtggac tgctggcgga aaatgagaaa attcgaccta tccttgcgca gctcgagaag ctcttacttt gcgacctttc gccatcaact aacgattctg tcaaaaactg acgcgttgga tgaggagaag tggcttaata tgcttggcac gttcgtcaag gactggttta gatatgagtc acattttgtt catggtagag attctcttgt tgacatttta aaagagcgtg gattactatc tgagtccgat gctgttcaac cactaatagg taagaaatca tgagtcaagt tactgaacaa tccgtacgtt tccagaccgc tttggcctct attaagctca ttcaggcttc tgccgttttg gatttaaccg aagatgattt cgattttctg acgagtaaca aagtttggat tgctactgac cgctctcgtg ctcgtcgctg cgttgaggct tgcgtttatg gtacgctgga ctttgtggga taccctcgct ttcctgctcc tgttgagttt attgctgccg tcattgctta ttatgttcat cccgtcaaca ttcaaacggc ctgtctcatc atggaaggcg ctgaatttac ggaaaacatt attaatggcg tcgagcgtcc ggttaaagcc gctgaattgt tcgcgtttac cttgcgtgta cgcgcaggaa acactgacgt tcttactgac gcagaagaaa acgtgcgtca aaaattacgt gcggaaggag tgatgtaatg tctaaaggta aaaaacgttc tggcgctcgc cctggtcgtc cgcagccgtt gcgaggtact aaaggcaagc gtaaaggcgc tcgtctttgg tatgtaggtg gtcaacaatt ttaattgcag gggcttcggc cccttacttg aggataaatt atgtctaata ttcaaactgg cgccgagcgt atgccgcatg acctttccca tcttggcttc cttgctggtc agattggtcg tcttattacc atttcaacta ctccggttat cgctggcgac tccttcgaga tggacgccgt tggcgctctc cgtctttctc cattgcgtcg tggccttgct attgactcta ctgtagacat ttttactttt tatgtccctc atcgtcacgt ttatggtgaa cagtggatta agttcatgaa ggatggtgtt aatgccactc ctctcccgac tgttaacact actggttata ttgaccatgc cgcttttctt ggcacgatta accctgatac caataaaatc cctaagcatt tgtttcaggg ttatttgaat atctataaca actattttaa agcgccgtgg atgcctgacc gtaccgaggc taaccctaat gagcttaatc aagatgatgc tcgttatggt ttccgttgct gccatctcaa aaacatttgg actgctccgc ttcctcctga gactgagctt tctcgccaaa tgacgacttc taccacatct attgacatta tgggtctgca agctgcttat gctaatttgc atactgacca agaacgtgat tacttcatgc agcgttacca tgatgttatt tcttcatttg gaggtaaaac ctcttatgac gctgacaacc gtcctttact tgtcatgcgc tctaatctct gggcatctgg ctatgatgtt gatggaactg accaaacgtc gttaggccag ttttctggtc gtgttcaaca gacctataaa cattctgtgc cgcgtttctt tgttcctgag catggcacta tgtttactct tgcgcttgtt cgttttccgc ctactgcgac taaagagatt cagtacctta acgctaaagg tgctttgact tataccgata ttgctggcga ccctgttttg tatggcaact tgccgccgcg tgaaatttct atgaaggatg ttttccgttc tggtgattcg tctaagaagt ttaagattgc tgagggtcag tggtatcgtt atgcgccttc gtatgtttct cctgcttatc accttcttga aggcttccca ttcattcagg aaccgccttc tggtgatttg caagaacgcg tacttattcg ccaccatgat tatgaccagt gtttccagtc cgttcagttg ttgcagtgga atagtcaggt taaatttaat gtgaccgttt atcgcaatct gccgaccact cgcgattcaa tcatgacttc gtgataaaag attgagtgtg aggttataac gccgaagcgg taaaaatttt aatttttgcc gctgaggggt tgaccaagcg aagcgcggta ggttttctgc ttaggagttt aatcatgttt cagactttta tttctcgcca taattcaaac tttttttctg ataagctggt tctcacttct gttactccag cttcttcggc acctgtttta cagacaccta aagctacatc gtcaacgtta tattttgata gtttgacggt taatgctggt aatggtggtt ttcttcattg cattcagatg gatacatctg tcaacgccgc taatcaggtt gtttctgttg gtgctgatat tgcttttgat gccgacccta aattttttgc ctgtttggtt cgctttgagt cttcttcggt tccgactacc ctcccgactg cctatgatgt ttatcctttg aatggtcgcc atgatggtgg ttattatacc gtcaaggact gtgtgactat tgacgtcctt ccccgtacgc cgggcaataa cgtttatgtt ggtttcatgg tttggtctaa ctttaccgct actaaatgcc gcggattggt ttcgctgaat aagagattat ttgtctccag ccacttaagt gaggtgattt atgtttggtg ctattgctgg cggtattgct tctgctcttg ctggtggcgc catgtctaaa ttgtttggag gcggtcaaaa agccgcctcc ggtggcattc aaggtgatgt gcttgctacc gataacaata ctgtaggcat gggtgatgct ggtattaaat ctgccattca aggctctaat gttcctaacc ctgatgaggc cgcccctagt tttgtttctg gtgctatggc taaagctggt aaaggacttc ttgaaggtac gttgcaggct ggcacttctg ccgtttctga taagttgctt gatttggttg gacttggtgg caagtctgcc gctgataaag gaaaggatac tcgtgattat cttgctgctg catttcctga gcttaatgct tgggagcgtg ctggtgctga tgcttcctct gctggtatgg ttgacgccgg atttgagaat caaaaagagc ttactaaaat gcaactggac aatcagaaag agattgccga gatgcaaaat gagactcaaa aagagattgc tggcattcag tcggcgactt cacgccagaa tacgaaagac caggtatatg cacaaaatga gatgcttgct tatcaacaga aggagtctac tgctcgcgtt gcgtctatta tggaaaacac caatctttcc aagcaacagc aggtttccga gattatgcgc caaatgctta ctcaagctca aacggctggt cagtatttta ccaatgacca aatcaaagaa atgactcgca aggttagtgc tgaggttgac ttagttcatc agcaaacgca gaatcagcgg tatggctctt ctcatattgg cgctactgca aaggatattt ctaatgtcgt cactgatgct gcttctggtg tggttgatat ttttcatggt attgataaag ctgttgccga tacttggaac aatttctgga aagacggtaa agctgatggt attggctcta atttgtctag gaaataaccg tcaggattga caccctccca attgtatgtt ttcatgcctc caaatcttgg aggctttttt atggttcgtt cttattaccc ttctgaatgt cacgctgatt attttgactt tgagcgtatc gaggctctta aacctgctat tgaggcttgt ggcatttcta ctctttctca atccccaatg cttggcttcc ataagcagat ggataaccgc atcaagctct tggaagagat tctgtctttt cgtatgcagg gcgttgagtt cgataatggt gatatgtatg ttgacggcca taaggctgct tctgacgttc gtgatgagtt tgtatctgtt actgagaagt taatggatga attggcacaa tgctacaatg tgctccccca acttgatatt aataacacta tagaccaccg ccccgaaggg gacgaaaaat ggtttttaga gaacgagaag acggttacgc agttttgccg caagctggct gctgaacgcc ctcttaagga tattcgcgat gagtataatt accccaaaaa gaaaggtatt aaggatgagt gttcaagatt gctggaggcc tccactatga aatcgcgtag aggctttgct attcagcgtt tgatgaatgc aatgcgacag gctcatgctg atggttggtt tatcgttttt gacactctca cgttggctga cgaccgatta gaggcgtttt atgataatcc caatgctttg cgtgactatt ttcgtgatat tggtcgtatg gttcttgctg ccgagggtcg caaggctaat gattcacacg ccgactgcta tcagtatttt tgtgtgcctg agtatggtac agctaatggc cgtcttcatt tccatgcggt gcactttatg cggacacttc ctacaggtag cgttgaccct aattttggtc gtcgggtacg caatcgccgc cagttaaata gcttgcaaaa tacgtggcct tatggttaca gtatgcccat cgcagttcgc tacacgcagg acgctttttc acgttctggt tggttgtggc ctgttgatgc taaaggtgag ccgcttaaag ctaccagtta tatggctgtt ggtttctatg tggctaaata cgttaacaaa aagtcagata tggaccttgc tgctaaaggt ctaggagcta aagaatggaa caactcacta aaaaccaagc tgtcgctact tcccaagaag ctgttcagaa tcagaatgag ccgcaacttc gggatgaaaa tgctcacaat gacaaatctg tccacggagt gcttaatcca acttaccaag ctgggttacg acgcgacgcc gttcaaccag atattgaagc agaacgcaaa aagagagatg agattgaggc tgggaaaagt tactgtagcc gacgttttgg cggcgcaacc tgtgacgaca aatctgctca aatttatgcg cgcttcgata aaaatgattg gcgtatccaa cctgca

4. Approaches to handle lots of data- Visualization

Page 31: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

40 3 Microbial Genome Sequences

Base Atlases to Visualize Base Composition Features

Figure 3.3 is an ‘absolute’ Base Atlas, or a graphical representation of the entire !X174 DNA sequence plotted on a single figure (for the positive strand, the one represented in Fig. 3.11). Since we are interested in base composition analysis, the densities of the four bases are plotted by color intensity (the four outer circles). It is obvious that this DNA is quite T-rich, as there is far more red (T’s) than green (A’s), turquoise (G’s), or violet (C’s). It would be a challenge to see this at one glance from Fig. 3.1.

Continuing to read this plot from outside to inside, the coding sequences are plotted next, and since they are all on one strand only one color is needed here (in case there are coding sequences on the strand complementary to the strand that is published, we color them red). The next circle is called AT skew, and is a measure of the bias of A’s towards one strand (and T’s towards the other). As will be discussed in Chapter 7, for some bacteria, the A’s are biased towards the replication leading strand, but in other bacterial chromosomes, including E. coli, which this phage nor-mally infects, the T’s are biased towards the leading strand. The strong red color in this lane means that T’s are biased towards the strand represented by the sequence, implying that this is the leading replication strand. The next circle shows the GC skew, and since the scale is the same as that of the AT skew (+/! 0.20), the absence of dark colors indicates that the bias of G’s towards one strand or the other is not as

1 Phage !X174 is a virus that packs its DNA as single strand DNA (ssDNA) in viroid particles, so it only contains this positive strand in viroid form.

phiX1745386 bp

Hin

dII

TaqI

TaqI

TaqI

TaqI

Hin

dII

Hae

IIIM

boII

TaqI

HindII

HaeIII

TaqI

HapII

MboII

HindII

HaeIII

HapIITaqIHaeIIIHphIHindII

MboIIHaeIII

HphIHphI

HphI

MboII

HindII

MboII

HindII

MboIIM

boII

Hap

II

Hph

I

Hap

II

Hph

I

Hph

I

Hae

IIIHindIIHap

II

HindIIMboII

TaqI

TaqIHphI

HindII

HaeIII

MboII

HaeIII

HaeIII

MboII

HindIIHaeIII

HaeIII

HphI

HindII

TaqIP

stI

HindII

MboII phiX1745386 bp

p1

p2

p3 p4p5 p7

p8p9

p6

p10

p11

origin

Fig. 3.2 Two views of the nucleotide sequence of the !X174 genome. The left view shows a selection of the restriction enzyme recognition sites originally described in the paper (the unique PstI site is red), and the right view shows all 11 protein encoding genes, along with their predicted transcripts. The origin of replication is indicated by an arrow

40 3 Microbial Genome Sequences

Base Atlases to Visualize Base Composition Features

Figure 3.3 is an ‘absolute’ Base Atlas, or a graphical representation of the entire !X174 DNA sequence plotted on a single figure (for the positive strand, the one represented in Fig. 3.11). Since we are interested in base composition analysis, the densities of the four bases are plotted by color intensity (the four outer circles). It is obvious that this DNA is quite T-rich, as there is far more red (T’s) than green (A’s), turquoise (G’s), or violet (C’s). It would be a challenge to see this at one glance from Fig. 3.1.

Continuing to read this plot from outside to inside, the coding sequences are plotted next, and since they are all on one strand only one color is needed here (in case there are coding sequences on the strand complementary to the strand that is published, we color them red). The next circle is called AT skew, and is a measure of the bias of A’s towards one strand (and T’s towards the other). As will be discussed in Chapter 7, for some bacteria, the A’s are biased towards the replication leading strand, but in other bacterial chromosomes, including E. coli, which this phage nor-mally infects, the T’s are biased towards the leading strand. The strong red color in this lane means that T’s are biased towards the strand represented by the sequence, implying that this is the leading replication strand. The next circle shows the GC skew, and since the scale is the same as that of the AT skew (+/! 0.20), the absence of dark colors indicates that the bias of G’s towards one strand or the other is not as

1 Phage !X174 is a virus that packs its DNA as single strand DNA (ssDNA) in viroid particles, so it only contains this positive strand in viroid form.

phiX1745386 bp

Hin

dII

TaqI

TaqI

TaqI

TaqI

Hin

dII

Hae

IIIM

boII

TaqI

HindII

HaeIII

TaqI

HapII

MboII

HindII

HaeIII

HapIITaqIHaeIIIHphIHindII

MboIIHaeIII

HphIHphI

HphI

MboII

HindII

MboII

HindII

MboIIM

boII

Hap

II

Hph

I

Hap

II

Hph

I

Hph

I

Hae

IIIHindIIHap

II

HindIIMboII

TaqI

TaqIHphI

HindII

HaeIII

MboII

HaeIII

HaeIII

MboII

HindIIHaeIII

HaeIII

HphI

HindII

TaqIP

stI

HindII

MboII phiX1745386 bp

p1

p2

p3 p4p5 p7

p8p9

p6

p10

p11

origin

Fig. 3.2 Two views of the nucleotide sequence of the !X174 genome. The left view shows a selection of the restriction enzyme recognition sites originally described in the paper (the unique PstI site is red), and the right view shows all 11 protein encoding genes, along with their predicted transcripts. The origin of replication is indicated by an arrow

Page 32: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

The Importance of Visualization 41

strong as for the A’s in the previous circle. AT and GC skew are further explained in Chapter 7. Finally, the deviation of AT content from the chromosomal average percentage AT is plotted, ranging from 40% to 60% AT, with 50% AT in the middle; thus bright red regions contain lots of A’s or T’s, and the blue regions are GC-rich. There are four dark red regions in the innermost circle that are much more AT-rich than the rest of the chromosome.

The plot of Fig. 3.4 shows the same data as in Fig. 3.3, but now as a ‘relative’ Base Atlas: the data are normalized to the genomic average for the values in each lane; only values greater than three standard deviations above the average are colored. At first sight, this is a rather bleached version of the previous figure, but it does reveal different information. For instance, there is a region where A’s are highly overrepresented compared to the global A content (around 3.5 k), and a relatively small stretch where G’s are overrepresented (around 1 k). This isn't obvious from the previous, absolute Base Atlas because that is too colourful. Atlas figures can display lanes as either absolute ranges, or show regions that deviate by more than three standard deviations from the chromosomal average, or a combination of fixed and average lanes. The way to tell the scale is to look at the legend, which is always oriented with the outermost circle on the top, going towards the innermost circle at the bottom. At the right of each scale in the legend, ‘fix’ indicates a fixed range,

coliphage phiX174

.

0k 0.5k

1k1.5k

2k

2.5k3k3

5k4k

4 .5k

5k

5,386 bp

BASE ATLAS

G Contentfixavg

0.00 0.40

A Contentfixavg

0.00 0.40

T Contentfixavg

0.00 0.40

C Contentfixavg

0.00 0.40

Annotations:

CDS +

AT Skewfixavg

–0.20 0.20

GC Skewfixavg

–0.20 0.20

Percent ATfixavg

0.40 0.60

Resolution: 3

Fig. 3.3 Absolute DNA Base Atlas of the nucleotide sequence of the !X174 genome. The legend to the right explains what is represented from the outer to the inner circle. Shown are the fraction of each nucleotide along the genome (first four circles counting inwards), the coding sequences on the positive (clockwise) strand, the AT and GC skew, and the percent AT. In an ‘absolute’ Atlas all lanes are plotted with a fixed range

Page 33: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

42 3 Microbial Genome Sequences

while ‘dev’ means that the average is in the middle value (usually light gray) and the extreme ends represent plus or minus three standard deviations from the average.

Genome Atlases to Visualize Chromosomes

The analysis of DNA base composition is interesting in itself, but a Base Atlas dis-plays only a fraction of the type of information a genomic atlas can provide. The next step is to combine this with the presence of genes, and to also indicate regions containing repeats in DNA sequences. Structural features of the DNA can also be plotted. That way, we start to produce what we call a Genome Atlas, providing a quick overview of some of the most important and informative features in a microbial chromosome, plasmid, or phage. Figure 3.5 represents a Genome Atlas of the !X174 genome.

Of the circles of the Base Atlas of Fig. 3.4 we have chosen to represent only AT skew (as a fixed average) and percent AT (as deviation). Three outer circles have been added to the atlas, representing DNA structural properties: intrinsic DNA curvature in the outermost, followed by stacking energy and position preference.

CD

S >

CDS >

CDS >

>

CD

S >

CD

S >

k0

0.5k

1k.5k

2k

2.5k3k3.5

k4k

4 .5k

5k

BASE ATLAS

G Contentdevavg

0.07 0.39

A Contentdevavg

0.01 0.47

T Contentdevavg

0.10 0.53

C Contentdevavg

0.04 0.39

Annotations:

CDS +

AT Skewdevavg

–0.33 0.18

GC Skewdevavg

–0.10 0.14

Percent ATdevavg

0.47 0.63

Resolution: 3

coliphage phiX1745,386 bp

Fig. 3.4 Relative Base Atlas of the !X174 genome. In this Atlas the colors represent the regions where the base density varies more than three standard deviations from the genomic average. To the right of each scale is indicated whether fixed average or three standard deviations are plotted. The numbers below the scales indicate how color intensity was chosen. This relative Base Atlas (and not the absolute version of Fig. 3.3) is the default Base Atlas used in the remainder of the book

Page 34: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

gagttttatc gcttccatga cgcagaagtt aacactttcg gatatttctg atgagtcgaa aaattatctt gataaagcag gaattactac tgcttgttta cgaattaaat cgaagtggac tgctggcgga aaatgagaaa attcgaccta tccttgcgca gctcgagaag ctcttacttt gcgacctttc gccatcaact aacgattctg tcaaaaactg acgcgttgga tgaggagaag tggcttaata tgcttggcac gttcgtcaag gactggttta gatatgagtc acattttgtt catggtagag attctcttgt tgacatttta aaagagcgtg gattactatc tgagtccgat gctgttcaac cactaatagg taagaaatca tgagtcaagt tactgaacaa tccgtacgtt tccagaccgc tttggcctct attaagctca ttcaggcttc tgccgttttg gatttaaccg aagatgattt cgattttctg acgagtaaca aagtttggat tgctactgac cgctctcgtg ctcgtcgctg cgttgaggct tgcgtttatg gtacgctgga ctttgtggga taccctcgct ttcctgctcc tgttgagttt attgctgccg tcattgctta ttatgttcat cccgtcaaca ttcaaacggc ctgtctcatc atggaaggcg ctgaatttac ggaaaacatt attaatggcg tcgagcgtcc ggttaaagcc gctgaattgt tcgcgtttac cttgcgtgta cgcgcaggaa acactgacgt tcttactgac gcagaagaaa acgtgcgtca aaaattacgt gcggaaggag tgatgtaatg tctaaaggta aaaaacgttc tggcgctcgc cctggtcgtc cgcagccgtt gcgaggtact aaaggcaagc gtaaaggcgc tcgtctttgg tatgtaggtg gtcaacaatt ttaattgcag gggcttcggc cccttacttg aggataaatt atgtctaata ttcaaactgg cgccgagcgt atgccgcatg acctttccca tcttggcttc cttgctggtc agattggtcg tcttattacc atttcaacta ctccggttat cgctggcgac tccttcgaga tggacgccgt tggcgctctc cgtctttctc cattgcgtcg tggccttgct attgactcta ctgtagacat ttttactttt tatgtccctc atcgtcacgt ttatggtgaa cagtggatta agttcatgaa ggatggtgtt aatgccactc ctctcccgac tgttaacact actggttata ttgaccatgc cgcttttctt ggcacgatta accctgatac caataaaatc cctaagcatt tgtttcaggg ttatttgaat atctataaca actattttaa agcgccgtgg atgcctgacc gtaccgaggc taaccctaat gagcttaatc aagatgatgc tcgttatggt ttccgttgct gccatctcaa aaacatttgg actgctccgc ttcctcctga gactgagctt tctcgccaaa tgacgacttc taccacatct attgacatta tgggtctgca agctgcttat gctaatttgc atactgacca agaacgtgat tacttcatgc agcgttacca tgatgttatt tcttcatttg gaggtaaaac ctcttatgac gctgacaacc gtcctttact tgtcatgcgc tctaatctct gggcatctgg ctatgatgtt gatggaactg accaaacgtc gttaggccag ttttctggtc gtgttcaaca gacctataaa cattctgtgc cgcgtttctt tgttcctgag catggcacta tgtttactct tgcgcttgtt cgttttccgc ctactgcgac taaagagatt cagtacctta acgctaaagg tgctttgact tataccgata ttgctggcga ccctgttttg tatggcaact tgccgccgcg tgaaatttct atgaaggatg ttttccgttc tggtgattcg tctaagaagt ttaagattgc tgagggtcag tggtatcgtt atgcgccttc gtatgtttct cctgcttatc accttcttga aggcttccca ttcattcagg aaccgccttc tggtgatttg caagaacgcg tacttattcg ccaccatgat tatgaccagt gtttccagtc cgttcagttg ttgcagtgga atagtcaggt taaatttaat gtgaccgttt atcgcaatct gccgaccact cgcgattcaa tcatgacttc gtgataaaag attgagtgtg aggttataac gccgaagcgg taaaaatttt aatttttgcc gctgaggggt tgaccaagcg aagcgcggta ggttttctgc ttaggagttt aatcatgttt cagactttta tttctcgcca taattcaaac tttttttctg ataagctggt tctcacttct gttactccag cttcttcggc acctgtttta cagacaccta aagctacatc gtcaacgtta tattttgata gtttgacggt taatgctggt aatggtggtt ttcttcattg cattcagatg gatacatctg tcaacgccgc taatcaggtt gtttctgttg gtgctgatat tgcttttgat gccgacccta aattttttgc ctgtttggtt cgctttgagt cttcttcggt tccgactacc ctcccgactg cctatgatgt ttatcctttg aatggtcgcc atgatggtgg ttattatacc gtcaaggact gtgtgactat tgacgtcctt ccccgtacgc cgggcaataa cgtttatgtt ggtttcatgg tttggtctaa ctttaccgct actaaatgcc gcggattggt ttcgctgaat aagagattat ttgtctccag ccacttaagt gaggtgattt atgtttggtg ctattgctgg cggtattgct tctgctcttg ctggtggcgc catgtctaaa ttgtttggag gcggtcaaaa agccgcctcc ggtggcattc aaggtgatgt gcttgctacc gataacaata ctgtaggcat gggtgatgct ggtattaaat ctgccattca aggctctaat gttcctaacc ctgatgaggc cgcccctagt tttgtttctg gtgctatggc taaagctggt aaaggacttc ttgaaggtac gttgcaggct ggcacttctg ccgtttctga taagttgctt gatttggttg gacttggtgg caagtctgcc gctgataaag gaaaggatac tcgtgattat cttgctgctg catttcctga gcttaatgct tgggagcgtg ctggtgctga tgcttcctct gctggtatgg ttgacgccgg atttgagaat caaaaagagc ttactaaaat gcaactggac aatcagaaag agattgccga gatgcaaaat gagactcaaa aagagattgc tggcattcag tcggcgactt cacgccagaa tacgaaagac caggtatatg cacaaaatga gatgcttgct tatcaacaga aggagtctac tgctcgcgtt gcgtctatta tggaaaacac caatctttcc aagcaacagc aggtttccga gattatgcgc caaatgctta ctcaagctca aacggctggt cagtatttta ccaatgacca aatcaaagaa atgactcgca aggttagtgc tgaggttgac ttagttcatc agcaaacgca gaatcagcgg tatggctctt ctcatattgg cgctactgca aaggatattt ctaatgtcgt cactgatgct gcttctggtg tggttgatat ttttcatggt attgataaag ctgttgccga tacttggaac aatttctgga aagacggtaa agctgatggt attggctcta atttgtctag gaaataaccg tcaggattga caccctccca attgtatgtt ttcatgcctc caaatcttgg aggctttttt atggttcgtt cttattaccc ttctgaatgt cacgctgatt attttgactt tgagcgtatc gaggctctta aacctgctat tgaggcttgt ggcatttcta ctctttctca atccccaatg cttggcttcc ataagcagat ggataaccgc atcaagctct tggaagagat tctgtctttt cgtatgcagg gcgttgagtt cgataatggt gatatgtatg ttgacggcca taaggctgct tctgacgttc gtgatgagtt tgtatctgtt actgagaagt taatggatga attggcacaa tgctacaatg tgctccccca acttgatatt aataacacta tagaccaccg ccccgaaggg gacgaaaaat ggtttttaga gaacgagaag acggttacgc agttttgccg caagctggct gctgaacgcc ctcttaagga tattcgcgat gagtataatt accccaaaaa gaaaggtatt aaggatgagt gttcaagatt gctggaggcc tccactatga aatcgcgtag aggctttgct attcagcgtt tgatgaatgc aatgcgacag gctcatgctg atggttggtt tatcgttttt gacactctca cgttggctga cgaccgatta gaggcgtttt atgataatcc caatgctttg cgtgactatt ttcgtgatat tggtcgtatg gttcttgctg ccgagggtcg caaggctaat gattcacacg ccgactgcta tcagtatttt tgtgtgcctg agtatggtac agctaatggc cgtcttcatt tccatgcggt gcactttatg cggacacttc ctacaggtag cgttgaccct aattttggtc gtcgggtacg caatcgccgc cagttaaata gcttgcaaaa tacgtggcct tatggttaca gtatgcccat cgcagttcgc tacacgcagg acgctttttc acgttctggt tggttgtggc ctgttgatgc taaaggtgag ccgcttaaag ctaccagtta tatggctgtt ggtttctatg tggctaaata cgttaacaaa aagtcagata tggaccttgc tgctaaaggt ctaggagcta aagaatggaa caactcacta aaaaccaagc tgtcgctact tcccaagaag ctgttcagaa tcagaatgag ccgcaacttc gggatgaaaa tgctcacaat gacaaatctg tccacggagt gcttaatcca acttaccaag ctgggttacg acgcgacgcc gttcaaccag atattgaagc agaacgcaaa aagagagatg agattgaggc tgggaaaagt tactgtagcc gacgttttgg cggcgcaacc tgtgacgaca aatctgctca aatttatgcg cgcttcgata aaaatgattg gcgtatccaa cctgca

Page 35: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

0k

250k5

00

k

750k1000k

125

0k

15

00

k

H. influenzae Rd KW20

1,830,138 bp

GENOME ATLAS

Intrinsic Curvaturedevavg

0.19 0.26

Stacking Energydevavg

-7.91 -7.09

Position Preferencedevavg

0.14 0.17

Annotations: CDS +

CDS -

rRNA

tRNA

Global Direct Repeatsfixavg

5.00 7.50

Global Inverted Repeatsfixavg

5.00 7.50

GC Skewdevavg

-0.07 0.07

Percent ATfixavg

0.20 0.80

Resolution: 733

GENOME ATLAS

Intrinsic Curvaturedevavg

0.19 0.26

Stacking Energydevavg

-7.89 -7.11

Position Preferencedevavg

0.14 0.17

Annotations: CDS +

CDS -

rRNA

tRNA

Global Direct Repeatsfixavg

5.00 7.50

Global Inverted Repeatsfixavg

5.00 7.50

GC Skewdevavg

-0.07 0.07

Percent ATfixavg

0.20 0.80Resolution: 766

2

0k

250k

50

0k

750k1000k

1

50k

15

00

k

1750k

H. influenzae 86-028NP

1,913,428 bp

Page 36: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

0k

250k5

00

k

750k1000k

125

0k

15

00

k

H. influenzae Rd KW20

1,830,138 bp

GENOME ATLAS

Intrinsic Curvaturedevavg

0.19 0.26

Stacking Energydevavg

-7.91 -7.09

Position Preferencedevavg

0.14 0.17

Annotations: CDS +

CDS -

rRNA

tRNA

Global Direct Repeatsfixavg

5.00 7.50

Global Inverted Repeatsfixavg

5.00 7.50

GC Skewdevavg

-0.07 0.07

Percent ATfixavg

0.20 0.80

Resolution: 733

GENOME ATLAS

Intrinsic Curvaturedevavg

0.19 0.26

Stacking Energydevavg

-7.89 -7.11

Position Preferencedevavg

0.14 0.17

Annotations: CDS +

CDS -

rRNA

tRNA

Global Direct Repeatsfixavg

5.00 7.50

Global Inverted Repeatsfixavg

5.00 7.50

GC Skewdevavg

-0.07 0.07

Percent ATfixavg

0.20 0.80Resolution: 766

20k

250k

50

0k

750k1000k

1

50k

15

00

k

1750k

H. influenzae 86-028NP

1,913,428 bp

Page 37: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

etp

D >

etpE

>

etpF

>

hlyA >

hlyB >hlyD >

CD

S >

CD

S >

CD

S >

nik

B >

toxB

>

traI >

katP >

espP >

CD

S >

Origin

toxB

toxB

0k12.5k

25

k

37.5k50k

62.5

k7

5k

87.5k

pO157 of E. coli O157:H7

strain Sakai

92,721 bp

GENOME ATLAS

Intrinsic Curvaturedevavg

0.08 0.30

Stacking Energydevavg

-9.51 -6.41

Position Preferencedevavg

0.11 0.17

Annotations:

CDS +

CDS -

Global Direct Repeatsfixavg

5.00 7.50

Global Inverted Repeatsfixavg

5.00 7.50

GC Skewfixavg

-0.12 0.12

Percent ATfixavg

0.30 0.70Resolution: 38

katP

>

espP >

L7028 >

L7031 >

etpE >

etpL >

EH

EC

-hly

A

Ch

lyB

>

L 7072 >

L7081 >

L7095 >

t

Origin

hlyA

0k12.5k

25

k

37.5k50k

62.5

k7

5k

pO157 ofE. coli O157:H7

92,077 bp

GENOME ATLAS

Intrinsic Curvaturedevavg

0.08 0.30

Stacking Energydevavg

-9.52 -6.41

Position Preferencedevavg

0.11 0.17

Annotations:

CDS +

CDS -

Global Direct Repeatsfixavg

5.00 7.50

Global Inverted Repeatsfixavg

5.00 7.50

GC Skewfixavg

-0.12 0.12

Percent ATfixavg

0.30 0.70Resolution: 37

hlyA

etpD >etpE >

etpF >

hly

A >

hly

B >

hly

D >

CD

S >

CDS >

CD S >nikB >

toxB >

t raI >

katP

>

espP

>

CDS >

Origin

toxB

toxB

0k

12.5k

25k37.5

k

50

k

62 .5k75k

87.5kpO157 of

E. coli O

157:H7

strain S

akai

92,721 bp

GE

NO

ME

AT

LA

S

Intrin

sic Cu

rvatu

red

evav

g

0.08

0.30

Stackin

g E

nerg

yd

evav

g

-9.51

-6.41

Po

sition

Preferen

ced

evav

g

0.11

0.17

An

no

tation

s:

CD

S +

CD

S -

Glo

bal D

irect Rep

eatsfixav

g

5.00

7.50

Glo

bal In

verted

Rep

eatsfixav

g

5.00

7.50

GC

Skew

fixavg

-0.12

0.12

Percen

t AT

fixavg

0.30

0.70

Reso

lutio

n: 38

katP >

espP >

L7

028

>

L7

03

1 >

et p

E >

etp

L >

EH EC-hlyA

C hlyB >

L7072 >

L7

08

1 >

L7

095 >

t

OriginhlyA

0k

12.5k

25k37.5

k

50

k

62 .5k75k

pO157 of

E. coli O

157:H7

92,077 bp

GE

NO

ME

AT

LA

S

Intrin

sic Cu

rvatu

red

evav

g

0.08

0.30

Stackin

g E

nerg

yd

evav

g

-9.52

-6.41

Po

sition

Preferen

ced

evav

g

0.11

0.17

An

no

tation

s:

CD

S +

CD

S -

Glo

bal D

irect Rep

eatsfixav

g

5.00

7.50

Glo

bal In

verted

Rep

eatsfixav

g

5.00

7.50

GC

Skew

fixavg

-0.12

0.12

Percen

t AT

fixavg

0.30

0.70

Reso

lutio

n: 37

hlyA

etp

D >

etpE

>

etpF

>

hlyA >

hlyB >hlyD >

CD

S >

CD

S >

CD

S >

nik

B >

toxB

>

traI >

katP >

espP >

CD

S >

Origin

toxB

toxB

0k12.5k

25

k

37.5k50k

62.5

k7

5k

87.5k

pO157 of E. coli O157:H7

strain Sakai

92,721 bp

GENOME ATLAS

Intrinsic Curvaturedevavg

0.08 0.30

Stacking Energydevavg

-9.51 -6.41

Position Preferencedevavg

0.11 0.17

Annotations:

CDS +

CDS -

Global Direct Repeatsfixavg

5.00 7.50

Global Inverted Repeatsfixavg

5.00 7.50

GC Skewfixavg

-0.12 0.12

Percent ATfixavg

0.30 0.70Resolution: 38

katP

>

espP >

L7028 >

L7031 >

etpE >

etpL >

EH

EC

-hly

A

Ch

lyB

>

L 7072 >

L7081 >

L7095 >

t

Origin

hlyA

0k12.5k

25

k

37.5k50k

62.5

k7

5k

pO157 ofE. coli O157:H7

92,077 bp

GENOME ATLAS

Intrinsic Curvaturedevavg

0.08 0.30

Stacking Energydevavg

-9.52 -6.41

Position Preferencedevavg

0.11 0.17

Annotations:

CDS +

CDS -

Global Direct Repeatsfixavg

5.00 7.50

Global Inverted Repeatsfixavg

5.00 7.50

GC Skewfixavg

-0.12 0.12

Percent ATfixavg

0.30 0.70Resolution: 37

hlyA

Page 38: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

Proteome comparison of Burkholderia malleiALR: 0.75, e-value:1e-10

Escherichia colistrain K

-12 isolate MG

16554289 genes, bp

Escherichia colistrain K

-12 isolate W3110

4387 genes, bp

Escherichia colistrain K

-12 substr. DH

10B4126 genes, bp

Escherichia colistrain C

FT0735379 genes, bp

Escherichia colistrain K-12 isolate MG16554289 genes, 4,639,221 bp

Escherichia colistrain K-12 isolate W31104387 genes, 4,641,433 bp

Escherichia colistrain K-12 substr. DH10B4126 genes, 4,686,137 bp

Escherichia colistrain CFT0735379 genes, 5,231,428 bp

391 / 4289

9.1%

4165 / 4289

97.1%

3920 / 4289

91.4%

3567 / 4289

83.2%

4217 / 4387

96.1%

464 / 4387

10.6%

3939 / 4387

89.8%

3616 / 4387

82.4%

4010 / 4126

97.2%

3994 / 4126

96.8%

564 / 4126

13.7%

3539 / 4126

85.8%

3625 / 5379

67.4%

3637 / 5379

67.6%

3502 / 5379

65.1%

680 / 5379

12.6%

Homology within genomes

9.12 13.67

Homology between genomes

65.11 97.19

Page 39: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

Comparative Microbial Genomics group Center for B

iological Sequence analysis D

epartment of S

ystems B

iology, Technical University of D

enmark

3.0 %110 / 3,665

88.1 %3,495 / 3,966

80.4 %3,303 / 4,109

77.1 %3,186 / 4,134

92.9 %3,489 / 3,754

79.6 %3,139 / 3,944

85.8 %3,291 / 3,837

83.2 %3,325 / 3,995

82.4 %3,302 / 4,009

83.4 %3,315 / 3,973

76.9 %3,165 / 4,117

64.7 %2,888 / 4,463

70.4 %2,922 / 4,153

68.7 %2,874 / 4,181

75.9 %3,094 / 4,077

71.6 %3,010 / 4,205

73.6 %3,045 / 4,135

70.3 %2,933 / 4,174

40.0 %2,421 / 6,055

42.2 %2,215 / 5,254

44.3 %2,515 / 5,683

41.7 %2,552 / 6,116

38.4 %2,291 / 5,971

40.3 %2,326 / 5,771

35.0 %1,949 / 5,561

34.0 %1,963 / 5,771

32.1 %1,846 / 5,747

38.7 %2,143 / 5,536

35.8 %2,018 / 5,637

32.5 %2,385 / 7,336

31.2 %2,143 / 6,862

27.2 %1,946 / 7,165

4.2 %155 / 3,729

88.8 %3,489 / 3,927

74.9 %3,187 / 4,253

89.7 %3,485 / 3,884

78.1 %3,147 / 4,032

82.5 %3,278 / 3,971

80.7 %3,321 / 4,117

80.8 %3,319 / 4,106

81.3 %3,320 / 4,085

76.7 %3,195 / 4,167

64.9 %2,940 / 4,533

70.3 %2,965 / 4,217

67.2 %2,880 / 4,288

77.2 %3,155 / 4,088

72.6 %3,068 / 4,226

74.9 %3,101 / 4,142

69.2 %2,953 / 4,267

39.6 %2,438 / 6,154

41.6 %2,225 / 5,354

43.7 %2,527 / 5,781

41.2 %2,564 / 6,224

38.0 %2,307 / 6,067

39.8 %2,339 / 5,873

34.8 %1,967 / 5,647

33.7 %1,977 / 5,865

32.1 %1,873 / 5,828

38.3 %2,156 / 5,631

35.9 %2,049 / 5,713

32.6 %2,405 / 7,380

31.1 %2,163 / 6,948

27.1 %1,964 / 7,245

4.3 %157 / 3,665

80.2 %3,271 / 4,079

81.1 %3,280 / 4,046

80.2 %3,169 / 3,953

83.4 %3,267 / 3,919

81.4 %3,309 / 4,067

81.6 %3,311 / 4,057

81.9 %3,311 / 4,041

81.6 %3,264 / 4,000

67.6 %2,983 / 4,413

72.2 %2,986 / 4,136

69.7 %2,916 / 4,183

78.0 %3,149 / 4,038

73.5 %3,065 / 4,172

75.5 %3,089 / 4,092

69.7 %2,944 / 4,221

40.0 %2,440 / 6,094

42.3 %2,236 / 5,283

44.5 %2,539 / 5,707

41.9 %2,575 / 6,151

38.5 %2,311 / 6,004

40.4 %2,345 / 5,808

35.3 %1,972 / 5,581

34.2 %1,983 / 5,797

32.5 %1,873 / 5,769

38.8 %2,162 / 5,566

36.4 %2,055 / 5,647

33.1 %2,415 / 7,299

31.5 %2,169 / 6,884

27.5 %1,971 / 7,179

3.3 %120 / 3,599

77.3 %3,164 / 4,093

75.1 %3,024 / 4,028

79.4 %3,143 / 3,956

76.0 %3,147 / 4,141

75.3 %3,136 / 4,162

76.3 %3,142 / 4,120

77.5 %3,153 / 4,066

65.1 %2,880 / 4,423

68.5 %2,869 / 4,191

69.0 %2,860 / 4,145

71.5 %2,986 / 4,175

68.5 %2,914 / 4,256

69.8 %2,942 / 4,217

66.3 %2,833 / 4,271

38.4 %2,348 / 6,120

41.3 %2,181 / 5,277

42.8 %2,449 / 5,718

40.0 %2,473 / 6,185

37.0 %2,227 / 6,026

38.6 %2,251 / 5,839

33.6 %1,884 / 5,612

32.5 %1,896 / 5,827

30.6 %1,777 / 5,804

37.9 %2,110 / 5,560

34.7 %1,968 / 5,677

31.7 %2,323 / 7,337

30.4 %2,098 / 6,893

26.3 %1,893 / 7,208

2.8 %99 / 3,560

80.7 %3,108 / 3,853

87.0 %3,272 / 3,762

82.9 %3,277 / 3,954

82.6 %3,267 / 3,954

83.7 %3,275 / 3,915

78.4 %3,157 / 4,029

67.3 %2,909 / 4,320

73.7 %2,947 / 4,001

71.8 %2,896 / 4,036

79.5 %3,125 / 3,932

74.1 %3,024 / 4,083

76.0 %3,059 / 4,025

73.2 %2,952 / 4,034

41.4 %2,445 / 5,905

43.8 %2,234 / 5,101

45.9 %2,535 / 5,526

42.9 %2,564 / 5,977

39.7 %2,312 / 5,825

41.7 %2,346 / 5,626

36.3 %1,965 / 5,413

35.3 %1,981 / 5,619

33.3 %1,863 / 5,593

40.3 %2,167 / 5,378

37.3 %2,045 / 5,477

33.6 %2,410 / 7,181

32.3 %2,164 / 6,706

28.0 %1,962 / 7,016

1.8 %59 / 3,353

83.0 %3,126 / 3,768

83.2 %3,208 / 3,855

85.6 %3,244 / 3,790

86.6 %3,253 / 3,757

74.3 %2,987 / 4,018

67.8 %2,836 / 4,184

81.0 %2,989 / 3,688

71.5 %2,801 / 3,915

77.9 %3,002 / 3,856

73.0 %2,908 / 3,986

75.2 %2,954 / 3,928

73.5 %2,863 / 3,897

42.4 %2,408 / 5,683

45.9 %2,223 / 4,842

47.1 %2,503 / 5,310

44.1 %2,533 / 5,743

40.6 %2,270 / 5,592

42.9 %2,314 / 5,388

37.7 %1,947 / 5,165

36.6 %1,964 / 5,371

34.4 %1,846 / 5,360

41.6 %2,140 / 5,139

38.7 %2,021 / 5,225

34.3 %2,377 / 6,932

33.0 %2,137 / 6,467

28.7 %1,944 / 6,766

2.9 %100 / 3,429

91.4 %3,373 / 3,689

90.1 %3,346 / 3,715

91.2 %3,355 / 3,679

78.6 %3,113 / 3,962

68.1 %2,876 / 4,226

74.9 %2,944 / 3,932

72.2 %2,861 / 3,960

83.0 %3,135 / 3,777

78.5 %3,061 / 3,901

80.4 %3,080 / 3,831

76.4 %2,970 / 3,887

41.8 %2,434 / 5,818

44.3 %2,232 / 5,038

46.4 %2,534 / 5,464

43.3 %2,559 / 5,916

39.9 %2,299 / 5,768

41.9 %2,334 / 5,571

36.6 %1,961 / 5,354

35.5 %1,974 / 5,563

33.4 %1,852 / 5,547

40.6 %2,159 / 5,323

37.4 %2,032 / 5,428

33.8 %2,403 / 7,116

32.4 %2,155 / 6,649

28.2 %1,960 / 6,957

2.8 %102 / 3,619

90.4 %3,439 / 3,805

91.7 %3,455 / 3,766

75.5 %3,125 / 4,141

66.4 %2,917 / 4,393

73.3 %2,983 / 4,071

69.0 %2,868 / 4,158

79.4 %3,144 / 3,961

75.8 %3,073 / 4,056

77.3 %3,098 / 4,009

73.1 %2,977 / 4,072

40.8 %2,445 / 5,993

43.0 %2,240 / 5,212

45.2 %2,546 / 5,633

42.3 %2,572 / 6,075

38.9 %2,309 / 5,932

40.9 %2,343 / 5,733

35.7 %1,969 / 5,522

34.6 %1,982 / 5,732

32.7 %1,868 / 5,705

39.5 %2,169 / 5,493

36.7 %2,048 / 5,585

33.3 %2,418 / 7,252

31.8 %2,169 / 6,817

27.6 %1,965 / 7,122

3.0 %109 / 3,575

96.0 %3,531 / 3,678

74.6 %3,103 / 4,160

66.3 %2,908 / 4,386

73.6 %2,979 / 4,045

69.1 %2,864 / 4,145

79.3 %3,138 / 3,958

75.1 %3,061 / 4,077

77.1 %3,088 / 4,007

73.4 %2,971 / 4,050

41.1 %2,450 / 5,957

43.4 %2,242 / 5,171

45.5 %2,548 / 5,599

42.7 %2,578 / 6,032

39.3 %2,314 / 5,892

41.2 %2,346 / 5,697

35.9 %1,971 / 5,485

34.8 %1,984 / 5,694

33.0 %1,872 / 5,667

39.7 %2,168 / 5,459

37.0 %2,051 / 5,545

33.5 %2,420 / 7,225

32.1 %2,173 / 6,778

27.7 %1,965 / 7,093

2.6 %92 / 3,593

75.4 %3,111 / 4,124

67.0 %2,915 / 4,348

74.3 %2,982 / 4,014

70.0 %2,873 / 4,102

80.3 %3,147 / 3,918

76.3 %3,076 / 4,029

78.0 %3,097 / 3,971

73.8 %2,975 / 4,030

41.3 %2,449 / 5,936

43.6 %2,244 / 5,143

45.8 %2,552 / 5,568

42.9 %2,579 / 6,005

39.4 %2,314 / 5,868

41.4 %2,346 / 5,670

36.1 %1,971 / 5,458

35.0 %1,985 / 5,667

33.2 %1,872 / 5,641

40.0 %2,171 / 5,428

37.2 %2,052 / 5,516

33.6 %2,420 / 7,193

32.2 %2,173 / 6,752

27.8 %1,967 / 7,064

3.5 %125 / 3,567

64.7 %2,847 / 4,403

68.5 %2,844 / 4,150

68.3 %2,820 / 4,126

73.1 %3,000 / 4,103

69.2 %2,925 / 4,226

71.3 %2,953 / 4,142

65.6 %2,791 / 4,256

37.9 %2,313 / 6,099

41.1 %2,153 / 5,242

42.2 %2,409 / 5,711

39.4 %2,432 / 6,172

36.9 %2,208 / 5,989

38.0 %2,213 / 5,823

32.8 %1,843 / 5,611

31.9 %1,857 / 5,817

30.2 %1,747 / 5,786

38.2 %2,104 / 5,504

34.4 %1,944 / 5,645

31.0 %2,282 / 7,362

30.3 %2,079 / 6,856

25.7 %1,850 / 7,198

2.8 %99 / 3,586

70.6 %2,886 / 4,087

68.0 %2,806 / 4,126

69.5 %2,908 / 4,185

64.3 %2,805 / 4,365

70.2 %2,930 / 4,172

65.2 %2,768 / 4,246

38.2 %2,320 / 6,080

41.2 %2,152 / 5,228

41.4 %2,372 / 5,735

39.1 %2,413 / 6,176

36.4 %2,186 / 6,003

38.0 %2,209 / 5,811

33.1 %1,848 / 5,585

32.1 %1,861 / 5,795

30.0 %1,736 / 5,791

37.3 %2,064 / 5,537

34.8 %1,952 / 5,606

30.8 %2,270 / 7,379

29.7 %2,044 / 6,887

25.6 %1,841 / 7,194

2.5 %84 / 3,305

73.1 %2,818 / 3,854

76.7 %2,961 / 3,861

71.6 %2,868 / 4,006

77.2 %2,983 / 3,866

71.6 %2,802 / 3,915

42.3 %2,399 / 5,675

46.0 %2,220 / 4,821

46.3 %2,464 / 5,326

43.0 %2,483 / 5,781

39.9 %2,238 / 5,609

41.8 %2,264 / 5,418

36.7 %1,906 / 5,192

35.6 %1,922 / 5,398

33.6 %1,805 / 5,377

41.6 %2,134 / 5,135

38.1 %1,994 / 5,228

33.3 %2,327 / 6,984

32.4 %2,098 / 6,481

28.1 %1,904 / 6,782

2.2 %73 / 3,311

71.8 %2,849 / 3,968

67.9 %2,780 / 4,092

71.8 %2,855 / 3,974

68.6 %2,743 / 4,001

39.7 %2,303 / 5,796

42.7 %2,113 / 4,953

43.5 %2,367 / 5,437

40.1 %2,373 / 5,919

38.0 %2,171 / 5,707

39.9 %2,202 / 5,514

34.9 %1,843 / 5,282

34.2 %1,872 / 5,473

31.9 %1,743 / 5,469

39.1 %2,048 / 5,244

36.4 %1,935 / 5,317

32.1 %2,268 / 7,062

31.2 %2,045 / 6,565

26.9 %1,851 / 6,869

2.1 %72 / 3,454

78.6 %3,059 / 3,894

82.5 %3,117 / 3,780

77.1 %2,975 / 3,861

42.9 %2,463 / 5,745

44.3 %2,213 / 5,001

45.9 %2,501 / 5,451

43.5 %2,550 / 5,859

40.1 %2,293 / 5,719

42.0 %2,320 / 5,530

36.8 %1,951 / 5,307

35.7 %1,970 / 5,513

33.5 %1,845 / 5,501

40.4 %2,139 / 5,293

37.7 %2,026 / 5,367

34.2 %2,394 / 7,002

32.6 %2,153 / 6,600

28.2 %1,949 / 6,915

2.4 %83 / 3,442

78.0 %3,045 / 3,906

73.0 %2,911 / 3,989

40.8 %2,400 / 5,876

43.5 %2,200 / 5,058

46.1 %2,513 / 5,455

42.2 %2,506 / 5,941

39.2 %2,272 / 5,796

41.1 %2,303 / 5,603

36.1 %1,940 / 5,373

34.9 %1,952 / 5,586

32.9 %1,824 / 5,545

39.4 %2,118 / 5,370

37.3 %2,022 / 5,418

33.4 %2,359 / 7,060

31.8 %2,123 / 6,682

27.9 %1,942 / 6,969

2.9 %99 / 3,427

74.7 %2,922 / 3,914

42.4 %2,451 / 5,781

44.9 %2,242 / 4,998

45.7 %2,506 / 5,480

42.9 %2,536 / 5,906

39.8 %2,292 / 5,756

41.6 %2,314 / 5,569

36.2 %1,940 / 5,352

35.2 %1,954 / 5,558

33.0 %1,831 / 5,549

40.3 %2,145 / 5,320

37.3 %2,019 / 5,407

33.4 %2,375 / 7,115

32.0 %2,130 / 6,656

27.9 %1,941 / 6,954

1.9 %62 / 3,316

40.9 %2,360 / 5,769

43.5 %2,155 / 4,958

45.1 %2,444 / 5,415

42.3 %2,475 / 5,845

39.2 %2,230 / 5,684

41.5 %2,272 / 5,479

36.3 %1,906 / 5,250

35.3 %1,926 / 5,455

33.1 %1,804 / 5,448

39.8 %2,086 / 5,245

37.8 %2,001 / 5,288

33.0 %2,325 / 7,056

32.0 %2,097 / 6,549

27.9 %1,909 / 6,851

3.2 %147 / 4,662

43.5 %2,547 / 5,858

48.9 %2,994 / 6,128

46.4 %3,042 / 6,550

41.4 %2,698 / 6,523

43.9 %2,762 / 6,293

35.9 %2,233 / 6,219

35.5 %2,270 / 6,400

31.9 %2,074 / 6,494

64.9 %3,384 / 5,214

47.0 %2,741 / 5,827

46.4 %3,371 / 7,266

35.2 %2,581 / 7,333

29.6 %2,295 / 7,753

2.1 %79 / 3,683

47.2 %2,608 / 5,524

43.2 %2,597 / 6,013

43.7 %2,492 / 5,705

45.4 %2,507 / 5,523

36.2 %1,976 / 5,464

34.5 %1,965 / 5,696

32.3 %1,842 / 5,705

45.0 %2,357 / 5,232

37.8 %2,081 / 5,503

34.9 %2,496 / 7,160

34.3 %2,276 / 6,634

27.9 %1,972 / 7,061

2.8 %121 / 4,337

67.5 %3,741 / 5,540

41.9 %2,637 / 6,301

43.7 %2,670 / 6,112

36.3 %2,179 / 6,005

35.3 %2,191 / 6,211

32.6 %2,040 / 6,250

46.1 %2,626 / 5,697

39.9 %2,372 / 5,942

38.7 %2,880 / 7,439

34.4 %2,472 / 7,184

29.4 %2,212 / 7,534

3.1 %150 / 4,773

39.7 %2,672 / 6,728

40.8 %2,680 / 6,565

34.2 %2,205 / 6,448

33.1 %2,209 / 6,672

30.5 %2,050 / 6,715

43.2 %2,655 / 6,143

37.5 %2,396 / 6,387

37.0 %2,900 / 7,832

33.0 %2,516 / 7,615

27.8 %2,222 / 7,979

2.3 %103 / 4,463

72.3 %3,688 / 5,101

36.6 %2,201 / 6,016

35.7 %2,219 / 6,213

33.1 %2,074 / 6,269

40.2 %2,413 / 5,999

36.9 %2,259 / 6,124

34.6 %2,682 / 7,762

36.5 %2,593 / 7,105

28.1 %2,155 / 7,667

2.8 %118 / 4,277

38.7 %2,246 / 5,808

38.1 %2,277 / 5,979

34.9 %2,114 / 6,065

42.3 %2,451 / 5,795

38.6 %2,289 / 5,931

36.7 %2,759 / 7,516

36.7 %2,562 / 6,982

29.5 %2,198 / 7,456

2.6 %96 / 3,691

75.0 %3,261 / 4,346

55.5 %2,683 / 4,838

34.5 %1,963 / 5,692

33.6 %1,915 / 5,695

29.6 %2,214 / 7,478

30.4 %2,085 / 6,866

30.3 %2,110 / 6,968

2.9 %112 / 3,894

52.4 %2,666 / 5,085

33.9 %1,991 / 5,874

32.5 %1,919 / 5,903

29.3 %2,244 / 7,665

29.4 %2,083 / 7,082

29.7 %2,127 / 7,169

3.3 %111 / 3,378

30.5 %1,813 / 5,939

30.3 %1,795 / 5,923

28.3 %2,095 / 7,406

26.7 %1,916 / 7,168

28.3 %1,980 / 6,989

2.7 %103 / 3,886

46.2 %2,452 / 5,307

43.4 %2,981 / 6,875

34.5 %2,335 / 6,762

28.0 %2,022 / 7,222

2.3 %88 / 3,822

45.0 %3,018 / 6,702

30.9 %2,144 / 6,948

25.5 %1,872 / 7,339

3.9 %201 / 5,117

30.1 %2,581 / 8,574

26.1 %2,254 / 8,624

3.9 %200 / 5,078

25.9 %2,170 / 8,370

5.0 %243 / 4,897

V.cholerae

N16961

V.cholerae0395 TEDA

V.cholerae0395 TIGR

V.choleraeV52

V.choleraeM66-2MO10

V.choleraeBX330286

V.choleraeRC9

V.choleraeMJ1236

V.choleraeB33VCE

V.cholerae2740-80

V.cholerae

V.cholerae1587

V.choleraeAM-19226

V.choleraeMZO-2

V.cholerae12129

V.choleraeTM11079-80

V.choleraeTMA21

V.choleraeVL426

V.parahaemolyticus2210633

V.parahaemolyticus 16

V.vulnificusCMCP6

V.vulnificusYJ016

V.speciesMED222

V.splendidusLGP32

A.fischeri ES114

A.fischeri MJ11

A.salmonicidaLFI1238

V.speciesEx25

V.campbellii AND4

V.harveyi BAA1116

V.shilonii AK1

P.profundumSS9 V.c

holer

aeN16

961

V.cho

lerae

0395

TEDA

V.cho

lerae

0395

TIGR

V.cho

lerae

V52

V.cho

lerae

M66-2

V.cho

lerae

MO10

V.cho

lerae

BX3302

86

V.cho

lerae

RC9

V.cho

lerae

MJ123

6

V.cho

lerae

B33VCE

V.cho

lerae

2740

-80

V.cho

lerae

1587

V.cho

lerae

AM-192

26

V.cho

lerae

MZO-2

V.cho

lerae

1212

9

V.cho

lerae

TM1107

9-80

V.cho

lerae

TMA21

V.cho

lerae

VL426

V.par

ahae

molytic

us 22

1063

3

V.par

ahae

molytic

us16

V.vuln

ificus

CMCP6

V.vuln

ificus

YJ016

V.spe

cies M

ED222

V.sple

ndidu

sLG

P32

A.fisch

eri ES11

4

A.fisch

eri MJ1

1

A.salm

onici

daLF

I1238

V.spe

cies

Ex25

V.cam

pbell

ii AND4

V.har

veyi

BAA1116

V.shil

onii

AK1

P.pro

fundu

mSS9

Homology within proteomes

6.0 %0.0 %

Homology between proteomes

90.0 %30.0 %

Figure

4BLAST

matrix

ofthe

32Vibrionaceae

genomes.

The

colourshighlighting

thespecies

arethe

sameas

inFig.1.

Sincethe

reciprocalsim

ilarity(reported

aspercent)

isnot

readableatthis

resolution,everymatrix

celliscoloured

usingthe

scalesas

indicated.The

bottomrow

identifieshits

(otherthan

hits-to-self)found

within

age-

nome.

Fourmatrix

cellsreport-

inghigh

pairwise

similarities

areoutlined;

theirnum

bersare

specifiedin

thetext

Origins

ofV.

cholerae7

Page 40: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

0M

0.5M1M

1.5M

2M2.

5M

V. cholerae 01 El Tor N16961chromosome 12,961,149 bp

Gap B

Gap A

Gap C

Gap D

Gap G

Gap E

Gap F

0k125k

250k375k

500k625k

750k

8 75k

1000k

V. cholerae 01 El Tor N16961chromosome 21,072,310 bp

Superintegron

P.profundum SS9

V.shilonii AK1

V.harveyi BAA-116

V.campebellii AND4

V.parahaemolyticus 16

V.parahaemolyticus 2210633

Vibrio spp. Ex25

A.salmonicida LF11238

A.fischeri MJ11

A.fischeri ES114

V.splendidus LGP32

V.species MED222

V.vulnificus YJ016

V.vulnificus CMCP6

V.cholerae VL426

V.cholerae 12129

V.cholerae TMA21

V.cholerae TM11079-80

V.cholerae 1587

V.cholerae AM-19226

V.cholerae MZO-2

V.cholerae 2740-80

V.cholerae BX330286

V.cholerae B33VCE

V.cholerae RC9

V.cholerae MJ1236V.cholerae M66-2

V.cholerae V52

V.cholerae MO10

V.cholerae O395 TEDA

V.cholerae 0395 TIGR

V.cholerae N16961

Stacking energy

Position preference

Global direct repeats

GC skew

genes positive strandgenes negatve strand

Outer circle

Inner circle

T. Vesth et al.

Page 41: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group

0M

0.5M1M

1.5M

2M 2.5M

V. cholerae 01

El T

or N16961

chromosom

e 12,961,149 bp

Gap B

Gap A

Gap C

Gap D

Gap G

Gap E

Gap F

0k

125k

250k375k

500k

6 25k

750k 875k

1000k

V. cholerae 01

El T

or N16961

chromosom

e 21,072,310 bp

Superintegron

P.profundum

SS

9

V.shilonii A

K1

V.harveyi B

AA

-116

V.cam

pebellii AN

D4

V.parahaem

olyticus 16

V.parahaem

olyticus 2210633

Vibrio spp. E

x25

A.salm

onicida LF11238

A.fischeri M

J11

A.fischeri E

S114

V.splendidus LG

P32

V.species M

ED

222

V.vulnificus Y

J016

V.vulnificus C

MC

P6

V.cholerae V

L426

V.cholerae 12129

V.cholerae T

MA

21

V.cholerae T

M11079-80

V.cholerae 1587

V.cholerae A

M-19226

V.cholerae M

ZO

-2

V.cholerae 2740-80

V.cholerae B

X330286

V.cholerae B

33VC

E

V.cholerae R

C9

V.cholerae M

J1236V

.cholerae M66-2

V.cholerae V

52

V.cholerae M

O10

V.cholerae O

395 TE

DA

V.cholerae 0395 T

IGR

V.cholerae N

16961

Stacking energy

Position preference

Global direct repeats

GC

skew

genes positive strandgenes negatve strand

Outer circle

Inner circle

T.Vesth

etal.

Page 42: Cautionary Tales of Next-generation ... - DTU … · Cautionary Tales of Next-generation and Next-next Generation Sequencing Dave Ussery ... . Comparative Microbial Genomics group