Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

Preview:

DESCRIPTION

Talk from the Genome Informatics Alliance 2009 meeting.

Citation preview

Challenges with data quality, sharing, and versioning

David Dooling <ddooling@wustl.edu>GIA 2009

<ddooling@wustl.edu>

Production Centers• Tony Cox, Sanger

SequencingScaleInfrastructureData flow

• Toby Bloom, BroadQualityIntegrationStandardsSharing

• David Dooling, WUStLScaleQualitySharingVersioning

<ddooling@wustl.edu>

sub scale {

<ddooling@wustl.edu>

Moore’s Law

!"""# !""$# !""!# !""%# !""&# !""'# !""(# !"")# !""*# !""+# !"$"#

,-./011-2#

300.-4/#567#

8,9#

:;0.6<-#

:-=>-1?-#

<ddooling@wustl.edu>

Images

200 TB/week

<ddooling@wustl.edu>

Images

10 PB/year

<ddooling@wustl.edu>

Perspective

20 PB/day

<ddooling@wustl.edu>

Perspective

2 PB/s

<ddooling@wustl.edu>

FASTQ@HWI-EAS404:5:1:6:180#0/1

GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT

+HWI-EAS404:5:1:6:180#0/1

aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\

@HWI-EAS404:5:1:6:396#0/1

TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA

+HWI-EAS404:5:1:6:396#0/1

Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ

@HWI-EAS404:5:1:6:1344#0/1

GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG

+HWI-EAS404:5:1:6:1344#0/1

aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[

@HWI-EAS404:5:1:6:1814#0/1

AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC

+HWI-EAS404:5:1:6:1814#0/1

aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X

7 TB/week

<ddooling@wustl.edu>

FASTQ@HWI-EAS404:5:1:6:180#0/1

GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT

+HWI-EAS404:5:1:6:180#0/1

aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\

@HWI-EAS404:5:1:6:396#0/1

TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA

+HWI-EAS404:5:1:6:396#0/1

Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ

@HWI-EAS404:5:1:6:1344#0/1

GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG

+HWI-EAS404:5:1:6:1344#0/1

aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[

@HWI-EAS404:5:1:6:1814#0/1

AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC

+HWI-EAS404:5:1:6:1814#0/1

aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X

350 TB/year

<ddooling@wustl.edu>

Mapping

2 TB/week

<ddooling@wustl.edu>

Mapping

100 TB/year

<ddooling@wustl.edu>

Mapping

42,000 core-hr/week

<ddooling@wustl.edu>

Mapping

5 core-yr/week

<ddooling@wustl.edu>

Mapping

250 core cluster

<ddooling@wustl.edu>

The Weakest Link

<ddooling@wustl.edu>

The Balanced PC• Clock speed• AGP• Front-side bus• Hypertransport• 1 Gbps• PCI-X• SATA• PCI-Express• Infiniband• Multi-core• Front-side bus• GPU• 10 Gbps

<ddooling@wustl.edu>

The balanced PS1

10 gosub get(sequencers)

20 gosub get(disk)

30 gosub get(backup_capacity)

40 gosub get(network_capacity)

50 gosub get(cluster_nodes)

1 - Pipeline for Sequencing

<ddooling@wustl.edu>

The unbalanced PS

10 gosub get(sequencers)

20 gosub get(disk)

30 gosub get(backup_capacity)

40 gosub get(network_capacity)

50 gosub get(cluster_nodes)

60 goto 10

<ddooling@wustl.edu>

The GHz race

<ddooling@wustl.edu>

} # scale

<ddooling@wustl.edu>

sub quality {

<ddooling@wustl.edu>

Honda

<ddooling@wustl.edu>

Honda

<ddooling@wustl.edu>

Honda

<ddooling@wustl.edu>

Ford

<ddooling@wustl.edu>

Ford

<ddooling@wustl.edu>

Ford

<ddooling@wustl.edu>

Ford

<ddooling@wustl.edu>

Ford

<ddooling@wustl.edu>

Ford

<ddooling@wustl.edu>

Ford

<ddooling@wustl.edu>

Quality is Job 1

<ddooling@wustl.edu>

...must be more than just a slogan

<ddooling@wustl.edu>

Quality missteps

Initial low fidelity between basequality values and quality

Tsonev, S. SEP 2007

<ddooling@wustl.edu>

An aside

“basecall calibration predicted vs. observed”

<ddooling@wustl.edu>

Cult of traces

<ddooling@wustl.edu>

Quality is the keyNeed high fidelity between prediction and observed

3 bits per base

50 bytes per base

20 bytes per base

2 bytes per base

<ddooling@wustl.edu>

} # quality

<ddooling@wustl.edu>

sub sharing {

<ddooling@wustl.edu>

1000 Genomes

<ddooling@wustl.edu>

3.8 Tb

<ddooling@wustl.edu>

~50 B/b

<ddooling@wustl.edu>

190 TB

<ddooling@wustl.edu>

Submitted to central repositories

<ddooling@wustl.edu>

... and replicatedacross the pond

<ddooling@wustl.edu>

The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods.

<ddooling@wustl.edu>

Write-only databases

Search limited to sequence andvalues of specific XML entities

submitted as metadata

<ddooling@wustl.edu>

Write-only databases

Search limited to sequence andvalues of specific XML entities

submitted as metadata

x

<ddooling@wustl.edu>

Speaking of XML<?xml version="1.0" encoding="UTF-8"?><STUDY_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <STUDY alias="LowSalternSDbayVir111005" accession="SRP000145"> <DESCRIPTOR> <STUDY_TITLE>Solar Salterns, viral fraction from low salinity saltern in San Diego, CA </STUDY_TITLE> <STUDY_TYPE existing_study_type="Metagenomics"/> <STUDY_ABSTRACT>Viral community from a "low" salinity saltern and sequenced at 454 Life Sciences. </STUDY_ABSTRACT> <CENTER_NAME>SDSU</CENTER_NAME> <CENTER_PROJECT_NAME>LowSalternSDbayVir111005</CENTER_PROJECT_NAME> <PROJECT_ID>28373</PROJECT_ID> </DESCRIPTOR> <STUDY_ATTRIBUTES> <STUDY_ATTRIBUTE> <TAG>NCBI parent project ID</TAG> <VALUE>28725</VALUE> </STUDY_ATTRIBUTE> </STUDY_ATTRIBUTES> </STUDY></STUDY_SET>

<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <SAMPLE alias="28373" accession="SRS000373"> <SAMPLE_NAME> <TAXON_ID>496920</TAXON_ID> <COMMON_NAME>saltern metagenome</COMMON_NAME> </SAMPLE_NAME> <DESCRIPTION>viral fraction from low salinity saltern in San Diego, CA </DESCRIPTION> <SAMPLE_ATTRIBUTES> <SAMPLE_ATTRIBUTE> <TAG>collection_date</TAG> <VALUE>11/10/05</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>lat_lon</TAG> <VALUE>32.599040, -117.107356</VALUE> </SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTES> </SAMPLE></SAMPLE_SET>

<?xml version="1.0" encoding="UTF-8"?><EXPERIMENT_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <EXPERIMENT alias="LowSalternSDbayVir111005_experiment" expected_number_runs="2" accession="SRX000217"> <TITLE>454 sequencing of saltern metagenome fragment library</TITLE> <STUDY_REF accession="SRP000145" refname="LowSalternSDbayVir111005"/> <DESIGN> <DESIGN_DESCRIPTION>454 Sequencing of viral fraction from low salinity saltern in San Diego, CA</DESIGN_DESCRIPTION> <SAMPLE_DESCRIPTOR accession="SRS000373" refname="28373"/> <LIBRARY_DESCRIPTOR> <LIBRARY_NAME>lowSalternSDbayVir111005</LIBRARY_NAME> <LIBRARY_STRATEGY>OTHER</LIBRARY_STRATEGY> <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE> <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION> <LIBRARY_LAYOUT> <SINGLE/> </LIBRARY_LAYOUT> <LIBRARY_CONSTRUCTION_PROTOCOL> none provided </LIBRARY_CONSTRUCTION_PROTOCOL> </LIBRARY_DESCRIPTOR> <SPOT_DESCRIPTOR> <SPOT_DECODE_SPEC> <NUMBER_OF_READS_PER_SPOT>2</NUMBER_OF_READS_PER_SPOT> <READ_SPEC> <READ_INDEX>0</READ_INDEX> <READ_CLASS>Technical Read</READ_CLASS> <READ_TYPE>Adapter</READ_TYPE> <BASE_COORD>1</BASE_COORD> </READ_SPEC> <READ_SPEC> <READ_INDEX>1</READ_INDEX> <READ_CLASS>Application Read</READ_CLASS> <READ_TYPE>Forward</READ_TYPE> <BASE_COORD>5</BASE_COORD> </READ_SPEC> </SPOT_DECODE_SPEC> </SPOT_DESCRIPTOR> </DESIGN> <PLATFORM>

<LS454> <INSTRUMENT_MODEL>GS 20</INSTRUMENT_MODEL> <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</FLOW_SEQUENCE> <FLOW_COUNT>168</FLOW_COUNT> </LS454> </PLATFORM> <PROCESSING> <BASE_CALLS> <SEQUENCE_SPACE>Base Space</SEQUENCE_SPACE> <BASE_CALLER>454BaseCaller</BASE_CALLER> </BASE_CALLS> <QUALITY_SCORES qtype="phred"> <QUALITY_SCORER>454BaseCaller</QUALITY_SCORER> <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS> <MULTIPLIER>1</MULTIPLIER> </QUALITY_SCORES> </PROCESSING> </EXPERIMENT></EXPERIMENT_SET>

<?xml version="1.0" encoding="UTF-8"?><RUN_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RUN alias="D0IIGP3" instrument_model="454 GS 20" run_date="2006-03-17T09:39:51Z" run_file="D0IIGP3" run_center="454MSC" total_data_blocks="1" accession="SRR001053"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D0IIGP3" region="1" total_spots="51121" total_reads="51121" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D0IIGP301.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG>

<VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN> <RUN alias="D1LDSHL" instrument_model="454 GS 20" run_date="2006-04-06T09:25:19Z" run_file="D1LDSHL" run_center="454MSC" total_data_blocks="1" accession="SRR001054"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D1LDSHL" region="1" total_spots="70935" total_reads="70935" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D1LDSHL01.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG> <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN></RUN_SET>

<ddooling@wustl.edu>

} # sharing

<ddooling@wustl.edu>

sub versioning {

<ddooling@wustl.edu>

The Cathedral and the BazaarLinux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time.

<ddooling@wustl.edu>

The Vatican and the Reformation

<ddooling@wustl.edu>

The popes

Will this scale?

<ddooling@wustl.edu>

The Human Reference>7 dna:chromosome chromosome:NCBI36:7:1:158821424:1...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAGGTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTTTTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCTGGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTATATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAAAATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACATAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAACTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTATTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAAAGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTTTAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTACAGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC...

<ddooling@wustl.edu>

The Human Reference

<ddooling@wustl.edu>

The Human Reference

D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in large genomes. Genome Biology 2006, 7:R7

A13

D2

B18

C2

H2

F4

E 139

G160

E

F

C

A

H

D

B

G

142

3(50)

2

4(22)

2(219)

3(3)

3(2)

71

2(19)

2(2)

3(3)

23(2)

6

2

2

2(50)

173

3(41)

158

2(7)

83

2

3

2

5(5)

58(2)

2(49)

5

6(3)

82

812

7 16(2)

52(6)3

8

38(6)

3(21)

2(3)

2(15)

2(4)

13(2)

3(5)

2(42) 4(9)

3(2)

8(6)

37

13(2)

6(2)

55(3)

2

5

4(7)

15819(8)

2(13)

2(2)

7(8)

4(3)

2 2(34)

4(24)

2(2)

5(7)

2(61)

4

2

3

2(7)

3(24)

5(7)2(15)

2(202)

3

3(50)

4(51)

2(4)

3

2

5

F

C

A 21

H

G 160

B18

D

s5766

E139

E

A

C

B

F

G

H

D

37

13(2)

184

142

158

38(6)

8

71

13(2)

123(2)

48(10)

32(3) 45(3)

13(2)

8(5)

158

20(2)

55(3)

13(7)

82

81

9(6)D117

A207

E

139

F

B

62

G171

G

B

E

A

F

D

37

13(2)

2993

13(2)

8(5)

114

127(2)

58(7)

55(3)

82

132

140

81

38(6)

8

18(6)

3(2)

(a)

(b) (c)

<ddooling@wustl.edu>

} # versioning

<ddooling@wustl.edu>

sub thank {"you"}