63
Challenges with data quality, sharing, and versioning David Dooling <[email protected]> GIA 2009

Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

Embed Size (px)

DESCRIPTION

Talk from the Genome Informatics Alliance 2009 meeting.

Citation preview

Page 1: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

Challenges with data quality, sharing, and versioning

David Dooling <[email protected]>GIA 2009

Page 2: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

Production Centers• Tony Cox, Sanger

SequencingScaleInfrastructureData flow

• Toby Bloom, BroadQualityIntegrationStandardsSharing

• David Dooling, WUStLScaleQualitySharingVersioning

Page 4: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

Moore’s Law

!"""# !""$# !""!# !""%# !""&# !""'# !""(# !"")# !""*# !""+# !"$"#

,-./011-2#

300.-4/#567#

8,9#

:;0.6<-#

:-=>-1?-#

Page 9: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

FASTQ@HWI-EAS404:5:1:6:180#0/1

GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT

+HWI-EAS404:5:1:6:180#0/1

aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\

@HWI-EAS404:5:1:6:396#0/1

TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA

+HWI-EAS404:5:1:6:396#0/1

Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ

@HWI-EAS404:5:1:6:1344#0/1

GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG

+HWI-EAS404:5:1:6:1344#0/1

aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[

@HWI-EAS404:5:1:6:1814#0/1

AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC

+HWI-EAS404:5:1:6:1814#0/1

aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X

7 TB/week

Page 10: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

FASTQ@HWI-EAS404:5:1:6:180#0/1

GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT

+HWI-EAS404:5:1:6:180#0/1

aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\

@HWI-EAS404:5:1:6:396#0/1

TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA

+HWI-EAS404:5:1:6:396#0/1

Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ

@HWI-EAS404:5:1:6:1344#0/1

GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG

+HWI-EAS404:5:1:6:1344#0/1

aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[

@HWI-EAS404:5:1:6:1814#0/1

AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC

+HWI-EAS404:5:1:6:1814#0/1

aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X

350 TB/year

Page 17: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

The Balanced PC• Clock speed• AGP• Front-side bus• Hypertransport• 1 Gbps• PCI-X• SATA• PCI-Express• Infiniband• Multi-core• Front-side bus• GPU• 10 Gbps

Page 18: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

The balanced PS1

10 gosub get(sequencers)

20 gosub get(disk)

30 gosub get(backup_capacity)

40 gosub get(network_capacity)

50 gosub get(cluster_nodes)

1 - Pipeline for Sequencing

Page 19: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

The unbalanced PS

10 gosub get(sequencers)

20 gosub get(disk)

30 gosub get(backup_capacity)

40 gosub get(network_capacity)

50 gosub get(cluster_nodes)

60 goto 10

Page 34: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

...must be more than just a slogan

Page 35: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

Quality missteps

Initial low fidelity between basequality values and quality

Tsonev, S. SEP 2007

Page 36: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

An aside

“basecall calibration predicted vs. observed”

Page 38: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

Quality is the keyNeed high fidelity between prediction and observed

3 bits per base

50 bytes per base

20 bytes per base

2 bytes per base

Page 46: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

Submitted to central repositories

Page 47: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

... and replicatedacross the pond

Page 48: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods.

Page 49: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

Write-only databases

Search limited to sequence andvalues of specific XML entities

submitted as metadata

Page 50: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

Write-only databases

Search limited to sequence andvalues of specific XML entities

submitted as metadata

x

Page 51: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

Speaking of XML<?xml version="1.0" encoding="UTF-8"?><STUDY_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <STUDY alias="LowSalternSDbayVir111005" accession="SRP000145"> <DESCRIPTOR> <STUDY_TITLE>Solar Salterns, viral fraction from low salinity saltern in San Diego, CA </STUDY_TITLE> <STUDY_TYPE existing_study_type="Metagenomics"/> <STUDY_ABSTRACT>Viral community from a "low" salinity saltern and sequenced at 454 Life Sciences. </STUDY_ABSTRACT> <CENTER_NAME>SDSU</CENTER_NAME> <CENTER_PROJECT_NAME>LowSalternSDbayVir111005</CENTER_PROJECT_NAME> <PROJECT_ID>28373</PROJECT_ID> </DESCRIPTOR> <STUDY_ATTRIBUTES> <STUDY_ATTRIBUTE> <TAG>NCBI parent project ID</TAG> <VALUE>28725</VALUE> </STUDY_ATTRIBUTE> </STUDY_ATTRIBUTES> </STUDY></STUDY_SET>

<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <SAMPLE alias="28373" accession="SRS000373"> <SAMPLE_NAME> <TAXON_ID>496920</TAXON_ID> <COMMON_NAME>saltern metagenome</COMMON_NAME> </SAMPLE_NAME> <DESCRIPTION>viral fraction from low salinity saltern in San Diego, CA </DESCRIPTION> <SAMPLE_ATTRIBUTES> <SAMPLE_ATTRIBUTE> <TAG>collection_date</TAG> <VALUE>11/10/05</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>lat_lon</TAG> <VALUE>32.599040, -117.107356</VALUE> </SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTES> </SAMPLE></SAMPLE_SET>

<?xml version="1.0" encoding="UTF-8"?><EXPERIMENT_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <EXPERIMENT alias="LowSalternSDbayVir111005_experiment" expected_number_runs="2" accession="SRX000217"> <TITLE>454 sequencing of saltern metagenome fragment library</TITLE> <STUDY_REF accession="SRP000145" refname="LowSalternSDbayVir111005"/> <DESIGN> <DESIGN_DESCRIPTION>454 Sequencing of viral fraction from low salinity saltern in San Diego, CA</DESIGN_DESCRIPTION> <SAMPLE_DESCRIPTOR accession="SRS000373" refname="28373"/> <LIBRARY_DESCRIPTOR> <LIBRARY_NAME>lowSalternSDbayVir111005</LIBRARY_NAME> <LIBRARY_STRATEGY>OTHER</LIBRARY_STRATEGY> <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE> <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION> <LIBRARY_LAYOUT> <SINGLE/> </LIBRARY_LAYOUT> <LIBRARY_CONSTRUCTION_PROTOCOL> none provided </LIBRARY_CONSTRUCTION_PROTOCOL> </LIBRARY_DESCRIPTOR> <SPOT_DESCRIPTOR> <SPOT_DECODE_SPEC> <NUMBER_OF_READS_PER_SPOT>2</NUMBER_OF_READS_PER_SPOT> <READ_SPEC> <READ_INDEX>0</READ_INDEX> <READ_CLASS>Technical Read</READ_CLASS> <READ_TYPE>Adapter</READ_TYPE> <BASE_COORD>1</BASE_COORD> </READ_SPEC> <READ_SPEC> <READ_INDEX>1</READ_INDEX> <READ_CLASS>Application Read</READ_CLASS> <READ_TYPE>Forward</READ_TYPE> <BASE_COORD>5</BASE_COORD> </READ_SPEC> </SPOT_DECODE_SPEC> </SPOT_DESCRIPTOR> </DESIGN> <PLATFORM>

<LS454> <INSTRUMENT_MODEL>GS 20</INSTRUMENT_MODEL> <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</FLOW_SEQUENCE> <FLOW_COUNT>168</FLOW_COUNT> </LS454> </PLATFORM> <PROCESSING> <BASE_CALLS> <SEQUENCE_SPACE>Base Space</SEQUENCE_SPACE> <BASE_CALLER>454BaseCaller</BASE_CALLER> </BASE_CALLS> <QUALITY_SCORES qtype="phred"> <QUALITY_SCORER>454BaseCaller</QUALITY_SCORER> <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS> <MULTIPLIER>1</MULTIPLIER> </QUALITY_SCORES> </PROCESSING> </EXPERIMENT></EXPERIMENT_SET>

<?xml version="1.0" encoding="UTF-8"?><RUN_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RUN alias="D0IIGP3" instrument_model="454 GS 20" run_date="2006-03-17T09:39:51Z" run_file="D0IIGP3" run_center="454MSC" total_data_blocks="1" accession="SRR001053"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D0IIGP3" region="1" total_spots="51121" total_reads="51121" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D0IIGP301.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG>

<VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN> <RUN alias="D1LDSHL" instrument_model="454 GS 20" run_date="2006-04-06T09:25:19Z" run_file="D1LDSHL" run_center="454MSC" total_data_blocks="1" accession="SRR001054"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D1LDSHL" region="1" total_spots="70935" total_reads="70935" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D1LDSHL01.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG> <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN></RUN_SET>

Page 54: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

The Cathedral and the BazaarLinux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time.

Page 59: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

The Human Reference>7 dna:chromosome chromosome:NCBI36:7:1:158821424:1...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAGGTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTTTTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCTGGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTATATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAAAATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACATAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAACTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTATTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAAAGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTTTAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTACAGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC...

Page 61: Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

<[email protected]>

The Human Reference

D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in large genomes. Genome Biology 2006, 7:R7

A13

D2

B18

C2

H2

F4

E 139

G160

E

F

C

A

H

D

B

G

142

3(50)

2

4(22)

2(219)

3(3)

3(2)

71

2(19)

2(2)

3(3)

23(2)

6

2

2

2(50)

173

3(41)

158

2(7)

83

2

3

2

5(5)

58(2)

2(49)

5

6(3)

82

812

7 16(2)

52(6)3

8

38(6)

3(21)

2(3)

2(15)

2(4)

13(2)

3(5)

2(42) 4(9)

3(2)

8(6)

37

13(2)

6(2)

55(3)

2

5

4(7)

15819(8)

2(13)

2(2)

7(8)

4(3)

2 2(34)

4(24)

2(2)

5(7)

2(61)

4

2

3

2(7)

3(24)

5(7)2(15)

2(202)

3

3(50)

4(51)

2(4)

3

2

5

F

C

A 21

H

G 160

B18

D

s5766

E139

E

A

C

B

F

G

H

D

37

13(2)

184

142

158

38(6)

8

71

13(2)

123(2)

48(10)

32(3) 45(3)

13(2)

8(5)

158

20(2)

55(3)

13(7)

82

81

9(6)D117

A207

E

139

F

B

62

G171

G

B

E

A

F

D

37

13(2)

2993

13(2)

8(5)

114

127(2)

58(7)

55(3)

82

132

140

81

38(6)

8

18(6)

3(2)

(a)

(b) (c)