Challenges with data quality, sharing, and versioning
David Dooling <[email protected]>GIA 2009
Production Centers• Tony Cox, Sanger
SequencingScaleInfrastructureData flow
• Toby Bloom, BroadQualityIntegrationStandardsSharing
• David Dooling, WUStLScaleQualitySharingVersioning
Moore’s Law
!"""# !""$# !""!# !""%# !""&# !""'# !""(# !"")# !""*# !""+# !"$"#
,-./011-2#
300.-4/#567#
8,9#
:;0.6<-#
:-=>-1?-#
FASTQ@HWI-EAS404:5:1:6:180#0/1
GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
+HWI-EAS404:5:1:6:180#0/1
aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\
@HWI-EAS404:5:1:6:396#0/1
TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
+HWI-EAS404:5:1:6:396#0/1
Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ
@HWI-EAS404:5:1:6:1344#0/1
GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
+HWI-EAS404:5:1:6:1344#0/1
aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[
@HWI-EAS404:5:1:6:1814#0/1
AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
+HWI-EAS404:5:1:6:1814#0/1
aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X
7 TB/week
FASTQ@HWI-EAS404:5:1:6:180#0/1
GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
+HWI-EAS404:5:1:6:180#0/1
aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\
@HWI-EAS404:5:1:6:396#0/1
TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
+HWI-EAS404:5:1:6:396#0/1
Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ
@HWI-EAS404:5:1:6:1344#0/1
GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
+HWI-EAS404:5:1:6:1344#0/1
aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[
@HWI-EAS404:5:1:6:1814#0/1
AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
+HWI-EAS404:5:1:6:1814#0/1
aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X
350 TB/year
The Balanced PC• Clock speed• AGP• Front-side bus• Hypertransport• 1 Gbps• PCI-X• SATA• PCI-Express• Infiniband• Multi-core• Front-side bus• GPU• 10 Gbps
The balanced PS1
10 gosub get(sequencers)
20 gosub get(disk)
30 gosub get(backup_capacity)
40 gosub get(network_capacity)
50 gosub get(cluster_nodes)
1 - Pipeline for Sequencing
The unbalanced PS
10 gosub get(sequencers)
20 gosub get(disk)
30 gosub get(backup_capacity)
40 gosub get(network_capacity)
50 gosub get(cluster_nodes)
60 goto 10
...must be more than just a slogan
Quality missteps
Initial low fidelity between basequality values and quality
Tsonev, S. SEP 2007
An aside
“basecall calibration predicted vs. observed”
Quality is the keyNeed high fidelity between prediction and observed
3 bits per base
50 bytes per base
20 bytes per base
2 bytes per base
The down side
http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg
http://www3.appliedbiosystems.com/cms/groups/mcb_marketing/documents/generaldocuments/cms_057559.pdf
Submitted to central repositories
... and replicatedacross the pond
The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods.
Write-only databases
Search limited to sequence andvalues of specific XML entities
submitted as metadata
Write-only databases
Search limited to sequence andvalues of specific XML entities
submitted as metadata
x
Speaking of XML<?xml version="1.0" encoding="UTF-8"?><STUDY_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <STUDY alias="LowSalternSDbayVir111005" accession="SRP000145"> <DESCRIPTOR> <STUDY_TITLE>Solar Salterns, viral fraction from low salinity saltern in San Diego, CA </STUDY_TITLE> <STUDY_TYPE existing_study_type="Metagenomics"/> <STUDY_ABSTRACT>Viral community from a "low" salinity saltern and sequenced at 454 Life Sciences. </STUDY_ABSTRACT> <CENTER_NAME>SDSU</CENTER_NAME> <CENTER_PROJECT_NAME>LowSalternSDbayVir111005</CENTER_PROJECT_NAME> <PROJECT_ID>28373</PROJECT_ID> </DESCRIPTOR> <STUDY_ATTRIBUTES> <STUDY_ATTRIBUTE> <TAG>NCBI parent project ID</TAG> <VALUE>28725</VALUE> </STUDY_ATTRIBUTE> </STUDY_ATTRIBUTES> </STUDY></STUDY_SET>
<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <SAMPLE alias="28373" accession="SRS000373"> <SAMPLE_NAME> <TAXON_ID>496920</TAXON_ID> <COMMON_NAME>saltern metagenome</COMMON_NAME> </SAMPLE_NAME> <DESCRIPTION>viral fraction from low salinity saltern in San Diego, CA </DESCRIPTION> <SAMPLE_ATTRIBUTES> <SAMPLE_ATTRIBUTE> <TAG>collection_date</TAG> <VALUE>11/10/05</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>lat_lon</TAG> <VALUE>32.599040, -117.107356</VALUE> </SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTES> </SAMPLE></SAMPLE_SET>
<?xml version="1.0" encoding="UTF-8"?><EXPERIMENT_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <EXPERIMENT alias="LowSalternSDbayVir111005_experiment" expected_number_runs="2" accession="SRX000217"> <TITLE>454 sequencing of saltern metagenome fragment library</TITLE> <STUDY_REF accession="SRP000145" refname="LowSalternSDbayVir111005"/> <DESIGN> <DESIGN_DESCRIPTION>454 Sequencing of viral fraction from low salinity saltern in San Diego, CA</DESIGN_DESCRIPTION> <SAMPLE_DESCRIPTOR accession="SRS000373" refname="28373"/> <LIBRARY_DESCRIPTOR> <LIBRARY_NAME>lowSalternSDbayVir111005</LIBRARY_NAME> <LIBRARY_STRATEGY>OTHER</LIBRARY_STRATEGY> <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE> <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION> <LIBRARY_LAYOUT> <SINGLE/> </LIBRARY_LAYOUT> <LIBRARY_CONSTRUCTION_PROTOCOL> none provided </LIBRARY_CONSTRUCTION_PROTOCOL> </LIBRARY_DESCRIPTOR> <SPOT_DESCRIPTOR> <SPOT_DECODE_SPEC> <NUMBER_OF_READS_PER_SPOT>2</NUMBER_OF_READS_PER_SPOT> <READ_SPEC> <READ_INDEX>0</READ_INDEX> <READ_CLASS>Technical Read</READ_CLASS> <READ_TYPE>Adapter</READ_TYPE> <BASE_COORD>1</BASE_COORD> </READ_SPEC> <READ_SPEC> <READ_INDEX>1</READ_INDEX> <READ_CLASS>Application Read</READ_CLASS> <READ_TYPE>Forward</READ_TYPE> <BASE_COORD>5</BASE_COORD> </READ_SPEC> </SPOT_DECODE_SPEC> </SPOT_DESCRIPTOR> </DESIGN> <PLATFORM>
<LS454> <INSTRUMENT_MODEL>GS 20</INSTRUMENT_MODEL> <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</FLOW_SEQUENCE> <FLOW_COUNT>168</FLOW_COUNT> </LS454> </PLATFORM> <PROCESSING> <BASE_CALLS> <SEQUENCE_SPACE>Base Space</SEQUENCE_SPACE> <BASE_CALLER>454BaseCaller</BASE_CALLER> </BASE_CALLS> <QUALITY_SCORES qtype="phred"> <QUALITY_SCORER>454BaseCaller</QUALITY_SCORER> <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS> <MULTIPLIER>1</MULTIPLIER> </QUALITY_SCORES> </PROCESSING> </EXPERIMENT></EXPERIMENT_SET>
<?xml version="1.0" encoding="UTF-8"?><RUN_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RUN alias="D0IIGP3" instrument_model="454 GS 20" run_date="2006-03-17T09:39:51Z" run_file="D0IIGP3" run_center="454MSC" total_data_blocks="1" accession="SRR001053"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D0IIGP3" region="1" total_spots="51121" total_reads="51121" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D0IIGP301.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG>
<VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN> <RUN alias="D1LDSHL" instrument_model="454 GS 20" run_date="2006-04-06T09:25:19Z" run_file="D1LDSHL" run_center="454MSC" total_data_blocks="1" accession="SRR001054"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D1LDSHL" region="1" total_spots="70935" total_reads="70935" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D1LDSHL01.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG> <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN></RUN_SET>
The Cathedral and the BazaarLinux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time.
The Vatican and the Reformation
GenBank genome
http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/
git genome
http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/
The Human Reference>7 dna:chromosome chromosome:NCBI36:7:1:158821424:1...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAGGTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTTTTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCTGGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTATATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAAAATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACATAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAACTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTATTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAAAGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTTTAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTACAGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC...
The Human Reference
D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in large genomes. Genome Biology 2006, 7:R7
A13
D2
B18
C2
H2
F4
E 139
G160
E
F
C
A
H
D
B
G
142
3(50)
2
4(22)
2(219)
3(3)
3(2)
71
2(19)
2(2)
3(3)
23(2)
6
2
2
2(50)
173
3(41)
158
2(7)
83
2
3
2
5(5)
58(2)
2(49)
5
6(3)
82
812
7 16(2)
52(6)3
8
38(6)
3(21)
2(3)
2(15)
2(4)
13(2)
3(5)
2(42) 4(9)
3(2)
8(6)
37
13(2)
6(2)
55(3)
2
5
4(7)
15819(8)
2(13)
2(2)
7(8)
4(3)
2 2(34)
4(24)
2(2)
5(7)
2(61)
4
2
3
2(7)
3(24)
5(7)2(15)
2(202)
3
3(50)
4(51)
2(4)
3
2
5
F
C
A 21
H
G 160
B18
D
s5766
E139
E
A
C
B
F
G
H
D
37
13(2)
184
142
158
38(6)
8
71
13(2)
123(2)
48(10)
32(3) 45(3)
13(2)
8(5)
158
20(2)
55(3)
13(7)
82
81
9(6)D117
A207
E
139
F
B
62
G171
G
B
E
A
F
D
37
13(2)
2993
13(2)
8(5)
114
127(2)
58(7)
55(3)
82
132
140
81
38(6)
8
18(6)
3(2)
(a)
(b) (c)