View
3.433
Download
5
Category
Tags:
Preview:
DESCRIPTION
Talk from the Genome Informatics Alliance 2009 meeting.
Citation preview
Challenges with data quality, sharing, and versioning
David Dooling <ddooling@wustl.edu>GIA 2009
<ddooling@wustl.edu>
Production Centers• Tony Cox, Sanger
SequencingScaleInfrastructureData flow
• Toby Bloom, BroadQualityIntegrationStandardsSharing
• David Dooling, WUStLScaleQualitySharingVersioning
<ddooling@wustl.edu>
Moore’s Law
!"""# !""$# !""!# !""%# !""&# !""'# !""(# !"")# !""*# !""+# !"$"#
,-./011-2#
300.-4/#567#
8,9#
:;0.6<-#
:-=>-1?-#
<ddooling@wustl.edu>
FASTQ@HWI-EAS404:5:1:6:180#0/1
GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
+HWI-EAS404:5:1:6:180#0/1
aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\
@HWI-EAS404:5:1:6:396#0/1
TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
+HWI-EAS404:5:1:6:396#0/1
Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ
@HWI-EAS404:5:1:6:1344#0/1
GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
+HWI-EAS404:5:1:6:1344#0/1
aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[
@HWI-EAS404:5:1:6:1814#0/1
AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
+HWI-EAS404:5:1:6:1814#0/1
aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X
7 TB/week
<ddooling@wustl.edu>
FASTQ@HWI-EAS404:5:1:6:180#0/1
GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
+HWI-EAS404:5:1:6:180#0/1
aaaa`]aaaa`aa^aa]aaaa^\`_\``____`W]a_`T\[[b__`\YXUW][MSTNZX^[[`_Z[^``\X`^a\
@HWI-EAS404:5:1:6:396#0/1
TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
+HWI-EAS404:5:1:6:396#0/1
Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`\^ZPP[__^_\a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^\\NZ
@HWI-EAS404:5:1:6:1344#0/1
GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
+HWI-EAS404:5:1:6:1344#0/1
aabaaa__]^a`[\^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X``\`_WVNYWKDNLTW[Y\XSVZ^ZTZZVRUX[
@HWI-EAS404:5:1:6:1814#0/1
AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
+HWI-EAS404:5:1:6:1814#0/1
aa````aa^a`_^`\`a`XY`^ZX^YW\^[X\UWUYOMVZZ\\_W^^\XXTSMHMLLNTTDWU__[WVVY]Y_]X
350 TB/year
<ddooling@wustl.edu>
Mapping
42,000 core-hr/week
<ddooling@wustl.edu>
The Balanced PC• Clock speed• AGP• Front-side bus• Hypertransport• 1 Gbps• PCI-X• SATA• PCI-Express• Infiniband• Multi-core• Front-side bus• GPU• 10 Gbps
<ddooling@wustl.edu>
The balanced PS1
10 gosub get(sequencers)
20 gosub get(disk)
30 gosub get(backup_capacity)
40 gosub get(network_capacity)
50 gosub get(cluster_nodes)
1 - Pipeline for Sequencing
<ddooling@wustl.edu>
The unbalanced PS
10 gosub get(sequencers)
20 gosub get(disk)
30 gosub get(backup_capacity)
40 gosub get(network_capacity)
50 gosub get(cluster_nodes)
60 goto 10
<ddooling@wustl.edu>
...must be more than just a slogan
<ddooling@wustl.edu>
Quality missteps
Initial low fidelity between basequality values and quality
Tsonev, S. SEP 2007
<ddooling@wustl.edu>
An aside
“basecall calibration predicted vs. observed”
<ddooling@wustl.edu>
Quality is the keyNeed high fidelity between prediction and observed
3 bits per base
50 bytes per base
20 bytes per base
2 bytes per base
<ddooling@wustl.edu>
The down side
http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg
http://www3.appliedbiosystems.com/cms/groups/mcb_marketing/documents/generaldocuments/cms_057559.pdf
<ddooling@wustl.edu>
Submitted to central repositories
<ddooling@wustl.edu>
... and replicatedacross the pond
<ddooling@wustl.edu>
The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods.
<ddooling@wustl.edu>
Write-only databases
Search limited to sequence andvalues of specific XML entities
submitted as metadata
<ddooling@wustl.edu>
Write-only databases
Search limited to sequence andvalues of specific XML entities
submitted as metadata
x
<ddooling@wustl.edu>
Speaking of XML<?xml version="1.0" encoding="UTF-8"?><STUDY_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <STUDY alias="LowSalternSDbayVir111005" accession="SRP000145"> <DESCRIPTOR> <STUDY_TITLE>Solar Salterns, viral fraction from low salinity saltern in San Diego, CA </STUDY_TITLE> <STUDY_TYPE existing_study_type="Metagenomics"/> <STUDY_ABSTRACT>Viral community from a "low" salinity saltern and sequenced at 454 Life Sciences. </STUDY_ABSTRACT> <CENTER_NAME>SDSU</CENTER_NAME> <CENTER_PROJECT_NAME>LowSalternSDbayVir111005</CENTER_PROJECT_NAME> <PROJECT_ID>28373</PROJECT_ID> </DESCRIPTOR> <STUDY_ATTRIBUTES> <STUDY_ATTRIBUTE> <TAG>NCBI parent project ID</TAG> <VALUE>28725</VALUE> </STUDY_ATTRIBUTE> </STUDY_ATTRIBUTES> </STUDY></STUDY_SET>
<?xml version="1.0" encoding="UTF-8"?><SAMPLE_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <SAMPLE alias="28373" accession="SRS000373"> <SAMPLE_NAME> <TAXON_ID>496920</TAXON_ID> <COMMON_NAME>saltern metagenome</COMMON_NAME> </SAMPLE_NAME> <DESCRIPTION>viral fraction from low salinity saltern in San Diego, CA </DESCRIPTION> <SAMPLE_ATTRIBUTES> <SAMPLE_ATTRIBUTE> <TAG>collection_date</TAG> <VALUE>11/10/05</VALUE> </SAMPLE_ATTRIBUTE> <SAMPLE_ATTRIBUTE> <TAG>lat_lon</TAG> <VALUE>32.599040, -117.107356</VALUE> </SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTES> </SAMPLE></SAMPLE_SET>
<?xml version="1.0" encoding="UTF-8"?><EXPERIMENT_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <EXPERIMENT alias="LowSalternSDbayVir111005_experiment" expected_number_runs="2" accession="SRX000217"> <TITLE>454 sequencing of saltern metagenome fragment library</TITLE> <STUDY_REF accession="SRP000145" refname="LowSalternSDbayVir111005"/> <DESIGN> <DESIGN_DESCRIPTION>454 Sequencing of viral fraction from low salinity saltern in San Diego, CA</DESIGN_DESCRIPTION> <SAMPLE_DESCRIPTOR accession="SRS000373" refname="28373"/> <LIBRARY_DESCRIPTOR> <LIBRARY_NAME>lowSalternSDbayVir111005</LIBRARY_NAME> <LIBRARY_STRATEGY>OTHER</LIBRARY_STRATEGY> <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE> <LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION> <LIBRARY_LAYOUT> <SINGLE/> </LIBRARY_LAYOUT> <LIBRARY_CONSTRUCTION_PROTOCOL> none provided </LIBRARY_CONSTRUCTION_PROTOCOL> </LIBRARY_DESCRIPTOR> <SPOT_DESCRIPTOR> <SPOT_DECODE_SPEC> <NUMBER_OF_READS_PER_SPOT>2</NUMBER_OF_READS_PER_SPOT> <READ_SPEC> <READ_INDEX>0</READ_INDEX> <READ_CLASS>Technical Read</READ_CLASS> <READ_TYPE>Adapter</READ_TYPE> <BASE_COORD>1</BASE_COORD> </READ_SPEC> <READ_SPEC> <READ_INDEX>1</READ_INDEX> <READ_CLASS>Application Read</READ_CLASS> <READ_TYPE>Forward</READ_TYPE> <BASE_COORD>5</BASE_COORD> </READ_SPEC> </SPOT_DECODE_SPEC> </SPOT_DESCRIPTOR> </DESIGN> <PLATFORM>
<LS454> <INSTRUMENT_MODEL>GS 20</INSTRUMENT_MODEL> <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</FLOW_SEQUENCE> <FLOW_COUNT>168</FLOW_COUNT> </LS454> </PLATFORM> <PROCESSING> <BASE_CALLS> <SEQUENCE_SPACE>Base Space</SEQUENCE_SPACE> <BASE_CALLER>454BaseCaller</BASE_CALLER> </BASE_CALLS> <QUALITY_SCORES qtype="phred"> <QUALITY_SCORER>454BaseCaller</QUALITY_SCORER> <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS> <MULTIPLIER>1</MULTIPLIER> </QUALITY_SCORES> </PROCESSING> </EXPERIMENT></EXPERIMENT_SET>
<?xml version="1.0" encoding="UTF-8"?><RUN_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RUN alias="D0IIGP3" instrument_model="454 GS 20" run_date="2006-03-17T09:39:51Z" run_file="D0IIGP3" run_center="454MSC" total_data_blocks="1" accession="SRR001053"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D0IIGP3" region="1" total_spots="51121" total_reads="51121" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D0IIGP301.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG>
<VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN> <RUN alias="D1LDSHL" instrument_model="454 GS 20" run_date="2006-04-06T09:25:19Z" run_file="D1LDSHL" run_center="454MSC" total_data_blocks="1" accession="SRR001054"> <EXPERIMENT_REF accession="SRX000217" refname="LowSalternSDbayVir111005_experiment"/> <DATA_BLOCK name="D1LDSHL" region="1" total_spots="70935" total_reads="70935" number_channels="1" format_code="1" sector="0"> <FILES> <FILE filename="D1LDSHL01.sff" filetype="sff"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>flow_count</TAG> <VALUE>168</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>flow_sequence</TAG> <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>key_sequence</TAG> <VALUE>TCAG</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN></RUN_SET>
<ddooling@wustl.edu>
The Cathedral and the BazaarLinux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time.
<ddooling@wustl.edu>
The Vatican and the Reformation
<ddooling@wustl.edu>
GenBank genome
http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/
<ddooling@wustl.edu>
git genome
http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/
<ddooling@wustl.edu>
The Human Reference>7 dna:chromosome chromosome:NCBI36:7:1:158821424:1...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAGGTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTTTTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCTGGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTATATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAAAATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACATAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAACTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTATTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAAAGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTTTAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTACAGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC...
<ddooling@wustl.edu>
The Human Reference
D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in large genomes. Genome Biology 2006, 7:R7
A13
D2
B18
C2
H2
F4
E 139
G160
E
F
C
A
H
D
B
G
142
3(50)
2
4(22)
2(219)
3(3)
3(2)
71
2(19)
2(2)
3(3)
23(2)
6
2
2
2(50)
173
3(41)
158
2(7)
83
2
3
2
5(5)
58(2)
2(49)
5
6(3)
82
812
7 16(2)
52(6)3
8
38(6)
3(21)
2(3)
2(15)
2(4)
13(2)
3(5)
2(42) 4(9)
3(2)
8(6)
37
13(2)
6(2)
55(3)
2
5
4(7)
15819(8)
2(13)
2(2)
7(8)
4(3)
2 2(34)
4(24)
2(2)
5(7)
2(61)
4
2
3
2(7)
3(24)
5(7)2(15)
2(202)
3
3(50)
4(51)
2(4)
3
2
5
F
C
A 21
H
G 160
B18
D
s5766
E139
E
A
C
B
F
G
H
D
37
13(2)
184
142
158
38(6)
8
71
13(2)
123(2)
48(10)
32(3) 45(3)
13(2)
8(5)
158
20(2)
55(3)
13(7)
82
81
9(6)D117
A207
E
139
F
B
62
G171
G
B
E
A
F
D
37
13(2)
2993
13(2)
8(5)
114
127(2)
58(7)
55(3)
82
132
140
81
38(6)
8
18(6)
3(2)
(a)
(b) (c)
Recommended