Upload
bosc-2010
View
802
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
www.hdfgroup.org
The HDF Group
1July 9, 2010
BioHDFOpen Binary File Formats for
Next-Generation Sequencing Data
Dana Robinson
The HDF Group
Copyright © 2010 The HDF Group. All Rights Reserved
Current Status and Future Directions
www.hdfgroup.orgJuly 9, 2010
NGS Data Challenges
2
Very large quantities of data (100s of GB)
"Drinking from the firehose"
Analysis methods vary greatly, so a flexible yet unified data store would be useful.
www.hdfgroup.orgJuly 9, 2010
What is Needed
3
A Data ModelA data model which accurately describes the data and can be expanded to contain new types of data
A Data StoreA file format or data store which is efficient in access time and storage size and which scales well
A ToolkitA flexible software toolkit that can be used to create tools and pipelines based on the data model and file format
www.hdfgroup.orgJuly 9, 2010 4
What is BioHDF?
An open-source, community-driven project, funded by an NIH SBIR grant and led by Geospiza, Inc. in collaboration with The HDF Group.
BioHDF is a particular arrangement of objects in an HDF5 file (similar to a database schema)
BioHDF is a library and C API which can be used to write applications (coming soon)
BioHDF is a set of command line tools for storing, retrieving and manipulating data in BioHDF files
www.hdfgroup.orgJuly 9, 2010 5
HDF = Hierarchical Data Format
/Reads/
Alignments/
References
somefile.h5
groups
datasets
is_sorted
attributes
An example of how data is stored in HDF5
www.hdfgroup.orgJuly 9, 2010 6
Benefits of BioHDF
• Portability and data sharing:Platform independent, endian independent, self describing, common data models.
• High performance:Fast random access and efficient, scalable, petabyte level compressed storage.
• Widespread adoption:MATLAB, IDL, NASA-Earth Observing System, Pacific Biosciences, SOLiD, 100's of products.
• 20 year history:Robust, performance tuned, and well supported by The HDF Group, an independent non-profit entity.
www.hdfgroup.orgJuly 9, 2010
HDF in Bioinformatics
• Baylor Imaging Group• Life Technologies• Pacific Biosciences• Oxford Nanopore• GenomeData (UW)• Geospiza• Others
www.hdfgroup.orgJuly 9, 2010 8
Data Stored
The prototype BioHDF stores
Reads
Alignments
Annotations
Clusters of Aligned Reads
Reference Sequences
Indexes (NCList or simple)
www.hdfgroup.orgJuly 9, 2010 9
Data Stored
Additional user-specific data can be stored without breaking the library or tools.
BioHDFData
User-SpecificData
Similar to how adding additional tables to a database schema does not invalidate existing queries.
www.hdfgroup.orgJuly 9, 2010 10
Project Stages
A "pipeline prototype " set of tools to demonstrate the suitability of HDF5 for NGS data storage.
An version 1.0 release of a BioHDF library and C API targeting the functionality of samtools.
A higher-level C API that abstracts out and hides the underlying storage technology.
www.hdfgroup.orgJuly 9, 2010 11
BioHDF Applications andWrappers (e.g. Perl, Python)
HDF5 API and Applications
HDF5 API
Physical Storage
BioHDF API
High-Level API
www.hdfgroup.orgJuly 9, 2010 12
A Higher-Level API
high-levelC API
BioHDFAPI
samtools
tool
wrapperBAMAPI
low-levelC APIs
A high-level API will encapsulate and hide the underlying storage technology.
www.hdfgroup.orgJuly 9, 2010 13
Acknowledgements
BioHDF is supported by NIH SBIR Phase II grant HG003792 awarded to Geospiza, Inc.
GeospizaTodd SmithMark Welsh
The HDF GroupMike Folk
www.hdfgroup.org
The HDF Group
14July 9, 2010
Thank you for your time!
If you are interested in using or contributing to BioHDF, please contact us!
Dana Robinson ([email protected])
http://www.biohdf.org
BOSC BoF: Friday 5:10-6:00
ISMB Poster J18: Monday, July 12: 12:40-2:30
ISMB BoF: Tuesday, July 13 1-2 pm, room 306