Upload
morgan-langille
View
835
Download
1
Tags:
Embed Size (px)
DESCRIPTION
I present an overview of BioTorrents.net. This was presented at the Open Science Summit 2010 conference in Berkeley, CA.
Citation preview
Morgan Langille, PhD
Open Science Summit 2010
Berkeley, California
July 29st, 2010
Acknowledgements
iSEEM project Dr. Jonathan Eisen UC Davis
Questions/Comments Twitter: @BetaScience
Motivation
Data in science is growing rapidly
Transfer times increasing
Reliability of data transfer
Sharing scientific data openly
Personal Challenges
1. Improve download speed and reliability from large data providers
2. Encourage sharing of all data associated with a study
3. Allow easier sharing of unpublished data
Traditional file transfer methods Single source server
Bandwidth limitations
No data redundancy
No data verification
Peer-to-peer file transfer: BitTorrent Data is shared between
all computers
Bandwidth grows as users increases
Data redundancy
Data is verified Sha1 cryptographic hash
25-50% of all Internet traffic is BitTorrent
BitTorrent: How it works1. User installs BitTorrent
client software
2. User downloads a small “.torrent” descriptor file
3. Client software connects to “Tracker” to obtain a list of other “peers” with same data
4. Client begins downloading/uploading
.torrent.torrent
“Tracker” server
Other BitTorrent Advantages
Every dataset is given a unique id (Sha1 hash)
Distributed Hash Table (DHT) & Peer Exchange (PEX)Tracker-less peer identification
Local Peer Discovery (LPD)Finds peers on local area network (LAN) allowing much faster
data transfer
Web SeedsFTP or HTTP resources can be added to the torrent
BitTorrent Trackers Many trackers already
exist
Almost all have legal issues with copyright infringement issues
None are tailored to hosting scientific datasets
BioTorrents is a file sharing website for scientists
BioTorrents provides a central listing of datasets
Anyone can upload their own data
All data must be “open”; no illegal file sharing
Data is not hosted on BioTorrents**
Langille & Eisen, 2010, PLoS ONE 5: e10071.
BioTorrents: Advanced Features Browse and search by
Keyword (dataset title and description)Category (Genomics, Proteomics, Chemistry, etc.) License (Public Domain, Creative Commons, GPL, etc.)Username (mlangill, jeisen, NCBI, etc.)
RSS feeds and automatic downloading Torrents linked into “Versions” Upload script for bulk torrent creation
BioTorrents progress
1000 registered users
43 datasets (107 GB)
766 downloads
1386 GB data transferred
Real Example
Download GenBank (~230GB) from NCBI
NCBI to
UC Davis
Download speed
Time
Max 30MB/s 2 hours
FTP to other server
~10MB/s 6 hours
FTP to NCBI ~.5MB/s 5 days
Who will use BioTorrents?
1. Existing large data providers More reliable and faster downloads for users Less bandwidth requirements for provider
2. Scientists sharing published data All data is bundled together and given a unique id Easier than setting up a Web/FTP server
3. Scientists sharing unpublished data Data that might not be suitable for existing databases Results that may not be sufficient for publication
Issues BitTorrent works best for large, popular datasets
Long term seedingAt least 1 seeder has to exist
Many institutions block/limit BitTorrent activity
Future
MetalinkXML Link ProtocolCombines multiple sources
○ FTP, HTTP, BitTorrent, etc.
Volunteer StorageParallel to volunteer computing
Final Message Data transfer should be fast and easy
Scientific community should embrace existing technologies such as BitTorrent
BioTorrents uses the strengths of BitTorrent and provides features unique to scientific data