17
Morgan Langille, PhD Open Science Summit 2010 Berkeley, California July 29 st , 2010

BioTorrents: A File Sharing Service for Scientific Data

Embed Size (px)

DESCRIPTION

I present an overview of BioTorrents.net. This was presented at the Open Science Summit 2010 conference in Berkeley, CA.

Citation preview

Page 1: BioTorrents: A File Sharing Service for Scientific Data

Morgan Langille, PhD

Open Science Summit 2010

Berkeley, California

July 29st, 2010

Page 2: BioTorrents: A File Sharing Service for Scientific Data

Acknowledgements

iSEEM project Dr. Jonathan Eisen UC Davis

Questions/Comments Twitter: @BetaScience

Page 3: BioTorrents: A File Sharing Service for Scientific Data

Motivation

Data in science is growing rapidly

Transfer times increasing

Reliability of data transfer

Sharing scientific data openly

Page 4: BioTorrents: A File Sharing Service for Scientific Data

Personal Challenges

1. Improve download speed and reliability from large data providers

2. Encourage sharing of all data associated with a study

3. Allow easier sharing of unpublished data

Page 5: BioTorrents: A File Sharing Service for Scientific Data

Traditional file transfer methods Single source server

Bandwidth limitations

No data redundancy

No data verification

Page 6: BioTorrents: A File Sharing Service for Scientific Data

Peer-to-peer file transfer: BitTorrent Data is shared between

all computers

Bandwidth grows as users increases

Data redundancy

Data is verified Sha1 cryptographic hash

25-50% of all Internet traffic is BitTorrent

Page 7: BioTorrents: A File Sharing Service for Scientific Data

BitTorrent: How it works1. User installs BitTorrent

client software

2. User downloads a small “.torrent” descriptor file

3. Client software connects to “Tracker” to obtain a list of other “peers” with same data

4. Client begins downloading/uploading

.torrent.torrent

“Tracker” server

Page 8: BioTorrents: A File Sharing Service for Scientific Data

Other BitTorrent Advantages

Every dataset is given a unique id (Sha1 hash)

Distributed Hash Table (DHT) & Peer Exchange (PEX)Tracker-less peer identification

Local Peer Discovery (LPD)Finds peers on local area network (LAN) allowing much faster

data transfer

Web SeedsFTP or HTTP resources can be added to the torrent

Page 9: BioTorrents: A File Sharing Service for Scientific Data

BitTorrent Trackers Many trackers already

exist

Almost all have legal issues with copyright infringement issues

None are tailored to hosting scientific datasets

Page 10: BioTorrents: A File Sharing Service for Scientific Data

BioTorrents is a file sharing website for scientists

BioTorrents provides a central listing of datasets

Anyone can upload their own data

All data must be “open”; no illegal file sharing

Data is not hosted on BioTorrents**

Langille & Eisen, 2010, PLoS ONE 5: e10071.

Page 11: BioTorrents: A File Sharing Service for Scientific Data

BioTorrents: Advanced Features Browse and search by

Keyword (dataset title and description)Category (Genomics, Proteomics, Chemistry, etc.) License (Public Domain, Creative Commons, GPL, etc.)Username (mlangill, jeisen, NCBI, etc.)

RSS feeds and automatic downloading Torrents linked into “Versions” Upload script for bulk torrent creation

Page 12: BioTorrents: A File Sharing Service for Scientific Data

BioTorrents progress

1000 registered users

43 datasets (107 GB)

766 downloads

1386 GB data transferred

Page 13: BioTorrents: A File Sharing Service for Scientific Data

Real Example

Download GenBank (~230GB) from NCBI

NCBI to

UC Davis

Download speed

Time

Max 30MB/s 2 hours

FTP to other server

~10MB/s 6 hours

FTP to NCBI ~.5MB/s 5 days

Page 14: BioTorrents: A File Sharing Service for Scientific Data

Who will use BioTorrents?

1. Existing large data providers More reliable and faster downloads for users Less bandwidth requirements for provider

2. Scientists sharing published data All data is bundled together and given a unique id Easier than setting up a Web/FTP server

3. Scientists sharing unpublished data Data that might not be suitable for existing databases Results that may not be sufficient for publication

Page 15: BioTorrents: A File Sharing Service for Scientific Data

Issues BitTorrent works best for large, popular datasets

Long term seedingAt least 1 seeder has to exist

Many institutions block/limit BitTorrent activity

Page 16: BioTorrents: A File Sharing Service for Scientific Data

Future

MetalinkXML Link ProtocolCombines multiple sources

○ FTP, HTTP, BitTorrent, etc.

Volunteer StorageParallel to volunteer computing

Page 17: BioTorrents: A File Sharing Service for Scientific Data

Final Message Data transfer should be fast and easy

Scientific community should embrace existing technologies such as BitTorrent

BioTorrents uses the strengths of BitTorrent and provides features unique to scientific data