Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science...

Preview:

Citation preview

Academic TorrentsAcademic TorrentsScalable Distribution for Science

Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z LoUMass Boston Computer Science Ph.D Candidates

Entire Presentation

Datasets-Searchable central index-Dynamic hosting locations-Ability to cache on campuses-Long term persistence-Aggregate sources

Publications-Long term persistence-New publication model: distributed publishing-Library Smart Nodes

NSF Data Sharing Policy“Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing. See” Award & Administration Guide (AAG) Chapter VI.D.4.

NIH Data Sharing Policies“Expects investigators seeking more than $500K in direct support in any given year to submit a data sharing plan with their application or to indicate why data sharing is not possible.”“Requires data for all NIDA-funded human genetics studies to be available for sharing, independent of direct costs, membership in the NIDA Genetics Consortium, or the type of genetics data generated.” http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_policies.html

We need to share!

Stick figures taken from xkcd

Sharing is Hard

Considerations:● Maintenance - how much work?● Bandwidth - how scalable?● Speed - how fast are downloads?● Robustness - susceptible to failure?● Cost - how much will it be?

Stick figures taken from xkcd

One machine hosts a file from one location● Benefits

○ Simple (relatively)

● Pains○ Single point of failure (hard drive/network/power outage)○ Limited bandwidth (one machine serving the world)

Single Server Model

Stick figures taken from xkcd

Multiple machines host copies of a fileA central point sends the file to each mirror node (via scp, rsync)

A central index publishes hash of file to verify correctness

● Benefits○ Solves the single point of failure○ Might be faster if you download from a closer node

● Pains○ Each mirror must have high bandwidth○ Verification of each file is responsibility of the users

Apache Mirroring

Maintains list of data locations dynamically (via API)Supports HTTP, FTP, and BitTorrent mirrors

● Benefits○ Long term preservation of data○ Automatic verification of data to ensure consistency○ Can extend existing data dissemination systems○ Download from multiple at once (on campus CDN!)

● Pains○ Clients are not designed for research (until now)○ Network firewalls (HTTP and FTP not blocked)

Academic Torrents

Stick figures taken from xkcd

Method ComparisonMaintenance Bandwidth limits Speed Robustness Cost

Single Server Moderate Somewhat Slow No Moderate

Multiple Servers High Somewhat Moderate Somewhat Moderate

Mailing Disks High No High No Low

Free Repositories Low Yes Moderate Somewhat Free

Proprietary Repositories Low Moderate Moderate Somewhat High

Method ComparisonMaintenance Bandwidth limits Speed Robustness Cost

Single Server Moderate Somewhat Slow No Moderate

Multiple Servers High Somewhat Moderate Somewhat Moderate

Mailing Disks High No High No Low

Free Repositories Low Yes Moderate Somewhat Free

Proprietary Repositories Low Moderate Moderate Somewhat High

Academic Torrents Moderate No High Yes Low

Academic Torrents

Peers get torrent from AT

Upload torrent to Academic

Torrents

Create torrent from data

Share data with peers

Transmission torrent client

Academic Torrents Portal

Each entry contains:

Bibtex Metadata (keys->values)

File listing with hashes (verify authenticity)

Listing of hosting locations (global mirror locations)

Curated collections

Each collection is:

Curated by a user (allows trust)

An updatable folder of entries (modifiable)

accessible via APIs (RSS, CSV, RESTful)

Command Line Interface (atdown)https://github.com/AcademicTorrents/AcademicTorrents-Downloader

Command Line Interface (atdown)

Use Case:Wikipedia XML Offline Version

10GB of DataCommunity Hosted

766 Downloads in 2014 (7.66TB!)~15 Persistent mirror locations

Wikipedia data 10GBglobal mirror locations

Speeds vary

Bytes!

At UMass Boston Campus, Boston, MA

At XSEDE14, Atlanta, GA

Different Mirror Access Mirrors have

different speeds

Use Case:Direct Numerical Simulation of Turbulent Flows

5TB of Datain 63 files

Able to use AT infrastructure as management tool.

Direct Numerical Simulation of Turbulent Flows 250GB/5TB in 2 Locations

Entire Presentation

Datasets-Searchable central index-Dynamic hosting locations-Ability to cache on campuses-Long term persistence-Aggregate sources

Publications-Long term persistence-New publication model: distributed publishing-Library Smart Nodes

Questions

Why can you expect papers to be accessible?

What is the cost of a research paper?

Current Publishing Model,Elsevier, IEEE/ACM Journal

Distributed publishing model,Academic Torrents Library Smart Node

: ( IEEE/ACM Conference

Current Open Access Model,PLOS, F1000 Journals

Subscribers Everyone

Rea

der/L

ibra

ry P

ays

Aut

hor P

aysC

ost

Who can access

Library Smart Node Overview

StudentLibrary Database,

OpenURL, orAtoZ Server

Elsevier

Springer

IEEE Explore

Academic Torrents Curated SmartNode

ScholarWorks

PLOS

JLMR

$$$$$$$$$$

Smart Node

Management software for dealing with data

Deals with:Bandwidth Limits

Space LimitsContent (subscriptions)

Smart Node V1

CS410 - Software DesignTeam of UndergraduatesGPL/C++

V2 will be in Java

https://github.com/AcademicTorrents/AcademicTorrents-SmartNodeV1

Open Journal System IntegrationSimon Fraser University Library

Academic Torrents

Is this my dissertation topic? No.->Object detection in remote sensed imagery using machine learning +Ad-Hoc pervasive mobile networks +Semi-structured information extraction+CS and Cyber Security Education

blucat Throw Platform Feature Selection

Building Detection

Crater Detection