41
Academic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science Ph.D Candidates

Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

  • Upload
    others

  • View
    73

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Academic TorrentsAcademic TorrentsScalable Distribution for Science

Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z LoUMass Boston Computer Science Ph.D Candidates

Page 2: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Entire Presentation

Datasets-Searchable central index-Dynamic hosting locations-Ability to cache on campuses-Long term persistence-Aggregate sources

Publications-Long term persistence-New publication model: distributed publishing-Library Smart Nodes

Page 3: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

NSF Data Sharing Policy“Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing. See” Award & Administration Guide (AAG) Chapter VI.D.4.

NIH Data Sharing Policies“Expects investigators seeking more than $500K in direct support in any given year to submit a data sharing plan with their application or to indicate why data sharing is not possible.”“Requires data for all NIDA-funded human genetics studies to be available for sharing, independent of direct costs, membership in the NIDA Genetics Consortium, or the type of genetics data generated.” http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_policies.html

We need to share!

Page 4: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Stick figures taken from xkcd

Page 5: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Sharing is Hard

Considerations:● Maintenance - how much work?● Bandwidth - how scalable?● Speed - how fast are downloads?● Robustness - susceptible to failure?● Cost - how much will it be?

Page 6: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Stick figures taken from xkcd

Page 7: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

One machine hosts a file from one location● Benefits

○ Simple (relatively)

● Pains○ Single point of failure (hard drive/network/power outage)○ Limited bandwidth (one machine serving the world)

Single Server Model

Page 8: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Stick figures taken from xkcd

Page 9: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Multiple machines host copies of a fileA central point sends the file to each mirror node (via scp, rsync)

A central index publishes hash of file to verify correctness

● Benefits○ Solves the single point of failure○ Might be faster if you download from a closer node

● Pains○ Each mirror must have high bandwidth○ Verification of each file is responsibility of the users

Apache Mirroring

Page 10: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Maintains list of data locations dynamically (via API)Supports HTTP, FTP, and BitTorrent mirrors

● Benefits○ Long term preservation of data○ Automatic verification of data to ensure consistency○ Can extend existing data dissemination systems○ Download from multiple at once (on campus CDN!)

● Pains○ Clients are not designed for research (until now)○ Network firewalls (HTTP and FTP not blocked)

Academic Torrents

Page 11: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Stick figures taken from xkcd

Page 12: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Method ComparisonMaintenance Bandwidth limits Speed Robustness Cost

Single Server Moderate Somewhat Slow No Moderate

Multiple Servers High Somewhat Moderate Somewhat Moderate

Mailing Disks High No High No Low

Free Repositories Low Yes Moderate Somewhat Free

Proprietary Repositories Low Moderate Moderate Somewhat High

Page 13: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Method ComparisonMaintenance Bandwidth limits Speed Robustness Cost

Single Server Moderate Somewhat Slow No Moderate

Multiple Servers High Somewhat Moderate Somewhat Moderate

Mailing Disks High No High No Low

Free Repositories Low Yes Moderate Somewhat Free

Proprietary Repositories Low Moderate Moderate Somewhat High

Academic Torrents Moderate No High Yes Low

Page 14: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Academic Torrents

Peers get torrent from AT

Upload torrent to Academic

Torrents

Create torrent from data

Share data with peers

Transmission torrent client

Page 15: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Academic Torrents Portal

Page 16: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Each entry contains:

Bibtex Metadata (keys->values)

File listing with hashes (verify authenticity)

Listing of hosting locations (global mirror locations)

Page 17: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Curated collections

Page 18: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Each collection is:

Curated by a user (allows trust)

An updatable folder of entries (modifiable)

accessible via APIs (RSS, CSV, RESTful)

Page 19: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science
Page 20: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science
Page 21: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science
Page 22: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Command Line Interface (atdown)https://github.com/AcademicTorrents/AcademicTorrents-Downloader

Page 23: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Command Line Interface (atdown)

Page 24: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Use Case:Wikipedia XML Offline Version

10GB of DataCommunity Hosted

766 Downloads in 2014 (7.66TB!)~15 Persistent mirror locations

Page 25: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Wikipedia data 10GBglobal mirror locations

Page 26: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Speeds vary

Bytes!

At UMass Boston Campus, Boston, MA

At XSEDE14, Atlanta, GA

Different Mirror Access Mirrors have

different speeds

Page 27: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Use Case:Direct Numerical Simulation of Turbulent Flows

5TB of Datain 63 files

Able to use AT infrastructure as management tool.

Page 28: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Direct Numerical Simulation of Turbulent Flows 250GB/5TB in 2 Locations

Page 29: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Entire Presentation

Datasets-Searchable central index-Dynamic hosting locations-Ability to cache on campuses-Long term persistence-Aggregate sources

Publications-Long term persistence-New publication model: distributed publishing-Library Smart Nodes

Page 30: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Questions

Why can you expect papers to be accessible?

What is the cost of a research paper?

Page 31: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science
Page 32: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science
Page 33: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Current Publishing Model,Elsevier, IEEE/ACM Journal

Distributed publishing model,Academic Torrents Library Smart Node

: ( IEEE/ACM Conference

Current Open Access Model,PLOS, F1000 Journals

Subscribers Everyone

Rea

der/L

ibra

ry P

ays

Aut

hor P

aysC

ost

Who can access

Page 34: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science
Page 35: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science
Page 36: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Library Smart Node Overview

StudentLibrary Database,

OpenURL, orAtoZ Server

Elsevier

Springer

IEEE Explore

Academic Torrents Curated SmartNode

ScholarWorks

PLOS

JLMR

$$$$$$$$$$

Page 37: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Smart Node

Management software for dealing with data

Deals with:Bandwidth Limits

Space LimitsContent (subscriptions)

Page 38: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Smart Node V1

CS410 - Software DesignTeam of UndergraduatesGPL/C++

V2 will be in Java

https://github.com/AcademicTorrents/AcademicTorrents-SmartNodeV1

Page 39: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Open Journal System IntegrationSimon Fraser University Library

Page 40: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Academic Torrents

Page 41: Academic Torrents - XSEDEAcademic Torrents Academic Torrents Scalable Distribution for Science Joseph Paul Cohen (NSF Graduate Fellow) and Henry Z Lo UMass Boston Computer Science

Is this my dissertation topic? No.->Object detection in remote sensed imagery using machine learning +Ad-Hoc pervasive mobile networks +Semi-structured information extraction+CS and Cyber Security Education

blucat Throw Platform Feature Selection

Building Detection

Crater Detection