Upload
matthew-vaughn
View
139
Download
2
Embed Size (px)
Citation preview
Transforming Science Through Data-driven Discovery
How Cyverse.org enables scalable data discoverability and re-use
Matt Vaughn, co-PI@mattdotvaughn
History and Context
~ $100m direct NSF investment over 10
years
Currently working to sustain its successes
beyond 2018
iPlant 2008Empowering a
New Plant Biology
iPlant 2013Cyberinfrastructure
for Life Science
CyVerse 2016Transforming Science Through Data-Driven
Discovery
Plant Science Cyberinfrastructure CollaborativeA "new type of organization" that is "community-driven" uniting "biologists, computer and information scientists and experts from other disciplines working in an integrated team" to provide "computational and cyberinfrastructure capabilities and expertise that are capable of handling large and heterogeneous plant biology data sets"
What is Cyberinfrastructure?
•Data storage and retrieval
•Software (system & user)
•Computing capability
•Human expertise and support
Organized into systems that solve problems of size and scope that would not otherwise be solvable
Platform Overview
Ready to usePlatforms
FoundationalCapabilities
Established CI Components
Extensible Services
Ease
of
Use
Adoption and Outputs• Over 40K registered users (15-20%
active)• Millions of computing hours on
XSEDE, campus HPC, Cyverse systems, and commercial cloud
• 2+ PB user data stored in CyVerseData Store
• Hundreds of publications, courses, and discoveries
• Spin-off technologies• Jetstream: NSF production
cloud• Syndicate: Software-defined
storage system• Agave API: Multitenant
science PaaS• Communities such as iAnimal,
iMicrobe, iPlant.UK• 3rd party software resources
using it as a platform
FederationMetadata
Finding and re-using Data (1)
iRODS (2+PB)
ElasticSearchTucson Resources
AustinResources
Catalog Servers
CSHL Resource
iPlant.UK Resources
Data Store APIs
Agave API
AWS S3
Public FTP
SFTP
At the heart of all Cyverse applications is a data-centric architecture, designed to be scaled and extended
Finding and re-using Data (2)
• Browser-based file manager• Upload from local or URI• Download• Add/Edit comments and tags• AVU metadata + structured
templates• Share with collaborators or any
Cyverse user
The Cyverse Discovery Environment Data Window
Finding and re-using Data (3)
• Browser-based file manager• Upload from local or URI• Download• Add/Edit comments and tags• AVU metadata + structured
templates• Share with collaborators or any
Cyverse user
Google Drive, for big data
The Cyverse Discovery Environment Data Window
Finding and re-using Software (1)• Extendable App Catalog
• Provide Dockerfile + GUI specification
• Develop VM image• Deploy application web
service
Info view for a Cyverse Discovery Environment application
Finding and re-using Software (2)• Extendable App Catalog
• Provide Dockerfile + GUI specification
• Develop VM image• Deploy application web
service• Require links to
documentation, example files and usage, appropriate software and domain ontologies
Public or shared Atmosphere VM images tagged with “GWAS”
Finding and re-using Software (3)• Extendable App Catalog
• Provide Dockerfile + GUI specification
• Develop VM image• Deploy application web
service• Require links to
documentation, example files and usage, appropriate software and domain ontologies
• Give credit to app author and software authorApplication and Data catalogs available to 3rd parties
Cyverse Data Commons (1)
Data Commons Landing Page (1.0)Persistent URL for each data set. No authentication
required. Fast browsing and retrieval.
NCBI SRA Submission Workflow in DECyverse is the analysis home for a lot of genomics
data. To get it off our systems, we need to help get it into the SRA!
Cyverse Data Commons (2)
Actively facilitating publication and discovery of data stored with CyVerse
Candidate Research Data @
Data Store
Identify, organize, rename files and folders
Prepare a DataCite metadata document
Submit to Cyverse Curation
Team
Data snapshot
made public. DOI
issued.
Candidate VM image
Document contents & capabilities
Prepare a DataCite metadata document
Submit to Cyverse Curation
Team
Public image
released. DOI issued.
Summary
• Cyverse is a model for providing cyberinfrastructure to diverse bioscience user communities
• State of the art has shifted at least twice since we started work
• Had to overcome initial reticence to “give data” to Cyverse
• Still hard to get developers and providers to maintain after contributing
• Cost recovery model - We have started using the term ‘subsidized’ rather than free but it might be too late.
• Natural syngergy between our organization and ODEN objectives
Transforming Science Through Data-driven Discovery
Parker Antin Nirav Merchant
Eric Lyons
Matt Vaughn@mattdotvaughn
Doreen WareDave Micklos
CyVerse is supported by the National Science Foundation under Grant No. DBI-0735191 and DBI-1265383.
CyVerse Executive Team