Research Data Management

  • Upload
    beata

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Research Data Management www.globusonline.org. Rachana Ananthakrishnan University of Chicago & Argonne National Lab. We started with technology proven in many large-scale grids. GridFTP GRAM MyProxy GSI- OpenSSH. …. - PowerPoint PPT Presentation

Citation preview

Science for the Future: Strategies for distributing and sharing data

Research Data Management

www.globusonline.orgRachana AnanthakrishnanUniversity of Chicago & Argonne National Lab

globus online1We started with technology proven in many large-scale grids

GridFTPGRAMMyProxy GSI-OpenSSHBig science has achieved big successes with advanced community services

Community services built on Globus Toolkit software

LIGO: 1 PB data in last science run, distributed worldwideESG: 1.2 PB climate datadelivered to 23,000 users; 600+ pubs

OSG: 1.4M CPU-hours/day, >90 sites, >3000 users, >260 pubs in 2010Substantial teamsSustained effortLeverage common technologyApplication-specific solutionsProduction focus3But small and medium science is suffering

Data delugeAd-hoc solutionsInadequate software, hardware & IT staffDES as an example of medium science4Every night, they receive 100,000 files in IllinoisThey transmit files to Texas for analysis then move results back to Illinois and make them available to usersProcess must be reliable, routine, and efficientThe cyberinfrastructure team is not large!Medium science: Dark Energy Survey

Image credit: Roger Smith/NOAO/AURA/NSFBlanco 4m on Cerro TololoNot just small labsmedium science too.E.g., Dark Energy Survey.5Time-consuming Tasks in ResearchRun experimentsCollect dataManage dataMove dataAcquire computersAnalyze dataRun simulationsCompare experiment with simulationSearch the literatureCommunicate with colleaguesPublish papersFind, configure, install relevant softwareFind, access, analyze relevant dataOrder suppliesWrite proposalsWrite reports66Excerpts from ESNet reportsTransfers often take longer than expected based on available network capacities Lack of an easy to use interface to some of the high-performance tools Tools [are] too difficult to install and useTime and interruption to other work required to supervise large data transfers Need data transfer tools that are easy to use, well-supported, and permitted by site and facility cybersecurity organizationsWe envisage a world where data flows rapidly, reliably, and securely among: experimental facilities, online and archival storage, computing facilities, and remote institutionsWe envisage a world where data is easily integrated into dynamic datasets that also include metadata and programs necessary to understand and regenerate itWe envisage a world where data is readily discoverable and accessible to collaborators, regardless of their and the datas location

We believe a new approach is needed to deliver data management infrastructure

FrictionlessAffordableSustainable

Like but for science! Focusing on frictionless, weve started to do this with the Globus Online service Transfer and sharing of large data sets with dropbox-like characteristics directly from your own storage systemsReliable, secure, high-performance file transferFire-and-forget transfersAutomatic fault recoveryAuto tuningSeamless security integrationDataSourceDataDestination

User initiates transfer request1Globus Online moves and syncs files2Globus Online notifies user3

13Simple, secure sharing off existing storage systems DataSource

User A selects file(s) to share, selects user or group, and sets permissions 1Globus Online tracks shared files; no need to move files to cloud storage!2User B logs in to Globus Online and accesses shared file3

Easily share large data with any user or groupNo cloud storage required14Globus Online is SaaSWeb, command line, and REST interfacesReduced IT operational costsNew features automatically availableConsolidated support & troubleshootingEasy to add your laptop, server, cluster, supercomputer, etc. with Globus Connect 15Globus Connect MultiuserCreate endpoint in minutes; no complex GridFTP installEnable all users with local accounts to transfer filesNative packages: RPMs and DEBsAlso available as part of the Globus Toolkit16Local Storage System(RCC cluster, campus server, )Globus Connect MultiuserMyProxyOnline CAGridFTP Server

Local system users

Early adoption is encouraging

Early adoption is encouraging

~24PB and 1B files moved

10x (or better) performance vs. scp

99.9% availability

B. Winjum (UCLA) moves 900K-file plasma physics datasets UCLA NERSC

20Dan Kozak (Caltech) replicates 1 PB LIGO astronomy data for resilience

2122

Credit: Kerstin Kleese-van DamErin Miller (PNNL) collects data at Advanced Photon Source, renders at PNNL, and views at ANLThis image shows a 3D rendering of a Shewanella biofilm grown on a flat plastic substrate in a Constant Depth bioFilm Fermenter (CDFF). The image was generated using x-ray microtomography at the Advanced Photon Source, Argonne National Laboratory.

22Globus Online as a platformGlobus Nexus (Identity, Group, Profile)Sharing ServiceTransfer ServiceDataset Services Globus Toolkit

Globus Online APIs

Globus Connect

Early platform adopters

More capabilities underway Globus Toolkit

Sharing Service

Transfer Service

Dataset Services

Globus Nexus (Identity, Group, Profile)

Globus Online APIs

Globus Connect

Introducing the datasetGroup data based on use, not locationLogical grouping to organize, reorganize, search, and describe usageTag with characteristics that reflect content Capture as much existing information as we canor to reflect current status in investigationStage of processing, provenance, validation, ..Share data sets for collaborationControl access to data and metadataOperate on datasets as unitsCopy, export, analyze, tag, archive, Expanding Globus Online servicesIngest and publicationImagine a DropBox that not only replicates, but also extracts metadata, catalogs, convertsCatalogingVirtual views of data based on user-defined and/or automatically extracted metadataIntegration with computationAssociate computational procedures, orchestrate application, catalog results, record provenance

28mydata42owner: Francescotype: 3dtomoformat: HDF5beamline: 2BM

Tomography

Define datasetInfer typeExtract metadataPopulate catalog(s)Locate datasetsAccess filesanalyze

Catalog derived products

transfer/scheduleOrchestrationOrganizationRecord provenance

Annotate, sharebrowse, search

http://www.blyberg.net/card-generator/http://www.sciencemag.org/content/332/6025/88/F1.large.jpg

28We believe a new approach is needed to deliver data management infrastructure

FrictionlessAffordableSustainableWeve got a handle on frictionlessWeb interface, REST API, command lineInCommon, Oauth, OpenID, X.509, Credential managementGroup definition and managementTransfer management and optimizationReliability via transfer retriesOne-click Globus Connect install 5-minute Globus Connect Multiuser installAffordable and sustainable?Common expectation is either:High-priced commercial software (with generally higher levels of quality)Or:Free, open source software (with generally lower levels of quality)

We aim to offer the best of all worlds!We are a non-profit service provider to the non-profit research communityOur challenge:SustainabilityWe are a non-profit service provider to the non-profit research communityGlobus Online Provider Plans

Support ongoing operationsOffer value-added capabilitiesEngage more closely with users34Provider Plans offerEndpoint management consoleUsage reportingMSS optimizationsGlobus Plus subscriptionsBranded web sitesAlternate identity provider

Starting at $10k/year35Researchers may use Globus file transfer for freeFile transfer and synchronization to/from serversPersonal endpoints with Globus ConnectAccess to shared endpoints created by others

Globus Plus: $7/month (or $70/year)Create and manage shared endpointsTransfer and sharing between Globus Connect Personal endpoints3636We hope you will join us

Provider Plan not required to get startedUse Globus Connect Multiuser to easily connect your resources with Globus OnlineGo to: globusonline.org/gcmuRegistryStaging StoreIngestStoreAnalysisStoreCommunity StoreArchiveMirror

IngestStoreAnalysisStoreCommunity StoreArchiveMirrorRegistry

38Our research is supported by:

U.S. DEPARTMENT OFENERGY

QuestionsContact: [email protected]: globusonline.org/provider-plansResearchers: globusonline.org/pluswww.globusonline.org