Upload
vivien-bonazzi
View
115
Download
0
Embed Size (px)
Citation preview
BD2K & the Commons @ NIH
Vivien Bonazzi, Ph.D.
Senior Advisor for Data Science Technologies Office of Data Science (ADDS)National Institutes of Health
A Digital Story
NIH Data
NIH Data NIH Data
US Government Memo - Increasing Access to Results of Federally Funded Scientific Research
In Feb 2013 the US OSTP issued a memo calling for all US Federal Agencies to make digital assets from federally funded research availableOSTP - Office of Science Technology Policy at the White House
Public Access to Data Memohttp://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
US Government Memo - Increasing Access to Results of Federally Funded Scientific Research
Each agency’s public access plan shall:
Maximize access, by the general public and without charge, to digitally formatted scientific data created with Federal funds while:
i) protecting confidentiality and personal privacy
ii) recognizing proprietary interests, business confidential information, and intellectual property rights and avoiding significant negative impact on intellectual property rights, innovation, and U.S. competitiveness, and
iii) preserving the balance between the relative value of long-term preservation and access and the associated cost and administrative burden.
NIH Response
In response to the incredible growth of large biomedical (digital) datasets, the Director of NIH established a special Data and Informatics Working Group (DIWG)
http://acd.od.nih.gov/diwg.htm
NIH Response
Establish new data science research and training programsFulfilling the recommendation of the ACD WG report
Big Data to Knowledge (BD2K) - 2013http://datascience.nih.gov/bd2k
Establish a new position: NIH Associate Director of Data Science (ADDS) Phil Bourne – 2014
CHAPTER 3
BD2K – Big Data to Knowledge Expanding training programs in data science Find and Sharing Data & Software though
Indexes Targeted Software tools and methods
Data wrangling Privacy security of data Data repurposing Applications of metadata
Advance Big methods, tools and applications BD2K Centers of Excellence)
https://datascience.nih.gov/bd2k/funded-programs
To enable biomedical research as a digital enterprise through which new discoveries are made and knowledge generated by maximizing community engagement and productivity.
NIH ADDS Mission Statement
To use data science to foster an
Open Digital Ecosystem that will accelerate
efficient, cost-effective biomedical research
to enhance health, lengthen life, and reduce illness and
disability
Enabling digital Ecosystems via a Commons & BD2K
Leveraging BD2K efforts
Harnessing e-infrastructures - Public-private partnerships & Interagency collaborations
Collaborating with external communities
Commons : Achieving a BalanceBiomedical Use Cases + Data Science + e-infrastructures
Supporting open biomedical science using robust, scalable and flexible digital technologies
In collaboration with global communities
What are the PRINCIPLES of a Commons?
Supports a digital biomedical ecosystem Treats products of research – data, software, methods,
papers etc. as digital objects Digital objects exist in a shared virtual space
Find, Deposit, Manage, Share and Reuse data, software, metadata and workflows
Digital objects need to conform to FAIR principles: Findable Accessible (and usable) Interoperable Reusable
Developing a Commons Framework
Exploits new scalable computing technologies - Cloud Making digital objects : FAIR
Indexable/Findable, Accessible & Usable, Interoperable, Reproducible
Simplifies access, sharing and interoperability of digital objects such as data, software, metadata and workflows
Provides physical or logical access to digital objects Provides understanding and accounting of usage patterns Is potentially more cost effective given digital growth Gives currency to digital objects and the people who develop
and support them
Commons Framework
Compute Platform: Cloud or SC Facilities
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data“Reference” Data Sets
User defined data
Digital Object Compliance
App store/User Interface
https://datascience.nih.gov/commons
Commons Framework
Compute Platform: Cloud or SC Facilities
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data“Reference” Data Sets
User defined data
Digital Object Compliance
App store/User Interface
IaaS
PaaS
SaaS
https://datascience.nih.gov/commons
Commons: Digital Object Compliance
Attributes of digital research objects in the Commons Initial Phase
Unique digital object identifiers of resolvable to original authoritative source
Machine readable A minimal set of searchable metadata Physically available in a cloud based Commons provider Clear access rules (especially important for human subjects data) An entry (with metadata) in one or more indices
Future Phases Standard, community based unique digital object identifiers Conform to community approved standard metadata and ontologies for
enhanced searching Digital objects accessible via open standard APIs Are physically and logical available to the commons
Towards Data Commons’
Towards Data Commons’
co-locate data, storage and computing infrastructure with commonly used tools for accessing, analyzing, sharing data to create an open interoperable resource for the research community.
NIH Commons PILOTS
Current Commons Pilots
Reference Data Sets
Commons Framework
Pilots
Cloud Credit Model
Resource Search &
Index
Explore feasibility of the Commons framework Provide data objects to populate the Commons Facilitate collaboration and interoperability
Provide access to cloud (IaaS) and PaaS/SaaS via credits Connecting credits to NIH Grants
Making large and/or high value NIH funded data sets and tool accessible in the cloud
Developing Data & Software Indexing methods Leveraging BD2K efforts bioCADDIE et al Collaborating with external groups
Other Commons Activities
HMP Cloud (NIAID/Comm
on Fund)
NCI Cloud Pilots
& GDC
NIH affiliated Commons projects
Testing cloud environments to enable access, sharing. use and reuse of large data sets and accompanying tools The Cancer Genome Atlas (TCGA) - NCI Human Microbiome Project (HMP) - NIAID
Providing a portals to view representation and analysis of large data sets (Genomic Data Commons – NCI)
?
Other Commons’
Commons Framework Pilots
Exploring feasibility of the Commons framework using the BD2K Centers, MODs, and HMP groups
Facilitating connectivity, interoperability and access to digital objects
Providing digital research objects to populate the Commons
Enable biomedical science to happen more easily and robustly
Connecting biology use cases with data science
Commons Framework PilotsBD2K Centers, MODs, HMP
BD2K Centers, MODS and HMP
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data“Reference” Data Sets
User defined data
Digital Object Compliance
App store/User Interface
Mapping to the Commons framework:Commons Framework Pilots
PaaS
SaaS
Does your work map to the Commons framework? Good Bad Ugly
How does it enable science? Using robust computational methods Enable biomedical use cases
Commons Framework PilotsBD2K Centers, MODs, HMP
Commons Framework PilotsPI Parent grant’s
ICProject description
TOGA NIBIB • Cloud-hosted data publication system • Allows the automatic creation and publication of data a personalized data
repository
MUSEN NIAID • Smart APIs – improved handling for metadata within APIs• Ontological support for metadata within an API• Improving smart API discoverability: a registry of APIs
HAN NIGMS • Docker container hub for BD2K community• Docker containers for genomic analysis applications and pipelines• Benchmark, Evaluation & best practices
COOPER/KOHANE NHGRI • Cloud based authenticated API access and exchange of causal modeling data , tools + genomic and phenomic data (PICI)
• Docker containers for CCD tools available in AWSHAUSSLER NHGRI • Secure sharing of germline genetic variations for a targeted panel of breast
cancer susceptibility genes and variations• (GA4GH) API : being able to query this data and metadata
Ohno-Machado NHLBI • Development of an ecosystem for repeatable science • easy reuse of data AND software; tracking of provenance. • Use of container technologies for software and data reuse.
Sternberg NHGRI • Development of a cloud-based literature curation system for specific curation tasks of the collaborating sites.
• An API to provide programmatic access to the relevant papers in PMC
White NHGRI • The entire HMP1 data set made accessible on AWS• Analysis tools for microbiome data in AWS
Westerfield NHGRI • Development of a common data model for the MODs• Development of APIs accessing data across the MODs
More specifically from a Data Science perspective Open standards for APIs and Docker containers Docker registry and best practices Improved metadata handing in APIs Data Object registry and indexing
Reusing what is currently available bioCADDIE, schema.org and schema.org
Publication Preprint server with Links to all digital objects
Commons Framework PilotsBD2K Centers, MODs, HMP
Example of a biomedical Use Case: Develop a common gene model for all the MODs Develop a open well structured, resuable and documented API that can be used across the MOD data
Why?• To be able to query a human gene against all MOD orthologs• Improved understanding of health and disease states• Improved understanding of genome structure & organization
Commons Framework PilotsBD2K Centers, MODs, HMP
The purpose of the Commons Framework is to support
BOTH
Biological use cases + Data Science methods
To allow biological research to happen at scale
Commons Framework PilotsBD2K Centers, MODs, HMP
Commons Credits Model
The Cloud Credits ModelThe Commons
Cloud ProviderA
Cloud ProviderB
InvestigatorNIH
Provides credits
HPC Provider
Uses credits inCommons
Enabling search: Index Commons Compliance Commons Conformance
Drivers of the Cloud Credits Model
Scalability Exploiting new computing models Potentially Cost Effectiveness Simplified sharing of digital objects Cloud computing supports many of these
objectives
Cloud credits model (CCM)
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data“Reference” Data Sets
User defined data
Digital Object Compliance
App store/User Interface
Mapping pilots to the Commons framework: Cloud Credits Model:
IaaS
PaaS
SaaS
Supports simplified data sharing by driving science into publicly accessible computing environments that still provide for investigator level access control
Scalable for the needs of the scientific community for the next 5 years
Democratize access to data and computational tools Cost effective
Competitive marketplace for biomedical computing services Reduces redundancy Uses resources efficiently
Advantages of this Model
Novelty:Never been tried, so we don’t have data about likelihood of success
Cost Models: Assumes stable or declining prices among providersTrue for the last several years, but we can’t guarantee that it will continue, particularly if there is significant consolidation in industry
Service Providers:Assumes that providers are willing to make the investment to become conformantMarket research suggests 3-5 providers within 2-3 months of launch
Persistence: The model is ‘Pay As You Go’ which means if you stop paying it
stops going Giving investigators an unprecedented level of control over what
lives (or dies) in the Commons
Potential Disadvantages of this Model
Cloud Commons Reference Data Sets
Data Sets in a Cloud Commons
Making High Value and/or High Volume NIH funded data sets available in a cloud commons
Co-location of large datasets and compute power enables access, use, resuse and sharing of data and tools
Data must adhere to FAIR/Commons compliance principles Helps “seed” the Commons with FAIR/Commons compliant
data Provides an Indexable test data sets for bioCADDIE (and
other indexing efforts)
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data“Reference” Data Sets
User defined data
Digital Object Compliance
App store/User Interface
Mapping pilots to the Commons framework : Large, high value Data Sets
NIH defined data sets
Data Sets in the Cloud Commons Preliminary possible data sets
GTex (Genotype-Tissue Expression) LINCS (Library of Integrated network based cellular signatures) Model Organism Databases (MODs) UniProt Neuroimaging Resource (NITRIC) Radiology Image Share Epigenomics GenPort The Cancer Genome Atlas Project (TCGA) this data set is currently
housed at the GDC but there ARE plans to move to AWS and Google BTRIS Data – NIH Clinical center NIAID AIDs Data dbGAP GEO
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data“Reference” Data Sets
User defined data
Digital Object Compliance
App store/User Interface
Mapping pilots to the Commons framework : Community Defined Data Sets
Community defined data sets
Data Sets in a Cloud Commons: Opportunities
Ability to share data more easily
Ability to access and compute on data more easily
Reduced costs: Costs is paid by NIH not the individual PI Stops continues uploads of the same data
sets
FAIR/ Commons Compliance of data sets
Data Sets in a Cloud Commons: Challenges
Supporting sensitive (human) data in commercial clouds Updating, versioning, maintaining Consents for data
Can be very strict and only valid across 1 data set Analysis across data sets may constrained by consents
Optimizing for cloud environments: performance Incentivizing data (and tool) generators to move and
maintain their data in the cloud Data peering across clouds
Commercial clouds are resistant : cyclinders of excellence
Peering and Virtualization of services
Making things Findable
Indexing & Search methods
Commons Pilots: Search & Index Indexing and Searching digital objects in a
Commons
Leveraging indexing methods within BD2KBioCADDIE, Others approach within BD2KSchema.org
Coexisting efforts
BD2K Indexinge.g. BioCADDIE, Other, schema.org
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data“Reference” Data Sets
User defined data
Digital Object Compliance
App store/User Interface
Mapping pilots to the Commons framework : Indexing & Searching
What is bioCADDIE?biomedical and healthCAre
Data Discovery Index Ecosystem
University of California San Diego PI Lucila Ohno-Machado
Development of a prototype of Data Discovery Index (DDI)
Aims – “Pubmed” for Data1. Help users find shared data 2. Build a prototype data discovery index3. Evaluate requirements for next phase
ecosystem components for finding data
Policiescriteria for inclusion, sustainability
Standardsmetadatadata
Identifiersreuse of existing ID issuing services
Metadataminimal setguidelines for mapping,accessibility information,provenance
Search engineconnection to other engines, repositories, data sets
Commons Pilots Leveraging Schema.org
Marking up a biomedical resource using schema.org Flexible and scalable Developing a bioschema.org approach
Helps drive a community standard for reuse by other groups
Harnesses the power of search engines to find digital objects
Commons : Achieving a BalanceBiomedical Use Cases + Data Science + e-infrastructures
Supporting open biomedical science using robust, scalable and flexible digital technologies
In collaboration with global communities
Thankyou ADDS Office
Phil Bourne, Michelle Dunn, Jennie Larkin, Mark Guyer, Sonynka Ngosso
NCBI: George Komatsoulis
NHGRI: Valentina di Francesco, Kevin Lee
CIT: Debbie Sinmao, Andrea Norris, Stacy Charland Trans NIH BD2K Executive Committee & Working groups NCI: Warren Kibbe, Tony Kerlavage, Lou Staudt, Tanja Davidsen, Ian
Fore
NIAID: Nick Weber, Darrell Hurt, Maria Giovanni, JJ McGowan Many biomedical researchers, cloud providers, IT
professionals
The end