78
+ Common Framework Working Groups Owen White and many more

Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+

Common Framework Working Groups

Owen White and many more

Page 2: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ Why this is confusing

■Several different initiatives■BD2k, Common Fund, Global Alliance, Genome Data Commons

■Several different virtual spaces■GDC, Hutch Data Commonwealth, Cloud pilots

■Co-opting several pre-existing activities■MODs, GA4GH, HMP

Page 3: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+Why this is REALLY confusing

Page 4: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ Ready or not …we are building an ecosystem

■ Living, thriving, dynamic and very new concept

■ Composed of many incubators

■ Some technologies will prevail, some will not

■ It is not appropriate or possible to:■ do this in isolation■ burn resources while just doing our own research

Page 5: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+Data Commons Components

Reference DataSets

Resource Search & Index

Cloud Credit Model

Commons Framework Pilots

• GDC• Human Microbiome Project• Global Alliance• MODs• RFI – engage community

Winter 2017

• FOAs – Place high impact data sets in the cloud

Spring 2017

Page 6: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+Data Commons Components

Reference Data Sets

Resource Search & Index

Cloud Credit Model

Commons Framework Pilots

• Data Discovery Index (DDI) Consortium (bioCADDIE, dataMed, omicsDI others)

• Aggregation of metadata presented on web

• Driving metadata standards

• Search/query services

Page 7: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+Data Commons Components

Reference Data Sets

Resource Search & Index

Cloud Credit Model

Commons Framework Pilots

• 3 year pilot to test business model • Investigators receive credits for

use with cloud providers• Provider debits against account in

pay-as-you-go model• Amazon Reseller, IBM, Google

Reseller, Broad and NCI Cloud Pilots

Page 8: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+Data Commons Components

Reference Data Sets

Resource Search & Index

Cloud Credit Model

Commons Framework Pilots

Several examples…...

Page 9: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+The CEDAR Approach to Metadata

Page 10: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+AzTec

Building a Technology Platform to Integrate Biomedical Resources

https://aztec.bio

Page 11: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+

Faceted search

Metadata editorAPI testingRepository

API

smartAPI Interoperability PilotDevelopment of a Community-based standard

Intelligent authoring of API Metadata

Page 12: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+

Brian O’Connor - UCSC

Page 13: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ Motivation – broad goals…and why you should participate

■ Everyone has a lot to share – let’s ensure we socialize our research products

■ Vision for Commons implementation

■ Self-governance

■ Managing standards proliferation

■ We are not in competition with each other

■ Setting guidelines in RFAs

Page 14: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ Common Framework Working Group

■ Development of FAIR-ness Metrics■ objective measures for the degree of data availability

■ Metadata documentation of APIs■ creating a "minimal list" that describes available APIs

■ Data-object registry / Indexing■ Approaches to make all data findable

■ Workflow sharing and Docker Registry■ we got lots of workflows, how to share them?

■ Commons publication initiative■ a coordinated publication plan

Page 15: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

FAIRness

Page 16: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

FAIRness MetricsOptimizing the FAIR alignment of

research assets, roles & relationships

Neil McKenna, Ph.DBaylor College of Medicine

Co-Chair, FAIRness Metrics Subgroup

Co-chair: Michel Dumontier, Ph.D

Page 17: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

What are FAIRness Metrics?

• The FAIR principles articulate ideals in research• FAIRness Metrics give effect to these principles to

advance FAIRness in research• Commons FAIRness Metrics Subgroup (FMSG) has been

tasked with developing FAIRness Metrics• First (ongoing) step for the FMSG is to comprehensively

define the components of the research ecosystem: – Research assets– Research roles– Research relationships

Page 18: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

What are the assets of research?

Datasets

Metadata& Standards

Research Resources

Applications,services & tools

Page 19: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Defining research roles: two examples

Asset Producers

Asset Stewards

Individual benchresearchers

Ontologyorganizations

Tool & appdevelopers

Primary data repos

Softwareregistries Research Resource

Stores

Page 20: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

What are the roles in research?

Asset Consumers

Asset Producers

Asset Stewards

PublishersAsset sponsors

Asset indexers& registries

Page 21: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

What are the relationships between these roles?

Page 22: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

What are the relationships between assets and roles?

+

Web-based analysis widget

Page 23: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Asset Producers

Asset Stewards

PublishersAsset Sponsors

Asset ConsumersAsset Indexers

& Registries

FAIRness Metrics & Indexes

• FAIRness metrics seek to optimize the alignment of research assets, roles and relationships with the FAIR principles

• Unique roles & relationships require unique sets of metrics– FAIRness Index: custom set of metrics tailored to a

specific research role, its assets and its relationships with other roles

Page 24: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

FAIRness Indexes: holding a mirror up to research roles

• Asset Producers How well are the products of my research shared with other researchers?

• Asset stewards Are assets optimally exposed to both machines & humans?

• Publishers Is the relationship between research articles and their supporting assets properly recognized?

• Consumers Do I give full attribution when I re-use assets?

Page 25: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Asset Producers

PublishersAsset Sponsors

Asset ConsumersAsset Indexers

& Registries

Get involved – please!• Long term goal is to have FAIRness Indexes adopted by

funding agencies & incorporated into FOAs• We need help from the community:

– Defining & identifying research assets and roles– Developing & refining use cases that define the relationships

between research assets & roles – The more roles that are represented in the FMSG, the better

FAIRness Indexes will reflect the real research world• To get involved in the FMSG, complain, or just find out

more about what we’re doing, contact– Neil McKenna ([email protected]) – Michel Dumontier ([email protected])

• Or stop by Poster 135 later on!

Page 26: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Current FMSG roster: thank youMark Wilkinson (University of Madrid)Alejandra Gonzalez-Beltran, Philippe Rocca-Serra, Susanna Sansone (Oxford University)Allen Dearry, Elaine Collier (NIH)Lucila Ohno-Machado, Jeff Grethe (UCSD)Mark Musen (Stanford University)Tim Clark (Harvard Medical School)Nolan Nichols (SRI/Stanford)Tobias Kuhn (VU University Amsterdam)Carole Goble (The University of Manchester)Jo McEntyre (EBI)Luiz Bonino (DTL/VU)Alasdair Gray (Heriot-Watt University)Marco Roos, Katy Wolstencroft, Mark Thompson (Leiden University Medical Center)Richard Finkers (Wageningen UR)Christina Lohr, Holly Falk-Krzesinski, Anita deWaard, Paul Groth (Elsevier)Ronak Patel (Baylor College of Medicine)Lisa Federer (NIH Library)

Page 27: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,
Page 28: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

CFWG API Interoperability Working GroupImproving the discoverability, accessibility,

interoperability and reuse of web APIs

28

Co-Chairs: Michel Dumontier and Chunlei Wu

@micheldumontier::CFWG:30-11-2016

Page 29: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

MotivationBiomedical science is increasingly being done using cloud-based, web-friendly application programming interfaces (APIs).

BUT it’s pretty much impossible to automaticallydiscover which API to use and how to connect these together to create an effective workflow.

-> barrier to discovery.@micheldumontier::CFWG:30-11-2016 29

51 APIs

1,184 APIs

14,952 APIs

Page 30: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Examining the metadata for the myGene.info web API

@micheldumontier::CFWG:30-11-2016 30

GenemyGene.info ?

Page 31: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

@micheldumontier::CFWG:30-11-2016 31

GenBank identifier

Affymetrix identifier

Taxonomy identifier

… 1340 lines …

HGNC symbol

?

NCBI Gene Terminology

Profiling the API output

What do these symbols refer to?How do we find out more?

Page 32: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

@micheldumontier::CFWG:30-11-2016 32

How does myGene.info connect with myVariant.info?

Gene

myGene.info

?

myVariant.info

Page 33: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Knowing how APIs connect is essential for (automated) workflow composition

@micheldumontier::CFWG:30-11-2016 33

Page 34: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Problem Statement

There is an overwhelming lack of explicitknowledge pertaining to the structure and datatype of web API inputs and outputs

If web APIs were annotated with semantic metadata, they would be easier to discover, connect together, and reuse.

@micheldumontier::CFWG:30-11-2016 34

Page 35: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

API Interoperability CFWG

To foster a collaborative environment for the discussion, development and evaluation of infrastructure and guidelines that facilitate the discoverability, implementation, deployment, interoperability and reuse of web APIs

@micheldumontier::CFWG:30-11-2016 35

Page 36: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

API Interoperability WG PeopleMichel DumontierAmrapali ZaveriShima Dastgheib

Chunlei Wu Caty ChungRaymond Terryn Paul Avillach

http://mygene.info

http://ruben.verborgh.org/blog/2013/11/29/the-lie-of-the-api/

http://dumontierlab.com http://www.lincsproject.org http://bd2k-picsure.hms.harvard.edu

https://spec-ops.io http://nidm.nidash.org/

Kevin OsbornDavid Steinberg

https://cgl.genomics.ucsc.edu/

http://sadiframework.org https://bd2kccc.org/http://rgd.mcw.edu/

Kathleen Jagodnik

36

Gregg Kellogg Nolan Nichols

Mark Wilkinson Ruben Verborgh Mary ShimoyamaJeff De Pons Denise Luna

Page 37: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Metadata Survey

@micheldumontier::CFWG:30-11-201637

We performed a survey of 3 repositories (Biocatalogue, Programmable Web, Elixir Tools & Services Registry) and 4 specifications (MIAS, OPEN API, SADI, schema.org, and a preliminary smartAPI metadata specification).

Page 38: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

@micheldumontier::CFWG:30-11-2016 38

Metadata Elements 20 basic, 6 provider, 10 operation, 12 parameters, 6 response

Page 39: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

@micheldumontier::CFWG:30-11-2016 39

Metadata authoring made easy. We extended t

smartAPI metadata authoring tool

he Swagger Editor to validateusing the smartAPIspecification and to suggestmetadata elements and values from the smartAPIrepository API.

Page 40: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Unify API data with Linked Open Data

@micheldumontier::CFWG:30-11-2016 40

Page 41: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

@micheldumontier::CFWG:30-11-2016 41

WG members are documenting their APIs!

Page 42: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

API Interoperability CFWGMission: To foster a collaborative environment for the discussion, development and evaluation of infrastructure and guidelines that facilitate the discoverability, implementation, deployment, interoperability and reuse of web APIs.

Planned Activities– Finalizing vision and API metadata specification– Demonstrations and evaluations of usability and utility of our work– Implement and use of smartAPIs in reproducible discovery science– Coordinate activities with the GA4GH API group– Investigating FAIR metrics for APIs– Your idea here!

Participation– Join mailing list and participate in biweekly teleconference calls– Work with an excellent group of people with broad expertise– Take credit for transforming the API ecosystem in BD2K … and beyond!

@micheldumontier::CFWG:30-11-2016 42

Page 43: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

43@micheldumontier::CFWG:30-11-2016

[email protected]: http://dumontierlab.com

Presentations: http://slideshare.com/micheldumontier

Page 44: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,
Page 45: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

BD2K Indexing Working Group

a consolidated effort of the Commons Framework Pilots WG, the Centers of Excellence Coordination Center,

and the Data Discovery Index Consortium

Current co-chairs

45

Wei Wang, UC Los Angeles

Michel Dumontier, Stanford

Lucila Ohno-Machado, UC San Diego

Founding members (everyone is welcome to join)George Alter, Univ. MichiganElizabeth Bell, UCSDAlejandra Gonzalez-Beltran, Philippe Rocca-Serra, Susanna Sansone, Univ. OxfordJudith Blake, The Jackson LaboratoryBrian Bleakley, BD2K centers Coordinating CenterBenjamin Hitz, StanfordIyad Obeid, TempleJoe Picone, TempleKevin Read, NYU

Page 46: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Operating Principles

• Data integration is key to functional and comparative biomedicine (-omics, clinical medicine, public health, health economics)– Allows data to be evaluated in new contexts

• Standards are key to data integration– Nomenclature

• Standardized nomenclature, keywords, etc.– Knowledge representation

• Gene Ontology (GO)• Mammalian Phenotype Ontology• Others

Adapted from J Blake’s slide, The Jackson Laboratory 46

Page 47: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Gaps in the Metadata Workflow

Most data are “born digital,” but metadata are orphans

47

• Curating data is an expensive manual process• When data are created in silico, why are annotations entered

manually?• There are gaps in the scientific workflow because tools for managing,

transforming, and analyzing data are not metadata-aware• Tools to automate the capture and maintenance of metadata are

needed• Example:

– Many types of data are analyzed in statistical packages (R, SAS, etc.) that do not read or write metadata (data transformations from statistical software must be annotated by hand)

– Other analytical software should also read/write metadata (and be indexed)

Adapted from G Alter’s slide, University of Michigan

Page 48: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Annotating Data Repositories

48

MetadataIngestion

Terminology server• Query

expansion• Result ranking

DataMed User InterfaceSearch Engine

Metadata Management• Mapping• Indexing

Repositories

Data Sets

Funding Agencies

Data Producers

Publishers

Data

sour

ces

Page 49: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Dataset Ingestion Challenges and Costs (1)

Challenges we have encountered Costs

1. Lack of metadata documentation Human labor and time spent on investigating the repository website to understand the data it provides, and to find solutions for obtaining metadata

2. Limited readily accessible metadata Human labor and time spend on design a web crawler to collect available data from the repository website before translating them into the metadata required for indexing

Hardware to meet computational needs for web crawling tasks

3. Lack of domain knowledge (from the indexing team)

Human labor and time spent on understanding the biological and/or technical contents of the data repository

4. Heterogeneity in metadata and data formats Human labor and time spent on iterative refinement of DATS mapping as well as transformation and ingestion scripts (or codes)

Adapted from H Kim’s slide, UC San Diego 49

Page 50: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Dataset Ingestion Challenges and Costs (2)

Challenges• Setting up the ingestion pipeline is

complicated and time-consuming (one-time process)

• Metadata download and ingestion requires domain expertise to verify validity & granularity

• Domain experts required to verify indexing

• Heterogeneity across curators during the mapping process

• Code for harvesting metadata needs to be invariably customized for each repository

• Poor documentation (including lack of APIs, no defined metadata) in a large number of repositories

• Requires interaction and communication with repository personnel (time-consuming) to initiate the ingestion process

50

Costs• Personnel (domain experts &

programmers)• Time consuming process

Adapted from H Xu’s slide, University of Texas Houston

Page 51: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Like JATS (Journal Article Tag Suite) is used by PubMed to index literature,

DATS (DatA Tag Suite) is needed for a scalable way to

index data sources in the DataMed prototype

A community effort

Adapted from a slide by Sansone, Gonzalez-Beltran, and Rocca-Serra, University of Oxford

Page 52: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Example of a model for scalable indexing

Convergence

of elements extracted

from competency

questions

and existing (generic and

biomedical)

data models

(incl. DataCite, DCAT,

schema.org, HCLS

dataset, RIF-CS, ISA-

Tab, SRA-xml etc.)

Adoption from

of elements extracted from

and from

core entities

extended entitiesAdapted from a slide by Sansone, Gonzalez-Beltran, Rocca-Serra, University of Oxford

Page 53: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Interlinking to other indexes

Adapted from a slide by Sansone, Gonzalez-Beltran, and Rocca-Serra, University of Oxford

Page 54: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Two FrontsAnnotating existing data• Continue to work with data

repositories to map into a minimal standard

• Incentive$ for data producers/repositories to facilitate mapping

• Incentive$ for data reuse/citation

Annotating new data• Could be done at the

source, like publishers do for JATS

• Additional re$ource$ need to be provided for data producers/repositories to prepare data for sharing (e.g., after grant funding period ends)

Re$ource$ for data producers and/or repositories to maintain data and their annotations are needed

Leveraging resources from various paid projects, consolidation of efforts, and incentivizing data producers/keepers saves time and money

54

Page 55: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Working Group Charter

Make recommendations to funders to allow increase adoption of standardized metadata by the

biomedical science community

• Establish framework for calculation of costs and sustainability

• Propose mechanisms to enable effective metadata curation– What: Re$ource$

– When: Timelines– How: Minimal metadata– Who: Self- or assisted mapping

55

Page 56: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Workflow Sharing and DockerRegistries Work Group

Umberto RavaioliUniversity of Illinois

Brian O’ConnorUniversity of California Santa Cruz

Page 57: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

FAIR-ness

• Adherence to FAIR principles: Registries tomake tools Findable and Accessible and(Docker) container adoption to make toolcomponents Interoperable and Reusable.

• Important mission of the NIH Commonsshould be to develop a culture of open sourcedevelopment, data sharing, and accessibletools for reproducible science.

Page 58: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Overview of Activities - MembershipRavaioli Umberto University of Illinois at Urbana-ChampaignO'Connor Brian University of California, Santa CruzDiekhans Mark University of California, Santa CruzPaten Benedict University of California, Santa CruzBlatti Charles University of Illinois at Urbana-ChampaignEpstein Milt University of Illinois at Urbana-ChampaignArmstrong Don University of Illinois at Urbana-ChampaignMadduri Ravi University of Chicago & Argonne National LabAmaro Rommie University of California, San DiegoRamsey Stephen Oregon State UniversityHitz Benjamin Stanford UniversityCrusoe Michael Common Workflow Language ProjectSofia Heidi National Human Genome Research Institute, NIHTsang Steve Hsinyi NIH/NCI & Attain

Page 59: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Organization of Activities

• Monthly conference calls (3rd Thursday of themonth)

• Use of Google tools and workspaces tocommunicate and share documents

• Administrative assistance received fromCoordinating Center (UCLA – Denise Luna)

Page 60: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Position Paper• Discussion of the State-of-the-Art• Goals of the Work Group• Sharing mechanisms• Docker containers• Workflow Languages• Case Studies/Prototypes/etc• Recommendations based on experience, future

path of technologies, e.g.:• Standards , API’s / External Collaborations• Other considerations (security, legal, etc.)• Adherence to FAIR Commons Concepts

Page 61: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Workflow Languages and Specs

• This area keeps evolving• There are two main languages (CWL and WDL)

used by the genomics community• Workflow Execution Services:

– Seven Bridges– Fire Cloud (Broad Institute, specialized for Google)– Consonance (Java)– TOIL (UCSC – Python. Wide support of computing

systems)

Page 62: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Common Workflow Language (CWL)

• CWL is a way to describe command line tools and connect them to create workflows.

• CWL is a specification and not a piece of software• Tools and workflows described using CWL are

portable across a variety of platforms that support the CWL standard.

• CWL approach emphasizes execution features and machine-readability, and serves a core target audience of software and platform developers.

Page 63: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Workflow Description Language (WDL)

• Developed by the Broad Institute engineering team supporting genome analysis pipelines

• WDL emphasizes scripting and is designed from the ground up as a human-readable and -writable way to express tasks and workflows.

• WDL script provides a complete analysis solution: workflow, task, call, command and output

• Work is underway to ensure interoperability between CWL and WDL, through conversion and related utilities.

Page 64: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Reaching out to GA4GH

• We are in contact with the GA4GH Containersand Workflows Group to coordinate technicaldiscussions and possibly to mergedevelopment of position paper into a jointactivity (Brian taking the lead on this)

Page 65: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

GA4GH – API proposal for:

• ability to request a workflow run using CWL orWDL (and maybe future formats)

• ability to parameterize that workflow using aJSON schema that's simple and used incommon between CWL and WDL

• ability to get information about runningworkflows, status, errors, output file locations

• ISSUE: standardization of terms– job, workflow, steps, tools, etc

Page 66: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

GA4GH – API (continued)

• Having this standard API supported by multipleexecution engines will give options of processingthe same workflow (e.g., CWL or WDL) acrossdifferent workflow execution platforms runningacross various clouds/environments.

• Example of possible scenario:– Get workflow in CWL on Dockstore.org– Use Dockstore to generate a JSON parameterization

file– Submit to SevenBridges/FireCloud/Consonance or

some other GA4GH-compliant workflow executionservice (if API is supported!)

Page 67: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Containerization

• How do we approach standardization ofDocker containers to promote reusability?

• Computational efficiency goes hand in handwith workflow definition and execution.

• Parallelization: Macrotasking vs Microtasking.• Optimization of numerical procedures is of

paramount importance.• Discoverability: Standardization of terms

mentioned before is very important.

Page 68: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Computing Landscape

• Cloud architectures and providers areproliferating in a climate of competitions

• Will platforms standardize and perhapsconsolidate over time?

• Need to understand trade off between cost,efficiency and adherence to FAIR principles.

Page 69: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

Disruptive Technologies on the Horizon

• Amazon “lambda” serverless computingparadigm is intended to maximize utilization ofresources.

• Server Virtual Machine is not “allocatedpermanently” to a given system but computeinstances are fire up only when needed.

• Considerable cost reduction with presentcharging scheme.

• Need to understand how design of wokflows andcontainers may be affected.

Page 70: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

15

Page 71: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+

BD2k Collections IssueOwen White

Ian FosterXinzhi Zhang

Susanna-Assunta Sansone

Page 72: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ The idea

http:/ / collections.plos.org/ hmp

Page 73: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ Oversight Committee

Develop r ules of engagement regarding consor tium membership, disclosing intended publications to the group, and areas of professional conduct.

Discussion of topical areas.

Search and open call for possible manuscr ipt authors.

Promote publication plan across BD2k network .

Promote coordination with E uropean or other inter national networks.

Page 74: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ Oversight Committee

G eneration of an over view publication descr ibing the BD2k commons, and general NIH data management ecosystem.

O rganization and general announcements to the larger group.

Hold per iodic meetings to discuss progress.

C oordination and milestone completion.

G eneration of ar twork for special collection.

Page 75: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ Timeline November 2016 BD2k meeting: broad announcement for special

collections

November: formation of Steer ing C ommittee

November: O versight C ommittee representative contacts potential jour nal editors

Januar y 2017: Finalize target jour nal for special collections

Present to November 2017: Manuscr ipt generation

July - November: Per iodic meetings for exposure of content, discussion of progress

November 2017: Submission deadline

December 2017/Januar y 2018: Review and revision

Febr uar y 2018: publication appears

Page 76: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ Open Issues

Iterative process / Multiple deadlinesPublish an earlier marker paper or

set of position papers

Page 77: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ The CFWG: Looming Issues

Page 78: Common Framework Working Groups - Data Science · Ronak Patel (Baylor College of Medicine) Lisa Federer (NIH Library) CFWG API Interoperability Working Group Improving the discoverability,

+ Looming Issues■Consortium-wide tools

■ Diversity of datatypes■Genomic / `omic / variants■Phenotypes / patient■Clinical studies

■Overlapping working groups■Funding / identify / mandate■NIH, trans-agency, international, NGOs

■Awareness

■Longevity / sustainability