Deliverable 9 - phenomenal-h2020.eu › home › wp-content › uploads › 2016 › 0… · the kubernetes cluster. The latter are rather simple, since here the Helm package manager

Deliverable 9.3

Project ID 654241

Project Title A comprehensive and standardised e-infrastructure for analysing medical metabolic phenotype data.

Project Acronym PhenoMeNal

Start Date of the Project

1st September 2015

Duration of the Project

36 Months

Work Package Number

9

Work Package Title

WP9 Tools, Workflows, Audit and Data Management

Deliverable Title D9.3 Report API access to PhenoMeNal Resources

Delivery Date M24

Work Package leader

IPB

Contributing Partners

IPB, UU, EMBL-EBI, CEA, CSR4

Authors Steffen Neumann, Kristian Peters, Christoph Ruttkies, Etienne Thévenot, Pablo Moreno, Ola Spjuth, Daniel Schober, Philippe Rocca-Serra

Abstract Developers working on PhenoMeNal need a modular environment, where

2

functionality can be accessed programmatically through an Application Programming Interface (API). In PhenoMeNal, APIs can be used to define and manage infrastructure , develop and use containerized packaged metabolomics tools and access tools and workflows on Galaxy programmatically. This deliverable reports on these APIs and gives usage examples.

Table of Contents

1 EXECUTIVESUMMARY.....................................................................................................3

2 CONTRIBUTIONTOWARDSPROJECTOBJECTIVES.............................................................3

3 DETAILEDREPORTOFTHEDELIVERABLE..........................................................................33.1 Overview...............................................................................................................................33.2 InfrastructureasCode...........................................................................................................43.3 AccessingworkflowsonGalaxywiththeAPI.........................................................................53.4 API-basedworkflowInstallationduringinfrastructuredeployment.......................................63.5 Casestudy:batchprocessingMetabolightsstudieswithunivariateandmultivariatestatistics 63.6 Accessingcontainerisedpackagedmetabolomicstools.......................................................113.7 AccessingcontainersinadeployedVREfromexternalapplications.....................................11

4 DELIVERYANDSCHEDULE...............................................................................................11

5 CONCLUSION..................................................................................................................11

3

1 EXECUTIVE SUMMARY

The web-based user interfacing components and web applications on the PhenoMeNal portal are primarily aimed at biomedical end users. IT savvy bioinformaticians and service developers working on PhenoMeNal, deploying and modifying it, need a flexible modular environment, where functionality can be accessed programmatically through an Application Programming Interface (API). These APIs can be accessed remotely or from the command line, and are very important for unattended and automated operation, last but not least allowing continuous integration tests.

In PhenoMeNal, APIs can be used to 1) define Infrastructure as Code (IaC) for the deployment, management and destruction of infrastructure resources, as well as installation and configuration of the infrastructure software 2) access containerised packaged metabolomics tools individually and 3) provide API access to tools and workflows within Galaxy.

This deliverable reports on the status and availability of the above mentioned APIs, and provides usage examples.

2 CONTRIBUTION TOWARDS PROJECT OBJECTIVES

The deliverable has contributed to the following project objectives:

Objective 9.1 Specify and integrate software pipelines and tools utilised in the PhenoMeNal e-Infrastructure into VMIs, adhering to data standards developed in WP8 and supporting the interoperability and federation middleware developed in WP5.

Objective 9.2 Develop methods to scale-up software pipelines for high-throughput analysis, supporting execution on e.g. local clusters, private clouds, federated clouds, or GRIDs.

Objective 9.3 Add quality control and quality assurance to pipelines to ensure high quality and reliable data, keep an audit trail of intermediate steps and results.

3 DETAILED REPORT OF THE DELIVERABLE

3.1 Overview

Many of the web-based user facing components and web applications on the PhenoMeNal portal are primarily aimed at biomedical end users. Developers working on PhenoMeNal, deploying and modifying it, need a modular environment, where functionality can be accessed programmatically, hence the name Application Programming Interface (API). These APIs can be accessed remotely or from the

4

command line, and are very important for unattended and automated operation, last but not least for the continuous integration tests.

In PhenoMeNal, APIs can be used to:

● Define Infrastructure as Code for the deployment, management and destruction of infrastructure resources, installation and configuration of the infrastructure software

● Access individual tools and whole workflows on Galaxy with the API ● Access containerised packaged metabolomics tools individually and

independently In the following, we will give detailed information on the above.

3.2 Infrastructure as Code

The PhenoMeNal infrastructure deployment is handled by the cloud-deploy-kubenow software1 and is applicable for installations on local workstations, on local clusters and public or private cloud environments.

A general introduction in the deployment process is given in our tutorial resources at ‘Starting a PhenoMeNal CRE via PhenoMeNal Portal2’. Technical tutorials, i.e. for clinical IT managers intending to launch a local PhenoMeNal VRE3 are provided on the wiki ‘Starting a PhenoMeNal CRE on a local server (bare-metal)4’ and guidance on CRE deployment in cloud environments can be found at ‘Starting a PhenoMeNal CRE on a public or private cloud provider5’, and are also included in the material from the de.NBI-summer-school-2017-on-cloud-computing6.

When the deployment is invoked from the cloud portal7, the configuration details are queried through the portal web application8. When invoked from the command line, the requested number of machines, their flavours and other details that can be site-specific, are configured through files such as config.ostack.sh9. While the interfaces and options found at the public cloud providers like Google, Amazon and Microsoft are the same for

1 https://github.com/phnmnl/cloud-deploy-kubenow/ 2 https://github.com/phnmnl/phenomenal-h2020/wiki/Deployment-Cloud-Research-Environment 3 e.g. in order to keep ELSI sensitive patient data behind a clinical firewall (bring-the-compute-to-the-data approach) 4 https://github.com/phnmnl/phenomenal-h2020/wiki/Starting-a-PhenoMeNal-CRE-on-a-local-server-(bare-metal) 5 https://portal.phenomenal-h2020.eu/help/Starting-a-PhenoMeNal-CRE-on-a-public-or-private-cloud-provider 6 https://github.com/phnmnl/phenomenal-h2020/wiki/de.NBI-summer-school-2017-on-cloud-computing 7 https://portal.phenomenal-h2020.eu/home 8 https://github.com/phnmnl/phenomenal-h2020/wiki/PhenoMeNal-Portal-Infrastructure 9 https://github.com/phnmnl/cloud-deploy-kubenow/blob/master/config.ostack.sh-template

5

all customers, OpenStack installations (local and commercial ones alike) can differ between the installations. These differences can range from the installed OpenStack version and available OpenStack modules, but also in the available networks and virtual machine flavours. These settings can be fully configured and the documentation ‘Starting-up-a-PhenoMeNal-VRE-on-OpenStack10’ is showing where to find this information in different OpenStack clusters.

For software provisioning, i.e. the deployment and configuration of software inside the virtual machine infrastructure, the administration automation engine Ansible is used. The software management tasks are defined in so-called playbooks. In PhenoMeNal, they are split into 1) the generic playbooks11 for setting up the operating system and Kubernetes cluster, and 2) playbooks12 defining the running PhenoMeNal services in the kubernetes cluster. The latter are rather simple, since here the Helm package manager is used for the actual definition. Helm is a package manager for Kubernetes application deployments with a large user base in the IT industry. It allows administrators and users to manage parameterized deployments, upgrade them and keep track of the history of the deployment (which versions were deployed, with which variables, rollback after an update if needed, etc).

We have written Helm charts for the application deployment of Galaxy13, which have been contributed to the Galaxy community. The documentation for developers who want to extend the functionality are in the README.md14 and values.yaml15 files. Other helm charts are for Jupyter and Portal16. We also use Helm for the continuous delivery of testing inside Kubernetes clusters for tools with larger data17.

3.3 Accessing workflows on Galaxy with the API

The normal access route to use the tools and workflows created and maintained by PhenoMeNal are through the GUI-based web applications Galaxy and Jupyter. For the integration into other applications, or for batch processing of workflows, an API for remote execution access is required.

The Galaxy API addresses these and other situations by exposing Galaxy internals through an additional interface, known as an Applications Programming Interface, or

10 https://github.com/phnmnl/phenomenal-h2020/wiki/Starting-up-a-PhenoMeNal-VRE-on-OpenStack 11 https://github.com/kubenow/KubeNow/tree/master/playbooks 12 https://github.com/phnmnl/cloud-deploy-kubenow/tree/master/playbooks 13 https://github.com/galaxyproject/galaxy-kubernetes 14 https://github.com/galaxyproject/galaxy-kubernetes/blob/master/README.md 15 https://github.com/galaxyproject/galaxy-kubernetes/blob/master/galaxy/values.yaml 16 https://github.com/phnmnl/helm-charts/ 17 https://github.com/phnmnl/phenomenal-h2020/wiki/Testing-Guide-Proposal-3

6

API18. In PhenoMeNal, we have created the wft4galaxy19 command line suite, which allows to describe, parameterise and run workflows from the command line through YAML files, e.g. for workflow testing and processing data sets (see “Case study: batch processing ...” section below). wft4galaxy is a Python module which allows to automate the running of Galaxy workflow tests. It can be used either as local Python library or as Docker image running inside a Docker container. It is available20 from and documented21

The Galaxy API itself is described on https://galaxyproject.org/develop/api/. To use the API, you must first generate an API key for the account you want to access Galaxy from. Please note that this key acts as an alternate means to access your account, and should be treated with the same care as your login password. You can do so in the Galaxy UI under user preferences (while logged in), or alternatively, you can retrieve your API key by sending baseauth GET request to /api/authenticate/baseauth. The bioblend Python library used by wft4galaxy is developed by the Galaxy community, and is available from https://github.com/galaxyproject/bioblend and documented at http://bioblend.readthedocs.io.

This option allows to access PhenoMeNal services instantiated as part of the VRE from an application outside of the VRE, and take advantage of the benefits of an elastic cloud environment.

3.4 API-based workflow Installation during infrastructure deployment

Another very important usage of the Galaxy API in PhenoMeNal is the configuration and customisation of the Galaxy Workflow application during the deployment of the infrastructure. During installation, several Ansible playbooks are being executed to tailor the Galaxy instance after initialisation through helm. These reside in container-galaxy-k8s-runtime/ansible22. The configure_galaxy.py23 script uses the API to set the initial user, and populate the instance with the PhenoMeNal workflows.

3.5 Case study: batch processing Metabolights studies with univariate and multivariate statistics

In this case study we used the wft4galaxy24 API to perform a batch processing of MetaboLights data sets using the univariate and multivariate statistical workflow25 on 18 https://galaxyproject.org/develop/api/ 19 http://wft4galaxy.readthedocs.io 20 https://github.com/phnmnl/wft4galaxy 21 http://wft4galaxy.readthedocs.io 22 https://github.com/phnmnl/container-galaxy-k8s-runtime/tree/develop/ansible 23 https://github.com/phnmnl/container-galaxy-k8s-runtime/blob/develop/ansible/configure_galaxy.py 24 http://wft4galaxy.readthedocs.io

7

ISA-Tab formatted studies (quantified molecules, study factors, …). The workflow processes a Metabolights study and generates univariate and multivariate statistical reports. The workflow is also available through the web interface in our Galaxy public instance26 under Shared Data / Workflows / W4M Omics generic biosigner feature selection statistics27. Specifically, this statistical workflow performs 1) univariate hypothesis testing with correction for multiple testing, and 2) multivariate Orthogonal Partial Least Squares modeling. Such analyses can be applied to metabolomics datasets for content-based quality assurance, plausibility checks, and a first-pass statistical exploration (i.e. to discover discriminant features and to assess the quality of a PLS predictive model). The ability to perform such analyses in an automated way is therefore of interest as a preliminary statistical analysis of the dataset. In addition, such analyses are an added value for the Metabolights repository, contributing to outlier/error detection, quality management, and mining of the datasets. To validate the results, however, manually checking of the diagnostics returned by the workflow (metrics and plots) remains a critical step (the available documentation of the tools and the workflow tutorial provide comprehensive information about the interpretation of the statistical model diagnostics).

We developed the phnmnl_statistical_workflow_api tool28 using the wft4galaxy API which is accessing a Galaxy instance with an installed PhenoMeNal e-infrastructure. It processes a given MTBLS study selecting each assay and performs the statistics for each available factor of interest. For the study MTBLS404 as an example, the available assay is ‘a_sacurine’ with ‘age’, ‘body mass index’ and ‘gender’ as three factors of interest. After the initial check of the given study to be applicable for running the workflow, which includes the allowed number of levels for the factor of interest, a YAML file is generated to be used as input for wft4galaxy and the workflow invoked and processed on the PhenoMeNal cluster connected to the Galaxy instance. The generated results are downloaded locally and the history is kept in the user space of Galaxy. The phnmnl_statistical_workflow_api tool was used to process all available Metabolights studies in batch mode and processed as shown in in Figure 1.

25 https://portal.phenomenal-h2020.eu/help/Sacurine-statistical-workflow 26 https://public.phenomenal-h2020.eu 27 https://public.phenomenal-h2020.eu/workflow/list_published 28 https://github.com/c-ruttkies/phmnl_statistical_worfklow_api

8

Figure 1. Simplified workflow of the case study for batch API processing for the univariate and multivariate statistical workflow using wft4galaxy accessing a PhenoMeNal e-infrastructure The analysis revealed several issues within both data and some of the tools, which resulted in corrections on different levels: for example, the initial version of the workflow was designed to handle studies with just a single ISA assay type declared. The statistics workflow itself was consequently improved to handle studies with multiple ISA assay types (e.g. NMR assay, positive or negative acquisition MS assay, or even different samples obtained from one biosource or patient). Another issue were several MetaboLights studies with inconsistent metadata annotations, empty sample names etc. which caused the automatic processing to fail. These were reported to the MetaboLights curation team, and will help to improve the overall metadata annotation and validation for metabolomics datasets.

Currently, 24 MetaboLights studies (see Table 1) are applicable to this batch processing on a technical level, which is one component of meeting the Milestone MS9.2. Part of the uni- and multivariate results are the OPLS plots shown in Figure 2. As mentioned above, these statistical results must not be used without ensuring that the statistical analysis itself is appropriate for the study design. To this end, as part of deliverable 8.4.2, UOXF (WP8) carried out a batch analysis of Metabolights ISA-Tab formatted metadata. A full report is in preparation and will be provided to EMBL-EBI Metabolights for corrections to be made.

9

Study Name

Journal Study Name

Journal

MTBLS30 Molecular Endocrinology MTBLS171 Metabolomics

MTBLS32 Cell MTBLS173 J. of Proteome Research

MTBLS33 Cell MTBLS265 PNAS

MTBLS71 Metabolomics MTBLS267 PNAS

MTBLS96 PLOS One MTBLS327 Genome Medicine

MTBLS119 Phytochemistry MTBLS341 International Journal of Molecular Sciences

MTBLS120 Journal of agricultural and food chemistry

MTBLS345 Nature Microbiology

MTBLS127 PLOS One MTBLS350 Frontiers in Plant Science

MTBLS143 Scientific Data MTBLS354 Diagnostic Microbiology and Infectious Disease

MTBLS144 Marine Chemistry MTBLS366 Rapid Communications in Mass Spectrometry

MTBLS155 Environmental Microbiology

MTBLS404 Journal of Proteome Research

MTBLS157 The ISME Journal MTBLS414 Metabolomics Table 1. Metabolights studies processed with the phnmnl_statistical_workflow_api tool. Not all annotation issues in these studies have been rectified by curators at the time of writing.

10

Figure 2: Score plots from the 24 OPLS analyses performed from the command line via the API. In this first version of the automated workflow, only qualitative factors with two levels were studied (hence OPLS-Discriminant Analysis models were built). Manual inspection of the diagnostics confirmed that 19 of the models were valid (i.e. performed significantly better than models built after random permutation of the response;

11

overfitted models are boxed in red). For these models, the Q2Y metric (at the bottom of the plots) estimates the (normalised) prediction performance.

3.6 Accessing containerised packaged metabolomics tools

The containerised packaging of the metabolomics tools in the app-library allows to execute tools on any computer with a docker installation, which is important for developers and in case external users wish to include the tools. Tools can have examples for their usage in the README.md, which is automatically included in the tools description in the app-library.

The usage strings can be retrieved directly from the app-library container via:

docker run -it --rm \ container-registry.phenomenal-h2020.eu/phnmnl/phenomenal-portal-app-library \ bash -c "bin/run.sh && grep 'docker run' \ /var/www/html/php-phenomenal-portal-app-library/wiki-markdown/container-*/README.md"

With this information it is possible for developers in the bioinformatics community to use the containers with the packaged metabolomics tools in frameworks other than Galaxy, or from shell scripts, which can even be scheduled on classic HPC queueing systems like Gridengine, LSF or SLURM. This has been demonstrated at the de.NBI workshop preceding GCB2016 in Berlin, where Ruttkies/Neumann (IPB) attended.

3.7 Accessing containers in a deployed VRE from external applications

Another use case for the use of tools in the VRE are programs running locally a workstation, and using a remote (non-Galaxy) API to access compute functionality in the VRE and thus take advantage of the benefits of an elastic cloud environment. We are currently investigation examples for this client-server architecture, using e.g. REST interfaces for the server components inside the VRE.

4 DELIVERY AND SCHEDULE

The deliverable was submitted on time.

5 CONCLUSION

This deliverable summarises how different parts of a PhenoMeNal infrastructure can be accessed programmatically, ranging from the Infrastructure to accessing the individual tools on the command-line and via the Galaxy API. The usability of batch-processing of

12

studies in the MetaboLights repository has been demonstrated with the automatic processing of all applicable studies.

Documents

Deliverable 9 - phenomenal-h2020.eu › home › wp-content › uploads › 2016 › 0… · the kubernetes cluster. The latter are rather simple, since here the Helm package manager