D7.2 Federated Infrastructure for Data Linkage - i~HD...infrastructure. Apache Camel, an open source light weight framework, which provides a Java based DSL and in built load-balancing

TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage

1

Translational Research and Patient Safety in Europe

D7.2 Federated Infrastructure for Data Linkage

Work Package Number: WP7 Work Package Title: Federated Infrastructure for Data Linkage Nature of Deliverable: Report

Dissemination Level: Confidential

Version: 0.4

Delivery Date From Annex 1: M51

Principal Authors: S. Hajebi, A. Raj, E. O’Toole, S. Clarke (TCD)

Contributing Authors: L. Zhao, C. Golby, T. N. Arvanitis (UW)

M.McGilchrist, F.Culross (UD)

Partner Institutions: Trinity College Dublin (TCD), University of Dundee

(UD), University of Warwick (UW)

Internal reviewers: Ita Richardson, Theodoros N. Arvanitis (UW)

This project has received funding from the European Union’s

Seventh Framework Programme for research, technological

development and demonstration under grant agreement no 247787 [TRANSFoRm].


2

Version History Version Date Author (partner) Changes/reason

0.1 18.04.2014 Saeed Hajebi, Amit Raj, Eamonn O’Toole

Initial Version for Internal Review

0.2 22.05.2014 Saeed Hajebi, Amit Raj, Eamonn O’Toole, Mark McGilchrist, Frank Culross

Incorporated feedback from internal review and added DNC

0.3 26.05.2014 Eamonn O’Toole, Theodoros N. Arvanitis (UW)

Internal Review

0.4 29.05.2014 Eamonn O’Toole, Mark McGilchrist, Vasa Curcin

Internal Review

0.5 30.05.2014 Eamonn O’Toole Final Edits 1.0 31.05.2014 Brendan Delaney (KCL),

Vasa Curcin (IC) Internal review


3

Table of Contents

Version History .......................................................................................................... 2

List of Figures ............................................................................................................ 6

List of Tables ............................................................................................................. 7

Abbreviations ............................................................................................................. 8

Executive Summary .................................................................................................. 9

1 Introduction ....................................................................................................... 11

2 Overview of Requirements ............................................................................... 13

2.1 Service Based Infrastructure ........................................................................ 13

2.2 Federated Secure Data Access ................................................................... 14

2.3 Semantically Rich Registry Services ............................................................ 14

2.4 Provenance Integration ................................................................................ 15

2.5 Load-balancing and fault tolerance mechanisms ......................................... 15

2.6 Investigation of Service Based Middleware Technologies ........................... 16

2.7 Summary ...................................................................................................... 18

3 TRANSFoRm Federated Infrastructure for Data Linkage Architecture ....... 19

3.1 Conceptual Architecture ............................................................................... 19

3.2 Distributed Platform ...................................................................................... 20

3.2.1 Authentication Framework ............................................................................................ 20

3.2.2 Secure Data Transport .................................................................................................. 20

3.2.3 Registry Services .......................................................................................................... 21

3.3 Data Extraction ............................................................................................. 21

3.3.1 Data Node Connector ................................................................................................... 21

3.4 Non-Infrastructure Components ................................................................... 22

3.4.1 Query Formulation Workbench ..................................................................................... 22

3.4.2 Provenance Framework ................................................................................................ 23

3.5 Summary ...................................................................................................... 23

4 Implementation of the Distributed Platform ................................................... 24


4

4.1 Technology Stack ......................................................................................... 24

4.2 Distributed Platform for Query and Result Data ........................................... 25

4.2.1 Components .................................................................................................................. 25

4.2.2 Query Lifecycle ............................................................................................................. 27

4.2.3 Load Balancing and Fault Tolerance ............................................................................ 28

4.3 Semantically Rich Registry Services ............................................................ 29

4.3.1 Registry Service Capabilities ........................................................................................ 29

4.4 Security Integration ...................................................................................... 30

4.4.1 Secure Data Transport .................................................................................................. 30

4.4.2 Authentication Framework ............................................................................................ 32

4.5 Provenance Integration ................................................................................ 35

4.5.1 Reception of unencrypted CDIM query ......................................................................... 35

4.5.2 Query to EHR Repository .............................................................................................. 36

4.5.3 Results from Data Source ............................................................................................. 37

4.5.4 Retrieval of results by QWB User ................................................................................. 37

4.6 Global User Management ............................................................................ 39

4.6.1 User Roles .................................................................................................................... 39

4.6.2 User Repository ............................................................................................................ 40

4.6.3 User Management Tool ................................................................................................. 42

4.7 Summary ...................................................................................................... 43

5 Data Extraction .................................................................................................. 44

5.1 DNC-WB (Query Formulation Workbench) .................................................. 45

5.2 DNC-DS (Data Source) ................................................................................ 47

5.3 Summary ...................................................................................................... 48

6 Concluding Remarks ........................................................................................ 49

7 References ......................................................................................................... 51

8 Appendix 1 ......................................................................................................... 52

9 Appendix 2 ......................................................................................................... 54

9.1 CSV Template .............................................................................................. 54


5

9.2 User Management Tool Screenshots ........................................................... 54

10 Appendix 3 ..................................................................................................... 57

10.1 Example Query ......................................................................................... 57

10.2 Query post-substitution ............................................................................. 59


6

List of Figures

Figure 1 Conceptual Architecture .............................................................................. 19

Figure 2 Query lifecycle across platform ................................................................... 25

Figure 3 TRANSFoRm Authentication Framework .................................................... 34

Figure 4 Reception of unencrypted data ................................................................... 35

Figure 5 Query to EHR data source .......................................................................... 36

Figure 6 Results returned from the data source ........................................................ 37

Figure 7 Retrieve query results for user .................................................................... 38

Figure 8 Conceptual Architecture for Data Extraction and Linkage ........................... 44

Figure 9 DNC-WB use of semantic mediator to translate data elements expressed as

archetypes to local database queries, usually SQL queries. DNC-DS is not

shown. ................................................................................................................ 47

Figure 10 TRANSFoRm User Management Tool: Login Page .................................. 54

Figure 11 Transform User Management Tool: Home Screen ................................... 55

Figure 12 Transform User Management Tool: Invite New User ................................ 56

Figure 13 TRANSFoRm User Management Tool: View TRANSFoRm Users ........... 56


7

List of Tables

Table 1 Registry Services: Data Source Information ................................................. 30

Table 2 Classification Information for each data source ............................................ 30

Table 3 TRANSFoRm Global User roles ................................................................... 40

Table 4 TRANSFoRm User objects ........................................................................... 42


8

Abbreviations CROM Clinician Reported Outcome Measures

DNC Data Node Connector

QWB Query Formulation Workbench

CRIM Clinical Research Information Model

CDIM Clinical Data Integration Model

EC Eligibility Criteria

eCRF electronic Case Report Form

EHR Electronic Health Record

ODM Operational Data Model

PROM Patient Reported Outcome Measures

VarQs form variables expressed using the TRANSFoRm query model

SDB Study Database

SS Study System

TAM Technical Acceptance Model

W/S Web Service

SDM Study Design Model

SF12 Short Form 12

LDAP Lightweight Directory Access Protocol


9

Executive Summary

This deliverable describes the outcome of WT 7.5 Infrastructure which manages the

extraction and linkage of data from heterogeneous datasets for the purposes of the

use cases of the TRANSFoRm project. This work task has developed a federated

infrastructure to facilitate secure communication of query data and query results

between research and clinical systems. This federated infrastructure was developed

using service-based technologies with asynchronous messaging being used between

the numerous distributed components that compose it.

This document outlines a list of requirements identified through the examination of

the work task description provided in the TRANSFoRm description of work and

consultation with impacted project partners who were developing tools that would

depend on the federated infrastructure. These requirements helped inform an

analysis of existing integration framework technologies to select the most suitable

framework, in order to provide the foundations for the TRANSFoRm federated

infrastructure. Apache Camel, an open source light weight framework, which provides

a Java based DSL and in built load-balancing and fault tolerance mechanisms, was

selected.

The federated infrastructure can be summarised as providing an authentication

framework for TRANSFoRm users, secure data transport and a semantically rich

registry service information on available EHR data source information. This document

describes each of the core components implemented to achieve the aforementioned

functionalities with the back bone of the infrastructure being provided by a set of

proxy libraries that are deployed locally to user and data source facing applications

developed elsewhere in TRANSFoRm. These libraries provide the set of

functionalities contained in the federated infrastructure through well-defined

interfaces. This approach encapsulates the complexity of the federated infrastructure

from the application layer at each end of the TRANSFoRm. These proxy libraries also

provide a means of integrating the TRANSFoRm provenance framework and security

technical solution into the distributed infrastructure ensuring secure transportation of

query data and query results between distributed endpoints.

This document also describes the implementation of user management mechanisms

for TRANSFoRm, where users can be registered and assigned one or more user

role. These mechanisms allow users accessing TRANSFoRm to be authenticated


10

and authorised.

Finally, this document describes the data extraction mechanisms that allow data

queries to be executed against heterogeneous data sources and the results returned

to the infrastructure. Data extraction is provided by the Data Node Connector

component, which acts as an interface between TRANSFoRm and target clinical

systems. This component is split enabling different technologies to be used by

TRANSFoRm and clinical systems and allowing clinical data to be electronically

isolated from the TRANSFoRm infrastructure if required by the data owner.


11

1 Introduction

Health organizations and their electronic record systems are geographically

dispersed throughout Europe and are subject to a heterogeneous set of security and

information governance policies on sharing their data. This presents a significant

challenge to TRANSFoRm[2] which seeks to provide researchers with efficient and

effective access to this data in order to find suitable patients to recruit for clinical

studies. This goal requires a distributed infrastructure that facilitates communication

between these research and clinical systems.

This deliverable describes the outcomes of WT 7.5, infrastructure that manages the

extraction and linkage of data between heterogeneous datasets. This work task has

developed a federated infrastructure to facility secure communication of query data

and query results between research and clinical systems and provide data extraction

and linkage capabilities. To achieve this, the federated infrastructure is conceptually

divided into two components; a distributed platform component and data extraction

component.

The distributed platform provides a service-based infrastructure to facilitate

asynchronous communication between distributed endpoints in TRANSFoRm. This

platform provides complex infrastructure workflows, encapsulated behind well-

defined interfaces provided by a set of distributed components. The distributed

platform also integrates the technical security solution outlined in WT 3.3, to deliver a

flexible authentication framework providing policy-driven authentication and

authorization access to TRANSFoRm user-facing tools. A second feature of this

integration is the provision of secure data transport across the federated

infrastructure using signature, encryption and decryption capabilities provided by the

security solution.

The distributed platform also maintains a semantically rich registry service, that

provides dynamic discovery and binding to remote services. This registry service

allows research users to identify target EHR data sources that may contain suitable

patients to recruit. Provenance information is also gathered and communicated to the

TRANSFoRm provenance framework [3].Finally, load balancing and fault tolerance

mechanisms are included in the technical solution to provide a highly available set of

services.

Data extraction is provided by the Data Node Connector (DNC) component. This


12

component is located local to target EHR repository systems and acts as an interface

between TRANSFoRm and these data sources. It receives queries through the

distributed platform and translates them into queries that can be run against the local

repository. When ready, it returns the results of these queries to the distributed

platform where they are securely transferred back to the researcher.

The document is organised as follows. Section 2 outlines the requirements identified

to deliver the federated infrastructure. It also provides a discussion on the

investigation undertaken to identify the most suitable integration framework on which

to build the infrastructure. Section 3 discusses the system architecture and the

functional components that comprise and integrate with the federated infrastructure.

Section 4 provides a detailed implementation account of the distributed platform

component of the federated infrastructure, with all sub-components, workflows and

communication being discussed. Section 5 outlines the data extraction functionality

of the federated infrastructure. Finally, the document also includes some concluding

remarks, references and appendices, which include example policies and

implementation screenshots.


13

2 Overview of Requirements

A requirements-driven approach was used throughout the design and implementation

of the TRANSFoRm federated infrastructure for data linkage. An initial set of

requirements for the deliverable is outlined in the project’s description of work [1].

The most fundamental of these is the provision of a service-based infrastructure

enabling communication between research and clinical systems. This infrastructure

will provide specified TRANSFoRm users data access and query of electronic health

systems and research databanks, with security being ensured through the integration

of security policies and the technical security solution described in deliverable D3.3

[4]. The current deliverable should also include semantically rich registry services to

identity and describe potential target EHR data sources that are both geographically

dispersed and conceptually and technologically distinct. The infrastructure is also

required to integrate with the provenance framework to ensure that auditing of data

access events is possible. Finally, to ensure highly available services, suitable load-

balancing and fault tolerance mechanisms for a federated infrastructure of this nature

must be investigated.

Through consultation with relevant project partners and analysis of suitable

technology solutions, these initial requirements were expanded to a comprehensive

list of functional requirements that are described in greater details below. Finally, this

section also describes the rationale behind our selection of Apache CAMEL as the

most suitable integration framework to provide the foundation of this deliverable.

2.1 Service Based Infrastructure

Research institutes require data from clinical systems. However, different clinical

systems may use heterogeneous technology infrastructure and data formats. On the

other hand, the communication between these systems should be asynchronous, as

the data needed for some researches may require approval to be provided by the

data owner and this process will typically involve a time delay. An asynchronous

service-based infrastructure should be in place to enable this communication.

The above motivates the following functional requirements:

• A mechanism to receive queries from the Query Formulation Workbench (WT

5.3) and securely route them to the targeted data sources and decrypt them.

• A mechanism to deliver queries to EHR repositories where they are passed to


14

the local Data Node Connector (DNC) instance.

• A mechanism to receive query results from the DNC, and securely store them

until they are requested by the researcher.

• A mechanism to deliver the results to the researcher.

• A mechanism to track and report the status of the queries.

• A mechanism to uniquely identify queries

• A mechanism to cancel an ongoing query.

2.2 Federated Secure Data Access

Given the sensitive nature of the information involved, the communication between

research institutes and clinical systems or data sources must be secure. D3.3

Security Solution Layer describes the proposed technical security solution for

TRANSFoRm. One key task of this deliverable is to integrate the D3.3 security

policies and the technical security solutions to enable flexible yet strongly secure

policy-driven authentication and authorization access.

Within TRANSFoRm software ecosystem, different partner institutions use different

authentication frameworks. A federated authentication framework (Shibboleth, as per

outlined in D3.3) should be used to enable partners to use federated single sign on

(SSO) mechanisms to be able to access to functionalities of TRANSFoRm.

The security needs discussed motivate the following functional requirements.

• Integration with the security library to apply encryption/decryption on the

queries and the results before they are transmitted across TRANSFoRm

• Integration with the security library to apply secure policy-driven authentication

and authorization access to TRANSFoRm tools

2.3 Semantically Rich Registry Services

The scope of TRANSFoRm provides a challenge as a number of heterogeneous

EHR data sources may be added or removed over time, with each data source

having a different set of data and using different coding schemes. Therefore, a

semantically rich registry service must be provided that can dynamically discover new

data sources and bind to their services, providing researchers with details of

available data sources to choose from. Conversely, if an existing data source decides


15

to leave TRANSFoRm, the registry service should be updated to reflect this.

This motivates the following functional requirements:

• Dynamically discover new data sources and bind to their services.

• Automatically updating the list of data sources.

• Providing list of data sources and the detailed information about them.

• Providing semantic information about data sources and the possibility to

search and select a data source.

2.4 Provenance Integration

All occurrences of data access should be captured in the scope of TRANSFoRm

project. To enable this characteristic across all of TRANSFoRm, the federated

infrastructure for data linkage should be integrated with the provenance and auditing

service (WT 3.4).

This motivates the following functional requirements:

• Mechanisms to annotate with the provenance service query data and query

result transmission across the distributed infrastructure.

• Mechanisms to annotate with the provenance service, query data and query

result encryption and decryption by the distributed infrastructure.

• Mechanisms to provide details of user access events to TRANSFoRm tools

such as the Query Formulation Workbench (QWB).

2.5 Load-balancing and fault tolerance mechanisms

The functionalities of QWB are dependent on the federated infrastructure for data

linkage; therefore the infrastructure should be highly available to facilitate

uninterrupted processing of query and result data. To achieve this, there must be at

least two instances of the components deployed on two different sites. The load of

queries should be divided between these sites. Each site should be updated with the

latest changes from the other.

The requirements for fault tolerance imply that there must be no single point of

failure; all the functionalities must be redundant. If a system experiences a failure, the

whole workflow must continue to operate without interruption during the repair

process. Additionally, faults must be isolated to the failing component; i.e., when a


16

failure occurs, the system must be able to isolate the failure to the failing component.

2.6 Investigation of Service Based Middleware Technologies

The need to exchange data between different and distributed applications is not

specific to TRANSFoRm and is common across all systems composed of two or

more applications. There are a number of existing integration frameworks that aim to

provide a foundation on which to build such systems, providing functionalities and

patterns that are commonly required. We identified the following criteria to help select

the most suitable integration framework available:

• Open source

• testability

• Java based Domain specific language (DSL)

• popularity

• IDE-Support

• error handling

• monitoring support

• number of components for interfaces, technologies and protocols

• expandability

We identified a comprehensive set of potential candidates that included:

• Apache Camel

• Spring Integration

• Spring Batch

• JBPM

• nexusBPM

• Drools

• hadoop/cascading

• full ESBs such as Service Mix, Mule ESB, OpenESB, and JBossESB [5].

After an investigation of the possible alternatives with the considerations outlined

above, Apache Camel was chosen as the most suitable framework. Camel is a light


17

weight open-source (Apache License version 2.0)1 framework with a Java based

DSL. It provides the following features [6], [7]

• Concrete implementations of all the widely used Enterprise Integration Patterns

(EIPs).

• Connectivity to a great variety of transports and APIs

• Easy to use Domain Specific Languages (DSLs) to wire EIPs and transports

together

• Pluggable data formats and type converters for easy message transformation

between CSV, EDI, Flatpack, HL7, JAXB, JSON, XmlBeans, XStream, Zip, etc.

• Pluggable languages to create expressions or predicates for use in the DSL.

Some of these languages include: EL, JXPath, Mvel, OGNL, BeanShell,

JavaScript, Groovy, Python, PHP, Ruby, SQL, XPath, XQuery, etc.

• Support for the integration of beans and POJOs in various places in Camel.

• Support for testing distributed and asynchronous systems using a messaging

approach.

Camel is a popular integration framework with a vibrant community and it supports an

array of different data formats. It provides a comprehensive list of pre-packaged

components (more than 130) for accessing various backend systems and leverages

a very extensible overall architecture. It is built with the Spring framework, ensuring a

great deal of customization and extensibility. Due to its openness, it can be deployed

stand-alone as well as an embedded component within other applications or

frameworks [6].

With specific consideration for TRANSFoRm, Camel allows developers to create

dynamic workflows that connect distinct functional components for seamless

integration. It also allows proxy interfaces to be deployed; this hides the complexity of

the distributed infrastructure from the applications, such as the QWB, that use it.

Camel also provides a selection of load-balancing policies and fault-tolerance

1 Apache License version 2.0 is a permissive license similar to the MIT License, but also provides an

express grant of patent rights from contributors to users


18

mechanisms during communication, making it an ideal candidate for use in the

TRANSFoRm federated infrastructure for data linkage.

2.7 Summary In this section the requirements of the federated infrastructure for data linkage (WT

7.5) has been explained. The main requirements include:

• A service-based infrastructure that enables communication between research

and clinical systems

• Providing federated secure data access and query of electronic health record

systems and research databanks

• Integrating the security policy and the technical security solutions enabling

flexible yet strongly secure policy-driven authentication and authorization

access.

• Integrating with the provenance and auditing service to enable accurate

capture of all occurrences of data access.

• Provision of semantically rich registry services

• Load-balancing and fault-tolerance mechanisms to provide a highly available

service.

• Investigation of a variety of non-disruptive and asynchronous data linkage

mechanisms with EHR data repositories.

This section also provided a discussion on the alternatives for non-disruptive and

asynchronous data linkage mechanisms detailing the reasons behind choosing

Apache Camel as our integration framework.


19

3 TRANSFoRm Federated Infrastructure for Data Linkage Architecture

This section provides a brief overview and description of the system architecture

across the TRANSFoRm Federated Infrastructure for Data Linkage. Additionally, the

functional components that form and interact with this infrastructure are outlined.

3.1 Conceptual Architecture

Figure 1 Conceptual Architecture

TRANSFoRm aims to develop methods to integrate primary care clinical and

research activities with an essential component being the provision of a secure

means of communication between these geographically and conceptually distinct

endpoints. This communication is provided by the TRANSFoRm distributed

infrastructure for data extraction and linkage (WT 7.5) which connects researchers,

using the Query Formulation Workbench (QWB), to target EHR data sources by

means of a service-based technologies.

As described in Figure 1, this deliverable is composed of two broad components,

reflecting the two goals to be achieved. The first of these is a distributed platform

enabling communication across TRANSFoRm. This component of the deliverable is

itself composed of three objectives. The first is the Authentication Framework, which

provides user authentication and role based authorisation to user-facing


20

TRANSFoRm tools such as the QWB. The second component, secure data

transport, provides secure transmission of data queries and data results using

asynchronous messaging and data encryption. This allows queries to be securely

delivered to the target EHR repositories.

The second component of the Federated Infrastructure involves functionality for data

extraction. This is provided by the Data Node Connector (DNC). When the query

arrives at the target EHR site, a local instance of the DNC translates the query and

allows the data owner to execute it against the data source. When combined with the

distributed platform, this achieves the goal of the federated infrastructure for data

linkage.

Provenance information for the entire query workflow is captured by the distributed

provenance system (Deliverable 5.2 TRANSFoRm Provenance Tool), which is

integrated with each component across the architecture.

3.2 Distributed Platform

The distributed platform provides an essential backbone for the TRANSFoRm

system, providing a secure means of communication between research and clinical

systems. The objectives of the distributed platform can be summarised under three

broad purposes, an authentication framework for TRANSFoRm users, secure data

transport for queries and results across distributed endpoints and registry services.

3.2.1 Authentication Framework

The Authentication Framework is a dynamic extensible authentication service for the

various and distributed TRANSFoRm services. Based on the Secure Assertion mark-

up Language (SAML) and in particular, Shibboleth, the framework provides federated

single sign-on authentication for user-facing TRANSFoRm tools. The framework

assigns global roles to registered users and allows them to perform certain actions in

each TRANSFoRm tool. This enables role-based access and authorization to be

implemented across the various TRANSFoRm software tools.

3.2.2 Secure Data Transport

Secure data transport is the fundamental goal of the Distributed Platform where

communication is enabled between research and clinical systems to provide secure

data access and query of electronic health record systems and research databanks.

This is achieved by using a service-based infrastructure with asynchronous


21

messaging being used across distributed components of the platform. These

distributed components receive an initial Clinical Data Integration Model (CDIM)

query from the QWB and transmit that query to the target EHR data sources where it

is delivered to the local DNC.

Security is ensured by integrating the security library, described in D3.3 Security

Solution Layer [4], which encrypts all query requests and query results before they

are transmitted across the distributed infrastructure. Specifically, when clinical

researchers design eligibility criteria and instruct the QWB to query selected EHR

data sources, the QWB invokes the infrastructure API, which encrypts the request

using the security library. The encryption of the query request happens before any

external communication takes place. The query remains encrypted until it arrives at

the target EHR data source where the security library is used to unencrypt the query

as it is now safe to do so. The lifecycle of a query is described in greater detail in

Section 4.2 below.

In addition to providing secure messaging, the distributed platform interacts with the

TRANSFoRm provenance service to capture relevant provenance information across

different phases of this secure transmission process.

3.2.3 Registry Services

A semantically rich registry services is provided that can dynamically discover new

data sources and bind to their services, providing the QWB with details of available

data sources to choose from. It’s capable of connecting to DNC to dynamically

discover new data sources and bind to their services. It also automatically updates

the list of data sources; and provides this list and the detailed information about the

data source to the QWB.

3.3 Data Extraction

3.3.1 Data Node Connector

The Data Node Connector (DNC) component acts as the interface between queries

arriving via Secure Data Transport and the local data source from which the data

needs to be extracted. It provides data extraction functionality necessary for the

Federated Infrastructure. Once the Secure Data Transport has unencrypted the

query this interface involves the translation of a query arriving from the QWB in CDIM

format into an executable form that can be processed and executed by the local data


22

source. Semantic mediator is used in this translation process, with the tool also

providing a user-facing console residing at the data provider site that displays arriving

data query in a form meaningful to the data controller at the site. In addition to the

query formulation in local coding, each arriving entry contains context information for

the query: study agreement details, approved person attached to that study,

approved organization attached to that study, and explanation and purpose of the

query in natural language.

Once the query results are ready to be returned to the researcher, the DNC passes

them to the Secure Data Transport, where they are encrypted before being

transferred across the distributed infrastructure back to the QWB.

The functionalities of the Distributed Platform is provided to the DNC via a proxy,

which provides a clear interface and encapsulates the complexities of the underlying

infrastructure.

3.4 Non-Infrastructure Components

3.4.1 Query Formulation Workbench

The Query Formulation Workbench (QWB) is used to create, manage, store and

deploy queries of clinical data to identify subjects for clinical studies, evaluate trial

feasibility and to analyse the numbers of matching subjects in cohort studies, while

facilitating the extraction of data relating to epidemiological studies. Specifically, the

QWB provides a user interface for clinical researchers to create clinical studies,

design eligibility criteria, initiate distributed queries, monitor query progress, and

report query results. The QWB is based on the TRANSFoRm Clinical Research

Information Model (CRIM) model and uses the CDIM model for constructing queries,

together with the Vocabulary service for coding query concepts in supported

terminologies.

The QWB integrates with both aspects of the Distributed Platform. It uses the

Authentication framework to authenticate users and handle different user roles thus

allowing access to the QWB to be limited to registered TRANSFoRm users only. In

addition, the QWB utilises the Secure Data Transport layer provided by the

Distributed Platform to securely route queries to target EHR repositories, provides

updates on submitted queries and to retrieve results when they are ready.

The functionalities of the federated infrastructure are provided to QWB via a proxy,


23

which provides a clear interface and encapsulates the complexities of the underlying

platform.

3.4.2 Provenance Framework

The TRANSFoRm Provenance framework controls and manages the access to

provenance data created during the operation of TRANSFoRm tools. Making

TRANSFoRm tools provenance aware enables the investigation of data sources and

the services that produced a particular output, together with the individuals who

instigated the requests and received the outputs. In such a way, user behaviour and

data manipulation can be audited, to assess that correct decisions were made and

appropriate procedures were followed. Data privacy, legal and ethical regulations

restrict provenance data from being stored in a central repository. The provenance

framework mirrors the distributed EHR data access infrastructure, by implementing a

decentralised platform for provenance capture, storage and querying. More details

about the provenance service can be found in [28].

The distributed infrastructure invokes provenance services to annotate events

throughout a query’s lifecycle. This is achieved by connecting to the central

provenance service at specific points to store data describing query reception,

encryption/decryption and execution events. More details on this are provided in

Section 4.5 below.

3.5 Summary This section provided a high-level description of the TRANSFoRm architecture and

the components composing and interacting with the federated infrastructure for data

linkage. The infrastructure is conceptually divided into two sets of components; those

providing a distributed platform and those providing data extraction. The distributed

platform is responsible for providing secure communication between distributed

endpoints, the infrastructure is itself composed of a number of distributed

components that communicate using a service based infrastructure. Secure data

transport is provided by integrating the Transform Security Solution (WT 3.3) into the

distributed platform, with provenance being used to annotate and audit the query

lifecycle across the platform. Data extraction is provided by the DNC component

which acts as an interface between the distributed infrastructure and the local data

repository.


24

4 Implementation of the Distributed Platform

In this section we explain details of implementation of the first component of the

Federated Infrastructure for Data Linkage, the Distributed Platform. In the next

section, Section 5, we discuss the data extraction aspect of this deliverable. The

purpose of the Distributed Platform is to provide secure data transportation of query

and result data as well as facilitating authentication of users on the TRANSFoRm

platform. We first review the technology stack used in this project, then we explain

different components of this infrastructure. The main components of this platform

include Middleware Proxy (Front Side), Security Library, Middleware Services, Data

Source Registry Services, Middleware Proxy (Backend Side), and Results Processor

Component, which will be explained. We also explain the workflow of a query. The

details of Security Integration and Provenance Integration will be discussed

separately.

4.1 Technology Stack

The Distributed Platform of TRANSFoRm is responsible for collaborating with a

number of different software applications running on different partners’ sites to

process the query data and results. This infrastructure uses a stack of different

technologies for the enterprise communication and middleware services. The

different technologies in our technology stack are:

• Apache web server

• Apache Tomcat application server

• Apache Camel

• Spring framework

• Java SDK

• Shibboleth Single Sign-on (SSO) framework

• Java Messaging Service (JMS)

• MySQL Database

• Maven compiler plugin

• LDAP

The Shibboleth SSO framework was used to implement the user authentication.

When the users are authenticated, they are allowed to use the middleware services.

The Apache web server is used to configure the domain name of the Middleware


25

services and SSL for secure communication over HTTP protocol. The Middleware

web application was deployed in the Tomcat server. In the Middleware web

application, several Camel endpoints were published as web services. The frontend

and backend can communicate with Middleware web services over secure HTTP.

The actual business logic of web services is written as Spring services. Maven is

used to compile the applications. Maven also enables easy migration from a lower

version to a higher version of the included libraries, such as Spring framework 2.5 to

3.0. The Middleware library for backend side was using JMS Queue for

communication with the Data Node connector. The metadata regarding the query

and data source is saved in MySQL database located at the Middleware site.

4.2 Distributed Platform for Query and Result Data

This section explains a detailed description of the distributed platform in terms of the

lifecycle of query and result data. The description include the constituent

components, the query life cycle and workflow, load balancing, and fault tolerance of

the software. The platform allows the query construction, secure transportation to a

designated data source and stores the decrypted results in a secure FTP location as

shown in Figure 2.

Figure 2 Query lifecycle across platform

4.2.1 Components

This platform is implemented as a distributed system where secure transportation of

query functionality is distributed among the following components.


26

Middleware Proxy (Front Side): This component is deployed in the frontend site

where it acts as a proxy of our middleware services. This proxy contains a set of

methods that allow the QWB to execute a query and gets the results. Additionally, it

contains a set of methods to obtain the detailed information about the data sources

registered in the Registry Services. The QWB provides the original CDIM query and

the HTTPRequest to the executeQuery() method as two parameters. The

HTTPRequest contains the user’s browser request to the server (i.e. QWB) and is

used by the Security Library (outlined below). The executeQuery() propagates the

query through a Camel workflow where it is encrypted and passed on to the

Middleware. The Camel workflow instantiates uses security library to encrypt the

original query.

Security Library: Security library has two primary functionalities: (1) check if the

HTTPRequest received from the QWB is authenticated with Shibboleth SSO

framework, and (2) encrypt/decrypt the query. Upon authentication, SAML assertions

are attached to the corresponding HTTPRequest. When SAML assertions are

available in the request, the library encrypts the query with a private key. A security

policy is created for different users to allow dedicated functionality to a class of users.

Middleware Services: This component is used to store the encrypted query, update

the status of query processing and enable a secure transportation of the query to the

designated data source.

Data Source Registry Services: Whenever a new data source becomes available,

firstly it is registered in the Data Source Registry. This registry uses a MySQL

database where all the relevant information regarding a data source is saved. The

QWB can get the list of registered data sources along with their detailed information

through the corresponding methods available in the Middleware Proxy (Frond End).

This is a web component that provides the data sources information over HTTP

protocol in the JSON format.

Middleware Proxy (Backend Side): This component was deployed at the backend

site along with the DNC. This business logic was written in a Camel route. This route

polls the Middleware to check if a new query is available. When a new query

becomes available, it retrieves the encrypted query and uses the security library to

decrypt the query. The decrypted query is sent to a JMS queue for processing at

DNC. Another Camel route receives the results, encrypts them and sends them to


27

the secure FTP location.

Results Processor Component: This component is deployed at Custodix, Belgium,

the TRANSFoRm results store partner. When the encrypted results are ready from

DNC, our Middleware proxy (Backend side) puts those results in a secure FTP

location at the Custodix site. The QWB can retrieve these encrypted results from this

FTP location. Moreover, the result processor component continuously polls this

secure FTP and identify if a new result is available. When a new result is available,

this component retrieves and decrypts it. After decryption, the decrypted results are

placed in another secure FTP at the Custodix site, in a directory corresponding to the

user who created the query.

4.2.2 Query Lifecycle

The query flows across several components through its secure data transport layer to

get processed. The following steps describe the sequential workflow of query

processing. The step numbers outlined here correspond to the numbers described in

Figure 2 above.

1. A CDIM query created in the QWB is passed to the platform using the

Middleware Proxy, a library located in the QWB application. This library

packages the query by encrypting and signing it using the Security Library

from WT 3.3. The encrypted query is then sent to the Middleware server

where it is stored in a MySQL database.

2. At the data source side, another Middleware Proxy (backend side) periodically

polls the Middleware server requesting any new queries that are intended for

that data source. If new or unprocessed queries are available, they are sent to

the Data Node Connector, still encrypted, via the Middleware Proxy.

3. The Middleware Proxy decrypts the query using the Security Library from WT

3.3 and sends it to the Data Node Connector. The Data Node Connector takes

the CDIM query and manages the workflow of passing it to the Semantic

Mediator, from translation into SQL, and executing the query against the

corresponding database.

4. Once the query is processed and results are ready, they are returned to the

Middleware Proxy (backend side) for encryption.

5. The Middleware Proxy once again packages the results by signing and

encrypting them using the Security Library from WT 3.3. The encrypted results

are placed into a secure FTP server (sFTP1) that the data source specifies.


28

6. The results processor component located local to the secure FTP (sFTP1)

retrieves the encrypted results and decrypt them using the Security Library

from WT 3.3. The decrypted results are stored in another secure FTP (sFTP2)

location.

7. Throughout the workflow, the query status is updated to reflect the current

stage of query processing life cycle. Once the results are ready, the user of

the QWB can retrieve the results with the request routed through the

Middleware Proxy (Front End). The proxy pulls the encrypted results from the

secure FTP (sFTP1) and returned to the QWB application. The Middleware

proxy (Front End) located in the QWB application decrypt the results using

Security Library from WT 3.3 and provide the decrypted results of the user.

4.2.3 Load Balancing and Fault Tolerance

In a large-scale heterogeneous system that is composed of several components

implemented in different languages and using different communication protocols,

enterprise integration is a crucial task. Apache Camel provides a good integration

support for most of the technologies and languages. Moreover, it provides an out of

the box feature of load balancing and fault tolerance.

The concepts of load balancing and fault tolerance are strongly connected with each

other. A critical large-scale system is required to be fault tolerant to maintain high

availability. In this case, the most important task is to identify single point of failures

(SPOFs) because these may cause some catastrophic failures. In order to avoid

SPOF of Middleware, it was replicated and deployed on two locations: University of

Warwick and Kings College London. In this way, we avoid SPOF for Middleware.

In the Middleware, several temporary faults can occur such as a database deadlock

or temporary outage. In such cases, the Camel workflows inside the Middleware will

fail to process the exchange. In order to deal with such temporary faults, we use

Camel Dead Letter Channel. This channel re-processes the failed exchanges again

after a time interval.

As the replicated copies of Middleware are deployed at different locations, it is

necessary to balance the load among them. Camel provides load balancer that we

use in the project. This functionality delegates the request to one of the two

endpoints available using a load balancing policy. In addition to existing policies, a


29

user can create its own load balancing policy to be used. We placed the load

balancer along with Middleware Proxy (Frond End). However, it will be down when

the QWB application is down. The Camel endpoints, published in the Middleware

located in both University of Warwick and Kings College London, will be provided to

the Middleware Proxy (Front End) so that the load balancer can accordingly delegate

the query processing requests in a balanced manner.

4.3 Semantically Rich Registry Services

Different data sources may add to or remove from the project, each having different

set of data and using different coding schemes. A semantically rich registry service

must be provided. It must be able to dynamically discover new data sources and bind

to their services. If a data source is removed from the project, it should be removed

automatically from the list of data sources.

4.3.1 Registry Service Capabilities

• Dynamically discover new data sources and bind to their services.

• Automatically updating the list of data sources.

• Providing list of data sources and the detailed information about them

• Providing semantic information about data sources and the possibility to

search and select a data source

Registry should hold the provenance metadata about the data source (not informatics

provenance, but actual origin of the harvested data) and quality information. Table 2

and 3 show list of the required field for each data source, and list of classification

information for each data source.

Fields Description (and some sample data) dataSource_id connection_address description registry Registry: GPRD name_of_ registry Name of the registry: Clinical Practice Research Datalink host_instititution Host institution: Medicines and Healthcare Regulatory

Agency (MHRA) host_contact_email Host contact e-mail: [email protected] host_contact_phone Host contact phone: +44 (0) 20 7084 2383 controlling_institution Controlling institution: MHRA controller_contact Controller contact: John Parkinson controller_email Controller e-mail: null geographical_coverage Geographical coverage: UK legal_jurisdiction Legal jurisdiction: England & Wales, Scotland language Language: English type_of_system Type of system: General Practice Repository dbms DBMS: Proprietary


30

publication_url Publication URL: http://www.cprd.com data_source Data source: GPIS start_year The beginning of the Period end_year The end of the Period committee Committee: ISAC number_of_practices Number of practices: 600 number_of_patients Number of patients: 54000000 patient_consent Patient consent contain_physical_examination_data Contain Physical Examination data contain_lifestyle_data Contain Lifestyle data contain_medication_data Contain Medication data contain_lab_results Contain Lab results contain_genetic_markers Contain genetic markers linkable_to_genetic_data Linkable to genetic data linkable_to_a_cancer_registry Linkable to a cancer registry linkable_to_a_drug_registry Linkable to a drug registry linkable_to_a_hospital_registry Linkable to a hospital registry linkable_to_a_population_registry Linkable to a population registry already_linked_to Already linked to? (Text) linkage_planned Linkage planned? (Text) linkage_actually_not_foreseen Linkage actually not foreseen, except from participation to

TRANSFoRm (Text) alive Is the data source alive or not. (Yes/No) first_heartbeat

The first time the data source is joined. (date/time)

last_heartbeat

The last heartbeat we have from the data source. (date/time)

last_update The last time that the information of the data source has been updated. (date/time)

Table 1 Registry Services: Data Source Information

The functionalities of Registry Services is provided to the QWB by three methods

which are available as a part of Middleware Proxy. These methods include:

• MiddlewareService.getAllDataForAllDataSources();

• MiddlewareService.getAllDataSourceIDs();

• getAllDataForADataSource(int did); (did: data source id)

The return type of all three methods is a Map containing the relevant data.

Table 2 Classification Information for each data source

4.4 Security Integration 4.4.1 Secure Data Transport

The secure data transport layer provides a secure transportation infrastructure for the

data among different components of the system. Here, the data includes the query

Fields Description (and some sample data) dataSource_id dataSource_id terminology version comments Registry: GPRD


31

generated by the QWB and the results obtained from the data sources. In order to

securely transport the query from the QWB to designated data sources, we use

Security Library from WT 3.3 to encrypt the query. We have defined a security policy

for each user-role to bind the user actions with their roles. A policy file is an XML file

composed of one or more ‘policy’ blocks, which are wrapped in a ‘policies’ root

element. The security library is directed to apply one of these policies which can

involve encrypting, decrypting of transforming the data through an XSLT. An example

of a policy file is included in Appendix 1. The security library also includes a private

key that is used in the encryption process.

We use HTTP protocol for transportation between front-end (QWB side) and

Middleware. In order to secure this transportation, we use secure HTTP (HTTPS)

protocol configured through Apache web server. In a web communication over

HTTPS protocol, the request is encrypted using a SSL certificate at the producer side

and the request is decrypted using the same SSL certificate at the consumer side. It

assures that only authenticated consumers who have the right SSL certificate can

decrypt and see the original request.

At the QWB side, the encrypted query and user id is sent to the Middleware

component over the SSL using HTTPS protocol. The Middleware component cannot

decrypt the encrypted query as it does not have the security library used to encrypt

the query. Instead, it saves the encrypted query package in a database to enable

status updates to be provided. The query waits in this database until it is requested

by the backend (data source) side of the platform.

We issue another SSL certificate for HTTPS communication between Middleware

and backend side. When the back-end requests a new query to process, the

encrypted query is sent from the Middleware component to the back-end over this

secure HTTPS channel. The back-end component contains the security library, thus

it can decrypt the encrypted query to obtain the original CDIM query. Then, this query

is pushed into the JMS queue provided by DNC. The results, obtained from DNC, are

encrypted using the Security Library. The encrypted results are sent to the secure

FTP location (on Custodix site) over a secure channel.

When the QWB users issue a request for the results of their query, the corresponding

results are retrieved from the secure FTP location and passed on to the QWB over a

secure channel. When the encrypted results are received at the QWB side, the


32

results will be decrypted using the same security library that was used to encrypt the

original query. Finally, the decrypted results are presented to the user through the

QWB.

4.4.2 Authentication Framework

The TRANSFoRm Authentication Framework is provided using the Secure Assertion

Mark-up Language (SAML), which is a flexible authentication standard aimed at web

services. The SAML standard defines an XML-based framework for describing and

exchanging security information between on-line partners. This information is

expressed in the form of portable SAML assertions that applications working across

security domain boundaries can trust.

The SAML standard defines precise syntax and rules for requesting, creating,

communicating and using these SAML assertions. As a result, SAML provides the

ideal foundation to handle the broad range of organisations and the significant

geographical dispersal involved in TRANSFoRm.

In particular, the TRANSFoRm Authentication Framework uses Shibboleth, an open

source software package built upon and including SAML that provides federated

single sign-on authentication. The use of SAML as the underlying standard ensures

that non-Shibboleth implementations of SAML, such as simpleSAMLphp for example,

can also be integrated into the Authentication Framework.

SAML relies upon a number of core concepts and roles that create the architecture

for the Authentication Framework. These are:

• Service Provider: A Service Provider (SP) is any TRANSFoRm application

that requires users to be authenticated in order to access the application.

• Identity Provider: The Identity Provider (IDP) provides the Single Sign-On

Service as part of the Authentication Framework. The IP stores information

about the user and authenticates the user’s identity by requiring the user to log

in with their username and password.

• Assertion: The IDP can assert security information to the SP in the form of

XML statements about the user. These statements are known as an Assertion.

For instance, a SAML assertion could state that the user’s name, project role

and contact details.

• Metadata: Metadata is used to express and share configuration between the


33

IDP and SP. This describes the IDP and SP to each other and tells them

where they can found. Sharing this metadata is a fundamental part of

integrating new SP into the authentication framework.

TRANSFoRm uses a centralised identity provider where information on users and

their global user roles are stored in an LDAP repository. Although centralised in this

case, Shibboleth supports a federated identity provider architecture, allowing the

authentication to be easily expanded if necessary. Users are added to this repository

using a web based management tool that is not itself a part of the identity provider.

Details on the user roles and this web based tool are described in greater detail in

Section 4.6.

When a user tries to access a TRANSFoRm user-facing application, the

authentication framework is invoked to authenticate the user’s identity and provide

details on the user to the application allowing authorisation decisions to be made on

user actions in that application. The steps taken during this authentication process

are:

1. The user attempts to access a resource on the SP sp.example.com.

2. The SP sends an HTTP redirect response to the browser. This redirection

contains the destination address of the Sign-On Service at the IDP together with

an authentication request <AuthnRequest>.

3. The IDP challenges the user, via their browser, to provide valid credentials

(username and password)

4. The user provides the valid credentials and a local logon security context (login

session) is created for the user at the IDP

5. The IDP builds a SAML assertion representing the user’s logon security context.

The assertion is digitally signed and then placed within a HTML Form.

6. The browser issues an HTTP POST request to send the form to the SP’s

assertion consumer service.

7. An access check is made to establish whether the user has the correct

authorization to access the resource. If the access check passes, the resource is

then returned to the browser. Please note, this access check is made by the

application, using the information contained in the SAML assertion returned from

the IDP.


34

Figure 3 TRANSFoRm Authentication Framework


35

4.5 Provenance Integration

The TRANSFoRm provenance framework is integrated into the federated

infrastructure for data linkage by a series of workflows that are triggered at different

points throughout the query lifecycle. These workflows invoke the central provenance

service using web services passing where information is passed to annotate and

audit the event. The occasion when this occurs, as well as a description of the

workflows is outlined below.

4.5.1 Reception of unencrypted CDIM query

The federated infrastructure connects research and clinical systems to target EHR

data-sources. At both ends of this infrastructure, unencrypted data, such as a CDIM

query from the QWB, or results from the Data Node Connector (DNC) are passed to

the federated infrastructure. However, for the purpose of this section we will begin

with the creation of the query.

Figure 4 Reception of unencrypted data

In this instance we include the Query Formulation Workbench (QWB) as the initiating

application, which triggers the workflow by passing a query to the Middleware Proxy

library (described in section 4.2). The proxy library invokes the provenance service to

tell it that it has received a new query and it is about to begin packaging the query for

transmission by encrypting it. Included in this call is a provenance URI that is passed

from the QWB along with the query, which enables provenance to associate the

packaging with connected events in the QWB.


36

The initial call to provenance returns a URI, middlewareProvUri, which is added to

the query package about to be encrypted. This URI is included as it will be used

when the query is unencrypted at the EHR repository. Next the proxy library uses the

security library (Seclib) to encrypt and sign the package and once this is complete

another call is made to provenance to annotate the end of this packaging process.

Once this is complete, the proxy library submits the query to the middleware services

before once again calling the provenance services to inform it that the packaged

query has now been sent.

4.5.2 Query to EHR Repository

The next event that is annotated in provenance is the transmission of the encrypted

query to the EHR data source site. The workflow is begun when the Middleware

Proxy library requests a new query and a query exists for that data source. In that

instance, the middleware service calls the provenance service to record that it is

sending the encrypted query data to the middleware proxy located at the data

source.

Figure 5 Query to EHR data source

When the query arrives at the middleware proxy library, provenance is informed that

the query has arrived and the proxy library then begins to unencrypt the query using

the security library (Seclib). Once unencrypted, the contents of the package are

accessible with the middlewareProvUri that was encrypted with the package initially

being sent to provenance to record that the package has been unpackaged

(unpackageData). The original CDIM query is passed to the DNC for processing.


37

4.5.3 Results from Data Source

The next point in the query life cycle where provenance information is recorded is

when the query results are ready and returned from the Data Node Connector. This

workflow is similar to the reception of an unencrypted query from the QWB and

involves annotation the reception of unencrypted results and the encryption process

that is applied to those results. Once the results are encrypted they are placed in a

secure sFTP location until the researcher requests that they are retrieved.

Figure 6 Results returned from the data source

4.5.4 Retrieval of results by QWB User

The final workflow involving provenance is triggered when a QWB user requests the

count results of their query to be retrieved from the sFTP location. The request is

sent to the Middleware Proxy library, which retrieves the encrypted results and

makes a call to the central provenance service to record that it has received that

data. Next the query results are unencrypted using the security library (Seclib) and

provenance is informed that the results are now unencrypted. Finally the result set

are returned to the QWB application, where they can be accessed by the requesting

researcher.


38

Figure 7 Retrieve query results for user


39

4.6 Global User Management

TRANSFoRm is composed of many heterogeneous applications and users. This

provides a number of challenges when trying to manage these users and their

access to TRANSFoRm tools. For certain applications, the same user may possess

several different roles based on application specific concepts, which are impossible

to capture at a global project level due to the complexity involved. To address this

problem, we define user roles are two levels, a global level with project wide roles

that is maintained by the federated infrastructure and an application level where the

user’s specific application roles are maintained (e.g. researcher access to a particular

study on the QWB). This distinction is discussed in Section 4.4 where the

architecture of the authentication framework is outlined.

4.6.1 User Roles

TRANSFoRm’s operational security policy [8] presents the identified user roles at

both levels of the system. Here, we concentrate on just the global user roles which

are managed centrally by federated infrastructure. The table below provides a

description of the global user roles that capture all required user features at a global

level across the TRANSFoRm infrastructure. The “Administrator” role is a key role

and is limited to certain authorised individuals. These users manage the global

management system for TRANSFoRm with the power to invite new users and delete

or edit existing users. They are also the only users who can access the full

functionality of the User Management Tool (described in Table 3) with all other users

being limited to changing the passwords for their accounts.


40

Role Name Role-Id Description

Researcher ROLE_RESEARCH A typical research user of TRANSFoRm user-facing tools such as the QWB.

General Practitioner ROLE_GP Clinical general practitioners who access TRANSFoRm

Administrator ROLE_ADMIN Administrators may create new users and invite them to TRANSFoRm. They may also manage existing users by changing their user role and other information.

Table 3 TRANSFoRm Global User roles

4.6.2 User Repository

OpenLDAP [9], an open source implementation of the Lightweight Directory Access

Protocol (LDAP) is used to store and manage TRANSFoRm user information in the

federated infrastructure. LDAP was chosen as it is open industry standard for

managing distributed directory information making it ideally suited to a federated

infrastructure such as TRANSFoRm. With LDAP users and groups, entries to the

repository are represented as objects, with a tree structure being used to provide a

hierarchy between the objects in the repository. Every entry contains a set of

attributes with an attribute being defined in a schema and possessing one or more

values. All entries are also uniquely identified with a Distinguished Name (DN) which

is constructed from attributes of the entry and the parent entry’s DN.

TRANSFoRm uses a shallow hierarchy to store user information. The repository is

split into two branches with one branch, “People”, storing all the user entries,

including information on each user’s global role. The second branch contains a set of

groups, one for each type of user role. These groups contain the DN for each user

who has that role in TRANSFoRm. This effectively duplicates the user role

information contained in the user’s individual entry. However, it provides easy access

to information about what users hold what roles and thus makes managing the

repository easier. It also simplifies the implementation of authorisation policies on the

repositories as, for example, the ability to add new users or change existing users

can be limited to members of the “Administrator” group.


41

Figure 8 TRANSFoRm User Repository Structure

User entries are stored as a combination of “InetOrgPerson” and “Person”, two

defined object classes provided by LDAP to create entries for users and provide

specific attributes. The set of attributes stored for each user and their description is

outlined in Table 4.


42

InetOrgPerson Attribute

TRANSFoRm Attribute Name

Description

DN Distinguished Name This is a required attribute of all LDAP entries. In TRANSFoRm it is composed of the entry UID plus the branch of the repository it is in.

UID User ID A unique user id for each user

Title Title The title of the user eg. Mr, Mrs, Dr, Prof etc.

SN Surname The user’s surname

givenName First Name The user’s first name(s)

postalAddress Institution/Organisation The institution, organisation or clinic that the user belongs to

mail Email The email of the user

CN Common Name This is a required attribute of the inetOrgPerson object class. It is composed by combining the users first and second name.

employeeType User Role This contains the user’s global role. It will be one or more ROLE_RESEARCH, ROLE_GP, ROLE_ADMIN with it being possible for an individual user to have more than one role.

Table 4 TRANSFoRm User objects

4.6.3 User Management Tool

The TRANSFoRm user management tool is a web based application that is

predominantly designed to allow administrators to manage TRANSFoRm users and

their roles at a global system level. The user management tool is implemented in

Spring MVC and provides a set of core functionalities to users to enable them to

connect to and update the LDAP repository that contains global user information.

These functionalities are:

1. Create a single user (invite a new user to TRANSFoRm)

2. Create a batch set of users (invite a group of users to TRANSFoRm)

3. Edit User information

o Roles

o Email

o Title

4. Complete Registration


43

5. Change Password

Numbers 1 to 3 are limited to the administrator users, whilst all users may use

functionalities 4 and 5. Inviting a single user is completed via a web form, where the

administrator enters the users name, title, email address, institution and role. Once

submitted this completes the first part of registration of a user, where the user is

added to the LDAP repository. However, to enable full registration to be completed

an email is automatically sent to the user containing a uniquely generated URL. The

user is invited to click the link to complete registration, where they are returned to the

User Management Tool and prompted to set their password. Once set, this

completes the user creation process.

The User Management tool also allows administrators to invite a group of users at

once to TRANSFoRm. This functionality, which is limited to general practitioner users

only, allows the user to specify a CSV on to upload user information. Each entry in

this file is then created in turn in the same manner as a single user with emails being

sent to all included users.

A number of screenshots from the User Management tool, as well as a template for

the batch user creation csv file are provided in Appendix 2 of this deliverable.

4.7 Summary

In this section we explained details of implementation of the distributed platform

component of the TRANSFoRm Federated Infrastructure for Data Linkage. We

started by reviewing the technology stack which is used in this project, then we

explained different components of this platform. We also explained the workflow of a

query and details of Security and Provenance Integration. In the next section we

outline the data linkage and extraction components of federated infrastructure.


44

5 Data Extraction

The TRANSFoRm platform realises two basic operations: (1) a request for patient

counts or data, issued by a study QWB and targeted at a clinical or genetic

repository, and (2) requests for data embedded within an ODM, issued by a study

system and targeted at an EHR. In this report we focus on the first kind of operation.

The platform’s conceptual components, which mediate these operations, are show

below in Figure 8.

Figure 8 Conceptual Architecture for Data Extraction and Linkage

Between the study system (including QWB) and the data source (clinical repository,

genetic repository) are various connectors, and a set of components which together

are called the Data Node Connector. The former simply offer a means of moving

information around using message queues, web service calls or files; while the latter

provide the management of activities necessary to complete the operations.

The DNC for this scenario is split into two parts, one platform-acing, and the other

data source facing. This was necessary for two reasons: (1) there was a requirement

that the networks on which the platform and clinical repositories reside could be

electronically isolated, requiring a physical intervention to complete operations, and

(2) to permit distinct technologies to be used by the platform (presently Java-based)


45

and components accessing the clinical repository or EHR (presently variable in terms

of specific relational database management system (RDBMS) and access methods).

The QWB is the source of single queries (for counts) and multiple queries (for data).

In the single count query and multiple data query scenarios DNC-WB polls the

distributed platform connector (outlined in Section 4 above) for available query

messages. These connectors are message queues such as JMS. Both these

connectors provided security for the messages since they will cross ‘foreign’

networks between the QWB and the DNCs. Messages are available if a query has

been submitted by the study QWB, the embedded queries are subject to translation

by the semantic mediator; DNC-WB ensures this.

After semantic mediation the queries are held as files and are transferred to the

control of DNC-DS. This may occur automatically, or under the supervision of the

controller of the data source. For example, for one repository within the TRANSFoRm

project (NIVEL) it was a requirement that the file containing a query is moved

physically between two separated networks and inspected before continuing its

execution, the console component permitting this inspection and authorisation.

For a clinical repository, DNC-DS’s workflow is relatively straightforward: a single

count-query or multiple data-request-queries are parsed according to the query

model and individual SQL queries executed against the relational database using a

SQL connector; and the results consolidated into counts for return to the study

system QWB, or data files for delivery to a safe location for analysis.

5.1 DNC-WB (Query Formulation Workbench)

The DNC-WB is used to receive queries from the QWB by polling the message

queues of the middleware. The query or queries contained within the message is

targeted at clinical or genetic repositories to establish the numbers of patients

satisfying study eligibility criteria, or to provide data for cohorts previously identified.

This DNC underpins the Diabetes use-case (WT 1.1) and the BMS use-case (WT?).

The workflow for this DNC is relatively straightforward. Queries are extracted from

the encrypted messages and parsed according to the query model (WT 6.4). This

yields a set of (CDIM-augmented, openEHR) archetypes defining the required data


46

elements and their constraints, e.g. ‘Laboratory HbA1c > 7.5%’.

Each archetype, specified in ADL, is submitted to the Semantic Mediator which uses

the data source model (DSM) and CDIM-DSM mapping model to translate the

archetypes to an equivalent query for the local data source (see Figure 9). In all

cases so far this is a SQL query. These SQL queries are then re-embedded within

the overall query XML document in place of the archetype elements. See Appendix 3

for examples.

The updated query is now placed as a message within a file to which the console

component has access. The console component will determine whether this

message file will be ‘carried’ to the data-facing part of the DNC (named DNC-DS).

This will happen automatically if local governance allows this. However, any data

source can opt to inspect the query in the message to decide whether it is to be

executed. Rejected messages will generate an exception message response for use

by the QWB. DNC-DS will process the query: data retrieved (for each archetype in

the original query) are subsequently combined according to the logical and temporal

operators within the remainder of the overall query. DNC-DS will then compose a

response using the same format as the original, but with counts inserted, and return

the response to the file-based message queue. DNC-WB posts this file back to the

QWB using the middleware provided by the distributed platform. The results are

encrypted by the distributed platform before they leave the data owner’s jurisdiction.

Note that the file-based communication between DNC-WB and DNC-DS is not

secured explicitly by TRANSFoRm as both components reside within the same

organisational jurisdiction, and the organisation itself is expected to take all the

necessary precautions. (Both DNC-WB and DNC-DS will be given access rights to

file systems and databases associated with the user account(s) under which they

run.)

The activities of DNC-WB are reported to the provenance service at key points in the

workflow. There is no reporting of workflow in relation to the DNC-DS. The hosting

organisation may however audit this activity using its own mechanisms.


47

Figure 9 DNC-WB use of semantic mediator to translate data elements expressed as archetypes to local database queries, usually SQL queries. DNC-DS is not shown.

5.2 DNC-DS (Data Source)

As discussed in the section above, the DNC-DS receives the message through the

file-queue boundary connector and parses the query contained in the message file

and executes the embedded local (SQL) queries at the appropriate points in the

parse using the database access connector. The SQL queries always produce a

record set which includes the patient identifier and time-point for the data values

satisfying the data element criteria, the patient identifier being applied to logical

operators and the time-points to temporal operators. The final result of the logic is a

single record set, which forms the final patient count arising from the query. All record

sets are held in memory to avoid the need for a local database for use by the DNC.

DNC-DS then embeds the counts in the original XML message which is placed in the

file-queue boundary connector for inspection by the console. As with the incoming

message, the onward transmission of this message to DNC-WB can be automatic, or

the message can be inspected before transmission. The message can obviously be

rejected at this point and substituted with an exception message which the QWB can

parse.

When extracting data for submission to the study system a set of extract queries –

one for each archetype of interest – are used to extract data and placed in an output

file for transmission (by sFTP) to the study system. CDIM provides the meta-data for


48

these data elements and the analysts can store and structure this data as required. A

further ‘flag patient’ query is also provided as part of the data extract request to

specify the patients for which data is required.

It should be noted that for the current version of the platform no reverse mapping of

local code systems to common coding systems is provided. Therefore, the coded

data at the study system consists of local codes – whether truly local, national or

international.

5.3 Summary

This section described the implementation of data extraction mechanisms for the

TRANSFoRm Federated Infrastructure for Data Linkage. These functionalities are

provided by the DNC component. The DNC is located local to target EHR data

sources and acts as an interface between TRANSFoRm and the data owners

infrastructure. Due to the heterogeneous technical characteristics and requirements

of different data sources, the operation of data extraction requires that the DNC is

split into two parts: one facing the platform (DNC-WB), the other facing the data

source (DNC-DS).


49

6 Concluding Remarks

Within the TRANSFoRm project we have developed a federated infrastructure to

achieve data linkage between research and clinical systems. To achieve this we

have conceptually split the infrastructure into two components. The first is a

distributed platform to handle communication between distributed endpoints, provide

semantically rich registry service as well as user authentication. The second

conceptual component provides data extraction functionalities to the federated

infrastructure.

The distributed platform component is built upon Apache Camel, an open source light

weight integration framework that provides a comprehensive set of Enterprise

Integration Patterns (EIPs) as well as load balancing and fault tolerance

mechanisms. This has facilitated a service based infrastructure with asynchronous

messaging across a set of distributed components, providing integrated research and

clinical systems with a highly available set of services.

The distributed platform provides secure data transport between these systems using

three components. Two of these components are deployed local to research and

data source applications as middleware proxy libraries. These libraries provide the

functionalities of the distributed platform, data communication, data encryption and

registry service, in a well-defined interface that encapsulates complex distributed

workflows across the federated infrastructure. They also integrate the TRANSFoRm

technical security policy (D3.3), signing and encrypting query and results data before

they are sent across the federated infrastructure which ensures the security and

integrity of this sensitive information.

The third component in the distributed platform is the middleware server which

communicates with deployed proxy libraries. This component maintains an index of

submitted queries as well as a semantically rich register of available data source

information. Both of these can be queried through the deployed proxy libraries

throughout TRANSFoRm. In each component, provenance information is recorded

and communicated to the TRANSFoRm provenance service to allow all data access

requests to be fully audited across the federated infrastructure.

The other aspect of the distributed platform, described in this deliverable, is the

authentication framework, which is used to authenticate users accessing

TRANSFoRm user-facing tools. This framework is based on SAML, with Shibboleth,


50

an industry standard implementation of SAML, being selected to power a centralised

identity provider for TRANSFoRm users. This authentication framework redirects

users to the identity provider whenever they attempt to access TRANSFoRm where

they authenticate using a unique user-name and password. Once authenticated the

user is returned to the application, with information describing the user being returned

to the application. This enables the application to determine what actions each user

is authorised to perform.

The authentication framework requires a repository of users and their information to

be maintained and a user friendly tool to enable users to be added and managed.

The Lightweight Directory Access Protocol (LDAP) is used to create a repository of

users whilst a web based User Management Tool has been developed to manage

this user directory.

Data extraction is provided by the Data Node Connector (DNC). The DNC acts as the

interface between queries arriving via the distributed platform and the local data

source from which the data needs to be extracted. This interface involves the

translation of a query arriving in CDIM format into an executable form that can be

processed and executed by the local data source. Semantic mediator is used in this

translation process, with the DNC also providing a user-facing console residing at the

data provider site that displays arriving data query in a form meaningful to the data

controller at the site.

Due to the complex and heterogeneous technical requirements of different data

sources, the DNC is split into two parts, one facing the platform and other

TRANSFoRm developed tools (DNC-WB), the other facing the data source (DNC-

DS).This enforces electronic separation of clinical repositories and permits distinct

technologies to be used by the distributed platform and data access components.


51

7 References

[1] “Translational Research And Patient Safety In Europe, ICT-2009.5.2-247787, Annex I- Description of Work,” 2011.

[2] “TRANSFoRm Project.” [Online]. Available: http://www.transformproject.eu/.

[3] V. C. A. Anjum, “D3.1: TRANSFoRm Provenance Framework,” 2011.

[4] S. Farrell, “TRANSFoRM Technical Security Framework.” 2011

[5] “Spoilt for Choice: Which Integration Framework to use – Spring Integration, Mule ESB or Apache Camel.” [Online]. Available: http://www.kai-waehner.de/blog/2012/01/10/spoilt-for-choice-which-integration-framework-to-use-spring-integration-mule-esb-or-apache-camel/.

[6] “Apache Camel.” [Online]. Available: http://www.methodsandtools.com/tools/tools.php?camel.

[7] “Open Source Integration with Apache Camel and How Fuse IDE Can Help.” [Online]. Available: http://java.dzone.com/articles/open-source-integration-apache.

[8] “User roles in TRANSFoRm tools – operational security policy,” 2013.

[9] “OpenLDAP.” [Online]. Available: http://www.openldap.org/.


52

8 Appendix 1 This Appendix includes a sample policy file that can be used by TRANSFoRm’s

federated infrastructure. The policies included contain signature and encryption

policies, for example “researcherSignature”. Additionally, the file contains decryption

policies for returned query results, “researcherResults”.

<policies>

<policy name="researcherSignature">

<sign>

<signer>researcher</signer>

</sign>

<encrypt content-only='true'>

<recipient>queryProcessor</recipient>

</encrypt>

</policy>

<policy name="analystSignature">

<sign>

<signer>analyst</signer>

</sign>

<encrypt content-only='true'>

<recipient>queryProcessor</recipient>

</encrypt>

</policy>

<policy name="researcherResults">

<decrypt match='//body'>

<recipient>researcher</recipient>

</decrypt>

<verify>

<allowed-signer>queryProcessor</allowed-signer>

<allowed-signer>relayInstitution</allowed-signer>

</verify>



<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output indent="no" omit-xml-declaration="yes" method="html" cdata-section-elements="pre"/>

<xsl:template match="//body/results">

<html>

<head><title>Results for <xsl:value-of select="//security-header[@name='principleAuthenticationName']"/></title></head>

<body>

<xsl:apply-templates select="@*|node()"/>

</body>

</html>

</xsl:template>

<xsl:template match="//results/response">

<h2>Response</h2>

<pre><xsl:value-of select="."/></pre>

</xsl:template>


53

<xsl:template match="@*|node()">


</xsl:template>

</xsl:stylesheet>

</policy>

<policy name="analystResults">

<decrypt match='//body'>

<recipient>analyst</recipient>

</decrypt>

<verify>

<allowed-signer>queryProcessor</allowed-signer>

<allowed-signer>relayInstitution</allowed-signer>

</verify>



<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output indent="no" omit-xml-declaration="yes" method="html" cdata-section-elements="pre"/>

<xsl:template match="//body/results">

<html>

<head><title>Results for <xsl:value-of select="//security-header[@name='principleAuthenticationName']"/></title></head>

<body>


</body>

</html>

</xsl:template>

<xsl:template match="//results/response">

<h2>Response</h2>

<pre><xsl:value-of select="."/></pre>

</xsl:template>

<xsl:template match="@*|node()">


</xsl:template>

</xsl:stylesheet>

</policy>

</policies>


54

9 Appendix 2 This Appendix includes several screenshots of the User Management Tool that is

described in Section 4.6 above. It also includes the template for the CSV to batch

create general practitioner users.

9.1 CSV Template

The CSV should contain the following fields, in order:

• Title

• First Name

• Surname

• Organisation/Institution

• Email

The first line of the file should contain these headers with the first user being on line

2.

9.2 User Management Tool Screenshots

Figure 10 TRANSFoRm User Management Tool: Login Page


55

Figure 11 Transform User Management Tool: Home Screen


56

Figure 12 Transform User Management Tool: Invite New User

Figure 13 TRANSFoRm User Management Tool: View TRANSFoRm Users


57

10 Appendix 3

This Appendix includes some sample queries illustrating the data extraction

component of the federated infrastructure.

10.1 Example Query

Example Query generated by the Query Formulation Workbench using the eligibility

criteria Query Model

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<EligibleSubjectCountRequest>

<QueryCriteria id="1248">

<Criteria type="criteriaGroup" operator="AND" id="1249">

<Criteria type="singleCriterion" id="1251">

<Archetype abbreviated="yes">

(adl_version=1.4)

TRANSFoRm-CRIM-DATAENTRY.dob.v1

value <=1977-01-01

ontology cdim_000007

</Archetype>

</Criteria>



(adl_version=1.4)

TRANSFoRm-CRIM-DATAENTRY.diagnosis.v1

value [ICD10::E11][ICPC2EDUT::T90][RCDV3::X40J5][SNOMEDCT::44054006]

ontology cdim_000011, cdim_000012

</Archetype>

</Criteria>

<Criteria type="criteriaGroup" operator="OR" id="1257">



58


archetype (adl_version=1.4)

TRANSFoRm-CRIM-DATAENTRY.medication.v1

value [ATC::A10BA02]


</Archetype>

</Criteria>




TRANSFoRm-CRIM-DATAENTRY.medication.v1

[RCDV3::X80NJ,XM0lF,f3...]


</Archetype>

</Criteria>

</Criteria>





TRANSFoRm-CRIM-DATAENTRY.lab_test.v1

Value [SNOMEDCT::40402000]

ontology OGMS_0000056, CDIM_000032, IAO_0000003, CDIM_000029

</Archetype>

</Criteria>





Value [SNOMEDCT::36048009]



59

</Archetype>

</Criteria>





Value [SNOMEDCT::144185003,144167005,166893007,166911009,271062006]


</Archetype>

</Criteria>

</Criteria>

</Criteria>

</QueryCriteria>

<Destination name="NIVEL">

<Practice>8872</Practice>



</Destination>

</EligibleSubjectCountRequest>

10.2 Query post-substitution

An example query after substitution of archetype elements by SQL elements

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<EligibleSubjectCountRequest>

<QueryCriteria id="1248">

<Criteria type="criteriaGroup" operator="AND" id="1249">


<SQL comment="Date of birth < 1977">

<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, CLIENT.GEBOORTEDATUM AS CDIM_000007

FROM CLIENT


60

INNER JOIN PRAKTIJK ON CLIENT.ID_PRAKTIJK = PRAKTIJK.ID_PRAKTIJK

WHERE (DATEDIFF(day, CLIENT.GEBOORTEDATUM, '2002-01-01') > 0)

ORDER BY CDIM_000003, CDIM_000007]]>

</SQL>

</Criteria>


<SQL comment="has Diabetes">

<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, MORBIDITEIT.DATUM AS CDIM_000012

FROM MORBIDITEIT

INNER JOIN CLIENT ON MORBIDITEIT.ID_CLIENT = CLIENT.ID_CLIENT


WHERE MORBIDITEIT.DIAGNOSE IN (219000,219001,219002)


</SQL>

</Criteria>



<SQL comment="takes metformin">

<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, PRESCRIPTIE.RECEPTDATUM AS

CDIM_000105

FROM PRESCRIPTIE

INNER JOIN CLIENT ON PRESCRIPTIE.ID_CLIENT = CLIENT.ID_CLIENT


WHERE PRESCRIPTIE.ATC IN ('A10BA02')


</SQL>

</Criteria>


<SQL comment="takes Sulphonylurea compounds">

<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, PRESCRIPTIE.RECEPTDATUM AS

CDIM_000105

FROM PRESCRIPTIE


61

INNER JOIN CLIENT ON PRESCRIPTIE.ID_CLIENT = CLIENT.ID_CLIENT


WHERE PRESCRIPTIE.ATC IN ('X80NJ', 'XM0lF', 'f3...', 'X80NJ', 'XM0lF', 'f3...', '372711004',

'34012005', '259552008', '273950002', 'C-A2400', '372711004', '34012005', '259552008',

'273950002', 'NOCODE', 'C0038766')


</SQL>

</Criteria>

</Criteria>



<SQL comment="has HbA1c > 6.5 mmol/l">

<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, UITSLAGEN.REGISTRATIEDATUM AS

CDIM_000029

FROM UITSLAGEN

INNER JOIN CLIENT ON UITSLAGEN.ID_CLIENT = CLIENT.ID_CLIENT


INNER JOIN HULP_UITSLAGHIS ON UITSLAGEN.NHGNUMMER = HULP_UITSLAGHIS.nhgnummer

WHERE UITSLAGEN.TYPEUITSLAG = 1

AND UITSLAGEN.NHGNUMMER IN (368) AND (LTRIM(UITSLAGEN.WAARDE) >= '06.5')


</SQL>

</Criteria>


<SQL comment="has Random glucose > 9.9 mmol/l">


CDIM_000029

FROM UITSLAGEN





AND UITSLAGEN.NHGNUMMER IN (372)


62

AND (LTRIM(UITSLAGEN.WAARDE) >= '09.9') AND HULP_UITSLAGHIS.eenheid = ' mmol/l'


</SQL>

</Criteria>


<SQL comment="has fasting glucose > 7.0 mmol/l">


CDIM_000029

FROM UITSLAGEN





AND UITSLAGEN.NHGNUMMER IN (371)

AND (LTRIM(UITSLAGEN.WAARDE) >= '07.0' AND HULP_UITSLAGHIS.eenheid = ' mmol/l')


</SQL>

</Criteria>

</Criteria>

</Criteria>

</QueryCriteria>

<Destination name="NIVEL">




</Destination>

</EligibleSubjectCountRequest>


63

Documents

D7.2 Federated Infrastructure for Data Linkage - i~HD...infrastructure. Apache Camel, an open source light weight framework, which provides a Java based DSL and in built load-balancing