Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
1
Translational Research and Patient Safety in Europe
D7.2 Federated Infrastructure for Data Linkage
Work Package Number: WP7 Work Package Title: Federated Infrastructure for Data Linkage Nature of Deliverable: Report
Dissemination Level: Confidential
Version: 0.4
Delivery Date From Annex 1: M51
Principal Authors: S. Hajebi, A. Raj, E. O’Toole, S. Clarke (TCD)
Contributing Authors: L. Zhao, C. Golby, T. N. Arvanitis (UW)
M.McGilchrist, F.Culross (UD)
Partner Institutions: Trinity College Dublin (TCD), University of Dundee
(UD), University of Warwick (UW)
Internal reviewers: Ita Richardson, Theodoros N. Arvanitis (UW)
This project has received funding from the European Union’s
Seventh Framework Programme for research, technological
development and demonstration under grant agreement no 247787 [TRANSFoRm].
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
2
Version History Version Date Author (partner) Changes/reason
0.1 18.04.2014 Saeed Hajebi, Amit Raj, Eamonn O’Toole
Initial Version for Internal Review
0.2 22.05.2014 Saeed Hajebi, Amit Raj, Eamonn O’Toole, Mark McGilchrist, Frank Culross
Incorporated feedback from internal review and added DNC
0.3 26.05.2014 Eamonn O’Toole, Theodoros N. Arvanitis (UW)
Internal Review
0.4 29.05.2014 Eamonn O’Toole, Mark McGilchrist, Vasa Curcin
Internal Review
0.5 30.05.2014 Eamonn O’Toole Final Edits 1.0 31.05.2014 Brendan Delaney (KCL),
Vasa Curcin (IC) Internal review
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
3
Table of Contents
Version History .......................................................................................................... 2
List of Figures ............................................................................................................ 6
List of Tables ............................................................................................................. 7
Abbreviations ............................................................................................................. 8
Executive Summary .................................................................................................. 9
1 Introduction ....................................................................................................... 11
2 Overview of Requirements ............................................................................... 13
2.1 Service Based Infrastructure ........................................................................ 13
2.2 Federated Secure Data Access ................................................................... 14
2.3 Semantically Rich Registry Services ............................................................ 14
2.4 Provenance Integration ................................................................................ 15
2.5 Load-balancing and fault tolerance mechanisms ......................................... 15
2.6 Investigation of Service Based Middleware Technologies ........................... 16
2.7 Summary ...................................................................................................... 18
3 TRANSFoRm Federated Infrastructure for Data Linkage Architecture ....... 19
3.1 Conceptual Architecture ............................................................................... 19
3.2 Distributed Platform ...................................................................................... 20
3.2.1 Authentication Framework ............................................................................................ 20
3.2.2 Secure Data Transport .................................................................................................. 20
3.2.3 Registry Services .......................................................................................................... 21
3.3 Data Extraction ............................................................................................. 21
3.3.1 Data Node Connector ................................................................................................... 21
3.4 Non-Infrastructure Components ................................................................... 22
3.4.1 Query Formulation Workbench ..................................................................................... 22
3.4.2 Provenance Framework ................................................................................................ 23
3.5 Summary ...................................................................................................... 23
4 Implementation of the Distributed Platform ................................................... 24
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
4
4.1 Technology Stack ......................................................................................... 24
4.2 Distributed Platform for Query and Result Data ........................................... 25
4.2.1 Components .................................................................................................................. 25
4.2.2 Query Lifecycle ............................................................................................................. 27
4.2.3 Load Balancing and Fault Tolerance ............................................................................ 28
4.3 Semantically Rich Registry Services ............................................................ 29
4.3.1 Registry Service Capabilities ........................................................................................ 29
4.4 Security Integration ...................................................................................... 30
4.4.1 Secure Data Transport .................................................................................................. 30
4.4.2 Authentication Framework ............................................................................................ 32
4.5 Provenance Integration ................................................................................ 35
4.5.1 Reception of unencrypted CDIM query ......................................................................... 35
4.5.2 Query to EHR Repository .............................................................................................. 36
4.5.3 Results from Data Source ............................................................................................. 37
4.5.4 Retrieval of results by QWB User ................................................................................. 37
4.6 Global User Management ............................................................................ 39
4.6.1 User Roles .................................................................................................................... 39
4.6.2 User Repository ............................................................................................................ 40
4.6.3 User Management Tool ................................................................................................. 42
4.7 Summary ...................................................................................................... 43
5 Data Extraction .................................................................................................. 44
5.1 DNC-WB (Query Formulation Workbench) .................................................. 45
5.2 DNC-DS (Data Source) ................................................................................ 47
5.3 Summary ...................................................................................................... 48
6 Concluding Remarks ........................................................................................ 49
7 References ......................................................................................................... 51
8 Appendix 1 ......................................................................................................... 52
9 Appendix 2 ......................................................................................................... 54
9.1 CSV Template .............................................................................................. 54
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
5
9.2 User Management Tool Screenshots ........................................................... 54
10 Appendix 3 ..................................................................................................... 57
10.1 Example Query ......................................................................................... 57
10.2 Query post-substitution ............................................................................. 59
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
6
List of Figures
Figure 1 Conceptual Architecture .............................................................................. 19
Figure 2 Query lifecycle across platform ................................................................... 25
Figure 3 TRANSFoRm Authentication Framework .................................................... 34
Figure 4 Reception of unencrypted data ................................................................... 35
Figure 5 Query to EHR data source .......................................................................... 36
Figure 6 Results returned from the data source ........................................................ 37
Figure 7 Retrieve query results for user .................................................................... 38
Figure 8 Conceptual Architecture for Data Extraction and Linkage ........................... 44
Figure 9 DNC-WB use of semantic mediator to translate data elements expressed as
archetypes to local database queries, usually SQL queries. DNC-DS is not
shown. ................................................................................................................ 47
Figure 10 TRANSFoRm User Management Tool: Login Page .................................. 54
Figure 11 Transform User Management Tool: Home Screen ................................... 55
Figure 12 Transform User Management Tool: Invite New User ................................ 56
Figure 13 TRANSFoRm User Management Tool: View TRANSFoRm Users ........... 56
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
7
List of Tables
Table 1 Registry Services: Data Source Information ................................................. 30
Table 2 Classification Information for each data source ............................................ 30
Table 3 TRANSFoRm Global User roles ................................................................... 40
Table 4 TRANSFoRm User objects ........................................................................... 42
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
8
Abbreviations CROM Clinician Reported Outcome Measures
DNC Data Node Connector
QWB Query Formulation Workbench
CRIM Clinical Research Information Model
CDIM Clinical Data Integration Model
EC Eligibility Criteria
eCRF electronic Case Report Form
EHR Electronic Health Record
ODM Operational Data Model
PROM Patient Reported Outcome Measures
VarQs form variables expressed using the TRANSFoRm query model
SDB Study Database
SS Study System
TAM Technical Acceptance Model
W/S Web Service
SDM Study Design Model
SF12 Short Form 12
LDAP Lightweight Directory Access Protocol
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
9
Executive Summary
This deliverable describes the outcome of WT 7.5 Infrastructure which manages the
extraction and linkage of data from heterogeneous datasets for the purposes of the
use cases of the TRANSFoRm project. This work task has developed a federated
infrastructure to facilitate secure communication of query data and query results
between research and clinical systems. This federated infrastructure was developed
using service-based technologies with asynchronous messaging being used between
the numerous distributed components that compose it.
This document outlines a list of requirements identified through the examination of
the work task description provided in the TRANSFoRm description of work and
consultation with impacted project partners who were developing tools that would
depend on the federated infrastructure. These requirements helped inform an
analysis of existing integration framework technologies to select the most suitable
framework, in order to provide the foundations for the TRANSFoRm federated
infrastructure. Apache Camel, an open source light weight framework, which provides
a Java based DSL and in built load-balancing and fault tolerance mechanisms, was
selected.
The federated infrastructure can be summarised as providing an authentication
framework for TRANSFoRm users, secure data transport and a semantically rich
registry service information on available EHR data source information. This document
describes each of the core components implemented to achieve the aforementioned
functionalities with the back bone of the infrastructure being provided by a set of
proxy libraries that are deployed locally to user and data source facing applications
developed elsewhere in TRANSFoRm. These libraries provide the set of
functionalities contained in the federated infrastructure through well-defined
interfaces. This approach encapsulates the complexity of the federated infrastructure
from the application layer at each end of the TRANSFoRm. These proxy libraries also
provide a means of integrating the TRANSFoRm provenance framework and security
technical solution into the distributed infrastructure ensuring secure transportation of
query data and query results between distributed endpoints.
This document also describes the implementation of user management mechanisms
for TRANSFoRm, where users can be registered and assigned one or more user
role. These mechanisms allow users accessing TRANSFoRm to be authenticated
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
10
and authorised.
Finally, this document describes the data extraction mechanisms that allow data
queries to be executed against heterogeneous data sources and the results returned
to the infrastructure. Data extraction is provided by the Data Node Connector
component, which acts as an interface between TRANSFoRm and target clinical
systems. This component is split enabling different technologies to be used by
TRANSFoRm and clinical systems and allowing clinical data to be electronically
isolated from the TRANSFoRm infrastructure if required by the data owner.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
11
1 Introduction
Health organizations and their electronic record systems are geographically
dispersed throughout Europe and are subject to a heterogeneous set of security and
information governance policies on sharing their data. This presents a significant
challenge to TRANSFoRm[2] which seeks to provide researchers with efficient and
effective access to this data in order to find suitable patients to recruit for clinical
studies. This goal requires a distributed infrastructure that facilitates communication
between these research and clinical systems.
This deliverable describes the outcomes of WT 7.5, infrastructure that manages the
extraction and linkage of data between heterogeneous datasets. This work task has
developed a federated infrastructure to facility secure communication of query data
and query results between research and clinical systems and provide data extraction
and linkage capabilities. To achieve this, the federated infrastructure is conceptually
divided into two components; a distributed platform component and data extraction
component.
The distributed platform provides a service-based infrastructure to facilitate
asynchronous communication between distributed endpoints in TRANSFoRm. This
platform provides complex infrastructure workflows, encapsulated behind well-
defined interfaces provided by a set of distributed components. The distributed
platform also integrates the technical security solution outlined in WT 3.3, to deliver a
flexible authentication framework providing policy-driven authentication and
authorization access to TRANSFoRm user-facing tools. A second feature of this
integration is the provision of secure data transport across the federated
infrastructure using signature, encryption and decryption capabilities provided by the
security solution.
The distributed platform also maintains a semantically rich registry service, that
provides dynamic discovery and binding to remote services. This registry service
allows research users to identify target EHR data sources that may contain suitable
patients to recruit. Provenance information is also gathered and communicated to the
TRANSFoRm provenance framework [3].Finally, load balancing and fault tolerance
mechanisms are included in the technical solution to provide a highly available set of
services.
Data extraction is provided by the Data Node Connector (DNC) component. This
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
12
component is located local to target EHR repository systems and acts as an interface
between TRANSFoRm and these data sources. It receives queries through the
distributed platform and translates them into queries that can be run against the local
repository. When ready, it returns the results of these queries to the distributed
platform where they are securely transferred back to the researcher.
The document is organised as follows. Section 2 outlines the requirements identified
to deliver the federated infrastructure. It also provides a discussion on the
investigation undertaken to identify the most suitable integration framework on which
to build the infrastructure. Section 3 discusses the system architecture and the
functional components that comprise and integrate with the federated infrastructure.
Section 4 provides a detailed implementation account of the distributed platform
component of the federated infrastructure, with all sub-components, workflows and
communication being discussed. Section 5 outlines the data extraction functionality
of the federated infrastructure. Finally, the document also includes some concluding
remarks, references and appendices, which include example policies and
implementation screenshots.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
13
2 Overview of Requirements
A requirements-driven approach was used throughout the design and implementation
of the TRANSFoRm federated infrastructure for data linkage. An initial set of
requirements for the deliverable is outlined in the project’s description of work [1].
The most fundamental of these is the provision of a service-based infrastructure
enabling communication between research and clinical systems. This infrastructure
will provide specified TRANSFoRm users data access and query of electronic health
systems and research databanks, with security being ensured through the integration
of security policies and the technical security solution described in deliverable D3.3
[4]. The current deliverable should also include semantically rich registry services to
identity and describe potential target EHR data sources that are both geographically
dispersed and conceptually and technologically distinct. The infrastructure is also
required to integrate with the provenance framework to ensure that auditing of data
access events is possible. Finally, to ensure highly available services, suitable load-
balancing and fault tolerance mechanisms for a federated infrastructure of this nature
must be investigated.
Through consultation with relevant project partners and analysis of suitable
technology solutions, these initial requirements were expanded to a comprehensive
list of functional requirements that are described in greater details below. Finally, this
section also describes the rationale behind our selection of Apache CAMEL as the
most suitable integration framework to provide the foundation of this deliverable.
2.1 Service Based Infrastructure
Research institutes require data from clinical systems. However, different clinical
systems may use heterogeneous technology infrastructure and data formats. On the
other hand, the communication between these systems should be asynchronous, as
the data needed for some researches may require approval to be provided by the
data owner and this process will typically involve a time delay. An asynchronous
service-based infrastructure should be in place to enable this communication.
The above motivates the following functional requirements:
• A mechanism to receive queries from the Query Formulation Workbench (WT
5.3) and securely route them to the targeted data sources and decrypt them.
• A mechanism to deliver queries to EHR repositories where they are passed to
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
14
the local Data Node Connector (DNC) instance.
• A mechanism to receive query results from the DNC, and securely store them
until they are requested by the researcher.
• A mechanism to deliver the results to the researcher.
• A mechanism to track and report the status of the queries.
• A mechanism to uniquely identify queries
• A mechanism to cancel an ongoing query.
2.2 Federated Secure Data Access
Given the sensitive nature of the information involved, the communication between
research institutes and clinical systems or data sources must be secure. D3.3
Security Solution Layer describes the proposed technical security solution for
TRANSFoRm. One key task of this deliverable is to integrate the D3.3 security
policies and the technical security solutions to enable flexible yet strongly secure
policy-driven authentication and authorization access.
Within TRANSFoRm software ecosystem, different partner institutions use different
authentication frameworks. A federated authentication framework (Shibboleth, as per
outlined in D3.3) should be used to enable partners to use federated single sign on
(SSO) mechanisms to be able to access to functionalities of TRANSFoRm.
The security needs discussed motivate the following functional requirements.
• Integration with the security library to apply encryption/decryption on the
queries and the results before they are transmitted across TRANSFoRm
• Integration with the security library to apply secure policy-driven authentication
and authorization access to TRANSFoRm tools
2.3 Semantically Rich Registry Services
The scope of TRANSFoRm provides a challenge as a number of heterogeneous
EHR data sources may be added or removed over time, with each data source
having a different set of data and using different coding schemes. Therefore, a
semantically rich registry service must be provided that can dynamically discover new
data sources and bind to their services, providing researchers with details of
available data sources to choose from. Conversely, if an existing data source decides
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
15
to leave TRANSFoRm, the registry service should be updated to reflect this.
This motivates the following functional requirements:
• Dynamically discover new data sources and bind to their services.
• Automatically updating the list of data sources.
• Providing list of data sources and the detailed information about them.
• Providing semantic information about data sources and the possibility to
search and select a data source.
2.4 Provenance Integration
All occurrences of data access should be captured in the scope of TRANSFoRm
project. To enable this characteristic across all of TRANSFoRm, the federated
infrastructure for data linkage should be integrated with the provenance and auditing
service (WT 3.4).
This motivates the following functional requirements:
• Mechanisms to annotate with the provenance service query data and query
result transmission across the distributed infrastructure.
• Mechanisms to annotate with the provenance service, query data and query
result encryption and decryption by the distributed infrastructure.
• Mechanisms to provide details of user access events to TRANSFoRm tools
such as the Query Formulation Workbench (QWB).
2.5 Load-balancing and fault tolerance mechanisms
The functionalities of QWB are dependent on the federated infrastructure for data
linkage; therefore the infrastructure should be highly available to facilitate
uninterrupted processing of query and result data. To achieve this, there must be at
least two instances of the components deployed on two different sites. The load of
queries should be divided between these sites. Each site should be updated with the
latest changes from the other.
The requirements for fault tolerance imply that there must be no single point of
failure; all the functionalities must be redundant. If a system experiences a failure, the
whole workflow must continue to operate without interruption during the repair
process. Additionally, faults must be isolated to the failing component; i.e., when a
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
16
failure occurs, the system must be able to isolate the failure to the failing component.
2.6 Investigation of Service Based Middleware Technologies
The need to exchange data between different and distributed applications is not
specific to TRANSFoRm and is common across all systems composed of two or
more applications. There are a number of existing integration frameworks that aim to
provide a foundation on which to build such systems, providing functionalities and
patterns that are commonly required. We identified the following criteria to help select
the most suitable integration framework available:
• Open source
• testability
• Java based Domain specific language (DSL)
• popularity
• IDE-Support
• error handling
• monitoring support
• number of components for interfaces, technologies and protocols
• expandability
We identified a comprehensive set of potential candidates that included:
• Apache Camel
• Spring Integration
• Spring Batch
• JBPM
• nexusBPM
• Drools
• hadoop/cascading
• full ESBs such as Service Mix, Mule ESB, OpenESB, and JBossESB [5].
After an investigation of the possible alternatives with the considerations outlined
above, Apache Camel was chosen as the most suitable framework. Camel is a light
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
17
weight open-source (Apache License version 2.0)1 framework with a Java based
DSL. It provides the following features [6], [7]
• Concrete implementations of all the widely used Enterprise Integration Patterns
(EIPs).
• Connectivity to a great variety of transports and APIs
• Easy to use Domain Specific Languages (DSLs) to wire EIPs and transports
together
• Pluggable data formats and type converters for easy message transformation
between CSV, EDI, Flatpack, HL7, JAXB, JSON, XmlBeans, XStream, Zip, etc.
• Pluggable languages to create expressions or predicates for use in the DSL.
Some of these languages include: EL, JXPath, Mvel, OGNL, BeanShell,
JavaScript, Groovy, Python, PHP, Ruby, SQL, XPath, XQuery, etc.
• Support for the integration of beans and POJOs in various places in Camel.
• Support for testing distributed and asynchronous systems using a messaging
approach.
Camel is a popular integration framework with a vibrant community and it supports an
array of different data formats. It provides a comprehensive list of pre-packaged
components (more than 130) for accessing various backend systems and leverages
a very extensible overall architecture. It is built with the Spring framework, ensuring a
great deal of customization and extensibility. Due to its openness, it can be deployed
stand-alone as well as an embedded component within other applications or
frameworks [6].
With specific consideration for TRANSFoRm, Camel allows developers to create
dynamic workflows that connect distinct functional components for seamless
integration. It also allows proxy interfaces to be deployed; this hides the complexity of
the distributed infrastructure from the applications, such as the QWB, that use it.
Camel also provides a selection of load-balancing policies and fault-tolerance
1 Apache License version 2.0 is a permissive license similar to the MIT License, but also provides an
express grant of patent rights from contributors to users
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
18
mechanisms during communication, making it an ideal candidate for use in the
TRANSFoRm federated infrastructure for data linkage.
2.7 Summary In this section the requirements of the federated infrastructure for data linkage (WT
7.5) has been explained. The main requirements include:
• A service-based infrastructure that enables communication between research
and clinical systems
• Providing federated secure data access and query of electronic health record
systems and research databanks
• Integrating the security policy and the technical security solutions enabling
flexible yet strongly secure policy-driven authentication and authorization
access.
• Integrating with the provenance and auditing service to enable accurate
capture of all occurrences of data access.
• Provision of semantically rich registry services
• Load-balancing and fault-tolerance mechanisms to provide a highly available
service.
• Investigation of a variety of non-disruptive and asynchronous data linkage
mechanisms with EHR data repositories.
This section also provided a discussion on the alternatives for non-disruptive and
asynchronous data linkage mechanisms detailing the reasons behind choosing
Apache Camel as our integration framework.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
19
3 TRANSFoRm Federated Infrastructure for Data Linkage Architecture
This section provides a brief overview and description of the system architecture
across the TRANSFoRm Federated Infrastructure for Data Linkage. Additionally, the
functional components that form and interact with this infrastructure are outlined.
3.1 Conceptual Architecture
Figure 1 Conceptual Architecture
TRANSFoRm aims to develop methods to integrate primary care clinical and
research activities with an essential component being the provision of a secure
means of communication between these geographically and conceptually distinct
endpoints. This communication is provided by the TRANSFoRm distributed
infrastructure for data extraction and linkage (WT 7.5) which connects researchers,
using the Query Formulation Workbench (QWB), to target EHR data sources by
means of a service-based technologies.
As described in Figure 1, this deliverable is composed of two broad components,
reflecting the two goals to be achieved. The first of these is a distributed platform
enabling communication across TRANSFoRm. This component of the deliverable is
itself composed of three objectives. The first is the Authentication Framework, which
provides user authentication and role based authorisation to user-facing
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
20
TRANSFoRm tools such as the QWB. The second component, secure data
transport, provides secure transmission of data queries and data results using
asynchronous messaging and data encryption. This allows queries to be securely
delivered to the target EHR repositories.
The second component of the Federated Infrastructure involves functionality for data
extraction. This is provided by the Data Node Connector (DNC). When the query
arrives at the target EHR site, a local instance of the DNC translates the query and
allows the data owner to execute it against the data source. When combined with the
distributed platform, this achieves the goal of the federated infrastructure for data
linkage.
Provenance information for the entire query workflow is captured by the distributed
provenance system (Deliverable 5.2 TRANSFoRm Provenance Tool), which is
integrated with each component across the architecture.
3.2 Distributed Platform
The distributed platform provides an essential backbone for the TRANSFoRm
system, providing a secure means of communication between research and clinical
systems. The objectives of the distributed platform can be summarised under three
broad purposes, an authentication framework for TRANSFoRm users, secure data
transport for queries and results across distributed endpoints and registry services.
3.2.1 Authentication Framework
The Authentication Framework is a dynamic extensible authentication service for the
various and distributed TRANSFoRm services. Based on the Secure Assertion mark-
up Language (SAML) and in particular, Shibboleth, the framework provides federated
single sign-on authentication for user-facing TRANSFoRm tools. The framework
assigns global roles to registered users and allows them to perform certain actions in
each TRANSFoRm tool. This enables role-based access and authorization to be
implemented across the various TRANSFoRm software tools.
3.2.2 Secure Data Transport
Secure data transport is the fundamental goal of the Distributed Platform where
communication is enabled between research and clinical systems to provide secure
data access and query of electronic health record systems and research databanks.
This is achieved by using a service-based infrastructure with asynchronous
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
21
messaging being used across distributed components of the platform. These
distributed components receive an initial Clinical Data Integration Model (CDIM)
query from the QWB and transmit that query to the target EHR data sources where it
is delivered to the local DNC.
Security is ensured by integrating the security library, described in D3.3 Security
Solution Layer [4], which encrypts all query requests and query results before they
are transmitted across the distributed infrastructure. Specifically, when clinical
researchers design eligibility criteria and instruct the QWB to query selected EHR
data sources, the QWB invokes the infrastructure API, which encrypts the request
using the security library. The encryption of the query request happens before any
external communication takes place. The query remains encrypted until it arrives at
the target EHR data source where the security library is used to unencrypt the query
as it is now safe to do so. The lifecycle of a query is described in greater detail in
Section 4.2 below.
In addition to providing secure messaging, the distributed platform interacts with the
TRANSFoRm provenance service to capture relevant provenance information across
different phases of this secure transmission process.
3.2.3 Registry Services
A semantically rich registry services is provided that can dynamically discover new
data sources and bind to their services, providing the QWB with details of available
data sources to choose from. It’s capable of connecting to DNC to dynamically
discover new data sources and bind to their services. It also automatically updates
the list of data sources; and provides this list and the detailed information about the
data source to the QWB.
3.3 Data Extraction
3.3.1 Data Node Connector
The Data Node Connector (DNC) component acts as the interface between queries
arriving via Secure Data Transport and the local data source from which the data
needs to be extracted. It provides data extraction functionality necessary for the
Federated Infrastructure. Once the Secure Data Transport has unencrypted the
query this interface involves the translation of a query arriving from the QWB in CDIM
format into an executable form that can be processed and executed by the local data
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
22
source. Semantic mediator is used in this translation process, with the tool also
providing a user-facing console residing at the data provider site that displays arriving
data query in a form meaningful to the data controller at the site. In addition to the
query formulation in local coding, each arriving entry contains context information for
the query: study agreement details, approved person attached to that study,
approved organization attached to that study, and explanation and purpose of the
query in natural language.
Once the query results are ready to be returned to the researcher, the DNC passes
them to the Secure Data Transport, where they are encrypted before being
transferred across the distributed infrastructure back to the QWB.
The functionalities of the Distributed Platform is provided to the DNC via a proxy,
which provides a clear interface and encapsulates the complexities of the underlying
infrastructure.
3.4 Non-Infrastructure Components
3.4.1 Query Formulation Workbench
The Query Formulation Workbench (QWB) is used to create, manage, store and
deploy queries of clinical data to identify subjects for clinical studies, evaluate trial
feasibility and to analyse the numbers of matching subjects in cohort studies, while
facilitating the extraction of data relating to epidemiological studies. Specifically, the
QWB provides a user interface for clinical researchers to create clinical studies,
design eligibility criteria, initiate distributed queries, monitor query progress, and
report query results. The QWB is based on the TRANSFoRm Clinical Research
Information Model (CRIM) model and uses the CDIM model for constructing queries,
together with the Vocabulary service for coding query concepts in supported
terminologies.
The QWB integrates with both aspects of the Distributed Platform. It uses the
Authentication framework to authenticate users and handle different user roles thus
allowing access to the QWB to be limited to registered TRANSFoRm users only. In
addition, the QWB utilises the Secure Data Transport layer provided by the
Distributed Platform to securely route queries to target EHR repositories, provides
updates on submitted queries and to retrieve results when they are ready.
The functionalities of the federated infrastructure are provided to QWB via a proxy,
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
23
which provides a clear interface and encapsulates the complexities of the underlying
platform.
3.4.2 Provenance Framework
The TRANSFoRm Provenance framework controls and manages the access to
provenance data created during the operation of TRANSFoRm tools. Making
TRANSFoRm tools provenance aware enables the investigation of data sources and
the services that produced a particular output, together with the individuals who
instigated the requests and received the outputs. In such a way, user behaviour and
data manipulation can be audited, to assess that correct decisions were made and
appropriate procedures were followed. Data privacy, legal and ethical regulations
restrict provenance data from being stored in a central repository. The provenance
framework mirrors the distributed EHR data access infrastructure, by implementing a
decentralised platform for provenance capture, storage and querying. More details
about the provenance service can be found in [28].
The distributed infrastructure invokes provenance services to annotate events
throughout a query’s lifecycle. This is achieved by connecting to the central
provenance service at specific points to store data describing query reception,
encryption/decryption and execution events. More details on this are provided in
Section 4.5 below.
3.5 Summary This section provided a high-level description of the TRANSFoRm architecture and
the components composing and interacting with the federated infrastructure for data
linkage. The infrastructure is conceptually divided into two sets of components; those
providing a distributed platform and those providing data extraction. The distributed
platform is responsible for providing secure communication between distributed
endpoints, the infrastructure is itself composed of a number of distributed
components that communicate using a service based infrastructure. Secure data
transport is provided by integrating the Transform Security Solution (WT 3.3) into the
distributed platform, with provenance being used to annotate and audit the query
lifecycle across the platform. Data extraction is provided by the DNC component
which acts as an interface between the distributed infrastructure and the local data
repository.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
24
4 Implementation of the Distributed Platform
In this section we explain details of implementation of the first component of the
Federated Infrastructure for Data Linkage, the Distributed Platform. In the next
section, Section 5, we discuss the data extraction aspect of this deliverable. The
purpose of the Distributed Platform is to provide secure data transportation of query
and result data as well as facilitating authentication of users on the TRANSFoRm
platform. We first review the technology stack used in this project, then we explain
different components of this infrastructure. The main components of this platform
include Middleware Proxy (Front Side), Security Library, Middleware Services, Data
Source Registry Services, Middleware Proxy (Backend Side), and Results Processor
Component, which will be explained. We also explain the workflow of a query. The
details of Security Integration and Provenance Integration will be discussed
separately.
4.1 Technology Stack
The Distributed Platform of TRANSFoRm is responsible for collaborating with a
number of different software applications running on different partners’ sites to
process the query data and results. This infrastructure uses a stack of different
technologies for the enterprise communication and middleware services. The
different technologies in our technology stack are:
• Apache web server
• Apache Tomcat application server
• Apache Camel
• Spring framework
• Java SDK
• Shibboleth Single Sign-on (SSO) framework
• Java Messaging Service (JMS)
• MySQL Database
• Maven compiler plugin
• LDAP
The Shibboleth SSO framework was used to implement the user authentication.
When the users are authenticated, they are allowed to use the middleware services.
The Apache web server is used to configure the domain name of the Middleware
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
25
services and SSL for secure communication over HTTP protocol. The Middleware
web application was deployed in the Tomcat server. In the Middleware web
application, several Camel endpoints were published as web services. The frontend
and backend can communicate with Middleware web services over secure HTTP.
The actual business logic of web services is written as Spring services. Maven is
used to compile the applications. Maven also enables easy migration from a lower
version to a higher version of the included libraries, such as Spring framework 2.5 to
3.0. The Middleware library for backend side was using JMS Queue for
communication with the Data Node connector. The metadata regarding the query
and data source is saved in MySQL database located at the Middleware site.
4.2 Distributed Platform for Query and Result Data
This section explains a detailed description of the distributed platform in terms of the
lifecycle of query and result data. The description include the constituent
components, the query life cycle and workflow, load balancing, and fault tolerance of
the software. The platform allows the query construction, secure transportation to a
designated data source and stores the decrypted results in a secure FTP location as
shown in Figure 2.
Figure 2 Query lifecycle across platform
4.2.1 Components
This platform is implemented as a distributed system where secure transportation of
query functionality is distributed among the following components.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
26
Middleware Proxy (Front Side): This component is deployed in the frontend site
where it acts as a proxy of our middleware services. This proxy contains a set of
methods that allow the QWB to execute a query and gets the results. Additionally, it
contains a set of methods to obtain the detailed information about the data sources
registered in the Registry Services. The QWB provides the original CDIM query and
the HTTPRequest to the executeQuery() method as two parameters. The
HTTPRequest contains the user’s browser request to the server (i.e. QWB) and is
used by the Security Library (outlined below). The executeQuery() propagates the
query through a Camel workflow where it is encrypted and passed on to the
Middleware. The Camel workflow instantiates uses security library to encrypt the
original query.
Security Library: Security library has two primary functionalities: (1) check if the
HTTPRequest received from the QWB is authenticated with Shibboleth SSO
framework, and (2) encrypt/decrypt the query. Upon authentication, SAML assertions
are attached to the corresponding HTTPRequest. When SAML assertions are
available in the request, the library encrypts the query with a private key. A security
policy is created for different users to allow dedicated functionality to a class of users.
Middleware Services: This component is used to store the encrypted query, update
the status of query processing and enable a secure transportation of the query to the
designated data source.
Data Source Registry Services: Whenever a new data source becomes available,
firstly it is registered in the Data Source Registry. This registry uses a MySQL
database where all the relevant information regarding a data source is saved. The
QWB can get the list of registered data sources along with their detailed information
through the corresponding methods available in the Middleware Proxy (Frond End).
This is a web component that provides the data sources information over HTTP
protocol in the JSON format.
Middleware Proxy (Backend Side): This component was deployed at the backend
site along with the DNC. This business logic was written in a Camel route. This route
polls the Middleware to check if a new query is available. When a new query
becomes available, it retrieves the encrypted query and uses the security library to
decrypt the query. The decrypted query is sent to a JMS queue for processing at
DNC. Another Camel route receives the results, encrypts them and sends them to
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
27
the secure FTP location.
Results Processor Component: This component is deployed at Custodix, Belgium,
the TRANSFoRm results store partner. When the encrypted results are ready from
DNC, our Middleware proxy (Backend side) puts those results in a secure FTP
location at the Custodix site. The QWB can retrieve these encrypted results from this
FTP location. Moreover, the result processor component continuously polls this
secure FTP and identify if a new result is available. When a new result is available,
this component retrieves and decrypts it. After decryption, the decrypted results are
placed in another secure FTP at the Custodix site, in a directory corresponding to the
user who created the query.
4.2.2 Query Lifecycle
The query flows across several components through its secure data transport layer to
get processed. The following steps describe the sequential workflow of query
processing. The step numbers outlined here correspond to the numbers described in
Figure 2 above.
1. A CDIM query created in the QWB is passed to the platform using the
Middleware Proxy, a library located in the QWB application. This library
packages the query by encrypting and signing it using the Security Library
from WT 3.3. The encrypted query is then sent to the Middleware server
where it is stored in a MySQL database.
2. At the data source side, another Middleware Proxy (backend side) periodically
polls the Middleware server requesting any new queries that are intended for
that data source. If new or unprocessed queries are available, they are sent to
the Data Node Connector, still encrypted, via the Middleware Proxy.
3. The Middleware Proxy decrypts the query using the Security Library from WT
3.3 and sends it to the Data Node Connector. The Data Node Connector takes
the CDIM query and manages the workflow of passing it to the Semantic
Mediator, from translation into SQL, and executing the query against the
corresponding database.
4. Once the query is processed and results are ready, they are returned to the
Middleware Proxy (backend side) for encryption.
5. The Middleware Proxy once again packages the results by signing and
encrypting them using the Security Library from WT 3.3. The encrypted results
are placed into a secure FTP server (sFTP1) that the data source specifies.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
28
6. The results processor component located local to the secure FTP (sFTP1)
retrieves the encrypted results and decrypt them using the Security Library
from WT 3.3. The decrypted results are stored in another secure FTP (sFTP2)
location.
7. Throughout the workflow, the query status is updated to reflect the current
stage of query processing life cycle. Once the results are ready, the user of
the QWB can retrieve the results with the request routed through the
Middleware Proxy (Front End). The proxy pulls the encrypted results from the
secure FTP (sFTP1) and returned to the QWB application. The Middleware
proxy (Front End) located in the QWB application decrypt the results using
Security Library from WT 3.3 and provide the decrypted results of the user.
4.2.3 Load Balancing and Fault Tolerance
In a large-scale heterogeneous system that is composed of several components
implemented in different languages and using different communication protocols,
enterprise integration is a crucial task. Apache Camel provides a good integration
support for most of the technologies and languages. Moreover, it provides an out of
the box feature of load balancing and fault tolerance.
The concepts of load balancing and fault tolerance are strongly connected with each
other. A critical large-scale system is required to be fault tolerant to maintain high
availability. In this case, the most important task is to identify single point of failures
(SPOFs) because these may cause some catastrophic failures. In order to avoid
SPOF of Middleware, it was replicated and deployed on two locations: University of
Warwick and Kings College London. In this way, we avoid SPOF for Middleware.
In the Middleware, several temporary faults can occur such as a database deadlock
or temporary outage. In such cases, the Camel workflows inside the Middleware will
fail to process the exchange. In order to deal with such temporary faults, we use
Camel Dead Letter Channel. This channel re-processes the failed exchanges again
after a time interval.
As the replicated copies of Middleware are deployed at different locations, it is
necessary to balance the load among them. Camel provides load balancer that we
use in the project. This functionality delegates the request to one of the two
endpoints available using a load balancing policy. In addition to existing policies, a
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
29
user can create its own load balancing policy to be used. We placed the load
balancer along with Middleware Proxy (Frond End). However, it will be down when
the QWB application is down. The Camel endpoints, published in the Middleware
located in both University of Warwick and Kings College London, will be provided to
the Middleware Proxy (Front End) so that the load balancer can accordingly delegate
the query processing requests in a balanced manner.
4.3 Semantically Rich Registry Services
Different data sources may add to or remove from the project, each having different
set of data and using different coding schemes. A semantically rich registry service
must be provided. It must be able to dynamically discover new data sources and bind
to their services. If a data source is removed from the project, it should be removed
automatically from the list of data sources.
4.3.1 Registry Service Capabilities
• Dynamically discover new data sources and bind to their services.
• Automatically updating the list of data sources.
• Providing list of data sources and the detailed information about them
• Providing semantic information about data sources and the possibility to
search and select a data source
Registry should hold the provenance metadata about the data source (not informatics
provenance, but actual origin of the harvested data) and quality information. Table 2
and 3 show list of the required field for each data source, and list of classification
information for each data source.
Fields Description (and some sample data) dataSource_id connection_address description registry Registry: GPRD name_of_ registry Name of the registry: Clinical Practice Research Datalink host_instititution Host institution: Medicines and Healthcare Regulatory
Agency (MHRA) host_contact_email Host contact e-mail: [email protected] host_contact_phone Host contact phone: +44 (0) 20 7084 2383 controlling_institution Controlling institution: MHRA controller_contact Controller contact: John Parkinson controller_email Controller e-mail: null geographical_coverage Geographical coverage: UK legal_jurisdiction Legal jurisdiction: England & Wales, Scotland language Language: English type_of_system Type of system: General Practice Repository dbms DBMS: Proprietary
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
30
publication_url Publication URL: http://www.cprd.com data_source Data source: GPIS start_year The beginning of the Period end_year The end of the Period committee Committee: ISAC number_of_practices Number of practices: 600 number_of_patients Number of patients: 54000000 patient_consent Patient consent contain_physical_examination_data Contain Physical Examination data contain_lifestyle_data Contain Lifestyle data contain_medication_data Contain Medication data contain_lab_results Contain Lab results contain_genetic_markers Contain genetic markers linkable_to_genetic_data Linkable to genetic data linkable_to_a_cancer_registry Linkable to a cancer registry linkable_to_a_drug_registry Linkable to a drug registry linkable_to_a_hospital_registry Linkable to a hospital registry linkable_to_a_population_registry Linkable to a population registry already_linked_to Already linked to? (Text) linkage_planned Linkage planned? (Text) linkage_actually_not_foreseen Linkage actually not foreseen, except from participation to
TRANSFoRm (Text) alive Is the data source alive or not. (Yes/No) first_heartbeat
The first time the data source is joined. (date/time)
last_heartbeat
The last heartbeat we have from the data source. (date/time)
last_update The last time that the information of the data source has been updated. (date/time)
Table 1 Registry Services: Data Source Information
The functionalities of Registry Services is provided to the QWB by three methods
which are available as a part of Middleware Proxy. These methods include:
• MiddlewareService.getAllDataForAllDataSources();
• MiddlewareService.getAllDataSourceIDs();
• getAllDataForADataSource(int did); (did: data source id)
The return type of all three methods is a Map containing the relevant data.
Table 2 Classification Information for each data source
4.4 Security Integration 4.4.1 Secure Data Transport
The secure data transport layer provides a secure transportation infrastructure for the
data among different components of the system. Here, the data includes the query
Fields Description (and some sample data) dataSource_id dataSource_id terminology version comments Registry: GPRD
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
31
generated by the QWB and the results obtained from the data sources. In order to
securely transport the query from the QWB to designated data sources, we use
Security Library from WT 3.3 to encrypt the query. We have defined a security policy
for each user-role to bind the user actions with their roles. A policy file is an XML file
composed of one or more ‘policy’ blocks, which are wrapped in a ‘policies’ root
element. The security library is directed to apply one of these policies which can
involve encrypting, decrypting of transforming the data through an XSLT. An example
of a policy file is included in Appendix 1. The security library also includes a private
key that is used in the encryption process.
We use HTTP protocol for transportation between front-end (QWB side) and
Middleware. In order to secure this transportation, we use secure HTTP (HTTPS)
protocol configured through Apache web server. In a web communication over
HTTPS protocol, the request is encrypted using a SSL certificate at the producer side
and the request is decrypted using the same SSL certificate at the consumer side. It
assures that only authenticated consumers who have the right SSL certificate can
decrypt and see the original request.
At the QWB side, the encrypted query and user id is sent to the Middleware
component over the SSL using HTTPS protocol. The Middleware component cannot
decrypt the encrypted query as it does not have the security library used to encrypt
the query. Instead, it saves the encrypted query package in a database to enable
status updates to be provided. The query waits in this database until it is requested
by the backend (data source) side of the platform.
We issue another SSL certificate for HTTPS communication between Middleware
and backend side. When the back-end requests a new query to process, the
encrypted query is sent from the Middleware component to the back-end over this
secure HTTPS channel. The back-end component contains the security library, thus
it can decrypt the encrypted query to obtain the original CDIM query. Then, this query
is pushed into the JMS queue provided by DNC. The results, obtained from DNC, are
encrypted using the Security Library. The encrypted results are sent to the secure
FTP location (on Custodix site) over a secure channel.
When the QWB users issue a request for the results of their query, the corresponding
results are retrieved from the secure FTP location and passed on to the QWB over a
secure channel. When the encrypted results are received at the QWB side, the
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
32
results will be decrypted using the same security library that was used to encrypt the
original query. Finally, the decrypted results are presented to the user through the
QWB.
4.4.2 Authentication Framework
The TRANSFoRm Authentication Framework is provided using the Secure Assertion
Mark-up Language (SAML), which is a flexible authentication standard aimed at web
services. The SAML standard defines an XML-based framework for describing and
exchanging security information between on-line partners. This information is
expressed in the form of portable SAML assertions that applications working across
security domain boundaries can trust.
The SAML standard defines precise syntax and rules for requesting, creating,
communicating and using these SAML assertions. As a result, SAML provides the
ideal foundation to handle the broad range of organisations and the significant
geographical dispersal involved in TRANSFoRm.
In particular, the TRANSFoRm Authentication Framework uses Shibboleth, an open
source software package built upon and including SAML that provides federated
single sign-on authentication. The use of SAML as the underlying standard ensures
that non-Shibboleth implementations of SAML, such as simpleSAMLphp for example,
can also be integrated into the Authentication Framework.
SAML relies upon a number of core concepts and roles that create the architecture
for the Authentication Framework. These are:
• Service Provider: A Service Provider (SP) is any TRANSFoRm application
that requires users to be authenticated in order to access the application.
• Identity Provider: The Identity Provider (IDP) provides the Single Sign-On
Service as part of the Authentication Framework. The IP stores information
about the user and authenticates the user’s identity by requiring the user to log
in with their username and password.
• Assertion: The IDP can assert security information to the SP in the form of
XML statements about the user. These statements are known as an Assertion.
For instance, a SAML assertion could state that the user’s name, project role
and contact details.
• Metadata: Metadata is used to express and share configuration between the
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
33
IDP and SP. This describes the IDP and SP to each other and tells them
where they can found. Sharing this metadata is a fundamental part of
integrating new SP into the authentication framework.
TRANSFoRm uses a centralised identity provider where information on users and
their global user roles are stored in an LDAP repository. Although centralised in this
case, Shibboleth supports a federated identity provider architecture, allowing the
authentication to be easily expanded if necessary. Users are added to this repository
using a web based management tool that is not itself a part of the identity provider.
Details on the user roles and this web based tool are described in greater detail in
Section 4.6.
When a user tries to access a TRANSFoRm user-facing application, the
authentication framework is invoked to authenticate the user’s identity and provide
details on the user to the application allowing authorisation decisions to be made on
user actions in that application. The steps taken during this authentication process
are:
1. The user attempts to access a resource on the SP sp.example.com.
2. The SP sends an HTTP redirect response to the browser. This redirection
contains the destination address of the Sign-On Service at the IDP together with
an authentication request <AuthnRequest>.
3. The IDP challenges the user, via their browser, to provide valid credentials
(username and password)
4. The user provides the valid credentials and a local logon security context (login
session) is created for the user at the IDP
5. The IDP builds a SAML assertion representing the user’s logon security context.
The assertion is digitally signed and then placed within a HTML Form.
6. The browser issues an HTTP POST request to send the form to the SP’s
assertion consumer service.
7. An access check is made to establish whether the user has the correct
authorization to access the resource. If the access check passes, the resource is
then returned to the browser. Please note, this access check is made by the
application, using the information contained in the SAML assertion returned from
the IDP.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
34
Figure 3 TRANSFoRm Authentication Framework
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
35
4.5 Provenance Integration
The TRANSFoRm provenance framework is integrated into the federated
infrastructure for data linkage by a series of workflows that are triggered at different
points throughout the query lifecycle. These workflows invoke the central provenance
service using web services passing where information is passed to annotate and
audit the event. The occasion when this occurs, as well as a description of the
workflows is outlined below.
4.5.1 Reception of unencrypted CDIM query
The federated infrastructure connects research and clinical systems to target EHR
data-sources. At both ends of this infrastructure, unencrypted data, such as a CDIM
query from the QWB, or results from the Data Node Connector (DNC) are passed to
the federated infrastructure. However, for the purpose of this section we will begin
with the creation of the query.
Figure 4 Reception of unencrypted data
In this instance we include the Query Formulation Workbench (QWB) as the initiating
application, which triggers the workflow by passing a query to the Middleware Proxy
library (described in section 4.2). The proxy library invokes the provenance service to
tell it that it has received a new query and it is about to begin packaging the query for
transmission by encrypting it. Included in this call is a provenance URI that is passed
from the QWB along with the query, which enables provenance to associate the
packaging with connected events in the QWB.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
36
The initial call to provenance returns a URI, middlewareProvUri, which is added to
the query package about to be encrypted. This URI is included as it will be used
when the query is unencrypted at the EHR repository. Next the proxy library uses the
security library (Seclib) to encrypt and sign the package and once this is complete
another call is made to provenance to annotate the end of this packaging process.
Once this is complete, the proxy library submits the query to the middleware services
before once again calling the provenance services to inform it that the packaged
query has now been sent.
4.5.2 Query to EHR Repository
The next event that is annotated in provenance is the transmission of the encrypted
query to the EHR data source site. The workflow is begun when the Middleware
Proxy library requests a new query and a query exists for that data source. In that
instance, the middleware service calls the provenance service to record that it is
sending the encrypted query data to the middleware proxy located at the data
source.
Figure 5 Query to EHR data source
When the query arrives at the middleware proxy library, provenance is informed that
the query has arrived and the proxy library then begins to unencrypt the query using
the security library (Seclib). Once unencrypted, the contents of the package are
accessible with the middlewareProvUri that was encrypted with the package initially
being sent to provenance to record that the package has been unpackaged
(unpackageData). The original CDIM query is passed to the DNC for processing.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
37
4.5.3 Results from Data Source
The next point in the query life cycle where provenance information is recorded is
when the query results are ready and returned from the Data Node Connector. This
workflow is similar to the reception of an unencrypted query from the QWB and
involves annotation the reception of unencrypted results and the encryption process
that is applied to those results. Once the results are encrypted they are placed in a
secure sFTP location until the researcher requests that they are retrieved.
Figure 6 Results returned from the data source
4.5.4 Retrieval of results by QWB User
The final workflow involving provenance is triggered when a QWB user requests the
count results of their query to be retrieved from the sFTP location. The request is
sent to the Middleware Proxy library, which retrieves the encrypted results and
makes a call to the central provenance service to record that it has received that
data. Next the query results are unencrypted using the security library (Seclib) and
provenance is informed that the results are now unencrypted. Finally the result set
are returned to the QWB application, where they can be accessed by the requesting
researcher.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
38
Figure 7 Retrieve query results for user
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
39
4.6 Global User Management
TRANSFoRm is composed of many heterogeneous applications and users. This
provides a number of challenges when trying to manage these users and their
access to TRANSFoRm tools. For certain applications, the same user may possess
several different roles based on application specific concepts, which are impossible
to capture at a global project level due to the complexity involved. To address this
problem, we define user roles are two levels, a global level with project wide roles
that is maintained by the federated infrastructure and an application level where the
user’s specific application roles are maintained (e.g. researcher access to a particular
study on the QWB). This distinction is discussed in Section 4.4 where the
architecture of the authentication framework is outlined.
4.6.1 User Roles
TRANSFoRm’s operational security policy [8] presents the identified user roles at
both levels of the system. Here, we concentrate on just the global user roles which
are managed centrally by federated infrastructure. The table below provides a
description of the global user roles that capture all required user features at a global
level across the TRANSFoRm infrastructure. The “Administrator” role is a key role
and is limited to certain authorised individuals. These users manage the global
management system for TRANSFoRm with the power to invite new users and delete
or edit existing users. They are also the only users who can access the full
functionality of the User Management Tool (described in Table 3) with all other users
being limited to changing the passwords for their accounts.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
40
Role Name Role-Id Description
Researcher ROLE_RESEARCH A typical research user of TRANSFoRm user-facing tools such as the QWB.
General Practitioner ROLE_GP Clinical general practitioners who access TRANSFoRm
Administrator ROLE_ADMIN Administrators may create new users and invite them to TRANSFoRm. They may also manage existing users by changing their user role and other information.
Table 3 TRANSFoRm Global User roles
4.6.2 User Repository
OpenLDAP [9], an open source implementation of the Lightweight Directory Access
Protocol (LDAP) is used to store and manage TRANSFoRm user information in the
federated infrastructure. LDAP was chosen as it is open industry standard for
managing distributed directory information making it ideally suited to a federated
infrastructure such as TRANSFoRm. With LDAP users and groups, entries to the
repository are represented as objects, with a tree structure being used to provide a
hierarchy between the objects in the repository. Every entry contains a set of
attributes with an attribute being defined in a schema and possessing one or more
values. All entries are also uniquely identified with a Distinguished Name (DN) which
is constructed from attributes of the entry and the parent entry’s DN.
TRANSFoRm uses a shallow hierarchy to store user information. The repository is
split into two branches with one branch, “People”, storing all the user entries,
including information on each user’s global role. The second branch contains a set of
groups, one for each type of user role. These groups contain the DN for each user
who has that role in TRANSFoRm. This effectively duplicates the user role
information contained in the user’s individual entry. However, it provides easy access
to information about what users hold what roles and thus makes managing the
repository easier. It also simplifies the implementation of authorisation policies on the
repositories as, for example, the ability to add new users or change existing users
can be limited to members of the “Administrator” group.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
41
Figure 8 TRANSFoRm User Repository Structure
User entries are stored as a combination of “InetOrgPerson” and “Person”, two
defined object classes provided by LDAP to create entries for users and provide
specific attributes. The set of attributes stored for each user and their description is
outlined in Table 4.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
42
InetOrgPerson Attribute
TRANSFoRm Attribute Name
Description
DN Distinguished Name This is a required attribute of all LDAP entries. In TRANSFoRm it is composed of the entry UID plus the branch of the repository it is in.
UID User ID A unique user id for each user
Title Title The title of the user eg. Mr, Mrs, Dr, Prof etc.
SN Surname The user’s surname
givenName First Name The user’s first name(s)
postalAddress Institution/Organisation The institution, organisation or clinic that the user belongs to
mail Email The email of the user
CN Common Name This is a required attribute of the inetOrgPerson object class. It is composed by combining the users first and second name.
employeeType User Role This contains the user’s global role. It will be one or more ROLE_RESEARCH, ROLE_GP, ROLE_ADMIN with it being possible for an individual user to have more than one role.
Table 4 TRANSFoRm User objects
4.6.3 User Management Tool
The TRANSFoRm user management tool is a web based application that is
predominantly designed to allow administrators to manage TRANSFoRm users and
their roles at a global system level. The user management tool is implemented in
Spring MVC and provides a set of core functionalities to users to enable them to
connect to and update the LDAP repository that contains global user information.
These functionalities are:
1. Create a single user (invite a new user to TRANSFoRm)
2. Create a batch set of users (invite a group of users to TRANSFoRm)
3. Edit User information
o Roles
o Email
o Title
4. Complete Registration
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
43
5. Change Password
Numbers 1 to 3 are limited to the administrator users, whilst all users may use
functionalities 4 and 5. Inviting a single user is completed via a web form, where the
administrator enters the users name, title, email address, institution and role. Once
submitted this completes the first part of registration of a user, where the user is
added to the LDAP repository. However, to enable full registration to be completed
an email is automatically sent to the user containing a uniquely generated URL. The
user is invited to click the link to complete registration, where they are returned to the
User Management Tool and prompted to set their password. Once set, this
completes the user creation process.
The User Management tool also allows administrators to invite a group of users at
once to TRANSFoRm. This functionality, which is limited to general practitioner users
only, allows the user to specify a CSV on to upload user information. Each entry in
this file is then created in turn in the same manner as a single user with emails being
sent to all included users.
A number of screenshots from the User Management tool, as well as a template for
the batch user creation csv file are provided in Appendix 2 of this deliverable.
4.7 Summary
In this section we explained details of implementation of the distributed platform
component of the TRANSFoRm Federated Infrastructure for Data Linkage. We
started by reviewing the technology stack which is used in this project, then we
explained different components of this platform. We also explained the workflow of a
query and details of Security and Provenance Integration. In the next section we
outline the data linkage and extraction components of federated infrastructure.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
44
5 Data Extraction
The TRANSFoRm platform realises two basic operations: (1) a request for patient
counts or data, issued by a study QWB and targeted at a clinical or genetic
repository, and (2) requests for data embedded within an ODM, issued by a study
system and targeted at an EHR. In this report we focus on the first kind of operation.
The platform’s conceptual components, which mediate these operations, are show
below in Figure 8.
Figure 8 Conceptual Architecture for Data Extraction and Linkage
Between the study system (including QWB) and the data source (clinical repository,
genetic repository) are various connectors, and a set of components which together
are called the Data Node Connector. The former simply offer a means of moving
information around using message queues, web service calls or files; while the latter
provide the management of activities necessary to complete the operations.
The DNC for this scenario is split into two parts, one platform-acing, and the other
data source facing. This was necessary for two reasons: (1) there was a requirement
that the networks on which the platform and clinical repositories reside could be
electronically isolated, requiring a physical intervention to complete operations, and
(2) to permit distinct technologies to be used by the platform (presently Java-based)
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
45
and components accessing the clinical repository or EHR (presently variable in terms
of specific relational database management system (RDBMS) and access methods).
The QWB is the source of single queries (for counts) and multiple queries (for data).
In the single count query and multiple data query scenarios DNC-WB polls the
distributed platform connector (outlined in Section 4 above) for available query
messages. These connectors are message queues such as JMS. Both these
connectors provided security for the messages since they will cross ‘foreign’
networks between the QWB and the DNCs. Messages are available if a query has
been submitted by the study QWB, the embedded queries are subject to translation
by the semantic mediator; DNC-WB ensures this.
After semantic mediation the queries are held as files and are transferred to the
control of DNC-DS. This may occur automatically, or under the supervision of the
controller of the data source. For example, for one repository within the TRANSFoRm
project (NIVEL) it was a requirement that the file containing a query is moved
physically between two separated networks and inspected before continuing its
execution, the console component permitting this inspection and authorisation.
For a clinical repository, DNC-DS’s workflow is relatively straightforward: a single
count-query or multiple data-request-queries are parsed according to the query
model and individual SQL queries executed against the relational database using a
SQL connector; and the results consolidated into counts for return to the study
system QWB, or data files for delivery to a safe location for analysis.
5.1 DNC-WB (Query Formulation Workbench)
The DNC-WB is used to receive queries from the QWB by polling the message
queues of the middleware. The query or queries contained within the message is
targeted at clinical or genetic repositories to establish the numbers of patients
satisfying study eligibility criteria, or to provide data for cohorts previously identified.
This DNC underpins the Diabetes use-case (WT 1.1) and the BMS use-case (WT?).
The workflow for this DNC is relatively straightforward. Queries are extracted from
the encrypted messages and parsed according to the query model (WT 6.4). This
yields a set of (CDIM-augmented, openEHR) archetypes defining the required data
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
46
elements and their constraints, e.g. ‘Laboratory HbA1c > 7.5%’.
Each archetype, specified in ADL, is submitted to the Semantic Mediator which uses
the data source model (DSM) and CDIM-DSM mapping model to translate the
archetypes to an equivalent query for the local data source (see Figure 9). In all
cases so far this is a SQL query. These SQL queries are then re-embedded within
the overall query XML document in place of the archetype elements. See Appendix 3
for examples.
The updated query is now placed as a message within a file to which the console
component has access. The console component will determine whether this
message file will be ‘carried’ to the data-facing part of the DNC (named DNC-DS).
This will happen automatically if local governance allows this. However, any data
source can opt to inspect the query in the message to decide whether it is to be
executed. Rejected messages will generate an exception message response for use
by the QWB. DNC-DS will process the query: data retrieved (for each archetype in
the original query) are subsequently combined according to the logical and temporal
operators within the remainder of the overall query. DNC-DS will then compose a
response using the same format as the original, but with counts inserted, and return
the response to the file-based message queue. DNC-WB posts this file back to the
QWB using the middleware provided by the distributed platform. The results are
encrypted by the distributed platform before they leave the data owner’s jurisdiction.
Note that the file-based communication between DNC-WB and DNC-DS is not
secured explicitly by TRANSFoRm as both components reside within the same
organisational jurisdiction, and the organisation itself is expected to take all the
necessary precautions. (Both DNC-WB and DNC-DS will be given access rights to
file systems and databases associated with the user account(s) under which they
run.)
The activities of DNC-WB are reported to the provenance service at key points in the
workflow. There is no reporting of workflow in relation to the DNC-DS. The hosting
organisation may however audit this activity using its own mechanisms.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
47
Figure 9 DNC-WB use of semantic mediator to translate data elements expressed as archetypes to local database queries, usually SQL queries. DNC-DS is not shown.
5.2 DNC-DS (Data Source)
As discussed in the section above, the DNC-DS receives the message through the
file-queue boundary connector and parses the query contained in the message file
and executes the embedded local (SQL) queries at the appropriate points in the
parse using the database access connector. The SQL queries always produce a
record set which includes the patient identifier and time-point for the data values
satisfying the data element criteria, the patient identifier being applied to logical
operators and the time-points to temporal operators. The final result of the logic is a
single record set, which forms the final patient count arising from the query. All record
sets are held in memory to avoid the need for a local database for use by the DNC.
DNC-DS then embeds the counts in the original XML message which is placed in the
file-queue boundary connector for inspection by the console. As with the incoming
message, the onward transmission of this message to DNC-WB can be automatic, or
the message can be inspected before transmission. The message can obviously be
rejected at this point and substituted with an exception message which the QWB can
parse.
When extracting data for submission to the study system a set of extract queries –
one for each archetype of interest – are used to extract data and placed in an output
file for transmission (by sFTP) to the study system. CDIM provides the meta-data for
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
48
these data elements and the analysts can store and structure this data as required. A
further ‘flag patient’ query is also provided as part of the data extract request to
specify the patients for which data is required.
It should be noted that for the current version of the platform no reverse mapping of
local code systems to common coding systems is provided. Therefore, the coded
data at the study system consists of local codes – whether truly local, national or
international.
5.3 Summary
This section described the implementation of data extraction mechanisms for the
TRANSFoRm Federated Infrastructure for Data Linkage. These functionalities are
provided by the DNC component. The DNC is located local to target EHR data
sources and acts as an interface between TRANSFoRm and the data owners
infrastructure. Due to the heterogeneous technical characteristics and requirements
of different data sources, the operation of data extraction requires that the DNC is
split into two parts: one facing the platform (DNC-WB), the other facing the data
source (DNC-DS).
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
49
6 Concluding Remarks
Within the TRANSFoRm project we have developed a federated infrastructure to
achieve data linkage between research and clinical systems. To achieve this we
have conceptually split the infrastructure into two components. The first is a
distributed platform to handle communication between distributed endpoints, provide
semantically rich registry service as well as user authentication. The second
conceptual component provides data extraction functionalities to the federated
infrastructure.
The distributed platform component is built upon Apache Camel, an open source light
weight integration framework that provides a comprehensive set of Enterprise
Integration Patterns (EIPs) as well as load balancing and fault tolerance
mechanisms. This has facilitated a service based infrastructure with asynchronous
messaging across a set of distributed components, providing integrated research and
clinical systems with a highly available set of services.
The distributed platform provides secure data transport between these systems using
three components. Two of these components are deployed local to research and
data source applications as middleware proxy libraries. These libraries provide the
functionalities of the distributed platform, data communication, data encryption and
registry service, in a well-defined interface that encapsulates complex distributed
workflows across the federated infrastructure. They also integrate the TRANSFoRm
technical security policy (D3.3), signing and encrypting query and results data before
they are sent across the federated infrastructure which ensures the security and
integrity of this sensitive information.
The third component in the distributed platform is the middleware server which
communicates with deployed proxy libraries. This component maintains an index of
submitted queries as well as a semantically rich register of available data source
information. Both of these can be queried through the deployed proxy libraries
throughout TRANSFoRm. In each component, provenance information is recorded
and communicated to the TRANSFoRm provenance service to allow all data access
requests to be fully audited across the federated infrastructure.
The other aspect of the distributed platform, described in this deliverable, is the
authentication framework, which is used to authenticate users accessing
TRANSFoRm user-facing tools. This framework is based on SAML, with Shibboleth,
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
50
an industry standard implementation of SAML, being selected to power a centralised
identity provider for TRANSFoRm users. This authentication framework redirects
users to the identity provider whenever they attempt to access TRANSFoRm where
they authenticate using a unique user-name and password. Once authenticated the
user is returned to the application, with information describing the user being returned
to the application. This enables the application to determine what actions each user
is authorised to perform.
The authentication framework requires a repository of users and their information to
be maintained and a user friendly tool to enable users to be added and managed.
The Lightweight Directory Access Protocol (LDAP) is used to create a repository of
users whilst a web based User Management Tool has been developed to manage
this user directory.
Data extraction is provided by the Data Node Connector (DNC). The DNC acts as the
interface between queries arriving via the distributed platform and the local data
source from which the data needs to be extracted. This interface involves the
translation of a query arriving in CDIM format into an executable form that can be
processed and executed by the local data source. Semantic mediator is used in this
translation process, with the DNC also providing a user-facing console residing at the
data provider site that displays arriving data query in a form meaningful to the data
controller at the site.
Due to the complex and heterogeneous technical requirements of different data
sources, the DNC is split into two parts, one facing the platform and other
TRANSFoRm developed tools (DNC-WB), the other facing the data source (DNC-
DS).This enforces electronic separation of clinical repositories and permits distinct
technologies to be used by the distributed platform and data access components.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
51
7 References
[1] “Translational Research And Patient Safety In Europe, ICT-2009.5.2-247787, Annex I- Description of Work,” 2011.
[2] “TRANSFoRm Project.” [Online]. Available: http://www.transformproject.eu/.
[3] V. C. A. Anjum, “D3.1: TRANSFoRm Provenance Framework,” 2011.
[4] S. Farrell, “TRANSFoRM Technical Security Framework.” 2011
[5] “Spoilt for Choice: Which Integration Framework to use – Spring Integration, Mule ESB or Apache Camel.” [Online]. Available: http://www.kai-waehner.de/blog/2012/01/10/spoilt-for-choice-which-integration-framework-to-use-spring-integration-mule-esb-or-apache-camel/.
[6] “Apache Camel.” [Online]. Available: http://www.methodsandtools.com/tools/tools.php?camel.
[7] “Open Source Integration with Apache Camel and How Fuse IDE Can Help.” [Online]. Available: http://java.dzone.com/articles/open-source-integration-apache.
[8] “User roles in TRANSFoRm tools – operational security policy,” 2013.
[9] “OpenLDAP.” [Online]. Available: http://www.openldap.org/.
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
52
8 Appendix 1 This Appendix includes a sample policy file that can be used by TRANSFoRm’s
federated infrastructure. The policies included contain signature and encryption
policies, for example “researcherSignature”. Additionally, the file contains decryption
policies for returned query results, “researcherResults”.
<policies>
<policy name="researcherSignature">
<sign>
<signer>researcher</signer>
</sign>
<encrypt content-only='true'>
<recipient>queryProcessor</recipient>
</encrypt>
</policy>
<policy name="analystSignature">
<sign>
<signer>analyst</signer>
</sign>
<encrypt content-only='true'>
<recipient>queryProcessor</recipient>
</encrypt>
</policy>
<policy name="researcherResults">
<decrypt match='//body'>
<recipient>researcher</recipient>
</decrypt>
<verify>
<allowed-signer>queryProcessor</allowed-signer>
<allowed-signer>relayInstitution</allowed-signer>
</verify>
<!-- Convert to HTML (to demonstrate) -->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="no" omit-xml-declaration="yes" method="html" cdata-section-elements="pre"/>
<xsl:template match="//body/results">
<html>
<head><title>Results for <xsl:value-of select="//security-header[@name='principleAuthenticationName']"/></title></head>
<body>
<xsl:apply-templates select="@*|node()"/>
</body>
</html>
</xsl:template>
<xsl:template match="//results/response">
<h2>Response</h2>
<pre><xsl:value-of select="."/></pre>
</xsl:template>
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
53
<xsl:template match="@*|node()">
<xsl:apply-templates select="@*|node()"/>
</xsl:template>
</xsl:stylesheet>
</policy>
<policy name="analystResults">
<decrypt match='//body'>
<recipient>analyst</recipient>
</decrypt>
<verify>
<allowed-signer>queryProcessor</allowed-signer>
<allowed-signer>relayInstitution</allowed-signer>
</verify>
<!-- Convert to HTML (to demonstrate) -->
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="no" omit-xml-declaration="yes" method="html" cdata-section-elements="pre"/>
<xsl:template match="//body/results">
<html>
<head><title>Results for <xsl:value-of select="//security-header[@name='principleAuthenticationName']"/></title></head>
<body>
<xsl:apply-templates select="@*|node()"/>
</body>
</html>
</xsl:template>
<xsl:template match="//results/response">
<h2>Response</h2>
<pre><xsl:value-of select="."/></pre>
</xsl:template>
<xsl:template match="@*|node()">
<xsl:apply-templates select="@*|node()"/>
</xsl:template>
</xsl:stylesheet>
</policy>
</policies>
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
54
9 Appendix 2 This Appendix includes several screenshots of the User Management Tool that is
described in Section 4.6 above. It also includes the template for the CSV to batch
create general practitioner users.
9.1 CSV Template
The CSV should contain the following fields, in order:
• Title
• First Name
• Surname
• Organisation/Institution
The first line of the file should contain these headers with the first user being on line
2.
9.2 User Management Tool Screenshots
Figure 10 TRANSFoRm User Management Tool: Login Page
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
55
Figure 11 Transform User Management Tool: Home Screen
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
56
Figure 12 Transform User Management Tool: Invite New User
Figure 13 TRANSFoRm User Management Tool: View TRANSFoRm Users
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
57
10 Appendix 3
This Appendix includes some sample queries illustrating the data extraction
component of the federated infrastructure.
10.1 Example Query
Example Query generated by the Query Formulation Workbench using the eligibility
criteria Query Model
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<EligibleSubjectCountRequest>
<QueryCriteria id="1248">
<Criteria type="criteriaGroup" operator="AND" id="1249">
<Criteria type="singleCriterion" id="1251">
<Archetype abbreviated="yes">
(adl_version=1.4)
TRANSFoRm-CRIM-DATAENTRY.dob.v1
value <=1977-01-01
ontology cdim_000007
</Archetype>
</Criteria>
<Criteria type="singleCriterion" id="1255">
<Archetype abbreviated="yes">
(adl_version=1.4)
TRANSFoRm-CRIM-DATAENTRY.diagnosis.v1
value [ICD10::E11][ICPC2EDUT::T90][RCDV3::X40J5][SNOMEDCT::44054006]
ontology cdim_000011, cdim_000012
</Archetype>
</Criteria>
<Criteria type="criteriaGroup" operator="OR" id="1257">
<Criteria type="singleCriterion" id="1258">
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
58
<Archetype abbreviated="yes">
archetype (adl_version=1.4)
TRANSFoRm-CRIM-DATAENTRY.medication.v1
value [ATC::A10BA02]
ontology cdim_000037, cdim_000045
</Archetype>
</Criteria>
<Criteria type="singleCriterion" id="1259">
<Archetype abbreviated="yes">
archetype (adl_version=1.4)
TRANSFoRm-CRIM-DATAENTRY.medication.v1
[RCDV3::X80NJ,XM0lF,f3...]
ontology cdim_000037, cdim_000045
</Archetype>
</Criteria>
</Criteria>
<Criteria type="criteriaGroup" operator="OR" id="1261">
<Criteria type="singleCriterion" id="1262">
<Archetype abbreviated="yes">
archetype (adl_version=1.4)
TRANSFoRm-CRIM-DATAENTRY.lab_test.v1
Value [SNOMEDCT::40402000]
ontology OGMS_0000056, CDIM_000032, IAO_0000003, CDIM_000029
</Archetype>
</Criteria>
<Criteria type="singleCriterion" id="1263">
<Archetype abbreviated="yes">
archetype (adl_version=1.4)
TRANSFoRm-CRIM-DATAENTRY.lab_test.v1
Value [SNOMEDCT::36048009]
ontology OGMS_0000056, CDIM_000032, IAO_0000003, CDIM_000029
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
59
</Archetype>
</Criteria>
<Criteria type="singleCriterion" id="1264">
<Archetype abbreviated="yes">
archetype (adl_version=1.4)
TRANSFoRm-CRIM-DATAENTRY.lab_test.v1
Value [SNOMEDCT::144185003,144167005,166893007,166911009,271062006]
ontology OGMS_0000056, CDIM_000032, IAO_0000003, CDIM_000029
</Archetype>
</Criteria>
</Criteria>
</Criteria>
</QueryCriteria>
<Destination name="NIVEL">
<Practice>8872</Practice>
<Practice>8711</Practice>
<Practice>8087</Practice>
</Destination>
</EligibleSubjectCountRequest>
10.2 Query post-substitution
An example query after substitution of archetype elements by SQL elements
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<EligibleSubjectCountRequest>
<QueryCriteria id="1248">
<Criteria type="criteriaGroup" operator="AND" id="1249">
<Criteria type="singleCriterion" id="1251">
<SQL comment="Date of birth < 1977">
<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, CLIENT.GEBOORTEDATUM AS CDIM_000007
FROM CLIENT
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
60
INNER JOIN PRAKTIJK ON CLIENT.ID_PRAKTIJK = PRAKTIJK.ID_PRAKTIJK
WHERE (DATEDIFF(day, CLIENT.GEBOORTEDATUM, '2002-01-01') > 0)
ORDER BY CDIM_000003, CDIM_000007]]>
</SQL>
</Criteria>
<Criteria type="singleCriterion" id="1255">
<SQL comment="has Diabetes">
<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, MORBIDITEIT.DATUM AS CDIM_000012
FROM MORBIDITEIT
INNER JOIN CLIENT ON MORBIDITEIT.ID_CLIENT = CLIENT.ID_CLIENT
INNER JOIN PRAKTIJK ON CLIENT.ID_PRAKTIJK = PRAKTIJK.ID_PRAKTIJK
WHERE MORBIDITEIT.DIAGNOSE IN (219000,219001,219002)
ORDER BY CDIM_000003, CDIM_000012]]>
</SQL>
</Criteria>
<Criteria type="criteriaGroup" operator="OR" id="1257">
<Criteria type="singleCriterion" id="1258">
<SQL comment="takes metformin">
<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, PRESCRIPTIE.RECEPTDATUM AS
CDIM_000105
FROM PRESCRIPTIE
INNER JOIN CLIENT ON PRESCRIPTIE.ID_CLIENT = CLIENT.ID_CLIENT
INNER JOIN PRAKTIJK ON CLIENT.ID_PRAKTIJK = PRAKTIJK.ID_PRAKTIJK
WHERE PRESCRIPTIE.ATC IN ('A10BA02')
ORDER BY CDIM_000003, CDIM_000105]]>
</SQL>
</Criteria>
<Criteria type="singleCriterion" id="1259">
<SQL comment="takes Sulphonylurea compounds">
<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, PRESCRIPTIE.RECEPTDATUM AS
CDIM_000105
FROM PRESCRIPTIE
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
61
INNER JOIN CLIENT ON PRESCRIPTIE.ID_CLIENT = CLIENT.ID_CLIENT
INNER JOIN PRAKTIJK ON CLIENT.ID_PRAKTIJK = PRAKTIJK.ID_PRAKTIJK
WHERE PRESCRIPTIE.ATC IN ('X80NJ', 'XM0lF', 'f3...', 'X80NJ', 'XM0lF', 'f3...', '372711004',
'34012005', '259552008', '273950002', 'C-A2400', '372711004', '34012005', '259552008',
'273950002', 'NOCODE', 'C0038766')
ORDER BY CDIM_000003, CDIM_000105]]>
</SQL>
</Criteria>
</Criteria>
<Criteria type="criteriaGroup" operator="OR" id="1261">
<Criteria type="singleCriterion" id="1262">
<SQL comment="has HbA1c > 6.5 mmol/l">
<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, UITSLAGEN.REGISTRATIEDATUM AS
CDIM_000029
FROM UITSLAGEN
INNER JOIN CLIENT ON UITSLAGEN.ID_CLIENT = CLIENT.ID_CLIENT
INNER JOIN PRAKTIJK ON CLIENT.ID_PRAKTIJK = PRAKTIJK.ID_PRAKTIJK
INNER JOIN HULP_UITSLAGHIS ON UITSLAGEN.NHGNUMMER = HULP_UITSLAGHIS.nhgnummer
WHERE UITSLAGEN.TYPEUITSLAG = 1
AND UITSLAGEN.NHGNUMMER IN (368) AND (LTRIM(UITSLAGEN.WAARDE) >= '06.5')
ORDER BY CDIM_000003, CDIM_000029]]>
</SQL>
</Criteria>
<Criteria type="singleCriterion" id="1263">
<SQL comment="has Random glucose > 9.9 mmol/l">
<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, UITSLAGEN.REGISTRATIEDATUM AS
CDIM_000029
FROM UITSLAGEN
INNER JOIN CLIENT ON UITSLAGEN.ID_CLIENT = CLIENT.ID_CLIENT
INNER JOIN PRAKTIJK ON CLIENT.ID_PRAKTIJK = PRAKTIJK.ID_PRAKTIJK
INNER JOIN HULP_UITSLAGHIS ON UITSLAGEN.NHGNUMMER = HULP_UITSLAGHIS.nhgnummer
WHERE UITSLAGEN.TYPEUITSLAG = 1
AND UITSLAGEN.NHGNUMMER IN (372)
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
62
AND (LTRIM(UITSLAGEN.WAARDE) >= '09.9') AND HULP_UITSLAGHIS.eenheid = ' mmol/l'
ORDER BY CDIM_000003, CDIM_000029]]>
</SQL>
</Criteria>
<Criteria type="singleCriterion" id="1264">
<SQL comment="has fasting glucose > 7.0 mmol/l">
<![CDATA[SELECT DISTINCT CLIENT.ID_CLIENT AS CDIM_000003, UITSLAGEN.REGISTRATIEDATUM AS
CDIM_000029
FROM UITSLAGEN
INNER JOIN CLIENT ON UITSLAGEN.ID_CLIENT = CLIENT.ID_CLIENT
INNER JOIN PRAKTIJK ON CLIENT.ID_PRAKTIJK = PRAKTIJK.ID_PRAKTIJK
INNER JOIN HULP_UITSLAGHIS ON UITSLAGEN.NHGNUMMER = HULP_UITSLAGHIS.nhgnummer
WHERE UITSLAGEN.TYPEUITSLAG = 1
AND UITSLAGEN.NHGNUMMER IN (371)
AND (LTRIM(UITSLAGEN.WAARDE) >= '07.0' AND HULP_UITSLAGHIS.eenheid = ' mmol/l')
ORDER BY CDIM_000003, CDIM_000029]]>
</SQL>
</Criteria>
</Criteria>
</Criteria>
</QueryCriteria>
<Destination name="NIVEL">
<Practice>8872</Practice>
<Practice>8711</Practice>
<Practice>8087</Practice>
</Destination>
</EligibleSubjectCountRequest>
TRANSFoRm FP7-‐247787 D7.2 Federated Infrastructure for Data Linkage
63