The ca ncer B iomedical I nformatics G rid: Connecting the Cancer Research Community

The cancer Biomedical Informatics Grid:

Connecting the Cancer Research Community

Scott OsterDepartment of Biomedical Informatics

Ohio State University

Challenges of Large Applications in Distributed Environments (CLADE) 2007

Monterey Bay, CaliforniaJune 25, 2007

2

Agenda

caBIG Overview caGrid

Challenges of caBIG

3

Cancer Background

This year there will be approximately 1,400,000 Americans diagnosed with cancer

More than 500,000 Americans are expected to die from cancer this year

In 2005, the NIH estimated costs for cancer at $209.9 billion, with direct medical costs of $74 billion

4

First, a visionary non-technical challenge…

5

National Cancer Institute 2015 Goal

“Relieve suffering and death due to

cancer by the year 2015”

6

Origins of caBIG

Goal: Enable investigators and research teams nationwide to combine and leverage their findings and expertise in order to meet NCI 2015 Goal.

Strategy: Create scalable, actively managed organization that will connect members of the NCI-supported cancer enterprise by building a biomedical informatics network

7

caBIG Community

More than 50 Cancer Centers (of 61 total)

30 Organizations Government, Industry, Standards

Over 800 people

8

caBIG Domain Workspaces The data and tool producers:

Clinical Trial Management Systems Provides software tools for consistent, open, and comprehensive clinical trials

management, including enrollment of patients, tracking of protocols, recording of outcomes information, administration of trials, and submission of data to regulatory authorities

Integrative Cancer Research Builds software tools and systems to enable integration of clinical information

(such as data collected from biospecimen donors) with molecular information (such as data from high-throughput genomic and proteomic technologies)

In Vivo Imaging Provides technology for the sharing and analysis of in vivo (in the body)

imaging data, such as MRI and PET scans, both in basic and clinical research settings

Tissue Banks and Pathology Tools Develops software tools for the collection, processing, and dissemination of

biospecimens, including the annotation of those biospecimens with donor clinical and protocol data, as well as for the operational and administrative aspects of biorepositories

9

caBIG Strategic Workspaces

The policy makers: Data Sharing and Intellectual Capital

Develops policies for the sharing of data, software, and inventions within the caBIG™-funded cancer community. This workspace addresses, for example, how to implement patient protection policies; the ethical, legal, and contractual obligations associated with the sharing of clinical data and biospecimens; and how the public and private sector should interact when using caBIG™ tools in collaboration

Documentation and Training Provides technical training for software developers in the use of the

caBIG™ resources, including online tutorials, workshops, and education programs

Strategic Planning Assists in identifying strategic priorities for the development and

evolution of caBIG™

10

caBIG Cross-Cutting Workspaces

The infrastructure and standards developers: Architecture

Develops communication standards and systems necessary for all other caBIG™ workspaces to inter-connect as a grid via the Internet, including solutions for access control, security, and patient data protection

Vocabularies and Common Data Elements Creates data standards, including the development, promotion, and support

of vocabularies, ontologies, and common data elements to ensure that the entire caBIG™ community is speaking the same “language.” Such common data standards are a key component to ensure that large scale NCI projects generate interoperable information

11

What is caBIG?

Common, widely distributed infrastructure that permits the cancer research community to focus on innovation

Shared, harmonized set of terminology, data elements, and data models that facilitate information exchange

Collection of interoperable applications developed to common standards

Cancer research data available for mining and integration

12

Driving needs

A multitude of “legacy” information systems, most of which cannot be readily shared between institutions

An absence of tools to connect different databases An absence of common data formats A huge and growing volume of data must be collected,

analyzed, and made accessible Few common vocabularies, making it difficult, if not

impossible, to interlink diverse research and clinical results Difficulty in identifying and accessing available resources An absence of information infrastructure to share data

within an institution, or among different institutions

13

So there are technical challenges as well…

14

What is caGrid?

Development project of Architecture Workspace The Grid infrastructure for caBIG (the “G” in caBIG) Driven from use cases and needs of cancer research

community Service Oriented Architecture Based on federation Model Driven Object-Oriented, Semantically-Annotated Data Virtualization

15

What is caGrid? cont…

Builds on existing Grid technologies Provides additional enterprise Grid components

Grid Service Graphical Development Toolkit Metadata Infrastructure Advertisement and Discovery Semantic Services Data Service Infrastructure Analytical Service Infrastructure Identifiers Workflow Security Infrastructure Client tooling

16

Agenda

caBIG Overview caGrid

Challenges of caBIG

17

Issue: Disparate systems

No common infrastructure for applications, databases, etc

Variety of programming languages Variety of platforms and operating systems Inability to interoperate with other

systems throughout virtual organization

18

Approach: Disparate systems

Create and leverage a standards-based Grid (caGrid) WSRF web services using SOAP/HTTP(s)

Creation of compatibility guidelines and review process Define a uniform query interface and language for data

providing systems Provide common infrastructure services most federation

scenarios Focus on tools for virtualizing existing systems and APIs

behind these grid interfaces Open Issue: some systems require more manual work than others Open Issue: tradeoff between specificity and universal applicability

19

Introduce Graphical Development Environment for Grid Services Provides simple means to create a service skeleton that a developer can

then implement, build, and deploy Provides a set of tools which enable the developer to

add/remove/modify/import methods of the service Automatic code generation (WSDL, service and client APIs, JNDI, WSDDs,

security descriptors, metadata, etc)

20

Issue: Lack of common Data Formats

Tools use widely varying and/or proprietary data formats

Lack of formal definition Not all suitable for communication with

remote systems Lack of uniform way to discover and

understand the formats

21

Approach: Lack of common Data Formats

Adopt XML as data exchange format Leverage XML Schemas for definition Global Model Exchange service for publishing, managing, and

discovering XML Schemas Leverage UML for logical definition of data models Cancer Data Standards Repository (caDSR) captures logical

model with annotations; facilitates reuse and formal definition Formal binding of logical model (UML) and exchange model (XML) Community review of the use of standards for new systems Open Issue: Data translation still necessary when existing system

can’t be easily changed (though some caBIG tools exist to address this; e.g caAdapter)

Open Issue: tradeoff between reuse and creating the new “perfect model”

22

Issue: Data Interoperability

Common data formats allow for syntactic data interoperability but are not sufficient for ensuring common semantics May work with wholesale adoption of common

domain-specific models, but breaks down cross-model

Need to understand the meaning of the value domains and terminology of a data format or system

Assumptions of meaning can be dangerous, even deadly, in the medical domain

23

Interoperability

The ability of multiple systems to exchange information and to be able to use the information that has been exchanged.

Syntacticinteroperability

Semanticinteroperability

Class/Attribute

Example Data

CIA Definition NCI Definition

Agent A sworn intelligence agent; a spy

Chemical compound administered to a human being to treat a disease or condition, or prevent the onset of a disease or condition

Agent.nSCNumber 007 Identifier given to an intelligence agent by the National Security Council

Identifier given to chemical compound by the US Food and Drug Administration Nomenclature Standards Committee

Agent.name Taxol CIA code name given to intelligence agents

Common name of chemical compound used as an agent

<Agent>

<name>Taxol</name>

<nSCNumber>007</nSCNumber>

</Agent>

Semantics Example

25

Approach: Data Interoperability

Community maintained and curated shared ontology Enterprise Vocabulary Services (EVS) maintains and provides

access to the data semantics and controlled vocabulary of all models Definitions, synonyms, relationships, etc

All models in caDSR annotated with terminology and concepts from EVS Focus on identifying “Common Data Elements” as semantically equivalent

attributes Based on ISO 11179 Information Technology – Metadata Registries (MDR) parts

1-6

Community review of the use of standards and harmonization for new systems

Open Issue: Is it possible to scale to federated terminologies? Open Issue: High initial cost to entry; high overhead to

maintaining quality

caGrid Data Description Infrastructure• Client and service APIs are

object oriented, and operate over well-defined and curated data types

• Objects are defined in UML and converted into ISO/IEC 11179 Administered Components, which are in turn registered in the Cancer Data Standards Repository (caDSR)

• Object definitions draw from controlled terminology and vocabulary registered in the Enterprise Vocabulary Services (EVS), and their relationships are thus semantically described

• XML serialization of objects adhere to XML schemas registered in the Global Model Exchange (GME)

Service

Core Services

Client

XSDWSDL

Grid Service

Service Definition

Data TypeDefinitions

Service API

Grid Client

Client API

Registered In

Object Definitions

SemanticallyDescribed In

XMLObjectsSerialize To

ValidatesAgainst

Client Uses

Cancer Data Standards Repository

Enterprise Vocabulary

Services

Objects

GlobalModel

Exchange

GMERegistered In

ObjectDefinitions

Objects

27

Issue: Finding Resources

Creating infrastructure for programmatic interoperability is excessive without a way to dynamically find and use previously unknown resources

Resources need to be self-descriptive enough such that their use and value can be determined

28

Approach: Finding Resources

Rich set of standardized metadata publicly provided by each service Operations and data types described in terms of

structure and semantics extracted from caDSR and EVS

Services register existence with Index Service, and metadata is aggregated

Tools for querying Index Service, and analyzing metadata are provided

Open Issue: Lines between data and metadata are blurry at best Some key distinctions in caBIG are metadata is publically

accessible, and describes "types" not instances

Advertisement and Discovery Process

Core Services

Grid Service

Uses TerminologyDescribed In

Cancer Data Standards Repository

Enterprise Vocabulary

Services

References ObjectsDefined in

Index Service

Service Metadata

Publishes

Subscribes Toand Aggregates

Queries ServiceMetadata Aggregated In

Registers To

Discovery Client API

All services register their service location and metadata information to an Index Service

The Index Service subscribes to the standardized metadata and aggregates their contents

Clients can discover services using a discovery API which facilitates inspection of data types

Leveraging semantic information (from which service metadata is drawn), services can be discovered by the semantics of their data types “Find me all the services

from Cancer Center X” “Which Analytical services

take Genes as input?” “Which Data services expose

data relating to lung cancer?”

30

Numerous Sources of Large Data sets Imaging

Tumor Microenvironment High Resolution Scanning= 25TB/cm2 tissue

Image repositories Multiple Modalities, thousands of cases, Millions of images,

terabytes of data

Mouse Models terabytes of data

Proteomics Modest Example: 30 samples, 10 fractions, 10 runs, 1.5 MB per spectra = 4.5 GB

Many others

Issue: Data Size

Time

31

Approach: Data Size

Often a tradeoff between optimized performance and interoperability e.g. Out of band binary transfer vs XML/SOAP/HTTP

Currently Leveraging: Transfer: ws-enumeration, GridFTP (with integrated

security, and metadata) Avoid Transfer: Identifiers, federated query, workflow, co-

location

Looking at: Moving services to data (Imaging) Binary data format descriptions for binary metadata (e.g

DFDL) New area of address; much more to do…

32

Issue: User Accounting

Most legacy systems built with local users and permissions can’t require users to maintain hundreds of

accounts, but need to still allow local policy

Central account management and identity vetting not tractable but there are too many organizations with

differing infrastructures to try to establish point to point relationships

33

Approach: User Accounting

Provide Single Sign On to grid via X.509 proxy certificates

Grid Authentication and Authorization with Reliably Distributed Services (GAARDS)

Federate Identity Management (Dorian) Rely on participating institutions to vouch for identity of their

members Standardize on identity assertion language and attributes Integrate existing institutional identity management systems

as Registration Authorities, into aggregate Certificate Authorities

Distribute revocations via Grid Trust Service (GTS); discussed later

GAARDS in Action

Authenticate with Local Credential Provider

SAML Assertion

User authenticates to local credential

provider using your everyday user

credentials

GAARDS in Action

SAML AssertionGrid Credentials

Application obtains grid

credentials from Dorian using

SAML provided by the local

provider.

GAARDS in Action

Grid Credentials

Application uses grid

credentials to invoke secure grid services.

37

Issue: Data Privacy

Lots of interesting data involves human subjects in some form Numerous barriers to data and resource sharing in caBIG

Federal, state, and local law; regulations; institutional policies Institutional Review Boards (IRB) involved for any protected health

information (PHI); even for de-identified data Grid is new technology; IRBs must give very detailed protocol approvals Most regulations are more than just "who“; "how" and "for what" matters

Grid is multi-institutional which means IRBs must reach agreements (read: separately employed lawyers working together)

Legal and policy requirements related to privacy and security drivers include: HIPAA Privacy and Security Rules The Common Rule for Human Subjects Research FDA Regulations on Human Subjects 21 CFR Part 11 State and institutional requirements

38

Approach: Data Privacy

Though some aspect of solutions require technology (auditing, provenance, encryption/digital signing), the problem cannot be solved by technology alone

Data Sharing and Intellectual Capital Workspace (DSIC) Identification of issues; development of guidelines; template

agreements; education and training Some caBIG (and external) tools exist for automated de-

identification Can leverage authorization solutions (GridGrouper for group-

based; CSM for local policy; Globus PDPs for complex rules) Open Issue: What technologies and policies (if any) can be

universally adopted? Open Issue: To date emphasis of development security

infrastructure in caBIG has been around services; not data Lots of work to do…

39

Issue: Intellectual Captial

Social problem “Publish or perish” Justified hesitance to share pre-publication data Justified reluctance to advance the cause of competitors

(industrial and academic) Can I rely on the data/results of some (potentially)

unknown entity?

If cancer is cured, and caBIG resources play a role, there will be much interest in knowing who contributed what (and who funded them)

Proper attribution is not just ethical, its often required

40

Approach: Intellectual Captial

Technological Provenance may or may not be enough (annotation vs

enforcement)

Socio-Cultural Whole workspace in caBIG dedicated to it (DSIC) NCI in a good position to “encourage” it

Large percentage of institutions’ cancer research funding comes from NCI

Hope is motivation will be value-based once initially primed

Starting to see movement from “wait and see” to active engagement; industry involvement

Lots of work to do…

41

Issue: Complicated Trust Arrangements

When hundreds of organizations are sharing data and providing access to each other’s systems, defining a trust model is complicated, even for public data

For non-public data/systems, the simplest/safest policy is “deny all”

For many data sets and services, the owning organization may be virtual

Central authority is socially and technologically intractable

Rapid propagation of information on compromised systems/individuals is critical

42

Approach: Complicated Trust Arrangements

Grid Authentication and Authorization with Reliably Distributed Services (GAARDS)

Federated Trust Models (GTS) Establish and manage trust

relationships between institutions through adherence to mutually agreed upon policy

Promote global policy distribution, but allow arbitrary local overrides

Provide enterprise tools and services for management and automate distribution of information

43

Grid Trust Service (GTS) Federation

A GTS can inherit Trusted Authorities and Trust Levels from other Grid Trust Services

Allows one to build a scalable Trust Fabric

Allows institutions to stand up their own GTS, inheriting all the trusted authorities in the wider grid, yet being to add their own authorities that might not yet be trusted by the wider grid

A GTS can also be used to join the trust fabrics of two or more grids

GAARDS in Action

Grid Credentials

Application uses grid

credentials to invoke secure grid services.

GAARDS in Action

Grid Service authenticates the

user by asking the GTS whether or not the signer of the credential

should be trusted.

Should I trust thecredential signer?

46

Issue: Computationally Expensive

Many studies on molecular data require expensive calculations on large data sets Statistical analysis, hypothesis testing,

searches

Researchers lack necessary computing resources

47

Approach: Computationally Expensive

Variety of well-known solutions exist in Grid and cluster space (a main driving force of their existence)

Challenge is in seamlessly integrating with abstraction layer in use i.e Operations on semantically annotated objects, not scheduled

jobs on flat files Leverage virtualization; domain specific service interface over

general computational resources TeraGrid, Super Computer Centers

Open Issue: Balancing abstraction vs control (e.g. scheduling priorities, cost models, optimizations, etc)

Open Issue: Appropriate level of control for service as resource broker

Open Issue: Complexity moved from client to service developer (working on tools to facilitate)

48

caGrid/TeraGrid Overview

49

Issue: Evolving Infrastructure

Standards in Web/Grid service domain are turbulent at best Competing interests of “big business” and multiple

standards bodies

Major revisions of toolkits generally not backwards compatible Interface stability vs new features Don’t want multiple grids Upgrade or perish? Staying behind means lack of

support

Application layer abstractions help developers, but don’t address “wire incompatibility”

50

Approach: Evolving Infrastructure

Most traditional solutions are in conflict with strongly-typed requirements or complicate service development (unless extensibility built into spec) e.g. Lax processing; must ignore/must understand with

schema overloading; multiple (protocol) service interfaces

Abstract specifications from developers with tooling

Focus on rigid “data format” specifications, allow more freedom on composition into messages

Open Issue: Doesn’t address wire incompatibility Open Issue: No good solution

Do we need to just get it “good enough” and stabilize?

51

Summary

The bad news: Large-scale, distributed knowledge sharing is hard

The good news: The potential rewards are large

The good news (for computer scientists): There are lots of unsolved problems (and interest in

getting them solved)

The cancer Biomedical Informatics Grid:

Connecting the Cancer Research Community

Scott OsterDepartment of Biomedical Informatics

Ohio State University

Challenges of Large Applications in Distributed Environments (CLADE) 2007

Monterey Bay, CaliforniaJune 25, 2007

53

BACKUP SLIDES

54

Common Service Metadata Provided by all services Details service’s capabilities, operations, contact information, hosting research

center Service operation’s inputs and outputs defined in terms of structure and

semantics extracted from caDSR and EVS Service Security Metadata

Provided by all services Details the service’s requirements on communication channel for each operation Can be used by client to programmatically negotiate an acceptable means of

communication Data Service Metadata

Provided by all data services Describes the Domain Model being exposed, in terms of a UML model linked to

semantics Provides information needed to formulate the Object-Oriented Query As with common metadata, data types defined in terms of structure and

semantics extracted from caDSR and EVS

Standardized Service Metadata

Level I: Collection

Level II:ClosedDistribution

Level III:Public Distribution/ Access

Level IV:Post-PublicationAttribution

•Access control•Patient Privacy•Data integrity•Provenance

metadata for attribution•Authentication of

authorship•Information

Security

•External access controls•Dynamic

permissions for limited access•Materials transfer

issues•Mechanisms for

data escrow

•Data released from escrow•Data transmission

security•Dynamic

permissions for general access•Provision for IP

ownership as opposed to access

•Provenance metadata for publication•Community

standards for attribution of authorship•Dynamic permissions

for general release•Data escrow for

publication

caBIG Data Hierarchy

Issues affecting the data are cumulative, i.e. data functioning in Level II will also raise the issues raised for Level I data; Level III data will require attention to the issues raised by both Level I and II, et cetera.

56

Level I data issuesLevel I data is all data collected by the caBIG system, including patient data, analyses,

records, and research, regardless of whether that data is released to other researchers, the public, or parties other than the one that originally provides the data to the system. Issues raised include:

Access Controls. Management, operational, and technical controls are necessary to create a methodology for restricting access to data in caBIG consistent with the authorization of the individual or entity.

Patient Privacy. Data must be collected and stored in a manner that protects the privacy interests of the data subjects, consistent with the HIPAA Privacy Rule, the Common Rule of human subjects research (reflected in the Code of Federal Regulations), and other state, local, ethical, and institutional requirements.

Data integrity. Mechanisms must be available to ascertain that data has been entered accurately and will not be inappropriately modified in the transfer from its point of origin, while maintained in caBIG, or subsequently.

Provenance metadata for attribution. Individual contributors’ interests must be protected by assuring that the system allows data submitted to be associated with information concerning its authorship, collection, or creation, and that a mechanism exists for data originators to amend incorrect provenance information.

Authentication of authorship. Mechanisms and processes must be available to verify that provenance data correctly identifies the source of contributed data and information. Such protections may include digital signatures (as described in 21 CFR 11) and other methods.

Information Security. Data must be collected and stored in a manner that protects the privacy interests of the data subjects, consistent with the HIPAA Security Rule, the Federal Information Security Management Act of 2002, and other Federal, state, local, ethical, and institutional requirements.

57

Level II data issues

Level II data is data that is collected and then shared by some limited subset of potential data users, but not all caBIG users or the general public. These individuals could include, for example, the party that contributed the data only, individuals that have reached private agreements with those that have contributed the data, or individuals granted “role-based” access to certain categories. Issues raised at this level include:

External Access Controls. Level I data requires controls for access to caBIG: Level II data requires management, operational, and technical controls to limit access to the caBIG users authorized to view data originated elsewhere

Dynamic permissions for limited access. Access controls will need to be flexible enough to change what data individuals have access to as roles, agreements, and activities of individual system users change over time.

Integration with materials transfer processes. Information sharing practices facilitated by caBIG must be aligned with practices for individuals or groups that share, transfer or provide access to tissues, cultures, cell lines, research animals, or other material shipped from one location to another.

Mechanisms for data escrow. Common research practices require data to be available for verification of research findings but not available for access, alteration, or further analysis until the validity of research findings is verified. caBIG will need to include a mechanism to allow data stored on the system to be partitioned off consistent with these requirements.

58

Level III data issues

Level III data is data made available to general audiences, including all caBIG users, all interested researchers, or the general public. Level III data issues include:

Data released from escrow. Once data has been cleared for general access (either due to the conclusion of the prepublication issues described under Level II data above, or pursuant to an arrangement with the data’s originator), it must be made available in a manner consistent with caBIG policy and in a way that does not compromise the data’s integrity or required attribution.

Data transmission security. Management, operational, and technical controls should assure that data integrity is not compromised in transit, nor that poor security practices on the part of caBIG system users create platforms for security breaches of the caBIG system itself.

Dynamic permissions for general access. As with Level II data, access must be granted appropriately to users. As well as access levels for users, data must also be assigned security categories such that data can be re-categorized from having a specified, limited availability to becoming more generally available.

Provision for intellectual property ownership as opposed to access. Researchers may be willing to share data for limited purposes or a limited data set, and may wish to retain rights to be acknowledged for collecting data, generating analyses, or previous publications. Mechanisms must be in place to allow individuals the ongoing ability to benefit from their research or retain exclusive rights to it if contracts or other conditional agreements so require.

59

Level IV data issues

Level IV data is data that will be used for analyses, research, or other writing that will be attributed to one or more authors or individuals as the author, creator, sponsor, or other related party. Level IV data requires special protections such that the proper attribution is received for the particular accomplishments or expertise associated with that data. Level IV data issues include:

Provenance metadata for publication. Provenance metadata provisions for Level I data should govern the rights, restrictions and considerations relevant to authorship and attribution for data to be published.

Community standards for attribution of authorship. An appropriate, written protocol that accounts for existing law, policy and custom should exist to reflect how data and information is generated and in what capacity each participant contributed.

Dynamic permissions for general release. Data permissions for attributed data may require escrow; delivery to third parties for verification and analyses; and added provenance metadata for modified or concurrently developed material.

Data escrow for publication. Many journals require that data used in publications remain in escrow prior to (or following) publication to allow other researchers to validate findings. caBIG processes would need to be compatible with this requirement.

Documents

The ca ncer B iomedical I nformatics G rid: Connecting the Cancer Research Community