Upload
torgny
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The ca ncer B iomedical I nformatics G rid: Connecting the Cancer Research Community. Scott Oster Department of Biomedical Informatics Ohio State University Challenges of Large Applications in Distributed Environments (CLADE) 2007 Monterey Bay, California June 25, 2007. Agenda. - PowerPoint PPT Presentation
Citation preview
The cancer Biomedical Informatics Grid:
Connecting the Cancer Research Community
Scott OsterDepartment of Biomedical Informatics
Ohio State University
Challenges of Large Applications in Distributed Environments (CLADE) 2007
Monterey Bay, CaliforniaJune 25, 2007
2
Agenda
caBIG Overview caGrid
Challenges of caBIG
3
Cancer Background
This year there will be approximately 1,400,000 Americans diagnosed with cancer
More than 500,000 Americans are expected to die from cancer this year
In 2005, the NIH estimated costs for cancer at $209.9 billion, with direct medical costs of $74 billion
4
First, a visionary non-technical challenge…
5
National Cancer Institute 2015 Goal
“Relieve suffering and death due to
cancer by the year 2015”
6
Origins of caBIG
Goal: Enable investigators and research teams nationwide to combine and leverage their findings and expertise in order to meet NCI 2015 Goal.
Strategy: Create scalable, actively managed organization that will connect members of the NCI-supported cancer enterprise by building a biomedical informatics network
7
caBIG Community
More than 50 Cancer Centers (of 61 total)
30 Organizations Government, Industry, Standards
Over 800 people
8
caBIG Domain Workspaces The data and tool producers:
Clinical Trial Management Systems Provides software tools for consistent, open, and comprehensive clinical trials
management, including enrollment of patients, tracking of protocols, recording of outcomes information, administration of trials, and submission of data to regulatory authorities
Integrative Cancer Research Builds software tools and systems to enable integration of clinical information
(such as data collected from biospecimen donors) with molecular information (such as data from high-throughput genomic and proteomic technologies)
In Vivo Imaging Provides technology for the sharing and analysis of in vivo (in the body)
imaging data, such as MRI and PET scans, both in basic and clinical research settings
Tissue Banks and Pathology Tools Develops software tools for the collection, processing, and dissemination of
biospecimens, including the annotation of those biospecimens with donor clinical and protocol data, as well as for the operational and administrative aspects of biorepositories
9
caBIG Strategic Workspaces
The policy makers: Data Sharing and Intellectual Capital
Develops policies for the sharing of data, software, and inventions within the caBIG™-funded cancer community. This workspace addresses, for example, how to implement patient protection policies; the ethical, legal, and contractual obligations associated with the sharing of clinical data and biospecimens; and how the public and private sector should interact when using caBIG™ tools in collaboration
Documentation and Training Provides technical training for software developers in the use of the
caBIG™ resources, including online tutorials, workshops, and education programs
Strategic Planning Assists in identifying strategic priorities for the development and
evolution of caBIG™
10
caBIG Cross-Cutting Workspaces
The infrastructure and standards developers: Architecture
Develops communication standards and systems necessary for all other caBIG™ workspaces to inter-connect as a grid via the Internet, including solutions for access control, security, and patient data protection
Vocabularies and Common Data Elements Creates data standards, including the development, promotion, and support
of vocabularies, ontologies, and common data elements to ensure that the entire caBIG™ community is speaking the same “language.” Such common data standards are a key component to ensure that large scale NCI projects generate interoperable information
11
What is caBIG?
Common, widely distributed infrastructure that permits the cancer research community to focus on innovation
Shared, harmonized set of terminology, data elements, and data models that facilitate information exchange
Collection of interoperable applications developed to common standards
Cancer research data available for mining and integration
12
Driving needs
A multitude of “legacy” information systems, most of which cannot be readily shared between institutions
An absence of tools to connect different databases An absence of common data formats A huge and growing volume of data must be collected,
analyzed, and made accessible Few common vocabularies, making it difficult, if not
impossible, to interlink diverse research and clinical results Difficulty in identifying and accessing available resources An absence of information infrastructure to share data
within an institution, or among different institutions
13
So there are technical challenges as well…
14
What is caGrid?
Development project of Architecture Workspace The Grid infrastructure for caBIG (the “G” in caBIG) Driven from use cases and needs of cancer research
community Service Oriented Architecture Based on federation Model Driven Object-Oriented, Semantically-Annotated Data Virtualization
15
What is caGrid? cont…
Builds on existing Grid technologies Provides additional enterprise Grid components
Grid Service Graphical Development Toolkit Metadata Infrastructure Advertisement and Discovery Semantic Services Data Service Infrastructure Analytical Service Infrastructure Identifiers Workflow Security Infrastructure Client tooling
16
Agenda
caBIG Overview caGrid
Challenges of caBIG
17
Issue: Disparate systems
No common infrastructure for applications, databases, etc
Variety of programming languages Variety of platforms and operating systems Inability to interoperate with other
systems throughout virtual organization
18
Approach: Disparate systems
Create and leverage a standards-based Grid (caGrid) WSRF web services using SOAP/HTTP(s)
Creation of compatibility guidelines and review process Define a uniform query interface and language for data
providing systems Provide common infrastructure services most federation
scenarios Focus on tools for virtualizing existing systems and APIs
behind these grid interfaces Open Issue: some systems require more manual work than others Open Issue: tradeoff between specificity and universal applicability
19
Introduce Graphical Development Environment for Grid Services Provides simple means to create a service skeleton that a developer can
then implement, build, and deploy Provides a set of tools which enable the developer to
add/remove/modify/import methods of the service Automatic code generation (WSDL, service and client APIs, JNDI, WSDDs,
security descriptors, metadata, etc)
20
Issue: Lack of common Data Formats
Tools use widely varying and/or proprietary data formats
Lack of formal definition Not all suitable for communication with
remote systems Lack of uniform way to discover and
understand the formats
21
Approach: Lack of common Data Formats
Adopt XML as data exchange format Leverage XML Schemas for definition Global Model Exchange service for publishing, managing, and
discovering XML Schemas Leverage UML for logical definition of data models Cancer Data Standards Repository (caDSR) captures logical
model with annotations; facilitates reuse and formal definition Formal binding of logical model (UML) and exchange model (XML) Community review of the use of standards for new systems Open Issue: Data translation still necessary when existing system
can’t be easily changed (though some caBIG tools exist to address this; e.g caAdapter)
Open Issue: tradeoff between reuse and creating the new “perfect model”
22
Issue: Data Interoperability
Common data formats allow for syntactic data interoperability but are not sufficient for ensuring common semantics May work with wholesale adoption of common
domain-specific models, but breaks down cross-model
Need to understand the meaning of the value domains and terminology of a data format or system
Assumptions of meaning can be dangerous, even deadly, in the medical domain
23
Interoperability
The ability of multiple systems to exchange information and to be able to use the information that has been exchanged.
Syntacticinteroperability
Semanticinteroperability
Class/Attribute
Example Data
CIA Definition NCI Definition
Agent A sworn intelligence agent; a spy
Chemical compound administered to a human being to treat a disease or condition, or prevent the onset of a disease or condition
Agent.nSCNumber 007 Identifier given to an intelligence agent by the National Security Council
Identifier given to chemical compound by the US Food and Drug Administration Nomenclature Standards Committee
Agent.name Taxol CIA code name given to intelligence agents
Common name of chemical compound used as an agent
<Agent>
<name>Taxol</name>
<nSCNumber>007</nSCNumber>
</Agent>
Semantics Example
25
Approach: Data Interoperability
Community maintained and curated shared ontology Enterprise Vocabulary Services (EVS) maintains and provides
access to the data semantics and controlled vocabulary of all models Definitions, synonyms, relationships, etc
All models in caDSR annotated with terminology and concepts from EVS Focus on identifying “Common Data Elements” as semantically equivalent
attributes Based on ISO 11179 Information Technology – Metadata Registries (MDR) parts
1-6
Community review of the use of standards and harmonization for new systems
Open Issue: Is it possible to scale to federated terminologies? Open Issue: High initial cost to entry; high overhead to
maintaining quality
caGrid Data Description Infrastructure• Client and service APIs are
object oriented, and operate over well-defined and curated data types
• Objects are defined in UML and converted into ISO/IEC 11179 Administered Components, which are in turn registered in the Cancer Data Standards Repository (caDSR)
• Object definitions draw from controlled terminology and vocabulary registered in the Enterprise Vocabulary Services (EVS), and their relationships are thus semantically described
• XML serialization of objects adhere to XML schemas registered in the Global Model Exchange (GME)
Service
Core Services
Client
XSDWSDL
Grid Service
Service Definition
Data TypeDefinitions
Service API
Grid Client
Client API
Registered In
Object Definitions
SemanticallyDescribed In
XMLObjectsSerialize To
ValidatesAgainst
Client Uses
Cancer Data Standards Repository
Enterprise Vocabulary
Services
Objects
GlobalModel
Exchange
GMERegistered In
ObjectDefinitions
Objects
27
Issue: Finding Resources
Creating infrastructure for programmatic interoperability is excessive without a way to dynamically find and use previously unknown resources
Resources need to be self-descriptive enough such that their use and value can be determined
28
Approach: Finding Resources
Rich set of standardized metadata publicly provided by each service Operations and data types described in terms of
structure and semantics extracted from caDSR and EVS
Services register existence with Index Service, and metadata is aggregated
Tools for querying Index Service, and analyzing metadata are provided
Open Issue: Lines between data and metadata are blurry at best Some key distinctions in caBIG are metadata is publically
accessible, and describes "types" not instances
Advertisement and Discovery Process
Core Services
Grid Service
Uses TerminologyDescribed In
Cancer Data Standards Repository
Enterprise Vocabulary
Services
References ObjectsDefined in
Index Service
Service Metadata
Publishes
Subscribes Toand Aggregates
Queries ServiceMetadata Aggregated In
Registers To
Discovery Client API
All services register their service location and metadata information to an Index Service
The Index Service subscribes to the standardized metadata and aggregates their contents
Clients can discover services using a discovery API which facilitates inspection of data types
Leveraging semantic information (from which service metadata is drawn), services can be discovered by the semantics of their data types “Find me all the services
from Cancer Center X” “Which Analytical services
take Genes as input?” “Which Data services expose
data relating to lung cancer?”
30
Numerous Sources of Large Data sets Imaging
Tumor Microenvironment High Resolution Scanning= 25TB/cm2 tissue
Image repositories Multiple Modalities, thousands of cases, Millions of images,
terabytes of data
Mouse Models terabytes of data
Proteomics Modest Example: 30 samples, 10 fractions, 10 runs, 1.5 MB per spectra = 4.5 GB
Many others
Issue: Data Size
Time
31
Approach: Data Size
Often a tradeoff between optimized performance and interoperability e.g. Out of band binary transfer vs XML/SOAP/HTTP
Currently Leveraging: Transfer: ws-enumeration, GridFTP (with integrated
security, and metadata) Avoid Transfer: Identifiers, federated query, workflow, co-
location
Looking at: Moving services to data (Imaging) Binary data format descriptions for binary metadata (e.g
DFDL) New area of address; much more to do…
32
Issue: User Accounting
Most legacy systems built with local users and permissions can’t require users to maintain hundreds of
accounts, but need to still allow local policy
Central account management and identity vetting not tractable but there are too many organizations with
differing infrastructures to try to establish point to point relationships
33
Approach: User Accounting
Provide Single Sign On to grid via X.509 proxy certificates
Grid Authentication and Authorization with Reliably Distributed Services (GAARDS)
Federate Identity Management (Dorian) Rely on participating institutions to vouch for identity of their
members Standardize on identity assertion language and attributes Integrate existing institutional identity management systems
as Registration Authorities, into aggregate Certificate Authorities
Distribute revocations via Grid Trust Service (GTS); discussed later
GAARDS in Action
Authenticate with Local Credential Provider
SAML Assertion
User authenticates to local credential
provider using your everyday user
credentials
GAARDS in Action
SAML AssertionGrid Credentials
Application obtains grid
credentials from Dorian using
SAML provided by the local
provider.
GAARDS in Action
Grid Credentials
Application uses grid
credentials to invoke secure grid services.
37
Issue: Data Privacy
Lots of interesting data involves human subjects in some form Numerous barriers to data and resource sharing in caBIG
Federal, state, and local law; regulations; institutional policies Institutional Review Boards (IRB) involved for any protected health
information (PHI); even for de-identified data Grid is new technology; IRBs must give very detailed protocol approvals Most regulations are more than just "who“; "how" and "for what" matters
Grid is multi-institutional which means IRBs must reach agreements (read: separately employed lawyers working together)
Legal and policy requirements related to privacy and security drivers include: HIPAA Privacy and Security Rules The Common Rule for Human Subjects Research FDA Regulations on Human Subjects 21 CFR Part 11 State and institutional requirements
38
Approach: Data Privacy
Though some aspect of solutions require technology (auditing, provenance, encryption/digital signing), the problem cannot be solved by technology alone
Data Sharing and Intellectual Capital Workspace (DSIC) Identification of issues; development of guidelines; template
agreements; education and training Some caBIG (and external) tools exist for automated de-
identification Can leverage authorization solutions (GridGrouper for group-
based; CSM for local policy; Globus PDPs for complex rules) Open Issue: What technologies and policies (if any) can be
universally adopted? Open Issue: To date emphasis of development security
infrastructure in caBIG has been around services; not data Lots of work to do…
39
Issue: Intellectual Captial
Social problem “Publish or perish” Justified hesitance to share pre-publication data Justified reluctance to advance the cause of competitors
(industrial and academic) Can I rely on the data/results of some (potentially)
unknown entity?
If cancer is cured, and caBIG resources play a role, there will be much interest in knowing who contributed what (and who funded them)
Proper attribution is not just ethical, its often required
40
Approach: Intellectual Captial
Technological Provenance may or may not be enough (annotation vs
enforcement)
Socio-Cultural Whole workspace in caBIG dedicated to it (DSIC) NCI in a good position to “encourage” it
Large percentage of institutions’ cancer research funding comes from NCI
Hope is motivation will be value-based once initially primed
Starting to see movement from “wait and see” to active engagement; industry involvement
Lots of work to do…
41
Issue: Complicated Trust Arrangements
When hundreds of organizations are sharing data and providing access to each other’s systems, defining a trust model is complicated, even for public data
For non-public data/systems, the simplest/safest policy is “deny all”
For many data sets and services, the owning organization may be virtual
Central authority is socially and technologically intractable
Rapid propagation of information on compromised systems/individuals is critical
42
Approach: Complicated Trust Arrangements
Grid Authentication and Authorization with Reliably Distributed Services (GAARDS)
Federated Trust Models (GTS) Establish and manage trust
relationships between institutions through adherence to mutually agreed upon policy
Promote global policy distribution, but allow arbitrary local overrides
Provide enterprise tools and services for management and automate distribution of information
43
Grid Trust Service (GTS) Federation
A GTS can inherit Trusted Authorities and Trust Levels from other Grid Trust Services
Allows one to build a scalable Trust Fabric
Allows institutions to stand up their own GTS, inheriting all the trusted authorities in the wider grid, yet being to add their own authorities that might not yet be trusted by the wider grid
A GTS can also be used to join the trust fabrics of two or more grids
GAARDS in Action
Grid Credentials
Application uses grid
credentials to invoke secure grid services.
GAARDS in Action
Grid Service authenticates the
user by asking the GTS whether or not the signer of the credential
should be trusted.
Should I trust thecredential signer?
46
Issue: Computationally Expensive
Many studies on molecular data require expensive calculations on large data sets Statistical analysis, hypothesis testing,
searches
Researchers lack necessary computing resources
47
Approach: Computationally Expensive
Variety of well-known solutions exist in Grid and cluster space (a main driving force of their existence)
Challenge is in seamlessly integrating with abstraction layer in use i.e Operations on semantically annotated objects, not scheduled
jobs on flat files Leverage virtualization; domain specific service interface over
general computational resources TeraGrid, Super Computer Centers
Open Issue: Balancing abstraction vs control (e.g. scheduling priorities, cost models, optimizations, etc)
Open Issue: Appropriate level of control for service as resource broker
Open Issue: Complexity moved from client to service developer (working on tools to facilitate)
48
caGrid/TeraGrid Overview
49
Issue: Evolving Infrastructure
Standards in Web/Grid service domain are turbulent at best Competing interests of “big business” and multiple
standards bodies
Major revisions of toolkits generally not backwards compatible Interface stability vs new features Don’t want multiple grids Upgrade or perish? Staying behind means lack of
support
Application layer abstractions help developers, but don’t address “wire incompatibility”
50
Approach: Evolving Infrastructure
Most traditional solutions are in conflict with strongly-typed requirements or complicate service development (unless extensibility built into spec) e.g. Lax processing; must ignore/must understand with
schema overloading; multiple (protocol) service interfaces
Abstract specifications from developers with tooling
Focus on rigid “data format” specifications, allow more freedom on composition into messages
Open Issue: Doesn’t address wire incompatibility Open Issue: No good solution
Do we need to just get it “good enough” and stabilize?
51
Summary
The bad news: Large-scale, distributed knowledge sharing is hard
The good news: The potential rewards are large
The good news (for computer scientists): There are lots of unsolved problems (and interest in
getting them solved)
The cancer Biomedical Informatics Grid:
Connecting the Cancer Research Community
Scott OsterDepartment of Biomedical Informatics
Ohio State University
Challenges of Large Applications in Distributed Environments (CLADE) 2007
Monterey Bay, CaliforniaJune 25, 2007
53
BACKUP SLIDES
54
Common Service Metadata Provided by all services Details service’s capabilities, operations, contact information, hosting research
center Service operation’s inputs and outputs defined in terms of structure and
semantics extracted from caDSR and EVS Service Security Metadata
Provided by all services Details the service’s requirements on communication channel for each operation Can be used by client to programmatically negotiate an acceptable means of
communication Data Service Metadata
Provided by all data services Describes the Domain Model being exposed, in terms of a UML model linked to
semantics Provides information needed to formulate the Object-Oriented Query As with common metadata, data types defined in terms of structure and
semantics extracted from caDSR and EVS
Standardized Service Metadata
Level I: Collection
Level II:ClosedDistribution
Level III:Public Distribution/ Access
Level IV:Post-PublicationAttribution
•Access control•Patient Privacy•Data integrity•Provenance
metadata for attribution•Authentication of
authorship•Information
Security
•External access controls•Dynamic
permissions for limited access•Materials transfer
issues•Mechanisms for
data escrow
•Data released from escrow•Data transmission
security•Dynamic
permissions for general access•Provision for IP
ownership as opposed to access
•Provenance metadata for publication•Community
standards for attribution of authorship•Dynamic permissions
for general release•Data escrow for
publication
caBIG Data Hierarchy
Issues affecting the data are cumulative, i.e. data functioning in Level II will also raise the issues raised for Level I data; Level III data will require attention to the issues raised by both Level I and II, et cetera.
56
Level I data issuesLevel I data is all data collected by the caBIG system, including patient data, analyses,
records, and research, regardless of whether that data is released to other researchers, the public, or parties other than the one that originally provides the data to the system. Issues raised include:
Access Controls. Management, operational, and technical controls are necessary to create a methodology for restricting access to data in caBIG consistent with the authorization of the individual or entity.
Patient Privacy. Data must be collected and stored in a manner that protects the privacy interests of the data subjects, consistent with the HIPAA Privacy Rule, the Common Rule of human subjects research (reflected in the Code of Federal Regulations), and other state, local, ethical, and institutional requirements.
Data integrity. Mechanisms must be available to ascertain that data has been entered accurately and will not be inappropriately modified in the transfer from its point of origin, while maintained in caBIG, or subsequently.
Provenance metadata for attribution. Individual contributors’ interests must be protected by assuring that the system allows data submitted to be associated with information concerning its authorship, collection, or creation, and that a mechanism exists for data originators to amend incorrect provenance information.
Authentication of authorship. Mechanisms and processes must be available to verify that provenance data correctly identifies the source of contributed data and information. Such protections may include digital signatures (as described in 21 CFR 11) and other methods.
Information Security. Data must be collected and stored in a manner that protects the privacy interests of the data subjects, consistent with the HIPAA Security Rule, the Federal Information Security Management Act of 2002, and other Federal, state, local, ethical, and institutional requirements.
57
Level II data issues
Level II data is data that is collected and then shared by some limited subset of potential data users, but not all caBIG users or the general public. These individuals could include, for example, the party that contributed the data only, individuals that have reached private agreements with those that have contributed the data, or individuals granted “role-based” access to certain categories. Issues raised at this level include:
External Access Controls. Level I data requires controls for access to caBIG: Level II data requires management, operational, and technical controls to limit access to the caBIG users authorized to view data originated elsewhere
Dynamic permissions for limited access. Access controls will need to be flexible enough to change what data individuals have access to as roles, agreements, and activities of individual system users change over time.
Integration with materials transfer processes. Information sharing practices facilitated by caBIG must be aligned with practices for individuals or groups that share, transfer or provide access to tissues, cultures, cell lines, research animals, or other material shipped from one location to another.
Mechanisms for data escrow. Common research practices require data to be available for verification of research findings but not available for access, alteration, or further analysis until the validity of research findings is verified. caBIG will need to include a mechanism to allow data stored on the system to be partitioned off consistent with these requirements.
58
Level III data issues
Level III data is data made available to general audiences, including all caBIG users, all interested researchers, or the general public. Level III data issues include:
Data released from escrow. Once data has been cleared for general access (either due to the conclusion of the prepublication issues described under Level II data above, or pursuant to an arrangement with the data’s originator), it must be made available in a manner consistent with caBIG policy and in a way that does not compromise the data’s integrity or required attribution.
Data transmission security. Management, operational, and technical controls should assure that data integrity is not compromised in transit, nor that poor security practices on the part of caBIG system users create platforms for security breaches of the caBIG system itself.
Dynamic permissions for general access. As with Level II data, access must be granted appropriately to users. As well as access levels for users, data must also be assigned security categories such that data can be re-categorized from having a specified, limited availability to becoming more generally available.
Provision for intellectual property ownership as opposed to access. Researchers may be willing to share data for limited purposes or a limited data set, and may wish to retain rights to be acknowledged for collecting data, generating analyses, or previous publications. Mechanisms must be in place to allow individuals the ongoing ability to benefit from their research or retain exclusive rights to it if contracts or other conditional agreements so require.
59
Level IV data issues
Level IV data is data that will be used for analyses, research, or other writing that will be attributed to one or more authors or individuals as the author, creator, sponsor, or other related party. Level IV data requires special protections such that the proper attribution is received for the particular accomplishments or expertise associated with that data. Level IV data issues include:
Provenance metadata for publication. Provenance metadata provisions for Level I data should govern the rights, restrictions and considerations relevant to authorship and attribution for data to be published.
Community standards for attribution of authorship. An appropriate, written protocol that accounts for existing law, policy and custom should exist to reflect how data and information is generated and in what capacity each participant contributed.
Dynamic permissions for general release. Data permissions for attributed data may require escrow; delivery to third parties for verification and analyses; and added provenance metadata for modified or concurrently developed material.
Data escrow for publication. Many journals require that data used in publications remain in escrow prior to (or following) publication to allow other researchers to validate findings. caBIG processes would need to be compatible with this requirement.