8
Everybody Share: The Challenge of Data-Sharing Systems I ndividuals, businesses, and government agencies share vast quantities of data via a wide range of technologies, including Web services, data ware- houses, portals, RSS feeds, and peer-to-peer (P2P) file sharing. Scientific data sharing has produced a new subdiscipline, data-driven science, in which researchers explore large data stores such as genomic databases and digital sky surveys to generate and test new hypotheses. Environments for sharing data have expanded dramatically—for example, a crawl of the Gnutella P2P network three years after its inception discovered 100,000 nodes with 20,000,000 files. 1 On the heels of these successes, the demand for data sharing continues to grow: The US Government’s 9/11 commission cited inad- equate information sharing as a key impediment to preventing acts of terrorism. Many governments are now enacting measures to improve counterterrorism data sharing. A key challenge in combating a pandemic disease outbreak is enabling a wide variety of government agencies, hospitals, and other organizations to rap- idly share “biosurveillance” data. At the same time, data sharing is a source of consid- erable controversy. P2P file-sharing users continue to clash with the entertainment industry, while government efforts to share terrorism information have raised strong concerns from privacy advocates. Data sharing: Governments require it, vendors prom- ise it, the media discuss it, and researchers in many fields study it. But what exactly is “it”? Among a diverse set of advertised “data-sharing solutions” are network routers, ontologies, federated databases, search engines, and por- tal software. While many studies have addressed techni- cal aspects of data sharing, what the term means varies considerably. Despite the increasing societal importance of data sharing, we are not aware of a published framework that defines data sharing, identifies its key dimensions and challenges, and explains how representative systems relate to it. Previous researchers have approached the problem from the perspective of a single technology, such as Web services, or have emphasized one aspect, such as semantic integration or digital rights management. However, enterprise architects and technology planners require a general framework that applies across a broad range of implementation technologies. We present such a framework and illustrate its conceptual features with examples from biomedical research, defense, and law enforcement. MOTIVATING EXAMPLE: NEUROMORPHO.ORG A growing area of science involves simulating the behavior of interconnected neurons to better understand the brain’s operation. Modelers who do this research require detailed representations of actual neurons, but, until recently, these have been hard to find, access, and interpret. A second group of scientists study neurons’ Data sharing is increasingly important in modern society, yet researchers typically focus on a single technology, such as Web services, or explore only one aspect, such as semantic integration. The authors propose a technology-neutral framework for characterizing data sharing problems and solutions and discuss open research challenges. Ken Smith, Len Seligman, and Vipin Swarup The MITRE Corporation RESEARCH FEATURE 54 Computer Published by the IEEE Computer Society 0018-9162/08/$25.00 © 2008 IEEE

Everybody Share: The Challenge of Data-Sharing Systems

Embed Size (px)

Citation preview

Everybody Share: The Challenge of Data-Sharing Systems

Individuals, businesses, and government agencies share vast quantities of data via a wide range of technologies, including Web services, data ware-houses, portals, RSS feeds, and peer-to-peer (P2P) file sharing. Scientific data sharing has produced

a new subdiscipline, data-driven science, in which researchers explore large data stores such as genomic databases and digital sky surveys to generate and test new hypotheses. Environments for sharing data have expanded dramatically—for example, a crawl of the Gnutella P2P network three years after its inception discovered 100,000 nodes with 20,000,000 files.1

On the heels of these successes, the demand for data sharing continues to grow:

The US Government’s 9/11 commission cited inad-equate information sharing as a key impediment to preventing acts of terrorism. Many governments are now enacting measures to improve counterterrorism data sharing.A key challenge in combating a pandemic disease outbreak is enabling a wide variety of government agencies, hospitals, and other organizations to rap-idly share “biosurveillance” data.

At the same time, data sharing is a source of consid-erable controversy. P2P file-sharing users continue to clash with the entertainment industry, while government efforts to share terrorism information have raised strong concerns from privacy advocates.

Data sharing: Governments require it, vendors prom-ise it, the media discuss it, and researchers in many fields study it. But what exactly is “it”? Among a diverse set of advertised “data-sharing solutions” are network routers, ontologies, federated databases, search engines, and por-tal software. While many studies have addressed techni-cal aspects of data sharing, what the term means varies considerably.

Despite the increasing societal importance of data sharing, we are not aware of a published framework that defines data sharing, identifies its key dimensions and challenges, and explains how representative systems relate to it. Previous researchers have approached the problem from the perspective of a single technology, such as Web services, or have emphasized one aspect, such as semantic integration or digital rights management. However, enterprise architects and technology planners require a general framework that applies across a broad range of implementation technologies. We present such a framework and illustrate its conceptual features with examples from biomedical research, defense, and law enforcement.

Motivating ExaMplE: nEuroMorpho.orgA growing area of science involves simulating the

behavior of interconnected neurons to better understand the brain’s operation. Modelers who do this research require detailed representations of actual neurons, but, until recently, these have been hard to find, access, and interpret. A second group of scientists study neurons’

Data sharing is increasingly important in modern society, yet researchers typically focus

on a single technology, such as Web services, or explore only one aspect, such as semantic

integration. the authors propose a technology-neutral framework for characterizing data

sharing problems and solutions and discuss open research challenges.

Ken Smith, Len Seligman, and Vipin SwarupThe Mitre Corporation

R E S E A R C H F E A T U R E

54 Computer Published by the IEEE Computer Society 0018-9162/08/$25.00©2008IEEE

microscopic structure. Working with a microscope and software applica-tion, these anatomists can create a detailed digital reconstruction of a single neuron in several weeks. Approximately 5,000 such recon-structions currently exist through the efforts of several dozen laboratories worldwide.

If shared, the cellular reconstruc-tions that anatomists produce would provide modelers with accurate sim-ulation inputs. Anatomists periodi-cally do send modelers data by e-mail or DVD, but these modes of sharing are limited by the lack of supportive sharing services. The cost of discover-ing reconstructions is especially high: Modelers are unlikely to know anato-mists or be familiar with their aca-demic literature, and Internet search is insufficient to help modelers find suitable cellular reconstructions.

In response to this problem, the Computational Neuroanatomy Group at George Mason Universi-ty’s Krasnow Institute for Advanced Study created and maintains Neuro-Morpho.Org (http://neuromorpho.org). This system, which became pub-licly available in fall 2006, consists of a repository of 1,906 digitally recon-structed neurons (as of v2.1) from multiple labs and a website supporting data discovery and access as shown in Figure 1.

The data, submitted in diverse formats, is normalized to facilitate its interpretation by consumers. The system also addresses a major policy barrier to scientific data sharing: concern about receiving credit. Before down-loading a neuronal reconstruction, all users must agree to cite a publication, provided on the system, describing that neuron. The result has been a significant increase in data sharing: Users downloaded 41,000 cellular recon-structions in the system’s first year. In August 2007, the Journal of Neuroscience featured NeuroMorpho.Org, hailed by scientists as a “historic” database, on its cover.2 The system continues to acquire new data, with the ulti-mate goal of making available a large percentage of the world’s cellular reconstructions.

Data-Sharing FraMEWorkWe define data sharing as the act of enabling a set of

consumers to find, access, and use digital data that was originally produced for others. Any collection of sym-bols—for example, words, numbers, or images—that expresses statements about the world constitutes data.

Basic requirementsAny large-scale data-sharing arrangement must meet

four fundamental requirements:

discovery,access,understanding, andpolicy management.

Table 1 lists typical provider and consumer actions that correspond to these requirements.

The first three requirements support consumers, who can only use the data if they have a means to discover its existence, access its contents, and understand it. Figure 1 shows NeuroMorpho.Org’s discovery tool; a Web server and back-end database provide access to its data. Under-standing often requires data transformation and explana-tory annotation, and is essential to the (re)usage of shared data. For example, NeuroMorpho.Org’s cellular recon-structions would be of little value without normalization and annotations describing experimental protocols.

The fourth requirement, policy management, supports the interests of all data-sharing participants. Providers

••••

Figure 1. NeuroMorpho.Org discovery interface. Clicking on a pie chart element displays all data available from the selected lab. Users can also search by species, brain region, and cell type.

Table 1. Basic data-sharing requirements and typical corresponding actions.

Requirement Provideraction Consumeraction

Discovery Advertise SearchAccess Enableaccess DownloaddataUnderstanding Annotate MapandtranslatePolicymanagement Expresspolicyandmonitorcompliance Meetobligation

September2008 55

56 Computer

only let others use their data when the incentives to share outweigh the disincentives. Possible incentives include prestige, satisfaction of legal or organizational require-ments, and anticipation of a reward such as participation in a winning cause or the offer of reciprocal sharing. Dis-incentives include fear that consumers will intentionally or unintentionally misuse data, steal ideas or intellectual property in the data, violate privacy, or fail to protect data from malicious third parties.

The balance can change dramatically when sharing partners are confident that certain obligations will be met. For example, anatomists are willing to share data through NeuroMorpho.Org largely because of the policy that consumers must cite data providers. Policy services provide mechanisms—such as access controls, usage constraints, and data quality guarantees—to express and realize desired behaviors in a data-sharing arrangement.

Data-sharing systemsA data-sharing system can facilitate data sharing.

A DSS consists of a set of objects to be shared; actors whose actions create flows of shared data; and services that meet a particular domain’s discovery, access, under-standing, and policy management requirements.

objects. The unit of data sharing is the object, which here refers to data of arbitrary granularity and type. The object can be monolithic, such as an image file, or struc-tured to contain other objects, such as a database table containing rows and columns.

An object O can have metadata describing various properties of O. For example, a Wikipedia article has an edit history, and in NeuroMorpho.Org each cellular reconstruction has a brain region. Metadata annotations are critical to discovery, consumer understanding, and policy management.

actors. In a DSS, individuals or organizations initiate actions through which data sharing occurs. As Figure 2 shows, there are five actor types: providers, consumers, distributors, facilita-tors, and developers.

Providers introduce data into a DSS for use by other actors. They often have a responsibility to protect data from corruption or inappropri-ate disclosure. Consumers seek to obtain new data through a DSS. Distributors collect data from mul-tiple providers and make it available to consumers. They provide several value-added services, including improved discovery, translation into various consumer formats, and pol-icy enforcement. Distributors such

as NeuroMorpho.Org can substantially reduce the effort of both consumers and providers.

DSSs also can have indirect actors who do not share data themselves but nevertheless enable sharing. Facili-tators can recruit providers and consumers, set policies favorable to sharing system success, establish data stan-dards, or fund tool development. Developers produce the desired services.

As the “Reducing the ‘Activation Energy’ of Data Sharing” sidebar describes, actors can make it easier for beneficial sharing to occur, much like catalysts in chemi-cal reactions.

Services. A DSS provides services that meet the four requirements of discovery, access, understanding, and policy management. In some cases, developers meet these requirements easily; data access in particular has been eased by a host of standards—for example, HTTP, XML, and WSDL—and off-the-shelf products. How-ever, significant challenges remain in meeting the other requirements.

Data DiScovEryThe most straightforward approach to discovery uses

generic search engines and service registries. However, many sources cannot be indexed by Web crawlers. This deep Web results from firewalls and security guards as well as the presence of valuable data in password- protected databases. One study estimates the deep Web to be 40 times the size of the crawlable Internet.3

Example: Defense Discovery Metadata Standard

Recently, the US Department of Defense (DoD) sought to improve the visibility of its data across an enormous enterprise. However, standard search engines did not address two key requirements. First, security constraints

Developer

ProviderConsumerDistributor

Facilitator

Fund, distribute tools

Set standards

Set policy

Flow of object control

Major actor types

Indirect actor types

Figure 2. DSS actor types. An individual or organization can perform the actions of one or more of five actor types.

September2008 57

dictate managing much information behind organiza-tional firewalls and security guards, yet it must nonethe-less be discoverable by consumers outside those organiza-tions. Second, traditional text keywords are inadequate for many searches; instead, spatial and temporal search terms—for example, “the region within this bounding box”—are needed.

In response, the DoD defined the Defense Discovery Metadata Standard (http://metadata.dod.mil/mdr/irs/DDMS). All DoD data sources must describe themselves using the DDMS and make those descriptions available to an enterprise content discovery service. This addresses the security problem, as DDMS descriptions are typi-cally far less sensitive than the sources they describe. Providers must also describe their data source’s geospa-tial and temporal coverage where applicable. The DDMS minimizes the provider’s annotation burden by requir-ing only a few fields; it can also be extended, permit-ting communities within the DoD to define additional discovery metadata—for example, imagery analysts can define resolution metadata.

The DoD has demonstrated the DDMS and a pilot content discovery service in several applications, includ-ing an effort to improve “maritime domain awareness”: a multiagency mission to help decision makers under-stand activity on the seas that might adversely affect the nation’s security, safety, economy, or environment.

Discovery challengesIn addition to the deep Web, two discovery challenges

that are topics of current research are structure-aware discovery and the interaction of discovery and privacy.

Structure-aware discovery. It is possible to provide rudimentary discovery simply by making database con-tents indexable, but such an approach fails to exploit the semantics implicit in the information’s structure. For example, text-based search engines do not distinguish between occurrences of the term “cancer” in the records <diagnosis = “cancer”> and <diagnosis = “normal,” remarks = “no cancer!”>.

While P2P file-sharing systems support discovery queries involving object structure at an Internet scale, search criteria are currently limited to a few predefined attributes—for example, artist and album title in music databases. Research is just beginning on how to provide more general structured queries at Internet scale and seamless discovery over both structured and unstructured sources.4

Another hurdle is automatically generating useful sum-maries of structured data sets to enable their discovery. For example, a medical patient database specifying age, sex, and alcohol consumption could be characterized by histograms of these attributes. Challenges include how to autogenerate descriptive content summaries and how discovery services can best exploit such summaries.

interaction of discovery and privacy. The DDMS

example highlights problems arising from interactions of discovery and security. Interactions of discovery and privacy present similar concerns, with additional subtleties.

For example, the Health Insurance Portability and Accountability Act of 1996 prohibits publicly sharing

reducing the ‘activation Energy’ of Data Sharing

Theprogressofchemicalreactionscanbevisual-izedusingenergydiagramslikethatshowninFigureA.Manyreactionsresultinanetproductionofenergybutrequireaninitialinvestmentofactivationenergy.Areactionwithahighactivationenergyisunlikelytooccur,evenifitultimatelygeneratesenergy.Catalystsreducethisactivationenergybarrier,increasingthereaction’sprobability.

Data-sharingsystemscanlikewiserequirealargeinvestmentof“activationenergy”beforeproducersandconsumersbenefit.Byprovidingtoolsandinfra-structure,standarddataformats,favorablepolicies,andfunding,certainactorscanactascatalysts,mak-ingiteasierforbeneficialsharingtooccur.

Forexample,GiorgioAscoli,directoroftheCom-putationalNeuroanatomyGroupatGeorgeMasonUniversity’sKrasnowInstituteforAdvancedStudy,hascatalyzedthesharingofneuronalreconstructionsinhisscientificcommunitybyplayingadistributorrole.Hisgroup’sNeuroMorpho.Orgsystemsignificantlyreducestheenergybarriersofdiscoveringrelevantdata,acquiringtheactualdatafiles,andnormalizingtheirformats.TheNationalInstitutesofHealthhasindirectlyservedasacatalystforsharingneuronalreconstructionsbyplayingafacilitatorrole:fundingNeuroMorpho.OrgandencouragingNIH-fundeddataproducerstodevelopasharingplan.

Reactant2BrNO

Products2NO + Br2

Pote

ntia

l ene

rgy

ON Br

ON BrTransition state

Reaction progress

Ea

∆E

Figure A. Energy diagram. Just as catalysts reduce the activation

energy (Ea

) of a chemical reaction, making it more likely to

occur, actors can serve as catalysts in data-sharing systems by

providing appropriate tools and infrastructure, standard data

formats, favorable policies, and funding.

58 Computer

data sets that contain protected health information unless they are first “de-identified.” HIPAA’s safe-harbor de-identification method requires removing date infor-mation more exact than the year for personal events, such as medical exams. However, this would hinder an epidemiologist seeking medical records for research on, say, the daily spread of influenza in November and December. Thus, using the safe-harbor approach, the very act of making a data set sharable can deprive it of sufficient temporal detail to make it appear relevant.

The challenge is to describe an entire data set in suf-ficient detail to enable accurate discovery without vio-lating privacy constraints on individual records. Then, upon discovery, providers and consumers can reach spe-cific agreements to share data sets containing private information in an appropriately constrained manner.

Consumer queries also present privacy issues. For example, a pharmaceutical company’s searches in a pro-tein database could reveal its strategies.

conSuMEr unDErStanDing oF DataData is usually well understood in its environment of

origin because important details such as data accuracy and units of measure are implicitly known. However, these details are often unknown in the consumer’s envi-ronment, leading to the naked-data problem: detail-poor data that consumers find difficult to interpret. It is thus necessary to “reclothe” shared data with metadata annotations to make it understandable to consumers.

Example: Birn Mediator The Biomedical Informatics Research Network (www.

nbirn.net) consists of two dozen international research laboratories. BIRN was launched in 2001 to acceler-ate science by creating “a new biomedical collaborative infrastructure and culture.” Central to this infrastructure is a data-sharing system, the BIRN Mediator. Member labs share diverse types of data, including MRI studies of the human brain, detailed images of mouse brain tissue, and genetic expression (for example, microarray) data for mouse brain tissue. The BIRN Coordinating Center (BIRN-CC) serves as both facilitator and distributor.

name and value heterogeneity. Among its efforts to improve consumer understanding, BIRN addresses inconsistency in both the names and values of shared data objects. For example, the identical concept might be named “brain region” by one BIRN member and “lesion location” by another. To deal with name heterogeneity, BIRN-CC developed a standard annotation schema that some labs adopted directly. Other labs published data in their own formats, and BIRN-CC engineers developed mappings to the standard.

However, even with standardized names, there was significant heterogeneity in annotation values. For example, one lab might use “hindbrain” and another “rhombencephalon” to describe the same brain region. BIRN-CC therefore developed an ontology of key con-cepts and encouraged providers to use those concepts in annotations.

While the initial ontology effort was quite broad, the current version focuses on a few key features of human and mouse neuroanatomy. The narrower scope has two advantages. First, large ontologies are expensive to develop, especially where there must be consensus over a broad community. In fact, the cost is proportional to the product of the number of concepts to be standardized and the number of stakeholders from whom agreement is required.5 Second, more providers can understand a smaller ontology, resulting in more consistent annota-tions.

Spatial heterogeneity. BIRN also tackles spatial heterogeneity in biomedical images. To overlay, com-pare, and explore spatial mouse brain information—for example, anatomic images and genetic expression maps—from different mice from various labs, BIRN-CC transforms such data onto a standard reference image called an atlas, shown in Figure 3, using a special Java-based toolkit (www.nbirn.net/tools/mbat/index.shtm). To align this data, a spatial normalization algorithm maps provider data into the atlas’s coordinate system.

challenges in consumer understandingAs the BIRN experience illustrates, semantic hetero-

geneity is a major obstacle to consumer understand-ing of shared data. Fortunately, the two main research camps—one focused on ontologies, the other on data-

Figure 3. Tackling spatial heterogeneity in biomedical images. The BIRN Coordinating Center transforms spatial mouse brain information from different labs onto a standard reference atlas.

September2008 59

base integration—have begun to cooperate in recent years.6 Stronger tools should result from ontologists’ semantically richer models and database researchers’ emphasis on pragmatics and industrial scale.

automating annotation. Sharing systems must increasingly automate metadata annotation to reduce costs. Opportunities include information extraction from free text, such as identifying the time and place of events, and automated reasoning about contexts—for example, recognizing that EU monetary amounts are in euros unless otherwise specified.

complex annotation types. Some annotations, such as data lineage and uncertainty, are widely useful yet potentially complex to represent. These require greater rigor than the ad hoc techniques currently employed. Work on the Trio proj-ect by researchers at Stanford Uni-versity and Google represents an important first step.7

conflicting values. Even after semantic heterogeneity is resolved, two producers might offer conflict-ing opinions on values for the same real-world object. For example, if protein databases A and B list the func-tion of protein-1 as “immune system” and “cell metabo-lism,” respectively, which value should the consumer use? A recent paper describes automated reconciliation techniques based on preferences for recency or percep-tion of source reliability,8 but more research is required to produce practical reconciliation tools.

Data-Sharing policy ManagEMEntMedical image providers incur risks by sharing them.

Against their wishes, a consumer might take credit for the images, sell them to others, or fail to protect subjects’ privacy. The consumer might also have expectations of the data-sharing relationship—for example, to promptly receive well-annotated images in the future. Policy man-agement gives sharing partners a way to express, analyze, and realize desired sharing system behavior. Challenges in policy management arise primarily from the

tension among security, privacy, and sharing needs, and conflicting interests of multiple parties.

Policy management highlights the social requirements of data sharing, including

trust among sharing partners; alignment of participants’ locally perceived incen-tives with desired global behavior; and effective governance—the set of processes, customs, and laws affecting the administration or control of an enterprise.

••

Failure to consider these social factors has caused some high-profile data-sharing initiatives to fail—for example, the DoD’s massive data standardization efforts in the 1990s inadequately considered partici-pants’ incentives.5

Example: Sharing sensitive but unclassified information

Federal intelligence agencies generate significant amounts of data regarding criminal activities and terror-ism, including intelligence assessments, terrorist identity watch lists, and reports on drug movements and money laundering. State, local, and tribal law enforcement

(SLTLE) agencies also generate much information during, for example, police patrols and traffic stops.

Data sharing between federal and SLTLE agencies can provide enormous value to their respec-tive investigations. For example, in August 2004, two off-duty police officers observed individuals film-ing Maryland’s Chesapeake Bay

Bridge (http://fas.org/irp/agency/doj/lei/chap11.pdf). They reported this suspicious activity and, after a series of infor-mation exchanges between federal and SLTLE agencies, a more complete and dangerous picture of the situation emerged. Authorities arrested a suspect and subsequently discovered valuable evidence at his residence.

Researchers have developed numerous systems to facil-itate such sharing in recent years. However, much criti-cal data from federal agencies is classified, and SLTLE staff often lack the appropriate security clearances. A primary reason for the higher classification is that the information reveals the sources and methods used to generate it. Hence, a typical solution is to remove these and to relabel the information as sensitive but unclassi-fied. SBU information is unclassified and can be shared with SLTLE staff but not the public.

However, SBU data itself is subject to a multitude of legal requirements that restrict its sharing. For exam-ple, law-enforcement-sensitive information cannot be disseminated outside the domestic law-enforcement community without the author’s permission. Further, sharing systems that contain criminal intelligence data must comply with privacy-protection guidelines such as 28 Code of Federal Regulations Part 23. Adding to this complexity, many categories of SBU information have evolved, and each category is governed by a unique set of regulations or constraints; further, various agencies sometimes treat the same category differently.

policy management challengesAs the SBU example shows, large-scale DSSs often

have participants with conflicting perceptions of risks and benefits. Laws and organizational policies might also

policy management gives sharing partners

a way to express, analyze, and realize desired sharing

system behavior.

60 Computer

limit sharing. Despite this complexity, however, there is little automated support for specifying, monitoring, and enforcing sharing policies. The result is inadequate data sharing in important domains including disaster response, pandemic monitoring, and counterterrorism. Research breakthroughs are needed in managing risk/benefit tradeoffs, handling policy conflicts, tracking and enforcing obligations, and policy administration tools.

Managing risk/benefit tradeoffs. Existing access-control policies focus primarily on protection rather than sharing. Trust decisions are therefore based on the risks of sharing information and exclude consideration of the risks of not sharing. While protection is neces-sary, failing to provide information to the right entity at the right time can potentially be as disastrous as failing to protect it. For example, in disaster-response or mili-tary operations, failure to share data can result in loss of life. We need new sharing models, with well-understood properties, capable of making and reasoning about prac-tical risk/benefit tradeoffs.9

policy conflicts. Diverse participants rarely have the same perception of the benefits and risks of data sharing. For example, the policies of a provider of medical images may conflict with those of the original subjects, hospi-tals, insurance companies, and the government. Resolv-ing the conflicts among diverse stakeholders’ preferences and enforcing the resulting policies remains a daunting challenge. Semantic heterogeneity research is relevant here as well because stakeholders express their prefer-ences in terms of their local view of data.

obligations. Obligations are a commitment to future behavior; they can mitigate the risks and increase the benefits of sharing. For example, a consumer of SBU information might commit to sending the provider detailed usage logs as a condition of gaining access, while the provider might make guarantees of data accu-racy or timeliness. Obligations are central to a sharing policy—confidence that sharing partners will keep obli-gations builds trust. Many challenges remain, however, in supporting the expression, management, analysis, monitoring, and enactment of obligations incurred in data sharing.

Researchers have developed formal obligation models and applied them to data sharing; one proposed model10 focuses on data-sharing agreements, which are analo-gous to service-level agreements. DSAs are useful when consumers build value-added services over information supply chains they do not control. A DSA can require providers to ensure qualities of their information supply, such as availability and recency.

A key challenge is remote enforcement of obligations in consumer environments. Once data is shared beyond the provider’s administrative sphere, digital rights man-agement (DRM) techniques permit data owners to con-trol access to and usage of data, thus they hold potential for expressing and enforcing certain consumer obliga-

tions. For example, printing limits could be encoded into a PDF document, which the PDF reader’s print option would enforce in the consumer’s environment.

In current practice, however, provider DRM policies are often unilateral—that is, enforcement is unnegoti-ated—which does not promote the requisite cooperation for data sharing. Also, often the consumer can’t inter-pret the obligations encoded in DRM. When a consumer has a large database of shared data involving obligations from multiple sources, it is crucial to be able to under-stand and reason about the current set of incurred obli-gations.

policy administration and language. Specifying and implementing expressed policy currently requires too much human skill and effort. While stakeholders express their policy in terms of concepts like “criminal intelligence data” and “law-enforcement professional,” policy administrators must

understand these domain-specific concepts; be familiar with policy implementation mecha-nisms, such as database access controls and audit logs; andunambiguously map stakeholder concepts onto implementation mechanisms and objects, such as database tuples, in diverse systems.11

This process grows more difficult as sharing crosses organizational and security boundaries, and when access policy references dynamically determined attributes of the environment—for example, the threat level.

Well-designed policy languages can simplify policy administration. However, policy languages providing aggregation and other constructs for sharing policy and intuitive interfaces for nonspecialist administra-tors are currently rare.12 More research is needed on simplifying policy specification and automatically compiling expressed policy into code executable in heterogeneous environments.

O ur proposed technology-neutral framework to unify and understand the diverse aspects of data sharing is but a first step in developing effective

data-sharing systems. Significant technical challenges remain for researchers, tool builders, and enterprise pol-icy makers before everybody can, truly, share informa-tion. At the same time, engineers can’t afford the luxury of considering social factors as “out of scope”; failure to consider these can cause large, expensive data-sharing initiatives to fail. ■

acknowledgmentsThis work was supported by a grant from the Human

Brain Project (R01-MH64417-01), which is funded by

••

September2008 61

the National Institute for Mental Health and by the Mitre Innovation Program. We thank Barbara Blaus-tein, Peter Mork, Arnie Rosenthal, Jean Stanford, Anne Cady, Steve Scott, and Giorgio Ascoli for helpful com-ments and many useful discussions. Annual meetings of the Human Brain Project and the International Consor-tium on Brain Mapping (funded by NIMH grant P20-MHDA52176) have also been valuable forums for dis-cussion. We also thank the anonymous reviewers whose feedback enhanced this article.

references 1. B.T. Loo et al., “Enhancing P2P File-Sharing with an Internet-

Scale Query Processor,” Proc. 30th Int’l Conf. Very Large Databases, vol. 30, VLDB Endowment, 2004, pp. 432-443.

2. G.A. Ascoli, D.E. Donohue, and M. Halavi, “NeuroMor-pho.Org: A Central Resource for Neuronal Morphologies,” J. Neuroscience, vol. 27, no. 35, 2007, pp. 9247-9251.

3. M.K. Bergman, “The Deep Web: Surfacing Hidden Value,” J. Electronic Publishing, vol. 7, no. 1, 2001; www.press.umich.edu/jep/07-01/bergman.html.

4. J. Madhavan et al., “Web-Scale Data Integration: You Can Only Afford to Pay as You Go,” Proc. 3rd Biennial Conf. Innovative Data Systems Research, 2007; www.cidrdb.org/cidr2007/papers/cidr07p40.pdf.

5. A. Rosenthal, L. Seligman, and S. Renner, “From Semantic Integration to Semantics Management: Case Studies and a Way Forward,” ACM SIGMOD Record, vol. 33, no. 4, 2004, pp. 44-50.

6. A. Doan, N.F. Noy, and A.Y. Halevy, “Introduction to the Special Issue on Semantic Integration,” ACM SIGMOD Record, vol. 33, no. 4, 2004, pp. 11-13.

7. O. Benjelloun et al., “ULDBs: Databases with Uncertainty and Lineage,” Proc. 32nd Int’l Conf. Very Large Databases, VLDB Endowment, 2006, pp. 953-964.

8. N.E. Taylor and Z.G. Ives, “Reconciling While Tolerating Disagreement in Collaborative Data Sharing,” Proc. 2006 ACM SIGMOD Int’l Conf. Management of Data, ACM Press, 2006, pp. 13-24.

9. Horizontal Integration: Broader Access Models for Realizing Information Dominance, JASON report JSR-04-132, Mitre Corp., 2004; www.fas.org/irp/agency/dod/jason/classpol.pdf.

10. V. Swarup, L. Seligman, and A. Rosenthal, “A Data Sharing Agreement Framework,” Proc. 2nd Int’l Conf. Information Systems Security, LNCS 4322, Springer, 2006, pp. 22-36.

11. A. Rosenthal, “Scalable Access Policy Administration: Opin-ions and a Research Agenda,” tech. paper, Mitre Corp., 2005; www.mitre.org/work/tech_papers/tech_papers_05/05_1186/05_1186.pdf.

12. K. Smith et al., “Enabling the Sharing of Neuroimaging Data through Well-Defined Intermediate Levels of Visibility,” Neu-roImage, vol. 22, no. 4, 2004, pp. 1646-1656.

Ken Smith, a senior principal scientist in the Informa-tion and Computing Technologies Division at Mitre, is an adjunct professor at George Mason University. His research interests include scientific databases and infor-mation sensitivity. Smith received a PhD in computer sci-ence from the University of Illinois at Urbana-Champaign. Contact him at [email protected].

Len Seligman is chief technologist for the Information and Computing Technologies Division at Mitre. His research interests include data interoperability and information management for very large enterprises. Seligman received a PhD in information technology from George Mason University. Contact him at [email protected].

Vipin Swarup is a principal scientist in the Informa-tion Assurance Division at Mitre. His research interests include information security, secure information sharing, and information survivability. Swarup received a PhD in computer science from the University of Illinois at Urbana-Champaign. Contact him at [email protected].

www.computer.org/publications/

The IEEE Computer Society publishes over 250 conference publications a year.Visit us online for a preview of the latest

papers in your field.250