22
Privacy-Aware Health Information Sharing Thomas Trojer Research Group Quality Engineering University of Innsbruck Innsbruck, Austria C. K. Lee Hong Kong Red Cross Blood Transfusion Service Hong Kong Benjamin C. M. Fung CIISE, Concordia University Montreal, QC, Canada Lalita Narupiyakul and Patrick C. K. Hung University of Ontario Institute of Technology Oshawa, ON, Canada October 15, 2009 Abstract Gaining access to high-quality health data is a vital requirement to informed-decision making for medical practitioners and pharmaceutical re- searchers. Driven by mutual benefits and by regulations, there is a demand and necessity for healthcare institutes to share patient data to various parties for research purposes. Health data in its raw form, however, often contains sensitive information about individuals and publishing such data will violate individual privacy. The current practice in data publishing primarily relies on policies and guidelines on the types of data that can be published, and agreements on the use of shared data. This approach alone may lead to excessive data distortion or insufficient protection. A prob- lem of utmost importance, known as privacy-aware information sharing, is to provide methods and tools for sharing person-specific, sensitive in- formation for the purpose of data mining. This chapter exploits a real-life information sharing scenario in the Hong Kong Red Cross Blood Trans- fusion Service to bring out the challenges of preserving both individual privacy and data mining quality in the context of healthcare information systems. Furthermore, this chapter will present a unified privacy-aware information sharing method for two specific data mining tasks, namely classification analysis and cluster analysis. Based on an extended data

Privacy-Aware Health Information Sharing

  • Upload
    mcgill

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Privacy-Aware Health Information Sharing

Thomas Trojer

Research Group Quality Engineering

University of Innsbruck

Innsbruck, Austria

C. K. Lee

Hong Kong Red Cross Blood Transfusion Service

Hong Kong

Benjamin C. M. Fung

CIISE, Concordia University

Montreal, QC, Canada

Lalita Narupiyakul and Patrick C. K. Hung

University of Ontario

Institute of Technology

Oshawa, ON, Canada

October 15, 2009

Abstract

Gaining access to high-quality health data is a vital requirement toinformed-decision making for medical practitioners and pharmaceutical re-searchers. Driven by mutual benefits and by regulations, there is a demandand necessity for healthcare institutes to share patient data to variousparties for research purposes. Health data in its raw form, however, oftencontains sensitive information about individuals and publishing such datawill violate individual privacy. The current practice in data publishingprimarily relies on policies and guidelines on the types of data that can bepublished, and agreements on the use of shared data. This approach alonemay lead to excessive data distortion or insufficient protection. A prob-lem of utmost importance, known as privacy-aware information sharing,is to provide methods and tools for sharing person-specific, sensitive in-formation for the purpose of data mining. This chapter exploits a real-lifeinformation sharing scenario in the Hong Kong Red Cross Blood Trans-fusion Service to bring out the challenges of preserving both individualprivacy and data mining quality in the context of healthcare informationsystems. Furthermore, this chapter will present a unified privacy-awareinformation sharing method for two specific data mining tasks, namelyclassification analysis and cluster analysis. Based on an extended data

schema taken from the Health Level 7 (HL7) framework, this chapter em-ploys the information system of Red Cross transfusion service in HongKong as a use case to motivate the problem and to illustrate the essentialsteps of the anonymization methods for classification and cluster analysis.

1 Introduction

Gaining access to high-quality health data is a vital requirement to informed de-cision making for medical practitioners and pharmaceutical researchers. Drivenby mutual benefits and by regulations, there is a demand and necessity forhealthcare institutes to share patient data to various parties for research pur-poses. Health data in its raw form, however, often contains sensitive informationabout individuals and publishing such data will violate individual privacy. Thecurrent practice in data publishing primarily relies on policies and guidelines onthe types of data that can be published, and agreements on the use of shareddata. This approach alone may lead to excessive data distortion or insufficientprotection. A problem of utmost importance, known as privacy-aware infor-mation sharing, is to provide methods and tools for sharing person-specific,sensitive information for the purpose of performing data mining.

Based on an extended data schema taken from the Health Level 7 (HL7) frame-work, this chapter exploits a real-life information sharing scenario in the HongKong Red Cross Blood Transfusion Service (BTS) to bring out the challengesof preserving both individual privacy and data mining quality in the context ofhealthcare information systems. Furthermore, this chapter presents a unifiedprivacy-aware information sharing method for two specific data mining tasks,namely classification analysis and cluster analysis. This chapter employs the in-formation system of Red Cross transfusion service in Hong Kong as a use case tomotivate the problem and to illustrate the essential steps of the anonymizationmethods for classification and cluster analysis.

2 Reference Scenario

Figure 1 illustrates the data flow in Hong Kong Red Cross BTS. After collectingand examining the blood collected from donors, the Red Cross BTS distributesthe blood to different hospitals. The hospitals collect and maintain the healthrecords of their patients, and transfuse the blood to them if necessary. All infor-mation used and maintained throughout the transfusion procedure, includingthe type of operation, participated medical practitioners, and reason of transfu-sion, is clearly documented and stored in the database of a hospital. The patientinformation as well as the blood usage listings have been made accessible to theRed Cross BTS so that this institution can perform certain data mining andauditing tasks. The objectives of the data mining and auditing procedures areto improve the estimated future blood consumption in different hospitals and tomake recommendation on the blood usage in future medical cases. In the final

Figure 1: Data Flow in Hong Kong Red Cross Blood Transfusion Service

step, the Red Cross BTS submits a report to the Hospital Authority. Referringto the privacy regulations, such reports have the property of keeping patient’sprivacy protected, although useful patterns and structures have to be preserved.The data published along with the Privacy Aware Health Information SharingService gets refined in a way to meet certain privacy criteria. This Red Crossexample brings out a typical dilemma in information sharing and privacy pro-tection faced by many healthcare institutions in the world nowadays. In general,we can study this use case from different perspectives as follows.

Information needs: The Red Cross BTS wants to have access to the healthand blood usage data, not statistics, from the public hospitals for several rea-sons. First, the practitioners in hospitals have no expertise in doing data mining.They simply want to share the patient data to a party, such as the Red CrossBTS in this example, who needs the health data with a legitimate reason. Sec-ond, having access to the data, the Red Cross BTS has much better flexibility toperform the required data mining. It is impossible to request the practitionersin a hospital to produce different types of statistical information and fine-tunethe results for research purposes. In this chapter, the term “data mining” hasa broad sense. The data mining conducted by the Red Cross BTS could beanything from a simple counting of men with diabetes using Type A blood tosophisticated classification and cluster analysis.

Privacy concerns: Allowing the Red Cross BTS to access the health and bloodusage data clearly has legitimate reason. However, it also raises some concernson the patients’ privacy. In the information collection phase, the public hos-

pitals collect person-specific data from individual patients. In the informationsharing phase, the hospitals release the collected information to the Red CrossBTS, who will then conduct data mining and auditing on the shared data. Thepatients are willing to submit their data to a hospital because the hospital isa trustworthy entity. Yet, the trust to the hospital may not necessarily betransitive to a third party. Nowadays, many agencies and institutions considerthat the released data is privacy-preserving if the explicit identifying informa-tion, such as name, social security number, address, and telephone number, isremoved. However, substantial research has shown that simply removing the ex-plicit identifying information is often insufficient for privacy protection. Sweeny[Swe02] even shows that many patients can be re-identified simply by match-ing their other attributes, called the quasi-identifier, such as, gender, age, andpostal code.

How can healthcare institutions share patient-specific information to a thirdparty without compromising the privacy of individual patients? The study ofprivacy-aware information sharing is to address this problem by anonymizingthe released data. Insufficient anonymization leads to privacy threats. Over-anonymization heavily leads to loss of information, which, in turn, lowers thequality of applied data mining. The key challenge is how to perform theanonymization so that both, privacy and usefulness of information throughouta data mining process are preserved in the derived data advertising privacy-protection.

In the most basic form of privacy-aware information sharing, the data publisher(e.g., a hospital in the Red Cross case) has a table of the form of a tuple

T (Explicit Identifier,Quasi Identifier, Sensitive Attributes,Non−Sensitive Attributes),

where Explicit Identifier is a set of attributes, such as name and social se-curity number (SSN), containing information that explicitly identifies recordowners; Quasi Identifier (QID) is a set of attributes that could potentiallyidentify record owners; Sensitive Attributes consist of sensitive patient-specificinformation such as disease, medical history, and disability status; and Non −Sensitive Attributes contain all attributes that do not fall into the previousthree categories [BA03]. The four sets of attributes are disjoint. Most workassumes that each record in the table represents a distinct record owner.

3 Literature Review

What is privacy protection? Dalenius [Dal77] provided a very stringent defini-tion:

Access to the published data should not enable the attacker to learnanything extra about any target victim compared to no access to the

database, even with the presence of any attacker’s background knowl-edge obtained from other sources.

Dwork [Dwo06] shows that such absolute privacy protection is impossible dueto the presence of attacker’s background knowledge. Let us consider the ageof an individual as sensitive information. Assume an attacker knows that Al-ice’s age is of 5 years younger than the average age of American women. If theattacker has access to a statistical database that discloses the average age ofAmerican women, then Alice’s privacy is considered to be compromised accord-ing to Dalenius’ definition, regardless whether or not Alice’s record is in thedatabase [Dwo06]. Most literature on Privacy-Aware Information Sharing, alsoknown as Privacy-Preserving Data Publishing [FY10], considers a more relaxed,more practical notion of privacy protection by assuming the attacker has limitedbackground knowledge. Below, the term “victim” refers to a patient in the RedCross example targeted by an attacker. We can broadly classify privacy modelsto three categories based on their attack principles. In general, a privacy threatoccurs when an attacker is able to link a record owner to a record in a publisheddata table, or to a sensitive attribute in a published data table. We call themrecord linkage and attribute linkage, respectively. In both types of linkages, weassume that the attacker knows the QID of the victim.

3.1 Record Linkage

In a record linkage attack, some value qid on QID identifies a small number ofrecords in the released table T , called a group. If the victim’s QID matchesthe value qid, the victim is vulnerable to being linked to the small number ofrecords in the group. In this case, the attacker faces only a small number of pos-sibilities for the victim’s record, and with the help of additional and supportingknowledge, there is a high chance that the attacker can uniquely distinguish thevictim’s record from the group.

Example 1. Suppose that a hospital wants to publish patients’ records as listedin table 1 to a research center. Therefore, the hospital should have to protect thedata table against privacy-threats by simply removing all identifying attributes.Suppose that the research center has access to the external table, table 2, andknows that every person with a record in table 2 has a record in table 1. Joiningthe two tables on the common attributes Job, Gender and Year of Birth maylink the identity of a person to his/her sensitive information Diagnosis. Forexample, Doug, a male lawyer born in 1968, is identified as a HIV patient byqid =< Lawyer,Male, 1968 > after the join. �

k-Anonymity: To prevent record linkage through QID, Samarati and Sweeney[SS98a, SS98b, Sam01, Swe02] roposed the notion of k-anonymity: if one recordin the table has some value qid, at least k− 1 other records also have the valueqid. In other words, the minimum group size on QID is at least k. A table

Job Gender Year of Birth Diagnosis

Engineer Male 1965 HepatitisEngineer Male 1968 HepatitisLawyer Male 1968 HIVWriter Female 1960 FluWriter Female 1960 HIVDancer Female 1960 HIVDancer Female 1960 HIV

Table 1: Patient Table

Name Job Gender Year of Birth

Alice Writer Female 1960Bob Engineer Male 1965Cathy Writer Female 1960Doug Lawyer Male 1968Emily Dancer Female 1960Fred Engineer Male 1968Gladys Dancer Female 1960Henry Lawyer Male 1969Irene Dancer Female 1962

Table 2: External Table

Job Gender Year of Birth Diagnosis

Professional Male [1965-1969] HepatitisProfessional Male [1965-1969] HepatitisProfessional Female [1965-1969] HIVArtist Female [1960-1964] FluArtist Female [1960-1964] HIVArtist Female [1960-1964] HIVArtist Female [1960-1964] HIV

Table 3: Anonymous Patient Table

Figure 2: Taxonomy Trees

satisfying this requirement is called k-anonymous. In a k-anonymous table, eachrecord is indistinguishable from at least k−1 other records with respect to QID.Consequently, the probability of linking a victim to a specific record throughQID is at most 1

k.

Example 2. Table 3 shows a 3-anonymous table by generalizing QID =Job,Gender, Y earofBirth from table 1 using the taxonomy trees in figure 2. Ithas two distinct groups on QID, namely < Professional,Male, [1965−1969] >

and < Artist, Female, [1960 − 1964] >. Since each group contains at least 3records, the table is 3-anonymous. If we link the records in table 2 to the recordsin table 3 through QID, each record is linked to either no record or at least 3records in table 3. �

We broadly classify record linkage anonymization algorithms into two families:optimal anonymization and minimal anonymization. Algorithms in both fam-ilies primarily use generalization and suppression to achieve the k-anonymityprivacy model.

The first family finds an optimal k-anonymization, for a given data metric,by limiting to full-domain generalization and record suppression. Since thesearch space for the full-domain generalization scheme is much smaller thanother schemes, finding an optimal solution is feasible for small data sets.

Sweeney’s MinGen algorithm [Swe02] exhaustively examines all potential full-domain generalizations to identify the optimal generalization, measured in min-imal distortion. Sweeney acknowledged that this exhaustive search is imprac-tical even for the modest sized data sets, motivating the second family of k-anonymization algorithms to be discussed later. Samarati [Sam01] proposeda binary search algorithm that first identifies all minimal generalizations, andthen finds the global optimal generalization. Enumerating all minimal general-izations is an expensive operation and, therefore, not scalable for large data sets,especially if a more flexible anonymization scheme is employed. LeFevre et al.[LR05] presented a suite of optimal bottom-up generalization algorithms, calledIncognito, to generate all possible k-anonymous full-domain generalizations.

Another algorithm called K-Optimize [BA05] effectively prunes non-optimalanonymous tables by modeling the search space using a set enumeration tree.Each node represents a k-anonymous solution. The algorithm assumes a totallyordered set of attribute values, and examines the tree in a top-down mannerstarting from the most general table and prunes a node in the tree when noneof its descendants could be a global optimal solution based on discernibilitymetric and classification metric. Unlike the above algorithms, K-Optimize em-ploys the subtree generalization and record suppression schemes. It is the onlyefficient optimal algorithm that uses the flexible subtree generalization.

The second family of algorithms produces a minimal k-anonymous table by em-

ploying a greedy search guided by a search metric. Being heuristic in nature,these algorithms find a minimally anonymous solution, but are more scalablethan the previous family.

µ-argus. The µ-argus algorithm [HW96] computes the frequency of all 3-valuecombinations of domain values, then greedily applies subtree generalizations andcell suppressions to achieve k-anonymity. Since the method limits the size ofattribute combination, the resulting data may not be k-anonymous when morethan 3 attributes are considered. Sweeney’s Datafly system was the first k-anonymization algorithm scalable to handle real-life large data sets. It achievesk-anonymization by generating an array of qid group sizes and greedily general-izing those combinations with less than k occurrences based on a heuristic searchmetric that selects the attribute having the largest number of distinct values.Datafly employs full-domain generalization and record suppression schemes.

Iyengar [Iye02] was among the first who aimed at preserving classification infor-mation in k-anonymous data by employing a genetic algorithm with an incom-plete stochastic search based on classification metric and a subtree generalizationscheme. The idea is to encode each state of generalization as a “chromosome”and encode data distortion by a fitness function. The search process is a geneticevolution that converges to the fittest chromosome. Iyengar’s experiments sug-gested that, by considering the classification purpose, the classifier built fromthe anonymous data produces a lower classification error than the classifier builtfrom the anonymous data using a general purpose metric. However, experimentsalso showed that this genetic algorithm is inefficient for large data sets.

The Top-Down Refinement (TDR) method [FY05, FY07] masks a table by re-fining it from the most general state in which all values are masked to the mostgeneral values of their taxonomy trees. At each step, TDR selects the refine-ment according to the search metric that maximizes the information gain andminimizes the privacy loss. The refinement process terminates if no refinementcan be performed without violating k-anonymity. TDR handles both categoricalattributes and numerical attributes in a uniform way, except that the taxonomytree for a numerical attribute is grown on the fly as specializations are searchedat each step. Fung et al [FD08, FH09] further extended the k-anonymizationalgorithm to preserve the information for cluster analysis. The major challengeof anonymizing data for cluster analysis is the lack of class labels that could beused to guide the anonymization process. Their solution is to first partition theoriginal data into clusters, convert the problem into the counterpart problemfor classification analysis where class labels encode the cluster information inthe data, and then apply TDR to preserve k-anonymity and the encoded clusterinformation. Since the framework of TDR fits well to the requirement of RedCross case, this chapter studies TDR extensively in the context of the Red Crosscase.

Multidimensional k-anonymity. Let Di be the domain of an attribute

Ai. A single-dimensional generalization, such as full-domain generalization andsubtree generalization, is defined by a function fi : DAi

← D′ for each at-tribute Ai in QID. In contrast, a multidimensional generalization is definedby a single function f : DAi

× . . . × DAn→ D′, which is used to generalize

qid =< v1, . . . , vn > to qid′ =< u1, . . . , un > where for every vi, either vi = ui

or vi is a child node of ui in the taxonomy of Ai. This scheme flexibly allowstwo qid groups, even having the same value v, to be independently general-ized into different parent groups. For example < Engineer,Male > can begeneralized to < Engineer,ANY Gender > while < Engineer, Female > canbe generalized to < Professional, Female >. The generalized table containsboth Engineer and Professional. This scheme produces less distortion than thefull-domain and subtree generalization schemes because it needs to generalizeonly the qid groups that violate the specified threshold. Note, in this multi-dimensional scheme, all records in a qid are generalized to the same qid’, butcell generalization does not have such constraint. Both schemes suffer from thedata exploration problem though. Nergiz and Clifton [NC07] further evaluateda family of clustering-based algorithms that even attempts to improve data util-ity by ignoring the restrictions of the given taxonomies.

LeFevre et al. [LR06] presented a greedy top-down specialization algorithm forfinding a minimal k-anonymization in the case of the multidimensional gener-alization scheme. Both TDR and this algorithm perform a specialization ona value v one at a time. The major difference is that TDR specializes in allqid groups containing v. In other words, a specialization is performed only ifeach specialized qid group contains at least k records. In contrast, Mondrianperforms a specialization on one qid group if each of its specialized qid groupscontains at least k records. Due to such a relaxed constraint, the resultinganonymous data in multidimensional generalization usually has a better qualitythan in single generalization. The trade-off is that multidimensional generaliza-tion is less scalable than other schemes due to the increased search space.

3.2 Attribute Linkage

In the attack of attribute linkage, the attacker may not precisely identify therecord of the target victim, but could infer his/her sensitive values from thepublished data T, based on the set of sensitive values associated to the groupthat the victim belongs to. In case some sensitive values predominate in a group,a successful inference becomes relatively easy even if k-anonymity is satisfied.Several approaches have been proposed to address this type of threat. The gen-eral idea is to diminish the correlation between QID attributes and sensitiveattributes.

Example 3. From table 1, an attacker can infer that all female dancers born in1960 have HIV, i.e., < Dancer, Female, 1960 >→ HIV with 100% confidence.Applying this knowledge on table 2, the attacker can infer that Emily has HIV

with 100% confidence provided that Emily comes from the same population oftable 1. �

l-diversity. Machanavajjhala et al. [MV06, MV07] proposed the diversityprinciple, called l-diversity, to prevent attribute linkage. The l-diversity re-quires every qid group to contain at least l “well-represented” sensitive values.There are several instantiations of this principle, which differ in the definitionof being well-represented. The simplest understanding of “well-represented” isto ensure that there are at least l distinct values for the sensitive attribute ineach qid group.

Confidence Bounding. Wang et al. [WY05, WY07] considered boundingthe confidence of inferring a sensitive value from a qid group by specifying oneor more privacy templates of the form, < QID → s, h >. s is a sensitivevalue, QID is a quasi-identifier, and h is a maximum confidence threshold.Let Conf(QID → s) be max{conf(qid → s)} over all qid groups on QID,where conf(qid→ s) denotes the percentage of records containing s in the qid

group. A table satisfies < QID → s, h > if Conf(QID → s) → h. In otherwords, a privacy template bounds the attacker’s confidence of inferring the sen-sitive value s in any group on QID to the maximum h. For example, withQID = {Job,Gender, Y earofBirth}, < QID → HIV, 10% > states that theconfidence of inferring HIV from any group on QID is no more than 10%. Forthe data in table 3, this privacy template is violated because the confidence ofinferring HIV is 75% in the group {Artist, Female, [1960− 1964]}.

t-closeness. Li et al. [LV07] observed that when the overall distribution ofa sensitive attribute is skewed, l-diversity does not prevent attribute linkageattacks. Consider a patient table where 95% of records have Flu and 5% ofrecords have HIV . Suppose that a qid group has 50% of Flu and 50% ofHIV and, therefore, satisfies 2-diversity. However, this group presents a seriousprivacy threat because any record owner in the group could be inferred as havingHIV with 50% confidence, compared to 5% in the overall table. To preventskewness attack, Li et al. [LV07] proposed a privacy model, called t-closeness,which requires the distribution of a sensitive attribute in any group on QIDto be close to the distribution of the attribute in the overall table. t-closenessuses the Earth Mover Distance function to measure the closeness between twodistributions of sensitive values, and requires the closeness to be within t.

4 Privacy-Aware Information Sharing for Clas-sification Analysis

Most work in section 3 focuses on achieving a privacy requirement. Yet, anotherrequirement is making the released data useful to data mining, which is theprimary purpose of sharing the information. For example, in the data flow of the

Figure 3: The Framework for Privacy-Aware Cluster Analysis

Red Cross blood transfusion system depicted in figure 1, the Hospital Authorityhas to share patient-specific data to the Red Cross for classification analysis. Isit possible that both, the privacy and data mining goals can be achieved at thesame time? Our insight is that these two goals are indeed dealing with two typesof information. The privacy goal requires masking sensitive information, usuallyspecific descriptions that identify individuals, whereas the classification goalrequires extracting general structures that capture trends and patterns. Bothgoals can be achieved by carefully performing generalization and suppressionon some selected data. This insight is supported by extensive experimentalresults on real-life datasets [FY05, FY07]. This section summarizes the essentialproperty and presents a high-level anonymization algorithm for achieving bothprivacy and data mining goals.

4.1 The Problem: k-Anonymity for Classification Analy-sis

To prevent record linkages, the Hospital Authority wants to k-anonymize thepatient data before sharing the data to the Red Cross for data mining andreport generation. Sharing the patient data in the Red Cross case poses severalchallenges to the traditional anonymization algorithms [SS98a, SS98b, Sam01,Swe02, LR05].

• The Hong Kong Red Cross wants to utilize the patient data as trainingdata for building a classification model, and then use the model to classifyfuture cases. Most traditional anonymization methods aim at minimiz-ing the distortion on the data. These methods may well-preserve somebasic count statistics, but experiments have shown that achieving mini-mal distortion does not imply preserving the data quality for classificationanalysis [FY05, FY07].

• The patient data contains both categorical attributes and numerical at-tributes. Some categorical attributes come with a taxonomy tree whilemany of them do not. Some anonymization methods suggest pre-discretizingnumerical attributes into intervals; however, this approach does not takeclassification into account. We need an anonymization algorithm that canmask all three types of attributes, while preserving a certain classificationquality.

• The number of patient data records can be huge. The anonymizationalgorithm must be efficient and scalable to perform well on large datasets.

The problem of k-anonymity for classification analysis in the Red Cross casecan be formally described as follows: Given a raw data table

T (QID,Sensitive Attributes, Class Attribute),

a specified k-anonymity requirement, and an optional taxonomy tree for eachcategorical attribute in QID, the Hospital Authority wants to determine amasked version of T , denoted by T ∗. In the modified and privacy-protectedtable T ∗ records are made k-anonymous while essential structures for classifi-cations are preserved (that is, the masked table remains useful for classifyingthe Class Attribute). Therefore it is important to extract certain patterns, e.g.most dominant attributes (which are not necessarily sensitive ones), from T,which are then preserved in T ∗ in way to gain most information about the tablefor analysis. In parallel it is taken care that no privacy violation occurs whengenerating T ∗ from the raw table T . The Sensitive Attributes should be im-portant for the task of classification analysis; otherwise, they should be removed.

We discuss a particular k-anonymization method, called Top-Down Refinement(TDR) [FY05, FY07], because TDR does not only address the problem of k-anonymity for classification analysis and all the aforementioned challenges inthe Red Cross case, but also provides extensibility to deal with the problem ofk-anonymity for cluster analysis, which will be discussed in section 4.

4.2 Masking Operations

To transform T to satisfy the k-anonymity requirement, we consider three typesof masking operations on the attributes Dj in QID.

Figure 4: Taxonomy Trees indicating a cut

1. Generalize Dj if Dj is a categorical attribute with a taxonomy tree. A leafnode represents a domain value and a parent node represents a less specificvalue. A generalized Dj can be viewed as a “cut” through its taxonomytree. A cut of a tree is a subset of values in the tree, denoted by Cutj ,which contains exactly one value on each root-to-leaf path. For example infigure 4, the cut indicated by the dashed line represent generalized valuesof the 3-anonymous table from table 3.

2. Suppress Dj if Dj is a categorical attribute with no taxonomy tree. Thesuppression of a value on Dj means replacing all occurrences of the valuewith the special value ⊥j . All suppressed values on Dj are representedby the same ⊥j , which is treated as a new value in Dj by a classificationalgorithm. We use Supj to denote the set of values suppressed by ⊥j .This type of suppression is at the value level in that Supj is, in general, asubset of the values in the attribute Dj .

3. Discretize Dj if Dj is a continuous attribute. The discretization of a valuev on Dj means replacing all occurrences of v with an interval containingthe value. Our algorithm dynamically grows a taxonomy tree for intervalsat runtime, where each node represents an interval, and each non-leafnode has two child nodes representing some “optimal” binary split of theparent interval. A discretized Dj can be represented by the set of intervals,denoted by Intj , corresponding to the leaf nodes in the dynamically growntaxonomy tree of Dj .

4.3 The Algorithm: Top-Down Refinement (TDR)

A table T can be masked by a sequence of refinements starting from the mostmasked state in which each attribute is either generalized to the topmost value,suppressed to the special value ⊥, or represented by a single interval. Ourmethod iteratively refines a masked value selected from the current set of cuts,suppressed values, and intervals, until violating the anonymity requirement.Each refinement increases the information and decreases the anonymity sincerecords with specific values are more distinguishable. The key is selecting the“best” refinement at each step with both impacts considered. Below, we for-mally describe the notion of refinement on different types of attributes Dj inQID and define a selection criterion for a single refinement.

1. Refinement for Generalization: Consider a categorical attribute Dj witha user-specified taxonomy tree. Let child(v) be the set of child values ofv in a user-specified taxonomy tree. A refinement, written v → child(v),replaces the parent value v with the child value in child(v) that generalizesthe domain value in each (generalized) record that contains v.

2. Refinement for Suppression: For a categorical attribute Dj without tax-onomy tree, a refinement ⊥j→ {v,⊥j} refers to disclosing one value v fromthe set of suppressed values Supj . Let R⊥j

denote the set of suppressedrecords that currently contain ⊥j . Disclosing v means replacing ⊥j withv in all records in R⊥j

that originally contain v.

3. Refinement for Discretization: For a continuous attribute, refinement issimilar to that for generalization except that no prior taxonomy tree isgiven and the taxonomy tree has to be grown dynamically in the process ofrefinement. Initially, the interval that covers the full range of the attributeforms the root. The refinement on an interval v, which is written asv → child(v), refers to the optimal split of v into two child intervalschild(v) that maximizes the information gain. The anonymity is not usedfor finding a split good for classification. This is similar to defining ataxonomy tree where the main consideration is how the taxonomy bestdescribes the application. Due to this extra step of identifying the optimalsplit of the parent interval, we treat continuous attributes separately fromcategorical attributes with taxonomy trees.

A refinement is valid (with respect to T ) if T satisfies the anonymity require-ment after the refinement. A refinement is beneficial (with respect to T ) if morethan one class is involved in the refined records. A refinement is performed onlyif it is both valid and beneficial. Therefore, a refinement guarantees that everynewly generated qid has a(qid) ≥ k.

We propose a selection criterion for guiding our TDR process to heuristicallymaximize the classification goal. Consider a refinement v → child(v), wherev ∈ Dj , and Dj is a categorical attribute with a user-specified taxonomy tree orDj is a continuous attribute with a dynamically grown taxonomy tree. The re-finement has two effects: it increases the information of the refined records withrespect to classification, and it decreases the anonymity of the refined recordswith respect to privacy. These effects are measured by “information gain”, de-noted by InfoGain(v), and “anonymity loss”, denoted by AnonyLoss(v). v isa good candidate for refinement if InfoGain(v) is large and AnonyLoss(v) issmall. Our selection criterion is choosing the candidate v, for the next refine-ment, that has the maximum information-gain/anonymity-loss trade-off, whichis defined as

Score(v) =InfoGain(v)

PrivLoss(v)

Algorithm 1 Top-Down Refinement (TDR)

Require: Raw data table T

1: Initialize every value of Dj to the topmost value, or suppress every valueof Dj to ⊥j , or include every continuous value of Dj into the full-rangeinterval, where Dj ∈ QID.

2: Initialize Cutj of Dj to include the topmost value, Supj of Dj to includeall domain values of Dj , and Intj of Dj to include the full-range interval,where Dj ∈ QID.

3: while there exists a x ∈< Cutj , Supj , Intj >, which is valid and beneficialdo

4: Find the refinement Best from < Cutj , Supj , Intj >, by using the Score-function

5: Perform Best on T and update < Cutj , Supj , Intj >

6: Update Score(Best) and validity for x ∈< Cutj , Supj , Intj >

7: end while

8: return k-anonymous T ∗ and solution set < Cutj , Supj , Intj >

.For InfoGain(v), we employ Shannon’s information theory to measure infor-mation gain of a refinement on v with respect to the Class attribute. ForPrivLoss(v), it is the decrease of the anonymity of QID by the refinement onv.

We present a high-level description of the TDR algorithm. In a preprocessingstep, we compress the given table T by removing all attributes not in QID andcollapsing duplicates into a single row with the Class column storing the class fre-quency. The compressed table is typically much smaller than the original table.Below, the term “data records” refers to data records in this compressed form.Algorithm 1 summarizes the conceptual algorithm. Initially, Cutj contains onlythe top-most value for a categorical attribute Dj with a taxonomy tree, Supj

contains all domain values of a categorical attribute Dj without a taxonomytree, and Intj contains the full-range interval for a continuous attribute Dj .The valid beneficial refinements in < Cutj , Supj , Intj > form contain the set ofcandidates fulfilling the definitions of being valid and beneficial. At each itera-tion, we find the candidate of the highest Score, denoted by Best (Line 4), applyBest to T , update < Cutj , Supj , Intj > (Line 5), and update Score and thevalidity of the candidates in < Cutj , Supj , Intj > (Line 6). The algorithm ter-minates when there is no more candidate in < Cutj , Supj , Intj >, in which caseit returns the masked table together with the solution set < Cutj , Supj , Intj >.

4.4 The HL7-Compliant Data Structure for the Blood Us-age Record

The HL7 data hierarchy of blood usage record is designed to support the HL7schema version 3 which is a standard for the electronic data interchange (EDI)in the health care organization. The importance of HL7 lies within the unify-ing and well-defined structures on the maintaining of Electronic Health Records(EHR) within domain boundaries as well as concerning a wider perspective. Anyelectronic documents or records need to follow the HL7 standard and the HL7current version (version 3) supports the XML platform representation. There-fore, the data fields of blood usage record are transformed to the HL7-compliantXML document structure as shown in (ToDo Figure 4). Each element of hi-erarchy is represented by the character “e” inside the square box in Figure 4.To compare our data hierarchy of the blood usage record with the data fieldsin table 4, we match the elements of the data fields from table 4. In Figure 4elements of the hierarchy refer to the attributes listed in table 4 with the givencontent between the brackets matching the attribute names.

Field Name Data Type Description

LIS Hospital (Identifiable) Char(3) From LISIssue Date Time DateTime From LISIssue Date ID Int Issue Date ID of blood productNo of Unit Int From LISBlood Group Char(6) From LISProduct Char(11) From LISLIS Case No Char(12) From LISPatient Name Char(48) From LISReference Key Int Non-HKID identifier for patient, from

Data WarehouseSex Char(1) From LISAge Int From LISAge Unit Char(6) From LISDOB DateTime From LISPatient Blood Group Char(6) From LISIP Hospital (Identifiable) Char(3) Hospital code of the IP episode for

the patient at “Issue Date Time”, fromData Warehouse

IP Case No Char(12) HN number of the IP episode for the pa-tient at “Issue Date Time”, from DataWarehouse

Specialty (EIS) Char(7) EIS specialty code of the IP episode forthe patient at “Issue Date Time”, fromData Warehouse

Pre Issue Hb Decimal(11, 4) For study on a) red cellsPre Issue Platelet Count Decimal(11, 4) For study on b) platelet concentrates

Algorithm 2 OT Reference Data Time

Require: OT records interfaced from CMS into Data Warehouse1: if OT Record has non-empty OT Start Date Time then

2: OT Start Date Time3: else if OT Record Creation Date Time = OT Date then

4: OT Record Creation Date Time5: else

6: OT Date (Time = 00 : 00)7: end if

Pre Issue PT Decimal(11, 4) For study on c) fresh frozen plasmaPre Issue INR Decimal(11, 4) For study on c) fresh frozen plasmaPre Issue APTT Decimal(11, 4) For study on c) fresh frozen plasmaPre Issue D-dimer Decimal(11, 4) For study on c) fresh frozen plasma

Pre Issue Fibrinogen Decimal(11, 4)For study on c) fresh frozen plasmaNon standardized test, pending for en-tity ID(s) for searching

Dx1 . . . Dx15 Char(7) Diagnosis codes (in ICD9CM) of the IPepisode for the patient at “Issue DateTime”, from Data Warehouse

OT Reference Char(12) OT Reference number of the IP episodefor the patient at “Issue Date Time”,from OT records interfaced from CMSinto Data Warehouse; Note that one HNmay have multiple OT records.

OT Reference Date Time DateTime see Algorithm 2

OT Nature Char(1)C - ElectiveE - EmergencyFromt OT records interfaced from CMSinto Data Warehouse

OT Magnitude Description Char(16) OT magnitude input by surgeon, fromOT records interfaced from CMS intoData Warehouse

Blood Loss (ml) Int Blood loss in ml, from OT records inter-faced from CMS into Data Warehouse

OT Related Px1 . . . Px15 Char(7) OT Record Related procedure codes inICD9CM, from OT records interfacedfrom CMS into Data Warehouse

Table 4: Data Field Description

The HL7 data hierarchy consists of two main sections: a head and a bodysection. The head section contains the information related to the patient iden-

tification, physicians, and hospital information while the body section containsthe information about the patient medical history, laboratory results, and phys-ical examination. For example, in (ToDo Figure 4(a)), the elements “title”and “patient” of the head section can be match with the data field “product”and “patient name”, respectively. For the body section (Figure 4(b) and 4(c)),each element “entry” has a particular laboratory result or a measurement valueand can be matched with the data fields from table 4, e.g. like the element“entry” code = (Blood Group) with the “Blood Group” attribute containedin table 4. Using the HL7-compliant XML document structure can providethe flexibility of simple interchanging, updating, and analyzing the data whichimproves the overall performance in the management of blood usage records.Referring to figure 1, the data in both databases “Blood Donor Data and BloodInformation” and “Patient Health Data and Blood Usage” will be stored in thisHL7-compliant format.

5 Privacy-Aware Information Sharing for Clus-ter Analysis

Performing classification analysis on anonymous data is a privacy-preservationrequirement for data mining. Another data mining task to perform is clusteranalysis in order to obtain better insight on the common characteristics of bloodtransfusion cases. The major challenge of masking data for cluster analysis isthe lack of class labels that could be used to guide the masking process. Inthis section we convert this problem into the counterpart problem of classifica-tion analysis, wherein class labels encode the cluster structure of the data, andpresent a framework to evaluate the cluster quality on the masked data.

5.1 The Problem: k-anonymity for Cluster Analysis

The data holder wants to transform a raw patient data table T to a masked ta-ble T ∗ for the purpose of cluster analysis, with its goal to group similar objectscloser (i.e. into the same cluster) and dissimilar objects away from each other(i.e. into different clusters).

What kind of information should be preserved for cluster analysis? Unlike clas-sification analysis, wherein the information utility of attributes can be measuredby their power of identifying class labels [BA05, FY05, FY07, Iye02, LR06], noclass labels are available for cluster analysis. One natural approach is to pre-serve the cluster structure in the raw data. Any loss of structure due to theanonymization is measured relative to such “raw cluster structure”.

The problem of k-anonymity for cluster analysis in the Red Cross case can beformally described as follows: Given a raw data table

T (QID,Sensitive Attributes),

a k-anonymity requirement, an optional taxonomy tree for each categorical at-tribute in QID, the Hospital Authority wants to determine a masked version ofT , denoted by T ∗, such that QID is k-anonymous and the masked table T ∗ hasa cluster structure as similar as possible to the cluster structure in the raw tableT . The Sensitive Attributes should be important for the task of classificationanalysis; otherwise, removed from the set.

5.2 The Solution Framework

In this section, we present a framework to show the steps of the algorithm forgenerating a masked table T ∗, represented by a solution set < Cutj , Supj , Intj >

that satisfies a given k-anonymity requirement and preserves as much as possi-ble of the raw cluster structure. Figure 3 provides an overview of our proposedframework. First, we generate the cluster structure in the raw table T and labeleach record in T by a class label. This labelled table, denoted by Tl, has a Classattribute that contains a class label for each record. Essentially, preserving theraw cluster structure is to preserve the power of identifying such class labels dur-ing masking. Masking operations that diminishes the difference among recordsbelonging to different clusters (classes) are penalized. As the requirement is,similar to the anonymity problem for classification analysis, we can apply TDR,the anonymization algorithm for classification analysis, to achieve the requesteddegree of anonymity. We explain each step in figure 3 as follows.

1. Convert T to a labelled table Tl. Apply a clustering algorithm to T toidentify the raw cluster structure, and label each record in T by its classlabel. The resulting labelled table Tl has a Class attribute containing thelabels.

2. Mask the labelled table Tl. Employ an anonymization algorithm for clas-sification analysis, such as TDR, to mask Tl. The masked Tl

∗ satisfies thegiven k-anonymity requirement.

3. Clustering on the masked Tl∗. Remove the labels from the masked Tl

∗ andthen apply a clustering algorithm to the masked Tl

∗, where the numberof clusters is the same as in Step 1. By default, the clustering algorithmin this step is the same as the clustering algorithm in Step 1, but can bereplaced with any other choice requested by the Red Cross.

4. Evaluate the masked Tl∗. Compute the similarity between the cluster

structure found in Step 3 and the raw cluster structure found in Step1. The similarity measures the loss of cluster quality due to masking. Ifthe evaluation is unsatisfactory, the Hospital Authority may repeat Steps1-4 with different specification of taxonomy trees, choice of clustering algo-rithms, masking operations, number of clusters, and anonymity thresholdsif possible.

5. Release the masked Tl∗. If the evaluation in Step 4 is satisfactory, the

Hospital Authority can release the masked Tl∗ together with some optional

supplementary information: all taxonomy trees (including those generatedat runtime for continuous attributes), the solution set, the similarity scorecomputed in Step 4, and the class labels generated in Step 1.

6 Conclusions and Future Research Directions

This chapter presents the major privacy issues when sharing sensitive informa-tion in the context of healthcare applications and services. The strong need ofprotecting individual privacy for data mining activities should be encountered bydifferent healthcare stakeholders (e.g. the Hong Kong Red Cross Blood Transfu-sion Services) on arbitrary kinds of client-related datasets, in our case of discus-sion Health Level 7 (HL7)-compliant ones. In general, this is done by extractingvaluable structures and patterns to perform such an activity, e.g. classificationanalysis over given medical datasets. We describe some well-known methodssuitable to secure person-specific data and use an iterative method called Top-Down Refinement (TDR) on how the medical datasets can be published in away in which one’s privacy is protected.

An approach in the area of privacy-aware information processing which comesalong with the TDR method is called k-anonymity which gives a measure of pri-vacy protection. This chapter deals with the problem of reaching k-anonymizeddata for classification analysis, additionally it describes a framework for solvingthe problem of cluster analysis on the records of the datasets. Future workwill include further improvements of the discussed TDR method to cover otherprivacy-preserving approaches like l-diversity and t-closeness as discussed in theliterature review. Ongoing effort will also be put to design methods to directlyapply privacy-protection algorithms securing the data represented in healthcaredocuments in the illustrative case of the Hong Kong BTS which is compliantwith HL7 version 3. During the past cooperation with the institution of theHong Kong Red Cross BTS we were able to verify the applicability of our ap-proach and future efforts will be put towards the full adoption of the needs ofprivacy-preservation in such a real-world scenario.

References

[BA03] Barlow-Stewart K. Pros A. Burnett, L. and H. Aizenberg. The genetrustee: A universal identification system that ensures privacy and con-fidentiality for human genetic databases. Journal of Law and Medicine,10:506—513, 2003.

[BA05] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In 21st IEEE International Conference on Data Engi-neering (ICDE), pages 217—228, Tokyo, Japan, 2005.

[Dal77] T. Dalenius. Towards a methodology for statistical disclosure control.Statistik Tidskrift, 15:429—444, 1977.

[Dwo06] C. Dwork. Differential privacy. In 33rd International Colloquiumon Automata, Languages and Programming (ICALP), pages 1—12,Venice, Italy, 2006.

[FD08] Wang K. Wang L. Fung, B. C. M. and M. Debbabi. A frameworkfor privacy-preserving cluster analysis. In 2008 IEEE InternationalConference on Intelligence and Security Informatics (ISI), pages 46—51, Taipei, Taiwan, 2008.

[FH09] Wang K. Wang L. Fung, B. C. M. and P. C. K. Hung. Privacy-preserving data publishing for cluster analysis. Data and KnowledgeEngineering (DKE), 2009.

[FY05] Wang K. Fung, B. C. M. and P. S. Yu. Top-down specialization forinformation and privacy preservation. In 21st IEEE International Con-ference on Data Engineering (ICDE), pages 205—216, Tokyo, Japan,2005.

[FY07] Wang K. Fung, B. C. M. and P. S. Yu. Anonymizing classification datafor privacy preservation. IEEE Transactions on Knowledge and DataEngineering (TKDE), 19:711—725, 2007.

[FY10] Wang K. Chen R. Fung, B. C. M. and P. S. Yu. Privacy-preservingdata publishing: a survey on recent developments. ACM ComputingSurveys, 2010.

[HW96] A. Hundepool and L. Willenborg. µ- and τ -argus: Software for sta-tistical disclosure control. In 3rd International Seminar on StatisticalConfidentiality, Bled, Slovenia, 1996.

[Iye02] V. S. Iyengar. Transforming data to satisfy privacy constraints. In 8thACM SIGKDD, pages 279—288, Edmonton, AB, Canada, 2002.

[LR05] Dewitt D. J. LeFevre, K. and R. Ramakrishnan. Incognito: Efficientfull-domain k-anonymity. In ACM SIGMOD, pages 49—60, Baltimore,ML, USA, 2005.

[LR06] Dewitt D. J. LeFevre, K. and R. Ramakrishnan. Workload-awareanonymization. In 12th ACM SIGKDD, Philadelphia, PA, USA, 2006.

[LV07] Li T. Li, N. and S. Venkatasubramanian. t-closeness: Privacy beyondk-anonymity and l-diversity. In 21st IEEE International Conferenceon Data Engineering (ICDE), Istanbul, Turkey, 2007.

[MV06] Gehrke J. Kifer D. Machanavajjhala, A. and M. Venkitasubramaniam.l-diversity: Privacy beyond k-anonymity. In 22nd IEEE InternationalConference on Data Engineering (ICDE), Atlanta, GA, USA, 2006.

[MV07] Kifer D. Gehrke J. Machanavajjhala, A. and M. Venkitasubramaniam.l-diversity: Privacy beyond k-anonymity. ACM TKDD, 1, 2007.

[NC07] M. E. Nergiz and C. Clifton. Thoughts on k-anonymization. Data andKnowledge Engineering, 63:622—645, 2007.

[Sam01] P. Samarati. Protecting respondents’ identities in microdata release.IEEE Transactions on Knowledge and Data Engineering (TKDE),13(6):1010—1027, 2001.

[SS98a] P. Samarati and L. Sweeney. Generalizing data to provide anonymitywhen disclosing information. In 17th ACM SIGACT-SIGMOD-SIGART PODS, page 188, Seattle, WA, USA, 1998.

[SS98b] P. Samarati and L. Sweeney. Protecting privacy when disclosing infor-mation: k-anonymity and its enforcement through generalization andsuppression. Technical report, SRI International, 1998.

[Swe02] L. Sweeney. k-anonymity: a model for protecting privacy. Interna-tional Journal on Uncertainty, Fuzziness and Knowledge-based Sys-tems, 10(5):557—570, 2002.

[WY05] Fung B. C. M. Wang, K. and P. S. Yu. Template-based privacy preser-vation in classification problems. In 5th IEEE International Conferenceon Data Mining (ICDM), pages 466—473, 2005.

[WY07] Fung B. C. M. Wang, K. and P. S. Yu. Handicapping attackers confi-dence: An alternative to k-anonymization. Knowledge and InformationSystems (KAIS), 11(3):345—368, 2007.