Upload
tingting-chen
View
215
Download
1
Embed Size (px)
Citation preview
Pu
TC
a
A
R
R
1
A
K
S
P
L
1
Wiittohttmhtt
itoc
0d
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253
jo ur n al hom ep age : www.int l .e lsev ierhea l th .com/ journa ls /cmpb
rivacy-preserving models for comparing survival curvessing the logrank test
ingting Chen, Sheng Zhong ∗
omputer Science and Engineering Department, State University of New York at Buffalo, Amherst, NY 14260, USA
r t i c l e i n f o
rticle history:
eceived 16 June 2010
eceived in revised form
9 April 2011
ccepted 24 April 2011
a b s t r a c t
The incorporation of electronic health care in medical institutions will benefit and thus
further boost the collaborations in medical research among clinics and research institu-
tions. However, privacy regulations and security concerns make such collaborations very
restricted. In this paper, we propose privacy preserving models for survival curves com-
parison based on logrank test, in order to perform better survival analysis through the
eywords:
urvival curves
rivacy preservation
ogrank test
collaboration of multiple medical institutions and protect the data privacy. We distinguish
two collaboration scenarios and for each scenario we present a privacy preserving model for
logrank test. We conduct experiments on the real medical data to evaluate the effectiveness
of our proposed models.
© 2011 Elsevier Ireland Ltd. All rights reserved.
the three institutions are not sufficient to obtain results withhigh accuracy. If they can conduct survival curves compari-
. Introduction
ith the development of information technology, there is anncreasing need to incorporate electronic health record (EHR)n medical institutions [1]. The availability of EHRs is believedo be able to improve the health care efficiency and quality thathe patients receive. Moreover, because of using EHR insteadf paper-based records, hospitals can store and manage moreealth care data than ever before. Consequently, it will benefithe development of more advanced clinical computer-basedools that help diagnosis and research. Especially, if multiple
edical institutions can integrate their electronically storedealth care data, with this substantial amount of data, bet-er models with higher accuracy can be built to assist clinicalreatment and medical research.
Survival analysis [2] is an important statistic tool often usedn clinic trial to provide assessment of benefit and risk. With
he collaboration of multiple medical institutions, researchersr doctors can build better survival analysis models, espe-ially survival function comparison models. Here we illustrate∗ Corresponding author. Tel.: +1 716 645 4752; fax: +1 716 645 3464.E-mail addresses: [email protected] (T. Chen), [email protected]
169-2607/$ – see front matter © 2011 Elsevier Ireland Ltd. All rights resoi:10.1016/j.cmpb.2011.04.004
two different scenarios as examples. The first scenario is thatin a hospital, a new radiotherapy treatment is performed toa group of pancreatic cancer patients. The doctors in thehospital can observe the survival events (death or cancerrecurrence) of these patients and draw a survival curve forthis new treatment. They want to compare this survival curvewith other treatments to justify its effectiveness and advan-tage. Luckily, a medical research institution holds survival dataof other treatment trials for pancreatic cancer with trial par-ticipants of similar background. Clearly the collaborative dataexchange between the hospital and the research institutionswill be beneficial for the result comparison. In the second sce-nario, three institutions are all studying the performance ofa new medicine for stroke on patients of different ages. Theywant to build and compare the survival curves for differentage intervals. However, the trial participants in any one of
u (S. Zhong).
son based on the trial participants from all of them, it willsignificantly increase the result accuracy.
erved.
s i n
250 c o m p u t e r m e t h o d s a n d p r o g r a mHowever, sharing medical data is well-known to berestricted because of privacy and security concerns. Accord-ing to the privacy rules of Health Insurance Portability andAccountability Act (HIPAA) [3], the privacy of patients must beprotected and it is illegal for research institutions and hospi-tals to distribute patient’s medical data without appropriateprivacy preservation. On the other hand, Medical researchersare reluctant to share their data with others even if it is alreadyanonymized, due to the concern of possibility that their datacould be misused or misinterpreted. For instance, in Dart-mouth College neuroscientist found it difficult to encouragethe sharing of brain imaging data [4]. In the two scenariosabove, the privacy concern also exists which impedes the pro-cess of collaboration between medical institutions. Therefore,we need to develop new models for survival curves compar-ison that can protect the privacy of patients and relieve thedata security concern of the researchers or doctors.
In this paper, we propose novel privacy preserving mod-els for logrank test, which is a standard comparison testof survival curves. In particular, for each of the two col-laboration scenarios we mentioned above, we design oneprivacy-preserving logrank test model. In the rest of thispaper, we call the first scenario group partition, meaning eachinstitution holds a survival curve for a entire group of partici-pants. We call the second scenario sample partition, meaningeach institution holds the survival data of some (but not all)participants in each group. Our goal is that for each of thecollaboration scenario, our proposed logrank test model canlearn the comparison result of survival curves built on the datafrom all medical institutions, even without looking at the orig-inal survival data from other medical institutions. We utilize acryptographic tool, secure sum [5], in our models. In this way,the privacy of medical data is protected. As far as we know,it is the first work on building privacy preserving models forsurvival curves comparison using logrank test. We preformexperiments on real medical data to show the effectiveness ofour proposed models.
2. Methods
In this section, we first review the logrank test for com-paring survival curves. Then we describe our two privacypreserving models for the logrank test. The first modelenables the privacy preserving comparison of survival curvesin the group partition scenario. Then we present the sec-ond privacy preserving model which preserves the privacy incomparing the survival curves in the sample partition sce-nario.
2.1. Overview of logrank test
Suppose we have n groups of individuals. Logrank test [2] is astatistical hypothesis test, where the hypothesis is that the ngroups have the same survival distribution, i.e., for each group
the probability of occurring the event (e.g., death) at each timepoint is the same. In particular, we divide the time into m inter-vals. Let nkj be the number of individuals that are alive in groupk at the beginning of time interval j. Let dkj be the number ofb i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253
events occurring in group k in interval j. nj and dj are definedas Eq. (1) and Eq. (2) respectively.
nj =n∑
k=1
nkj (1)
dj =n∑
k=1
dkj (2)
The test statistic is calculated as
Z =n∑
k=1
(Ok − Ek)2
Ek, (3)
where Ok represents the number of observed deaths in groupk, i.e.,
Ok =m∑
j=1
dkj. (4)
Ek is the expected number of deaths in group k, i.e.,
Ek =m∑
j=1
nkjdj
nj. (5)
A smaller test statistic Z suggests a higher probability that thehypothesis is true.
2.2. Privacy for each party
As mentioned above, it is often the case that the survival datafor several groups are distributed in different places, e.g., med-ical research institutions and clinics. These organizations orparties want to compare their survival curves using logranktest but each of them is not willing to reveal its own survivaldata to other parties. We distinguish the privacy of each partyfor the two collaboration scenarios, i.e., the group partitionand the sample partition.
• The group partitionIn the group partition scenario, each party holds the survivaldata collected from the group of patients that this party has:number of events occurring in each time slot and the num-ber of surviving individuals at the beginning of each timeinterval. Without loss of generality, we assume that party kholds the survival data of group k. In our proposed privacy-preserving logrank test model, we aim to for each party kprotect the information nkj and dkj (∀j) from other partiesthan k and meanwhile correctly compute the logrank teststatistic. The group partition scenario is illustrated in Fig. 1.
• The sample partitionIn the sample partition scenario, each party holds the sur-
vival data for some participants in each group. Formally, ∀k, j each party i holds its survival data nikjand di
kj, which are
collected for time interval j from the patients in group k thatparty i has. Each party i wants to keep ni
kj, and di
kjprivate and
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253 251
Party 1 Party 2 Party n
Group 1 Group 2 Group n
patient 1.1patient 1.2patient 1.3
....
.... ......
patient 2.1patient 2.2patient 2.3
....
....
patient n.1patient n.2patient n.3
....
....
d11 , n
11d
12 , n
12 ...d
1j n
1j...d
1m n
1m
d21 , n
21d
22 , n
22 ...d
2j n
2j...d
2m n
2m
dn1
, nn1
dn2
, nn2
...d
nj n
nj...d
nm n
nm
p pa
2g
Asjw
Fig. 1 – The grou
build logrank model using nkj and dkj, such that nkj =∑
ini
kj
and dkj =∑
idi
kj. Fig. 2 shows the sample partition scenario.
.3. Privacy-preserving logrank test model for theroup partition scenario
ssume that there are s parties (s ≥ 3), each of which holds theurvival data for one group of patients. The s parties want toointly perform the logrank test (i.e., computing the value Z),
ithout revealing their private data to each other.
Party 1 Party 2
Group 1
Group 2
patient 1.1patient 1.2....
patient 2.3patient 2.4....
d111
, n111
...
d1kj
n1kj
...d1
sm n1
sm
Group 2
patient 2.1patient 2.2....
Group s...
patient 1.3patient 1.4....
Group 1
Group s...
......
d211
, n211
...
d2kj
n2kj
...d2
sm n2
sm
Fig. 2 – The sample pa
rtition scenario.
From Eq. (3) we know that if party k wants to compute Z, itneeds to be able to compute Ek first, but from Eqs. (3), (1) and(5), we know that
Ek =m∑
j=1
nkj
∑k
dkj
∑k
nkj
, (6)
which requires dk′j and nk′j where k /= k′. Moreover, in order
to compute Z, party k needs to know∑
k′ /= k(Ok′ − Ek′ )2/Ek′ to
obtain the sum.
Party n
Group 1
......
patient1.2n−1patient 1.2n....
patient2.2n−1patient 2.2n....
Group 2
Group s...
...
dn11
, n n11
...
dnkj
n nkj
...dn
sm n n
sm
rtition scenario.
252 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253
For inter val 1 ≤ j ≤ m,(Step 1): Party 1 generates two random numbers Rj1 and Rj2 (s.t.,
Rj1 ∈ [0 , D), Rj2 ∈ [0 , D))a and sends Rj1 + d1j and Rj2 + n1j to Party 2.(Step 2): For each party p, s.t. 2 ≤ p ≤ s − 1,
Party p receive s Rj1 + p−1k=1 dkj , and Rj2+
p−1k=1 nkj . Party p
computes Rj1+p−1k=1 dkj + dpj an d Rj2+
p−1k=1 nkj +npj and sends the m to
party p + 1 .(Step 3): Party s receive s Rj1 + s−1
k=1 dsj , and Rj2 + s−1k=1 nkj . Party
s computes Rj1 + n−2k=1 dkj + dsj an d Rj2 + s−1
k=1 nkj + nsj and sends the mto party 1 .
(Step 4): Party 1 receives Rj1 + s−1k=1 dkj , and Rj2 + s−1
k=1 nkj . Party1 substracts Rj1 and Rj2 from the two received numbers respectively, an dcomputes d
n = d
n . Party 1 sends dn to all other parties .
End For
Each party k (s.t., 1 ≤ k ≤ s) computes Ek = mj=1 nkj
dn .
aHere D is the range of dij an d nij , ∀i, j.
Fig. 3 – Privately computing Ek for each party k.
In our model, we utilize a randomization based securecomputation tool, secure sum [5], to tackle these two chal-lenges. Now we first describe how to securely compute Ek
for each k, without knowing dk′j and nk′j where k /= k′. Thenwe present our complete privacy-preserving logrank testmodel.
Suppose that all participating parties are numbered as par-ties 1 . . . s. Our method of computing Ek for each party k issummarized in Fig. 3. The idea of our method is that to com-pute each dj (resp. nj) ∀j, party 1 generates a random numberwith the same range as that of dij and nij and adds its local value d1j
(resp. n1j) to the random number before passing it to the nextparty. In this way the actual values of d1j and n2j are hiddenbehind the random numbers. Similarly, every other party addsits local value to the sums that it receives and sends the newsums to the next party. Finally, party s sends Rj1 +
∑s−1k=1dkj,
and Rj2 + ∑s−1k=1nkj back to party 1, and party 1 subtracts the
two random numbers respectively and obtains dj and nj. Afterparty 1 sends (dj/nj) to all other parties, each party k can com-pute the value of Ek.
After each party k gets Ek, it can easily compute (Ok − Ek)2/Ek
because obtaining Ok does not require the survival data fromother parties. As the final step of our privacy-preservinglogrank test model, we need to compute
∑k(Ok − Ek)2/Ek. Since
(Ok − Ek)2/Ek reveals the information of how much the survivalcurve of group k is different from other groups, we also needto keep it private when calculating Z. Again we use the idea ofsecure sum. Party 1 generates a random number R and adds(O1 − E1)2/E1 to it before passing it to the next party. Wheneach party k has added the (Ok − Ek)2/Ek, party 1 again receivesR +
∑s
k=1(Ok − Ek)2/Ek. Party 1 subtracts R and sends the valueof Z to other parties.
We notice that in our privacy-preserving logrank testmodel, party 1 meed to generate random numbers. In theimplementation, we use the pseudo random number genera-
tor function in the GNU C Library [6]. Applying pseudo randomnumber generation algorithm is a standard way to generaterandom numbers in cryptography.2.4. Privacy-preserving logrank test model for thesample partition scenario
Assume that there are n(n > 2) parties want to compare s sur-vival curves in the sample partition scenario. As we havementioned, in the sample partition scenario, each party i holdsits survival data di
kjand ni
kj. They would like to collaboratively
compute Z that can be written as follows
Z =s∑
k=1
⎛⎜⎜⎜⎜⎝
m∑j=1
n∑i=1
dikj
−m∑
j=1
n∑i=1
nikj
n∑i=1
dij
n∑i=1
n1j
⎞⎟⎟⎟⎟⎠
2
m∑j=1
n∑i=1
nikj
n∑i=1
dij
n∑i=1
n1j
. (7)
As we can see, to compute Z in a privacy preserving way forthe sample partition scenario is more complicated than for thegroup partition scenario. Again we utilize the idea of securesum. For ease of presentation, we only describe the main stepsin this model and skip the details of secure sum computationwhich is similar to what we have presented above.
First, using the secure sum algorithm, the n parties cansecurely obtain
Ok =n∑
i=1
m∑j=1
dikj (8)
and∑n
i=1nikj
. Similarly, for each j, the n parties can securely
compute (∑n
i=1dij)/(
∑n
i=1nij), where di
j=
∑kdi
kjand ni
j=
∑kni
kj
i n b i
c
(
E
Ast
3
Otw
•
•
•
•
ppaf
r
[
[
[
[
[
[[
[
c o m p u t e r m e t h o d s a n d p r o g r a m s
an be computed locally at each party i. With∑n
i=1nikj
and∑n
i=1dij)/(
∑n
i=1nij), now Ek can be computed as in Eq. (9)
k =m∑
j=1
n∑i=1
nikj
n∑i=1
dij
n∑i=1
n1j
(9)
fter obtaining Ok and Ek for each party k, apply the secureum again to securely compute the sum of (Ok − Ek)2/Ek forhe s groups.
. Discussion
ur proposed privacy-preserving logrank test models havehe following characteristics that are different from previousork.
They preserve the data privacy for each party withoutrevealing it to others, in the process of survival data analy-sis using logrank test. Relieving the privacy concerns for themedical institutions is of great importance because it willencourage more cooperations among these institutions oneither research or clinical trials. Consequently, better sur-vival data comparisons supported by larger database willbecome available.
Our privacy preserving models are accurate, meaning thelogrank test results obtained by our models are the same asthose obtained by having all data on one site.
The two models either for the group partition scenario or forthe sample partition scenario only require the parties whohold the survival data to participate. In other words, as longas those parties with data use our models, all computation isconducted within those parties and thus no other agenciesare needed. Therefore, our models are very practical in itsimplementation.
We are utilizing a randomization-based method in ourmodels. As a result our models are much more efficientcompared with cryptography-based approaches.
An existing work in data mining community proposed the
rivacy preserving cox regression in survival analysis [7]. Theroposed model was based on linearly projecting the data tolower dimensional space through an optimal mapping. Dif-erent from their model, we focus on another very important
o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253 253
function of survival data analysis in medical trials, the survivalcurves comparison. Furthermore, we did not lose any accuracyin our privacy preserving models.
4. Hardware and software specifications
The models are implemented using GNU C Library. The pro-grams are running in Redhat Linux 7.2 on 2.0 GHz computers.See [8] for more detailed implementation and evaluation results.
5. Conclusion
In this paper, we propose privacy preserving models for sur-vival curves comparison based on logrank test, in order toperform better survival analysis through the collaboration ofmultiple medical institutions and protect the data privacy. Wedistinguish two collaboration scenarios, the group partitionscenario and the sample partition scenario. For each scenariowe present a privacy preserving model for logrank test. Ourexperiments on the real medical data to evaluate the effec-tiveness of our proposed models.
e f e r e n c e s
1] Health Information Technology for Economic and ClinicalHealth Act, available at:http://waysandmeans.house.gov/media/pdf/111/hitech.pdf.
2] D.G. Altman, Practical Statistics for Medical Research,Chapman & Hall, London, 1991, ISBN 0-412-27630-5.
3] HIPPA, National Standards to Protect the Privacy of PersonalHealth Information [Online], available at:http://www.hhs.gov/ocr/hipaa/finalreg.html.
4] Editorial, Whose scans are they, anyway? Nature 406(August(443)) (2000).
5] M. Kantarcioglu, C. Clifton, Privacy-preserving distributedmining of association rules on horizontally partitioned data,in: The ACM SIGMOD Workshop on Research Issues on DataMining and Knowledge Discovery (DMKD’02), Madison,Wisconsin, June 2, 2002, pp. 24–31.
6] GNU C library, available at: http://www.gnu.org/software/libc/.7] S. Yu, G. Fung, R. Rosales, S. Krishnan, R.B. Rao,
Privacy-preserving cox regression for survival analysis, in:Proceedings of KDD’08, Las Vegas, Nevada, USA, August, 2008.
8] T. Chen, S. Zhong, Privacy-Preserving Models for Comparing
Survival Curves Using the Logrank Test, Technical Report2011-02, Computer Science and Engineering Department,SUNY Buffalo,http://www.cse.buffalo.edu/tech-reports/2011-02.pdf.