5
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253 jo ur n al hom ep age : www.intl.elsevierhealth.com/journals/cmpb Privacy-preserving models for comparing survival curves using the logrank test Tingting Chen, Sheng Zhong Computer Science and Engineering Department, State University of New York at Buffalo, Amherst, NY 14260, USA a r t i c l e i n f o Article history: Received 16 June 2010 Received in revised form 19 April 2011 Accepted 24 April 2011 Keywords: Survival curves Privacy preservation Logrank test a b s t r a c t The incorporation of electronic health care in medical institutions will benefit and thus further boost the collaborations in medical research among clinics and research institu- tions. However, privacy regulations and security concerns make such collaborations very restricted. In this paper, we propose privacy preserving models for survival curves com- parison based on logrank test, in order to perform better survival analysis through the collaboration of multiple medical institutions and protect the data privacy. We distinguish two collaboration scenarios and for each scenario we present a privacy preserving model for logrank test. We conduct experiments on the real medical data to evaluate the effectiveness of our proposed models. © 2011 Elsevier Ireland Ltd. All rights reserved. 1. Introduction With the development of information technology, there is an increasing need to incorporate electronic health record (EHR) in medical institutions [1]. The availability of EHRs is believed to be able to improve the health care efficiency and quality that the patients receive. Moreover, because of using EHR instead of paper-based records, hospitals can store and manage more health care data than ever before. Consequently, it will benefit the development of more advanced clinical computer-based tools that help diagnosis and research. Especially, if multiple medical institutions can integrate their electronically stored health care data, with this substantial amount of data, bet- ter models with higher accuracy can be built to assist clinical treatment and medical research. Survival analysis [2] is an important statistic tool often used in clinic trial to provide assessment of benefit and risk. With the collaboration of multiple medical institutions, researchers or doctors can build better survival analysis models, espe- cially survival function comparison models. Here we illustrate Corresponding author. Tel.: +1 716 645 4752; fax: +1 716 645 3464. E-mail addresses: [email protected] (T. Chen), [email protected] (S. Zhong). two different scenarios as examples. The first scenario is that in a hospital, a new radiotherapy treatment is performed to a group of pancreatic cancer patients. The doctors in the hospital can observe the survival events (death or cancer recurrence) of these patients and draw a survival curve for this new treatment. They want to compare this survival curve with other treatments to justify its effectiveness and advan- tage. Luckily, a medical research institution holds survival data of other treatment trials for pancreatic cancer with trial par- ticipants of similar background. Clearly the collaborative data exchange between the hospital and the research institutions will be beneficial for the result comparison. In the second sce- nario, three institutions are all studying the performance of a new medicine for stroke on patients of different ages. They want to build and compare the survival curves for different age intervals. However, the trial participants in any one of the three institutions are not sufficient to obtain results with high accuracy. If they can conduct survival curves compari- son based on the trial participants from all of them, it will significantly increase the result accuracy. 0169-2607/$ see front matter © 2011 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.cmpb.2011.04.004

Privacy-preserving models for comparing survival curves using the logrank test

Embed Size (px)

Citation preview

Pu

TC

a

A

R

R

1

A

K

S

P

L

1

Wiittohttmhtt

itoc

0d

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253

jo ur n al hom ep age : www.int l .e lsev ierhea l th .com/ journa ls /cmpb

rivacy-preserving models for comparing survival curvessing the logrank test

ingting Chen, Sheng Zhong ∗

omputer Science and Engineering Department, State University of New York at Buffalo, Amherst, NY 14260, USA

r t i c l e i n f o

rticle history:

eceived 16 June 2010

eceived in revised form

9 April 2011

ccepted 24 April 2011

a b s t r a c t

The incorporation of electronic health care in medical institutions will benefit and thus

further boost the collaborations in medical research among clinics and research institu-

tions. However, privacy regulations and security concerns make such collaborations very

restricted. In this paper, we propose privacy preserving models for survival curves com-

parison based on logrank test, in order to perform better survival analysis through the

eywords:

urvival curves

rivacy preservation

ogrank test

collaboration of multiple medical institutions and protect the data privacy. We distinguish

two collaboration scenarios and for each scenario we present a privacy preserving model for

logrank test. We conduct experiments on the real medical data to evaluate the effectiveness

of our proposed models.

© 2011 Elsevier Ireland Ltd. All rights reserved.

the three institutions are not sufficient to obtain results withhigh accuracy. If they can conduct survival curves compari-

. Introduction

ith the development of information technology, there is anncreasing need to incorporate electronic health record (EHR)n medical institutions [1]. The availability of EHRs is believedo be able to improve the health care efficiency and quality thathe patients receive. Moreover, because of using EHR insteadf paper-based records, hospitals can store and manage moreealth care data than ever before. Consequently, it will benefithe development of more advanced clinical computer-basedools that help diagnosis and research. Especially, if multiple

edical institutions can integrate their electronically storedealth care data, with this substantial amount of data, bet-er models with higher accuracy can be built to assist clinicalreatment and medical research.

Survival analysis [2] is an important statistic tool often usedn clinic trial to provide assessment of benefit and risk. With

he collaboration of multiple medical institutions, researchersr doctors can build better survival analysis models, espe-ially survival function comparison models. Here we illustrate

∗ Corresponding author. Tel.: +1 716 645 4752; fax: +1 716 645 3464.E-mail addresses: [email protected] (T. Chen), [email protected]

169-2607/$ – see front matter © 2011 Elsevier Ireland Ltd. All rights resoi:10.1016/j.cmpb.2011.04.004

two different scenarios as examples. The first scenario is thatin a hospital, a new radiotherapy treatment is performed toa group of pancreatic cancer patients. The doctors in thehospital can observe the survival events (death or cancerrecurrence) of these patients and draw a survival curve forthis new treatment. They want to compare this survival curvewith other treatments to justify its effectiveness and advan-tage. Luckily, a medical research institution holds survival dataof other treatment trials for pancreatic cancer with trial par-ticipants of similar background. Clearly the collaborative dataexchange between the hospital and the research institutionswill be beneficial for the result comparison. In the second sce-nario, three institutions are all studying the performance ofa new medicine for stroke on patients of different ages. Theywant to build and compare the survival curves for differentage intervals. However, the trial participants in any one of

u (S. Zhong).

son based on the trial participants from all of them, it willsignificantly increase the result accuracy.

erved.

s i n

250 c o m p u t e r m e t h o d s a n d p r o g r a m

However, sharing medical data is well-known to berestricted because of privacy and security concerns. Accord-ing to the privacy rules of Health Insurance Portability andAccountability Act (HIPAA) [3], the privacy of patients must beprotected and it is illegal for research institutions and hospi-tals to distribute patient’s medical data without appropriateprivacy preservation. On the other hand, Medical researchersare reluctant to share their data with others even if it is alreadyanonymized, due to the concern of possibility that their datacould be misused or misinterpreted. For instance, in Dart-mouth College neuroscientist found it difficult to encouragethe sharing of brain imaging data [4]. In the two scenariosabove, the privacy concern also exists which impedes the pro-cess of collaboration between medical institutions. Therefore,we need to develop new models for survival curves compar-ison that can protect the privacy of patients and relieve thedata security concern of the researchers or doctors.

In this paper, we propose novel privacy preserving mod-els for logrank test, which is a standard comparison testof survival curves. In particular, for each of the two col-laboration scenarios we mentioned above, we design oneprivacy-preserving logrank test model. In the rest of thispaper, we call the first scenario group partition, meaning eachinstitution holds a survival curve for a entire group of partici-pants. We call the second scenario sample partition, meaningeach institution holds the survival data of some (but not all)participants in each group. Our goal is that for each of thecollaboration scenario, our proposed logrank test model canlearn the comparison result of survival curves built on the datafrom all medical institutions, even without looking at the orig-inal survival data from other medical institutions. We utilize acryptographic tool, secure sum [5], in our models. In this way,the privacy of medical data is protected. As far as we know,it is the first work on building privacy preserving models forsurvival curves comparison using logrank test. We preformexperiments on real medical data to show the effectiveness ofour proposed models.

2. Methods

In this section, we first review the logrank test for com-paring survival curves. Then we describe our two privacypreserving models for the logrank test. The first modelenables the privacy preserving comparison of survival curvesin the group partition scenario. Then we present the sec-ond privacy preserving model which preserves the privacy incomparing the survival curves in the sample partition sce-nario.

2.1. Overview of logrank test

Suppose we have n groups of individuals. Logrank test [2] is astatistical hypothesis test, where the hypothesis is that the ngroups have the same survival distribution, i.e., for each group

the probability of occurring the event (e.g., death) at each timepoint is the same. In particular, we divide the time into m inter-vals. Let nkj be the number of individuals that are alive in groupk at the beginning of time interval j. Let dkj be the number of

b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253

events occurring in group k in interval j. nj and dj are definedas Eq. (1) and Eq. (2) respectively.

nj =n∑

k=1

nkj (1)

dj =n∑

k=1

dkj (2)

The test statistic is calculated as

Z =n∑

k=1

(Ok − Ek)2

Ek, (3)

where Ok represents the number of observed deaths in groupk, i.e.,

Ok =m∑

j=1

dkj. (4)

Ek is the expected number of deaths in group k, i.e.,

Ek =m∑

j=1

nkjdj

nj. (5)

A smaller test statistic Z suggests a higher probability that thehypothesis is true.

2.2. Privacy for each party

As mentioned above, it is often the case that the survival datafor several groups are distributed in different places, e.g., med-ical research institutions and clinics. These organizations orparties want to compare their survival curves using logranktest but each of them is not willing to reveal its own survivaldata to other parties. We distinguish the privacy of each partyfor the two collaboration scenarios, i.e., the group partitionand the sample partition.

• The group partitionIn the group partition scenario, each party holds the survivaldata collected from the group of patients that this party has:number of events occurring in each time slot and the num-ber of surviving individuals at the beginning of each timeinterval. Without loss of generality, we assume that party kholds the survival data of group k. In our proposed privacy-preserving logrank test model, we aim to for each party kprotect the information nkj and dkj (∀j) from other partiesthan k and meanwhile correctly compute the logrank teststatistic. The group partition scenario is illustrated in Fig. 1.

• The sample partitionIn the sample partition scenario, each party holds the sur-

vival data for some participants in each group. Formally, ∀k, j each party i holds its survival data ni

kjand di

kj, which are

collected for time interval j from the patients in group k thatparty i has. Each party i wants to keep ni

kj, and di

kjprivate and

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253 251

Party 1 Party 2 Party n

Group 1 Group 2 Group n

patient 1.1patient 1.2patient 1.3

....

.... ......

patient 2.1patient 2.2patient 2.3

....

....

patient n.1patient n.2patient n.3

....

....

d11 , n

11d

12 , n

12 ...d

1j n

1j...d

1m n

1m

d21 , n

21d

22 , n

22 ...d

2j n

2j...d

2m n

2m

dn1

, nn1

dn2

, nn2

...d

nj n

nj...d

nm n

nm

p pa

2g

Asjw

Fig. 1 – The grou

build logrank model using nkj and dkj, such that nkj =∑

ini

kj

and dkj =∑

idi

kj. Fig. 2 shows the sample partition scenario.

.3. Privacy-preserving logrank test model for theroup partition scenario

ssume that there are s parties (s ≥ 3), each of which holds theurvival data for one group of patients. The s parties want toointly perform the logrank test (i.e., computing the value Z),

ithout revealing their private data to each other.

Party 1 Party 2

Group 1

Group 2

patient 1.1patient 1.2....

patient 2.3patient 2.4....

d111

, n111

...

d1kj

n1kj

...d1

sm n1

sm

Group 2

patient 2.1patient 2.2....

Group s...

patient 1.3patient 1.4....

Group 1

Group s...

......

d211

, n211

...

d2kj

n2kj

...d2

sm n2

sm

Fig. 2 – The sample pa

rtition scenario.

From Eq. (3) we know that if party k wants to compute Z, itneeds to be able to compute Ek first, but from Eqs. (3), (1) and(5), we know that

Ek =m∑

j=1

nkj

∑k

dkj

∑k

nkj

, (6)

which requires dk′j and nk′j where k /= k′. Moreover, in order

to compute Z, party k needs to know∑

k′ /= k(Ok′ − Ek′ )2/Ek′ to

obtain the sum.

Party n

Group 1

......

patient1.2n−1patient 1.2n....

patient2.2n−1patient 2.2n....

Group 2

Group s...

...

dn11

, n n11

...

dnkj

n nkj

...dn

sm n n

sm

rtition scenario.

252 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253

For inter val 1 ≤ j ≤ m,(Step 1): Party 1 generates two random numbers Rj1 and Rj2 (s.t.,

Rj1 ∈ [0 , D), Rj2 ∈ [0 , D))a and sends Rj1 + d1j and Rj2 + n1j to Party 2.(Step 2): For each party p, s.t. 2 ≤ p ≤ s − 1,

Party p receive s Rj1 + p−1k=1 dkj , and Rj2+

p−1k=1 nkj . Party p

computes Rj1+p−1k=1 dkj + dpj an d Rj2+

p−1k=1 nkj +npj and sends the m to

party p + 1 .(Step 3): Party s receive s Rj1 + s−1

k=1 dsj , and Rj2 + s−1k=1 nkj . Party

s computes Rj1 + n−2k=1 dkj + dsj an d Rj2 + s−1

k=1 nkj + nsj and sends the mto party 1 .

(Step 4): Party 1 receives Rj1 + s−1k=1 dkj , and Rj2 + s−1

k=1 nkj . Party1 substracts Rj1 and Rj2 from the two received numbers respectively, an dcomputes d

n = d

n . Party 1 sends dn to all other parties .

End For

Each party k (s.t., 1 ≤ k ≤ s) computes Ek = mj=1 nkj

dn .

aHere D is the range of dij an d nij , ∀i, j.

Fig. 3 – Privately computing Ek for each party k.

In our model, we utilize a randomization based securecomputation tool, secure sum [5], to tackle these two chal-lenges. Now we first describe how to securely compute Ek

for each k, without knowing dk′j and nk′j where k /= k′. Thenwe present our complete privacy-preserving logrank testmodel.

Suppose that all participating parties are numbered as par-ties 1 . . . s. Our method of computing Ek for each party k issummarized in Fig. 3. The idea of our method is that to com-pute each dj (resp. nj) ∀j, party 1 generates a random numberwith the same range as that of dij and nij and adds its local value d1j

(resp. n1j) to the random number before passing it to the nextparty. In this way the actual values of d1j and n2j are hiddenbehind the random numbers. Similarly, every other party addsits local value to the sums that it receives and sends the newsums to the next party. Finally, party s sends Rj1 +

∑s−1k=1dkj,

and Rj2 + ∑s−1k=1nkj back to party 1, and party 1 subtracts the

two random numbers respectively and obtains dj and nj. Afterparty 1 sends (dj/nj) to all other parties, each party k can com-pute the value of Ek.

After each party k gets Ek, it can easily compute (Ok − Ek)2/Ek

because obtaining Ok does not require the survival data fromother parties. As the final step of our privacy-preservinglogrank test model, we need to compute

∑k(Ok − Ek)2/Ek. Since

(Ok − Ek)2/Ek reveals the information of how much the survivalcurve of group k is different from other groups, we also needto keep it private when calculating Z. Again we use the idea ofsecure sum. Party 1 generates a random number R and adds(O1 − E1)2/E1 to it before passing it to the next party. Wheneach party k has added the (Ok − Ek)2/Ek, party 1 again receivesR +

∑s

k=1(Ok − Ek)2/Ek. Party 1 subtracts R and sends the valueof Z to other parties.

We notice that in our privacy-preserving logrank testmodel, party 1 meed to generate random numbers. In theimplementation, we use the pseudo random number genera-

tor function in the GNU C Library [6]. Applying pseudo randomnumber generation algorithm is a standard way to generaterandom numbers in cryptography.

2.4. Privacy-preserving logrank test model for thesample partition scenario

Assume that there are n(n > 2) parties want to compare s sur-vival curves in the sample partition scenario. As we havementioned, in the sample partition scenario, each party i holdsits survival data di

kjand ni

kj. They would like to collaboratively

compute Z that can be written as follows

Z =s∑

k=1

⎛⎜⎜⎜⎜⎝

m∑j=1

n∑i=1

dikj

−m∑

j=1

n∑i=1

nikj

n∑i=1

dij

n∑i=1

n1j

⎞⎟⎟⎟⎟⎠

2

m∑j=1

n∑i=1

nikj

n∑i=1

dij

n∑i=1

n1j

. (7)

As we can see, to compute Z in a privacy preserving way forthe sample partition scenario is more complicated than for thegroup partition scenario. Again we utilize the idea of securesum. For ease of presentation, we only describe the main stepsin this model and skip the details of secure sum computationwhich is similar to what we have presented above.

First, using the secure sum algorithm, the n parties cansecurely obtain

Ok =n∑

i=1

m∑j=1

dikj (8)

and∑n

i=1nikj

. Similarly, for each j, the n parties can securely

compute (∑n

i=1dij)/(

∑n

i=1nij), where di

j=

∑kdi

kjand ni

j=

∑kni

kj

i n b i

c

(

E

Ast

3

Otw

ppaf

r

[

[

[

[

[

[[

[

c o m p u t e r m e t h o d s a n d p r o g r a m s

an be computed locally at each party i. With∑n

i=1nikj

and∑n

i=1dij)/(

∑n

i=1nij), now Ek can be computed as in Eq. (9)

k =m∑

j=1

n∑i=1

nikj

n∑i=1

dij

n∑i=1

n1j

(9)

fter obtaining Ok and Ek for each party k, apply the secureum again to securely compute the sum of (Ok − Ek)2/Ek forhe s groups.

. Discussion

ur proposed privacy-preserving logrank test models havehe following characteristics that are different from previousork.

They preserve the data privacy for each party withoutrevealing it to others, in the process of survival data analy-sis using logrank test. Relieving the privacy concerns for themedical institutions is of great importance because it willencourage more cooperations among these institutions oneither research or clinical trials. Consequently, better sur-vival data comparisons supported by larger database willbecome available.

Our privacy preserving models are accurate, meaning thelogrank test results obtained by our models are the same asthose obtained by having all data on one site.

The two models either for the group partition scenario or forthe sample partition scenario only require the parties whohold the survival data to participate. In other words, as longas those parties with data use our models, all computation isconducted within those parties and thus no other agenciesare needed. Therefore, our models are very practical in itsimplementation.

We are utilizing a randomization-based method in ourmodels. As a result our models are much more efficientcompared with cryptography-based approaches.

An existing work in data mining community proposed the

rivacy preserving cox regression in survival analysis [7]. Theroposed model was based on linearly projecting the data to

lower dimensional space through an optimal mapping. Dif-erent from their model, we focus on another very important

o m e d i c i n e 1 0 4 ( 2 0 1 1 ) 249–253 253

function of survival data analysis in medical trials, the survivalcurves comparison. Furthermore, we did not lose any accuracyin our privacy preserving models.

4. Hardware and software specifications

The models are implemented using GNU C Library. The pro-grams are running in Redhat Linux 7.2 on 2.0 GHz computers.See [8] for more detailed implementation and evaluation results.

5. Conclusion

In this paper, we propose privacy preserving models for sur-vival curves comparison based on logrank test, in order toperform better survival analysis through the collaboration ofmultiple medical institutions and protect the data privacy. Wedistinguish two collaboration scenarios, the group partitionscenario and the sample partition scenario. For each scenariowe present a privacy preserving model for logrank test. Ourexperiments on the real medical data to evaluate the effec-tiveness of our proposed models.

e f e r e n c e s

1] Health Information Technology for Economic and ClinicalHealth Act, available at:http://waysandmeans.house.gov/media/pdf/111/hitech.pdf.

2] D.G. Altman, Practical Statistics for Medical Research,Chapman & Hall, London, 1991, ISBN 0-412-27630-5.

3] HIPPA, National Standards to Protect the Privacy of PersonalHealth Information [Online], available at:http://www.hhs.gov/ocr/hipaa/finalreg.html.

4] Editorial, Whose scans are they, anyway? Nature 406(August(443)) (2000).

5] M. Kantarcioglu, C. Clifton, Privacy-preserving distributedmining of association rules on horizontally partitioned data,in: The ACM SIGMOD Workshop on Research Issues on DataMining and Knowledge Discovery (DMKD’02), Madison,Wisconsin, June 2, 2002, pp. 24–31.

6] GNU C library, available at: http://www.gnu.org/software/libc/.7] S. Yu, G. Fung, R. Rosales, S. Krishnan, R.B. Rao,

Privacy-preserving cox regression for survival analysis, in:Proceedings of KDD’08, Las Vegas, Nevada, USA, August, 2008.

8] T. Chen, S. Zhong, Privacy-Preserving Models for Comparing

Survival Curves Using the Logrank Test, Technical Report2011-02, Computer Science and Engineering Department,SUNY Buffalo,http://www.cse.buffalo.edu/tech-reports/2011-02.pdf.