13
On A New Scheme on Privacy Preserving Data Classication Nan Zhang Shengquan Wang Wei Zhao Abstract We address the privacy preserving data classication problem in a distributed system. Randomization has been proposed to preserve privacy in such circumstances. However, this approach was chal- lenged in [12] by a privacy intrusion technique that is capable of reconstructing the private data in a relative accurate manner. In this paper, we introduce an algebraic technique based scheme. Com- pared to the randomization approach, our new scheme can build classiers with better accuracy but disclose less private informa- tion. We also show that our scheme is immune to privacy intrusion attacks. Performance lower bounds in terms of both accuracy and privacy are established. Keywords: Privacy, security, classication 1 Introduction In this paper, we address issues related to privacy preserving data mining techniques. The purpose of data mining is to extract knowledge from large amounts of data [10]. Classi- cation is one of the biggest challenges in data mining. In this paper, we focus on privacy preserving data classication. The objective of classication is to construct a model (classier) that is capable of predicting the (categorical) class labels of data [10]. The model is usually represented by classication rules, decision trees, neural networks, or mathematical formulae that can be used for classication. The model is constructed by analyzing data tuples (i.e., samples) in a training data set, where the class label of each data tuple is provided. For example, suppose that a company has a database containing the age, occupation, and income of customers and wants to know whether a new customer is a potential buyer of a new product. To answer this question, the company rst builds a model which details the existing customers in the database, based on whether they have bought the new product. The model consists of a set of classication rules (e.g., (occupation = student) (age 20) This work was supported in part by the National Science Foundation under Contracts 0081761, 0324988, 0329181, by the Defense Advanced Research Projects Agency under Contract F30602-99-1-0531, and by Texas A&M University under its Telecommunication and Information Task Force Program. Any opinions, ndings, conclusions, and/or recommendations expressed in this material, either expressed or implied, are those of the authors and do not necessarily reect the views of the sponsors listed above. The authors are with the Department of Computer Science, Texas A&M University, College Station, TX 77843, USA. Email: {nzhang, swang, zhao}@cs.tamu.edu. buyer). Then, the company uses the model to determine whether the new customer is a potential buyer of the product. Classication techniques have been extensively studied for over twenty years [14]. However, only in recent years has the issue of privacy protection in classication been raised [2, 13]. In many situations, privacy is a very important con- cern. In the above example, the customers may not want to disclose their personal information (e.g., incomes) to the company. The objective of research on privacy preserving data classication is to develop techniques that can build ac- curate classication models without disclosing private infor- mation in the data being mined. The performance of privacy preserving techniques should be analyzed and compared in terms of both accuracy and privacy. We consider a distributed environment where training data tuples for classication are stored in multiple sites. We can classify distributed privacy preserving classication systems into two categories based on their infrastructures: Server-to-Server (S2S) and Client-to-Server (C2S), respec- tively. In the rst category (S2S), data tuples in the training data set are distributed across several servers. Each server holds a part of the training data set that contains numerous data tuples. The servers collaborate with each other to con- struct a classier spanning all servers. Since the number of servers in the system is usually small (e.g., less than 10), the problem can be formulated as a variation of secure mul- tiparty computation [13]. Existing privacy preserving clas- sication algorithms in this category include decision tree classiers [4,13], na¨ ıve Bayesian classier for vertically par- titioned data [16], and na¨ ıve Bayesian classier for horizon- tally partitioned data [11]. In the second category (C2S), a system usually con- sists of a data miner (server) and numerous data providers (clients). Each data provider holds only one data tuple. The data miner performs data classication process on aggre- gated data offered by the data providers. An online survey is a typical example of this type of system, as the system can be modeled as one survey collector/analyzer (data miner) and thousands of survey respondents (data providers). Needless to say, both S2S and C2S systems have a broad range of applications. In this paper, we focus on studying privacy preserving data classication in C2S systems. Most of the current studies on C2S systems tacitly assume that randomization is an effective approach to preserving privacy.

On A New Scheme on Privacy Preserving Data Classi cation€¦ · class labels of data [10]. The model is usually represented by classification rules, decision trees, neural networks,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

On A New Scheme on Privacy Preserving Data Classification ∗

Nan Zhang Shengquan Wang Wei Zhao †

Abstract

We address the privacy preserving data classification problem in adistributed system. Randomization has been proposed to preserveprivacy in such circumstances. However, this approach was chal-lenged in [12] by a privacy intrusion technique that is capable ofreconstructing the private data in a relative accurate manner. In thispaper, we introduce an algebraic technique based scheme. Com-pared to the randomization approach, our new scheme can buildclassifiers with better accuracy but disclose less private informa-tion. We also show that our scheme is immune to privacy intrusionattacks. Performance lower bounds in terms of both accuracy andprivacy are established.Keywords: Privacy, security, classification

1 Introduction

In this paper, we address issues related to privacy preservingdata mining techniques. The purpose of data mining is toextract knowledge from large amounts of data [10]. Classifi-cation is one of the biggest challenges in data mining. In thispaper, we focus on privacy preserving data classification.

The objective of classification is to construct a model(classifier) that is capable of predicting the (categorical)class labels of data [10]. The model is usually representedby classification rules, decision trees, neural networks, ormathematical formulae that can be used for classification.The model is constructed by analyzing data tuples (i.e.,samples) in a training data set, where the class label ofeach data tuple is provided. For example, suppose that acompany has a database containing the age, occupation, andincome of customers and wants to know whether a newcustomer is a potential buyer of a new product. To answerthis question, the company first builds a model which detailsthe existing customers in the database, based on whether theyhave bought the new product. The model consists of a set ofclassification rules (e.g., (occupation = student)∧ (age≤ 20)

∗This work was supported in part by the National Science Foundationunder Contracts 0081761, 0324988, 0329181, by the Defense AdvancedResearch Projects Agency under Contract F30602-99-1-0531, and by TexasA&M University under its Telecommunication and Information Task ForceProgram. Any opinions, findings, conclusions, and/or recommendationsexpressed in this material, either expressed or implied, are those of theauthors and do not necessarily reflect the views of the sponsors listed above.

†The authors are with the Department of Computer Science, Texas A&MUniversity, College Station, TX 77843, USA. Email: {nzhang, swang,zhao}@cs.tamu.edu.

→ buyer). Then, the company uses the model to determinewhether the new customer is a potential buyer of the product.

Classification techniques have been extensively studiedfor over twenty years [14]. However, only in recent years hasthe issue of privacy protection in classification been raised[2, 13]. In many situations, privacy is a very important con-cern. In the above example, the customers may not wantto disclose their personal information (e.g., incomes) to thecompany. The objective of research on privacy preservingdata classification is to develop techniques that can build ac-curate classification models without disclosing private infor-mation in the data being mined. The performance of privacypreserving techniques should be analyzed and compared interms of both accuracy and privacy.

We consider a distributed environment where trainingdata tuples for classification are stored in multiple sites.We can classify distributed privacy preserving classificationsystems into two categories based on their infrastructures:Server-to-Server (S2S) and Client-to-Server (C2S), respec-tively.

In the first category (S2S), data tuples in the trainingdata set are distributed across several servers. Each serverholds a part of the training data set that contains numerousdata tuples. The servers collaborate with each other to con-struct a classifier spanning all servers. Since the number ofservers in the system is usually small (e.g., less than 10),the problem can be formulated as a variation of secure mul-tiparty computation [13]. Existing privacy preserving clas-sification algorithms in this category include decision treeclassifiers [4,13], naıve Bayesian classifier for vertically par-titioned data [16], and naıve Bayesian classifier for horizon-tally partitioned data [11].

In the second category (C2S), a system usually con-sists of a data miner (server) and numerous data providers(clients). Each data provider holds only one data tuple. Thedata miner performs data classification process on aggre-gated data offered by the data providers. An online surveyis a typical example of this type of system, as the system canbe modeled as one survey collector/analyzer (data miner) andthousands of survey respondents (data providers).

Needless to say, both S2S and C2S systems have a broadrange of applications. In this paper, we focus on studyingprivacy preserving data classification in C2S systems. Mostof the current studies on C2S systems tacitly assume thatrandomization is an effective approach to preserving privacy.

However, this assumption has been challenged in [12]. It wasshown that an illegal data miner may be able to construct theprivate data even if they have been randomized. In this paper,we take an algebraic approach and develop a new scheme.Our new scheme has the following important features todistinguish it from previous approaches.

• Our scheme can help to build classifiers that have betteraccuracy but disclose less private information. A lowerbound of accuracy is derived and can be used to predictthe system accuracy in reality.

• Our scheme is immune to privacy intrusion attacks.That is, the data miner cannot derive more privateinformation from the data tuples it receives from thedata providers if the data are properly perturbed basedon our scheme.

• Our scheme allows every user to play a role in determin-ing the tradeoff between accuracy and privacy. Specif-ically, we allow explicit negotiation between each dataprovider and the data miner in terms of the tradeoff be-tween accuracy and privacy. This makes our systemmeet the needs of a wide range of users, from hard-coreprivacy protectionists to privacy marginally concerned.

• Our scheme is flexible and easy to implement. It doesnot require a distribution reconstruction component ashave previous approaches. Thus, our privacy preserv-ing component is transparent to the data classificationprocess and can be readily integrated with existing sys-tems as a middleware.

The rest of the paper is organized as follows: The modelof data miners is introduced in Section 2. We briefly reviewprevious approaches in Section 3. In Section 4, we introduceour new scheme and its basic components. We present atheoretical analysis on accuracy and privacy in Section 5.Theoretical bounds on accuracy and privacy metrics are alsoderived in this section. Then we make a comparison betweenthe performance of our scheme and the previous approach inSection 6. Experimental results are presented in this section.The implementation of our scheme is discussed in Section 7,followed by a final remark in Section 8.

2 Model

In this section, we introduce our model of data miners. Dueto the privacy concern introduced to the system, we classifythe data miners into two categories. One category is legaldata miners. These data miners always act legally in thatthey only perform regular data mining tasks and would neverintentionally compromise the privacy of data. The othercategory is illegal data miners. These data miners wouldpurposely discover the privacy in the data being mined.

Like adversaries in distributed systems, illegal data min-ers come in many forms. In most forms, their behavior is re-stricted from arbitrarily deviating from the protocol. In thispaper, we focus on a particular subclass of illegal miners, cu-rious data miners. That is, in our system, illegal data minersare honest but curious1: they follow the protocol strictly (i.e.,they are honest), but they may analyze all intermediate com-munications and received data to compromise the privacy ofdata providers (i.e., they are curious) [8].

3 Randomization Approach and its Problems

Based on the model of data miners, we review the random-ization approach, which is currently used to preserve privacyin classification. We also point out the problems associatedwith the randomization approach that motivates us to designa new privacy preserving scheme on data classification.

To prevent privacy invasions by curious data miners,countermeasures must be implemented in the data classifica-tion process. Randomization is a commonly used approach.We briefly review it below.

Based on the randomization approach, the entire pri-vacy preserving classification process can be considered atwo-step process. The first step is to transmit (randomized)data from the data providers to the data miner. That is, inthis step, a data provider applies a randomization operatorR(·) to the data tuple that the data provider holds. Then,the data provider transmits the randomized data tuple to thedata miner. In previous studies, several randomization oper-ators have been proposed including the random perturbationoperator [2] and the random response operator [5]. Theseoperators are shown in (3.1) and (3.2), respectively.

R(t) = t + r.(3.1)

R(t) ={

t, if r ≥ θ,t, if r < θ.

(3.2)

Here, t is the original data tuple, r is the noise randomlygenerated from a predetermined distribution, and θ is apredetermined parameter. Note that the random responseoperator in (3.2) only applies to binary data tuples.

In the second step, the legal data miner performs thedata classification process on the aggregated data. With therandomization approach, the legal data miner must first em-ploy a distribution reconstruction algorithm which intendsto recover the original data distribution from the random-ized data. There are several algorithms for reconstructingthe original distribution [1, 2, 5]. In particular, an expecta-tion maximization (EM) algorithm was proposed in [1]. Thedistribution reconstructed by EM algorithm converges to themaximum likelihood estimate of the original distribution.

1The honest-but-curious behavior model is also known as semi-honestbehavior model.

Also in the second step, a curious data miner may invadeprivacy by using a privacy data recovery algorithm on therandomized data supplied by the data providers.

Figure 1: Randomization Approach

Figure 1 depicts the classification process with the ran-domization approach. Clearly, any such data classificationsystem should be measured by its capacity of both buildingaccurate classifiers and preventing private data leakage.

3.1 The problems of randomization approach. Whilethe randomization approach is intuitive, researchers haverecently identified privacy breaches as one of the majorproblems with the randomization approach. Kargupta, Datta,Wang, and Sivakumar showed that the spectral properties ofrandomized data could help curious data miners to separatenoise from private data [12]. Based on random matrix theory,they proposed a filtering method to reconstruct private datafrom the randomized data set. They demonstrated thatrandomization preserves very little privacy in many cases.

Randomization approach also suffers in efficiency as itputs a heavy load on (legal) data miners at run time (becauseof the distribution reconstruction) [3]. It is shown that thecost of mining randomized data set is “well within an orderof magnitude” in respect to that of mining original data set. 2

Another problem with the randomization approach isthat it cannot be adapted to meet the needs of different kindsof users. A survey on Internet users (potential data providers)showed that there are 17% privacy fundamentalists, 56%privacy pragmatists, and 27% marginally concerned people.

2Although the work is based on association rule mining, we believethat the similarity between randomization operators in association rulemining and data classification makes the efficiency concern inherent inrandomization approach.

Privacy fundamentalists are extremely concerned about pri-vacy. Privacy pragmatists are concerned about privacy, butmuch less than the fundamentalists. Marginally concernedpeople are generally willing to provide their private data.The randomization approach treats all the data providers inthe same manner and cannot handle the differing needs ofdifferent data providers.

We believe that the followings are the reasons behindthe above mentioned problems.

• Randomization operator is user-invariant. In a system,the same perturbation algorithm is applied to all dataproviders. The reason is that in a system using random-ization approach, the communication is one-way: fromthe data providers to the data miner. As such, a dataprovider cannot obtain any user specified guidance onthe randomization of its private data.

• Randomization operator is attribute-invariant. Thesame perturbation algorithm is applied to all attributes.The distribution of every attribute, no matter how useful(or useless) it is in the classification, is equally main-tained in the perturbation. For example, suppose thateach data tuple has three attributes: age, occupation,and salary. Also, assume that more than 95% of testdata tuples can be correctly classified using age as theonly attribute. The wisest decision is to maintain onlythe distribution of age, which is the most useful at-tribute in classification. If the randomization approachis taken, private information disclosed on the other twoattributes are unnecessary (from the perspective of adata provider) because it does not contribute much tobuild the classifier. However, again, due to the lackof communication between the data miner and the dataproviders, a data provider cannot learn the correlationbetween different attributes. Thus, a data provider hasno choice but to employ an attribute-invariant operator.

These properties are inherent in the randomization ap-proach, and hence motivate us to develop a new scheme al-lowing two-way communication between the data miner andthe data providers. The two-way communication helps pre-serving private information while not incurring too muchoverhead. Thereby, we significantly improve the perfor-mance in terms of accuracy, privacy, and efficiency. We de-scribe the new scheme in the next section.

4 Our New Scheme

In this section, we introduce our scheme and its basiccomponents. We take a two-way communication approachthat substantially improves the performance while incurringlittle overhead.

Figure 2: Our New Scheme

4.1 Description of our new scheme. Figure 2 depicts theinfrastructure of our new scheme. Our scheme has two keycomponents, perturbation guidance (PG) in the data minerserver and perturbation in the data providers. Comparedto the randomization approach, our scheme does not havethe distribution recovery component. Instead, the classifierconstruction procedure is performed on the perturbed datatuples (R(t)) directly.

Our scheme is a three-step process. In the first step, thedata miner negotiates different perturbation level k with dif-ferent data providers. The larger k is, the more contributionR(t) will make to the classification process. The smaller kis, the more private information is preserved. Thus, a privacyfundamentalist can choose a small k to preserve its privacywhile a privacy unconcerneddata provider can choose a largek to contribute to the classification.

The second step is to transmit (perturbed) data from thedata providers to the data miner. Since each data providercomes at a different time, this step can be considered as aniterative process. In each stage, the data miner dispatchesa reference (perturbation guidance) Vk to a data providerPi. Here Vk depends on the perturbation level k negotiatedby the data miner and the data provider P i in the firststep. Based on the received Vk, the perturbation componentof Pi computes the perturbed data tuple R(ti) from theoriginal data tuple ti. Then, Pi transmits R(ti) to theperturbation guidance (PG) component of the data miner.PG then updates Vk based on R(ti) and forwards R(ti) tothe classifier construction procedure. A curious data minercan also obtain R(ti). In this case, the curious data mineruses private data recovery algorithm to discover privateinformation from R(ti).

In the third step, the perturbed data tuples receivedby the data miner are used by the classifier constructionprocedure as the training data tuples. A classifier is builtand delivered to the data miner.

4.2 Basic components. The basic components of ourscheme are: a) the method of computing Vk , and b) the per-turbation function R(·). Before presenting the details of thecomponents, we first introduce some notions of the trainingdata set.

Notions of training data set. Suppose that T is a trainingdata set consisting of m data tuples t1, . . . tm. Each datatuple belongs to a predefined class, which is determined byits class label attribute a0. In this paper, we consider thelabeled data from two classes, named C0 and C1. The classlabel attribute has two distinct values 0 and 1, correspondingto classes C0 and C1, respectively. Besides the class labelattribute, each data tuple has n attributes, named a1, . . . , an.

The class label attribute of each data tuple is public (i.e,privacy-insensitive). All other attributes consist of privateinformation which needs to be preserved. We represent theprivate part of the training data set by an m × n matrixT = [t1; . . . ; tm] = [a1, . . . , an].3 We use T0 and T1 torepresent the matrices of data tuples that belong to classC0 and C1, respectively. We denote the j-th attribute ofti by 〈T 〉ij . An example of T is shown in Table 1. Aswe can see from the matrix, the first data tuple in T ist1 = [20, 3.2, 18, 1]. It belongs to class C1.

Table 1: An Example of a Training Data Set

T :

a1 a2 a3 a4

t1 20 3.2 18 1...

......

......

tm 40 2.5 13 0

a0:

a0

t1 1...

...tm 0

Notions used in the paper are listed in Appendix F.

Computation of Vk . In our scheme, Vk is an estimation ofthe first k eigenvectors of T ′

0T0 − T ′1T1. The justification

of Vk will be provided in Section 5. Now we show how toupdate Vk when a new data tuple is received.

As we are considering the case where data tuples areiteratively fed to the data miner, the data miner keeps a copyof all received data tuples and updates it when a new datatuple is received. Let the current matrix of received datatuples be T ∗. When a new data tuple R(t) is received bythe data miner, R(t) is appended to the bottom of T ∗.

3Here ti and ai are used somewhat ambiguously. In the context oftraining data set, ti is a data tuple and ai is an attribute. In the contextof matrix, ti represents a row vector in T and ai represents a column vectorin T .

Besides the received data tuples T ∗, the data miner alsokeeps track of two additional matrices: A∗

0 = T ∗′0 T ∗

0 andA∗

1 = T ∗′1 T ∗

1 where T ∗0 and T ∗

1 are the matrices of receiveddata tuples that belong to class C0 and C1, respectively. Notethat the update of A∗

0 and A∗1 (after R(t) is received) does not

need access to any data tuple other than the recently receivedR(t). Thus, we do not require the matrix of data tuples Tto remain in main memory. In particular, if the class labelattribute of R(t) is c (c ∈ {0, 1}), A∗

c is updated as follows.

A∗c = A∗

c + R(t)′R(t).(4.3)

Given the updated A∗0 and A∗

1, the computation ofVk is done in the following steps. Using singular valuedecomposition (SVD), we can decompose A∗ = A∗

0 − A∗1

as

A∗ = V ∗ΣV ∗′,(4.4)

where Σ = diag(s1, . . . , sn) is a diagonal matrix with s1 ≥· · · ≥ sn, si is the i-th eigenvalue of A∗, and V ∗ is an n× nunitary matrix composed of the eigenvectors of A∗.

Vk is composed of the first k eigenvectors of A∗ (i.e.,the first k column vectors of V ∗), which correspond tothe k largest eigenvalues of A∗. In particular, if V ∗ =[v1, . . . , vn], then

Vk = [v1, . . . , vk].(4.5)

The computing cost of updating Vk is addressed in Section 6.

Perturbation function R(·). Once a data provider obtainsVk from the data miner, the data provider employs a pertur-bation function R(·) on the original data tuple t. The resultis a perturbed data tuple that is transmitted to the data miner.In our scheme, the perturbation function R(·) is defined asfollows.

R(t) = tVkV ′k(4.6)

We have now described our new scheme and its basiccomponents. We are ready to analyze our scheme in termsof accuracy, privacy and their tradeoff.

5 Theoretical Analysis of Our Scheme

In this section, we analyze our new scheme. We will definemetrics for accuracy and privacy and derive their bounds, inorder to provide guidelines for the tradeoff between thesetwo measures and hence help system managers setting pa-rameter in practice.

5.1 Accuracy analysis. An accuracy measure should re-flect the capability of the system that can correctly classifythe objectives in a given population. We will define an accu-racy metric and derive a lower bound on it.

Accuracy metric. Before we formally define an accuracymetric, let us review the process of building a classifier andobserve what factors may impact accuracy. Recall that theprocess to build a classifier takes the following steps:

• Training data set T is sampled from a population T .

• Due to the privacy concern, we perturb each data tuplet in the training data set by perturbation function R(·)and produce a perturbed data tuple R(t). In the previoussection, we have described our selection of R(·).

• A classifier construction procedure (CCP) is used tomine the perturbed training data set in order to producea classifier.

Figure 3: Building a Classifier

Figure 3 shows the workflow of such a process. From thisprocess, it is clear that many factors impact the accuracy.They include the characteristics of the training data set, thealgorithm used by CCP, and the perturbation function usedto perturb the training data tuples. Formally, we define ametric for accuracy as the probability that a test data tuplesampled from the population T can be correctly classifiedby the classifier produced by CCP. We denote the accuracymetric by la(T, R, CCP) where T is the sample from thepopulation, R is the perturbation function, and CCP is theclassifier construction procedure.

Lower bounds on accuracy. As we have observed, the ac-curacy measure of a given system depends on many fac-tors. It remains an open problem to develop a systematicmethodology allowing one to derive the value of accuracymeasures for a given system. Nevertheless, in this paper, wederive lower bounds of accuracy and discuss their implica-tions in practice. We will focus on a system where the clas-sifier construction procedure uses the tree augmented naıveBayesian classification algorithm (TAN) [7] to build the clas-sifier. Note that this algorithm is an improved version overthe traditional naıve Bayesian classification algorithm, as theindependence assumption has been relaxed. The indepen-dence assumption assumes that all attributes in a data tuple

are conditionally independent of each other. TAN relaxesthis assumption by allowing certain dependencies betweentwo attributes. An overview of TAN can be found in Ap-pendix A.

We will start with two lemmas. First, we define a matrix

A = T ′0T0 − T ′

1T1.(5.7)

Recall that T0 and T1 are the matrices of original trainingdata tuples that belong to class C0 and C1, respectively. Weassume that the data miner maintains an estimation of A. Theestimation is defined as follows.

A = T ′0T0 − T ′

1T1.(5.8)

Here T0 and T1 are the matrices of perturbed data tuples thatbelong to class C0 and C1, respectively. Note that A = A∗

when all the (perturbed) data tuples in the training data sethave been received by the data miner. We found the accuracyin our system depends on the estimation error of A, which isdefined as follows.

ε = maxr,q∈[1,n]

〈A − A〉rq.(5.9)

Formally, the following lemma can be established. Pleaserefer to Appendix B for the proof of the lemma.

LEMMA 5.1. For a given system where the classifier con-struction procedure uses TAN to build the classifier, if theestimation error of A is bounded by ε, then the accuracy ofthe system is approximately bounded by

la(T, R, TAN)(5.10)

>∼ p − 0.4vrvq ·√

ε2

9 ·∑vr

i=1

∑vq

j=1(αri α

qj)

,

where TAN is the notion of the classifier construction pro-cedure. vr and vq are the number of values of ar and aq ,respectively, αr

i and αqj are the i-th and j-th possible values

of ar and aq, respectively, and p is the accuracy of the systemwhen ε = 0.

The following lemma provides an upper bound on theestimation error of A.

LEMMA 5.2. Let the (k + 1)-th largest eigenvalue of A besk+1. ∀r, q ∈ [1, n], we have

〈A − A〉rq ≤ sk+1.(5.11)

For the proof, please refer to Appendix C. With this lemma,we can establish a lower bound of accuracy as follows.

THEOREM 5.1.

la(T, R, TAN)(5.12)

>∼ p − 0.4vrvq

√s2

k+1

9 ·∑vr

i=1

∑vq

j=1(αri α

qj)

.

The theorem is established by simply substituting (5.11)into (5.10). From this theorem, we can make the followingobservations.

• The accuracy measure increases monotonously with in-creasing p, which is the accuracy measure for the origi-nal system without privacy preserving countermeasure.That is, the higher the accuracy in the original system,the higher the accuracy in the privacy preserving one.

• Given a training data set T , the accuracy measureincreases monotonously with increasing perturbationlevel k (i.e., decreasing sk+1). Thus, a privacy uncon-cerned data provider may contribute more to the clas-sification by choosing a large k to help improving theaccuracy.

5.2 Privacy analysis. A privacy measure is needed toproperly quantify the privacy loss in the system. We willdefine a privacy metric and derive a lower bound on it.

Metrics defined in pervious studies. Several privacy met-rics have been proposed in the literature. We briefly reviewthem below.

In [2] and [18], two interval-based metrics were pro-posed. Suppose that based on the perturbed data tuple R(t),the value of attribute ar of an original data tuple can be esti-mated to lie in interval (αr

0, αr1) with c% confidence. By [2],

the privacy measure of attribute ar at level c% is then definedas the minimum width of the interval, which is min(αr

1−αr0).

In [18], this definition is revised by using the Lebesgue mea-sure and hence becoming more mathematically rigorous.

In [1], an information-theoretic metric was defined asfollows. Given R(t), the privacy loss of t is given byP(t|R(t)) = 1 − 2−I(t;R(t)) where I(t; R(t)) is the mutualinformation of t and R(t). This metric measures the averageamount of privacy leakage. In contrast, a metric that mea-sures the worst-case privacy loss was proposed in [6].

Each of these metrics has its pros and cons. Most ofthem are defined in the context of the randomization ap-proach 4. Nevertheless, we now propose a general privacymetric to quantify privacy leakage in a given privacy pre-serving data mining system.

A general privacy metric. Let the matrix of perturbed datatuples R(t) be T . Suppose that T is the best approximationof T that a (curious) data miner can derive from T . We wouldlike to measure the privacy leakage by the distance betweenT and T . Formally, we have the following:

4To make a fair comparison with the randomization approach, we willuse the interval-based metric in [2] as the privacy metric in performancecomparison in Section 6.

DEFINITION 5.1. Given a matrix norm ‖ · ‖, we define theprivacy measure lp as

lp = minT

‖T − T‖,(5.13)

where T can be derived from T .

In the above definition, the distance of T and T can alwaysbe formulated by a matrix norm (e.g., Frobenius norm, alsoknown as Euclidean norm) of T − T . A list of commonlyused matrix norms is shown in Table 2. In comparison

Table 2: Matrix Norms of an m × n matrix M

norm definition

‖M‖1 maxj

∑mi=1〈M〉ij

‖M‖2 max eigenvalue of M ′M‖M‖∞ maxi

∑nj=1〈M〉ij

‖M‖F

√∑m,ni,j=1〈M〉2ij

with definitions used in previous studies, our definition onprivacy metric is a general one. The reason is that one canuse different matrix norms to satisfy different measurementrequirements. For example, if one wants to measure theaverage privacy loss, the Frobenious norm may be used. Onthe other hand, if one wants to analyze the worst case, the1-norm and ∞-norm may be selected.

Immunity property. Before we derive an analytical boundon the privacy measure, we first show that our system isimmune to privacy intrusion attacks. That is, an illegal dataminer cannot derive a better approximation of the originaldata tuples from the perturbed data tuples it receives. Thisproperty can be established as follows. In our scheme, wehave

R(t) = tVkV ′k.(5.14)

Since Vk is composed of the first k eigenvectors of A∗, VkV ′k

is a singular matrix with det(VkV ′k) = 0. That is, VkV ′

k doesnot have an inverse matrix. Thus, t cannot be deduced fromR(t) deterministically. Furthermore, we also claim that nobetter approximation of t can be derived from R(t). Let theMoore-Penrose pseudoinverse matrix of VkV ′

k be (VkV ′k)†.

Due to the property of the pseudoinverse matrix, given R(t),R(t)(VkV ′

k)† is the shortest length least square solution tot in (5.14). Since Vk is a singular matrix, (VkV ′

k)† is equalto VkV ′

k . Thus, the least square approximation of t basedon R(t) is t = tVkV ′

kVkV ′k = R(t). That is, no better

approximation of t can be derived from R(t).

Analytical bound on lp. We now derive a set of lowerbounds on the privacy measure depending on the use ofmatrix norms.

THEOREM 5.2. The lower bounds on the privacy measureare given below

norm lower bound on p

‖T − T‖1 δk+1/(maxi ‖ai + ai‖∞)‖T − T‖2 ρk+1

‖T − T‖∞ δk+1/(maxi ‖ai + ai‖1)

‖T − T‖F

√∑ni=k+1 ρ2

i

where matrix norms are defined in [9], ρi is the i-th eigen-value of T , si is the i-th eigenvalue of A, σi is the i-th eigen-value of T ′

0T0, τi is the i-th eigenvalue of T ′1T1, and

(5.15) δi = 2 max{σi, τi} − si.

The proof of the theorem can be found in Appendix D.Theorem 5.2 establishes a quantitative relationship be-

tween the privacy measure and eigenvalues of T , A, T0, T1,T ′

0, and T ′1. Note that the lower bounds also implicitly de-

pend on the value of k. By an observation of these formulas,one can easily see that the smaller k is, the higher the lowerbounds are. This is consistent with our intuition that a smallk would better protect the privacy.

6 Performance Evaluation

In this section, we first demonstrate the effectiveness of ourscheme by presenting simulation results on a simple data set.Then, we compare the performance of our scheme and therandomization approach in two areas: a) tradeoff betweenaccuracy and privacy, and b) runtime efficiency.

1 2 3 4 5 6 7 8 9 10

100

200

300

400

500(c) Perturbed data in our scheme

attribute value

num

ber

of e

lem

ents

2 4 6 8 10

100

200

300

400

500(b) Randomized data

2 4 6 8 10

100

200

300

400

500(a) Original data

attribute a1

Figure 4: simulation results

6.1 Data distribution before and after randomizationand perturbation. We use a training data set of 1, 000 datatuples, equally split between two classes. Each data tuple has10 privacy-sensitive attributes a1, . . . , a10. Each attribute

is independently and uniformly chosen from 1 to 10. Theclassification function is c = (a1 > 5). That is, a data tupleis in group C0 if a1 ≤ 5. Otherwise, the data tuple is ingroup C1. Figure 4(a) shows the distribution of the originaldata. Each line represents the distribution of an attribute.Figure 4(b) shows the distribution of the randomized dataafter uniform randomization [2]. Figure 4(c) shows thedistribution of the perturbed data in our scheme when theperturbation level k = 2.

Our scheme preserves the private information ina2, . . . , a10 better than the randomization approach. As wecan see, the variance of a2, . . . , a10 after perturbation in ourscheme is smaller than that of the randomized attributes bythe randomization approach. On the other hand, our schemeleaves a1 barely perturbed. The reason is that our schemeidentifies a1 as the only attribute that is needed in the clas-sification. Thus, our scheme can identify the most useful at-tribute in classification and maintain the distribution of theseattributes only.

6.2 Tradeoff between accuracy and privacy. In order tomake a fair comparison between the performance of ourscheme and the randomization approach proposed in [2],we use the exactly same training and testing data sets asin [2]. The training data set consists of 100, 000 datatuples. The testing data set consists of 5, 000 data tuples.Each data tuple has nine attributes. Five widely variedclassification functions are used to measure the tradeoffbetween accuracy and privacy in different circumstances.A detailed description of the data set and the classificationfunctions is available in Appendix E. We use the sameclassification algorithm, ID3 decision tree algorithm, as in[2]. In our scheme, we update the perturbation guidance V k

once 1, 000 data tuples are received.The comparison of accuracy measure while fixing pri-

vacy level at 75% [2] is shown in Figure 5. As we can see,our scheme outperforms the randomization approaches on allfive functions.

A comparison between the accuracy of our schemeand that of the randomization approach on different privacylevels is shown in Figure 6. Function 2 is used in this figure.From this figure, we can observe a tradeoff between accuracyand privacy. We can also observe the role of perturbationlevel k in our scheme. In any case, our scheme outperformsthe randomization approach for a wide range of k values.

6.3 Runtime efficiency. As we have addressed in Sec-tion 3, it is shown in [3] that the cost of mining randomizeddata set is “well within an order of magnitude” in respect tothat of mining the original data set. In particular, the ran-domization approach proposed in [2] requires the originaldata distribution to be reconstructed before a decision treeclassifier can be built on the randomized data set. The dis-

Fn1 Fn2 Fn3 Fn4 Fn5

75

80

85

90

95

100Privacy level = 75%

Class label attribute

Acc

urac

y (%

)

our new schemerandomization

Figure 5: comparison of performance

70 75 80 85 90 95 1000

20

40

60

80

100

120

140

160Function 2

Accuracy (%)

Priv

acy

(%)

our new schemerandomization

k = 3

k = 4

k = 5 k = 6

k = 7

k = 8

k = 9

Figure 6: performance on function 2

tribution reconstruction is a three-step process. We use “By-Class” reconstruction algorithm as an example because, asstated in [2], it is a tradeoff between accuracy and efficiency.

In the first step, split points are determined to partitionthe domain of each attribute into intervals. There is anestimated number of data points in each interval. The secondstep partitions data values into different intervals. For eachattribute, the values of randomized data are sorted to beassociated with an interval. In the third step, for eachattribute, the original distribution is reconstructed for eachclass separately. The main purpose of the first two steps isto accelerate the computation of the third step. The timecomplexity of the algorithm is O(mn + nv2) where m isthe number of training data tuples, n is the number of privateattributes in a data tuple, and v is the number of intervals oneach attribute. It is assumed in [2] that 10 ≤ v ≤ 100.

Note that the overhead of the randomization approachoccurs on the critical time path. Since the distribution re-

construction is not an incremental algorithm, it has to beperformed after all data tuples are collected and before theclassifier is constructed. Besides, the distribution reconstruc-tion algorithm requires access to the whole training data set,some of which may not be stored in the main memory. Thisproblem may incur even more serious overhead.

In our scheme, the perturbed data tuples are directlyused to construct the classifier. The only overhead incurredon the data miner is to update the perturbation guidanceVk. Note that the overhead is not on the critical time path.Instead, it occurs during the collection of data. The timecomplexity of the updating process is O(n2). A heuristicof the number of updates is between 10 and 100. Since thenumber of attributes is much less than the number of datatuples (i.e., n m) in data classification, the overheadof our scheme is significantly less than the overhead of therandomization approach.

The space complexity of the updating process in ourscheme is also O(n2). That is, the received data tuples neednot to remain in the main memory. Besides, our schemeis inherently incremental. These features make our schemescalable to very large training data sets.

Since the perturbation level k is always a small number(e.g., k ≤ 10), the communication overhead (O(nk) perdata provider) incurred by the two-way communication inour scheme is not significant.

7 Implementation

A prototypical system for privacy preserving data classifi-cation has been implemented using our new scheme. Thegoal of the system is to deliver an online survey solution thatpreserves the privacy of survey respondents. The survey col-lector/analyzer and the survey respondents are modeled asthe data miner and the data providers, respectively. The sys-tem consists of a perturbation guidance component on webservers and a data perturbation component on web browsers.Both components are implemented as custom plug-ins thatone can easily install to existing systems. The architectureof our system is shown in Figure 7.

Figure 7: System implementation

As is shown in the figure, there are three separatelayers in our system: user interface layer, perturbation layer,and web layer. The top layer, named user interface layer,provides interface to data providers and the data miner. The

middle layer, named perturbation layer, realizes our privacypreserving scheme and exploits the bottom layer to transferinformation. The bottom layer, named web layer, consists ofweb servers and web browsers. As an important feature ofour system, the details of data perturbation on the middlelayer are transparent to both data providers and the dataminer.

8 Final Remarks

In this paper, we propose a new scheme on privacy pre-serving data classification. Compared with previous ap-proaches, we introduce a two-way communication mecha-nism between the data miner and the data providers with lit-tle overhead. In particular, we let the data miner send per-turbation guidance to data providers. Using this intelligence,data providers perturb the data tuples to be transmitted to theminer. As a result, our scheme has the benefit of a bettertradeoff between accuracy and privacy.

Our work is preliminary and many extensions can bemade. In addition to using a similar approach in associationrule mining [17], we are currently investigating how toapply the approach to clustering problems. We would liketo investigate a new behavior model that is stronger thanthe honest-but-curious model, and can be dealt with by ourscheme.

References

[1] D. Agrawal and C. C. Aggarwal, “On the design and quantifi-cation of privacy preserving data mining algorithms,” in Pro-ceedings of the 20th ACM SIGMOD-SIGACT-SIGART Sym-posium on Principles of Database Systems. ACM Press,2001, pp. 247–255.

[2] R. Agrawal and R. Srikant, “Privacy-preserving data mining,”in Proceedings of the 19th ACM SIGMOD InternationalConference on Management of Data. ACM Press, 2000,pp. 439–450.

[3] S. Agrawal, V. Krishnan, and J. R. Haritsa, “On addressing ef-ficiency concerns in privacy-preserving mining,” in Proceed-ings of the 9th International Conference on Database Systemsfor Advanced Applications. Springer Verlag, 2004, pp. 439–450.

[4] W. Du and Z. Zhan, “Building decision tree classifier on pri-vate data,” in Proceedings of the IEEE International Confer-ence on Privacy, Security and Data Mining. Australian Com-puter Society, Inc., 2002, pp. 1–8.

[5] W. Du and Z. Zhan, “Using randomized response techniquesfor privacy-preserving data mining,” in Proceedings of the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. ACM Press, 2003, pp. 505–510.

[6] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacybreaches in privacy preserving data mining,” in Proceedingsof the twenty-second ACM SIGMOD-SIGACT-SIGART sym-

posium on Principles of database systems. ACM Press, 2003,pp. 211–222.

[7] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesiannetwork classifiers,” Machine Learning, vol. 29, no. 2-3, pp.131–163, 1997.

[8] O. Goldreich, Secure Multi-Party Computation. CambridgeUniveristy Press, 2004, ch. 7.

[9] G. H. Golub and C. F. V. Loan, Matrix Computation. JohnHopkins University Press, 1996.

[10] J. Han and M. Kamber, Data Mining Concepts and Tech-niques. Morgan Kaufmann, 2001.

[11] M. Kantarcioglu and J. Vaidya, “Privacy preserving naıvebayes classifier for horizontally partitioned data,” in Work-shop on Privacy Preserving Data Mining held in associationwith The 3rd IEEE International Conference on Data Mining,2003.

[12] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “Onthe privacy preserving properties of random data perturbationtechniques,” in Proceedings of the 3rd IEEE InternationalConference on Data Mining. IEEE Press, 2003, pp. 99–106.

[13] Y. Lindell and B. Pinkas, “Privacy preserving data mining,”in Proceedings of the 20th Annual International CryptologyConference on Advances in Cryptology. Springer Verlag,2000, pp. 36–54.

[14] J. R. Quinlan, “Induction of decision trees,” Machine Learn-ing, vol. 1, no. 1, pp. 81–106, 1986.

[15] A. C. Tamhane and D. D. Dunlop, Statistics and Data Analy-sis: From Elementary to Intermediate. Prentice Hall, 1999.

[16] J. Vaidya and C. Clifton, “Privacy preserving naıve bayesclassifier for vertically partitioned data,” in Proceedings of the4th SIAM Conference on Data Mining. SIAM Press, 2004,pp. 330–334.

[17] N. Zhang, S. Wang, and W. Zhao, “A new scheme on privacypreserving association rule mining,” in Proceedings of the 7thEuropean Conference on Principles and Practice of Knowl-edge Discovery in Databases. Springer Verlag, 2004.

[18] Y. Zhu and L. Liu, “Optimal randomization for privacypreserving data mining,” in Proceedings of the 2004 ACMSIGKDD international conference on Knowledge discoveryand data mining. ACM Press, 2004, pp. 761–766.

Appendix A Overview of TAN

For the completeness of the paper, we briefly introducethe tree augmented naıve Bayesian classification algorithm(TAN). For details, please refer to [7]. Bayesian classifica-tion uses the posterior probability, Pr{t ∈ C0|t}, to predictthe class membership of test samples. Naıve Bayesian clas-sification makes an additional assumption, which is called“conditional independence” [10]. It assumes that the valuesof the attributes of a data tuple are independent of each other.By Bayes theorem, the posterior probability of the class labelon a data tuple t is

Pr{t ∈ Ci|t} =Pr{Ci}

∏nr=1 Pr{ar = αr

i |Ci}∏nr=1 Pr{ar = αr

i }.(1.16)

Recall that αri is a possible value of ar. If we represent a

naıve Bayesian classifier by a dependency graph (Figure 8),we can see that no connection between two attributes isallowed. TAN relaxes this assumption by allowing each

Figure 8: The structure of a naıve Bayesian classifier

attribute to have one other attribute as its parent. That is,TAN allows the dependency between attributes to form atree topology. The dependency graph of a TAN classifieris shown in Figure 9. For a TAN classifier, the posterior

Figure 9: An example of TAN classifier

probability of the class label on a data tuple t is

Pr{t ∈ Ci|t}(1.17)

=Pr{Ci}

∏nr=1 Pr{ar = αr

i |Ci, aq = αqj}∏n

r=1 Pr{ar = αri |aq = αq

j},

where aq is the parent node of ar in the dependency tree. Forexample, in the example shown in Figure 9, the parent nodeof a2 is a1.

Appendix B Proof of Lemma 5.1

LEMMA 5.1 For a given system where the classifier con-struction procedure uses TAN to build the classifier, if theestimation error of A is bounded by ε, then the accuracy ofthe system is approximately bounded by

la(T, R, TAN)(2.18)

>∼ p − 0.4vrvq ·√

ε2

9 ·∑vr

i=1

∑vq

j=1(αri α

qj)

,

where TAN is the notion of the classifier construction pro-cedure. vr and vq are the number of values of ar and aq ,respectively, αr

i and αqj are the i-th and j-th possible values

of ar and aq, respectively, and p is the accuracy of the systemwhen ε = 0.

Proof. Let p− lt(R, BAN) be εt. In order to derive a higherbound on εt, we first consider εrq as follows.

εrq =

vr∑i=1

vq∑j=1

|Pr{C0, ar = αri , aq = αq

j}(2.19)

− Pr{C0, ar = αri , aq = αq

j}|Here r, q ∈ [1, n]. Recall that vr and vq are the number ofpossible values of ar and aq , respectively.

REMARK B.1

εt ≤ maxr,q∈[1,n]

εrq.(2.20)

For the simplicity of discussion, we do not include the detailshere. An intuitive explanation of the remark is to considera TAN classifer built on two attributes only. That is, thedependency graph consists of three nodes: C i, ar and aq .∑

i

∑j Pr{C0, ar = αr

i , aq = αqj} is the probability that

a test sample is correctly classified by such classifer builton the original data.

∑i

∑j Pr{C0, ar = αr

i , aq = αqj}

is the correction rate of such classifer built on the perturbeddata. Thus, εrq is the error rate of classification made by suchclassifer due to the perturbation.

REMARK B.2 Let the upper bound on εrq be δ. Suppose thatthe value of every element of A can be determined from theperturbed data tuples with error ε. An estimation of δ is

δ ≈ 0.4vrvq

√ε2

9 ·∑vr

i=1

∑vq

j=1(αiαj).(2.21)

We represent Pr{ar = αri } by Pr{αr

i }. Other notionsare similarly abbreviated. Recall that an element 〈A〉rq inA = T ′

0T0 − T ′1T1 is

〈A〉rq =

vr∑i=1

vq∑j=1

αri α

qj (Pr{C0, α

ri , α

qj} − Pr{C1, α

ri , α

qj})

(2.22)

=

vr∑i=1

vq∑j=1

αri α

qj Pr{αr

i , αqj}(2 Pr{C0|αr

i , αqj} − 1).(2.23)

Given r, q, let the error of Pr{C0, ar = αri , aq = αq

j}after perturbation be εij . Here we make an assumption(approximation) that all εij are independently and identicallydistributed (i.i.d.) random variables. We describe them bythe normal distribution. Since

∑i

∑j εij is always 0, we

assume that the mean of εij is 0. Let the variance of εij beσ2

ij . The distribution of |〈A〉rq−〈A〉rq| is normal distributionwith mean µ = 0 and variance

σ2 =vr∑

i=1

vq∑j=1

αri α

qj · 4σ2

ij .(2.24)

Recall that the error of 〈A〉rq is upper bounded by ε.

maxr,q∈[1,n]

|〈A − A〉rq| ≤ ε.(2.25)

Thus, we have σ ≤ ε/3. Due to (2.24), σ2ij can be bounded

by

σ2ij ≤ 1

4· ε2

9 ·∑vr

i=1

∑vq

j=1(αri α

qj)

.(2.26)

Consider the average absolute deviation (i.e., mean de-viation) of εij , which is meanij |εij |. We have εrq =vrvqmeanij |εij |. Since the average absolute deviation of anormal distribution is about 0.8 times the standard devia-tion [15], we can approximate δ by

δ ≈ 0.4vrvq

√ε2

9 ·∑vr

i=1

∑vq

j=1(αri α

qj )

.(2.27)

Appendix C Proof of Lemma 5.2

LEMMA 5.2 Let the (k + 1)-th largest eigenvalue of A besk+1. ∀r, q ∈ [1, n], we have

〈A − A〉rq ≤ sk+1.(3.28)

Proof. We consider the case where Vk is composed of thefirst k eigenvectors of A = T ′

0T0 − T ′1T1. Thus, the bound

derived is an approximation as in practice, the value of Vk

is estimated from the current copy of T ∗ (i.e., currentlyreceived data tuples) 5. Given data tuple matrix T , recallthat the matrix of perturbed data tuples is T .

T = [t1; . . . ; tm],(3.29)

T = [t1VkV ′k; . . . ; tmVkV ′

k]′ = TVkV ′k.(3.30)

T0 and T1 are the matrices of data tuples that belong to classC0 and C1, respectively. Recall that A = T ′

0T0 − T ′1T1. Let

the singular value decomposition of A be A = V ΣV ′. Withsome algebraic manipulation, we have

A = (T0VkV ′k)′(T0VkV ′

k) − (T1VkV ′k)′(T1VkV ′

k)(3.31)

= VkV ′k(T ′

0T0 − T ′1T1)VkV ′

k(3.32)

= VkV ′kV ΣV ′VkV ′

k(3.33)

= VkΣV ′k.(3.34)

Due to the theory of singular value decomposition [9], A isthe optimal rank-k approximation of A. We have

maxr,q∈[1,n]

|〈A − A〉rq| ≤ ‖A − A‖2 = sk+1,(3.35)

where ‖A − A‖2 is the spectral norm of A − A.

5Fortunately, the estimated Vk converges well to its accurate value.Besides, the convergence is fairly fast in most cases.

Appendix D Proof of Theorem 5.2

THEOREM 5.2 The lower bounds on the privacy measure aregiven below

norm lower bound on p

‖T − T‖1 δk+1/(maxi ‖ai + ai‖∞)‖T − T‖2 ρk+1

‖T − T‖∞ δk+1/(maxi ‖ai + ai‖1)

‖T − T‖F

√∑ni=k+1 ρ2

i

where matrix norms are defined in [9], ρi is the i-th eigen-value of T , si is the i-th eigenvalue of A, σi is the i-th eigen-value of T ′

0T0, τi is the i-th eigenvalue of T ′1T1, and

(4.36) δi = 2 max{σi, τi} − si.

Proof. Recall that T is the least square approximation of Tthat can be derived from T . For the simplicity of discussion,we consider T equal to T . We first show a simple lowerbound on spectral norm ‖T − T‖2 and Frobenius norm‖T − T‖F . After that, we derive lower bounds on maximumabsolute column sum norm (i.e., 1-norm) ‖T − T‖1 andmaximum absolute row sum norm (i.e, infinity norm) ‖T −T‖∞ based on the special structure of T . Please refer toTable 5 for notions used in the proof. Recall that T =TVkV ′

k . Since the rank of Vk is k, we have rank(T ) ≤ k.Let the optimal rank-k apprimation of T be Tk. We have

‖T − T ‖2 ≥ ‖Tk − T ‖2 = ρk+1,(4.37)

and

‖T − T ‖F ≥ ‖Tk − T ‖F =√

ρ2k+1 + · · · + ρ2

n(4.38)

As we have proved in Appendix C, A is the optimalrank-k approximation of A. Thus, we have

‖A − A‖2 = sk+1.(4.39)

Note that T0 and T1 are rank-k approximations of T0 and T1,respectively. Let the optimal rank-k approximation of T 0 andT1 be T0k and T1k, respectively. We have

‖T ′0T0 − T ′

0T0‖2 ≥ ‖T0k − T ‖2 = σk+1,(4.40)

and

‖T ′1T1 − T ′

1T1‖2 ≥ ‖T1k − T ‖2 = τk+1.(4.41)

Note that all matrix norms satisfy triangle inequalities. LetT ′T − T ′T be ∆. Thus,

‖∆‖2 =‖2T ′0T0 − 2T ′

0T0 − (A − A)‖2(4.42)

≥‖2T ′0T0 − 2T ′

0T0‖2 − ‖A − A‖2(4.43)

≥2σk+1 − sk+1.(4.44)

Similarly, we have

‖∆‖2 ≥ 2 max{σk+1, τk+1} − sk+1.(4.45)

Note that for any matrix M , ‖M‖22 ≤ ‖M‖1‖M‖inf , where

‖M‖1 is the maximum absolute column sum of M and‖M‖∞ is the maximum absolute row sum of M . Since ∆ isa symmetric matrix, the 1-norm of ∆ is equal to the infinitynorm of ∆. Thus, we have

‖∆‖1 = ‖∆‖inf(4.46)

≥ ‖∆‖2 ≥ 2 max{σk+1, τk+1} − sk+1.

Due to the definition of 1-norm, we can always find a columnin T with index i ∈ [1, n] such that

‖∆‖1 =‖T ′ai − T ′ai‖1(4.47)

<∼‖(T − T )′(ai + ai)‖1(4.48)

≤‖T − T‖∞‖ai + ai‖1.(4.49)

Similarly, we can derive that

‖∆‖1 ≤ ‖T − T‖1‖ai + ai‖∞.(4.50)

Thus, the lower bounds on ‖T − T‖1 and ‖T − T‖∞ are asfollows.

‖T − T‖1 ≥ 2 max{σk+1, τk+1} − sk+1

maxi ‖ai + ai‖∞,(4.51)

and

‖T − T‖∞ ≥ 2 max{σk+1, τk+1} − sk+1

maxi ‖ai + ai‖1.(4.52)

Appendix E Table of Attributes and ClassificationFunctions

Table 3: Description of Attributes

Attribute Description

salary uniformly distributed on [20k,150k]commission if salary ≥ 75k then commission=0

else uniformly distributed on [10k,75k]age uniformly distribued on [20,80]elevel uniformly chosen from 0..4car uniformly chosen from 1..20zipcode uniformly chosen from 0..9hvalue uniformly distributed on [zipcode × 50k,

zipcode × 100k]hyears uniformly distributed on [1,30]loan uniformly distributed on [0,500k]

Table 4: Classification Functions

Condition for t ∈ C0

F1 (age<40) ∨ (age>60)F2 ((age<40) ∧ (50k≤salary≤100k)) ∨

((40≤age<60) ∧ (75k<salary<125k)) ∨((age≥60) ∧ (25k<salary<75k))

F3 ((age<40) ∧ (((elevel ∈ [0, 1]) ∧ (25k ≤ salary ≤75k))∨ ((elevel∈ [2, 3])∧ (50k≤salary≤100k))))∨ ((40 ≤age<60) ∧ (((elevel ∈ [1, 3]) ∧(50k ≤ salary ≤ 100k)) ∨ ((elevel = 4) ∧(75k≤salary≤125k)))) ∨ ((age≥60) ∧ (((elevel∈ [2, 4]) ∧ (50k≤ salary ≤ 100k))∨ ((elevel = 1)∧ (25k≤salary≤75k))))

F4 (0.67 × (salary+commission) - 0.2× loan -10k)>0

F5 (0.67 × (salary+commission) - 0.2× loan +0.2 ×equity -10k)>0 where equity = 0.1 × hvalue ×max(hyears-20)

Appendix F Table of Notions

Table 5: Notions

notion definition

m number of data tuples in the training data setn number of attributes besides the class label attributet a data tuple in the original training data setR(t) a data tuple in the perturbed training data setar r-th attribute of an original data tupleαr

i the i-th value of ar

vr the number of possible values of ar

ar the r-th attribute of a perturbed data tupleC0 class 0C1 class 1T matrix of original training data setT0 matrix of original data tuples that belong to C0

T1 matrix of original data tuples that belong to C1

T matrix of perturbed training data setT0 matrix of perturbed data tuples that belong to C0

T1 matrix of perturbed data tuples that belong to C1

T ∗ current matrix of received (perturbed) data tuplesT ∗

0 matrix of received data tuples that belong to C0

T ∗1 matrix of received data tuples that belong to C1

A T ′0T0 − T ′

1T1

A T ′0T0 − T ′

1T1

A∗0 T ∗′

0 T ∗0

A∗1 T ∗′

1 T ∗1

A∗ A∗0 − A∗

1

ε maxr,q∈[1,n]〈A − A〉rq

V ∗ matrix composed of the eigenvectors of A∗

Vk matrix composed of the first k eigenvectors of A∗

∆ T ′T − T ′Tρi i-th eigenvalue of Tsi i-th eigenvalue of A∗

Σ diag(s1, · · · , sn)σi i-th eigenvalue of T ′

0T0

τi i-th eigenvalue of T ′1T1

δi 2 max{σi, τi} − si

〈·〉rq the element of a matrix with index r and q