Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Towards A Differential Privacy and Utility Preserving Machine Learning Classifier

Kato Mivule, Claude Turner, and Soo-Yeon Ji

Computer Science DepartmentBowie State University

Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

1


Outline

Introduction Related work Essential Terms Methodology Results Conclusion

2


Introduction

Entities transact in ‘big data’ containing personal identifiable information (PII).

Organizations are bound by federal and state law to ensure data privacy.

In the process to achieve privacy, the utility of privatized datasets diminishes.

Achieving balance between privacy and utility is an ongoing problem.

Therefore, we investigate a differential privacy preserving machine learning classification approach that seeks an acceptable level of utility.

3Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Related Work There is a growing interest in investigating privacy preserving data mining

solutions that provide a balance between data privacy and utility.

Kifer and Gehrke (2006) did a broad study of enhanced data utility in privacy preserving data publishing by using statistical approaches.

Wong (2007) described how achieving global optimal privacy while maintaining utility is an NP-hard problem.

Krause and Horvitz (2010) noted that endeavours of finding trade-offs between privacy and utility is still an NP-hard problem.

Muralidhar and Sarathy (2011) showed that differential privacy provides strong privacy guarantees but utility is still a problem due to noise levels.

Finding the optimal balance between privacy and utility remains a challenge—even with differential privacy.


Data Utility verses Privacy

Data utility is the extent of how useful a published dataset is to the consumer of that publicized dataset.

In the course of a data privacy process, original data will lose statistical value despite privacy guarantees.


Image Source: Kenneth Corbin/Internet News.

Objective

Achieving an optimal balance between data privacy and utility remains an ongoing challenge.

Such optimality is highly desired and remains our investigation goal.


Image Source: Wikipedia, on Confidentiality.

Ensemble classification Is a machine learning process, in which a collection of several

independently trained classifiers are merged to achieve better prediction.

Examples include single trained decision trees joined to make accurate predictions.


AdaBoost Ensemble – Adaptive Boosting Proposed by Freund and Schapire (1995), uses several iterations by adding weak

learners to create a powerful learner, adjusting weights to center on misclassified data in earlier iterations.

Classification Error in AdaBoost Ensemble is computed as below:


AdaBoost Ensemble (Cont’d ) AdaBoost Ensemble computes as follows:


Differential Privacy

Proposed by Cynthia Dwork (2006).

Imposes confidentiality by returning perturbed query responses from databases:

The end user of the database cannot know if a data item has been altered.

An attacker cannot gain information about any data item in the database.


Differential Privacy (Cont’d)

ε-differential privacy is satisfied if the results to a query run on database D1 and D2 should probabilistically be similar, and meet the following condition:

Where D1 and D2 are the two databases

P is the probability of the perturbed query results D1 and D2. qn() is the privacy granting procedure (perturbation).

qn(D1) is the privacy granting procedure on query results from database D1. qn(D2) is the privacy granting procedure on query results from database D2.

R is the perturbed query results from the databases D1 and D2 respectively. is the exponential e epsilon value.


Methodology (Cont’d)

We utilized a public available Barack Obama 2008 campaign donations dataset.

The data set contained 17,695 records of original unperturbed data.

Two attributes, the donation amount and income status, are utilized to classify data into three classes.

The three classes are low income, middle income, and high income, for donations $1 to $49, $50 to $80, $81 and above respectively.

Validating our approach, the dataset comprised 50 percent on training and the remainder on testing, on both Original and Privatized datasets.

Oracle database is queried via MATLAB ODBC connector. MATLAB is used for differential privacy and machine learning classification.


Results

Essential statistical traits of the original and differential privacy datasets, a necessary requirement to publish privatized datasets, are kept.

As depicted, the mean, standard deviation, and variance of the original and differential privacy datasets remained the same.


Results (Cont’d)

There is a strong positive covariance of 1060.8 between the two datasets, which means that they grow simultaneously, as illustrated below:


Results (Cont’d) There is almost no correlation (the correlation was 0.0054) between the

original and differentially privatized datasets.

Indicates some privacy assurances, and difficulty for an attacker, dealing only with the privatized dataset, to correctly infer any alterations.


Results (Cont’d) After applying differential privacy, AdaBoost ensemble classifier is

performed.

The outcome of the donors’ dataset was Low, Middle, and High income, for donations 0 to 50, 51 to 80, and 81 to 100, respectively.

This same classification outcome is used for the perturbed dataset to investigate whether the classifier would categorize the perturbed dataset correctly.


Results (Cont’d)

The training dataset from the original data showed that the classification error dropped from 0.25 to 0 with increased weak decision tree learners.

The results changed with the training dataset on the differentially private data when the classification error dropped from 0.588 to 0.58.


Results (Cont’d) When the same procedure is applied to the test dataset of the original data the

classification error dropped from 0.03 to 0.

However, when this procedure perform on the differentially private data, the error rate did not change even with increased number of weak decision tree.


Conclusion In this study, we found that while differential privacy might guarantee strong

confidentiality, providing data utility still remains a challenge.

However, this study is instructive in a variety of ways:

The level of Laplace noise does affect the classification error.

Increasing the number of weak learners is not too significant.

Adjusting the Laplace noise parameter, ε, is essential for further study. However, accurate classification means loss of privacy.

Tradeoffs must be made between privacy and utility.

We plan on investigating optimization approaches for such tradeoffs.


Questions?

Contact: Kato Mivule: [email protected]

Thank You.

20


Technology

Towards A Differential Privacy Preserving Utility Machine Learning Classifier