20
Towards A Differential Privacy and Utility Preserving Machine Learning Classifier Kato Mivule, Claude Turner, and Soo-Yeon Ji Computer Science Department Bowie State University Complex Adaptive Systems 2012 – Washington DC USA, November 14-16 1 Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Embed Size (px)

DESCRIPTION

Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Citation preview

Page 1: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Towards A Differential Privacy and Utility Preserving Machine Learning Classifier

Kato Mivule, Claude Turner, and Soo-Yeon Ji

Computer Science DepartmentBowie State University

Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

1

Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 2: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Outline

Introduction Related work Essential Terms Methodology Results Conclusion

2

Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 3: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Introduction

Entities transact in ‘big data’ containing personal identifiable information (PII).

Organizations are bound by federal and state law to ensure data privacy.

In the process to achieve privacy, the utility of privatized datasets diminishes.

Achieving balance between privacy and utility is an ongoing problem.

Therefore, we investigate a differential privacy preserving machine learning classification approach that seeks an acceptable level of utility.

3Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 4: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Related Work There is a growing interest in investigating privacy preserving data mining

solutions that provide a balance between data privacy and utility.

Kifer and Gehrke (2006) did a broad study of enhanced data utility in privacy preserving data publishing by using statistical approaches.

Wong (2007) described how achieving global optimal privacy while maintaining utility is an NP-hard problem.

Krause and Horvitz (2010) noted that endeavours of finding trade-offs between privacy and utility is still an NP-hard problem.

Muralidhar and Sarathy (2011) showed that differential privacy provides strong privacy guarantees but utility is still a problem due to noise levels.

Finding the optimal balance between privacy and utility remains a challenge—even with differential privacy.

4Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 5: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Data Utility verses Privacy

Data utility is the extent of how useful a published dataset is to the consumer of that publicized dataset.

In the course of a data privacy process, original data will lose statistical value despite privacy guarantees.

5Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Image Source: Kenneth Corbin/Internet News.

Page 6: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Objective

Achieving an optimal balance between data privacy and utility remains an ongoing challenge.

Such optimality is highly desired and remains our investigation goal.

6Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Image Source: Wikipedia, on Confidentiality.

Page 7: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Ensemble classification Is a machine learning process, in which a collection of several

independently trained classifiers are merged to achieve better prediction.

Examples include single trained decision trees joined to make accurate predictions.

7Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 8: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

AdaBoost Ensemble – Adaptive Boosting Proposed by Freund and Schapire (1995), uses several iterations by adding weak

learners to create a powerful learner, adjusting weights to center on misclassified data in earlier iterations.

Classification Error in AdaBoost Ensemble is computed as below:

8Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 9: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

AdaBoost Ensemble (Cont’d ) AdaBoost Ensemble computes as follows:

9Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 10: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Differential Privacy

Proposed by Cynthia Dwork (2006).

Imposes confidentiality by returning perturbed query responses from databases:

The end user of the database cannot know if a data item has been altered.

An attacker cannot gain information about any data item in the database.

10Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 11: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Differential Privacy (Cont’d)

ε-differential privacy is satisfied if the results to a query run on database D1 and D2 should probabilistically be similar, and meet the following condition:

Where D1 and D2 are the two databases

P is the probability of the perturbed query results D1 and D2. qn() is the privacy granting procedure (perturbation).

qn(D1) is the privacy granting procedure on query results from database D1. qn(D2) is the privacy granting procedure on query results from database D2.

R is the perturbed query results from the databases D1 and D2 respectively. is the exponential e epsilon value.

11Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 12: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Methodology (Cont’d)

We utilized a public available Barack Obama 2008 campaign donations dataset.

The data set contained 17,695 records of original unperturbed data.

Two attributes, the donation amount and income status, are utilized to classify data into three classes.

The three classes are low income, middle income, and high income, for donations $1 to $49, $50 to $80, $81 and above respectively.

Validating our approach, the dataset comprised 50 percent on training and the remainder on testing, on both Original and Privatized datasets.

Oracle database is queried via MATLAB ODBC connector. MATLAB is used for differential privacy and machine learning classification.

12Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 13: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Results

Essential statistical traits of the original and differential privacy datasets, a necessary requirement to publish privatized datasets, are kept.

As depicted, the mean, standard deviation, and variance of the original and differential privacy datasets remained the same.

13Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 14: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Results (Cont’d)

There is a strong positive covariance of 1060.8 between the two datasets, which means that they grow simultaneously, as illustrated below:

14Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 15: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Results (Cont’d) There is almost no correlation (the correlation was 0.0054) between the

original and differentially privatized datasets.

Indicates some privacy assurances, and difficulty for an attacker, dealing only with the privatized dataset, to correctly infer any alterations.

15Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 16: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Results (Cont’d) After applying differential privacy, AdaBoost ensemble classifier is

performed.

The outcome of the donors’ dataset was Low, Middle, and High income, for donations 0 to 50, 51 to 80, and 81 to 100, respectively.

This same classification outcome is used for the perturbed dataset to investigate whether the classifier would categorize the perturbed dataset correctly.

16Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 17: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Results (Cont’d)

The training dataset from the original data showed that the classification error dropped from 0.25 to 0 with increased weak decision tree learners.

The results changed with the training dataset on the differentially private data when the classification error dropped from 0.588 to 0.58.

17Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 18: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Results (Cont’d) When the same procedure is applied to the test dataset of the original data the

classification error dropped from 0.03 to 0.

However, when this procedure perform on the differentially private data, the error rate did not change even with increased number of weak decision tree.

18Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 19: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Conclusion In this study, we found that while differential privacy might guarantee strong

confidentiality, providing data utility still remains a challenge.

However, this study is instructive in a variety of ways:

The level of Laplace noise does affect the classification error.

Increasing the number of weak learners is not too significant.

Adjusting the Laplace noise parameter, ε, is essential for further study. However, accurate classification means loss of privacy.

Tradeoffs must be made between privacy and utility.

We plan on investigating optimization approaches for such tradeoffs.

19Complex Adaptive Systems 2012 – Washington DC USA, November 14-16

Page 20: Towards A Differential Privacy Preserving Utility Machine Learning Classifier

Questions?

Contact: Kato Mivule: [email protected]

Thank You.

20

Complex Adaptive Systems 2012 – Washington DC USA, November 14-16