24
Chair of Software Engineering for Business Information Systems (sebis) Faculty of Informatics Technische Universität München wwwmatthes.in.tum.de Investigating the Application of Differential Privacy to Mitigate Privacy Issues in Natural Language Processing Stephen Meisenbacher, 07.12.2020, Guided Research Kick-Off Presentation

Investigating the Application of Differential Privacy to

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Investigating the Application of Differential Privacy to

Chair of Software Engineering for Business Information Systems (sebis)

Faculty of Informatics

Technische Universität München

wwwmatthes.in.tum.de

Investigating the Application of Differential Privacy to Mitigate

Privacy Issues in Natural Language ProcessingStephen Meisenbacher, 07.12.2020, Guided Research Kick-Off Presentation

Page 2: Investigating the Application of Differential Privacy to

Introduction

Background

Motivation

Goals

Research Questions

Methodology

Initial Results

Source Collection

Interviews

Next Steps

Timeline

Outline

© sebis 2201207 Meisenbacher Guided Research Kick-Off Presentation

Page 3: Investigating the Application of Differential Privacy to

Background #1 – How do People View Data Privacy?

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 3https://www.pewresearch.org/internet/2019/11/15/americans-and-privacy-concerned-

confused-and-feeling-lack-of-control-over-their-personal-information/

Page 4: Investigating the Application of Differential Privacy to

Background #2 – The Rise of Big Data

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 4

https://www.forbes.com/sites/louiscolumbus/2018/05/23/10-charts-that-will-change-your-perspective-

of-big-datas-growth/https://link.springer.com/article/10.1007/s11192-020-03371-2

Evolutionary trend in the number of publications covering data

science and big data

Page 5: Investigating the Application of Differential Privacy to

Background #3 – Data Breaches and the Reaction

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 5

https://www.statista.com/statistics/273550/data-breaches-recorded-in-the-united-states-by-number-of-breaches-

and-records-exposed/

Page 6: Investigating the Application of Differential Privacy to

Background #4 – Privacy Going Forward

“Data Privacy Will Be The Most Important Issue In The Next Decade”

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 6

https://www.forbes.com/sites/marymeehan/2019/11/26/data-privacy-will-be-the-most-important-issue-in-

the-next-decade/

https://www.pewresearch.org/internet/2019/11/15/americans-and-privacy-concerned-

confused-and-feeling-lack-of-control-over-their-personal-information/

Publications with “Privacy” in Title or Abstract, 2000-2020

https://app.dimensions.ai/analytics/publication/overview/timeline?search_mode=content&search_text=p

rivacy&search_type=kws&search_field=text_search&year_from=2000&year_to=2020&local:indicator-

y1=timeline-source-published

Page 7: Investigating the Application of Differential Privacy to

Motivation – Differential Privacy

• Context: using data in learning tasks

• Differential Privacy key concept: individual changes to

the data still preserves privacy

i.e. no new information can be learned from these

minute changes

Participation of individual cannot be determined

• Goal: difference between output of query/algorithm

with a single change is bounded

Bound can be controlled / quantified (ε)

• Not an algorithm – more of a guideline, schematic

• Provides privacy guarantee

• Advantages: composability, robustness regardless of

attack type

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 7

Page 8: Investigating the Application of Differential Privacy to

Motivation – why Differential Privacy + Natural Language Processing?

• Privacy in general is a hot topic

Coincides with the increasing importance of data

What if we could provide some sort of privacy guarantee?

• Result: novel privacy-preserving techniques

• Differential privacy: a promising and relatively new concept

Still very much in the theoretical phase

• NLP: usually dependent on user (human) data potential privacy concerns

• A great deal of papers address Differential Privacy in regards to Machine Learning or Deep Learning

But – very little specific to NLP

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 8

Page 9: Investigating the Application of Differential Privacy to

Goals

• Want: overview of the current state of DP in NLP

• Privacy vulnerabilities

• Feasibility

• Technical applications

• Use cases?

• Pros, cons

• Overall: current work + potential

• Method: systematic literature review

• Academic research/literature

• “Grey” literature

• Contact with experts in the field

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 9

Page 10: Investigating the Application of Differential Privacy to

Research Questions

1. What vulnerabilities to current NLP techniques is Differential Privacy capable of preventing?

2. What are the foundations of Differential Privacy, and how can it be applied to NLP tasks?

3. What are the distinct benefits and limitations of applying Differential Privacy to NLP tasks?

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 10

Page 11: Investigating the Application of Differential Privacy to

Methodology

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 11

Page 12: Investigating the Application of Differential Privacy to

Initial Results – Source Collection

• Initial search strings:

• ‘Differential Privacy (Natural Language Processing | NLP)’

• ‘Privacy (Natural Language Processing | NLP)’

• ‘Differential Privacy’

• Electronic Data Sources

• IEEE Xplore

• ACM Digital Library

• Google Scholar

• ScienceDirect

• Springer

• Wiley

• Google (for grey literature)

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 12

Garousi, et al. “Multivocal Literature Reviews”

Page 13: Investigating the Application of Differential Privacy to

Initial Results – Source Collection (cont.)

100s• Search

Results (unfiltered)

~20• Filtered based

upon title/abstract

60• Manual search

through references in initial documents

??• Final Set of

documents (tbd)

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 13

Page 14: Investigating the Application of Differential Privacy to

Initial Results – Source Collection (cont.)

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 14

Multi-word Keyword Extraction

app.sketchengine.eu

~340,000 word corpus

0

2

4

6

8

10

12

14

16

18

20

2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

# E

lem

en

ts

Year

Documents in Catalog by Year

Page 15: Investigating the Application of Differential Privacy to

Initial Results – Interviews: Organization and Methodology

• Manual search for expert contacts

Look for “privacy” + “NLP” in research interests

• Contacted via email

8 emails sent

• So far:

5 responses

3 interviews scheduled

2 interviews conducted

Both researchers who deal specifically with privacy and NLP

Hope: a few more interviews over time

• Format: ~30 minute video interview

• Prepared interview questionnaire, broken down to 4 categories:

General: current work, background with privacy, thoughts on privacy + NLP

RQ1: NLP privacy vulnerabilities, attack types, preventative work so far

RQ2: DP foundations, application to NLP, use cases, technical implementations

RQ3: major advantages, current limitations, future improvement, thoughts on future of private NLP

Total: 18 questions + sub-questions

• Main goal: obtain a strong background/motivation before diving into the literature review

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 15

Page 16: Investigating the Application of Differential Privacy to

Initial Results – Interviews: (Some) Takeaways So Far

1. The nature of NLP inherently creates privacy concerns

Reliant on human data language is our main source of communication

Textual information is “rich in content” (potentially private/sensitive)

Vulnerable to certain attack types:

Membership inference, attribute inference, keyword inference, pattern reconstruction

Several use cases bolster this point:

Models trained on private messages can unintentionally memorize sensitive data1

Profiling fake news / hate speech spreaders from stylometry2

Authorship verification / profiling / clustering (e.g. from follower tweets) 2

Hyperpartisan news detection2

*Gender identification from tweets2,3,6

*Personally identifiable search log text4

*Genome prediction using pattern reconstruction5

*Extracting disease keywords from BERT embeddings5

* (Metric) Differential Privacy can be used to mitigate vulnerabilities

RQ1 goal

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 16

1 Carlini, et al. “The Secret Sharer: Evaluating and Testing

Unintended Memorization in Neural Networks”2 https://pan.webis.de/publications.html3 Koppel, et al. “Automatically Categorizing Written Texts

by Author Gender” 4 Li, et al. “Towards Robust and Privacy-preserving Text

Representations” 5 Pan, et al. “Privacy Risks of General-Purpose Language

Models”6 Fernandes, et al. “Author Obfuscation using Generalised

Differential Privacy”

Page 17: Investigating the Application of Differential Privacy to

Initial Results – Interviews: (Some) Takeaways So Far (cont.)

2. Features of DP make it an attractive option (“the best”)

Privacy guarantee (see 4.)

Relatively efficient (e.g. vs. homomorphic encryption)

Composability + robustness

3. However, DP is not a silver bullet to addressing privacy concerns in NLP

Must consider the task at hand

Also consider the privacy-utility tradeoff

Use DP: making generalizations/predictions on publicly visible data

Other privacy-preserving techniques might sometimes make more sense

4. Biggest challenge/limitation: Explainability

+ giving an epsilon value to an engineer – simple

- NLP is fuzzy, unstructured

- What does privacy mean for NLP (or in general)?

How can we explain DP in the frame of NLP???

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 17

Page 18: Investigating the Application of Differential Privacy to

Initial Results – Next Steps

1. Conduct next interviews

Compile and transcribe results

Possible: find new contacts

2. Finalize catalog of literature

Filter down to << 60

3. Begin full literature review

Mark up and take notes

Follow Thematic Synthesis Process:

4. Concurrently, begin writing

© sebis201207 Meisenbacher Guided Research Kick-Off Presentation 18

Dyba, et al. “Applying Systematic Reviews to Diverse Study Types: An Experience Report”

Page 19: Investigating the Application of Differential Privacy to

Timeline

201207 Meisenbacher Guided Research Kick-Off Presentation 19© sebis

Stephen Meisenbacher Guided Research Schedule

Project Start Date Display Week 4

Project End Date

19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

WBS TASK START END DAYS%

DONE

WORK

DAYSM T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S

1 Initial Stages Fri 10/02/20 Fri 10/30/20 29 21

1.1 Initial meeting with Alexandra Fri 10/02/20 Fri 10/02/20 1 100% 1

1.2 Initial planning + meeting prep Sat 10/03/20 Fri 10/16/20 14 100% 10

1.3 Prep meeting, review and finish slides Mon 10/19/20 Wed 10/21/20 3 100% 3

1.4 Meeting with Prof. Matthes Thu 10/22/20 Thu 10/22/20 1 100% 1

1.5 Revision to methodology, etc. Fri 10/23/20 Fri 10/30/20 8 100% 6

2 Data Collection Mon 11/02/20 Mon 12/21/20 50 36

2.1 Source collection, cataloging Mon 11/02/20 Fri 11/20/20 19 100% 10

2.2Source review, data collection and

notationMon 11/16/20 Mon 12/21/20 36 0% 26

2.3 Seminar initial presentation Mon 12/07/20 Mon 12/07/20 1 0% 1

2.4 Conduct interviews Mon 11/16/20 Fri 12/11/20 26 0% 20

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14

M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S

3 Writing and Presentation Mon 1/04/21 Mon 4/12/21 99 71

3.1 Writing of review draft Mon 1/04/21 Fri 2/12/21 40 0% 30

3.2 Editing and revisions Mon 2/15/21 Fri 2/26/21 12 0% 10

3.3 Final presentation prep Mon 3/01/21 Fri 3/05/21 5 0% 5

3.4 Final presentation Mon 3/08/21 Mon 3/08/21 1 0% 1

3.5 Submit final paper Fri 3/12/21 Fri 3/12/21 1 0% 1

3.6 Buffer Sat 3/13/21 Mon 4/12/21 31 0% 21

Week 20

8 Feb 2021

Week 21

15 Feb 2021

Week 17

18 Jan 2021

Week 18

25 Jan 2021

Week 19

1 Feb 2021

Week 15

4 Jan 2021

Week 16

11 Jan 2021

Week 12

14 Dec 2020

Week 13

21 Dec 2020

Week 7

9 Nov 2020

Week 11

7 Dec 202016 Nov 2020

Week 9

23 Nov 2020

Week 8 Week 10

30 Nov 2020

Week 6

2 Nov 20204/12/2021 (Monday)

Week 5Week 410/1/2020 (Thursday)

26 Oct 202019 Oct 2020

Week 22

22 Feb 2021

Week 23

1 Mar 2021

Week 24

8 Mar 2021

Page 20: Investigating the Application of Differential Privacy to

Technische Universität München

Faculty of Informatics

Chair of Software Engineering for Business

Information Systems

Boltzmannstraße 3

85748 Garching bei München

Tel +49.89.289.

Fax +49.89.289.17136

wwwmatthes.in.tum.de

Stephen [email protected]

17132

[email protected]

Page 21: Investigating the Application of Differential Privacy to

Appendix A: Grey Literature (Garousi)

201207 Meisenbacher Guided Research Kick-Off Presentation 21© sebis

Page 22: Investigating the Application of Differential Privacy to

Appendix B: Quality Checklist (Garousi)

201207 Meisenbacher Guided Research Kick-Off Presentation 22© sebis

Page 23: Investigating the Application of Differential Privacy to

Appendix C: Search Process Documentation (Kitchenham)

201207 Meisenbacher Guided Research Kick-Off Presentation 23© sebis

Page 24: Investigating the Application of Differential Privacy to

Appendix D: Thematic Synthesis Process (Dyba)

201207 Meisenbacher Guided Research Kick-Off Presentation 24© sebis