65
Wolf-Tilo Balke Jan-Christoph Kalo Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Relational Database Systems 2 13. Data Privacy

Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

Wolf-Tilo Balke

Jan-Christoph Kalo

Institut für Informationssysteme

Technische Universität Braunschweig

http://www.ifis.cs.tu-bs.de

Relational Database Systems 213. Data Privacy

Page 2: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

13.1 What is Privacy?

13.2 Privacy Laws

13.3 K-Anonymity

13.4 Netflix Prize

13.5 Social Network Anonymization

13.6 Statistical Database Security

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 2

13 Data Privacy

Page 3: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 3

13.1 What is Privacy?

• “… the ability to determine for ourselves when,

how, and to what extent information about us is

communicated to others …”• Westin, 1967

• “Privacy intrusion occurs when new information

about an individual is released. “• Parent, 1983

Page 4: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 4

13.1 What is Privacy?

• Privacy has several aspects and definitions:

– The right to be let alone

– Limit the access others have to one's personal information

– The option to conceal any information from others

– Secrecy

– States of privacy

– Personhood and autonomy

– Self-identity and personal growth

– Protection of intimate relationships

Page 5: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Different to database security, privacy cannot be

achieved by access control

• The challenge is to utilize data while protecting

individual's privacy preferences

– Idea: Anonymize all attributes

that could be used for

de-anonymization

Data Privacy Database Security

• Detecting data inference• Controlling data inference

• Controlling database access• Protecting database content

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 5

13.1 What is Privacy?

Page 6: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Between 1850 and 1890 newspapers

popularity grew tremendously

– New camera technology allowed first

snapshots in public places

– People feared the new technology to be used by the„sensationalistic press“

• In 1890 Samuel D. Warren and Louis D. Brandeis designed „The Right to Privacy“.

– „most influential law article of all time“

– Privacy as „The Right to be Alone“

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 6

13.2 History of Privacy Laws

Page 7: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• In the 1960s John F. Kennedy plans to introduce

the national registration centers

– In the USA a large discussion on privacy is started

– Kennedy‘s plan fails in Congress

• In 1974 the Privacy Act is adopted in the USA

– Collection, maintenance, use, and dissemination of

information about individuals by federal agencies

is regulated

– Plans to adopt the law to the private sector failed

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 7

13.2 History of Privacy Laws

Page 8: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• In 1970 the Worlds first privacy law is

adopted in Hesse

– Regulates the usage of personal data by public

authorities

– Ensure the informational self-determination of

citizens

• In 1977 the German privacy law (BDSG)

follows

– Renewed in 1990 due to „Volkszählungsurteil“

– Due to several scandals the law was renewed in 2009

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 8

13.2 History of Privacy Laws

Page 9: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Persons affected need to consent the data

acquisition

• Affected persons have the right to:• Obtain information on stored data about them

• Obtain information why data is stored about them and

where it comes from

• Demand the deletion of their data

• Prohibit their data to be circulated

• Germany has several organizations/institutions

supervising privacy issues

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 9

13.2 Privacy in Germany

Page 10: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• A general privacy law collides with the right of

freedom of expression

• Specialized laws regulate privacy issues in the US

– COPPA

– HIPAA

• No supervising organization for privacy issues

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 10

13.2 Privacy in the US

Page 11: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• International Privacy Ranking from 2007

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 11

13.2 Privacy Laws Today

Page 12: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• 40% of worlds population uses the internet

– In developed countries even 78% of the populationuse the internet

• Data explosion on the Internet

– The average user produces about 8-10GB of publicdata per day (in 2007)

• People produce a huge amount of private data

– The popularity of social networks tremendouslycontributes to this data

• How can we use this data?

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 12

13.3 Privacy Risks

Page 13: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Most of this private data is used by companies

– Movie & Product recommendation

– Facebook Timeline optimization

– Search recommendation

• Companies and also researchers use private datafor statistical analysis

• Also health and governmental institutions use Big Data

– Open health data helps researchers to improve thehealth system

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 13

13.3 Privacy Risks

Page 14: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Problem: Growing availability of information

and person specific data leads to privacy risks

– Who is allowed to see which information?

• Goal: Anonymize data to protect individuals

privacy

– Even when data has no explicit identifying

attributes, such as names or address

re-identification is possible

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 14

13.3 Privacy Risks

Page 15: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• In 1997 medical information of Massachuesets

Governor has been re-identified by researchers

from MIT

• Massachusetts Group Insurance Commission

published medical data to improve healthcare

and to controll costs

– William Weld, then Governor of

Massachusetts, assured the public

that GIC had protected patient privacy

by deleting all identifiers

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 15

13.3 Protecting Privacy

Page 16: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Latanya Sweeney, a student from

MIT, working on privacy tried to

identify the governor

– She bought the voters register for

Cambridge, Massachusetts including

54,805 persons for 20$

• Actually 101,391 people lived in Cambridge

• Half of the population could not have been identified

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 16

13.3 Protecting Privacy

Page 17: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• It is estimated that 29,000 of these 54,805 havea unique combination of Birthdate, Gender and ZIP

– 29% of Cambridges population have a plausible risk ofre-identification

• Actually, 87% of the population of the United States can be uniquely identified by gender, date ofbirth and ZIP code

– Sweeney, L. A. Simple Demographics Often Identify People Uniquely. Carnegie Mellon Univ. Data Privacy Working Paper 3 (2000).

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 17

13.3 Protecting Privacy

Page 18: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• By combining this data with the GIC records,

Sweeney found Governor Weld with ease

– Only six people in Cambridge shared his birth date,

only three of them men, and of them, only he lived

in his ZIP code

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 18

13.3 Protecting Privacy

Name

Address

Date

registered

Party affiliation

Date last voted

Ethnicity

Visit date

Diagnosis

Procedure

Medication

Total charge

ZIP

Birthdate

Sex

GIC Data Voters Register

Page 19: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Breach and Patch method?– Identify privacy breaches

– Design new algorithms and techniques to fix breaches

• Better: Formally define models and specify their properties– Formally specify the privacy model

– Derive conditions for privacy

– Design an algorithm that satisfies the privacy conditions

• Over the years several privacy models have been developed– k-anonymity, l-diversity, l-closeness, (c,k)-safety

• There is no perfect privacy model

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 19

13.3 Protecting Privacy

Page 20: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• K-Anonymity is a protection model that can

ensure anonymity of indivuals of relational data

– Idea: Generalize, modify, or distort identifiers so that

no individual is uniquely identifiable

• In recent years it has gained popularity and

became one of the most important ideas with

regard to privacy

– Several fast algorithms for creating k-anonymous data

sets exist

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 20

13.3 k-Anonymity Model

Page 21: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• K-Anonimity is defined on relation data

– Table 𝑅𝑇 𝐴1, … , 𝐴𝑛

– Quasi-identifiers 𝑄𝐼𝑅𝑇 have to be identified in the data

• What could be used to identify individuals?

• Definition of k-anonymity

– The table RT satisfies k-anononymity if and only if each

sequence of values in RT[𝑄𝐼𝑅𝑇] appears with at least k

appeareances“

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 21

13.3 k-Anonymity Model

Race Birth Gender ZIP Problem

Black 1988 Male 38102 Short Breath

Black 1988 Male 38102 Hypertension

Black 1989 Female 38102 Obesity

Black 1989 Female 38102 Chest Pain

Page 22: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Meaning of k-anonymity:

• If the released data satisfies k-anonymity, the combination of RT with external data cannot be matched to fewer than k indivuals

• Each record is indistinguishable from at least k-1 other recordswithin the dataset

• Large k values imply higher privacy

– Probability for identifying individuals is1

𝑘

• Finding quasi-identifiers 𝑄𝐼𝑅𝑇 is hard

– If external data sources are not fully known, finding all quasi-idenfitiers is not possible

• Sometimes the combination of meaningless attributes canidentify individuals

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 22

13.3 k-Anonymity Model

Page 23: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Algorithms for k-anonymization involve two basicoperations

– Deleting cell values or entire tuples

• When generalization causes too much information loss

– Generalizing cell values

• Replace specific quasi-identifiers with less specific values until get k identical values

• Finding an optimal anonymization using k-anonymityis NP-hard

– Optimal means to perturb the input data as little asnecessary

– Perturbing data also implies a loss of data utility

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 23

13.3 k-Anonymity Model

Page 24: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Real-world data is often sorted

– Several k-anonymous data sets can be directly

matched

– Randomly sorting the tuples solves the problem

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 24

13.3 Unsorted Matching Attack

Race ZIP

Asian 38102

Asian 38106

Black 38102

Black 38106

White 38102

White 38106

Race ZIP

Person 38102

Person 38106

Person 38102

Person 38106

Person 38102

Person 38106

Race ZIP

Asian 381**

Asian 381**

Black 381**

Black 381**

White 381**

White 381**

Original Data K-Anonymous Data 1 K-Anonymous Data 2

Page 25: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Both data sets are k-anonymous with k=2

– Matched using the Problem attribute

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 25

13.3 Complementary Release Attack

Race Birthyear Gender ZIP Problem

Black 1965 Male 38100 Short Breath

Black 1965 Male 38100 Chest Pain

Person 1965 Female 381** Painful Eye

Person 1965 Female 381** Wheezing

Black 1964 Female 38106 Obesity

Black 1964 Female 38106 Chest Pain

White 1964 Male 381** Short Breath

Person 1965 Female 381** Hypertension

White 1964 Male 381** Obesity

White 1964 Male 381** Fever

White 1967 Male 38106 Vomiting

White 1967 Male 38106 Back Pain

Race Birthyear Gender ZIP Problem

Black 1965 Male 38100 Short Breath

Black 1965 Male 38100 Chest Pain

Black 1965 Female 38106 Painful Eye

Black 1965 Female 38106 Wheezing

Black 1964 Female 38106 Obesity

Black 1964 Female 38106 Chest Pain

White 19** Male 38106 Short Breath

White 19** Human 38102 Hypertension

White 19** Human 38102 Obesity

White 19** Human 38102 Fever

White 19** Male 38106 Vomiting

White 19** Male 38106 Back Pain

Page 26: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Also little diversity with regard to the sensitive

attributes can lead to privacy problems

– 4-anonymous patient data

• Privacy is not guaranteed

since all 4 individuals

have cancer

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 26

13.3 Homogeneity Attack

Race Birthyear Gender ZIP Problem

Human 196* * 38100 Short Breath

Human 196* * 38100 Chest Pain

Human 196* * 38100 Painful Eye

Human 196* * 38100 Wheezing

Human 197* * 38106 Obesity

Human 197* * 38106 Chest Pain

Human 197* * 38106 Short Breath

Human 197* * 38106 Cancer

Human 198* * 38102 Cancer

Human 198* * 38102 Cancer

Human 198* * 38102 Cancer

Human 198* * 38102 Cancer

Page 27: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• In datasets with little diversity, backgroundknowledge can lead to privacy issues

• Goal: Find medical problem of my neighbour in 38100– I know that he just came

back fromVietnam

– It is more likely that he

suffers from Dengue

fever

• Available background knowledge is not known whendata is released– Background knowledge might be arbitrary complex

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 27

13.3 Background Knowledge Attack

Race Birthyear Gender ZIP Problem

Human 196* * 38100 Heart Disease

Human 196* * 38100 Heart Disease

Human 196* * 38100 Dengue fever

Human 196* * 38100 Dengue fever

Human 197* * 38106 Obesity

Human 197* * 38106 Chest Pain

Human 197* * 38106 Short Breath

Human 197* * 38106 Cancer

Page 28: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• To prevent homogeneity and background

knowledge attack by diversity

– Sensitive attributes must be diverse within each

class

– If each class has at least l different values, it is l-

diverse

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 28

13.3 L-Diversity

Race Birthyear Gender ZIP Problem

Human 196* * 38100 Short Breath

Human 196* * 38100 Chest Pain

Human 196* * 38100 Wheezing

Human 197* * 38106 Obesity

Human 197* * 38106 Chest Pain

Human 197* * 38106 Short Breath

Human 198* * 38102 Cancer

Human 198* * 38102 Cancer

Human 198* * 38102 Cancer

Page 29: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Also L-Diversity can not guarantee perfect

anonymization

• Imagine HIV test result as an sensitive attribute

– Diversity is hard to achieve and unnecessary

• Overall distribution of sensitive attributes is not

considered

– Skewness attack is possible

• See the HIV example with very few positive results

• Several other attacks on l-diverse data sets are

possible

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 29

13.3 L-Diversity

Page 30: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• K-Anonymizing data relies on generalization

– Each record is generalized together with

its k neigbours

• However, datasets often have many attributes

– Data is very sparse

– K nearest neighbours are very far away

• In high dimensional datasets k-anonymization

leads to loss of information

– Loss of data utility

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 30

13.3 Problem: Curse of Dimensionality

Page 31: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Recommender Systems play an important role in E-Commerce

• Video, Product, Music Recommendation

• …but also Facebook, Twitter, Instagram

• Most systems are based on Collaborative Filtering– Users express their preferences by ratings

– Match user ratings against each other and find users withsimilar taste

– Recommend movies that users with a similar taste ratedhigh

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 31

13.4 The Netflix Prize

Page 32: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Netflix held an open competition for

collaborative filtering algorithms

– 20,000 teams from 150 countries

registered for competition

– Training dataset contains:

• 100,480,507 ratings from 480,189 users

• Goal: Improve Netflix’ algorithm by 10% to win

the prize of 1 million dollar

– BellKor's Pragmatic Chaos achieved a 10.5%

improvement in 2009

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 32

13.4 The Netflix Prize

Page 33: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Netflix Dataset anonymized

– No information about users or films

• Each rating consists of the following data: user id, movie id, date of grade, grade from 1-5

• Average users rated about 200 movies

• Average movies have been rated by more than 5000 users

• Netflix claimed that data has been perturbed

• Researchers de-anonymized the dataset by cross-matching with Internet Movie Database (IMDB)

– 50 users have been identified

– Researchers uncovered their political view and othersensitive information

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 33

13.4 The Netflix Prize

Page 34: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Netflix has been criticized by privacy advocates

since the release of the dataset

– 2007 researchers identified indivual users

– 2009 four users filed a law suit against Netflix

• In March 2010 the Netflix Prize was cancelled

– The dataset was removed from the Web

– Plaintiffs dismissed lawsuits after settlement with

Netflix

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 34

13.4 The Netflix Prize

Page 35: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Arvind Narayanan and Vitaly Shmatikov from the

University of Texas identify Netflix users in the

IMDB

– 50 users have been de-anonymized as a proof of concept

• The Netflix Prize data set is very sparse

– 480,000 users have on average 200 reviews

– 17,700 movies or attributes per user

• Compared to k-Anonymity, there are no quasi-

identifiers

– Direct de-anonymization is not possible

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 35

13.4 De-Anonymization of Sparse Data

Page 36: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Perturbation of Netflix data cannot be too large

– Introducing large errors massively harms data utility

for recommendation

– Comparisons to other Netflix datasets showed that

only very few noises were introduced

• For one user 5 out of 229 ratings, for another one 1 out of

306

• Ratings reveal some information of individuals

– Non-null values are very rare

– Some movies have been rated only by few people

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 36

13.4 De-Anonymization of Sparse Data

Page 37: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Rating profiles of most users are unique

– Data is not k-anonymized for k>1

• K-Anonymization is bound to failure

– The rating similarity among users is

very low

– Therefore generalizing attributes

would lead to a great loss of

information

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 37

13.4 De-Anonymization of Sparse Data

Page 38: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Similarity between records (individuals) can be

computed by comparing their ratings

– Similarity measure favors rarely rated movies

• Sparse data set example:

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 38

13.4 De-Anonymization of Sparse Data

1 2 3 4 5 6 7 8 9 10

𝑟1 3 4 5 1

𝑟2 3 4 1 1

𝑟3 4 5 1

𝑟4 4 1

𝑟5 5 2 4

Movies

Use

rs

Page 39: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Basic Idea for de-anonymizing record p using dataset A

1. Compute similarity scores from person p to all records in A

2. If distance(𝑚𝑎𝑥−max

2)

𝜎than some threshold Φ, then the

most similar record is a match

• If no record could be identified

– Also a small number of candidate records similar to the target reveals a lot of information

• For de-anonymizing the Netflix dataset a more complex variant of the algorithm was used

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 39

13.4 De-Anonymization of Sparse Data

Page 40: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Algorithms are robust to noise in the background knowledge

• Very little information is needed to de-anonymize an average individual in the Netflix data set

– 8 movie ratings and rating dates (+-14 days) lead to an accuracy of 99%

– Only 2 ratings and rating dates (+- 3 days) are needed to still achieve a matching quality of 68%

• Political orientation and religious views can be revealed by the opinion on movies

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 40

13.4 De-Anonymization of Sparse Data

Page 41: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• In March 2016 Facebook had 1,65 billion active

users

– About 28 million active users in Germany

• Instagram has more than 500 million active

users

• Snapchat has 150 million active users per day

– More than 10 billion video views per day

• 310 million people actively use Twitter

– ...but also lots of private data!

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 41

13.5 Privacy in Social Networks

Page 42: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 42

13.5 Privacy in Social Networks

Page 43: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Social networks are even worse

– Sexual orientation, ethnicity, religious, political views,

personality traits, intelligence, happiness, use of

addictive substances can be predicted…

• Every interaction with social networks tells

something about the user

– Every like, friend request or post and can be used

as an attribute for later classification

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 43

13.5 Privacy in Social Networks

Page 44: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• When a user joins a social network, he/she will explore the environment

– What Facebook „likes“ tell about members• Kosinski M, Stillwell D, Graepel T (2013) Private traits and attributes are

predictable from digital records of human behavior. PNAS 110 (15).

– What Facebook friendships tell about sexual orientation• Jernigan C, Mistree BFT (2009) Gaydar: Facebook Friendships Expose

Sexual Orientation. First Monday14(10).

– What Facebook knows about non-member friends of members• Horvát E-Á, Hanselmann M, Hamprecht FA, Zweig KA (2012) One Plus

One Makes Three (for Social Networks). PLoS ONE 7(4): e34740.

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 44

13.5 Privacy in Social Networks

Page 45: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Which information needs to be anonymized?

– Individuals should not be re-identified

– Friendships/Interaction between two persons should not inferred

– Sensitive attributes of individuals should stay secret

• Similar to before:

– Deleting/generalizing identifying attributes is not enough

• Re-identification using graph structure is possible

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 45

13.5 Privacy in Social Networks

Page 46: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• A simple anonymization of a Social Network

would remove all identifiers

– Only the structure of the network is available

• What can Tilo and Jan find out about the

network?

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 46

13.5 De-anonymization in Social Networks

Tilo JanJose Stephan

SilviuKindaChristoph

Social NetworkStructure

Page 47: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Tilo has 3 friends in the social network

– There is only a single node with node degree 3

– Jan can also re-identify himself

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 47

13.5 De-anonymization in Social Networks

Tilo JanJose Stephan

SilviuKindaChristoph

Social NetworkStructure

Page 48: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• They can share the information about their

friendships

– Both are friends with Jose

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 48

13.5 De-anonymization in Social Networks

Tilo JanJose Stephan

SilviuKindaChristoph

Social NetworkStructure

Page 49: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Christoph is only friends with Tilo

– Tilo can also learn that Jose and Christoph are

friends

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 49

13.5 De-anonymization in Social Networks

Tilo JanJose Stephan

SilviuKindaChristoph

Social NetworkStructure

Page 50: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Major challenge in anonymizing social networks– Adding/deleting edges also affects the neighbourhoods

properties of other nodes

• Can we use k-anonymization in large social networks?– We do not want to change the neighbourhoods of other

nodes

– What do we want to protect against?• Node re-identification

• Edge disclosure

• Sensitive attributes

– What do we want to anonymize?• Node degrees

• Neighborhood of nodes

• Attributes

• Structural knowledge of the graph

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 50

13.5 Advanced Anonymization in Social Networks

Page 51: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Anonymizing the degree distribution to protect against node re-identification

– Goal: Transform the social network into a k-degree anonymous graph

• There are at least k nodes with the same degree

• Basic Idea:

– Step 1: Construct a degree distribution that is close to original distribution, by minimally increasing degrees of a few nodes.

– Step 2: Construct a graph satisfying the new degree distribution close to the original graph by adding minimum number of edges.

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 51

13.5 Advanced Anonymization in Social Networks

Page 52: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Building a 2-degree anonymous graph

– Build a degree sequence for the social network

• Only increasing node degrees is allowed

– Use dynamic programming to find an optimal k-degree anonymous sequence

• Minimize the sum of differences of node degrees

– The minmal sum of differences is 3

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 52

13.5 Advanced Anonymization in Social Networks

(5, 3, 2, 2, 1, 1, 0) (5, 5, 2, 2, 1, 1, 1)

Page 53: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

– Construct a graph from degree sequence using the original graph• Not every degree sequence is realizable

– (5, 5, 2, 2, 1, 1, 1) has no graph representation

• Degree sequence has to be even

– Start with 7 nodes and no edges:

– Basic algorithm idea:1. Pick node v as node with highest degree

2. Add degree of v edges to other nodes w

3. Decrease degrees of neighbor nodes

4. IF all degrees are 0: RETURN

5. IF degree of any node < 0 NO GRAPH

– If the degree sequence is not realizable, add random noise to original degree sequence• In most cases, the constructed degree sequences are realizable

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 53

13.5 Advanced Anonymization in Social Networks

(5, 3, 2, 2, 1, 1, 0) (0, 2, 1, 1, 0, 0, 0)

Page 54: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• A database can also be used for statistical

purposes without granting access to individual

records

– Statistical operations allow a view on the actual data

– Special protection techniques have to be applied to

protect the individual data records

• ‘Reengineering’ of actual individual values is sometimes

possible

• Statistical inference, especially taking

advantage of sequences of statistical

queries, must be prevented

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 54

13.6 Statistical Database Security

Page 55: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• In any case, a statistical filter for queries is needed

– Permits only statistical queries, while preventing access to individual records

• E.g., allow a ‘COUNT’ query for the number of employees whose salary is higher than 100.000 $, but deny queries selecting individuals having that characteristic

• But statistical filters are not sufficient to prevent interference

– E.g., first get the average salary of employees with job description ‘manager’ and then count their number

• If the number is 1, you do exactly know how much your manager earns…

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 55

13.6 Statistical Database Security

Page 56: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Security measures have to be taken on top of

the authorization measures presented before

• A statistical database is

– Positively compromised, if a user finds out that an

individual has a specific characteristic (value)

– Negatively compromised, if a user finds out that a

given individual does not have a certain characteristic

• Also a simple anonymization of data does not

suffice to protect individuals

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 56

13.6 Statistical Database Security

Page 57: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• The majority of inference protection techniques

can be classified as

– Conceptual techniques

• Involve the conceptual level of the underlying database

– Restriction-based techniques

• Deny statistical queries working on too small or too large

subsets of the data

– Perturbation-based techniques

• Introduce modification to the data which change individual

values, but should have hardly any effect on the statistics

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 57

13.6 Inference Protection Techniques

Page 58: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• A good example for conceptual techniques is the lattice model

– Statistics over relational tables can be represented as a lattice, where vertexes reflect different combinations of attributes

– E.g., lattice for table T with three attributes A, B, and C

– By aggregating over some attribute, less dimensional tables are obtained

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 58

13.6 Conceptual Techniques

Tall

TA TB TC

TBCTACTAB

TABC

aggregation

Page 59: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• The lattice can be used to study inference protection mechanisms

– A statistic is considered to be harmful, if the n-respondent, k%-dominance criterion applies

• i.e., n or fewer records represent more than k% of the total with n and k being fixed but secret values

– Consequently, for any vertex of the lattice e.g., a ‘count’ statistic holding a query set of size 1 is harmful

• By using operations involving vertexes at different levels the user can disclose sensitive statistics

– Generally, it is possible to permit a statistic in a vertex of the lattice, if the individual is not identified in some parent table in the lattice

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 59

13.6 Conceptual Techniques

Page 60: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• The general aim of these techniques is to restrict statistical queries that could compromise the database

– The simplest restriction technique controls the size of the query set associated with a query

• Suppose that for some individual A, a user knows a certain characteristic ‘Ai = x’ and the respective count statistic is 1; then more information can be disclosed by issuing queries COUNT(Ai = x AND Aj = y) to find out about the Aj value, etc.

• For some secret parameter k, a statistical query is only permitted if the size of the query set is both larger than kand smaller than (database_size - k)

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 60

13.6 Restriction-based Techniques

Page 61: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• However, this simple technique is not safe e.g., against

tracker-based attacks

– A tracker is a set of formulas to pad out small size query

sets with additional records to fulfill the size restriction

• Assume an individual can be uniquely identified by the

characteristics (Ai = x AND Aj = y AND Ak = z)

• A tracker could be (Ai = x), (Ai = x AND NOT Aj = y AND NOT Ak = z)

• The forbidden statistics COUNT (Ai = x AND Aj = y AND Ak = z)could be calculated by COUNT (Ai = x) – COUNT (Ai = x AND NOT Aj = y AND NOT Ak = z)

• Having that statistics more information can be obtained

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 61

13.6 Restriction-based Techniques

Page 62: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• One way to deal with trackers, is to generalize the

query set size criterion to all logical combinations

– For a query on (A1 = a AND A2 = b AND…AND An = z) all 2n combinations

(NOT A1 = a AND A2 = b AND…AND An = z), (A1 = a AND NOT A2 = b AND…AND An = z),…

have to fulfill the query set size restriction

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 62

13.6 Restriction-based Techniques

Page 63: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• A prime example for perturbation-based techniques is data swapping– The idea is to exchange attribute values between the records

of the original database in such a way that• The new database has no common records with the original database

• While the statistics (up to a certain number of attributes involved in the statistics) stay correct

• A second technique are random sample queries that are performed only on a random sample of the database

• Another technique is result rounding, where the response is perturbed– Before being released the response values are rounded up or

down to the nearest multiple of a certain base b

– Users can then deduce the true value only within some interval

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 63

13.6 Perturbation-based Techniques

Page 64: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

• Specific attacks, however can still disclose information

– E.g., consider a user knows that an individual record matches a characteristic Ai = x and that the relative frequency of having that value is 1/database_size

– The attacker can now discover whether the record shows the additional characteristic Aj = y by requesting the relative frequency of (Ai = x AND Aj = y )

– If the value is still 1/database_size, it has the characteristic, if the value is 0 it does not…

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 64

13.6 Perturbation-based Techniques

Page 65: Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

14.1 Knowledge-based Systems

and Deductive DBs

14.2 Distributed Databases

14.3 Information Retrieval and

Web Search Engines

14.4 Spatial databases and GIS

14.5 Multimedia Databases

14.6 Data Warehousing

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 65

14 Beyond Relational Databases