Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut

Wolf-Tilo Balke

Jan-Christoph Kalo

Institut für Informationssysteme

Technische Universität Braunschweig

http://www.ifis.cs.tu-bs.de

Relational Database Systems 213. Data Privacy

13.1 What is Privacy?

13.2 Privacy Laws

13.3 K-Anonymity

13.4 Netflix Prize

13.5 Social Network Anonymization

13.6 Statistical Database Security

Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 2

13 Data Privacy



• “… the ability to determine for ourselves when,

how, and to what extent information about us is

communicated to others …”• Westin, 1967

• “Privacy intrusion occurs when new information

about an individual is released. “• Parent, 1983



• Privacy has several aspects and definitions:

– The right to be let alone

– Limit the access others have to one's personal information

– The option to conceal any information from others

– Secrecy

– States of privacy

– Personhood and autonomy

– Self-identity and personal growth

– Protection of intimate relationships

• Different to database security, privacy cannot be

achieved by access control

• The challenge is to utilize data while protecting

individual's privacy preferences

– Idea: Anonymize all attributes

that could be used for

de-anonymization

Data Privacy Database Security

• Detecting data inference• Controlling data inference

• Controlling database access• Protecting database content



• Between 1850 and 1890 newspapers

popularity grew tremendously

– New camera technology allowed first

snapshots in public places

– People feared the new technology to be used by the„sensationalistic press“

• In 1890 Samuel D. Warren and Louis D. Brandeis designed „The Right to Privacy“.

– „most influential law article of all time“

– Privacy as „The Right to be Alone“


13.2 History of Privacy Laws

• In the 1960s John F. Kennedy plans to introduce

the national registration centers

– In the USA a large discussion on privacy is started

– Kennedy‘s plan fails in Congress

• In 1974 the Privacy Act is adopted in the USA

– Collection, maintenance, use, and dissemination of

information about individuals by federal agencies

is regulated

– Plans to adopt the law to the private sector failed



• In 1970 the Worlds first privacy law is

adopted in Hesse

– Regulates the usage of personal data by public

authorities

– Ensure the informational self-determination of

citizens

• In 1977 the German privacy law (BDSG)

follows

– Renewed in 1990 due to „Volkszählungsurteil“

– Due to several scandals the law was renewed in 2009



• Persons affected need to consent the data

acquisition

• Affected persons have the right to:• Obtain information on stored data about them

• Obtain information why data is stored about them and

where it comes from

• Demand the deletion of their data

• Prohibit their data to be circulated

• Germany has several organizations/institutions

supervising privacy issues


13.2 Privacy in Germany

• A general privacy law collides with the right of

freedom of expression

• Specialized laws regulate privacy issues in the US

– COPPA

– HIPAA

• No supervising organization for privacy issues


13.2 Privacy in the US

• International Privacy Ranking from 2007


13.2 Privacy Laws Today

• 40% of worlds population uses the internet

– In developed countries even 78% of the populationuse the internet

• Data explosion on the Internet

– The average user produces about 8-10GB of publicdata per day (in 2007)

• People produce a huge amount of private data

– The popularity of social networks tremendouslycontributes to this data

• How can we use this data?


13.3 Privacy Risks

• Most of this private data is used by companies

– Movie & Product recommendation

– Facebook Timeline optimization

– Search recommendation

• Companies and also researchers use private datafor statistical analysis

• Also health and governmental institutions use Big Data

– Open health data helps researchers to improve thehealth system


13.3 Privacy Risks

• Problem: Growing availability of information

and person specific data leads to privacy risks

– Who is allowed to see which information?

• Goal: Anonymize data to protect individuals

privacy

– Even when data has no explicit identifying

attributes, such as names or address

re-identification is possible


13.3 Privacy Risks

• In 1997 medical information of Massachuesets

Governor has been re-identified by researchers

from MIT

• Massachusetts Group Insurance Commission

published medical data to improve healthcare

and to controll costs

– William Weld, then Governor of

Massachusetts, assured the public

that GIC had protected patient privacy

by deleting all identifiers


13.3 Protecting Privacy

• Latanya Sweeney, a student from

MIT, working on privacy tried to

identify the governor

– She bought the voters register for

Cambridge, Massachusetts including

54,805 persons for 20$

• Actually 101,391 people lived in Cambridge

• Half of the population could not have been identified



• It is estimated that 29,000 of these 54,805 havea unique combination of Birthdate, Gender and ZIP

– 29% of Cambridges population have a plausible risk ofre-identification

• Actually, 87% of the population of the United States can be uniquely identified by gender, date ofbirth and ZIP code

– Sweeney, L. A. Simple Demographics Often Identify People Uniquely. Carnegie Mellon Univ. Data Privacy Working Paper 3 (2000).



• By combining this data with the GIC records,

Sweeney found Governor Weld with ease

– Only six people in Cambridge shared his birth date,

only three of them men, and of them, only he lived

in his ZIP code



Name

Address

Date

registered

Party affiliation

Date last voted

Ethnicity

Visit date

Diagnosis

Procedure

Medication

Total charge

ZIP

Birthdate

Sex

GIC Data Voters Register

• Breach and Patch method?– Identify privacy breaches

– Design new algorithms and techniques to fix breaches

• Better: Formally define models and specify their properties– Formally specify the privacy model

– Derive conditions for privacy

– Design an algorithm that satisfies the privacy conditions

• Over the years several privacy models have been developed– k-anonymity, l-diversity, l-closeness, (c,k)-safety

• There is no perfect privacy model



• K-Anonymity is a protection model that can

ensure anonymity of indivuals of relational data

– Idea: Generalize, modify, or distort identifiers so that

no individual is uniquely identifiable

• In recent years it has gained popularity and

became one of the most important ideas with

regard to privacy

– Several fast algorithms for creating k-anonymous data

sets exist


13.3 k-Anonymity Model

• K-Anonimity is defined on relation data

– Table 𝑅𝑇 𝐴1, … , 𝐴𝑛

– Quasi-identifiers 𝑄𝐼𝑅𝑇 have to be identified in the data

• What could be used to identify individuals?

• Definition of k-anonymity

– The table RT satisfies k-anononymity if and only if each

sequence of values in RT[𝑄𝐼𝑅𝑇] appears with at least k

appeareances“



Race Birth Gender ZIP Problem

Black 1988 Male 38102 Short Breath

Black 1988 Male 38102 Hypertension

Black 1989 Female 38102 Obesity

Black 1989 Female 38102 Chest Pain

• Meaning of k-anonymity:

• If the released data satisfies k-anonymity, the combination of RT with external data cannot be matched to fewer than k indivuals

• Each record is indistinguishable from at least k-1 other recordswithin the dataset

• Large k values imply higher privacy

– Probability for identifying individuals is1

𝑘

• Finding quasi-identifiers 𝑄𝐼𝑅𝑇 is hard

– If external data sources are not fully known, finding all quasi-idenfitiers is not possible

• Sometimes the combination of meaningless attributes canidentify individuals



• Algorithms for k-anonymization involve two basicoperations

– Deleting cell values or entire tuples

• When generalization causes too much information loss

– Generalizing cell values

• Replace specific quasi-identifiers with less specific values until get k identical values

• Finding an optimal anonymization using k-anonymityis NP-hard

– Optimal means to perturb the input data as little asnecessary

– Perturbing data also implies a loss of data utility



• Real-world data is often sorted

– Several k-anonymous data sets can be directly

matched

– Randomly sorting the tuples solves the problem


13.3 Unsorted Matching Attack

Race ZIP

Asian 38102

Asian 38106

Black 38102

Black 38106

White 38102

White 38106

Race ZIP

Person 38102

Person 38106

Person 38102

Person 38106

Person 38102

Person 38106

Race ZIP

Asian 381**

Asian 381**

Black 381**

Black 381**

White 381**

White 381**

Original Data K-Anonymous Data 1 K-Anonymous Data 2

• Both data sets are k-anonymous with k=2

– Matched using the Problem attribute


13.3 Complementary Release Attack

Race Birthyear Gender ZIP Problem


Black 1965 Male 38100 Chest Pain

Person 1965 Female 381** Painful Eye

Person 1965 Female 381** Wheezing



White 1964 Male 381** Short Breath

Person 1965 Female 381** Hypertension

White 1964 Male 381** Obesity

White 1964 Male 381** Fever

White 1967 Male 38106 Vomiting

White 1967 Male 38106 Back Pain



Black 1965 Male 38100 Chest Pain

Black 1965 Female 38106 Painful Eye

Black 1965 Female 38106 Wheezing



White 19** Male 38106 Short Breath

White 19** Human 38102 Hypertension

White 19** Human 38102 Obesity

White 19** Human 38102 Fever

White 19** Male 38106 Vomiting

White 19** Male 38106 Back Pain

• Also little diversity with regard to the sensitive

attributes can lead to privacy problems

– 4-anonymous patient data

• Privacy is not guaranteed

since all 4 individuals

have cancer


13.3 Homogeneity Attack


Human 196* * 38100 Short Breath

Human 196* * 38100 Chest Pain

Human 196* * 38100 Painful Eye

Human 196* * 38100 Wheezing

Human 197* * 38106 Obesity



Human 197* * 38106 Cancer





• In datasets with little diversity, backgroundknowledge can lead to privacy issues

• Goal: Find medical problem of my neighbour in 38100– I know that he just came

back fromVietnam

– It is more likely that he

suffers from Dengue

fever

• Available background knowledge is not known whendata is released– Background knowledge might be arbitrary complex


13.3 Background Knowledge Attack


Human 196* * 38100 Heart Disease

Human 196* * 38100 Heart Disease

Human 196* * 38100 Dengue fever

Human 196* * 38100 Dengue fever





• To prevent homogeneity and background

knowledge attack by diversity

– Sensitive attributes must be diverse within each

class

– If each class has at least l different values, it is l-

diverse


13.3 L-Diversity




Human 196* * 38100 Wheezing







• Also L-Diversity can not guarantee perfect

anonymization

• Imagine HIV test result as an sensitive attribute

– Diversity is hard to achieve and unnecessary

• Overall distribution of sensitive attributes is not

considered

– Skewness attack is possible

• See the HIV example with very few positive results

• Several other attacks on l-diverse data sets are

possible


13.3 L-Diversity

• K-Anonymizing data relies on generalization

– Each record is generalized together with

its k neigbours

• However, datasets often have many attributes

– Data is very sparse

– K nearest neighbours are very far away

• In high dimensional datasets k-anonymization

leads to loss of information

– Loss of data utility


13.3 Problem: Curse of Dimensionality

• Recommender Systems play an important role in E-Commerce

• Video, Product, Music Recommendation

• …but also Facebook, Twitter, Instagram

• Most systems are based on Collaborative Filtering– Users express their preferences by ratings

– Match user ratings against each other and find users withsimilar taste

– Recommend movies that users with a similar taste ratedhigh


13.4 The Netflix Prize

• Netflix held an open competition for

collaborative filtering algorithms

– 20,000 teams from 150 countries

registered for competition

– Training dataset contains:

• 100,480,507 ratings from 480,189 users

• Goal: Improve Netflix’ algorithm by 10% to win

the prize of 1 million dollar

– BellKor's Pragmatic Chaos achieved a 10.5%

improvement in 2009



• Netflix Dataset anonymized

– No information about users or films

• Each rating consists of the following data: user id, movie id, date of grade, grade from 1-5

• Average users rated about 200 movies

• Average movies have been rated by more than 5000 users

• Netflix claimed that data has been perturbed

• Researchers de-anonymized the dataset by cross-matching with Internet Movie Database (IMDB)

– 50 users have been identified

– Researchers uncovered their political view and othersensitive information



• Netflix has been criticized by privacy advocates

since the release of the dataset

– 2007 researchers identified indivual users

– 2009 four users filed a law suit against Netflix

• In March 2010 the Netflix Prize was cancelled

– The dataset was removed from the Web

– Plaintiffs dismissed lawsuits after settlement with

Netflix



• Arvind Narayanan and Vitaly Shmatikov from the

University of Texas identify Netflix users in the

IMDB

– 50 users have been de-anonymized as a proof of concept

• The Netflix Prize data set is very sparse

– 480,000 users have on average 200 reviews

– 17,700 movies or attributes per user

• Compared to k-Anonymity, there are no quasi-

identifiers

– Direct de-anonymization is not possible


13.4 De-Anonymization of Sparse Data

• Perturbation of Netflix data cannot be too large

– Introducing large errors massively harms data utility

for recommendation

– Comparisons to other Netflix datasets showed that

only very few noises were introduced

• For one user 5 out of 229 ratings, for another one 1 out of

306

• Ratings reveal some information of individuals

– Non-null values are very rare

– Some movies have been rated only by few people



• Rating profiles of most users are unique

– Data is not k-anonymized for k>1

• K-Anonymization is bound to failure

– The rating similarity among users is

very low

– Therefore generalizing attributes

would lead to a great loss of

information



• Similarity between records (individuals) can be

computed by comparing their ratings

– Similarity measure favors rarely rated movies

• Sparse data set example:



1 2 3 4 5 6 7 8 9 10

𝑟1 3 4 5 1

𝑟2 3 4 1 1

𝑟3 4 5 1

𝑟4 4 1

𝑟5 5 2 4

Movies

Use

rs

• Basic Idea for de-anonymizing record p using dataset A

1. Compute similarity scores from person p to all records in A

2. If distance(𝑚𝑎𝑥−max

2)

𝜎than some threshold Φ, then the

most similar record is a match

• If no record could be identified

– Also a small number of candidate records similar to the target reveals a lot of information

• For de-anonymizing the Netflix dataset a more complex variant of the algorithm was used



• Algorithms are robust to noise in the background knowledge

• Very little information is needed to de-anonymize an average individual in the Netflix data set

– 8 movie ratings and rating dates (+-14 days) lead to an accuracy of 99%

– Only 2 ratings and rating dates (+- 3 days) are needed to still achieve a matching quality of 68%

• Political orientation and religious views can be revealed by the opinion on movies



• In March 2016 Facebook had 1,65 billion active

users

– About 28 million active users in Germany

• Instagram has more than 500 million active

users

• Snapchat has 150 million active users per day

– More than 10 billion video views per day

• 310 million people actively use Twitter

– ...but also lots of private data!


13.5 Privacy in Social Networks



• Social networks are even worse

– Sexual orientation, ethnicity, religious, political views,

personality traits, intelligence, happiness, use of

addictive substances can be predicted…

• Every interaction with social networks tells

something about the user

– Every like, friend request or post and can be used

as an attribute for later classification



• When a user joins a social network, he/she will explore the environment

– What Facebook „likes“ tell about members• Kosinski M, Stillwell D, Graepel T (2013) Private traits and attributes are

predictable from digital records of human behavior. PNAS 110 (15).

– What Facebook friendships tell about sexual orientation• Jernigan C, Mistree BFT (2009) Gaydar: Facebook Friendships Expose

Sexual Orientation. First Monday14(10).

– What Facebook knows about non-member friends of members• Horvát E-Á, Hanselmann M, Hamprecht FA, Zweig KA (2012) One Plus

One Makes Three (for Social Networks). PLoS ONE 7(4): e34740.



• Which information needs to be anonymized?

– Individuals should not be re-identified

– Friendships/Interaction between two persons should not inferred

– Sensitive attributes of individuals should stay secret

• Similar to before:

– Deleting/generalizing identifying attributes is not enough

• Re-identification using graph structure is possible



• A simple anonymization of a Social Network

would remove all identifiers

– Only the structure of the network is available

• What can Tilo and Jan find out about the

network?


13.5 De-anonymization in Social Networks

Tilo JanJose Stephan

SilviuKindaChristoph

Social NetworkStructure

• Tilo has 3 friends in the social network

– There is only a single node with node degree 3

– Jan can also re-identify himself






• They can share the information about their

friendships

– Both are friends with Jose






• Christoph is only friends with Tilo

– Tilo can also learn that Jose and Christoph are

friends






• Major challenge in anonymizing social networks– Adding/deleting edges also affects the neighbourhoods

properties of other nodes

• Can we use k-anonymization in large social networks?– We do not want to change the neighbourhoods of other

nodes

– What do we want to protect against?• Node re-identification

• Edge disclosure

• Sensitive attributes

– What do we want to anonymize?• Node degrees

• Neighborhood of nodes

• Attributes

• Structural knowledge of the graph


13.5 Advanced Anonymization in Social Networks

• Anonymizing the degree distribution to protect against node re-identification

– Goal: Transform the social network into a k-degree anonymous graph

• There are at least k nodes with the same degree

• Basic Idea:

– Step 1: Construct a degree distribution that is close to original distribution, by minimally increasing degrees of a few nodes.

– Step 2: Construct a graph satisfying the new degree distribution close to the original graph by adding minimum number of edges.



• Building a 2-degree anonymous graph

– Build a degree sequence for the social network

• Only increasing node degrees is allowed

– Use dynamic programming to find an optimal k-degree anonymous sequence

• Minimize the sum of differences of node degrees

– The minmal sum of differences is 3



(5, 3, 2, 2, 1, 1, 0) (5, 5, 2, 2, 1, 1, 1)

– Construct a graph from degree sequence using the original graph• Not every degree sequence is realizable

– (5, 5, 2, 2, 1, 1, 1) has no graph representation

• Degree sequence has to be even

– Start with 7 nodes and no edges:

– Basic algorithm idea:1. Pick node v as node with highest degree

2. Add degree of v edges to other nodes w

3. Decrease degrees of neighbor nodes

4. IF all degrees are 0: RETURN

5. IF degree of any node < 0 NO GRAPH

– If the degree sequence is not realizable, add random noise to original degree sequence• In most cases, the constructed degree sequences are realizable



(5, 3, 2, 2, 1, 1, 0) (0, 2, 1, 1, 0, 0, 0)

• A database can also be used for statistical

purposes without granting access to individual

records

– Statistical operations allow a view on the actual data

– Special protection techniques have to be applied to

protect the individual data records

• ‘Reengineering’ of actual individual values is sometimes

possible

• Statistical inference, especially taking

advantage of sequences of statistical

queries, must be prevented



• In any case, a statistical filter for queries is needed

– Permits only statistical queries, while preventing access to individual records

• E.g., allow a ‘COUNT’ query for the number of employees whose salary is higher than 100.000 $, but deny queries selecting individuals having that characteristic

• But statistical filters are not sufficient to prevent interference

– E.g., first get the average salary of employees with job description ‘manager’ and then count their number

• If the number is 1, you do exactly know how much your manager earns…



• Security measures have to be taken on top of

the authorization measures presented before

• A statistical database is

– Positively compromised, if a user finds out that an

individual has a specific characteristic (value)

– Negatively compromised, if a user finds out that a

given individual does not have a certain characteristic

• Also a simple anonymization of data does not

suffice to protect individuals



• The majority of inference protection techniques

can be classified as

– Conceptual techniques

• Involve the conceptual level of the underlying database

– Restriction-based techniques

• Deny statistical queries working on too small or too large

subsets of the data

– Perturbation-based techniques

• Introduce modification to the data which change individual

values, but should have hardly any effect on the statistics


13.6 Inference Protection Techniques

• A good example for conceptual techniques is the lattice model

– Statistics over relational tables can be represented as a lattice, where vertexes reflect different combinations of attributes

– E.g., lattice for table T with three attributes A, B, and C

– By aggregating over some attribute, less dimensional tables are obtained


13.6 Conceptual Techniques

Tall

TA TB TC

TBCTACTAB

TABC

aggregation

• The lattice can be used to study inference protection mechanisms

– A statistic is considered to be harmful, if the n-respondent, k%-dominance criterion applies

• i.e., n or fewer records represent more than k% of the total with n and k being fixed but secret values

– Consequently, for any vertex of the lattice e.g., a ‘count’ statistic holding a query set of size 1 is harmful

• By using operations involving vertexes at different levels the user can disclose sensitive statistics

– Generally, it is possible to permit a statistic in a vertex of the lattice, if the individual is not identified in some parent table in the lattice


13.6 Conceptual Techniques

• The general aim of these techniques is to restrict statistical queries that could compromise the database

– The simplest restriction technique controls the size of the query set associated with a query

• Suppose that for some individual A, a user knows a certain characteristic ‘Ai = x’ and the respective count statistic is 1; then more information can be disclosed by issuing queries COUNT(Ai = x AND Aj = y) to find out about the Aj value, etc.

• For some secret parameter k, a statistical query is only permitted if the size of the query set is both larger than kand smaller than (database_size - k)


13.6 Restriction-based Techniques

• However, this simple technique is not safe e.g., against

tracker-based attacks

– A tracker is a set of formulas to pad out small size query

sets with additional records to fulfill the size restriction

• Assume an individual can be uniquely identified by the

characteristics (Ai = x AND Aj = y AND Ak = z)

• A tracker could be (Ai = x), (Ai = x AND NOT Aj = y AND NOT Ak = z)

• The forbidden statistics COUNT (Ai = x AND Aj = y AND Ak = z)could be calculated by COUNT (Ai = x) – COUNT (Ai = x AND NOT Aj = y AND NOT Ak = z)

• Having that statistics more information can be obtained



• One way to deal with trackers, is to generalize the

query set size criterion to all logical combinations

– For a query on (A1 = a AND A2 = b AND…AND An = z) all 2n combinations

(NOT A1 = a AND A2 = b AND…AND An = z), (A1 = a AND NOT A2 = b AND…AND An = z),…

have to fulfill the query set size restriction



• A prime example for perturbation-based techniques is data swapping– The idea is to exchange attribute values between the records

of the original database in such a way that• The new database has no common records with the original database

• While the statistics (up to a certain number of attributes involved in the statistics) stay correct

• A second technique are random sample queries that are performed only on a random sample of the database

• Another technique is result rounding, where the response is perturbed– Before being released the response values are rounded up or

down to the nearest multiple of a certain base b

– Users can then deduce the true value only within some interval


13.6 Perturbation-based Techniques

• Specific attacks, however can still disclose information

– E.g., consider a user knows that an individual record matches a characteristic Ai = x and that the relative frequency of having that value is 1/database_size

– The attacker can now discover whether the record shows the additional characteristic Aj = y by requesting the relative frequency of (Ai = x AND Aj = y )

– If the value is still 1/database_size, it has the characteristic, if the value is 0 it does not…


13.6 Perturbation-based Techniques

14.1 Knowledge-based Systems

and Deductive DBs

14.2 Distributed Databases

14.3 Information Retrieval and

Web Search Engines

14.4 Spatial databases and GIS

14.5 Multimedia Databases

14.6 Data Warehousing


14 Beyond Relational Databases

Documents

Relational Database Systems 2 - TU Braunschweig · •Both data sets are k-anonymous with k=2 –Matched using the Problem attribute Relational Database Systems 2 –Wolf-Tilo Balke–Institut