Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Wolf-Tilo Balke
Jan-Christoph Kalo
Institut für Informationssysteme
Technische Universität Braunschweig
http://www.ifis.cs.tu-bs.de
Relational Database Systems 213. Data Privacy
13.1 What is Privacy?
13.2 Privacy Laws
13.3 K-Anonymity
13.4 Netflix Prize
13.5 Social Network Anonymization
13.6 Statistical Database Security
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 2
13 Data Privacy
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 3
13.1 What is Privacy?
• “… the ability to determine for ourselves when,
how, and to what extent information about us is
communicated to others …”• Westin, 1967
• “Privacy intrusion occurs when new information
about an individual is released. “• Parent, 1983
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 4
13.1 What is Privacy?
• Privacy has several aspects and definitions:
– The right to be let alone
– Limit the access others have to one's personal information
– The option to conceal any information from others
– Secrecy
– States of privacy
– Personhood and autonomy
– Self-identity and personal growth
– Protection of intimate relationships
• Different to database security, privacy cannot be
achieved by access control
• The challenge is to utilize data while protecting
individual's privacy preferences
– Idea: Anonymize all attributes
that could be used for
de-anonymization
Data Privacy Database Security
• Detecting data inference• Controlling data inference
• Controlling database access• Protecting database content
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 5
13.1 What is Privacy?
• Between 1850 and 1890 newspapers
popularity grew tremendously
– New camera technology allowed first
snapshots in public places
– People feared the new technology to be used by the„sensationalistic press“
• In 1890 Samuel D. Warren and Louis D. Brandeis designed „The Right to Privacy“.
– „most influential law article of all time“
– Privacy as „The Right to be Alone“
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 6
13.2 History of Privacy Laws
• In the 1960s John F. Kennedy plans to introduce
the national registration centers
– In the USA a large discussion on privacy is started
– Kennedy‘s plan fails in Congress
• In 1974 the Privacy Act is adopted in the USA
– Collection, maintenance, use, and dissemination of
information about individuals by federal agencies
is regulated
– Plans to adopt the law to the private sector failed
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 7
13.2 History of Privacy Laws
• In 1970 the Worlds first privacy law is
adopted in Hesse
– Regulates the usage of personal data by public
authorities
– Ensure the informational self-determination of
citizens
• In 1977 the German privacy law (BDSG)
follows
– Renewed in 1990 due to „Volkszählungsurteil“
– Due to several scandals the law was renewed in 2009
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 8
13.2 History of Privacy Laws
• Persons affected need to consent the data
acquisition
• Affected persons have the right to:• Obtain information on stored data about them
• Obtain information why data is stored about them and
where it comes from
• Demand the deletion of their data
• Prohibit their data to be circulated
• Germany has several organizations/institutions
supervising privacy issues
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 9
13.2 Privacy in Germany
• A general privacy law collides with the right of
freedom of expression
• Specialized laws regulate privacy issues in the US
– COPPA
– HIPAA
• No supervising organization for privacy issues
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 10
13.2 Privacy in the US
• International Privacy Ranking from 2007
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 11
13.2 Privacy Laws Today
• 40% of worlds population uses the internet
– In developed countries even 78% of the populationuse the internet
• Data explosion on the Internet
– The average user produces about 8-10GB of publicdata per day (in 2007)
• People produce a huge amount of private data
– The popularity of social networks tremendouslycontributes to this data
• How can we use this data?
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 12
13.3 Privacy Risks
• Most of this private data is used by companies
– Movie & Product recommendation
– Facebook Timeline optimization
– Search recommendation
• Companies and also researchers use private datafor statistical analysis
• Also health and governmental institutions use Big Data
– Open health data helps researchers to improve thehealth system
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 13
13.3 Privacy Risks
• Problem: Growing availability of information
and person specific data leads to privacy risks
– Who is allowed to see which information?
• Goal: Anonymize data to protect individuals
privacy
– Even when data has no explicit identifying
attributes, such as names or address
re-identification is possible
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 14
13.3 Privacy Risks
• In 1997 medical information of Massachuesets
Governor has been re-identified by researchers
from MIT
• Massachusetts Group Insurance Commission
published medical data to improve healthcare
and to controll costs
– William Weld, then Governor of
Massachusetts, assured the public
that GIC had protected patient privacy
by deleting all identifiers
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 15
13.3 Protecting Privacy
• Latanya Sweeney, a student from
MIT, working on privacy tried to
identify the governor
– She bought the voters register for
Cambridge, Massachusetts including
54,805 persons for 20$
• Actually 101,391 people lived in Cambridge
• Half of the population could not have been identified
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 16
13.3 Protecting Privacy
• It is estimated that 29,000 of these 54,805 havea unique combination of Birthdate, Gender and ZIP
– 29% of Cambridges population have a plausible risk ofre-identification
• Actually, 87% of the population of the United States can be uniquely identified by gender, date ofbirth and ZIP code
– Sweeney, L. A. Simple Demographics Often Identify People Uniquely. Carnegie Mellon Univ. Data Privacy Working Paper 3 (2000).
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 17
13.3 Protecting Privacy
• By combining this data with the GIC records,
Sweeney found Governor Weld with ease
– Only six people in Cambridge shared his birth date,
only three of them men, and of them, only he lived
in his ZIP code
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 18
13.3 Protecting Privacy
Name
Address
Date
registered
Party affiliation
Date last voted
Ethnicity
Visit date
Diagnosis
Procedure
Medication
Total charge
ZIP
Birthdate
Sex
GIC Data Voters Register
• Breach and Patch method?– Identify privacy breaches
– Design new algorithms and techniques to fix breaches
• Better: Formally define models and specify their properties– Formally specify the privacy model
– Derive conditions for privacy
– Design an algorithm that satisfies the privacy conditions
• Over the years several privacy models have been developed– k-anonymity, l-diversity, l-closeness, (c,k)-safety
• There is no perfect privacy model
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 19
13.3 Protecting Privacy
• K-Anonymity is a protection model that can
ensure anonymity of indivuals of relational data
– Idea: Generalize, modify, or distort identifiers so that
no individual is uniquely identifiable
• In recent years it has gained popularity and
became one of the most important ideas with
regard to privacy
– Several fast algorithms for creating k-anonymous data
sets exist
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 20
13.3 k-Anonymity Model
• K-Anonimity is defined on relation data
– Table 𝑅𝑇 𝐴1, … , 𝐴𝑛
– Quasi-identifiers 𝑄𝐼𝑅𝑇 have to be identified in the data
• What could be used to identify individuals?
• Definition of k-anonymity
– The table RT satisfies k-anononymity if and only if each
sequence of values in RT[𝑄𝐼𝑅𝑇] appears with at least k
appeareances“
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 21
13.3 k-Anonymity Model
Race Birth Gender ZIP Problem
Black 1988 Male 38102 Short Breath
Black 1988 Male 38102 Hypertension
Black 1989 Female 38102 Obesity
Black 1989 Female 38102 Chest Pain
• Meaning of k-anonymity:
• If the released data satisfies k-anonymity, the combination of RT with external data cannot be matched to fewer than k indivuals
• Each record is indistinguishable from at least k-1 other recordswithin the dataset
• Large k values imply higher privacy
– Probability for identifying individuals is1
𝑘
• Finding quasi-identifiers 𝑄𝐼𝑅𝑇 is hard
– If external data sources are not fully known, finding all quasi-idenfitiers is not possible
• Sometimes the combination of meaningless attributes canidentify individuals
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 22
13.3 k-Anonymity Model
• Algorithms for k-anonymization involve two basicoperations
– Deleting cell values or entire tuples
• When generalization causes too much information loss
– Generalizing cell values
• Replace specific quasi-identifiers with less specific values until get k identical values
• Finding an optimal anonymization using k-anonymityis NP-hard
– Optimal means to perturb the input data as little asnecessary
– Perturbing data also implies a loss of data utility
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 23
13.3 k-Anonymity Model
• Real-world data is often sorted
– Several k-anonymous data sets can be directly
matched
– Randomly sorting the tuples solves the problem
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 24
13.3 Unsorted Matching Attack
Race ZIP
Asian 38102
Asian 38106
Black 38102
Black 38106
White 38102
White 38106
Race ZIP
Person 38102
Person 38106
Person 38102
Person 38106
Person 38102
Person 38106
Race ZIP
Asian 381**
Asian 381**
Black 381**
Black 381**
White 381**
White 381**
Original Data K-Anonymous Data 1 K-Anonymous Data 2
• Both data sets are k-anonymous with k=2
– Matched using the Problem attribute
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 25
13.3 Complementary Release Attack
Race Birthyear Gender ZIP Problem
Black 1965 Male 38100 Short Breath
Black 1965 Male 38100 Chest Pain
Person 1965 Female 381** Painful Eye
Person 1965 Female 381** Wheezing
Black 1964 Female 38106 Obesity
Black 1964 Female 38106 Chest Pain
White 1964 Male 381** Short Breath
Person 1965 Female 381** Hypertension
White 1964 Male 381** Obesity
White 1964 Male 381** Fever
White 1967 Male 38106 Vomiting
White 1967 Male 38106 Back Pain
Race Birthyear Gender ZIP Problem
Black 1965 Male 38100 Short Breath
Black 1965 Male 38100 Chest Pain
Black 1965 Female 38106 Painful Eye
Black 1965 Female 38106 Wheezing
Black 1964 Female 38106 Obesity
Black 1964 Female 38106 Chest Pain
White 19** Male 38106 Short Breath
White 19** Human 38102 Hypertension
White 19** Human 38102 Obesity
White 19** Human 38102 Fever
White 19** Male 38106 Vomiting
White 19** Male 38106 Back Pain
• Also little diversity with regard to the sensitive
attributes can lead to privacy problems
– 4-anonymous patient data
• Privacy is not guaranteed
since all 4 individuals
have cancer
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 26
13.3 Homogeneity Attack
Race Birthyear Gender ZIP Problem
Human 196* * 38100 Short Breath
Human 196* * 38100 Chest Pain
Human 196* * 38100 Painful Eye
Human 196* * 38100 Wheezing
Human 197* * 38106 Obesity
Human 197* * 38106 Chest Pain
Human 197* * 38106 Short Breath
Human 197* * 38106 Cancer
Human 198* * 38102 Cancer
Human 198* * 38102 Cancer
Human 198* * 38102 Cancer
Human 198* * 38102 Cancer
• In datasets with little diversity, backgroundknowledge can lead to privacy issues
• Goal: Find medical problem of my neighbour in 38100– I know that he just came
back fromVietnam
– It is more likely that he
suffers from Dengue
fever
• Available background knowledge is not known whendata is released– Background knowledge might be arbitrary complex
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 27
13.3 Background Knowledge Attack
Race Birthyear Gender ZIP Problem
Human 196* * 38100 Heart Disease
Human 196* * 38100 Heart Disease
Human 196* * 38100 Dengue fever
Human 196* * 38100 Dengue fever
Human 197* * 38106 Obesity
Human 197* * 38106 Chest Pain
Human 197* * 38106 Short Breath
Human 197* * 38106 Cancer
• To prevent homogeneity and background
knowledge attack by diversity
– Sensitive attributes must be diverse within each
class
– If each class has at least l different values, it is l-
diverse
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 28
13.3 L-Diversity
Race Birthyear Gender ZIP Problem
Human 196* * 38100 Short Breath
Human 196* * 38100 Chest Pain
Human 196* * 38100 Wheezing
Human 197* * 38106 Obesity
Human 197* * 38106 Chest Pain
Human 197* * 38106 Short Breath
Human 198* * 38102 Cancer
Human 198* * 38102 Cancer
Human 198* * 38102 Cancer
• Also L-Diversity can not guarantee perfect
anonymization
• Imagine HIV test result as an sensitive attribute
– Diversity is hard to achieve and unnecessary
• Overall distribution of sensitive attributes is not
considered
– Skewness attack is possible
• See the HIV example with very few positive results
• Several other attacks on l-diverse data sets are
possible
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 29
13.3 L-Diversity
• K-Anonymizing data relies on generalization
– Each record is generalized together with
its k neigbours
• However, datasets often have many attributes
– Data is very sparse
– K nearest neighbours are very far away
• In high dimensional datasets k-anonymization
leads to loss of information
– Loss of data utility
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 30
13.3 Problem: Curse of Dimensionality
• Recommender Systems play an important role in E-Commerce
• Video, Product, Music Recommendation
• …but also Facebook, Twitter, Instagram
• Most systems are based on Collaborative Filtering– Users express their preferences by ratings
– Match user ratings against each other and find users withsimilar taste
– Recommend movies that users with a similar taste ratedhigh
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 31
13.4 The Netflix Prize
• Netflix held an open competition for
collaborative filtering algorithms
– 20,000 teams from 150 countries
registered for competition
– Training dataset contains:
• 100,480,507 ratings from 480,189 users
• Goal: Improve Netflix’ algorithm by 10% to win
the prize of 1 million dollar
– BellKor's Pragmatic Chaos achieved a 10.5%
improvement in 2009
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 32
13.4 The Netflix Prize
• Netflix Dataset anonymized
– No information about users or films
• Each rating consists of the following data: user id, movie id, date of grade, grade from 1-5
• Average users rated about 200 movies
• Average movies have been rated by more than 5000 users
• Netflix claimed that data has been perturbed
• Researchers de-anonymized the dataset by cross-matching with Internet Movie Database (IMDB)
– 50 users have been identified
– Researchers uncovered their political view and othersensitive information
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 33
13.4 The Netflix Prize
• Netflix has been criticized by privacy advocates
since the release of the dataset
– 2007 researchers identified indivual users
– 2009 four users filed a law suit against Netflix
• In March 2010 the Netflix Prize was cancelled
– The dataset was removed from the Web
– Plaintiffs dismissed lawsuits after settlement with
Netflix
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 34
13.4 The Netflix Prize
• Arvind Narayanan and Vitaly Shmatikov from the
University of Texas identify Netflix users in the
IMDB
– 50 users have been de-anonymized as a proof of concept
• The Netflix Prize data set is very sparse
– 480,000 users have on average 200 reviews
– 17,700 movies or attributes per user
• Compared to k-Anonymity, there are no quasi-
identifiers
– Direct de-anonymization is not possible
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 35
13.4 De-Anonymization of Sparse Data
• Perturbation of Netflix data cannot be too large
– Introducing large errors massively harms data utility
for recommendation
– Comparisons to other Netflix datasets showed that
only very few noises were introduced
• For one user 5 out of 229 ratings, for another one 1 out of
306
• Ratings reveal some information of individuals
– Non-null values are very rare
– Some movies have been rated only by few people
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 36
13.4 De-Anonymization of Sparse Data
• Rating profiles of most users are unique
– Data is not k-anonymized for k>1
• K-Anonymization is bound to failure
– The rating similarity among users is
very low
– Therefore generalizing attributes
would lead to a great loss of
information
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 37
13.4 De-Anonymization of Sparse Data
• Similarity between records (individuals) can be
computed by comparing their ratings
– Similarity measure favors rarely rated movies
• Sparse data set example:
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 38
13.4 De-Anonymization of Sparse Data
1 2 3 4 5 6 7 8 9 10
𝑟1 3 4 5 1
𝑟2 3 4 1 1
𝑟3 4 5 1
𝑟4 4 1
𝑟5 5 2 4
Movies
Use
rs
• Basic Idea for de-anonymizing record p using dataset A
1. Compute similarity scores from person p to all records in A
2. If distance(𝑚𝑎𝑥−max
2)
𝜎than some threshold Φ, then the
most similar record is a match
• If no record could be identified
– Also a small number of candidate records similar to the target reveals a lot of information
• For de-anonymizing the Netflix dataset a more complex variant of the algorithm was used
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 39
13.4 De-Anonymization of Sparse Data
• Algorithms are robust to noise in the background knowledge
• Very little information is needed to de-anonymize an average individual in the Netflix data set
– 8 movie ratings and rating dates (+-14 days) lead to an accuracy of 99%
– Only 2 ratings and rating dates (+- 3 days) are needed to still achieve a matching quality of 68%
• Political orientation and religious views can be revealed by the opinion on movies
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 40
13.4 De-Anonymization of Sparse Data
• In March 2016 Facebook had 1,65 billion active
users
– About 28 million active users in Germany
• Instagram has more than 500 million active
users
• Snapchat has 150 million active users per day
– More than 10 billion video views per day
• 310 million people actively use Twitter
– ...but also lots of private data!
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 41
13.5 Privacy in Social Networks
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 42
13.5 Privacy in Social Networks
• Social networks are even worse
– Sexual orientation, ethnicity, religious, political views,
personality traits, intelligence, happiness, use of
addictive substances can be predicted…
• Every interaction with social networks tells
something about the user
– Every like, friend request or post and can be used
as an attribute for later classification
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 43
13.5 Privacy in Social Networks
• When a user joins a social network, he/she will explore the environment
– What Facebook „likes“ tell about members• Kosinski M, Stillwell D, Graepel T (2013) Private traits and attributes are
predictable from digital records of human behavior. PNAS 110 (15).
– What Facebook friendships tell about sexual orientation• Jernigan C, Mistree BFT (2009) Gaydar: Facebook Friendships Expose
Sexual Orientation. First Monday14(10).
– What Facebook knows about non-member friends of members• Horvát E-Á, Hanselmann M, Hamprecht FA, Zweig KA (2012) One Plus
One Makes Three (for Social Networks). PLoS ONE 7(4): e34740.
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 44
13.5 Privacy in Social Networks
• Which information needs to be anonymized?
– Individuals should not be re-identified
– Friendships/Interaction between two persons should not inferred
– Sensitive attributes of individuals should stay secret
• Similar to before:
– Deleting/generalizing identifying attributes is not enough
• Re-identification using graph structure is possible
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 45
13.5 Privacy in Social Networks
• A simple anonymization of a Social Network
would remove all identifiers
– Only the structure of the network is available
• What can Tilo and Jan find out about the
network?
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 46
13.5 De-anonymization in Social Networks
Tilo JanJose Stephan
SilviuKindaChristoph
Social NetworkStructure
• Tilo has 3 friends in the social network
– There is only a single node with node degree 3
– Jan can also re-identify himself
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 47
13.5 De-anonymization in Social Networks
Tilo JanJose Stephan
SilviuKindaChristoph
Social NetworkStructure
• They can share the information about their
friendships
– Both are friends with Jose
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 48
13.5 De-anonymization in Social Networks
Tilo JanJose Stephan
SilviuKindaChristoph
Social NetworkStructure
• Christoph is only friends with Tilo
– Tilo can also learn that Jose and Christoph are
friends
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 49
13.5 De-anonymization in Social Networks
Tilo JanJose Stephan
SilviuKindaChristoph
Social NetworkStructure
• Major challenge in anonymizing social networks– Adding/deleting edges also affects the neighbourhoods
properties of other nodes
• Can we use k-anonymization in large social networks?– We do not want to change the neighbourhoods of other
nodes
– What do we want to protect against?• Node re-identification
• Edge disclosure
• Sensitive attributes
– What do we want to anonymize?• Node degrees
• Neighborhood of nodes
• Attributes
• Structural knowledge of the graph
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 50
13.5 Advanced Anonymization in Social Networks
• Anonymizing the degree distribution to protect against node re-identification
– Goal: Transform the social network into a k-degree anonymous graph
• There are at least k nodes with the same degree
• Basic Idea:
– Step 1: Construct a degree distribution that is close to original distribution, by minimally increasing degrees of a few nodes.
– Step 2: Construct a graph satisfying the new degree distribution close to the original graph by adding minimum number of edges.
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 51
13.5 Advanced Anonymization in Social Networks
• Building a 2-degree anonymous graph
– Build a degree sequence for the social network
• Only increasing node degrees is allowed
– Use dynamic programming to find an optimal k-degree anonymous sequence
• Minimize the sum of differences of node degrees
– The minmal sum of differences is 3
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 52
13.5 Advanced Anonymization in Social Networks
(5, 3, 2, 2, 1, 1, 0) (5, 5, 2, 2, 1, 1, 1)
– Construct a graph from degree sequence using the original graph• Not every degree sequence is realizable
– (5, 5, 2, 2, 1, 1, 1) has no graph representation
• Degree sequence has to be even
– Start with 7 nodes and no edges:
– Basic algorithm idea:1. Pick node v as node with highest degree
2. Add degree of v edges to other nodes w
3. Decrease degrees of neighbor nodes
4. IF all degrees are 0: RETURN
5. IF degree of any node < 0 NO GRAPH
– If the degree sequence is not realizable, add random noise to original degree sequence• In most cases, the constructed degree sequences are realizable
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 53
13.5 Advanced Anonymization in Social Networks
(5, 3, 2, 2, 1, 1, 0) (0, 2, 1, 1, 0, 0, 0)
• A database can also be used for statistical
purposes without granting access to individual
records
– Statistical operations allow a view on the actual data
– Special protection techniques have to be applied to
protect the individual data records
• ‘Reengineering’ of actual individual values is sometimes
possible
• Statistical inference, especially taking
advantage of sequences of statistical
queries, must be prevented
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 54
13.6 Statistical Database Security
• In any case, a statistical filter for queries is needed
– Permits only statistical queries, while preventing access to individual records
• E.g., allow a ‘COUNT’ query for the number of employees whose salary is higher than 100.000 $, but deny queries selecting individuals having that characteristic
• But statistical filters are not sufficient to prevent interference
– E.g., first get the average salary of employees with job description ‘manager’ and then count their number
• If the number is 1, you do exactly know how much your manager earns…
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 55
13.6 Statistical Database Security
• Security measures have to be taken on top of
the authorization measures presented before
• A statistical database is
– Positively compromised, if a user finds out that an
individual has a specific characteristic (value)
– Negatively compromised, if a user finds out that a
given individual does not have a certain characteristic
• Also a simple anonymization of data does not
suffice to protect individuals
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 56
13.6 Statistical Database Security
• The majority of inference protection techniques
can be classified as
– Conceptual techniques
• Involve the conceptual level of the underlying database
– Restriction-based techniques
• Deny statistical queries working on too small or too large
subsets of the data
– Perturbation-based techniques
• Introduce modification to the data which change individual
values, but should have hardly any effect on the statistics
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 57
13.6 Inference Protection Techniques
• A good example for conceptual techniques is the lattice model
– Statistics over relational tables can be represented as a lattice, where vertexes reflect different combinations of attributes
– E.g., lattice for table T with three attributes A, B, and C
– By aggregating over some attribute, less dimensional tables are obtained
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 58
13.6 Conceptual Techniques
Tall
TA TB TC
TBCTACTAB
TABC
aggregation
• The lattice can be used to study inference protection mechanisms
– A statistic is considered to be harmful, if the n-respondent, k%-dominance criterion applies
• i.e., n or fewer records represent more than k% of the total with n and k being fixed but secret values
– Consequently, for any vertex of the lattice e.g., a ‘count’ statistic holding a query set of size 1 is harmful
• By using operations involving vertexes at different levels the user can disclose sensitive statistics
– Generally, it is possible to permit a statistic in a vertex of the lattice, if the individual is not identified in some parent table in the lattice
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 59
13.6 Conceptual Techniques
• The general aim of these techniques is to restrict statistical queries that could compromise the database
– The simplest restriction technique controls the size of the query set associated with a query
• Suppose that for some individual A, a user knows a certain characteristic ‘Ai = x’ and the respective count statistic is 1; then more information can be disclosed by issuing queries COUNT(Ai = x AND Aj = y) to find out about the Aj value, etc.
• For some secret parameter k, a statistical query is only permitted if the size of the query set is both larger than kand smaller than (database_size - k)
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 60
13.6 Restriction-based Techniques
• However, this simple technique is not safe e.g., against
tracker-based attacks
– A tracker is a set of formulas to pad out small size query
sets with additional records to fulfill the size restriction
• Assume an individual can be uniquely identified by the
characteristics (Ai = x AND Aj = y AND Ak = z)
• A tracker could be (Ai = x), (Ai = x AND NOT Aj = y AND NOT Ak = z)
• The forbidden statistics COUNT (Ai = x AND Aj = y AND Ak = z)could be calculated by COUNT (Ai = x) – COUNT (Ai = x AND NOT Aj = y AND NOT Ak = z)
• Having that statistics more information can be obtained
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 61
13.6 Restriction-based Techniques
• One way to deal with trackers, is to generalize the
query set size criterion to all logical combinations
– For a query on (A1 = a AND A2 = b AND…AND An = z) all 2n combinations
(NOT A1 = a AND A2 = b AND…AND An = z), (A1 = a AND NOT A2 = b AND…AND An = z),…
have to fulfill the query set size restriction
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 62
13.6 Restriction-based Techniques
• A prime example for perturbation-based techniques is data swapping– The idea is to exchange attribute values between the records
of the original database in such a way that• The new database has no common records with the original database
• While the statistics (up to a certain number of attributes involved in the statistics) stay correct
• A second technique are random sample queries that are performed only on a random sample of the database
• Another technique is result rounding, where the response is perturbed– Before being released the response values are rounded up or
down to the nearest multiple of a certain base b
– Users can then deduce the true value only within some interval
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 63
13.6 Perturbation-based Techniques
• Specific attacks, however can still disclose information
– E.g., consider a user knows that an individual record matches a characteristic Ai = x and that the relative frequency of having that value is 1/database_size
– The attacker can now discover whether the record shows the additional characteristic Aj = y by requesting the relative frequency of (Ai = x AND Aj = y )
– If the value is still 1/database_size, it has the characteristic, if the value is 0 it does not…
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 64
13.6 Perturbation-based Techniques
14.1 Knowledge-based Systems
and Deductive DBs
14.2 Distributed Databases
14.3 Information Retrieval and
Web Search Engines
14.4 Spatial databases and GIS
14.5 Multimedia Databases
14.6 Data Warehousing
Relational Database Systems 2 – Wolf-Tilo Balke– Institut für Informationssysteme 65
14 Beyond Relational Databases