Upload
pradeep-chauhan
View
147
Download
4
Tags:
Embed Size (px)
Citation preview
A
SEMINAR REPORT
ON
DATA LEAKAGE DETECTION
SUBMITTED BY
ANUSRI RAMCHANDRAN (3)
PRADEEP CHAUHAN ASHOK (7)
POOJA SHANTILAL CHHIPA (8)
UNDER THE GUIDANCE OF
MRS.JYOTI WADMARE
DEPARTMENT OF COMPUTER ENGINEERING
K.J. SOMAIYA INSTITUTE OF ENGINEERING AND INFORMATION TECHNOLOGY
EVERARD NAGAR, EASTERN EXPRESS HIGHWAY,SION,MUMBAI-42
Certificate
This is to certify that the seminar entitled “DATA LEAKAGE DETECTION” has been
submitted by the following students.
Anusri Ramchandran(3)
Pradeep Chauhan Ashok(7)
Pooja Shantilal Chhipa(8)
Under guidance of Prof. Jyoti Wadmare for the subject seminar in semester VI
Internal examiner Head of Department
(Prof._____________________) (Prof. Uday Rote)
External examiner Principal
(________________________) (Dr. S.G. Kirloskar)
Acknowledgement
Data Leakage Detection is a seminar report brought about by the effort of a dynamic team and lot of other
people. We would like to extend our sincere thanks to them.
We express our deepest gratitude to Mrs.Jyoti Wadmare who guided us through the entire seminar report.
We sincerely thank our college staff for their help and guidance.
We also extend our formal gratitude to Prof. Uday Rote, Head of Department of Computer Engineering
who provided us the needful facilities in the implementation of the seminar.
We are really indebted to those who are involved in realization of this project.
We are also thankful to our family members and friends for their patience and encouragement.
Anusri Ramchandran
Pradeep Chauhan
Pooja Shantilal Chhipa
Abstract
We study the following problem: A data distributor has given sensitive data to a set of
supposedly trusted agents (third parties). Some of the data are leaked and found in an
unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the
likelihood that the leaked data came from one or more agents, as opposed to having been
independently gathered by other means. We propose data allocation strategies (across the
agents) that improve the probability of identifying leakages. These methods do not rely on
alterations of the released data (e.g., watermarks). In some cases, we can also inject “realistic
but fake” data records to further improve our chances of detecting leakage and identifying the
guilty party.
An organization secures its data/information from intruders (i.e. Hackers) by protecting
their network but data is growing rapidly as the sizes of organizations (e.g. due to globalization)
grows and also to access these data number of data points (machines and servers) are rising
which are simpler modes of communication. Sometimes intentional or even unintentionally data
leaks from within the organization and become a painful reality. This has lead to growing
information security awareness in general and about outbound content management in particular.
A data distributor has given sensitive data to a set of supposedly trusted agents (third parties).
Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s
laptop).
The distributor must assess the likelihood that the leaked data came from one or more
agents, opposed to having been independently gathered by other means. We propose data
allocation strategies (across the agents) that improve the probability of identifying leakages.
These methods do not rely on alterations of the released data (e.g., watermarks). In some cases
we can also inject “realistic but fake” data records to further improve our chances of detecting
leakage and identifying the guilty party.
Index
SR.NO. TOPIC PAGE NO.
1. Introduction 12. Literature Review 23. Report on Present Investigation 4
3.1 Introduction to Data Leakage 4
3.2 The Leaking Faucet 6
3.3 Data Leakage Detection 8
3.1.1 Data Allocation Strategy 9
3.2.2 Guilt Detection Model 10
3.2.3 Symbols and Tẻrminology 114. Agent Guilt Model 13
4.1 Guilty Agent 13
4.2 Guilt Agent Detection
14
5. Allocation Strategy 16
5.1 Explicit Data Requests 16
5.2 Sample Data Requests 19
5.3 Data Allocation Problem 21
5.3.1 Fake Objects 22
5.3.2 Optimization Problem 226. Intelligent Data Dístribution 24
6.1 Hashed Distribution Algorithm 24
6.2 Detection Process 25
6.3 Benefits of Hashed distribution 25
7. Summary 268. References and Links 27
Chapter 1
Introduction
In the business process, sometimes sensitive data must be given to trusted third parties. For
example, a company may have partnerships with other companies that require sharing customer
data to complete any process of organization. Similarly, a hospital gives patient records to the
researchers to devise new treatments. Another enterprise may outsource its data processing, so
data must be given to various other companies. We call the owner of the data the distributor and
the supposedly trusted third parties the agents [1]. Our goal is to detect the agent leaked the data
when the distributor distributes sensitive data among various agent, and if possible to identify the
guilty of agent that leaked the data.
Leakage detection is traditionally handled by the process of watermarking, e.g., a unique
code is embedded in each distributed copy. If that copy is later discovered in the hands of an
unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases,
but again, involve some modification in the original data. Furthermore, watermarks can
sometimes be destroyed if the data recipient is malicious. In this paper we study unobtrusive
techniques for detecting leakage of a set of objects or records. Specifically, we study the
following scenario: After giving a set of objects to agents, the distributor discovers some of those
same objects in an unauthorized place. For example, the data may be found on a web site, or may
be obtained through a legal discovery process. At this point the distributor can assess the
likelihood that the leaked data came from one or more agents, as opposed to having been
independently gathered by other means.
To detect guilty of an agent intelligent distribution techniques must be developed to
identify the agent leak the data object obtained from certain discovery process or from any
authorized place and a model for assessing the “guilt” of agents. In these algorithms we consider
the option of adding “fake” objects to the distributed set. Such objects do not correspond to real
entities but appear realistic to the agents, it works as watermark for distributed data set and used
to identify the agent responsible for leakage of data object, fake records are added with some
modification in the data set.
Chapter 2
Literature ReviewIn Introduction to Data leakage detection system guilt detection approach and data
allocation strategies are introduced along with it various other works on mechanisms that allow
only authorized users to access sensitive data through access control policies are performed.
Such approaches prevent in some sense data leakage by sharing information only with trusted
parties. However, these policies are restrictive and may make it impossible to satisfy agents’
requests. The guilt detection approach is related to the data provenance problem in which tracing
the lineage of leaked objects implies essentially the detection of the guilty agents. As far as the
data allocation strategies are concerned, it is mostly relevant to watermarking that is used as a
means of establishing original ownership of distributed objects.
All Algorithms presented in this paper have been implemented in a prototype lineage
tracing system and preliminary performance results are reported. Enabling lineage tracing in a
data warehousing environment has several benefits and applications, including in-depth data
analysis and data mining, authorization management, view update, efficient warehouse recovery,
and others as outlined [6].
Panagiotis Papadimitriou and Hector Garcia-Molina analyzed various data allocation
strategies and related the data allocation strategy with the process of watermarking in which the
work is mostly relevant to watermarking that is used as a means of establishing original
ownership of distributed objects. Watermarks were initially used in images, video, and audio
data whose digital representation includes considerable redundancy. Recently, other works have
also studied marks insertion to relational data. The approach and watermarking are similar in the
sense of providing agents with some kind of receiver identifying information. However, by its
very nature, a watermark modifies the item being watermarked. If the object to be watermarked
cannot be modified, then a watermark cannot be inserted. In such cases, methods that attach
watermarks to the distributed data are not applicable [1].The authors conducted experiments with
simulated data leakage problems to evaluate their performance and present the evaluation for
sample requests and explicit data requests, respectively.
To calculate the guilt probabilities and differences, use throughout this section p = 0.5.
Although not reported here, this is experimented with other p values and observed that the
relative performance of our algorithms and our main conclusions do not change. If p approaches
to 0, it becomes easier to find guilty agents and algorithm performance converges. On the other
hand, if p approaches 1, the relative differences among algorithms grow since more evidence is
needed to find an agent guilty.
The presented experiments confirmed that fake objects can have a significant impact on
our chances of detecting a guilty agent. This means that by allocating 10 percent of fake objects,
the distributor can detect a guilty agent even in the worst-case leakage scenario, while without
fake objects, he will be unsuccessful not only in the worst case but also in average case.
With explicit data request few objects that are shared among multiple agents. These are
the most interesting scenarios, since object sharing makes it difficult to distinguish a guilty from
no guilty agents. Scenarios with more objects to distribute or scenarios with objects shared
among fewer agents are obviously easier to handle. As far as scenarios with many objects to
distribute and many overlapping agent requests are concerned, they are similar to the scenarios ,
since we can map them to the distribution of many small subsets.
With sample data requests, agents are not interested in particular objects. Hence, object
sharing is not explicitly defined by their requests. The distributor is “forced” to allocate certain
objects to multiple agents only if the number of requested objects exceeds the number of objects
in set T. The more data objects the agents request in total, the more recipients, on average, an
object has; and the more objects are shared among different agents, the more difficult it is to
detect a guilty agent.
Piero Bonatti, Sabrina De Capitani Di Vimercati and Pierangela Samarati worked on
access control policy which works as mechanisms that allow only authorized users to access
sensitive data through access control policies are performed. Such approaches prevent in some
sense data leakage by sharing information only with trusted parties. Access control are all based
on monolithic and complete specifications.
The approach for data leakage detection is similar to the watermark added in the image,
audio or video for its detection by adding unique code for the identification. In case of data
object set some fake records are added.
Chapter 3
Report on Present Investigation
Data leakage happens every day when confidential business information such as customer or
patient data, source code or design specifications, price lists, intellectual property and trade
secrets, forecasts and budgets in spreadsheets are leaked out. In this report a problem is
considered in which a data distributor has given sensitive data to a set of supposedly trusted
agents and some of the data are leaked and found in an unauthorized place by any means. The
problem with the data leakage is that once this data is no longer within the domain of distributor,
then the company is at serious risk. The distributor must assess the likelihood that the leaked data
came from one or more agents, as opposed to having been independently gathered by other
means.
We propose data allocation strategies (across the agents) that improve the probability of
identifying leakages. These methods do not rely on alterations of the released data. In some
cases, we can also inject “realistic but fake” data records to further improve our chances of
detecting leakage and identifying the guilty party.
Further modification is applied in order to overcome the problems of current algorithm by
intelligently distributing data object among various agents in such a way that identification of
guilty agent becomes simple.
3.1 Introduction to Data Leakage
Data Leakage, put simply, is the unauthorized transmission of data (or information) from
within an organization to an external destination or recipient. Leakage is possible either
intentionally or unintentionally by internal or external user, internals are authorized user of the
system who can access the data through valid access control policy, however external intruder
access data through some attack on target machine which is either active or passive.
This may be electronic, or may be via a physical method. Data Leakage is synonymous
with the term Information Leakage and harms the image of organization and create threat in the
mind for continuing the relation with the distributor, as it is not able to protect the sensitive
information.
The reader is encouraged to be mindful that unauthorized does not automatically mean
intentional or malicious. Unintentional or inadvertent data leakage is also unauthorized.
Figure:3.1 Data Leakage
According to data compiled from EPIC.org and PerkinsCoie.com which conducted
survey on data leakage by considering various different organizations and concluded that, 52%
of Data Security breaches are from internal sources compared to the remaining 48% by external
hackers hence the protection needed from internal users also.
Type of data leaked Percentage
Confidential information 15
Intellectual property 4
Customer data 73
Health record 8
Table.3.2 Types of data leaked [11]
The noteworthy aspect of these figures is that, when the internal breaches are examined,
the percentage due to malicious intent is remarkably low.
Less than 1%, the level of inadvertent data breach is significant (96%). This is further
deconstructed to 46% being due to employee oversight, and 50% due to poor business process.
3.2 The Leaking Faucet
Data protection programs at most organizations are concerned with protecting sensitive
data from external malicious attacks, relying on technical controls that include perimeter
security, network/wireless surveillance and monitoring, application and point security
management, and user awareness and education. But what about inadvertent data leaks that isn’t
so sensational.
For example unencrypted information on a lost or stolen laptop/ USB or other device?
Like the steady drip from a leaking faucet, everyday data leaks are making headlines more often
than the nefarious attack scenarios around which organizations plan most, if not all, of their data
leakage prevention methods. However, to truly protect their critical data, organizations also need
to plan a more data-centric approach to their security programs to protect against leaks that occur
for sensitive data.
Organizations are concerned with protecting sensitive data from external malicious
attacks, relying on technical controls that include perimeter security and internal staff and from
those who can access those data from any method, network/wireless surveillance and monitoring,
application and point security management, and user awareness and education and DLP
solutions. But what about data leaks by trusted third party called as agents who are not present
inside the network and their activity is not easily traceable for this situation some care must be
taken so that data is not misused by them.
Various sensitive information such as Financial Data, Private Data, Credit Card
Information, Health Record Information, Confidential Information, Personal Information etc are
part of various different organizations which can be prevented by various different ways as
shown in figure 3.3. The leaking faucets include Education, Prevention and Detection to control
the leakage of sensitive information. Education remains the important factor among all the
protection measure, which includes training and awareness program to handle the sensitive
information and their importance for the organization.
The figure 3.3 represents faucet for data leakage in the center different kinds of sensitive
information is placed which is surrounded by protecting mechanism which will prevent leakage
of valuable information from the organization which leads to major problem.
Figure 3.3: The Leaking Faucet
Prevention mechanism deals with the DLP mechanism which is suits of technology
which prevent leakage of data by classifying sensitive information and monitoring them and
through various accesses control policy the access is prevented from users. Education prevents
the leakage as most of the time leakage occurs unintentionally by the internal users. Detection
process detects the leakage of information distributed to trustworthy third party called as agent
and to calculate their involvement in the process of leakage.
3.3 Data Leakage Detection
Organizations thought of data/information security only in terms of protecting their
network from intruders (e.g. hackers). But with growing amount of data, rapid growth in the
sizes of organizations (e.g. due to globalization), rise in number of data points (machines and
servers) and easier modes of communication, accidental or even deliberate leakage of data from
within the organization has become a painful reality. This has lead to growing awareness about
information security in general and about outbound content management in particular.
Data Leakage, put simply, is the unauthorized transmission of data (or information) from
within an organization to an external destination or recipient. This may be electronic, or may be
via a physical method. Data Leakage is synonymous with the term Information Leakage. The
reader is encouraged to be mindful that unauthorized does not automatically mean intentional or
malicious. Unintentional or inadvertent data leakage is also unauthorized. In the course of doing
business, sometimes sensitive data must be handed over to supposedly trusted third parties.
For example, a hospital may give patient records to researchers who will devise new
treatments. Similarly, a company may have partnerships with other companies that require
sharing customer data. Another enterprise may outsource its data processing, so data must be
given to various other companies. We call the owner of the data the distributor and the
supposedly trusted third parties the agents. Our goal is to detect the guilty agent among all
trustworthy agents when the distributor’s sensitive data have been leaked by any one agent, and
if possible to identify the agent that leaked the data.
We consider applications where the original sensitive data cannot be perturbed.
Perturbation is a very useful technique where the data are modified and made “less sensitive”
before being handed to agents, in such case sensitive data is sent as it is example is contact no of
employee it cannot be altered and sent to third party for recruitment as it is sensitive or similarly
its bank account number perturbation of those information to makes no effect after transmission
to the receiver as this data is not useful for any process.
To overcome such a situation an effective method is required through which the
distribution of data is possible to modify the valuable information.
In such case one can add random noise to certain attributes, or one can replace exact
values by ranges [4]. However, in some cases, it is important not to alter the original distributor’s
data. For example, if an outsourcer is doing our payroll, he must have the exact salary and
customer bank account numbers
Fig 3.4: Data Leakage Detection Process
If medical researchers treating the patients (as opposed to simply computing statistics),
they may need accurate data for the patients. Traditionally, leakage detection is handled by
watermarking, e.g., a unique code is embedded in each distributed copy.
If that copy is later discovered in the hands of an unauthorized party, the leaker can be
identified. Watermarks can be very useful in some cases, but again, involve some modification
of the original data by adding redundancy. Furthermore, watermarks can sometimes be destroyed
if the data recipient is malicious as it is aware of various techniques to temper the watermark.
Data Leakage Detection system mainly divided into two modules
3.3.1 Data allocation strategy
This module helps in intelligent distribution of data set so that if that data is leaked guilty
agent is identified.
Fig 3.5: Data Distribution Scenario
3.3.2 Guilt detection model
This model helps to determine the agent is responsible for leakage of data or data set
obtained by target is by some other means.
It requires complete domain knowledge to calculate the probability p to evaluate guilty
agent. From the domain knowledge and proper analysis and experiment probability facture is
calculated which act as threshold for evidence to prove guilty of agent. It means when number of
leaked record are more than the probability specified then agent is guilty and if number of leaked
records is less then is said to be not guilty because in such situation it is possible that leaked
object is obtained by target by some other means.
Fig 3.6: Guilt Detection Model
An unobtrusive technique for detecting leakage of a set of objects or records is proposed
in this report. After giving a set of objects to agents, the distributor discovers some of those same
objects in an unauthorized place. (For example, the data may be found on a website, or may be
obtained through a legal discovery process.) At this point, the distributor can assess the
likelihood that the leaked data came from one or more agents, as opposed to having been
independently gathered by other means. Using an analogy with cookies stolen from a cookie jar,
if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if
we catch Freddie with five cookies, it will be much harder for him to argue that his hands were
not in the cookie jar. If the distributor sees “enough evidence” that an agent leaked data, he may
stop doing business with him, or may initiate legal proceedings. In this paper, we develop a
model for assessing the “guilt” of agents. We also present algorithms for distributing objects to
agents, in a way that improves our chances of identifying a leaker. Finally, we also consider the
option of adding “fake” objects to the distributed set. Such objects do not correspond to real
entities but appear realistic to the agents. In a sense, the fake objects act as a type of watermark
for the entire set, without modifying any individual members. If it turns out that an agent was
given one or more fake objects that were leaked, then the distributor can be more confident that
agent was guilty.
3.3.3 Symbols and TerminologyA distributor owns a set T = {t1, t2, t3, ……. } of valuable and sensitive data objects. The
distributor wants to share some of the objects with a set of agents U1, U2… Un, but does not wish
the objects be leaked to other third parties. The objects in T could be of any type and size, e.g.,
they could be tuples in a relation, or relations in a database. An agent Ui receives a subset of
objects Ri subset of T, determined either by a sample request or an explicit request:
• Distributor: A distributor owns a set T of valuable and sensitive data objects.
Owner of data set T = {t1, t2, ……, tn}
• Agent (U): The distributor shares some of the objects with a set of agents U1, U2… Un,
but does not wish the objects be leaked to other third parties.
Receives set R T from the distributor.
• Target: Unauthorized third party caught with leaked data set S T.
Example: Say T contains customer records for a given company A. Company “C” hires a
marketing agency U1 to do an on-line survey of customers. Since any customers will do for the
survey, U1 requests a sample of 1000 customer records. At the same time, company “C”
subcontracts with agent U2 to handle billing for all California customers. Thus, U2 receives all T
records that satisfy the condition “state is California Suppose that after giving objects to agents,
the distributor discovers that a set S subset of T has leaked. This means that some third party
called the target has been caught in possession of S. For example, this target may be displaying S
on its web site, or perhaps as part of a legal discovery process, the target turned over S to the
distributor.
Agents U1, U2… Un have some of the data, it is reasonable to suspect them leaking the data.
However, the agents can argue that they are innocent, and that the S data was obtained by the
target through other means. For example, say one of the objects in S represents a customer X.
Perhaps X is also a customer of some other company, and that company provided the data to the
target. Or perhaps X can be reconstructed from various publicly available sources on the web.
Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to
other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did not
leak anything. Similarly, the “rarer” the objects, the harder it is to argue that the target obtained
them through other means.
Not only do we want to estimate the likelihood the agents leaked data, but we would also like
to find out if one of them in particular was more likely to be the leaker. For instance, if one of the
S objects was only given to agent U1, while the other objects were given to all agents, we may
suspect U1 more. The model we present next captures this intuition. We say an agent Ui is guilty
if it contributes one or more objects to the target. While performing implementation and research
work for the calculation of guilty of agent in order to reduce the complexity and computation we
are following various assumption
Chapter 4
Agent Guilt Model
This model helps to determine the agent is responsible for leakage of data or data set
obtained by target is by some other means. The distributor can assess the likelihood that the
leaked data came from one or more agents, as opposed to having been independently gathered by
other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a
single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with five
cookies, it will be much harder for him to argue that his hands were not in the cookie jar.
If the distributor sees “enough evidence” that an agent leaked data, he may stop doing
business with him, or may initiate legal proceedings.
4.1 Guilty Agent
To compute the probability of guilty agent, we need an estimate for the probability that
values in S can be “guessed” by the target. For instance, say some of the objects in T are emails
of individuals. We can conduct an experiment and ask a person with approximately the expertise
and resources of the target to find the email of say 100 individuals. If this person can find say 90
emails, then we can reasonably guess that the probability of finding one email is 0.9. On the
other hand, if the objects in question are bank account numbers, the person may only discover
say 20, leading to an estimate of 0.2. We call this estimate pt , the probability that object t can be
guessed by the target.
To simplify the formulas, we assume that all T objects have the same probability, which we
call p. Next, we make two assumptions regarding the relationship among the various leakage
events. The first assumption simply states that an agent’s decision to leak an object is not related
to other objects.
Suppose that after giving objects to agents, the distributor discovers that a set S T has
leaked. This means that some third party called the target has been caught in possession of S.
For example, this target may be displaying S on its web site, or perhaps as part of a legal
discovery process, the target turned over S to the distributor. Since the agents U1, ……,Un have
some of the data, it is reasonable to suspect them leaking the data. However, the agents can argue
that they are innocent, and that the S data was obtained by the target through other means.
For example, say one of the objects in S represents a customer X. Perhaps X is also a
customer of some other company, and that company provided the data to Equations the target. Or
perhaps X can be reconstructed from various publicly available sources on the web. Our goal is
to estimate the likelihood that the leaked data came from the agents as opposed to other sources.
Intuitively, the more data in S, the harder it is for the agents to argue they did not leak anything.
Similarly, the “rarer” the objects, the harder it is to argue that the target obtained them through
other means. Not only do we want to estimate the likelihood the agents leaked data, but we
would also like to find out if one of them in particular was more likely to be the leaker.
For instance, if one of the S objects was only given to agent U1, while the other objects
were given to all agents, we may suspect U1 more. The model we present next captures this
intuition. We say an agent Ui is guilty if it contributes one or more objects to the target. We
denote the event that agent Ui is guilty for a given leaked set S by {G i | S}. Our next step is to
estimate Pr {Gi | S}, i.e., the probability that agent Ui is guilty given evidence S.
4.2 Guilt Agent Detection
We can conduct an experiment and ask a person with approximately the expertise and
resources of the target to find the email of say 100 individuals.
If this person can find say 90 emails, then we can reasonably guess that the probability of
finding one email is 0.9. On the other hand, if the objects in question are bank account numbers,
the person may only discover say 20, leading to an estimate of 0.2. We call this estimate pt, the
probability that object t can be guessed by the target [2]. For simplicity we assume that all T
objects have the same pt, which we call p.
Next, make two assumptions regarding the relationship among the various leakage
events. The first assumption simply states that an agent’s decision to leak an object is not related
to other objects
Assumption1. For all t, t’ S such that t t’ the provenance of t is independent of the
provenance of t’ [1].
The term “provenance” in this assumption statement refers to the source of a value t that
appears in the leaked set. The source can be any of the agents who have t in their sets or the
target itself (guessing). The following assumption states that joint events have a negligible
probability
Assumption2. An object t S can only be obtained by the target in one of the two ways as
follows.
A single agent Ui leaked t from its own Ri set. The target “t” guessed or obtained through
other means without the help of any of the n agents. In other words, for all t S, the event that
the target guesses t and the events that agent Ui (i = 1, . . . , n) leaks object t are disjoint Assume
that the distributor set T, the agent sets Rs, and the target set S are:
T = { t1, t2, t3 }, R1 = { t1, t2 }, R2 = { t1, t3 }, S = { t1, t2, t3 }.
In this case, all three of the distributor’s objects have been leaked and appear in S. Let us first
consider how the target may have obtained object t1, which was given to both agents. From
Assumption 2, the target either guessed t1 or one of U1 or U2 leaked it.
We know that the probability of the former event is p, so assuming that probability that each
of the two agents leaked t1 is the same, we have the following cases:
The target guessed t1 with probability p,
Agent U1 leaked t1 to S with probability (1-p)/2,
Agent U2 leaked t1 to S with probability (1-p)/2.
Similarly, we find that agent U1 leaked t2 to S with probability (1-p) since he is the only agent
that has t2.
Given these values, the probability that agent U1 is not guilty, namely that U1 did not leak
either object, is
And the probability that U1 is guilty is:
If Assumption 2 did not hold, our analysis would be more complex because we would need to
consider joint events, e.g., the target guesses t1, and at the same time, one or two agents leak the
value. In our simplified analysis, we say that an agent is not guilty when the object can be
guessed, regardless of whether the agent leaked the value.
Since we are “not counting” instances when an agent leaks information, the simplified
analysis yields conservative values (Smaller Probability) analysis.
Chapter 5
Allocation Strategy
Allocation strategies that are applicable to problem instances data requests are discussed.
We deal with problems with explicit data requests, and problems with sample data requests.
5.1 Explicit Data Requests
In problems the distributor is not allowed to add fake objects to the distributed data. So,
the data allocation is fully defined by the agents’ data requests. In EF problems, objective values
are initialized by agents’ data requests. Say, for example, that T= {t1, t2} and there are two agents
with explicit data requests such that R1 = {t1, t2} and R2 = {t1}. The value of the sum objective is
in this case is 1.5. The distributor cannot remove or alter the R1 or R2 data to decrease the overlap
R1 R2. If the distributor is able to create more fake objects, he could further improve the
objective.
The distributor cannot remove or alter the R1 or R2 data to decrease the overlap R1 R2.
However, say that the distributor can create one fake object (B = 1) and both agents can receive
one fake object (b1 = b2 = 1). In this case, the distributor can add one fake object to either R1 or
R2 to increase the corresponding denominator of the summation term. Assume that the distributor
creates a fake object f and he gives it to agent R1. Agent U1 has now R1 = { t1, t2, f } and F1 = {f}
and the value of the sum-objective decreases to 1:33 < 1:5.
If the distributor is able to create more fake objects, he could further improve the
objective. Algorithm 1 is a general “driver” that will be used for the allocation in case of explicit
request with fake record. In the algorithm first the random agent is selected from the list and then
its request is analyzed after the computation Fake records are created by function
CREATEFAKEOBJECT() fake records are added in the data set and given back to the agent
requested that data set. Fake records help in the process of identifying agent from the leaked data
set.
Algorithm 1:
Allocation for Explcit Data Requests (EF)
Input: R ,R1, R2,…,R, cond1 , … condi , B1,…Bn,B
Output : R1,…,Rn,F1,…,Fn
R Null
For i=1, …, n do
If bi > 0 then
R R ∪ {i}
Fi Null
While B>0 do
I SELECT AGENT (R, R1, R2, … …Rn)
FCREATEFAKE OBJECT (Ri, Fi, condi)
Ri Ri ∪ {F}
Fi Fi ∪ {F}
bi bi -1
if bi = 0 then
R=R/R{i}
BB-1
Algorithm 2:
Agent Selection for e – random function SELECT AGENT (R,.R1, …,Rn)
I select random agent from R
return i
Flow Chart for implementation of the following Algorithms:
(a)Allocation for Explicit Data Request(EF) with fake objects
(b) Agent Selection for e-random and e-optimal
“IN Both the above caseCREATEFAKCEOBJECT() METHOD GENERATES A FAKEOBJECT.”
Stop
User Request
R Explicit
Check the Condition Select the agent and add the fake.
IF B> 0
object
Evaluate The Loop.
Create Fake Object is Invoked
User Receives the Output.
ElseExit
Loop Iterates for n number of requests
Start
5.2 Sample Data Requests
With sample data requests, each agent Ui may receive any T subset out of entire distribution
set which are different ones. Hence, there are different object allocations. In every allocation, the
distributor can permute T objects and keep the same chances of guilty agent detection. The
reason is that the guilt probability depends only on which agents have received the leaked objects
and not on the identity of the leaked objects. The distributor’s problem is to pick one out so that
he optimizes his objective. The distributor can increase the number of possible allocations by
adding fake objects.
Algorithm 3:
Allocation for Sample Data Requests (S F )
Input : m1,… … , mn |T|
Output: R1 ,…, Rn
A O|T|
R1 null, …,Rn null
Rem mi
While rem > 0 do
For i=1, …, n:Ri <mi do k SELECT OBJECT(i, Ri) Ri Ri ∪ {tk}
A[k] a[k] + 1
Rem rem - 1
Algorithm 4:
Object Selection function SELECTOBJE CT(i, Ri)
K select at random an element from set (K’ | tk’ ∉ Ri)
Return k
Flow Chart for implementation of the following Algorithms:
(a) Allocation for Sample Data Request(EF) without any fake objects:
(b) Agent Selection for e-random and e-optimal
“In Both the following cases Select Method() returns the value of ∑ Ri n Rj “
Stop
User Request
R Explicit
Check the Condition Select the agent and add the fake.
IF B> 0
object
Evaluate The Loop.
SelectObject() Method is Invoked
User Receives the Output.
ElseExit
Loop Iterates for n number of requests
Start
5.3 Data Allocation Problem
The main focus of our work is the data allocation problem: how can the distributor
“intelligently” give data to agents to improve the chances of detecting a guilty agent.
As illustrated in Figure, there are four instances of this problem we address, depending on
the type of data requests made by agents (E for Explicit and S for Sample requests) and whether
“fake objects” are allowed (F for the use of fake objects, and for the case where fake objects
are not allowed). Fake objects are objects generated by the distributor that are not in set T.
The objects are designed to look like real objects, and are distributed to agents together
with the T objects, in order to increase the chances of detecting agents that leak data.
Fig6.1: Leakage problem instances.
Assume that we have two agents with requests R1 = EXPLICIT( T, cond1 ) and R2 =
SAMPLE(T’, 1), where T’ = EXPLICIT( T, cond2 ).
Further, say that cond1 is “state = CA” (objects have a state field). If agent U2 has the
same condition cond2 = cond1, we can create an equivalent problem with sample data requests on
set T’. That is, our problem will be how to distribute the CA objects to two agents, with
R1 = SAMPLE( T’, |T’| ) and R2 = SAMPLE(T’, 1). If instead U2 uses condition “state = NY,”
we can solve two different problems for sets T’ and T – T’. In each problem, it will have only
one agent. Finally, if the conditions partially overlap, R1 T’ NULL, but R1 T’, we can solve
three different problems for sets R1 – T’, R1 T’, and T’ – R1.
5.3.1 Fake Object
The distributor may be able to add fake objects to the distributed data in order to improve
his effectiveness in detecting guilty agents. However, fake objects may impact the correctness of
what agents do, so they may not always be allowable. The idea of perturbing data to detect
leakage is not new. However, in most cases, individual objects are perturbed, e.g., by adding
random noise to sensitive salaries, or adding a watermark to an image. In this case, perturbing
the set of distributor objects by adding fake elements is done.
For example, say the distributed data objects are medical records and the agents are
hospitals. In this case, even small modifications to the records of actual patients may be
undesirable. However, the addition of some fake medical records may be acceptable, since no
patient matches these records, and hence no one will ever be treated based on fake records. A
trace file is maintained to identify the guilty agent. Trace file are a type of fake objects that help
to identify improper use of data. The creation of fake but real-looking objects is a nontrivial
problem whose thorough investigation is beyond the scope of this paper. Here, we model the
creation of a fake object for agent Ui as a black box function CREATEFAKEOBJECT ( Ri, Fi,
condi ) that takes as input the set of all objects Ri, the subset of fake objects Fi that Ui has
received so far, and condi, and returns a new fake object. This function needs condi to produce a
valid object that satisfies Ui’s condition. Set Ri is needed as input so that the created fake object
is not only valid but also indistinguishable from other real objects.
5.3.2 Optimization Problem
The distributor’s data allocation to agents has one constraint and one objective. The
distributor’s constraint is to satisfy agents’ requests, by providing them with the number of
objects they request or with all available objects that satisfy their conditions. His objective is to
be able to detect an agent who leaks any of his data objects.
We consider the constraint as strict. The distributor may not deny serving an agent
request and may not provide agents with different perturbed versions of the same objects. We
consider fake object allocation as the only possible constraint relaxation. Our detection objective
is ideal and intractable. Detection would be assured only if the distributor gave no data object to
any agent. We use instead the following objective: maximize the chances of detecting a guilty
agent that leaks all his objects.
We now introduce some notation to state formally the distributor’s objective. Recall that
Pr {Gj | S = Ri} or simply Pr {Gj | S = Ri}, is the probability that agent Uj is guilty if the
distributor discovers a leaked table S that contains all Ri objects. We define the difference
functions (i, j) as:
(i, j) = Pr {Gj | S = Ri}- Pr {Gi | S = Ri} i, j = 1,……, n(6)
Note that differences have non-negative values: given that set Ri contains all the leaked
objects, agent Ui is at least as likely to be guilty as any other agent. Difference (i, j) is positive
for any agent Uj, whose set Ri does not contain all data of S.
It is zero, if Ri Rj. In this case the distributor will consider both agents U i and Uj equally
guilty since they have both received all the leaked objects. The larger a (i, j) value is, the easier
it is to identify Ui as the leaking agent. Thus, we want to distribute data so that values are
large.
Chapter 6
Intelligent Data Distribution
6.1 Hashed Distribution Algorithm
Input: Agent ID (UID), Number of data item requested in the dataset (N), Fake records (F)
Output: Distribution set (Dataset + Fake record)
1. Start
2. Accept the data request from the agent and analyze
a. Type of request { Sample, Exclusive }
b. Probability of getting records from other means other than the distributor
Pr {guessing}
c. No of records in the Dataset (N) to calculate number of fake record added in order
to determine guilty agent.
d. Agent ID requesting data (UID)
3. Generate the list of data to be send to the agent (dataset), assign each record with unique
distribution ID.
4. For I =1 to F : > For each fake record
Mapping_Function (UID, FID)
{
Hash (UID)
DID → FID
Store → DistributionDetails { FID,DID,UID }.
}
5. For I=1 to F
AddFakeRecord (DistributionDetails)
Output: Distribution Set
6. Stop.
6.2 Detection Process
Detection process starts when the set of distributed sensitive record found on some unauthorized
places. Detection process completes in two phases: In phase one agent is identified by the
presence of fake record in the obtained set, if no matching fake record is identified phase two
begins which searches for missing record in the set on which fake record is substituted. The
advantage of second phase is that it works in a situation in which agent identify and delete the
fake record before leaking to the target.
Inverse mapping function (Leaked Data Set)
1. Attach DID to every record
2. Sort records in order of DID
3. Search and map fake record
4. For every Record
If fake record = Yes
MapAgent (FID)
Else If fake records = No
Map (UID) which gives hash location
Identify the absence of substituted record
MapAgent (DID)
Else
Objects are obtained by some other means.
6.3 Benefits of Hashed distribution:
Once the data is distributed fake records are used to identify the guilty agent here instead
of record we are using location to determine the guilty agent so even when the presence of fake
record is identified by the agent it will delete record but the location is anyhow determine by the
distributor so event absence of fake record will reveal the Identity of agent by tracking absence
of original record.
This solves data distribution and optimization problem to some extent by its distribution
technique.
Chapter 7
Summary
In a perfect world, there would be no need to hand over sensitive data to agents that may
unknowingly or maliciously leak it. And even if we had to hand over sensitive data, in a perfect
world, we could watermark each object so that we could trace its origins with absolute certainty.
However, in many cases, we must indeed work with agents that may not be 100 percent
trusted, and we may not be certain if a leaked object came from an agent or from some other
source, since certain data cannot admit watermarks. In spite of these difficulties, we have shown
that it is possible to assess the likelihood that an agent is responsible for a leak, based on the
overlap of his data with the leaked data and the data of other agents, and based on the probability
that objects can be “guessed” by other means. Our model is relatively simple, but we believe that
it captures the essential trade-offs. The algorithms we have presented implement a variety of data
distribution strategies that can improve the distributor’s chances of identifying a leaker. We have
shown that distributing objects judiciously can make a significant difference in identifying guilty
agents, especially in cases where there is large overlap in the data that agents must receive.
The data distribution strategies improve the distributor’s chances of identifying a leaker.
It has been shown that distributing objects judiciously can make a significant difference in
identifying guilty agents, especially in cases where there is large overlap in the data that agents
must receive. In some cases “realistic but fake” data records are injected to improve the chances
of detecting leakage and identifying the guilty party. In future the extension of our allocation
strategies can handle agent requests in an online fashion (the presented strategies assume that
there is a fixed set of agents with requests known in advance) can be implemented.
References and Links
[1] Panagiotis Papadimitriou, Student Member, IEEE, and Hector Garcia-Molina, Member, IEEE
“Data Leakage Detection“ IEEE Transactions on knowledge and data engineering, Vol. 23, NO.
1, January 2011
[2]S.Umamaheswari, H.Arthi Geetha “Detection of Guilty Agents” Coimbatore Institute of
Engineering and Technology.
[3]J. Clerk Ma P. Papadimitriou and H. Garcia-Molina, “Data leakage detection,” Stanford
University.
[4]L.Sweeney, “Achieving K-Anonymity Privacy Protection Using Generalization and
Suppression,” http://en.scientificcommons. org/43196131, 2002.
[5]Peter Gordon “Data Leakage – Threats and Mitigation” SANS Institute Reading Room
October 15, 2007
[6]S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian, “Flexible Support for Multiple
Access Control Policies,” ACM Trans. Database Systems, vol. 26, no. 2, pp. 214-260, 2001.