Data Leakage Report

A

SEMINAR REPORT

ON

DATA LEAKAGE DETECTION

SUBMITTED BY

ANUSRI RAMCHANDRAN (3)

PRADEEP CHAUHAN ASHOK (7)

POOJA SHANTILAL CHHIPA (8)

UNDER THE GUIDANCE OF

MRS.JYOTI WADMARE

DEPARTMENT OF COMPUTER ENGINEERING

K.J. SOMAIYA INSTITUTE OF ENGINEERING AND INFORMATION TECHNOLOGY

EVERARD NAGAR, EASTERN EXPRESS HIGHWAY,SION,MUMBAI-42

Certificate

This is to certify that the seminar entitled “DATA LEAKAGE DETECTION” has been

submitted by the following students.

Anusri Ramchandran(3)

Pradeep Chauhan Ashok(7)

Pooja Shantilal Chhipa(8)

Under guidance of Prof. Jyoti Wadmare for the subject seminar in semester VI

Internal examiner Head of Department

(Prof._____________________) (Prof. Uday Rote)

External examiner Principal

(________________________) (Dr. S.G. Kirloskar)

Acknowledgement

Data Leakage Detection is a seminar report brought about by the effort of a dynamic team and lot of other

people. We would like to extend our sincere thanks to them.

We express our deepest gratitude to Mrs.Jyoti Wadmare who guided us through the entire seminar report.

We sincerely thank our college staff for their help and guidance.

We also extend our formal gratitude to Prof. Uday Rote, Head of Department of Computer Engineering

who provided us the needful facilities in the implementation of the seminar.

We are really indebted to those who are involved in realization of this project.

We are also thankful to our family members and friends for their patience and encouragement.

Anusri Ramchandran

Pradeep Chauhan

Pooja Shantilal Chhipa

Abstract

We study the following problem: A data distributor has given sensitive data to a set of

supposedly trusted agents (third parties). Some of the data are leaked and found in an

unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the

likelihood that the leaked data came from one or more agents, as opposed to having been

independently gathered by other means. We propose data allocation strategies (across the

agents) that improve the probability of identifying leakages. These methods do not rely on

alterations of the released data (e.g., watermarks). In some cases, we can also inject “realistic

but fake” data records to further improve our chances of detecting leakage and identifying the

guilty party.

An organization secures its data/information from intruders (i.e. Hackers) by protecting

their network but data is growing rapidly as the sizes of organizations (e.g. due to globalization)

grows and also to access these data number of data points (machines and servers) are rising

which are simpler modes of communication. Sometimes intentional or even unintentionally data

leaks from within the organization and become a painful reality. This has lead to growing

information security awareness in general and about outbound content management in particular.

A data distributor has given sensitive data to a set of supposedly trusted agents (third parties).

Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s

laptop).

The distributor must assess the likelihood that the leaked data came from one or more

agents, opposed to having been independently gathered by other means. We propose data

allocation strategies (across the agents) that improve the probability of identifying leakages.

These methods do not rely on alterations of the released data (e.g., watermarks). In some cases

we can also inject “realistic but fake” data records to further improve our chances of detecting

leakage and identifying the guilty party.

Index

SR.NO. TOPIC PAGE NO.

1. Introduction 12. Literature Review 23. Report on Present Investigation 4

3.1 Introduction to Data Leakage 4

3.2 The Leaking Faucet 6

3.3 Data Leakage Detection 8

3.1.1 Data Allocation Strategy 9

3.2.2 Guilt Detection Model 10

3.2.3 Symbols and Tẻrminology 114. Agent Guilt Model 13

4.1 Guilty Agent 13

4.2 Guilt Agent Detection

14

5. Allocation Strategy 16

5.1 Explicit Data Requests 16

5.2 Sample Data Requests 19

5.3 Data Allocation Problem 21

5.3.1 Fake Objects 22

5.3.2 Optimization Problem 226. Intelligent Data Dístribution 24

6.1 Hashed Distribution Algorithm 24

6.2 Detection Process 25

6.3 Benefits of Hashed distribution 25

7. Summary 268. References and Links 27

Chapter 1

Introduction

In the business process, sometimes sensitive data must be given to trusted third parties. For

example, a company may have partnerships with other companies that require sharing customer

data to complete any process of organization. Similarly, a hospital gives patient records to the

researchers to devise new treatments. Another enterprise may outsource its data processing, so

data must be given to various other companies. We call the owner of the data the distributor and

the supposedly trusted third parties the agents [1]. Our goal is to detect the agent leaked the data

when the distributor distributes sensitive data among various agent, and if possible to identify the

guilty of agent that leaked the data.

Leakage detection is traditionally handled by the process of watermarking, e.g., a unique

code is embedded in each distributed copy. If that copy is later discovered in the hands of an

unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases,

but again, involve some modification in the original data. Furthermore, watermarks can

sometimes be destroyed if the data recipient is malicious. In this paper we study unobtrusive

techniques for detecting leakage of a set of objects or records. Specifically, we study the

following scenario: After giving a set of objects to agents, the distributor discovers some of those

same objects in an unauthorized place. For example, the data may be found on a web site, or may

be obtained through a legal discovery process. At this point the distributor can assess the


independently gathered by other means.

To detect guilty of an agent intelligent distribution techniques must be developed to

identify the agent leak the data object obtained from certain discovery process or from any

authorized place and a model for assessing the “guilt” of agents. In these algorithms we consider

the option of adding “fake” objects to the distributed set. Such objects do not correspond to real

entities but appear realistic to the agents, it works as watermark for distributed data set and used

to identify the agent responsible for leakage of data object, fake records are added with some

modification in the data set.

Chapter 2

Literature ReviewIn Introduction to Data leakage detection system guilt detection approach and data

allocation strategies are introduced along with it various other works on mechanisms that allow

only authorized users to access sensitive data through access control policies are performed.

Such approaches prevent in some sense data leakage by sharing information only with trusted

parties. However, these policies are restrictive and may make it impossible to satisfy agents’

requests. The guilt detection approach is related to the data provenance problem in which tracing

the lineage of leaked objects implies essentially the detection of the guilty agents. As far as the

data allocation strategies are concerned, it is mostly relevant to watermarking that is used as a

means of establishing original ownership of distributed objects.

All Algorithms presented in this paper have been implemented in a prototype lineage

tracing system and preliminary performance results are reported. Enabling lineage tracing in a

data warehousing environment has several benefits and applications, including in-depth data

analysis and data mining, authorization management, view update, efficient warehouse recovery,

and others as outlined [6].

Panagiotis Papadimitriou and Hector Garcia-Molina analyzed various data allocation

strategies and related the data allocation strategy with the process of watermarking in which the

work is mostly relevant to watermarking that is used as a means of establishing original

ownership of distributed objects. Watermarks were initially used in images, video, and audio

data whose digital representation includes considerable redundancy. Recently, other works have

also studied marks insertion to relational data. The approach and watermarking are similar in the

sense of providing agents with some kind of receiver identifying information. However, by its

very nature, a watermark modifies the item being watermarked. If the object to be watermarked

cannot be modified, then a watermark cannot be inserted. In such cases, methods that attach

watermarks to the distributed data are not applicable [1].The authors conducted experiments with

simulated data leakage problems to evaluate their performance and present the evaluation for

sample requests and explicit data requests, respectively.

To calculate the guilt probabilities and differences, use throughout this section p = 0.5.

Although not reported here, this is experimented with other p values and observed that the

relative performance of our algorithms and our main conclusions do not change. If p approaches

to 0, it becomes easier to find guilty agents and algorithm performance converges. On the other

hand, if p approaches 1, the relative differences among algorithms grow since more evidence is

needed to find an agent guilty.

The presented experiments confirmed that fake objects can have a significant impact on

our chances of detecting a guilty agent. This means that by allocating 10 percent of fake objects,

the distributor can detect a guilty agent even in the worst-case leakage scenario, while without

fake objects, he will be unsuccessful not only in the worst case but also in average case.

With explicit data request few objects that are shared among multiple agents. These are

the most interesting scenarios, since object sharing makes it difficult to distinguish a guilty from

no guilty agents. Scenarios with more objects to distribute or scenarios with objects shared

among fewer agents are obviously easier to handle. As far as scenarios with many objects to

distribute and many overlapping agent requests are concerned, they are similar to the scenarios ,

since we can map them to the distribution of many small subsets.

With sample data requests, agents are not interested in particular objects. Hence, object

sharing is not explicitly defined by their requests. The distributor is “forced” to allocate certain

objects to multiple agents only if the number of requested objects exceeds the number of objects

in set T. The more data objects the agents request in total, the more recipients, on average, an

object has; and the more objects are shared among different agents, the more difficult it is to

detect a guilty agent.

Piero Bonatti, Sabrina De Capitani Di Vimercati and Pierangela Samarati worked on

access control policy which works as mechanisms that allow only authorized users to access

sensitive data through access control policies are performed. Such approaches prevent in some

sense data leakage by sharing information only with trusted parties. Access control are all based

on monolithic and complete specifications.

The approach for data leakage detection is similar to the watermark added in the image,

audio or video for its detection by adding unique code for the identification. In case of data

object set some fake records are added.

Chapter 3

Report on Present Investigation

Data leakage happens every day when confidential business information such as customer or

patient data, source code or design specifications, price lists, intellectual property and trade

secrets, forecasts and budgets in spreadsheets are leaked out. In this report a problem is

considered in which a data distributor has given sensitive data to a set of supposedly trusted

agents and some of the data are leaked and found in an unauthorized place by any means. The

problem with the data leakage is that once this data is no longer within the domain of distributor,

then the company is at serious risk. The distributor must assess the likelihood that the leaked data

came from one or more agents, as opposed to having been independently gathered by other

means.

We propose data allocation strategies (across the agents) that improve the probability of

identifying leakages. These methods do not rely on alterations of the released data. In some

cases, we can also inject “realistic but fake” data records to further improve our chances of

detecting leakage and identifying the guilty party.

Further modification is applied in order to overcome the problems of current algorithm by

intelligently distributing data object among various agents in such a way that identification of

guilty agent becomes simple.

3.1 Introduction to Data Leakage

Data Leakage, put simply, is the unauthorized transmission of data (or information) from

within an organization to an external destination or recipient. Leakage is possible either

intentionally or unintentionally by internal or external user, internals are authorized user of the

system who can access the data through valid access control policy, however external intruder

access data through some attack on target machine which is either active or passive.

This may be electronic, or may be via a physical method. Data Leakage is synonymous

with the term Information Leakage and harms the image of organization and create threat in the

mind for continuing the relation with the distributor, as it is not able to protect the sensitive

information.

The reader is encouraged to be mindful that unauthorized does not automatically mean

intentional or malicious. Unintentional or inadvertent data leakage is also unauthorized.

Figure:3.1 Data Leakage

According to data compiled from EPIC.org and PerkinsCoie.com which conducted

survey on data leakage by considering various different organizations and concluded that, 52%

of Data Security breaches are from internal sources compared to the remaining 48% by external

hackers hence the protection needed from internal users also.

Type of data leaked Percentage

Confidential information 15

Intellectual property 4

Customer data 73

Health record 8

Table.3.2 Types of data leaked [11]

The noteworthy aspect of these figures is that, when the internal breaches are examined,

the percentage due to malicious intent is remarkably low.

Less than 1%, the level of inadvertent data breach is significant (96%). This is further

deconstructed to 46% being due to employee oversight, and 50% due to poor business process.

3.2 The Leaking Faucet

Data protection programs at most organizations are concerned with protecting sensitive

data from external malicious attacks, relying on technical controls that include perimeter

security, network/wireless surveillance and monitoring, application and point security

management, and user awareness and education. But what about inadvertent data leaks that isn’t

so sensational.

For example unencrypted information on a lost or stolen laptop/ USB or other device?

Like the steady drip from a leaking faucet, everyday data leaks are making headlines more often

than the nefarious attack scenarios around which organizations plan most, if not all, of their data

leakage prevention methods. However, to truly protect their critical data, organizations also need

to plan a more data-centric approach to their security programs to protect against leaks that occur

for sensitive data.

Organizations are concerned with protecting sensitive data from external malicious

attacks, relying on technical controls that include perimeter security and internal staff and from

those who can access those data from any method, network/wireless surveillance and monitoring,

application and point security management, and user awareness and education and DLP

solutions. But what about data leaks by trusted third party called as agents who are not present

inside the network and their activity is not easily traceable for this situation some care must be

taken so that data is not misused by them.

Various sensitive information such as Financial Data, Private Data, Credit Card

Information, Health Record Information, Confidential Information, Personal Information etc are

part of various different organizations which can be prevented by various different ways as

shown in figure 3.3. The leaking faucets include Education, Prevention and Detection to control

the leakage of sensitive information. Education remains the important factor among all the

protection measure, which includes training and awareness program to handle the sensitive

information and their importance for the organization.

The figure 3.3 represents faucet for data leakage in the center different kinds of sensitive

information is placed which is surrounded by protecting mechanism which will prevent leakage

of valuable information from the organization which leads to major problem.

Figure 3.3: The Leaking Faucet

Prevention mechanism deals with the DLP mechanism which is suits of technology

which prevent leakage of data by classifying sensitive information and monitoring them and

through various accesses control policy the access is prevented from users. Education prevents

the leakage as most of the time leakage occurs unintentionally by the internal users. Detection

process detects the leakage of information distributed to trustworthy third party called as agent

and to calculate their involvement in the process of leakage.

3.3 Data Leakage Detection

Organizations thought of data/information security only in terms of protecting their

network from intruders (e.g. hackers). But with growing amount of data, rapid growth in the

sizes of organizations (e.g. due to globalization), rise in number of data points (machines and

servers) and easier modes of communication, accidental or even deliberate leakage of data from

within the organization has become a painful reality. This has lead to growing awareness about

information security in general and about outbound content management in particular.

Data Leakage, put simply, is the unauthorized transmission of data (or information) from

within an organization to an external destination or recipient. This may be electronic, or may be

via a physical method. Data Leakage is synonymous with the term Information Leakage. The

reader is encouraged to be mindful that unauthorized does not automatically mean intentional or

malicious. Unintentional or inadvertent data leakage is also unauthorized. In the course of doing

business, sometimes sensitive data must be handed over to supposedly trusted third parties.

For example, a hospital may give patient records to researchers who will devise new

treatments. Similarly, a company may have partnerships with other companies that require

sharing customer data. Another enterprise may outsource its data processing, so data must be

given to various other companies. We call the owner of the data the distributor and the

supposedly trusted third parties the agents. Our goal is to detect the guilty agent among all

trustworthy agents when the distributor’s sensitive data have been leaked by any one agent, and

if possible to identify the agent that leaked the data.

We consider applications where the original sensitive data cannot be perturbed.

Perturbation is a very useful technique where the data are modified and made “less sensitive”

before being handed to agents, in such case sensitive data is sent as it is example is contact no of

employee it cannot be altered and sent to third party for recruitment as it is sensitive or similarly

its bank account number perturbation of those information to makes no effect after transmission

to the receiver as this data is not useful for any process.

To overcome such a situation an effective method is required through which the

distribution of data is possible to modify the valuable information.

In such case one can add random noise to certain attributes, or one can replace exact

values by ranges [4]. However, in some cases, it is important not to alter the original distributor’s

data. For example, if an outsourcer is doing our payroll, he must have the exact salary and

customer bank account numbers

Fig 3.4: Data Leakage Detection Process

If medical researchers treating the patients (as opposed to simply computing statistics),

they may need accurate data for the patients. Traditionally, leakage detection is handled by

watermarking, e.g., a unique code is embedded in each distributed copy.

If that copy is later discovered in the hands of an unauthorized party, the leaker can be

identified. Watermarks can be very useful in some cases, but again, involve some modification

of the original data by adding redundancy. Furthermore, watermarks can sometimes be destroyed

if the data recipient is malicious as it is aware of various techniques to temper the watermark.

Data Leakage Detection system mainly divided into two modules

3.3.1 Data allocation strategy

This module helps in intelligent distribution of data set so that if that data is leaked guilty

agent is identified.

Fig 3.5: Data Distribution Scenario

3.3.2 Guilt detection model

This model helps to determine the agent is responsible for leakage of data or data set

obtained by target is by some other means.

It requires complete domain knowledge to calculate the probability p to evaluate guilty

agent. From the domain knowledge and proper analysis and experiment probability facture is

calculated which act as threshold for evidence to prove guilty of agent. It means when number of

leaked record are more than the probability specified then agent is guilty and if number of leaked

records is less then is said to be not guilty because in such situation it is possible that leaked

object is obtained by target by some other means.

Fig 3.6: Guilt Detection Model

An unobtrusive technique for detecting leakage of a set of objects or records is proposed

in this report. After giving a set of objects to agents, the distributor discovers some of those same

objects in an unauthorized place. (For example, the data may be found on a website, or may be

obtained through a legal discovery process.) At this point, the distributor can assess the


independently gathered by other means. Using an analogy with cookies stolen from a cookie jar,

if we catch Freddie with a single cookie, he can argue that a friend gave him the cookie. But if

we catch Freddie with five cookies, it will be much harder for him to argue that his hands were

not in the cookie jar. If the distributor sees “enough evidence” that an agent leaked data, he may

stop doing business with him, or may initiate legal proceedings. In this paper, we develop a

model for assessing the “guilt” of agents. We also present algorithms for distributing objects to

agents, in a way that improves our chances of identifying a leaker. Finally, we also consider the

option of adding “fake” objects to the distributed set. Such objects do not correspond to real

entities but appear realistic to the agents. In a sense, the fake objects act as a type of watermark

for the entire set, without modifying any individual members. If it turns out that an agent was

given one or more fake objects that were leaked, then the distributor can be more confident that

agent was guilty.

3.3.3 Symbols and TerminologyA distributor owns a set T = {t1, t2, t3, ……. } of valuable and sensitive data objects. The

distributor wants to share some of the objects with a set of agents U1, U2… Un, but does not wish

the objects be leaked to other third parties. The objects in T could be of any type and size, e.g.,

they could be tuples in a relation, or relations in a database. An agent Ui receives a subset of

objects Ri subset of T, determined either by a sample request or an explicit request:

• Distributor: A distributor owns a set T of valuable and sensitive data objects.

Owner of data set T = {t1, t2, ……, tn}

• Agent (U): The distributor shares some of the objects with a set of agents U1, U2… Un,

but does not wish the objects be leaked to other third parties.

Receives set R T from the distributor.

• Target: Unauthorized third party caught with leaked data set S T.

Example: Say T contains customer records for a given company A. Company “C” hires a

marketing agency U1 to do an on-line survey of customers. Since any customers will do for the

survey, U1 requests a sample of 1000 customer records. At the same time, company “C”

subcontracts with agent U2 to handle billing for all California customers. Thus, U2 receives all T

records that satisfy the condition “state is California Suppose that after giving objects to agents,

the distributor discovers that a set S subset of T has leaked. This means that some third party

called the target has been caught in possession of S. For example, this target may be displaying S

on its web site, or perhaps as part of a legal discovery process, the target turned over S to the

distributor.

Agents U1, U2… Un have some of the data, it is reasonable to suspect them leaking the data.

However, the agents can argue that they are innocent, and that the S data was obtained by the

target through other means. For example, say one of the objects in S represents a customer X.

Perhaps X is also a customer of some other company, and that company provided the data to the

target. Or perhaps X can be reconstructed from various publicly available sources on the web.

Our goal is to estimate the likelihood that the leaked data came from the agents as opposed to

other sources. Intuitively, the more data in S, the harder it is for the agents to argue they did not

leak anything. Similarly, the “rarer” the objects, the harder it is to argue that the target obtained

them through other means.

Not only do we want to estimate the likelihood the agents leaked data, but we would also like

to find out if one of them in particular was more likely to be the leaker. For instance, if one of the

S objects was only given to agent U1, while the other objects were given to all agents, we may

suspect U1 more. The model we present next captures this intuition. We say an agent Ui is guilty

if it contributes one or more objects to the target. While performing implementation and research

work for the calculation of guilty of agent in order to reduce the complexity and computation we

are following various assumption

Chapter 4

Agent Guilt Model

This model helps to determine the agent is responsible for leakage of data or data set

obtained by target is by some other means. The distributor can assess the likelihood that the

leaked data came from one or more agents, as opposed to having been independently gathered by

other means. Using an analogy with cookies stolen from a cookie jar, if we catch Freddie with a

single cookie, he can argue that a friend gave him the cookie. But if we catch Freddie with five

cookies, it will be much harder for him to argue that his hands were not in the cookie jar.

If the distributor sees “enough evidence” that an agent leaked data, he may stop doing

business with him, or may initiate legal proceedings.

4.1 Guilty Agent

To compute the probability of guilty agent, we need an estimate for the probability that

values in S can be “guessed” by the target. For instance, say some of the objects in T are emails

of individuals. We can conduct an experiment and ask a person with approximately the expertise

and resources of the target to find the email of say 100 individuals. If this person can find say 90

emails, then we can reasonably guess that the probability of finding one email is 0.9. On the

other hand, if the objects in question are bank account numbers, the person may only discover

say 20, leading to an estimate of 0.2. We call this estimate pt , the probability that object t can be

guessed by the target.

To simplify the formulas, we assume that all T objects have the same probability, which we

call p. Next, we make two assumptions regarding the relationship among the various leakage

events. The first assumption simply states that an agent’s decision to leak an object is not related

to other objects.

Suppose that after giving objects to agents, the distributor discovers that a set S T has

leaked. This means that some third party called the target has been caught in possession of S.

For example, this target may be displaying S on its web site, or perhaps as part of a legal

discovery process, the target turned over S to the distributor. Since the agents U1, ……,Un have

some of the data, it is reasonable to suspect them leaking the data. However, the agents can argue

that they are innocent, and that the S data was obtained by the target through other means.

For example, say one of the objects in S represents a customer X. Perhaps X is also a

customer of some other company, and that company provided the data to Equations the target. Or

perhaps X can be reconstructed from various publicly available sources on the web. Our goal is

to estimate the likelihood that the leaked data came from the agents as opposed to other sources.

Intuitively, the more data in S, the harder it is for the agents to argue they did not leak anything.

Similarly, the “rarer” the objects, the harder it is to argue that the target obtained them through

other means. Not only do we want to estimate the likelihood the agents leaked data, but we

would also like to find out if one of them in particular was more likely to be the leaker.

For instance, if one of the S objects was only given to agent U1, while the other objects

were given to all agents, we may suspect U1 more. The model we present next captures this

intuition. We say an agent Ui is guilty if it contributes one or more objects to the target. We

denote the event that agent Ui is guilty for a given leaked set S by {G i | S}. Our next step is to

estimate Pr {Gi | S}, i.e., the probability that agent Ui is guilty given evidence S.

4.2 Guilt Agent Detection

We can conduct an experiment and ask a person with approximately the expertise and

resources of the target to find the email of say 100 individuals.

If this person can find say 90 emails, then we can reasonably guess that the probability of

finding one email is 0.9. On the other hand, if the objects in question are bank account numbers,

the person may only discover say 20, leading to an estimate of 0.2. We call this estimate pt, the

probability that object t can be guessed by the target [2]. For simplicity we assume that all T

objects have the same pt, which we call p.

Next, make two assumptions regarding the relationship among the various leakage

events. The first assumption simply states that an agent’s decision to leak an object is not related

to other objects

Assumption1. For all t, t’ S such that t t’ the provenance of t is independent of the

provenance of t’ [1].

The term “provenance” in this assumption statement refers to the source of a value t that

appears in the leaked set. The source can be any of the agents who have t in their sets or the

target itself (guessing). The following assumption states that joint events have a negligible

probability

Assumption2. An object t S can only be obtained by the target in one of the two ways as

follows.

A single agent Ui leaked t from its own Ri set. The target “t” guessed or obtained through

other means without the help of any of the n agents. In other words, for all t S, the event that

the target guesses t and the events that agent Ui (i = 1, . . . , n) leaks object t are disjoint Assume

that the distributor set T, the agent sets Rs, and the target set S are:

T = { t1, t2, t3 }, R1 = { t1, t2 }, R2 = { t1, t3 }, S = { t1, t2, t3 }.

In this case, all three of the distributor’s objects have been leaked and appear in S. Let us first

consider how the target may have obtained object t1, which was given to both agents. From

Assumption 2, the target either guessed t1 or one of U1 or U2 leaked it.

We know that the probability of the former event is p, so assuming that probability that each

of the two agents leaked t1 is the same, we have the following cases:

The target guessed t1 with probability p,

Agent U1 leaked t1 to S with probability (1-p)/2,

Agent U2 leaked t1 to S with probability (1-p)/2.

Similarly, we find that agent U1 leaked t2 to S with probability (1-p) since he is the only agent

that has t2.

Given these values, the probability that agent U1 is not guilty, namely that U1 did not leak

either object, is

And the probability that U1 is guilty is:

If Assumption 2 did not hold, our analysis would be more complex because we would need to

consider joint events, e.g., the target guesses t1, and at the same time, one or two agents leak the

value. In our simplified analysis, we say that an agent is not guilty when the object can be

guessed, regardless of whether the agent leaked the value.

Since we are “not counting” instances when an agent leaks information, the simplified

analysis yields conservative values (Smaller Probability) analysis.

Chapter 5

Allocation Strategy

Allocation strategies that are applicable to problem instances data requests are discussed.

We deal with problems with explicit data requests, and problems with sample data requests.

5.1 Explicit Data Requests

In problems the distributor is not allowed to add fake objects to the distributed data. So,

the data allocation is fully defined by the agents’ data requests. In EF problems, objective values

are initialized by agents’ data requests. Say, for example, that T= {t1, t2} and there are two agents

with explicit data requests such that R1 = {t1, t2} and R2 = {t1}. The value of the sum objective is

in this case is 1.5. The distributor cannot remove or alter the R1 or R2 data to decrease the overlap

R1 R2. If the distributor is able to create more fake objects, he could further improve the

objective.

The distributor cannot remove or alter the R1 or R2 data to decrease the overlap R1 R2.

However, say that the distributor can create one fake object (B = 1) and both agents can receive

one fake object (b1 = b2 = 1). In this case, the distributor can add one fake object to either R1 or

R2 to increase the corresponding denominator of the summation term. Assume that the distributor

creates a fake object f and he gives it to agent R1. Agent U1 has now R1 = { t1, t2, f } and F1 = {f}

and the value of the sum-objective decreases to 1:33 < 1:5.

If the distributor is able to create more fake objects, he could further improve the

objective. Algorithm 1 is a general “driver” that will be used for the allocation in case of explicit

request with fake record. In the algorithm first the random agent is selected from the list and then

its request is analyzed after the computation Fake records are created by function

CREATEFAKEOBJECT() fake records are added in the data set and given back to the agent

requested that data set. Fake records help in the process of identifying agent from the leaked data

set.

Algorithm 1:

Allocation for Explcit Data Requests (EF)

Input: R ,R1, R2,…,R, cond1 , … condi , B1,…Bn,B

Output : R1,…,Rn,F1,…,Fn

R Null

For i=1, …, n do

If bi > 0 then

R R ∪ {i}

Fi Null

While B>0 do

I SELECT AGENT (R, R1, R2, … …Rn)

FCREATEFAKE OBJECT (Ri, Fi, condi)

Ri Ri ∪ {F}

Fi Fi ∪ {F}

bi bi -1

if bi = 0 then

R=R/R{i}

BB-1

Algorithm 2:

Agent Selection for e – random function SELECT AGENT (R,.R1, …,Rn)

I select random agent from R

return i

Flow Chart for implementation of the following Algorithms:

(a)Allocation for Explicit Data Request(EF) with fake objects

(b) Agent Selection for e-random and e-optimal

“IN Both the above caseCREATEFAKCEOBJECT() METHOD GENERATES A FAKEOBJECT.”

Stop

User Request

R Explicit

Check the Condition Select the agent and add the fake.

IF B> 0

object

Evaluate The Loop.

Create Fake Object is Invoked

User Receives the Output.

ElseExit

Loop Iterates for n number of requests

Start

5.2 Sample Data Requests

With sample data requests, each agent Ui may receive any T subset out of entire distribution

set which are different ones. Hence, there are different object allocations. In every allocation, the

distributor can permute T objects and keep the same chances of guilty agent detection. The

reason is that the guilt probability depends only on which agents have received the leaked objects

and not on the identity of the leaked objects. The distributor’s problem is to pick one out so that

he optimizes his objective. The distributor can increase the number of possible allocations by

adding fake objects.

Algorithm 3:

Allocation for Sample Data Requests (S F )

Input : m1,… … , mn |T|

Output: R1 ,…, Rn

A O|T|

R1 null, …,Rn null

Rem mi

While rem > 0 do

For i=1, …, n:Ri <mi do k SELECT OBJECT(i, Ri) Ri Ri ∪ {tk}

A[k] a[k] + 1

Rem rem - 1

Algorithm 4:

Object Selection function SELECTOBJE CT(i, Ri)

K select at random an element from set (K’ | tk’ ∉ Ri)

Return k

Flow Chart for implementation of the following Algorithms:

(a) Allocation for Sample Data Request(EF) without any fake objects:

(b) Agent Selection for e-random and e-optimal

“In Both the following cases Select Method() returns the value of ∑ Ri n Rj “

Stop

User Request

R Explicit

Check the Condition Select the agent and add the fake.

IF B> 0

object

Evaluate The Loop.

SelectObject() Method is Invoked

User Receives the Output.

ElseExit

Loop Iterates for n number of requests

Start

5.3 Data Allocation Problem

The main focus of our work is the data allocation problem: how can the distributor

“intelligently” give data to agents to improve the chances of detecting a guilty agent.

As illustrated in Figure, there are four instances of this problem we address, depending on

the type of data requests made by agents (E for Explicit and S for Sample requests) and whether

“fake objects” are allowed (F for the use of fake objects, and for the case where fake objects

are not allowed). Fake objects are objects generated by the distributor that are not in set T.

The objects are designed to look like real objects, and are distributed to agents together

with the T objects, in order to increase the chances of detecting agents that leak data.

Fig6.1: Leakage problem instances.

Assume that we have two agents with requests R1 = EXPLICIT( T, cond1 ) and R2 =

SAMPLE(T’, 1), where T’ = EXPLICIT( T, cond2 ).

Further, say that cond1 is “state = CA” (objects have a state field). If agent U2 has the

same condition cond2 = cond1, we can create an equivalent problem with sample data requests on

set T’. That is, our problem will be how to distribute the CA objects to two agents, with

R1 = SAMPLE( T’, |T’| ) and R2 = SAMPLE(T’, 1). If instead U2 uses condition “state = NY,”

we can solve two different problems for sets T’ and T – T’. In each problem, it will have only

one agent. Finally, if the conditions partially overlap, R1 T’ NULL, but R1 T’, we can solve

three different problems for sets R1 – T’, R1 T’, and T’ – R1.

5.3.1 Fake Object

The distributor may be able to add fake objects to the distributed data in order to improve

his effectiveness in detecting guilty agents. However, fake objects may impact the correctness of

what agents do, so they may not always be allowable. The idea of perturbing data to detect

leakage is not new. However, in most cases, individual objects are perturbed, e.g., by adding

random noise to sensitive salaries, or adding a watermark to an image. In this case, perturbing

the set of distributor objects by adding fake elements is done.

For example, say the distributed data objects are medical records and the agents are

hospitals. In this case, even small modifications to the records of actual patients may be

undesirable. However, the addition of some fake medical records may be acceptable, since no

patient matches these records, and hence no one will ever be treated based on fake records. A

trace file is maintained to identify the guilty agent. Trace file are a type of fake objects that help

to identify improper use of data. The creation of fake but real-looking objects is a nontrivial

problem whose thorough investigation is beyond the scope of this paper. Here, we model the

creation of a fake object for agent Ui as a black box function CREATEFAKEOBJECT ( Ri, Fi,

condi ) that takes as input the set of all objects Ri, the subset of fake objects Fi that Ui has

received so far, and condi, and returns a new fake object. This function needs condi to produce a

valid object that satisfies Ui’s condition. Set Ri is needed as input so that the created fake object

is not only valid but also indistinguishable from other real objects.

5.3.2 Optimization Problem

The distributor’s data allocation to agents has one constraint and one objective. The

distributor’s constraint is to satisfy agents’ requests, by providing them with the number of

objects they request or with all available objects that satisfy their conditions. His objective is to

be able to detect an agent who leaks any of his data objects.

We consider the constraint as strict. The distributor may not deny serving an agent

request and may not provide agents with different perturbed versions of the same objects. We

consider fake object allocation as the only possible constraint relaxation. Our detection objective

is ideal and intractable. Detection would be assured only if the distributor gave no data object to

any agent. We use instead the following objective: maximize the chances of detecting a guilty

agent that leaks all his objects.

We now introduce some notation to state formally the distributor’s objective. Recall that

Pr {Gj | S = Ri} or simply Pr {Gj | S = Ri}, is the probability that agent Uj is guilty if the

distributor discovers a leaked table S that contains all Ri objects. We define the difference

functions (i, j) as:

(i, j) = Pr {Gj | S = Ri}- Pr {Gi | S = Ri} i, j = 1,……, n(6)

Note that differences have non-negative values: given that set Ri contains all the leaked

objects, agent Ui is at least as likely to be guilty as any other agent. Difference (i, j) is positive

for any agent Uj, whose set Ri does not contain all data of S.

It is zero, if Ri Rj. In this case the distributor will consider both agents U i and Uj equally

guilty since they have both received all the leaked objects. The larger a (i, j) value is, the easier

it is to identify Ui as the leaking agent. Thus, we want to distribute data so that values are

large.

Chapter 6

Intelligent Data Distribution

6.1 Hashed Distribution Algorithm

Input: Agent ID (UID), Number of data item requested in the dataset (N), Fake records (F)

Output: Distribution set (Dataset + Fake record)

1. Start

2. Accept the data request from the agent and analyze

a. Type of request { Sample, Exclusive }

b. Probability of getting records from other means other than the distributor

Pr {guessing}

c. No of records in the Dataset (N) to calculate number of fake record added in order

to determine guilty agent.

d. Agent ID requesting data (UID)

3. Generate the list of data to be send to the agent (dataset), assign each record with unique

distribution ID.

4. For I =1 to F : > For each fake record

Mapping_Function (UID, FID)

{

Hash (UID)

DID → FID

Store → DistributionDetails { FID,DID,UID }.

}

5. For I=1 to F

AddFakeRecord (DistributionDetails)

Output: Distribution Set

6. Stop.

6.2 Detection Process

Detection process starts when the set of distributed sensitive record found on some unauthorized

places. Detection process completes in two phases: In phase one agent is identified by the

presence of fake record in the obtained set, if no matching fake record is identified phase two

begins which searches for missing record in the set on which fake record is substituted. The

advantage of second phase is that it works in a situation in which agent identify and delete the

fake record before leaking to the target.

Inverse mapping function (Leaked Data Set)

1. Attach DID to every record

2. Sort records in order of DID

3. Search and map fake record

4. For every Record

If fake record = Yes

MapAgent (FID)

Else If fake records = No

Map (UID) which gives hash location

Identify the absence of substituted record

MapAgent (DID)

Else

Objects are obtained by some other means.

6.3 Benefits of Hashed distribution:

Once the data is distributed fake records are used to identify the guilty agent here instead

of record we are using location to determine the guilty agent so even when the presence of fake

record is identified by the agent it will delete record but the location is anyhow determine by the

distributor so event absence of fake record will reveal the Identity of agent by tracking absence

of original record.

This solves data distribution and optimization problem to some extent by its distribution

technique.

Chapter 7

Summary

In a perfect world, there would be no need to hand over sensitive data to agents that may

unknowingly or maliciously leak it. And even if we had to hand over sensitive data, in a perfect

world, we could watermark each object so that we could trace its origins with absolute certainty.

However, in many cases, we must indeed work with agents that may not be 100 percent

trusted, and we may not be certain if a leaked object came from an agent or from some other

source, since certain data cannot admit watermarks. In spite of these difficulties, we have shown

that it is possible to assess the likelihood that an agent is responsible for a leak, based on the

overlap of his data with the leaked data and the data of other agents, and based on the probability

that objects can be “guessed” by other means. Our model is relatively simple, but we believe that

it captures the essential trade-offs. The algorithms we have presented implement a variety of data

distribution strategies that can improve the distributor’s chances of identifying a leaker. We have

shown that distributing objects judiciously can make a significant difference in identifying guilty

agents, especially in cases where there is large overlap in the data that agents must receive.

The data distribution strategies improve the distributor’s chances of identifying a leaker.

It has been shown that distributing objects judiciously can make a significant difference in

identifying guilty agents, especially in cases where there is large overlap in the data that agents

must receive. In some cases “realistic but fake” data records are injected to improve the chances

of detecting leakage and identifying the guilty party. In future the extension of our allocation

strategies can handle agent requests in an online fashion (the presented strategies assume that

there is a fixed set of agents with requests known in advance) can be implemented.

References and Links

[1] Panagiotis Papadimitriou, Student Member, IEEE, and Hector Garcia-Molina, Member, IEEE

“Data Leakage Detection“ IEEE Transactions on knowledge and data engineering, Vol. 23, NO.

1, January 2011

[2]S.Umamaheswari, H.Arthi Geetha “Detection of Guilty Agents” Coimbatore Institute of

Engineering and Technology.

[3]J. Clerk Ma P. Papadimitriou and H. Garcia-Molina, “Data leakage detection,” Stanford

University.

[4]L.Sweeney, “Achieving K-Anonymity Privacy Protection Using Generalization and

Suppression,” http://en.scientificcommons. org/43196131, 2002.

[5]Peter Gordon “Data Leakage – Threats and Mitigation” SANS Institute Reading Room

October 15, 2007

[6]S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian, “Flexible Support for Multiple

Access Control Policies,” ACM Trans. Database Systems, vol. 26, no. 2, pp. 214-260, 2001.

http://en.scientificcommons/

Documents

Data Leakage Report