1. 1
A PROJECT SYNOPSIS
On
Data Leakage Detection
Submitted By
1) Nisha Jain (19)
2) Ankita Patil (37)
3) Yogita Patil (41)
4) Vidya Shelke (51)
Under the Guidance of
Prof. Shilpa Kolte
Department of Information Technology
Saraswati Education Society’s
SARASWATI COLLEGE OF ENGINEERING
Kharghar, Navi Mumbai
(Affiliated to University of Mumbai)
Academic Year :-2012-13
2. 2
_____________________________________________________________________________________
PLOT NO. 46/46A, SECTOR NO 5, BEHIND MSEB SUBSTATION, KHARGHAR,NAVI
MUMBAI-410210
Tel.: 022-27743706 to 11 * Fax : 022-27743712 * Website: www.scoe.edu.in
CERTIFICATE
This is to certify that the requirements for the synopsis entitled “Data Leakage Detection”
Have been successfully completed by the following students:
Roll numbers Name
1) 19 Nisha Jain
2) 37 Ankita Patil
3) 41 Yogita Patil
4) 51 Vidya Shelke
In partial fulfillment of Sem –VII , Bachelor of Engineering of Mumbai University in
Information Technology of Saraswati college of Engineering , Kharghar during the academic
year 2012-13.
Internal Guide External Examiner
Prof. Shilpa Kolte
Project co-ordinator Head of Department
Prof. Archana Sharma Prof. Vaishali Jadhav
Principal
Dr. B. B. Shrivastava
SARASWATI EDUCATION SOCIETY’S
SARASWATI COLLEGE OF ENGINEERING
(Approved by AICTE, recg. By Maharashtra Govt. DTE,Affiliated to Mumbai University)
3. 3
Acknowledgement
A project is something that could not have been materialized without cooperation of many people. This
project shall be incomplete if I do not convey my heart felt gratitude to those people from whom I have
got considerable support and encouragement.
It is a matter of great pleasure for us to have a respected Prof. Shilpa Kolte as our project guide. We are
thankful to her for being constant source of inspiration.
We would also like to give our sincere thanks to Prof. Vaishali Jadhav, H.O.D, Information
Technology Department, Prof. Archana Sharma, Project co-ordinator for their kind support.
We would like to express our deepest gratitude to Dr. B. B. Shrivastava, our principal of Saraswati
college of Engineering, Kharghar, Navi Mumbai.
Last but not the least I would also like to thank all the staff of Saraswati college of Engineering
(Information Technology Department) for their valuable guidance with their interest and valuable
suggestions brightened us.
Name of Students
1) Nisha Jain (19)
2) Ankita Patil (37)
3) Yogita Patil (41)
4) Vidya Shelke (51)
4. 4
Data Leakage Detection
ABSTRACT
A data distributor has given sensitive data to a set of supposedly trusted agents (third
parties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or
somebody’s laptop). The distributor must assess the likelihood that the leaked data came from
one or more agents, as opposed to having been independently gathered by other means. ). If the
data distributed to third parties is found in a public/private domain then finding the guilty party is
a nontrivial task to distributor. Traditionally, this leakage of data is handled by water marking
technique which requires modification of data. To overcome the disadvantages of using
watermark, data allocation strategies are used to improve the probability of identifying guilty
third parties. In this project, we implement and analyse a guilt model that detects the agents using
allocation strategies without modifying the original data. The guilty agent is one who leaks a
portion of distributed data. The idea is to distribute the data intelligently to agents based on
sample data request and explicit data request in order to improve the chance of detecting the
guilty agents. The algorithms implemented using fake objects will improve the distributor chance
of detecting guilty agents. It is observed that by minimizing the sum objective the chance of
detecting guilty agents will increase. We also developed a framework for generating fake
objects.
5. 5
INDEX
1. Introduction ..……………………………………………………………...
2. Literature survey .………………………………………………….……..
3. Data Leakage Detection
3.1 Problem statement .………………………………………………...
3.2 Scope ...…………………………………………………………….
3.3 Proposed system …………………………………………………...
3.4 Analysis ……………………………………………………………
3.5 Details of Hardware and Software .……………………………......
3.6 Design details .……………………………………………………..
3.7 Conclusion …………………………………………………………
4. Implementation Plan for Next semester ..……………………………….
5. References ..…………………………………………………………….....
6. 6
1. INTRODUCTION
1.1 Need:
Due to today’s competitive world, many companies outsource certain business processes
(e.g. marketing, human resources, etc.) and associated activities to a third party. This
allows companies to focus on their core competency by subcontracting other activities to
specialists, resulting in reduced operational costs and increased productivity. In most
cases, the service providers need access to a company's intellectual property and other
confidential information to carry out their services. For example, if an outsourcer is doing
payroll for a bank, he must have the salary and customer bank account numbers. The
main security problem is that the service provider may not be fully trusted or may not be
securely administered. Business agreements for try to regulate how the data will be
handled by service providers, but it is almost impossible to truly enforce or verify such
policies across different administrative domains. Because of their digital nature, relational
databases are easy to duplicate and in many cases a service provider may have financial
incentives to redistribute commercially valuable data or may simply fail to handle it
properly. Hence, we need powerful techniques that can detect and deter such dishonest.
1.2 Basic Concept:
A model for assessing the “guilt” of agents has been developed. An algorithm for
distributing objects to agents, in a way that improves our chances of identifying a leaker
has been proposed. The option of adding “fake” objects to the distributed set also been
considered. The allocation of fake objects depends on the sample data request or explicit
data request made by the agent. Such objects do not correspond to real entities but appear
realistic to the agents. In a sense, the fake objects acts as a type of watermark for the
entire set, without modifying any individual members. If it turns out an agent was given
one or more fake objects that were leaked, then the distributor can be more confident that
agent was guilty. Unobtrusive techniques for detecting leakage of a set of objects or
records have been studied. After giving a set of objects to agents, the distributor discovers
some of those same objects in an unauthorized place. At this point the distributor can
assess the likelihood that the leaked data came from one or more agents.
7. 7
2. LITERATURE SURVEY
2.1 Existing Systems:
2.1.1 Perturbation:
Perturbation is a very useful technique where the data are modified and made less
sensitive before being handed to agents. For example, one can add random noise to
certain attributes, or one can replace exact values by ranges [5]. However, it is not
suitable if an application where the original sensitive data cannot be altered is considered.
For example, if an outsourcer is doing our payroll, he must have the exact salary and
customer bank account numbers. If medical researchers will be treating patients, as
opposed to simply computing statistics, they may need accurate data for the patients.
2.1.2 Watermarking:
Watermarks were initially used in images [2], video [3], and audio data [4] whose digital
representation includes considerable redundancy. Traditionally, leakage detection is
handled by watermarking, e.g., a unique code is embedded in each distributed copy. If
that copy is later discovered in the hands of an unauthorized party, the leaker can be
identified. Watermarks can be very useful in some cases, but again, involve some
modification of the original data. Furthermore, watermarks can sometimes be destroyed if
the data recipient is malicious.
2.2 Disadvantages:
1. In some applications the original sensitive data cannot be perturbed.
2. In some cases it is important not to alter the original distributor’s data.
3. For example, we can't perturbed patient records while giving it to research center for
finding cure for a disease.
4. Watermarks can be very useful in some cases but involve some modification of the
original data.
5. Watermarks can sometimes be destroyed if the data recipient is malicious.
8. 8
2.3 Related Work:
1. The guilt detection approach we present is related to the data provenance problem [1]:
tracing the lineage of S objects implies essentially the detection of the guilty agents.
Tutorial [6] provides a good overview on the research conducted in this field. Suggested
solutions are domain specific, such as lineage tracing for data warehouses [7], and
assume some prior knowledge on the way a data view is created out of data sources.
2. Our problem formulation with objects and sets is more general and simplifies lineage
tracing, since we do not consider any data transformation from Ri sets to S. As far as the
data allocation strategies are concerned, our work is mostly relevant to watermarking that
is used as a means of establishing original ownership of distributed objects.
3. There are also lots of other works on mechanisms that allow only authorized users to
access sensitive data through access control policies [8], [9]. Such approaches prevent in
some sense data leakage by sharing information only with trusted parties. However, these
policies are restrictive and may make it impossible to satisfy agents’ requests.
4. The concept used in this project is very useful as it does not modify the original data, but
adds fake objects that appear realistic to the agents. Only the distributor knows about the
fake objects so there is no way that the agent can come to know which objects are fake.
5. The distributor may be able to add fake objects to the distributed data in order to improve
his effectiveness in detecting guilty agents. However, fake objects may impact the
correctness of what agents do, so they may not always be allowable. The idea of
perturbing data to detect leakage is not new, e.g., [10]. However, in most cases,
individual objects are perturbed, e.g., by adding random noise to sensitive salaries, or
adding a watermark to an image.
6. In our case, we are perturbing the set of distributor objects by adding fake elements. In
some applications, fake objects may cause fewer problems that perturbing real objects.
For example, say that the distributed data objects are customer’s records of a bank and
the agents are outsourcers who will do the marketing. In this case, even small
modifications to the records of actual customers may be undesirable. However, the
addition of some fake customer’s records may be acceptable, since no customer matches
these records, and hence, no one will ever be contacted based on fake records.
9. 9
3. DATA LEAKAGE DETECTION
3.1 Problem statement :
Suppose a distributor owns a set T = {t1, tm} of valuable data objects. The distributor
wants to share some of the objects with a set of agents U1,U2,…,Un but does wish the
objects be leaked to other third parties. An agent Ui receives a subset of objects Ri which
belongs to T, determined either by a sample request or an explicit request. The objects in T
could be of any type and size, e.g., they could be tuples in a relation, or relations in a
database. After giving objects to agents, the distributor discovers that a set S of T has
leaked. This means that some third party called the target has been caught in possession of
S. For example, this target may be displaying S on its web site, or perhaps as part of a legal
discovery process, the target turned over S to the distributor. Since the agents U1,U2 ,…
,Un, have some of the data, it is reasonable to suspect them leaking the data. However, the
agents can argue that they are innocent, and that the S data was obtained by the target
through other means.
3.2 Scope :
Our goal is to detect when the distributor’s sensitive data have been leaked by agents, and
also to identify the agent that leaked the data. Data allocation strategies (across the agents)
have been used that improve the probability of identifying leakages. These methods do not
rely on alterations of the released data (e.g., watermarks). In some cases distributor can also
inject “realistic but fake” data records to further improve our chances of detecting leakage
and identifying the guilty party. This system can be useful for various organizations or
companies who need to distribute their sensitive data to some third parties. For example, this
system can be implemented in Banks to avoid revealing the identity of any employee or
customers. Similarly, a company may have partnerships with other companies that require
sharing customer data. Another enterprise may outsource its data processing, so data must
be given to various other companies.
10. 10
3.3 Proposed System :
In the course of doing business, sometimes sensitive data must be handed over to
supposedly trusted third parties. A data distributor will give sensitive data to a set of
supposedly trusted agents. When some of the data is found in an unauthorized place, the
distributor may assess the likelihood that the leaked data came from one or more agents. We
call the owner of the data the distributor and the supposedly trusted third parties the
agents. The guilty agent is one who leaks a portion of distributed data. And the one who
receives the data from the guilt agent is an unauthorised person.
3.3.1 Guilt Agent:
Suppose that after giving objects to agents, the distributor discovers that a set S ∈ T has
leaked. This means that some third party, called the target, has been caught in possession of
S. For example, this target may be displaying S on its website, or perhaps as part of a legal
discovery process, the target turned over S to the distributor. Since the agents U1; . . . ;Un
have some of the data, it is reasonable to suspect them leaking the data. However, the agents
can argue that they are innocent, and that the S data were obtained by the target through
other means. An agent Ui is guilty and if it contributes one or more objects to the target. We
denote the event that agent Ui is guilty by Gi and the event that agent Ui is guilty for a given
leaked set S by Gi|S. Our next step is to estimate Pr{Gi|S}, i.e., the probability that agent Ui
is guilty given evidence S.
3.3.2 Agent Guilt Model:
To compute the Pr{Gi|S}, we need an estimate for the probability that values in S can be
“guessed” by the target we assume that all T objects have the same pt, which we call p. Our
equations can be easily generalized to diverse pts. Next, we make two assumptions
regarding the relationship among the various leakage events.
Assumption 1: For all t, t ∈ S such that t != t’ the provenance of t is independent of the
provenance of t.
11. 11
The term provenance in this assumption statement refers to the source of a value t that
appears in the leaked set. The source can be any of the agents who have t in their sets or the
target itself.
Assumption 2: An object t ∈ S can only be obtained by the target in one of two ways.
A single agent Ui leaked t from its own Ri set, or
The target guessed (or obtained through other means) t without the help of any of the n
agents.
To find the probability that an agent Ui is guilty given a set S, consider the target guessed t1
with probability p and that agent leaks t1 to S with the probability 1-p. First compute the
probability that he leaks a single object t to S. To compute this, define the set of agents
Vt = { Ui | t ∈ Ri } that have t in their data sets. Then using Assumption 2 and known
probability p, we have,
Assuming that all agents that belong to Vt can leak t to S with equal probability and using
Assumption 2 obtain,
Given that agent Ui is guilty if he leaks at least one value to S, with Assumption 1 and
Equation 1.2 compute the probability Pr { Gi | S}, agent Ui is guilty,
3.3.3 Allocation Strategies:
Explicit Data Request
In case of explicit data request with fake not allowed, the distributor is not allowed to add
fake objects to the distributed data. So Data allocation is fully defined by the agent’s data
request.
12. 12
In case of explicit data request with fake allowed, the distributor cannot remove or alter the
requests R from the agent. However distributor can add the fake object. In algorithm for
data allocation for explicit request, the input to this is a set of request R1,R2,……,Rn from n
agents and different conditions for requests. The e-optimal algorithm finds the agents that
are eligible to receiving fake objects.
Then create one fake object in iteration and allocate it to the agent selected. The e-optimal
algorithm minimizes every term of the objective summation by adding maximum number bi
of fake objects to every set Ri yielding optimal solution.
Algorithm:
Step 1: Calculate total fake records as sum of fake records allowed.
Step 2: While total fake objects > 0
Step 3: Select agent that will yield the greatest improvement in the sum objective
Step 4: Create fake record
Step 5: Add this fake record to the agent and also to fake record set.
Step 6: Decrement fake record from total fake record set.
Algorithm makes a greedy choice by selecting the agent that will yield the greatest
improvement in the sum-objective.
Sample Data Request
With sample data requests, each agent Ui may receive any T from a subset out of
different ones. Hence, there are different allocations. In every allocation, the
distributor can permute T objects and keep the same chances of guilty agent detection. The
reason is that the guilt probability depends only on which agents have received the leaked
objects and not on the identity of the leaked objects. Therefore, from the distributor’s
perspective there are different allocations.
13. 13
An object allocation that satisfies requests and ignores the distributor’s objective is to give
each agent a unique subset of T of size m. The s-max algorithm allocates to an agent the
data record that yields the minimum increase of the maximum relative overlap among any
pair of agents. The s-max algorithm is as follows.
Algorithm:
Step 1: Initialize Min_overlap 1, the minimum out of the maximum relative overlaps that
the allocations of different objects to Ui
Step 2: for k ∈ {k |tk ∈ Ri} do
Initialize max_rel_ov 0, the maximum relative overlap between Ri and any set Rj that the
allocation of tk toUi
Step 3: for all j = 1,..., n : j = i and tk ∈ Rj do
Calculate absolute overlap as abs_ov |Ri Rj | + 1
Calculate relative overlap as rel_ov abs_ov / min (mi ,mj )
Step 4: Find maximum relative as max_rel_ov MAX (max_rel_ov, rel_ov)
If max_rel_ov <= min_overlap then
min_overlap max_rel_ov
ret_k k
Return ret_k
It can be shown that algorithm s-max is optimal for the sum-objective and the max-objective
in problems where M <= |T| and n < |T|. It is also optimal for the max-objective if |T| <= M
<= 2 |T| or all agents request data of the same size.
14. 14
3.4 Analysis :
After analyzing the requirements of the task to be performed, the next step is to analyze
the problem and understand its context. The first activity in the phase is studying the
existing system and other is to understand the requirements and domain of the new system.
Both the activities are equally important, but the first activity serves as a basis of giving the
functional specifications and then successful design of the proposed system. Understanding
the properties and requirements of a new system is more difficult and requires creative
thinking and understanding of the existing running system is also difficult, improper
understanding of present system can lead diversion from solution.
The main focus of our Project is the data allocation problem: How can the distributor
“intelligently” give data to agents in order to improve the chances of detecting a guilty
agent? As illustrated in the figure below, there are four instances of this problem we
address, depending on the type of data requests made by agents and whether “fake objects”
are allowed.
The two types of requests we handle are sample and explicit. Sample is kind of query in
which agents request only for fixed number of data say 100. Explicit is a kind of query in
which agents request for data based on conditions say age = 25 (As shown in the figure
above, we represent our four problem instances with the names EF, EF, SF, and SF, where E
stands for explicit requests, S for sample requests, F for the use of fake objects, and F for the
case where fake objects are not allowed).
15. 15
3.5 Details of hardware and software :
Hardware requirements
Processor : Pentium Dual-Core Processor
Ram : 1GB of RAM
Hard Disk : 160GB
Software requirements
Operating System : Windows Family
Front End : VB.Net
Back End : Ms-Access
Language : ASP.Net using C#
16. 16
3.6 Design Details :
SYSTEM ARCHITECTURE
DATA FLOW DIAGRAM
LEVEL 0:
Compares with
his DBto find
guilt agent
Explicit or sample
request to distributor
DB
Distributor sends
sensitive data to
agents
Data Leak out
Leaked file
Distributor finds his
data leak in
unauthorized place
Distributor
Agent 1 Agent 2 Agent 3 Agent 3 Agent 5
Unauthorized
person
In runtime distributor
creates fake object
and allocates to agents
18. 18
UML DIAGRAMS
USE CASE DIAGRAM:
Add fake objects
Explicit or sample
request
Distribute data to agents
Find probability
Find guilt agent
Distributor
Agent
Data transfer to
unauthorized
Unauthorized
person
23. 23
3.7 Conclusion:
When leaked data is found, the distributor can assess the likelihood that the leaked data came
from one or more agents, as opposed to having been independently gathered by other means.
If the distributor sees “enough evidence” that an agent leaked data, he may stop doing
business with him, or may initiate legal proceedings.
We also present allocation strategies for distributing objects to agents, in a way that improves
our chances of identifying a leaker.
4. Implementation Plan for Next Semester
4.1 Gantt chart:
July 2012 Aug 2012 Sept-Oct 2012 Nov 2012-Jan 2013 Feb-March 2013
24. 24
4.2 Projectplan:
Task A- Requirement Gathering and Analysis
Task B- Planning
Task C- Designing
Task D- Coding
Task E- Implementation
Task F- Testing
5. References:
[1] P. Buneman, S. Khanna, and W.C. Tan, “Why and Where: A Characterization of Data
Provenance,” Proc. Eighth Int’l Conf. Database Theory (ICDT ’01), J.V. den Bussche
and V. Vianu, eds.,pp. 316-330, Jan. 2001.
[2] J.J.K.O. Ruanaidh, W.J. Dowling, and F.M. Boland, “Watermarking Digital Images
for Copyright Protection,” IEE Proc. Vision, Signal and Image Processing, vol. 143, no.
4, pp. 250-256, 1996.
[3] F. Hartung and B. Girod, “Watermarking of Uncompressed and Compressed Video,”
Signal Processing, vol. 66, no. 3, pp. 283-301,1998.
[4] S. Czerwinski, R. Fromm, and T. Hodes, “Digital Music Distribution and Audio
Watermarking,” http://www.scientificcommons.org/43025658, 2007.
[5] L. Sweeney, “Achieving K-Anonymity Privacy Protection Using Generalization and
Suppression,” http://en.scientificcommons.org/43196131, 2002.
[6] P. Buneman and W.-C. Tan, “Provenance in Databases,” Proc. ACM SIGMOD, pp.
1171-1173, 2007.
[7] Y. Cui and J. Widom, “Lineage Tracing for General Data Warehouse
Transformations,” The VLDB J., vol. 12, pp. 41-58, 2003.
25. 25
[8] S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian, “Flexible Support for
Multiple Access Control Policies,” ACM Trans. Database Systems, vol. 26, no. 2, pp.
214-260, 2001.
[9] P. Bonatti, S.D.C. di Vimercati, and P. Samarati, “An Algebra for Composing Access
Control Policies,” ACM Trans. Information and System Security, vol. 5, no. 1, pp. 1-35,
2002
[10] R. Agrawal and J. Kiernan, “Watermarking Relational Databases,” Proc. 28th Int’l
Conf. Very Large Data Bases (VLDB ’02), VLDB Endowment, pp. 155-166, 2002.