SlideShare a Scribd company logo
1 of 25
1
A PROJECT SYNOPSIS
On
Data Leakage Detection
Submitted By
1) Nisha Jain (19)
2) Ankita Patil (37)
3) Yogita Patil (41)
4) Vidya Shelke (51)
Under the Guidance of
Prof. Shilpa Kolte
Department of Information Technology
Saraswati Education Society’s
SARASWATI COLLEGE OF ENGINEERING
Kharghar, Navi Mumbai
(Affiliated to University of Mumbai)
Academic Year :-2012-13
2
_____________________________________________________________________________________
PLOT NO. 46/46A, SECTOR NO 5, BEHIND MSEB SUBSTATION, KHARGHAR,NAVI
MUMBAI-410210
Tel.: 022-27743706 to 11 * Fax : 022-27743712 * Website: www.scoe.edu.in
CERTIFICATE
This is to certify that the requirements for the synopsis entitled “Data Leakage Detection”
Have been successfully completed by the following students:
Roll numbers Name
1) 19 Nisha Jain
2) 37 Ankita Patil
3) 41 Yogita Patil
4) 51 Vidya Shelke
In partial fulfillment of Sem –VII , Bachelor of Engineering of Mumbai University in
Information Technology of Saraswati college of Engineering , Kharghar during the academic
year 2012-13.
Internal Guide External Examiner
Prof. Shilpa Kolte
Project co-ordinator Head of Department
Prof. Archana Sharma Prof. Vaishali Jadhav
Principal
Dr. B. B. Shrivastava
SARASWATI EDUCATION SOCIETY’S
SARASWATI COLLEGE OF ENGINEERING
(Approved by AICTE, recg. By Maharashtra Govt. DTE,Affiliated to Mumbai University)
3
Acknowledgement
A project is something that could not have been materialized without cooperation of many people. This
project shall be incomplete if I do not convey my heart felt gratitude to those people from whom I have
got considerable support and encouragement.
It is a matter of great pleasure for us to have a respected Prof. Shilpa Kolte as our project guide. We are
thankful to her for being constant source of inspiration.
We would also like to give our sincere thanks to Prof. Vaishali Jadhav, H.O.D, Information
Technology Department, Prof. Archana Sharma, Project co-ordinator for their kind support.
We would like to express our deepest gratitude to Dr. B. B. Shrivastava, our principal of Saraswati
college of Engineering, Kharghar, Navi Mumbai.
Last but not the least I would also like to thank all the staff of Saraswati college of Engineering
(Information Technology Department) for their valuable guidance with their interest and valuable
suggestions brightened us.
Name of Students
1) Nisha Jain (19)
2) Ankita Patil (37)
3) Yogita Patil (41)
4) Vidya Shelke (51)
4
Data Leakage Detection
ABSTRACT
A data distributor has given sensitive data to a set of supposedly trusted agents (third
parties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or
somebody’s laptop). The distributor must assess the likelihood that the leaked data came from
one or more agents, as opposed to having been independently gathered by other means. ). If the
data distributed to third parties is found in a public/private domain then finding the guilty party is
a nontrivial task to distributor. Traditionally, this leakage of data is handled by water marking
technique which requires modification of data. To overcome the disadvantages of using
watermark, data allocation strategies are used to improve the probability of identifying guilty
third parties. In this project, we implement and analyse a guilt model that detects the agents using
allocation strategies without modifying the original data. The guilty agent is one who leaks a
portion of distributed data. The idea is to distribute the data intelligently to agents based on
sample data request and explicit data request in order to improve the chance of detecting the
guilty agents. The algorithms implemented using fake objects will improve the distributor chance
of detecting guilty agents. It is observed that by minimizing the sum objective the chance of
detecting guilty agents will increase. We also developed a framework for generating fake
objects.
5
INDEX
1. Introduction ..……………………………………………………………...
2. Literature survey .………………………………………………….……..
3. Data Leakage Detection
3.1 Problem statement .………………………………………………...
3.2 Scope ...…………………………………………………………….
3.3 Proposed system …………………………………………………...
3.4 Analysis ……………………………………………………………
3.5 Details of Hardware and Software .……………………………......
3.6 Design details .……………………………………………………..
3.7 Conclusion …………………………………………………………
4. Implementation Plan for Next semester ..……………………………….
5. References ..…………………………………………………………….....
6
1. INTRODUCTION
1.1 Need:
Due to today’s competitive world, many companies outsource certain business processes
(e.g. marketing, human resources, etc.) and associated activities to a third party. This
allows companies to focus on their core competency by subcontracting other activities to
specialists, resulting in reduced operational costs and increased productivity. In most
cases, the service providers need access to a company's intellectual property and other
confidential information to carry out their services. For example, if an outsourcer is doing
payroll for a bank, he must have the salary and customer bank account numbers. The
main security problem is that the service provider may not be fully trusted or may not be
securely administered. Business agreements for try to regulate how the data will be
handled by service providers, but it is almost impossible to truly enforce or verify such
policies across different administrative domains. Because of their digital nature, relational
databases are easy to duplicate and in many cases a service provider may have financial
incentives to redistribute commercially valuable data or may simply fail to handle it
properly. Hence, we need powerful techniques that can detect and deter such dishonest.
1.2 Basic Concept:
A model for assessing the “guilt” of agents has been developed. An algorithm for
distributing objects to agents, in a way that improves our chances of identifying a leaker
has been proposed. The option of adding “fake” objects to the distributed set also been
considered. The allocation of fake objects depends on the sample data request or explicit
data request made by the agent. Such objects do not correspond to real entities but appear
realistic to the agents. In a sense, the fake objects acts as a type of watermark for the
entire set, without modifying any individual members. If it turns out an agent was given
one or more fake objects that were leaked, then the distributor can be more confident that
agent was guilty. Unobtrusive techniques for detecting leakage of a set of objects or
records have been studied. After giving a set of objects to agents, the distributor discovers
some of those same objects in an unauthorized place. At this point the distributor can
assess the likelihood that the leaked data came from one or more agents.
7
2. LITERATURE SURVEY
2.1 Existing Systems:
2.1.1 Perturbation:
Perturbation is a very useful technique where the data are modified and made less
sensitive before being handed to agents. For example, one can add random noise to
certain attributes, or one can replace exact values by ranges [5]. However, it is not
suitable if an application where the original sensitive data cannot be altered is considered.
For example, if an outsourcer is doing our payroll, he must have the exact salary and
customer bank account numbers. If medical researchers will be treating patients, as
opposed to simply computing statistics, they may need accurate data for the patients.
2.1.2 Watermarking:
Watermarks were initially used in images [2], video [3], and audio data [4] whose digital
representation includes considerable redundancy. Traditionally, leakage detection is
handled by watermarking, e.g., a unique code is embedded in each distributed copy. If
that copy is later discovered in the hands of an unauthorized party, the leaker can be
identified. Watermarks can be very useful in some cases, but again, involve some
modification of the original data. Furthermore, watermarks can sometimes be destroyed if
the data recipient is malicious.
2.2 Disadvantages:
1. In some applications the original sensitive data cannot be perturbed.
2. In some cases it is important not to alter the original distributor’s data.
3. For example, we can't perturbed patient records while giving it to research center for
finding cure for a disease.
4. Watermarks can be very useful in some cases but involve some modification of the
original data.
5. Watermarks can sometimes be destroyed if the data recipient is malicious.
8
2.3 Related Work:
1. The guilt detection approach we present is related to the data provenance problem [1]:
tracing the lineage of S objects implies essentially the detection of the guilty agents.
Tutorial [6] provides a good overview on the research conducted in this field. Suggested
solutions are domain specific, such as lineage tracing for data warehouses [7], and
assume some prior knowledge on the way a data view is created out of data sources.
2. Our problem formulation with objects and sets is more general and simplifies lineage
tracing, since we do not consider any data transformation from Ri sets to S. As far as the
data allocation strategies are concerned, our work is mostly relevant to watermarking that
is used as a means of establishing original ownership of distributed objects.
3. There are also lots of other works on mechanisms that allow only authorized users to
access sensitive data through access control policies [8], [9]. Such approaches prevent in
some sense data leakage by sharing information only with trusted parties. However, these
policies are restrictive and may make it impossible to satisfy agents’ requests.
4. The concept used in this project is very useful as it does not modify the original data, but
adds fake objects that appear realistic to the agents. Only the distributor knows about the
fake objects so there is no way that the agent can come to know which objects are fake.
5. The distributor may be able to add fake objects to the distributed data in order to improve
his effectiveness in detecting guilty agents. However, fake objects may impact the
correctness of what agents do, so they may not always be allowable. The idea of
perturbing data to detect leakage is not new, e.g., [10]. However, in most cases,
individual objects are perturbed, e.g., by adding random noise to sensitive salaries, or
adding a watermark to an image.
6. In our case, we are perturbing the set of distributor objects by adding fake elements. In
some applications, fake objects may cause fewer problems that perturbing real objects.
For example, say that the distributed data objects are customer’s records of a bank and
the agents are outsourcers who will do the marketing. In this case, even small
modifications to the records of actual customers may be undesirable. However, the
addition of some fake customer’s records may be acceptable, since no customer matches
these records, and hence, no one will ever be contacted based on fake records.
9
3. DATA LEAKAGE DETECTION
3.1 Problem statement :
Suppose a distributor owns a set T = {t1, tm} of valuable data objects. The distributor
wants to share some of the objects with a set of agents U1,U2,…,Un but does wish the
objects be leaked to other third parties. An agent Ui receives a subset of objects Ri which
belongs to T, determined either by a sample request or an explicit request. The objects in T
could be of any type and size, e.g., they could be tuples in a relation, or relations in a
database. After giving objects to agents, the distributor discovers that a set S of T has
leaked. This means that some third party called the target has been caught in possession of
S. For example, this target may be displaying S on its web site, or perhaps as part of a legal
discovery process, the target turned over S to the distributor. Since the agents U1,U2 ,…
,Un, have some of the data, it is reasonable to suspect them leaking the data. However, the
agents can argue that they are innocent, and that the S data was obtained by the target
through other means.
3.2 Scope :
Our goal is to detect when the distributor’s sensitive data have been leaked by agents, and
also to identify the agent that leaked the data. Data allocation strategies (across the agents)
have been used that improve the probability of identifying leakages. These methods do not
rely on alterations of the released data (e.g., watermarks). In some cases distributor can also
inject “realistic but fake” data records to further improve our chances of detecting leakage
and identifying the guilty party. This system can be useful for various organizations or
companies who need to distribute their sensitive data to some third parties. For example, this
system can be implemented in Banks to avoid revealing the identity of any employee or
customers. Similarly, a company may have partnerships with other companies that require
sharing customer data. Another enterprise may outsource its data processing, so data must
be given to various other companies.
10
3.3 Proposed System :
In the course of doing business, sometimes sensitive data must be handed over to
supposedly trusted third parties. A data distributor will give sensitive data to a set of
supposedly trusted agents. When some of the data is found in an unauthorized place, the
distributor may assess the likelihood that the leaked data came from one or more agents. We
call the owner of the data the distributor and the supposedly trusted third parties the
agents. The guilty agent is one who leaks a portion of distributed data. And the one who
receives the data from the guilt agent is an unauthorised person.
3.3.1 Guilt Agent:
Suppose that after giving objects to agents, the distributor discovers that a set S ∈ T has
leaked. This means that some third party, called the target, has been caught in possession of
S. For example, this target may be displaying S on its website, or perhaps as part of a legal
discovery process, the target turned over S to the distributor. Since the agents U1; . . . ;Un
have some of the data, it is reasonable to suspect them leaking the data. However, the agents
can argue that they are innocent, and that the S data were obtained by the target through
other means. An agent Ui is guilty and if it contributes one or more objects to the target. We
denote the event that agent Ui is guilty by Gi and the event that agent Ui is guilty for a given
leaked set S by Gi|S. Our next step is to estimate Pr{Gi|S}, i.e., the probability that agent Ui
is guilty given evidence S.
3.3.2 Agent Guilt Model:
To compute the Pr{Gi|S}, we need an estimate for the probability that values in S can be
“guessed” by the target we assume that all T objects have the same pt, which we call p. Our
equations can be easily generalized to diverse pts. Next, we make two assumptions
regarding the relationship among the various leakage events.
Assumption 1: For all t, t ∈ S such that t != t’ the provenance of t is independent of the
provenance of t.
11
The term provenance in this assumption statement refers to the source of a value t that
appears in the leaked set. The source can be any of the agents who have t in their sets or the
target itself.
Assumption 2: An object t ∈ S can only be obtained by the target in one of two ways.
 A single agent Ui leaked t from its own Ri set, or
 The target guessed (or obtained through other means) t without the help of any of the n
agents.
To find the probability that an agent Ui is guilty given a set S, consider the target guessed t1
with probability p and that agent leaks t1 to S with the probability 1-p. First compute the
probability that he leaks a single object t to S. To compute this, define the set of agents
Vt = { Ui | t ∈ Ri } that have t in their data sets. Then using Assumption 2 and known
probability p, we have,
Assuming that all agents that belong to Vt can leak t to S with equal probability and using
Assumption 2 obtain,
Given that agent Ui is guilty if he leaks at least one value to S, with Assumption 1 and
Equation 1.2 compute the probability Pr { Gi | S}, agent Ui is guilty,
3.3.3 Allocation Strategies:
Explicit Data Request
In case of explicit data request with fake not allowed, the distributor is not allowed to add
fake objects to the distributed data. So Data allocation is fully defined by the agent’s data
request.
12
In case of explicit data request with fake allowed, the distributor cannot remove or alter the
requests R from the agent. However distributor can add the fake object. In algorithm for
data allocation for explicit request, the input to this is a set of request R1,R2,……,Rn from n
agents and different conditions for requests. The e-optimal algorithm finds the agents that
are eligible to receiving fake objects.
Then create one fake object in iteration and allocate it to the agent selected. The e-optimal
algorithm minimizes every term of the objective summation by adding maximum number bi
of fake objects to every set Ri yielding optimal solution.
Algorithm:
Step 1: Calculate total fake records as sum of fake records allowed.
Step 2: While total fake objects > 0
Step 3: Select agent that will yield the greatest improvement in the sum objective
Step 4: Create fake record
Step 5: Add this fake record to the agent and also to fake record set.
Step 6: Decrement fake record from total fake record set.
Algorithm makes a greedy choice by selecting the agent that will yield the greatest
improvement in the sum-objective.
Sample Data Request
With sample data requests, each agent Ui may receive any T from a subset out of
different ones. Hence, there are different allocations. In every allocation, the
distributor can permute T objects and keep the same chances of guilty agent detection. The
reason is that the guilt probability depends only on which agents have received the leaked
objects and not on the identity of the leaked objects. Therefore, from the distributor’s
perspective there are different allocations.
13
An object allocation that satisfies requests and ignores the distributor’s objective is to give
each agent a unique subset of T of size m. The s-max algorithm allocates to an agent the
data record that yields the minimum increase of the maximum relative overlap among any
pair of agents. The s-max algorithm is as follows.
Algorithm:
Step 1: Initialize Min_overlap  1, the minimum out of the maximum relative overlaps that
the allocations of different objects to Ui
Step 2: for k ∈ {k |tk ∈ Ri} do
Initialize max_rel_ov  0, the maximum relative overlap between Ri and any set Rj that the
allocation of tk toUi
Step 3: for all j = 1,..., n : j = i and tk ∈ Rj do
Calculate absolute overlap as abs_ov  |Ri Rj | + 1
Calculate relative overlap as rel_ov  abs_ov / min (mi ,mj )
Step 4: Find maximum relative as max_rel_ov  MAX (max_rel_ov, rel_ov)
If max_rel_ov <= min_overlap then
min_overlap  max_rel_ov
ret_k  k
Return ret_k
It can be shown that algorithm s-max is optimal for the sum-objective and the max-objective
in problems where M <= |T| and n < |T|. It is also optimal for the max-objective if |T| <= M
<= 2 |T| or all agents request data of the same size.
14
3.4 Analysis :
After analyzing the requirements of the task to be performed, the next step is to analyze
the problem and understand its context. The first activity in the phase is studying the
existing system and other is to understand the requirements and domain of the new system.
Both the activities are equally important, but the first activity serves as a basis of giving the
functional specifications and then successful design of the proposed system. Understanding
the properties and requirements of a new system is more difficult and requires creative
thinking and understanding of the existing running system is also difficult, improper
understanding of present system can lead diversion from solution.
The main focus of our Project is the data allocation problem: How can the distributor
“intelligently” give data to agents in order to improve the chances of detecting a guilty
agent? As illustrated in the figure below, there are four instances of this problem we
address, depending on the type of data requests made by agents and whether “fake objects”
are allowed.
The two types of requests we handle are sample and explicit. Sample is kind of query in
which agents request only for fixed number of data say 100. Explicit is a kind of query in
which agents request for data based on conditions say age = 25 (As shown in the figure
above, we represent our four problem instances with the names EF, EF, SF, and SF, where E
stands for explicit requests, S for sample requests, F for the use of fake objects, and F for the
case where fake objects are not allowed).
15
3.5 Details of hardware and software :
Hardware requirements
 Processor : Pentium Dual-Core Processor
 Ram : 1GB of RAM
 Hard Disk : 160GB
Software requirements
 Operating System : Windows Family
 Front End : VB.Net
 Back End : Ms-Access
 Language : ASP.Net using C#
16
3.6 Design Details :
SYSTEM ARCHITECTURE
DATA FLOW DIAGRAM
LEVEL 0:
Compares with
his DBto find
guilt agent
Explicit or sample
request to distributor
DB
Distributor sends
sensitive data to
agents
Data Leak out
Leaked file
Distributor finds his
data leak in
unauthorized place
Distributor
Agent 1 Agent 2 Agent 3 Agent 3 Agent 5
Unauthorized
person
In runtime distributor
creates fake object
and allocates to agents
17
LEVEL 1:
LEVEL 2:
18
UML DIAGRAMS
USE CASE DIAGRAM:
Add fake objects
Explicit or sample
request
Distribute data to agents
Find probability
Find guilt agent
Distributor
Agent
Data transfer to
unauthorized
Unauthorized
person
19
SEQUENCE DIAGRAM:
20
COLLABORATION DIAGRAM:
21
CLASS DIAGRAM:
22
ACTIVITY DIAGRAM:
23
3.7 Conclusion:
When leaked data is found, the distributor can assess the likelihood that the leaked data came
from one or more agents, as opposed to having been independently gathered by other means.
If the distributor sees “enough evidence” that an agent leaked data, he may stop doing
business with him, or may initiate legal proceedings.
We also present allocation strategies for distributing objects to agents, in a way that improves
our chances of identifying a leaker.
4. Implementation Plan for Next Semester
4.1 Gantt chart:
July 2012 Aug 2012 Sept-Oct 2012 Nov 2012-Jan 2013 Feb-March 2013
24
4.2 Projectplan:
Task A- Requirement Gathering and Analysis
Task B- Planning
Task C- Designing
Task D- Coding
Task E- Implementation
Task F- Testing
5. References:
[1] P. Buneman, S. Khanna, and W.C. Tan, “Why and Where: A Characterization of Data
Provenance,” Proc. Eighth Int’l Conf. Database Theory (ICDT ’01), J.V. den Bussche
and V. Vianu, eds.,pp. 316-330, Jan. 2001.
[2] J.J.K.O. Ruanaidh, W.J. Dowling, and F.M. Boland, “Watermarking Digital Images
for Copyright Protection,” IEE Proc. Vision, Signal and Image Processing, vol. 143, no.
4, pp. 250-256, 1996.
[3] F. Hartung and B. Girod, “Watermarking of Uncompressed and Compressed Video,”
Signal Processing, vol. 66, no. 3, pp. 283-301,1998.
[4] S. Czerwinski, R. Fromm, and T. Hodes, “Digital Music Distribution and Audio
Watermarking,” http://www.scientificcommons.org/43025658, 2007.
[5] L. Sweeney, “Achieving K-Anonymity Privacy Protection Using Generalization and
Suppression,” http://en.scientificcommons.org/43196131, 2002.
[6] P. Buneman and W.-C. Tan, “Provenance in Databases,” Proc. ACM SIGMOD, pp.
1171-1173, 2007.
[7] Y. Cui and J. Widom, “Lineage Tracing for General Data Warehouse
Transformations,” The VLDB J., vol. 12, pp. 41-58, 2003.
25
[8] S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian, “Flexible Support for
Multiple Access Control Policies,” ACM Trans. Database Systems, vol. 26, no. 2, pp.
214-260, 2001.
[9] P. Bonatti, S.D.C. di Vimercati, and P. Samarati, “An Algebra for Composing Access
Control Policies,” ACM Trans. Information and System Security, vol. 5, no. 1, pp. 1-35,
2002
[10] R. Agrawal and J. Kiernan, “Watermarking Relational Databases,” Proc. 28th Int’l
Conf. Very Large Data Bases (VLDB ’02), VLDB Endowment, pp. 155-166, 2002.

More Related Content

What's hot

data mining for security application
data mining for security applicationdata mining for security application
data mining for security applicationbharatsvnit
 
Privacy Preserving Based Cloud Storage System
Privacy Preserving Based Cloud Storage SystemPrivacy Preserving Based Cloud Storage System
Privacy Preserving Based Cloud Storage SystemKumar Goud
 
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsA Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsDrjabez
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data miningeSAT Publishing House
 
Using Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data MiningUsing Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data miningNeeda Multani
 
rpaper
rpaperrpaper
rpaperimu409
 
Phishing Websites Detection Using Back Propagation Algorithm: A Review
Phishing Websites Detection Using Back Propagation Algorithm: A ReviewPhishing Websites Detection Using Back Propagation Algorithm: A Review
Phishing Websites Detection Using Back Propagation Algorithm: A Reviewtheijes
 
User friendly pattern search paradigm
User friendly pattern search paradigmUser friendly pattern search paradigm
User friendly pattern search paradigmMigrant Systems
 
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
Performance Analysis of Hybrid Approach for Privacy Preserving in Data MiningPerformance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Miningidescitation
 
Enabling Use of Dynamic Anonymization for Enhanced Security in Cloud
Enabling Use of Dynamic Anonymization for Enhanced Security in CloudEnabling Use of Dynamic Anonymization for Enhanced Security in Cloud
Enabling Use of Dynamic Anonymization for Enhanced Security in CloudIOSR Journals
 
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...IJECEIAES
 
Incentive Compatible Privacy Preserving Data Analysis
Incentive Compatible Privacy Preserving Data AnalysisIncentive Compatible Privacy Preserving Data Analysis
Incentive Compatible Privacy Preserving Data Analysisrupasri mupparthi
 
Cryptography for privacy preserving data mining
Cryptography for privacy preserving data miningCryptography for privacy preserving data mining
Cryptography for privacy preserving data miningMesbah Uddin Khan
 
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and ApproachesA Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches14894
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...ijceronline
 
Cluster Based Access Privilege Management Scheme for Databases
Cluster Based Access Privilege Management Scheme for DatabasesCluster Based Access Privilege Management Scheme for Databases
Cluster Based Access Privilege Management Scheme for DatabasesEditor IJMTER
 
Privacy preserving dm_ppt
Privacy preserving dm_pptPrivacy preserving dm_ppt
Privacy preserving dm_pptSagar Verma
 

What's hot (20)

data mining for security application
data mining for security applicationdata mining for security application
data mining for security application
 
C3602021025
C3602021025C3602021025
C3602021025
 
Privacy Preserving Based Cloud Storage System
Privacy Preserving Based Cloud Storage SystemPrivacy Preserving Based Cloud Storage System
Privacy Preserving Based Cloud Storage System
 
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network DatasetsA Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
A Study on Genetic-Fuzzy Based Automatic Intrusion Detection on Network Datasets
 
Privacy preservation techniques in data mining
Privacy preservation techniques in data miningPrivacy preservation techniques in data mining
Privacy preservation techniques in data mining
 
Using Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data MiningUsing Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data Mining
 
Data mining and privacy preserving in data mining
Data mining and privacy preserving in data miningData mining and privacy preserving in data mining
Data mining and privacy preserving in data mining
 
rpaper
rpaperrpaper
rpaper
 
Phishing Websites Detection Using Back Propagation Algorithm: A Review
Phishing Websites Detection Using Back Propagation Algorithm: A ReviewPhishing Websites Detection Using Back Propagation Algorithm: A Review
Phishing Websites Detection Using Back Propagation Algorithm: A Review
 
User friendly pattern search paradigm
User friendly pattern search paradigmUser friendly pattern search paradigm
User friendly pattern search paradigm
 
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
Performance Analysis of Hybrid Approach for Privacy Preserving in Data MiningPerformance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
 
Kg2417521755
Kg2417521755Kg2417521755
Kg2417521755
 
Enabling Use of Dynamic Anonymization for Enhanced Security in Cloud
Enabling Use of Dynamic Anonymization for Enhanced Security in CloudEnabling Use of Dynamic Anonymization for Enhanced Security in Cloud
Enabling Use of Dynamic Anonymization for Enhanced Security in Cloud
 
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
Misusability Measure Based Sanitization of Big Data for Privacy Preserving Ma...
 
Incentive Compatible Privacy Preserving Data Analysis
Incentive Compatible Privacy Preserving Data AnalysisIncentive Compatible Privacy Preserving Data Analysis
Incentive Compatible Privacy Preserving Data Analysis
 
Cryptography for privacy preserving data mining
Cryptography for privacy preserving data miningCryptography for privacy preserving data mining
Cryptography for privacy preserving data mining
 
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and ApproachesA Review Study on the Privacy Preserving Data Mining Techniques and Approaches
A Review Study on the Privacy Preserving Data Mining Techniques and Approaches
 
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
New Hybrid Intrusion Detection System Based On Data Mining Technique to Enhan...
 
Cluster Based Access Privilege Management Scheme for Databases
Cluster Based Access Privilege Management Scheme for DatabasesCluster Based Access Privilege Management Scheme for Databases
Cluster Based Access Privilege Management Scheme for Databases
 
Privacy preserving dm_ppt
Privacy preserving dm_pptPrivacy preserving dm_ppt
Privacy preserving dm_ppt
 

Viewers also liked

Synopsis for Public Garden automation with solar tracker by Punith urs
Synopsis for Public Garden automation with solar tracker by Punith ursSynopsis for Public Garden automation with solar tracker by Punith urs
Synopsis for Public Garden automation with solar tracker by Punith ursPunith Urs
 
Power grid synopsis
Power grid synopsisPower grid synopsis
Power grid synopsisParveen Jaat
 
Project Synopsis sample
Project Synopsis sampleProject Synopsis sample
Project Synopsis sampleRahul Pola
 
Software rejuvenation
Software rejuvenationSoftware rejuvenation
Software rejuvenationRVCE2
 
Behavioral malware detection in delay tolerant network
Behavioral malware detection in delay tolerant networkBehavioral malware detection in delay tolerant network
Behavioral malware detection in delay tolerant networkBittu Roy
 
PROJECT Room envirment monitoring system synopsis
PROJECT  Room envirment monitoring system synopsisPROJECT  Room envirment monitoring system synopsis
PROJECT Room envirment monitoring system synopsisprashant shukla
 
Anish Kumar (EEE)
Anish Kumar (EEE)Anish Kumar (EEE)
Anish Kumar (EEE)Anish Kumar
 
Chap2coulomb law
Chap2coulomb lawChap2coulomb law
Chap2coulomb lawgtamayov
 
DETECTING MALICIOUS FACEBOOK APPLICATIONS - IEEE PROJECTS IN PONDICHERRY,BUL...
DETECTING MALICIOUS FACEBOOK APPLICATIONS  - IEEE PROJECTS IN PONDICHERRY,BUL...DETECTING MALICIOUS FACEBOOK APPLICATIONS  - IEEE PROJECTS IN PONDICHERRY,BUL...
DETECTING MALICIOUS FACEBOOK APPLICATIONS - IEEE PROJECTS IN PONDICHERRY,BUL...Nexgen Technology
 
Van de graaff generator pp
Van de graaff generator ppVan de graaff generator pp
Van de graaff generator ppPaul Schumann
 
power grid synchronization failure detection
power grid synchronization failure detectionpower grid synchronization failure detection
power grid synchronization failure detectionJay Hind
 
Discovery and verification Documentation
Discovery and verification DocumentationDiscovery and verification Documentation
Discovery and verification DocumentationSambit Dutta
 
Audio steganography project presentation
Audio steganography project presentationAudio steganography project presentation
Audio steganography project presentationkartikeya upadhyay
 
Synopsis or minor project ppt's (2)
Synopsis or minor project ppt's (2)Synopsis or minor project ppt's (2)
Synopsis or minor project ppt's (2)sunil paswan
 
Canteen automation system (updated) revised
Canteen automation system (updated) revisedCanteen automation system (updated) revised
Canteen automation system (updated) revisedrinshi jain
 
B. Tech. Minor Project Synopsis
B. Tech. Minor Project Synopsis B. Tech. Minor Project Synopsis
B. Tech. Minor Project Synopsis Shashank Narayan
 
Fr app e detecting malicious facebook applications
Fr app e detecting malicious facebook applicationsFr app e detecting malicious facebook applications
Fr app e detecting malicious facebook applicationsCloudTechnologies
 

Viewers also liked (20)

Synopsis for Public Garden automation with solar tracker by Punith urs
Synopsis for Public Garden automation with solar tracker by Punith ursSynopsis for Public Garden automation with solar tracker by Punith urs
Synopsis for Public Garden automation with solar tracker by Punith urs
 
Power grid synopsis
Power grid synopsisPower grid synopsis
Power grid synopsis
 
Project Synopsis sample
Project Synopsis sampleProject Synopsis sample
Project Synopsis sample
 
Software rejuvenation
Software rejuvenationSoftware rejuvenation
Software rejuvenation
 
Behavioral malware detection in delay tolerant network
Behavioral malware detection in delay tolerant networkBehavioral malware detection in delay tolerant network
Behavioral malware detection in delay tolerant network
 
PROJECT Room envirment monitoring system synopsis
PROJECT  Room envirment monitoring system synopsisPROJECT  Room envirment monitoring system synopsis
PROJECT Room envirment monitoring system synopsis
 
Anish Kumar (EEE)
Anish Kumar (EEE)Anish Kumar (EEE)
Anish Kumar (EEE)
 
Chap2coulomb law
Chap2coulomb lawChap2coulomb law
Chap2coulomb law
 
DETECTING MALICIOUS FACEBOOK APPLICATIONS - IEEE PROJECTS IN PONDICHERRY,BUL...
DETECTING MALICIOUS FACEBOOK APPLICATIONS  - IEEE PROJECTS IN PONDICHERRY,BUL...DETECTING MALICIOUS FACEBOOK APPLICATIONS  - IEEE PROJECTS IN PONDICHERRY,BUL...
DETECTING MALICIOUS FACEBOOK APPLICATIONS - IEEE PROJECTS IN PONDICHERRY,BUL...
 
Van de graaff generator pp
Van de graaff generator ppVan de graaff generator pp
Van de graaff generator pp
 
Oruta project report
Oruta project reportOruta project report
Oruta project report
 
Ven de graaff generator 745
Ven de graaff generator 745Ven de graaff generator 745
Ven de graaff generator 745
 
power grid synchronization failure detection
power grid synchronization failure detectionpower grid synchronization failure detection
power grid synchronization failure detection
 
Discovery and verification Documentation
Discovery and verification DocumentationDiscovery and verification Documentation
Discovery and verification Documentation
 
Audio steganography project presentation
Audio steganography project presentationAudio steganography project presentation
Audio steganography project presentation
 
Synopsis or minor project ppt's (2)
Synopsis or minor project ppt's (2)Synopsis or minor project ppt's (2)
Synopsis or minor project ppt's (2)
 
Canteen automation system (updated) revised
Canteen automation system (updated) revisedCanteen automation system (updated) revised
Canteen automation system (updated) revised
 
Ieee 2015 2016
Ieee 2015 2016Ieee 2015 2016
Ieee 2015 2016
 
B. Tech. Minor Project Synopsis
B. Tech. Minor Project Synopsis B. Tech. Minor Project Synopsis
B. Tech. Minor Project Synopsis
 
Fr app e detecting malicious facebook applications
Fr app e detecting malicious facebook applicationsFr app e detecting malicious facebook applications
Fr app e detecting malicious facebook applications
 

Similar to DLD_SYNOPSIS

IRJET- Data Leakage Detection System
IRJET- Data Leakage Detection SystemIRJET- Data Leakage Detection System
IRJET- Data Leakage Detection SystemIRJET Journal
 
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdfDrog3
 
Data Leakage Detection
Data Leakage DetectionData Leakage Detection
Data Leakage DetectionAshwini Nerkar
 
Dn31538540
Dn31538540Dn31538540
Dn31538540IJMER
 
Jpdcs1 data leakage detection
Jpdcs1 data leakage detectionJpdcs1 data leakage detection
Jpdcs1 data leakage detectionChaitanya Kn
 
A Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningA Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningBRNSSPublicationHubI
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET Journal
 
Privacy preserving detection of sensitive data exposure
Privacy preserving detection of sensitive data exposurePrivacy preserving detection of sensitive data exposure
Privacy preserving detection of sensitive data exposurePvrtechnologies Nellore
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Data discrimination prevention in customer relationship managment
Data discrimination prevention in customer relationship managmentData discrimination prevention in customer relationship managment
Data discrimination prevention in customer relationship managmenteSAT Publishing House
 
Secure Multimedia Content Protection and Sharing
Secure Multimedia Content Protection and SharingSecure Multimedia Content Protection and Sharing
Secure Multimedia Content Protection and SharingIRJET Journal
 
Privacy preserving detection of sensitive data exposure
Privacy preserving detection of sensitive data exposurePrivacy preserving detection of sensitive data exposure
Privacy preserving detection of sensitive data exposureredpel dot com
 
Data leakage detection
Data leakage detectionData leakage detection
Data leakage detectionkalpesh1908
 
Data leakage detection
Data leakage detectionData leakage detection
Data leakage detectionAjitkaur saini
 
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET Journal
 
Applying Data Mining Principles in the Extraction of Digital Evidence
Applying Data Mining Principles in the Extraction of Digital EvidenceApplying Data Mining Principles in the Extraction of Digital Evidence
Applying Data Mining Principles in the Extraction of Digital EvidenceDr. Richard Otieno
 
83504808-Data-Leakage-Detection-1-Final.ppt
83504808-Data-Leakage-Detection-1-Final.ppt83504808-Data-Leakage-Detection-1-Final.ppt
83504808-Data-Leakage-Detection-1-Final.pptnaresh2004s
 

Similar to DLD_SYNOPSIS (20)

IRJET- Data Leakage Detection System
IRJET- Data Leakage Detection SystemIRJET- Data Leakage Detection System
IRJET- Data Leakage Detection System
 
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
164788616_Data_Leakage_Detection_Complete_Project_Report__1_.docx.pdf
 
Data Leakage Detection
Data Leakage DetectionData Leakage Detection
Data Leakage Detection
 
Dn31538540
Dn31538540Dn31538540
Dn31538540
 
Sub1555
Sub1555Sub1555
Sub1555
 
Jpdcs1 data leakage detection
Jpdcs1 data leakage detectionJpdcs1 data leakage detection
Jpdcs1 data leakage detection
 
A Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningA Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data Mining
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation Forest
 
Privacy preserving detection of sensitive data exposure
Privacy preserving detection of sensitive data exposurePrivacy preserving detection of sensitive data exposure
Privacy preserving detection of sensitive data exposure
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Data discrimination prevention in customer relationship managment
Data discrimination prevention in customer relationship managmentData discrimination prevention in customer relationship managment
Data discrimination prevention in customer relationship managment
 
Secure Multimedia Content Protection and Sharing
Secure Multimedia Content Protection and SharingSecure Multimedia Content Protection and Sharing
Secure Multimedia Content Protection and Sharing
 
Privacy preserving detection of sensitive data exposure
Privacy preserving detection of sensitive data exposurePrivacy preserving detection of sensitive data exposure
Privacy preserving detection of sensitive data exposure
 
Data leakage detection
Data leakage detectionData leakage detection
Data leakage detection
 
Data leakage detection
Data leakage detectionData leakage detection
Data leakage detection
 
709 713
709 713709 713
709 713
 
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
 
Applying Data Mining Principles in the Extraction of Digital Evidence
Applying Data Mining Principles in the Extraction of Digital EvidenceApplying Data Mining Principles in the Extraction of Digital Evidence
Applying Data Mining Principles in the Extraction of Digital Evidence
 
83504808-Data-Leakage-Detection-1-Final.ppt
83504808-Data-Leakage-Detection-1-Final.ppt83504808-Data-Leakage-Detection-1-Final.ppt
83504808-Data-Leakage-Detection-1-Final.ppt
 
F033026029
F033026029F033026029
F033026029
 

DLD_SYNOPSIS

  • 1. 1 A PROJECT SYNOPSIS On Data Leakage Detection Submitted By 1) Nisha Jain (19) 2) Ankita Patil (37) 3) Yogita Patil (41) 4) Vidya Shelke (51) Under the Guidance of Prof. Shilpa Kolte Department of Information Technology Saraswati Education Society’s SARASWATI COLLEGE OF ENGINEERING Kharghar, Navi Mumbai (Affiliated to University of Mumbai) Academic Year :-2012-13
  • 2. 2 _____________________________________________________________________________________ PLOT NO. 46/46A, SECTOR NO 5, BEHIND MSEB SUBSTATION, KHARGHAR,NAVI MUMBAI-410210 Tel.: 022-27743706 to 11 * Fax : 022-27743712 * Website: www.scoe.edu.in CERTIFICATE This is to certify that the requirements for the synopsis entitled “Data Leakage Detection” Have been successfully completed by the following students: Roll numbers Name 1) 19 Nisha Jain 2) 37 Ankita Patil 3) 41 Yogita Patil 4) 51 Vidya Shelke In partial fulfillment of Sem –VII , Bachelor of Engineering of Mumbai University in Information Technology of Saraswati college of Engineering , Kharghar during the academic year 2012-13. Internal Guide External Examiner Prof. Shilpa Kolte Project co-ordinator Head of Department Prof. Archana Sharma Prof. Vaishali Jadhav Principal Dr. B. B. Shrivastava SARASWATI EDUCATION SOCIETY’S SARASWATI COLLEGE OF ENGINEERING (Approved by AICTE, recg. By Maharashtra Govt. DTE,Affiliated to Mumbai University)
  • 3. 3 Acknowledgement A project is something that could not have been materialized without cooperation of many people. This project shall be incomplete if I do not convey my heart felt gratitude to those people from whom I have got considerable support and encouragement. It is a matter of great pleasure for us to have a respected Prof. Shilpa Kolte as our project guide. We are thankful to her for being constant source of inspiration. We would also like to give our sincere thanks to Prof. Vaishali Jadhav, H.O.D, Information Technology Department, Prof. Archana Sharma, Project co-ordinator for their kind support. We would like to express our deepest gratitude to Dr. B. B. Shrivastava, our principal of Saraswati college of Engineering, Kharghar, Navi Mumbai. Last but not the least I would also like to thank all the staff of Saraswati college of Engineering (Information Technology Department) for their valuable guidance with their interest and valuable suggestions brightened us. Name of Students 1) Nisha Jain (19) 2) Ankita Patil (37) 3) Yogita Patil (41) 4) Vidya Shelke (51)
  • 4. 4 Data Leakage Detection ABSTRACT A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. ). If the data distributed to third parties is found in a public/private domain then finding the guilty party is a nontrivial task to distributor. Traditionally, this leakage of data is handled by water marking technique which requires modification of data. To overcome the disadvantages of using watermark, data allocation strategies are used to improve the probability of identifying guilty third parties. In this project, we implement and analyse a guilt model that detects the agents using allocation strategies without modifying the original data. The guilty agent is one who leaks a portion of distributed data. The idea is to distribute the data intelligently to agents based on sample data request and explicit data request in order to improve the chance of detecting the guilty agents. The algorithms implemented using fake objects will improve the distributor chance of detecting guilty agents. It is observed that by minimizing the sum objective the chance of detecting guilty agents will increase. We also developed a framework for generating fake objects.
  • 5. 5 INDEX 1. Introduction ..……………………………………………………………... 2. Literature survey .………………………………………………….…….. 3. Data Leakage Detection 3.1 Problem statement .………………………………………………... 3.2 Scope ...……………………………………………………………. 3.3 Proposed system …………………………………………………... 3.4 Analysis …………………………………………………………… 3.5 Details of Hardware and Software .……………………………...... 3.6 Design details .…………………………………………………….. 3.7 Conclusion ………………………………………………………… 4. Implementation Plan for Next semester ..………………………………. 5. References ..…………………………………………………………….....
  • 6. 6 1. INTRODUCTION 1.1 Need: Due to today’s competitive world, many companies outsource certain business processes (e.g. marketing, human resources, etc.) and associated activities to a third party. This allows companies to focus on their core competency by subcontracting other activities to specialists, resulting in reduced operational costs and increased productivity. In most cases, the service providers need access to a company's intellectual property and other confidential information to carry out their services. For example, if an outsourcer is doing payroll for a bank, he must have the salary and customer bank account numbers. The main security problem is that the service provider may not be fully trusted or may not be securely administered. Business agreements for try to regulate how the data will be handled by service providers, but it is almost impossible to truly enforce or verify such policies across different administrative domains. Because of their digital nature, relational databases are easy to duplicate and in many cases a service provider may have financial incentives to redistribute commercially valuable data or may simply fail to handle it properly. Hence, we need powerful techniques that can detect and deter such dishonest. 1.2 Basic Concept: A model for assessing the “guilt” of agents has been developed. An algorithm for distributing objects to agents, in a way that improves our chances of identifying a leaker has been proposed. The option of adding “fake” objects to the distributed set also been considered. The allocation of fake objects depends on the sample data request or explicit data request made by the agent. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects acts as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty. Unobtrusive techniques for detecting leakage of a set of objects or records have been studied. After giving a set of objects to agents, the distributor discovers some of those same objects in an unauthorized place. At this point the distributor can assess the likelihood that the leaked data came from one or more agents.
  • 7. 7 2. LITERATURE SURVEY 2.1 Existing Systems: 2.1.1 Perturbation: Perturbation is a very useful technique where the data are modified and made less sensitive before being handed to agents. For example, one can add random noise to certain attributes, or one can replace exact values by ranges [5]. However, it is not suitable if an application where the original sensitive data cannot be altered is considered. For example, if an outsourcer is doing our payroll, he must have the exact salary and customer bank account numbers. If medical researchers will be treating patients, as opposed to simply computing statistics, they may need accurate data for the patients. 2.1.2 Watermarking: Watermarks were initially used in images [2], video [3], and audio data [4] whose digital representation includes considerable redundancy. Traditionally, leakage detection is handled by watermarking, e.g., a unique code is embedded in each distributed copy. If that copy is later discovered in the hands of an unauthorized party, the leaker can be identified. Watermarks can be very useful in some cases, but again, involve some modification of the original data. Furthermore, watermarks can sometimes be destroyed if the data recipient is malicious. 2.2 Disadvantages: 1. In some applications the original sensitive data cannot be perturbed. 2. In some cases it is important not to alter the original distributor’s data. 3. For example, we can't perturbed patient records while giving it to research center for finding cure for a disease. 4. Watermarks can be very useful in some cases but involve some modification of the original data. 5. Watermarks can sometimes be destroyed if the data recipient is malicious.
  • 8. 8 2.3 Related Work: 1. The guilt detection approach we present is related to the data provenance problem [1]: tracing the lineage of S objects implies essentially the detection of the guilty agents. Tutorial [6] provides a good overview on the research conducted in this field. Suggested solutions are domain specific, such as lineage tracing for data warehouses [7], and assume some prior knowledge on the way a data view is created out of data sources. 2. Our problem formulation with objects and sets is more general and simplifies lineage tracing, since we do not consider any data transformation from Ri sets to S. As far as the data allocation strategies are concerned, our work is mostly relevant to watermarking that is used as a means of establishing original ownership of distributed objects. 3. There are also lots of other works on mechanisms that allow only authorized users to access sensitive data through access control policies [8], [9]. Such approaches prevent in some sense data leakage by sharing information only with trusted parties. However, these policies are restrictive and may make it impossible to satisfy agents’ requests. 4. The concept used in this project is very useful as it does not modify the original data, but adds fake objects that appear realistic to the agents. Only the distributor knows about the fake objects so there is no way that the agent can come to know which objects are fake. 5. The distributor may be able to add fake objects to the distributed data in order to improve his effectiveness in detecting guilty agents. However, fake objects may impact the correctness of what agents do, so they may not always be allowable. The idea of perturbing data to detect leakage is not new, e.g., [10]. However, in most cases, individual objects are perturbed, e.g., by adding random noise to sensitive salaries, or adding a watermark to an image. 6. In our case, we are perturbing the set of distributor objects by adding fake elements. In some applications, fake objects may cause fewer problems that perturbing real objects. For example, say that the distributed data objects are customer’s records of a bank and the agents are outsourcers who will do the marketing. In this case, even small modifications to the records of actual customers may be undesirable. However, the addition of some fake customer’s records may be acceptable, since no customer matches these records, and hence, no one will ever be contacted based on fake records.
  • 9. 9 3. DATA LEAKAGE DETECTION 3.1 Problem statement : Suppose a distributor owns a set T = {t1, tm} of valuable data objects. The distributor wants to share some of the objects with a set of agents U1,U2,…,Un but does wish the objects be leaked to other third parties. An agent Ui receives a subset of objects Ri which belongs to T, determined either by a sample request or an explicit request. The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a database. After giving objects to agents, the distributor discovers that a set S of T has leaked. This means that some third party called the target has been caught in possession of S. For example, this target may be displaying S on its web site, or perhaps as part of a legal discovery process, the target turned over S to the distributor. Since the agents U1,U2 ,… ,Un, have some of the data, it is reasonable to suspect them leaking the data. However, the agents can argue that they are innocent, and that the S data was obtained by the target through other means. 3.2 Scope : Our goal is to detect when the distributor’s sensitive data have been leaked by agents, and also to identify the agent that leaked the data. Data allocation strategies (across the agents) have been used that improve the probability of identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases distributor can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party. This system can be useful for various organizations or companies who need to distribute their sensitive data to some third parties. For example, this system can be implemented in Banks to avoid revealing the identity of any employee or customers. Similarly, a company may have partnerships with other companies that require sharing customer data. Another enterprise may outsource its data processing, so data must be given to various other companies.
  • 10. 10 3.3 Proposed System : In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. A data distributor will give sensitive data to a set of supposedly trusted agents. When some of the data is found in an unauthorized place, the distributor may assess the likelihood that the leaked data came from one or more agents. We call the owner of the data the distributor and the supposedly trusted third parties the agents. The guilty agent is one who leaks a portion of distributed data. And the one who receives the data from the guilt agent is an unauthorised person. 3.3.1 Guilt Agent: Suppose that after giving objects to agents, the distributor discovers that a set S ∈ T has leaked. This means that some third party, called the target, has been caught in possession of S. For example, this target may be displaying S on its website, or perhaps as part of a legal discovery process, the target turned over S to the distributor. Since the agents U1; . . . ;Un have some of the data, it is reasonable to suspect them leaking the data. However, the agents can argue that they are innocent, and that the S data were obtained by the target through other means. An agent Ui is guilty and if it contributes one or more objects to the target. We denote the event that agent Ui is guilty by Gi and the event that agent Ui is guilty for a given leaked set S by Gi|S. Our next step is to estimate Pr{Gi|S}, i.e., the probability that agent Ui is guilty given evidence S. 3.3.2 Agent Guilt Model: To compute the Pr{Gi|S}, we need an estimate for the probability that values in S can be “guessed” by the target we assume that all T objects have the same pt, which we call p. Our equations can be easily generalized to diverse pts. Next, we make two assumptions regarding the relationship among the various leakage events. Assumption 1: For all t, t ∈ S such that t != t’ the provenance of t is independent of the provenance of t.
  • 11. 11 The term provenance in this assumption statement refers to the source of a value t that appears in the leaked set. The source can be any of the agents who have t in their sets or the target itself. Assumption 2: An object t ∈ S can only be obtained by the target in one of two ways.  A single agent Ui leaked t from its own Ri set, or  The target guessed (or obtained through other means) t without the help of any of the n agents. To find the probability that an agent Ui is guilty given a set S, consider the target guessed t1 with probability p and that agent leaks t1 to S with the probability 1-p. First compute the probability that he leaks a single object t to S. To compute this, define the set of agents Vt = { Ui | t ∈ Ri } that have t in their data sets. Then using Assumption 2 and known probability p, we have, Assuming that all agents that belong to Vt can leak t to S with equal probability and using Assumption 2 obtain, Given that agent Ui is guilty if he leaks at least one value to S, with Assumption 1 and Equation 1.2 compute the probability Pr { Gi | S}, agent Ui is guilty, 3.3.3 Allocation Strategies: Explicit Data Request In case of explicit data request with fake not allowed, the distributor is not allowed to add fake objects to the distributed data. So Data allocation is fully defined by the agent’s data request.
  • 12. 12 In case of explicit data request with fake allowed, the distributor cannot remove or alter the requests R from the agent. However distributor can add the fake object. In algorithm for data allocation for explicit request, the input to this is a set of request R1,R2,……,Rn from n agents and different conditions for requests. The e-optimal algorithm finds the agents that are eligible to receiving fake objects. Then create one fake object in iteration and allocate it to the agent selected. The e-optimal algorithm minimizes every term of the objective summation by adding maximum number bi of fake objects to every set Ri yielding optimal solution. Algorithm: Step 1: Calculate total fake records as sum of fake records allowed. Step 2: While total fake objects > 0 Step 3: Select agent that will yield the greatest improvement in the sum objective Step 4: Create fake record Step 5: Add this fake record to the agent and also to fake record set. Step 6: Decrement fake record from total fake record set. Algorithm makes a greedy choice by selecting the agent that will yield the greatest improvement in the sum-objective. Sample Data Request With sample data requests, each agent Ui may receive any T from a subset out of different ones. Hence, there are different allocations. In every allocation, the distributor can permute T objects and keep the same chances of guilty agent detection. The reason is that the guilt probability depends only on which agents have received the leaked objects and not on the identity of the leaked objects. Therefore, from the distributor’s perspective there are different allocations.
  • 13. 13 An object allocation that satisfies requests and ignores the distributor’s objective is to give each agent a unique subset of T of size m. The s-max algorithm allocates to an agent the data record that yields the minimum increase of the maximum relative overlap among any pair of agents. The s-max algorithm is as follows. Algorithm: Step 1: Initialize Min_overlap  1, the minimum out of the maximum relative overlaps that the allocations of different objects to Ui Step 2: for k ∈ {k |tk ∈ Ri} do Initialize max_rel_ov  0, the maximum relative overlap between Ri and any set Rj that the allocation of tk toUi Step 3: for all j = 1,..., n : j = i and tk ∈ Rj do Calculate absolute overlap as abs_ov  |Ri Rj | + 1 Calculate relative overlap as rel_ov  abs_ov / min (mi ,mj ) Step 4: Find maximum relative as max_rel_ov  MAX (max_rel_ov, rel_ov) If max_rel_ov <= min_overlap then min_overlap  max_rel_ov ret_k  k Return ret_k It can be shown that algorithm s-max is optimal for the sum-objective and the max-objective in problems where M <= |T| and n < |T|. It is also optimal for the max-objective if |T| <= M <= 2 |T| or all agents request data of the same size.
  • 14. 14 3.4 Analysis : After analyzing the requirements of the task to be performed, the next step is to analyze the problem and understand its context. The first activity in the phase is studying the existing system and other is to understand the requirements and domain of the new system. Both the activities are equally important, but the first activity serves as a basis of giving the functional specifications and then successful design of the proposed system. Understanding the properties and requirements of a new system is more difficult and requires creative thinking and understanding of the existing running system is also difficult, improper understanding of present system can lead diversion from solution. The main focus of our Project is the data allocation problem: How can the distributor “intelligently” give data to agents in order to improve the chances of detecting a guilty agent? As illustrated in the figure below, there are four instances of this problem we address, depending on the type of data requests made by agents and whether “fake objects” are allowed. The two types of requests we handle are sample and explicit. Sample is kind of query in which agents request only for fixed number of data say 100. Explicit is a kind of query in which agents request for data based on conditions say age = 25 (As shown in the figure above, we represent our four problem instances with the names EF, EF, SF, and SF, where E stands for explicit requests, S for sample requests, F for the use of fake objects, and F for the case where fake objects are not allowed).
  • 15. 15 3.5 Details of hardware and software : Hardware requirements  Processor : Pentium Dual-Core Processor  Ram : 1GB of RAM  Hard Disk : 160GB Software requirements  Operating System : Windows Family  Front End : VB.Net  Back End : Ms-Access  Language : ASP.Net using C#
  • 16. 16 3.6 Design Details : SYSTEM ARCHITECTURE DATA FLOW DIAGRAM LEVEL 0: Compares with his DBto find guilt agent Explicit or sample request to distributor DB Distributor sends sensitive data to agents Data Leak out Leaked file Distributor finds his data leak in unauthorized place Distributor Agent 1 Agent 2 Agent 3 Agent 3 Agent 5 Unauthorized person In runtime distributor creates fake object and allocates to agents
  • 18. 18 UML DIAGRAMS USE CASE DIAGRAM: Add fake objects Explicit or sample request Distribute data to agents Find probability Find guilt agent Distributor Agent Data transfer to unauthorized Unauthorized person
  • 23. 23 3.7 Conclusion: When leaked data is found, the distributor can assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. If the distributor sees “enough evidence” that an agent leaked data, he may stop doing business with him, or may initiate legal proceedings. We also present allocation strategies for distributing objects to agents, in a way that improves our chances of identifying a leaker. 4. Implementation Plan for Next Semester 4.1 Gantt chart: July 2012 Aug 2012 Sept-Oct 2012 Nov 2012-Jan 2013 Feb-March 2013
  • 24. 24 4.2 Projectplan: Task A- Requirement Gathering and Analysis Task B- Planning Task C- Designing Task D- Coding Task E- Implementation Task F- Testing 5. References: [1] P. Buneman, S. Khanna, and W.C. Tan, “Why and Where: A Characterization of Data Provenance,” Proc. Eighth Int’l Conf. Database Theory (ICDT ’01), J.V. den Bussche and V. Vianu, eds.,pp. 316-330, Jan. 2001. [2] J.J.K.O. Ruanaidh, W.J. Dowling, and F.M. Boland, “Watermarking Digital Images for Copyright Protection,” IEE Proc. Vision, Signal and Image Processing, vol. 143, no. 4, pp. 250-256, 1996. [3] F. Hartung and B. Girod, “Watermarking of Uncompressed and Compressed Video,” Signal Processing, vol. 66, no. 3, pp. 283-301,1998. [4] S. Czerwinski, R. Fromm, and T. Hodes, “Digital Music Distribution and Audio Watermarking,” http://www.scientificcommons.org/43025658, 2007. [5] L. Sweeney, “Achieving K-Anonymity Privacy Protection Using Generalization and Suppression,” http://en.scientificcommons.org/43196131, 2002. [6] P. Buneman and W.-C. Tan, “Provenance in Databases,” Proc. ACM SIGMOD, pp. 1171-1173, 2007. [7] Y. Cui and J. Widom, “Lineage Tracing for General Data Warehouse Transformations,” The VLDB J., vol. 12, pp. 41-58, 2003.
  • 25. 25 [8] S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian, “Flexible Support for Multiple Access Control Policies,” ACM Trans. Database Systems, vol. 26, no. 2, pp. 214-260, 2001. [9] P. Bonatti, S.D.C. di Vimercati, and P. Samarati, “An Algebra for Composing Access Control Policies,” ACM Trans. Information and System Security, vol. 5, no. 1, pp. 1-35, 2002 [10] R. Agrawal and J. Kiernan, “Watermarking Relational Databases,” Proc. 28th Int’l Conf. Very Large Data Bases (VLDB ’02), VLDB Endowment, pp. 155-166, 2002.