Data mining with differential privacy

+
Data Mining with Differential Privacy
Arik Friedman and Assaf Schuster / KDD’10
Chang Wei-Yuan
2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting

+Outline
n Introduction
n Background
n Method
n Experiment
n Conclusion
n Though
2

+Introduction
n There is great value in data mining
solutions.
n reliable privacy guarantees
n available accuracy
n Differential privacy
n computations are insensitive to changes in
any particular individual's record
3

+Introduction (cont.)
n Once an individual is certain that his or
her data will remain private, being opted
in or out of the database should make
little difference.
4

+
Introduction (cont.)
n Example1
Name
Result
Tom
0
Jack
1
Henry
1
Diego
0
Alice
?
5
n f(i) = count(i)
n Alice = i=5
n count(5) – count(4)

+
Introduction (cont.)
n Example2
n We can speculate the target based on
the information.
6
Id
Sex
Job
Hometown
Hobby
1
M
student
Hsinchu
sport
2
M
teacher
Taipei
writing
3
F
student
Hsinchu
Singing
4
F
student
Taipei
Singing
5
?
?
?
?

+Introduction (cont.)
n Goal：count(5) – count(4) ≈ 0
n Goal：” computations are insensitive to
changes in any particular individual's
record ”
7

+Outline
n Introduction
n Background
n Method
n Experiment
n Conclusion
n Though
8

+Differential Privacy
9
Output
Probability
• M：a randomized computation
• f：a query function
• D, D’：the datasets with
symmetric difference

+Differential Privacy (cont.)
10
Define.（ε-Differential Privacy）
We say a randomized computation M provides
differential privacy if for any datasets A and B with
symmetric difference AΔB=1, and set of possible
outcomes S ⊆ Range(M)

+Laplace Mechanism
n Example of Laplace Mechanism
11
Name
Result
Tom
0
Jack
1
Henry
1
Diego
0
Alice
?
n count(4) = 2 + noise(4)
n count(5) = 3 + noise(5)
n count(5) – count(4) = eε

+Laplace Mechanism
n Laplace Mechanism
12
Theorem. （Laplace mechanism）
Given a function f over an arbitrary domain D, the
computation
provides differential privacy.

+Exponential Mechanism
n Example of Exponential Mechanism
13
item
q
ε=0
ε=0.1
ε=1
Football 30
0.46
0.42
0.92
Volleyball
25
0.38
0.33
0.07
Basketball
8
0.12
0.14
1.5E-05
Tennis
2
0.03
0.10
7.7E-07

+Exponential Mechanism (cont.)
n Exponential Mechanism
14
Theorem. （Exponential Mechanism）
Let q be a quality function, given a database d,
assigns a score r to each outcome. Then the
mechanism M, defined by
maintains differential privacy.

+PINQ Framework
n PINQ Framework
n PINQ is a proposed architecture for data
analysis with differential privacy
n Another operator presented in PINQ is
partition which was dubbed parallel
composition.
n the costs do not add up when queries are executed
on disjoint datasets
15

+Outline
n Introduction
n Background
n Method
n Experiment
n Conclusion
n Though
17

+Method
18
n SQL-based ID3
n DiffP-ID3
n DiffP-C4.5

+SuL-based ID3
n Based on SuLQ framework and Using
Laplace Mechanism.
n It makes direct use of the NoisyCount
primitive to evaluate the information gain
criterion.
n It required to evaluate the information
gain should be carried out for each
attribute separately.
n the budget per query is small
19

+SuL-based ID3
n ID3 Classification
n Split point
n max( Gain(Job), Gain(Home), Gain(Hobby) )
20
Id
Sex
Job
Hometown
Hobby
1
M
student
Hsinchu
sport
2
M
teacher
Taipei
writing
3
F
student
Hsinchu
Singing
4
F
student
Taipei
Singing

+SuL-based ID3
n SuL-based ID3 Classification
n Split point
n max( Gain(Job)+Noisy, Gain(Home)+Noisy,
Gain(Hobby)+Noisy )
21
Id
Sex
Job
Hometown
Hobby
1
M
student
Hsinchu
sport
2
M
teacher
Taipei
writing
3
F
student
Hsinchu
Singing
4
F
student
Taipei
Singing

+DiffP-ID3
n Based on PINQ framework and using
exponential mechanism.
n It evaluates all attributes simultaneously
in one query, the outcome of which is
the attribute to use for splitting.
n the quality function q provided to the scores
each attribute
22

+DiffP-ID3 (cont.)
n DiffP-ID3 Classification
n Split point
n Max( Gain(M(Job)), Gain(M(Job)),
Gain(M(Hobby)) )
n PINQ Partition
23
Id
Sex
Job
Hometown
Hobby
1
M
student
Hsinchu
sport
2
M
teacher
Taipei
writing
3
F
student
Hsinchu
Singing
4
F
student
Taipei
Singing

+DiffP-ID3 (cont.)
n Which quality function should be fed into
the exponential mechanism?
n the depth constraint
n the sensitivity of the splitting criterion
n Information gain will be the most
sensitive to noise, and Max operator will
be the least sensitive to noise.
24

+DiffP-C4.5
n One important extension is the ability to
handle continuous attributes.
n First, the domain is divided into ranges where
the score is constant. Each range is
considered a discrete option.
n Then, a point from the range is sampled with
uniform distribution and returned as the output
of the exponential mechanism.
25

+Outline
n Introduction
n Background
n Method
n Experiment
n Conclusion
n Though
26

+Experiment
n It define a domain with ten nominal
attributes and a class attribute from
another paper.
n It introduces noise to the samples by
reassigning attributes and classes,
replacing each value with probability
noise.
n For testing, it generated similarly a
noiseless test set with 10, 000 records.
27

+ 28
n the average accuracy is higher as more training
samples are available
n the influence of the noise weakens as the
number of samples grows using Gini and Max

+ 29
n three of the ten attributes were replaced with
numeric attributes over the domain [0, 100]
n Figure 4 presents the results of a similar
experiment

+ 30
n for smaller training sets, ID3 allows for better
accuracy
n for larger training sets, C4.5 is better than ID3

+ 31
n the accuracy results presented in Figure 6 was
around 5% and even lower than the results
presented in Figure 7
n when the sizeof the dataset is small, algorithms
that make efficient use of the privacy budget are
superior

+Outline
n Introduction
n Background
n Method
n Experiment
n Conclusion
n Though
32

+Conclusion
n When the number of training samples is
relatively small or the privacy constraints
set by the data provider are very limiting,
the sensitivity of the calculations
becomes crucial.
33

+Future work
n One solution might be to consider other
stopping rules when selecting nodes,
trading possible improvements in
accuracy for increased stability.
n In addition, it may be fruitful to consider
different tactics for budget distribution.
34

+Outline
n Introduction
n Background
n Method
n Experiment
n Conclusion
n Though
35

+
Thanks for listening.
2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting
v123582@gmail.com

Data mining with differential privacy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data mining with differential privacy

Similar to Data mining with differential privacy (20)

More from Wei-Yuan Chang

More from Wei-Yuan Chang (20)

Recently uploaded

Recently uploaded (20)

Data mining with differential privacy

Editor's Notes