3. +Introduction
n There is great value in data mining
solutions.
n reliable privacy guarantees
n available accuracy
n Differential privacy
n computations are insensitive to changes in
any particular individual's record
3
4. +Introduction (cont.)
n Once an individual is certain that his or
her data will remain private, being opted
in or out of the database should make
little difference.
4
5. +
Introduction (cont.)
n Example1
Name
Result
Tom
0
Jack
1
Henry
1
Diego
0
Alice
?
5
n f(i) = count(i)
n Alice = i=5
n count(5) – count(4)
6. +
Introduction (cont.)
n Example2
n We can speculate the target based on
the information.
6
Id
Sex
Job
Hometown
Hobby
1
M
student
Hsinchu
sport
2
M
teacher
Taipei
writing
3
F
student
Hsinchu
Singing
4
F
student
Taipei
Singing
5
?
?
?
?
7. +Introduction (cont.)
n Goal:count(5) – count(4) ≈ 0
n Goal:” computations are insensitive to
changes in any particular individual's
record ”
7
9. +Differential Privacy
n Differential privacy
9
Output
Probability
• M:a randomized computation
• f:a query function
• D, D’:the datasets with
symmetric difference
10. +Differential Privacy (cont.)
n Differential privacy
10
Define.(ε-Differential Privacy)
We say a randomized computation M provides
differential privacy if for any datasets A and B with
symmetric difference AΔB=1, and set of possible
outcomes S ⊆ Range(M)
11. +Laplace Mechanism
n Example of Laplace Mechanism
11
Name
Result
Tom
0
Jack
1
Henry
1
Diego
0
Alice
?
n count(4) = 2 + noise(4)
n count(5) = 3 + noise(5)
n count(5) – count(4) = eε
12. +Laplace Mechanism
n Laplace Mechanism
12
Theorem. (Laplace mechanism)
Given a function f over an arbitrary domain D, the
computation
provides differential privacy.
14. +Exponential Mechanism (cont.)
n Exponential Mechanism
14
Theorem. (Exponential Mechanism)
Let q be a quality function, given a database d,
assigns a score r to each outcome. Then the
mechanism M, defined by
maintains differential privacy.
15. +PINQ Framework
n PINQ Framework
n PINQ is a proposed architecture for data
analysis with differential privacy
n Another operator presented in PINQ is
partition which was dubbed parallel
composition.
n the costs do not add up when queries are executed
on disjoint datasets
15
19. +SuL-based ID3
n Based on SuLQ framework and Using
Laplace Mechanism.
n It makes direct use of the NoisyCount
primitive to evaluate the information gain
criterion.
n It required to evaluate the information
gain should be carried out for each
attribute separately.
n the budget per query is small
19
20. +SuL-based ID3
n ID3 Classification
n Split point
n max( Gain(Job), Gain(Home), Gain(Hobby) )
20
Id
Sex
Job
Hometown
Hobby
1
M
student
Hsinchu
sport
2
M
teacher
Taipei
writing
3
F
student
Hsinchu
Singing
4
F
student
Taipei
Singing
21. +SuL-based ID3
n SuL-based ID3 Classification
n Split point
n max( Gain(Job)+Noisy, Gain(Home)+Noisy,
Gain(Hobby)+Noisy )
21
Id
Sex
Job
Hometown
Hobby
1
M
student
Hsinchu
sport
2
M
teacher
Taipei
writing
3
F
student
Hsinchu
Singing
4
F
student
Taipei
Singing
22. +DiffP-ID3
n Based on PINQ framework and using
exponential mechanism.
n It evaluates all attributes simultaneously
in one query, the outcome of which is
the attribute to use for splitting.
n the quality function q provided to the scores
each attribute
22
23. +DiffP-ID3 (cont.)
n DiffP-ID3 Classification
n Split point
n Max( Gain(M(Job)), Gain(M(Job)),
Gain(M(Hobby)) )
n PINQ Partition
23
Id
Sex
Job
Hometown
Hobby
1
M
student
Hsinchu
sport
2
M
teacher
Taipei
writing
3
F
student
Hsinchu
Singing
4
F
student
Taipei
Singing
24. +DiffP-ID3 (cont.)
n Which quality function should be fed into
the exponential mechanism?
n the depth constraint
n the sensitivity of the splitting criterion
n Information gain will be the most
sensitive to noise, and Max operator will
be the least sensitive to noise.
24
25. +DiffP-C4.5
n One important extension is the ability to
handle continuous attributes.
n First, the domain is divided into ranges where
the score is constant. Each range is
considered a discrete option.
n Then, a point from the range is sampled with
uniform distribution and returned as the output
of the exponential mechanism.
25
27. +Experiment
n It define a domain with ten nominal
attributes and a class attribute from
another paper.
n It introduces noise to the samples by
reassigning attributes and classes,
replacing each value with probability
noise.
n For testing, it generated similarly a
noiseless test set with 10, 000 records.
27
28. + 28
n the average accuracy is higher as more training
samples are available
n the influence of the noise weakens as the
number of samples grows using Gini and Max
29. + 29
n three of the ten attributes were replaced with
numeric attributes over the domain [0, 100]
n Figure 4 presents the results of a similar
experiment
30. + 30
n for smaller training sets, ID3 allows for better
accuracy
n for larger training sets, C4.5 is better than ID3
31. + 31
n the accuracy results presented in Figure 6 was
around 5% and even lower than the results
presented in Figure 7
n when the sizeof the dataset is small, algorithms
that make efficient use of the privacy budget are
superior
33. +Conclusion
n When the number of training samples is
relatively small or the privacy constraints
set by the data provider are very limiting,
the sensitivity of the calculations
becomes crucial.
33
34. +Future work
n One solution might be to consider other
stopping rules when selecting nodes,
trading possible improvements in
accuracy for increased stability.
n In addition, it may be fruitful to consider
different tactics for budget distribution.
34