SlideShare a Scribd company logo
1 of 23
Random Data Perturbation Techniques and Privacy 
(Authors: H. Kargupta, S. Datta, Q. Wang & K. Sivakumar) 
1 
Preserving Data Mining 
April 26, 2005 
Gunjan Gupta
Privacy & Good Service: Often Conflicting Goals 
2 
• Privacy 
– Customer: I don’t want you to share my personal information with anyone. 
– Business: I don’t want to share my data with a competitor. 
• Quantity, Cost & Quality of Service 
– Customer: I want you to provide me lower cost of service 
– and good quality. 
– and at lower cost. 
• Paradox: lower cost often comes from being able to use/share sensitive 
data that can be used or misused: 
– Provide better service by predicting consumer needs better, or sell information 
to marketers. 
– Optimize load sharing between competing utilities or preempting competition. 
– Doctor saving patient by knowing patient history or insurance companies 
declining coverage to individuals with preexisting conditions.
Can we use privacy sensitive data to optimize cost and 
quality of a service without compromising any privacy? 
3 
Central Question:
4 
Short Answer: 
No!
5 
Long Answer: 
Maybe compromise a small amount of privacy (low cost 
increase) to improve quality and cost of service (high cost 
savings) substantially.
Why anonymous exact records not so secure? 
• Example : medical insurance premium estimation based on patient history 
– Predictive fields often generic: age, sex, disease history, first two digits of zip 
code (not allowed in Germany). no. of kids etc. 
– Specifics such as record id (key), name, address omitted. 
• This could be easily broken by matching non-secure records with secure 
6 
anonymous records: 
Susan Calvin, 121 Norwood Cr. 
Austin, TX-78753 
Hi, I am Susan, and here are pictures 
of me, my husband, and my 3 
wonderful kids from my 43rd 
birthday party! 
Female, 43, 3 kids, 78---,married, 
anonymous medical record 1 
Female, 43, 2 kids, 78---, single 
anonymous medical record 2 
Yellowpages 
Personal website 
Anonymous “privacy preserving records” 
Internal Human + 
Automated hacker 
Susan Calvin, 43, 3 kids, Address, 
78733, now labeled med. Records! 
Broken Exact record
Two approaches to Privacy Preserving 
7 
• Distributed: 
– Suitable for multi-party platforms. Share sub-models. 
– Unsupervised: Ensemble Clustering, Privacy Preserving Clustering etc. 
– Supervised: Meta-learners, Fourier Spectrum Decision Trees, Collective 
Hierarchical Clustering and so on.. 
– Secure communication based: Secure sum, secure scalar product 
• Random Data Perturbation: Our focus 
– Perturb data by small amounts to protect privacy of individual records. 
– Preserve intrinsic distributions necessary for modeling.
Recovering approximately correct anonymous 
8 
features also breaks privacy 
• Somewhat inexactly recovered anonymous record values might also be sufficient: 
yellowpages 
Susan Calvin, 121 Norwood Cr. 
Austin, TX-78753 
Personal website 
Hi, I am Susan, and here are pictures 
of me, my husband, and my 3 
wonderful kids from my 43rd 
birthday party! 
“Denoised” privacy preserving records 
Female, 44.5, 3.2 kids, 78---,married, 
anonymous medical record 1 
Female, 42.2, 2.1 kids, 78---, single, 
anonymous medical record 2 
Internal Human + 
Automated hacker 
Susan Calvin, 43, 3 kids, Address, 
78733, now labeled med. Records! 
Broken Exact record
Anonymous records (with or without) small perturbations not 
secure: not a recently noticed phenomena 
• 1979, Denning & Denning: The Tracker: A Threat to Statistical Database Security 
– Show why anonymous records are not secure. 
– Show example of recovering exact salary of a professor from anonymous 
records. 
– Present a general algorithm for an Individual Tracker. 
– A formal probabilistic model and set of conditions that make a dataset support 
such a tracker. 
• 1984, Traub & Yemin: The Statistical Security of a Statistical Database: 
– No free lunch: perturbations cause irrecoverable loss in model accuracy. 
– However, the holy grail of random perturbation: 
We can try to find a perturbation algorithm that best trades 
off between loss of privacy vs. model accuracy. 
9
10 
Recovering perturbed distributions: Earlier work 
• Reconstructing Original Distribution from Perturbed Ones. Setup: 
– N samples U1, U2, U3.. Xn 
– N noise values V1, V2, V3.. Vn all taken from a public(known) distribution 
V. 
– Visible noisy data: W1=U1+V1, W2=U2+V2 . . 
– Assumption: Such noise can allow you to recover the distribution of 
X1,X2,X3 ..Xn, but not the individual record’s. 
• Two well known methods and definitions: 
– Agrawal & Srikant: 
Interval based: Privacy(X) at Confidence 0.95= X2-X1 
– Agrawal & Aggarwal: 
Distributional Privacy(X)=2h(x) 
X1 X2 
f(x) f’(x)
Interval Based Method: Agrawal & Srikant in more detail 
• N samples U1, U2, U3.. Xn 
• N noise values V1, V2, V3.. Vn all taken from a public(known) distribution V. 
• W1=U1+V1, W2=U2+V2 . . 
• Visible noisy data: W1, W2, W3 .. 
Given: noise function fV , using Bayes’ Rule, we can show that the cumulative 
posterior distribution function of u in terms of w (visible) and fV , and unknown 
desired function fu , 
11 
Differentiating w.r.t. u we get an important recursive definition: 
Notation issue (in paper): f‘ simply means approximation of true f, not derivative of f !
Interval Based Method: Agrawal & Srikant in more detail 
Seed with a uniform distribution for J=0 
sum over discrete z intervals instead of 
integral for speed 
12 
Algorithm in practice: 
STEP J+1 
replaced integration with summation 
over i.i.d samples 
STEP J 
• Converges to a local minima? Different than uniform initialization 
might give a different result. Not explored by authors. 
• For large enough samples, hope to get close to true distribution. 
• Stop when fU(J+1) – fU(J) becomes small.
Interval Based Method: Good Results for a variety of noises 
13
Revisiting an Essential Assumption in the Random Perturbation 
Assumption: Such noise can allow you to recover the distribution 
of X1,X2,X3 ..Xn, but not the individual record’s. 
14 
• The Authors in this paper challenge this assumption. 
• Claim randomness addition can be mostly visual and not real: 
• Many simple forms of random perturbations are “breakable”.
Exploit predictable properties of Random data to design a filter 
to break the perturbation encryption? 
All eigen-values close to 1! 
Spiral data Random data 
15
Spectral Filtering: 
Main Idea: Use eigen-values properties of noise to filter 
16 
• U+V data 
• Decomposition of eeigen-values 
of noise and original data 
• Recovered data
Decomposing eigen-values: separating data from noise 
Let – 
U and V be the m x n data and noise matrices 
P the perturbed matrix UP= U+V 
Covariance matrix of UP = UP T UP = (U+V) T (U+V) = UTU + VTU + UTV + UTU 
Since signal and noise are uncorrelated in random perturbation, for 
large no. of observations: VTU ~ 0 and UTV ~ 0, therefore 
17 
UP 
T UP = UTU + VTV 
Since the above 3 matrices are correlation matrices, they are symmetric and 
positive semi-definite, therefore, we can perform eigen decomposition:
With bunch of algebra and theorems from Matrix Perturbation 
theory, authors show that in the limit (lots of data).. 
Wigner’s law: Describes distribution of eigen values for normal random 
matrices: 
• eigen values for noise component V stick in a thin range given by λmin and 
λmax (show example next page) with high probability. 
• Allows us to compute λmin and λmax. Solution! 
Giving us the following algorithm: 
1. Find a large no. of eigen values of the perturbed data P. 
2. Separate all eigen values inside λmin and λmax and save row indices IV 
3. Take the remaining eigen indices to get the “peturbed” but not noise 
18 
eigens coming from true data U: save their row indices IU 
4. Break perturbed eigenvector matrix QP into AU = QP (IU), AV = QP (IV). 
5. Estimate true data as projection :
Exploit predictable properties of Random data to design a filter 
to break the perturbation encryption? 
All eigen-values close to 1! 
Spiral data Random data 
19
20 
Results: Quality of Eeigen values recovery 
Only the real eigen’s 
got captured, because 
of the nice automatic 
thresholding !
Results: Comparison with Aggarwal’s reproduction 
Agrawal & Srikant (no breaking 
of encryption) Agrawal & Srikant (estimated from broken 
21 
encryption)
22 
Discussion 
• Amazing amount of experimental results and comparisons presented by authors in 
the Journal version. 
• Extension to a situation where perturbing distribution form is known but exact first 
, second or higher order statistics not known: discussed but not presented. 
• Comparison of performance with other obvious techniques for noise reduction in 
signal processing community: 
– Moving Averages and Weiner Filtering. 
– PCA Based filtering. 
• Pros and Cons of the perturbation analysis by authors (and in general): 
– Effect of more and more noise: rapid degradation of results. 
– Problem in dealing with inherent noise in original data. 
– Technique fails when features independent of each other because of 
Covariance matrix exploitation: Points to a major improvement possibility in 
encryption: perform ICA/PCA and then randomize? 
– Results suggest that more complex noise models might be harder to break. 
Not clear if this improves privacy-model quality tradeoff? 
– eigen decomposition has an inherent metric assumption?
A not-so-ominous* application of noise filtering: Nulling 
Interferometer on Terrestrial Planet Finder-I 
23 
*but maybe not if you believe Hollywood movies such as 
Independence Day! 
alien ship

More Related Content

Viewers also liked

File 2 removed 1 word from 5 slides 1
File 2 removed 1 word from 5 slides 1File 2 removed 1 word from 5 slides 1
File 2 removed 1 word from 5 slides 1test prod1
 
File 3 removed 5 words 1
File 3 removed 5 words 1File 3 removed 5 words 1
File 3 removed 5 words 1test prod1
 
File 4 removed 1 slide 2
File 4 removed 1 slide 2File 4 removed 1 slide 2
File 4 removed 1 slide 2test prod1
 
File 3 removed 4 slides 1
File 3 removed 4 slides 1File 3 removed 4 slides 1
File 3 removed 4 slides 1test prod1
 
File 2 extra slide 2
File 2 extra slide 2File 2 extra slide 2
File 2 extra slide 2test prod1
 
File 4 original
File 4 originalFile 4 original
File 4 originaltest prod1
 
File 3 extra slide 1
File 3 extra slide 1File 3 extra slide 1
File 3 extra slide 1test prod1
 
File 3 removed 1 word 1
File 3 removed 1 word 1File 3 removed 1 word 1
File 3 removed 1 word 1test prod1
 
File 1 removed slide 1
File 1 removed slide 1File 1 removed slide 1
File 1 removed slide 1test prod1
 
File 5 extra four slides 2
File 5 extra four slides 2File 5 extra four slides 2
File 5 extra four slides 2test prod1
 
File 2 removed 4 slides 1
File 2 removed 4 slides 1File 2 removed 4 slides 1
File 2 removed 4 slides 1test prod1
 
File 5 original copy 2
File 5 original  copy 2File 5 original  copy 2
File 5 original copy 2test prod1
 
File 4 removed 1 word from 5 slides 1
File 4 removed 1 word from 5 slides 1File 4 removed 1 word from 5 slides 1
File 4 removed 1 word from 5 slides 1test prod1
 
File 5 removed 4 slides 2
File 5 removed 4 slides 2File 5 removed 4 slides 2
File 5 removed 4 slides 2test prod1
 
File 3 removed 1 word from 5 slides 1
File 3 removed 1 word from 5 slides 1File 3 removed 1 word from 5 slides 1
File 3 removed 1 word from 5 slides 1test prod1
 
File 4 removed 4 slides 1
File 4 removed 4 slides 1File 4 removed 4 slides 1
File 4 removed 4 slides 1test prod1
 
File 4 removed 5 words 1
File 4 removed 5 words 1File 4 removed 5 words 1
File 4 removed 5 words 1test prod1
 
File 1 four extra slides 2
File 1 four extra slides 2File 1 four extra slides 2
File 1 four extra slides 2test prod1
 
File 5 removed 1 slide 2
File 5 removed 1 slide 2File 5 removed 1 slide 2
File 5 removed 1 slide 2test prod1
 

Viewers also liked (19)

File 2 removed 1 word from 5 slides 1
File 2 removed 1 word from 5 slides 1File 2 removed 1 word from 5 slides 1
File 2 removed 1 word from 5 slides 1
 
File 3 removed 5 words 1
File 3 removed 5 words 1File 3 removed 5 words 1
File 3 removed 5 words 1
 
File 4 removed 1 slide 2
File 4 removed 1 slide 2File 4 removed 1 slide 2
File 4 removed 1 slide 2
 
File 3 removed 4 slides 1
File 3 removed 4 slides 1File 3 removed 4 slides 1
File 3 removed 4 slides 1
 
File 2 extra slide 2
File 2 extra slide 2File 2 extra slide 2
File 2 extra slide 2
 
File 4 original
File 4 originalFile 4 original
File 4 original
 
File 3 extra slide 1
File 3 extra slide 1File 3 extra slide 1
File 3 extra slide 1
 
File 3 removed 1 word 1
File 3 removed 1 word 1File 3 removed 1 word 1
File 3 removed 1 word 1
 
File 1 removed slide 1
File 1 removed slide 1File 1 removed slide 1
File 1 removed slide 1
 
File 5 extra four slides 2
File 5 extra four slides 2File 5 extra four slides 2
File 5 extra four slides 2
 
File 2 removed 4 slides 1
File 2 removed 4 slides 1File 2 removed 4 slides 1
File 2 removed 4 slides 1
 
File 5 original copy 2
File 5 original  copy 2File 5 original  copy 2
File 5 original copy 2
 
File 4 removed 1 word from 5 slides 1
File 4 removed 1 word from 5 slides 1File 4 removed 1 word from 5 slides 1
File 4 removed 1 word from 5 slides 1
 
File 5 removed 4 slides 2
File 5 removed 4 slides 2File 5 removed 4 slides 2
File 5 removed 4 slides 2
 
File 3 removed 1 word from 5 slides 1
File 3 removed 1 word from 5 slides 1File 3 removed 1 word from 5 slides 1
File 3 removed 1 word from 5 slides 1
 
File 4 removed 4 slides 1
File 4 removed 4 slides 1File 4 removed 4 slides 1
File 4 removed 4 slides 1
 
File 4 removed 5 words 1
File 4 removed 5 words 1File 4 removed 5 words 1
File 4 removed 5 words 1
 
File 1 four extra slides 2
File 1 four extra slides 2File 1 four extra slides 2
File 1 four extra slides 2
 
File 5 removed 1 slide 2
File 5 removed 1 slide 2File 5 removed 1 slide 2
File 5 removed 1 slide 2
 

Similar to 7

An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...Md Rakibul Hasan
 
Compressed sensing techniques for sensor data using unsupervised learning
Compressed sensing techniques for sensor data using unsupervised learningCompressed sensing techniques for sensor data using unsupervised learning
Compressed sensing techniques for sensor data using unsupervised learningSong Cui, Ph.D
 
DTI brain networks analysis
DTI brain networks analysisDTI brain networks analysis
DTI brain networks analysisemapesce
 
Array diagnosis using compressed sensing in near field
Array diagnosis using compressed sensing in near fieldArray diagnosis using compressed sensing in near field
Array diagnosis using compressed sensing in near fieldAlexander Decker
 
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017MLconf
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component AnalysisSunjeet Jena
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017MLconf
 
A Study on Privacy Level in Publishing Data of Smart Tap Network
A Study on Privacy Level in Publishing Data of Smart Tap NetworkA Study on Privacy Level in Publishing Data of Smart Tap Network
A Study on Privacy Level in Publishing Data of Smart Tap NetworkHa Phuong
 
Developmental Mega Sample: Exploring Inter-Individual Variation
Developmental Mega Sample: Exploring Inter-Individual VariationDevelopmental Mega Sample: Exploring Inter-Individual Variation
Developmental Mega Sample: Exploring Inter-Individual VariationSaigeRutherford
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018digitalzombie
 

Similar to 7 (20)

1 public embedd
1 public embedd1 public embedd
1 public embedd
 
related
relatedrelated
related
 
main
mainmain
main
 
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
An Evolutionary-based Neural Network for Distinguishing between Genuine and P...
 
Compressed sensing techniques for sensor data using unsupervised learning
Compressed sensing techniques for sensor data using unsupervised learningCompressed sensing techniques for sensor data using unsupervised learning
Compressed sensing techniques for sensor data using unsupervised learning
 
DTI brain networks analysis
DTI brain networks analysisDTI brain networks analysis
DTI brain networks analysis
 
PCA Final.pptx
PCA Final.pptxPCA Final.pptx
PCA Final.pptx
 
Array diagnosis using compressed sensing in near field
Array diagnosis using compressed sensing in near fieldArray diagnosis using compressed sensing in near field
Array diagnosis using compressed sensing in near field
 
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
A Study on Privacy Level in Publishing Data of Smart Tap Network
A Study on Privacy Level in Publishing Data of Smart Tap NetworkA Study on Privacy Level in Publishing Data of Smart Tap Network
A Study on Privacy Level in Publishing Data of Smart Tap Network
 
Core Training Presentations- 3 Estimating an Ag Database using CE Methods
Core Training Presentations- 3 Estimating an Ag Database using CE MethodsCore Training Presentations- 3 Estimating an Ag Database using CE Methods
Core Training Presentations- 3 Estimating an Ag Database using CE Methods
 
Saif_CCECE2007_full_paper_submitted
Saif_CCECE2007_full_paper_submittedSaif_CCECE2007_full_paper_submitted
Saif_CCECE2007_full_paper_submitted
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
PREDICT 422 - Module 1.pptx
PREDICT 422 - Module 1.pptxPREDICT 422 - Module 1.pptx
PREDICT 422 - Module 1.pptx
 
Developmental Mega Sample: Exploring Inter-Individual Variation
Developmental Mega Sample: Exploring Inter-Individual VariationDevelopmental Mega Sample: Exploring Inter-Individual Variation
Developmental Mega Sample: Exploring Inter-Individual Variation
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 

More from test prod1

Lead form geo testing us -DO NOT DELETE
Lead form geo testing us -DO NOT DELETELead form geo testing us -DO NOT DELETE
Lead form geo testing us -DO NOT DELETEtest prod1
 
Lead form geo testing eu -DO NOT DELETE
Lead form geo testing eu -DO NOT DELETELead form geo testing eu -DO NOT DELETE
Lead form geo testing eu -DO NOT DELETEtest prod1
 
Lead form geo testing in -DO NOT DELETE
Lead form geo testing in -DO NOT DELETELead form geo testing in -DO NOT DELETE
Lead form geo testing in -DO NOT DELETEtest prod1
 
Unique file 15
Unique file 15Unique file 15
Unique file 15test prod1
 
Unique file 14
Unique file 14Unique file 14
Unique file 14test prod1
 
Unique file 12
Unique file 12Unique file 12
Unique file 12test prod1
 
Unique file 11
Unique file 11Unique file 11
Unique file 11test prod1
 
Unique file 10
Unique file 10Unique file 10
Unique file 10test prod1
 
File 5 removed 5 words 2
File 5 removed 5 words 2File 5 removed 5 words 2
File 5 removed 5 words 2test prod1
 

More from test prod1 (20)

Ppt11
Ppt11Ppt11
Ppt11
 
Do notdelete
Do notdeleteDo notdelete
Do notdelete
 
Do notdelete
Do notdeleteDo notdelete
Do notdelete
 
Lead form geo testing us -DO NOT DELETE
Lead form geo testing us -DO NOT DELETELead form geo testing us -DO NOT DELETE
Lead form geo testing us -DO NOT DELETE
 
Lead form geo testing eu -DO NOT DELETE
Lead form geo testing eu -DO NOT DELETELead form geo testing eu -DO NOT DELETE
Lead form geo testing eu -DO NOT DELETE
 
Lead form geo testing in -DO NOT DELETE
Lead form geo testing in -DO NOT DELETELead form geo testing in -DO NOT DELETE
Lead form geo testing in -DO NOT DELETE
 
Unique file 15
Unique file 15Unique file 15
Unique file 15
 
Unique file 14
Unique file 14Unique file 14
Unique file 14
 
Unique file 12
Unique file 12Unique file 12
Unique file 12
 
Unique file 11
Unique file 11Unique file 11
Unique file 11
 
Unique file 10
Unique file 10Unique file 10
Unique file 10
 
Unique file 9
Unique file 9Unique file 9
Unique file 9
 
Unique file 8
Unique file 8Unique file 8
Unique file 8
 
Unique file 7
Unique file 7Unique file 7
Unique file 7
 
Unique file 5
Unique file 5Unique file 5
Unique file 5
 
Unique file 4
Unique file 4Unique file 4
Unique file 4
 
Unique file 3
Unique file 3Unique file 3
Unique file 3
 
Unique file 2
Unique file 2Unique file 2
Unique file 2
 
Unique file 1
Unique file 1Unique file 1
Unique file 1
 
File 5 removed 5 words 2
File 5 removed 5 words 2File 5 removed 5 words 2
File 5 removed 5 words 2
 

7

  • 1. Random Data Perturbation Techniques and Privacy (Authors: H. Kargupta, S. Datta, Q. Wang & K. Sivakumar) 1 Preserving Data Mining April 26, 2005 Gunjan Gupta
  • 2. Privacy & Good Service: Often Conflicting Goals 2 • Privacy – Customer: I don’t want you to share my personal information with anyone. – Business: I don’t want to share my data with a competitor. • Quantity, Cost & Quality of Service – Customer: I want you to provide me lower cost of service – and good quality. – and at lower cost. • Paradox: lower cost often comes from being able to use/share sensitive data that can be used or misused: – Provide better service by predicting consumer needs better, or sell information to marketers. – Optimize load sharing between competing utilities or preempting competition. – Doctor saving patient by knowing patient history or insurance companies declining coverage to individuals with preexisting conditions.
  • 3. Can we use privacy sensitive data to optimize cost and quality of a service without compromising any privacy? 3 Central Question:
  • 5. 5 Long Answer: Maybe compromise a small amount of privacy (low cost increase) to improve quality and cost of service (high cost savings) substantially.
  • 6. Why anonymous exact records not so secure? • Example : medical insurance premium estimation based on patient history – Predictive fields often generic: age, sex, disease history, first two digits of zip code (not allowed in Germany). no. of kids etc. – Specifics such as record id (key), name, address omitted. • This could be easily broken by matching non-secure records with secure 6 anonymous records: Susan Calvin, 121 Norwood Cr. Austin, TX-78753 Hi, I am Susan, and here are pictures of me, my husband, and my 3 wonderful kids from my 43rd birthday party! Female, 43, 3 kids, 78---,married, anonymous medical record 1 Female, 43, 2 kids, 78---, single anonymous medical record 2 Yellowpages Personal website Anonymous “privacy preserving records” Internal Human + Automated hacker Susan Calvin, 43, 3 kids, Address, 78733, now labeled med. Records! Broken Exact record
  • 7. Two approaches to Privacy Preserving 7 • Distributed: – Suitable for multi-party platforms. Share sub-models. – Unsupervised: Ensemble Clustering, Privacy Preserving Clustering etc. – Supervised: Meta-learners, Fourier Spectrum Decision Trees, Collective Hierarchical Clustering and so on.. – Secure communication based: Secure sum, secure scalar product • Random Data Perturbation: Our focus – Perturb data by small amounts to protect privacy of individual records. – Preserve intrinsic distributions necessary for modeling.
  • 8. Recovering approximately correct anonymous 8 features also breaks privacy • Somewhat inexactly recovered anonymous record values might also be sufficient: yellowpages Susan Calvin, 121 Norwood Cr. Austin, TX-78753 Personal website Hi, I am Susan, and here are pictures of me, my husband, and my 3 wonderful kids from my 43rd birthday party! “Denoised” privacy preserving records Female, 44.5, 3.2 kids, 78---,married, anonymous medical record 1 Female, 42.2, 2.1 kids, 78---, single, anonymous medical record 2 Internal Human + Automated hacker Susan Calvin, 43, 3 kids, Address, 78733, now labeled med. Records! Broken Exact record
  • 9. Anonymous records (with or without) small perturbations not secure: not a recently noticed phenomena • 1979, Denning & Denning: The Tracker: A Threat to Statistical Database Security – Show why anonymous records are not secure. – Show example of recovering exact salary of a professor from anonymous records. – Present a general algorithm for an Individual Tracker. – A formal probabilistic model and set of conditions that make a dataset support such a tracker. • 1984, Traub & Yemin: The Statistical Security of a Statistical Database: – No free lunch: perturbations cause irrecoverable loss in model accuracy. – However, the holy grail of random perturbation: We can try to find a perturbation algorithm that best trades off between loss of privacy vs. model accuracy. 9
  • 10. 10 Recovering perturbed distributions: Earlier work • Reconstructing Original Distribution from Perturbed Ones. Setup: – N samples U1, U2, U3.. Xn – N noise values V1, V2, V3.. Vn all taken from a public(known) distribution V. – Visible noisy data: W1=U1+V1, W2=U2+V2 . . – Assumption: Such noise can allow you to recover the distribution of X1,X2,X3 ..Xn, but not the individual record’s. • Two well known methods and definitions: – Agrawal & Srikant: Interval based: Privacy(X) at Confidence 0.95= X2-X1 – Agrawal & Aggarwal: Distributional Privacy(X)=2h(x) X1 X2 f(x) f’(x)
  • 11. Interval Based Method: Agrawal & Srikant in more detail • N samples U1, U2, U3.. Xn • N noise values V1, V2, V3.. Vn all taken from a public(known) distribution V. • W1=U1+V1, W2=U2+V2 . . • Visible noisy data: W1, W2, W3 .. Given: noise function fV , using Bayes’ Rule, we can show that the cumulative posterior distribution function of u in terms of w (visible) and fV , and unknown desired function fu , 11 Differentiating w.r.t. u we get an important recursive definition: Notation issue (in paper): f‘ simply means approximation of true f, not derivative of f !
  • 12. Interval Based Method: Agrawal & Srikant in more detail Seed with a uniform distribution for J=0 sum over discrete z intervals instead of integral for speed 12 Algorithm in practice: STEP J+1 replaced integration with summation over i.i.d samples STEP J • Converges to a local minima? Different than uniform initialization might give a different result. Not explored by authors. • For large enough samples, hope to get close to true distribution. • Stop when fU(J+1) – fU(J) becomes small.
  • 13. Interval Based Method: Good Results for a variety of noises 13
  • 14. Revisiting an Essential Assumption in the Random Perturbation Assumption: Such noise can allow you to recover the distribution of X1,X2,X3 ..Xn, but not the individual record’s. 14 • The Authors in this paper challenge this assumption. • Claim randomness addition can be mostly visual and not real: • Many simple forms of random perturbations are “breakable”.
  • 15. Exploit predictable properties of Random data to design a filter to break the perturbation encryption? All eigen-values close to 1! Spiral data Random data 15
  • 16. Spectral Filtering: Main Idea: Use eigen-values properties of noise to filter 16 • U+V data • Decomposition of eeigen-values of noise and original data • Recovered data
  • 17. Decomposing eigen-values: separating data from noise Let – U and V be the m x n data and noise matrices P the perturbed matrix UP= U+V Covariance matrix of UP = UP T UP = (U+V) T (U+V) = UTU + VTU + UTV + UTU Since signal and noise are uncorrelated in random perturbation, for large no. of observations: VTU ~ 0 and UTV ~ 0, therefore 17 UP T UP = UTU + VTV Since the above 3 matrices are correlation matrices, they are symmetric and positive semi-definite, therefore, we can perform eigen decomposition:
  • 18. With bunch of algebra and theorems from Matrix Perturbation theory, authors show that in the limit (lots of data).. Wigner’s law: Describes distribution of eigen values for normal random matrices: • eigen values for noise component V stick in a thin range given by λmin and λmax (show example next page) with high probability. • Allows us to compute λmin and λmax. Solution! Giving us the following algorithm: 1. Find a large no. of eigen values of the perturbed data P. 2. Separate all eigen values inside λmin and λmax and save row indices IV 3. Take the remaining eigen indices to get the “peturbed” but not noise 18 eigens coming from true data U: save their row indices IU 4. Break perturbed eigenvector matrix QP into AU = QP (IU), AV = QP (IV). 5. Estimate true data as projection :
  • 19. Exploit predictable properties of Random data to design a filter to break the perturbation encryption? All eigen-values close to 1! Spiral data Random data 19
  • 20. 20 Results: Quality of Eeigen values recovery Only the real eigen’s got captured, because of the nice automatic thresholding !
  • 21. Results: Comparison with Aggarwal’s reproduction Agrawal & Srikant (no breaking of encryption) Agrawal & Srikant (estimated from broken 21 encryption)
  • 22. 22 Discussion • Amazing amount of experimental results and comparisons presented by authors in the Journal version. • Extension to a situation where perturbing distribution form is known but exact first , second or higher order statistics not known: discussed but not presented. • Comparison of performance with other obvious techniques for noise reduction in signal processing community: – Moving Averages and Weiner Filtering. – PCA Based filtering. • Pros and Cons of the perturbation analysis by authors (and in general): – Effect of more and more noise: rapid degradation of results. – Problem in dealing with inherent noise in original data. – Technique fails when features independent of each other because of Covariance matrix exploitation: Points to a major improvement possibility in encryption: perform ICA/PCA and then randomize? – Results suggest that more complex noise models might be harder to break. Not clear if this improves privacy-model quality tradeoff? – eigen decomposition has an inherent metric assumption?
  • 23. A not-so-ominous* application of noise filtering: Nulling Interferometer on Terrestrial Planet Finder-I 23 *but maybe not if you believe Hollywood movies such as Independence Day! alien ship