SlideShare a Scribd company logo
1 of 37
Download to read offline
+ 
Data Mining with Differential Privacy 
Arik Friedman and Assaf Schuster / KDD’10 
Chang Wei-Yuan 
2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting
+Outline 
n Introduction 
n Background 
n Method 
n Experiment 
n Conclusion 
n Though 
2
+Introduction 
n There is great value in data mining 
solutions. 
n reliable privacy guarantees 
n available accuracy 
n Differential privacy 
n computations are insensitive to changes in 
any particular individual's record 
3
+Introduction (cont.) 
n Once an individual is certain that his or 
her data will remain private, being opted 
in or out of the database should make 
little difference. 
4
+ 
Introduction (cont.) 
n Example1 
Name 
Result 
Tom 
0 
Jack 
1 
Henry 
1 
Diego 
0 
Alice 
? 
5 
n f(i) = count(i) 
n Alice = i=5 
n count(5) – count(4)
+ 
Introduction (cont.) 
n Example2 
n We can speculate the target based on 
the information. 
6 
Id 
Sex 
Job 
Hometown 
Hobby 
1 
M 
student 
Hsinchu 
sport 
2 
M 
teacher 
Taipei 
writing 
3 
F 
student 
Hsinchu 
Singing 
4 
F 
student 
Taipei 
Singing 
5 
? 
? 
? 
?
+Introduction (cont.) 
n Goal:count(5) – count(4) ≈ 0 
n Goal:” computations are insensitive to 
changes in any particular individual's 
record ” 
7
+Outline 
n Introduction 
n Background 
n Method 
n Experiment 
n Conclusion 
n Though 
8
+Differential Privacy 
n Differential privacy 
9 
Output 
Probability 
• M:a randomized computation 
• f:a query function 
• D, D’:the datasets with 
symmetric difference
+Differential Privacy (cont.) 
n Differential privacy 
10 
Define.(ε-Differential Privacy) 
We say a randomized computation M provides 
differential privacy if for any datasets A and B with 
symmetric difference AΔB=1, and set of possible 
outcomes S ⊆ Range(M)
+Laplace Mechanism 
n Example of Laplace Mechanism 
11 
Name 
Result 
Tom 
0 
Jack 
1 
Henry 
1 
Diego 
0 
Alice 
? 
n count(4) = 2 + noise(4) 
n count(5) = 3 + noise(5) 
n count(5) – count(4) = eε
+Laplace Mechanism 
n Laplace Mechanism 
12 
Theorem. (Laplace mechanism) 
Given a function f over an arbitrary domain D, the 
computation 
provides differential privacy.
+Exponential Mechanism 
n Example of Exponential Mechanism 
13 
item 
q 
ε=0 
ε=0.1 
ε=1 
Football 30 
0.46 
0.42 
0.92 
Volleyball 
25 
0.38 
0.33 
0.07 
Basketball 
8 
0.12 
0.14 
1.5E-05 
Tennis 
2 
0.03 
0.10 
7.7E-07
+Exponential Mechanism (cont.) 
n Exponential Mechanism 
14 
Theorem. (Exponential Mechanism) 
Let q be a quality function, given a database d, 
assigns a score r to each outcome. Then the 
mechanism M, defined by 
maintains differential privacy.
+PINQ Framework 
n PINQ Framework 
n PINQ is a proposed architecture for data 
analysis with differential privacy 
n Another operator presented in PINQ is 
partition which was dubbed parallel 
composition. 
n the costs do not add up when queries are executed 
on disjoint datasets 
15
+PINQ Framework (cont.) 
16
+Outline 
n Introduction 
n Background 
n Method 
n Experiment 
n Conclusion 
n Though 
17
+Method 
18 
n SQL-based ID3 
n DiffP-ID3 
n DiffP-C4.5
+SuL-based ID3 
n Based on SuLQ framework and Using 
Laplace Mechanism. 
n It makes direct use of the NoisyCount 
primitive to evaluate the information gain 
criterion. 
n It required to evaluate the information 
gain should be carried out for each 
attribute separately. 
n the budget per query is small 
19
+SuL-based ID3 
n ID3 Classification 
n Split point 
n max( Gain(Job), Gain(Home), Gain(Hobby) ) 
20 
Id 
Sex 
Job 
Hometown 
Hobby 
1 
M 
student 
Hsinchu 
sport 
2 
M 
teacher 
Taipei 
writing 
3 
F 
student 
Hsinchu 
Singing 
4 
F 
student 
Taipei 
Singing
+SuL-based ID3 
n SuL-based ID3 Classification 
n Split point 
n max( Gain(Job)+Noisy, Gain(Home)+Noisy, 
Gain(Hobby)+Noisy ) 
21 
Id 
Sex 
Job 
Hometown 
Hobby 
1 
M 
student 
Hsinchu 
sport 
2 
M 
teacher 
Taipei 
writing 
3 
F 
student 
Hsinchu 
Singing 
4 
F 
student 
Taipei 
Singing
+DiffP-ID3 
n Based on PINQ framework and using 
exponential mechanism. 
n It evaluates all attributes simultaneously 
in one query, the outcome of which is 
the attribute to use for splitting. 
n the quality function q provided to the scores 
each attribute 
22
+DiffP-ID3 (cont.) 
n DiffP-ID3 Classification 
n Split point 
n Max( Gain(M(Job)), Gain(M(Job)), 
Gain(M(Hobby)) ) 
n PINQ Partition 
23 
Id 
Sex 
Job 
Hometown 
Hobby 
1 
M 
student 
Hsinchu 
sport 
2 
M 
teacher 
Taipei 
writing 
3 
F 
student 
Hsinchu 
Singing 
4 
F 
student 
Taipei 
Singing
+DiffP-ID3 (cont.) 
n Which quality function should be fed into 
the exponential mechanism? 
n the depth constraint 
n the sensitivity of the splitting criterion  
n Information gain will be the most 
sensitive to noise, and Max operator will 
be the least sensitive to noise. 
24
+DiffP-C4.5 
n One important extension is the ability to 
handle continuous attributes. 
n First, the domain is divided into ranges where 
the score is constant. Each range is 
considered a discrete option. 
n Then, a point from the range is sampled with 
uniform distribution and returned as the output 
of the exponential mechanism. 
25
+Outline 
n Introduction 
n Background 
n Method 
n Experiment 
n Conclusion 
n Though 
26
+Experiment 
n It define a domain with ten nominal 
attributes and a class attribute from 
another paper. 
n It introduces noise to the samples by 
reassigning attributes and classes, 
replacing each value with probability 
noise. 
n For testing, it generated similarly a 
noiseless test set with 10, 000 records. 
27
+ 28 
n the average accuracy is higher as more training 
samples are available 
n the influence of the noise weakens as the 
number of samples grows using Gini and Max
+ 29 
n three of the ten attributes were replaced with 
numeric attributes over the domain [0, 100] 
n Figure 4 presents the results of a similar 
experiment
+ 30 
n for smaller training sets, ID3 allows for better 
accuracy 
n for larger training sets, C4.5 is better than ID3
+ 31 
n the accuracy results presented in Figure 6 was 
around 5% and even lower than the results 
presented in Figure 7 
n when the sizeof the dataset is small, algorithms 
that make efficient use of the privacy budget are 
superior
+Outline 
n Introduction 
n Background 
n Method 
n Experiment 
n Conclusion 
n Though 
32
+Conclusion 
n When the number of training samples is 
relatively small or the privacy constraints 
set by the data provider are very limiting, 
the sensitivity of the calculations 
becomes crucial. 
33
+Future work 
n One solution might be to consider other 
stopping rules when selecting nodes, 
trading possible improvements in 
accuracy for increased stability. 
n In addition, it may be fruitful to consider 
different tactics for budget distribution. 
34
+Outline 
n Introduction 
n Background 
n Method 
n Experiment 
n Conclusion 
n Though 
35
+ 
Thought 
36
+ 
Thanks for listening. 
2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting 
v123582@gmail.com

More Related Content

What's hot

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
 
Privacy, security and ethics in data science
Privacy, security and ethics in data sciencePrivacy, security and ethics in data science
Privacy, security and ethics in data scienceNikolaos Vasiloglou
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalHarvinder Atwal
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
Unit 6 Privacy and Data Protection 8 hr
Unit 6  Privacy and Data Protection 8 hrUnit 6  Privacy and Data Protection 8 hr
Unit 6 Privacy and Data Protection 8 hrTushar Rajput
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Venkata Reddy Konasani
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data ScienceBrijeshGoyani
 

What's hot (20)

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Data science
Data scienceData science
Data science
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Privacy, security and ethics in data science
Privacy, security and ethics in data sciencePrivacy, security and ethics in data science
Privacy, security and ethics in data science
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Hadoop
HadoopHadoop
Hadoop
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
Unit 6 Privacy and Data Protection 8 hr
Unit 6  Privacy and Data Protection 8 hrUnit 6  Privacy and Data Protection 8 hr
Unit 6 Privacy and Data Protection 8 hr
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
MapReduce
MapReduceMapReduce
MapReduce
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Big Data & Data Science
Big Data & Data ScienceBig Data & Data Science
Big Data & Data Science
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 

Similar to Data mining with differential privacy

More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?Dhafer Malouche
 
tutorial5.ppt
tutorial5.ppttutorial5.ppt
tutorial5.pptjvjfvvoa
 
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdfDWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdfChristinaGayenMondal
 
Lagrange Interpolation
Lagrange InterpolationLagrange Interpolation
Lagrange InterpolationSaloni Singhal
 
Newton Forward Interpolation
Newton Forward InterpolationNewton Forward Interpolation
Newton Forward InterpolationSaloni Singhal
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
Master Thesis Defense
Master Thesis DefenseMaster Thesis Defense
Master Thesis DefenseFilipo Mór
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with RBarbara Fusinska
 
CMSC 56 | Lecture 8: Growth of Functions
CMSC 56 | Lecture 8: Growth of FunctionsCMSC 56 | Lecture 8: Growth of Functions
CMSC 56 | Lecture 8: Growth of Functionsallyn joy calcaben
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
chapter1.pdf ......................................
chapter1.pdf ......................................chapter1.pdf ......................................
chapter1.pdf ......................................nourhandardeer3
 
First session _Cracking the coding interview.pptx
First session _Cracking the coding interview.pptxFirst session _Cracking the coding interview.pptx
First session _Cracking the coding interview.pptxZilvinasAleksa
 
Max Entropy
Max EntropyMax Entropy
Max Entropyjianingy
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with RBarbara Fusinska
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).pptrajasamal1999
 
NumberTheory explanations in the easiest way.ppt
NumberTheory explanations in the easiest way.pptNumberTheory explanations in the easiest way.ppt
NumberTheory explanations in the easiest way.pptIshwariKhanal
 
Fast coputation of Phi(x) inverse
Fast coputation of Phi(x) inverseFast coputation of Phi(x) inverse
Fast coputation of Phi(x) inverseJohn Cook
 
Lovely Professional University UNIT 1 NUMBER SYSTEM.pdf
Lovely Professional University UNIT 1 NUMBER SYSTEM.pdfLovely Professional University UNIT 1 NUMBER SYSTEM.pdf
Lovely Professional University UNIT 1 NUMBER SYSTEM.pdfkhabarkus234
 
Newton Backward Interpolation
Newton Backward InterpolationNewton Backward Interpolation
Newton Backward InterpolationSaloni Singhal
 

Similar to Data mining with differential privacy (20)

More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 
tutorial5.ppt
tutorial5.ppttutorial5.ppt
tutorial5.ppt
 
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdfDWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
 
Lagrange Interpolation
Lagrange InterpolationLagrange Interpolation
Lagrange Interpolation
 
Newton Forward Interpolation
Newton Forward InterpolationNewton Forward Interpolation
Newton Forward Interpolation
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
Master Thesis Defense
Master Thesis DefenseMaster Thesis Defense
Master Thesis Defense
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with R
 
CMSC 56 | Lecture 8: Growth of Functions
CMSC 56 | Lecture 8: Growth of FunctionsCMSC 56 | Lecture 8: Growth of Functions
CMSC 56 | Lecture 8: Growth of Functions
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
chapter1.pdf ......................................
chapter1.pdf ......................................chapter1.pdf ......................................
chapter1.pdf ......................................
 
First session _Cracking the coding interview.pptx
First session _Cracking the coding interview.pptxFirst session _Cracking the coding interview.pptx
First session _Cracking the coding interview.pptx
 
Max Entropy
Max EntropyMax Entropy
Max Entropy
 
Clean, Learn and Visualise data with R
Clean, Learn and Visualise data with RClean, Learn and Visualise data with R
Clean, Learn and Visualise data with R
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
 
Classification (ML).ppt
Classification (ML).pptClassification (ML).ppt
Classification (ML).ppt
 
NumberTheory explanations in the easiest way.ppt
NumberTheory explanations in the easiest way.pptNumberTheory explanations in the easiest way.ppt
NumberTheory explanations in the easiest way.ppt
 
Fast coputation of Phi(x) inverse
Fast coputation of Phi(x) inverseFast coputation of Phi(x) inverse
Fast coputation of Phi(x) inverse
 
Lovely Professional University UNIT 1 NUMBER SYSTEM.pdf
Lovely Professional University UNIT 1 NUMBER SYSTEM.pdfLovely Professional University UNIT 1 NUMBER SYSTEM.pdf
Lovely Professional University UNIT 1 NUMBER SYSTEM.pdf
 
Newton Backward Interpolation
Newton Backward InterpolationNewton Backward Interpolation
Newton Backward Interpolation
 

More from Wei-Yuan Chang

Python Fundamentals - Basic
Python Fundamentals - BasicPython Fundamentals - Basic
Python Fundamentals - BasicWei-Yuan Chang
 
Data Analysis with Python - Pandas | WeiYuan
Data Analysis with Python - Pandas | WeiYuanData Analysis with Python - Pandas | WeiYuan
Data Analysis with Python - Pandas | WeiYuanWei-Yuan Chang
 
Data Crawler using Python (I) | WeiYuan
Data Crawler using Python (I) | WeiYuanData Crawler using Python (I) | WeiYuan
Data Crawler using Python (I) | WeiYuanWei-Yuan Chang
 
Learning to Use Git | WeiYuan
Learning to Use Git | WeiYuanLearning to Use Git | WeiYuan
Learning to Use Git | WeiYuanWei-Yuan Chang
 
Scientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanScientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanWei-Yuan Chang
 
Basic Web Development | WeiYuan
Basic Web Development | WeiYuanBasic Web Development | WeiYuan
Basic Web Development | WeiYuanWei-Yuan Chang
 
資料視覺化 - D3 的第一堂課 | WeiYuan
資料視覺化 - D3 的第一堂課 | WeiYuan資料視覺化 - D3 的第一堂課 | WeiYuan
資料視覺化 - D3 的第一堂課 | WeiYuanWei-Yuan Chang
 
JavaScript Beginner Tutorial | WeiYuan
JavaScript Beginner Tutorial | WeiYuanJavaScript Beginner Tutorial | WeiYuan
JavaScript Beginner Tutorial | WeiYuanWei-Yuan Chang
 
Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanWei-Yuan Chang
 
Introduce to PredictionIO
Introduce to PredictionIOIntroduce to PredictionIO
Introduce to PredictionIOWei-Yuan Chang
 
Analysis and Classification of Respiratory Health Risks with Respect to Air P...
Analysis and Classification of Respiratory Health Risks with Respect to Air P...Analysis and Classification of Respiratory Health Risks with Respect to Air P...
Analysis and Classification of Respiratory Health Risks with Respect to Air P...Wei-Yuan Chang
 
Forecasting Fine Grained Air Quality Based on Big Data
Forecasting Fine Grained Air Quality Based on Big DataForecasting Fine Grained Air Quality Based on Big Data
Forecasting Fine Grained Air Quality Based on Big DataWei-Yuan Chang
 
On the Coverage of Science in the Media a Big Data Study on the Impact of th...
On the Coverage of Science in the Media a Big Data Study on the Impact of th...On the Coverage of Science in the Media a Big Data Study on the Impact of th...
On the Coverage of Science in the Media a Big Data Study on the Impact of th...Wei-Yuan Chang
 
On the Ground Validation of Online Diagnosis with Twitter and Medical Records
On the Ground Validation of Online Diagnosis with Twitter and Medical RecordsOn the Ground Validation of Online Diagnosis with Twitter and Medical Records
On the Ground Validation of Online Diagnosis with Twitter and Medical RecordsWei-Yuan Chang
 
Effective Event Identification in Social Media
Effective Event Identification in Social MediaEffective Event Identification in Social Media
Effective Event Identification in Social MediaWei-Yuan Chang
 
Eears (earthquake alert and report system) a real time decision support syst...
Eears (earthquake alert and report system)  a real time decision support syst...Eears (earthquake alert and report system)  a real time decision support syst...
Eears (earthquake alert and report system) a real time decision support syst...Wei-Yuan Chang
 
Fine Grained Location Extraction from Tweets with Temporal Awareness
Fine Grained Location Extraction from Tweets with Temporal AwarenessFine Grained Location Extraction from Tweets with Temporal Awareness
Fine Grained Location Extraction from Tweets with Temporal AwarenessWei-Yuan Chang
 
Practical Lessons from Predicting Clicks on Ads at Facebook
Practical Lessons from Predicting Clicks on Ads at FacebookPractical Lessons from Predicting Clicks on Ads at Facebook
Practical Lessons from Predicting Clicks on Ads at FacebookWei-Yuan Chang
 
How many folders do you really need ? Classifying email into a handful of cat...
How many folders do you really need ? Classifying email into a handful of cat...How many folders do you really need ? Classifying email into a handful of cat...
How many folders do you really need ? Classifying email into a handful of cat...Wei-Yuan Chang
 
Extending faceted search to the general web
Extending faceted search to the general webExtending faceted search to the general web
Extending faceted search to the general webWei-Yuan Chang
 

More from Wei-Yuan Chang (20)

Python Fundamentals - Basic
Python Fundamentals - BasicPython Fundamentals - Basic
Python Fundamentals - Basic
 
Data Analysis with Python - Pandas | WeiYuan
Data Analysis with Python - Pandas | WeiYuanData Analysis with Python - Pandas | WeiYuan
Data Analysis with Python - Pandas | WeiYuan
 
Data Crawler using Python (I) | WeiYuan
Data Crawler using Python (I) | WeiYuanData Crawler using Python (I) | WeiYuan
Data Crawler using Python (I) | WeiYuan
 
Learning to Use Git | WeiYuan
Learning to Use Git | WeiYuanLearning to Use Git | WeiYuan
Learning to Use Git | WeiYuan
 
Scientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuanScientific Computing with Python - NumPy | WeiYuan
Scientific Computing with Python - NumPy | WeiYuan
 
Basic Web Development | WeiYuan
Basic Web Development | WeiYuanBasic Web Development | WeiYuan
Basic Web Development | WeiYuan
 
資料視覺化 - D3 的第一堂課 | WeiYuan
資料視覺化 - D3 的第一堂課 | WeiYuan資料視覺化 - D3 的第一堂課 | WeiYuan
資料視覺化 - D3 的第一堂課 | WeiYuan
 
JavaScript Beginner Tutorial | WeiYuan
JavaScript Beginner Tutorial | WeiYuanJavaScript Beginner Tutorial | WeiYuan
JavaScript Beginner Tutorial | WeiYuan
 
Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuan
 
Introduce to PredictionIO
Introduce to PredictionIOIntroduce to PredictionIO
Introduce to PredictionIO
 
Analysis and Classification of Respiratory Health Risks with Respect to Air P...
Analysis and Classification of Respiratory Health Risks with Respect to Air P...Analysis and Classification of Respiratory Health Risks with Respect to Air P...
Analysis and Classification of Respiratory Health Risks with Respect to Air P...
 
Forecasting Fine Grained Air Quality Based on Big Data
Forecasting Fine Grained Air Quality Based on Big DataForecasting Fine Grained Air Quality Based on Big Data
Forecasting Fine Grained Air Quality Based on Big Data
 
On the Coverage of Science in the Media a Big Data Study on the Impact of th...
On the Coverage of Science in the Media a Big Data Study on the Impact of th...On the Coverage of Science in the Media a Big Data Study on the Impact of th...
On the Coverage of Science in the Media a Big Data Study on the Impact of th...
 
On the Ground Validation of Online Diagnosis with Twitter and Medical Records
On the Ground Validation of Online Diagnosis with Twitter and Medical RecordsOn the Ground Validation of Online Diagnosis with Twitter and Medical Records
On the Ground Validation of Online Diagnosis with Twitter and Medical Records
 
Effective Event Identification in Social Media
Effective Event Identification in Social MediaEffective Event Identification in Social Media
Effective Event Identification in Social Media
 
Eears (earthquake alert and report system) a real time decision support syst...
Eears (earthquake alert and report system)  a real time decision support syst...Eears (earthquake alert and report system)  a real time decision support syst...
Eears (earthquake alert and report system) a real time decision support syst...
 
Fine Grained Location Extraction from Tweets with Temporal Awareness
Fine Grained Location Extraction from Tweets with Temporal AwarenessFine Grained Location Extraction from Tweets with Temporal Awareness
Fine Grained Location Extraction from Tweets with Temporal Awareness
 
Practical Lessons from Predicting Clicks on Ads at Facebook
Practical Lessons from Predicting Clicks on Ads at FacebookPractical Lessons from Predicting Clicks on Ads at Facebook
Practical Lessons from Predicting Clicks on Ads at Facebook
 
How many folders do you really need ? Classifying email into a handful of cat...
How many folders do you really need ? Classifying email into a handful of cat...How many folders do you really need ? Classifying email into a handful of cat...
How many folders do you really need ? Classifying email into a handful of cat...
 
Extending faceted search to the general web
Extending faceted search to the general webExtending faceted search to the general web
Extending faceted search to the general web
 

Recently uploaded

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 

Recently uploaded (20)

Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 

Data mining with differential privacy

  • 1. + Data Mining with Differential Privacy Arik Friedman and Assaf Schuster / KDD’10 Chang Wei-Yuan 2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting
  • 2. +Outline n Introduction n Background n Method n Experiment n Conclusion n Though 2
  • 3. +Introduction n There is great value in data mining solutions. n reliable privacy guarantees n available accuracy n Differential privacy n computations are insensitive to changes in any particular individual's record 3
  • 4. +Introduction (cont.) n Once an individual is certain that his or her data will remain private, being opted in or out of the database should make little difference. 4
  • 5. + Introduction (cont.) n Example1 Name Result Tom 0 Jack 1 Henry 1 Diego 0 Alice ? 5 n f(i) = count(i) n Alice = i=5 n count(5) – count(4)
  • 6. + Introduction (cont.) n Example2 n We can speculate the target based on the information. 6 Id Sex Job Hometown Hobby 1 M student Hsinchu sport 2 M teacher Taipei writing 3 F student Hsinchu Singing 4 F student Taipei Singing 5 ? ? ? ?
  • 7. +Introduction (cont.) n Goal:count(5) – count(4) ≈ 0 n Goal:” computations are insensitive to changes in any particular individual's record ” 7
  • 8. +Outline n Introduction n Background n Method n Experiment n Conclusion n Though 8
  • 9. +Differential Privacy n Differential privacy 9 Output Probability • M:a randomized computation • f:a query function • D, D’:the datasets with symmetric difference
  • 10. +Differential Privacy (cont.) n Differential privacy 10 Define.(ε-Differential Privacy) We say a randomized computation M provides differential privacy if for any datasets A and B with symmetric difference AΔB=1, and set of possible outcomes S ⊆ Range(M)
  • 11. +Laplace Mechanism n Example of Laplace Mechanism 11 Name Result Tom 0 Jack 1 Henry 1 Diego 0 Alice ? n count(4) = 2 + noise(4) n count(5) = 3 + noise(5) n count(5) – count(4) = eε
  • 12. +Laplace Mechanism n Laplace Mechanism 12 Theorem. (Laplace mechanism) Given a function f over an arbitrary domain D, the computation provides differential privacy.
  • 13. +Exponential Mechanism n Example of Exponential Mechanism 13 item q ε=0 ε=0.1 ε=1 Football 30 0.46 0.42 0.92 Volleyball 25 0.38 0.33 0.07 Basketball 8 0.12 0.14 1.5E-05 Tennis 2 0.03 0.10 7.7E-07
  • 14. +Exponential Mechanism (cont.) n Exponential Mechanism 14 Theorem. (Exponential Mechanism) Let q be a quality function, given a database d, assigns a score r to each outcome. Then the mechanism M, defined by maintains differential privacy.
  • 15. +PINQ Framework n PINQ Framework n PINQ is a proposed architecture for data analysis with differential privacy n Another operator presented in PINQ is partition which was dubbed parallel composition. n the costs do not add up when queries are executed on disjoint datasets 15
  • 17. +Outline n Introduction n Background n Method n Experiment n Conclusion n Though 17
  • 18. +Method 18 n SQL-based ID3 n DiffP-ID3 n DiffP-C4.5
  • 19. +SuL-based ID3 n Based on SuLQ framework and Using Laplace Mechanism. n It makes direct use of the NoisyCount primitive to evaluate the information gain criterion. n It required to evaluate the information gain should be carried out for each attribute separately. n the budget per query is small 19
  • 20. +SuL-based ID3 n ID3 Classification n Split point n max( Gain(Job), Gain(Home), Gain(Hobby) ) 20 Id Sex Job Hometown Hobby 1 M student Hsinchu sport 2 M teacher Taipei writing 3 F student Hsinchu Singing 4 F student Taipei Singing
  • 21. +SuL-based ID3 n SuL-based ID3 Classification n Split point n max( Gain(Job)+Noisy, Gain(Home)+Noisy, Gain(Hobby)+Noisy ) 21 Id Sex Job Hometown Hobby 1 M student Hsinchu sport 2 M teacher Taipei writing 3 F student Hsinchu Singing 4 F student Taipei Singing
  • 22. +DiffP-ID3 n Based on PINQ framework and using exponential mechanism. n It evaluates all attributes simultaneously in one query, the outcome of which is the attribute to use for splitting. n the quality function q provided to the scores each attribute 22
  • 23. +DiffP-ID3 (cont.) n DiffP-ID3 Classification n Split point n Max( Gain(M(Job)), Gain(M(Job)), Gain(M(Hobby)) ) n PINQ Partition 23 Id Sex Job Hometown Hobby 1 M student Hsinchu sport 2 M teacher Taipei writing 3 F student Hsinchu Singing 4 F student Taipei Singing
  • 24. +DiffP-ID3 (cont.) n Which quality function should be fed into the exponential mechanism? n the depth constraint n the sensitivity of the splitting criterion n Information gain will be the most sensitive to noise, and Max operator will be the least sensitive to noise. 24
  • 25. +DiffP-C4.5 n One important extension is the ability to handle continuous attributes. n First, the domain is divided into ranges where the score is constant. Each range is considered a discrete option. n Then, a point from the range is sampled with uniform distribution and returned as the output of the exponential mechanism. 25
  • 26. +Outline n Introduction n Background n Method n Experiment n Conclusion n Though 26
  • 27. +Experiment n It define a domain with ten nominal attributes and a class attribute from another paper. n It introduces noise to the samples by reassigning attributes and classes, replacing each value with probability noise. n For testing, it generated similarly a noiseless test set with 10, 000 records. 27
  • 28. + 28 n the average accuracy is higher as more training samples are available n the influence of the noise weakens as the number of samples grows using Gini and Max
  • 29. + 29 n three of the ten attributes were replaced with numeric attributes over the domain [0, 100] n Figure 4 presents the results of a similar experiment
  • 30. + 30 n for smaller training sets, ID3 allows for better accuracy n for larger training sets, C4.5 is better than ID3
  • 31. + 31 n the accuracy results presented in Figure 6 was around 5% and even lower than the results presented in Figure 7 n when the sizeof the dataset is small, algorithms that make efficient use of the privacy budget are superior
  • 32. +Outline n Introduction n Background n Method n Experiment n Conclusion n Though 32
  • 33. +Conclusion n When the number of training samples is relatively small or the privacy constraints set by the data provider are very limiting, the sensitivity of the calculations becomes crucial. 33
  • 34. +Future work n One solution might be to consider other stopping rules when selecting nodes, trading possible improvements in accuracy for increased stability. n In addition, it may be fruitful to consider different tactics for budget distribution. 34
  • 35. +Outline n Introduction n Background n Method n Experiment n Conclusion n Though 35
  • 37. + Thanks for listening. 2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting v123582@gmail.com

Editor's Notes

  1. 在目前 data mining 的發展中,資料的隱私是一件被重視的事情 而如何在不影響精確度的前提下達到資料的隱私保障是一直俱有價值的研究 在最新研究中,有一個新的技術,「Differential privacy 」即提供了這件事情 那什麼是差分隱私?它的定義是「每一筆資料對於整體資料具有不敏感性」
  2. 簡單來說,單一筆資料的改變(移出或移入)並不會對整體的資料產生巨大的改變 當滿足這樣的狀況及表示具有隱私性,而差分隱私可以提供這樣的性質
  3. 舉個例子,假設有一個資料集,包含「名字」及「是否有得癌症」 這個資料集提供了「計數」的查詢,假設我們知道 Alice 位於第五個位置 那我們就可以透過「查訊前 5 筆資料」-「查詢前 4 筆資料」的方式來得知 Alice 的資料 所以目的是希望可以透過一些處理,來避免這樣的攻擊模式
  4. PINQ 是一個微軟開發的平台,提供了差分隱私的功能 特別的是,他提供了一個平行處理的運算:partition 傳統的方法是,我們對資料查詢一次,就會增加一次成本 而在平行的方法下,查詢一次時,可以同時對很多獨立的資料做查詢的動作 因為這些資料是獨立的,因此他們的成本並不會被累加
  5. 傳統的方法是採用 SuLQ framework ,這是一個線性的 query 系統 使用傳統的 ID3 分類樹的方法加上 Laplace 機率分配來實現差分隱私 也就是說,每次在決定分割點計算 information gain 的時候加上 Laplace 這個分配函數 這個方法有一個嚴重的缺點,計算每一個屬性時時,都要加上 Noisy 所以容易造成整體的誤差過大
  6. 將所有屬性採用PINQ,根據 quality function 去作指數機制 得出來的結果會是使用每一個屬性的得分 換言之,可以得到使用哪個屬性最適合作為分割點
  7. 使用不同的 quality function 在不同條件下可能會有不同的結果 結果發現,整體來說: Information gain 的影響最差,Max operator 影響最小
  8. 這個方法提供了連續屬性的分類 過程分成兩步: 第一步,先將連續資料切成不同的區間,每一個區間用一個分數代表 第二步,將這些的資料用指數機制轉換到這些分數上 所以這個方法必須要使用指數機制將所有連續屬性轉成離散屬性(%) 利用這些離散資料再做一次指數機制找出最佳的分割點
  9. 他實驗的訓練資料集取自其他 paper 是一個具有 10 個離散屬性和 1 個標籤屬性的資料 他對這些訓練資料隨機加了一些雜訊,測試資料集有 10000 筆
  10. 第一個實驗是:「離散型的DiffP-ID3」的實驗 水平軸是訓練資料個數,鉛直軸是平均精確度 我們可以發現以下幾個現象: 當訓練資料量越大時,精準度也會更大 我們發現,使用 Gini 及 Max 作為評估函數所產生的誤差較小 為什麼原始的 ID3 的精確度也只有 90 幾呢?因為我們的訓練資料集有加上雜訊
  11. 第二個實驗室討論「連續型的 DiffP-C4.5」,將原始資料中三個屬性換成連續型的資料 可以從結果看出跟上一個實驗類似的結果 當訓練資料量越大時,精準度也會更大
  12. 第三個實驗是「面對連續資料時我們使用 DiffP-C4.5 與 直接將資料集離散化後執行 DiffP-ID3」 結果顯示: 在資料量比較小的時候,DiffP-ID3 會比 DiffP-C4.5 來得好 (三角形>正方形,圓形>叉叉) 在資料量比較大的時候,DiffP-C4.5 會比 DiffP-ID3 來的好 (正方形>三角形,叉叉>圓形)
  13. 最後一個實驗是「增加了分類樹高的限制」 左邊的是針對離散屬性,右邊的是針對連續屬性 我們可以從結果發現: 針對資料量大的情況,DiffP-C4.5 俱有比較好的精準度 在資料量小的時候,選擇不同的評估函數造成的差別比較明顯
  14. 差分隱私的效能會隨著很多情況而有所改變 例如:資料量大小,或是分類樹深度等等的限制 因此必須針對不同的限制下,去挑選一個合適的 quality function
  15. 目前的方法還有很多地方可以改進,例如: 當使用差分隱私的機制時,分類樹的停止條件可能會有誤差,可能會影響精準度 除此之外,考慮不同的分配函數也是一個方向