47. 47
The Basic Problem for Data Science
How to do de-identification to
•Enable “desirable uses” of the data while protecting
the “privacy” of the data subjects?
• Political policy
• Academic research
• Study drug trial
• Security: searching for terrorists/crimials
• Market analysis, ….
47
48. 48
Name Sex Blood
Jane F B
Perry M A
Smith M O
Ross M O
Huang F A
Chen M B
Approach 1: Encrypt the Data
Problems: Data cannot be
analyzed.
Name Sex Blood
100101 001001 110101
101010 111010 111111
001010 100100 011001
001110 010010 110101
110101 000000 111001
111110 110010 000101
49. 49
Approach 2: Anonymize the Data
“re-identification”, linking data
[Sweeney `97]
Problems:
Name Sex Blood HIV?
Chen F B Y
Jones M A N
Smith M O N
Ross M O Y
Lu F A N
Shah M B Y
50. 50
Approach 3: Mediate Access
C
C
trusted
“curator”
data analysts
Problems: “aggregated” statistics can
reveal individual information; query selections
Name Sex Blood
Jane F B
Perry M A
Smith M O
Ross M O
Huang F A
Chen M B
52. 52
Synthetic data
Utility: preserves statistics with every set of attributes!
“fake” people
Problem: computation time
C
Sex Blood Cancer?
F B Y
F A N
M O N
M O Y
F A N
M B Y
Sex Blood Cancer?
M B N
F B Y
M O Y
M A N
F O N