"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Synthetic Data for Big Data Privacy
1. Synthetic Data For
Big Data Privacy
Helsinki Data Science Meetup
Michael Platzer – Mostly AI
2. Agenda
1. Anonymization is Hard 5min
2. The Promise of Synthetic Data 5min
3. Case Studies w/ Public Data 5min
4. A Word on Privacy 5min
5. Use Cases 5min
2
3. The Privacy vs. Innovation Clash
3
>> Data Sharing
>> Data Monetization
>> Behavioral Analytics
>> Machine Learning
>> Smart Services
>> New Opportunities
>> Zero Time-To-Data
>> Consumer Understanding
“share as much data with as
many people as possible”
Data Protection <<
Privacy Regulations <<
Compliance <<
Reputational Risk <<
Customer Consent <<
Business Justification <<
Restricted Environments <<
need-to-know-basis <<
“share as little data with as
few people as possible”
4. The Privacy vs. Innovation Clash
4
Why Not “Simply” Anonymize Data?
(anonymous data not subject to privacy regulations)
9. 9
Anonymization is Hard for Images
for any High-Dimensional Data Asset
Curse of Dimensionality
= Exponential Growth in Data Space
A Curse for Analytics
A Curse for Machine Learning
A Curse for Preserving Privacy
A Blessing for Consumer-Centric Organizations
11. 11
Anonymization is Hard
“We conjecture that the amount of perturbation that must be
applied to the data to defeat our algorithm will completely destroy
their utility [..] Sanitization techniques from the k-anonymity
literature such as generalization and suppression do not provide
meaningful privacy guarantees, and in any case fail on high-
dimensional data.”
(100 million movie ratings for 170k+ movies)
12. 12
of mobile phone owners are re-identified simply by 2 antenna
signals, even when coarsened to hour of day (Nature, 2013)
of credit card owners are re-identified by 3 transactions, even when
only merchant and the date of transaction is revealed (Science, 2015)
of US citizens are re-identified by date-of-birth, gender and ZIP code
(Health, 2000)
Anonymization is Hard
14. Data Assets Gets Locked Up
14
PERSONAL
DATA
No!
Data Scientists
Data Engineers
Program Mgmt
Business Analysts
Developers
Business Partners
Designers
Researcher
Integrators
Tester
Operations
Innovation
→ How to become data-driven & customer-centric if you can‘t collaborate with your data?
Chief Data Officer
Data
Protection
Officer
17. 17
enables unrestricted big data utilization
retains ~99% of statistical value
MOSTLY GENERATE is a flexible, scalable, automated and highly accurate Synthetic Data Platform powered by generative
deep neural network models for structured behavioral data (e.g. financial transaction data, healthcare data, mobility
data,...). It is an easy to deploy software solution that runs on-premise or private cloud.
actual, privacy-sensitive data
synthetic, statistical representative data
your secure IT environment
MOSTLY GENERATE
AI-Generated Synthetic Behavioral Data
fully anonymous, granular-level data
19. Synthetic Data – How Accurate Is It?
19
Measuring Accuracy - An Open Challenge for Unsupervised Learning
− Turing Test
− Descriptive Statistics & Visualizations
− Distance Measure for Distributions (e.g. TVD)
− Benchmark Predictive Models
20. Synthetic Data – US Census
20
26’049 actual citizens
w/ 15 attributes
100’000 synthetic citizens
w/ 15 attributes
MOSTLY GENERATE
https://generate.mostly.ai/
(free public, yet limited demo)
21. Synthetic Data – US Census
21
target = q01: 17y, q50: 37y, q99: 74y
synthetic = q01: 17y, q50: 37y, q99: 74y
target = Local-gov: 6.5%
synthetic = Local-gov: 6.7%
target = high-income: 24.1%
synthetic = high-income: 24.4%
Age Workclass Income
35. Synthetic Mobility Traces – Porto Taxi
35
Original Data Synthetic Data
https://mostly.ai/2020/02/21/protecting-privacy-with-synthetic-location-data/
39. Synthetic Data – How Private Is It?
39
1. Report Differential Privacy as Theoretical Upper Limit
2. Calculate Empirical Differential Privacy (compute intensive)
3. Post hoc Privacy Analysis based on Individual-Level Distance
− Identical Match Count (IMC)
− Distance To Closest Record (DCR)
− Nearest Neighbor Distance Ration (NNDR)
Synthetic Data shall be “as close as possible”, but “not too close” to
Actual Data. Holdout determines benchmark for “too close”.
A perfect solution generates new synthetic data, that behaves exactly
like actual data, that hasn’t been seen before (=holdout data).
40. Use Cases for Synthetic Data
40
for External Data Monetization
- Data Consortia
- Data Marketplaces
- Data Resellers
- Market Research Intel
for Internal Data Sharing
- Data Governance
- Cross-Border Data Sharing
- Cross-Department Data Sharing
- Testing & Development of BI / AI
- Data Literacy / Hackathons
- Data Retention
for External Data Sharing
- Group-Wide Data Sharing
- Open Innovation
- Research Collaborations
- Vendor Validation
- Sandboxes
- Public Data
current industry focus on finance, healthcare and public sector
41. One More – What’s Better Than Synthetic Data?
41
Synthetic Fair Data!
https://mostly.ai/2020/05/08/diving-deep-into-fair-synthetic-data-generation-fairness-series-part-5/
44. Synthetic Data – How Private Is It?
44
Real
Synthetic
Real’
Synthetic’
Training on Synthetic Data does NOT decrease predictive accuracy,
but fixes privacy leak / memorization of classic ML approaches.