Synthetic Data for Big Data Privacy

Synthetic Data For
Big Data Privacy
Helsinki Data Science Meetup
Michael Platzer – Mostly AI

Agenda
1. Anonymization is Hard 5min
2. The Promise of Synthetic Data 5min
3. Case Studies w/ Public Data 5min
4. A Word on Privacy 5min
5. Use Cases 5min
2

The Privacy vs. Innovation Clash
3
>> Data Sharing
>> Data Monetization
>> Behavioral Analytics
>> Machine Learning
>> Smart Services
>> New Opportunities
>> Zero Time-To-Data
>> Consumer Understanding
“share as much data with as
many people as possible”
Data Protection <<
Privacy Regulations <<
Compliance <<
Reputational Risk <<
Customer Consent <<
Business Justification <<
Restricted Environments <<
need-to-know-basis <<
“share as little data with as
few people as possible”

4
Why Not “Simply” Anonymize Data?
(anonymous data not subject to privacy regulations)

8
Anonymization is Hard
useful
but not
private
private
but not
useful

9
Anonymization is Hard for Images
for any High-Dimensional Data Asset
Curse of Dimensionality
= Exponential Growth in Data Space
 A Curse for Analytics
 A Curse for Machine Learning
 A Curse for Preserving Privacy
 A Blessing for Consumer-Centric Organizations

10
(100 million movie ratings for 170k+ movies)

11
“We conjecture that the amount of perturbation that must be
applied to the data to defeat our algorithm will completely destroy
their utility [..] Sanitization techniques from the k-anonymity
literature such as generalization and suppression do not provide
meaningful privacy guarantees, and in any case fail on high-
dimensional data.”
(100 million movie ratings for 170k+ movies)

12
of mobile phone owners are re-identified simply by 2 antenna
signals, even when coarsened to hour of day (Nature, 2013)
of credit card owners are re-identified by 3 transactions, even when
only merchant and the date of transaction is revealed (Science, 2015)
of US citizens are re-identified by date-of-birth, gender and ZIP code
(Health, 2000)

Data Assets Gets Locked Up
14
PERSONAL
DATA
No!
Data Scientists
Data Engineers
Program Mgmt
Business Analysts
Developers
Business Partners
Designers
Researcher
Integrators
Tester
Operations
Innovation
→ How to become data-driven & customer-centric if you can‘t collaborate with your data?
Chief Data Officer
Data
Protection
Officer

AI-Generated Synthetic Data
15

16
actual, privacy-sensitive data
synthetic, statistical representative data
AI-Generated Synthetic Behavioral Data

17
enables unrestricted big data utilization
retains ~99% of statistical value
MOSTLY GENERATE is a flexible, scalable, automated and highly accurate Synthetic Data Platform powered by generative
deep neural network models for structured behavioral data (e.g. financial transaction data, healthcare data, mobility
data,...). It is an easy to deploy software solution that runs on-premise or private cloud.
actual, privacy-sensitive data
synthetic, statistical representative data
your secure IT environment
MOSTLY GENERATE
AI-Generated Synthetic Behavioral Data
fully anonymous, granular-level data

A Game Changer for Big Data Anonymization
18

Synthetic Data – How Accurate Is It?
19
Measuring Accuracy - An Open Challenge for Unsupervised Learning
− Turing Test
− Descriptive Statistics & Visualizations
− Distance Measure for Distributions (e.g. TVD)
− Benchmark Predictive Models

Synthetic Data – US Census
20
26’049 actual citizens
w/ 15 attributes
100’000 synthetic citizens
w/ 15 attributes
MOSTLY GENERATE
https://generate.mostly.ai/
(free public, yet limited demo)

21
target = q01: 17y, q50: 37y, q99: 74y
synthetic = q01: 17y, q50: 37y, q99: 74y
target = Local-gov: 6.5%
synthetic = Local-gov: 6.7%
target = high-income: 24.1%
synthetic = high-income: 24.4%
Age Workclass Income

22

23
Actual Data Synthetic Data

24
More Data >> Higher Accuracy

25

26

Synthetic Behavioral Data – CDNOW
28
MOSTLY GENERATE
23’570 actual customers 50’000 synthetic customers
https://mostly.ai/2020/05/28/how-to-unlock-your-behavioral-data-assets-part-ii/

29

30

31

32

33

34

Synthetic Mobility Traces – Porto Taxi
35
Original Data Synthetic Data
https://mostly.ai/2020/02/21/protecting-privacy-with-synthetic-location-data/

36
Original Sample Synthetic Sample

37

38

Synthetic Data – How Private Is It?
39
1. Report Differential Privacy as Theoretical Upper Limit
2. Calculate Empirical Differential Privacy (compute intensive)
3. Post hoc Privacy Analysis based on Individual-Level Distance
− Identical Match Count (IMC)
− Distance To Closest Record (DCR)
− Nearest Neighbor Distance Ration (NNDR)
Synthetic Data shall be “as close as possible”, but “not too close” to
Actual Data. Holdout determines benchmark for “too close”.
A perfect solution generates new synthetic data, that behaves exactly
like actual data, that hasn’t been seen before (=holdout data).

Use Cases for Synthetic Data
40
for External Data Monetization
- Data Consortia
- Data Marketplaces
- Data Resellers
- Market Research Intel
for Internal Data Sharing
- Data Governance
- Cross-Border Data Sharing
- Cross-Department Data Sharing
- Testing & Development of BI / AI
- Data Literacy / Hackathons
- Data Retention
for External Data Sharing
- Group-Wide Data Sharing
- Open Innovation
- Research Collaborations
- Vendor Validation
- Sandboxes
- Public Data
current industry focus on finance, healthcare and public sector

One More – What’s Better Than Synthetic Data?
41
Synthetic Fair Data!
https://mostly.ai/2020/05/08/diving-deep-into-fair-synthetic-data-generation-fairness-series-part-5/

Questions?
42
michael.platzer@mostly.ai
Founder & Chief Strategy Officer
Michael Platzer, PhD

Synthetic Data – How Private Is It?
44
Real
Synthetic
Real’
Synthetic’
 Training on Synthetic Data does NOT decrease predictive accuracy,
but fixes privacy leak / memorization of classic ML approaches.

The Consequence Huge Untapped Potential
MOSTLY AI - CONFIDENTIAL45

AI-Generated Synthetic Data is a Game Changer
46
Synthetic Data is Fully Anonymous
Synthetic Data is As-Good-As-Real
1
2

Generative Deep Models - VAEs
Variational Autoencoders
Encoder Decoder
actual data
Latent Space Representation
synthetic data

Generative Deep Models - GANs
Generative Adversarial Networks

Generative Deep Models - ARNs
Autoregressive Neural Networks
Synthetic Shakespeare Synthetic Linux Source Code

50

Synthetic Data for Big Data Privacy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Synthetic Data for Big Data Privacy

Similar to Synthetic Data for Big Data Privacy (20)

More from MOSTLY AI

More from MOSTLY AI (9)

Recently uploaded

Recently uploaded (20)

Synthetic Data for Big Data Privacy