SlideShare a Scribd company logo
1 of 50
Synthetic Data For
Big Data Privacy
Helsinki Data Science Meetup
Michael Platzer – Mostly AI
Agenda
1. Anonymization is Hard 5min
2. The Promise of Synthetic Data 5min
3. Case Studies w/ Public Data 5min
4. A Word on Privacy 5min
5. Use Cases 5min
2
The Privacy vs. Innovation Clash
3
>> Data Sharing
>> Data Monetization
>> Behavioral Analytics
>> Machine Learning
>> Smart Services
>> New Opportunities
>> Zero Time-To-Data
>> Consumer Understanding
“share as much data with as
many people as possible”
Data Protection <<
Privacy Regulations <<
Compliance <<
Reputational Risk <<
Customer Consent <<
Business Justification <<
Restricted Environments <<
need-to-know-basis <<
“share as little data with as
few people as possible”
The Privacy vs. Innovation Clash
4
Why Not “Simply” Anonymize Data?
(anonymous data not subject to privacy regulations)
5
Anonymization is Hard
6
Anonymization is Hard
7
Anonymization is Hard
8
Anonymization is Hard
useful
but not
private
private
but not
useful
9
Anonymization is Hard for Images
for any High-Dimensional Data Asset
Curse of Dimensionality
= Exponential Growth in Data Space
 A Curse for Analytics
 A Curse for Machine Learning
 A Curse for Preserving Privacy
 A Blessing for Consumer-Centric Organizations
10
Anonymization is Hard
(100 million movie ratings for 170k+ movies)
11
Anonymization is Hard
“We conjecture that the amount of perturbation that must be
applied to the data to defeat our algorithm will completely destroy
their utility [..] Sanitization techniques from the k-anonymity
literature such as generalization and suppression do not provide
meaningful privacy guarantees, and in any case fail on high-
dimensional data.”
(100 million movie ratings for 170k+ movies)
12
of mobile phone owners are re-identified simply by 2 antenna
signals, even when coarsened to hour of day (Nature, 2013)
of credit card owners are re-identified by 3 transactions, even when
only merchant and the date of transaction is revealed (Science, 2015)
of US citizens are re-identified by date-of-birth, gender and ZIP code
(Health, 2000)
Anonymization is Hard
13
Anonymization is Hard
Data Assets Gets Locked Up
14
PERSONAL
DATA
No!
Data Scientists
Data Engineers
Program Mgmt
Business Analysts
Developers
Business Partners
Designers
Researcher
Integrators
Tester
Operations
Innovation
→ How to become data-driven & customer-centric if you can‘t collaborate with your data?
Chief Data Officer
Data
Protection
Officer
AI-Generated Synthetic Data
15
16
actual, privacy-sensitive data
synthetic, statistical representative data
AI-Generated Synthetic Behavioral Data
17
enables unrestricted big data utilization
retains ~99% of statistical value
MOSTLY GENERATE is a flexible, scalable, automated and highly accurate Synthetic Data Platform powered by generative
deep neural network models for structured behavioral data (e.g. financial transaction data, healthcare data, mobility
data,...). It is an easy to deploy software solution that runs on-premise or private cloud.
actual, privacy-sensitive data
synthetic, statistical representative data
your secure IT environment
MOSTLY GENERATE
AI-Generated Synthetic Behavioral Data
fully anonymous, granular-level data
A Game Changer for Big Data Anonymization
18
Synthetic Data – How Accurate Is It?
19
Measuring Accuracy - An Open Challenge for Unsupervised Learning
− Turing Test
− Descriptive Statistics & Visualizations
− Distance Measure for Distributions (e.g. TVD)
− Benchmark Predictive Models
Synthetic Data – US Census
20
26’049 actual citizens
w/ 15 attributes
100’000 synthetic citizens
w/ 15 attributes
MOSTLY GENERATE
https://generate.mostly.ai/
(free public, yet limited demo)
Synthetic Data – US Census
21
target = q01: 17y, q50: 37y, q99: 74y
synthetic = q01: 17y, q50: 37y, q99: 74y
target = Local-gov: 6.5%
synthetic = Local-gov: 6.7%
target = high-income: 24.1%
synthetic = high-income: 24.4%
Age Workclass Income
Synthetic Data – US Census
22
Synthetic Data – US Census
23
Actual Data Synthetic Data
Synthetic Data – US Census
24
More Data >> Higher Accuracy
Synthetic Data – US Census
25
Synthetic Data – US Census
26
Synthetic Behavioral Data
27
Synthetic Behavioral Data – CDNOW
28
MOSTLY GENERATE
23’570 actual customers 50’000 synthetic customers
https://mostly.ai/2020/05/28/how-to-unlock-your-behavioral-data-assets-part-ii/
Synthetic Behavioral Data – CDNOW
29
Synthetic Behavioral Data – CDNOW
30
Synthetic Behavioral Data – CDNOW
31
Synthetic Behavioral Data – CDNOW
32
Synthetic Behavioral Data – CDNOW
33
Synthetic Behavioral Data – CDNOW
34
Synthetic Mobility Traces – Porto Taxi
35
Original Data Synthetic Data
https://mostly.ai/2020/02/21/protecting-privacy-with-synthetic-location-data/
Synthetic Mobility Traces – Porto Taxi
36
Original Sample Synthetic Sample
Synthetic Mobility Traces – Porto Taxi
37
Synthetic Mobility Traces – Porto Taxi
38
Synthetic Data – How Private Is It?
39
1. Report Differential Privacy as Theoretical Upper Limit
2. Calculate Empirical Differential Privacy (compute intensive)
3. Post hoc Privacy Analysis based on Individual-Level Distance
− Identical Match Count (IMC)
− Distance To Closest Record (DCR)
− Nearest Neighbor Distance Ration (NNDR)
Synthetic Data shall be “as close as possible”, but “not too close” to
Actual Data. Holdout determines benchmark for “too close”.
A perfect solution generates new synthetic data, that behaves exactly
like actual data, that hasn’t been seen before (=holdout data).
Use Cases for Synthetic Data
40
for External Data Monetization
- Data Consortia
- Data Marketplaces
- Data Resellers
- Market Research Intel
for Internal Data Sharing
- Data Governance
- Cross-Border Data Sharing
- Cross-Department Data Sharing
- Testing & Development of BI / AI
- Data Literacy / Hackathons
- Data Retention
for External Data Sharing
- Group-Wide Data Sharing
- Open Innovation
- Research Collaborations
- Vendor Validation
- Sandboxes
- Public Data
current industry focus on finance, healthcare and public sector
One More – What’s Better Than Synthetic Data?
41
Synthetic Fair Data!
https://mostly.ai/2020/05/08/diving-deep-into-fair-synthetic-data-generation-fairness-series-part-5/
Questions?
42
michael.platzer@mostly.ai
Founder & Chief Strategy Officer
Michael Platzer, PhD
© MOSTLY AI Solutions MP GmbH, All rights reserved
Synthetic Data – How Private Is It?
44
Real
Synthetic
Real’
Synthetic’
 Training on Synthetic Data does NOT decrease predictive accuracy,
but fixes privacy leak / memorization of classic ML approaches.
The Consequence Huge Untapped Potential
MOSTLY AI - CONFIDENTIAL45
AI-Generated Synthetic Data is a Game Changer
46
Synthetic Data is Fully Anonymous
Synthetic Data is As-Good-As-Real
1
2
Generative Deep Models - VAEs
Variational Autoencoders
Encoder Decoder
actual data
Latent Space Representation
synthetic data
Generative Deep Models - GANs
Generative Adversarial Networks
Generative Deep Models - ARNs
Autoregressive Neural Networks
Synthetic Shakespeare Synthetic Linux Source Code
The Privacy vs. Innovation Clash
50

More Related Content

What's hot

Responsible AI
Responsible AIResponsible AI
Responsible AINeo4j
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023HyunJoon Jung
 
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveGenerative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveHuahai Yang
 
Unlocking the Power of Generative AI An Executive's Guide.pdf
Unlocking the Power of Generative AI An Executive's Guide.pdfUnlocking the Power of Generative AI An Executive's Guide.pdf
Unlocking the Power of Generative AI An Executive's Guide.pdfPremNaraindas1
 
Best Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI ServiceBest Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI ServiceKumton Suttiraksiri
 
Cavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AICavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AICavalry Ventures
 
Microsoft Introduction to Automated Machine Learning
Microsoft Introduction to Automated Machine LearningMicrosoft Introduction to Automated Machine Learning
Microsoft Introduction to Automated Machine LearningSetu Chokshi
 
Understanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdfUnderstanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdfStephenAmell4
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudMárton Kodok
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to productionHerman Wu
 
Leveraging Generative AI & Best practices
Leveraging Generative AI & Best practicesLeveraging Generative AI & Best practices
Leveraging Generative AI & Best practicesDianaGray10
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_futureNisha Talagala
 
Responsible Generative AI
Responsible Generative AIResponsible Generative AI
Responsible Generative AICMassociates
 
Neo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptx
Neo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptxNeo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptx
Neo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptxNeo4j
 

What's hot (20)

Responsible AI
Responsible AIResponsible AI
Responsible AI
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveGenerative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's Perspective
 
Unlocking the Power of Generative AI An Executive's Guide.pdf
Unlocking the Power of Generative AI An Executive's Guide.pdfUnlocking the Power of Generative AI An Executive's Guide.pdf
Unlocking the Power of Generative AI An Executive's Guide.pdf
 
Best Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI ServiceBest Practice on using Azure OpenAI Service
Best Practice on using Azure OpenAI Service
 
Cavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AICavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AI
 
Microsoft Introduction to Automated Machine Learning
Microsoft Introduction to Automated Machine LearningMicrosoft Introduction to Automated Machine Learning
Microsoft Introduction to Automated Machine Learning
 
Understanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdfUnderstanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdf
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
Responsible AI
Responsible AIResponsible AI
Responsible AI
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 
Leveraging Generative AI & Best practices
Leveraging Generative AI & Best practicesLeveraging Generative AI & Best practices
Leveraging Generative AI & Best practices
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_future
 
Responsible Generative AI
Responsible Generative AIResponsible Generative AI
Responsible Generative AI
 
Neo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptx
Neo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptxNeo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptx
Neo4j & AWS Bedrock workshop at GraphSummit London 14 Nov 2023.pptx
 

Similar to Synthetic Data for Big Data Privacy

The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyClaudiu Popa
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperativeTrillium Software
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallTrillium Software
 
Data Science Innovations
Data Science InnovationsData Science Innovations
Data Science Innovationssuresh sood
 
Analytics Trends 2015: A below-the-surface look
Analytics Trends 2015: A below-the-surface lookAnalytics Trends 2015: A below-the-surface look
Analytics Trends 2015: A below-the-surface lookDeloitte Canada
 
big data analytics pgpmx2015
big data analytics pgpmx2015big data analytics pgpmx2015
big data analytics pgpmx2015Sanmeet Dhokay
 
Big Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBig Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBala Iyer
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data scienceVipul Kalamkar
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Why Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieWhy Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieSunil Ranka
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analyticsCapgemini
 
Big Data & Analytics, Peter Jönsson
Big Data & Analytics, Peter JönssonBig Data & Analytics, Peter Jönsson
Big Data & Analytics, Peter JönssonIBM Danmark
 
Fontys Eric van Tol
Fontys Eric van TolFontys Eric van Tol
Fontys Eric van TolTalentEvent
 
Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...InnoTech
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 

Similar to Synthetic Data for Big Data Privacy (20)

Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on Privacy
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
 
The Bigger They Are The Harder They Fall
The Bigger They Are The Harder They FallThe Bigger They Are The Harder They Fall
The Bigger They Are The Harder They Fall
 
Data Science Innovations
Data Science InnovationsData Science Innovations
Data Science Innovations
 
Analytics Trends 2015: A below-the-surface look
Analytics Trends 2015: A below-the-surface lookAnalytics Trends 2015: A below-the-surface look
Analytics Trends 2015: A below-the-surface look
 
big data analytics pgpmx2015
big data analytics pgpmx2015big data analytics pgpmx2015
big data analytics pgpmx2015
 
Using Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay VinzeUsing Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay Vinze
 
Big Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the MarketspaceBig Data & Business Analytics: Understanding the Marketspace
Big Data & Business Analytics: Understanding the Marketspace
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
Big Data Analytics (1).ppt
Big Data Analytics (1).pptBig Data Analytics (1).ppt
Big Data Analytics (1).ppt
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Why Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieWhy Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A Lie
 
Impact of big data on analytics
Impact of big data on analyticsImpact of big data on analytics
Impact of big data on analytics
 
Big Data & Analytics, Peter Jönsson
Big Data & Analytics, Peter JönssonBig Data & Analytics, Peter Jönsson
Big Data & Analytics, Peter Jönsson
 
Fontys Eric van Tol
Fontys Eric van TolFontys Eric van Tol
Fontys Eric van Tol
 
Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 

More from MOSTLY AI

Everything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataEverything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataMOSTLY AI
 
Everything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataEverything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataMOSTLY AI
 
Synthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AISynthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AIMOSTLY AI
 
Nvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferNvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferMOSTLY AI
 
Artificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnArtificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnMOSTLY AI
 
PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016MOSTLY AI
 
My Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMy Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMOSTLY AI
 
Stochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsStochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsMOSTLY AI
 
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...MOSTLY AI
 

More from MOSTLY AI (9)

Everything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataEverything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic Data
 
Everything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataEverything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic Data
 
Synthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AISynthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AI
 
Nvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferNvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer Töglhofer
 
Artificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnArtificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines Learn
 
PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016
 
My Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMy Entry to the DMEF CLV Contest
My Entry to the DMEF CLV Contest
 
Stochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsStochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer Relationships
 
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Synthetic Data for Big Data Privacy

  • 1. Synthetic Data For Big Data Privacy Helsinki Data Science Meetup Michael Platzer – Mostly AI
  • 2. Agenda 1. Anonymization is Hard 5min 2. The Promise of Synthetic Data 5min 3. Case Studies w/ Public Data 5min 4. A Word on Privacy 5min 5. Use Cases 5min 2
  • 3. The Privacy vs. Innovation Clash 3 >> Data Sharing >> Data Monetization >> Behavioral Analytics >> Machine Learning >> Smart Services >> New Opportunities >> Zero Time-To-Data >> Consumer Understanding “share as much data with as many people as possible” Data Protection << Privacy Regulations << Compliance << Reputational Risk << Customer Consent << Business Justification << Restricted Environments << need-to-know-basis << “share as little data with as few people as possible”
  • 4. The Privacy vs. Innovation Clash 4 Why Not “Simply” Anonymize Data? (anonymous data not subject to privacy regulations)
  • 8. 8 Anonymization is Hard useful but not private private but not useful
  • 9. 9 Anonymization is Hard for Images for any High-Dimensional Data Asset Curse of Dimensionality = Exponential Growth in Data Space  A Curse for Analytics  A Curse for Machine Learning  A Curse for Preserving Privacy  A Blessing for Consumer-Centric Organizations
  • 10. 10 Anonymization is Hard (100 million movie ratings for 170k+ movies)
  • 11. 11 Anonymization is Hard “We conjecture that the amount of perturbation that must be applied to the data to defeat our algorithm will completely destroy their utility [..] Sanitization techniques from the k-anonymity literature such as generalization and suppression do not provide meaningful privacy guarantees, and in any case fail on high- dimensional data.” (100 million movie ratings for 170k+ movies)
  • 12. 12 of mobile phone owners are re-identified simply by 2 antenna signals, even when coarsened to hour of day (Nature, 2013) of credit card owners are re-identified by 3 transactions, even when only merchant and the date of transaction is revealed (Science, 2015) of US citizens are re-identified by date-of-birth, gender and ZIP code (Health, 2000) Anonymization is Hard
  • 14. Data Assets Gets Locked Up 14 PERSONAL DATA No! Data Scientists Data Engineers Program Mgmt Business Analysts Developers Business Partners Designers Researcher Integrators Tester Operations Innovation → How to become data-driven & customer-centric if you can‘t collaborate with your data? Chief Data Officer Data Protection Officer
  • 16. 16 actual, privacy-sensitive data synthetic, statistical representative data AI-Generated Synthetic Behavioral Data
  • 17. 17 enables unrestricted big data utilization retains ~99% of statistical value MOSTLY GENERATE is a flexible, scalable, automated and highly accurate Synthetic Data Platform powered by generative deep neural network models for structured behavioral data (e.g. financial transaction data, healthcare data, mobility data,...). It is an easy to deploy software solution that runs on-premise or private cloud. actual, privacy-sensitive data synthetic, statistical representative data your secure IT environment MOSTLY GENERATE AI-Generated Synthetic Behavioral Data fully anonymous, granular-level data
  • 18. A Game Changer for Big Data Anonymization 18
  • 19. Synthetic Data – How Accurate Is It? 19 Measuring Accuracy - An Open Challenge for Unsupervised Learning − Turing Test − Descriptive Statistics & Visualizations − Distance Measure for Distributions (e.g. TVD) − Benchmark Predictive Models
  • 20. Synthetic Data – US Census 20 26’049 actual citizens w/ 15 attributes 100’000 synthetic citizens w/ 15 attributes MOSTLY GENERATE https://generate.mostly.ai/ (free public, yet limited demo)
  • 21. Synthetic Data – US Census 21 target = q01: 17y, q50: 37y, q99: 74y synthetic = q01: 17y, q50: 37y, q99: 74y target = Local-gov: 6.5% synthetic = Local-gov: 6.7% target = high-income: 24.1% synthetic = high-income: 24.4% Age Workclass Income
  • 22. Synthetic Data – US Census 22
  • 23. Synthetic Data – US Census 23 Actual Data Synthetic Data
  • 24. Synthetic Data – US Census 24 More Data >> Higher Accuracy
  • 25. Synthetic Data – US Census 25
  • 26. Synthetic Data – US Census 26
  • 28. Synthetic Behavioral Data – CDNOW 28 MOSTLY GENERATE 23’570 actual customers 50’000 synthetic customers https://mostly.ai/2020/05/28/how-to-unlock-your-behavioral-data-assets-part-ii/
  • 35. Synthetic Mobility Traces – Porto Taxi 35 Original Data Synthetic Data https://mostly.ai/2020/02/21/protecting-privacy-with-synthetic-location-data/
  • 36. Synthetic Mobility Traces – Porto Taxi 36 Original Sample Synthetic Sample
  • 37. Synthetic Mobility Traces – Porto Taxi 37
  • 38. Synthetic Mobility Traces – Porto Taxi 38
  • 39. Synthetic Data – How Private Is It? 39 1. Report Differential Privacy as Theoretical Upper Limit 2. Calculate Empirical Differential Privacy (compute intensive) 3. Post hoc Privacy Analysis based on Individual-Level Distance − Identical Match Count (IMC) − Distance To Closest Record (DCR) − Nearest Neighbor Distance Ration (NNDR) Synthetic Data shall be “as close as possible”, but “not too close” to Actual Data. Holdout determines benchmark for “too close”. A perfect solution generates new synthetic data, that behaves exactly like actual data, that hasn’t been seen before (=holdout data).
  • 40. Use Cases for Synthetic Data 40 for External Data Monetization - Data Consortia - Data Marketplaces - Data Resellers - Market Research Intel for Internal Data Sharing - Data Governance - Cross-Border Data Sharing - Cross-Department Data Sharing - Testing & Development of BI / AI - Data Literacy / Hackathons - Data Retention for External Data Sharing - Group-Wide Data Sharing - Open Innovation - Research Collaborations - Vendor Validation - Sandboxes - Public Data current industry focus on finance, healthcare and public sector
  • 41. One More – What’s Better Than Synthetic Data? 41 Synthetic Fair Data! https://mostly.ai/2020/05/08/diving-deep-into-fair-synthetic-data-generation-fairness-series-part-5/
  • 42. Questions? 42 michael.platzer@mostly.ai Founder & Chief Strategy Officer Michael Platzer, PhD
  • 43. © MOSTLY AI Solutions MP GmbH, All rights reserved
  • 44. Synthetic Data – How Private Is It? 44 Real Synthetic Real’ Synthetic’  Training on Synthetic Data does NOT decrease predictive accuracy, but fixes privacy leak / memorization of classic ML approaches.
  • 45. The Consequence Huge Untapped Potential MOSTLY AI - CONFIDENTIAL45
  • 46. AI-Generated Synthetic Data is a Game Changer 46 Synthetic Data is Fully Anonymous Synthetic Data is As-Good-As-Real 1 2
  • 47. Generative Deep Models - VAEs Variational Autoencoders Encoder Decoder actual data Latent Space Representation synthetic data
  • 48. Generative Deep Models - GANs Generative Adversarial Networks
  • 49. Generative Deep Models - ARNs Autoregressive Neural Networks Synthetic Shakespeare Synthetic Linux Source Code
  • 50. The Privacy vs. Innovation Clash 50