SlideShare a Scribd company logo
1 of 35
Download to read offline
Privacy-Preserving Sharing
of Financial Transaction Data
with Deep Generative Models
Michael Platzer
Founder & CEO
Mostly AI
Christoph Töglhofer
George Labs
Erste Group
Big Data
➔ drives scientific progress
➔ fosters innovation
➔ improves services, and
➔ helps optimize processes
But if Data is the new Oil...
...then let’s talk side effects
Data Protection is here for a reason
My position is not that there
should be no regulation.
Mark Zuckerberg in 2018
The Privacy vs. Innovation Clash in Finance
Data privacy restricts sharing of data
and thus hampers digital innovation.
1
The Privacy vs. Innovation Clash in Finance
Data privacy restricts sharing of data
and thus hampers digital innovation.
1
Pseudonymization offers no safety, while
Full Anonymization falls short for big data.
2
Pseudonymization Fails for Big Data
Günther Baumgartner Not So Anonymous User #3289384
Classic Anonymization...
Still Not So AnonymousStill Not So Anonymous
20-30€
40-50€
Classic Anonymization Falls Short for Big Data
AnonymousAnonymous
0-30€
30-60€
0-30€
Classic Anonymization Falls Short for Big Data
i.e. for High-Dimensional, Highly Correlated Data
masking - obfuscation adding noise - perturbations
generalizationgeneralization
Simple Demographics Often Identify People Uniquely (Sweeney, 2000)
→ 87% of US citizens identified by date-of-birth, gender and ZIP
AOL search data leak (2006)
Robust De-anonymization of Large Sparse Datasets (Narayanan, 2008)
→ Partial re-identification of the Netflix dataset
“Sanitization techniques from the k-anonymity literature [..] do not provide meaningful privacy guarantees, and in any
case fail on high-dimensional data.”
The privacy bounds of human mobility (Montjoye, 2013)
→ 2 spatio-temporal points are enough to uniquely identify 55% of 1.5m people
Unique in the shopping mall: On the reidentifiability of credit card metadata (Montjoye, 2015)
→ 4 spatio-temporal points are enough to uniquely identify 90% of 1.1m people
→ “even coarse datasets provide little anonymity”
Stalking Celebrities in NYC Taxi Dataset (2014; viz)
→ Everyone’s Digital Trail is Highly Unique
The Underestimated De-Anonymization Risk
amount of captured information
amount of
shareable
information
Classic Anonymization
can only share very limited amount of
information per person
...
Growing Gap
The Problem Untapped Big Data Potential
The Solution
Synthetic Data
Synthetic Data ?
John Doe Jane Doe
hand-crafted: simplistic and biased
Synthetic Data ?
John Doe Jane Doe
hand-engineered: simplistic and biased
AI-Generated Synthetic Data !
state-of-the-art deep neural networks, trained by NVIDIA on 30’000 celebrity photos
the generated new images look like real persons, but they aren’t
AI generated = realistic and representative
Scales with Data Growth
while fully preserving privacy of actual customers
amount of captured information
amount of
shareable
information
Classic Anonymization
can only share very limited amount of
information per person
AI-generated Synthetic Data
retains structure and variation, but no 1:1
relationship to actual persons
...
...
The Synthetic Data Engine by Mostly AI
AI-generated rich synthetic worlds of
customers and their behavior
actual
synthetic
Bayesian Model of Customer Transactions (Platzer and Reutterer, 2016)
flexible MCMC estimation of parametric customer model !
Bayesian Model of Customer Transactions (Platzer and Reutterer, 2016)
But computation & model capacity did not scale to big data
Learning by Being Taught
→ Supervised Learning
Learning by Observation
→ Unsupervised learning
Learning by Exploration
→ Reinforcement Learning
Machine Learning
Generative Deep Models - VAEs
Variational Autoencoders
Encoder Decoder
actual data
Latent Space Representation
synthetic data
- VAE introduced by Kingma & Welling in 2013
- 600+ papers published on VAE in 2017
Generative Deep Models - GANs
Generative Adversarial Networks
- Introduced by Goodfellow et al. in 2014
- 1500+ papers published on GANs in 2017
Generative Deep Models - ARNs
Autoregressive Neural Networks
Synthetic Shakespeare Synthetic Linux Source Code
The Customer Story
Product Development in Finance Industry
Customer Story The Business Problem
- UX Development & Testing w/ realistic data
- platform for 3rd
party development
- feature dev: forecasting account balances
- open research collaboration with university
- deep generative model trained on 100k+ customers w/ 100m+ financial transactions
- ability to simulate an unlimited number of synthetic profiles, accounts and transactions
- results are highly realistic and representative; retain detail, structure and variation
- independent audit by bank’s analytics team: ”over-achieved”
Customer Story The Solution
Hofer
Merkur
Starbucks
Mediamarkt
Thalia
Fressnapf
AirBnB
EVN
OMV
A1
Bauhaus
Santander
Generali
Amazon
Finanzamt
...
100+ merchants
Customer Story Data Quality
Transaction Level
Billa
Spar
Kirchenbeitrag
Ikea
Libro
Kik
Drei
UPC
ORF GIS
Uniqa
ÖBB
WGKK Zahlung
Wiener Netze
Radatz
...
→ actual & synthetic patterns near identical for 100+ merchants
synthetic
actuals
Customer Story Data Quality
Customer Level
€ / month correlations
→ actual & synthetic patterns near identical for 100+ merchants
Customer Story System Setup
- on-premise, secure
- GPU environment
2x Quadro P4000 (8GB)
(training runs ~18h)
- anywhere
- anytime
- CLI / REST API
integration /
pre-processing
SDE
Training Moduleactual
data
SDE
Generator Module synthetic
data
quality
assurance
reports
target
holdout
synthetic
subset
aggregation
integration /
pre-processing
integration /
pre-processing
integration /
post-processing
deep generative model
deep generative model
architecture, encodings & weights
Customer Story Integration
Synthetic data is anonymous.
Key Takeaway
Data privacy restricts sharing of data and
thus hampers digital innovation.
1
Pseudonymization offers no safety, while
Full Anonymization falls short for big data.
2
PROBLEM
SOLUTION
1
2
Generative AI allows highly accurate
synthetic data to be generated at scale.
Christoph Töglhofer
Data Scientist
YES,WE’RE
HIRINGYES, WE’RE
HIRING
george@erstegroup.com
https://george-labs.com/
michael.platzer@mostly.ai
https://mostly.ai/
Contact Details
christoph.toeglhofer@erstegroup.com
Erste Group IT International GmbH
Michael Platzer
Founder & CEO

More Related Content

What's hot

Future of value of data asia iot-asia presentation - 22 march 2018
Future of value of data asia   iot-asia presentation - 22 march 2018Future of value of data asia   iot-asia presentation - 22 march 2018
Future of value of data asia iot-asia presentation - 22 march 2018
Future Agenda
 

What's hot (19)

GraphDay Stockholm - Levaraging Graph-Technology to fight Financial Fraud
GraphDay Stockholm - Levaraging Graph-Technology to fight Financial FraudGraphDay Stockholm - Levaraging Graph-Technology to fight Financial Fraud
GraphDay Stockholm - Levaraging Graph-Technology to fight Financial Fraud
 
How to make business using AI and Big Data?
How to make business using AI and Big Data?How to make business using AI and Big Data?
How to make business using AI and Big Data?
 
Big Data: Industry trends and key players
Big Data: Industry trends and key playersBig Data: Industry trends and key players
Big Data: Industry trends and key players
 
The Web3 Data Economy: Ocean Protocol
The Web3 Data Economy: Ocean ProtocolThe Web3 Data Economy: Ocean Protocol
The Web3 Data Economy: Ocean Protocol
 
29 08-2018 blockchain-introduction_digital wednesday
29 08-2018 blockchain-introduction_digital wednesday29 08-2018 blockchain-introduction_digital wednesday
29 08-2018 blockchain-introduction_digital wednesday
 
A visual approach to fraud detection and investigation - Giuseppe Francavilla
A visual approach to fraud detection and investigation - Giuseppe FrancavillaA visual approach to fraud detection and investigation - Giuseppe Francavilla
A visual approach to fraud detection and investigation - Giuseppe Francavilla
 
Big data asap 2016 v1
Big data asap 2016 v1Big data asap 2016 v1
Big data asap 2016 v1
 
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
Unlocking Value in the Fragmented World of Big Data Analytics (POV Paper)
 
"Big Data Dreams"
"Big Data Dreams""Big Data Dreams"
"Big Data Dreams"
 
Innovations from GAFAM, BATX threats or innovations for businesses
Innovations from GAFAM, BATX threats or innovations for businessesInnovations from GAFAM, BATX threats or innovations for businesses
Innovations from GAFAM, BATX threats or innovations for businesses
 
IT FUTURE- Big data
IT FUTURE- Big dataIT FUTURE- Big data
IT FUTURE- Big data
 
Fy13 performance final
Fy13 performance finalFy13 performance final
Fy13 performance final
 
Future of value of data asia iot-asia presentation - 22 march 2018
Future of value of data asia   iot-asia presentation - 22 march 2018Future of value of data asia   iot-asia presentation - 22 march 2018
Future of value of data asia iot-asia presentation - 22 march 2018
 
Tomorrow's Data Heros
Tomorrow's Data HerosTomorrow's Data Heros
Tomorrow's Data Heros
 
Dwika sharing bisnis Big Data v2a IDBigData Meetup 3rd UI Jakarta
Dwika sharing  bisnis Big Data v2a IDBigData Meetup 3rd UI JakartaDwika sharing  bisnis Big Data v2a IDBigData Meetup 3rd UI Jakarta
Dwika sharing bisnis Big Data v2a IDBigData Meetup 3rd UI Jakarta
 
Big Data in Asia
Big Data in AsiaBig Data in Asia
Big Data in Asia
 
Hacked: Threats, Trends and the Power of Connected Data
Hacked: Threats, Trends and the Power of Connected DataHacked: Threats, Trends and the Power of Connected Data
Hacked: Threats, Trends and the Power of Connected Data
 
Synthetic Data for Big Data Privacy
Synthetic Data for Big Data PrivacySynthetic Data for Big Data Privacy
Synthetic Data for Big Data Privacy
 
Present & Future of Marketing - Disruption: Social Media vs Social Networks =...
Present & Future of Marketing - Disruption: Social Media vs Social Networks =...Present & Future of Marketing - Disruption: Social Media vs Social Networks =...
Present & Future of Marketing - Disruption: Social Media vs Social Networks =...
 

Similar to Nvidia GTC18 Platzer Töglhofer

Similar to Nvidia GTC18 Platzer Töglhofer (20)

APD Presents Best of the Next
APD Presents Best of the Next APD Presents Best of the Next
APD Presents Best of the Next
 
Eric van Tol - Businesscases & Verdienmodellen
Eric van Tol - Businesscases & VerdienmodellenEric van Tol - Businesscases & Verdienmodellen
Eric van Tol - Businesscases & Verdienmodellen
 
BI, AI/ML, Use Cases, Business Impact and how to get started
BI, AI/ML, Use Cases, Business Impact and how to get startedBI, AI/ML, Use Cases, Business Impact and how to get started
BI, AI/ML, Use Cases, Business Impact and how to get started
 
Social Media Monitoring: your data with destiny
Social Media Monitoring: your data with destinySocial Media Monitoring: your data with destiny
Social Media Monitoring: your data with destiny
 
2014 11 24 big data bmm aformationdigitale2014 v print guy huyberechts
2014 11 24 big data bmm aformationdigitale2014 v print guy huyberechts2014 11 24 big data bmm aformationdigitale2014 v print guy huyberechts
2014 11 24 big data bmm aformationdigitale2014 v print guy huyberechts
 
From Info Science to Data Science & Smart Nation
From Info Science to Data Science & Smart Nation From Info Science to Data Science & Smart Nation
From Info Science to Data Science & Smart Nation
 
Keynote Sales Kickoff Interoute
Keynote Sales Kickoff InterouteKeynote Sales Kickoff Interoute
Keynote Sales Kickoff Interoute
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infra
 
Trend Report dld2017
Trend Report dld2017Trend Report dld2017
Trend Report dld2017
 
Fontys Eric van Tol
Fontys Eric van TolFontys Eric van Tol
Fontys Eric van Tol
 
Policy paper need for focussed big data & analytics skillset building throu...
Policy  paper  need for focussed big data & analytics skillset building throu...Policy  paper  need for focussed big data & analytics skillset building throu...
Policy paper need for focussed big data & analytics skillset building throu...
 
South By South Best 2018
South By South Best 2018 South By South Best 2018
South By South Best 2018
 
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air FranceQu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
 
Ps095 - London College of Fashion
Ps095 - London College of FashionPs095 - London College of Fashion
Ps095 - London College of Fashion
 
Rulex big data and analytics
Rulex big data and analyticsRulex big data and analytics
Rulex big data and analytics
 
Neo4j Graph Data Platform: Making Your Data More Intelligent
Neo4j Graph Data Platform: Making Your Data More IntelligentNeo4j Graph Data Platform: Making Your Data More Intelligent
Neo4j Graph Data Platform: Making Your Data More Intelligent
 
Talk_AI_in_Africa
Talk_AI_in_AfricaTalk_AI_in_Africa
Talk_AI_in_Africa
 
From Paris Hilton to Walmart: welcome to the Big Data Revolution
From Paris Hilton to Walmart: welcome to the Big Data RevolutionFrom Paris Hilton to Walmart: welcome to the Big Data Revolution
From Paris Hilton to Walmart: welcome to the Big Data Revolution
 
Digital marketing for ninja e lazio innova giulia decina 2018
Digital marketing for ninja e lazio innova giulia decina 2018Digital marketing for ninja e lazio innova giulia decina 2018
Digital marketing for ninja e lazio innova giulia decina 2018
 
Stop Being A Third-Party Victim - Treat Your Customer Data Like A Pro - Mo Mi...
Stop Being A Third-Party Victim - Treat Your Customer Data Like A Pro - Mo Mi...Stop Being A Third-Party Victim - Treat Your Customer Data Like A Pro - Mo Mi...
Stop Being A Third-Party Victim - Treat Your Customer Data Like A Pro - Mo Mi...
 

More from MOSTLY AI

More from MOSTLY AI (9)

Everything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic DataEverything You Always Wanted to Know About Synthetic Data
Everything You Always Wanted to Know About Synthetic Data
 
Everything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic DataEverything you always wanted to know about Synthetic Data
Everything you always wanted to know about Synthetic Data
 
Synthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AISynthetic Population Data with MOSTLY AI
Synthetic Population Data with MOSTLY AI
 
AI-based re-identification of behavioral data
AI-based re-identification of behavioral dataAI-based re-identification of behavioral data
AI-based re-identification of behavioral data
 
Artificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines LearnArtificial Intelligence - How Machines Learn
Artificial Intelligence - How Machines Learn
 
PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016PhD Seminar Riezlern 2016
PhD Seminar Riezlern 2016
 
My Entry to the DMEF CLV Contest
My Entry to the DMEF CLV ContestMy Entry to the DMEF CLV Contest
My Entry to the DMEF CLV Contest
 
Stochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer RelationshipsStochastic Models of Noncontractual Consumer Relationships
Stochastic Models of Noncontractual Consumer Relationships
 
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
Incorporating Regularity into Models of Noncontractual Customer-Firm Relation...
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Nvidia GTC18 Platzer Töglhofer

  • 1. Privacy-Preserving Sharing of Financial Transaction Data with Deep Generative Models Michael Platzer Founder & CEO Mostly AI Christoph Töglhofer George Labs Erste Group
  • 2. Big Data ➔ drives scientific progress ➔ fosters innovation ➔ improves services, and ➔ helps optimize processes
  • 3. But if Data is the new Oil...
  • 4. ...then let’s talk side effects
  • 5. Data Protection is here for a reason
  • 6. My position is not that there should be no regulation. Mark Zuckerberg in 2018
  • 7. The Privacy vs. Innovation Clash in Finance Data privacy restricts sharing of data and thus hampers digital innovation. 1
  • 8. The Privacy vs. Innovation Clash in Finance Data privacy restricts sharing of data and thus hampers digital innovation. 1 Pseudonymization offers no safety, while Full Anonymization falls short for big data. 2
  • 9. Pseudonymization Fails for Big Data Günther Baumgartner Not So Anonymous User #3289384
  • 10. Classic Anonymization... Still Not So AnonymousStill Not So Anonymous 20-30€ 40-50€
  • 11. Classic Anonymization Falls Short for Big Data AnonymousAnonymous 0-30€ 30-60€ 0-30€
  • 12. Classic Anonymization Falls Short for Big Data i.e. for High-Dimensional, Highly Correlated Data masking - obfuscation adding noise - perturbations generalizationgeneralization
  • 13. Simple Demographics Often Identify People Uniquely (Sweeney, 2000) → 87% of US citizens identified by date-of-birth, gender and ZIP AOL search data leak (2006) Robust De-anonymization of Large Sparse Datasets (Narayanan, 2008) → Partial re-identification of the Netflix dataset “Sanitization techniques from the k-anonymity literature [..] do not provide meaningful privacy guarantees, and in any case fail on high-dimensional data.” The privacy bounds of human mobility (Montjoye, 2013) → 2 spatio-temporal points are enough to uniquely identify 55% of 1.5m people Unique in the shopping mall: On the reidentifiability of credit card metadata (Montjoye, 2015) → 4 spatio-temporal points are enough to uniquely identify 90% of 1.1m people → “even coarse datasets provide little anonymity” Stalking Celebrities in NYC Taxi Dataset (2014; viz) → Everyone’s Digital Trail is Highly Unique The Underestimated De-Anonymization Risk
  • 14. amount of captured information amount of shareable information Classic Anonymization can only share very limited amount of information per person ... Growing Gap The Problem Untapped Big Data Potential
  • 16. Synthetic Data ? John Doe Jane Doe hand-crafted: simplistic and biased
  • 17. Synthetic Data ? John Doe Jane Doe hand-engineered: simplistic and biased
  • 18. AI-Generated Synthetic Data ! state-of-the-art deep neural networks, trained by NVIDIA on 30’000 celebrity photos the generated new images look like real persons, but they aren’t AI generated = realistic and representative
  • 19. Scales with Data Growth while fully preserving privacy of actual customers amount of captured information amount of shareable information Classic Anonymization can only share very limited amount of information per person AI-generated Synthetic Data retains structure and variation, but no 1:1 relationship to actual persons ... ...
  • 20. The Synthetic Data Engine by Mostly AI AI-generated rich synthetic worlds of customers and their behavior actual synthetic
  • 21. Bayesian Model of Customer Transactions (Platzer and Reutterer, 2016) flexible MCMC estimation of parametric customer model !
  • 22. Bayesian Model of Customer Transactions (Platzer and Reutterer, 2016) But computation & model capacity did not scale to big data
  • 23. Learning by Being Taught → Supervised Learning Learning by Observation → Unsupervised learning Learning by Exploration → Reinforcement Learning Machine Learning
  • 24. Generative Deep Models - VAEs Variational Autoencoders Encoder Decoder actual data Latent Space Representation synthetic data - VAE introduced by Kingma & Welling in 2013 - 600+ papers published on VAE in 2017
  • 25. Generative Deep Models - GANs Generative Adversarial Networks - Introduced by Goodfellow et al. in 2014 - 1500+ papers published on GANs in 2017
  • 26. Generative Deep Models - ARNs Autoregressive Neural Networks Synthetic Shakespeare Synthetic Linux Source Code
  • 27. The Customer Story Product Development in Finance Industry
  • 28. Customer Story The Business Problem - UX Development & Testing w/ realistic data - platform for 3rd party development - feature dev: forecasting account balances - open research collaboration with university
  • 29. - deep generative model trained on 100k+ customers w/ 100m+ financial transactions - ability to simulate an unlimited number of synthetic profiles, accounts and transactions - results are highly realistic and representative; retain detail, structure and variation - independent audit by bank’s analytics team: ”over-achieved” Customer Story The Solution
  • 30. Hofer Merkur Starbucks Mediamarkt Thalia Fressnapf AirBnB EVN OMV A1 Bauhaus Santander Generali Amazon Finanzamt ... 100+ merchants Customer Story Data Quality Transaction Level Billa Spar Kirchenbeitrag Ikea Libro Kik Drei UPC ORF GIS Uniqa ÖBB WGKK Zahlung Wiener Netze Radatz ... → actual & synthetic patterns near identical for 100+ merchants synthetic actuals
  • 31. Customer Story Data Quality Customer Level € / month correlations → actual & synthetic patterns near identical for 100+ merchants
  • 32. Customer Story System Setup - on-premise, secure - GPU environment 2x Quadro P4000 (8GB) (training runs ~18h) - anywhere - anytime - CLI / REST API integration / pre-processing SDE Training Moduleactual data SDE Generator Module synthetic data quality assurance reports target holdout synthetic subset aggregation integration / pre-processing integration / pre-processing integration / post-processing deep generative model deep generative model architecture, encodings & weights
  • 34. Synthetic data is anonymous. Key Takeaway Data privacy restricts sharing of data and thus hampers digital innovation. 1 Pseudonymization offers no safety, while Full Anonymization falls short for big data. 2 PROBLEM SOLUTION 1 2 Generative AI allows highly accurate synthetic data to be generated at scale.
  • 35. Christoph Töglhofer Data Scientist YES,WE’RE HIRINGYES, WE’RE HIRING george@erstegroup.com https://george-labs.com/ michael.platzer@mostly.ai https://mostly.ai/ Contact Details christoph.toeglhofer@erstegroup.com Erste Group IT International GmbH Michael Platzer Founder & CEO