SlideShare a Scribd company logo
1 of 36
Download to read offline
Jim Dowling
CEO / Co-Founder
Logical Clocks
Associate Prof at KTH – Royal Institute of Technology
Anti Money Laundering and GANs
Berlin Meetup
@jim_dowling
● Problem: Increase detection rate and reduce costs for AML.
● Solution: We used the Hopsworks platform to train GANs to classify
transactions as suspected for money laundering or not. We have worked with
a large transaction dataset (~40 TB) and the solution uses Spark for Feature
Engineering and TensorFlow/GPUs to train a binary classifier, classifying
transactions as either clean or dirty. We use the open-source Hopsworks
platform to manage features, scale-out training, and manage models.
● Reference: Whitepaper
Agenda
● Money laundering involves turning the “dirty” money into “clean” money either
through an obscure sequence of banking transfers or through commercial
transactions.
● The three broad stages of money laundering* are:
○ Placement (smurf it)
○ Layering (spread it out fast)
○ Integration (buy stuff)
What is Money Laundering?
*https://towardsdatascience.com/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c
Rules-Base AML vs Deep Learning AML
AML as a Supervised ML Problem
● Anti-money laundering (AML) is a pattern
matching problem
● AML systems should automatically flag ‘suspect’
financial transactions
○ Followed by manual investigation
● Historical transaction datasets have massive
data imbalance between the number of ‘clean’
transactions versus ‘dirty’ transactions
Clean
Transactions
Dirty
Transactions
Millions or Billions
100s or 1000s
Implications of AML as a Binary Classification Problem
True Positive
Reality:  A Money Laundering Transaction 
Prediction: “Dirty” transaction predicted
Result: Good
False Positive
Reality: Not a Money Laundering Transaction 
Prediction: “Dirty” transaction predicted
Result: Unnecessary work and cost!
False Negative
Reality: A Money Laundering Transaction 
Prediction: “Clean” transaction predicted
Result: Fines/jail by authorities/regulator!
True Negative
Reality: Not a Money Laundering Transaction 
Prediction: “Clean” transaction predicted
Result: Good
Confusion matrix of our Binary AML Classifier with all possible predictions and their consequences.
We use a variant of the F1 score to evaluate models (precision, recall, fallout should not be weighted equally).
AML as an Anomaly Detection Problem
“Anomaly detection follows
quite naturally from a good
unsupervised model”
Alex Graves (Deep Mind)
Traditional unsupervised
approaches do not scale:
k-means clustering and
principal component
analysis
[Image from Ruff et al, “Deep Semi-Supervised Anomaly Detection”, https://arxiv.org/pdf/1906.02694.pdf
AML - Semi-Supervised Anomaly Detection
AML is not a classical use-case for anomaly
detection as we typically have labelled
datasets, albeit imbalanced.
“Semi-supervised learning is a class of machine
learning tasks and techniques that also make
use of unlabeled data for training – typically a
small amount of labeled data with a large
amount of unlabeled data.” Wikipedia
GANs and Other Methods for Anomaly Detection
● Variational Auto-Encoders for Anomaly Detection
○ Easier to train, performance not state-of-the-art
● Generative Adversarial Networks (GANs)
○ Learn the manifold of normal samples (what to do if anomaly-free
dataset is polluted)
○ One-Class Classifier for Novelty Detection GAN
○ BiGAN, BigGAN, BigBiGAN, GANOMALY, f-AnoGAN, GANs for Fraud
“[For GANs] the Convolutional Neural Network architecture is more important than how you
train them”, Marc Aurelio Ranzato (Facebook) at NeurIPS 2018.
GAN Discriminator-Based Anomaly Detection
GAN Discriminator-Based Anomaly Detection
GANs are hard to train
● Pick the right GAN Architecture
● Risk of mode-collapse
● Hard to tune Hyperparameters
Different Hyperparameter Tuning Strategies
GANs are hard to train
● Mode collapse
○ Transactional data distributions are multimodal. There will be multiple types of
transactional behaviour that will be perfectly normal.
○ Original GAN is based on the zero-sum non-cooperative game. In these setting
when the mini-max game reaches the Nash equilibrium too soon. The generator
will learn to produce only a limited number of modes and mode collapse occurs.
● GANs are highly sensitive to the hyperparameters.
○ Finding good hyperparameters takes time, especially for GANs. List of possible
hyperparameters and tricks are listed here https://github.com/soumith/ganhacks
○ It is essential to have a good optimization and hyperparameter tuning engine
How to address mode collapse problem
● MO-GAAL [Liu, et al] proposed using multiple generators, where different
generators will be in charge of learning different modes of distribution.
● Schleg, et al in f-AnoGAN proposed replacing DCGAN with WGAN-GP and
introducing an encoder that was trained sequentially for image to latent
space mapping.
● Berg, et al improved f-AnoGAN by training Generator and Encoder jointly, as
well as employing progressive growing GAN.
WGAN-Gradient-Penalty Based Anomaly Detection
[Image from Berg et Al - https://arxiv.org/pdf/1905.11034.pdf ]
[Image from Berg et Al - https://arxiv.org/pdf/1905.11034.pdf ]
WGAN-Gradient-Penalty Based Anomaly Detection
Will GANs help improve AML predictions?
Expected results from using GANs (Anomaly Detection at Spark/AI EU Summit 2019)
“[In China] two commercial
banks have reduced losses
of about 10 million RMB in
twelve weeks and
significantly improved their
business reputation”
GAN-based telecom fraud
detection at the receiving
bank
Online
Feature Store
Offline
Feature Store
Train,
Batch App
Feature Store
<10ms
TBs/PBs
How can we manage the Features between Training/Serving?
Recent transaction counts
(Steaming App)
Streaming App pushes CDC data
Pandas App updates every hour
Batch PySpark App pushes
updates every day
Low
Latency
Features
High
Latency
Features
Real-time features
(cust IDs, amount, type, datetime)
Real-time
Data
Event Data
SQL
S3, HDFS
Online AML
App
SQL DW
DataFrameAPI
HOPSWORKS
Offline FS
Apache Hive
HopsFS
Read and Join Features
Online FS
MySQL Cluster
(External)
Spark Cluster
fs.get_features([“name”, “Pclass”,
“Sex”, “Balance”, “Survived”])
Storage
(S3, HopsFS, HDFS, ADLS)
.npy, .tfrecords, .csv
Create AML Training Datasets
Model Development Lifecycle in Hopsworks
Hopsworks Conventions
/training_datasets
/models
/logs
/notebooks
/featurestore
Conventions and Implicit Provenance in Hopsworks*
*https://www.usenix.org/conference/opml20/presentation/ormenisan
In [
]:
dataset = tf.data.Dataset.list_files("training_datasets/resnet/*.tfrecord")
tf.saved_model.save(model, ‘models/ResNet’)
maggy.lagom(....)
Exploration
Experimentati
on
Model
Training
Explainability
Validation
Serving
Feature
Pipelines
ML Model Development Lifecycle
Hyperparameter Tuning for GANs
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
ML Model Dev Lifecycle is Iternative
Explore
and Design
Experimentation:
Tune and Search
Model Training
(Distributed)
Explainability and
Ablation Studies
Rewrite your code at each stage => Iteration is impossible!
Ablation StudiesEDA HParam Tuning Training (Dist)
It’s the Frameworks’ fault – they make us rewrite it!
OBLIVIOUS
TRAINING
FUNCTION
# RUNS ON THE WORKERS
def train():
def input_fn(): # return dataset
model = …
optimizer = …
model.compile(…)
….
Ablation StudiesEDA HParam Tuning Training (Dist)
Obvlious Training Function – Write Once, Reuse Many Times
def dataset(batch_size):
(x_train, y_train) = load_data()
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train,y_train)).shuffle(60000)
.repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):
model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),
optimizer=SGD(learning_rate=lr))
return model
def dataset(batch_size):
(x_train, y_train) = load_data()
x_train = x_train / np.float32(255)
y_train = y_train.astype(np.int64)
train_dataset = tf.data.Dataset.from_tensor_slices(
(x_train,y_train)).shuffle(60000)
.repeat().batch(batch_size)
return train_dataset
def build_and_compile_cnn_model(lr):
model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(
loss=SparseCategoricalCrossentropy(from_logits=True),
optimizer=SGD(learning_rate=lr))
return model
NO
CHANGES!
What is Transparent Code in Practice?
def aml(kernel, pool, dropout, reporter):
# This is your training iteration loop
for i in range(number_iterations):
...
# add the maggy reporter to report the metric to be optimized
reporter.broadcast(metric=accuracy)
...
# Return the same final metric
return accuracy
from maggy import experiment, Searchspace
sp = Searchspace(kernel=('INTEGER', [2, 8]), pool=('INTEGER', [2, 8]))
result = experiment.lagom(train=aml, searchspace=sp, optimizer='randomsearch’,
direction='max’, num_trials=15, name='MNIST’ )
Maggy for HParam Optimization
Maggy is built on top of PySpark
Get Started: Paysim AML Dataset (Kaggle)
● Graph-based Candidate Features, Concatenated Features
○ Link the origin account, destination account, and transaction type to track
the problem of smurfing and the higher cash withdrawals
● Frequency Candidate Features
○ Learn how frequently the account is used
● Amount Features
○ Magnitude of the amount of transactions.
● Time-Since Features
○  Learn the speed of transactions
● Velocity-Change Features
○ Identify a sudden change in the behaviour of accounts
https://www.kaggle.com/ntnu-testimon/paysim1?select=PS_20174392719_1491204439457_log.csv
Hopsworks Cluster
Project-Based Multi-Tenant Security
API
KEY
IAM Profile
Users
Jobs
Dev Feature Store
Staging Feature Store
Prod Feature Store
User
Login
(LDAP, AD,
OAuth2, 2FA)
databricks
SageMaker
Kubeflow
Amazon EMR
Delta LakeSnowflakeAmazon S3
Amazon
Redshift
Full Featured
AGPL-v3 License Model
Hopsworks Community
Kubernetes Support
• Model Serving
• Other services for robustness (Jupyter, more coming)
Authentication (LDAP, Kerberos, OAuth2)
Github support
Hopsworks Enterprise
Managed SAAS platform (currently only on AWS)
Hopsworks.ai
Hopsworks – open-source or managed platform
Thank you.
github.com/logicalclocks/hopsworks
-
@logicalclocks
-
www.logicalclocks.com

More Related Content

What's hot

Anti-Money Laundering Solution
Anti-Money Laundering SolutionAnti-Money Laundering Solution
Anti-Money Laundering SolutionSri Ambati
 
AWS Global Infrastructure Foundations
AWS Global Infrastructure Foundations AWS Global Infrastructure Foundations
AWS Global Infrastructure Foundations Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Introduction to AI/ML with AWS
Introduction to AI/ML with AWSIntroduction to AI/ML with AWS
Introduction to AI/ML with AWSSuman Debnath
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language modelJiWenKim
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to productionHerman Wu
 
Build, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfBuild, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfAmazon Web Services
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
LanGCHAIN Framework
LanGCHAIN FrameworkLanGCHAIN Framework
LanGCHAIN FrameworkKeymate.AI
 
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMsJim Steele
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web ServicesAmazon Web Services
 
Generative AI for the rest of us
Generative AI for the rest of usGenerative AI for the rest of us
Generative AI for the rest of usMassimo Ferre'
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGISynaptonIncorporated
 
Conversational AI and Chatbot Integrations
Conversational AI and Chatbot IntegrationsConversational AI and Chatbot Integrations
Conversational AI and Chatbot IntegrationsCristina Vidu
 
An overview of foundation models.pdf
An overview of foundation models.pdfAn overview of foundation models.pdf
An overview of foundation models.pdfStephenAmell4
 

What's hot (20)

Intro to SageMaker
Intro to SageMakerIntro to SageMaker
Intro to SageMaker
 
Anti-Money Laundering Solution
Anti-Money Laundering SolutionAnti-Money Laundering Solution
Anti-Money Laundering Solution
 
AWS Global Infrastructure Foundations
AWS Global Infrastructure Foundations AWS Global Infrastructure Foundations
AWS Global Infrastructure Foundations
 
Intro to AI & ML at Amazon
Intro to AI & ML at AmazonIntro to AI & ML at Amazon
Intro to AI & ML at Amazon
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Introduction to AI/ML with AWS
Introduction to AI/ML with AWSIntroduction to AI/ML with AWS
Introduction to AI/ML with AWS
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 
Build, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdfBuild, train, and deploy ML models at scale.pdf
Build, train, and deploy ML models at scale.pdf
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
LanGCHAIN Framework
LanGCHAIN FrameworkLanGCHAIN Framework
LanGCHAIN Framework
 
Como reducir costos en AWS
Como reducir costos en AWSComo reducir costos en AWS
Como reducir costos en AWS
 
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMs
 
Overview of Amazon Web Services
Overview of Amazon Web ServicesOverview of Amazon Web Services
Overview of Amazon Web Services
 
Introduction to Serverless
Introduction to ServerlessIntroduction to Serverless
Introduction to Serverless
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web Services
 
Generative AI for the rest of us
Generative AI for the rest of usGenerative AI for the rest of us
Generative AI for the rest of us
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
 
Conversational AI and Chatbot Integrations
Conversational AI and Chatbot IntegrationsConversational AI and Chatbot Integrations
Conversational AI and Chatbot Integrations
 
An overview of foundation models.pdf
An overview of foundation models.pdfAn overview of foundation models.pdf
An overview of foundation models.pdf
 

Similar to GANs for Anti Money Laundering

Predictive analytics semi-supervised learning with GANs
Predictive analytics   semi-supervised learning with GANsPredictive analytics   semi-supervised learning with GANs
Predictive analytics semi-supervised learning with GANsterek47
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...PyData
 
Semi-supervised learning with GANs
Semi-supervised learning with GANsSemi-supervised learning with GANs
Semi-supervised learning with GANsterek47
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
Computation graphs - Tensorflow & CNTK
Computation graphs - Tensorflow & CNTKComputation graphs - Tensorflow & CNTK
Computation graphs - Tensorflow & CNTKA H M Forhadul Islam
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoMLArpitha Gurumurthy
 
Machine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeonMachine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeonActiveeon
 
Presentation
PresentationPresentation
Presentationbutest
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Sri Ambati
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)ActiveEon
 
GNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.pptGNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.pptManiMaran230751
 
AI & ML in Defence Systems - Sunil Chomal
AI & ML in Defence Systems   - Sunil ChomalAI & ML in Defence Systems   - Sunil Chomal
AI & ML in Defence Systems - Sunil ChomalSunil Chomal
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.pptyang947066
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series ForecastingBillTubbs
 

Similar to GANs for Anti Money Laundering (20)

Predictive analytics semi-supervised learning with GANs
Predictive analytics   semi-supervised learning with GANsPredictive analytics   semi-supervised learning with GANs
Predictive analytics semi-supervised learning with GANs
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...Using GANs to improve generalization in a semi-supervised setting - trying it...
Using GANs to improve generalization in a semi-supervised setting - trying it...
 
Semi-supervised learning with GANs
Semi-supervised learning with GANsSemi-supervised learning with GANs
Semi-supervised learning with GANs
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
C3 w1
C3 w1C3 w1
C3 w1
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Computation graphs - Tensorflow & CNTK
Computation graphs - Tensorflow & CNTKComputation graphs - Tensorflow & CNTK
Computation graphs - Tensorflow & CNTK
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Machine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeonMachine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeon
 
Presentation
PresentationPresentation
Presentation
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
 
GNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.pptGNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.ppt
 
AI & ML in Defence Systems - Sunil Chomal
AI & ML in Defence Systems   - Sunil ChomalAI & ML in Defence Systems   - Sunil Chomal
AI & ML in Defence Systems - Sunil Chomal
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series Forecasting
 

More from Jim Dowling

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfJim Dowling
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfJim Dowling
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdfJim Dowling
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Jim Dowling
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Jim Dowling
 
Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Jim Dowling
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigmJim Dowling
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingJim Dowling
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityJim Dowling
 
Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020Jim Dowling
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines Jim Dowling
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML PipelinesPyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML PipelinesJim Dowling
 

More from Jim Dowling (20)

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdf
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021
 
Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
 
Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML PipelinesPyData Meetup - Feature Store for Hopsworks and ML Pipelines
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
 

Recently uploaded

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 

Recently uploaded (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 

GANs for Anti Money Laundering

  • 1. Jim Dowling CEO / Co-Founder Logical Clocks Associate Prof at KTH – Royal Institute of Technology Anti Money Laundering and GANs Berlin Meetup @jim_dowling
  • 2. ● Problem: Increase detection rate and reduce costs for AML. ● Solution: We used the Hopsworks platform to train GANs to classify transactions as suspected for money laundering or not. We have worked with a large transaction dataset (~40 TB) and the solution uses Spark for Feature Engineering and TensorFlow/GPUs to train a binary classifier, classifying transactions as either clean or dirty. We use the open-source Hopsworks platform to manage features, scale-out training, and manage models. ● Reference: Whitepaper Agenda
  • 3. ● Money laundering involves turning the “dirty” money into “clean” money either through an obscure sequence of banking transfers or through commercial transactions. ● The three broad stages of money laundering* are: ○ Placement (smurf it) ○ Layering (spread it out fast) ○ Integration (buy stuff) What is Money Laundering? *https://towardsdatascience.com/the-art-of-engineering-features-for-a-strong-machine-learning-model-a47a876e654c
  • 4. Rules-Base AML vs Deep Learning AML
  • 5. AML as a Supervised ML Problem ● Anti-money laundering (AML) is a pattern matching problem ● AML systems should automatically flag ‘suspect’ financial transactions ○ Followed by manual investigation ● Historical transaction datasets have massive data imbalance between the number of ‘clean’ transactions versus ‘dirty’ transactions Clean Transactions Dirty Transactions Millions or Billions 100s or 1000s
  • 6. Implications of AML as a Binary Classification Problem True Positive Reality:  A Money Laundering Transaction  Prediction: “Dirty” transaction predicted Result: Good False Positive Reality: Not a Money Laundering Transaction  Prediction: “Dirty” transaction predicted Result: Unnecessary work and cost! False Negative Reality: A Money Laundering Transaction  Prediction: “Clean” transaction predicted Result: Fines/jail by authorities/regulator! True Negative Reality: Not a Money Laundering Transaction  Prediction: “Clean” transaction predicted Result: Good Confusion matrix of our Binary AML Classifier with all possible predictions and their consequences. We use a variant of the F1 score to evaluate models (precision, recall, fallout should not be weighted equally).
  • 7. AML as an Anomaly Detection Problem “Anomaly detection follows quite naturally from a good unsupervised model” Alex Graves (Deep Mind) Traditional unsupervised approaches do not scale: k-means clustering and principal component analysis [Image from Ruff et al, “Deep Semi-Supervised Anomaly Detection”, https://arxiv.org/pdf/1906.02694.pdf
  • 8. AML - Semi-Supervised Anomaly Detection AML is not a classical use-case for anomaly detection as we typically have labelled datasets, albeit imbalanced. “Semi-supervised learning is a class of machine learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data.” Wikipedia
  • 9. GANs and Other Methods for Anomaly Detection ● Variational Auto-Encoders for Anomaly Detection ○ Easier to train, performance not state-of-the-art ● Generative Adversarial Networks (GANs) ○ Learn the manifold of normal samples (what to do if anomaly-free dataset is polluted) ○ One-Class Classifier for Novelty Detection GAN ○ BiGAN, BigGAN, BigBiGAN, GANOMALY, f-AnoGAN, GANs for Fraud “[For GANs] the Convolutional Neural Network architecture is more important than how you train them”, Marc Aurelio Ranzato (Facebook) at NeurIPS 2018.
  • 12. GANs are hard to train ● Pick the right GAN Architecture ● Risk of mode-collapse ● Hard to tune Hyperparameters Different Hyperparameter Tuning Strategies
  • 13. GANs are hard to train ● Mode collapse ○ Transactional data distributions are multimodal. There will be multiple types of transactional behaviour that will be perfectly normal. ○ Original GAN is based on the zero-sum non-cooperative game. In these setting when the mini-max game reaches the Nash equilibrium too soon. The generator will learn to produce only a limited number of modes and mode collapse occurs. ● GANs are highly sensitive to the hyperparameters. ○ Finding good hyperparameters takes time, especially for GANs. List of possible hyperparameters and tricks are listed here https://github.com/soumith/ganhacks ○ It is essential to have a good optimization and hyperparameter tuning engine
  • 14. How to address mode collapse problem ● MO-GAAL [Liu, et al] proposed using multiple generators, where different generators will be in charge of learning different modes of distribution. ● Schleg, et al in f-AnoGAN proposed replacing DCGAN with WGAN-GP and introducing an encoder that was trained sequentially for image to latent space mapping. ● Berg, et al improved f-AnoGAN by training Generator and Encoder jointly, as well as employing progressive growing GAN.
  • 15. WGAN-Gradient-Penalty Based Anomaly Detection [Image from Berg et Al - https://arxiv.org/pdf/1905.11034.pdf ]
  • 16. [Image from Berg et Al - https://arxiv.org/pdf/1905.11034.pdf ] WGAN-Gradient-Penalty Based Anomaly Detection
  • 17. Will GANs help improve AML predictions? Expected results from using GANs (Anomaly Detection at Spark/AI EU Summit 2019) “[In China] two commercial banks have reduced losses of about 10 million RMB in twelve weeks and significantly improved their business reputation” GAN-based telecom fraud detection at the receiving bank
  • 18. Online Feature Store Offline Feature Store Train, Batch App Feature Store <10ms TBs/PBs How can we manage the Features between Training/Serving? Recent transaction counts (Steaming App) Streaming App pushes CDC data Pandas App updates every hour Batch PySpark App pushes updates every day Low Latency Features High Latency Features Real-time features (cust IDs, amount, type, datetime) Real-time Data Event Data SQL S3, HDFS Online AML App SQL DW DataFrameAPI
  • 19. HOPSWORKS Offline FS Apache Hive HopsFS Read and Join Features Online FS MySQL Cluster (External) Spark Cluster fs.get_features([“name”, “Pclass”, “Sex”, “Balance”, “Survived”]) Storage (S3, HopsFS, HDFS, ADLS) .npy, .tfrecords, .csv Create AML Training Datasets
  • 21. Hopsworks Conventions /training_datasets /models /logs /notebooks /featurestore Conventions and Implicit Provenance in Hopsworks* *https://www.usenix.org/conference/opml20/presentation/ormenisan In [ ]: dataset = tf.data.Dataset.list_files("training_datasets/resnet/*.tfrecord") tf.saved_model.save(model, ‘models/ResNet’) maggy.lagom(....)
  • 22.
  • 25. Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies ML Model Dev Lifecycle is Iternative
  • 26. Explore and Design Experimentation: Tune and Search Model Training (Distributed) Explainability and Ablation Studies Rewrite your code at each stage => Iteration is impossible!
  • 27. Ablation StudiesEDA HParam Tuning Training (Dist) It’s the Frameworks’ fault – they make us rewrite it!
  • 28. OBLIVIOUS TRAINING FUNCTION # RUNS ON THE WORKERS def train(): def input_fn(): # return dataset model = … optimizer = … model.compile(…) …. Ablation StudiesEDA HParam Tuning Training (Dist) Obvlious Training Function – Write Once, Reuse Many Times
  • 29. def dataset(batch_size): (x_train, y_train) = load_data() x_train = x_train / np.float32(255) y_train = y_train.astype(np.int64) train_dataset = tf.data.Dataset.from_tensor_slices( (x_train,y_train)).shuffle(60000) .repeat().batch(batch_size) return train_dataset def build_and_compile_cnn_model(lr): model = tf.keras.Sequential([ tf.keras.Input(shape=(28, 28)), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) ]) model.compile( loss=SparseCategoricalCrossentropy(from_logits=True), optimizer=SGD(learning_rate=lr)) return model def dataset(batch_size): (x_train, y_train) = load_data() x_train = x_train / np.float32(255) y_train = y_train.astype(np.int64) train_dataset = tf.data.Dataset.from_tensor_slices( (x_train,y_train)).shuffle(60000) .repeat().batch(batch_size) return train_dataset def build_and_compile_cnn_model(lr): model = tf.keras.Sequential([ tf.keras.Input(shape=(28, 28)), tf.keras.layers.Conv2D(32, 3, activation='relu'), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(10) ]) model.compile( loss=SparseCategoricalCrossentropy(from_logits=True), optimizer=SGD(learning_rate=lr)) return model NO CHANGES! What is Transparent Code in Practice?
  • 30. def aml(kernel, pool, dropout, reporter): # This is your training iteration loop for i in range(number_iterations): ... # add the maggy reporter to report the metric to be optimized reporter.broadcast(metric=accuracy) ... # Return the same final metric return accuracy from maggy import experiment, Searchspace sp = Searchspace(kernel=('INTEGER', [2, 8]), pool=('INTEGER', [2, 8])) result = experiment.lagom(train=aml, searchspace=sp, optimizer='randomsearch’, direction='max’, num_trials=15, name='MNIST’ ) Maggy for HParam Optimization
  • 31. Maggy is built on top of PySpark
  • 32. Get Started: Paysim AML Dataset (Kaggle) ● Graph-based Candidate Features, Concatenated Features ○ Link the origin account, destination account, and transaction type to track the problem of smurfing and the higher cash withdrawals ● Frequency Candidate Features ○ Learn how frequently the account is used ● Amount Features ○ Magnitude of the amount of transactions. ● Time-Since Features ○  Learn the speed of transactions ● Velocity-Change Features ○ Identify a sudden change in the behaviour of accounts https://www.kaggle.com/ntnu-testimon/paysim1?select=PS_20174392719_1491204439457_log.csv
  • 33.
  • 34. Hopsworks Cluster Project-Based Multi-Tenant Security API KEY IAM Profile Users Jobs Dev Feature Store Staging Feature Store Prod Feature Store User Login (LDAP, AD, OAuth2, 2FA) databricks SageMaker Kubeflow Amazon EMR Delta LakeSnowflakeAmazon S3 Amazon Redshift
  • 35. Full Featured AGPL-v3 License Model Hopsworks Community Kubernetes Support • Model Serving • Other services for robustness (Jupyter, more coming) Authentication (LDAP, Kerberos, OAuth2) Github support Hopsworks Enterprise Managed SAAS platform (currently only on AWS) Hopsworks.ai Hopsworks – open-source or managed platform