http://flink-forward.org/kb_sessions/amidst-toolbox-scalable-probabilistic-machine-learning-with-flink/
In this session we would like to present our AMIDST toolbox for analysis of large-scale data sets using probabilistic machine learning models. AMIDST runs algorithms in a distributed fashion for learning and
inference in a wide spectrum of latent variable models such as Gaussian mixtures, (probabilistic) principal component analysis, Hidden Markov Models, Kalman Filters, Latent Dirichlet Allocation, etc. This toolbox is
able to perform Bayesian parameter learning on any user-defined probabilistic (graphical) model with billions of nodes using novel distributed message passing algorithms.
We plan to give an overview of the AMIDST toolbox (Java open source), some details about the API and the integration with Flink, and an analysis of the scalability of our learning algorithms. All this in the context of a real use case scenario in the financial domain (BCC group), where the profile of millions of customers is analyzed using Flink and the Amazon Web Services.
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with Flink
1. AMIDST Toolbox – Flink Forward 1
A Java Toolbox for Scalable Probabilistic Machine Learning
12-14 SEP 2016, Berlin
Ana M. Martínez
Aalborg University
ana@cs.aau.dk
11. PGMS
11
Probabilistic graphical models (PGMs)
Specify your model using probabilistic graphical models with
latent variables and temporal dependencies
AMIDST Toolbox – Flink Forward
13. PGMS
13
Custom Gaussian Mixture Model
Hij defines a local mixture.
Hi defines a global mixture.
AMIDST Toolbox – Flink Forward
14. PGMS
14
RUNNING CODE EXAMPLE
//Set-up Flink session.
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
//Load the data stream
String filename = "hdfs://dataFlink_month0.arff";
DataFlink<DataInstance> data =
DataFlinkLoader.loadDataFromFolder(env, filename,
false);
//Build the model
Model model = new CustomGaussianMixture(data.getAttributes());
AMIDST Toolbox – Flink Forward
17. SCALABLE INFERENCE
17
RUNNING CODE EXAMPLE
AMIDST Toolbox – Flink Forward
//Set-up Flink session.
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//Load the data stream
String filename = “hdfs://dataFlink_month0.arff";
DataFlink<DataInstance> data =
DataFlinkLoader.loadDataFromFolder(env, filename, false);
//Build the model
Model model = new CustomGaussianMixture(data.getAttributes());
//Learn the model
model.updateModel(data);
18. DATA STREAMS
18
Data Streams
Update your models when new data is available. This makes our
toolbox appropriate for learning from data streams.
AMIDST Toolbox – Flink Forward
19. DATA STREAMS
19
//Set-up Flink session.
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//Load the data stream
String filename = “hdfs://dataFlink_month0.arff";
DataFlink<DataInstance> data =
DataFlinkLoader.loadDataFromFolder(env, filename, false);
//Build the model
Model model = new CustomGaussianMixture(data.getAttributes());
//Learn the model
model.updateModel(data);
//Update your model
for(int i=1; i<12; i++) {
filename = “dataFlink_month"+i+".arff";
data = DataFlinkLoader.loadDataFromFolder(env, filename,false);
model.updateModel(data);
}
RUNNING CODE EXAMPLE
AMIDST Toolbox – Flink Forward
20. RUNNING USE CASE
20
Predicting Defaulting Clients
§ Old BCC’s models based on logistic regression got an AUC of 0.816.
§ AMIDST’s models gets an AUC of 0.952.
AMIDST Toolbox – Flink Forward
24. MODULAR DESIGN
24
Modular Design
The AMIDST Toolbox has been designed following a modular structure.
This makes easier:
§ The maintenance and enhancement of the software
§ The integration with external software: HUGIN, MOA, Weka, R.
AMIDST Toolbox – Flink Forward
29. CONCEPT DRIFT DETECTION
29
RUNNING CODE
AMIDST Toolbox – Flink Forward
//Set-up Flink session.
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//Load the data stream
String filename = “hdfs://dataFlink_month0.arff";
DataFlink<DataInstance> data =
DataFlinkLoader.loadDataFromFolder(env, filename, false);
//Build the model
Model model = new ConceptDriftDetector(data.getAttributes());
//Learn the model
model.updateModel(data);
//Update your model
for(int i=1; i<12; i++) {
filename = “dataFlink_month"+i+".arff";
data = DataFlinkLoader.loadDataFromFolder(env, filename,false);
model.updateModel(data);
System.out.println(model.getPosteriorDistribution(“hiddenVar”).
toString());
}