Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with Flink

353 views

Published on

http://flink-forward.org/kb_sessions/amidst-toolbox-scalable-probabilistic-machine-learning-with-flink/

In this session we would like to present our AMIDST toolbox for analysis of large-scale data sets using probabilistic machine learning models. AMIDST runs algorithms in a distributed fashion for learning and
inference in a wide spectrum of latent variable models such as Gaussian mixtures, (probabilistic) principal component analysis, Hidden Markov Models, Kalman Filters, Latent Dirichlet Allocation, etc. This toolbox is
able to perform Bayesian parameter learning on any user-defined probabilistic (graphical) model with billions of nodes using novel distributed message passing algorithms.

We plan to give an overview of the AMIDST toolbox (Java open source), some details about the API and the integration with Flink, and an analysis of the scalability of our learning algorithms. All this in the context of a real use case scenario in the financial domain (BCC group), where the profile of millions of customers is analyzed using Flink and the Amazon Web Services.

Published in: Data & Analytics
  • Login to see the comments

Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with Flink

  1. 1. AMIDST Toolbox – Flink Forward 1 A Java Toolbox for Scalable Probabilistic Machine Learning 12-14 SEP 2016, Berlin Ana M. Martínez Aalborg University ana@cs.aau.dk
  2. 2. AMIDST Toolbox – Flink Forward 2 Who are we?
  3. 3. THE AMIDST CONSORTIUM 3AMIDST Toolbox – Flink Forward
  4. 4. 4 Running Use Case AMIDST Toolbox – Flink Forward
  5. 5. RUNNING USE CASE 5 Predicting Defaulting Clients Predicts probability a customer will default within 2 years AMIDST Toolbox – Flink Forward
  6. 6. RUNNING USE CASE §  Daily data for millions of clients §  Tons of missing data. §  Odd distributions. 6AMIDST Toolbox – Flink Forward
  7. 7. 7 Toolbox presentation AMIDST Toolbox – Flink Forward
  8. 8. GENERAL DESCRIPTION 8AMIDST Toolbox – Flink Forward
  9. 9. AMIDST APPROACH 9 Data Knowledge Openbox Models Blackbox Inference Engine (Powered by Flink) AMIDST Toolbox – Flink Forward
  10. 10. 10 Main Features AMIDST Toolbox – Flink Forward
  11. 11. PGMS 11 Probabilistic graphical models (PGMs) Specify your model using probabilistic graphical models with latent variables and temporal dependencies AMIDST Toolbox – Flink Forward
  12. 12. RUNNING USE CASE 12AMIDST Toolbox – Flink Forward
  13. 13. PGMS 13 Custom Gaussian Mixture Model Hij defines a local mixture. Hi defines a global mixture. AMIDST Toolbox – Flink Forward
  14. 14. PGMS 14 RUNNING CODE EXAMPLE //Set-up Flink session.
 final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
 
 //Load the data stream
 String filename = "hdfs://dataFlink_month0.arff";
 DataFlink<DataInstance> data = DataFlinkLoader.loadDataFromFolder(env, filename, false);
 
 //Build the model
 Model model = new CustomGaussianMixture(data.getAttributes()); 
 
 AMIDST Toolbox – Flink Forward
  15. 15. SCALABLE INFERENCE 15 Scalable Learning Perform Bayesian inference on your probabilistic models with powerful approximate and scalable algorithms. AMIDST Toolbox – Flink Forward
  16. 16. SCALABLE INFERENCE 16 d-VMP Algorithm - Coded as iterative map-reduce task A state-of-the-art distributed variational message passing algorithm. AMIDST Toolbox – Flink Forward
  17. 17. SCALABLE INFERENCE 17 RUNNING CODE EXAMPLE AMIDST Toolbox – Flink Forward //Set-up Flink session.
 final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
 
 //Load the data stream
 String filename = “hdfs://dataFlink_month0.arff";
 DataFlink<DataInstance> data = DataFlinkLoader.loadDataFromFolder(env, filename, false);
 
 //Build the model
 Model model = new CustomGaussianMixture(data.getAttributes());
 
 //Learn the model
 model.updateModel(data); 
 

  18. 18. DATA STREAMS 18 Data Streams Update your models when new data is available. This makes our toolbox appropriate for learning from data streams. AMIDST Toolbox – Flink Forward
  19. 19. DATA STREAMS 19 //Set-up Flink session.
 final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
 
 //Load the data stream
 String filename = “hdfs://dataFlink_month0.arff";
 DataFlink<DataInstance> data = DataFlinkLoader.loadDataFromFolder(env, filename, false);
 
 //Build the model
 Model model = new CustomGaussianMixture(data.getAttributes());
 
 //Learn the model
 model.updateModel(data); 
 
 //Update your model
 for(int i=1; i<12; i++) {
 filename = “dataFlink_month"+i+".arff";
 data = DataFlinkLoader.loadDataFromFolder(env, filename,false);
 model.updateModel(data);
 
 } RUNNING CODE EXAMPLE AMIDST Toolbox – Flink Forward
  20. 20. RUNNING USE CASE 20 Predicting Defaulting Clients §  Old BCC’s models based on logistic regression got an AUC of 0.816. §  AMIDST’s models gets an AUC of 0.952. AMIDST Toolbox – Flink Forward
  21. 21. SCALABILITY ANALYSIS 21 Scalability analysis Use your defined models to process massive data sets in a distributed computer cluster using Flink. AMIDST Toolbox – Flink Forward
  22. 22. SCALABILITY ANALYSIS 22 One billion node probabilistic model Experiment on a Flink cluster with 16 nodes on AWS. AMIDST Toolbox – Flink Forward
  23. 23. SCALABILITY ANALYSIS §  Speedup (with respect to 2 nodes) AMIDST Toolbox – Flink Forward 23/32 0 1 2 3 4 5 6 7 8 4 nodes 8 nodes 16 nodes
  24. 24. MODULAR DESIGN 24 Modular Design The AMIDST Toolbox has been designed following a modular structure. This makes easier: §  The maintenance and enhancement of the software §  The integration with external software: HUGIN, MOA, Weka, R. AMIDST Toolbox – Flink Forward
  25. 25. MODULAR DESIGN 25AMIDST Toolbox – Flink Forward
  26. 26. 26 Running Use Case II AMIDST Toolbox – Flink Forward
  27. 27. CONCEPT DRIFT DETECTION 27 Tracking Concept Drift Detects changes in customer profiles during Spanish financial crisis AMIDST Toolbox – Flink Forward
  28. 28. CONCEPT DRIFT DETECTION 28 Hidden Variables are used to capture changes in customer profile MODEL AMIDST Toolbox – Flink Forward
  29. 29. CONCEPT DRIFT DETECTION 29 RUNNING CODE AMIDST Toolbox – Flink Forward //Set-up Flink session.
 final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
 
 //Load the data stream
 String filename = “hdfs://dataFlink_month0.arff";
 DataFlink<DataInstance> data = DataFlinkLoader.loadDataFromFolder(env, filename, false);
 
 //Build the model
 Model model = new ConceptDriftDetector(data.getAttributes());
 
 //Learn the model
 model.updateModel(data); 
 
 //Update your model
 for(int i=1; i<12; i++) {
 filename = “dataFlink_month"+i+".arff";
 data = DataFlinkLoader.loadDataFromFolder(env, filename,false);
 model.updateModel(data); System.out.println(model.getPosteriorDistribution(“hiddenVar”). toString());
 }
  30. 30. CONCEPT DRIFT DETECTION 30 Hidden Variable Captures Concept Drift Drift Pattern: Seasonal + Global trend RESULTS AMIDST Toolbox – Flink Forward
  31. 31. CONCEPT DRIFT DETECTION 31 Unemployment Rate main driver of Concept Drift Hidden Variable correlates with unemployment rate (rho = 0.961) RESULTS AMIDST Toolbox – Flink Forward
  32. 32. COLLABORATE 32 www.amidsttoolbox.com github.com/amidst/toolbox Appache License 2.0 AMIDST Toolbox – Flink Forward
  33. 33. AMIDST Toolbox – Flink Forward 33 Thanks for your attention @ contact@amidsttoolbox.com @AmidstToolbox www www.amidsttoolbox.com

×