SlideShare a Scribd company logo
1 of 27
Download to read offline
MLFlow at Company Scale
Jean-Denis Lesage
@jdlesage
Staff Dev Lead at Criteo
Agenda
Introduction
Why did we choose Mlflow ?
How did we setup a shared
Mlflow instance
Optimization, monitoring, automation,
authentication, etc.
Plugins and other Features
UI additions
Yarn execution backend
ElasticSearch Store
Machine Learning at Criteo
▪ Criteo is an AdTech company. ML is used for bidding,
recommendation, sales& clicks optimization, creative generation,
brand safety, invalid traffic detection, etc.
▪ 1000models in production
▪ 100 million predictions per seconds (latency ~50 µs)
▪ 1 PB of logs generated per day
▪ 300offline experiments per day
Why Mlflow ?
▪ Multi-framework
▪ Can run on any kind of infrastructure
▪ Extensible by plugins
▪ Opensource: lower the risk, can contribute if something is missing
▪ But not originally designed to run as a central service in a company.
Architecture
Central service
Pros:
▪ Factorize the maintenance.
▪ Open, improve collaboration. 1 place to store all ML results.
▪ We have time to contribute to the project
Cons:
▪ More constraints to the service (more QPS, etc.).
▪ No isolation: 1 team can impact all the whole company. (outage)
Central Mlflow versus 1 Mlflow per team/project
Architecture
Mlflow
client
HDFS
How did we setup a shared Mlflow instance ?
Scale SQLAlchemyStore
Back to October 2019…
Scale SQLAlchemyStore
Root cause (October 2019)
▪ Slow: All lines are filtered in python
▪ Very Slow: Pagination in python: restart from the beginning every
time
▪ Unstable: All runs are loaded in memory. => OoM risk
MariaDB (fast!)
Select * from
Experiment
Tons of lines!
Filter Pagination
100 lines
Scale SQLAlchemyStore
▪ Move filtering and pagination inside SQL queries.
▪ https://github.com/mlflow/mlflow/pull/2059
▪ Dramatically speed up (x10 on our usecases)
▪ No more Out of Memory Exceptions
▪ But have to join all tables
▪ Eagerly loading from sqlalchemy is memory consuming.
▪ Load runs attributes lazily
▪ https://github.com/mlflow/mlflow/pull/1878
Implementation
Scale SQLAlchemyStore
In first versions of the store, latest metrics were computed in python.
https://github.com/mlflow/mlflow/pull/1660
Compute in SQL: 4x speed up
https://github.com/mlflow/mlflow/pull/1767
Store in a dedicated table: 20 x speed up
Latest metrics. Python is slow compared to SQL
Scale SQLAlchemyStore
▪ Some experiments have >1000 columns (metrics, tags, params)
▪ But users often display ~10 of them.
▪ https://github.com/criteo-forks/mlflow/pull/178
▪ search_run complexity becomes O(number of columns)
Columns to whitelist: Prune unecessary columns in search results.
Whitelist Column
Implementation
def selectallin(
runs: List[SqlRun],
query_run_ids: sqlalchemy.sql.expression.Alias,
query: sqlalchemy.orm.query.Query,
attribute,
attr_name: str,
) -> None:
values = sorted(
query.filter(attribute.in_(query_run_ids)).all(), key=operator.attrgetter("run_uuid")
)
values_grouped_by = {
run_id: list(vals)
for run_id, vals in itertools.groupby(values, key=operator.attrgetter("run_uuid"))
}
for run in runs:
if run.run_uuid in values_grouped_by:
sqlalchemy.orm.attributes.set_committed_value(
run, attr_name, values_grouped_by[run.run_uuid]
)
Attach to
SQLRun objects
Query columns
Group by types
(tags, …)
Monitoring
▪ Add /metrics endpoint for prometheus
https://github.com/mlflow/mlflow/pull/2097
▪ A probe queries periodically search_runs
Configure the Gunicorn Application
▪ You can control by code the configuration of the Gunicorn Application
from gunicorn.app.base import Application
from mlflow.server import app
class StandaloneApplication(Application):
def __init__(self, app, opt=None):
self.app = app
super(StandaloneApplication, self).__init__()
def init(self, parser, opts, args):
pass
def load(self):
"""Return application to be ran."""
return self.app
Benefits:
• Add hooks (nicely close SQL connections
pool on worker exit)
• Add new endpoints to mlflow
• Use flask extensions (authentication,
CORS, etc.)
Automatic DB migration
▪ Apply DB migrations scripts at server startup.
▪ Use SQL to implement a mutex. Only 1 server will do the migration.
▪ Test before in docker (integration tests) then preprod cluster.
Benefits
▪ Continuous deployment (at least 1 release per week)
▪ Deploy master branch of mlflow
▪ No human actions on production servers
Periodic Jobs
▪ Automatize maintenance jobs
Use SQL to maintain a lock between servers.
▪ Examples of jobs
▪ Compress timestamp metrics. Remove some points after
XXX days in order to save space in database
▪ Archive artifacts on hdfs
https://hadoop.apache.org/docs/r1.2.1/hadoop_archives
.html
▪ Remove deleted runs after XX days. Mlflow gc
▪ Kill stuck jobs
Job name Lock acquire time
cleaning_metrics 1601991418
SQL
Server 0 Server 1
Integration to company SSO
▪ Payload to Mlflow server must contain a JWT token
▪ New endpoint connected to the OAuth2 service that generate a JWT
token from access token
▪ Javascript embeds the token in all payload to Mlflow.
From javascript (frontend)
JWT Integration in Javascript
Implementation
export const wrapDeferred = (deferred, data, timeLeftMs = 60000, sleepMs = 1000) => {
const token = localStorage.getItem('token');
if (token !== null) {
$.ajaxSetup({
headers: { Authorization: token },
});
}
[…]
if (xhr.status === 401) {
console.warn('Request failed with status 401');
const authService = new AuthService();
authService.redirectToSsoIfPossible(xhr);
}
[…]
},
});
});
};
Store JWT token
in local storage
(auth once per session)
If Mlflow backend
returns 401, then
redirect to SSO to
get a valid JWT token
Integration to company SSO
From python and java clients (based on Kerberos)
HTTP Client
Kerberized
client
KDC
JTC
Kerberized
server
Mlflow
protected by
JWT
1 Get TGT
2 return TGS
3 HTTP/SPNego Call
4 return JWT token
5 call Mlflow using
JWT token
Plugins and new features
Write Easily Fitler Queries
Mlflow-yarn
▪ Execution plugin to run MLProject on hadoop cluster.
▪ Based on skein and cluster-pack
▪ Cluster-pack EuroPython2020: https://youtu.be/d-XQqBclLnE
Mlflow-elasticsearchstore
▪ Experiment denormalization
for tracking server
▪ Speed up on large
experiments
▪ Experimental support. Open to
contributions.
Conclusion
Code Organization
▪ https://github.com/criteo-forks/mlflow (branch criteo-master)
Contains contributions not yet merged (or those that can’t be merged)
▪ Private git mlflow-criteo
Our gunicorn Application. Configuration, criteo business logic, periodic
jobs…
▪ Plugins repos:
▪ https://github.com/criteo/mlflow-yarn
▪ https://github.com/criteo/mlflow-elasticsearchstore
Some take away
▪ Delegate computations to SQL servers on critical endpoints
▪ Automatize as much as possible
▪ Create your own gunicorn app and extend Mlflow!
▪ Mlflow is open. Contributions, extend flask application or plugins
▪ Exciting journey to setup Mlflow as a central service for a company.
Lot of topics covered.
▪ It was great to collaborate with the community.

More Related Content

What's hot

From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOpsCarl W. Handlin
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingDatabricks
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowDatabricks
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...Databricks
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model ServingDatabricks
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with DatabricksLiangjun Jiang
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersDaniel Zivkovic
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.Knoldus Inc.
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMatei Zaharia
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_futureNisha Talagala
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to productionHerman Wu
 

What's hot (20)

From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model Serving
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with Databricks
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine Learning
 
Ml ops past_present_future
Ml ops past_present_futureMl ops past_present_future
Ml ops past_present_future
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 

Similar to MLflow at Company Scale

Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerCisco Canada
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Jim Czuprynski
 
Apex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong FoundationsApex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong FoundationsSalesforce Developers
 
How to deploy & optimize eZ Publish
How to deploy & optimize eZ PublishHow to deploy & optimize eZ Publish
How to deploy & optimize eZ PublishKaliop-slide
 
Make your gui shine with ajax solr
Make your gui shine with ajax solrMake your gui shine with ajax solr
Make your gui shine with ajax solrlucenerevolution
 
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logsStefan Krawczyk
 
Azure SQL Database - Connectivity Best Practices
Azure SQL Database - Connectivity Best PracticesAzure SQL Database - Connectivity Best Practices
Azure SQL Database - Connectivity Best PracticesJose Manuel Jurado Diaz
 
Single sign on across drupal 8
Single sign on across drupal 8Single sign on across drupal 8
Single sign on across drupal 8Iwantha Lekamge
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
SQL Server - CLR integration
SQL Server - CLR integrationSQL Server - CLR integration
SQL Server - CLR integrationPeter Gfader
 
El camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - IntroductionEl camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - IntroductionPlain Concepts
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeAman Kohli
 
High-Performance Hibernate - JDK.io 2018
High-Performance Hibernate - JDK.io 2018High-Performance Hibernate - JDK.io 2018
High-Performance Hibernate - JDK.io 2018Vlad Mihalcea
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-actionAssaf Gannon
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesEduardo Castro
 
SQL Track: In Memory OLTP in SQL Server
SQL Track: In Memory OLTP in SQL ServerSQL Track: In Memory OLTP in SQL Server
SQL Track: In Memory OLTP in SQL ServerITProceed
 
Windows Server 2008 (PowerShell Scripting Uygulamaları)
Windows Server 2008 (PowerShell Scripting Uygulamaları)Windows Server 2008 (PowerShell Scripting Uygulamaları)
Windows Server 2008 (PowerShell Scripting Uygulamaları)ÇözümPARK
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Fastly
 

Similar to MLflow at Company Scale (20)

Expanding your impact with programmability in the data center
Expanding your impact with programmability in the data centerExpanding your impact with programmability in the data center
Expanding your impact with programmability in the data center
 
Php forum2015 tomas_final
Php forum2015 tomas_finalPhp forum2015 tomas_final
Php forum2015 tomas_final
 
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
Autonomous Transaction Processing (ATP): In Heavy Traffic, Why Drive Stick?
 
Apex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong FoundationsApex Enterprise Patterns: Building Strong Foundations
Apex Enterprise Patterns: Building Strong Foundations
 
How to deploy & optimize eZ Publish
How to deploy & optimize eZ PublishHow to deploy & optimize eZ Publish
How to deploy & optimize eZ Publish
 
Make your gui shine with ajax solr
Make your gui shine with ajax solrMake your gui shine with ajax solr
Make your gui shine with ajax solr
 
Why you should be using structured logs
Why you should be using structured logsWhy you should be using structured logs
Why you should be using structured logs
 
Azure SQL Database - Connectivity Best Practices
Azure SQL Database - Connectivity Best PracticesAzure SQL Database - Connectivity Best Practices
Azure SQL Database - Connectivity Best Practices
 
Single sign on across drupal 8
Single sign on across drupal 8Single sign on across drupal 8
Single sign on across drupal 8
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
SQL Server - CLR integration
SQL Server - CLR integrationSQL Server - CLR integration
SQL Server - CLR integration
 
El camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - IntroductionEl camino a las Cloud Native Apps - Introduction
El camino a las Cloud Native Apps - Introduction
 
Being HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on PurposeBeing HAPI! Reverse Proxying on Purpose
Being HAPI! Reverse Proxying on Purpose
 
High-Performance Hibernate - JDK.io 2018
High-Performance Hibernate - JDK.io 2018High-Performance Hibernate - JDK.io 2018
High-Performance Hibernate - JDK.io 2018
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-action
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
SQL Track: In Memory OLTP in SQL Server
SQL Track: In Memory OLTP in SQL ServerSQL Track: In Memory OLTP in SQL Server
SQL Track: In Memory OLTP in SQL Server
 
Windows Server 2008 (PowerShell Scripting Uygulamaları)
Windows Server 2008 (PowerShell Scripting Uygulamaları)Windows Server 2008 (PowerShell Scripting Uygulamaları)
Windows Server 2008 (PowerShell Scripting Uygulamaları)
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 

Recently uploaded (20)

Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 

MLflow at Company Scale

  • 1. MLFlow at Company Scale Jean-Denis Lesage @jdlesage Staff Dev Lead at Criteo
  • 2. Agenda Introduction Why did we choose Mlflow ? How did we setup a shared Mlflow instance Optimization, monitoring, automation, authentication, etc. Plugins and other Features UI additions Yarn execution backend ElasticSearch Store
  • 3. Machine Learning at Criteo ▪ Criteo is an AdTech company. ML is used for bidding, recommendation, sales& clicks optimization, creative generation, brand safety, invalid traffic detection, etc. ▪ 1000models in production ▪ 100 million predictions per seconds (latency ~50 µs) ▪ 1 PB of logs generated per day ▪ 300offline experiments per day
  • 4. Why Mlflow ? ▪ Multi-framework ▪ Can run on any kind of infrastructure ▪ Extensible by plugins ▪ Opensource: lower the risk, can contribute if something is missing ▪ But not originally designed to run as a central service in a company.
  • 5. Architecture Central service Pros: ▪ Factorize the maintenance. ▪ Open, improve collaboration. 1 place to store all ML results. ▪ We have time to contribute to the project Cons: ▪ More constraints to the service (more QPS, etc.). ▪ No isolation: 1 team can impact all the whole company. (outage) Central Mlflow versus 1 Mlflow per team/project
  • 7. How did we setup a shared Mlflow instance ?
  • 9. Scale SQLAlchemyStore Root cause (October 2019) ▪ Slow: All lines are filtered in python ▪ Very Slow: Pagination in python: restart from the beginning every time ▪ Unstable: All runs are loaded in memory. => OoM risk MariaDB (fast!) Select * from Experiment Tons of lines! Filter Pagination 100 lines
  • 10. Scale SQLAlchemyStore ▪ Move filtering and pagination inside SQL queries. ▪ https://github.com/mlflow/mlflow/pull/2059 ▪ Dramatically speed up (x10 on our usecases) ▪ No more Out of Memory Exceptions ▪ But have to join all tables ▪ Eagerly loading from sqlalchemy is memory consuming. ▪ Load runs attributes lazily ▪ https://github.com/mlflow/mlflow/pull/1878 Implementation
  • 11. Scale SQLAlchemyStore In first versions of the store, latest metrics were computed in python. https://github.com/mlflow/mlflow/pull/1660 Compute in SQL: 4x speed up https://github.com/mlflow/mlflow/pull/1767 Store in a dedicated table: 20 x speed up Latest metrics. Python is slow compared to SQL
  • 12. Scale SQLAlchemyStore ▪ Some experiments have >1000 columns (metrics, tags, params) ▪ But users often display ~10 of them. ▪ https://github.com/criteo-forks/mlflow/pull/178 ▪ search_run complexity becomes O(number of columns) Columns to whitelist: Prune unecessary columns in search results.
  • 13. Whitelist Column Implementation def selectallin( runs: List[SqlRun], query_run_ids: sqlalchemy.sql.expression.Alias, query: sqlalchemy.orm.query.Query, attribute, attr_name: str, ) -> None: values = sorted( query.filter(attribute.in_(query_run_ids)).all(), key=operator.attrgetter("run_uuid") ) values_grouped_by = { run_id: list(vals) for run_id, vals in itertools.groupby(values, key=operator.attrgetter("run_uuid")) } for run in runs: if run.run_uuid in values_grouped_by: sqlalchemy.orm.attributes.set_committed_value( run, attr_name, values_grouped_by[run.run_uuid] ) Attach to SQLRun objects Query columns Group by types (tags, …)
  • 14. Monitoring ▪ Add /metrics endpoint for prometheus https://github.com/mlflow/mlflow/pull/2097 ▪ A probe queries periodically search_runs
  • 15. Configure the Gunicorn Application ▪ You can control by code the configuration of the Gunicorn Application from gunicorn.app.base import Application from mlflow.server import app class StandaloneApplication(Application): def __init__(self, app, opt=None): self.app = app super(StandaloneApplication, self).__init__() def init(self, parser, opts, args): pass def load(self): """Return application to be ran.""" return self.app Benefits: • Add hooks (nicely close SQL connections pool on worker exit) • Add new endpoints to mlflow • Use flask extensions (authentication, CORS, etc.)
  • 16. Automatic DB migration ▪ Apply DB migrations scripts at server startup. ▪ Use SQL to implement a mutex. Only 1 server will do the migration. ▪ Test before in docker (integration tests) then preprod cluster. Benefits ▪ Continuous deployment (at least 1 release per week) ▪ Deploy master branch of mlflow ▪ No human actions on production servers
  • 17. Periodic Jobs ▪ Automatize maintenance jobs Use SQL to maintain a lock between servers. ▪ Examples of jobs ▪ Compress timestamp metrics. Remove some points after XXX days in order to save space in database ▪ Archive artifacts on hdfs https://hadoop.apache.org/docs/r1.2.1/hadoop_archives .html ▪ Remove deleted runs after XX days. Mlflow gc ▪ Kill stuck jobs Job name Lock acquire time cleaning_metrics 1601991418 SQL Server 0 Server 1
  • 18. Integration to company SSO ▪ Payload to Mlflow server must contain a JWT token ▪ New endpoint connected to the OAuth2 service that generate a JWT token from access token ▪ Javascript embeds the token in all payload to Mlflow. From javascript (frontend)
  • 19. JWT Integration in Javascript Implementation export const wrapDeferred = (deferred, data, timeLeftMs = 60000, sleepMs = 1000) => { const token = localStorage.getItem('token'); if (token !== null) { $.ajaxSetup({ headers: { Authorization: token }, }); } […] if (xhr.status === 401) { console.warn('Request failed with status 401'); const authService = new AuthService(); authService.redirectToSsoIfPossible(xhr); } […] }, }); }); }; Store JWT token in local storage (auth once per session) If Mlflow backend returns 401, then redirect to SSO to get a valid JWT token
  • 20. Integration to company SSO From python and java clients (based on Kerberos) HTTP Client Kerberized client KDC JTC Kerberized server Mlflow protected by JWT 1 Get TGT 2 return TGS 3 HTTP/SPNego Call 4 return JWT token 5 call Mlflow using JWT token
  • 21. Plugins and new features
  • 23. Mlflow-yarn ▪ Execution plugin to run MLProject on hadoop cluster. ▪ Based on skein and cluster-pack ▪ Cluster-pack EuroPython2020: https://youtu.be/d-XQqBclLnE
  • 24. Mlflow-elasticsearchstore ▪ Experiment denormalization for tracking server ▪ Speed up on large experiments ▪ Experimental support. Open to contributions.
  • 26. Code Organization ▪ https://github.com/criteo-forks/mlflow (branch criteo-master) Contains contributions not yet merged (or those that can’t be merged) ▪ Private git mlflow-criteo Our gunicorn Application. Configuration, criteo business logic, periodic jobs… ▪ Plugins repos: ▪ https://github.com/criteo/mlflow-yarn ▪ https://github.com/criteo/mlflow-elasticsearchstore
  • 27. Some take away ▪ Delegate computations to SQL servers on critical endpoints ▪ Automatize as much as possible ▪ Create your own gunicorn app and extend Mlflow! ▪ Mlflow is open. Contributions, extend flask application or plugins ▪ Exciting journey to setup Mlflow as a central service for a company. Lot of topics covered. ▪ It was great to collaborate with the community.