MLflow at Company Scale

MLFlow at Company Scale
Jean-Denis Lesage
@jdlesage
Staff Dev Lead at Criteo

Agenda
Introduction
Why did we choose Mlflow ?
How did we setup a shared
Mlflow instance
Optimization, monitoring, automation,
authentication, etc.
Plugins and other Features
UI additions
Yarn execution backend
ElasticSearch Store

Machine Learning at Criteo
▪ Criteo is an AdTech company. ML is used for bidding,
recommendation, sales& clicks optimization, creative generation,
brand safety, invalid traffic detection, etc.
▪ 1000models in production
▪ 100 million predictions per seconds (latency ~50 µs)
▪ 1 PB of logs generated per day
▪ 300offline experiments per day

Why Mlflow ?
▪ Multi-framework
▪ Can run on any kind of infrastructure
▪ Extensible by plugins
▪ Opensource: lower the risk, can contribute if something is missing
▪ But not originally designed to run as a central service in a company.

Architecture
Central service
Pros:
▪ Factorize the maintenance.
▪ Open, improve collaboration. 1 place to store all ML results.
▪ We have time to contribute to the project
Cons:
▪ More constraints to the service (more QPS, etc.).
▪ No isolation: 1 team can impact all the whole company. (outage)
Central Mlflow versus 1 Mlflow per team/project

Architecture
Mlflow
client
HDFS

How did we setup a shared Mlflow instance ?

Scale SQLAlchemyStore
Back to October 2019…

Root cause (October 2019)
▪ Slow: All lines are filtered in python
▪ Very Slow: Pagination in python: restart from the beginning every
time
▪ Unstable: All runs are loaded in memory. => OoM risk
MariaDB (fast!)
Select * from
Experiment
Tons of lines!
Filter Pagination
100 lines

▪ Move filtering and pagination inside SQL queries.
▪ https://github.com/mlflow/mlflow/pull/2059
▪ Dramatically speed up (x10 on our usecases)
▪ No more Out of Memory Exceptions
▪ But have to join all tables
▪ Eagerly loading from sqlalchemy is memory consuming.
▪ Load runs attributes lazily
▪ https://github.com/mlflow/mlflow/pull/1878
Implementation

In first versions of the store, latest metrics were computed in python.
https://github.com/mlflow/mlflow/pull/1660
Compute in SQL: 4x speed up
Store in a dedicated table: 20 x speed up
Latest metrics. Python is slow compared to SQL

▪ Some experiments have >1000 columns (metrics, tags, params)
▪ But users often display ~10 of them.
▪ https://github.com/criteo-forks/mlflow/pull/178
▪ search_run complexity becomes O(number of columns)
Columns to whitelist: Prune unecessary columns in search results.

Whitelist Column
Implementation
def selectallin(
runs: List[SqlRun],
query_run_ids: sqlalchemy.sql.expression.Alias,
query: sqlalchemy.orm.query.Query,
attribute,
attr_name: str,
) -> None:
values = sorted(
query.filter(attribute.in_(query_run_ids)).all(), key=operator.attrgetter("run_uuid")
)
values_grouped_by = {
run_id: list(vals)
for run_id, vals in itertools.groupby(values, key=operator.attrgetter("run_uuid"))
}
for run in runs:
if run.run_uuid in values_grouped_by:
sqlalchemy.orm.attributes.set_committed_value(
run, attr_name, values_grouped_by[run.run_uuid]
)
Attach to
SQLRun objects
Query columns
Group by types
(tags, …)

Monitoring
▪ Add /metrics endpoint for prometheus
▪ A probe queries periodically search_runs

Configure the Gunicorn Application
▪ You can control by code the configuration of the Gunicorn Application
from gunicorn.app.base import Application
from mlflow.server import app
class StandaloneApplication(Application):
def __init__(self, app, opt=None):
self.app = app
super(StandaloneApplication, self).__init__()
def init(self, parser, opts, args):
pass
def load(self):
"""Return application to be ran."""
return self.app
Benefits:
• Add hooks (nicely close SQL connections
pool on worker exit)
• Add new endpoints to mlflow
• Use flask extensions (authentication,
CORS, etc.)

Automatic DB migration
▪ Apply DB migrations scripts at server startup.
▪ Use SQL to implement a mutex. Only 1 server will do the migration.
▪ Test before in docker (integration tests) then preprod cluster.
Benefits
▪ Continuous deployment (at least 1 release per week)
▪ Deploy master branch of mlflow
▪ No human actions on production servers

Periodic Jobs
▪ Automatize maintenance jobs
Use SQL to maintain a lock between servers.
▪ Examples of jobs
▪ Compress timestamp metrics. Remove some points after
XXX days in order to save space in database
▪ Archive artifacts on hdfs
https://hadoop.apache.org/docs/r1.2.1/hadoop_archives
.html
▪ Remove deleted runs after XX days. Mlflow gc
▪ Kill stuck jobs
Job name Lock acquire time
cleaning_metrics 1601991418
SQL
Server 0 Server 1

Integration to company SSO
▪ Payload to Mlflow server must contain a JWT token
▪ New endpoint connected to the OAuth2 service that generate a JWT
token from access token
▪ Javascript embeds the token in all payload to Mlflow.
From javascript (frontend)

JWT Integration in Javascript
Implementation
export const wrapDeferred = (deferred, data, timeLeftMs = 60000, sleepMs = 1000) => {
const token = localStorage.getItem('token');
if (token !== null) {
$.ajaxSetup({
headers: { Authorization: token },
});
}
[…]
if (xhr.status === 401) {
console.warn('Request failed with status 401');
const authService = new AuthService();
authService.redirectToSsoIfPossible(xhr);
}
[…]
},
});
});
};
Store JWT token
in local storage
(auth once per session)
If Mlflow backend
returns 401, then
redirect to SSO to
get a valid JWT token

Integration to company SSO
From python and java clients (based on Kerberos)
HTTP Client
Kerberized
client
KDC
JTC
Kerberized
server
Mlflow
protected by
JWT
1 Get TGT
2 return TGS
3 HTTP/SPNego Call
4 return JWT token
5 call Mlflow using
JWT token

Mlflow-yarn
▪ Execution plugin to run MLProject on hadoop cluster.
▪ Based on skein and cluster-pack
▪ Cluster-pack EuroPython2020: https://youtu.be/d-XQqBclLnE

Mlflow-elasticsearchstore
▪ Experiment denormalization
for tracking server
▪ Speed up on large
experiments
▪ Experimental support. Open to
contributions.

Code Organization
▪ https://github.com/criteo-forks/mlflow (branch criteo-master)
Contains contributions not yet merged (or those that can’t be merged)
▪ Private git mlflow-criteo
Our gunicorn Application. Configuration, criteo business logic, periodic
jobs…
▪ Plugins repos:
▪ https://github.com/criteo/mlflow-yarn
▪ https://github.com/criteo/mlflow-elasticsearchstore

Some take away
▪ Delegate computations to SQL servers on critical endpoints
▪ Automatize as much as possible
▪ Create your own gunicorn app and extend Mlflow!
▪ Mlflow is open. Contributions, extend flask application or plugins
▪ Exciting journey to setup Mlflow as a central service for a company.
Lot of topics covered.
▪ It was great to collaborate with the community.

MLflow at Company Scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MLflow at Company Scale

Similar to MLflow at Company Scale (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

MLflow at Company Scale