50k runs, millions of metrics, parameters or tags, some bursts at 20k QPS. That’s the volume of data managed by our MLflow tracking servers this year at Criteo. In this talk, you will learn how we set up a shared instance of MLflow at company scale. We will present our contributions to the SQLAlchemyStore to make it responsive at this scale. We will present you how we turned MLflow to a production-ready system. How we scaled horizontally a shared instance on a mesos cluster ? Our monitoring system based on prometheus. Integration to the company Single Sign-On (SSO) authentication. And how our data scientists register their runs from the largest hadoop cluster in Europe.
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
MLflow at Company Scale
1. MLFlow at Company Scale
Jean-Denis Lesage
@jdlesage
Staff Dev Lead at Criteo
2. Agenda
Introduction
Why did we choose Mlflow ?
How did we setup a shared
Mlflow instance
Optimization, monitoring, automation,
authentication, etc.
Plugins and other Features
UI additions
Yarn execution backend
ElasticSearch Store
3. Machine Learning at Criteo
▪ Criteo is an AdTech company. ML is used for bidding,
recommendation, sales& clicks optimization, creative generation,
brand safety, invalid traffic detection, etc.
▪ 1000models in production
▪ 100 million predictions per seconds (latency ~50 µs)
▪ 1 PB of logs generated per day
▪ 300offline experiments per day
4. Why Mlflow ?
▪ Multi-framework
▪ Can run on any kind of infrastructure
▪ Extensible by plugins
▪ Opensource: lower the risk, can contribute if something is missing
▪ But not originally designed to run as a central service in a company.
5. Architecture
Central service
Pros:
▪ Factorize the maintenance.
▪ Open, improve collaboration. 1 place to store all ML results.
▪ We have time to contribute to the project
Cons:
▪ More constraints to the service (more QPS, etc.).
▪ No isolation: 1 team can impact all the whole company. (outage)
Central Mlflow versus 1 Mlflow per team/project
9. Scale SQLAlchemyStore
Root cause (October 2019)
▪ Slow: All lines are filtered in python
▪ Very Slow: Pagination in python: restart from the beginning every
time
▪ Unstable: All runs are loaded in memory. => OoM risk
MariaDB (fast!)
Select * from
Experiment
Tons of lines!
Filter Pagination
100 lines
10. Scale SQLAlchemyStore
▪ Move filtering and pagination inside SQL queries.
▪ https://github.com/mlflow/mlflow/pull/2059
▪ Dramatically speed up (x10 on our usecases)
▪ No more Out of Memory Exceptions
▪ But have to join all tables
▪ Eagerly loading from sqlalchemy is memory consuming.
▪ Load runs attributes lazily
▪ https://github.com/mlflow/mlflow/pull/1878
Implementation
11. Scale SQLAlchemyStore
In first versions of the store, latest metrics were computed in python.
https://github.com/mlflow/mlflow/pull/1660
Compute in SQL: 4x speed up
https://github.com/mlflow/mlflow/pull/1767
Store in a dedicated table: 20 x speed up
Latest metrics. Python is slow compared to SQL
12. Scale SQLAlchemyStore
▪ Some experiments have >1000 columns (metrics, tags, params)
▪ But users often display ~10 of them.
▪ https://github.com/criteo-forks/mlflow/pull/178
▪ search_run complexity becomes O(number of columns)
Columns to whitelist: Prune unecessary columns in search results.
13. Whitelist Column
Implementation
def selectallin(
runs: List[SqlRun],
query_run_ids: sqlalchemy.sql.expression.Alias,
query: sqlalchemy.orm.query.Query,
attribute,
attr_name: str,
) -> None:
values = sorted(
query.filter(attribute.in_(query_run_ids)).all(), key=operator.attrgetter("run_uuid")
)
values_grouped_by = {
run_id: list(vals)
for run_id, vals in itertools.groupby(values, key=operator.attrgetter("run_uuid"))
}
for run in runs:
if run.run_uuid in values_grouped_by:
sqlalchemy.orm.attributes.set_committed_value(
run, attr_name, values_grouped_by[run.run_uuid]
)
Attach to
SQLRun objects
Query columns
Group by types
(tags, …)
14. Monitoring
▪ Add /metrics endpoint for prometheus
https://github.com/mlflow/mlflow/pull/2097
▪ A probe queries periodically search_runs
15. Configure the Gunicorn Application
▪ You can control by code the configuration of the Gunicorn Application
from gunicorn.app.base import Application
from mlflow.server import app
class StandaloneApplication(Application):
def __init__(self, app, opt=None):
self.app = app
super(StandaloneApplication, self).__init__()
def init(self, parser, opts, args):
pass
def load(self):
"""Return application to be ran."""
return self.app
Benefits:
• Add hooks (nicely close SQL connections
pool on worker exit)
• Add new endpoints to mlflow
• Use flask extensions (authentication,
CORS, etc.)
16. Automatic DB migration
▪ Apply DB migrations scripts at server startup.
▪ Use SQL to implement a mutex. Only 1 server will do the migration.
▪ Test before in docker (integration tests) then preprod cluster.
Benefits
▪ Continuous deployment (at least 1 release per week)
▪ Deploy master branch of mlflow
▪ No human actions on production servers
17. Periodic Jobs
▪ Automatize maintenance jobs
Use SQL to maintain a lock between servers.
▪ Examples of jobs
▪ Compress timestamp metrics. Remove some points after
XXX days in order to save space in database
▪ Archive artifacts on hdfs
https://hadoop.apache.org/docs/r1.2.1/hadoop_archives
.html
▪ Remove deleted runs after XX days. Mlflow gc
▪ Kill stuck jobs
Job name Lock acquire time
cleaning_metrics 1601991418
SQL
Server 0 Server 1
18. Integration to company SSO
▪ Payload to Mlflow server must contain a JWT token
▪ New endpoint connected to the OAuth2 service that generate a JWT
token from access token
▪ Javascript embeds the token in all payload to Mlflow.
From javascript (frontend)
19. JWT Integration in Javascript
Implementation
export const wrapDeferred = (deferred, data, timeLeftMs = 60000, sleepMs = 1000) => {
const token = localStorage.getItem('token');
if (token !== null) {
$.ajaxSetup({
headers: { Authorization: token },
});
}
[…]
if (xhr.status === 401) {
console.warn('Request failed with status 401');
const authService = new AuthService();
authService.redirectToSsoIfPossible(xhr);
}
[…]
},
});
});
};
Store JWT token
in local storage
(auth once per session)
If Mlflow backend
returns 401, then
redirect to SSO to
get a valid JWT token
20. Integration to company SSO
From python and java clients (based on Kerberos)
HTTP Client
Kerberized
client
KDC
JTC
Kerberized
server
Mlflow
protected by
JWT
1 Get TGT
2 return TGS
3 HTTP/SPNego Call
4 return JWT token
5 call Mlflow using
JWT token
23. Mlflow-yarn
▪ Execution plugin to run MLProject on hadoop cluster.
▪ Based on skein and cluster-pack
▪ Cluster-pack EuroPython2020: https://youtu.be/d-XQqBclLnE
26. Code Organization
▪ https://github.com/criteo-forks/mlflow (branch criteo-master)
Contains contributions not yet merged (or those that can’t be merged)
▪ Private git mlflow-criteo
Our gunicorn Application. Configuration, criteo business logic, periodic
jobs…
▪ Plugins repos:
▪ https://github.com/criteo/mlflow-yarn
▪ https://github.com/criteo/mlflow-elasticsearchstore
27. Some take away
▪ Delegate computations to SQL servers on critical endpoints
▪ Automatize as much as possible
▪ Create your own gunicorn app and extend Mlflow!
▪ Mlflow is open. Contributions, extend flask application or plugins
▪ Exciting journey to setup Mlflow as a central service for a company.
Lot of topics covered.
▪ It was great to collaborate with the community.