Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution.
Generative AI on Enterprise Cloud with NiFi and Milvus
Data Versioning and Reproducible ML with DVC and MLflow
1. Data Versioning and Reproducible ML
with DVC and MLflow
Petra Kaferle Devisschere
Data Scientist, Adaltas
2. Introduction
About me:
▪ Data Scientist
▪ 13 years of experience in Data Analytics in Life
Sciences
▪ Developed solutions for automating frequently
used protocols
About Adaltas:
▪ Involved with Big Data since 2010
▪ 16 experts in France, 2 in Morocco
▪ Partner with Cloudera, Databricks, D2IQ
▪ Actor in Open Source community
▪ Teaching Big Data at 6 universities
3. Agenda
▪ Versioning in Data Science
▪ Data Versioning
▪ Intro to DVC and MLflow
▪ Demo with DVC and MLflow
4. Problem definition
▪ Dataset
▪ Tabular
▪ Images, video, sound, text
▪ Code
▪ Functionality, Hyper-parameters
▪ Data pre-processing, Modeling
▪ Results
▪ Numeric (metrics)
▪ Plots
▪ Environment
▪ Dependencies
What do we need to track in Data Science project to be able to reproduce the results?
5. Data versioning use cases
▪ Data engineering during data cleaning and dataset preparation
▪ Test data in software engineering to assure the functionality of the
software
▪ Development and testing of database applications
▪ Ontology versioning for Semantic Web
▪ ...
Data beyond Data Science
6. Data is in the heart of any ML project
▪ Data quality management
▪ Reproducible training and re-training
▪ Automation of testing or deployment
▪ Use in applications and web services
▪ Audit
Why is the exact version of a dataset important?
7. Classical DV solutions and their downsides
▪ Git
▪ Not suitable for big files
▪ Difficulties with binary files
▪ Git-LFS
▪ File size limitation (ex.: 2 GB for GitHub free account and 5 GB for GitHub Enterprise Cloud)
▪ Data needs to be stored on servers of Git provider
▪ External storage
▪ Checksum calculation for any change in the dataset and placing them under version control
Classical Software Engineering tools are not enough to solve ML reproducibility crisis
8. Our proposed solution
Combining Data Version Control (DVC) and MLflow
&
- solves Git’s limitations
- easy to learn
- extensive and easy tracking
- user-friendly UI
9. What is DVC?
https://dvc.org/
▪ Tracks datasets and ML projects
▪ Works with many types of storages (cloud,
local, HDFS, HTTP…)
▪ Runs on top of a Git repository
▪ Supports building and running pipelines
We will focus on dataset tracking.