Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance. All these qualities of data & ML projects lead us to the necessity of continuous testing and monitoring of our models and pipelines.
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks
1. CI/CD Templates: Continuous
Delivery of ML-Enabled Data
Pipelines on Databricks
Michael Shtelma, Sr. Solutions Architect
Ivan Trusov, Solutions Architect
2. Agenda
The Challenges of implementing
CI/CD for ML pipelines
The CI/CD challenges forcing ML teams to choose
between Databricks notebooks or local IDEs
Introducing DatabricksLabs
CI/CD Templates
How CI/CD Templates solves ML team production
challenges
Demo and Next Steps
3. Problem:
Organisations are struggling to get Business to start using
their models to drive additional revenue
Cause:
Due to complexity of ML lifecycle only few models end up
in production and drive additional revenue for business.
Most of them are either delayed or discontinued during
different ML Project stages
It is challenging for organizations to
gain value from ML due to complexity of
the ML lifecycle
5. ML teams struggle to combine traditional CI/CD
tools with Databricks notebooks
1. Benefits to Databricks notebooks
Easy to use
Scalable
Provides access to ML tools such as mlflow for model logging and serving
2. Challenges
Non-trivial to hook into traditional software development tools such as CI tools or local IDEs.
3. Result
Teams find themselves choosing between
using traditional IDE based workflows but struggling to test and deploy at scale or
using Databricks notebooks or other cloud notebooks but then struggling to ensure
testing and deployment reliability via CI/CD pipelines.
7. CI/CD Templates gives you the benefits of
traditional CICD workflows and the scale of
databricks clusters
CI/CD Templates allows you to
● create a production pipeline via template in a few steps
● that automatically hooks to github actions and
● runs tests and deployments on databricks upon git commit or
whatever trigger you define and
● gives you a test success status directly in github so you know if your
commit broke the build
8. A scalable CI/CD pipeline in 5 easy steps
1. Install and customize with a single command
2. Create a new github repo containing your databricks host
and token secrets
3. Initialize git in your repo and commit the code.
4. Push your new cicd templates project to the repo. Your tests will
start running automatically on Databricks. Upon your tests’ success
or failure you will get a green checkmark or red x next to your commit
status.
5. You’re done! You now have a fully scalable CICD pipeline.
1
2
3
4
5
9. Project structure
1. Python package where the logic of the project will be developed.
Your models and pipelines will be developed here.
2. Configuration where you can configure define Databricks jobs
which can run pipelines developed in python package
3. Tests directory where local unit tests and integration tests will be
developed
1
2
3
10. CI/CD Templates execute tests and deployments
directly on databricks while storing packages, model
logging and other artifacts in Mlflow
11. CI/CD Templates - now powered by dbx
With dbx you can:
● customize project structure and specify it during deployments
● use new CI tools easily (PRs are welcome!)
● run custom data pipelines pipelines directly from IDE on interactive clusters
15. Summary
The Challenges of implementing
CD4ML
The CI/CD challenges forcing ML teams to choose
between Databricks notebooks or local IDEs
Introducing DatabricksLabs
CI/CD Templates
How CI/CD Templates solves ML team production
challenges
Next Steps
Search DatabricksLabs cicd-templates or go
directly to
https://github.com/databrickslabs/cicd-templates
to get started
michael.shtelma@databricks.com
ivan.trusov@databricks.com