Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks

Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance. All these qualities of data & ML projects lead us to the necessity of continuous testing and monitoring of our models and pipelines.

  • Be the first to comment

  • Be the first to like this

CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks

  1. 1. CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks Michael Shtelma, Sr. Solutions Architect Ivan Trusov, Solutions Architect
  2. 2. Agenda The Challenges of implementing CI/CD for ML pipelines The CI/CD challenges forcing ML teams to choose between Databricks notebooks or local IDEs Introducing DatabricksLabs CI/CD Templates How CI/CD Templates solves ML team production challenges Demo and Next Steps
  3. 3. Problem: Organisations are struggling to get Business to start using their models to drive additional revenue Cause: Due to complexity of ML lifecycle only few models end up in production and drive additional revenue for business. Most of them are either delayed or discontinued during different ML Project stages It is challenging for organizations to gain value from ML due to complexity of the ML lifecycle
  4. 4. What challenges do ML teams face when then try to implement CD4ML?
  5. 5. ML teams struggle to combine traditional CI/CD tools with Databricks notebooks 1. Benefits to Databricks notebooks Easy to use Scalable Provides access to ML tools such as mlflow for model logging and serving 2. Challenges Non-trivial to hook into traditional software development tools such as CI tools or local IDEs. 3. Result Teams find themselves choosing between using traditional IDE based workflows but struggling to test and deploy at scale or using Databricks notebooks or other cloud notebooks but then struggling to ensure testing and deployment reliability via CI/CD pipelines.
  6. 6. What’s the solution?
  7. 7. CI/CD Templates gives you the benefits of traditional CICD workflows and the scale of databricks clusters CI/CD Templates allows you to ● create a production pipeline via template in a few steps ● that automatically hooks to github actions and ● runs tests and deployments on databricks upon git commit or whatever trigger you define and ● gives you a test success status directly in github so you know if your commit broke the build
  8. 8. A scalable CI/CD pipeline in 5 easy steps 1. Install and customize with a single command 2. Create a new github repo containing your databricks host and token secrets 3. Initialize git in your repo and commit the code. 4. Push your new cicd templates project to the repo. Your tests will start running automatically on Databricks. Upon your tests’ success or failure you will get a green checkmark or red x next to your commit status. 5. You’re done! You now have a fully scalable CICD pipeline. 1 2 3 4 5
  9. 9. Project structure 1. Python package where the logic of the project will be developed. Your models and pipelines will be developed here. 2. Configuration where you can configure define Databricks jobs which can run pipelines developed in python package 3. Tests directory where local unit tests and integration tests will be developed 1 2 3
  10. 10. CI/CD Templates execute tests and deployments directly on databricks while storing packages, model logging and other artifacts in Mlflow
  11. 11. CI/CD Templates - now powered by dbx With dbx you can: ● customize project structure and specify it during deployments ● use new CI tools easily (PRs are welcome!) ● run custom data pipelines pipelines directly from IDE on interactive clusters
  12. 12. Push Flow
  13. 13. Release Flow
  14. 14. Demo: CI/CD Templates
  15. 15. Summary The Challenges of implementing CD4ML The CI/CD challenges forcing ML teams to choose between Databricks notebooks or local IDEs Introducing DatabricksLabs CI/CD Templates How CI/CD Templates solves ML team production challenges Next Steps Search DatabricksLabs cicd-templates or go directly to https://github.com/databrickslabs/cicd-templates to get started michael.shtelma@databricks.com ivan.trusov@databricks.com
  16. 16. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×