Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
1 of 38

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic



Download to read offline

Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance.

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic

  1. 1. Developing ML-enabled data pipelines on Databricks using IDE & CI/CD 1
  2. 2. Companies presentation Challenges CICD Template Runtastic Integration AGENDA Demo
  3. 3. EMANUELE VIGLIANISI Data Engineer at Runtastic since January 2020 Previously Security Testing Researcher in FinTech 3
  4. 4. MICHAEL SHTELMA Solutions Architect at Databricks since April 2019 Previously Technical Lead Data Foundation at Teradata 4
  5. 5. Runtastic 5
  7. 7. ADIDAS TRAINING ● 180+ HD exercise videos with step-by-step instructions ● 25+ Standalone workouts to workout anytime, anywhere ● Guided video workouts allow you to exercise along with our fitness experts and your favorite athletes ● Special indoor workouts, suitable for home ● No additional equipment necessary ● Health and nutrition guide to complement your fitness ● Proven quality through development cooperation with Apple and Google ● Top-rated app on the Apple App Store and Google Play 🔗 Download the App Now 7
  8. 8. 🔗 Download the App Now ADIDAS RUNNING ● Our original flagship app ● Allows you to track your sports activities using GPS technology ● 90+ available sport types ● Share your sports activities and reach your goals ● Participate in challenges ● Compare yourself with your friends on the Leaderboard ● Listen to Story Runs, while you are active ● and use many more features… 8
  9. 9. Databricks 9
  10. 10. ▪ Global company with over 5,000 customers and 450+ partners ▪ Original creators of popular data and machine learning open source projects A unified data analytics platform for accelerating innovation across data engineering, data science, and analytics 10
  11. 11. Challenges 11
  12. 12. Our Goal As IS New Move the on-premise Analytics Backend to the cloud, Microsoft Azure and Databricks, and ensuring high quality software. 12
  13. 13. The CI/CD challenge CI/CD is fundamental in software development workflow for ensuring high quality code. Is there a way to integrate the CI/CD in Databricks for our Data Engineering pipelines? Question 13
  14. 14. CI/CD Benefits - Continuous integration (CI) is the practice of automating the integration of code changes from multiple contributors into a single software project. The CI process is comprised of automatic tools that assert the code’s correctness before and after integration (tests). - Continuous delivery (CD) is an approach where teams release quality products frequently and predictably from source code repository to production in an automated fashion. Key Points of CI/CD CI/CD let us automate long and error prone deployment processes like: - Testing the code before every Pull Request merge - Deploy the right code into the right environment (DEV, PRD) Our needs 14
  15. 15. Why we need CI/CD 15
  16. 16. Why we need CI/CD 16
  17. 17. CHALLENGES - Tests require production-like data (static or dynamic data) - Production-like data is available in the cloud only - To perform integration test in the cloud DAT What is the cloud challenge? DATA ETL pipelines make use of different cloud services - Ingest data into the cloud from Azure Event Hub - Store in Azure Data Lake - Require authorization for accessing the data using Azure Active Directory rules - Use Secrets securely stored in the cloud using Azure Key Vault CLOUD DEPENDENCIES 17
  18. 18. INTEGRATIONThe problem we had Databricks notebook Databricks connect Option 1 Option 2 - It is difficult to divide the code in different sub-modules/project - Versioning is possible, but one notebook at the time - No tooling for automatic tests - No perfect place for tests Limitations Limitations - It does not support Streaming Jobs - Not possible to run arbitrary code that is not a part of a Spark job on the remote cluster. AIMING TO IMPLEMENT CI/CD USING 18
  19. 19. CICD Templates by Databricks Labs 19
  20. 20. CICD TEMPLATE - Benefits to Databricks notebooks - Easy to use - Scalable - Provides access to ML tools such as mlflow for model logging and serving - Challenges - Non-trivial to hook into traditional software development tools such as CI tools or local IDEs. - Result - Teams find themselves choosing between - using traditional IDE based workflows but struggling to test and deploy at scale or - using Databricks notebooks or other cloud notebooks but then struggling to ensure testing and deployment reliability via CICD pipelines. ML teams struggle to combine traditional CI/CD tools with Databricks notebooks 20
  21. 21. CICD TEMPLATE CI/CD Templates allows you to ● create a production pipeline via template in a few steps ● that automatically hooks to github actions and ● runs tests and deployments on databricks upon git commit or whatever trigger you define and ● gives you a test success status directly in github so you know if your commit broke the build CI/CD Templates gives you the benefits of traditional CI/CD workflows and the scale of databricks clusters 21
  22. 22. A scalable CI/CD pipeline in 5 easy steps 1. Install and customize with a single command 2. Create a new github repo containing your databricks host and token secrets 3. Initialize git in your repo and commit the code. 4. Push your new cicd templates project to the repo. Your tests will start running automatically on Databricks. Upon your tests’ success or failure you will get a green checkmark or red x next to your commit status. 5. You’re done! You now have a fully scalable CICD pipeline. 1 3 22 4 5 2
  23. 23. CI/CD Templates executes tests and deployments directly on databricks while storing packages, model logging and other artifacts in Mlflow 23
  24. 24. Push Flow 24
  25. 25. Release Flow 25
  26. 26. Runtastic Integration 26
  27. 27. INTEGR ATION There are in total 4 environments: How we are using the CICD template PRD Stable code On PRD data. DEV Playground for DS/DA/DE. PRE Release-candidate code on PRD data STG Stable Code on release candidate data DB token DEV DB token PRE DB token STG DB token PRD 27
  28. 28. INTEGR ATION /analyticsbackend /pipelines /tests /unit /integration runtime_requirements.txt Project structure Main python module Pipelines configurations Tests folder, divided in unit and integration tests Libraries installed in every cluster 28
  29. 29. INTEGR ATION /analyticsbackend /pipelines /anonymization_pipeline /tests /unit / /integration /anonymization_pipeline_test Pipeline example 29
  30. 30. INTEGR ATION/pipelines . . . /anonymization_pipeline /databricks-config_dev.json /databricks-config_prd.json /databricks-config_pre.json /databricks-config_stg.json /job_spec_azure_dev.json /job_spec_azure_prd.json /job_spec_azure_pre.json /job_spec_azure_stg.json / Pipeline structure JSON containing input parameters for the pipeline (eg. paths). One for each environment. Job configuration. One for each environment. Containing cluster properties, pool id, etc. Pipeline entry point 30
  31. 31. 1. Move to the target environment export DATABRICKS_ENV=DEV export DATABRICKS_TOKEN=<DB-DEV-TOKEN> 2. Run the pipeline python3 pipelines --pipeline-name anonymization_pipeline Run a (test)pipeline 31
  32. 32. INTEGR ATION 1. Move to the target environment export DATABRICKS_ENV=DEV export DATABRICKS_TOKEN=<DB-DEV-TOKEN> 2. Run the pipeline from databrickslabs_cicdtemplates import release_cicd_pipeline; release_cicd_pipeline.main('tests/integration', 'pipelines', True, env=DATABRICKS_ENV);" Deploy pipelines Folder with the testing pipelines Folder pipelines to deploy Run test before deploying Deployment environment 32
  33. 33. INTEGRATION Github integration 33
  34. 34. INTEGRATION Github actions name: Release workflow on: # Trigger the workflow once you create a new release release: types: - created jobs: build: runs-on: ubuntu-latest [ . . . ] - name: Deploy artifact on PRD and STG environments run: | export DATABRICKS_TOKEN=${{ secrets.DATABRICKS_TOKEN_PRD }} export DATABRICKS_ENV=PRD python -c "from databrickslabs_cicdtemplates import release_cicd_pipeline; release_cicd_pipeline.main('tests/integration', 'pipelines', True, env='${DATABRICKS_ENV}');" 34
  35. 35. INTEGRATION Our git flow MASTER BRANCH On-push (test+deploy) NEW RELEASE On-release (deploy) FEATURE BRANCH PR TESTED and APPROVED Test-it label Some work done merge PRE PRD STG 35
  36. 36. Demo 36
  37. 37. Conclusions 1. Code and Data of ETL pipelines need to be tested like everything in Software Engineering. CI/CD is necessary for automating the testing and deployment processes and achieving high quality software. 2. CI/CD is not easy to implement: Databricks Notebooks and Databricks connect are not enough for complex scenarios. 3. CI/CD template by Databricks Lab allows us to better organize our code in sub-modules and implement CI/CD using its easy integration with Github Actions Key Takeaways 37
  38. 38. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. Thank you! 38