CI/CD with Azure DevOps and Azure Databricks

•

2 likes•864 views

GoDataDriven

Presentation given during Data Council meetup

Engineering

CI/CD with
Azure DevOps, Pre-Commit, and Azure Databricks
30-10-2019

A typical pipeline
Automate everything
• Deploy to production Efficiently & Reliably
• Allow everyone in the team to do so
• Smaller increments
• Roll-forward don’t Roll-back
2
Trigger
Version control
Test
Code
Build
Artifact
Deploy
Dev
Integration tests
Deploy
Prod
User facing
Measure
Capture performance

Overall project structure
• src, containing the library
• input, data used while testing
• notebook, containing the application
• tests, for tests
4

Testing
Our approach
• Use Pre-Commit
• Apply Black, Flake8
• Run PySpark tests in a Docker container
5
• Checkout code
• Install requirements
• Apply linters
• Run unit-tests
• Publish test/coverage

Pre-Commit
Eg, solving the “Fixing lint issues” commit
• Framework for creating Git Hooks
• Eg, scripts that run on each commit
• Compare it to a local-CI

Pre-Commit
.pre-commit-config.yaml
• In our case
• run black/Flake8 on each commit
• run pytest on each push
repos:
- repo: https://github.com/psf/black
rev: 19.3b0
hooks:
- id: black
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: flake8
- id: check-merge-conflict
- repo: https://github.com/godatadriven/pre-commit-docker-pyspark
rev: master
hooks:
- id: pyspark-docker
name: Run tests
entry: /entrypoint.sh python setup.py test
language: docker
pass_filenames: false
stages: [push]

8
python setup.py test
setup.py setup.cfg conftest.py test_etl.py
docker

$9 Testing PySpark conftest.py create pytest fixture called spark def test_load_df(spark): df = load_df(spark, "input/data.csv") assert df.count() == 891 assert df.filter(df.Name == "Sandstrom, Miss. Marguerite Rut").count() == 1 def test_fill_na(spark): input_df = spark.createDataFrame( [(None, None, None)], "Age: double, Cabin: string, Fare: double" ) output_df = fill_na(input_df) output = df_to_list_dict(output_df) expected_output = [{"Age": -0.5, "Cabin": "N", "Fare": -0.5}] assert output == expected_output$

Test output
Integrates with Azure Devops
• Which test frequently fail
• Full stack traces of a failed test
• Code coverage
10

Building
Our approach
• Python wheel of library
• Modify notebook/version.py
• Create a build artifact of notebook
11
• Checkout code
• Build wheel
• Authenticate with Azure Devops Artifacts
• Push wheel
• Publish notebook folder as Build Artifact

Deployment
Our approach
• Copy version.py to the DEV workspace
• After a manual step
• Copy notebook/* to the Prod workspace
12
• Authenticate with Databricks cli
• Copy notebook/version.py to the DEV workspace
• Authenticate with Databricks cli
• Copy notebook/* to the PROD workspace

13
version.py
• A successful change to master results in a new version of the library
• Deploy that version to DEV
• and maybe at a later time to PROD
Azure DevOps
Pipeline
Azure DevOps
Artifacts
Dev
Notebook
Prod
Notebook
Azure Databricks
Version: 1.0.100
Version: 1.0.200

14
• On Dev only version.py is deployed by our CI/CD
• On Prod the whole notebook folder
• e.g. our application
• Using dbutils and version.py
• We can install a specific version of our library
dbutils.library

15
The complete pipeline
Run black, flake8 and
pytest using pre-commit
Upload wheel to DevOps
artifacts, export Notebook
folder with modified
version.py
Copy version.py to the
DEV workspace
Copy the Notebook folder
to the PROD workspace

Your Data Career
16
Check out the job opportunities
WE ARE HIRING
GoDataDriven.com/Careers

What's hot

Azure DevOps CI/CD For BeginnersRahul Nath

Azure Data Factory Data FlowMark Kromer

Building End-to-End Delta Pipelines on GCPDatabricks

Delta lake and the delta architectureAdam Doyle

Databricks FundamentalsDalibor Wijas

Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks

(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services

Apache Kafka® Use Cases for Financial Servicesconfluent

Mapping Data Flows Training April 2021Mark Kromer

Getting Started with Databricks SQL AnalyticsDatabricks

Incremental View Maintenance with Coral, DBT, and IcebergWalaa Eldin Moustafa

Delta Lake with Azure DatabricksDustin Vannoy

Druid and Hive Together : Use Cases and Best PracticesDataWorks Summit

Delta from a Data Engineer's PerspectiveDatabricks

Azure Databricks is Easier Than You ThinkIke Ellis

Data Pipelines with Kafka ConnectKaufman Ng

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks

Introducing Databricks DeltaDatabricks

Intro to Azure OpenAI Service L100 (Thai Ver).pdfKorkrid Akepanidtaworn

What's hot (20)

Azure DevOps CI/CD For Beginners

Azure Data Factory Data Flow

Building End-to-End Delta Pipelines on GCP

Delta lake and the delta architecture

Databricks Fundamentals

Spark SQL Deep Dive @ Melbourne Spark Meetup

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

Apache Kafka® Use Cases for Financial Services

Mapping Data Flows Training April 2021

Getting Started with Databricks SQL Analytics

Incremental View Maintenance with Coral, DBT, and Iceberg

Delta Lake with Azure Databricks

Druid and Hive Together : Use Cases and Best Practices

Delta from a Data Engineer's Perspective

Azure Databricks is Easier Than You Think

Data Pipelines with Kafka Connect

The Parquet Format and Performance Optimization Opportunities

Radical Speed for SQL Queries on Databricks: Photon Under the Hood

Introducing Databricks Delta

Intro to Azure OpenAI Service L100 (Thai Ver).pdf

Similar to CI/CD with Azure DevOps and Azure Databricks

Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWERIndrajit Poddar

Continuous Deployment of your Application @jSession#5Marcin Grzejszczak

Modern Web-site Development PipelineGlobalLogic Ukraine

habitat at docker budMandi Walls

Continuous Deployment to the cloudVMware Tanzu

CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...E. Camden Fisher

Automated Deployment and Configuration Engines. AnsibleAlberto Molina Coballes

Continuous Deployment of your Application @SpringOneciberkleid

Django and DockerDocker, Inc.

Docker at Djangocon 2013 | Talk by Ken CochranedotCloud

Continuous Deployment To The Cloud @DevoxxPL 2017 Marcin Grzejszczak

Containers and Microservices for RealistsOracle Developers

Containers and microservices for realistsKarthik Gaekwad

CI/CD on AWSBhargav Amin

Fluo CICD OpenStack SummitMiguel Zuniga

Detailed Introduction To Dockernklmish

Docker based-Pipelines with CodefreshCodefresh

Using Grunt with Drupalarithmetric

DockerCon 15 Keynote - Day 2Docker, Inc.

DCEU 18: Building Your Development PipelineDocker, Inc.

Similar to CI/CD with Azure DevOps and Azure Databricks (20)

Continuous Integration with Cloud Foundry Concourse and Docker on OpenPOWER

Continuous Deployment of your Application @jSession#5

Modern Web-site Development Pipeline

habitat at docker bud

Continuous Deployment to the cloud

CT Software Developers Meetup: Using Docker and Vagrant Within A GitHub Pull ...

Automated Deployment and Configuration Engines. Ansible

Continuous Deployment of your Application @SpringOne

Django and Docker

Docker at Djangocon 2013 | Talk by Ken Cochrane

Continuous Deployment To The Cloud @DevoxxPL 2017

Containers and Microservices for Realists

Containers and microservices for realists

CI/CD on AWS

Fluo CICD OpenStack Summit

Detailed Introduction To Docker

Docker based-Pipelines with Codefresh

Using Grunt with Drupal

DockerCon 15 Keynote - Day 2

DCEU 18: Building Your Development Pipeline

Recently uploaded

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

Transport layer issues and challenges - GuideGOPINATHS437943

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

computer application and construction managementMariconPadriquez1

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Past, Present and Future of Generative AIabhishek36461

Work Experience-Dalton Park.pptxfvvvvvvvLewisJB

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR

welding defects observed during the weldingMuhammadUzairLiaqat

POWER SYSTEMS-1 Complete notes examplesDr. Gudipudi Nageswara Rao

Oxy acetylene welding presentation note.eptoze12

Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1

Introduction-To-Agricultural-Surveillance-Rover.pptxk795866

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721

8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

Correctly Loading Incremental Data at ScaleAlluxio, Inc.

Application of Residue Theorem to evaluate real integrations.pptx959SahilShah

Recently uploaded (20)

IVE Industry Focused Event - Defence Sector 2024

Transport layer issues and challenges - Guide

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

computer application and construction management

young call girls in Green Park🔝 9953056974 🔝 escort Service

Past, Present and Future of Generative AI

Work Experience-Dalton Park.pptxfvvvvvvv

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...

welding defects observed during the welding

POWER SYSTEMS-1 Complete notes examples

Oxy acetylene welding presentation note.

Software and Systems Engineering Standards: Verification and Validation of Sy...

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers

Introduction-To-Agricultural-Surveillance-Rover.pptx

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync

8251 universal synchronous asynchronous receiver transmitter

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction

Correctly Loading Incremental Data at Scale

Application of Residue Theorem to evaluate real integrations.pptx

CI/CD with Azure DevOps and Azure Databricks

1. CI/CD with Azure DevOps, Pre-Commit, and Azure Databricks 30-10-2019

2. A typical pipeline Automate everything • Deploy to production Efficiently & Reliably • Allow everyone in the team to do so • Smaller increments • Roll-forward don’t Roll-back 2 Trigger Version control Test Code Build Artifact Deploy Dev Integration tests Deploy Prod User facing Measure Capture performance

3. 3 Today’s pipeline

4. Overall project structure • src, containing the library • input, data used while testing • notebook, containing the application • tests, for tests 4

5. Testing Our approach • Use Pre-Commit • Apply Black, Flake8 • Run PySpark tests in a Docker container 5 • Checkout code • Install requirements • Apply linters • Run unit-tests • Publish test/coverage

6. Pre-Commit Eg, solving the “Fixing lint issues” commit • Framework for creating Git Hooks • Eg, scripts that run on each commit • Compare it to a local-CI

7. Pre-Commit .pre-commit-config.yaml • In our case • run black/Flake8 on each commit • run pytest on each push repos: - repo: https://github.com/psf/black rev: 19.3b0 hooks: - id: black - repo: https://github.com/pre-commit/pre-commit-hooks rev: v2.3.0 hooks: - id: flake8 - id: check-merge-conflict - repo: https://github.com/godatadriven/pre-commit-docker-pyspark rev: master hooks: - id: pyspark-docker name: Run tests entry: /entrypoint.sh python setup.py test language: docker pass_filenames: false stages: [push]

8. 8 python setup.py test setup.py setup.cfg conftest.py test_etl.py docker

9. 9 Testing PySpark conftest.py create pytest fixture called spark def test_load_df(spark): df = load_df(spark, "input/data.csv") assert df.count() == 891 assert df.filter(df.Name == "Sandstrom, Miss. Marguerite Rut").count() == 1 def test_fill_na(spark): input_df = spark.createDataFrame( [(None, None, None)], "Age: double, Cabin: string, Fare: double" ) output_df = fill_na(input_df) output = df_to_list_dict(output_df) expected_output = [{"Age": -0.5, "Cabin": "N", "Fare": -0.5}] assert output == expected_output

10. Test output Integrates with Azure Devops • Which test frequently fail • Full stack traces of a failed test • Code coverage 10

11. Building Our approach • Python wheel of library • Modify notebook/version.py • Create a build artifact of notebook 11 • Checkout code • Build wheel • Authenticate with Azure Devops Artifacts • Push wheel • Publish notebook folder as Build Artifact

12. Deployment Our approach • Copy version.py to the DEV workspace • After a manual step • Copy notebook/* to the Prod workspace 12 • Authenticate with Databricks cli • Copy notebook/version.py to the DEV workspace • Authenticate with Databricks cli • Copy notebook/* to the PROD workspace

13. 13 version.py • A successful change to master results in a new version of the library • Deploy that version to DEV • and maybe at a later time to PROD Azure DevOps Pipeline Azure DevOps Artifacts Dev Notebook Prod Notebook Azure Databricks Version: 1.0.100 Version: 1.0.200

14. 14 • On Dev only version.py is deployed by our CI/CD • On Prod the whole notebook folder • e.g. our application • Using dbutils and version.py • We can install a specific version of our library dbutils.library

15. 15 The complete pipeline Run black, flake8 and pytest using pre-commit Upload wheel to DevOps artifacts, export Notebook folder with modified version.py Copy version.py to the DEV workspace Copy the Notebook folder to the PROD workspace

16. Your Data Career 16 Check out the job opportunities WE ARE HIRING GoDataDriven.com/Careers

CI/CD with Azure DevOps and Azure Databricks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CI/CD with Azure DevOps and Azure Databricks

Similar to CI/CD with Azure DevOps and Azure Databricks (20)

More from GoDataDriven

More from GoDataDriven (20)

Recently uploaded

Recently uploaded (20)

CI/CD with Azure DevOps and Azure Databricks