SlideShare a Scribd company logo
1 of 28
Download to read offline
Enabling Scalable Data Science
Pipelines with MLflow and Model
Registry at Thermo Fisher
Scientific
Allison Wu
Data Scientist, Data Science Center of Excellence
Thermo Fisher Scientific
Key Summary
▪ We standardized development of machine learning models by
integrating MLFlow tracking into the development pipeline.
▪ We improved reproducibility of machine learning models by having
GitHub and Delta Lake integrated into development and deployment
pipelines .
▪ We streamlined our deployment process for machine learning models
on different platforms through MLFlow and Centralized Model
Registry.
3
What do data scientists at our Data Science Center of
Excellence do?
4
▪ Generate novel algorithms that
can be applied across different
divisions
▪ Work with cross-divisional teams
for model migration and
standardization
▪ Enable data science in different
divisions and functions
▪ Establish data science best
practices
Operations Human Resources
Commercial &
Marketing R&D
Data Science
at Thermo
Fisher
Commercial & Marketing Data Science Life Cycle
Actionable insights (data) from customer interactions which creates a competitive advantage and drives growth and profitability
Data Delivery
Install base
Cloud
Transactional
External data
Web behavioral
Customer interaction
Call center
Model Development &
Deployment
Automatic email campaigns
Website marketing strategies
Prescriptive rec. for sales reps.
Machine Learning Models
F E E D B A C K F O R M A C H I N E L E A R N I N G
R E L E V A N T
O F F E R
Customer
• Engagement
• Leads
• Revenue ($)
Rule-based Legacy Models
5
Model Development and Deployment Cycle
▪ Exploratory
analysis
▪ Model
development:
featuring
engineering,
selection, model
optimization
▪ Deployment to
production
environment
▪ Audit
▪ Scoring
DeploymentDevelopment
▪ Web
recommendation
▪ Email campaign
▪ Commercial
dashboard
Delivery
▪ Monitoring
▪ Feedback
Management
Production model retraining and retuning
New model development
PRDDEV PRDPRD
DEV ß PRD
6
An Example Model Development / Deployment Cycle
A model that makes product recommendation based on customer behaviors, such as web
activities, sales transactions, etc.
7
6-8 weeks of EDA
and prototyping
• Scoring daily.
• Retrain/Retune based on
new data in production
every 2 weeks.
• Deliver through email
campaign or commercial
sales rep. channels.
• Monitor model
performance metrics
What we used to do…
• All work is in Databricks Notebooks
• No version control on either data or
model
• No unit testing
• No regression testing against
different versions of models
• Hard to share modularized functions
across projects (Lots of copy-pasting)
8
What we now do…
Databricks notebook
• Exploratory Analysis
• Feature engineering
Notebook & mlflow
• ML model experiment
• Hyperparameter tracking
• Feature selection
• Model comparison
DEV
Development Model Registry
• Streamline regression testing against
previous model versions
• Documented model review process
• Clean version management for better
collaboration within the same DEV
environment
DEV
• ML model library python modules for sharable
and testable ML functions such as feature
functions, utility functions, ML tuning functions.
• Version controlled on GitHub
• Integrate with Databricks Projects to version
control Databricks notebooks
• Documented code review process
• Version controlled data source with Delta Lake
9
Tracking Feature Improvements Become Easy
▪ “Let me find out how the features do in
my….uh….model_version_10.dbc?
Maybe?”
▪ “I wish I had a screen shot of the
feature importance figure before….”
What we used to do…
Boss: What are the important features in
this version versus the previous version?
Tracking Feature Improvements Become Easy
What we now do….
▪ “I got it. Let me pull it out from MLFlow…”
12
Sharing ML features Becomes Easy
13
Colleague: I really like the feature you
used in your last model. Can I use that as
well?
What we used to do…
▪ “Sure! Just copy-paste this part of the
notebook…oh but I also have a slightly
different version in this other part of
the notebook…. I THINK this is the one I
used….”
Sharing ML features Becomes Easy
14
What we now do….
▪ “Sure! I added that feature to the
shared ML repo. Feel free to use it by
importing the module and if you
modify the feature, just contribute to
the repo so that I can use it next time
as well!”
▪ What’s even cooler…. You can log the
exact version of the repo you used
in MLFlow so that even if the repo
evolved after your model
development. You can still trace back
to the exact version you used for your
own model.
Internal Shared ML repo
• Reproducing model results does not just rely on version control of
code and notebooks but also the training data, environments and
dependencies.
• MLflow and Delta Lake allows for tracking all necessary things needed
for reproducing the model results.
• GitHub allows us to:
• establish best practices of accessing our data warehouses
• standardizing our ML models
• encourage collaboration and review among different data scientists.
What We Learned
15
Let’s talk about deployment….
16
What we used to do…
• Manually export Databricks notebooks
and dependent libraries.
• Manually set up clusters in PRD
instance to match cluster settings in
DEV.
• Difficulty in troubleshooting the
differences between PRD and DEV
shard environments as data scientists
don’t have required access to pre-
deploy in PRD environment.
17
What we now can do….
Centralized Model Registry
• Regression testing in production
environment
• Allows model version management in
a centralized workspace
• Manage production models from
different DEV environments
• Streamlined deployment with logged
dependencies and environment set-
up.
PRD
Development Model Registry
DEV
18
What we now can do….
PRD
Notebook
• Execute model pipelines
• Deliver results through
various channels
• Monitors regular model
retraining/retuning,
scoring processes
• Model feedback logging
Centralized Model
Registry
• Regression testing in
production environment
• Allows model version
management in a
centralized workspace
• Manage production
models from different
DEV environments
PRD
19
What we can also do….
Deploying and Managing Models Across Different Platforms through a
Centralized Model Registry
Development Model Registry
DEV
Centralized Model
Registry
• Regression testing in
production environment
• Allows model version
management in a
centralized workspace
• Manage production
models from different
DEV environments
PRD
Development Model Registry
DEV
PRD
Notebook
• Execute model pipelines
• Deliver results through
various channels
• Monitors regular model
retraining/retuning,
scoring processes
• Model feedback logging
20
Regression Testing Becomes Easy
▪ “Let me look through the previous
colleague’s notebook to find out
what the performance was….”
▪ After digging through the notebook,
you can’t find performance metrics
logged anywhere…..
22
Boss: How does your new model
performance compare to the old model in
production?
What we used to do… What we now do…
▪ From the record in model
registry, it looks like I have
improved the precision by X%..
23
Troubleshooting Transient Data Discrepancies Becomes Easy
▪ “Uh….the input table is already overwritten by today’s
run. I can rerun the model and see if the prediction
comes back to normal now?”
24
Data Engineer: The daily run yesterday
yield only <1000 rows of prediction. Do you
know what happened?
What we used to do…
▪ “Let me pull out that
version of input table
since it’s saved as Delta
Tables. Looks like
there were a lot fewer
rows in the input table
due to the delay of data
refresh job?”
25
What we now do…
Troubleshooting Transient Data Discrepancies Becomes Easy
• Data scientists like the freedom of trying out new platforms and tools.
• Allowing for the freedom of platforms and tools can be a nightmare for
deployment in production environment.
• MLFlow tracking server and model registry allows logging a wide range
of “flavors” of ML models, from Spark ML, Sci-kit Learn to SageMaker.
This allows management and comparison across different platforms in
the same centralized workspace.
What We Learned
26
Thank you!
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

Similar to Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific

MWLUG 2015 - An Introduction to MVC
MWLUG 2015 - An Introduction to MVCMWLUG 2015 - An Introduction to MVC
MWLUG 2015 - An Introduction to MVCUlrich Krause
 
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...DataScienceConferenc1
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowDatabricks
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Knoldus Inc.
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
 
Bringing DevOps to the Database
Bringing DevOps to the DatabaseBringing DevOps to the Database
Bringing DevOps to the DatabaseMichaela Murray
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowDatabricks
 
Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...
Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...
Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...AbishekSubramanian2
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-useltonrodriguez11
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning Jesus Rodriguez
 
Integrating Machine Learning Capabilities into your team
Integrating Machine Learning Capabilities into your teamIntegrating Machine Learning Capabilities into your team
Integrating Machine Learning Capabilities into your teamCameron Vetter
 
Experimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsExperimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsDatabricks
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP TestingRTTS
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionFlorian Wilhelm
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
 
Real world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comReal world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comMathieu Dumoulin
 

Similar to Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific (20)

MWLUG 2015 - An Introduction to MVC
MWLUG 2015 - An Introduction to MVCMWLUG 2015 - An Introduction to MVC
MWLUG 2015 - An Introduction to MVC
 
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Bringing DevOps to the Database
Bringing DevOps to the DatabaseBringing DevOps to the Database
Bringing DevOps to the Database
 
DevOps Days Rockies MLOps
DevOps Days Rockies MLOpsDevOps Days Rockies MLOps
DevOps Days Rockies MLOps
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 
Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...
Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...
Unlocking MLOps Potential: Streamlining Machine Learning Lifecycle with Datab...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning A practical guidance of the enterprise machine learning
A practical guidance of the enterprise machine learning
 
Integrating Machine Learning Capabilities into your team
Integrating Machine Learning Capabilities into your teamIntegrating Machine Learning Capabilities into your team
Integrating Machine Learning Capabilities into your team
 
Experimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsExperimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOps
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Real world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comReal world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.com
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...boychatmate1
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 

Recently uploaded (20)

DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 

Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific

  • 1.
  • 2. Enabling Scalable Data Science Pipelines with MLflow and Model Registry at Thermo Fisher Scientific Allison Wu Data Scientist, Data Science Center of Excellence Thermo Fisher Scientific
  • 3. Key Summary ▪ We standardized development of machine learning models by integrating MLFlow tracking into the development pipeline. ▪ We improved reproducibility of machine learning models by having GitHub and Delta Lake integrated into development and deployment pipelines . ▪ We streamlined our deployment process for machine learning models on different platforms through MLFlow and Centralized Model Registry. 3
  • 4. What do data scientists at our Data Science Center of Excellence do? 4 ▪ Generate novel algorithms that can be applied across different divisions ▪ Work with cross-divisional teams for model migration and standardization ▪ Enable data science in different divisions and functions ▪ Establish data science best practices Operations Human Resources Commercial & Marketing R&D Data Science at Thermo Fisher
  • 5. Commercial & Marketing Data Science Life Cycle Actionable insights (data) from customer interactions which creates a competitive advantage and drives growth and profitability Data Delivery Install base Cloud Transactional External data Web behavioral Customer interaction Call center Model Development & Deployment Automatic email campaigns Website marketing strategies Prescriptive rec. for sales reps. Machine Learning Models F E E D B A C K F O R M A C H I N E L E A R N I N G R E L E V A N T O F F E R Customer • Engagement • Leads • Revenue ($) Rule-based Legacy Models 5
  • 6. Model Development and Deployment Cycle ▪ Exploratory analysis ▪ Model development: featuring engineering, selection, model optimization ▪ Deployment to production environment ▪ Audit ▪ Scoring DeploymentDevelopment ▪ Web recommendation ▪ Email campaign ▪ Commercial dashboard Delivery ▪ Monitoring ▪ Feedback Management Production model retraining and retuning New model development PRDDEV PRDPRD DEV ß PRD 6
  • 7. An Example Model Development / Deployment Cycle A model that makes product recommendation based on customer behaviors, such as web activities, sales transactions, etc. 7 6-8 weeks of EDA and prototyping • Scoring daily. • Retrain/Retune based on new data in production every 2 weeks. • Deliver through email campaign or commercial sales rep. channels. • Monitor model performance metrics
  • 8. What we used to do… • All work is in Databricks Notebooks • No version control on either data or model • No unit testing • No regression testing against different versions of models • Hard to share modularized functions across projects (Lots of copy-pasting) 8
  • 9. What we now do… Databricks notebook • Exploratory Analysis • Feature engineering Notebook & mlflow • ML model experiment • Hyperparameter tracking • Feature selection • Model comparison DEV Development Model Registry • Streamline regression testing against previous model versions • Documented model review process • Clean version management for better collaboration within the same DEV environment DEV • ML model library python modules for sharable and testable ML functions such as feature functions, utility functions, ML tuning functions. • Version controlled on GitHub • Integrate with Databricks Projects to version control Databricks notebooks • Documented code review process • Version controlled data source with Delta Lake 9
  • 10. Tracking Feature Improvements Become Easy ▪ “Let me find out how the features do in my….uh….model_version_10.dbc? Maybe?” ▪ “I wish I had a screen shot of the feature importance figure before….” What we used to do… Boss: What are the important features in this version versus the previous version?
  • 11. Tracking Feature Improvements Become Easy What we now do…. ▪ “I got it. Let me pull it out from MLFlow…”
  • 12. 12
  • 13. Sharing ML features Becomes Easy 13 Colleague: I really like the feature you used in your last model. Can I use that as well? What we used to do… ▪ “Sure! Just copy-paste this part of the notebook…oh but I also have a slightly different version in this other part of the notebook…. I THINK this is the one I used….”
  • 14. Sharing ML features Becomes Easy 14 What we now do…. ▪ “Sure! I added that feature to the shared ML repo. Feel free to use it by importing the module and if you modify the feature, just contribute to the repo so that I can use it next time as well!” ▪ What’s even cooler…. You can log the exact version of the repo you used in MLFlow so that even if the repo evolved after your model development. You can still trace back to the exact version you used for your own model. Internal Shared ML repo
  • 15. • Reproducing model results does not just rely on version control of code and notebooks but also the training data, environments and dependencies. • MLflow and Delta Lake allows for tracking all necessary things needed for reproducing the model results. • GitHub allows us to: • establish best practices of accessing our data warehouses • standardizing our ML models • encourage collaboration and review among different data scientists. What We Learned 15
  • 16. Let’s talk about deployment…. 16
  • 17. What we used to do… • Manually export Databricks notebooks and dependent libraries. • Manually set up clusters in PRD instance to match cluster settings in DEV. • Difficulty in troubleshooting the differences between PRD and DEV shard environments as data scientists don’t have required access to pre- deploy in PRD environment. 17
  • 18. What we now can do…. Centralized Model Registry • Regression testing in production environment • Allows model version management in a centralized workspace • Manage production models from different DEV environments • Streamlined deployment with logged dependencies and environment set- up. PRD Development Model Registry DEV 18
  • 19. What we now can do…. PRD Notebook • Execute model pipelines • Deliver results through various channels • Monitors regular model retraining/retuning, scoring processes • Model feedback logging Centralized Model Registry • Regression testing in production environment • Allows model version management in a centralized workspace • Manage production models from different DEV environments PRD 19
  • 20. What we can also do…. Deploying and Managing Models Across Different Platforms through a Centralized Model Registry Development Model Registry DEV Centralized Model Registry • Regression testing in production environment • Allows model version management in a centralized workspace • Manage production models from different DEV environments PRD Development Model Registry DEV PRD Notebook • Execute model pipelines • Deliver results through various channels • Monitors regular model retraining/retuning, scoring processes • Model feedback logging 20
  • 21.
  • 22. Regression Testing Becomes Easy ▪ “Let me look through the previous colleague’s notebook to find out what the performance was….” ▪ After digging through the notebook, you can’t find performance metrics logged anywhere….. 22 Boss: How does your new model performance compare to the old model in production? What we used to do… What we now do… ▪ From the record in model registry, it looks like I have improved the precision by X%..
  • 23. 23
  • 24. Troubleshooting Transient Data Discrepancies Becomes Easy ▪ “Uh….the input table is already overwritten by today’s run. I can rerun the model and see if the prediction comes back to normal now?” 24 Data Engineer: The daily run yesterday yield only <1000 rows of prediction. Do you know what happened? What we used to do…
  • 25. ▪ “Let me pull out that version of input table since it’s saved as Delta Tables. Looks like there were a lot fewer rows in the input table due to the delay of data refresh job?” 25 What we now do… Troubleshooting Transient Data Discrepancies Becomes Easy
  • 26. • Data scientists like the freedom of trying out new platforms and tools. • Allowing for the freedom of platforms and tools can be a nightmare for deployment in production environment. • MLFlow tracking server and model registry allows logging a wide range of “flavors” of ML models, from Spark ML, Sci-kit Learn to SageMaker. This allows management and comparison across different platforms in the same centralized workspace. What We Learned 26
  • 28. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.