Thermo Fisher Scientific has one of the most extensive product portfolios in the industry, ranging from reagents to capital instruments across customers in biotechnology, pharmaceuticals, academic, and more.
2023 Survey Shows Dip in High School E-Cigarette Use
Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific
1.
2. Enabling Scalable Data Science
Pipelines with MLflow and Model
Registry at Thermo Fisher
Scientific
Allison Wu
Data Scientist, Data Science Center of Excellence
Thermo Fisher Scientific
3. Key Summary
▪ We standardized development of machine learning models by
integrating MLFlow tracking into the development pipeline.
▪ We improved reproducibility of machine learning models by having
GitHub and Delta Lake integrated into development and deployment
pipelines .
▪ We streamlined our deployment process for machine learning models
on different platforms through MLFlow and Centralized Model
Registry.
3
4. What do data scientists at our Data Science Center of
Excellence do?
4
▪ Generate novel algorithms that
can be applied across different
divisions
▪ Work with cross-divisional teams
for model migration and
standardization
▪ Enable data science in different
divisions and functions
▪ Establish data science best
practices
Operations Human Resources
Commercial &
Marketing R&D
Data Science
at Thermo
Fisher
5. Commercial & Marketing Data Science Life Cycle
Actionable insights (data) from customer interactions which creates a competitive advantage and drives growth and profitability
Data Delivery
Install base
Cloud
Transactional
External data
Web behavioral
Customer interaction
Call center
Model Development &
Deployment
Automatic email campaigns
Website marketing strategies
Prescriptive rec. for sales reps.
Machine Learning Models
F E E D B A C K F O R M A C H I N E L E A R N I N G
R E L E V A N T
O F F E R
Customer
• Engagement
• Leads
• Revenue ($)
Rule-based Legacy Models
5
6. Model Development and Deployment Cycle
▪ Exploratory
analysis
▪ Model
development:
featuring
engineering,
selection, model
optimization
▪ Deployment to
production
environment
▪ Audit
▪ Scoring
DeploymentDevelopment
▪ Web
recommendation
▪ Email campaign
▪ Commercial
dashboard
Delivery
▪ Monitoring
▪ Feedback
Management
Production model retraining and retuning
New model development
PRDDEV PRDPRD
DEV ß PRD
6
7. An Example Model Development / Deployment Cycle
A model that makes product recommendation based on customer behaviors, such as web
activities, sales transactions, etc.
7
6-8 weeks of EDA
and prototyping
• Scoring daily.
• Retrain/Retune based on
new data in production
every 2 weeks.
• Deliver through email
campaign or commercial
sales rep. channels.
• Monitor model
performance metrics
8. What we used to do…
• All work is in Databricks Notebooks
• No version control on either data or
model
• No unit testing
• No regression testing against
different versions of models
• Hard to share modularized functions
across projects (Lots of copy-pasting)
8
9. What we now do…
Databricks notebook
• Exploratory Analysis
• Feature engineering
Notebook & mlflow
• ML model experiment
• Hyperparameter tracking
• Feature selection
• Model comparison
DEV
Development Model Registry
• Streamline regression testing against
previous model versions
• Documented model review process
• Clean version management for better
collaboration within the same DEV
environment
DEV
• ML model library python modules for sharable
and testable ML functions such as feature
functions, utility functions, ML tuning functions.
• Version controlled on GitHub
• Integrate with Databricks Projects to version
control Databricks notebooks
• Documented code review process
• Version controlled data source with Delta Lake
9
10. Tracking Feature Improvements Become Easy
▪ “Let me find out how the features do in
my….uh….model_version_10.dbc?
Maybe?”
▪ “I wish I had a screen shot of the
feature importance figure before….”
What we used to do…
Boss: What are the important features in
this version versus the previous version?
13. Sharing ML features Becomes Easy
13
Colleague: I really like the feature you
used in your last model. Can I use that as
well?
What we used to do…
▪ “Sure! Just copy-paste this part of the
notebook…oh but I also have a slightly
different version in this other part of
the notebook…. I THINK this is the one I
used….”
14. Sharing ML features Becomes Easy
14
What we now do….
▪ “Sure! I added that feature to the
shared ML repo. Feel free to use it by
importing the module and if you
modify the feature, just contribute to
the repo so that I can use it next time
as well!”
▪ What’s even cooler…. You can log the
exact version of the repo you used
in MLFlow so that even if the repo
evolved after your model
development. You can still trace back
to the exact version you used for your
own model.
Internal Shared ML repo
15. • Reproducing model results does not just rely on version control of
code and notebooks but also the training data, environments and
dependencies.
• MLflow and Delta Lake allows for tracking all necessary things needed
for reproducing the model results.
• GitHub allows us to:
• establish best practices of accessing our data warehouses
• standardizing our ML models
• encourage collaboration and review among different data scientists.
What We Learned
15
17. What we used to do…
• Manually export Databricks notebooks
and dependent libraries.
• Manually set up clusters in PRD
instance to match cluster settings in
DEV.
• Difficulty in troubleshooting the
differences between PRD and DEV
shard environments as data scientists
don’t have required access to pre-
deploy in PRD environment.
17
18. What we now can do….
Centralized Model Registry
• Regression testing in production
environment
• Allows model version management in
a centralized workspace
• Manage production models from
different DEV environments
• Streamlined deployment with logged
dependencies and environment set-
up.
PRD
Development Model Registry
DEV
18
19. What we now can do….
PRD
Notebook
• Execute model pipelines
• Deliver results through
various channels
• Monitors regular model
retraining/retuning,
scoring processes
• Model feedback logging
Centralized Model
Registry
• Regression testing in
production environment
• Allows model version
management in a
centralized workspace
• Manage production
models from different
DEV environments
PRD
19
20. What we can also do….
Deploying and Managing Models Across Different Platforms through a
Centralized Model Registry
Development Model Registry
DEV
Centralized Model
Registry
• Regression testing in
production environment
• Allows model version
management in a
centralized workspace
• Manage production
models from different
DEV environments
PRD
Development Model Registry
DEV
PRD
Notebook
• Execute model pipelines
• Deliver results through
various channels
• Monitors regular model
retraining/retuning,
scoring processes
• Model feedback logging
20
21.
22. Regression Testing Becomes Easy
▪ “Let me look through the previous
colleague’s notebook to find out
what the performance was….”
▪ After digging through the notebook,
you can’t find performance metrics
logged anywhere…..
22
Boss: How does your new model
performance compare to the old model in
production?
What we used to do… What we now do…
▪ From the record in model
registry, it looks like I have
improved the precision by X%..
24. Troubleshooting Transient Data Discrepancies Becomes Easy
▪ “Uh….the input table is already overwritten by today’s
run. I can rerun the model and see if the prediction
comes back to normal now?”
24
Data Engineer: The daily run yesterday
yield only <1000 rows of prediction. Do you
know what happened?
What we used to do…
25. ▪ “Let me pull out that
version of input table
since it’s saved as Delta
Tables. Looks like
there were a lot fewer
rows in the input table
due to the delay of data
refresh job?”
25
What we now do…
Troubleshooting Transient Data Discrepancies Becomes Easy
26. • Data scientists like the freedom of trying out new platforms and tools.
• Allowing for the freedom of platforms and tools can be a nightmare for
deployment in production environment.
• MLFlow tracking server and model registry allows logging a wide range
of “flavors” of ML models, from Spark ML, Sci-kit Learn to SageMaker.
This allows management and comparison across different platforms in
the same centralized workspace.
What We Learned
26