SlideShare a Scribd company logo
1 of 46
Download to read offline
© 2019 KNIME AG. All Rights Reserved.
Moving from Artisanal to Industrial
Machine Learning
Greg Landrum
(greg.landrum@knime.com)
© 2019 KNIME AG. All Rights Reserved. 2
This talk
• Motivation
• Creating a reproducible/industrial artisan
• An artisanal side trip into working with imbalanced
data
© 2019 KNIME AG. All Rights Reserved. 3
Context
Artisanal Industrial
https://flic.kr/p/RJ5xEs
License: CC-BY 2.0CC BY 2.0, https://flic.kr/p/a3LLdm
© 2019 KNIME AG. All Rights Reserved. 4
Context
Artisanal
• Creative/Exploratory
• Flexible
Industrial
• Automated
• Reproducible
• Repeatable
• Quality control
© 2019 KNIME AG. All Rights Reserved. 5
Motivation: utility
• Thinking about the models that are useful in the
design-make-test cycle of a med-chem project
• Perhaps something project-specific for the main
target + important anti-targets.
• Likely a host of additional global models that could
be used (solubility, pKa, hERG, CYPs, synthetic
accessibility, etc.)
© 2019 KNIME AG. All Rights Reserved. 6
Aspirations
• Can we figure out how to help the artisan be more
reproducible/repeatable?
• Can we provide an “industrial” framework the
artisan can work within?
• Can this somehow be practical?
7© 2019 KNIME AG. All Rights Reserved.
A process for data mining
© 2019 KNIME AG. All Rights Reserved. 8
Cross-industry standard process for data mining
• An EU-funded project from the late ‘90s run by
Integral Solutions (bought by SPSS, bought by IBM),
Teradata, Daimler-Benz, NCR, and OHRA.
© 2019 KNIME AG. All Rights Reserved. 9
Cross-industry standard process for data mining
• An EU-funded project from the late ‘90s run by
Integral Solutions (bought by SPSS, bought by IBM),
Teradata, Daimler-Benz, NCR, and OHRA.
I can guess what you’re thinking…
© 2019 KNIME AG. All Rights Reserved. 10
Cross-industry standard process for data mining
• An EU-funded project from the late ‘90s run by
Integral Solutions (bought by SPSS, bought by IBM),
Teradata, Daimler-Benz, NCR, and OHRA.
I can guess what you’re thinking…
© 2019 KNIME AG. All Rights Reserved. 11
Cross-industry standard process for data mining
• An EU-funded project from the late ‘90s run by
Integral Solutions (bought by SPSS, bought by IBM),
Teradata, Daimler-Benz, NCR, and OHRA.
Shockingly, this actually produced
something useful
© 2019 KNIME AG. All Rights Reserved. 12
The CRISP-DM Process
12
CRISP-DM (CRoss Industry
Standard Process for Data
Mining) is a standard
process for data mining
solutions.
Image from:
https://upload.wikimedia.org/wikipedia/commons
/b/b9/CRISP-DM_Process_Diagram.png
© 2019 KNIME AG. All Rights Reserved. 13
Establishing context
• Business understanding
– What problem are we trying to solve?
– What would a solution look like?
• Data understanding
– What data do we have available?
– Is it any good?
– What might be useful for this problem?
Image from:
https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-
DM_Process_Diagram.png
Domain expertise required here
© 2019 KNIME AG. All Rights Reserved. 14
The problem
• Build predictive models for bioactivity based
on the data in screening assays
© 2019 KNIME AG. All Rights Reserved. 15
The datasets we’ll be working with
• qHTS data from eight PubChem assays
captured in ChEMBL
• The assays have very different numbers of
actives in them, so to get started we’ll use
two at different ends of the spectrum
© 2019 KNIME AG. All Rights Reserved. 16
The datasets we’ll be working with
• Assay CHEMBL1614166 (PubChem BioAssay.
qHTS Assay for Inhibitors of MBNL1-poly(CUG)
RNA binding. (Class of assay: confirmatory))
– https://www.ebi.ac.uk/chembl/assay_report_card/CHEMBL1614166/
– https://pubchem.ncbi.nlm.nih.gov/bioassay/2675
• 34018 inactives, 98 actives (using the
annotations from PubChem)
© 2019 KNIME AG. All Rights Reserved. 17
Nature of the actives (CHEMBL1614166)
© 2019 KNIME AG. All Rights Reserved. 18
Nature of the actives (CHEMBL1614166)
© 2019 KNIME AG. All Rights Reserved. 19
The datasets we’ll be working with
• Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS
for Inhibitors of Tau Fibril Formation, Thioflavin T
Binding. (Class of assay: confirmatory))
– https://www.ebi.ac.uk/chembl/assay_report_card/CHEM
BL1614166/
– https://pubchem.ncbi.nlm.nih.gov/bioassay/1460
• 43345 inactives, 5602 actives (using the annotations
from PubChem)
© 2019 KNIME AG. All Rights Reserved. 20
Model building
• Data Preparation
– Making it machine-useable
– Cleanup
– Feature engineering
• Modeling
– The cool ML/AI stuff
Image from:
https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-
DM_Process_Diagram.png
© 2019 KNIME AG. All Rights Reserved. 21
Data Preparation
• Structures are taken from ChEMBL
– Already some standardization done
– Processed with RDKit
• Fingerprints: RDKit Morgan-2, 2048 bits
© 2019 KNIME AG. All Rights Reserved. 22
Modeling
• Stratified 80-20 training/holdout split
• KNIME random forest classifier
– 500 trees
– Max depth 15
– Min node size 2
This is a first pass through the cycle, we will try
other fingerprints, learning algorithms, and
hyperparameters in future iterations
© 2019 KNIME AG. All Rights Reserved. 23
Evaluation
• Does the model work?
• Does it actually solve the problem?
• Was the problem well posed?
• Is it implying data problems?
Image from:
https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-
DM_Process_Diagram.png
© 2019 KNIME AG. All Rights Reserved. 24
Evaluation
• AUROC, overall accuracy and Cohen’s kappa
on the holdout data
Many, many, many options here. I’m using global
metrics because in the end I want to use the
“active/inactive” predictions made by the model
© 2019 KNIME AG. All Rights Reserved. 25
Using
• Deployment
– How do you actually use the model?
– How do you keep it up to date?
– How do you get people to accept the
results? Image from:
https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-
DM_Process_Diagram.png
© 2019 KNIME AG. All Rights Reserved. 26
Deployment: technical
• Easy since I’m using KNIME
• Deploy as a web service
– Easy to validate/test
• Automated rebuild/re-evaluate when new data
are available
© 2019 KNIME AG. All Rights Reserved. 27
Deployment: practical
• Providing “active/inactive” classifications and
predicted probabilities likely not enough
• Similar compounds from training set?
• Applicability domain?
• Conformal prediction?
• “Explanation” of the prediction (i.e. similarity
maps)?
28© 2019 KNIME AG. All Rights Reserved.
Results
© 2019 KNIME AG. All Rights Reserved. 29
Evaluation CHEMBL1614166: holdout data
© 2019 KNIME AG. All Rights Reserved. 30
Evaluation CHEMBL1614166: test data
AUROC=0.72
© 2019 KNIME AG. All Rights Reserved. 31
Results CHEMBL1614421: holdout data
© 2019 KNIME AG. All Rights Reserved. 32
Evaluation CHEMBL1614421: holdout data
AUROC=0.75
© 2019 KNIME AG. All Rights Reserved. 33
Taking stock
• Both models have:
– Good overall accuracies (because of imbalance)
– Decent AUROC values
– Terrible Cohen kappas
Now what?
34© 2019 KNIME AG. All Rights Reserved.
Let’s get artisanal…
© 2019 KNIME AG. All Rights Reserved. 35
Quick diversion on bag classifiers
When making predictions, each tree in the
classifier votes on the result.
Majority wins
The predicted class probabilities are often the
means of the predicted probabilities from the
individual trees
We construct the ROC curve by sorting the
predictions in decreasing order of predicted
probability of being active.
Note that the actual predictions are irrelevant for an ROC curve. As long
as true actives tend to have a higher predicted probability of being active
than true inactives the AUC will be good.
© 2019 KNIME AG. All Rights Reserved. 36
Handling imbalanced data
• The standard decision rule for a random forest (or
any bag classifier) is that the majority wins1, i.e. at
the predicted probability of being active must be
>=0.5 in order for the model to predict "active"
• Shift that threshold to a lower value for models built
on highly imbalanced datasets2
1 This is only strictly true for binary classifiers
2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and
QSAR in Environmental Research 17 (2006): 337–52.
© 2019 KNIME AG. All Rights Reserved. 37
Picking a new decision threshold
• Generate a random forest for the dataset using the
training set
• Generate out-of-bag predicted probabilities using
the training set
• Try a number of different decision thresholds1 and
pick the one that gives the best kappa
• Once we have the decision threshold, use it to
generate predictions for the test set.
1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
© 2019 KNIME AG. All Rights Reserved. 38
Results CHEMBL1614166
• Balanced confusion matrix
Previously 0.181
© 2019 KNIME AG. All Rights Reserved. 39
• Balanced confusion matrix
Results CHEMBL1614421
Previously 0.005
© 2019 KNIME AG. All Rights Reserved. 40
Does it work in general?
ChEMBL data, random-split validation
© 2019 KNIME AG. All Rights Reserved. 41
Does it work in general?
Proprietary data, time-split validation
© 2019 KNIME AG. All Rights Reserved. 42
Coming back to validation
• CHEMBL1614166:
– Overall accuracy: 99.8%
– Kappa: 0.53
– AUROC: 0.72
• CHEMBL1614421:
– Overall accuracy: 89.6%
– Kappa: 0. 30
– AUROC: 0.75
© 2019 KNIME AG. All Rights Reserved. 43
Wrapping up
Image from:
https://upload.wikimedia.org/wikipedia/commons
/b/b9/CRISP-DM_Process_Diagram.png
© 2019 KNIME AG. All Rights Reserved. 44
Maybe useful…
• “Practical Machine Learning Canvas”
© 2019 KNIME AG. All Rights Reserved. 45
Data/Scripts
• KNIME workflow for adjusting the decision
threshold: https://kni.me/w/HRDmzyQy0UL0k7H2
• RDKit blog post about adjusting the decision
threshold (includes links to code):
http://rdkit.blogspot.com/2018/11/working-with-
unbalanced-data-part-i.html
• Practical ML Canvas: https://bit.ly/2JLLsRC
© 2019 KNIME AG. All Rights Reserved. 46
Acknowledgements
• Dean Abbott (Abbott Analytics)
• KNIME:
– Daria Goldmann
– Rosaria Silipo
• NIBR:
– Nik Stiefl
– Nadine Schneider
– Niko Fechner
For more amazing car pictures: do an image search for “rat rod”

More Related Content

What's hot

Optalysys Optical Processing for HPC
Optalysys Optical Processing for HPCOptalysys Optical Processing for HPC
Optalysys Optical Processing for HPCinside-BigData.com
 
Introduction aux algorithmes génétiques
Introduction aux algorithmes génétiquesIntroduction aux algorithmes génétiques
Introduction aux algorithmes génétiquesJUG Lausanne
 
Modern ML & AI Operations to Advance Healthcare
Modern ML & AI Operations to Advance HealthcareModern ML & AI Operations to Advance Healthcare
Modern ML & AI Operations to Advance HealthcareHolden Ackerman
 
Scoring Metrics for Classification Models
Scoring Metrics for Classification ModelsScoring Metrics for Classification Models
Scoring Metrics for Classification ModelsKNIMESlides
 
This Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open SessionThis Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open SessionHelix Nebula The Science Cloud
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for MLBin Han
 
OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021OpenACC
 
Association Rule Mining using RHadoop
Association Rule Mining using RHadoopAssociation Rule Mining using RHadoop
Association Rule Mining using RHadoopIRJET Journal
 
Container and Kubernetes without limits
Container and Kubernetes without limitsContainer and Kubernetes without limits
Container and Kubernetes without limitsAntje Barth
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC
 
OpenACC Monthly Highlights: May 2019
OpenACC Monthly Highlights: May 2019OpenACC Monthly Highlights: May 2019
OpenACC Monthly Highlights: May 2019OpenACC
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Joachim Schlosser
 
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY KanbanLKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY KanbanLean Kanban Central Europe
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Datainside-BigData.com
 
Master the RETE algorithm
Master the RETE algorithmMaster the RETE algorithm
Master the RETE algorithmMasahiko Umeno
 
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...Christian Plessl
 
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019VMware Tanzu
 
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020OpenACC
 

What's hot (18)

Optalysys Optical Processing for HPC
Optalysys Optical Processing for HPCOptalysys Optical Processing for HPC
Optalysys Optical Processing for HPC
 
Introduction aux algorithmes génétiques
Introduction aux algorithmes génétiquesIntroduction aux algorithmes génétiques
Introduction aux algorithmes génétiques
 
Modern ML & AI Operations to Advance Healthcare
Modern ML & AI Operations to Advance HealthcareModern ML & AI Operations to Advance Healthcare
Modern ML & AI Operations to Advance Healthcare
 
Scoring Metrics for Classification Models
Scoring Metrics for Classification ModelsScoring Metrics for Classification Models
Scoring Metrics for Classification Models
 
This Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open SessionThis Helix Nebula Science Cloud Pilot Phase Open Session
This Helix Nebula Science Cloud Pilot Phase Open Session
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
 
OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021OpenACC Monthly Highlights: March 2021
OpenACC Monthly Highlights: March 2021
 
Association Rule Mining using RHadoop
Association Rule Mining using RHadoopAssociation Rule Mining using RHadoop
Association Rule Mining using RHadoop
 
Container and Kubernetes without limits
Container and Kubernetes without limitsContainer and Kubernetes without limits
Container and Kubernetes without limits
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
OpenACC Monthly Highlights: May 2019
OpenACC Monthly Highlights: May 2019OpenACC Monthly Highlights: May 2019
OpenACC Monthly Highlights: May 2019
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
 
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY KanbanLKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
LKCE18 Boris Kneisel - SMART Option Picking for DISCOVERY Kanban
 
Introducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big DataIntroducing the TPCx-HS Benchmark for Big Data
Introducing the TPCx-HS Benchmark for Big Data
 
Master the RETE algorithm
Master the RETE algorithmMaster the RETE algorithm
Master the RETE algorithm
 
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...
 
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
Distributing big astronomical catalogues with Greenplum - Greenplum Summit 2019
 
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020
 

Similar to Moving from Artisanal to Industrial Machine Learning

How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? Greg Landrum
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialGreg Landrum
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...KNIMESlides
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Greg Landrum
 
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015SGI HPC Systems Help Fuel Manufacturing Rebirth 2015
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015Josh Goergen
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...Alok Singh
 
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...Edge AI and Vision Alliance
 
Introduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power SystemsIntroduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power SystemsDavid Spurway
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019KNIMESlides
 
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - OptimisationIBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - OptimisationIBM France Lab
 
Key strategies for discrete manufacturers j caie arc japan 2008
Key strategies for discrete manufacturers j caie arc japan 2008Key strategies for discrete manufacturers j caie arc japan 2008
Key strategies for discrete manufacturers j caie arc japan 2008ARC Advisory Group
 
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...Thoughtworks
 
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18IBM Sverige
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17David Spurway
 
Building a guided analytics forecasting platform with Knime
Building a guided analytics forecasting platform with KnimeBuilding a guided analytics forecasting platform with Knime
Building a guided analytics forecasting platform with KnimeKnoldus Inc.
 
Building Simulation, Its Role, Softwares & Their Limitations
Building Simulation, Its Role, Softwares & Their LimitationsBuilding Simulation, Its Role, Softwares & Their Limitations
Building Simulation, Its Role, Softwares & Their LimitationsPrasad Thanthratey
 
Scaling up deep learning by scaling down
Scaling up deep learning by scaling downScaling up deep learning by scaling down
Scaling up deep learning by scaling downNick Pentreath
 
Open Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME SoftwareOpen Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME SoftwareKNIMESlides
 

Similar to Moving from Artisanal to Industrial Machine Learning (20)

How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them? How Do You Build and Validate 1500 Models and What Can You Learn from Them?
How Do You Build and Validate 1500 Models and What Can You Learn from Them?
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
KNIME Data Science Learnathon: From Raw Data To Deployment - Paris - November...
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015SGI HPC Systems Help Fuel Manufacturing Rebirth 2015
SGI HPC Systems Help Fuel Manufacturing Rebirth 2015
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
 
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
 
Knime & bioinformatics
Knime & bioinformaticsKnime & bioinformatics
Knime & bioinformatics
 
OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar OpenPOWER/POWER9 AI webinar
OpenPOWER/POWER9 AI webinar
 
Introduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power SystemsIntroduction to Machine Learning on IBM Power Systems
Introduction to Machine Learning on IBM Power Systems
 
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
KNIME Data Science Learnathon: From Raw Data To Deployment - Dublin - June 2019
 
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - OptimisationIBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
 
Key strategies for discrete manufacturers j caie arc japan 2008
Key strategies for discrete manufacturers j caie arc japan 2008Key strategies for discrete manufacturers j caie arc japan 2008
Key strategies for discrete manufacturers j caie arc japan 2008
 
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
Emerging Best Practises for Machine Learning Engineering- Lex Toumbourou (By ...
 
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18
Trender, inspirationer och visioner - Mikael Haglund #ibmbpsse18
 
IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17IBM Power Systems Update 1Q17
IBM Power Systems Update 1Q17
 
Building a guided analytics forecasting platform with Knime
Building a guided analytics forecasting platform with KnimeBuilding a guided analytics forecasting platform with Knime
Building a guided analytics forecasting platform with Knime
 
Building Simulation, Its Role, Softwares & Their Limitations
Building Simulation, Its Role, Softwares & Their LimitationsBuilding Simulation, Its Role, Softwares & Their Limitations
Building Simulation, Its Role, Softwares & Their Limitations
 
Scaling up deep learning by scaling down
Scaling up deep learning by scaling downScaling up deep learning by scaling down
Scaling up deep learning by scaling down
 
Open Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME SoftwareOpen Source Story and what’s new in KNIME Software
Open Source Story and what’s new in KNIME Software
 

More from Greg Landrum

Chemical registration
Chemical registrationChemical registration
Chemical registrationGreg Landrum
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Greg Landrum
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsGreg Landrum
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysisGreg Landrum
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchGreg Landrum
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontGreg Landrum
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataGreg Landrum
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knimeGreg Landrum
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesGreg Landrum
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Greg Landrum
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Greg Landrum
 

More from Greg Landrum (12)

Chemical registration
Chemical registrationChemical registration
Chemical registration
 
Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022Mike Lynch Award Lecture, ICCS 2022
Mike Lynch Award Lecture, ICCS 2022
 
ACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformaticsACS San Diego - The RDKit: Open-source cheminformatics
ACS San Diego - The RDKit: Open-source cheminformatics
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
 
Is one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical researchIs one enough? Data warehousing for biomedical research
Is one enough? Data warehousing for biomedical research
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Large scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent dataLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data
 
Machine learning in the life sciences with knime
Machine learning in the life sciences with knimeMachine learning in the life sciences with knime
Machine learning in the life sciences with knime
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Open-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databasesOpen-source tools for querying and organizing large reaction databases
Open-source tools for querying and organizing large reaction databases
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 

Recently uploaded

《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 

Moving from Artisanal to Industrial Machine Learning

  • 1. © 2019 KNIME AG. All Rights Reserved. Moving from Artisanal to Industrial Machine Learning Greg Landrum (greg.landrum@knime.com)
  • 2. © 2019 KNIME AG. All Rights Reserved. 2 This talk • Motivation • Creating a reproducible/industrial artisan • An artisanal side trip into working with imbalanced data
  • 3. © 2019 KNIME AG. All Rights Reserved. 3 Context Artisanal Industrial https://flic.kr/p/RJ5xEs License: CC-BY 2.0CC BY 2.0, https://flic.kr/p/a3LLdm
  • 4. © 2019 KNIME AG. All Rights Reserved. 4 Context Artisanal • Creative/Exploratory • Flexible Industrial • Automated • Reproducible • Repeatable • Quality control
  • 5. © 2019 KNIME AG. All Rights Reserved. 5 Motivation: utility • Thinking about the models that are useful in the design-make-test cycle of a med-chem project • Perhaps something project-specific for the main target + important anti-targets. • Likely a host of additional global models that could be used (solubility, pKa, hERG, CYPs, synthetic accessibility, etc.)
  • 6. © 2019 KNIME AG. All Rights Reserved. 6 Aspirations • Can we figure out how to help the artisan be more reproducible/repeatable? • Can we provide an “industrial” framework the artisan can work within? • Can this somehow be practical?
  • 7. 7© 2019 KNIME AG. All Rights Reserved. A process for data mining
  • 8. © 2019 KNIME AG. All Rights Reserved. 8 Cross-industry standard process for data mining • An EU-funded project from the late ‘90s run by Integral Solutions (bought by SPSS, bought by IBM), Teradata, Daimler-Benz, NCR, and OHRA.
  • 9. © 2019 KNIME AG. All Rights Reserved. 9 Cross-industry standard process for data mining • An EU-funded project from the late ‘90s run by Integral Solutions (bought by SPSS, bought by IBM), Teradata, Daimler-Benz, NCR, and OHRA. I can guess what you’re thinking…
  • 10. © 2019 KNIME AG. All Rights Reserved. 10 Cross-industry standard process for data mining • An EU-funded project from the late ‘90s run by Integral Solutions (bought by SPSS, bought by IBM), Teradata, Daimler-Benz, NCR, and OHRA. I can guess what you’re thinking…
  • 11. © 2019 KNIME AG. All Rights Reserved. 11 Cross-industry standard process for data mining • An EU-funded project from the late ‘90s run by Integral Solutions (bought by SPSS, bought by IBM), Teradata, Daimler-Benz, NCR, and OHRA. Shockingly, this actually produced something useful
  • 12. © 2019 KNIME AG. All Rights Reserved. 12 The CRISP-DM Process 12 CRISP-DM (CRoss Industry Standard Process for Data Mining) is a standard process for data mining solutions. Image from: https://upload.wikimedia.org/wikipedia/commons /b/b9/CRISP-DM_Process_Diagram.png
  • 13. © 2019 KNIME AG. All Rights Reserved. 13 Establishing context • Business understanding – What problem are we trying to solve? – What would a solution look like? • Data understanding – What data do we have available? – Is it any good? – What might be useful for this problem? Image from: https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP- DM_Process_Diagram.png Domain expertise required here
  • 14. © 2019 KNIME AG. All Rights Reserved. 14 The problem • Build predictive models for bioactivity based on the data in screening assays
  • 15. © 2019 KNIME AG. All Rights Reserved. 15 The datasets we’ll be working with • qHTS data from eight PubChem assays captured in ChEMBL • The assays have very different numbers of actives in them, so to get started we’ll use two at different ends of the spectrum
  • 16. © 2019 KNIME AG. All Rights Reserved. 16 The datasets we’ll be working with • Assay CHEMBL1614166 (PubChem BioAssay. qHTS Assay for Inhibitors of MBNL1-poly(CUG) RNA binding. (Class of assay: confirmatory)) – https://www.ebi.ac.uk/chembl/assay_report_card/CHEMBL1614166/ – https://pubchem.ncbi.nlm.nih.gov/bioassay/2675 • 34018 inactives, 98 actives (using the annotations from PubChem)
  • 17. © 2019 KNIME AG. All Rights Reserved. 17 Nature of the actives (CHEMBL1614166)
  • 18. © 2019 KNIME AG. All Rights Reserved. 18 Nature of the actives (CHEMBL1614166)
  • 19. © 2019 KNIME AG. All Rights Reserved. 19 The datasets we’ll be working with • Assay CHEMBL1614421 (PUBCHEM_BIOASSAY: qHTS for Inhibitors of Tau Fibril Formation, Thioflavin T Binding. (Class of assay: confirmatory)) – https://www.ebi.ac.uk/chembl/assay_report_card/CHEM BL1614166/ – https://pubchem.ncbi.nlm.nih.gov/bioassay/1460 • 43345 inactives, 5602 actives (using the annotations from PubChem)
  • 20. © 2019 KNIME AG. All Rights Reserved. 20 Model building • Data Preparation – Making it machine-useable – Cleanup – Feature engineering • Modeling – The cool ML/AI stuff Image from: https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP- DM_Process_Diagram.png
  • 21. © 2019 KNIME AG. All Rights Reserved. 21 Data Preparation • Structures are taken from ChEMBL – Already some standardization done – Processed with RDKit • Fingerprints: RDKit Morgan-2, 2048 bits
  • 22. © 2019 KNIME AG. All Rights Reserved. 22 Modeling • Stratified 80-20 training/holdout split • KNIME random forest classifier – 500 trees – Max depth 15 – Min node size 2 This is a first pass through the cycle, we will try other fingerprints, learning algorithms, and hyperparameters in future iterations
  • 23. © 2019 KNIME AG. All Rights Reserved. 23 Evaluation • Does the model work? • Does it actually solve the problem? • Was the problem well posed? • Is it implying data problems? Image from: https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP- DM_Process_Diagram.png
  • 24. © 2019 KNIME AG. All Rights Reserved. 24 Evaluation • AUROC, overall accuracy and Cohen’s kappa on the holdout data Many, many, many options here. I’m using global metrics because in the end I want to use the “active/inactive” predictions made by the model
  • 25. © 2019 KNIME AG. All Rights Reserved. 25 Using • Deployment – How do you actually use the model? – How do you keep it up to date? – How do you get people to accept the results? Image from: https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP- DM_Process_Diagram.png
  • 26. © 2019 KNIME AG. All Rights Reserved. 26 Deployment: technical • Easy since I’m using KNIME • Deploy as a web service – Easy to validate/test • Automated rebuild/re-evaluate when new data are available
  • 27. © 2019 KNIME AG. All Rights Reserved. 27 Deployment: practical • Providing “active/inactive” classifications and predicted probabilities likely not enough • Similar compounds from training set? • Applicability domain? • Conformal prediction? • “Explanation” of the prediction (i.e. similarity maps)?
  • 28. 28© 2019 KNIME AG. All Rights Reserved. Results
  • 29. © 2019 KNIME AG. All Rights Reserved. 29 Evaluation CHEMBL1614166: holdout data
  • 30. © 2019 KNIME AG. All Rights Reserved. 30 Evaluation CHEMBL1614166: test data AUROC=0.72
  • 31. © 2019 KNIME AG. All Rights Reserved. 31 Results CHEMBL1614421: holdout data
  • 32. © 2019 KNIME AG. All Rights Reserved. 32 Evaluation CHEMBL1614421: holdout data AUROC=0.75
  • 33. © 2019 KNIME AG. All Rights Reserved. 33 Taking stock • Both models have: – Good overall accuracies (because of imbalance) – Decent AUROC values – Terrible Cohen kappas Now what?
  • 34. 34© 2019 KNIME AG. All Rights Reserved. Let’s get artisanal…
  • 35. © 2019 KNIME AG. All Rights Reserved. 35 Quick diversion on bag classifiers When making predictions, each tree in the classifier votes on the result. Majority wins The predicted class probabilities are often the means of the predicted probabilities from the individual trees We construct the ROC curve by sorting the predictions in decreasing order of predicted probability of being active. Note that the actual predictions are irrelevant for an ROC curve. As long as true actives tend to have a higher predicted probability of being active than true inactives the AUC will be good.
  • 36. © 2019 KNIME AG. All Rights Reserved. 36 Handling imbalanced data • The standard decision rule for a random forest (or any bag classifier) is that the majority wins1, i.e. at the predicted probability of being active must be >=0.5 in order for the model to predict "active" • Shift that threshold to a lower value for models built on highly imbalanced datasets2 1 This is only strictly true for binary classifiers 2 Chen, J. J., et al. “Decision Threshold Adjustment in Class Prediction.” SAR and QSAR in Environmental Research 17 (2006): 337–52.
  • 37. © 2019 KNIME AG. All Rights Reserved. 37 Picking a new decision threshold • Generate a random forest for the dataset using the training set • Generate out-of-bag predicted probabilities using the training set • Try a number of different decision thresholds1 and pick the one that gives the best kappa • Once we have the decision threshold, use it to generate predictions for the test set. 1 Here we use: [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ]
  • 38. © 2019 KNIME AG. All Rights Reserved. 38 Results CHEMBL1614166 • Balanced confusion matrix Previously 0.181
  • 39. © 2019 KNIME AG. All Rights Reserved. 39 • Balanced confusion matrix Results CHEMBL1614421 Previously 0.005
  • 40. © 2019 KNIME AG. All Rights Reserved. 40 Does it work in general? ChEMBL data, random-split validation
  • 41. © 2019 KNIME AG. All Rights Reserved. 41 Does it work in general? Proprietary data, time-split validation
  • 42. © 2019 KNIME AG. All Rights Reserved. 42 Coming back to validation • CHEMBL1614166: – Overall accuracy: 99.8% – Kappa: 0.53 – AUROC: 0.72 • CHEMBL1614421: – Overall accuracy: 89.6% – Kappa: 0. 30 – AUROC: 0.75
  • 43. © 2019 KNIME AG. All Rights Reserved. 43 Wrapping up Image from: https://upload.wikimedia.org/wikipedia/commons /b/b9/CRISP-DM_Process_Diagram.png
  • 44. © 2019 KNIME AG. All Rights Reserved. 44 Maybe useful… • “Practical Machine Learning Canvas”
  • 45. © 2019 KNIME AG. All Rights Reserved. 45 Data/Scripts • KNIME workflow for adjusting the decision threshold: https://kni.me/w/HRDmzyQy0UL0k7H2 • RDKit blog post about adjusting the decision threshold (includes links to code): http://rdkit.blogspot.com/2018/11/working-with- unbalanced-data-part-i.html • Practical ML Canvas: https://bit.ly/2JLLsRC
  • 46. © 2019 KNIME AG. All Rights Reserved. 46 Acknowledgements • Dean Abbott (Abbott Analytics) • KNIME: – Daria Goldmann – Rosaria Silipo • NIBR: – Nik Stiefl – Nadine Schneider – Niko Fechner For more amazing car pictures: do an image search for “rat rod”