We live with an abundance of ML resources; from open source tools, to GPU workstations, to cloud-hosted autoML. What’s more, the lines between AI research and everyday ML have blurred; you can recreate a state-of-the-art model from arxiv papers at home. But can you afford to? In this talk, we explore ways to recession-proof your ML process without sacrificing on accuracy, explainability, or value.
11. The Supervised Learning Problem
11
Labeled Training Data
Define a set of target classes &
build a training dataset that
has been annotated with those
class labels.
Feature Transformation(s)
Take raw data and convert into
vector form ahead of model
training.
Classifier Algorithm
Train a model to recognize
target classes using labeled
training data. Tune parameters
to reduce false positives
and/or false negatives.
This part is slow and boring.
12.
13. “Mais il faut cultiver
notre jardin.”
- Voltaire, Candide
14. Getting Labeled Data
● Started with Amazon Mechanical Turk.
● Now there are many commercial providers of data labeling
and data annotation services.
● It can be quite expensive.
○ 100,000 samples for $25,000 - $75,000
● It’s just people, actually…
○ Semuels, Alana. “The Internet Is Enabling a New Kind of Poorly Paid
Hell.” The Atlantic, January 23, 2018.
● Doesn’t usually work for domain-specific data.
● Quality tends to vary.
15. Which Model is Best?
Bayesian Decision Tree Dense Feedforward
16. Which Model is Best?
Bayesian Decision Tree Dense Feedforward
for this dataset/problem space
17. from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection as ms
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = ms.KFold(len(X), n_folds=12)
max([
ms.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
18. How to Select a Model
● Start with a simple model, or better yet, try several in
parallel!
● Filter out the weak performers, and only tune the best.
● Set an initial baseline.
● Use these preliminary steps to prepare for hyperparameter
tuning.
19. Hyperparameter Tuning
● Grid search
● Randomized search
● Bayesian optimization
● Evolutionary optimization
● Population-based training
● Gradient-based optimization
● “Auto ML” (see above, but pay $$$)
20. ● Search is difficult, particularly in
high dimensional space.
● Even with clever optimization
techniques, there is no
guarantee of a solution.
● As the search space gets larger,
the amount of time increases
exponentially.
Unfortunately...
24. Thoughtful Tuning
● Only tune the best performing models.
● Try to reduce your feature space.
● Understand the parameter ranges you’re searching.
● Move towards complexity purposefully
○ Understand error from variance vs. error from bias.
○ The model underfits.
○ The error doesn’t converge.
● Move towards complexity gradually
○ While both train and test scores are increasing (or error decreasing).
27. Prototype Locally First
● Consider: are these conveniences really necessary/useful
at the prototyping phase?
○ Probably not
● Don’t default to using cloud-hosted, Spark-running
notebooks for everything!
● Configure Python to run locally (one-time cost).
● VSCode, PyCharm, etc, support Jupyter notebooks now.
● Downsampling your data is cheap!
30. Serialize Everything
● The model
● Engineered features
● Feature vectors/embeddings
● Stopwords
● Lexicons
● Scores
● Diagnostic plots
● Training times
And any other artifacts or
metadata!
31. When is an ML Model “done”?
● When you have achieved an accuracy measure above your
threshold.
● When your error bounds are within your pre-defined target
range.
● When your cross-validation demonstrates a convergence in
training and test data.
● When the sprint is over.
● When the project is due
34. When we shift our collective
mindset toward model
thriftiness rather than the
relentless pursuit of a tiny bit
more F1, there’s no telling what
new things we might discover…