Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
6. 6
Problem: I can’t fully specify the behavior I want.
Solution: Machine Learning
7. Where does
machine learning fit in
the technology
universe?
Valuable
... a star of the Data Science orchestra.
- John Mount, Win-Vector
Central
... the new algorithms ... at the heart of most of
what computer science does.
- Hal Daumé III, U. Maryland Professor
Last Resort
… for cases when the desired behavior cannot
be effectively expressed in software logic
without dependency on external data.
- D. Sculley et al., Google
7
8. Where does machine learning fit in developing
technology?
8
Stuff to do Demonstrable ValueStuff to do now
9. How does machine
learning affect value
demonstration?
Distill business goal into a repeatable,
balanced metric.
Measure on the most representative data you
can get.
Distinguish intrinsic errors from
implementation bugs.
Let your customer override the model when
they absolutely must get some answer.
9
Demonstrable Value
10. Distill business goal into
a repeatable, balanced
metric.
10
Demonstrable Value
Business goals in our example:
● fewer incorrect candidates sent to
analysts for review
● no increased volume of work for
analysts
● confidence to help analysts prioritize
Example metric: area under an error trade-off
curve based on confidence, constrained to
max volume. Sometimes called an ‘overall
evaluation criteria’ (OEC).
Note that the more skewed the OEC (e.g., if #
of positives varies by day and season) the
more samples are required to be sure of
statistical significance.
11. Measure on the most
representative data you
can get.
11
Demonstrable Value
Considerations when selecting data:
● online v offline: A/B test in production with
feature flags (one or two variables at a time,
agile-y) vs. stable data set
● implicit v explicit: implicit can correlate
more with value but omits unseen states
● broad v targeted: if explicitly annotating
consider targeting based on diagnostic
value or where systems disagree
Resist the temptation to ‘clean’ data -- you may
kill it. Instead include normalization in your
model.
12. Distinguish intrinsic
errors from
implementation bugs.
12
Demonstrable Value
Distinction
● Error: incorrect output from a model
despite the model being correctly
implemented.
● Bug: incorrect implementation, doing
something other than what was
intended
Useful to manage expectations about quality
and effort required to improve/fix.
Providing an explanation for output can help
make this distinction.
Bug Error
13. Let your customer
override the model
when they absolutely
must get some answer.
13
Demonstrable Value
Varieties of overrides:
● Always give this answer.
● Never give this answer.
Can apply for sub-models or overall.
Beware of potential toward ‘whack a mole’.
Feel sad every time they use it.
14. Where does machine learning fit in developing
technology?
14
Stuff to do Demonstrable ValueStuff to do now
15. How does machine
learning affect team
organization?
15
Machine Learning Expert
Spectrum of options between:
Integrate machine learning expertise in every
team that needs it.
Separate it in an independent, specialist
team.
16. Option 1: integrated
teams with cross-team
interest groups
16
Encourages alignment with
business goals.
Challenges machine learning
collaboration, depth and reuse.
Best for small, diverse products.
17. Option 2: independent
machine learning team
delivering models
17
Encourages machine learning
collaboration, depth and reuse.
Challenges alignment with
business goals.
Best for products with large,
complex model(s).
18. How does machine
learning affect iteration
structure?
18
Pros for shorter:
● More simple experiments are better
than fewer complex ones
● The value of machine learning leads to
high cost of delay
Pros for longer:
● Innovation takes deep thinking
● More time to control technical debt
creation
19. Where does machine learning fit in developing
technology?
19
Stuff to do Demonstrable ValueStuff to do now
20. How does machine
learning affect chunks
of work?
Focus on experiments following the
scientific-method: hypothesis, measurement
and error analysis.
Continuously test for regression versus
expected measurements.
Decouple functional tests from model
variations.
20
Stuff to do now
22. Continuously test for
regression versus
expected
measurements.
22
Stuff to do now
With machine learning’s dependence on data
changing anything changes everything. This
makes it the “high-interest credit card of
technical debt”.
Determine what’s a significant change,
including looking at aggregate effect across
different data sets.
23. Decouple functional
tests from model
variations.
23
Stuff to do now
Options:
Black-box style: enforce “can’t be wrong”
(“earmark”) input/output pairs. Might lead to
spurious test failures.
Clear-box style: use a mock implementation
of the model that produces expected answers.
24. Decouple functional
tests from model
variations.
24
Stuff to do now
Options:
Black-box style: ensure “can’t be wrong”
(“earmark”) input/output pairs. Might lead to
spurious test failures.
Clear-box style: use a mock implementation
of the model that produces expected answers.
25. Decouple functional
tests from model
variations.
25
Stuff to do now
Options:
Black-box style: ensure “can’t be wrong”
(“earmark”) input/output pairs. Might lead to
spurious test failures.
Clear-box style: use a mock implementation
of the model that produces expected answers.
42
26. Where does machine learning fit in developing
technology?
26
Stuff to do Demonstrable ValueStuff to do now
27. How does machine
learning affect
prioritization?
27
Stuff to do
Do we need more training data?
Do we need a richer representation of our
data?
Do we need a combination of models?
How much could improving a sub-component
of the model help?
What development milestones should we
target?
28. Do we need more
training data?
28
Stuff to do
The learning curve implies adding training
data should bring down the test error closer
to the desired level.
29. Do we need a richer
representation of our
data?
29
Stuff to do
The learning curve implies adding data won’t
help but a richer data representation may.
Could be more features identified by
someone with domain expertise analyzing
errors. Though remember more features
often means less speed.
Could require a new model if the domain
information identified is not representable in
the existing one.
30. Do we need a
combination of models?
30
Stuff to do
The learning curve implies the model is
overfitting the training set.
Consider training multiple models on random
subsets of the data and combine them at
runtime to decrease the variance while
retaining a low bias. Presuming you can spend
the compute.
31. How much could
improving a
sub-component of the
model help?
31
Stuff to do
Build an ‘oracle’ for the sub-component --
something that takes perfect output from
data.
Annotate to get that perfect output on some
test data to feed the oracle.
Measure the overall system with the oracle
turned on.
32. What development
milestones should we
target?
32
Stuff to do
Make it…
● Glued-together with some rules
(Prototype)
● Function (Alpha)
● Measurable & inspectable (early Beta)
● Accurate, not slow, nice demo,
documented & configurable (late Beta)
● Simple & fast (GA)
● Handle new kinds of input (post-GA)
33. Questions?
33
Stuff to do Demonstrable ValueStuff to do now
Suggested questions:
Say more about integrating domain expertise?
Say more about online vs. offline testing?
How to manage acquiring data?
How to recruit machine learning folks?
What bad habits can ML enable?
Where can I try your stuff? api.rosette.com
You hiring? Yes - basistech.com/careers/
@dmurga
36. Recruiting machine learning experts
36
who
◦ expertise in sequence models > in domain
◦ depth in specific model > breadth over many
where to find them
◦ local network: meet-ups, LinkedIn
◦ academic conferences
◦ communities (e.g., Kaggle, users of ML tools)
how to attract them
◦ explain purpose & uniqueness of the problem
37. Online vs. offline evaluation
37
Online (e.g., A/B)
● Individual decisions need to not be mission critical
● Enough use to get sufficient statistics in short time
● Helps motivate aligning production and development environments
● If the model is updated online, validate it against offline data periodically to
watch out for drift
● Usually focused on extrinsic or distant measures
Offline
● Always have some of this to for long-term protection against regression
● May be required for intrinsic measurement
38. 38
Epistemology Exact
sciences
Experimental
sciences
Engineering Art
Example ... Theoretical C.S. Physics Software Management
Deals with ... Theorems Theories Artifacts People
Truth is ... Forever Temporary “It works” In the eye of
the beholder
Parts of
machine
learning fit all
four...
Learning theory Model &
measure
Systems Users
This is great, as long as we don’t confuse one kind of work for another.
(This table is an expansion of one in Bottou’s ICML 2015 talk.)
Editor's Notes
Balance:
Consistency v correctness
Extrinsic v intrinsic
Interpretability v correctness
Precision v recall (volume)
Exploitation v exploration
Data:
Historic
Diagnostic
Online v offline
Balance:
Consistency v correctness
Extrinsic v intrinsic
Interpretability v correctness
Precision v recall (volume)
Exploitation v exploration
Data:
Historic
Diagnostic
Online v offline
Balance:
Consistency v correctness
Extrinsic v intrinsic
Interpretability v correctness
Precision v recall (volume)
Exploitation v exploration
Data:
Historic
Diagnostic
Online v offline
Balance:
Consistency v correctness
Extrinsic v intrinsic
Interpretability v correctness
Precision v recall (volume)
Exploitation v exploration
Data:
Historic
Diagnostic
Online v offline
Balance:
Consistency v correctness
Extrinsic v intrinsic
Interpretability v correctness
Precision v recall (volume)
Exploitation v exploration
Data:
Historic
Diagnostic
Online v offline
For online A/B tests, choose control.
Oracle
Experiments both for data collection and speed (esp of adding caches)