Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments
1. How
to
Make
Analy.c
Opera.ons
Look
More
Like
DevOps:
Lessons
learned
Moving
Machine-‐
Learning
Algorithms
to
Produc.on
Environments
Robert
L.
Grossman
University
of
Chicago
and
Open
Data
Group
O’Reilly
Strata
Conference
March
30,
2016
rgrossman.com
@bobgrossman
3. SoRware
Development
Quality
Assurance
Opera.ons
DevOps
The
goal
of
DevOps
is
to
establish
a
culture
and
an
environment
where
building,
tes.ng,
releasing,
and
opera.ng
soRware
can
happen
rapidly,
frequently,
and
more
reliably.*
*Adapted
from
Wikipedia,
en.wikipedia.org/wiki/DevOps.
4. Analy.c
Modeling
Quality
Assurance
Analy.c
Opera.ons
Analy.cOps
The
goal
of
Analy.cOps
is
to
establish
a
culture
and
an
environment
where
building,
valida.ng,
deploying,
and
running
analy.c
models
happen
rapidly,
frequently,
and
reliably.
5. Analy.c
Modeling
Quality
Assurance
Analy.c
Opera.ons
Analy.cOps
The
goal
of
Analy.cOps
is
to
establish
a
culture
and
an
environment
where
building,
valida.ng,
deploying,
and
running
analy.c
models
happen
rapidly,
frequently,
and
reliably.
• SoRware
• Model
• Data
6. Analy.c
strategy
and
planning
Analy.c
models
&
algorithms
Analy.c
opera.ons
Analy.c
Infrastructure
*Source:
Robert
L.
Grossman,
The
Strategy
and
Prac.ce
of
Analy.cs,
O’Reilly,
2016,
to
appear.
7. A
Problem
There
are
plaZorms
and
tools
for
managing
and
processing
big
data
(Hadoop),
for
building
analy.cs
(SAS,
SPSS,
R,
Sta.s.ca,
Spark,
Skytree,
Mahout),
but
few
op.ons
for
deploying
analy.cs
into
opera.ons
or
for
embedding
analy.cs
into
products
and
services.
Data
scien.sts
developing
analy.c
models
&
algorithms
Analy.c
infrastructure
Enterprise
IT
deploying
analy.cs
into
products,
services
and
opera.ons
Deploying
analy.cs
7
8. More
Problems
Data
scien.sts
developing
analy.c
models
&
algorithms
Analy.c
infrastructure
Enterprise
IT
deploying
analy.cs
into
products,
services
and
opera.ons
Deploying
analy.cs
8
Monitoring
opera.onal
analy.cs
ETL
and
datamarts
for
the
modelers
10. Life
Cycle
of
Predic.ve
Model
Exploratory
Data
Analysis
Get
and
clean
the
data
Build
model
in
dev/
modeling
environment
Deploy
model
in
opera.onal
systems
with
scoring
applica.on
Monitor
performance
and
employ
champion-‐
challenger
methodology
to
develop
improved
model
Analy.c
modeling
Analy.c
opera.ons
Deploy
model
Perf.
data
Re.re
model
and
deploy
improved
model
Select
analy.c
problem
&
approach
Scale
up
deployment
11. Exploratory
Data
Analysis
Get
and
clean
the
data
Build
model
in
dev/
modeling
environment
Deploy
model
in
opera.onal
systems
with
scoring
applica.on
Monitor
performance
and
employ
champion-‐
challenger
methodology
to
develop
improved
model
Analy.c
modeling
Analy.c
opera.ons
Deploy
model
Re.re
model
and
deploy
improved
model
Select
analy.c
problem
&
approach
Scale
up
deployment
ModelDev
AnalyticOps
Perf.
data
12. Differences
Between
the
Modeling
and
Deployment
Environments
• Typically
modelers
use
specialized
languages
such
as
SAS,
SPSS
or
R.
• Usually,
developers
responsible
for
products
and
services
use
languages
such
as
Java,
JavaScript,
Python,
C++,
etc.
• This
can
result
in
significant
effort
moving
the
model
from
the
modeling
environment
to
the
deployment
environment.
13. Ways
to
Deploy
Models
into
Products/Services/Opera.ons
• Export
and
import
tables
of
scores
• Export
and
import
tables
of
parameters
• Have
the
product/service
interact
with
the
model
as
a
web
or
message
service.
• Import
the
models
into
a
database
• Embed
the
model
into
a
product
or
service.
• Push
code.
How
quickly
can
the
model
be
updated?
• Model
parameters?
• New
features?
• New
pre-‐
&
post-‐
processing?
14. What
is
a
Scoring
Engine?
• A
scoring
engine
is
a
component
that
is
integrated
into
products
or
enterprise
IT
that
deploys
analy.c
models
in
opera.onal
workflows
for
products
and
services.
• A
Model
Interchange
Format
is
a
format
that
supports
the
expor.ng
of
a
model
by
one
applica.on
and
the
impor.ng
of
a
model
by
another
applica.on.
• Model
Interchange
Formats
include
the
Predic.ve
Model
Markup
Language
(PMML),
the
Portable
Format
for
Analy.cs
(PFA),
and
various
in-‐house
or
custom
formats.
• Scoring
engines
are
integrated
once,
but
allow
applica.ons
to
update
models
as
quickly
as
reading
a
a
model
interchange
format
file.
14
15. Analy.c
algorithms
&
models
Analy.c
opera.ons
Deploying
analy.c
models
Model
Consumer
Model
Producer
Analy.c
Infrastructure
Export
model
Import
model
PMML
&
PFA
16. Case
Study
2:
Scaling
Bioinforma.cs
Pipelines
for
the
Genomic
Data
Commons*
This
case
study
describes
work
by
the
NCI
Genomic
Data
Commons
Project
and
the
University
of
Chicago
Center
for
Data
Intensive
Science.
17. TCGA
dataset:
1.54
PB
consis.ng
of
577,878
files
about
14,052
cases
(pa.ents),
in
42
cancer
types,
across
29
primary
sites.
2.5+
PB
of
cancer
genomics
data
+
Bionimbus
data
commons
technology
running
mul.ple
community
developed
variant
calling
pipelines.
Over
12,000
cores
and
10
PB
of
raw
storage
in
18+
racks
running
for
months.
Analy.cOps
for
the
Genomic
Data
Commons
18. Dev Ops
• Virtualiza.on
and
the
requirement
for
massive
scale
out
spawned
infrastructure
automa.on
(“infrastructure
as
code”).
• Requirement
for
reducing
the
.me
to
deploying
code
created
tools
for
con.nuous
integra.on
and
tes.ng.
19. ModelDev AnalyticOps
• Use
virtualiza.on
/
containers,
infrastructure
automa.on
and
scale
out
to
support
large
scale
analy.cs.
• Requirement:
reduce
the
.me
and
cost
to
do
high
quality
analy.cs
over
large
amounts
of
data.
24. • Model
quality
(confusion
matrix)
• Data
quality
(six
dimensions)
• Lack
of
ground
truth
• SoRware
errors
• Workflow
with
monitoring
• Scheduling
• Boqlenecks,
stragglers,
hot
spots,
etc.
• Analy.c
configura.ons
problems*
• System
failures
• Human
errors
Ten
Factors
Effec.ng
Analy.cOps
*DMS
=
data-‐model-‐system
25. Monitor
Data
Quality
and
Model
Performance
and
Summarize
With
Dashboards
Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.
27. Data
Quality:
Batch
Effects
Can
Be
Significant
Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.
28. Model
Quality:
Differences
in
Three
Soma.c
Muta.on
Detec.on
Algorithms
Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.
29. ORen
SoRware
Must
Be
Wriqen
so
that
It
Can
Be
Run
Efficiently
in
Automated
Enivronments
• Generally,
community
soRware
in
bioinforma.cs
is
designed
to
be
run
manually
over
local
clusters.
• Example
– We
patched
one
piece
of
soRware
over
400
.mes
so
that
it
could
run
over
12,000
genomes
– Although
only
3.3%
of
genomes
had
problems,
it
required
significant
manual
effort.
• Analy.cOps
requires
opera.ng
the
soRware
in
automated
environments.
30. Decide
What
Not
to
Compute
VarScan Rate
Rate (GB/hour)
Frequency
0.0 0.5 1.0 1.5 2.0
020040060080010001200
Manage
these
cases
carefully.
31. Model
Expected
Performance
Processing
.me
Tumor
BAM
size
(GB)
Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.
32. Case
Study
3:
Deploying
Gaussian
Process
Models
to
the
Industrial
Internet*
*Thanks
to
the
DMG
PMML
and
PFA
Working
Groups.
34. PFA
is
Based
Upon
Defining
Primi.ves
for
Analy.c
Models
• What
would
a
standard
look
like
that…
– Defines
primi.ves
for
data
transforma.ons,
data
aggrega.ons,
and
sta.s.cal
and
analy.c
models.
– Supports
composi.on
of
data
mining
primi.ves
(which
makes
it
easy
to
specify
machine
learning
algorithms
and
pre-‐/post-‐
processing
of
data).
– Is
extensible.
– Is
“safe”
to
deploy
in
enterprise
IT
opera.onal
environments.
• This
is
a
different
philosophy
that
is
different
and
complementary
to
Predic.ve
Model
Markup
Language
(PMML).
34
35. Benefits
of
PFA
• PFA
is
based
upon
JSON
and
Avro
and
integrates
easily
into
modern
big
data
environments.
• PFA
allows
models
to
be
easily
chained
and
composed
• PFA
allows
developers
and
users
users
of
analy.c
systems
to
pre-‐process
inputs
and
to
post-‐process
outputs
to
models
• PFA
is
easily
integrated
with
Storm,
Akka
and
other
streaming
environments
• PFA
can
be
used
to
integrate
mul.ple
tools
applica.ons
within
an
analy.c
ecosystem.
38. Example
of
a
PFA
model
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
type
(also
Avro)
and
value
(as
JSON,
truncated)
Gaussian
Process
model
parameters
39. Example
of
a
PFA
model
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
calling
method:
parameters
expressed
as
JSON
input:
get
interpola.on
point
from
input
{cell:
table}:
get
parameters
from
table
null:
no
explicit
Kriging
weight
(universal)
{fcn:
…}:
kernel
func.on
40. Example
of
a
PFA
model
• Appears
declara.ve,
but
this
is
a
func.on
call.
– Fourth
parameter
is
another
func.on:
m.kernel.rbf
(radial
basis
kernel,
a.k.a.
squared
exponen.al).
–
m.kernel.rbf
was
intended
for
SVM,
but
is
reusable
anywhere.
– One
argument
(gamma)
preapplied
so
that
it
fits
the
signature
for
model.reg.gaussianProcess.
• Any
kernel
func.on
could
be
used,
including
user-‐defined
func.ons
wriqen
with
PFA
“code.”
• The
Gaussian
Process
could
be
used
anywhere,
even
as
a
pre-‐
processing
or
post-‐processing
step.
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
42. Ten
Analy.cOps
Rules
1. Team
a
modeler,
soRware
engineer,
and
systems
engineer.
2. Instrument
and
monitor
analy.cs,
soRware
and
systems
and
populate
and
Analy.cOps
dashboard.
3. Use
an
automated
tes.ng
and
deployment
environment
to
improve
the
model
quality.
4. Use
scoring
engines
with
languages
such
as
PFA
&
PMML.
5. Put
in
place
a
data
quality
program.
6. For
complex
workloads,
use
workflow
and
schedulers
(even
if
you
think
you
don’t
need
them
ini.ally)
and
model
scale
up.
7. Op.mize
the
end
to
end
performance
of
the
Analy.cOps,
not
individual
analy.cs.
8. Dis.nguish
scores
from
ac.ons.
9. Iden.fy
and
eliminate
performance
hot
spots,
system
stragglers,
etc.
10. Invest
in
root
cause
analysis
of
Analy.cOps
problems.