AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

How
to
Make
Analy.c
Opera.ons
Look
More
Like

DevOps:
Lessons
learned
Moving
Machine-‐
Learning
Algorithms
to
Produc.on
Environments

Robert
L.
Grossman

University
of
Chicago

and

Open
Data
Group

O’Reilly
Strata
Conference

March
30,
2016

rgrossman.com

@bobgrossman

Introduc.on
to
Analy.cOps

SoRware

Development

Quality

Assurance

Opera.ons

DevOps

The
goal
of
DevOps
is
to
establish
a
culture
and
an
environment

where
building,
tes.ng,
releasing,
and
opera.ng
soRware
can

happen
rapidly,
frequently,
and
more
reliably.*

*Adapted
from
Wikipedia,
en.wikipedia.org/wiki/DevOps.

Analy.c

Modeling

Quality

Assurance

Analy.c

Opera.ons

Analy.cOps

The
goal
of
Analy.cOps
is
to
establish
a
culture
and
an

environment
where
building,
valida.ng,
deploying,
and
running

analy.c
models
happen
rapidly,
frequently,
and
reliably.

Analy.c

Modeling

Quality

Assurance

Analy.c

Opera.ons

Analy.cOps

The
goal
of
Analy.cOps
is
to
establish
a
culture
and
an

environment
where
building,
valida.ng,
deploying,
and
running

analy.c
models
happen
rapidly,
frequently,
and
reliably.

•  SoRware

•  Model

•  Data

Analy.c
strategy

and
planning

Analy.c
models
&

algorithms
Analy.c
opera.ons

Analy.c
Infrastructure

*Source:
Robert
L.
Grossman,
The
Strategy
and
Prac.ce
of
Analy.cs,
O’Reilly,
2016,
to
appear.

A
Problem

There
are
plaZorms
and
tools
for
managing
and
processing
big
data

(Hadoop),
for
building
analy.cs
(SAS,
SPSS,
R,
Sta.s.ca,
Spark,

Skytree,
Mahout),
but
few
op.ons
for
deploying
analy.cs
into

opera.ons
or
for
embedding
analy.cs
into
products
and
services.

Data
scien.sts

developing
analy.c

models
&
algorithms

Analy.c
infrastructure

Enterprise
IT

deploying
analy.cs

into
products,
services

and
opera.ons

Deploying
analy.cs

7

More
Problems

Data
scien.sts

developing
analy.c

models
&
algorithms

Analy.c
infrastructure

Enterprise
IT

deploying
analy.cs

into
products,
services

and
opera.ons

Deploying
analy.cs

8

Monitoring

opera.onal
analy.cs

ETL
and
datamarts
for

the
modelers

Case
Study
1:
Scoring
Engines
for
Cri.cal

Systems

Life
Cycle
of
Predic.ve
Model

Exploratory
Data
Analysis

Get
and

clean
the
data

Build
model
in
dev/
modeling
environment

Deploy
model
in

opera.onal
systems
with

scoring
applica.on

Monitor
performance
and

employ
champion-‐
challenger
methodology
to

develop
improved
model

Analy.c
modeling

Analy.c
opera.ons

Deploy

model

Perf.

data

Re.re
model
and
deploy

improved
model

Select
analy.c

problem
&

approach

Scale
up

deployment

Exploratory
Data
Analysis

Get
and

clean
the
data

Build
model
in
dev/
modeling
environment

Deploy
model
in

opera.onal
systems
with

scoring
applica.on

Monitor
performance
and

employ
champion-‐
challenger
methodology
to

develop
improved
model

Analy.c
modeling

Analy.c
opera.ons

Deploy

model

Re.re
model
and
deploy

improved
model

Select
analy.c

problem
&

approach

Scale
up

deployment

ModelDev
AnalyticOps
Perf.

data

Differences
Between
the
Modeling
and

Deployment
Environments

•  Typically
modelers
use
specialized
languages
such
as

SAS,
SPSS
or
R.

•  Usually,
developers
responsible
for
products
and

services
use
languages
such
as
Java,
JavaScript,

Python,
C++,
etc.

•  This
can
result
in
significant
effort
moving
the
model

from
the
modeling
environment
to
the
deployment

environment.

Ways
to
Deploy
Models
into

Products/Services/Opera.ons

•  Export
and
import
tables
of
scores

•  Export
and
import
tables
of
parameters

•  Have
the
product/service
interact
with
the

model
as
a
web
or
message
service.

•  Import
the
models
into
a
database

•  Embed
the
model
into
a
product
or
service.

•  Push
code.

How
quickly
can
the
model
be
updated?

•  Model
parameters?

•  New
features?

•  New
pre-‐
&
post-‐
processing?

What
is
a
Scoring
Engine?

•  A
scoring
engine
is
a
component
that
is
integrated
into

products
or
enterprise
IT
that
deploys
analy.c
models
in

opera.onal
workﬂows
for
products
and
services.

•  A
Model
Interchange
Format
is
a
format
that
supports

the
expor.ng
of
a
model
by
one
applica.on
and
the

impor.ng
of
a
model
by
another
applica.on.

•  Model
Interchange
Formats
include
the
Predic.ve
Model

Markup
Language
(PMML),
the
Portable
Format
for

Analy.cs
(PFA),
and
various
in-‐house
or
custom
formats.

•  Scoring
engines
are
integrated
once,
but
allow

applica.ons
to
update
models
as
quickly
as
reading
a
a

model
interchange
format
ﬁle.

14

Analy.c
algorithms

&
models

Analy.c
opera.ons

Deploying
analy.c
models

Model

Consumer

Model

Producer

Analy.c
Infrastructure

Export

model

Import

model

PMML
&
PFA

Case
Study
2:

Scaling
Bioinforma.cs

Pipelines
for
the
Genomic
Data
Commons*

This
case
study
describes
work
by
the
NCI
Genomic
Data
Commons
Project
and
the

University
of
Chicago
Center
for
Data
Intensive
Science.

TCGA
dataset:
1.54
PB

consis.ng
of
577,878

ﬁles
about
14,052
cases

(pa.ents),
in
42
cancer

types,
across
29
primary

sites.

2.5+
PB

of
cancer

genomics
data

+

Bionimbus
data
commons

technology
running
mul.ple

community
developed
variant

calling
pipelines.

Over
12,000

cores
and
10
PB
of
raw
storage
in

18+
racks
running
for
months.

Analy.cOps
for
the
Genomic
Data
Commons

Dev Ops
•  Virtualiza.on
and
the
requirement
for
massive
scale
out

spawned
infrastructure
automa.on
(“infrastructure
as

code”).

•  Requirement
for
reducing
the
.me
to
deploying
code

created
tools
for
con.nuous
integra.on
and
tes.ng.

ModelDev AnalyticOps
•  Use
virtualiza.on
/
containers,
infrastructure

automa.on
and
scale
out
to
support
large
scale

analy.cs.

•  Requirement:
reduce
the
.me
and
cost
to
do
high

quality
analy.cs

over
large
amounts
of
data.

Genomic
Data
Commons
(GDC)
Files
Vary
Over
9

Orders
of
Magnitude
in
Size

GDC
Pipelines
Are
Complex

and
are
Mostly
Wriqen
by
Others

Computa.ons
for
a
Single

Genome
Can
Take
Over
a
Week

Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.

System
Loads
Vary
Signiﬁcantly

•  Model
quality

(confusion
matrix)

•  Data
quality

(six
dimensions)

•  Lack
of
ground
truth

•  SoRware
errors

•  Workflow
with

monitoring

•  Scheduling

•  Boqlenecks,
stragglers,
hot
spots,
etc.

•  Analy.c
configura.ons
problems*

•  System
failures

•  Human
errors

Ten
Factors
Effec.ng
Analy.cOps

*DMS
=
data-‐model-‐system

Monitor
Data
Quality
and
Model
Performance

and
Summarize
With
Dashboards

Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.

Analy.cOps
Dashboard

Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.

Data
Quality:
Batch
Eﬀects
Can
Be
Signiﬁcant

Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.

Model
Quality:
Diﬀerences
in
Three

Soma.c
Muta.on
Detec.on
Algorithms

Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.

ORen
SoRware
Must
Be
Wriqen
so
that
It
Can

Be
Run
Efficiently
in
Automated
Enivronments

•  Generally,
community
soRware
in
bioinforma.cs
is

designed
to
be
run
manually
over
local
clusters.

•  Example

– We
patched
one
piece
of
soRware
over
400
.mes

so
that
it
could
run
over
12,000
genomes

– Although
only
3.3%
of
genomes
had
problems,
it

required
significant
manual
effort.

•  Analy.cOps
requires
opera.ng
the
soRware
in

automated
environments.

Decide
What
Not
to
Compute

VarScan Rate
Rate (GB/hour)
Frequency
0.0 0.5 1.0 1.5 2.0
020040060080010001200
Manage
these

cases
carefully.

Model
Expected
Performance

Processing
.me

Tumor
BAM
size
(GB)

Source:
University
of
Chicago
Center
for
Data
Intensive
Science
Bioinforma.cs
Group.

Case
Study
3:
Deploying
Gaussian
Process

Models
to
the
Industrial
Internet*

*Thanks
to
the
DMG
PMML
and
PFA
Working
Groups.

Portable
Format
for
Analy.cs
(PFA)
Standard

www.dmg.org

PFA
is
Based
Upon
Defining
Primi.ves
for

Analy.c
Models

•  What
would
a
standard
look
like
that…

– Defines
primi.ves
for
data
transforma.ons,
data

aggrega.ons,
and
sta.s.cal
and
analy.c
models.

– Supports
composi.on
of
data
mining
primi.ves

(which
makes
it
easy
to
specify
machine
learning

algorithms
and
pre-‐/post-‐
processing
of
data).

– Is
extensible.

– Is
“safe”
to
deploy
in
enterprise
IT
opera.onal

environments.

•  This
is
a
different
philosophy
that
is
different
and

complementary
to
Predic.ve
Model
Markup

Language
(PMML).

34

Beneﬁts
of
PFA

•  PFA
is
based
upon
JSON
and
Avro
and
integrates

easily
into
modern
big
data
environments.

•  PFA
allows
models
to
be
easily
chained
and

composed

•  PFA
allows
developers
and
users
users
of
analy.c

systems
to
pre-‐process
inputs
and
to
post-‐process

outputs
to
models

•  PFA
is
easily
integrated
with
Storm,
Akka
and
other

streaming
environments

•  PFA
can
be
used
to
integrate
mul.ple

tools

applica.ons
within
an
analy.c
ecosystem.

Gaussian
Process
Model

Example
of
a
PFA
model

input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
input
and
output
of
scoring
engine

expressed
as
Avro
schemas

Example
of
a
PFA
model

cells:
table:
type:
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
- input
- {cell: table}
- null
type

(also
Avro)

and
value

(as
JSON,

truncated)

Gaussian
Process

model
parameters

Example
of
a
PFA
model

cells:
table:
type:
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
- input
- {cell: table}
- null
calling
method:
parameters

expressed
as
JSON

input:
get
interpola.on
point
from
input

{cell:
table}:
get
parameters
from
table

null:
no
explicit
Kriging
weight
(universal)

{fcn:
…}:
kernel
func.on

Example
of
a
PFA
model

•  Appears
declara.ve,
but
this
is
a
func.on
call.

–  Fourth
parameter
is
another
func.on:
m.kernel.rbf
(radial
basis

kernel,
a.k.a.
squared
exponen.al).

– 
m.kernel.rbf
was
intended
for
SVM,
but
is
reusable
anywhere.

–  One
argument
(gamma)
preapplied
so
that
it
ﬁts
the
signature

for
model.reg.gaussianProcess.

•  Any
kernel
func.on
could
be
used,
including
user-‐deﬁned
func.ons

wriqen
with
PFA
“code.”

•  The
Gaussian
Process
could
be
used
anywhere,
even
as
a
pre-‐
processing
or
post-‐processing
step.

- input
- {cell: table}
- null

Ten
Analy.cOps
Rules

1.  Team
a
modeler,
soRware
engineer,
and
systems
engineer.

2.  Instrument
and
monitor
analy.cs,
soRware
and
systems
and

populate
and
Analy.cOps
dashboard.

3.  Use
an
automated
tes.ng
and
deployment
environment
to

improve
the
model
quality.

4.  Use
scoring
engines
with
languages
such
as
PFA
&
PMML.

5.  Put
in
place
a
data
quality
program.

6.  For
complex
workloads,
use
workﬂow
and
schedulers
(even
if

you
think
you
don’t
need
them
ini.ally)
and
model
scale
up.

7.  Op.mize
the
end
to
end
performance
of
the
Analy.cOps,
not

individual
analy.cs.

8.  Dis.nguish
scores
from
ac.ons.

9.  Iden.fy
and
eliminate
performance
hot
spots,
system
stragglers,

etc.

10.  Invest
in
root
cause
analysis
of
Analy.cOps
problems.

Ques.ons?

43

rgrossman.com

@bobgrossman

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

Similar to AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments (20)

More from Robert Grossman

More from Robert Grossman (11)

Recently uploaded

Recently uploaded (20)

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments