How to Lower the Cost of Deploying Analytics: An Introduction to the Portable Format for Analytics
1. How
To
Lower
the
Cost
of
Deploying
Analy7cs:
An
Introduc7on
to
the
Portable
Format
for
Analy7cs
(PFA)
Robert
L.
Grossman
University
of
Chicago
and
Open
Data
Group
The
Data
Science
Conference
(Chicago)
April
22,
2016
rgrossman.com
@bobgrossman
2. Exploratory
Data
Analysis
Get
and
clean
the
data
Build
model
in
dev/
modeling
environment
Deploy
model
in
opera7onal
systems
with
scoring
applica7on
Monitor
performance
and
employ
champion-‐challenger
methodology
Analy7c
modeling
Analy7c
opera7ons
Deploy
model
Re7re
model
and
deploy
improved
model
Select
analy7c
problem
&
approach
Scale
up
deployment
Perf.
data
Life
Cycle
of
Predic7ve
Model
3. Exploratory
Data
Analysis
Get
and
clean
the
data
Build
model
in
dev/
modeling
environment
Deploy
model
in
opera7onal
systems
with
scoring
applica7on
Monitor
performance
and
employ
champion-‐challenger
methodology
Analy7c
modeling
Analy7c
opera7ons
Deploy
model
Re7re
model
and
deploy
improved
model
Select
analy7c
problem
&
approach
Scale
up
deployment
Model Env
Deployment Env
Perf.
data
4. Differences
Between
the
Modeling
and
Deployment
Environments
• Typically
modelers
use
specialized
languages
such
as
SAS,
SPSS
or
R.
• Usually,
developers
responsible
for
products
and
services
use
languages
such
as
Java,
JavaScript,
Python,
C++,
etc.
• This
can
result
in
significant
effort
moving
the
model
from
the
modeling
environment
to
the
deployment
environment.
5. Ways
to
Deploy
Models
into
Products/Services/Opera7ons
• Push
code.
• Embed
a
sta7c
model
into
a
product
or
service.
• Export
and
import
tables
of
scores
• Export
and
import
tables
of
parameters
• Have
the
product/service
interact
with
the
model
as
a
web
or
message
service.
• Import
the
models
into
a
database
How
quickly
can
the
model
be
updated?
• Model
parameters?
• New
features?
• New
pre-‐
&
post-‐
processing?
6. I write all my models in R,
why do I need a model
interchange format??
Alice,
Data
Scien7st
8. Analy7c
models
Analy7c
opera7ons
Deploying
analy7c
models
Model
Consumer
Model
Producer
Analy7c
Infrastructure
Export
model
Import
model
PMML
&
PFA
9. What
is
a
Scoring
Engine?
• A
scoring
engine
is
a
component
that
is
integrated
into
products
or
enterprise
IT
that
deploys
analy7c
models
in
opera7onal
workflows
for
products
and
services.
• A
Model
Interchange
Format
is
a
format
that
supports
the
expor7ng
of
a
model
by
one
applica7on
and
the
impor7ng
of
a
model
by
another
applica7on.
• Model
Interchange
Formats
include
the
Predic7ve
Model
Markup
Language
(PMML),
the
Portable
Format
for
Analy7cs
(PFA),
and
various
in-‐house
or
custom
formats.
• Scoring
engines
are
integrated
once,
but
allow
applica7ons
to
update
models
as
quickly
as
reading
a
a
model
interchange
format
file.
9
10. PMML
Philosophy
• PMML
is
a
specifica(on
of
a
model,
not
an
implementa7on
of
a
model
• PMML
allows
a
simple
means
of
binding
parameters
to
values
for
an
agreed
upon
set
of
data
mining
models
&
transforma7ons
• Because
of
the
specifica7on
nature
of
PMML,
a
compliant
scoring
engine
must
support
a
large
combinatorial
combina7on
of
specifica7ons,
and
it
can
be
challenging
to
develop
a
consistent
scoring
engine.
10
11. PFA
Philosophy
• Define
primi7ves
for
data
transforma7ons,
data
aggrega7ons,
and
sta7s7cal
and
analy7c
models.
• Support
composi7on
of
data
mining
primi7ves
(which
makes
it
easy
to
specify
machine
learning
algorithms
and
pre-‐/post-‐
processing
of
data).
• Be
extensible.
• Designed
to
be
“safe”
to
deploy
in
enterprise
IT
opera7onal
environments.
• This
is
a
philosophy
that
is
different
and
complementary
to
Predic7ve
Model
Markup
Language
(PMML).
11
12. PFA
Case
Study
1
• 20+
person
data
science
group
developing
models
in
R,
Python,
Scikit-‐learn
and
MATLAB.
• All
the
data
scien7sts
export
their
model
in
PFA.
• The
company’s
product
imports
models
in
PFA
and
runs
on
their
customers
data
as
required.
Export
PFA
Import
PFA
Widget
records
Widget
scores
13. PFA
Case
Study
2
• Data
scien7st
teams
developing
analy7c
models
for
adversarial
analy7cs
project.
• Models
developed
in
Hadoop
and
exported
in
PFA
every
2
weeks.
• Models
updated
in
client
systems
every
week.
Export
PFA
Import
PFA
Event
records
Event
scores
Weeks
1/2,
3/4,
Weeks
2/3,
4/5,
…
Weeks
2,
3,
4,
5,
…
14. PFA
Func7onality
• PFA
codes
arbitrary
mathema7cal
algorithms
in
a
7ghtly
controlled
environment.
• PFA
has
all
the
standard
flow
control
of
a
programming
language:
if/then/else
&
for/while
loops.
• PFA
has
func7on
calls
and
func7on
call
backs
• PFA
has
algebraic
data
types.
• PFA
is
encoded
as
func7on
calls
in
JSON
{func7on:
[arg
1,
arg
2,
…,
arg
n]
}
14
17. Benefits
of
PFA
• PFA
is
based
upon
JSON
and
Avro
and
integrates
easily
into
modern
big
data
environments.
• PFA
allows
models
to
be
easily
chained
and
composed
• PFA
allows
developers
and
users
users
of
analy7c
systems
to
pre-‐process
inputs
and
to
post-‐process
outputs
to
models
• PFA
is
easily
integrated
with
Hadoop,
Spark,
etc.
• PFA
is
easily
integrated
with
Kapa,
Storm,
Akka
and
other
streaming
environments
• PFA
can
be
used
to
integrate
mul7ple
tools
applica7ons
within
an
analy7c
ecosystem.
20. Gaussian
Process
Model
(3
of
5)
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
type
(also
Avro)
and
value
(as
JSON,
truncated)
Gaussian
Process
model
parameters
Source:
dmg.org/pfa
21. Gaussian
Process
Model
(4
of
5)
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
calling
method:
parameters
expressed
as
JSON
input:
get
interpola7on
point
from
input
{cell:
table}:
get
parameters
from
table
null:
no
explicit
Kriging
weight
(universal)
{fcn:
…}:
kernel
func7on
Source:
dmg.org/pfa
22. Gaussian
Process
Model
(5
of
5)
• Appears
declara7ve,
but
this
is
a
func7on
call.
– Fourth
parameter
is
another
func7on:
m.kernel.rbf
(radial
basis
kernel,
a.k.a.
squared
exponen7al).
–
m.kernel.rbf
was
intended
for
SVM,
but
is
reusable
anywhere.
– One
argument
(gamma)
preapplied
so
that
it
fits
the
signature
for
model.reg.gaussianProcess.
• Any
kernel
func7on
could
be
used,
including
user-‐defined
func7ons
wriren
with
PFA
“code.”
• The
Gaussian
Process
could
be
used
anywhere,
even
as
a
pre-‐
processing
or
post-‐processing
step.
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
Source:
dmg.org/pfa
23. Summary
• The
Portable
Format
for
Analy7cs
(PFA)
is
a
model
interchange
format
for
building
analy7c
models
in
one
environment
and
deploying
them
in
another
one.
• Based
upon
data
mining
primi7ves.
• Supports
pre-‐processing,
analy7c
models,
post-‐
processing,
and
composi7on
of
primi7ves
and
models.
• You
can
easily
add
your
own
PFA
models
since
you
can
add
your
own
PFA
func7ons
• There
is
a
reference
implementa7on
and
thousands
of
compliance
tests.
• Standard
being
developed
by
the
not-‐for-‐profit
DMG,
which
developed
PMML.