How to Lower the Cost of Deploying Analytics: An Introduction to the Portable Format for Analytics

How
To
Lower
the
Cost
of
Deploying
Analy7cs:

An
Introduc7on
to
the
Portable

Format
for
Analy7cs
(PFA)

Robert
L.
Grossman

University
of
Chicago

and

Open
Data
Group

The
Data
Science
Conference
(Chicago)

April
22,
2016

rgrossman.com

@bobgrossman

Exploratory
Data

Analysis

Get
and

clean
the

data
Build
model
in
dev/
modeling
environment

Deploy
model
in

opera7onal
systems

with
scoring

applica7on

Monitor
performance

and
employ

champion-‐challenger

methodology

Analy7c
modeling

Analy7c
opera7ons

Deploy

model

Re7re
model
and
deploy

improved
model

Select
analy7c

problem
&

approach

Scale
up

deployment

Perf.

data

Life
Cycle
of
Predic7ve
Model

Exploratory
Data

Analysis

Get
and

clean
the

data
Build
model
in
dev/
modeling
environment

Deploy
model
in

opera7onal
systems

with
scoring

applica7on

Monitor
performance

and
employ

champion-‐challenger

methodology

Analy7c
modeling

Analy7c
opera7ons

Deploy

model

Re7re
model
and
deploy

improved
model

Select
analy7c

problem
&

approach

Scale
up

deployment

Model Env
Deployment Env
Perf.

data

Differences
Between
the
Modeling
and

Deployment
Environments

•  Typically
modelers
use
specialized
languages
such
as

SAS,
SPSS
or
R.

•  Usually,
developers
responsible
for
products
and

services
use
languages
such
as
Java,
JavaScript,

Python,
C++,
etc.

•  This
can
result
in
significant
effort
moving
the
model

from
the
modeling
environment
to
the
deployment

environment.

Ways
to
Deploy
Models
into

Products/Services/Opera7ons

•  Push
code.

•  Embed
a
sta7c
model
into
a
product
or
service.

•  Export
and
import
tables
of
scores

•  Export
and
import
tables
of
parameters

•  Have
the
product/service
interact
with
the
model

as
a
web
or
message
service.

•  Import
the
models
into
a
database

How
quickly
can
the
model
be
updated?

•  Model
parameters?

•  New
features?

•  New
pre-‐
&
post-‐
processing?

I write all my models in R,
why do I need a model
interchange format??
Alice,
Data
Scien7st

Not-‐For-‐Proﬁt
DMG

www.dmg.org

PMML

PFA

Analy7c
models
Analy7c
opera7ons

Deploying
analy7c
models

Model

Consumer

Model

Producer

Analy7c
Infrastructure

Export

model

Import

model

PMML
&
PFA

What
is
a
Scoring
Engine?

•  A
scoring
engine
is
a
component
that
is
integrated
into

products
or
enterprise
IT
that
deploys
analy7c
models
in

opera7onal
workﬂows
for
products
and
services.

•  A
Model
Interchange
Format
is
a
format
that
supports

the
expor7ng
of
a
model
by
one
applica7on
and
the

impor7ng
of
a
model
by
another
applica7on.

•  Model
Interchange
Formats
include
the
Predic7ve
Model

Markup
Language
(PMML),
the
Portable
Format
for

Analy7cs
(PFA),
and
various
in-‐house
or
custom
formats.

•  Scoring
engines
are
integrated
once,
but
allow

applica7ons
to
update
models
as
quickly
as
reading
a
a

model
interchange
format
ﬁle.

9

PMML
Philosophy

•  PMML
is
a
specifica(on
of
a
model,
not
an

implementa7on
of
a
model

•  PMML
allows
a
simple
means
of
binding

parameters
to
values
for
an
agreed
upon
set
of

data
mining
models
&
transforma7ons

•  Because
of
the
specifica7on
nature
of
PMML,
a

compliant
scoring
engine
must
support
a
large

combinatorial
combina7on
of
specifica7ons,
and

it
can
be
challenging
to
develop
a
consistent

scoring
engine.

10

PFA
Philosophy

•  Deﬁne
primi7ves
for
data
transforma7ons,
data

aggrega7ons,
and
sta7s7cal
and
analy7c
models.

•  Support
composi7on
of
data
mining
primi7ves

(which
makes
it
easy
to
specify
machine
learning

algorithms
and
pre-‐/post-‐
processing
of
data).

•  Be
extensible.

•  Designed
to
be
“safe”
to
deploy
in
enterprise
IT

opera7onal
environments.

•  This
is
a
philosophy
that
is
diﬀerent
and

complementary
to
Predic7ve
Model
Markup

Language
(PMML).

11

PFA
Case
Study
1

•  20+
person
data
science
group
developing
models
in

R,
Python,
Scikit-‐learn
and
MATLAB.

•  All
the
data
scien7sts
export
their
model
in
PFA.

•  The
company’s
product
imports
models
in
PFA
and

runs
on
their
customers
data
as
required.

Export
PFA
Import
PFA

Widget

records

Widget

scores

PFA
Case
Study
2

•  Data
scien7st
teams
developing
analy7c
models
for

adversarial
analy7cs
project.

•  Models
developed
in
Hadoop
and
exported
in
PFA
every

2
weeks.

•  Models
updated
in
client
systems
every
week.

Export
PFA
Import
PFA

Event

records

Event

scores

Weeks
1/2,
3/4,

Weeks
2/3,
4/5,
…
Weeks
2,
3,
4,
5,
…

PFA
Func7onality

•  PFA
codes
arbitrary
mathema7cal
algorithms
in
a

7ghtly
controlled
environment.

•  PFA
has
all
the
standard
ﬂow
control
of
a

programming
language:
if/then/else
&

for/while

loops.

•  PFA
has
func7on
calls
and
func7on
call
backs

•  PFA
has
algebraic
data
types.

•  PFA
is
encoded
as
func7on
calls
in
JSON

{func7on:
[arg
1,
arg
2,
…,
arg
n]
}

14

Example:
Scoring
Clusters

15

Source:
dmg.org/pfa

Source:
dmg.org/pfa

16

Beneﬁts
of
PFA

•  PFA
is
based
upon
JSON
and
Avro
and
integrates

easily
into
modern
big
data
environments.

•  PFA
allows
models
to
be
easily
chained
and

composed

•  PFA
allows
developers
and
users
users
of
analy7c

systems
to
pre-‐process
inputs
and
to
post-‐process

outputs
to
models

•  PFA
is
easily
integrated
with
Hadoop,
Spark,
etc.

•  PFA
is
easily
integrated
with
Kapa,
Storm,
Akka
and

other
streaming
environments

•  PFA
can
be
used
to
integrate
mul7ple

tools

applica7ons
within
an
analy7c
ecosystem.

Gaussian
Process
Model
(1
of
5)

Gaussian
Process
Model
(2
of
5)

input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
input
and
output
of
scoring
engine

expressed
as
Avro
schemas

Source:
dmg.org/pfa

Gaussian
Process
Model
(3
of
5)

cells:
table:
type:
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
- input
- {cell: table}
- null
type

(also
Avro)

and
value

(as
JSON,

truncated)

Gaussian
Process

model
parameters

Source:
dmg.org/pfa

Gaussian
Process
Model
(4
of
5)

cells:
table:
type:
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
- input
- {cell: table}
- null
calling
method:
parameters

expressed
as
JSON

input:
get
interpola7on
point
from
input

{cell:
table}:
get
parameters
from
table

null:
no
explicit
Kriging
weight
(universal)

{fcn:
…}:
kernel
func7on

Source:
dmg.org/pfa

Gaussian
Process
Model
(5
of
5)

•  Appears
declara7ve,
but
this
is
a
func7on
call.

–  Fourth
parameter
is
another
func7on:
m.kernel.rbf
(radial
basis

kernel,
a.k.a.
squared
exponen7al).

– 
m.kernel.rbf
was
intended
for
SVM,
but
is
reusable
anywhere.

–  One
argument
(gamma)
preapplied
so
that
it
ﬁts
the
signature

for
model.reg.gaussianProcess.

•  Any
kernel
func7on
could
be
used,
including
user-‐deﬁned
func7ons

wriren
with
PFA
“code.”

•  The
Gaussian
Process
could
be
used
anywhere,
even
as
a
pre-‐
processing
or
post-‐processing
step.

- input
- {cell: table}
- null
Source:
dmg.org/pfa

Summary

•  The
Portable
Format
for
Analy7cs
(PFA)
is
a
model

interchange
format
for
building
analy7c
models
in
one

environment
and
deploying
them
in
another
one.

•  Based
upon
data
mining
primi7ves.

•  Supports
pre-‐processing,
analy7c
models,
post-‐
processing,
and
composi7on
of
primi7ves
and
models.

•  You
can
easily
add
your
own
PFA
models
since
you
can

add
your
own
PFA
func7ons

•  There
is
a
reference
implementa7on
and
thousands
of

compliance
tests.

•  Standard
being
developed
by
the
not-‐for-‐proﬁt
DMG,

which
developed
PMML.

Ques7ons?

24

For
more
informa7on,
see:
dmg.org/pfa

How to Lower the Cost of Deploying Analytics: An Introduction to the Portable Format for Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How to Lower the Cost of Deploying Analytics: An Introduction to the Portable Format for Analytics

Similar to How to Lower the Cost of Deploying Analytics: An Introduction to the Portable Format for Analytics (20)

More from Robert Grossman

More from Robert Grossman (18)

Recently uploaded

Recently uploaded (20)

How to Lower the Cost of Deploying Analytics: An Introduction to the Portable Format for Analytics