Best practices for building and deploying predictive models over big data presentation
1. Best
Prac?ces
for
Building
and
Deploying
Predic?ve
Models
Over
Big
Data
Robert
Grossman
Open
Data
Group
&
Univ.
of
Chicago
Collin
Benne=
Open
Data
Group
October
23,
2012
2. 1. Introduc?on
7.
Building
Models
over
Hadoop
2. Building
Predic?ve
Models
Using
R
–
EDA,
Features
8.
Case
Study:
Building
Trees
3. Case
Study:
MalStone
over
Big
Data
9.
Analy?c
Opera?ons
–
SAMS
4. Deploying
Predic?ve
Models
10.
Case
Study:
AdReady
5. Working
with
Mul?ple
11.
Building
Predic?ve
Models
–
Models
Li
of
a
Model
6. Case
Study:
CTA
12.
Case
Study:
Matsu
The
tutorial
is
divided
into
12
Modules.
You
can
download
all
the
modules
and
related
materials
from
tutorials.opendatagroup.com.
3. Best
Prac?ces
for
Building
and
Deploying
Predic?ve
Models
Over
Big
Data
Module
1:
Introduc?on
Robert
Grossman
Open
Data
Group
&
Univ.
of
Chicago
Collin
Benne=
Open
Data
Group
October
23,
2012
4. What
Is
Small
Data?
• 100
million
movie
ra?ngs
• 480
thousand
customers
• 17,000
movies
• From
1998
to
2005
• Less
than
2
GB
data.
• Fits
into
memory,
but
very
sophis?cated
models
required
to
win.
5. What
is
Big
Data?
• The
leaders
in
big
data
measure
data
in
Megawa=s.
– As
in,
Facebook’s
leased
data
centers
are
typically
between
2.5
MW
and
6.0
MW.
– Facebook’s
new
Pineville
data
center
is
30
MW.
• At
the
other
end
of
the
spectrum,
databases
can
manage
the
data
required
for
many
projects.
6. An
algorithm
and
compu?ng
infrastructure
is
“big-‐
data
scalable”
if
adding
a
rack
of
data
(and
corresponding
processors)
allows
you
to
compute
over
more
data
in
the
same
?me.
7. What
is
Predic?ve
Analy?cs?
Short
Defini5on
• Using
data
to
make
decisions.
Longer
Defini5on
• Using
data
to
take
ac?ons
and
make
decisions
using
models
that
are
sta?s?cally
valid
and
empirically
derived.
7
9. Key
Modeling
Ac?vi?es
§ Exploring
data
analysis
–
goal
is
to
understand
the
data
so
that
features
can
be
generated
and
model
evaluated
§ Genera?ng
features
–
goal
is
to
shape
events
into
meaningful
features
that
enable
meaningful
predic?on
of
the
dependent
variable
§ Building
models
–
goal
is
to
derive
sta?s?cally
valid
models
from
a
learning
set
§ Evalua?ng
models
–
goal
is
to
develop
meaningful
report
so
that
the
performance
of
one
model
can
be
compared
to
another
model
12. What
Is
Analy?c
Infrastructure?
1
D A T A M I N I N G G R O U P
Analy&c
infrastructure
refers
to
the
soware
components,
soware
services,
applica?ons
and
plahorms
for
managing
data,
processing
data,
producing
analy?c
models,
and
using
analy?c
models
to
generate
alerts,
take
ac?ons
and
make
decisions.
12
16. Measures
Dashboards
Ac?ons
Models
Scores
SAM
Deployed
model
17. ETL
Deployment
Measures
Ac?ons
Validated
Model
Scores
Candidate
Model
Build
Features
Analy?c
Datamart
Cleaning
Data
&
Exploratory
Data
Data
Warehouse
Analysis
Opera?onal
data
Analy?c
Infrastructure
Analy?c
Modeling
Analy?c
Opera?ons
18. Life
Cycle
of
Predic?ve
Model
Exploratory
Data
Analysis
(R)
Process
the
data
(Hadoop)
Build
model
in
dev/
modeling
environment
PMML
Analy?c
modeling
Opera?ons
Deploy
model
in
Log
opera?onal
systems
with
files
scoring
applica?on
Refine
model
(Hadoop,
streams
&
Augustus)
Re?re
model
and
deploy
improved
model
19. Key
Abstrac?ons
Abstrac5on
Credit
Card
Fraud
On
line
Change
Detec5on
adver5sing
Events
Credit
card
Impressions
and
Status
updates
transac?ons
clicks
En??es
Credit
card
Cookies,
user
IDs
Computers,
devices,
etc.
accounts
Features
#
transac?ons
past
#
impressions
per
#
events
in
a
category
(examples)
n
minutes
category
past
n
previous
?me
period
minutes,
RFM
related,
etc.
Models
CART
Clusters,
trees,
Baseline
models
recommenda?on,
etc.
Scores
Indicates
likelihood
Likelihood
of
Indicates
likelihood
that
a
of
fraudulent
clicking
change
has
taken
place
account
20. Event
Based
Upda?ng
of
Feature
Vectors
events FVs
events
Event
loop:
updates
1. take
next
event
2. update
one
or
more
associated
FVs
R
n
3. rescore
updated
FVs
to
compute
alerts
scores
21. Example
Architecture
• Data
lives
in
Hadoop
Distributed
File
System
(HDFS)
or
other
distributed
file
systems.
• Hadoop
streams
and
MapReduce
are
used
to
process
the
data
• R
is
used
for
Exploratory
Data
Analysis
(EDA)
and
models
are
exported
as
PMML.
• PMML
models
are
imported
and
Hadoop,
Python
Streams
and
Augustus
are
used
to
score
data
in
produc?on
22. Ques?ons?
For
the
most
current
version
of
these
slides,
please
see
tutorials.opendatagroup.com
23. Best
Prac?ces
for
Building
and
Deploying
Predic?ve
Models
Over
Big
Data
Module
2:
Building
Models
–
Data
Explora?on
and
Features
Genera?on
Robert
Grossman
Open
Data
Group
&
Univ.
of
Chicago
Collin
Benne=
Open
Data
Group
October
23,
2012
24. Key
Modeling
Ac?vi?es
§ Exploring
data
analysis
–
goal
is
to
understand
the
data
so
that
features
can
be
generated
and
model
evaluated
§ Genera?ng
features
–
goal
is
to
shape
events
into
meaningful
features
that
enable
meaningful
predic?on
of
the
dependent
variable
§ Building
models
–
goal
is
to
derive
sta?s?cally
valid
models
from
learning
set
§ Evalua?ng
models
–
goal
is
to
develop
meaningful
report
so
that
the
performance
of
one
model
can
be
compared
to
another
model
25. ?
Naïve
Bayes
?
K
nearest
neighbor
1957
K-‐means
1977
EM
Algorithm
Beware
of
any
vendor
1984
CART
whose
unique
value
1993
C4.5
proposi?on
includes
1994
A
Priori
some
“secret
analy?c
1995
SVM
sauce.”
1997
AdaBoost
1998
PageRank
Source:
X.
Wu,
et.
al.,
Top
10
algorithms
in
data
mining,
Knowledge
and
Informa?on
Systems,
Volume
14,
2008,
pages
1-‐37.
25
26. Pessimis?c
View:
We
Get
a
Significantly
New
Algorithm
Every
Decade
or
So
1970’s
neural
networks
1980’s
classifica?ons
and
regression
trees
1990’s
support
vector
machine
2000’s
graph
algorithms
27. In
general,
understanding
the
data
through
exploratory
analysis
and
genera?ng
good
features
is
much
more
important
than
the
type
of
predic?ve
model
you
use.
28. Ques?ons
About
Datasets
• Number
of
records
• Number
of
data
fields
• Size
– Does
it
fit
into
memory?
– Does
it
fit
on
a
single
disk
or
disk
array?
• How
many
missing
values?
• How
many
duplicates?
• Are
there
labels
(for
the
dependent
variable)?
• Are
there
keys?
29. Ques?ons
About
Data
Fields
• Data
fields
– What
is
the
mean,
variance,
…?
– What
is
the
distribu?on?
Plot
the
histograms.
– What
are
extreme
values,
missing
values,
unusual
values?
• Pairs
of
data
fields
– What
is
the
correla?on?
Plot
the
sca=er
plots.
• Visualize
the
data
– Histograms,
sca=er
plots,
etc.
32. Compu?ng
Count
Distribu?ons
def
valueDistrib(b,
field,
select=None):
#
given
a
dataframe
b
and
an
a=ribute
field,
#
returns
the
class
distribu?on
of
the
field
ccdict
=
{}
if
select:
for
i
in
select:
count
=
ccdict.get(b[i][field],
0)
ccdict[b[i][field]]
=
count
+
1
else:
for
i
in
range(len(b)):
count
=
ccdict.get(b[i][field],
0)
ccdict[b[i][field]]
=
count
+
1
return
ccdict
33. Features
• Normalize
– [0,
1]
or
[-‐1,
1]
• Aggregate
• Discre?ze
or
bin
• Transform
con?nuous
to
con?nuous
or
discrete
to
discrete
• Compute
ra?os
• Use
percen?les
• Use
indicator
variables
34. Predic?ng
Office
Building
Renewal
Using
Text-‐based
Features
Classification Avg R Classification Avg R
imports 5.17 technology 2.03
clothing 4.17 apparel 2.03
van 3.18 law 2.00
fashion 2.50 casualty 1.91
investment 2.42 bank 1.88
marketing 2.38 technologies 1.84
oil 2.11 trading 1.82
air 2.09 associates 1.81
system 2.06 staffing 1.77
foundation 2.04 securities 1.77
35. Twi=er
Inten?on
&
Predic?on
• Use
machine
learning
classifica?on
techniques:
– Support
Vector
Machines
– Trees
– Naïve
Bayes
• Models
require
clean
data
and
lots
of
hand-‐
scored
examples
36. What
is
the
Size
of
Your
Data?
• Small
– Fits
into
memory
• Medium
– Too
large
for
memory
– But
fits
into
a
database
• Large
– To
large
for
a
database
– But
you
can
use
NoSQL
database
(HBase,
etc.)
– Or
DFS
(Hadoop
DFS)
36
37. If
You
Can
Ask
For
Something
• Ask
for
more
data
• Ask
for
“orthogonal
data”
38. Ques?ons?
For
the
most
current
version
of
these
slides,
please
see
tutorials.opendatagroup.com
39. Best
Prac?ces
for
Building
and
Deploying
Predic?ve
Models
Over
Big
Data
Module
3:
Case
Study
-‐
MalStone
Robert
Grossman
Open
Data
Group
&
Univ.
of
Chicago
Collin
Benne=
Open
Data
Group
October
23,
2012
40. Part
1.
Sta?s?cs
over
Log
Files
• Log
files
are
everywhere
• Adver?sing
systems
• Network
and
system
logs
for
intrusion
detec?on
• Health
and
status
monitoring
41. What
are
the
Common
Elements?
• Time
stamps
• Sites
– e.g.
Web
sites,
computers,
network
devices
• En??es
– e.g.
visitors,
users,
flows
• Log
files
fill
disks,
many,
many
disks
• Behavior
occurs
at
all
scales
• Want
to
iden?fy
phenomena
at
all
scales
• Need
to
group
“similar
behavior”
• Need
to
do
sta?s?cs
(not
just
sor?ng)
42. Abstract
the
Problem
Using
Site-‐En?ty
Logs
Example
Sites
En55es
Measuring
online
Web
sites
Consumers
adver?sing
Drive-‐by
exploits
Web
sites
Computers
(iden?fied
by
cookies
or
IP)
Compromised
Compromised
User
accounts
systems
computers
42
43. MalStone
Schema
• Event
ID
• Time
stamp
• Site
ID
• En?ty
ID
• Mark
(categorical
variable)
• Fit
into
100
bytes
44. Toy
Example
map/shuffle
reduce
Events
collected
Map
events
by
For
each
site,
by
device
or
site
compute
counts
processor
in
?me
and
ra?os
of
order
events
by
type
44
45. Distribu?ons
• Tens
of
millions
of
sites
• Hundreds
of
millions
of
en??es
• Billions
of
events
• Most
sites
have
a
few
number
of
events
• Some
sites
have
many
events
• Most
en??es
visit
a
few
sites
• Some
visitors
visit
many
sites
46. MalStone
B
sites
en??es
dk-‐2
dk-‐1
dk
?me
46
47. The
Mark
Model
• Some
sites
are
marked
(percent
of
mark
is
a
parameter
and
type
of
sites
marked
is
a
draw
from
a
distribu?on)
• Some
en??es
become
marked
aer
visi?ng
a
marked
site
(this
is
a
draw
from
a
distribu?on)
• There
is
a
delay
between
the
visit
and
the
when
the
en?ty
becomes
marked
(this
is
a
draw
from
a
distribu?on)
• There
is
a
background
process
that
marks
some
en??es
independent
of
visit
(this
adds
noise
to
problem)
49. Nota?on
• Fix
a
site
s[j]
• Let
A[j]
be
en??es
that
transact
during
ExpWin
and
if
en?ty
is
marked,
then
visit
occurs
before
mark
• Let
B[j]
be
all
en??es
in
A[j]
that
become
marked
some?me
during
the
MonWin
• Subsequent
propor?on
of
marks
is
r[j]
=
|
B[j]
|
/
|
A[j]
|
50. ExpWin
MonWin
1
MonWin
2
B[j,
t]
are
en??es
that
become
marked
during
MonWin[j]
r[j,
t]
=
|
B[j,
t]
|
/
|
A[j]
|
dk-‐2
dk-‐1
dk
?me
50
51. Part
2.
MalStone
Benchmarks
code.google.com/p/malgen/
MalGen
and
MalStone
implementa?ons
are
open
source
52. MalStone
Benchmark
• Benchmark
developed
by
Open
Cloud
Consor?um
for
clouds
suppor?ng
data
intensive
compu?ng.
• Code
to
generate
synthe?c
data
required
is
available
from
code.google.com/p/malgen
• Stylized
analy?c
computa?on
that
is
easy
to
implement
in
MapReduce
and
its
generaliza?ons.
52
54. • MalStone
B
running
on
10
Billion
100
byte
records
• Hadoop
version
0.18.3
• 20
nodes
in
the
Open
Cloud
Testbed
• MapReduce
required
799
minutes
• Hadoop
streams
required
142
minutes
55. Importance
of
Hadoop
Streams
MalStone
B
Hadoop
v0.18.3
799
min
Hadoop
Streaming
v0.18.3
142
min
Sector/Sphere
44
min
#
Nodes
20
nodes
#
Records
10
Billion
Size
of
Dataset
1
TB
55
56. Ques?ons?
For
the
most
current
version
of
these
slides,
please
see
tutorials.opendatagroup.com
57. Best
Prac?ces
for
Building
and
Deploying
Predic?ve
Models
Over
Big
Data
Module
4:
Deploying
Analy?cs
Robert
Grossman
Open
Data
Group
&
Univ.
of
Chicago
Collin
Benne=
Open
Data
Group
October
23,
2012
59. Problems
with
Current
Techniques
• Analy?cs
built
in
modeling
group
and
deployed
by
IT.
• Time
required
to
deploy
models
and
to
integrate
models
with
other
applica?ons
can
be
long.
• Models
are
deployed
in
proprietary
formats
•
Models
are
applica?on
dependent
•
Models
are
system
dependent
•
Models
are
architecture
dependent
60. Model
Independence
• A
model
should
be
independent
of
the
environment:
o Produc?on
/
Staging
/
Development
o MapReduce
/
Elas?c
MapReduce
/
Standalone
o Developer’s
desktop
/
Analyst’s
laptop
o Produc?on
Workflow
/
Historical
Data
Store
• A
model
should
be
independent
of
coding
language.
61. (Very
Simplified)
Architectural
View
Model
PMML
Data
Producer
Model
CART,
SVM,
k-‐means,
etc.
• The
Predic?ve
Model
Markup
Language
(PMML)
is
an
XML
language
for
sta?s?cal
and
data
mining
models
(www.dmg.org).
• With
PMML,
it
is
easy
to
move
models
between
applica?ons
and
plahorms.
61
62. Predic?ve
Model
Markup
Language
(PMML)
•
Based
on
XML
•
Benefits
of
PMML
– Open
standard
for
Data
Mining
&
Sta?s?cal
Models
– Provides
independence
from
applica?on,
plahorm,
and
opera?ng
system
• PMML
Producers
– Some
applica?ons
create
predic?ve
models
• PMML
Consumers
– Other
applica?ons
read
or
consume
models
63. PMML
• PMML
is
the
leading
standard
for
sta?s?cal
and
data
mining
models
and
supported
by
over
20
vendors
and
organiza?ons.
• With
PMML,
it
is
easy
to
develop
a
model
on
one
system
using
one
applica?on
and
deploy
the
model
on
another
system
using
another
applica?on
• Applica?ons
exist
in
R,
Python,
Java,
SAS,
SPSS
…
64. Vendor
Support
• SPSS
• R
• SAS
• Augustus
• IBM
• KNIME
• Microstrategy
• &
other
open
source
• Salford
applica?ons
• Tibco
• Zemen?s
• &
other
vendors
65. Philosophy
•
Very
important
to
understand
what
PMML
is
not
concerned
with
…
•
PMML
is
a
specifica?on
of
a
model,
not
an
implementa?on
of
a
model
•
PMML
allows
a
simple
means
of
binding
parameters
to
values
for
an
agreed
upon
set
of
data
mining
models
&
transforma?ons
•
Also,
PMML
includes
the
metadata
required
to
deploy
models
66. PMML
Document
Components
• Data
dic?onary
• Mining
schema
• Transforma?on
Dic?onary
• Mul?ple
models,
including
segments
and
ensembles.
• Model
verifica?on,
…
• Univariate
Sta?s?cs
(ModelStats)
• Op?onal
Extensions
67. PMML
Models
•
trees
•
polynomial
•
associa?ons
regression
•
neural
nets
•
logis?c
regression
•
naïve
Bayes
•
general
regression
•
sequences
•
center
based
•
text
models
clusters
•
support
vector
•
density
based
machines
clusters
•
ruleset
68. PMML
Data
Flow
• Data
Dic?onary
defines
data
• Mining
Schema
defines
specific
inputs
(MiningFields)
required
for
model
• Transforma?on
Dic?onary
defines
op?onal
addi?onal
derived
fields
• A=ributes
of
feature
vectors
are
of
two
types
– a=ributes
defined
by
the
mining
schema
– derived
a=ributes
defined
via
transforma?ons
• Models
themselves
can
also
support
certain
transforma?ons
69. (Simplified)
Architectural
View
algorithms
to
es?mate
models
Data
Pre-‐ Model
Data
processing
features
Producer
• PMML
also
supports
XML
elements
PMML
to
describe
data
preprocessing.
Model
69
70. Three
Important
Interfaces
Modeling
Environment
2
1
1
Data
Pre-‐ Model
PMML
Data
processing
Producer
Model
2
Deployment
Environment
PMML
Model
1
3
3
Model
Post
Consumer
Processing
data
scores
ac?ons
70
71. Single
Model
PMML
Order
of
Evalua?on:
DD
àMS
à
TD
à
LT
à
Model
PMML
Document
(Global)
Model
ABC
DD
MS
Model
Execu?on
Data
Data
1-‐to-‐1
MF
Targets
Field
Target
Processing
of
TD
elements
for
TD
DF
PMML
supported
DF
models,
such
as
trees,
regression,
Outputs
LT
Naïve
Bayes,
Output
OF
DF
baselines,
etc.
Model
Verifica?on
MV
Fields
<
71
-‐
PMML
Field
Diagram
(v8)
71
72. Augustus
PMML
Producer
Producer
Schema
(part
of
PMML)
Model
PMML
Producer
Model
Producer
Configura?on
File
various
data
formats,
including
csv
files,
h=p
streams,
&
databases
Augusuts
(augustus.googlecode.com)
72
73. Augustus
PMML
Consumer
PMML
Model
Post
Model
Processing
Consumer
scores
Consumer
ac?ons,
alerts,
various
data
formats,
Configura?on
File
reports,
etc.
including
csv
files,
h=p
streams,
&
databases
Augusuts
(augustus.googlecode.com)
73
74. Best
Prac?ces
for
Building
and
Deploying
Predic?ve
Models
Over
Big
Data
Module
5:
Mul?ple
Models
Robert
Grossman
Open
Data
Group
&
Univ.
of
Chicago
Collin
Benne=
Open
Data
Group
October
23,
2012
75. 1.
Trees
For CART trees: L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone,
Classification and Regression Trees, 1984, Chapman & Hall.
76. Classifica?on
Trees
Petal Len. Petal Width Sepal Len. Sepal Width Species
02
14
33
50
A
24
56
31
67
C
23
51
31
69
C
13
45
28
57
B
• Want
a
func?on
Y
=
g(X),
which
predicts
the
red
variable
Y
using
one
or
more
of
the
blue
variables
X[1],
…,
X[4]
•
Assume
each
row
is
classified
A,
B,
or
C
77. Trees
Par??on
Feature
Space
Petal
Width
C
Petal
Length
<
2.45
1.75
A
B
C
Petal
Width
<
1.75?
Petal
Length
A
Petal
Len
<
4.95?
C
2.45
4.95
B
C
• Trees
par??on
the
feature
space
into
regions
by
asking
whether
an
a=ribute
is
less
than
a
threshold.
81. Ensembles
• Ensembles
are
collec?ons
of
classifiers
with
a
rule
for
combining
the
results.
• The
simplest
combining
rule:
– Use
average
for
regression
trees.
– Use
majority
vo?ng
for
classifica?on.
• Ensembles
are
unreasonably
effec?ve
at
building
models.
83. Ensemble Models
2. build
model
3. gather 1. Partition data and
scatter
models
2. Build models (e.g.
tree-based model)
3. Gather models
4. form 4. Form collection of
1. partition data ensemble models into ensemble
(e.g. majority vote for
classification &
model averaging for
regression)
84. Example:
Fraud
• Ensembles
of
trees
have
proved
remarkably
effec?ve
in
detec?ng
fraud
• Trees
can
be
very
large
• Very
fast
to
score
• Lots
of
variants
– Random
forests
– Random
trees
• Some?mes
need
to
reverse
engineer
reason
codes.
85. 3.
CUSUM
Baseline
Observed
Distribu?on
f0(x)
f1(x)
f0(x)
=
baseline
density
§ Assume
that
we
have
a
mean
and
f1(x)
=
observed
density
standard
devia?on
for
both
g(x)
=
log(f1(x)/f0(x))
distribu?ons.
Z0
=
0
Zn+1
=
max{Zn+g(x),
0}
Alert
if
Zn+1>threshold
85
87. Example
1:
Cubes
of
Models
for
Change
Detec?on
for
Payments
System
Build
separate
segmented
model
Geospatial
for
every
en?ty
of
interest
region
Divide
&
conquer
data
(segment)
using
mul?dimensional
data
cubes
For
each
dis?nct
cube,
establish
Type of
separate
baselines
for
each
Transaction
quan&fy
of
interest
Entity
(bank, etc.)
Detect
changes
from
baselines
Estimate separate baselines for each
quantify of interest
87
88. Example
2
–
Cubes
of
Models
for
Detec?ng
Anomalies
in
Traffic
• Highway
Traffic
Data
– each
day
(7)
x
each
hour
(24)
x
each
sensor
(hundreds)
x
each
weather
condi?on
(5)
x
each
special
event
(dozens)
– 50,000
baselines
models
used
in
current
testbed
88
89. 5.
Mul?ple
Models
in
Hadoop
• Trees
in
Mapper
• Build
lots
of
small
trees
(over
historical
data
in
Hadoop)
• Load
all
of
them
into
the
mapper
• Mapper
– For
each
event,
score
against
each
tree
– Map
emits
• Key
t
value
=
event
id
t
tree
id,
score
• Reducer
– Aggregate
scores
for
each
event
– Select
a
“winner”
for
each
event
90. Ques?ons?
For
the
most
current
version
of
these
slides,
please
see
tutorials.opendatagroup.com
91. Best
Prac?ces
for
Building
and
Deploying
Predic?ve
Models
Over
Big
Data
Module
6:
Case
Study
-‐
CTA
Robert
Grossman
Open
Data
Group
&
Univ.
of
Chicago
Collin
Benne=
Open
Data
Group
October
23,
2012
92. Chicago
Transit
Authority
• The
Chicago
Transit
Authority
provides
service
to
Chicago
and
surrounding
suburbs
• CTA
operates
24
hours
each
day
and
on
an
average
weekday
provides
1.7
million
rides
on
buses
and
trains
• It
has
1,800
buses,
opera?ng
over
140
routes,
along
2,230
route
miles
• Buses
provide
approximately
1
million
rides
day
and
serve
more
than
12,000
posted
bus
stops
93. CTA
Data
• The
City
of
Chicago
has
a
public
data
portal
• h=ps://data.cityofchicago.org/
• Some
of
available
categories
are:
• Administra?on
&
Finance
• Community
&
Economic
Development
• Educa?on
• Facili?es
&
Geographic
Boundaries
• Transporta?on
94. Moving
to
Hadoop
• For
an
example,
we
model
ridership
by
bus
route
Data
set:
CTA
-‐
Ridership
-‐
Bus
Routes
-‐
Daily
Totals
by
Route
• MapReduce
– Map
Input:
• Route
number,
Day
type,
Number
of
riders
– Map
Output
/
Reduce
Input:
• Key
t
value
=
Route
t
day,
date,
riders
– Reduce
Output:
• Model
for
each
route
95. Example
Data
Set
• Data
goes
back
to
2001
– Route
Number
– Date
– Day
of
Week
Type
– Number
of
riders
• Demo
size:
– 11
MB
– 534,208
records
• Saved
as
csv
file
in
HDFS
99. Mapper
• Using
Hadoop’s
Streaming
Interface,
we
can
use
a
language
appropriate
for
each
step
• The
mapper
needs
to
read
the
records
quickly
• We
use
Python
100. Reducer
• For
this
example,
we
calculate
a
simple
sta?s?c:
– Gaussian
Distribu?on
of
riders
for
route,
day
of
week
pairs
described
by
the
mean
and
variance
• We
use
the
Baseline
Model
in
PMML
to
describe
this.
• The
Baseline
model
is
oen
useful
for
Change
Detec?on.
• Do
the
records
matching
a
segment
fit
the
expected
distribu?on
describing
the
segment
102. Building
PMML
in
the
Reducer
• You
can
embed
a
PMML
engine
into
the
reducer
if
you
are
using
– Java
– R
– Python
• We
present
a
light-‐weight
solu?on
which
builds
valid
models,
but
also
allows
you
to
construct
invalid
ones
107. Reducer
Trade
Offs
• Standard
trade
offs
apply
– Holding
events
in
the
reducer
(easier,
RAM)
vs.
– Compu?ng
running
sta?s?cs
(more
coding,
CPU)
• We
present
two
techniques
using
R
– Holding
all
events
for
a
given
key
in
a
dataframe
and
then
calcula?ng
the
sta?s?c
when
a
key
transi?on
is
iden?fied
– Holding
the
minimum
data
to
calculate
running
sta?s?cs.
Computed
as
each
event
is
seen
108. Two
Approaches
• Holding
events
requires
extra
memory
and
may
get
slow
as
the
number
of
events
shuffled
to
the
reducer
for
a
given
keys
grows
large
– But,
it
there
are
a
lot
of
sta?s?cal
and
data
mining
methods
available
in
R
that
operate
on
a
dataframe
• Tracking
running
sta?s?cs
requires
more
coding
– But
may
be
the
only
way
to
process
a
batch
without
spilling
to
disk
109. Dataframe
• Accumulate
each
mapped
record
• For
each
event,
add
it
to
a
temporary
table
doAny <- function(v, date, rides) {
# add a column to the temporary table
v[["rows"]][[length(v[["rows"]]) + 1]]
<- list(date, rides)
}
• Create
the
dataframe
when
we
have
them
all
110. Dataframe
con?nued
• When
you
have
seen
all
of
the,
process
and
and
compute
doLast <- function(v) {
# make a data frame out of the temporary table
dataframe <- do.call("rbind", v[["rows"]])
colnames(dataframe) <- c("date", "rides")
# pull a column out and call the function on it
rides <- unlist(dataframe[,"rides"])
result <- ewup2(rides, alpha)
count <- result[["count"]][[length(rides)]]
mean <- result[["mean"]][[length(rides)]]
variance <- result[["variance"]][[length(rides)]]
varn <- result[["varn"]][[length(rides)]]
}
112. Running
Totals
con?nued
• Update
sta?s?cs
as
each
event
is
encountered
doLast <- function(v) {
if (v[["count"]] >= 2) {
# when N >= 2, correct for 1/(N - 1)
variance <- v[["varn"]] /
(v[["count"]] - 1.0)
}
else {
variance <- v[["varn"]]
}
}
113. Results
• As
new
data
are
seen,
models
can
be
updated
Model
Model
Model
• New
routes
added
as
new
segments
Model
• Model
is
not
bound
to
Model
Hadoop
Model
Model
116. Ques?ons?
For
the
most
current
version
of
these
slides,
please
see
tutorials.opendatagroup.com
117. About
Robert
Grossman
Robert
Grossman
is
the
Founder
and
a
Partner
of
Open
Data
Group,
which
specializes
in
building
predic?ve
models
over
big
data.
He
is
the
Director
of
Informa?cs
at
the
Ins?tute
for
Genomics
and
Systems
Biology
(IGSB)
and
a
faculty
member
in
the
Computa?on
Ins?tute
(CI)
at
the
University
of
Chicago.
His
areas
of
research
include:
big
data,
predic?ve
analy?cs,
bioinforma?cs,
data
intensive
compu?ng
and
analy?c
infrastructure.
He
has
led
the
development
of
new
open
source
soware
tools
for
analyzing
big
data,
cloud
compu?ng,
data
mining,
distributed
compu?ng
and
high
performance
networking.
Prior
to
star?ng
Open
Data
Group,
he
founded
Magnify,
Inc.
in
1996,
which
provides
data
mining
solu?ons
to
the
insurance
industry.
Grossman
was
Magnify’s
CEO
un?l
2001
and
its
Chairman
un?l
it
was
sold
to
ChoicePoint
in
2005.
He
blogs
about
big
data,
data
science,
and
data
engineering
at
rgrossman.com.
118. About
Robert
Grossman
(cont’d)
• You
can
find
more
informa?on
about
big
data
and
related
areas
on
my
blog:
rgrossman.com.
• Some
of
my
technical
papers
are
available
online
at
rgrossman.com
• I
just
finished
a
book
for
the
non-‐technical
reader
called
The
Structure
of
Digital
Compu&ng:
From
Mainframes
to
Big
Data,
which
is
available
from
Amazon.
119. About
Collin
Benne=
Collin
Benne=
is
a
principal
at
Open
Data
Group.
In
three
and
a
half
years
with
the
company,
Collin
has
worked
on
the
open
source
Augustus
scoring
engine
and
a
cloud-‐based
environment
for
rapid
analy?c
prototyping
called
RAP.
Addi?onally,
he
has
released
open
source
projects
for
the
Open
Cloud
Consor?um.
One
of
these,
MalGen,
has
been
used
to
benchmark
several
parallel
computa?on
frameworks.
Previously,
he
led
soware
development
for
the
Product
Development
Team
at
Acquity
Group,
an
IT
consul?ng
firm
head-‐quartered
in
Chicago.
He
also
worked
at
startups
Orbitz
(when
it
was
s?ll
was
one)
and
Business
Logic
Corpora?on.
He
has
co-‐authored
papers
on
Weyl
tensors,
large
data
clouds,
and
high
performance
wide
area
cloud
testbeds.
He
holds
degrees
in
English,
Mathema?cs
and
Computer
Science.
120. About
Open
Data
• Open
Data
began
opera?ons
in
2001
and
has
built
predic?ve
models
for
companies
for
over
ten
years
• Open
Data
provides
management
consul?ng,
outsourced
analy?c
services,
&
analy?c
staffing
• For
more
informa?on
• www.opendatagroup.com
• info@opendatagroup.com
121. About
Open
Data
Group
(cont’d)
Open
Data
Group
has
been
building
predic?ve
models
over
big
data
for
its
clients
over
ten
years.
Open
Data
Group
is
one
of
the
pioneers
using
technologies
such
as
Hadoop
and
NoSQL
databases
so
that
companies
can
build
predic?ve
models
efficiently
over
all
of
their
data.
Open
Data
has
also
par?cipated
in
the
development
of
the
Predic?ve
Model
Markup
Language
(PMML)
so
that
companies
can
more
quickly
deploy
analy?c
models
into
their
opera?onal
systems.
For
the
past
several
years,
Open
Data
has
also
helped
companies
build
predic?ve
models
and
score
data
in
real
?me
using
elas?c
clouds,
such
as
provided
with
Amazon’s
AWS.
Open
Data’s
consul?ng
philosophy
is
nicely
summed
up
by
Marvin
Bower,
one
of
McKinsey
&
Company’s
founders.
Bower
led
his
firm
using
three
rules:
(1)
put
the
interests
of
the
client
ahead
of
revenues;
(2)
tell
the
truth
and
don’t
be
afraid
to
challenge
a
client’s
opinion;
and
(3)
only
agree
to
perform
work
that
is
necessary
and
something
[you]
can
do
well.
www.opendatagroup.com