Pattern: an open source project for migrating predictive models onto Apache Hadoop

“Pattern –
an open source project for migrating
predictive models onto Apache Hadoop”

Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid

Copyright @2013, Concurrent, Inc.

Sunday, 17 March 13 1

Pattern: predictive models at scale
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

• Enterprise Data Workflows
Word
Count

• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments


Cascading – origins

API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.


Cascading – functional programming

Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:

• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters


functional programming… in production

• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-speciﬁc languages (DSLs) in JVM
languages which emphasize functional programming:

Cascalog in Clojure (2010)
Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki


Cascading – deﬁnitions

• a pattern language for Enterprise Data Workﬂows
Customers
• simple to build, easy to test, robust in production
• design principles ⟹ ensure best practices at scale Web
App

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting


Cascading – usage

• Java API, DSLs in Scala, Clojure,
Customers
Jython, JRuby, Groovy, ANSI SQL
• ASL 2 license, GitHub src, Web
App
http://conjars.org
• 5+ yrs production use, logs
logs
Logs
Cache

multiple Enterprise verticals Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting


Cascading – integrations

• partners: Microsoft Azure, Hortonworks,
Customers
Amazon AWS, MapR, EMC, SpringSource,
Cloudera Web

• taps: Memcached, Cassandra, MongoDB,
App

HBase, JDBC, Parquet, etc. logs
logs Cache

• serialization: Avro, Thrift, Kryo, Support
Logs

JSON, etc. trap
source
tap sink
tap tap

• topologies: Apache Hadoop, Data
tuple spaces, local mode Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting


Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.


Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utilityworkﬂow abstraction
grids, telecom, addresses:
genomics, climatology, agronomics, etc.
• stafﬁng bottleneck;
• system integration;
• operational complexity;
• test-driven development


Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

• Sample Code
• Pattern
• PMML
• Roadmap


The Ubiquitous Word Count
Document

Definition:
Collection

Tokenize
GroupBy
M token Count

count how often each word appears
count how often each word appears
R Word
Count

in a collection of text documents
in a collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates: void map (String doc_id, String text):

• requires a minimal amount of code for each word w in segment(text):
emit(w, "1");

• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group):

• is not many steps away from useful search indexing int count = 0;

• serves as a “Hello World” for Hadoop apps for each pc in group:
count += Int(pc);

Any distributed computing framework which can run Word emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.


word count – conceptual ﬂow diagram

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

1 map cascading.org/category/impatient
1 reduce
18 lines code gist.github.com/3900702


word count – Cascading app in Java
Document
Collection

String docPath = args[ 0 ]; Tokenize
GroupBy
M token

String wcPath = args[ 1 ]; Count

Properties properties = new Properties(); R Word
Count

AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();


word count – generated ﬂow diagram
Document
Collection

Tokenize
[head] M
GroupBy
token Count

R Word
Count

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']

map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

[{1}:'token']
[{1}:'token']

GroupBy('wc')[by:['token']]

wc[{1}:'token']
[{1}:'token']

reduce
Every('wc')[Count[decl:'count']]

[{2}:'token', 'count']
[{1}:'token']

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']


[tail]


word count – Cascalog / Clojure
Document
Collection

(ns impatient.core M
Tokenize
GroupBy
token Count

  (:use [cascalog.api] R Word
Count

        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))

; Paul Lam
; github.com/Quantisan/Impatient


word count – Cascalog / Clojure
Document
Collection

github.com/nathanmarz/cascalog/wiki
Tokenize
GroupBy
M token Count

R Word
Count

• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn


word count – Scalding / Scala
Document
Collection

import com.twitter.scalding._ M
Tokenize
GroupBy
token Count

R Word
Count

class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}


Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog


Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
leverage the functional
algebra, machine learning – e.g., Matrix API, Algebird, etc.
of MapReduce, helping limit
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
complexity in process
• great for data services at scale
• less learning curve than Cascalog


Two Avenues to the App Layer…

Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,

complexity ➞
ANSI SQL, SAS, etc. – to migrate
workﬂows onto Apache Hadoop while
leveraging existing staff

Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞


Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

• Sample Code
• Pattern
• PMML
• Roadmap


workflow abstraction – pattern language

Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Data is represented as flows of tuples. Operations within Word

the flows bring functional programming aspects into Java Count

In formal terms, this provides a pattern language


references…

pattern language: a structured method for solving
large, complex design problems, where the syntax of
the language promotes the use of best practices

amazon.com/dp/0195019199

design patterns: the notion originated in consensus
negotiation for architecture, later applied in OOP
software engineering by “Gang of Four”


workflow abstraction – pattern language

Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection

Scrub
Tokenize

design principles of the pattern
token

M

language ensure best practices
Stop Word
List
HashJoin
Left
Regex
token
GroupBy
token
R

for robust, parallel data workflows
RHS

at scale Count

Data is represented as flows of tuples. Operations within Word

the flows bring functional programming aspects into Java Count

In formal terms, this provides a pattern language


workflow abstraction – literate programming

Cascading workflows generate their own visual
documentation: flow diagrams

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

In formal terms, flow diagrams leverage a methodology Word
Count

called literate programming
Provides intuitive, visual representations for apps –
great for cross-team collaboration


references…

by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/

“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”


workflow abstraction – test-driven development

• assert patterns (regex) on the tuple streams
Customers
• adjust assert levels, like log4j levels
• trap edge cases as “data exceptions” Web
App

• TDD at scale:
1. start from raw inputs in the flow graph logs
logs
Logs
Cache

2. define stream assertions for each stage Support
source
trap sink
of transforms tap
tap
tap

3. verify exceptions, code to remove them Modeling PMML
Data
Workflow

4. when impl is complete, app has full sink
source
tap
tap
test coverage Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
redirect traps in production Reporting
Cluster

to Ops, QA, Support, Audit, etc.


workflow abstraction – business process

Following the essence of literate programming, Cascading
workflows provide statements of business process
This recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
This is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
By virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale


references…

by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
Rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on what apps do:
the process of structuring data

Closely related to functional relational programming paradigm:
“Out of the Tar Pit”
Moseley & Marks 2006
http://goo.gl/SKspn


workﬂow abstraction – API design principles

• specify what is required, not how it must be achieved
• plan far ahead, before consuming cluster resources –
fail fast prior to submit

• fail the same way twice – deterministic ﬂow planners
help reduce engineering costs for debugging at scale

• same JAR, any scale – app does not require a recompile
to change data taps or cluster topologies


workflow abstraction – building apps in layers

business separation of concerns: focus on specifying what is required, not how the computers
process
must accomplish it – not unlike BPM/BPEL for BigData

test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
development code until tests pass, repeat … route exceptional data to appropriate department

pattern syntax of the pattern language conveys expertise – much like building a tower with
language
Lego blocks: ensure best practices for robust, parallel data workflows at scale

flow planner/ enables the functional programming aspects: compiler within a compiler, mapping
optimizer flows to topologies (e.g., create and sequence Hadoop job steps)

compiler/ entire app is visible to the compiler: resolves issues of crossing boundaries for
build troubleshooting, exception handling, notifications, etc.; one app = one JAR

topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.

JVM cluster cluster scheduler, instrumentation, etc.


workflow abstraction – building apps in layers

business separation of concerns: focus on specifying what is required, not how the computers
process
must accomplish it – not unlike BPM/BPEL for BigData

test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
development code until tests pass, repeat … route exceptional data to appropriate department

pattern syntax of the pattern language conveys expertise – much like building a tower with
language
Lego blocks: ensure best practices for robust, parallel data workflows at scale

flow planner/
optimizer
several theoretical aspects converge
enables the functional programming aspects: compiler within a compiler, mapping
flows to topologies
into software engineering practices
entire app is visible to the compiler: resolves issues of crossing boundaries for
compiler/
build which minimize the complexity of
troubleshooting, exception handling, notifications, etc.; one app = one JAR
building and maintaining Enterprise
topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.
data workflows
JVM cluster cluster scheduler, instrumentation, etc.


Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

• Sample Code
• Pattern
• PMML
• Roadmap


Pattern – analytics workflows

• open source project – ASL 2, GitHub repo
• multiple companies contributing
• complementary to Apache Mahout – while leveraging
workflow abstraction, multiple topologies, etc.
• model scoring: generates workflows from PMML models
• model creation: estimation at scale, captured as PMML
• use sample Hadoop app at scale – no coding required
• integrate with 2 lines of Java (1 line Clojure or Scala)
• excellent use cases for customer experiments at scale

cascading.org/pattern


Pattern – analytics workflows

• open source project – ASL 2, GitHub repo
• multiple companies contributing
• complementary to Apache Mahout – while leveraging
workflow abstraction, multiple topologies, etc.
• model scoring: generates workflows from PMML models
• model creation: estimation at reduced development
greatly scale, captured at PMML costs, less
• use sample Hadoop app at scale – no coding required leveraging the
licensing issues at scale –
• economics of Apache Hadoop clusters,
integrate with 2 lines of Java (1 line Clojure or Scala)
• excellent use cases for customer experiments at scale of analytics
plus the core competencies
staff, plus existing IP in predictive models



Pattern – model scoring

• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML Customers

• great open source tools – R, Weka, Web
App
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries – logs
logs Cache
Logs
Matrix API, etc. Support

• leverage PMML as another kind trap
tap
source
tap sink
tap

of DSL
Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting



Pattern – an example classifier

1. use customer order history as the training data set
2. train a risk classifier for orders, using Random Forest risk classifier
dimension: customer 360
risk classifier
dimension: per-order
Cascading apps

3. export model from R to PMML data prep
training
data sets
analyst's
laptop
customer
transactions

predict score new

4. build a Cascading app to execute the PMML model model costs

detect
PMML
model
orders

anomaly
fraudsters detection

4.1. generate flow from PMML description segment
customers
velocity
metrics

4.2. plan the flow for a topology (Hadoop) Hadoop
batch
Customer
DB
real-time
IMDG

workloads workloads

4.3. compile app to a JAR file
ETL

chargebacks, partner
DW etc. data

5. verify results with a regression test
6. deploy the app at scale to calculate scores
7. potentially, reuse classifier for real-time scoring


Pattern – an example classifier

risk classifier risk classifier
dimension: customer 360 dimension: per-order
Cascading apps

training analyst's customer
data prep laptop
data sets transactions

predict score new
model costs orders
PMML
model
detect anomaly
fraudsters detection

segment velocity
customers metrics

Hadoop Customer IMDG
DB
batch real-time
workloads workloads

ETL

chargebacks, partner
DW etc. data


Pattern – create a model in R

## train a RandomForest model

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)

## test the model on the holdout test set

print(fit$importance)
print(fit)

predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)

## export predicted labels to TSV

write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)

## export RF model to PMML

saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))


Pattern – capture model parameters as PMML
<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
<Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.2.30"/>
  <Timestamp>2012-10-22 19:39:28</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
  <DataField name="label" optype="categorical" dataType="string">
   <Value value="0"/>
   <Value value="1"/>
  </DataField>
  <DataField name="var0" optype="continuous" dataType="double"/>
</DataDictionary>
<MiningModel modelName="randomForest_Model" functionName="classification">
  <MiningSchema>
   <MiningField name="label" usageType="predicted"/>
   <MiningField name="var0" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
   <Segment id="1">
    <True/>
    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
     </MiningSchema>
...


Pattern – score a model, within an app
public class Main {
public static void main( String[] args ) {
  String pmmlPath = args[ 0 ];
  String ordersPath = args[ 1 ];
  String classifyPath = args[ 2 ];
  String trapPath = args[ 3 ];

  Properties properties = new Properties();
  AppProps.setApplicationJarClass( properties, Main.class );
  HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

  // create source and sink taps
  Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
  Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
  Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

  // define a "Classifier" model from PMML to evaluate the orders
  ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
  Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

  // connect the taps, pipes, etc., into a flow
  FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
   .addSource( classifyPipe, ordersTap )
   .addTrap( classifyPipe, trapTap )
   .addSink( classifyPipe, classifyTap );

  // write a DOT file and run the flow
  Flow classifyFlow = flowConnector.connect( flowDef );
  classifyFlow.writeDOT( "dot/classify.dot" );
  classifyFlow.complete();
}
}


Pattern – score a model, using pre-deﬁned Cascading app

Customer
Orders

Scored GroupBy
Classify Assert
Orders token

M R

PMML
Model
Count

Failure Confusion
Traps Matrix


Pattern – score a model, using pre-deﬁned Cascading app

## run an RF classifier at scale

hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml

## run an RF classifier at scale, assert regression test, measure confusion matrix

hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml --assert --measure out/measure

## run a predictive model at scale, measure RMSE

hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap
--pmml data/iris.lm_p.xml --rmse out/measure


Pattern – evaluating results

bash-3.2$ head out/classify/part-00000
label" var0" var1" var2" order_id" predicted"
score
1" 0" 1" 0" 6f8e1014" 1" 1
0" 0" 0" 1" 6f8ea22e" 0" 0
1" 0" 1" 0" 6f8ea435" 1" 1
0" 0" 0" 1" 6f8ea5e1" 0" 0
1" 0" 1" 0" 6f8ea785" 1" 1
1" 0" 1" 0" 6f8ea91e" 1" 1
0" 1" 0" 0" 6f8eaaba" 0" 0
1" 0" 1" 0" 6f8eac54" 1" 1
0" 1" 1" 0" 6f8eade3" 1" 1


Lingual – connecting Hadoop and R

# load the JDBC package
library(RJDBC)

# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver",
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")

# set up a database connection to a local repository
connection <- dbConnect(drv,
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")

# query the repository: in this case the MySQL sample database (CSV files)
df <- dbGetQuery(connection,
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)

# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)

library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()


Lingual – connecting Hadoop and R

> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92

cascading.org/lingual
launchpad.net/test-db


Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

• Sample Code
• Pattern
• PMML
• Roadmap


PMML – standard

• established XML standard for predictive model markup
• organized by Data Mining Group (DMG), since 1997
http://dmg.org/
• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
• PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple ﬂows

“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”

wikipedia.org/wiki/Predictive_Model_Markup_Language


PMML – models

• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classiﬁers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• Support Vector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element

ibm.com/developerworks/industry/library/ind-PMML2/


PMML – vendor coverage


Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

• Sample Code
• Pattern
• PMML
• Roadmap


roadmap – existing algorithms for scoring

•

Random Forest
• Decision Trees
• Linear Regression
• GLM
• Logistic Regression
• K-Means Clustering
• Hierarchical Clustering
• Support Vector Machines



roadmap – top priorities for creating models at scale

•

Random Forest
• Logistic Regression
• K-Means Clustering

a wealth of recent research indicates many opportunities
to parallelize popular algorithms for training models at scale
on Apache Hadoop…



roadmap – next priorities for scoring

•

Time Series (ARIMA forecast)
• Association Rules (basket analysis)
• Naïve Bayes
• Neural Networks

algorithms extended based on customer use cases –
contact @pacoid



Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

• Sample Code
• Pattern
• PMML
• Roadmap


experiments – comparing models

• much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
• run multiple variants, then measure relative “lift”
• Concurrent runtime – tag and track models

the following example compares two models trained
with different machine learning algorithms

this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment


experiments – Random Forest model

## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220

f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))

OOB estimate of error rate: 14%
Confusion matrix:
0 1 class.error
0 69 16 0.1882353
1 12 103 0.1043478


experiments – Logistic Regression model

## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r

f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***
var0 -1.3755 0.4355 -3.159 0.00159 **
var2 -3.7742 0.5794 -6.514 7.30e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

NB: this model has “var1” intentionally omitted


experiments – comparing results

•

use a confusion matrix to compare results for the classiﬁers
• Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)
• assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classiﬁer:
FN ∼ chargeback risk
FP ∼ customer support costs


references…

Enterprise Data Workﬂows
with Cascading
O’Reilly, 2013


drill-down…

blog, dev community, code/wiki/gists, maven repo,
commercial products, career opportunities:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com

Copyright @2013, Concurrent, Inc.


Pattern: an open source project for migrating predictive models onto Apache Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (20)

Similar to Pattern: an open source project for migrating predictive models onto Apache Hadoop

Similar to Pattern: an open source project for migrating predictive models onto Apache Hadoop (20)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

Pattern: an open source project for migrating predictive models onto Apache Hadoop