Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
“Pattern – an open source project
for migrating predictive models
from SAS, etc., onto Hadoop”
1Tuesday, 25 June 13

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
Cascading: background
The Workflow Abstraction
PMML: Predictive Model Markup
Pattern: PMML in Cascading
PMML for Customer Experiments
Ensemble Models with Pattern
Workflow Design Pattern

Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.

Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – deﬁnitions
• a pattern language for Enterprise Data Workﬂows
• simple to build, easy to test, robust in production
• design principles ⟹ ensure best practices at scale

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – usage
• Java API, DSLs in Scala, Clojure,
Jython, JRuby, Groovy,ANSI SQL
• ASL 2 license, GitHub src,
http://conjars.org
• 5+ yrs production use,
multiple Enterprise verticals

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – integrations
• partners: Microsoft Azure, Hortonworks,
Amazon AWS, MapR, EMC, SpringSource,
Cloudera
• taps: Memcached, Cassandra, MongoDB,
HBase, JDBC, Parquet, etc.
• serialization: Avro, Thrift, Kryo,
JSON, etc.
• topologies: Apache Hadoop,
tuple spaces, local mode

Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.

Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
workﬂow abstraction addresses:
• stafﬁng bottleneck;
• system integration;
• operational complexity;
• test-driven development

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Enterprise Data Workﬂows
Let’s consider a “strawman” architecture
for an example app… at the front end
LOB use cases drive demand for apps

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Same example… in the back ofﬁce
Organizations have substantial investments
in people, infrastructure, process

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Same example… the heavy lifting!
“Main Street” ﬁrms are migrating
workﬂows to Hadoop, for cost
savings and scale-out

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workﬂows – taps
• taps integrate other data frameworks, as tuple streams
• these are “plumbing” endpoints in the pattern language
• sources (inputs), sinks (outputs), traps (exceptions)
• text delimited, JDBC, Memcached,
HBase, Cassandra, MongoDB, etc.
• data serialization: Avro, Thrift,
Kryo, JSON, etc.
• extend a new kind of tap in just
a few lines of Java
schema and provenance get
derived from analysis of the taps

Cascading workﬂows – taps
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
source and sink taps
for TSV data in HDFS

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workflows – topologies
• topologies execute workflows on clusters
• flow planner is like a compiler for queries
- Hadoop (MapReduce jobs)
- local mode (dev/test or special config)
- in-memory data grids (real-time)
• flow planner can be extended
to support other topologies
blend flows in different topologies
into the same app – for example,
batch (Hadoop) + transactions (IMDG)

Cascading workﬂows – topologies
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
ﬂow planner for
Apache Hadoop
topology

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading workflows – test-driven development
• assert patterns (regex) on the tuple streams
• adjust assert levels, like log4j levels
• trap edge cases as “data exceptions”
• TDD at scale:
1.start from raw inputs in the flow graph
2.define stream assertions for each stage
of transforms
3.verify exceptions, code to remove them
4.when impl is complete, app has full
test coverage
redirect traps in production
to Ops, QA, Support,Audit, etc.

Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations within
the flows bring functional programming aspects into Java
In formal terms, this provides a pattern language

Pattern Language
structured method for solving large, complex design
problems, where the syntax of the language ensures
the use of best practices – i.e., conveying expertise
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199

Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R

Literate Programming
by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”

Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale

Business Process
by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on what apps do:
the process of structuring data

Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-speciﬁc languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/

Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time

Two Avenues to the App Layer…
scale ➞
complexity➞
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
ANSI SQL, SAS, etc. – to migrate
workﬂows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads

• established XML standard for predictive model markup
• organized by Data Mining Group (DMG), since 1997
http://dmg.org/
• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
• PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple ﬂows
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
PMML – standard
wikipedia.org/wiki/Predictive_Model_Markup_Language

• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classiﬁers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• SupportVector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element
PMML – model coverage
ibm.com/developerworks/industry/library/ind-PMML2/

PMML – vendor coverage

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Pattern – model scoring
• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML
• great open source tools – R, Weka,
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries –
Matrix API, etc.
• leverage PMML as another kind
of DSL
cascading.org/pattern

## train a RandomForest model

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)

## test the model on the holdout test set

print(fit$importance)
print(fit)

predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)

## export predicted labels to TSV

write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)

## export RF model to PMML

saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Pattern – create a model in R

<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
<Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.2.30"/>
  <Timestamp>2012-10-22 19:39:28</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
  <DataField name="label" optype="categorical" dataType="string">
   <Value value="0"/>
   <Value value="1"/>
  </DataField>
  <DataField name="var0" optype="continuous" dataType="double"/>
</DataDictionary>
<MiningModel modelName="randomForest_Model" functionName="classification">
  <MiningSchema>
   <MiningField name="label" usageType="predicted"/>
   <MiningField name="var0" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
   <Segment id="1">
    <True/>
    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
     </MiningSchema>
...
Pattern – capture model parameters as PMML

public static void main( String[] args ) throws RuntimeException {
String inputPath = args[ 0 ];
String classifyPath = args[ 1 ];
// set up the config properties
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
// handle command line options
OptionParser optParser = new OptionParser();
optParser.accepts( "pmml" ).withRequiredArg();
OptionSet options = optParser.parse( args );

FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

if( options.hasArgument( "pmml" ) ) {
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlPath ) )
.retainOnlyActiveIncomingFields()
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model
flowDef.addAssemblyPlanner( pmmlPlanner );
}

Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
Pattern – score a model, within an app

Customer
Orders
Classify
Scored
Orders
GroupBy
token
Count
PMML
Model
M R
Failure
Traps
Assert
Confusion
Matrix
Pattern – score a model, using pre-deﬁned Cascading app

## run an RF classifier at scale

hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml

## run an RF classifier at scale, assert regression test, measure confusion matrix

hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml --assert --measure out/measure

## run a predictive model at scale, measure RMSE

hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap
--pmml data/iris.lm_p.xml --rmse out/measure
Pattern – score a model, using pre-deﬁned Cascading app

Roadmap – existing algorithms for scoring
•

Random Forest
• Decision Trees
• Linear Regression
• GLM
• Logistic Regression
• K-Means Clustering
• Hierarchical Clustering
• Multinomial
• SupportVector Machines (prepared for release)
also, model chaining and general support for ensembles

Roadmap – next priorities for scoring
•

Time Series (ARIMA forecast)
• Association Rules (basket analysis)
• Naïve Bayes
• Neural Networks
algorithms extended based on customer use cases –
contact groups.google.com/forum/?fromgroups#!forum/pattern-user

Roadmap – top priorities for creating models at scale
•

Random Forest
• Logistic Regression
• K-Means Clustering
• Association Rules
…plus all models which can be trained via sparse matrix
factorization (TQSR => PCA, SVD least squares, etc.)
a wealth of recent research indicates many opportunities
to parallelize popular algorithms for training models at scale
on Apache Hadoop…

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads

Experiments – comparing models
• much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
• run multiple variants, then measure relative “lift”
• Concurrent runtime – tag and track models
the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment

## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220

f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))
Experiments – Random Forest model
OOB estimate of error rate: 14%
Confusion matrix:
0 1 class.error
0 69 16 0.1882353
1 12 103 0.1043478

## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r

f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))
Experiments – Logistic Regression model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***
var0 -1.3755 0.4355 -3.159 0.00159 **
var2 -3.7742 0.5794 -6.514 7.30e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NB: this model has “var1” intentionally omitted

Experiments – comparing results
•

use a confusion matrix to compare results for the classiﬁers
• Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)
• assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classiﬁer:
FN ∼ chargeback risk
FP ∼ customer support costs

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads

Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in ﬁnancial markets.”
Statistical Modeling: TheTwo Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
in other words, seeing the forest for the trees…
this paper chronicled a sea change from data modeling practices
(silos, manual process) to the rising use of algorithmic modeling
(machine data for automation/optimization)

Why Do Ensembles Matter?
The World…
per Data Modeling
The World…

Algorithmic Modeling
“The trick to being a scientist is to be open to using
a wide variety of tools.” – Breiman
circa 2001: Random Forest, bootstrap aggregation, etc.,
yield dramatic increases in predictive power over earlier
modeling such as Logistic Regression
major learnings from the Netﬂix Prize: the power of
ensembles, model chaining, etc.
the problems at hand have become simply too big and too
complex for ONE distribution, ONE model, ONE team…

Ensemble Models
Breiman:“a multiplicity of data models”
BellKor team: 100+ individual models in 2007 Progress Prize
while the process of combining models adds complexity
(making it more difﬁcult to anticipate or explain predictions)
accuracy may increase substantially
Ensemble Learning: Better PredictionsThrough Diversity
Todd Holloway
ETech (2008)
abeautifulwww.com/EnsembleLearningETech.pdf
The Story of the Netﬂix Prize:An EnsemblersTale
Lester Mackey
National Academies Seminar,Washington, DC (2011)
stanford.edu/~lmackey/papers/

KDD 2013 PMML Workshop
Pattern: PMML for Cascading and Hadoop
Paco Nathan, Girish Kathalagiri
Chicago, 2013-08-11 (accepted)
19th ACM SIGKDD
Conference on Knowledge Discovery
and Data Mining
kdd13pmml.wordpress.com

Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads

Anatomy of an Enterprise app
Deﬁnition a typical Enterprise workﬂow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses

ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Cascading allows multiple departments to combine their workﬂow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );

SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );

flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();

flowDef.addAssemblyPlanner( pmmlPlanner );

cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
visual collaboration for the business logic is a great
way to improve how teams work together
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads
multiple departments, working in their respective
frameworks, integrate results into a combined app,
which runs at scale on a cluster… business process
combined in a common space (DAG) for ﬂow
planners, compiler, optimization, troubleshooting,
exception handling, notiﬁcations, security audit,
performance monitoring, etc.
cascading.org

Enterprise DataWorkﬂows
with Cascading
O’Reilly, 2013
amazon.com/dp/1449358721
references…
newsletter updates:
liber118.com/pxn/

Many thanks to others who have contributed code,
ideas, suggestions, etc., to Pattern:
• Chris Wensel @ Concurrent
• Girish Kathalagiri @ AgilOne
• Vijay Srinivas Agneeswaran @ Impetus
• Chris Severs @ eBay
• Ofer Mendelevitch @ Hortonworks
• Sergey Boldyrev @ Nokia
• Quinton Anderson @ IZAZI Solutions
• Chris Gutierrez @ Airbnb
• Villu Ruusmann @ JPMML project
acknowledgements…

blog, developer community, code/wiki/gists, maven repo,
commercial products, etc.:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
drill-down…

Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Recommended

Recommended

More Related Content

Similar to Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Similar to Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop (20)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop