HTML Injection Attacks: Impact and Mitigation Strategies
Pattern: an open source project for migrating predictive models onto Apache Hadoop
1. “Pattern –
an open source project for migrating
predictive models onto Apache Hadoop”
Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
Copyright @2013, Concurrent, Inc.
Sunday, 17 March 13 1
2. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 2
3. Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
Sunday, 17 March 13 3
4. Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
Sunday, 17 March 13 4
5. functional programming… in production
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Sunday, 17 March 13 5
6. Cascading – definitions
• a pattern language for Enterprise Data Workflows
Customers
• simple to build, easy to test, robust in production
• design principles ⟹ ensure best practices at scale Web
App
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Sunday, 17 March 13 6
7. Cascading – usage
• Java API, DSLs in Scala, Clojure,
Customers
Jython, JRuby, Groovy, ANSI SQL
• ASL 2 license, GitHub src, Web
App
http://conjars.org
• 5+ yrs production use, logs
logs
Logs
Cache
multiple Enterprise verticals Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Sunday, 17 March 13 7
8. Cascading – integrations
• partners: Microsoft Azure, Hortonworks,
Customers
Amazon AWS, MapR, EMC, SpringSource,
Cloudera Web
• taps: Memcached, Cassandra, MongoDB,
App
HBase, JDBC, Parquet, etc. logs
logs Cache
• serialization: Avro, Thrift, Kryo, Support
Logs
JSON, etc. trap
source
tap sink
tap tap
• topologies: Apache Hadoop, Data
tuple spaces, local mode Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Sunday, 17 March 13 8
9. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
Sunday, 17 March 13 9
10. Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utilityworkflow abstraction
grids, telecom, addresses:
genomics, climatology, agronomics, etc.
• staffing bottleneck;
• system integration;
• operational complexity;
• test-driven development
Sunday, 17 March 13 10
11. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 11
12. The Ubiquitous Word Count
Document
Definition:
Collection
Tokenize
GroupBy
M token Count
count how often each word appears
count how often each word appears
R Word
Count
in a collection of text documents
in a collection of text documents
This simple program provides an excellent test case for
parallel processing, since it illustrates: void map (String doc_id, String text):
• requires a minimal amount of code for each word w in segment(text):
emit(w, "1");
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group):
• is not many steps away from useful search indexing int count = 0;
• serves as a “Hello World” for Hadoop apps for each pc in group:
count += Int(pc);
Any distributed computing framework which can run Word emit(word, String(count));
Count efficiently in parallel at scale can handle much
larger and more interesting compute problems.
Sunday, 17 March 13 12
13. word count – conceptual flow diagram
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
1 map cascading.org/category/impatient
1 reduce
18 lines code gist.github.com/3900702
Sunday, 17 March 13 13
14. word count – Cascading app in Java
Document
Collection
String docPath = args[ 0 ]; Tokenize
GroupBy
M token
String wcPath = args[ 1 ]; Count
Properties properties = new Properties(); R Word
Count
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Sunday, 17 March 13 14
15. word count – generated flow diagram
Document
Collection
Tokenize
[head] M
GroupBy
token Count
R Word
Count
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
[{1}:'token']
[{1}:'token']
GroupBy('wc')[by:['token']]
wc[{1}:'token']
[{1}:'token']
reduce
Every('wc')[Count[decl:'count']]
[{2}:'token', 'count']
[{1}:'token']
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
[{2}:'token', 'count']
[{2}:'token', 'count']
[tail]
Sunday, 17 March 13 15
16. word count – Cascalog / Clojure
Document
Collection
(ns impatient.core M
Tokenize
GroupBy
token Count
(:use [cascalog.api] R Word
Count
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
Sunday, 17 March 13 16
17. word count – Cascalog / Clojure
Document
Collection
github.com/nathanmarz/cascalog/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
Sunday, 17 March 13 17
18. word count – Scalding / Scala
Document
Collection
import com.twitter.scalding._ M
Tokenize
GroupBy
token Count
R Word
Count
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
Sunday, 17 March 13 18
19. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
Sunday, 17 March 13 19
20. word count – Scalding / Scala
Document
Collection
github.com/twitter/scalding/wiki
Tokenize
GroupBy
M token Count
R Word
Count
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls Cascalog and Scalding DSLs
• extensive libraries are available for linear algebra, abstractaspects
leverage the functional
algebra, machine learning – e.g., Matrix API, Algebird, etc.
of MapReduce, helping limit
• significant investments by Twitter, Etsy, eBay, etc.
complexity in process
• great for data services at scale
• less learning curve than Cascalog
Sunday, 17 March 13 20
21. Two Avenues to the App Layer…
Enterprise: must contend with
complexity at scale everyday…
incumbents extend current practices and
infrastructure investments – using J2EE,
complexity ➞
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding
scale ➞
Sunday, 17 March 13 21
22. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 22
23. workflow abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Data is represented as flows of tuples. Operations within Word
the flows bring functional programming aspects into Java Count
In formal terms, this provides a pattern language
Sunday, 17 March 13 23
24. references…
pattern language: a structured method for solving
large, complex design problems, where the syntax of
the language promotes the use of best practices
amazon.com/dp/0195019199
design patterns: the notion originated in consensus
negotiation for architecture, later applied in OOP
software engineering by “Gang of Four”
amazon.com/dp/0201633612
Sunday, 17 March 13 24
25. workflow abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection
Scrub
Tokenize
design principles of the pattern
token
M
language ensure best practices
Stop Word
List
HashJoin
Left
Regex
token
GroupBy
token
R
for robust, parallel data workflows
RHS
at scale Count
Data is represented as flows of tuples. Operations within Word
the flows bring functional programming aspects into Java Count
In formal terms, this provides a pattern language
Sunday, 17 March 13 25
26. workflow abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
In formal terms, flow diagrams leverage a methodology Word
Count
called literate programming
Provides intuitive, visual representations for apps –
great for cross-team collaboration
Sunday, 17 March 13 26
27. references…
by Don Knuth
Literate Programming
Univ of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is
to instruct a computer what to do, let us
concentrate rather on explaining to human
beings what we want a computer to do.”
Sunday, 17 March 13 27
28. workflow abstraction – test-driven development
• assert patterns (regex) on the tuple streams
Customers
• adjust assert levels, like log4j levels
• trap edge cases as “data exceptions” Web
App
• TDD at scale:
1. start from raw inputs in the flow graph logs
logs
Logs
Cache
2. define stream assertions for each stage Support
source
trap sink
of transforms tap
tap
tap
3. verify exceptions, code to remove them Modeling PMML
Data
Workflow
4. when impl is complete, app has full sink
source
tap
tap
test coverage Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
redirect traps in production Reporting
Cluster
to Ops, QA, Support, Audit, etc.
Sunday, 17 March 13 28
29. workflow abstraction – business process
Following the essence of literate programming, Cascading
workflows provide statements of business process
This recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
This is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
By virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
Sunday, 17 March 13 29
30. references…
by Edgar Codd
“A relational model of data for large shared data banks”
Communications of the ACM, 1970
dl.acm.org/citation.cfm?id=362685
Rather than arguing between SQL vs. NoSQL…
structured vs. unstructured data frameworks…
this approach focuses on what apps do:
the process of structuring data
Closely related to functional relational programming paradigm:
“Out of the Tar Pit”
Moseley & Marks 2006
http://goo.gl/SKspn
Sunday, 17 March 13 30
31. workflow abstraction – API design principles
• specify what is required, not how it must be achieved
• plan far ahead, before consuming cluster resources –
fail fast prior to submit
• fail the same way twice – deterministic flow planners
help reduce engineering costs for debugging at scale
• same JAR, any scale – app does not require a recompile
to change data taps or cluster topologies
Sunday, 17 March 13 31
32. workflow abstraction – building apps in layers
business separation of concerns: focus on specifying what is required, not how the computers
process
must accomplish it – not unlike BPM/BPEL for BigData
test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
development code until tests pass, repeat … route exceptional data to appropriate department
pattern syntax of the pattern language conveys expertise – much like building a tower with
language
Lego blocks: ensure best practices for robust, parallel data workflows at scale
flow planner/ enables the functional programming aspects: compiler within a compiler, mapping
optimizer flows to topologies (e.g., create and sequence Hadoop job steps)
compiler/ entire app is visible to the compiler: resolves issues of crossing boundaries for
build troubleshooting, exception handling, notifications, etc.; one app = one JAR
topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.
JVM cluster cluster scheduler, instrumentation, etc.
Sunday, 17 March 13 32
33. workflow abstraction – building apps in layers
business separation of concerns: focus on specifying what is required, not how the computers
process
must accomplish it – not unlike BPM/BPEL for BigData
test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
development code until tests pass, repeat … route exceptional data to appropriate department
pattern syntax of the pattern language conveys expertise – much like building a tower with
language
Lego blocks: ensure best practices for robust, parallel data workflows at scale
flow planner/
optimizer
several theoretical aspects converge
enables the functional programming aspects: compiler within a compiler, mapping
flows to topologies
into software engineering practices
entire app is visible to the compiler: resolves issues of crossing boundaries for
compiler/
build which minimize the complexity of
troubleshooting, exception handling, notifications, etc.; one app = one JAR
building and maintaining Enterprise
topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.
data workflows
JVM cluster cluster scheduler, instrumentation, etc.
Sunday, 17 March 13 33
34. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 34
35. Pattern – analytics workflows
• open source project – ASL 2, GitHub repo
• multiple companies contributing
• complementary to Apache Mahout – while leveraging
workflow abstraction, multiple topologies, etc.
• model scoring: generates workflows from PMML models
• model creation: estimation at scale, captured as PMML
• use sample Hadoop app at scale – no coding required
• integrate with 2 lines of Java (1 line Clojure or Scala)
• excellent use cases for customer experiments at scale
cascading.org/pattern
Sunday, 17 March 13 35
36. Pattern – analytics workflows
• open source project – ASL 2, GitHub repo
• multiple companies contributing
• complementary to Apache Mahout – while leveraging
workflow abstraction, multiple topologies, etc.
• model scoring: generates workflows from PMML models
• model creation: estimation at reduced development
greatly scale, captured at PMML costs, less
• use sample Hadoop app at scale – no coding required leveraging the
licensing issues at scale –
• economics of Apache Hadoop clusters,
integrate with 2 lines of Java (1 line Clojure or Scala)
• excellent use cases for customer experiments at scale of analytics
plus the core competencies
staff, plus existing IP in predictive models
cascading.org/pattern
Sunday, 17 March 13 36
37. Pattern – model scoring
• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML Customers
• great open source tools – R, Weka, Web
App
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries – logs
logs Cache
Logs
Matrix API, etc. Support
• leverage PMML as another kind trap
tap
source
tap sink
tap
of DSL
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
cascading.org/pattern
Sunday, 17 March 13 37
38. Pattern – an example classifier
1. use customer order history as the training data set
2. train a risk classifier for orders, using Random Forest risk classifier
dimension: customer 360
risk classifier
dimension: per-order
Cascading apps
3. export model from R to PMML data prep
training
data sets
analyst's
laptop
customer
transactions
predict score new
4. build a Cascading app to execute the PMML model model costs
detect
PMML
model
orders
anomaly
fraudsters detection
4.1. generate flow from PMML description segment
customers
velocity
metrics
4.2. plan the flow for a topology (Hadoop) Hadoop
batch
Customer
DB
real-time
IMDG
workloads workloads
4.3. compile app to a JAR file
ETL
chargebacks, partner
DW etc. data
5. verify results with a regression test
6. deploy the app at scale to calculate scores
7. potentially, reuse classifier for real-time scoring
Sunday, 17 March 13 38
39. Pattern – an example classifier
risk classifier risk classifier
dimension: customer 360 dimension: per-order
Cascading apps
training analyst's customer
data prep laptop
data sets transactions
predict score new
model costs orders
PMML
model
detect anomaly
fraudsters detection
segment velocity
customers metrics
Hadoop Customer IMDG
DB
batch real-time
workloads workloads
ETL
chargebacks, partner
DW etc. data
Sunday, 17 March 13 39
40. Pattern – create a model in R
## train a RandomForest model
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
## test the model on the holdout test set
print(fit$importance)
print(fit)
predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)
## export predicted labels to TSV
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
## export RF model to PMML
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Sunday, 17 March 13 40
42. Pattern – score a model, within an app
public class Main {
public static void main( String[] args ) {
String pmmlPath = args[ 0 ];
String ordersPath = args[ 1 ];
String classifyPath = args[ 2 ];
String trapPath = args[ 3 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );
// define a "Classifier" model from PMML to evaluate the orders
ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( classifyPipe, ordersTap )
.addTrap( classifyPipe, trapTap )
.addSink( classifyPipe, classifyTap );
// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
}
Sunday, 17 March 13 42
43. Pattern – score a model, using pre-defined Cascading app
Customer
Orders
Scored GroupBy
Classify Assert
Orders token
M R
PMML
Model
Count
Failure Confusion
Traps Matrix
Sunday, 17 March 13 43
44. Pattern – score a model, using pre-defined Cascading app
## run an RF classifier at scale
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml
## run an RF classifier at scale, assert regression test, measure confusion matrix
hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap
--pmml data/sample.rf.xml --assert --measure out/measure
## run a predictive model at scale, measure RMSE
hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap
--pmml data/iris.lm_p.xml --rmse out/measure
Sunday, 17 March 13 44
46. Lingual – connecting Hadoop and R
# load the JDBC package
library(RJDBC)
# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver",
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")
# set up a database connection to a local repository
connection <- dbConnect(drv,
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")
# query the repository: in this case the MySQL sample database (CSV files)
df <- dbGetQuery(connection,
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)
# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)
library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
Sunday, 17 March 13 46
47. Lingual – connecting Hadoop and R
> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
cascading.org/lingual
launchpad.net/test-db
Sunday, 17 March 13 47
48. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 48
49. PMML – standard
• established XML standard for predictive model markup
• organized by Data Mining Group (DMG), since 1997
http://dmg.org/
• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
• PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple flows
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
wikipedia.org/wiki/Predictive_Model_Markup_Language
Sunday, 17 March 13 49
50. PMML – models
• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classifiers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• Support Vector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element
ibm.com/developerworks/industry/library/ind-PMML2/
Sunday, 17 March 13 50
52. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 52
53. roadmap – existing algorithms for scoring
•
Random Forest
• Decision Trees
• Linear Regression
• GLM
• Logistic Regression
• K-Means Clustering
• Hierarchical Clustering
• Support Vector Machines
cascading.org/pattern
Sunday, 17 March 13 53
54. roadmap – top priorities for creating models at scale
•
Random Forest
• Logistic Regression
• K-Means Clustering
a wealth of recent research indicates many opportunities
to parallelize popular algorithms for training models at scale
on Apache Hadoop…
cascading.org/pattern
Sunday, 17 March 13 54
55. roadmap – next priorities for scoring
•
Time Series (ARIMA forecast)
• Association Rules (basket analysis)
• Naïve Bayes
• Neural Networks
algorithms extended based on customer use cases –
contact @pacoid
cascading.org/pattern
Sunday, 17 March 13 55
56. Pattern: predictive models at scale
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
• Enterprise Data Workflows
Word
Count
• Sample Code
• A Little Theory…
• Pattern
• PMML
• Roadmap
• Customer Experiments
Sunday, 17 March 13 56
57. experiments – comparing models
• much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale
• run multiple variants, then measure relative “lift”
• Concurrent runtime – tag and track models
the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment
Sunday, 17 March 13 57
58. experiments – Random Forest model
## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220
f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))
OOB estimate of error rate: 14%
Confusion matrix:
0 1 class.error
0 69 16 0.1882353
1 12 103 0.1043478
Sunday, 17 March 13 58
59. experiments – Logistic Regression model
## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8524 0.3803 4.871 1.11e-06 ***
var0 -1.3755 0.4355 -3.159 0.00159 **
var2 -3.7742 0.5794 -6.514 7.30e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NB: this model has “var1” intentionally omitted
Sunday, 17 March 13 59
60. experiments – comparing results
•
use a confusion matrix to compare results for the classifiers
• Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)
• assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classifier:
FN ∼ chargeback risk
FP ∼ customer support costs
Sunday, 17 March 13 60
61. references…
Enterprise Data Workflows
with Cascading
O’Reilly, 2013
amazon.com/dp/1449358721
Sunday, 17 March 13 61
62. drill-down…
blog, dev community, code/wiki/gists, maven repo,
commercial products, career opportunities:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
Copyright @2013, Concurrent, Inc.
Sunday, 17 March 13 62