SlideShare a Scribd company logo
1 of 90
Download to read offline
ACM Big Data Mining Camp,
2013-10-12:

Cascading, Pattern, and PMML
Paco Nathan @pacoid
Chief Scientist, Mesosphere
Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A

ACM, 2013-10-12
PMML – an industry standard

•
•

established XML standard for predictive model markup

•

members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.

•

PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple flows

organized by Data Mining Group (DMG), since 1997
http://dmg.org/

“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”

wikipedia.org/wiki/Predictive_Model_Markup_Language
PMML – vendor coverage
PMML – model coverage

•
•
•
•
•
•
•
•
•
•
•

Association Rules: AssociationModel element
Cluster Models: ClusteringModel element
Decision Trees: TreeModel element
Naïve Bayes Classifiers: NaiveBayesModel element
Neural Networks: NeuralNetwork element
Regression: RegressionModel and GeneralRegressionModel elements
Rulesets: RuleSetModel element
Sequences: SequenceModel element
Support Vector Machines: SupportVectorMachineModel element
Text Models: TextModel element
Time Series: TimeSeriesModel element

ibm.com/developerworks/industry/library/ind-PMML2/
PMML – create a model in R
## train a RandomForest model
 
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
 
## test the model on the holdout test set
 
print(fit$importance)
print(fit)
 
predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)
 
## export predicted labels to TSV
 
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
 
## export RF model to PMML
 
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
PMML – capture business logic of analytics workflows
<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
 <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.2.30"/>
  <Timestamp>2012-10-22 19:39:28</Timestamp>
 </Header>
 <DataDictionary numberOfFields="4">
  <DataField name="label" optype="categorical" dataType="string">
   <Value value="0"/>
   <Value value="1"/>
  </DataField>
  <DataField name="var0" optype="continuous" dataType="double"/>
  <DataField name="var1" optype="continuous" dataType="double"/>
  <DataField name="var2" optype="continuous" dataType="double"/>
 </DataDictionary>
 <MiningModel modelName="randomForest_Model" functionName="classification">
  <MiningSchema>
   <MiningField name="label" usageType="predicted"/>
   <MiningField name="var0" usageType="active"/>
   <MiningField name="var1" usageType="active"/>
   <MiningField name="var2" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
   <Segment id="1">
    <True/>
    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
      <MiningField name="var0" usageType="active"/>
      <MiningField name="var1" usageType="active"/>
      <MiningField name="var2" usageType="active"/>
     </MiningSchema>
...
PMML – further study
PMML in Action
Alex Guazzelli, Wen-Ching Lin, Tridivesh Jena
amazon.com/dp/1470003244
See also excellent resources at:

zementis.com/pmml.htm
Lab: RStudio and PMML in R
set up RStudio…
rstudio.com/ide/

use the Iris data to build predictive models…

•

github.com/Cascading/pattern
pattern-examples/examples/r/rattle_pmml.R

•
•
•

test/train hold-outs
evaluating predictive power
export as PMML
Model: data prep based on “Iris”
library(pmml)
library(randomForest)
library(nnet)
library(XML)
library(kernlab)
 
## split data into test and train sets
 
data(iris)
iris_full <- iris
colnames(iris_full) <c("sepal_length", "sepal_width", "petal_length", "petal_width", "species")
 
idx <- sample(150, 100)
iris_train <- iris_full[idx,]
iris_test <- iris_full[-idx,]
Model: Random Forest
## http://mkseo.pe.kr/stats/?p=220
 
f <- as.formula("as.factor(species) ~ .")
fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50)
 
print(fit$importance)
print(fit)
print(table(iris_test$species, predict(fit, iris_test, type="class")))
 
plot(fit, log="y", main="Random Forest")
varImpPlot(fit)
MDSplot(fit, iris_full$species)
 
out <- iris_full
out$predict <- predict(fit, out, type="class")
 
write.table(out, file=paste(dat_folder, "iris.rf.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
saveXML(pmml(fit), file=paste(dat_folder, "iris.rf.xml", sep="/"))
Model: Linear Regression
## http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/iris_lm/
 
f <- as.formula("sepal_length ~ .")
fit <- lm(f, data=iris_train)
 
print(summary(fit))
print(table(round(iris_test$sepal_length), round(predict(fit, iris_test))))
 
op <- par(mfrow = c(3, 2))
plot(predict(fit), main="Linear Regression")
plot(iris_full$petal_length, iris_full$petal_width, pch=21,
bg=c("red", "green3", "blue")[unclass(iris_full$species)],
main="Edgar Anderson's Iris Data", xlab="petal length", ylab="petal width")
plot(fit)
par(op)
 
out <- iris_full
out$predict <- predict(fit, out)
 
write.table(out, file=paste(dat_folder, "iris.lm_p.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
saveXML(pmml(fit), file=paste(dat_folder, "iris.lm_p.xml", sep="/"))
Model: Neural Network
## http://statisticsr.blogspot.com/2008/10/notes-for-nnet.html
 
samp <- c(sample(1:50,25), sample(51:100,25), sample(101:150,25))
 
ird <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),
species=factor(c(rep("setosa",50), rep("versicolor", 50), rep("virginica", 50))))
 
f <- as.formula("species ~ .")
fit <- nnet(f, data=ird, subset=samp, size=2, rang=0.1, decay=5e-4, maxit=200)
 
print(fit)
print(summary(fit))
print(table(ird$species[-samp], predict(fit, ird[-samp,], type = "class")))
 
out <- ird
out$predict <- predict(fit, ird, type="class")
 
write.table(out, file=paste(dat_folder, "iris.nn.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
saveXML(pmml(fit), file=paste(dat_folder, "iris.nn.xml", sep="/"))
Model: K-Means Clustering
## http://mkseo.pe.kr/stats/?p=15
 
ds <- iris_full[,-5]
fit <- kmeans(ds, 3)
 
print(fit)
print(summary(fit))
print(table(fit$cluster, iris_full$species))
 
op <- par(mfrow = c(1, 1))
plot(iris_full$sepal_length, iris_full$sepal_width, pch = 23,
bg = c("blue", "red", "green")[fit$cluster], main="K-Means Clustering")
points(fit$centers[,c(1, 2)], col=1:3, pch=8, cex=2)
par(op)
 
out <- iris_full
out$predict <- fit$cluster
 
write.table(out, file=paste(dat_folder, "iris.kmeans.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
saveXML(pmml(fit), file=paste(dat_folder, "iris.kmeans.xml", sep="/"))
Model: Hierarchical Clustering
## http://mkseo.pe.kr/stats/?p=15
 
i = as.matrix(iris_full[,-5])
fit <- hclust(dist(i), method = "average")
 
initial <- tapply(i, list(rep(cutree(fit, 3), ncol(i)), col(i)), mean)
dimnames(initial) <- list(NULL, dimnames(i)[[2]])
kls = cutree(fit, 3)
 
print(fit)
print(table(iris_full$species, kls))
 
op <- par(mfrow = c(1, 1))
plclust(fit, main="Hierarchical Clustering")
par(op)
 
out <- iris_full
out$predict <- kls
 
write.table(out, file=paste(dat_folder, "iris.hc.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
saveXML(pmml(fit, data=iris, centers=initial),
file=paste(dat_folder, "iris.hc.xml", sep="/"))
Model: Support Vector Machine
## https://support.zementis.com/entries/21176632-what-types-of-svm-models-built-inr-can-i-export-to-pmml
 
f <- as.formula("species ~ .")
fit <- ksvm(f, data=iris_train, kernel="rbfdot", prob.model=TRUE)
 
print(fit)
print(table(iris_test$species, predict(fit, iris_test)))
 
out <- iris_full
out$predict <- predict(fit, out)
 
write.table(out, file=paste(dat_folder, "iris.svm.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
saveXML(pmml(fit, dataset=iris_train),
file=paste(dat_folder, "iris.svm.xml", sep="/"))
Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A

ACM, 2013-10-12
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…

ETL

data
sources

data
prep

predictive
model

end
uses
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ANSI SQL for ETL

ETL

data
sources

data
prep

predictive
model

end
uses
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…

ETL

data
sources

data
prep

Java, Pig for business logic

predictive
model

end
uses
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
SAS for predictive models

ETL

data
sources

data
prep

predictive
model

end
uses
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ANSI SQL for ETL

most of the licensing costs…

ETL

data
sources

data
prep

SAS for predictive models

predictive
model

end
uses
Anatomy of an Enterprise app
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
most of the project costs…

ETL

data
sources

data
prep

Java, Pig for business logic

predictive
model

end
uses
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Lingual:
DW → ANSI SQL

cascading.org
ETL

data
sources

business logic in Java,
Clojure, Scala, etc.

data
prep

a compiler sees it all…
one connected DAG:
• optimization

Pattern:
SAS, R, etc. → PMML

predictive
model

end
uses

• troubleshooting

source taps for
Cassandra, JDBC,
Splunk, etc.

• exception handling
• notifications

sink taps for
Memcached, HBase,
MongoDB, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Lingual:
DW → ANSI SQL

cascading.org
ETL

data
sources

source taps for
Cassandra, JDBC,
Splunk, etc.

business logic in Java,
Clojure, Scala, etc.

Pattern:
SAS, R, etc. → PMML

FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
predictive
data
.addSource( "example.employee", emplTap )
model
prep
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
 
SQLPlanner sqlPlanner = new SQLPlanner()
end
.setSql( sqlStatement );
a compiler sees it all…
uses
 
flowDef.addAssemblyPlanner( sqlPlanner );
sink taps for
Memcached, HBase,
MongoDB, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Lingual:
DW → ANSI SQL

business logic in Java,
Clojure, Scala, etc.

Pattern:
SAS, R, etc. → PMML

FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
predictive
.addSource( "input", inputTap ) data
ETL
model
.addSink( "classify", classifyTap prep
);
 
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
data
.retainOnlyActiveIncomingFields(); sees it all…
a compiler
sources
 
flowDef.addAssemblyPlanner( pmmlPlanner );
source taps for
Cassandra, JDBC,
Splunk, etc.

end
uses

sink taps for
Memcached, HBase,
MongoDB, etc.
Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
to ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:

•
•

leverages JVM and Java-based tools without any
need to create new languages
allows programmers who have Java expertise
to leverage the economics of Hadoop clusters

Edgar Codd alluded to this (DSLs for structuring data)
in his original paper about relational model
Cascading – functional programming

•
•

Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading –
used for their large-scale production deployments
new case studies for Cascading apps are mostly based on
domain-specific languages (DSLs) in JVM languages which
emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming Practices Will Improve Your Return from Technology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programmingpractices-will-improve-your-return-from-technology/
Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
Cascading – deployments

•

case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.

•

use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Document
Collection

Tokenize

Scrub
token

M

HashJoin
Left
Stop Word
List

Regex
token

GroupBy
token

R

RHS

Count

Word
Count

data is represented as flows of tuples
operations in the flows bring functional
programming aspects into Java

A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
Document
Collection

Tokenize

Scrub
token

M

HashJoin
Left
Stop Word
List

Regex
token

GroupBy
token

R

RHS

Count

Word
Count

in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration

Literate Programming
Don Knuth
literateprogramming.com
Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
The Ubiquitous Word Count
Definition:
count how often each word appears
in a collection of text documents
this simple program provides an excellent test case
for parallel processing:
• requires a minimal amount of code
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction
• is not many steps away from useful search indexing
• serves as a “Hello World” for Hadoop apps

void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");

void reduce (String word, Iterator group):
int count = 0;
for each pc in group:
count += Int(pc);
emit(word, String(count));

a distributed computing framework that runs Word Count
efficiently in parallel at scale can handle much larger
and more interesting compute problems
WordCount – conceptual flow diagram
Document
Collection

Tokenize

M

GroupBy
token

R

1 map
1 reduce
18 lines code

Count

Word
Count

cascading.org/category/impatient
gist.github.com/3900702
WordCount – Cascading app in Java
Document
Collection

String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
 .addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();

Tokenize

M

GroupBy
token

R

Count

Word
Count
WordCount – generated flow diagram
Document
Collection

Tokenize

[head]

M

GroupBy
token

R

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

map

[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
[{1}:'token']
[{1}:'token']

GroupBy('wc')[by:['token']]

Every('wc')[Count[decl:'count']]
[{2}:'token', 'count']
[{1}:'token']

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
[{2}:'token', 'count']
[{2}:'token', 'count']

[tail]

reduce

wc[{1}:'token']
[{1}:'token']

Count

Word
Count
WordCount – Cascalog / Clojure
Document
Collection

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))
(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient

Tokenize

M

GroupBy
token

R

Count

Word
Count
WordCount – Cascalog / Clojure
Document
Collection

github.com/nathanmarz/cascalog/wiki

Tokenize

M

GroupBy
token

R

• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn

Count

Word
Count
WordCount – Scalding / Scala
Document
Collection

import com.twitter.scalding._
 
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}

Tokenize

M

GroupBy
token

R

Count

Word
Count
WordCount – Scalding / Scala
Document
Collection

github.com/twitter/scalding/wiki

Tokenize

M

GroupBy
token

R

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog

Count

Word
Count
WordCount – Apache Hive
Document
Collection

CREATE TABLE text_docs (line STRING);
 
LOAD DATA LOCAL INPATH 'data/rain.txt'
OVERWRITE INTO TABLE text_docs
;
 
SELECT
word, COUNT(*)
FROM
(SELECT
split(line, 't')[1] AS text
FROM text_docs
) t
LATERAL VIEW explode(split(text, '[ ,.()]')) lTable AS
word
GROUP BY word
;

Tokenize

M

GroupBy
token

R

Count

Word
Count
WordCount – Apache Hive
Document
Collection

hive.apache.org
pro:
‣ most popular abstraction atop Apache Hadoop
‣ SQL-like language is syntactically familiar to most analysts
‣ simple to load large-scale unstructured data and run ad-hoc queries

con:
‣ not a relational engine, many surprises at scale
‣ difficult to represent complex workflows, ML algorithms, etc.
‣ one poorly-trained analyst can bottleneck an entire cluster
‣ app-level integration requires other coding, outside of script language
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may change unexpectedly
‣ business logic must cross multiple language boundaries: difficult to

troubleshoot, optimize, audit, handle exceptions, set notifications, etc.

Tokenize

M

GroupBy
token

R

Count

Word
Count
WordCount – Apache Pig
Document
Collection

docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';
-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';
-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
GENERATE group AS token, COUNT(tokenPipe) AS count;
-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;

Tokenize

M

GroupBy
token

R

Count

Word
Count
WordCount – Apache Pig
Document
Collection

pig.apache.org
pro:
‣ easy to learn data manipulation language (DML)
‣ interactive prompt (Grunt) makes it simple to prototype apps
‣ extensibility through UDFs

con:
‣ not a full programming language; must extend via UDFs outside of language
‣ app-level integration requires other coding, outside of script language
‣ simple problems are simple to do; hard problems become quite complex
‣ difficult to parameterize scripts externally; must rewrite to change taps!
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of maps+reduces may changes unexpectedly
‣ business logic must cross multiple language boundaries: difficult to

troubleshoot, optimize, audit, handle exceptions, set notifications, etc.

Tokenize

M

GroupBy
token

R

Count

Word
Count
Two Avenues to the App Layer…

incumbents extend current practices and
infrastructure investments – using JVM,
ANSI SQL, SAS, etc. – to migrate
workflows onto Apache Hadoop while
leveraging existing staff
Start-ups: crave complexity and
scale to become viable…
new ventures move into Enterprise space
to compete using relatively lean staff,
while leveraging sophisticated engineering
practices, e.g., Cascalog and Scalding

complexity ➞

Enterprise: must contend with
complexity at scale everyday…

scale ➞
Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A

ACM, 2013-10-12
Pattern – model scoring

•

migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML

•

great open source tools – R, Weka,
KNIME, Matlab, RapidMiner, etc.

•

integrate with other libraries –
Matrix API, etc.

Customers

•

Web
App

logs
logs
Logs
Support

leverage PMML as another kind
of DSL

trap
tap

Modeling

PMML

source
tap

source
tap

Analytics
Cubes

Reporting

sink
tap

Data
Workflow

sink
tap

cascading.org/pattern

Cache

customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Pattern – score a model, using pre-defined Cascading app

Customer
Orders

Classify

Scored
Orders

Assert

GroupBy
token

M

R

PMML
Model

Count

Failure
Traps

cascading.org/pattern

Confusion
Matrix
Pattern – score a model, within an app
public static void main( String[] args ) throws RuntimeException {
String inputPath = args[ 0 ];
String classifyPath = args[ 1 ];
// set up the config properties
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
 
// create source and sink taps
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
 
// handle command line options
OptionParser optParser = new OptionParser();
optParser.accepts( "pmml" ).withRequiredArg();
 
OptionSet options = optParser.parse( args );
 
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
 
if( options.hasArgument( "pmml" ) ) {
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlPath ) )
.retainOnlyActiveIncomingFields()
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model
flowDef.addAssemblyPlanner( pmmlPlanner );
}
 
// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
Approach 1: Vagrant Cluster for Cascading and Hadoop
set up Vagrant (use v1.3.3 only!) and VirtualBox to run Cascading…
PS: we can share USB thumb drives to speed up box downloads!
github.com/Cascading/vagrant-cascading-hadoop-cluster
NB: when running Gradle builds, you must run as “root”…
then when running Hadoop, you must run as “mapred”
and use HDFS commands.
Approach 2: Laptop Setup for Java, Hadoop, Gradle, Cascading
set up a build environment locally and run Apache Hadoop
in “standalone” mode… works fine for Linux or MacOSX;
however, please no “cdh”, “hdp”, “homebrew”, or “cygwin”
liber118.com/pxn/course/itds/install.html
download as a ZIP file, or use Git to clone the repo…
github.com/Cascading/Impatient
NB: when running Hadoop, you will run in local mode –
no HDFS
Approach 3: Login to a pre-configured EC2 Node
assuming you are familiar with using SSH on Linux or MacOSX,
or using Putty on Windows…
we will give instructions during the workshop
NB: when running Hadoop, you will run in local mode –
no HDFS
Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A

ACM, 2013-10-12
Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A

ACM, 2013-10-12
Experiments – comparing models

•

much customer interest in leveraging Cascading and
Apache Hadoop to run customer experiments at scale

•
•

run multiple variants, then measure relative “lift”
Concurrent runtime – tag and track models

the following example compares two models trained
with different machine learning algorithms
this is exaggerated, one has an important variable
intentionally omitted to help illustrate the experiment
Experiments – Random Forest model
## train a Random Forest model
## example: http://mkseo.pe.kr/stats/?p=220
 
f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
print(fit)
saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))

OOB estimate of
Confusion matrix:
0
1 class.error
0 69 16
0.1882353
1 12 103
0.1043478

error rate: 14%
Experiments – Logistic Regression model
## train a Logistic Regression model (special case of GLM)
## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
 
f <- as.formula("as.factor(label) ~ var0 + var2")
fit <- glm(f, family=binomial, data=data)
print(summary(fit))
saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
1.8524
0.3803
4.871 1.11e-06 ***
var0
-1.3755
0.4355 -3.159 0.00159 **
var2
-3.7742
0.5794 -6.514 7.30e-11 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

NB: this model has “var1” intentionally omitted
Experiments – comparing results

•
•

	

 a confusion matrix to compare results for the classifiers
use

•

assign a cost model to select a winner –
for example, in an ecommerce anti-fraud classifier:

Logistic Regression has a lower “false negative” rate (5% vs. 11%)
however it has a much higher “false positive” rate (52% vs. 14%)

FN ∼ chargeback risk
FP ∼ customer support costs
Why Do Ensembles Matter?

The World…

The World…
per Data Modeling
Two Cultures
“A new research community using these tools sprang up.Their goal
was predictive accuracy.The community consisted of young computer
scientists, physicists and engineers plus a few aging statisticians.
They began using the new tools in working on complex prediction
problems where it was obvious that data models were not applicable:
speech recognition, image recognition, nonlinear time series prediction,
handwriting recognition, prediction in financial markets.”

Statistical Modeling: The Two Cultures
Leo Breiman, 2001
bit.ly/eUTh9L
chronicled a sea change from data modeling (silos, manual
process) to the rising use of algorithmic modeling (machine
data for automation/optimization) which led in turn to the
practice of leveraging inter-disciplinary teams
Ensemble Models
Breiman: “a multiplicity of data models”
BellKor team: 100+ individual models in 2007 Progress Prize
while the process of combining models adds complexity
(making it more difficult to anticipate or explain predictions)
accuracy may increase substantially
Ensemble Learning: Better Predictions Through Diversity
Todd Holloway
ETech (2008)
abeautifulwww.com/EnsembleLearningETech.pdf
The Story of the Netflix Prize: An Ensemblers Tale
Lester Mackey
National Academies Seminar, Washington, DC (2011)
stanford.edu/~lmackey/papers/
KDD 2013 PMML Workshop
Pattern: PMML for Cascading and Hadoop
Paco Nathan, Girish Kathalagiri
Chicago (2013-08-11)
19th ACM SIGKDD
Conference on Knowledge Discovery
and Data Mining
kdd13pmml.wordpress.com
Pattern: Example App

•

example integration of PMML and Cascading, using a sample app
based on the crime dataset from the City of Chicago Open Data

•

sample app implements a predictive model for expected crime
rates based on location, hour of day, and month

•

modeling performed in R, using the pmml package

•

multiple models are captured as PMML, then integrated via
Pattern to implement the entire workflow as a single app

•

PMML provides a vector for migrating workloads off of SAS,
SPSS, etc., onto Hadoop clusters for more cost-effective scaling
Pattern: Example App
City of Chicago Open Data portal
cityofchicago.org/city/en/narr/foia/CityData.html
Pattern open source project
github.com/Cascading/pattern
Observed benefits include greatly reduced development costs
and less licensing issues at scale, while leveraging the scalability
of Apache Hadoop clusters, existing intellectual property in
predictive models, and the core competencies of analytics staff.
Analysts can train predictive models in popular analytics
frameworks, such as SAS, Microstrategy, R, Weka, SQL Server,
etc., then run those models at scale on Apache Hadoop with
little or no coding required.
Pattern API:
Support for Model Chaining, Transforms, etc.
workflow used for data preparation:
Pattern API:
Support for Model Chaining, Transforms, etc.
workflow used for model scoring:
Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A

ACM, 2013-10-12
Statistical Thinking

Process

Variation

Data

Tools

employing a mode of thought which includes both logical and analytical reasoning:
evaluating the whole of a problem, as well as its component parts; attempting
to assess the effects of changing one or more variables
this approach attempts to understand not just problems and solutions,
but also the processes involved and their variances
particularly valuable in Big Data work when combined with hands-on experience in
physics – roughly 50% of my peers come from physics or physical engineering…
programmers typically don’t think this way…
however, both systems engineers and data scientists must
What is needed most?
approximately 80% of the costs for data-related projects
gets spent on data preparation – mostly on cleaning up
data quality issues: ETL, log files, etc., generally by socializing
the problem
unfortunately, data-related budgets tend to go into
frameworks which can only be used after clean up
most valuable skills:
‣ learn to use programmable tools that prepare data
‣ learn to understand the audience and their priorities
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
d3js.org
Team Process = Needs

discovery

help people ask the
right questions

modeling

analysts

allow automation to
place informed bets

integration
apps
systems

deliver data products
at scale to LOB end uses

inter-disciplinary
leadership

build smarts into
product features
keep infrastructure
running, cost-effective

engineers
Team Composition = Roles

business process,
stakeholder

Domain
Expert
data
science

Data
Scientist

App Dev

Ops

introduced
capability

data prep, discovery,
modeling, etc.
software engineering,
automation

systems engineering,
availability

leverage non-traditional
pairing among roles, to
complement skills and
tear down silos
Team Composition = Needs × Roles
very
very
sco
iisco
d
d

ng
lliing
ode
ode
m
m

n
n
atiio
at o
tegr
tegr
n
iin

pps
pps
a
a

s
s
tem
tem
sys
sys
business process,
stakeholder

Domain
Expert
data
science

Data
Scientist

App Dev

Ops

data prep, discovery,
modeling, etc.
software engineering,
automation

systems engineering,
availability
Alternatively, Data Roles × Skill Sets
Analyzing the Analyzers
Harlan Harris, Sean Murphy,
Marck Vaisman
O’Reilly, 2013
amazon.com/dp/B00DBHTE56

Harlan Harris, et al.
datacommunitydc.org/blog/wp-content/uploads/
2012/08/SkillsSelfIDMosaic-edit-500px.png
Cluster Computing’s Dirty Little Secret
many of us make a good living by leveraging high ROI
apps based on clusters, and so execs agree to build
out more data centers…
clusters for Hadoop/HBase, for Storm, for MySQL,
for Memcached, for Cassandra, for Nginx, etc.
this becomes expensive!
a single class of workloads on a given cluster is simpler
to manage, but terrible for utilization… various notions
of “cloud” help…
Cloudera, Hortonworks, probably EMC soon: sell a notion
of “Hadoop as OS”
All your workloads are belong to us

Google Data Center, Fox News

~2002
Beyond Hadoop
Hadoop – an open source solution for fault-tolerant parallel
processing of batch jobs at scale, based on commodity
hardware… however, other priorities have emerged for the
analytics lifecycle:

•
•
•
•
•
•

apps require integration beyond Hadoop
multiple topologies, mixed workloads, multi-tenancy
higher utilization
lower latency
highly-available, long running services
more than “Just JVM” – e.g., Python growth

keep in mind the priority for multi-disciplinary efforts,
to break down even more silos – well beyond the
de facto “priesthood” of data engineering
Beyond Hadoop
Google has been doing data center computing for years,
to address the complexities of large-scale data workflows:

•
•

leveraging the modern kernel: isolation in lieu of VMs

•
•
•
•

mixed workloads, multi-tenancy

“most (>80%) jobs are batch jobs, but the majority
of resources (55–80%) are allocated to service jobs”
relatively high utilization rates
JVM? not so much…
reality: scheduling batch is simple;
scheduling services is hard/expensive
“Return of the Borg”
Return of the Borg: How Twitter Rebuilt Google’s
Secret Weapon
Cade Metz
wired.com/wiredenterprise/
2013/03/google-borg-twitter-mesos
The Datacenter as a Computer: An Introduction
to the Design of Warehouse-Scale Machines
Luiz André Barroso, Urs Hölzle
research.google.com/pubs/
pub35290.html
2011 GAFS Omega
John Wilkes, et al.
youtu.be/0ZFMlO98Jkc
“Return of the Borg”
Omega: flexible, scalable schedulers for large compute clusters
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkes
eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
Mesos – definitions
a common substrate for cluster computing
http://mesos.apache.org/
heterogenous assets in your data center or cloud
made available as a homogenous set of resources

•
•
•
•
•
•
•
•
•

top-level Apache project
scalability to 10,000s of nodes
obviates the need for virtual machines
isolation (pluggable) for CPU, RAM, I/O, FS, etc.
fault-tolerant replicated master using ZooKeeper
multi-resource scheduling (memory and CPU aware)
APIs in C++, Java, Python
web UI for inspecting cluster state
available for Linux, OpenSolaris, Mac OSX
Mesos – architecture
given the use of Mesos as a Data Center OS kernel…

•

Chronos provides complex scheduling capabilities,
much like a distributed Unix “cron”

•

Marathon provides highly-available long-running
services, much like a distributed Unix “init.d”

•

next time you need to build a distributed app,
consider using these as building blocks

a major lesson learned from Spark:

•

leveraging these kinds of building blocks,
one can rebuild Hadoop 100x faster,
in much less code
Mesos – architecture
services

batch

Workloads

Apps
Scalding

MPI

Impala

Hadoop

Shark

Spark

MySQL

Kafka

JBoss

Django

Chronos

Storm

Rails

Frameworks

Py

th
on

R

ub

y

Marathon

C

++

JV

M

Kernel

distributed file system

distributed resources: CPU, RAM, I/O, FS, rack locality, etc.

DFS

Cluster
Deployments
Case Study: Twitter (bare metal / on premise)
“Mesos is the cornerstone of our elastic compute infrastructure –
it’s how we build all our new services and is critical for Twitter’s
continued success at scale. It's one of the primary keys to our
data center efficiency."
Chris Fry, SVP Engineering

blog.twitter.com/2013/mesos-graduates-from-apache-incubation

•

key services run in production: analytics, typeahead, ads

•

Twitter engineers rely on Mesos to build all new services

•

instead of thinking about static machines, engineers think
about resources like CPU, memory and disk

•

allows services to scale and leverage a shared pool of
servers across data centers efficiently

•

reduces the time between prototyping and launching
Case Study: Airbnb (fungible cloud infrastructure)
“We think we might be pushing data science in the field of travel
more so than anyone has ever done before… a smaller number
of engineers can have higher impact through automation on
Mesos."
Mike Curtis,VP Engineering

gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven...

•

improves resource management and efficiency

•

helps advance engineering strategy of building small teams
that can move fast

•

key to letting engineers make the most of AWS-based
infrastructure beyond just Hadoop

•

allowed company to migrate off Elastic MapReduce

•

enables use of Hadoop along with Chronos, Spark, Storm, etc.
Arguments for Data Center Computing
rather than running several specialized clusters, each
at relatively low utilization rates, instead run many
mixed workloads
obvious benefits are realized in terms of:
• scalability, elasticity, fault tolerance, performance, utilization

•
•

reduced equipment cap­ex, Ops overhead, etc.
reduced licensing, eliminating need for VMs or
potential vendor lock­in

subtle benefits – arguably, more important for Enterprise IT:
• reduced time for engineers to ramp­up new services at scale

•

reduced latency between batch and services, enabling new
high­ROI use cases

•

enables Dev/Test apps to run safely on a Production cluster
Media Coverage
Mesosphere Adds Docker Support To Its Mesos-Based Operating System For The Data Center
Frederic Lardinois
TechCrunch (2013-09-26)
techcrunch.com/2013/09/26/mesosphere...
Play Framework Grid Deployment with Mesos
James Ward, Flo Leibert, et al.
Typesafe blog (2013-09-19)
typesafe.com/blog/play-framework-grid...
Mesosphere Launches Marathon Framework
Adrian Bridgwater
Dr. Dobbs (2013-09-18)
drdobbs.com/open-source/mesosphere...
New open source tech Marathon wants to make your data center run like Google’s
Derrick Harris
GigaOM (2013-09-04)
gigaom.com/2013/09/04/new-open-source...
Running batch and long-running, highly available service jobs on the same cluster
Ben Lorica
O’Reilly (2013-09-01)
strata.oreilly.com/2013/09/running-batch...
Resources
Apache Mesos Project
mesos.apache.org
Mesosphere
mesosphere.io
Tutorial
mesosphere.io/2013/08/01/...
Documentation
mesos.apache.org/documentation
2011 USENIX Research Paper
usenix.org/legacy/event/nsdi11/tech/full_papers/Hindman_new.pdf
Collected Notes/Archives
goo.gl/jPtTP
Cascading, Pattern, and PMML:
1. PMML and R (30 min lab)
2. Cascading Overview (15 min)
3. Model Scoring (30 min lab)
4. < break/ >
5. Ensembles, Experiments, etc. (15 min)
6. Industry Practices (20 min)
7. Q & A

ACM, 2013-10-12
Enterprise Data Workflows with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
monthly newsletter for updates, events,
conference summaries, etc.:
liber118.com/pxn/

More Related Content

What's hot

An introduction to multi-model databases
An introduction to multi-model databasesAn introduction to multi-model databases
An introduction to multi-model databasesBerta Hermida Plaza
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph AnalyticsLinkurious
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
 
Graph analytics in Linkurious Enterprise
Graph analytics in Linkurious EnterpriseGraph analytics in Linkurious Enterprise
Graph analytics in Linkurious EnterpriseLinkurious
 
Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?ArangoDB Database
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
towards_analytics_query_engine
towards_analytics_query_enginetowards_analytics_query_engine
towards_analytics_query_engineNantia Makrynioti
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?Paco Nathan
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Robert Grossman
 
Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsConnected Data World
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionChetan Khatri
 
From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchCareerBuilder.com
 
ArangoDB 3.7 Roadmap: Performance at Scale
ArangoDB 3.7 Roadmap: Performance at ScaleArangoDB 3.7 Roadmap: Performance at Scale
ArangoDB 3.7 Roadmap: Performance at ScaleArangoDB Database
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Reproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowReproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowDatabricks
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"Databricks
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in productionStepan Pushkarev
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI
 

What's hot (20)

Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
An introduction to multi-model databases
An introduction to multi-model databasesAn introduction to multi-model databases
An introduction to multi-model databases
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph Analytics
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
 
Graph analytics in Linkurious Enterprise
Graph analytics in Linkurious EnterpriseGraph analytics in Linkurious Enterprise
Graph analytics in Linkurious Enterprise
 
Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?Guacamole Fiesta: What do avocados and databases have in common?
Guacamole Fiesta: What do avocados and databases have in common?
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
towards_analytics_query_engine
towards_analytics_query_enginetowards_analytics_query_engine
towards_analytics_query_engine
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
What's new with Apache Spark?
What's new with Apache Spark?What's new with Apache Spark?
What's new with Apache Spark?
 
Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)Big Data - Lab A1 (SC 11 Tutorial)
Big Data - Lab A1 (SC 11 Tutorial)
 
Elegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property GraphsElegant and Scalable Code Querying with Code Property Graphs
Elegant and Scalable Code Querying with Code Property Graphs
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
 
From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic search
 
ArangoDB 3.7 Roadmap: Performance at Scale
ArangoDB 3.7 Roadmap: Performance at ScaleArangoDB 3.7 Roadmap: Performance at Scale
ArangoDB 3.7 Roadmap: Performance at Scale
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Reproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowReproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflow
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
 
Vital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent AppsVital.AI Creating Intelligent Apps
Vital.AI Creating Intelligent Apps
 

Viewers also liked

On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsVillu Ruusmann
 
PMML - Predictive Model Markup Language
PMML - Predictive Model Markup LanguagePMML - Predictive Model Markup Language
PMML - Predictive Model Markup Languageaguazzel
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
Enabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTEnabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTSingleStore
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 

Viewers also liked (8)

On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
PMML - Predictive Model Markup Language
PMML - Predictive Model Markup LanguagePMML - Predictive Model Markup Language
PMML - Predictive Model Markup Language
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Enabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTEnabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoT
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 

Similar to ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
More Stored Procedures and MUMPS for DivConq
More Stored Procedures and  MUMPS for DivConqMore Stored Procedures and  MUMPS for DivConq
More Stored Procedures and MUMPS for DivConqeTimeline, LLC
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraRustam Aliyev
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Datasamuel shamiri
 
Java Web Programming [5/9] : EL, JSTL and Custom Tags
Java Web Programming [5/9] : EL, JSTL and Custom TagsJava Web Programming [5/9] : EL, JSTL and Custom Tags
Java Web Programming [5/9] : EL, JSTL and Custom TagsIMC Institute
 
JavaScript Functions
JavaScript Functions JavaScript Functions
JavaScript Functions Reem Alattas
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...DataStax
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...FarhanAhmade
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Erik-Berndt Scheper
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?Martin Loetzsch
 
Intermediate WhizzML Workflows
Intermediate WhizzML WorkflowsIntermediate WhizzML Workflows
Intermediate WhizzML WorkflowsBigML, Inc
 

Similar to ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop (20)

Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
More Stored Procedures and MUMPS for DivConq
More Stored Procedures and  MUMPS for DivConqMore Stored Procedures and  MUMPS for DivConq
More Stored Procedures and MUMPS for DivConq
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Lab manual asp.net
Lab manual asp.netLab manual asp.net
Lab manual asp.net
 
Converting R to PMML
Converting R to PMMLConverting R to PMML
Converting R to PMML
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
 
Java Web Programming [5/9] : EL, JSTL and Custom Tags
Java Web Programming [5/9] : EL, JSTL and Custom TagsJava Web Programming [5/9] : EL, JSTL and Custom Tags
Java Web Programming [5/9] : EL, JSTL and Custom Tags
 
JavaScript Functions
JavaScript Functions JavaScript Functions
JavaScript Functions
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...Informatics Practices (new) solution CBSE  2021, Compartment,  improvement ex...
Informatics Practices (new) solution CBSE 2021, Compartment, improvement ex...
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
Project A Data Modelling Best Practices Part II: How to Build a Data Warehouse?
 
Intermediate WhizzML Workflows
Intermediate WhizzML WorkflowsIntermediate WhizzML Workflows
Intermediate WhizzML Workflows
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 
Microservices, Containers, and Machine Learning
Microservices, Containers, and Machine LearningMicroservices, Containers, and Machine Learning
Microservices, Containers, and Machine Learning
 

Recently uploaded

Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 

Recently uploaded (20)

Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 

ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop

  • 1. ACM Big Data Mining Camp, 2013-10-12: Cascading, Pattern, and PMML Paco Nathan @pacoid Chief Scientist, Mesosphere
  • 2. Cascading, Pattern, and PMML: 1. PMML and R (30 min lab) 2. Cascading Overview (15 min) 3. Model Scoring (30 min lab) 4. < break/ > 5. Ensembles, Experiments, etc. (15 min) 6. Industry Practices (20 min) 7. Q & A ACM, 2013-10-12
  • 3. PMML – an industry standard • • established XML standard for predictive model markup • members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc. • PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows organized by Data Mining Group (DMG), since 1997 http://dmg.org/ “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations.With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.” wikipedia.org/wiki/Predictive_Model_Markup_Language
  • 4. PMML – vendor coverage
  • 5. PMML – model coverage • • • • • • • • • • • Association Rules: AssociationModel element Cluster Models: ClusteringModel element Decision Trees: TreeModel element Naïve Bayes Classifiers: NaiveBayesModel element Neural Networks: NeuralNetwork element Regression: RegressionModel and GeneralRegressionModel elements Rulesets: RuleSetModel element Sequences: SequenceModel element Support Vector Machines: SupportVectorMachineModel element Text Models: TextModel element Time Series: TimeSeriesModel element ibm.com/developerworks/industry/library/ind-PMML2/
  • 6. PMML – create a model in R ## train a RandomForest model   f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50)   ## test the model on the holdout test set   print(fit$importance) print(fit)   predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse)   ## export predicted labels to TSV   write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)   ## export RF model to PMML   saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
  • 7. PMML – capture business logic of analytics workflows <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ...
  • 8. PMML – further study PMML in Action Alex Guazzelli, Wen-Ching Lin, Tridivesh Jena amazon.com/dp/1470003244 See also excellent resources at: zementis.com/pmml.htm
  • 9. Lab: RStudio and PMML in R set up RStudio… rstudio.com/ide/ use the Iris data to build predictive models… • github.com/Cascading/pattern pattern-examples/examples/r/rattle_pmml.R • • • test/train hold-outs evaluating predictive power export as PMML
  • 10. Model: data prep based on “Iris” library(pmml) library(randomForest) library(nnet) library(XML) library(kernlab)   ## split data into test and train sets   data(iris) iris_full <- iris colnames(iris_full) <c("sepal_length", "sepal_width", "petal_length", "petal_width", "species")   idx <- sample(150, 100) iris_train <- iris_full[idx,] iris_test <- iris_full[-idx,]
  • 11. Model: Random Forest ## http://mkseo.pe.kr/stats/?p=220   f <- as.formula("as.factor(species) ~ .") fit <- randomForest(f, data=iris_train, proximity=TRUE, ntree=50)   print(fit$importance) print(fit) print(table(iris_test$species, predict(fit, iris_test, type="class")))   plot(fit, log="y", main="Random Forest") varImpPlot(fit) MDSplot(fit, iris_full$species)   out <- iris_full out$predict <- predict(fit, out, type="class")   write.table(out, file=paste(dat_folder, "iris.rf.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) saveXML(pmml(fit), file=paste(dat_folder, "iris.rf.xml", sep="/"))
  • 12. Model: Linear Regression ## http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/iris_lm/   f <- as.formula("sepal_length ~ .") fit <- lm(f, data=iris_train)   print(summary(fit)) print(table(round(iris_test$sepal_length), round(predict(fit, iris_test))))   op <- par(mfrow = c(3, 2)) plot(predict(fit), main="Linear Regression") plot(iris_full$petal_length, iris_full$petal_width, pch=21, bg=c("red", "green3", "blue")[unclass(iris_full$species)], main="Edgar Anderson's Iris Data", xlab="petal length", ylab="petal width") plot(fit) par(op)   out <- iris_full out$predict <- predict(fit, out)   write.table(out, file=paste(dat_folder, "iris.lm_p.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) saveXML(pmml(fit), file=paste(dat_folder, "iris.lm_p.xml", sep="/"))
  • 13. Model: Neural Network ## http://statisticsr.blogspot.com/2008/10/notes-for-nnet.html   samp <- c(sample(1:50,25), sample(51:100,25), sample(101:150,25))   ird <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), species=factor(c(rep("setosa",50), rep("versicolor", 50), rep("virginica", 50))))   f <- as.formula("species ~ .") fit <- nnet(f, data=ird, subset=samp, size=2, rang=0.1, decay=5e-4, maxit=200)   print(fit) print(summary(fit)) print(table(ird$species[-samp], predict(fit, ird[-samp,], type = "class")))   out <- ird out$predict <- predict(fit, ird, type="class")   write.table(out, file=paste(dat_folder, "iris.nn.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) saveXML(pmml(fit), file=paste(dat_folder, "iris.nn.xml", sep="/"))
  • 14. Model: K-Means Clustering ## http://mkseo.pe.kr/stats/?p=15   ds <- iris_full[,-5] fit <- kmeans(ds, 3)   print(fit) print(summary(fit)) print(table(fit$cluster, iris_full$species))   op <- par(mfrow = c(1, 1)) plot(iris_full$sepal_length, iris_full$sepal_width, pch = 23, bg = c("blue", "red", "green")[fit$cluster], main="K-Means Clustering") points(fit$centers[,c(1, 2)], col=1:3, pch=8, cex=2) par(op)   out <- iris_full out$predict <- fit$cluster   write.table(out, file=paste(dat_folder, "iris.kmeans.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) saveXML(pmml(fit), file=paste(dat_folder, "iris.kmeans.xml", sep="/"))
  • 15. Model: Hierarchical Clustering ## http://mkseo.pe.kr/stats/?p=15   i = as.matrix(iris_full[,-5]) fit <- hclust(dist(i), method = "average")   initial <- tapply(i, list(rep(cutree(fit, 3), ncol(i)), col(i)), mean) dimnames(initial) <- list(NULL, dimnames(i)[[2]]) kls = cutree(fit, 3)   print(fit) print(table(iris_full$species, kls))   op <- par(mfrow = c(1, 1)) plclust(fit, main="Hierarchical Clustering") par(op)   out <- iris_full out$predict <- kls   write.table(out, file=paste(dat_folder, "iris.hc.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) saveXML(pmml(fit, data=iris, centers=initial), file=paste(dat_folder, "iris.hc.xml", sep="/"))
  • 16. Model: Support Vector Machine ## https://support.zementis.com/entries/21176632-what-types-of-svm-models-built-inr-can-i-export-to-pmml   f <- as.formula("species ~ .") fit <- ksvm(f, data=iris_train, kernel="rbfdot", prob.model=TRUE)   print(fit) print(table(iris_test$species, predict(fit, iris_test)))   out <- iris_full out$predict <- predict(fit, out)   write.table(out, file=paste(dat_folder, "iris.svm.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) saveXML(pmml(fit, dataset=iris_train), file=paste(dat_folder, "iris.svm.xml", sep="/"))
  • 17. Cascading, Pattern, and PMML: 1. PMML and R (30 min lab) 2. Cascading Overview (15 min) 3. Model Scoring (30 min lab) 4. < break/ > 5. Ensembles, Experiments, etc. (15 min) 6. Industry Practices (20 min) 7. Q & A ACM, 2013-10-12
  • 18. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data sources data prep predictive model end uses
  • 19. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ANSI SQL for ETL ETL data sources data prep predictive model end uses
  • 20. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data sources data prep Java, Pig for business logic predictive model end uses
  • 21. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… SAS for predictive models ETL data sources data prep predictive model end uses
  • 22. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ANSI SQL for ETL most of the licensing costs… ETL data sources data prep SAS for predictive models predictive model end uses
  • 23. Anatomy of an Enterprise app definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… most of the project costs… ETL data sources data prep Java, Pig for business logic predictive model end uses
  • 24. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source Lingual: DW → ANSI SQL cascading.org ETL data sources business logic in Java, Clojure, Scala, etc. data prep a compiler sees it all… one connected DAG: • optimization Pattern: SAS, R, etc. → PMML predictive model end uses • troubleshooting source taps for Cassandra, JDBC, Splunk, etc. • exception handling • notifications sink taps for Memcached, HBase, MongoDB, etc.
  • 25. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source Lingual: DW → ANSI SQL cascading.org ETL data sources source taps for Cassandra, JDBC, Splunk, etc. business logic in Java, Clojure, Scala, etc. Pattern: SAS, R, etc. → PMML FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) predictive data .addSource( "example.employee", emplTap ) model prep .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap );   SQLPlanner sqlPlanner = new SQLPlanner() end .setSql( sqlStatement ); a compiler sees it all… uses   flowDef.addAssemblyPlanner( sqlPlanner ); sink taps for Memcached, HBase, MongoDB, etc.
  • 26. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source Lingual: DW → ANSI SQL business logic in Java, Clojure, Scala, etc. Pattern: SAS, R, etc. → PMML FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) predictive .addSource( "input", inputTap ) data ETL model .addSink( "classify", classifyTap prep );   PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) data .retainOnlyActiveIncomingFields(); sees it all… a compiler sources   flowDef.addAssemblyPlanner( pmmlPlanner ); source taps for Cassandra, JDBC, Splunk, etc. end uses sink taps for Memcached, HBase, MongoDB, etc.
  • 27. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. to ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • • leverages JVM and Java-based tools without any need to create new languages allows programmers who have Java expertise to leverage the economics of Hadoop clusters Edgar Codd alluded to this (DSLs for structuring data) in his original paper about relational model
  • 28. Cascading – functional programming • • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Why Adopting the Declarative Programming Practices Will Improve Your Return from Technology Dan Woods, 2013-04-17 Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programmingpractices-will-improve-your-return-from-technology/
  • 29. Functional Programming for Big Data WordCount with token scrubbing… Apache Hive: 52 lines HQL + 8 lines Python (UDF) compared to Scalding: 18 lines Scala/Cascading functional programming languages help reduce software engineering costs at scale, over time
  • 30. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.
  • 31. Workflow Abstraction – pattern language Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Tokenize Scrub token M HashJoin Left Stop Word List Regex token GroupBy token R RHS Count Word Count data is represented as flows of tuples operations in the flows bring functional programming aspects into Java A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199
  • 32. Workflow Abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Tokenize Scrub token M HashJoin Left Stop Word List Regex token GroupBy token R RHS Count Word Count in formal terms, flow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps – great for cross-team collaboration Literate Programming Don Knuth literateprogramming.com
  • 33. Workflow Abstraction – business process following the essence of literate programming, Cascading workflows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale
  • 34. The Ubiquitous Word Count Definition: count how often each word appears in a collection of text documents this simple program provides an excellent test case for parallel processing: • requires a minimal amount of code • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction • is not many steps away from useful search indexing • serves as a “Hello World” for Hadoop apps void map (String doc_id, String text): for each word w in segment(text): emit(w, "1"); void reduce (String word, Iterator group): int count = 0; for each pc in group: count += Int(pc); emit(word, String(count)); a distributed computing framework that runs Word Count efficiently in parallel at scale can handle much larger and more interesting compute problems
  • 35. WordCount – conceptual flow diagram Document Collection Tokenize M GroupBy token R 1 map 1 reduce 18 lines code Count Word Count cascading.org/category/impatient gist.github.com/3900702
  • 36. WordCount – Cascading app in Java Document Collection String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Tokenize M GroupBy token R Count Word Count
  • 37. WordCount – generated flow diagram Document Collection Tokenize [head] M GroupBy token R Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] map [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] reduce wc[{1}:'token'] [{1}:'token'] Count Word Count
  • 38. WordCount – Cascalog / Clojure Document Collection (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient Tokenize M GroupBy token R Count Word Count
  • 39. WordCount – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize M GroupBy token R • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn Count Word Count
  • 40. WordCount – Scalding / Scala Document Collection import com.twitter.scalding._   class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } Tokenize M GroupBy token R Count Word Count
  • 41. WordCount – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize M GroupBy token R • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog Count Word Count
  • 42. WordCount – Apache Hive Document Collection CREATE TABLE text_docs (line STRING);   LOAD DATA LOCAL INPATH 'data/rain.txt' OVERWRITE INTO TABLE text_docs ;   SELECT word, COUNT(*) FROM (SELECT split(line, 't')[1] AS text FROM text_docs ) t LATERAL VIEW explode(split(text, '[ ,.()]')) lTable AS word GROUP BY word ; Tokenize M GroupBy token R Count Word Count
  • 43. WordCount – Apache Hive Document Collection hive.apache.org pro: ‣ most popular abstraction atop Apache Hadoop ‣ SQL-like language is syntactically familiar to most analysts ‣ simple to load large-scale unstructured data and run ad-hoc queries con: ‣ not a relational engine, many surprises at scale ‣ difficult to represent complex workflows, ML algorithms, etc. ‣ one poorly-trained analyst can bottleneck an entire cluster ‣ app-level integration requires other coding, outside of script language ‣ logical planner mixed with physical planner; cannot collect app stats ‣ non-deterministic exec: number of maps+reduces may change unexpectedly ‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc. Tokenize M GroupBy token R Count Word Count
  • 44. WordCount – Apache Pig Document Collection docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource') AS (doc_id, text); docPipe = FILTER docPipe BY doc_id != 'doc_id'; -- specify regex to split "document" text lines into token stream tokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token; tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*'; -- determine the word counts tokenGroups = GROUP tokenPipe BY token; wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count; -- output STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource'); EXPLAIN -out dot/wc_pig.dot -dot wcPipe; Tokenize M GroupBy token R Count Word Count
  • 45. WordCount – Apache Pig Document Collection pig.apache.org pro: ‣ easy to learn data manipulation language (DML) ‣ interactive prompt (Grunt) makes it simple to prototype apps ‣ extensibility through UDFs con: ‣ not a full programming language; must extend via UDFs outside of language ‣ app-level integration requires other coding, outside of script language ‣ simple problems are simple to do; hard problems become quite complex ‣ difficult to parameterize scripts externally; must rewrite to change taps! ‣ logical planner mixed with physical planner; cannot collect app stats ‣ non-deterministic exec: number of maps+reduces may changes unexpectedly ‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc. Tokenize M GroupBy token R Count Word Count
  • 46. Two Avenues to the App Layer… incumbents extend current practices and infrastructure investments – using JVM, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding complexity ➞ Enterprise: must contend with complexity at scale everyday… scale ➞
  • 47. Cascading, Pattern, and PMML: 1. PMML and R (30 min lab) 2. Cascading Overview (15 min) 3. Model Scoring (30 min lab) 4. < break/ > 5. Ensembles, Experiments, etc. (15 min) 6. Industry Practices (20 min) 7. Q & A ACM, 2013-10-12
  • 48. Pattern – model scoring • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML • great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc. • integrate with other libraries – Matrix API, etc. Customers • Web App logs logs Logs Support leverage PMML as another kind of DSL trap tap Modeling PMML source tap source tap Analytics Cubes Reporting sink tap Data Workflow sink tap cascading.org/pattern Cache customer Customer profile DBs Prefs Hadoop Cluster
  • 49. Pattern – score a model, using pre-defined Cascading app Customer Orders Classify Scored Orders Assert GroupBy token M R PMML Model Count Failure Traps cascading.org/pattern Confusion Matrix
  • 50. Pattern – score a model, within an app public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg();   OptionSet options = optParser.parse( args );   // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );   if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); }   // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); }
  • 51. Approach 1: Vagrant Cluster for Cascading and Hadoop set up Vagrant (use v1.3.3 only!) and VirtualBox to run Cascading… PS: we can share USB thumb drives to speed up box downloads! github.com/Cascading/vagrant-cascading-hadoop-cluster NB: when running Gradle builds, you must run as “root”… then when running Hadoop, you must run as “mapred” and use HDFS commands.
  • 52. Approach 2: Laptop Setup for Java, Hadoop, Gradle, Cascading set up a build environment locally and run Apache Hadoop in “standalone” mode… works fine for Linux or MacOSX; however, please no “cdh”, “hdp”, “homebrew”, or “cygwin” liber118.com/pxn/course/itds/install.html download as a ZIP file, or use Git to clone the repo… github.com/Cascading/Impatient NB: when running Hadoop, you will run in local mode – no HDFS
  • 53. Approach 3: Login to a pre-configured EC2 Node assuming you are familiar with using SSH on Linux or MacOSX, or using Putty on Windows… we will give instructions during the workshop NB: when running Hadoop, you will run in local mode – no HDFS
  • 54. Cascading, Pattern, and PMML: 1. PMML and R (30 min lab) 2. Cascading Overview (15 min) 3. Model Scoring (30 min lab) 4. < break/ > 5. Ensembles, Experiments, etc. (15 min) 6. Industry Practices (20 min) 7. Q & A ACM, 2013-10-12
  • 55. Cascading, Pattern, and PMML: 1. PMML and R (30 min lab) 2. Cascading Overview (15 min) 3. Model Scoring (30 min lab) 4. < break/ > 5. Ensembles, Experiments, etc. (15 min) 6. Industry Practices (20 min) 7. Q & A ACM, 2013-10-12
  • 56. Experiments – comparing models • much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale • • run multiple variants, then measure relative “lift” Concurrent runtime – tag and track models the following example compares two models trained with different machine learning algorithms this is exaggerated, one has an important variable intentionally omitted to help illustrate the experiment
  • 57. Experiments – Random Forest model ## train a Random Forest model ## example: http://mkseo.pe.kr/stats/?p=220   f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomForest(f, data=data, proximity=TRUE, ntree=25) print(fit) saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) OOB estimate of Confusion matrix: 0 1 class.error 0 69 16 0.1882353 1 12 103 0.1043478 error rate: 14%
  • 58. Experiments – Logistic Regression model ## train a Logistic Regression model (special case of GLM) ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r   f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/")) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 *** var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 NB: this model has “var1” intentionally omitted
  • 59. Experiments – comparing results • • a confusion matrix to compare results for the classifiers use • assign a cost model to select a winner – for example, in an ecommerce anti-fraud classifier: Logistic Regression has a lower “false negative” rate (5% vs. 11%) however it has a much higher “false positive” rate (52% vs. 14%) FN ∼ chargeback risk FP ∼ customer support costs
  • 60. Why Do Ensembles Matter? The World… The World… per Data Modeling
  • 61. Two Cultures “A new research community using these tools sprang up.Their goal was predictive accuracy.The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.” Statistical Modeling: The Two Cultures Leo Breiman, 2001 bit.ly/eUTh9L chronicled a sea change from data modeling (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization) which led in turn to the practice of leveraging inter-disciplinary teams
  • 62. Ensemble Models Breiman: “a multiplicity of data models” BellKor team: 100+ individual models in 2007 Progress Prize while the process of combining models adds complexity (making it more difficult to anticipate or explain predictions) accuracy may increase substantially Ensemble Learning: Better Predictions Through Diversity Todd Holloway ETech (2008) abeautifulwww.com/EnsembleLearningETech.pdf The Story of the Netflix Prize: An Ensemblers Tale Lester Mackey National Academies Seminar, Washington, DC (2011) stanford.edu/~lmackey/papers/
  • 63. KDD 2013 PMML Workshop Pattern: PMML for Cascading and Hadoop Paco Nathan, Girish Kathalagiri Chicago (2013-08-11) 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining kdd13pmml.wordpress.com
  • 64. Pattern: Example App • example integration of PMML and Cascading, using a sample app based on the crime dataset from the City of Chicago Open Data • sample app implements a predictive model for expected crime rates based on location, hour of day, and month • modeling performed in R, using the pmml package • multiple models are captured as PMML, then integrated via Pattern to implement the entire workflow as a single app • PMML provides a vector for migrating workloads off of SAS, SPSS, etc., onto Hadoop clusters for more cost-effective scaling
  • 65. Pattern: Example App City of Chicago Open Data portal cityofchicago.org/city/en/narr/foia/CityData.html Pattern open source project github.com/Cascading/pattern Observed benefits include greatly reduced development costs and less licensing issues at scale, while leveraging the scalability of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff. Analysts can train predictive models in popular analytics frameworks, such as SAS, Microstrategy, R, Weka, SQL Server, etc., then run those models at scale on Apache Hadoop with little or no coding required.
  • 66. Pattern API: Support for Model Chaining, Transforms, etc. workflow used for data preparation:
  • 67. Pattern API: Support for Model Chaining, Transforms, etc. workflow used for model scoring:
  • 68. Cascading, Pattern, and PMML: 1. PMML and R (30 min lab) 2. Cascading Overview (15 min) 3. Model Scoring (30 min lab) 4. < break/ > 5. Ensembles, Experiments, etc. (15 min) 6. Industry Practices (20 min) 7. Q & A ACM, 2013-10-12
  • 69. Statistical Thinking Process Variation Data Tools employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must
  • 70. What is needed most? approximately 80% of the costs for data-related projects gets spent on data preparation – mostly on cleaning up data quality issues: ETL, log files, etc., generally by socializing the problem unfortunately, data-related budgets tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to understand the audience and their priorities ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable d3js.org
  • 71. Team Process = Needs discovery help people ask the right questions modeling analysts allow automation to place informed bets integration apps systems deliver data products at scale to LOB end uses inter-disciplinary leadership build smarts into product features keep infrastructure running, cost-effective engineers
  • 72. Team Composition = Roles business process, stakeholder Domain Expert data science Data Scientist App Dev Ops introduced capability data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability leverage non-traditional pairing among roles, to complement skills and tear down silos
  • 73. Team Composition = Needs × Roles very very sco iisco d d ng lliing ode ode m m n n atiio at o tegr tegr n iin pps pps a a s s tem tem sys sys business process, stakeholder Domain Expert data science Data Scientist App Dev Ops data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability
  • 74. Alternatively, Data Roles × Skill Sets Analyzing the Analyzers Harlan Harris, Sean Murphy, Marck Vaisman O’Reilly, 2013 amazon.com/dp/B00DBHTE56 Harlan Harris, et al. datacommunitydc.org/blog/wp-content/uploads/ 2012/08/SkillsSelfIDMosaic-edit-500px.png
  • 75. Cluster Computing’s Dirty Little Secret many of us make a good living by leveraging high ROI apps based on clusters, and so execs agree to build out more data centers… clusters for Hadoop/HBase, for Storm, for MySQL, for Memcached, for Cassandra, for Nginx, etc. this becomes expensive! a single class of workloads on a given cluster is simpler to manage, but terrible for utilization… various notions of “cloud” help… Cloudera, Hortonworks, probably EMC soon: sell a notion of “Hadoop as OS” All your workloads are belong to us Google Data Center, Fox News ~2002
  • 76. Beyond Hadoop Hadoop – an open source solution for fault-tolerant parallel processing of batch jobs at scale, based on commodity hardware… however, other priorities have emerged for the analytics lifecycle: • • • • • • apps require integration beyond Hadoop multiple topologies, mixed workloads, multi-tenancy higher utilization lower latency highly-available, long running services more than “Just JVM” – e.g., Python growth keep in mind the priority for multi-disciplinary efforts, to break down even more silos – well beyond the de facto “priesthood” of data engineering
  • 77. Beyond Hadoop Google has been doing data center computing for years, to address the complexities of large-scale data workflows: • • leveraging the modern kernel: isolation in lieu of VMs • • • • mixed workloads, multi-tenancy “most (>80%) jobs are batch jobs, but the majority of resources (55–80%) are allocated to service jobs” relatively high utilization rates JVM? not so much… reality: scheduling batch is simple; scheduling services is hard/expensive
  • 78. “Return of the Borg” Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon Cade Metz wired.com/wiredenterprise/ 2013/03/google-borg-twitter-mesos The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines Luiz André Barroso, Urs Hölzle research.google.com/pubs/ pub35290.html 2011 GAFS Omega John Wilkes, et al. youtu.be/0ZFMlO98Jkc
  • 79. “Return of the Borg” Omega: flexible, scalable schedulers for large compute clusters Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkes eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
  • 80. Mesos – definitions a common substrate for cluster computing http://mesos.apache.org/ heterogenous assets in your data center or cloud made available as a homogenous set of resources • • • • • • • • • top-level Apache project scalability to 10,000s of nodes obviates the need for virtual machines isolation (pluggable) for CPU, RAM, I/O, FS, etc. fault-tolerant replicated master using ZooKeeper multi-resource scheduling (memory and CPU aware) APIs in C++, Java, Python web UI for inspecting cluster state available for Linux, OpenSolaris, Mac OSX
  • 81. Mesos – architecture given the use of Mesos as a Data Center OS kernel… • Chronos provides complex scheduling capabilities, much like a distributed Unix “cron” • Marathon provides highly-available long-running services, much like a distributed Unix “init.d” • next time you need to build a distributed app, consider using these as building blocks a major lesson learned from Spark: • leveraging these kinds of building blocks, one can rebuild Hadoop 100x faster, in much less code
  • 84. Case Study: Twitter (bare metal / on premise) “Mesos is the cornerstone of our elastic compute infrastructure – it’s how we build all our new services and is critical for Twitter’s continued success at scale. It's one of the primary keys to our data center efficiency." Chris Fry, SVP Engineering blog.twitter.com/2013/mesos-graduates-from-apache-incubation • key services run in production: analytics, typeahead, ads • Twitter engineers rely on Mesos to build all new services • instead of thinking about static machines, engineers think about resources like CPU, memory and disk • allows services to scale and leverage a shared pool of servers across data centers efficiently • reduces the time between prototyping and launching
  • 85. Case Study: Airbnb (fungible cloud infrastructure) “We think we might be pushing data science in the field of travel more so than anyone has ever done before… a smaller number of engineers can have higher impact through automation on Mesos." Mike Curtis,VP Engineering gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven... • improves resource management and efficiency • helps advance engineering strategy of building small teams that can move fast • key to letting engineers make the most of AWS-based infrastructure beyond just Hadoop • allowed company to migrate off Elastic MapReduce • enables use of Hadoop along with Chronos, Spark, Storm, etc.
  • 86. Arguments for Data Center Computing rather than running several specialized clusters, each at relatively low utilization rates, instead run many mixed workloads obvious benefits are realized in terms of: • scalability, elasticity, fault tolerance, performance, utilization • • reduced equipment cap­ex, Ops overhead, etc. reduced licensing, eliminating need for VMs or potential vendor lock­in subtle benefits – arguably, more important for Enterprise IT: • reduced time for engineers to ramp­up new services at scale • reduced latency between batch and services, enabling new high­ROI use cases • enables Dev/Test apps to run safely on a Production cluster
  • 87. Media Coverage Mesosphere Adds Docker Support To Its Mesos-Based Operating System For The Data Center Frederic Lardinois TechCrunch (2013-09-26) techcrunch.com/2013/09/26/mesosphere... Play Framework Grid Deployment with Mesos James Ward, Flo Leibert, et al. Typesafe blog (2013-09-19) typesafe.com/blog/play-framework-grid... Mesosphere Launches Marathon Framework Adrian Bridgwater Dr. Dobbs (2013-09-18) drdobbs.com/open-source/mesosphere... New open source tech Marathon wants to make your data center run like Google’s Derrick Harris GigaOM (2013-09-04) gigaom.com/2013/09/04/new-open-source... Running batch and long-running, highly available service jobs on the same cluster Ben Lorica O’Reilly (2013-09-01) strata.oreilly.com/2013/09/running-batch...
  • 88. Resources Apache Mesos Project mesos.apache.org Mesosphere mesosphere.io Tutorial mesosphere.io/2013/08/01/... Documentation mesos.apache.org/documentation 2011 USENIX Research Paper usenix.org/legacy/event/nsdi11/tech/full_papers/Hindman_new.pdf Collected Notes/Archives goo.gl/jPtTP
  • 89. Cascading, Pattern, and PMML: 1. PMML and R (30 min lab) 2. Cascading Overview (15 min) 3. Model Scoring (30 min lab) 4. < break/ > 5. Ensembles, Experiments, etc. (15 min) 6. Industry Practices (20 min) 7. Q & A ACM, 2013-10-12
  • 90. Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do monthly newsletter for updates, events, conference summaries, etc.: liber118.com/pxn/