SlideShare a Scribd company logo
1 of 74
Download to read offline
Data Workflows for
Machine Learning:	



Seattle,WA	

2014-01-29	

!

Paco Nathan

@pacoid	

http://liber118.com/pxn/
meetup.com/Seattle-DAML/events/159043422/
Why is this talk here?	

Machine Learning in production apps is less and less about
algorithms (even though that work is quite fun and vital)	

!

Performing real work is more about:	

! socializing a problem within an organization	

! feature engineering (“Beyond Product Managers”)	

! tournaments in CI/CD environments	

! operationalizing high-ROI apps at scale	

! etc.	

!

So I’ll just crawl out on a limb and state that leveraging great
frameworks to build data workflows is more important than 

chasing after diminishing returns on highly nuanced algorithms.	

!

Because Interwebs!
Data Workflows for Machine Learning	

Middleware has been evolving for Big Data, and there are some
great examples — we’ll review several. Process has been evolving
too, right along with the use cases.	

!

Popular frameworks typically provide some Machine Learning
capabilities within their core components, or at least among their
major use cases.	

!

Let’s consider features from Enterprise Data Workflows as a basis 

for what’s needed in Data Workflows for Machine Learning.	

!

Their requirements for scale, robustness, cost trade-offs,
interdisciplinary teams, etc., serve as guides in general.
Caveat Auditor	

I won’t claim to be expert with each of the frameworks and
environments described in this talk. Expert with a few of them
perhaps, but more to the point: embroiled in many use cases.	

!

This talk attempts to define a “scorecard” for evaluating
important ML data workflow features: 

what’s needed for use cases, compare and contrast of what’s
available, plus some indication of which frameworks are likely 

to be best for a given scenario.	

!

Seriously, this is a work in progress.
Outline	


•

Definition: Machine Learning	


•

Definition: Data Workflows	


•

A whole bunch o’ examples across several platforms	


•

Nine points to discuss, leading up to a scorecard	


•

A crazy little thing called PMML	


•

Questions, comments, flying tomatoes…
Data Workflows for
Machine Learning:	



Frame the question…	

!




“A Basis for What’s Needed”
Definition: Machine Learning
“Machine learning algorithms can figure out how to perform

important tasks by generalizing from examples.This is often

feasible and cost-effective where manual programming is not. 

As more data becomes available, more ambitious problems 

can be tackled. As a result, machine learning is widely used 

in computer science and other fields.”	

Pedro Domingos, U Washington

A Few Useful Things to Know about Machine Learning

[learning as generalization] =

[representation] + [evaluation] + [optimization]

•
•
•

overfitting (variance) 	

underfitting (bias)	

“perfect classifier” (no free lunch)
Definition: Machine Learning … Conceptually

!
!

1. real-world data

	


2. graph theory for representation

	


3. convert to sparse matrix for production work 

leveraging abstract algebra + func programming 	

4. cost-effective parallel processing for ML apps 

at scale, live use cases
Definition: Machine Learning … Conceptually

!
!
f(x): loss function	

g(z): regularization term

1. real-world data 	

[generalization]
2. graph theory for representation
[representation]

	


3. convert to sparse matrix for production work 

leveraging abstract algebra + func programming 	

[evaluation]
4. cost-effective parallel processing for ML apps 

at scale, live use cases
[optimization]
Definition: Machine Learning … Interdisciplinary Teams, Needs × Roles
very
very
sco
iisco
d
d

ng
lliing
ode
ode
m
m

n
n
atiio
at o
tegr
tegr
n
iin

pps
pps
a
a

s
s
tem
tem
sys
sys
business process,	

stakeholder

Domain
Expert
data
science

Data
Scientist

App Dev

Ops

data prep, discovery,
modeling, etc.
software engineering,
automation

systems engineering,
availability
Definition: Machine Learning … Process, Feature Engineering
feature engineering
data
data
sources
data
sources
sources

data prep
pipeline

hold-outs
train

learners
classifiers
classifiers
classifiers

scoring

test

representation

evaluation

optimization

use cases
Definition: Machine Learning … Process, Tournaments
feature engineering
data
data
sources
data
sources
sources

data prep
pipeline

tournaments

hold-outs
train

learners
classifiers
classifiers
classifiers

scoring

test

representation

• obtain other data?
• improve metadata?
• refine representation?
• improve optimization?

evaluation

• iterate with stakeholders?
• can models (inference) inform
first principles approaches?

optimization

use cases

quantify and measure:
• benefit?
• risk?
• operational costs?
Definition: Machine Learning … Invasion of the Monoids
monoid (an alien life form, courtesy of abstract algebra):	

! binary associative operator	

! closure within a set of objects	

! unique identity element	

what are those good for?	

!
!
!
!
!

composable functions in workflows	

compiler “hints” on steroids, to parallelize	

reassemble results minimizing bottlenecks at scale	

reusable components, like Lego blocks	

think: Docker for business logic	


Monoidify! Monoids as a Design Principle for Efficient
MapReduce Algorithms

Jimmy Lin, U Maryland/Twitter	

kudos to Oscar Boykin, Sam Ritchie, et al.
Definition: Machine Learning
feature engineering
data
data
sources
data
sources
sources

data prep
pipeline

tournaments

hold-outs
train

learners
classifiers
classifiers
classifiers

scoring

test

representation

• obtain other data?
• improve metadata?
• refine representation?
• improve optimization?

evaluation

optimization

• iterate with stakeholders?

quantify and measure:
• benefit?
• risk?
• operational costs?

• can models (inference) inform
first principles approaches?

y
y
over
over
diisc
d sc

elliing
e ng
mod
mod

use cases

n
on
ratiio
grat
nteg
iinte

apps
apps

ems
ems
syst
syst
Domain
Expert
data
science

Data
Scientist

App Dev

Ops

introduced
capability
Definition
feature engineering
data
data
sources
data
sources
sources

data prep
pipeline

tournaments

hold-outs
train

learners
classifiers
classifiers
classifiers

scoring

test

Can we fit the required 

process into generalized 

workflow definitions?
representation

• obtain other data?
• improve metadata?
• refine representation?
• improve optimization?

evaluation

optimization

• iterate with stakeholders?

quantify and measure:
• benefit?
• risk?
• operational costs?

• can models (inference) inform
first principles approaches?

y
y
over
over
diisc
d sc

elliing
e ng
mod
mod

use cases

n
on
ratiio
grat
nteg
iinte

apps
apps

ems
ems
syst
syst
Domain
Expert
data
science

Data
Scientist

App Dev

Ops

introduced
capability
Definition: Data Workflows	

Middleware, effectively, is evolving for Big Data and Machine Learning…

The following design pattern — a DAG — shows up in many places, 

via many frameworks:

ETL

data
sources

data
prep

predictive
model

end
uses
Definition: Data Workflows	

definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ANSI SQL for ETL

ETL

data
sources

data
prep

predictive
model

end
uses
Definition: Data Workflows	

definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…

ETL

data
sources

data
prep

Java, Pig for business logic

predictive
model

end
uses
Definition: Data Workflows	

definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
SAS for predictive models


ETL

data
sources

data
prep

predictive
model

end
uses
Definition: Data Workflows
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ANSI SQL for ETL

most of the licensing costs…

ETL

data
sources

data
prep

SAS for predictive models

predictive
model

end
uses
Definition: Data Workflows
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
most of the project costs…

ETL

data
sources

data
prep

Java, Pig for business logic

predictive
model

end
uses
Definition: Data Workflows
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
most of the project costs…

Something emerges 

to fill these needs, 

for instance…
ETL

data
sources

data
prep

Java, Pig for business logic

predictive
model

end
uses
Definition: Data Workflows	

For example, Cascading and related projects implement the following
components, based on 100% open source:
Lingual:
DW → ANSI SQL

ETL

data
sources

business logic in Java,
Clojure, Scala, etc.

data
prep

a compiler sees it all…

one connected DAG:
• troubleshooting

cascading.org

Pattern:
SAS, R, etc. → PMML

predictive
model

end
uses

• exception handling

source taps for
Cassandra, JDBC,
Splunk, etc.

• notifications
• some optimizations

sink taps for
Memcached, HBase,
MongoDB, etc.
Definition: Data Workflows
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Lingual:
DW → ANSI SQL

cascading.org
ETL

data
sources

source taps for
Cassandra, JDBC,
Splunk, etc.

business logic in Java,
!
Clojure, Scala, etc.

Pattern:
SAS, R, etc. → PMML

!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "etl" )!
predictive
data
.addSource( "example.employee", emplTap )!
model
prep
.addSource( "example.sales", salesTap )!
.addSink( "results", resultsTap );!
 !
SQLPlanner sqlPlanner = new SQLPlanner()!
.setSql( sqlStatement );! end
uses
 !
flowDef.addAssemblyPlanner( sqlPlanner );!
!
sink taps for
!
Memcached, HBase,
MongoDB, etc.
Definition: Data Workflows
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
Lingual:
DW → ANSI SQL

business logic in Java,
Clojure, Scala, etc.

!
!
FlowDef flowDef = FlowDef.flowDef()!
.setName( "classifier" )!
.addSource( "input", inputTap )! data
ETL
.addSink( "classify", classifyTap prep
);!
 !
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput( new File( pmmlModel ) )!
data
.retainOnlyActiveIncomingFields();!
sources
 !
flowDef.addAssemblyPlanner( pmmlPlanner );!
!
source taps for
!
Cassandra, JDBC,
Splunk, etc.

Pattern:
SAS, R, etc. → PMML

predictive
model

end
uses

sink taps for
Memcached, HBase,
MongoDB, etc.
Enterprise Data Workflows with Cascading	

O’Reilly, 2013	

shop.oreilly.com/product/
0636920028536.do

ETL

data
sources

data
prep

predictive
model

end
uses
Enterprise Data Workflows with Cascading	

O’Reilly, 2013	

shop.oreilly.com/product/
0636920028536.do

This begs an update…
ETL

data
sources

data
prep

predictive
model

end
uses
Enterprise Data Workflows with Cascading	

O’Reilly, 2013	

shop.oreilly.com/product/
0636920028536.do

Because…
ETL

data
sources

data
prep

predictive
model

end
uses
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for
Machine Learning:	



Examples…	

!




“Neque per exemplum”
Example: KNIME	

“a user-friendly graphical workbench for the entire analysis

process: data access, data transformation, initial investigation,

powerful predictive analytics, visualisation and reporting.”	


•
•
•
•
•
•

large number of integrations (over 1000 modules)	

ranked #1 in customer satisfaction among 

open source analytics frameworks	

visual editing of reusable modules	

leverage prior work in R, Perl, etc.	

Eclipse integration	

easily extended for new integrations

knime.org
Example: Python stack	

Python has much to offer – ranging across an 

organization, no just for the analytics staff	


•
•
•
•
•
•
•
•
•

ipython.org
pandas.pydata.org
scikit-learn.org
numpy.org
scipy.org
code.google.com/p/augustus
continuum.io
nltk.org
matplotlib.org
Example: Julia	


julialang.org

“a high-level, high-performance dynamic programming language

for technical computing, with syntax that is familiar to users of

other technical computing environments”	


•
•
•

significantly faster than most alternatives	

built to leverage parallelism, cloud computing	

still relatively new — one to watch!
importall Base!

!
type
!

BubbleSort <: Sort.Algorithm end!

function sort!(v::AbstractVector, lo::Int, hi::Int, ::BubbleSort, o::Sort.Ordering)!
    while true!
        clean = true!
        for i = lo:hi-1!
            if Sort.lt(o, v[i+1], v[i])!
                v[i+1], v[i] = v[i], v[i+1]!
                clean = false!
            end!
        end!
        clean && break!
    end!
    return v!
end!
Example: Summingbird	


github.com/twitter/summingbird

“a library that lets you write streaming MapReduce programs

that look like native Scala or Java collection transformations 

and execute them on a number of well-known distributed

MapReduce platforms like Storm and Scalding.”	

• switch between Storm, Scalding (Hadoop)	

• Spark support is in progress	

• leverage Algebird, Storehaus, Matrix API, etc.

def wordCount[P <: Platform[P]]!
(source: Producer[P, String], store: P#Store[String, Long]) =!
source.flatMap { sentence => !
toWords(sentence).map(_ -> 1L)!
}.sumByKey(store)
Example: Scalding	

import com.twitter.scalding._!
 !
class WordCount(args : Args) extends Job(args) {!
Tsv(args("doc"),!
('doc_id, 'text),!
skipHeader = true)!
.read!
.flatMap('text -> 'token) {!
text : String => text.split("[ [](),.]")!
}!
.groupBy('token) { _.size('count) }!
.write(Tsv(args("wc"), writeHeader = true))!
}

github.com/twitter/scalding
Example: Scalding	

• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading	

• code is compact, easy to understand	

• nearly 1:1 between elements of conceptual flow diagram 

and function calls	

• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.	

• significant investments by Twitter, Etsy, eBay, etc.	

• less learning curve than Cascalog	

• build scripts… those take a while to run :

github.com/twitter/scalding
Example: Cascalog	

(ns impatient.core!
  (:use [cascalog.api]!
        [cascalog.more-taps :only (hfs-delimited)])!
  (:require [clojure.string :as s]!
            [cascalog.ops :as c])!
  (:gen-class))!

!

(defmapcatop split [line]!
  "reads in a line of string and splits it by regex"!
  (s/split line #"[[](),.)s]+"))!

!

(defn -main [in out & args]!
  (?<- (hfs-delimited out)!
       [?word ?count]!
       ((hfs-delimited in :skip-header? true) _ ?line)!
       (split ?line :> ?word)!
       (c/count ?count)))!

!

; Paul Lam!
; github.com/Quantisan/Impatient

cascalog.org
Example: Cascalog	

• implements Datalog in Clojure, with predicates backed 

by Cascading – for a highly declarative language	

• run ad-hoc queries from the Clojure REPL –

approx. 10:1 code reduction compared with SQL	

• composable subqueries, used for test-driven
development (TDD) practices at scale	

• Leiningen build: simple, no surprises, in Clojure itself	

• more new deployments than other Cascading DSLs – 

Climate Corp is largest use case: 90% Clojure/Cascalog	

• has a learning curve, limited number of Clojure
developers	

• aggregators are the magic, and those take effort to learn

cascalog.org
Example: Apache Spark	


spark-project.org

in-memory cluster computing, 

by Matei Zaharia, Ram Venkataraman, et al.	


•
•
•
•
•

intended to make data analytics fast to write and to run	

load data into memory and query it repeatedly, much
more quickly than with Hadoop	

APIs in Scala, Java and Python, shells in Scala and Python	

Shark (Hive on Spark), Spark Streaming (like Storm), etc.	

integrations with MLbase, Summingbird (in progress)	

// word count!
val file = spark.textFile(“hdfs://…”)!
val counts = file.flatMap(line => line.split(” “))!
.map(word => (word, 1))!
.reduceByKey(_ + _)!
counts.saveAsTextFile(“hdfs://…”)
Example: MLbase	

“distributed machine learning made easy”	


•
•
•
•
•

sponsored by UC Berkeley EECS AMP Lab	

MLlib – common algorithms, low-level, written atop Spark	

MLI – feature extraction, algorithm dev atop MLlib	

ML Optimizer – automates model selection, 

compiler/optimizer	

see article:

http://strata.oreilly.com/2013/02/mlbase-scalablemachine-learning-made-accessible.html
!
data = load("hdfs://path/to/als_clinical")!
// the features are stored in columns 2-10!
X = data[, 2 to 10]!
y = data[, 1]!
model = do_classify(y, X)

mlbase.org
Example: Titan	


thinkaurelius.github.io/titan

distributed graph database, 

by Matthias Broecheler, Marko Rodriguez, et al.	


•
•
•
•
•
•

scalable graph database optimized for storing and
querying graphs	

supports hundreds of billions of vertices and edges	

transactional database that can support thousands of
concurrent users executing complex graph traversals	

supports search through Lucene, ElasticSearch	

can be backed by HBase, Cassandra, BerkeleyDB	

TinkerPop native graph stack with Gremlin, etc.
// who is hercules' paternal grandfather?!
g.V('name','hercules').out('father').out('father').name
Example: MBrace	

“a .NET based software stack that enables easy large-scale

distributed computation and big data analysis.”	

• declarative and concise distributed algorithms in 

F# asynchronous workflows	

• scalable for apps in private datacenters or public 

clouds, e.g., Windows Azure or Amazon EC2	

• tooling for interactive, REPL-style deployment,
monitoring, management, debugging in Visual Studio	

• leverages monads (similar to Summingbird)	

• MapReduce as a library, but many patterns beyond

m-brace.net

let rec mapReduce (map: 'T -> Cloud<'R>)!
(reduce: 'R -> 'R -> Cloud<'R>)!
(identity: 'R)!
(input: 'T list) =!
cloud {!
match input with!
| [] -> return identity!
| [value] -> return! map value!
| _ ->!
let left, right = List.split input!
let! r1, r2 =!
(mapReduce map reduce identity left)!
<||>!
(mapReduce map reduce identity right)!
return! reduce r1 r2!
}
Data Workflows for
Machine Learning:	



Update the definitions…	

!

“Nine Points to Discuss”

Data Workflows	

Formally speaking, workflows include people (cross-dept) 

and define oversight for exceptional data	


1

…otherwise, data flow or pipeline would be more apt	

…examples include Cascading traps, with exceptional tuples 

routed to a review organization, e.g., Customer Support

Customers

Web
App

logs
logs
Logs

Cache

Support
trap
tap

Modeling

PMML

source
tap

Data
Workflow
source
tap

sink
tap

Analytics
Cubes

Reporting

sink
tap

customer
Customer
profile DBs
Prefs
Hadoop
Cluster

includes people, defines
oversight for exceptional data
Data Workflows	

Workflows impose a separation of concerns, allowing 

for multiple abstraction layers in the tech stack	

…specify what is required, not how to accomplish it	

…articulating the business logic	

… Cascading leverages pattern language	

…related notions from Knuth, literate programming	

…not unlike BPM/BPEL for Big Data	

…examples: IPython, R Markdown, etc.	


2

separation of concerns, allows
for literate programming
3

Data Workflows	

Multiple abstraction layers in the tech stack are 

needed, but still emerging	

…feedback loops based on machine data	

…optimizers feed on abstraction	


Portable
Models

…metadata accounting:	


Reusable
Components

track data lineage	

• propagate schema	

• model / feature usage	

• ensemble performance	

…app history accounting:	

•

•
•
•

util stats, mixing workloads	

heavy-hitters, bottlenecks	

throughput, latency

Business
Process

multiple abstraction layers 

for metadata, feedback, 

and optimization

•
•
•
•

data lineage
schema propagation
feature selection
tournaments

DSLs

Planners/
Optimizers
Mixed
Topologies

Cluster
Scheduler

• app history
• util stats
• bottlenecks

Machine
Data
Clusters
3

Data Workflows
Multiple
needed, but still emerging	

Business
Process

…feedback loops based on
…optimizers feed on abstraction

Portable
Models

Um, will compilers 

track data lineage
propagate
ever look like this??
model / feature usage	


…metadata accounting:	

•
•
•

Reusable
Components

ensemble performance	

…app history accounting:	

•
•

•
•
•
•

data lineage
schema propagation
feature selection
tournaments

DSLs

•

•

multiple abstraction layers 

for metadata, feedback, 

and optimization

util stats, mixing workloads	

heavy-hitters, bottlenecks	

throughput, latency

Planners/
Optimizers
Mixed
Topologies

Cluster
Scheduler

• app history
• util stats
• bottlenecks

Machine
Data
Clusters
4

Data Workflows	

Workflows must provide for test, which is no simple 

matter	


testing: model evaluation,
TDD, app deployment

…testing is required on three abstraction layers:	

model portability/evaluation, ensembles, tournaments	

• TDD in reusable components and business process	

• continuous integration / continuous deployment	

…examples: Cascalog for TDD, PMML for model eval	

•

…still so much more to improve	


Business
Process
Portable
Models
Reusable
Components

•
•
•
•

data lineage
schema propagation
feature selection
tournaments

DSLs

…keep in mind that workflows involve people, too	


Planners/
Optimizers
Mixed
Topologies

Cluster
Scheduler

• app history
• util stats
• bottlenecks

Machine
Data
Clusters
5

Data Workflows	

Workflows enable system integration	

…future-proof integrations and scale-out	


future-proof system
integration, scale-out, ops

…build components and apps, not jobs and command lines	

…allow compiler optimizations across the DAG,

i.e., cross-dept contributions	


Customers

Web
App

…minimize operationalization costs:	

troubleshooting, debugging at scale	

• exception handling	

• notifications, instrumentation	

…examples: KNIME, Cascading, etc.	

•

logs
logs
Logs

Cache

Support
trap
tap

Modeling

!

PMML

source
tap

Data
Workflow
source
tap

sink
tap

Analytics
Cubes

Reporting

sink
tap

customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Data Workflows	

Visualizing workflows, what a great idea.	

…the practical basis for:	

collaboration	

• rapid prototyping	

• component reuse	

…examples: KNIME wins best in category	

•

…Cascading generates flow diagrams, 

which are a nice start	


6

visualizing allows people 

to collaborate through code
Data Workflows	

Abstract algebra, containerizing workflow metadata
…monoids, semigroups, etc., allow for reusable components 

with well-defined properties for running in parallel at scale	

…let business process be agnostic about underlying topologies, 

with analogy to Linux containers (Docker, etc.)	

…compose functions, take advantage of sparsity (monoids)	

…because “data is fully functional” – FP for s/w eng benefits	

…aggregators are almost always “magic”, now we can solve 

for common use cases	

…read Monoidify! Monoids as a Design Principle for Efficient
MapReduce Algorithms by Jimmy Lin	

…Cascading introduced some notions of this circa 2007, 

but didn’t make it explicit	

…examples: Algebird in Summingbird, Simmer, Spark, etc.

7

abstract algebra and
functional programming
containerize business process
Data Workflows	

Workflow needs vary in time, and need to blend time	

…something something batch, something something low-latency	


8

blend results from different
time scales: batch plus lowlatency

…scheduling batch isn’t hard; scheduling low-latency is 

computationally brutal (see the Omega paper)	

…because people like saying “Lambda Architecture”, 

it gives them goosebumps, or something	

…because real-time means so many different things	

…because batch windows are so 1964	

…examples: Summingbird, Oryx

Big Data

Nathan Marz, James Warren

manning.com/marz
Data Workflows	

Workflows may define contexts in which model selection 

possibly becomes a compiler problem	

…for example, see Boyd, Parikh, et al.	

…ultimately optimizing for loss function + regularization term	

…probably not ready for prime-time tomorrow	

…examples: MLbase

f(x): loss function
g(z): regularization term

9

optimize learners in context,
to make model selection
potentially a compiler problem
Data Workflows	

For one of the best papers about what workflows really

truly require, see Out of the Tar Pit by Moseley and Marks

bonus
Nine Points for Data Workflows — a wish list	

1. includes people, defines oversight for exceptional data
2. separation of concerns, allows for literate programming
3. multiple abstraction layers for metadata, feedback, and
optimization

Business
Process
Portable
Models
Reusable
Components

4. testing: model evaluation, TDD, app deployment
6. visualizing workflows allows people to collaborate
through code

Planners/
Optimizers
Mixed
Topologies

7. abstract algebra and functional programming
containerize business process

9. optimize learners in context, to make model 

selection potentially a compiler problem

data lineage
schema propagation
feature selection
tournaments

DSLs

5. future-proof system integration, scale-out, ops

8. blend results from different time scales: 

batch plus low-latency

•
•
•
•

Cluster
Scheduler

• app history
• util stats
• bottlenecks

Machine
Data
Clusters
Data Workflows for Machine Learning - Seattle DAML
Nine Points for Data Workflows — a scorecard	

Spark,
MLbase

Oryx

Summing!
bird

Cascalog

Cascading
!

KNIME

Py Data

R
Markdown

✔

✔

✔

✔

includes people,
exceptional data
separation of
concerns
multiple 

abstraction layers
testing in depth
future-proof 

system
integration
visualize to collab
can haz monoids
blends batch + 

“real-time”
optimize learners 

in context
can haz PMML

✔

MBrace
Data Workflows for
Machine Learning:	



PMML…	

!




“Oh, by the way…”
PMML – an industry standard

•
•

established XML standard for predictive model markup	


•

members: IBM, SAS, Visa, FICO, Equifax, NASA,

Microstrategy, Microsoft, etc.	


•

PMML concepts for metadata, ensembles, etc., translate 

directly into workflow abstractions, e.g., Cascading	


organized by Data Mining Group (DMG), since 1997 

http://dmg.org/	


!

“PMML is the leading standard for statistical and data mining models and

supported by over 20 vendors and organizations.With PMML, it is easy 

to develop a model on one system using one application and deploy the

model on another system using another application.”

wikipedia.org/wiki/Predictive_Model_Markup_Language
PMML – vendor coverage
PMML – model coverage

•
•
•
•
•
•
•
•
•
•
•

Association Rules: AssociationModel element	

Cluster Models: ClusteringModel element	

Decision Trees: TreeModel element	

Naïve Bayes Classifiers: NaiveBayesModel element	

Neural Networks: NeuralNetwork element	

Regression: RegressionModel and GeneralRegressionModel elements	

Rulesets: RuleSetModel element	

Sequences: SequenceModel element	

Support Vector Machines: SupportVectorMachineModel element	

Text Models: TextModel element	

Time Series: TimeSeriesModel element

ibm.com/developerworks/industry/library/ind-PMML2/
PMML – further study
!

PMML in Action

Alex Guazzelli, Wen-Ching Lin, Tridivesh Jena

amazon.com/dp/1470003244	

!

See also excellent resources at:	


zementis.com/pmml.htm
Pattern – model scoring	


•
•
•
•
•

a PMML library for Cascading workflows
Customers

migrate workloads: SAS,Teradata, etc., 

exporting predictive models as PMML	


Web
App

great open source tools – R, Weka, 

KNIME, Matlab, RapidMiner, etc.	

leverage PMML as another kind of DSL	


logs
logs
Logs

Cache

Support

only for scoring models, not training

trap
tap

Modeling

PMML

source
tap

Data
Workflow
source
tap

sink
tap

Analytics
Cubes

Reporting

sink
tap

customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Pattern – model scoring	


•

originally intended as a general purpose, tuple-oriented
model scoring engine for JVM-based frameworks	


•

library plugs into Cascading (Hadoop), and ostensibly
Cascalog, Scalding, Storm, Spark, Summingbird, etc.	


•

dev forum:

https://groups.google.com/forum/#!forum/pattern-user	


•

original GitHub repo:

https://github.com/ceteri/pattern/	


•

somehow, the other fork became deeply intertwined 

with Cascading dependencies…	


•

work in progress to integrate with other frameworks, 

add models, and also train models at scale using Spark
Pattern – create a model in R
## train a RandomForest model!
 !
f <- as.formula("as.factor(label) ~ .")!
fit <- randomForest(f, data_train, ntree=50)!
 !
## test the model on the holdout test set!
 !
print(fit$importance)!
print(fit)!
 !
predicted <- predict(fit, data)!
data$predicted <- predicted!
confuse <- table(pred = predicted, true = data[,1])!
print(confuse)!
 !
## export predicted labels to TSV!
 !
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), 

quote=FALSE, sep="t", row.names=FALSE)!
 !
## export RF model to PMML!
 !
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))!
Pattern – capture model parameters as PMML
<?xml version="1.0"?>!
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"!
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"!
 xsi:schemaLocation="http://www.dmg.org/PMML-4_0

http://www.dmg.org/v4-0/pmml-4-0.xsd">!
 <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">!
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>!
  <Application name="Rattle/PMML" version="1.2.30"/>!
  <Timestamp>2012-10-22 19:39:28</Timestamp>!
 </Header>!
 <DataDictionary numberOfFields="4">!
  <DataField name="label" optype="categorical" dataType="string">!
   <Value value="0"/>!
   <Value value="1"/>!
  </DataField>!
  <DataField name="var0" optype="continuous" dataType="double"/>!
  <DataField name="var1" optype="continuous" dataType="double"/>!
  <DataField name="var2" optype="continuous" dataType="double"/>!
 </DataDictionary>!
 <MiningModel modelName="randomForest_Model" functionName="classification">!
  <MiningSchema>!
   <MiningField name="label" usageType="predicted"/>!
   <MiningField name="var0" usageType="active"/>!
   <MiningField name="var1" usageType="active"/>!
   <MiningField name="var2" usageType="active"/>!
  </MiningSchema>!
  <Segmentation multipleModelMethod="majorityVote">!
   <Segment id="1">!
    <True/>!
    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">!
     <MiningSchema>!
      <MiningField name="label" usageType="predicted"/>!
      <MiningField name="var0" usageType="active"/>!
      <MiningField name="var1" usageType="active"/>!
      <MiningField name="var2" usageType="active"/>!
     </MiningSchema>!
...!
Pattern – score a model, within an app
public static void main( String[] args ) throws RuntimeException {!
String inputPath = args[ 0 ];!
String classifyPath = args[ 1 ];!
// set up the config properties!
Properties properties = new Properties();!
AppProps.setApplicationJarClass( properties, Main.class );!
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!
 
// create source and sink taps!
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );!
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );!
 
// handle command line options!
OptionParser optParser = new OptionParser();!
optParser.accepts( "pmml" ).withRequiredArg();!
 
OptionSet options = optParser.parse( args );!
 !
// connect the taps, pipes, etc., into a flow!
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )!
.addSource( "input", inputTap )!
.addSink( "classify", classifyTap );!
 !
if( options.hasArgument( "pmml" ) ) {!
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );!
PMMLPlanner pmmlPlanner = new PMMLPlanner()!
.setPMMLInput( new File( pmmlPath ) )!
.retainOnlyActiveIncomingFields()!
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model!
flowDef.addAssemblyPlanner( pmmlPlanner );!
}!
 !
// write a DOT file and run the flow!
Flow classifyFlow = flowConnector.connect( flowDef );!
classifyFlow.writeDOT( "dot/classify.dot" );!
classifyFlow.complete();!
}
Pattern – score a model, using pre-defined Cascading app

Customer
Orders

Classify

Scored
Orders

Assert

GroupBy
token

M

R

PMML
Model

Count

Failure
Traps

Confusion
Matrix
Pattern – score a model, using pre-defined Cascading app
## run an RF classifier at scale!
 !
hadoop jar build/libs/pattern.jar !
data/sample.tsv out/classify out/trap !
--pmml data/sample.rf.xml!
 !
!
Data Workflows for
Machine Learning:	



Summary	

!




“feedback and discussion”	


?
Data Workflows for Machine Learning - Seattle DAML
Nine Points for Data Workflows — a scorecard	

Spark,
MLbase

Oryx

Summing!
bird

Cascalog

Cascading
!

KNIME

Py Data

R
Markdown

✔

✔

✔

✔

includes people,
exceptional data
separation of
concerns
multiple 

abstraction layers
testing in depth
future-proof 

system
integration
visualize to collab
can haz monoids
blends batch + 

“real-time”
optimize learners 

in context
can haz PMML

✔

MBrace
PMML – what’s needed?	

in the language standard:	

! data preparation	

! data sources/sinks as parameters	

! track data lineage	

! more monoid, less XML	

! handle data exceptions => TDD	

! updates/active learning?	


feature engineering
data
data
sources
data
sources
sources

data prep
pipeline

tournaments

hold-outs
train

learners
classifiers
classifiers
classifiers

! tournaments?	


scoring

test

representation

evaluation

optimization

use cases

!

in the workflow frameworks:	

! include feature engineering w/ model?	

! support evaluation better	

! build ensembles

• obtain other data?
• improve metadata?
• refine representation?
• improve optimization?

• iterate with stakeholders?
• can models (inference) inform
first principles approaches?

quantify and measure:
• benefit?
• risk?
• operational costs?
Enterprise Data Workflows with Cascading	

O’Reilly, 2013	

shop.oreilly.com/product/
0636920028536.do
!

monthly newsletter for updates, events,
conference summaries, etc.:	

liber118.com/pxn/

More Related Content

What's hot

10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldXavier Amatriain
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Gülden Bilgütay
 
DutchMLSchool. ML Automation
DutchMLSchool. ML AutomationDutchMLSchool. ML Automation
DutchMLSchool. ML AutomationBigML, Inc
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine LearningYuriy Guts
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101QuantUniversity
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentationehtshamelahi
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZCharles Vestur
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Aseda Owusua Addai-Deseh
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSergey Karayev
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 

What's hot (20)

10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning World
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
DutchMLSchool. ML Automation
DutchMLSchool. ML AutomationDutchMLSchool. ML Automation
DutchMLSchool. ML Automation
 
Machine Learning Goes Production
Machine Learning Goes ProductionMachine Learning Goes Production
Machine Learning Goes Production
 
Target Leakage in Machine Learning
Target Leakage in Machine LearningTarget Leakage in Machine Learning
Target Leakage in Machine Learning
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep Learning
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Knowledge Discovery
Knowledge DiscoveryKnowledge Discovery
Knowledge Discovery
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
 

Viewers also liked

Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
 
Telco Big Data Workshop Sample
Telco Big Data Workshop SampleTelco Big Data Workshop Sample
Telco Big Data Workshop SampleAlan Quayle
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationJen Stirrup
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Evolution Of The Computers
Evolution Of The ComputersEvolution Of The Computers
Evolution Of The Computerspanitiaict
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Software Architecture and Design - An Overview
Software Architecture and Design - An OverviewSoftware Architecture and Design - An Overview
Software Architecture and Design - An OverviewOliver Stadie
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Svetlin Nakov
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 

Viewers also liked (20)

Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
Telco Big Data Workshop Sample
Telco Big Data Workshop SampleTelco Big Data Workshop Sample
Telco Big Data Workshop Sample
 
Big data and its impact on indian business
Big data and its impact on indian businessBig data and its impact on indian business
Big data and its impact on indian business
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
 
3.1.2 classification of network
3.1.2 classification of network3.1.2 classification of network
3.1.2 classification of network
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Evolution Of The Computers
Evolution Of The ComputersEvolution Of The Computers
Evolution Of The Computers
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Layered Software Architecture
Layered Software ArchitectureLayered Software Architecture
Layered Software Architecture
 
Software Architecture and Design - An Overview
Software Architecture and Design - An OverviewSoftware Architecture and Design - An Overview
Software Architecture and Design - An Overview
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
Architectural Patterns and Software Architectures: Client-Server, Multi-Tier,...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 

Similar to Data Workflows for Machine Learning - Seattle DAML

Brochure data science learning path board-infinity (1)
Brochure   data science learning path board-infinity (1)Brochure   data science learning path board-infinity (1)
Brochure data science learning path board-infinity (1)NirupamNishant2
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointJoris Poelmans
 
No more Three Tier - A path to a better code for Cloud and Azure
No more Three Tier - A path to a better code for Cloud and AzureNo more Three Tier - A path to a better code for Cloud and Azure
No more Three Tier - A path to a better code for Cloud and AzureMarco Parenzan
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning ProductsAndrew Musselman
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTrivadis
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101Renato Jovic
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Scienceds4good
 
EVAIN Artificial intelligence and semantic annotation: are you serious about it?
EVAIN Artificial intelligence and semantic annotation: are you serious about it?EVAIN Artificial intelligence and semantic annotation: are you serious about it?
EVAIN Artificial intelligence and semantic annotation: are you serious about it?FIAT/IFTA
 
SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?Nicolas Georgeault
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020Ralf Laemmel
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented ApproachRazorleaf Corporation
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineSrivatsan Srinivasan
 
How I became ML Engineer
How I became ML Engineer How I became ML Engineer
How I became ML Engineer Kevin Lee
 

Similar to Data Workflows for Machine Learning - Seattle DAML (20)

Brochure data science learning path board-infinity (1)
Brochure   data science learning path board-infinity (1)Brochure   data science learning path board-infinity (1)
Brochure data science learning path board-infinity (1)
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
No more Three Tier - A path to a better code for Cloud and Azure
No more Three Tier - A path to a better code for Cloud and AzureNo more Three Tier - A path to a better code for Cloud and Azure
No more Three Tier - A path to a better code for Cloud and Azure
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
 
IIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data ScienceIIPGH Webinar 1: Getting Started With Data Science
IIPGH Webinar 1: Getting Started With Data Science
 
EVAIN Artificial intelligence and semantic annotation: are you serious about it?
EVAIN Artificial intelligence and semantic annotation: are you serious about it?EVAIN Artificial intelligence and semantic annotation: are you serious about it?
EVAIN Artificial intelligence and semantic annotation: are you serious about it?
 
SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Keynote at-icpc-2020
Keynote at-icpc-2020Keynote at-icpc-2020
Keynote at-icpc-2020
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented Approach
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
 
How I became ML Engineer
How I became ML Engineer How I became ML Engineer
How I became ML Engineer
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
20200723_insight_release_plan
20200723_insight_release_plan20200723_insight_release_plan
20200723_insight_release_planJamie (Taka) Wang
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 

Recently uploaded (20)

Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
20200723_insight_release_plan
20200723_insight_release_plan20200723_insight_release_plan
20200723_insight_release_plan
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 

Data Workflows for Machine Learning - Seattle DAML

  • 1. Data Workflows for Machine Learning: 
 Seattle,WA 2014-01-29 ! Paco Nathan
 @pacoid http://liber118.com/pxn/ meetup.com/Seattle-DAML/events/159043422/
  • 2. Why is this talk here? Machine Learning in production apps is less and less about algorithms (even though that work is quite fun and vital) ! Performing real work is more about: ! socializing a problem within an organization ! feature engineering (“Beyond Product Managers”) ! tournaments in CI/CD environments ! operationalizing high-ROI apps at scale ! etc. ! So I’ll just crawl out on a limb and state that leveraging great frameworks to build data workflows is more important than 
 chasing after diminishing returns on highly nuanced algorithms. ! Because Interwebs!
  • 3. Data Workflows for Machine Learning Middleware has been evolving for Big Data, and there are some great examples — we’ll review several. Process has been evolving too, right along with the use cases. ! Popular frameworks typically provide some Machine Learning capabilities within their core components, or at least among their major use cases. ! Let’s consider features from Enterprise Data Workflows as a basis 
 for what’s needed in Data Workflows for Machine Learning. ! Their requirements for scale, robustness, cost trade-offs, interdisciplinary teams, etc., serve as guides in general.
  • 4. Caveat Auditor I won’t claim to be expert with each of the frameworks and environments described in this talk. Expert with a few of them perhaps, but more to the point: embroiled in many use cases. ! This talk attempts to define a “scorecard” for evaluating important ML data workflow features: 
 what’s needed for use cases, compare and contrast of what’s available, plus some indication of which frameworks are likely 
 to be best for a given scenario. ! Seriously, this is a work in progress.
  • 5. Outline • Definition: Machine Learning • Definition: Data Workflows • A whole bunch o’ examples across several platforms • Nine points to discuss, leading up to a scorecard • A crazy little thing called PMML • Questions, comments, flying tomatoes…
  • 6. Data Workflows for Machine Learning: 
 Frame the question… ! 
 “A Basis for What’s Needed”
  • 7. Definition: Machine Learning “Machine learning algorithms can figure out how to perform
 important tasks by generalizing from examples.This is often
 feasible and cost-effective where manual programming is not. 
 As more data becomes available, more ambitious problems 
 can be tackled. As a result, machine learning is widely used 
 in computer science and other fields.” Pedro Domingos, U Washington
 A Few Useful Things to Know about Machine Learning [learning as generalization] =
 [representation] + [evaluation] + [optimization] • • • overfitting (variance) underfitting (bias) “perfect classifier” (no free lunch)
  • 8. Definition: Machine Learning … Conceptually
 ! ! 1. real-world data 2. graph theory for representation 3. convert to sparse matrix for production work 
 leveraging abstract algebra + func programming 4. cost-effective parallel processing for ML apps 
 at scale, live use cases
  • 9. Definition: Machine Learning … Conceptually
 ! ! f(x): loss function g(z): regularization term 1. real-world data [generalization] 2. graph theory for representation [representation] 3. convert to sparse matrix for production work 
 leveraging abstract algebra + func programming [evaluation] 4. cost-effective parallel processing for ML apps 
 at scale, live use cases [optimization]
  • 10. Definition: Machine Learning … Interdisciplinary Teams, Needs × Roles very very sco iisco d d ng lliing ode ode m m n n atiio at o tegr tegr n iin pps pps a a s s tem tem sys sys business process, stakeholder Domain Expert data science Data Scientist App Dev Ops data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability
  • 11. Definition: Machine Learning … Process, Feature Engineering feature engineering data data sources data sources sources data prep pipeline hold-outs train learners classifiers classifiers classifiers scoring test representation evaluation optimization use cases
  • 12. Definition: Machine Learning … Process, Tournaments feature engineering data data sources data sources sources data prep pipeline tournaments hold-outs train learners classifiers classifiers classifiers scoring test representation • obtain other data? • improve metadata? • refine representation? • improve optimization? evaluation • iterate with stakeholders? • can models (inference) inform first principles approaches? optimization use cases quantify and measure: • benefit? • risk? • operational costs?
  • 13. Definition: Machine Learning … Invasion of the Monoids monoid (an alien life form, courtesy of abstract algebra): ! binary associative operator ! closure within a set of objects ! unique identity element what are those good for? ! ! ! ! ! composable functions in workflows compiler “hints” on steroids, to parallelize reassemble results minimizing bottlenecks at scale reusable components, like Lego blocks think: Docker for business logic Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms
 Jimmy Lin, U Maryland/Twitter kudos to Oscar Boykin, Sam Ritchie, et al.
  • 14. Definition: Machine Learning feature engineering data data sources data sources sources data prep pipeline tournaments hold-outs train learners classifiers classifiers classifiers scoring test representation • obtain other data? • improve metadata? • refine representation? • improve optimization? evaluation optimization • iterate with stakeholders? quantify and measure: • benefit? • risk? • operational costs? • can models (inference) inform first principles approaches? y y over over diisc d sc elliing e ng mod mod use cases n on ratiio grat nteg iinte apps apps ems ems syst syst Domain Expert data science Data Scientist App Dev Ops introduced capability
  • 15. Definition feature engineering data data sources data sources sources data prep pipeline tournaments hold-outs train learners classifiers classifiers classifiers scoring test Can we fit the required 
 process into generalized 
 workflow definitions? representation • obtain other data? • improve metadata? • refine representation? • improve optimization? evaluation optimization • iterate with stakeholders? quantify and measure: • benefit? • risk? • operational costs? • can models (inference) inform first principles approaches? y y over over diisc d sc elliing e ng mod mod use cases n on ratiio grat nteg iinte apps apps ems ems syst syst Domain Expert data science Data Scientist App Dev Ops introduced capability
  • 16. Definition: Data Workflows Middleware, effectively, is evolving for Big Data and Machine Learning…
 The following design pattern — a DAG — shows up in many places, 
 via many frameworks: ETL data sources data prep predictive model end uses
  • 17. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ANSI SQL for ETL ETL data sources data prep predictive model end uses
  • 18. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data sources data prep Java, Pig for business logic predictive model end uses
  • 19. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… SAS for predictive models
 ETL data sources data prep predictive model end uses
  • 20. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ANSI SQL for ETL most of the licensing costs… ETL data sources data prep SAS for predictive models predictive model end uses
  • 21. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… most of the project costs… ETL data sources data prep Java, Pig for business logic predictive model end uses
  • 22. Definition: Data Workflows definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… most of the project costs… Something emerges 
 to fill these needs, 
 for instance… ETL data sources data prep Java, Pig for business logic predictive model end uses
  • 23. Definition: Data Workflows For example, Cascading and related projects implement the following components, based on 100% open source: Lingual: DW → ANSI SQL ETL data sources business logic in Java, Clojure, Scala, etc. data prep a compiler sees it all…
 one connected DAG: • troubleshooting cascading.org Pattern: SAS, R, etc. → PMML predictive model end uses • exception handling source taps for Cassandra, JDBC, Splunk, etc. • notifications • some optimizations sink taps for Memcached, HBase, MongoDB, etc.
  • 24. Definition: Data Workflows Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source Lingual: DW → ANSI SQL cascading.org ETL data sources source taps for Cassandra, JDBC, Splunk, etc. business logic in Java, ! Clojure, Scala, etc. Pattern: SAS, R, etc. → PMML ! FlowDef flowDef = FlowDef.flowDef()! .setName( "etl" )! predictive data .addSource( "example.employee", emplTap )! model prep .addSource( "example.sales", salesTap )! .addSink( "results", resultsTap );!  ! SQLPlanner sqlPlanner = new SQLPlanner()! .setSql( sqlStatement );! end uses  ! flowDef.addAssemblyPlanner( sqlPlanner );! ! sink taps for ! Memcached, HBase, MongoDB, etc.
  • 25. Definition: Data Workflows Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source Lingual: DW → ANSI SQL business logic in Java, Clojure, Scala, etc. ! ! FlowDef flowDef = FlowDef.flowDef()! .setName( "classifier" )! .addSource( "input", inputTap )! data ETL .addSink( "classify", classifyTap prep );!  ! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlModel ) )! data .retainOnlyActiveIncomingFields();! sources  ! flowDef.addAssemblyPlanner( pmmlPlanner );! ! source taps for ! Cassandra, JDBC, Splunk, etc. Pattern: SAS, R, etc. → PMML predictive model end uses sink taps for Memcached, HBase, MongoDB, etc.
  • 26. Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do ETL data sources data prep predictive model end uses
  • 27. Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do This begs an update… ETL data sources data prep predictive model end uses
  • 28. Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do Because… ETL data sources data prep predictive model end uses
  • 30. Data Workflows for Machine Learning: 
 Examples… ! 
 “Neque per exemplum”
  • 31. Example: KNIME “a user-friendly graphical workbench for the entire analysis
 process: data access, data transformation, initial investigation,
 powerful predictive analytics, visualisation and reporting.” • • • • • • large number of integrations (over 1000 modules) ranked #1 in customer satisfaction among 
 open source analytics frameworks visual editing of reusable modules leverage prior work in R, Perl, etc. Eclipse integration easily extended for new integrations knime.org
  • 32. Example: Python stack Python has much to offer – ranging across an 
 organization, no just for the analytics staff • • • • • • • • • ipython.org pandas.pydata.org scikit-learn.org numpy.org scipy.org code.google.com/p/augustus continuum.io nltk.org matplotlib.org
  • 33. Example: Julia julialang.org “a high-level, high-performance dynamic programming language
 for technical computing, with syntax that is familiar to users of
 other technical computing environments” • • • significantly faster than most alternatives built to leverage parallelism, cloud computing still relatively new — one to watch! importall Base! ! type ! BubbleSort <: Sort.Algorithm end! function sort!(v::AbstractVector, lo::Int, hi::Int, ::BubbleSort, o::Sort.Ordering)!     while true!         clean = true!         for i = lo:hi-1!             if Sort.lt(o, v[i+1], v[i])!                 v[i+1], v[i] = v[i], v[i+1]!                 clean = false!             end!         end!         clean && break!     end!     return v! end!
  • 34. Example: Summingbird github.com/twitter/summingbird “a library that lets you write streaming MapReduce programs
 that look like native Scala or Java collection transformations 
 and execute them on a number of well-known distributed
 MapReduce platforms like Storm and Scalding.” • switch between Storm, Scalding (Hadoop) • Spark support is in progress • leverage Algebird, Storehaus, Matrix API, etc. def wordCount[P <: Platform[P]]! (source: Producer[P, String], store: P#Store[String, Long]) =! source.flatMap { sentence => ! toWords(sentence).map(_ -> 1L)! }.sumByKey(store)
  • 35. Example: Scalding import com.twitter.scalding._!  ! class WordCount(args : Args) extends Job(args) {! Tsv(args("doc"),! ('doc_id, 'text),! skipHeader = true)! .read! .flatMap('text -> 'token) {! text : String => text.split("[ [](),.]")! }! .groupBy('token) { _.size('count) }! .write(Tsv(args("wc"), writeHeader = true))! } github.com/twitter/scalding
  • 36. Example: Scalding • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram 
 and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • less learning curve than Cascalog • build scripts… those take a while to run : github.com/twitter/scalding
  • 37. Example: Cascalog (ns impatient.core!   (:use [cascalog.api]!         [cascalog.more-taps :only (hfs-delimited)])!   (:require [clojure.string :as s]!             [cascalog.ops :as c])!   (:gen-class))! ! (defmapcatop split [line]!   "reads in a line of string and splits it by regex"!   (s/split line #"[[](),.)s]+"))! ! (defn -main [in out & args]!   (?<- (hfs-delimited out)!        [?word ?count]!        ((hfs-delimited in :skip-header? true) _ ?line)!        (split ?line :> ?word)!        (c/count ?count)))! ! ; Paul Lam! ; github.com/Quantisan/Impatient cascalog.org
  • 38. Example: Cascalog • implements Datalog in Clojure, with predicates backed 
 by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL –
 approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – 
 Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn cascalog.org
  • 39. Example: Apache Spark spark-project.org in-memory cluster computing, 
 by Matei Zaharia, Ram Venkataraman, et al. • • • • • intended to make data analytics fast to write and to run load data into memory and query it repeatedly, much more quickly than with Hadoop APIs in Scala, Java and Python, shells in Scala and Python Shark (Hive on Spark), Spark Streaming (like Storm), etc. integrations with MLbase, Summingbird (in progress) // word count! val file = spark.textFile(“hdfs://…”)! val counts = file.flatMap(line => line.split(” “))! .map(word => (word, 1))! .reduceByKey(_ + _)! counts.saveAsTextFile(“hdfs://…”)
  • 40. Example: MLbase “distributed machine learning made easy” • • • • • sponsored by UC Berkeley EECS AMP Lab MLlib – common algorithms, low-level, written atop Spark MLI – feature extraction, algorithm dev atop MLlib ML Optimizer – automates model selection, 
 compiler/optimizer see article:
 http://strata.oreilly.com/2013/02/mlbase-scalablemachine-learning-made-accessible.html ! data = load("hdfs://path/to/als_clinical")! // the features are stored in columns 2-10! X = data[, 2 to 10]! y = data[, 1]! model = do_classify(y, X) mlbase.org
  • 41. Example: Titan thinkaurelius.github.io/titan distributed graph database, 
 by Matthias Broecheler, Marko Rodriguez, et al. • • • • • • scalable graph database optimized for storing and querying graphs supports hundreds of billions of vertices and edges transactional database that can support thousands of concurrent users executing complex graph traversals supports search through Lucene, ElasticSearch can be backed by HBase, Cassandra, BerkeleyDB TinkerPop native graph stack with Gremlin, etc. // who is hercules' paternal grandfather?! g.V('name','hercules').out('father').out('father').name
  • 42. Example: MBrace “a .NET based software stack that enables easy large-scale
 distributed computation and big data analysis.” • declarative and concise distributed algorithms in 
 F# asynchronous workflows • scalable for apps in private datacenters or public 
 clouds, e.g., Windows Azure or Amazon EC2 • tooling for interactive, REPL-style deployment, monitoring, management, debugging in Visual Studio • leverages monads (similar to Summingbird) • MapReduce as a library, but many patterns beyond m-brace.net let rec mapReduce (map: 'T -> Cloud<'R>)! (reduce: 'R -> 'R -> Cloud<'R>)! (identity: 'R)! (input: 'T list) =! cloud {! match input with! | [] -> return identity! | [value] -> return! map value! | _ ->! let left, right = List.split input! let! r1, r2 =! (mapReduce map reduce identity left)! <||>! (mapReduce map reduce identity right)! return! reduce r1 r2! }
  • 43. Data Workflows for Machine Learning: 
 Update the definitions… ! “Nine Points to Discuss”

  • 44. Data Workflows Formally speaking, workflows include people (cross-dept) 
 and define oversight for exceptional data 1 …otherwise, data flow or pipeline would be more apt …examples include Cascading traps, with exceptional tuples 
 routed to a review organization, e.g., Customer Support Customers Web App logs logs Logs Cache Support trap tap Modeling PMML source tap Data Workflow source tap sink tap Analytics Cubes Reporting sink tap customer Customer profile DBs Prefs Hadoop Cluster includes people, defines oversight for exceptional data
  • 45. Data Workflows Workflows impose a separation of concerns, allowing 
 for multiple abstraction layers in the tech stack …specify what is required, not how to accomplish it …articulating the business logic … Cascading leverages pattern language …related notions from Knuth, literate programming …not unlike BPM/BPEL for Big Data …examples: IPython, R Markdown, etc. 2 separation of concerns, allows for literate programming
  • 46. 3 Data Workflows Multiple abstraction layers in the tech stack are 
 needed, but still emerging …feedback loops based on machine data …optimizers feed on abstraction Portable Models …metadata accounting: Reusable Components track data lineage • propagate schema • model / feature usage • ensemble performance …app history accounting: • • • • util stats, mixing workloads heavy-hitters, bottlenecks throughput, latency Business Process multiple abstraction layers 
 for metadata, feedback, 
 and optimization • • • • data lineage schema propagation feature selection tournaments DSLs Planners/ Optimizers Mixed Topologies Cluster Scheduler • app history • util stats • bottlenecks Machine Data Clusters
  • 47. 3 Data Workflows Multiple needed, but still emerging Business Process …feedback loops based on …optimizers feed on abstraction Portable Models Um, will compilers 
 track data lineage propagate ever look like this?? model / feature usage …metadata accounting: • • • Reusable Components ensemble performance …app history accounting: • • • • • • data lineage schema propagation feature selection tournaments DSLs • • multiple abstraction layers 
 for metadata, feedback, 
 and optimization util stats, mixing workloads heavy-hitters, bottlenecks throughput, latency Planners/ Optimizers Mixed Topologies Cluster Scheduler • app history • util stats • bottlenecks Machine Data Clusters
  • 48. 4 Data Workflows Workflows must provide for test, which is no simple 
 matter testing: model evaluation, TDD, app deployment …testing is required on three abstraction layers: model portability/evaluation, ensembles, tournaments • TDD in reusable components and business process • continuous integration / continuous deployment …examples: Cascalog for TDD, PMML for model eval • …still so much more to improve Business Process Portable Models Reusable Components • • • • data lineage schema propagation feature selection tournaments DSLs …keep in mind that workflows involve people, too Planners/ Optimizers Mixed Topologies Cluster Scheduler • app history • util stats • bottlenecks Machine Data Clusters
  • 49. 5 Data Workflows Workflows enable system integration …future-proof integrations and scale-out future-proof system integration, scale-out, ops …build components and apps, not jobs and command lines …allow compiler optimizations across the DAG,
 i.e., cross-dept contributions Customers Web App …minimize operationalization costs: troubleshooting, debugging at scale • exception handling • notifications, instrumentation …examples: KNIME, Cascading, etc. • logs logs Logs Cache Support trap tap Modeling ! PMML source tap Data Workflow source tap sink tap Analytics Cubes Reporting sink tap customer Customer profile DBs Prefs Hadoop Cluster
  • 50. Data Workflows Visualizing workflows, what a great idea. …the practical basis for: collaboration • rapid prototyping • component reuse …examples: KNIME wins best in category • …Cascading generates flow diagrams, 
 which are a nice start 6 visualizing allows people 
 to collaborate through code
  • 51. Data Workflows Abstract algebra, containerizing workflow metadata …monoids, semigroups, etc., allow for reusable components 
 with well-defined properties for running in parallel at scale …let business process be agnostic about underlying topologies, 
 with analogy to Linux containers (Docker, etc.) …compose functions, take advantage of sparsity (monoids) …because “data is fully functional” – FP for s/w eng benefits …aggregators are almost always “magic”, now we can solve 
 for common use cases …read Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms by Jimmy Lin …Cascading introduced some notions of this circa 2007, 
 but didn’t make it explicit …examples: Algebird in Summingbird, Simmer, Spark, etc. 7 abstract algebra and functional programming containerize business process
  • 52. Data Workflows Workflow needs vary in time, and need to blend time …something something batch, something something low-latency 8 blend results from different time scales: batch plus lowlatency …scheduling batch isn’t hard; scheduling low-latency is 
 computationally brutal (see the Omega paper) …because people like saying “Lambda Architecture”, 
 it gives them goosebumps, or something …because real-time means so many different things …because batch windows are so 1964 …examples: Summingbird, Oryx Big Data
 Nathan Marz, James Warren
 manning.com/marz
  • 53. Data Workflows Workflows may define contexts in which model selection 
 possibly becomes a compiler problem …for example, see Boyd, Parikh, et al. …ultimately optimizing for loss function + regularization term …probably not ready for prime-time tomorrow …examples: MLbase f(x): loss function g(z): regularization term 9 optimize learners in context, to make model selection potentially a compiler problem
  • 54. Data Workflows For one of the best papers about what workflows really
 truly require, see Out of the Tar Pit by Moseley and Marks bonus
  • 55. Nine Points for Data Workflows — a wish list 1. includes people, defines oversight for exceptional data 2. separation of concerns, allows for literate programming 3. multiple abstraction layers for metadata, feedback, and optimization Business Process Portable Models Reusable Components 4. testing: model evaluation, TDD, app deployment 6. visualizing workflows allows people to collaborate through code Planners/ Optimizers Mixed Topologies 7. abstract algebra and functional programming containerize business process 9. optimize learners in context, to make model 
 selection potentially a compiler problem data lineage schema propagation feature selection tournaments DSLs 5. future-proof system integration, scale-out, ops 8. blend results from different time scales: 
 batch plus low-latency • • • • Cluster Scheduler • app history • util stats • bottlenecks Machine Data Clusters
  • 57. Nine Points for Data Workflows — a scorecard Spark, MLbase Oryx Summing! bird Cascalog Cascading ! KNIME Py Data R Markdown ✔ ✔ ✔ ✔ includes people, exceptional data separation of concerns multiple 
 abstraction layers testing in depth future-proof 
 system integration visualize to collab can haz monoids blends batch + 
 “real-time” optimize learners 
 in context can haz PMML ✔ MBrace
  • 58. Data Workflows for Machine Learning: 
 PMML… ! 
 “Oh, by the way…”
  • 59. PMML – an industry standard • • established XML standard for predictive model markup • members: IBM, SAS, Visa, FICO, Equifax, NASA,
 Microstrategy, Microsoft, etc. • PMML concepts for metadata, ensembles, etc., translate 
 directly into workflow abstractions, e.g., Cascading organized by Data Mining Group (DMG), since 1997 
 http://dmg.org/ ! “PMML is the leading standard for statistical and data mining models and
 supported by over 20 vendors and organizations.With PMML, it is easy 
 to develop a model on one system using one application and deploy the
 model on another system using another application.” wikipedia.org/wiki/Predictive_Model_Markup_Language
  • 60. PMML – vendor coverage
  • 61. PMML – model coverage • • • • • • • • • • • Association Rules: AssociationModel element Cluster Models: ClusteringModel element Decision Trees: TreeModel element Naïve Bayes Classifiers: NaiveBayesModel element Neural Networks: NeuralNetwork element Regression: RegressionModel and GeneralRegressionModel elements Rulesets: RuleSetModel element Sequences: SequenceModel element Support Vector Machines: SupportVectorMachineModel element Text Models: TextModel element Time Series: TimeSeriesModel element ibm.com/developerworks/industry/library/ind-PMML2/
  • 62. PMML – further study ! PMML in Action
 Alex Guazzelli, Wen-Ching Lin, Tridivesh Jena
 amazon.com/dp/1470003244 ! See also excellent resources at: zementis.com/pmml.htm
  • 63. Pattern – model scoring • • • • • a PMML library for Cascading workflows Customers migrate workloads: SAS,Teradata, etc., 
 exporting predictive models as PMML Web App great open source tools – R, Weka, 
 KNIME, Matlab, RapidMiner, etc. leverage PMML as another kind of DSL logs logs Logs Cache Support only for scoring models, not training trap tap Modeling PMML source tap Data Workflow source tap sink tap Analytics Cubes Reporting sink tap customer Customer profile DBs Prefs Hadoop Cluster
  • 64. Pattern – model scoring • originally intended as a general purpose, tuple-oriented model scoring engine for JVM-based frameworks • library plugs into Cascading (Hadoop), and ostensibly Cascalog, Scalding, Storm, Spark, Summingbird, etc. • dev forum:
 https://groups.google.com/forum/#!forum/pattern-user • original GitHub repo:
 https://github.com/ceteri/pattern/ • somehow, the other fork became deeply intertwined 
 with Cascading dependencies… • work in progress to integrate with other frameworks, 
 add models, and also train models at scale using Spark
  • 65. Pattern – create a model in R ## train a RandomForest model!  ! f <- as.formula("as.factor(label) ~ .")! fit <- randomForest(f, data_train, ntree=50)!  ! ## test the model on the holdout test set!  ! print(fit$importance)! print(fit)!  ! predicted <- predict(fit, data)! data$predicted <- predicted! confuse <- table(pred = predicted, true = data[,1])! print(confuse)!  ! ## export predicted labels to TSV!  ! write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), 
 quote=FALSE, sep="t", row.names=FALSE)!  ! ## export RF model to PMML!  ! saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))!
  • 66. Pattern – capture model parameters as PMML <?xml version="1.0"?>! <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"!  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"!  xsi:schemaLocation="http://www.dmg.org/PMML-4_0
 http://www.dmg.org/v4-0/pmml-4-0.xsd">!  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">!   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>!   <Application name="Rattle/PMML" version="1.2.30"/>!   <Timestamp>2012-10-22 19:39:28</Timestamp>!  </Header>!  <DataDictionary numberOfFields="4">!   <DataField name="label" optype="categorical" dataType="string">!    <Value value="0"/>!    <Value value="1"/>!   </DataField>!   <DataField name="var0" optype="continuous" dataType="double"/>!   <DataField name="var1" optype="continuous" dataType="double"/>!   <DataField name="var2" optype="continuous" dataType="double"/>!  </DataDictionary>!  <MiningModel modelName="randomForest_Model" functionName="classification">!   <MiningSchema>!    <MiningField name="label" usageType="predicted"/>!    <MiningField name="var0" usageType="active"/>!    <MiningField name="var1" usageType="active"/>!    <MiningField name="var2" usageType="active"/>!   </MiningSchema>!   <Segmentation multipleModelMethod="majorityVote">!    <Segment id="1">!     <True/>!     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">!      <MiningSchema>!       <MiningField name="label" usageType="predicted"/>!       <MiningField name="var0" usageType="active"/>!       <MiningField name="var1" usageType="active"/>!       <MiningField name="var2" usageType="active"/>!      </MiningSchema>! ...!
  • 67. Pattern – score a model, within an app public static void main( String[] args ) throws RuntimeException {! String inputPath = args[ 0 ];! String classifyPath = args[ 1 ];! // set up the config properties! Properties properties = new Properties();! AppProps.setApplicationJarClass( properties, Main.class );! HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!   // create source and sink taps! Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );! Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );!   // handle command line options! OptionParser optParser = new OptionParser();! optParser.accepts( "pmml" ).withRequiredArg();!   OptionSet options = optParser.parse( args );!  ! // connect the taps, pipes, etc., into a flow! FlowDef flowDef = FlowDef.flowDef().setName( "classify" )! .addSource( "input", inputTap )! .addSink( "classify", classifyTap );!  ! if( options.hasArgument( "pmml" ) ) {! String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );! PMMLPlanner pmmlPlanner = new PMMLPlanner()! .setPMMLInput( new File( pmmlPath ) )! .retainOnlyActiveIncomingFields()! .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model! flowDef.addAssemblyPlanner( pmmlPlanner );! }!  ! // write a DOT file and run the flow! Flow classifyFlow = flowConnector.connect( flowDef );! classifyFlow.writeDOT( "dot/classify.dot" );! classifyFlow.complete();! }
  • 68. Pattern – score a model, using pre-defined Cascading app Customer Orders Classify Scored Orders Assert GroupBy token M R PMML Model Count Failure Traps Confusion Matrix
  • 69. Pattern – score a model, using pre-defined Cascading app ## run an RF classifier at scale!  ! hadoop jar build/libs/pattern.jar ! data/sample.tsv out/classify out/trap ! --pmml data/sample.rf.xml!  ! !
  • 70. Data Workflows for Machine Learning: 
 Summary ! 
 “feedback and discussion” ?
  • 72. Nine Points for Data Workflows — a scorecard Spark, MLbase Oryx Summing! bird Cascalog Cascading ! KNIME Py Data R Markdown ✔ ✔ ✔ ✔ includes people, exceptional data separation of concerns multiple 
 abstraction layers testing in depth future-proof 
 system integration visualize to collab can haz monoids blends batch + 
 “real-time” optimize learners 
 in context can haz PMML ✔ MBrace
  • 73. PMML – what’s needed? in the language standard: ! data preparation ! data sources/sinks as parameters ! track data lineage ! more monoid, less XML ! handle data exceptions => TDD ! updates/active learning? feature engineering data data sources data sources sources data prep pipeline tournaments hold-outs train learners classifiers classifiers classifiers ! tournaments? scoring test representation evaluation optimization use cases ! in the workflow frameworks: ! include feature engineering w/ model? ! support evaluation better ! build ensembles • obtain other data? • improve metadata? • refine representation? • improve optimization? • iterate with stakeholders? • can models (inference) inform first principles approaches? quantify and measure: • benefit? • risk? • operational costs?
  • 74. Enterprise Data Workflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do ! monthly newsletter for updates, events, conference summaries, etc.: liber118.com/pxn/