Enterprise Data Workflows with Cascading

Enterprise Data Workﬂows
with Cascading

Document
Collection

Paco Nathan
Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R

Concurrent, Inc.
Stop Word token
List
RHS

Count

Word
Count

pnathan@concurrentinc.com
@pacoid

Copyright @2012, Concurrent, Inc.

Monday, 17 December 12 1

Unstructured Data
meets
Enterprise Scale

1. Cascading API: a few facts & quotes
2. Example #1: distributed file copy
3. Example #2: word count
4. Pattern Language: workflow abstraction
5. Compare: Scalding, Cascalog, Hive, Pig


Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Cascading API:
a few facts & quotes


Enterprise apps, pre-Hadoop
SQL
queries
Data
analyst Warehouse ops
ETL
data data
sets sources
insights data
sources

Analytics Apps
modeling
Tools

developer
priorities

ad-hoc
dashboards analysis
queries
domain


Enterprise apps, pre-Hadoop
the devil you know:

‣ “scale up” as needed – larger proprietary hardware
‣ data warehouse: e.g., Oracle,Teradata, etc. – expensive
‣ analytics: e.g., SAS, Microstrategy, etc. – expensive
‣ highly trained staff in specific roles – lots of “silos”

however, to be competitive now, the data rates must scale
by orders of magnitude...

( alternatively, can we get hired onto the SAS sales team? )


Enterprise apps, with Hadoop
Apache Hadoop offers an attractive migration path:

‣ open source software – less expensive
‣ commodity hardware – less expensive
‣ fault tolerance for large-scale parallel workloads
‣ great use cases: Yahoo!, Facebook, Twitter, Amazon, Apple, etc.
‣ offload workflows from licensed platforms, based on “scale-out”



queries, Java job tracker
models apps name node
Hadoop Cluster
analyst developer
ETL
needs

ops


anything odd about that diagram? queries,
models
Java
apps
job tracker
name node
Hadoop Cluster
analyst developer
ETL
needs

‣ demands expert Hadoop developers ops

‣ experts are hard to find, expensive
‣ even harder to train from among existing staff
‣ early adopter abstractions are not suitable for Enterprise IT
‣ importantly: Hadoop is almost never used in isolation


Cascading API: purpose
‣ simplify data processing development and deployment

‣ improve application developer productivity

‣ enable data processing application manageability


Cascading API: a few facts
Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.

in production (~5 yrs) at hundreds of enterprise Hadoop deployments:
Finance, Health Care, Transportation, other verticals

studies published about large use cases: Twitter, Etsy, eBay, Airbnb, Square,
Climate Corp, FlightCaster, Williams-Sonoma, Trulia, TeleNav

partnerships and distribution with SpringSource, Amazon AWS,
Microsoft Azure, Hortonworks, MapR, EMC

several open source projects built atop, managed by Twitter, Etsy, eBay, etc.,
which provide substantial Machine Learning libraries

DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy

data “taps” integrate popular data frameworks via JDBC, Memcached, HBase,
plus serialization in Apache Thrift, Avro, Kyro, etc.

entire app compiles into a single JAR: fully connected for compiler optimization,
exception handling, debug, conﬁg, scheduling, notiﬁcations, provenance, etc.


Cascading API: a few quotes
“Cascading gives Java developers the ability to build Big Data applications
on Hadoop using their existing skillset … Management can really go out
and build a team around folks that are already very experienced with Java.
Switching over to this is really a very short exercise.”
CIO, Thor Olavsrud, 2012-06-06
cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading

“Masks the complexity of MapReduce, simpliﬁes the programming, and
speeds you on your journey toward actionable analytics … A vast
improvement over native MapReduce functions or Pig UDFs.”
2012 BOSSIE Awards, James Borck, 2012-09-18
infoworld.com/slideshow/65089

“Company’s promise to application developers is an opportunity to build
and test applications on their desktops in the language of choice with
familiar constructs and reusable components”
Dr. Dobb’s, Adrian Bridgwater, 2012-06-08
drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759


Enterprise concerns
“Notes from the Mystery Machine Bus”
by Steve Yegge, Google
goo.gl/SeRZa
“conservative” “liberal”
(mostly) Enterprise (mostly) Start-Up

risk management customer experiments

assurance flexibility

well-defined schema schema follows code
explicit configuration convention

type-checking compiler interpreted scripts

wants no surprises wants no impediments

Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.

Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.


Enterprise adoption

As Enterprise apps move into
Hadoop and related BigData
frameworks, risk profiles shift
toward more conservative
programming practices

Cascading provides a popular
API – formally speaking, as a
pattern language – for defining
and managing Enterprise data
workflows


Migration of batch toolsets

Enterprise Migration Start-Ups
deﬁne pipelines J2EE Cascading Pig

query data SQL Lingual Hive

predictive models SAS Pattern Mahout


Summary
Cascading API benefits:

‣ addresses staffing bottlenecks due to Hadoop adoption
‣ reduces costs, while servicing risk concerns and “conservatism”
‣ manages complexity as the data continues to scale massively
‣ provides a pattern language for system integration
‣ leverages a workflow abstraction for Enterprise apps
‣ utilizes existing practices for JVM-based clusters


Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Code Example #1:
distributed file copy


1: distributed file copy
public class
  Main
  {
  public static void
  main( String[] args )
    {
    String inPath = args[ 0 ];
    String outPath = args[ 1 ];
Source
    Properties props = new Properties();
    AppProps.setApplicationJarClass( props, Main.class );
    HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

    // create the source tap
    Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

M     // create the sink tap
    Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
Sink
    // specify a pipe to connect the taps
    Pipe copyPipe = new Pipe( "copy" );

    // connect the taps, pipes, etc., into a flow
    FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
     .addSource( copyPipe, inTap )
     .addTailSink( copyPipe, outTap );

    // run the flow
    flowConnector.connect( flowDef ).complete();

1 mapper     }
  }
0 reducers
10 lines code


1: distributed file copy
shown:
‣ a source tap – input data
‣ a sink tap – output data
‣ a pipe connecting a source to a sink
‣ simplest possible Cascading app

not shown:
‣ what kind of taps? and what size of input data set?
‣ could be: JDBC, HBase, Cassandra, XML, flat files, etc.
‣ what kind of topology? and what size of cluster?
‣ could be: Hadoop, in-memory, etc.

as system architects, we leverage pattern


principle: same JAR, any scale
MegaCorp Enterprise IT:
Pb’s data
1000+ node private cluster
EVP calls you when app fails
runtime: days+

Production Cluster:
Tb’s data
EMR w/ 50 HPC Instances
Ops monitors results
runtime: hours – days

Staging Cluster:
Gb’s data
EMR + 4 Spot Instances
CI shows red or green lights
runtime: minutes – hours

Your Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes


principle: fail the same way twice
troubleshooting at scale:

‣ physical plan for a query provides a deterministic strategy
‣ avoid non-deterministic behavior – expensive when troubleshooting
‣ otherwise, edge cases become nightmares on large clusters
‣ again, addresses “conservative” need for predictability
‣ a core value which is unique to Cascading


principle: plan ahead
flow planner per topology:

‣ leverage the flow graph (DAG)
‣ catch as many errors as possible before an app gets submitted
‣ potential problems caught at compile time or at flow planner stage
‣ …long before large, expensive resources start getting consumed
‣ …or worse, before the wrong results get propagated downstream


Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Code Example #2:
word count


2: word count
defined: count how often each word appears in a collection of text documents

a simple program provides a great test case for parallel processing,
since it illustrates:
‣ requires a minimal amount of code
‣ demonstrates use of both symbolic and numeric values
‣ shows a dependency graph of tuples as an abstraction
‣ is not many steps away from useful search indexing
‣ serves as a “Hello World” for Hadoop apps

any distributed computing framework which runs Word Count
efficiently in parallel at scale,
can handle much larger, more interesting compute problems


2: word count

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

1 mapper
1 reducer
18 lines code gist.github.com/3900702


2: word count Document
Collection

M
Tokenize
GroupBy
token Count

String docPath = args[ 0 ]; R Word
Count

String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );

// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );

// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();


2: word count
[head]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']

map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

[{1}:'token']
[{1}:'token']

GroupBy('wc')[by:['token']]

wc[{1}:'token']
[{1}:'token']

reduce
Every('wc')[Count[decl:'count']]

[{2}:'token', 'count']
[{1}:'token']

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

1 mapper [{2}:'token', 'count']

1 reducer [{2}:'token', 'count']

18 lines code [tail]


2: word count
deltas between Example #1 and Example #2:

‣ defines source tap as a collection of text documents
‣ defines sink tap to produce word count tuples (desired end result)
‣ uses named fields, applying structure to unstructured data
‣ adds semantics to the workflow, specifying business logic
‣ inserts operations into the pipe: Tokenize, GroupBy, Count
‣ shows function and aggregation applied to data tuples in parallel

Document
Collection
Source

Tokenize
GroupBy
M token Count
M

Sink
R Word
Count


Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Pattern Language:
the workflow abstraction


enterprise data workflows
Tuples, Pipelines, Taps, Operations, Joins, Assertions, Traps, etc.
…in other words, “plumbing” as a pattern language
for handling Big Data in Enterprise IT

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count


pattern language
deﬁned: a structured method for solving large, complex
design problems, where the syntax of the language
promotes the use of best practices
“plumbing” metaphor of pipes and operators in
Cascading helps indicate: algorithms to be used at
particular points, appropriate architectural trade-offs,
frameworks which must be integrated, etc.
design patterns: originated in consensus negotiation
for architecture, later used in software engineering

wikipedia.org/wiki/Pattern_language


data workflows: team
‣ Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)

‣ Systems Integrator POV:
system integration of heterogenous data sources and compute platforms

‣ Data Scientist POV:
a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.

‣ Data Architect POV:
a physical plan for large-scale data flow management

‣ Software Architect POV:
a pattern language, similar to plumbing or circuit design
Document
Collection

‣ App Developer POV: M
Tokenize
Scrub
token

API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Stop Word
List
HashJoin
Left

RHS
Regex
token
GroupBy
token
R

Count

‣ Systems Engineer POV: Word
Count

a JAR file, has passed CI, available in a Maven repo


data workflows: layers
business domain expertise, business trade-offs,
process
operating parameters, market position, etc.

API Java, Scala, Clojure, Jython, JRuby, Groovy, etc.
language
…envision whatever runs in a JVM

optimize /
schedule
major changes in technology now
Document
Collection

Scrub
Tokenize
token

M

physical Stop Word
HashJoin
Left
Regex
token
GroupBy
token
R

plan
List
RHS

Count

“assembler”
Word
Count

code
topology
Apache Hadoop, in-memory local mode
…envision GPUs, streaming, etc.

machine
data
Splunk, New Relic, Typesafe, Nagios, etc.


data workflows: example
web
web Memcached web
logsweb
logs cluster API
logs

Cascading app
source sink
tap tap
Customers
Recommender
source System trap
tap tap

customer Support
Customer
profile review
Profile
DBs
DBs

Hadoop cluster


data workflows: SQL vs. JVM
abstraction SQL
parser SQL parser

optimizer logical plan,
optimized based on stats
planner physical plan

machine query history,
data table stats
topology b-trees, etc.

visualization ERD

schema table schema

catalog relational catalog


data workflows: SQL vs. JVM
abstraction SQL JVM
parser SQL parser SQL-92 compliant parser
(in progress)
optimizer logical plan, logical plan,
optimized based on stats optimized based on stats
planner physical plan API “plumbing”

machine query history, app history,
data table stats tuple stats
topology b-trees, etc. heterogenous, distributed:
Hadoop, in-memory, etc.
visualization ERD ﬂow diagram

schema table schema tuple schema

catalog relational catalog tap usage DB


Cascading taxonomy

Cascading
scheduler app
app
instance
source
tap

Maven flow
repo
sink
step tap

slice
owner trap
kind mapper | reducer tap

topology hadoop | local


MapReduce architecture
‣ name node / data node
‣ job tracker / task tracker
‣ submit queue
‣ task slots
‣ HDFS
‣ distributed cache

Wikipedia

Apache


Summary
If you were leading a team responsible for Enterprise apps:

‣ which of the previous two slides seems easier to understand?
‣ which is simpler to use for training and managing a team?
‣ which costs the most in the long run?


Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Compare & Contrast:
other approaches


wc: pseudocode Document
Collection

M
Tokenize
GroupBy
token Count

R Word
Count

void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");

void reduce (String word, Iterator partial_counts):
int count = 0;

for each pc in partial_counts:
count += Int(pc);

emit(word, String(count));


Scalding / Scala Document
Collection

M
Tokenize
GroupBy
token Count

R Word
Count

// Sujit Pal
// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html

package com.mycompany.impatient

import com.twitter.scalding._

class Part2(args : Args) extends Job(args) {
  val input = Tsv(args("input"), ('docId, 'text))
  val output = Tsv(args("output"))
  input.read.
    flatMap('text -> 'word) {
text : String => text.split("""s+""")
}.
    groupBy('word) { group => group.size }.
    write(output)
}


Scalding / Scala Document
Collection

M
Tokenize
GroupBy
token Count

github.com/twitter/scalding/wiki
R Word
Count

notes:
‣ code is compact, easy to understand

‣ functional programming is great for expressing
complex workﬂows in MapReduce, etc.
‣ very large-scale, complex problems can be handled
in just a few lines of code
‣ many large-scale apps in production deployments

‣ signiﬁcant investments by Twitter, Etsy, eBay, etc.,
in this open source project
‣ extensive libraries are available for linear algebra,
machine learning – e.g., “Matrix API”


Cascalog / Clojure Document
Collection

M
Tokenize
GroupBy
token Count

R Word
Count

; Paul Lam
; github.com/Quantisan/Impatient

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))


Cascalog / Clojure Document
Collection

M
Tokenize
GroupBy
token Count

github.com/nathanmarz/cascalog/wiki
R Word
Count

notes:
‣ code is compact, easy to understand

‣ functional programming is great for expressing
complex workflows in MapReduce, etc.
‣ significant investments by Twitter, Climate Corp, etc.,
in this open source project
‣ can run queries from the Clojure REPL

‣ compelling for very large-scale use cases where code
correctness can be verified before deployment


Apache Hive Document
Collection

M
Tokenize
GroupBy
token Count

R Word
Count

-- Steve Severance
-- stackoverflow.com/questions/10039949/word-count-program-in-hive

CREATE TABLE input (line STRING);

LOAD DATA LOCAL INPATH 'input.tsv'
OVERWRITE INTO TABLE input;

SELECT
word, COUNT(*)
FROM input
LATERAL VIEW explode(split(text, ' ')) lTable AS word
GROUP BY word
;


Apache Hive Document
Collection

M
Tokenize
GroupBy
token Count

hive.apache.org
R Word
Count

pro:
‣ most popular abstraction atop Apache Hadoop

‣ SQL-like language is syntactically familiar to most analysts

‣ simple to load large-scale unstructured data and run ad-hoc queries

con:
‣ not a relational engine, many surprises at scale

‣ difficult to represent complex workflows, ML algorithms, etc.

‣ one poorly-trained analyst can bottleneck an entire cluster

‣ app-level integration requires other coding, outside of script language

‣ logical planner mixed with physical planner; cannot collect app stats

‣ non-deterministic exec: number of mappers+reducers changes unexpectedly

‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.


Apache Pig Document
Collection

M
Tokenize
GroupBy
token Count

R Word
Count

-- kudos to Dmitriy Ryaboy

docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';

-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';

-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
GENERATE group AS token, COUNT(tokenPipe) AS count;

-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;


Apache Pig Document
Collection

M
Tokenize
GroupBy
token Count

pig.apache.org
R Word
Count

pro:
‣ easy to learn data manipulation language (DML)

‣ interactive prompt (Grunt) makes it simple to prototype apps

‣ extensibility through UDFs

con:
‣ not a full programming language; must extend via UDFs outside of language

‣ app-level integration requires other coding, outside of script language

‣ simple problems are simple to do; hard problems become quite complex

‣ difficult to parameterize scripts externally; must rewrite to change taps!

‣ logical planner mixed with physical planner; cannot collect app stats

‣ non-deterministic exec: number of mappers+reducers changes unexpectedly

‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.


Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Code Example #N:
city of palo alto, etc.


extend: wc + scrub + stop words

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

1 mapper Word

1 reducer Count

28+10 lines code


extend: a simple search engine

Unique Insert SumBy

D
doc_id 1 doc_id
Document
Collection

M R M R RHS

Scrub
Tokenize
token
HashJoin
M

RHS

token
HashJoin Regex Unique CountBy

DF
Left token token token ExprFunc
CoGroup
Stop Word tf-idf
List
RHS
M R M R M R
TF-IDF

M

CountBy
TF
doc_id,
token
CountBy Sort
token count

M R M
Word
R M R Count

10 mappers
8 reducers
68+14 lines code


City of Palo Alto open data
Regex Regex

tree
Scrub
filter parser species

M
HashJoin
Left Geohash
CoPA
GIS exprot Tree
Metadata M
RHS RHS
tree
Regex Checkpoint

road
Regex Regex

tsv
parser tsv filter Tree Filter GroupBy Checkpoint
parser CoGroup
Distance tree_dist tree_name shade
M

R M R M RHS
M
HashJoin Estimate Road
Left Albedo Geohash CoGroup
Segments
Road
Metadata GPS
Failure RHS M logs
Traps R
road

Geohash

M

Regex
park

filter reco

M
park

github.com/Cascading/CoPA/wiki
‣ GIS export for parks, roads, trees (unstructured / open data)
‣ log ﬁles of personalized/frequented locations in Palo Alto via iPhone GPS tracks
‣ curated metadata, used to enrich the dataset
‣ could extend via mash-up with many available public data APIs

Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”


CoPA: log events


CoPA: results 0.12
Estimated Tree Height (meters)

0.10

0.08
count
0

density
100
0.06 200
300

0.04

0.02

0.00

0 10 20 30 40 50
avg_height

‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciﬂua
‣ avg height 23 m
‣ road albedo: 0.12
‣ distance: 10 m
‣ a short walk from my train stop ✔


Intro to Cascading
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

PMML:
predictive modeling


PMML model


cascading.pattern
example:
1. use customer order history as the training data set
2. train a risk classifier for orders, using Random Forest
3. export model from R to PMML
4. build a Cascading app to execute the PMML model
4.1. generate a pipeline from PMML description
4.2. planner builds the flow for a topology (Hadoop)
4.3. compile app to a JAR file
5. deploy the app at scale to calculate scores


cascading.pattern
risk classiﬁer risk classiﬁer
dimension: customer 360 dimension: per-order
Cascading apps

training analyst's customer
data prep laptop
data sets transactions

predict score new
model costs orders
PMML
model
detect anomaly
fraudsters detection

segment velocity
customers metrics

Hadoop Customer IMDG
DB
batch real-time
workloads workloads

ETL

chargebacks, partner
DW etc. data


1:
“orders” data set...
train/test in R...
exported as PMML


R modeling
## train a RandomForest model

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)

## test the model on the holdout test set

print(fit$importance)
print(fit)

predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)

## export predicted labels to TSV

write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)

## export RF model to PMML

saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))


R output
MeanDecreaseGini
var0 0.6591701
var1 33.8625179
var2 8.0290020

OOB estimate of error rate: 13.83%
Confusion matrix:
0 1 class.error
0 28 5 0.1515152
1 8 53 0.1311475

[1] "./data/sample.rf.xml"


2:
Cascading app
takes PMML as
a parameter...


PMML model
<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
<Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.2.30"/>
  <Timestamp>2012-10-22 19:39:28</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
  <DataField name="label" optype="categorical" dataType="string">
   <Value value="0"/>
   <Value value="1"/>
  </DataField>
  <DataField name="var0" optype="continuous" dataType="double"/>
</DataDictionary>
<MiningModel modelName="randomForest_Model" functionName="classification">
  <MiningSchema>
   <MiningField name="label" usageType="predicted"/>
   <MiningField name="var0" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
   <Segment id="1">
    <True/>
    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest"
splitCharacteristic="binarySplit">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
     </MiningSchema>
...


Cascading app
public class Main {
public static void main( String[] args ) {
  String pmmlPath = args[ 0 ];
  String ordersPath = args[ 1 ];
  String classifyPath = args[ 2 ];
  String trapPath = args[ 3 ];

  Properties properties = new Properties();
  AppProps.setApplicationJarClass( properties, Main.class );
  HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

  // create source and sink taps
  Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
  Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
  Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

  // define a "Classifier" model from PMML to evaluate the orders
  Classifier classifier = new Classifier( pmmlPath );
  Pipe classifyPipe = new Each( new Pipe( "classify" ), classifier.getFields(),
new ClassifierFunction( new Fields( "score" ), classifier ), Fields.ALL );

  FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
   .addSource( classifyPipe, ordersTap )
   .addTrap( classifyPipe, trapTap )
   .addSink( classifyPipe, classifyTap );

  // write a DOT file and run the flow
  Flow classifyFlow = flowConnector.connect( flowDef );
  classifyFlow.writeDOT( "dot/classify.dot" );
  classifyFlow.complete();
}
}


3:
app deployed on
a cluster to score
customers at scale...


deploy to cloud
elastic-mapreduce --create --name "RF"
--jar s3n://temp.cascading.org/pattern/pattern.jar
--arg s3n://temp.cascading.org/pattern/sample.rf.xml
--arg s3n://temp.cascading.org/pattern/sample.tsv
--arg s3n://temp.cascading.org/pattern/out/classify
--arg s3n://temp.cascading.org/pattern/out/trap

aws.amazon.com/elasticmapreduce/


results
bash-3.2$ head output/classify/part-00000
label" var0" var1" var2" order_id" predicted"score
1" 0" 1" 0" 6f8e1014" 1" 1
0" 0" 0" 1" 6f8ea22e" 0" 0
1" 0" 1" 0" 6f8ea435" 1" 1
0" 0" 0" 1" 6f8ea5e1" 0" 0
1" 0" 1" 0" 6f8ea785" 1" 1
1" 0" 1" 0" 6f8ea91e" 1" 1
0" 1" 0" 0" 6f8eaaba" 0" 0
1" 0" 1" 0" 6f8eac54" 1" 1
0" 1" 1" 0" 6f8eade3" 1" 1


drill-down

blog, code/wiki/gists, JARs, community, DevOps products:
cascading.org
github.org/Cascading
conjars.org
meetup.com/cascading
goo.gl/KQtUL
concurrentinc.com

pnathan@concurrentinc.com
@pacoid
Copyright @2012, Concurrent, Inc.


Enterprise Data Workflows with Cascading

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Enterprise Data Workflows with Cascading

Similar to Enterprise Data Workflows with Cascading (10)

More from Paco Nathan

More from Paco Nathan (20)

Enterprise Data Workflows with Cascading