Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17
http://www.meetup.com/cascading/events/94079162/
1. Enterprise Data Workflows
with Cascading
Document
Collection
Paco Nathan
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Concurrent, Inc.
Stop Word token
List
RHS
Count
Word
Count
pnathan@concurrentinc.com
@pacoid
Copyright @2012, Concurrent, Inc.
Monday, 17 December 12 1
2. Unstructured Data
meets
Enterprise Scale
1. Cascading API: a few facts & quotes
2. Example #1: distributed file copy
3. Example #2: word count
4. Pattern Language: workflow abstraction
5. Compare: Scalding, Cascalog, Hive, Pig
Monday, 17 December 12 2
3. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Cascading API:
a few facts & quotes
Monday, 17 December 12 3
4. Enterprise apps, pre-Hadoop
SQL
queries
Data
analyst Warehouse ops
ETL
data data
sets sources
insights data
sources
Analytics Apps
modeling
Tools
developer
priorities
ad-hoc
dashboards analysis
queries
domain
Monday, 17 December 12 4
5. Enterprise apps, pre-Hadoop
the devil you know:
‣ “scale up” as needed – larger proprietary hardware
‣ data warehouse: e.g., Oracle,Teradata, etc. – expensive
‣ analytics: e.g., SAS, Microstrategy, etc. – expensive
‣ highly trained staff in specific roles – lots of “silos”
however, to be competitive now, the data rates must scale
by orders of magnitude...
( alternatively, can we get hired onto the SAS sales team? )
Monday, 17 December 12 5
6. Enterprise apps, with Hadoop
Apache Hadoop offers an attractive migration path:
‣ open source software – less expensive
‣ commodity hardware – less expensive
‣ fault tolerance for large-scale parallel workloads
‣ great use cases: Yahoo!, Facebook, Twitter, Amazon, Apple, etc.
‣ offload workflows from licensed platforms, based on “scale-out”
Monday, 17 December 12 6
7. Enterprise apps, with Hadoop
queries, Java job tracker
models apps name node
Hadoop Cluster
analyst developer
ETL
needs
ops
Monday, 17 December 12 7
8. Enterprise apps, with Hadoop
anything odd about that diagram? queries,
models
Java
apps
job tracker
name node
Hadoop Cluster
analyst developer
ETL
needs
‣ demands expert Hadoop developers ops
‣ experts are hard to find, expensive
‣ even harder to train from among existing staff
‣ early adopter abstractions are not suitable for Enterprise IT
‣ importantly: Hadoop is almost never used in isolation
Monday, 17 December 12 8
9. Cascading API: purpose
‣ simplify data processing development and deployment
‣ improve application developer productivity
‣ enable data processing application manageability
Monday, 17 December 12 9
10. Cascading API: a few facts
Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.
in production (~5 yrs) at hundreds of enterprise Hadoop deployments:
Finance, Health Care, Transportation, other verticals
studies published about large use cases: Twitter, Etsy, eBay, Airbnb, Square,
Climate Corp, FlightCaster, Williams-Sonoma, Trulia, TeleNav
partnerships and distribution with SpringSource, Amazon AWS,
Microsoft Azure, Hortonworks, MapR, EMC
several open source projects built atop, managed by Twitter, Etsy, eBay, etc.,
which provide substantial Machine Learning libraries
DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy
data “taps” integrate popular data frameworks via JDBC, Memcached, HBase,
plus serialization in Apache Thrift, Avro, Kyro, etc.
entire app compiles into a single JAR: fully connected for compiler optimization,
exception handling, debug, config, scheduling, notifications, provenance, etc.
Monday, 17 December 12 10
11. Cascading API: a few quotes
“Cascading gives Java developers the ability to build Big Data applications
on Hadoop using their existing skillset … Management can really go out
and build a team around folks that are already very experienced with Java.
Switching over to this is really a very short exercise.”
CIO, Thor Olavsrud, 2012-06-06
cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading
“Masks the complexity of MapReduce, simplifies the programming, and
speeds you on your journey toward actionable analytics … A vast
improvement over native MapReduce functions or Pig UDFs.”
2012 BOSSIE Awards, James Borck, 2012-09-18
infoworld.com/slideshow/65089
“Company’s promise to application developers is an opportunity to build
and test applications on their desktops in the language of choice with
familiar constructs and reusable components”
Dr. Dobb’s, Adrian Bridgwater, 2012-06-08
drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759
Monday, 17 December 12 11
12. Enterprise concerns
“Notes from the Mystery Machine Bus”
by Steve Yegge, Google
goo.gl/SeRZa
“conservative” “liberal”
(mostly) Enterprise (mostly) Start-Up
risk management customer experiments
assurance flexibility
well-defined schema schema follows code
explicit configuration convention
type-checking compiler interpreted scripts
wants no surprises wants no impediments
Java, Scala, Clojure, etc. PHP, Ruby, Python, etc.
Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.
Monday, 17 December 12 12
13. Enterprise adoption
As Enterprise apps move into
Hadoop and related BigData
frameworks, risk profiles shift
toward more conservative
programming practices
Cascading provides a popular
API – formally speaking, as a
pattern language – for defining
and managing Enterprise data
workflows
Monday, 17 December 12 13
14. Migration of batch toolsets
Enterprise Migration Start-Ups
define pipelines J2EE Cascading Pig
query data SQL Lingual Hive
predictive models SAS Pattern Mahout
Monday, 17 December 12 14
15. Summary
Cascading API benefits:
‣ addresses staffing bottlenecks due to Hadoop adoption
‣ reduces costs, while servicing risk concerns and “conservatism”
‣ manages complexity as the data continues to scale massively
‣ provides a pattern language for system integration
‣ leverages a workflow abstraction for Enterprise apps
‣ utilizes existing practices for JVM-based clusters
Monday, 17 December 12 15
16. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Code Example #1:
distributed file copy
Monday, 17 December 12 16
17. 1: distributed file copy
public class
Main
{
public static void
main( String[] args )
{
String inPath = args[ 0 ];
String outPath = args[ 1 ];
Source
Properties props = new Properties();
AppProps.setApplicationJarClass( props, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create the source tap
Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );
M // create the sink tap
Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
Sink
// specify a pipe to connect the taps
Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
.addSource( copyPipe, inTap )
.addTailSink( copyPipe, outTap );
// run the flow
flowConnector.connect( flowDef ).complete();
1 mapper }
}
0 reducers
10 lines code
Monday, 17 December 12 17
18. 1: distributed file copy
shown:
‣ a source tap – input data
‣ a sink tap – output data
‣ a pipe connecting a source to a sink
‣ simplest possible Cascading app
not shown:
‣ what kind of taps? and what size of input data set?
‣ could be: JDBC, HBase, Cassandra, XML, flat files, etc.
‣ what kind of topology? and what size of cluster?
‣ could be: Hadoop, in-memory, etc.
as system architects, we leverage pattern
Monday, 17 December 12 18
19. principle: same JAR, any scale
MegaCorp Enterprise IT:
Pb’s data
1000+ node private cluster
EVP calls you when app fails
runtime: days+
Production Cluster:
Tb’s data
EMR w/ 50 HPC Instances
Ops monitors results
runtime: hours – days
Staging Cluster:
Gb’s data
EMR + 4 Spot Instances
CI shows red or green lights
runtime: minutes – hours
Your Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes
Monday, 17 December 12 19
20. principle: fail the same way twice
troubleshooting at scale:
‣ physical plan for a query provides a deterministic strategy
‣ avoid non-deterministic behavior – expensive when troubleshooting
‣ otherwise, edge cases become nightmares on large clusters
‣ again, addresses “conservative” need for predictability
‣ a core value which is unique to Cascading
Monday, 17 December 12 20
21. principle: plan ahead
flow planner per topology:
‣ leverage the flow graph (DAG)
‣ catch as many errors as possible before an app gets submitted
‣ potential problems caught at compile time or at flow planner stage
‣ …long before large, expensive resources start getting consumed
‣ …or worse, before the wrong results get propagated downstream
Monday, 17 December 12 21
22. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Code Example #2:
word count
Monday, 17 December 12 22
23. 2: word count
defined: count how often each word appears in a collection of text documents
a simple program provides a great test case for parallel processing,
since it illustrates:
‣ requires a minimal amount of code
‣ demonstrates use of both symbolic and numeric values
‣ shows a dependency graph of tuples as an abstraction
‣ is not many steps away from useful search indexing
‣ serves as a “Hello World” for Hadoop apps
any distributed computing framework which runs Word Count
efficiently in parallel at scale,
can handle much larger, more interesting compute problems
Monday, 17 December 12 23
24. 2: word count
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
1 mapper
1 reducer
18 lines code gist.github.com/3900702
Monday, 17 December 12 24
25. 2: word count Document
Collection
M
Tokenize
GroupBy
token Count
String docPath = args[ 0 ]; R Word
Count
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Monday, 17 December 12 25
27. 2: word count
deltas between Example #1 and Example #2:
‣ defines source tap as a collection of text documents
‣ defines sink tap to produce word count tuples (desired end result)
‣ uses named fields, applying structure to unstructured data
‣ adds semantics to the workflow, specifying business logic
‣ inserts operations into the pipe: Tokenize, GroupBy, Count
‣ shows function and aggregation applied to data tuples in parallel
Document
Collection
Source
Tokenize
GroupBy
M token Count
M
Sink
R Word
Count
Monday, 17 December 12 27
28. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Pattern Language:
the workflow abstraction
Monday, 17 December 12 28
29. enterprise data workflows
Tuples, Pipelines, Taps, Operations, Joins, Assertions, Traps, etc.
…in other words, “plumbing” as a pattern language
for handling Big Data in Enterprise IT
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Monday, 17 December 12 29
30. pattern language
defined: a structured method for solving large, complex
design problems, where the syntax of the language
promotes the use of best practices
“plumbing” metaphor of pipes and operators in
Cascading helps indicate: algorithms to be used at
particular points, appropriate architectural trade-offs,
frameworks which must be integrated, etc.
design patterns: originated in consensus negotiation
for architecture, later used in software engineering
wikipedia.org/wiki/Pattern_language
Monday, 17 December 12 30
31. data workflows: team
‣ Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)
‣ Systems Integrator POV:
system integration of heterogenous data sources and compute platforms
‣ Data Scientist POV:
a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.
‣ Data Architect POV:
a physical plan for large-scale data flow management
‣ Software Architect POV:
a pattern language, similar to plumbing or circuit design
Document
Collection
‣ App Developer POV: M
Tokenize
Scrub
token
API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Stop Word
List
HashJoin
Left
RHS
Regex
token
GroupBy
token
R
Count
‣ Systems Engineer POV: Word
Count
a JAR file, has passed CI, available in a Maven repo
Monday, 17 December 12 31
32. data workflows: layers
business domain expertise, business trade-offs,
process
operating parameters, market position, etc.
API Java, Scala, Clojure, Jython, JRuby, Groovy, etc.
language
…envision whatever runs in a JVM
optimize /
schedule
major changes in technology now
Document
Collection
Scrub
Tokenize
token
M
physical Stop Word
HashJoin
Left
Regex
token
GroupBy
token
R
plan
List
RHS
Count
“assembler”
Word
Count
code
topology
Apache Hadoop, in-memory local mode
…envision GPUs, streaming, etc.
machine
data
Splunk, New Relic, Typesafe, Nagios, etc.
Monday, 17 December 12 32
33. data workflows: example
web
web Memcached web
logsweb
logs cluster API
logs
Cascading app
source sink
tap tap
Customers
Recommender
source System trap
tap tap
customer Support
Customer
profile review
Profile
DBs
DBs
Hadoop cluster
Monday, 17 December 12 33
34. data workflows: SQL vs. JVM
abstraction SQL
parser SQL parser
optimizer logical plan,
optimized based on stats
planner physical plan
machine query history,
data table stats
topology b-trees, etc.
visualization ERD
schema table schema
catalog relational catalog
Monday, 17 December 12 34
35. data workflows: SQL vs. JVM
abstraction SQL JVM
parser SQL parser SQL-92 compliant parser
(in progress)
optimizer logical plan, logical plan,
optimized based on stats optimized based on stats
planner physical plan API “plumbing”
machine query history, app history,
data table stats tuple stats
topology b-trees, etc. heterogenous, distributed:
Hadoop, in-memory, etc.
visualization ERD flow diagram
schema table schema tuple schema
catalog relational catalog tap usage DB
Monday, 17 December 12 35
36. Cascading taxonomy
Cascading
scheduler app
app
instance
source
tap
Maven flow
repo
sink
step tap
slice
owner trap
kind mapper | reducer tap
topology hadoop | local
Monday, 17 December 12 36
37. MapReduce architecture
‣ name node / data node
‣ job tracker / task tracker
‣ submit queue
‣ task slots
‣ HDFS
‣ distributed cache
Wikipedia
Apache
Monday, 17 December 12 37
38. Summary
If you were leading a team responsible for Enterprise apps:
‣ which of the previous two slides seems easier to understand?
‣ which is simpler to use for training and managing a team?
‣ which costs the most in the long run?
Monday, 17 December 12 38
39. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Compare & Contrast:
other approaches
Monday, 17 December 12 39
40. wc: pseudocode Document
Collection
M
Tokenize
GroupBy
token Count
R Word
Count
void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator partial_counts):
int count = 0;
for each pc in partial_counts:
count += Int(pc);
emit(word, String(count));
Monday, 17 December 12 40
41. Scalding / Scala Document
Collection
M
Tokenize
GroupBy
token Count
R Word
Count
// Sujit Pal
// sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
package com.mycompany.impatient
import com.twitter.scalding._
class Part2(args : Args) extends Job(args) {
val input = Tsv(args("input"), ('docId, 'text))
val output = Tsv(args("output"))
input.read.
flatMap('text -> 'word) {
text : String => text.split("""s+""")
}.
groupBy('word) { group => group.size }.
write(output)
}
Monday, 17 December 12 41
42. Scalding / Scala Document
Collection
M
Tokenize
GroupBy
token Count
github.com/twitter/scalding/wiki
R Word
Count
notes:
‣ code is compact, easy to understand
‣ functional programming is great for expressing
complex workflows in MapReduce, etc.
‣ very large-scale, complex problems can be handled
in just a few lines of code
‣ many large-scale apps in production deployments
‣ significant investments by Twitter, Etsy, eBay, etc.,
in this open source project
‣ extensive libraries are available for linear algebra,
machine learning – e.g., “Matrix API”
Monday, 17 December 12 42
43. Cascalog / Clojure Document
Collection
M
Tokenize
GroupBy
token Count
R Word
Count
; Paul Lam
; github.com/Quantisan/Impatient
(ns impatient.core
(:use [cascalog.api]
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
Monday, 17 December 12 43
44. Cascalog / Clojure Document
Collection
M
Tokenize
GroupBy
token Count
github.com/nathanmarz/cascalog/wiki
R Word
Count
notes:
‣ code is compact, easy to understand
‣ functional programming is great for expressing
complex workflows in MapReduce, etc.
‣ significant investments by Twitter, Climate Corp, etc.,
in this open source project
‣ can run queries from the Clojure REPL
‣ compelling for very large-scale use cases where code
correctness can be verified before deployment
Monday, 17 December 12 44
45. Apache Hive Document
Collection
M
Tokenize
GroupBy
token Count
R Word
Count
-- Steve Severance
-- stackoverflow.com/questions/10039949/word-count-program-in-hive
CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv'
OVERWRITE INTO TABLE input;
SELECT
word, COUNT(*)
FROM input
LATERAL VIEW explode(split(text, ' ')) lTable AS word
GROUP BY word
;
Monday, 17 December 12 45
46. Apache Hive Document
Collection
M
Tokenize
GroupBy
token Count
hive.apache.org
R Word
Count
pro:
‣ most popular abstraction atop Apache Hadoop
‣ SQL-like language is syntactically familiar to most analysts
‣ simple to load large-scale unstructured data and run ad-hoc queries
con:
‣ not a relational engine, many surprises at scale
‣ difficult to represent complex workflows, ML algorithms, etc.
‣ one poorly-trained analyst can bottleneck an entire cluster
‣ app-level integration requires other coding, outside of script language
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of mappers+reducers changes unexpectedly
‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.
Monday, 17 December 12 46
47. Apache Pig Document
Collection
M
Tokenize
GroupBy
token Count
R Word
Count
-- kudos to Dmitriy Ryaboy
docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
AS (doc_id, text);
docPipe = FILTER docPipe BY doc_id != 'doc_id';
-- specify regex to split "document" text lines into token stream
tokenPipe = FOREACH docPipe
GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';
-- determine the word counts
tokenGroups = GROUP tokenPipe BY token;
wcPipe = FOREACH tokenGroups
GENERATE group AS token, COUNT(tokenPipe) AS count;
-- output
STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
EXPLAIN -out dot/wc_pig.dot -dot wcPipe;
Monday, 17 December 12 47
48. Apache Pig Document
Collection
M
Tokenize
GroupBy
token Count
pig.apache.org
R Word
Count
pro:
‣ easy to learn data manipulation language (DML)
‣ interactive prompt (Grunt) makes it simple to prototype apps
‣ extensibility through UDFs
con:
‣ not a full programming language; must extend via UDFs outside of language
‣ app-level integration requires other coding, outside of script language
‣ simple problems are simple to do; hard problems become quite complex
‣ difficult to parameterize scripts externally; must rewrite to change taps!
‣ logical planner mixed with physical planner; cannot collect app stats
‣ non-deterministic exec: number of mappers+reducers changes unexpectedly
‣ business logic must cross multiple language boundaries: difficult to
troubleshoot, optimize, audit, handle exceptions, set notifications, etc.
Monday, 17 December 12 48
49. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Code Example #N:
city of palo alto, etc.
Monday, 17 December 12 49
50. extend: wc + scrub + stop words
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
1 mapper Word
1 reducer Count
28+10 lines code
Monday, 17 December 12 50
51. extend: a simple search engine
Unique Insert SumBy
D
doc_id 1 doc_id
Document
Collection
M R M R RHS
Scrub
Tokenize
token
HashJoin
M
RHS
token
HashJoin Regex Unique CountBy
DF
Left token token token ExprFunc
CoGroup
Stop Word tf-idf
List
RHS
M R M R M R
TF-IDF
M
CountBy
TF
doc_id,
token
CountBy Sort
token count
M R M
Word
R M R Count
10 mappers
8 reducers
68+14 lines code
Monday, 17 December 12 51
52. City of Palo Alto open data
Regex Regex
tree
Scrub
filter parser species
M
HashJoin
Left Geohash
CoPA
GIS exprot Tree
Metadata M
RHS RHS
tree
Regex Checkpoint
road
Regex Regex
tsv
parser tsv filter Tree Filter GroupBy Checkpoint
parser CoGroup
Distance tree_dist tree_name shade
M
R M R M RHS
M
HashJoin Estimate Road
Left Albedo Geohash CoGroup
Segments
Road
Metadata GPS
Failure RHS M logs
Traps R
road
Geohash
M
Regex
park
filter reco
M
park
github.com/Cascading/CoPA/wiki
‣ GIS export for parks, roads, trees (unstructured / open data)
‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
‣ curated metadata, used to enrich the dataset
‣ could extend via mash-up with many available public data APIs
Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”
Monday, 17 December 12 52
54. CoPA: results 0.12
Estimated Tree Height (meters)
0.10
0.08
count
0
density
100
0.06 200
300
0.04
0.02
0.00
0 10 20 30 40 50
avg_height
‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciflua
‣ avg height 23 m
‣ road albedo: 0.12
‣ distance: 10 m
‣ a short walk from my train stop ✔
Monday, 17 December 12 54
55. Intro to Cascading
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
PMML:
predictive modeling
Monday, 17 December 12 55
57. cascading.pattern
example:
1. use customer order history as the training data set
2. train a risk classifier for orders, using Random Forest
3. export model from R to PMML
4. build a Cascading app to execute the PMML model
4.1. generate a pipeline from PMML description
4.2. planner builds the flow for a topology (Hadoop)
4.3. compile app to a JAR file
5. deploy the app at scale to calculate scores
Monday, 17 December 12 57
58. cascading.pattern
risk classifier risk classifier
dimension: customer 360 dimension: per-order
Cascading apps
training analyst's customer
data prep laptop
data sets transactions
predict score new
model costs orders
PMML
model
detect anomaly
fraudsters detection
segment velocity
customers metrics
Hadoop Customer IMDG
DB
batch real-time
workloads workloads
ETL
chargebacks, partner
DW etc. data
Monday, 17 December 12 58
59. 1:
“orders” data set...
train/test in R...
exported as PMML
Monday, 17 December 12 59
60. R modeling
## train a RandomForest model
f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)
## test the model on the holdout test set
print(fit$importance)
print(fit)
predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)
## export predicted labels to TSV
write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)
## export RF model to PMML
saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Monday, 17 December 12 60