The document describes how to perform various text analytics workflows like word count, stop word filtering, TF-IDF using Cascading from ingesting documents to deployment on Amazon EMR. It shows the code required at each step and how adding features like testing and checkpoints only adds a few extra lines of code while allowing the workflow to run on datasets of any scale.
Scaling API-first – The story of a global engineering organization
Cascading for the Impatient
1. Cascading for the Impatient
Paco Nathan Document
Collection
Concurrent, Inc.
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
pnathan@concurrentinc.com
Count
Word
Count
@pacoid
Copyright @2012, Concurrent, Inc.
3. how?
Cascading.org/
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
4. who?
• Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)
• Systems Integrator POV: data sources and compute platforms
system integration of heterogenous
• Data Scientist graph (DAG) on which we can apply Amdahl's Law
a directed, acyclic
POV:
• Data Architect large-scale data flow management
a physical plan for
POV:
• Software Architect POV:plumbing or circuit design
a pattern language, similar to
• API bindings for Scala, Clojure, Python, Ruby, Java
App Developer POV:
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
• Systemshas passed CI, available in a Maven repo
a JAR file,
Engineer POV: Count
Word
Count
5. where?
business Domain expertise, business trade-offs,
process operating parameters, etc.
API Scala, Clojure, Python, Ruby, Java, etc.
language …envision whatever else runs in a JVM
logical plan (raw human intellect, unless…)
/ optimize
Document
Collection
Scrub
Tokenize
token
M
physical Stop Word
HashJoin
Left
Regex
token
GroupBy
token
R
plan
List
RHS
Count
Word
Count
compute Apache Hadoop, in-memory local mode
framework
…envision GPUs, other frameworks, etc.
“assembler”
code
monitors, Nagios, etc.
notification
6. 1: copy
public class
Main
{
public static void
main( String[] args )
{
String inPath = args[ 0 ];
String outPath = args[ 1 ];
Source
Properties props = new Properties();
AppProps.setApplicationJarClass( props, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create the source tap
Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );
// create the sink tap
M Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
// specify a pipe to connect the taps
Sink Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
.addSource( copyPipe, inTap )
.addTailSink( copyPipe, outTap );
// run the flow
flowConnector.connect( flowDef ).complete();
1 mapper
}
}
0 reducers
10 lines code
7. wait!
ten lines of code
for a file copy …
seems like a lot.
8. same JAR, any scale…
MegaCorp Enterprise IT:
Pb’s data
1000+ node cluster
EVP calls you when app fails
runtime: days+
Production Cluster:
Tb’s data
EMR + 50 HPC Instances
Ops monitors results
runtime: hours – days
Staging Cluster:
Gb’s data
EMR + 4 Spot Instances
CI shows red or green lights
runtime: minutes – hours
Your Mom’s Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes
10. 3: wc + scrub
Document
Collection
Scrub GroupBy
Tokenize
token token
Count
M
R Word
Count
1 mapper
1 reducer
22+10 lines code
11. 4: wc + scrub + stop words
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
1 mapper Word
1 reducer
Count
28+10 lines code
12. 5: tf-idf
Unique Insert SumBy
D
doc_id 1 doc_id
Document
Collection
M R M R M RHS
Scrub
Tokenize
token
HashJoin
M
RHS
token
HashJoin Regex Unique GroupBy
DF
Left token token token ExprFunc
Count CoGroup
Stop Word tf-idf
List
RHS
M R M R M R
TF-IDF
M
GroupBy
TF
doc_id,
token Count
GroupBy Count
token
M R M R
Word
R M R Count
11 mappers
9 reducers
65+10 lines code
13. 6: tf-idf + tdd
Unique Insert SumBy
D
doc_id 1 doc_id
Document
Collection
RHS
M R M R M
Assert Scrub
Tokenize
token
HashJoin Checkpoint
M
M
RHS
token
HashJoin Regex Unique GroupBy
DF
Left token token token Count ExprFunc
CoGroup
tf-idf
Stop Word
List RHS
M R M R M R
TF-IDF
M
GroupBy
TF
doc_id,
Failure token Count
Traps GroupBy Count
token
M R M R
Word
Count
R M R
12 mappers
9 reducers
76+14 lines code
15. results? doc_id tf-idf
doc02 0.9163
token
air
doc05 0.9163 australia
doc05 0.9163 broken
doc04 0.9163 california's
doc04 0.9163 cause
doc02 0.9163 cloudcover
doc04 0.9163 death
doc04 0.9163 deserts
doc03 0.9163 downwind
doc_id text …
doc01 A rain shadow is a dry area on the lee back side of a mountainous area. doc02 0.9163 sinking
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain doc04 0.9163 such
with less rain and cloudcover. doc04 0.9163 valley
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) doc05 0.9163 women
side of a mountain. doc03 0.5108 land
doc04 This is known as the rain shadow effect and is the primary cause of leeward doc05 0.5108 land
deserts of mountain ranges, such as California's Death Valley. doc01 0.5108 lee
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc02 0.5108 lee
zoink null doc03 0.5108 leeward
doc04 0.5108 leeward
doc01 0.4463 area
doc02 0.2231 area
doc03 0.2231 area
doc01 0.2231 dry
doc02 0.2231 dry
doc03 0.2231 dry
doc02 0.2231 mountain
doc03 0.2231 mountain
doc04 0.2231 mountain
doc01 0.0000 rain
doc02 0.0000 rain
doc03 0.0000 rain
doc04 0.0000 rain
doc01 0.0000 shadow
doc02 0.0000 shadow
doc03 0.0000 shadow
doc04 0.0000 shadow
16. comparisons?
compare similar code in Scalding and Cascalog:
sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
based on: github.com/twitter/scalding/wiki
github.com/Quantisan/Impatient
based on: github.com/nathanmarz/cascalog/wiki