Cascading for the Impatient

Cascading for the Impatient
Paco Nathan Document
Collection

Concurrent, Inc.
Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

pnathan@concurrentinc.com
Count

Word
Count

@pacoid

Copyright @2012, Concurrent, Inc.

why?

Unstructured Data
meets
Enterprise Scale

how?

Cascading.org/
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

who?
• Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)

• Systems Integrator POV: data sources and compute platforms
system integration of heterogenous

• Data Scientist graph (DAG) on which we can apply Amdahl's Law
a directed, acyclic
POV:

• Data Architect large-scale data flow management
a physical plan for
POV:

• Software Architect POV:plumbing or circuit design
a pattern language, similar to

• API bindings for Scala, Clojure, Python, Ruby, Java
App Developer POV:
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

• Systemshas passed CI, available in a Maven repo
a JAR file,
Engineer POV: Count

Word
Count

where?
business Domain expertise, business trade-offs,
process operating parameters, etc.

API Scala, Clojure, Python, Ruby, Java, etc.
language …envision whatever else runs in a JVM

logical plan (raw human intellect, unless…)
/ optimize
Document
Collection

Scrub
Tokenize
token

M

physical Stop Word
HashJoin
Left
Regex
token
GroupBy
token
R

plan
List
RHS

Count

Word
Count

compute Apache Hadoop, in-memory local mode
framework
…envision GPUs, other frameworks, etc.

“assembler”
code
monitors, Nagios, etc.
notification

1: copy
public class
  Main
  {
  public static void
  main( String[] args )
    {
    String inPath = args[ 0 ];
    String outPath = args[ 1 ];
Source
    Properties props = new Properties();
    AppProps.setApplicationJarClass( props, Main.class );
    HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

    // create the source tap
    Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

    // create the sink tap
M     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );

    // specify a pipe to connect the taps
Sink     Pipe copyPipe = new Pipe( "copy" );

    // connect the taps, pipes, etc., into a flow
    FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
     .addSource( copyPipe, inTap )
     .addTailSink( copyPipe, outTap );

    // run the flow
    flowConnector.connect( flowDef ).complete();

1 mapper
    }
  }

0 reducers
10 lines code

wait!

ten lines of code
for a file copy …
seems like a lot.

same JAR, any scale…
MegaCorp Enterprise IT:
Pb’s data
1000+ node cluster
EVP calls you when app fails
runtime: days+

Production Cluster:
Tb’s data
EMR + 50 HPC Instances
Ops monitors results
runtime: hours – days

Staging Cluster:
Gb’s data
EMR + 4 Spot Instances
CI shows red or green lights
runtime: minutes – hours

Your Mom’s Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes

2: word count

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

1 mapper
1 reducer
18 lines code

3: wc + scrub

Document
Collection

Scrub GroupBy
Tokenize
token token
Count
M

R Word
Count

1 mapper
1 reducer
22+10 lines code

4: wc + scrub + stop words

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

1 mapper Word

1 reducer
Count

28+10 lines code

5: tf-idf

Unique Insert SumBy

D
doc_id 1 doc_id
Document
Collection

M R M R M RHS

Scrub
Tokenize
token
HashJoin
M

RHS

token
HashJoin Regex Unique GroupBy

DF
Left token token token ExprFunc
Count CoGroup
Stop Word tf-idf
List
RHS
M R M R M R
TF-IDF

M

GroupBy
TF

doc_id,
token Count
GroupBy Count
token

M R M R
Word
R M R Count

11 mappers
9 reducers
65+10 lines code

6: tf-idf + tdd

Unique Insert SumBy

D
doc_id 1 doc_id
Document
Collection

RHS
M R M R M
Assert Scrub
Tokenize
token
HashJoin Checkpoint
M
M

RHS

token
HashJoin Regex Unique GroupBy

DF
Left token token token Count ExprFunc
CoGroup
tf-idf
Stop Word
List RHS

M R M R M R
TF-IDF

M
GroupBy

TF
doc_id,
Failure token Count
Traps GroupBy Count
token

M R M R
Word
Count
R M R

12 mappers
9 reducers
76+14 lines code

deployed…

elastic-mapreduce --create --name "TF-IDF"
--jar s3n://temp.cascading.org/impatient/part6.jar
--arg s3n://temp.cascading.org/impatient/rain.txt
--arg s3n://temp.cascading.org/impatient/out/wc
--arg s3n://temp.cascading.org/impatient/en.stop
--arg s3n://temp.cascading.org/impatient/out/tfidf
--arg s3n://temp.cascading.org/impatient/out/trap
--arg s3n://temp.cascading.org/impatient/out/check

results? doc_id tf-idf
doc02 0.9163
token
air
doc05 0.9163 australia
doc05 0.9163 broken
doc04 0.9163 california's
doc04 0.9163 cause
doc02 0.9163 cloudcover
doc04 0.9163 death
doc04 0.9163 deserts
doc03 0.9163 downwind
doc_id text …
doc01 A rain shadow is a dry area on the lee back side of a mountainous area. doc02 0.9163 sinking
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain doc04 0.9163 such
with less rain and cloudcover. doc04 0.9163 valley
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) doc05 0.9163 women
side of a mountain. doc03 0.5108 land
doc04 This is known as the rain shadow effect and is the primary cause of leeward doc05 0.5108 land
deserts of mountain ranges, such as California's Death Valley. doc01 0.5108 lee
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc02 0.5108 lee
zoink null doc03 0.5108 leeward
doc04 0.5108 leeward
doc01 0.4463 area
doc02 0.2231 area
doc03 0.2231 area
doc01 0.2231 dry
doc02 0.2231 dry
doc03 0.2231 dry
doc02 0.2231 mountain
doc01 0.0000 rain
doc02 0.0000 rain
doc03 0.0000 rain
doc04 0.0000 rain
doc01 0.0000 shadow
doc02 0.0000 shadow
doc03 0.0000 shadow
doc04 0.0000 shadow

comparisons?

compare similar code in Scalding and Cascalog:

sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
based on: github.com/twitter/scalding/wiki

github.com/Quantisan/Impatient
based on: github.com/nathanmarz/cascalog/wiki

drill-down?

blog, code, wiki, gists, jars, list, DevOps products:

cascading.org/category/impatient/
github.org/Cascading/
conjars.org/
goo.gl/KQtUL
concurrentinc.com/

Cascading for the Impatient

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Cascading for the Impatient

Similar to Cascading for the Impatient (11)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

Cascading for the Impatient

Editor's Notes