SlideShare a Scribd company logo
1 of 37
Download to read offline
Cascading 
or, “was it worth three days 
out of the office?”
Agenda 
What is Cascading? 
Building cascades and flows 
How does this fit our needs? 
Advantages/disadvantages 
Q&A
What is Cascading anyway?
Cascading 101 
JVM framework and SDK for creating abstracted data 
flows 
Translates data flows into actual Hadoop/RDBMS/local 
jobs
Huh? 
Okay, let’s back up a bit.
Data flows 
Think of an ETL: Extract-Transform-Load 
In simple terms, take data from a source, change it 
somehow, and stick the result into something (a “sink”) 
Data 
source 
Data 
sink 
Extract Load 
Transformation(s)
Data flow implementation 
Pretty much everything we do is some flavor of this 
Sources: Games, Hadoop, Hive/MySQL, Couchbase, 
web service 
Transformations: Aggregations, group-bys, combined 
fields, filtering, etc. 
Sinks: Hadoop, Hive/MySQL, Couchbase
Cascading 101 (Part Deux) 
JVM data flow framework 
Models data flows as abstractions: 
Separates details of where and how we get data 
from what we do with it 
Implements transform operations as SQL or 
MapReduce or whatever
In other words… 
An ETL framework. 
A Pentaho we can program.
Building cascades 
and flows
Cascading terminology 
Flow: A path for data with some number of inputs, 
some operations, and some outputs 
Cascade: A series of connected flows
More terminology 
Operation: A function applied to data, yielding new 
data 
Pipe: Moves data from someplace to some other place 
Tap: Feeds data from outside the flow into it 
and writes data from inside the flow out of it
Simplest possible flow 
// create the source tap 
Tap inTap = new Hfs(new TextDelimited(true, "t"), inPath); 
! 
// create the sink tap 
Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); 
! 
// specify a pipe to connect the taps 
Pipe copyPipe = new Pipe(“copy"); 
! 
// connect the taps, pipes, etc., into a flow 
FlowDef flowDef = FlowDef.flowDef() 
.addSource(copyPipe, inTap) 
.addTailSink(copyPipe, outTap); 
! 
// run the flow 
flowConnector.connect(flowDef).complete();
We already 
have that. 
! 
It’s called ‘cp’.
Actually… 
Runs entirely in the cluster 
Works fine on megabytes, gigabytes, terabytes or 
petabytes; i.e., IT SCALES 
Completely testable outside of the cluster 
Who gets shell access to a namenode to run the bash 
or python equivalent?
Reliability is 
ESSENTIAL 
! 
if we, and our system, are to 
be taken srsly. 
Reliability is a feature, 
not a goal.
Let’s do something more 
interesting.
Real world use case: 
Word counting 
Read a simple file format 
Count the occurrence of every word in the file 
Output a list of all words and their counts
doc_id text 
doc01 A rain shadow is a dry area on the lee back side 
doc02 This sinking, dry air produces a rain shadow, or 
doc03 A rain shadow is an area of dry land that lies on 
doc04 This is known as the rain shadow effect and is the 
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] 
Newline-delimited entries 
ID and text fields, separated by tabs 
Plan: Split lines into words and count them over each line
Flow I/O 
Tap docTap = new Hfs(new TextDelimited(true, "t"), docPath); 
Tap wcTap = new Hfs(new TextDelimited(true, "t"), wcPath); 
No surprises here: 
docTap reads a file from HDFS 
wcTap will write the results to a different HDFS file
File parsing 
Fields token = new Fields("token"); 
Fields text = new Fields("text"); 
RegexSplitGenerator splitter = 
new RegexSplitGenerator(token, "[ [](),.]"); 
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS); 
Fields are names for the tuple elements 
RegexSplitGenerator applies the regex to input and 
yields matches under the “token” field 
docPipe takes each “token” generated by the splitter 
and outputs them
Count the tokens (words) 
Pipe wcPipe = new Pipe("wc", docPipe); 
wcPipe = new GroupBy(wcPipe, token); 
wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL); 
wcPipe connects to docPipe, using it for input 
Fit a GroupBy function onto wcPipe, grouping by the 
token field (the actual words) 
for every tuple in wcPipe (every word), count each 
occurrence and output the result
Create and run the flow 
FlowDef flowDef = FlowDef.flowDef() 
.setName("wc") 
.addSource(docPipe, docTap) 
.addTailSink(wcPipe, wcTap); 
Flow wcFlow = flowConnector.connect(flowDef).complete(); 
Define a new flow with name “wc” 
Feed the docTap (the original text file) into the 
docPipe 
Feed the wcTap (the output word counts) into the 
wcPipe 
Connect to the flowConnector (Hadoop) and go!
Cascading flow 
100% Java 
Databases and processing 
are behind class 
abstractions 
Automatically scalable 
Easily testable
How could this help us?
Testing 
Create flows entirely in code on a local machine 
Write tests for controlled sample data sets 
Run tests as regular old Java without needing access 
to actual Hadoopery or databases 
Local machine and CI testing are easy!
Reusability 
Pipe assemblies are designed for reuse 
Once created and tested, use them in other flows 
Write logic to do something only once 
This is *essential* for data integrity as well as 
good programming
Common code base 
Infrastructure writes MR-type jobs in Cascading, 
warehouse writes data manipulations in Cascading 
Everybody uses the same terms and same tech 
Teams understand each other’s code 
Can be modified by anyone, not just tool experts
Simpler stack 
Cascading creates DAG of dependent jobs for us 
Removes most of the need for Oozie (ew) 
Keeps track of where a flow fails and can rerun from 
that point on failure
Disadvantages 
“silver bullets are not a thing”
Some bad news 
JVM, which means Java (or Scala (or CLOJURE :) :) 
Argument: Java is the platform for big data, so we 
can’t avoid embracing it. 
PyCascading uses Jython, which kinda sucks
Some other bad news 
Doesn’t have job scheduler 
Can figure out dependency graph for jobs, but 
nothing to run them on a regular interval 
We still need Jenkins or quartz 
Concurrent is doing proprietary products (read: $) 
for this kind of thing, but they’re months away
Other bad news 
No real built-in monitoring 
Easy to have a flow report what it has done; 
hard to watch it in progress 
We’d have to roll our own (but we’d have to do that 
anyway, so whatevs)
Recommendations 
“Enough already!”
Yes, we should try it. 
It’s not everything we need, but it’s a lot 
Possibly replace MapReduce and Sqoop 
Proven tech; this isn’t bleeding edge work 
We need an ETL framework and we don’t have time 
to write one from scratch.
Let’s prototype a couple of jobs and 
see what people other than me think.
Questions? 
Satisfactory answers 
not guaranteed.

More Related Content

What's hot

Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on AndroidTomáš Kypta
 
Non Blocking I/O for Everyone with RxJava
Non Blocking I/O for Everyone with RxJavaNon Blocking I/O for Everyone with RxJava
Non Blocking I/O for Everyone with RxJavaFrank Lyaruu
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceKonrad Malawski
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJavaJobaer Chowdhury
 
A dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenarioA dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenarioGioia Ballin
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure wayBahadir Cambel
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangaloresrikanthhadoop
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangSri Ambati
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Till Rohrmann
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examplesPeter Lawrey
 
Converting a naive flow to akka streams
Converting a naive flow to akka streamsConverting a naive flow to akka streams
Converting a naive flow to akka streamsGal Topper
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 

What's hot (20)

Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Reactive programming on Android
Reactive programming on AndroidReactive programming on Android
Reactive programming on Android
 
Non Blocking I/O for Everyone with RxJava
Non Blocking I/O for Everyone with RxJavaNon Blocking I/O for Everyone with RxJava
Non Blocking I/O for Everyone with RxJava
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
HBase RowKey design for Akka Persistence
HBase RowKey design for Akka PersistenceHBase RowKey design for Akka Persistence
HBase RowKey design for Akka Persistence
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJava
 
A dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenarioA dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenario
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 
Presto overview
Presto overviewPresto overview
Presto overview
 
Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?Streaming Analytics & CEP - Two sides of the same coin?
Streaming Analytics & CEP - Two sides of the same coin?
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Reactive programming with examples
Reactive programming with examplesReactive programming with examples
Reactive programming with examples
 
Converting a naive flow to akka streams
Converting a naive flow to akka streamsConverting a naive flow to akka streams
Converting a naive flow to akka streams
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 

Similar to Intro to Cascading

The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application frameworkModern Data Stack France
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
 
Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...DataWorks Summit
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Christopher Curtin
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010Christopher Curtin
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014cwensel
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streamsRadu Tudoran
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systemsRaja SP
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14Sri Ambati
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiUnmesh Baile
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 

Similar to Intro to Cascading (20)

The Cascading (big) data application framework
The Cascading (big) data application frameworkThe Cascading (big) data application framework
The Cascading (big) data application framework
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...Pattern - an open source project for migrating predictive models from SAS, et...
Pattern - an open source project for migrating predictive models from SAS, et...
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Nosql East October 2009
Nosql East October 2009Nosql East October 2009
Nosql East October 2009
 
Hadoop User Group EU 2014
Hadoop User Group EU 2014Hadoop User Group EU 2014
Hadoop User Group EU 2014
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Sparkling Water 5 28-14
Sparkling Water 5 28-14Sparkling Water 5 28-14
Sparkling Water 5 28-14
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 

Recently uploaded

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 

Recently uploaded (20)

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 

Intro to Cascading

  • 1. Cascading or, “was it worth three days out of the office?”
  • 2. Agenda What is Cascading? Building cascades and flows How does this fit our needs? Advantages/disadvantages Q&A
  • 4. Cascading 101 JVM framework and SDK for creating abstracted data flows Translates data flows into actual Hadoop/RDBMS/local jobs
  • 5. Huh? Okay, let’s back up a bit.
  • 6. Data flows Think of an ETL: Extract-Transform-Load In simple terms, take data from a source, change it somehow, and stick the result into something (a “sink”) Data source Data sink Extract Load Transformation(s)
  • 7. Data flow implementation Pretty much everything we do is some flavor of this Sources: Games, Hadoop, Hive/MySQL, Couchbase, web service Transformations: Aggregations, group-bys, combined fields, filtering, etc. Sinks: Hadoop, Hive/MySQL, Couchbase
  • 8. Cascading 101 (Part Deux) JVM data flow framework Models data flows as abstractions: Separates details of where and how we get data from what we do with it Implements transform operations as SQL or MapReduce or whatever
  • 9. In other words… An ETL framework. A Pentaho we can program.
  • 11. Cascading terminology Flow: A path for data with some number of inputs, some operations, and some outputs Cascade: A series of connected flows
  • 12. More terminology Operation: A function applied to data, yielding new data Pipe: Moves data from someplace to some other place Tap: Feeds data from outside the flow into it and writes data from inside the flow out of it
  • 13. Simplest possible flow // create the source tap Tap inTap = new Hfs(new TextDelimited(true, "t"), inPath); ! // create the sink tap Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); ! // specify a pipe to connect the taps Pipe copyPipe = new Pipe(“copy"); ! // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); ! // run the flow flowConnector.connect(flowDef).complete();
  • 14. We already have that. ! It’s called ‘cp’.
  • 15. Actually… Runs entirely in the cluster Works fine on megabytes, gigabytes, terabytes or petabytes; i.e., IT SCALES Completely testable outside of the cluster Who gets shell access to a namenode to run the bash or python equivalent?
  • 16. Reliability is ESSENTIAL ! if we, and our system, are to be taken srsly. Reliability is a feature, not a goal.
  • 17. Let’s do something more interesting.
  • 18. Real world use case: Word counting Read a simple file format Count the occurrence of every word in the file Output a list of all words and their counts
  • 19. doc_id text doc01 A rain shadow is a dry area on the lee back side doc02 This sinking, dry air produces a rain shadow, or doc03 A rain shadow is an area of dry land that lies on doc04 This is known as the rain shadow effect and is the doc05 Two Women. Secrets. A Broken Land. [DVD Australia] Newline-delimited entries ID and text fields, separated by tabs Plan: Split lines into words and count them over each line
  • 20. Flow I/O Tap docTap = new Hfs(new TextDelimited(true, "t"), docPath); Tap wcTap = new Hfs(new TextDelimited(true, "t"), wcPath); No surprises here: docTap reads a file from HDFS wcTap will write the results to a different HDFS file
  • 21. File parsing Fields token = new Fields("token"); Fields text = new Fields("text"); RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ [](),.]"); Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS); Fields are names for the tuple elements RegexSplitGenerator applies the regex to input and yields matches under the “token” field docPipe takes each “token” generated by the splitter and outputs them
  • 22. Count the tokens (words) Pipe wcPipe = new Pipe("wc", docPipe); wcPipe = new GroupBy(wcPipe, token); wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL); wcPipe connects to docPipe, using it for input Fit a GroupBy function onto wcPipe, grouping by the token field (the actual words) for every tuple in wcPipe (every word), count each occurrence and output the result
  • 23. Create and run the flow FlowDef flowDef = FlowDef.flowDef() .setName("wc") .addSource(docPipe, docTap) .addTailSink(wcPipe, wcTap); Flow wcFlow = flowConnector.connect(flowDef).complete(); Define a new flow with name “wc” Feed the docTap (the original text file) into the docPipe Feed the wcTap (the output word counts) into the wcPipe Connect to the flowConnector (Hadoop) and go!
  • 24. Cascading flow 100% Java Databases and processing are behind class abstractions Automatically scalable Easily testable
  • 25. How could this help us?
  • 26. Testing Create flows entirely in code on a local machine Write tests for controlled sample data sets Run tests as regular old Java without needing access to actual Hadoopery or databases Local machine and CI testing are easy!
  • 27. Reusability Pipe assemblies are designed for reuse Once created and tested, use them in other flows Write logic to do something only once This is *essential* for data integrity as well as good programming
  • 28. Common code base Infrastructure writes MR-type jobs in Cascading, warehouse writes data manipulations in Cascading Everybody uses the same terms and same tech Teams understand each other’s code Can be modified by anyone, not just tool experts
  • 29. Simpler stack Cascading creates DAG of dependent jobs for us Removes most of the need for Oozie (ew) Keeps track of where a flow fails and can rerun from that point on failure
  • 30. Disadvantages “silver bullets are not a thing”
  • 31. Some bad news JVM, which means Java (or Scala (or CLOJURE :) :) Argument: Java is the platform for big data, so we can’t avoid embracing it. PyCascading uses Jython, which kinda sucks
  • 32. Some other bad news Doesn’t have job scheduler Can figure out dependency graph for jobs, but nothing to run them on a regular interval We still need Jenkins or quartz Concurrent is doing proprietary products (read: $) for this kind of thing, but they’re months away
  • 33. Other bad news No real built-in monitoring Easy to have a flow report what it has done; hard to watch it in progress We’d have to roll our own (but we’d have to do that anyway, so whatevs)
  • 35. Yes, we should try it. It’s not everything we need, but it’s a lot Possibly replace MapReduce and Sqoop Proven tech; this isn’t bleeding edge work We need an ETL framework and we don’t have time to write one from scratch.
  • 36. Let’s prototype a couple of jobs and see what people other than me think.