SlideShare a Scribd company logo
1 of 17
Cascading for the Impatient
Paco Nathan                   Document
                              Collection




Concurrent, Inc.
                                                           Scrub
                                           Tokenize
                                                           token

                                      M



                                                                   HashJoin   Regex
                                                                     Left     token
                                                                                      GroupBy    R
                                                      Stop Word                        token
                                                         List
                                                                     RHS




pnathan@concurrentinc.com
                                                                                         Count




                                                                                                     Word
                                                                                                     Count




@pacoid




                            Copyright @2012, Concurrent, Inc.
why?



 Unstructured Data
   meets
  Enterprise Scale
how?


 Cascading.org/
  Document
  Collection



                               Scrub
               Tokenize
                               token

          M



                                       HashJoin   Regex
                                         Left     token
                                                          GroupBy    R
                          Stop Word                        token
                             List
                                         RHS




                                                             Count




                                                                         Word
                                                                         Count
who?
 • Business Stakeholder POV:
   business process management for workflow orchestration (think BPM/BPEL)


 • Systems Integrator POV: data sources and compute platforms
   system integration of heterogenous


 • Data Scientist graph (DAG) on which we can apply Amdahl's Law
   a directed, acyclic
                       POV:



 • Data Architect large-scale data flow management
   a physical plan for
                       POV:



 • Software Architect POV:plumbing or circuit design
   a pattern language, similar to


 • API bindings for Scala, Clojure, Python, Ruby, Java
    App Developer POV:
                                                         Document
                                                         Collection



                                                                                      Scrub
                                                                      Tokenize
                                                                                      token

                                                                 M



                                                                                              HashJoin   Regex
                                                                                                Left     token
                                                                                                                 GroupBy    R
                                                                                 Stop Word                        token
                                                                                    List
                                                                                                RHS




 • Systemshas passed CI, available in a Maven repo
   a JAR file,
               Engineer POV:                                                                                        Count




                                                                                                                                Word
                                                                                                                                Count
where?
  business      Domain expertise, business trade-offs,
  process       operating parameters, etc.

     API        Scala, Clojure, Python, Ruby, Java, etc.
  language      …envision whatever else runs in a JVM

 logical plan   (raw human intellect, unless…)
  / optimize
                   Document
                   Collection



                                                Scrub
                                Tokenize
                                                token

                           M




  physical                                 Stop Word
                                                        HashJoin
                                                          Left
                                                                   Regex
                                                                   token
                                                                           GroupBy
                                                                            token
                                                                                      R




    plan
                                              List
                                                          RHS




                                                                              Count




                                                                                          Word
                                                                                          Count




  compute       Apache Hadoop, in-memory local mode
 framework
                …envision GPUs, other frameworks, etc.




                                                                                                  “assembler”
                                                                                                   code
  monitors,     Nagios, etc.
 notification
1: copy
                       public class
                         Main
                         {
                         public static void
                         main( String[] args )
                           {
                           String inPath = args[ 0 ];
                           String outPath = args[ 1 ];
 Source
                           Properties props = new Properties();
                           AppProps.setApplicationJarClass( props, Main.class );
                           HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

                           // create the source tap
                           Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

                           // create the sink tap
          M                Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );

                           // specify a pipe to connect the taps
                Sink       Pipe copyPipe = new Pipe( "copy" );

                           // connect the taps, pipes, etc., into a flow
                           FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
                            .addSource( copyPipe, inTap )
                            .addTailSink( copyPipe, outTap );

                           // run the flow
                           flowConnector.connect( flowDef ).complete();

1 mapper
                           }
                         }

0 reducers
10 lines code
wait!



  ten lines of code
  for a file copy …
  seems like a lot.
same JAR, any scale…
                                              MegaCorp Enterprise IT:
                                              Pb’s data
                                              1000+ node cluster
                                              EVP calls you when app fails
                                              runtime: days+

                              Production Cluster:
                              Tb’s data
                              EMR + 50 HPC Instances
                              Ops monitors results
                              runtime: hours – days

               Staging Cluster:
               Gb’s data
               EMR + 4 Spot Instances
               CI shows red or green lights
               runtime: minutes – hours

 Your Mom’s Laptop:
 Mb’s data
 Hadoop standalone mode
 passes unit tests, or not
 runtime: seconds – minutes
2: word count


Document
Collection




                Tokenize
                           GroupBy
        M                   token    Count




                              R              Word
                                             Count




1 mapper
1 reducer
18 lines code
3: wc + scrub


Document
Collection



                        Scrub   GroupBy
             Tokenize
                        token    token
                                          Count
        M

                                   R              Word
                                                  Count




1 mapper
1 reducer
22+10 lines code
4: wc + scrub + stop words


Document
Collection



                             Scrub
             Tokenize
                             token

        M



                                     HashJoin   Regex
                                       Left     token
                                                        GroupBy    R
                        Stop Word                        token
                           List
                                       RHS




                                                           Count



1 mapper                                                               Word

1 reducer
                                                                       Count


28+10 lines code
5: tf-idf


                                                                        Unique                 Insert   SumBy




                                                                  D
                                                                        doc_id                   1      doc_id
Document
Collection

                                                                  M       R           M                   R      M     RHS

                               Scrub
             Tokenize
                               token
                                                                                                                     HashJoin
        M

                                                                                                                                            RHS




                                                          token
                                       HashJoin   Regex                 Unique                GroupBy




                                                                  DF
                                         Left     token                  token                 token                                                         ExprFunc
                                                                                                         Count                             CoGroup
                        Stop Word                                                                                                                              tf-idf
                           List
                                         RHS
                                                                  M       R           M          R               M                                   R
                                                                                                                                                                          TF-IDF




                                                                                                                 M

                                                                       GroupBy
                                                                  TF

                                                                        doc_id,
                                                                         token                 Count
                                                                                                                             GroupBy                 Count
                                                                                                                              token

                                                                  M       R       M       R
                                                                                                                                                                  Word
                                                                                                                                R      M      R                   Count




  11 mappers
  9 reducers
  65+10 lines code
6: tf-idf + tdd


                                                                                                Unique                 Insert   SumBy




                                                                                          D
                                                                                                doc_id                   1      doc_id
Document
Collection

                                                                                                                                               RHS
                                                                                          M       R           M                   R      M
                       Assert                          Scrub
                                Tokenize
                                                       token
                                                                                                                                             HashJoin              Checkpoint
        M
                                                                                                                                                                                  M

                                                                                                                                                                                       RHS




                                                                                  token
                                                               HashJoin   Regex                 Unique                GroupBy




                                                                                          DF
                                                                 Left     token                  token                 token     Count                                                               ExprFunc
                                                                                                                                                                                      CoGroup
                                                                                                                                                                                                       tf-idf
                                           Stop Word
                                              List               RHS

                                                                                          M       R           M          R               M                                                      R
                                                                                                                                                                                                                TF-IDF




                                                                                                                                         M
                                                                                               GroupBy




                                                                                          TF
                                                                                                doc_id,
             Failure                                                                             token                 Count
              Traps                                                                                                                                  GroupBy              Count
                                                                                                                                                      token

                                                                                          M       R       M       R
                                                                                                                                                                                             Word
                                                                                                                                                                                             Count
                                                                                                                                                        R      M    R




  12 mappers
  9 reducers
  76+14 lines code
deployed…


 elastic-mapreduce --create --name "TF-IDF" 
   --jar s3n://temp.cascading.org/impatient/part6.jar 
   --arg s3n://temp.cascading.org/impatient/rain.txt 
   --arg s3n://temp.cascading.org/impatient/out/wc 
   --arg s3n://temp.cascading.org/impatient/en.stop 
   --arg s3n://temp.cascading.org/impatient/out/tfidf 
   --arg s3n://temp.cascading.org/impatient/out/trap 
   --arg s3n://temp.cascading.org/impatient/out/check
results?                                                                               doc_id tf-idf
                                                                                       doc02 0.9163
                                                                                                       token
                                                                                                       air
                                                                                       doc05 0.9163    australia
                                                                                       doc05 0.9163    broken
                                                                                       doc04 0.9163    california's
                                                                                       doc04 0.9163    cause
                                                                                       doc02 0.9163    cloudcover
                                                                                       doc04 0.9163    death
                                                                                       doc04 0.9163    deserts
                                                                                       doc03 0.9163    downwind
doc_id text                                                                             …
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.          doc02 0.9163    sinking
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain   doc04 0.9163    such
with less rain and cloudcover.                                                         doc04 0.9163    valley
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind)      doc05 0.9163    women
side of a mountain.                                                                    doc03 0.5108    land
doc04 This is known as the rain shadow effect and is the primary cause of leeward      doc05 0.5108    land
deserts of mountain ranges, such as California's Death Valley.                         doc01 0.5108    lee
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]                               doc02 0.5108    lee
zoink null                                                                             doc03 0.5108    leeward
                                                                                       doc04 0.5108    leeward
                                                                                       doc01 0.4463    area
                                                                                       doc02 0.2231    area
                                                                                       doc03 0.2231    area
                                                                                       doc01 0.2231    dry
                                                                                       doc02 0.2231    dry
                                                                                       doc03 0.2231    dry
                                                                                       doc02 0.2231    mountain
                                                                                       doc03 0.2231    mountain
                                                                                       doc04 0.2231    mountain
                                                                                       doc01 0.0000    rain
                                                                                       doc02 0.0000    rain
                                                                                       doc03 0.0000    rain
                                                                                       doc04 0.0000    rain
                                                                                       doc01 0.0000    shadow
                                                                                       doc02 0.0000    shadow
                                                                                       doc03 0.0000    shadow
                                                                                       doc04 0.0000    shadow
comparisons?


 compare similar code in Scalding and Cascalog:

 sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
 based on: github.com/twitter/scalding/wiki


 github.com/Quantisan/Impatient
 based on: github.com/nathanmarz/cascalog/wiki
drill-down?


  blog, code, wiki, gists, jars, list, DevOps products:

  cascading.org/category/impatient/
  github.org/Cascading/
  conjars.org/
  goo.gl/KQtUL
  concurrentinc.com/

More Related Content

Viewers also liked

Hardware innovation (keynote file)
Hardware innovation (keynote file)Hardware innovation (keynote file)
Hardware innovation (keynote file)Tim O'Reilly
 
Global Considerations for sCRM Strategy
Global Considerations for sCRM StrategyGlobal Considerations for sCRM Strategy
Global Considerations for sCRM StrategyJesus Hoyos
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldOReillyStrata
 
Birth of the Global Mind
Birth of the Global MindBirth of the Global Mind
Birth of the Global MindTim O'Reilly
 
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooMohnish Jadwani
 
Awakening India - Jago Party
Awakening India - Jago PartyAwakening India - Jago Party
Awakening India - Jago PartyKapil Mohan
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torqueboxrockyjaiswal
 
Yusuf mapping the creative industries in jordan 15 11 2012
Yusuf mapping the creative industries in jordan 15 11 2012Yusuf mapping the creative industries in jordan 15 11 2012
Yusuf mapping the creative industries in jordan 15 11 2012Yusuf Mansur
 
The DiSo Project and the Open Web
The DiSo Project and the Open WebThe DiSo Project and the Open Web
The DiSo Project and the Open WebChris Messina
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014Ldger, Inc
 
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Wesley Schwalje
 
Pinterest for Business 101
Pinterest for Business 101Pinterest for Business 101
Pinterest for Business 101Nick Armstrong
 

Viewers also liked (15)

Hardware innovation (keynote file)
Hardware innovation (keynote file)Hardware innovation (keynote file)
Hardware innovation (keynote file)
 
Global Considerations for sCRM Strategy
Global Considerations for sCRM StrategyGlobal Considerations for sCRM Strategy
Global Considerations for sCRM Strategy
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the world
 
Birth of the Global Mind
Birth of the Global MindBirth of the Global Mind
Birth of the Global Mind
 
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours too
 
ISIS Captures Ramadi - May 2015
ISIS Captures Ramadi  - May 2015ISIS Captures Ramadi  - May 2015
ISIS Captures Ramadi - May 2015
 
Awakening India - Jago Party
Awakening India - Jago PartyAwakening India - Jago Party
Awakening India - Jago Party
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torquebox
 
Yusuf mapping the creative industries in jordan 15 11 2012
Yusuf mapping the creative industries in jordan 15 11 2012Yusuf mapping the creative industries in jordan 15 11 2012
Yusuf mapping the creative industries in jordan 15 11 2012
 
The DiSo Project and the Open Web
The DiSo Project and the Open WebThe DiSo Project and the Open Web
The DiSo Project and the Open Web
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Creative, Digital & Design Business Briefing July 2015
Creative, Digital & Design Business Briefing July 2015Creative, Digital & Design Business Briefing July 2015
Creative, Digital & Design Business Briefing July 2015
 
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
Columbia Law School - Decentralized Ledgers Presentation on 4/7/2014
 
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
 
Pinterest for Business 101
Pinterest for Business 101Pinterest for Business 101
Pinterest for Business 101
 

Similar to Cascading for the Impatient

Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingPaco Nathan
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsPaco Nathan
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingPaco Nathan
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiPaco Nathan
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataPaco Nathan
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionOReillyStrata
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionPaco Nathan
 

Similar to Cascading for the Impatient (11)

Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Cascading for the Impatient

  • 1. Cascading for the Impatient Paco Nathan Document Collection Concurrent, Inc. Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS pnathan@concurrentinc.com Count Word Count @pacoid Copyright @2012, Concurrent, Inc.
  • 2. why? Unstructured Data meets Enterprise Scale
  • 3. how? Cascading.org/ Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 4. who? • Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) • Systems Integrator POV: data sources and compute platforms system integration of heterogenous • Data Scientist graph (DAG) on which we can apply Amdahl's Law a directed, acyclic POV: • Data Architect large-scale data flow management a physical plan for POV: • Software Architect POV:plumbing or circuit design a pattern language, similar to • API bindings for Scala, Clojure, Python, Ruby, Java App Developer POV: Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS • Systemshas passed CI, available in a Maven repo a JAR file, Engineer POV: Count Word Count
  • 5. where? business Domain expertise, business trade-offs, process operating parameters, etc. API Scala, Clojure, Python, Ruby, Java, etc. language …envision whatever else runs in a JVM logical plan (raw human intellect, unless…) / optimize Document Collection Scrub Tokenize token M physical Stop Word HashJoin Left Regex token GroupBy token R plan List RHS Count Word Count compute Apache Hadoop, in-memory local mode framework …envision GPUs, other frameworks, etc. “assembler” code monitors, Nagios, etc. notification
  • 6. 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );     // create the sink tap M     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );     // specify a pipe to connect the taps Sink     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers 10 lines code
  • 7. wait! ten lines of code for a file copy … seems like a lot.
  • 8. same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR + 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Mom’s Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
  • 9. 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code
  • 10. 3: wc + scrub Document Collection Scrub GroupBy Tokenize token token Count M R Word Count 1 mapper 1 reducer 22+10 lines code
  • 11. 4: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count 1 mapper Word 1 reducer Count 28+10 lines code
  • 12. 5: tf-idf Unique Insert SumBy D doc_id 1 doc_id Document Collection M R M R M RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique GroupBy DF Left token token token ExprFunc Count CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M GroupBy TF doc_id, token Count GroupBy Count token M R M R Word R M R Count 11 mappers 9 reducers 65+10 lines code
  • 13. 6: tf-idf + tdd Unique Insert SumBy D doc_id 1 doc_id Document Collection RHS M R M R M Assert Scrub Tokenize token HashJoin Checkpoint M M RHS token HashJoin Regex Unique GroupBy DF Left token token token Count ExprFunc CoGroup tf-idf Stop Word List RHS M R M R M R TF-IDF M GroupBy TF doc_id, Failure token Count Traps GroupBy Count token M R M R Word Count R M R 12 mappers 9 reducers 76+14 lines code
  • 14. deployed… elastic-mapreduce --create --name "TF-IDF" --jar s3n://temp.cascading.org/impatient/part6.jar --arg s3n://temp.cascading.org/impatient/rain.txt --arg s3n://temp.cascading.org/impatient/out/wc --arg s3n://temp.cascading.org/impatient/en.stop --arg s3n://temp.cascading.org/impatient/out/tfidf --arg s3n://temp.cascading.org/impatient/out/trap --arg s3n://temp.cascading.org/impatient/out/check
  • 15. results? doc_id tf-idf doc02 0.9163 token air doc05 0.9163 australia doc05 0.9163 broken doc04 0.9163 california's doc04 0.9163 cause doc02 0.9163 cloudcover doc04 0.9163 death doc04 0.9163 deserts doc03 0.9163 downwind doc_id text … doc01 A rain shadow is a dry area on the lee back side of a mountainous area. doc02 0.9163 sinking doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain doc04 0.9163 such with less rain and cloudcover. doc04 0.9163 valley doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) doc05 0.9163 women side of a mountain. doc03 0.5108 land doc04 This is known as the rain shadow effect and is the primary cause of leeward doc05 0.5108 land deserts of mountain ranges, such as California's Death Valley. doc01 0.5108 lee doc05 Two Women. Secrets. A Broken Land. [DVD Australia] doc02 0.5108 lee zoink null doc03 0.5108 leeward doc04 0.5108 leeward doc01 0.4463 area doc02 0.2231 area doc03 0.2231 area doc01 0.2231 dry doc02 0.2231 dry doc03 0.2231 dry doc02 0.2231 mountain doc03 0.2231 mountain doc04 0.2231 mountain doc01 0.0000 rain doc02 0.0000 rain doc03 0.0000 rain doc04 0.0000 rain doc01 0.0000 shadow doc02 0.0000 shadow doc03 0.0000 shadow doc04 0.0000 shadow
  • 16. comparisons? compare similar code in Scalding and Cascalog: sujitpal.blogspot.com/2012/08/scalding-for-impatient.html based on: github.com/twitter/scalding/wiki github.com/Quantisan/Impatient based on: github.com/nathanmarz/cascalog/wiki
  • 17. drill-down? blog, code, wiki, gists, jars, list, DevOps products: cascading.org/category/impatient/ github.org/Cascading/ conjars.org/ goo.gl/KQtUL concurrentinc.com/

Editor's Notes

  1. responsible for net lift, or we work on something else\n
  2. responsible for net lift, or we work on something else\n
  3. responsible for net lift, or we work on something else\n
  4. responsible for net lift, or we work on something else\n
  5. responsible for net lift, or we work on something else\n
  6. responsible for net lift, or we work on something else\n
  7. responsible for net lift, or we work on something else\n
  8. responsible for net lift, or we work on something else\n
  9. responsible for net lift, or we work on something else\n
  10. responsible for net lift, or we work on something else\n
  11. responsible for net lift, or we work on something else\n
  12. responsible for net lift, or we work on something else\n
  13. responsible for net lift, or we work on something else\n
  14. responsible for net lift, or we work on something else\n
  15. responsible for net lift, or we work on something else\n
  16. responsible for net lift, or we work on something else\n
  17. responsible for net lift, or we work on something else\n