SlideShare a Scribd company logo
1 of 68
Download to read offline
Enterprise Data Workflows
               with Cascading

                                            Document
                                            Collection




                Paco Nathan
                                                                          Scrub
                                                          Tokenize
                                                                          token

                                                    M



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R




                Concurrent, Inc.
                                                                     Stop Word                        token
                                                                        List
                                                                                    RHS




                                                                                                        Count




                                                                                                                    Word
                                                                                                                    Count




                pnathan@concurrentinc.com
                @pacoid




                                                         Copyright @2012, Concurrent, Inc.



Monday, 17 December 12                                                                                                      1
Unstructured Data
         meets
        Enterprise Scale

         1. Cascading API: a few facts & quotes
         2. Example #1: distributed file copy
         3. Example #2: word count
         4. Pattern Language: workflow abstraction
         5. Compare: Scalding, Cascalog, Hive, Pig




Monday, 17 December 12                               2
Intro to Cascading
                                         Document
                                         Collection



                                                                      Scrub
                                                      Tokenize
                                                                      token

                                                 M



                                                                              HashJoin   Regex
                                                                                Left     token
                                                                                                 GroupBy    R
                                                                 Stop Word                        token
                                                                    List
                                                                                RHS




                                                                                                    Count




                                                                                                                Word
                                                                                                                Count




                         Cascading API:
                         a few facts & quotes




Monday, 17 December 12                                                                                                  3
Enterprise apps, pre-Hadoop
                                            SQL
                                            queries
                                                                   Data
                 analyst                                         Warehouse                           ops
                                                                              ETL
                                               data                                        data
                                               sets                                       sources
                              insights                                                               data
                                                                                                    sources




                                         Analytics                                  Apps
                                                             modeling
                                          Tools


                                                                                             developer
                                                                             priorities

                                          ad-hoc
                         dashboards                   analysis
                                          queries
                                                                    domain




Monday, 17 December 12                                                                                        4
Enterprise apps, pre-Hadoop
               the devil you know:

                 ‣ “scale up” as needed – larger proprietary hardware
                 ‣ data warehouse: e.g., Oracle,Teradata, etc. – expensive
                 ‣ analytics: e.g., SAS, Microstrategy, etc. – expensive
                 ‣ highly trained staff in specific roles – lots of “silos”

               however, to be competitive now, the data rates must scale
               by orders of magnitude...

               ( alternatively, can we get hired onto the SAS sales team? )




Monday, 17 December 12                                                        5
Enterprise apps, with Hadoop
               Apache Hadoop offers an attractive migration path:

                 ‣ open source software – less expensive
                 ‣ commodity hardware – less expensive
                 ‣ fault tolerance for large-scale parallel workloads
                 ‣ great use cases: Yahoo!, Facebook, Twitter, Amazon, Apple, etc.
                 ‣ offload workflows from licensed platforms, based on “scale-out”




Monday, 17 December 12                                                               6
Enterprise apps, with Hadoop


                         queries,               Java   job tracker
                         models                 apps   name node
                                                                     Hadoop Cluster
               analyst              developer
                         ETL
                         needs



                  ops




Monday, 17 December 12                                                                7
Enterprise apps, with Hadoop
               anything odd about that diagram?                          queries,
                                                                         models
                                                                                                Java
                                                                                                apps
                                                                                                       job tracker
                                                                                                       name node
                                                                                                                     Hadoop Cluster
                                                               analyst              developer
                                                                         ETL
                                                                         needs




                 ‣ demands expert Hadoop developers             ops




                 ‣ experts are hard to find, expensive
                 ‣ even harder to train from among existing staff
                 ‣ early adopter abstractions are not suitable for Enterprise IT
                 ‣ importantly: Hadoop is almost never used in isolation




Monday, 17 December 12                                                                                                                8
Cascading API: purpose
                ‣ simplify data processing development and deployment

                ‣ improve application developer productivity

                ‣ enable data processing application manageability




Monday, 17 December 12                                                  9
Cascading API: a few facts
                Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc.

                in production (~5 yrs) at hundreds of enterprise Hadoop deployments:
                Finance, Health Care, Transportation, other verticals

                studies published about large use cases: Twitter, Etsy, eBay, Airbnb, Square,
                Climate Corp, FlightCaster, Williams-Sonoma, Trulia, TeleNav

                partnerships and distribution with SpringSource, Amazon AWS,
                Microsoft Azure, Hortonworks, MapR, EMC

                several open source projects built atop, managed by Twitter, Etsy, eBay, etc.,
                which provide substantial Machine Learning libraries

                DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy

                data “taps” integrate popular data frameworks via JDBC, Memcached, HBase,
                plus serialization in Apache Thrift, Avro, Kyro, etc.

                entire app compiles into a single JAR: fully connected for compiler optimization,
                exception handling, debug, config, scheduling, notifications, provenance, etc.




Monday, 17 December 12                                                                              10
Cascading API: a few quotes
              “Cascading gives Java developers the ability to build Big Data applications
               on Hadoop using their existing skillset … Management can really go out
               and build a team around folks that are already very experienced with Java.
               Switching over to this is really a very short exercise.”
                  CIO, Thor Olavsrud, 2012-06-06
                  cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading


              “Masks the complexity of MapReduce, simplifies the programming, and
               speeds you on your journey toward actionable analytics … A vast
               improvement over native MapReduce functions or Pig UDFs.”
                  2012 BOSSIE Awards, James Borck, 2012-09-18
                  infoworld.com/slideshow/65089


              “Company’s promise to application developers is an opportunity to build
               and test applications on their desktops in the language of choice with
               familiar constructs and reusable components”
                  Dr. Dobb’s, Adrian Bridgwater, 2012-06-08
                  drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759




Monday, 17 December 12                                                                      11
Enterprise concerns
              “Notes from the Mystery Machine Bus”
               by Steve Yegge, Google
               goo.gl/SeRZa
                          “conservative”                             “liberal”
                            (mostly) Enterprise                   (mostly) Start-Up

                             risk management                    customer experiments

                                 assurance                            flexibility

                           well-defined schema                   schema follows code
                           explicit configuration                     convention

                          type-checking compiler                 interpreted scripts

                            wants no surprises                  wants no impediments

                          Java, Scala, Clojure, etc.            PHP, Ruby, Python, etc.

                   Cascading, Scalding, Cascalog, etc.   Hive, Pig, Hadoop Streaming, etc.



Monday, 17 December 12                                                                       12
Enterprise adoption

                         As Enterprise apps move into
                         Hadoop and related BigData
                         frameworks, risk profiles shift
                         toward more conservative
                         programming practices

                         Cascading provides a popular
                         API – formally speaking, as a
                         pattern language – for defining
                         and managing Enterprise data
                         workflows


Monday, 17 December 12                                    13
Migration of batch toolsets


                                       Enterprise   Migration    Start-Ups
                     define pipelines      J2EE       Cascading      Pig

                         query data       SQL         Lingual       Hive

                   predictive models      SAS         Pattern      Mahout




Monday, 17 December 12                                                       14
Summary
               Cascading API benefits:


                 ‣ addresses staffing bottlenecks due to Hadoop adoption
                 ‣ reduces costs, while servicing risk concerns and “conservatism”
                 ‣ manages complexity as the data continues to scale massively
                 ‣ provides a pattern language for system integration
                 ‣ leverages a workflow abstraction for Enterprise apps
                 ‣ utilizes existing practices for JVM-based clusters




Monday, 17 December 12                                                               15
Intro to Cascading
                                          Document
                                          Collection



                                                                       Scrub
                                                       Tokenize
                                                                       token

                                                  M



                                                                               HashJoin   Regex
                                                                                 Left     token
                                                                                                  GroupBy    R
                                                                  Stop Word                        token
                                                                     List
                                                                                 RHS




                                                                                                     Count




                                                                                                                 Word
                                                                                                                 Count




                         Code Example #1:
                         distributed file copy




Monday, 17 December 12                                                                                                   16
1: distributed file copy
                                public class
                                  Main
                                  {
                                  public static void
                                  main( String[] args )
                                    {
                                    String inPath = args[ 0 ];
                                    String outPath = args[ 1 ];
            Source
                                    Properties props = new Properties();
                                    AppProps.setApplicationJarClass( props, Main.class );
                                    HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

                                    // create the source tap
                                    Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

                     M              // create the sink tap
                                    Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );
                         Sink
                                    // specify a pipe to connect the taps
                                    Pipe copyPipe = new Pipe( "copy" );

                                    // connect the taps, pipes, etc., into a flow
                                    FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
                                     .addSource( copyPipe, inTap )
                                     .addTailSink( copyPipe, outTap );

                                    // run the flow
                                    flowConnector.connect( flowDef ).complete();

          1 mapper                  }
                                  }
          0 reducers
         10 lines code



Monday, 17 December 12                                                                                      17
1: distributed file copy
               shown:
                ‣ a source tap – input data
                 ‣ a sink tap – output data
                 ‣ a pipe connecting a source to a sink
                 ‣ simplest possible Cascading app

               not shown:
                ‣ what kind of taps? and what size of input data set?
                 ‣ could be: JDBC, HBase, Cassandra, XML, flat files, etc.
                 ‣ what kind of topology? and what size of cluster?
                 ‣ could be: Hadoop, in-memory, etc.

               as system architects, we leverage pattern


Monday, 17 December 12                                                       18
principle: same JAR, any scale
                                                                     MegaCorp Enterprise IT:
                                                                     Pb’s data
                                                                     1000+ node private cluster
                                                                     EVP calls you when app fails
                                                                     runtime: days+

                                                      Production Cluster:
                                                      Tb’s data
                                                      EMR w/ 50 HPC Instances
                                                      Ops monitors results
                                                      runtime: hours – days

                                  Staging Cluster:
                                  Gb’s data
                                  EMR + 4 Spot Instances
                                  CI shows red or green lights
                                  runtime: minutes – hours

               Your Laptop:
               Mb’s data
               Hadoop standalone mode
               passes unit tests, or not
               runtime: seconds – minutes



Monday, 17 December 12                                                                              19
principle: fail the same way twice
               troubleshooting at scale:


                 ‣ physical plan for a query provides a deterministic strategy
                 ‣ avoid non-deterministic behavior – expensive when troubleshooting
                 ‣ otherwise, edge cases become nightmares on large clusters
                 ‣ again, addresses “conservative” need for predictability
                 ‣ a core value which is unique to Cascading




Monday, 17 December 12                                                                 20
principle: plan ahead
               flow planner per topology:


                 ‣ leverage the flow graph (DAG)
                 ‣ catch as many errors as possible before an app gets submitted
                 ‣ potential problems caught at compile time or at flow planner stage
                 ‣ …long before large, expensive resources start getting consumed
                 ‣ …or worse, before the wrong results get propagated downstream




Monday, 17 December 12                                                                  21
Intro to Cascading
                                       Document
                                       Collection



                                                                    Scrub
                                                    Tokenize
                                                                    token

                                               M



                                                                            HashJoin   Regex
                                                                              Left     token
                                                                                               GroupBy    R
                                                               Stop Word                        token
                                                                  List
                                                                              RHS




                                                                                                  Count




                                                                                                              Word
                                                                                                              Count




                         Code Example #2:
                         word count




Monday, 17 December 12                                                                                                22
2: word count
               defined: count how often each word appears in a collection of text documents

               a simple program provides a great test case for parallel processing,
               since it illustrates:
                 ‣ requires a minimal amount of code
                 ‣ demonstrates use of both symbolic and numeric values
                 ‣ shows a dependency graph of tuples as an abstraction
                 ‣ is not many steps away from useful search indexing
                 ‣ serves as a “Hello World” for Hadoop apps

               any distributed computing framework which runs Word Count
               efficiently in parallel at scale,
               can handle much larger, more interesting compute problems




Monday, 17 December 12                                                                        23
2: word count


         Document
         Collection




                         Tokenize
                                    GroupBy
                    M                token             Count




                                       R                        Word
                                                                Count




          1 mapper
          1 reducer
         18 lines code                        gist.github.com/3900702


Monday, 17 December 12                                                  24
2: word count                                                     Document
                                                                         Collection




                                                                                 M
                                                                                      Tokenize
                                                                                                 GroupBy
                                                                                                  token    Count




             String docPath = args[ 0 ];                                                            R              Word
                                                                                                                   Count


             String wcPath = args[ 1 ];
             Properties properties = new Properties();
             AppProps.setApplicationJarClass( properties, Main.class );
             HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

             // create source and sink taps
             Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
             Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

             // specify a regex to split "document" text lines into token stream
             Fields token = new Fields( "token" );
             Fields text = new Fields( "text" );
             RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );

             // only returns "token"
             Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );

             // determine the word counts
             Pipe wcPipe = new Pipe( "wc", docPipe );
             wcPipe = new GroupBy( wcPipe, token );
             wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

             // connect the taps, pipes, etc., into a flow
             FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
              .addSource( docPipe, docTap )
              .addTailSink( wcPipe, wcTap );

             // write a DOT file and run the flow
             Flow wcFlow = flowConnector.connect( flowDef );
             wcFlow.writeDOT( "dot/wc.dot" );
             wcFlow.complete();



Monday, 17 December 12                                                                                                     25
2: word count
                                                          [head]



                            Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                    [{2}:'doc_id', 'text']
                                                    [{2}:'doc_id', 'text']




                                                                                                 map
                             Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                                                        [{1}:'token']
                                                        [{1}:'token']



                                              GroupBy('wc')[by:['token']]

                                                      wc[{1}:'token']
                                                      [{1}:'token']




                                                                                                 reduce
                                           Every('wc')[Count[decl:'count']]

                                                    [{2}:'token', 'count']
                                                    [{1}:'token']



                         Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

          1 mapper                                  [{2}:'token', 'count']

          1 reducer                                 [{2}:'token', 'count']


         18 lines code                                     [tail]




Monday, 17 December 12                                                                                    26
2: word count
               deltas between Example #1 and Example #2:

                 ‣ defines source tap as a collection of text documents
                 ‣ defines sink tap to produce word count tuples (desired end result)
                 ‣ uses named fields, applying structure to unstructured data
                 ‣ adds semantics to the workflow, specifying business logic
                 ‣ inserts operations into the pipe: Tokenize, GroupBy, Count
                 ‣ shows function and aggregation applied to data tuples in parallel

                                                Document
                                                Collection
                 Source



                                                             Tokenize
                                                                        GroupBy
                                                        M                token    Count
                          M

                              Sink
                                                                           R              Word
                                                                                          Count




Monday, 17 December 12                                                                            27
Intro to Cascading
                                        Document
                                        Collection



                                                                     Scrub
                                                     Tokenize
                                                                     token

                                                M



                                                                             HashJoin   Regex
                                                                               Left     token
                                                                                                GroupBy    R
                                                                Stop Word                        token
                                                                   List
                                                                               RHS




                                                                                                   Count




                                                                                                               Word
                                                                                                               Count




                         Pattern Language:
                         the workflow abstraction




Monday, 17 December 12                                                                                                 28
enterprise data workflows
               Tuples, Pipelines, Taps, Operations, Joins, Assertions, Traps, etc.
               …in other words, “plumbing” as a pattern language
               for handling Big Data in Enterprise IT

                  Document
                  Collection



                                               Scrub
                               Tokenize
                                               token

                          M



                                                       HashJoin   Regex
                                                         Left     token
                                                                           GroupBy    R
                                          Stop Word                         token
                                             List
                                                         RHS




                                                                              Count




                                                                                          Word
                                                                                          Count




Monday, 17 December 12                                                                            29
pattern language
              defined: a structured method for solving large, complex
              design problems, where the syntax of the language
              promotes the use of best practices
              “plumbing” metaphor of pipes and operators in
              Cascading helps indicate: algorithms to be used at
              particular points, appropriate architectural trade-offs,
              frameworks which must be integrated, etc.
              design patterns: originated in consensus negotiation
              for architecture, later used in software engineering



                 wikipedia.org/wiki/Pattern_language



Monday, 17 December 12                                                   30
data workflows: team
                ‣ Business Stakeholder POV:
                   business process management for workflow orchestration (think BPM/BPEL)

                ‣ Systems Integrator POV:
                   system integration of heterogenous data sources and compute platforms

                ‣ Data Scientist POV:
                   a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.

                ‣ Data Architect POV:
                   a physical plan for large-scale data flow management

                ‣ Software Architect POV:
                   a pattern language, similar to plumbing or circuit design
                                                                                  Document
                                                                                  Collection




                ‣ App Developer POV:                                                      M
                                                                                               Tokenize
                                                                                                               Scrub
                                                                                                               token




                   API bindings for Java, Scala, Clojure, Jython, JRuby, etc.                             Stop Word
                                                                                                             List
                                                                                                                       HashJoin
                                                                                                                         Left


                                                                                                                         RHS
                                                                                                                                  Regex
                                                                                                                                  token
                                                                                                                                          GroupBy
                                                                                                                                           token
                                                                                                                                                     R




                                                                                                                                             Count




                ‣ Systems Engineer POV:                                                                                                                  Word
                                                                                                                                                         Count




                   a JAR file, has passed CI, available in a Maven repo



Monday, 17 December 12                                                                                                                                           31
data workflows: layers
                   business   domain expertise, business trade-offs,
                   process
                              operating parameters, market position, etc.

                      API     Java, Scala, Clojure, Jython, JRuby, Groovy, etc.
                   language
                              …envision whatever runs in a JVM

                 optimize /
                  schedule
                              major changes in technology now
                                 Document
                                 Collection



                                                              Scrub
                                              Tokenize
                                                              token

                                         M




                   physical                              Stop Word
                                                                      HashJoin
                                                                        Left
                                                                                 Regex
                                                                                 token
                                                                                         GroupBy
                                                                                          token
                                                                                                    R




                    plan
                                                            List
                                                                        RHS




                                                                                            Count




                                                                                                                “assembler”
                                                                                                        Word
                                                                                                        Count




                                                                                                                 code
                   topology
                              Apache Hadoop, in-memory local mode
                              …envision GPUs, streaming, etc.

                   machine
                    data
                              Splunk, New Relic, Typesafe, Nagios, etc.



Monday, 17 December 12                                                                                                        32
data workflows: example
                              web
                                web                         Memcached          web
                              logsweb
                                logs                          cluster          API
                                  logs



                                            Cascading app
                                 source                         sink
                                   tap                          tap
                                                                               Customers
                                          Recommender
                              source         System           trap
                                tap                            tap




                         customer                                    Support
                          Customer
                          profile                                    review
                           Profile
                           DBs
                            DBs

                                          Hadoop cluster




Monday, 17 December 12                                                                     33
data workflows: SQL vs. JVM
                     abstraction                  SQL
                            parser             SQL parser


                          optimizer            logical plan,
                                         optimized based on stats
                           planner             physical plan


                          machine             query history,
                           data                 table stats
                           topology            b-trees, etc.

                         visualization             ERD


                           schema             table schema


                           catalog          relational catalog




Monday, 17 December 12                                              34
data workflows: SQL vs. JVM
                     abstraction                  SQL                        JVM
                            parser             SQL parser           SQL-92 compliant parser
                                                                         (in progress)
                          optimizer            logical plan,              logical plan,
                                         optimized based on stats   optimized based on stats
                           planner             physical plan             API “plumbing”


                          machine             query history,               app history,
                           data                 table stats                 tuple stats
                           topology            b-trees, etc.        heterogenous, distributed:
                                                                     Hadoop, in-memory, etc.
                         visualization             ERD                    flow diagram


                           schema             table schema                tuple schema


                           catalog          relational catalog            tap usage DB




Monday, 17 December 12                                                                           35
Cascading taxonomy


                Cascading
                            scheduler     app
                   app
                                        instance
                                                                                                source
                                                                                                  tap


                  Maven                               flow
                   repo
                                                                                                 sink
                                                             step                                tap

                                                                    slice
                         owner                                                           trap
                                                                 kind mapper | reducer    tap

                                                   topology hadoop | local




Monday, 17 December 12                                                                                   36
MapReduce architecture
               ‣ name node / data node
               ‣ job tracker / task tracker
               ‣ submit queue
               ‣ task slots
               ‣ HDFS
               ‣ distributed cache

                                              Wikipedia




                              Apache


Monday, 17 December 12                                    37
Summary
               If you were leading a team responsible for Enterprise apps:


                 ‣ which of the previous two slides seems easier to understand?
                 ‣ which is simpler to use for training and managing a team?
                 ‣ which costs the most in the long run?




Monday, 17 December 12                                                            38
Intro to Cascading
                                        Document
                                        Collection



                                                                     Scrub
                                                     Tokenize
                                                                     token

                                                M



                                                                             HashJoin   Regex
                                                                               Left     token
                                                                                                GroupBy    R
                                                                Stop Word                        token
                                                                   List
                                                                               RHS




                                                                                                   Count




                                                                                                               Word
                                                                                                               Count




                         Compare & Contrast:
                         other approaches




Monday, 17 December 12                                                                                                 39
wc: pseudocode                             Document
                                                  Collection




                                                          M
                                                               Tokenize
                                                                          GroupBy
                                                                           token    Count




                                                                             R              Word
                                                                                            Count




       void map (String doc_id, String text):
         for each word w in segment(text):
           emit(w, "1");



       void reduce (String word, Iterator partial_counts):
         int count = 0;

           for each pc in partial_counts:
             count += Int(pc);

           emit(word, String(count));




Monday, 17 December 12                                                                              40
Scalding / Scala                          Document
                                                 Collection




                                                         M
                                                              Tokenize
                                                                         GroupBy
                                                                          token    Count




                                                                            R              Word
                                                                                           Count




       // Sujit Pal
       // sujitpal.blogspot.com/2012/08/scalding-for-impatient.html

       package com.mycompany.impatient

       import com.twitter.scalding._

       class Part2(args : Args) extends Job(args) {
         val input = Tsv(args("input"), ('docId, 'text))
         val output = Tsv(args("output"))
         input.read.
           flatMap('text -> 'word) {
              text : String => text.split("""s+""")
           }.
           groupBy('word) { group => group.size }.
           write(output)
       }




Monday, 17 December 12                                                                             41
Scalding / Scala                                             Document
                                                                    Collection




                                                                            M
                                                                                 Tokenize
                                                                                            GroupBy
                                                                                             token    Count




       github.com/twitter/scalding/wiki
                                                                                               R              Word
                                                                                                              Count




       notes:
        ‣ code is compact, easy to understand

          ‣ functional programming is great for expressing
             complex workflows in MapReduce, etc.
          ‣ very large-scale, complex problems can be handled
             in just a few lines of code
          ‣ many large-scale apps in production deployments

          ‣ significant investments by Twitter, Etsy, eBay, etc.,
             in this open source project
          ‣ extensive libraries are available for linear algebra,
             machine learning – e.g., “Matrix API”




Monday, 17 December 12                                                                                                42
Cascalog / Clojure                            Document
                                                     Collection




                                                             M
                                                                  Tokenize
                                                                             GroupBy
                                                                              token    Count




                                                                                R              Word
                                                                                               Count




       ; Paul Lam
       ; github.com/Quantisan/Impatient

       (ns impatient.core
         (:use [cascalog.api]
               [cascalog.more-taps :only (hfs-delimited)])
         (:require [clojure.string :as s]
                   [cascalog.ops :as c])
         (:gen-class))

       (defmapcatop split [line]
         "reads in a line of string and splits it by regex"
         (s/split line #"[[](),.)s]+"))

       (defn -main [in out & args]
         (?<- (hfs-delimited out)
              [?word ?count]
              ((hfs-delimited in :skip-header? true) _ ?line)
              (split ?line :> ?word)
              (c/count ?count)))



Monday, 17 December 12                                                                                 43
Cascalog / Clojure                                            Document
                                                                     Collection




                                                                             M
                                                                                  Tokenize
                                                                                             GroupBy
                                                                                              token    Count




       github.com/nathanmarz/cascalog/wiki
                                                                                                R              Word
                                                                                                               Count




       notes:
        ‣ code is compact, easy to understand

          ‣ functional programming is great for expressing
             complex workflows in MapReduce, etc.
          ‣ significant investments by Twitter, Climate Corp, etc.,
             in this open source project
          ‣ can run queries from the Clojure REPL

          ‣ compelling for very large-scale use cases where code
             correctness can be verified before deployment




Monday, 17 December 12                                                                                                 44
Apache Hive                                 Document
                                                   Collection




                                                           M
                                                                Tokenize
                                                                           GroupBy
                                                                            token    Count




                                                                              R              Word
                                                                                             Count




       -- Steve Severance
       -- stackoverflow.com/questions/10039949/word-count-program-in-hive

       CREATE TABLE input (line STRING);

       LOAD DATA LOCAL INPATH 'input.tsv'
       OVERWRITE INTO TABLE input;

       SELECT
        word, COUNT(*)
       FROM input
        LATERAL VIEW explode(split(text, ' ')) lTable AS word
       GROUP BY word
       ;




Monday, 17 December 12                                                                               45
Apache Hive                                                         Document
                                                                           Collection




                                                                                   M
                                                                                        Tokenize
                                                                                                   GroupBy
                                                                                                    token    Count




       hive.apache.org
                                                                                                      R              Word
                                                                                                                     Count




       pro:
        ‣ most popular abstraction atop Apache Hadoop

          ‣ SQL-like language is syntactically familiar to most analysts

          ‣ simple to load large-scale unstructured data and run ad-hoc queries

       con:
        ‣ not a relational engine, many surprises at scale

          ‣ difficult to represent complex workflows, ML algorithms, etc.

          ‣ one poorly-trained analyst can bottleneck an entire cluster

          ‣ app-level integration requires other coding, outside of script language

          ‣ logical planner mixed with physical planner; cannot collect app stats

          ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly

          ‣ business logic must cross multiple language boundaries: difficult to
             troubleshoot, optimize, audit, handle exceptions, set notifications, etc.


Monday, 17 December 12                                                                                                       46
Apache Pig                                  Document
                                                   Collection




                                                           M
                                                                Tokenize
                                                                           GroupBy
                                                                            token    Count




                                                                              R              Word
                                                                                             Count




       -- kudos to Dmitriy Ryaboy

       docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource')
         AS (doc_id, text);
       docPipe = FILTER docPipe BY doc_id != 'doc_id';

       -- specify regex to split "document" text lines into token stream
       tokenPipe = FOREACH docPipe
         GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token;
       tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*';

       -- determine the word counts
       tokenGroups = GROUP tokenPipe BY token;
       wcPipe = FOREACH tokenGroups
         GENERATE group AS token, COUNT(tokenPipe) AS count;

       -- output
       STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource');
       EXPLAIN -out dot/wc_pig.dot -dot wcPipe;


Monday, 17 December 12                                                                               47
Apache Pig                                                         Document
                                                                          Collection




                                                                                  M
                                                                                       Tokenize
                                                                                                  GroupBy
                                                                                                   token    Count




       pig.apache.org
                                                                                                     R              Word
                                                                                                                    Count




       pro:
        ‣ easy to learn data manipulation language (DML)

          ‣ interactive prompt (Grunt) makes it simple to prototype apps

          ‣ extensibility through UDFs

       con:
        ‣ not a full programming language; must extend via UDFs outside of language

          ‣ app-level integration requires other coding, outside of script language

          ‣ simple problems are simple to do; hard problems become quite complex

          ‣ difficult to parameterize scripts externally; must rewrite to change taps!

          ‣ logical planner mixed with physical planner; cannot collect app stats

          ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly

          ‣ business logic must cross multiple language boundaries: difficult to
             troubleshoot, optimize, audit, handle exceptions, set notifications, etc.


Monday, 17 December 12                                                                                                      48
Intro to Cascading
                                           Document
                                           Collection



                                                                        Scrub
                                                        Tokenize
                                                                        token

                                                   M



                                                                                HashJoin   Regex
                                                                                  Left     token
                                                                                                   GroupBy    R
                                                                   Stop Word                        token
                                                                      List
                                                                                  RHS




                                                                                                      Count




                                                                                                                  Word
                                                                                                                  Count




                         Code Example #N:
                         city of palo alto, etc.




Monday, 17 December 12                                                                                                    49
extend: wc + scrub + stop words


         Document
         Collection



                                         Scrub
                         Tokenize
                                         token

                 M



                                                 HashJoin   Regex
                                                   Left     token
                                                                    GroupBy    R
                                    Stop Word                        token
                                       List
                                                   RHS




                                                                       Count



          1 mapper                                                                 Word

          1 reducer                                                                Count


         28+10 lines code



Monday, 17 December 12                                                                     50
extend: a simple search engine


                                                                                Unique             Insert   SumBy




                                                                          D
                                                                                doc_id               1      doc_id
        Document
        Collection

                                                                          M       R           M               R            RHS

                                       Scrub
                     Tokenize
                                       token
                                                                                                                         HashJoin
                M

                                                                                                                                           RHS




                                                                  token
                                               HashJoin   Regex                 Unique            CountBy




                                                                          DF
                                                 Left     token                  token             token                                                    ExprFunc
                                                                                                                                          CoGroup
                                Stop Word                                                                                                                     tf-idf
                                   List
                                                 RHS
                                                                          M       R           M      R               M                              R
                                                                                                                                                                               TF-IDF




                                                                                                                     M

                                                                               CountBy
                                                                          TF
                                                                                doc_id,
                                                                                 token
                                                                                                                            CountBy                 Sort
                                                                                                                             token                  count

                                                                          M       R       M
                                                                                                                                                                       Word
                                                                                                                                 R    M             R                  Count




         10 mappers
          8 reducers
         68+14 lines code



Monday, 17 December 12                                                                                                                                                                  51
City of Palo Alto open data
                                                             Regex           Regex




                                                      tree
                                                                                           Scrub
                                                              filter         parser        species




                                                      M
                                                                                                                 HashJoin
                                                                                                                   Left     Geohash
              CoPA
            GIS exprot                                                                                 Tree
                                                                                                     Metadata                                M
                                                                                                                   RHS                            RHS
                                                                                                                                      tree
                         Regex     Checkpoint




                                                      road
                                                             Regex           Regex

                                                tsv
                         parser       tsv                     filter                                                                                             Tree       Filter         GroupBy        Checkpoint
                                                                             parser                                                              CoGroup
                                                                                                                                                               Distance   tree_dist       tree_name         shade
             M

                                                                                                                                                           R                          M               R                M    RHS
                                                      M
                                                                                      HashJoin        Estimate     Road
                                                                                        Left           Albedo               Geohash                                                                                        CoGroup
                                                                                                                 Segments
                                                                        Road
                                                                       Metadata                                                                                                              GPS
                         Failure                                                        RHS                                                  M                                               logs
                          Traps                                                                                                                                                                                                      R
                                                                                                                                      road


                                                                                                                                                                                                           Geohash


                                                                                                                                                                                                                       M

                                                             Regex
                                                      park




                                                              filter                                                                                                                                                                     reco




                                                      M
                                                                            park




         github.com/Cascading/CoPA/wiki
           ‣ GIS export for parks, roads, trees (unstructured / open data)
           ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
           ‣ curated metadata, used to enrich the dataset
           ‣ could extend via mash-up with many available public data APIs

         Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
         “Find a shady spot on a summer day to walk near downtown and take a call…”



Monday, 17 December 12                                                                                                                                                                                                                          52
CoPA: log events




Monday, 17 December 12    53
CoPA: results                                          0.12
                                                                          Estimated Tree Height (meters)




                                                              0.10




                                                              0.08
                                                                                                                     count
                                                                                                                        0




                                                    density
                                                                                                                        100
                                                              0.06                                                      200
                                                                                                                        300



                                                              0.04




                                                              0.02




                                                              0.00


                                                                     0   10        20            30        40   50
                                                                                    avg_height




            ‣   addr: 115 HAWTHORNE AVE
            ‣   lat/lng: 37.446, -122.168
            ‣   geohash: 9q9jh0
            ‣   tree: 413 site 2
            ‣   species: Liquidambar styraciflua
            ‣   avg height 23 m
            ‣   road albedo: 0.12
            ‣   distance: 10 m
            ‣   a short walk from my train stop ✔



Monday, 17 December 12                                                                                                        54
Intro to Cascading
                                         Document
                                         Collection



                                                                      Scrub
                                                      Tokenize
                                                                      token

                                                 M



                                                                              HashJoin   Regex
                                                                                Left     token
                                                                                                 GroupBy    R
                                                                 Stop Word                        token
                                                                    List
                                                                                RHS




                                                                                                    Count




                                                                                                                Word
                                                                                                                Count




                         PMML:
                         predictive modeling




Monday, 17 December 12                                                                                                  55
PMML model




Monday, 17 December 12   56
cascading.pattern
                  example:
                  1. use customer order history as the training data set
                  2. train a risk classifier for orders, using Random Forest
                  3. export model from R to PMML
                  4. build a Cascading app to execute the PMML model
                         4.1. generate a pipeline from PMML description
                         4.2. planner builds the flow for a topology (Hadoop)
                         4.3. compile app to a JAR file
                  5. deploy the app at scale to calculate scores




Monday, 17 December 12                                                         57
cascading.pattern
                risk classifier                                                 risk classifier
                dimension: customer 360                                        dimension: per-order
                  Cascading apps

                                              training             analyst's                    customer
                           data prep                                laptop
                                             data sets                                        transactions

                           predict                                                            score new
                          model costs                                                           orders
                                                                    PMML
                                                                    model
                             detect                                                            anomaly
                           fraudsters                                                          detection

                            segment                                                             velocity
                           customers                                                            metrics



                           Hadoop                                  Customer                    IMDG
                                                                      DB
                                                     batch                     real-time
                                                 workloads                     workloads

                                       ETL



                                                    chargebacks,   partner
                                        DW              etc.        data




Monday, 17 December 12                                                                                       58
1:
                         “orders” data set...
                         train/test in R...
                         exported as PMML



Monday, 17 December 12                          59
R modeling
       ## train a RandomForest model

       f <- as.formula("as.factor(label) ~ .")
       fit <- randomForest(f, data_train, ntree=50)

       ## test the model on the holdout test set

       print(fit$importance)
       print(fit)

       predicted <- predict(fit, data)
       data$predicted <- predicted
       confuse <- table(pred = predicted, true = data[,1])
       print(confuse)

       ## export predicted labels to TSV

       write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
       quote=FALSE, sep="t", row.names=FALSE)

       ## export RF model to PMML

       saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))




Monday, 17 December 12                                                        60
R output
            MeanDecreaseGini
       var0        0.6591701
       var1       33.8625179
       var2        8.0290020

               OOB estimate of   error rate: 13.83%
       Confusion matrix:
          0 1 class.error
       0 28 5    0.1515152
       1 8 53    0.1311475

       [1] "./data/sample.rf.xml"




Monday, 17 December 12                                61
2:
                         Cascading app
                         takes PMML as
                         a parameter...



Monday, 17 December 12                    62
PMML model
       <?xml version="1.0"?>
       <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.dmg.org/PMML-4_0
        http://www.dmg.org/v4-0/pmml-4-0.xsd">
        <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
         <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
         <Application name="Rattle/PMML" version="1.2.30"/>
         <Timestamp>2012-10-22 19:39:28</Timestamp>
        </Header>
        <DataDictionary numberOfFields="4">
         <DataField name="label" optype="categorical" dataType="string">
          <Value value="0"/>
          <Value value="1"/>
         </DataField>
         <DataField name="var0" optype="continuous" dataType="double"/>
         <DataField name="var1" optype="continuous" dataType="double"/>
         <DataField name="var2" optype="continuous" dataType="double"/>
        </DataDictionary>
        <MiningModel modelName="randomForest_Model" functionName="classification">
         <MiningSchema>
          <MiningField name="label" usageType="predicted"/>
          <MiningField name="var0" usageType="active"/>
          <MiningField name="var1" usageType="active"/>
          <MiningField name="var2" usageType="active"/>
         </MiningSchema>
         <Segmentation multipleModelMethod="majorityVote">
          <Segment id="1">
           <True/>
           <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest"
       splitCharacteristic="binarySplit">
            <MiningSchema>
             <MiningField name="label" usageType="predicted"/>
             <MiningField name="var0" usageType="active"/>
             <MiningField name="var1" usageType="active"/>
             <MiningField name="var2" usageType="active"/>
            </MiningSchema>
       ...


Monday, 17 December 12                                                                                            63
Cascading app
       public class Main {
         public static void main( String[] args ) {
           String pmmlPath = args[ 0 ];
           String ordersPath = args[ 1 ];
           String classifyPath = args[ 2 ];
           String trapPath = args[ 3 ];

             Properties properties = new Properties();
             AppProps.setApplicationJarClass( properties, Main.class );
             HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

             // create source and sink taps
             Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
             Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
             Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

             // define a "Classifier" model from PMML to evaluate the orders
             Classifier classifier = new Classifier( pmmlPath );
             Pipe classifyPipe = new Each( new Pipe( "classify" ), classifier.getFields(),
               new ClassifierFunction( new Fields( "score" ), classifier ), Fields.ALL );

             // connect the taps, pipes, etc., into a flow
             FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
              .addSource( classifyPipe, ordersTap )
              .addTrap( classifyPipe, trapTap )
              .addSink( classifyPipe, classifyTap );

             // write a DOT file and run the flow
             Flow classifyFlow = flowConnector.connect( flowDef );
             classifyFlow.writeDOT( "dot/classify.dot" );
             classifyFlow.complete();
           }
       }



Monday, 17 December 12                                                                       64
3:
                         app deployed on
                         a cluster to score
                         customers at scale...



Monday, 17 December 12                           65
deploy to cloud
       elastic-mapreduce --create --name "RF" 
         --jar s3n://temp.cascading.org/pattern/pattern.jar 
         --arg s3n://temp.cascading.org/pattern/sample.rf.xml 
         --arg s3n://temp.cascading.org/pattern/sample.tsv 
         --arg s3n://temp.cascading.org/pattern/out/classify 
         --arg s3n://temp.cascading.org/pattern/out/trap



       aws.amazon.com/elasticmapreduce/




Monday, 17 December 12                                            66
results
       bash-3.2$ head output/classify/part-00000
       label" var0" var1" var2" order_id" predicted"score
       1" 0" 1" 0" 6f8e1014" 1" 1
       0" 0" 0" 1" 6f8ea22e" 0" 0
       1" 0" 1" 0" 6f8ea435" 1" 1
       0" 0" 0" 1" 6f8ea5e1" 0" 0
       1" 0" 1" 0" 6f8ea785" 1" 1
       1" 0" 1" 0" 6f8ea91e" 1" 1
       0" 1" 0" 0" 6f8eaaba" 0" 0
       1" 0" 1" 0" 6f8eac54" 1" 1
       0" 1" 1" 0" 6f8eade3" 1" 1




Monday, 17 December 12                                      67
drill-down

                blog, code/wiki/gists, JARs, community, DevOps products:
                cascading.org
                github.org/Cascading
                conjars.org
                meetup.com/cascading
                goo.gl/KQtUL
                concurrentinc.com

                 pnathan@concurrentinc.com
                 @pacoid
                                                        Copyright @2012, Concurrent, Inc.



Monday, 17 December 12                                                                      68

More Related Content

What's hot

Isis Papyrus Ti Billing Energy E
Isis Papyrus Ti Billing Energy EIsis Papyrus Ti Billing Energy E
Isis Papyrus Ti Billing Energy EFriso de Jong
 
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2Quang Nguyễn Bá
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu
 
BI Dashboards with SQL Server
BI Dashboards with SQL ServerBI Dashboards with SQL Server
BI Dashboards with SQL ServerEduardo Castro
 
adrian coyler open tour keynote
adrian coyler open tour keynoteadrian coyler open tour keynote
adrian coyler open tour keynotemarklucovsky
 
Implementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a LifestyleImplementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a LifestyleInstitute of Validation Technology
 
SSRS integration with share point
SSRS integration with share pointSSRS integration with share point
SSRS integration with share pointJacob Chang
 
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for ITDenny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for ITBala Subra
 
GlassFish Mobility Platform - Hans Hrasna
GlassFish Mobility Platform - Hans HrasnaGlassFish Mobility Platform - Hans Hrasna
GlassFish Mobility Platform - Hans HrasnaEduardo Pelegri-Llopart
 
Sap and alfresco integrations with ctac connector 19 april2011
Sap and alfresco integrations with ctac connector 19 april2011Sap and alfresco integrations with ctac connector 19 april2011
Sap and alfresco integrations with ctac connector 19 april2011Alfresco Software
 
Resume_Asad_updated_DEC2016
Resume_Asad_updated_DEC2016Resume_Asad_updated_DEC2016
Resume_Asad_updated_DEC2016Asadullah Khan
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
SAP Alfresco Integration For The Public Sector With Ctac
SAP Alfresco Integration For The Public Sector With CtacSAP Alfresco Integration For The Public Sector With Ctac
SAP Alfresco Integration For The Public Sector With CtacAlfresco Software
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel BayetaSam B
 

What's hot (18)

Isis Papyrus Ti Billing Energy E
Isis Papyrus Ti Billing Energy EIsis Papyrus Ti Billing Energy E
Isis Papyrus Ti Billing Energy E
 
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
Introduction to Business Intelligence in Microsoft SQL Server 2008 R2
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
BI Dashboards with SQL Server
BI Dashboards with SQL ServerBI Dashboards with SQL Server
BI Dashboards with SQL Server
 
adrian coyler open tour keynote
adrian coyler open tour keynoteadrian coyler open tour keynote
adrian coyler open tour keynote
 
Implementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a LifestyleImplementing a QbD program to make Process Validation a Lifestyle
Implementing a QbD program to make Process Validation a Lifestyle
 
SSRS integration with share point
SSRS integration with share pointSSRS integration with share point
SSRS integration with share point
 
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for ITDenny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
 
101 ab 1600-1630
101 ab 1600-1630101 ab 1600-1630
101 ab 1600-1630
 
GlassFish Mobility Platform - Hans Hrasna
GlassFish Mobility Platform - Hans HrasnaGlassFish Mobility Platform - Hans Hrasna
GlassFish Mobility Platform - Hans Hrasna
 
Sap and alfresco integrations with ctac connector 19 april2011
Sap and alfresco integrations with ctac connector 19 april2011Sap and alfresco integrations with ctac connector 19 april2011
Sap and alfresco integrations with ctac connector 19 april2011
 
Resume_Asad_updated_DEC2016
Resume_Asad_updated_DEC2016Resume_Asad_updated_DEC2016
Resume_Asad_updated_DEC2016
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
Os Lonergan
Os LonerganOs Lonergan
Os Lonergan
 
SAP Alfresco Integration For The Public Sector With Ctac
SAP Alfresco Integration For The Public Sector With CtacSAP Alfresco Integration For The Public Sector With Ctac
SAP Alfresco Integration For The Public Sector With Ctac
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel Bayeta
 
SharePoint 2010: ECM-ready?
SharePoint 2010: ECM-ready?SharePoint 2010: ECM-ready?
SharePoint 2010: ECM-ready?
 
SAP_HANA_FAQ
SAP_HANA_FAQSAP_HANA_FAQ
SAP_HANA_FAQ
 

Similar to Enterprise Data Workflows with Cascading

Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the ImpatientPaco Nathan
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsPaco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiPaco Nathan
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataPaco Nathan
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataPaco Nathan
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionOReillyStrata
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionPaco Nathan
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and rSAP Technology
 

Similar to Enterprise Data Workflows with Cascading (10)

Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Enterprise Data Workflows with Cascading

  • 1. Enterprise Data Workflows with Cascading Document Collection Paco Nathan Scrub Tokenize token M HashJoin Regex Left token GroupBy R Concurrent, Inc. Stop Word token List RHS Count Word Count pnathan@concurrentinc.com @pacoid Copyright @2012, Concurrent, Inc. Monday, 17 December 12 1
  • 2. Unstructured Data meets Enterprise Scale 1. Cascading API: a few facts & quotes 2. Example #1: distributed file copy 3. Example #2: word count 4. Pattern Language: workflow abstraction 5. Compare: Scalding, Cascalog, Hive, Pig Monday, 17 December 12 2
  • 3. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Cascading API: a few facts & quotes Monday, 17 December 12 3
  • 4. Enterprise apps, pre-Hadoop SQL queries Data analyst Warehouse ops ETL data data sets sources insights data sources Analytics Apps modeling Tools developer priorities ad-hoc dashboards analysis queries domain Monday, 17 December 12 4
  • 5. Enterprise apps, pre-Hadoop the devil you know: ‣ “scale up” as needed – larger proprietary hardware ‣ data warehouse: e.g., Oracle,Teradata, etc. – expensive ‣ analytics: e.g., SAS, Microstrategy, etc. – expensive ‣ highly trained staff in specific roles – lots of “silos” however, to be competitive now, the data rates must scale by orders of magnitude... ( alternatively, can we get hired onto the SAS sales team? ) Monday, 17 December 12 5
  • 6. Enterprise apps, with Hadoop Apache Hadoop offers an attractive migration path: ‣ open source software – less expensive ‣ commodity hardware – less expensive ‣ fault tolerance for large-scale parallel workloads ‣ great use cases: Yahoo!, Facebook, Twitter, Amazon, Apple, etc. ‣ offload workflows from licensed platforms, based on “scale-out” Monday, 17 December 12 6
  • 7. Enterprise apps, with Hadoop queries, Java job tracker models apps name node Hadoop Cluster analyst developer ETL needs ops Monday, 17 December 12 7
  • 8. Enterprise apps, with Hadoop anything odd about that diagram? queries, models Java apps job tracker name node Hadoop Cluster analyst developer ETL needs ‣ demands expert Hadoop developers ops ‣ experts are hard to find, expensive ‣ even harder to train from among existing staff ‣ early adopter abstractions are not suitable for Enterprise IT ‣ importantly: Hadoop is almost never used in isolation Monday, 17 December 12 8
  • 9. Cascading API: purpose ‣ simplify data processing development and deployment ‣ improve application developer productivity ‣ enable data processing application manageability Monday, 17 December 12 9
  • 10. Cascading API: a few facts Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc. in production (~5 yrs) at hundreds of enterprise Hadoop deployments: Finance, Health Care, Transportation, other verticals studies published about large use cases: Twitter, Etsy, eBay, Airbnb, Square, Climate Corp, FlightCaster, Williams-Sonoma, Trulia, TeleNav partnerships and distribution with SpringSource, Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC several open source projects built atop, managed by Twitter, Etsy, eBay, etc., which provide substantial Machine Learning libraries DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy data “taps” integrate popular data frameworks via JDBC, Memcached, HBase, plus serialization in Apache Thrift, Avro, Kyro, etc. entire app compiles into a single JAR: fully connected for compiler optimization, exception handling, debug, config, scheduling, notifications, provenance, etc. Monday, 17 December 12 10
  • 11. Cascading API: a few quotes “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud, 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck, 2012-09-18 infoworld.com/slideshow/65089 “Company’s promise to application developers is an opportunity to build and test applications on their desktops in the language of choice with familiar constructs and reusable components” Dr. Dobb’s, Adrian Bridgwater, 2012-06-08 drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759 Monday, 17 December 12 11
  • 12. Enterprise concerns “Notes from the Mystery Machine Bus” by Steve Yegge, Google goo.gl/SeRZa “conservative” “liberal” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc. Monday, 17 December 12 12
  • 13. Enterprise adoption As Enterprise apps move into Hadoop and related BigData frameworks, risk profiles shift toward more conservative programming practices Cascading provides a popular API – formally speaking, as a pattern language – for defining and managing Enterprise data workflows Monday, 17 December 12 13
  • 14. Migration of batch toolsets Enterprise Migration Start-Ups define pipelines J2EE Cascading Pig query data SQL Lingual Hive predictive models SAS Pattern Mahout Monday, 17 December 12 14
  • 15. Summary Cascading API benefits: ‣ addresses staffing bottlenecks due to Hadoop adoption ‣ reduces costs, while servicing risk concerns and “conservatism” ‣ manages complexity as the data continues to scale massively ‣ provides a pattern language for system integration ‣ leverages a workflow abstraction for Enterprise apps ‣ utilizes existing practices for JVM-based clusters Monday, 17 December 12 15
  • 16. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Code Example #1: distributed file copy Monday, 17 December 12 16
  • 17. 1: distributed file copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); M     // create the sink tap     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Sink     // specify a pipe to connect the taps     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers 10 lines code Monday, 17 December 12 17
  • 18. 1: distributed file copy shown: ‣ a source tap – input data ‣ a sink tap – output data ‣ a pipe connecting a source to a sink ‣ simplest possible Cascading app not shown: ‣ what kind of taps? and what size of input data set? ‣ could be: JDBC, HBase, Cassandra, XML, flat files, etc. ‣ what kind of topology? and what size of cluster? ‣ could be: Hadoop, in-memory, etc. as system architects, we leverage pattern Monday, 17 December 12 18
  • 19. principle: same JAR, any scale MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes Monday, 17 December 12 19
  • 20. principle: fail the same way twice troubleshooting at scale: ‣ physical plan for a query provides a deterministic strategy ‣ avoid non-deterministic behavior – expensive when troubleshooting ‣ otherwise, edge cases become nightmares on large clusters ‣ again, addresses “conservative” need for predictability ‣ a core value which is unique to Cascading Monday, 17 December 12 20
  • 21. principle: plan ahead flow planner per topology: ‣ leverage the flow graph (DAG) ‣ catch as many errors as possible before an app gets submitted ‣ potential problems caught at compile time or at flow planner stage ‣ …long before large, expensive resources start getting consumed ‣ …or worse, before the wrong results get propagated downstream Monday, 17 December 12 21
  • 22. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Code Example #2: word count Monday, 17 December 12 22
  • 23. 2: word count defined: count how often each word appears in a collection of text documents a simple program provides a great test case for parallel processing, since it illustrates: ‣ requires a minimal amount of code ‣ demonstrates use of both symbolic and numeric values ‣ shows a dependency graph of tuples as an abstraction ‣ is not many steps away from useful search indexing ‣ serves as a “Hello World” for Hadoop apps any distributed computing framework which runs Word Count efficiently in parallel at scale, can handle much larger, more interesting compute problems Monday, 17 December 12 23
  • 24. 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code gist.github.com/3900702 Monday, 17 December 12 24
  • 25. 2: word count Document Collection M Tokenize GroupBy token Count String docPath = args[ 0 ]; R Word Count String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Monday, 17 December 12 25
  • 26. 2: word count [head] Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] 1 mapper [{2}:'token', 'count'] 1 reducer [{2}:'token', 'count'] 18 lines code [tail] Monday, 17 December 12 26
  • 27. 2: word count deltas between Example #1 and Example #2: ‣ defines source tap as a collection of text documents ‣ defines sink tap to produce word count tuples (desired end result) ‣ uses named fields, applying structure to unstructured data ‣ adds semantics to the workflow, specifying business logic ‣ inserts operations into the pipe: Tokenize, GroupBy, Count ‣ shows function and aggregation applied to data tuples in parallel Document Collection Source Tokenize GroupBy M token Count M Sink R Word Count Monday, 17 December 12 27
  • 28. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Pattern Language: the workflow abstraction Monday, 17 December 12 28
  • 29. enterprise data workflows Tuples, Pipelines, Taps, Operations, Joins, Assertions, Traps, etc. …in other words, “plumbing” as a pattern language for handling Big Data in Enterprise IT Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Monday, 17 December 12 29
  • 30. pattern language defined: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices “plumbing” metaphor of pipes and operators in Cascading helps indicate: algorithms to be used at particular points, appropriate architectural trade-offs, frameworks which must be integrated, etc. design patterns: originated in consensus negotiation for architecture, later used in software engineering wikipedia.org/wiki/Pattern_language Monday, 17 December 12 30
  • 31. data workflows: team ‣ Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) ‣ Systems Integrator POV: system integration of heterogenous data sources and compute platforms ‣ Data Scientist POV: a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc. ‣ Data Architect POV: a physical plan for large-scale data flow management ‣ Software Architect POV: a pattern language, similar to plumbing or circuit design Document Collection ‣ App Developer POV: M Tokenize Scrub token API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Count ‣ Systems Engineer POV: Word Count a JAR file, has passed CI, available in a Maven repo Monday, 17 December 12 31
  • 32. data workflows: layers business domain expertise, business trade-offs, process operating parameters, market position, etc. API Java, Scala, Clojure, Jython, JRuby, Groovy, etc. language …envision whatever runs in a JVM optimize / schedule major changes in technology now Document Collection Scrub Tokenize token M physical Stop Word HashJoin Left Regex token GroupBy token R plan List RHS Count “assembler” Word Count code topology Apache Hadoop, in-memory local mode …envision GPUs, streaming, etc. machine data Splunk, New Relic, Typesafe, Nagios, etc. Monday, 17 December 12 32
  • 33. data workflows: example web web Memcached web logsweb logs cluster API logs Cascading app source sink tap tap Customers Recommender source System trap tap tap customer Support Customer profile review Profile DBs DBs Hadoop cluster Monday, 17 December 12 33
  • 34. data workflows: SQL vs. JVM abstraction SQL parser SQL parser optimizer logical plan, optimized based on stats planner physical plan machine query history, data table stats topology b-trees, etc. visualization ERD schema table schema catalog relational catalog Monday, 17 December 12 34
  • 35. data workflows: SQL vs. JVM abstraction SQL JVM parser SQL parser SQL-92 compliant parser (in progress) optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, in-memory, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB Monday, 17 December 12 35
  • 36. Cascading taxonomy Cascading scheduler app app instance source tap Maven flow repo sink step tap slice owner trap kind mapper | reducer tap topology hadoop | local Monday, 17 December 12 36
  • 37. MapReduce architecture ‣ name node / data node ‣ job tracker / task tracker ‣ submit queue ‣ task slots ‣ HDFS ‣ distributed cache Wikipedia Apache Monday, 17 December 12 37
  • 38. Summary If you were leading a team responsible for Enterprise apps: ‣ which of the previous two slides seems easier to understand? ‣ which is simpler to use for training and managing a team? ‣ which costs the most in the long run? Monday, 17 December 12 38
  • 39. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Compare & Contrast: other approaches Monday, 17 December 12 39
  • 40. wc: pseudocode Document Collection M Tokenize GroupBy token Count R Word Count void map (String doc_id, String text): for each word w in segment(text): emit(w, "1"); void reduce (String word, Iterator partial_counts): int count = 0; for each pc in partial_counts: count += Int(pc); emit(word, String(count)); Monday, 17 December 12 40
  • 41. Scalding / Scala Document Collection M Tokenize GroupBy token Count R Word Count // Sujit Pal // sujitpal.blogspot.com/2012/08/scalding-for-impatient.html package com.mycompany.impatient import com.twitter.scalding._ class Part2(args : Args) extends Job(args) {   val input = Tsv(args("input"), ('docId, 'text))   val output = Tsv(args("output"))   input.read.     flatMap('text -> 'word) { text : String => text.split("""s+""") }.     groupBy('word) { group => group.size }.     write(output) } Monday, 17 December 12 41
  • 42. Scalding / Scala Document Collection M Tokenize GroupBy token Count github.com/twitter/scalding/wiki R Word Count notes: ‣ code is compact, easy to understand ‣ functional programming is great for expressing complex workflows in MapReduce, etc. ‣ very large-scale, complex problems can be handled in just a few lines of code ‣ many large-scale apps in production deployments ‣ significant investments by Twitter, Etsy, eBay, etc., in this open source project ‣ extensive libraries are available for linear algebra, machine learning – e.g., “Matrix API” Monday, 17 December 12 42
  • 43. Cascalog / Clojure Document Collection M Tokenize GroupBy token Count R Word Count ; Paul Lam ; github.com/Quantisan/Impatient (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) Monday, 17 December 12 43
  • 44. Cascalog / Clojure Document Collection M Tokenize GroupBy token Count github.com/nathanmarz/cascalog/wiki R Word Count notes: ‣ code is compact, easy to understand ‣ functional programming is great for expressing complex workflows in MapReduce, etc. ‣ significant investments by Twitter, Climate Corp, etc., in this open source project ‣ can run queries from the Clojure REPL ‣ compelling for very large-scale use cases where code correctness can be verified before deployment Monday, 17 December 12 44
  • 45. Apache Hive Document Collection M Tokenize GroupBy token Count R Word Count -- Steve Severance -- stackoverflow.com/questions/10039949/word-count-program-in-hive CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input; SELECT  word, COUNT(*) FROM input  LATERAL VIEW explode(split(text, ' ')) lTable AS word GROUP BY word ; Monday, 17 December 12 45
  • 46. Apache Hive Document Collection M Tokenize GroupBy token Count hive.apache.org R Word Count pro: ‣ most popular abstraction atop Apache Hadoop ‣ SQL-like language is syntactically familiar to most analysts ‣ simple to load large-scale unstructured data and run ad-hoc queries con: ‣ not a relational engine, many surprises at scale ‣ difficult to represent complex workflows, ML algorithms, etc. ‣ one poorly-trained analyst can bottleneck an entire cluster ‣ app-level integration requires other coding, outside of script language ‣ logical planner mixed with physical planner; cannot collect app stats ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly ‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc. Monday, 17 December 12 46
  • 47. Apache Pig Document Collection M Tokenize GroupBy token Count R Word Count -- kudos to Dmitriy Ryaboy docPipe = LOAD '$docPath' USING PigStorage('t', 'tagsource') AS (doc_id, text); docPipe = FILTER docPipe BY doc_id != 'doc_id'; -- specify regex to split "document" text lines into token stream tokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, ' [](),.')) AS token; tokenPipe = FILTER tokenPipe BY token MATCHES 'w.*'; -- determine the word counts tokenGroups = GROUP tokenPipe BY token; wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count; -- output STORE wcPipe INTO '$wcPath' USING PigStorage('t', 'tagsource'); EXPLAIN -out dot/wc_pig.dot -dot wcPipe; Monday, 17 December 12 47
  • 48. Apache Pig Document Collection M Tokenize GroupBy token Count pig.apache.org R Word Count pro: ‣ easy to learn data manipulation language (DML) ‣ interactive prompt (Grunt) makes it simple to prototype apps ‣ extensibility through UDFs con: ‣ not a full programming language; must extend via UDFs outside of language ‣ app-level integration requires other coding, outside of script language ‣ simple problems are simple to do; hard problems become quite complex ‣ difficult to parameterize scripts externally; must rewrite to change taps! ‣ logical planner mixed with physical planner; cannot collect app stats ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly ‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc. Monday, 17 December 12 48
  • 49. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Code Example #N: city of palo alto, etc. Monday, 17 December 12 49
  • 50. extend: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count 1 mapper Word 1 reducer Count 28+10 lines code Monday, 17 December 12 50
  • 51. extend: a simple search engine Unique Insert SumBy D doc_id 1 doc_id Document Collection M R M R RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique CountBy DF Left token token token ExprFunc CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M CountBy TF doc_id, token CountBy Sort token count M R M Word R M R Count 10 mappers 8 reducers 68+14 lines code Monday, 17 December 12 51
  • 52. City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPA GIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shade M R M R M RHS M HashJoin Estimate Road Left Albedo Geohash CoGroup Segments Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M park github.com/Cascading/CoPA/wiki ‣ GIS export for parks, roads, trees (unstructured / open data) ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks ‣ curated metadata, used to enrich the dataset ‣ could extend via mash-up with many available public data APIs Enterprise-scale app: road albedo + tree species metadata + geospatial indexing “Find a shady spot on a summer day to walk near downtown and take a call…” Monday, 17 December 12 52
  • 53. CoPA: log events Monday, 17 December 12 53
  • 54. CoPA: results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ avg height 23 m ‣ road albedo: 0.12 ‣ distance: 10 m ‣ a short walk from my train stop ✔ Monday, 17 December 12 54
  • 55. Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count PMML: predictive modeling Monday, 17 December 12 55
  • 56. PMML model Monday, 17 December 12 56
  • 57. cascading.pattern example: 1. use customer order history as the training data set 2. train a risk classifier for orders, using Random Forest 3. export model from R to PMML 4. build a Cascading app to execute the PMML model 4.1. generate a pipeline from PMML description 4.2. planner builds the flow for a topology (Hadoop) 4.3. compile app to a JAR file 5. deploy the app at scale to calculate scores Monday, 17 December 12 57
  • 58. cascading.pattern risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analyst's customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. data Monday, 17 December 12 58
  • 59. 1: “orders” data set... train/test in R... exported as PMML Monday, 17 December 12 59
  • 60. R modeling ## train a RandomForest model f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) Monday, 17 December 12 60
  • 61. R output MeanDecreaseGini var0 0.6591701 var1 33.8625179 var2 8.0290020 OOB estimate of error rate: 13.83% Confusion matrix: 0 1 class.error 0 28 5 0.1515152 1 8 53 0.1311475 [1] "./data/sample.rf.xml" Monday, 17 December 12 61
  • 62. 2: Cascading app takes PMML as a parameter... Monday, 17 December 12 62
  • 63. PMML model <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ... Monday, 17 December 12 63
  • 64. Cascading app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   Classifier classifier = new Classifier( pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classifier.getFields(), new ClassifierFunction( new Fields( "score" ), classifier ), Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } } Monday, 17 December 12 64
  • 65. 3: app deployed on a cluster to score customers at scale... Monday, 17 December 12 65
  • 66. deploy to cloud elastic-mapreduce --create --name "RF" --jar s3n://temp.cascading.org/pattern/pattern.jar --arg s3n://temp.cascading.org/pattern/sample.rf.xml --arg s3n://temp.cascading.org/pattern/sample.tsv --arg s3n://temp.cascading.org/pattern/out/classify --arg s3n://temp.cascading.org/pattern/out/trap aws.amazon.com/elasticmapreduce/ Monday, 17 December 12 66
  • 67. results bash-3.2$ head output/classify/part-00000 label" var0" var1" var2" order_id" predicted"score 1" 0" 1" 0" 6f8e1014" 1" 1 0" 0" 0" 1" 6f8ea22e" 0" 0 1" 0" 1" 0" 6f8ea435" 1" 1 0" 0" 0" 1" 6f8ea5e1" 0" 0 1" 0" 1" 0" 6f8ea785" 1" 1 1" 0" 1" 0" 6f8ea91e" 1" 1 0" 1" 0" 0" 6f8eaaba" 0" 0 1" 0" 1" 0" 6f8eac54" 1" 1 0" 1" 1" 0" 6f8eade3" 1" 1 Monday, 17 December 12 67
  • 68. drill-down blog, code/wiki/gists, JARs, community, DevOps products: cascading.org github.org/Cascading conjars.org meetup.com/cascading goo.gl/KQtUL concurrentinc.com pnathan@concurrentinc.com @pacoid Copyright @2012, Concurrent, Inc. Monday, 17 December 12 68