SlideShare a Scribd company logo
1 of 62
Download to read offline
“Pattern –
              an open source project for migrating
              predictive models onto Apache Hadoop”

                  Paco Nathan
                  Concurrent, Inc.
                  San Francisco, CA
                  @pacoid




                 Copyright @2013, Concurrent, Inc.




Sunday, 17 March 13                                   1
Pattern: predictive models at scale
                                                Document
                                                Collection



                                                                             Scrub
                                                             Tokenize
                                                                             token

                                                        M



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token
                                                                           List
                                                                                       RHS




                                                                                                           Count




            • Enterprise Data Workflows
                                                                                                                       Word
                                                                                                                       Count




            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments




Sunday, 17 March 13                                                                                                            2
Cascading – origins

           API author Chris Wensel worked as a system architect
           at an Enterprise firm well-known for many popular
           data products.
           Wensel was following the Nutch open source project –
           where Hadoop started.
           Observation: would be difficult to find Java developers
           to write complex Enterprise apps in MapReduce –
           potential blocker for leveraging new open source
           technology.




Sunday, 17 March 13                                                3
Cascading – functional programming

           Key insight: MapReduce is based on functional programming
           – back to LISP in 1970s. Apache Hadoop use cases are
           mostly about data pipelines, which are functional in nature.
           To ease staffing problems as “Main Street” Enterprise firms
           began to embrace Hadoop, Cascading was introduced
           in late 2007, as a new Java API to implement functional
           programming for large-scale data workflows:

             • leverages JVM and Java-based tools without any
                 need to create new languages
             •   allows programmers who have J2EE expertise
                 to leverage the economics of Hadoop clusters




Sunday, 17 March 13                                                       4
functional programming… in production

             • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
                 have invested in open source projects atop Cascading
                 – used for their large-scale production deployments
             •   new case studies for Cascading apps are mostly
                 based on domain-specific languages (DSLs) in JVM
                 languages which emphasize functional programming:

                 Cascalog in Clojure (2010)
                 Scalding in Scala (2012)


           github.com/nathanmarz/cascalog/wiki
           github.com/twitter/scalding/wiki




Sunday, 17 March 13                                                       5
Cascading – definitions

             • a pattern language for Enterprise Data Workflows
                                                                                            Customers
             • simple to build, easy to test, robust in production
             • design principles ⟹ ensure best practices at scale                             Web
                                                                                              App




                                                                                logs         Cache
                                                                                  logs
                                                                                    Logs

                                                           Support
                                                                                   source
                                                                         trap                  sink
                                                                                     tap
                                                                          tap                  tap


                                                                                 Data
                                                           Modeling    PMML
                                                                                Workflow

                                                                                              source
                                                                         sink
                                                                                                tap
                                                                         tap

                                                           Analytics
                                                            Cubes                            customer
                                                                                              Customer
                                                                                            profile DBs
                                                                                                Prefs
                                                                                  Hadoop
                                                                                  Cluster
                                                           Reporting




Sunday, 17 March 13                                                                                       6
Cascading – usage

             • Java API, DSLs in Scala, Clojure,
                                                                                    Customers
                 Jython, JRuby, Groovy, ANSI SQL
             • ASL 2 license, GitHub src,                                             Web
                                                                                      App
                 http://conjars.org
             • 5+ yrs production use,                                   logs
                                                                          logs
                                                                            Logs
                                                                                     Cache

                 multiple Enterprise verticals     Support
                                                                           source
                                                                 trap                  sink
                                                                             tap
                                                                  tap                  tap


                                                                         Data
                                                   Modeling    PMML
                                                                        Workflow

                                                                                      source
                                                                 sink
                                                                                        tap
                                                                 tap

                                                   Analytics
                                                    Cubes                            customer
                                                                                      Customer
                                                                                    profile DBs
                                                                                        Prefs
                                                                          Hadoop
                                                                          Cluster
                                                   Reporting




Sunday, 17 March 13                                                                               7
Cascading – integrations

             • partners: Microsoft Azure, Hortonworks,
                                                                                          Customers
                 Amazon AWS, MapR, EMC, SpringSource,
                 Cloudera                                                                   Web

             • taps: Memcached, Cassandra, MongoDB,
                                                                                            App



                 HBase, JDBC, Parquet, etc.                                   logs
                                                                                logs       Cache

             • serialization: Avro, Thrift, Kryo,        Support
                                                                                  Logs



                 JSON, etc.                                            trap
                                                                                 source
                                                                                   tap       sink
                                                                        tap                  tap

             • topologies: Apache Hadoop,                                      Data
                 tuple spaces, local mode                Modeling    PMML
                                                                              Workflow

                                                                                            source
                                                                       sink
                                                                                              tap
                                                                       tap

                                                         Analytics
                                                          Cubes                            customer
                                                                                            Customer
                                                                                          profile DBs
                                                                                              Prefs
                                                                                Hadoop
                                                                                Cluster
                                                         Reporting




Sunday, 17 March 13                                                                                     8
Cascading – deployments

             • case studies: Climate Corp, Twitter, Etsy,
                 Williams-Sonoma, uSwitch, Airbnb, Nokia,
                 YieldBot, Square, Harvard, etc.
             • use cases: ETL, marketing funnel, anti-fraud,
                 social media, retail pricing, search analytics,
                 recommenders, eCRM, utility grids, telecom,
                 genomics, climatology, agronomics, etc.




Sunday, 17 March 13                                                9
Cascading – deployments

             • case studies: Climate Corp, Twitter, Etsy,
                 Williams-Sonoma, uSwitch, Airbnb, Nokia,
                 YieldBot, Square, Harvard, etc.
             • use cases: ETL, marketing funnel, anti-fraud,
                 social media, retail pricing, search analytics,
                 recommenders, eCRM, utilityworkflow abstraction
                                                 grids, telecom,   addresses:
                 genomics, climatology, agronomics, etc.
                                             • staffing bottleneck;
                                             • system integration;
                                             • operational complexity;
                                             • test-driven development



Sunday, 17 March 13                                                             10
Pattern: predictive models at scale
                                                Document
                                                Collection



                                                                             Scrub
                                                             Tokenize
                                                                             token

                                                        M



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token
                                                                           List
                                                                                       RHS




                                                                                                           Count




            • Enterprise Data Workflows
                                                                                                                       Word
                                                                                                                       Count




            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments




Sunday, 17 March 13                                                                                                            11
The Ubiquitous Word Count
                                                                               Document




           Definition:
                                                                               Collection




                                                                                            Tokenize
                                                                                                       GroupBy
                                                                                       M                token    Count




               count how often each word appears
             count how often each word appears
                                                                                                          R              Word
                                                                                                                         Count




               in a collection of text documents
             in a collection of text documents
           This simple program provides an excellent test case for
           parallel processing, since it illustrates:                void map (String doc_id, String text):

            • requires a minimal amount of code                       for each word w in segment(text):
                                                                        emit(w, "1");

            • demonstrates use of both symbolic and numeric values
            • shows a dependency graph of tuples as an abstraction   void reduce (String word, Iterator group):

            • is not many steps away from useful search indexing      int count = 0;


            • serves as a “Hello World” for Hadoop apps               for each pc in group:
                                                                        count += Int(pc);

           Any distributed computing framework which can run Word     emit(word, String(count));
           Count efficiently in parallel at scale can handle much
           larger and more interesting compute problems.


Sunday, 17 March 13                                                                                                              12
word count – conceptual flow diagram


               Document
               Collection




                             Tokenize
                                          GroupBy
                       M                   token               Count




                                             R                             Word
                                                                           Count




              1 map                              cascading.org/category/impatient
              1 reduce
             18 lines code                               gist.github.com/3900702


Sunday, 17 March 13                                                                 13
word count – Cascading app in Java
                                                                                                   Document
                                                                                                   Collection




           String docPath = args[ 0 ];                                                                          Tokenize
                                                                                                                           GroupBy
                                                                                                           M                token

           String wcPath = args[ 1 ];                                                                                                Count




           Properties properties = new Properties();                                                                          R              Word
                                                                                                                                             Count



           AppProps.setApplicationJarClass( properties, Main.class );
           HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

           // create source and sink taps
           Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
           Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

           // specify a regex to split "document" text lines into token stream
           Fields token = new Fields( "token" );
           Fields text = new Fields( "text" );
           RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
           // only returns "token"
           Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
           // determine the word counts
           Pipe wcPipe = new Pipe( "wc", docPipe );
           wcPipe = new GroupBy( wcPipe, token );
           wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

           // connect the taps, pipes, etc., into a flow
           FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
            .addSource( docPipe, docTap )
            .addTailSink( wcPipe, wcTap );
           // write a DOT file and run the flow
           Flow wcFlow = flowConnector.connect( flowDef );
           wcFlow.writeDOT( "dot/wc.dot" );
           wcFlow.complete();



Sunday, 17 March 13                                                                                                                                  14
word count – generated flow diagram
                                                                                                              Document
                                                                                                              Collection




                                                                                                                           Tokenize
                                                              [head]                                                  M
                                                                                                                                      GroupBy
                                                                                                                                       token    Count




                                                                                                                                         R              Word
                                                                                                                                                        Count




                                Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                        [{2}:'doc_id', 'text']
                                                        [{2}:'doc_id', 'text']




                                                                                                     map
                                 Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                                                            [{1}:'token']
                                                            [{1}:'token']



                                                  GroupBy('wc')[by:['token']]

                                                          wc[{1}:'token']
                                                          [{1}:'token']




                                                                                                     reduce
                                               Every('wc')[Count[decl:'count']]

                                                        [{2}:'token', 'count']
                                                        [{1}:'token']



                             Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                        [{2}:'token', 'count']
                                                        [{2}:'token', 'count']



                                                               [tail]


Sunday, 17 March 13                                                                                                                                             15
word count – Cascalog / Clojure
                                                                    Document
                                                                    Collection




           (ns impatient.core                                               M
                                                                                 Tokenize
                                                                                            GroupBy
                                                                                             token    Count



             (:use [cascalog.api]                                                              R              Word
                                                                                                              Count


                   [cascalog.more-taps :only (hfs-delimited)])
             (:require [clojure.string :as s]
                       [cascalog.ops :as c])
             (:gen-class))

           (defmapcatop split [line]
             "reads in a line of string and splits it by regex"
             (s/split line #"[[](),.)s]+"))

           (defn -main [in out & args]
             (?<- (hfs-delimited out)
                  [?word ?count]
                  ((hfs-delimited in :skip-header? true) _ ?line)
                  (split ?line :> ?word)
                  (c/count ?count)))

           ; Paul Lam
           ; github.com/Quantisan/Impatient




Sunday, 17 March 13                                                                                                   16
word count – Cascalog / Clojure
                                                                            Document
                                                                            Collection




            github.com/nathanmarz/cascalog/wiki
                                                                                         Tokenize
                                                                                                    GroupBy
                                                                                    M                token    Count




                                                                                                       R              Word
                                                                                                                      Count




             • implements Datalog in Clojure, with predicates backed
               by Cascading – for a highly declarative language
             • run ad-hoc queries from the Clojure REPL –
               approx. 10:1 code reduction compared with SQL
             • composable subqueries, used for test-driven development
               (TDD) practices at scale
             • Leiningen build: simple, no surprises, in Clojure itself
             • more new deployments than other Cascading DSLs –
               Climate Corp is largest use case: 90% Clojure/Cascalog
             • has a learning curve, limited number of Clojure developers
             • aggregators are the magic, and those take effort to learn




Sunday, 17 March 13                                                                                                           17
word count – Scalding / Scala
                                                                    Document
                                                                    Collection




          import com.twitter.scalding._                                     M
                                                                                 Tokenize
                                                                                            GroupBy
                                                                                             token    Count



                                                                                               R              Word
                                                                                                              Count


          class WordCount(args : Args) extends Job(args) {
            Tsv(args("doc"),
                 ('doc_id, 'text),
                 skipHeader = true)
              .read
              .flatMap('text -> 'token) {
                 text : String => text.split("[ [](),.]")
               }
              .groupBy('token) { _.size('count) }
              .write(Tsv(args("wc"), writeHeader = true))
          }




Sunday, 17 March 13                                                                                                   18
word count – Scalding / Scala
                                                                                Document
                                                                                Collection




           github.com/twitter/scalding/wiki
                                                                                             Tokenize
                                                                                                        GroupBy
                                                                                        M                token    Count




                                                                                                           R              Word
                                                                                                                          Count




             • extends the Scala collections API so that distributed lists
               become “pipes” backed by Cascading
             • code is compact, easy to understand
             • nearly 1:1 between elements of conceptual flow diagram
               and function calls
             • extensive libraries are available for linear algebra, abstract
               algebra, machine learning – e.g., Matrix API, Algebird, etc.
             • significant investments by Twitter, Etsy, eBay, etc.
             • great for data services at scale
             • less learning curve than Cascalog




Sunday, 17 March 13                                                                                                               19
word count – Scalding / Scala
                                                                                       Document
                                                                                       Collection




           github.com/twitter/scalding/wiki
                                                                                                    Tokenize
                                                                                                               GroupBy
                                                                                               M                token    Count




                                                                                                                  R              Word
                                                                                                                                 Count




             • extends the Scala collections API so that distributed lists
               become “pipes” backed by Cascading
             • code is compact, easy to understand
             • nearly 1:1 between elements of conceptual flow diagram
               and function calls        Cascalog and Scalding DSLs
             • extensive libraries are available for linear algebra, abstractaspects
                                         leverage the functional
               algebra, machine learning – e.g., Matrix API, Algebird, etc.
                                         of MapReduce, helping limit
             • significant investments by Twitter, Etsy, eBay, etc.
                                         complexity in process
             • great for data services at scale
             • less learning curve than Cascalog




Sunday, 17 March 13                                                                                                                      20
Two Avenues to the App Layer…

            Enterprise: must contend with
            complexity at scale everyday…
            incumbents extend current practices and
            infrastructure investments – using J2EE,




                                                          complexity ➞
            ANSI SQL, SAS, etc. – to migrate
            workflows onto Apache Hadoop while
            leveraging existing staff


             Start-ups: crave complexity and
             scale to become viable…
             new ventures move into Enterprise space
             to compete using relatively lean staff,
             while leveraging sophisticated engineering
             practices, e.g., Cascalog and Scalding
                                                                         scale ➞

Sunday, 17 March 13                                                                21
Pattern: predictive models at scale
                                                Document
                                                Collection



                                                                             Scrub
                                                             Tokenize
                                                                             token

                                                        M



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token
                                                                           List
                                                                                       RHS




                                                                                                           Count




            • Enterprise Data Workflows
                                                                                                                       Word
                                                                                                                       Count




            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments




Sunday, 17 March 13                                                                                                            22
workflow abstraction – pattern language

           Cascading uses a “plumbing” metaphor in the Java API,
           to define workflows out of familiar elements: Pipes, Taps,
           Tuple Flows, Filters, Joins, Traps, etc.
                                 Document
                                 Collection



                                                              Scrub
                                              Tokenize
                                                              token

                                         M



                                                                      HashJoin   Regex
                                                                        Left     token
                                                                                         GroupBy    R
                                                         Stop Word                        token
                                                            List
                                                                        RHS




                                                                                            Count


            Data is represented as flows of tuples. Operations within                                    Word

            the flows bring functional programming aspects into Java                                     Count




            In formal terms, this provides a pattern language



Sunday, 17 March 13                                                                                             23
references…

                      pattern language: a structured method for solving
                      large, complex design problems, where the syntax of
                      the language promotes the use of best practices

                      amazon.com/dp/0195019199



                      design patterns: the notion originated in consensus
                      negotiation for architecture, later applied in OOP
                      software engineering by “Gang of Four”
                      amazon.com/dp/0201633612




Sunday, 17 March 13                                                         24
workflow abstraction – pattern language

           Cascading uses a “plumbing” metaphor in the Java API,
           to define workflows out of familiar elements: Pipes, Taps,
           Tuple Flows, Filters, Joins, Traps, etc.
                                 Document
                                 Collection



                                                              Scrub
                                              Tokenize



                                         design principles of the pattern
                                                              token

                                         M




                                         language ensure best practices
                                                         Stop Word
                                                            List
                                                                      HashJoin
                                                                        Left
                                                                                 Regex
                                                                                 token
                                                                                         GroupBy
                                                                                          token
                                                                                                    R




                                         for robust, parallel data workflows
                                                                        RHS




                                         at scale                                           Count


            Data is represented as flows of tuples. Operations within                                    Word

            the flows bring functional programming aspects into Java                                     Count




            In formal terms, this provides a pattern language



Sunday, 17 March 13                                                                                             25
workflow abstraction – literate programming

           Cascading workflows generate their own visual
           documentation: flow diagrams


                                  Document
                                  Collection



                                                               Scrub
                                               Tokenize
                                                               token

                                          M



                                                                       HashJoin   Regex
                                                                         Left     token
                                                                                          GroupBy    R
                                                          Stop Word                        token
                                                             List
                                                                         RHS




                                                                                             Count



            In formal terms, flow diagrams leverage a methodology                                         Word
                                                                                                         Count

            called literate programming
            Provides intuitive, visual representations for apps –
            great for cross-team collaboration


Sunday, 17 March 13                                                                                              26
references…

                      by Don Knuth
                      Literate Programming
                      Univ of Chicago Press, 1992
                      literateprogramming.com/

                      “Instead of imagining that our main task is
                       to instruct a computer what to do, let us
                       concentrate rather on explaining to human
                       beings what we want a computer to do.”




Sunday, 17 March 13                                                 27
workflow abstraction – test-driven development

             •   assert patterns (regex) on the tuple streams
                                                                                              Customers
             •   adjust assert levels, like log4j levels
             •   trap edge cases as “data exceptions”                                           Web
                                                                                                App

             •   TDD at scale:
                 1. start from raw inputs in the flow graph                        logs
                                                                                    logs
                                                                                      Logs
                                                                                               Cache


                 2. define stream assertions for each stage   Support
                                                                                     source
                                                                           trap                  sink
                    of transforms                                           tap
                                                                                       tap
                                                                                                 tap



                 3. verify exceptions, code to remove them   Modeling    PMML
                                                                                   Data
                                                                                  Workflow

                 4. when impl is complete, app has full                    sink
                                                                                                source
                                                                                                  tap
                                                                           tap
                    test coverage                            Analytics
                                                              Cubes                            customer
                                                                                                Customer
                                                                                              profile DBs
                                                                                                  Prefs
                                                                                    Hadoop
           redirect traps in production                      Reporting
                                                                                    Cluster


           to Ops, QA, Support, Audit, etc.


Sunday, 17 March 13                                                                                         28
workflow abstraction – business process

           Following the essence of literate programming, Cascading
           workflows provide statements of business process
           This recalls a sense of business process management
           for Enterprise apps (think BPM/BPEL for Big Data)
           Cascading creates a separation of concerns between
           business process and implementation details (Hadoop, etc.)
           This is especially apparent in large-scale Cascalog apps:
               “Specify what you require, not how to achieve it.”
           By virtue of the pattern language, the flow planner then
           determines how to translate business process into efficient,
           parallel jobs at scale




Sunday, 17 March 13                                                      29
references…

                      by Edgar Codd
                      “A relational model of data for large shared data banks”
                      Communications of the ACM, 1970
                      dl.acm.org/citation.cfm?id=362685
                      Rather than arguing between SQL vs. NoSQL…
                      structured vs. unstructured data frameworks…
                      this approach focuses on what apps do:
                        the process of structuring data


                      Closely related to functional relational programming paradigm:
                        “Out of the Tar Pit”
                        Moseley & Marks 2006
                        http://goo.gl/SKspn


Sunday, 17 March 13                                                                    30
workflow abstraction – API design principles

             • specify what is required, not how it must be achieved
             • plan far ahead, before consuming cluster resources –
                 fail fast prior to submit

             • fail the same way twice – deterministic flow planners
                 help reduce engineering costs for debugging at scale

             • same JAR, any scale – app does not require a recompile
                 to change data taps or cluster topologies




Sunday, 17 March 13                                                     31
workflow abstraction – building apps in layers

                        business      separation of concerns: focus on specifying what is required, not how the computers
                        process
                                      must accomplish it – not unlike BPM/BPEL for BigData

                       test-driven    assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
                      development     code until tests pass, repeat … route exceptional data to appropriate department

                         pattern      syntax of the pattern language conveys expertise – much like building a tower with
                        language
                                      Lego blocks: ensure best practices for robust, parallel data workflows at scale

                      flow planner/   enables the functional programming aspects: compiler within a compiler, mapping
                         optimizer    flows to topologies (e.g., create and sequence Hadoop job steps)

                       compiler/      entire app is visible to the compiler: resolves issues of crossing boundaries for
                         build        troubleshooting, exception handling, notifications, etc.; one app = one JAR

                        topology      Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.


                       JVM cluster    cluster scheduler, instrumentation, etc.



Sunday, 17 March 13                                                                                                              32
workflow abstraction – building apps in layers

                        business      separation of concerns: focus on specifying what is required, not how the computers
                        process
                                      must accomplish it – not unlike BPM/BPEL for BigData

                       test-driven    assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail,
                      development     code until tests pass, repeat … route exceptional data to appropriate department

                         pattern      syntax of the pattern language conveys expertise – much like building a tower with
                        language
                                      Lego blocks: ensure best practices for robust, parallel data workflows at scale

                      flow planner/
                         optimizer
                                              several theoretical aspects converge
                                      enables the functional programming aspects: compiler within a compiler, mapping
                                      flows to topologies
                                              into software engineering practices
                                      entire app is visible to the compiler: resolves issues of crossing boundaries for
                       compiler/
                         build                which minimize the complexity of
                                      troubleshooting, exception handling, notifications, etc.; one app = one JAR
                                              building and maintaining Enterprise
                        topology      Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc.
                                              data workflows
                       JVM cluster    cluster scheduler, instrumentation, etc.



Sunday, 17 March 13                                                                                                              33
Pattern: predictive models at scale
                                                Document
                                                Collection



                                                                             Scrub
                                                             Tokenize
                                                                             token

                                                        M



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token
                                                                           List
                                                                                       RHS




                                                                                                           Count




            • Enterprise Data Workflows
                                                                                                                       Word
                                                                                                                       Count




            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments




Sunday, 17 March 13                                                                                                            34
Pattern – analytics workflows

             • open source project – ASL 2, GitHub repo
             • multiple companies contributing
             • complementary to Apache Mahout – while leveraging
                 workflow abstraction, multiple topologies, etc.
             •   model scoring: generates workflows from PMML models
             •   model creation: estimation at scale, captured as PMML
             •   use sample Hadoop app at scale – no coding required
             •   integrate with 2 lines of Java (1 line Clojure or Scala)
             •   excellent use cases for customer experiments at scale




             cascading.org/pattern


Sunday, 17 March 13                                                         35
Pattern – analytics workflows

             • open source project – ASL 2, GitHub repo
             • multiple companies contributing
             • complementary to Apache Mahout – while leveraging
                 workflow abstraction, multiple topologies, etc.
             •   model scoring: generates workflows from PMML models
             •   model creation: estimation at reduced development
                                     greatly scale, captured at PMML      costs, less
             •   use sample Hadoop app at scale – no coding required leveraging the
                                    licensing issues at scale –
             •                      economics of Apache Hadoop clusters,
                 integrate with 2 lines of Java (1 line Clojure or Scala)
             •   excellent use cases for customer experiments at scale of analytics
                                    plus the core competencies
                                    staff, plus existing IP in predictive models

             cascading.org/pattern


Sunday, 17 March 13                                                                     36
Pattern – model scoring

             • migrate workloads: SAS,Teradata, etc.,
                 exporting predictive models as PMML                                     Customers



             • great open source tools – R, Weka,                                          Web
                                                                                           App
                 KNIME, Matlab, RapidMiner, etc.
             • integrate with other libraries –                              logs
                                                                               logs       Cache
                                                                                 Logs
                 Matrix API, etc.                       Support

             • leverage PMML as another kind                          trap
                                                                       tap
                                                                                source
                                                                                  tap       sink
                                                                                            tap

                 of DSL
                                                                              Data
                                                        Modeling    PMML
                                                                             Workflow

                                                                                           source
                                                                      sink
                                                                                             tap
                                                                      tap

                                                        Analytics
                                                         Cubes                            customer
                                                                                           Customer
                                                                                         profile DBs
                                                                                             Prefs
                                                                               Hadoop
                                                                               Cluster
                                                        Reporting


             cascading.org/pattern


Sunday, 17 March 13                                                                                    37
Pattern – an example classifier

               1. use customer order history as the training data set
               2. train a risk classifier for orders, using Random Forest   risk classifier
                                                                           dimension: customer 360
                                                                                                                                        risk classifier
                                                                                                                                        dimension: per-order
                                                                           Cascading apps

               3. export model from R to PMML                                       data prep
                                                                                                       training
                                                                                                      data sets
                                                                                                                            analyst's
                                                                                                                             laptop
                                                                                                                                                         customer
                                                                                                                                                       transactions

                                                                                    predict                                                            score new

               4. build a Cascading app to execute the PMML model                  model costs

                                                                                      detect
                                                                                                                             PMML
                                                                                                                             model
                                                                                                                                                         orders

                                                                                                                                                        anomaly
                                                                                    fraudsters                                                          detection

                      4.1. generate flow from PMML description                        segment
                                                                                    customers
                                                                                                                                                         velocity
                                                                                                                                                         metrics



                      4.2. plan the flow for a topology (Hadoop)                     Hadoop
                                                                                                              batch
                                                                                                                            Customer
                                                                                                                               DB
                                                                                                                                        real-time
                                                                                                                                                        IMDG

                                                                                                          workloads                     workloads



                      4.3. compile app to a JAR file
                                                                                                ETL



                                                                                                             chargebacks,   partner
                                                                                                 DW              etc.        data

               5. verify results with a regression test
               6. deploy the app at scale to calculate scores
               7. potentially, reuse classifier for real-time scoring


Sunday, 17 March 13                                                                                                                                                   38
Pattern – an example classifier

                      risk classifier                                               risk classifier
                      dimension: customer 360                                      dimension: per-order
                      Cascading apps

                                                  training             analyst's                    customer
                               data prep                                laptop
                                                 data sets                                        transactions

                               predict                                                            score new
                              model costs                                                           orders
                                                                        PMML
                                                                        model
                                 detect                                                            anomaly
                               fraudsters                                                          detection

                                segment                                                             velocity
                               customers                                                            metrics



                               Hadoop                                  Customer                    IMDG
                                                                          DB
                                                         batch                     real-time
                                                     workloads                     workloads

                                           ETL



                                                        chargebacks,   partner
                                            DW              etc.        data




Sunday, 17 March 13                                                                                              39
Pattern – create a model in R

                      ## train a RandomForest model
                       
                      f <- as.formula("as.factor(label) ~ .")
                      fit <- randomForest(f, data_train, ntree=50)
                       
                      ## test the model on the holdout test set
                       
                      print(fit$importance)
                      print(fit)
                       
                      predicted <- predict(fit, data)
                      data$predicted <- predicted
                      confuse <- table(pred = predicted, true = data[,1])
                      print(confuse)
                       
                      ## export predicted labels to TSV
                       
                      write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
                        quote=FALSE, sep="t", row.names=FALSE)
                       
                      ## export RF model to PMML
                       
                      saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))




Sunday, 17 March 13                                                                          40
Pattern – capture model parameters as PMML
                      <?xml version="1.0"?>
                      <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
                       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                       xsi:schemaLocation="http://www.dmg.org/PMML-4_0
                       http://www.dmg.org/v4-0/pmml-4-0.xsd">
                       <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
                        <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
                        <Application name="Rattle/PMML" version="1.2.30"/>
                        <Timestamp>2012-10-22 19:39:28</Timestamp>
                       </Header>
                       <DataDictionary numberOfFields="4">
                        <DataField name="label" optype="categorical" dataType="string">
                         <Value value="0"/>
                         <Value value="1"/>
                        </DataField>
                        <DataField name="var0" optype="continuous" dataType="double"/>
                        <DataField name="var1" optype="continuous" dataType="double"/>
                        <DataField name="var2" optype="continuous" dataType="double"/>
                       </DataDictionary>
                       <MiningModel modelName="randomForest_Model" functionName="classification">
                        <MiningSchema>
                         <MiningField name="label" usageType="predicted"/>
                         <MiningField name="var0" usageType="active"/>
                         <MiningField name="var1" usageType="active"/>
                         <MiningField name="var2" usageType="active"/>
                        </MiningSchema>
                        <Segmentation multipleModelMethod="majorityVote">
                         <Segment id="1">
                          <True/>
                          <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
                           <MiningSchema>
                            <MiningField name="label" usageType="predicted"/>
                            <MiningField name="var0" usageType="active"/>
                            <MiningField name="var1" usageType="active"/>
                            <MiningField name="var2" usageType="active"/>
                           </MiningSchema>
                      ...

Sunday, 17 March 13                                                                                                                                                 41
Pattern – score a model, within an app
                      public class Main {
                        public static void main( String[] args ) {
                          String pmmlPath = args[ 0 ];
                          String ordersPath = args[ 1 ];
                          String classifyPath = args[ 2 ];
                          String trapPath = args[ 3 ];

                            Properties properties = new Properties();
                            AppProps.setApplicationJarClass( properties, Main.class );
                            HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

                            // create source and sink taps
                            Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
                            Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
                            Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

                            // define a "Classifier" model from PMML to evaluate the orders
                            ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );
                            Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

                            // connect the taps, pipes, etc., into a flow
                            FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
                             .addSource( classifyPipe, ordersTap )
                             .addTrap( classifyPipe, trapTap )
                             .addSink( classifyPipe, classifyTap );

                            // write a DOT file and run the flow
                            Flow classifyFlow = flowConnector.connect( flowDef );
                            classifyFlow.writeDOT( "dot/classify.dot" );
                            classifyFlow.complete();
                          }
                      }

Sunday, 17 March 13                                                                                                                      42
Pattern – score a model, using pre-defined Cascading app



                           Customer
                            Orders



                                                 Scored             GroupBy
                                      Classify            Assert
                                                 Orders              token

                                 M                                             R




                       PMML
                       Model
                                                                       Count




                                                          Failure              Confusion
                                                           Traps                Matrix




Sunday, 17 March 13                                                                        43
Pattern – score a model, using pre-defined Cascading app

                      ## run an RF classifier at scale
                       
                      hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap 
                        --pmml data/sample.rf.xml
                       


                      ## run an RF classifier at scale, assert regression test, measure confusion matrix
                       
                      hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap 
                        --pmml data/sample.rf.xml --assert --measure out/measure


                       
                      ## run a predictive model at scale, measure RMSE
                       
                      hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap 
                           --pmml data/iris.lm_p.xml --rmse out/measure




Sunday, 17 March 13                                                                                        44
Pattern – evaluating results

                      bash-3.2$ head out/classify/part-00000
                      label" var0" var1" var2" order_id" predicted"
                                                                  score
                      1" 0" 1" 0" 6f8e1014" 1" 1
                      0" 0" 0" 1" 6f8ea22e" 0" 0
                      1" 0" 1" 0" 6f8ea435" 1" 1
                      0" 0" 0" 1" 6f8ea5e1" 0" 0
                      1" 0" 1" 0" 6f8ea785" 1" 1
                      1" 0" 1" 0" 6f8ea91e" 1" 1
                      0" 1" 0" 0" 6f8eaaba" 0" 0
                      1" 0" 1" 0" 6f8eac54" 1" 1
                      0" 1" 1" 0" 6f8eade3" 1" 1




Sunday, 17 March 13                                                       45
Lingual – connecting Hadoop and R

                      # load the JDBC package
                      library(RJDBC)
                       
                      # set up the driver
                      drv <- JDBC("cascading.lingual.jdbc.Driver",
                        "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")
                       
                      # set up a database connection to a local repository
                      connection <- dbConnect(drv,
                        "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
                      tables;schema=EMPLOYEES")
                       
                      # query the repository: in this case the MySQL sample database (CSV files)
                      df <- dbGetQuery(connection,
                        "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
                      head(df)
                       
                      # use R functions to summarize and visualize part of the data
                      df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
                      summary(df$hire_age)

                      library(ggplot2)
                      m <- ggplot(df, aes(x=hire_age))
                      m <- m + ggtitle("Age at hire, people named Gina")
                      m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()


Sunday, 17 March 13                                                                                             46
Lingual – connecting Hadoop and R

                      > summary(df$hire_age)
                         Min. 1st Qu. Median     Mean 3rd Qu.    Max.
                        20.86   27.89   31.70   31.61   35.01   43.92




             cascading.org/lingual
             launchpad.net/test-db


Sunday, 17 March 13                                                     47
Pattern: predictive models at scale
                                                Document
                                                Collection



                                                                             Scrub
                                                             Tokenize
                                                                             token

                                                        M



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token
                                                                           List
                                                                                       RHS




                                                                                                           Count




            • Enterprise Data Workflows
                                                                                                                       Word
                                                                                                                       Count




            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments




Sunday, 17 March 13                                                                                                            48
PMML – standard

             • established XML standard for predictive model markup
             • organized by Data Mining Group (DMG), since 1997
                 http://dmg.org/
             • members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
                 Microsoft, etc.
             • PMML concepts for metadata, ensembles, etc., translate
                 directly into Cascading tuple flows

           “PMML is the leading standard for statistical and data mining models and
            supported by over 20 vendors and organizations.With PMML, it is easy
            to develop a model on one system using one application and deploy the
            model on another system using another application.”


             wikipedia.org/wiki/Predictive_Model_Markup_Language


Sunday, 17 March 13                                                                   49
PMML – models

             •   Association Rules: AssociationModel element
             •   Cluster Models: ClusteringModel element
             •   Decision Trees: TreeModel element
             •   Naïve Bayes Classifiers: NaiveBayesModel element
             •   Neural Networks: NeuralNetwork element
             •   Regression: RegressionModel and GeneralRegressionModel elements
             •   Rulesets: RuleSetModel element
             •   Sequences: SequenceModel element
             •   Support Vector Machines: SupportVectorMachineModel element
             •   Text Models: TextModel element
             •   Time Series: TimeSeriesModel element

             ibm.com/developerworks/industry/library/ind-PMML2/


Sunday, 17 March 13                                                                50
PMML – vendor coverage




Sunday, 17 March 13                 51
Pattern: predictive models at scale
                                                Document
                                                Collection



                                                                             Scrub
                                                             Tokenize
                                                                             token

                                                        M



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token
                                                                           List
                                                                                       RHS




                                                                                                           Count




            • Enterprise Data Workflows
                                                                                                                       Word
                                                                                                                       Count




            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments




Sunday, 17 March 13                                                                                                            52
roadmap – existing algorithms for scoring

             •   	

                  Random Forest
             •   Decision Trees
             •   Linear Regression
             •   GLM
             •   Logistic Regression
             •   K-Means Clustering
             •   Hierarchical Clustering
             •   Support Vector Machines




             cascading.org/pattern


Sunday, 17 March 13                                    53
roadmap – top priorities for creating models at scale

             • 	

Random Forest
             • Logistic Regression
             • K-Means Clustering


           a wealth of recent research indicates many opportunities
           to parallelize popular algorithms for training models at scale
           on Apache Hadoop…




             cascading.org/pattern


Sunday, 17 March 13                                                         54
roadmap – next priorities for scoring

             •   	

                  Time Series (ARIMA forecast)
             •   Association Rules (basket analysis)
             •   Naïve Bayes
             •   Neural Networks


           algorithms extended based on customer use cases –
           contact @pacoid




             cascading.org/pattern


Sunday, 17 March 13                                            55
Pattern: predictive models at scale
                                                Document
                                                Collection



                                                                             Scrub
                                                             Tokenize
                                                                             token

                                                        M



                                                                                     HashJoin   Regex
                                                                                       Left     token
                                                                                                        GroupBy    R
                                                                        Stop Word                        token
                                                                           List
                                                                                       RHS




                                                                                                           Count




            • Enterprise Data Workflows
                                                                                                                       Word
                                                                                                                       Count




            • Sample Code
            • A Little Theory…
            • Pattern
            • PMML
            • Roadmap
            • Customer Experiments




Sunday, 17 March 13                                                                                                            56
experiments – comparing models

             • much customer interest in leveraging Cascading and
                 Apache Hadoop to run customer experiments at scale
             • run multiple variants, then measure relative “lift”
             • Concurrent runtime – tag and track models

           the following example compares two models trained
           with different machine learning algorithms

           this is exaggerated, one has an important variable
           intentionally omitted to help illustrate the experiment




Sunday, 17 March 13                                                   57
experiments – Random Forest model

                      ## train a Random Forest model
                      ## example: http://mkseo.pe.kr/stats/?p=220
                       
                      f <- as.formula("as.factor(label) ~ var0 + var1 + var2")
                      fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)
                      print(fit)
                      saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))



                               OOB estimate of   error rate: 14%
                      Confusion matrix:
                         0   1 class.error
                      0 69 16     0.1882353
                      1 12 103    0.1043478




Sunday, 17 March 13                                                                          58
experiments – Logistic Regression model

                      ## train a Logistic Regression model (special case of GLM)
                      ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r
                       
                      f <- as.formula("as.factor(label) ~ var0 + var2")
                      fit <- glm(f, family=binomial, data=data)
                      print(summary(fit))
                      saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))



                      Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)
                      (Intercept)    1.8524    0.3803   4.871 1.11e-06 ***
                      var0          -1.3755    0.4355 -3.159 0.00159 **
                      var2          -3.7742    0.5794 -6.514 7.30e-11 ***
                      ---
                      Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
                       ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




                      NB: this model has “var1” intentionally omitted


Sunday, 17 March 13                                                                                 59
experiments – comparing results

             • 	

use a confusion matrix to compare results for the classifiers
             • Logistic Regression has a lower “false negative” rate (5% vs. 11%)
                 however it has a much higher “false positive” rate (52% vs. 14%)
             • assign a cost model to select a winner –
                 for example, in an ecommerce anti-fraud classifier:
                      FN ∼ chargeback risk
                      FP ∼ customer support costs




Sunday, 17 March 13                                                                 60
references…


                      Enterprise Data Workflows
                      with Cascading
                      O’Reilly, 2013
                      amazon.com/dp/1449358721




Sunday, 17 March 13                              61
drill-down…


                      blog, dev community, code/wiki/gists, maven repo,
                      commercial products, career opportunities:
                        cascading.org
                        zest.to/group11
                        github.com/Cascading
                        conjars.org
                        goo.gl/KQtUL
                        concurrentinc.com

                                                                          Copyright @2013, Concurrent, Inc.




Sunday, 17 March 13                                                                                           62

More Related Content

What's hot

Bi303 data warehousing with fast track and pdw - Assaf Fraenkel
Bi303 data warehousing with fast track and pdw - Assaf FraenkelBi303 data warehousing with fast track and pdw - Assaf Fraenkel
Bi303 data warehousing with fast track and pdw - Assaf Fraenkelsqlserver.co.il
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012RIPE NCC
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
 
HP Microsoft SQL Server Data Management Solutions
HP Microsoft SQL Server Data Management SolutionsHP Microsoft SQL Server Data Management Solutions
HP Microsoft SQL Server Data Management SolutionsEduardo Castro
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 
Standards for Semantic Mashups
Standards for Semantic MashupsStandards for Semantic Mashups
Standards for Semantic MashupsLaurent Lefort
 
Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013Geodata AS
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 

What's hot (13)

The Other Way of Doing Big Data
The Other Way of Doing Big DataThe Other Way of Doing Big Data
The Other Way of Doing Big Data
 
Bi303 data warehousing with fast track and pdw - Assaf Fraenkel
Bi303 data warehousing with fast track and pdw - Assaf FraenkelBi303 data warehousing with fast track and pdw - Assaf Fraenkel
Bi303 data warehousing with fast track and pdw - Assaf Fraenkel
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012RIPEstat Public demo 16 April 2012
RIPEstat Public demo 16 April 2012
 
Hw09 Production Deep Dive With High Availability
Hw09   Production Deep Dive With High AvailabilityHw09   Production Deep Dive With High Availability
Hw09 Production Deep Dive With High Availability
 
HP Microsoft SQL Server Data Management Solutions
HP Microsoft SQL Server Data Management SolutionsHP Microsoft SQL Server Data Management Solutions
HP Microsoft SQL Server Data Management Solutions
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 
Standards for Semantic Mashups
Standards for Semantic MashupsStandards for Semantic Mashups
Standards for Semantic Mashups
 
Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013Laserdata i skyen - Geomatikkdagene 2013
Laserdata i skyen - Geomatikkdagene 2013
 
Hana Offerings Engl
Hana Offerings EnglHana Offerings Engl
Hana Offerings Engl
 
User Group Bi
User Group BiUser Group Bi
User Group Bi
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 

Viewers also liked

Panorama de l'utilisation des médias sociaux dans les collectivités locales
Panorama de l'utilisation des médias sociaux dans les collectivités localesPanorama de l'utilisation des médias sociaux dans les collectivités locales
Panorama de l'utilisation des médias sociaux dans les collectivités localesEmilie Marquois
 
Open Data: From the Information Age to the Action Age (PDF with notes)
Open Data: From the Information Age to the Action Age (PDF with notes)Open Data: From the Information Age to the Action Age (PDF with notes)
Open Data: From the Information Age to the Action Age (PDF with notes)Tim O'Reilly
 
Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)Tim O'Reilly
 
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON Byrum
 
clearScienceStrataRx2012
clearScienceStrataRx2012clearScienceStrataRx2012
clearScienceStrataRx2012OReillyStrata
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionPaco Nathan
 
What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)Sergey Sundukovskiy
 
AWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThisAWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThisPaco Nathan
 
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking OSCON Byrum
 
Traffic Signal Movie Preview
Traffic Signal Movie PreviewTraffic Signal Movie Preview
Traffic Signal Movie PreviewKapil Mohan
 
25 Words Of Social Media Wisdom Project
25 Words Of Social Media Wisdom Project25 Words Of Social Media Wisdom Project
25 Words Of Social Media Wisdom ProjectLiz Strauss
 
Solving the Wanamaker Problem for Healthcare (keynote file)
Solving the Wanamaker Problem for Healthcare (keynote file)Solving the Wanamaker Problem for Healthcare (keynote file)
Solving the Wanamaker Problem for Healthcare (keynote file)Tim O'Reilly
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Ermes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia GiuliaErmes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia GiuliaSimone Puksic
 
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanPaco Nathan
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torqueboxrockyjaiswal
 
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooMohnish Jadwani
 

Viewers also liked (20)

Panorama de l'utilisation des médias sociaux dans les collectivités locales
Panorama de l'utilisation des médias sociaux dans les collectivités localesPanorama de l'utilisation des médias sociaux dans les collectivités locales
Panorama de l'utilisation des médias sociaux dans les collectivités locales
 
Open Data: From the Information Age to the Action Age (PDF with notes)
Open Data: From the Information Age to the Action Age (PDF with notes)Open Data: From the Information Age to the Action Age (PDF with notes)
Open Data: From the Information Age to the Action Age (PDF with notes)
 
Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)Some Lessons for Startups (pdf with notes)
Some Lessons for Startups (pdf with notes)
 
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
OSCON 2012 US Patriot Act Implications for Cloud Computing - Diane Mueller, A...
 
clearScienceStrataRx2012
clearScienceStrataRx2012clearScienceStrataRx2012
clearScienceStrataRx2012
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)
 
Government 2.0
Government 2.0Government 2.0
Government 2.0
 
Bilan de mobilité
Bilan de mobilitéBilan de mobilité
Bilan de mobilité
 
AWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThisAWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThis
 
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
BodyTrack: Open Source Tools for Health Empowerment through Self-Tracking
 
Traffic Signal Movie Preview
Traffic Signal Movie PreviewTraffic Signal Movie Preview
Traffic Signal Movie Preview
 
25 Words Of Social Media Wisdom Project
25 Words Of Social Media Wisdom Project25 Words Of Social Media Wisdom Project
25 Words Of Social Media Wisdom Project
 
Solving the Wanamaker Problem for Healthcare (keynote file)
Solving the Wanamaker Problem for Healthcare (keynote file)Solving the Wanamaker Problem for Healthcare (keynote file)
Solving the Wanamaker Problem for Healthcare (keynote file)
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Ermes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia GiuliaErmes, internet veloce per la regione Friuli Venezia Giulia
Ermes, internet veloce per la regione Friuli Venezia Giulia
 
DSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco NathanDSSG Speaker Series: Paco Nathan
DSSG Speaker Series: Paco Nathan
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
 
When Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of TorqueboxWhen Ruby Meets Java - The Power of Torquebox
When Ruby Meets Java - The Power of Torquebox
 
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours too
 

Similar to Pattern: an open source project for migrating predictive models onto Apache Hadoop

Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsPaco Nathan
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?Thomas Roessler
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingPaco Nathan
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big DataPaco Nathan
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133OpenStack Foundation
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackOpenStack Foundation
 
10 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 201910 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 2019JanBask Training
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Narayan Bharadwaj
 
Dancing about architecture
Dancing about architectureDancing about architecture
Dancing about architectureCoraline Ehmke
 
Document imaging 101 Imaging 101 using SAP's Content Server
Document imaging 101 Imaging 101 using SAP's Content Server Document imaging 101 Imaging 101 using SAP's Content Server
Document imaging 101 Imaging 101 using SAP's Content Server Verbella CMG
 
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataPaco Nathan
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataOSCON Byrum
 
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSybase Türkiye
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataPaco Nathan
 
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
Bringing olap fully online  analyze changing datasets in mem sql and spark wi...Bringing olap fully online  analyze changing datasets in mem sql and spark wi...
Bringing olap fully online analyze changing datasets in mem sql and spark wi...SingleStore
 
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPaco Nathan
 
Lighweight Collaboration Management (Mashups09@OOPSLA)
Lighweight Collaboration Management (Mashups09@OOPSLA)Lighweight Collaboration Management (Mashups09@OOPSLA)
Lighweight Collaboration Management (Mashups09@OOPSLA)Cesare Pautasso
 

Similar to Pattern: an open source project for migrating predictive models onto Apache Hadoop (20)

Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133Bercovici top 10 things net app learned 0416133
Bercovici top 10 things net app learned 0416133
 
Top 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStackTop 10 Things We Learned Implementing OpenStack
Top 10 Things We Learned Implementing OpenStack
 
10 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 201910 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 2019
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
 
Dancing about architecture
Dancing about architectureDancing about architecture
Dancing about architecture
 
Data streaming
Data streamingData streaming
Data streaming
 
Document imaging 101 Imaging 101 using SAP's Content Server
Document imaging 101 Imaging 101 using SAP's Content Server Document imaging 101 Imaging 101 using SAP's Content Server
Document imaging 101 Imaging 101 using SAP's Content Server
 
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open DataOSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
OSCON 2013: Using Cascalog to build an app with City of Palo Alto Open Data
 
Using Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open DataUsing Cascalog to build an app with City of Palo Alto Open Data
Using Cascalog to build an app with City of Palo Alto Open Data
 
Ruby at UW C4C
Ruby at UW C4CRuby at UW C4C
Ruby at UW C4C
 
SAP Sybase Event Streaming Processing
SAP Sybase Event Streaming ProcessingSAP Sybase Event Streaming Processing
SAP Sybase Event Streaming Processing
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
 
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
Bringing olap fully online  analyze changing datasets in mem sql and spark wi...Bringing olap fully online  analyze changing datasets in mem sql and spark wi...
Bringing olap fully online analyze changing datasets in mem sql and spark wi...
 
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and MesosPDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
PDX Hadoop: Enterprise Data Workflows with Cascading and Mesos
 
Lighweight Collaboration Management (Mashups09@OOPSLA)
Lighweight Collaboration Management (Mashups09@OOPSLA)Lighweight Collaboration Management (Mashups09@OOPSLA)
Lighweight Collaboration Management (Mashups09@OOPSLA)
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Pattern: an open source project for migrating predictive models onto Apache Hadoop

  • 1. “Pattern – an open source project for migrating predictive models onto Apache Hadoop” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc. Sunday, 17 March 13 1
  • 2. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 2
  • 3. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products. Wensel was following the Nutch open source project – where Hadoop started. Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology. Sunday, 17 March 13 3
  • 4. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows: • leverages JVM and Java-based tools without any need to create new languages • allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters Sunday, 17 March 13 4
  • 5. functional programming… in production • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Sunday, 17 March 13 5
  • 6. Cascading – definitions • a pattern language for Enterprise Data Workflows Customers • simple to build, easy to test, robust in production • design principles ⟹ ensure best practices at scale Web App logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Sunday, 17 March 13 6
  • 7. Cascading – usage • Java API, DSLs in Scala, Clojure, Customers Jython, JRuby, Groovy, ANSI SQL • ASL 2 license, GitHub src, Web App http://conjars.org • 5+ yrs production use, logs logs Logs Cache multiple Enterprise verticals Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Sunday, 17 March 13 7
  • 8. Cascading – integrations • partners: Microsoft Azure, Hortonworks, Customers Amazon AWS, MapR, EMC, SpringSource, Cloudera Web • taps: Memcached, Cassandra, MongoDB, App HBase, JDBC, Parquet, etc. logs logs Cache • serialization: Avro, Thrift, Kryo, Support Logs JSON, etc. trap source tap sink tap tap • topologies: Apache Hadoop, Data tuple spaces, local mode Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Sunday, 17 March 13 8
  • 9. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc. Sunday, 17 March 13 9
  • 10. Cascading – deployments • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utilityworkflow abstraction grids, telecom, addresses: genomics, climatology, agronomics, etc. • staffing bottleneck; • system integration; • operational complexity; • test-driven development Sunday, 17 March 13 10
  • 11. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 11
  • 12. The Ubiquitous Word Count Document Definition: Collection Tokenize GroupBy M token Count count how often each word appears count how often each word appears R Word Count in a collection of text documents in a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): • requires a minimal amount of code for each word w in segment(text): emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems. Sunday, 17 March 13 12
  • 13. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702 Sunday, 17 March 13 13
  • 14. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy M token String wcPath = args[ 1 ]; Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Sunday, 17 March 13 14
  • 15. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Sunday, 17 March 13 15
  • 16. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient Sunday, 17 March 13 16
  • 17. word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn Sunday, 17 March 13 17
  • 18. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } Sunday, 17 March 13 18
  • 19. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog Sunday, 17 March 13 19
  • 20. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale • less learning curve than Cascalog Sunday, 17 March 13 20
  • 21. Two Avenues to the App Layer… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Sunday, 17 March 13 21
  • 22. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 22
  • 23. workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern language Sunday, 17 March 13 23
  • 24. references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four” amazon.com/dp/0201633612 Sunday, 17 March 13 24
  • 25. workflow abstraction – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize design principles of the pattern token M language ensure best practices Stop Word List HashJoin Left Regex token GroupBy token R for robust, parallel data workflows RHS at scale Count Data is represented as flows of tuples. Operations within Word the flows bring functional programming aspects into Java Count In formal terms, this provides a pattern language Sunday, 17 March 13 25
  • 26. workflow abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps – great for cross-team collaboration Sunday, 17 March 13 26
  • 27. references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” Sunday, 17 March 13 27
  • 28. workflow abstraction – test-driven development • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • trap edge cases as “data exceptions” Web App • TDD at scale: 1. start from raw inputs in the flow graph logs logs Logs Cache 2. define stream assertions for each stage Support source trap sink of transforms tap tap tap 3. verify exceptions, code to remove them Modeling PMML Data Workflow 4. when impl is complete, app has full sink source tap tap test coverage Analytics Cubes customer Customer profile DBs Prefs Hadoop redirect traps in production Reporting Cluster to Ops, QA, Support, Audit, etc. Sunday, 17 March 13 28
  • 29. workflow abstraction – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale Sunday, 17 March 13 29
  • 30. references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on what apps do: the process of structuring data Closely related to functional relational programming paradigm: “Out of the Tar Pit” Moseley & Marks 2006 http://goo.gl/SKspn Sunday, 17 March 13 30
  • 31. workflow abstraction – API design principles • specify what is required, not how it must be achieved • plan far ahead, before consuming cluster resources – fail fast prior to submit • fail the same way twice – deterministic flow planners help reduce engineering costs for debugging at scale • same JAR, any scale – app does not require a recompile to change data taps or cluster topologies Sunday, 17 March 13 31
  • 32. workflow abstraction – building apps in layers business separation of concerns: focus on specifying what is required, not how the computers process must accomplish it – not unlike BPM/BPEL for BigData test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail, development code until tests pass, repeat … route exceptional data to appropriate department pattern syntax of the pattern language conveys expertise – much like building a tower with language Lego blocks: ensure best practices for robust, parallel data workflows at scale flow planner/ enables the functional programming aspects: compiler within a compiler, mapping optimizer flows to topologies (e.g., create and sequence Hadoop job steps) compiler/ entire app is visible to the compiler: resolves issues of crossing boundaries for build troubleshooting, exception handling, notifications, etc.; one app = one JAR topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc. JVM cluster cluster scheduler, instrumentation, etc. Sunday, 17 March 13 32
  • 33. workflow abstraction – building apps in layers business separation of concerns: focus on specifying what is required, not how the computers process must accomplish it – not unlike BPM/BPEL for BigData test-driven assert expected patterns in tuple flows, adjust assertion levels, verify that tests fail, development code until tests pass, repeat … route exceptional data to appropriate department pattern syntax of the pattern language conveys expertise – much like building a tower with language Lego blocks: ensure best practices for robust, parallel data workflows at scale flow planner/ optimizer several theoretical aspects converge enables the functional programming aspects: compiler within a compiler, mapping flows to topologies into software engineering practices entire app is visible to the compiler: resolves issues of crossing boundaries for compiler/ build which minimize the complexity of troubleshooting, exception handling, notifications, etc.; one app = one JAR building and maintaining Enterprise topology Apache Hadoop MR, IMDGs, etc., – upcoming MR2, etc. data workflows JVM cluster cluster scheduler, instrumentation, etc. Sunday, 17 March 13 33
  • 34. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 34
  • 35. Pattern – analytics workflows • open source project – ASL 2, GitHub repo • multiple companies contributing • complementary to Apache Mahout – while leveraging workflow abstraction, multiple topologies, etc. • model scoring: generates workflows from PMML models • model creation: estimation at scale, captured as PMML • use sample Hadoop app at scale – no coding required • integrate with 2 lines of Java (1 line Clojure or Scala) • excellent use cases for customer experiments at scale cascading.org/pattern Sunday, 17 March 13 35
  • 36. Pattern – analytics workflows • open source project – ASL 2, GitHub repo • multiple companies contributing • complementary to Apache Mahout – while leveraging workflow abstraction, multiple topologies, etc. • model scoring: generates workflows from PMML models • model creation: estimation at reduced development greatly scale, captured at PMML costs, less • use sample Hadoop app at scale – no coding required leveraging the licensing issues at scale – • economics of Apache Hadoop clusters, integrate with 2 lines of Java (1 line Clojure or Scala) • excellent use cases for customer experiments at scale of analytics plus the core competencies staff, plus existing IP in predictive models cascading.org/pattern Sunday, 17 March 13 36
  • 37. Pattern – model scoring • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML Customers • great open source tools – R, Weka, Web App KNIME, Matlab, RapidMiner, etc. • integrate with other libraries – logs logs Cache Logs Matrix API, etc. Support • leverage PMML as another kind trap tap source tap sink tap of DSL Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting cascading.org/pattern Sunday, 17 March 13 37
  • 38. Pattern – an example classifier 1. use customer order history as the training data set 2. train a risk classifier for orders, using Random Forest risk classifier dimension: customer 360 risk classifier dimension: per-order Cascading apps 3. export model from R to PMML data prep training data sets analyst's laptop customer transactions predict score new 4. build a Cascading app to execute the PMML model model costs detect PMML model orders anomaly fraudsters detection 4.1. generate flow from PMML description segment customers velocity metrics 4.2. plan the flow for a topology (Hadoop) Hadoop batch Customer DB real-time IMDG workloads workloads 4.3. compile app to a JAR file ETL chargebacks, partner DW etc. data 5. verify results with a regression test 6. deploy the app at scale to calculate scores 7. potentially, reuse classifier for real-time scoring Sunday, 17 March 13 38
  • 39. Pattern – an example classifier risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analyst's customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. data Sunday, 17 March 13 39
  • 40. Pattern – create a model in R ## train a RandomForest model   f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50)   ## test the model on the holdout test set   print(fit$importance) print(fit)   predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse)   ## export predicted labels to TSV   write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE)   ## export RF model to PMML   saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/")) Sunday, 17 March 13 40
  • 41. Pattern – capture model parameters as PMML <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ... Sunday, 17 March 13 41
  • 42. Pattern – score a model, within an app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   ClassifierFunction classFunc = new ClassifierFunction( new Fields( "score" ), pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } } Sunday, 17 March 13 42
  • 43. Pattern – score a model, using pre-defined Cascading app Customer Orders Scored GroupBy Classify Assert Orders token M R PMML Model Count Failure Confusion Traps Matrix Sunday, 17 March 13 43
  • 44. Pattern – score a model, using pre-defined Cascading app ## run an RF classifier at scale   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml   ## run an RF classifier at scale, assert regression test, measure confusion matrix   hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap --pmml data/sample.rf.xml --assert --measure out/measure   ## run a predictive model at scale, measure RMSE   hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap --pmml data/iris.lm_p.xml --rmse out/measure Sunday, 17 March 13 44
  • 45. Pattern – evaluating results bash-3.2$ head out/classify/part-00000 label" var0" var1" var2" order_id" predicted" score 1" 0" 1" 0" 6f8e1014" 1" 1 0" 0" 0" 1" 6f8ea22e" 0" 0 1" 0" 1" 0" 6f8ea435" 1" 1 0" 0" 0" 1" 6f8ea5e1" 0" 0 1" 0" 1" 0" 6f8ea785" 1" 1 1" 0" 1" 0" 6f8ea91e" 1" 1 0" 1" 0" 0" 6f8eaaba" 0" 0 1" 0" 1" 0" 6f8eac54" 1" 1 0" 1" 1" 0" 6f8eade3" 1" 1 Sunday, 17 March 13 45
  • 46. Lingual – connecting Hadoop and R # load the JDBC package library(RJDBC)   # set up the driver drv <- JDBC("cascading.lingual.jdbc.Driver", "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")   # set up a database connection to a local repository connection <- dbConnect(drv, "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/ tables;schema=EMPLOYEES")   # query the repository: in this case the MySQL sample database (CSV files) df <- dbGetQuery(connection, "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'") head(df)   # use R functions to summarize and visualize part of the data df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25 summary(df$hire_age) library(ggplot2) m <- ggplot(df, aes(x=hire_age)) m <- m + ggtitle("Age at hire, people named Gina") m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density() Sunday, 17 March 13 46
  • 47. Lingual – connecting Hadoop and R > summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92 cascading.org/lingual launchpad.net/test-db Sunday, 17 March 13 47
  • 48. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 48
  • 49. PMML – standard • established XML standard for predictive model markup • organized by Data Mining Group (DMG), since 1997 http://dmg.org/ • members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc. • PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows “PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations.With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.” wikipedia.org/wiki/Predictive_Model_Markup_Language Sunday, 17 March 13 49
  • 50. PMML – models • Association Rules: AssociationModel element • Cluster Models: ClusteringModel element • Decision Trees: TreeModel element • Naïve Bayes Classifiers: NaiveBayesModel element • Neural Networks: NeuralNetwork element • Regression: RegressionModel and GeneralRegressionModel elements • Rulesets: RuleSetModel element • Sequences: SequenceModel element • Support Vector Machines: SupportVectorMachineModel element • Text Models: TextModel element • Time Series: TimeSeriesModel element ibm.com/developerworks/industry/library/ind-PMML2/ Sunday, 17 March 13 50
  • 51. PMML – vendor coverage Sunday, 17 March 13 51
  • 52. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 52
  • 53. roadmap – existing algorithms for scoring • Random Forest • Decision Trees • Linear Regression • GLM • Logistic Regression • K-Means Clustering • Hierarchical Clustering • Support Vector Machines cascading.org/pattern Sunday, 17 March 13 53
  • 54. roadmap – top priorities for creating models at scale • Random Forest • Logistic Regression • K-Means Clustering a wealth of recent research indicates many opportunities to parallelize popular algorithms for training models at scale on Apache Hadoop… cascading.org/pattern Sunday, 17 March 13 54
  • 55. roadmap – next priorities for scoring • Time Series (ARIMA forecast) • Association Rules (basket analysis) • Naïve Bayes • Neural Networks algorithms extended based on customer use cases – contact @pacoid cascading.org/pattern Sunday, 17 March 13 55
  • 56. Pattern: predictive models at scale Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count • Enterprise Data Workflows Word Count • Sample Code • A Little Theory… • Pattern • PMML • Roadmap • Customer Experiments Sunday, 17 March 13 56
  • 57. experiments – comparing models • much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale • run multiple variants, then measure relative “lift” • Concurrent runtime – tag and track models the following example compares two models trained with different machine learning algorithms this is exaggerated, one has an important variable intentionally omitted to help illustrate the experiment Sunday, 17 March 13 57
  • 58. experiments – Random Forest model ## train a Random Forest model ## example: http://mkseo.pe.kr/stats/?p=220   f <- as.formula("as.factor(label) ~ var0 + var1 + var2") fit <- randomForest(f, data=data, proximity=TRUE, ntree=25) print(fit) saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/")) OOB estimate of error rate: 14% Confusion matrix: 0 1 class.error 0 69 16 0.1882353 1 12 103 0.1043478 Sunday, 17 March 13 58
  • 59. experiments – Logistic Regression model ## train a Logistic Regression model (special case of GLM) ## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r   f <- as.formula("as.factor(label) ~ var0 + var2") fit <- glm(f, family=binomial, data=data) print(summary(fit)) saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/")) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 *** var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 NB: this model has “var1” intentionally omitted Sunday, 17 March 13 59
  • 60. experiments – comparing results • use a confusion matrix to compare results for the classifiers • Logistic Regression has a lower “false negative” rate (5% vs. 11%) however it has a much higher “false positive” rate (52% vs. 14%) • assign a cost model to select a winner – for example, in an ecommerce anti-fraud classifier: FN ∼ chargeback risk FP ∼ customer support costs Sunday, 17 March 13 60
  • 61. references… Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721 Sunday, 17 March 13 61
  • 62. drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com Copyright @2013, Concurrent, Inc. Sunday, 17 March 13 62