SlideShare a Scribd company logo
1 of 81
Download to read offline
“Functional programming
              for optimization problems
              in Big Data”

                  Paco Nathan
                  Concurrent, Inc.
                  San Francisco, CA
                  @pacoid




                                          Copyright @2013, Concurrent, Inc.




Wednesday, 06 March 13                                                        1
The Workflow Abstraction
                                                                                                                                              Document
                                                                                                                                              Collection



                                                                                                                                                                           Scrub
                                                                                                                                                           Tokenize
                                                                                                                                                                           token

                                                                                                                                                      M



                                                                                                                                                                                   HashJoin   Regex
                                                                                                                                                                                     Left     token
                                                                                                                                                                                                      GroupBy    R
                                                                                                                                                                      Stop Word                        token
                                                                                                                                                                         List
                                                                                                                                                                                     RHS




                                                                                                                                                                                                         Count




                                                                                                                                                                                                                     Word
                                                                                                                                                                                                                     Count




                      1. Data Science
                      2. Functional Programming
                      3. Workflow Abstraction
                      4. Typical Use Cases
                      5. Open Data Example



Wednesday, 06 March 13                                                                                                                                                                                                       2
Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.
Where did Big Data come from, and where is this kind of work headed?
Q3 1997: inflection point

             Four independent teams were working toward horizontal
             scale-out of workflows based on commodity hardware.
             This effort prepared the way for huge Internet successes
             in the 1997 holiday season… AMZN, EBAY, Inktomi
             (YHOO Search), then GOOG

             MapReduce and the Apache Hadoop open source stack
             emerged from this.




Wednesday, 06 March 13                                                                                                     3
Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:

parallelize workloads onto clusters of commodity servers to scale-out horizontally.
Google and Inktomi (YHOO Search) were working along the same lines.
Circa 1996: pre- inflection point

                                                                                               Stakeholder                   Customers

                                                       Excel pivot tables
                                                     PowerPoint slide decks                         strategy



                                                            BI
                                                                                                   Product
                                                          Analysts


                                                                                                  requirements



                                                          SQL Query                                              optimized
                                                                                                 Engineering       code         Web App
                                                           result sets



                                                                                                                                transactions




                                                                                                                                RDBMS




Wednesday, 06 March 13                                                                                                                         4
Perl and C++ for CGI :)
Feedback loops shown in red represent data innovations at the time… these are rather static.

Characterized by slow, manual processes:
data modeling / business intelligence; “throw it over the wall”…
this thinking led to impossible silos
Circa 2001: post- big ecommerce successes

                                                        Stakeholder                                            Product                                          Customers




                                                            dashboards                                                                                                 UX
                                                                                                            Engineering

                                                                                          models                                         servlets

                                                                                                             recommenders
                                                        Algorithmic                                                 +                                           Web Apps
                                                         Modeling                                               classifiers


                                                                                                                                                               Middleware
                                                                                          aggregation
                                                                                                                                          event
                                                            SQL Query                                                                    history
                                                             result sets                                                                                             customer
                                                                                                                                                                   transactions
                                                                                                                  Logs



                                                               DW                                                   ETL                                            RDBMS




Wednesday, 06 March 13                                                                                                                                                                                                                 5
Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the
marketing funnel, as in our case study.

LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
Circa 2013: clusters everywhere

                                                                                      Data Products                                                     Customers
                                                            business
                                  Domain                    process                                                                                                                  Prod
                                  Expert                                                 Workflow
                                                               dashboard
                                                                metrics
                                                  data
                                                                                                                                                       Web Apps,               s/w
                                                                                           History                              services
                                                science                                                                                                Mobile, etc.            dev
                                 Data
                               Scientist
                                                                                           Planner                                                 social
                                                             discovery                                                                          interactions
                                                                 +                                       optimized                                             transactions,
                                                                                                                                                                                      Eng
                                                             modeling                          taps       capacity                                                content

                                  App Dev
                                                                                               Use Cases Across Topologies


                                                                                            Hadoop,                         Log                          In-Memory
                                                                                              etc.                         Events                         Data Grid
                                    Ops                             DW                                                                                                                Ops
                                                                                                                                       batch       near time


                                                                                                                        Cluster Scheduler
                                  introduced                                                                                                                                         existing
                                   capability                                                                                                                                         SDLC

                                                                                                                                                               RDBMS
                                                                                                                                                                RDBMS


Wednesday, 06 March 13                                                                                                                                                                          6
Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.
Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.

Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.

We see this feeding into cluster optimization in YARN, Apache Mesos, etc.
references…

                       by Leo Breiman
                       Statistical Modeling: The Two Cultures
                       Statistical Science, 2001
                       bit.ly/eUTh9L




Wednesday, 06 March 13                                                                                                                                                                                                      7
Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
references…

                   Amazon
                   “Early Amazon: Splitting the website” – Greg Linden
                   glinden.blogspot.com/2006/02/early-amazon-splitting-website.html

                   eBay
                   “The eBay Architecture” – Randy Shoup, Dan Pritchett
                   addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html
                   addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

                   Inktomi (YHOO Search)
                   “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)
                   youtube.com/watch?v=E91oEn1bnXM

                   Google
                   “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)
                   youtube.com/watch?v=qsan-GQaeyk
                   perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx




Wednesday, 06 March 13                                                                        8
In their own words…
core values

                   Data Science teams develop actionable insights, building
                   confidence for decisions

                   that work may influence a few decisions worth billions
                   (e.g., M&A) or billions of small decisions (e.g., AdWords)

                   probably somewhere in-between…
                   solving for pattern, at scale.

                   by definition, this is a multi-disciplinary
                   pursuit which requires teams, not sole
                   players




Wednesday, 06 March 13                                                          9
team process = needs


                                       help people ask the
                         discovery     right questions


                                       allow automation to place
                         modeling      informed bets


                                       deliver products at         Gephi
                         integration   scale to customers


                                       build smarts into
                            apps       product features


                                       keep infrastructure
                          systems      running, cost-effective




Wednesday, 06 March 13                                                     10
team composition = roles
                                                                                                                Document
                                                                                                                Collection



                                                                                                                                             Scrub
                                                                                                                             Tokenize
                                                                                                                                             token

                                                                                                                        M



                                                                                                                                                     HashJoin   Regex
                                                                                                                                                       Left     token
                                                                                                                                                                        GroupBy    R
                                                                                                                                        Stop Word                        token
                                                                                                                                           List
                                                                                                                                                       RHS


                                             Domain
                                             Expert                      business process, stakeholder                                                                     Count




                                                                                                                                                                                       Word
                                                                                                                                                                                       Count




                                                             data
                                                           science
                                            Data
                                          Scientist                      data prep, discovery, modeling, etc.




                                             App Dev                     software engineering, automation




                                               Ops                       systems engineering, access


                                             introduced
                                              capability




Wednesday, 06 March 13                                                                                                                                                                         11
This is an example of multi-disciplinary team composition for data science
While other emerging problems spaces will require other more specific kinds of team roles
matrix: evaluate needs × roles

                                                                  nn
                               o
                               overy
                                 very      elliing
                                            e ng            ratiio
                                                            rat o      apps
                                                                       apps     stem
                                                                                 stem
                                                                                     ss
                         diisc
                         d sc           mod
                                        mod            nteg
                                                     iinte
                                                           g                  sy
                                                                              sy


                                                                                          stakeholder



                                                                                           scientist



                                                                                          developer



                                                                                             ops




Wednesday, 06 March 13                                                                                  12
most valuable skills

                   approximately 80% of the costs for data-related projects
                   get spent on data preparation – mostly on cleaning up
                   data quality issues: ETL, log file analysis, etc.

                   unfortunately, data-related budgets for many companies tend
                   to go into frameworks which can only be used after clean up

                   most valuable skills:
                         ‣ learn to use programmable tools that prepare data

                         ‣ learn to generate compelling data visualizations

                         ‣ learn to estimate the confidence for reported results

                         ‣ learn to automate work, making analysis repeatable
                                                                                  D3
                   the rest of the skills – modeling,
                   algorithms, etc. – those are secondary



Wednesday, 06 March 13                                                                 13
science in data science?

                                                                                        edoMpUsserD:IUN
                                                                    tcudorP ylppA lenaP yrotnevnI tneilC



                     in a nutshell, what we do…
                                                                 tcudorP evomeR lenaP yrotnevnI tneilC
                                                                                        edoMmooRyM:IUN
                                                                                    edoMmooRcilbuP:IUN
                                                                                                 ydduB ddA
                                                                                              nigoL etisbeW
                                                                                                          vd
                                                                                         edoMsdneirF:IUN
                                                                                             edoMtahC:IUN
                                                                                         egasseM a evaeL
                                                                            G1 :gniniamer ecaps sserddA
                                                                                     dekcilCeliforPyM:IUN
                                                                                      edoMstiderCyuB:IUN
                                                                                          tohspanS a ekaT
                                                                                      egapemoH nwO tisiV
                                                                                              elbbuB a epyT




                     ‣ estimate probability
                                                                                               taeS egnahC
                                                                                         wodniW D3 nepO
                                                                                                 dneirF ddA
                                                                revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                                                  lenaP tidE
                                                                                                   woN tahC
                                                                                                    teP yalP
                                                                                                   teP deeF
                                                            2 petS egaP traC esahcruP edaM remotsuC
                                                                         M215 :gniniamer ecaps sserddA
                                                                                             gnihtolC no tuP
                                                                                          bew :metI na yuB
                                                                                            edoMeivoM:IUN
                                                                   ytinummoc ,tneilc :detratS weiV eivoM




                     ‣ calculate analytic variance
                                                                                            teP weN etaerC
                                                                       detrats etius tset :tseTytivitcennoC
                                                                                  emag pazyeh dehcnuaL
                                                                                   eciov mooRcilbuP tahC
                                                                                         egasseM yadhtriB
                                                                                         edoMlairotuT:IUN
                                                                                   ybbol semag dehcnuaL
                                                                                       noitartsigeR euqinU




                                                                                                               edoMpUsserD:IUN
                                                                                                               tcudorP ylppA lenaP yrotnevnI tneilC
                                                                                                               tcudorP evomeR lenaP yrotnevnI tneilC
                                                                                                               edoMmooRyM:IUN
                                                                                                               edoMmooRcilbuP:IUN
                                                                                                               ydduB ddA
                                                                                                               nigoL etisbeW
                                                                                                               vd
                                                                                                               edoMsdneirF:IUN
                                                                                                               edoMtahC:IUN
                                                                                                               egasseM a evaeL
                                                                                                               G1 :gniniamer ecaps sserddA
                                                                                                               dekcilCeliforPyM:IUN
                                                                                                               edoMstiderCyuB:IUN
                                                                                                               tohspanS a ekaT
                                                                                                               egapemoH nwO tisiV
                                                                                                               elbbuB a epyT
                                                                                                               taeS egnahC

                                                                                                               dneirF ddA
                                                                                                               revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                                                               lenaP tidE
                                                                                                               woN tahC
                                                                                                               teP yalP
                                                                                                               teP deeF
                                                                                                               2 petS egaP traC esahcruP edaM remotsuC
                                                                                                               M215 :gniniamer ecaps sserddA
                                                                                                               gnihtolC no tuP
                                                                                                               bew :metI na yuB
                                                                                                               edoMeivoM:IUN
                                                                                                               ytinummoc ,tneilc :detratS weiV eivoM
                                                                                                               teP weN etaerC
                                                                                                               detrats etius tset :tseTytivitcennoC
                                                                                                               emag pazyeh dehcnuaL
                                                                                                               eciov mooRcilbuP tahC
                                                                                                               egasseM yadhtriB
                                                                                                               edoMlairotuT:IUN
                                                                                                               ybbol semag dehcnuaL
                                                                                                               noitartsigeR euqinU
                                                                                                               wodniW D3 nepO
                     ‣ manipulate order complexity

                     ‣ leverage use of learning theory

                     +   collab with DevOps, Stakeholders

                     +   reduce work to cron entries




Wednesday, 06 March 13                                                                                                                                    14
references…

                       by DJ Patil
                       Data Jujitsu
                       O’Reilly, 2012
                       amazon.com/dp/B008HMN5BE
                       Building Data Science Teams
                       O’Reilly, 2011
                       amazon.com/dp/B005O4U3ZE



Wednesday, 06 March 13                                                                                                                                                                                                      15
Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
The Workflow Abstraction
                                                                                                 Document
                                                                                                 Collection



                                                                                                                              Scrub
                                                                                                              Tokenize
                                                                                                                              token

                                                                                                         M



                                                                                                                                      HashJoin   Regex
                                                                                                                                        Left     token
                                                                                                                                                         GroupBy    R
                                                                                                                         Stop Word                        token
                                                                                                                            List
                                                                                                                                        RHS




                                                                                                                                                            Count




                                                                                                                                                                        Word
                                                                                                                                                                        Count




                      1. Data Science
                      2. Functional Programming
                      3. Workflow Abstraction
                      4. Typical Use Cases
                      5. Open Data Example



Wednesday, 06 March 13                                                                                                                                                          16
Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
Cascading – origins

             API author Chris Wensel worked as a system architect
             at an Enterprise firm well-known for several popular
             data products.
             Wensel was following the Nutch open source project –
             before Hadoop even had a name.
             He noted that it would become difficult to find Java
             developers to write complex Enterprise apps directly
             in Apache Hadoop – a potential blocker for leveraging
             this new open source technology.




Wednesday, 06 March 13                                                                                                                                                         17
Cascading initially grew from interaction with the Nutch project, before Hadoop had a name

API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
Cascading – functional programming

             Key insight: MapReduce is based on functional programming
             – back to LISP in 1970s. Apache Hadoop use cases are
             mostly about data pipelines, which are functional in nature.
             To ease staffing problems as “Main Street” Enterprise firms
             began to embrace Hadoop, Cascading was introduced
             in late 2007, as a new Java API to implement functional
             programming for large-scale data workflows.




Wednesday, 06 March 13                                                                                                                        18
Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
examples…

                       • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested
                           in functional programming open source projects atop
                           Cascading – used for their large-scale production
                           deployments
                       •   new case studies for Cascading apps are mostly
                           based on domain-specific languages (DSLs) in JVM
                           languages which emphasize functional programming:

                           Cascalog in Clojure (2010)
                           Scalding in Scala (2012)


                     github.com/nathanmarz/cascalog/wiki
                     github.com/twitter/scalding/wiki




Wednesday, 06 March 13                                                                 19
Many case studies, many Enterprise production deployments now for 5+ years.
The Ubiquitous Word Count
                                                                                                                     Document
                                                                                                                     Collection




             Definition:                                                                                                     M
                                                                                                                                  Tokenize
                                                                                                                                             GroupBy
                                                                                                                                              token    Count




                 count how often each word appears
               count how often each word appears
                                                                                                                                                R              Word
                                                                                                                                                               Count




               inin a collection of text documents
                  a collection of text documents
             This simple program provides an excellent test case for
             parallel processing, since it illustrates:                                                    void map (String doc_id, String text):
                                                                                                            for each word w in segment(text):
              • requires a minimal amount of code                                                             emit(w, "1");

              • demonstrates use of both symbolic and numeric values
              • shows a dependency graph of tuples as an abstraction                                       void reduce (String word, Iterator group):


              • is not many steps away from useful search indexing
                                                                                                            int count = 0;



              • serves as a “Hello World” for Hadoop apps                                                   for each pc in group:
                                                                                                              count += Int(pc);


             Any distributed computing framework which can run Word                                         emit(word, String(count));
             Count efficiently in parallel at scale can handle much
             larger and more interesting compute problems.


Wednesday, 06 March 13                                                                                                                                                 20
Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...
word count – conceptual flow diagram


                 Document
                 Collection




                                                       Tokenize
                                                                                                       GroupBy
                               M                                                                        token                                             Count




                                                                                                             R                                                                                Word
                                                                                                                                                                                              Count




                1 map                                                                                            cascading.org/category/impatient
                1 reduce
               18 lines code                                                                                                               gist.github.com/3900702


Wednesday, 06 March 13                                                                                                                                                                                                   21
Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
word count – Cascading app in Java
                                                                                                     Document
                                                                                                     Collection




             String docPath = args[ 0 ];                                                                          Tokenize
                                                                                                                             GroupBy
                                                                                                                              token
             String wcPath = args[ 1 ];                                                                      M                         Count




             Properties properties = new Properties();                                                                          R              Word
                                                                                                                                               Count


             AppProps.setApplicationJarClass( properties, Main.class );
             HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

             // create source and sink taps
             Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
             Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

             // specify a regex to split "document" text lines into token stream
             Fields token = new Fields( "token" );
             Fields text = new Fields( "text" );
             RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
             // only returns "token"
             Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
             // determine the word counts
             Pipe wcPipe = new Pipe( "wc", docPipe );
             wcPipe = new GroupBy( wcPipe, token );
             wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

             // connect the taps, pipes, etc., into a flow
             FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
              .addSource( docPipe, docTap )
              .addTailSink( wcPipe, wcTap );
             // write a DOT file and run the flow
             Flow wcFlow = flowConnector.connect( flowDef );
             wcFlow.writeDOT( "dot/wc.dot" );
             wcFlow.complete();



Wednesday, 06 March 13                                                                                                                                 22
Based on a Cascading implementation of Word Count, here is sample code --
approx 1/3 the code size of the Word Count example from Apache Hadoop

2nd to last line: generates a DOT file for the flow diagram
word count – generated flow diagram
                                                                                                                                                      Document
                                                                                                                                                      Collection




                                                                                                                                                                   Tokenize
                                                                                                      [head]                                                  M
                                                                                                                                                                              GroupBy
                                                                                                                                                                               token    Count




                                                                                                                                                                                 R              Word
                                                                                                                                                                                                Count




                                                                        Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                                                                [{2}:'doc_id', 'text']
                                                                                                [{2}:'doc_id', 'text']




                                                                                                                                             map
                                                                         Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                                                                                                    [{1}:'token']
                                                                                                    [{1}:'token']



                                                                                          GroupBy('wc')[by:['token']]

                                                                                                  wc[{1}:'token']
                                                                                                  [{1}:'token']




                                                                                                                                             reduce
                                                                                       Every('wc')[Count[decl:'count']]

                                                                                                [{2}:'token', 'count']
                                                                                                [{1}:'token']



                                                                    Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                                                                [{2}:'token', 'count']
                                                                                                [{2}:'token', 'count']



                                                                                                       [tail]


Wednesday, 06 March 13                                                                                                                                                                                  23
As a concrete example of literate programming in Cascading,
here is the DOT representation of the flow plan -- generated by the app itself.
word count – Cascalog / Clojure
                                                                      Document
                                                                      Collection




             (ns impatient.core                                               M
                                                                                   Tokenize
                                                                                              GroupBy
                                                                                               token    Count



               (:use [cascalog.api]                                                              R              Word
                                                                                                                Count


                     [cascalog.more-taps :only (hfs-delimited)])
               (:require [clojure.string :as s]
                         [cascalog.ops :as c])
               (:gen-class))

             (defmapcatop split [line]
               "reads in a line of string and splits it by regex"
               (s/split line #"[[](),.)s]+"))

             (defn -main [in out & args]
               (?<- (hfs-delimited out)
                    [?word ?count]
                    ((hfs-delimited in :skip-header? true) _ ?line)
                    (split ?line :> ?word)
                    (c/count ?count)))

             ; Paul Lam
             ; github.com/Quantisan/Impatient




Wednesday, 06 March 13                                                                                                  24
Here is the same Word Count app written in Clojure, using Cascalog.
word count – Cascalog / Clojure
                                                                                                                    Document
                                                                                                                    Collection




             github.com/nathanmarz/cascalog/wiki
                                                                                                                                 Tokenize
                                                                                                                                            GroupBy
                                                                                                                            M                token    Count




                                                                                                                                               R              Word
                                                                                                                                                              Count




               • implements Datalog in Clojure, with predicates backed
                 by Cascading – for a highly declarative language
               • run ad-hoc queries from the Clojure REPL –
                 approx. 10:1 code reduction compared with SQL
               • composable subqueries, used for test-driven development
                 (TDD) practices at scale
               • Leiningen build: simple, no surprises, in Clojure itself
               • more new deployments than other Cascading DSLs –
                 Climate Corp is largest use case: 90% Clojure/Cascalog
               • has a learning curve, limited number of Clojure developers
               • aggregators are the magic, and those take effort to learn




Wednesday, 06 March 13                                                                                                                                                25
From what we see about language features, customer case studies, and best practices in general --
Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.

Great for large-scale, complex apps, where small teams must limit the complexities in their process.
word count – Scalding / Scala
                                                                                 Document
                                                                                 Collection




           import com.twitter.scalding._                                                 M
                                                                                              Tokenize
                                                                                                         GroupBy
                                                                                                          token    Count



                                                                                                            R              Word
                                                                                                                           Count


           class WordCount(args : Args) extends Job(args) {
             Tsv(args("doc"),
                  ('doc_id, 'text),
                  skipHeader = true)
               .read
               .flatMap('text -> 'token) {
                  text : String => text.split("[ [](),.]")
                }
               .groupBy('token) { _.size('count) }
               .write(Tsv(args("wc"), writeHeader = true))
           }




Wednesday, 06 March 13                                                                                                             26
Here is the same Word Count app written in Scala, using Scalding.

Very compact, easy to understand; however, also more imperative than Cascalog.
word count – Scalding / Scala
                                                                                                                                                                                  Document
                                                                                                                                                                                  Collection




             github.com/twitter/scalding/wiki
                                                                                                                                                                                               Tokenize
                                                                                                                                                                                                          GroupBy
                                                                                                                                                                                          M                token    Count




                                                                                                                                                                                                             R              Word
                                                                                                                                                                                                                            Count




                • extends the Scala collections API so that distributed lists
                  become “pipes” backed by Cascading
                • code is compact, easy to understand
                • nearly 1:1 between elements of conceptual flow diagram
                  and function calls
                • extensive libraries are available for linear algebra, abstract
                  algebra, machine learning – e.g., Matrix API, Algebird, etc.
                • significant investments by Twitter, Etsy, eBay, etc.
                • great for data services at scale
                • less learning curve than Cascalog,
                  not as much of a high-level language




Wednesday, 06 March 13                                                                                                                                                                                                              27
If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
word count – Scalding / Scala
                                                                                                                                                    Document
                                                                                                                                                    Collection




             github.com/twitter/scalding/wiki
                                                                                                                                                                 Tokenize
                                                                                                                                                                            GroupBy
                                                                                                                                                            M                token    Count




                                                                                                                                                                               R              Word
                                                                                                                                                                                              Count




               • extends the Scala collections API so that distributed lists
                 become “pipes” backed by Cascading
               • code is compact, easy to understand
               • nearly 1:1 between elements of conceptual flow diagram
                 and function calls        Cascalog and Scalding DSLs
               • extensive libraries are available for linear algebra, abstractaspects
                                           leverage the functional
                 algebra, machine learning – e.g., Matrix API, Algebird, etc.
                                           of MapReduce, helping to limit
               • significant investments by Twitter, Etsy, eBay, etc.
                                           complexity in process
               • great for data services at scale
                 (imagine SOA infra @ Google as an open source project)
               • less learning curve than Cascalog,
                 not as much of a high-level language



Wednesday, 06 March 13                                                                                                                                                                                28
Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
The Workflow Abstraction
                                                                         Document
                                                                         Collection



                                                                                                      Scrub
                                                                                      Tokenize
                                                                                                      token

                                                                                 M



                                                                                                              HashJoin   Regex
                                                                                                                Left     token
                                                                                                                                 GroupBy    R
                                                                                                 Stop Word                        token
                                                                                                    List
                                                                                                                RHS




                                                                                                                                    Count




                                                                                                                                                Word
                                                                                                                                                Count




                      1. Data Science
                      2. Functional Programming
                      3. Workflow Abstraction
                      4. Typical Use Cases
                      5. Open Data Example



Wednesday, 06 March 13                                                                                                                                  29
CS theory related to data workflow abstraction, to manage complexity
Cascading workflows – pattern language

             Cascading uses a “plumbing” metaphor in the Java API,
             to define workflows out of familiar elements: Pipes, Taps,
             Tuple Flows, Filters, Joins, Traps, etc.
                                                          Document
                                                          Collection



                                                                                       Scrub
                                                                       Tokenize
                                                                                       token

                                                                  M



                                                                                               HashJoin   Regex
                                                                                                 Left     token
                                                                                                                  GroupBy    R
                                                                                  Stop Word                        token
                                                                                     List
                                                                                                 RHS




                                                                                                                     Count


             Data is represented as flows of tuples. Operations within                                                            Word

             the tuple flows bring functional programming aspects into                                                            Count




             Java apps.
             In formal terms, this provides a pattern language.


Wednesday, 06 March 13                                                                                                                   30
A pattern language, based on the metaphor of “plumbing”
references…

                      pattern language: a structured method for solving
                      large, complex design problems, where the syntax of
                      the language promotes the use of best practices.

                      amazon.com/dp/0195019199



                      design patterns: the notion originated in consensus
                      negotiation for architecture, later applied in OOP
                      software engineering by “Gang of Four”.
                      amazon.com/dp/0201633612




Wednesday, 06 March 13                                                                                              31
Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
Cascading workflows – literate programming

             Cascading workflows generate their own visual
             documentation: flow diagrams

                                                            Document
                                                            Collection



                                                                                                   Scrub
                                                                              Tokenize
                                                                                                   token

                                                                    M



                                                                                                                     HashJoin             Regex
                                                                                                                       Left               token
                                                                                                                                                              GroupBy      R
                                                                                              Stop Word                                                        token
                                                                                                 List
                                                                                                                       RHS




                                                                                                                                                                  Count



              In formal terms, flow diagrams leverage a methodology                                                                                                                 Word
                                                                                                                                                                                   Count

              called literate programming
              Provides intuitive, visual representations for apps, great
              for cross-team collaboration.


Wednesday, 06 March 13                                                                                                                                                                                                                       32
Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.

Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
references…

                       by Don Knuth
                       Literate Programming
                       Univ of Chicago Press, 1992
                       literateprogramming.com/

                       “Instead of imagining that our main task is
                        to instruct a computer what to do, let us
                        concentrate rather on explaining to human
                        beings what we want a computer to do.”




Wednesday, 06 March 13                                                                                    33
Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
examples…

                        • Scalding apps have nearly 1:1 correspondence
                            between function calls and the elements in their
                            flow diagrams – excellent elision and literate
                            representation
                        •   noticed on cascading-users email list:
                            when troubleshooting issues, Cascading experts ask
                            novices to provide an app’s flow diagram (generated                                                                      [head]


                            as a DOT file), sometimes in lieu of showing code                                          Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

                                                                                                                                              [{2}:'doc_id', 'text']
                                                                                                                                              [{2}:'doc_id', 'text']




                                                                                                                                                                                           map
                                                                                                                       Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

                      In formal terms, a flow diagram is a directed, acyclic                                                                       [{1}:'token']
                                                                                                                                                  [{1}:'token']



                      graph (DAG) on which lots of interesting math applies                                                             GroupBy('wc')[by:['token']]



                      for query optimization, predictive models about app
                                                                                                                                                wc[{1}:'token']
                                                                                                                                                [{1}:'token']




                                                                                                                                                                                           reduce
                      execution, parallel efficiency metrics, etc.                                                                    Every('wc')[Count[decl:'count']]

                                                                                                                                              [{2}:'token', 'count']
                                                                                                                                              [{1}:'token']



                                                                                                                   Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

                                                                                                                                              [{2}:'token', 'count']
                                                                                                                                              [{2}:'token', 'count']



                                                                                                                                                     [tail]




Wednesday, 06 March 13                                                                                                                                                                              34
Literate programming examples observed on the email list are some of the best illustrations of this methodology.
Cascading workflows – business process

            Following the essence of literate programming, Cascading
            workflows provide statements of business process
            This recalls a sense of business process management
            for Enterprise apps (think BPM/BPEL for Big Data)
            As a separation of concerns between business process
            and implementation details (Hadoop, etc.)
            This is especially apparent in large-scale Cascalog apps:
                “Specify what you require, not how to achieve it.”
            By virtue of the pattern language, the flow planner in used
            in a Cascading app determines how to translate business
            process into efficient, parallel jobs at scale.




Wednesday, 06 March 13                                                   35
Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)
references…

                      by Edgar Codd
                      “A relational model of data for large shared data banks”
                      Communications of the ACM, 1970
                      dl.acm.org/citation.cfm?id=362685
                      Rather than arguing between SQL vs. NoSQL…
                      structured vs. unstructured data frameworks…
                      this approach focuses on:
                            the process of structuring data
                      That’s what apps do – Making Data Work




Wednesday, 06 March 13                                                                                                                                          36
Focus on *the process of structuring data*
which must happen before the large-scale joins, predictive models, visualizations, etc.

Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.

BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
Cascading workflows – functional relational programming

             The combination of functional programming, pattern language,
             DSLs, literate programming, business process, etc., traces back
             to the original definition of the relational model (Codd, 1970)
             prior to SQL.
             Cascalog, in particular, implements more of what Codd intended
             for a “data sublanguage” and is considered to be close to a full
             implementation of the functional relational programming
             paradigm defined in:
                    Moseley & Marks, 2006
                    “Out of the Tar Pit”
                    goo.gl/SKspn




Wednesday, 06 March 13                                                          37
A more contemporary statement along similar lines...
Two Avenues…

             Enterprise: must contend with
             complexity at scale everyday…
             incumbents extend current practices and
             infrastructure investments – using J2EE,




                                                                                                            complexity ➞
             ANSI SQL, SAS, etc. – to migrate
             workflows onto Apache Hadoop while
             leveraging existing staff


              Start-ups: crave complexity and
              scale to become viable…
              new ventures move into Enterprise space
              to compete using relatively lean staff,
              while leveraging sophisticated engineering
              practices, e.g., Cascalog and Scalding
                                                                                                                                    scale ➞

Wednesday, 06 March 13                                                                                                                        38
Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
Cascading workflows – functional relational programming

           The combination of functional programming, pattern language,
           DSLs, literate programming, business process, etc., traces back
           to the original definition of the relational model (Codd, 1970)
           prior to SQL.
           Cascalog, in particular, implements more of what Codd intended for a
                                      several theoretical aspects converge
           “data sublanguage” and is considered to be close to a full
           implementation of the functional relational programming
           paradigm defined in:        into software engineering practices
                 Moseley & Marks, 2006which mitigates the complexity of
                 “Out of the Tar Pit” building and maintaining Enterprise
                 goo.gl/SKspn
                                   data workflows



Wednesday, 06 March 13                                                            39
The Workflow Abstraction
                                                                        Document
                                                                        Collection



                                                                                                     Scrub
                                                                                     Tokenize
                                                                                                     token

                                                                                M



                                                                                                             HashJoin   Regex
                                                                                                               Left     token
                                                                                                                                GroupBy    R
                                                                                                Stop Word                        token
                                                                                                   List
                                                                                                               RHS




                                                                                                                                   Count




                                                                                                                                               Word
                                                                                                                                               Count




                      1. Data Science
                      2. Functional Programming
                      3. Workflow Abstraction
                      4. Typical Use Cases
                      5. Open Data Example



Wednesday, 06 March 13                                                                                                                                 40
Here are a few use cases to consider, for Enterprise data workflows
Cascading – deployments

              • 5+ history of Enterprise production deployments,
                   ASL 2 license, GitHub src, http://conjars.org

              • partners: Amazon AWS, Microsoft Azure, Hortonworks,
                   MapR, EMC, SpringSource, Cloudera

              • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma,
                   uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.

              • use cases: ETL, marketing funnel, anti-fraud, social media,
                   retail pricing, search analytics, recommenders, eCRM,
                   utility grids, genomics, climatology, etc.




Wednesday, 06 March 13                                                               41
Several published case studies about Cascading, Cascalog, Scalding, etc.
Wide range of use cases.

Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with the various Hadoop distro vendors, cloud providers, etc.
Finance: Ecommerce Risk

                   Problem:




                                                                                    stat.berkeley.edu
                   <1% chargeback rate allowed by Visa, others follow
                    • may leverage CAPTURE/AUTH wait period
                    • Cybersource,Vindicia, others haven’t stopped fraud
                   >15% chargeback rate common for mobile in US:
                    • not much info shared with merchant
                    • carrier as judge/jury/executioner; customer assumed correct
                   most common: professional fraud (identity theft, etc.)
                    • patterns of attack change all the time
                    • widespread use of IP proxies, to mask location
                    • global market for stolen credit card info
                   other common case is friendly fraud
                    • teenager billing to parent’s cell phone



Wednesday, 06 March 13                                                                                  42
Finance: Ecommerce Risk

                   KPI:




                                                                                            stat.berkeley.edu
                   chargeback rate (CB)
                    • ground truth for how much fraud the bank/carrier claims
                    • 7-120 day latencies from the bank
                   false positive rate (FP)
                    • estimated cost: predicts customer support issues
                    • complaints due to incorrect fraud scores on valid orders (or lies)
                   false negative rate (FN)
                    • estimated risk: how much fraud may pass undetected in future orders
                    • changes with new product features/services/inventory/marketing




Wednesday, 06 March 13                                                                                          43
Finance: Ecommerce Risk

                   Data Science Issues:




                                                                                          stat.berkeley.edu
                   • chargeback limits imply few training cases
                   • sparse data implies lots of missing values – must impute
                   • long latency on chargebacks – “good” flips to “bad”
                   • most detection occurs within large-scale batch,
                         decisions required during real-time event processing
                   • not just one pattern to detect – many, ever-changing
                   • many unknowns: blocked orders scare off professional fraud,
                         inferences cannot be confirmed
                   • cannot simply use raw data as input – requires lots of
                         data preparation and statistical modeling
                   • each ecommerce firm has shopping/policy nuances
                         which get exploited differently – hard to generalize solutions



Wednesday, 06 March 13                                                                                        44
Finance: Ecommerce Risk

                   Predictive Analytics:




                                                                                             stat.berkeley.edu
                   batch
                    •    cluster/segment customers for expected behaviors
                    •    adjust for seasonal variation
                    •    geospatial indexing / bayesian point estimates (fraud by lat/lng)
                    •    impute missing values (“guesses” to fill-in sparse data)
                    •    run anti-fraud classifier (customer 360)
                   real-time
                    • exponential smoothing (estimators for velocity)
                    • calculate running medians (anomaly detection)
                    • run anti-fraud classifier (per order)




Wednesday, 06 March 13                                                                                           45
Finance: Ecommerce Risk

                   1. Data Preparation (batch)




                                                                                        stat.berkeley.edu
                    ‣ ETL from bank, log sessionization, customer profiles, etc.
                         - large-scale joins of customers + orders
                    ‣ apply time window
                         - too long: patterns lose currency
                         - too short: not enough wait for chargebacks
                    ‣ segment customers
                         - temporary fraud (identity theft which has been resolved)
                         - confirmed fraud (chargebacks from the bank)
                         - estimated fraud (blocked/banned by Customer Support)
                         - valid orders (but different clusters of expected behavior)
                    ‣ subsample to rebalance data
                         - produce training set + test holdout
                         - adjust balance for FP/FN bias (company risk profile)



Wednesday, 06 March 13                                                                                      46
Finance: Ecommerce Risk

                   2. Model Creation (analyst)




                                                                                                stat.berkeley.edu
                    ‣ distinguish between different IV data types
                          - continuous (e.g., age)
                          - boolean (e.g., paid lead)
                          - categorical (e.g., gender)
                          - computed (e.g., geo risk, velocities)
                    ‣ use geospatial smoothing for lat/lng
                    ‣ determine distributions for IV
                    ‣ adjust IV for seasonal variation, where appropriate
                    ‣ impute missing values based on density functions / medians
                    ‣ factor analysis: determine which IV to keep (too many creates problems)
                    ‣ train model: random forest (RF) classifiers predict likely fraud
                    ‣ calculate the confusion matrix (TP/FP/TN/FN)




Wednesday, 06 March 13                                                                                              47
Finance: Ecommerce Risk

                   3. Test Model (analyst/batch loop)




                                                                                          stat.berkeley.edu
                    ‣    calculate estimated fraud rates
                    ‣    identify potential found fraud cases
                    ‣    report to Customer Support for review
                    ‣    generate risk vs. benefit curves
                    ‣    visualize estimated impact of new model


                   4. Decision (stakeholder)
                    ‣ decide risk vs. benefit (minimize fraud + customer support costs)
                    ‣ coordinate with bank/carrier if there are current issues
                    ‣ determine go/no-go, when to deploy in production, size of rollout




Wednesday, 06 March 13                                                                                        48
Finance: Ecommerce Risk

                   5. Production Deployment (near-time)




                                                                                     stat.berkeley.edu
                    ‣ run model on in-memory grid / transaction processing
                    ‣ A/B test to verify model in production (progressive rollout)
                    ‣ detect anomalies
                         - use running medians on continuous IVs
                         - use exponential smoothing on computed IVs (velocities)
                         - trigger notifications
                    ‣ monitor KPI and other metrics in dashboards




Wednesday, 06 March 13                                                                                   49
Finance: Ecommerce Risk

                         risk classifier                                               risk classifier
                         dimension: customer 360                                      dimension: per-order
                         Cascading apps

                                                     training             analyst's                    customer
                                  data prep                                laptop
                                                    data sets                                        transactions

                                  predict                                                            score new
                                 model costs                                                           orders
                                                                           PMML
                                                                           model
                                    detect                                                            anomaly
                                  fraudsters                                                          detection

                                   segment                                                             velocity
                                  customers                                                            metrics



                                  Hadoop                                  Customer                    IMDG
                                                                             DB
                                                            batch                     real-time
                                                        workloads                     workloads

                                              ETL



                                                           chargebacks,   partner
                                               DW              etc.        data




Wednesday, 06 March 13                                                                                              50
Ecommerce: Marketing Funnel

                    Problem:
                    • must optimize large ad spend budget




                                                                            Wikipedia
                    • different vendors report different kinds of metrics
                    • some campaigns are much smaller than others
                    • seasonal variation distorts performance
                    • inherent latency in spend vs. effect
                    • ads channels cannot scale up immediately
                    • must “scrub” leads to dispute payments/refunds
                    • hard to predict ROI for incremental ad spend
                    • many issues of diminishing returns in general




Wednesday, 06 March 13                                                                  51
Ecommerce: Marketing Funnel

                    KPI:
                    cost per paying user (CPP)




                                                                                      Wikipedia
                    • must align metrics for different ad channels
                    • generally need to estimate to end-of-month
                    customer lifetime value (LTV)
                    • big differences based on geographic region, age, gender, etc.
                    • assumes that new customers behave like previous customers
                    return on investment (ROI)
                    • relationship between CPP and LTV
                    • adjust to invest in marketing (>CPP) vs. extract profit (>LTV)
                    other metrics
                    • reach: how many people get a brand message
                    • customer satisfaction: would recommend to a friend, etc.



Wednesday, 06 March 13                                                                            52
Ecommerce: Marketing Funnel

                    Predictive Analytics:
                    batch




                                                                                             Wikipedia
                    •    log aggregation, followed with cohort analysis
                    •    bayesian point estimates compare different-sized ad tests
                    •    time series analysis normalizes for seasonal variation
                    •    geolocation adjusts for regional cost/benefit
                    •    customer lifetime value estimates ROI of new leads
                    •    linear programming models estimate elasticity of demand
                    real-time
                    •    determine whether this is actually a new customer…
                    •    new: modify initial UX based on ad channel, region, friends, etc.
                    •    old: recommend products/services/friends based on behaviors
                    •    adjust spend on poorly performing channels
                    •    track back to top referring sites/partners



Wednesday, 06 March 13                                                                                   53
Airlines

                    Problem:
                    • minimize schedule delays
                    • re-route around weather and airport
                         conditions
                    • manage supplier channels and inventories
                         to minimize AOG


                    KPI:
                    forecast future passenger demand
                    customer loyalty
                    aircraft on ground (AOG)
                    mean time between failures (MTBF)




Wednesday, 06 March 13                                           54
Airlines

                    Predictive Analytics:
                    batch
                    • predict “last mile” failures
                    • optimize capacity utilization
                    • operations research problem to optimize stocking /
                         minimize fuel waste
                    • boost customer loyalty by adjusting incentives
                         frequent flyer programs
                    real-time
                    • forecast schedule delays
                    • monitor factors for travel conditions: weather,
                         airports, etc.




Wednesday, 06 March 13                                                     55
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data

More Related Content

Viewers also liked

Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scalapramode_ce
 
Csc1100 lecture01 ch01 pt2-paradigm
Csc1100 lecture01 ch01 pt2-paradigmCsc1100 lecture01 ch01 pt2-paradigm
Csc1100 lecture01 ch01 pt2-paradigmIIUM
 
Uses of Cupcakes For Any Occasion in Hyderabad!
 Uses of Cupcakes For Any Occasion in Hyderabad! Uses of Cupcakes For Any Occasion in Hyderabad!
Uses of Cupcakes For Any Occasion in Hyderabad!bookthecake.com
 
Oop project briefing sem 1 2015 2016
Oop project briefing  sem 1 2015 2016Oop project briefing  sem 1 2015 2016
Oop project briefing sem 1 2015 2016IIUM
 
Functional programming principles and Java 8
Functional programming principles and Java 8Functional programming principles and Java 8
Functional programming principles and Java 8Dragos Balan
 
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 13 - Progra...
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 13 - Progra...Understanding Computers: Today and Tomorrow, 13th Edition Chapter 13 - Progra...
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 13 - Progra...yaminohime
 
Functional Programming in Java - Code for Maintainability
Functional Programming in Java - Code for MaintainabilityFunctional Programming in Java - Code for Maintainability
Functional Programming in Java - Code for MaintainabilityMarcin Stepien
 
Good Programming Practice
Good Programming PracticeGood Programming Practice
Good Programming PracticeBikalpa Gyawali
 
Why C is Called Structured Programming Language
Why C is Called Structured Programming LanguageWhy C is Called Structured Programming Language
Why C is Called Structured Programming LanguageSinbad Konick
 
Programming languages
Programming languagesProgramming languages
Programming languagesEelco Visser
 
Programming Languages and Program Develompent
Programming Languages and Program DevelompentProgramming Languages and Program Develompent
Programming Languages and Program DevelompentSamudin Kassan
 
Principles Of Programing Languages
Principles Of Programing LanguagesPrinciples Of Programing Languages
Principles Of Programing LanguagesMatthew McCullough
 
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 1 - Introdu...
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 1 - Introdu...Understanding Computers: Today and Tomorrow, 13th Edition Chapter 1 - Introdu...
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 1 - Introdu...yaminohime
 
5 structured programming
5 structured programming 5 structured programming
5 structured programming hccit
 
Guidelines for preparing internship report
Guidelines for preparing internship reportGuidelines for preparing internship report
Guidelines for preparing internship reportWINNERbd.it
 
Functional Programming In Java
Functional Programming In JavaFunctional Programming In Java
Functional Programming In JavaAndrei Solntsev
 

Viewers also liked (20)

Introduction to Functional Programming with Scala
Introduction to Functional Programming with ScalaIntroduction to Functional Programming with Scala
Introduction to Functional Programming with Scala
 
Core FP Concepts
Core FP ConceptsCore FP Concepts
Core FP Concepts
 
Csc1100 lecture01 ch01 pt2-paradigm
Csc1100 lecture01 ch01 pt2-paradigmCsc1100 lecture01 ch01 pt2-paradigm
Csc1100 lecture01 ch01 pt2-paradigm
 
Uses of Cupcakes For Any Occasion in Hyderabad!
 Uses of Cupcakes For Any Occasion in Hyderabad! Uses of Cupcakes For Any Occasion in Hyderabad!
Uses of Cupcakes For Any Occasion in Hyderabad!
 
Oop project briefing sem 1 2015 2016
Oop project briefing  sem 1 2015 2016Oop project briefing  sem 1 2015 2016
Oop project briefing sem 1 2015 2016
 
Functional programming principles and Java 8
Functional programming principles and Java 8Functional programming principles and Java 8
Functional programming principles and Java 8
 
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 13 - Progra...
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 13 - Progra...Understanding Computers: Today and Tomorrow, 13th Edition Chapter 13 - Progra...
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 13 - Progra...
 
Functional Programming in Java - Code for Maintainability
Functional Programming in Java - Code for MaintainabilityFunctional Programming in Java - Code for Maintainability
Functional Programming in Java - Code for Maintainability
 
Good Programming Practice
Good Programming PracticeGood Programming Practice
Good Programming Practice
 
Ada 95 - Structured programming
Ada 95 - Structured programmingAda 95 - Structured programming
Ada 95 - Structured programming
 
Why C is Called Structured Programming Language
Why C is Called Structured Programming LanguageWhy C is Called Structured Programming Language
Why C is Called Structured Programming Language
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
Programming Languages and Program Develompent
Programming Languages and Program DevelompentProgramming Languages and Program Develompent
Programming Languages and Program Develompent
 
Functional programming in java
Functional programming in javaFunctional programming in java
Functional programming in java
 
Principles Of Programing Languages
Principles Of Programing LanguagesPrinciples Of Programing Languages
Principles Of Programing Languages
 
Input
InputInput
Input
 
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 1 - Introdu...
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 1 - Introdu...Understanding Computers: Today and Tomorrow, 13th Edition Chapter 1 - Introdu...
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 1 - Introdu...
 
5 structured programming
5 structured programming 5 structured programming
5 structured programming
 
Guidelines for preparing internship report
Guidelines for preparing internship reportGuidelines for preparing internship report
Guidelines for preparing internship report
 
Functional Programming In Java
Functional Programming In JavaFunctional Programming In Java
Functional Programming In Java
 

Similar to Functional programming for optimization problems in Big Data

The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionOReillyStrata
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionPaco Nathan
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsPaco Nathan
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataPaco Nathan
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingPaco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiPaco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the ImpatientPaco Nathan
 
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...Rob Cottingham
 
10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brandRob Cottingham
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinMojisola Erdt née Anjorin
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
 

Similar to Functional programming for optimization problems in Big Data (15)

The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
 
Using Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open DataUsing Cascalog to build
 an app based on City of Palo Alto Open Data
Using Cascalog to build
 an app based on City of Palo Alto Open Data
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
 
Cascading for the Impatient
Cascading for the ImpatientCascading for the Impatient
Cascading for the Impatient
 
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
10 Ways Your Blog Can Provide Real Value to You, Your Organization and Your B...
 
10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand10 ways your blog can provide value to you, your organization and your brand
10 ways your blog can provide value to you, your organization and your brand
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 
DTO #ChefConf2012
DTO #ChefConf2012DTO #ChefConf2012
DTO #ChefConf2012
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Functional programming for optimization problems in Big Data

  • 1. “Functional programming for optimization problems in Big Data” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc. Wednesday, 06 March 13 1
  • 2. The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data Example Wednesday, 06 March 13 2 Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data. Where did Big Data come from, and where is this kind of work headed?
  • 3. Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware. This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this. Wednesday, 06 March 13 3 Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion: parallelize workloads onto clusters of commodity servers to scale-out horizontally. Google and Inktomi (YHOO Search) were working along the same lines.
  • 4. Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS Wednesday, 06 March 13 4 Perl and C++ for CGI :) Feedback loops shown in red represent data innovations at the time… these are rather static. Characterized by slow, manual processes: data modeling / business intelligence; “throw it over the wall”… this thinking led to impossible silos
  • 5. Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS Wednesday, 06 March 13 5 Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the marketing funnel, as in our case study. LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
  • 6. Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMS Wednesday, 06 March 13 6 Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams. Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric. Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment. We see this feeding into cluster optimization in YARN, Apache Mesos, etc.
  • 7. references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L Wednesday, 06 March 13 7 Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
  • 8. references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx Wednesday, 06 March 13 8 In their own words…
  • 9. core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. by definition, this is a multi-disciplinary pursuit which requires teams, not sole players Wednesday, 06 March 13 9
  • 10. team process = needs help people ask the discovery right questions allow automation to place modeling informed bets deliver products at Gephi integration scale to customers build smarts into apps product features keep infrastructure systems running, cost-effective Wednesday, 06 March 13 10
  • 11. team composition = roles Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Domain Expert business process, stakeholder Count Word Count data science Data Scientist data prep, discovery, modeling, etc. App Dev software engineering, automation Ops systems engineering, access introduced capability Wednesday, 06 March 13 11 This is an example of multi-disciplinary team composition for data science While other emerging problems spaces will require other more specific kinds of team roles
  • 12. matrix: evaluate needs × roles nn o overy very elliing e ng ratiio rat o apps apps stem stem ss diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops Wednesday, 06 March 13 12
  • 13. most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondary Wednesday, 06 March 13 13
  • 14. science in data science? edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC in a nutshell, what we do… tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT ‣ estimate probability taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM ‣ calculate analytic variance teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO ‣ manipulate order complexity ‣ leverage use of learning theory + collab with DevOps, Stakeholders + reduce work to cron entries Wednesday, 06 March 13 14
  • 15. references… by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE Wednesday, 06 March 13 15 Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
  • 16. The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data Example Wednesday, 06 March 13 16 Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
  • 17. Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products. Wensel was following the Nutch open source project – before Hadoop even had a name. He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology. Wednesday, 06 March 13 17 Cascading initially grew from interaction with the Nutch project, before Hadoop had a name API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
  • 18. Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows. Wednesday, 06 March 13 18 Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
  • 19. examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Wednesday, 06 March 13 19 Many case studies, many Enterprise production deployments now for 5+ years.
  • 20. The Ubiquitous Word Count Document Collection Definition: M Tokenize GroupBy token Count count how often each word appears count how often each word appears R Word Count inin a collection of text documents a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): for each word w in segment(text): • requires a minimal amount of code emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems. Wednesday, 06 March 13 20 Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...
  • 21. word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702 Wednesday, 06 March 13 21 Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
  • 22. word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy token String wcPath = args[ 1 ]; M Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); Wednesday, 06 March 13 22 Based on a Cascading implementation of Word Count, here is sample code -- approx 1/3 the code size of the Word Count example from Apache Hadoop 2nd to last line: generates a DOT file for the flow diagram
  • 23. word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] [{1}:'token'] [{1}:'token'] GroupBy('wc')[by:['token']] wc[{1}:'token'] [{1}:'token'] reduce Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Wednesday, 06 March 13 23 As a concrete example of literate programming in Cascading, here is the DOT representation of the flow plan -- generated by the app itself.
  • 24. word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/Impatient Wednesday, 06 March 13 24 Here is the same Word Count app written in Clojure, using Cascalog.
  • 25. word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learn Wednesday, 06 March 13 25 From what we see about language features, customer case studies, and best practices in general -- Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments. Great for large-scale, complex apps, where small teams must limit the complexities in their process.
  • 26. word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ [](),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true)) } Wednesday, 06 March 13 26 Here is the same Word Count app written in Scala, using Scalding. Very compact, easy to understand; however, also more imperative than Cascalog.
  • 27. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog, not as much of a high-level language Wednesday, 06 March 13 27 If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
  • 28. word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping to limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale (imagine SOA infra @ Google as an open source project) • less learning curve than Cascalog, not as much of a high-level language Wednesday, 06 March 13 28 Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
  • 29. The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data Example Wednesday, 06 March 13 29 CS theory related to data workflow abstraction, to manage complexity
  • 30. Cascading workflows – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the tuple flows bring functional programming aspects into Count Java apps. In formal terms, this provides a pattern language. Wednesday, 06 March 13 30 A pattern language, based on the metaphor of “plumbing”
  • 31. references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices. amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four”. amazon.com/dp/0201633612 Wednesday, 06 March 13 31 Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
  • 32. Cascading workflows – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps, great for cross-team collaboration. Wednesday, 06 March 13 32 Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming. Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
  • 33. references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.” Wednesday, 06 March 13 33 Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
  • 34. examples… • Scalding apps have nearly 1:1 correspondence between function calls and the elements in their flow diagrams – excellent elision and literate representation • noticed on cascading-users email list: when troubleshooting issues, Cascading experts ask novices to provide an app’s flow diagram (generated [head] as a DOT file), sometimes in lieu of showing code Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']'] [{2}:'doc_id', 'text'] [{2}:'doc_id', 'text'] map Each('token')[RegexSplitGenerator[decl:'token'][args:1]] In formal terms, a flow diagram is a directed, acyclic [{1}:'token'] [{1}:'token'] graph (DAG) on which lots of interesting math applies GroupBy('wc')[by:['token']] for query optimization, predictive models about app wc[{1}:'token'] [{1}:'token'] reduce execution, parallel efficiency metrics, etc. Every('wc')[Count[decl:'count']] [{2}:'token', 'count'] [{1}:'token'] Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']'] [{2}:'token', 'count'] [{2}:'token', 'count'] [tail] Wednesday, 06 March 13 34 Literate programming examples observed on the email list are some of the best illustrations of this methodology.
  • 35. Cascading workflows – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) As a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner in used in a Cascading app determines how to translate business process into efficient, parallel jobs at scale. Wednesday, 06 March 13 35 Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL)
  • 36. references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on: the process of structuring data That’s what apps do – Making Data Work Wednesday, 06 March 13 36 Focus on *the process of structuring data* which must happen before the large-scale joins, predictive models, visualizations, etc. Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work. BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
  • 37. Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: Moseley & Marks, 2006 “Out of the Tar Pit” goo.gl/SKspn Wednesday, 06 March 13 37 A more contemporary statement along similar lines...
  • 38. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Wednesday, 06 March 13 38 Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
  • 39. Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a several theoretical aspects converge “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: into software engineering practices Moseley & Marks, 2006which mitigates the complexity of “Out of the Tar Pit” building and maintaining Enterprise goo.gl/SKspn data workflows Wednesday, 06 March 13 39
  • 40. The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data Example Wednesday, 06 March 13 40 Here are a few use cases to consider, for Enterprise data workflows
  • 41. Cascading – deployments • 5+ history of Enterprise production deployments, ASL 2 license, GitHub src, http://conjars.org • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc. Wednesday, 06 March 13 41 Several published case studies about Cascading, Cascalog, Scalding, etc. Wide range of use cases. Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading. Partnerships with the various Hadoop distro vendors, cloud providers, etc.
  • 42. Finance: Ecommerce Risk Problem: stat.berkeley.edu <1% chargeback rate allowed by Visa, others follow • may leverage CAPTURE/AUTH wait period • Cybersource,Vindicia, others haven’t stopped fraud >15% chargeback rate common for mobile in US: • not much info shared with merchant • carrier as judge/jury/executioner; customer assumed correct most common: professional fraud (identity theft, etc.) • patterns of attack change all the time • widespread use of IP proxies, to mask location • global market for stolen credit card info other common case is friendly fraud • teenager billing to parent’s cell phone Wednesday, 06 March 13 42
  • 43. Finance: Ecommerce Risk KPI: stat.berkeley.edu chargeback rate (CB) • ground truth for how much fraud the bank/carrier claims • 7-120 day latencies from the bank false positive rate (FP) • estimated cost: predicts customer support issues • complaints due to incorrect fraud scores on valid orders (or lies) false negative rate (FN) • estimated risk: how much fraud may pass undetected in future orders • changes with new product features/services/inventory/marketing Wednesday, 06 March 13 43
  • 44. Finance: Ecommerce Risk Data Science Issues: stat.berkeley.edu • chargeback limits imply few training cases • sparse data implies lots of missing values – must impute • long latency on chargebacks – “good” flips to “bad” • most detection occurs within large-scale batch, decisions required during real-time event processing • not just one pattern to detect – many, ever-changing • many unknowns: blocked orders scare off professional fraud, inferences cannot be confirmed • cannot simply use raw data as input – requires lots of data preparation and statistical modeling • each ecommerce firm has shopping/policy nuances which get exploited differently – hard to generalize solutions Wednesday, 06 March 13 44
  • 45. Finance: Ecommerce Risk Predictive Analytics: stat.berkeley.edu batch • cluster/segment customers for expected behaviors • adjust for seasonal variation • geospatial indexing / bayesian point estimates (fraud by lat/lng) • impute missing values (“guesses” to fill-in sparse data) • run anti-fraud classifier (customer 360) real-time • exponential smoothing (estimators for velocity) • calculate running medians (anomaly detection) • run anti-fraud classifier (per order) Wednesday, 06 March 13 45
  • 46. Finance: Ecommerce Risk 1. Data Preparation (batch) stat.berkeley.edu ‣ ETL from bank, log sessionization, customer profiles, etc. - large-scale joins of customers + orders ‣ apply time window - too long: patterns lose currency - too short: not enough wait for chargebacks ‣ segment customers - temporary fraud (identity theft which has been resolved) - confirmed fraud (chargebacks from the bank) - estimated fraud (blocked/banned by Customer Support) - valid orders (but different clusters of expected behavior) ‣ subsample to rebalance data - produce training set + test holdout - adjust balance for FP/FN bias (company risk profile) Wednesday, 06 March 13 46
  • 47. Finance: Ecommerce Risk 2. Model Creation (analyst) stat.berkeley.edu ‣ distinguish between different IV data types - continuous (e.g., age) - boolean (e.g., paid lead) - categorical (e.g., gender) - computed (e.g., geo risk, velocities) ‣ use geospatial smoothing for lat/lng ‣ determine distributions for IV ‣ adjust IV for seasonal variation, where appropriate ‣ impute missing values based on density functions / medians ‣ factor analysis: determine which IV to keep (too many creates problems) ‣ train model: random forest (RF) classifiers predict likely fraud ‣ calculate the confusion matrix (TP/FP/TN/FN) Wednesday, 06 March 13 47
  • 48. Finance: Ecommerce Risk 3. Test Model (analyst/batch loop) stat.berkeley.edu ‣ calculate estimated fraud rates ‣ identify potential found fraud cases ‣ report to Customer Support for review ‣ generate risk vs. benefit curves ‣ visualize estimated impact of new model 4. Decision (stakeholder) ‣ decide risk vs. benefit (minimize fraud + customer support costs) ‣ coordinate with bank/carrier if there are current issues ‣ determine go/no-go, when to deploy in production, size of rollout Wednesday, 06 March 13 48
  • 49. Finance: Ecommerce Risk 5. Production Deployment (near-time) stat.berkeley.edu ‣ run model on in-memory grid / transaction processing ‣ A/B test to verify model in production (progressive rollout) ‣ detect anomalies - use running medians on continuous IVs - use exponential smoothing on computed IVs (velocities) - trigger notifications ‣ monitor KPI and other metrics in dashboards Wednesday, 06 March 13 49
  • 50. Finance: Ecommerce Risk risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analyst's customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. data Wednesday, 06 March 13 50
  • 51. Ecommerce: Marketing Funnel Problem: • must optimize large ad spend budget Wikipedia • different vendors report different kinds of metrics • some campaigns are much smaller than others • seasonal variation distorts performance • inherent latency in spend vs. effect • ads channels cannot scale up immediately • must “scrub” leads to dispute payments/refunds • hard to predict ROI for incremental ad spend • many issues of diminishing returns in general Wednesday, 06 March 13 51
  • 52. Ecommerce: Marketing Funnel KPI: cost per paying user (CPP) Wikipedia • must align metrics for different ad channels • generally need to estimate to end-of-month customer lifetime value (LTV) • big differences based on geographic region, age, gender, etc. • assumes that new customers behave like previous customers return on investment (ROI) • relationship between CPP and LTV • adjust to invest in marketing (>CPP) vs. extract profit (>LTV) other metrics • reach: how many people get a brand message • customer satisfaction: would recommend to a friend, etc. Wednesday, 06 March 13 52
  • 53. Ecommerce: Marketing Funnel Predictive Analytics: batch Wikipedia • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • customer lifetime value estimates ROI of new leads • linear programming models estimate elasticity of demand real-time • determine whether this is actually a new customer… • new: modify initial UX based on ad channel, region, friends, etc. • old: recommend products/services/friends based on behaviors • adjust spend on poorly performing channels • track back to top referring sites/partners Wednesday, 06 March 13 53
  • 54. Airlines Problem: • minimize schedule delays • re-route around weather and airport conditions • manage supplier channels and inventories to minimize AOG KPI: forecast future passenger demand customer loyalty aircraft on ground (AOG) mean time between failures (MTBF) Wednesday, 06 March 13 54
  • 55. Airlines Predictive Analytics: batch • predict “last mile” failures • optimize capacity utilization • operations research problem to optimize stocking / minimize fuel waste • boost customer loyalty by adjusting incentives frequent flyer programs real-time • forecast schedule delays • monitor factors for travel conditions: weather, airports, etc. Wednesday, 06 March 13 55