SlideShare a Scribd company logo
1 of 36
“A Data Scientist
 And A Log File
 Walk Into A Bar…”

Paco Nathan                   Document
                              Collection




                                           Tokenize
                                                           Scrub




Concurrent, Inc.
                                                           token

                                      M



                                                                   HashJoin   Regex
                                                                     Left     token
                                                                                      GroupBy    R
                                                      Stop Word                        token
                                                         List
                                                                     RHS




pnathan@concurrentinc.com
                                                                                         Count




                                                                                                     Word
                                                                                                     Count




@pacoid




                            Copyright @2012, Concurrent, Inc.
opportunity


 Unstructured Data
   meets
  Enterprise Scale

 1. backstory: how we got here
 2. overview: typical use cases
 3. example: a Cascading app
Intro to Data Science
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




1. backstory:
how we got here
inflection point
 • huge Internet successes after 1997 holiday season…                  1997
   AMZN, EBAY, then GOOG, Inktomi (YHOO Search)

 • consider this metric:                                               1998
     annual revenue per customer / amount of data stored
   dropped 100x within a few years after 1997

 • storage and processing costs plummeted, now we must
   work much smarter to extract ROI from Big Data…
   our methods must adapt                                              2004
 • “conventional wisdom” of RDBMS and BI tools became
   less viable; business cadre still focused on pivot tables
   and pie charts… tends toward inertia!

 • MapReduce and the Hadoop open source stack grew
   directly out of that contention… but only solve portions
 massive disruption in retail, advertising, etc.,
 “All of Fortune 500 is now on notice over the next 10-year period.”
  – Geoffrey Moore, 2012 (Mohr Davidow Ventures)
the world before…

BI, SQL, and highly
optimized code
data innovation: circa 1996
                           Stakeholder                   Customers

    Excel pivot tables
  PowerPoint slide decks        strategy



        BI
                               Product
      Analysts


                              requirements



      SQL Query                              optimized
                             Engineering       code         Web App
       result sets



                                                            transactions




                                                            RDBMS
the world after…

machine learning,
leveraging log files
data innovation: circa 2001
   Stakeholder                    Product                   Customers




     dashboards                                                  UX
                                 Engineering

                   models                        servlets

                                 recommenders
   Algorithmic
                                        +                   Web Apps
    Modeling                        classifiers


                                                            Middleware
                   aggregation
                                                  event
    SQL Query                                    history
     result sets                                               customer
                                                             transactions
                                    Logs



       DW                             ETL                    RDBMS
the world ahead…

what our customers
are doing now
data innovation: circa 2013
                                                                                         Customers
                                    Data Apps
                      business
 Domain               process       Workflow                                                                          Prod
 Expert
                        dashboard                                                        Web Apps,
                         metrics
                                     History                     services                 Mobile,
              data                                                                         etc.                s/w
            science                                                                                            dev
  Data
                                     Planner
Scientist
                                                                                   social
                      discovery                  optimized                      interactions
                          +                       capacity                                     transactions,         Eng
                                     endpoints
                      modeling                                                                    content

 App Dev
                                            Data Access Patterns


                                     Hadoop,                   Log                        In-Memory
                                       etc.                   Events                       Data Grid
   Ops                       DW                                                                                      Ops
                                                                        batch      "real time"


                                                             Cluster Scheduler


                                                                                               RDBMS
                                                                                                RDBMS
a key difference…
statistical thinking

         Process           Variation            Data           Tools




 employing a mode of thought which includes both logical and analytical reasoning:
 evaluating the whole of a problem, as well as its component parts; attempting
 to assess the effects of changing one or more variables

 this approach attempts to understand not just problems and solutions,
 but also the processes involved and their variances

 particularly valuable in Big Data work when combined with hands-on experience in
 physics – roughly 50% of my peers come from physics or physical engineering…

 programmers typically don’t think this way…
 however, both systems engineers and data scientists must!
references

 by Leo Breiman
 Statistical Modeling:
 The Two Cultures
 Statistical Science, 2001
 http://bit.ly/eUTh9L

 also check out RStudio:
 http://rstudio.org/
 http://rpubs.com/
most valuable skills
 • approximately 80% of the costs for data-related projects
   get spent on data preparation – mostly on cleaning up
   data quality issues: ETL, log file analysis, etc.

 • unfortunately, data-related budgets for many companies tend
   to go into frameworks which can only be used after clean up

 • most valuable skills:
   ‣ learn to use programmable tools that prepare data

   ‣ learn to generate compelling data visualizations

   ‣ learn to estimate the confidence for reported results

   ‣ learn to automate work, making analysis repeatable
                                                                 D3
 the rest of the skills – modeling,
 algorithms, etc. – those are secondary
team process

                  help people ask the
    discovery     right questions


                  allow automation to
    modeling      place informed bets


                  deliver products at
    integration   scale to customers


                  leverage smarts in
       apps       product features          Gephi


                  keep infrastructure
     systems      running, cost-effective
matrix: usage
                                                 nn
              o
              overy
                very      elliing
                           e ng            ratiio
                                           rat o      apps
                                                      apps      stem
                                                                stem
                                                                     s
                                                                     s
        diisc
        d sc           mod
                       mod            nteg
                                    iinte
                                          g                  sy
                                                             sy


 conceptual tool for managing Data Science teams                         stakeholder

 overlay your project requirements (needs)
 with your team’s strengths (roles)
                                                                          scientist
 that will show very quickly where to focus

 NB: bring in individuals who cover 2-3 needs,                           developer
 particularly for team leads

                                                                            ops
building teams
                                             nn
          o
          overy
            very      elliing
                       e ng            ratiio
                                       rat o      apps
                                                  apps      stem
                                                            stem
                                                                 s
                                                                 s
    diisc
    d sc           mod
                   mod            nteg
                                iinte
                                      g                  sy
                                                         sy


                                                                     stakeholder



                                                                      scientist



                                                                     developer



                                                                        ops
references

 by DJ Patil

 Data Jujitsu
 O’Reilly, 2012
 http://www.amazon.com/dp/B008HMN5BE

 Building Data Science Teams
 O’Reilly, 2011
 http://www.amazon.com/dp/B005O4U3ZE
Intro to Data Science
            Document
            Collection



                                         Scrub
                         Tokenize
                                         token

                    M



                                                 HashJoin   Regex
                                                   Left     token
                                                                    GroupBy    R
                                    Stop Word                        token
                                       List
                                                   RHS




                                                                       Count




                                                                                   Word
                                                                                   Count




2. overview:
typical use cases
using science in data science
                                                        edoMpUsserD:IUN
                                    tcudorP ylppA lenaP yrotnevnI tneilC
                                 tcudorP evomeR lenaP yrotnevnI tneilC




  in a nutshell, what we do…
                                                        edoMmooRyM:IUN
                                                    edoMmooRcilbuP:IUN
                                                                 ydduB ddA
                                                              nigoL etisbeW
                                                                          vd
                                                         edoMsdneirF:IUN
                                                             edoMtahC:IUN
                                                         egasseM a evaeL
                                            G1 :gniniamer ecaps sserddA
                                                     dekcilCeliforPyM:IUN
                                                      edoMstiderCyuB:IUN
                                                          tohspanS a ekaT
                                                      egapemoH nwO tisiV
                                                              elbbuB a epyT
                                                               taeS egnahC
                                                         wodniW D3 nepO
                                                                 dneirF ddA
                                revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                  lenaP tidE




  • estimate probability
                                                                   woN tahC
                                                                    teP yalP
                                                                   teP deeF
                            2 petS egaP traC esahcruP edaM remotsuC
                                         M215 :gniniamer ecaps sserddA
                                                             gnihtolC no tuP
                                                          bew :metI na yuB
                                                            edoMeivoM:IUN
                                   ytinummoc ,tneilc :detratS weiV eivoM
                                                            teP weN etaerC
                                       detrats etius tset :tseTytivitcennoC
                                                  emag pazyeh dehcnuaL
                                                   eciov mooRcilbuP tahC
                                                         egasseM yadhtriB
                                                         edoMlairotuT:IUN
                                                   ybbol semag dehcnuaL
                                                       noitartsigeR euqinU




                                                                               edoMpUsserD:IUN
                                                                               tcudorP ylppA lenaP yrotnevnI tneilC
                                                                               tcudorP evomeR lenaP yrotnevnI tneilC
                                                                               edoMmooRyM:IUN
                                                                               edoMmooRcilbuP:IUN
                                                                               y d d uB d dA
                                                                               nigoL etisbeW
                                                                               vd
                                                                               edoMsdneirF:IUN
                                                                               edoMtahC:IUN
                                                                               egasseM a evaeL
                                                                               G1 :gniniamer ecaps sserddA
                                                                               dekcilCeliforPyM:IUN
                                                                               edoMstiderCyuB:IUN
                                                                               tohspanS a ekaT
                                                                               egapemoH nwO tisiV
                                                                               elbbuB a epyT
                                                                               t a eS e g n a h C

                                                                               dneirF ddA
                                                                               revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                               l e n aP t i dE
                                                                               woN tahC
                                                                               teP yalP
                                                                               teP deeF
                                                                               2 petS egaP traC esahcruP edaM remotsuC
                                                                               M215 :gniniamer ecaps sserddA
                                                                               gnihtolC no tuP
                                                                               bew :metI na yuB
                                                                               edoMeivoM:IUN
                                                                               ytinummoc ,tneilc :detratS weiV eivoM
                                                                               teP weN etaerC
                                                                               detrats etius tset :tseTytivitcennoC
                                                                               emag pazyeh dehcnuaL
                                                                               eciov mooRcilbuP tahC
                                                                               egasseM yadhtriB
                                                                               edoMlairotuT:IUN
                                                                               ybbol semag dehcnuaL
                                                                               noitartsigeR euqinU
                                                                               wodniW D3 nepO
  • calculate analytic variance
  • manipulate order complexity
  • make use of learning theory
  • collab with DevOps, Stakeholders
use case: marketing funnel
  •   must optimize a very large ad spend
  •   different vendors report different metrics




                                                                Wikipedia
  •   seasonal variation distorts performance
  •   some campaigns are much smaller than others
  •   hard to predict ROI for incremental spend

  approach:
  • log aggregation, followed with cohort analysis
  • bayesian point estimates compare different-sized ad tests
  • customer lifetime value quantifies ROI of new leads
  • time series analysis normalizes for seasonal variation
  • geolocation adjusts for regional cost/benefit
  • linear programming models estimate elasticity of demand
use case: ecommerce fraud
  • sparse data means lots of missing values




                                                             stat.berkeley.edu
  • “needle in a haystack” lack of training cases
  • answers are available in large-scale batch, results
      are needed in real-time event processing
  •   not just one pattern to detect – many, ever-changing

  approach:
  • random forest (RF) classifiers predict likely fraud
  • subsampled data to re-balance training sets
  • impute missing values based on density functions
  • train on massive log files, run on in-memory grid
  • adjust metrics to minimize customer support costs
  • detect novelty – report anomalies via notifications
use case: customer segmentation
  • many millions of customers, hard to determine
      which features resonate




                                                                Mathworks
  •   multi-modal distributions get obscured by the
      practice of calculating an “average”
  •   not much is known about individual customers

  approach:
  • connected components for sessionization, determining
      uniques from logs
  •   estimates for age, gender, income, geo, etc.
  •   clustering algorithms to group into market segments
  •   social graph infers “unknown” relationships
  • covariance/heat maps visualizes segments vs. feature sets
use case: monetizing content
  • need to suggest relevant content which would




                                                                   Digital Humanities
      otherwise get buried in the back catalog
  •   big disconnect between inventory and limited
      performance ad market
  •   enormous amounts of text, hard to categorize

  approach:
  • text analytics glean key phrases from documents
  • hierarchical clustering of char frequencies detects lang
  • latent dirichlet allocation (LDA) reduces dimension to topic
      models
  •   recommenders suggest similar topics to customers
  • collaborative filters connect known users with less known
plus some great tools…
                                                                reporting:
                                   visualization:
                                                                Graphite, PowerPivot,
                                   ggplot2, D3, Gephi
   analytics/modeling:                                          Pentaho, Jaspersoft, SAS
   R, Weka, Matlab, PMML, GLPK
                                      text:
                                      LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK

       apps:
       Cascading, Scalding, Cascalog, R markdown, SWF
                                                scale-out:
                                                Scalr, RightScale, CycleComputing, vFabric, Beanstalk
               graph:          column:
               Gremlin,        Vertica,
               GraphLab,       HBase,           key/val:        index:               relational:
               Neo4J           Drill,           Redis,          Lucene/Solr,         usual suspects
                               Dynamo           Membase,        ElasticSearch
                                                MySQL

   imdg:
   Spark, Storm,         hadoop:
                         EMR, HW, MapR,               machine data:
   Gigaspaces
                         EMC, Azure, Compute          Splunk, collectd,         durable storage:
                                                      Nagios                    S3, ASV, GCS,
                                                                                Riak, Couch
Intro to Data Science
           Document
           Collection



                                        Scrub
                        Tokenize
                                        token

                   M



                                                HashJoin   Regex
                                                  Left     token
                                                                   GroupBy    R
                                   Stop Word                        token
                                      List
                                                  RHS




                                                                      Count




                                                                                  Word
                                                                                  Count




3. example:
a Cascading app
getting started

  cascading.org/category/impatient/
  Document
  Collection



                               Scrub
               Tokenize
                               token

          M



                                       HashJoin   Regex
                                         Left     token
                                                          GroupBy    R
                          Stop Word                        token
                             List
                                         RHS




                                                             Count




                                                                         Word
                                                                         Count
composition of a workflow
  business     domain expertise, business trade-offs,
  process      market position, operating parameters, etc.

     API       Scala, Clojure, Python, Ruby, Java, etc.
  language
               …envision whatever else runs in a JVM

  optimize /
   schedule    major changes in technology now
                  Document
                  Collection



                                               Scrub
                               Tokenize
                                               token




   physical
                          M



                                                       HashJoin   Regex
                                                         Left     token
                                                                          GroupBy    R



    plan
                                          Stop Word                        token
                                             List
                                                         RHS




                                                                             Count




                                                                                         Word
                                                                                         Count




  compute      Apache Hadoop, in-memory local mode




                                                                                                 “assembler”
                                                                                                  code
  substrate
               …envision GPUs, other frameworks, etc.

  machine
   data        Splunk, Nagios, Collectd, etc.
1: copy
                       public class
                         Main
                         {
                         public static void
                         main( String[] args )
                           {
                           String inPath = args[ 0 ];
                           String outPath = args[ 1 ];
 Source
                           Properties props = new Properties();
                           AppProps.setApplicationJarClass( props, Main.class );
                           HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

                           // create the source tap
                           Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );

                           // create the sink tap
          M                Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );

                           // specify a pipe to connect the taps
                Sink       Pipe copyPipe = new Pipe( "copy" );

                           // connect the taps, pipes, etc., into a flow
                           FlowDef flowDef = FlowDef.flowDef().setName( "copy" )
                            .addSource( copyPipe, inTap )
                            .addTailSink( copyPipe, outTap );

                           // run the flow
                           flowConnector.connect( flowDef ).complete();

 1 mapper                  }
                         }
 0 reducers
10 lines code
wait!



  ten lines of code
  for a file copy…
  seems like a lot.
same JAR, any scale…
                                                       MegaCorp Enterprise IT:
                                                       Pb’s data
                                                       1000+ node private cluster
                                                       EVP calls you when app fails
                                                       runtime: days+

                                        Production Cluster:
                                        Tb’s data
                                        EMR w/ 50 HPC Instances
                                        Ops monitors results
                                        runtime: hours – days

                    Staging Cluster:
                    Gb’s data
                    EMR + 4 Spot Instances
                    CI shows red or green lights
                    runtime: minutes – hours

 Your Laptop:
 Mb’s data
 Hadoop standalone mode
 passes unit tests, or not
 runtime: seconds – minutes
2: word count


Document
Collection




                Tokenize
                           GroupBy
        M                   token    Count




                              R              Word
                                             Count




 1 mapper
 1 reducer
18 lines code
3: City of Palo Alto open data
                                                   Regex           Regex




                                            tree
                                                                                 Scrub
                                                    filter         parser        species




                                            M
                                                                                                       HashJoin
                                                                                                         Left     Geohash
    CoPA
  GIS exprot                                                                                 Tree
                                                                                           Metadata                                M
                                                                                                         RHS                            RHS
                                                                                                                            tree
               Regex     Checkpoint




                                            road
                                                   Regex           Regex

                                      tsv
               parser       tsv                     filter                                                                                             Tree       Filter         GroupBy        Checkpoint
                                                                   parser                                                              CoGroup
                                                                                                                                                     Distance   tree_dist       tree_name         shade
  M

                                                                                                                                                 R                          M               R                M    RHS
                                            M
                                                                            HashJoin        Estimate     Road
                                                                              Left           Albedo    Segments   Geohash                                                                                        CoGroup
                                                              Road
                                                             Metadata                                                                                                              GPS
               Failure                                                        RHS                                                  M                                               logs
                Traps                                                                                                                                                                                                      R
                                                                                                                            road


                                                                                                                                                                                                 Geohash


                                                                                                                                                                                                             M

                                                   Regex
                                            park




                                                    filter                                                                                                                                                                     reco




                                            M
                                                                  park




github.com/Cascading/CoPA/wiki
  • GIS export for parks, roads, trees (unstructured / open data)
  • log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks
  • curated metadata, used to enrich the dataset
  • could extend via mash-up with many available public data APIs
Enterprise-scale app: road albedo + tree species metadata + geospatial indexing
“Find a shady spot on a summer day to walk near downtown and take a call…”
log events
example results                                    0.12
                                                               Estimated Tree Height (meters)




                                                   0.10




                                                   0.08
                                                                                                          count
                                                                                                             0




                                         density
                                                                                                             100
                                                   0.06                                                      200
                                                                                                             300



                                                   0.04




                                                   0.02




                                                   0.00


                                                          0   10        20            30        40   50
                                                                         avg_height




 •   addr: 115 HAWTHORNE AVE
 •   lat/lng: 37.446, -122.168
 •   geohash: 9q9jh0
 •   tree: 413 site 2
 •   species: Liquidambar styraciflua
 •   avg height 23 m
 •   road albedo: 0.12
 •   distance: 10 m
 •   a short walk from my train stop ✔
drill-down


  blog, code/wiki/gists, jars, list, DevOps products:
  cascading.org/
  github.org/Cascading/
  conjars.org/
  goo.gl/KQtUL
  concurrentinc.com/

More Related Content

What's hot

Innovating the Real-Time Business with SAP BusinessObjects BI Solutions and S...
Innovating the Real-Time Business with SAP BusinessObjects BI Solutions and S...Innovating the Real-Time Business with SAP BusinessObjects BI Solutions and S...
Innovating the Real-Time Business with SAP BusinessObjects BI Solutions and S...SAP Analytics
 
Adding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylenAdding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylenDynamic People B.V.
 
[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email ArchivingJinho Jung
 
Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With HadoopCloudera, Inc.
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic
 
SAP BOBJ Architectural Options
SAP BOBJ Architectural OptionsSAP BOBJ Architectural Options
SAP BOBJ Architectural Optionsdcd2z
 
ConceptClassifier for SharePoint Turbo Charging the Public Sector
ConceptClassifier for SharePoint Turbo Charging the Public SectorConceptClassifier for SharePoint Turbo Charging the Public Sector
ConceptClassifier for SharePoint Turbo Charging the Public Sectormartingarland
 
SOA: Syndication-Oriented Architecture?
SOA: Syndication-Oriented Architecture?SOA: Syndication-Oriented Architecture?
SOA: Syndication-Oriented Architecture?Jamie Fiorda
 

What's hot (8)

Innovating the Real-Time Business with SAP BusinessObjects BI Solutions and S...
Innovating the Real-Time Business with SAP BusinessObjects BI Solutions and S...Innovating the Real-Time Business with SAP BusinessObjects BI Solutions and S...
Innovating the Real-Time Business with SAP BusinessObjects BI Solutions and S...
 
Adding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylenAdding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylen
 
[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving[Hadoop] NexR Terapot: Massive Email Archiving
[Hadoop] NexR Terapot: Massive Email Archiving
 
Hw09 Terapot Email Archiving With Hadoop
Hw09   Terapot  Email Archiving With HadoopHw09   Terapot  Email Archiving With Hadoop
Hw09 Terapot Email Archiving With Hadoop
 
Sasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation DefenseSasa Nesic - PhD Dissertation Defense
Sasa Nesic - PhD Dissertation Defense
 
SAP BOBJ Architectural Options
SAP BOBJ Architectural OptionsSAP BOBJ Architectural Options
SAP BOBJ Architectural Options
 
ConceptClassifier for SharePoint Turbo Charging the Public Sector
ConceptClassifier for SharePoint Turbo Charging the Public SectorConceptClassifier for SharePoint Turbo Charging the Public Sector
ConceptClassifier for SharePoint Turbo Charging the Public Sector
 
SOA: Syndication-Oriented Architecture?
SOA: Syndication-Oriented Architecture?SOA: Syndication-Oriented Architecture?
SOA: Syndication-Oriented Architecture?
 

Viewers also liked

Smart Log Analysis
Smart Log AnalysisSmart Log Analysis
Smart Log AnalysisTim Burke
 
UACH Bachillerato Lab 8: Fuerza en el Choque
UACH Bachillerato Lab 8: Fuerza en el ChoqueUACH Bachillerato Lab 8: Fuerza en el Choque
UACH Bachillerato Lab 8: Fuerza en el ChoqueWilly H. Gerber
 
Atlas de biologia molecular mglc.
Atlas de biologia molecular mglc.Atlas de biologia molecular mglc.
Atlas de biologia molecular mglc.memoxlara
 
El social media en el marketing de hoy @josemarimayoral - u. loyola
El social media en el marketing de hoy   @josemarimayoral - u. loyolaEl social media en el marketing de hoy   @josemarimayoral - u. loyola
El social media en el marketing de hoy @josemarimayoral - u. loyolaJose Montilla
 
The 5 Trends Behind the 2014 Best of the Email Swipe File
The 5 Trends Behind the 2014 Best of the Email Swipe FileThe 5 Trends Behind the 2014 Best of the Email Swipe File
The 5 Trends Behind the 2014 Best of the Email Swipe FileSalesforce Marketing Cloud
 
Bloques, 1,2,3, informatica, tlgo fredy itas
Bloques, 1,2,3, informatica, tlgo fredy itasBloques, 1,2,3, informatica, tlgo fredy itas
Bloques, 1,2,3, informatica, tlgo fredy itasUnidad Educativa Olympus
 
Actualizado octubre. Informe de actividades parlamento europeo 2009 2013 - copy
Actualizado octubre. Informe de actividades parlamento europeo 2009 2013 - copyActualizado octubre. Informe de actividades parlamento europeo 2009 2013 - copy
Actualizado octubre. Informe de actividades parlamento europeo 2009 2013 - copyupydeuropa
 
Palestra III Congresso Nacional CAEM
Palestra III Congresso Nacional CAEMPalestra III Congresso Nacional CAEM
Palestra III Congresso Nacional CAEMLeo Pallotta
 
Indexabilidad básica por Human Level Communications - Presentación para clíni...
Indexabilidad básica por Human Level Communications - Presentación para clíni...Indexabilidad básica por Human Level Communications - Presentación para clíni...
Indexabilidad básica por Human Level Communications - Presentación para clíni...Fernando Maciá Domene
 
Outsights The Futureofthe Global Economyto2030
Outsights The Futureofthe Global Economyto2030Outsights The Futureofthe Global Economyto2030
Outsights The Futureofthe Global Economyto2030Dinah Saw
 
Ciclo de conferencias 'El Retrato en las Colecciones Reales'
Ciclo de conferencias 'El Retrato en las Colecciones Reales'Ciclo de conferencias 'El Retrato en las Colecciones Reales'
Ciclo de conferencias 'El Retrato en las Colecciones Reales'Fundación Banco Santander
 

Viewers also liked (20)

Smart Log Analysis
Smart Log AnalysisSmart Log Analysis
Smart Log Analysis
 
UACH Bachillerato Lab 8: Fuerza en el Choque
UACH Bachillerato Lab 8: Fuerza en el ChoqueUACH Bachillerato Lab 8: Fuerza en el Choque
UACH Bachillerato Lab 8: Fuerza en el Choque
 
Atlas de biologia molecular mglc.
Atlas de biologia molecular mglc.Atlas de biologia molecular mglc.
Atlas de biologia molecular mglc.
 
El social media en el marketing de hoy @josemarimayoral - u. loyola
El social media en el marketing de hoy   @josemarimayoral - u. loyolaEl social media en el marketing de hoy   @josemarimayoral - u. loyola
El social media en el marketing de hoy @josemarimayoral - u. loyola
 
Lars johansson
Lars johanssonLars johansson
Lars johansson
 
The 5 Trends Behind the 2014 Best of the Email Swipe File
The 5 Trends Behind the 2014 Best of the Email Swipe FileThe 5 Trends Behind the 2014 Best of the Email Swipe File
The 5 Trends Behind the 2014 Best of the Email Swipe File
 
Bloques, 1,2,3, informatica, tlgo fredy itas
Bloques, 1,2,3, informatica, tlgo fredy itasBloques, 1,2,3, informatica, tlgo fredy itas
Bloques, 1,2,3, informatica, tlgo fredy itas
 
Actualizado octubre. Informe de actividades parlamento europeo 2009 2013 - copy
Actualizado octubre. Informe de actividades parlamento europeo 2009 2013 - copyActualizado octubre. Informe de actividades parlamento europeo 2009 2013 - copy
Actualizado octubre. Informe de actividades parlamento europeo 2009 2013 - copy
 
Palestra III Congresso Nacional CAEM
Palestra III Congresso Nacional CAEMPalestra III Congresso Nacional CAEM
Palestra III Congresso Nacional CAEM
 
Open android
Open androidOpen android
Open android
 
Indexabilidad básica por Human Level Communications - Presentación para clíni...
Indexabilidad básica por Human Level Communications - Presentación para clíni...Indexabilidad básica por Human Level Communications - Presentación para clíni...
Indexabilidad básica por Human Level Communications - Presentación para clíni...
 
T100 presse de
T100 presse deT100 presse de
T100 presse de
 
10.motilidad ruminal
10.motilidad ruminal10.motilidad ruminal
10.motilidad ruminal
 
Clase6 sitio web construccion
Clase6 sitio web construccionClase6 sitio web construccion
Clase6 sitio web construccion
 
Outsights The Futureofthe Global Economyto2030
Outsights The Futureofthe Global Economyto2030Outsights The Futureofthe Global Economyto2030
Outsights The Futureofthe Global Economyto2030
 
Ciclo de conferencias 'El Retrato en las Colecciones Reales'
Ciclo de conferencias 'El Retrato en las Colecciones Reales'Ciclo de conferencias 'El Retrato en las Colecciones Reales'
Ciclo de conferencias 'El Retrato en las Colecciones Reales'
 
aXsGuard Gatekeeper
aXsGuard GatekeeperaXsGuard Gatekeeper
aXsGuard Gatekeeper
 
Catalogo pees de tierra
Catalogo pees de tierraCatalogo pees de tierra
Catalogo pees de tierra
 
MDT Brief Training
MDT Brief TrainingMDT Brief Training
MDT Brief Training
 
Wicca 6 a
Wicca 6 aWicca 6 a
Wicca 6 a
 

Similar to A Data Scientist And A Log File Walk Into A Bar...

Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingPaco Nathan
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingPaco Nathan
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsPaco Nathan
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
 
Hadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation ArchitecturesHadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation ArchitecturesDataWorks Summit
 
Keyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenterKeyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCentergoodfriday
 
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP IntegrationRaymond Gao
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?Thomas Roessler
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiPaco Nathan
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Narayan Bharadwaj
 
Front-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft OfficeFront-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft Officegoodfriday
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentationpbridges
 
Unified big data architecture
Unified big data architectureUnified big data architecture
Unified big data architectureDataWorks Summit
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and rSAP Technology
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big DataPaco Nathan
 
Sap microsoft interoperability sitnl 08-12-2012
Sap microsoft interoperability sitnl 08-12-2012Sap microsoft interoperability sitnl 08-12-2012
Sap microsoft interoperability sitnl 08-12-2012Twan van den Broek
 

Similar to A Data Scientist And A Log File Walk Into A Bar... (20)

Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
Cascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional ProgrammingCascading: Enterprise Data Workflows based on Functional Programming
Cascading: Enterprise Data Workflows based on Functional Programming
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
 
Hadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation ArchitecturesHadoop's Opportunity to Power Next-Generation Architectures
Hadoop's Opportunity to Power Next-Generation Architectures
 
Keyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenterKeyword Services Platform (KSP) from Microsoft adCenter
Keyword Services Platform (KSP) from Microsoft adCenter
 
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP Integration
 
Web standards, why care?
Web standards, why care?Web standards, why care?
Web standards, why care?
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013Hadoop Summit San Diego Feb2013
Hadoop Summit San Diego Feb2013
 
Front-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft OfficeFront-Ending the Web with Microsoft Office
Front-Ending the Web with Microsoft Office
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
Unified big data architecture
Unified big data architectureUnified big data architecture
Unified big data architecture
 
Advanced analytics with sap hana and r
Advanced analytics with sap hana and rAdvanced analytics with sap hana and r
Advanced analytics with sap hana and r
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Sap microsoft interoperability sitnl 08-12-2012
Sap microsoft interoperability sitnl 08-12-2012Sap microsoft interoperability sitnl 08-12-2012
Sap microsoft interoperability sitnl 08-12-2012
 
Globant and Big Data on AWS
Globant and Big Data on AWSGlobant and Big Data on AWS
Globant and Big Data on AWS
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

A Data Scientist And A Log File Walk Into A Bar...

  • 1. “A Data Scientist And A Log File Walk Into A Bar…” Paco Nathan Document Collection Tokenize Scrub Concurrent, Inc. token M HashJoin Regex Left token GroupBy R Stop Word token List RHS pnathan@concurrentinc.com Count Word Count @pacoid Copyright @2012, Concurrent, Inc.
  • 2. opportunity Unstructured Data meets Enterprise Scale 1. backstory: how we got here 2. overview: typical use cases 3. example: a Cascading app
  • 3. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. backstory: how we got here
  • 4. inflection point • huge Internet successes after 1997 holiday season… 1997 AMZN, EBAY, then GOOG, Inktomi (YHOO Search) • consider this metric: 1998 annual revenue per customer / amount of data stored dropped 100x within a few years after 1997 • storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data… our methods must adapt 2004 • “conventional wisdom” of RDBMS and BI tools became less viable; business cadre still focused on pivot tables and pie charts… tends toward inertia! • MapReduce and the Hadoop open source stack grew directly out of that contention… but only solve portions massive disruption in retail, advertising, etc., “All of Fortune 500 is now on notice over the next 10-year period.” – Geoffrey Moore, 2012 (Mohr Davidow Ventures)
  • 5. the world before… BI, SQL, and highly optimized code
  • 6. data innovation: circa 1996 Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMS
  • 7. the world after… machine learning, leveraging log files
  • 8. data innovation: circa 2001 Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMS
  • 9. the world ahead… what our customers are doing now
  • 10. data innovation: circa 2013 Customers Data Apps business Domain process Workflow Prod Expert dashboard Web Apps, metrics History services Mobile, data etc. s/w science dev Data Planner Scientist social discovery optimized interactions + capacity transactions, Eng endpoints modeling content App Dev Data Access Patterns Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch "real time" Cluster Scheduler RDBMS RDBMS
  • 12. statistical thinking Process Variation Data Tools employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering… programmers typically don’t think this way… however, both systems engineers and data scientists must!
  • 13. references by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 http://bit.ly/eUTh9L also check out RStudio: http://rstudio.org/ http://rpubs.com/
  • 14. most valuable skills • approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. • unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up • most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondary
  • 15. team process help people ask the discovery right questions allow automation to modeling place informed bets deliver products at integration scale to customers leverage smarts in apps product features Gephi keep infrastructure systems running, cost-effective
  • 16. matrix: usage nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy conceptual tool for managing Data Science teams stakeholder overlay your project requirements (needs) with your team’s strengths (roles) scientist that will show very quickly where to focus NB: bring in individuals who cover 2-3 needs, developer particularly for team leads ops
  • 17. building teams nn o overy very elliing e ng ratiio rat o apps apps stem stem s s diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer ops
  • 18. references by DJ Patil Data Jujitsu O’Reilly, 2012 http://www.amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 http://www.amazon.com/dp/B005O4U3ZE
  • 19. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 2. overview: typical use cases
  • 20. using science in data science edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC in a nutshell, what we do… edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE • estimate probability woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN y d d uB d dA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT t a eS e g n a h C dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC l e n aP t i dE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO • calculate analytic variance • manipulate order complexity • make use of learning theory • collab with DevOps, Stakeholders
  • 21. use case: marketing funnel • must optimize a very large ad spend • different vendors report different metrics Wikipedia • seasonal variation distorts performance • some campaigns are much smaller than others • hard to predict ROI for incremental spend approach: • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • customer lifetime value quantifies ROI of new leads • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • linear programming models estimate elasticity of demand
  • 22. use case: ecommerce fraud • sparse data means lots of missing values stat.berkeley.edu • “needle in a haystack” lack of training cases • answers are available in large-scale batch, results are needed in real-time event processing • not just one pattern to detect – many, ever-changing approach: • random forest (RF) classifiers predict likely fraud • subsampled data to re-balance training sets • impute missing values based on density functions • train on massive log files, run on in-memory grid • adjust metrics to minimize customer support costs • detect novelty – report anomalies via notifications
  • 23. use case: customer segmentation • many millions of customers, hard to determine which features resonate Mathworks • multi-modal distributions get obscured by the practice of calculating an “average” • not much is known about individual customers approach: • connected components for sessionization, determining uniques from logs • estimates for age, gender, income, geo, etc. • clustering algorithms to group into market segments • social graph infers “unknown” relationships • covariance/heat maps visualizes segments vs. feature sets
  • 24. use case: monetizing content • need to suggest relevant content which would Digital Humanities otherwise get buried in the back catalog • big disconnect between inventory and limited performance ad market • enormous amounts of text, hard to categorize approach: • text analytics glean key phrases from documents • hierarchical clustering of char frequencies detects lang • latent dirichlet allocation (LDA) reduces dimension to topic models • recommenders suggest similar topics to customers • collaborative filters connect known users with less known
  • 25. plus some great tools… reporting: visualization: Graphite, PowerPivot, ggplot2, D3, Gephi analytics/modeling: Pentaho, Jaspersoft, SAS R, Weka, Matlab, PMML, GLPK text: LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK apps: Cascading, Scalding, Cascalog, R markdown, SWF scale-out: Scalr, RightScale, CycleComputing, vFabric, Beanstalk graph: column: Gremlin, Vertica, GraphLab, HBase, key/val: index: relational: Neo4J Drill, Redis, Lucene/Solr, usual suspects Dynamo Membase, ElasticSearch MySQL imdg: Spark, Storm, hadoop: EMR, HW, MapR, machine data: Gigaspaces EMC, Azure, Compute Splunk, collectd, durable storage: Nagios S3, ASV, GCS, Riak, Couch
  • 26. Intro to Data Science Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 3. example: a Cascading app
  • 27. getting started cascading.org/category/impatient/ Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count
  • 28. composition of a workflow business domain expertise, business trade-offs, process market position, operating parameters, etc. API Scala, Clojure, Python, Ruby, Java, etc. language …envision whatever else runs in a JVM optimize / schedule major changes in technology now Document Collection Scrub Tokenize token physical M HashJoin Regex Left token GroupBy R plan Stop Word token List RHS Count Word Count compute Apache Hadoop, in-memory local mode “assembler” code substrate …envision GPUs, other frameworks, etc. machine data Splunk, Nagios, Collectd, etc.
  • 29. 1: copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath );     // create the sink tap M     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath );     // specify a pipe to connect the taps Sink     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers 10 lines code
  • 30. wait! ten lines of code for a file copy… seems like a lot.
  • 31. same JAR, any scale… MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes
  • 32. 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code
  • 33. 3: City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPA GIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shade M R M R M RHS M HashJoin Estimate Road Left Albedo Segments Geohash CoGroup Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M park github.com/Cascading/CoPA/wiki • GIS export for parks, roads, trees (unstructured / open data) • log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks • curated metadata, used to enrich the dataset • could extend via mash-up with many available public data APIs Enterprise-scale app: road albedo + tree species metadata + geospatial indexing “Find a shady spot on a summer day to walk near downtown and take a call…”
  • 35. example results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height • addr: 115 HAWTHORNE AVE • lat/lng: 37.446, -122.168 • geohash: 9q9jh0 • tree: 413 site 2 • species: Liquidambar styraciflua • avg height 23 m • road albedo: 0.12 • distance: 10 m • a short walk from my train stop ✔
  • 36. drill-down blog, code/wiki/gists, jars, list, DevOps products: cascading.org/ github.org/Cascading/ conjars.org/ goo.gl/KQtUL concurrentinc.com/

Editor's Notes

  1. responsible for net lift, or we work on something else\n
  2. responsible for net lift, or we work on something else\n
  3. responsible for net lift, or we work on something else\n
  4. responsible for net lift, or we work on something else\n
  5. responsible for net lift, or we work on something else\n
  6. responsible for net lift, or we work on something else\n
  7. responsible for net lift, or we work on something else\n
  8. responsible for net lift, or we work on something else\n
  9. responsible for net lift, or we work on something else\n
  10. responsible for net lift, or we work on something else\n
  11. responsible for net lift, or we work on something else\n
  12. responsible for net lift, or we work on something else\n
  13. responsible for net lift, or we work on something else\n
  14. responsible for net lift, or we work on something else\n
  15. responsible for net lift, or we work on something else\n
  16. responsible for net lift, or we work on something else\n
  17. responsible for net lift, or we work on something else\n
  18. responsible for net lift, or we work on something else\n
  19. responsible for net lift, or we work on something else\n
  20. responsible for net lift, or we work on something else\n
  21. responsible for net lift, or we work on something else\n
  22. responsible for net lift, or we work on something else\n
  23. responsible for net lift, or we work on something else\n
  24. responsible for net lift, or we work on something else\n
  25. responsible for net lift, or we work on something else\n
  26. responsible for net lift, or we work on something else\n
  27. responsible for net lift, or we work on something else\n
  28. responsible for net lift, or we work on something else\n
  29. responsible for net lift, or we work on something else\n
  30. responsible for net lift, or we work on something else\n
  31. responsible for net lift, or we work on something else\n
  32. responsible for net lift, or we work on something else\n
  33. responsible for net lift, or we work on something else\n
  34. responsible for net lift, or we work on something else\n
  35. responsible for net lift, or we work on something else\n
  36. responsible for net lift, or we work on something else\n