SlideShare a Scribd company logo
1 of 38
Download to read offline
Revolution Confidential




T he R is e of Data
S c ienc e in the age of
B ig Data A nalytic s
Why Data Dis tillation and Mac hine
L earning A ren’t E nough



David M S mith
V P Marketing and C ommunity
R evolution Analytic s
Today, we’ll dis c us s :               Revolution Confidential




   What is Data Science?
   Why machine learning isn’t enough
   Why Data Science works
   The Data Scientists Toolkit
   The Future of Big Data Analytics
   Closing thoughts and resources



                                                          2
Revolution Confidential




© Dov Harrington, CC By-2.0
http://www.flickr.com/photos/idovermani/4110546683/                     3
Where is it s afe to fis h near S an F ranc is c o?  Revolution Confidential




                   San Francisco Estuary Institute
                   http://www.sfei.org/tools/wqt                       4
Hurric ane S andy                                                                 Revolution Confidential




           Bob Rudis
           http://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/



                                                                                                    5
Hurric ane S andy                                 Revolution Confidential




  Ed Chen
  http://blog.echen.me/hurricane-sandy-outages/



                                                                    6
When did Mic hael J ac ks on have his
bigges t hits ?                                                                      Revolution Confidential




                 New York Times, June 25 2009 (3 hours after Michael Jackson’s death)
                 http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html          7
T hree E s s ential S kills of Data S c ientis ts                                      Revolution Confidential



                                                                                         Models
Data Integration
                                                                                      Visualization
   Mashups
                                                                                      Predictions
 Applications
                                                                                      Uncertainty




       Problems                                                                        Effective
      Data Sources                                                                       Data
       Credibility                                                                    Applications




                     Drew Conway
                     http://www.dataists.com/2010/09/the-data-science-venn-diagram/                      8
Revolution Confidential




Image © Abode of Chaos, CC BY 2.0
http://www.flickr.com/photos/home_of_chaos/6418989233/                     9
Mac hine learning (ML ) for predic tions                                                 Revolution Confidential




                          Building the Model
              Responses
   Features




                                    scoring                                   Scoring new data
                          ML         rules




                                                                                                          Predictions (scores)
                                                                   New Data
                                                                               scoring
                           Validating the Model
                                          Predictions                           rules
                                                        Response
 Validation




                   scoring
     set




                    rules

                                        “Accuracy”


                                                                                                          10
P roblem: A lac k of pers pec tive                                           Revolution Confidential




                Image © 2010 David M Smith. Some rights reserved CC BY-2.0                    11
P roblem: L ac k of c redibility   Revolution Confidential




                                                    12
P roblem: C omplexity   Revolution Confidential




                                         13
Data Science to the
             Revolution Confidential



           Rescue!




                              14
A ns wer Unas ked Ques tions                                                         Revolution Confidential




               Revolutions blog: “The Uncanny Valley of Big Data”
               http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html        15
F ill in knowledge gaps                                                                                 Revolution Confidential


                                                           “Companies that have
                                                         massive amounts of data
                                                         without massive amounts
                                                           of clue are going to be
                                                         displaced by startups that
                                                          have less data but more
                                                               clue.” -- Tim O’Reilly
    “More data beats
    better algorithms,
    every time” – Google




             Google Research, “The Unreasonable Effectiveness of Data”:
              http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html
             Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwd
             TechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html              16
Avoid ineffec tive reac tions                                                     Revolution Confidential


   S&P 500




                Stupid Data Miner Tricks
                http://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf         17
Revolution Confidential




© Henricks Photos CC-BY-ND 2.0
http://www.flickr.com/photos/hendricksphotos/3240667626/                    18
0. Data (B ig & Mes s y)   Revolution Confidential




                                            19
1. A language for programming with data     Revolution Confidential




                           Download the White Paper
                                 R is Hot
                                 bit.ly/r-is-hot




                                                             20
Data import and pre-
                                           processing
                                                Revolution Confidential

                                      User-defined functions


                                       Internet API interface
                                            XML parsing




                            Grant awards to homeless veterans FY09
Iterative data processing   Data: Data.gov
                            Analysis: Drew Conway


                                          Custom graphics




                                                                 21
2. S peed. L ots and lots of s peed.                                       Revolution Confidential




                                     Variable
                                  Transformation




             Feature
            Selection                                         Model
  Data      Sampling                                        Estimation   Predictions
           Aggregation




                      Model
                                                     Model
                   Comparison /
                                                   Refinement
                   Benkmarking




                                                                                       22
Us e all available c omputing c yc les                             Revolution Confidential




                                      Shared Memory

                                   Data           Data               Data



              Core 0              Core 1         Core 2            Core n
  Disk        (Thread 0)          (Thread 1)     (Thread 2)        (Thread n)


                           Multicore Processor (4, 8, 16+ cores)




                                                                                    23
3. A lgorithms that don’t c hoke on B ig Data
                                       Revolution Confidential




                        Compute
                         Node

     Data
    Partition
                        Compute
     Data                Node
    Partition
   BIG
    Data
                                        Master
                                        Node
   Partition            Compute
  DATA                   Node

     Data
    Partition
                        Compute
                         Node




PEMAs: Parallel External-Memory Algorithms
                                                        24
Drink les s c offee!                     Revolution Confidential




               Single Threaded
                Non-optimized
                  algorithms




                          Optimized
                          Parallelized
                          Algorithms




                                                          25
4. Move c ode to data (not vic e vers a)          Revolution Confidential




            Map-Reduce




                 RHadoop: http://bit.ly/RHadoop                    26
B ig Data A pplianc es                        Revolution Confidential




         More info: http://bit.ly/R-Netezza

                                                               27
P lay Nic e with Others             Revolution Confidential




    Presentation Layer
    • Business Intelligence Tools
    • Web-based data apps
    • Reporting / Spreadsheets

    Analytics Layer
    •R

    Data Layer
    • Relational datastores
    • Unstructured datastores

                                                     28
What every data s c ientis t needs                            Revolution Confidential


                                                           Revolution R
                                           Open-Source R    Enterprise
   Interface with multiple data sources         ✓             ✓✓
             Exploratory data analysis         ✓✓             ✓✓
     Wide range of statistical methods         ✓✓             ✓✓
              High-speed computation            ✘             ✓✓
                     Big Data support           ✘             ✓✓
     Data/code locality (Hadoop, etc.)          ✘             ✓✓
        Print-quality data visualization        ✓               ✓
          Scheduled batch production            ✓             ✓✓
      Works in a multi-tool ecosystem          ✓✓             ✓✓
            Integration into Data Apps          ✘             ✓✓



                                                                               29
R evolution R E nterpris e: B ig-Data R                                Revolution Confidential


                                                                    Revolution R
                                                  Open-Source R      Enterprise
   Interface with multiple data sources                       ✓        ✓✓
             Exploratory data analysis                      ✓✓         ✓✓
     Wide range of statistical methods                      ✓✓         ✓✓
              High-speed computation                          ✘        ✓✓
                     Big Data support                         ✘        ✓✓
     Data/code locality (Hadoop, etc.)                        ✘        ✓✓
        Print-quality data visualization                    ✓✓         ✓✓
          Scheduled batch production                          ✓        ✓✓
      Works in a multi-tool ecosystem                       ✓✓         ✓✓
            Integration into Data Apps                        ✘        ✓✓



                             www.revolutionanalytics.com/products                       30
Revolution Confidential




Image © www.tinyplanetphotography.com                    31
A nd … the future?                                    Revolution Confidential




 Even more data

 Cloud computing

 Demand for
  Data Scientists

 Diverging paradigms for data analytics

                    http://www.indeed.com/jobtrends                    32
Diverging data paradigms                                 Revolution Confidential




                 More data, better fault tolerance




                Files          Data             Hadoop
              Clusters       Appliances         NoSQL



Exploration                                                 Storage
 Modeling                                                Preprocessing
              Easier programming, better performance

                             Production



                                                                          33
Data S c ienc e in P roduc tion              Revolution Confidential




     Real-time Big Data Analytics: From
         Deployment to Production

           Thursday, November 29, 2012
          10:00AM - 11:00AM Pacific Time

www.revolutionanalytics.com/news-events/free-webinars/



                                                              34
B uilding Data S c ienc e Teams                 Revolution Confidential




 DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI

 Statistics and Data Science graduates

 Kaggle and Chorus

 Revolution Analytics R Training:
   http://www.revolutionanalytics.com/services/training/


                                                                 35
C los ing T houghts                    Revolution Confidential




 Data Science process leads to more
  powerful, and more useful models

 Data Scientists need a technology platform
  to think about, explore, and model data

 Revolution R Enterprise is R for Big Data


                                                        36
R es ourc es                               Revolution Confidential




 Revolution R Enterprise : R for Big Data
   www.revolutionanalytics.com/products
 Rhadoop : Connecting R and Hadoop
   bit.ly/r-hadoop


 Contact David Smith
   david@revolutionanalytics.com
      @revodavid
   blog.revolutionanalytics.com
                                                            37
T hank you.                                                                      Revolution Confidential




           The leading commercial provider of software and support for the popular
                            open source R statistics language.




 www.revolutionanalytics.com            650.646.9545                  Twitter: @RevolutionR




                                                                                                  38

More Related Content

Similar to The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

W-JAX Keynote - Big Data and Corporate Evolution
W-JAX Keynote - Big Data and Corporate EvolutionW-JAX Keynote - Big Data and Corporate Evolution
W-JAX Keynote - Big Data and Corporate Evolutionjstogdill
 
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Fitzgerald Analytics, Inc.
 
The Enterprise Trifecta
The Enterprise TrifectaThe Enterprise Trifecta
The Enterprise Trifectasinhabipul
 
Semantics, Deep Learning, and the Transformation of Business
Semantics, Deep Learning, and the Transformation of BusinessSemantics, Deep Learning, and the Transformation of Business
Semantics, Deep Learning, and the Transformation of BusinessSteve Omohundro
 
Data Visualization @Sun Yat-sen University
Data Visualization @Sun Yat-sen UniversityData Visualization @Sun Yat-sen University
Data Visualization @Sun Yat-sen UniversityYolanda Ma Jinxin
 
Big data overview external
Big data overview externalBig data overview external
Big data overview externalBrett Colbert
 
How Your Data Can Predict The Future
How Your Data Can Predict The FutureHow Your Data Can Predict The Future
How Your Data Can Predict The FutureBecky Wang
 
Data Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data ScienceData Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data ScienceMichael Roytman
 
Less is More: Behind the Data at Risk I/O
Less is More: Behind the Data at Risk I/OLess is More: Behind the Data at Risk I/O
Less is More: Behind the Data at Risk I/OMichael Roytman
 
Three Discriminators That Allow Companies To Grow To A Billion Dollars In Ten...
Three Discriminators That Allow Companies To Grow To A Billion Dollars In Ten...Three Discriminators That Allow Companies To Grow To A Billion Dollars In Ten...
Three Discriminators That Allow Companies To Grow To A Billion Dollars In Ten...SalesLabDC
 
Nutanix BriForum 05242012
Nutanix BriForum 05242012Nutanix BriForum 05242012
Nutanix BriForum 05242012Dheeraj Pandey
 
Big Data in Public Safety
Big Data in Public SafetyBig Data in Public Safety
Big Data in Public SafetyMichal Kosinski
 
Accretive Health - Quality Management in Health Care
Accretive Health - Quality Management in Health CareAccretive Health - Quality Management in Health Care
Accretive Health - Quality Management in Health CareAccretiveHealth
 
Over the past weeks we have been examining the inference process- big.docx
Over the past weeks we have been examining the inference process- big.docxOver the past weeks we have been examining the inference process- big.docx
Over the past weeks we have been examining the inference process- big.docxlmark1
 
The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyClaudiu Popa
 
Forecast 2012 Panel: Software Innovation Richard Villars, IDC
Forecast 2012 Panel: Software Innovation Richard Villars, IDCForecast 2012 Panel: Software Innovation Richard Villars, IDC
Forecast 2012 Panel: Software Innovation Richard Villars, IDCOpen Data Center Alliance
 
Big data security the perfect storm
Big data security   the perfect stormBig data security   the perfect storm
Big data security the perfect stormUlf Mattsson
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data ScienceAndrew Gardner
 

Similar to The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough (20)

W-JAX Keynote - Big Data and Corporate Evolution
W-JAX Keynote - Big Data and Corporate EvolutionW-JAX Keynote - Big Data and Corporate Evolution
W-JAX Keynote - Big Data and Corporate Evolution
 
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
 
The Enterprise Trifecta
The Enterprise TrifectaThe Enterprise Trifecta
The Enterprise Trifecta
 
Semantics, Deep Learning, and the Transformation of Business
Semantics, Deep Learning, and the Transformation of BusinessSemantics, Deep Learning, and the Transformation of Business
Semantics, Deep Learning, and the Transformation of Business
 
Data Visualization @Sun Yat-sen University
Data Visualization @Sun Yat-sen UniversityData Visualization @Sun Yat-sen University
Data Visualization @Sun Yat-sen University
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
 
How Your Data Can Predict The Future
How Your Data Can Predict The FutureHow Your Data Can Predict The Future
How Your Data Can Predict The Future
 
Data Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data ScienceData Science ATL Meetup - Risk I/O Security Data Science
Data Science ATL Meetup - Risk I/O Security Data Science
 
SB'12 - Sean Gourley - Quid
SB'12 - Sean Gourley - Quid SB'12 - Sean Gourley - Quid
SB'12 - Sean Gourley - Quid
 
Less is More: Behind the Data at Risk I/O
Less is More: Behind the Data at Risk I/OLess is More: Behind the Data at Risk I/O
Less is More: Behind the Data at Risk I/O
 
Three Discriminators That Allow Companies To Grow To A Billion Dollars In Ten...
Three Discriminators That Allow Companies To Grow To A Billion Dollars In Ten...Three Discriminators That Allow Companies To Grow To A Billion Dollars In Ten...
Three Discriminators That Allow Companies To Grow To A Billion Dollars In Ten...
 
Nutanix BriForum 05242012
Nutanix BriForum 05242012Nutanix BriForum 05242012
Nutanix BriForum 05242012
 
Big Data in Public Safety
Big Data in Public SafetyBig Data in Public Safety
Big Data in Public Safety
 
Opening keynote gianni cooreman
Opening keynote gianni cooremanOpening keynote gianni cooreman
Opening keynote gianni cooreman
 
Accretive Health - Quality Management in Health Care
Accretive Health - Quality Management in Health CareAccretive Health - Quality Management in Health Care
Accretive Health - Quality Management in Health Care
 
Over the past weeks we have been examining the inference process- big.docx
Over the past weeks we have been examining the inference process- big.docxOver the past weeks we have been examining the inference process- big.docx
Over the past weeks we have been examining the inference process- big.docx
 
The REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on PrivacyThe REAL Impact of Big Data on Privacy
The REAL Impact of Big Data on Privacy
 
Forecast 2012 Panel: Software Innovation Richard Villars, IDC
Forecast 2012 Panel: Software Innovation Richard Villars, IDCForecast 2012 Panel: Software Innovation Richard Villars, IDC
Forecast 2012 Panel: Software Innovation Richard Villars, IDC
 
Big data security the perfect storm
Big data security   the perfect stormBig data security   the perfect storm
Big data security the perfect storm
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 

More from Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 

More from Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough

  • 1. Revolution Confidential T he R is e of Data S c ienc e in the age of B ig Data A nalytic s Why Data Dis tillation and Mac hine L earning A ren’t E nough David M S mith V P Marketing and C ommunity R evolution Analytic s
  • 2. Today, we’ll dis c us s : Revolution Confidential  What is Data Science?  Why machine learning isn’t enough  Why Data Science works  The Data Scientists Toolkit  The Future of Big Data Analytics  Closing thoughts and resources 2
  • 3. Revolution Confidential © Dov Harrington, CC By-2.0 http://www.flickr.com/photos/idovermani/4110546683/ 3
  • 4. Where is it s afe to fis h near S an F ranc is c o? Revolution Confidential San Francisco Estuary Institute http://www.sfei.org/tools/wqt 4
  • 5. Hurric ane S andy Revolution Confidential Bob Rudis http://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/ 5
  • 6. Hurric ane S andy Revolution Confidential Ed Chen http://blog.echen.me/hurricane-sandy-outages/ 6
  • 7. When did Mic hael J ac ks on have his bigges t hits ? Revolution Confidential New York Times, June 25 2009 (3 hours after Michael Jackson’s death) http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7
  • 8. T hree E s s ential S kills of Data S c ientis ts Revolution Confidential Models Data Integration Visualization Mashups Predictions Applications Uncertainty Problems Effective Data Sources Data Credibility Applications Drew Conway http://www.dataists.com/2010/09/the-data-science-venn-diagram/ 8
  • 9. Revolution Confidential Image © Abode of Chaos, CC BY 2.0 http://www.flickr.com/photos/home_of_chaos/6418989233/ 9
  • 10. Mac hine learning (ML ) for predic tions Revolution Confidential Building the Model Responses Features scoring Scoring new data ML rules Predictions (scores) New Data scoring Validating the Model Predictions rules Response Validation scoring set rules “Accuracy” 10
  • 11. P roblem: A lac k of pers pec tive Revolution Confidential Image © 2010 David M Smith. Some rights reserved CC BY-2.0 11
  • 12. P roblem: L ac k of c redibility Revolution Confidential 12
  • 13. P roblem: C omplexity Revolution Confidential 13
  • 14. Data Science to the Revolution Confidential Rescue! 14
  • 15. A ns wer Unas ked Ques tions Revolution Confidential Revolutions blog: “The Uncanny Valley of Big Data” http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html 15
  • 16. F ill in knowledge gaps Revolution Confidential “Companies that have massive amounts of data without massive amounts of clue are going to be displaced by startups that have less data but more clue.” -- Tim O’Reilly “More data beats better algorithms, every time” – Google Google Research, “The Unreasonable Effectiveness of Data”: http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwd TechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html 16
  • 17. Avoid ineffec tive reac tions Revolution Confidential S&P 500 Stupid Data Miner Tricks http://nerdsonwallstreet.typepad.com/my_weblog/files/dataminejune_2000.pdf 17
  • 18. Revolution Confidential © Henricks Photos CC-BY-ND 2.0 http://www.flickr.com/photos/hendricksphotos/3240667626/ 18
  • 19. 0. Data (B ig & Mes s y) Revolution Confidential 19
  • 20. 1. A language for programming with data Revolution Confidential Download the White Paper R is Hot bit.ly/r-is-hot 20
  • 21. Data import and pre- processing Revolution Confidential User-defined functions Internet API interface XML parsing Grant awards to homeless veterans FY09 Iterative data processing Data: Data.gov Analysis: Drew Conway Custom graphics 21
  • 22. 2. S peed. L ots and lots of s peed. Revolution Confidential Variable Transformation Feature Selection Model Data Sampling Estimation Predictions Aggregation Model Model Comparison / Refinement Benkmarking 22
  • 23. Us e all available c omputing c yc les Revolution Confidential Shared Memory Data Data Data Core 0 Core 1 Core 2 Core n Disk (Thread 0) (Thread 1) (Thread 2) (Thread n) Multicore Processor (4, 8, 16+ cores) 23
  • 24. 3. A lgorithms that don’t c hoke on B ig Data Revolution Confidential Compute Node Data Partition Compute Data Node Partition BIG Data Master Node Partition Compute DATA Node Data Partition Compute Node PEMAs: Parallel External-Memory Algorithms 24
  • 25. Drink les s c offee! Revolution Confidential Single Threaded Non-optimized algorithms Optimized Parallelized Algorithms 25
  • 26. 4. Move c ode to data (not vic e vers a) Revolution Confidential Map-Reduce RHadoop: http://bit.ly/RHadoop 26
  • 27. B ig Data A pplianc es Revolution Confidential More info: http://bit.ly/R-Netezza 27
  • 28. P lay Nic e with Others Revolution Confidential Presentation Layer • Business Intelligence Tools • Web-based data apps • Reporting / Spreadsheets Analytics Layer •R Data Layer • Relational datastores • Unstructured datastores 28
  • 29. What every data s c ientis t needs Revolution Confidential Revolution R Open-Source R Enterprise Interface with multiple data sources ✓ ✓✓ Exploratory data analysis ✓✓ ✓✓ Wide range of statistical methods ✓✓ ✓✓ High-speed computation ✘ ✓✓ Big Data support ✘ ✓✓ Data/code locality (Hadoop, etc.) ✘ ✓✓ Print-quality data visualization ✓ ✓ Scheduled batch production ✓ ✓✓ Works in a multi-tool ecosystem ✓✓ ✓✓ Integration into Data Apps ✘ ✓✓ 29
  • 30. R evolution R E nterpris e: B ig-Data R Revolution Confidential Revolution R Open-Source R Enterprise Interface with multiple data sources ✓ ✓✓ Exploratory data analysis ✓✓ ✓✓ Wide range of statistical methods ✓✓ ✓✓ High-speed computation ✘ ✓✓ Big Data support ✘ ✓✓ Data/code locality (Hadoop, etc.) ✘ ✓✓ Print-quality data visualization ✓✓ ✓✓ Scheduled batch production ✓ ✓✓ Works in a multi-tool ecosystem ✓✓ ✓✓ Integration into Data Apps ✘ ✓✓ www.revolutionanalytics.com/products 30
  • 31. Revolution Confidential Image © www.tinyplanetphotography.com 31
  • 32. A nd … the future? Revolution Confidential  Even more data  Cloud computing  Demand for Data Scientists  Diverging paradigms for data analytics http://www.indeed.com/jobtrends 32
  • 33. Diverging data paradigms Revolution Confidential More data, better fault tolerance Files Data Hadoop Clusters Appliances NoSQL Exploration Storage Modeling Preprocessing Easier programming, better performance Production 33
  • 34. Data S c ienc e in P roduc tion Revolution Confidential Real-time Big Data Analytics: From Deployment to Production Thursday, November 29, 2012 10:00AM - 11:00AM Pacific Time www.revolutionanalytics.com/news-events/free-webinars/ 34
  • 35. B uilding Data S c ienc e Teams Revolution Confidential  DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI  Statistics and Data Science graduates  Kaggle and Chorus  Revolution Analytics R Training:  http://www.revolutionanalytics.com/services/training/ 35
  • 36. C los ing T houghts Revolution Confidential  Data Science process leads to more powerful, and more useful models  Data Scientists need a technology platform to think about, explore, and model data  Revolution R Enterprise is R for Big Data 36
  • 37. R es ourc es Revolution Confidential  Revolution R Enterprise : R for Big Data  www.revolutionanalytics.com/products  Rhadoop : Connecting R and Hadoop  bit.ly/r-hadoop  Contact David Smith  david@revolutionanalytics.com  @revodavid  blog.revolutionanalytics.com 37
  • 38. T hank you. Revolution Confidential The leading commercial provider of software and support for the popular open source R statistics language. www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR 38