The reason why Big Data is important is because we want to use it to make sense of our world. It’s tempting to think there’s some “magic bullet” for analyzing big data, but simple “data distillation” often isn’t enough, and unsupervised machine-learning systems can be dangerous. (Like, bringing-down-the-entire-financial-system dangerous.) Data Science is the key to unlocking insight from Big Data: by combining computer science skills with statistical analysis and a deep understanding of the data and problem we can not only make better predictions, but also fill in gaps in our knowledge, and even find answers to questions we hadn’t even thought of yet.
The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough
1. Revolution Confidential
T he R is e of Data
S c ienc e in the age of
B ig Data A nalytic s
Why Data Dis tillation and Mac hine
L earning A ren’t E nough
David M S mith
V P Marketing and C ommunity
R evolution Analytic s
2. Today, we’ll dis c us s : Revolution Confidential
What is Data Science?
Why machine learning isn’t enough
Why Data Science works
The Data Scientists Toolkit
The Future of Big Data Analytics
Closing thoughts and resources
2
4. Where is it s afe to fis h near S an F ranc is c o? Revolution Confidential
San Francisco Estuary Institute
http://www.sfei.org/tools/wqt 4
5. Hurric ane S andy Revolution Confidential
Bob Rudis
http://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/
5
6. Hurric ane S andy Revolution Confidential
Ed Chen
http://blog.echen.me/hurricane-sandy-outages/
6
7. When did Mic hael J ac ks on have his
bigges t hits ? Revolution Confidential
New York Times, June 25 2009 (3 hours after Michael Jackson’s death)
http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7
8. T hree E s s ential S kills of Data S c ientis ts Revolution Confidential
Models
Data Integration
Visualization
Mashups
Predictions
Applications
Uncertainty
Problems Effective
Data Sources Data
Credibility Applications
Drew Conway
http://www.dataists.com/2010/09/the-data-science-venn-diagram/ 8
10. Mac hine learning (ML ) for predic tions Revolution Confidential
Building the Model
Responses
Features
scoring Scoring new data
ML rules
Predictions (scores)
New Data
scoring
Validating the Model
Predictions rules
Response
Validation
scoring
set
rules
“Accuracy”
10
15. A ns wer Unas ked Ques tions Revolution Confidential
Revolutions blog: “The Uncanny Valley of Big Data”
http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html 15
16. F ill in knowledge gaps Revolution Confidential
“Companies that have
massive amounts of data
without massive amounts
of clue are going to be
displaced by startups that
have less data but more
clue.” -- Tim O’Reilly
“More data beats
better algorithms,
every time” – Google
Google Research, “The Unreasonable Effectiveness of Data”:
http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html
Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwd
TechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html 16
19. 0. Data (B ig & Mes s y) Revolution Confidential
19
20. 1. A language for programming with data Revolution Confidential
Download the White Paper
R is Hot
bit.ly/r-is-hot
20
21. Data import and pre-
processing
Revolution Confidential
User-defined functions
Internet API interface
XML parsing
Grant awards to homeless veterans FY09
Iterative data processing Data: Data.gov
Analysis: Drew Conway
Custom graphics
21
22. 2. S peed. L ots and lots of s peed. Revolution Confidential
Variable
Transformation
Feature
Selection Model
Data Sampling Estimation Predictions
Aggregation
Model
Model
Comparison /
Refinement
Benkmarking
22
23. Us e all available c omputing c yc les Revolution Confidential
Shared Memory
Data Data Data
Core 0 Core 1 Core 2 Core n
Disk (Thread 0) (Thread 1) (Thread 2) (Thread n)
Multicore Processor (4, 8, 16+ cores)
23
24. 3. A lgorithms that don’t c hoke on B ig Data
Revolution Confidential
Compute
Node
Data
Partition
Compute
Data Node
Partition
BIG
Data
Master
Node
Partition Compute
DATA Node
Data
Partition
Compute
Node
PEMAs: Parallel External-Memory Algorithms
24
25. Drink les s c offee! Revolution Confidential
Single Threaded
Non-optimized
algorithms
Optimized
Parallelized
Algorithms
25
26. 4. Move c ode to data (not vic e vers a) Revolution Confidential
Map-Reduce
RHadoop: http://bit.ly/RHadoop 26
27. B ig Data A pplianc es Revolution Confidential
More info: http://bit.ly/R-Netezza
27
28. P lay Nic e with Others Revolution Confidential
Presentation Layer
• Business Intelligence Tools
• Web-based data apps
• Reporting / Spreadsheets
Analytics Layer
•R
Data Layer
• Relational datastores
• Unstructured datastores
28
29. What every data s c ientis t needs Revolution Confidential
Revolution R
Open-Source R Enterprise
Interface with multiple data sources ✓ ✓✓
Exploratory data analysis ✓✓ ✓✓
Wide range of statistical methods ✓✓ ✓✓
High-speed computation ✘ ✓✓
Big Data support ✘ ✓✓
Data/code locality (Hadoop, etc.) ✘ ✓✓
Print-quality data visualization ✓ ✓
Scheduled batch production ✓ ✓✓
Works in a multi-tool ecosystem ✓✓ ✓✓
Integration into Data Apps ✘ ✓✓
29
30. R evolution R E nterpris e: B ig-Data R Revolution Confidential
Revolution R
Open-Source R Enterprise
Interface with multiple data sources ✓ ✓✓
Exploratory data analysis ✓✓ ✓✓
Wide range of statistical methods ✓✓ ✓✓
High-speed computation ✘ ✓✓
Big Data support ✘ ✓✓
Data/code locality (Hadoop, etc.) ✘ ✓✓
Print-quality data visualization ✓✓ ✓✓
Scheduled batch production ✓ ✓✓
Works in a multi-tool ecosystem ✓✓ ✓✓
Integration into Data Apps ✘ ✓✓
www.revolutionanalytics.com/products 30
32. A nd … the future? Revolution Confidential
Even more data
Cloud computing
Demand for
Data Scientists
Diverging paradigms for data analytics
http://www.indeed.com/jobtrends 32
33. Diverging data paradigms Revolution Confidential
More data, better fault tolerance
Files Data Hadoop
Clusters Appliances NoSQL
Exploration Storage
Modeling Preprocessing
Easier programming, better performance
Production
33
34. Data S c ienc e in P roduc tion Revolution Confidential
Real-time Big Data Analytics: From
Deployment to Production
Thursday, November 29, 2012
10:00AM - 11:00AM Pacific Time
www.revolutionanalytics.com/news-events/free-webinars/
34
35. B uilding Data S c ienc e Teams Revolution Confidential
DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI
Statistics and Data Science graduates
Kaggle and Chorus
Revolution Analytics R Training:
http://www.revolutionanalytics.com/services/training/
35
36. C los ing T houghts Revolution Confidential
Data Science process leads to more
powerful, and more useful models
Data Scientists need a technology platform
to think about, explore, and model data
Revolution R Enterprise is R for Big Data
36
37. R es ourc es Revolution Confidential
Revolution R Enterprise : R for Big Data
www.revolutionanalytics.com/products
Rhadoop : Connecting R and Hadoop
bit.ly/r-hadoop
Contact David Smith
david@revolutionanalytics.com
@revodavid
blog.revolutionanalytics.com
37
38. T hank you. Revolution Confidential
The leading commercial provider of software and support for the popular
open source R statistics language.
www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR
38