This talk uses a case study to demonstrate core data science capabilities in Big Data, infrastructure requirements, and talent profiles that translate to early success. Using the challenge of classifying events in a consumer-oriented website, the discussion is for a wide audience:
- Practitioners will learn two key techniques for early success
- Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences
- Hiring managers will expand their knowledge of the skills required to bring business value with data
3. CONFIDENTIAL | 3
TODAY
• High level exploration of the
• skills, tools, and techniques
• needed to achieve early success
• and to help you build
• your data science practice.
4. CONFIDENTIAL | 4
Understand our organizational needs for data science
Infrastructure: Technological tools and platforms.
Talent: Staff hired and trained.
Capabilities: Data science techniques utilized.
INFRASTRUCTURE, TALENT, & CAPABILITIES
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce
Data
Exploration
Basic Modeling PhD Math
Visualization Clustering Categorization
Continuous
Models
Text Analysis
5. CONFIDENTIAL | 5
Boxed Solutions: Mahout & Platform
Toolkits: RHadoop, Scikit, etc.
You will need toolkits to solve unique problems
but smart techniques make that easier.
Boxed solutions are limited
but can be a good source of early velocity.
ANALYTICS TOOLS
6. CONFIDENTIAL | 6
Gigabytes from Stackoverflow
Questions from users
With metadata
Users have reputations
Questions open or closed
Follow along
Thinking about your data
To learn in a
Familiar context and
Plan
DATA
Presenter Audience
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
7. CONFIDENTIAL | 7
select count(1) as total
, sum(has_code)
, avg(body_count)
, stddev_samp(body_count)
, corr(reputation,
owner_questions)
,
histogram_numeric(body_count, 10)
from questions
;
STEP 1: EXPLORE
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Patterns through Hive Patterns through Tableau
8. CONFIDENTIAL | 8
Summaries of unstructured
data
Time-since metrics
select transform(…)
using ‘python …’
Clustering: Browsing cohorts
/bin/mahout canopy
STEP 2: FEATURE BUILDING
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
SQL Windowing Cross-Record Features
9. CONFIDENTIAL | 9
• Sample (don’t parallelize)
• Naturally parallel
• SVD
• Random Forests
• Estimators and Ensembles
• Bootstrapping
• Localizing
• Advanced Parallelization
• Linear models with SGD
• Neural networks
PARALLEL MODELS IN HADOOP
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
10. CONFIDENTIAL | 10
Single R model
run many times
over samples
and aggregated
m <- C5.0(status ~ …)
STEP 3: STRUCTURED MODEL (BAGGING)
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Mapper 1:
Define n reducer keys
Send any record to reducer I with
probability p
Reducer 1:
Key: Id of sample
Value: List of records
Perform analysis over records
Reducer 2:
Key: One
Value: List of models
Aggregate the models (e.g. average)
Bagging a Model
11. CONFIDENTIAL | 11
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
We’ve created a structured model
to flag questions that won’t be closed
using Big Data.
But we haven’t used unstructured data.
12. CONFIDENTIAL | 12
TEXT ANALYSIS
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
• Is “the big dog” really different from “dog is big?”
• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”
• Language has lexical and syntactical features
• Different techniques leverage these in different ways
Bag of Words: Structure doesn’t matter
n-gram: Structure matters (but not that much)
Feature Extraction: BACON! BACON! BACON!
13. CONFIDENTIAL | 13
STEP 4: UNSTRUCTURED MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Similar to Hadoop’s Word
Count
Create counts for
token/category pairs
Use counts to calculate
Information Gain
MR Job 1:
Calculate information gain (IG) for all
tokens.
MR Job 2:
Select tokens with largest IG.
Create structured data for record, tokens:
question #4 | 0 | 1 | 0 | 1 | 1
MR Job 3:
Build a classifier over the newly structured
data (prior slides)
Information Gain
14. CONFIDENTIAL | 14
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
We’ve created two models
One structured,
one unstructured.
But they don’t work together.
15. CONFIDENTIAL | 15
STEP 5: ENSEMBLE MODEL
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
Join many models together
By using their output
As input to ensemble model.
Best when models perform
differently
Exploit differences with
nonlinearities
Like interaction effects.
Ensembling
Mapper 1:
Load multiple models
Score the models per record and output
Reducer 1:
Key: Id of record
Value: List of model outputs
Join model outputs to make new records
MR Job 2:
Build a model over the output data as if it
was raw data.
16. CONFIDENTIAL | 16
We’ve created two models:
one structured,
one unstructured
and have ensembled them
to create a single, powerful model
and solve a practical business problem.
WHERE ARE WE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis
17. CONFIDENTIAL | 17
This required simple infrastructure
a blend of analysis and scripting skills
an understanding of BIG data science techniques
but not a team of PhDs or a billion dollars.
HOW DID WE GET HERE?
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis