BIG Data Science: A Path Forward

June 2013
BIG DATA SCIENCE: A PATH FORWARD

CONFIDENTIAL | 2
linkedin.com/in/danmallinger/
@danmallinger
www.thinkbiganalytics.com
 Data Science Lead @ Think Big
 Product/Brand Obsessive
 Teacher
 Occasional Engineer

CONFIDENTIAL | 3
TODAY
• High level exploration of the
• skills, tools, and techniques
• needed to achieve early success
• and to help you build
• your data science practice.

CONFIDENTIAL | 4
 Understand our organizational needs for data science
 Infrastructure: Technological tools and platforms.
 Talent: Staff hired and trained.
 Capabilities: Data science techniques utilized.
INFRASTRUCTURE, TALENT, & CAPABILITIES
Hadoop NoSQL Analytics SQL/MPP Real Time
Scripting MapReduce
Data
Exploration
Basic Modeling PhD Math
Visualization Clustering Categorization
Continuous
Models
Text Analysis

CONFIDENTIAL | 5
 Boxed Solutions: Mahout & Platform
 Toolkits: RHadoop, Scikit, etc.
 You will need toolkits to solve unique problems
 but smart techniques make that easier.
 Boxed solutions are limited
 but can be a good source of early velocity.
ANALYTICS TOOLS

CONFIDENTIAL | 6
 Gigabytes from Stackoverflow
 Questions from users
 With metadata
 Users have reputations
 Questions open or closed
 Follow along
 Thinking about your data
 To learn in a
 Familiar context and
 Plan
DATA
Presenter Audience
Scripting MapReduce Exploration Basic Modeling PhD Math
Visualization Clustering Categorization Continuous Text Analysis

CONFIDENTIAL | 7
select count(1) as total
, sum(has_code)
, avg(body_count)
, stddev_samp(body_count)
, corr(reputation,
owner_questions)
,
histogram_numeric(body_count, 10)
from questions
;
STEP 1: EXPLORE
Patterns through Hive Patterns through Tableau

CONFIDENTIAL | 8
 Summaries of unstructured
data
 Time-since metrics
select transform(…)
using ‘python …’
 Clustering: Browsing cohorts
/bin/mahout canopy
STEP 2: FEATURE BUILDING
SQL Windowing Cross-Record Features

CONFIDENTIAL | 9
• Sample (don’t parallelize)
• Naturally parallel
• SVD
• Random Forests
• Estimators and Ensembles
• Bootstrapping
• Localizing
• Advanced Parallelization
• Linear models with SGD
• Neural networks
PARALLEL MODELS IN HADOOP

CONFIDENTIAL | 10
 Single R model
 run many times
 over samples
 and aggregated
m <- C5.0(status ~ …)
STEP 3: STRUCTURED MODEL (BAGGING)
Mapper 1:
Define n reducer keys
Send any record to reducer I with
probability p
Reducer 1:
Key: Id of sample
Value: List of records
Perform analysis over records
Reducer 2:
Key: One
Value: List of models
Aggregate the models (e.g. average)
Bagging a Model

CONFIDENTIAL | 11
WHERE ARE WE?
 We’ve created a structured model
 to flag questions that won’t be closed
 using Big Data.
 But we haven’t used unstructured data.

CONFIDENTIAL | 12
TEXT ANALYSIS
• Is “the big dog” really different from “dog is big?”
• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”
• Language has lexical and syntactical features
• Different techniques leverage these in different ways
 Bag of Words: Structure doesn’t matter
 n-gram: Structure matters (but not that much)
 Feature Extraction: BACON! BACON! BACON!

CONFIDENTIAL | 13
STEP 4: UNSTRUCTURED MODEL
 Similar to Hadoop’s Word
Count
 Create counts for
token/category pairs
 Use counts to calculate
Information Gain
MR Job 1:
Calculate information gain (IG) for all
tokens.
MR Job 2:
Select tokens with largest IG.
Create structured data for record, tokens:
question #4 | 0 | 1 | 0 | 1 | 1
MR Job 3:
Build a classifier over the newly structured
data (prior slides)
Information Gain

CONFIDENTIAL | 14
WHERE ARE WE?
 We’ve created two models
 One structured,
 one unstructured.
 But they don’t work together.

CONFIDENTIAL | 15
STEP 5: ENSEMBLE MODEL
 Join many models together
 By using their output
 As input to ensemble model.
 Best when models perform
differently
 Exploit differences with
nonlinearities
 Like interaction effects.
Ensembling
Mapper 1:
Load multiple models
Score the models per record and output
Reducer 1:
Key: Id of record
Value: List of model outputs
Join model outputs to make new records
MR Job 2:
Build a model over the output data as if it
was raw data.

CONFIDENTIAL | 16
 We’ve created two models:
 one structured,
 one unstructured
 and have ensembled them
 to create a single, powerful model
 and solve a practical business problem.
WHERE ARE WE?

CONFIDENTIAL | 17
 This required simple infrastructure
 a blend of analysis and scripting skills
 an understanding of BIG data science techniques
 but not a team of PhDs or a billion dollars.
HOW DID WE GET HERE?

CONFIDENTIAL | 18
Questions?
www.thinkbiganalytics.com
@danmallinger

BIG Data Science: A Path Forward

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to BIG Data Science: A Path Forward

Similar to BIG Data Science: A Path Forward (20)

Recently uploaded

Recently uploaded (20)

BIG Data Science: A Path Forward