2. AGENDA
1. Introduction to Big Data / ML
2. What is H2O.ai?
3. Use Cases:
4. Data Science Competition
a) Beat Bill Belichick
b) Fight Crime in Chicago
c) Ham/Spam Text Messages
d) Cycling Article Search
3. 1. INTROTO BIG DATA / ML
BIG DATA IS LIKE TEENAGE SEX:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is
doing it, so everyone claims
they are doing it…
Dan Ariely, Prof. @ Duke
4. BIGVS. SMALL DATA
When you try to open
file in excel, excel
CRASHES
SMALL = Data fits in RAM
BIG = Data does NOT fit in RAM
Basically…
Big Data is data too big
to process using conventional
methods
(e.g. excel, access)
5. V +V +V
Today, we have access to more data than we know what to do with!
1) Wearables (fitbit, iWatch, etc)
2) Click streams from web visitors
3. Sensor readings
4. Social Media Outlets (e.g. twitter, facebook, etc)
Volume - Data volumes are becoming unmanageable
Variety - More data types being captured
Velocity - Data arrives rapidly and must
be processed / stored
6. THE HOPE OF BIG DATA
1. Data contains information of great business / personal value
Examples:
a) Predicting future stock movements = $$$
b) Netflix movie recommendations = Better experience = $$$
2. IF you can extract those insights from the data, you can make better
decisions
Enter, Machine Learning (ML)…
So how the hell do you do it?
7. MACHINE LEARNING
The Wikipedia Definition:
…a scientific discipline that explores the construction and study
of algorithms that can learn from data. Such algorithms operate
by building a model…. ZZZzzzzzZZZzzzzzz
My Definition:
The development, analysis, and application of algorithms that enable
machines to: make predictions and / or better understand data
2 Types of Learning:
SUPERVISED + UNSUPERVISED
8. SUPERVISED LEARNING
What is it?
Examples of supervised learning tasks:
1. ClassificationTasks - Benign / Malignant tumor
2. RegressionTasks - Predicting future stock market prices
3. Image Recognition - Highlighting faces in pictures
Methods that infer a function from labeled training data. Key task:
Predicting ________ . (Insert your task here)
9. UNSUPERVISED LEARNING
What is it?
Examples of unsupervised learning tasks:
1. Clustering - Discovering customer segments
2.Topic Extraction - What topics are people tweeting about?
3. Information Retrieval - IBM Watson: Question + Answer
Methods to understand the general structure of input data where
no predictions is needed.
4.Anomaly Detection - Detecting irregular heart-beats
NO CURATION NEEDED!
10. 2.WHAT IS H2O?
What is H2O? (water, duh!)
It is ALSO an open-source, parallel processing engine for machine
learning.
What makes H2O different?
Cutting-edge algorithms + parallel architecture + ease-of-use
=
Happy Data Scientists / Analysts
13. TRY IT!
Don’t take my word for it…www.h2o.ai
Simple Instructions
1. CD to Download Location
2. unzip h2o file
3. java -jar h2o.jar
4. Point browser to: localhost:54321
GUI
R
15. TB + BB
Bill Belichick Tom Brady
+ =
15 years together
3 Super Bowls
16. PASS OR RUN?
On any given offensive play…
Coach Bill can either call a PASS or a RUN
What determines this?
Game situation
Opposing team
Time remaining, etc, etc
Yards to go (until 1st down)
Basically, LOTS of stuff.
Personnel
17. BUT WHAT IF??
Question:
Can we try to predict whether the next play will be PASS or RUN
using historical data?
Approach:
Download every offensive play from Belichick-Brady era since 2000
Use various Machine Learning approaches to model PASS / RUN
Disclaimer: I’m not a Seahawks fan!
Extract known features to build model inputs
18. DATA COLLECTION
Data:
13 years of data (2002 -2013 season)
194 games total
14,547 total offensive plays (excludes punts, kickoffs, returns)
Response Variable: PASS / RUN
Model Inputs:
Quarter, Minutes, Seconds, OpposingTeam, Down, Distance,
Line of Scrimmage, NE-Score, OpposingTeam Score, Season,
Formation, Game Status (is NE losing / winning / tied)
20. OPEN CITY, OPEN DATA
“…my kind of town” - F. Sinatra
~4.6 Million rows of crimes from 2001, updated weekly*
External data source considerations???
Weather Data ?U.S. Census
Data ?
Crime Data
21. ML WORKFLOW
1. Collect datasets (Crime + Weather + Census)
2. Do some feature extraction (e.g. dates, times)
3. Join Crime data Weather Data Census Data
4. Build deep learning model to predict
arrest / no arrest made
GOAL:
For a given crime,
predict if an arrest is
more / less likely to be made!
22. SPARK SQL + H2O RDD
3 table join using Spark SQL
Convert joined table to H2O RDD
25. HAM / SPAMTEXTS
Problem:
No one likes to be spammed. Can we look at text messages and
come up with a ham (real text) / spam classifier using Spark feature
processing + h2o deep learning?
ML Workflow:
1.Tokenize words in text messages (1,024 texts)
2.Transform each text using Spark’s implementation of TF-IDF
3. ConvertTF-IDF Spark RDD H2O RDD
4. Run Deep Learning onTrain /Test Data
26. FEATURE EXTRACTION
Original Text:
“Ok…But they said i’ve got wisdom teeth hidden inside n mayb need 2
remove.”
Post Data Cleaning & Tokenization:
( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
lower case
ignore stopwordsstrip punctuation
remove numbers
27. FEATURETRANSFORMATION
Post Data Cleaning & Tokenization:
( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
Term Frequency - Inverse Document Frequency (TF-IDF)
1.TF - How often does “wisdom” occur in above text?
2. IDF - Normalization which calc’s frequency of “wisdom” across all
other text messages.
tf-idf(t, d) = tf(t, d) x idf(t) WHERE idf(t) = log(N / n)
30. DEEP AUTOENCODERS + K-
MEANS EXAMPLE
Help cyclists with their health related questions!
31. CYCLING + __________
Problem:
New and Experienced Cyclists have questions about cycling + ______
(given topic). Let’s build a question + answer system to help!
ML Workflow:
1) Scrape thousands of article titles from internet about cycling /
cycling tips / cycling health, etc from various sources.
2) Build Bag-of-Words Dataset on article titles corpus
3) Reduce # of dimensions via deep autoencoder
4) Extract ‘last layer’ of deep features and cluster using k-means
5) Inspect Results!
32. BAG-OF-WORDS
Build dataset of cycling-related articles from various sources:
The Basics of Exercise Nutrition
0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0
basics exercise nutrition
lower case
remove ‘stopwords’
remove punctuation
Article Title
[ ]
33. DIMENSIONALITY
REDUCTION
Use deep autoencoder to reduce # features (~2,700 words!)
2,700Words
500hiddenfeatures
250H.F.
125H.F.
50
125H.F.
250H.F.
500hiddenfeatures
2,700Words
Decoder
Encoder
The Basics of
Exercise Nutrition
34. K-MEANS CLUSTERING
For each article: Extract ‘last’ layer of autoencoder (50 deep features)
The Basics of
Exercise Nutrition 50 ‘deep features’
The Basics of
Exercise Nutrition
-‐0.09330833 0.167881429 -‐0.234307408 0.247723639 -‐0.067700267 -‐0.094107866
DF1 DF2 DF3 DF4 DF5 DF6
K-Means Clustering
Inputs: Extracted 50 deep features for each cycling-related article
K = 50 clusters after grid-search of values
35. RESULT: CYCLING + A.I.
Now we inspect the clusters!
Test Article Title:
Fluid & Carbohydrate Ingestion Improve Performance During 1Hour of
Intense Exercise
Result:
Clustered w/ 17 other titles (out of ~5,700)
Top 5 similar titles within cluster:
Caffeine ingestion does not alter performance during a 100-km cycling time-trial performance
Immuno-endocrine response to cycling following ingestion of caffeine and carbohydrate
Metabolism and performance following carbohydrate ingestion late in exercise
Increases in cycling performance in response to caffeine ingestion are repeatable
Fluid ingestion does not influence intense 1-h exercise performance in a mild environment
36. HOWTO GET FASTER?
Test Article Title:
Muscle Coordination is Key to Power Output & Mechanical Efficiency of
Limb Movements
Result:
Clustered w/ 29 other titles (out of ~5,700)
Top 5 similar titles within cluster:
Muscle fibre type efficiency and mechanical optima affect freely chosen pedal rate during cycling.
Standard mechanical energy analyses do not correlate with muscle work in cycling.
The influence of body position on leg kinematics and muscle recruitment during cycling.
Influence of repeated sprint training on pulmonary O2 uptake and muscle deoxygenation kinetics in humans
Influence of pedaling rate on muscle mechanical energy in low power recumbent pedaling
using forward dynamic simulations