H2O for non-data scientists Meetup 07.22.15 with Jessica Lanford.
Video: https://livestream.com/h2oai/events/4186354
Example Flows: https://github.com/h2oai/h2o-3/tree/master/h2o-docs/src/product/flow/packs/examples
2. I’m no data scientist…
(tech writer by trade)
…but I can use H2O
(and so can you!)
3. What is H2O?
• “Scalable machine learning for
smarter applications”
• Let’s break that down…
• Scalable = as large (or small)
as you like
• Machine learning = let the
computers do the math (so
you don’t have to)
• Smarter applications = learn
from your data
4. The H2O Team
• Sales/Marketing
• Engineering/
Development
• Operations
• Advisors
5. Why use data science?
• Learn from your data to improve your business
• Improve efficiency
• Identify, predict, & prevent supply chain gaps
• Lifehacks
• find better bordeaux
• find better deals on shoes
• improve your chances in your office sports pool
• tonight we’ll learn how to simplify your flight with GLM!
6. First, let’s download our data:
1. Go to data.h2o.ai
2. Click the airlines/
allyears2k.csv dataset
and save the file.
This dataset represents a subset
of flight arrival/departure info for
all US intercontinental commercial
flights (Oct. 1987-Apr. 2008).
For more accurate results, please
try on your own with the larger
dataset at https://
s3.amazonaws.com/h2o-airlines-
unpacked/airlines_all.05p.csv
7. Now, Let’s Download H2O
1) Go to h2o.ai/download.
2) Click the Simons link.
3) Click the Download H2O button at
the top of the screen.
4) After the file downloads, open
Terminal (OS X).
5) Type the following (OS X):
cd ~/Downloads
unzip h2o-3.0.1.2.zip
cd h2o-3.0.1.2
java -jar h2o.jar
6) In your browser, enter localhost:
54321
You are now running H2O!
Windows Users
1) Go to Downloads folder
2) Unzip h2o-3.0.1.2.zip
3) Double-click the h2o
executable jar file.
4) In browser, enter
localhost:54321
Make sure you have Java installed first!
https://java.com/en/download/
8. Flow Web UI
Assist Me
Flow Pro Tips:
• To run Flow strictly as a GUI, click
Flow > Toggle All Cell Inputs
• View your saved flows by clicking
the Flows tab in the Help sidebar
• Use the keyboard shortcuts (Help
> Keyboard Shortcuts
9. Importing Data
1. Click the importFiles link under
Assistance.
2. In the Search field, start typing the
location where you saved the
airlines dataset.
Use the exact file path (e.g.,
UsersHomeDownloads
allyears2k.csv for Windows, /
Users/<LoginName>/Downloads for
OS X)
3. Click to select the airlines dataset,
then press the Enter key.
4. Click the Add all link.
5. Click the Import button.
• Flow Pro Tip: Find any file available
in H2O using Data > List All Frames
10. Parsing Data
• Why parse?
• Can customize data before
creating model
• Confirm data transferred correctly
• Feeds in data in H2O-friendly
format
1. Click the Parse these files…
button.
2. Review the data in the Edit Column
Names and Types section.
3. Click the Parse button.
11. Parse Setup
• Sources: Original file
• Destination key: H2O-friendly
file name
• Parser: File type (CSV, XLS,
XLSX, etc.)
• Separator: what separates the
entries (comma, semicolon,
space, etc.)
• Column headers: Is 1st row
data or names?
Flow Pro Tip: Save actions and paste them in other sections of
your workflow using the Clips feature (the paperclip icon)
12. What type of model should I use ?
• Classification (what category?)
• Binomial (2):Naive Bayes, GLM (logistic
regression), DRF, Deep Learning
- airlines: delay/no delay
• Multinomial (2+): Deep Learning, GBM, DRF
- prediction; identify handwritten numbers
• Regression (what value?)
- how much supply to meet demand?
• CoxPH
• GLM
• Dimensionality reduction (many —> few)
- narrow # of features to find important ones
• DL Autoencoder
• PCA
• Clustering (what’s similar?)
- tracking customer behavior; use
when you don’t know what the response
will be
• K-Means
Flow Pro Tip: Access each model
type from the Model menu
Data contains values or categories?
Values Categories
How many
categories?
2 only 2+
NB, GBM
GLM,
DRF, DL
GBM,
DRF, DL
GBM
GLM,
DRF, DL
13. What is GLM?
With some problems, we
don’t know what the
answer is.
But with GLM, you
already know what
answer you want -
the model helps you
figure out how to get
there.
• The dependent variable is the
response (the result we want).
For our dataset, this is
IsDepDelayed.
• The independent variables are
the predictors: the factors that
influence the results. For our
dataset, these include Origin,
Dest, Distance…
• The weights represent how
much each independent
variable (factor) influences the
dependent variable (result).
14. Building Models
1. Click the View button after parsing.
2. Click the Build Model button (below
allyears2k.hex).
3. Select Generalized Linear Model from the drop-
down list.
4. Select allyears2k.hex from the validation_frame
drop-down list.
5. Select IsDepDelayed from the Response
Column drop-down list.
6. Click the All button in the ignored_columns
section, then uncheck the following:
- Year
- Month
- DayofWeek
- UniqueCarrier
- FlightNum
- Origin
- Dest
- Distance
- IsDepDelayed
7. Select binomial from the family drop-down list.
8. Click the Build Model button at the bottom of the
page. After the model is complete, click View.
15. Interpreting Results
If your parameters were added correctly,
you should see results that look like this.
—>
So what do these results tell us?
Airports at the top of the list are more likely
to be delayed (if they have a positive
coefficient) or less likely to be delayed (with
a negative coefficient). Lynchburg is more
likely to be delayed than Corpus Christi Intl.
Coefficient: relationship of feature to
response variable; for airlines data, positive
= more likely to be delayed, negative = less
likely to be delayed
ROC curve = ratio of true positive vs. false
positive rate (here, about 50/50 due to
smaller dataset - try with the larger dataset
to see the difference!)
16. Getting Help
• To get help, click the question mark button (last
button in row), or click Help > Assist Me
• docs.h2o.ai (Select H2O 3.0)
• FAQ
• Flow Web UI (also available in Flow in the “Help”
tab in the sidebar
• h2ostream - https://groups.google.com/forum/#!
forum/h2ostream
• learn.h2o.ai
17. I’m still having trouble…
What do I do?
• First, check the FAQ & h2ostream to see if it is already solved
• If not, get the logs:
• Admin > Download Logs
• The logs will help our support team identify the issue faster
• Then send the logs to support@h2o.ai or post them on
h2ostream (if you don’t mind sharing your data)
• Can also use Help > Report an Issue to file a JIRA ticket
18. Data Science Resources
• StackExchange - stats.stackexchange.com
• Kaggle - kaggle.com/wiki/Home
(also see Tutorials on kaggle.com)
• Quora - quora.com/Machine-Learning
• StatTrek - stattrek.com
• Data sets - docs.h2o.ai/h2oclassic/resources/
publicdata.html
19. • Regression: output = predict numerical value output; cont.
val; input = numerical values; model types: (quantitative)
• Classification: goal is to predict category for response
input; output = labels; input = ? model types: (categorical)
• Binomial: only 2 possible outcomes (yes/no)
• Multinomial: 2+ outcomes; model is generated based on
outcomes of the specified number of independent trials
(outcome of previous trial does not affect next), which have
a whole number of possible outcomes with a constant
probability
• Response col: Dependent variable - what do you want to
measure/observe/predict?
• Predictors: Independent variable - what factors can be
manipulated/controlled?
• Validation: How well does the model fit?
• Confusion matrix: Table that depicts algorithm’s
performance (usually supervised learning)
• Supervised vs unsupervised: in supervised learning, input
is labeled so that algorithm can identify it and learn from it;
in unsupervised learning, input data is not labeled
• Hit ratio: number of times prediction was correct out of total
predictions
• Iteration: round of model testing/training
• Standardization: Converts variables to standardized
variables; helps prevent precision problems
Vocabulary
• Seed: Random number generated to control sampling and
initialization
• Offset: Compensates for differences in units of observation
(different populations, different geographic sizes) to make
sure outcome is proportional
• Weights: A way to specify certain outcomes as more
significant than others (for example, if you are trying to
identify incidence of disease, one "yes" can be more
meaningful than 50 "no" responses). More weight = more
importance.
• Trees: Similar to a flowchart - each decision moves you
further down the flowchart (tree). Remember the game "20
questions"? If your only choices are "animal", "vegetable",
or "mineral", and the responses are "no" for both "animal"
and "vegetable", the decision tree will tell you that the
answer is "mineral."
• Distribution: proportion, not frequency (number of red
candies in a pack not the same every time)
• Balance classes: similar to weights; helps to give more
importance to specific responses
• Enum: Categorical variable
• Separator: what separates the entries in the dataset
(usually a comma, semicolon, etc.)
• Integer vs real: Integer = whole number (can be negative);
real = fractional number
• Binary: See “binomial” (only 2 possibilities)