SlideShare a Scribd company logo
1 of 19
Download to read offline
H2O for Non-Data
Scientists
An Introduction to Data Science with H2O
with Jessica Lanford and Amy Wang
I’m no data scientist…
(tech writer by trade)
…but I can use H2O
(and so can you!)
What is H2O?
• “Scalable machine learning for
smarter applications”
• Let’s break that down…
• Scalable = as large (or small)
as you like
• Machine learning = let the
computers do the math (so
you don’t have to)
• Smarter applications = learn
from your data
The H2O Team
• Sales/Marketing
• Engineering/
Development
• Operations
• Advisors
Why use data science?
• Learn from your data to improve your business
• Improve efficiency
• Identify, predict, & prevent supply chain gaps
• Lifehacks
• find better bordeaux
• find better deals on shoes
• improve your chances in your office sports pool
• tonight we’ll learn how to simplify your flight with GLM!
First, let’s download our data:
1. Go to data.h2o.ai
2. Click the airlines/
allyears2k.csv dataset
and save the file.
This dataset represents a subset
of flight arrival/departure info for
all US intercontinental commercial
flights (Oct. 1987-Apr. 2008).
For more accurate results, please
try on your own with the larger
dataset at https://
s3.amazonaws.com/h2o-airlines-
unpacked/airlines_all.05p.csv
Now, Let’s Download H2O
1) Go to h2o.ai/download.
2) Click the Simons link.
3) Click the Download H2O button at
the top of the screen.
4) After the file downloads, open
Terminal (OS X).
5) Type the following (OS X):
cd ~/Downloads
unzip h2o-3.0.1.2.zip
cd h2o-3.0.1.2
java -jar h2o.jar
6) In your browser, enter localhost:
54321
You are now running H2O!
Windows Users
1) Go to Downloads folder
2) Unzip h2o-3.0.1.2.zip
3) Double-click the h2o
executable jar file.
4) In browser, enter
localhost:54321
Make sure you have Java installed first! 

https://java.com/en/download/
Flow Web UI
Assist Me
Flow Pro Tips:
• To run Flow strictly as a GUI, click
Flow > Toggle All Cell Inputs
• View your saved flows by clicking
the Flows tab in the Help sidebar
• Use the keyboard shortcuts (Help
> Keyboard Shortcuts
Importing Data
1. Click the importFiles link under
Assistance.
2. In the Search field, start typing the
location where you saved the
airlines dataset. 

Use the exact file path (e.g.,
UsersHomeDownloads
allyears2k.csv for Windows, /
Users/<LoginName>/Downloads for
OS X)
3. Click to select the airlines dataset,
then press the Enter key.
4. Click the Add all link.
5. Click the Import button.
• Flow Pro Tip: Find any file available
in H2O using Data > List All Frames
Parsing Data
• Why parse?
• Can customize data before
creating model
• Confirm data transferred correctly
• Feeds in data in H2O-friendly
format
1. Click the Parse these files…
button.
2. Review the data in the Edit Column
Names and Types section.
3. Click the Parse button.
Parse Setup
• Sources: Original file
• Destination key: H2O-friendly
file name
• Parser: File type (CSV, XLS,
XLSX, etc.)
• Separator: what separates the
entries (comma, semicolon,
space, etc.)
• Column headers: Is 1st row
data or names?
Flow Pro Tip: Save actions and paste them in other sections of
your workflow using the Clips feature (the paperclip icon)
What type of model should I use ?
• Classification (what category?)
• Binomial (2):Naive Bayes, GLM (logistic
regression), DRF, Deep Learning

- airlines: delay/no delay
• Multinomial (2+): Deep Learning, GBM, DRF

- prediction; identify handwritten numbers
• Regression (what value?)

- how much supply to meet demand?
• CoxPH
• GLM
• Dimensionality reduction (many —> few)

- narrow # of features to find important ones
• DL Autoencoder
• PCA
• Clustering (what’s similar?)

- tracking customer behavior; use 

when you don’t know what the response 

will be
• K-Means
Flow Pro Tip: Access each model
type from the Model menu
Data contains values or categories?
Values Categories
How many
categories?
2 only 2+
NB, GBM
GLM,
DRF, DL
GBM,
DRF, DL
GBM
GLM,
DRF, DL
What is GLM?
With some problems, we
don’t know what the
answer is.
But with GLM, you
already know what
answer you want -
the model helps you
figure out how to get
there.
• The dependent variable is the
response (the result we want).
For our dataset, this is
IsDepDelayed.
• The independent variables are
the predictors: the factors that
influence the results. For our
dataset, these include Origin,
Dest, Distance…
• The weights represent how
much each independent
variable (factor) influences the
dependent variable (result).
Building Models
1. Click the View button after parsing.
2. Click the Build Model button (below
allyears2k.hex).
3. Select Generalized Linear Model from the drop-
down list.
4. Select allyears2k.hex from the validation_frame
drop-down list.
5. Select IsDepDelayed from the Response
Column drop-down list.
6. Click the All button in the ignored_columns
section, then uncheck the following: 

- Year

- Month

- DayofWeek

- UniqueCarrier

- FlightNum

- Origin

- Dest

- Distance

- IsDepDelayed
7. Select binomial from the family drop-down list.
8. Click the Build Model button at the bottom of the
page. After the model is complete, click View.
Interpreting Results
If your parameters were added correctly,
you should see results that look like this.
—>
So what do these results tell us?
Airports at the top of the list are more likely
to be delayed (if they have a positive
coefficient) or less likely to be delayed (with
a negative coefficient). Lynchburg is more
likely to be delayed than Corpus Christi Intl.
Coefficient: relationship of feature to
response variable; for airlines data, positive
= more likely to be delayed, negative = less
likely to be delayed
ROC curve = ratio of true positive vs. false
positive rate (here, about 50/50 due to
smaller dataset - try with the larger dataset
to see the difference!)
Getting Help
• To get help, click the question mark button (last
button in row), or click Help > Assist Me
• docs.h2o.ai (Select H2O 3.0)
• FAQ
• Flow Web UI (also available in Flow in the “Help”
tab in the sidebar
• h2ostream - https://groups.google.com/forum/#!
forum/h2ostream
• learn.h2o.ai
I’m still having trouble…
What do I do?
• First, check the FAQ & h2ostream to see if it is already solved
• If not, get the logs:
• Admin > Download Logs
• The logs will help our support team identify the issue faster
• Then send the logs to support@h2o.ai or post them on
h2ostream (if you don’t mind sharing your data)
• Can also use Help > Report an Issue to file a JIRA ticket
Data Science Resources
• StackExchange - stats.stackexchange.com
• Kaggle - kaggle.com/wiki/Home 

(also see Tutorials on kaggle.com)
• Quora - quora.com/Machine-Learning
• StatTrek - stattrek.com
• Data sets - docs.h2o.ai/h2oclassic/resources/
publicdata.html
• Regression: output = predict numerical value output; cont.
val; input = numerical values; model types: (quantitative)
• Classification: goal is to predict category for response
input; output = labels; input = ? model types: (categorical)
• Binomial: only 2 possible outcomes (yes/no)
• Multinomial: 2+ outcomes; model is generated based on
outcomes of the specified number of independent trials
(outcome of previous trial does not affect next), which have
a whole number of possible outcomes with a constant
probability
• Response col: Dependent variable - what do you want to
measure/observe/predict?
• Predictors: Independent variable - what factors can be
manipulated/controlled?
• Validation: How well does the model fit?
• Confusion matrix: Table that depicts algorithm’s
performance (usually supervised learning)
• Supervised vs unsupervised: in supervised learning, input
is labeled so that algorithm can identify it and learn from it;
in unsupervised learning, input data is not labeled
• Hit ratio: number of times prediction was correct out of total
predictions
• Iteration: round of model testing/training
• Standardization: Converts variables to standardized
variables; helps prevent precision problems
Vocabulary
• Seed: Random number generated to control sampling and
initialization
• Offset: Compensates for differences in units of observation
(different populations, different geographic sizes) to make
sure outcome is proportional
• Weights: A way to specify certain outcomes as more
significant than others (for example, if you are trying to
identify incidence of disease, one "yes" can be more
meaningful than 50 "no" responses). More weight = more
importance.
• Trees: Similar to a flowchart - each decision moves you
further down the flowchart (tree). Remember the game "20
questions"? If your only choices are "animal", "vegetable",
or "mineral", and the responses are "no" for both "animal"
and "vegetable", the decision tree will tell you that the
answer is "mineral."
• Distribution: proportion, not frequency (number of red
candies in a pack not the same every time)
• Balance classes: similar to weights; helps to give more
importance to specific responses
• Enum: Categorical variable
• Separator: what separates the entries in the dataset
(usually a comma, semicolon, etc.)
• Integer vs real: Integer = whole number (can be negative);
real = fractional number
• Binary: See “binomial” (only 2 possibilities)

More Related Content

More from Sri Ambati

Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the WaySri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OSri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersSri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email AgainSri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneySri Ambati
 
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DFML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DFSri Ambati
 
Scaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOpsScaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOpsSri Ambati
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2OSri Ambati
 
Your AI Transformation
Your AI Transformation Your AI Transformation
Your AI Transformation Sri Ambati
 
AI Solutions in Manufacturing
AI Solutions in ManufacturingAI Solutions in Manufacturing
AI Solutions in ManufacturingSri Ambati
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 RecapSri Ambati
 
Getting Your Supply Chain Back on Track with AI
Getting Your Supply Chain Back on Track with AIGetting Your Supply Chain Back on Track with AI
Getting Your Supply Chain Back on Track with AISri Ambati
 

More from Sri Ambati (20)

Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DFML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
 
Scaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOpsScaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOps
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2O
 
Your AI Transformation
Your AI Transformation Your AI Transformation
Your AI Transformation
 
AI Solutions in Manufacturing
AI Solutions in ManufacturingAI Solutions in Manufacturing
AI Solutions in Manufacturing
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
Getting Your Supply Chain Back on Track with AI
Getting Your Supply Chain Back on Track with AIGetting Your Supply Chain Back on Track with AI
Getting Your Supply Chain Back on Track with AI
 

H2O for non-data scientists Meetup 07.22.15

  • 1. H2O for Non-Data Scientists An Introduction to Data Science with H2O with Jessica Lanford and Amy Wang
  • 2. I’m no data scientist… (tech writer by trade) …but I can use H2O (and so can you!)
  • 3. What is H2O? • “Scalable machine learning for smarter applications” • Let’s break that down… • Scalable = as large (or small) as you like • Machine learning = let the computers do the math (so you don’t have to) • Smarter applications = learn from your data
  • 4. The H2O Team • Sales/Marketing • Engineering/ Development • Operations • Advisors
  • 5. Why use data science? • Learn from your data to improve your business • Improve efficiency • Identify, predict, & prevent supply chain gaps • Lifehacks • find better bordeaux • find better deals on shoes • improve your chances in your office sports pool • tonight we’ll learn how to simplify your flight with GLM!
  • 6. First, let’s download our data: 1. Go to data.h2o.ai 2. Click the airlines/ allyears2k.csv dataset and save the file. This dataset represents a subset of flight arrival/departure info for all US intercontinental commercial flights (Oct. 1987-Apr. 2008). For more accurate results, please try on your own with the larger dataset at https:// s3.amazonaws.com/h2o-airlines- unpacked/airlines_all.05p.csv
  • 7. Now, Let’s Download H2O 1) Go to h2o.ai/download. 2) Click the Simons link. 3) Click the Download H2O button at the top of the screen. 4) After the file downloads, open Terminal (OS X). 5) Type the following (OS X): cd ~/Downloads unzip h2o-3.0.1.2.zip cd h2o-3.0.1.2 java -jar h2o.jar 6) In your browser, enter localhost: 54321 You are now running H2O! Windows Users 1) Go to Downloads folder 2) Unzip h2o-3.0.1.2.zip 3) Double-click the h2o executable jar file. 4) In browser, enter localhost:54321 Make sure you have Java installed first! 
 https://java.com/en/download/
  • 8. Flow Web UI Assist Me Flow Pro Tips: • To run Flow strictly as a GUI, click Flow > Toggle All Cell Inputs • View your saved flows by clicking the Flows tab in the Help sidebar • Use the keyboard shortcuts (Help > Keyboard Shortcuts
  • 9. Importing Data 1. Click the importFiles link under Assistance. 2. In the Search field, start typing the location where you saved the airlines dataset. 
 Use the exact file path (e.g., UsersHomeDownloads allyears2k.csv for Windows, / Users/<LoginName>/Downloads for OS X) 3. Click to select the airlines dataset, then press the Enter key. 4. Click the Add all link. 5. Click the Import button. • Flow Pro Tip: Find any file available in H2O using Data > List All Frames
  • 10. Parsing Data • Why parse? • Can customize data before creating model • Confirm data transferred correctly • Feeds in data in H2O-friendly format 1. Click the Parse these files… button. 2. Review the data in the Edit Column Names and Types section. 3. Click the Parse button.
  • 11. Parse Setup • Sources: Original file • Destination key: H2O-friendly file name • Parser: File type (CSV, XLS, XLSX, etc.) • Separator: what separates the entries (comma, semicolon, space, etc.) • Column headers: Is 1st row data or names? Flow Pro Tip: Save actions and paste them in other sections of your workflow using the Clips feature (the paperclip icon)
  • 12. What type of model should I use ? • Classification (what category?) • Binomial (2):Naive Bayes, GLM (logistic regression), DRF, Deep Learning
 - airlines: delay/no delay • Multinomial (2+): Deep Learning, GBM, DRF
 - prediction; identify handwritten numbers • Regression (what value?)
 - how much supply to meet demand? • CoxPH • GLM • Dimensionality reduction (many —> few)
 - narrow # of features to find important ones • DL Autoencoder • PCA • Clustering (what’s similar?)
 - tracking customer behavior; use 
 when you don’t know what the response 
 will be • K-Means Flow Pro Tip: Access each model type from the Model menu Data contains values or categories? Values Categories How many categories? 2 only 2+ NB, GBM GLM, DRF, DL GBM, DRF, DL GBM GLM, DRF, DL
  • 13. What is GLM? With some problems, we don’t know what the answer is. But with GLM, you already know what answer you want - the model helps you figure out how to get there. • The dependent variable is the response (the result we want). For our dataset, this is IsDepDelayed. • The independent variables are the predictors: the factors that influence the results. For our dataset, these include Origin, Dest, Distance… • The weights represent how much each independent variable (factor) influences the dependent variable (result).
  • 14. Building Models 1. Click the View button after parsing. 2. Click the Build Model button (below allyears2k.hex). 3. Select Generalized Linear Model from the drop- down list. 4. Select allyears2k.hex from the validation_frame drop-down list. 5. Select IsDepDelayed from the Response Column drop-down list. 6. Click the All button in the ignored_columns section, then uncheck the following: 
 - Year
 - Month
 - DayofWeek
 - UniqueCarrier
 - FlightNum
 - Origin
 - Dest
 - Distance
 - IsDepDelayed 7. Select binomial from the family drop-down list. 8. Click the Build Model button at the bottom of the page. After the model is complete, click View.
  • 15. Interpreting Results If your parameters were added correctly, you should see results that look like this. —> So what do these results tell us? Airports at the top of the list are more likely to be delayed (if they have a positive coefficient) or less likely to be delayed (with a negative coefficient). Lynchburg is more likely to be delayed than Corpus Christi Intl. Coefficient: relationship of feature to response variable; for airlines data, positive = more likely to be delayed, negative = less likely to be delayed ROC curve = ratio of true positive vs. false positive rate (here, about 50/50 due to smaller dataset - try with the larger dataset to see the difference!)
  • 16. Getting Help • To get help, click the question mark button (last button in row), or click Help > Assist Me • docs.h2o.ai (Select H2O 3.0) • FAQ • Flow Web UI (also available in Flow in the “Help” tab in the sidebar • h2ostream - https://groups.google.com/forum/#! forum/h2ostream • learn.h2o.ai
  • 17. I’m still having trouble… What do I do? • First, check the FAQ & h2ostream to see if it is already solved • If not, get the logs: • Admin > Download Logs • The logs will help our support team identify the issue faster • Then send the logs to support@h2o.ai or post them on h2ostream (if you don’t mind sharing your data) • Can also use Help > Report an Issue to file a JIRA ticket
  • 18. Data Science Resources • StackExchange - stats.stackexchange.com • Kaggle - kaggle.com/wiki/Home 
 (also see Tutorials on kaggle.com) • Quora - quora.com/Machine-Learning • StatTrek - stattrek.com • Data sets - docs.h2o.ai/h2oclassic/resources/ publicdata.html
  • 19. • Regression: output = predict numerical value output; cont. val; input = numerical values; model types: (quantitative) • Classification: goal is to predict category for response input; output = labels; input = ? model types: (categorical) • Binomial: only 2 possible outcomes (yes/no) • Multinomial: 2+ outcomes; model is generated based on outcomes of the specified number of independent trials (outcome of previous trial does not affect next), which have a whole number of possible outcomes with a constant probability • Response col: Dependent variable - what do you want to measure/observe/predict? • Predictors: Independent variable - what factors can be manipulated/controlled? • Validation: How well does the model fit? • Confusion matrix: Table that depicts algorithm’s performance (usually supervised learning) • Supervised vs unsupervised: in supervised learning, input is labeled so that algorithm can identify it and learn from it; in unsupervised learning, input data is not labeled • Hit ratio: number of times prediction was correct out of total predictions • Iteration: round of model testing/training • Standardization: Converts variables to standardized variables; helps prevent precision problems Vocabulary • Seed: Random number generated to control sampling and initialization • Offset: Compensates for differences in units of observation (different populations, different geographic sizes) to make sure outcome is proportional • Weights: A way to specify certain outcomes as more significant than others (for example, if you are trying to identify incidence of disease, one "yes" can be more meaningful than 50 "no" responses). More weight = more importance. • Trees: Similar to a flowchart - each decision moves you further down the flowchart (tree). Remember the game "20 questions"? If your only choices are "animal", "vegetable", or "mineral", and the responses are "no" for both "animal" and "vegetable", the decision tree will tell you that the answer is "mineral." • Distribution: proportion, not frequency (number of red candies in a pack not the same every time) • Balance classes: similar to weights; helps to give more importance to specific responses • Enum: Categorical variable • Separator: what separates the entries in the dataset (usually a comma, semicolon, etc.) • Integer vs real: Integer = whole number (can be negative); real = fractional number • Binary: See “binomial” (only 2 possibilities)