Workshop at DataEngConf 2016, on April 7-8 2016, at Galvanize, 44 Tehama Street, San Francisco, CA.
Demo and labs for workshop are at https://github.com/asimjalis/data-science-workshop
8. WHAT IS THIS WORKSHOP
ABOUT?
Using Data Science and
Machine Learning
Building Classifiers
Using Python and scikit-
Learn
By the end of the
workshop you will be
able to build Machine
Learning Classification
Systems
13. OUTLINE
What is Data Science and Machine Learning?
What is scikit-learn? Why its super helpful?
What kinds of problems can we solve with this stuff?
16. DATA SCIENCE AND MACHINE
LEARNING
Data Science = Machine
Learning + Statistics +
Domain Expertise
17. STATISTICS AND MACHINE
LEARNING
Statistics asks whether
milk causes heart
disease
Machine Learning
predicts your death
Focused on results and
actionable predictions
Used in production
so!ware systems
18. HISTORY OF MACHINE LEARNING
Input Features Algorithm Output
Machine Human Human Machine
Machine Human Machine Machine
Machine Machine Machine Machine
19. WHAT IS MACHINE LEARNING?
Inputs: Vectors or points of high dimensions
Outputs: Either binary vectors or continuous vectors
Machine Learning finds the relationship between them
Using statistical techniques
23. CLASSIFICATION EXAMPLE:
EMAIL SPAM DETECTION
Start with large collection of emails, labeled spam/not-
spam
Convert email text into vectors of 0s and 1s: 0 if a word
occurs, 1 if it does not
These are called inputs or features
Split data set into training set (70%) and test set (30%)
Use algorithm like Random Forests to build model
Evaluate model by running it on test set and capturing
success rate
25. CHOOSING ALGORITHM
Evaluate different models on data
Look at the relative success rates
Use rules of thumb: some algorithms work better on some
kinds of data
26. CLASSIFICATION EXAMPLES
Is this tumor benign or
cancerous?
Is this lead profitable or
not?
Who will win the
presidential elections?
27. CLASSIFICATION: POP QUIZ
Is classification supervised or unsupervised learning?
Supervised because you have to label the data.
28. CLUSTERING EXAMPLE: LOCATE
CELL PHONE TOWERS
Start with GPS
coordinates of all cell
phone users
Represent data as
vectors
Locate towers in biggest
clusters
29. CLUSTERING EXAMPLE: T-SHIRTS
What size should a t-
shirt be?
Everyone’s real t-shirt
size is different
Lay out all sizes and
cluster
Target large clusters
with XS, S, M, L, XL
30. CLUSTERING: POP QUIZ
Is clustering supervised or unsupervised?
Unsupervised because no labeling is required
34. REGRESSION EXAMPLES
How many units of product will sell next month
What will student score on SAT
What is the market price of this house
How long before this engine needs repair
35. REGRESSION EXAMPLE:
AIRCRAFT PART FAILURE
Cessna collects data
from airplane sensors
Predict when part needs
to be replaced
Ship part to customer’s
service airport
37. ANOMALY DETECTION EXAMPLE:
CREDIT CARD FRAUD
Train model on good
transactions
Anomalous activity
indicates fraud
Can pass transaction
down to human for
investigation
38. ANOMALY DETECTION EXAMPLE:
NETWORK INTRUSION
Train model on network
login activity
Anomalous activity
indicates threat
Can initiate alerts and
lockdown procedures
39. ANOMALY DETECTION: POP QUIZ
Is anomaly detection supervised or unsupervised?
Unsupervised because we only train on normal data
42. IPYTHON CHEATSHEET
Command Meaning
ipython Start IPython
/help np Help on module
/help np.array Help on function
/help "hello" Help on object
%cpaste Paste blob of text
%timeit Time function call
%load FILE Load file as source
quit Exit
54. INDEXING
# Array
a = [0,1,2,3,4]
# First element
a[0]
# Second element
a[1]
# Last element
a[-1]
55. SLICING
# Start at index 1, stop before 5, step 2
a[1:5:2]
# Start at index 1, stop before 3, step 1
a[1:3]
# Start at index 1, stop at end, step 1
a[1:]
# Start at index 0, stop before 5, step 2
a[:5:2]
56. SLICING: POP QUIZ
What does a[::] give you?
Defaults for everything: start at 0, to end, step 1.
58. DATA FRAMES FROM CSV
# Use CSV file header for column names
df = pd.read_csv('file.csv',header=0)
# CSV file has no header
df = pd.read_csv('file.csv',header=None)
59. DATA FRAME
import pandas as pd
df = pd.DataFrame(
columns=
['City','State','Sales'],
data=[
['SFO','CA',300],
['SEA','WA',200],
['PDX','OR',150],
])
City State Sales
0 SFO CA 300
1 SEA WA 200
2 PDX OR 150
64. SELECTING ROWS + COLUMNS
WITH SLICES
# First 2 rows and all columns
df.iloc[0:2,::]
City State Sales
0 SFO CA 300
1 SEA WA 200
65. SELECTING ROWS + COLUMNS
WITH SLICES
# First 2 rows and all but first column
df.iloc[0:2,1::]
State Sales
0 CA 300
1 WA 200
66. ADD NEW COLUMN
# New tax column
df['Tax'] = df.Sales * 0.085
df
City State Sales Tax
0 SFO CA 300 25.50
1 SEA WA 200 17.00
2 PDX OR 150 12.75
67. ADD NEW BOOLEAN COLUMN
# New boolean column
df['HighSales'] = (df.Sales >= 200)
df
City State Sales Tax HighSales
0 SFO CA 300 25.50 True
1 SEA WA 200 17.00 True
2 PDX OR 150 12.75 False
68. ADD NEW INTEGER COLUMN
# New integer 0/1 column
df['HighSales'] = (df.Sales >= 200).astype('int')
df
City State Sales Tax HighSales
0 SFO CA 300 25.50 1
1 SEA WA 200 17.00 1
2 PDX OR 150 12.75 0
69. APPLYING FUNCTION
# Arbitrary function
df['State2'] = df.State.apply(lambda x: x.lower())
df
City State Sales Tax HighSales State2
0 SFO CA 300 25.50 1 ca
1 SEA WA 200 17.00 1 wa
2 PDX OR 150 12.75 0 or
70. APPLYING FUNCTION ACROSS
AXIS
# Calculate mean across all rows, columns 2,3,4
df.iloc[:,2:5].apply(lambda x: x.mean(),axis=0)
Sales 216.666667
Tax 18.416667
HighSales 0.666667
dtype: float64
71. VECTORIZED OPERATIONS
Which one is faster?
# Vectorized operation
%timeit df * 2
160 µs per loop
# Loop
%timeit for i in xrange(df.size): df.iloc[i,0] * 2
1.72 s per loop
77. PLOT FEATURES
Plot survival by port:
C = Cherbourg, France
Q = Queenstown, Ireland
S = Southampton, UK
# Load data
df = pd.read_csv('data/titanic.csv',header=0)
# Plot by port
df.groupby('Embarked')[['Embarked','Survived']].mean()
df.groupby('Embarked')[['Embarked','Survived']].mean().plot(kind='bar')
81. MISSING VALUES
from numpy import NaN
df = pd.DataFrame(
columns=
['State','Sales'],
data=[
['CA',300],
['WA',NaN],
['OR',150],
])
State Sales
0 CA 300
1 WA NaN
2 OR 150
82. MISSING VALUES: BACK FILL
df
State Sales
0 CA 300
1 WA NaN
2 OR 150
df.fillna(method='bfill',axis=0)
State Sales
0 CA 300
1 WA 150
2 OR 150
83. FORWARD FILL: USE PREVIOUS
VALID ROW’S VALUE
df
State Sales
0 CA 300
1 WA NaN
2 OR 150
df.fillna(method='ffill',axis=0)
State Sales
0 CA 300
1 WA 300
2 OR 150
85. CATEGORICAL DATA
How can we handle column that has data like CA, WA, OR,
etc?
Replace categorical features with 0 and 1
For example, replace state column containing CA, WA, OR
With binary column for each state
86. DUMMY VARIABLES
df = pd.DataFrame(
columns=
['State','Sales'],
data=[
['CA',300],
['WA',200],
['OR',150],
])
State Sales
0 CA 300
1 WA 200
2 OR 150
87. DUMMY VARIABLES
df
State Sales
0 CA 300
1 WA 200
2 OR 150
pd.get_dummies(df)
Sales State_CA State_OR State_WA
0 300 1.0 0.0 0.0
1 200 0.0 0.0 1.0
2 150 0.0 1.0 0.0
88. CENTERING AND SCALING DATA
Why center and scale data?
Features with large values can dominate
Centering centers it at zero
Scaling divides by standard deviation
89. CENTERING AND SCALING DATA
from sklearn import preprocessing
import numpy as np
X = np.array([[1.0,-1.0, 2.0],
[2.0, 0.0, 0.0],
[3.0, 1.0, 2.0],
[0.0, 1.0,-1.0]])
scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled
X_scaled.mean(axis=0)
X_scaled.std(axis=0)
96. BASIC IDEA
Collection of decision
trees
Each decision tree looks
only at some features
For final decision the
trees vote
Example of ensemble
method
97. DECISION TREES
Decision trees play 20
questions on your data
Finds feature questions
that can split up final
output
Chooses splits that
produce most pure
branches
98. RANDOM FORESTS
Collection of decision
trees
Each tree sees random
sample of data
Each split point based
on random subset of
features
out of total features
To classify new data
trees vote
99. RANDOM FORESTS: POP QUIZ
Which one takes more time: training random forests or
classifying new data point on trained random forests?
Training takes more time
This is when tree is constructed
Running/evaluation is fast
You just walk down the tree
101. PROBLEM OF OVERFITTING
Model can get attached
to sample data
Learns specific patterns
instead of general
pattern
102. OVERFITTING
Which one of these is overfitting, underfitting, just right?
1. Underfitting
2. Just right
3. Overfitting
103. DETECTING OVERFITTING
How do you know you are overfitting?
Model does great on training set and terrible on test set
104. RANDOM FORESTS AND
OVERFITTING
Random Forests are not prone to overfitting. Why?
Random Forests are an ensemble method
Each tree only sees and captures part of the data
Tends to pick up general rather than specific patterns
106. PROBLEM
How can we find out
how good our models
are?
Is it enough for models
to do well on training
set?
How can we know how
the model will do on
new unseen data?
107. CROSS VALIDATION
Technique to test model
Split data into train and
test subsets
Train on train data set
Measure model on test
data set
108. CROSS VALIDATION: POP QUIZ
Why can’t we test our models on the training set?
The model already knows the training set
It will have an unfair advantage
It has to be tested on data it has not seen before
109. K-FOLD CROSS VALIDATION
Split data into k sets
Repeat k times: train on
k-1, test on kth
Model’s score is average
of the k scores
110. CROSS VALIDATION CODE
from sklearn.cross_validation import cross_val_score
# 10-fold (default 3-fold)
scores10 = cross_val_score(model, X, y, cv=10)
# See score stats
pd.Series(scores10).describe()
112. PROBLEM: OIL EXPLORATION
Drilling holes is
expensive
We want to find the
biggest oilfield without
wasting money on duds
Where should we plant
our next oilfield derrick?
113. PROBLEM: MACHINE LEARNING
Testing hyperparameters is expensive
We have an N-dimensional grid of parameters
How can we quickly zero in on the best combination of
hyperparameters?
117. RANDOM
Randomly search the
grid
60 random samples gets
you within top 5% of
grid search with 95%
probability
Bergstra and Bengio’s
result and Alice Zheng’s
explanation (see
References)
126. REFERENCES
Bayesian Optimization by Dewancker et al
Random Search by Bengio et al
Evaluating machine learning models
Alice Zheng
http://sigopt.com
http://jmlr.org
http://www.oreilly.com
127. COURSES
Machine Learning by Andrew Ng
(Online Course)
Intro to Statistical Learning by Hastie et al
(PDF)
(Video)
(Online Course)
https://www.coursera.org
http://usc.edu
http://www.dataschool.io
https://stanford.edu
128. WORKSHOP DEMO AND LAB
Titanic Demo and Congress Lab for this workshop
https://github.com/asimjalis/data-science-workshop