Data Science and Machine Learning Using Python and Scikit-learn

DATA SCIENCE AND MACHINE
LEARNING USING PYTHON AND
SCIKIT-LEARN
ASIM JALIS
GALVANIZE

ASIM JALIS
Galvanize/Zipfian, Data
Engineering
Cloudera, Microso!,
Salesforce
MS in Computer Science
from University of
Virginia

GALVANIZE PROGRAMS
Program Duration
Data Science
Immersive
12
weeks
Data
Engineering
Immersive
12
weeks
Full Stack
Immersive
6
months
Galvanize U 1 year

YOU GET TO . . .
Immersive Group
Learning
Master High-Demand
Skills and Technologies
Intense Focus on Hiring
and Outcomes
Level UP your Career

WANT MORE INFO OR A TOUR?
http://galvanize.com
asim.jalis@galvanize.com

WHAT IS THIS WORKSHOP
ABOUT?
Using Data Science and
Machine Learning
Building Classifiers
Using Python and scikit-
Learn
By the end of the
workshop you will be
able to build Machine
Learning Classification
Systems

HOW MANY PEOPLE HERE HAVE
USED MACHINE LEARNING
ALGORITHMS?

USED PYTHON?

USED IPYTHON?

USED SCIKIT-LEARN?

OUTLINE
What is Data Science and Machine Learning?
What is scikit-learn? Why its super helpful?
What kinds of problems can we solve with this stuﬀ?

WHY MACHINE LEARNING
EXCITING
Self-driving cars
Voice recognition
AlphaGo

DATA SCIENCE AND MACHINE
LEARNING
Data Science = Machine
Learning + Statistics +
Domain Expertise

STATISTICS AND MACHINE
LEARNING
Statistics asks whether
milk causes heart
disease
Machine Learning
predicts your death
Focused on results and
actionable predictions
Used in production
so!ware systems

HISTORY OF MACHINE LEARNING
Input Features Algorithm Output
Machine Human Human Machine
Machine Human Machine Machine
Machine Machine Machine Machine

WHAT IS MACHINE LEARNING?
Inputs: Vectors or points of high dimensions
Outputs: Either binary vectors or continuous vectors
Machine Learning finds the relationship between them
Using statistical techniques

SUPERVISED VS UNSUPERVISED
Learning Type Meaning
Supervised Data needs to be labeled
Unsupervised Data does not need to be labeled

TECHNIQUES
Classification
Regression
Clustering
Recommendations
Anomaly detection

CLASSIFICATION EXAMPLE:
EMAIL SPAM DETECTION

CLASSIFICATION EXAMPLE:
EMAIL SPAM DETECTION
Start with large collection of emails, labeled spam/not-
spam
Convert email text into vectors of 0s and 1s: 0 if a word
occurs, 1 if it does not
These are called inputs or features
Split data set into training set (70%) and test set (30%)
Use algorithm like Random Forests to build model
Evaluate model by running it on test set and capturing
success rate

CLASSIFICATION ALGORITHMS
Neural Networks
Random Forests
Support Vector Machines (SVM)
Decision Trees
Logistic Regression
Naive Bayes

CHOOSING ALGORITHM
Evaluate diﬀerent models on data
Look at the relative success rates
Use rules of thumb: some algorithms work better on some
kinds of data

CLASSIFICATION EXAMPLES
Is this tumor benign or
cancerous?
Is this lead profitable or
not?
Who will win the
presidential elections?

CLASSIFICATION: POP QUIZ
Is classification supervised or unsupervised learning?
Supervised because you have to label the data.

CLUSTERING EXAMPLE: LOCATE
CELL PHONE TOWERS
Start with GPS
coordinates of all cell
phone users
Represent data as
vectors
Locate towers in biggest
clusters

CLUSTERING EXAMPLE: T-SHIRTS
What size should a t-
shirt be?
Everyone’s real t-shirt
size is diﬀerent
Lay out all sizes and
cluster
Target large clusters
with XS, S, M, L, XL

CLUSTERING: POP QUIZ
Is clustering supervised or unsupervised?
Unsupervised because no labeling is required

RECOMMENDATIONS EXAMPLE:
AMAZON
Model looks at user
ratings of books
Viewing a book triggers
implicit rating
Recommend user new
books

RECOMMENDATION: POP QUIZ
Are recommendation systems supervised or unsupervised?
Unsupervised

REGRESSION
Like classification
Output is continuous instead of one from k choices

REGRESSION EXAMPLES
How many units of product will sell next month
What will student score on SAT
What is the market price of this house
How long before this engine needs repair

REGRESSION EXAMPLE:
AIRCRAFT PART FAILURE
Cessna collects data
from airplane sensors
Predict when part needs
to be replaced
Ship part to customer’s
service airport

REGRESSION: POP QUIZ
Is regression supervised or unsupervised?
Supervised

ANOMALY DETECTION EXAMPLE:
CREDIT CARD FRAUD
Train model on good
transactions
Anomalous activity
indicates fraud
Can pass transaction
down to human for
investigation

ANOMALY DETECTION EXAMPLE:
NETWORK INTRUSION
Train model on network
login activity
Anomalous activity
indicates threat
Can initiate alerts and
lockdown procedures

ANOMALY DETECTION: POP QUIZ
Is anomaly detection supervised or unsupervised?
Unsupervised because we only train on normal data

IPYTHON CHEATSHEET
Command Meaning
ipython Start IPython
/help np Help on module
/help np.array Help on function
/help "hello" Help on object
%cpaste Paste blob of text
%timeit Time function call
%load FILE Load file as source
quit Exit

SCIKIT-LEARN
David Cournapeau
In 2007
Google Summer of Code
project

PANDAS
Wes McKinney
In 2008
At AQR Capital
Management

WHY PANDAS
“I’m a data janitor.” —
Josh Wills
Big part of data science
is data cleaning
Pandas is a power tool
for data cleaning

PANDAS AND NUMPY
Pandas and NumPy both hold data
Pandas has column names as well
Makes it easier to manipulate data

SIDE BY SIDE
import numpy as np
# Create numpy array
sales_a = np.array([
[5.0,2000.0],
[10.0,500.0],
[20.0,200.0]])
# Extract all rows, first column
sales_a[:,0]
Out: array([ 5., 10., 20.])
import pandas as pd
# Create pandas DataFrame
sales_df = pd.DataFrame(
sales_a,
columns=['Price','Sales'])
# Extract first column as DataFrame
sales_df[['Price']]
Out:
Price
0 5.0
1 10.0
2 20.0

PANDAS NUMPY SCIKIT-LEARN
WORKFLOW
Start with CSV
Convert to Pandas
DataFrame
Slice and dice in Pandas
Convert to NumPy array
to feed to scikit-learn

ARRAYS VS PANDAS VS NUMPY
NumPy is faster than Pandas
Both are faster than normal Python arrays

INDEXING
# Array
a = [0,1,2,3,4]
# First element
a[0]
# Second element
a[1]
# Last element
a[-1]

SLICING
# Start at index 1, stop before 5, step 2
a[1:5:2]
a[1:3]
# Start at index 1, stop at end, step 1
a[1:]
a[:5:2]

SLICING: POP QUIZ
What does a[::] give you?
Defaults for everything: start at 0, to end, step 1.

DATA FRAMES FROM CSV
# Use CSV file header for column names
df = pd.read_csv('file.csv',header=0)
# CSV file has no header
df = pd.read_csv('file.csv',header=None)

DATA FRAME
import pandas as pd
df = pd.DataFrame(
columns=
['City','State','Sales'],
data=[
['SFO','CA',300],
['SEA','WA',200],
['PDX','OR',150],
])
City State Sales
0 SFO CA 300
1 SEA WA 200
2 PDX OR 150

SELECTING COLUMNS AS
DATAFRAME
df[['State','Sales']]
State Sales
0 CA 300
1 WA 200
2 OR 150

SELECTING COLUMNS AS SERIES
df['Sales']
0 300
1 200
2 150
Name: Sales, dtype: int64
df.Sales
0 300
1 200
2 150
Name: Sales, dtype: int64

SELECTING ROWS WITH SLICE
df[1:3]
City State Sales
1 SEA WA 200
2 PDX OR 150

SELECTING ROWS WITH
CONDITION
df[df['Sales'] >= 200]
City State Sales
0 SFO CA 300
1 SEA WA 200
df[df.Sales >= 200]
City State Sales
0 SFO CA 300
1 SEA WA 200

SELECTING ROWS + COLUMNS
WITH SLICES
# First 2 rows and all columns
df.iloc[0:2,::]
City State Sales
0 SFO CA 300
1 SEA WA 200

SELECTING ROWS + COLUMNS
WITH SLICES
# First 2 rows and all but first column
df.iloc[0:2,1::]
State Sales
0 CA 300
1 WA 200

ADD NEW COLUMN
# New tax column
df['Tax'] = df.Sales * 0.085
df
City State Sales Tax
0 SFO CA 300 25.50
1 SEA WA 200 17.00
2 PDX OR 150 12.75

ADD NEW BOOLEAN COLUMN
# New boolean column
df['HighSales'] = (df.Sales >= 200)
df
City State Sales Tax HighSales
0 SFO CA 300 25.50 True
1 SEA WA 200 17.00 True
2 PDX OR 150 12.75 False

ADD NEW INTEGER COLUMN
# New integer 0/1 column
df['HighSales'] = (df.Sales >= 200).astype('int')
df
City State Sales Tax HighSales
0 SFO CA 300 25.50 1
1 SEA WA 200 17.00 1
2 PDX OR 150 12.75 0

APPLYING FUNCTION
# Arbitrary function
df['State2'] = df.State.apply(lambda x: x.lower())
df
City State Sales Tax HighSales State2
0 SFO CA 300 25.50 1 ca
1 SEA WA 200 17.00 1 wa
2 PDX OR 150 12.75 0 or

APPLYING FUNCTION ACROSS
AXIS
# Calculate mean across all rows, columns 2,3,4
df.iloc[:,2:5].apply(lambda x: x.mean(),axis=0)
Sales 216.666667
Tax 18.416667
HighSales 0.666667
dtype: float64

VECTORIZED OPERATIONS
Which one is faster?
# Vectorized operation
%timeit df * 2
160 µs per loop
# Loop
%timeit for i in xrange(df.size): df.iloc[i,0] * 2
1.72 s per loop

VECTORIZED OPERATIONS
Always use vectorized operations
Avoid Python loops

WHY VISUALIZE
Why do we want to plot and visualize data?
Develop intuition about data
Relationships and correlations might stand out

SCATTER MATRIX
from pandas.tools.plotting import scatter_matrix
df = pd.DataFrame(randn(1000,3), columns=['A','B','C'])
scatter_matrix(df,alpha=0.2,figsize=(6,6),diagonal='kde')

PLOT FEATURES
Plot survival by port:
C = Cherbourg, France
Q = Queenstown, Ireland
S = Southampton, UK
# Load data
df = pd.read_csv('data/titanic.csv',header=0)
# Plot by port
df.groupby('Embarked')[['Embarked','Survived']].mean()
df.groupby('Embarked')[['Embarked','Survived']].mean().plot(kind='bar')

PREPROCESSING
Handling Missing Values
Encoding Categories with Dummy Variables
Centering and Scaling

MISSING VALUES
from numpy import NaN
df = pd.DataFrame(
columns=
['State','Sales'],
data=[
['CA',300],
['WA',NaN],
['OR',150],
])
State Sales
0 CA 300
1 WA NaN
2 OR 150

MISSING VALUES: BACK FILL
df
State Sales
0 CA 300
1 WA NaN
2 OR 150
df.fillna(method='bfill',axis=0)
State Sales
0 CA 300
1 WA 150
2 OR 150

FORWARD FILL: USE PREVIOUS
VALID ROW’S VALUE
df
State Sales
0 CA 300
1 WA NaN
2 OR 150
df.fillna(method='ffill',axis=0)
State Sales
0 CA 300
1 WA 300
2 OR 150

DROP NA
df
State Sales
0 CA 300
1 WA NaN
2 OR 150
df.dropna()
State Sales
0 CA 300
2 OR 150

CATEGORICAL DATA
How can we handle column that has data like CA, WA, OR,
etc?
Replace categorical features with 0 and 1
For example, replace state column containing CA, WA, OR
With binary column for each state

DUMMY VARIABLES
df = pd.DataFrame(
columns=
['State','Sales'],
data=[
['CA',300],
['WA',200],
['OR',150],
])
State Sales
0 CA 300
1 WA 200
2 OR 150

DUMMY VARIABLES
df
State Sales
0 CA 300
1 WA 200
2 OR 150
pd.get_dummies(df)
Sales State_CA State_OR State_WA
0 300 1.0 0.0 0.0
1 200 0.0 0.0 1.0
2 150 0.0 1.0 0.0

CENTERING AND SCALING DATA
Why center and scale data?
Features with large values can dominate
Centering centers it at zero
Scaling divides by standard deviation

CENTERING AND SCALING DATA
from sklearn import preprocessing
import numpy as np
X = np.array([[1.0,-1.0, 2.0],
[2.0, 0.0, 0.0],
[3.0, 1.0, 2.0],
[0.0, 1.0,-1.0]])
scaler = preprocessing.StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_scaled
X_scaled.mean(axis=0)
X_scaled.std(axis=0)

INVERSE SCALING
scaler.inverse_transform(X_scaled)

SCALING: POP QUIZ
Why is inverse scaling useful?
Use it to unscale the predictions
Back to the units from the problem domain

RANDOM FORESTS HISTORY
Classification and
Regression algorithm
Invented by Leo
Breiman and Adele
Cutler at Berkelely in
2001

BASIC IDEA
Collection of decision
trees
Each decision tree looks
only at some features
For final decision the
trees vote
Example of ensemble
method

DECISION TREES
Decision trees play 20
questions on your data
Finds feature questions
that can split up final
output
Chooses splits that
produce most pure
branches

RANDOM FORESTS
Collection of decision
trees
Each tree sees random
sample of data
Each split point based
on random subset of
features
out of total features
To classify new data
trees vote

RANDOM FORESTS: POP QUIZ
Which one takes more time: training random forests or
classifying new data point on trained random forests?
Training takes more time
This is when tree is constructed
Running/evaluation is fast
You just walk down the tree

PROBLEM OF OVERFITTING
Model can get attached
to sample data
Learns specific patterns
instead of general
pattern

OVERFITTING
Which one of these is overfitting, underfitting, just right?
1. Underfitting
2. Just right
3. Overfitting

DETECTING OVERFITTING
How do you know you are overfitting?
Model does great on training set and terrible on test set

RANDOM FORESTS AND
OVERFITTING
Random Forests are not prone to overfitting. Why?
Random Forests are an ensemble method
Each tree only sees and captures part of the data
Tends to pick up general rather than specific patterns

PROBLEM
How can we find out
how good our models
are?
Is it enough for models
to do well on training
set?
How can we know how
the model will do on
new unseen data?

CROSS VALIDATION
Technique to test model
Split data into train and
test subsets
Train on train data set
Measure model on test
data set

CROSS VALIDATION: POP QUIZ
Why can’t we test our models on the training set?
The model already knows the training set
It will have an unfair advantage
It has to be tested on data it has not seen before

K-FOLD CROSS VALIDATION
Split data into k sets
Repeat k times: train on
k-1, test on kth
Model’s score is average
of the k scores

CROSS VALIDATION CODE
from sklearn.cross_validation import cross_val_score
# 10-fold (default 3-fold)
scores10 = cross_val_score(model, X, y, cv=10)
# See score stats
pd.Series(scores10).describe()

PROBLEM: OIL EXPLORATION
Drilling holes is
expensive
We want to find the
biggest oilfield without
wasting money on duds
Where should we plant
our next oilfield derrick?

PROBLEM: MACHINE LEARNING
Testing hyperparameters is expensive
We have an N-dimensional grid of parameters
How can we quickly zero in on the best combination of
hyperparameters?

HYPERPARAMETER EXAMPLE:
RANDOM FORESTS
"n_estimators": = [10, 50, 100, 300]
"max_depth": [3, 5, 10],
"max_features": [1, 3, 10],
"criterion": ["gini", "entropy"]}

ALGORITHMS
Grid
Random
Bayesian Optimization

GRID
Systematically search
entire grid
Remember best found
so far

RANDOM
Randomly search the
grid
60 random samples gets
you within top 5% of
grid search with 95%
probability
Bergstra and Bengio’s
result and Alice Zheng’s
explanation (see
References)

BAYESIAN OPTIMIZATION
Balance between
explore and exploit
Exploit: test spots within
explored perimeter
Explore: test new spots
in random locations
Balance the trade-oﬀ

SIGOPT
YC-backed SF startup
Founded by Scott Clark
Raised $2M
Sells cloud-based
proprietary variant of
Bayesian Optimization

BAYESIAN OPTIMIZATION PRIMER
Bayesian Optimization Primer by Ian Dewancker, Michael
McCourt, Scott Clark
See References

OPEN SOURCE VARIANTS
Open source alternatives:
Spearmint
Hyperopt
SMAC
MOE

GRID SEARCH CODE
# Run grid analysis
from sklearn import grid_search
parameters = {
'n_estimators':[20,50,100,300],
'max_depth':[5,10,20] }
model = grid_search.GridSearchCV(
RandomForestClassifier(), parameters)
model.fit(X,y)
# Lets find out which parameters won
print model.best_params_
print model.best_score_
print model.grid_scores_

REFERENCES
Python Reference
scikit-learn Reference
http://python.org
http://scikit-learn.org

REFERENCES
Bayesian Optimization by Dewancker et al
Random Search by Bengio et al
Evaluating machine learning models
Alice Zheng
http://sigopt.com
http://jmlr.org
http://www.oreilly.com

COURSES
Machine Learning by Andrew Ng
(Online Course)
Intro to Statistical Learning by Hastie et al
(PDF)
(Video)
(Online Course)
https://www.coursera.org
http://usc.edu
http://www.dataschool.io
https://stanford.edu

WORKSHOP DEMO AND LAB
Titanic Demo and Congress Lab for this workshop
https://github.com/asimjalis/data-science-workshop

Data Science and Machine Learning Using Python and Scikit-learn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Data Science and Machine Learning Using Python and Scikit-learn

Similar to Data Science and Machine Learning Using Python and Scikit-learn (20)

Recently uploaded

Recently uploaded (20)

Data Science and Machine Learning Using Python and Scikit-learn