Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Data Science Salon Miami Example - Churn Rate Predictor
1. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 1/23
Table of Contents
1 3Blades Churn Rate Demo
1.1 Demo Goal
1.2 Define the problem
1.3 Import Data & Libraries
1.3.0.1 Here are our features:
1.4 Cleanse
1.5 Transform
1.6 Cross Validation and Model Preparation
1.7 Spot Check Models
1.8 Fit Model
1.9 What Does This Mean?
3Blades Churn Rate Demo
This Notebook provides an example of how to automate Machine Learning (ML) workflows with scikit-learn (http://scikit-learn.org/) and Magics
(https://ipython.org/ipython-doc/3/interactive/magics.html) and applies these to these methodologies to a churn rate predictor. Use this demo by
changing the values of k-folds, x and Y in order to explore! This demo follows the following workflow, adapted from Jason Browlee via Machine
Learning Mastery (http://machinelearningmastery.com/author/jasonb/):
1. Define the problem & Imports
2. Prepare the data
3. Cross Validation and Model Preparation
4. Run Model
5. Spot Check Algorithm
2. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 2/23
Demo Goal
This demo's main goal is to make it easy to use the Machine Learning Workflow to solve real life marketing problems. It allows users to perform data
science quickly and cleanly.
Superset provides:
The ability to set X, y, and k-folds for a personalized demo experience.
A rich set of models, tools and helpful hints to guide you through the process step by step.
The ability to code in Python and R and import from a variety of data sources, so you aren't locked in.
Fast run times without sacrificing on folds, model types or model accuracy.
The ability to change performance metrics so you have the power to interpret models in the way with which you're most comfortable.
Define the problem
Churn is the measure of the the amount of people who stop subscribing or buying a product over time. In situations where consumers are subscribing
to a service it is important to measure how likely those people are to stop subscribing. In this demo, when churn is true, it will hold the value of 1.
When a customer does not churn, it will hold a value of 0.
Import Data & Libraries
We will now dowload our data and libraries. We will also list our features and view the first few lines of data.
3. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 3/23
In [1]: #Import libraries
%run ./Libraries.ipynb
#Import data and view first few rows
churn_df = pd.read_csv("churndata.csv")
churn_df.head()
/Users/Annmarie/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarn
ing: This module was deprecated in version 0.18 in favor of the model_selection module into which al
l the refactored classes and functions are moved. Also note that the interface of the new CV iterato
rs are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
Out[1]:
State
Account
Length
Area
Code
Phone
Int'l
Plan
VMail
Plan
VMail
Message
Day
Mins
Day
Calls
Day
Charge
Eve
Mins
Eve
Calls
Eve
Charge
Night
Mins
Night
Calls
Night
Charge
Intl
Min
0 KS 128 415
382-
4657
no yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0
1 OH 107 415
371-
7191
no yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7
2 NJ 137 415
358-
1921
no no 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2
3 OH 84 408
375-
9999
yes no 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6
4 OK 75 415
330-
6626
yes no 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1
4. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 4/23
In [2]: churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.',1,0)
# We don't need these columns
to_drop = ['State','Area Code','Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop,axis=1)
# 'yes'/'no' has to be converted to boolean values
# NumPy converts these from boolean to 1. and 0. later
yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'
# Pull out features for future use
features = churn_feat_space.columns
X = churn_feat_space.as_matrix().astype(np.float)
Here are our features:
1. Account Length
2. International Plan
3. Voicemail Plan
4. Voicemail Messages
5. Day Minutes
6. Day Calls
7. Day Charge
8. Eve Minutes
9. Eve Calls
10. Eve Charge
11. Night Minutes
12. Night Calls
13. Night Charge
14. International Minutes
15. International Calls
16. International Charge
17. Customer Service Calls
5. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 5/23
Cleanse
Now lets do data cleaning. To do this we will isolate target data, get rid of any columns we don't need and then convert any 'yes'/'no' values to values
of 1/0. By doing this, we make it easy to make sense of all of our data. At this point, all missing values have been removed. We will then pull out our
features and place them in a sparse matrix for analysis. It is also important that we standardize our data. This is a common requirement for many
models and can be done with skit-learn's preprocessor. This will create a Gaussian distribution with zero mean and unit variance. Once completed, we
can graph our distribution and move on to transforming our data to prepare for modeling.
In [3]: # Isolate target data, remove uneeded columns,
%run ./FeatureEngineering.ipynb
# This is important
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
Transform
Now it is time to transform our data. In the following cells we will take our data and begin to optimize it. To do this we will look at what features are the
most important to the data set. We will check the relationship between the variables and make sure that we have the best features in their best
possible format.
We now find that the following three features are the most important to the data set
1. International Plan
2. Voicemail Plan
3. Account Length
Feature space holds 3333 observations and 17 features
Unique target labels: [0 1]
9. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 9/23
In [5]: # %load ./Visualizations.py.html
# In[16]:
from IPython.core.magic import (register_line_magic, register_cell_magic,
register_line_cell_magic)
register_cell_magic('Visualizations')
get_ipython().magic(u'run ./Libraries.ipynb')
# Isolate target data
churn_result = churn_df["Churn?"]
y = np.where(churn_result == 'True.',1,0)
fig = plt.figure(figsize=(18,6), dpi=1600)
alpha=alpha_scatterplot = 0.2
alpha_bar_chart = 0.55
ax1 = plt.subplot2grid((2,3),(0,0))
# plots a bar graph of those churn vs those who do not.
churn_result.value_counts().plot(kind='bar', alpha=alpha_bar_chart)
ax1.set_xlim(-1, 2)
plt.title("Distribution of churn, (1 = Churn)")
Out[5]: <matplotlib.text.Text at 0x11f6dca10>
10. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 10/23
Cross Validation and Model Preparation
Now let's set up the parameters for the model cross validation-- we will use K-Folds. In the next line, chance the "n folds" object to your desired
number of folds and then run!
How well will your ML algorithms perform?
Make predictions for new data for known answers
Resampling methods estimates performance on new data.
It is important to evaluate ML algorithms using data that is different than the data used to train ML algorithm, to avoid overfitting
This notebook uses K-folds to avoid over fitting and ensures we make good predictions!
In [6]: #Set folds by changing the "n_folds" object
kf = KFold(len(y),n_folds=25,shuffle=True)
In [7]: # Construct a kfolds object
def run_cv(X,y,clf_class,**kwargs):
y_pred = y.copy()
# Iterate through folds
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train = y[train_index]
# Initialize a classifier with key word arguments
clf = clf_class(**kwargs)
clf.fit(X_train,y_train)
y_pred[test_index] = clf.predict(X_test)
return y_pred
In [8]: run_cv
Out[8]: <function __main__.run_cv>
11. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 11/23
Spot Check Models
For this data set we will test Support Vector Machines, Random Forests and K-nearest neighbors. Note that Random Forest seems to
perform the best. To remove a model, feel free to add # before that line to exclude it.
In [9]: def accuracy(y_true,y_pred):
# NumPy interprets True and False as 1. and 0.
return np.mean(y_true == y_pred)
In [10]: print "Support vector machines:"
print "%.3f" % accuracy(y, run_cv(X,y,SVC))
print "Random forest:"
print "%.3f" % accuracy(y, run_cv(X,y,RF))
print "K-nearest-neighbors:"
print "%.3f" % accuracy(y, run_cv(X,y,KNN))
Support vector machines:
0.921
Random forest:
0.945
K-nearest-neighbors:
0.893
17. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 17/23
Fit Model
We will now fit a random forest model as it is the highest performing model after both normal cross validation and bayesian optimization.
In [32]: from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomTreesEmbedding, RandomForestClassifier,
GradientBoostingClassifier)
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.pipeline import make_pipeline
model = RandomForestClassifier(n_jobs = 10,random_state =50)
In [39]: n_estimator = 10
X, y = make_classification(n_samples=5000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
# It is important to train the ensemble of trees on a different subset
# of the training data than the linear regression model to avoid
# overfitting, in particular if the total number of leaves is
# similar to the number of training samples
X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,
y_train,
test_size=0.5)
13 | 00m03s | 0.90582 | 0.2301 | 24.7868 | 249.9074 |
14 | 00m00s | 0.90339 | 0.8018 | 2.3446 | 10.0055 |
15 | 00m06s | 0.90693 | 0.8106 | 24.8361 | 249.8776 |
-----------------------------------------------------
Final Results
SVC: 0.898651
RFC: 0.912495
22. 2/14/2018 churn_rate
file:///Users/gregwerner/Downloads/churn_rate.html 22/23
What Does This Mean?
After performing multiple models, checking those models via multiple means of spot checking and cross validation, we can now conclude that the
Random Forest Classifier performs the best. This means that we can predict whether or not a customer will churn with 96.7 percent accuracy.