These slides are from a talk I gave at Google Campus Madrid for the Machine Learning Meetup. The main subject is uplift modelling. Starting from a churn model approach for an e-gaming company, we introduce when to apply uplift methods, how to mathematically model them, and finally, how to evaluate them.
2. A few words about me
• Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems, graph
analytics, smart cities, … )
• Occasional Kaggle competitor
• Mostly code with python and SQL
• Twitter @prrgutierrez
3. Plan
• Introduction / Clients situation
• Uplift use case examples
• Uplift modelling
• Uplift evaluation & results
4. Client situation
• French Online Gaming Company (RPG)
• A lot of users are leaving
• let’s do a churn prediction model !
• Target : no come back in 14 or 28 days.
(14 missing days -> 80 % of chance not to come back
28 missing days -> 90 % of chance not to come back)
• Features :
• Connection features :
• Time played in 1,7,15,30,… days
• Time since last connection
• Connection frequency
• Days of week / hours of days played
• Equivalent for payments and subscriptions
• Age, sex, country
• Number of account, is a bot …
• No in game features (no data)
5. Client situation
• Model Results :
• AUC 0.88
• Very stable model
• Marketing actions :
• 7 different actions based on customer segmentation
(offers, promotion, … )
• A/B test
-> -5 % churn for persons contacted by email
• Going further :
• Feature engineering : guilds, close network, in game actions, …
• Study long term churn …
6. Client situation
• But wait !
• Strong hypothesis : target the person that are the most likely to churn
7. Client situation
• But wait !
• Strong hypothesis : target the person that are the most likely to churn
• What is the gain / person for an action ?
• cost of action
• value of the customer
• independent variables
• “treated” population and “control” population
•
• Value with action :
• Value without action :
• Gain (if independent of treatment ) :
c
vi i
X
T C
Y =
⇢
1 if customer churn
0 otherwise
ET
(Vi) = vi(1 PT
(Y = 1|X)) c
EC
(Vi) = vi(1 PC
(Y = 1|X))
vi
E(Gi) = vi(PC
(Y = 1|X) PT
(Y = 1|X)) c
8. Client situation
• But wait !
• Strong hypothesis : target the person that are the most likely to churn
• What is the gain / person for an action ?
• Objective : maximize this gain
• Targeting highly probable churner -> minimize
But not the difference !
• Intuitive examples :
• : action is expected to make the situation worst. Spam ?
• : user does not care, is already lost
Upli&
=
Model
E(Gi) = vi(PC
(Y = 1|X) PT
(Y = 1|X)) c
PT
(Y = 1|X)
PC
(Y = 1) ⇡ PT
(Y = 1)
P
PC
(Y = 1) < PT
(Y = 1)
9. Uplift
• Model effect of the action
• 4 groups of customers / patients
• 1 Responded because of the action
(the people we want)
• 2 Responded, but would have responded anyway
(unnecessary costs)
• 3 Did not respond and the action had no impact
(unnecessary costs)
• 4 Did not respond because the action had a negative impact
(negative impact)
• Incomplete knowledge
10. Uplift Examples
• Healthcare :
• A typical medical trial:
• treatment group: gets the treatment
• control group: gets placebo (or another treatment)
• do a statistical test to show that the treatment is better than placebo
• With uplift modeling we can find out for whom the treatment works best
• Personalized medicine
• Ex : What is the gain in survival probability ?
-> classification/uplift problem
11. Uplift Examples
• Churn :
• E-gaming
• Other Ex : Coyote
• Retail :
• Compare coupons campaigns
12. Uplift Examples
• Mailing : Hillstrom challenge
• 2 campaigns :
• one men email
• one woman email
• Question : who are the people to target / that have the best response rate
13. Uplift Examples
• Common pattern
• Experiment or A/B testing -> Test and control
• Warning : Control can be biased easily :
• Targeted most probable churners and control is the rest
• Call only the people that come to a shop
• Limited experiment trial -> no bandit algorithm :
(once a medicine experiment is done, you don’t continue the “exploration”)
-> relatively large and discrete in time feedbacks.
14. Uplift modelling
• Three main methods :
• Two models approach
• Class variable modification
• Modification of existing machine learning models
15. Uplift modelling : Two model approach
• Build a model on treatment to get
• Build a model on control to get
• Set :
PT
(Y |X)
PC
(Y |X)
P = PT
(Y |X) PC
(Y |X)
16. Uplift modelling : Two model approach
• Advantages :
• Standard ML models can be used
• In theory, two good estimators -> a good uplift model
• Works well in practice
• Generalize to regression and multi-treatment easily
• Drawbacks
• Difference of estimators is probably not the best estimator of the difference
• The two classifier can ignore the weaker uplift signal (since it’s not their target)
• Algorithm focusing on estimating the difference should perform better
17. Uplift modelling : Class variable modification
• Introduced in Jaskowski, Jaroszewicz 2012
• Allows any classifier to be updated to uplift modeling
• Let denote the group membership (Treatment or Control)
• Let’s define the new target variable :
• This corresponds to flipping the target in the control dataset.
G 2 {T, C}
Z =
8
<
:
1 if G = T and Y = 1
1 if G = C and Y = 0
0 otherwise
18. Uplift modelling : Class variable modification
• Why does it work ?
• By design (A/B test warning !), should be independent from
• Possibly with a reweighting of the datasets we should have :
thus
P(Z = 1|X) = PT
(Y = 1|X)P(G = T|X) + PC
(Y = 0|X)P(G = C|X)
P(Z = 1|X) = PT
(Y = 1|X)P(G = T) + PC
(Y = 0|X)P(G = C)
G X
P(G = T) = P(G = C) = 1/2
2P(Z = 1|X) = PT
(Y = 1|X) + PC
(Y = 0|X)
19. Uplift modelling : Class variable modification
• Why does it work ?
Thus
And sorting by is the same as sorting by
2P(Z = 1|X) = PT
(Y = 1|X) + PC
(Y = 0|X)
= PT
(Y = 1|X) + 1 PC
(Y = 1|X)
P = 2P(Z = 1|X) 1
P(Z = 1|X) P
20. Uplift modelling : Class variable modification
• Summary :
• Flip class for control dataset
• Concatenate test and control dataset
• Build a classifier
• Target users with highest probability
• Advantages :
• Any classifier can be used
• Directly predict uplift (and not each class separately)
• Single model on a larger dataset (instead of two small ones)
• Drawbacks :
• Complex decision surface -> model can perform poorly
• Interpretation : what is AUC in this case ?
21. Uplift modeling : Other methods
• Based on decision trees :
• Rzepakowski Jaroszewicz 2012
new decision tree split criterion based on information theory
• Soltys Rzepakowski Jaroszewicz 2013
Ensemble methods for uplift modeling
(out of today scope)
22. Evaluation
• We used :
• 2 model approach. -> AUC ? Not very informative.
• 1 model approach -> does AUC means something ?
• How can we evaluate / compare them ?
• Cross Validation :
• 4 datasets : treatment/control x train/test
• Problem :
• We don’t have a clear 0/1 target.
• We would need to know for each customer
• Response to treatment
• Response to control
-> not possible
23. Evaluation
• Gain for group of customers :
• Gain for the 10% highest scoring customers =
% of successes for top 10% treated customers − % of successes for top 10% control
customers
• Uplift curve ? :
• Difference between two lift curve
• Interpretation : net gain in success rate if a given percentage of the population is treated
• Pb : no theoretic maximum
• Pb 2 : weird behaviour for 2 wizard models.
24. Evaluation : Qini
• Qini Measure :
• Similar to Gini (Area under lift curve). Lift Curve <-> Qini Curve
• Parametric curve defined by :
• When taking the first observations
• is the total number of 1 seen in target observations
• is the total number of 1 seen in control observations
• is the total number of target observations
• is the total number of control observations
• Balanced setting :
t
f(t) = YT (t) YC(t) ⇤ NC(t)/NT (t)
YT
YC
NC
NT
f(t) = YT (t) YC(t)
25. Evaluation : Qini
• Personal intuition :
• We can’t know everything :
• treated that convert, not treated that don’t convert. What would have happen ?
• But we don’t want to see :
• Treated not converting
• Not treated converting (in our top list)
• In we want to minimize :
• Very similar to lift taking into account only negative examples.
t
NT (t) YT (t) + YC(t)
27. Evaluation : Qini
• Best model :
• Take first all positive in target and last all positive in control.
• No theoretic best model :
• depends on possibility of negative effect
• Displayed for no negative effect
• Random model :
• Corresponds to global effect of treatment
• Hillstrom Dataset :
• For women models are comparable and useful
• For men, there is no clear individuals to target
29. Evaluation : Qini
• Back to our study :
• Class modification performs best
• Two models approach performs poorly
• A/B test failure :
• Control dataset is way to small !
• Class modification model very close to lift
• Two model slightly better than random
-> need to redo the A/B test.
30. Conclusion
• Uplift :
• Surprisingly little literature / examples
• The theory is rather easy to test
• Two models
• Class modification
• The intuition and evaluation are not easy to grasp
• On the client side :
• I don’t loose hope we’ll do the A/B test again
• A good lead to select the best offer for a customer
31. A few references
• Data :
• Churn in gaming :
WOWAH dataset (blog post to come)
• Uplift for healthcare :
Colon Dataset
• Uplift in mailing :
Hillstrom data challenge
• Uplift in General :
Simulated data :
(blog post to come)
32. A few references
• Application
• Uplift modeling for clinical trial data (Jaskowski, Jaroszewicz)
• Uplift Modeling in Direct Marketing (Rzepakowski, Jaroszewicz)
33. A few references
• Modeling techniques :
• Rzepakowski Jaroszewicz 2011 (decision trees)
• Soltys Rzepakowski Jaroszewicz 2013 (ensemble for uplift)
• Jaskowski Jaroszewicz 2012 (Class modification model)
34. A few references
• Evaluation
• Using Control Groups to Target on Predicted Lift (Radcliffe)
• Testing a New Metric for Uplift Models (Mesalles Naranjo)