Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.
2. • Data Science competitions platform
(There are others : DataScience.net in France)
• 332,000 Data Scientists
• today : 192 competitions, 18 active
+ 516 In class, 12 active
• Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex…
What is ?
3. • Price pool?
• 325,000 $ to make on August 31st
• Good luck with that !
• Not a good hourly wage
• today : 192 competitions, 18 active
Understand :
• Lot’s of datasets about approximately every DS topic
• Lot’s of winner solutions, tip and tricks, etc…
• Lot’s of “beat the benchmark” for beginners
I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ...
Why should I join ?
4. Most of the time:
• You have a train set with labels and a test set without labels.
• You need to learn a model using the train features and predict the test set labels
• Your prediction is evaluated using a specific metric
• The best prediction wins
What is a Data Science Competition ?
5. Most of the time:
• You have a train set with labels and a test set without labels.
• You need to learn a model using the train features and predict the test set labels
• Your prediction is evaluated using a specific metric
• The best prediction wins
What is a Data Science Competition?
Why
AUC?
F1
score?
Log
loss?
Could
that
depend
on
my
train/test
split?
Where
do
they
come
from
?
Do
you
always
have
some?
Why
is
the
split
this
way?
Random?
Time?
6. What you don’t learn on Kaggle (or in class?):
• How to model a business question into a ML problem.
• How to manage/create labels. (proxy / missing…)
• How to evaluate a model:
• How to choose your metric
• How to design your train/test split
• How to account for this in feature engineering
Understanding this actually helps you in Kaggle competition :
• How to design your cross validation scheme (and not overfit)
• How to create relevant features
• Hacks and tricks (leak exploitation J)
What is a Data Science Competition?
9. • Introduction
• Labels?
• Train and test split?
• Feature Engineering?
• Evaluation Metric?
Introduction
10. • Introduction
• Labels?
• Train and test split?
• Feature Engineering?
• Evaluation Metric?
Introduction
The
newcomer
disillusion
The
produc(on
bad
surprise
The
business
obfusca(on
11. • Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems,
graph analytics, smart cities,…)
• (More than) Occasional Kaggle competitor
• Twitter @prrgutierrez
Who I am
12. • Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems,
graph analytics, smart cities,…)
• (More than) Occasional Kaggle competitor
• Twitter @prrgutierrez
Who I am
14. • Everywhere is fraud
E-business, Telco, Medicare,…
• Easily defined as a classification problem
• Target well defined ?
• E-business : yes with lag
• Elsewhere : need checks,
labels are expensive
Fraud Detection
15. • Wikipedia:
“Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the
number of individuals or items moving out of a collective group over a specific period of
time”
= Customer leaving
Churn
16. • Subscription models:
• Telco
• E-gamming (Wow)
• Ex : Coyote -> 1 year subscription
-> you know when someone leave
• Non subscription models:
• E-Business (Amazon, Price Minister, Vente Privée)
• E-gamming (Candy Crush, free MMORPG)
-> you approximate someone leaving
Candy Crush: days / weeks
MMORPG: 2 months (holidays)
Price Minister: months
Two types of Churn
17. • Predict if a vehicle / machine / part is going to fail
• Classification Problem:
• Given a future horizon and a failure type. Will this happen for a given vehicle ?
-> 2 parameters describe the target
• Vary a lot the target -> spurious correlation
• Just choose it as the result of the exact business need
Predictive Maintenance
18. • Target is “will like” or “will buy”
• Target is often proxy of real interest (implicit feedback)
Recommender System
19. • Can you model the problem as a ML problem?
• Ex : predictive maintenance
• Ask the right question from a business point of view.
Not what you know how to do.
• Is your target a proxy?
• Recommendation system
• May need bandit algorithm
• Is it easy to get labels?
• Ex : Fraud detection
• Can be expensive
• Mechanical Turk can be the answer
Summary on Labels
20. • Random Split
• Just like in school
Train / test split
• When
and
why
?
-‐>
When
each
line
is
independent
from
the
rest
(not
that
common
!)
image,
document
classifica(on,
sen(ment
analysis
(“but
aha
is
the
new
lol”
)
-‐>
When
you
want
to
quickly
iterate
/
benchmark:
“is
it
even
possible?”
-‐>
When
you
want
to
sell
something
to
your
boss
21. • Column / group based
Ex : Caterpillar challenge
• Predict a price
• for each tube id
• Tube id in train and test
are different
Objective :
being able to generalize to
other tubes!
Train / test split
22. • Time based
• Simply separate train and test on a time variable
• When and Why?
-> When you want a model that “predict the future”
-> When things evolve with time! (most problems!)
-> Examples :
Add click prediction, Churn prediction, E-business Fraud detection, Predictive
maintenance,…
Train / test split
23. • No subscription example
• Target : 4 month without buying
• Features ?
Train / test split : Churn example
24. Ex : Train and predict scheme
Time
T
:
present
(me
T
–
4
month
Data
is
used
for
target
crea(on
:
ac(vity
during
the
last
4
months
Data
is
used
for
feature
genera(on.
Use
model
to
predict
future
churn
Train
model
using
features
and
target
25. Ex : Train Evaluation and Predict Scheme
Time
T
:
present
(me
T
–
4
month
Data
is
used
for
target
crea(on
:
ac(vity
during
the
last
4
months
Data
is
used
for
feature
genera(on
Valida&on
set
Use
model
to
predict
future
churn
Training
Evaluate
on
the
target
of
the
valida(on
set
T
–
8
month
Data
is
used
for
features
genera(on.
Data
is
used
for
target
crea(on
:
ac(vity
during
the
last
4
months
26. • More complex design
• Graph sampling (fraud rings ? )
• Random sampling in client / machine life
• Mix of column based and time based …
• The rule :
1) What is the problem ?
2) To what would I like to generalize my model ?
Future ? Other individuals ? …
3) => Train / Test split
Train / test split
27. • Predictive Maintenance problem
• Objective : predict failure in next 3 days.
• Metric is proportional to accuracy (and 0.57 is the best score !)
• Link to data :
https://www.phmsociety.org/events/conference/phm/14/data-challenge
EX PHM Society (Fail example)
31. • How to design the evaluation scheme?
• What is the probability that an asset fail in the next 3 days from Now?
-> classification problem
-> Time based split
-> but how do I create a train and a test?
• Choose a date and evaluate what happens 3 days later?
-> pb : not enough failures happening
• Choose several dates for each asset?
-> beware of asset over-fitting
• In the challenge : random selection of (asset, date) in the future + over sampling of
failures.
EX PHM Society
37. • Beware of the distribution of you features!
• Is there a time dependency?
• Ex : count, sum, … that will only increase with time
• -> Calculate count and sum rescaled by time / in moving windows instead.
• Can be found in Churn, Fraud detection, Ad click prediction,…
• A categorical variable dependency?
• Ex : email flag in fraud detection
• Is there a Network dependency?
• Ex : Fraud / Bot detection (network features can be useful but leaky)
Feature Engineering
38. • Final trick :
- Stack train and test and add is_test boolean
- Try to predict is_test
- Check if the model is able to predict
- If so :
- check the feature importance
- Remove / modify feature and iterate
Feature Engineering
39. • Final trick:
• Back to Phm example:
Feature Engineering
Huge
(me
leak
!
40. • “Treshold dependant”
• Accuracy
• Precision and Recall
• F1 score
• “Treshold independant”
• AUC
• Log Loss
• Others (Mean average precision)…
Evaluation metric : Classification
41. • “Treshold dependant”
• Accuracy
• Precision and Recall
• F1 score
• “Treshold independant”
• AUC
• Log Loss
• Others (Mean average precision)…
• Customs
Evaluation metric : Classification
Not
good
if
unbalanced
target
When
you
have
an
order
problem
When
you
are
going
stochas(c
When
you
need
to
s(ck
to
business
Accuracy
alterna(ve
42. • Custom metrics
• Cost based
• Ex Fraud:
• Mean loss of 50 $ / fraud (FN)
• Mean loss of 20 $ / wrongly cancelled transaction (FP)
• F1 score often used in papers
• in practice, you often have a business cost
Evaluation metric : Classification
TP
FN
TN
FP
43. • Custom metrics
• Fraud Example 1:
• “I have fraudsters on my e-business website”
• I generate a score for each transaction
• I handle this by manually handling transactions with score higher than threshold
• I have 1 person that does this fulltime and able to deal with 100 transactions / day
• The rest is automatically accepted
-> AUC is not bad
-> Recall in 100 transactions / day
-> Total money blocked 100 transactions / day
In practice AUC more stable… But the money metric can also be used for communication.
Evaluation metric : Classification
44. • Custom metrics
• Fraud Example 2:
• “I have fraudsters on my e-business website”
• I generate a score for each transaction
• I handle this automatically by blocking all transactions with score higher than threshold
-> AUC is not bad… But don’t give threshold value.
-> F1–Score?
-> Cost based is better
Evaluation metric : Classification
45. • My cheat sheet
Evaluation metric : Classification
Metric
Op&mized
By
ML
model
?
Treshold
Dependant
Applica&on
example
Accuracy
YES
YES
image
classifica(on,
nlp
…
F1-‐score
NO
YES
?
Papers
?
AUC
NO
NO
fraud
detec(on,
churn,
healthcare
…
Log-‐Loss
YES
NO
add
click
predic(on
Custom
metric
NO
?
all
?
46. • Business Question dictates Evaluation Scheme!
• test set design
• evaluation metric
• Indirectly impact feature engineering
• Indirectly impact label quality
• Think (not too much) before coding
• Don’t try to optimize the wrong problem!
Conclusion