SlideShare a Scribd company logo
1 of 47
Download to read offline
Before Kaggle
From a business goal to a ML problem
Pierre	
  Gu(errez	
  @prrgu(errez	
  
•  Data Science competitions platform
(There are others : DataScience.net in France)
•  332,000 Data Scientists
•  today : 192 competitions, 18 active
+ 516 In class, 12 active
•  Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex…
What is ?
•  Price pool?
•  325,000 $ to make on August 31st
•  Good luck with that !
•  Not a good hourly wage
•  today : 192 competitions, 18 active
Understand :
•  Lot’s of datasets about approximately every DS topic
•  Lot’s of winner solutions, tip and tricks, etc…
•  Lot’s of “beat the benchmark” for beginners
I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ...
Why should I join ?
Most of the time:
•  You have a train set with labels and a test set without labels.
•  You need to learn a model using the train features and predict the test set labels
•  Your prediction is evaluated using a specific metric
•  The best prediction wins
What is a Data Science Competition ?
Most of the time:
•  You have a train set with labels and a test set without labels.
•  You need to learn a model using the train features and predict the test set labels
•  Your prediction is evaluated using a specific metric
•  The best prediction wins
What is a Data Science Competition?
Why	
  AUC?	
  F1	
  score?	
  Log	
  loss?	
  	
  
Could	
  that	
  depend	
  on	
  my	
  train/test	
  split?	
  	
  
Where	
  do	
  they	
  come	
  from	
  ?	
  Do	
  you	
  always	
  
have	
  some?	
  	
  	
  
Why	
  is	
  the	
  split	
  this	
  way?	
  Random?	
  Time?	
  	
  
What you don’t learn on Kaggle (or in class?):
•  How to model a business question into a ML problem.
•  How to manage/create labels. (proxy / missing…)
•  How to evaluate a model:
•  How to choose your metric
•  How to design your train/test split
•  How to account for this in feature engineering
Understanding this actually helps you in Kaggle competition :
•  How to design your cross validation scheme (and not overfit)
•  How to create relevant features
•  Hacks and tricks (leak exploitation J)
What is a Data Science Competition?
Scikit learn cheat sheet
Christophe Bourguignat DS cheat sheet
@chris_bour	
  	
  
Today	
  
•  Introduction
•  Labels?
•  Train and test split?
•  Feature Engineering?
•  Evaluation Metric?
Introduction
•  Introduction
•  Labels?
•  Train and test split?
•  Feature Engineering?
•  Evaluation Metric?
Introduction
The	
  newcomer	
  disillusion	
  
The	
  produc(on	
  bad	
  surprise	
  
The	
  business	
  obfusca(on	
  
•  Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems,
graph analytics, smart cities,…)
•  (More than) Occasional Kaggle competitor
•  Twitter @prrgutierrez
Who I am
•  Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems,
graph analytics, smart cities,…)
•  (More than) Occasional Kaggle competitor
•  Twitter @prrgutierrez
Who I am
Before Kaggle
•  Everywhere is fraud
E-business, Telco, Medicare,…
•  Easily defined as a classification problem
•  Target well defined ?
•  E-business : yes with lag
•  Elsewhere : need checks,
labels are expensive
Fraud Detection
•  Wikipedia:
“Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the
number of individuals or items moving out of a collective group over a specific period of
time”
= Customer leaving
Churn
•  Subscription models:
•  Telco
•  E-gamming (Wow)
•  Ex : Coyote -> 1 year subscription
-> you know when someone leave
•  Non subscription models:
•  E-Business (Amazon, Price Minister, Vente Privée)
•  E-gamming (Candy Crush, free MMORPG)
-> you approximate someone leaving
Candy Crush: days / weeks
MMORPG: 2 months (holidays)
Price Minister: months
Two types of Churn
•  Predict if a vehicle / machine / part is going to fail
•  Classification Problem:
•  Given a future horizon and a failure type. Will this happen for a given vehicle ?
-> 2 parameters describe the target
•  Vary a lot the target -> spurious correlation
•  Just choose it as the result of the exact business need
Predictive Maintenance
•  Target is “will like” or “will buy”
•  Target is often proxy of real interest (implicit feedback)
Recommender System
•  Can you model the problem as a ML problem?
•  Ex : predictive maintenance
•  Ask the right question from a business point of view.
Not what you know how to do.
•  Is your target a proxy?
•  Recommendation system
•  May need bandit algorithm
•  Is it easy to get labels?
•  Ex : Fraud detection
•  Can be expensive
•  Mechanical Turk can be the answer
Summary on Labels
•  Random Split
•  Just like in school
Train / test split
	
  
•  When	
  and	
  why	
  ?	
  	
  
-­‐>	
  	
  When	
  each	
  line	
  is	
  independent	
  from	
  the	
  
rest	
  (not	
  that	
  common	
  !)	
  
	
  	
  
image,	
  document	
  classifica(on,	
  sen(ment	
  
analysis	
  (“but	
  aha	
  is	
  the	
  new	
  lol”	
  )	
  
	
  
-­‐>	
  	
  When	
  you	
  want	
  to	
  quickly	
  iterate	
  /	
  
benchmark:	
  “is	
  it	
  even	
  possible?”	
  
	
  
-­‐>	
  	
  When	
  you	
  want	
  to	
  sell	
  something	
  to	
  
your	
  boss	
  
•  Column / group based
Ex : Caterpillar challenge
•  Predict a price
•  for each tube id
•  Tube id in train and test
are different
Objective :
being able to generalize to
other tubes!
Train / test split
•  Time based
•  Simply separate train and test on a time variable
•  When and Why?
-> When you want a model that “predict the future”
-> When things evolve with time! (most problems!)
-> Examples :
Add click prediction, Churn prediction, E-business Fraud detection, Predictive
maintenance,…
Train / test split
•  No subscription example
•  Target : 4 month without buying
•  Features ?
Train / test split : Churn example
Ex : Train and predict scheme
Time	
  
T	
  :	
  present	
  (me	
  T	
  –	
  4	
  month	
  
Data	
  is	
  used	
  for	
  target	
  
crea(on	
  :	
  ac(vity	
  during	
  
the	
  last	
  4	
  months	
  
Data	
  is	
  used	
  for	
  feature	
  
genera(on.	
  
Use	
  model	
  to	
  predict	
  
future	
  churn	
  
Train	
  model	
  using	
  features	
  and	
  target	
  
Ex : Train Evaluation and Predict Scheme
Time	
  
T	
  :	
  present	
  (me	
  T	
  –	
  4	
  month	
  
Data	
  is	
  used	
  for	
  target	
  
crea(on	
  :	
  ac(vity	
  during	
  
the	
  last	
  4	
  months	
  
Data	
  is	
  used	
  for	
  
feature	
  genera(on	
  
Valida&on	
  set	
  
Use	
  model	
  to	
  
predict	
  future	
  
churn	
  
Training	
  
Evaluate	
  on	
  the	
  target	
  
of	
  the	
  valida(on	
  set	
  
T	
  –	
  8	
  month	
  
Data	
  is	
  used	
  for	
  features	
  
genera(on.	
  
Data	
  is	
  used	
  for	
  target	
  
crea(on	
  :	
  ac(vity	
  during	
  
the	
  last	
  4	
  months	
  
•  More complex design
•  Graph sampling (fraud rings ? )
•  Random sampling in client / machine life
•  Mix of column based and time based …
•  The rule :
1)  What is the problem ?
2)  To what would I like to generalize my model ?
Future ? Other individuals ? …
3)  => Train / Test split
Train / test split
•  Predictive Maintenance problem
•  Objective : predict failure in next 3 days.
•  Metric is proportional to accuracy (and 0.57 is the best score !)
•  Link to data :
https://www.phmsociety.org/events/conference/phm/14/data-challenge
EX PHM Society (Fail example)
•  Failures
EX PHM Society
•  Usage
EX PHM Society
•  Part Replacements
EX PHM Society
•  How to design the evaluation scheme?
•  What is the probability that an asset fail in the next 3 days from Now?
-> classification problem
-> Time based split
-> but how do I create a train and a test?
•  Choose a date and evaluate what happens 3 days later?
-> pb : not enough failures happening
•  Choose several dates for each asset?
-> beware of asset over-fitting
•  In the challenge : random selection of (asset, date) in the future + over sampling of
failures.
EX PHM Society
•  Basic Feature engineering
EX PHM Society
•  Random Sampling
EX PHM Society
This	
  is	
  decent!	
  	
   «	
  With	
  some	
  more	
  work	
  I	
  could	
  have	
  a	
  model	
  
that	
  beat	
  randomness	
  enough	
  to	
  be	
  useful	
  »	
  
•  Time based split
EX PHM Society
Wait	
  what?	
  	
  
•  TIME LEAK
EX PHM Society
•  TIME LEAK
EX PHM Society
Tree	
  cuts	
  
•  Beware of the distribution of you features!
•  Is there a time dependency?
•  Ex : count, sum, … that will only increase with time
•  -> Calculate count and sum rescaled by time / in moving windows instead.
•  Can be found in Churn, Fraud detection, Ad click prediction,…
•  A categorical variable dependency?
•  Ex : email flag in fraud detection
•  Is there a Network dependency?
•  Ex : Fraud / Bot detection (network features can be useful but leaky)
Feature Engineering
•  Final trick :
-  Stack train and test and add is_test boolean
-  Try to predict is_test
-  Check if the model is able to predict
-  If so :
-  check the feature importance
-  Remove / modify feature and iterate
Feature Engineering
•  Final trick:
•  Back to Phm example:
Feature Engineering
Huge	
  (me	
  leak	
  !	
  	
  
•  “Treshold dependant”
•  Accuracy
•  Precision and Recall
•  F1 score
•  “Treshold independant”
•  AUC
•  Log Loss
•  Others (Mean average precision)…
Evaluation metric : Classification
•  “Treshold dependant”
•  Accuracy
•  Precision and Recall
•  F1 score
•  “Treshold independant”
•  AUC
•  Log Loss
•  Others (Mean average precision)…
•  Customs
Evaluation metric : Classification
Not	
  good	
  if	
  unbalanced	
  target	
  
When	
  you	
  have	
  an	
  order	
  problem	
  	
  
When	
  you	
  are	
  going	
  stochas(c	
  
When	
  you	
  need	
  to	
  s(ck	
  to	
  business	
  
Accuracy	
  alterna(ve	
  
•  Custom metrics
•  Cost based
•  Ex Fraud:
•  Mean loss of 50 $ / fraud (FN)
•  Mean loss of 20 $ / wrongly cancelled transaction (FP)
•  F1 score often used in papers
•  in practice, you often have a business cost
Evaluation metric : Classification
TP	
   FN	
  
TN	
  FP	
  
•  Custom metrics
•  Fraud Example 1:
•  “I have fraudsters on my e-business website”
•  I generate a score for each transaction
•  I handle this by manually handling transactions with score higher than threshold
•  I have 1 person that does this fulltime and able to deal with 100 transactions / day
•  The rest is automatically accepted
-> AUC is not bad
-> Recall in 100 transactions / day
-> Total money blocked 100 transactions / day
In practice AUC more stable… But the money metric can also be used for communication.
Evaluation metric : Classification
•  Custom metrics
•  Fraud Example 2:
•  “I have fraudsters on my e-business website”
•  I generate a score for each transaction
•  I handle this automatically by blocking all transactions with score higher than threshold
-> AUC is not bad… But don’t give threshold value.
-> F1–Score?
-> Cost based is better
Evaluation metric : Classification
•  My cheat sheet
Evaluation metric : Classification
Metric	
   Op&mized	
  By	
  ML	
  model	
  ?	
  	
   Treshold	
  Dependant	
   Applica&on	
  example	
  
Accuracy	
   YES	
   YES	
   image	
  classifica(on,	
  nlp	
  …	
  	
  
F1-­‐score	
   NO	
   YES	
   ?	
  Papers	
  ?	
  	
  
AUC	
   NO	
   NO	
   fraud	
  detec(on,	
  churn,	
  healthcare	
  …	
  	
  
Log-­‐Loss	
   YES	
   NO	
   add	
  click	
  predic(on	
  
Custom	
  metric	
   NO	
   ?	
  	
   all	
  ?	
  	
  
•  Business Question dictates Evaluation Scheme!
•  test set design
•  evaluation metric
•  Indirectly impact feature engineering
•  Indirectly impact label quality
•  Think (not too much) before coding
•  Don’t try to optimize the wrong problem!
Conclusion
Thank you for your attention!

More Related Content

What's hot

Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientificRevenue
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5Roger Barga
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 
Barga Data Science lecture 7
Barga Data Science lecture 7Barga Data Science lecture 7
Barga Data Science lecture 7Roger Barga
 
How to Perform Churn Analysis for your Mobile Application?
How to Perform Churn Analysis for your Mobile Application?How to Perform Churn Analysis for your Mobile Application?
How to Perform Churn Analysis for your Mobile Application?Tatvic Analytics
 
Barga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkBarga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkRoger Barga
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8Roger Barga
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
"You can't just turn the crank": Machine learning for fighting abuse on the c...
"You can't just turn the crank": Machine learning for fighting abuse on the c..."You can't just turn the crank": Machine learning for fighting abuse on the c...
"You can't just turn the crank": Machine learning for fighting abuse on the c...David Freeman
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Scott Triglia, MLconf 2013
Scott Triglia, MLconf 2013Scott Triglia, MLconf 2013
Scott Triglia, MLconf 2013MLconf
 
DutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-EndDutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-EndBigML, Inc
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Vasily Leksin
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning ProjectEng Teong Cheah
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive ModelDKALab
 
RecSys Challenge 2016
RecSys Challenge 2016RecSys Challenge 2016
RecSys Challenge 2016Fabian Abel
 

What's hot (20)

Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talk
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
Barga Data Science lecture 7
Barga Data Science lecture 7Barga Data Science lecture 7
Barga Data Science lecture 7
 
How to Perform Churn Analysis for your Mobile Application?
How to Perform Churn Analysis for your Mobile Application?How to Perform Churn Analysis for your Mobile Application?
How to Perform Churn Analysis for your Mobile Application?
 
Barga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkBarga DIDC'14 Invited Talk
Barga DIDC'14 Invited Talk
 
Barga Data Science lecture 8
Barga Data Science lecture 8Barga Data Science lecture 8
Barga Data Science lecture 8
 
Optimization
OptimizationOptimization
Optimization
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
"You can't just turn the crank": Machine learning for fighting abuse on the c...
"You can't just turn the crank": Machine learning for fighting abuse on the c..."You can't just turn the crank": Machine learning for fighting abuse on the c...
"You can't just turn the crank": Machine learning for fighting abuse on the c...
 
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesDutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Logistic Regression, Deepnets, Time Series
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Scott Triglia, MLconf 2013
Scott Triglia, MLconf 2013Scott Triglia, MLconf 2013
Scott Triglia, MLconf 2013
 
DutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-EndDutchMLSchool. Machine Learning End-to-End
DutchMLSchool. Machine Learning End-to-End
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
 
RecSys Challenge 2016
RecSys Challenge 2016RecSys Challenge 2016
RecSys Challenge 2016
 

Viewers also liked

Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...
Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...
Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...Ericsson France
 
Как бороться с оттоком клиентов?
Как бороться с оттоком клиентов?Как бороться с оттоком клиентов?
Как бороться с оттоком клиентов?NGM
 
Comment Coyote Systems utilse le Data Science Studio de Dataiku pour optimise...
Comment Coyote Systems utilse le Data Science Studio de Dataiku pour optimise...Comment Coyote Systems utilse le Data Science Studio de Dataiku pour optimise...
Comment Coyote Systems utilse le Data Science Studio de Dataiku pour optimise...Le_GFII
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Introduction to Uplift Modelling
Introduction to Uplift ModellingIntroduction to Uplift Modelling
Introduction to Uplift ModellingPierre Gutierrez
 
Machine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical preventionMachine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical preventionPierre Gutierrez
 
CBRE AECom Fitness Proposal
CBRE AECom Fitness ProposalCBRE AECom Fitness Proposal
CBRE AECom Fitness ProposalKat Pisano
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Livre Blanc Attribution Management : entre technologie, marketing et statistique
Livre Blanc Attribution Management : entre technologie, marketing et statistiqueLivre Blanc Attribution Management : entre technologie, marketing et statistique
Livre Blanc Attribution Management : entre technologie, marketing et statistiqueConverteo
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShareSlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShareSlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShareSlideShare
 

Viewers also liked (14)

Evaluation Fit-for-Purpose
Evaluation Fit-for-PurposeEvaluation Fit-for-Purpose
Evaluation Fit-for-Purpose
 
Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...
Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...
Traffic and Market Report – On the Pulse of the Networked Society - Ericsson ...
 
Как бороться с оттоком клиентов?
Как бороться с оттоком клиентов?Как бороться с оттоком клиентов?
Как бороться с оттоком клиентов?
 
Comment Coyote Systems utilse le Data Science Studio de Dataiku pour optimise...
Comment Coyote Systems utilse le Data Science Studio de Dataiku pour optimise...Comment Coyote Systems utilse le Data Science Studio de Dataiku pour optimise...
Comment Coyote Systems utilse le Data Science Studio de Dataiku pour optimise...
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Introduction to Uplift Modelling
Introduction to Uplift ModellingIntroduction to Uplift Modelling
Introduction to Uplift Modelling
 
Machine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical preventionMachine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical prevention
 
CBRE AECom Fitness Proposal
CBRE AECom Fitness ProposalCBRE AECom Fitness Proposal
CBRE AECom Fitness Proposal
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Livre Blanc Attribution Management : entre technologie, marketing et statistique
Livre Blanc Attribution Management : entre technologie, marketing et statistiqueLivre Blanc Attribution Management : entre technologie, marketing et statistique
Livre Blanc Attribution Management : entre technologie, marketing et statistique
 
2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare2015 Upload Campaigns Calendar - SlideShare
2015 Upload Campaigns Calendar - SlideShare
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
Getting Started With SlideShare
Getting Started With SlideShareGetting Started With SlideShare
Getting Started With SlideShare
 

Similar to Before Kaggle

DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
An Overview of automated testing (1)
An Overview of automated testing (1)An Overview of automated testing (1)
An Overview of automated testing (1)Rodrigo Lopes
 
From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productBruce Kuo
 
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013Why-What-How Consulting, LLC
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Toolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig SullivanToolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig SullivanUXPA UK
 
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CROUXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CROCraig Sullivan
 
How to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerHow to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerProduct School
 
Agile Estimating and Planning
Agile Estimating and PlanningAgile Estimating and Planning
Agile Estimating and PlanningMojammel Haque
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive MaintenanceArnab Biswas
 
BMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckBMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckSasha Lazarevic
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Unit 1 introduction to simulation
Unit 1 introduction to simulationUnit 1 introduction to simulation
Unit 1 introduction to simulationDevaKumari Vijay
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NETDev Raj Gautam
 
Data-driven product management
Data-driven product managementData-driven product management
Data-driven product managementArseny Kravchenko
 
Design Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning BasicsDesign Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning BasicsInductive Automation
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 

Similar to Before Kaggle (20)

DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
An Overview of automated testing (1)
An Overview of automated testing (1)An Overview of automated testing (1)
An Overview of automated testing (1)
 
From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning product
 
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Real timeanalyticsl oreal
Real timeanalyticsl orealReal timeanalyticsl oreal
Real timeanalyticsl oreal
 
Toolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig SullivanToolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig Sullivan
 
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CROUXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
 
How to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerHow to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product Manager
 
Agile Estimating and Planning
Agile Estimating and PlanningAgile Estimating and Planning
Agile Estimating and Planning
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive Maintenance
 
When Should I Use Simulation?
When Should I Use Simulation?When Should I Use Simulation?
When Should I Use Simulation?
 
BMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckBMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist Deck
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Unit 1 introduction to simulation
Unit 1 introduction to simulationUnit 1 introduction to simulation
Unit 1 introduction to simulation
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
 
Data-driven product management
Data-driven product managementData-driven product management
Data-driven product management
 
Design Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning BasicsDesign Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning Basics
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 

Recently uploaded

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 

Recently uploaded (17)

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 

Before Kaggle

  • 1. Before Kaggle From a business goal to a ML problem Pierre  Gu(errez  @prrgu(errez  
  • 2. •  Data Science competitions platform (There are others : DataScience.net in France) •  332,000 Data Scientists •  today : 192 competitions, 18 active + 516 In class, 12 active •  Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex… What is ?
  • 3. •  Price pool? •  325,000 $ to make on August 31st •  Good luck with that ! •  Not a good hourly wage •  today : 192 competitions, 18 active Understand : •  Lot’s of datasets about approximately every DS topic •  Lot’s of winner solutions, tip and tricks, etc… •  Lot’s of “beat the benchmark” for beginners I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ... Why should I join ?
  • 4. Most of the time: •  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins What is a Data Science Competition ?
  • 5. Most of the time: •  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins What is a Data Science Competition? Why  AUC?  F1  score?  Log  loss?     Could  that  depend  on  my  train/test  split?     Where  do  they  come  from  ?  Do  you  always   have  some?       Why  is  the  split  this  way?  Random?  Time?    
  • 6. What you don’t learn on Kaggle (or in class?): •  How to model a business question into a ML problem. •  How to manage/create labels. (proxy / missing…) •  How to evaluate a model: •  How to choose your metric •  How to design your train/test split •  How to account for this in feature engineering Understanding this actually helps you in Kaggle competition : •  How to design your cross validation scheme (and not overfit) •  How to create relevant features •  Hacks and tricks (leak exploitation J) What is a Data Science Competition?
  • 8. Christophe Bourguignat DS cheat sheet @chris_bour     Today  
  • 9. •  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric? Introduction
  • 10. •  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric? Introduction The  newcomer  disillusion   The  produc(on  bad  surprise   The  business  obfusca(on  
  • 11. •  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…) •  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez Who I am
  • 12. •  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…) •  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez Who I am
  • 14. •  Everywhere is fraud E-business, Telco, Medicare,… •  Easily defined as a classification problem •  Target well defined ? •  E-business : yes with lag •  Elsewhere : need checks, labels are expensive Fraud Detection
  • 15. •  Wikipedia: “Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the number of individuals or items moving out of a collective group over a specific period of time” = Customer leaving Churn
  • 16. •  Subscription models: •  Telco •  E-gamming (Wow) •  Ex : Coyote -> 1 year subscription -> you know when someone leave •  Non subscription models: •  E-Business (Amazon, Price Minister, Vente Privée) •  E-gamming (Candy Crush, free MMORPG) -> you approximate someone leaving Candy Crush: days / weeks MMORPG: 2 months (holidays) Price Minister: months Two types of Churn
  • 17. •  Predict if a vehicle / machine / part is going to fail •  Classification Problem: •  Given a future horizon and a failure type. Will this happen for a given vehicle ? -> 2 parameters describe the target •  Vary a lot the target -> spurious correlation •  Just choose it as the result of the exact business need Predictive Maintenance
  • 18. •  Target is “will like” or “will buy” •  Target is often proxy of real interest (implicit feedback) Recommender System
  • 19. •  Can you model the problem as a ML problem? •  Ex : predictive maintenance •  Ask the right question from a business point of view. Not what you know how to do. •  Is your target a proxy? •  Recommendation system •  May need bandit algorithm •  Is it easy to get labels? •  Ex : Fraud detection •  Can be expensive •  Mechanical Turk can be the answer Summary on Labels
  • 20. •  Random Split •  Just like in school Train / test split   •  When  and  why  ?     -­‐>    When  each  line  is  independent  from  the   rest  (not  that  common  !)       image,  document  classifica(on,  sen(ment   analysis  (“but  aha  is  the  new  lol”  )     -­‐>    When  you  want  to  quickly  iterate  /   benchmark:  “is  it  even  possible?”     -­‐>    When  you  want  to  sell  something  to   your  boss  
  • 21. •  Column / group based Ex : Caterpillar challenge •  Predict a price •  for each tube id •  Tube id in train and test are different Objective : being able to generalize to other tubes! Train / test split
  • 22. •  Time based •  Simply separate train and test on a time variable •  When and Why? -> When you want a model that “predict the future” -> When things evolve with time! (most problems!) -> Examples : Add click prediction, Churn prediction, E-business Fraud detection, Predictive maintenance,… Train / test split
  • 23. •  No subscription example •  Target : 4 month without buying •  Features ? Train / test split : Churn example
  • 24. Ex : Train and predict scheme Time   T  :  present  (me  T  –  4  month   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months   Data  is  used  for  feature   genera(on.   Use  model  to  predict   future  churn   Train  model  using  features  and  target  
  • 25. Ex : Train Evaluation and Predict Scheme Time   T  :  present  (me  T  –  4  month   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months   Data  is  used  for   feature  genera(on   Valida&on  set   Use  model  to   predict  future   churn   Training   Evaluate  on  the  target   of  the  valida(on  set   T  –  8  month   Data  is  used  for  features   genera(on.   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months  
  • 26. •  More complex design •  Graph sampling (fraud rings ? ) •  Random sampling in client / machine life •  Mix of column based and time based … •  The rule : 1)  What is the problem ? 2)  To what would I like to generalize my model ? Future ? Other individuals ? … 3)  => Train / Test split Train / test split
  • 27. •  Predictive Maintenance problem •  Objective : predict failure in next 3 days. •  Metric is proportional to accuracy (and 0.57 is the best score !) •  Link to data : https://www.phmsociety.org/events/conference/phm/14/data-challenge EX PHM Society (Fail example)
  • 31. •  How to design the evaluation scheme? •  What is the probability that an asset fail in the next 3 days from Now? -> classification problem -> Time based split -> but how do I create a train and a test? •  Choose a date and evaluate what happens 3 days later? -> pb : not enough failures happening •  Choose several dates for each asset? -> beware of asset over-fitting •  In the challenge : random selection of (asset, date) in the future + over sampling of failures. EX PHM Society
  • 32. •  Basic Feature engineering EX PHM Society
  • 33. •  Random Sampling EX PHM Society This  is  decent!     «  With  some  more  work  I  could  have  a  model   that  beat  randomness  enough  to  be  useful  »  
  • 34. •  Time based split EX PHM Society Wait  what?    
  • 35. •  TIME LEAK EX PHM Society
  • 36. •  TIME LEAK EX PHM Society Tree  cuts  
  • 37. •  Beware of the distribution of you features! •  Is there a time dependency? •  Ex : count, sum, … that will only increase with time •  -> Calculate count and sum rescaled by time / in moving windows instead. •  Can be found in Churn, Fraud detection, Ad click prediction,… •  A categorical variable dependency? •  Ex : email flag in fraud detection •  Is there a Network dependency? •  Ex : Fraud / Bot detection (network features can be useful but leaky) Feature Engineering
  • 38. •  Final trick : -  Stack train and test and add is_test boolean -  Try to predict is_test -  Check if the model is able to predict -  If so : -  check the feature importance -  Remove / modify feature and iterate Feature Engineering
  • 39. •  Final trick: •  Back to Phm example: Feature Engineering Huge  (me  leak  !    
  • 40. •  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score •  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)… Evaluation metric : Classification
  • 41. •  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score •  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)… •  Customs Evaluation metric : Classification Not  good  if  unbalanced  target   When  you  have  an  order  problem     When  you  are  going  stochas(c   When  you  need  to  s(ck  to  business   Accuracy  alterna(ve  
  • 42. •  Custom metrics •  Cost based •  Ex Fraud: •  Mean loss of 50 $ / fraud (FN) •  Mean loss of 20 $ / wrongly cancelled transaction (FP) •  F1 score often used in papers •  in practice, you often have a business cost Evaluation metric : Classification TP   FN   TN  FP  
  • 43. •  Custom metrics •  Fraud Example 1: •  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this by manually handling transactions with score higher than threshold •  I have 1 person that does this fulltime and able to deal with 100 transactions / day •  The rest is automatically accepted -> AUC is not bad -> Recall in 100 transactions / day -> Total money blocked 100 transactions / day In practice AUC more stable… But the money metric can also be used for communication. Evaluation metric : Classification
  • 44. •  Custom metrics •  Fraud Example 2: •  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this automatically by blocking all transactions with score higher than threshold -> AUC is not bad… But don’t give threshold value. -> F1–Score? -> Cost based is better Evaluation metric : Classification
  • 45. •  My cheat sheet Evaluation metric : Classification Metric   Op&mized  By  ML  model  ?     Treshold  Dependant   Applica&on  example   Accuracy   YES   YES   image  classifica(on,  nlp  …     F1-­‐score   NO   YES   ?  Papers  ?     AUC   NO   NO   fraud  detec(on,  churn,  healthcare  …     Log-­‐Loss   YES   NO   add  click  predic(on   Custom  metric   NO   ?     all  ?    
  • 46. •  Business Question dictates Evaluation Scheme! •  test set design •  evaluation metric •  Indirectly impact feature engineering •  Indirectly impact label quality •  Think (not too much) before coding •  Don’t try to optimize the wrong problem! Conclusion
  • 47. Thank you for your attention!