SlideShare a Scribd company logo
1 of 47
Download to read offline
Before Kaggle
From a business goal to a ML problem
Pierre	
  Gu(errez	
  @prrgu(errez	
  
•  Data Science competitions platform
(There are others : DataScience.net in France)
•  332,000 Data Scientists
•  today : 192 competitions, 18 active
+ 516 In class, 12 active
•  Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex…
What is ?
•  Price pool?
•  325,000 $ to make on August 31st
•  Good luck with that !
•  Not a good hourly wage
•  today : 192 competitions, 18 active
Understand :
•  Lot’s of datasets about approximately every DS topic
•  Lot’s of winner solutions, tip and tricks, etc…
•  Lot’s of “beat the benchmark” for beginners
I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ...
Why should I join ?
Most of the time:
•  You have a train set with labels and a test set without labels.
•  You need to learn a model using the train features and predict the test set labels
•  Your prediction is evaluated using a specific metric
•  The best prediction wins
What is a Data Science Competition ?
Most of the time:
•  You have a train set with labels and a test set without labels.
•  You need to learn a model using the train features and predict the test set labels
•  Your prediction is evaluated using a specific metric
•  The best prediction wins
What is a Data Science Competition?
Why	
  AUC?	
  F1	
  score?	
  Log	
  loss?	
  	
  
Could	
  that	
  depend	
  on	
  my	
  train/test	
  split?	
  	
  
Where	
  do	
  they	
  come	
  from	
  ?	
  Do	
  you	
  always	
  
have	
  some?	
  	
  	
  
Why	
  is	
  the	
  split	
  this	
  way?	
  Random?	
  Time?	
  	
  
What you don’t learn on Kaggle (or in class?):
•  How to model a business question into a ML problem.
•  How to manage/create labels. (proxy / missing…)
•  How to evaluate a model:
•  How to choose your metric
•  How to design your train/test split
•  How to account for this in feature engineering
Understanding this actually helps you in Kaggle competition :
•  How to design your cross validation scheme (and not overfit)
•  How to create relevant features
•  Hacks and tricks (leak exploitation J)
What is a Data Science Competition?
Scikit learn cheat sheet
Christophe Bourguignat DS cheat sheet
@chris_bour	
  	
  
Today	
  
•  Introduction
•  Labels?
•  Train and test split?
•  Feature Engineering?
•  Evaluation Metric?
Introduction
•  Introduction
•  Labels?
•  Train and test split?
•  Feature Engineering?
•  Evaluation Metric?
Introduction
The	
  newcomer	
  disillusion	
  
The	
  produc(on	
  bad	
  surprise	
  
The	
  business	
  obfusca(on	
  
•  Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems,
graph analytics, smart cities,…)
•  (More than) Occasional Kaggle competitor
•  Twitter @prrgutierrez
Who I am
•  Senior Data Scientist at Dataiku
(worked on churn prediction, fraud detection, bot detection, recommender systems,
graph analytics, smart cities,…)
•  (More than) Occasional Kaggle competitor
•  Twitter @prrgutierrez
Who I am
•  Everywhere is fraud
E-business, Telco, Medicare,…
•  Easily defined as a classification problem
•  Target well defined ?
•  E-business : yes with lag
•  Elsewhere : need checks,
labels are expensive
Fraud Detection
•  Wikipedia:
“Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the
number of individuals or items moving out of a collective group over a specific period of
time”
= Customer leaving
Churn
•  Subscription models:
•  Telco
•  E-gamming (Wow)
•  Ex : Coyote -> 1 year subscription
-> you know when someone leave
•  Non subscription models:
•  E-Business (Amazon, Price Minister, Vente Privée)
•  E-gamming (Candy Crush, free MMORPG)
-> you approximate someone leaving
Candy Crush: days / weeks
MMORPG: 2 months (holidays)
Price Minister: months
Two types of Churn
•  Predict if a vehicle / machine / part is going to fail
•  Classification Problem:
•  Given a future horizon and a failure type. Will this happen for a given vehicle ?
-> 2 parameters describe the target
•  Vary a lot the target -> spurious correlation
•  Just choose it as the result of the exact business need
Predictive Maintenance
•  Target is “will like” or “will buy”
•  Target is often proxy of real interest (implicit feedback)
Recommender System
•  Can you model the problem as a ML problem?
•  Ex : predictive maintenance
•  Ask the right question from a business point of view.
Not what you know how to do.
•  Is your target a proxy?
•  Recommendation system
•  May need bandit algorithm
•  Is it easy to get labels?
•  Ex : Fraud detection
•  Can be expensive
•  Mechanical Turk can be the answer
Summary on Labels
•  Random Split
•  Just like in school
Train / test split
	
  
•  When	
  and	
  why	
  ?	
  	
  
-­‐>	
  	
  When	
  each	
  line	
  is	
  independent	
  from	
  the	
  
rest	
  (not	
  that	
  common	
  !)	
  
	
  	
  
image,	
  document	
  classifica(on,	
  sen(ment	
  
analysis	
  (“but	
  aha	
  is	
  the	
  new	
  lol”	
  )	
  
	
  
-­‐>	
  	
  When	
  you	
  want	
  to	
  quickly	
  iterate	
  /	
  
benchmark:	
  “is	
  it	
  even	
  possible?”	
  
	
  
-­‐>	
  	
  When	
  you	
  want	
  to	
  sell	
  something	
  to	
  
your	
  boss	
  
•  Column / group based
Ex : Caterpillar challenge
•  Predict a price
•  for each tube id
•  Tube id in train and test
are different
Objective :
being able to generalize to
other tubes!
Train / test split
•  Time based
•  Simply separate train and test on a time variable
•  When and Why?
-> When you want a model that “predict the future”
-> When things evolve with time! (most problems!)
-> Examples :
Add click prediction, Churn prediction, E-business Fraud detection, Predictive
maintenance,…
Train / test split
•  No subscription example
•  Target : 4 month without buying
•  Features ?
Train / test split : Churn example
Ex : Train and predict scheme
Time	
  
T	
  :	
  present	
  (me	
  T	
  –	
  4	
  month	
  
Data	
  is	
  used	
  for	
  target	
  
crea(on	
  :	
  ac(vity	
  during	
  
the	
  last	
  4	
  months	
  
Data	
  is	
  used	
  for	
  feature	
  
genera(on.	
  
Use	
  model	
  to	
  predict	
  
future	
  churn	
  
Train	
  model	
  using	
  features	
  and	
  target	
  
Ex : Train Evaluation and Predict Scheme
Time	
  
T	
  :	
  present	
  (me	
  T	
  –	
  4	
  month	
  
Data	
  is	
  used	
  for	
  target	
  
crea(on	
  :	
  ac(vity	
  during	
  
the	
  last	
  4	
  months	
  
Data	
  is	
  used	
  for	
  
feature	
  genera(on	
  
Valida&on	
  set	
  
Use	
  model	
  to	
  
predict	
  future	
  
churn	
  
Training	
  
Evaluate	
  on	
  the	
  target	
  
of	
  the	
  valida(on	
  set	
  
T	
  –	
  8	
  month	
  
Data	
  is	
  used	
  for	
  features	
  
genera(on.	
  
Data	
  is	
  used	
  for	
  target	
  
crea(on	
  :	
  ac(vity	
  during	
  
the	
  last	
  4	
  months	
  
•  More complex design
•  Graph sampling (fraud rings ? )
•  Random sampling in client / machine life
•  Mix of column based and time based …
•  The rule :
1)  What is the problem ?
2)  To what would I like to generalize my model ?
Future ? Other individuals ? …
3)  => Train / Test split
Train / test split
•  Predictive Maintenance problem
•  Objective : predict failure in next 3 days.
•  Metric is proportional to accuracy (and 0.57 is the best score !)
•  Link to data :
https://www.phmsociety.org/events/conference/phm/14/data-challenge
EX PHM Society (Fail example)
•  Failures
EX PHM Society
•  Usage
EX PHM Society
•  Part Replacements
EX PHM Society
•  How to design the evaluation scheme?
•  What is the probability that an asset fail in the next 3 days from Now?
-> classification problem
-> Time based split
-> but how do I create a train and a test?
•  Choose a date and evaluate what happens 3 days later?
-> pb : not enough failures happening
•  Choose several dates for each asset?
-> beware of asset over-fitting
•  In the challenge : random selection of (asset, date) in the future + over sampling of
failures.
EX PHM Society
•  Basic Feature engineering
EX PHM Society
•  Random Sampling
EX PHM Society
This	
  is	
  decent!	
  	
   «	
  With	
  some	
  more	
  work	
  I	
  could	
  have	
  a	
  model	
  
that	
  beat	
  randomness	
  enough	
  to	
  be	
  useful	
  »	
  
•  Time based split
EX PHM Society
Wait	
  what?	
  	
  
•  TIME LEAK
EX PHM Society
•  TIME LEAK
EX PHM Society
Tree	
  cuts	
  
•  Beware of the distribution of you features!
•  Is there a time dependency?
•  Ex : count, sum, … that will only increase with time
•  -> Calculate count and sum rescaled by time / in moving windows instead.
•  Can be found in Churn, Fraud detection, Ad click prediction,…
•  A categorical variable dependency?
•  Ex : email flag in fraud detection
•  Is there a Network dependency?
•  Ex : Fraud / Bot detection (network features can be useful but leaky)
Feature Engineering
•  Final trick :
-  Stack train and test and add is_test boolean
-  Try to predict is_test
-  Check if the model is able to predict
-  If so :
-  check the feature importance
-  Remove / modify feature and iterate
Feature Engineering
•  Final trick:
•  Back to Phm example:
Feature Engineering
Huge	
  (me	
  leak	
  !	
  	
  
•  “Treshold dependant”
•  Accuracy
•  Precision and Recall
•  F1 score
•  “Treshold independant”
•  AUC
•  Log Loss
•  Others (Mean average precision)…
Evaluation metric : Classification
•  “Treshold dependant”
•  Accuracy
•  Precision and Recall
•  F1 score
•  “Treshold independant”
•  AUC
•  Log Loss
•  Others (Mean average precision)…
•  Customs
Evaluation metric : Classification
Not	
  good	
  if	
  unbalanced	
  target	
  
When	
  you	
  have	
  an	
  order	
  problem	
  	
  
When	
  you	
  are	
  going	
  stochas(c	
  
When	
  you	
  need	
  to	
  s(ck	
  to	
  business	
  
Accuracy	
  alterna(ve	
  
•  Custom metrics
•  Cost based
•  Ex Fraud:
•  Mean loss of 50 $ / fraud (FN)
•  Mean loss of 20 $ / wrongly cancelled transaction (FP)
•  F1 score often used in papers
•  in practice, you often have a business cost
Evaluation metric : Classification
TP	
   FN	
  
TN	
  FP	
  
•  Custom metrics
•  Fraud Example 1:
•  “I have fraudsters on my e-business website”
•  I generate a score for each transaction
•  I handle this by manually handling transactions with score higher than threshold
•  I have 1 person that does this fulltime and able to deal with 100 transactions / day
•  The rest is automatically accepted
-> AUC is not bad
-> Recall in 100 transactions / day
-> Total money blocked 100 transactions / day
In practice AUC more stable… But the money metric can also be used for communication.
Evaluation metric : Classification
•  Custom metrics
•  Fraud Example 2:
•  “I have fraudsters on my e-business website”
•  I generate a score for each transaction
•  I handle this automatically by blocking all transactions with score higher than threshold
-> AUC is not bad… But don’t give threshold value.
-> F1–Score?
-> Cost based is better
Evaluation metric : Classification
•  My cheat sheet
Evaluation metric : Classification
Metric	
   Op&mized	
  By	
  ML	
  model	
  ?	
  	
   Treshold	
  Dependant	
   Applica&on	
  example	
  
Accuracy	
   YES	
   YES	
   image	
  classifica(on,	
  nlp	
  …	
  	
  
F1-­‐score	
   NO	
   YES	
   ?	
  Papers	
  ?	
  	
  
AUC	
   NO	
   NO	
   fraud	
  detec(on,	
  churn,	
  healthcare	
  …	
  	
  
Log-­‐Loss	
   YES	
   NO	
   add	
  click	
  predic(on	
  
Custom	
  metric	
   NO	
   ?	
  	
   all	
  ?	
  	
  
•  Business Question dictates Evaluation Scheme!
•  test set design
•  evaluation metric
•  Indirectly impact feature engineering
•  Indirectly impact label quality
•  Think (not too much) before coding
•  Don’t try to optimize the wrong problem!
Conclusion
Thank you for your attention!

More Related Content

What's hot

Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunDataiku
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisCourse 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisBetacowork
 
Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Valdas Maksimavičius
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)Betacowork
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
H2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaH2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaSri Ambati
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019DataKitchen
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 

What's hot (20)

Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos DeligiannisCourse 3 : Types of data and opportunities by Nikolaos Deligiannis
Course 3 : Types of data and opportunities by Nikolaos Deligiannis
 
Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping Reducing Technology Risks Through Prototyping
Reducing Technology Risks Through Prototyping
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
Course 1 - Introduction to Big Data by Toon Vanagt ( #BigDataBXL)
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
H2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral BajariaH2O World - Data Science in Action @ 6sense - Viral Bajaria
H2O World - Data Science in Action @ 6sense - Viral Bajaria
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
DataHub
DataHubDataHub
DataHub
 

Viewers also liked

Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
How to get started in Kaggle competition
How to get started in Kaggle competitionHow to get started in Kaggle competition
How to get started in Kaggle competitionMerja Kajava
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessJoshua Drake
 
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...M Khurram Shahzad
 
121206 3-dirty-words-webinar
121206 3-dirty-words-webinar121206 3-dirty-words-webinar
121206 3-dirty-words-webinarLeanne Smith
 
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and ScoreBudgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and ScoreSpringboard
 
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...courageasia
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 

Viewers also liked (13)

Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
How to get started in Kaggle competition
How to get started in Kaggle competitionHow to get started in Kaggle competition
How to get started in Kaggle competition
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Defining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own BusinessDefining Your Goal: Starting Your Own Business
Defining Your Goal: Starting Your Own Business
 
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
A Goal-oriented Approach for Business Process Improvement Using Process Wareh...
 
10 ways to boost your company sales
10 ways to boost your company sales10 ways to boost your company sales
10 ways to boost your company sales
 
Budgeting 101 Fall Institute 2011 Final
Budgeting 101 Fall Institute 2011 FinalBudgeting 101 Fall Institute 2011 Final
Budgeting 101 Fall Institute 2011 Final
 
Restaurant Profitability 101: Budgeting
Restaurant Profitability 101: BudgetingRestaurant Profitability 101: Budgeting
Restaurant Profitability 101: Budgeting
 
121206 3-dirty-words-webinar
121206 3-dirty-words-webinar121206 3-dirty-words-webinar
121206 3-dirty-words-webinar
 
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and ScoreBudgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
Budgeting_ Wise Use of Credit_Understanding Your Credit Report and Score
 
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
Jopet Pedroso - Business Team Goal Clarity Creates Higher Profits (And Happy ...
 
B101 slideshow
B101 slideshowB101 slideshow
B101 slideshow
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 

Similar to Before Kaggle : from a business goal to a Machine Learning problem

Churn prediction data modeling
Churn prediction data modelingChurn prediction data modeling
Churn prediction data modelingPierre Gutierrez
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
An Overview of automated testing (1)
An Overview of automated testing (1)An Overview of automated testing (1)
An Overview of automated testing (1)Rodrigo Lopes
 
From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productBruce Kuo
 
DutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveDutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveBigML, Inc
 
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013Why-What-How Consulting, LLC
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
 
Toolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig SullivanToolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig SullivanUXPA UK
 
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CROUXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CROCraig Sullivan
 
How to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerHow to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerProduct School
 
Agile Estimating and Planning
Agile Estimating and PlanningAgile Estimating and Planning
Agile Estimating and PlanningMojammel Haque
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive MaintenanceArnab Biswas
 
BMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckBMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckSasha Lazarevic
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Unit 1 introduction to simulation
Unit 1 introduction to simulationUnit 1 introduction to simulation
Unit 1 introduction to simulationDevaKumari Vijay
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NETDev Raj Gautam
 
Data-driven product management
Data-driven product managementData-driven product management
Data-driven product managementArseny Kravchenko
 

Similar to Before Kaggle : from a business goal to a Machine Learning problem (20)

Churn prediction data modeling
Churn prediction data modelingChurn prediction data modeling
Churn prediction data modeling
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
An Overview of automated testing (1)
An Overview of automated testing (1)An Overview of automated testing (1)
An Overview of automated testing (1)
 
From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning product
 
DutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business PerspectiveDutchMLSchool. ML Business Perspective
DutchMLSchool. ML Business Perspective
 
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
Business process simulations: from GREAT! to good, Razvan Radulian, Sept 2013
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Real timeanalyticsl oreal
Real timeanalyticsl orealReal timeanalyticsl oreal
Real timeanalyticsl oreal
 
Toolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig SullivanToolkits and tips for UX analytics CRO by Craig Sullivan
Toolkits and tips for UX analytics CRO by Craig Sullivan
 
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CROUXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
UXPA UK - Toolkits and Tips for Blending UX, Analytics and CRO
 
How to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product ManagerHow to Apply Machine Learning by Lyft Senior Product Manager
How to Apply Machine Learning by Lyft Senior Product Manager
 
Agile Estimating and Planning
Agile Estimating and PlanningAgile Estimating and Planning
Agile Estimating and Planning
 
Machine Learning & Predictive Maintenance
Machine Learning &  Predictive MaintenanceMachine Learning &  Predictive Maintenance
Machine Learning & Predictive Maintenance
 
When Should I Use Simulation?
When Should I Use Simulation?When Should I Use Simulation?
When Should I Use Simulation?
 
BMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist DeckBMDSE v1 - Data Scientist Deck
BMDSE v1 - Data Scientist Deck
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Unit 1 introduction to simulation
Unit 1 introduction to simulationUnit 1 introduction to simulation
Unit 1 introduction to simulation
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
 
Data-driven product management
Data-driven product managementData-driven product management
Data-driven product management
 

More from Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thDataiku
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku
 

More from Dataiku (17)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch
 

Recently uploaded

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 

Recently uploaded (20)

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

Before Kaggle : from a business goal to a Machine Learning problem

  • 1. Before Kaggle From a business goal to a ML problem Pierre  Gu(errez  @prrgu(errez  
  • 2. •  Data Science competitions platform (There are others : DataScience.net in France) •  332,000 Data Scientists •  today : 192 competitions, 18 active + 516 In class, 12 active •  Prestigious clients : Axa, Cern, Caterpillar, Facebook, GM, Microsoft, Yandex… What is ?
  • 3. •  Price pool? •  325,000 $ to make on August 31st •  Good luck with that ! •  Not a good hourly wage •  today : 192 competitions, 18 active Understand : •  Lot’s of datasets about approximately every DS topic •  Lot’s of winner solutions, tip and tricks, etc… •  Lot’s of “beat the benchmark” for beginners I discovered/tested there : GBT, xgboost, Keras, word2vec, BeautifulSoup, hyperopt, ... Why should I join ?
  • 4. Most of the time: •  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins What is a Data Science Competition ?
  • 5. Most of the time: •  You have a train set with labels and a test set without labels. •  You need to learn a model using the train features and predict the test set labels •  Your prediction is evaluated using a specific metric •  The best prediction wins What is a Data Science Competition? Why  AUC?  F1  score?  Log  loss?     Could  that  depend  on  my  train/test  split?     Where  do  they  come  from  ?  Do  you  always   have  some?       Why  is  the  split  this  way?  Random?  Time?    
  • 6. What you don’t learn on Kaggle (or in class?): •  How to model a business question into a ML problem. •  How to manage/create labels. (proxy / missing…) •  How to evaluate a model: •  How to choose your metric •  How to design your train/test split •  How to account for this in feature engineering Understanding this actually helps you in Kaggle competition : •  How to design your cross validation scheme (and not overfit) •  How to create relevant features •  Hacks and tricks (leak exploitation J) What is a Data Science Competition?
  • 8. Christophe Bourguignat DS cheat sheet @chris_bour     Today  
  • 9. •  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric? Introduction
  • 10. •  Introduction •  Labels? •  Train and test split? •  Feature Engineering? •  Evaluation Metric? Introduction The  newcomer  disillusion   The  produc(on  bad  surprise   The  business  obfusca(on  
  • 11. •  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…) •  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez Who I am
  • 12. •  Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities,…) •  (More than) Occasional Kaggle competitor •  Twitter @prrgutierrez Who I am
  • 13.
  • 14. •  Everywhere is fraud E-business, Telco, Medicare,… •  Easily defined as a classification problem •  Target well defined ? •  E-business : yes with lag •  Elsewhere : need checks, labels are expensive Fraud Detection
  • 15. •  Wikipedia: “Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the number of individuals or items moving out of a collective group over a specific period of time” = Customer leaving Churn
  • 16. •  Subscription models: •  Telco •  E-gamming (Wow) •  Ex : Coyote -> 1 year subscription -> you know when someone leave •  Non subscription models: •  E-Business (Amazon, Price Minister, Vente Privée) •  E-gamming (Candy Crush, free MMORPG) -> you approximate someone leaving Candy Crush: days / weeks MMORPG: 2 months (holidays) Price Minister: months Two types of Churn
  • 17. •  Predict if a vehicle / machine / part is going to fail •  Classification Problem: •  Given a future horizon and a failure type. Will this happen for a given vehicle ? -> 2 parameters describe the target •  Vary a lot the target -> spurious correlation •  Just choose it as the result of the exact business need Predictive Maintenance
  • 18. •  Target is “will like” or “will buy” •  Target is often proxy of real interest (implicit feedback) Recommender System
  • 19. •  Can you model the problem as a ML problem? •  Ex : predictive maintenance •  Ask the right question from a business point of view. Not what you know how to do. •  Is your target a proxy? •  Recommendation system •  May need bandit algorithm •  Is it easy to get labels? •  Ex : Fraud detection •  Can be expensive •  Mechanical Turk can be the answer Summary on Labels
  • 20. •  Random Split •  Just like in school Train / test split   •  When  and  why  ?     -­‐>    When  each  line  is  independent  from  the   rest  (not  that  common  !)       image,  document  classifica(on,  sen(ment   analysis  (“but  aha  is  the  new  lol”  )     -­‐>    When  you  want  to  quickly  iterate  /   benchmark:  “is  it  even  possible?”     -­‐>    When  you  want  to  sell  something  to   your  boss  
  • 21. •  Column / group based Ex : Caterpillar challenge •  Predict a price •  for each tube id •  Tube id in train and test are different Objective : being able to generalize to other tubes! Train / test split
  • 22. •  Time based •  Simply separate train and test on a time variable •  When and Why? -> When you want a model that “predict the future” -> When things evolve with time! (most problems!) -> Examples : Add click prediction, Churn prediction, E-business Fraud detection, Predictive maintenance,… Train / test split
  • 23. •  No subscription example •  Target : 4 month without buying •  Features ? Train / test split : Churn example
  • 24. Ex : Train and predict scheme Time   T  :  present  (me  T  –  4  month   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months   Data  is  used  for  feature   genera(on.   Use  model  to  predict   future  churn   Train  model  using  features  and  target  
  • 25. Ex : Train Evaluation and Predict Scheme Time   T  :  present  (me  T  –  4  month   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months   Data  is  used  for   feature  genera(on   Valida&on  set   Use  model  to   predict  future   churn   Training   Evaluate  on  the  target   of  the  valida(on  set   T  –  8  month   Data  is  used  for  features   genera(on.   Data  is  used  for  target   crea(on  :  ac(vity  during   the  last  4  months  
  • 26. •  More complex design •  Graph sampling (fraud rings ? ) •  Random sampling in client / machine life •  Mix of column based and time based … •  The rule : 1)  What is the problem ? 2)  To what would I like to generalize my model ? Future ? Other individuals ? … 3)  => Train / Test split Train / test split
  • 27. •  Predictive Maintenance problem •  Objective : predict failure in next 3 days. •  Metric is proportional to accuracy (and 0.57 is the best score !) •  Link to data : https://www.phmsociety.org/events/conference/phm/14/data-challenge EX PHM Society (Fail example)
  • 31. •  How to design the evaluation scheme? •  What is the probability that an asset fail in the next 3 days from Now? -> classification problem -> Time based split -> but how do I create a train and a test? •  Choose a date and evaluate what happens 3 days later? -> pb : not enough failures happening •  Choose several dates for each asset? -> beware of asset over-fitting •  In the challenge : random selection of (asset, date) in the future + over sampling of failures. EX PHM Society
  • 32. •  Basic Feature engineering EX PHM Society
  • 33. •  Random Sampling EX PHM Society This  is  decent!     «  With  some  more  work  I  could  have  a  model   that  beat  randomness  enough  to  be  useful  »  
  • 34. •  Time based split EX PHM Society Wait  what?    
  • 35. •  TIME LEAK EX PHM Society
  • 36. •  TIME LEAK EX PHM Society Tree  cuts  
  • 37. •  Beware of the distribution of you features! •  Is there a time dependency? •  Ex : count, sum, … that will only increase with time •  -> Calculate count and sum rescaled by time / in moving windows instead. •  Can be found in Churn, Fraud detection, Ad click prediction,… •  A categorical variable dependency? •  Ex : email flag in fraud detection •  Is there a Network dependency? •  Ex : Fraud / Bot detection (network features can be useful but leaky) Feature Engineering
  • 38. •  Final trick : -  Stack train and test and add is_test boolean -  Try to predict is_test -  Check if the model is able to predict -  If so : -  check the feature importance -  Remove / modify feature and iterate Feature Engineering
  • 39. •  Final trick: •  Back to Phm example: Feature Engineering Huge  (me  leak  !    
  • 40. •  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score •  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)… Evaluation metric : Classification
  • 41. •  “Treshold dependant” •  Accuracy •  Precision and Recall •  F1 score •  “Treshold independant” •  AUC •  Log Loss •  Others (Mean average precision)… •  Customs Evaluation metric : Classification Not  good  if  unbalanced  target   When  you  have  an  order  problem     When  you  are  going  stochas(c   When  you  need  to  s(ck  to  business   Accuracy  alterna(ve  
  • 42. •  Custom metrics •  Cost based •  Ex Fraud: •  Mean loss of 50 $ / fraud (FN) •  Mean loss of 20 $ / wrongly cancelled transaction (FP) •  F1 score often used in papers •  in practice, you often have a business cost Evaluation metric : Classification TP   FN   TN  FP  
  • 43. •  Custom metrics •  Fraud Example 1: •  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this by manually handling transactions with score higher than threshold •  I have 1 person that does this fulltime and able to deal with 100 transactions / day •  The rest is automatically accepted -> AUC is not bad -> Recall in 100 transactions / day -> Total money blocked 100 transactions / day In practice AUC more stable… But the money metric can also be used for communication. Evaluation metric : Classification
  • 44. •  Custom metrics •  Fraud Example 2: •  “I have fraudsters on my e-business website” •  I generate a score for each transaction •  I handle this automatically by blocking all transactions with score higher than threshold -> AUC is not bad… But don’t give threshold value. -> F1–Score? -> Cost based is better Evaluation metric : Classification
  • 45. •  My cheat sheet Evaluation metric : Classification Metric   Op&mized  By  ML  model  ?     Treshold  Dependant   Applica&on  example   Accuracy   YES   YES   image  classifica(on,  nlp  …     F1-­‐score   NO   YES   ?  Papers  ?     AUC   NO   NO   fraud  detec(on,  churn,  healthcare  …     Log-­‐Loss   YES   NO   add  click  predic(on   Custom  metric   NO   ?     all  ?    
  • 46. •  Business Question dictates Evaluation Scheme! •  test set design •  evaluation metric •  Indirectly impact feature engineering •  Indirectly impact label quality •  Think (not too much) before coding •  Don’t try to optimize the wrong problem! Conclusion
  • 47. Thank you for your attention!