SlideShare a Scribd company logo
1 of 27
How to Win Machine Learning
competitions By Marios Michailidis
It’s not the destination…it’s the
journey!
What is kaggle
• world's biggest predictive modelling
competition platform
• Half a million members
• Companies host data challenges.
• Usual tasks include:
– Predict topic or sentiment from text.
– Predict species/type from image.
– Predict store/product/area sales
– Marketing response
Inspired by Horse races…!
• At the University of Southampton, an
entrepreneur talked to us about how he was
able to predict the horse races with regression!
Was curious, wanted to learn more
• learned statistical tools (Like SAS, SPSS, R)
• I became more passionate!
• Picked up programming skills
Built KazAnova
• Generated a couple of algorithms and data
techniques and decided to make them public
so that others can gain from it.
• I released it at (http://www.kazanovaforanalytics.com/)
• Named it after ANOVA (Statistics) and
• KAZANI , mom’s last name.
Joined Kaggle!
• Was curious about Kaggle.
• Joined a few contests and learned lots  .
• The community was very open to sharing and
collaboration.
Other interesting tasks!
• predict the correct answer from a science test,
using the Wikipedia.
• Predict which virus has infected a file.
• Predict the NCAA tournament!
• Predict the Higgs Bosson
• Out of many different numerical series predict
which one is the cause
3 Years of modelling competitions
• Over 90 competitions
• Participated with 45 different teams
• 22 top 10 finishes
• 12 times prize winner
• 3 different modelling platforms
• Ranked 1st out of 480,000 data scientists
What's next
• PhD (UCL) about recommender systems.
• More kaggling (but less intense) !
So… what wins competitions?
In short:
• Understand the problem
• Discipline
• try problem-specific things or new approaches
• The hours you put in
• the right tools
• Collaboration.
• Ensembling
More specifically…
● Understand the problem and the function to optimize
● Choose what seems to be a good Algorithm for a problem and
iterate
• Clean the data
• Scale the data
• Make feature transformations
• Make feature derivations
• Use cross validation to
 Make feature selections
 Tune hyper parameters of the algorithm
– Do this for many other algorithms – Always exploiting their
benefits
– Find best way to combine or ensemble the different algorithms
● Different type of models and when to use them
Understand the metric to optimize
● For example in the competition it was AUC (Area Under the roc
Curve).
● This is a ranking metric
● It shows how consistently your good cases have higher score
than your bad cases.
● Not all algorithms optimize the metric you want.
● Common metrics:
– AUC,
– Classification accuracy,
– precision,
– NDCG,
– RMSE
– MAE
– Deviance
Choose the algorithm
● In most cases many algorithms are experimented
before finding the right one(s).
● Those who try more models/parameters have higher
chance to win contests than others.
● Any algorithm that makes sense to be used, should be
used.
In the algorithm: Clean the Data
● The data cleaning step is not independent of the chosen
algorithm. For different algorithms , different cleaning filters should
be applied.
● Treating missing values is really important, certain algorithms
are more forgiving than others
● In other occasions it may make more sense to carefully replace a
missing value with a sensible one (like average of feature), treat it
as separate category or even remove the whole observation.
● Similarly we search for outliers and again other models are more
forgiving, while in others the impact of outliers is detrimental.
● How to decide the best method?
– Try them all or
– Experience and literature, but mostly the first (bold)
In the algorithm: scaling the data
● Certain algorithms cannot deal with unscaled data.
● scale techniques
– Max scaler: Divide each feature with highest absolute value
– Normalization: (subtract mean and divide with standard
deviation)
– Conditional scaling : scale only under certain conditions (e.g
in medicine we tend to scale per subject to make their features
comparable)
In the algorithm: feature transformations
● For certain algorithms there is benefit in changing the features
because they help them converge faster and better.
● Common transformations will include:
1. LOG, SQRT (variable) , smoothens variables
2. Dummies for categorical variables
3. Sparse matrices To be able to compress the data
4. 1st derivatives : To smoothen data.
5. Weights of evidence (transforming variables while using
information of the target variable)
6. Unsupervised methods that reduce dimensionality (SVD, PCA,
ISOMAP, KDTREE, clustering
In the algorithm: feature derivations
● In many problems this is the most important thing. For example:
– Text classification : generate the corpus of words and make TFIDF
– Sounds : convert sounds to frequencies through Fourier
transformations
– Images : make convolution. E.g. break down an image to pixels and
extract different parts of the image.
– Interactions: Really important for some models. For our algorithms
too! E.g. have variables that show if an item is popular AND the
customer likes it.
– Other that makes sense: similarity features, dimensionality
reduction features or even predictions from other models as features.
In the algorithm: Cross-Validation
● This basically means that from my main set , I create RANDOMLY 2
sets. I built (train) my algorithm with the first one (lets call it training
set) and score the other (lets call it validation set). I repeat this
process multiple times and always check how my model performs on
the test set in respect to the metric I want to optimize.
● The process may look like:
1. For 10 (you choose how many X) times
1. Split the set in training (50%-90% of the original data)
2. And validation (50%-10% of the original data)
3. Then fit the algorithm on the training set
4. Score the validation set.
5. Save the result of that scoring in respect to the chosen metric.
2. Calculate the average of these 10 (X) times. That how much you
expect this score in real life an dis generally a good estimate.
● Remember to use a SEED to be bale to replicate these X splits
In Cross-Validation: Do feature selection
● It is unlikely that all features are useful and some of them may be
damaging your model, because…
– They do not provide new information (Colinearity)
– They are just not predictive (just noise)
● There are many ways to do feature selection:
– Run the algorithm and seek an internal measure to retrieve the most
important features (no cross validation)
– Forward selection with or with no cross validation
– Backward selection with or with no cross validation
– Noise injection
– Hybrid of all methods
● Normally a forward method is chosen with cv. That is we add a
feature and then we split the data in the exact X times as we did
before and we check whether our metric improved:
– If yes, the feature remains
– Else, Wiedersehen!
In Cross-Validation: Hyper Parameter Optimization
● This generally takes lots of time depending on the algorithm . For
example in a random forest, the best model would need to be
determined based on some parameters as number of trees, max
depth, minimum cases in a split, features to consider at each split
etc.
● One way to find the best parameters is to manually change one
(e.g. max_depth=10 instead of 9) while you keep everything else
constant. I found using this method helps you understand more
about what would work in a specific dataset.
● Another way is try many possible combinations of hyper
parameters. We normally do that with Grid Search where we
provide an array of all possible values to be trialled with cross
validation (e.g. try max_depth {8,9,10,11} and number of trees
{90,120,200,500}
Train Many Algorithms
● Try to exploit their strengths
● For example, focus on using linear features with linear
regression (e.g. the higher the age the higher the
income) and non-linear ones with Random forest
● Make it so that each model tries to capture something
new or even focus on different part of the data
Ensemble
● Key part (in winning competitions at least) to combine the various
models made .
● Remember, even a crappy model can be useful to some small
extend.
● Possible ways to ensemble:
– Simple average (Model1 prediction + model2 prediction)/2
– Average Ranks for AUC (simple average after converting to rank)
– Manually tune weights with cross validation
– Using Geomean weighted average
– Use Meta-Modelling (also called stack generalization or stacking)
● Check github for a complete example of these methods using the
Amazon comp hosted by kaggle : https://github.com/kaz-
Anova/ensemble_amazon (top 60 rank) .
Different models I have experimented vol 1
● Logistic/Linear/discriminant regression: Fast, Scalable,
Comprehensible, solid under high dimensionality, can be memory-
light. Best when relationships are linear or all features are
categorical. Good for text classification too .
● Random Forests : Probably the best one-off overall algorithm out
there (to my experience) . Fast, Scalable , memory-medium. Best
when all features are numeric-continuous and there are strong non-
linear relationships. Does not cope well with high dimensionality.
● Gradient Boosting (Trees): Less memory intense as forests (as
individual predictors tend to be weaker). Fast, Semi-Scalable,
memory-medium. Is good when forests are good
● Neural Nets (AKA deep Learning): Good for tasks humans are
good at: Image Recognition, sound recognition. Good with categorical
variables too (as they replicate on-and-off signals). Medium-speed,
Scalable, memory-light . Generally good for linear and non-linear
tasks. May take a lot to train depending on structure. Many
parameters to tune. Very prone to over and under fitting.
● Support Vector Machines (SVMs): Medium-Speed, not scalable,
memory intense. Still good at capturing linear and non linear
relationships. Holding the kernel matrix takes too much memory.
Not advisable for data sets bigger than 20k.
● K Nearest Neighbours: Slow (depending on the size), Not easily
scalable, memory-heavy. Good when really defining the good of
the bad is matter of how much he/she looks to specific individuals.
Also good when number of target variables are many as the
similarity measures remain the same across different
observations. Good for text classification too.
● Naïve Bayes : Quick, scalable, memory-ok. Good for quick
classifications on big datasets. Not particularly predictive.
● Factorization Machines: Good gateways between Linear and
non-linear problems. Stand between regressions , Knns and
neural networks. Memory Medium, semi-scalable, medium-speed.
Good for predicting the rating a customer will assign to a pruduct
Different models I have experimented vol 2
Tools vol 1
● Languages : Python, R, Java
● Liblinear : for linear models
http://www.csie.ntu.edu.tw/~cjlin/liblinear/
● LibSvm for Support Vector machines
www.csie.ntu.edu.tw/~cjlin/libsvm/
● Scikit package in python for text classification, random forests
and gradient boosting machines scikit-learn.org/stable/
● Xgboost for fast scalable gradient boosting
https://github.com/tqchen/xgboost
● LightGBM https://github.com/Microsoft/LightGBM
● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear
models
● http://www.heatonresearch.com/encog encog for neural nets
● H2O in R for many models
● LibFm www.libfm.org
● LibFFM : https://www.csie.ntu.edu.tw/~cjlin/libffm/
● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/
● Graphchi for factorizations : https://github.com/GraphChi
● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html
● Cxxnet : One of the best implementation of convolutional neural nets out
there. Difficult to install and requires GPU with NVDIA Graphics card.
https://github.com/antinucleon/cxxnet
● RankLib: The best library out there made in java suited for ranking algorithms
(e.g. rank products for customers) that supports optimization fucntions like
NDCG. people.cs.umass.edu/~vdang/ranklib.html
● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne )
for nets. This assumes you have Theano
(http://deeplearning.net/software/theano/ ) or Tensorflow
https://www.tensorflow.org/ .
Tools vol 2
Where to go next to prepare for competitions
● Coursera : https://www.coursera.org/course/ml Andrew’s NG
class
● Kaggle.com : many competitions for learning. For instance:
http://www.kaggle.com/c/titanic-gettingStarted . Look for the
“knowledge flag”
● Very good slides from university of UTAH:
www.cs.utah.edu/~piyush/teaching/cs5350.html
● clopinet.com/challenges/ . Many past predictive modelling
competitions with tutorials.
● Wikipedia. Not to underestimate. Still the best source of
information out there (collectively) .

More Related Content

What's hot

Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingTed Xiao
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsAndrew Ferlitsch
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: ClusteringDeepak George
 
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsConcept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsLightbend
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 

What's hot (20)

Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Causal Inference in Marketing
Causal Inference in MarketingCausal Inference in Marketing
Causal Inference in Marketing
 
Winning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to StackingWinning Kaggle 101: Introduction to Stacking
Winning Kaggle 101: Introduction to Stacking
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 
Model selection
Model selectionModel selection
Model selection
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble Methods
 
Ensemble methods
Ensemble methodsEnsemble methods
Ensemble methods
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Unsupervised learning: Clustering
Unsupervised learning: ClusteringUnsupervised learning: Clustering
Unsupervised learning: Clustering
 
Concept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML ApplicationsConcept Drift: Monitoring Model Quality In Streaming ML Applications
Concept Drift: Monitoring Model Quality In Streaming ML Applications
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 

Viewers also liked

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning modelsExtract Data Conference
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case StudyHackerEarth
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science CompetitionJeong-Yoon Lee
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforceHackerEarth
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingHackerEarth
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningHJ van Veen
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT MinistryJeong-Yoon Lee
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHackerEarth
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation systemHackerEarth
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talentHackerEarth
 

Viewers also liked (20)

Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Open Innovation - A Case Study
Open Innovation - A Case StudyOpen Innovation - A Case Study
Open Innovation - A Case Study
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Data Science Competition
Data Science CompetitionData Science Competition
Data Science Competition
 
State of women in technical workforce
State of women in technical workforceState of women in technical workforce
State of women in technical workforce
 
Kill the wabbit
Kill the wabbitKill the wabbit
Kill the wabbit
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Leverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and RecruitingLeverage Social Media for Employer Brand and Recruiting
Leverage Social Media for Employer Brand and Recruiting
 
Ethics in Data Science and Machine Learning
Ethics in Data Science and Machine LearningEthics in Data Science and Machine Learning
Ethics in Data Science and Machine Learning
 
HackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case StudyHackerEarth helping a startup hire developers - The Practo Case Study
HackerEarth helping a startup hire developers - The Practo Case Study
 
Marriage - LIGHT Ministry
Marriage - LIGHT MinistryMarriage - LIGHT Ministry
Marriage - LIGHT Ministry
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
How hackathons can drive top line revenue growth
How hackathons can drive top line revenue growthHow hackathons can drive top line revenue growth
How hackathons can drive top line revenue growth
 
Smart Switchboard: An home automation system
Smart Switchboard: An home automation systemSmart Switchboard: An home automation system
Smart Switchboard: An home automation system
 
No-Bullshit Data Science
No-Bullshit Data ScienceNo-Bullshit Data Science
No-Bullshit Data Science
 
How to recruit excellent tech talent
How to recruit excellent tech talentHow to recruit excellent tech talent
How to recruit excellent tech talent
 

Similar to How to Win Machine Learning Competitions ?

Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Dori Waldman
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummiesMichael Winer
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new pptSalford Systems
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsEng Teong Cheah
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning techniqueDishaSinha9
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2OSri Ambati
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learningjagan477830
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptxRaflyRizky2
 

Similar to How to Win Machine Learning Competitions ? (20)

Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Machine learning4dummies
Machine learning4dummiesMachine learning4dummies
Machine learning4dummies
 
Informs presentation new ppt
Informs presentation new pptInforms presentation new ppt
Informs presentation new ppt
 
Initializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning ModelsInitializing & Optimizing Machine Learning Models
Initializing & Optimizing Machine Learning Models
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Ensemble hybrid learning technique
Ensemble hybrid learning techniqueEnsemble hybrid learning technique
Ensemble hybrid learning technique
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
GLM & GBM in H2O
GLM & GBM in H2OGLM & GBM in H2O
GLM & GBM in H2O
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Diabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine LearningDiabetes Prediction Using Machine Learning
Diabetes Prediction Using Machine Learning
 
Dnn guidelines
Dnn guidelinesDnn guidelines
Dnn guidelines
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx20211229120253D6323_PERT 06_ Ensemble Learning.pptx
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
 

More from HackerEarth

How to hire a data scientist recruit page
How to hire a data scientist recruit pageHow to hire a data scientist recruit page
How to hire a data scientist recruit pageHackerEarth
 
Build accurate assessment with question analytics
Build accurate assessment with question analyticsBuild accurate assessment with question analytics
Build accurate assessment with question analyticsHackerEarth
 
Make your assessments more effective with test analytics
Make your assessments more effective with test analyticsMake your assessments more effective with test analytics
Make your assessments more effective with test analyticsHackerEarth
 
How to hire a data scientist
How to hire a data scientistHow to hire a data scientist
How to hire a data scientistHackerEarth
 
Changing landscape of Technical Recruitment
Changing landscape of Technical RecruitmentChanging landscape of Technical Recruitment
Changing landscape of Technical RecruitmentHackerEarth
 
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...HackerEarth
 
How to recruit excellent talent
How to recruit excellent talentHow to recruit excellent talent
How to recruit excellent talentHackerEarth
 
Interpersonal Dynamics at work
Interpersonal Dynamics at workInterpersonal Dynamics at work
Interpersonal Dynamics at workHackerEarth
 
The Power of HR Analytics
The Power of HR AnalyticsThe Power of HR Analytics
The Power of HR AnalyticsHackerEarth
 
Leading change management
Leading change managementLeading change management
Leading change managementHackerEarth
 
Enhancing the employer brand
Enhancing the employer brandEnhancing the employer brand
Enhancing the employer brandHackerEarth
 
Global Hackathon Report
Global Hackathon ReportGlobal Hackathon Report
Global Hackathon ReportHackerEarth
 
How to organize a successful hackathon
How to organize a successful hackathonHow to organize a successful hackathon
How to organize a successful hackathonHackerEarth
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovationHackerEarth
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?HackerEarth
 
Managing innovation: A Process Overview
Managing innovation: A Process OverviewManaging innovation: A Process Overview
Managing innovation: A Process OverviewHackerEarth
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist? HackerEarth
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEoHackerEarth
 
Richard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief BiographyRichard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief BiographyHackerEarth
 
CodeRED Casestudy
CodeRED CasestudyCodeRED Casestudy
CodeRED CasestudyHackerEarth
 

More from HackerEarth (20)

How to hire a data scientist recruit page
How to hire a data scientist recruit pageHow to hire a data scientist recruit page
How to hire a data scientist recruit page
 
Build accurate assessment with question analytics
Build accurate assessment with question analyticsBuild accurate assessment with question analytics
Build accurate assessment with question analytics
 
Make your assessments more effective with test analytics
Make your assessments more effective with test analyticsMake your assessments more effective with test analytics
Make your assessments more effective with test analytics
 
How to hire a data scientist
How to hire a data scientistHow to hire a data scientist
How to hire a data scientist
 
Changing landscape of Technical Recruitment
Changing landscape of Technical RecruitmentChanging landscape of Technical Recruitment
Changing landscape of Technical Recruitment
 
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
Multi-Skilling: A unique way to train, retain and develop in a manufacturing ...
 
How to recruit excellent talent
How to recruit excellent talentHow to recruit excellent talent
How to recruit excellent talent
 
Interpersonal Dynamics at work
Interpersonal Dynamics at workInterpersonal Dynamics at work
Interpersonal Dynamics at work
 
The Power of HR Analytics
The Power of HR AnalyticsThe Power of HR Analytics
The Power of HR Analytics
 
Leading change management
Leading change managementLeading change management
Leading change management
 
Enhancing the employer brand
Enhancing the employer brandEnhancing the employer brand
Enhancing the employer brand
 
Global Hackathon Report
Global Hackathon ReportGlobal Hackathon Report
Global Hackathon Report
 
How to organize a successful hackathon
How to organize a successful hackathonHow to organize a successful hackathon
How to organize a successful hackathon
 
6 rules of enterprise innovation
6 rules of enterprise innovation6 rules of enterprise innovation
6 rules of enterprise innovation
 
How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?How to assess & hire Java developers accurately?
How to assess & hire Java developers accurately?
 
Managing innovation: A Process Overview
Managing innovation: A Process OverviewManaging innovation: A Process Overview
Managing innovation: A Process Overview
 
How to become a Data Scientist?
How to become a Data Scientist? How to become a Data Scientist?
How to become a Data Scientist?
 
Menstrual Health Reader - mEo
Menstrual Health Reader - mEoMenstrual Health Reader - mEo
Menstrual Health Reader - mEo
 
Richard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief BiographyRichard Matthew Stallman - A Brief Biography
Richard Matthew Stallman - A Brief Biography
 
CodeRED Casestudy
CodeRED CasestudyCodeRED Casestudy
CodeRED Casestudy
 

Recently uploaded

Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 

Recently uploaded (20)

Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 

How to Win Machine Learning Competitions ?

  • 1. How to Win Machine Learning competitions By Marios Michailidis It’s not the destination…it’s the journey!
  • 2. What is kaggle • world's biggest predictive modelling competition platform • Half a million members • Companies host data challenges. • Usual tasks include: – Predict topic or sentiment from text. – Predict species/type from image. – Predict store/product/area sales – Marketing response
  • 3. Inspired by Horse races…! • At the University of Southampton, an entrepreneur talked to us about how he was able to predict the horse races with regression!
  • 4. Was curious, wanted to learn more • learned statistical tools (Like SAS, SPSS, R) • I became more passionate! • Picked up programming skills
  • 5. Built KazAnova • Generated a couple of algorithms and data techniques and decided to make them public so that others can gain from it. • I released it at (http://www.kazanovaforanalytics.com/) • Named it after ANOVA (Statistics) and • KAZANI , mom’s last name.
  • 6. Joined Kaggle! • Was curious about Kaggle. • Joined a few contests and learned lots  . • The community was very open to sharing and collaboration.
  • 7. Other interesting tasks! • predict the correct answer from a science test, using the Wikipedia. • Predict which virus has infected a file. • Predict the NCAA tournament! • Predict the Higgs Bosson • Out of many different numerical series predict which one is the cause
  • 8. 3 Years of modelling competitions • Over 90 competitions • Participated with 45 different teams • 22 top 10 finishes • 12 times prize winner • 3 different modelling platforms • Ranked 1st out of 480,000 data scientists
  • 9. What's next • PhD (UCL) about recommender systems. • More kaggling (but less intense) !
  • 10. So… what wins competitions? In short: • Understand the problem • Discipline • try problem-specific things or new approaches • The hours you put in • the right tools • Collaboration. • Ensembling
  • 11. More specifically… ● Understand the problem and the function to optimize ● Choose what seems to be a good Algorithm for a problem and iterate • Clean the data • Scale the data • Make feature transformations • Make feature derivations • Use cross validation to  Make feature selections  Tune hyper parameters of the algorithm – Do this for many other algorithms – Always exploiting their benefits – Find best way to combine or ensemble the different algorithms ● Different type of models and when to use them
  • 12. Understand the metric to optimize ● For example in the competition it was AUC (Area Under the roc Curve). ● This is a ranking metric ● It shows how consistently your good cases have higher score than your bad cases. ● Not all algorithms optimize the metric you want. ● Common metrics: – AUC, – Classification accuracy, – precision, – NDCG, – RMSE – MAE – Deviance
  • 13. Choose the algorithm ● In most cases many algorithms are experimented before finding the right one(s). ● Those who try more models/parameters have higher chance to win contests than others. ● Any algorithm that makes sense to be used, should be used.
  • 14. In the algorithm: Clean the Data ● The data cleaning step is not independent of the chosen algorithm. For different algorithms , different cleaning filters should be applied. ● Treating missing values is really important, certain algorithms are more forgiving than others ● In other occasions it may make more sense to carefully replace a missing value with a sensible one (like average of feature), treat it as separate category or even remove the whole observation. ● Similarly we search for outliers and again other models are more forgiving, while in others the impact of outliers is detrimental. ● How to decide the best method? – Try them all or – Experience and literature, but mostly the first (bold)
  • 15. In the algorithm: scaling the data ● Certain algorithms cannot deal with unscaled data. ● scale techniques – Max scaler: Divide each feature with highest absolute value – Normalization: (subtract mean and divide with standard deviation) – Conditional scaling : scale only under certain conditions (e.g in medicine we tend to scale per subject to make their features comparable)
  • 16. In the algorithm: feature transformations ● For certain algorithms there is benefit in changing the features because they help them converge faster and better. ● Common transformations will include: 1. LOG, SQRT (variable) , smoothens variables 2. Dummies for categorical variables 3. Sparse matrices To be able to compress the data 4. 1st derivatives : To smoothen data. 5. Weights of evidence (transforming variables while using information of the target variable) 6. Unsupervised methods that reduce dimensionality (SVD, PCA, ISOMAP, KDTREE, clustering
  • 17. In the algorithm: feature derivations ● In many problems this is the most important thing. For example: – Text classification : generate the corpus of words and make TFIDF – Sounds : convert sounds to frequencies through Fourier transformations – Images : make convolution. E.g. break down an image to pixels and extract different parts of the image. – Interactions: Really important for some models. For our algorithms too! E.g. have variables that show if an item is popular AND the customer likes it. – Other that makes sense: similarity features, dimensionality reduction features or even predictions from other models as features.
  • 18. In the algorithm: Cross-Validation ● This basically means that from my main set , I create RANDOMLY 2 sets. I built (train) my algorithm with the first one (lets call it training set) and score the other (lets call it validation set). I repeat this process multiple times and always check how my model performs on the test set in respect to the metric I want to optimize. ● The process may look like: 1. For 10 (you choose how many X) times 1. Split the set in training (50%-90% of the original data) 2. And validation (50%-10% of the original data) 3. Then fit the algorithm on the training set 4. Score the validation set. 5. Save the result of that scoring in respect to the chosen metric. 2. Calculate the average of these 10 (X) times. That how much you expect this score in real life an dis generally a good estimate. ● Remember to use a SEED to be bale to replicate these X splits
  • 19. In Cross-Validation: Do feature selection ● It is unlikely that all features are useful and some of them may be damaging your model, because… – They do not provide new information (Colinearity) – They are just not predictive (just noise) ● There are many ways to do feature selection: – Run the algorithm and seek an internal measure to retrieve the most important features (no cross validation) – Forward selection with or with no cross validation – Backward selection with or with no cross validation – Noise injection – Hybrid of all methods ● Normally a forward method is chosen with cv. That is we add a feature and then we split the data in the exact X times as we did before and we check whether our metric improved: – If yes, the feature remains – Else, Wiedersehen!
  • 20. In Cross-Validation: Hyper Parameter Optimization ● This generally takes lots of time depending on the algorithm . For example in a random forest, the best model would need to be determined based on some parameters as number of trees, max depth, minimum cases in a split, features to consider at each split etc. ● One way to find the best parameters is to manually change one (e.g. max_depth=10 instead of 9) while you keep everything else constant. I found using this method helps you understand more about what would work in a specific dataset. ● Another way is try many possible combinations of hyper parameters. We normally do that with Grid Search where we provide an array of all possible values to be trialled with cross validation (e.g. try max_depth {8,9,10,11} and number of trees {90,120,200,500}
  • 21. Train Many Algorithms ● Try to exploit their strengths ● For example, focus on using linear features with linear regression (e.g. the higher the age the higher the income) and non-linear ones with Random forest ● Make it so that each model tries to capture something new or even focus on different part of the data
  • 22. Ensemble ● Key part (in winning competitions at least) to combine the various models made . ● Remember, even a crappy model can be useful to some small extend. ● Possible ways to ensemble: – Simple average (Model1 prediction + model2 prediction)/2 – Average Ranks for AUC (simple average after converting to rank) – Manually tune weights with cross validation – Using Geomean weighted average – Use Meta-Modelling (also called stack generalization or stacking) ● Check github for a complete example of these methods using the Amazon comp hosted by kaggle : https://github.com/kaz- Anova/ensemble_amazon (top 60 rank) .
  • 23. Different models I have experimented vol 1 ● Logistic/Linear/discriminant regression: Fast, Scalable, Comprehensible, solid under high dimensionality, can be memory- light. Best when relationships are linear or all features are categorical. Good for text classification too . ● Random Forests : Probably the best one-off overall algorithm out there (to my experience) . Fast, Scalable , memory-medium. Best when all features are numeric-continuous and there are strong non- linear relationships. Does not cope well with high dimensionality. ● Gradient Boosting (Trees): Less memory intense as forests (as individual predictors tend to be weaker). Fast, Semi-Scalable, memory-medium. Is good when forests are good ● Neural Nets (AKA deep Learning): Good for tasks humans are good at: Image Recognition, sound recognition. Good with categorical variables too (as they replicate on-and-off signals). Medium-speed, Scalable, memory-light . Generally good for linear and non-linear tasks. May take a lot to train depending on structure. Many parameters to tune. Very prone to over and under fitting.
  • 24. ● Support Vector Machines (SVMs): Medium-Speed, not scalable, memory intense. Still good at capturing linear and non linear relationships. Holding the kernel matrix takes too much memory. Not advisable for data sets bigger than 20k. ● K Nearest Neighbours: Slow (depending on the size), Not easily scalable, memory-heavy. Good when really defining the good of the bad is matter of how much he/she looks to specific individuals. Also good when number of target variables are many as the similarity measures remain the same across different observations. Good for text classification too. ● Naïve Bayes : Quick, scalable, memory-ok. Good for quick classifications on big datasets. Not particularly predictive. ● Factorization Machines: Good gateways between Linear and non-linear problems. Stand between regressions , Knns and neural networks. Memory Medium, semi-scalable, medium-speed. Good for predicting the rating a customer will assign to a pruduct Different models I have experimented vol 2
  • 25. Tools vol 1 ● Languages : Python, R, Java ● Liblinear : for linear models http://www.csie.ntu.edu.tw/~cjlin/liblinear/ ● LibSvm for Support Vector machines www.csie.ntu.edu.tw/~cjlin/libsvm/ ● Scikit package in python for text classification, random forests and gradient boosting machines scikit-learn.org/stable/ ● Xgboost for fast scalable gradient boosting https://github.com/tqchen/xgboost ● LightGBM https://github.com/Microsoft/LightGBM ● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear models ● http://www.heatonresearch.com/encog encog for neural nets ● H2O in R for many models
  • 26. ● LibFm www.libfm.org ● LibFFM : https://www.csie.ntu.edu.tw/~cjlin/libffm/ ● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/ ● Graphchi for factorizations : https://github.com/GraphChi ● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html ● Cxxnet : One of the best implementation of convolutional neural nets out there. Difficult to install and requires GPU with NVDIA Graphics card. https://github.com/antinucleon/cxxnet ● RankLib: The best library out there made in java suited for ranking algorithms (e.g. rank products for customers) that supports optimization fucntions like NDCG. people.cs.umass.edu/~vdang/ranklib.html ● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne ) for nets. This assumes you have Theano (http://deeplearning.net/software/theano/ ) or Tensorflow https://www.tensorflow.org/ . Tools vol 2
  • 27. Where to go next to prepare for competitions ● Coursera : https://www.coursera.org/course/ml Andrew’s NG class ● Kaggle.com : many competitions for learning. For instance: http://www.kaggle.com/c/titanic-gettingStarted . Look for the “knowledge flag” ● Very good slides from university of UTAH: www.cs.utah.edu/~piyush/teaching/cs5350.html ● clopinet.com/challenges/ . Many past predictive modelling competitions with tutorials. ● Wikipedia. Not to underestimate. Still the best source of information out there (collectively) .