How to Win Machine Learning Competitions ?

How to Win Machine Learning
competitions By Marios Michailidis
It’s not the destination…it’s the
journey!

What is kaggle
• world's biggest predictive modelling
competition platform
• Half a million members
• Companies host data challenges.
• Usual tasks include:
– Predict topic or sentiment from text.
– Predict species/type from image.
– Predict store/product/area sales
– Marketing response

Inspired by Horse races…!
• At the University of Southampton, an
entrepreneur talked to us about how he was
able to predict the horse races with regression!

Was curious, wanted to learn more
• learned statistical tools (Like SAS, SPSS, R)
• I became more passionate!
• Picked up programming skills

Built KazAnova
• Generated a couple of algorithms and data
techniques and decided to make them public
so that others can gain from it.
• I released it at (http://www.kazanovaforanalytics.com/)
• Named it after ANOVA (Statistics) and
• KAZANI , mom’s last name.

Joined Kaggle!
• Was curious about Kaggle.
• Joined a few contests and learned lots  .
• The community was very open to sharing and
collaboration.

Other interesting tasks!
• predict the correct answer from a science test,
using the Wikipedia.
• Predict which virus has infected a file.
• Predict the NCAA tournament!
• Predict the Higgs Bosson
• Out of many different numerical series predict
which one is the cause

3 Years of modelling competitions
• Over 90 competitions
• Participated with 45 different teams
• 22 top 10 finishes
• 12 times prize winner
• 3 different modelling platforms
• Ranked 1st out of 480,000 data scientists

What's next
• PhD (UCL) about recommender systems.
• More kaggling (but less intense) !

So… what wins competitions?
In short:
• Understand the problem
• Discipline
• try problem-specific things or new approaches
• The hours you put in
• the right tools
• Collaboration.
• Ensembling

More specifically…
● Understand the problem and the function to optimize
● Choose what seems to be a good Algorithm for a problem and
iterate
• Clean the data
• Scale the data
• Make feature transformations
• Make feature derivations
• Use cross validation to
 Make feature selections
 Tune hyper parameters of the algorithm
– Do this for many other algorithms – Always exploiting their
benefits
– Find best way to combine or ensemble the different algorithms
● Different type of models and when to use them

Understand the metric to optimize
● For example in the competition it was AUC (Area Under the roc
Curve).
● This is a ranking metric
● It shows how consistently your good cases have higher score
than your bad cases.
● Not all algorithms optimize the metric you want.
● Common metrics:
– AUC,
– Classification accuracy,
– precision,
– NDCG,
– RMSE
– MAE
– Deviance

Choose the algorithm
● In most cases many algorithms are experimented
before finding the right one(s).
● Those who try more models/parameters have higher
chance to win contests than others.
● Any algorithm that makes sense to be used, should be
used.

In the algorithm: Clean the Data
● The data cleaning step is not independent of the chosen
algorithm. For different algorithms , different cleaning filters should
be applied.
● Treating missing values is really important, certain algorithms
are more forgiving than others
● In other occasions it may make more sense to carefully replace a
missing value with a sensible one (like average of feature), treat it
as separate category or even remove the whole observation.
● Similarly we search for outliers and again other models are more
forgiving, while in others the impact of outliers is detrimental.
● How to decide the best method?
– Try them all or
– Experience and literature, but mostly the first (bold)

In the algorithm: scaling the data
● Certain algorithms cannot deal with unscaled data.
● scale techniques
– Max scaler: Divide each feature with highest absolute value
– Normalization: (subtract mean and divide with standard
deviation)
– Conditional scaling : scale only under certain conditions (e.g
in medicine we tend to scale per subject to make their features
comparable)

In the algorithm: feature transformations
● For certain algorithms there is benefit in changing the features
because they help them converge faster and better.
● Common transformations will include:
1. LOG, SQRT (variable) , smoothens variables
2. Dummies for categorical variables
3. Sparse matrices To be able to compress the data
4. 1st derivatives : To smoothen data.
5. Weights of evidence (transforming variables while using
information of the target variable)
6. Unsupervised methods that reduce dimensionality (SVD, PCA,
ISOMAP, KDTREE, clustering

In the algorithm: feature derivations
● In many problems this is the most important thing. For example:
– Text classification : generate the corpus of words and make TFIDF
– Sounds : convert sounds to frequencies through Fourier
transformations
– Images : make convolution. E.g. break down an image to pixels and
extract different parts of the image.
– Interactions: Really important for some models. For our algorithms
too! E.g. have variables that show if an item is popular AND the
customer likes it.
– Other that makes sense: similarity features, dimensionality
reduction features or even predictions from other models as features.

In the algorithm: Cross-Validation
● This basically means that from my main set , I create RANDOMLY 2
sets. I built (train) my algorithm with the first one (lets call it training
set) and score the other (lets call it validation set). I repeat this
process multiple times and always check how my model performs on
the test set in respect to the metric I want to optimize.
● The process may look like:
1. For 10 (you choose how many X) times
1. Split the set in training (50%-90% of the original data)
2. And validation (50%-10% of the original data)
3. Then fit the algorithm on the training set
4. Score the validation set.
5. Save the result of that scoring in respect to the chosen metric.
2. Calculate the average of these 10 (X) times. That how much you
expect this score in real life an dis generally a good estimate.
● Remember to use a SEED to be bale to replicate these X splits

In Cross-Validation: Do feature selection
● It is unlikely that all features are useful and some of them may be
damaging your model, because…
– They do not provide new information (Colinearity)
– They are just not predictive (just noise)
● There are many ways to do feature selection:
– Run the algorithm and seek an internal measure to retrieve the most
important features (no cross validation)
– Forward selection with or with no cross validation
– Backward selection with or with no cross validation
– Noise injection
– Hybrid of all methods
● Normally a forward method is chosen with cv. That is we add a
feature and then we split the data in the exact X times as we did
before and we check whether our metric improved:
– If yes, the feature remains
– Else, Wiedersehen!

In Cross-Validation: Hyper Parameter Optimization
● This generally takes lots of time depending on the algorithm . For
example in a random forest, the best model would need to be
determined based on some parameters as number of trees, max
depth, minimum cases in a split, features to consider at each split
etc.
● One way to find the best parameters is to manually change one
(e.g. max_depth=10 instead of 9) while you keep everything else
constant. I found using this method helps you understand more
about what would work in a specific dataset.
● Another way is try many possible combinations of hyper
parameters. We normally do that with Grid Search where we
provide an array of all possible values to be trialled with cross
validation (e.g. try max_depth {8,9,10,11} and number of trees
{90,120,200,500}

Train Many Algorithms
● Try to exploit their strengths
● For example, focus on using linear features with linear
regression (e.g. the higher the age the higher the
income) and non-linear ones with Random forest
● Make it so that each model tries to capture something
new or even focus on different part of the data

Ensemble
● Key part (in winning competitions at least) to combine the various
models made .
● Remember, even a crappy model can be useful to some small
extend.
● Possible ways to ensemble:
– Simple average (Model1 prediction + model2 prediction)/2
– Average Ranks for AUC (simple average after converting to rank)
– Manually tune weights with cross validation
– Using Geomean weighted average
– Use Meta-Modelling (also called stack generalization or stacking)
● Check github for a complete example of these methods using the
Amazon comp hosted by kaggle : https://github.com/kaz-
Anova/ensemble_amazon (top 60 rank) .

Different models I have experimented vol 1
● Logistic/Linear/discriminant regression: Fast, Scalable,
Comprehensible, solid under high dimensionality, can be memory-
light. Best when relationships are linear or all features are
categorical. Good for text classification too .
● Random Forests : Probably the best one-off overall algorithm out
there (to my experience) . Fast, Scalable , memory-medium. Best
when all features are numeric-continuous and there are strong non-
linear relationships. Does not cope well with high dimensionality.
● Gradient Boosting (Trees): Less memory intense as forests (as
individual predictors tend to be weaker). Fast, Semi-Scalable,
memory-medium. Is good when forests are good
● Neural Nets (AKA deep Learning): Good for tasks humans are
good at: Image Recognition, sound recognition. Good with categorical
variables too (as they replicate on-and-off signals). Medium-speed,
Scalable, memory-light . Generally good for linear and non-linear
tasks. May take a lot to train depending on structure. Many
parameters to tune. Very prone to over and under fitting.

● Support Vector Machines (SVMs): Medium-Speed, not scalable,
memory intense. Still good at capturing linear and non linear
relationships. Holding the kernel matrix takes too much memory.
Not advisable for data sets bigger than 20k.
● K Nearest Neighbours: Slow (depending on the size), Not easily
scalable, memory-heavy. Good when really defining the good of
the bad is matter of how much he/she looks to specific individuals.
Also good when number of target variables are many as the
similarity measures remain the same across different
observations. Good for text classification too.
● Naïve Bayes : Quick, scalable, memory-ok. Good for quick
classifications on big datasets. Not particularly predictive.
● Factorization Machines: Good gateways between Linear and
non-linear problems. Stand between regressions , Knns and
neural networks. Memory Medium, semi-scalable, medium-speed.
Good for predicting the rating a customer will assign to a pruduct
Different models I have experimented vol 2

Tools vol 1
● Languages : Python, R, Java
● Liblinear : for linear models
http://www.csie.ntu.edu.tw/~cjlin/liblinear/
● LibSvm for Support Vector machines
www.csie.ntu.edu.tw/~cjlin/libsvm/
● Scikit package in python for text classification, random forests
and gradient boosting machines scikit-learn.org/stable/
● Xgboost for fast scalable gradient boosting
https://github.com/tqchen/xgboost
● LightGBM https://github.com/Microsoft/LightGBM
● Vowpal Wabbit hunch.net/~vw/ for fast memory efficient linear
models
● http://www.heatonresearch.com/encog encog for neural nets
● H2O in R for many models

● LibFm www.libfm.org
● LibFFM : https://www.csie.ntu.edu.tw/~cjlin/libffm/
● Weka in Java (has everything) http://www.cs.waikato.ac.nz/ml/weka/
● Graphchi for factorizations : https://github.com/GraphChi
● GraphLab for lots of stuff. https://dato.com/products/create/open_source.html
● Cxxnet : One of the best implementation of convolutional neural nets out
there. Difficult to install and requires GPU with NVDIA Graphics card.
https://github.com/antinucleon/cxxnet
● RankLib: The best library out there made in java suited for ranking algorithms
(e.g. rank products for customers) that supports optimization fucntions like
NDCG. people.cs.umass.edu/~vdang/ranklib.html
● Keras ( http://keras.io/) and Lasagne(https://github.com/Lasagne/Lasagne )
for nets. This assumes you have Theano
(http://deeplearning.net/software/theano/ ) or Tensorflow
https://www.tensorflow.org/ .
Tools vol 2

Where to go next to prepare for competitions
● Coursera : https://www.coursera.org/course/ml Andrew’s NG
class
● Kaggle.com : many competitions for learning. For instance:
http://www.kaggle.com/c/titanic-gettingStarted . Look for the
“knowledge flag”
● Very good slides from university of UTAH:
www.cs.utah.edu/~piyush/teaching/cs5350.html
● clopinet.com/challenges/ . Many past predictive modelling
competitions with tutorials.
● Wikipedia. Not to underestimate. Still the best source of
information out there (collectively) .

How to Win Machine Learning Competitions ?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How to Win Machine Learning Competitions ?

Similar to How to Win Machine Learning Competitions ? (20)

More from HackerEarth

More from HackerEarth (20)

Recently uploaded

Recently uploaded (20)

How to Win Machine Learning Competitions ?