2. What is kaggle
• world's biggest predictive modelling
competition platform
• Half a million members
• Companies host data challenges.
• Usual tasks include:
– Predict topic or sentiment from text.
– Predict species/type from image.
– Predict store/product/area sales
– Marketing response
3. Inspired by Horse races…!
• At the University of Southampton, an
entrepreneur talked to us about how he was
able to predict the horse races with regression!
4. Was curious, wanted to learn
more
• learned statistical tools (Like SAS, SPSS, R)
• I became more passionate!
• Picked up programming skills
5. Built KazAnova
• Generated a couple of algorithms and data
techniques and decided to make them public
so that others can gain from it.
• I released it at (http://www.kazanovaforanalytics.com/)
• Named it after ANOVA (Statistics) and
• KAZANI , mom’s last name.
6. Joined dunnhumy … and Kaggle!
• Joined dunnhumby’s science team
• They had hosted 2 Kaggle contests already !
• Was curious about Kaggle.
• Joined a few contests and learned lots .
• The community was very open to sharing and
collaboration.
9. And text Classification and Sentiment
Identify the writer….
Who wrote this? ‘To be, or not to be’
Shakespeare or Molière
Detect sentiment….
‘The Burger is Not Bad’
Negative Bigram = Positive comment
10. 3 Years of modelling competitions
• Over 75 competitions
• Participated with 35 different teams
• 21 top 10 finishes
• 8 times prize winner
• 3 different modelling platforms
• Ranked 1st
out of 480,000 data scientists
11. What's next
• Data science within dunnhumby
• PhD (UCL) about recommender systems.
• Kaggling for fun
12. Amazon.com - Employee Access Challenge
• Link:
https://www.kaggle.com/c/amazon-employee-access
• Objective: Predict if employee will require special
accesses (like manual access transactions ).
• Lessons learned:
1. Logistic Regression can be great when combined
with regularization to deal with high dimensionality
(e.g. many variables-features)
2. Keeping the data in Sparse Format speeds up things
a lot.
3. Sharing is caring! Great Participation, positive
attitude towards helping others. Lots of help from
the forum . Kaggle is the way to learn and improve!
4. Scikit-learn + Python is great!
13. RecSys Challenge 2013: Yelp business rating prediction
• Link:
https://www.kaggle.com/c/yelp-recsys-2013
• Objective: Predict what rating a customer will
give to a business
• Lessons learned:
1. Factorization machines and specifically Libfm (
http://www.libfm.org/) are great for summarizing
the relationship between a customer and a
business as well as combining many other factors.
2. Basic data manipulation (like joins, merges,
aggregations) as well as Feature engineering is
important.
3. Simpler/Linear models did well for this task.
14. Cause-effect pairs
• Link: https://www.kaggle.com/c/cause-effect-pairs
• Objective: "correlation does not mean
causation “. Out of 2 series of numbers find
which one is causing the other!
• Lessons learned:
1. In General it seems that the series’ causing the
other , has higher chance to be able to predict it
better with a nonlinear model, given some noise
2. Gradient boosting machines Can be great for
this task.
15. StumbleUpon Evergreen Classification Challenge
• Link: https://www.kaggle.com/c/stumbleupon
• Objective: Build a classifier to categorize
webpages as evergreen (contain timeless quality)
or non-evergreen
• Lessons learned:
1. Some Over fitting again (cv process not right yet).
Better safe than sorry from now on!
2. Impressive how tf-idf gives such a good
classification from the contents of the webpage as
text.
3. Dimensionality reduction with singular Value
decomposition on sparse data (in a way that ‘topics’
are created) is very powerful too.
16. Multi-label Bird Species Classification - NIPS 2013
• Link:
https://www.kaggle.com/c/multilabel-bird-species-classification-nips2013
• Objective: Identify which of 87 classes of birds and
amphibians are present into 1000 continuous wild sound
recordings
• Lessons learned:
1. Converting the sound clips to numbers via using Mel
Frequency Cepstral Coefficients (MFCC) and then
creating some basic aggregate features based on them
was more than enough to get a good score.
2. This was good tutorial
http://practicalcryptography.com/miscellaneous/machine-l
3. Meta-modelling gave a good boost. As in using some
models’ predictions as features to new models.
4. I can make predictions in a field I literally know nothing
about!
17. March Machine Learning Mania
• Link:
https://www.kaggle.com/c/march-machine-learning-mania
• Objective: predict the 2014 NCAA Tournament
• Lessons learned:
1. Combine Pleasure with data = double pleasure! (
I am a huge NBA fan)! Was also my first top 10
finish!
2. Trust the rating agencies – They do a great job
and they have more data than you!
3. Simple models worked well
18. The Allen AI Science Challenge
• Link: https://www.kaggle.com/c/the-allen-ai-science-challenge
Objective: make a model that predicts the right
answer in an 8th
-grade science examination test
• Lessons learned:
1. Lucene (http://
www.docjar.com/html/api/org/apache/lucene/benchmark/utils/ExtractWikip
)was very efficient at indexing Wikipedia and
calculating distances among question and answers
2. Gensim word2vec (https://
radimrehurek.com/gensim/models/word2vec.html)
helped operations via representing each word with
a sequence of numbers.
19. Higgs Boson Machine Learning Challenge
• Link: https://www.kaggle.com/c/higgs-boson Objective:
Use the ATLAS (data collected by the Large Hadron
Collider) experiment to identify the Higgs boson
• Lessons learned:
1. XGBOOST! OOU! (https://github.com/dmlc/xgboost) Extreme
gradient boosting. I knew this tool was going to
make a huge impact in future. Multithreaded,
sparse data, super accuracy, many objective
functions.
2. Deep learning showing some teeth
3. RGF (http://stat.rutgers.edu/home/tzhang/software/rgf/) was
good.
4. Physics’ was probably useful.
20. Driver Telematics Analysis
• Link: https://www.kaggle.com/c/axa-driver-telematics-analysis
• Objective: Use telematic data to identify a driver
signature
• Lessons learned:
1. Geospatial stats were useful
2. Extracting features like average speed or
acceleration where critical.
3. Treating this as supervised problem seemed to
help.
21. Microsoft Malware Classification Challenge (BIG 2015)
• Link: https://www.kaggle.com/c/malware-classification
• Objective: Classify virus based on file contents
• Report :
http://blog.kaggle.com/2015/05/11/microsoft-malware-winners-intervie
/
Lessons learned:
1. Treating this problem as NLP (with bytes being
the words) worked really well as there were
certain sequences of bytes more prone to be
infected by certain viruses.
2. Information from different compression
techniques was also indicative of the virus.
22. Otto Group Product Classification Challenge
• Link: https://
www.kaggle.com/c/otto-group-product-classification-challenge
• Objective: Classify products into the correct
category with anonymized features
Lessons learned:
1. Deep learning very good for this task
2. Lasagne (Theno based :
http://lasagne.readthedocs.org/en/latest/ ) was
a very good tool.
3. Multi-level meta modelling gave a boost.
4. Pretty much every common model family
contributed!
23. Click-Through Rate Prediction
• Link: https://www.kaggle.com/c/avazu-ctr-prediction
• Objective: Predict whether a mobile ad will be
clicked
• Lessons learned:
1. Follow The Regularized Leader (FTRL) , which
uses the hashing trick was extremely efficient in
making good predictions using less 1 MB of Ram
on 40+milion data rows with thousand different
categories.
2. Same tricks old tricks (woe, algorithms on sparse
data, meta stacking)
24. Truly Native?
• Link: https://www.kaggle.com/c/dato-native
• Objective: Predict which web pages served by
StumbleUpon are sponsored
• Report:
http://blog.kaggle.com/2015/12/03/dato-winners-interview-
/
• Lessons learned:
1. Modelling with a trained corpus of over 4 grams
was vital for winning.
2. Meta fully connected modelling Level 4 (Stack net).
3. Used many different input formats (zipped data or
not).
4. Generating over 40 models allowed for greater
generalization.
25. Homesite Quote Conversion
• Link: https://www.kaggle.com/c/homesite-quote-conversion
Objective: Which customers will purchase a
quoted insurance plan?
• Lessons learned:
1. Generating a large pool of (500) models was
really useful in exploiting AUC to the maximum.
2. Feature engineering with XGBfi and with noise
imputation.
3. Exploring about 4-way interactions.
4. Retraining already trained models.
5. Dynamic collaboration is best!
26. So… what wins competitions?
In short:
•Understand the problem
•Discipline
•try problem-specific things or new approaches
•The hours you put in
•the right tools
•Collaboration
•Experience
•Ensembling
•Luck
34. A data Science Hero
• Me: “ Don’t get Stressed”
• Lucas: “ I want to. “ I want to win ” (20/04/2016)
for the Santander competition.
• Passed away 4 days afterwards (24/04/2016) after battling with cancer
for 2.5 years.
• Find Lucas’ winning solutions (and post competition
threads) and learn from the best!