SlideShare a Scribd company logo
1 of 42
Download to read offline
Kaggle Otto Challenge
How we achieved 85th out of 3,514 and what we learnt
Eugene Yan & Wang Weimin
Kaggle: A platform for predictive modeling competitions
Otto Production Classification
Challenge: Classify products
into 9 main product
categories
One of the most popular Kaggle
competitions ever
… …
Our team achieved
85th position out of
3,514 teams
… …
93 (obfuscated) numerical
features provided
Target with 9
categories
Let’s take a look at the data
Evaluation Metric: (minimize) multi-class log loss
𝑁 = no. of products in dataset (
𝑀 = no. of class labels (i.e., 9 classes)
𝑙𝑜𝑔 = natural logarithm
𝑦𝑖𝑗 = 1 if observation 𝑖 is in class 𝑗 and 0 otherwise
𝑝𝑖𝑗 = predicted probability that observation 𝑖 belongs to class 𝑗
Minimizing multi-class log loss
heavily penalizes falsely
confident predictions
Validation (two
main approaches)
Training set Holdout
Parameter tuning
using 5 fold cross-
validation
Local
validation
using
holdout
 Train models on 80% train set and
validate against 20% local holdout
 Ensemble by fitting predictions from
80% train set on 20% local holdout
 Reduces risk of overfitting leaderboard
 Build model twice for submission
– Once for local validation
– Once for leaderboard submission
Parameter tuning and
validation using 5 fold
cross-validation
 Train models on full data set with 5-
fold cross-validation
 Build model once for submission
 Low risk of overfitting if cv score is
close to leaderboard score (i.e., training
data similar to testing data)
Feature
Engineering
Do we need all 93 features?
Can we reduce noise to
reveal more of the signal?
Dimensionality reduction
led nowhere: No clear ‘elbow’
from principal components
analysis
L1 regularization (lasso) L2 regularization
Feature Selection via elastic net/lasso dropped four
features, but led to significant drop in accuracy
The data looks is very skewed—
should we make it more ‘normal’?
Would standardizing/ rescaling
the data help?
Feature Transformation: Surprisingly, transforming
features helped with tree-based techniques
z-standardization:
𝑥 − 𝑚𝑒𝑎𝑛(𝑥)
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑥)
Difference from Mean: 𝑥 − 𝑚𝑒𝑎𝑛 𝑥
Difference from Median: 𝑥 − 𝑚𝑒𝑑𝑖𝑎𝑛 𝑥
Log-transformation: 𝑙𝑜𝑔(𝑥 + 1)
Adding flags: 𝑖𝑓 𝑥 > 0, 𝑡ℎ𝑒𝑛 1, 𝑒𝑙𝑠𝑒 0
Improved tree-
based models a bit
Worked better than
z-standardization
Didn’t help (because
most medians were 0)
Helped with Neural
Networks
Terrible =/
Though models like GBM and
Neural Nets can approximate
deep interactions, we can help
find patterns and explicitly define:
 Complex features (e.g., week on
week increase)
 Interactions (e.g., ratios, sums,
differences)
Feature Creation: Aggregated features created by
applying functions by row worked well
Row
Sums of
features
1 - 93
Row
Variances
of features
1 - 93
Count of
non-zero
features
(1 – 93)
Feature Creation: Top features selected from RF, GBM,
and XGB to create interaction features; top interaction
features helped a bit
+ interaction: feat_34 + feat_48, feat_34 + feat_60, etc
- interaction: feat_34 - feat_48, feat_34 - feat_60, etc
* interaction: feat_34 * feat_48, feat_34 * feat_60, etc
/ interaction: feat_34 / feat_48, feat_34 / feat_60, etc
Top 20 features
from randomForest’s
variable importance
Tree-based Models
R’s caret: adding custom log loss metric
Custom
summary
function (log
loss) for use
with caret
Bagging random forests: leads to minor improvement
Single rf with 150 trees, 12
randomly sampled features
(i.e., mtry), nodesize = 4
After bagging 10 rfs
gbm + caret: better than rf for this dataset
GBM
Params
 Depth = 10
 Trees = 350
 Shrinkage = 0.02
 Depth = 10
 Trees = 1000
 Shrinkage = 0.01
 Node.size = 4
 Bag.frac* = 0.8
 Depth = 10
 Trees = 1000
 Shrinkage = 0.01
 Node.size = 4
 Bag.frac* = 0.8
 + aggregated
features
Local
Validation
0.52449 0.51109 0.49964
Improvement as shrinkage , no. of trees ,
and aggregated features are included
*Bag Fraction: fraction of training set randomly selected to build the next tree in gbm. Introduces randomness and helps reduce variance
XGBoost (extreme gradient boosting): better and faster
than gbm; one of two main models in ensemble
xgb
Params
 Depth = 10
 Trees = 250
 Shrinkage = 0.1
 Gamma = 1
 Node.size = 4
 Col.sample = 0.8
 Row.sample = 0.9
 Depth = 10
 Trees = 7500
 Shrinkage = 0.005
 Gamma = 1
 Node.size = 4
 Col.sample = 0.8
 Row.sample = 0.9
 Original features
+ aggregated
features
 Depth = 10
 Trees = 7500
 Shrinkage = 0.005
 Gamma = 0.5
 Node.size = 4
 Col.sample = 0.8
 Row.sample = 0.9
 Original features
only (difference
from mean)
Local
Validation
0.46278 0.45173 0.44898
Improvement as shrinkage , no. of trees
Feature creation and transformation helped too
Neural Networks
Nolearn + Lasagna: a simple two-layer network with
dropout works great
0.15
dropout
1000
hidden
units, 0.3
dropout
500
hidden
units, 0.3
dropout
Neural Net Params
 Activation: Rectifier
 Output: Softmax
 Batch size: 256
 Epochs: 140
 Exponentially decreasing
learning rate
Input
Hidden
Layer 1
Hidden
Layer 2 Output
Tuning Neural Network hyper-parameters:
 Use validation data to tune:
– Layers, dropout, L2, batch size, etc
 Start with a single network (say 93 x 100 x 50 x 9)
 Using less data to get faster response
 Early stopping
– No-improvement-in-10, 5, 3 …
– Visualize loss v.s. epochs in a graph
 Use GPU
LogLoss
Epochs
LogLoss
Epochs
Bagging NNs: leads to significant improvement
Single Neural Net
Bag of 10
Neural Nets
Bag of 50
Neural Nets
Neural nets are somewhat
unstable—bagging reduces
variance and improve LB score
So many ideas, so little time: Bagging + Stacking
 Randomly sample from training data (with replacement) – BAG
 Train base model on OOB, and predict on BAG data
 Boost the BAG data with meta model
1
2
3
4
5
6
7
Training data
4
3
6
3
3
5
4
1
2
7
Bootstrap sample (BAG)
OOB data
1.sample Xgboost(
meta)
RF(base)
Test Data
Done!
So many ideas, so little time: TSNE
tsne1
tsne2
Tsne1
-12
2
3.2
-3.2
3.3
2.2
1.1
10.2
3.1
11
Tsne2
3.3
10
-3.2
2.3
1.0
21
0.33
-1.1
1
22
Ensemble our models
Wisdom of the Crowd: combining multiple models leads
to significant improvement in performance
Different classifiers make up for each other’s weaknesses
Ensemble: how do we combine to minimize log loss
over multiple models?
Create predictions
using best classifiers
on training set
 Find best weights for combining
the classifiers by minimizing log
loss on holdout set
 Our approach:
– Append various predictions
– Minimize overall log loss using
scipy.optimize.minimize
 Competition Metric: the
goal is to minimize log loss
– How to do this over
multiple models?
– Voting? Averaging?
Ensemble: great improvement over best individual
models, though we shouldn’t throw in everything
XGBoost
(0.43528)
Bag of 50 NNs
(0.43023)
Ensemble
(0.41540)
(0.45 × ) + (0.55 × ) =
415th position
on leaderboard
350th position
on leaderboard
85th position
on leaderboard
Our final ensemble
0.445 × XGBoost
0.545 × =Bag of 110 NNs
0.01 × Bag of 10 RFs
Ensemble
(0.41542)
+
+
Another ensemble we tried:
Sometimes, more ≠ better!
Ideas we didn’t have
time to implement
Ideas that worked well in Otto and other competitions:
Clamping predicted
probabilities between
some threshold (e.g.,
0.005) and 1
Adding an SVM
classifier into
the ensemble
Creating
new features
with t-SNE
Top Solutions
5th place
https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14297/share-your-models/79677#post79677
 Use TF-IDF to transform raw features
 Create new features by fitting models on raw and TF-IDF features
 Combine created features with original features
 Bag XGBoost and 2-layer NN 30 times and average predictions
2nd place
http://blog.kaggle.com/2015/06/09/otto-product-classification-winners-interview-2nd-place-alexander-guschin/
 Level 0: Split the data into two groups, Raw and TF-IDF
 Level 1: Create metafeature using different models
– Split data into k folds, training k models on k-1 parts and predict
on the 1 part left aside for each k-1 group
 Level 2: Train metaclassifier with features and metafeatures and
average/ensemble
1st place
https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-semenov
What’s a suggested
framework for
Kaggle?
Suggested framework for Kaggle competitions:
 Understand the problem, metric, and data
 Create a reliable validation process that resembles leaderboard
– Use early submissions for this
– Avoid over fitting!
 Understand how linear and non-linear models work on the problem
 Try many different approaches/model and do the following
– Transform data (rescale, normalize, pca, etc)
– Feature selection/creation
– Tune parameters
– If large disparity between local validation and leaderboard,
reassess validation process
 Ensemble
Largely adopted from KazAnova: http://blog.kaggle.com/2015/05/07/profiling-top-kagglers-kazanovacurrently-2-in-the-world/
Our code is available on GitHub:
https://github.com/eugeneyan/Otto
Thank you!
Eugene Yan
eugeneyanziyou@gmail.com
Wang Weimin
wangweimin888@yahoo.com
&

More Related Content

What's hot

ラベル付けのいろは
ラベル付けのいろはラベル付けのいろは
ラベル付けのいろはKensuke Mitsuzawa
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? HackerEarth
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)
Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)
Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)Shotaro Sano
 
整数計画法に基づく説明可能性な機械学習へのアプローチ
整数計画法に基づく説明可能性な機械学習へのアプローチ整数計画法に基づく説明可能性な機械学習へのアプローチ
整数計画法に基づく説明可能性な機械学習へのアプローチKentaro Kanamori
 
Machine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleMachine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleXavier Amatriain
 
広告クリエイティブ制作におけるコンピュータビジョングラフィックデザイン CA Data Engineering & Data Analysis WS #9
広告クリエイティブ制作におけるコンピュータビジョングラフィックデザイン CA Data Engineering & Data Analysis WS #9広告クリエイティブ制作におけるコンピュータビジョングラフィックデザイン CA Data Engineering & Data Analysis WS #9
広告クリエイティブ制作におけるコンピュータビジョングラフィックデザイン CA Data Engineering & Data Analysis WS #9Kazuhiro Ota
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science CompetitionsJeong-Yoon Lee
 
実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだことnishio
 
Kaggle Avito Demand Prediction Challenge 9th Place Solution
Kaggle Avito Demand Prediction Challenge 9th Place SolutionKaggle Avito Demand Prediction Challenge 9th Place Solution
Kaggle Avito Demand Prediction Challenge 9th Place SolutionJin Zhan
 
SSII2020TS: 機械学習モデルの判断根拠の説明​ 〜 Explainable AI 研究の近年の展開 〜​
SSII2020TS: 機械学習モデルの判断根拠の説明​ 〜 Explainable AI 研究の近年の展開 〜​SSII2020TS: 機械学習モデルの判断根拠の説明​ 〜 Explainable AI 研究の近年の展開 〜​
SSII2020TS: 機械学習モデルの判断根拠の説明​ 〜 Explainable AI 研究の近年の展開 〜​SSII
 
機械学習の理論と実践
機械学習の理論と実践機械学習の理論と実践
機械学習の理論と実践Preferred Networks
 
合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点Ichigaku Takigawa
 
探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディットH Okazaki
 
ブースティング入門
ブースティング入門ブースティング入門
ブースティング入門Retrieva inc.
 
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages. Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages. Satoshi Kato
 
Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習Plot Hong
 
異常検知と変化検知で復習するPRML
異常検知と変化検知で復習するPRML異常検知と変化検知で復習するPRML
異常検知と変化検知で復習するPRMLKatsuya Ito
 

What's hot (20)

ラベル付けのいろは
ラベル付けのいろはラベル付けのいろは
ラベル付けのいろは
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)
Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)
Microsoft Malware Classification Challenge 上位手法の紹介 (in Kaggle Study Meetup)
 
整数計画法に基づく説明可能性な機械学習へのアプローチ
整数計画法に基づく説明可能性な機械学習へのアプローチ整数計画法に基づく説明可能性な機械学習へのアプローチ
整数計画法に基づく説明可能性な機械学習へのアプローチ
 
Machine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora ExampleMachine Learning for Q&A Sites: The Quora Example
Machine Learning for Q&A Sites: The Quora Example
 
広告クリエイティブ制作におけるコンピュータビジョングラフィックデザイン CA Data Engineering & Data Analysis WS #9
広告クリエイティブ制作におけるコンピュータビジョングラフィックデザイン CA Data Engineering & Data Analysis WS #9広告クリエイティブ制作におけるコンピュータビジョングラフィックデザイン CA Data Engineering & Data Analysis WS #9
広告クリエイティブ制作におけるコンピュータビジョングラフィックデザイン CA Data Engineering & Data Analysis WS #9
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと
 
Kaggle Avito Demand Prediction Challenge 9th Place Solution
Kaggle Avito Demand Prediction Challenge 9th Place SolutionKaggle Avito Demand Prediction Challenge 9th Place Solution
Kaggle Avito Demand Prediction Challenge 9th Place Solution
 
SSII2020TS: 機械学習モデルの判断根拠の説明​ 〜 Explainable AI 研究の近年の展開 〜​
SSII2020TS: 機械学習モデルの判断根拠の説明​ 〜 Explainable AI 研究の近年の展開 〜​SSII2020TS: 機械学習モデルの判断根拠の説明​ 〜 Explainable AI 研究の近年の展開 〜​
SSII2020TS: 機械学習モデルの判断根拠の説明​ 〜 Explainable AI 研究の近年の展開 〜​
 
機械学習の理論と実践
機械学習の理論と実践機械学習の理論と実践
機械学習の理論と実践
 
合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点合成変量とアンサンブル:回帰森と加法モデルの要点
合成変量とアンサンブル:回帰森と加法モデルの要点
 
探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット
 
ブースティング入門
ブースティング入門ブースティング入門
ブースティング入門
 
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages. Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
Dimensionality reduction with t-SNE(Rtsne) and UMAP(uwot) using R packages.
 
Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習
 
異常検知と変化検知で復習するPRML
異常検知と変化検知で復習するPRML異常検知と変化検知で復習するPRML
異常検知と変化検知で復習するPRML
 

Similar to Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt

모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkSri Ambati
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & PythonLonghow Lam
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.pptyang947066
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC OverheadTakipi
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningSri Ambati
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn철민 권
 

Similar to Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt (20)

모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
Deep learning
Deep learningDeep learning
Deep learning
 
C3 w1
C3 w1C3 w1
C3 w1
 
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup  - Alex PerrierLarge data with Scikit-learn - Boston Data Mining Meetup  - Alex Perrier
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & Python
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn
 

More from Eugene Yan Ziyou

System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and searchEugene Yan Ziyou
 
Recommender Systems: Beyond the user-item matrix
Recommender Systems: Beyond the user-item matrixRecommender Systems: Beyond the user-item matrix
Recommender Systems: Beyond the user-item matrixEugene Yan Ziyou
 
Predicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admissionPredicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admissionEugene Yan Ziyou
 
OLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
OLX Group Prod Tech 2019 Keynote: Asia's Tech GiantsOLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
OLX Group Prod Tech 2019 Keynote: Asia's Tech GiantsEugene Yan Ziyou
 
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...Eugene Yan Ziyou
 
INSEAD Sharing on Lazada Data Science and my Journey
INSEAD Sharing on Lazada Data Science and my JourneyINSEAD Sharing on Lazada Data Science and my Journey
INSEAD Sharing on Lazada Data Science and my JourneyEugene Yan Ziyou
 
SMU BIA Sharing on Data Science
SMU BIA Sharing on Data ScienceSMU BIA Sharing on Data Science
SMU BIA Sharing on Data ScienceEugene Yan Ziyou
 
Culture at Lazada Data Science
Culture at Lazada Data ScienceCulture at Lazada Data Science
Culture at Lazada Data ScienceEugene Yan Ziyou
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Eugene Yan Ziyou
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionEugene Yan Ziyou
 
Sharing about my data science journey and what I do at Lazada
Sharing about my data science journey and what I do at LazadaSharing about my data science journey and what I do at Lazada
Sharing about my data science journey and what I do at LazadaEugene Yan Ziyou
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)Eugene Yan Ziyou
 
Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)Eugene Yan Ziyou
 
DataKind SG sharing of our first DataDive
DataKind SG sharing of our first DataDiveDataKind SG sharing of our first DataDive
DataKind SG sharing of our first DataDiveEugene Yan Ziyou
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communityEugene Yan Ziyou
 
Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)Eugene Yan Ziyou
 
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsStatistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsEugene Yan Ziyou
 
Statistical inference: Hypothesis Testing and t-tests
Statistical inference: Hypothesis Testing and t-testsStatistical inference: Hypothesis Testing and t-tests
Statistical inference: Hypothesis Testing and t-testsEugene Yan Ziyou
 
Statistical inference: Probability and Distribution
Statistical inference: Probability and DistributionStatistical inference: Probability and Distribution
Statistical inference: Probability and DistributionEugene Yan Ziyou
 
A Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the USA Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the USEugene Yan Ziyou
 

More from Eugene Yan Ziyou (20)

System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and search
 
Recommender Systems: Beyond the user-item matrix
Recommender Systems: Beyond the user-item matrixRecommender Systems: Beyond the user-item matrix
Recommender Systems: Beyond the user-item matrix
 
Predicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admissionPredicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admission
 
OLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
OLX Group Prod Tech 2019 Keynote: Asia's Tech GiantsOLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
OLX Group Prod Tech 2019 Keynote: Asia's Tech Giants
 
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
Data Science Challenges and Impact at Lazada (Big Data and Analytics Innovati...
 
INSEAD Sharing on Lazada Data Science and my Journey
INSEAD Sharing on Lazada Data Science and my JourneyINSEAD Sharing on Lazada Data Science and my Journey
INSEAD Sharing on Lazada Data Science and my Journey
 
SMU BIA Sharing on Data Science
SMU BIA Sharing on Data ScienceSMU BIA Sharing on Data Science
SMU BIA Sharing on Data Science
 
Culture at Lazada Data Science
Culture at Lazada Data ScienceCulture at Lazada Data Science
Culture at Lazada Data Science
 
Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...Competition Improves Performance: Only when Competition Form matches Goal Ori...
Competition Improves Performance: Only when Competition Form matches Goal Ori...
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
 
Sharing about my data science journey and what I do at Lazada
Sharing about my data science journey and what I do at LazadaSharing about my data science journey and what I do at Lazada
Sharing about my data science journey and what I do at Lazada
 
AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)AXA x DSSG Meetup Sharing (Feb 2016)
AXA x DSSG Meetup Sharing (Feb 2016)
 
Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)Garuda Robotics x DataScience SG Meetup (Sep 2015)
Garuda Robotics x DataScience SG Meetup (Sep 2015)
 
DataKind SG sharing of our first DataDive
DataKind SG sharing of our first DataDiveDataKind SG sharing of our first DataDive
DataKind SG sharing of our first DataDive
 
Social network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG communitySocial network analysis and growth recommendations for DataScience SG community
Social network analysis and growth recommendations for DataScience SG community
 
Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)Nielsen x DataScience SG Meetup (Apr 2015)
Nielsen x DataScience SG Meetup (Apr 2015)
 
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc testsStatistical inference: Statistical Power, ANOVA, and Post Hoc tests
Statistical inference: Statistical Power, ANOVA, and Post Hoc tests
 
Statistical inference: Hypothesis Testing and t-tests
Statistical inference: Hypothesis Testing and t-testsStatistical inference: Hypothesis Testing and t-tests
Statistical inference: Hypothesis Testing and t-tests
 
Statistical inference: Probability and Distribution
Statistical inference: Probability and DistributionStatistical inference: Probability and Distribution
Statistical inference: Probability and Distribution
 
A Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the USA Study on the Relationship between Education and Income in the US
A Study on the Relationship between Education and Income in the US
 

Recently uploaded

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 

Recently uploaded (20)

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 

Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt

  • 1. Kaggle Otto Challenge How we achieved 85th out of 3,514 and what we learnt Eugene Yan & Wang Weimin
  • 2. Kaggle: A platform for predictive modeling competitions
  • 3. Otto Production Classification Challenge: Classify products into 9 main product categories
  • 4. One of the most popular Kaggle competitions ever … … Our team achieved 85th position out of 3,514 teams
  • 5. … … 93 (obfuscated) numerical features provided Target with 9 categories Let’s take a look at the data
  • 6. Evaluation Metric: (minimize) multi-class log loss 𝑁 = no. of products in dataset ( 𝑀 = no. of class labels (i.e., 9 classes) 𝑙𝑜𝑔 = natural logarithm 𝑦𝑖𝑗 = 1 if observation 𝑖 is in class 𝑗 and 0 otherwise 𝑝𝑖𝑗 = predicted probability that observation 𝑖 belongs to class 𝑗 Minimizing multi-class log loss heavily penalizes falsely confident predictions
  • 8. Training set Holdout Parameter tuning using 5 fold cross- validation Local validation using holdout  Train models on 80% train set and validate against 20% local holdout  Ensemble by fitting predictions from 80% train set on 20% local holdout  Reduces risk of overfitting leaderboard  Build model twice for submission – Once for local validation – Once for leaderboard submission Parameter tuning and validation using 5 fold cross-validation  Train models on full data set with 5- fold cross-validation  Build model once for submission  Low risk of overfitting if cv score is close to leaderboard score (i.e., training data similar to testing data)
  • 10. Do we need all 93 features? Can we reduce noise to reveal more of the signal?
  • 11. Dimensionality reduction led nowhere: No clear ‘elbow’ from principal components analysis
  • 12. L1 regularization (lasso) L2 regularization Feature Selection via elastic net/lasso dropped four features, but led to significant drop in accuracy
  • 13. The data looks is very skewed— should we make it more ‘normal’? Would standardizing/ rescaling the data help?
  • 14. Feature Transformation: Surprisingly, transforming features helped with tree-based techniques z-standardization: 𝑥 − 𝑚𝑒𝑎𝑛(𝑥) 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑥) Difference from Mean: 𝑥 − 𝑚𝑒𝑎𝑛 𝑥 Difference from Median: 𝑥 − 𝑚𝑒𝑑𝑖𝑎𝑛 𝑥 Log-transformation: 𝑙𝑜𝑔(𝑥 + 1) Adding flags: 𝑖𝑓 𝑥 > 0, 𝑡ℎ𝑒𝑛 1, 𝑒𝑙𝑠𝑒 0 Improved tree- based models a bit Worked better than z-standardization Didn’t help (because most medians were 0) Helped with Neural Networks Terrible =/
  • 15. Though models like GBM and Neural Nets can approximate deep interactions, we can help find patterns and explicitly define:  Complex features (e.g., week on week increase)  Interactions (e.g., ratios, sums, differences)
  • 16. Feature Creation: Aggregated features created by applying functions by row worked well Row Sums of features 1 - 93 Row Variances of features 1 - 93 Count of non-zero features (1 – 93)
  • 17. Feature Creation: Top features selected from RF, GBM, and XGB to create interaction features; top interaction features helped a bit + interaction: feat_34 + feat_48, feat_34 + feat_60, etc - interaction: feat_34 - feat_48, feat_34 - feat_60, etc * interaction: feat_34 * feat_48, feat_34 * feat_60, etc / interaction: feat_34 / feat_48, feat_34 / feat_60, etc Top 20 features from randomForest’s variable importance
  • 19. R’s caret: adding custom log loss metric Custom summary function (log loss) for use with caret
  • 20. Bagging random forests: leads to minor improvement Single rf with 150 trees, 12 randomly sampled features (i.e., mtry), nodesize = 4 After bagging 10 rfs
  • 21. gbm + caret: better than rf for this dataset GBM Params  Depth = 10  Trees = 350  Shrinkage = 0.02  Depth = 10  Trees = 1000  Shrinkage = 0.01  Node.size = 4  Bag.frac* = 0.8  Depth = 10  Trees = 1000  Shrinkage = 0.01  Node.size = 4  Bag.frac* = 0.8  + aggregated features Local Validation 0.52449 0.51109 0.49964 Improvement as shrinkage , no. of trees , and aggregated features are included *Bag Fraction: fraction of training set randomly selected to build the next tree in gbm. Introduces randomness and helps reduce variance
  • 22. XGBoost (extreme gradient boosting): better and faster than gbm; one of two main models in ensemble xgb Params  Depth = 10  Trees = 250  Shrinkage = 0.1  Gamma = 1  Node.size = 4  Col.sample = 0.8  Row.sample = 0.9  Depth = 10  Trees = 7500  Shrinkage = 0.005  Gamma = 1  Node.size = 4  Col.sample = 0.8  Row.sample = 0.9  Original features + aggregated features  Depth = 10  Trees = 7500  Shrinkage = 0.005  Gamma = 0.5  Node.size = 4  Col.sample = 0.8  Row.sample = 0.9  Original features only (difference from mean) Local Validation 0.46278 0.45173 0.44898 Improvement as shrinkage , no. of trees Feature creation and transformation helped too
  • 24. Nolearn + Lasagna: a simple two-layer network with dropout works great 0.15 dropout 1000 hidden units, 0.3 dropout 500 hidden units, 0.3 dropout Neural Net Params  Activation: Rectifier  Output: Softmax  Batch size: 256  Epochs: 140  Exponentially decreasing learning rate Input Hidden Layer 1 Hidden Layer 2 Output
  • 25. Tuning Neural Network hyper-parameters:  Use validation data to tune: – Layers, dropout, L2, batch size, etc  Start with a single network (say 93 x 100 x 50 x 9)  Using less data to get faster response  Early stopping – No-improvement-in-10, 5, 3 … – Visualize loss v.s. epochs in a graph  Use GPU LogLoss Epochs LogLoss Epochs
  • 26. Bagging NNs: leads to significant improvement Single Neural Net Bag of 10 Neural Nets Bag of 50 Neural Nets Neural nets are somewhat unstable—bagging reduces variance and improve LB score
  • 27. So many ideas, so little time: Bagging + Stacking  Randomly sample from training data (with replacement) – BAG  Train base model on OOB, and predict on BAG data  Boost the BAG data with meta model 1 2 3 4 5 6 7 Training data 4 3 6 3 3 5 4 1 2 7 Bootstrap sample (BAG) OOB data 1.sample Xgboost( meta) RF(base) Test Data Done!
  • 28. So many ideas, so little time: TSNE tsne1 tsne2 Tsne1 -12 2 3.2 -3.2 3.3 2.2 1.1 10.2 3.1 11 Tsne2 3.3 10 -3.2 2.3 1.0 21 0.33 -1.1 1 22
  • 30. Wisdom of the Crowd: combining multiple models leads to significant improvement in performance Different classifiers make up for each other’s weaknesses
  • 31. Ensemble: how do we combine to minimize log loss over multiple models? Create predictions using best classifiers on training set  Find best weights for combining the classifiers by minimizing log loss on holdout set  Our approach: – Append various predictions – Minimize overall log loss using scipy.optimize.minimize  Competition Metric: the goal is to minimize log loss – How to do this over multiple models? – Voting? Averaging?
  • 32. Ensemble: great improvement over best individual models, though we shouldn’t throw in everything XGBoost (0.43528) Bag of 50 NNs (0.43023) Ensemble (0.41540) (0.45 × ) + (0.55 × ) = 415th position on leaderboard 350th position on leaderboard 85th position on leaderboard Our final ensemble 0.445 × XGBoost 0.545 × =Bag of 110 NNs 0.01 × Bag of 10 RFs Ensemble (0.41542) + + Another ensemble we tried: Sometimes, more ≠ better!
  • 33. Ideas we didn’t have time to implement
  • 34. Ideas that worked well in Otto and other competitions: Clamping predicted probabilities between some threshold (e.g., 0.005) and 1 Adding an SVM classifier into the ensemble Creating new features with t-SNE
  • 36. 5th place https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14297/share-your-models/79677#post79677  Use TF-IDF to transform raw features  Create new features by fitting models on raw and TF-IDF features  Combine created features with original features  Bag XGBoost and 2-layer NN 30 times and average predictions
  • 37. 2nd place http://blog.kaggle.com/2015/06/09/otto-product-classification-winners-interview-2nd-place-alexander-guschin/  Level 0: Split the data into two groups, Raw and TF-IDF  Level 1: Create metafeature using different models – Split data into k folds, training k models on k-1 parts and predict on the 1 part left aside for each k-1 group  Level 2: Train metaclassifier with features and metafeatures and average/ensemble
  • 40. Suggested framework for Kaggle competitions:  Understand the problem, metric, and data  Create a reliable validation process that resembles leaderboard – Use early submissions for this – Avoid over fitting!  Understand how linear and non-linear models work on the problem  Try many different approaches/model and do the following – Transform data (rescale, normalize, pca, etc) – Feature selection/creation – Tune parameters – If large disparity between local validation and leaderboard, reassess validation process  Ensemble Largely adopted from KazAnova: http://blog.kaggle.com/2015/05/07/profiling-top-kagglers-kazanovacurrently-2-in-the-world/
  • 41. Our code is available on GitHub: https://github.com/eugeneyan/Otto
  • 42. Thank you! Eugene Yan eugeneyanziyou@gmail.com Wang Weimin wangweimin888@yahoo.com &