SlideShare a Scribd company logo
1 of 51
Download to read offline
Kaggle
Higgs Boson
Team Lead: Bernard Ong
Nanda Rajarathinam
Venkatesh Subramanian
Deepak Khurana
Machine Learning Challenge
Agenda
• Introduction
• Strategy and Approach
• Exploratory Data Analysis and Feature Engineering
• Training and Cross-Validation
• Results and Findings
• Conclusion
Introduction
Challenge Definition
Success Metrics
Challenge and Objectives
• Apply Machine Learning to predict CERN’s simulated particle collision events as either
a Higgs Boson signal (s) or background (b) noise
• Learn how to strategize, optimize, and fine-tune models, algorithms, and parameters
to achieve the best signal prediction possible
• Gain working experience differentiating theory from actual practice
• Develop new advanced skills in model ensembles and stacking techniques
Success Metrics
• Achieve the highest Approximate Median Significance (AMS) Score in the Kaggle
Private Leaderboard
• Achieve the Top 5% (or better) position placement in the Kaggle Private Leaderboard
out of 1,785 teams in the challenge
Strategy and Approach
Process
Feature Analysis and Engineering
Technology Stack
Models, Ensembles, and Stacking
Process
• The machine learns, while we also learn how to learn
• Agile “chunking” process applied towards machine learning
• Move fast, fail fast, wash-rinse-repeat = then harvest good results and keep detailed logs
• Start with the basics to get the patterns and trends
• Treat data like it’s alive, understand it’s health and heartbeat
• Get a feel of how the data behaves and reacts to changes
• Increase complexity as the accuracy improves
• Optimize each and every model from the outset
• Develop a solid build::create::train::evaluate::predict::score pipeline
• Make the iterative process fast, repeatable, and reliable
• Institute machine learning Agile “kanban” process, applied towards multiple parallel production lines
• Build in feature analysis and engineering to complement the development, training, and cross-
validation processes
Feature Analysis & Engineering
• Missingness Analysis and Imputation
• Correlation Charts and Multicollinearity
• Principal Component Analysis (Dimensional Reduction)
• Density Charts
• 5-PHI Fields
• Jet Fields
Technology Stack
• Python
• For all modeling and algorithmic related work
• Numpy, Pandas & Matplotlib.pyplot
• R
• For all Exploratory Discovery Analysis (EDA) work and plots
• Key Python Libraries
• Scikit Learn (Linear_Model.LogisticRegression, svm, DecisionTreeClassifier, Naive_Bayes.GaussianNB)
• Scikit Learn Ensemble (GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier)
• Special Libraries
• XGBoost
• Neural Network
Models, Ensembles, Stacking
• Naïve Bayes
• Logistic Regression
• Support Vector Machines
• Decision Tree, Extra Trees, Random Forest
• Gradient Boosting, AdaBoost, XGBoost
• Neural Networking
EDA, Feature Analysis
and Engineering
PCA Results
Correlation Results
Histograms
Imputation on “Missing” Data
Variable Importance Plot
Missing Values Plots
Features Correlation Plot
PCA Plot 1
PCA Plot for Feature Reduction 2
Least Influence Features
15% of Data No Influence on AMS
Imputation on “Missing” Data
• Zeros
• Mean
• Median
• Most Frequent
• K-Nearest Neighbor
• Interpolation
• “Do Nothing”
Missing Values Plots
Training and Cross Validation
Parameters and Hyper-Parameters
Patterns and Trends
Tuning and Optimization
Naïve Bayes
All Input
Data
Naïve
Bayes Model
Parameter
Tuning
Final
Prediction
Local AMS
Score = 1.00983
Logistic Regression
All Input
Data
Logistic Regression
Model
Parameter
Tuning
Final
Prediction
Local AMS
Score = 1.81852
Neural Networking Model
Dropped Set
Input Data
Neural
Network Model
Parameter
Tuning
Final
Prediction
Local AMS
Score = 2.698
Private LB AMS
Score = 2.745
Boosting, Random Forest, SVM
All Input
Data
GBM / RF /
SVM Ensemble
Final
Prediction
Local AMS
Score = 3.53583
Private LB AMS
Score = 3.58742
Parameter
Tuning
XGBoost
All Input
Data
XGBoost
Ensemble
Final
Prediction
Local AMS
Score = 3.60760
Private LB AMS
Score = 3.64051
Parameter
Tuning
Drop the 5-phi Features
Dropped Set
Input Data
XGBoost
Ensemble
Final
Prediction
Local AMS
Score = 3.64040
Parameter
Tuning
Drop Features with >70% Missing Data
Dropped Set
Input Data
XGBoost
Ensemble
Final
Prediction
Local AMS
Score = 3.41482
Parameter
Tuning
Polynomial Features
Engineered
Input Data
XGBoost
Ensemble
Final
Prediction
Local AMS
Score = 3.57455
Private LB AMS
Score = 3.64580
Parameter
Tuning
XGBoost Hyper-Parm Tuning
Varied
Input Data
XGBoost
Ensemble
Final
Prediction
Local AMS
Score = 3.57257
Private LB AMS
Score = 3.68305
Parameter
Tuning
Model Ensemble Stacking Tuning
Dropped Set
Training Data
Gradient Boosting
Classifier
Logistic
Classifier
Final
Prediction
Local AMS
Score = 3.68743
Private LB AMS
Score = 3.72254
Parameter
Tuning
XGBoost
Classifier
Random Forest
Trees Classifier
AdaBoost
Classifier
Dropped Set
Training Data
Dropped Set
Training Data
Dropped Set
Training Data
LEVEL 0
LEVEL 1
Results and Findings
Interpretation
Discovery and Insights
Feature Analysis
• Some variables are meaningless or cannot be computed; value is −999.0, which is outside the normal range of all variables
• Another reason to not impute -999.0
• Imputation is a bad idea and wrong because −999.0 is not missing at all
• -999.0 indicates nothing is meant to be there
• Split training and test data into 6 categories according to level of undefined features
• PRI_jet_num is a categorical variable telling which of 4 {jets(1,2,3) and no (0) jet} event happened
• 7 variables are missing more than 70 % of data in training set
• Last variables in PCA feature plots have least influence in their current form
• Weights and Labels are dependent on each other from chisquare test of dependence
• Signal events have low weights and Background events have high weights
• Signal events weight range is exclusively separated from background event weight range
• Signal seems to have only three weights Looks like they are looking for three channels of Higgs Boson
• Not worth spending time coming up with the formula It's already done for us
Findings
• The Variable Importance plot as well as PCA showed that the 5 angle (‘phi’) variables seemed to be the least
important of the 30 predictors. Although removal of the above variables did not result in a significant
improvement of the AMS score using a single model, it did help increase the score quite a bit for the Ensemble
model
• Ensemble models performed better than single models and stacking the Ensemble models provided the best result
in achieving the final AMS score
• The final Stacking model consisted of AdaBoost, Gradient Boost, XGBoost as Level 0 classifiers followed by Logistic
Regression as Level 1 classifier
• Stacking several models in the base layer does not always enhance the final result (AMS score). In our testing,
three base classifiers (AdaBoost, Gradient Boost and XGBoost) provided almost equivalent results compared to
two base classifiers
• Cross Validation plays a key role in testing the model. Based on our testing, the optimal number of folds for cross
validation was determined as 5 since 3 folds seemed to underfit the model while 10 folds seemed to overfit the
model
ROC / AUC Results Curve
Sensitivity / Specificity Curve
Lessons Learned
• An increase in the local AMS score based on the training and cross-validation data does not
always translate to an increase in the private Kaggle Leaderboard score due to potential
overfitting of the model
• Combination of (no. of models) x (no. of hyperparameters) x (no. of features) x (no. of VABS)
is complex and overwhelming. Disciplined process, pipelining, and automation are a must
• Methodical Tuning of the parameters is critical during testing (focusing on one parameter at a
time). Random tuning of the parameters would make it difficult to keep track of the changes
and may result in over-tuning that would end up decreasing the Kaggle score
• SVM is highly computation intensive and the features and parameters need to be carefully
selected before the execution of the algorithm. The algorithm runs for several hours even
with reduced features and data
• The maximum number of trees used need not be the same for all the models in a stacked
classifier. Increasing the number of trees helped in improving the AMS score only up to a
certain threshold. Any further increase in the trees resulted in decreasing the overall AMS
score due to potential overfitting
Insights
• Feature engineering (creating new features) did not really work that well – it was not the
worth effort
• Ensembles and stacking clearly demonstrated significant improvements
• Hyper-parameters fine-tuning and optimization are very challenging and time consuming
• Each little change in hyper-parameters could decrease or increase accuracy significantly
• A fine line exists balancing across effort vs computing time vs complexity vs accuracy
• Imputation of missing data creates lower predictive accuracy – better to just leave it as is
• Managing theoretical feature significance to improve prediction accuracy was very
challenging in actual practice
• Prediction accuracy would have been even more challenging during a live competition due to
the absence of the private leaderboard
What We Would Have Liked to Do
• Find better techniques to auto-generate new features to increase prediction accuracy
• Explore SVM in depth and try various options for tuning parameters and analyze effects on final score
• Blending machine learning and physics (or business) models have a big potential for increased accuracy
• Build out our expertise in ensembles and other aggregation strategies (VABS)
• Inject more random seed sampling for base models
• Invest more time and effort to optimize our neural network frameworks
• Enhance the current Ensemble model by including the Neural Network algorithm (Deep Learning) and
test the predictive accuracy
• Use a Bayesian or Gaussian-based Optimizer tools to automate pipeline iterations to find the
best combination of hyper-parameters (given its complexity)
• Use GridSearch as a contingent only to determine the best combination of parameters for the Tree
based Boosting models
Conclusion
Summary and Synopsis
Final Thoughts
Summary
• We have met our objectives in this Machine Learning Challenge, being able to apply
various models, algorithms, and strategies to achieve relatively good predictions
• We have obtained the final private leaderboard score of 3.72254, the 34th position,
garnering the Top 2% spot in the challenge
• We have a renewed sense of appreciation of what machine learning can do, and also
the power behind the complexity
• We have learned and developed a lot of new skills from the challenge, but also
realized how much more we still don’t know
FinalThoughts
Winning or getting high scores in Kaggle does
not necessarily equate to being a good Data
Scientist.
Being great at machine learning is more than just
depending on an individual’s smarts and skills.
Knowing how to form a great team that works
well together plays a major role in how one
succeeds in the field of Data Science.
Final Private LB AMS Score
Top 2%
Appendix
More Details :
XGBoost Details
More EDA Plots
Benefits of XGBoost
• Regularization & Parallel Processing
• XGBoost implements regularization that helps in reducing overfitting. It also implements parallel processing that significantly
expedites the model fitting process compared to GBM.
• XGBoost utilizes OpenMP which can parallelize the code on a multithreaded CPU automatically. XGBoost has defined a data
structure DMatrix to store the data matrix. This data structure will perform some preprocessing work on the data so that the latter
iterations run faster.
• High Flexibility
• XGBoost allow users to define custom optimization objectives and evaluation criteria. This adds a whole new dimension to the
model and significantly enhances the flexibility
• Handling Missing Values
• XGBoost has an in-built routine to handle missing values.
• User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it
encounters a missing value on each node and learns which path to take for missing values in future.
• Tree Pruning
• A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm.
• XGBoost on the other hand makes the splits up to the max_depth specified and then start pruning the tree backwards and remove
splits beyond which there is no positive gain.
• Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop
as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
XGBoost Hyper-Parameters
• General Parameters
• booster [default=gbtree]: The type of booster to be used. This value can be gbtree, gblinear or dart. gbtree and dart use tree based model
while gblinear uses linear function.
• silent [default=0]: 0 means printing running messages, 1 means silent mode.
• nthread [default to maximum number of threads available if not set]: number of parallel threads used to run xgboost
• num_pbuffer [set automatically by xgboost, no need to be set by user]: size of prediction buffer, normally set to number of training instances.
The buffers are used to save the prediction results of last boosting step.
• num_feature [set automatically by xgboost, no need to be set by user]: feature dimension used in boosting, set to maximum dimension of
the feature
• Tree Booster Parameters
• eta [default=0.3] : step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new
features. and eta actually shrinks the feature weights to make the boosting process more conservative.
• max_depth [default=6] : maximum depth of a tree, increase this value will make model more complex / likely to be overfitting.
• scale_pos_weight [default=0]: Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to
consider: sum(negative cases) / sum(positive cases)
• subsample [default=1] : subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data
instances to grow trees and this will prevent overfitting.
• Learning Task Parameters
• objective [ default=reg:linear ] Used "binary:logitraw" for this scenario. Logistic regression for binary classification that outputs score before
logistic transformation
• eval_metric : evaluation metrics for validation data, a default metric will be assigned according to objective( RMSE for regression, and error
for classification, mean average precision for ranking )
Histograms 1
Histograms 2
Density Plot 1
Density Plot 2
s & b Weight Density Plot
Variance Inflation Factors
Variable Importance Plots

More Related Content

What's hot

GBM package in r
GBM package in rGBM package in r
GBM package in rmark_landry
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiAutomatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiSri Ambati
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design PatternsAshwin Shiv
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2OSri Ambati
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearnPratap Dangeti
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 

What's hot (20)

GBM package in r
GBM package in rGBM package in r
GBM package in r
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.aiAutomatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Ml7 bagging
Ml7 baggingMl7 bagging
Ml7 bagging
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design Patterns
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Gbm.more GBM in H2O
Gbm.more GBM in H2OGbm.more GBM in H2O
Gbm.more GBM in H2O
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Bayesian Global Optimization
Bayesian Global OptimizationBayesian Global Optimization
Bayesian Global Optimization
 

Viewers also liked

Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...
Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...
Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...Cheng Chen
 
CSI 5387: Concept Learning Systems / Machine Learning
CSI 5387: Concept Learning Systems / Machine Learning CSI 5387: Concept Learning Systems / Machine Learning
CSI 5387: Concept Learning Systems / Machine Learning butest
 
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…Dongseo University
 
To use the concept of Data Mining and machine learning concept for Cyber secu...
To use the concept of Data Mining and machine learning concept for Cyber secu...To use the concept of Data Mining and machine learning concept for Cyber secu...
To use the concept of Data Mining and machine learning concept for Cyber secu...Nishant Mehta
 
Machine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and PreparationMachine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and PreparationPier Luca Lanzi
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning IntroductionMark Chang
 
Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance
Machine Learning/ Data Science: Boosting Predictive Analytics Model PerformanceMachine Learning/ Data Science: Boosting Predictive Analytics Model Performance
Machine Learning/ Data Science: Boosting Predictive Analytics Model PerformanceT. Scott Clendaniel
 

Viewers also liked (7)

Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...
Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...
Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...
 
CSI 5387: Concept Learning Systems / Machine Learning
CSI 5387: Concept Learning Systems / Machine Learning CSI 5387: Concept Learning Systems / Machine Learning
CSI 5387: Concept Learning Systems / Machine Learning
 
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
2013-1 Machine Learning Lecture 06 - Artur Ferreira - A Survey on Boosting…
 
To use the concept of Data Mining and machine learning concept for Cyber secu...
To use the concept of Data Mining and machine learning concept for Cyber secu...To use the concept of Data Mining and machine learning concept for Cyber secu...
To use the concept of Data Mining and machine learning concept for Cyber secu...
 
Machine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and PreparationMachine Learning and Data Mining: 15 Data Exploration and Preparation
Machine Learning and Data Mining: 15 Data Exploration and Preparation
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance
Machine Learning/ Data Science: Boosting Predictive Analytics Model PerformanceMachine Learning/ Data Science: Boosting Predictive Analytics Model Performance
Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance
 

Similar to Kaggle Higgs Boson Machine Learning Challenge

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Christopher Sneed, MSDS, PMP, CSPO
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonMark Conway
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningRomiRoy4
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1khairulhuda242
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowDatabricks
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudAnima Anandkumar
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientistMatthew Evans
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
Automating the testing of your bi solution with n bi
Automating the testing of your bi solution with n biAutomating the testing of your bi solution with n bi
Automating the testing of your bi solution with n biCédric Charlier
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?Tuan Yang
 

Similar to Kaggle Higgs Boson Machine Learning Challenge (20)

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
AlphaPy
AlphaPyAlphaPy
AlphaPy
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine Learning
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1Week 12 Dimensionality Reduction Bagian 1
Week 12 Dimensionality Reduction Bagian 1
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Trinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloudTrinity of AI: data, algorithms and cloud
Trinity of AI: data, algorithms and cloud
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Automating the testing of your bi solution with n bi
Automating the testing of your bi solution with n biAutomating the testing of your bi solution with n bi
Automating the testing of your bi solution with n bi
 
How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?How Machine Learning Helps Organizations to Work More Efficiently?
How Machine Learning Helps Organizations to Work More Efficiently?
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 

Kaggle Higgs Boson Machine Learning Challenge

  • 1. Kaggle Higgs Boson Team Lead: Bernard Ong Nanda Rajarathinam Venkatesh Subramanian Deepak Khurana Machine Learning Challenge
  • 2. Agenda • Introduction • Strategy and Approach • Exploratory Data Analysis and Feature Engineering • Training and Cross-Validation • Results and Findings • Conclusion
  • 4. Challenge and Objectives • Apply Machine Learning to predict CERN’s simulated particle collision events as either a Higgs Boson signal (s) or background (b) noise • Learn how to strategize, optimize, and fine-tune models, algorithms, and parameters to achieve the best signal prediction possible • Gain working experience differentiating theory from actual practice • Develop new advanced skills in model ensembles and stacking techniques
  • 5. Success Metrics • Achieve the highest Approximate Median Significance (AMS) Score in the Kaggle Private Leaderboard • Achieve the Top 5% (or better) position placement in the Kaggle Private Leaderboard out of 1,785 teams in the challenge
  • 6. Strategy and Approach Process Feature Analysis and Engineering Technology Stack Models, Ensembles, and Stacking
  • 7. Process • The machine learns, while we also learn how to learn • Agile “chunking” process applied towards machine learning • Move fast, fail fast, wash-rinse-repeat = then harvest good results and keep detailed logs • Start with the basics to get the patterns and trends • Treat data like it’s alive, understand it’s health and heartbeat • Get a feel of how the data behaves and reacts to changes • Increase complexity as the accuracy improves • Optimize each and every model from the outset • Develop a solid build::create::train::evaluate::predict::score pipeline • Make the iterative process fast, repeatable, and reliable • Institute machine learning Agile “kanban” process, applied towards multiple parallel production lines • Build in feature analysis and engineering to complement the development, training, and cross- validation processes
  • 8. Feature Analysis & Engineering • Missingness Analysis and Imputation • Correlation Charts and Multicollinearity • Principal Component Analysis (Dimensional Reduction) • Density Charts • 5-PHI Fields • Jet Fields
  • 9. Technology Stack • Python • For all modeling and algorithmic related work • Numpy, Pandas & Matplotlib.pyplot • R • For all Exploratory Discovery Analysis (EDA) work and plots • Key Python Libraries • Scikit Learn (Linear_Model.LogisticRegression, svm, DecisionTreeClassifier, Naive_Bayes.GaussianNB) • Scikit Learn Ensemble (GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier) • Special Libraries • XGBoost • Neural Network
  • 10. Models, Ensembles, Stacking • Naïve Bayes • Logistic Regression • Support Vector Machines • Decision Tree, Extra Trees, Random Forest • Gradient Boosting, AdaBoost, XGBoost • Neural Networking
  • 11. EDA, Feature Analysis and Engineering PCA Results Correlation Results Histograms Imputation on “Missing” Data Variable Importance Plot Missing Values Plots
  • 14. PCA Plot for Feature Reduction 2
  • 16. 15% of Data No Influence on AMS
  • 17. Imputation on “Missing” Data • Zeros • Mean • Median • Most Frequent • K-Nearest Neighbor • Interpolation • “Do Nothing”
  • 19. Training and Cross Validation Parameters and Hyper-Parameters Patterns and Trends Tuning and Optimization
  • 20. Naïve Bayes All Input Data Naïve Bayes Model Parameter Tuning Final Prediction Local AMS Score = 1.00983
  • 21. Logistic Regression All Input Data Logistic Regression Model Parameter Tuning Final Prediction Local AMS Score = 1.81852
  • 22. Neural Networking Model Dropped Set Input Data Neural Network Model Parameter Tuning Final Prediction Local AMS Score = 2.698 Private LB AMS Score = 2.745
  • 23. Boosting, Random Forest, SVM All Input Data GBM / RF / SVM Ensemble Final Prediction Local AMS Score = 3.53583 Private LB AMS Score = 3.58742 Parameter Tuning
  • 24. XGBoost All Input Data XGBoost Ensemble Final Prediction Local AMS Score = 3.60760 Private LB AMS Score = 3.64051 Parameter Tuning
  • 25. Drop the 5-phi Features Dropped Set Input Data XGBoost Ensemble Final Prediction Local AMS Score = 3.64040 Parameter Tuning
  • 26. Drop Features with >70% Missing Data Dropped Set Input Data XGBoost Ensemble Final Prediction Local AMS Score = 3.41482 Parameter Tuning
  • 27. Polynomial Features Engineered Input Data XGBoost Ensemble Final Prediction Local AMS Score = 3.57455 Private LB AMS Score = 3.64580 Parameter Tuning
  • 28. XGBoost Hyper-Parm Tuning Varied Input Data XGBoost Ensemble Final Prediction Local AMS Score = 3.57257 Private LB AMS Score = 3.68305 Parameter Tuning
  • 29. Model Ensemble Stacking Tuning Dropped Set Training Data Gradient Boosting Classifier Logistic Classifier Final Prediction Local AMS Score = 3.68743 Private LB AMS Score = 3.72254 Parameter Tuning XGBoost Classifier Random Forest Trees Classifier AdaBoost Classifier Dropped Set Training Data Dropped Set Training Data Dropped Set Training Data LEVEL 0 LEVEL 1
  • 31. Feature Analysis • Some variables are meaningless or cannot be computed; value is −999.0, which is outside the normal range of all variables • Another reason to not impute -999.0 • Imputation is a bad idea and wrong because −999.0 is not missing at all • -999.0 indicates nothing is meant to be there • Split training and test data into 6 categories according to level of undefined features • PRI_jet_num is a categorical variable telling which of 4 {jets(1,2,3) and no (0) jet} event happened • 7 variables are missing more than 70 % of data in training set • Last variables in PCA feature plots have least influence in their current form • Weights and Labels are dependent on each other from chisquare test of dependence • Signal events have low weights and Background events have high weights • Signal events weight range is exclusively separated from background event weight range • Signal seems to have only three weights Looks like they are looking for three channels of Higgs Boson • Not worth spending time coming up with the formula It's already done for us
  • 32. Findings • The Variable Importance plot as well as PCA showed that the 5 angle (‘phi’) variables seemed to be the least important of the 30 predictors. Although removal of the above variables did not result in a significant improvement of the AMS score using a single model, it did help increase the score quite a bit for the Ensemble model • Ensemble models performed better than single models and stacking the Ensemble models provided the best result in achieving the final AMS score • The final Stacking model consisted of AdaBoost, Gradient Boost, XGBoost as Level 0 classifiers followed by Logistic Regression as Level 1 classifier • Stacking several models in the base layer does not always enhance the final result (AMS score). In our testing, three base classifiers (AdaBoost, Gradient Boost and XGBoost) provided almost equivalent results compared to two base classifiers • Cross Validation plays a key role in testing the model. Based on our testing, the optimal number of folds for cross validation was determined as 5 since 3 folds seemed to underfit the model while 10 folds seemed to overfit the model
  • 33. ROC / AUC Results Curve
  • 35. Lessons Learned • An increase in the local AMS score based on the training and cross-validation data does not always translate to an increase in the private Kaggle Leaderboard score due to potential overfitting of the model • Combination of (no. of models) x (no. of hyperparameters) x (no. of features) x (no. of VABS) is complex and overwhelming. Disciplined process, pipelining, and automation are a must • Methodical Tuning of the parameters is critical during testing (focusing on one parameter at a time). Random tuning of the parameters would make it difficult to keep track of the changes and may result in over-tuning that would end up decreasing the Kaggle score • SVM is highly computation intensive and the features and parameters need to be carefully selected before the execution of the algorithm. The algorithm runs for several hours even with reduced features and data • The maximum number of trees used need not be the same for all the models in a stacked classifier. Increasing the number of trees helped in improving the AMS score only up to a certain threshold. Any further increase in the trees resulted in decreasing the overall AMS score due to potential overfitting
  • 36. Insights • Feature engineering (creating new features) did not really work that well – it was not the worth effort • Ensembles and stacking clearly demonstrated significant improvements • Hyper-parameters fine-tuning and optimization are very challenging and time consuming • Each little change in hyper-parameters could decrease or increase accuracy significantly • A fine line exists balancing across effort vs computing time vs complexity vs accuracy • Imputation of missing data creates lower predictive accuracy – better to just leave it as is • Managing theoretical feature significance to improve prediction accuracy was very challenging in actual practice • Prediction accuracy would have been even more challenging during a live competition due to the absence of the private leaderboard
  • 37. What We Would Have Liked to Do • Find better techniques to auto-generate new features to increase prediction accuracy • Explore SVM in depth and try various options for tuning parameters and analyze effects on final score • Blending machine learning and physics (or business) models have a big potential for increased accuracy • Build out our expertise in ensembles and other aggregation strategies (VABS) • Inject more random seed sampling for base models • Invest more time and effort to optimize our neural network frameworks • Enhance the current Ensemble model by including the Neural Network algorithm (Deep Learning) and test the predictive accuracy • Use a Bayesian or Gaussian-based Optimizer tools to automate pipeline iterations to find the best combination of hyper-parameters (given its complexity) • Use GridSearch as a contingent only to determine the best combination of parameters for the Tree based Boosting models
  • 39. Summary • We have met our objectives in this Machine Learning Challenge, being able to apply various models, algorithms, and strategies to achieve relatively good predictions • We have obtained the final private leaderboard score of 3.72254, the 34th position, garnering the Top 2% spot in the challenge • We have a renewed sense of appreciation of what machine learning can do, and also the power behind the complexity • We have learned and developed a lot of new skills from the challenge, but also realized how much more we still don’t know
  • 40. FinalThoughts Winning or getting high scores in Kaggle does not necessarily equate to being a good Data Scientist. Being great at machine learning is more than just depending on an individual’s smarts and skills. Knowing how to form a great team that works well together plays a major role in how one succeeds in the field of Data Science.
  • 41. Final Private LB AMS Score Top 2%
  • 42. Appendix More Details : XGBoost Details More EDA Plots
  • 43. Benefits of XGBoost • Regularization & Parallel Processing • XGBoost implements regularization that helps in reducing overfitting. It also implements parallel processing that significantly expedites the model fitting process compared to GBM. • XGBoost utilizes OpenMP which can parallelize the code on a multithreaded CPU automatically. XGBoost has defined a data structure DMatrix to store the data matrix. This data structure will perform some preprocessing work on the data so that the latter iterations run faster. • High Flexibility • XGBoost allow users to define custom optimization objectives and evaluation criteria. This adds a whole new dimension to the model and significantly enhances the flexibility • Handling Missing Values • XGBoost has an in-built routine to handle missing values. • User is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in future. • Tree Pruning • A GBM would stop splitting a node when it encounters a negative loss in the split. Thus it is more of a greedy algorithm. • XGBoost on the other hand makes the splits up to the max_depth specified and then start pruning the tree backwards and remove splits beyond which there is no positive gain. • Another advantage is that sometimes a split of negative loss say -2 may be followed by a split of positive loss +10. GBM would stop as it encounters -2. But XGBoost will go deeper and it will see a combined effect of +8 of the split and keep both.
  • 44. XGBoost Hyper-Parameters • General Parameters • booster [default=gbtree]: The type of booster to be used. This value can be gbtree, gblinear or dart. gbtree and dart use tree based model while gblinear uses linear function. • silent [default=0]: 0 means printing running messages, 1 means silent mode. • nthread [default to maximum number of threads available if not set]: number of parallel threads used to run xgboost • num_pbuffer [set automatically by xgboost, no need to be set by user]: size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step. • num_feature [set automatically by xgboost, no need to be set by user]: feature dimension used in boosting, set to maximum dimension of the feature • Tree Booster Parameters • eta [default=0.3] : step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinks the feature weights to make the boosting process more conservative. • max_depth [default=6] : maximum depth of a tree, increase this value will make model more complex / likely to be overfitting. • scale_pos_weight [default=0]: Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases) • subsample [default=1] : subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. • Learning Task Parameters • objective [ default=reg:linear ] Used "binary:logitraw" for this scenario. Logistic regression for binary classification that outputs score before logistic transformation • eval_metric : evaluation metrics for validation data, a default metric will be assigned according to objective( RMSE for regression, and error for classification, mean average precision for ranking )
  • 49. s & b Weight Density Plot