Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

6

Share

Download to read offline

Mastering Machine Learning with Competitions

Download to read offline

Presented at Data Break 2018, the first Korean Kaggler workshop at Microsoft Korea in Seoul, Korea on 10/7/2018.

Mastering Machine Learning with Competitions

  1. 1. Jeong-Yoon Lee, Ph.D. Sr. Applied Machine Learning Scientist, Microsoft Science Advisor, Neofect, Conversion Logic KDD Cup 2012, 2015 Winner, Top 10, Kaggle 2015 KDD Cup 2018 Co-Chair, ACM SIGKDD OneML Organizing Committee, Microsoft
  2. 2. ML Competitions Since 1997 2006 - 2009 Since 2010 For the latest list of competitions, see https://github.com/iphysresearch/DataSciComp Started in 8/2018
  3. 3. KDD Cup
  4. 4. Kaggle
  5. 5. http://kagglerank.azurewebsites.net/ Country Rankers USA 863 Russia 266 India 220 China 197 France 153 Germany 143 Japan 139 UK 119 South Korea 19 North Korea 1
  6. 6. Why ML Competitions?
  7. 7. Fun
  8. 8. Networking
  9. 9. Learning - Data
  10. 10. Learning - Languages
  11. 11. Learning - Approaches
  12. 12. https://imaginecup.microsoft.com/en-us/winners/2018WorldChampions https://www.sciencealert.com/this-teenage-girl-invented-a-brilliant-ai- based-app-that-can-quickly-diagnose-eye-disease
  13. 13. Best Practices
  14. 14. Feature Engineering Types Note Numerical Log, Log2(1 + x), Box-Cox, Normalization, Binning Categorical One-hot-encoding, Label-encoding, Count, Weight-of-Evidence Text Bag-of-Words, TF-IDF, N-gram, Character-n-gram, K-skip-n-gram Timeseries/ Sensor data Descriptive Statistics, Derivatives, FFT, MFCC, ERP Network Graph Degree, Closeness, Betweenness, PageRank Numerical/ Timeseries Convert to categorical features using RF/GBM Dimensionality Reduction PCA, SVD, Autoencoder, Hashing Trick Interaction Addition/subtraction/multiplication/division. Hashing Trick * More comprehensive overview on feature engineering by HJ van Veen: https://www.slideshare.net/HJvanVeen/feature-engineering-72376750
  15. 15. Diverse Algorithms Algorithm Tool Note Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions Random Forests Scikit-Learn, randomForest Used to be popular before GBM Extremely Random Trees Scikit-Learn Neural Networks/ Deep Learning Keras, MXNet, PyTorch, CNTK Blends well with GBM. Best at image and speech recognition competitions Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble. Support Vector Machine Scikit-Learn FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions Factorization Machine libFM, fastFM Winning solution for KDD Cup 2012 Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)
  16. 16. A Tale of Two Algorithms GBM Deep Learning No. 1 winning algorithm at most machine learning competitions Highlight Most popular algorithm across media, industry, and academia Decision Tree (Morgan & Sonquist 1963) Base algorithm Perceptron (Rosenblatt 1958) Structured, categorical data Use cases Image, speech, natural language data Feature engineering Crucial step Architecture design. Finding pre-trained models LightGBM, XGBoost, CatBoost, H2O Open source tools Keras, PyTorch, Tensorflow, CNTK, MXNet, Caffe
  17. 17. Cross Validation Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
  18. 18. * for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/ Ensemble - Stacking
  19. 19. KDD Cup 2015 Solution
  20. 20. Collaboration
  21. 21. Collaboration – Git Repo + S3/Dropbox
  22. 22. Collaboration – Common Validation
  23. 23. Collaboration – Internal Leaderboard
  24. 24. Pipeline https://gitlab.com/jeongyoonlee/allstate-claims-severity
  25. 25. How to Explore
  26. 26. Resources 캐글뽀개기 Introduction to Machine Learning for Coders Practical Deep Learning for Coders How to Win a Data Science Competition Winning Tips on Machine Learning Competitions Feature Engineering mlwave.com
  27. 27. Active Competitions
  28. 28. jeol@microsoft.com https://linkedin.com/in/jeongyoonlee https://kaggle.com/jeongyoonlee
  29. 29. Misconceptions on Competitions
  30. 30. No ETL? - Deloitte Western Australia Rental Prices
  31. 31. No ETL? - Outbrain Click Prediction 2B page views. 16.9MM clicks. 700MM users. 560 sites
  32. 32. No ETL? - YouTube-8M Video Understanding Challenge 1.7TB feature-level data. 31GB video-level data.
  33. 33. No ETL?
  34. 34. No EDA?
  35. 35. Not worth it?
  • SangmoGu

    Jul. 18, 2021
  • ssuser7ccfb9

    Mar. 21, 2020
  • ILTAECKJOO

    Oct. 27, 2018
  • SundongKim

    Oct. 23, 2018
  • zzsza

    Oct. 23, 2018
  • JeevanAnandAnne

    Oct. 23, 2018

Presented at Data Break 2018, the first Korean Kaggler workshop at Microsoft Korea in Seoul, Korea on 10/7/2018.

Views

Total views

497

On Slideshare

0

From embeds

0

Number of embeds

14

Actions

Downloads

15

Shares

0

Comments

0

Likes

6

×