SlideShare a Scribd company logo
1 of 47
2nd Place Solution
Instacart Market Basket Analysis
Agenda
• My Background
• Problem Overview
• Main Approach
• Feature Engineering
• Feature Importance
• Important Findings
• F1 maximization
My Background
• Bachelor of Economics
• Programmer of Financial Industry
• Consultant of Financial Industry
• 2nd Place at KDDCUP2015
• Data Scientist at Yahoo! JAPAN
Problem Overview
• In this competition, we have to predict reorder.
• So, it is little different from general recommendation.
• I mean,
Problem Overview
• How hot(user)?
*prior is regarded as train
Problem Overview
• How hot(item)?
*Clipped by 500
Problem Overview
• Evaluation metric is mean F1 score
• Precision and Recall
Problem Overview
• Links between the files
Main Approach
• We are given orders.csv
Main Approach
• We are given orders.csv
Main Approach
• We are given order_products.csv
Main Approach
• Reorder Prediction
user_id product_id label
Main Approach
• None Prediction
user_id label
Main Approach
Main Approach
Feature Engineering
• I made 4 types of features
1. User
• What this user like
2. Item
• What this item like
3. User x Item
• How do the user feel about the item
4. Datetime
• What this day and hour like
*For None model, I can’t use above features except user and datetime. So I convert those to
stats(min, mean, max, sum, std…).
Feature Importance for reorder
Feature Importance for None
Important Findings for reorder - 1
• user_id: 54035
Important Findings for reorder - 2
• days_last_order-max is difference between days_since_last_order_this_item and
useritem_order_days_max
• days_since_last_order_this_item is a feature belong to user and item. This means how
many days passed since last order
• Also, useritem_order_days_max is a feature belong to user and item. This means max
span(day) of order
• For more detail, see the next page
Important Findings for reorder - 2
• See the index 0, this means
the user bought this item 14 days
ago, and max span is 30 days
• So I think this feature says if the user
is bored or not by that item
Important Findings for reorder - 3
• We already know fruits are reordered more frequently than vegetables(3
Million Instacart Orders, Open Sourced)
• I wanted to know how often
• So I made a item_10to1_ratio feature
that’s defined as the reorder ratio after
an item is ordered vs. not ordered.
• Next page, for more details
Important Findings for reorder - 3
• Let’s say userA bought itemA at order_number 1 and 4
• And userB bought itemA at order_number 1 and 3
• item_10to1_ratio is 0.5
Important Findings for None - 1
• Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart
that Item B falls into
• Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all
items
• So this feature essentially captures
the average position of an item in a user’s
cart, and we can see that users who
don’t buy many items all at once are
more likely to be None
Important Findings for None - 2
• total_buy is number of total order
• If userA bought itemA 3 times
in the past, this would be 3
• So total_buy-max is max of above
feature by user
• We can see that it predicts
whether or not a user will make a reorder
Important Findings for None - 3
• t-1_is_None(User A) is a binary feature that says whether or not the
user’s previous order was None.
• If the previous order is None,
then the next order will also be
None with 30% probability.
F1 maximization
• In this competition, the evaluation metric was an F1 score, which is a way of
capturing both precision and recall in a single metric.
• Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No)
numbers.
• However, in order to perform this conversion, we need to know a threshold. At
first, I used grid search to find a universal threshold of 0.2. But I saw
comments on the Kaggle discussion boards that said different orders should
have different thresholds.
• To understand why, let’s look at an example.
F1 maximization
F1 maximization
• In the first example, threshold is between 0.9 and 0.3
• In the second example, threshold is lower than 0.2
• As I showed, each order should have each threshold
• But using above calculation, we have to prepare all patterns of
probability at first
• Thus I needed to come up with another calculation
• See the next page
F1 maximization
• Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then
simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities.
• For example, the simulated labels might look like this.
• I then calculate the expected F1 score for each set of labels,
starting from the highest probability items, and then adding items
(e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score
peaks and then decreases.
• We don’t need to calculate all of patterns
like A, B, AB…
• Because if we should select itemB, we should
select itemA as well
F1 maximization
• F1score_mean( , [A]) -> 0.809747641431
• F1score_mean( , [A,B]) -> 0.709004233757
F1 maximization - Predicting None
• One way to think about None is as the probability (1 - Item A)
* (1 - Item B) * …
• But another method is to try to predict None as a special
case.
• By using our None model and treating None as just another
item, we can boost the F1 score from 0.400 to 0.407.
Appendix
Appendix
Appendix
1 month to go…
7 days to go…
2 days to go…
(´-`).。oO(
1 hours to go…
30 minutes to go…
やったか?!
やったか?!
(やってない)
20 minutes to go…
EOP

More Related Content

What's hot

Kaggle days tokyo jin zhan
Kaggle days tokyo   jin zhanKaggle days tokyo   jin zhan
Kaggle days tokyo jin zhanJin Zhan
 
レコメンドアルゴリズムの基本と周辺知識と実装方法
レコメンドアルゴリズムの基本と周辺知識と実装方法レコメンドアルゴリズムの基本と周辺知識と実装方法
レコメンドアルゴリズムの基本と周辺知識と実装方法Takeshi Mikami
 
実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだことnishio
 
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016Tokoroten Nakayama
 
情報検索とゼロショット学習
情報検索とゼロショット学習情報検索とゼロショット学習
情報検索とゼロショット学習kt.mako
 
Union find(素集合データ構造)
Union find(素集合データ構造)Union find(素集合データ構造)
Union find(素集合データ構造)AtCoder Inc.
 
[論文解説]KGAT:Knowledge Graph Attention Network for Recommendation
[論文解説]KGAT:Knowledge Graph Attention Network for Recommendation[論文解説]KGAT:Knowledge Graph Attention Network for Recommendation
[論文解説]KGAT:Knowledge Graph Attention Network for Recommendationssuser3e398d
 
クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式Hiroshi Nakagawa
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方joisino
 
[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデルDeep Learning JP
 
コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)Nagi Teramo
 
競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略K Moneto
 
はじパタ8章 svm
はじパタ8章 svmはじパタ8章 svm
はじパタ8章 svmtetsuro ito
 
ベイズ最適化によるハイパラーパラメータ探索
ベイズ最適化によるハイパラーパラメータ探索ベイズ最適化によるハイパラーパラメータ探索
ベイズ最適化によるハイパラーパラメータ探索西岡 賢一郎
 
トピックモデルの基礎と応用
トピックモデルの基礎と応用トピックモデルの基礎と応用
トピックモデルの基礎と応用Tomonari Masada
 
【論文調査】XAI技術の効能を ユーザ実験で評価する研究
【論文調査】XAI技術の効能を ユーザ実験で評価する研究【論文調査】XAI技術の効能を ユーザ実験で評価する研究
【論文調査】XAI技術の効能を ユーザ実験で評価する研究Satoshi Hara
 

What's hot (20)

Kaggle days tokyo jin zhan
Kaggle days tokyo   jin zhanKaggle days tokyo   jin zhan
Kaggle days tokyo jin zhan
 
レコメンドアルゴリズムの基本と周辺知識と実装方法
レコメンドアルゴリズムの基本と周辺知識と実装方法レコメンドアルゴリズムの基本と周辺知識と実装方法
レコメンドアルゴリズムの基本と周辺知識と実装方法
 
線形計画法入門
線形計画法入門線形計画法入門
線形計画法入門
 
実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと実践多クラス分類 Kaggle Ottoから学んだこと
実践多クラス分類 Kaggle Ottoから学んだこと
 
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
データ分析グループの組織編制とその課題 マーケティングにおけるKPI設計の失敗例 ABテストの活用と、機械学習の導入 #CWT2016
 
情報検索とゼロショット学習
情報検索とゼロショット学習情報検索とゼロショット学習
情報検索とゼロショット学習
 
Union find(素集合データ構造)
Union find(素集合データ構造)Union find(素集合データ構造)
Union find(素集合データ構造)
 
[論文解説]KGAT:Knowledge Graph Attention Network for Recommendation
[論文解説]KGAT:Knowledge Graph Attention Network for Recommendation[論文解説]KGAT:Knowledge Graph Attention Network for Recommendation
[論文解説]KGAT:Knowledge Graph Attention Network for Recommendation
 
BERT入門
BERT入門BERT入門
BERT入門
 
クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方
 
[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル
 
コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)コピュラと金融工学の新展開(?)
コピュラと金融工学の新展開(?)
 
競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略
 
自然言語処理
自然言語処理自然言語処理
自然言語処理
 
はじパタ8章 svm
はじパタ8章 svmはじパタ8章 svm
はじパタ8章 svm
 
ベイズ最適化によるハイパラーパラメータ探索
ベイズ最適化によるハイパラーパラメータ探索ベイズ最適化によるハイパラーパラメータ探索
ベイズ最適化によるハイパラーパラメータ探索
 
AlphaGoのしくみ
AlphaGoのしくみAlphaGoのしくみ
AlphaGoのしくみ
 
トピックモデルの基礎と応用
トピックモデルの基礎と応用トピックモデルの基礎と応用
トピックモデルの基礎と応用
 
【論文調査】XAI技術の効能を ユーザ実験で評価する研究
【論文調査】XAI技術の効能を ユーザ実験で評価する研究【論文調査】XAI技術の効能を ユーザ実験で評価する研究
【論文調査】XAI技術の効能を ユーザ実験で評価する研究
 

Viewers also liked

Quoraコンペ参加記録
Quoraコンペ参加記録Quoraコンペ参加記録
Quoraコンペ参加記録Takami Sato
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Keisuke Hosaka
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKeisuke Hosaka
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺についてKeisuke Hosaka
 
機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門hoxo_m
 
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural NetworksDeep Learning JP
 
Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話y-uti
 
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Keiku322
 
Webスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよWebスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよTakaichi Ito
 
サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法Takuro Sasaki
 
岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02goony0101
 
岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01goony0101
 
岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00goony0101
 
Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法Hiroki Itô
 
Python twitter data_150709
Python twitter data_150709Python twitter data_150709
Python twitter data_150709BrainPad Inc.
 
データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料The Japan DataScientist Society
 
深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについてryosuke-kojima
 
Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化Isao Takaesu
 
Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31Shinichi Nakagawa
 

Viewers also liked (20)

Quoraコンペ参加記録
Quoraコンペ参加記録Quoraコンペ参加記録
Quoraコンペ参加記録
 
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
Kaggle bosch presentation material for Kaggle Tokyo Meetup #2
 
Kaggle boschコンペ振り返り
Kaggle boschコンペ振り返りKaggle boschコンペ振り返り
Kaggle boschコンペ振り返り
 
Hyperoptとその周辺について
Hyperoptとその周辺についてHyperoptとその周辺について
Hyperoptとその周辺について
 
機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門
 
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
[Dl輪読会]AdaNet: Adaptive Structural Learning of Artificial Neural Networks
 
Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話Kaggle の Titanic チュートリアルに挑戦した話
Kaggle の Titanic チュートリアルに挑戦した話
 
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
Kaggle – Airbnb New User Bookingsのアプローチについて(Kaggle Tokyo Meetup #1 20160305)
 
Webスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよWebスクレイピング用の言語っぽいものを作ったよ
Webスクレイピング用の言語っぽいものを作ったよ
 
サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法サイト/ブログから本文抽出する方法
サイト/ブログから本文抽出する方法
 
岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02
 
岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01
 
岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00岩波データサイエンス_Vol.5_勉強会資料00
岩波データサイエンス_Vol.5_勉強会資料00
 
Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法Rパッケージ“KFAS”を使った時系列データの解析方法
Rパッケージ“KFAS”を使った時系列データの解析方法
 
Python twitter data_150709
Python twitter data_150709Python twitter data_150709
Python twitter data_150709
 
データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料データサイエンティスト協会スキル委員会4thシンポジウム講演資料
データサイエンティスト協会スキル委員会4thシンポジウム講演資料
 
深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて深層学習と確率プログラミングを融合したEdwardについて
深層学習と確率プログラミングを融合したEdwardについて
 
Semantic segmentation2
Semantic segmentation2Semantic segmentation2
Semantic segmentation2
 
Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化Pythonと機械学習によるWebセキュリティの自動化
Pythonと機械学習によるWebセキュリティの自動化
 
Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31Pythonistaデビュー #PyNyumon 2016/5/31
Pythonistaデビュー #PyNyumon 2016/5/31
 

Similar to Kaggle meetup #3 instacart 2nd place solution

Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3Minha Hwang
 
Goal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptxGoal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptxmilanrameswarpanigra
 
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptxgoalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptxIrfanRashid36
 
Lecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptxLecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptxAsadkhan47384
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptSysteDesig
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptSysteDesig
 
ContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docxContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docxmaxinesmith73660
 
Intro to Data warehousing lecture 15
Intro to Data warehousing   lecture 15Intro to Data warehousing   lecture 15
Intro to Data warehousing lecture 15AnwarrChaudary
 
Introduction to Management Science and Linear Programming
 Introduction to Management Science and Linear Programming  Introduction to Management Science and Linear Programming
Introduction to Management Science and Linear Programming Kishore Morya PhD.
 
Chatter Actions - Short Version
Chatter Actions - Short VersionChatter Actions - Short Version
Chatter Actions - Short VersionCloudTech 
 
Lecture 3F.ppt
Lecture 3F.pptLecture 3F.ppt
Lecture 3F.pptkhang28765
 
Dwh lecture 13-process dm
Dwh  lecture 13-process dmDwh  lecture 13-process dm
Dwh lecture 13-process dmSulman Ahmed
 
Value analysis and value engineering
Value  analysis and value engineeringValue  analysis and value engineering
Value analysis and value engineeringudayravi2
 
Production Planning and Process Planning
Production Planning and Process PlanningProduction Planning and Process Planning
Production Planning and Process PlanningPraveenManickam2
 
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmmChapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmmtalila4
 
DS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfDS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfMidhunM83
 

Similar to Kaggle meetup #3 instacart 2nd place solution (20)

C++ super market
C++ super marketC++ super market
C++ super market
 
Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3Conjoint Analysis - Part 1/3
Conjoint Analysis - Part 1/3
 
Goal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptxGoal Seek And Sensitivity Analysis.pptx
Goal Seek And Sensitivity Analysis.pptx
 
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptxgoalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
goalseekandsensitivityanalysis-221112123352-9fe0067e.pptx
 
Lecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptxLecture 08B - Logical-DWH-Model-Pending.pptx
Lecture 08B - Logical-DWH-Model-Pending.pptx
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.ppt
 
KL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.pptKL3083 Lecture Eng Design.ppt
KL3083 Lecture Eng Design.ppt
 
ContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docxContentsPhase 1 Design Concepts2Project Description2Use.docx
ContentsPhase 1 Design Concepts2Project Description2Use.docx
 
Intro to Data warehousing lecture 15
Intro to Data warehousing   lecture 15Intro to Data warehousing   lecture 15
Intro to Data warehousing lecture 15
 
Introduction to Management Science and Linear Programming
 Introduction to Management Science and Linear Programming  Introduction to Management Science and Linear Programming
Introduction to Management Science and Linear Programming
 
Chatter Actions - Short Version
Chatter Actions - Short VersionChatter Actions - Short Version
Chatter Actions - Short Version
 
Lecture 3F.ppt
Lecture 3F.pptLecture 3F.ppt
Lecture 3F.ppt
 
Dwh lecture 13-process dm
Dwh  lecture 13-process dmDwh  lecture 13-process dm
Dwh lecture 13-process dm
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
One day Course On Agile
One day Course On AgileOne day Course On Agile
One day Course On Agile
 
Value analysis and value engineering
Value  analysis and value engineeringValue  analysis and value engineering
Value analysis and value engineering
 
Production Planning and Process Planning
Production Planning and Process PlanningProduction Planning and Process Planning
Production Planning and Process Planning
 
GRO n GO
GRO n GO GRO n GO
GRO n GO
 
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmmChapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
Chapter Two Cost.pptxmmmmmmmmmmmmmmmmmmmmmm
 
DS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdfDS M1 full - KQB KtuQbank.pdf
DS M1 full - KQB KtuQbank.pdf
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 

Kaggle meetup #3 instacart 2nd place solution

  • 1. 2nd Place Solution Instacart Market Basket Analysis
  • 2. Agenda • My Background • Problem Overview • Main Approach • Feature Engineering • Feature Importance • Important Findings • F1 maximization
  • 3. My Background • Bachelor of Economics • Programmer of Financial Industry • Consultant of Financial Industry • 2nd Place at KDDCUP2015 • Data Scientist at Yahoo! JAPAN
  • 4. Problem Overview • In this competition, we have to predict reorder. • So, it is little different from general recommendation. • I mean,
  • 5. Problem Overview • How hot(user)? *prior is regarded as train
  • 6. Problem Overview • How hot(item)? *Clipped by 500
  • 7. Problem Overview • Evaluation metric is mean F1 score • Precision and Recall
  • 8. Problem Overview • Links between the files
  • 9. Main Approach • We are given orders.csv
  • 10. Main Approach • We are given orders.csv
  • 11. Main Approach • We are given order_products.csv
  • 12. Main Approach • Reorder Prediction user_id product_id label
  • 13. Main Approach • None Prediction user_id label
  • 16. Feature Engineering • I made 4 types of features 1. User • What this user like 2. Item • What this item like 3. User x Item • How do the user feel about the item 4. Datetime • What this day and hour like *For None model, I can’t use above features except user and datetime. So I convert those to stats(min, mean, max, sum, std…).
  • 19. Important Findings for reorder - 1 • user_id: 54035
  • 20. Important Findings for reorder - 2 • days_last_order-max is difference between days_since_last_order_this_item and useritem_order_days_max • days_since_last_order_this_item is a feature belong to user and item. This means how many days passed since last order • Also, useritem_order_days_max is a feature belong to user and item. This means max span(day) of order • For more detail, see the next page
  • 21. Important Findings for reorder - 2 • See the index 0, this means the user bought this item 14 days ago, and max span is 30 days • So I think this feature says if the user is bored or not by that item
  • 22. Important Findings for reorder - 3 • We already know fruits are reordered more frequently than vegetables(3 Million Instacart Orders, Open Sourced) • I wanted to know how often • So I made a item_10to1_ratio feature that’s defined as the reorder ratio after an item is ordered vs. not ordered. • Next page, for more details
  • 23. Important Findings for reorder - 3 • Let’s say userA bought itemA at order_number 1 and 4 • And userB bought itemA at order_number 1 and 3 • item_10to1_ratio is 0.5
  • 24. Important Findings for None - 1 • Useritem_sum_pos_cart(User A, Item B) is the average position in User A’s cart that Item B falls into • Useritem_sum_pos_cart-mean(User A) is the mean of the above feature across all items • So this feature essentially captures the average position of an item in a user’s cart, and we can see that users who don’t buy many items all at once are more likely to be None
  • 25. Important Findings for None - 2 • total_buy is number of total order • If userA bought itemA 3 times in the past, this would be 3 • So total_buy-max is max of above feature by user • We can see that it predicts whether or not a user will make a reorder
  • 26. Important Findings for None - 3 • t-1_is_None(User A) is a binary feature that says whether or not the user’s previous order was None. • If the previous order is None, then the next order will also be None with 30% probability.
  • 27. F1 maximization • In this competition, the evaluation metric was an F1 score, which is a way of capturing both precision and recall in a single metric. • Thus, we needed to convert reorder probabilities into binary 1/0 (Yes/No) numbers. • However, in order to perform this conversion, we need to know a threshold. At first, I used grid search to find a universal threshold of 0.2. But I saw comments on the Kaggle discussion boards that said different orders should have different thresholds. • To understand why, let’s look at an example.
  • 29. F1 maximization • In the first example, threshold is between 0.9 and 0.3 • In the second example, threshold is lower than 0.2 • As I showed, each order should have each threshold • But using above calculation, we have to prepare all patterns of probability at first • Thus I needed to come up with another calculation • See the next page
  • 30. F1 maximization • Let’s say our model predicts Item A will be reordered with probability 0.9, and Item B with probability 0.3. I then simulate 9,999 target labels (whether A and B will be ordered or not) using these probabilities. • For example, the simulated labels might look like this. • I then calculate the expected F1 score for each set of labels, starting from the highest probability items, and then adding items (e.g., [A], then [A, B], then [A, B, C], etc) until the F1 score peaks and then decreases. • We don’t need to calculate all of patterns like A, B, AB… • Because if we should select itemB, we should select itemA as well
  • 31. F1 maximization • F1score_mean( , [A]) -> 0.809747641431 • F1score_mean( , [A,B]) -> 0.709004233757
  • 32. F1 maximization - Predicting None • One way to think about None is as the probability (1 - Item A) * (1 - Item B) * … • But another method is to try to predict None as a special case. • By using our None model and treating None as just another item, we can boost the F1 score from 0.400 to 0.407.
  • 36. 1 month to go…
  • 37.
  • 38. 7 days to go…
  • 39. 2 days to go…
  • 41. 1 hours to go…
  • 42.
  • 43. 30 minutes to go…
  • 46. 20 minutes to go…
  • 47. EOP