SlideShare a Scribd company logo
1 of 31
Download to read offline
KAGGLE AVITO DEMAND
PREDICTION CHALLENGE 9TH
SOLUTION
Kaggle Meetup Tokyo 5th – 2018.12.01senkin13
About Me
¨  詹金 (せんきん)
¨  Kaggle ID: senkin13
¨  Infrastructure&DB Engineer
[Prefect World] [Square Enix]
¨  Bigdata Engineer
[Square Enix] [OPT] [Line] [FastRetailing]
¨  Machine learning Engineer
[FastRetailing]
Background
KaggleName
Agenda
¨  Avito-demand-prediction Overview
¨  Competition Pipeline
¨  Best Single Model (Lightgbm)
¨  Diverse Models
¨  Ynktk’s Best NN
¨  Kohei’s Ensemble
¨  China Competitions/Kagglers
¨  Q & A
Our Team
Public LB:8th Private LB:9th
Description
「Prediction」
predict demand for an
online advertisement
based on its full
description (title,
description, images, etc.),
its context
(geographically where it
was posted, similar ads
already posted) and
historical demand for
similar ads in similar
contexts
Russia’s largest classified
advertisements website
Evaluation
Item_id Deal_probability
b912c3c6a6ad 0.12789
2dac0150717d 0.00000
ba83aefab5dc 0.43177
02996f1dd2eas 0.80323
[Target]
This is the likelihood
that an ad actually sold
something.
Scope: 0 ~ 1
Data Description
¨  ID
item_id,user_id
¨  Numeric
price
¨  Category
region,city,parent_category_name,user_type
category_name,param_1, param_2, param_3,
image_top_1
¨  Text
title,description
¨  Image
image
¨  Sequence
item_seq_number
¨  Date
activation_date,date_from,date_to
Train/Test
User_id
……
Item_id
Target
Active
User_id
……
Item_id
Periods
Item_id
Date_from
Date_to
Supplemental data of
train minus
deal_probability,
image, and
image_top_1
Train Test Period
Train: 2017-03-15 ~ 2017-04-05
Test: 2017-04-12 ~ 2017-04-20
Pipeline
[Baseline]
1.Table Data Model
2.Text Data Model
(reduce wait time)
[Validation]
Kfold: 5
Feature Validation:
once by one
Validate Score:
5fold
One Week One WeekOne Month
Description
Kernel
Discussion
Beseline
Design
[Feature Engineering]
LightGBM(Table + Text + Image)
Feature Save: 1 feature 1 pickle file
[Validation]
Kfold: 5
Feature Validation: once by one or
by group
Validate Score: 1fold
[Parameter Tuning]
Manually
Teammates’
feature
reuse
Diverse
Model’s oof
Preprocossing
¨  Tabular data
df_all['price']	=	np.log1p(df_all['price'])	
df_all['city']	=	df_all['city']	+	‘_’	+	df_all['region’]	
¨  Text data
def	clean_text(s):	
				s	=	re.sub('м²|d+/d|d+-к|d+к',	‘	‘,	s.lower())	
				s	=	re.sub('s+',	‘	‘,	s)	
				s	=	s.strip()	
				return	s	
¨  Image data
Delete	4	empty	images
Feature Engineering
¨  Date Feature
df_all['wday']	=	df_all['activation_date'].dt.weekday	
※TrainとTest両方があるdate型を利用する	
¨  Extended Text Feature
df_all['param_123']	=	(df_all['param_1'].fillna('')	+	'	'	+					
df_all['param_2'].fillna('')	+	'	'	+	
df_all['param_3'].fillna('')).astype(str)
	
df_all['text']	=	df_all['description'].fillna('').astype(str)	+	'	'	+	
df_all['title'].fillna('').astype(str)	+	'	'	+	
df_all['param_123'].fillna('').astype(str)	
※Traing単語が増える
Aggrearation Feature
¨  Unique
{'groupby':	['category_name'],	'target':’image_top_1',	'agg':'nunique'},	
¨  Count
{'groupby':	['user_id'],	'target':'item_id',	'agg':'count'},	
¨  Sum
{'groupby':	['parent_category_name'],	'target':'price',	'agg':'sum'},	
¨  Mean
{'groupby':	['user_id'],	'target':'price',	'agg':'mean'},	
¨  Median
{'groupby':	['image_top_1'],	'target':'price',	'agg':'median'},	
¨  Max
{'groupby':	['image_top_1','user_id'],	'target':'price',	'agg':'max'},	
¨  Min
{'groupby':	['user_id'],	'target':'price',	'agg':'min'},	
	
※業務視点から作るのが効率が良い
Interaction Feature
¨  Difference between two features
df_all['image_top_1_diff_price']	=	df_all['price']	-	
df_all['image_top_1_mean_price']	
df_all['category_name_diff_price']	=	df_all['price']	-	
df_all['category_name_mean_price']	
df_all['param_1_diff_price']	=	df_all['price']	-	
df_all['param_1_mean_price']	
df_all['param_2_diff_price']	=	df_all['price']	-	
df_all['param_2_mean_price']	
df_all['user_id_diff_price']	=	df_all['price']	-	
df_all['user_id_mean_price']	
df_all['region_diff_price']	=	df_all['price']	-	df_all['region_mean_price']	
df_all['city_diff_price']	=	df_all['price']	-	df_all['city_mean_price']	
	
※Business	senseがある加減乗除特徴量が強い
Supplemental Data Feature
¨  Caculate each item’s up days
all_periods['days_up']	=	all_periods['date_to'].dt.dayofyear	-		
all_periods['date_from'].dt.dayofyear
¨  Count and Sum of item’s up days
{'groupby':	['item_id'],	'target':'days_up',	'agg':'count'},	
{'groupby':	['item_id'],	'target':'days_up',	'agg':'sum'},		
¨  Merge to main table	
df_all	=	df_all.merge(all_periods,	on='item_id',	how='left')	
	
※補足データの業務に関わる部分深掘りが大事
Impute Null Values
¨  Fillna with 0
df_all[‘price’].fillna(0)	
¨  Fillna with median
enc	=	df_all.groupby('category_name')
['item_id_count_days_up'].agg('median’).reset_index()	
enc.columns	=	['category_name'	,'count_days_up_impute']	
	
df_all	=	pd.merge(df_all,	enc,	how='left',	on='category_name')	
df_all['item_id_count_days_up_impute'].fillna(df_all['count_days_up_impute'
],	inplace=True)
¨  Fillna with model prediction value
Rnn(text)	-> image_top_1(rename:image_top_2)	
	
※見つからなかったMagic	feature:	df[‘price’]	–	df[Rnn(text)	->	price]
Text Feature
¨  TF-IDF for text ,title,param_123
	
vectorizer	=	FeatureUnion([	
('text',TfidfVectorizer(	
								ngram_range=(1,	2),	
								max_features=200000,	
								**tfidf_para),	
('title',TfidfVectorizer(	
								ngram_range=(1,	2),	
								stop_words	=	
russian_stop),	
('param_123',TfidfVectorizer(	
									ngram_range=(1,	2),	
									stop_words	=	
russian_stop))					
])	
tfidf_para	=	{	
				"stop_words":	russian_stop,	
				"analyzer":	'word',	
				"token_pattern":	r'w{1,}',	
				"lowercase":	True,	
				"sublinear_tf":	True,	
				"dtype":	np.float32,	
				"norm":	'l2',	
"smooth_idf":False	
}
Text Feature
¨  SVD for Title
tfidf_vec = TfidfVectorizer(ngram_range=(1,1))	
svd_title_obj = TruncatedSVD(n_components=40, algorithm='arpack')	
	
svd_title_obj.fit(full_title_tfidf)	
	
train_title_svd = pd.DataFrame(svd_title_obj.transform(train_title_tfidf))	
test_title_svd = pd.DataFrame(svd_title_obj.transform(test_title_tfidf))
Text Feature
¨  Count Unique Feature
for	cols	in	['text','title','param_123']:	
				df_all[cols	+	'_num_cap']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[А-ЯA-Z]',	x))	
				df_all[cols	+	'_num_low']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[а-яa-z]',	x))	
				df_all[cols	+	'_num_rus_cap']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[А-Я]',	x))	
				df_all[cols	+	'_num_eng_cap']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[A-Z]',	x))					
				df_all[cols	+	'_num_rus_low']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[а-я]',	x))	
				df_all[cols	+	'_num_eng_low']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[a-z]',	x))	
				df_all[cols	+	'_num_dig']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[0-9]',	x))	
				df_all[cols	+	'_num_pun']	=	df_all[cols].apply(lambda	x:	sum(c	in	punct	for	c	in	x))	
				df_all[cols	+	'_num_space']	=	df_all[cols].apply(lambda	x:	sum(c.isspace()	for	c	in	x))	
				df_all[cols	+	'_num_emo']	=	df_all[cols].apply(lambda	x:	sum(c	in	emoji	for	c	in	x))	
				df_all[cols	+	'_num_row']	=	df_all[cols].apply(lambda	x:	x.count('/n'))	
				df_all[cols	+	'_num_chars']	=	df_all[cols].apply(len)	#	Count	number	of	Characters	
				df_all[cols	+	'_num_words']	=	df_all[cols].apply(lambda	comment:	len(comment.split()))		
				df_all[cols	+	'_num_unique_words']	=	df_all[cols].apply(lambda	comment:	len(set(w	for	w	in	
comment.split())))	
				df_all[cols	+	'_ratio_unique_words']	=	df_all[cols+'_num_unique_words']	/	(df_all[cols+'_num_words']+1)		
				df_all[cols	+'_num_stopwords']	=	df_all[cols].apply(lambda	x:	len([w	for	w	in	x.split()	if	w	in	
stopwords]))	
				df_all[cols	+'_num_words_upper']	=	df_all[cols].apply(lambda	x:	len([w	for	w	in	str(x).split()	if	
w.isupper()]))	
				df_all[cols	+'_num_words_lower']	=	df_all[cols].apply(lambda	x:	len([w	for	w	in	str(x).split()	if	
w.islower()]))	
				df_all[cols	+'_num_words_title']	=	df_all[cols].apply(lambda	x:	len([w	for	w	in	str(x).split()	if	
w.istitle()]))
Text Feature
¨  Ynktk’s WordEmbedding
u  Self-trained FastText
model	=	FastText(PathLineSentences(train+test+train_active+test_active),	
size=300,	window=5,	min_count=5,	word_ngrams=1,	seed=seed,	workers=32)	
u  Self-trained Word2Vec
model	=	Word2Vec(PathLineSentences(train+test+train_active+test_active),	
size=300,	window=5,	min_count=5,	seed=seed,	workers=32)	
※Wikiなどで学習したembeddingsよりも、与えられたテキストで学習したembeddingsの方が
有効.おそらく、商品名などの固有名詞が目的変数に効いていたため
Image Feature
¨  Meta Feature
u  Image_size	,Height,Width,Average_pixel_width,Average_blue,Average_red,Aver
age_green,Blurrness,Whiteness,Dullness	
u  Dullness	–	Whiteness	(Interaction	feature)	
¨  Pre-trained Prediction Feature
u  Vgg16	Prediction	Value	
u  Resnet50	Prediction	Value	
¨  Ynktk’s Feature
u  上位入賞者はImageをVGGなどで特徴抽出していたが、hand-craftな特徴も有効だった
u  NIMA [1]
u  Brightness, Saturation, Contrast, Colorfullness, Dullness, Bluriness, Interest Points, Saliency Map,
Human Facesなど[2]
[1] Talebi, H., & Milanfar, P. (2018). NIMA: Neural Image Assessment
[2] Cheng, H. et al. (2012). Multimedia Features for Click Prediction of New Ads in Display
Advertising
Parameter Tuning
q  Manually choosing using multi servers
params	=	{	
				'boosting_type':	'gbdt',	
				’objective’:	‘xentropy’,	#target	value	like	a	binary	classification	probability	value		
				'metric':	'rmse',	
				'learning_rate':	0.02,	
				'num_leaves':	600,			
				'max_depth':	-1,			
				'max_bin':	256,			
				’bagging_fraction’:	1,			
				’feature_fractio’:	0.1,		#sparse	text	vector	
				'verbose':	1	
				}
Submission Analysis
Single Lightgbm Sub File Stacking Sub File
1.  Bug Check
2.  Diverse Model Comparation
3.  Prediction Value Trend
Best Lightgbm Summary
¨  Table Feature Number
~250
¨  Text Feature Number
1,500,000+
¨  Image Feature Number
50+
¨  Total Feature Number
1,503,424
¨  Public LB
better than 0.2174
¨  Private LB
better than 0.2210
Diversity
Type Loss Data Set Feature Set Parameter NN Structure
Lightgbm xentropy
regression
huber
fair
auc
With/Without
Active data
Table
Table + Text
Table + Text + Image
Table + Text + Image +
Ridge_meta
Learning_rate
Num_leaves
Xgboost reg:linear
binary:logist
ic
With/Without
Active data
Table + Text
Table + Text + Image
Catboost binary_cross
entropy
With Active data Table + Image
Random
Forest
regression With Active data Table + Text + Image
Ridge
Regression
regression Without Active
data
Text
Table + Text + Image
Tfidf
max_features
Neural
network
regression
binary_cross
entropy
With Active data Table + Text + Image +
wordembedding
Layer size
Dropout
BatchNorm
Pooling
rnn-dnn
rnn-cnn-dnn
rnn-attention-dnn
Ynktk’s Best NN
Numerical	 Categorical	Image	 Text	
Embedding	
Dense	 SpatialDropout	
LSTM	
GRU	
Conv1D	
LeakyReLU	
GAP	 GMP	
Concat	
BatchNorm	
LeakyReLU	
Dropout	
Dense	
LeakyReLU	
BatchNorm	
Dense	
Concat	
Embedding	
*callbacks
•  EarlyStopping
•  ReduceLROnPlateau
*optimizer
•  Adam with clipvalue
0.5
•  Learning rate 2e-03
*loss
•  Binary cross entropy	
Priv LB: 0.2225
Pub LB: 0.2181
China Competitions & Platform
Kaggle China Comp
Platform Kaggle Tianchi,Tencent,Jdata,Kesci,DataCastle,
Biendata,DataFountain… …[1]
Round Round1:2~3 months
Public/Private LB
Round1:1.5months Public,3 days Private
Round2:2 weeks Public,3 days Private
Round3:Presentation
Sub/day 5 Public:3,Private:1
Prize Top 3 Top 5/10/50
[1] https://github.com/iphysresearch/DataSciComp
Knowledge Sharing
https://github.com/Smilexuhc/Data-
Competition-TopSolution/blob/master/
README.md
Learn From GrandMasters
1.  EDA by Excel
2.  Join every competitions
3.  Reuse pipeline & features
4.  Strictly time management
5.  Use differnet area’s knowledge
6.  Family Support
Thank You !
Q & A

More Related Content

What's hot

グラフニューラルネットワーク入門
グラフニューラルネットワーク入門グラフニューラルネットワーク入門
グラフニューラルネットワーク入門ryosuke-kojima
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII
 
指数時間アルゴリズム入門
指数時間アルゴリズム入門指数時間アルゴリズム入門
指数時間アルゴリズム入門Yoichi Iwata
 
レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成Nagi Teramo
 
確率的主成分分析
確率的主成分分析確率的主成分分析
確率的主成分分析Mika Yoshimura
 
Rolling Hashを殺す話
Rolling Hashを殺す話Rolling Hashを殺す話
Rolling Hashを殺す話Nagisa Eto
 
機械学習モデルの判断根拠の説明(Ver.2)
機械学習モデルの判断根拠の説明(Ver.2)機械学習モデルの判断根拠の説明(Ver.2)
機械学習モデルの判断根拠の説明(Ver.2)Satoshi Hara
 
遺伝的アルゴリズム (Genetic Algorithm)を始めよう!
遺伝的アルゴリズム(Genetic Algorithm)を始めよう!遺伝的アルゴリズム(Genetic Algorithm)を始めよう!
遺伝的アルゴリズム (Genetic Algorithm)を始めよう!Kazuhide Okamura
 
機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化gree_tech
 
pymcとpystanでベイズ推定してみた話
pymcとpystanでベイズ推定してみた話pymcとpystanでベイズ推定してみた話
pymcとpystanでベイズ推定してみた話Classi.corp
 
Sliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデルSliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデルohken
 
第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知Chika Inoshita
 
最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向ohken
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment AnythingDeep Learning JP
 
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイ[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイDeep Learning JP
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)Takao Yamanaka
 
5分でわかるかもしれないglmnet
5分でわかるかもしれないglmnet5分でわかるかもしれないglmnet
5分でわかるかもしれないglmnetNagi Teramo
 
最小カットを使って「燃やす埋める問題」を解く
最小カットを使って「燃やす埋める問題」を解く最小カットを使って「燃やす埋める問題」を解く
最小カットを使って「燃やす埋める問題」を解くshindannin
 

What's hot (20)

グラフニューラルネットワーク入門
グラフニューラルネットワーク入門グラフニューラルネットワーク入門
グラフニューラルネットワーク入門
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
 
指数時間アルゴリズム入門
指数時間アルゴリズム入門指数時間アルゴリズム入門
指数時間アルゴリズム入門
 
レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成レプリカ交換モンテカルロ法で乱数の生成
レプリカ交換モンテカルロ法で乱数の生成
 
確率的主成分分析
確率的主成分分析確率的主成分分析
確率的主成分分析
 
Rolling Hashを殺す話
Rolling Hashを殺す話Rolling Hashを殺す話
Rolling Hashを殺す話
 
機械学習モデルの判断根拠の説明(Ver.2)
機械学習モデルの判断根拠の説明(Ver.2)機械学習モデルの判断根拠の説明(Ver.2)
機械学習モデルの判断根拠の説明(Ver.2)
 
目指せグラフマスター
目指せグラフマスター目指せグラフマスター
目指せグラフマスター
 
遺伝的アルゴリズム (Genetic Algorithm)を始めよう!
遺伝的アルゴリズム(Genetic Algorithm)を始めよう!遺伝的アルゴリズム(Genetic Algorithm)を始めよう!
遺伝的アルゴリズム (Genetic Algorithm)を始めよう!
 
機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化
 
pymcとpystanでベイズ推定してみた話
pymcとpystanでベイズ推定してみた話pymcとpystanでベイズ推定してみた話
pymcとpystanでベイズ推定してみた話
 
Sliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデルSliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデル
 
第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知第8章 ガウス過程回帰による異常検知
第8章 ガウス過程回帰による異常検知
 
最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向最適輸送の計算アルゴリズムの研究動向
最適輸送の計算アルゴリズムの研究動向
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment Anything
 
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイ[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
 
変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)変分推論法(変分ベイズ法)(PRML第10章)
変分推論法(変分ベイズ法)(PRML第10章)
 
5分でわかるかもしれないglmnet
5分でわかるかもしれないglmnet5分でわかるかもしれないglmnet
5分でわかるかもしれないglmnet
 
Convex Hull Trick
Convex Hull TrickConvex Hull Trick
Convex Hull Trick
 
最小カットを使って「燃やす埋める問題」を解く
最小カットを使って「燃やす埋める問題」を解く最小カットを使って「燃やす埋める問題」を解く
最小カットを使って「燃やす埋める問題」を解く
 

Similar to Kaggle Avito Demand Prediction Challenge 9th Place Solution

Webinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineWebinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineMongoDB
 
Blunt Umbrellas Website Showcase
Blunt Umbrellas Website ShowcaseBlunt Umbrellas Website Showcase
Blunt Umbrellas Website ShowcaseGareth Hall
 
MongoDB Distilled
MongoDB DistilledMongoDB Distilled
MongoDB Distilledb0ris_1
 
How To Build a Multi-Field Search Page For Your XPages Application
How To Build a Multi-Field Search Page For Your XPages ApplicationHow To Build a Multi-Field Search Page For Your XPages Application
How To Build a Multi-Field Search Page For Your XPages ApplicationMichael McGarel
 
An Intro to Angular 2
An Intro to Angular 2An Intro to Angular 2
An Intro to Angular 2Ron Heft
 
Database Development Replication Security Maintenance Report
Database Development Replication Security Maintenance ReportDatabase Development Replication Security Maintenance Report
Database Development Replication Security Maintenance Reportnyin27
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkDatabricks
 
Angular data binding
Angular data binding Angular data binding
Angular data binding Sultan Ahmed
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...IT Event
 
Angular server side rendering - Strategies & Technics
Angular server side rendering - Strategies & Technics Angular server side rendering - Strategies & Technics
Angular server side rendering - Strategies & Technics Eliran Eliassy
 
There's more than web
There's more than webThere's more than web
There's more than webMatt Evans
 
How We Built the Private AppExchange App (Apex, Visualforce, RWD)
How We Built the Private AppExchange App (Apex, Visualforce, RWD)How We Built the Private AppExchange App (Apex, Visualforce, RWD)
How We Built the Private AppExchange App (Apex, Visualforce, RWD)Salesforce Developers
 
EnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer VisionEnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer Visiongiamuhammad
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Jonathan Felch
 
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and PracticeGDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and PracticeJamesAnderson599331
 
はじめてのAngular2
はじめてのAngular2はじめてのAngular2
はじめてのAngular2Kenichi Kanai
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdwKohei KaiGai
 

Similar to Kaggle Avito Demand Prediction Challenge 9th Place Solution (20)

Webinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineWebinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage Engine
 
Nicolas Embleton, Advanced Angular JS
Nicolas Embleton, Advanced Angular JSNicolas Embleton, Advanced Angular JS
Nicolas Embleton, Advanced Angular JS
 
Blunt Umbrellas Website Showcase
Blunt Umbrellas Website ShowcaseBlunt Umbrellas Website Showcase
Blunt Umbrellas Website Showcase
 
MongoDB Distilled
MongoDB DistilledMongoDB Distilled
MongoDB Distilled
 
How To Build a Multi-Field Search Page For Your XPages Application
How To Build a Multi-Field Search Page For Your XPages ApplicationHow To Build a Multi-Field Search Page For Your XPages Application
How To Build a Multi-Field Search Page For Your XPages Application
 
An Intro to Angular 2
An Intro to Angular 2An Intro to Angular 2
An Intro to Angular 2
 
Database Development Replication Security Maintenance Report
Database Development Replication Security Maintenance ReportDatabase Development Replication Security Maintenance Report
Database Development Replication Security Maintenance Report
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache Spark
 
Angular data binding
Angular data binding Angular data binding
Angular data binding
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
 
Kendoui
KendouiKendoui
Kendoui
 
Angular server side rendering - Strategies & Technics
Angular server side rendering - Strategies & Technics Angular server side rendering - Strategies & Technics
Angular server side rendering - Strategies & Technics
 
There's more than web
There's more than webThere's more than web
There's more than web
 
How We Built the Private AppExchange App (Apex, Visualforce, RWD)
How We Built the Private AppExchange App (Apex, Visualforce, RWD)How We Built the Private AppExchange App (Apex, Visualforce, RWD)
How We Built the Private AppExchange App (Apex, Visualforce, RWD)
 
EnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer VisionEnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer Vision
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and PracticeGDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
 
はじめてのAngular2
はじめてのAngular2はじめてのAngular2
はじめてのAngular2
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 

Recently uploaded

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Recently uploaded (20)

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

Kaggle Avito Demand Prediction Challenge 9th Place Solution

  • 1. KAGGLE AVITO DEMAND PREDICTION CHALLENGE 9TH SOLUTION Kaggle Meetup Tokyo 5th – 2018.12.01senkin13
  • 2. About Me ¨  詹金 (せんきん) ¨  Kaggle ID: senkin13 ¨  Infrastructure&DB Engineer [Prefect World] [Square Enix] ¨  Bigdata Engineer [Square Enix] [OPT] [Line] [FastRetailing] ¨  Machine learning Engineer [FastRetailing] Background KaggleName
  • 3. Agenda ¨  Avito-demand-prediction Overview ¨  Competition Pipeline ¨  Best Single Model (Lightgbm) ¨  Diverse Models ¨  Ynktk’s Best NN ¨  Kohei’s Ensemble ¨  China Competitions/Kagglers ¨  Q & A
  • 6. Description 「Prediction」 predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts Russia’s largest classified advertisements website
  • 7. Evaluation Item_id Deal_probability b912c3c6a6ad 0.12789 2dac0150717d 0.00000 ba83aefab5dc 0.43177 02996f1dd2eas 0.80323 [Target] This is the likelihood that an ad actually sold something. Scope: 0 ~ 1
  • 8. Data Description ¨  ID item_id,user_id ¨  Numeric price ¨  Category region,city,parent_category_name,user_type category_name,param_1, param_2, param_3, image_top_1 ¨  Text title,description ¨  Image image ¨  Sequence item_seq_number ¨  Date activation_date,date_from,date_to Train/Test User_id …… Item_id Target Active User_id …… Item_id Periods Item_id Date_from Date_to Supplemental data of train minus deal_probability, image, and image_top_1
  • 9. Train Test Period Train: 2017-03-15 ~ 2017-04-05 Test: 2017-04-12 ~ 2017-04-20
  • 10. Pipeline [Baseline] 1.Table Data Model 2.Text Data Model (reduce wait time) [Validation] Kfold: 5 Feature Validation: once by one Validate Score: 5fold One Week One WeekOne Month Description Kernel Discussion Beseline Design [Feature Engineering] LightGBM(Table + Text + Image) Feature Save: 1 feature 1 pickle file [Validation] Kfold: 5 Feature Validation: once by one or by group Validate Score: 1fold [Parameter Tuning] Manually Teammates’ feature reuse Diverse Model’s oof
  • 11. Preprocossing ¨  Tabular data df_all['price'] = np.log1p(df_all['price']) df_all['city'] = df_all['city'] + ‘_’ + df_all['region’] ¨  Text data def clean_text(s): s = re.sub('м²|d+/d|d+-к|d+к', ‘ ‘, s.lower()) s = re.sub('s+', ‘ ‘, s) s = s.strip() return s ¨  Image data Delete 4 empty images
  • 12. Feature Engineering ¨  Date Feature df_all['wday'] = df_all['activation_date'].dt.weekday ※TrainとTest両方があるdate型を利用する ¨  Extended Text Feature df_all['param_123'] = (df_all['param_1'].fillna('') + ' ' + df_all['param_2'].fillna('') + ' ' + df_all['param_3'].fillna('')).astype(str) df_all['text'] = df_all['description'].fillna('').astype(str) + ' ' + df_all['title'].fillna('').astype(str) + ' ' + df_all['param_123'].fillna('').astype(str) ※Traing単語が増える
  • 13. Aggrearation Feature ¨  Unique {'groupby': ['category_name'], 'target':’image_top_1', 'agg':'nunique'}, ¨  Count {'groupby': ['user_id'], 'target':'item_id', 'agg':'count'}, ¨  Sum {'groupby': ['parent_category_name'], 'target':'price', 'agg':'sum'}, ¨  Mean {'groupby': ['user_id'], 'target':'price', 'agg':'mean'}, ¨  Median {'groupby': ['image_top_1'], 'target':'price', 'agg':'median'}, ¨  Max {'groupby': ['image_top_1','user_id'], 'target':'price', 'agg':'max'}, ¨  Min {'groupby': ['user_id'], 'target':'price', 'agg':'min'}, ※業務視点から作るのが効率が良い
  • 14. Interaction Feature ¨  Difference between two features df_all['image_top_1_diff_price'] = df_all['price'] - df_all['image_top_1_mean_price'] df_all['category_name_diff_price'] = df_all['price'] - df_all['category_name_mean_price'] df_all['param_1_diff_price'] = df_all['price'] - df_all['param_1_mean_price'] df_all['param_2_diff_price'] = df_all['price'] - df_all['param_2_mean_price'] df_all['user_id_diff_price'] = df_all['price'] - df_all['user_id_mean_price'] df_all['region_diff_price'] = df_all['price'] - df_all['region_mean_price'] df_all['city_diff_price'] = df_all['price'] - df_all['city_mean_price'] ※Business senseがある加減乗除特徴量が強い
  • 15. Supplemental Data Feature ¨  Caculate each item’s up days all_periods['days_up'] = all_periods['date_to'].dt.dayofyear - all_periods['date_from'].dt.dayofyear ¨  Count and Sum of item’s up days {'groupby': ['item_id'], 'target':'days_up', 'agg':'count'}, {'groupby': ['item_id'], 'target':'days_up', 'agg':'sum'}, ¨  Merge to main table df_all = df_all.merge(all_periods, on='item_id', how='left') ※補足データの業務に関わる部分深掘りが大事
  • 16. Impute Null Values ¨  Fillna with 0 df_all[‘price’].fillna(0) ¨  Fillna with median enc = df_all.groupby('category_name') ['item_id_count_days_up'].agg('median’).reset_index() enc.columns = ['category_name' ,'count_days_up_impute'] df_all = pd.merge(df_all, enc, how='left', on='category_name') df_all['item_id_count_days_up_impute'].fillna(df_all['count_days_up_impute' ], inplace=True) ¨  Fillna with model prediction value Rnn(text) -> image_top_1(rename:image_top_2) ※見つからなかったMagic feature: df[‘price’] – df[Rnn(text) -> price]
  • 17. Text Feature ¨  TF-IDF for text ,title,param_123 vectorizer = FeatureUnion([ ('text',TfidfVectorizer( ngram_range=(1, 2), max_features=200000, **tfidf_para), ('title',TfidfVectorizer( ngram_range=(1, 2), stop_words = russian_stop), ('param_123',TfidfVectorizer( ngram_range=(1, 2), stop_words = russian_stop)) ]) tfidf_para = { "stop_words": russian_stop, "analyzer": 'word', "token_pattern": r'w{1,}', "lowercase": True, "sublinear_tf": True, "dtype": np.float32, "norm": 'l2', "smooth_idf":False }
  • 18. Text Feature ¨  SVD for Title tfidf_vec = TfidfVectorizer(ngram_range=(1,1)) svd_title_obj = TruncatedSVD(n_components=40, algorithm='arpack') svd_title_obj.fit(full_title_tfidf) train_title_svd = pd.DataFrame(svd_title_obj.transform(train_title_tfidf)) test_title_svd = pd.DataFrame(svd_title_obj.transform(test_title_tfidf))
  • 19. Text Feature ¨  Count Unique Feature for cols in ['text','title','param_123']: df_all[cols + '_num_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-ЯA-Z]', x)) df_all[cols + '_num_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-яa-z]', x)) df_all[cols + '_num_rus_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-Я]', x)) df_all[cols + '_num_eng_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[A-Z]', x)) df_all[cols + '_num_rus_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-я]', x)) df_all[cols + '_num_eng_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[a-z]', x)) df_all[cols + '_num_dig'] = df_all[cols].apply(lambda x: count_regexp_occ('[0-9]', x)) df_all[cols + '_num_pun'] = df_all[cols].apply(lambda x: sum(c in punct for c in x)) df_all[cols + '_num_space'] = df_all[cols].apply(lambda x: sum(c.isspace() for c in x)) df_all[cols + '_num_emo'] = df_all[cols].apply(lambda x: sum(c in emoji for c in x)) df_all[cols + '_num_row'] = df_all[cols].apply(lambda x: x.count('/n')) df_all[cols + '_num_chars'] = df_all[cols].apply(len) # Count number of Characters df_all[cols + '_num_words'] = df_all[cols].apply(lambda comment: len(comment.split())) df_all[cols + '_num_unique_words'] = df_all[cols].apply(lambda comment: len(set(w for w in comment.split()))) df_all[cols + '_ratio_unique_words'] = df_all[cols+'_num_unique_words'] / (df_all[cols+'_num_words']+1) df_all[cols +'_num_stopwords'] = df_all[cols].apply(lambda x: len([w for w in x.split() if w in stopwords])) df_all[cols +'_num_words_upper'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if w.isupper()])) df_all[cols +'_num_words_lower'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if w.islower()])) df_all[cols +'_num_words_title'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
  • 20. Text Feature ¨  Ynktk’s WordEmbedding u  Self-trained FastText model = FastText(PathLineSentences(train+test+train_active+test_active), size=300, window=5, min_count=5, word_ngrams=1, seed=seed, workers=32) u  Self-trained Word2Vec model = Word2Vec(PathLineSentences(train+test+train_active+test_active), size=300, window=5, min_count=5, seed=seed, workers=32) ※Wikiなどで学習したembeddingsよりも、与えられたテキストで学習したembeddingsの方が 有効.おそらく、商品名などの固有名詞が目的変数に効いていたため
  • 21. Image Feature ¨  Meta Feature u  Image_size ,Height,Width,Average_pixel_width,Average_blue,Average_red,Aver age_green,Blurrness,Whiteness,Dullness u  Dullness – Whiteness (Interaction feature) ¨  Pre-trained Prediction Feature u  Vgg16 Prediction Value u  Resnet50 Prediction Value ¨  Ynktk’s Feature u  上位入賞者はImageをVGGなどで特徴抽出していたが、hand-craftな特徴も有効だった u  NIMA [1] u  Brightness, Saturation, Contrast, Colorfullness, Dullness, Bluriness, Interest Points, Saliency Map, Human Facesなど[2] [1] Talebi, H., & Milanfar, P. (2018). NIMA: Neural Image Assessment [2] Cheng, H. et al. (2012). Multimedia Features for Click Prediction of New Ads in Display Advertising
  • 22. Parameter Tuning q  Manually choosing using multi servers params = { 'boosting_type': 'gbdt', ’objective’: ‘xentropy’, #target value like a binary classification probability value 'metric': 'rmse', 'learning_rate': 0.02, 'num_leaves': 600, 'max_depth': -1, 'max_bin': 256, ’bagging_fraction’: 1, ’feature_fractio’: 0.1, #sparse text vector 'verbose': 1 }
  • 23. Submission Analysis Single Lightgbm Sub File Stacking Sub File 1.  Bug Check 2.  Diverse Model Comparation 3.  Prediction Value Trend
  • 24. Best Lightgbm Summary ¨  Table Feature Number ~250 ¨  Text Feature Number 1,500,000+ ¨  Image Feature Number 50+ ¨  Total Feature Number 1,503,424 ¨  Public LB better than 0.2174 ¨  Private LB better than 0.2210
  • 25. Diversity Type Loss Data Set Feature Set Parameter NN Structure Lightgbm xentropy regression huber fair auc With/Without Active data Table Table + Text Table + Text + Image Table + Text + Image + Ridge_meta Learning_rate Num_leaves Xgboost reg:linear binary:logist ic With/Without Active data Table + Text Table + Text + Image Catboost binary_cross entropy With Active data Table + Image Random Forest regression With Active data Table + Text + Image Ridge Regression regression Without Active data Text Table + Text + Image Tfidf max_features Neural network regression binary_cross entropy With Active data Table + Text + Image + wordembedding Layer size Dropout BatchNorm Pooling rnn-dnn rnn-cnn-dnn rnn-attention-dnn
  • 26. Ynktk’s Best NN Numerical Categorical Image Text Embedding Dense SpatialDropout LSTM GRU Conv1D LeakyReLU GAP GMP Concat BatchNorm LeakyReLU Dropout Dense LeakyReLU BatchNorm Dense Concat Embedding *callbacks •  EarlyStopping •  ReduceLROnPlateau *optimizer •  Adam with clipvalue 0.5 •  Learning rate 2e-03 *loss •  Binary cross entropy Priv LB: 0.2225 Pub LB: 0.2181
  • 27.
  • 28.
  • 29. China Competitions & Platform Kaggle China Comp Platform Kaggle Tianchi,Tencent,Jdata,Kesci,DataCastle, Biendata,DataFountain… …[1] Round Round1:2~3 months Public/Private LB Round1:1.5months Public,3 days Private Round2:2 weeks Public,3 days Private Round3:Presentation Sub/day 5 Public:3,Private:1 Prize Top 3 Top 5/10/50 [1] https://github.com/iphysresearch/DataSciComp
  • 30. Knowledge Sharing https://github.com/Smilexuhc/Data- Competition-TopSolution/blob/master/ README.md Learn From GrandMasters 1.  EDA by Excel 2.  Join every competitions 3.  Reuse pipeline & features 4.  Strictly time management 5.  Use differnet area’s knowledge 6.  Family Support