Machine Learning Models for Question Answering Dataset

•

2 likes•2,051 views

Ken'ichi Matsui

2020/2/28に行われたKaggle Google Quest Q&A Labeling 反省会にてLTで自分の47th place solutionを紹介した資料です。

Data & Analytics

//2 44/2 7 & 2 /
0 427 4
8 8 & Q
& A IQ LA
K G

)(
3
C Te TC a
C RTs Ci Ci C C
ü t t p s a s g C
• (/ 2) / H N Cs L
• s C C N
• Nv
• ( - N
• . N
• coRh C L (/ 2) /
• D V s m LirgS nd I GN C
dpa V E
A :

4
Kaggle 6 3 3 Master !
SIGNATE
ü 2
https://www.slideshare.net/matsukenbook/signate-108228406

8
qa_id
question_title
ü tq R a
ü th R U
question_body ü th
question_user_name ü th m
question_user_page ü th o
answer ü th a ex
answer_user_name ü ex m
answer_user_page ü ex o
url ü tq
category ü tq
host ü tq i Um
LI cp
1, 2, 3 D
What am I losing when using extension …
After playing around with macro …
ysap
https://photo.stackexchange.com/users/1024
I just got extension tubes, so here's the skinny. …
rfusca
https://photo.stackexchange.com/users/1917
1 1 . 1/ 1 :
LIFE_ARTS
photo.stackexchange.com
train data 6079n public data g476n (13%)
private data g3186n (87%)
ku L s

12
• A : :BD GC FK
• K B K :F D:K : G P B D F@ A P E:P K D F@ A
• https://arxiv.org/abs/1905.05583
• + :@BF@ ( :K MF :K EG D
• 4GK G KKBF@) 0B : @ BK B M BGF B A :BF : : !: : G : @ GDMEFK
• . :BD BK BF A GDDG BF@ :@
• -GF : F: GGD GM M K M F GM M GE D: G
1DG :D+ :@ 4GGDBF@ .
• & GD B A MD BD: D : B B 0GD ! A:FCK : A: A:
• GF@ A : G KGDG : B B : BGF
https://www.kaggle.com/c/google-quest-challenge/discussion/129885

+ +
13
L J E7 C
)
5403 L J E79
()-
5/24 5403 EJ
()-
5403
C (
L J E79
LCC
EJ
LCC
L J E7 C
LCC
fq]pra . _lk s b J LE_1 J J CL E ajh oeS
iJ : E_c Pgm E C]d C . ) L J E. ()- EJ . ()- n [
. J. C : : J LE E E 9 J E : :C JJ : E J:LJJ E -,*,

14
-. N T X aN -.N1 :26:4 E - 1 0 / Rb
NN 6 0 L
BBB 62 0 : 90 7 :1 7 1

15
0 R T G P 3 6
0
6
B B
A6 6 46
B B
0
6
B B
A6 6 46
B B
0
6
B B
A6 6 46
B B
6 1 B B
A6 6 461 B B AB &
E A
6 1 B B
A6 6 461 B B AB (
E A
2 A6 6 461 B B
32 )D6 2 6.
B
6 A6
32 )D6 2 6.
B
6 A6
32 )D6 2 6.
B
6 A6
0
6
B B
A6 6 46
B B
32 )D6 2 6.
B
6 A6
6 1 B B
A6 6 461 B B AB (
E A E B B B

$16 def rank_average(preds): ranked_pred = rankdata(preds) return (ranked_pred - np.min(ranked_pred)) / (np.max(ranked_pred) - np.min(ranked_pred)) class OptimPreds(object): def __init__(self, df_train): self.score_range_dict = {} for i, c in enumerate(df_train.columns[11:]): cnt = df_train[c].value_counts(normalize=True).sort_index() self.score_range_dict[i] = [cnt.index.values.tolist(), cnt.values.tolist()] def predict(self, preds, i): return pd.cut(rank_average(preds), [-np.inf] + np.cumsum(self.score_range_dict[i][1])[:-1].tolist() + [np.inf], labels = self.score_range_dict[i][0]) def optim_predict(pred): for i in range(pred.shape[1]): if i in [2,5,12,13,14,15,19]: pred[:,i] = optim.predict(pred[:,i], i) return pred optim = OptimPreds(df_train) valid_pred = optim_predict(valid_pred_org.copy()) V train targetV C> V 01. - + ( ( )+ 896 2:5 - - ) 8 3764 -( () ) ) ($

17
https://www.kaggle.com/c/google-quest-challenge/discussion/120368
- !

Didn’t work for me
19
ü Pre-training with stackoverflow data (150,000 sentences)
ü Multi sample dropout
ü The other models
ü Roberta
ü Albert
ü XLNet
ü Concatenate question only output & answer only model
ü Concatenate category MLP with BERT model
ü LSTM head instead of Dense with BERT model
ü Freeze half of BertLayer for reducing model complexity
ü Skip half of BertLayer for reducing model complexity
ü USE(Universal Sequence Encorder) + MLP
ü LSTM model with gensim embedding
ü custom loss
ü BCE & MSE
ü focal loss
ü Word count feature
ü Concat title and question_body as a one block (removing ["SEP"] between them)
ü Up-sampling for imbalance target column
https://www.kaggle.com/c/google-quest-challenge/discussion/129885
B
B
L B 1

What's hot

AtCoder Regular Contest 038 解説AtCoder Inc.

Physique révisionbadro96

communication-systems-4th-edition-2002-carlson-solution-manualamirhosseinozgoli

Communication systems solution manual 5th editionTayeen Ahmed

Predicting the Wind: Data Science in Wind Resource AssessmentFlorian Roscheck

CODE FESTIVAL 2015 予選A 解説AtCoder Inc.

Program Language - Fall 2013 Yun-Yan Chi

imager package in R and examples..Dr. Volkan OBAN

CSS Grid Layout is Just Around the Corner (CSSConf US 2015)Igalia

Frontiers of data-driven property prediction: molecular machine learningIchigaku Takigawa

合同数問題と保型形式Junpei Tsuji

End sem solutionGopi Saiteja

RではじめるTwitter解析Takeshi Arabiki

DevOps導入支援サービスArata Fujimura

ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-ssusere0a682

twitteRで快適Rライフ！Takeshi Arabiki

Key pat1 3-52 mathArisara Fungthanakul

DevOps導入支援サービス(Ver.2)Arata Fujimura

CODE FESTIVAL 2015 解説AtCoder Inc.

What's hot (19)

AtCoder Regular Contest 038 解説

Physique révision

communication-systems-4th-edition-2002-carlson-solution-manual

Communication systems solution manual 5th edition

Predicting the Wind: Data Science in Wind Resource Assessment

CODE FESTIVAL 2015 予選A 解説

Program Language - Fall 2013

imager package in R and examples..

CSS Grid Layout is Just Around the Corner (CSSConf US 2015)

Frontiers of data-driven property prediction: molecular machine learning

合同数問題と保型形式

End sem solution

RではじめるTwitter解析

DevOps導入支援サービス

ゲーム理論BASIC 演習37 -3人ゲームの混合戦略ナッシュ均衡を求める-

twitteRで快適Rライフ！

Key pat1 3-52 math

DevOps導入支援サービス(Ver.2)

CODE FESTIVAL 2015 解説

Recently uploaded

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

RadioAdProWritingCinderellabyButleri.pdfgstagge

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

ASML's Taxonomy Adventure by Daniel Cantervoginip

Multiple time frame trading analysis -brianshannon.pdfchwongval

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

How we prevented account sharing with MFAAndrei Kaleshka

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Recently uploaded (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

RadioAdProWritingCinderellabyButleri.pdf

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

ASML's Taxonomy Adventure by Daniel Canter

Multiple time frame trading analysis -brianshannon.pdf

RABBIT: A CLI tool for identifying bots based on their GitHub events.

How we prevented account sharing with MFA

Heart Disease Classification Report: A Data Analysis Project

Customer Service Analytics - Make Sense of All Your Data.pptx

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Call Girls In Dwarka 9654467111 Escorts Service

Machine Learning Models for Question Answering Dataset

1. //2 44/2 7 & 2 / 0 427 4 8 8 & Q & A IQ LA K G

2. 2 2 4 3 1

3. )( 3 C Te TC a C RTs Ci Ci C C ü t t p s a s g C • (/ 2) / H N Cs L • s C C N • Nv • ( - N • . N • coRh C L (/ 2) / • D V s m LirgS nd I GN C dpa V E A :

4. 4 Kaggle 6 3 3 Master ! SIGNATE ü 2 https://www.slideshare.net/matsukenbook/signate-108228406

5. 2015 02016 9 ( ) b 1 9 9 c 2 P 9

6. 6 2 4 3 1

7. 7 1 4 2 4 78!

8. 8 qa_id question_title ü tq R a ü th R U question_body ü th question_user_name ü th m question_user_page ü th o answer ü th a ex answer_user_name ü ex m answer_user_page ü ex o url ü tq category ü tq host ü tq i Um LI cp 1, 2, 3 D What am I losing when using extension … After playing around with macro … ysap https://photo.stackexchange.com/users/1024 I just got extension tubes, so here's the skinny. … rfusca https://photo.stackexchange.com/users/1917 1 1 . 1/ 1 : LIFE_ARTS photo.stackexchange.com train data 6079n public data g476n (13%) private data g3186n (87%) ku L s

9. 9

10. 10 ü ü

11. 11 2 4 3 1

12. 12 • A : :BD GC FK • K B K :F D:K : G P B D F@ A P E:P K D F@ A • https://arxiv.org/abs/1905.05583 • + :@BF@ ( :K MF :K EG D • 4GK G KKBF@) 0B : @ BK B M BGF B A :BF : : !: : G : @ GDMEFK • . :BD BK BF A GDDG BF@ :@ • -GF : F: GGD GM M K M F GM M GE D: G 1DG :D+ :@ 4GGDBF@ . • & GD B A MD BD: D : B B 0GD ! A:FCK : A: A: • GF@ A : G KGDG : B B : BGF https://www.kaggle.com/c/google-quest-challenge/discussion/129885

13. + + 13 L J E7 C ) 5403 L J E79 ()- 5/24 5403 EJ ()- 5403 C ( L J E79 LCC EJ LCC L J E7 C LCC fq]pra . _lk s b J LE_1 J J CL E ajh oeS iJ : E_c Pgm E C]d C . ) L J E. ()- EJ . ()- n [ . J. C : : J LE E E 9 J E : :C JJ : E J:LJJ E -,*,

14. 14 -. N T X aN -.N1 :26:4 E - 1 0 / Rb NN 6 0 L BBB 62 0 : 90 7 :1 7 1

15. 15 0 R T G P 3 6 0 6 B B A6 6 46 B B 0 6 B B A6 6 46 B B 0 6 B B A6 6 46 B B 6 1 B B A6 6 461 B B AB & E A 6 1 B B A6 6 461 B B AB ( E A 2 A6 6 461 B B 32 )D6 2 6. B 6 A6 32 )D6 2 6. B 6 A6 32 )D6 2 6. B 6 A6 0 6 B B A6 6 46 B B 32 )D6 2 6. B 6 A6 6 1 B B A6 6 461 B B AB ( E A E B B B

16. 16 def rank_average(preds): ranked_pred = rankdata(preds) return (ranked_pred - np.min(ranked_pred)) / (np.max(ranked_pred) - np.min(ranked_pred)) class OptimPreds(object): def __init__(self, df_train): self.score_range_dict = {} for i, c in enumerate(df_train.columns[11:]): cnt = df_train[c].value_counts(normalize=True).sort_index() self.score_range_dict[i] = [cnt.index.values.tolist(), cnt.values.tolist()] def predict(self, preds, i): return pd.cut(rank_average(preds), [-np.inf] + np.cumsum(self.score_range_dict[i][1])[:-1].tolist() + [np.inf], labels = self.score_range_dict[i][0]) def optim_predict(pred): for i in range(pred.shape[1]): if i in [2,5,12,13,14,15,19]: pred[:,i] = optim.predict(pred[:,i], i) return pred optim = OptimPreds(df_train) valid_pred = optim_predict(valid_pred_org.copy()) V train targetV C> V 01. - + ( ( )+ 896 2:5 - - ) 8 3764 -( () ) ) (

17. 17 https://www.kaggle.com/c/google-quest-challenge/discussion/120368 - !

18. 18 2 4 3 1

19. Didn’t work for me 19 ü Pre-training with stackoverflow data (150,000 sentences) ü Multi sample dropout ü The other models ü Roberta ü Albert ü XLNet ü Concatenate question only output & answer only model ü Concatenate category MLP with BERT model ü LSTM head instead of Dense with BERT model ü Freeze half of BertLayer for reducing model complexity ü Skip half of BertLayer for reducing model complexity ü USE(Universal Sequence Encorder) + MLP ü LSTM model with gensim embedding ü custom loss ü BCE & MSE ü focal loss ü Word count feature ü Concat title and question_body as a one block (removing ["SEP"] between them) ü Up-sampling for imbalance target column https://www.kaggle.com/c/google-quest-challenge/discussion/129885 B B L B 1

20. 20

Machine Learning Models for Question Answering Dataset

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Machine Learning Models for Question Answering Dataset

Similar to Machine Learning Models for Question Answering Dataset (20)

More from Ken'ichi Matsui

More from Ken'ichi Matsui (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Models for Question Answering Dataset