Improving Distributional Similarity with Lessons Learned from Word Embeddings

Improving Distributional Similarity
with Lessons Learned from Word
Embeddings
Omer Levy, Yoav Goldberg, Ido Dagan
arXivTimes, 2017/1/24
Presenter: Hironsan
Twitter : Hironsan13
GitHub : Hironsan
Transactions of the Association for Computational Linguistics, 2015

Abstract
• word embeddingの性能向上はHyperparametersのチューニングによ
るところが大きいのではないかということで検証した論文
• 4つの手法について様々な Hyperparameters の組み合わせで検証
• word similarityタスクにおいて、count-baseの手法でも
prediction-baseの手法と同等の性能を示した
• analogyタスクにおいてはprediction-base手法の方が優っていた

Word Similarity & Relatedness
• How similar is pizza to ?
• How related is pizza to Italy?
• 単語をベクトルとして表現することで類似度を
計算できるようになる

Approaches for Representing Words
Word Embeddings (Predict)
• Inspired by deep learning
• word2vec (Mikolov et al., 2013)
• GloVe (Pennington et al., 2014)
Underlying Theory: The Distributional Hypothesis
(Harris, ’54)
“似た単語は似た文脈で現れる”
Distributional Semantics (Count)
• Used since the 90’s
• Sparse word-context PMI/PPMI matrix
• Decomposed with SVD

Approaches for Representing Words
Both approaches:
• distributional hypothesisに頼っている
• 使用するデータも同じ
• 数学的にも関連している
• “Neural Word Embedding as Implicit MatrixFactorization”
(NIPS 2014)
• どっちの手法の方が良いのだろうか？
• “Don’t Count, Predict!” (Baroni et al., ACL 2014)
• タイトルの通りPredictベースを使うべきと主張。本当に？

The Contributions of Word Embeddings
Novel Algorithms
(objective + training method)
• Skip Grams + Negative Sampling
• CBOW + Hierarchical Softmax
• Noise Contrastive Estimation
• GloVe
• …
どちらが性能を改善するのに重要なのか？
New Hyperparameters
(preprocessing, smoothing, etc.)
• Subsampling
• Dynamic Context Windows
• Context Distribution Smoothing
• Adding Context Vectors
• …

Contributions
1) Identifying the existence of new hyperparameters
• Not always mentioned in papers
2) Adapting the hyperparameters across algorithms
• Must understand the mathematical relation between
algorithms
3) Comparing algorithms across all hyperparameter
settings
• Over 5,000 experiments

Comparing Methods
以下の4つの手法についてハイパーパラメータを変えて比較
• PPMI(Positive Pointwise Mutual Information)
• SVD(Singular Value Decomposition)
• SGNS(Skip-Gram with Negative Sampling)
• GloVe(Global Vectors)
Count-base
Prediction-base

Hyperparameters
Preprocessing Hyperparameters
• Window Size(win)
• Dynamic Context Window (dyn)
• Subsampling (sub)
• Deleting Rare Words (del)
Association Metric Hyperparameters
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
Post-processing Hyperparameters
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)

Subsampling (sub)
• Subsamplingは頻出語を除去するために使う
• 閾値 t 以上に頻出する単語を確率 p で除去する
• f は単語の頻度
• t として設定する値は10-3〜10-5

Deleting Rare Words (del)
• 低頻度語の除去
• context windowsを作る前に除去する

Transferable Hyperparameters
Preprocessing Hyperparameters
• Window Size(win)
• Dynamic Context Window (dyn)
• Subsampling (sub)
• Deleting Rare Words (del)
Association Metric Hyperparameters
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
Post-processing Hyperparameters
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)

Shifted PMI (neg)
• PMIからSGNSのNegative Samplingのパラメータ k をshiftする
(Levy and Goldberg 2014)
SPPMI(w, c) = max(PMI(w, c) − log k, 0)
• Negative Samplingのパラメータ k をPMIに適用している

Context Distribution Smoothing (cds)
• negative samplingは単語のユニグラム分布 P をスムージングした
分布を用いて行う
• PMIでも同様にスムージング
• こうすることで低頻度語の影響を緩和でき、かなり効果的

Adding Context Vectors (w+c)
• SGNSは単語ベクトル w を生成する
• SGNSは文脈ベクトル c も生成する
• GloVeとSVDも
• 単語を w で表さず、w+c で表す by Pennington et al. (2014)
• 今まではGloVeだけに適応されてた

Eigenvalue Weighting (eig)
• SVDで分解した結果得られた固有値の重み付け方法を変更する
• SVDしてWとCを作るが類似度タスクに最適な作り方とは限らない
• symmetricな方法がsemanticタスクには適している
Symmetric

Vector Normalization (nrm)
• 得られた単語ベクトルは正規化
• 単位ベクトルにする

Transferable Hyperparameters
• 各アルゴリズムにおけるHyperparametersの対応をとることで、
なるべく条件を揃えて実験する

Systematic Experiments
• 9 Hyperparameters
• 4 Word Representation Algorithms
• PPMI (Sparse & Explicit)
• SVD(PPMI)
• SGNS
• GloVe
• 8 Benchmarks
• 6 Word Similarity Tasks
• 2 Analogy Tasks

Hyperparameter Settings
Classic Vanilla Setting
(commonly used for baselines)
• Preprocessing
• <None>
•
• Postprocessing
• <None>
• Association Metric
• Vanilla PMI/PPMI
Recommended word2vec Setting
(tuned for SGNS)
• Preprocessing
• Dynamic Context Window
• Subsampling
• Postprocessing
• <None>
• Association Metric
• Shifted PMI/PPMI
• Context Distribution Smoothing

Experiments: Hyperparameter Tuning
0.54
0.587
0.688
0.623
0.697 0.681
0.3
0.4
0.5
0.6
0.7
PPMI (Sparse Vectors) SGNS (Embeddings)
Spearman’sCorrelation
WordSim-353 Relatedness
[optimal]

Overall Results
• Hyperparameters のチューニングは時に algorithms より重要
• word similarityタスクにおいて、count-baseの手法でも
prediction-baseの手法と同等の性能を示した
• analogyタスクにおいてはprediction-base手法の方が優っていた

Conclusions: Distributional Similarity
Word Embeddingsの性能には以下の2つが影響する:
• Novel Algorithms
• New Hyperparameters
何が性能にとって本当に重要なのか？
• Hyperparameters (mostly)
• Algorithms

References
• Omer Levy and Yoav Goldberg, 2014, Neural word embeddings as
implicit matrix factorization
• Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013,
Efficient Estimation of Word Representations in Vector Space
• Jeffrey Pennington, Richard Socher, Christopher D. Manning, 2014,
GloVe: Global Vectors for Word Representation

If you are interested in
Machine Learning, Natural Language Processing, Computer Vision,
Follow arXivTimes @ https://twitter.com/arxivtimes

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Recommended

Recommended

More Related Content

Similar to Improving Distributional Similarity with Lessons Learned from Word Embeddings

Similar to Improving Distributional Similarity with Lessons Learned from Word Embeddings (6)

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Editor's Notes