[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowledge into Topic Models
1. Efficient Methods for Incorporating
Knowledge into Topic Models
[Yang, Downey and Boyd-Graber 2015]
2015/10/24
EMNLP 2015 Reading
@shuyo
2. Large-scale Topic Model
• In academic papers
– Up to 10^3 topics
• Industrial applications
– 10^5~10^6 topics!
– Search engines, online ads. and so on
– To capture infrequent topics
• This paper handles up to 500 topics...
really?
3. (Standard) LDA
[Blei+ 2003, Griffiths+ 2004]
• "Conventional" Gibbs sampling
𝑃 𝑧 = 𝑡 𝒛−, 𝑤 ∝ 𝑞𝑡 ≔ 𝑛 𝑑,𝑡 + 𝛼
𝑛 𝑤,𝑡 + 𝛽
𝑛 𝑡 + 𝑉𝛽
– 𝑇 : Topic size
– For 𝑈~𝒰 0, 𝑧
𝑇 𝑞 𝑧 , find 𝑡 s.t. 𝑧
𝑡−1 𝑞 𝑧 < 𝑈 < 𝑧
𝑡 𝑞 𝑧
• For large T, it is computationally intensive
– 𝑛 𝑤,𝑡 is sparse
– When T is very large, 𝑛 𝑑,𝑡 is too e.g. 𝑇 = 106
> 𝑛 𝑑
6. Word correlation prior
knowledge
• Must-link
– “quarterback” and “fumble” are both
related to American football
• Cannot-link
– “fumble” and “bank” imply two different
topics
7. SC-LDA [Yang+ 2015]
• 𝑚 ∈ 𝑀 : Prior knowledge
• 𝑓𝑚(𝑧, 𝑤, 𝑑) : Potential function of prior
knowledge 𝑚 about word 𝑤 with topic
𝑧 in document 𝑑
• 𝜓 𝒛, 𝑀 = 𝑧∈𝒛 exp 𝑓𝑚 𝑧, 𝑤, 𝑑
• 𝑃 𝒘, 𝒛 𝛼, 𝛽, 𝑀 = 𝑃 𝒘 𝒛, 𝛽 𝑃 𝒛 𝛼 𝜓(𝒛, 𝑀)
maybe ∝
maybe 𝑚 ∈ 𝑀, all 𝑤 with 𝑧 in all 𝑑
Sparse Constrained
9. Word correlation prior
knowledge for SC-LDA
• 𝑓𝑚 𝑧, 𝑤, 𝑑 =
𝑢∈𝑀 𝑤
𝑚
log max 𝜆, 𝑛 𝑢,𝑧 +
𝑣∈𝑀 𝑤
𝑐
log
1
max 𝜆, 𝑛 𝑣,𝑧
– where 𝑀 𝑤
𝑚 : Must-link of 𝑤, 𝑀 𝑤
𝑐 : Cannot-link of 𝑤
• 𝑃 𝑧 = 𝑡 𝒛−, 𝑤, 𝑀 ∝
𝛼𝛽
𝑛 𝑡+𝑉𝛽
+
𝑛 𝑑,𝑡 𝛽
𝑛 𝑡+𝑉𝛽
+
𝑛 𝑑,𝑡+𝛼 𝑛 𝑤,𝑡
𝑛 𝑡+𝑉𝛽
𝑢∈𝑀 𝑤
𝑚
max 𝜆, 𝑛 𝑢,𝑧
𝑣∈𝑀 𝑤
𝑐
1
max 𝜆, 𝑛 𝑣,𝑧
10. Factor Graph
• They tell that prior knowledge is incorporated
“by adding a factor graph to encode prior
knowledge,” but it does not be drawn.
• The potential function 𝑓𝑚 𝑧, 𝑤, 𝑑 contains 𝑛 𝑤,𝑧,
and 𝜑 𝑤,𝑧 ∝ 𝑛 𝑤,𝑧 + 𝛽.
• So the above model seems like Fig.b:
Fig.a Fig.b
11. [Ramage+ 2009] Labeled LDA
• Supervized LDA for labeled documents
– It is equivalent to SC-LDA with the
following potential function
𝑓𝑚 𝑧, 𝑤, 𝑑 =
1, if 𝑧 ∈ 𝑚 𝑑
−∞, else
where 𝑚 𝑑 specifies a label set of 𝑑
12. Experiments
• Baselines
– Dirichlet Forest-LDA [Andrzejewski+ 2009]
– Logic-LDA [Andrzejewski+ 2011]
– MRF-LDA [Xie+ 2015]
• Encodes word correlations in LDA as MRF
– SparseLDA
DATASET DOCS TYPE TOKEN(APPROX) Experiments
NIPS 1,500 12,419 1,900,000
Word correlation
NYT-NEWS 3,000,000 102,660 100,000,000
20NG 18,828 21,514 1,946,000 Labeled docs
13. Generate Word Correlation
• Must-link
– Obtain synsets from WordNet 3.0
– Similarity between the word and its
synsets on word embedding from
word2vec is higher than threshold 0.2
• Cannot-link
– Nothing?
14. Convergence Speed
The average running time per iteration
over 100 iterations, averaged over 5
seeds, on 20NG dataset.
15. Coherence [Mimno+ 2011]
• 𝐶 𝑡: 𝑉 𝑡 = 𝑚=2
𝑀
𝑙=1
𝑚−1
log
𝐹 𝑣 𝑚
𝑡
,𝑣𝑙
𝑡
+𝜖
𝐹 𝑣𝑙
𝑡
– 𝐹 𝑣 : document frequency of word type 𝑣
– 𝐹 𝑣, 𝑣′ :co-document frequency of word type 𝑣, 𝑣′
It means
“include”?
𝜖 is very small like
10−12
[Röder+ 2015]
-39.1 -36.6
16. References
• [Yang+ 2015] Efficient Methods for Incorporating Knowledge into Topic Models
• [Blei+ 2003] Latent Dirichlet allocation.
• [Griffiths+ 2004] Finding scientific topics.
• [Yao+ 2009] Efficient methods for topic model inference on streaming document
collections.
• [Ramage+ 2009] Labeled LDA: A supervised topic model for credit attribution in
multilabeled corpora.
• [Andrzejewski+ 2009] Incorporating domain knowledge into topic modeling via
Dirichlet forest priors.
• [Andrzejewski+ 2011] A framework for incorporating general domain knowledge
into latent Dirichlet allocation using first-order logic.
• [Xie+ 2015] Incorporating word correlation knowledge into topic modeling.
• [Mimno+ 2011] Optimizing semantic coherence in topic models.
• [Röder+ 2015] Exploring the space of topic coherence measures.