IEEESSCI2017-FOCI4-1039

Tighter Upper Bound of Real Log
Canonical Threshold of Non-negative
Matrix Factorization and its Application
to Bayesian Inference
Naoki Hayashi* (TokyoTech, Dept. of MCS)
Sumio Watanabe (TokyoTech, Dept. of MCS)
12017/11/28 IEEE SSCI 2017 FOCI, Hawaii

Slide
• This slide is available at
http://watanabe-www.math.dis.titech.ac.jp/~nhayashi
/pdf/hayashi1039.pdf
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 2

Index
• Introduction
• Main Theorem
• Discussion
• Conclusion
• (Appendix: Sketch of Proof)

1. INTRODUCTION

Index
• Introduction
– Non-negative Matrix Factorization
– Real Log Canonical Threshold
– Research Goal
• Main Theorem
• Discussion
• Conclusion

NMF has been applied
• Non-negative Matrix Factorization (NMF) has
been applied to many field
• E. g.
– Purchase basket data → Consumer analysis
– Image, sound,… → Signal processing
– Text data → Text mining
– Microarray data → Bioinformatics
↑ Knowledge/Structure Discovery
NMF: data → knowledge

Suffering
• NMF has hierarchical structure
• Likelihood cannot be approximated
by Gaussian function
• Traditional statistics cannot be used
HOWEVER
AIC BIC

Suffering
Hierarchical structure causes non-identifibility :
𝑿𝒀 = 𝑿𝑷𝑷−𝟏 𝒀; 𝐟𝐨𝐫 ∃𝑷 ≠ 𝑰; 𝑿, 𝒀, 𝑿𝑷, 𝑷−𝟏 𝒀 ≥ 𝟎
𝟏 𝟑
𝟏 𝟑
𝟏 𝟒
𝟏 𝟏 𝟒
𝟓 𝟏 𝟒
=
𝟏 𝟑
𝟏 𝟑
𝟏 𝟒
𝟐 −𝟑
𝟏 𝟐
𝟐 −𝟑
𝟏 𝟐
−𝟏
𝟏 𝟏 𝟒
𝟓 𝟏 𝟒
=
𝟏
𝟕
𝟓 𝟑
𝟓 𝟑
𝟔 𝟓
𝟏𝟕 𝟓 𝟐𝟎
𝟗 𝟏 𝟒
=
𝟏𝟔 𝟒 𝟏𝟔
𝟏𝟔 𝟒 𝟏𝟔
𝟐𝟏 𝟓 𝟐𝟎
HOWEVER
AIC BIC

Suffering
Hierarchical structure causes non-identifibility :
𝑿𝒀 = 𝑿𝑷𝑷−𝟏 𝒀; 𝐟𝐨𝐫 ∃𝑷 ≠ 𝑰; 𝑿, 𝒀, 𝑿𝑷, 𝑷−𝟏 𝒀 ≥ 𝟎
𝟏 𝟑
𝟏 𝟑
𝟏 𝟒
𝟏 𝟏 𝟒
𝟓 𝟏 𝟒
=
𝟏 𝟑
𝟏 𝟑
𝟏 𝟒
𝟐 −𝟑
𝟏 𝟐
𝟐 −𝟑
𝟏 𝟐
−𝟏
𝟏 𝟏 𝟒
𝟓 𝟏 𝟒
=
𝟏
𝟕
𝟓 𝟑
𝟓 𝟑
𝟔 𝟓
𝟏𝟕 𝟓 𝟐𝟎
𝟗 𝟏 𝟒
=
𝟏𝟔 𝟒 𝟏𝟔
𝟏𝟔 𝟒 𝟏𝟔
𝟐𝟏 𝟓 𝟐𝟎
HOWEVER
AIC BIC
One matrix,
(at least) pairs
about the NMF

Suffering
HOWEVER
• Strongly depending on initial value
• Suffering from many local minima
– It seldom reaches to the global minimum.
In Addition +
AIC BIC

Learning Theory of NMF
• NMF has been used for ``data → knowledge’’

• Mathematical property is unknown
– Learning theory has not been yet established
– Prediction accuracy has not been yet clarified
No guarantee for correctness of
numerical calculation
No method for theoretical
hyperparameter tuning

– Learning theory has not been yet established
– Prediction accuracy has not been yet clarified
Constructing its theory is
an important problem

Index
• Introduction
– Research Goal
• Main Theorem
• Discussion
• Conclusion
• (Appendix: Sktech of Proof)

• In general [Watanabe, 2001]
– Let n be the sample size
– Bayesian generalization error 𝑮 𝒏 has
an asymptotic behavior:
𝔼 𝑮 𝒏 =
𝝀
𝒏
+ 𝒐
𝟏
𝒏
• Learning coefficient
𝝀 depends on the model
• 𝝀 is called real log canonical
threshold (RLCT)
When does RLCT appear?

Error: Bayes<<Freq.
• In hierarchical structure model, Bayesian 𝝀 is
smaller than frequentist’s one and maximum
posterior one [Watanabe,2001 and 2009]
• Bayesian inference is effective for reducing
the generalization error
• We consider Bayesian inference framework
– Bayesian inference for NMF has been proposed
[Cemgil, 2009] Rem: ← is only discrete

RLCT of NMF is unknown
– Learning theory has not been established
– Prediction accuracy has not been clarified
↑ means that the RLCT of NMF has
not been clarified

Def. RLCT
• RLCT is characterized as a learning coefficient
• It is defined by the largest pole of the
following complex function:
𝜻 𝒛 = න𝑲 𝒘 𝒛 𝝋 𝒘 𝒅𝒘,
where 𝑲 is KL-divergence from true distribution
to learning machine and 𝝋 is prior.
• A statistical model selection method that uses
RLCTs has been proposed [Drton, et al. 2017]

Def. RLCT
• RLCT is characterized as a learning coefficient
• It is defined by the largest pole of the
following complex function:
𝜻 𝒛 = න𝑲 𝒘 𝒛 𝝋 𝒘 𝒅𝒘,
where 𝑲 is KL-divergence from true distribution
to learning machine and 𝝋 is prior.
• A statistical model selection method that uses
RLCTs has been proposed [Drton, et al. 2017]
known as sBIC
(singular BIC)

Index
• Introduction
– Research Goal
• Main Theorem
• Discussion
• Conclusion
• (Appendix: Sktech of Proof)

Research Goal
• Constructing learning theory of NMF
→focus theoretical generalization error
→focus RLCT of NMF
• Recently, we derived an upper bound of
RLCT [Hayashi, et. al. 2017]
• We used algebraic geometrical method
(singularity resolution)

Research Goal
• Constructing learning theory of NMF
→focus theoretical generalization error
→focus RLCT of NMF
• In this research, we newly derive the exact
value of the RLCT of NMF in the case rank ≦ 2
• Using the above exact value, we make the
upper bound tighter than previous one

2. MAIN THEOREM

Index
• Introduction
• Main Theorem
– Bayesian Framework of NMF
– Main Result
• Discussion
• Conclusion

Formalizing and Setting
• Data matrices: 𝑾 𝒏 = 𝑾 𝟏, … , 𝑾 𝒏 ; 𝑴 × 𝑵(× 𝒏)
– For general, we treat not only n=1 but also n>1.
[Kohjima et al. 2016/6, modified]
𝑾𝑴
𝑵

• True factorization: 𝑨; 𝑴 × 𝑯 𝟎, 𝑩; 𝑯 𝟎 × 𝑵
• Learner factorization: 𝑿; 𝑴 × 𝑯, 𝒀; 𝑯 × 𝑵
𝑾𝑴
𝑵 𝑯 𝟎 𝑵
𝑯 𝟎
𝑴 𝑨
𝑩

• True factorization: 𝑨; 𝑴 × 𝑯 𝟎, 𝑩; 𝑯 𝟎 × 𝑵
• Learner factorization: 𝑿; 𝑴 × 𝑯, 𝒀; 𝑯 × 𝑵
• What is the Bayesian framework of ↑?
𝑾𝑴
𝑵 𝑯 𝟎 𝑵
𝑯 𝟎
𝑴 𝑨
𝑩

• Notation of probability density function (PDF)
– 𝑞 𝑊 : true distribution,
– 𝑝 𝑊 𝑋, 𝑌 : learning machine,
– 𝑝∗ 𝑊 : predictive distribution,
whose domains are Euclidian sp.
– 𝜑 𝑋, 𝑌 : prior distribution,
– 𝑝 𝑋, 𝑌 𝑊 𝑛 : posterior distribution given data,
whose domains are compact subsets of Euclidian sp.
data
parameter

• Assume
𝒒 𝑾 ∝ 𝐞𝐱𝐩 −
𝟏
𝟐
𝑾 − 𝑨𝑩 𝟐 ,
𝒑 𝑾 𝑿, 𝒀 ∝ 𝐞𝐱𝐩 −
𝟏
𝟐
𝑾 − 𝑿𝒀 𝟐
,
and prior 𝝋 is strictly positive and bounded in a
neighborhood of 𝑨, 𝑩 .
• Remark: Poisson and exponential dist. can be
also applied [Hayashi, et al. 2017].

Bayesian Framework
• The posterior is defined by
𝒑 𝑿, 𝒀 𝑾 𝒏 =
𝟏
𝒁 𝒏
ෑ
𝒊=𝟏
𝒏
𝒑 𝑾𝒊 𝑿, 𝒀 𝝋 𝑿, 𝒀
where 𝒁 𝒏 is normalizing constant.
• The predictive distribution is defined by
𝒑∗
𝑾 = න𝒑 𝑾 𝑿, 𝒀 𝒑 𝑿, 𝒀 𝑾 𝒏
)𝒅𝑿𝒅𝒀 .

Bayesian Framework
• The Bayesian generalization error is defined by
KL-divergence from true to predictive dist. :
𝑮 𝒏 = න 𝒒 𝑾 𝐥𝐨𝐠
𝒒 𝑾
𝒑∗ 𝑾
𝒅𝑾.
• This depends on the training data thus it is a
random variable.
• Its expected value among the overall data has
an asymptotic behavior:
𝔼 𝑮 𝒏 =
𝝀
𝒏
+ 𝒐
𝟏
𝒏
.

Index
• Introduction
• Main Theorem
– Bayesian Framework of NMF
– Main Result
• Discussion
• Conclusion

Def. RLCT of NMF
• The RLCT of NMF is defined by the minus
maximum pole of the following zeta function:
𝜻 𝒛 = ඵ 𝑿𝒀 − 𝑨𝑩 𝟐 𝒛
𝒅𝑿𝒅𝒀 .

𝒅𝑿𝒅𝒀 .
• 𝜻 𝒛 can be analytically continued to the entire
complex plane and its poles are negative
rational numbers.
• The largest pole of 𝜻 𝒛
equals −𝝀 . Then,
𝝀 is the RLCT of NMF.
Def. RLCT of NMF

𝒅𝑿𝒅𝒀 .
• 𝜻 𝒛 can be analytically continued to the entire
complex plane and its poles are negative
rational numbers.
• The largest pole of 𝜻 𝒛
equals −𝝀 . Then,
𝝀 is the RLCT of NMF.
𝐎
𝐗 𝐗 𝐗 𝐗 𝐗
𝒛 = −𝝀
Def. RLCT of NMF
ℂ

Main Theorem
• The RLCT of NMF 𝝀 satisfies the following
inequality:
𝝀 ≤
𝟏
𝟐
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴, 𝑵 + 𝑯 𝟎 𝑴 + 𝑵 − 𝟐 + 𝜹 𝑯 𝟎
,
where
𝜹 𝑯 𝟎
= ቊ
𝟏 (𝑯 𝟎 ≅ 𝟏, 𝐦𝐨𝐝 𝟐)
𝟎 (𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆)
.
The equality holds if 𝑯 = 𝑯 𝟎 = 𝟏 𝐨𝐫 𝟐 or 𝑯 𝟎 = 𝟎.

3. DISCUSSION

Index
• Introduction
• Main Theorem
• Discussion
– Tightness
– Theoretical Application
– Numerical Experiment and Conjecture
• Conclusion

Tighter than previous
• Main Theorem shows an upper bound of the
RLCT of NMF.
• We have derived another bound of it in
previous research.
• How tight is the new bound?

Tighter than previous
• In previous work,
𝝀 ≤ 𝝀 𝒑𝒓𝒗 =
𝟏
𝟐
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴, 𝑵 + 𝑯 𝟎 𝑴 + 𝑵 − 𝟏 .
• In this paper,
𝝀 ≤ 𝝀 𝒏𝒆𝒘 =
𝟏
𝟐
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴, 𝑵 + 𝑯 𝟎 𝑴 + 𝑵 − 𝟐 + 𝜹 𝑯 𝟎
.
• By comparing them, we improve the bound
𝝀 𝒏𝒆𝒘 − 𝝀 𝒑𝒓𝒗 =
𝟏
𝟐
𝑯 𝟎 − 𝜹 𝑯 𝟎
.
• True dist. is more complex, bound is tighter.

Index
• Introduction
• Main Theorem
• Discussion
– Tightness
• Conclusion

Bound of Error
• Main Theorem shows an upper bound of the
Bayesian generalization error via
𝔼 𝑮 𝒏 =
𝝀
𝒏
+ 𝒐
𝟏
𝒏
.
• Actually, we have
𝔼 𝑮 𝒏 ≤
𝟏
𝟐𝒏
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴, 𝑵 + 𝑯 𝟎 𝑴 + 𝑵 − 𝟐 + 𝜹 𝑯 𝟎
+ 𝒐
𝟏
𝒏
.
– This gives guarantee of accuracy!
• What distribution can we bound the error?

Robustness on Dist.
• Main Theorem assumes that elements of
parameter matrices are subject to normal
distribution:
𝒒 𝑾 ∝ 𝓝 𝑾 𝑨𝑩 ,
𝒑 𝑾 𝑿, 𝒀 ∝ 𝓝 𝑾 𝑿𝒀 .
• Can Main Theorem be used even for other
distributions?

Robustness on Dist.
• In the prior work [Hayashi, et. al. 2017],
we proved that same zeta function can be
applied to Poisson and exponential distribution:
𝒒 𝑾 ∝ 𝐏𝐨𝐢 𝑾 𝑨𝑩 ,
𝒑 𝑾 𝑿, 𝒀 ∝ 𝐏𝐨𝐢 𝑾 𝑿𝒀 ,
𝒒 𝑾 ∝ 𝐄𝐱𝐩𝐨 𝑾 𝑨𝑩 ,
𝒑 𝑾 𝑿, 𝒀 ∝ 𝐄𝐱𝐩𝐨 𝑾 𝑿𝒀 ,
𝒅𝑿𝒅𝒀 .
Even if
We can use

Robustness on Dist.
• The above result is derived by the fact that
I-divergence and Itakura Saito-divergence have
same RLCT as square error.
Distribution Normal Poisson Exponential
Similarity Sq. error I-divergence IS-divergence
𝒅𝑿𝒅𝒀
We can use
Same RLCTs

Robustness on Dist.
• The above result is derived by the fact that
I-divergence and Itakura Saito-divergence have
same RLCT as square error.
Distribution Normal Poisson Exponential
Similarity Sq. error I-divergence IS-divergence
𝒅𝑿𝒅𝒀
We can use
Same RLCTs
Main Thm.
is attained!

Index
• Introduction
• Main Theorem
• Discussion
– Tightness
• Conclusion

• We carried out experiment to estimate the
exact value of RLCT.
– (+) to compare with the RLCT of reduced rank
regression (RRR); non-restricted matrix factorization.
• The posterior cannot be analytically derived.
→Markov Chain Monte Carlo(MCMC)
– We used Metropolis Hastings method.
Numerical Experiment

• We made artificial data and set the following
cases:
– 1. exact value of RLCT of NMF is known.
– 2. exact value is unknown and rank = rank+.
– 3. exact value is unknown and rank ≠ rank+.
• rank+ : minimal inner dimension of NMF
– ↑ is called non-negative rank.
– In general, rank+≧ rank holds.
– If min{rows, columns} ≦ 3 or rank ≦2, rank =rank+.
– There is a non-negative matrix s.t. rank＜rank+.
Non-negative Rank

• Sample size n=200 (parameter dimension≦50)
• The number of data sets D=100
→ we empirically calculated the RLCT using
𝔼 𝑮 𝒏 =
𝝀
𝒏
+ 𝒐
𝟏
𝒏
→ 𝝀 ≈ 𝒏𝔼 𝑮 𝒏 ≈
𝒏
𝑫
෍
𝒋=𝟏
𝑫
𝑮 𝒏
𝒋
• MCMC sample size K=1,000
– Burn-in=20,000, thin=20, i.e. sampling iteration is 40,000.
• For calculating 𝑮 𝒏, we generated T=20,000 test
datas from the true distribution.
• Total: 100*(40,000+1,000*20,000) ≈ 𝑶 𝑫𝑲𝑻
Condition of Experiments

Numerical Result
𝝀 𝑵 𝝀 𝝀 𝑩 𝝀 𝑹
Numerical
calculated
The exact
value in NMF
The upper
bound in NMF
The exact
value in RRR
r: true rank

Numerical Result
Numerical
calculated
The exact
value in NMF
The upper
bound in NMF
The exact
value in RRR
Numerical results equal
theoretical value: 𝝀 𝑵 = 𝝀.
Numerical calculation is correct!
r: true rank

Numerical Result
Numerical
calculated
The exact
value in NMF
The upper
bound in NMF
The exact
value in RRR
r: true rank
If rank = rank+, then numerical
results equal RRR case: 𝝀 𝑵 = 𝝀 𝑹.
It seems that if rank = rank+,
then the RLCT of NMF 𝝀 = 𝝀 𝑹.

Numerical Result
Numerical
calculated
The exact
value in NMF
The upper
bound in NMF
The exact
value in RRR
r: true rank
If rank ≠ rank+, then num. results
are larger than RRR case: 𝝀 𝑵 > 𝝀 𝑹.
If rank ≠ rank+,
then the RLCT of NMF is
larger than RRR case:
𝝀 𝑵 > 𝝀 𝑹.

ConjectureFrom the paper:

4. CONCLUSION

Index
• Introduction
• Main Theorem
• Discussion
• Conclusion

• (Main contribution) We mathematically
improved the upper bound of the RLCT of NMF.
– This made the bound of the generalization error in
Bayesian NMF tighter.
• (Minor contribution) We carried out experiments
and suggested conjecture about the exact value
of the RLCT:
– ･ rank = rank+ ⇒ RLCT of NMF = RLCT of RRR.
– ･ rank ≠ rank+ ⇒ RLCT of NMF > RLCT of RRR.
Conclusion

APPENDIX: SKETCH OF
PROOF FOR MAIN THEOREM

• We have already derived the exact value in the
case 𝑯 𝟎 = 𝟎 and 𝑯 = 𝑯 𝟎 = 𝟏.
• We newly clarified the exact value in the case
𝑯 = 𝑯 𝟎 = 𝟐 by considering the dimension of
algebraic subvariety in the parameter space.
• We bound the RLCT in the case 𝑯 = 𝑯 𝟎 by using
the clarified exact value mentioned above.
• We bound the RLCT in general case by using the
above results.
Sketch of Proof

IEEESSCI2017-FOCI4-1039

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IEEESSCI2017-FOCI4-1039

Similar to IEEESSCI2017-FOCI4-1039 (20)

More from Naoki Hayashi

More from Naoki Hayashi (19)

Recently uploaded

Recently uploaded (20)

IEEESSCI2017-FOCI4-1039