This research was published in IEEE SSCI 2017 in Hawaii.
The research goal was constructing learning theory of Non-negative Matrix Factorization and we derived a tighter upper bound of the generalization error than our previous research. Moreover, we carried out numerical experiments and discovered a conjecture that showed the exact value of the generalization error.
OECD bibliometric indicators: Selected highlights, April 2024
IEEESSCI2017-FOCI4-1039
1. Tighter Upper Bound of Real Log
Canonical Threshold of Non-negative
Matrix Factorization and its Application
to Bayesian Inference
Naoki Hayashi* (TokyoTech, Dept. of MCS)
Sumio Watanabe (TokyoTech, Dept. of MCS)
12017/11/28 IEEE SSCI 2017 FOCI, Hawaii
2. Slide
• This slide is available at
http://watanabe-www.math.dis.titech.ac.jp/~nhayashi
/pdf/hayashi1039.pdf
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 2
3. Index
• Introduction
• Main Theorem
• Discussion
• Conclusion
• (Appendix: Sketch of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 3
5. Index
• Introduction
– Non-negative Matrix Factorization
– Real Log Canonical Threshold
– Research Goal
• Main Theorem
• Discussion
• Conclusion
• (Appendix: Sketch of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 5
6. NMF has been applied
• Non-negative Matrix Factorization (NMF) has
been applied to many field
• E. g.
– Purchase basket data → Consumer analysis
– Image, sound,… → Signal processing
– Text data → Text mining
– Microarray data → Bioinformatics
↑ Knowledge/Structure Discovery
NMF: data → knowledge
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 6
7. Suffering
• NMF has hierarchical structure
• Likelihood cannot be approximated
by Gaussian function
• Traditional statistics cannot be used
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 7
HOWEVER
AIC BIC
10. Suffering
• NMF has hierarchical structure
• Likelihood cannot be approximated
by Gaussian function
• Traditional statistics cannot be used
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 10
HOWEVER
• Strongly depending on initial value
• Suffering from many local minima
– It seldom reaches to the global minimum.
In Addition +
AIC BIC
11. Learning Theory of NMF
• NMF has been used for ``data → knowledge’’
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 11
12. • NMF has been used for ``data → knowledge’’
• Mathematical property is unknown
– Learning theory has not been yet established
– Prediction accuracy has not been yet clarified
No guarantee for correctness of
numerical calculation
No method for theoretical
hyperparameter tuning
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 12
Learning Theory of NMF
13. • NMF has been used for ``data → knowledge’’
• Mathematical property is unknown
– Learning theory has not been yet established
– Prediction accuracy has not been yet clarified
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 13
Constructing its theory is
an important problem
Learning Theory of NMF
14. Index
• Introduction
– Non-negative Matrix Factorization
– Real Log Canonical Threshold
– Research Goal
• Main Theorem
• Discussion
• Conclusion
• (Appendix: Sktech of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 14
15. • In general [Watanabe, 2001]
– Let n be the sample size
– Bayesian generalization error 𝑮 𝒏 has
an asymptotic behavior:
𝔼 𝑮 𝒏 =
𝝀
𝒏
+ 𝒐
𝟏
𝒏
• Learning coefficient
𝝀 depends on the model
• 𝝀 is called real log canonical
threshold (RLCT)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 15
When does RLCT appear?
16. Error: Bayes<<Freq.
• In hierarchical structure model, Bayesian 𝝀 is
smaller than frequentist’s one and maximum
posterior one [Watanabe,2001 and 2009]
• Bayesian inference is effective for reducing
the generalization error
• We consider Bayesian inference framework
– Bayesian inference for NMF has been proposed
[Cemgil, 2009] Rem: ← is only discrete
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 16
17. RLCT of NMF is unknown
• NMF has been used for ``data → knowledge’’
• Mathematical property is unknown
– Learning theory has not been established
– Prediction accuracy has not been clarified
↑ means that the RLCT of NMF has
not been clarified
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 17
18. Def. RLCT
• RLCT is characterized as a learning coefficient
• It is defined by the largest pole of the
following complex function:
𝜻 𝒛 = න𝑲 𝒘 𝒛 𝝋 𝒘 𝒅𝒘,
where 𝑲 is KL-divergence from true distribution
to learning machine and 𝝋 is prior.
• A statistical model selection method that uses
RLCTs has been proposed [Drton, et al. 2017]
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 18
19. Def. RLCT
• RLCT is characterized as a learning coefficient
• It is defined by the largest pole of the
following complex function:
𝜻 𝒛 = න𝑲 𝒘 𝒛 𝝋 𝒘 𝒅𝒘,
where 𝑲 is KL-divergence from true distribution
to learning machine and 𝝋 is prior.
• A statistical model selection method that uses
RLCTs has been proposed [Drton, et al. 2017]
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 19
known as sBIC
(singular BIC)
20. Index
• Introduction
– Non-negative Matrix Factorization
– Real Log Canonical Threshold
– Research Goal
• Main Theorem
• Discussion
• Conclusion
• (Appendix: Sktech of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 20
21. Research Goal
• Constructing learning theory of NMF
→focus theoretical generalization error
→focus RLCT of NMF
• Recently, we derived an upper bound of
RLCT [Hayashi, et. al. 2017]
• We used algebraic geometrical method
(singularity resolution)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 21
22. Research Goal
• Constructing learning theory of NMF
→focus theoretical generalization error
→focus RLCT of NMF
• In this research, we newly derive the exact
value of the RLCT of NMF in the case rank ≦ 2
• Using the above exact value, we make the
upper bound tighter than previous one
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 22
24. Index
• Introduction
• Main Theorem
– Bayesian Framework of NMF
– Main Result
• Discussion
• Conclusion
• (Appendix: Sketch of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 24
25. Formalizing and Setting
• Data matrices: 𝑾 𝒏 = 𝑾 𝟏, … , 𝑾 𝒏 ; 𝑴 × 𝑵(× 𝒏)
– For general, we treat not only n=1 but also n>1.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 25
[Kohjima et al. 2016/6, modified]
𝑾𝑴
𝑵
26. Formalizing and Setting
• Data matrices: 𝑾 𝒏 = 𝑾 𝟏, … , 𝑾 𝒏 ; 𝑴 × 𝑵(× 𝒏)
– For general, we treat not only n=1 but also n>1.
• True factorization: 𝑨; 𝑴 × 𝑯 𝟎, 𝑩; 𝑯 𝟎 × 𝑵
• Learner factorization: 𝑿; 𝑴 × 𝑯, 𝒀; 𝑯 × 𝑵
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 26
[Kohjima et al. 2016/6, modified]
𝑾𝑴
𝑵 𝑯 𝟎 𝑵
𝑯 𝟎
𝑴 𝑨
𝑩
27. Formalizing and Setting
• Data matrices: 𝑾 𝒏 = 𝑾 𝟏, … , 𝑾 𝒏 ; 𝑴 × 𝑵(× 𝒏)
– For general, we treat not only n=1 but also n>1.
• True factorization: 𝑨; 𝑴 × 𝑯 𝟎, 𝑩; 𝑯 𝟎 × 𝑵
• Learner factorization: 𝑿; 𝑴 × 𝑯, 𝒀; 𝑯 × 𝑵
• What is the Bayesian framework of ↑?
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 27
[Kohjima et al. 2016/6, modified]
𝑾𝑴
𝑵 𝑯 𝟎 𝑵
𝑯 𝟎
𝑴 𝑨
𝑩
28. • Notation of probability density function (PDF)
– 𝑞 𝑊 : true distribution,
– 𝑝 𝑊 𝑋, 𝑌 : learning machine,
– 𝑝∗ 𝑊 : predictive distribution,
whose domains are Euclidian sp.
– 𝜑 𝑋, 𝑌 : prior distribution,
– 𝑝 𝑋, 𝑌 𝑊 𝑛 : posterior distribution given data,
whose domains are compact subsets of Euclidian sp.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 28
Formalizing and Setting
data
parameter
29. • Assume
𝒒 𝑾 ∝ 𝐞𝐱𝐩 −
𝟏
𝟐
𝑾 − 𝑨𝑩 𝟐 ,
𝒑 𝑾 𝑿, 𝒀 ∝ 𝐞𝐱𝐩 −
𝟏
𝟐
𝑾 − 𝑿𝒀 𝟐
,
and prior 𝝋 is strictly positive and bounded in a
neighborhood of 𝑨, 𝑩 .
• Remark: Poisson and exponential dist. can be
also applied [Hayashi, et al. 2017].
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 29
Formalizing and Setting
30. Bayesian Framework
• The posterior is defined by
𝒑 𝑿, 𝒀 𝑾 𝒏 =
𝟏
𝒁 𝒏
ෑ
𝒊=𝟏
𝒏
𝒑 𝑾𝒊 𝑿, 𝒀 𝝋 𝑿, 𝒀
where 𝒁 𝒏 is normalizing constant.
• The predictive distribution is defined by
𝒑∗
𝑾 = න𝒑 𝑾 𝑿, 𝒀 𝒑 𝑿, 𝒀 𝑾 𝒏
)𝒅𝑿𝒅𝒀 .
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 30
31. Bayesian Framework
• The Bayesian generalization error is defined by
KL-divergence from true to predictive dist. :
𝑮 𝒏 = න 𝒒 𝑾 𝐥𝐨𝐠
𝒒 𝑾
𝒑∗ 𝑾
𝒅𝑾.
• This depends on the training data thus it is a
random variable.
• Its expected value among the overall data has
an asymptotic behavior:
𝔼 𝑮 𝒏 =
𝝀
𝒏
+ 𝒐
𝟏
𝒏
.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 31
32. Index
• Introduction
• Main Theorem
– Bayesian Framework of NMF
– Main Result
• Discussion
• Conclusion
• (Appendix: Sketch of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 32
33. Def. RLCT of NMF
• The RLCT of NMF is defined by the minus
maximum pole of the following zeta function:
𝜻 𝒛 = ඵ 𝑿𝒀 − 𝑨𝑩 𝟐 𝒛
𝒅𝑿𝒅𝒀 .
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 33
34. • The RLCT of NMF is defined by the minus
maximum pole of the following zeta function:
𝜻 𝒛 = ඵ 𝑿𝒀 − 𝑨𝑩 𝟐 𝒛
𝒅𝑿𝒅𝒀 .
• 𝜻 𝒛 can be analytically continued to the entire
complex plane and its poles are negative
rational numbers.
• The largest pole of 𝜻 𝒛
equals −𝝀 . Then,
𝝀 is the RLCT of NMF.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 34
Def. RLCT of NMF
35. • The RLCT of NMF is defined by the minus
maximum pole of the following zeta function:
𝜻 𝒛 = ඵ 𝑿𝒀 − 𝑨𝑩 𝟐 𝒛
𝒅𝑿𝒅𝒀 .
• 𝜻 𝒛 can be analytically continued to the entire
complex plane and its poles are negative
rational numbers.
• The largest pole of 𝜻 𝒛
equals −𝝀 . Then,
𝝀 is the RLCT of NMF.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 35
𝐎
𝐗 𝐗 𝐗 𝐗 𝐗
𝒛 = −𝝀
Def. RLCT of NMF
ℂ
36. Main Theorem
• The RLCT of NMF 𝝀 satisfies the following
inequality:
𝝀 ≤
𝟏
𝟐
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴, 𝑵 + 𝑯 𝟎 𝑴 + 𝑵 − 𝟐 + 𝜹 𝑯 𝟎
,
where
𝜹 𝑯 𝟎
= ቊ
𝟏 (𝑯 𝟎 ≅ 𝟏, 𝐦𝐨𝐝 𝟐)
𝟎 (𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆)
.
The equality holds if 𝑯 = 𝑯 𝟎 = 𝟏 𝐨𝐫 𝟐 or 𝑯 𝟎 = 𝟎.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 36
38. Index
• Introduction
• Main Theorem
• Discussion
– Tightness
– Theoretical Application
– Numerical Experiment and Conjecture
• Conclusion
• (Appendix: Sketch of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 38
39. Tighter than previous
• Main Theorem shows an upper bound of the
RLCT of NMF.
• We have derived another bound of it in
previous research.
• How tight is the new bound?
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 39
40. Tighter than previous
• In previous work,
𝝀 ≤ 𝝀 𝒑𝒓𝒗 =
𝟏
𝟐
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴, 𝑵 + 𝑯 𝟎 𝑴 + 𝑵 − 𝟏 .
• In this paper,
𝝀 ≤ 𝝀 𝒏𝒆𝒘 =
𝟏
𝟐
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴, 𝑵 + 𝑯 𝟎 𝑴 + 𝑵 − 𝟐 + 𝜹 𝑯 𝟎
.
• By comparing them, we improve the bound
𝝀 𝒏𝒆𝒘 − 𝝀 𝒑𝒓𝒗 =
𝟏
𝟐
𝑯 𝟎 − 𝜹 𝑯 𝟎
.
• True dist. is more complex, bound is tighter.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 40
41. Index
• Introduction
• Main Theorem
• Discussion
– Tightness
– Theoretical Application
– Numerical Experiment and Conjecture
• Conclusion
• (Appendix: Sketch of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 41
42. Bound of Error
• Main Theorem shows an upper bound of the
Bayesian generalization error via
𝔼 𝑮 𝒏 =
𝝀
𝒏
+ 𝒐
𝟏
𝒏
.
• Actually, we have
𝔼 𝑮 𝒏 ≤
𝟏
𝟐𝒏
𝑯 − 𝑯 𝟎 𝐦𝐢𝐧 𝑴, 𝑵 + 𝑯 𝟎 𝑴 + 𝑵 − 𝟐 + 𝜹 𝑯 𝟎
+ 𝒐
𝟏
𝒏
.
– This gives guarantee of accuracy!
• What distribution can we bound the error?
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 42
43. Robustness on Dist.
• Main Theorem assumes that elements of
parameter matrices are subject to normal
distribution:
𝒒 𝑾 ∝ 𝓝 𝑾 𝑨𝑩 ,
𝒑 𝑾 𝑿, 𝒀 ∝ 𝓝 𝑾 𝑿𝒀 .
• Can Main Theorem be used even for other
distributions?
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 43
44. Robustness on Dist.
• In the prior work [Hayashi, et. al. 2017],
we proved that same zeta function can be
applied to Poisson and exponential distribution:
𝒒 𝑾 ∝ 𝐏𝐨𝐢 𝑾 𝑨𝑩 ,
𝒑 𝑾 𝑿, 𝒀 ∝ 𝐏𝐨𝐢 𝑾 𝑿𝒀 ,
𝒒 𝑾 ∝ 𝐄𝐱𝐩𝐨 𝑾 𝑨𝑩 ,
𝒑 𝑾 𝑿, 𝒀 ∝ 𝐄𝐱𝐩𝐨 𝑾 𝑿𝒀 ,
𝜻 𝒛 = ඵ 𝑿𝒀 − 𝑨𝑩 𝟐 𝒛
𝒅𝑿𝒅𝒀 .
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 44
Even if
We can use
45. Robustness on Dist.
• The above result is derived by the fact that
I-divergence and Itakura Saito-divergence have
same RLCT as square error.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 45
Distribution Normal Poisson Exponential
Similarity Sq. error I-divergence IS-divergence
𝜻 𝒛 = ඵ 𝑿𝒀 − 𝑨𝑩 𝟐 𝒛
𝒅𝑿𝒅𝒀
We can use
Same RLCTs
46. Robustness on Dist.
• The above result is derived by the fact that
I-divergence and Itakura Saito-divergence have
same RLCT as square error.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 46
Distribution Normal Poisson Exponential
Similarity Sq. error I-divergence IS-divergence
𝜻 𝒛 = ඵ 𝑿𝒀 − 𝑨𝑩 𝟐 𝒛
𝒅𝑿𝒅𝒀
We can use
Same RLCTs
Main Thm.
is attained!
47. Index
• Introduction
• Main Theorem
• Discussion
– Tightness
– Theoretical Application
– Numerical Experiment and Conjecture
• Conclusion
• (Appendix: Sketch of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 47
48. • We carried out experiment to estimate the
exact value of RLCT.
– (+) to compare with the RLCT of reduced rank
regression (RRR); non-restricted matrix factorization.
• The posterior cannot be analytically derived.
→Markov Chain Monte Carlo(MCMC)
– We used Metropolis Hastings method.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 48
Numerical Experiment
49. • We made artificial data and set the following
cases:
– 1. exact value of RLCT of NMF is known.
– 2. exact value is unknown and rank = rank+.
– 3. exact value is unknown and rank ≠ rank+.
• rank+ : minimal inner dimension of NMF
– ↑ is called non-negative rank.
– In general, rank+≧ rank holds.
– If min{rows, columns} ≦ 3 or rank ≦2, rank =rank+.
– There is a non-negative matrix s.t. rank<rank+.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 49
Non-negative Rank
50. • Sample size n=200 (parameter dimension≦50)
• The number of data sets D=100
→ we empirically calculated the RLCT using
𝔼 𝑮 𝒏 =
𝝀
𝒏
+ 𝒐
𝟏
𝒏
→ 𝝀 ≈ 𝒏𝔼 𝑮 𝒏 ≈
𝒏
𝑫
𝒋=𝟏
𝑫
𝑮 𝒏
𝒋
• MCMC sample size K=1,000
– Burn-in=20,000, thin=20, i.e. sampling iteration is 40,000.
• For calculating 𝑮 𝒏, we generated T=20,000 test
datas from the true distribution.
• Total: 100*(40,000+1,000*20,000) ≈ 𝑶 𝑫𝑲𝑻
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 50
Condition of Experiments
51. 2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 51
Numerical Result
𝝀 𝑵 𝝀 𝝀 𝑩 𝝀 𝑹
Numerical
calculated
The exact
value in NMF
The upper
bound in NMF
The exact
value in RRR
r: true rank
52. 2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 52
Numerical Result
𝝀 𝑵 𝝀 𝝀 𝑩 𝝀 𝑹
Numerical
calculated
The exact
value in NMF
The upper
bound in NMF
The exact
value in RRR
Numerical results equal
theoretical value: 𝝀 𝑵 = 𝝀.
Numerical calculation is correct!
r: true rank
53. 2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 53
Numerical Result
𝝀 𝑵 𝝀 𝝀 𝑩 𝝀 𝑹
Numerical
calculated
The exact
value in NMF
The upper
bound in NMF
The exact
value in RRR
r: true rank
If rank = rank+, then numerical
results equal RRR case: 𝝀 𝑵 = 𝝀 𝑹.
It seems that if rank = rank+,
then the RLCT of NMF 𝝀 = 𝝀 𝑹.
54. 2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 54
Numerical Result
𝝀 𝑵 𝝀 𝝀 𝑩 𝝀 𝑹
Numerical
calculated
The exact
value in NMF
The upper
bound in NMF
The exact
value in RRR
r: true rank
If rank ≠ rank+, then num. results
are larger than RRR case: 𝝀 𝑵 > 𝝀 𝑹.
If rank ≠ rank+,
then the RLCT of NMF is
larger than RRR case:
𝝀 𝑵 > 𝝀 𝑹.
57. Index
• Introduction
• Main Theorem
• Discussion
• Conclusion
• (Appendix: Sketch of Proof)
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 57
58. • (Main contribution) We mathematically
improved the upper bound of the RLCT of NMF.
– This made the bound of the generalization error in
Bayesian NMF tighter.
• (Minor contribution) We carried out experiments
and suggested conjecture about the exact value
of the RLCT:
– ・ rank = rank+ ⇒ RLCT of NMF = RLCT of RRR.
– ・ rank ≠ rank+ ⇒ RLCT of NMF > RLCT of RRR.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 58
Conclusion
60. • We have already derived the exact value in the
case 𝑯 𝟎 = 𝟎 and 𝑯 = 𝑯 𝟎 = 𝟏.
• We newly clarified the exact value in the case
𝑯 = 𝑯 𝟎 = 𝟐 by considering the dimension of
algebraic subvariety in the parameter space.
• We bound the RLCT in the case 𝑯 = 𝑯 𝟎 by using
the clarified exact value mentioned above.
• We bound the RLCT in general case by using the
above results.
2017/11/28 IEEE SSCI 2017 FOCI, Hawaii 60
Sketch of Proof