More Related Content
Similar to Document Analysis of Japanese Text Corpora Using TF-IDF and Co-Occurrence Methods
Similar to Document Analysis of Japanese Text Corpora Using TF-IDF and Co-Occurrence Methods (18)
Document Analysis of Japanese Text Corpora Using TF-IDF and Co-Occurrence Methods
- 1. 1
— —
Hilofumi Yamamoto
November 8, 2008
- 2. 2
• ( , 2005, 2006, 2007)
•
•
•
•
• ( , 1983; , 1989)
•
- 3. 3
) )
) ) 07
) 86 4) 44) ) 205
05 51 0 (1
0 2
11 •11 18 (1
8
( •9 ( •9 ( •1 (• ( (1
8 q=
8 8 d =8 =8 =8
:# = e@ 0d= =&0 MU l2V=8 =8 78E:#
8 E 8 =& 8e 6b ; @i: ?
46 56 79 38 20 44 17
¡
¡
¡
¡
¡
¡
¡
¡
900 950 1000 1050 1100 1150 1200 1250
- 4. 4
•
• (1976)
• (1991)
• (1998)
•
•
•
• →
- 5. 5
•
• 9484
( )
• kh (β )
• ( ) t2c
•
• (48732) (1408) (49)
- 6. 6
/$N / Fb /$K / =U /$O / Mh / $K / $1$j / 2) /$N / E`$l / $k / N^ / :# /$d / 2r$/ / $i$`
• – – – ...
- 7. 7
•
• ( , 1983)
• ( , 1996)
• idf (Sp¨rck Jones, 1972)
a
N
idf (t, N ) = log
df (t)
- 8. 8
idf : inverse document frequency
N
idf (ari, N ) = log (1)
df (ari)
9484
= log (2)
1201
= log 7.89.. = 2.07.. (3)
N
idf (uguisu, N ) = log (4)
df (uguisu)
9484
= log (5)
101
= log 93.90.. = 4.54.. (6)
- 9. 9
3500
L-Shape Freq-Type
3000
2500
number of type
2000
1500
1000
500
0
0 200 400 600 800 100012001400160018002000
frequency
- 10. 10
1200 idf
J-Shape IDF-Type
1000
idf
800
number of type
idf
idf
600
400
200
0
1 2 3 4 5 6 7 8 9
inverse document frequency (idf)
- 11. 11
• ( )
•
• tfidf
w(t, K, N ) = (1 + log tf (t, K)) idf (t, N )
- 12. 12
(cw)
w(t, K, N ) = (1 + log tf (t, K)) idf (t, N ) (7)
√
cidf (t1 , t2 , N ) = idf (t1 , N ) idf (t2 , N ) (8)
ctf (t1 , t2 , K) = 1 + log |{k : t1 , t2 ∈ k}| (9)
• K
• (8)
• (9) K
•
- 13. 13
cidf
˙
1000
frequency of patterns
800
600
400
200
0
0 1 2 3 4 5 6 7 8 9
cidf
- 14. 14
(cw)
|N |
ictf (t1 , t2 , N ) = 1 + log (10)
|{n : t1 , t2 ∈ n}|
cw(t1 , t2 ) = ctf (t1 , t2 , K) ictf (t1 , t2 , N ) cidf (t1 , t2 , N ) (11)
• K N
•
• K
• N
- 15. 15
cw
900
¨ ‚¯”£
1
cumulative frequency of patterns 8 2
800 3
4
700 1 5
6
7
600 8
3
500
400
7
2
300
200 5 cw z
6
100 4
0
0 10 20 30 40 50 60 70 80 90 100
co-occurrence weight (cw)
- 21. 21
(1)
t1 –t2 cw z ctf idf (t1 ) idf (t2 )
(24) – 86.06 3.33 10 3.18 4.63
– 65.15 1.76 5 3.18 3.26
– 64.32 1.70 2 3.43 4.69
– 63.36 1.62 2 3.18 4.92
– 61.87 1.51 2 3.18 4.69
– 60.36 1.40 4 3.18 3.18
– 55.34 1.02 2 3.18 4.37
(11) – 54.69 1.33 3 3.18 4.63
– 52.40 1.12 3 3.18 3.26
– 51.40 1.03 1 3.18 8.06
– 51.28 1.02 2 3.43 4.63
(15) – 80.25 3.74 8 3.18 4.63
– 55.90 1.54 2 3.18 3.83
– 54.92 1.46 8 3.18 2.08
– 54.35 1.40 2 3.18 3.95
– 52.42 1.23 2 3.18 3.37
– 50.48 1.05 1 3.18 7.77
(3) N/A
- 22. 22
(2)
t1 –t2 cw z ctf idf (t1 ) idf (t2 )
(5) – 72.27 3.34 4 3.43 4.63
– 52.17 1.44 2 3.43 3.95
– 51.68 1.40 2 3.43 3.71
– 51.00 1.33 2 3.43 3.43
– 49.48 1.19 4 3.43 2.08
– 48.33 1.08 1 3.43 6.59
– 47.56 1.01 1 3.43 6.38
(6) N/A
(9) N/A
(24) – 63.56 1.64 3 3.43 4.63
– 62.38 1.55 3 3.43 3.14
– 62.18 1.53 4 3.18 4.63
– 56.96 1.14 1 3.43 9.16
- 23. 23
•
• (cw) z 1σ
1σ(16 )
•
•
- 24. 24
•
•
•
•
http://etymology.jp/waka/poem.cgi
XML(SVG)
•