SlideShare a Scribd company logo
1 of 24
Download to read offline
1




—                       —


    Hilofumi Yamamoto




    November 8, 2008
2




•   (   , 2005, 2006, 2007)
•
•
•
•
•          (     , 1983;      , 1989)
•
3




                                                                                                          )                                       )
                        )                        )                       07
                                                                           )                            86            4) 44)                 ) 205
                      05                       51                       0                            (1
                                                                                                       0             2
                                                                                                                   11 •11              18 (1
                                                                                                                                            8
                (   •9                   (   •9                   (   •1                                        (•     (            (1
            8                      q=
                                     8                        8                        d   =8                =8                          =8
       :# =                      e@                       0d=                      =&0                     MU l2V=8            =8 78E:#
 8   E                       8                         =&                       8e                      6b      ;           @i: ?
                46                            56                          79                    38            20          44       17
 ¡




                             ¡




                                                       ¡




                                                                                ¡



                                                                                                        ¡

                                                                                                                   ¡




                                                                                                                               ¡

                                                                                                                                        ¡
900                         950                      1000                1050       1100                           1150            1200         1250
4




•
•       (1976)
•       (1991)
•     (1998)
•
•
•
• →
5




•
•             9484
    (                                        )
• kh              (β   )
•         (                     )      t2c


•
•       (48732)        (1408)       (49)
6




/$N / Fb /$K / =U /$O / Mh / $K / $1$j / 2) /$N / E`$l / $k / N^ / :# /$d / 2r$/ / $i$`




•     –            –            –        ...
7




•
•                                    (      , 1983)
•                                           (   , 1996)
• idf (Sp¨rck Jones, 1972)
         a


                                    N
                 idf (t, N ) = log
                                   df (t)
8


idf : inverse document frequency

                              N
     idf (ari, N ) =   log                    (1)
                           df (ari)
                           9484
                  =    log                    (2)
                           1201
                  =    log 7.89.. = 2.07..    (3)
                                N
  idf (uguisu, N ) =   log                    (4)
                           df (uguisu)
                           9484
                  =    log                    (5)
                            101
                  =    log 93.90.. = 4.54..   (6)
9



                 3500
                                        L-Shape Freq-Type

                 3000


                 2500
number of type




                 2000


                 1500


                 1000


                 500


                   0
                        0   200 400 600 800 100012001400160018002000
                                         frequency
10



                 1200             idf
                                        J-Shape IDF-Type


                 1000


                            idf
                 800
number of type



                                        idf
                            idf
                 600


                 400


                 200


                   0
                        1    2    3    4    5    6    7    8   9
                            inverse document frequency (idf)
11




•                                (     )




•


• tfidf

         w(t, K, N ) = (1 + log tf (t, K)) idf (t, N )
12



                                      (cw)


           w(t, K, N ) = (1 + log tf (t, K)) idf (t, N )      (7)
                                √
        cidf (t1 , t2 , N ) =   idf (t1 , N ) idf (t2 , N )   (8)
         ctf (t1 , t2 , K) = 1 + log |{k : t1 , t2 ∈ k}|      (9)


• K

• (8)

• (9)     K

•
13



                                           cidf

                                                               ˙
                        1000
frequency of patterns




                        800



                        600



                        400



                        200



                          0
                               0   1   2   3      4    5   6   7   8   9
                                                  cidf
14



                                                     (cw)


                                       |N |
ictf (t1 , t2 , N ) = 1 + log                                                   (10)
                               |{n : t1 , t2 ∈ n}|
     cw(t1 , t2 ) = ctf (t1 , t2 , K) ictf (t1 , t2 , N ) cidf (t1 , t2 , N )   (11)

         • K                                     N

         •

         •                       K

         •                       N
15



                                             cw
                                   900
                                                              ¨       ‚¯”£
                                                                         1
cumulative frequency of patterns              8                          2
                                   800                                   3
                                                                         4
                                   700        1                          5
                                                                         6
                                                                         7
                                   600                                   8
                                              3
                                   500

                                   400
                                              7
                                              2
                                   300

                                   200        5                        cw     z
                                              6

                                   100        4

                                    0
                                         0   10   20 30 40 50 60 70 80            90 100
                                                  co-occurrence weight (cw)
16



1σ




         16
     (        )
17
18
19
20
21
                                       (1)

        t1 –t2     cw       z   ctf   idf (t1 )   idf (t2 )
(24)       –     86.06   3.33    10      3.18        4.63
           –     65.15   1.76     5      3.18        3.26
           –     64.32   1.70     2      3.43        4.69
           –     63.36   1.62     2      3.18        4.92
           –     61.87   1.51     2      3.18        4.69
           –     60.36   1.40     4      3.18        3.18
           –     55.34   1.02     2      3.18        4.37
(11)       –     54.69   1.33     3      3.18        4.63
           –     52.40   1.12     3      3.18        3.26
           –     51.40   1.03     1      3.18        8.06
           –     51.28   1.02     2      3.43        4.63
(15)       –     80.25   3.74     8      3.18        4.63
           –     55.90   1.54     2      3.18        3.83
           –     54.92   1.46     8      3.18        2.08
           –     54.35   1.40     2      3.18        3.95
           –     52.42   1.23     2      3.18        3.37
           –     50.48   1.05     1      3.18        7.77
  (3)   N/A
22
                                        (2)

         t1 –t2     cw       z   ctf   idf (t1 )   idf (t2 )
(5)         –     72.27   3.34     4      3.43        4.63
            –     52.17   1.44     2      3.43        3.95
            –     51.68   1.40     2      3.43        3.71
            –     51.00   1.33     2      3.43        3.43
            –     49.48   1.19     4      3.43        2.08
            –     48.33   1.08     1      3.43        6.59
            –     47.56   1.01     1      3.43        6.38
(6)      N/A
(9)      N/A
  (24)      –     63.56   1.64    3       3.43        4.63
            –     62.38   1.55    3       3.43        3.14
            –     62.18   1.53    4       3.18        4.63
            –     56.96   1.14    1       3.43        9.16
23




•


•   (cw)   z       1σ

      1σ(16    )
•


•
24




•
•
•


•
    http://etymology.jp/waka/poem.cgi
    XML(SVG)
•

More Related Content

Viewers also liked

Viewers also liked (12)

Tokyotech20130715
Tokyotech20130715Tokyotech20130715
Tokyotech20130715
 
Jinmon2007slide02
Jinmon2007slide02Jinmon2007slide02
Jinmon2007slide02
 
Ch2011slide01
Ch2011slide01Ch2011slide01
Ch2011slide01
 
Asialex201103slide02
Asialex201103slide02Asialex201103slide02
Asialex201103slide02
 
Goiken2007slide
Goiken2007slideGoiken2007slide
Goiken2007slide
 
Kokken20100303
Kokken20100303Kokken20100303
Kokken20100303
 
Database2010 01slide
Database2010 01slideDatabase2010 01slide
Database2010 01slide
 
Workshop20110305slide01
Workshop20110305slide01Workshop20110305slide01
Workshop20110305slide01
 
Incremental load
Incremental loadIncremental load
Incremental load
 
Wollongong02
Wollongong02Wollongong02
Wollongong02
 
Ch2008slide01
Ch2008slide01Ch2008slide01
Ch2008slide01
 
AyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpegAyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpeg
 

Similar to Document Analysis of Japanese Text Corpora Using TF-IDF and Co-Occurrence Methods

A linguistic survey on _Itako Bushi_ (1806)
A linguistic survey on _Itako Bushi_ (1806)A linguistic survey on _Itako Bushi_ (1806)
A linguistic survey on _Itako Bushi_ (1806)Kazuhiro Okada
 
دليل مصور ومختصر لفهم الإسلام
دليل مصور ومختصر لفهم الإسلام دليل مصور ومختصر لفهم الإسلام
دليل مصور ومختصر لفهم الإسلام Islamic Invitation
 
Summer notes by_kolay
Summer notes by_kolaySummer notes by_kolay
Summer notes by_kolayKo Lay
 
Sap fico-configuration-guide
Sap fico-configuration-guideSap fico-configuration-guide
Sap fico-configuration-guideChanchal Singha
 
Kza Presentatie (1)
Kza Presentatie (1)Kza Presentatie (1)
Kza Presentatie (1)plinnebank
 
CoffeeScript Lightning Talk
CoffeeScript Lightning TalkCoffeeScript Lightning Talk
CoffeeScript Lightning TalkGiltTech
 
Rmpiとsnowで 並列処理
Rmpiとsnowで 並列処理Rmpiとsnowで 並列処理
Rmpiとsnowで 並列処理Masafumi Okada
 
2011/1/24~1/28投資週報
2011/1/24~1/28投資週報2011/1/24~1/28投資週報
2011/1/24~1/28投資週報利全 蔡
 
ข้อมูลและสารสนเทศ
ข้อมูลและสารสนเทศข้อมูลและสารสนเทศ
ข้อมูลและสารสนเทศchukiat008
 

Similar to Document Analysis of Japanese Text Corpora Using TF-IDF and Co-Occurrence Methods (18)

Biblio animação setembro
Biblio animação setembroBiblio animação setembro
Biblio animação setembro
 
Pcd0405 (07)
Pcd0405 (07)Pcd0405 (07)
Pcd0405 (07)
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 
A linguistic survey on _Itako Bushi_ (1806)
A linguistic survey on _Itako Bushi_ (1806)A linguistic survey on _Itako Bushi_ (1806)
A linguistic survey on _Itako Bushi_ (1806)
 
4R2012 preTest5A
4R2012 preTest5A4R2012 preTest5A
4R2012 preTest5A
 
دليل مصور ومختصر لفهم الإسلام
دليل مصور ومختصر لفهم الإسلام دليل مصور ومختصر لفهم الإسلام
دليل مصور ومختصر لفهم الإسلام
 
Apre 2 t08
Apre 2 t08Apre 2 t08
Apre 2 t08
 
Summer notes by_kolay
Summer notes by_kolaySummer notes by_kolay
Summer notes by_kolay
 
Sap fico-configuration-guide
Sap fico-configuration-guideSap fico-configuration-guide
Sap fico-configuration-guide
 
Kza Presentatie (1)
Kza Presentatie (1)Kza Presentatie (1)
Kza Presentatie (1)
 
CoffeeScript Lightning Talk
CoffeeScript Lightning TalkCoffeeScript Lightning Talk
CoffeeScript Lightning Talk
 
Rmpiとsnowで 並列処理
Rmpiとsnowで 並列処理Rmpiとsnowで 並列処理
Rmpiとsnowで 並列処理
 
Ej res 1
Ej res 1Ej res 1
Ej res 1
 
Prism vol.103
Prism vol.103Prism vol.103
Prism vol.103
 
2011/1/24~1/28投資週報
2011/1/24~1/28投資週報2011/1/24~1/28投資週報
2011/1/24~1/28投資週報
 
ข้อมูลและสารสนเทศ
ข้อมูลและสารสนเทศข้อมูลและสารสนเทศ
ข้อมูลและสารสนเทศ
 
日本無人島開発
日本無人島開発日本無人島開発
日本無人島開発
 
日本無人島開発
  日本無人島開発  日本無人島開発
日本無人島開発
 

Document Analysis of Japanese Text Corpora Using TF-IDF and Co-Occurrence Methods

  • 1. 1 — — Hilofumi Yamamoto November 8, 2008
  • 2. 2 • ( , 2005, 2006, 2007) • • • • • ( , 1983; , 1989) •
  • 3. 3 ) ) ) ) 07 ) 86 4) 44) ) 205 05 51 0 (1 0 2 11 •11 18 (1 8 ( •9 ( •9 ( •1 (• ( (1 8 q= 8 8 d =8 =8 =8 :# = e@ 0d= =&0 MU l2V=8 =8 78E:# 8 E 8 =& 8e 6b ; @i: ? 46 56 79 38 20 44 17 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 900 950 1000 1050 1100 1150 1200 1250
  • 4. 4 • • (1976) • (1991) • (1998) • • • • →
  • 5. 5 • • 9484 ( ) • kh (β ) • ( ) t2c • • (48732) (1408) (49)
  • 6. 6 /$N / Fb /$K / =U /$O / Mh / $K / $1$j / 2) /$N / E`$l / $k / N^ / :# /$d / 2r$/ / $i$` • – – – ...
  • 7. 7 • • ( , 1983) • ( , 1996) • idf (Sp¨rck Jones, 1972) a N idf (t, N ) = log df (t)
  • 8. 8 idf : inverse document frequency N idf (ari, N ) = log (1) df (ari) 9484 = log (2) 1201 = log 7.89.. = 2.07.. (3) N idf (uguisu, N ) = log (4) df (uguisu) 9484 = log (5) 101 = log 93.90.. = 4.54.. (6)
  • 9. 9 3500 L-Shape Freq-Type 3000 2500 number of type 2000 1500 1000 500 0 0 200 400 600 800 100012001400160018002000 frequency
  • 10. 10 1200 idf J-Shape IDF-Type 1000 idf 800 number of type idf idf 600 400 200 0 1 2 3 4 5 6 7 8 9 inverse document frequency (idf)
  • 11. 11 • ( ) • • tfidf w(t, K, N ) = (1 + log tf (t, K)) idf (t, N )
  • 12. 12 (cw) w(t, K, N ) = (1 + log tf (t, K)) idf (t, N ) (7) √ cidf (t1 , t2 , N ) = idf (t1 , N ) idf (t2 , N ) (8) ctf (t1 , t2 , K) = 1 + log |{k : t1 , t2 ∈ k}| (9) • K • (8) • (9) K •
  • 13. 13 cidf ˙ 1000 frequency of patterns 800 600 400 200 0 0 1 2 3 4 5 6 7 8 9 cidf
  • 14. 14 (cw) |N | ictf (t1 , t2 , N ) = 1 + log (10) |{n : t1 , t2 ∈ n}| cw(t1 , t2 ) = ctf (t1 , t2 , K) ictf (t1 , t2 , N ) cidf (t1 , t2 , N ) (11) • K N • • K • N
  • 15. 15 cw 900 ¨ ‚¯”£ 1 cumulative frequency of patterns 8 2 800 3 4 700 1 5 6 7 600 8 3 500 400 7 2 300 200 5 cw z 6 100 4 0 0 10 20 30 40 50 60 70 80 90 100 co-occurrence weight (cw)
  • 16. 16 1σ 16 ( )
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21 (1) t1 –t2 cw z ctf idf (t1 ) idf (t2 ) (24) – 86.06 3.33 10 3.18 4.63 – 65.15 1.76 5 3.18 3.26 – 64.32 1.70 2 3.43 4.69 – 63.36 1.62 2 3.18 4.92 – 61.87 1.51 2 3.18 4.69 – 60.36 1.40 4 3.18 3.18 – 55.34 1.02 2 3.18 4.37 (11) – 54.69 1.33 3 3.18 4.63 – 52.40 1.12 3 3.18 3.26 – 51.40 1.03 1 3.18 8.06 – 51.28 1.02 2 3.43 4.63 (15) – 80.25 3.74 8 3.18 4.63 – 55.90 1.54 2 3.18 3.83 – 54.92 1.46 8 3.18 2.08 – 54.35 1.40 2 3.18 3.95 – 52.42 1.23 2 3.18 3.37 – 50.48 1.05 1 3.18 7.77 (3) N/A
  • 22. 22 (2) t1 –t2 cw z ctf idf (t1 ) idf (t2 ) (5) – 72.27 3.34 4 3.43 4.63 – 52.17 1.44 2 3.43 3.95 – 51.68 1.40 2 3.43 3.71 – 51.00 1.33 2 3.43 3.43 – 49.48 1.19 4 3.43 2.08 – 48.33 1.08 1 3.43 6.59 – 47.56 1.01 1 3.43 6.38 (6) N/A (9) N/A (24) – 63.56 1.64 3 3.43 4.63 – 62.38 1.55 3 3.43 3.14 – 62.18 1.53 4 3.18 4.63 – 56.96 1.14 1 3.43 9.16
  • 23. 23 • • (cw) z 1σ 1σ(16 ) • •
  • 24. 24 • • • • http://etymology.jp/waka/poem.cgi XML(SVG) •