SlideShare a Scribd company logo
1 of 30
Download to read offline
Asialex 2011 Kyoto, Japan                                          1



       Development of the Thesaurus of Classical
             Japanese Poetic Vocabulary




                                Hilofumi Yamamoto
                            Tokyo Institute of Technology
                                   Makiro Tanaka
         National Institute of Japanese Language and Linguistics

                                  22nd Aug. 2011
Asialex 2011 Kyoto, Japan                                        2




       Outline
         1. Purpose of Study
              • Connotation of classical poetic vocabulary
              • Longitudinal study of transition of vocabulary
         2. Development of Thesaurus
         3. Applications
Asialex 2011 Kyoto, Japan                                                  3




       Waka: Japanese Poetry




                            Tatsuta-Hime..
                            tamukuru KAMI no / arebakoso
                            aki no konoha no / nusa to chirurame

                            because Princess Tatsuta
                            has a god to whom she offers brocades,
                            the leaves of trees
                            in autumn will scatter
                            as an offering.

                                                 Prince Kanemi
                                                 No. 298 in the Kokinsh¯
                                                                       u
Asialex 2011 Kyoto, Japan                                    4




       Problem: Orthography
                                in Chinese characters

                  in hiragana




                                → All Tatsuta (place name)
Asialex 2011 Kyoto, Japan                                          5




       Problem: Unit size / attribution
       The unit size and meaning of a word depends on a context.
         • unit →           or          (Nakano, 1998)
         • orthography →
           (sad)
         • attributions →         ∈ plant or       ∈ food
            (unohana = a deutzia or bean curd refuse)
Asialex 2011 Kyoto, Japan                                                            6


       An Item of Thesaurus: God

                BG-01-2030-01-030-A-                                    -
                  ↑       ↑        ↑        ↑        ↑      ↑      ↑         ↑
                 (1)     (2)      (3)      (4)      (5)    (6)    (7)       (8)

          Figure 1: Structure of an item of BG database in the case of kami (god):
                    (1) database ID (BG = short-unit general vocabulary);
                    (2) part of speech ID (01 = noun);
                    (3) group ID (2030 = Shinto deities and Buddhas);
                    (4) field ID;
                    (5) exact ID (030 = god);
                    (6) era-flag (A = contemporary, C = classic);
                    (7) Chinese character reading;
                    (8) Chinese character
Asialex 2011 Kyoto, Japan                              7




       Development: Thesaurus, KH, and t2c
         • Thesaurus for classical poetic vocabulary
         • KH (tokenizer)
         • t2c (token to code converter)
Asialex 2011 Kyoto, Japan                                                                                                                   8



        Materials: the Hachidaish¯
                                 u
           • The Hachidaish¯ : eight anthologies compiled by
                             u
             imperial orders during ca. 905–2105.
           • The database: compiled by the National Institute of
             Japanese Literature, Japan.
           • Old texts taken based on Sh¯hobonban version of the
                                        o
             Hachidaish¯u                                                                                                               )
                                              )                                                 )              )    )             )  205
                      05
                        )
                                            51                          )                   0 86           1 24 44              88 (1
                                          •9                          07                                  1      1             1 ¯
                (   •9                (                              0                    (1           ( • ( •1              (1 shu
           u¯                    u¯                                •1                sh
                                                                                       u¯            ¯
                                                                                                     u                     ¯
                                                                                                                           u    n
         sh                   nsh                         u¯
                                                               (
                                                                                 u¯ i             sh shu
                                                                                                            ¯
                                                                                                                        ish oki
      ki
        n
                           se                           sh                     sh
                                                                                                 ¯
                                                                                               yo ika                 za ink
    K
     o
                      G
                          o
                                                  J   ui
                                                      ¯                     G
                                                                              o
                                                                                           K
                                                                                             in h
                                                                                                   S
                                                                                                                    n
                                                                                                                  Se Sh
          46                     56                                   79          38        20          44       17
    ⊲




                      ⊲




                                                  ⊲




                                                                            ⊲



                                                                                        ⊲

                                                                                                 ⊲




                                                                                                             ⊲

                                                                                                                      ⊲
  900                950                   1000                      1050       1100             1150            1200           1250
Asialex 2011 Kyoto, Japan                                                                               9




       Methods: Flowchart of data processing



                                                                                  ing           P
                              e nt                        er sion          o dell          −O
                          opm                           nv              lm              CT
                    sdevel       isat
                                     ion
                                               co d
                                                   e co         ma tica          ction:       isat
                                                                                                  ion
                  pu          en            a-               he              tra            al
             Co r          Tok           Met            Mat              Sub            Visu
         A             B             C              D              E                 F
Asialex 2011 Kyoto, Japan                              10




       Development: Thesaurus, KH, and t2c
         • Thesaurus for classical poetic vocabulary
         • KH (tokenizer)
         • t2c (token to code converter)
Asialex 2011 Kyoto, Japan                                                       11

                  Table 1: An example of input for KH / Gosensh¯ No. 664
                                                               u
         input: 000664
         output:000664
                           (       - :   :   :   :              )
                   (            - : : : : )
                   (        :    : )
                       (        -    :   :   :   :              )
                           (       - :   :   :   :              )
                   (        :    : )
                           (       -   :   :   :   :                )
                  (         :    : )
                  (         :    : )
                  ( : :           )
                  (   :          : )
                ---
                        (        -       :     :       )
                  (   : :            )
                ---
                        (                - :       :        :           :   )
                  (   : :            )
                ---
                    ( : :               )
                  (   -              : : )
                    (    -             :    :   :   :   )
                    ( -              :    :   :   :   )
Asialex 2011 Kyoto, Japan                                                                      12




       Development: Thesaurus

                                                     Thesaurus
                              Tokeniser              code tagger



         Poem Texts               kh                      t2c                    Hachidaishu
                                                                                  Thesaurus

                            add unknown entries             add new thesaurus codes

                            Dictionary            General, Place Name
                                                  Personal Name, etc
                                  (A)                     (B)
Asialex 2011 Kyoto, Japan                                                                    13




       (A) Corpus: Poems (OP)

             KW00029800|A|KANEMI NO ¯=kanemi no ¯
                                    O           o
             KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→
                        tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→
                        no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→
                        aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→
                        nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→
                        rame[CJR-REAL]/

          Figure 2: Format of the database of a poem: → indicates continuing to the
                    next line without breaks; the first line, which includes |A|, indicates
                    the name of the poet; the second line which includes |B|, indicates
                    the contents of the poem and added information.
Asialex 2011 Kyoto, Japan                                                   14




       (A) Corpus: Translations (CT)
           $A|000298
           $B|                                                         →

           $C|
           $D|                                                         →

           $I|                                                         →
                                                                        →


                            Figure 3: Format of the database of a CT
Asialex 2011 Kyoto, Japan                                                       15




       (B) Tokenisation:
            original text


               ↓
            tokenising
                   /        / / /[     ]/ /    / / /         / / / /   /[   ]
               ↓
            converting into predicative form
                   /        / / /[     ]/ /    / / /         / / / /   /[   ]

                             Figure 4: Tokenisation of poem texts
Asialex 2011 Kyoto, Japan                                                           16




       (C) meta-code conversion
          CH-29-2130-01-010-A                    Tatsutahime   Princess-Tatsuta
          CH-29-0000-14-010-A   --               -- Tatsuta    Tatsuta
          BG-01-2030-01-101-A   --               -- hime       princess
          BG-02-3770-04-080-C                    tamukuru      present(verb)
          BG-01-5730-02-010-A   --               -- te         hand
          BG-02-1700-01-040-A   --               -- mukeru     for
          BG-01-2030-01-030-A                    kami          god
          BG-08-0061-07-010-A                    no            SUB (particle)
          BG-02-1200-01-010-C                    are           be
          BG-08-0064-26-010-A                    ba            because (particle)
          BG-04-1120-05-150-A   --               -- ba         because (reason)
          BG-08-0065-01-010-A                    koso          KP (emphasis)

                        Figure 5: Meta-code conversion in case of OP
Asialex 2011 Kyoto, Japan                                                            17



       (C) Structure of meta-code-1
                BG-01-2030-01-030-A-                                    -
                  ↑       ↑        ↑        ↑        ↑      ↑      ↑         ↑
                 (1)     (2)      (3)      (4)      (5)    (6)    (7)       (8)

          Figure 6: Structure of an item of BG database in the case of kami (god):
                    (1) database ID (BG = short-unit general vocabulary);
                    (2) part of speech ID (01 = noun);
                    (3) group ID (2030 = Shinto deities and Buddhas);
                    (4) field ID;
                    (5) exact ID (030 = god);
                    (6) era-flag (A = contemporary, C = classic);
                    (7) Chinese character reading;
                    (8) Chinese character
Asialex 2011 Kyoto, Japan                                                    18




       (C) Structure of the meta-code-2
             BG-01-2600-01-020-A (1)     =   BG-01-2610-01-040-A (2)
             yononaka (world)                yo (world)


                                         +   BG-08-0010-01-021-A (3)
                                             no (of)


                                         +   BG-01-1770-01-080-A (4)
                                             naka (inside)



          Figure 7: Structure of an item of the semantic table in the case
                    of a compound word, yononaka (world)
Asialex 2011 Kyoto, Japan                                                           19




       (C) meta-code conversion-3
          CH-29-2130-01-010-A                    Tatsutahime   Princess-Tatsuta
          CH-29-0000-14-010-A   --               -- Tatsuta    Tatsuta
          BG-01-2030-01-101-A   --               -- hime       princess
          BG-02-3770-04-080-C                    tamukuru      present(verb)
          BG-01-5730-02-010-A   --               -- te         hand
          BG-02-1700-01-040-A   --               -- mukeru     for
          BG-01-2030-01-030-A                    kami          god
          BG-08-0061-07-010-A                    no            SUB (particle)
          BG-02-1200-01-010-C                    are           be
          BG-08-0064-26-010-A                    ba            because (particle)
          BG-04-1120-05-150-A   --               -- ba         because (reason)
          BG-08-0065-01-010-A                    koso          KP (emphasis)

                        Figure 8: Meta-code conversion in case of OP
Asialex 2011 Kyoto, Japan                                                                 20




                             10th century                    20th century
                         Field of experience        Field of experience (expert)


                  poet         write           OP           read       expert reader

                                                         com
                                                             par           write
                                                                e


                                                                           CT


                                                                           read

                                                                       novice reader

                                                                        20th century
                                                                    Field of experience
                                                                          (novice)




                    Figure 9: Schema of relationship between OP and CT
Asialex 2011 Kyoto, Japan                                                   21

           +-------- # of pair
           | +----- value of matching level, exact=17, field=13, group=10
           | | +-- # of POS
           | | |
           | | | # of element of OP ----+        +- # of element of CT
           | | |         element of OP -+ |      | +--- element of CT
           | | |                        | |      | |
           1 17 11                       00 <-> 12        (Tatsutahime)
           2 17 47                       04 <-> 25         (hand)
           3 17 47                       05 <-> 26        (toward)
           4 17 2                        06 <-> 32         (god)
           5 10 61                       07 <-> 33         (SUB)
           6 17 47                       08 <-> 34        (be)
           7 10 64                       09 <-> 35         (because)
           8 17 65                       11 <-> 36        (EM)
           9 17 2                        12 <-> 38         (autumn)
          10 17 71                       13 <-> 39         (CON)
          11 17 2                        14 <-> 40        (leaf of tree)
          12 17 2                        19 <-> 45         (present)
          13 17 61                       20 <-> 46         (CRD)
          14 17 47                       21 <-> 49        (fall)
          15 13 74                       22 <-> 54         (CJR)

                            Figure 10: Example of the matching process
Asialex 2011 Kyoto, Japan                                                              22




        Residual

   CT   (                                )         (                )
   OP   — —— — — — — — — —                         — — — — —— —


   CT   (        )                           ( ) (       )    (           )
   OP   — —                                  [ ]       — —    — — — —



            Figure 11: Example of the matching process in the case of kks 298 in Ko-
                       machiya (1982)
Asialex 2011 Kyoto, Japan                                                        23




       Components of OP
          Table 2: Result of subtracting the elements of OP(298) from those
                   of CT(298, koma): it indicates the ratio of the ingredients
                   of OP(298).
          OP    (valid      number of element)                     =   16
          E     (ratio      of exact match)              12/16     =   0.750
          F     (ratio      of field match)               1/16     =   0.062
          G     (ratio      of group match)               2/16     =   0.125
          T     (ratio      of total match)              15/16     =   0.938
          U     (ratio      of unmatched OP)             1 - T     =   0.062
Asialex 2011 Kyoto, Japan                       24




       Calculation of Residual Rate



                                     P
                            D = 1−        (1)
                                     T
                                     16
                              = 1−        (2)
                                     41
                              = 0.61      (3)
Asialex 2011 Kyoto, Japan                                                                 25




       Components of CT
          Table 3: Component of CT in case of kks 298 by Komachiya (1982):
                   fabs(D-H) stands for the function of the absolute value of the prac-
                   tical value, D, minus the theoretical value, H.

           CT (valid number of element)                       =41
           W (ratio of original word use)                12/41=0.293(E/CT)
           A (ratio of annotation)                     1-0.293=0.707(1-W)
               ---breakdown of the annotation---
               P1(ratio of FG paraphrased)   (0.62+0.12)/0.707=0.073(F+G)/A
               P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
               D (ratio of purely added)   0.707-(0.073+0.040)=0.595A-(P1+P2)
           H (theoretical value of D)                  1-16/41=0.6101-OP/CT
           Gap                               fabs(0.595-0.610)=0.015fabs(D-H)
Asialex 2011 Kyoto, Japan                                                                              26



       Subtraction: CT - OP


                                                                        P1 3 (7.3%)


                                                                  P2 1 (4.0%)           W 12 (29.3%)
                        Exact 12 (75.0%)




                                             Unmatched 1 (6.2%)


                                                                                D 25 (59.5%)
                                           Group 2 (12.5%)


                                     Field 1 (6.2%)



                        OP(298) : 16 elements                            CT(298,koma) : 41 elements



          Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,
                     koma)
Asialex 2011 Kyoto, Japan                                                         27




       (E) Mathematical modelling
                                                     √
                cw(t1 , t2 ) = (1+log ctf (t1 , t2 )) idf (t1 ) idf (t2 )   (4)


                                                N
                                 idf (t) = log                              (5)
                                               df (t)
Asialex 2011 Kyoto, Japan                                                                                                                                                                                                                                                               28
                                                                                                          far treetop high.1
                                                                                                                                               7regret

                         force                                                                                                                          separation


                                                                                                                                 7                       treetop high.3
                                                                                                           go over
                                                                                                                5
                                                                                                                                               10
                                 6                           be heard.1                                                                             7
                                                                                                                                                    4

                                                                                      this morning                     10                                                                                                                    near
                                                                                                                  9
                                                                                                           10

                                  summer mountains
                                                  hear            borrow                                                    Otowa.PN
                                                                            37
                                                                                                                                                                                                                                6
                                                                                           29
                                                                    69           19                               11                                                                                                                                       old age
                                                             11
                                                                                                treetop           20
                                                                                                                            20
             a cry
                                                                                                                                     19
                                          singing voice                                         20
                                                                                                                                                                                                      every morning
                                                                    cuckoo mountain
                         10                              21
                                                                                                                                                                                                                                                                   wear in (my) hair
               8                                                                                                                                                                        stop.vi.1     8                                                6
                                                   39                                                                 110

                                                   14                                             9                   261                                                                                                                                  4
                                 summer midsummer rain                                                                           sing.vi                                                      field
            side     8                              20                                                                                                                                                                                                                   green willow
                                                                                                                                                                                                                                                                             4
                                             12                                                                                                                                                       10
                                                                                          42
                                                                                                             174                                                                           15                          plum
                                                                                                              44                                                145                                                                                                4
                                                                                                                                                                                         17                                         10
                         9                                                                                                                                                                                                               woven hat
                                                  last year                                                                               10
                                                                                                            26               voice                         62
                                                                                                                                                                                                           56
                                                                                                                                                                                                          break off23
                                                                                                                                                                                                                       10
                                                                                                                                                                                                                                                                   6
                                                                                                                                                                                                                                                                            sew.2
                                                                        10
                                                                                          May                                                                                                              22

          mountain cuckoo                                6                                                                                                      10
                                                                                                                                                                         warbler                                                                               7
                                                                    6                                                                                                                                                                                                         6
                                                              9
                                                                                                                                                                                                            35         branch
                                                                                                                                                                                                           88
                                                                                                                                           Tatsuta.PN                         29
                                                                                                                                                                      cry.vi
                                                                                                                                                                       52                  138
                                                                        7                                                                                                                                                                                               hide.vi.2
                                                flutter.2                             8                                                                    10                       30
                                     imperceptibly                                                                                                                                                                spring
                                                                                                                                                           scatter.1
                                                                                                                                                                                   10
                                                                                                                                                                                                flower
                                                                                                                                                                                                 9

                                                                                                                                      10
                                                                                                                                           9
                                                                                                                                                                                   yet.1
                                                        iris.1              reason.1
                                                                   6


                                                                                                                                                                       touch                                    lure
                                                                                                                 stand.vi
                                                                                                                                                                                                                                         4
                                                                                                                                                                                                                                                       send
                                                                                                                             spring haze                                                                                    7

                                                                                                                                                                                                                        5
                                                                                                                                                                                                           4
                                                                                                                                                                         10
                                                                                                                                                                                                                                         fragrance.1


                                                                                                                                                                                                                       attach
                                                                                                                                                                  hand                    guidance.1

                                                                                                      warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
Asialex 2011 Kyoto, Japan                                                       29



       Conclusion
       The thesaurus annotated with meta-codes allows researchers

         1. to identify different orthographies as the same word;

         2. to attach an alternative semantic ID to a word which has the
            same form but has more than one meaning (polysemic word);

         3. to attach meta-codes not only to tokens recognised as a
            single/simple word but also to attach it to a longer size token

         4. to indicate a similarity between tokens.

         5. to detect common or different tokens among more than one text,
            which will tell us the similarities or differences between texts.

         6. to indicate the relative differences between two words in literary
            works.
Asialex 2011 Kyoto, Japan                                    30




       Questions
         • Computer Modelling of Classical Japanese Poetic
           Vocabulary
            http://etymology.jp/waka/poem.cgi
         • Inquiry:
            Hilofumi Yamamoto
            yamagen@ryu.titech.ac.jp
         • Thank you.

More Related Content

Viewers also liked (11)

Ch2006slide
Ch2006slideCh2006slide
Ch2006slide
 
Database2010 01slide
Database2010 01slideDatabase2010 01slide
Database2010 01slide
 
Kokken20100303
Kokken20100303Kokken20100303
Kokken20100303
 
Keio slide
Keio slideKeio slide
Keio slide
 
Jinmon2007slide02
Jinmon2007slide02Jinmon2007slide02
Jinmon2007slide02
 
Ch2011slide01
Ch2011slide01Ch2011slide01
Ch2011slide01
 
Incremental load
Incremental loadIncremental load
Incremental load
 
Ch2010slide01
Ch2010slide01Ch2010slide01
Ch2010slide01
 
Ch2008slide01
Ch2008slide01Ch2008slide01
Ch2008slide01
 
Goiken2007slide
Goiken2007slideGoiken2007slide
Goiken2007slide
 
AyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpegAyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpeg
 

Recently uploaded

Lucknow to Sitapur Cab | Lucknow to Sitapur Taxi
Lucknow to Sitapur Cab | Lucknow to Sitapur TaxiLucknow to Sitapur Cab | Lucknow to Sitapur Taxi
Lucknow to Sitapur Cab | Lucknow to Sitapur TaxiCab Bazar
 
The Roles of Aviation Auditors - Presentation
The Roles of Aviation Auditors - PresentationThe Roles of Aviation Auditors - Presentation
The Roles of Aviation Auditors - PresentationTilak Ramaprakash
 
Top Five Best Places to Visit in India.pdf
Top Five Best Places to Visit in India.pdfTop Five Best Places to Visit in India.pdf
Top Five Best Places to Visit in India.pdfonlinevisaindia
 
Paragliding Billing Bir at Himachal Pardesh
Paragliding Billing Bir at Himachal PardeshParagliding Billing Bir at Himachal Pardesh
Paragliding Billing Bir at Himachal PardeshParagliding Billing Bir
 
Busy Season Mastery Simple Strategies to Optimize Your Lodging Business!.pptx
Busy Season Mastery Simple Strategies to Optimize Your Lodging Business!.pptxBusy Season Mastery Simple Strategies to Optimize Your Lodging Business!.pptx
Busy Season Mastery Simple Strategies to Optimize Your Lodging Business!.pptxRezStream
 
Canada PR - Eligibility, Steps to apply and Visa processing fees
Canada PR - Eligibility, Steps to apply and Visa processing feesCanada PR - Eligibility, Steps to apply and Visa processing fees
Canada PR - Eligibility, Steps to apply and Visa processing feesY-Axis Overseas Careers
 
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's JewelSicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's JewelTime for Sicily
 
What Are Some Tips For A Safe White River Rafting Experience
What Are Some Tips For A Safe White River Rafting ExperienceWhat Are Some Tips For A Safe White River Rafting Experience
What Are Some Tips For A Safe White River Rafting ExperienceTahoe Whitewater Tours
 
What Unwritten Rules Of Surfing Etiquette Are Crucial For Beginners To Grasp
What Unwritten Rules Of Surfing Etiquette Are Crucial For Beginners To GraspWhat Unwritten Rules Of Surfing Etiquette Are Crucial For Beginners To Grasp
What Unwritten Rules Of Surfing Etiquette Are Crucial For Beginners To GraspHanalei Surf School
 
Melanie Smith Tourism, Wellbeing and Happiness
Melanie Smith Tourism, Wellbeing and HappinessMelanie Smith Tourism, Wellbeing and Happiness
Melanie Smith Tourism, Wellbeing and HappinessEDGAR TARRÉS FALCÓ
 
5 beautyfull places visiting in uttrakhand
5 beautyfull places visiting in uttrakhand5 beautyfull places visiting in uttrakhand
5 beautyfull places visiting in uttrakhandaradhya3287
 
What Are The Must-Know Tips For First-Time Jet Skiers In Aruba
What Are The Must-Know Tips For First-Time Jet Skiers In ArubaWhat Are The Must-Know Tips For First-Time Jet Skiers In Aruba
What Are The Must-Know Tips For First-Time Jet Skiers In ArubaDelphi Watersports
 
Transportation Options_ Getting to Keukenhof Gardens from Amsterdam.pdf
Transportation Options_ Getting to Keukenhof Gardens from Amsterdam.pdfTransportation Options_ Getting to Keukenhof Gardens from Amsterdam.pdf
Transportation Options_ Getting to Keukenhof Gardens from Amsterdam.pdfGlobalbustours
 
a presentation for foreigners about how to travel in Germany.
a presentation for foreigners about how to travel in Germany.a presentation for foreigners about how to travel in Germany.
a presentation for foreigners about how to travel in Germany.moritzmieg
 
Sizzling Summer Adventures Unforgettable Tours Under the Sun
Sizzling Summer Adventures Unforgettable Tours Under the SunSizzling Summer Adventures Unforgettable Tours Under the Sun
Sizzling Summer Adventures Unforgettable Tours Under the SunSnowshoe Tahoe
 
László Puczkó Wellbeing Tourism and Economy
László Puczkó Wellbeing Tourism and EconomyLászló Puczkó Wellbeing Tourism and Economy
László Puczkó Wellbeing Tourism and EconomyEDGAR TARRÉS FALCÓ
 
What Safety Precautions Are Recommended For Na Pali Snorkeling Adventure
What Safety Precautions Are Recommended For Na Pali Snorkeling AdventureWhat Safety Precautions Are Recommended For Na Pali Snorkeling Adventure
What Safety Precautions Are Recommended For Na Pali Snorkeling AdventureHanalei Charters
 
Authentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptxAuthentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptxGregory DeShields
 
It’s Time Get Refresh Travel Around The World
It’s Time Get Refresh Travel Around The WorldIt’s Time Get Refresh Travel Around The World
It’s Time Get Refresh Travel Around The WorldParagliding Billing Bir
 
Discover the Magic of Sicily: Your Travel Guide
Discover the Magic of Sicily: Your Travel GuideDiscover the Magic of Sicily: Your Travel Guide
Discover the Magic of Sicily: Your Travel GuideTime for Sicily
 

Recently uploaded (20)

Lucknow to Sitapur Cab | Lucknow to Sitapur Taxi
Lucknow to Sitapur Cab | Lucknow to Sitapur TaxiLucknow to Sitapur Cab | Lucknow to Sitapur Taxi
Lucknow to Sitapur Cab | Lucknow to Sitapur Taxi
 
The Roles of Aviation Auditors - Presentation
The Roles of Aviation Auditors - PresentationThe Roles of Aviation Auditors - Presentation
The Roles of Aviation Auditors - Presentation
 
Top Five Best Places to Visit in India.pdf
Top Five Best Places to Visit in India.pdfTop Five Best Places to Visit in India.pdf
Top Five Best Places to Visit in India.pdf
 
Paragliding Billing Bir at Himachal Pardesh
Paragliding Billing Bir at Himachal PardeshParagliding Billing Bir at Himachal Pardesh
Paragliding Billing Bir at Himachal Pardesh
 
Busy Season Mastery Simple Strategies to Optimize Your Lodging Business!.pptx
Busy Season Mastery Simple Strategies to Optimize Your Lodging Business!.pptxBusy Season Mastery Simple Strategies to Optimize Your Lodging Business!.pptx
Busy Season Mastery Simple Strategies to Optimize Your Lodging Business!.pptx
 
Canada PR - Eligibility, Steps to apply and Visa processing fees
Canada PR - Eligibility, Steps to apply and Visa processing feesCanada PR - Eligibility, Steps to apply and Visa processing fees
Canada PR - Eligibility, Steps to apply and Visa processing fees
 
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's JewelSicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
Sicily Holidays Guide Book: Unveiling the Treasures of Italy's Jewel
 
What Are Some Tips For A Safe White River Rafting Experience
What Are Some Tips For A Safe White River Rafting ExperienceWhat Are Some Tips For A Safe White River Rafting Experience
What Are Some Tips For A Safe White River Rafting Experience
 
What Unwritten Rules Of Surfing Etiquette Are Crucial For Beginners To Grasp
What Unwritten Rules Of Surfing Etiquette Are Crucial For Beginners To GraspWhat Unwritten Rules Of Surfing Etiquette Are Crucial For Beginners To Grasp
What Unwritten Rules Of Surfing Etiquette Are Crucial For Beginners To Grasp
 
Melanie Smith Tourism, Wellbeing and Happiness
Melanie Smith Tourism, Wellbeing and HappinessMelanie Smith Tourism, Wellbeing and Happiness
Melanie Smith Tourism, Wellbeing and Happiness
 
5 beautyfull places visiting in uttrakhand
5 beautyfull places visiting in uttrakhand5 beautyfull places visiting in uttrakhand
5 beautyfull places visiting in uttrakhand
 
What Are The Must-Know Tips For First-Time Jet Skiers In Aruba
What Are The Must-Know Tips For First-Time Jet Skiers In ArubaWhat Are The Must-Know Tips For First-Time Jet Skiers In Aruba
What Are The Must-Know Tips For First-Time Jet Skiers In Aruba
 
Transportation Options_ Getting to Keukenhof Gardens from Amsterdam.pdf
Transportation Options_ Getting to Keukenhof Gardens from Amsterdam.pdfTransportation Options_ Getting to Keukenhof Gardens from Amsterdam.pdf
Transportation Options_ Getting to Keukenhof Gardens from Amsterdam.pdf
 
a presentation for foreigners about how to travel in Germany.
a presentation for foreigners about how to travel in Germany.a presentation for foreigners about how to travel in Germany.
a presentation for foreigners about how to travel in Germany.
 
Sizzling Summer Adventures Unforgettable Tours Under the Sun
Sizzling Summer Adventures Unforgettable Tours Under the SunSizzling Summer Adventures Unforgettable Tours Under the Sun
Sizzling Summer Adventures Unforgettable Tours Under the Sun
 
László Puczkó Wellbeing Tourism and Economy
László Puczkó Wellbeing Tourism and EconomyLászló Puczkó Wellbeing Tourism and Economy
László Puczkó Wellbeing Tourism and Economy
 
What Safety Precautions Are Recommended For Na Pali Snorkeling Adventure
What Safety Precautions Are Recommended For Na Pali Snorkeling AdventureWhat Safety Precautions Are Recommended For Na Pali Snorkeling Adventure
What Safety Precautions Are Recommended For Na Pali Snorkeling Adventure
 
Authentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptxAuthentic Travel Experience 2024 Greg DeShields.pptx
Authentic Travel Experience 2024 Greg DeShields.pptx
 
It’s Time Get Refresh Travel Around The World
It’s Time Get Refresh Travel Around The WorldIt’s Time Get Refresh Travel Around The World
It’s Time Get Refresh Travel Around The World
 
Discover the Magic of Sicily: Your Travel Guide
Discover the Magic of Sicily: Your Travel GuideDiscover the Magic of Sicily: Your Travel Guide
Discover the Magic of Sicily: Your Travel Guide
 

Asialex201103slide02

  • 1. Asialex 2011 Kyoto, Japan 1 Development of the Thesaurus of Classical Japanese Poetic Vocabulary Hilofumi Yamamoto Tokyo Institute of Technology Makiro Tanaka National Institute of Japanese Language and Linguistics 22nd Aug. 2011
  • 2. Asialex 2011 Kyoto, Japan 2 Outline 1. Purpose of Study • Connotation of classical poetic vocabulary • Longitudinal study of transition of vocabulary 2. Development of Thesaurus 3. Applications
  • 3. Asialex 2011 Kyoto, Japan 3 Waka: Japanese Poetry Tatsuta-Hime.. tamukuru KAMI no / arebakoso aki no konoha no / nusa to chirurame because Princess Tatsuta has a god to whom she offers brocades, the leaves of trees in autumn will scatter as an offering. Prince Kanemi No. 298 in the Kokinsh¯ u
  • 4. Asialex 2011 Kyoto, Japan 4 Problem: Orthography in Chinese characters in hiragana → All Tatsuta (place name)
  • 5. Asialex 2011 Kyoto, Japan 5 Problem: Unit size / attribution The unit size and meaning of a word depends on a context. • unit → or (Nakano, 1998) • orthography → (sad) • attributions → ∈ plant or ∈ food (unohana = a deutzia or bean curd refuse)
  • 6. Asialex 2011 Kyoto, Japan 6 An Item of Thesaurus: God BG-01-2030-01-030-A- - ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 1: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 7. Asialex 2011 Kyoto, Japan 7 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 8. Asialex 2011 Kyoto, Japan 8 Materials: the Hachidaish¯ u • The Hachidaish¯ : eight anthologies compiled by u imperial orders during ca. 905–2105. • The database: compiled by the National Institute of Japanese Literature, Japan. • Old texts taken based on Sh¯hobonban version of the o Hachidaish¯u ) ) ) ) ) ) 205 05 ) 51 ) 0 86 1 24 44 88 (1 •9 07 1 1 1 ¯ ( •9 ( 0 (1 ( • ( •1 (1 shu u¯ u¯ •1 sh u¯ ¯ u ¯ u n sh nsh u¯ ( u¯ i sh shu ¯ ish oki ki n se sh sh ¯ yo ika za ink K o G o J ui ¯ G o K in h S n Se Sh 46 56 79 38 20 44 17 ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ 900 950 1000 1050 1100 1150 1200 1250
  • 9. Asialex 2011 Kyoto, Japan 9 Methods: Flowchart of data processing ing P e nt er sion o dell −O opm nv lm CT sdevel isat ion co d e co ma tica ction: isat ion pu en a- he tra al Co r Tok Met Mat Sub Visu A B C D E F
  • 10. Asialex 2011 Kyoto, Japan 10 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 11. Asialex 2011 Kyoto, Japan 11 Table 1: An example of input for KH / Gosensh¯ No. 664 u input: 000664 output:000664 ( - : : : : ) ( - : : : : ) ( : : ) ( - : : : : ) ( - : : : : ) ( : : ) ( - : : : : ) ( : : ) ( : : ) ( : : ) ( : : ) --- ( - : : ) ( : : ) --- ( - : : : : ) ( : : ) --- ( : : ) ( - : : ) ( - : : : : ) ( - : : : : )
  • 12. Asialex 2011 Kyoto, Japan 12 Development: Thesaurus Thesaurus Tokeniser code tagger Poem Texts kh t2c Hachidaishu Thesaurus add unknown entries add new thesaurus codes Dictionary General, Place Name Personal Name, etc (A) (B)
  • 13. Asialex 2011 Kyoto, Japan 13 (A) Corpus: Poems (OP) KW00029800|A|KANEMI NO ¯=kanemi no ¯ O o KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→ tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→ no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→ aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→ nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→ rame[CJR-REAL]/ Figure 2: Format of the database of a poem: → indicates continuing to the next line without breaks; the first line, which includes |A|, indicates the name of the poet; the second line which includes |B|, indicates the contents of the poem and added information.
  • 14. Asialex 2011 Kyoto, Japan 14 (A) Corpus: Translations (CT) $A|000298 $B| → $C| $D| → $I| → → Figure 3: Format of the database of a CT
  • 15. Asialex 2011 Kyoto, Japan 15 (B) Tokenisation: original text ↓ tokenising / / / /[ ]/ / / / / / / / / /[ ] ↓ converting into predicative form / / / /[ ]/ / / / / / / / / /[ ] Figure 4: Tokenisation of poem texts
  • 16. Asialex 2011 Kyoto, Japan 16 (C) meta-code conversion CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- -- Tatsuta Tatsuta BG-01-2030-01-101-A -- -- hime princess BG-02-3770-04-080-C tamukuru present(verb) BG-01-5730-02-010-A -- -- te hand BG-02-1700-01-040-A -- -- mukeru for BG-01-2030-01-030-A kami god BG-08-0061-07-010-A no SUB (particle) BG-02-1200-01-010-C are be BG-08-0064-26-010-A ba because (particle) BG-04-1120-05-150-A -- -- ba because (reason) BG-08-0065-01-010-A koso KP (emphasis) Figure 5: Meta-code conversion in case of OP
  • 17. Asialex 2011 Kyoto, Japan 17 (C) Structure of meta-code-1 BG-01-2030-01-030-A- - ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 6: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 18. Asialex 2011 Kyoto, Japan 18 (C) Structure of the meta-code-2 BG-01-2600-01-020-A (1) = BG-01-2610-01-040-A (2) yononaka (world) yo (world) + BG-08-0010-01-021-A (3) no (of) + BG-01-1770-01-080-A (4) naka (inside) Figure 7: Structure of an item of the semantic table in the case of a compound word, yononaka (world)
  • 19. Asialex 2011 Kyoto, Japan 19 (C) meta-code conversion-3 CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- -- Tatsuta Tatsuta BG-01-2030-01-101-A -- -- hime princess BG-02-3770-04-080-C tamukuru present(verb) BG-01-5730-02-010-A -- -- te hand BG-02-1700-01-040-A -- -- mukeru for BG-01-2030-01-030-A kami god BG-08-0061-07-010-A no SUB (particle) BG-02-1200-01-010-C are be BG-08-0064-26-010-A ba because (particle) BG-04-1120-05-150-A -- -- ba because (reason) BG-08-0065-01-010-A koso KP (emphasis) Figure 8: Meta-code conversion in case of OP
  • 20. Asialex 2011 Kyoto, Japan 20 10th century 20th century Field of experience Field of experience (expert) poet write OP read expert reader com par write e CT read novice reader 20th century Field of experience (novice) Figure 9: Schema of relationship between OP and CT
  • 21. Asialex 2011 Kyoto, Japan 21 +-------- # of pair | +----- value of matching level, exact=17, field=13, group=10 | | +-- # of POS | | | | | | # of element of OP ----+ +- # of element of CT | | | element of OP -+ | | +--- element of CT | | | | | | | 1 17 11 00 <-> 12 (Tatsutahime) 2 17 47 04 <-> 25 (hand) 3 17 47 05 <-> 26 (toward) 4 17 2 06 <-> 32 (god) 5 10 61 07 <-> 33 (SUB) 6 17 47 08 <-> 34 (be) 7 10 64 09 <-> 35 (because) 8 17 65 11 <-> 36 (EM) 9 17 2 12 <-> 38 (autumn) 10 17 71 13 <-> 39 (CON) 11 17 2 14 <-> 40 (leaf of tree) 12 17 2 19 <-> 45 (present) 13 17 61 20 <-> 46 (CRD) 14 17 47 21 <-> 49 (fall) 15 13 74 22 <-> 54 (CJR) Figure 10: Example of the matching process
  • 22. Asialex 2011 Kyoto, Japan 22 Residual CT ( ) ( ) OP — —— — — — — — — — — — — — —— — CT ( ) ( ) ( ) ( ) OP — — [ ] — — — — — — Figure 11: Example of the matching process in the case of kks 298 in Ko- machiya (1982)
  • 23. Asialex 2011 Kyoto, Japan 23 Components of OP Table 2: Result of subtracting the elements of OP(298) from those of CT(298, koma): it indicates the ratio of the ingredients of OP(298). OP (valid number of element) = 16 E (ratio of exact match) 12/16 = 0.750 F (ratio of field match) 1/16 = 0.062 G (ratio of group match) 2/16 = 0.125 T (ratio of total match) 15/16 = 0.938 U (ratio of unmatched OP) 1 - T = 0.062
  • 24. Asialex 2011 Kyoto, Japan 24 Calculation of Residual Rate P D = 1− (1) T 16 = 1− (2) 41 = 0.61 (3)
  • 25. Asialex 2011 Kyoto, Japan 25 Components of CT Table 3: Component of CT in case of kks 298 by Komachiya (1982): fabs(D-H) stands for the function of the absolute value of the prac- tical value, D, minus the theoretical value, H. CT (valid number of element) =41 W (ratio of original word use) 12/41=0.293(E/CT) A (ratio of annotation) 1-0.293=0.707(1-W) ---breakdown of the annotation--- P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2) H (theoretical value of D) 1-16/41=0.6101-OP/CT Gap fabs(0.595-0.610)=0.015fabs(D-H)
  • 26. Asialex 2011 Kyoto, Japan 26 Subtraction: CT - OP P1 3 (7.3%) P2 1 (4.0%) W 12 (29.3%) Exact 12 (75.0%) Unmatched 1 (6.2%) D 25 (59.5%) Group 2 (12.5%) Field 1 (6.2%) OP(298) : 16 elements CT(298,koma) : 41 elements Figure 12: Pie-charts illustrating the components of OP(298) and CT(298, koma)
  • 27. Asialex 2011 Kyoto, Japan 27 (E) Mathematical modelling √ cw(t1 , t2 ) = (1+log ctf (t1 , t2 )) idf (t1 ) idf (t2 ) (4) N idf (t) = log (5) df (t)
  • 28. Asialex 2011 Kyoto, Japan 28 far treetop high.1 7regret force separation 7 treetop high.3 go over 5 10 6 be heard.1 7 4 this morning 10 near 9 10 summer mountains hear borrow Otowa.PN 37 6 29 69 19 11 old age 11 treetop 20 20 a cry 19 singing voice 20 every morning cuckoo mountain 10 21 wear in (my) hair 8 stop.vi.1 8 6 39 110 14 9 261 4 summer midsummer rain sing.vi field side 8 20 green willow 4 12 10 42 174 15 plum 44 145 4 17 10 9 woven hat last year 10 26 voice 62 56 break off23 10 6 sew.2 10 May 22 mountain cuckoo 6 10 warbler 7 6 6 9 35 branch 88 Tatsuta.PN 29 cry.vi 52 138 7 hide.vi.2 flutter.2 8 10 30 imperceptibly spring scatter.1 10 flower 9 10 9 yet.1 iris.1 reason.1 6 touch lure stand.vi 4 send spring haze 7 5 4 10 fragrance.1 attach hand guidance.1 warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
  • 29. Asialex 2011 Kyoto, Japan 29 Conclusion The thesaurus annotated with meta-codes allows researchers 1. to identify different orthographies as the same word; 2. to attach an alternative semantic ID to a word which has the same form but has more than one meaning (polysemic word); 3. to attach meta-codes not only to tokens recognised as a single/simple word but also to attach it to a longer size token 4. to indicate a similarity between tokens. 5. to detect common or different tokens among more than one text, which will tell us the similarities or differences between texts. 6. to indicate the relative differences between two words in literary works.
  • 30. Asialex 2011 Kyoto, Japan 30 Questions • Computer Modelling of Classical Japanese Poetic Vocabulary http://etymology.jp/waka/poem.cgi • Inquiry: Hilofumi Yamamoto yamagen@ryu.titech.ac.jp • Thank you.