Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Asialex201103slide02

1,990 views

Published on

Published in: Travel, Education
  • These are one of the best companies for review articles. High quality with cheap rates. ⇒⇒⇒WRITE-MY-PAPER.net ⇐⇐⇐ I highly recommend it :)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • There is a useful site for you that will help you to write a perfect and valuable essay and so on. Check out, please ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Asialex201103slide02

  1. 1. Asialex 2011 Kyoto, Japan 1 Development of the Thesaurus of Classical Japanese Poetic Vocabulary Hilofumi Yamamoto Tokyo Institute of Technology Makiro Tanaka National Institute of Japanese Language and Linguistics 22nd Aug. 2011
  2. 2. Asialex 2011 Kyoto, Japan 2 Outline 1. Purpose of Study • Connotation of classical poetic vocabulary • Longitudinal study of transition of vocabulary 2. Development of Thesaurus 3. Applications
  3. 3. Asialex 2011 Kyoto, Japan 3 Waka: Japanese Poetry Tatsuta-Hime.. tamukuru KAMI no / arebakoso aki no konoha no / nusa to chirurame because Princess Tatsuta has a god to whom she offers brocades, the leaves of trees in autumn will scatter as an offering. Prince Kanemi No. 298 in the Kokinsh¯ u
  4. 4. Asialex 2011 Kyoto, Japan 4 Problem: Orthography in Chinese characters in hiragana → All Tatsuta (place name)
  5. 5. Asialex 2011 Kyoto, Japan 5 Problem: Unit size / attribution The unit size and meaning of a word depends on a context. • unit → or (Nakano, 1998) • orthography → (sad) • attributions → ∈ plant or ∈ food (unohana = a deutzia or bean curd refuse)
  6. 6. Asialex 2011 Kyoto, Japan 6 An Item of Thesaurus: God BG-01-2030-01-030-A- - ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 1: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  7. 7. Asialex 2011 Kyoto, Japan 7 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  8. 8. Asialex 2011 Kyoto, Japan 8 Materials: the Hachidaish¯ u • The Hachidaish¯ : eight anthologies compiled by u imperial orders during ca. 905–2105. • The database: compiled by the National Institute of Japanese Literature, Japan. • Old texts taken based on Sh¯hobonban version of the o Hachidaish¯u ) ) ) ) ) ) 205 05 ) 51 ) 0 86 1 24 44 88 (1 •9 07 1 1 1 ¯ ( •9 ( 0 (1 ( • ( •1 (1 shu u¯ u¯ •1 sh u¯ ¯ u ¯ u n sh nsh u¯ ( u¯ i sh shu ¯ ish oki ki n se sh sh ¯ yo ika za ink K o G o J ui ¯ G o K in h S n Se Sh 46 56 79 38 20 44 17 ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ ⊲ 900 950 1000 1050 1100 1150 1200 1250
  9. 9. Asialex 2011 Kyoto, Japan 9 Methods: Flowchart of data processing ing P e nt er sion o dell −O opm nv lm CT sdevel isat ion co d e co ma tica ction: isat ion pu en a- he tra al Co r Tok Met Mat Sub Visu A B C D E F
  10. 10. Asialex 2011 Kyoto, Japan 10 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  11. 11. Asialex 2011 Kyoto, Japan 11 Table 1: An example of input for KH / Gosensh¯ No. 664 u input: 000664 output:000664 ( - : : : : ) ( - : : : : ) ( : : ) ( - : : : : ) ( - : : : : ) ( : : ) ( - : : : : ) ( : : ) ( : : ) ( : : ) ( : : ) --- ( - : : ) ( : : ) --- ( - : : : : ) ( : : ) --- ( : : ) ( - : : ) ( - : : : : ) ( - : : : : )
  12. 12. Asialex 2011 Kyoto, Japan 12 Development: Thesaurus Thesaurus Tokeniser code tagger Poem Texts kh t2c Hachidaishu Thesaurus add unknown entries add new thesaurus codes Dictionary General, Place Name Personal Name, etc (A) (B)
  13. 13. Asialex 2011 Kyoto, Japan 13 (A) Corpus: Poems (OP) KW00029800|A|KANEMI NO ¯=kanemi no ¯ O o KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→ tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→ no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→ aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→ nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→ rame[CJR-REAL]/ Figure 2: Format of the database of a poem: → indicates continuing to the next line without breaks; the first line, which includes |A|, indicates the name of the poet; the second line which includes |B|, indicates the contents of the poem and added information.
  14. 14. Asialex 2011 Kyoto, Japan 14 (A) Corpus: Translations (CT) $A|000298 $B| → $C| $D| → $I| → → Figure 3: Format of the database of a CT
  15. 15. Asialex 2011 Kyoto, Japan 15 (B) Tokenisation: original text ↓ tokenising / / / /[ ]/ / / / / / / / / /[ ] ↓ converting into predicative form / / / /[ ]/ / / / / / / / / /[ ] Figure 4: Tokenisation of poem texts
  16. 16. Asialex 2011 Kyoto, Japan 16 (C) meta-code conversion CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- -- Tatsuta Tatsuta BG-01-2030-01-101-A -- -- hime princess BG-02-3770-04-080-C tamukuru present(verb) BG-01-5730-02-010-A -- -- te hand BG-02-1700-01-040-A -- -- mukeru for BG-01-2030-01-030-A kami god BG-08-0061-07-010-A no SUB (particle) BG-02-1200-01-010-C are be BG-08-0064-26-010-A ba because (particle) BG-04-1120-05-150-A -- -- ba because (reason) BG-08-0065-01-010-A koso KP (emphasis) Figure 5: Meta-code conversion in case of OP
  17. 17. Asialex 2011 Kyoto, Japan 17 (C) Structure of meta-code-1 BG-01-2030-01-030-A- - ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 6: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  18. 18. Asialex 2011 Kyoto, Japan 18 (C) Structure of the meta-code-2 BG-01-2600-01-020-A (1) = BG-01-2610-01-040-A (2) yononaka (world) yo (world) + BG-08-0010-01-021-A (3) no (of) + BG-01-1770-01-080-A (4) naka (inside) Figure 7: Structure of an item of the semantic table in the case of a compound word, yononaka (world)
  19. 19. Asialex 2011 Kyoto, Japan 19 (C) meta-code conversion-3 CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- -- Tatsuta Tatsuta BG-01-2030-01-101-A -- -- hime princess BG-02-3770-04-080-C tamukuru present(verb) BG-01-5730-02-010-A -- -- te hand BG-02-1700-01-040-A -- -- mukeru for BG-01-2030-01-030-A kami god BG-08-0061-07-010-A no SUB (particle) BG-02-1200-01-010-C are be BG-08-0064-26-010-A ba because (particle) BG-04-1120-05-150-A -- -- ba because (reason) BG-08-0065-01-010-A koso KP (emphasis) Figure 8: Meta-code conversion in case of OP
  20. 20. Asialex 2011 Kyoto, Japan 20 10th century 20th century Field of experience Field of experience (expert) poet write OP read expert reader com par write e CT read novice reader 20th century Field of experience (novice) Figure 9: Schema of relationship between OP and CT
  21. 21. Asialex 2011 Kyoto, Japan 21 +-------- # of pair | +----- value of matching level, exact=17, field=13, group=10 | | +-- # of POS | | | | | | # of element of OP ----+ +- # of element of CT | | | element of OP -+ | | +--- element of CT | | | | | | | 1 17 11 00 <-> 12 (Tatsutahime) 2 17 47 04 <-> 25 (hand) 3 17 47 05 <-> 26 (toward) 4 17 2 06 <-> 32 (god) 5 10 61 07 <-> 33 (SUB) 6 17 47 08 <-> 34 (be) 7 10 64 09 <-> 35 (because) 8 17 65 11 <-> 36 (EM) 9 17 2 12 <-> 38 (autumn) 10 17 71 13 <-> 39 (CON) 11 17 2 14 <-> 40 (leaf of tree) 12 17 2 19 <-> 45 (present) 13 17 61 20 <-> 46 (CRD) 14 17 47 21 <-> 49 (fall) 15 13 74 22 <-> 54 (CJR) Figure 10: Example of the matching process
  22. 22. Asialex 2011 Kyoto, Japan 22 Residual CT ( ) ( ) OP — —— — — — — — — — — — — — —— — CT ( ) ( ) ( ) ( ) OP — — [ ] — — — — — — Figure 11: Example of the matching process in the case of kks 298 in Ko- machiya (1982)
  23. 23. Asialex 2011 Kyoto, Japan 23 Components of OP Table 2: Result of subtracting the elements of OP(298) from those of CT(298, koma): it indicates the ratio of the ingredients of OP(298). OP (valid number of element) = 16 E (ratio of exact match) 12/16 = 0.750 F (ratio of field match) 1/16 = 0.062 G (ratio of group match) 2/16 = 0.125 T (ratio of total match) 15/16 = 0.938 U (ratio of unmatched OP) 1 - T = 0.062
  24. 24. Asialex 2011 Kyoto, Japan 24 Calculation of Residual Rate P D = 1− (1) T 16 = 1− (2) 41 = 0.61 (3)
  25. 25. Asialex 2011 Kyoto, Japan 25 Components of CT Table 3: Component of CT in case of kks 298 by Komachiya (1982): fabs(D-H) stands for the function of the absolute value of the prac- tical value, D, minus the theoretical value, H. CT (valid number of element) =41 W (ratio of original word use) 12/41=0.293(E/CT) A (ratio of annotation) 1-0.293=0.707(1-W) ---breakdown of the annotation--- P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2) H (theoretical value of D) 1-16/41=0.6101-OP/CT Gap fabs(0.595-0.610)=0.015fabs(D-H)
  26. 26. Asialex 2011 Kyoto, Japan 26 Subtraction: CT - OP P1 3 (7.3%) P2 1 (4.0%) W 12 (29.3%) Exact 12 (75.0%) Unmatched 1 (6.2%) D 25 (59.5%) Group 2 (12.5%) Field 1 (6.2%) OP(298) : 16 elements CT(298,koma) : 41 elements Figure 12: Pie-charts illustrating the components of OP(298) and CT(298, koma)
  27. 27. Asialex 2011 Kyoto, Japan 27 (E) Mathematical modelling √ cw(t1 , t2 ) = (1+log ctf (t1 , t2 )) idf (t1 ) idf (t2 ) (4) N idf (t) = log (5) df (t)
  28. 28. Asialex 2011 Kyoto, Japan 28 far treetop high.1 7regret force separation 7 treetop high.3 go over 5 10 6 be heard.1 7 4 this morning 10 near 9 10 summer mountains hear borrow Otowa.PN 37 6 29 69 19 11 old age 11 treetop 20 20 a cry 19 singing voice 20 every morning cuckoo mountain 10 21 wear in (my) hair 8 stop.vi.1 8 6 39 110 14 9 261 4 summer midsummer rain sing.vi field side 8 20 green willow 4 12 10 42 174 15 plum 44 145 4 17 10 9 woven hat last year 10 26 voice 62 56 break off23 10 6 sew.2 10 May 22 mountain cuckoo 6 10 warbler 7 6 6 9 35 branch 88 Tatsuta.PN 29 cry.vi 52 138 7 hide.vi.2 flutter.2 8 10 30 imperceptibly spring scatter.1 10 flower 9 10 9 yet.1 iris.1 reason.1 6 touch lure stand.vi 4 send spring haze 7 5 4 10 fragrance.1 attach hand guidance.1 warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
  29. 29. Asialex 2011 Kyoto, Japan 29 Conclusion The thesaurus annotated with meta-codes allows researchers 1. to identify different orthographies as the same word; 2. to attach an alternative semantic ID to a word which has the same form but has more than one meaning (polysemic word); 3. to attach meta-codes not only to tokens recognised as a single/simple word but also to attach it to a longer size token 4. to indicate a similarity between tokens. 5. to detect common or different tokens among more than one text, which will tell us the similarities or differences between texts. 6. to indicate the relative differences between two words in literary works.
  30. 30. Asialex 2011 Kyoto, Japan 30 Questions • Computer Modelling of Classical Japanese Poetic Vocabulary http://etymology.jp/waka/poem.cgi • Inquiry: Hilofumi Yamamoto yamagen@ryu.titech.ac.jp • Thank you.

×