1. Asialex 2011 Kyoto, Japan 1
Development of the Thesaurus of Classical
Japanese Poetic Vocabulary
Hilofumi Yamamoto
Tokyo Institute of Technology
Makiro Tanaka
National Institute of Japanese Language and Linguistics
22nd Aug. 2011
2. Asialex 2011 Kyoto, Japan 2
Outline
1. Purpose of Study
• Connotation of classical poetic vocabulary
• Longitudinal study of transition of vocabulary
2. Development of Thesaurus
3. Applications
3. Asialex 2011 Kyoto, Japan 3
Waka: Japanese Poetry
Tatsuta-Hime..
tamukuru KAMI no / arebakoso
aki no konoha no / nusa to chirurame
because Princess Tatsuta
has a god to whom she offers brocades,
the leaves of trees
in autumn will scatter
as an offering.
Prince Kanemi
No. 298 in the Kokinsh¯
u
4. Asialex 2011 Kyoto, Japan 4
Problem: Orthography
in Chinese characters
in hiragana
→ All Tatsuta (place name)
5. Asialex 2011 Kyoto, Japan 5
Problem: Unit size / attribution
The unit size and meaning of a word depends on a context.
• unit → or (Nakano, 1998)
• orthography →
(sad)
• attributions → ∈ plant or ∈ food
(unohana = a deutzia or bean curd refuse)
6. Asialex 2011 Kyoto, Japan 6
An Item of Thesaurus: God
BG-01-2030-01-030-A- -
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 1: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) field ID;
(5) exact ID (030 = god);
(6) era-flag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
7. Asialex 2011 Kyoto, Japan 7
Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)
8. Asialex 2011 Kyoto, Japan 8
Materials: the Hachidaish¯
u
• The Hachidaish¯ : eight anthologies compiled by
u
imperial orders during ca. 905–2105.
• The database: compiled by the National Institute of
Japanese Literature, Japan.
• Old texts taken based on Sh¯hobonban version of the
o
Hachidaish¯u )
) ) ) ) ) 205
05
)
51 ) 0 86 1 24 44 88 (1
•9 07 1 1 1 ¯
( •9 ( 0 (1 ( • ( •1 (1 shu
u¯ u¯ •1 sh
u¯ ¯
u ¯
u n
sh nsh u¯
(
u¯ i sh shu
¯
ish oki
ki
n
se sh sh
¯
yo ika za ink
K
o
G
o
J ui
¯ G
o
K
in h
S
n
Se Sh
46 56 79 38 20 44 17
⊲
⊲
⊲
⊲
⊲
⊲
⊲
⊲
900 950 1000 1050 1100 1150 1200 1250
9. Asialex 2011 Kyoto, Japan 9
Methods: Flowchart of data processing
ing P
e nt er sion o dell −O
opm nv lm CT
sdevel isat
ion
co d
e co ma tica ction: isat
ion
pu en a- he tra al
Co r Tok Met Mat Sub Visu
A B C D E F
10. Asialex 2011 Kyoto, Japan 10
Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)
12. Asialex 2011 Kyoto, Japan 12
Development: Thesaurus
Thesaurus
Tokeniser code tagger
Poem Texts kh t2c Hachidaishu
Thesaurus
add unknown entries add new thesaurus codes
Dictionary General, Place Name
Personal Name, etc
(A) (B)
13. Asialex 2011 Kyoto, Japan 13
(A) Corpus: Poems (OP)
KW00029800|A|KANEMI NO ¯=kanemi no ¯
O o
KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→
tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→
no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→
aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→
nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→
rame[CJR-REAL]/
Figure 2: Format of the database of a poem: → indicates continuing to the
next line without breaks; the first line, which includes |A|, indicates
the name of the poet; the second line which includes |B|, indicates
the contents of the poem and added information.
14. Asialex 2011 Kyoto, Japan 14
(A) Corpus: Translations (CT)
$A|000298
$B| →
$C|
$D| →
$I| →
→
Figure 3: Format of the database of a CT
15. Asialex 2011 Kyoto, Japan 15
(B) Tokenisation:
original text
↓
tokenising
/ / / /[ ]/ / / / / / / / / /[ ]
↓
converting into predicative form
/ / / /[ ]/ / / / / / / / / /[ ]
Figure 4: Tokenisation of poem texts
16. Asialex 2011 Kyoto, Japan 16
(C) meta-code conversion
CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- -- hime princess
BG-02-3770-04-080-C tamukuru present(verb)
BG-01-5730-02-010-A -- -- te hand
BG-02-1700-01-040-A -- -- mukeru for
BG-01-2030-01-030-A kami god
BG-08-0061-07-010-A no SUB (particle)
BG-02-1200-01-010-C are be
BG-08-0064-26-010-A ba because (particle)
BG-04-1120-05-150-A -- -- ba because (reason)
BG-08-0065-01-010-A koso KP (emphasis)
Figure 5: Meta-code conversion in case of OP
17. Asialex 2011 Kyoto, Japan 17
(C) Structure of meta-code-1
BG-01-2030-01-030-A- -
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 6: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) field ID;
(5) exact ID (030 = god);
(6) era-flag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
18. Asialex 2011 Kyoto, Japan 18
(C) Structure of the meta-code-2
BG-01-2600-01-020-A (1) = BG-01-2610-01-040-A (2)
yononaka (world) yo (world)
+ BG-08-0010-01-021-A (3)
no (of)
+ BG-01-1770-01-080-A (4)
naka (inside)
Figure 7: Structure of an item of the semantic table in the case
of a compound word, yononaka (world)
19. Asialex 2011 Kyoto, Japan 19
(C) meta-code conversion-3
CH-29-2130-01-010-A Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- -- hime princess
BG-02-3770-04-080-C tamukuru present(verb)
BG-01-5730-02-010-A -- -- te hand
BG-02-1700-01-040-A -- -- mukeru for
BG-01-2030-01-030-A kami god
BG-08-0061-07-010-A no SUB (particle)
BG-02-1200-01-010-C are be
BG-08-0064-26-010-A ba because (particle)
BG-04-1120-05-150-A -- -- ba because (reason)
BG-08-0065-01-010-A koso KP (emphasis)
Figure 8: Meta-code conversion in case of OP
20. Asialex 2011 Kyoto, Japan 20
10th century 20th century
Field of experience Field of experience (expert)
poet write OP read expert reader
com
par write
e
CT
read
novice reader
20th century
Field of experience
(novice)
Figure 9: Schema of relationship between OP and CT
21. Asialex 2011 Kyoto, Japan 21
+-------- # of pair
| +----- value of matching level, exact=17, field=13, group=10
| | +-- # of POS
| | |
| | | # of element of OP ----+ +- # of element of CT
| | | element of OP -+ | | +--- element of CT
| | | | | | |
1 17 11 00 <-> 12 (Tatsutahime)
2 17 47 04 <-> 25 (hand)
3 17 47 05 <-> 26 (toward)
4 17 2 06 <-> 32 (god)
5 10 61 07 <-> 33 (SUB)
6 17 47 08 <-> 34 (be)
7 10 64 09 <-> 35 (because)
8 17 65 11 <-> 36 (EM)
9 17 2 12 <-> 38 (autumn)
10 17 71 13 <-> 39 (CON)
11 17 2 14 <-> 40 (leaf of tree)
12 17 2 19 <-> 45 (present)
13 17 61 20 <-> 46 (CRD)
14 17 47 21 <-> 49 (fall)
15 13 74 22 <-> 54 (CJR)
Figure 10: Example of the matching process
22. Asialex 2011 Kyoto, Japan 22
Residual
CT ( ) ( )
OP — —— — — — — — — — — — — — —— —
CT ( ) ( ) ( ) ( )
OP — — [ ] — — — — — —
Figure 11: Example of the matching process in the case of kks 298 in Ko-
machiya (1982)
23. Asialex 2011 Kyoto, Japan 23
Components of OP
Table 2: Result of subtracting the elements of OP(298) from those
of CT(298, koma): it indicates the ratio of the ingredients
of OP(298).
OP (valid number of element) = 16
E (ratio of exact match) 12/16 = 0.750
F (ratio of field match) 1/16 = 0.062
G (ratio of group match) 2/16 = 0.125
T (ratio of total match) 15/16 = 0.938
U (ratio of unmatched OP) 1 - T = 0.062
24. Asialex 2011 Kyoto, Japan 24
Calculation of Residual Rate
P
D = 1− (1)
T
16
= 1− (2)
41
= 0.61 (3)
25. Asialex 2011 Kyoto, Japan 25
Components of CT
Table 3: Component of CT in case of kks 298 by Komachiya (1982):
fabs(D-H) stands for the function of the absolute value of the prac-
tical value, D, minus the theoretical value, H.
CT (valid number of element) =41
W (ratio of original word use) 12/41=0.293(E/CT)
A (ratio of annotation) 1-0.293=0.707(1-W)
---breakdown of the annotation---
P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A
P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2)
H (theoretical value of D) 1-16/41=0.6101-OP/CT
Gap fabs(0.595-0.610)=0.015fabs(D-H)
26. Asialex 2011 Kyoto, Japan 26
Subtraction: CT - OP
P1 3 (7.3%)
P2 1 (4.0%) W 12 (29.3%)
Exact 12 (75.0%)
Unmatched 1 (6.2%)
D 25 (59.5%)
Group 2 (12.5%)
Field 1 (6.2%)
OP(298) : 16 elements CT(298,koma) : 41 elements
Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,
koma)
28. Asialex 2011 Kyoto, Japan 28
far treetop high.1
7regret
force separation
7 treetop high.3
go over
5
10
6 be heard.1 7
4
this morning 10 near
9
10
summer mountains
hear borrow Otowa.PN
37
6
29
69 19 11 old age
11
treetop 20
20
a cry
19
singing voice 20
every morning
cuckoo mountain
10 21
wear in (my) hair
8 stop.vi.1 8 6
39 110
14 9 261 4
summer midsummer rain sing.vi field
side 8 20 green willow
4
12 10
42
174 15 plum
44 145 4
17 10
9 woven hat
last year 10
26 voice 62
56
break off23
10
6
sew.2
10
May 22
mountain cuckoo 6 10
warbler 7
6 6
9
35 branch
88
Tatsuta.PN 29
cry.vi
52 138
7 hide.vi.2
flutter.2 8 10 30
imperceptibly spring
scatter.1
10
flower
9
10
9
yet.1
iris.1 reason.1
6
touch lure
stand.vi
4
send
spring haze 7
5
4
10
fragrance.1
attach
hand guidance.1
warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
29. Asialex 2011 Kyoto, Japan 29
Conclusion
The thesaurus annotated with meta-codes allows researchers
1. to identify different orthographies as the same word;
2. to attach an alternative semantic ID to a word which has the
same form but has more than one meaning (polysemic word);
3. to attach meta-codes not only to tokens recognised as a
single/simple word but also to attach it to a longer size token
4. to indicate a similarity between tokens.
5. to detect common or different tokens among more than one text,
which will tell us the similarities or differences between texts.
6. to indicate the relative differences between two words in literary
works.
30. Asialex 2011 Kyoto, Japan 30
Questions
• Computer Modelling of Classical Japanese Poetic
Vocabulary
http://etymology.jp/waka/poem.cgi
• Inquiry:
Hilofumi Yamamoto
yamagen@ryu.titech.ac.jp
• Thank you.