The document discusses the development of a thesaurus of classical Japanese poetic vocabulary. It outlines how the thesaurus was created by analyzing poems from the Hachidaishu anthologies using techniques like tokenization, meta-code conversion, and matching original poems to scholarly translations to extract vocabulary terms and their meanings over time. The goal is to better understand the connotation and historical transition of classical poetic words in a longitudinal study.
1. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 1
Development of the Thesaurus of Classical
Japanese Poetic Vocabulary
Hilofumi Yamamoto
Tokyo Institute of Technology
15th July 2013
2. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 2
Outline
1. Purpose of Study
• Connotation of classical poetic vocabulary
• Longitudinal study of transition of vocabulary
2. Development of Thesaurus
3. Applications
3. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 3
Waka: Japanese Poetry
Tatsuta-Hime..
tamukuru KAMI no / arebakoso
aki no konoha no / nusa to chirurame
because Princess Tatsuta
has a god to whom she offers brocades,
the leaves of trees
in autumn will scatter
as an offering.
Prince Kanemi
No. 298 in the Kokinsh¯u
4. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 4
Problem: Orthography
in hiragana
たつた
in Chinese characters
立田
竜田
龍田
→ All Tatsuta (place name)
5. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 5
Problem: Unit size / attribution
The unit size and meaning of a word depends on a context.
• unit → 卯の花 or 卯/の/花 (Nakano, 1998)
• orthography → さびしい/さみしい/寂しい/淋しい
(sad)
• attributions → 卯の花 ∈ plant or 卯の花 ∈ food
(unohana = a deutzia or bean curd refuse)
6. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 6
An Item of Thesaurus: God
BG-01-2030-01-030-A-かみ-神
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 1: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) field ID;
(5) exact ID (030 = god);
(6) era-flag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
7. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 7
Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)
8. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 8
Materials: the Hachidaish¯u
• The Hachidaish¯u: eight anthologies compiled by
imperial orders during ca. 905–2105.
• The database: compiled by the National Institute of
Japanese Literature, Japan.
• Old texts taken based on Sh¯ohobonban version of the
Hachidaish¯u
900
⊲
K
okinsh¯u
(•905)
46
950
⊲
G
osensh¯u
(•951)
56
1000
⊲
J¯uish¯u
(•1007)
79
1050
⊲
G
osh¯uish¯u
(1086)
38
1100
⊲
K
iny¯osh¯u
(•1124)
20
⊲
Shikash¯u
(•1144)
44
1150
⊲
Senzaish¯u
(1188)
17
1200
⊲
Shinkokinsh¯u
(1205)
1250
9. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 9
Methods: Flowchart of data processing
A
Corpus development
B
Tokenisation
C
Meta-code conversion
D
Mathematical modelling
E
Subtraction: CT − OP
F
Visualisation
10. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 10
Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)
11. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 11
Table 1: An example of input for KH / Gosensh¯u No. 664
input: 000664 わすられて思ふなげきのしげるをや身をはづかしのもりといふらん
output:000664
わすら (ラ四-未:忘る:わする:忘ら:わすら)
れ (自可受-用:る:る:れ:れ)
て (接助:て:て)
思ふ (ハ四-終体:思ふ:おもふ:思ふ:おもふ)
なげき (カ四-用:嘆く:なげく:嘆き:なげき)
の (格助:の:の)
しげる (ラ四-終体:茂る:しげる:茂る:しげる)
を (*助:を:を)
や (係助:や:や)
身 (名:身:み)
を (*助:を:を)
---
はづかし (名-地名:羽束師:はづかし)
の (格助:の:の)
---
はづかし (形シク-終:恥づかし:はづかし:恥づかし:はづかし)
の (格助:の:の)
---
もり (名:森:もり)
と (格助-引用:と:と)
いふ (ハ四-終体:言ふ:いふ:言ふ:いふ)
らん (推-終体:らむ:らむ:らむ:らむ)
12. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 12
Development: Thesaurus
Poem Texts kh t2c
Thesaurus
code taggerTokeniser
Hachidaishu
Thesaurus
(A) (B)
add new thesaurus codes
Dictionary General, Place Name
Personal Name, etc
add unknown entries
13. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 13
(A) Corpus: Poems (OP)
KW00029800|A|KANEMI NO ¯O=kanemi no ¯o
KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→
tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→
no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→
aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→
nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→
rame[CJR-REAL]/
Figure 2: Format of the database of a poem: → indicates continuing to the
next line without breaks; the first line, which includes |A|, indicates
the name of the poet; the second line which includes |B|, indicates
the contents of the poem and added information.
14. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 14
(A) Corpus: Translations (CT)
$A|000298
$B|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け →
をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。
$C|秋の歌
$D|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け →
をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。
$I|あきのすえちかくなってかえりみちについたたつたひめが、どうちゅう →
のぶじをねがってたむけをするかみがあるからこそ、あきのこのはがぬさ →
となってちっているのだろう。
Figure 3: Format of the database of a CT
15. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 15
(B) Tokenisation:
original text
立田姫手向ける神の有ればこそ秋の木の葉の幣と散るらめ
↓
tokenising
立田姫/手向ける/神/の/[有れ]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らめ]
↓
converting into predicative form
立田姫/手向ける/神/の/[有り]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らむ]
Figure 4: Tokenisation of poem texts
16. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 16
(C) meta-code conversion
CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- 姫 -- hime princess
BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb)
BG-01-5730-02-010-A -- 手 -- te hand
BG-02-1700-01-040-A -- 向ける -- mukeru for
BG-01-2030-01-030-A かみ 神 kami god
BG-08-0061-07-010-A の の no SUB (particle)
BG-02-1200-01-010-C あれ 有り are be
BG-08-0064-26-010-A ば ば ba because (particle)
BG-04-1120-05-150-A -- ば -- ba because (reason)
BG-08-0065-01-010-A こそ こそ koso KP (emphasis)
Figure 5: Meta-code conversion in case of OP
17. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 17
(C) Structure of meta-code-1
BG-01-2030-01-030-A-かみ-神
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 6: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) field ID;
(5) exact ID (030 = god);
(6) era-flag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
18. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 18
(C) Structure of the meta-code-2
BG-01-2600-01-020-A
yononaka (world)
(1) = BG-01-2610-01-040-A
yo (world)
(2)
+ BG-08-0010-01-021-A
no (of)
(3)
+ BG-01-1770-01-080-A
naka (inside)
(4)
Figure 7: Structure of an item of the semantic table in the case
of a compound word, yononaka (world)
19. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 19
(C) meta-code conversion-3
CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- 姫 -- hime princess
BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb)
BG-01-5730-02-010-A -- 手 -- te hand
BG-02-1700-01-040-A -- 向ける -- mukeru for
BG-01-2030-01-030-A かみ 神 kami god
BG-08-0061-07-010-A の の no SUB (particle)
BG-02-1200-01-010-C あれ 有り are be
BG-08-0064-26-010-A ば ば ba because (particle)
BG-04-1120-05-150-A -- ば -- ba because (reason)
BG-08-0065-01-010-A こそ こそ koso KP (emphasis)
Figure 8: Meta-code conversion in case of OP
20. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 20
poet write OP read expert reader
write
CT
read
novice reader
compare
10th century
Field of experience
20th century
Field of experience (expert)
20th century
Field of experience
(novice)
Figure 9: Schema of relationship between OP and CT
21. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 21
+-------- # of pair
| +----- value of matching level, exact=17, field=13, group=10
| | +-- # of POS
| | |
| | | # of element of OP ----+ +- # of element of CT
| | | element of OP -+ | | +--- element of CT
| | | | | | |
1 17 11 立田姫 00 <-> 12 龍田姫 (Tatsutahime)
2 17 47 手 04 <-> 25 手 (hand)
3 17 47 向ける 05 <-> 26 向ける (toward)
4 17 2 神 06 <-> 32 神 (god)
5 10 61 の 07 <-> 33 が (SUB)
6 17 47 有り 08 <-> 34 ある (be)
7 10 64 ば 09 <-> 35 から (because)
8 17 65 こそ 11 <-> 36 こそ (EM)
9 17 2 秋 12 <-> 38 秋 (autumn)
10 17 71 の 13 <-> 39 の (CON)
11 17 2 木の葉 14 <-> 40 木の葉 (leaf of tree)
12 17 2 幣 19 <-> 45 幣 (present)
13 17 61 と 20 <-> 46 と (CRD)
14 17 47 散る 21 <-> 49 散る (fall)
15 13 74 らむ 22 <-> 54 う (CJR)
Figure 10: Example of the matching process
22. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 22
Residual
CT (秋の末近くなって帰り道についた)龍田姫(が道中の無事を願って)手 向け
OP — —— — — — — — — — 立田姫 — — — — — — — 手向ける
CT (をする)神があるからこそ秋の木の葉(が)幣(となって)散っ(ているのだろ) う
OP — — 神のあれ ば こそ秋の木の葉[の]幣 と — — 散る — — — — らめ
Figure 11: Example of the matching process in the case of kks 298 in Ko-
machiya (1982)
23. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 23
Components of OP
Table 2: Result of subtracting the elements of OP(298) from those
of CT(298, koma): it indicates the ratio of the ingredients
of OP(298).
OP (valid number of element) = 16
E (ratio of exact match) 12/16 = 0.750
F (ratio of field match) 1/16 = 0.062
G (ratio of group match) 2/16 = 0.125
T (ratio of total match) 15/16 = 0.938
U (ratio of unmatched OP) 1 - T = 0.062
24. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 24
Calculation of Residual Rate
D = 1 −
P
T
(1)
= 1 −
16
41
(2)
= 0.61 (3)
25. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 25
Components of CT
Table 3: Component of CT in case of kks 298 by Komachiya (1982):
fabs(D-H) stands for the function of the absolute value of the prac-
tical value, D, minus the theoretical value, H.
CT (valid number of element) =41
W (ratio of original word use) 12/41=0.293(E/CT)
A (ratio of annotation) 1-0.293=0.707(1-W)
---breakdown of the annotation---
P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A
P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2)
H (theoretical value of D) 1-16/41=0.6101-OP/CT
Gap fabs(0.595-0.610)=0.015fabs(D-H)
26. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 26
Subtraction: CT - OP
Exact 12 (75.0%)
Field 1 (6.2%)
Group 2 (12.5%)
Unmatched 1 (6.2%)
W 12 (29.3%)
P1 3 (7.3%)
P2 1 (4.0%)
D 25 (59.5%)
OP : 16 elements CT : 41 elements(298) (298,koma)
Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,
koma)
27. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 27
(E) Mathematical modelling
cw(t1, t2)=(1+log ctf(t1, t2))
√
idf(t1) idf(t2) (4)
idf(t) = log
N
df(t)
(5)
28. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 28
warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
every morning
field
8
warbler
17
old age
woven hat
6
10
green willow
4
wear in (my) hair
4
sew.26
spring
88
10
Tatsuta.PN
10
branch35
flower
138
stop.vi.1
15
break off
22
cry.vi
29
sing.vi
145
yet.1
30
summer
side 8
cuckoo39
a cry
8
May
42
Otowa.PN
20
voice
174
mountain110
261
singing voice
21
midsummer rain14
hear
69
be heard.1
37
last year
10
iris.1
7
treetop
9
12
20
20
11
this morning
29
9
19
go over
10
regret
10
treetop high.3
4
10
near
6
6226
reason.1
8
6
guidance.1
lure
4
9
send
4
separation
7
4
fragrance.1
7
20
10
spring haze
9
stand.vi
10
summer mountains
11
force
6
plum
10
56
23
44
mountain cuckoo
9
hide.vi.2
7
6
10
scatter.1
52
10
touch
10
hand
10
attach
5
flutter.2
6
6
borrow
19
imperceptibly
9
treetop high.1
7
7
far
5
29. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 29
Conclusion
The thesaurus annotated with meta-codes allows researchers
1. to identify different orthographies as the same word;
2. to attach an alternative semantic ID to a word which has the
same form but has more than one meaning (polysemic word);
3. to attach meta-codes not only to tokens recognised as a
single/simple word but also to attach it to a longer size token
4. to indicate a similarity between tokens.
5. to detect common or different tokens among more than one text,
which will tell us the similarities or differences between texts.
6. to indicate the relative differences between two words in literary
works.
30. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 30
Questions
• Computer Modelling of Classical Japanese Poetic
Vocabulary
http://warbler.ryu.titech.ac.jp/waka/poem.cgi
• Inquiry:
Hilofumi Yamamoto
yamagen@ryu.titech.ac.jp
• Thank you.