SlideShare a Scribd company logo
1 of 30
Download to read offline
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 1
Development of the Thesaurus of Classical
Japanese Poetic Vocabulary
Hilofumi Yamamoto
Tokyo Institute of Technology
15th July 2013
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 2
Outline
1. Purpose of Study
• Connotation of classical poetic vocabulary
• Longitudinal study of transition of vocabulary
2. Development of Thesaurus
3. Applications
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 3
Waka: Japanese Poetry
Tatsuta-Hime..
tamukuru KAMI no / arebakoso
aki no konoha no / nusa to chirurame
because Princess Tatsuta
has a god to whom she offers brocades,
the leaves of trees
in autumn will scatter
as an offering.
Prince Kanemi
No. 298 in the Kokinsh¯u
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 4
Problem: Orthography
in hiragana
たつた
in Chinese characters
立田
竜田
龍田
→ All Tatsuta (place name)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 5
Problem: Unit size / attribution
The unit size and meaning of a word depends on a context.
• unit → 卯の花 or 卯/の/花 (Nakano, 1998)
• orthography → さびしい/さみしい/寂しい/淋しい
(sad)
• attributions → 卯の花 ∈ plant or 卯の花 ∈ food
(unohana = a deutzia or bean curd refuse)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 6
An Item of Thesaurus: God
BG-01-2030-01-030-A-かみ-神
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 1: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) field ID;
(5) exact ID (030 = god);
(6) era-flag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 7
Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 8
Materials: the Hachidaish¯u
• The Hachidaish¯u: eight anthologies compiled by
imperial orders during ca. 905–2105.
• The database: compiled by the National Institute of
Japanese Literature, Japan.
• Old texts taken based on Sh¯ohobonban version of the
Hachidaish¯u
900
⊲
K
okinsh¯u
(•905)
46
950
⊲
G
osensh¯u
(•951)
56
1000
⊲
J¯uish¯u
(•1007)
79
1050
⊲
G
osh¯uish¯u
(1086)
38
1100
⊲
K
iny¯osh¯u
(•1124)
20
⊲
Shikash¯u
(•1144)
44
1150
⊲
Senzaish¯u
(1188)
17
1200
⊲
Shinkokinsh¯u
(1205)
1250
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 9
Methods: Flowchart of data processing
A
Corpus development
B
Tokenisation
C
Meta-code conversion
D
Mathematical modelling
E
Subtraction: CT − OP
F
Visualisation
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 10
Development: Thesaurus, KH, and t2c
• Thesaurus for classical poetic vocabulary
• KH (tokenizer)
• t2c (token to code converter)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 11
Table 1: An example of input for KH / Gosensh¯u No. 664
input: 000664 わすられて思ふなげきのしげるをや身をはづかしのもりといふらん
output:000664
わすら (ラ四-未:忘る:わする:忘ら:わすら)
れ (自可受-用:る:る:れ:れ)
て (接助:て:て)
思ふ (ハ四-終体:思ふ:おもふ:思ふ:おもふ)
なげき (カ四-用:嘆く:なげく:嘆き:なげき)
の (格助:の:の)
しげる (ラ四-終体:茂る:しげる:茂る:しげる)
を (*助:を:を)
や (係助:や:や)
身 (名:身:み)
を (*助:を:を)
---
はづかし (名-地名:羽束師:はづかし)
の (格助:の:の)
---
はづかし (形シク-終:恥づかし:はづかし:恥づかし:はづかし)
の (格助:の:の)
---
もり (名:森:もり)
と (格助-引用:と:と)
いふ (ハ四-終体:言ふ:いふ:言ふ:いふ)
らん (推-終体:らむ:らむ:らむ:らむ)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 12
Development: Thesaurus
Poem Texts kh t2c
Thesaurus
code taggerTokeniser
Hachidaishu
Thesaurus
(A) (B)
add new thesaurus codes
Dictionary General, Place Name
Personal Name, etc
add unknown entries
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 13
(A) Corpus: Poems (OP)
KW00029800|A|KANEMI NO ¯O=kanemi no ¯o
KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→
tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→
no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→
aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→
nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→
rame[CJR-REAL]/
Figure 2: Format of the database of a poem: → indicates continuing to the
next line without breaks; the first line, which includes |A|, indicates
the name of the poet; the second line which includes |B|, indicates
the contents of the poem and added information.
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 14
(A) Corpus: Translations (CT)
$A|000298
$B|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け →
をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。
$C|秋の歌
$D|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け →
をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。
$I|あきのすえちかくなってかえりみちについたたつたひめが、どうちゅう →
のぶじをねがってたむけをするかみがあるからこそ、あきのこのはがぬさ →
となってちっているのだろう。
Figure 3: Format of the database of a CT
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 15
(B) Tokenisation:
original text
立田姫手向ける神の有ればこそ秋の木の葉の幣と散るらめ
↓
tokenising
立田姫/手向ける/神/の/[有れ]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らめ]
↓
converting into predicative form
立田姫/手向ける/神/の/[有り]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らむ]
Figure 4: Tokenisation of poem texts
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 16
(C) meta-code conversion
CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- 姫 -- hime princess
BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb)
BG-01-5730-02-010-A -- 手 -- te hand
BG-02-1700-01-040-A -- 向ける -- mukeru for
BG-01-2030-01-030-A かみ 神 kami god
BG-08-0061-07-010-A の の no SUB (particle)
BG-02-1200-01-010-C あれ 有り are be
BG-08-0064-26-010-A ば ば ba because (particle)
BG-04-1120-05-150-A -- ば -- ba because (reason)
BG-08-0065-01-010-A こそ こそ koso KP (emphasis)
Figure 5: Meta-code conversion in case of OP
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 17
(C) Structure of meta-code-1
BG-01-2030-01-030-A-かみ-神
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 6: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) field ID;
(5) exact ID (030 = god);
(6) era-flag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 18
(C) Structure of the meta-code-2
BG-01-2600-01-020-A
yononaka (world)
(1) = BG-01-2610-01-040-A
yo (world)
(2)
+ BG-08-0010-01-021-A
no (of)
(3)
+ BG-01-1770-01-080-A
naka (inside)
(4)
Figure 7: Structure of an item of the semantic table in the case
of a compound word, yononaka (world)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 19
(C) meta-code conversion-3
CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- 姫 -- hime princess
BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb)
BG-01-5730-02-010-A -- 手 -- te hand
BG-02-1700-01-040-A -- 向ける -- mukeru for
BG-01-2030-01-030-A かみ 神 kami god
BG-08-0061-07-010-A の の no SUB (particle)
BG-02-1200-01-010-C あれ 有り are be
BG-08-0064-26-010-A ば ば ba because (particle)
BG-04-1120-05-150-A -- ば -- ba because (reason)
BG-08-0065-01-010-A こそ こそ koso KP (emphasis)
Figure 8: Meta-code conversion in case of OP
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 20
poet write OP read expert reader
write
CT
read
novice reader
compare
10th century
Field of experience
20th century
Field of experience (expert)
20th century
Field of experience
(novice)
Figure 9: Schema of relationship between OP and CT
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 21
+-------- # of pair
| +----- value of matching level, exact=17, field=13, group=10
| | +-- # of POS
| | |
| | | # of element of OP ----+ +- # of element of CT
| | | element of OP -+ | | +--- element of CT
| | | | | | |
1 17 11 立田姫 00 <-> 12 龍田姫 (Tatsutahime)
2 17 47 手 04 <-> 25 手 (hand)
3 17 47 向ける 05 <-> 26 向ける (toward)
4 17 2 神 06 <-> 32 神 (god)
5 10 61 の 07 <-> 33 が (SUB)
6 17 47 有り 08 <-> 34 ある (be)
7 10 64 ば 09 <-> 35 から (because)
8 17 65 こそ 11 <-> 36 こそ (EM)
9 17 2 秋 12 <-> 38 秋 (autumn)
10 17 71 の 13 <-> 39 の (CON)
11 17 2 木の葉 14 <-> 40 木の葉 (leaf of tree)
12 17 2 幣 19 <-> 45 幣 (present)
13 17 61 と 20 <-> 46 と (CRD)
14 17 47 散る 21 <-> 49 散る (fall)
15 13 74 らむ 22 <-> 54 う (CJR)
Figure 10: Example of the matching process
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 22
Residual
CT (秋の末近くなって帰り道についた)龍田姫(が道中の無事を願って)手 向け
OP — —— — — — — — — — 立田姫 — — — — — — — 手向ける
CT (をする)神があるからこそ秋の木の葉(が)幣(となって)散っ(ているのだろ) う
OP — — 神のあれ ば こそ秋の木の葉[の]幣 と — — 散る — — — — らめ
Figure 11: Example of the matching process in the case of kks 298 in Ko-
machiya (1982)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 23
Components of OP
Table 2: Result of subtracting the elements of OP(298) from those
of CT(298, koma): it indicates the ratio of the ingredients
of OP(298).
OP (valid number of element) = 16
E (ratio of exact match) 12/16 = 0.750
F (ratio of field match) 1/16 = 0.062
G (ratio of group match) 2/16 = 0.125
T (ratio of total match) 15/16 = 0.938
U (ratio of unmatched OP) 1 - T = 0.062
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 24
Calculation of Residual Rate
D = 1 −
P
T
(1)
= 1 −
16
41
(2)
= 0.61 (3)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 25
Components of CT
Table 3: Component of CT in case of kks 298 by Komachiya (1982):
fabs(D-H) stands for the function of the absolute value of the prac-
tical value, D, minus the theoretical value, H.
CT (valid number of element) =41
W (ratio of original word use) 12/41=0.293(E/CT)
A (ratio of annotation) 1-0.293=0.707(1-W)
---breakdown of the annotation---
P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A
P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2)
H (theoretical value of D) 1-16/41=0.6101-OP/CT
Gap fabs(0.595-0.610)=0.015fabs(D-H)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 26
Subtraction: CT - OP
Exact 12 (75.0%)
Field 1 (6.2%)
Group 2 (12.5%)
Unmatched 1 (6.2%)
W 12 (29.3%)
P1 3 (7.3%)
P2 1 (4.0%)
D 25 (59.5%)
OP : 16 elements CT : 41 elements(298) (298,koma)
Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,
koma)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 27
(E) Mathematical modelling
cw(t1, t2)=(1+log ctf(t1, t2))
√
idf(t1) idf(t2) (4)
idf(t) = log
N
df(t)
(5)
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 28
warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
every morning
field
8
warbler
17
old age
woven hat
6
10
green willow
4
wear in (my) hair
4
sew.26
spring
88
10
Tatsuta.PN
10
branch35
flower
138
stop.vi.1
15
break off
22
cry.vi
29
sing.vi
145
yet.1
30
summer
side 8
cuckoo39
a cry
8
May
42
Otowa.PN
20
voice
174
mountain110
261
singing voice
21
midsummer rain14
hear
69
be heard.1
37
last year
10
iris.1
7
treetop
9
12
20
20
11
this morning
29
9
19
go over
10
regret
10
treetop high.3
4
10
near
6
6226
reason.1
8
6
guidance.1
lure
4
9
send
4
separation
7
4
fragrance.1
7
20
10
spring haze
9
stand.vi
10
summer mountains
11
force
6
plum
10
56
23
44
mountain cuckoo
9
hide.vi.2
7
6
10
scatter.1
52
10
touch
10
hand
10
attach
5
flutter.2
6
6
borrow
19
imperceptibly
9
treetop high.1
7
7
far
5
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 29
Conclusion
The thesaurus annotated with meta-codes allows researchers
1. to identify different orthographies as the same word;
2. to attach an alternative semantic ID to a word which has the
same form but has more than one meaning (polysemic word);
3. to attach meta-codes not only to tokens recognised as a
single/simple word but also to attach it to a longer size token
4. to indicate a similarity between tokens.
5. to detect common or different tokens among more than one text,
which will tell us the similarities or differences between texts.
6. to indicate the relative differences between two words in literary
works.
TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 30
Questions
• Computer Modelling of Classical Japanese Poetic
Vocabulary
http://warbler.ryu.titech.ac.jp/waka/poem.cgi
• Inquiry:
Hilofumi Yamamoto
yamagen@ryu.titech.ac.jp
• Thank you.

More Related Content

Viewers also liked

Viewers also liked (10)

Ch2006slide
Ch2006slideCh2006slide
Ch2006slide
 
Wollongong02
Wollongong02Wollongong02
Wollongong02
 
Ch2007slide02
Ch2007slide02Ch2007slide02
Ch2007slide02
 
Ch2010slide01
Ch2010slide01Ch2010slide01
Ch2010slide01
 
Goiken2008 slide01
Goiken2008 slide01Goiken2008 slide01
Goiken2008 slide01
 
Ch2008slide01
Ch2008slide01Ch2008slide01
Ch2008slide01
 
Asialex201103slide02
Asialex201103slide02Asialex201103slide02
Asialex201103slide02
 
Kokken20100303
Kokken20100303Kokken20100303
Kokken20100303
 
Goiken2007slide
Goiken2007slideGoiken2007slide
Goiken2007slide
 
AyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpegAyeteValdiviaCarlos_videoescollit+mpeg
AyeteValdiviaCarlos_videoescollit+mpeg
 

Similar to Tokyotech20130715

time_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdftime_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdfSrinivasaReddyPolamR
 
Common fixed point theorems for random operators in hilbert space
Common fixed point theorems  for  random operators in hilbert spaceCommon fixed point theorems  for  random operators in hilbert space
Common fixed point theorems for random operators in hilbert spaceAlexander Decker
 
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...orajjournal
 
Modeling Topical Trends over Continuous Time with Priors
Modeling Topical Trends over Continuous Time with PriorsModeling Topical Trends over Continuous Time with Priors
Modeling Topical Trends over Continuous Time with PriorsTomonari Masada
 
JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component Marialaura Bancheri
 
Workshop presentations l_bworkshop_reis
Workshop presentations l_bworkshop_reisWorkshop presentations l_bworkshop_reis
Workshop presentations l_bworkshop_reisTim Reis
 
Estimating the economic quantities of different concrete slab types
Estimating the economic quantities of different concrete slab typesEstimating the economic quantities of different concrete slab types
Estimating the economic quantities of different concrete slab typesAhmed Ebid
 

Similar to Tokyotech20130715 (11)

time_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdftime_complexity_list_02_04_2024_22_pages.pdf
time_complexity_list_02_04_2024_22_pages.pdf
 
Dg mcqs (1)
Dg mcqs (1)Dg mcqs (1)
Dg mcqs (1)
 
CMSI計算科学技術特論C (2015) ALPS と量子多体問題①
CMSI計算科学技術特論C (2015) ALPS と量子多体問題①CMSI計算科学技術特論C (2015) ALPS と量子多体問題①
CMSI計算科学技術特論C (2015) ALPS と量子多体問題①
 
Common fixed point theorems for random operators in hilbert space
Common fixed point theorems  for  random operators in hilbert spaceCommon fixed point theorems  for  random operators in hilbert space
Common fixed point theorems for random operators in hilbert space
 
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
 
Ck4201578592
Ck4201578592Ck4201578592
Ck4201578592
 
Modeling Topical Trends over Continuous Time with Priors
Modeling Topical Trends over Continuous Time with PriorsModeling Topical Trends over Continuous Time with Priors
Modeling Topical Trends over Continuous Time with Priors
 
JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component JGrass-NewAge probabilities backward component
JGrass-NewAge probabilities backward component
 
Workshop presentations l_bworkshop_reis
Workshop presentations l_bworkshop_reisWorkshop presentations l_bworkshop_reis
Workshop presentations l_bworkshop_reis
 
Estimating the economic quantities of different concrete slab types
Estimating the economic quantities of different concrete slab typesEstimating the economic quantities of different concrete slab types
Estimating the economic quantities of different concrete slab types
 
Lab no.08
Lab no.08Lab no.08
Lab no.08
 

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 

Tokyotech20130715

  • 1. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 1 Development of the Thesaurus of Classical Japanese Poetic Vocabulary Hilofumi Yamamoto Tokyo Institute of Technology 15th July 2013
  • 2. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 2 Outline 1. Purpose of Study • Connotation of classical poetic vocabulary • Longitudinal study of transition of vocabulary 2. Development of Thesaurus 3. Applications
  • 3. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 3 Waka: Japanese Poetry Tatsuta-Hime.. tamukuru KAMI no / arebakoso aki no konoha no / nusa to chirurame because Princess Tatsuta has a god to whom she offers brocades, the leaves of trees in autumn will scatter as an offering. Prince Kanemi No. 298 in the Kokinsh¯u
  • 4. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 4 Problem: Orthography in hiragana たつた in Chinese characters 立田 竜田 龍田 → All Tatsuta (place name)
  • 5. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 5 Problem: Unit size / attribution The unit size and meaning of a word depends on a context. • unit → 卯の花 or 卯/の/花 (Nakano, 1998) • orthography → さびしい/さみしい/寂しい/淋しい (sad) • attributions → 卯の花 ∈ plant or 卯の花 ∈ food (unohana = a deutzia or bean curd refuse)
  • 6. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 6 An Item of Thesaurus: God BG-01-2030-01-030-A-かみ-神 ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 1: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 7. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 7 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 8. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 8 Materials: the Hachidaish¯u • The Hachidaish¯u: eight anthologies compiled by imperial orders during ca. 905–2105. • The database: compiled by the National Institute of Japanese Literature, Japan. • Old texts taken based on Sh¯ohobonban version of the Hachidaish¯u 900 ⊲ K okinsh¯u (•905) 46 950 ⊲ G osensh¯u (•951) 56 1000 ⊲ J¯uish¯u (•1007) 79 1050 ⊲ G osh¯uish¯u (1086) 38 1100 ⊲ K iny¯osh¯u (•1124) 20 ⊲ Shikash¯u (•1144) 44 1150 ⊲ Senzaish¯u (1188) 17 1200 ⊲ Shinkokinsh¯u (1205) 1250
  • 9. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 9 Methods: Flowchart of data processing A Corpus development B Tokenisation C Meta-code conversion D Mathematical modelling E Subtraction: CT − OP F Visualisation
  • 10. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 10 Development: Thesaurus, KH, and t2c • Thesaurus for classical poetic vocabulary • KH (tokenizer) • t2c (token to code converter)
  • 11. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 11 Table 1: An example of input for KH / Gosensh¯u No. 664 input: 000664 わすられて思ふなげきのしげるをや身をはづかしのもりといふらん output:000664 わすら (ラ四-未:忘る:わする:忘ら:わすら) れ (自可受-用:る:る:れ:れ) て (接助:て:て) 思ふ (ハ四-終体:思ふ:おもふ:思ふ:おもふ) なげき (カ四-用:嘆く:なげく:嘆き:なげき) の (格助:の:の) しげる (ラ四-終体:茂る:しげる:茂る:しげる) を (*助:を:を) や (係助:や:や) 身 (名:身:み) を (*助:を:を) --- はづかし (名-地名:羽束師:はづかし) の (格助:の:の) --- はづかし (形シク-終:恥づかし:はづかし:恥づかし:はづかし) の (格助:の:の) --- もり (名:森:もり) と (格助-引用:と:と) いふ (ハ四-終体:言ふ:いふ:言ふ:いふ) らん (推-終体:らむ:らむ:らむ:らむ)
  • 12. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 12 Development: Thesaurus Poem Texts kh t2c Thesaurus code taggerTokeniser Hachidaishu Thesaurus (A) (B) add new thesaurus codes Dictionary General, Place Name Personal Name, etc add unknown entries
  • 13. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 13 (A) Corpus: Poems (OP) KW00029800|A|KANEMI NO ¯O=kanemi no ¯o KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/→ tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]→ no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/→ aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/→ nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]→ rame[CJR-REAL]/ Figure 2: Format of the database of a poem: → indicates continuing to the next line without breaks; the first line, which includes |A|, indicates the name of the poet; the second line which includes |B|, indicates the contents of the poem and added information.
  • 14. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 14 (A) Corpus: Translations (CT) $A|000298 $B|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け → をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。 $C|秋の歌 $D|秋の末近くなって帰り道についた龍田姫が、道中の無事を願って手向け → をする神があるからこそ、秋の木の葉が幣となって散っているのだろう。 $I|あきのすえちかくなってかえりみちについたたつたひめが、どうちゅう → のぶじをねがってたむけをするかみがあるからこそ、あきのこのはがぬさ → となってちっているのだろう。 Figure 3: Format of the database of a CT
  • 15. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 15 (B) Tokenisation: original text 立田姫手向ける神の有ればこそ秋の木の葉の幣と散るらめ ↓ tokenising 立田姫/手向ける/神/の/[有れ]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らめ] ↓ converting into predicative form 立田姫/手向ける/神/の/[有り]/ば/こそ/秋/の/木の葉/の/幣/と/散る/[らむ] Figure 4: Tokenisation of poem texts
  • 16. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 16 (C) meta-code conversion CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta BG-01-2030-01-101-A -- 姫 -- hime princess BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb) BG-01-5730-02-010-A -- 手 -- te hand BG-02-1700-01-040-A -- 向ける -- mukeru for BG-01-2030-01-030-A かみ 神 kami god BG-08-0061-07-010-A の の no SUB (particle) BG-02-1200-01-010-C あれ 有り are be BG-08-0064-26-010-A ば ば ba because (particle) BG-04-1120-05-150-A -- ば -- ba because (reason) BG-08-0065-01-010-A こそ こそ koso KP (emphasis) Figure 5: Meta-code conversion in case of OP
  • 17. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 17 (C) Structure of meta-code-1 BG-01-2030-01-030-A-かみ-神 ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (1) (2) (3) (4) (5) (6) (7) (8) Figure 6: Structure of an item of BG database in the case of kami (god): (1) database ID (BG = short-unit general vocabulary); (2) part of speech ID (01 = noun); (3) group ID (2030 = Shinto deities and Buddhas); (4) field ID; (5) exact ID (030 = god); (6) era-flag (A = contemporary, C = classic); (7) Chinese character reading; (8) Chinese character
  • 18. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 18 (C) Structure of the meta-code-2 BG-01-2600-01-020-A yononaka (world) (1) = BG-01-2610-01-040-A yo (world) (2) + BG-08-0010-01-021-A no (of) (3) + BG-01-1770-01-080-A naka (inside) (4) Figure 7: Structure of an item of the semantic table in the case of a compound word, yononaka (world)
  • 19. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 19 (C) meta-code conversion-3 CH-29-2130-01-010-A たつたひめ 立田姫 Tatsutahime Princess-Tatsuta CH-29-0000-14-010-A -- 立田 -- Tatsuta Tatsuta BG-01-2030-01-101-A -- 姫 -- hime princess BG-02-3770-04-080-C たむくる 手向く tamukuru present(verb) BG-01-5730-02-010-A -- 手 -- te hand BG-02-1700-01-040-A -- 向ける -- mukeru for BG-01-2030-01-030-A かみ 神 kami god BG-08-0061-07-010-A の の no SUB (particle) BG-02-1200-01-010-C あれ 有り are be BG-08-0064-26-010-A ば ば ba because (particle) BG-04-1120-05-150-A -- ば -- ba because (reason) BG-08-0065-01-010-A こそ こそ koso KP (emphasis) Figure 8: Meta-code conversion in case of OP
  • 20. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 20 poet write OP read expert reader write CT read novice reader compare 10th century Field of experience 20th century Field of experience (expert) 20th century Field of experience (novice) Figure 9: Schema of relationship between OP and CT
  • 21. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 21 +-------- # of pair | +----- value of matching level, exact=17, field=13, group=10 | | +-- # of POS | | | | | | # of element of OP ----+ +- # of element of CT | | | element of OP -+ | | +--- element of CT | | | | | | | 1 17 11 立田姫 00 <-> 12 龍田姫 (Tatsutahime) 2 17 47 手 04 <-> 25 手 (hand) 3 17 47 向ける 05 <-> 26 向ける (toward) 4 17 2 神 06 <-> 32 神 (god) 5 10 61 の 07 <-> 33 が (SUB) 6 17 47 有り 08 <-> 34 ある (be) 7 10 64 ば 09 <-> 35 から (because) 8 17 65 こそ 11 <-> 36 こそ (EM) 9 17 2 秋 12 <-> 38 秋 (autumn) 10 17 71 の 13 <-> 39 の (CON) 11 17 2 木の葉 14 <-> 40 木の葉 (leaf of tree) 12 17 2 幣 19 <-> 45 幣 (present) 13 17 61 と 20 <-> 46 と (CRD) 14 17 47 散る 21 <-> 49 散る (fall) 15 13 74 らむ 22 <-> 54 う (CJR) Figure 10: Example of the matching process
  • 22. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 22 Residual CT (秋の末近くなって帰り道についた)龍田姫(が道中の無事を願って)手 向け OP — —— — — — — — — — 立田姫 — — — — — — — 手向ける CT (をする)神があるからこそ秋の木の葉(が)幣(となって)散っ(ているのだろ) う OP — — 神のあれ ば こそ秋の木の葉[の]幣 と — — 散る — — — — らめ Figure 11: Example of the matching process in the case of kks 298 in Ko- machiya (1982)
  • 23. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 23 Components of OP Table 2: Result of subtracting the elements of OP(298) from those of CT(298, koma): it indicates the ratio of the ingredients of OP(298). OP (valid number of element) = 16 E (ratio of exact match) 12/16 = 0.750 F (ratio of field match) 1/16 = 0.062 G (ratio of group match) 2/16 = 0.125 T (ratio of total match) 15/16 = 0.938 U (ratio of unmatched OP) 1 - T = 0.062
  • 24. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 24 Calculation of Residual Rate D = 1 − P T (1) = 1 − 16 41 (2) = 0.61 (3)
  • 25. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 25 Components of CT Table 3: Component of CT in case of kks 298 by Komachiya (1982): fabs(D-H) stands for the function of the absolute value of the prac- tical value, D, minus the theoretical value, H. CT (valid number of element) =41 W (ratio of original word use) 12/41=0.293(E/CT) A (ratio of annotation) 1-0.293=0.707(1-W) ---breakdown of the annotation--- P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2) H (theoretical value of D) 1-16/41=0.6101-OP/CT Gap fabs(0.595-0.610)=0.015fabs(D-H)
  • 26. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 26 Subtraction: CT - OP Exact 12 (75.0%) Field 1 (6.2%) Group 2 (12.5%) Unmatched 1 (6.2%) W 12 (29.3%) P1 3 (7.3%) P2 1 (4.0%) D 25 (59.5%) OP : 16 elements CT : 41 elements(298) (298,koma) Figure 12: Pie-charts illustrating the components of OP(298) and CT(298, koma)
  • 27. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 27 (E) Mathematical modelling cw(t1, t2)=(1+log ctf(t1, t2)) √ idf(t1) idf(t2) (4) idf(t) = log N df(t) (5)
  • 28. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 28 warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16 every morning field 8 warbler 17 old age woven hat 6 10 green willow 4 wear in (my) hair 4 sew.26 spring 88 10 Tatsuta.PN 10 branch35 flower 138 stop.vi.1 15 break off 22 cry.vi 29 sing.vi 145 yet.1 30 summer side 8 cuckoo39 a cry 8 May 42 Otowa.PN 20 voice 174 mountain110 261 singing voice 21 midsummer rain14 hear 69 be heard.1 37 last year 10 iris.1 7 treetop 9 12 20 20 11 this morning 29 9 19 go over 10 regret 10 treetop high.3 4 10 near 6 6226 reason.1 8 6 guidance.1 lure 4 9 send 4 separation 7 4 fragrance.1 7 20 10 spring haze 9 stand.vi 10 summer mountains 11 force 6 plum 10 56 23 44 mountain cuckoo 9 hide.vi.2 7 6 10 scatter.1 52 10 touch 10 hand 10 attach 5 flutter.2 6 6 borrow 19 imperceptibly 9 treetop high.1 7 7 far 5
  • 29. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 29 Conclusion The thesaurus annotated with meta-codes allows researchers 1. to identify different orthographies as the same word; 2. to attach an alternative semantic ID to a word which has the same form but has more than one meaning (polysemic word); 3. to attach meta-codes not only to tokens recognised as a single/simple word but also to attach it to a longer size token 4. to indicate a similarity between tokens. 5. to detect common or different tokens among more than one text, which will tell us the similarities or differences between texts. 6. to indicate the relative differences between two words in literary works.
  • 30. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 30 Questions • Computer Modelling of Classical Japanese Poetic Vocabulary http://warbler.ryu.titech.ac.jp/waka/poem.cgi • Inquiry: Hilofumi Yamamoto yamagen@ryu.titech.ac.jp • Thank you.