How the machine understands Korean
기계와 대화를 하려면 어떻게 해야 할까요? 우리는 그 동안 기계가 이해할 수 있는 프로그래밍 언어를 만들어서, 그 언어를 통해 소통해 왔습니다. 하지만 2010년 들어서며 급물살을 탄 AI 연구는 이러한 소통의 영역까지 침투하여, 기계가 인간의 언어를 이해하고, 소통할 수 있는 단계로 다가서고자 노력하고 있습니다. 그 근간에는 선형대수학의 여러 이론들이 사용되고 있는데요, 특히 인간의 언어를 기호화하고 이를 벡터공간에 투영하는 방법들이 핵심으로 여겨지고 있습니다. 이러한 방법을 임베딩(embedding)이라 지칭하고, 단어부터 문장, 문서에 이르기까지 인간의 언어를 다양한 형태로 벡터화하고, 이를 이용해 언어의 의미 유사성, 관계 유사성 등을 벡터 공간에서 벡터 연산을 통해 내재적인 의미를 도출합니다.
이번 세미나에서는 벡터공간모델(Vector Space Model, VSM)의 전통적인 방법(TF-IDF, SVD 등)부터 신경망 방법(word2vec, sent2vec 등)에 이르는 다양한 언어 모델링들을 살펴보고, 이를 한국어에 적용했을 때 기계가 어떻게 의미를 이해하는 것으로 해석할 수 있는지 다양한 관점에서 실험을 통해 살펴보도록 하겠습니다.
3. a mechanically, electrically, or electronically operated device
for performing a task
https://www.merriam-webster.com/dictionary/machine3
4. a process by which information is exchanged
between individuals through a common system
of symbols, signs, or behavior
https://www.merriam-webster.com/dictionary/communication4
41. 41 Christopher, D. M., Prabhakar, R., & Hinrich, S. (2008). Introduction to information retrieval. An Introduction To Information Retrieval, 151(177), 5.
42. 42 Christopher, D. M., Prabhakar, R., & Hinrich, S. (2008). Introduction to information retrieval. An Introduction To Information Retrieval, 151(177), 5.
43. 43 Christopher, D. M., Prabhakar, R., & Hinrich, S. (2008). Introduction to information retrieval. An Introduction To Information Retrieval, 151(177), 5.
• Each term !" generates a row vector ($"%, $"', ⋯ , $"))
referred to as a term vector and each document +, generates a
column vector
+, =
$%,
⋮
$/,
44. 44 Christopher, D. M., Prabhakar, R., & Hinrich, S. (2008). Introduction to information retrieval. An Introduction To Information Retrieval, 151(177), 5.
A =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1 1 0 0
1 0 1 0
1 1 1 0
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
0 0 0 1
0 0 0 1
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
<latexit sha1_base64="CnY+57CJKvSKGuwemxFFRmUiI9c=">AAADjnicjVHJTsMwEJ0QlhK2AEcuFhWIU0kKAi4VBS4cQaIUiZQqcU2xmk2Og6iqfiBfgPgD+AvGJmVH4MjJ85v3JjOeIA15Jh3n0Rgzxycmp0rT1szs3PyCvbh0niW5oKxBkzARF4GfsZDHrCG5DNlFKpgfBSFrBr0jFW/eMpHxJD6T/ZS1Ir8b82tOfYlU274/IDXiBazL40Ea+VLwu6HlknWitqO25xFrdHA/Eu5Xwhmp/iJc8j+L8xPxzlgeiztvVbftslNx9CLfgVuAMhTrJLEfwIMOJEAhhwgYxCARh+BDhs8luOBAilwLBsgJRFzHGQzBQm+OKoYKH9kevrt4uizYGM8qZ6bdFP8S4hboJLCGngR1ArH6G9HxXGdW7G+5Bzqnqq2P36DIFSEr4QbZv3wj5X99qhcJ17Cne+DYU6oZ1R0tsuT6VlTl5ENXEjOkyCncwbhATLVzdM9EezLdu7pbX8eftFKx6kwLbQ7PqkocsPt1nN/BebXiblWqp9vl+mEx6hKswCps4Dx3oQ7HcAINoMam0TCujLZpmztmzdx/lY4ZhWcZPi3z+AWY8LZs</latexit>
47. 47
A =
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
1 1 0 0
1 0 1 0
1 1 1 0
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
1 0 0 1
0 0 0 1
0 0 0 1
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
<latexit sha1_base64="CnY+57CJKvSKGuwemxFFRmUiI9c=">AAADjnicjVHJTsMwEJ0QlhK2AEcuFhWIU0kKAi4VBS4cQaIUiZQqcU2xmk2Og6iqfiBfgPgD+AvGJmVH4MjJ85v3JjOeIA15Jh3n0Rgzxycmp0rT1szs3PyCvbh0niW5oKxBkzARF4GfsZDHrCG5DNlFKpgfBSFrBr0jFW/eMpHxJD6T/ZS1Ir8b82tOfYlU274/IDXiBazL40Ea+VLwu6HlknWitqO25xFrdHA/Eu5Xwhmp/iJc8j+L8xPxzlgeiztvVbftslNx9CLfgVuAMhTrJLEfwIMOJEAhhwgYxCARh+BDhs8luOBAilwLBsgJRFzHGQzBQm+OKoYKH9kevrt4uizYGM8qZ6bdFP8S4hboJLCGngR1ArH6G9HxXGdW7G+5Bzqnqq2P36DIFSEr4QbZv3wj5X99qhcJ17Cne+DYU6oZ1R0tsuT6VlTl5ENXEjOkyCncwbhATLVzdM9EezLdu7pbX8eftFKx6kwLbQ7PqkocsPt1nN/BebXiblWqp9vl+mEx6hKswCps4Dx3oQ7HcAINoMam0TCujLZpmztmzdx/lY4ZhWcZPi3z+AWY8LZs</latexit>
cos(d1, d2) =
2
2.83 ⇥ 1.41
= 0.5
<latexit sha1_base64="Mz3fqZdI6Gmx11iq+hTEdN8ueuA=">AAAC8nicjVHLShxBFD12XmoetrrMpsgQmEDSdLcG3QiimywVHEeYGYbumppJYb+oqhZkmK9w5y5kmx9wqx8h/oH5i9yqtJBkCEk13X3q3HtO1b03rTKpTRjeLXiPHj95+mxxafn5i5evVvzVtWNd1oqLDi+zUp2kiRaZLETHSJOJk0qJJE8z0U1P9228eyaUlmVxZM4rMciTSSHHkieGqKH/gZe6PRpG79loGL9jO6w/VgmfxrNpHGxv9I3MhWZRsBnNdsLg49BvhUHoFpsHUQNaaNZB6d+ijxFKcNTIIVDAEM6QQNPTQ4QQFXEDTIlThKSLC8ywTNqasgRlJMSe0ndCu17DFrS3ntqpOZ2S0atIyfCWNCXlKcL2NObitXO27N+8p87T3u2c/mnjlRNr8JnYf+keMv9XZ2sxGGPb1SCppsoxtjreuNSuK/bm7JeqDDlUxFk8orgizJ3yoc/MabSr3fY2cfF7l2lZu+dNbo3v9pY04OjPcc6D4ziINoL4cLO1u9eMehGv8QZtmucWdvEJB+iQ9wWucI0bz3iX3hfv689Ub6HRrOO35X37AVvnnlE=</latexit>
[[1.0 , 0.5 , 0.5 , 0.67 ],
[0.5 , 1.0 , 0.5 , 0.0 ],
[0.5 , 0.5 , 1.0 , 0.0 ],
[0.67 , 0.0 , 0.0 , 1.0 ]]
50. 50
A0
=
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
0.95 0.54 0.54 0.04
0.95 0.54 0.54 0.04
1.23 0.8 0.8 0.18
0.93 0.06 0.06 1.05
0.93 0.06 0.06 1.05
0.93 0.06 0.06 1.05
0.93 0.06 0.06 1.05
0.93 0.06 0.06 1.05
0.26 0.22 0.22 0.8
0.26 0.22 0.22 0.8
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
<latexit sha1_base64="7xtN193IeVqY4dyiGQFiKzwdqM0=">AAADwXictVHLTuswEJ0Qnr08CizZRFQ8NkROaKEskHhsWIJ0C0gUocQ1xWpechwEqtizhZ9D/AH8AwvGrovgogo215Enx2fmjGc8YRbxXBLybA3ZwyOjY+MTpT+TU9Mz5dm54zwtBGUNmkapOA2DnEU8YQ3JZcROM8GCOIzYSdjZV/6TayZyniZ/5W3GzuOgnfBLTgOJ1EX5bXfF2XaaIWvzpJvFgRT85q5E3K3aMnFr1b4h1WZzAOu5/jriut5rxPXqvVBFko2e8VxS+6+sv6Gu9n1jsZRBNEtaH31elCuYRi/nO/AMqIBZh2n5CZrQghQoFBADgwQk4ggCyPE7Aw8IZMidQxc5gYhrP4M7KKG2wCiGEQGyHbRtPJ0ZNsGzyplrNcVbItwClQ4soSbFOIFY3eZof6EzK3ZQ7q7OqWq7xX9ocsXISrhC9iddP/K3OtWLhEuo6x449pRpRnVHTZZCv4qq3PnUlcQMGXIKt9AvEFOt7L+zozW57l29baD9LzpSsepMTWwBr6pKHLD37zi/g2Pf9dZd/6ha2dkzox6HBViEVZznJuzAARxCA6gVWvfWg/Vo79vczmzRCx2yjGYeviy7+w7I98cy</latexit>
[[ 1. , 0.67 , 0.67 , 0.71],
[ 0.67, 1. , 1. , -0.05],
[ 0.67, 1. , 1. , -0.05],
[ 0.71, -0.05, -0.05, 1. ]]
51. 51
A0
=
0
B
B
B
B
B
B
B
B
B
B
B
B
B
B
@
0.95 0.54 0.54 0.04
0.95 0.54 0.54 0.04
1.23 0.8 0.8 0.18
0.93 0.06 0.06 1.05
0.93 0.06 0.06 1.05
0.93 0.06 0.06 1.05
0.93 0.06 0.06 1.05
0.93 0.06 0.06 1.05
0.26 0.22 0.22 0.8
0.26 0.22 0.22 0.8
1
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
<latexit sha1_base64="7xtN193IeVqY4dyiGQFiKzwdqM0=">AAADwXictVHLTuswEJ0Qnr08CizZRFQ8NkROaKEskHhsWIJ0C0gUocQ1xWpechwEqtizhZ9D/AH8AwvGrovgogo215Enx2fmjGc8YRbxXBLybA3ZwyOjY+MTpT+TU9Mz5dm54zwtBGUNmkapOA2DnEU8YQ3JZcROM8GCOIzYSdjZV/6TayZyniZ/5W3GzuOgnfBLTgOJ1EX5bXfF2XaaIWvzpJvFgRT85q5E3K3aMnFr1b4h1WZzAOu5/jriut5rxPXqvVBFko2e8VxS+6+sv6Gu9n1jsZRBNEtaH31elCuYRi/nO/AMqIBZh2n5CZrQghQoFBADgwQk4ggCyPE7Aw8IZMidQxc5gYhrP4M7KKG2wCiGEQGyHbRtPJ0ZNsGzyplrNcVbItwClQ4soSbFOIFY3eZof6EzK3ZQ7q7OqWq7xX9ocsXISrhC9iddP/K3OtWLhEuo6x449pRpRnVHTZZCv4qq3PnUlcQMGXIKt9AvEFOt7L+zozW57l29baD9LzpSsepMTWwBr6pKHLD37zi/g2Pf9dZd/6ha2dkzox6HBViEVZznJuzAARxCA6gVWvfWg/Vo79vczmzRCx2yjGYeviy7+w7I98cy</latexit>
[[ 1. , 0.67 , 0.67 , 0.71],
[ 0.67, 1. , 1. , -0.05],
[ 0.67, 1. , 1. , -0.05],
[ 0.71, -0.05, -0.05, 1. ]]
108. 108 Nickel, Maximillian, and Douwe Kiela. "Poincaré embeddings for learning hierarchical representations." Advances in neural information processing systems. 2017.