SlideShare a Scribd company logo
1 of 46
Download to read offline
<Little Big Data #1>
summatic@scatterlab.co.kr
1
• 

(@ , 2016. 1~)

• 

(2016. 8~)

• 

(2018. 5~)
!2
:
:
:• 

• (?) 

• B

•
.

•
.

•
.

•
.

• 

• ( , id )
.

• ( , , )
.
!3
• Intro

• 

• 

• 

• 

• 

• Preprocessing

• Word Embedding

• Document Similarity

•
!4
Intro
• 

• 

• “ ” -> “ " -> “ ” .

• “ ” .

• .

• .

• 

• .

• .
!6
-
• Hell 

• .

• 

• 

• 

• 

•
< >
- ?
- ? / ? ?
< > , ,
< > , ,
< >
< > , , ,
!7
• 

• 

•
-
< >
/ / ? / / ? / ? / 

< >
- (X) -> (O)
- ? (X) -> ? (O)
- ? (X) -> ? (O)
- (X) -> (O)
< >
-
-
!8
- preprocess
• Data Science 

• Garbage in, Garbage out

• , preprocess
.

• preprocess ?
!10
Preprocessing
Preprocessing -
• 

• preprocess (POS1 tagger)
.

• : 

• KoNLPy2

• 

• , ,
1) POS: Part of speech

2) http://konlpy-ko.readthedocs.io/ko/v0.4.3/
!12
Preprocessing - ( )
• . ?
 • _NP _MAG _VV _ECE 

_VXA _EFN ._SF _MAG 

_VV _EFQ ?_SF
• 
 • _NP _MAG _NNG 

_XSV _ECE
• . 
 • _NNG _VA _ECD _VV 

_EFN ._SF _MAG _VV 

_ECE _NNG _XSV _ECS
< > < >
!13
Preprocessing - ( )
• . ?
 • _UN _JKS _MAG _MAG 

_VV _ECE _NNG _MAG 

_MAG _VV _ECS ?_SF
• 
 • _NP _NNG _NNG 

_JKM _VV
• . 
 • _NNG _VA _ECD _NP 

_UN ._SF _MAG _VV _ECE
_MAG _VV _ECS _EMO
< > < >
!15
Preprocessing -
• 

• ( , corpus)


• (corpus)

•
!17
: https://ko.wikipedia.org/wiki/
Preprocessing -
• Sejong Corpus

• National Institute of the Korean Language, 1998-2007.

• 

• (..)
!18
: https://ithub.korean.go.kr/user/guide/corpus/guide1.do
• preprocess

• normalize( )

• preprocessing

• 

• tokenizing
< >
count(“ ”) < count(“ ?”) , “ ” .
Preprocessing -
!19
Preprocessing - Tokenizing
• Tokenizing: 

• token , .

• , token 

• “ ” “ ” tokenizing
.
!20
< >
before tokenizing:
.
after tokenizing:
/ / / / / / / / / / / / / / / /
/ / / / .
• 

• 

• c1c2..cn-1 cn c1..cn 

•
Preprocessing - Tokenizing(Cohesion Probability)
!21
< >
“ ” “ ” .
: https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/05/05/cohesion/
Preprocessing - Tokenizing(Cohesion Probability)
• ) 

• = +
!22
substring count
- count( ) = 20000
- count( ) = 1500
- count( ) = 1200
- count( ) = 30
- count( ) = 15
cohesion probability
- CP( ) = 0.2738
- CP( ) = 0.3914
- CP( ) = 0.1968
- CP( ) = 0.2371
Preprocessing - Tokenizing
• Cohesion probability .

• .

• [ 2017] NLP - 

• 

• https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp

• 

• https://github.com/lovit/soynlp
!23
Word Embedding
Word Embedding - Word2Vec
• vector .

• word embedding word representation .

• word2vec

• You shall know a word by the company it keeps (Firth, J. R. 1957:11)
!25
Word Embedding - Word2Vec
• word2vec OOV
.

• OOV(Out-of-vocabulary): (=dictionary ) vocabulary
vector 

• training input vocabulary OOV
, inference .

• inference : 

•


• ( , )
, dictionary .
!26
• word2vec 

• word2vec:

• 

• fasttext: 

• where the set of n grams appearing in w

• subword
Word Embedding - Fasttext
!27
< >
w: Alpaca
n grams of w (n=3) = <Al, Alp, lpa, pac, aca, ca>
: Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:
1607.04606.
Word Embedding - Fasttext +
• fasttext .

• (character) subword 

• subword 

• , OOV .
!28
< >
subwords( ) = < , , , , >
< >
= _ _ _
subwords( ) = < , _, _ , …, >
Word Embedding - Fasttext
•
!29
- , 0.8590
- , 0.8465
- , 0.8180
- , 0.8055
- , 0.8018
- , 0.8017
- , 0.8007
- , 0.7983
- , 0.7972
- , 0.7948
- , 0.9022
- , 0.8986
- , 0.8887
- , 0.8866
- , 0.8567
- , 0.8498
- , 0.8474
- , 0.8413
- , 0.8335
- , 0.8191
Word Embedding - Fasttext
• 

• 

• Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word
vectors with subword information. arXiv preprint arXiv:1607.04606.

• 

• https://github.com/facebookresearch/fastText

• https://radimrehurek.com/gensim/models/fasttext.html

• https://github.com/summatic/hangul_jamo_fasttext
!30
Sentence Similarity
Setence Similarity
• document
.

• document short sentence .

• word embedding vector embedding
cosine similarity .
!32
< >
sim( , ?)
Sentence Similarity - BOW + Word Embedding
• word vector 

• doc2vec 

• word embedding 

• word embedding ?

• word embedding 

• !=
!34
- similarity( , ) = 0.9011
- similarity( , ) = 0.8839
- similarity( , ) = 0.9707
Sentence Similarity - RNN
• sentence embedding RNN (LSTM, Bi-
RNN, GRU ) .

• RNN language modeling

• “ .” <-> “ ”


• sequence embedding .

• .. “ ” “ ” embedding .

• “?”
!35
Sentence Similarity - Term vector
• vector embedding
embedding .

• embedding term vector 

• one hot encoding .

• term vector cosine similarity, edit distance
.
!36
< >
- I love you, you love me
- {“I”: 1, “love”: 2, “you”: 2, “me”: 1}
Sentence Similarity - Term vector
• term vector 

• . 

• 

• pair1 pair2 ?
!38
< >
pair1: I love you <-> I like you
pair2: I love you <-> I hate you
Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• (=word vector) 

• cosine similarity

• ESA similarity
!39
I love you
I like you
similarity I love you
I 1 0.2 0.5
like 0.3 0.9 0.4
you 0.5 0.4 1
1 0.9 1
Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• (=word vector) 

• cosine similarity

• ESA similarity
!40
I love you
I hate you
similarity I love you
I 1 0.2 0.5
hate 0.3 0.5 0.4
you 0.5 0.4 1
1 0.5 1
Sentence Similarity - ESA Similarity
• ESA: Explicit Semantic Analysis

• I love you 

• .
!41
I like you I hate you
cosine 0.667 0.667
ESA 0.967 0.833
Sentence Similarity - ESA Similarity
• .

• 

• Song, Y., & Roth, D. (2015). Unsupervised sparse vector densification for short
text similarity. In Proceedings of the 2015 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies (pp. 1275-1280).

• 

• ( )
!42
• preprocessing 80% 

• Zipf’s law

• corpus ,


• ( ) .


• 

• 

• , count based


• unlabeled data label 

• label insight
!44
WE WANT YOU!
- End of Document -
46

More Related Content

Similar to <Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)

Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotechPy "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotechShinichi Nakagawa
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEmLars Fronius
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchTrey Grainger
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for RubyistsSean Cribbs
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic searchHenry Saputra
 
How to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails appsHow to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails appsboogie_cat
 
Programming Contest Hacks
Programming Contest HacksProgramming Contest Hacks
Programming Contest HacksKosei Moriyama
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go ProgrammingLin Yo-An
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기Heejong Ahn
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Databricks
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"sandinmyjoints
 
Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기Taejun Kim
 
Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0Svetlin Nakov
 
Archetype autoplugins
Archetype autopluginsArchetype autoplugins
Archetype autopluginsMark Schaake
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitWojciech Gawroński
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithmAndrew Koo
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 

Similar to <Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정) (20)

Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotechPy "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
Py "Baseball" Data入門〜サービス(と野球)を支えるデータ分析基盤 #monotarotech
 
Elasticsearch at EyeEm
Elasticsearch at EyeEmElasticsearch at EyeEm
Elasticsearch at EyeEm
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
2014 spark with elastic search
2014   spark with elastic search2014   spark with elastic search
2014 spark with elastic search
 
How to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails appsHow to look like a model? MongoDB for Rails apps
How to look like a model? MongoDB for Rails apps
 
Programming Contest Hacks
Programming Contest HacksProgramming Contest Hacks
Programming Contest Hacks
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
TypeScript와 Flow: 
자바스크립트 개발에 정적 타이핑 도입하기
 
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
Extending Spark SQL API with Easier to Use Array Types Operations with Marek ...
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기Zeppelin, TensorFlow, Deep Learning 맛보기
Zeppelin, TensorFlow, Deep Learning 맛보기
 
Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0Svetlin Nakov - What's New In CLR 2.0
Svetlin Nakov - What's New In CLR 2.0
 
Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)Deep Learning Summit (DLS01-4)
Deep Learning Summit (DLS01-4)
 
Archetype autoplugins
Archetype autopluginsArchetype autoplugins
Archetype autoplugins
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and Profit
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 

Recently uploaded

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Recently uploaded (20)

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

<Little Big Data #1> 한국어 채팅 데이터로 머신러닝 하기 (한국어 보이게 수정)

  • 1. <Little Big Data #1> summatic@scatterlab.co.kr 1
  • 2. • 
 (@ , 2016. 1~) • 
 (2016. 8~) • 
 (2018. 5~) !2
  • 3. : : :• • (?) • B • . • . • . • . • • ( , id ) . • ( , , ) . !3
  • 4. • Intro • • • • • • Preprocessing • Word Embedding • Document Similarity • !4
  • 6. • • • “ ” -> “ " -> “ ” . • “ ” . • . • . • • . • . !6
  • 7. - • Hell • . • • • • • < > - ? - ? / ? ? < > , , < > , , < > < > , , , !7
  • 8. • • • - < > / / ? / / ? / ? / 
 < > - (X) -> (O) - ? (X) -> ? (O) - ? (X) -> ? (O) - (X) -> (O) < > - - !8
  • 9.
  • 10. - preprocess • Data Science • Garbage in, Garbage out • , preprocess . • preprocess ? !10
  • 12. Preprocessing - • • preprocess (POS1 tagger) . • : • KoNLPy2 • • , , 1) POS: Part of speech 2) http://konlpy-ko.readthedocs.io/ko/v0.4.3/ !12
  • 13. Preprocessing - ( ) • . ? • _NP _MAG _VV _ECE 
 _VXA _EFN ._SF _MAG 
 _VV _EFQ ?_SF • • _NP _MAG _NNG 
 _XSV _ECE • . • _NNG _VA _ECD _VV 
 _EFN ._SF _MAG _VV 
 _ECE _NNG _XSV _ECS < > < > !13
  • 14.
  • 15. Preprocessing - ( ) • . ? • _UN _JKS _MAG _MAG 
 _VV _ECE _NNG _MAG 
 _MAG _VV _ECS ?_SF • • _NP _NNG _NNG 
 _JKM _VV • . • _NNG _VA _ECD _NP 
 _UN ._SF _MAG _VV _ECE _MAG _VV _ECS _EMO < > < > !15
  • 16.
  • 17. Preprocessing - • • ( , corpus) • (corpus) • !17 : https://ko.wikipedia.org/wiki/
  • 18. Preprocessing - • Sejong Corpus • National Institute of the Korean Language, 1998-2007. • • (..) !18 : https://ithub.korean.go.kr/user/guide/corpus/guide1.do
  • 19. • preprocess • normalize( ) • preprocessing • • tokenizing < > count(“ ”) < count(“ ?”) , “ ” . Preprocessing - !19
  • 20. Preprocessing - Tokenizing • Tokenizing: • token , . • , token • “ ” “ ” tokenizing . !20 < > before tokenizing: . after tokenizing: / / / / / / / / / / / / / / / / / / / / .
  • 21. • • • c1c2..cn-1 cn c1..cn • Preprocessing - Tokenizing(Cohesion Probability) !21 < > “ ” “ ” . : https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/05/05/cohesion/
  • 22. Preprocessing - Tokenizing(Cohesion Probability) • ) • = + !22 substring count - count( ) = 20000 - count( ) = 1500 - count( ) = 1200 - count( ) = 30 - count( ) = 15 cohesion probability - CP( ) = 0.2738 - CP( ) = 0.3914 - CP( ) = 0.1968 - CP( ) = 0.2371
  • 23. Preprocessing - Tokenizing • Cohesion probability . • . • [ 2017] NLP - • • https://www.slideshare.net/kimhyunjoonglovit/pycon2017-koreannlp • • https://github.com/lovit/soynlp !23
  • 25. Word Embedding - Word2Vec • vector . • word embedding word representation . • word2vec • You shall know a word by the company it keeps (Firth, J. R. 1957:11) !25
  • 26. Word Embedding - Word2Vec • word2vec OOV . • OOV(Out-of-vocabulary): (=dictionary ) vocabulary vector • training input vocabulary OOV , inference . • inference : • • ( , ) , dictionary . !26
  • 27. • word2vec • word2vec: • • fasttext: • where the set of n grams appearing in w • subword Word Embedding - Fasttext !27 < > w: Alpaca n grams of w (n=3) = <Al, Alp, lpa, pac, aca, ca> : Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv: 1607.04606.
  • 28. Word Embedding - Fasttext + • fasttext . • (character) subword • subword • , OOV . !28 < > subwords( ) = < , , , , > < > = _ _ _ subwords( ) = < , _, _ , …, >
  • 29. Word Embedding - Fasttext • !29 - , 0.8590 - , 0.8465 - , 0.8180 - , 0.8055 - , 0.8018 - , 0.8017 - , 0.8007 - , 0.7983 - , 0.7972 - , 0.7948 - , 0.9022 - , 0.8986 - , 0.8887 - , 0.8866 - , 0.8567 - , 0.8498 - , 0.8474 - , 0.8413 - , 0.8335 - , 0.8191
  • 30. Word Embedding - Fasttext • • • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. • • https://github.com/facebookresearch/fastText • https://radimrehurek.com/gensim/models/fasttext.html • https://github.com/summatic/hangul_jamo_fasttext !30
  • 32. Setence Similarity • document . • document short sentence . • word embedding vector embedding cosine similarity . !32 < > sim( , ?)
  • 33.
  • 34. Sentence Similarity - BOW + Word Embedding • word vector • doc2vec • word embedding • word embedding ? • word embedding • != !34 - similarity( , ) = 0.9011 - similarity( , ) = 0.8839 - similarity( , ) = 0.9707
  • 35. Sentence Similarity - RNN • sentence embedding RNN (LSTM, Bi- RNN, GRU ) . • RNN language modeling • “ .” <-> “ ” • sequence embedding . • .. “ ” “ ” embedding . • “?” !35
  • 36. Sentence Similarity - Term vector • vector embedding embedding . • embedding term vector • one hot encoding . • term vector cosine similarity, edit distance . !36 < > - I love you, you love me - {“I”: 1, “love”: 2, “you”: 2, “me”: 1}
  • 37.
  • 38. Sentence Similarity - Term vector • term vector • . • • pair1 pair2 ? !38 < > pair1: I love you <-> I like you pair2: I love you <-> I hate you
  • 39. Sentence Similarity - ESA Similarity • ESA: Explicit Semantic Analysis • (=word vector) • cosine similarity • ESA similarity !39 I love you I like you similarity I love you I 1 0.2 0.5 like 0.3 0.9 0.4 you 0.5 0.4 1 1 0.9 1
  • 40. Sentence Similarity - ESA Similarity • ESA: Explicit Semantic Analysis • (=word vector) • cosine similarity • ESA similarity !40 I love you I hate you similarity I love you I 1 0.2 0.5 hate 0.3 0.5 0.4 you 0.5 0.4 1 1 0.5 1
  • 41. Sentence Similarity - ESA Similarity • ESA: Explicit Semantic Analysis • I love you • . !41 I like you I hate you cosine 0.667 0.667 ESA 0.967 0.833
  • 42. Sentence Similarity - ESA Similarity • . • • Song, Y., & Roth, D. (2015). Unsupervised sparse vector densification for short text similarity. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1275-1280). • • ( ) !42
  • 43.
  • 44. • preprocessing 80% • Zipf’s law • corpus , • ( ) . • • • , count based • unlabeled data label • label insight !44
  • 46. - End of Document - 46