DevoxxFR 2024 Reproducible Builds with Apache Maven
NLP Deep Learning with Tensorflow
1. NLP 에 대한 이해와
Tensorflow 를 활용한 실무 적용
WRITTEN BY SeungWooKim
tmddno1@gmail.com
2. 현 POSCO IT 사업부 - AI TFT 리더
POSCO IT 사업부 AI 프로젝트 지원 FrameWork 개발 리더
POSCO AI Chat Bot 시범 서비스 개발 리더
POSCO ICT BigData & AI 사내 강사
성균관대학교 컴퓨터 공학 전공
tmddno1@gmail.com
3. 1. 강의 도커 환경
https://github.com/TensorMSA/skp_edu_docker
2. 강의 소스 코드
git clone https://github.com/TensorMSA/tensormsa_jupyter.git
4. 강의 목표
"피자 주문을 ChatBot Messenger 를 통해서 서비스 하고 싶다..
어떤 데이터를 수집하고, 어떤 신경망을 사용하고, 어떻게 아키택쳐를
구성해야 목표를 달성 할 수 있을까?"
예를 들어 위와 같이 자연어 처리와 관련된 어떤 문제가 주어졌을 때
데이터와 딥러닝 관점에서 문제를 접근 할 수 있는 통찰력 획득
[다음 세션]
이번 시간에 배운 재료를 아키택쳐 관점에서의 어플리케이션 레벨에서
적용하고 응용하는 방법에 대한 세션
5. 1.NLP & Deep Learning
2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-2-1.Lexical Analysis Basic Process
2-2-2.Deep Learning on Lexical Analysis
2-2-3.Prerequisite Knowledge
2-2-4.BiLstmCrf for Named Entity Recognition
2-3.Syntactic Analysis ㅛ
2-3-1.Dependency Parsing
2-3-2.Google SyntaxNet with Docker
2-4.Semantic Analysis
2-4-1.Semantic Role Labeling
2-4-2.Char CNN for Sentence Classification
2-5.Discourse Analysis
2-5-1.RNN for understand global Conversation
6. 3.Language Generation
3-1.Basic Seq2Seq
3-2.Other types of Seq2Seq (Attention, Pointer)
4.Tips
4-1.Hyper Parameter Random Search
4-2.Genetic Algorithm for Hyper Parameter Search
4-3.Auto Hyper Parameter Search with Multi GPU Server
8. NLP and Deep Learning
Today’s Focus
이미지등 다른 분야와 마찬가지로
DL 이 좋은 성능을 보여주지만,
분야의 특성상 100% DL 로 대체될
수는 없다.
기존 연구 분야에 대한 이해 중요
https://www.slideshare.net/ssuser06e0c5/ss-64417928
10. NLP Applications
Mostly Solved Making Good Progress Still Really Hard
Spam Detection
(스팸분석)
Text Categorization
(텍스트 분류)
Part of Speech Tagging
(단어 분석)
Named Entity Recognition
(의미 구분 분석)
Information Extraction
(정보 추출)
Sentiment Analysis
(감정분석)
Coreference Resolution
(같은 단어 복수 참조)
Word Sense
Disambiguation
(복수 의미 분류)
Syntactic Parsing
(구문해석)
Machine Translation
(기계번역)
Semantic Search
(의미 분석 검색)
Question & Answer
(질의 응답)
Textual inference
(문장 추론)
Summarization
(텍스트 요약)
Discourse & Dialog
(대화 & 토론)
11. NLP Applications
Text Categorization
Text Classification assigns one or more classes to a document according to their content. Classes are
selected from a previously established taxonomy (a hierarchy of catergories or classes).
Spam Detection
Spam Detection is also the part of Text Classification problem.
Part of Speech
grammatical tagging or word-category disambiguation, is the process of marking up a word in a
text (corpus) as corresponding to a particular part of speech, based on both its definition and its
context
13. NLP Applications
Information Extraction on Broader view
https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwievZKlmMzVAhVCgrwKHbM_D88QFggyMAE&url=https%3A
%2F%2Fweb.stanford.edu%2Fclass%2Fcs124%2Flec%2FInformation_Extraction_and_Named_Entity_Recognition.pptx&usg=AFQjCNFUT9ZjvrDrx
F9su0J9KiWobVP4Kg
Rule Based
Extraction
Named Entity
recognition
Syntax Anal
Relation Search
Ontology
Information
Extraction
14. NLP ApplicationsNLP Applications
Coreference Resolution
I did not vote for the Donald Trump because I think he is too reckless
Coreference resolution is the task of finding all expressions that refer to the same entity in a
text. It is an important step for a lot of higher level NLP tasks that involve natural language
understanding such as document summarization, question answering, and information
extraction.
Deep Reinforcement Learning for Mention-Ranking Coreference Models
Improving Coreference Resolution by Learning Entity-Level Distributed Representations
https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30
15. NLP ApplicationsNLP Applications
Word Sense Disambiguation
[Example]
1. a type of fish
2. tones of low frequency
and the sentences:
1. I went fishing for some sea bass.
2. The bass line of the song is too weak.
http://www.cs.cornell.edu/courses/cs4740/2014sp/lectures/wsd-1.pdf
supervised way lable data example
simi-supervised way ontology based
17. NLP Applications
Machine Translation
Machine translation (MT) is automated translation. It is the process by which computer software is
used to translate a text from one natural language (such as English) to another (such as Spanish).
18. NLP Applications
Semantic Search
Semantic search seeks to improve search accuracy by understanding a searcher’s intent through
contextual meaning.
Question and Answer
Able to answer questions in natural language based on Knowledge data (usually ontology)
ex) Best example is IBM Watson
Textural Inference
Recognize, generate, or extract pairs <T,H> of natural language
expressions, such that a human who reads (and trusts) T would infer that His most likely also true
Summarization
Extracting interesting parts of the text and create a summary by using these parts of the text and
allow for rephrasings to make summary more grammatically correct.
Discourse & Dialog
Do conversation with understanding the whole history of dialog and semantic meaning of speaker.
19. Level of NLP
○ pragmatics : use of language
○ Semantics : meaning of words & sentences
○ (Surface) Syntax : Phrase & Sentence
○ Morphology : morpheme, word
○ Phonology : phoneme (abstract unit of speech sound)
○ Phonetics : phone (acoustic unit of speech sound)
음성과 단어
단어의 구성
단어의 순서
단어&문장 의미
대화의도 & 맥락
High
Low
23. Language Analysis - Speech Recognition
AI Speaker Alexa Alexa Microphone System
24. Language Analysis - Speech Recognition
Deep Learning for Classification Hidden Markov Model for Language Model
25. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-2-1.Lexical Analysis Basic Process
2-2-2.Deep Learning on Lexical Analysis
2-2-3.Prerequisite Knowledge
2-2-4.BiLstmCrf for Named Entity Recognition
26. Language Analysis - Lexical Analysis
Main Factors on Lexical Analysis
Sentence
Splitting
Tokenizing Morphological
Part of Speech
Tagging
27. Lexical Analysis - Sentence Splitting & Tokenizing
What if there is no line change char (‘n’) ? Where is the EOS point?
What if sentence is not separated into words properly with space?
[Examples]
[Problems]
28. Language Analysis - Lexical Analysis - Morphological
Word stemming lemmatization
Love Lov Love
Loves Lov Love
Loved Lov Love
Loving Lov Love
Innovation Innovat Innovation
Innovations Innovat Innovation
Innovate Innovat Innovate
Innovates Innovat Innovate
Innovative Innovat Innovative
Morphing Examples Stemming & lemmatization
Morphology is process of finding morpheme which is smallest“meaningful unit (Lexical meaning
or grammatical function)” and other features like stem in a language that carries information.
29. Language Analysis - Lexical Analysis - Part of Speech Tagging
Ambiguity
“that” can be a subordinating conjunction or a relative pronoun
- The fact that/IN you’re here
- A man that/WDT I know
“Around” can be a preposition, particle, or adverb
- I bought it at the shop around/IN the corner.
- I never got around/RP to getting a car.
- A new Toyota Prius costs around/RB $25K.
Degree of ambiguity (in Brown corpus)
- 11.5% of word types (40% of word tokens) are ambiguous
# of Tags 1 2 3 4 5 6 7
# of Words 35340 3760 264 61 12 2 1
#Ambiguity Problem is much serious in Korean
Part-of-speech tagging is one of the most important text analysis tasks used to classify words into
their part-of-speech and label them according the tagset which is a collection of tags used for the pos
tagging. Part-of-speech tagging also known as word classes or lexical categories
30. Language Analysis - Lexical Analysis - Implementation
Hannanum Kkma Komoran Mecab Twitter
하늘 / N 하늘 / NNG 하늘 / NNG 하늘 / NNG 하늘 / Noun
을 / J 을 / JKO 을 / JKO 을 / JKO 을 / Josa
나 / N 날 / VV 나 / NP 나 / NP 나 / Noun
는 / J 는 / ETD 는 / JX 는 / JX 는 / Josa
자동차 / N 자동차 / NNG 자동차 /
NNG
자동차 /
NNG
자동차 /
Noun
Anal Result Comparison Library Performance Comparison
34. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-2-1.Lexical Analysis Basic Process
2-2-2.Deep Learning on Lexical Analysis
2-2-3.Prerequisite Knowledge
2-2-4.BiLstmCrf for Named Entity Recognition
35. Language Analysis - Lexical Analysis
[Deep Learning - Sequence Labeling - BiLSTM-CRF]
(1) Word Segmentation
(2) POS Tagging
(3) Chunking
(4) Clause Identification
(5) Named Entity Recognition
(6) Semantic Role Labeling
(7) Information Extraction
What we can do with sequence labeling What’s sequence labeling
36. Language Analysis - Lexical Analysis
[Deep Learning - Sequence Labeling - BiLSTM-CRF]
Word POS Chunk NE
West NNP B-NP B-MISC
Indian NNP I-NP I-MISC
all-around NN I-NP O
Phil NNP I-NP B-PER
Simons NNP I-NP I-PER
took VBD B-VP O
four CD B-NP O
for IN B-PP O
38 CD B-NP O
on IN B-PP O
Friday NNP B-NP O
iob data set example
POS Tag 의미
ttps://docs.google.com/spreadsheet/ccc?key=0ApcJghR6UMXxdEdU
RGY2YzIwb3dSZ290RFpSaUkzZ0E&usp=sharing
Chunk Tag 의미
B : Begin of Chunk
I : Continuation of Chunk
E: End of Chunk
NP : Noun
VP : Verb
NER BIO Tag 의미
B : Start with new Chunk
I : word inside Chunk
O: Outside of Chunk
37. Language Analysis - Lexical Analysis - Sequence Labeling
[Deep Learning - BiLSTM-CRF]
BiLSTM-CRF Description
Before we Talk about
BiLstmCrf which is really important
algorithm for sequence labelling..
Let’s talk about necessary knowledge
that we have to know really briefly
38. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-2-1.Lexical Analysis Basic Process
2-2-2.Deep Learning on Lexical Analysis
2-2-3. Prerequisite Knowledge
2-2-4.BiLstmCrf for Named Entity Recognition
39. Language Analysis - Lexical Analysis - Check Prerequisite
[Those will be needed to understand what I am trying to explain]
Concept of perceptron
& Deep Neural Network
Concept of SoftMax
DNN & Matrix
Gradient Descent Back Propagation
Activation Functions
40. Language Analysis - Brief Explanation
# tf Graph input
x = tf.placeholder("float", [None, 784])
y = tf.placeholder("float", [None, 10])
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_normal([784, 256])),
'h2': tf.Variable(tf.random_normal([256, 256])),
'out': tf.Variable(tf.random_normal([256, 10]))
}
biases = {
'b1': tf.Variable(tf.random_normal([256])),
'b2': tf.Variable(tf.random_normal([256])),
'out': tf.Variable(tf.random_normal([10]))
}
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
pred = tf.matmul(layer_2, weights['out']) + biases['out']
hypothesis = tf.nn.softmax(pred )
# Define loss and optimizer
cost = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(hypothesis),
reduction_indices=1))
tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
input Hidden Out
784
256
10
Hidden
256
784
256
786 256
256 10
256
S
O
F
T
M
A
X
Y=Activation(W*x + b)
[Error]
Cross
Entropy
W W1
A(W*x + b)
b
b
A(W*x + b)x
2
1
3
4
5
256
786
1
41. Language Analysis - Lexical Analysis - Check Prerequisite
[Those will be needed to understand what I am trying to explain]
Dynamic RNN BiDirectional LSTM
Word EmbeddingRecurrent Neural Network LSTM (Long Short Term Memory)
42. Language Analysis - Brief Explain
START 오늘 날씨 는 ? PAD PAD END
START 오늘 날씨 는 어때 ? PAD END
START 오늘 비가 오 려 나 ? END
Case of long sentence …
Vanishing Problem happens
Various length of data cause
waste of computing power
Here we have concept of Dynamic RNN
BiDirectional Lstm learn given data from backward Long Short Term Memory Cell
Cell State
https://brunch.co.kr/@chris-song/9
updateforget out
cell state
https://blog.altoros.com/the-magic-behind-google-translate-
sequence-to-sequence-models-and-tensorflow.html
43. Language Analysis - Word embedding
Word Embedding 이란 ?
텍스트를 구성하는 하나의 음소, 음절, 단어, 문장, 문서 단위를 수치화하여
표현하는 방법의 일종
장점 : 차원의 축소 , 의미적 유사성의 표현
단점 : 동음이의어 처리, 데이터 적을 경우 신경망 훈련시 신호 강도
44. Language Analysis - Word embedding - OneHot Encoding
Concept of OneHot Encoding
45. Language Analysis - Word embedding - Word2Vec
https://www.tensorflow.org/tutorials/word2vec
http://w.elnn.kr/search/
Concept of Word2Vector
Word2Vector Demo Site
46. Language Analysis - Word embedding - Word2Vec
C-Bow
the quick brown fox jumped over the lazy dog
([brown, jumped], fox)
window size : 1
brown
jumped
over
the
.
.
brown
jumped
over
fox
.
.
Input OutputHidden
Hidden Size Hidden Size
Vocab
Size
Data Set
Original
Text
47. Language Analysis - Word embedding - Word2Vec
the quick brown fox jumped over the lazy dog
(fox, brown), (fox, jumped)
window size : 1
brown
jumped
over
the
.
.
brown
jumped
over
fox
.
.
Input OutputHidden
Hidden Size Hidden Size
Vocab
Size
Data Set
Original
Text
Skip-Gram
48. Language Analysis - Word embedding - Doc2Vec
(1)PV-DM (2)PV-DBOW
(3)DM + DBOW (Vector Concat)
W2V W2V W2V
(4)AVG(TF-IDF * W2V)
the quick brown fox jumped over the lazy dog
(paragraph, the)
(paragraph, quick)
(paragraph, brown)
(paragraph, fox)
(paragraph, jumped)
.
([paragraph, quick, brown,
fox, juped], over)
([paragraph, quick, brown,
fox, juped,over],the)
vector vector vector
TF-IDF TF-IDF TF-IDF
X X X
vector
AVG
49. tfidf(t,d,D) = tf(t,d) x idf(t,D)
Language Analysis - Word embedding - TF-IDF
https://thinkwarelab.wordpress.com/2016/11/14/ir-tf-idf-%EC%97%90-%EB%8C%80%ED%95%B4-%EC%95%8C%EC%95%84%EB%B4%85%EC%8B%9C%EB%8B%A4/
http://www.popit.kr/bm25-elasticsearch-5-0%EC%97%90%EC%84%9C-%EA%B2%80%EC%83%89%ED%95%98%EB%8A%94-%EC%83%88%EB%A1%9C%EC%9A%B4-%EB%B0%A9%EB%B2%95/
Not exactly word embedding but used on nlp with deep learning pretty often
- Document similarity
- Words importance on document
- Used on search engine (like elasticsearch though it use BM25 for now)
51. Language Analysis - Word embedding - Word+Char
the quick brown fox jumped over the lazy dog
0.2 0.1 0.4 0.21 0 0 0
f o x fox
Word2Vector
0 1 0 0 0 0 1 0
OneHot
Encoding
OneHot
Encoding
OneHot
Encoding
1.Word2Vec 계열은 의미적 상관성을 잘 표현
2.OneHot 은 강한 신호적 특성으로 Train 에 효과적
3.Word 단위 Embedding 은 단어를 잘 기억함
4.Char 단위 Embedding 은 미훈련 단어 처리에 용이
52. Language Analysis - Word embedding - NGram
In case of Word2Vec it can represent only the trained word..
Words not exactly match the pretrained dict will return “UNKNOWN”
So FastText (by Facebook ) use ngram on their word embedding algorithm..
에어컨 ~ 에어조단 비교
에어컨
['$$에', '$에어', '에어컨', '어컨$', '컨$$'] => 5
에어조단
['$$에', '$에어', '에어조', '어조단', '조단$', '단$$'] => 6
일치
['$$에', '$에어'] => 2
점수
일치 2건 / 중복제거 전체 7건 => 0.2222
54. Language Analysis - Word embedding - Implementation
OneHot Encoding : Simple Test Code show concept of onehot
http://ip:8888/tree/tensormsa_jupyter/chap05_nlp/wordembedding/
[Code]
55. Language Analysis - Word embedding - Implementation
Word2Vector : Using Gensim word2vec package
http://ip:8888/tree/tensormsa_jupyter/chap05_nlp/wordembedding/
56. Language Analysis - Word embedding - Implementation
FastText : FaceBook fasttext with gensim wrapper
http://ip:8888/tree/tensormsa_jupyter/chap05_nlp/wordembedding/
57. Language Analysis - Word embedding - Implementation
FastText : Possible to use pretrained vector and do find tuning on it
http://ip:8888/tree/tensormsa_jupyter/chap05_nlp/wordembedding/
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
58. Language Analysis - Word embedding - Implementation
N-grams are simply all combinations of adjacent words or letters of length n that you can
find in your source text.
59. Language Analysis - Word embedding - Implementation
For large dataset word2vec training GPU acceleration is needed
You can also think about using Tensorflow or Keras for training model
https://github.com/SimonPavlik/word2vec-keras-in-gensim/blob/keras106/word2veckeras/word2veckeras.py
https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec.py
60. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-2-1.Lexical Analysis Basic Process
2-2-2.Deep Learning on Lexical Analysis
2-2-3. Other prerequisite Knowledge
2-2-4.BiLstmCrf for Named Entity Recognition
62. Language Analysis - Lexical Analysis - Sequence Labeling
[Deep Learning - BiLSTM-CRF]
김승우 B-PERSON
전화번호 B-TARGET
검색 O
김승우 B-PERSON
이메일 B-TARGET
검색 O
김승우 B-PERSON
이미지 B-TARGET
검색 O
IOB Data
김승우 전화번호 검색
김승우 이메일 검색
김승우 이미지 검색
Plain Data
Sentence
Splitting
Token Morphing
Part of
Speech
Tagging
Lexical Analysis
Word2Vector
OneHot Encoding
1 0 0 0
0 1 0 0
0 0 1 0
김승우
전화번호
이메일
검색
B-PERSON
B-TARGET
김
우
승
Index
List
63. Language Analysis - Lexical Analysis - Sequence Labeling
[Deep Learning - BiLSTM-CRF]
김승우
전화번호
이메일
검색
B-PERSON
B-TARGET
김
우
승
Index
List
[Code]
66. Language Analysis - Lexical Analysis - Sequence Labeling
[Deep Learning - BiLSTM-CRF]
Conditional Random Field Soft Max
[Code]
67. Language Analysis - Lexical Analysis - Sequence Labeling
[Deep Learning - BiLSTM-CRF]
http://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
Probabilistic Model for sequence data segmentation and labeling
https://www.slideshare.net/kanimozhiu/tdm-probabilistic-models-part-2
he first method makes local choices. In other words, even if we capture some information from the
context in our hh thanks to the bi-LSTM, the tagging decision is still local. We don’t make use of the
neighbooring tagging decisions. For instance, in New York, the fact that we are tagging York as a
location should help us to decide that New corresponds to the beginning of a location. Given a
sequence of words w1,…,wmw1,…,wm, a sequence of score vectors s1,…,sms1,…,sm and a
sequence of tags y1,…,ymy1,…,ym, a linear-chain CRF defines a global score s∈Rs∈R
68. Language Analysis - Lexical Analysis - Sequence Labeling
[Deep Learning - BiLSTM-CRF]
Gradient
Descent
Momentum
NAG
Adagrad
Adadelta
Rmsprop
Adam
[Code]
69. Language Analysis - Lexical Analysis - Sequence Labeling
[Deep Learning - BiLSTM-CRF]
https://arxiv.org/pdf/1705.08292.pdf
"Gradient descent (GD)나 Stochastic gradient descent (SGD)를 이용하여 찾은 solution이
다른 adaptive methods (e.g. AdaGrad, RMSprop, and Adam)으로 찾은 solution보다 훨씬
generalization 측면에서 뛰어나다."
The Marginal Value of Adaptive Gradient Methods in Machine Learning Ashia C. Wilson] , Rebecca Roelofs] ,
Mitchell Stern] , Nathan Srebro† , and Benjamin Recht]∗ ] University of California, Berkeley. † Toyota
Technological Institute at Chicago May 24, 2017
There is no optimizer best for all cases!!
When to use adaptive optimizer?
If input embedding vectors are sparse, it’s better to use adaptive optimizer!
70. Language Analysis - Lexical Analysis - Sequence Labeling
[Deep Learning - BiLSTM-CRF]
Real Project BiLstm Result Sample Code Predict Test Result
Test data Not Included in Train Set
Predicts well
http://ip:8888/tree/tensormsa_jupyter/chap05_nlp/sequence_tagging/
71. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-3.Syntactic Analysis
2-3-1.Dependency Parsing
2-3-2.Google SyntaxNet with Docker
72. Language Analysis - Syntactic Analysis
구문 분석(構文分析, 문화어: 구문해석, 문장해석)은 문장을 그것을 이루고 있는
구성 성분으로 분해하고 그들 사이의 위계 관계를 분석하여 문장의 구조를
결정하는 것을 말한다.
Graph-Based Models Transition-Based Models
CYK Style Parsing MST finding Algorithm Projective & Non Projective Model
73. Language Analysis - Syntactic Analysis
Transition-Based Models
Sentence W
Repeat until all words have their head
- Select two target words in data structure
(One dependent & one head candidate)
- Deterministically predict next parsing action from parsing model
- Modify structure according parsing action
C0 -> C1 -> C2 -> ……..C8 -> C9 -> C10 -> .… -> Cm D-tree
t1 t2 t3 t8 t9 t10 tm
Oracle
(Classifier)
Predict the best
transition
74. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
75. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Assume that we are given an oracle :
- for any non-terminal configuration, it can predict the correct transition
(for deterministic parsing)
- That is, it takes two words & magically gives us the dependency
relation b/w item if one exists
76. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Shift :
Move Economic from buffer B to stack S
77. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Left-arc :
Add left-arc (news, Economic, amod) to arc set A
Remove Economic from stack (since it now has head in A)
78. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Shift :
Move news from buffer B to stack S
79. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Left-arc :
Add left-arc (had, news, nsubj) to A
Remove news from stack (since it now has head in A)
80. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Right-arc :
Add right-arc (ROOT, had, root) to A
keep had in stack : because it can have other dependents on the right
81. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Left-arc :
Add left-arc (effect, little, amod) to A
Remove little from stack (since it now has head in A)
82. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Right-arc :
Add right-arc (had, effect, dobj) to A
Keep effect in stack : because it can have other dependents on right
83. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Right-arc :
Add right-arc (effect, on, prep) to A
Keep on in stack : because it can have other dependents on the right
84. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Shift :
Move financial from buffer B to stack S
85. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Left-arc :
Add left-arc (market, financial, amod) to A
Remove financial from stack (since it now has head in A)
86. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Right-arc :
Add right-arc (on, markets, pmod) to A
Keep markets in stack : because it can have other dependents on the right
87. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Reduce :
Remove markets, on, effect from stack (since they already have head in A)
※ All decisions like right-arc, left-arc, reduce, shift will be made by oracle
88. Language Analysis - Syntactic Analysis
Transition-Based Models - Arc Eager Transition System
Right-arc :
Add right-arc (had, period, p) to A
Keep period in stack
Done !
89. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-3.Syntactic Analysis
2-3-1.Dependency Parsing
2-3-2.Google SyntaxNet with Docker
90. Language Analysis - Syntactic Analysis - Syntax Net
We show this layout in the schematic below: the state of the system (a stack and a buffer, visualized
below for both the POS and the dependency parsing task) is used to extract sparse features, which
are fed into the network in groups. We show only a small subset of the features to simplify the
presentation in the schematic
Google SyntaxNet with Deep Learning - Pos Tagging
http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf
91. Language Analysis - Syntactic Analysis - Syntax Net
Google SyntaxNet with Deep Learning - A Fast and Accurate Dependency Parser using Neural Networks
https://arxiv.org/pdf/1603.06042.pdf
1 2 3
1 I _ PRP PRP _ 2 nsubj _ _
2 knew _ VBD VBD _ 0 ROOT _ _
3 I _ PRP PRP _ 5 nsubj _ _
4 could _ MD MD _ 5 aux _ _
5 do _ VB VB _ 2 ccomp _ _
6 it _ PRP PRP _ 5 dobj _ _
7 properly _ RB RB _ 5 advmod _ _
8 if _ IN IN _ 9 mark _ _
9 given _ VBN VBN _ 5 advcl _ _
10 the _ DT DT _ 12 det _ _
11 right _ JJ JJ _ 12 amod _ _
12 kind _ NN NN _ 9 dobj _ _
13 of _ IN IN _ 12 prep _ _
14 support _ NN NN _ 13 pobj _ _
15 . _ . . _ 2 punct _ _
18 units
(1),(2),(3)
18 units
(1),(2),(3)
12 units
(2),(3)
(1) The top 3 words on the stack and buffer: s1, s2, s3, b1, b2, b3; => 6
(2) The first and second leftmost / rightmost children of the top two words
on the stack: lc1(si), rc1(si), lc2(si), rc2(si), i = 1, 2. => 8
(3) The leftmost of leftmost / rightmost of rightmost children of the top two
words on the stack: lc1(lc1(si)), rc1(rc1(si)), i = 1, 2. => 4
92. Language Analysis - Syntactic Analysis - Syntax Net
Google SyntaxNet with Deep Learning - Local Parser
1. SHIFT: Push another word onto the top of the stack, i.e. shifting one token from the buffer to
the stack.
2. LEFT_ARC: Pop the top two words from the stack. Attach the second to the first, creating an
arc pointing to the left. Push the first word back on the stack.
3. RIGHT_ARC: Pop the top two words from the stack. Attach the second to the first, creating an
arc point to the right. Push the second word back on the stack.
93. Language Analysis - Syntactic Analysis - Syntax Net
As we describe in the paper, there are several problems with the locally normalized models we just
trained. The most important is the label-bias problem: the model doesn't learn what a good parse
looks like, only what action to take given a history of gold decisions. This is because the scores are
normalized locally using a softmax for each decision.
Google SyntaxNet with Deep Learning - Global Training
94. Language Analysis - Syntactic Analysis - Syntax Net
What’s Beam Search Algorithm on RNN ?
https://www.youtube.com/watch?v=UXW6Cs82UKo
Instead of try only the best every iteration, try all cases to the end and choose the sum is maximum.
But if you try to calculate all cases algorithms will be too heavy, so remain only the best few every
step and remove others (pruning). This is for find global maximum predict result .
95. Language Analysis - Syntactic Analysis - Syntax Net
http://universaldependencies.org/
Google SyntaxNet do not support Korean as a default language.
But as we can see bellow, we can train the model with Sejong corpus data.
Though we have to covert the format for SyntaxNet to understand.
Google SyntaxNet with Deep Learning - How about Korean
96. Language Analysis - Syntactic Analysis - Syntax Net
Demo Site (we also use samples on this site)
http://sejongpsg.ddns.net/syntaxnet/psg_tree.htm
SyntaxNet Korean with Docker (We pretrained Korean corpus and set up webserver for service)
https://github.com/TensorMSA/tensormsa_syntax_docker
Google SyntaxNet with Deep Learning - Test it by yourself
97. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-3.Syntactic Analysis
2-4.Semantic Analysis
2-4-1.Semantic Role Labeling
2-4-2.Char CNN for Sentence Classification
2-5.Discourse Analysis
98. Sentential semantics
- Semantic role labeling (SRL)
- Phrase similarity (=paraphrase)
- Sentence Classification, Sentence Emotion Analysis and etc
Language Analysis - Semantic Analysis
What is Semantic in study of language
Three perspectives on meaning
- Lexical semantics : individual words
- Sentential semantics : individual sentences
- Discourse or Pragmatics : longer piece of text or conversation
NLP Tasks for Semantics
99. Language Analysis - Semantic Analysis - SRL
What is Semantic Role Labeling (SRL)
SRL = Semantic roles express the abstract role that arguments of a predicate
can take in the event.
The police arrested the suspect in the park last night
Agent predicate Theme Location Time
Who did what to whom where when
Can we figure out that these sentences have the same meaning?
Can we figure out the bought, sold, purchase used on sentence with same meaning?
XYZ corporation bought the stock.
The sold the stock to XYZ corporation.
The stock was bought by XYZ corporation.
The purchase of the stock by XYZ corporation.
100. Language Analysis - Semantic Analysis - SRL
Common Semantic Role Labeling Architecture
http://naacl2013.naacl.org/Documents/semantic-role-labeling-part-1-naacl-2013-tutorial.pdf
Syntatic
Parse
Argument
Identification
Argument
Classification
Structural
Inference
Prune
Constituents
Candidates
Semantic
roles
Arguments
Step-1 Candidate Selection
- Parse the sentence
- Prune/filter the parse tree
(eliminate some tree constituents to speed up the execution)
Step-2 Argument Identification
- A binary classification of each node as Argument or NONE
- Local scoring
Step-3 Argument Classification
- A multi class (one-of-N) classification of all the argument candidates
- Global /joint scoring
ML
ML
ML
101. Language Analysis - Semantic Analysis - SRL
Exceptions to the Standard Architecture
1. Specialized parsing for SRL
- Syntactic parser trained to predict argument
candidates
- Semantic parsing = parsing + SRL
- SRL based on dependency parsing
2. Sequential labeling (instead of tree traversing)
- Motivated by Lack of full parse trees
102. Language Analysis - Semantic Analysis - SRL
Semantic Role Labeling Applications
Information : Anna is friend of mine.
http://localhost:8888/notebooks/tensormsa_jupyter/chap05_nlp/neo4j/neo4j_basic.ipynb
Name NameRelation
session.run("MATCH (you:Person {name:'You'})"
"FOREACH (name in ['Anna'] |"
" CREATE (you)-[:FRIEND]->(:Person {name:name}))")
result = session.run("MATCH (you {name:'You'})-[:FRIEND]->(yourFriends)"
"RETURN you, yourFriends")
Neo4j Insert Query
Neo4j Jupyter example & visualize
103. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-3.Syntactic Analysis
2-4.Semantic Analysis
2-4-1.Semantic Role Labeling
2-4-2.Char CNN for Sentence Classification
2-5.Discourse Analysis
104. Language Analysis - Semantic Analysis - Text Classification
Can we figure out that these sentences are positive or negative?
돈이 아깝지 않다 (긍정)
다시는 오지 않을 거야 (부정)
음식이 정말 맛이 없다 (부정)
이 식당은 정말 맛있다 (긍정)
Analysis negative and positive with dictionary
word “않다” is usually negative but ?
돈이 아깝지 않다 => Positive
다시는 오지 않을 거야 => Negative
105. There are many ways of doing text classification..
Traditional Rule based Machine Learning - Logistic & SVM
Deep Learning - CharCNN, RNN, Etc..
Language Analysis - Semantic Analysis - Text Classification
106. Language Analysis - Semantic Analysis - Char CNN
http://localhost:8888/notebooks/tensormsa_jupyter/chap05_nlp/charcnn/charcnn.ipynb
Deep Learning Method CharCNN can be a solution for this kind of problem.
1 2 3
107. Language Analysis - Semantic Analysis - Char CNN
http://localhost:8888/notebooks/tensormsa_jupyter/chap05_nlp/charcnn/charcnn.ipynb
Preparing Data for embedding is pretty similar to other neural networks
1. Word Embedding & OneHot didn’t show that much difference.
2. Personally, prefer to concat char onehot + word2vector오늘
메뉴
는
뭐
지?
PAD
PAD
1. Need to define sentence max length
2. Need padding like other nlp neural networks
108. Language Analysis - Semantic Analysis - Char CNN
http://localhost:8888/notebooks/tensormsa_jupyter/chap05_nlp/charcnn/charcnn.ipynb
Using Multi Convolution Filter Size
109. Language Analysis - Semantic Analysis - Char CNN
http://localhost:8888/notebooks/tensormsa_jupyter/chap05_nlp/charcnn/charcnn.ipynb
Other steps are same (fully connected > softmax > loss> optimizer)
110. Language Analysis - Semantic Analysis - Char CNN
http://localhost:8888/notebooks/tensormsa_jupyter/chap05_nlp/charcnn/charcnn.ipynb
You can see Char CNN can distinguish two sentences
111. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-3.Syntactic Analysis
2-4.Semantic Analysis
2-5.Discourse Analysis
2-5-1.RNN for understand global Conversation
2-5-2.Memory Network for global context
112. Language Analysis - Dialogue Understand
https://research.fb.com/publications
Getting to a natural language dialogue state with a chatbot remains
a challenge and will require a number of research breakthroughs. At
FAIR we have chosen to tackle the problem from both ends:
general AI and reasoning by machines through communication as
well as conducting research grounded in current dialog systems,
using lessons learned from exposing actual chatbots to people.
The attempt to understand and interpret dialogue is not a new one.
As far back as 20 years, there were several efforts to build a machine
a person could talk to and teach how to have a conversation. These
incorporated technology and engineering, but were single purposed
with a very narrow focus, using pre-programmed scripted responses.
Thanks to progress in machine learning, particularly in the last few
years, having AI agents being able to converse with people in natural
language has become a more realistic endeavor that is garnering
attention from both the research community and industry.
However, most of today’s dialogue systems continue to be scripted:
their natural language understanding module may be based on
machine learning, but what they execute or answer is in general
dictated by if/then statements or rules engines. While they are
improvement on what existed decades ago, it is in large part due to
the large databases of content used to create and script their
responses.
Amazing free papers!! read it right now!
113. Discourse Analysis with RNN
On conversation topic changes often so keep track the topic of conversation is important.
안녕
안녕
넌 뭐할줄 아니?
기능은 XX 가 있어요
사람 좀 찾아볼까해
누구를 찾아드려요?
포항 제강부 IT담당 홍길동 팀장의
그룹장을 좀 찾아줘 (지역:포항), 부서(제강부),업무 (IT), 이름
(홍길동), 직급(팀장), 상위자(그룹장) 을
검색합니다.
내일 점심 먹자고 문자 보내줘
“내일 점식 먹자고” 로 전송합니다.
아냐. 수고했어. 나가서 먹지
대화를 초기화 합니다.
State : 초기 상태
State : 도움말 상태
State : 사람 찾기 상태
State : 조회한 사람에 문자 보내기
State : 초기 상태
114. Dialogue State Tracking Challenge and Accepted papers
Discourse Analysis with RNN
http://www.phontron.com/paper/yoshino16iwsds.pdfhttp://www.colips.org/workshop/dstc4/papers.html
* Dialogue State Tracking using Long Short Term Memory Neural Networks
Koichiro Yoshino, Takuya Hiraoka, Graham Neubig and Satoshi Nakamura
115. Let’s Predict intent of sentence on the conversation.
Basic idea is keep the RNN state info and continue prediction from that point.
Intent
Intent
Intent
Dialogue state tracking with LSTM
Doc2Vec
Doc2Vec
Doc2Vec
T
I
M
E
L
I
N
E
116. Key point of this code is using RNN State Vector as memory
Discourse Analysis with RNN
http://localhost:8888/tree/chap05_nlp/state_tracking
117. 2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-3.Syntactic Analysis
2-4.Semantic Analysis
2-5.Discourse Analysis
2-5-1.RNN for understand global Conversation
2-5-2.Memory Network for global context
118. Goal of Dialogue understand and Memory network..
Memory Network for Dialogue understand
https://arxiv.org/pdf/1503.08895v4.pdf https://arxiv.org/pdf/1503.08895v4.pdf
119. Here is the network architecture of end2end memory network
Memory Network for Dialogue understand
https://yerevann.github.io/2016/02/05/implementing-dynamic-memory-networks/
https://www.slideshare.net/mobile/carpedm20/ss-63116251
120. (1) Feed data (“Sentences”, “Question”, “Target”)
Memory Network for Dialogue understand
1
2
3
121. Convert word index to embedding vector (Training target vector A,B,C)
Memory Network for Dialogue understand
1
3
Vocab
Size
2 Dim
Size
vocab size
Mem Size
122. Embedding A from given context sentences multiply Input Question Embedding (using embedding B
which is not defined on this code) ※ if it’s a first layer, if not it would be output of t-1 layer
Memory Network for Dialogue understand
1
2 1
2
multiply
123. Set embedding C(on the code it’s B) this is also the target variable for train
Memory Network for Dialogue understand
124. Embedding C(one the code it’s B) Multiply softmax result
Memory Network for Dialogue understand
125. For the last multiply question and output of memory network again
Memory Network for Dialogue understand
127. Memory Network for Dialogue understand
Set fully connected layer and calculate error with softmax cross entropy
128. Memory Network for Dialogue understand
On the given code I removed 90% of data set because we are using CPU for education..
So result may can be poor…..
129. Memory Network for Dialogue understand
bAbi Test Results .. (comparing DMN & MemNN )
https://research.fb.com/downloads/babi/
131. 1.NLP & Deep Learning
2.Language Analysis Process
3.Language Generation
3-1.Basic Seq2Seq
3-2.Other types of Seq2Seq (Attention, Pointer)
132. Response Generator - Seq2Seq Model
Seq2Seq 모델은 기계번역, 요약, 간단한 질답 등 말 그대로 Input 과 Output 이 모두 Sequence Data 인
다양한 케이스에 적용이 가능하며, 이를 간단한 트릭을 적용하여 답변을 생성하는 용도로 사용할 수 있다.
- Input : 딥 러닝 재미 즐거운 일
- Output : 딥 러닝은 재미있고 즐거운 일이다
https://arxiv.org/pdf/1406.1078.pdf
https://www.slideshare.net/KeonKim/attention-mechanisms-with-tensorflow
133. Attention Mechanism Pointer Network
https://medium.com/@devnag/pointer-networks-in-tensorflow-
with-sample-code-14645063f264
Seq2Seq 의 변형된 형태들…
Response Generator - Seq2Seq Model
※ 다음 강의에서 자세히 진행할 예정인 내용으로 상세 내용 생략
http://localhost:8888/tree/chap05_nlp/attention_seq2seq
134. 결국 Natural Language Process 는 "기존 자연어 처리 알고리즘", "Deep
Learning" Algorithm” 그리고 각종 “Software Architecture” 의 거대한
Combination
Conclusion
기존 자연어
처리 이론
Deep Learning
Theory
Software
Architecture
135. Conclusion
지금까지 이야기한 내용들을 연결하여 하나의 예를 만들어 보자
Web Document Web Crawler
Lexical (어휘) Analysis
Syntactic (구문) Analysis
Semantic (의미) Analysis
Ontology
Man
Filtering
information
Dialogue (구문) Analysis
information
Lexical (어휘) Analysis
Syntactic (구문) Analysis
Semantic (의미) Analysis
Dialogue (구문) Analysis
Web Server
Response Generation
IN
OUT
137. Hyper Parameter Optimization
Set of graph flow
Set of graph flow
Set of graph flow
Hyper Parm Range
~
Hyper Parameter
Random Search
Genetic Algorithm
Approximation
Hyper Parameter 서치를 위한 Genetic Algorithm 에 대한 설명
1 2 3
138. Hyper Parameter Optimization
Hyper Parameter Random Search 에 대한 설명
http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
In this more challenging optimization problem random search is still effective, but not 300 RANDOM
SEARCH FOR HYPER-PARAMETER OPTIMIZATION superior as it was as in the case of neural
network optimization. Comparing to the 3-layer DBN results in Larochelle et al. (2007), random
search found a better model than the manual search in one data set (convex), an equally good
model in four (mnist basic, mnist rotated, rectangles, and rectangles images), and an inferior model
in three (mnist background images, mnist background random, mnist rotated background images).
140. Hyper Parameter Optimization
Genetic Algorithm on Hyper parameter optimization (Approximation)
https://blog.coast.ai/lets-evolve-a-neural-network-with-a-genetic-algorithm-code-included-8809bece164
Let’s say it takes five minutes to train and evaluate a network on your dataset. And let’s say we have four parameters with
five possible settings each. To try them all would take (5**4) * 5 minutes, or 3,125 minutes, or about 52 hours.
Now let’s say we use a genetic algorithm to evolve 10 generations with a population of 20 (more on what this means
below), with a plan to keep the top 25% plus a few more, so ~8 per generation. This means that in our first generation we
score 20 networks (20 * 5 = 100 minutes). Every generation after that only requires around 12 runs, since we don’t have
the score the ones we keep. That’s 100 + (9 generations * 5 minutes * 12 networks) = 640 minutes, or 11 hours.
https://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol1/hmw/article1.html
use multi gpu
cluster servers
hyper parameter
random search
141. Hyper Parameter Optimization
Let’s see how hyperparameter optimization with genetic algorithm works .. . ..
http://localhost:8888/tree/chap05_nlp/automl
142. 다음 강의 목표
NLP 관점에서 Deep Learning 을 적용하기 위한 데이터와 모델에 대한
이해를 돕기위한 강의를 진행하였습니다.
다음 시간에는 이러한 재료들을 모아서 아키택쳐 관점에서 응용하고
활용하기 위한 방법들에 대해서 강의하고자 합니다.
감사합니다.