NLP Deep Learning with Tensorflow

NLP 에 대한 이해와
Tensorflow 를 활용한 실무 적용
WRITTEN BY SeungWooKim
tmddno1@gmail.com

현 POSCO IT 사업부 - AI TFT 리더
POSCO IT 사업부 AI 프로젝트 지원 FrameWork 개발 리더
POSCO AI Chat Bot 시범 서비스 개발 리더
POSCO ICT BigData & AI 사내 강사
성균관대학교 컴퓨터 공학 전공
tmddno1@gmail.com

1. 강의 도커 환경
https://github.com/TensorMSA/skp_edu_docker
2. 강의 소스 코드
git clone https://github.com/TensorMSA/tensormsa_jupyter.git

강의 목표
"피자 주문을 ChatBot Messenger 를 통해서 서비스 하고 싶다..
어떤 데이터를 수집하고, 어떤 신경망을 사용하고, 어떻게 아키택쳐를
구성해야 목표를 달성 할 수 있을까?"
예를 들어 위와 같이 자연어 처리와 관련된 어떤 문제가 주어졌을 때
데이터와 딥러닝 관점에서 문제를 접근 할 수 있는 통찰력 획득
[다음 세션]
이번 시간에 배운 재료를 아키택쳐 관점에서의 어플리케이션 레벨에서
적용하고 응용하는 방법에 대한 세션

1.NLP & Deep Learning
2.Language Analysis Process
2-1.Voice Recognition
2-2.Lexical Analysis
2-2-1.Lexical Analysis Basic Process
2-2-2.Deep Learning on Lexical Analysis
2-2-3.Prerequisite Knowledge
2-2-4.BiLstmCrf for Named Entity Recognition
2-3.Syntactic Analysis ㅛ
2-3-1.Dependency Parsing
2-3-2.Google SyntaxNet with Docker
2-4.Semantic Analysis
2-4-1.Semantic Role Labeling
2-4-2.Char CNN for Sentence Classification
2-5.Discourse Analysis
2-5-1.RNN for understand global Conversation

3.Language Generation
3-1.Basic Seq2Seq
3-2.Other types of Seq2Seq (Attention, Pointer)
4.Tips
4-1.Hyper Parameter Random Search
4-2.Genetic Algorithm for Hyper Parameter Search
4-3.Auto Hyper Parameter Search with Multi GPU Server

NLP and Deep Learning
Today’s Focus
이미지등 다른 분야와 마찬가지로
DL 이 좋은 성능을 보여주지만,
분야의 특성상 100% DL 로 대체될
수는 없다.
기존 연구 분야에 대한 이해 중요
https://www.slideshare.net/ssuser06e0c5/ss-64417928

What’s NLP (Natural Language Process) ?
Let’s find out with examples

NLP Applications
Mostly Solved Making Good Progress Still Really Hard
Spam Detection
(스팸분석)
Text Categorization
(텍스트 분류)
Part of Speech Tagging
(단어 분석)
Named Entity Recognition
(의미 구분 분석)
Information Extraction
(정보 추출)
Sentiment Analysis
(감정분석)
Coreference Resolution
(같은 단어 복수 참조)
Word Sense
Disambiguation
(복수 의미 분류)
Syntactic Parsing
(구문해석)
Machine Translation
(기계번역)
Semantic Search
(의미 분석 검색)
Question & Answer
(질의 응답)
Textual inference
(문장 추론)
Summarization
(텍스트 요약)
Discourse & Dialog
(대화 & 토론)

NLP Applications
Text Categorization
Text Classification assigns one or more classes to a document according to their content. Classes are
selected from a previously established taxonomy (a hierarchy of catergories or classes).
Spam Detection
Spam Detection is also the part of Text Classification problem.
Part of Speech
grammatical tagging or word-category disambiguation, is the process of marking up a word in a
text (corpus) as corresponding to a particular part of speech, based on both its definition and its
context

NLP Applications
Low Level Information Extraction

NLP Applications
Information Extraction on Broader view
https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwievZKlmMzVAhVCgrwKHbM_D88QFggyMAE&url=https%3A
%2F%2Fweb.stanford.edu%2Fclass%2Fcs124%2Flec%2FInformation_Extraction_and_Named_Entity_Recognition.pptx&usg=AFQjCNFUT9ZjvrDrx
F9su0J9KiWobVP4Kg
Rule Based
Extraction
Named Entity
recognition
Syntax Anal
Relation Search
Ontology
Information
Extraction

NLP ApplicationsNLP Applications
Coreference Resolution
I did not vote for the Donald Trump because I think he is too reckless
Coreference resolution is the task of finding all expressions that refer to the same entity in a
text. It is an important step for a lot of higher level NLP tasks that involve natural language
understanding such as document summarization, question answering, and information
extraction.
Deep Reinforcement Learning for Mention-Ranking Coreference Models
Improving Coreference Resolution by Learning Entity-Level Distributed Representations
https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30

NLP ApplicationsNLP Applications
Word Sense Disambiguation
[Example]
1. a type of fish
2. tones of low frequency
and the sentences:
1. I went fishing for some sea bass.
2. The bass line of the song is too weak.
http://www.cs.cornell.edu/courses/cs4740/2014sp/lectures/wsd-1.pdf
supervised way lable data example
simi-supervised way ontology based

NLP Applications
Syntatic Parsing
syntatic parsing is Find structural relationships between words in a sentence
https://web.stanford.edu/~jurafsky/slp3/12.pdf

NLP Applications
Machine Translation
Machine translation (MT) is automated translation. It is the process by which computer software is
used to translate a text from one natural language (such as English) to another (such as Spanish).

NLP Applications
Semantic Search
Semantic search seeks to improve search accuracy by understanding a searcher’s intent through
contextual meaning.
Question and Answer
Able to answer questions in natural language based on Knowledge data (usually ontology)
ex) Best example is IBM Watson
Textural Inference
Recognize, generate, or extract pairs <T,H> of natural language
expressions, such that a human who reads (and trusts) T would infer that His most likely also true
Summarization
Extracting interesting parts of the text and create a summary by using these parts of the text and
allow for rephrasings to make summary more grammatically correct.
Discourse & Dialog
Do conversation with understanding the whole history of dialog and semantic meaning of speaker.

Level of NLP
○ pragmatics : use of language
○ Semantics : meaning of words & sentences
○ (Surface) Syntax : Phrase & Sentence
○ Morphology : morpheme, word
○ Phonology : phoneme (abstract unit of speech sound)
○ Phonetics : phone (acoustic unit of speech sound)
음성과 단어
단어의 구성
단어의 순서
단어&문장 의미
대화의도 & 맥락
High
Low

Spoken Utterance
Lexical (어휘) Analysis : Word Structure
Speech Recognition
Written Utterance
Syntactic (구문) Analysis : Sentence Structure
Morphemes, Word
Semantic (의미) Analysis : Meaning of Words & Sentence
Sentence
Discourse (대화) Analysis : Relationship between sentence
Context beyond Sentence
Language Analysis

2-3.Syntactic Analysis

Language Analysis - Speech Recognition
AI Speaker Alexa Alexa Microphone System

Language Analysis - Speech Recognition
Deep Learning for Classification Hidden Markov Model for Language Model

2-2-3.Prerequisite Knowledge

Language Analysis - Lexical Analysis
Main Factors on Lexical Analysis
Sentence
Splitting
Tokenizing Morphological
Part of Speech
Tagging

Lexical Analysis - Sentence Splitting & Tokenizing
What if there is no line change char (‘n’) ? Where is the EOS point?
What if sentence is not separated into words properly with space?
[Examples]
[Problems]

Language Analysis - Lexical Analysis - Morphological
Word stemming lemmatization
Love Lov Love
Loves Lov Love
Loved Lov Love
Loving Lov Love
Innovation Innovat Innovation
Innovations Innovat Innovation
Innovate Innovat Innovate
Innovates Innovat Innovate
Innovative Innovat Innovative
Morphing Examples Stemming & lemmatization
Morphology is process of finding morpheme which is smallest“meaningful unit (Lexical meaning
or grammatical function)” and other features like stem in a language that carries information.

Language Analysis - Lexical Analysis - Part of Speech Tagging
Ambiguity
“that” can be a subordinating conjunction or a relative pronoun
- The fact that/IN you’re here
- A man that/WDT I know
“Around” can be a preposition, particle, or adverb
- I bought it at the shop around/IN the corner.
- I never got around/RP to getting a car.
- A new Toyota Prius costs around/RB $25K.
Degree of ambiguity (in Brown corpus)
- 11.5% of word types (40% of word tokens) are ambiguous
# of Tags 1 2 3 4 5 6 7
# of Words 35340 3760 264 61 12 2 1
#Ambiguity Problem is much serious in Korean
Part-of-speech tagging is one of the most important text analysis tasks used to classify words into
their part-of-speech and label them according the tagset which is a collection of tags used for the pos
tagging. Part-of-speech tagging also known as word classes or lexical categories

Language Analysis - Lexical Analysis - Implementation
Hannanum Kkma Komoran Mecab Twitter
하늘 / N 하늘 / NNG 하늘 / NNG 하늘 / NNG 하늘 / Noun
을 / J 을 / JKO 을 / JKO 을 / JKO 을 / Josa
나 / N 날 / VV 나 / NP 나 / NP 나 / Noun
는 / J 는 / ETD 는 / JX 는 / JX 는 / Josa
자동차 / N 자동차 / NNG 자동차 /
NNG
자동차 /
NNG
자동차 /
Noun
Anal Result Comparison Library Performance Comparison

Language Analysis - Lexical Analysis - Implementation
[Code]

[Deep Learning - Sequence Labeling - BiLSTM-CRF]
(1) Word Segmentation
(2) POS Tagging
(3) Chunking
(4) Clause Identification
(5) Named Entity Recognition
(6) Semantic Role Labeling
(7) Information Extraction
What we can do with sequence labeling What’s sequence labeling

[Deep Learning - Sequence Labeling - BiLSTM-CRF]
Word POS Chunk NE
West NNP B-NP B-MISC
Indian NNP I-NP I-MISC
all-around NN I-NP O
Phil NNP I-NP B-PER
Simons NNP I-NP I-PER
took VBD B-VP O
four CD B-NP O
for IN B-PP O
38 CD B-NP O
on IN B-PP O
Friday NNP B-NP O
iob data set example
POS Tag 의미
ttps://docs.google.com/spreadsheet/ccc?key=0ApcJghR6UMXxdEdU
RGY2YzIwb3dSZ290RFpSaUkzZ0E&usp=sharing
Chunk Tag 의미
B : Begin of Chunk
I : Continuation of Chunk
E: End of Chunk
NP : Noun
VP : Verb
NER BIO Tag 의미
B : Start with new Chunk
I : word inside Chunk
O: Outside of Chunk

Language Analysis - Lexical Analysis - Sequence Labeling
[Deep Learning - BiLSTM-CRF]
BiLSTM-CRF Description
Before we Talk about
BiLstmCrf which is really important
algorithm for sequence labelling..
Let’s talk about necessary knowledge
that we have to know really briefly

2-2-3. Prerequisite Knowledge

Language Analysis - Lexical Analysis - Check Prerequisite
[Those will be needed to understand what I am trying to explain]
Concept of perceptron
& Deep Neural Network
Concept of SoftMax
DNN & Matrix
Gradient Descent Back Propagation
Activation Functions

Language Analysis - Brief Explanation
# tf Graph input
x = tf.placeholder("float", [None, 784])
y = tf.placeholder("float", [None, 10])
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_normal([784, 256])),
'h2': tf.Variable(tf.random_normal([256, 256])),
'out': tf.Variable(tf.random_normal([256, 10]))
}
biases = {
'b1': tf.Variable(tf.random_normal([256])),
'b2': tf.Variable(tf.random_normal([256])),
'out': tf.Variable(tf.random_normal([10]))
}
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
pred = tf.matmul(layer_2, weights['out']) + biases['out']
hypothesis = tf.nn.softmax(pred )
# Define loss and optimizer
cost = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(hypothesis),
reduction_indices=1))
tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
input Hidden Out
784
256
10
Hidden
256
784
256
786 256
256 10
256
S
O
F
T
M
A
X
Y=Activation(W*x + b)
[Error]
Cross
Entropy
W W1
A(W*x + b)
b
b
A(W*x + b)x
2
1
3
4
5
256
786
1

Language Analysis - Lexical Analysis - Check Prerequisite
[Those will be needed to understand what I am trying to explain]
Dynamic RNN BiDirectional LSTM
Word EmbeddingRecurrent Neural Network LSTM (Long Short Term Memory)

Language Analysis - Brief Explain
START 오늘 날씨 는 ? PAD PAD END
START 오늘 날씨 는 어때 ? PAD END
START 오늘 비가 오 려 나 ? END
Case of long sentence …
Vanishing Problem happens
Various length of data cause
waste of computing power
Here we have concept of Dynamic RNN
BiDirectional Lstm learn given data from backward Long Short Term Memory Cell
Cell State
https://brunch.co.kr/@chris-song/9
updateforget out
cell state
https://blog.altoros.com/the-magic-behind-google-translate-
sequence-to-sequence-models-and-tensorflow.html

Language Analysis - Word embedding
Word Embedding 이란 ?
텍스트를 구성하는 하나의 음소, 음절, 단어, 문장, 문서 단위를 수치화하여
표현하는 방법의 일종
장점 : 차원의 축소 , 의미적 유사성의 표현
단점 : 동음이의어 처리, 데이터 적을 경우 신경망 훈련시 신호 강도

Language Analysis - Word embedding - OneHot Encoding
Concept of OneHot Encoding

Language Analysis - Word embedding - Word2Vec
https://www.tensorflow.org/tutorials/word2vec
http://w.elnn.kr/search/
Concept of Word2Vector
Word2Vector Demo Site

C-Bow
the quick brown fox jumped over the lazy dog
([brown, jumped], fox)
window size : 1
brown
jumped
over
the
.
.
brown
jumped
over
fox
.
.
Input OutputHidden
Hidden Size Hidden Size
Vocab
Size
Data Set
Original
Text

(fox, brown), (fox, jumped)
window size : 1
brown
jumped
over
the
.
.
brown
jumped
over
fox
.
.
Input OutputHidden
Hidden Size Hidden Size
Vocab
Size
Data Set
Original
Text
Skip-Gram

Language Analysis - Word embedding - Doc2Vec
(1)PV-DM (2)PV-DBOW
(3)DM + DBOW (Vector Concat)
W2V W2V W2V
(4)AVG(TF-IDF * W2V)
(paragraph, the)
(paragraph, quick)
(paragraph, brown)
(paragraph, fox)
(paragraph, jumped)
.
([paragraph, quick, brown,
fox, juped], over)
([paragraph, quick, brown,
fox, juped,over],the)
vector vector vector
TF-IDF TF-IDF TF-IDF
X X X
vector
AVG

tfidf(t,d,D) = tf(t,d) x idf(t,D)
Language Analysis - Word embedding - TF-IDF
https://thinkwarelab.wordpress.com/2016/11/14/ir-tf-idf-%EC%97%90-%EB%8C%80%ED%95%B4-%EC%95%8C%EC%95%84%EB%B4%85%EC%8B%9C%EB%8B%A4/
http://www.popit.kr/bm25-elasticsearch-5-0%EC%97%90%EC%84%9C-%EA%B2%80%EC%83%89%ED%95%98%EB%8A%94-%EC%83%88%EB%A1%9C%EC%9A%B4-%EB%B0%A9%EB%B2%95/
Not exactly word embedding but used on nlp with deep learning pretty often
- Document similarity
- Words importance on document
- Used on search engine (like elasticsearch though it use BM25 for now)

Language Analysis - Word embedding - Char Embedding
- Introduce several ways to embed char as vector
안 녕 하 세 요
1
가 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
나 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
다 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
라 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
마 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
바 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
사 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
아 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
자 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
An Neung Ha Se Yo (ㅇ ㅏ ㄴ) (ㄴ ㅕ ㅇ) . . . .
2
a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
b 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
e 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
f 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
i 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
3
ㄱ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ㄴ 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ㄷ 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ㄹ 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
ㅁ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
ㅂ 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
ㅅ 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
ㅇ 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
ㅈ 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

Language Analysis - Word embedding - Word+Char
0.2 0.1 0.4 0.21 0 0 0
f o x fox
Word2Vector
0 1 0 0 0 0 1 0
OneHot
Encoding
OneHot
Encoding
OneHot
Encoding
1.Word2Vec 계열은 의미적 상관성을 잘 표현
2.OneHot 은 강한 신호적 특성으로 Train 에 효과적
3.Word 단위 Embedding 은 단어를 잘 기억함
4.Char 단위 Embedding 은 미훈련 단어 처리에 용이

Language Analysis - Word embedding - NGram
In case of Word2Vec it can represent only the trained word..
Words not exactly match the pretrained dict will return “UNKNOWN”
So FastText (by Facebook ) use ngram on their word embedding algorithm..
에어컨 ~ 에어조단 비교
에어컨
['$$에', '$에어', '에어컨', '어컨$', '컨$$'] => 5
에어조단
['$$에', '$에어', '에어조', '어조단', '조단$', '단$$'] => 6
일치
['$$에', '$에어'] => 2
점수
일치 2건 / 중복제거 전체 7건 => 0.2222

http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
Language Analysis - Word embedding - vector distance
Cosine Similarity
from math import*
def square_rooted(x):
return round(sqrt(sum([a*a for a in x])),3)
def cosine_similarity(x,y):
numerator = sum(a*b for a,b in zip(x,y))
denominator = square_rooted(x)*square_rooted(y)
return round(numerator/float(denominator),3)
print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])

Language Analysis - Word embedding - Implementation
OneHot Encoding : Simple Test Code show concept of onehot
http://ip:8888/tree/tensormsa_jupyter/chap05_nlp/wordembedding/
[Code]

Word2Vector : Using Gensim word2vec package

FastText : FaceBook fasttext with gensim wrapper

FastText : Possible to use pretrained vector and do find tuning on it
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

N-grams are simply all combinations of adjacent words or letters of length n that you can
find in your source text.

For large dataset word2vec training GPU acceleration is needed
You can also think about using Tensorflow or Keras for training model
https://github.com/SimonPavlik/word2vec-keras-in-gensim/blob/keras106/word2veckeras/word2veckeras.py
https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec.py

2-2-3. Other prerequisite Knowledge

BiLSTM-CRF Description
http://ip:8888/tree/tensormsa_jupyter/chap05_nlp/sequence_tagging/

김승우 B-PERSON
전화번호 B-TARGET
검색 O
김승우 B-PERSON
이메일 B-TARGET
검색 O
김승우 B-PERSON
이미지 B-TARGET
검색 O
IOB Data
김승우 전화번호 검색
김승우 이메일 검색
김승우 이미지 검색
Plain Data
Sentence
Splitting
Token Morphing
Part of
Speech
Tagging
Lexical Analysis
Word2Vector
OneHot Encoding
1 0 0 0
0 1 0 0
0 0 1 0
김승우
전화번호
이메일
검색
B-PERSON
B-TARGET
김
우
승
Index
List

김승우
전화번호
이메일
검색
B-PERSON
B-TARGET
김
우
승
Index
List
[Code]

[Deep Learning - BiLSTM-CRF] 김
우
승
김승우
전화번호
이메일
Concat Vector
[Code]

Concat Vector
김승우
전화번호
이메일
검색
B-PERSONB-TARGET
BiLstm
Fully Connected Layer
B-? B-? B-?
[Code]

Conditional Random Field Soft Max
[Code]

http://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf
Probabilistic Model for sequence data segmentation and labeling
https://www.slideshare.net/kanimozhiu/tdm-probabilistic-models-part-2
he first method makes local choices. In other words, even if we capture some information from the
context in our hh thanks to the bi-LSTM, the tagging decision is still local. We don’t make use of the
neighbooring tagging decisions. For instance, in New York, the fact that we are tagging York as a
location should help us to decide that New corresponds to the beginning of a location. Given a
sequence of words w1,…,wmw1,…,wm, a sequence of score vectors s1,…,sms1,…,sm and a
sequence of tags y1,…,ymy1,…,ym, a linear-chain CRF defines a global score s∈Rs∈R

Gradient
Descent
Momentum
NAG
Adagrad
Adadelta
Rmsprop
Adam
[Code]

https://arxiv.org/pdf/1705.08292.pdf
"Gradient descent (GD)나 Stochastic gradient descent (SGD)를 이용하여 찾은 solution이
다른 adaptive methods (e.g. AdaGrad, RMSprop, and Adam)으로 찾은 solution보다 훨씬
generalization 측면에서 뛰어나다."
The Marginal Value of Adaptive Gradient Methods in Machine Learning Ashia C. Wilson] , Rebecca Roelofs] ,
Mitchell Stern] , Nathan Srebro† , and Benjamin Recht]∗ ] University of California, Berkeley. † Toyota
Technological Institute at Chicago May 24, 2017
There is no optimizer best for all cases!!
When to use adaptive optimizer?
If input embedding vectors are sparse, it’s better to use adaptive optimizer!

Real Project BiLstm Result Sample Code Predict Test Result
Test data Not Included in Train Set
Predicts well
http://ip:8888/tree/tensormsa_jupyter/chap05_nlp/sequence_tagging/

2-3-1.Dependency Parsing
2-3-2.Google SyntaxNet with Docker

Language Analysis - Syntactic Analysis
구문 분석(構文分析, 문화어: 구문해석, 문장해석)은 문장을 그것을 이루고 있는
구성 성분으로 분해하고 그들 사이의 위계 관계를 분석하여 문장의 구조를
결정하는 것을 말한다.
Graph-Based Models Transition-Based Models
CYK Style Parsing MST finding Algorithm Projective & Non Projective Model

Transition-Based Models
Sentence W
Repeat until all words have their head
- Select two target words in data structure
(One dependent & one head candidate)
- Deterministically predict next parsing action from parsing model
- Modify structure according parsing action
C0 -> C1 -> C2 -> ……..C8 -> C9 -> C10 -> .… -> Cm D-tree
t1 t2 t3 t8 t9 t10 tm
Oracle
(Classifier)
Predict the best
transition

Transition-Based Models - Arc Eager Transition System

Assume that we are given an oracle :
- for any non-terminal configuration, it can predict the correct transition
(for deterministic parsing)
- That is, it takes two words & magically gives us the dependency
relation b/w item if one exists

Shift :
Move Economic from buffer B to stack S

Left-arc :
Add left-arc (news, Economic, amod) to arc set A
Remove Economic from stack (since it now has head in A)

Shift :
Move news from buffer B to stack S

Left-arc :
Add left-arc (had, news, nsubj) to A
Remove news from stack (since it now has head in A)

Right-arc :
Add right-arc (ROOT, had, root) to A
keep had in stack : because it can have other dependents on the right

Left-arc :
Add left-arc (effect, little, amod) to A
Remove little from stack (since it now has head in A)

Right-arc :
Add right-arc (had, effect, dobj) to A
Keep effect in stack : because it can have other dependents on right

Right-arc :
Add right-arc (effect, on, prep) to A
Keep on in stack : because it can have other dependents on the right

Shift :
Move financial from buffer B to stack S

Left-arc :
Add left-arc (market, financial, amod) to A
Remove financial from stack (since it now has head in A)

Right-arc :
Add right-arc (on, markets, pmod) to A
Keep markets in stack : because it can have other dependents on the right

Reduce :
Remove markets, on, effect from stack (since they already have head in A)
※ All decisions like right-arc, left-arc, reduce, shift will be made by oracle

Right-arc :
Add right-arc (had, period, p) to A
Keep period in stack
Done !

Language Analysis - Syntactic Analysis - Syntax Net
We show this layout in the schematic below: the state of the system (a stack and a buffer, visualized
below for both the POS and the dependency parsing task) is used to extract sparse features, which
are fed into the network in groups. We show only a small subset of the features to simplify the
presentation in the schematic
Google SyntaxNet with Deep Learning - Pos Tagging
http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf

Google SyntaxNet with Deep Learning - A Fast and Accurate Dependency Parser using Neural Networks
1 2 3
1 I _ PRP PRP _ 2 nsubj _ _
2 knew _ VBD VBD _ 0 ROOT _ _
3 I _ PRP PRP _ 5 nsubj _ _
4 could _ MD MD _ 5 aux _ _
5 do _ VB VB _ 2 ccomp _ _
6 it _ PRP PRP _ 5 dobj _ _
7 properly _ RB RB _ 5 advmod _ _
8 if _ IN IN _ 9 mark _ _
9 given _ VBN VBN _ 5 advcl _ _
10 the _ DT DT _ 12 det _ _
11 right _ JJ JJ _ 12 amod _ _
12 kind _ NN NN _ 9 dobj _ _
13 of _ IN IN _ 12 prep _ _
14 support _ NN NN _ 13 pobj _ _
15 . _ . . _ 2 punct _ _
18 units
(1),(2),(3)
18 units
(1),(2),(3)
12 units
(2),(3)
(1) The top 3 words on the stack and buffer: s1, s2, s3, b1, b2, b3; => 6
(2) The first and second leftmost / rightmost children of the top two words
on the stack: lc1(si), rc1(si), lc2(si), rc2(si), i = 1, 2. => 8
(3) The leftmost of leftmost / rightmost of rightmost children of the top two
words on the stack: lc1(lc1(si)), rc1(rc1(si)), i = 1, 2. => 4

Google SyntaxNet with Deep Learning - Local Parser
1. SHIFT: Push another word onto the top of the stack, i.e. shifting one token from the buffer to
the stack.
2. LEFT_ARC: Pop the top two words from the stack. Attach the second to the first, creating an
arc pointing to the left. Push the first word back on the stack.
3. RIGHT_ARC: Pop the top two words from the stack. Attach the second to the first, creating an
arc point to the right. Push the second word back on the stack.

As we describe in the paper, there are several problems with the locally normalized models we just
trained. The most important is the label-bias problem: the model doesn't learn what a good parse
looks like, only what action to take given a history of gold decisions. This is because the scores are
normalized locally using a softmax for each decision.
Google SyntaxNet with Deep Learning - Global Training

What’s Beam Search Algorithm on RNN ?
https://www.youtube.com/watch?v=UXW6Cs82UKo
Instead of try only the best every iteration, try all cases to the end and choose the sum is maximum.
But if you try to calculate all cases algorithms will be too heavy, so remain only the best few every
step and remove others (pruning). This is for find global maximum predict result .

http://universaldependencies.org/
Google SyntaxNet do not support Korean as a default language.
But as we can see bellow, we can train the model with Sejong corpus data.
Though we have to covert the format for SyntaxNet to understand.
Google SyntaxNet with Deep Learning - How about Korean

Demo Site (we also use samples on this site)
http://sejongpsg.ddns.net/syntaxnet/psg_tree.htm
SyntaxNet Korean with Docker (We pretrained Korean corpus and set up webserver for service)
https://github.com/TensorMSA/tensormsa_syntax_docker
Google SyntaxNet with Deep Learning - Test it by yourself

2-4-1.Semantic Role Labeling
2-4-2.Char CNN for Sentence Classification

Sentential semantics
- Semantic role labeling (SRL)
- Phrase similarity (=paraphrase)
- Sentence Classification, Sentence Emotion Analysis and etc
Language Analysis - Semantic Analysis
What is Semantic in study of language
Three perspectives on meaning
- Lexical semantics : individual words
- Sentential semantics : individual sentences
- Discourse or Pragmatics : longer piece of text or conversation
NLP Tasks for Semantics

Language Analysis - Semantic Analysis - SRL
What is Semantic Role Labeling (SRL)
SRL = Semantic roles express the abstract role that arguments of a predicate
can take in the event.
The police arrested the suspect in the park last night
Agent predicate Theme Location Time
Who did what to whom where when
Can we figure out that these sentences have the same meaning?
Can we figure out the bought, sold, purchase used on sentence with same meaning?
XYZ corporation bought the stock.
The sold the stock to XYZ corporation.
The stock was bought by XYZ corporation.
The purchase of the stock by XYZ corporation.

Common Semantic Role Labeling Architecture
http://naacl2013.naacl.org/Documents/semantic-role-labeling-part-1-naacl-2013-tutorial.pdf
Syntatic
Parse
Argument
Identification
Argument
Classification
Structural
Inference
Prune
Constituents
Candidates
Semantic
roles
Arguments
Step-1 Candidate Selection
- Parse the sentence
- Prune/filter the parse tree
(eliminate some tree constituents to speed up the execution)
Step-2 Argument Identification
- A binary classification of each node as Argument or NONE
- Local scoring
Step-3 Argument Classification
- A multi class (one-of-N) classification of all the argument candidates
- Global /joint scoring
ML
ML
ML

Exceptions to the Standard Architecture
1. Specialized parsing for SRL
- Syntactic parser trained to predict argument
candidates
- Semantic parsing = parsing + SRL
- SRL based on dependency parsing
2. Sequential labeling (instead of tree traversing)
- Motivated by Lack of full parse trees

Semantic Role Labeling Applications
Information : Anna is friend of mine.
http://localhost:8888/notebooks/tensormsa_jupyter/chap05_nlp/neo4j/neo4j_basic.ipynb
Name NameRelation
session.run("MATCH (you:Person {name:'You'})"
"FOREACH (name in ['Anna'] |"
" CREATE (you)-[:FRIEND]->(:Person {name:name}))")
result = session.run("MATCH (you {name:'You'})-[:FRIEND]->(yourFriends)"
"RETURN you, yourFriends")
Neo4j Insert Query
Neo4j Jupyter example & visualize

Language Analysis - Semantic Analysis - Text Classification
Can we figure out that these sentences are positive or negative?
돈이 아깝지 않다 (긍정)
다시는 오지 않을 거야 (부정)
음식이 정말 맛이 없다 (부정)
이 식당은 정말 맛있다 (긍정)
Analysis negative and positive with dictionary
word “않다” is usually negative but ?
돈이 아깝지 않다 => Positive
다시는 오지 않을 거야 => Negative

There are many ways of doing text classification..
Traditional Rule based Machine Learning - Logistic & SVM
Deep Learning - CharCNN, RNN, Etc..
Language Analysis - Semantic Analysis - Text Classification

Language Analysis - Semantic Analysis - Char CNN
http://localhost:8888/notebooks/tensormsa_jupyter/chap05_nlp/charcnn/charcnn.ipynb
Deep Learning Method CharCNN can be a solution for this kind of problem.
1 2 3

Preparing Data for embedding is pretty similar to other neural networks
1. Word Embedding & OneHot didn’t show that much difference.
2. Personally, prefer to concat char onehot + word2vector오늘
메뉴
는
뭐
지?
PAD
PAD
1. Need to define sentence max length
2. Need padding like other nlp neural networks

Using Multi Convolution Filter Size

Other steps are same (fully connected > softmax > loss> optimizer)

You can see Char CNN can distinguish two sentences

2-5-1.RNN for understand global Conversation
2-5-2.Memory Network for global context

Language Analysis - Dialogue Understand
https://research.fb.com/publications
Getting to a natural language dialogue state with a chatbot remains
a challenge and will require a number of research breakthroughs. At
FAIR we have chosen to tackle the problem from both ends:
general AI and reasoning by machines through communication as
well as conducting research grounded in current dialog systems,
using lessons learned from exposing actual chatbots to people.
The attempt to understand and interpret dialogue is not a new one.
As far back as 20 years, there were several efforts to build a machine
a person could talk to and teach how to have a conversation. These
incorporated technology and engineering, but were single purposed
with a very narrow focus, using pre-programmed scripted responses.
Thanks to progress in machine learning, particularly in the last few
years, having AI agents being able to converse with people in natural
language has become a more realistic endeavor that is garnering
attention from both the research community and industry.
However, most of today’s dialogue systems continue to be scripted:
their natural language understanding module may be based on
machine learning, but what they execute or answer is in general
dictated by if/then statements or rules engines. While they are
improvement on what existed decades ago, it is in large part due to
the large databases of content used to create and script their
responses.
Amazing free papers!! read it right now!

Discourse Analysis with RNN
On conversation topic changes often so keep track the topic of conversation is important.
안녕
안녕
넌 뭐할줄 아니?
기능은 XX 가 있어요
사람 좀 찾아볼까해
누구를 찾아드려요?
포항 제강부 IT담당 홍길동 팀장의
그룹장을 좀 찾아줘 (지역:포항), 부서(제강부),업무 (IT), 이름
(홍길동), 직급(팀장), 상위자(그룹장) 을
검색합니다.
내일 점심 먹자고 문자 보내줘
“내일 점식 먹자고” 로 전송합니다.
아냐. 수고했어. 나가서 먹지
대화를 초기화 합니다.
State : 초기 상태
State : 도움말 상태
State : 사람 찾기 상태
State : 조회한 사람에 문자 보내기
State : 초기 상태

Dialogue State Tracking Challenge and Accepted papers
http://www.phontron.com/paper/yoshino16iwsds.pdfhttp://www.colips.org/workshop/dstc4/papers.html
* Dialogue State Tracking using Long Short Term Memory Neural Networks
Koichiro Yoshino, Takuya Hiraoka, Graham Neubig and Satoshi Nakamura

Let’s Predict intent of sentence on the conversation.
Basic idea is keep the RNN state info and continue prediction from that point.
Intent
Intent
Intent
Dialogue state tracking with LSTM
Doc2Vec
Doc2Vec
Doc2Vec
T
I
M
E
L
I
N
E

Key point of this code is using RNN State Vector as memory
http://localhost:8888/tree/chap05_nlp/state_tracking

Goal of Dialogue understand and Memory network..
Memory Network for Dialogue understand
https://arxiv.org/pdf/1503.08895v4.pdf https://arxiv.org/pdf/1503.08895v4.pdf

Here is the network architecture of end2end memory network
https://yerevann.github.io/2016/02/05/implementing-dynamic-memory-networks/
https://www.slideshare.net/mobile/carpedm20/ss-63116251

(1) Feed data (“Sentences”, “Question”, “Target”)
1
2
3

Convert word index to embedding vector (Training target vector A,B,C)
1
3
Vocab
Size
2 Dim
Size
vocab size
Mem Size

Embedding A from given context sentences multiply Input Question Embedding (using embedding B
which is not defined on this code) ※ if it’s a first layer, if not it would be output of t-1 layer
1
2 1
2
multiply

Set embedding C(on the code it’s B) this is also the target variable for train

Embedding C(one the code it’s B) Multiply softmax result

For the last multiply question and output of memory network again

stack more memory layers

Set fully connected layer and calculate error with softmax cross entropy

On the given code I removed 90% of data set because we are using CPU for education..
So result may can be poor…..

bAbi Test Results .. (comparing DMN & MemNN )
https://research.fb.com/downloads/babi/

https://yerevann.github.io/2016/02/05/implementing-dynamic-memory-networks/
https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano
Dynamic Memory Networks Episodic Memory
Other types of memory networks ..

1.NLP & Deep Learning
3.Language Generation
3-1.Basic Seq2Seq
3-2.Other types of Seq2Seq (Attention, Pointer)

Response Generator - Seq2Seq Model
Seq2Seq 모델은 기계번역, 요약, 간단한 질답 등 말 그대로 Input 과 Output 이 모두 Sequence Data 인
다양한 케이스에 적용이 가능하며, 이를 간단한 트릭을 적용하여 답변을 생성하는 용도로 사용할 수 있다.
- Input : 딥 러닝 재미 즐거운 일
- Output : 딥 러닝은 재미있고 즐거운 일이다
https://www.slideshare.net/KeonKim/attention-mechanisms-with-tensorflow

Attention Mechanism Pointer Network
https://medium.com/@devnag/pointer-networks-in-tensorflow-
with-sample-code-14645063f264
Seq2Seq 의 변형된 형태들…
Response Generator - Seq2Seq Model
※ 다음 강의에서 자세히 진행할 예정인 내용으로 상세 내용 생략
http://localhost:8888/tree/chap05_nlp/attention_seq2seq

결국 Natural Language Process 는 "기존 자연어 처리 알고리즘", "Deep
Learning" Algorithm” 그리고 각종 “Software Architecture” 의 거대한
Combination
Conclusion
기존 자연어
처리 이론
Deep Learning
Theory
Software
Architecture

Conclusion
지금까지 이야기한 내용들을 연결하여 하나의 예를 만들어 보자
Web Document Web Crawler
Lexical (어휘) Analysis
Syntactic (구문) Analysis
Semantic (의미) Analysis
Ontology
Man
Filtering
information
Dialogue (구문) Analysis
information
Lexical (어휘) Analysis
Syntactic (구문) Analysis
Semantic (의미) Analysis
Dialogue (구문) Analysis
Web Server
Response Generation
IN
OUT

4.Tips
4-1.Hyper Parameter Random Search
4-2.Genetic Algorithm
4-3.Using multiple GPU Server

Hyper Parameter Optimization
Set of graph flow
Set of graph flow
Set of graph flow
Hyper Parm Range
~
Hyper Parameter
Random Search
Genetic Algorithm
Approximation
Hyper Parameter 서치를 위한 Genetic Algorithm 에 대한 설명
1 2 3

Hyper Parameter Random Search 에 대한 설명
http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
In this more challenging optimization problem random search is still effective, but not 300 RANDOM
SEARCH FOR HYPER-PARAMETER OPTIMIZATION superior as it was as in the case of neural
network optimization. Comparing to the 3-layer DBN results in Larochelle et al. (2007), random
search found a better model than the manual search in one data set (convex), an equally good
model in four (mnist basic, mnist rotated, rectangles, and rectangles images), and an inferior model
in three (mnist background images, mnist background random, mnist rotated background images).

[1Layer] - Grid vs Random [3Layer] - Grid+Manual vs Random

Genetic Algorithm on Hyper parameter optimization (Approximation)
https://blog.coast.ai/lets-evolve-a-neural-network-with-a-genetic-algorithm-code-included-8809bece164
Let’s say it takes five minutes to train and evaluate a network on your dataset. And let’s say we have four parameters with
five possible settings each. To try them all would take (5**4) * 5 minutes, or 3,125 minutes, or about 52 hours.
Now let’s say we use a genetic algorithm to evolve 10 generations with a population of 20 (more on what this means
below), with a plan to keep the top 25% plus a few more, so ~8 per generation. This means that in our first generation we
score 20 networks (20 * 5 = 100 minutes). Every generation after that only requires around 12 runs, since we don’t have
the score the ones we keep. That’s 100 + (9 generations * 5 minutes * 12 networks) = 640 minutes, or 11 hours.
https://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol1/hmw/article1.html
use multi gpu
cluster servers
hyper parameter
random search

Let’s see how hyperparameter optimization with genetic algorithm works .. . ..
http://localhost:8888/tree/chap05_nlp/automl

다음 강의 목표
NLP 관점에서 Deep Learning 을 적용하기 위한 데이터와 모델에 대한
이해를 돕기위한 강의를 진행하였습니다.
다음 시간에는 이러한 재료들을 모아서 아키택쳐 관점에서 응용하고
활용하기 위한 방법들에 대해서 강의하고자 합니다.
감사합니다.

NLP Deep Learning with Tensorflow

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to NLP Deep Learning with Tensorflow

Similar to NLP Deep Learning with Tensorflow (20)

More from seungwoo kim

More from seungwoo kim (10)

Recently uploaded

Recently uploaded (20)

NLP Deep Learning with Tensorflow