SlideShare a Scribd company logo
1 of 44
Download to read offline
1東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
2東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
•
3東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
4東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
5
killed a man yesterday . <EOS>
John killed a man yesterday .
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
•
6東北⼤学 ⼩林颯介 @ NLP-DL
7
• •
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
8東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
9
[Jozefowicz+15]
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
10
10
[Jozefowicz+15]
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
•
11
[Jozefowicz+15]
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
12
[Jozefowicz+15]
)
. (18)
T recurrence
+1)
)
C
+1)
(19)
the model)
aining RNN-
(20)
1)T
(21)
)T
. (22)
ormance of
wo baseline
tion capture
mean frame-
of compari-
RBM is per-
ling starting
ptimally.
d on a simu-
er & Hinton,
Figure 3. Receptive fields of 48 hidden units of an RNN-
RBM trained on the bouncing balls dataset. Each square
shows the input weights of a hidden unit as an image.
The human motion capture dataset2
is represented by
a sequence of joint angles, translations and rotations
of the base of the spine in an exponential-map parame-
terization (Hsu et al., 2005; Taylor et al., 2007). Since
the data consists of 49 real values per time step, we
use the Gaussian RBM variant (Welling et al., 2005)
for this task. We use up to 450 hidden units and an
initial learning rate of 0.001. The mean squared pre-
diction test error is 20.1 for the RTRBM and reduced
substantially to 16.2 for the RNN-RBM.
6 Modeling sequences of polyphonic
music
In this section, we show results with main applica-
tion of interest for this paper: probabilistic modeling
of sequences of polyphonic music. We report our ex-
periments on four datasets of varying complexity con-
verted to our input format.
Piano-midi.de is a classical piano MIDI archive that
was split according to Poliner & Ellis (2007).
Nottingham is a collection of 1200 folk tunes3
with
chords instantiated from the ABC format.
MuseData is an electronic library of orchestral and
piano classical music from CCARH4
.
JSB chorales refers to the entire corpus of 382 four-
part harmonized chorales by J. S. Bach with the
split of Allan & Williams (2005).
2
people.csail.mit.edu/ehsu/work/sig05stf
3
ifdo.ca/~seymour/nottingham/nottingham.html
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
13
[Jozefowicz+15]
for Nottingham, N-dropout stands for Nottingham with nonzero
dropout, and P stands for Piano-Midi.
Arch. 5M-tst 10M-v 20M-v 20M-tst
Tanh 4.811 4.729 4.635 4.582 (97.7)
LSTM 4.699 4.511 4.437 4.399 (81.4)
LSTM-f 4.785 4.752 4.658 4.606 (100.8)
LSTM-i 4.755 4.558 4.480 4.444 (85.1)
LSTM-o 4.708 4.496 4.447 4.411 (82.3)
LSTM-b 4.698 4.437 4.423 4.380 (79.83)
GRU 4.684 4.554 4.559 4.519 (91.7)
MUT1 4.699 4.605 4.594 4.550 (94.6)
MUT2 4.707 4.539 4.538 4.503 (90.2)
MUT3 4.692 4.523 4.530 4.494 (89.47)
Table 3. Perplexities on the PTB. The prefix (e.g., 5M) denotes
the number of parameters in the model. The suffix “v” denotes
validation negative log likelihood, the suffix“tst” refers to the test
set. The perplexity for select architectures is reported in paren-
theses. We used dropout only on models that have 10M or 20M
parameters, since the 5M models did not benefit from dropout at
all, and most dropout-free models achieved a test perplexity of
108, and never greater than 120. In particular, the perplexity of
the best models without dropout is below 110, which outperforms
the results of Mikolov et al. (2014).
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
•
• 14
[Greff+15]
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
•
15東北⼤学 ⼩林颯介 @ NLP-DL
•
16
•
•
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
17
resurgence of new structural designs for recurrent neural networks (RNNs)
esigns are derived from popular structures including vanilla RNNs, Long
works (LSTMs) [4] and Gated Recurrent Units (GRUs) [5]. Despite of their
ost of them share a common computational building block, described by the
(Wx + Uz + b), (1)
Rm
are state vectors coming from different information sources, W 2 Rd⇥n
e-to-state transition matrices, and b is a bias vector. This computational
a combinator for integrating information flow from the x and z by a sum
by a nonlinearity . We refer to it as the additive building block. Additive
ly implemented in various state computations in RNNs (e.g. hidden state
RNNs, gate/cell computations of LSTMs and GRUs.
an alternative design for constructing the computational building block by
of information integration. Specifically, instead of utilizing sum operation
e Hadamard product “ ” to fuse Wx and Uz:
(Wx Uz + b) (2)
ucture Description and Analysis
neral Formulation of Multiplicative Integration
idea behind Multiplicative Integration is to integrate different information flows Wx
adamard product “ ”. A more general formulation of Multiplicative Integration
e bias vectors 1 and 2 added to Wx and Uz:
((Wx + 1) (Uz + 2) + b)
1, 2 2 Rd
are bias vectors. Notice that such formulation contains the first order
itive building block, i.e., 1 Uht 1 + 2 Wxt. In order to make the Mult
on more flexible, we introduce another bias vector ↵ 2 Rd
to gate2
the term W
g the following formulation:
(↵ Wx Uz + 1 Uz + 2 Wx + b),
t the number of parameters of the Multiplicative Integration is about the same as t
building block, since the number of new parameters (↵, 1 and 2) are negligible c
number of parameters. Also, Multiplicative Integration can be easily extended to
Us3
, that adopt vanilla building blocks for computing gates and output states, wher
replace them with the Multiplicative Integration. More generally, in any kind of
information flows (k 2) are involved (e.g. RNN with multiple skip connect
dforward models like residual networks [12]), one can implement pairwise Mult
on for integrating all k information sources.東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
18
Figure 2: Several examples of cells with interpretable activa
[Karpathy+15]
東北⼤学 ⼩林颯介 @ NLP-DL
19
[Kádár+16]
•
•
•
•
•
•
omission(i, S) = 1 cosine(hend(S), hend(Si))
(12)
東北⼤学 ⼩林颯介 @ NLP-DL
20
•
[Kádár+16]
東北⼤学 ⼩林颯介 @ NLP-DL
21
•
•
•
[Kádár+16]
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
22東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
23東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
24東北⼤学 ⼩林颯介 @ NLP-DL
Pixel Recurrent Neu
x1
xi
xn
xn2
Figure 2. Left: To generate pixel xi one conditions on all the pre-
viously generated pixels left and above of xi. Center: Illustration
of a Row LSTM with a kernel of size 3. The dependency field of
the Row LSTM does not reach pixels further away on the sides
of the image. Right: Illustration of the two directions of the Di-
agonal BiLSTM. The dependency field of the Diagonal BiLSTM
covers the entire available context in the image.
Figure 3. In the Diagonal BiLSTM, to allow for parallelization
along the diagonals, the input map is skewed by offseting each
row by one position with respect to the previous row. When the
spatial layer is computed left to right and column by column, the
output map is shifted back into the original size. The convolution
uses a kernel of size 2 ⇥ 1.
(2015); Uria et al. (2014)). By contrast we model p(x) as
a discrete distribution, with every conditional distribution
3
T
th
tu
fo
x
p
d
la
T
in
T
a
c
L
th
tw
u
T
(s
re
in
la
T
th
s
h
Pixel Recurrent Neural Networks
x1
xi
xn
xn2
Figure 2. Left: To generate pixel xi one conditions on all the pre-
viously generated pixels left and above of xi. Center: Illustration
of a Row LSTM with a kernel of size 3. The dependency field of
the Row LSTM does not reach pixels further away on the sides
of the image. Right: Illustration of the two directions of the Di-
agonal BiLSTM. The dependency field of the Diagonal BiLSTM
covers the entire available context in the image.
3.1. Row LSTM
The Row LSTM is a unidirectiona
the image row by row from top to b
tures for a whole row at once; the
formed with a one-dimensional con
xi the layer captures a roughly triang
pixel as shown in Figure 2 (center).
dimensional convolution has size k
larger the value of k the broader the c
The weight sharing in the convoluti
invariance of the computed features
The computation proceeds as follow
an input-to-state component and a r
component that together determine th
LSTM core. To enhance parallelizat
•
•
• 25
as a conference paper at ICLR 2016
2d Grid LSTM blockblock
m0
h0
h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
cks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
ons. The dashed lines indicate identity transformations. The standard LSTM block
a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
ector m1 applied along the vertical dimension.
er review as a conference paper at ICLR 2016
2d Grid LSTM blockandard LSTM block
m0
h0
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
re 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
3 dimensions. The dashed lines indicate identity transformations. The standard LSTM block
not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
memory vector m1 applied along the vertical dimension.
review as a conference paper at ICLR 2016
2d Grid LSTM blockard LSTM block
m0
h0
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
dimensions. The dashed lines indicate identity transformations. The standard LSTM block
ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
mory vector m1 applied along the vertical dimension.
conference paper at ICLR 2016
2d Grid LSTM block
m0
h0
h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
orm the standard LSTM and those that form Grid LSTM networks of N = 1, 2
The dashed lines indicate identity transformations. The standard LSTM block
mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
m1 applied along the vertical dimension.
onference paper at ICLR 2016
2d Grid LSTM block
m0
h0
h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
rm the standard LSTM and those that form Grid LSTM networks of N = 1, 2
The dashed lines indicate identity transformations. The standard LSTM block
mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
m1 applied along the vertical dimension.
review as a conference paper at ICLR 2016
2d Grid LSTM blockdard LSTM block
m0
h0
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM Block
e 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2
dimensions. The dashed lines indicate identity transformations. The standard LSTM block
ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has
emory vector m1 applied along the vertical dimension.
Under review as a conference paper at ICLR 2016
2d Grid LSTM blockStandard LSTM block
m m0
h0
h
h0
I ⇤ xi h1
h2 h0
2
h0
1
m1
m0
1
m0
2m2
1d Grid LSTM Block 3d Grid LSTM
Figure 1: Blocks form the standard LSTM and those that form Grid LSTM networks o
and 3 dimensions. The dashed lines indicate identity transformations. The standard LS
does not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM
the memory vector m1 applied along the vertical dimension.
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
26
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing, pages 1556–1566,
Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics
works, a type of recurrent neural net-
work with a more complex computational
unit, have obtained strong results on a va-
riety of sequence modeling tasks. The
only underlying LSTM structure that has
been explored so far is a linear chain.
However, natural language exhibits syn-
tactic properties that would naturally com-
bine words to phrases. We introduce the
Tree-LSTM, a generalization of LSTMs to
tree-structured network topologies. Tree-
LSTMs outperform all existing systems
and strong LSTM baselines on two tasks:
predicting the semantic relatedness of two
sentences (SemEval 2014, Task 1) and
sentiment classification (Stanford Senti-
ment Treebank).
1 Introduction
Most models for distributed representations of
phrases and sentences—that is, models where real-
valued vectors are used to represent meaning—fall
into one of three classes: bag-of-words models,
sequence models, and tree-structured models. In
bag-of-words models, phrase and sentence repre-
sentations are independent of word order; for ex-
ample, they can be generated by averaging con-
stituent word representations (Landauer and Du-
mais, 1997; Foltz et al., 1998). In contrast, se-
quence models construct sentence representations
as an order-sensitive function of the sequence of
tokens (Elman, 1990; Mikolov, 2012). Lastly,
tree-structured models compose each phrase and
sentence representation from its constituent sub-
phrases according to a given syntactic structure
over the sentence (Goller and Kuchler, 1996;
Socher et al., 2011).
x1
x2
x4 x5 x6
y1
y2 y3
y4 y6
Figure 1: Top: A chain-structured LSTM net-
work. Bottom: A tree-structured LSTM network
with arbitrary branching factor.
Order-insensitive models are insufficient to
fully capture the semantics of natural language
due to their inability to account for differences in
meaning as a result of differences in word order
or syntactic structure (e.g., “cats climb trees” vs.
“trees climb cats”). We therefore turn to order-
sensitive sequential or tree-structured models. In
particular, tree-structured models are a linguisti-
cally attractive option due to their relation to syn-
tactic interpretations of sentence structure. A nat-
ural question, then, is the following: to what ex-
tent (if at all) can we do better with tree-structured
models as opposed to sequential models for sen-
tence representation? In this paper, we work to-
wards addressing this question by directly com-
paring a type of sequential model that has recently
been used to achieve state-of-the-art results in sev-
eral NLP tasks against its tree-structured general-
ization.
Due to their capability for processing arbitrary-
length sequences, recurrent neural networks
1556
w0
w0w1w2
w4 w5 w6
w0 w4 w5
G
EN
-L
GEN-NX-LGEN-NX-L
G
EN
-R
GEN-NX-R GEN-NX-R
w1w2w3
LD LD
Figure 4: Generation of left and right dependents of node w0
In order to jointly take
into account, we employ y
goes from the furthest lef
left dependent (LD is a
dent). As shown in Figur
representation of all left d
this representation is then
right dependent of the sam
w0
w1w2w3 w4 w5 w6
Generated by four LSTMs with tied We and tied Who
w0
w1w2w3
w0w1w2
w4 w5 w6
w0 w4 w5
G
EN
-L
GEN-NX-LGEN-NX-L
G
EN
-R
GEN-NX-R GEN-NX-R
Figure 3: Generation process of left (w1,w2,w3) and right
Who 2 R|V|⇥d the output matrix of our model, where
|V| is the vocabulary size, s the word embedding size
and d the hidden unit size. We use tied We and tied
Who for the four LSTMs to reduce the number of pa-
rameters in our model. The four LSTMs also share
their hidden states. Let H 2 Rd⇥(n+1) denote the
shared hidden states of all time steps and e(wt) the
one-hot vector of wt. Then, H[:,t] represents D(wt)
at time step t, and the computation2 is:
xt = We ·e(wt0 ) (2a)
z 0
東北⼤学 ⼩林颯介 @ NLP-DL
Figure 2: Attentional Encoder-Decoder model.
dj is calculated as the summation vector weighted
by ↵j(i):
dj =
nX
i=1
↵j(i)hi. (6)
To incorporate the attention mechanism into the
decoding process, the context vector is used for the
the j-th word prediction by putting an additional
hidden layer ˜sj:
˜s = tanh(W [s ; d ] + b ), (7)
Figure 3: Proposed model: Tree-to-sequence
tentional NMT model.
a sentence inherent in language. We propose
novel tree-based encoder in order to explicitly ta
the syntactic structure into consideration in t
NMT model. We focus on the phrase structure
a sentence and construct a sentence vector fro
phrase vectors in a bottom-up fashion. The se
tence vector in the tree-based encoder is the
•
•
•
•
•
27東北⼤学 ⼩林颯介 @ NLP-DL
The hungry cat
NP (VP(S
REDUCE
GENNT(NP)NT(VP)
…
cat hungry The
a<t
p(at)
ut
TtSt
gure 5: Neural architecture for defining a distribution over at given representations of the stack (St), output buffer (Tt) and
story of actions (a<t). Details of the composition architecture of the NP, the action history LSTM, and the other elements of the
ack are not shown. This architecture corresponds to the generator state at line 7 of Figure 4.
f the forward and reverse LSTMs are concatenated,
assed through an affine transformation and a tanh
onlinearity to become the subtree embedding.4 Be-
ause each of the child node embeddings (u, v, w in
ig. 6) is computed similarly (if it corresponds to an
ternal node), this composition function is a kind of
cursive neural network.
2 Word Generation
4.4 Discriminative Parsing Model
A discriminative parsing model can be obtained by
replacing the embedding of Tt at each time step with
an embedding of the input buffer Bt. To train this
model, the conditional likelihood of each sequence
of actions given the input string is maximized.5
5 Inference via Importance Sampling
Our generative model p(x, y) defines a joint dis-
•
28
3.5 Comparison to Other Models
Our generation algorithm algorithm differs from
previous stack-based parsing/generation algorithms
in two ways. First, it constructs rooted tree struc-
tures top down (rather than bottom up), and sec-
ond, the transition operators are capable of directly
generating arbitrary tree structures rather than, e.g.,
assuming binarized trees, as is the case in much
prior work that has used transition-based algorithms
to produce phrase-structure trees (Sagae and Lavie,
2005; Zhang and Clark, 2011; Zhu et al., 2013).
4 Generative Model
RNNGs use the generator transition set just pre-
sented to define a joint distribution on syntax trees
(y) and words (x). This distribution is defined as a
sequence model over generator transitions that is pa-
rameterized using a continuous space embedding of
the algorithm state at each time step (ut); i.e.,
p(x, y) =
|a(x,y)|
Y
t=1
p(at | a<t)
=
|a(x,y)|
Y
t=1
exp r>
at
ut + bat
P
a02AG(Tt,St,nt) exp r>
a0 ut + ba0
,
and where action-specific embeddings ra and bias
vector b are parameters in ⇥.
The representation of the algorithm state at time
t, ut, is computed by combining the representation
of the generator’s three data structures: the output
dard RNN encoding architecture. The stack (S) is
more complicated for two reasons. First, the ele-
ments of the stack are more complicated objects than
symbols from a discrete alphabet: open nontermi-
nals, terminals, and full trees, are all present on the
stack. Second, it is manipulated using both push and
pop operations. To efficiently obtain representations
of S under push and pop operations, we use stack
LSTMs (Dyer et al., 2015).
4.1 Syntactic Composition Function
When a REDUCE operation is executed, the parser
pops a sequence of completed subtrees and/or to-
kens (together with their vector embeddings) from
the stack and makes them children of the most recent
open nonterminal on the stack, “completing” the
constituent. To compute an embedding of this new
subtree, we use a composition function based on
bidirectional LSTMs, which is illustrated in Fig. 6.
NP
u v w
NP u v w NP
x
x
Figure 6: Syntactic composition function based on bidirec-
tional LSTMs that is executed during a REDUCE operation; the
network on the right models the structure on the left.
The first vector read by the LSTM in both the for-
ward and reverse directions is an embedding of the
[Dyer+16]
Input: The hungry cat meows .
Stack Buffer Action
0 The | hungry | cat | meows | . NT(S)
1 (S The | hungry | cat | meows | . NT(NP)
2 (S | (NP The | hungry | cat | meows | . SHIFT
3 (S | (NP | The hungry | cat | meows | . SHIFT
4 (S | (NP | The | hungry cat | meows | . SHIFT
5 (S | (NP | The | hungry | cat meows | . REDUCE
6 (S | (NP The hungry cat) meows | . NT(VP)
7 (S | (NP The hungry cat) | (VP meows | . SHIFT
8 (S | (NP The hungry cat) | (VP meows . REDUCE
9 (S | (NP The hungry cat) | (VP meows) . SHIFT
10 (S | (NP The hungry cat) | (VP meows) | . REDUCE
11 (S (NP The hungry cat) (VP meows) .)
Figure 2: Top-down parsing example.
tackt Termst Open NTst Action Stackt+1 Termst+1 Open NTst+1
T n NT(X) S | (X T n + 1
T n GEN(x) S | x T | x n
| (X | ⌧1 | . . . | ⌧` T n REDUCE S | (X ⌧1 . . . ⌧`) T n 1
ure 3: Generator transitions. Symbols defined as in Fig. 1 with the addition of T representing the history of generated terminals.
Stack Terminals Action
0 NT(S)
1 (S NT(NP)
2 (S | (NP GEN(The)
3 (S | (NP | The The GEN(hungry)
4 (S | (NP | The | hungry The | hungry GEN(cat)
5 (S | (NP | The | hungry | cat The | hungry | cat REDUCE
6 (S | (NP The hungry cat) The | hungry | cat NT(VP)
7 (S | (NP The hungry cat) | (VP The | hungry | cat GEN(meows)
8 (S | (NP The hungry cat) | (VP meows The | hungry | cat | meows REDUCE
9 (S | (NP The hungry cat) | (VP meows) The | hungry | cat | meows GEN(.)
10 (S | (NP The hungry cat) | (VP meows) | . The | hungry | cat | meows | . REDUCE
11 (S (NP The hungry cat) (VP meows) .) The | hungry | cat | meows | .
•
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
29
[Bowman+16]
bu er
down
sat
stack
cat
the
composition
tracking
transition
down
sat
the cat composition
tracking
transition
down
sat
the cat
tracking
(a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’,
and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function.
bu er
stack
t = 0
down
sat
cat
the
t = 1
down
sat
cat
the
t = 2
down
sat
cat
the
t = 3
down
sat
the cat
t = 4
down
sat
the cat
t = 5
down
sat
the cat
t = 6
sat down
the cat
t = 7 = T
(the cat) (sat down)
output to model
for semantic task
(b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity.
bu er
down
sat
stack
cat
the
composition
tracking
transition
down
sat
the cat composition
tracking
transition
down
sat
the cat
tracking
(a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’,
and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function.
bu er
stack
t = 0
down
sat
cat
the
t = 1
down
sat
cat
the
t = 2
down
sat
cat
the
t = 3
down
sat
the cat
t = 4
down
sat
the cat
t = 5
down
sat
the cat
t = 6
sat down
the cat
t = 7 = T
(the cat) (sat down)
output to model
for semantic task
(b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity.東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
30
tor with the structure lkj = (1 j/J) (k/d)(1 2j/J) (assuming 1-based indexing),
ng the number of words in the sentence, and d is the dimension of the embedding. This
presentation, which we call position encoding (PE), means that the order of the words
mi. The same representation is used for questions, memory inputs and memory outputs.
Encoding: Many of the QA tasks require some notion of temporal context, i.e. in
ample of Section 2, the model needs to understand that Sam is in the bedroom after
esented by a one-hot vector of length V (where the vocabulary is of size V = 177,
simplistic nature of the QA language). The same representation is used for the
d answer a. Two versions of the data are used, one that has 1000 training problems
second larger one with 10,000 per task.
Details
wise stated, all experiments used a K = 3 hops model with the adjacent weight sharing
all tasks that output lists (i.e. the answers are multiple words), we take each possible
of possible outputs and record them as a separate answer vocabulary word.
presentation: In our experiments we explore two different representations for
s. The first is the bag-of-words (BoW) representation that takes the sentence
2, ..., xin}, embeds each word and sums the resulting vectors: e.g mi =
P
j Axij and
j. The input vector u representing the question is also embedded as a bag of words:
. This has the drawback that it cannot capture the order of the words in the sentence,
ortant for some tasks.
propose a second representation that encodes the position of words within the
s takes the form: mi =
P
j lj · Axij, where · is an element-wise multiplication. lj is a
4
4.2 ATTENTION MECHANISMS
Neural models with memories coupled to differentiable addressing mechanism have been success-
fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah-
danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al.,
2015). Since we are interested in associative memories we employed a “content” based attention.
This has the property that the vector retrieved from our memory would not change if we randomly
shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular,
our process block based on an attention mechanism uses the following:
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
Read
Process Write
Figure 1: The Read-Process-and-Write model.
where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is
a query vector which allows us to read rt from the memories, f is a function that computes a
single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a
recurrent state but which takes no inputs. q⇤
t is the state which this LSTM evolves, and is formed
by concatenating the query qt with the resulting attention readout rt. t is the index which indicates
how many “processing steps” are being carried to compute the state to be fed to the decoder. Note
that permuting mi and mi0 has no effect on the read vector rt.
4.3 READ, PROCESS, WRITE
Our model, which naturally handles input sets, has three components (the exact equations and im-
plementation will be released in an appendix prior to publication):
• A reading block, which simply embeds each element xi 2 X using a small neural network
onto a memory vector mi (the same neural network is used for all i).
• A process block, which is an LSTM without inputs or outputs performing T steps of com-
putation over the memories mi. This LSTM keeps updating its state by reading mi repeat-
edly using the attention mechanism described in the previous section. At the end of this
block, its hidden state q⇤
T is an embedding which is permutation invariant to the inputs. See
eqs. (3)-(7) for more details.
4
fully applied to handwriting generation a
danau et al., 2015a), and more general c
2015). Since we are interested in associa
This has the property that the vector retri
shuffled the memory. This is crucial for
our process block based on an attention m
qt = LSTM(q⇤
t 1) (3)
ei,t = f(mi, qt) (4)
ai,t =
exp(ei,t)
P
j exp(ej,t)
(5)
rt =
X
i
ai,tmi (6)
q⇤
t = [qt rt] (7)
where i indexes through each memory v
a query vector which allows us to read
single scalar from mi and qt (e.g., a do
recurrent state but which takes no inputs
by concatenating the query qt with the re
how many “processing steps” are being c
that permuting mi and mi0 has no effect o
4.3 READ, PROCESS, WRITE
Our model, which naturally handles inpu
plementation will be released in an appen
• A reading block, which simply e
onto a memory vector mi (the sa
• A process block, which is an LS
putation over the memories mi.
edly using the attention mechan
block, its hidden state q⇤
T is an em
eqs. (3)-(7) for more details.
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
31東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
32東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
•
33
, d(ht−1)] + bh), (4)
function from Equation 2.
d-forward fully connected
is a significant difference:
ks every fully-connected
only once, while it is not
nt layer: each training ex-
mposed of a number of in-
ropout this results in hid-
on every step. This obser-
tion of how to sample the
re two options: sample it
sequence (per-sequence)
mask on every step (per-
wo strategies for sampling
etail in Section 3.4.
ht = ot ∗ f(ct),
where it, ft, ot are input, output and forget gate
step t; gt is the vector of cell updates and ct is
updated cell vector used to update the hidden s
ht; σ is the sigmoid function and ∗ is the elem
wise multiplication.
Our approach is to apply dropout to the cell
date vector ct as follows:
ct = ft ∗ ct−1 + it ∗ d(gt)
In contrast, Moon et al. (2015) propose to
ply dropout directly to the cell values and use
sequence sampling:
ct = d(ft ∗ ct−1 + it ∗ gt)
We will discuss the limitations of the appro
of Moon et al. (2015) in Section 3.4 and sup
Figure 1: Illustration of the three types
circles represent connections, hidden state
we apply dropout.
gt = f(Wg xt, rt ∗ ht−1 + bg)
ht = (1 − zt) ∗ ht−1 + zt ∗ gt
Similarly to the LSTMs, we propoose
dropout to the hidden state updates vector
ht = (1 − zt) ∗ ht−1 + zt ∗ d(gt)
To the best of our knowledge, this work is
to study the effect of recurrent dropout
networks.
3.4 Dropout and memory
Before going further with the explanatio
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
34
hidden-to-hidden transformations. We introduce the batch-normalizing transform BN( · ; , )
the LSTM as follows:
0
B
B
@
˜ft
˜it
˜ot
˜gt
1
C
C
A = BN(Whht 1; h, h) + BN(Wxxt; x, x) + b (6)
ct = (˜ft) ct 1 + (˜it) tanh( ˜gt) (7)
ht = (˜ot) tanh(BN(ct; c, c)) (8)
network by discarding the absolute scale of activations.
We want to a preserve the information in the network, by
normalizing the activations in a training example relative
to the statistics of the entire training data.
3 Normalization via Mini-Batch
Statistics
Since the full whitening of each layer’s inputs is costly
and not everywhere differentiable, we make two neces-
sary simplifications. The first is that instead of whitening
the features in layer inputs and outputs jointly, we will
normalize each scalar feature independently, by making it
have the mean of zero and the variance of 1. For a layer
with d-dimensional input x = (x(1)
. . . x(d)
), we will nor-
malize each dimension
x(k)
=
x(k)
− E[x(k)
]
Var[x(k)]
where the expectation and variance are computed over the
training data set. As shown in (LeCun et al., 1998b), such
normalization speeds up convergence, even when the fea-
tures are not decorrelated.
Note that simply normalizing each input of a layer may
change what the layer can represent. For instance, nor-
B = {x1...m}
Let the normalized values be x1...m
formations be y1...m. We refer to t
BNγ,β : x1...m →
as the Batch Normalizing Transfor
Transform in Algorithm 1. In the a
added to the mini-batch variance f
Input: Values of x over a mini-ba
Parameters to be learned:
Output: {yi = BNγ,β(xi)}
µB ←
1
m
m
i=1
xi
σ2
B ←
1
m
m
i=1
(xi − µB)2
xi ←
xi − µB
σ2
B + ϵ
yi ← γxi + β ≡ BNγ,β(xi)
Algorithm 1: Batch Normalizing
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
35
Overfitting in machine learning is addressed by restricting the space o
considered. This can be accomplished by reducing the number of par
with an inductive bias for simpler models, such as early stopping.
can be achieved by incorporating more sophisticated prior knowledg
activations on a reasonable path can be difficult, especially across lo
in mind, we devise a regularizer for the state representation learned
RNNs, that aims to encourage stability of the path taken through repr
we propose the following additional cost term for Recurrent Neural N
1
T
TX
t=1
(khtk2 kht 1k2)2
Where ht is the vector of hidden activations at time-step t, and is a h
amounts of regularization. We call this penalty the norm-stabilizer, as
norms of the hiddens to be stable (i.e. approximately constant acros
coherence” penalty of Jonschkowski & Brock (2015), our penalty
representation to remain constant, only its norm.
In the absence of inputs and nonlinearities, a constant norm would imp
to-hidden transition matrix for simple RNNs (SRNNs). However, in t
sition matrix, inputs and nonlinearities can still change the norm of
instability. This makes targeting the hidden activations directly a mo
ing norm stability. Stability becomes especially important when we
sequences at test time than those seen during training (the “training h
arXiv:1511.08400v
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
nce paper at ICLR 2016
English (unsupervised)
German (translation)
Tags (parsing)English
y Setting – one encoder, multiple decoders. This scheme is useful for either
as in Dong et al. (2015) or between different tasks. Here, English and Ger-
of words in the respective languages. The α values give the proportions of
are allocated for the different tasks.
Published as a conference paper at ICLR 2016
English (unsupervised)
German (translation)
Tags (parsing)English
Figure 2: One-to-many Setting – one encoder, multiple decoders. This scheme is useful for either
multi-target translation as in Dong et al. (2015) or between different tasks. Here, English and Ger-
man imply sequences of words in the respective languages. The α values give the proportions of
parameter updates that are allocated for the different tasks.
for constituency parsing as used in (Vinyals et al., 2015a), (b) a sequence of German words for ma-
chine translation (Luong et al., 2015a), and (c) the same sequence of English words for autoencoders
or a related sequence of English words for the skip-thought objective (Kiros et al., 2015).
3.2 MANY-TO-ONE SETTING
This scheme is the opposite of the one-to-many setting. As illustrated in Figure 3, it consists of mul-
tiple encoders and one decoder. This is useful for tasks in which only the decoder can be shared, for
example, when our tasks include machine translation and image caption generation (Vinyals et al.,
2015b). In addition, from a machine translation perspective, this setting can benefit from a large
amount of monolingual data on the target side, which is a standard practice in machine translation
system and has also been explored for neural MT by Gulcehre et al. (2015).
English (unsupervised)
Image (captioning) English
German (translation)
Figure 3: Many-to-one setting – multiple encoders, one decoder. This scheme is handy for tasks in
which only the decoders can be shared.
3.3 MANY-TO-MANY SETTING
Lastly, as the name describes, this category is the most general one, consisting of multiple encoders
Published as a conference paper at ICLR 2016
German (translation)
English (unsupervised) German (unsupervised)
English
Figure 4: Many-to-many setting – multiple encoders, multiple decoders. We consider t
in a limited context of machine translation to utilize the large monolingual corpora i
source and the target languages. Here, we consider a single translation task and two un
autoencoder tasks.
consist of ordered sentences, e.g., paragraphs. Unfortunately, in many applications th
machine translation, we only have sentence-level data where the sentences are unordered.
that, we split each sentence into two halves; we then use one half to predict the other hal
36東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
37東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
38東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
•
•
•
•
39東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
40
hello , my name is Tony Jebara .
Attentive	Read
hi , Tony Jebara
<eos> hi , Tony
h1 h2 h3 h4 h5
s1 s2 s3 s4
h6 h7 h8
“Tony”
DNN
Embedding
for “Tony”
Selective Read
for “Tony”
(a) Attention-based Encoder-Decoder (RNNSearch)
(c) State Update
s4
SourceVocabulary
Softmax
Prob(“Jebara”)=Prob(“Jebara”, g) +Prob(“Jebara”, c)
… ...
(b) Generate-Mode & Copy-Mode
!
M
M
東北⼤学 ⼩林颯介 @ NLP-DL
forms and their meanings is non-trivial (de Saus-
sure, 1916). While some compositional relation-
ships exist, e.g., morphological processes such as
adding -ing or -ly to a stem have relatively reg-
ular effects, many words with lexical similarities
convey different meanings, such as, the word pairs
lesson () lessen and coarse () course.
3 C2W Model
Our compositional character to word (C2W)
model is based on bidirectional LSTMs (Graves
and Schmidhuber, 2005), which are able to
learn complex non-local dependencies in sequence
models. An illustration is shown in Figure 1. The
input of the C2W model (illustrated on bottom) is
a single word type w, and we wish to obtain is
a d-dimensional vector used to represent w. This
model shares the same input and output of a word
lookup table (illustrated on top), allowing it to eas-
ily replace then in any network.
As input, we define an alphabet of characters
C. For English, this vocabulary would contain an
entry for each uppercase and lowercase letter as
well as numbers and punctuation. The input word
w is decomposed into a sequence of characters
c1, . . . , cm, where m is the length of w. Each ci
cats
cat
cats
job
....
....
........
cats
c a t s
a
c
t
....
....
s
Character
Lookup
Table
Word
Lookup
Table
Bi-LSTM
embeddings
for word "cats"
embeddings
for word "cats"
•
•
•
•
•
41東北⼤学 ⼩林颯介 @ NLP-DL
Figure 1: Illustration of the Scheduled Sampling approach,
•
•
•
42東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
43
In order to apply the REINFORCE algorithm (Williams, 1992; Zaremba & Sutskever, 2015) to the
problem of sequence generation we cast our problem in the reinforcement learning (RL) frame-
work (Sutton & Barto, 1988). Our generative model (the RNN) can be viewed as an agent, which
interacts with the external environment (the words and the context vector it sees as input at every
time step). The parameters of this agent defines a policy, whose execution results in the agent pick-
ing an action. In the sequence generation setting, an action refers to predicting the next word in
the sequence at each time step. After taking an action the agent updates its internal state (the hid-
den units of RNN). Once the agent has reached the end of a sequence, it observes a reward. We
can choose any reward function. Here, we use BLEU (Papineni et al., 2002) and ROUGE-2 (Lin
& Hovy, 2003) since these are the metrics we use at test time. BLEU is essentially a geometric
mean over n-gram precision scores as well as a brevity penalty (Liang et al., 2006); in this work, we
consider up to 4-grams. ROUGE-2 is instead recall over bi-grams. Like in imitation learning, we
have a training set of optimal sequences of actions. During training we choose actions according to
the current policy and only observe a reward at the end of the sequence (or after maximum sequence
length), by comparing the sequence of actions from the current policy against the optimal action
sequence. The goal of training is to find the parameters of the agent that maximize the expected
reward. We define our loss as the negative expected reward:
L✓ =
X
wg
1 ,...,wg
T
p✓(wg
1, . . . , wg
T )r(wg
1, . . . , wg
T ) = E[wg
1 ,...wg
T ]⇠p✓
r(wg
1, . . . , wg
T ), (9)
where wg
n is the word chosen by our model at the n-th time step, and r is the reward associated
with the generated sequence. In practice, we approximate this expectation with a single sample
from the distribution of actions implemented by the RNN (right hand side of the equation above
and Figure 9 of Supplementary Material). We refer the reader to prior work (Zaremba & Sutskever,
2015; Williams, 1992) for the full derivation of the gradients. Here, we directly report the partial
derivatives and their interpretation. The derivatives w.r.t. parameters are:
@L✓
@✓
=
X
t
@L✓
@ot
@ot
@✓
(10)
6
Published as a conference paper at ICLR 2016
h2 = ✓( , h1)
p✓(w| , h1)
XENT
h1
w2 w3XENT
top-k
w0
1,...,k p✓(w|w0
1,...,k, h2) w00
1,...,k
h3 = ✓(w0
1,...,k, h2)
top-k
Figure 3: Illustration of the End-to-End BackProp method. The first steps of the unrolled sequence
(here just the first step) are exactly the same as in a regular RNN trained with cross-entropy. How-
ever, in the remaining steps the input to each module is a sparse vector whose non-zero entries are
the k largest probabilities of the distribution predicted at the previous time step. Errors are back-
propagated through these inputs as well.
While this algorithm is a simple way to expose the model to its own predictions, the loss function
optimized is still XENT at each time step. There is no explicit supervision at the sequence level
while training the model.
3.2 SEQUENCE LEVEL TRAINING
We now introduce a novel algorithm for sequence level training, which we call Mixed Incremental
Cross-Entropy Reinforce (MIXER). The proposed method avoids the exposure bias problem, and
oss L using a two-step pro-
ass, we compute candidate
n violations (sequences with
backward pass, we back-
ugh the seq2seq RNNs. Un-
ining, the first-step requires
Time Step
a red dog smells home today
the dog dog barks quickly Friday
red blue cat barks straight now
runs today
a red dog runs quickly today
東北⼤学 ⼩林颯介 @ NLP-DL
•
•
•
•
•
44東北⼤学 ⼩林颯介 @ NLP-DL

More Related Content

What's hot

Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learningmustafa sarac
 
SPIRE2015-Improved Practical Compact Dynamic Tries
SPIRE2015-Improved Practical Compact Dynamic TriesSPIRE2015-Improved Practical Compact Dynamic Tries
SPIRE2015-Improved Practical Compact Dynamic TriesAndreas Poyias
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded ImaginationDeep Learning JP
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learningKazuki Fujikawa
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative ModelsKenta Oono
 
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbfkylin
 
Machine Learning for Trading
Machine Learning for TradingMachine Learning for Trading
Machine Learning for TradingLarry Guo
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Universitat Politècnica de Catalunya
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowAltoros
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphstuxette
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 

What's hot (20)

Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learning
 
SPIRE2015-Improved Practical Compact Dynamic Tries
SPIRE2015-Improved Practical Compact Dynamic TriesSPIRE2015-Improved Practical Compact Dynamic Tries
SPIRE2015-Improved Practical Compact Dynamic Tries
 
Lecture12 xing
Lecture12 xingLecture12 xing
Lecture12 xing
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Siamese networks
Siamese networksSiamese networks
Siamese networks
 
Section5 Rbf
Section5 RbfSection5 Rbf
Section5 Rbf
 
Machine Learning for Trading
Machine Learning for TradingMachine Learning for Trading
Machine Learning for Trading
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 

Viewers also liked

Teaching Computers to Listen to Music
Teaching Computers to Listen to MusicTeaching Computers to Listen to Music
Teaching Computers to Listen to MusicEric Battenberg
 
Audio chord recognition using deep neural networks
Audio chord recognition using deep neural networksAudio chord recognition using deep neural networks
Audio chord recognition using deep neural networksbzamecnik
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningVarad Meru
 
Ml conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMl conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMLconf
 
オープンソースを利用した新時代を生き抜くためのデータ解析
オープンソースを利用した新時代を生き抜くためのデータ解析オープンソースを利用した新時代を生き抜くためのデータ解析
オープンソースを利用した新時代を生き抜くためのデータ解析nakapara
 
「人工知能」の表紙に関するTweetの分析・続報
「人工知能」の表紙に関するTweetの分析・続報「人工知能」の表紙に関するTweetの分析・続報
「人工知能」の表紙に関するTweetの分析・続報Fujio Toriumi
 
[DL輪読会]Adversarial Feature Matching for Text Generation
[DL輪読会]Adversarial Feature Matching for Text Generation[DL輪読会]Adversarial Feature Matching for Text Generation
[DL輪読会]Adversarial Feature Matching for Text GenerationDeep Learning JP
 
最先端NLP勉強会 “Learning Language Games through Interaction” Sida I. Wang, Percy L...
最先端NLP勉強会“Learning Language Games through Interaction”Sida I. Wang, Percy L...最先端NLP勉強会“Learning Language Games through Interaction”Sida I. Wang, Percy L...
最先端NLP勉強会 “Learning Language Games through Interaction” Sida I. Wang, Percy L...Yuya Unno
 
Approximate Scalable Bounded Space Sketch for Large Data NLP
Approximate Scalable Bounded Space Sketch for Large Data NLPApproximate Scalable Bounded Space Sketch for Large Data NLP
Approximate Scalable Bounded Space Sketch for Large Data NLPKoji Matsuda
 
Twitter炎上分析事例 2014年
Twitter炎上分析事例 2014年Twitter炎上分析事例 2014年
Twitter炎上分析事例 2014年Takeshi Sakaki
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksShuyo Nakatani
 
第35回 強化学習勉強会・論文紹介 [Lantao Yu : 2016]
第35回 強化学習勉強会・論文紹介 [Lantao Yu : 2016]第35回 強化学習勉強会・論文紹介 [Lantao Yu : 2016]
第35回 強化学習勉強会・論文紹介 [Lantao Yu : 2016]Takayuki Sekine
 
あなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントあなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントShohei Hido
 
2016.03.11 「論文に書(け|か)ない自然言語処理」 ソーシャルメディア分析サービスにおけるNLPに関する諸問題について by ホットリンク 公開用
2016.03.11 「論文に書(け|か)ない自然言語処理」 ソーシャルメディア分析サービスにおけるNLPに関する諸問題について by  ホットリンク 公開用2016.03.11 「論文に書(け|か)ない自然言語処理」 ソーシャルメディア分析サービスにおけるNLPに関する諸問題について by  ホットリンク 公開用
2016.03.11 「論文に書(け|か)ない自然言語処理」 ソーシャルメディア分析サービスにおけるNLPに関する諸問題について by ホットリンク 公開用Takeshi Sakaki
 
オンコロジストなるためのスキル
オンコロジストなるためのスキルオンコロジストなるためのスキル
オンコロジストなるためのスキルmusako-oncology
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryRikiya Takahashi
 
ディープラーニングでラーメン二郎(全店舗)を識別してみた
ディープラーニングでラーメン二郎(全店舗)を識別してみたディープラーニングでラーメン二郎(全店舗)を識別してみた
ディープラーニングでラーメン二郎(全店舗)を識別してみたknjcode
 
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...Deep Learning JP
 
Deep LearningフレームワークChainerと最近の技術動向
Deep LearningフレームワークChainerと最近の技術動向Deep LearningフレームワークChainerと最近の技術動向
Deep LearningフレームワークChainerと最近の技術動向Shunta Saito
 

Viewers also liked (20)

Teaching Computers to Listen to Music
Teaching Computers to Listen to MusicTeaching Computers to Listen to Music
Teaching Computers to Listen to Music
 
Audio chord recognition using deep neural networks
Audio chord recognition using deep neural networksAudio chord recognition using deep neural networks
Audio chord recognition using deep neural networks
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 
Ml conf2013 teaching_computers_share
Ml conf2013 teaching_computers_shareMl conf2013 teaching_computers_share
Ml conf2013 teaching_computers_share
 
オープンソースを利用した新時代を生き抜くためのデータ解析
オープンソースを利用した新時代を生き抜くためのデータ解析オープンソースを利用した新時代を生き抜くためのデータ解析
オープンソースを利用した新時代を生き抜くためのデータ解析
 
「人工知能」の表紙に関するTweetの分析・続報
「人工知能」の表紙に関するTweetの分析・続報「人工知能」の表紙に関するTweetの分析・続報
「人工知能」の表紙に関するTweetの分析・続報
 
[DL輪読会]Adversarial Feature Matching for Text Generation
[DL輪読会]Adversarial Feature Matching for Text Generation[DL輪読会]Adversarial Feature Matching for Text Generation
[DL輪読会]Adversarial Feature Matching for Text Generation
 
最先端NLP勉強会 “Learning Language Games through Interaction” Sida I. Wang, Percy L...
最先端NLP勉強会“Learning Language Games through Interaction”Sida I. Wang, Percy L...最先端NLP勉強会“Learning Language Games through Interaction”Sida I. Wang, Percy L...
最先端NLP勉強会 “Learning Language Games through Interaction” Sida I. Wang, Percy L...
 
Argmax Operations in NLP
Argmax Operations in NLPArgmax Operations in NLP
Argmax Operations in NLP
 
Approximate Scalable Bounded Space Sketch for Large Data NLP
Approximate Scalable Bounded Space Sketch for Large Data NLPApproximate Scalable Bounded Space Sketch for Large Data NLP
Approximate Scalable Bounded Space Sketch for Large Data NLP
 
Twitter炎上分析事例 2014年
Twitter炎上分析事例 2014年Twitter炎上分析事例 2014年
Twitter炎上分析事例 2014年
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
第35回 強化学習勉強会・論文紹介 [Lantao Yu : 2016]
第35回 強化学習勉強会・論文紹介 [Lantao Yu : 2016]第35回 強化学習勉強会・論文紹介 [Lantao Yu : 2016]
第35回 強化学習勉強会・論文紹介 [Lantao Yu : 2016]
 
あなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイントあなたの業務に機械学習を活用する5つのポイント
あなたの業務に機械学習を活用する5つのポイント
 
2016.03.11 「論文に書(け|か)ない自然言語処理」 ソーシャルメディア分析サービスにおけるNLPに関する諸問題について by ホットリンク 公開用
2016.03.11 「論文に書(け|か)ない自然言語処理」 ソーシャルメディア分析サービスにおけるNLPに関する諸問題について by  ホットリンク 公開用2016.03.11 「論文に書(け|か)ない自然言語処理」 ソーシャルメディア分析サービスにおけるNLPに関する諸問題について by  ホットリンク 公開用
2016.03.11 「論文に書(け|か)ない自然言語処理」 ソーシャルメディア分析サービスにおけるNLPに関する諸問題について by ホットリンク 公開用
 
オンコロジストなるためのスキル
オンコロジストなるためのスキルオンコロジストなるためのスキル
オンコロジストなるためのスキル
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game TheoryUncertainty Awareness in Integrating Machine Learning and Game Theory
Uncertainty Awareness in Integrating Machine Learning and Game Theory
 
ディープラーニングでラーメン二郎(全店舗)を識別してみた
ディープラーニングでラーメン二郎(全店舗)を識別してみたディープラーニングでラーメン二郎(全店舗)を識別してみた
ディープラーニングでラーメン二郎(全店舗)を識別してみた
 
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
 
Deep LearningフレームワークChainerと最近の技術動向
Deep LearningフレームワークChainerと最近の技術動向Deep LearningフレームワークChainerと最近の技術動向
Deep LearningフレームワークChainerと最近の技術動向
 

Similar to 新たなRNNと自然言語処理

A novel architecture of rns based
A novel architecture of rns basedA novel architecture of rns based
A novel architecture of rns basedVLSICS Design
 
Tensor Spectral Clustering
Tensor Spectral ClusteringTensor Spectral Clustering
Tensor Spectral ClusteringAustin Benson
 
UNIT-II : SEQUENTIAL CIRCUIT DESIGN
UNIT-II  : SEQUENTIAL CIRCUIT DESIGN UNIT-II  : SEQUENTIAL CIRCUIT DESIGN
UNIT-II : SEQUENTIAL CIRCUIT DESIGN Dr.YNM
 
UNIT-II -DIGITAL SYSTEM DESIGN
UNIT-II -DIGITAL SYSTEM DESIGNUNIT-II -DIGITAL SYSTEM DESIGN
UNIT-II -DIGITAL SYSTEM DESIGNDr.YNM
 
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTIONMODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTIONijwmn
 
SLAM of Multi-Robot System Considering Its Network Topology
SLAM of Multi-Robot System Considering Its Network TopologySLAM of Multi-Robot System Considering Its Network Topology
SLAM of Multi-Robot System Considering Its Network Topologytoukaigi
 
ON FINDING MINIMUM AND MAXIMUM PATH LENGTH IN GRID-BASED WIRELESS NETWORKS
ON FINDING MINIMUM AND MAXIMUM PATH LENGTH IN GRID-BASED WIRELESS NETWORKSON FINDING MINIMUM AND MAXIMUM PATH LENGTH IN GRID-BASED WIRELESS NETWORKS
ON FINDING MINIMUM AND MAXIMUM PATH LENGTH IN GRID-BASED WIRELESS NETWORKSijwmn
 
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS cscpconf
 
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONSFEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONScsitconf
 
Comparative study of results obtained by analysis of structures using ANSYS, ...
Comparative study of results obtained by analysis of structures using ANSYS, ...Comparative study of results obtained by analysis of structures using ANSYS, ...
Comparative study of results obtained by analysis of structures using ANSYS, ...IOSR Journals
 
Passive network-redesign-ntua
Passive network-redesign-ntuaPassive network-redesign-ntua
Passive network-redesign-ntuaIEEE NTUA SB
 
Continuum Modeling and Control of Large Nonuniform Networks
Continuum Modeling and Control of Large Nonuniform NetworksContinuum Modeling and Control of Large Nonuniform Networks
Continuum Modeling and Control of Large Nonuniform NetworksYang Zhang
 
Transport and routing on coupled spatial networks
Transport and routing on coupled spatial networksTransport and routing on coupled spatial networks
Transport and routing on coupled spatial networksrichardgmorris
 
Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726marangburu42
 
Using spectral radius ratio for node degree
Using spectral radius ratio for node degreeUsing spectral radius ratio for node degree
Using spectral radius ratio for node degreeIJCNCJournal
 
Spectral Properties Of Social Networks
Spectral Properties Of Social NetworksSpectral Properties Of Social Networks
Spectral Properties Of Social NetworksAdriana Wilson
 
Modelling Quantum Transport in Nanostructures
Modelling Quantum Transport in NanostructuresModelling Quantum Transport in Nanostructures
Modelling Quantum Transport in Nanostructuresiosrjce
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIPMODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIPVLSICS Design
 

Similar to 新たなRNNと自然言語処理 (20)

A novel architecture of rns based
A novel architecture of rns basedA novel architecture of rns based
A novel architecture of rns based
 
Tensor Spectral Clustering
Tensor Spectral ClusteringTensor Spectral Clustering
Tensor Spectral Clustering
 
UNIT-II : SEQUENTIAL CIRCUIT DESIGN
UNIT-II  : SEQUENTIAL CIRCUIT DESIGN UNIT-II  : SEQUENTIAL CIRCUIT DESIGN
UNIT-II : SEQUENTIAL CIRCUIT DESIGN
 
UNIT-II -DIGITAL SYSTEM DESIGN
UNIT-II -DIGITAL SYSTEM DESIGNUNIT-II -DIGITAL SYSTEM DESIGN
UNIT-II -DIGITAL SYSTEM DESIGN
 
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTIONMODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
MODIFIED LLL ALGORITHM WITH SHIFTED START COLUMN FOR COMPLEXITY REDUCTION
 
SLAM of Multi-Robot System Considering Its Network Topology
SLAM of Multi-Robot System Considering Its Network TopologySLAM of Multi-Robot System Considering Its Network Topology
SLAM of Multi-Robot System Considering Its Network Topology
 
ON FINDING MINIMUM AND MAXIMUM PATH LENGTH IN GRID-BASED WIRELESS NETWORKS
ON FINDING MINIMUM AND MAXIMUM PATH LENGTH IN GRID-BASED WIRELESS NETWORKSON FINDING MINIMUM AND MAXIMUM PATH LENGTH IN GRID-BASED WIRELESS NETWORKS
ON FINDING MINIMUM AND MAXIMUM PATH LENGTH IN GRID-BASED WIRELESS NETWORKS
 
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
 
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONSFEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
FEEDBACK SHIFT REGISTERS AS CELLULAR AUTOMATA BOUNDARY CONDITIONS
 
Comparative study of results obtained by analysis of structures using ANSYS, ...
Comparative study of results obtained by analysis of structures using ANSYS, ...Comparative study of results obtained by analysis of structures using ANSYS, ...
Comparative study of results obtained by analysis of structures using ANSYS, ...
 
Passive network-redesign-ntua
Passive network-redesign-ntuaPassive network-redesign-ntua
Passive network-redesign-ntua
 
Continuum Modeling and Control of Large Nonuniform Networks
Continuum Modeling and Control of Large Nonuniform NetworksContinuum Modeling and Control of Large Nonuniform Networks
Continuum Modeling and Control of Large Nonuniform Networks
 
Transport and routing on coupled spatial networks
Transport and routing on coupled spatial networksTransport and routing on coupled spatial networks
Transport and routing on coupled spatial networks
 
Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726
 
Using spectral radius ratio for node degree
Using spectral radius ratio for node degreeUsing spectral radius ratio for node degree
Using spectral radius ratio for node degree
 
Spectral Properties Of Social Networks
Spectral Properties Of Social NetworksSpectral Properties Of Social Networks
Spectral Properties Of Social Networks
 
E010632226
E010632226E010632226
E010632226
 
Modelling Quantum Transport in Nanostructures
Modelling Quantum Transport in NanostructuresModelling Quantum Transport in Nanostructures
Modelling Quantum Transport in Nanostructures
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIPMODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
MODELLING AND SIMULATION OF 128-BIT CROSSBAR SWITCH FOR NETWORK -ONCHIP
 

Recently uploaded

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimizationarrow10202532yuvraj
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdfPaige Cruz
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Juan Carlos Gonzalez
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 

Recently uploaded (20)

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization100+ ChatGPT Prompts for SEO Optimization
100+ ChatGPT Prompts for SEO Optimization
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 

新たなRNNと自然言語処理

  • 5. • • • 5 killed a man yesterday . <EOS> John killed a man yesterday . 東北⼤学 ⼩林颯介 @ NLP-DL
  • 12. • • • • • • 12 [Jozefowicz+15] ) . (18) T recurrence +1) ) C +1) (19) the model) aining RNN- (20) 1)T (21) )T . (22) ormance of wo baseline tion capture mean frame- of compari- RBM is per- ling starting ptimally. d on a simu- er & Hinton, Figure 3. Receptive fields of 48 hidden units of an RNN- RBM trained on the bouncing balls dataset. Each square shows the input weights of a hidden unit as an image. The human motion capture dataset2 is represented by a sequence of joint angles, translations and rotations of the base of the spine in an exponential-map parame- terization (Hsu et al., 2005; Taylor et al., 2007). Since the data consists of 49 real values per time step, we use the Gaussian RBM variant (Welling et al., 2005) for this task. We use up to 450 hidden units and an initial learning rate of 0.001. The mean squared pre- diction test error is 20.1 for the RTRBM and reduced substantially to 16.2 for the RNN-RBM. 6 Modeling sequences of polyphonic music In this section, we show results with main applica- tion of interest for this paper: probabilistic modeling of sequences of polyphonic music. We report our ex- periments on four datasets of varying complexity con- verted to our input format. Piano-midi.de is a classical piano MIDI archive that was split according to Poliner & Ellis (2007). Nottingham is a collection of 1200 folk tunes3 with chords instantiated from the ABC format. MuseData is an electronic library of orchestral and piano classical music from CCARH4 . JSB chorales refers to the entire corpus of 382 four- part harmonized chorales by J. S. Bach with the split of Allan & Williams (2005). 2 people.csail.mit.edu/ehsu/work/sig05stf 3 ifdo.ca/~seymour/nottingham/nottingham.html 東北⼤学 ⼩林颯介 @ NLP-DL
  • 13. • • • • 13 [Jozefowicz+15] for Nottingham, N-dropout stands for Nottingham with nonzero dropout, and P stands for Piano-Midi. Arch. 5M-tst 10M-v 20M-v 20M-tst Tanh 4.811 4.729 4.635 4.582 (97.7) LSTM 4.699 4.511 4.437 4.399 (81.4) LSTM-f 4.785 4.752 4.658 4.606 (100.8) LSTM-i 4.755 4.558 4.480 4.444 (85.1) LSTM-o 4.708 4.496 4.447 4.411 (82.3) LSTM-b 4.698 4.437 4.423 4.380 (79.83) GRU 4.684 4.554 4.559 4.519 (91.7) MUT1 4.699 4.605 4.594 4.550 (94.6) MUT2 4.707 4.539 4.538 4.503 (90.2) MUT3 4.692 4.523 4.530 4.494 (89.47) Table 3. Perplexities on the PTB. The prefix (e.g., 5M) denotes the number of parameters in the model. The suffix “v” denotes validation negative log likelihood, the suffix“tst” refers to the test set. The perplexity for select architectures is reported in paren- theses. We used dropout only on models that have 10M or 20M parameters, since the 5M models did not benefit from dropout at all, and most dropout-free models achieved a test perplexity of 108, and never greater than 120. In particular, the perplexity of the best models without dropout is below 110, which outperforms the results of Mikolov et al. (2014). 東北⼤学 ⼩林颯介 @ NLP-DL
  • 17. • • • • 17 resurgence of new structural designs for recurrent neural networks (RNNs) esigns are derived from popular structures including vanilla RNNs, Long works (LSTMs) [4] and Gated Recurrent Units (GRUs) [5]. Despite of their ost of them share a common computational building block, described by the (Wx + Uz + b), (1) Rm are state vectors coming from different information sources, W 2 Rd⇥n e-to-state transition matrices, and b is a bias vector. This computational a combinator for integrating information flow from the x and z by a sum by a nonlinearity . We refer to it as the additive building block. Additive ly implemented in various state computations in RNNs (e.g. hidden state RNNs, gate/cell computations of LSTMs and GRUs. an alternative design for constructing the computational building block by of information integration. Specifically, instead of utilizing sum operation e Hadamard product “ ” to fuse Wx and Uz: (Wx Uz + b) (2) ucture Description and Analysis neral Formulation of Multiplicative Integration idea behind Multiplicative Integration is to integrate different information flows Wx adamard product “ ”. A more general formulation of Multiplicative Integration e bias vectors 1 and 2 added to Wx and Uz: ((Wx + 1) (Uz + 2) + b) 1, 2 2 Rd are bias vectors. Notice that such formulation contains the first order itive building block, i.e., 1 Uht 1 + 2 Wxt. In order to make the Mult on more flexible, we introduce another bias vector ↵ 2 Rd to gate2 the term W g the following formulation: (↵ Wx Uz + 1 Uz + 2 Wx + b), t the number of parameters of the Multiplicative Integration is about the same as t building block, since the number of new parameters (↵, 1 and 2) are negligible c number of parameters. Also, Multiplicative Integration can be easily extended to Us3 , that adopt vanilla building blocks for computing gates and output states, wher replace them with the Multiplicative Integration. More generally, in any kind of information flows (k 2) are involved (e.g. RNN with multiple skip connect dforward models like residual networks [12]), one can implement pairwise Mult on for integrating all k information sources.東北⼤学 ⼩林颯介 @ NLP-DL
  • 18. • • • 18 Figure 2: Several examples of cells with interpretable activa [Karpathy+15] 東北⼤学 ⼩林颯介 @ NLP-DL
  • 19. 19 [Kádár+16] • • • • • • omission(i, S) = 1 cosine(hend(S), hend(Si)) (12) 東北⼤学 ⼩林颯介 @ NLP-DL
  • 25. Pixel Recurrent Neu x1 xi xn xn2 Figure 2. Left: To generate pixel xi one conditions on all the pre- viously generated pixels left and above of xi. Center: Illustration of a Row LSTM with a kernel of size 3. The dependency field of the Row LSTM does not reach pixels further away on the sides of the image. Right: Illustration of the two directions of the Di- agonal BiLSTM. The dependency field of the Diagonal BiLSTM covers the entire available context in the image. Figure 3. In the Diagonal BiLSTM, to allow for parallelization along the diagonals, the input map is skewed by offseting each row by one position with respect to the previous row. When the spatial layer is computed left to right and column by column, the output map is shifted back into the original size. The convolution uses a kernel of size 2 ⇥ 1. (2015); Uria et al. (2014)). By contrast we model p(x) as a discrete distribution, with every conditional distribution 3 T th tu fo x p d la T in T a c L th tw u T (s re in la T th s h Pixel Recurrent Neural Networks x1 xi xn xn2 Figure 2. Left: To generate pixel xi one conditions on all the pre- viously generated pixels left and above of xi. Center: Illustration of a Row LSTM with a kernel of size 3. The dependency field of the Row LSTM does not reach pixels further away on the sides of the image. Right: Illustration of the two directions of the Di- agonal BiLSTM. The dependency field of the Diagonal BiLSTM covers the entire available context in the image. 3.1. Row LSTM The Row LSTM is a unidirectiona the image row by row from top to b tures for a whole row at once; the formed with a one-dimensional con xi the layer captures a roughly triang pixel as shown in Figure 2 (center). dimensional convolution has size k larger the value of k the broader the c The weight sharing in the convoluti invariance of the computed features The computation proceeds as follow an input-to-state component and a r component that together determine th LSTM core. To enhance parallelizat • • • 25 as a conference paper at ICLR 2016 2d Grid LSTM blockblock m0 h0 h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block cks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2 ons. The dashed lines indicate identity transformations. The standard LSTM block a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has ector m1 applied along the vertical dimension. er review as a conference paper at ICLR 2016 2d Grid LSTM blockandard LSTM block m0 h0 h0 I ⇤ xi h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block re 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2 3 dimensions. The dashed lines indicate identity transformations. The standard LSTM block not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has memory vector m1 applied along the vertical dimension. review as a conference paper at ICLR 2016 2d Grid LSTM blockard LSTM block m0 h0 h0 I ⇤ xi h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2 dimensions. The dashed lines indicate identity transformations. The standard LSTM block ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has mory vector m1 applied along the vertical dimension. conference paper at ICLR 2016 2d Grid LSTM block m0 h0 h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block orm the standard LSTM and those that form Grid LSTM networks of N = 1, 2 The dashed lines indicate identity transformations. The standard LSTM block mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has m1 applied along the vertical dimension. onference paper at ICLR 2016 2d Grid LSTM block m0 h0 h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block rm the standard LSTM and those that form Grid LSTM networks of N = 1, 2 The dashed lines indicate identity transformations. The standard LSTM block mory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has m1 applied along the vertical dimension. review as a conference paper at ICLR 2016 2d Grid LSTM blockdard LSTM block m0 h0 h0 I ⇤ xi h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Block e 1: Blocks form the standard LSTM and those that form Grid LSTM networks of N = 1, 2 dimensions. The dashed lines indicate identity transformations. The standard LSTM block ot have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM block has emory vector m1 applied along the vertical dimension. Under review as a conference paper at ICLR 2016 2d Grid LSTM blockStandard LSTM block m m0 h0 h h0 I ⇤ xi h1 h2 h0 2 h0 1 m1 m0 1 m0 2m2 1d Grid LSTM Block 3d Grid LSTM Figure 1: Blocks form the standard LSTM and those that form Grid LSTM networks o and 3 dimensions. The dashed lines indicate identity transformations. The standard LS does not have a memory vector in the vertical dimension; by contrast, the 2d Grid LSTM the memory vector m1 applied along the vertical dimension. 東北⼤学 ⼩林颯介 @ NLP-DL
  • 26. • • • • 26 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1556–1566, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics works, a type of recurrent neural net- work with a more complex computational unit, have obtained strong results on a va- riety of sequence modeling tasks. The only underlying LSTM structure that has been explored so far is a linear chain. However, natural language exhibits syn- tactic properties that would naturally com- bine words to phrases. We introduce the Tree-LSTM, a generalization of LSTMs to tree-structured network topologies. Tree- LSTMs outperform all existing systems and strong LSTM baselines on two tasks: predicting the semantic relatedness of two sentences (SemEval 2014, Task 1) and sentiment classification (Stanford Senti- ment Treebank). 1 Introduction Most models for distributed representations of phrases and sentences—that is, models where real- valued vectors are used to represent meaning—fall into one of three classes: bag-of-words models, sequence models, and tree-structured models. In bag-of-words models, phrase and sentence repre- sentations are independent of word order; for ex- ample, they can be generated by averaging con- stituent word representations (Landauer and Du- mais, 1997; Foltz et al., 1998). In contrast, se- quence models construct sentence representations as an order-sensitive function of the sequence of tokens (Elman, 1990; Mikolov, 2012). Lastly, tree-structured models compose each phrase and sentence representation from its constituent sub- phrases according to a given syntactic structure over the sentence (Goller and Kuchler, 1996; Socher et al., 2011). x1 x2 x4 x5 x6 y1 y2 y3 y4 y6 Figure 1: Top: A chain-structured LSTM net- work. Bottom: A tree-structured LSTM network with arbitrary branching factor. Order-insensitive models are insufficient to fully capture the semantics of natural language due to their inability to account for differences in meaning as a result of differences in word order or syntactic structure (e.g., “cats climb trees” vs. “trees climb cats”). We therefore turn to order- sensitive sequential or tree-structured models. In particular, tree-structured models are a linguisti- cally attractive option due to their relation to syn- tactic interpretations of sentence structure. A nat- ural question, then, is the following: to what ex- tent (if at all) can we do better with tree-structured models as opposed to sequential models for sen- tence representation? In this paper, we work to- wards addressing this question by directly com- paring a type of sequential model that has recently been used to achieve state-of-the-art results in sev- eral NLP tasks against its tree-structured general- ization. Due to their capability for processing arbitrary- length sequences, recurrent neural networks 1556 w0 w0w1w2 w4 w5 w6 w0 w4 w5 G EN -L GEN-NX-LGEN-NX-L G EN -R GEN-NX-R GEN-NX-R w1w2w3 LD LD Figure 4: Generation of left and right dependents of node w0 In order to jointly take into account, we employ y goes from the furthest lef left dependent (LD is a dent). As shown in Figur representation of all left d this representation is then right dependent of the sam w0 w1w2w3 w4 w5 w6 Generated by four LSTMs with tied We and tied Who w0 w1w2w3 w0w1w2 w4 w5 w6 w0 w4 w5 G EN -L GEN-NX-LGEN-NX-L G EN -R GEN-NX-R GEN-NX-R Figure 3: Generation process of left (w1,w2,w3) and right Who 2 R|V|⇥d the output matrix of our model, where |V| is the vocabulary size, s the word embedding size and d the hidden unit size. We use tied We and tied Who for the four LSTMs to reduce the number of pa- rameters in our model. The four LSTMs also share their hidden states. Let H 2 Rd⇥(n+1) denote the shared hidden states of all time steps and e(wt) the one-hot vector of wt. Then, H[:,t] represents D(wt) at time step t, and the computation2 is: xt = We ·e(wt0 ) (2a) z 0 東北⼤学 ⼩林颯介 @ NLP-DL
  • 27. Figure 2: Attentional Encoder-Decoder model. dj is calculated as the summation vector weighted by ↵j(i): dj = nX i=1 ↵j(i)hi. (6) To incorporate the attention mechanism into the decoding process, the context vector is used for the the j-th word prediction by putting an additional hidden layer ˜sj: ˜s = tanh(W [s ; d ] + b ), (7) Figure 3: Proposed model: Tree-to-sequence tentional NMT model. a sentence inherent in language. We propose novel tree-based encoder in order to explicitly ta the syntactic structure into consideration in t NMT model. We focus on the phrase structure a sentence and construct a sentence vector fro phrase vectors in a bottom-up fashion. The se tence vector in the tree-based encoder is the • • • • • 27東北⼤学 ⼩林颯介 @ NLP-DL
  • 28. The hungry cat NP (VP(S REDUCE GENNT(NP)NT(VP) … cat hungry The a<t p(at) ut TtSt gure 5: Neural architecture for defining a distribution over at given representations of the stack (St), output buffer (Tt) and story of actions (a<t). Details of the composition architecture of the NP, the action history LSTM, and the other elements of the ack are not shown. This architecture corresponds to the generator state at line 7 of Figure 4. f the forward and reverse LSTMs are concatenated, assed through an affine transformation and a tanh onlinearity to become the subtree embedding.4 Be- ause each of the child node embeddings (u, v, w in ig. 6) is computed similarly (if it corresponds to an ternal node), this composition function is a kind of cursive neural network. 2 Word Generation 4.4 Discriminative Parsing Model A discriminative parsing model can be obtained by replacing the embedding of Tt at each time step with an embedding of the input buffer Bt. To train this model, the conditional likelihood of each sequence of actions given the input string is maximized.5 5 Inference via Importance Sampling Our generative model p(x, y) defines a joint dis- • 28 3.5 Comparison to Other Models Our generation algorithm algorithm differs from previous stack-based parsing/generation algorithms in two ways. First, it constructs rooted tree struc- tures top down (rather than bottom up), and sec- ond, the transition operators are capable of directly generating arbitrary tree structures rather than, e.g., assuming binarized trees, as is the case in much prior work that has used transition-based algorithms to produce phrase-structure trees (Sagae and Lavie, 2005; Zhang and Clark, 2011; Zhu et al., 2013). 4 Generative Model RNNGs use the generator transition set just pre- sented to define a joint distribution on syntax trees (y) and words (x). This distribution is defined as a sequence model over generator transitions that is pa- rameterized using a continuous space embedding of the algorithm state at each time step (ut); i.e., p(x, y) = |a(x,y)| Y t=1 p(at | a<t) = |a(x,y)| Y t=1 exp r> at ut + bat P a02AG(Tt,St,nt) exp r> a0 ut + ba0 , and where action-specific embeddings ra and bias vector b are parameters in ⇥. The representation of the algorithm state at time t, ut, is computed by combining the representation of the generator’s three data structures: the output dard RNN encoding architecture. The stack (S) is more complicated for two reasons. First, the ele- ments of the stack are more complicated objects than symbols from a discrete alphabet: open nontermi- nals, terminals, and full trees, are all present on the stack. Second, it is manipulated using both push and pop operations. To efficiently obtain representations of S under push and pop operations, we use stack LSTMs (Dyer et al., 2015). 4.1 Syntactic Composition Function When a REDUCE operation is executed, the parser pops a sequence of completed subtrees and/or to- kens (together with their vector embeddings) from the stack and makes them children of the most recent open nonterminal on the stack, “completing” the constituent. To compute an embedding of this new subtree, we use a composition function based on bidirectional LSTMs, which is illustrated in Fig. 6. NP u v w NP u v w NP x x Figure 6: Syntactic composition function based on bidirec- tional LSTMs that is executed during a REDUCE operation; the network on the right models the structure on the left. The first vector read by the LSTM in both the for- ward and reverse directions is an embedding of the [Dyer+16] Input: The hungry cat meows . Stack Buffer Action 0 The | hungry | cat | meows | . NT(S) 1 (S The | hungry | cat | meows | . NT(NP) 2 (S | (NP The | hungry | cat | meows | . SHIFT 3 (S | (NP | The hungry | cat | meows | . SHIFT 4 (S | (NP | The | hungry cat | meows | . SHIFT 5 (S | (NP | The | hungry | cat meows | . REDUCE 6 (S | (NP The hungry cat) meows | . NT(VP) 7 (S | (NP The hungry cat) | (VP meows | . SHIFT 8 (S | (NP The hungry cat) | (VP meows . REDUCE 9 (S | (NP The hungry cat) | (VP meows) . SHIFT 10 (S | (NP The hungry cat) | (VP meows) | . REDUCE 11 (S (NP The hungry cat) (VP meows) .) Figure 2: Top-down parsing example. tackt Termst Open NTst Action Stackt+1 Termst+1 Open NTst+1 T n NT(X) S | (X T n + 1 T n GEN(x) S | x T | x n | (X | ⌧1 | . . . | ⌧` T n REDUCE S | (X ⌧1 . . . ⌧`) T n 1 ure 3: Generator transitions. Symbols defined as in Fig. 1 with the addition of T representing the history of generated terminals. Stack Terminals Action 0 NT(S) 1 (S NT(NP) 2 (S | (NP GEN(The) 3 (S | (NP | The The GEN(hungry) 4 (S | (NP | The | hungry The | hungry GEN(cat) 5 (S | (NP | The | hungry | cat The | hungry | cat REDUCE 6 (S | (NP The hungry cat) The | hungry | cat NT(VP) 7 (S | (NP The hungry cat) | (VP The | hungry | cat GEN(meows) 8 (S | (NP The hungry cat) | (VP meows The | hungry | cat | meows REDUCE 9 (S | (NP The hungry cat) | (VP meows) The | hungry | cat | meows GEN(.) 10 (S | (NP The hungry cat) | (VP meows) | . The | hungry | cat | meows | . REDUCE 11 (S (NP The hungry cat) (VP meows) .) The | hungry | cat | meows | . • 東北⼤学 ⼩林颯介 @ NLP-DL
  • 29. • • 29 [Bowman+16] bu er down sat stack cat the composition tracking transition down sat the cat composition tracking transition down sat the cat tracking (a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’, and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function. bu er stack t = 0 down sat cat the t = 1 down sat cat the t = 2 down sat cat the t = 3 down sat the cat t = 4 down sat the cat t = 5 down sat the cat t = 6 sat down the cat t = 7 = T (the cat) (sat down) output to model for semantic task (b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity. bu er down sat stack cat the composition tracking transition down sat the cat composition tracking transition down sat the cat tracking (a) The SPINN model unrolled for two transitions during the processing of the sentence the cat sat down. ‘Tracking’, ‘transition’, and ‘composition’ are neural network layers. Gray arrows indicate connections which are blocked by a gating function. bu er stack t = 0 down sat cat the t = 1 down sat cat the t = 2 down sat cat the t = 3 down sat the cat t = 4 down sat the cat t = 5 down sat the cat t = 6 sat down the cat t = 7 = T (the cat) (sat down) output to model for semantic task (b) The fully unrolled SPINN for the cat sat down, with neural network layers omitted for clarity.東北⼤学 ⼩林颯介 @ NLP-DL
  • 30. • • • 30 tor with the structure lkj = (1 j/J) (k/d)(1 2j/J) (assuming 1-based indexing), ng the number of words in the sentence, and d is the dimension of the embedding. This presentation, which we call position encoding (PE), means that the order of the words mi. The same representation is used for questions, memory inputs and memory outputs. Encoding: Many of the QA tasks require some notion of temporal context, i.e. in ample of Section 2, the model needs to understand that Sam is in the bedroom after esented by a one-hot vector of length V (where the vocabulary is of size V = 177, simplistic nature of the QA language). The same representation is used for the d answer a. Two versions of the data are used, one that has 1000 training problems second larger one with 10,000 per task. Details wise stated, all experiments used a K = 3 hops model with the adjacent weight sharing all tasks that output lists (i.e. the answers are multiple words), we take each possible of possible outputs and record them as a separate answer vocabulary word. presentation: In our experiments we explore two different representations for s. The first is the bag-of-words (BoW) representation that takes the sentence 2, ..., xin}, embeds each word and sums the resulting vectors: e.g mi = P j Axij and j. The input vector u representing the question is also embedded as a bag of words: . This has the drawback that it cannot capture the order of the words in the sentence, ortant for some tasks. propose a second representation that encodes the position of words within the s takes the form: mi = P j lj · Axij, where · is an element-wise multiplication. lj is a 4 4.2 ATTENTION MECHANISMS Neural models with memories coupled to differentiable addressing mechanism have been success- fully applied to handwriting generation and recognition (Graves, 2012), machine translation (Bah- danau et al., 2015a), and more general computation machines (Graves et al., 2014; Weston et al., 2015). Since we are interested in associative memories we employed a “content” based attention. This has the property that the vector retrieved from our memory would not change if we randomly shuffled the memory. This is crucial for proper treatment of the input set X as such. In particular, our process block based on an attention mechanism uses the following: qt = LSTM(q⇤ t 1) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q⇤ t = [qt rt] (7) Read Process Write Figure 1: The Read-Process-and-Write model. where i indexes through each memory vector mi (typically equal to the cardinality of X), qt is a query vector which allows us to read rt from the memories, f is a function that computes a single scalar from mi and qt (e.g., a dot product), and LSTM is an LSTM which computes a recurrent state but which takes no inputs. q⇤ t is the state which this LSTM evolves, and is formed by concatenating the query qt with the resulting attention readout rt. t is the index which indicates how many “processing steps” are being carried to compute the state to be fed to the decoder. Note that permuting mi and mi0 has no effect on the read vector rt. 4.3 READ, PROCESS, WRITE Our model, which naturally handles input sets, has three components (the exact equations and im- plementation will be released in an appendix prior to publication): • A reading block, which simply embeds each element xi 2 X using a small neural network onto a memory vector mi (the same neural network is used for all i). • A process block, which is an LSTM without inputs or outputs performing T steps of com- putation over the memories mi. This LSTM keeps updating its state by reading mi repeat- edly using the attention mechanism described in the previous section. At the end of this block, its hidden state q⇤ T is an embedding which is permutation invariant to the inputs. See eqs. (3)-(7) for more details. 4 fully applied to handwriting generation a danau et al., 2015a), and more general c 2015). Since we are interested in associa This has the property that the vector retri shuffled the memory. This is crucial for our process block based on an attention m qt = LSTM(q⇤ t 1) (3) ei,t = f(mi, qt) (4) ai,t = exp(ei,t) P j exp(ej,t) (5) rt = X i ai,tmi (6) q⇤ t = [qt rt] (7) where i indexes through each memory v a query vector which allows us to read single scalar from mi and qt (e.g., a do recurrent state but which takes no inputs by concatenating the query qt with the re how many “processing steps” are being c that permuting mi and mi0 has no effect o 4.3 READ, PROCESS, WRITE Our model, which naturally handles inpu plementation will be released in an appen • A reading block, which simply e onto a memory vector mi (the sa • A process block, which is an LS putation over the memories mi. edly using the attention mechan block, its hidden state q⇤ T is an em eqs. (3)-(7) for more details. 東北⼤学 ⼩林颯介 @ NLP-DL
  • 33. • • • • • • • 33 , d(ht−1)] + bh), (4) function from Equation 2. d-forward fully connected is a significant difference: ks every fully-connected only once, while it is not nt layer: each training ex- mposed of a number of in- ropout this results in hid- on every step. This obser- tion of how to sample the re two options: sample it sequence (per-sequence) mask on every step (per- wo strategies for sampling etail in Section 3.4. ht = ot ∗ f(ct), where it, ft, ot are input, output and forget gate step t; gt is the vector of cell updates and ct is updated cell vector used to update the hidden s ht; σ is the sigmoid function and ∗ is the elem wise multiplication. Our approach is to apply dropout to the cell date vector ct as follows: ct = ft ∗ ct−1 + it ∗ d(gt) In contrast, Moon et al. (2015) propose to ply dropout directly to the cell values and use sequence sampling: ct = d(ft ∗ ct−1 + it ∗ gt) We will discuss the limitations of the appro of Moon et al. (2015) in Section 3.4 and sup Figure 1: Illustration of the three types circles represent connections, hidden state we apply dropout. gt = f(Wg xt, rt ∗ ht−1 + bg) ht = (1 − zt) ∗ ht−1 + zt ∗ gt Similarly to the LSTMs, we propoose dropout to the hidden state updates vector ht = (1 − zt) ∗ ht−1 + zt ∗ d(gt) To the best of our knowledge, this work is to study the effect of recurrent dropout networks. 3.4 Dropout and memory Before going further with the explanatio 東北⼤学 ⼩林颯介 @ NLP-DL
  • 34. • • • • • 34 hidden-to-hidden transformations. We introduce the batch-normalizing transform BN( · ; , ) the LSTM as follows: 0 B B @ ˜ft ˜it ˜ot ˜gt 1 C C A = BN(Whht 1; h, h) + BN(Wxxt; x, x) + b (6) ct = (˜ft) ct 1 + (˜it) tanh( ˜gt) (7) ht = (˜ot) tanh(BN(ct; c, c)) (8) network by discarding the absolute scale of activations. We want to a preserve the information in the network, by normalizing the activations in a training example relative to the statistics of the entire training data. 3 Normalization via Mini-Batch Statistics Since the full whitening of each layer’s inputs is costly and not everywhere differentiable, we make two neces- sary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have the mean of zero and the variance of 1. For a layer with d-dimensional input x = (x(1) . . . x(d) ), we will nor- malize each dimension x(k) = x(k) − E[x(k) ] Var[x(k)] where the expectation and variance are computed over the training data set. As shown in (LeCun et al., 1998b), such normalization speeds up convergence, even when the fea- tures are not decorrelated. Note that simply normalizing each input of a layer may change what the layer can represent. For instance, nor- B = {x1...m} Let the normalized values be x1...m formations be y1...m. We refer to t BNγ,β : x1...m → as the Batch Normalizing Transfor Transform in Algorithm 1. In the a added to the mini-batch variance f Input: Values of x over a mini-ba Parameters to be learned: Output: {yi = BNγ,β(xi)} µB ← 1 m m i=1 xi σ2 B ← 1 m m i=1 (xi − µB)2 xi ← xi − µB σ2 B + ϵ yi ← γxi + β ≡ BNγ,β(xi) Algorithm 1: Batch Normalizing 東北⼤学 ⼩林颯介 @ NLP-DL
  • 35. • • • • • • 35 Overfitting in machine learning is addressed by restricting the space o considered. This can be accomplished by reducing the number of par with an inductive bias for simpler models, such as early stopping. can be achieved by incorporating more sophisticated prior knowledg activations on a reasonable path can be difficult, especially across lo in mind, we devise a regularizer for the state representation learned RNNs, that aims to encourage stability of the path taken through repr we propose the following additional cost term for Recurrent Neural N 1 T TX t=1 (khtk2 kht 1k2)2 Where ht is the vector of hidden activations at time-step t, and is a h amounts of regularization. We call this penalty the norm-stabilizer, as norms of the hiddens to be stable (i.e. approximately constant acros coherence” penalty of Jonschkowski & Brock (2015), our penalty representation to remain constant, only its norm. In the absence of inputs and nonlinearities, a constant norm would imp to-hidden transition matrix for simple RNNs (SRNNs). However, in t sition matrix, inputs and nonlinearities can still change the norm of instability. This makes targeting the hidden activations directly a mo ing norm stability. Stability becomes especially important when we sequences at test time than those seen during training (the “training h arXiv:1511.08400v 東北⼤学 ⼩林颯介 @ NLP-DL
  • 36. • • • nce paper at ICLR 2016 English (unsupervised) German (translation) Tags (parsing)English y Setting – one encoder, multiple decoders. This scheme is useful for either as in Dong et al. (2015) or between different tasks. Here, English and Ger- of words in the respective languages. The α values give the proportions of are allocated for the different tasks. Published as a conference paper at ICLR 2016 English (unsupervised) German (translation) Tags (parsing)English Figure 2: One-to-many Setting – one encoder, multiple decoders. This scheme is useful for either multi-target translation as in Dong et al. (2015) or between different tasks. Here, English and Ger- man imply sequences of words in the respective languages. The α values give the proportions of parameter updates that are allocated for the different tasks. for constituency parsing as used in (Vinyals et al., 2015a), (b) a sequence of German words for ma- chine translation (Luong et al., 2015a), and (c) the same sequence of English words for autoencoders or a related sequence of English words for the skip-thought objective (Kiros et al., 2015). 3.2 MANY-TO-ONE SETTING This scheme is the opposite of the one-to-many setting. As illustrated in Figure 3, it consists of mul- tiple encoders and one decoder. This is useful for tasks in which only the decoder can be shared, for example, when our tasks include machine translation and image caption generation (Vinyals et al., 2015b). In addition, from a machine translation perspective, this setting can benefit from a large amount of monolingual data on the target side, which is a standard practice in machine translation system and has also been explored for neural MT by Gulcehre et al. (2015). English (unsupervised) Image (captioning) English German (translation) Figure 3: Many-to-one setting – multiple encoders, one decoder. This scheme is handy for tasks in which only the decoders can be shared. 3.3 MANY-TO-MANY SETTING Lastly, as the name describes, this category is the most general one, consisting of multiple encoders Published as a conference paper at ICLR 2016 German (translation) English (unsupervised) German (unsupervised) English Figure 4: Many-to-many setting – multiple encoders, multiple decoders. We consider t in a limited context of machine translation to utilize the large monolingual corpora i source and the target languages. Here, we consider a single translation task and two un autoencoder tasks. consist of ordered sentences, e.g., paragraphs. Unfortunately, in many applications th machine translation, we only have sentence-level data where the sentences are unordered. that, we split each sentence into two halves; we then use one half to predict the other hal 36東北⼤学 ⼩林颯介 @ NLP-DL
  • 40. • • • 40 hello , my name is Tony Jebara . Attentive Read hi , Tony Jebara <eos> hi , Tony h1 h2 h3 h4 h5 s1 s2 s3 s4 h6 h7 h8 “Tony” DNN Embedding for “Tony” Selective Read for “Tony” (a) Attention-based Encoder-Decoder (RNNSearch) (c) State Update s4 SourceVocabulary Softmax Prob(“Jebara”)=Prob(“Jebara”, g) +Prob(“Jebara”, c) … ... (b) Generate-Mode & Copy-Mode ! M M 東北⼤学 ⼩林颯介 @ NLP-DL
  • 41. forms and their meanings is non-trivial (de Saus- sure, 1916). While some compositional relation- ships exist, e.g., morphological processes such as adding -ing or -ly to a stem have relatively reg- ular effects, many words with lexical similarities convey different meanings, such as, the word pairs lesson () lessen and coarse () course. 3 C2W Model Our compositional character to word (C2W) model is based on bidirectional LSTMs (Graves and Schmidhuber, 2005), which are able to learn complex non-local dependencies in sequence models. An illustration is shown in Figure 1. The input of the C2W model (illustrated on bottom) is a single word type w, and we wish to obtain is a d-dimensional vector used to represent w. This model shares the same input and output of a word lookup table (illustrated on top), allowing it to eas- ily replace then in any network. As input, we define an alphabet of characters C. For English, this vocabulary would contain an entry for each uppercase and lowercase letter as well as numbers and punctuation. The input word w is decomposed into a sequence of characters c1, . . . , cm, where m is the length of w. Each ci cats cat cats job .... .... ........ cats c a t s a c t .... .... s Character Lookup Table Word Lookup Table Bi-LSTM embeddings for word "cats" embeddings for word "cats" • • • • • 41東北⼤学 ⼩林颯介 @ NLP-DL
  • 42. Figure 1: Illustration of the Scheduled Sampling approach, • • • 42東北⼤学 ⼩林颯介 @ NLP-DL
  • 43. • • • 43 In order to apply the REINFORCE algorithm (Williams, 1992; Zaremba & Sutskever, 2015) to the problem of sequence generation we cast our problem in the reinforcement learning (RL) frame- work (Sutton & Barto, 1988). Our generative model (the RNN) can be viewed as an agent, which interacts with the external environment (the words and the context vector it sees as input at every time step). The parameters of this agent defines a policy, whose execution results in the agent pick- ing an action. In the sequence generation setting, an action refers to predicting the next word in the sequence at each time step. After taking an action the agent updates its internal state (the hid- den units of RNN). Once the agent has reached the end of a sequence, it observes a reward. We can choose any reward function. Here, we use BLEU (Papineni et al., 2002) and ROUGE-2 (Lin & Hovy, 2003) since these are the metrics we use at test time. BLEU is essentially a geometric mean over n-gram precision scores as well as a brevity penalty (Liang et al., 2006); in this work, we consider up to 4-grams. ROUGE-2 is instead recall over bi-grams. Like in imitation learning, we have a training set of optimal sequences of actions. During training we choose actions according to the current policy and only observe a reward at the end of the sequence (or after maximum sequence length), by comparing the sequence of actions from the current policy against the optimal action sequence. The goal of training is to find the parameters of the agent that maximize the expected reward. We define our loss as the negative expected reward: L✓ = X wg 1 ,...,wg T p✓(wg 1, . . . , wg T )r(wg 1, . . . , wg T ) = E[wg 1 ,...wg T ]⇠p✓ r(wg 1, . . . , wg T ), (9) where wg n is the word chosen by our model at the n-th time step, and r is the reward associated with the generated sequence. In practice, we approximate this expectation with a single sample from the distribution of actions implemented by the RNN (right hand side of the equation above and Figure 9 of Supplementary Material). We refer the reader to prior work (Zaremba & Sutskever, 2015; Williams, 1992) for the full derivation of the gradients. Here, we directly report the partial derivatives and their interpretation. The derivatives w.r.t. parameters are: @L✓ @✓ = X t @L✓ @ot @ot @✓ (10) 6 Published as a conference paper at ICLR 2016 h2 = ✓( , h1) p✓(w| , h1) XENT h1 w2 w3XENT top-k w0 1,...,k p✓(w|w0 1,...,k, h2) w00 1,...,k h3 = ✓(w0 1,...,k, h2) top-k Figure 3: Illustration of the End-to-End BackProp method. The first steps of the unrolled sequence (here just the first step) are exactly the same as in a regular RNN trained with cross-entropy. How- ever, in the remaining steps the input to each module is a sparse vector whose non-zero entries are the k largest probabilities of the distribution predicted at the previous time step. Errors are back- propagated through these inputs as well. While this algorithm is a simple way to expose the model to its own predictions, the loss function optimized is still XENT at each time step. There is no explicit supervision at the sequence level while training the model. 3.2 SEQUENCE LEVEL TRAINING We now introduce a novel algorithm for sequence level training, which we call Mixed Incremental Cross-Entropy Reinforce (MIXER). The proposed method avoids the exposure bias problem, and oss L using a two-step pro- ass, we compute candidate n violations (sequences with backward pass, we back- ugh the seq2seq RNNs. Un- ining, the first-step requires Time Step a red dog smells home today the dog dog barks quickly Friday red blue cat barks straight now runs today a red dog runs quickly today 東北⼤学 ⼩林颯介 @ NLP-DL