NLP@ICLR2019

b
•
•
•
• A g
• a
•
• M 0 4e 12
0 4
• e 0 0 4 0 N
: D A 4 ~
• N 0 o 4
• 12

,/ 0/2
• a 0/2O c Pc
• 55 A
• 1 4: & &
• A 24
• 1 4:N 55 A K& OL IR
• . CB 0/2 00 45 9 A 4 :4A9 4AA A9 A5
R S BMMJL I=L AIIAF =I JK L M MCI #P I E I. 9 062 5BM # 5A 1 M / CM LFC C A # = : :

0
•
• 6 2 6 1 9
• G G GED ,DI G I D G IG :I G DIE : GG DI G B ILEGA
D I F F G
• CEEI D I + EC IGN E GE B I : EM )C D -
• N - II DI ED L I - IL I D (ND C : ED EB I ED
• GE NC EB : ED: FI - GD G ,DI GFG I D : D EG D DI D:
GEC I G B F G ED E

21 0
•
• 7 191 I
• G G GED ,DI G I D G IG :I G DIE : GG DI G B ILEGA
D I F F G
• CEEI D I + EC IGN E GE B I : EM )C D -
• N - II DI ED L I - IL I D (ND C : ED EB I ED
• GE NC EB : ED: FI - GD G ,DI GFG I D : D EG D DI D:
GEC I G B F G ED E
A
L
1 9 2
L CNO
@

):9 8
• 0( + 1
• B AB9C9 @ N
• B89B98 9 B@ C 9:B : B99 2 B B9C @ 19 BB9 9 B 9 @B C
2 9 9C A A9B
• D : D DBI B CD :C +
• I + CC DD D G D + : DG : D I F ED C E
• - EB I AD + B B D BAB D : C B C D C
B - DEB EA BF C ,
L I1

+ +
9
• c[a e
• u R: Ot RO w
• d hu c]h S w I :c]h
o IT O
• i [
• l O u c]h :t ]
• w s e R O
• w N k O :
• n cg
• r
. . /

) 1 1 ( : + 0 : 0
( ( : +
• I + IMNOT
c RI !"#$L I[ a L S %" : L]
d I ̂!"L I[ a L S '" L]
e RI !"#$ I ̂!"L%" '" MNOLT
8: / 5- 2 . 8 : : 01 : - 0 2 :
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by ta
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. He
ˆg = E[g].
4.2 STRUCTURED GATING MECHANISM
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f )
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i)
Following the properties of the cumax() activation, the values in the master forget gate are mo
tonically increasing from 0 to 1, and those in the master input gate are monotonically decrea
from 1 to 0. These gates serve as high-level control for the update operations of cell states. U
the master gates, we define a new update rule:
!t = ˜ft
˜it
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it)
ît = it !t + (˜it !t) = ˜it (it
˜ft + 1 ˜ft)
ct = ˆft ct 1 +ît ˆct
In order to explain the intuition behind the new update rule, we assume that the master gates
neurons always update more (or less) frequently than the others, and that order is pre-determined as
part of the model architecture.
4 ON-LSTM
In this section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model
uses an architecture similar to the standard LSTM, reported below:
ft = (Wf xt + Uf ht 1 + bf ) (1)
it = (Wixt + Uiht 1 + bi) (2)
ot = (Woxt + Uoht 1 + bo) (3)
ˆct = tanh(Wcxt + Ucht 1 + bc) (4)
ht = ot tanh(ct) (5)
The difference with the LSTM is that we replace the update function for the cell state ct with a
new function that will be explained in the following sections. The forget gates ft and input gates
it are used to control the erasing and writing operation on cell states ct, as before. Since the gates
in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy
of information between the neurons. To this end, we propose to make the gate for each neuron
dependent on the others by enforcing the order in which neurons should be updated.
4.1 ACTIVATION FUNCTION: cumax()
that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The
differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven
fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking
neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some
4 ON-LSTM
ft = (Wf xt + Uf ht 1 + bf ) (1)
it are used to control the erasing and writing operation on cell states ct, as before. Since the gates
in the LSTM act independently on each neuron, it may be difficult in general to discern a hierarchy
or global information that will last anywhere from several time steps to the entire sentence, repre-
senting nodes near the root of the tree. Low-ranking neurons encode short-term or local information
that only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The
differentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven
fashion by controlling the update frequency of single neurons: to erase (or update) high-ranking
neurons, the model should first erase (or update) all lower-ranking neurons. In other words, some
4 ON-LSTM
ft = (Wf xt + Uf ht 1 + bf ) (1)

+ 1 - ,
1: /-
• + (- )O e
c[ ] p !"# 1 1 ̃%# 1 1 Th
r !"# N ̃%# O lkm &# Th
s !"# &# "# iaO(- )O 1 T L '"# + (- ) 1 Th
t ̃%# &# (# iaO(- )O 1 T L ̂%# + (- ) 1 Th
u [Odo*#+,Ne Odo ̂*#T"#N(#M IL T e S
discrete variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
practice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
a cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
ˆg = E[g].
Based on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
˜ft = cumax(W ˜f xt + U ˜f ht 1 + b ˜f ) (9)
˜it = 1 cumax(W˜ixt + U˜iht 1 + b˜i) (10)
Following the properties of the cumax() activation, the values in the master forget gate are mono-
tonically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
from 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
ct = ˆft ct 1 +ît ˆct (14)
In order to explain the intuition behind the new update rule, we assume that the master gates are
binary:
• The master forget gate ˜ft controls the erasing behavior of the model. Suppose ˜ft =
(0, . . . , 0, 1, . . . , 1) and the split point is df
t . Given the Eq. (12) and (14), the information
stored in the first df
t neurons of the previous cell state ct 1 will be completely erased. In a
parse tree (e.g. Figure 2(a)), this operation is akin to closing previous constituents. A large
number of zeroed neurons, i.e. a large df
, represents the end of a high-level constituent in
eally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
screte variable is included in the computation graph is not trivial (Schulman et al., 2015), so in
actice we use a continuous relaxation by computing the quantity p(d  k), obtained by taking
cumulative sum of the softmax. As gk is binary, this is equivalent to computing E[gk]. Hence,
= E[g].
2 STRUCTURED GATING MECHANISM
ased on the cumax() function, we introduce a master forget gate ˜ft and a master input gate ˜it:
ollowing the properties of the cumax() activation, the values in the master forget gate are mono-
nically increasing from 0 to 1, and those in the master input gate are monotonically decreasing
om 1 to 0. These gates serve as high-level control for the update operations of cell states. Using
e master gates, we define a new update rule:
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
order to explain the intuition behind the new update rule, we assume that the master gates are
nary:
t , represents the end of a high-level constituent in
k k
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
the parse tree, as most of the information in the state will be discarded. Conversely, a small
f
ik
Ideally, g should take the form of a discrete variable. Unfortunately, computing gradients when a
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
f
LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus-
d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span
e than one time step. The representation for high-ranking nodes should be relatively consistent
ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu-
. At each time step, given the input word, dark grey blocks are completely updated while light
blocks are partially updated. The three groups of neurons have different update frequencies.
most groups update less frequently while lower groups are more frequently updated.
lobal information that will last anywhere from several time steps to the entire sentence, repre-
ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information
only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The
rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven
ion by controlling the update frequency of single neurons: to erase (or update) high-ranking
ons, the model should first erase (or update) all lower-ranking neurons. In other words, some
ons always update more (or less) frequently than the others, and that order is pre-determined as
of the model architecture.
ON-LSTM
is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model
an architecture similar to the standard LSTM, reported below:
ft = (Wf xt + Uf ht 1 + bf ) (1)
difference with the LSTM is that we replace the update function for the cell state ct with a
function that will be explained in the following sections. The forget gates ft and input gates
e used to control the erasing and writing operation on cell states ct, as before. Since the gates
before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the
categories are mutually exclusive, we can do this by computing the cumulative distribution function:
p(gk = 1) = p(d  k) =
X
ik
p(d = i) (8)
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
iaO(- )
Rn
iaO(- )
Ngf

+ 1 - 2 2
1: /-
• + (- ) LM
TO S I N !"# 1 1 ̃%# 1 1 R
= E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
nary:
&'( =
exp(.()
∑(1 exp(.(1)
'( = 2
(13(
&'(
45678(9):;"<678(9)
45678 ( ) (

+ 3 3 1 -
1: /-
• + (- ) LM
TO S I N !"# 1 1 ̃%# 1 1 R
= E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
nary:
!"# ̃%#
&'()* +,"-()*
1 0 e
!"# ̃%# = 0
!"# .
= 1
*. ,
x
= = 1 = 0,
!"# ̃%# = 0
0
,
e paper at ICLR 2019
es between a constituency parse tree and the hidden states of the proposed
of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus-
a block view of the tree structure in (b), where both S and VP nodes span
The representation for high-ranking nodes should be relatively consistent
s. (c) Visualization of the update frequency of groups of hidden state neu-
given the input word, dark grey blocks are completely updated while light
updated. The three groups of neurons have different update frequencies.
ess frequently while lower groups are more frequently updated.

4 4 + = (= = / = - =
+ + : = 1
• + ) / OST
g ] L f !"# ̃%# = Ra
= E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
nary:
!"# ̃%#
&'()* +,"-()*
.
!"# ̃%#
!"# 0
. e
*.
1
, .
!"# ̃%# x
[d MI ce N ce
e paper at ICLR 2019
es between a constituency parse tree and the hidden states of the proposed
of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus-
a block view of the tree structure in (b), where both S and VP nodes span
The representation for high-ranking nodes should be relatively consistent
s. (c) Visualization of the update frequency of groups of hidden state neu-
given the input word, dark grey blocks are completely updated while light
updated. The three groups of neurons have different update frequencies.
ess frequently while lower groups are more frequently updated.

+ 1 -
1: /-
• + (- )ILMN
] R O !"# 1 5 1 ̃%# 1 1 T
!"# ̃%# IS [ &# T
= E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
nary:
k k
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
f

, + ( / -
+ + : 1
• ,+ ) / LO T
c [S M R !"# 6 ̃%# N]
d !"# ̃%# L a &# N]
e !"# &# "# L) / L6 N I '"# ,+ ) / 6 N]
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
= E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
nary:
k k
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
f
̃%#
!"#

, + ( 7 7 / -
+ + : 1
• ,+ ) / LO T
c [S M R !"# 7 7 ̃%# 7 N]
d !"# ̃%# L a &# N]
e !"# &# "# L) / L 7 7 N I '"# ,+ ) / 7 7 N]
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
= E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
nary:
k k
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
f
̃%#
"#

, + ( / -
+ + : 8 1
• ,+ ) / OT [d
m a l !"# ̃%# Se
n !"# N ̃%# O ih &# Se
o !"# &# "# f O) / O S] L '"# ,+ ) / Se
p ̃%# &# (# f O) / O S] L ̂%# ,+ ) / Se
Ock*#+,Nd Ock ̂*#S"#N(#M g ILT S[d R
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
= E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
nary:
k k
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
f
ik
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:
f
LSTM. A sequence of tokens S = (x1, x2, x3) and its corresponding constituency tree are illus-
d in (a). We provide a block view of the tree structure in (b), where both S and VP nodes span
e than one time step. The representation for high-ranking nodes should be relatively consistent
ss multiple time steps. (c) Visualization of the update frequency of groups of hidden state neu-
. At each time step, given the input word, dark grey blocks are completely updated while light
blocks are partially updated. The three groups of neurons have different update frequencies.
most groups update less frequently while lower groups are more frequently updated.
lobal information that will last anywhere from several time steps to the entire sentence, repre-
ng nodes near the root of the tree. Low-ranking neurons encode short-term or local information
only last one or a few time steps, representing smaller constituents, as shown in Figure 2(b). The
rentiation between high-ranking and low-ranking neurons is learnt in a completely data-driven
ion by controlling the update frequency of single neurons: to erase (or update) high-ranking
ons, the model should ﬁrst erase (or update) all lower-ranking neurons. In other words, some
ons always update more (or less) frequently than the others, and that order is pre-determined as
of the model architecture.
ON-LSTM
is section, we present a new RNN unit, ON-LSTM (“ordered neurons LSTM”). The new model
an architecture similar to the standard LSTM, reported below:
ft = (Wf xt + Uf ht 1 + bf ) (1)
difference with the LSTM is that we replace the update function for the cell state ct with a
function that will be explained in the following sections. The forget gates ft and input gates
e used to control the erasing and writing operation on cell states ct, as before. Since the gates
before the k-th being the split point, that is d  k = (d = 0) _ (d = 1) _ · · · _ (d = k). Since the
categories are mutually exclusive, we can do this by computing the cumulative distribution function:
p(gk = 1) = p(d  k) =
X
ik
p(d = i) (8)
ˆg = E[g].
!t = ˜ft
˜it (11)
ˆft = ft !t + ( ˜ft !t) = ˜ft (ft
˜it + 1 ˜it) (12)
˜ft + 1 ˜ft) (13)
binary:

1 1 : : : , + : :
-+ :
• +, cNM d r[
u ea!RO "# = %& − ∑)*+
,- ./#) g
%& lSf i ∑)*+
,- ./#) 9 S :Smt
v "#RO So nI ! ! + 1Sk ThR ] s L

2 2 2 : : 2 : , 22 + 0 2 : 20 2:
2 2 -+ 2:
• +, cNM d r[
u ea!RO "# = %& − ∑)*+
,- ./#) g
%& lSf i ∑)*+
,- ./#) 2 2 2S 0 :Smt

12 21 2 : : 2 : , 22 + 2 : 2 2:
2 2 -+ 2:
• +, cNM d r[
u ea!RO "# = %& − ∑)*+
,- ./#) g
%& lSf i ∑)*+
,- ./#) 2 2 2S :Smt

: + :
: , -
• + ML]c i
t [d ! N O "# = %& − ∑)*+
,- ./#) af
%& kRe h ∑)*+
,- ./#) 2: R : Rls
u "# N O IRn m ! ! + 1R Sg Or o T

2 2 2 : : 2 : , 22 + 2 : 2 2:
2 2 -+ 2:
• +, cNM d r[
u ea!RO "# = %& − ∑)*+
,- ./#) g
%& lSf i ∑)*+
,- ./#) 2 3 2 2S :Smt

1 0 I) - : A 4 I A
0 : 0 L BI 4
• ) pks
• k ih l) 2 +: B 2 +
• mgn nro i , 4 A N ( eW
• pks [ Yu ]2 MA N
• xP
• tg ] mgn nro if A N [ u YS]T O2 MA Nf
y
• , 4 4 7: ( YSda O4 D:Mtg y
RcO [ w

: : : F : 3 :: 2 :F 1: :
: : I F 52 :
• t r W n
• , 2+ 2+ g edfh wig a
• ,23- o A F : : : a LT0 F Rsma p
• A F : : : F R O [ Lsm WSR
• wig ec T W u WSTLTJ k
P TLWNSRl O M]

• + 0
• B AB C @ GO
• D :D: -: DBA AF: D F A D:: FD F D: AFB : DD:AF -: D -:FIBD
:A : F C C:D
• 1 @@ @ B @9 B@ C 6 @E C 2
• +: FF:AF BA I F + FI: F A ( A BA B F BA
• : -: DB B BA :CF +: DA:D AF:DCD:F A :A: BD A :AF:A :
)DB - F D C:D BA , B
P RL N I

2 + +2 2 2 + 2
2
•
• md P + 2 iLr] p no EB
• S[e sl c EB
• [
• + 2 [ ft
• P ", $ bg ahG 7 ah ft
Published as a conference paper at ICLR 2019
(a) Ontology (b) Order Embeddings (c) Box Embeddings
Figure 1: Comparison between the Order Embedding (vector lattice) and Box Embedding represen-
tations for a simple ontology. Regions represent concepts and overlaps represent their entailment.
Shading represents density in the probabilistic case.
3.1 PARTIAL ORDERS AND LATTICES
A non-strict partially ordered set (poset) is a pair P, , where P is a set, and is a binary relation.
For all a, b, c 2 P,

A 8 A A E 8:8 A82 D 8
8
• 8 L G
2A + A A8 B 8 + A A8 + A A8 D + A A8
0313 + 12 5 + 2 43 , + 12 5
+ 8 7 7 7
E A E 7 7 7
8 8 A 7 7
: B
8 A 2A8 7 7
:
2 8C
2 A
8A
:
2 8C
2 A
8A
:
2 8C
2 A
8A
:
2 8C
2 A
8A

A A A E : A 2 D
• S
2A + A A
+
E A E
9 A
: B
A 2A
:
2 C
2 A
A
+
P [ L
E A E
2 A G : V : G 2 A R [ L

0 AAE : E , A EC A CA DE (A + :D
3
• pd + :gxV a
EAC BC D E E A , DD BC D E E A
+ +
: A
D EC
D=A E DD
) AD C
E CD E A
C GAC
E
C E
C GAC
E
C E
: A D EC
nV P bcj[seS
D=A E DD
E C E [se o
) AD C E CD E A
C GAC Vli fm
hvu rR ] t L

1 BB =A B D B: DB = =E = )B , =A E
3 =
• b , =A [ePc
BD 0 CD E A =BA GEE= A 0 CD E A =BA BA 0 CD E A =BA
+014247 7203 ,0230 , 3 547 0
0 =BA
(E D
+=E B=A A EE
BE GA D
=A DE =BA
DA= BD
D =
DA= BD
D =
DA= BD
D =
0 =BA (E D
aP V ]Rd
BE GA D =A DE =BA
DA= BD P LfS
+=E B=A A EE
D = Pd L

BB =A B D B: DB = =E = )B , =A E
3 =
• b , =A [ePcV
2 BD 0 CD E A =BA GEE= A 0 CD E A =BA BA 0 CD E A =BA )B 0 CD E A =BA
0313 + 12 5 + 2 43 , + 12 5
0 =BA 7 7 7
(E D 7 7 7
+=E B=A A EE 7 7
BE GA D
=A DE =BA 7 7
DA= BD
D =
DA= BD
D =
DA= BD
D = DA= BDD =
70 =BA (E D
P ]Rd
7+=E B=A A EE
D = Ld a
7 BE GA D =A DE =BA
DA= BD P LfS

8 8 + 212 : 3 B 2
0
• xcb B , 1 : 0
• ] o B p L SV !", !$ ∈ [0, 1]* re l L
• ] o p L ts [ E r
P ! = ∏.
/
(1$,. − 1",.)
P !, 4 = ∏.
/
max 0, min 1$,., :$,. − max 1",., :",.
P !|4 =
< !,4
<(4)
• [fg n hm L E l L
• P ! P 4 p L P S fg
• P !|4 p L P S fg
4"
31
!"
!$
1 1:
4$
(a) Original lattice (b) Ground truth CPD
Figure 1: Representation of the toy probabilistic lattice used in Section 5.1. Darker color corresponds
to more unary marginal probability. The associated CPD is obtained by a weighted aggregation of leaf
elements.
(a) POE lattice (b) Box lattice
aiR G d

: 4 )4 :4 B + 0 (: 433
• ,4 4 4 0 R P "|$ P S[
• " $ G ] E P "|$ a
S[L
0
:0::0
+0:0::0
P " = &
'
(
(*+,' − *.,')
P ", $ = max 0, min *+, 6+ − max *., 6.
P "|$ =
P ", $
P($)

: ) : B 5 + 0 (: 33
• , 0 R P "|$ P S[
• " $ G ] E P "|$ a
S[L
0
:0::0
+0:0::0
P " = &
'
(
(*+,' − *.,')
P ", $ = max 0, min *+, 6+ − max *., 6.
P "|$ =
P ", $
P($)

: 6 ) : B + 0 (: 33 6
• , 0 R P "|$ P S[
• " $ G ] E P "|$ a
S[L
0
:0::0
+0:0::0
P " = &
'
(
(*+,' − *.,')
P ", $ = max 0, min *+, 6+ − max *., 6.
P "|$ =
P ", $
P($)

7: 3 )3 3 B + 0 07 7 7 ( 03227:
7
• 32
• aE P] G L [ G P E
S P
+
P " = $
%
&
softplus(/0,% − /3,%)
P ", 5 = softplus min /0, 90 − max /3, 93
P "|5 =
P ", 5
P(5)
softplus log(1 + @A
)

: 3 )3 3 B + 0 0 8 ( 0322 :
• 32
• 8 aE P] G L [ G P E
S P
8
+8
P " = $
%
&
softplus(/0,% − /3,%)
P "|5 =
P ", 5
P(5)
softplus log(1 + @A
)

9 : 3 )3 93 B + 0 0 (90322 :
• 9 32
• aE P] G L [ G P E
S P
9 99
+999
P " = $
%
&
softplus(/0,% − /3,%)
P "|5 =
P ", 5
P(5)
softplus log(1 + @A
)

4 4 : : 00
+
• g :0
• tix a o dNcfo emEP W
• ti ti n i rG yu s G
• P "#""#$|&'( = *
• yl W
• a cfLB [ ] L b [
W
• Lh n ] SL W
837k edges, while negative examples are generated by swapping one of the terms to a random word
in the dictionary. Experimental details are given in Appendix D.1.
The smoothed box model performs nearly as well as the original box lattice in terms of test ac-
curacy1
. While our model requires less hyper-parameter tuning than the original, we suspect that
our performance would be increased on a task with a higher degree of sparsity than the 50/50 posi-
tive/negative split of the standard WordNet data, which we explore in the next section.
5.2 IMBALANCED WORDNET
In order to confirm our intuition that the smoothed box model performs better in the sparse regime,
we perform further experiments using different numbers of positive and negative examples from the
WordNet mammal subset, comparing the box lattice, our smoothed approach, and order embeddings
(OE) as a baseline. The training data is the transitive reduction of this subset of the mammal Word-
Net, while the dev/test is the transitive closure of the training data. The training data contains 1,176
positive examples, and the dev and test sets contain 209 positive examples. Negative examples are
generated randomly using the ratio stated in the table.
As we can see from the table, with balanced data, all models include OE baseline, Box, Smoothed
Box models nearly match the full transitive closure. As the number of negative examples increases,
the performance drops for the original box model, but Smoothed Box still outperforms OE and
Box in all setting. This superior performance on imbalanced data is important for e.g. real-world
entailment graph learning, where the number of negatives greatly outweigh the positives.
Positive:Negative Box OE Smoothed Box
1:1 0.9905 0.9976 1.0
1:2 0.8982 0.9139 1.0
1:6 0.6680 0.6640 0.9561
1:10 0.5495 0.5897 0.8800
Table 5: F1 scores of the box lattice, order embeddings, and our smoothed model, for different levels
of label imbalance on the WordNet mammal subset.
5.3 FLICKR
proach retains the inductive bias of the original box model, is equivalent in the limit, and satisfies
e necessary condition that p(x, x) = p(x). A comparison of the 3 different functions is given in
gure 3, with the softplus overlap showing much better behavior for highly disjoint boxes than the
ussian model, while also preserving the meet property.
(a) Standard (hinge) overlap (b) Gaussian overlap, 2 {2, 6} (c) Softplus overlap
gure 3: Comparison of different overlap functions for two boxes of width 0.3 as a function of their
nters. Note that in order to achieve high overlap, the Gaussian model must drastically lower its
mperature, causing vanishing gradients in the tails.
EXPERIMENTS
1 WORDNET
Method Test Accuracy %
transitive 88.2
word2gauss 86.6
OE 90.6
Li et al. (2017) 91.3
POE 91.6
Box 92.2
Smoothed Box 92.0
Table 4: Classification accuracy on WordNet test set.
e perform experiments on the WordNet hypernym prediction task in order to evaluate the per-

+: : 4 1 :
• eb B :
• [ Li P E EMk]Glm ETS
• P "|$ M d ET M Lc
• P "|$ =
&(",$)
&($)
=
#+,-./0 ",$ 12 / #456+5
#+,-./0 $ 12 / #456+5
• , M P "|$ Li P h ag + : f h ag
Kk]
Box 0.050 0.900
Smoothed Box 0.036 0.917
Table 6: KL and Pearson correlation between model and gold probability.
randomly pick 100K conditional probabilities for training data and 10k probabilities for dev and test
data 2
.
We compare with several baselines: low-rank matrix factorization, complex bilinear factoriza-
tion (Trouillon et al., 2016), and two hierarchical embedding methods, POE (Lai & Hockenmaier,
2017) and the Box Lattice (Vilnis et al., 2018). Since the training matrix is asymmetric, we used
separate embeddings for target and conditioned movies. For the complex bilinear model, we added
one additional vector of parameters to capture the “imply” relation. We evaluate on the test set using
KL divergence, Pearson correlation, and Spearman correlation with the ground truth probabilities.
Experimental details are given in Appendix D.4.
From the results in Table 7, we can see that our smoothed box embedding method outperforms the
original box lattice as well as all other baselines’ performances, especially in Spearman correlation,
the most relevant metric for recommendation, a ranking task. We perform an additional study on the
robustness of the smoothed model to initialization conditions in Appendix C.
KL Pearson R Spearman R
Matrix Factorization 0.0173 0.8549 0.8374
Complex Bilinear Factorization 0.0141 0.8771 0.8636
POE 0.0170 0.8548 0.8511
Box 0.0147 0.8775 0.8768
Smoothed Box 0.0138 0.8985 0.8977
Table 7: Performance of the smoothed model, the original box model, and several baselines on
MovieLens.
6 CONCLUSION AND FUTURE WORK
We presented an approach to smoothing the energy and optimization landscape of probabilistic box
embeddings and provided a theoretical justiﬁcation for the smoothing. Due to a decreased number

42
• +
• 29 @2@
• E :E: -: ECBF BG: E G B E:: GE G E:F BGC : EE:BG -: E -:G CE F
:B :FG :E
• CCG B G : ):C :GEL C EC FG C : B F +
• 2D @@ @ C @ @C @ 2 4 D 2 9A@ 0 A 1
• : -: EC L C CB : G +: EB:E BG:E E:G B :B:F CE F B :BG:B :F
(EC - G E :EI F CB , C
L I N

: 4 4
: : : -
• ead fi
• e gi t W 3 : -+ s D[
• v yDo C[
• 3 : Pw Av m ND
• ]ch a
• d L[ S v W A 3 : V
l u m [n : : :
set dynamic convolutions achieve a new state of the art of 29.7 BLEU.
1 INTRODUCTION
There has been much recent progress in sequence modeling through recurrent neural networks
(RNN; Sutskever et al. 2014; Bahdanau et al. 2015; Wu et al. 2016), convolutional networks (CNN;
Kalchbrenner et al. 2016; Gehring et al. 2016; 2017; Kaiser et al. 2017) and self-attention models
(Paulus et al., 2017; Vaswani et al., 2017). RNNs integrate context information by updating a hid-
den state at every time-step, CNNs summarize a ﬁxed size context through multiple layers, while as
self-attention directly summarizes all context.
Attention assigns context elements attention weights which deﬁne a weighted sum over context rep-
resentations (Bahdanau et al., 2015; Sukhbaatar et al., 2015; Chorowski et al., 2015; Luong et al.,
2015). Source-target attention summarizes information from another sequence such as in machine
translation while as self-attention operates over the current sequence. Self-attention has been formu-
lated as content-based where attention weights are computed by comparing the current time-step to
all elements in the context (Figure 1a). The ability to compute comparisons over such unrestricted
context sizes are seen as a key characteristic of self-attention (Vaswani et al., 2017).
(a) Self-attention (b) Dynamic convolution
Figure 1: Self-attention computes attention weights by comparing all pairs of elements to each other
(a) while as dynamic convolutions predict separate kernels for each time-step (b).

+ A I + : I : )
(A A A 2 4
• PO RNW[ S
,, (,, A T
D
(A
(A - - - -
)
2 :
!" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !&
C : D L

5 + A I + : I : 5 ) 5
(A A A 4
• a l ] [fm k
,, (,,
5D5
(A
(A - -
) 5
:
!" !# !$ !% !& !" !# !$ !% !&
C : D L
(A
A Sk S edoPpN
) 5 :
lnTONR(,, gW chi

6L , ( A A , 6A LA6
) AI A 4
• ti w[ acd mx[ug
-- )-- : ( A A
6 6
+A: A ) A
2
) C L A A A
LA6
!" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !&
A D A A A
LA6
pzok [h rOwy e Wfok[
sq n vT R S
2 ) C L
l R A P]Nn

L , ( A A , A LA 7
) AI A 4
• zr hklfu p
-- )-- : ( A A sq
+A: A ) A
2
) C L A A A A
LA 7
!" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !& !" !# !$ !% !&
A D A7 A A
LA 7 2 ) C L
)--eiOg WdS [w xy mt
]RcN : ( A A
n vae oT P

8 8 4 4 4 8 8
8 8 + -
• D 8 8 8
• 8 8 Pu f ds 8 4
• m v [ c oe v [ c g Aoe v [ c
m v [ c i wNpSL nWy N
• N 8 4 y lC
• a ]th A 4 4 8 8 P
! ∈ ℝ$×$×&
! ∈ ℝ$×'×&
! ∈ ℝ(×'×& ! ∈ ℝ$×(×'×&
: 4 8 8 4 4 8 8 8 8 8
sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature.
Our experiments show that lightweight convolutions perform competitively to strong self-attention
results and that dynamic convolutions can perform even better. On WMT English-German transla-
tion dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French
they match the best reported result in the literature, and on IWSLT German-English dynamic convo-
lutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime
than a highly-optimized self-attention baseline. For language modeling on the Billion word bench-
mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail
abstractive document summarization we outperform a strong self-attention model.
2 BACKGROUND
We first outline sequence to sequence learning and self-attention. Our work builds on non-separable
convolutions as well as depthwise separable convolutions.
Sequence to sequence learning maps a source sequence to a target sequence via two separate
networks such as in machine translation (Sutskever et al., 2014). The encoder network computes
representations for the source sequence such as an English sentence and the decoder network au-
toregressively generates a target sequence based on the encoder output.
The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2
Rn⇥d
to obtain key (K), query (Q), and value (V) representations, where n is the number of time
steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head
can learn separate attention weights over dk features and attend to different positions. The module
computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor-
malizes the result. Finally, it computes a weighted sum using the output of the value projection
(V):
Attention(Q, K, V ) = softmax(
QKT
p
dk
)V
Depthwise convolutions perform a convolution independently over every channel. The number
of parameters can be reduced from d2
k to dk where k is the kernel width. The output O 2 Rn⇥d
of
a depthwise convolution with weight W 2 Rd⇥k
for element i and output dimension c is defined as:
Oi,c = DepthwiseConv(X, Wc,:, i, c) =
kX
j=1
Wc,j · X(i+j d k+1
2 e),c
3 LIGHTWEIGHT CONVOLUTIONS
In this section, we introduce LightConv, a depthwise convolution which shares certain output chan-
nels and whose weights are normalized across the temporal dimension using a softmax. Compared to
attention weights from the previous time-step into account (Chorowski et al., 2015; Luong et al.,
2015). Shen et al. (2018a) reduce complexity by performing attention within blocks of the input
sequence and Shen et al. (2017; 2018b) perform more fine-grained attention over each feature.
Our experiments show that lightweight convolutions perform competitively to strong self-attention
results and that dynamic convolutions can perform even better. On WMT English-German transla-
tion dynamic convolutions achieve a new state of the art of 29.7 BLEU, on WMT English-French
they match the best reported result in the literature, and on IWSLT German-English dynamic convo-
lutions outperform self-attention by 0.8 BLEU. Dynamic convolutions achieve 20% faster runtime
than a highly-optimized self-attention baseline. For language modeling on the Billion word bench-
mark dynamic convolutions perform as well as or better than self-attention and on CNN-DailyMail
abstractive document summarization we outperform a strong self-attention model.
2 BACKGROUND
We first outline sequence to sequence learning and self-attention. Our work builds on non-separable
convolutions as well as depthwise separable convolutions.
Sequence to sequence learning maps a source sequence to a target sequence via two separate
networks such as in machine translation (Sutskever et al., 2014). The encoder network computes
representations for the source sequence such as an English sentence and the decoder network au-
toregressively generates a target sequence based on the encoder output.
The self-attention module of Vaswani et al. (2017) applies three projections to the input X 2
Rn⇥d
to obtain key (K), query (Q), and value (V) representations, where n is the number of time
steps, d the input/output dimension (Figure 2a). It also defines a number of heads H where each head
can learn separate attention weights over dk features and attend to different positions. The module
computes dot-products between key/query pairs, scales to stabilize training, and then softmax nor-
malizes the result. Finally, it computes a weighted sum using the output of the value projection
(V):
Attention(Q, K, V ) = softmax(
QKT
p
dk
)V
Depthwise convolutions perform a convolution independently over every channel. The number
of parameters can be reduced from d2
k to dk where k is the kernel width. The output O 2 Rn⇥d
of
a depthwise convolution with weight W 2 Rd⇥k
for element i and output dimension c is defined as:
Oi,c = DepthwiseConv(X, Wc,:, i, c) =
kX
j=1
Wc,j · X(i+j d k+1
2 e),c
3 LIGHTWEIGHT CONVOLUTIONS
Under review as a conference paper at ICLR 2019
Mat Mul
Linear
Linear Linear
Scale
SoftMax
Mat Mul
Linear
Q K V
input
(a) Self-attention
LConv
Linear
Linear
input
GLU
(b) Lightweight convolution
input
LConv
Linear
Linear
Linear
dynamic weights
GLU
(c) Dynamic convolution
Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions.
self-attention, LightConv has a fixed context window and it determines the importance of context el-
ements with a set of weights that do not change over time steps. We will show that models equipped
with lightweight convolutions show better generalization compared to regular convolutions and that
they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the
common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the-
art results in natural language processing applications. Furthermore, the low computational profile
of LightConv enables us to formulate efficient dynamic convolutions (§4).
LightConv computes the following for the i-th element in the sequence and output channel c:
LightConv(X, Wd cH
d e,:, i, c) = DepthwiseConv(X, softmax(Wd cH
d e,:), i, c)
Weight sharing. We tie the parameters of every subsequent number of d
H channels, which reduces
the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032
(d2
⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights
(d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that
this vast reduction in the number of parameters is crucial to make dynamic convolutions possible on
current hardware.
Mat Mul
Linear
Linear Linear
Scale
SoftMax
Mat Mul
Linear
Q K V
input
(a) Self-attention
LConv
Linear
Linear
input
GLU
(b) Lightweight convolution
input
LConv
Linear
Linear
Linear
dynamic weights
GLU
(c) Dynamic convolution
Figure 2: Illustration of self-attention, lightweight convolutions and dynamic convolutions.
self-attention, LightConv has a fixed context window and it determines the importance of context el-
ements with a set of weights that do not change over time steps. We will show that models equipped
with lightweight convolutions show better generalization compared to regular convolutions and that
they can be competitive to state-of-the-art self-attention models (§6). This is surprising because the
common belief is that content-based self-attention mechanisms are crucial to obtaining state-of-the-
art results in natural language processing applications. Furthermore, the low computational profile
of LightConv enables us to formulate efficient dynamic convolutions (§4).
LightConv computes the following for the i-th element in the sequence and output channel c:
LightConv(X, Wd cH
d e,:, i, c) = DepthwiseConv(X, softmax(Wd cH
d e,:), i, c)
Weight sharing. We tie the parameters of every subsequent number of d
H channels, which reduces
the number of parameters by a factor of H. As illustration, a regular convolution requires 7,340,032
(d2
⇥ k) weights for d = 1024 and k = 7, a depthwise separable convolution has 7,168 weights
(d ⇥ k), and with weight sharing, H = 16, we have only 112 (H ⇥ k) weights. We will see that
4 DYNAMIC CONVOLUTIONS
A dynamic convolution has kernels that vary over time as a learned functi
steps. A dynamic version of standard convolutions would be impractic
to their large memory requirements. We address this problem by build
drastically reduces the number of parameters (§3).
DynamicConv takes the same form as LightConv but uses a time-step
computed using a function f : Rd
! RH⇥k
:
DynamicConv(X, i, c) = LightConv(X, f(Xi)h,:, i
we model f with a simple linear module with learned weights WQ
2
Pd
c=1 WQ
h,j,cXi,c.
Similar to self-attention, DynamicConv changes the weights assigned to co
However, the weights of DynamicConv do not depend on the entire conte
the current time-step only. Self-attention requires a quadratic number of o
A dynamic convolution has kernels that vary over time as a learned function of the individual ti
steps. A dynamic version of standard convolutions would be impractical for current GPUs d
to their large memory requirements. We address this problem by building on LightConv wh
drastically reduces the number of parameters (§3).
DynamicConv takes the same form as LightConv but uses a time-step dependent kernel that
! RH⇥k
:
DynamicConv(X, i, c) = LightConv(X, f(Xi)h,:, i, c)
we model f with a simple linear module with learned weights WQ
2 RH⇥k⇥d
, i.e., f(Xi)
Pd
c=1 WQ
h,j,cXi,c.
Similar to self-attention, DynamicConv changes the weights assigned to context elements over tim
However, the weights of DynamicConv do not depend on the entire context, they are a function
Under review as a conference paper at I
A dynamic convolution has kernels tha
steps. A dynamic version of standard
to their large memory requirements.
drastically reduces the number of param
DynamicConv takes the same form a
! R
DynamicConv(X
we model f with a simple linear mod
Pd
c=1 WQ
h,j,cXi,c.
Similar to self-attention, DynamicConv
However, the weights of DynamicConv
=
!! ! *(,)
./

+ C 4 4 4 C :
A 9
• - 4D LME
(
)
)
(
Model Param BLEU Sent/sec
Vaswani et al. (2017) 213M 26.4 -
Self-attention baseline (k=inf, H=16) 210M 26.9 ± 0.1 52.1 ± 0.1
Self-attention baseline (k=3,7,15,31x3, H=16) 210M 26.9 ± 0.3 54.9 ± 0.2
CNN (k=3) 208M 25.9 ± 0.2 68.1 ± 0.3
CNN Depthwise (k=3, H=1024) 195M 26.1 ± 0.2 67.1 ± 1.0
+ Increasing kernel (k=3,7,15,31x4, H=1024) 195M 26.4 ± 0.2 63.3 ± 0.1
+ DropConnect (H=1024) 195M 26.5 ± 0.2 63.3 ± 0.1
+ Weight sharing (H=16) 195M 26.5 ± 0.1 63.7 ± 0.4
+ Softmax-normalized weights [LightConv] (H=16) 195M 26.6 ± 0.2 63.6 ± 0.1
+ Dynamic weights [DynamicConv] (H=16) 200M 26.9 ± 0.2 62.6 ± 0.4
Note: DynamicConv(H=16) w/o softmax-normalization 200M diverges
AAN decoder + self-attn encoder 260M 26.8 ± 0.1 59.5 ± 0.1
AAN decoder + AAN encoder 310M 22.5 ± 0.1 59.2 ± 2.1
Table 3: Ablation on WMT English-German newstest2013. (+) indicates that a result includes all
preceding features. Speed results based on beam size 4, batch size 256 on an NVIDIA P100 GPU.
6.2 MODEL ABLATION
In this section we evaluate the impact of the various choices we made for LightConv (§3) and Dy-
namicConv (§4). We ﬁrst show that limiting the maximum context size of self-attention has no
impact on validation accuracy (Table 3). Note that our baseline is stronger than the original result
of Vaswani et al. (2017). Next, we replace self-attention blocks with non-separable convolutions
(CNN) with kernel size 3 and input/output dimension d = 1024. The CNN block has no input and
output projections compared to the baseline and we add one more encoder layer to assimilate the
parameter count. This CNN with a narrow kernel trails self-attention by 1 BLEU.
We improve this result by switching to a depthwise separable convolution (CNN Depthwise) with in-
put and output projections of size d = 1024. When we progressively increase the kernel width from
4 262402 .522 72482402 720
/
49 4
49 4
9 : 4
C : A 9
4 A 9
4 4 A 9

:
• -1 , -
• 0 : F I:IA MON
• C C + CA D ) E :C E : C EC E C D EA CC E + C + E AC D
DE B B C
• AAE : E ( A EC A CA DE AI :D
• DD EE E A E : E : E A GA E A D
• 5@ LC A FI - : , I F IA :
I + C :I : F A A : 9 2 : A

+ C - :5 5 : -5 A
A - 5 A + C -C :A: 1
• yr m
• l pvuI ] Sd [ F ]
Sd
• l pvu ns Iih u ] T , Sd[F
b N d M F ] o a Q c
• gv r
• l pvu ns Sdv L
• VeWe Sd t L
Models Visual Features Semantics
Extra Labels
Inference
# Prog. Attr.
FiLM (Perez et al., 2018) Convolutional Implicit 0 No Feature Manipulation
IEP (Johnson et al., 2017b) Convolutional Explicit 700K No Feature Manipulation
MAC (Hudson & Manning, 2018) Attentional Implicit 0 No Feature Manipulation
Stack-NMN (Hu et al., 2018) Attentional Implicit 0 No Attention Manipulation
TbD (Mascharka et al., 2018) Attentional Explicit 700K No Attention Manipulation
NS-VQA (Yi et al., 2018) Object-Based Explicit 0.2K Yes Symbolic Execution
NS-CL Object-Based Explicit 0 No Symbolic Execution
Table 1: Comparison with other frameworks on the CLEVR VQA dataset, w.r.t. visual features,
implicit or explicit semantics and supervisions.
1 2
3 4
Q: What is the shape of
the red object left of the
sphere?
✓ Query(Shape, Filter(Red, Relate(Left, Filter(Sphere))))
☓ Query(Shape, Filter(Sphere, Relate(Left, Filter(Red))))
☓ Exist(AERelate(Shape, Filter(Red, Relate(Left, Filter(Sphere)))))
……
Visual Representation
Semantic Parsing (Candidate Interpretations)
Symbolic Reasoning
Answer: Cylinder
Groundtruth: Box
Back-propagation
REINFORCE
Obj 1
Obj 2
Obj 3
Obj 4
Sphere
Concept Embeddings
……
Back-propagation
Θ"
Θ#

E C C Rg S]d ol
,C A A B 2:
Wa c L L
hru M L L
mp[ v N i
v N Ve ] I N ]d b
Ve ] ]d b N i
v N i
Obj:2
1 : -:FC : E +: C :C E:C C:E : :D
C D : E: :D C - EFC F :C D , 5
• n 2 DF : D s P
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
purple matte thing?
1
3
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program
Filter Green Cube
Program Representations
Relate Object 2
Left
Filter Red
Filter Purple Matte
AEQuery Object 1 Object 3
Shape
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-C
Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semant
y v[ y
. 1 B :A B A ?
8 B B
t i
--
t Ve ]
C :
+ C
0?
--
t ]d b
--
Filter(Red)
↓
Query(Shape)
A: Cube.
sphere?
A: 3
purple matte thing?
A: No
A: No.
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Filter Purple Matte
AEQuery Object 1 Object 3
Shape No (0.98)
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
Figure 4: A. Demonstration of the curriculum learning of visual concepts, words, and semantic parsing
y v[ y
. 1 B :A B A ?
8 B B
t i
--
A: Cube.
sphere?
A: 3
purple matte thing?
A: No
A: No.
Lesson3: More complex questions
Deploy: complex scenes, complex
A. Curriculum concept learning
Figure 4: A. Demonstration of the curriculum l
y v[ y
. 1 B :A B A ?
8 B B
t i
--
t T c
--
t ]d b
--
Filter(Red)
↓
Query(Shape)
Obj:1 Green
Red

Red
C 2 N I 5PG IF I JM . M J M B 5 L
I L 5 M L , IG 2 MN F 5NJ L I :1 I
• p v I M B I I JML 5 G M 3 L B
am ]sgfd eiUWckS3 IB GlytVhoT
ry am R[ ry u
:D O EN ODA ND A
B ODA MA FA?O
1 LE 4 22
4 L2 M
IFI +G B 5J
5C J +G B 5J
Cylinder
Sphere
Box
LN F , MN 5J
Obj:1
Obj:2
ly b
+ Q
j DAMA
n
3
↓
Filter(Red)
Query(Shape)
-4 -4
22
22
1 NG 6 ,22[i W gb uS6AN2AO [
ENP A OPMAb y
W ? AM -A? AMU+E06 06 Vf c x
- C [4M CM by
4M CM ], ?A O A E C
U, M A A E CVbk
E OAM b U6A , NE A t s ]a3 F n V
4M CM ], ?A O A E C
U D A A A E CVbk
PAMR b U3 F , NE A t s ]a D Ab
k V S mh y
je d orbl S A E C ?Abp

Red
C 2 N I 5PG IF I JM . M J M B 5 L
I L 5 M L , IG 2 MN F 5NJ L I :1 I
• d I M B I I JML 5 G M 3 L B
fsR]c ba lki nV[h T3 IB Gr m U
fs S p
1 ? ? ? F
E ? G E A :
1 LE 4 22
4 L2 M
IFI +G B 5J
5C J +G B 5J
Cylinder
Sphere
Box
LN F , MN 5J
Obj:1
Obj:2
r gy
E
sb F? G
t
3
↓
Filter(Red)
Query(Shape)
-4 -4
22
22
1 LE 4 22 fsW u e o 4 L2 M
LN F , MN vr
m Q ,D:E G + :E GO 2 2 P ] o
6+ED U0GE G C p
0GE G C i W ED: F ,C D
O EBEG C D P cy
- B G niO2 E D lxR u W[ A f P
0GE G C i W ED: F ,C D
O ? F C D P cy
1 G niO A E D lxR u W[ ? F
cyPS tea S p
sb gj d S ,C D F : hr

Red
FC 5C NL SJ LIG ,L CMP 3C N CN 1 PCNMNCPG E C C
LNB B C PC C NLJ 5 P N I MCNRG GL 4 L
• i 2LG P 3C N G E LD ,L CMP B CJ PG 6 N G E
T hbfe pom nr luV6NLEN Jv yW
dUc s
3 4F E F E 4C
B F B F
4 ,55
C 5CP
,LILN .J CBBG E M C
F MC .J CBBG E M C
Cylinder
Sphere
Box
G I C P NC M C
Obj:1
Obj:2
v k
B
o 1C
+6
↓
Filter(Red)
Query(Shape)
+G0 : 0 :
55
55
4E R pNW r yj E F R
2 E 4? 4F tl
s ]. LBCN -C LBCNV+G0 : 0 :WgUa[t
-L E 6NLEN J v
B: 4 V f Vm BA CFV A:
B?B A:M x
?F m kf SV BE A i OhsS - cuM
B: 4 V f Vm BA CFV A:
1 4C A:M x
m kf - SV BE A i OhsS 1 4C
xMP b SPQl
o SVdg a P A: 1C4 en

Red
3 LJ 6 JGE +JI N 1 LI L IN L L NEIC 6 I M
:JL M I 6 IN I M -LJ 3 N L G 6 LPEMEJI 2 J
• d JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC
etR]c ba lki oV[h T4LJCL r mvU
et S p
:D O EN ODA ND A
B ODA MA FA?O
2 MF 5 +33
5 M3 N
+JGJL , EIC 6
6 , EIC 6
Cylinder
Sphere
Box
EM G - N L 6
Obj:1
Obj:2
r f
+ Q
j DAMA
u
4
↓
Filter(Red)
Query(Shape)
E.5 .5
33
33
ENP A OPMAb y
- C [4M CM by
4LJCL n s W+JI N , EIC
T+JGJL EICU gy
4M CM ], ?A O A E C
U D A A A E CVbk
k V S mh y

Red
EB 3B NL 7SJ LIF +L BMP 1B N BN PBNMNBPF 7 B BO
LN O 7B PB BO -NLJ 3 P N I 7 MBNRFOFL 2 L
• i LF P 1B N F LC +L BMPO 7BJ PF 5 NOF
j Tbhcgf pomanu lyV5NL N J r W
j eUd
:D O EN ODA ND A
B ODA MA FA?O
2 O +33
BO3BP
+LILN ,J B F 7M B
7E MB ,J B F 7M B
Cylinder
Sphere
Box
:FO I -B P NB 7M B
Obj:1
Obj:2
k
+ Q
j DAMA
5
↓
Filter(Red)
Query(Shape)
F. .
33
33
1 NG 6 ,22[i W gb tS6AN2AO [
ENP A OPMAb x
u W ? AM -A? AMU+E06 06 Vf c
- C [4M CM bx
4M CM p y ], ?A O A E C
U, M A A E CVbk
-FIPBN a sV B ]+LOF B v t [ 4 G W
4M CM p y ], ?A O A E C
U D A A A E CVbk
PAMRy b pU3 F , NE A s r ]a D Ab
k V S mh x
je d n bl S A E C ?Abo

Red
3 LJ JGE +JI N 1 LI L IN L L NEIC I M
:JL M I IN I M -LJ 3 N L G LPEMEJI 2 J
• d JEIN 1 LIEIC JB +JI NM I INE 4 LMEIC
etR]c ba lki oV[h T4LJCL r mvU
et S p
:D O EN ODA ND A
B ODA MA FA?O
2 MF 5 +33
5 M3 N
+JGJL , EIC
, EIC
Cylinder
Sphere
Box
EM G - N L
Obj:1
Obj:2
r f
+ Q
j DAMA
u
4
↓
Filter(Red)
Query(Shape)
E.58 .58
33
33
ENP A OPMAb y
- C [4M CM by
4M CM ], ?A O A E C
U, M A A E CVbk
4LJCL n s W+JI N , EIC
T EICU gy
k V S mh y

Red
9FC 3CROM T MJG +ML CNQ 1C OLCO LQCONOCQGLE CLCP
MOBP LB CLQCL CP -OM 3 QRO J RNCOSGPGML 2 M
• l MGLQ 1C OLGLE M +ML CNQP LB C LQG 5 OPGLE
n Vekfji vtrds b 5OMEO a
n hWg
:D O EN ODA ND A
B ODA MA FA?O
2 PI +33
CP3CQ
+MJMO , CBBGLE N C
F NC , CBBGLE N C
Cylinder
Sphere
Box
GPR J -C QROC N C
Obj:1
Obj:2
o
+ Q
xg DAMA
5
↓
Filter(Red)
Query(Shape)
G. : . :
33
33
1 NG 6 ,22 f V e R6AN2AO
ENP A OPMA u
r V ? AM -A? AM +E06 06 UdSa]t
- C 4M CM u
4M CM ] m ] [, ?A O] A E C
, M A A E CU h
E OAM sm 6A ], NE A p Wo [ 3 F j U
4M CM ] m ] [, ?A O] A E C
D A A A E CU h
RCOT d 4 +MPGLC [y c F NCd
p U um ]
xgcSb ]kn i R A E C ?A l

4
Red
3 LJ 6 JGE +JI N 1 LI L IN L L NEIC 6 I M
• e 0JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC
fyS d cb nm ]lsW iuU4LJCL v [o V
fyaT t
:D O EN ODA ND A
B ODA MA FA?O
2 MF +33
M3 N
+JGJL , EIC 6
6 , EIC 6
Cylinder
Sphere
Box
EM G - N L 6
Obj:1
Obj:2
v h
+ Q
uh DAMA
↓
Filter(Red)
Query(Shape)
E. .
33
33
1 NG 6 ,22[g W x eb oS6AN2AO [
ENP A OPMAb s
p W ? AM -A? AMU+E06 06 Vd c r
- C [4M CM bs
4M CM l t ], ?A O A E C
U, M A A E CVbi
E OAMt b lU6A , NE A n my ]a3 F k V
U D A A A E CVbi
PAMRt b lU3 F , NE A n my ]a D Ab
i V S jf s
gaT pr]k R, EIC 6 ]

6C 2 N I PG IF I JM . M J M B L
I L M L , IG 2 MN F NJ L I :1 I
• e i I M B I I JML G M 3 L B
3 IB GalT] R I JM +G BU dS
al [f
+ , R
i MEBNB
kc
4 I
Red
gl bW V gl h
6 E P FO PEB OE MB
C PEB NBA ?GB P
1 LE 4 22
4 L2 M
IFI +G B J
C J +G B J
Cylinder
LN F , MN J
Obj:1
Obj:2
↓
Filter(Red)
Query(Shape)
-4 -4
22
22
2 O -33 h a fc r BO3BP
:FOQ 0B PQNBc
s a/ ABN B ABNV,F1 1 WeUd u
D N DN c
N DN n x - BMP / ?BAAF D
V- N B ?BAAF DWcj
0F PBNx ctnV BA] - OF B o ] b4?G l W
N DN n x - BMP / ?BAAF D
V E MB B ?BAAF DWcj
6QBNSx ctnV4?G ] - OF B o ] b E MBc
j W kg] [
i ic BF C N B N DN pcmy
Sphere
Box

o
EIBJL
3 LJ 6 JGE +JI N 1 LI L /IN L L NEIC 6 I M
• JEIN 1 LIEIC JB +JI NM I 6 INE 4 LMEIC
4LJCL i Wb]agS+JI N , EIC cpT
1 NG 6 ,22[g W x eb oS6AN2AO [
ENP A OPMAb s
p W ? AM -A? AMU+E06 06 Vd c r
- C [4M CM bs
U, M A A E CVbi
E OAMt b lU6A , NE A n my ]a3 F k V
U D A A A E CVbi
PAMRt b lU3 F , NE A n my ]a D Ab
i V S jf s
l tl vhV EIBJL U4LJCL Wmk e d
i r
+ Q
uh DAMA
Red
s n[R s fyu
:D O EN ODA ND A
B ODA MA FA?O
2 MF +33
M3 N
+JGJL , EIC 6
6 , EIC 6
Cylinder
EM G - N L 6
Obj:1
Obj:2
↓
Filter(Red)
Query(Shape)
E. .
33
33
Sphere
Box

1 . FC 6 , C C C C
2 C C . FC F C 3-
• e h + , C : C
• [S ]MP b W
• FCC F F , C T c I MN idTgf aLPJ
A: Cube.
sphere?
A: 3
purple matte thing?
A: No
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
A: Cube.
sphere?
A: 3
purple matte thing?
A: No
A: No.
Q: Does the red object lef
cube have the same shape
purple matte thing?
Obj 1
Obj 2
Obj 3
Obj 4
Step2, 3: Semantic Pars
Filter
Program Representa
Relate O
Filter
Filter
AEQuery Object 1 O
A. Curriculum concept learning B. Illustrative exec
Figure 4: A. Demonstration of the curriculum learning of visual concepts, wo
of sentences by watching images and reading paired questions and answers
different complexities are illustrated to the learner in an incremental mann

7 4 LI 6NE DC + F K 2 IF I FK I I KCF 6 F J
I J F 6 FK F J I E 4 KLI D 6L IMCJC F :
• +2- 5 , K J K :0 FJ F %
• uc a e t am ln
• 7I CF 1 DC 1 7 JK 1
• v m 1Rg S y s 1Rg S ya g
• am b oph ri [Td m SV] W
217 4./ 0 5: /3
8. . M P /31 9
6NA EE
217 4./ 0 5: /3
0 L
RQS-
CLLI- E DE DL A M L A M A % 2/39 5:/3 I LA I
CLLI- E DE DL A M L A M A % 2/39 5:/3 IILO

:
• : : : : : : : :: : 6 :5
• : EM I AMG ER A " Y7PDEPED MESPNM , 3MREGPARIMG RPEE RPSCRSPE IMRN PECSPPEMR MESPA MER NP "Z IM 8PNC" N
3 9 "
• 5EPIRW :REO EM ER A " Y9EGS APIXIMG AMD 7ORILIXIMG : 5 AMGSAGE 5NDE "Z IM 8PNC" N 3 9 "
• AMG I IM ER A " Y.PEA IMG R E N RLAV BNRR EMEC , - IG PAM PMM AMGSAGE LNDE "Z IM 8PNC" N 3 9 "
• : : 6
• I IAMG ER A " :LNNR IMG R E 2ENLERPW N 8PNBABI I RIC .NV 1LBEDDIMG " IM 8PNC" N 3 9 "
• 5I N NT NLA ER A " Y0I RPIBSRED PEOPE EMRARINM N NPD AMD O PA E AMD R EIP CNLON IRINMA IRW"Z IM 8PNC"
N 638: "
• I MI S E ER A " Y NPD PEOPE EMRARINM TIA GAS IAM ELBEDDIMG"Z IM 8PNC" N 3 9 "
• EMDPNT 3TAM ER A " Y7PDEP ELBEDDIMG N ILAGE AMD AMGSAGE"Z IM 8PNC" N 3 9 "
• AI - ICE ER A " Y EAPMIMG RN OPEDICR DEMNRARINMA OPNBABI IRIE NP LNDE IMG EM RAI LEMR"Z IM 8PNC" N 1-
"
• I MI S E ER A " Y8PNBABI I RIC ELBEDDIMG N MN EDGE GPAO IR BNV ARRICE LEA SPE "Z IM 8PNC" N -
"

:
• C : : 6 6 6 : C: :A :
• P . HFS H" T H NN J FKJ RF E HFDE R FDE JA ATJ IF KJQKHP FKJN" FJ MK " KC 0 8 "
• NR JF NEFNE H" J FKJ FN HH TKP J A" FJ MK " KC 0 9 "
• -6 ,C : : : : , : : , : :
, A :
• K 1F TP J H" :E J PMK NTI KHF KJ L H MJ M 0J MLM FJD N J N RKMAN JA N J J N CMKI
J PM H NPL MQFNFKJ" FJ MK " KC 0 8 "
• PANKJ ,M R H" V KILKNF FKJ H J FKJ J RKM N CKM I EFJ M NKJFJD"V FJ MK " KC 0 8 "
• N E M , QFA H" :M JNL M J T T A NFDJ HKNFJD E D L R J L MCKMI J JA FJ MLM FHF T
FJ QFNP H M NKJFJD"V FJ MK " KC 8 "
• F 2 SFJ H" PM H 9TI KHF 7 ,FN J JDHFJD M NKJFJD CMKI QFNFKJ JA H JDP D PJA MN JAFJD"V FJ
MK " KC PM0 9 "
• 1KEJNKJ 1PN FJ H" - 8 AF DJKN F A N CKM KILKNF FKJ H H JDP D JA H I J MT QFNP H
M NKJFJD"V FJ MK " KC 8 "

NLP@ICLR2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NLP@ICLR2019

Similar to NLP@ICLR2019 (20)

More from Kazuki Fujikawa

More from Kazuki Fujikawa (14)

Recently uploaded

Recently uploaded (20)

NLP@ICLR2019