21. Importance-aware step width
ステップ幅の再設定式
Table 1: Importance Weight Aware Updates for Various Loss Functions
Loss
`(p, y)
Update s(h)
⇣
⌘
>
p y
Squared
(y p)2
1 e h⌘x x
x> x
Logistic
log(1 + e
Exponential
e
y log
Logarithmic
Hellinger
Hinge
⌧ -Quantile
p
( p
y
p
p
2
y)
yp
)
yp
+ (1
p
( 1
y) log
p
1 y
1 p
p
1
max(0, 1 yp)
if y > p
⌧ (y p)
if y p (1 ⌧ )(p y)
(6) gives a di↵erential equation whose solution is the
result of a continuous gradient descent process.
As a sanity check we rederive (5) using (6). For
@`
squared loss @p = p y and we get a linear ODE:
y)2
> x+yp+eyp
) h⌘x> x eyp
for y 2 { 1, 1}
yx> x
py log(h⌘x> x+epy )
for y 2 { 1, 1}
x> xy
p
p 1+ (p 1)2 +2h⌘x> x
if y = 0
p x> x
p
p2 +2h⌘x> x
if y = 1
x> x
>
1
p 1+ 4 (12h⌘x x+8(1 p)3/2 )2/3
if y = 0
x> x
1
p 4 (12h⌘x> x+8p3/2 )2/3
if y = 1
x> x
1 yp
y min h⌘, x> x for y 2 { 1, 1}
if y > p
⌧ min(h⌘, ⌧yx>p )
x
p y
if y p (1 ⌧ ) min(h⌘, (1 ⌧ )x> x )
W (eh⌘x
solution to (6) has no simple form for all y 2 [0, 1] but
for y 2 {0, 1} we get the expressions in table 1.
3.1.1
(Karampatziakis+ 11) より
Hinge Loss and Quantile Loss
Two other commonly used loss function are the hinge
loss
21 and the ⌧ -quantile loss where ⌧ 2 [0, 1] is a parameter function. These are di↵erentiable everywhere
31. Normalized Online Learning
• オンライン処理しながら自動で正規化
• スケールを(あまり)気にせず,SGDを回せるように!
• スケールも敵対的に設定されるRegret Boundの証明付き
Algorithm 1 NG(learning rate ⌘t )
Algorithm 2 NAG(learning rate ⌘)
1. Initially wi = 0, si = 0, N = 0
1. Initially wi = 0, si = 0, Gi = 0, N
2. For each timestep t observe example (x, y)
2. For each timestep t observe example
(a) For each i, if |xi | > si
(a) For each i, if |xi | > si
wi si
i. wi
|xi |
ii. si
|xi |
P
(b) y = i wi xi
ˆ
P x2
i
(c) N
N + i s2
wi s2
i
|xi |2
i. wi
ii. si
|xi |
P
(b) y = i wi xi
ˆ
P
(c) N
N+ i
(d) For each i,
i. wi
wi
x2
i
2
si
(d) For each i,
y ,y)
t
⌘t N s1 @L(ˆi
2
@w
i
31
i. Gi
Gi +
ii. wi
wi
i
⇣
@L(ˆ,y)
y
@wi
(Stéphane+ 13)より q t
⌘
N si
⌘2
1
p
@L
Gi @