Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

CoCoA
A General Framework
for Communication-Efﬁcient
Distributed Optimization
Virginia Smith
Simone Forte ⋅ Chenxin Ma ⋅ Martin Takac ⋅ Martin Jaggi ⋅ Michael I. Jordan

Machine Learning with
Large Datasets

Machine Learning with
Large Datasets
image/music/video tagging
document categorization
item recommendation
click-through rate prediction
sequence tagging
protein structure prediction
sensor data prediction
spam classiﬁcation
fraud detection

Machine Learning Workflow
DATA & PROBLEM
classification, regression,
collaborative filtering, …

DATA & PROBLEM
MACHINE LEARNING MODEL
logistic regression, lasso,
support vector machines, …

DATA & PROBLEM
OPTIMIZATION ALGORITHM
gradient descent, coordinate
descent, Newton’s method, …

DATA & PROBLEM
SYSTEMS SETTING
multi-core, cluster, cloud,
supercomputer, …

DATA & PROBLEM
SYSTEMS SETTING
multi-core, cluster, cloud,
supercomputer, …
Open Problem:
efﬁciently solving objective
when data is distributed

reduce: w = w ↵
P
k w

“always communicate”
reduce: w = w ↵
P
k w

✔ convergence  
guarantees
reduce: w = w ↵
P
k w

✔ convergence  
guarantees
✗ high communication
reduce: w = w ↵
P
k w

✔ convergence  
guarantees
average: w := 1
K
P
k wk
reduce: w = w ↵
P
k w

✔ convergence  
guarantees
average: w := 1
K
P
k wk
“never communicate”
reduce: w = w ↵
P
k w

✔ convergence  
guarantees
✔ low communication
average: w := 1
K
P
k wk
reduce: w = w ↵
P
k w

✔ convergence  
guarantees
✔ low communication
✗ convergence  
not guaranteed
average: w := 1
K
P
k wk
reduce: w = w ↵
P
k w

Mini-batch
reduce: w = w ↵
|b|
P
i2b w

Mini-batch
✔ convergence
guarantees
reduce: w = w ↵
|b|
P
i2b w

Mini-batch
✔ convergence
guarantees
✔ tunable
communication
reduce: w = w ↵
|b|
P
i2b w

Mini-batch
✔ convergence
guarantees
✔ tunable
communication
a natural middle-ground
reduce: w = w ↵
|b|
P
i2b w

Mini-batch Limitations
1. ONE-OFF METHODS

1. ONE-OFF METHODS
2. STALE UPDATES

1. ONE-OFF METHODS
2. STALE UPDATES
3. AVERAGE OVER BATCH SIZE

LARGE-SCALE OPTIMIZATION
CoCoA
ProxCoCoA+

1. ONE-OFF METHODS 
Primal-Dual Framework
2. STALE UPDATES  
Immediately apply local updates
3. AVERAGE OVER BATCH SIZE 
Average over K << batch size

1. ONE-OFF METHODS 
Primal-Dual Framework
2. STALE UPDATES  
Immediately apply local updates
3. AVERAGE OVER BATCH SIZE 
Average over K << batch size
CoCoA-v1

1. Primal-Dual Framework
PRIMAL DUAL≥

PRIMAL DUAL≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#

PRIMAL DUAL≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

PRIMAL DUAL
Stopping criteria given by duality gap
≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

PRIMAL DUAL
Good performance in practice
≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

PRIMAL DUAL
Default in software packages e.g. liblinear
≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

PRIMAL DUAL
Default in software packages e.g. liblinear
Dual separates across machines  
(one dual variable per datapoint)
≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

Global objective:

Global objective:
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

Global objective:
Local objective:
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

Global objective:
Local objective: max
↵[k]2Rn
1
n
X
i2Pk
`⇤
i ( ↵i ( ↵[k])i)
1
n
wT
A ↵[k]
2
1
n
A ↵[k]
2
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

Global objective:
Local objective:
Can solve the local objective using any internal
optimization method
max
↵[k]2Rn
1
n
X
i2Pk
`⇤
i ( ↵i ( ↵[k])i)
1
n
wT
A ↵[k]
2
1
n
A ↵[k]
2
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

2. Immediately Apply Updates
for i 2 b
w w ↵riP(w)
end
w w + w
STALE

2. Immediately Apply Updates
for i 2 b
w w ↵riP(w)
end
w w + w
STALE
FRESH
for i 2 b
w ↵riP(w)
w w + w
end
w w + w

3. Average over K
reduce: w = w + 1
K
P
k wk

Can we avoid having to average the
partial solutions?
COCOA-v1 Limitations

partial solutions? ✔

partial solutions? ✔
[CoCoA+, Ma & Smith, et. al., ICML ’15]

partial solutions? 
 
L1-regularized objectives not covered in
initial framework
✔
[CoCoA+, Ma & Smith, et. al., ICML ’15]

L1 Regularization
Encourages sparse solutions

L1 Regularization
Includes popular models: 
- lasso regression 
- sparse logistic regression 
- elastic net-regularized problems

L1 Regularization
Beneﬁcial to distribute by feature

L1 Regularization
Beneﬁcial to distribute by feature
Can we map this to the CoCoA setup?

Solution:
Solve Primal Directly

min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Solution:
PRIMAL DUAL≥
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#

Solution:
PRIMAL DUAL≥
min
↵2Rn
f(A↵) +
nX
i=1
gi(↵i) min
w2Rd
f⇤
(w) +
nX
i=1
`⇤
i ( xT
i w)

Solution:
PRIMAL DUAL
≥
min
↵2Rn
f(A↵) +
nX
i=1
gi(↵i) min
w2Rd
f⇤
(w) +
nX
i=1
`⇤
i ( xT
i w)

Solution:
PRIMAL DUAL
Default in software packages e.g. glmnet
≥
min
↵2Rn
f(A↵) +
nX
i=1
gi(↵i) min
w2Rd
f⇤
(w) +
nX
i=1
`⇤
i ( xT
i w)

Solution:
PRIMAL DUAL
Default in software packages e.g. glmnet
Primal separates across machines  
(one primal variable per feature)
≥
min
↵2Rn
f(A↵) +
nX
i=1
gi(↵i) min
w2Rd
f⇤
(w) +
nX
i=1
`⇤
i ( xT
i w)

CoCoA: A Framework for

ﬂexible

ﬂexible
allows for arbitrary
internal methods

ﬂexible efﬁcient
internal methods

ﬂexible efﬁcient
internal methods
fast! (strong
convergence & low
communication)

ﬂexible efﬁcient general
internal methods
fast! (strong
convergence & low
communication)

internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models

internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models
1. CoCoA-v1
[NIPS ’14]

internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models
1. CoCoA-v1
[NIPS ’14]
2. CoCoA+
[ICML ’15]

internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models
1. CoCoA-v1
[NIPS ’14]
2. CoCoA+
[ICML ’15]
3.ProxCoCoA+
[current work]

internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models
1. CoCoA-v1
[NIPS ’14]
2. CoCoA+
[ICML ’15]
3.ProxCoCoA+
[current work]
CoCoA Framework

Impact & Adoption
gingsmith.github.io/
cocoa/

Impact & Adoption
cocoa/
Numerous talks & demos

Impact & Adoption
ICML
cocoa/

Impact & Adoption
ICML
cocoa/
Open-source code & documentation

Impact & Adoption
ICML
cocoa/
Open-source code & documentation
Industry & academic adoption

Thanks!
cs.berkeley.edu/~vsmith
github.com/gingsmith/proxcocoa
github.com/gingsmith/cocoa

Empirical Results in
Dataset
Training  
(n)
Features  
(d)
Sparsity Workers (K)
url 2M 3M 4e-3% 4
kddb 19M 29M 1e-4% 4
epsilon 400K 2K 100% 8
webspam 350K 16M 0.02% 16

A First Approach:
CoCoA+ with Smoothing

A First Approach:
Issue: CoCoA+ requires strongly-convex regularizers

A First Approach:
Approach: add a bit of L2 to the L1 regularizer

A First Approach:
k↵k1 + k↵k2
2

A First Approach:
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
k↵k1 + k↵k2
2

A First Approach:
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
k↵k1 + k↵k2
2

A First Approach:
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
k↵k1 + k↵k2
2

A First Approach:
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + k↵k2
2

A First Approach:
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + k↵k2
2
X

A First Approach:
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + k↵k2
2
X 
CoCoA+ with smoothing doesn’t work

A First Approach:
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + k↵k2
2
X 
CoCoA+ with smoothing doesn’t work
Additionally, CoCoA+ distributes by
datapoint, not by feature

Better Solution:
ProxCoCoA+
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465

Better Solution:
ProxCoCoA+
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
ProxCoCoA+ 0.6030

Convergence
Assumption: Local -Approximation
For , we assume the local solver  
ﬁnds an approximate solution satisfying:
⇥
⇥ 2 [0, 1)
E
⇥
G
0
k ( ↵[k]) G
0
k ( ↵?
[k])
⇤
 ⇥
⇣
G
0
k (0) G
0
k ( ↵?
[k])
⌘

Convergence
Theorem 1. Let have
-bounded supportL
T ˜O
⇣
1
1 ⇥
⇣
8L2
n2
⌧✏ + ˜c
⌘⌘
gi
⇥
⇥ 2 [0, 1)
E
⇥
G
0
k ( ↵[k]) G
0
k ( ↵?
[k])
⇤
 ⇥
⇣
G
0
k (0) G
0
k ( ↵?
[k])
⌘

Convergence
Theorem 1. Let have
-bounded supportL
T ˜O
⇣
1
1 ⇥
⇣
8L2
n2
⌧✏ + ˜c
⌘⌘
gi
⇥
⇥ 2 [0, 1)
E
⇥
G
0
k ( ↵[k]) G
0
k ( ↵?
[k])
⇤
 ⇥
⇣
G
0
k (0) G
0
k ( ↵?
[k])
⌘
Theorem 2. Let be  
-strongly convexµ
T 1
(1 ⇥)
µ⌧+n
µ⌧ log n
✏
gi

Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

Similar to Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016