A General Framework for Communication-Efficient Distributed Optimization: Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In light of this, we propose a general framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. Our framework enjoys strong convergence guarantees and exhibits state-of-the-art empirical performance in the distributed setting. We demonstrate this performance with extensive experiments in Apache Spark, achieving speedups of up to 50x compared to leading distributed methods on common machine learning objectives.
Automating Google Workspace (GWS) & more with Apps Script
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
1. CoCoA
A General Framework
for Communication-Efficient
Distributed Optimization
Virginia Smith
Simone Forte ⋅ Chenxin Ma ⋅ Martin Takac ⋅ Martin Jaggi ⋅ Michael I. Jordan
21. Distributed Optimization
✔ convergence
guarantees
✗ high communication
average: w := 1
K
P
k wk
“always communicate”
“never communicate”
reduce: w = w ↵
P
k w
22. Distributed Optimization
✔ convergence
guarantees
✗ high communication
✔ low communication
average: w := 1
K
P
k wk
“always communicate”
“never communicate”
reduce: w = w ↵
P
k w
23. Distributed Optimization
✔ convergence
guarantees
✗ high communication
✔ low communication
✗ convergence
not guaranteed
average: w := 1
K
P
k wk
“always communicate”
“never communicate”
reduce: w = w ↵
P
k w
39. Mini-batch Limitations
1. ONE-OFF METHODS
Primal-Dual Framework
2. STALE UPDATES
Immediately apply local updates
3. AVERAGE OVER BATCH SIZE
Average over K << batch size
40. 1. ONE-OFF METHODS
Primal-Dual Framework
2. STALE UPDATES
Immediately apply local updates
3. AVERAGE OVER BATCH SIZE
Average over K << batch size
CoCoA-v1
43. 1. Primal-Dual Framework
PRIMAL DUAL≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#
44. 1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gap
≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#
45. 1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gap
Good performance in practice
≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#
46. 1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gap
Good performance in practice
Default in software packages e.g. liblinear
≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#
47. 1. Primal-Dual Framework
PRIMAL DUAL
Stopping criteria given by duality gap
Good performance in practice
Default in software packages e.g. liblinear
Dual separates across machines
(one dual variable per datapoint)
≥
min
w2Rd
"
P(w) :=
2
||w||2
+
1
n
nX
i=1
`i(wT
xi)
#
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#
53. 1. Primal-Dual Framework
Global objective:
Local objective: max
↵[k]2Rn
1
n
X
i2Pk
`⇤
i ( ↵i ( ↵[k])i)
1
n
wT
A ↵[k]
2
1
n
A ↵[k]
2
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#
54. 1. Primal-Dual Framework
Global objective:
Local objective: max
↵[k]2Rn
1
n
X
i2Pk
`⇤
i ( ↵i ( ↵[k])i)
1
n
wT
A ↵[k]
2
1
n
A ↵[k]
2
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#
55. 1. Primal-Dual Framework
Global objective:
Local objective:
Can solve the local objective using any internal
optimization method
max
↵[k]2Rn
1
n
X
i2Pk
`⇤
i ( ↵i ( ↵[k])i)
1
n
wT
A ↵[k]
2
1
n
A ↵[k]
2
Ai =
1
n
xi
max
↵2Rn
"
D(↵) :=
2
kA↵k
2 1
n
nX
i=1
`⇤
i ( ↵i)
#
62. Can we avoid having to average the
partial solutions?
COCOA-v1 Limitations
63. Can we avoid having to average the
partial solutions? ✔
COCOA-v1 Limitations
64. Can we avoid having to average the
partial solutions? ✔
[CoCoA+, Ma & Smith, et. al., ICML ’15]
COCOA-v1 Limitations
65. Can we avoid having to average the
partial solutions?
L1-regularized objectives not covered in
initial framework
✔
[CoCoA+, Ma & Smith, et. al., ICML ’15]
COCOA-v1 Limitations
66. Can we avoid having to average the
partial solutions?
L1-regularized objectives not covered in
initial framework
✔
[CoCoA+, Ma & Smith, et. al., ICML ’15]
COCOA-v1 Limitations
71. L1 Regularization
Encourages sparse solutions
Includes popular models:
- lasso regression
- sparse logistic regression
- elastic net-regularized problems
72. L1 Regularization
Encourages sparse solutions
Includes popular models:
- lasso regression
- sparse logistic regression
- elastic net-regularized problems
Beneficial to distribute by feature
73. L1 Regularization
Encourages sparse solutions
Includes popular models:
- lasso regression
- sparse logistic regression
- elastic net-regularized problems
Beneficial to distribute by feature
Can we map this to the CoCoA setup?
77. Solution:
Solve Primal Directly
PRIMAL DUAL
Stopping criteria given by duality gap
≥
min
↵2Rn
f(A↵) +
nX
i=1
gi(↵i) min
w2Rd
f⇤
(w) +
nX
i=1
`⇤
i ( xT
i w)
78. Solution:
Solve Primal Directly
PRIMAL DUAL
Stopping criteria given by duality gap
Good performance in practice
≥
min
↵2Rn
f(A↵) +
nX
i=1
gi(↵i) min
w2Rd
f⇤
(w) +
nX
i=1
`⇤
i ( xT
i w)
79. Solution:
Solve Primal Directly
PRIMAL DUAL
Stopping criteria given by duality gap
Good performance in practice
Default in software packages e.g. glmnet
≥
min
↵2Rn
f(A↵) +
nX
i=1
gi(↵i) min
w2Rd
f⇤
(w) +
nX
i=1
`⇤
i ( xT
i w)
80. Solution:
Solve Primal Directly
PRIMAL DUAL
Stopping criteria given by duality gap
Good performance in practice
Default in software packages e.g. glmnet
Primal separates across machines
(one primal variable per feature)
≥
min
↵2Rn
f(A↵) +
nX
i=1
gi(↵i) min
w2Rd
f⇤
(w) +
nX
i=1
`⇤
i ( xT
i w)
86. CoCoA: A Framework for
Distributed Optimization
flexible
allows for arbitrary
internal methods
87. CoCoA: A Framework for
Distributed Optimization
flexible efficient
allows for arbitrary
internal methods
88. CoCoA: A Framework for
Distributed Optimization
flexible efficient
allows for arbitrary
internal methods
fast! (strong
convergence & low
communication)
89. CoCoA: A Framework for
Distributed Optimization
flexible efficient general
allows for arbitrary
internal methods
fast! (strong
convergence & low
communication)
90. CoCoA: A Framework for
Distributed Optimization
flexible efficient general
allows for arbitrary
internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models
91. CoCoA: A Framework for
Distributed Optimization
flexible efficient general
allows for arbitrary
internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models
1. CoCoA-v1
[NIPS ’14]
92. CoCoA: A Framework for
Distributed Optimization
flexible efficient general
allows for arbitrary
internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models
1. CoCoA-v1
[NIPS ’14]
2. CoCoA+
[ICML ’15]
93. CoCoA: A Framework for
Distributed Optimization
flexible efficient general
allows for arbitrary
internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models
1. CoCoA-v1
[NIPS ’14]
2. CoCoA+
[ICML ’15]
3.ProxCoCoA+
[current work]
94. CoCoA: A Framework for
Distributed Optimization
flexible efficient general
allows for arbitrary
internal methods
fast! (strong
convergence & low
communication)
works for a variety
of ML models
1. CoCoA-v1
[NIPS ’14]
2. CoCoA+
[ICML ’15]
3.ProxCoCoA+
[current work]
CoCoA Framework
110. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
111. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
k↵k1 + k↵k2
2
112. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
k↵k1 + k↵k2
2
113. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
k↵k1 + k↵k2
2
114. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
k↵k1 + k↵k2
2
115. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
k↵k1 + k↵k2
2
116. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + k↵k2
2
117. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + k↵k2
2
X
118. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + k↵k2
2
X
CoCoA+ with smoothing doesn’t work
119. A First Approach:
CoCoA+ with Smoothing
Issue: CoCoA+ requires strongly-convex regularizers
Approach: add a bit of L2 to the L1 regularizer
Amount of L2 Final Sparsity
Ideal (δ = 0) 0.6030
δ = 0.0001 0.6035
δ = 0.001 0.6240
δ = 0.01 0.6465
k↵k1 + k↵k2
2
X
CoCoA+ with smoothing doesn’t work
Additionally, CoCoA+ distributes by
datapoint, not by feature
126. Convergence
Assumption: Local -Approximation
For , we assume the local solver
finds an approximate solution satisfying:
⇥
⇥ 2 [0, 1)
E
⇥
G
0
k ( ↵[k]) G
0
k ( ↵?
[k])
⇤
⇥
⇣
G
0
k (0) G
0
k ( ↵?
[k])
⌘
127. Convergence
Theorem 1. Let have
-bounded supportL
T ˜O
⇣
1
1 ⇥
⇣
8L2
n2
⌧✏ + ˜c
⌘⌘
gi
Assumption: Local -Approximation
For , we assume the local solver
finds an approximate solution satisfying:
⇥
⇥ 2 [0, 1)
E
⇥
G
0
k ( ↵[k]) G
0
k ( ↵?
[k])
⇤
⇥
⇣
G
0
k (0) G
0
k ( ↵?
[k])
⌘
128. Convergence
Theorem 1. Let have
-bounded supportL
T ˜O
⇣
1
1 ⇥
⇣
8L2
n2
⌧✏ + ˜c
⌘⌘
gi
Assumption: Local -Approximation
For , we assume the local solver
finds an approximate solution satisfying:
⇥
⇥ 2 [0, 1)
E
⇥
G
0
k ( ↵[k]) G
0
k ( ↵?
[k])
⇤
⇥
⇣
G
0
k (0) G
0
k ( ↵?
[k])
⌘
Theorem 2. Let be
-strongly convexµ
T 1
(1 ⇥)
µ⌧+n
µ⌧ log n
✏
gi