1. Machine Learning of
Structured Outputs
Christoph Lampert
IST Austria
(Institute of Science and Technology Austria)
Klosterneuburg
Feb 2, 2011
2. Machine Learning of Structured Outputs
Overview...
Introduction to Structured Learning
Structured Support Vector Machines
Applications in Computer Vision
Slides available at
http://www.ist.ac.at/~chl
3. What is Machine Learning?
Definition [T. Mitchell]:
Machine Learning is the study of computer algorithms
that improve their performance in a certain task
through experience.
Example: Backgammon
Task: play backgammon
Experience: self-play
Performance measure: games won against humans
Example: Object Recognition
Task: determine which objects are visible in images
Experience: annotated training data
Performance measure: object recognized correctly
4. What is structured data?
Definition [ad hoc]:
Data is structured if it consists of several parts, and
not only the parts themselves contain information, but
also the way in which the parts belong together.
Text Molecules / Chemical Structures
Documents/HyperText Images
5. The right tool for the problem.
Example: Machine Learning for/of Structured Data
image body model model fit
Task: human pose estimation
Experience: images with manually annotated body pose
Performance measure: number of correctly localized body parts
6. Other tasks:
Natural Language Processing:
Automatic Translation (output: sentences)
Sentence Parsing (output: parse trees)
Bioinformatics:
RNA Structure Prediction (output: bipartite graphs)
Enzyme Function Prediction (output: path in a tree)
Speech Processing:
Automatic Transcription (output: sentences)
Text-to-Speech (output: audio signal)
Robotics:
Planning (output: sequence of actions)
This talk: only Computer Vision examples
7. "Normal" Machine Learning:
f : X → R.
inputs X can be any kind of objects
images, text, audio, sequence of amino acids, . . .
output y is a real number
classification, regression, . . .
many way to construct f :
f (x) = a · ϕ(x) + b,
f (x) = decision tree,
f (x) = neural network
8. Structured Output Learning:
f : X → Y.
inputs X can be any kind of objects
outputs y ∈ Y are complex (structured) objects
images, parse trees, folds of a protein, . . .
how to construct f ?
9. Predicting Structured Outputs: Image Denosing
f: →
input: images output: denoised images
input set X = {grayscale images} = [0, 255]M ·N
ˆ
output set Y = {grayscale images} = [0, 255]M ·N
ˆ
energy minimization f (x) := argminy∈Y E(x, y)
E(x, y) = λ i (xi − yi )2 + µ i,j |yi − yj |
10. Predicting Structured Outputs: Human Pose Estimation
→
input: image body model output: model fit
input set X = {images}
output set Y = {positions/angles of K body parts} = R4K .
ˆ
energy minimization f (x) := argminy∈Y E(x, y)
E(x, y) = i wi ϕfit (xi , yi ) + i,j wij ϕpose (yi , yj )
11. Predicting Structured Outputs: Shape Matching
input: image pairs
output: mapping y : xi ↔ y(xi )
scoring function
F (x, y) = i wi ϕsim (xi , y(xi )) + i,j wij ϕdist (xi , xj , y(xi ), y(xj ))
predict f : X → Y by f (x) := argmaxy∈Y F (x, y)
[J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008]
12. Predicting Structured Outputs: Tracking (by Detection)
input: output:
image object position
input set X = {images}
output set Y = R2 (box center) or R4 (box coordinates)
predict f : X → Y by f (x) := argmaxy∈Y F (x, y)
scoring function F (x, y) = w ϕ(x, y) e.g. SVM score
images: [C. L., Jan Peters, "Active Structured Learning for High-Speed Object Detection", DAGM 2009]
13. Predicting Structured Outputs: Summary
Image Denoising
y = argminy E(x, y) E(x, y) = w1 i
(xi − yi )2 + w2 i,j
|yi − yj |
Pose Estimation
y = argminy E(x, y) E(x, y) = i
wi ϕ(xi , yi ) + i,j
wij ϕ(yi , yj )
Point Matching
y = argmaxy F (x, y) F (x, y) = i
wi ϕ(xi , yi ) + i,j
wij ϕ(yi , yj )
Tracking
y = argmaxy F (x, y) F (x, y) = w ϕ(x, y)
14. Unified Formulation
Predict structured output by maximization
y = argmax F (x, y)
y∈Y
of a compatiblity function
F (x, y) = w, ϕ(x, y)
that is linear in a parameter vector w.
15. Structured Prediction: how to evaluate argmaxy F (x, y)?
chain tree
loop-free graphs: Shortest-Path / Belief Propagation (BP)
grid arbitrary graph
loopy graphs: GraphCut, approximate inference (e.g. loopy BP)
Structured Learning: how to learn F (x, y) from examples?
16. Machine Learning for Structured Outputs
Learning Problem:
Task: predict structured objects f : X → Y
Experience: example pairs {(x 1 , y 1 ), . . . , (x N , y N )} ⊂ X × Y:
typical inputs with "correct" outputs for them.
{ , , ,. . . }
Performance measure: ∆ : Y × Y → R
Our choice:
parametric family: F (x, y; w) = w, ϕ(x, y)
prediction method: f (x) = argmaxy∈Y F (x, y; w)
Task: determine "good" w
17. Reminder: regularized risk minimization
Find w for decision function F = w, ϕ(x, y) by
N
2
minw∈Rd λ w + (y n , F (x n , ·; w))
n=1
Regularization + empirical loss (on training data)
Logistic Loss: Conditional Random Fields
(y n , F (x n , ·; w)) = log exp[F (x n , y; w) − F (x n , y n ; w)]
y∈Y
Hinge-loss: Maximum Margin Training
(y n , F (x n , ·; w)) = max ∆(y n , y)+F (x n , y; w)−F (x n , y n ; w)
y∈Y
Exponential Loss: Boosting
(y n , F (x n , ·; w)) = exp[F (x n , y; w) − F (x n , y n ; w)]
y∈Y{y n }
19. Structured Support Vector Machine
Structured Support Vector Machine:
1 2
minw∈Rd w
2
N
C
+ max ∆(y n , y) + w, ϕ(x n , y) − w, ϕ(x n , y n )
N n=1
y∈Y
Unconstrained optimization, convex, non-differentiable objective.
[I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun. "Large Margin Methods for Structured and Interdependent
Output Variables", JMLR, 2005.]
21. Structured Support Vector Machine:
1 2
minw∈Rd w
2
N
C
+ max ∆(y n , y) + w, ϕ(x n , y) − w, ϕ(x n , y n )
N n=1
y∈Y
Unconstrained optimization, convex, non-differentiable objective.
22. Structured SVM (equivalent formulation):
N
1 2 C
minw∈Rd ,ξ∈Rn w + ξn
+
2 N n=1
subject to, for n = 1, . . . , N ,
max ∆(y n , y) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξn
y∈Y
n non-linear contraints, convex, differentiable objective.
23. Structured SVM (also equivalent formulation):
N
1 2 C
minw∈Rd ,ξ∈Rn w + ξn
+
2 N n=1
subject to, for n = 1, . . . , N ,
∆(y n , y) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξ n , for all y ∈ Y
|Y|n linear constraints, convex, differentiable objective.
24. Example: A "True" Multiclass SVM
1 for y = y
Y = {1, 2, . . . , K }, ∆(y, y ) = .
0 otherwise.
ϕ(x, y) = y = 1 Φ(x), y = 2 Φ(x), . . . , y = K Φ(x)
= Φ(x)ey with ey =y-th unit vector
Solve:
N
1 2 C
minw,ξ w + ξn
2 N n=1
subject to, for n = 1, . . . , N ,
w, ϕ(x n , y n ) − w, ϕ(x n , y) ≥ 1 − ξ n for all y ∈ Y.
Classification: MAP f (x) = argmax w, ϕ(x, y)
y∈Y
Crammer-Singer Multiclass SVM
25. Hierarchical Multiclass Classification
Loss function can reflect hierarchy:
cat dog car bus
1
∆(y, y ) := (distance in tree)
2
∆(cat, cat) = 0, ∆(cat, dog) = 1, ∆(cat, bus) = 2, etc.
Solve:
N
1 2 C
minw,ξ w + ξn
2 N n=1
subject to, for n = 1, . . . , N ,
w, ϕ(x n , y n ) − w, ϕ(x n , y) ≥ ∆(y n , y) − ξ n for all y ∈ Y.
26. Kernelized S-SVM problem:
Define
joint kernel function k : (X × Y) × (X × Y) → R,
kernel matrix Knn yy = k( (x n , y), (x n , y ) ).
1
max αny ∆(y n , y) − αny αn y Knn yy
n|Y|
α∈R+ n=1,...,N 2 y,y ∈Y
y∈Y n,n =1,...,N
subject to, for n = 1, . . . , N ,
C
αny ≤ .
y∈Y N
Kernelized prediction function:
f (x) = argmax αny k( (x n , y n ), (x, y) )
y∈Y ny
Too many variables: train with working set of αny .
30. Object Localization ⇒ Scene Interpretation
A man inside of a car A man outside of a car
⇒ He’s driving. ⇒ He’s passing by.
31. Object Localization as Structured Learning:
Given: training examples (x n , y n )n=1,...,N
Wanted: prediction function f : X → Y where
X = {all images}
Y = {all boxes}
fcar =
32. Structured SVM framework
Define:
feature function ϕ : X × Y → Rd ,
loss function ∆ : Y × Y → R,
routine to solve argmaxy ∆(y n , y) + w, ϕ(x n , y n ) .
1 2 N
Solve: minw,ξ 2
w +C n=1 ξN subject to
∀y ∈ Y : ∆(y, y n ) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξ n ,
Result:
w ∗ that determines scoring function F (x, y) = w ∗ , ϕ(x, y) ,
localization function: f (x) = argmaxy F (x, y).
• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.
33. Feature function: how to represents a (image,box)-pair (x, y)?
Obs: whether y is the right box for x, depends only on x|y .
ϕ(x, y) := h(x|y )
where h(r) is a (bag of visual word) histogram representation of the
region r.
ϕ = h( ) ≈ h( )=ϕ
ϕ = h( ) ≈ h( )=ϕ
ϕ = h( ) ≈ h( )=ϕ ...
34. Structured SVM framework
Define:
feature function ϕ : X × Y → Rd ,
loss function ∆ : Y × Y → R,
routine to solve argmaxy ∆(y n , y) + w, ϕ(x n , y n ) .
1 2 N
Solve: minw,ξ 2
w +C n=1 ξN subject to
∀y ∈ Y : ∆(y, y n ) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξ n ,
Result:
w ∗ that determines scoring function F (x, y) = w ∗ , ϕ(x, y) ,
localization function: f (x) = argmaxy F (x, y).
• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.
35. Loss function: how to compare two boxes y and y ?
∆(y, y ) := 1 − area overlap between y and y
area(y ∩ y )
=1−
area(y ∪ y )
36. Structured SVM framework
Define:
feature function ϕ : X × Y → Rd ,
loss function ∆ : Y × Y → R,
routine to solve argmaxy ∆(y n , y) + w, ϕ(x n , y n ) .
1 2 N
Solve: minw,ξ 2
w +C n=1 ξN subject to
∀y ∈ Y : ∆(y, y n ) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξ n ,
Result:
w ∗ that determines scoring function F (x, y) = w ∗ , ϕ(x, y) ,
localization function: f (x) = argmaxy F (x, y).
• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.
38. Structured Support Vector Machine
N
1 2
S-SVM Optimization: minw,ξ 2
w +C ξn
n=1
subject to for n = 1, . . . , N :
∀y ∈ Y : ∆(y, y n ) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξ n ,
Solve via constraint generation:
Iterate:
Solve minimization with working set of contraints: new w
Identify argmaxy∈Y ∆(y, y n ) + w, ϕ(x n , y)
Add violated constraints to working set and iterate
Polynomial time convergence to any precision ε
40. Initialize: no constraints
Solve minimization with working set of contraints ⇒ w=0
Identify argmaxy∈Y ∆(y, y n ) + w, ϕ(x n , y)
w, ϕ(x n , y) = 0 → pick any window with ∆(y, y n ) = 1
Add violated constraints to working set and iterate
w, − w, ≥ 1, w, − w, ≥ 1,
w, − w, ≥ 1, w, − w, ≥ 1.
41. Working set of constraints:
w, − w, ≥ 1, w, − w, ≥ 1,
w, − w, ≥ 1, w, − w, ≥ 1.
Solve minimization with working set of contraints
Identify argmaxy∈Y ∆(y, y n ) + w, ϕ(x n , y)
Add violated constraints to working set and iterate
w, − w, ≥ 1, w, − w, ≥ 0.9,
w, − w, ≥ 0.8, w, − w, ≥ 0.01.
42. Working set of constraints:
w, − w, ≥ 1, w, − w, ≥1
w, − w, ≥ 1, w, − w, ≥ 0.9,
w, − w, ≥ 1, w, − w, ≥ 0.8,
w, − w, ≥ 1, w, − w, ≥ 0.01.
Solve minimization with working set of contraints
Identify argmaxy∈Y ∆(y, y n ) + w, ϕ(x n , y)
Add violated constraints to working set and iterate,. . .
43. N
1 2
S-SVM Optimization: minw,ξ 2
w +C ξn
n=1
subject to for n = 1, . . . , N :
∀y ∈ Y : ∆(y, y n ) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξ n ,
Solve via constraint generation:
Iterate:
Solve minimization with working set of contraints
Identify argmaxy∈Y ∆(y, y n ) + w, ϕ(x n , y)
Add violated constraints to working set and iterate
Similar to classical bootstrap training, but:
force margin between correct and incorrect location scores,
handle overlapping detections by fractional scores.
44. Results: PASCAL VOC 2006
Example detections for VOC 2006 bicycle, bus and cat.
Precision–recall curves for VOC 2006 bicycle, bus and cat.
Structured training improves detection accuracy.
49. Why does it work?
Learned weights from binary (center) and structured training (right).
Both training methods: positive weights at object region.
Structured training: negative weights for features just outside
the bounding box position.
Posterior distribution over box coordinates becomes more
peaked.
51. Segmentation as Structured Learning:
Given: training examples (x n , y n )n=1,...,N
{ , , ,
, }
Wanted: prediction function f : X → Y with
X = {all images}
Y = {all binary segmentations}
52. Structured SVM framework
Define:
Feature functions: ϕ(x, y) → Rd
unary terms ϕi (x, yi ) for each pixel i
pairwise terms ϕij (x, yi , yj ) for neighbors (i, j)
Loss function ∆ : Y × Y → R.
ideally decomposes like ϕ
1 2 N
Solve: minw,ξ 2
w +C n=1 ξn subject to
∀y ∈ Y : ∆(y, y n ) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξ n ,
Result:
w ∗ that determines scoring function F (x, y) = w ∗ , ϕ(x, y) ,
segmentation function: f (x) = argmaxy F (x, y).
53. Example choices:
Feature functions: unary terms c = {i}:
(0, h(xi )) for y = 0,
ϕi (x, yi ) =
(hi (x), 0) for y = 1.
h(xi ) is the color histogram of the pixel i.
Feature functions: pairwise terms c = {i, j}:
ϕij (x, yi , yj ) = yi = yj .
Loss function: Hamming loss
∆(y, y ) = i yi = yi
54. How to solve argminy ∆(y n , y) + argmaxy F (x n , y) ?
∆(y n , y) + F (x n , y)
= yin = yi + wi h(xin ) + wij yi = yj
i i ij
= [wi h(xin ) + yin = yi ] + wij yi = yj
i ij
if wij ≥ 0 (which makes sense), then E := −F is submodular.
use GraphCut algorithm to find global optimum efficiently.
also possible: (loopy) belief propagation, variational inference,
greedy search, simulated annealing, . . .
• [M. Szummer, P. Kohli: "Learning CRFs using graph cuts", ECCV 2008]
55. Extension: Image segmentation with connectedness constraints
Knowing that the object is connected improves segmentation quality.
← →
ordinary original connected
segmentation segmentation
56. Segmentation as Structured Learning:
Given: training examples (x n , y n )n=1,...,N
Wanted: prediction function f : X → Y where
X = {all images (as superpixels)}
Y = {all connected binary segmentations}
• S. Nowozin, C.L.: Global Connectivity Potentials for Random Field Models, CVPR 2009.
57. Feature functions: unary terms c = {i}:
(0, h(xi )) for y = 0,
ϕi (x, yi ) =
(hi (x), 0) for y = 1.
h(xi ) is the bag of visual words histogram of the superpixel i.
Feature functions: pairwise terms c = {i, j}:
ϕij (yi , yj ) = yi = yj .
Loss function: Hamming loss
∆(y, y ) = i yi = yi
58. How to solve f (x) = argmax ∆(y n , y) + F (x n , y) ?
{y is connected}
Linear programming relaxation with connectivity constraints
rewrite energy such that it is linear in new variables µli and µll ,
ij
F (x, y) = w1 hi (x)µ1 + w2 hi (x)µ−1 +
i i w3 µll
ij
i l=l
subject to
µli ∈ {0, 1}, µll ∈ {0, 1},
ij
µli = 1, µll = µli ,
ij µll = µlj
ij
l l l
relax to µli ∈ [0, 1] and µll ∈ [0, 1]
ij
solve linear program with additional linear constraints:
µ1 + µ1 −
i j µ1 ≤ 1 for any set S of nodes separating i and j.
k
k∈S
59. Example Results:
original segmentation with connectivity
. . . still room for improvement . . .
60. Summary
Machine Learning of Structured Outputs
Task: predict f : X → Y for (almost) arbitrary Y
Key idea:
learn scoring function F : X × Y → R
predict using f (x) := argmaxy F (x, y)
Structured Support Vector Machines
Parametrize F (x, y) = w, ϕ(x, y)
Learn w from training data by maximum-margin criterion
Needs only:
feature function ϕ(x, y)
loss function ∆(y, y )
routine to solve argmaxy ∆(y n , y) + F (x n , y)
61. Applications
Many different applications in unified framework
Natural Language Prediction: parsing
CompBio: secondary structured prediction
Computer Vision: pose estimation, object
localization/segmentation
...
Open Problems
Theory:
what output structures are useful?
(how) can we use approximate argmaxy ?
Practice:
more application? new domains?
training speed!
Thank you!