Machine learning of structured outputs

Machine Learning of
Structured Outputs

Christoph Lampert
IST Austria
(Institute of Science and Technology Austria)
Klosterneuburg

Feb 2, 2011

Machine Learning of Structured Outputs

Overview...
Introduction to Structured Learning
Structured Support Vector Machines
Applications in Computer Vision

Slides available at
http://www.ist.ac.at/~chl

What is Machine Learning?

Deﬁnition [T. Mitchell]:
Machine Learning is the study of computer algorithms
that improve their performance in a certain task
through experience.

Example: Backgammon
Task: play backgammon
Experience: self-play
Performance measure: games won against humans

Example: Object Recognition
Task: determine which objects are visible in images
Experience: annotated training data
Performance measure: object recognized correctly

What is structured data?

Deﬁnition [ad hoc]:
Data is structured if it consists of several parts, and
not only the parts themselves contain information, but
also the way in which the parts belong together.

Text Molecules / Chemical Structures

Documents/HyperText Images

The right tool for the problem.

Example: Machine Learning for/of Structured Data

image body model model ﬁt

Task: human pose estimation
Experience: images with manually annotated body pose
Performance measure: number of correctly localized body parts

Other tasks:

Natural Language Processing:
Automatic Translation (output: sentences)
Sentence Parsing (output: parse trees)

Bioinformatics:
RNA Structure Prediction (output: bipartite graphs)
Enzyme Function Prediction (output: path in a tree)

Speech Processing:
Automatic Transcription (output: sentences)
Text-to-Speech (output: audio signal)

Robotics:
Planning (output: sequence of actions)

This talk: only Computer Vision examples

"Normal" Machine Learning:
f : X → R.
inputs X can be any kind of objects
images, text, audio, sequence of amino acids, . . .
output y is a real number
classiﬁcation, regression, . . .
many way to construct f :
f (x) = a · ϕ(x) + b,
f (x) = decision tree,
f (x) = neural network

Structured Output Learning:
f : X → Y.
inputs X can be any kind of objects
outputs y ∈ Y are complex (structured) objects
images, parse trees, folds of a protein, . . .
how to construct f ?

Predicting Structured Outputs: Image Denosing

f: →
input: images output: denoised images

input set X = {grayscale images} = [0, 255]M ·N
ˆ

output set Y = {grayscale images} = [0, 255]M ·N
ˆ

energy minimization f (x) := argminy∈Y E(x, y)

E(x, y) = λ i (xi − yi )2 + µ i,j |yi − yj |

Predicting Structured Outputs: Human Pose Estimation

→
input: image body model output: model ﬁt

input set X = {images}

output set Y = {positions/angles of K body parts} = R4K .
ˆ

energy minimization f (x) := argminy∈Y E(x, y)

E(x, y) = i wi ϕﬁt (xi , yi ) + i,j wij ϕpose (yi , yj )

Predicting Structured Outputs: Shape Matching

input: image pairs

output: mapping y : xi ↔ y(xi )

scoring function
F (x, y) = i wi ϕsim (xi , y(xi )) + i,j wij ϕdist (xi , xj , y(xi ), y(xj ))

predict f : X → Y by f (x) := argmaxy∈Y F (x, y)

[J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008]

Predicting Structured Outputs: Tracking (by Detection)

input: output:
image object position

input set X = {images}

output set Y = R2 (box center) or R4 (box coordinates)

predict f : X → Y by f (x) := argmaxy∈Y F (x, y)

scoring function F (x, y) = w ϕ(x, y) e.g. SVM score

images: [C. L., Jan Peters, "Active Structured Learning for High-Speed Object Detection", DAGM 2009]

Predicting Structured Outputs: Summary

Image Denoising
y = argminy E(x, y) E(x, y) = w1 i
(xi − yi )2 + w2 i,j
|yi − yj |

Pose Estimation
y = argminy E(x, y) E(x, y) = i
wi ϕ(xi , yi ) + i,j
wij ϕ(yi , yj )

Point Matching
y = argmaxy F (x, y) F (x, y) = i
wi ϕ(xi , yi ) + i,j
wij ϕ(yi , yj )

Tracking
y = argmaxy F (x, y) F (x, y) = w ϕ(x, y)

Uniﬁed Formulation
Predict structured output by maximization

y = argmax F (x, y)
y∈Y

of a compatiblity function

F (x, y) = w, ϕ(x, y)

that is linear in a parameter vector w.

Structured Prediction: how to evaluate argmaxy F (x, y)?

chain tree
loop-free graphs: Shortest-Path / Belief Propagation (BP)

grid arbitrary graph
loopy graphs: GraphCut, approximate inference (e.g. loopy BP)

Structured Learning: how to learn F (x, y) from examples?

Machine Learning for Structured Outputs

Learning Problem:
Task: predict structured objects f : X → Y
Experience: example pairs {(x 1 , y 1 ), . . . , (x N , y N )} ⊂ X × Y:
typical inputs with "correct" outputs for them.

{ , , ,. . . }
Performance measure: ∆ : Y × Y → R

Our choice:
parametric family: F (x, y; w) = w, ϕ(x, y)
prediction method: f (x) = argmaxy∈Y F (x, y; w)
Task: determine "good" w

Reminder: regularized risk minimization
Find w for decision function F = w, ϕ(x, y) by
N
2
minw∈Rd λ w + (y n , F (x n , ·; w))
n=1
Regularization + empirical loss (on training data)

Logistic Loss: Conditional Random Fields
(y n , F (x n , ·; w)) = log exp[F (x n , y; w) − F (x n , y n ; w)]
y∈Y

Hinge-loss: Maximum Margin Training
(y n , F (x n , ·; w)) = max ∆(y n , y)+F (x n , y; w)−F (x n , y n ; w)
y∈Y

Exponential Loss: Boosting
(y n , F (x n , ·; w)) = exp[F (x n , y; w) − F (x n , y n ; w)]
y∈Y{y n }

Maximum Margin Training
of Structured Models
(Structured SVMs)

Structured Support Vector Machine

Structured Support Vector Machine:
1 2
minw∈Rd w
2
N
C
+ max ∆(y n , y) + w, ϕ(x n , y) − w, ϕ(x n , y n )
N n=1
y∈Y

Unconstrained optimization, convex, non-diﬀerentiable objective.

[I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun. "Large Margin Methods for Structured and Interdependent
Output Variables", JMLR, 2005.]

S-SVM Objective Function for w ∈ R2 :
S-SVM objective C =0.01 S-SVM objective C =0.10
3 3

2 2

1 1

0 0

1 1

2 2
3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5
S-SVM objective C =1.00 S-SVM objective C→ ∞
3 3

2 2

1 1

0 0

1 1

2 2
3 2 1 0 1 2 3 4 5 3 2 1 0 1 2 3 4 5

Structured Support Vector Machine:
1 2
minw∈Rd w
2
N
C
+ max ∆(y n , y) + w, ϕ(x n , y) − w, ϕ(x n , y n )
N n=1
y∈Y

Unconstrained optimization, convex, non-diﬀerentiable objective.

Structured SVM (equivalent formulation):
N
1 2 C
minw∈Rd ,ξ∈Rn w + ξn
+
2 N n=1

subject to, for n = 1, . . . , N ,

max ∆(y n , y) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξn
y∈Y

n non-linear contraints, convex, diﬀerentiable objective.

Structured SVM (also equivalent formulation):
N
1 2 C
minw∈Rd ,ξ∈Rn w + ξn
+
2 N n=1


∆(y n , y) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξ n , for all y ∈ Y

|Y|n linear constraints, convex, diﬀerentiable objective.

Example: A "True" Multiclass SVM

1 for y = y
Y = {1, 2, . . . , K }, ∆(y, y ) = .
0 otherwise.
ϕ(x, y) = y = 1 Φ(x), y = 2 Φ(x), . . . , y = K Φ(x)
= Φ(x)ey with ey =y-th unit vector
Solve:
N
1 2 C
minw,ξ w + ξn
2 N n=1


w, ϕ(x n , y n ) − w, ϕ(x n , y) ≥ 1 − ξ n for all y ∈ Y.
Classiﬁcation: MAP f (x) = argmax w, ϕ(x, y)
y∈Y

Crammer-Singer Multiclass SVM

Hierarchical Multiclass Classiﬁcation

Loss function can reﬂect hierarchy:
cat dog car bus

1
∆(y, y ) := (distance in tree)
2
∆(cat, cat) = 0, ∆(cat, dog) = 1, ∆(cat, bus) = 2, etc.

Solve:
N
1 2 C
minw,ξ w + ξn
2 N n=1


w, ϕ(x n , y n ) − w, ϕ(x n , y) ≥ ∆(y n , y) − ξ n for all y ∈ Y.

Kernelized S-SVM problem:
Deﬁne
joint kernel function k : (X × Y) × (X × Y) → R,
kernel matrix Knn yy = k( (x n , y), (x n , y ) ).
1
max αny ∆(y n , y) − αny αn y Knn yy
n|Y|
α∈R+ n=1,...,N 2 y,y ∈Y
y∈Y n,n =1,...,N


C
αny ≤ .
y∈Y N

Kernelized prediction function:

f (x) = argmax αny k( (x n , y n ), (x, y) )
y∈Y ny

Too many variables: train with working set of αny .

Applications
in Computer Vision

Example 1: Category-Level Object Localization

What objects are present? person, car

Example 1: Category-Level Object Localization

Where are the objects?

Object Localization ⇒ Scene Interpretation

A man inside of a car A man outside of a car
⇒ He’s driving. ⇒ He’s passing by.

Object Localization as Structured Learning:
Given: training examples (x n , y n )n=1,...,N
Wanted: prediction function f : X → Y where
X = {all images}
Y = {all boxes}

 

fcar   =

Structured SVM framework

Deﬁne:
feature function ϕ : X × Y → Rd ,
loss function ∆ : Y × Y → R,
routine to solve argmaxy ∆(y n , y) + w, ϕ(x n , y n ) .

1 2 N
Solve: minw,ξ 2
w +C n=1 ξN subject to

∀y ∈ Y : ∆(y, y n ) + w, ϕ(x n , y) − w, ϕ(x n , y n ) ≤ ξ n ,

Result:
w ∗ that determines scoring function F (x, y) = w ∗ , ϕ(x, y) ,
localization function: f (x) = argmaxy F (x, y).

• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.

Feature function: how to represents a (image,box)-pair (x, y)?

Obs: whether y is the right box for x, depends only on x|y .

ϕ(x, y) := h(x|y )

where h(r) is a (bag of visual word) histogram representation of the
region r.

ϕ = h( ) ≈ h( )=ϕ

ϕ = h( ) ≈ h( )=ϕ

ϕ = h( ) ≈ h( )=ϕ ...

Loss function: how to compare two boxes y and y ?

∆(y, y ) := 1 − area overlap between y and y
area(y ∩ y )
=1−
area(y ∪ y )

How to solve f (x) = argmaxy ∆(y n , y) + w, ϕ(x n , y) ?

Option 1) Sliding Window
1 − 0.3 = 0.7
1 − 0.8 = 0.2
1 − 0.1 = 0.9
1 − 0.2 = 0.8
...
0.3 + 1.4 = 1.7
0 + 1.5 = 1.5
...
1 − 1.2 = −0.2
1 − 0.3 = 0.7

Option 2) Branch-and-Bound Search (another talk)

• C.L., M. Blaschko, T. Hofmann: Beyond Sliding Windows: Object Localization by Eﬃcient Subwindow Search, CVPR 2008.

Structured Support Vector Machine

N
1 2
S-SVM Optimization: minw,ξ 2
w +C ξn
n=1
subject to for n = 1, . . . , N :


Solve via constraint generation:
Iterate:
Solve minimization with working set of contraints: new w
Identify argmaxy∈Y ∆(y, y n ) + w, ϕ(x n , y)
Add violated constraints to working set and iterate
Polynomial time convergence to any precision ε

Example: Training set (x1 , z1 ), . . . , (x4 , y4 )

Initialize: no constraints

Solve minimization with working set of contraints ⇒ w=0
w, ϕ(x n , y) = 0 → pick any window with ∆(y, y n ) = 1


w, − w, ≥ 1, w, − w, ≥ 1,

w, − w, ≥ 1, w, − w, ≥ 1.

Working set of constraints:
w, − w, ≥ 1, w, − w, ≥ 1,

w, − w, ≥ 1, w, − w, ≥ 1.

Solve minimization with working set of contraints


w, − w, ≥ 1, w, − w, ≥ 0.9,

w, − w, ≥ 0.8, w, − w, ≥ 0.01.

Working set of constraints:

w, − w, ≥ 1, w, − w, ≥1

w, − w, ≥ 1, w, − w, ≥ 0.9,

w, − w, ≥ 1, w, − w, ≥ 0.8,

w, − w, ≥ 1, w, − w, ≥ 0.01.


Add violated constraints to working set and iterate,. . .

N
1 2
S-SVM Optimization: minw,ξ 2
w +C ξn
n=1
subject to for n = 1, . . . , N :


Solve via constraint generation:
Iterate:

Similar to classical bootstrap training, but:
force margin between correct and incorrect location scores,
handle overlapping detections by fractional scores.

Results: PASCAL VOC 2006

Example detections for VOC 2006 bicycle, bus and cat.

Precision–recall curves for VOC 2006 bicycle, bus and cat.

Structured training improves detection accuracy.

More Recent Results (PASCAL VOC 2009)

aeroplane


horse


sheep


sofa

Why does it work?

Learned weights from binary (center) and structured training (right).

Both training methods: positive weights at object region.
Structured training: negative weights for features just outside
the bounding box position.
Posterior distribution over box coordinates becomes more
peaked.

Example II: Category-Level Object Segmentation

Where exactly are the objects?

Segmentation as Structured Learning:

{ , , ,

, }
Wanted: prediction function f : X → Y with
X = {all images}
Y = {all binary segmentations}

Structured SVM framework

Deﬁne:
Feature functions: ϕ(x, y) → Rd
unary terms ϕi (x, yi ) for each pixel i
pairwise terms ϕij (x, yi , yj ) for neighbors (i, j)
Loss function ∆ : Y × Y → R.
ideally decomposes like ϕ

1 2 N
Solve: minw,ξ 2
w +C n=1 ξn subject to


Result:
w ∗ that determines scoring function F (x, y) = w ∗ , ϕ(x, y) ,
segmentation function: f (x) = argmaxy F (x, y).

Example choices:

Feature functions: unary terms c = {i}:

 (0, h(xi )) for y = 0,
ϕi (x, yi ) =
 (hi (x), 0) for y = 1.
h(xi ) is the color histogram of the pixel i.

Feature functions: pairwise terms c = {i, j}:
ϕij (x, yi , yj ) = yi = yj .

Loss function: Hamming loss
∆(y, y ) = i yi = yi

How to solve argminy ∆(y n , y) + argmaxy F (x n , y) ?

∆(y n , y) + F (x n , y)
= yin = yi + wi h(xin ) + wij yi = yj
i i ij

= [wi h(xin ) + yin = yi ] + wij yi = yj
i ij

if wij ≥ 0 (which makes sense), then E := −F is submodular.
use GraphCut algorithm to ﬁnd global optimum eﬃciently.
also possible: (loopy) belief propagation, variational inference,
greedy search, simulated annealing, . . .

• [M. Szummer, P. Kohli: "Learning CRFs using graph cuts", ECCV 2008]

Extension: Image segmentation with connectedness constraints

Knowing that the object is connected improves segmentation quality.

← →
ordinary original connected
segmentation segmentation

Segmentation as Structured Learning:
Wanted: prediction function f : X → Y where
X = {all images (as superpixels)}
Y = {all connected binary segmentations}

• S. Nowozin, C.L.: Global Connectivity Potentials for Random Field Models, CVPR 2009.

Feature functions: unary terms c = {i}:

(0, h(xi )) for y = 0,
ϕi (x, yi ) =
(hi (x), 0) for y = 1.
h(xi ) is the bag of visual words histogram of the superpixel i.

Feature functions: pairwise terms c = {i, j}:
ϕij (yi , yj ) = yi = yj .

Loss function: Hamming loss
∆(y, y ) = i yi = yi

How to solve f (x) = argmax ∆(y n , y) + F (x n , y) ?
{y is connected}

Linear programming relaxation with connectivity constraints
rewrite energy such that it is linear in new variables µli and µll ,
ij

F (x, y) = w1 hi (x)µ1 + w2 hi (x)µ−1 +
i i w3 µll
ij
i l=l

subject to

µli ∈ {0, 1}, µll ∈ {0, 1},
ij

µli = 1, µll = µli ,
ij µll = µlj
ij
l l l

relax to µli ∈ [0, 1] and µll ∈ [0, 1]
ij
solve linear program with additional linear constraints:
µ1 + µ1 −
i j µ1 ≤ 1 for any set S of nodes separating i and j.
k
k∈S

Example Results:

original segmentation with connectivity

. . . still room for improvement . . .

Summary

Machine Learning of Structured Outputs
Task: predict f : X → Y for (almost) arbitrary Y
Key idea:
learn scoring function F : X × Y → R
predict using f (x) := argmaxy F (x, y)

Structured Support Vector Machines
Parametrize F (x, y) = w, ϕ(x, y)
Learn w from training data by maximum-margin criterion
Needs only:
feature function ϕ(x, y)
loss function ∆(y, y )
routine to solve argmaxy ∆(y n , y) + F (x n , y)

Applications
Many diﬀerent applications in uniﬁed framework
Natural Language Prediction: parsing
CompBio: secondary structured prediction
Computer Vision: pose estimation, object
localization/segmentation
...

Open Problems
Theory:
what output structures are useful?
(how) can we use approximate argmaxy ?
Practice:
more application? new domains?
training speed!

Thank you!

Machine learning of structured outputs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine learning of structured outputs

Similar to Machine learning of structured outputs (20)

More from zukun

More from zukun (20)

Recently uploaded

Recently uploaded (20)

Machine learning of structured outputs