2. fw (x) =Traini
wĀ· ā¢
Visual representations
ā¢ Training data consists of images with labeled
N
ā¢ Need to learn the model structure, ļ¬lters and d ā¢
positives negatives
Learned model
Training
fw (x) = w Ā· Ī¦(x)
ā¢ Training data consists of images with labeled bounding boxes
Training
ā¢ Need to learn the model structure, ļ¬lters and deformation costs
Training
Geometric models positive negative
Statistical classifiers
(1970s-1990s) (1990s-present)weights
weights
Large-scale training
Hand-coded models
Appearance-based representations
3. Learned model
Learned visual fw (x) = w Ā· Ī¦(x)
representations
Training
ā¢ Training data consists of images with labeled bounding boxes
ā¢ Need Wherethe invariance built in? deformation costs
to learn is model structure, ļ¬lters and
Representation
(linear classifier, ...)
Training
Features
ViolaJones Dalal Triggs
positive nega
weights wei
4. Learned visual representations
Where is invariance built in?
4 4
4 4
4 4 4
4
Representation
4 4
(latent-variable classifier)
Features
(a) (b) (c)
(a) (a) (b)
(a)
(a)
(b) (c)
(b)
(b)
(c)
(c)
(c)
(a) (b) (c)
Felzenszwalb et al 09
(a) (a) (b)
(a) (b) (c)
(b) (c)
(c)
on model. The model is deļ¬ned by a coarse root ļ¬lter (a), several (a) (b) (c)
ections obtained withby single by a coarse root ļ¬lter (a), The model is deļ¬ned by a coarse (b) ļ¬lter (a), several
on model. The deļ¬ned isa deļ¬ned component person model.several
on model. The model is deļ¬ned byroot ļ¬lter root several several
The model is a coarse (a),
on model. The model is deļ¬ned by a coarse root ļ¬lter (a), several
model a coarse ļ¬lter (a),
(a) root (c)
btained with each with relative tocomponent personļ¬lters specifydeļ¬ned is deļ¬ned byroot ļ¬lter root several several
e locationobtained with a single component person model. The model is deļ¬ned by a coarse root ļ¬lter (a), several
tections obtained partcomponent the root model. The modelThe model by a coarse a coarse (a), ļ¬lter (a),
of a single person (c). The is
tections obtained with relative to(c). The ļ¬ltersThe ļ¬lters specify model is deļ¬ned by a coarse root ļ¬lter (a), several
location of each part a a root component person model. The
a single model (c). specify
tections part relative andthe spatialthe root for the location of each part relative to the root (c). The ļ¬lters specify
single model.
eof each ļ¬lters (b) to relative to the root (c). The ļ¬lters specify
ution part of each part relative to the root (c). The ļ¬lters specify
e location
isualization of each(b) positive spatial model for the location of relative to relative to(c). The ļ¬ltersThe ļ¬lters specify
e location and a spatial model for thedifferent orientations. The
(b) show part
ution part ļ¬lters the andaa single model for theof each part each part relative to the deļ¬ned The a coarse root ļ¬lte
ļ¬lters obtained with a weights at location person model. The model is root (c). by
ions part ofshow (b) positivespatial component location of each part relative to theatroot (c). The ļ¬lters specify
visualizationļ¬lters the and a at weights atorientations. location of each part
ution part ļ¬lters (b) positivespatial model for the The
and different different orientations. The
visualization show the positive weights at different orientations. The
histogram show the gradients features. Their visualization The
oriented
the root
the different
specify
ution the positive weights weights at different orientations. show the positive weights root (c). orientations. The
n show ļ¬lters specify
ingorientedof oriented gradients features. Their visualization show the positive different different orientations. model. T
the center of a part at different1.
histogram of oriented gradients
of Fig.features. Their visualization show the positive weightscomponent person The
Detections the root.
obtained with a single at different orientations. The
visualization gradients features. Their visualization show the positive weights at weights atorientations. The
locations relative to the root.
histogrampart of a part at different locations the root.
enter the acenterat different locations relative Their visualization show the of eachweightsrelative toorientations. (c). The ļ¬
cing the center of (b) anddifferent locations relative tothe root.
n part center of a part at different ācostā to relative to the location positive part at different the root The
of of
cing the ļ¬lters a part at a the locations placing
histogram of models reļ¬ects spatial model for the center
cingthe spatialoriented gradients features. of relative to the root. of a part at different locations relative to the root.
5. person bottle
Where does learning fit in?
Training Alg Ground
images output truth
Matching 17
alg
cat
person bottle
Tune parameters ( , ) till desired output on training set
āGraduate Student Descentā might take a while
(phrase from Marshall Tappen)
cat
6. 5 years of PASCAL people detection
Matching results
50
37.5
average
25
precision
12.5
0
05
06
07
08
09
10
(after non-maximum suppression)
20
20
20
20
20
20
~1 second to search all scales
1% to 47% in 5 years
How do we move beyond the plateau?
7. How do we move beyond the plateau?
1. Develop more structured models with less invariant features
9. person person
person person bottle
person bottle
person
person person bottle
person bottle
bottle
Invariance vs Parametric Search
person person
person
person
bottle
person
bottle
bottle
Part-Based Models
cat cat
cat
cat 4
cat 4 4
4
4 cat cat
cat
cat cat
cat c
cat
cat
(a) (b) (c)
(a) (a) (b)
(a) (b) (c)
(b) (c)
(c)
(a) (b) (c)
10. Learned visual representations
Where is invariance built in?
Representation
(latent-variable classifier)
Features
Yi & Ramanan 11
Buffy performance: 88% vs 73%
12. How do we move beyond the plateau?
1. Develop more structured models with less invariant features
2. Score syntax as semantics
13. The forgotten challenge....
!"#$%&#
'()*+"&,)-#.*/)&,*$#012*-"&"3&)4#*&4501"-*)1*)&,"4*-5&5
678)4-*+"&,)-*-)"#*1)&*5&&"+9&*&)*-"&"3&*8""&
Head Hand ;))&
:"5- :51- Foot
<=>?=@A:$+51@5B)$& CDED FEF GEH
6I;6!JAK<J LHEC GMED MEM
14. ure 8: Top: heat equilibrium for two bones. Bottom: the result
otating the right bone with the heat-based attachment
Structured classifiers
Figure 10: A centaur pirate with a centaur skeleton embedded looks
at a cat with a quadruped skeleton embedded
the character volume as an insulated heat-conducting body and
e the temperature of bone i to be 1 while keeping the tempera-
of all of the other bones at 0. Then we can take the equilibrium
perature at each vertex on the surface as the weight of bone i at
vertex. Figure 8 illustrates this in two dimensions.
olving for heat equilibrium over a volume would require tes-
ating the volume and would be slow. Therefore, for simplic-
Pinocchio solves for equilibrium over the surface only, but at
e vertices, it adds the heat transferred from the nearest bone.
i
equilibrium over the surface for bone i is given by āw = āt
i
+ H(pi ā wi ) = 0, which can be written as
āāwi + Hwi = Hpi , (1)
re ā is the discrete surface Laplacian, calculated with the
ngent formula [Meyer et al. 2003], pi is a vector with pi = 1
j
e nearest bone to vertex j is i and pi = 0 otherwise, and H is
shape
Figure 11: The human scan on the left is rigged by Pinocchio and is
posed on the right by changing joint angles in the embedded skele-
ton. The well-known deļ¬ciencies of LBS can be seen in the right
Estimated
shape
j
diagonal matrix with Hjj being the heat contribution weight of knee and hip areas.
nearest bone to vertex j. Because ā has units of lengthā2 , so
t H. Letting d(j) be the distance from vertex j to the nearest
e, Pinocchio uses Hjj = c/d(j)2 if the shortest line segment 5.1 Generality
m the vertex to the bone is contained in the character volume Figure 9 shows our 16 test characters and the skeletons Pinocchio
Hjj = 0 if it is not. It uses the precomputed distance ļ¬eld to embedded. The skeleton was correctly embedded into 13 of these
classifier
rmine whether a line segment is entirely contained in the char- models (81% success). For Models 7, 10 and 13, a hint for a single
r volume. For c ā 0.22, this method gives weights with similar joint was sufļ¬cient to produce a good embedding.
sitions to those computed by ļ¬nding the equilibrium over the These tests demonstrate the range of proportions that our method
me. Pinocchio uses c = 1 (corresponding to anisotropic heat can tolerate: we have a well-proportioned human (Models 1ā4, 8),
usion) because the results look more natural. When k bones are large arms and tiny legs (6; in 10, this causes problems), and large
distant from vertex j, heat contributions from all of them are legs and small arms (15; in 13, the small arms cause problems). For
d: pj is 1/k for all of them, and Hjj = kc/d(j)2 . other characters we tested, skeletons were almost always correctly
quation (1) is a sparse linear system, and the left hand side embedded into well-proportioned characters whose pose matched
Estimated
rix āā + H does not depend on i, the bone we are interested the given skeleton. Pinocchio was even able to transfer a biped
Thus we can factor the system once and back-substitute to ļ¬nd walk onto a human hand, a cat on its hind legs, and a donut.
weights for each bone. Botsch et al. [2005] show how to use The most common issues we ran into on other characters were:
arse Cholesky solver to compute the factorization for this kind
ystem. Pinocchio uses the TAUCS [Toledo 2003] library for
computation. Note also that the weights wi sum to 1 for each
reflectance
ā¢ The thinnest limb into which we may hope to embed a bone
has a radius of 2Ļ . Characters with extremely thin limbs often reflectance
fail because the the graph we extract is disconnected. Reduc-
ex: if we sum (1) over i, we get (āā + H) i wi = H Ā· 1,
P
ing Ļ , however, hurts performance.
ch yields i wi = 1.
P
is possible to speed up this method slightly by ļ¬nding vertices ā¢ Degree 2 joints such as knees and elbows are often positioned
are unambiguously attached to a single bone and forcing their incorrectly within a limb. We do not know of a reliable way
ght to 1. An earlier variant of our algorithm did this, but the im- to identify the right locations for them: on some characters
ement was negligible, and this introduced occasional artifacts. they are thicker than the rest of the limb, and on others they
are thinner.
Results Although most of our tests were done with the biped skeleton,
evaluate Pinocchio with respect to the three criteria stated in we have also used other skeletons for other characters (Figure 10).
introduction: generality, quality, and performance. To ensure
bjective evaluation, we use inputs that were not used during 5.2 Quality
elopment. To this end, once the development was complete, we Figure 11 shows the results of manually posing a human scan us-
ed Pinocchio on 16 biped Cosmic Blobs models that we had not ing our attachment. Our video [Baran and PopoviĀ“ 2007b] demon-
c
iously tried. strates the quality of the animation produced by Pinocchio.
6
15. Lead: Jitendra Malik (UC Berkeley)
Structured object reports
Participants: Deva Ramanan (UC Irvine), Steve Seitz (U Washington
duction/goal: Human detection and pose estimation are tasks with many applicat
ng next-generation human-computer interfaces and activity understanding. Detection
āIf youāre not winning the game, change the rulesā
s a classiļ¬cation problem (does this window contain a person or not?), while pose es
en cast as a regression problem, where given an image or sequence of frames, one m
oint angles. This project will take a more general view and cast both tasks as one of āp
e a full syntactic parse will report the number of people present (if any), their body
16. Lead: J
Caveat: we need more pixels Rama
Participants: Deva
Multiresolution models for object d
Dennis Park Deva Ramanan Charless Fowlkes
Motivation & Goal S3. Now we re
Objects in images come with various resolutions. star model
Most recognition systems are scale-invariant, eliminate bl
i.e. ļ¬xed-size template
LR global tem
More pixels mean more information!
naturally ļ¬ts
We want to use the information when it is avail-
LR template
able.
HR templat
Test image trained by La
Goal : part locatio
1. We want to use more pixels.
2. We want to detect small instances as well.
3. In addition, we try to address the correlation be- Ī¦(x, s, z) =
tween resolution and the role of context.
Introduction/goal: Human scoring funct
We should focus on high-resolution data
Model
detect
cluding next-generation human-com=
(in contrast to most learning methods)
Building blocks
f (x, s)
HOG features [1]
SVM cast as a classiļ¬cation problem &(does
S4. ļ¬nal mod
The boundar
17. Caltech Pedestrian Benchmark
missed
10
d detections detections
Multiresolution model
, we show the result of our low-resolution rigid-template baseline.
Park et al. 2010
s to detect large instances. On the right, we show detections of
, part-based baseline, which fails to ļ¬nd small instances. On the
detections of our multiresolution model that is able to detect both
tances. The threshold of each model is set todecrease same rate of
Multiresolution representations yield the error by 2X compared to previous work
18. How do we move beyond the plateau?
1. Develop more structured models with less invariant features
2. Score syntax as semantics
3. Generate ground-truth datasets of structured labels
25. How do we move beyond the plateau?
1. Develop more structured models with less invariant features
2. Score ānuisanceā variables as meaningful output
3. Generate ground-truth datasets of structured labels
26. Diagram for Eero
Machine Learning
Vision
Vision as applied machine learning
27. Diagram for Eero
Vision
Graphics Machine Learning
(shape & appearance)
Vision as structured pattern recognition