Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fcv learn ramanan

  • Login to see the comments

  • Be the first to like this

Fcv learn ramanan

  1. 1. Learning structured representations Deva Ramanan UC Irvine
  2. 2. fw (x) =Traini w· • Visual representations • Training data consists of images with labeled N • Need to learn the model structure, filters and d • positives negatives Learned model Training fw (x) = w · Φ(x) • Training data consists of images with labeled bounding boxes Training • Need to learn the model structure, filters and deformation costs TrainingGeometric models positive negative Statistical classifiers (1970s-1990s) (1990s-present)weights weights Large-scale trainingHand-coded models Appearance-based representations
  3. 3. Learned modelLearned visual fw (x) = w · Φ(x) representations Training • Training data consists of images with labeled bounding boxes • Need Wherethe invariance built in? deformation costs to learn is model structure, filters and Representation (linear classifier, ...) Training FeaturesViolaJones Dalal Triggs positive nega weights wei
  4. 4. Learned visual representations Where is invariance built in? 4 4 4 4 4 4 4 4 Representation 4 4 (latent-variable classifier) Features (a) (b) (c) (a) (a) (b) (a) (a) (b) (c) (b) (b) (c) (c) (c) (a) (b) (c) Felzenszwalb et al 09 (a) (a) (b) (a) (b) (c) (b) (c) (c)on model. The model is defined by a coarse root filter (a), several (a) (b) (c)ections obtained withby single by a coarse root filter (a), The model is defined by a coarse (b) filter (a), severalon model. The defined isa defined component person model.severalon model. The model is defined byroot filter root several several The model is a coarse (a),on model. The model is defined by a coarse root filter (a), several model a coarse filter (a), (a) root (c)btained with each with relative tocomponent personfilters specifydefined is defined byroot filter root several severale locationobtained with a single component person model. The model is defined by a coarse root filter (a), several tections obtained partcomponent the root model. The modelThe model by a coarse a coarse (a), filter (a), of a single person (c). The is tections obtained with relative to(c). The filtersThe filters specify model is defined by a coarse root filter (a), several location of each part a a root component person model. The a single model (c). specify tections part relative andthe spatialthe root for the location of each part relative to the root (c). The filters specify single model. eof each filters (b) to relative to the root (c). The filters specifyution part of each part relative to the root (c). The filters specify e location isualization of each(b) positive spatial model for the location of relative to relative to(c). The filtersThe filters specify e location and a spatial model for thedifferent orientations. The (b) show part ution part filters the andaa single model for theof each part each part relative to the defined The a coarse root filte filters obtained with a weights at location person model. The model is root (c). by ions part ofshow (b) positivespatial component location of each part relative to theatroot (c). The filters specifyvisualizationfilters the and a at weights atorientations. location of each part ution part filters (b) positivespatial model for the The and different different orientations. Thevisualization show the positive weights at different orientations. Thehistogram show the gradients features. Their visualization The oriented the root the different specify ution the positive weights weights at different orientations. show the positive weights root (c). orientations. The n show filters specify ingorientedof oriented gradients features. Their visualization show the positive different different orientations. model. T the center of a part at different1. histogram of oriented gradients of Fig.features. Their visualization show the positive weightscomponent person The Detections the root. obtained with a single at different orientations. Thevisualization gradients features. Their visualization show the positive weights at weights atorientations. The locations relative to the root. histogrampart of a part at different locations the root.enter the acenterat different locations relative Their visualization show the of eachweightsrelative toorientations. (c). The ficing the center of (b) anddifferent locations relative tothe root. n part center of a part at different “cost” to relative to the location positive part at different the root The of ofcing the filters a part at a the locations placing histogram of models reflects spatial model for the centercingthe spatialoriented gradients features. of relative to the root. of a part at different locations relative to the root.
  5. 5. person bottle Where does learning fit in?Training Alg Groundimages output truth Matching 17 alg cat person bottle Tune parameters ( , ) till desired output on training set ‘Graduate Student Descent’ might take a while (phrase from Marshall Tappen) cat
  6. 6. 5 years of PASCAL people detection Matching results 50 37.5 average 25precision 12.5 0 05 06 07 08 09 10 (after non-maximum suppression) 20 20 20 20 20 20 ~1 second to search all scales 1% to 47% in 5 years How do we move beyond the plateau?
  7. 7. How do we move beyond the plateau?1. Develop more structured models with less invariant features
  8. 8. Invariance vs Search Projective Invariants View-Based Mixtures
  9. 9. person person person person bottle person bottle person person person bottle person bottle bottleInvariance vs Parametric Search person person person person bottle person bottle bottle Part-Based Models cat cat cat cat 4 cat 4 4 4 4 cat cat cat cat cat cat c cat cat (a) (b) (c) (a) (a) (b) (a) (b) (c) (b) (c) (c) (a) (b) (c)
  10. 10. Learned visual representations Where is invariance built in? Representation (latent-variable classifier) FeaturesYi & Ramanan 11 Buffy performance: 88% vs 73%
  11. 11. Qualitative Results
  12. 12. How do we move beyond the plateau?1. Develop more structured models with less invariant features2. Score syntax as semantics
  13. 13. The forgotten challenge....!"#$%&# ()*+"&,)-#.*/)&,*$#012*-"&"3&)4#*&4501"-*)1*)&,"4*-5&5 678)4-*+"&,)-*-)"#*1)&*5&&"+9&*&)*-"&"3&*8""& Head Hand ;))& :"5- :51- Foot <=>?=@A:$+51@5B)$& CDED FEF GEH 6I;6!JAK<J LHEC GMED MEM
  14. 14. ure 8: Top: heat equilibrium for two bones. Bottom: the resultotating the right bone with the heat-based attachment Structured classifiers Figure 10: A centaur pirate with a centaur skeleton embedded looks at a cat with a quadruped skeleton embedded the character volume as an insulated heat-conducting body ande the temperature of bone i to be 1 while keeping the tempera- of all of the other bones at 0. Then we can take the equilibriumperature at each vertex on the surface as the weight of bone i at vertex. Figure 8 illustrates this in two dimensions. olving for heat equilibrium over a volume would require tes-ating the volume and would be slow. Therefore, for simplic-Pinocchio solves for equilibrium over the surface only, but at e vertices, it adds the heat transferred from the nearest bone. i equilibrium over the surface for bone i is given by ∂w = ∂t i + H(pi − wi ) = 0, which can be written as −∆wi + Hwi = Hpi , (1) re ∆ is the discrete surface Laplacian, calculated with the ngent formula [Meyer et al. 2003], pi is a vector with pi = 1 j e nearest bone to vertex j is i and pi = 0 otherwise, and H is shape Figure 11: The human scan on the left is rigged by Pinocchio and is posed on the right by changing joint angles in the embedded skele- ton. The well-known deficiencies of LBS can be seen in the right Estimated shape jdiagonal matrix with Hjj being the heat contribution weight of knee and hip areas.nearest bone to vertex j. Because ∆ has units of length−2 , so t H. Letting d(j) be the distance from vertex j to the neareste, Pinocchio uses Hjj = c/d(j)2 if the shortest line segment 5.1 Generalitym the vertex to the bone is contained in the character volume Figure 9 shows our 16 test characters and the skeletons Pinocchio Hjj = 0 if it is not. It uses the precomputed distance field to embedded. The skeleton was correctly embedded into 13 of these classifier rmine whether a line segment is entirely contained in the char- models (81% success). For Models 7, 10 and 13, a hint for a single r volume. For c ≈ 0.22, this method gives weights with similar joint was sufficient to produce a good embedding.sitions to those computed by finding the equilibrium over the These tests demonstrate the range of proportions that our method me. Pinocchio uses c = 1 (corresponding to anisotropic heat can tolerate: we have a well-proportioned human (Models 1–4, 8),usion) because the results look more natural. When k bones are large arms and tiny legs (6; in 10, this causes problems), and large distant from vertex j, heat contributions from all of them are legs and small arms (15; in 13, the small arms cause problems). Ford: pj is 1/k for all of them, and Hjj = kc/d(j)2 . other characters we tested, skeletons were almost always correctly quation (1) is a sparse linear system, and the left hand side embedded into well-proportioned characters whose pose matched Estimatedrix −∆ + H does not depend on i, the bone we are interested the given skeleton. Pinocchio was even able to transfer a bipedThus we can factor the system once and back-substitute to find walk onto a human hand, a cat on its hind legs, and a donut.weights for each bone. Botsch et al. [2005] show how to use The most common issues we ran into on other characters were: arse Cholesky solver to compute the factorization for this kind ystem. Pinocchio uses the TAUCS [Toledo 2003] library for computation. Note also that the weights wi sum to 1 for each reflectance • The thinnest limb into which we may hope to embed a bone has a radius of 2τ . Characters with extremely thin limbs often reflectance fail because the the graph we extract is disconnected. Reduc-ex: if we sum (1) over i, we get (−∆ + H) i wi = H · 1, P ing τ , however, hurts performance.ch yields i wi = 1. P is possible to speed up this method slightly by finding vertices • Degree 2 joints such as knees and elbows are often positioned are unambiguously attached to a single bone and forcing their incorrectly within a limb. We do not know of a reliable wayght to 1. An earlier variant of our algorithm did this, but the im- to identify the right locations for them: on some characters ement was negligible, and this introduced occasional artifacts. they are thicker than the rest of the limb, and on others they are thinner. Results Although most of our tests were done with the biped skeleton, evaluate Pinocchio with respect to the three criteria stated in we have also used other skeletons for other characters (Figure 10).introduction: generality, quality, and performance. To ensure bjective evaluation, we use inputs that were not used during 5.2 Qualityelopment. To this end, once the development was complete, we Figure 11 shows the results of manually posing a human scan us-ed Pinocchio on 16 biped Cosmic Blobs models that we had not ing our attachment. Our video [Baran and Popovi´ 2007b] demon- c iously tried. strates the quality of the animation produced by Pinocchio. 6
  15. 15. Lead: Jitendra Malik (UC Berkeley) Structured object reports Participants: Deva Ramanan (UC Irvine), Steve Seitz (U Washingtonduction/goal: Human detection and pose estimation are tasks with many applicatng next-generation human-computer interfaces and activity understanding. Detection “If you’re not winning the game, change the rules” s a classification problem (does this window contain a person or not?), while pose esen cast as a regression problem, where given an image or sequence of frames, one moint angles. This project will take a more general view and cast both tasks as one of “pe a full syntactic parse will report the number of people present (if any), their body
  16. 16. Lead: JCaveat: we need more pixels Rama Participants: Deva Multiresolution models for object d Dennis Park Deva Ramanan Charless Fowlkes Motivation & Goal S3. Now we re Objects in images come with various resolutions. star model Most recognition systems are scale-invariant, eliminate bl i.e. fixed-size template LR global tem More pixels mean more information! naturally fits We want to use the information when it is avail- LR template able. HR templat Test image trained by La Goal : part locatio 1. We want to use more pixels. 2. We want to detect small instances as well. 3. In addition, we try to address the correlation be- Φ(x, s, z) = tween resolution and the role of context. Introduction/goal: Human scoring funct We should focus on high-resolution data Model detect cluding next-generation human-com= (in contrast to most learning methods) Building blocks f (x, s) HOG features [1] SVM cast as a classification problem &(does S4. final mod The boundar
  17. 17. Caltech Pedestrian Benchmark missed 10d detections detections Multiresolution model , we show the result of our low-resolution rigid-template baseline. Park et al. 2010s to detect large instances. On the right, we show detections of, part-based baseline, which fails to find small instances. On thedetections of our multiresolution model that is able to detect bothtances. The threshold of each model is set todecrease same rate of Multiresolution representations yield the error by 2X compared to previous work
  18. 18. How do we move beyond the plateau?1. Develop more structured models with less invariant features2. Score syntax as semantics3. Generate ground-truth datasets of structured labels
  19. 19. Case study: small or big partsSkeleton Parts/Poselets Mini-parts
  20. 20. What are good representations? Exemplars Parts Attributes Visual Phrases Grammars ?
  21. 21. Even worse: what are the parts (if any)? Is there any structure to label here?
  22. 22. Sharing surfaces?
  23. 23. Selective parameter sharing v v vExemplars => Parts => Attributes => Grammars Multi-task training of instance-specific classifiers
  24. 24. Human-in-the-loop structure learning
  25. 25. How do we move beyond the plateau?1. Develop more structured models with less invariant features2. Score “nuisance” variables as meaningful output3. Generate ground-truth datasets of structured labels
  26. 26. Diagram for Eero Machine Learning VisionVision as applied machine learning
  27. 27. Diagram for Eero Vision Graphics Machine Learning(shape & appearance) Vision as structured pattern recognition

×