Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Visual7W Grounded Question Answering in Images

Slides by Issey Masuda about the paper:

Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei. "Visual7W: Grounded Question Answering in Images." CVPR 2016.

We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local regions in the images. We establish a semantic link between textual descriptions and image regions by object-level grounding. It enables a new type of QA with visual answers, in addition to textual answers used in previous work. We study the visual QA tasks in a grounded setting with a large collection of 7W multiple-choice QA pairs. Furthermore, we evaluate human performance and several baseline models on the QA tasks. Finally, we propose a novel LSTM model with spatial attention to tackle the 7W QA tasks.

  • Login to see the comments

Visual7W Grounded Question Answering in Images

  1. 1. Visual7W Grounded Question Answering in Images Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei Slides by Issey Masuda Mora Computer Vision Reading Group (09/05/2016) [arXiv] [web] [GitHub]
  2. 2. Context
  3. 3. Visual Question Answering Goal: predict the answer of a given question related to an image
  4. 4. Motivation New Turing test? How to evaluate AI’s image understanding?
  5. 5. Visual7W
  6. 6. The 7W WHAT WHERE WHEN WHO WHY HOW WHICH Questions: multi-choice 4 candidates, only one correct
  7. 7. Grounding: image-text correspondences Exploit the relation between image regions and nouns in the questions
  8. 8. The new answer is... Question-Answer types: ● Telling questions: the answer is text ● Pointing questions: a new type of QA that they introduce where the answers are image regions
  9. 9. Related work
  10. 10. Common approach Who is under the umbrella? Extract visual features Embedding Merge Predict answer Two women
  11. 11. The Dataset
  12. 12. Visual7W Dataset Characteristics: ● 47.300 images from COCO dataset ● 327.939 QA pairs ● 561.459 object bounding boxes spread across 36.579 categories
  13. 13. Creating the Dataset Procedure: ● Write QA pairs ● 3 AMT workers evaluate as good or bad each pair ● Only the ones with at least 2 good evaluations are considered ● Write the 3 wrong answers (having the right one) ● Extract object names and draw bounding boxes for each one
  14. 14. The Model
  15. 15. Attention-based model Pointing questions model
  16. 16. Experiments & Results
  17. 17. Experiments Different experiments have been conducted depending on the information given to the subject: ● Only the question ● Question + Image Subjects/models: ● Human ● Logistic regression ● LSTM ● LSTM + attention model
  18. 18. Results
  19. 19. Conclusions
  20. 20. Conclusions ● Visual QA model has been presented ● Attention model to focus on local regions of the image ● Dataset created with goundings