Visual7W Grounded Question Answering in Images

•

3 likes•889 views

Slides by Issey Masuda about the paper: Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei. "Visual7W: Grounded Question Answering in Images." CVPR 2016. We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local regions in the images. We establish a semantic link between textual descriptions and image regions by object-level grounding. It enables a new type of QA with visual answers, in addition to textual answers used in previous work. We study the visual QA tasks in a grounded setting with a large collection of 7W multiple-choice QA pairs. Furthermore, we evaluate human performance and several baseline models on the QA tasks. Finally, we propose a novel LSTM model with spatial attention to tackle the 7W QA tasks.

Technology

Visual7W
Grounded Question Answering in Images
Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei
Slides by Issey Masuda Mora
Computer Vision Reading Group (09/05/2016)
[arXiv] [web] [GitHub]

Visual Question Answering
Goal: predict the answer of a given question related to an image

Motivation
New Turing test? How to evaluate AI’s image
understanding?

The 7W
WHAT
WHERE
WHEN
WHO
WHY
HOW
WHICH
Questions: multi-choice
4 candidates, only one correct

Grounding: image-text correspondences
Exploit the relation between image regions and nouns in the questions

The new answer is...
Question-Answer types:
● Telling questions: the answer is
text
● Pointing questions: a new type of
QA that they introduce where the
answers are image regions

Common approach
Who is under the umbrella?
Extract visual
features
Embedding
Merge Predict answer Two women

Visual7W Dataset
Characteristics:
● 47.300 images from COCO dataset
● 327.939 QA pairs
● 561.459 object bounding boxes spread across 36.579 categories

Creating the Dataset
Procedure:
● Write QA pairs
● 3 AMT workers evaluate as good or bad each pair
● Only the ones with at least 2 good evaluations are considered
● Write the 3 wrong answers (having the right one)
● Extract object names and draw bounding boxes for each one

Attention-based model
Pointing questions model

Experiments
Different experiments have been conducted depending on the information given to
the subject:
● Only the question
● Question + Image
Subjects/models:
● Human
● Logistic regression
● LSTM
● LSTM + attention model

Conclusions
● Visual QA model has been presented
● Attention model to focus on local regions of the image
● Dataset created with goundings

What's hot

[PR12] understanding deep learning requires rethinking generalizationJaeJun Yoo

Layer-wise CNN Surgery for Visual Sentiment PredictionUniversitat Politècnica de Catalunya

Master Defense Slides (translated)Francis Piéraut

Diving deep into sentiment: Understanding fine-tuned CNNs for visual sentimen...Universitat Politècnica de Catalunya

Recomendation LetterVictor Morales

Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya

A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)Thomas da Silva Paula

IRJET-Image Question Answering: A ReviewIRJET Journal

Iterative Multi-document Neural Attention for Multiple Answer PredictionClaudio Greco

Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya

Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...Universitat Politècnica de Catalunya

Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...Universitat Politècnica de Catalunya

Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya

Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya

Predictive uncertainty of deep models and its applicationsNAVER Engineering

Geogebra introChristian Bokhove

Presentation Marie PostmaAlexandra Sierra

Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Ian Morgan

Inferring and executing programs for visual reasoning (UPC Reading Group)Universitat Politècnica de Catalunya

What's hot (20)

[PR12] understanding deep learning requires rethinking generalization

Layer-wise CNN Surgery for Visual Sentiment Prediction

Master Defense Slides (translated)

Diving deep into sentiment: Understanding fine-tuned CNNs for visual sentimen...

Recomendation Letter

Unsupervised Learning (DLAI D9L1 2017 UPC Deep Learning for Artificial Intell...

A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)

IRJET-Image Question Answering: A Review

Iterative Multi-document Neural Attention for Multiple Answer Prediction

Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona

Image Classification on ImageNet (D1L3 Insight@DCU Machine Learning Workshop ...

Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...

Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)

Welcome (D1L1 2017 UPC Deep Learning for Computer Vision)

Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)

Predictive uncertainty of deep models and its applications

Geogebra intro

Presentation Marie Postma

Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...

Inferring and executing programs for visual reasoning (UPC Reading Group)

Viewers also liked

Ask the Expert: All your Environment and Safety Questions AnsweredKPADealerWebinars

Internet Safety Gwenguest3d3a205b

Road Safety PostersRoad Safety

17 most important questions to ask about health and safety in the workplaceworkplacehealthandsafety

Road signsminikui81

Road Safety Presentationtohjingfenyv

Squeezing Deep Learning Into Mobile PhonesAnirudh Koul

Viewers also liked (7)

Ask the Expert: All your Environment and Safety Questions Answered

Internet Safety Gwen

Road Safety Posters

17 most important questions to ask about health and safety in the workplace

Road signs

Road Safety Presentation

Squeezing Deep Learning Into Mobile Phones

Recently uploaded

Training state-of-the-art general text embeddingZilliz

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

CloudStudio User manual (basic edition):comworks

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Recently uploaded (20)

Training state-of-the-art general text embedding

Artificial intelligence in cctv survelliance.pptx

Connect Wave/ connectwave Pitch Deck Presentation

Vector Databases 101 - An introduction to the world of Vector Databases

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Human Factors of XR: Using Human Factors to Design XR Systems

DevoxxFR 2024 Reproducible Builds with Apache Maven

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

CloudStudio User manual (basic edition):

My Hashitalk Indonesia April 2024 Presentation

SIP trunking in Janus @ Kamailio World 2024

Powerpoint exploring the locations used in television show Time Clash

Are Multi-Cloud and Serverless Good or Bad?

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

SAP Build Work Zone - Overview L2-L3.pptx

DevEX - reference for building teams, processes, and platforms

Scanning the Internet for External Cloud Exposures via SSL Certs

Visual7W Grounded Question Answering in Images

1. Visual7W Grounded Question Answering in Images Yuke Zhu, Oliver Groth, Michael Bernstein, Li Fei-Fei Slides by Issey Masuda Mora Computer Vision Reading Group (09/05/2016) [arXiv] [web] [GitHub]

2. Context

3. Visual Question Answering Goal: predict the answer of a given question related to an image

4. Motivation New Turing test? How to evaluate AI’s image understanding?

5. Visual7W

6. The 7W WHAT WHERE WHEN WHO WHY HOW WHICH Questions: multi-choice 4 candidates, only one correct

7. Grounding: image-text correspondences Exploit the relation between image regions and nouns in the questions

8. The new answer is... Question-Answer types: ● Telling questions: the answer is text ● Pointing questions: a new type of QA that they introduce where the answers are image regions

9. Related work

10. Common approach Who is under the umbrella? Extract visual features Embedding Merge Predict answer Two women

11. The Dataset

12. Visual7W Dataset Characteristics: ● 47.300 images from COCO dataset ● 327.939 QA pairs ● 561.459 object bounding boxes spread across 36.579 categories

13. Creating the Dataset Procedure: ● Write QA pairs ● 3 AMT workers evaluate as good or bad each pair ● Only the ones with at least 2 good evaluations are considered ● Write the 3 wrong answers (having the right one) ● Extract object names and draw bounding boxes for each one

14. The Model

15. Attention-based model Pointing questions model

16. Experiments & Results

17. Experiments Different experiments have been conducted depending on the information given to the subject: ● Only the question ● Question + Image Subjects/models: ● Human ● Logistic regression ● LSTM ● LSTM + attention model

18. Results

19. Conclusions

20. Conclusions ● Visual QA model has been presented ● Attention model to focus on local regions of the image ● Dataset created with goundings

Visual7W Grounded Question Answering in Images

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Visual7W Grounded Question Answering in Images

Similar to Visual7W Grounded Question Answering in Images (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Visual7W Grounded Question Answering in Images