Multimodal Sequential Learning for Video QA

•

1 like•565 views

발표자: 김은솔 (서울대 박사과정) 발표일: 2017.6. 2010년 9월부터 서울대 컴퓨터공학부 석박사 통합과정에 재학 중이며, 2014년 6월 젊은 여성과학자로 선정되었다. 개요: 본 발표에서는 사람과 기계가 컨텐츠를 같이 시청하고 컨텐츠의 내용에 대해 자연 언어로 묻고 답할 수 있는 기계 학습 엔진을 소개한다. Hierarchical multimodal recurrent neural network 기술을 기반으로 컨텐츠에 포함된 이미지, 자막(텍스트), 소리 정보를 sequential하게 결합하여 multimodal episodic memory를 구축하고, 주어진 질문에 필요한 memory를 선택하여 답을 추출할 수 있는 방법을 소개한다. 또한 recurrent neural network으로 multimodal memory를 구축할 때에 long-term sequence를 효율적으로 학습하기 위한 방법으로, reinforcement learning 아이디어를 결합한 방법을 소개한다.

Technology

Multimodal Sequential Learning
Eun-Sol Kim
Department of Computer Science and Engineering
Seoul National University
Seoul 08826, Korea

Contents
 Automatic Schema Construction
 DeepSchema: Automatic Schema Acquisition
from Wearable Sensor Data in Restaurant Situations
 Neurosymbolic Knowledge Graphs Learned from
Multimodal Sequential Data
 Video Question and Answering
 Multimodal Memory Network
 A reinforcement approach to multimodal sequential
learning

Motivation
 To describe human knowledge in formal languages
 Sheds new light on SCRIPTs
 Schank et al., 1997
 Conceptual dependency theory

Motivation
 To describe human knowledge in formal languages
 Sheds new light on SCRIPTs (Schank et al., 1997)
 Conceptual dependency theory
 (+) Abstracted knowledge, Generalization
 (-) should be designed in advance, not flexible, hard to apply
to new types of knowledge
 To extract abstracted representation from low-level
sensory data
 Deep neural networks
 (+) can be applied to low-level dataset, hierarchical structures
 (-) hard to interpret the results, not formal languages

Hierarchical Event Network
 A machine learning method which automatically
constructs the hierarchical schema for restaurant
situations from low-level sensory data
 Multimodal deep neural network architecture
 A three-layer hierarchy
 Inputs: low-level sensory data streams from wearable
devices
 Action primitives, Events and Probabilistic scripts

Data Acquisition
 Real-life dataset: DineAid
 Restaurant situation
 7 days dataset
 About 4000 seconds in
each
 11 Annotated with situation
 Greeting, Having a seat,
Selecting menu, Ordering
menu, etc.
 Multiple wearable devices
 Glass-type
 Video data
 Audio data
 Watch-type
 Accelerometer
 EDA
 BVP

Experimental Results (1/2)
- Event Prediction
 Classify the corresponding event using the event schema
 Learning the hierarchical event network with separated training data
 the corresponding events of the test data are predicted

Experimental Results (2/2)
- Probabilistic SCRIPT
 Classify the corresponding event using the event schema
 Learning the hierarchical event network with separated training data
 the corresponding events of the test data are predicted

 핑크퐁 애니메이션 75개에 대한 질의 응답 데이터 수집
 하나의 애니메이션에 대하여 400개의 질의 응답 데이터 수집
 Amazon Mechanical Turk 이용
 애니메이션의 자막, 이미지, 소리, 질의 응답 정보를 기계 학습
알고리즘으로 학습
 애니메이션에 대한 사용자의 질문에 응답할 수 있는 기술
Contents 기반 질의 응답 기술
19

Framework
20
Server
데이터 수집, 전처리, 질의응답 데이터 수집
기계학습
알고리즘을
이용한
학습
inter
face
Android
Web
HTTP
Bluetooth

Multimodal Sequential Learning with RL
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
1 ℎ 𝑡
1
ℎ 𝑠
1
𝑊𝑐 𝑡𝑠
ℎ1
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
2 ℎ 𝑡
2
ℎ 𝑠
2
ℎ2
RNN Weight 𝑊𝐺𝑅𝑈
Combining
Policy
𝜋
Combining Weight 𝑊𝑐
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
3 ℎ 𝑡
3
ℎ 𝑠
3
ℎ3
𝑊𝑐 𝑡
Image Text Sound
𝑎1
𝑎2
𝑎4
𝑎3
𝑎5
𝑎6
𝑎7
𝑎8
ℎ𝑖
𝑇 ℎ 𝑡
𝑇
ℎ 𝑠
𝑇
ℎ 𝑇
𝑊𝑐 𝑖𝑡𝑠
𝐸 𝐺𝑅𝑈 𝑅

Multimodal Sequential Learning with RL
* An error function for GRU part is trivial

What's hot

Introduction to Deep learningMassimiliano Ruocco

Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakPyData

Deep learning - A Visual IntroductionLukas Masuch

モデルアーキテクチャ観点からの高速化2019Yusuke Uchida

Deep learning - Conceptual understanding and applicationsBuhwan Jeong

Introduction to Deep LearningOleg Mygryn

Deep Learning - A Literature surveyAkshay Hegde

Learning where to look: focus and attention in deep visionUniversitat Politècnica de Catalunya

161209 Unsupervised Learning of Video Representations using LSTMsJunho Cho

NeuralProcessingofGeneralPurposeApproximateProgramsMohid Nabil

Mastering Computer Vision Problems with State-of-the-art Deep LearningMiguel González-Fierro

Deep Learning with Microsoft R OpenPoo Kuan Hoong

Introduction to Deep LearningMustafa Aldemir

False colouringGauravBiswas9

Introduction to deep learning in python and MatlabImry Kissos

Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중datasciencekorea

Deep learningRatnakar Pandey

Pipeline anomaly detectionGauravBiswas9

Transfer Learning and Fine-tuning Deep Neural NetworksPyData

Transformer 動向調査 in 画像認識Kazuki Maeno

What's hot (20)

Introduction to Deep learning

Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak

Deep learning - A Visual Introduction

モデルアーキテクチャ観点からの高速化2019

Deep learning - Conceptual understanding and applications

Introduction to Deep Learning

Deep Learning - A Literature survey

Learning where to look: focus and attention in deep vision

161209 Unsupervised Learning of Video Representations using LSTMs

NeuralProcessingofGeneralPurposeApproximatePrograms

Mastering Computer Vision Problems with State-of-the-art Deep Learning

Deep Learning with Microsoft R Open

Introduction to Deep Learning

False colouring

Introduction to deep learning in python and Matlab

Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중

Deep learning

Pipeline anomaly detection

Transfer Learning and Fine-tuning Deep Neural Networks

Transformer 動向調査 in 画像認識

Viewers also liked

알파고 해부하기 1부Donghun Lee

Introduction of Deep Reinforcement LearningNAVER Engineering

바둑인을 위한 알파고Donghun Lee

Video Object Segmentation in VideosNAVER Engineering

조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단NAVER Engineering

알파고 풀어보기 / Alpha Technical Review상은 박

Online video object segmentation via convolutional trident networkNAVER Engineering

Deep Learning, Where Are You Going?NAVER Engineering

Step-by-step approach to question answeringNAVER Engineering

딥러닝을 활용한 비디오 스토리 질의응답: 뽀로로QA와 심층 임베딩 메모리망NAVER Engineering

Finding connections among images using CycleGANNAVER Engineering

RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee

1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering

[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기이 의령

알파고 (바둑 인공지능)의 작동 원리Shane (Seungwhan) Moon

딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016Taehoon Kim

알아두면 쓸데있는 신기한 강화학습 NAVER 2017Taehoon Kim

Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Jeongkyu Shin

what is_tabs_shareNAVER D2

[143]알파글래스의 개발과정으로 알아보는 ar 스마트글래스 광학 시스템 NAVER D2

Viewers also liked (20)

알파고 해부하기 1부

Introduction of Deep Reinforcement Learning

바둑인을 위한 알파고

Video Object Segmentation in Videos

조음 Goodness-Of-Pronunciation 자질을 이용한 영어 학습자의 조음 오류 진단

알파고 풀어보기 / Alpha Technical Review

Online video object segmentation via convolutional trident network

Deep Learning, Where Are You Going?

Step-by-step approach to question answering

딥러닝을 활용한 비디오 스토리 질의응답: 뽀로로QA와 심층 임베딩 메모리망

Finding connections among images using CycleGAN

RLCode와 A3C 쉽고 깊게 이해하기

1시간만에 GAN(Generative Adversarial Network) 완전 정복하기

[2017 PYCON 튜토리얼]OpenAI Gym을 이용한 강화학습 에이전트 만들기

알파고 (바둑 인공지능)의 작동 원리

딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016

알아두면 쓸데있는 신기한 강화학습 NAVER 2017

Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...

what is_tabs_share

[143]알파글래스의 개발과정으로 알아보는 ar 스마트글래스 광학 시스템

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

The 7 Things I Know About Cyber Security After 25 Years | April 2024

🐬 The future of MySQL is Postgres 🐘

Scaling API-first – The story of a global engineering organization

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Slack Application Development 101 Slides

My Hashitalk Indonesia April 2024 Presentation

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

08448380779 Call Girls In Friends Colony Women Seeking Men

How to Troubleshoot Apps for the Modern Connected Worker

Presentation on how to chat with PDF using ChatGPT code interpreter

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

A Domino Admins Adventures (Engage 2024)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Multimodal Sequential Learning for Video QA

1. Multimodal Sequential Learning Eun-Sol Kim Department of Computer Science and Engineering Seoul National University Seoul 08826, Korea

2. Contents  Automatic Schema Construction  DeepSchema: Automatic Schema Acquisition from Wearable Sensor Data in Restaurant Situations  Neurosymbolic Knowledge Graphs Learned from Multimodal Sequential Data  Video Question and Answering  Multimodal Memory Network  A reinforcement approach to multimodal sequential learning

3. Automatic Schema Construction

4. Motivation  To describe human knowledge in formal languages  Sheds new light on SCRIPTs  Schank et al., 1997  Conceptual dependency theory

6. Motivation  To describe human knowledge in formal languages  Sheds new light on SCRIPTs (Schank et al., 1997)  Conceptual dependency theory  (+) Abstracted knowledge, Generalization  (-) should be designed in advance, not flexible, hard to apply to new types of knowledge  To extract abstracted representation from low-level sensory data  Deep neural networks  (+) can be applied to low-level dataset, hierarchical structures  (-) hard to interpret the results, not formal languages

7. Hierarchical Event Network  A machine learning method which automatically constructs the hierarchical schema for restaurant situations from low-level sensory data  Multimodal deep neural network architecture  A three-layer hierarchy  Inputs: low-level sensory data streams from wearable devices  Action primitives, Events and Probabilistic scripts

8. Hierarchical Event Network

9. Derivation - Learning

10. Restaurant Situations

11. Data Acquisition  Real-life dataset: DineAid  Restaurant situation  7 days dataset  About 4000 seconds in each  11 Annotated with situation  Greeting, Having a seat, Selecting menu, Ordering menu, etc.  Multiple wearable devices  Glass-type  Video data  Audio data  Watch-type  Accelerometer  EDA  BVP

12. Experimental Results (1/2) - Event Prediction  Classify the corresponding event using the event schema  Learning the hierarchical event network with separated training data  the corresponding events of the test data are predicted

13. Experimental Results (2/2) - Probabilistic SCRIPT  Classify the corresponding event using the event schema  Learning the hierarchical event network with separated training data  the corresponding events of the test data are predicted

14. WikiHow Dataset 14

15. System Architecture (1) 15

16. System Architecture (2) 16

17. Result 17

18. Video Question and Answering

19.  핑크퐁 애니메이션 75개에 대한 질의 응답 데이터 수집  하나의 애니메이션에 대하여 400개의 질의 응답 데이터 수집  Amazon Mechanical Turk 이용  애니메이션의 자막, 이미지, 소리, 질의 응답 정보를 기계 학습 알고리즘으로 학습  애니메이션에 대한 사용자의 질문에 응답할 수 있는 기술 Contents 기반 질의 응답 기술 19

20. Framework 20 Server 데이터 수집, 전처리, 질의응답 데이터 수집 기계학습 알고리즘을 이용한 학습 inter face Android Web HTTP Bluetooth

21. Multimodal Memory Network

22. Multimodal Sequential Learning with RL Image Text Sound 𝑎1 𝑎2 𝑎4 𝑎3 𝑎5 𝑎6 𝑎7 𝑎8 ℎ𝑖 1 ℎ 𝑡 1 ℎ 𝑠 1 𝑊𝑐 𝑡𝑠 ℎ1 Image Text Sound 𝑎1 𝑎2 𝑎4 𝑎3 𝑎5 𝑎6 𝑎7 𝑎8 ℎ𝑖 2 ℎ 𝑡 2 ℎ 𝑠 2 ℎ2 RNN Weight 𝑊𝐺𝑅𝑈 Combining Policy 𝜋 Combining Weight 𝑊𝑐 Image Text Sound 𝑎1 𝑎2 𝑎4 𝑎3 𝑎5 𝑎6 𝑎7 𝑎8 ℎ𝑖 3 ℎ 𝑡 3 ℎ 𝑠 3 ℎ3 𝑊𝑐 𝑡 Image Text Sound 𝑎1 𝑎2 𝑎4 𝑎3 𝑎5 𝑎6 𝑎7 𝑎8 ℎ𝑖 𝑇 ℎ 𝑡 𝑇 ℎ 𝑠 𝑇 ℎ 𝑇 𝑊𝑐 𝑖𝑡𝑠 𝐸 𝐺𝑅𝑈 𝑅

23. Multimodal Sequential Learning with RL * An error function for GRU part is trivial

24. Thank you!

Multimodal Sequential Learning for Video QA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Multimodal Sequential Learning for Video QA

Similar to Multimodal Sequential Learning for Video QA (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

Multimodal Sequential Learning for Video QA