Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Towards Scaling Video
Understanding
Serena Yeung

State-of-the-art in video understanding

Classification
Abu-El-Haija et al. 2016
4,800 categories
15.2 Top5 error

Classification Detection
Idrees et al. 2017, Sigurdsson et al. 2016Abu-El-Haija et al. 2016
4,800 categories
15.2 Top5 error
Tens of categories
~10-20 mAP at 0.5 overlap

Abu-El-Haija et al. 2016
Captioning
4,800 categories
15.2 Top5 error
Yu et al. 2016
Just getting started:
Short clips, niche domains
Idrees et al. 2017, Sigurdsson et al. 2016
Tens of categories

Comparing video with image understanding

Classification
4,800 categories
15.2% Top5 error
Videos
Images 1,000 categories*
3.1% Top5 error
*Transfer learning
widespread
Krizhevsky 2012, Xie 2016

He 2017
4,800 categories
15.2% Top5 error
Tens of categories
Videos
3.1% Top5 error
*Transfer learning
widespread
Hundreds of categories*
~60 mAP at 0.5 overlap
Pixel-level segmentation
*Transfer learning
widespread

He 2017 Johnson 2016, Krause 2017
Classification Detection Captioning
4,800 categories
15.2% Top5 error
Short clips, niche
domains
Videos
3.1% Top5 error
*Transfer learning
widespread
*Transfer learning
widespread
Dense captioning
Coherent paragraphs
Tens of categories

He 2017 Johnson 2016, Krause 2017
Classification Detection Captioning
4,800 categories
15.2 Top5 error
Short clips, niche
domains
Videos
Beyond
3.1% Top5 error
*Transfer learning
widespread
*Transfer learning
widespread
Dense captioning
Coherent paragraphs
Significant work on
question-answering
—
Yang 2016
Tens of categories

The challenge of scale
Training labels Inference
Models
Video processing is
computationally expensive
Video annotation is
labor-intensive
Temporal dimension adds
complexity

Video processing is
Video annotation is
labor-intensive
Models
complexity
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end
Learning of Action Detection from Frame
Glimpses in Videos. CVPR 2016.

Input Output
t = 0 t = T
Running
Task: Temporal action detection
Talking
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Efficient video processing

t = 0 t = T
Output
Frame model
Input: a frame
Our model for efficient action detection

t = 0 t = T
[ ]
Output
Output:
Detection instance [start, end]
Next frame to glimpse
Frame model
Input: a frame

t = 0 t = T
Recurrent neural network
(time information)
[ ] Output:
Output
Convolutional neural network
(frame information)

t = 0 t = T
Output
(time information)
[ ] [ ] Output:
Output
(frame information)

t = 0 t = T
Output
(time information)
[ ] [ ] Output:
Output
(frame information)
Output
[ ]

t = 0 t = T
Output
(time information)
[ ] [ ] Output:
Output
(frame information)
Output
[ ]
…

t = 0 t = T
Output
(time information)
Output
(frame information)
Output
[ ]
…
Optional output:
Output:

• Train differentiable outputs (detection output class and bounds) using
standard backpropagation
• Train non-differentiable outputs (where to look next, when to emit a
prediction) using reinforcement learning (REINFORCE algorithm)
• Achieves detection performance on par with dense sliding window-
based approaches, while observing only 2% of frames

Learned policy in action

Video processing is
Video annotation is
labor-intensive
Models
complexity
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.

Dense action labeling
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.

MultiTHUMOS
• Extends the THUMOS’14 action detection dataset with dense, multilevel,
frame-level action annotations for 30 hours across 400 videos
THUMOS MultiTHUMOS
Annotations 6,365 38,690
Classes 20 65
Density (labels / frame) 0.3 1.5
Classes per video 1.1 10.5
Max actions per frame 2 9
Max actions per video 3 25
Videos. IJCV 2017.

Modeling dense, multilabel actions
• Need to reason about multiple potential actions simultaneously
• High degree of temporal dependency
• In standard recurrent models for action recognition, all state is in hidden layer
representation
• At each time step, makes prediction of current frame based on the current
frame and previous hidden representation
Videos. IJCV 2017.

MultiLSTM
• Extension of LSTM that expands the temporal receptive field of input
and output connections
• Key idea: providing the model with more freedom in both reading
input and writing output reduces the burden placed on the hidden
layer representation
Videos. IJCV 2017.

MultiLSTM
Input video frames
Frame class predictions
t
Standard LSTM
……
Donahue 2014
Videos. IJCV 2017.

MultiLSTM
t
……
Videos. IJCV 2017.
Standard LSTM: Single input, single output
Input video frames

MultiLSTM
t
……
t
MultiLSTM: Multiple inputs, multiple outputs
……
Standard LSTM: Single input, single output
Input video frames Input video frames
Videos. IJCV 2017.

MultiLSTM
Multiple Inputs (soft attention)
Multiple Outputs (weighted average)
Multilabel Loss
Videos. IJCV 2017.

MultiLSTM
Videos. IJCV 2017.

Retrieving sequential actions
Videos. IJCV 2017.

Retrieving co-occurring actions
Videos. IJCV 2017.

Video processing is
Video annotation is
labor-intensive
Models
complexity
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-
Fei. Learning to learn from noisy web videos.
CVPR 2017.

Labeling videos is expensive
• Takes significantly longer to
label a video than an image
• If spatial or temporal bounds
desired, even worse
• How can we practically learn
about new concepts in video?

Web queries are a source of noisy video labels

Can we effectively learn from the noisy web queries?
• Our approach: learn how to selective positive training examples
from noisy queries in order to train classifiers for new classes
• Use a reinforcement learning-based formulation to learn a data
labeling policy that achieves strong performance on a small,
manually-labeled dataset of classes
• Then use this policy to automatically label noisy web data for new
classes
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Balancing diversity vs. semantic drift
• Want diverse training examples to improve classifier
• But too much diversity can also lead to semantic drift
• Our approach: balance diversity and drift by training labeling
policies using an annotated reward set which the policy must
successfully classify

Overview of approach
Boomerang …
Boomerang
on a beach
Boomerang
music video
Classifier
Candidate web queries
(YouTube autocomplete)
Agent
Label new positives
+
Boomerang
on a beach
Current
positive set
Update classifier
Update
state

Boomerang …
Boomerang
on a beach
Boomerang
music video
Classifier
Agent
Label new positives
+
Boomerang
on a beach
Current
positive set
Update classifier
Fixed
negative set
Update
state

Boomerang …
Boomerang
on a beach
Boomerang
music video
Classifier
Agent
Label new positives
+
Boomerang
on a beach
Current
positive set
Update classifier
Update
state
Fixed
negative set
Training reward
Eval on
reward set

Sports1M
Greedy classifier Ours

Novel classes
Greedy classifier Ours

Video processing is
Video annotation is
labor-intensive
Models
complexity
CVPR 2017.
Learning to learn

Video processing is
Video annotation is
labor-intensive
Models
complexity
CVPR 2017.
Learning to learn
Unsupervised learning

Towards Knowledge
Video processing is
Video annotation is
labor-intensive
Models
complexity
Videos
Knowledge of the
dynamic visual world

Collaborators
Olga Russakovsky Mykhaylo Andriluka Ning Jin Vignesh Ramanathan Liyue Shen
Greg Mori Fei-Fei Li

Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Similar to Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Serena Yeung, PHD, Stanford, at MLconf Seattle 2017