Serena is a Ph.D. student in the Stanford Vision Lab, advised by Prof. Fei-Fei Li. Her research interests are in computer vision, machine learning, and deep learning. She is particularly interested in the areas of video understanding, human action recognition, and healthcare applications. She interned at Facebook AI Research in Summer 2016.
Before starting her Ph.D., she received a B.S. in Electrical Engineering in 2010, and an M.S. in Electrical Engineering in 2013, both from Stanford. She also worked as a software engineer at Rockmelt (acquired by Yahoo) from 2009-2011.
Abstract summary
Towards Scaling Video Understanding:
The quantity of video data is vast, yet our capabilities for visual recognition and understanding in videos lags significantly behind that for images. In this talk, I will first discuss some of the challenges of scale in labeling, modeling, and inference behind this gap. I will then present some of our recent work towards addressing these challenges, in particular using reinforcement learning-based formulations to tackle efficient inference in videos and learning classifiers from noisy web search results. Finally, I will conclude with discussion on future promising directions towards scaling video understanding.
5. State-of-the-art in video understanding
Classification
Abu-El-Haija et al. 2016
4,800 categories
15.2 Top5 error
6. State-of-the-art in video understanding
Classification Detection
Idrees et al. 2017, Sigurdsson et al. 2016Abu-El-Haija et al. 2016
4,800 categories
15.2 Top5 error
Tens of categories
~10-20 mAP at 0.5 overlap
7. State-of-the-art in video understanding
Classification Detection
Abu-El-Haija et al. 2016
Captioning
4,800 categories
15.2 Top5 error
Yu et al. 2016
Just getting started:
Short clips, niche domains
Idrees et al. 2017, Sigurdsson et al. 2016
Tens of categories
~10-20 mAP at 0.5 overlap
10. Comparing video with image understanding
He 2017
Classification Detection
4,800 categories
15.2% Top5 error
Tens of categories
~10-20 mAP at 0.5 overlap
Videos
Images 1,000 categories*
3.1% Top5 error
*Transfer learning
widespread
Hundreds of categories*
~60 mAP at 0.5 overlap
Pixel-level segmentation
*Transfer learning
widespread
Krizhevsky 2012, Xie 2016
11. Comparing video with image understanding
He 2017 Johnson 2016, Krause 2017
Classification Detection Captioning
4,800 categories
15.2% Top5 error
Just getting started:
Short clips, niche
domains
Videos
Images 1,000 categories*
3.1% Top5 error
*Transfer learning
widespread
Hundreds of categories*
~60 mAP at 0.5 overlap
Pixel-level segmentation
*Transfer learning
widespread
Dense captioning
Coherent paragraphs
Krizhevsky 2012, Xie 2016
Tens of categories
~10-20 mAP at 0.5 overlap
12. Comparing video with image understanding
He 2017 Johnson 2016, Krause 2017
Classification Detection Captioning
4,800 categories
15.2 Top5 error
Just getting started:
Short clips, niche
domains
Videos
Beyond
Images 1,000 categories*
3.1% Top5 error
*Transfer learning
widespread
Hundreds of categories*
~60 mAP at 0.5 overlap
Pixel-level segmentation
*Transfer learning
widespread
Dense captioning
Coherent paragraphs
Significant work on
question-answering
—
Krizhevsky 2012, Xie 2016
Yang 2016
Tens of categories
~10-20 mAP at 0.5 overlap
13. The challenge of scale
Training labels Inference
Models
Video processing is
computationally expensive
Video annotation is
labor-intensive
Temporal dimension adds
complexity
14. The challenge of scale
Training labels Inference
Video processing is
computationally expensive
Video annotation is
labor-intensive
Models
Temporal dimension adds
complexity
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end
Learning of Action Detection from Frame
Glimpses in Videos. CVPR 2016.
15. Input Output
t = 0 t = T
Running
Task: Temporal action detection
Talking
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
16. Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
Efficient video processing
17. t = 0 t = T
Output
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
Frame model
Input: a frame
Our model for efficient action detection
18. t = 0 t = T
[ ]
Output
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
Output:
Detection instance [start, end]
Next frame to glimpse
Frame model
Input: a frame
Our model for efficient action detection
19. t = 0 t = T
Recurrent neural network
(time information)
[ ] Output:
Detection instance [start, end]
Next frame to glimpse
Output
Convolutional neural network
(frame information)
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
Our model for efficient action detection
20. t = 0 t = T
Recurrent neural network
(time information)
[ ] Output:
Detection instance [start, end]
Next frame to glimpse
Output
Convolutional neural network
(frame information)
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
Our model for efficient action detection
21. t = 0 t = T
Recurrent neural network
(time information)
[ ] Output:
Detection instance [start, end]
Next frame to glimpse
Output
Convolutional neural network
(frame information)
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
Our model for efficient action detection
22. t = 0 t = T
Output
Our model for efficient action detection
Recurrent neural network
(time information)
[ ] [ ] Output:
Detection instance [start, end]
Next frame to glimpse
Output
Convolutional neural network
(frame information)
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
23. t = 0 t = T
Output
Our model for efficient action detection
Recurrent neural network
(time information)
[ ] [ ] Output:
Detection instance [start, end]
Next frame to glimpse
Output
Convolutional neural network
(frame information)
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
24. t = 0 t = T
Output
Our model for efficient action detection
Recurrent neural network
(time information)
[ ] [ ] Output:
Detection instance [start, end]
Next frame to glimpse
Output
Convolutional neural network
(frame information)
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
25. t = 0 t = T
Output
Our model for efficient action detection
Recurrent neural network
(time information)
[ ] [ ] Output:
Detection instance [start, end]
Next frame to glimpse
Output
Convolutional neural network
(frame information)
Output
[ ]
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
26. t = 0 t = T
Output
Our model for efficient action detection
Recurrent neural network
(time information)
[ ] [ ] Output:
Detection instance [start, end]
Next frame to glimpse
Output
Convolutional neural network
(frame information)
Output
[ ]
…
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
27. t = 0 t = T
Output
Our model for efficient action detection
Recurrent neural network
(time information)
Output
Convolutional neural network
(frame information)
Output
[ ]
…
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
Optional output:
Detection instance [start, end]
Output:
Next frame to glimpse
28. Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
Our model for efficient action detection
• Train differentiable outputs (detection output class and bounds) using
standard backpropagation
• Train non-differentiable outputs (where to look next, when to emit a
prediction) using reinforcement learning (REINFORCE algorithm)
• Achieves detection performance on par with dense sliding window-
based approaches, while observing only 2% of frames
29. Learned policy in action
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
30. Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.
Learned policy in action
31. The challenge of scale
Training labels Inference
Video processing is
computationally expensive
Video annotation is
labor-intensive
Models
Temporal dimension adds
complexity
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end
Learning of Action Detection from Frame
Glimpses in Videos. CVPR 2016.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
32. Dense action labeling
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
33. MultiTHUMOS
• Extends the THUMOS’14 action detection dataset with dense, multilevel,
frame-level action annotations for 30 hours across 400 videos
THUMOS MultiTHUMOS
Annotations 6,365 38,690
Classes 20 65
Density (labels / frame) 0.3 1.5
Classes per video 1.1 10.5
Max actions per frame 2 9
Max actions per video 3 25
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
34. Modeling dense, multilabel actions
• Need to reason about multiple potential actions simultaneously
• High degree of temporal dependency
• In standard recurrent models for action recognition, all state is in hidden layer
representation
• At each time step, makes prediction of current frame based on the current
frame and previous hidden representation
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
35. MultiLSTM
• Extension of LSTM that expands the temporal receptive field of input
and output connections
• Key idea: providing the model with more freedom in both reading
input and writing output reduces the burden placed on the hidden
layer representation
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
36. MultiLSTM
Input video frames
Frame class predictions
t
Standard LSTM
……
Donahue 2014
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
37. MultiLSTM
Frame class predictions
t
……
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
Standard LSTM: Single input, single output
Input video frames
38. MultiLSTM
Frame class predictions
t
……
Frame class predictions
t
MultiLSTM: Multiple inputs, multiple outputs
……
Standard LSTM: Single input, single output
Input video frames Input video frames
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
39. MultiLSTM
Multiple Inputs (soft attention)
Multiple Outputs (weighted average)
Multilabel Loss
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
40. MultiLSTM
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
41. Retrieving sequential actions
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
42. Retrieving co-occurring actions
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex
Videos. IJCV 2017.
43. The challenge of scale
Training labels Inference
Video processing is
computationally expensive
Video annotation is
labor-intensive
Models
Temporal dimension adds
complexity
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end
Learning of Action Detection from Frame
Glimpses in Videos. CVPR 2016.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-
Fei. Learning to learn from noisy web videos.
CVPR 2017.
44. Labeling videos is expensive
• Takes significantly longer to
label a video than an image
• If spatial or temporal bounds
desired, even worse
• How can we practically learn
about new concepts in video?
47. Can we effectively learn from the noisy web queries?
• Our approach: learn how to selective positive training examples
from noisy queries in order to train classifiers for new classes
• Use a reinforcement learning-based formulation to learn a data
labeling policy that achieves strong performance on a small,
manually-labeled dataset of classes
• Then use this policy to automatically label noisy web data for new
classes
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
48. Balancing diversity vs. semantic drift
• Want diverse training examples to improve classifier
• But too much diversity can also lead to semantic drift
• Our approach: balance diversity and drift by training labeling
policies using an annotated reward set which the policy must
successfully classify
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
49. Overview of approach
Boomerang …
Boomerang
on a beach
Boomerang
music video
Classifier
Candidate web queries
(YouTube autocomplete)
Agent
Label new positives
+
Boomerang
on a beach
Current
positive set
Update classifier
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
Update
state
50. Overview of approach
Boomerang …
Boomerang
on a beach
Boomerang
music video
Classifier
Candidate web queries
(YouTube autocomplete)
Agent
Label new positives
+
Boomerang
on a beach
Current
positive set
Update classifier
Fixed
negative set
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
Update
state
51. Overview of approach
Boomerang …
Boomerang
on a beach
Boomerang
music video
Classifier
Candidate web queries
(YouTube autocomplete)
Agent
Label new positives
+
Boomerang
on a beach
Current
positive set
Update classifier
Update
state
Fixed
negative set
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
Training reward
Eval on
reward set
54. Novel classes
Greedy classifier Ours
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.
55. The challenge of scale
Training labels Inference
Video processing is
computationally expensive
Video annotation is
labor-intensive
Models
Temporal dimension adds
complexity
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end
Learning of Action Detection from Frame
Glimpses in Videos. CVPR 2016.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-
Fei. Learning to learn from noisy web videos.
CVPR 2017.
56. The challenge of scale
Training labels Inference
Video processing is
computationally expensive
Video annotation is
labor-intensive
Models
Temporal dimension adds
complexity
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end
Learning of Action Detection from Frame
Glimpses in Videos. CVPR 2016.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-
Fei. Learning to learn from noisy web videos.
CVPR 2017.
Learning to learn
57. The challenge of scale
Training labels Inference
Video processing is
computationally expensive
Video annotation is
labor-intensive
Models
Temporal dimension adds
complexity
Yeung, Russakovsky, Mori, Fei-Fei. End-to-end
Learning of Action Detection from Frame
Glimpses in Videos. CVPR 2016.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei.
Every Moment Counts: Dense Detailed Labeling
of Actions in Complex Videos. IJCV 2017.
Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-
Fei. Learning to learn from noisy web videos.
CVPR 2017.
Learning to learn
Unsupervised learning
58. Towards Knowledge
Training labels Inference
Video processing is
computationally expensive
Video annotation is
labor-intensive
Models
Temporal dimension adds
complexity
Videos
Knowledge of the
dynamic visual world