발표자: 고영준 (고려대 박사과정)
발표일: 2017.6.
개요:
Algorithms to segment objects in a video sequence will be presented.
First, I will introduce a primary object segmentation algorithm based on region augmentation and reduction. Second, collaborative detection, tracking, and segmentation for online multiple object segmentation will be presented.
3. • Divide data into meaningful segments
Segmentation
Superpixel Image segmentation
Video segmentation Video object segmentation
4. Video Object Segmentation
• Semi-supervised video object segmentation
• Primary object segmentation
• Multiple object segmentation
5. Semi-supervised Video Object Segmentation
• Track and segment a target object
• Annotated by a user in the first frame
First frame
& user annotation
Segment track
6. Primary Object Segmentation
• Segment a primary object in a video automatically
Primary object: Diver
Primary object: Tennis player
9. Primary Object Segmentation
• Primary object segmentation
• Initial region estimation
• Motion boundaries
• Object proposal
• Saliency maps
• Refinement
• Construct models for the primary object and the background,
e.g. Gaussian mixture models (GMMs)
• Propose augmentation and reduction process (ARP)
10. Primary Object Segmentation in Videos Based on
Region Augmentation and Reduction
• Overview
• Input: A set of consecutive video frames
• Output: A set of pixel-wise segments to delineate the primary
object
11. Candidate Region Generation
• Candidate regions
• Ultrametric contour map (UCM)
• Obtain color-based and motion-based UCMs
• Each region in UCM becomes a superpixel
12. Candidate Region Generation
• Candidate regions
• Generate candidate regions by merging neighboring superpixels
• Determine the pair, 𝑠 𝑚 and 𝑠 𝑛, sharing the weakest boundary
• Merge 𝑠 𝑚 and 𝑠 𝑛 in a single superpixel
• Repeat this process only one superpixel remains
13. Candidate Region Generation
• Foreground confidence
• Measure the foreground confidence of each candidate region
• Appearance confidence 𝜙𝑖
(𝑡)
• Obtain a saliency map using technique in [1]
• Average the saliency values within the candidate region
• Edge confidence 𝜓𝑖
(𝑡)
• Combine color-based edge map and motion-based edge map
𝑐𝑖
(𝑡)
= 𝜙𝑖
(𝑡)
+ 𝜓𝑖
(𝑡)
[1] W.-D. Jang, C. Lee, and C.-S. Kim, “Primary object segmentation in videos via alternate convex optimization of foreground and
background distributions,” CVPR, 2016
14. Candidate Region Generation
• Foreground confidence
• Select the top 20 candidate regions
• Warp the selected candidate regions to neighboring frames
• Rearrange the set of candidate regions 𝒬(𝑡) = 𝑞1
𝑡
, 𝑞2
𝑡
, … , 𝑞 𝑁
(𝑡)
• Feature description
• Describe the feature 𝐟𝑖
(𝑡)
of each candidate region 𝑞𝑖
(𝑡)
using the
bag-of-visual-words approach
15. Initial Region Estimation
• Selecting initial primary object regions
• Choose the main region 𝑞 𝛿
(𝑡)
among candidate regions
• Exploit the recurrence property that a primary object appears
repeatedly in a video sequence
Input frames
Candidate region
generation
Initial region
estimation
16. Initial Region Estimation
• Selecting initial primary object regions
• Assume that feature of main region 𝑞 𝛿
(𝑡)
should be similar to
features of the main regions in the other frames
• 𝐩 𝜏
denotes the feature of the main region in frame 𝐼(𝜏)
𝛿 = arg min
𝜏=1,𝜏≠𝑡
𝑑 𝜒 𝐟𝑖
(𝑡)
, 𝐩 𝜏
Input frames
Candidate region
generation
Initial region
estimation
17. Initial Region Estimation
• Selecting initial primary object regions
• Initialization of 𝐩 𝜏
• Superpose features of all candidate region in 𝒬(𝜏)
• Combine features of candidate regions, 𝐅(𝜏) = 𝐟1
𝜏
, … , 𝐟 𝑁
𝜏
, using
the foreground confidence vector 𝐜(𝜏) = 𝑐1
𝜏
, … , 𝑐 𝑁
𝜏
𝑇
• Obtain the main region 𝑞 𝛿
(𝑡)
by applying 𝐩 𝜏
for each frame
• Alternative update of the main regions
• Update 𝐩 𝑡 for each frame by 𝐩 𝑡 ← 𝐟𝛿
𝜏
• Choose the main region using the updated features
𝐩 𝜏
= 𝐅(𝜏)
𝐜(𝜏)
𝛿 = arg min
𝜏=1,𝜏≠𝑡
𝑑 𝜒 𝐟𝑖
(𝑡)
, 𝐩 𝜏
18. Primary Object Region Refinement
• Refinement of primary object regions
• Initial regions may exclude parts of primary objects or include
noisy regions (background or other objects)
• Attempt to refine initial regions
• Augment initial regions with missing region
• Reducing initial regions by removing noisy regions
19. Primary Object Region Refinement
• Augmented regions
• Augment initial regions 𝑞 𝛿
𝑡
with candidate region 𝑞𝑖
𝑡
in 𝒬(𝑡)
• Reduced regions
• Reduce initial regions 𝑞 𝛿
𝑡
using candidate region 𝑞 𝑗
𝑡
in 𝒬(𝑡)
𝑞 𝛿
𝑡
𝑞𝑖
𝑡
𝑞𝑖
𝑡
𝑞 𝛿
𝑡
𝑟𝑖
𝑡
= 𝑞 𝛿
𝑡
∪ 𝑞𝑖
𝑡
𝑞 𝛿
𝑡
𝑞 𝑗
𝑡
𝑞 𝛿
𝑡
𝑞 𝑗
𝑡
𝑟𝑗
𝑡
= 𝑞 𝛿
𝑡
∩ 𝑞 𝑗
𝑡
20. Primary Object Region Refinement
• Augmentation and reduction process (ARP)
• Determine whether to augment or reduce 𝑞 𝛿
𝑡
by cost function
• Data cost
• Constrain that the refined region 𝑟𝑖
(𝑡)
should be similar to initial
regions in all frames
• Segmentation cost
• Make the refined region 𝑟𝑖
(𝑡)
as dissimilar from its nearby
background as possible
𝐶 𝑟𝑖
(𝑡)
= 𝐶data 𝑟𝑖
(𝑡)
+ 𝛾 ⋅ 𝐶seg 𝑟𝑖
(𝑡)
𝐶data 𝑟𝑖
(𝑡)
=
1
𝑇
𝜏=1
𝑑 𝜒 𝐟r,𝑖
(𝑡)
, 𝐟𝛿
(𝑡)
𝐶seg 𝑟𝑖
(𝑡)
= −𝑑 𝜒 𝐟r,𝑖
(𝑡)
, 𝐟b,𝑖
(𝑡)
21. Primary Object Region Refinement
• Augmentation and reduction process (ARP)
• Minimize the cost function for the optimal refined region
• Perform ARP iteratively
• Construct the set of augmented and reduced regions again by
employing 𝑟∗
𝑡
as the initial region
• Find the optimal 𝑟∗
𝑡
by minimizing 𝐶 𝑟𝑖
(𝑡)
• Repeat until 𝑟∗
𝑡
is unchanged
𝑟∗
𝑡
= arg min 𝐶 𝑟𝑖
(𝑡)
23. • DAVIS dataset [2]
• 50 video sequences (3,455 annotated frames)
• Performance measure
• Region similarity 𝒥: Intersection over union
• Contour accuracy ℱ: F-measure that is the harmonic mean of the
contour precision and recall rates
Experimental results
[2] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation
methodology for video object segmentation,” CVPR 2016
24. Experimental results
• Impacts of ARP
• Compare ARP with the conventional refinement techniques [20,
36]
• Apply refinement techniques to our initial regions (IR)
[20] A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” ICCV,2013.
[36] D. Zhang, O. Javed, and M. Shah, “Video object segmentation through spatially accurate and temporally dense extraction of
primary object regions,” CVPR, 2013.
25. Experimental results
• Quantitative comparison
• Semi-supervised: Human annotation at the first frame
• Multiple VOS: Output multiple objects
• POS: Output primary object objects
28. Multiple Object Segmentation
• Multiple object segmentation
• Motion segmentation
• Cluster point trajectories in a video
• Video object proposal
• Proposal matching
• Proposal clustering
• Segmentation guided by object detection and tracking
29. CDTS: Collaborative Detection, Tracking, and Segmentation
for Online Multiple Object Segmentation in videos
• Overview
• Input: A set of consecutive video frames
• Output: Multiple segment tracks
Input frames
Detection and
tracking results
Joint detection
and tracking
ASE segmentationObject track generation
30. Object Track Generation
• Joint detection and tracking
• Detector [3]
• Find object location without manual annotations
• Some objects may remain undetected
• Tracker [4]
• Boost the recall rate of objects using temporal correlations
• Three cases
• Both detection and tracking boxes
• Only detection box
• Only tracking box
[3] Y. Li, K. He, J. Sun, et al. “R-FCN: Object detection via region-based fully convolutional networks,” NIPS, 2016
[4] H.-U. Kim, D.-Y. Lee, J.-Y. Sim, and C.-S. Kim, “SOWP: Spatially ordered and weighted patch descriptor for visual tracking,” ICCV, 2015.
31. Object Track Generation
• Joint detection and tracking
• Both detection and tracking boxes
• Match detection and tracking boxes
• The Hungarian algorithm
• Choose the more accurate box for each matching pair
• Link the selected box to the corresponding object track
• Unmatched detection box
• Regard as newly appearing object
• Unmatched tracking box
• Link to the corresponding object track
32. ASE Segmentation
• Alternate shrinking and expansion (ASE)
• Over-segment frame in to superpixels
• Dichotomize each superpixel within and near the box into
either foreground or background class
33. ASE Segmentation
• Over-segmentation
• Obtain superpixels using UCM
• Preliminary classification
• Exploit overlap ratio between the box and each superpixel
• Refine preliminary foreground regions
34. ASE Segmentation
• Intra-frame refinement
• Constrain foreground regions to have intense edge strengths
• Boundary cost
• Shrink foreground regions by remove superpixels to minimize
the boundary cost in a greedy manner
𝐶bnd 𝐹𝑖
(𝑡)
= −
𝐱∈𝜕𝐹𝑖
(𝑡)
𝑈 𝑡
𝐱
35. ASE Segmentation
• Inter-frame refinement
• Constrain that the refined region should be similar to the
segmentation results in previous frames
• Cost function
• Expand foreground regions by augmenting superpixels
• Perform shrinking in a similar way
𝐶inter 𝐹𝑖
(𝑡)
, ℬ𝑖
(𝑡)
= 𝛼 ⋅ 𝐶tmp 𝐹𝑖
𝑡
+ 𝐶seg 𝐹𝑖
(𝑡)
, ℬ𝑖
(𝑡)
+𝐶bnd 𝐹𝑖
(𝑡)
37. Experimental Results
• YouTube-Objects dataset
• Contain 126 videos for 10 object classes
• Performance measure
• Intersection over union (IoU)
[34] Y.-H. Tsai, G. Zhong, and M.-H. Yang, “Semantic cosegmentation in videos.,” ECCV,2016.
[42] Y. Zhang, X. Chen, J. Li, C. Wang, and C. Xia, “Semantic object segmentation via detection in weakly labeled video,” CVPR 2015.