Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)

3,610 views

Published on

http://pagines.uab.cat/mcv/

Published in: Data & Analytics
  • ⇒ www.HelpWriting.net ⇐ is a good website if you’re looking to get your essay written for you. You can also request things like research papers or dissertations. It’s really convenient and helpful.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • My personal experience with research paper writing services was highly positive. I sent a request to ⇒ www.WritePaper.info ⇐ and found a writer within a few minutes. Because I had to move house and I literally didn’t have any time to sit on a computer for many hours every evening. Thankfully, the writer I chose followed my instructions to the letter. I know we can all write essays ourselves. For those in the same situation I was in, I recommend ⇒ www.WritePaper.info ⇐.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If we are speaking about saving time and money this site ⇒ www.HelpWriting.net ⇐ is going to be the best option!! I personally used lots of times and remain highly satisfied.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • This is your last chance to grab all 16,000 plans at this discount price. I've been told that Ted will only extend this offer until midnight tonight and this offer will NOT be repeated again. 》》》 https://url.cn/xFeBN0O4
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Video Analysis with Convolutional Neural Networks (Master Computer Vision Barcelona 2017)

  1. 1. @DocXavi Module 4 - Lecture 6 Video Analysis with CNNs 31 January 2017 Xavier Giró-i-Nieto [http://pagines.uab.cat/mcv/]
  2. 2. Acknowledgments 2 Víctor Campos Alberto Montes
  3. 3. Linked slides
  4. 4. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 4
  5. 5. Recognition Demo: Clarifai MIT Technology Review : “A start-up’s Neural Network Can Understand Video” (3/2/2015) 5
  6. 6. Figure: Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 6 Recognition
  7. 7. 7 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  8. 8. 8 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Previous lectures
  9. 9. 9 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  10. 10. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. Slides extracted from ReadCV seminar by Victor Campos 10 Recognition: DeepVideo
  11. 11. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 11 Recognition: DeepVideo: Demo
  12. 12. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 12 Recognition: DeepVideo: Architectures
  13. 13. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 13 Unsupervised learning [Le at al’11] Supervised learning [Karpathy et al’14] Recognition: DeepVideo: Features
  14. 14. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 14 Recognition: DeepVideo: Multiscale
  15. 15. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014, June). Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (pp. 1725-1732). IEEE. 15 Recognition: DeepVideo: Results
  16. 16. 16 Recognition Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  17. 17. 17 Recognition: C3D Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015
  18. 18. 18 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Demo
  19. 19. 19 K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition ICLR 2015. Recognition: C3D: Spatial dimension Spatial dimensions (XY) of the used kernels are fixed to 3x3, following Symonian & Zisserman (ICLR 2015).
  20. 20. 20 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Temporal dimension 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets Temporal depth 2D ConvNets
  21. 21. 21 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets Recognition: C3D: Temporal dimension
  22. 22. 22 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 No gain when varying the temporal depth across layers. Recognition: C3D: Temporal dimension
  23. 23. 23 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Architecture Feature vector
  24. 24. 24 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Feature vector Video sequence 16 frames-long clips 8 frames-long overlap
  25. 25. 25 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Feature vector 16-frame clip 16-frame clip 16-frame clip 16-frame clip ... Average 4096-dimvideodescriptor 4096-dimvideodescriptor L2 norm
  26. 26. 26 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Visualization Based on Deconvnets by Zeiler and Fergus [ECCV 2014] - See [ReadCV Slides] for more details.
  27. 27. 27 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Compactness
  28. 28. 28 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Convolutional 3D(C3D) combined with a simple linear classifier outperforms state-of-the-art methods on 4 different benchmarks and are comparable with state of the art methods on other 2 benchmarks Recognition: C3D: Performance
  29. 29. 29 Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning spatiotemporal features with 3D convolutional networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489-4497. 2015 Recognition: C3D: Software Implementation by Michael Gygli (GitHub)
  30. 30. 30Simonyan, Karen, and Andrew Zisserman. "Two-stream convolutional networks for action recognition in videos." 2014. Recognition: Two stream Two CNNs in paralel: ● One for RGB images ● One for Optical flow (hand-crafted features) Fusion after the softmax layer
  31. 31. 31Feichtenhofer, Christoph, Axel Pinz, and Andrew Zisserman. "Convolutional two-stream network fusion for video action recognition." CVPR 2016. [code] Recognition: Two stream Two CNNs in paralel: ● One for RGB images ● One for Optical flow (hand-crafted features) Fusion at a convolutional layer
  32. 32. 32 Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016. (Slidecast and Slides by Alberto Montes) Recognition: Localization
  33. 33. 33 Recognition: Localization Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016. (Slidecast and Slides by Alberto Montes)
  34. 34. 34 Recognition: Localization Shou, Zheng, Dongang Wang, and Shih-Fu Chang. "Temporal action localization in untrimmed videos via multi-stage cnns." CVPR 2016. (Slidecast and Slides by Alberto Montes)
  35. 35. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 35
  36. 36. Optical Flow Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 36
  37. 37. Optical Flow: DeepFlow Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 37 Andrei Bursuc Postoc INRIA @abursuc
  38. 38. Optical Flow: DeepFlow Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 38 ● Deep (hierarchy) ✔ ● Convolution ✔ ● Learning ❌
  39. 39. Optical Flow: Small vs Large Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 39
  40. 40. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 40 Optical Flow Classic approach: Rigid matching of HoG or SIFT descriptors Deep Matching: Allow each subpatch to move: ● independently ● in a limited range depending on its size
  41. 41. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 41 Optical Flow: Deep Matching
  42. 42. Source: Matlab R2015b documentation for normxcorr2 by Mathworks 42 Optical Flow: 2D correlation Image Sub-Image Offset of the sub-image with respect to the image [0,0].
  43. 43. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 43 Instead of pre-trained filters, a convolution is defined between each: ● patch of the reference image ● target image ...as a results, a correlation map is generated for each reference patch. Optical Flow: Deep Matching
  44. 44. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 44 Optical Flow: Deep Matching The most discriminative response map The less discriminative response map
  45. 45. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 45 Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search. Optical Flow: Deep Matching 4x4 patches 8x8 patches 16x16 patches 32x32 patches Top-down matching (TD)Bottom-up extraction (BU)
  46. 46. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 46 Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search. Optical Flow: Deep Matching 4x4 patches 8x8 patches 16x16 patches 32x32 patches Bottom-up extraction (BU)
  47. 47. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 47 Optical Flow: Deep Matching (BU)
  48. 48. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 48 Key idea: Build (bottom-up) a pyramid of correlation maps to run an efficient (top-down) search. Optical Flow: Deep Matching (TD) 4x4 patches 8x8 patches 16x16 patches 32x32 patches Top-down matching (TD)
  49. 49. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 49 Optical Flow: Deep Matching (TD) Each local maxima in the top layer corresponds to a shift of one of the biggest (32x32) patches. If we focus on local maximum, we can retrieve the corresponding responses one scale below and focus on shift of the sub-patches that generated it
  50. 50. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 50 Optical Flow: Deep Matching (TD)
  51. 51. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 51 Optical Flow: Deep Matching
  52. 52. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 52 Ground truth Dense HOG [Brox & Malik 2011] Deep Matching Optical Flow: Deep Matching
  53. 53. Weinzaepfel, P., Revaud, J., Harchaoui, Z., & Schmid, C. (2013, December). DeepFlow: Large displacement optical flow with deep matching. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1385-1392). IEEE 53 Optical Flow: Deep Matching
  54. 54. Optical Flow Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 54
  55. 55. Optical Flow: FlowNet Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 55
  56. 56. Optical Flow: FlowNet Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 56 End to end supervised learning of optical flow.
  57. 57. Optical Flow: FlowNet (contracting) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 57 Option A: Stack both input images together and feed them through a generic network.
  58. 58. Optical Flow: FlowNet (contracting) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 58 Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage.
  59. 59. Optical Flow: FlowNet (contracting) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 59 Option B: Create two separate, yet identical processing streams for the two images and combine them at a later stage. Correlation layer: Convolution of data patches from the layers to combine.
  60. 60. Optical Flow: FlowNet (expanding) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 60 Upconvolutional layers: Unpooling features maps + convolution. Upconvolutioned feature maps are concatenated with the corresponding map from the contractive part.
  61. 61. Optical Flow: FlowNet Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. ICCV 2015 61 Since existing ground truth datasets are not sufficiently large to train a Convnet, a synthetic Flying Dataset is generated… and augmented (translation, rotation, scaling transformations; additive Gaussian noise; changes in brightness, contrast, gamma and color). Convnets trained on these unrealistic data generalize well to existing datasets such as Sintel and KITTI. Data augmentation
  62. 62. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D. and Brox, T., 2015. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2758-2766). 62 Optical Flow: FlowNet
  63. 63. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 63
  64. 64. Object tracking: MDNet 64 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
  65. 65. Object tracking: MDNet 65 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015)
  66. 66. Object tracking: MDNet: Architecture 66 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015) Domain-specific layers are used during training for each sequence, but are replaced by a single one at test time.
  67. 67. Object tracking: MDNet: Online update 67 Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." ICCV VOT Workshop (2015) MDNet is updated online at test time with hard negative mining, that is, selecting negative samples with the highest positive score.
  68. 68. Object tracking: FCNT 68 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." ICCV 2015 [code]
  69. 69. Object tracking: FCNT 69 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] Focus on conv4-3 and conv5-3 of VGG-16 network pre-trained for ImageNet image classification. conv4-3 conv5-3
  70. 70. Object tracking: FCNT: Specialization 70 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] Most feature maps in VGG-16 conv4-3 and conv5-3 are not related to the foreground regions in a tracking sequence.
  71. 71. Object tracking: FCNT: Localization 71 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] Although trained for image classification, feature maps in conv5-3 enable object localization… ...but is not discriminative enough to different objects of the same category.
  72. 72. Object tracking: Localization 72 Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. "Object detectors emerge in deep scene cnns." ICLR 2015. Other works have shown how features maps in convolutional layers allow object localization.
  73. 73. Object tracking: FCNT: Localization 73 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] On the other hand, feature maps from conv4-3 are more sensitive to intra-class appearance variation… conv4-3 conv5-3
  74. 74. Object tracking: FCNT: Architecture 74 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code] SNet=Specific Network (online update) GNet=General Network (fixed)
  75. 75. Object tracking: FCNT: Results 75 Wang, Lijun, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. "Visual Tracking with Fully Convolutional Networks." In Proceedings of the IEEE International Conference on Computer Vision, pp. 3119-3127. 2015 [code]
  76. 76. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 76
  77. 77. 77 Audio and Video Audio Vision
  78. 78. 78 Audio and Video: Soundnet Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016. Object & Scenes recognition in videos by analysing the audio track (only).
  79. 79. 79 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Videos for training are unlabeled. Relies on CNNs trained on labeled images. Audio and Video: Soundnet
  80. 80. 80 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Videos for training are unlabeled. Relies on CNNs trained on labeled images. Audio and Video: Soundnet
  81. 81. 81 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016. Audio and Video: Soundnet
  82. 82. 82 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Hidden layers of Soundnet are used to train a standard SVM classifier that outperforms state of the art. Audio and Video: Soundnet
  83. 83. 83 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio and Video: Soundnet
  84. 84. 84 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio and Video: Soundnet
  85. 85. 85 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS 2016. Visualization of the 1D filters over raw audio in conv1. Audio and Video: Soundnet
  86. 86. 86 Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." In Advances in Neural Information Processing Systems, pp. 892-900. 2016. Visualization of the video frames associated to the sounds that activate some of the last hidden units (conv7): Audio and Video: Soundnet
  87. 87. 87 Audio and Video: Sonorizaton Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. Learn synthesized sounds from videos of people hitting objects with a drumstick.
  88. 88. 88 Audio and Video: Visual Sounds Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016. No end-to-end
  89. 89. 89 Audio and Video: Visual Sounds Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. "Visually indicated sounds." CVPR 2016.
  90. 90. 90 Learn more Ruus Salakhutdinov, “Multimodal Machine Learning” (NIPS 2015 Workshop)
  91. 91. Generative models for Video 91 Slides D2L5 by Santi Pascual.
  92. 92. 92 What are Generative Models? We want our model with parameters θ = {weights, biases} and outputs distributed like Pmodel to estimate the distribution of our training data Pdata. Example) y = f(x), where y is scalar, make Pmodel similar to Pdata by training the parameters θ to maximize their similarity.
  93. 93. Key Idea: our model cares about what distribution generated the input data points, and we want to mimic it with our probabilistic model. Our learned model should be able to make up new samples from the distribution, not just copy and paste existing samples! 93 What are Generative Models? Figure from NIPS 2016 Tutorial: Generative Adversarial Networks (I. Goodfellow)
  94. 94. 94 Video Frame Prediction Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016
  95. 95. 95 Video Frame Prediction Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016
  96. 96. 96 Video Frame Prediction Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." ICLR 2016
  97. 97. Adversarial Training analogy Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. 100 100 It’s not even green
  98. 98. Adversarial Training analogy Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. 100 100 There is no watermark
  99. 99. Adversarial Training analogy Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. 100 100 Watermark should be rounded
  100. 100. Adversarial Training analogy Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether money is real or fake. ? After enough iterations, and if the counterfeiter is good enough (in terms of G network it means “has enough parameters”), the police should be confused.
  101. 101. Adversarial Training (batch update) ● Pick a sample x from training set ● Show x to D and update weights to output 1 (real)
  102. 102. Adversarial Training (batch update) ● G maps sample z to ẍ ● show ẍ and update weights to output 0 (fake)
  103. 103. Adversarial Training (batch update) ● Freeze D weights ● Update G weights to make D output 1 (just G weights!) ● Unfreeze D Weights and repeat
  104. 104. 104 Generative Adversarial Networks (GANs) Slide credit: Víctor Garcia Discriminator D(·) Generator G(·) Real World Random seed (z) Real/Synthetic
  105. 105. 105Slide credit: Víctor Garcia Conditional Adversarial Networks Real World Real/Synthetic Condition Discriminator D(·) Generator G(·) Generative Adversarial Networks (GANs)
  106. 106. Generating images/frames (Radford et al. 2015) Deep Conv. GAN (DCGAN) effectively generated 64x64 RGB images in a single shot. For example bedrooms from LSUN dataset.
  107. 107. Generating images/frames conditioned on captions (Reed et al. 2016b) (Zhang et al. 2016)
  108. 108. Unsupervised feature extraction/learning representations Similarly to word2vec, GANs learn a distributed representation that disentangles concepts such that we can perform operations on the data manifold: v(Man with glasses) - v(man) + v(woman) = v(woman with glasses) (Radford et al. 2015)
  109. 109. Image super-resolution Bicubic: not using data statistics. SRResNet: trained with MSE. SRGAN is able to understand that there are multiple correct answers, rather than averaging. (Ledig et al. 2016)
  110. 110. Image super-resolution Averaging is a serious problem we face when dealing with complex distributions. (Ledig et al. 2016)
  111. 111. Manipulating images and assisted content creation https://youtu.be/9c4z6YsBGQ0?t=126 https://youtu.be/9c4z6YsBGQ0?t=161 (Zhu et al. 2016)
  112. 112. 112 Adversarial Networks Slide credit: Víctor Garcia Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." arXiv preprint arXiv:1611.07004 (2016). Generator Discriminator Generated Pairs Real World Ground Truth Pairs Loss → BCE
  113. 113. 113Víctor Garcia and Xavier Giró-i-Nieto (work under progress) Generator Discriminator Loss2 GAN {Binary Crossentropy} 1/0 Generative Adversarial Networks (GANs)
  114. 114. Generative models for video 114 Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." 2016.
  115. 115. Generative models for video 115 Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. "Generating videos with scene dynamics." NIPS 2016.
  116. 116. 116 Adversarial Networks Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." NIPS 2014 Goodfellow, Ian. "NIPS 2016 Tutorial: Generative Adversarial Networks." arXiv preprint arXiv:1701.00160 (2016). F. Van Veen, “The Neural Network Zoo” (2016)
  117. 117. Outline 1. Recognition 2. Optical Flow 3. Object Tracking 4. Audio and Video 5. Generative models 117
  118. 118. 118 Thank you ! https://imatge.upc.edu/web/people/xavier-giro https://twitter.com/DocXavi https://www.facebook.com/ProfessorXavi xavier.giro@upc.edu Xavier Giró-i-Nieto [Part B: Video and audio]

×