Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Invited talk on AR/SLAM and IoT in ILAS Seminar :Introduction to IoT and
Security, Kyoto University, 2020.
( )
◆登壇者: Tomoyuki Mukasa

Published in: Technology
  • Be the first to comment

  • Be the first to like this


  1. 1. June 25, 2020 Tomoyuki Mukasa Rakuten Institute of Technology Rakuten, Inc.
  2. 2. 2 2015Ph.D. Student Engineer Researcher2012 3D Reconstruction & Motion Analysis Tomoyuki MUKASA, Ph.D. 3D Vision Researcher VR for Exhibition AR for Tourism AR/VR/HCI for e-commerce
  3. 3. 4 Contributing to existing businesses Exploring new ideas Increasing tech-brand awareness Using Computer Vision & Human Computer Interaction
  4. 4. 5 Woman Red Blouse Category Attributes
  5. 5. 6 Commoditization of AR/SLAM Prototypes in Rakuten Impact of ARKit/ARCore Deep learning for AR/SLAM Dense 3D reconstruction on SLAM SLAM w/o Camera Sensors on AR glasses AR/SLAM for Web WebAR/SLAM 5G + MEC Web for AR/SLAM Web as Stock footage Beyond the field of view Learning from IoT data
  6. 6. 7 Augmented Reality (AR) and Simultaneous Localization and Mapping (SLAM) has been commoditize. • Prototypes in Rakuten • AR furniture app • Impact of ARKit/ARCore • Scale estimation solved by IMU fusion • Research on the shoulder of giants
  7. 7. 9
  8. 8. 10 Need to be tracked in 3D!
  9. 9. 11 Need to be tracked in 3D! Almost solved in ARKit/ARCore…
  10. 10. 12 Merchants’ pages 3D models SLAM w/ scale estimation Advanced visualization w/ inpainting & relighting AR app for everyone E. Zhang, M. F. Cohen, and B. Curless. "Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.
  11. 11. 13 ARKit / ARCore Merchants’ pages 3D models SLAM w/ scale estimation Advanced visualization w/ inpainting & relighting AR app for everyone E. Zhang, M. F. Cohen, and B. Curless. "Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.
  12. 12. 14 • Dense 3D reconstruction on SLAM • Depth prediction by CNN • SLAM + Depth prediction • SLAM w/o Camera • Sensors on AR glasses • Google Glass’s return • Revival of UWB
  13. 13. 15 • Direct method based on photo consistency • Multi-baseline stereo using GPU • Getting easier to run on the latest mobile device, but still unwanted from the end-user point of view because of energy consumption, etc. R. A. Newcombe, S. J. Lovegrove and A. J. Davison, "DTAM: Dense tracking and mapping in real-time," ICCV, 2011
  14. 14. 16 D. Eigen, C. Puhrsch, and R. Fergus. “Depth map prediction from a single image using a multi-scale deep network.” NIPS, 2014. M. Kaneko, K. Sakurada and K. Aizawa. “MeshDepth: Disconnected Mesh-based Deep Depth Prediction.” ArXiv, 2019. Global Coarse-Scale Network + Local Fine-Scale Network Disconnected mesh representation
  15. 15. 17 Semi-dense SLAM + Prediction Compact and optimizable representation of dense geometry K. Tateno, F. Tombari, I. Laina and N. Navab, "CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction," CVPR, 2017. M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger and A. J. Davison. “CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM.” CVPR, 2018.
  16. 16. 18 Image capturing & Visualization thread 2D tracking thread 3D Mapping thread Depth prediction thread Depth fusion thread CLIENT-SIDE SERVER-SIDE Monocular visual SLAM Depth prediction by CNN 3D reconstruction Depth fusion by surface mesh deformation t t+1 t+2 t+3 t+4 Key-frame ARAP deformation
  17. 17. 19 Figure 4. (Top) Distribution of weightswi for thedeformation and (bottom) thecorresponding textured mesh. Larger intensity values in thetop figureindicate thehigher weights. 4. Experiments frames detected by ORB-SLAM because these are selected based on visual changes. We filter out those key-frames us- ing a spatio-temporal distance criterion similar to the other feature-based approaches, e.g., PTAM , and send them to the server. The key-frames are processed on the server and the depth image for each frame is estimated by the CNN architecture. In the fusion process, we convert the depth images to a re- fined mesh sequence as shown at the bottom of Figure 5.We also make the ground truth mesh sequence correspond to the refined one from the raw depth maps captured by the depth sensor on the other hand. We compute residual errors be- tween the refined mesh and the ground truth as shown in Ta- ble 2 and Figure 6. We can observe that our framework ef- ficiently reduces the residual errors for all sequences. Both the average and the median of the residual errors fall within the range from about two thirds to a half. We also evaluate the absolute scale estimated from depth prediction as shown in the rightmost column in the Table 2. The average error of the estimated scales for our six office scenes is 20% of the ground truth scale. 5. Conclusion In this paper, we proposed a framework fusing the re-
  18. 18. 20 Sofa area 1 Sofa area 2 Sofa area 3 Desk area 1 Desk area 2 Meeting room Figure 5. Input data for our depth fusion and the reconstructed scenes. From top to bottom row: color images, feature tracking result of SLAM, corresponding ground truth depth images, depth images estimated by DNN, and 3D reconstruction results on six office scenes, respectively. Scene M esh from CNN depth map Refined mesh by our method Mean Median Std dev Mean Median Std dev Scale
  19. 19. 21
  20. 20. 22 • DeepFactors: Real-Time Probabilistic Dense Monocular SLAM. Jan Czarnowski, Tristan Laidlow, Ronald Clark, Andrew J. Davison. IEEE Robotics and Automation Letters (RA-L), 2020
  21. 21. 23 RoNIN: Robust Neural Inertial Navigation in the Wild: Benchmark, Evaluations, and New Methods Hang Yan, Sachini Herath, Yasutaka Furukawa • Now SLAM is possible only w/ IMU
  22. 22. 24 • The 1st Google Glass raised privacy concerns (cf. driving recorder / cameras on connected cars) • Google Glass returned as enterprise edition • U1 chip uses Ultra Wideband (UWB) technology • UWB devices can detect locations within 10 cm • Wide indoor area localization • Apple glasses w/o camera, but w/ U1? • Potential application: SLAM w/o camera
  23. 23. 25 • WebAR/SLAM • Marker-based WebAR for events • WebAR/SLAM w/ IMU fusion • 5G + MEC • The future of 5G-enabled Augmented Reality
  24. 24. 26 概要 Pros: • アプリインストール無しでARが可能 • HTML(+Javascript)のみでコンテンツ制作可能 Cons: • 現状では要専用マーカー • 対応環境に制限(iOS11以降のSafari, Android5以降のChrome) 実装 • AR.js: マーカー位置推定 • A-frame: コンテンツ制作 Future work • 任意画像マーカー • マーカーレスAR (cf. ARKit, ARCore) • GeolocationとARマップの統合
  25. 25. 27 Pros: • No need to install native app • Easy to create only w/ HTML(+Javascript) Cons: • Marker-based • Need newer environment (Later than iOS11Safari, Android5 Chrome) Implementation • AR.js + A-frame
  26. 26. 28 • AR photo booth: 240 groups • AR lottery: 510 people
  27. 27. 29 Trial in Mother’s day & Father’s day Trial @Tokyo Dome R-mobile campaign
  28. 28. 30 8th Wall © 2019 8th Wall built their own highly-optimized SLAM engine, and then re-architected it for the mobile web. AUGMENTED REALITY FOR THE WEB Javascript WebGL WebAssembly Six-Degrees-of-Freedom (6DoF) Tracking Point-Cloud Lighting Surface Estimation Image Detection
  29. 29. 31 Light- weight Web AR SOTA SLAM Schneider, Thomas et al. “Maplab: An Open Framework for Research in Visual-Inertial Mapping and Localization.” IEEE Robotics and Automation Letters, 2018.
  30. 30. 32 Offline loop closure and optimization Online recording StartEnd Office Space Mapping and Optimization Back end visualization of the location map
  31. 31. Objects Object detection & recognition Input image Surface orientation Partial view alignment 3D pose estimation Plane fitting 3D scene initialization Room Geometry Objects in 3D scene walls initialized with unknown scale
  32. 32. Output: 3D model reconstruction Potential 3D applications on server in future • Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes From a Single Image, Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, Jian Jun Zhang, CVPR, 2020
  33. 33. 35 The future of 5G-enabled Augmented Reality • Powered by Mobile Edge Computing (MEC) • Big data processing w/ultra low latency • Example by Scape + Samsung Scape Technologies © 2019
  34. 34. 36 • Web as Stock footage • “Understanding Media”: "Hot" and "cool" media • Stock footage for narrative • Google street view time-lapse • Beyond the field of view • Photo uncrop • Neural rendering in the wild • Learning from web • Learning human depth • Learning from IoT data
  35. 35. 37 Understanding Media: The Extensions of Man by Marshall McLuhan (1964) • "Hot" and "cool" media • Hot: "high definition” like film • Cool: require more active participation on the part of the user like TV • Content of every medium is always another (previous) medium. The birth of virtual reality as an art form by Chris Milk (TEDTalks, 2016) • Is VR the last medium? • What about AR?
  36. 36. 38 Edwin S. Porter: Life of an American Fireman (1903) Sameer Agarwal, et al.: Building a Rome in a Day (ICCV 2009)
  37. 37. 39 • Photo uncrop, Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, and Steven M. Seitz, ECCV ‘14.
  38. 38. 40 M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely and R. M-Brualla. “Neural Rendering in the Wild.” CVPR, 2019 Total Scene Capture • Encode the 3D structure of the scene, enabling rendering from an arbitrary viewpoint, • Capture all possible appearances of the scene and allow rendering the scene under any of them. • Understand the location and appearance of transient objects in the scene and allow for reproducing or omitting them.
  39. 39. 41 Z. Li, T. Dekel, F. Cole, R. Tucker, N. Snavely, C. Liu and W. T. Freeman. “Learning the Depths of Moving People by Watching Frozen People.” CVPR, 2019.
  40. 40. 42 Kinect returns as Azure Kinect • Higher resolution, more accurate depth • Multimodal sensing • Integrated to Azure Cognitive Services, Azure IoT Azure Kinect DK Kinect for Windows v2 Audio Details 7-mic circular array 4-mic linear phased array Motion sensor Details 3-axis accelerometer 3-axis gyro 3-axis accelerometer RGB Camera Details 3840 x 2160 px @30 fps 1920 x 1080 px @30 fps Depth Camera Method Time-of-Flight Time-of-Flight Resolution 640 x 576 px @30 fps 512 x 424 px @ 30 fps 512 x 512 px @30 fps 1024x1024 px @15 fps Connectivity Data USB3.1 Gen 1 with type USB-C USB 3.1 gen 1 Power External PSU or USB- C External PSU Synchronization RGB & Depth internal, external device-to- device RGB & Depth internal only Mechanical Dimensions 103 x 39 x 126 mm 249 x 66 x 67 mm Mass 440 g 970 g Mounting One ¼-20 UNC. Four internal screw points One ¼-20 UNC Microsoft Azure IoT reference architecture
  41. 41. 43 • AR/SLAM technologies are commoditized, but still nice-to-have, not must-have. • Deep learning is pushing the boundaries of AR/SLAM • Web-based AR/SLAM has huge potential in 5G era • AR/SLAM can be improved by learning from Web including IoT data Designed by macrovector / Freepik