Invited talk on AR/SLAM and IoT in ILAS Seminar :Introduction to IoT and
Security, Kyoto University, 2020.
(https://www.z.k.kyoto-u.ac.jp/freshman-guide/ilas-seminars/ )
◆登壇者: Tomoyuki Mukasa
2. 2
2015Ph.D. Student Engineer Researcher2012
3D Reconstruction
& Motion Analysis
Tomoyuki MUKASA, Ph.D. 3D Vision Researcher
VR for
Exhibition
AR for Tourism
AR/VR/HCI for
e-commerce
6. 6
Commoditization of
AR/SLAM
Prototypes in Rakuten
Impact of ARKit/ARCore
Deep learning for
AR/SLAM
Dense 3D reconstruction on SLAM
SLAM w/o Camera
Sensors on AR glasses
AR/SLAM for Web WebAR/SLAM
5G + MEC
Web for AR/SLAM Web as Stock footage
Beyond the field of view
Learning from IoT data
7. 7
Augmented Reality (AR) and Simultaneous Localization and Mapping (SLAM) has been commoditize.
• Prototypes in Rakuten
• AR furniture app
• Impact of ARKit/ARCore
• Scale estimation solved by IMU fusion
• Research on the shoulder of giants
11. 11
Need to be tracked in 3D!
Almost solved in ARKit/ARCore…
12. 12
Merchants’ pages
3D models SLAM w/ scale estimation
Advanced visualization w/
inpainting & relighting
AR app for everyone
E. Zhang, M. F. Cohen, and B. Curless.
"Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.
13. 13
ARKit /
ARCore
Merchants’ pages
3D models SLAM w/ scale estimation
Advanced visualization w/
inpainting & relighting
AR app for everyone
E. Zhang, M. F. Cohen, and B. Curless.
"Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.
14. 14
• Dense 3D reconstruction on SLAM
• Depth prediction by CNN
• SLAM + Depth prediction
• SLAM w/o Camera
• Sensors on AR glasses
• Google Glass’s return
• Revival of UWB
15. 15
• Direct method based on photo consistency
• Multi-baseline stereo using GPU
• Getting easier to run on the latest mobile
device, but still unwanted from the end-user
point of view because of energy consumption,
etc.
R. A. Newcombe, S. J. Lovegrove and A. J. Davison,
"DTAM: Dense tracking and mapping in real-time," ICCV, 2011
16. 16
D. Eigen, C. Puhrsch, and R. Fergus.
“Depth map prediction from a single image using a multi-scale deep network.”
NIPS, 2014.
M. Kaneko, K. Sakurada and K. Aizawa.
“MeshDepth: Disconnected Mesh-based Deep Depth Prediction.”
ArXiv, 2019.
Global Coarse-Scale Network +
Local Fine-Scale Network
Disconnected mesh representation
17. 17
Semi-dense SLAM + Prediction Compact and optimizable representation of
dense geometry
K. Tateno, F. Tombari, I. Laina and N. Navab, "CNN-SLAM: Real-Time Dense
Monocular SLAM with Learned Depth Prediction," CVPR, 2017.
M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger and A. J. Davison.
“CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM.”
CVPR, 2018.
18. 18
Image capturing
& Visualization thread
2D tracking thread
3D Mapping thread
Depth prediction thread
Depth fusion thread
CLIENT-SIDE
SERVER-SIDE
Monocular visual SLAM
Depth prediction by CNN
3D reconstruction
Depth fusion
by surface mesh deformation
t t+1 t+2 t+3 t+4
Key-frame
ARAP deformation
19. 19
Figure 4. (Top) Distribution of weightswi for thedeformation and
(bottom) thecorresponding textured mesh. Larger intensity values
in thetop figureindicate thehigher weights.
4. Experiments
frames detected by ORB-SLAM because these are selected
based on visual changes. We filter out those key-frames us-
ing a spatio-temporal distance criterion similar to the other
feature-based approaches, e.g., PTAM , and send them to the
server.
The key-frames are processed on the server and the depth
image for each frame is estimated by the CNN architecture.
In the fusion process, we convert the depth images to a re-
fined mesh sequence as shown at the bottom of Figure 5.We
also make the ground truth mesh sequence correspond to the
refined one from the raw depth maps captured by the depth
sensor on the other hand. We compute residual errors be-
tween the refined mesh and the ground truth as shown in Ta-
ble 2 and Figure 6. We can observe that our framework ef-
ficiently reduces the residual errors for all sequences. Both
the average and the median of the residual errors fall within
the range from about two thirds to a half.
We also evaluate the absolute scale estimated from depth
prediction as shown in the rightmost column in the Table 2.
The average error of the estimated scales for our six office
scenes is 20% of the ground truth scale.
5. Conclusion
In this paper, we proposed a framework fusing the re-
20. 20
Sofa area 1 Sofa area 2 Sofa area 3 Desk area 1 Desk area 2 Meeting room
Figure 5. Input data for our depth fusion and the reconstructed scenes. From top to bottom row: color images, feature tracking result
of SLAM, corresponding ground truth depth images, depth images estimated by DNN, and 3D reconstruction results on six office scenes,
respectively.
Scene M esh from CNN depth map Refined mesh by our method
Mean Median Std dev Mean Median Std dev Scale
22. 22
• DeepFactors: Real-Time Probabilistic Dense Monocular SLAM. Jan Czarnowski, Tristan Laidlow,
Ronald Clark, Andrew J. Davison. IEEE Robotics and Automation Letters (RA-L), 2020
23. 23
RoNIN: Robust Neural Inertial Navigation in the Wild: Benchmark, Evaluations, and New Methods
Hang Yan, Sachini Herath, Yasutaka Furukawa
• Now SLAM is possible only w/ IMU
24. 24
• The 1st Google Glass raised privacy concerns (cf. driving recorder / cameras on connected cars)
• Google Glass returned as enterprise edition
• U1 chip uses Ultra Wideband (UWB) technology
• UWB devices can detect locations within 10 cm
• Wide indoor area localization
• Apple glasses w/o camera, but w/ U1?
• Potential application: SLAM w/o camera
25. 25
• WebAR/SLAM
• Marker-based WebAR for events
• WebAR/SLAM w/ IMU fusion
• 5G + MEC
• The future of 5G-enabled Augmented Reality
27. 27
Pros:
• No need to install native app
• Easy to create only w/ HTML(+Javascript)
Cons:
• Marker-based
• Need newer environment
(Later than iOS11Safari, Android5 Chrome)
Implementation
• AR.js + A-frame
28. 28
• AR photo booth: 240 groups
• AR lottery: 510 people
33. Objects
Object detection
& recognition
Input image
Surface orientation Partial view alignment
3D pose estimation
Plane fitting 3D scene initialization
Room Geometry
Objects in 3D scene
walls initialized
with unknown scale
34. Output:
3D model
reconstruction
Potential 3D applications on server in future
• Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes From a Single Image, Yinyu Nie,
Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, Jian Jun Zhang, CVPR, 2020
36. 36
• Web as Stock footage
• “Understanding Media”: "Hot" and "cool" media
• Stock footage for narrative
• Google street view time-lapse
• Beyond the field of view
• Photo uncrop
• Neural rendering in the wild
• Learning from web
• Learning human depth
• Learning from IoT data
37. 37
Understanding Media: The Extensions of Man by Marshall McLuhan (1964)
• "Hot" and "cool" media
• Hot: "high definition” like film
• Cool: require more active participation on the part of the user like TV
• Content of every medium is always another (previous) medium.
The birth of virtual reality as an art form by Chris Milk (TEDTalks, 2016)
• Is VR the last medium?
• What about AR?
38. 38
Edwin S. Porter:
Life of an American Fireman
(1903)
Sameer Agarwal, et al.:
Building a Rome in a Day
(ICCV 2009)
39. 39
• Photo uncrop, Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, and Steven M. Seitz,
ECCV ‘14.
40. 40
M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely and R. M-Brualla.
“Neural Rendering in the Wild.” CVPR, 2019
Total Scene Capture
• Encode the 3D structure of the scene, enabling rendering from an arbitrary viewpoint,
• Capture all possible appearances of the scene and allow rendering the scene under any of them.
• Understand the location and appearance of transient objects in the scene
and allow for reproducing or omitting them.
41. 41
Z. Li, T. Dekel, F. Cole, R. Tucker,
N. Snavely, C. Liu and W. T. Freeman.
“Learning the Depths of Moving People
by Watching Frozen People.”
CVPR, 2019.
42. 42
Kinect returns as
Azure Kinect
• Higher resolution, more accurate depth
• Multimodal sensing
• Integrated to Azure Cognitive Services, Azure IoT
Azure Kinect DK Kinect for Windows v2
Audio Details 7-mic circular array 4-mic linear phased
array
Motion sensor Details 3-axis accelerometer
3-axis gyro
3-axis accelerometer
RGB Camera Details 3840 x 2160 px @30
fps
1920 x 1080 px @30
fps
Depth Camera Method Time-of-Flight Time-of-Flight
Resolution 640 x 576 px @30 fps 512 x 424 px @ 30 fps
512 x 512 px @30 fps
1024x1024 px @15
fps
Connectivity Data USB3.1 Gen 1 with
type USB-C
USB 3.1 gen 1
Power External PSU or USB-
C
External PSU
Synchronization RGB & Depth internal,
external device-to-
device
RGB & Depth internal
only
Mechanical Dimensions 103 x 39 x 126 mm 249 x 66 x 67 mm
Mass 440 g 970 g
Mounting One ¼-20 UNC. Four
internal screw points
One ¼-20 UNC
Microsoft Azure IoT reference architecture
43. 43
• AR/SLAM technologies are
commoditized,
but still nice-to-have,
not must-have.
• Deep learning is pushing
the boundaries of AR/SLAM
• Web-based AR/SLAM has
huge potential in 5G era
• AR/SLAM can be improved by
learning from Web
including IoT data
Designed by
macrovector / Freepik