AR/SLAM and IoT

June 25, 2020
Tomoyuki Mukasa
Rakuten Institute of Technology
Rakuten, Inc.

2
2015Ph.D. Student Engineer Researcher2012
3D Reconstruction
& Motion Analysis
Tomoyuki MUKASA, Ph.D. 3D Vision Researcher
VR for
Exhibition
AR for Tourism
AR/VR/HCI for
e-commerce

4
Contributing to
existing businesses
Exploring
new ideas
Increasing
tech-brand awareness
Using Computer Vision & Human Computer Interaction

5
Woman
Red
Blouse
Category Attributes

6
Commoditization of
AR/SLAM
Prototypes in Rakuten
Impact of ARKit/ARCore
Deep learning for
AR/SLAM
Dense 3D reconstruction on SLAM
SLAM w/o Camera
Sensors on AR glasses
AR/SLAM for Web WebAR/SLAM
5G + MEC
Web for AR/SLAM Web as Stock footage
Beyond the field of view
Learning from IoT data

7
Augmented Reality (AR) and Simultaneous Localization and Mapping (SLAM) has been commoditize.
• Prototypes in Rakuten
• AR furniture app
• Impact of ARKit/ARCore
• Scale estimation solved by IMU fusion
• Research on the shoulder of giants

11
Need to be tracked in 3D!
Almost solved in ARKit/ARCore…

12
Merchants’ pages
3D models SLAM w/ scale estimation
Advanced visualization w/
inpainting & relighting
AR app for everyone
E. Zhang, M. F. Cohen, and B. Curless.
"Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.

13
ARKit /
ARCore
Merchants’ pages
3D models SLAM w/ scale estimation
Advanced visualization w/
inpainting & relighting
AR app for everyone
E. Zhang, M. F. Cohen, and B. Curless.
"Emptying, Refurnishing, and Relighting Indoor Spaces”, SIGGRAPH Asia, 2016.

14
• Dense 3D reconstruction on SLAM
• Depth prediction by CNN
• SLAM + Depth prediction
• SLAM w/o Camera
• Sensors on AR glasses
• Google Glass’s return
• Revival of UWB

15
• Direct method based on photo consistency
• Multi-baseline stereo using GPU
• Getting easier to run on the latest mobile
device, but still unwanted from the end-user
point of view because of energy consumption,
etc.
R. A. Newcombe, S. J. Lovegrove and A. J. Davison,
"DTAM: Dense tracking and mapping in real-time," ICCV, 2011

16
D. Eigen, C. Puhrsch, and R. Fergus.
“Depth map prediction from a single image using a multi-scale deep network.”
NIPS, 2014.
M. Kaneko, K. Sakurada and K. Aizawa.
“MeshDepth: Disconnected Mesh-based Deep Depth Prediction.”
ArXiv, 2019.
Global Coarse-Scale Network +
Local Fine-Scale Network
Disconnected mesh representation

17
Semi-dense SLAM + Prediction Compact and optimizable representation of
dense geometry
K. Tateno, F. Tombari, I. Laina and N. Navab, "CNN-SLAM: Real-Time Dense
Monocular SLAM with Learned Depth Prediction," CVPR, 2017.
M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger and A. J. Davison.
“CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM.”
CVPR, 2018.

18
Image capturing
& Visualization thread
2D tracking thread
3D Mapping thread
Depth prediction thread
Depth fusion thread
CLIENT-SIDE
SERVER-SIDE
Monocular visual SLAM
Depth prediction by CNN
3D reconstruction
Depth fusion
by surface mesh deformation
t t+1 t+2 t+3 t+4
Key-frame
ARAP deformation

19
Figure 4. (Top) Distribution of weightswi for thedeformation and
(bottom) thecorresponding textured mesh. Larger intensity values
in thetop figureindicate thehigher weights.
4. Experiments
frames detected by ORB-SLAM because these are selected
based on visual changes. We filter out those key-frames us-
ing a spatio-temporal distance criterion similar to the other
feature-based approaches, e.g., PTAM , and send them to the
server.
The key-frames are processed on the server and the depth
image for each frame is estimated by the CNN architecture.
In the fusion process, we convert the depth images to a re-
fined mesh sequence as shown at the bottom of Figure 5.We
also make the ground truth mesh sequence correspond to the
refined one from the raw depth maps captured by the depth
sensor on the other hand. We compute residual errors be-
tween the refined mesh and the ground truth as shown in Ta-
ble 2 and Figure 6. We can observe that our framework ef-
ficiently reduces the residual errors for all sequences. Both
the average and the median of the residual errors fall within
the range from about two thirds to a half.
We also evaluate the absolute scale estimated from depth
prediction as shown in the rightmost column in the Table 2.
The average error of the estimated scales for our six office
scenes is 20% of the ground truth scale.
5. Conclusion
In this paper, we proposed a framework fusing the re-

20
Sofa area 1 Sofa area 2 Sofa area 3 Desk area 1 Desk area 2 Meeting room
Figure 5. Input data for our depth fusion and the reconstructed scenes. From top to bottom row: color images, feature tracking result
of SLAM, corresponding ground truth depth images, depth images estimated by DNN, and 3D reconstruction results on six ofﬁce scenes,
respectively.
Scene M esh from CNN depth map Reﬁned mesh by our method
Mean Median Std dev Mean Median Std dev Scale

22
• DeepFactors: Real-Time Probabilistic Dense Monocular SLAM. Jan Czarnowski, Tristan Laidlow,
Ronald Clark, Andrew J. Davison. IEEE Robotics and Automation Letters (RA-L), 2020

23
RoNIN: Robust Neural Inertial Navigation in the Wild: Benchmark, Evaluations, and New Methods
Hang Yan, Sachini Herath, Yasutaka Furukawa
• Now SLAM is possible only w/ IMU

24
• The 1st Google Glass raised privacy concerns (cf. driving recorder / cameras on connected cars)
• Google Glass returned as enterprise edition
• U1 chip uses Ultra Wideband (UWB) technology
• UWB devices can detect locations within 10 cm
• Wide indoor area localization
• Apple glasses w/o camera, but w/ U1?
• Potential application: SLAM w/o camera

25
• WebAR/SLAM
• Marker-based WebAR for events
• WebAR/SLAM w/ IMU fusion
• 5G + MEC
• The future of 5G-enabled Augmented Reality

26
概要
Pros:
• アプリインストール無しでARが可能
• HTML(+Javascript)のみでコンテンツ制作可能
Cons:
• 現状では要専用マーカー
• 対応環境に制限（iOS11以降のSafari, Android5以降のChrome）
実装
• AR.js: マーカー位置推定
• A-frame: コンテンツ制作
Future work
• 任意画像マーカー
• マーカーレスAR (cf. ARKit, ARCore)
• GeolocationとARマップの統合

27
Pros:
• No need to install native app
• Easy to create only w/ HTML(+Javascript)
Cons:
• Marker-based
• Need newer environment
（Later than iOS11Safari, Android5 Chrome）
Implementation
• AR.js + A-frame

28
• AR photo booth: 240 groups
• AR lottery: 510 people

29
Trial in Mother’s day &
Father’s day
Trial
@Tokyo Dome
R-mobile campaign

30
8th Wall © 2019
8th Wall built their own highly-optimized SLAM engine, and then re-architected it for the mobile web.
AUGMENTED REALITY FOR THE WEB
Javascript
WebGL
WebAssembly
Six-Degrees-of-Freedom (6DoF)
Tracking
Point-Cloud
Lighting
Surface Estimation
Image Detection

31
Light-
weight
Web AR
SOTA
SLAM
Schneider, Thomas et al.
“Maplab: An Open Framework for Research
in Visual-Inertial Mapping and Localization.”
IEEE Robotics and Automation Letters, 2018.

32
Offline loop
closure and
optimization
Online
recording
StartEnd
Office Space Mapping and Optimization
Back end visualization of the location map

Objects
Object detection
& recognition
Input image
Surface orientation Partial view alignment
3D pose estimation
Plane fitting 3D scene initialization
Room Geometry
Objects in 3D scene
walls initialized
with unknown scale

Output:
3D model
reconstruction
Potential 3D applications on server in future
• Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes From a Single Image, Yinyu Nie,
Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, Jian Jun Zhang, CVPR, 2020

35
The future of 5G-enabled Augmented Reality
• Powered by Mobile Edge Computing (MEC)
• Big data processing w/ultra low latency
• Example by Scape + Samsung
Scape Technologies © 2019

36
• Web as Stock footage
• “Understanding Media”: "Hot" and "cool" media
• Stock footage for narrative
• Google street view time-lapse
• Beyond the field of view
• Photo uncrop
• Neural rendering in the wild
• Learning from web
• Learning human depth
• Learning from IoT data

37
Understanding Media: The Extensions of Man by Marshall McLuhan (1964)
• "Hot" and "cool" media
• Hot: "high definition” like film
• Cool: require more active participation on the part of the user like TV
• Content of every medium is always another (previous) medium.
The birth of virtual reality as an art form by Chris Milk (TEDTalks, 2016)
• Is VR the last medium?
• What about AR?

38
Edwin S. Porter:
Life of an American Fireman
(1903)
Sameer Agarwal, et al.:
Building a Rome in a Day
(ICCV 2009)

39
• Photo uncrop, Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernandez, and Steven M. Seitz,
ECCV ‘14.

40
M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R. Pandey, N. Snavely and R. M-Brualla.
“Neural Rendering in the Wild.” CVPR, 2019
Total Scene Capture
• Encode the 3D structure of the scene, enabling rendering from an arbitrary viewpoint,
• Capture all possible appearances of the scene and allow rendering the scene under any of them.
• Understand the location and appearance of transient objects in the scene
and allow for reproducing or omitting them.

41
Z. Li, T. Dekel, F. Cole, R. Tucker,
N. Snavely, C. Liu and W. T. Freeman.
“Learning the Depths of Moving People
by Watching Frozen People.”
CVPR, 2019.

42
Kinect returns as
Azure Kinect
• Higher resolution, more accurate depth
• Multimodal sensing
• Integrated to Azure Cognitive Services, Azure IoT
Azure Kinect DK Kinect for Windows v2
Audio Details 7-mic circular array 4-mic linear phased
array
Motion sensor Details 3-axis accelerometer
3-axis gyro
3-axis accelerometer
RGB Camera Details 3840 x 2160 px @30
fps
1920 x 1080 px @30
fps
Depth Camera Method Time-of-Flight Time-of-Flight
Resolution 640 x 576 px @30 fps 512 x 424 px @ 30 fps
512 x 512 px @30 fps
1024x1024 px @15
fps
Connectivity Data USB3.1 Gen 1 with
type USB-C
USB 3.1 gen 1
Power External PSU or USB-
C
External PSU
Synchronization RGB & Depth internal,
external device-to-
device
RGB & Depth internal
only
Mechanical Dimensions 103 x 39 x 126 mm 249 x 66 x 67 mm
Mass 440 g 970 g
Mounting One ¼-20 UNC. Four
internal screw points
One ¼-20 UNC
Microsoft Azure IoT reference architecture

43
• AR/SLAM technologies are
commoditized,
but still nice-to-have,
not must-have.
• Deep learning is pushing
the boundaries of AR/SLAM
• Web-based AR/SLAM has
huge potential in 5G era
• AR/SLAM can be improved by
learning from Web
including IoT data
Designed by
macrovector / Freepik

AR/SLAM and IoT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AR/SLAM and IoT

Similar to AR/SLAM and IoT (20)

More from Rakuten Group, Inc.

More from Rakuten Group, Inc. (20)

Recently uploaded

Recently uploaded (20)

AR/SLAM and IoT