Attention-based Long-term Modeling for Deep Visual Odometry

2021 Digital Image Computing: Techniques and Applications (DICTA) Pub Date : 2021-11-01 DOI:10.1109/DICTA52665.2021.9647140

Sangni Xu, Hao Xiong, Qiuxia Wu, Zhiyong Wang

{"title":"Attention-based Long-term Modeling for Deep Visual Odometry","authors":"Sangni Xu, Hao Xiong, Qiuxia Wu, Zhiyong Wang","doi":"10.1109/DICTA52665.2021.9647140","DOIUrl":null,"url":null,"abstract":"Visual odometry (VO) aims to determine the positions of a moving camera from an image sequence it acquired. It has been extensively utilized in many applications such as AR/VR, autonomous driving, and robotics. Conventional VO methods largely rely on hand-crafted features and data association that are in fact unreliable and suffering from fast motions. Therefore, learning-based VO utilizes neural networks mapping an image sequence to corresponding camera poses directly. Most existing learning-based methods also integrate with additional Long Short-Term Memory (LSTM) networks to model the temporal context across images, since the camera pose estimation of an image in VO is highly relevant to other images in the same sequence. However, traditional LSTM is limited to model short-term dependency rather than long-term temporal context or global information. To mitigate this issue, we propose an attention based long-term modelling approach by devising a new fusion gate into the LSTM cell. Our method consists of two modules: convolutional motion encoder and recurrent global motion refinement module. Specifically, the convolutional motion encoder extracts from images motion features which are then fused by the refinement module with more long-term temporal information. In the refinement module, the devised fusion gate generates long-term temporal information in two phases: (1) extracting correlated long-term information from previous predictions through a devised attention module; and (2) updating the current hidden state with extracted long-term information. As a result, it enables our model to gather long-term temporal information and further enhance estimation accuracy. We comprehensively evaluate our proposed method on two public datasets, KITTI and Oxford RobotCar. The experimental results demonstrate the effectiveness and superiority of our method over the baseline model.","PeriodicalId":424950,"journal":{"name":"2021 Digital Image Computing: Techniques and Applications (DICTA)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Digital Image Computing: Techniques and Applications (DICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DICTA52665.2021.9647140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Visual odometry (VO) aims to determine the positions of a moving camera from an image sequence it acquired. It has been extensively utilized in many applications such as AR/VR, autonomous driving, and robotics. Conventional VO methods largely rely on hand-crafted features and data association that are in fact unreliable and suffering from fast motions. Therefore, learning-based VO utilizes neural networks mapping an image sequence to corresponding camera poses directly. Most existing learning-based methods also integrate with additional Long Short-Term Memory (LSTM) networks to model the temporal context across images, since the camera pose estimation of an image in VO is highly relevant to other images in the same sequence. However, traditional LSTM is limited to model short-term dependency rather than long-term temporal context or global information. To mitigate this issue, we propose an attention based long-term modelling approach by devising a new fusion gate into the LSTM cell. Our method consists of two modules: convolutional motion encoder and recurrent global motion refinement module. Specifically, the convolutional motion encoder extracts from images motion features which are then fused by the refinement module with more long-term temporal information. In the refinement module, the devised fusion gate generates long-term temporal information in two phases: (1) extracting correlated long-term information from previous predictions through a devised attention module; and (2) updating the current hidden state with extracted long-term information. As a result, it enables our model to gather long-term temporal information and further enhance estimation accuracy. We comprehensively evaluate our proposed method on two public datasets, KITTI and Oxford RobotCar. The experimental results demonstrate the effectiveness and superiority of our method over the baseline model.

查看原文本刊更多论文

基于注意力的深度视觉里程计长期建模

视觉里程计(VO)的目的是确定移动摄像机的位置，从它获得的图像序列。它已广泛应用于AR/VR，自动驾驶和机器人等许多应用中。传统的VO方法很大程度上依赖于手工制作的特征和数据关联，这些特征和数据关联实际上是不可靠的，并且受到快速运动的影响。因此，基于学习的VO利用神经网络将图像序列直接映射到相应的相机姿势。大多数现有的基于学习的方法还集成了额外的长短期记忆(LSTM)网络来模拟图像之间的时间背景，因为VO中图像的相机姿态估计与同一序列中的其他图像高度相关。然而，传统的LSTM仅限于建模短期依赖关系，而不是长期时间上下文或全局信息。为了缓解这一问题，我们提出了一种基于注意力的长期建模方法，通过在LSTM单元中设计一个新的融合门。我们的方法包括两个模块:卷积运动编码器和循环全局运动细化模块。具体来说，卷积运动编码器从图像中提取运动特征，然后通过细化模块将其与更长期的时间信息融合。在细化模块中，设计的融合门分两个阶段生成长期时间信息:(1)通过设计的注意模块从先前的预测中提取相关的长期信息;(2)用提取的长期信息更新当前隐藏状态。因此，它使我们的模型能够收集长期时间信息，并进一步提高估计精度。我们在两个公共数据集KITTI和Oxford RobotCar上全面评估了我们提出的方法。实验结果证明了该方法相对于基线模型的有效性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 Digital Image Computing: Techniques and Applications (DICTA)

自引率

0.00%

发文量