DeepDCT-VO：基于深度学习的低复杂度单目视觉里程计三维方向坐标变换

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-09-26 DOI:10.1016/j.imavis.2025.105742

E. Simsek , B. Ozyer

{"title":"DeepDCT-VO：基于深度学习的低复杂度单目视觉里程计三维方向坐标变换","authors":"E. Simsek , B. Ozyer","doi":"10.1016/j.imavis.2025.105742","DOIUrl":null,"url":null,"abstract":"<div><div>Deep learning-based monocular visual odometry has gained importance in robotics and autonomous navigation due to its robustness in visually challenging environments and minimal sensor requirements. However, many existing deep learning-based MVO methods suffer from high computational costs and large model sizes, making them less suitable for real-time applications in resource-limited systems. In this study, we propose DeepDCT-VO, a lightweight visual odometry method that combines three-dimensional directional coordinate transformation with a compact deep learning architecture. Unlike traditional approaches that estimate translation in a global coordinate system and are prone to drift accumulation, DeepDCT-VO uses local directional motion derived from composite rotations. This approach avoids global trajectory reconstruction, thereby improving the method’s stability and reliability. The proposed model operates on input images at multiple resolutions (120 × 120, 240 × 240, 360 × 360, and 480 × 480), leveraging attention-guided residual learning to extract robust features. Additionally, it incorporates multi-modal information—specifically depth and semantic maps—to further improve the accuracy of pose estimation. Evaluations on the KITTI odometry benchmark demonstrate that DeepDCT-VO achieves competitive trajectory estimation accuracy while maintaining real-time performance—8 ms per frame on GPU and 12 ms on CPU. Compared to the existing method with the lowest translational drift (<span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span>), DeepDCT-VO reduces model size by approximately 96.3% (from 37.5 million to 1.4 million parameters). Conversely, when compared to the lightest model in terms of parameter count, DeepDCT-VO reduces <span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span> from 8.57% to 1.69%, achieving an 80.3% reduction in translational drift. These results underscore the effectiveness of DeepDCT-VO in delivering accurate and efficient monocular visual odometry, particularly suited for embedded and resource-limited applications, while the proposed transformation method offers an auxiliary function in reducing translational complexity.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105742"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DeepDCT-VO: 3D directional coordinate transformation for low-complexity monocular visual odometry using deep learning\",\"authors\":\"E. Simsek , B. Ozyer\",\"doi\":\"10.1016/j.imavis.2025.105742\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Deep learning-based monocular visual odometry has gained importance in robotics and autonomous navigation due to its robustness in visually challenging environments and minimal sensor requirements. However, many existing deep learning-based MVO methods suffer from high computational costs and large model sizes, making them less suitable for real-time applications in resource-limited systems. In this study, we propose DeepDCT-VO, a lightweight visual odometry method that combines three-dimensional directional coordinate transformation with a compact deep learning architecture. Unlike traditional approaches that estimate translation in a global coordinate system and are prone to drift accumulation, DeepDCT-VO uses local directional motion derived from composite rotations. This approach avoids global trajectory reconstruction, thereby improving the method’s stability and reliability. The proposed model operates on input images at multiple resolutions (120 × 120, 240 × 240, 360 × 360, and 480 × 480), leveraging attention-guided residual learning to extract robust features. Additionally, it incorporates multi-modal information—specifically depth and semantic maps—to further improve the accuracy of pose estimation. Evaluations on the KITTI odometry benchmark demonstrate that DeepDCT-VO achieves competitive trajectory estimation accuracy while maintaining real-time performance—8 ms per frame on GPU and 12 ms on CPU. Compared to the existing method with the lowest translational drift (<span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span>), DeepDCT-VO reduces model size by approximately 96.3% (from 37.5 million to 1.4 million parameters). Conversely, when compared to the lightest model in terms of parameter count, DeepDCT-VO reduces <span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span> from 8.57% to 1.69%, achieving an 80.3% reduction in translational drift. These results underscore the effectiveness of DeepDCT-VO in delivering accurate and efficient monocular visual odometry, particularly suited for embedded and resource-limited applications, while the proposed transformation method offers an auxiliary function in reducing translational complexity.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"163 \",\"pages\":\"Article 105742\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625003300\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003300","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基于深度学习的单目视觉里程计由于其在视觉挑战性环境中的鲁棒性和最小的传感器要求而在机器人和自主导航中变得越来越重要。然而，许多现有的基于深度学习的MVO方法存在计算成本高和模型尺寸大的问题，这使得它们不太适合在资源有限的系统中实时应用。在这项研究中，我们提出了DeepDCT-VO，这是一种轻量级的视觉里程计方法，将三维方向坐标变换与紧凑的深度学习架构相结合。与传统方法在全局坐标系中估计平移并且容易漂移积累不同，DeepDCT-VO使用由复合旋转派生的局部定向运动。该方法避免了全局轨迹重建，提高了方法的稳定性和可靠性。该模型在多种分辨率（120 × 120、240 × 240、360 × 360和480 × 480）的输入图像上运行，利用注意引导残差学习提取鲁棒特征。此外，它还结合了多模态信息，特别是深度和语义图，以进一步提高姿态估计的准确性。在KITTI odometry基准测试上的评估表明，DeepDCT-VO在保持实时性能的同时实现了具有竞争力的轨迹估计精度-在GPU上每帧8毫秒，在CPU上每帧12毫秒。与具有最低平移漂移（trel）的现有方法相比，DeepDCT-VO将模型大小减少了约96.3%（从3750万参数减少到140万参数）。相反，与参数计数最轻的模型相比，DeepDCT-VO将trel从8.57%减少到1.69%，实现了80.3%的平移漂移减少。这些结果强调了DeepDCT-VO在提供准确高效的单目视觉里程测量方面的有效性，特别适用于嵌入式和资源有限的应用，而所提出的转换方法在降低平移复杂性方面提供了辅助功能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DeepDCT-VO: 3D directional coordinate transformation for low-complexity monocular visual odometry using deep learning

Deep learning-based monocular visual odometry has gained importance in robotics and autonomous navigation due to its robustness in visually challenging environments and minimal sensor requirements. However, many existing deep learning-based MVO methods suffer from high computational costs and large model sizes, making them less suitable for real-time applications in resource-limited systems. In this study, we propose DeepDCT-VO, a lightweight visual odometry method that combines three-dimensional directional coordinate transformation with a compact deep learning architecture. Unlike traditional approaches that estimate translation in a global coordinate system and are prone to drift accumulation, DeepDCT-VO uses local directional motion derived from composite rotations. This approach avoids global trajectory reconstruction, thereby improving the method’s stability and reliability. The proposed model operates on input images at multiple resolutions (120 × 120, 240 × 240, 360 × 360, and 480 × 480), leveraging attention-guided residual learning to extract robust features. Additionally, it incorporates multi-modal information—specifically depth and semantic maps—to further improve the accuracy of pose estimation. Evaluations on the KITTI odometry benchmark demonstrate that DeepDCT-VO achieves competitive trajectory estimation accuracy while maintaining real-time performance—8 ms per frame on GPU and 12 ms on CPU. Compared to the existing method with the lowest translational drift (

t_{rel}

), DeepDCT-VO reduces model size by approximately 96.3% (from 37.5 million to 1.4 million parameters). Conversely, when compared to the lightest model in terms of parameter count, DeepDCT-VO reduces

t_{rel}

from 8.57% to 1.69%, achieving an 80.3% reduction in translational drift. These results underscore the effectiveness of DeepDCT-VO in delivering accurate and efficient monocular visual odometry, particularly suited for embedded and resource-limited applications, while the proposed transformation method offers an auxiliary function in reducing translational complexity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.