{"title":"DeepDCT-VO:基于深度学习的低复杂度单目视觉里程计三维方向坐标变换","authors":"E. Simsek , B. Ozyer","doi":"10.1016/j.imavis.2025.105742","DOIUrl":null,"url":null,"abstract":"<div><div>Deep learning-based monocular visual odometry has gained importance in robotics and autonomous navigation due to its robustness in visually challenging environments and minimal sensor requirements. However, many existing deep learning-based MVO methods suffer from high computational costs and large model sizes, making them less suitable for real-time applications in resource-limited systems. In this study, we propose DeepDCT-VO, a lightweight visual odometry method that combines three-dimensional directional coordinate transformation with a compact deep learning architecture. Unlike traditional approaches that estimate translation in a global coordinate system and are prone to drift accumulation, DeepDCT-VO uses local directional motion derived from composite rotations. This approach avoids global trajectory reconstruction, thereby improving the method’s stability and reliability. The proposed model operates on input images at multiple resolutions (120 × 120, 240 × 240, 360 × 360, and 480 × 480), leveraging attention-guided residual learning to extract robust features. Additionally, it incorporates multi-modal information—specifically depth and semantic maps—to further improve the accuracy of pose estimation. Evaluations on the KITTI odometry benchmark demonstrate that DeepDCT-VO achieves competitive trajectory estimation accuracy while maintaining real-time performance—8 ms per frame on GPU and 12 ms on CPU. Compared to the existing method with the lowest translational drift (<span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span>), DeepDCT-VO reduces model size by approximately 96.3% (from 37.5 million to 1.4 million parameters). Conversely, when compared to the lightest model in terms of parameter count, DeepDCT-VO reduces <span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span> from 8.57% to 1.69%, achieving an 80.3% reduction in translational drift. These results underscore the effectiveness of DeepDCT-VO in delivering accurate and efficient monocular visual odometry, particularly suited for embedded and resource-limited applications, while the proposed transformation method offers an auxiliary function in reducing translational complexity.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105742"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DeepDCT-VO: 3D directional coordinate transformation for low-complexity monocular visual odometry using deep learning\",\"authors\":\"E. Simsek , B. Ozyer\",\"doi\":\"10.1016/j.imavis.2025.105742\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Deep learning-based monocular visual odometry has gained importance in robotics and autonomous navigation due to its robustness in visually challenging environments and minimal sensor requirements. However, many existing deep learning-based MVO methods suffer from high computational costs and large model sizes, making them less suitable for real-time applications in resource-limited systems. In this study, we propose DeepDCT-VO, a lightweight visual odometry method that combines three-dimensional directional coordinate transformation with a compact deep learning architecture. Unlike traditional approaches that estimate translation in a global coordinate system and are prone to drift accumulation, DeepDCT-VO uses local directional motion derived from composite rotations. This approach avoids global trajectory reconstruction, thereby improving the method’s stability and reliability. The proposed model operates on input images at multiple resolutions (120 × 120, 240 × 240, 360 × 360, and 480 × 480), leveraging attention-guided residual learning to extract robust features. Additionally, it incorporates multi-modal information—specifically depth and semantic maps—to further improve the accuracy of pose estimation. Evaluations on the KITTI odometry benchmark demonstrate that DeepDCT-VO achieves competitive trajectory estimation accuracy while maintaining real-time performance—8 ms per frame on GPU and 12 ms on CPU. Compared to the existing method with the lowest translational drift (<span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span>), DeepDCT-VO reduces model size by approximately 96.3% (from 37.5 million to 1.4 million parameters). Conversely, when compared to the lightest model in terms of parameter count, DeepDCT-VO reduces <span><math><msub><mrow><mi>t</mi></mrow><mrow><mtext>rel</mtext></mrow></msub></math></span> from 8.57% to 1.69%, achieving an 80.3% reduction in translational drift. These results underscore the effectiveness of DeepDCT-VO in delivering accurate and efficient monocular visual odometry, particularly suited for embedded and resource-limited applications, while the proposed transformation method offers an auxiliary function in reducing translational complexity.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"163 \",\"pages\":\"Article 105742\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625003300\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003300","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
DeepDCT-VO: 3D directional coordinate transformation for low-complexity monocular visual odometry using deep learning
Deep learning-based monocular visual odometry has gained importance in robotics and autonomous navigation due to its robustness in visually challenging environments and minimal sensor requirements. However, many existing deep learning-based MVO methods suffer from high computational costs and large model sizes, making them less suitable for real-time applications in resource-limited systems. In this study, we propose DeepDCT-VO, a lightweight visual odometry method that combines three-dimensional directional coordinate transformation with a compact deep learning architecture. Unlike traditional approaches that estimate translation in a global coordinate system and are prone to drift accumulation, DeepDCT-VO uses local directional motion derived from composite rotations. This approach avoids global trajectory reconstruction, thereby improving the method’s stability and reliability. The proposed model operates on input images at multiple resolutions (120 × 120, 240 × 240, 360 × 360, and 480 × 480), leveraging attention-guided residual learning to extract robust features. Additionally, it incorporates multi-modal information—specifically depth and semantic maps—to further improve the accuracy of pose estimation. Evaluations on the KITTI odometry benchmark demonstrate that DeepDCT-VO achieves competitive trajectory estimation accuracy while maintaining real-time performance—8 ms per frame on GPU and 12 ms on CPU. Compared to the existing method with the lowest translational drift (), DeepDCT-VO reduces model size by approximately 96.3% (from 37.5 million to 1.4 million parameters). Conversely, when compared to the lightest model in terms of parameter count, DeepDCT-VO reduces from 8.57% to 1.69%, achieving an 80.3% reduction in translational drift. These results underscore the effectiveness of DeepDCT-VO in delivering accurate and efficient monocular visual odometry, particularly suited for embedded and resource-limited applications, while the proposed transformation method offers an auxiliary function in reducing translational complexity.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.