{"title":"深度引导的三轴融合网络,用于高效的广义立体匹配","authors":"Seunghun Moon;Haeuk Lee;Suk-Ju Kang","doi":"10.1109/LRA.2025.3606382","DOIUrl":null,"url":null,"abstract":"Stereo matching is a crucial task in computer vision that estimates pixel-level disparities from rectified image pairs to reconstruct three-dimensional depth information. It has diverse applications, ranging from augmented reality to autonomous driving. While deep learning-based methods have achieved remarkable progress through 3D CNNs and Transformer-based architectures, their reliance on domain-specific fine-tuning and localized feature extraction often hampers robustness and generalization in real-world scenarios. This letter introduces the Depth-Guided Tri-Axial Fusion Network (DGTFNet), which overcomes these limitations by integrating depth priors from a monocular depth foundation model via the Depth-Guided Cross-Modal Attention (DGCMA) module. Additionally, we propose a Tri-Axial Attention (TAA) module that employs directional strip convolutions to capture long-range dependencies across horizontal, vertical, and spatial dimensions. Extensive evaluations on public stereo benchmarks demonstrate that DGTFNet significantly outperforms state-of-the-art methods in zero-shot evaluations. Ablation studies further validate the contribution of each module in delivering robust and efficient stereo matching.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 10","pages":"10791-10798"},"PeriodicalIF":5.3000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DGTFNet: Depth-Guided Tri-Axial Fusion Network for Efficient Generalizable Stereo Matching\",\"authors\":\"Seunghun Moon;Haeuk Lee;Suk-Ju Kang\",\"doi\":\"10.1109/LRA.2025.3606382\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stereo matching is a crucial task in computer vision that estimates pixel-level disparities from rectified image pairs to reconstruct three-dimensional depth information. It has diverse applications, ranging from augmented reality to autonomous driving. While deep learning-based methods have achieved remarkable progress through 3D CNNs and Transformer-based architectures, their reliance on domain-specific fine-tuning and localized feature extraction often hampers robustness and generalization in real-world scenarios. This letter introduces the Depth-Guided Tri-Axial Fusion Network (DGTFNet), which overcomes these limitations by integrating depth priors from a monocular depth foundation model via the Depth-Guided Cross-Modal Attention (DGCMA) module. Additionally, we propose a Tri-Axial Attention (TAA) module that employs directional strip convolutions to capture long-range dependencies across horizontal, vertical, and spatial dimensions. Extensive evaluations on public stereo benchmarks demonstrate that DGTFNet significantly outperforms state-of-the-art methods in zero-shot evaluations. Ablation studies further validate the contribution of each module in delivering robust and efficient stereo matching.\",\"PeriodicalId\":13241,\"journal\":{\"name\":\"IEEE Robotics and Automation Letters\",\"volume\":\"10 10\",\"pages\":\"10791-10798\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Robotics and Automation Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11150692/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11150692/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}
DGTFNet: Depth-Guided Tri-Axial Fusion Network for Efficient Generalizable Stereo Matching
Stereo matching is a crucial task in computer vision that estimates pixel-level disparities from rectified image pairs to reconstruct three-dimensional depth information. It has diverse applications, ranging from augmented reality to autonomous driving. While deep learning-based methods have achieved remarkable progress through 3D CNNs and Transformer-based architectures, their reliance on domain-specific fine-tuning and localized feature extraction often hampers robustness and generalization in real-world scenarios. This letter introduces the Depth-Guided Tri-Axial Fusion Network (DGTFNet), which overcomes these limitations by integrating depth priors from a monocular depth foundation model via the Depth-Guided Cross-Modal Attention (DGCMA) module. Additionally, we propose a Tri-Axial Attention (TAA) module that employs directional strip convolutions to capture long-range dependencies across horizontal, vertical, and spatial dimensions. Extensive evaluations on public stereo benchmarks demonstrate that DGTFNet significantly outperforms state-of-the-art methods in zero-shot evaluations. Ablation studies further validate the contribution of each module in delivering robust and efficient stereo matching.
期刊介绍:
The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.