MAF‐Stereo: Fast stereo matching through multi-branch attention fusion

IF 6.5 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

ISA transactions Pub Date : 2025-05-26 DOI:10.1016/j.isatra.2025.05.038

Lei Jin, Ke Xu

{"title":"MAF‐Stereo: Fast stereo matching through multi-branch attention fusion","authors":"Lei Jin, Ke Xu","doi":"10.1016/j.isatra.2025.05.038","DOIUrl":null,"url":null,"abstract":"<div><div>With advancements in computer vision, stereo matching has become a critical component in applications such as autonomous driving and 3D reconstruction. Traditional methods for achieving accurate matching often rely on high-resolution image features or deeper network architectures, which substantially compromise inference speed. In contrast, methods designed for faster performance typically simplify network structures, sacrificing accuracy to improve efficiency. Our study identifies a key limitation of these rapid methods: their exclusive reliance on low-resolution features during the feature resolution recovery process, which results in insufficiently informative recovered features. To address this limitation, we propose a novel module, the Multi-branch Attention Fusion (MAF), which leverages shallow features extracted in the early stages of processing to enhance feature resolution recovery during the cost aggregation phase. Additionally, we introduce an improvement to the cost volume generation process by incorporating cosine similarity, which alleviates the issue of weak correlation between left and right image features often encountered in conventional four-dimensional cost volumes. Building upon these contributions, we present MAF-Stereo, a method that achieves an endpoint error (EPE) of 0.57 and an inference speed of 41 ms on the Scene Flow dataset. Comprehensive evaluations on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) 2012 and 2015 datasets further demonstrate that MAF-Stereo outperforms existing fast matching methods in both speed and accuracy, establishing its effectiveness and robustness. The code is available at: <span><span>https://github.com/LeiJ-USTB/MAF-Stereo/tree/main</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":14660,"journal":{"name":"ISA transactions","volume":"164 ","pages":"Pages 211-221"},"PeriodicalIF":6.5000,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISA transactions","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0019057825002721","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

With advancements in computer vision, stereo matching has become a critical component in applications such as autonomous driving and 3D reconstruction. Traditional methods for achieving accurate matching often rely on high-resolution image features or deeper network architectures, which substantially compromise inference speed. In contrast, methods designed for faster performance typically simplify network structures, sacrificing accuracy to improve efficiency. Our study identifies a key limitation of these rapid methods: their exclusive reliance on low-resolution features during the feature resolution recovery process, which results in insufficiently informative recovered features. To address this limitation, we propose a novel module, the Multi-branch Attention Fusion (MAF), which leverages shallow features extracted in the early stages of processing to enhance feature resolution recovery during the cost aggregation phase. Additionally, we introduce an improvement to the cost volume generation process by incorporating cosine similarity, which alleviates the issue of weak correlation between left and right image features often encountered in conventional four-dimensional cost volumes. Building upon these contributions, we present MAF-Stereo, a method that achieves an endpoint error (EPE) of 0.57 and an inference speed of 41 ms on the Scene Flow dataset. Comprehensive evaluations on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) 2012 and 2015 datasets further demonstrate that MAF-Stereo outperforms existing fast matching methods in both speed and accuracy, establishing its effectiveness and robustness. The code is available at: https://github.com/LeiJ-USTB/MAF-Stereo/tree/main.

查看原文本刊更多论文

MAF-Stereo：通过多分支注意融合实现快速立体匹配。

随着计算机视觉技术的进步，立体匹配已经成为自动驾驶和3D重建等应用的关键组成部分。实现精确匹配的传统方法通常依赖于高分辨率图像特征或更深层次的网络架构，这大大降低了推理速度。相比之下，为更快的性能而设计的方法通常会简化网络结构，牺牲准确性来提高效率。我们的研究发现了这些快速方法的一个关键限制：它们在特征分辨率恢复过程中完全依赖于低分辨率特征，这导致恢复的特征信息不足。为了解决这一限制，我们提出了一个新的模块，即多分支注意力融合（MAF），它利用在处理的早期阶段提取的浅层特征来增强在成本聚合阶段的特征分辨率恢复。此外，我们通过引入余弦相似度来改进成本体积生成过程，这缓解了传统四维成本体积中经常遇到的左右图像特征之间弱相关性的问题。在这些贡献的基础上，我们提出了MAF-Stereo，一种在场景流数据集上实现端点误差（EPE）为0.57和推理速度为41 ms的方法。对卡尔斯鲁厄理工学院和丰田理工学院（KITTI） 2012年和2015年数据集的综合评价进一步表明，MAF-Stereo在速度和精度上都优于现有的快速匹配方法，建立了其有效性和鲁棒性。代码可从https://github.com/LeiJ-USTB/MAF-Stereo/tree/main获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ISA transactions 工程技术-工程：综合

CiteScore

11.70

自引率

12.30%

发文量

824

审稿时长

4.4 months

期刊介绍： ISA Transactions serves as a platform for showcasing advancements in measurement and automation, catering to both industrial practitioners and applied researchers. It covers a wide array of topics within measurement, including sensors, signal processing, data analysis, and fault detection, supported by techniques such as artificial intelligence and communication systems. Automation topics encompass control strategies, modelling, system reliability, and maintenance, alongside optimization and human-machine interaction. The journal targets research and development professionals in control systems, process instrumentation, and automation from academia and industry.