MatchFormer: Interleaving Attention in Transformers for Feature Matching

Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, R. Stiefelhagen
{"title":"MatchFormer: Interleaving Attention in Transformers for Feature Matching","authors":"Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, R. Stiefelhagen","doi":"10.48550/arXiv.2203.09645","DOIUrl":null,"url":null,"abstract":"Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2203.09645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 51

Abstract

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).
MatchFormer:用于特征匹配的互感器中的交叉注意
局部特征匹配是一项亚像素级的计算密集型任务。基于检测器的方法结合特征描述符在低纹理场景中表现不佳,而基于cnn的方法采用顺序提取匹配管道,无法利用编码器的匹配能力,而且往往会使解码器的匹配负担过重。相反,我们提出了一种新的分层提取和匹配转换器,称为MatchFormer。在分层编码器的每个阶段,我们将特征提取的自关注和特征匹配的交叉关注交织在一起,产生了一种人类直观的提取和匹配方案。这样的匹配感知编码器释放了过载的解码器,使模型非常高效。此外,在层次结构中结合多尺度特征的自关注和交叉关注可以提高匹配的鲁棒性,特别是在低纹理的室内场景或室外训练数据较少的情况下。由于这样的策略,MatchFormer是一个多赢的解决方案,在效率,鲁棒性和精度。与之前室内姿势估计的最佳方法相比,我们的lite MatchFormer只有45%的GFLOPs,但实现了+1.3%的精度增益和41%的运行速度提升。大型MatchFormer在四个不同的基准上达到了最先进的水平,包括室内姿态估计(ScanNet),室外姿态估计(MegaDepth),单应性估计和图像匹配(HPatch)以及视觉定位(InLoc)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信