Sep-NMS: Unlocking the Aptitude of Two-Stage Referring Expression Comprehension

IF 7.3 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jing Wang, Zhikang Wang, Xiaojie Wang, Fangxiang Feng, Bo Yang
{"title":"Sep-NMS: Unlocking the Aptitude of Two-Stage Referring Expression Comprehension","authors":"Jing Wang,&nbsp;Zhikang Wang,&nbsp;Xiaojie Wang,&nbsp;Fangxiang Feng,&nbsp;Bo Yang","doi":"10.1049/cit2.70007","DOIUrl":null,"url":null,"abstract":"<p>Referring expression comprehension (REC) aims to locate a specific region in an image described by a natural language. Existing two-stage methods generate multiple candidate proposals in the first stage, followed by selecting one of these proposals as the grounding result in the second stage. Nevertheless, the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate, thereby enormously limiting the overall network performance. To address the above issues, the authors propose an innovative method termed Separate Non-Maximum Suppression (Sep-NMS) for two-stage REC. Particularly, Sep-NMS models information from the two stages independently and collaboratively, ultimately achieving an overall improvement in comprehension and identification of the target objects. Specifically, the authors propose a Ref-Relatedness module for filtering referent proposals rigorously, decreasing the redundancy of referent proposals. A <span></span><math>\n <semantics>\n <mrow>\n <msup>\n <mtext>CLIP</mtext>\n <mo>†</mo>\n </msup>\n </mrow>\n <annotation> ${\\text{CLIP}}^{{\\dagger}}$</annotation>\n </semantics></math> Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects. It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage. Moreover, an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages, ensuring maximum utilisation of the available information. Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods. The datasets used are publicly available: RefCOCO, RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg: https://doi.org/10.1109/CVPR.2016.9.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"1049-1061"},"PeriodicalIF":7.3000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70007","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.70007","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Referring expression comprehension (REC) aims to locate a specific region in an image described by a natural language. Existing two-stage methods generate multiple candidate proposals in the first stage, followed by selecting one of these proposals as the grounding result in the second stage. Nevertheless, the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate, thereby enormously limiting the overall network performance. To address the above issues, the authors propose an innovative method termed Separate Non-Maximum Suppression (Sep-NMS) for two-stage REC. Particularly, Sep-NMS models information from the two stages independently and collaboratively, ultimately achieving an overall improvement in comprehension and identification of the target objects. Specifically, the authors propose a Ref-Relatedness module for filtering referent proposals rigorously, decreasing the redundancy of referent proposals. A CLIP ${\text{CLIP}}^{{\dagger}}$ Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects. It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage. Moreover, an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages, ensuring maximum utilisation of the available information. Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods. The datasets used are publicly available: RefCOCO, RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg: https://doi.org/10.1109/CVPR.2016.9.

Abstract Image

Abstract Image

Abstract Image

Sep-NMS:开启两阶段指称表达理解的能力
引用表达式理解(REC)的目的是定位自然语言描述的图像中的特定区域。现有的两阶段方法在第一阶段生成多个候选提案,然后在第二阶段选择其中一个提案作为接地结果。然而,在第一阶段产生的候选提案的数量大大超过了基本事实,并且对关键对象的召回不足,从而极大地限制了整体网络性能。为了解决上述问题,作者提出了一种两阶段REC的创新方法Sep-NMS (Separate Non-Maximum Suppression), Sep-NMS对两阶段的信息进行独立和协作的建模,最终实现了对目标对象的理解和识别的全面提高。具体来说,作者提出了一个参考相关性模块来严格过滤参考建议,减少参考建议的冗余。构建了基于鲁棒多模态预训练编码器的CLIP†${\text{CLIP}}^{{\dagger}}$ Relatedness模块,用于精确评估语言和建议之间的相关性,以提高关键对象的召回率。值得一提的是,作者是在第一阶段使用多模态预训练模型进行建议过滤的先驱。此外,设计了信息融合模块,有效地融合了两个阶段的多模态信息,确保了可用信息的最大利用。大量的实验表明,该方法与以前最先进的方法相比具有竞争力。使用的数据集是公开的:RefCOCO, RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5和RefCOCO: https://doi.org/10.1109/CVPR.2016.9。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CAAI Transactions on Intelligence Technology
CAAI Transactions on Intelligence Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-
CiteScore
11.00
自引率
3.90%
发文量
134
审稿时长
35 weeks
期刊介绍: CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信