Sep-NMS：开启两阶段指称表达理解的能力

IF 7.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

CAAI Transactions on Intelligence Technology Pub Date : 2025-04-02 DOI:10.1049/cit2.70007

Jing Wang, Zhikang Wang, Xiaojie Wang, Fangxiang Feng, Bo Yang

{"title":"Sep-NMS：开启两阶段指称表达理解的能力","authors":"Jing Wang, Zhikang Wang, Xiaojie Wang, Fangxiang Feng, Bo Yang","doi":"10.1049/cit2.70007","DOIUrl":null,"url":null,"abstract":"Referring expression comprehension (REC) aims to locate a specific region in an image described by a natural language. Existing two-stage methods generate multiple candidate proposals in the first stage, followed by selecting one of these proposals as the grounding result in the second stage. Nevertheless, the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate, thereby enormously limiting the overall network performance. To address the above issues, the authors propose an innovative method termed Separate Non-Maximum Suppression (Sep-NMS) for two-stage REC. Particularly, Sep-NMS models information from the two stages independently and collaboratively, ultimately achieving an overall improvement in comprehension and identification of the target objects. Specifically, the authors propose a Ref-Relatedness module for filtering referent proposals rigorously, decreasing the redundancy of referent proposals. A <math>\n <semantics>\n <mrow>\n <msup>\n <mtext>CLIP</mtext>\n <mo>†</mo>\n </msup>\n </mrow>\n <annotation> ${\\text{CLIP}}^{{\\dagger}}$</annotation>\n </semantics></math> Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects. It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage. Moreover, an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages, ensuring maximum utilisation of the available information. Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods. The datasets used are publicly available: RefCOCO, RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg: https://doi.org/10.1109/CVPR.2016.9.","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"1049-1061"},"PeriodicalIF":7.3000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70007","citationCount":"0","resultStr":"{\"title\":\"Sep-NMS: Unlocking the Aptitude of Two-Stage Referring Expression Comprehension\",\"authors\":\"Jing Wang, Zhikang Wang, Xiaojie Wang, Fangxiang Feng, Bo Yang\",\"doi\":\"10.1049/cit2.70007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Referring expression comprehension (REC) aims to locate a specific region in an image described by a natural language. Existing two-stage methods generate multiple candidate proposals in the first stage, followed by selecting one of these proposals as the grounding result in the second stage. Nevertheless, the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate, thereby enormously limiting the overall network performance. To address the above issues, the authors propose an innovative method termed Separate Non-Maximum Suppression (Sep-NMS) for two-stage REC. Particularly, Sep-NMS models information from the two stages independently and collaboratively, ultimately achieving an overall improvement in comprehension and identification of the target objects. Specifically, the authors propose a Ref-Relatedness module for filtering referent proposals rigorously, decreasing the redundancy of referent proposals. A <math>\\n <semantics>\\n <mrow>\\n <msup>\\n <mtext>CLIP</mtext>\\n <mo>†</mo>\\n </msup>\\n </mrow>\\n <annotation> ${\\\\text{CLIP}}^{{\\\\dagger}}$</annotation>\\n </semantics></math> Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects. It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage. Moreover, an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages, ensuring maximum utilisation of the available information. Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods. The datasets used are publicly available: RefCOCO, RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg: https://doi.org/10.1109/CVPR.2016.9.\",\"PeriodicalId\":46211,\"journal\":{\"name\":\"CAAI Transactions on Intelligence Technology\",\"volume\":\"10 4\",\"pages\":\"1049-1061\"},\"PeriodicalIF\":7.3000,\"publicationDate\":\"2025-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70007\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"CAAI Transactions on Intelligence Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.70007\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.70007","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

引用表达式理解（REC）的目的是定位自然语言描述的图像中的特定区域。现有的两阶段方法在第一阶段生成多个候选提案，然后在第二阶段选择其中一个提案作为接地结果。然而，在第一阶段产生的候选提案的数量大大超过了基本事实，并且对关键对象的召回不足，从而极大地限制了整体网络性能。为了解决上述问题，作者提出了一种两阶段REC的创新方法Sep-NMS (Separate Non-Maximum Suppression)， Sep-NMS对两阶段的信息进行独立和协作的建模，最终实现了对目标对象的理解和识别的全面提高。具体来说，作者提出了一个参考相关性模块来严格过滤参考建议，减少参考建议的冗余。构建了基于鲁棒多模态预训练编码器的CLIP†${\text{CLIP}}^{{\dagger}}$ Relatedness模块，用于精确评估语言和建议之间的相关性，以提高关键对象的召回率。值得一提的是，作者是在第一阶段使用多模态预训练模型进行建议过滤的先驱。此外，设计了信息融合模块，有效地融合了两个阶段的多模态信息，确保了可用信息的最大利用。大量的实验表明，该方法与以前最先进的方法相比具有竞争力。使用的数据集是公开的：RefCOCO， RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5和RefCOCO: https://doi.org/10.1109/CVPR.2016.9。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Sep-NMS: Unlocking the Aptitude of Two-Stage Referring Expression Comprehension

查看原文本刊更多论文

Sep-NMS: Unlocking the Aptitude of Two-Stage Referring Expression Comprehension

Referring expression comprehension (REC) aims to locate a specific region in an image described by a natural language. Existing two-stage methods generate multiple candidate proposals in the first stage, followed by selecting one of these proposals as the grounding result in the second stage. Nevertheless, the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate, thereby enormously limiting the overall network performance. To address the above issues, the authors propose an innovative method termed Separate Non-Maximum Suppression (Sep-NMS) for two-stage REC. Particularly, Sep-NMS models information from the two stages independently and collaboratively, ultimately achieving an overall improvement in comprehension and identification of the target objects. Specifically, the authors propose a Ref-Relatedness module for filtering referent proposals rigorously, decreasing the redundancy of referent proposals. A ${CLIP}^{†}$ Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects. It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage. Moreover, an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages, ensuring maximum utilisation of the available information. Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods. The datasets used are publicly available: RefCOCO, RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg: https://doi.org/10.1109/CVPR.2016.9.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

CAAI Transactions on Intelligence Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

11.00

自引率

3.90%

发文量

134

审稿时长

35 weeks

期刊介绍： CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.