Jing Wang, Zhikang Wang, Xiaojie Wang, Fangxiang Feng, Bo Yang
{"title":"Sep-NMS:开启两阶段指称表达理解的能力","authors":"Jing Wang, Zhikang Wang, Xiaojie Wang, Fangxiang Feng, Bo Yang","doi":"10.1049/cit2.70007","DOIUrl":null,"url":null,"abstract":"<p>Referring expression comprehension (REC) aims to locate a specific region in an image described by a natural language. Existing two-stage methods generate multiple candidate proposals in the first stage, followed by selecting one of these proposals as the grounding result in the second stage. Nevertheless, the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate, thereby enormously limiting the overall network performance. To address the above issues, the authors propose an innovative method termed Separate Non-Maximum Suppression (Sep-NMS) for two-stage REC. Particularly, Sep-NMS models information from the two stages independently and collaboratively, ultimately achieving an overall improvement in comprehension and identification of the target objects. Specifically, the authors propose a Ref-Relatedness module for filtering referent proposals rigorously, decreasing the redundancy of referent proposals. A <span></span><math>\n <semantics>\n <mrow>\n <msup>\n <mtext>CLIP</mtext>\n <mo>†</mo>\n </msup>\n </mrow>\n <annotation> ${\\text{CLIP}}^{{\\dagger}}$</annotation>\n </semantics></math> Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects. It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage. Moreover, an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages, ensuring maximum utilisation of the available information. Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods. The datasets used are publicly available: RefCOCO, RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg: https://doi.org/10.1109/CVPR.2016.9.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 4","pages":"1049-1061"},"PeriodicalIF":7.3000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70007","citationCount":"0","resultStr":"{\"title\":\"Sep-NMS: Unlocking the Aptitude of Two-Stage Referring Expression Comprehension\",\"authors\":\"Jing Wang, Zhikang Wang, Xiaojie Wang, Fangxiang Feng, Bo Yang\",\"doi\":\"10.1049/cit2.70007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Referring expression comprehension (REC) aims to locate a specific region in an image described by a natural language. Existing two-stage methods generate multiple candidate proposals in the first stage, followed by selecting one of these proposals as the grounding result in the second stage. Nevertheless, the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate, thereby enormously limiting the overall network performance. To address the above issues, the authors propose an innovative method termed Separate Non-Maximum Suppression (Sep-NMS) for two-stage REC. Particularly, Sep-NMS models information from the two stages independently and collaboratively, ultimately achieving an overall improvement in comprehension and identification of the target objects. Specifically, the authors propose a Ref-Relatedness module for filtering referent proposals rigorously, decreasing the redundancy of referent proposals. A <span></span><math>\\n <semantics>\\n <mrow>\\n <msup>\\n <mtext>CLIP</mtext>\\n <mo>†</mo>\\n </msup>\\n </mrow>\\n <annotation> ${\\\\text{CLIP}}^{{\\\\dagger}}$</annotation>\\n </semantics></math> Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects. It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage. Moreover, an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages, ensuring maximum utilisation of the available information. Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods. The datasets used are publicly available: RefCOCO, RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg: https://doi.org/10.1109/CVPR.2016.9.</p>\",\"PeriodicalId\":46211,\"journal\":{\"name\":\"CAAI Transactions on Intelligence Technology\",\"volume\":\"10 4\",\"pages\":\"1049-1061\"},\"PeriodicalIF\":7.3000,\"publicationDate\":\"2025-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.70007\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"CAAI Transactions on Intelligence Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.70007\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.70007","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Sep-NMS: Unlocking the Aptitude of Two-Stage Referring Expression Comprehension
Referring expression comprehension (REC) aims to locate a specific region in an image described by a natural language. Existing two-stage methods generate multiple candidate proposals in the first stage, followed by selecting one of these proposals as the grounding result in the second stage. Nevertheless, the number of candidate proposals generated in the first stage significantly exceeds ground truth and the recall of critical objects is inadequate, thereby enormously limiting the overall network performance. To address the above issues, the authors propose an innovative method termed Separate Non-Maximum Suppression (Sep-NMS) for two-stage REC. Particularly, Sep-NMS models information from the two stages independently and collaboratively, ultimately achieving an overall improvement in comprehension and identification of the target objects. Specifically, the authors propose a Ref-Relatedness module for filtering referent proposals rigorously, decreasing the redundancy of referent proposals. A Relatedness module based on robust multimodal pre-trained encoders is built to precisely assess the relevance between language and proposals to improve the recall of critical objects. It is worth mentioning that the authors are the pioneers in utilising a multimodal pre-training model for proposal filtering in the first stage. Moreover, an Information Fusion module is designed to effectively amalgamate the multimodal information across two stages, ensuring maximum utilisation of the available information. Extensive experiments demonstrate that the approach achieves competitive performance with previous state-of-the-art methods. The datasets used are publicly available: RefCOCO, RefCOCO+: https://doi.org/10.1007/978-3-319-46475-6_5 and RefCOCOg: https://doi.org/10.1109/CVPR.2016.9.
期刊介绍:
CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.