{"title":"用于参考图像分割的混合尺度跨模态融合网络","authors":"Xiong Pan , Xuemei Xie , Jianxiu Yang","doi":"10.1016/j.neucom.2024.128793","DOIUrl":null,"url":null,"abstract":"<div><div>Referring image segmentation aims to segment the target by a given language expression. Recently, the bottom-up fusion network utilizes language features to highlight the most relevant regions during the visual encoder stage. However, it is not comprehensive that establish only the relationship between pixels and words. To alleviate this problem, we propose a mixed-scale cross-modal fusion method that widens the interaction between vision and language. Specially, at each stage, pyramid pooling is used to augment visual perception and improve the interaction between visual and linguistic features, thereby highlighting relevant regions in the visual data. Additionally, we employ a simple multi-scale feature fusion module to effectively combine multi-scale aligned features. Experiments conducted on Standard RIS benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the- art approaches. Moreover, we conducted experiments on different visual backbones respectively, and the proposed method yielded better and significantly improved performance results.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128793"},"PeriodicalIF":5.5000,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mixed-scale cross-modal fusion network for referring image segmentation\",\"authors\":\"Xiong Pan , Xuemei Xie , Jianxiu Yang\",\"doi\":\"10.1016/j.neucom.2024.128793\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Referring image segmentation aims to segment the target by a given language expression. Recently, the bottom-up fusion network utilizes language features to highlight the most relevant regions during the visual encoder stage. However, it is not comprehensive that establish only the relationship between pixels and words. To alleviate this problem, we propose a mixed-scale cross-modal fusion method that widens the interaction between vision and language. Specially, at each stage, pyramid pooling is used to augment visual perception and improve the interaction between visual and linguistic features, thereby highlighting relevant regions in the visual data. Additionally, we employ a simple multi-scale feature fusion module to effectively combine multi-scale aligned features. Experiments conducted on Standard RIS benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the- art approaches. Moreover, we conducted experiments on different visual backbones respectively, and the proposed method yielded better and significantly improved performance results.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"614 \",\"pages\":\"Article 128793\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-10-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224015649\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015649","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Mixed-scale cross-modal fusion network for referring image segmentation
Referring image segmentation aims to segment the target by a given language expression. Recently, the bottom-up fusion network utilizes language features to highlight the most relevant regions during the visual encoder stage. However, it is not comprehensive that establish only the relationship between pixels and words. To alleviate this problem, we propose a mixed-scale cross-modal fusion method that widens the interaction between vision and language. Specially, at each stage, pyramid pooling is used to augment visual perception and improve the interaction between visual and linguistic features, thereby highlighting relevant regions in the visual data. Additionally, we employ a simple multi-scale feature fusion module to effectively combine multi-scale aligned features. Experiments conducted on Standard RIS benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the- art approaches. Moreover, we conducted experiments on different visual backbones respectively, and the proposed method yielded better and significantly improved performance results.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.