Jingcheng Ke,Jie Wen,Huiting Wang,Wen-Huang Cheng,Jia Wang
{"title":"Multi-Perspective Cross-Modal Object Encoding for Referring Expression Comprehension.","authors":"Jingcheng Ke,Jie Wen,Huiting Wang,Wen-Huang Cheng,Jia Wang","doi":"10.1109/tip.2025.3620129","DOIUrl":null,"url":null,"abstract":"Referring expression comprehension (REC) is a crucial task in understanding how a given text description identifies a target object within an image. Existing two-stage REC methods have demonstrated strong performance due to their rational framework design. However, during the encoding of object candidates in an image, most two-stage methods rely exclusively on features extracted from pre-trained detectors, often neglecting the contextual relationships between an object and its neighboring elements. This limitation hinders the full capture of contextual and relational information, reducing the discriminative power of object representations and negatively impacting subsequent processing. In this paper, we propose two novel plug-and-adapt modules: expression-guided label representation module (ELR) and cross-modal calibrated semantic module (CCS), designed to enhance two-stage REC methods. Specifically, the ELR module connects the noun phases of expression to the categorical labels of object candidates in the image, ensuring effective alignment between them. Guided by these connections, a CCS module is introduced to represent each object candidate by integrating its features with those of neighboring candidates from multiple perspectives. This preserves the intrinsic information of each candidate while incorporating relational cues from other objects, enabling more precise embeddings and effective downstream processing in two-stage REC methods. Extensive experiments on six datasets demonstrate the importance of incorporating prior statistical knowledge, and detailed analysis shows that the proposed modules strengthen the alignment between image and text. As a result, our method achieves competitive performance and is compatible with most two-stage methods in the REC task. The code is available on Github: https://github.com/freedom6927/ELR_CCS.git.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"127 1","pages":""},"PeriodicalIF":13.7000,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tip.2025.3620129","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Referring expression comprehension (REC) is a crucial task in understanding how a given text description identifies a target object within an image. Existing two-stage REC methods have demonstrated strong performance due to their rational framework design. However, during the encoding of object candidates in an image, most two-stage methods rely exclusively on features extracted from pre-trained detectors, often neglecting the contextual relationships between an object and its neighboring elements. This limitation hinders the full capture of contextual and relational information, reducing the discriminative power of object representations and negatively impacting subsequent processing. In this paper, we propose two novel plug-and-adapt modules: expression-guided label representation module (ELR) and cross-modal calibrated semantic module (CCS), designed to enhance two-stage REC methods. Specifically, the ELR module connects the noun phases of expression to the categorical labels of object candidates in the image, ensuring effective alignment between them. Guided by these connections, a CCS module is introduced to represent each object candidate by integrating its features with those of neighboring candidates from multiple perspectives. This preserves the intrinsic information of each candidate while incorporating relational cues from other objects, enabling more precise embeddings and effective downstream processing in two-stage REC methods. Extensive experiments on six datasets demonstrate the importance of incorporating prior statistical knowledge, and detailed analysis shows that the proposed modules strengthen the alignment between image and text. As a result, our method achieves competitive performance and is compatible with most two-stage methods in the REC task. The code is available on Github: https://github.com/freedom6927/ELR_CCS.git.
期刊介绍:
The IEEE Transactions on Image Processing delves into groundbreaking theories, algorithms, and structures concerning the generation, acquisition, manipulation, transmission, scrutiny, and presentation of images, video, and multidimensional signals across diverse applications. Topics span mathematical, statistical, and perceptual aspects, encompassing modeling, representation, formation, coding, filtering, enhancement, restoration, rendering, halftoning, search, and analysis of images, video, and multidimensional signals. Pertinent applications range from image and video communications to electronic imaging, biomedical imaging, image and video systems, and remote sensing.