CoCNet: A Chain-of-Clues framework for zero-shot referring expression comprehension

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-04-19 DOI:10.1016/j.eswa.2025.127633

Xuanyu Zhou , Simin Zhang , Zengcan Xue , Xiao Lu , Tianxing Xiao , Lianhua Wu , Lin Liu , Xuan Li

{"title":"CoCNet: A Chain-of-Clues framework for zero-shot referring expression comprehension","authors":"Xuanyu Zhou , Simin Zhang , Zengcan Xue , Xiao Lu , Tianxing Xiao , Lianhua Wu , Lin Liu , Xuan Li","doi":"10.1016/j.eswa.2025.127633","DOIUrl":null,"url":null,"abstract":"<div><div>Zero-shot learning enables the reference expression comprehension (REC) model to adapt to a wide range of visual domains without training. However, the ambiguity of linguistic expression leads to the lack of a clear subject. Moreover, existing methods have not fully utilized the visual context and spatial information, resulting in low accuracy and robustness in complex scenes. To address these problems, we propose a Chain-of-Clues framework (CoCNet) to exploit multiple clues for zero-shot REC task to solve the inference confusion step by step. First, <strong>the subject clue module</strong> employs the strong ability of large language models (LLMs) to reason about the category in expression, which enhances the clarity of linguistic expression. In <strong>the attribute clue module</strong>, we propose the dual-track scoring which highlights the proposal by blurring its surroundings and enhances contextual sensitivity by blurring the proposal. Additionally, <strong>the spatial clue module</strong> utilizes a series of Gaussian-based soft heuristic rules to model the location words and the spatial relationship of the image. Experimental results show that CoCNet exhibits strong generalization capabilities in complex scenes. It significantly outperforms previous state-of-the-art zero-shot methods on RefCOCO, RefCOCO+, RefCOCOg, Flickr-Split-0 and Flickr-Split-1. Our code is released at <span><span>https://github.com/CoCNetHub/CoCNet-main</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"282 ","pages":"Article 127633"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425012552","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Zero-shot learning enables the reference expression comprehension (REC) model to adapt to a wide range of visual domains without training. However, the ambiguity of linguistic expression leads to the lack of a clear subject. Moreover, existing methods have not fully utilized the visual context and spatial information, resulting in low accuracy and robustness in complex scenes. To address these problems, we propose a Chain-of-Clues framework (CoCNet) to exploit multiple clues for zero-shot REC task to solve the inference confusion step by step. First, the subject clue module employs the strong ability of large language models (LLMs) to reason about the category in expression, which enhances the clarity of linguistic expression. In the attribute clue module, we propose the dual-track scoring which highlights the proposal by blurring its surroundings and enhances contextual sensitivity by blurring the proposal. Additionally, the spatial clue module utilizes a series of Gaussian-based soft heuristic rules to model the location words and the spatial relationship of the image. Experimental results show that CoCNet exhibits strong generalization capabilities in complex scenes. It significantly outperforms previous state-of-the-art zero-shot methods on RefCOCO, RefCOCO+, RefCOCOg, Flickr-Split-0 and Flickr-Split-1. Our code is released at https://github.com/CoCNetHub/CoCNet-main.

查看原文本刊更多论文

零镜头学习使参考表达理解（REC）模型无需训练即可适应各种视觉领域。然而，语言表达的模糊性导致缺乏明确的主体。此外，现有方法没有充分利用视觉上下文和空间信息，导致在复杂场景中的准确性和鲁棒性较低。针对这些问题，我们提出了一种线索链框架（CoCNet），利用零镜头 REC 任务中的多种线索，逐步解决推理混乱问题。首先，主题线索模块利用大型语言模型（LLM）的强大能力对表达中的类别进行推理，从而提高语言表达的清晰度。在属性线索模块中，我们提出了双轨评分法，通过模糊周边环境来突出提案，通过模糊提案来增强语境敏感性。此外，空间线索模块利用一系列基于高斯的软启发式规则对位置词和图像的空间关系进行建模。实验结果表明，CoCNet 在复杂场景中表现出很强的泛化能力。它在 RefCOCO、RefCOCO+、RefCOCOg、Flickr-Split-0 和 Flickr-Split-1 上的表现明显优于之前最先进的零镜头方法。我们的代码发布于 https://github.com/CoCNetHub/CoCNet-main。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.