CoCNet: A Chain-of-Clues framework for zero-shot referring expression comprehension

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Xuanyu Zhou , Simin Zhang , Zengcan Xue , Xiao Lu , Tianxing Xiao , Lianhua Wu , Lin Liu , Xuan Li
{"title":"CoCNet: A Chain-of-Clues framework for zero-shot referring expression comprehension","authors":"Xuanyu Zhou ,&nbsp;Simin Zhang ,&nbsp;Zengcan Xue ,&nbsp;Xiao Lu ,&nbsp;Tianxing Xiao ,&nbsp;Lianhua Wu ,&nbsp;Lin Liu ,&nbsp;Xuan Li","doi":"10.1016/j.eswa.2025.127633","DOIUrl":null,"url":null,"abstract":"<div><div>Zero-shot learning enables the reference expression comprehension (REC) model to adapt to a wide range of visual domains without training. However, the ambiguity of linguistic expression leads to the lack of a clear subject. Moreover, existing methods have not fully utilized the visual context and spatial information, resulting in low accuracy and robustness in complex scenes. To address these problems, we propose a Chain-of-Clues framework (CoCNet) to exploit multiple clues for zero-shot REC task to solve the inference confusion step by step. First, <strong>the subject clue module</strong> employs the strong ability of large language models (LLMs) to reason about the category in expression, which enhances the clarity of linguistic expression. In <strong>the attribute clue module</strong>, we propose the dual-track scoring which highlights the proposal by blurring its surroundings and enhances contextual sensitivity by blurring the proposal. Additionally, <strong>the spatial clue module</strong> utilizes a series of Gaussian-based soft heuristic rules to model the location words and the spatial relationship of the image. Experimental results show that CoCNet exhibits strong generalization capabilities in complex scenes. It significantly outperforms previous state-of-the-art zero-shot methods on RefCOCO, RefCOCO+, RefCOCOg, Flickr-Split-0 and Flickr-Split-1. Our code is released at <span><span>https://github.com/CoCNetHub/CoCNet-main</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"282 ","pages":"Article 127633"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425012552","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Zero-shot learning enables the reference expression comprehension (REC) model to adapt to a wide range of visual domains without training. However, the ambiguity of linguistic expression leads to the lack of a clear subject. Moreover, existing methods have not fully utilized the visual context and spatial information, resulting in low accuracy and robustness in complex scenes. To address these problems, we propose a Chain-of-Clues framework (CoCNet) to exploit multiple clues for zero-shot REC task to solve the inference confusion step by step. First, the subject clue module employs the strong ability of large language models (LLMs) to reason about the category in expression, which enhances the clarity of linguistic expression. In the attribute clue module, we propose the dual-track scoring which highlights the proposal by blurring its surroundings and enhances contextual sensitivity by blurring the proposal. Additionally, the spatial clue module utilizes a series of Gaussian-based soft heuristic rules to model the location words and the spatial relationship of the image. Experimental results show that CoCNet exhibits strong generalization capabilities in complex scenes. It significantly outperforms previous state-of-the-art zero-shot methods on RefCOCO, RefCOCO+, RefCOCOg, Flickr-Split-0 and Flickr-Split-1. Our code is released at https://github.com/CoCNetHub/CoCNet-main.
零镜头学习使参考表达理解(REC)模型无需训练即可适应各种视觉领域。然而,语言表达的模糊性导致缺乏明确的主体。此外,现有方法没有充分利用视觉上下文和空间信息,导致在复杂场景中的准确性和鲁棒性较低。针对这些问题,我们提出了一种线索链框架(CoCNet),利用零镜头 REC 任务中的多种线索,逐步解决推理混乱问题。首先,主题线索模块利用大型语言模型(LLM)的强大能力对表达中的类别进行推理,从而提高语言表达的清晰度。在属性线索模块中,我们提出了双轨评分法,通过模糊周边环境来突出提案,通过模糊提案来增强语境敏感性。此外,空间线索模块利用一系列基于高斯的软启发式规则对位置词和图像的空间关系进行建模。实验结果表明,CoCNet 在复杂场景中表现出很强的泛化能力。它在 RefCOCO、RefCOCO+、RefCOCOg、Flickr-Split-0 和 Flickr-Split-1 上的表现明显优于之前最先进的零镜头方法。我们的代码发布于 https://github.com/CoCNetHub/CoCNet-main。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信