Eye-movement-prompted large image captioning model

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Zheng Yang , Bing Han , Xinbo Gao , Zhi-Hui Zhan
{"title":"Eye-movement-prompted large image captioning model","authors":"Zheng Yang ,&nbsp;Bing Han ,&nbsp;Xinbo Gao ,&nbsp;Zhi-Hui Zhan","doi":"10.1016/j.patcog.2024.111097","DOIUrl":null,"url":null,"abstract":"<div><div>Pretrained large vision-language models have shown outstanding performance on the task of image captioning. However, owing to the insufficient decoding of image features, existing large models sometimes lose important information, such as objects, scenes, and their relationships. In addition, the complex “black-box” nature of these models makes their mechanisms difficult to explain. Research shows that humans learn richer representations than machines do, which inspires us to improve the accuracy and interpretability of large image captioning models by combining human observation patterns. We built a new dataset, called saliency in image captioning (SIC), to explore relationships between human vision and language representation. One thousand images with rich context information were selected as image data of SIC. Each image was annotated with five caption labels and five eye-movement labels. Through analysis of the eye-movement data, we found that humans efficiently captured comprehensive information for image captioning during their observations. Therefore, we propose an eye-movement-prompted large image captioning model, which is embedded with two carefully designed modules: the eye-movement simulation module (EMS) and the eye-movement analyzing module (EMA). EMS combines the human observation pattern to simulate eye-movement features, including the positions and scan paths of eye fixations. EMA is a graph neural network (GNN) based module, which decodes graphical eye-movement data and abstracts image features as a directed graph. More accurate descriptions can be predicted by decoding the generated graph. Extensive experiments were conducted on the MS-COCO and NoCaps datasets to validate our model. The experimental results showed that our network was interpretable, and could achieve superior results compared with state-of-the-art methods, <em>i.e.</em>, 84.2% BLEU-4 and 145.1% CIDEr-D on MS-COCO Karpathy test split, indicating its strong potential for use in image captioning.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111097"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008483","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Pretrained large vision-language models have shown outstanding performance on the task of image captioning. However, owing to the insufficient decoding of image features, existing large models sometimes lose important information, such as objects, scenes, and their relationships. In addition, the complex “black-box” nature of these models makes their mechanisms difficult to explain. Research shows that humans learn richer representations than machines do, which inspires us to improve the accuracy and interpretability of large image captioning models by combining human observation patterns. We built a new dataset, called saliency in image captioning (SIC), to explore relationships between human vision and language representation. One thousand images with rich context information were selected as image data of SIC. Each image was annotated with five caption labels and five eye-movement labels. Through analysis of the eye-movement data, we found that humans efficiently captured comprehensive information for image captioning during their observations. Therefore, we propose an eye-movement-prompted large image captioning model, which is embedded with two carefully designed modules: the eye-movement simulation module (EMS) and the eye-movement analyzing module (EMA). EMS combines the human observation pattern to simulate eye-movement features, including the positions and scan paths of eye fixations. EMA is a graph neural network (GNN) based module, which decodes graphical eye-movement data and abstracts image features as a directed graph. More accurate descriptions can be predicted by decoding the generated graph. Extensive experiments were conducted on the MS-COCO and NoCaps datasets to validate our model. The experimental results showed that our network was interpretable, and could achieve superior results compared with state-of-the-art methods, i.e., 84.2% BLEU-4 and 145.1% CIDEr-D on MS-COCO Karpathy test split, indicating its strong potential for use in image captioning.
眼动提示大图像字幕模型
预训练的大型视觉语言模型在图像字幕任务中表现出色。然而,由于对图像特征的解码不足,现有的大型模型有时会丢失重要信息,如物体、场景及其关系。此外,这些模型复杂的 "黑箱 "性质使其机制难以解释。研究表明,人类学习到的表征比机器更丰富,这启发我们结合人类的观察模式来提高大型图像字幕模型的准确性和可解释性。我们建立了一个新的数据集,称为 "图像标题中的显著性"(SIC),以探索人类视觉与语言表征之间的关系。我们选择了一千幅具有丰富语境信息的图像作为 SIC 的图像数据。每幅图像都标注了五个标题标签和五个眼动标签。通过对眼动数据的分析,我们发现人类在观察过程中能有效地捕捉到全面的图像标题信息。因此,我们提出了眼动提示大型图像字幕模型,该模型包含两个精心设计的模块:眼动模拟模块(EMS)和眼动分析模块(EMA)。EMS 结合人类观察模式来模拟眼球运动特征,包括眼球固定的位置和扫描路径。EMA 是一个基于图神经网络(GNN)的模块,可解码图形化眼动数据,并将图像特征抽象为有向图。通过对生成的图进行解码,可以预测出更准确的描述。为了验证我们的模型,我们在 MS-COCO 和 NoCaps 数据集上进行了广泛的实验。实验结果表明,我们的网络是可解释的,与最先进的方法相比,它能取得更优越的结果,即在 MS-COCO Karpathy 测试分集上,BLEU-4 为 84.2%,CIDEr-D 为 145.1%,这表明它在图像字幕方面具有强大的应用潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信