Multi-Keys Attention Network for Image Captioning

IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Ziqian Yang, Hui Li, Renrong Ouyang, Quan Zhang, Jimin Xiao
{"title":"Multi-Keys Attention Network for Image Captioning","authors":"Ziqian Yang, Hui Li, Renrong Ouyang, Quan Zhang, Jimin Xiao","doi":"10.1007/s12559-023-10231-7","DOIUrl":null,"url":null,"abstract":"<p>The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.</p>","PeriodicalId":51243,"journal":{"name":"Cognitive Computation","volume":"9 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cognitive Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12559-023-10231-7","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.

Abstract Image

用于图像字幕的多关键注意网络
图像标题任务旨在根据图像的主要内容生成描述。最近,带有自注意机制的变换器被广泛应用于图像字幕任务,其中的注意机制可以帮助编码器生成图像区域特征,并在解码器中指导字幕输出。然而,普通解码器使用的是简单的传统自注意机制,导致字幕语义信息贫乏,句子逻辑不完整。在本文中,我们提出了一种新颖的注意力区块--多键注意力区块,它能充分增强显性和隐性语义信息之间的相关性。从技术上讲,多键关注块首先将键向量和值向量连接起来,并将其传播到显式通道和隐式通道中。然后,通过对它们进行元素乘法运算,生成包含更多语义信息的 "相关值"。此外,为了完善句子逻辑,带有另一个信息流的反向密钥向量与最终关注结果保持连接。我们还在句子解码器的转换器中应用了多密钥注意模块,该转换器被命名为多密钥转换器(MKTrans)。实验证明,我们的 MKTrans 在 MS COCO "Karpathy "离线测试分词上取得了 138.6% 的 CIDEr 分数。事实证明,所提出的多密钥关注块和 MKTrans 模型比最先进的方法更有效、更优越。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Cognitive Computation
Cognitive Computation COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-NEUROSCIENCES
CiteScore
9.30
自引率
3.70%
发文量
116
审稿时长
>12 weeks
期刊介绍: Cognitive Computation is an international, peer-reviewed, interdisciplinary journal that publishes cutting-edge articles describing original basic and applied work involving biologically-inspired computational accounts of all aspects of natural and artificial cognitive systems. It provides a new platform for the dissemination of research, current practices and future trends in the emerging discipline of cognitive computation that bridges the gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信