A Position-Aware Transformer for Image Captioning

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Cmc-computers Materials & Continua Pub Date : 2022-01-01 DOI:10.32604/cmc.2022.019328

Zelin Deng, Bo Zhou, Pei He, Jian Huang, O. Alfarraj, Amr M. Tolba

{"title":"A Position-Aware Transformer for Image Captioning","authors":"Zelin Deng, Bo Zhou, Pei He, Jian Huang, O. Alfarraj, Amr M. Tolba","doi":"10.32604/cmc.2022.019328","DOIUrl":null,"url":null,"abstract":": Image captioning aims to generate a corresponding description of an image. In recent years, neural encoder-decoder models have been the dominant approaches, in which the Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) are used to translate an image into a natural language description. Among these approaches, the visual attention mechanisms are widely used to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. However, most conventional visual attention mechanisms are based on high-level image features, ignoring the effects of other image features, and giving insufficient consideration to the relative positions between image features. In this work, we propose a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for the above problems. The image-feature attention firstly extracts multi-level features by using Feature Pyramid Network (FPN), then utilizes the scaled-dot-product to fuse these features, which enables our model to detect objects of different scales in the image more effectively without increasing parameters. In the position-aware attention mechanism, the relative positions between image features are obtained at first, afterwards the relative positions are incorporated into the originalimage features to generate captions more accurately. Experiments are carried out on the MSCOCO dataset and our approach achieves competitive BLEU-4, METEOR, ROUGE-L, CIDEr scores compared with some state-of-the-art approaches, demonstrating the effectiveness of our approach.","PeriodicalId":10440,"journal":{"name":"Cmc-computers Materials & Continua","volume":"56 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cmc-computers Materials & Continua","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.32604/cmc.2022.019328","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 2

Abstract

: Image captioning aims to generate a corresponding description of an image. In recent years, neural encoder-decoder models have been the dominant approaches, in which the Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) are used to translate an image into a natural language description. Among these approaches, the visual attention mechanisms are widely used to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. However, most conventional visual attention mechanisms are based on high-level image features, ignoring the effects of other image features, and giving insufficient consideration to the relative positions between image features. In this work, we propose a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for the above problems. The image-feature attention firstly extracts multi-level features by using Feature Pyramid Network (FPN), then utilizes the scaled-dot-product to fuse these features, which enables our model to detect objects of different scales in the image more effectively without increasing parameters. In the position-aware attention mechanism, the relative positions between image features are obtained at first, afterwards the relative positions are incorporated into the originalimage features to generate captions more accurately. Experiments are carried out on the MSCOCO dataset and our approach achieves competitive BLEU-4, METEOR, ROUGE-L, CIDEr scores compared with some state-of-the-art approaches, demonstrating the effectiveness of our approach.

查看原文本刊更多论文

一种用于图像字幕的位置感知变压器

:图像字幕的目的是生成图像的相应描述。近年来，神经编码器-解码器模型已成为主流方法，其中使用卷积神经网络(CNN)和长短期记忆(LSTM)将图像翻译成自然语言描述。在这些方法中，视觉注意机制被广泛用于通过细粒度分析甚至多步骤推理来实现更深层次的图像理解。然而，传统的视觉注意机制大多基于高级图像特征，忽略了其他图像特征的影响，对图像特征之间的相对位置考虑不足。在这项工作中，我们提出了一个具有图像特征注意和位置感知注意机制的位置感知变压器模型来解决上述问题。图像特征关注首先利用特征金字塔网络(Feature Pyramid Network, FPN)提取多层次特征，然后利用尺度点积对这些特征进行融合，使我们的模型能够在不增加参数的情况下更有效地检测图像中不同尺度的目标。在位置感知注意机制中，首先获取图像特征之间的相对位置，然后将相对位置合并到原始图像特征中，从而更准确地生成字幕。在MSCOCO数据集上进行了实验，与一些最先进的方法相比，我们的方法获得了具有竞争力的BLEU-4, METEOR, ROUGE-L, CIDEr分数，证明了我们的方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cmc-computers Materials & Continua 工程技术-材料科学：综合

CiteScore

5.30

自引率

19.40%

发文量

345

审稿时长

1 months

期刊介绍： This journal publishes original research papers in the areas of computer networks, artificial intelligence, big data management, software engineering, multimedia, cyber security, internet of things, materials genome, integrated materials science, data analysis, modeling, and engineering of designing and manufacturing of modern functional and multifunctional materials. Novel high performance computing methods, big data analysis, and artificial intelligence that advance material technologies are especially welcome.