Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation

IF 12.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computer Science Review Pub Date : 2025-06-02 DOI:10.1016/j.cosrev.2025.100766

Israa A. Albadarneh , Bassam H. Hammo , Omar S. Al-Kadi

{"title":"Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation","authors":"Israa A. Albadarneh , Bassam H. Hammo , Omar S. Al-Kadi","doi":"10.1016/j.cosrev.2025.100766","DOIUrl":null,"url":null,"abstract":"<div><div>Image captioning involves generating textual descriptions from input images, bridging the gap between computer vision and natural language processing. Recent advancements in transformer-based models have significantly improved caption generation by leveraging attention mechanisms for better scene understanding. While various surveys have explored deep learning-based approaches for image captioning, few have comprehensively analyzed attention-based transformer models across multiple languages. This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches. It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning. Additionally, this paper identifies key limitations in current models, including semantic inconsistencies, data scarcity in non-English languages, and limitations in reasoning ability. Finally, we outline future research directions, such as multimodal learning, real-time applications in AI-powered assistants, healthcare, and forensic analysis. This survey serves as a comprehensive reference for researchers aiming to advance the field of attention-based image captioning.</div></div>","PeriodicalId":48633,"journal":{"name":"Computer Science Review","volume":"58 ","pages":"Article 100766"},"PeriodicalIF":12.7000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science Review","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574013725000425","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Image captioning involves generating textual descriptions from input images, bridging the gap between computer vision and natural language processing. Recent advancements in transformer-based models have significantly improved caption generation by leveraging attention mechanisms for better scene understanding. While various surveys have explored deep learning-based approaches for image captioning, few have comprehensively analyzed attention-based transformer models across multiple languages. This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches. It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning. Additionally, this paper identifies key limitations in current models, including semantic inconsistencies, data scarcity in non-English languages, and limitations in reasoning ability. Finally, we outline future research directions, such as multimodal learning, real-time applications in AI-powered assistants, healthcare, and forensic analysis. This survey serves as a comprehensive reference for researchers aiming to advance the field of attention-based image captioning.

Abstract Image

查看原文本刊更多论文

基于注意力的跨语言图像字幕转换模型：深入调查与评估

图像字幕涉及从输入图像生成文本描述，弥合了计算机视觉和自然语言处理之间的差距。基于变压器的模型的最新进展通过利用注意力机制来更好地理解场景，显著改善了标题生成。虽然各种调查已经探索了基于深度学习的图像字幕方法，但很少有人全面分析跨多种语言的基于注意力的转换模型。本调查回顾了基于注意力的图像字幕模型，将它们分为基于转换的、基于深度学习的和混合方法。它探讨了基准数据集，讨论了BLEU、METEOR、CIDEr和ROUGE等评估指标，并强调了多语言字幕中的挑战。此外，本文还指出了当前模型的主要局限性，包括语义不一致、非英语语言中的数据稀缺性以及推理能力的局限性。最后，我们概述了未来的研究方向，如多模式学习、人工智能助手的实时应用、医疗保健和法医分析。本研究为进一步推进基于注意力的图像字幕研究提供了参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Science Review Computer Science-General Computer Science

CiteScore

32.70

自引率

0.00%

发文量

审稿时长

51 days

期刊介绍： Computer Science Review, a publication dedicated to research surveys and expository overviews of open problems in computer science, targets a broad audience within the field seeking comprehensive insights into the latest developments. The journal welcomes articles from various fields as long as their content impacts the advancement of computer science. In particular, articles that review the application of well-known Computer Science methods to other areas are in scope only if these articles advance the fundamental understanding of those methods.