下一代图像字幕：从变形器到多模态大语言模型的方法和新挑战的调查

Natural Language Processing Journal Pub Date : 2025-06-10 DOI:10.1016/j.nlp.2025.100159

Huda Diab Abdulgalil, Otman A. Basir

{"title":"下一代图像字幕：从变形器到多模态大语言模型的方法和新挑战的调查","authors":"Huda Diab Abdulgalil, Otman A. Basir","doi":"10.1016/j.nlp.2025.100159","DOIUrl":null,"url":null,"abstract":"<div><div>The widespread availability of visual data on the Internet has fueled a significant interest in image-to-text captioning systems. Automated image captioning remains a challenging multimodal analytics task, integrating advances in both Computer Vision (CV) and Natural Language Processing (NLP) to understand image content and generate semantically meaningful textual descriptions. Modern deep learning-based approaches have supplanted traditional approaches in image captioning, leading to more efficient and sophisticated models. The development of attention mechanisms and transformer-based architectures has further enhanced the modeling of both language and visual data. Despite these gains, challenges such as long-tailed object recognition, bias in training data, and shortcomings in evaluation metrics constrain the capabilities of current models. Furthermore, an important breakthrough has been made with the recent emergence of Multimodal Large Language Models (MLLMs). By incorporating textual and visual data, MLLMs provide improved captioning flexibility, generative capabilities, and reasoning. However, these models introduce new challenges, including faithfulness, grounding, and computational cost. Although relatively few studies have comprehensively surveyed these developments, this paper provides a thorough analysis of Transformer-based captioning approaches, investigates the shift to MLLMs, and discusses associated challenges and opportunities. We also present a performance comparison of the latest models on the MS-COCO benchmark and conclude with perspectives on potential future research directions.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"12 ","pages":"Article 100159"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Next-generation image captioning: A survey of methodologies and emerging challenges from transformers to Multimodal Large Language Models\",\"authors\":\"Huda Diab Abdulgalil, Otman A. Basir\",\"doi\":\"10.1016/j.nlp.2025.100159\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The widespread availability of visual data on the Internet has fueled a significant interest in image-to-text captioning systems. Automated image captioning remains a challenging multimodal analytics task, integrating advances in both Computer Vision (CV) and Natural Language Processing (NLP) to understand image content and generate semantically meaningful textual descriptions. Modern deep learning-based approaches have supplanted traditional approaches in image captioning, leading to more efficient and sophisticated models. The development of attention mechanisms and transformer-based architectures has further enhanced the modeling of both language and visual data. Despite these gains, challenges such as long-tailed object recognition, bias in training data, and shortcomings in evaluation metrics constrain the capabilities of current models. Furthermore, an important breakthrough has been made with the recent emergence of Multimodal Large Language Models (MLLMs). By incorporating textual and visual data, MLLMs provide improved captioning flexibility, generative capabilities, and reasoning. However, these models introduce new challenges, including faithfulness, grounding, and computational cost. Although relatively few studies have comprehensively surveyed these developments, this paper provides a thorough analysis of Transformer-based captioning approaches, investigates the shift to MLLMs, and discusses associated challenges and opportunities. We also present a performance comparison of the latest models on the MS-COCO benchmark and conclude with perspectives on potential future research directions.</div></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"12 \",\"pages\":\"Article 100159\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719125000354\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000354","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

互联网上视觉数据的广泛可用性激起了人们对图像到文本字幕系统的极大兴趣。自动图像字幕仍然是一项具有挑战性的多模式分析任务，它集成了计算机视觉（CV）和自然语言处理（NLP）的进步，以理解图像内容并生成语义上有意义的文本描述。现代基于深度学习的方法已经取代了传统的图像字幕方法，从而产生了更高效、更复杂的模型。注意机制和基于转换器的体系结构的发展进一步增强了语言和视觉数据的建模。尽管取得了这些成果，但诸如长尾目标识别、训练数据偏差以及评估指标缺陷等挑战限制了当前模型的能力。此外，最近出现的多模态大型语言模型（Multimodal Large Language Models, mllm）也取得了重大突破。通过合并文本和可视化数据，mllm提供了改进的字幕灵活性、生成能力和推理能力。然而，这些模型带来了新的挑战，包括可靠性、基础和计算成本。尽管相对较少的研究对这些发展进行了全面的调查，但本文对基于transformer的字幕方法进行了全面的分析，调查了向mlm的转变，并讨论了相关的挑战和机遇。我们还在MS-COCO基准上对最新模型进行了性能比较，并对未来可能的研究方向进行了展望。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Next-generation image captioning: A survey of methodologies and emerging challenges from transformers to Multimodal Large Language Models

The widespread availability of visual data on the Internet has fueled a significant interest in image-to-text captioning systems. Automated image captioning remains a challenging multimodal analytics task, integrating advances in both Computer Vision (CV) and Natural Language Processing (NLP) to understand image content and generate semantically meaningful textual descriptions. Modern deep learning-based approaches have supplanted traditional approaches in image captioning, leading to more efficient and sophisticated models. The development of attention mechanisms and transformer-based architectures has further enhanced the modeling of both language and visual data. Despite these gains, challenges such as long-tailed object recognition, bias in training data, and shortcomings in evaluation metrics constrain the capabilities of current models. Furthermore, an important breakthrough has been made with the recent emergence of Multimodal Large Language Models (MLLMs). By incorporating textual and visual data, MLLMs provide improved captioning flexibility, generative capabilities, and reasoning. However, these models introduce new challenges, including faithfulness, grounding, and computational cost. Although relatively few studies have comprehensively surveyed these developments, this paper provides a thorough analysis of Transformer-based captioning approaches, investigates the shift to MLLMs, and discusses associated challenges and opportunities. We also present a performance comparison of the latest models on the MS-COCO benchmark and conclude with perspectives on potential future research directions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量