Tiny TR-CAP: A novel small-scale benchmark dataset for general-purpose image captioning tasks

IF 5.1 2区工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY

Engineering Science and Technology-An International Journal-Jestech Pub Date : 2025-03-03 DOI:10.1016/j.jestch.2025.102009

Abbas Memiş , Serdar Yıldız

{"title":"Tiny TR-CAP: A novel small-scale benchmark dataset for general-purpose image captioning tasks","authors":"Abbas Memiş , Serdar Yıldız","doi":"10.1016/j.jestch.2025.102009","DOIUrl":null,"url":null,"abstract":"<div><div>In the last decade, the outstanding performance of deep learning has also led to a rapid and inevitable rise in automatic image captioning, as well as the need for large amounts of data. Although well-known, conventional and publicly available datasets have been proposed for the image captioning task, the lack of ground-truth caption data still remains a major challenge in the generation of accurate image captions. To address this issue, in this paper we introduced a novel image captioning benchmark dataset called Tiny TR-CAP, which consists of 1076 original images and 5380 handwritten captions (5 captions for each image with high diversity). The captions, which were translated into English using two web-based language translation APIs and a novel multilingual deep machine translation model, were tested against 11 state-of-the-art and prominent deep learning-based models, including CLIPCap, BLIP, BLIP2, FUSECAP, OFA, PromptCap, Kosmos2, MiniGPT4, LlaVA, BakLlaVA, and GIT. In the experimental studies, the accuracy statistics of the captions generated by the related models were reported in terms of the BLEU, METEOR, ROUGE-L, CIDEr, SPICE, and WMD captioning metrics, and their performance was evaluated comparatively. In the performance analysis, quite promising captioning performances were observed, and the best success rates were achieved with the OFA model with scores of 0.7097 BLEU-1, 0.5389 BLEU-2, 0.3940 BLEU-3, 0.2875 BLEU-4, 0.1797 METEOR, 0.4627 ROUGE-L, 0.2938 CIDEr, 0.0626 SPICE, and 0.4605 WMD. To support research studies in the field of image captioning, the image and caption sets of Tiny TR-CAP will also be publicly available on GitHub (<span><span>https://github.com/abbasmemis/tiny_TR-CAP</span><svg><path></path></svg></span>) for academic research purposes.</div></div>","PeriodicalId":48609,"journal":{"name":"Engineering Science and Technology-An International Journal-Jestech","volume":"64 ","pages":"Article 102009"},"PeriodicalIF":5.1000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Science and Technology-An International Journal-Jestech","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2215098625000643","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

In the last decade, the outstanding performance of deep learning has also led to a rapid and inevitable rise in automatic image captioning, as well as the need for large amounts of data. Although well-known, conventional and publicly available datasets have been proposed for the image captioning task, the lack of ground-truth caption data still remains a major challenge in the generation of accurate image captions. To address this issue, in this paper we introduced a novel image captioning benchmark dataset called Tiny TR-CAP, which consists of 1076 original images and 5380 handwritten captions (5 captions for each image with high diversity). The captions, which were translated into English using two web-based language translation APIs and a novel multilingual deep machine translation model, were tested against 11 state-of-the-art and prominent deep learning-based models, including CLIPCap, BLIP, BLIP2, FUSECAP, OFA, PromptCap, Kosmos2, MiniGPT4, LlaVA, BakLlaVA, and GIT. In the experimental studies, the accuracy statistics of the captions generated by the related models were reported in terms of the BLEU, METEOR, ROUGE-L, CIDEr, SPICE, and WMD captioning metrics, and their performance was evaluated comparatively. In the performance analysis, quite promising captioning performances were observed, and the best success rates were achieved with the OFA model with scores of 0.7097 BLEU-1, 0.5389 BLEU-2, 0.3940 BLEU-3, 0.2875 BLEU-4, 0.1797 METEOR, 0.4627 ROUGE-L, 0.2938 CIDEr, 0.0626 SPICE, and 0.4605 WMD. To support research studies in the field of image captioning, the image and caption sets of Tiny TR-CAP will also be publicly available on GitHub (https://github.com/abbasmemis/tiny_TR-CAP) for academic research purposes.

查看原文本刊更多论文

Tiny TR-CAP：用于通用图像字幕任务的新型小规模基准数据集

在过去的十年里，深度学习的出色表现也导致了自动图像字幕的快速和不可避免的增长，以及对大量数据的需求。虽然众所周知，传统的和公开可用的数据集已经被提出用于图像字幕任务，但缺乏真实的字幕数据仍然是生成准确图像字幕的主要挑战。为了解决这个问题，本文引入了一个名为Tiny TR-CAP的新型图像字幕基准数据集，该数据集由1076张原始图像和5380张手写字幕组成（每张图像5个字幕，具有高多样性）。使用两个基于web的语言翻译api和一个新颖的多语言深度机器翻译模型将字幕翻译成英语，并针对11个最先进和著名的基于深度学习的模型进行了测试，包括CLIPCap、BLIP、BLIP2、FUSECAP、OFA、PromptCap、Kosmos2、MiniGPT4、LlaVA、BakLlaVA和GIT。在实验研究中，对相关模型生成的字幕以BLEU、METEOR、ROUGE-L、CIDEr、SPICE和WMD字幕指标进行准确率统计，并对其性能进行比较评价。在性能分析中，OFA模型的字幕效果非常好，得分为0.7097 BLEU-1、0.5389 BLEU-2、0.3940 BLEU-3、0.2875 BLEU-4、0.1797 METEOR、0.4627 ROUGE-L、0.2938 CIDEr、0.0626 SPICE和0.4605 WMD，成功率最高。为了支持图像字幕领域的研究，Tiny TR-CAP的图像和字幕集也将在GitHub （https://github.com/abbasmemis/tiny_TR-CAP）上公开，用于学术研究目的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Engineering Science and Technology-An International Journal-Jestech Materials Science-Electronic, Optical and Magnetic Materials

CiteScore

11.20

自引率

3.50%

发文量

153

审稿时长

22 days

期刊介绍： Engineering Science and Technology, an International Journal (JESTECH) (formerly Technology), a peer-reviewed quarterly engineering journal, publishes both theoretical and experimental high quality papers of permanent interest, not previously published in journals, in the field of engineering and applied science which aims to promote the theory and practice of technology and engineering. In addition to peer-reviewed original research papers, the Editorial Board welcomes original research reports, state-of-the-art reviews and communications in the broadly defined field of engineering science and technology. The scope of JESTECH includes a wide spectrum of subjects including: -Electrical/Electronics and Computer Engineering (Biomedical Engineering and Instrumentation; Coding, Cryptography, and Information Protection; Communications, Networks, Mobile Computing and Distributed Systems; Compilers and Operating Systems; Computer Architecture, Parallel Processing, and Dependability; Computer Vision and Robotics; Control Theory; Electromagnetic Waves, Microwave Techniques and Antennas; Embedded Systems; Integrated Circuits, VLSI Design, Testing, and CAD; Microelectromechanical Systems; Microelectronics, and Electronic Devices and Circuits; Power, Energy and Energy Conversion Systems; Signal, Image, and Speech Processing) -Mechanical and Civil Engineering (Automotive Technologies; Biomechanics; Construction Materials; Design and Manufacturing; Dynamics and Control; Energy Generation, Utilization, Conversion, and Storage; Fluid Mechanics and Hydraulics; Heat and Mass Transfer; Micro-Nano Sciences; Renewable and Sustainable Energy Technologies; Robotics and Mechatronics; Solid Mechanics and Structure; Thermal Sciences) -Metallurgical and Materials Engineering (Advanced Materials Science; Biomaterials; Ceramic and Inorgnanic Materials; Electronic-Magnetic Materials; Energy and Environment; Materials Characterizastion; Metallurgy; Polymers and Nanocomposites)