Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks

Appl. Comput. Intell. Soft Comput. Pub Date : 2023-04-30 DOI:10.1155/2023/9397325

Rodas Solomon, Mesfin Abebe

{"title":"Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks","authors":"Rodas Solomon, Mesfin Abebe","doi":"10.1155/2023/9397325","DOIUrl":null,"url":null,"abstract":"This study aims to develop a hybridized deep learning model for generating semantically meaningful image captions in Amharic Language. Image captioning is a task that combines both computer vision and natural language processing (NLP) domains. However, existing studies in the English language primarily focus on visual features to generate captions, resulting in a gap between visual and textual features and inadequate semantic representation. To address this challenge, this study proposes a hybridized attention-based deep neural network (DNN) model. The model consists of an Inception-v3 convolutional neural network (CNN) encoder to extract image features, a visual attention mechanism to capture significant features, and a bidirectional gated recurrent unit (Bi-GRU) with attention decoder to generate the image captions. The model was trained on the Flickr8k and BNATURE datasets with English captions, which were translated into Amharic Language with the help of Google Translator and Amharic Language experts. The evaluation of the model showed improvement in its performance, with a 1G-BLEU score of 60.6, a 2G-BLEU score of 50.1, a 3G-BLEU score of 43.7, and a 4G-BLEU score of 38.8. Generally, this study highlights the effectiveness of the hybrid approach in generating Amharic Language image captions with better semantic meaning.","PeriodicalId":8218,"journal":{"name":"Appl. Comput. Intell. Soft Comput.","volume":"48 1","pages":"9397325:1-9397325:11"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Appl. Comput. Intell. Soft Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2023/9397325","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This study aims to develop a hybridized deep learning model for generating semantically meaningful image captions in Amharic Language. Image captioning is a task that combines both computer vision and natural language processing (NLP) domains. However, existing studies in the English language primarily focus on visual features to generate captions, resulting in a gap between visual and textual features and inadequate semantic representation. To address this challenge, this study proposes a hybridized attention-based deep neural network (DNN) model. The model consists of an Inception-v3 convolutional neural network (CNN) encoder to extract image features, a visual attention mechanism to capture significant features, and a bidirectional gated recurrent unit (Bi-GRU) with attention decoder to generate the image captions. The model was trained on the Flickr8k and BNATURE datasets with English captions, which were translated into Amharic Language with the help of Google Translator and Amharic Language experts. The evaluation of the model showed improvement in its performance, with a 1G-BLEU score of 60.6, a 2G-BLEU score of 50.1, a 3G-BLEU score of 43.7, and a 4G-BLEU score of 38.8. Generally, this study highlights the effectiveness of the hybrid approach in generating Amharic Language image captions with better semantic meaning.

查看原文本刊更多论文

基于混合注意的深度神经网络生成阿姆哈拉语图像标题

本研究旨在开发一种混合深度学习模型，用于生成语义上有意义的阿姆哈拉语图像字幕。图像字幕是一项结合了计算机视觉和自然语言处理(NLP)领域的任务。然而，现有的英语语言研究主要集中在视觉特征上生成字幕，导致视觉特征与文本特征之间存在差距，语义表征不足。为了解决这一挑战，本研究提出了一种基于混合注意力的深度神经网络(DNN)模型。该模型由一个Inception-v3卷积神经网络(CNN)编码器(用于提取图像特征)、一个视觉注意机制(用于捕获重要特征)和一个双向门控循环单元(Bi-GRU)(带有注意解码器)(用于生成图像标题)组成。该模型在带有英文字幕的Flickr8k和BNATURE数据集上进行训练，并在谷歌翻译和阿姆哈拉语专家的帮助下翻译成阿姆哈拉语。评价结果表明，该模型的性能有所改善，1G-BLEU得分为60.6,2G-BLEU得分为50.1,3G-BLEU得分为43.7,4G-BLEU得分为38.8。总的来说，本研究强调了混合方法在生成语义更好的阿姆哈拉语图像字幕方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Appl. Comput. Intell. Soft Comput.

自引率

0.00%

发文量