Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks

Rodas Solomon, Mesfin Abebe
{"title":"Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks","authors":"Rodas Solomon, Mesfin Abebe","doi":"10.1155/2023/9397325","DOIUrl":null,"url":null,"abstract":"This study aims to develop a hybridized deep learning model for generating semantically meaningful image captions in Amharic Language. Image captioning is a task that combines both computer vision and natural language processing (NLP) domains. However, existing studies in the English language primarily focus on visual features to generate captions, resulting in a gap between visual and textual features and inadequate semantic representation. To address this challenge, this study proposes a hybridized attention-based deep neural network (DNN) model. The model consists of an Inception-v3 convolutional neural network (CNN) encoder to extract image features, a visual attention mechanism to capture significant features, and a bidirectional gated recurrent unit (Bi-GRU) with attention decoder to generate the image captions. The model was trained on the Flickr8k and BNATURE datasets with English captions, which were translated into Amharic Language with the help of Google Translator and Amharic Language experts. The evaluation of the model showed improvement in its performance, with a 1G-BLEU score of 60.6, a 2G-BLEU score of 50.1, a 3G-BLEU score of 43.7, and a 4G-BLEU score of 38.8. Generally, this study highlights the effectiveness of the hybrid approach in generating Amharic Language image captions with better semantic meaning.","PeriodicalId":8218,"journal":{"name":"Appl. Comput. Intell. Soft Comput.","volume":"48 1","pages":"9397325:1-9397325:11"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Appl. Comput. Intell. Soft Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2023/9397325","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

This study aims to develop a hybridized deep learning model for generating semantically meaningful image captions in Amharic Language. Image captioning is a task that combines both computer vision and natural language processing (NLP) domains. However, existing studies in the English language primarily focus on visual features to generate captions, resulting in a gap between visual and textual features and inadequate semantic representation. To address this challenge, this study proposes a hybridized attention-based deep neural network (DNN) model. The model consists of an Inception-v3 convolutional neural network (CNN) encoder to extract image features, a visual attention mechanism to capture significant features, and a bidirectional gated recurrent unit (Bi-GRU) with attention decoder to generate the image captions. The model was trained on the Flickr8k and BNATURE datasets with English captions, which were translated into Amharic Language with the help of Google Translator and Amharic Language experts. The evaluation of the model showed improvement in its performance, with a 1G-BLEU score of 60.6, a 2G-BLEU score of 50.1, a 3G-BLEU score of 43.7, and a 4G-BLEU score of 38.8. Generally, this study highlights the effectiveness of the hybrid approach in generating Amharic Language image captions with better semantic meaning.
基于混合注意的深度神经网络生成阿姆哈拉语图像标题
本研究旨在开发一种混合深度学习模型,用于生成语义上有意义的阿姆哈拉语图像字幕。图像字幕是一项结合了计算机视觉和自然语言处理(NLP)领域的任务。然而,现有的英语语言研究主要集中在视觉特征上生成字幕,导致视觉特征与文本特征之间存在差距,语义表征不足。为了解决这一挑战,本研究提出了一种基于混合注意力的深度神经网络(DNN)模型。该模型由一个Inception-v3卷积神经网络(CNN)编码器(用于提取图像特征)、一个视觉注意机制(用于捕获重要特征)和一个双向门控循环单元(Bi-GRU)(带有注意解码器)(用于生成图像标题)组成。该模型在带有英文字幕的Flickr8k和BNATURE数据集上进行训练,并在谷歌翻译和阿姆哈拉语专家的帮助下翻译成阿姆哈拉语。评价结果表明,该模型的性能有所改善,1G-BLEU得分为60.6,2G-BLEU得分为50.1,3G-BLEU得分为43.7,4G-BLEU得分为38.8。总的来说,本研究强调了混合方法在生成语义更好的阿姆哈拉语图像字幕方面的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信