{"title":"预测异常引导的多模态语言语义阿拉伯图像字幕","authors":"Nahla Aljojo , Hanin Ardah , Araek Tashkandi , Safa Habibullah","doi":"10.1016/j.mlwa.2025.100706","DOIUrl":null,"url":null,"abstract":"<div><div>Deep learning has significantly advanced image captioning tasks, enabling models to generate accurate, descriptive sentences from visual content. While much progress has been made in English-language image captioning, Arabic remains underexplored despite its linguistic complexity and widespread usage. Existing Arabic image captioning systems suffer from limited datasets, insufficiently tuned models, and poor adaptation to Arabic morphology and semantics. This limitation hinders the development of accurate, coherent Arabic captions, especially in high-resource applications such as media indexing and content accessibility. This study aims to develop an effective Arabic Image Caption Generator that addresses the shortage of research and tools in this domain. The goal is to create a robust model capable of generating semantically rich, syntactically accurate Arabic captions for visual inputs. The proposed system integrates a DenseNet201 convolutional neural network (CNN) for image feature extraction with a deep Recurrent Neural Network using Long Short-Term Memory (RNN-LSTM) units for sequential caption generation. The model was trained and fine-tuned on a translated Arabic version of the Flickr8K dataset, consisting of over 8000 images, each paired with three Arabic captions. The fine-tuned DenseNet201 + LSTM model achieved BLEU-4 of 0.85, ROUGE-L of 0.90, METEOR of 0.72, CIDEr of 0.88, SPICE of 0.68, and a perplexity score of 1.1, surpassing baseline and prior models in Arabic image captioning tasks. This research provides a novel, end-to-end Arabic image captioning framework, addressing linguistic challenges through deep learning. It offers a benchmark model for future research and practical applications in Arabic-language image understanding.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"21 ","pages":"Article 100706"},"PeriodicalIF":4.9000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting abnormality-guided multimodal linguistic semantics Arabic image captioning\",\"authors\":\"Nahla Aljojo , Hanin Ardah , Araek Tashkandi , Safa Habibullah\",\"doi\":\"10.1016/j.mlwa.2025.100706\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Deep learning has significantly advanced image captioning tasks, enabling models to generate accurate, descriptive sentences from visual content. While much progress has been made in English-language image captioning, Arabic remains underexplored despite its linguistic complexity and widespread usage. Existing Arabic image captioning systems suffer from limited datasets, insufficiently tuned models, and poor adaptation to Arabic morphology and semantics. This limitation hinders the development of accurate, coherent Arabic captions, especially in high-resource applications such as media indexing and content accessibility. This study aims to develop an effective Arabic Image Caption Generator that addresses the shortage of research and tools in this domain. The goal is to create a robust model capable of generating semantically rich, syntactically accurate Arabic captions for visual inputs. The proposed system integrates a DenseNet201 convolutional neural network (CNN) for image feature extraction with a deep Recurrent Neural Network using Long Short-Term Memory (RNN-LSTM) units for sequential caption generation. The model was trained and fine-tuned on a translated Arabic version of the Flickr8K dataset, consisting of over 8000 images, each paired with three Arabic captions. The fine-tuned DenseNet201 + LSTM model achieved BLEU-4 of 0.85, ROUGE-L of 0.90, METEOR of 0.72, CIDEr of 0.88, SPICE of 0.68, and a perplexity score of 1.1, surpassing baseline and prior models in Arabic image captioning tasks. This research provides a novel, end-to-end Arabic image captioning framework, addressing linguistic challenges through deep learning. It offers a benchmark model for future research and practical applications in Arabic-language image understanding.</div></div>\",\"PeriodicalId\":74093,\"journal\":{\"name\":\"Machine learning with applications\",\"volume\":\"21 \",\"pages\":\"Article 100706\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning with applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666827025000891\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025000891","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Predicting abnormality-guided multimodal linguistic semantics Arabic image captioning
Deep learning has significantly advanced image captioning tasks, enabling models to generate accurate, descriptive sentences from visual content. While much progress has been made in English-language image captioning, Arabic remains underexplored despite its linguistic complexity and widespread usage. Existing Arabic image captioning systems suffer from limited datasets, insufficiently tuned models, and poor adaptation to Arabic morphology and semantics. This limitation hinders the development of accurate, coherent Arabic captions, especially in high-resource applications such as media indexing and content accessibility. This study aims to develop an effective Arabic Image Caption Generator that addresses the shortage of research and tools in this domain. The goal is to create a robust model capable of generating semantically rich, syntactically accurate Arabic captions for visual inputs. The proposed system integrates a DenseNet201 convolutional neural network (CNN) for image feature extraction with a deep Recurrent Neural Network using Long Short-Term Memory (RNN-LSTM) units for sequential caption generation. The model was trained and fine-tuned on a translated Arabic version of the Flickr8K dataset, consisting of over 8000 images, each paired with three Arabic captions. The fine-tuned DenseNet201 + LSTM model achieved BLEU-4 of 0.85, ROUGE-L of 0.90, METEOR of 0.72, CIDEr of 0.88, SPICE of 0.68, and a perplexity score of 1.1, surpassing baseline and prior models in Arabic image captioning tasks. This research provides a novel, end-to-end Arabic image captioning framework, addressing linguistic challenges through deep learning. It offers a benchmark model for future research and practical applications in Arabic-language image understanding.