{"title":"Image Caption: Explaining Pictures by Text using Deep Learning","authors":"Sunil Varma, Nitika Kapoor","doi":"10.1109/ICNWC57852.2023.10127293","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a novel approach to generate image captions using deep learning techniques. Our model employs a pre-trained visual language model and matches picture label information to generate familiar picture captions that can depict novel articles. We also use a range of pre-training techniques for learning cross-modal representations on picture text sets, which contribute to the model’s ability to predict picture text semantic arrangements. We demonstrate that our model outperforms state-of-the-art models on the Flicker 8K dataset. We also employ a combination of long short-term memory (LSTM) and Convolutional Neural Networks (CNNs) layers to extract image features, which help the model understand and highlight the relationship between image features and caption semantics. Our results suggest that our approach can provide a more effective and resource-efficient solution for generating image captions. Overall, this paper presents a comprehensive investigation into the use of deep learning techniques for image caption generation.","PeriodicalId":197525,"journal":{"name":"2023 International Conference on Networking and Communications (ICNWC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Networking and Communications (ICNWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNWC57852.2023.10127293","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we propose a novel approach to generate image captions using deep learning techniques. Our model employs a pre-trained visual language model and matches picture label information to generate familiar picture captions that can depict novel articles. We also use a range of pre-training techniques for learning cross-modal representations on picture text sets, which contribute to the model’s ability to predict picture text semantic arrangements. We demonstrate that our model outperforms state-of-the-art models on the Flicker 8K dataset. We also employ a combination of long short-term memory (LSTM) and Convolutional Neural Networks (CNNs) layers to extract image features, which help the model understand and highlight the relationship between image features and caption semantics. Our results suggest that our approach can provide a more effective and resource-efficient solution for generating image captions. Overall, this paper presents a comprehensive investigation into the use of deep learning techniques for image caption generation.