Pre-trained CNNs as Feature-Extraction Modules for Image Captioning

Q4 Computer Science

Electronic Letters on Computer Vision and Image Analysis Pub Date : 2022-05-10 DOI:10.5565/rev/elcvia.1436

Muhammad Abdelhadie Al-Malla, Assef Jafar, Nada Ghneim

{"title":"Pre-trained CNNs as Feature-Extraction Modules for Image Captioning","authors":"Muhammad Abdelhadie Al-Malla, Assef Jafar, Nada Ghneim","doi":"10.5565/rev/elcvia.1436","DOIUrl":null,"url":null,"abstract":"In this work, we present a thorough experimental study about feature extraction using Convolutional NeuralNetworks (CNNs) for the task of image captioning in the context of deep learning. We perform a set of 72experiments on 12 image classification CNNs pre-trained on the ImageNet [29] dataset. The features areextracted from the last layer after removing the fully connected layer and fed into the captioning model. We usea unified captioning model with a fixed vocabulary size across all the experiments to study the effect of changingthe CNN feature extractor on image captioning quality. The scores are calculated using the standard metrics inimage captioning. We find a strong relationship between the model structure and the image captioning datasetand prove that VGG models give the least quality for image captioning feature extraction among the testedCNNs. In the end, we recommend a set of pre-trained CNNs for each of the image captioning evaluation metricswe want to optimise, and show the connection between our results and previous works. To our knowledge, thiswork is the most comprehensive comparison between feature extractors for image captioning.","PeriodicalId":38711,"journal":{"name":"Electronic Letters on Computer Vision and Image Analysis","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronic Letters on Computer Vision and Image Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5565/rev/elcvia.1436","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

In this work, we present a thorough experimental study about feature extraction using Convolutional NeuralNetworks (CNNs) for the task of image captioning in the context of deep learning. We perform a set of 72experiments on 12 image classification CNNs pre-trained on the ImageNet [29] dataset. The features areextracted from the last layer after removing the fully connected layer and fed into the captioning model. We usea unified captioning model with a fixed vocabulary size across all the experiments to study the effect of changingthe CNN feature extractor on image captioning quality. The scores are calculated using the standard metrics inimage captioning. We find a strong relationship between the model structure and the image captioning datasetand prove that VGG models give the least quality for image captioning feature extraction among the testedCNNs. In the end, we recommend a set of pre-trained CNNs for each of the image captioning evaluation metricswe want to optimise, and show the connection between our results and previous works. To our knowledge, thiswork is the most comprehensive comparison between feature extractors for image captioning.

查看原文本刊更多论文

预训练cnn作为图像字幕的特征提取模块

在这项工作中，我们提出了一个关于在深度学习背景下使用卷积神经网络(cnn)进行图像字幕任务的特征提取的全面实验研究。我们在ImageNet[29]数据集上预训练的12个图像分类cnn上进行了72次实验。在去除完全连接层后，从最后一层提取特征并输入到字幕模型中。我们在所有的实验中使用一个固定词汇量的统一字幕模型来研究改变CNN特征提取器对图像字幕质量的影响。分数是使用标准指标图像字幕计算的。我们发现模型结构与图像字幕数据集之间存在很强的相关性，并证明了在测试的cnn中，VGG模型对图像字幕特征提取的质量是最低的。最后，我们为每个我们想要优化的图像字幕评估指标推荐一组预训练的cnn，并显示我们的结果与以前的工作之间的联系。据我们所知，这项工作是图像字幕特征提取器之间最全面的比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊