Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model

Q3 Decision Sciences

JOIV International Journal on Informatics Visualization Pub Date : 2023-05-05 DOI:10.30630/joiv.7.2.1387

Rifqi Mulyawan, A. Sunyoto, Alva Hendi Muhammad

{"title":"Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model","authors":"Rifqi Mulyawan, A. Sunyoto, Alva Hendi Muhammad","doi":"10.30630/joiv.7.2.1387","DOIUrl":null,"url":null,"abstract":"Classification and object recognition in image processing has significantly improved computer vision tasks. The method is often used for visual problems, especially in picture classification utilizing the Convolutional Neural Network (CNN). In the popular state-of-the-art (SOTA) task of generating a caption on an image, the implementation is often used for feature extraction of an image as an encoder. Instead of performing direct classification, these extracted features are sent from the encoder to the decoder section to generate the sequence. So, some CNN layers related to the classification task are not required. This study aims to determine which CNN pre-trained architecture or model performs best in extracting image features using a state-of-the-art Transformer model as its decoder. Unlike the original Transformer’s architecture, we implemented a vector-to-sequence way instead of sequence-to-sequence for the model. Indonesian Flickr8k and Flick30k datasets were used in this research. Evaluations were carried out using several pre-trained architectures, including ResNet18, ResNet34, ResNet50, ResNet101, VGG16, Efficientnet_b0, Efficientnet_b1, and Googlenet. The qualitative model inference results and quantitative evaluation scores were analyzed in this study. The test results show that the ResNet50 architecture can produce stable sequence generation with the highest accuracy value. With some experimentation, finetuning the encoder can significantly increase the model evaluation score. As for future work, further exploration with larger datasets like Flickr30k, MS COCO 14, MS COCO 17, and other image captioning datasets in Indonesian also implementing a new Transformers-based method can be used to get a better Indonesian automatic image captioning model. ","PeriodicalId":32468,"journal":{"name":"JOIV International Journal on Informatics Visualization","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JOIV International Journal on Informatics Visualization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30630/joiv.7.2.1387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Decision Sciences","Score":null,"Total":0}

引用次数: 0

Abstract

Classification and object recognition in image processing has significantly improved computer vision tasks. The method is often used for visual problems, especially in picture classification utilizing the Convolutional Neural Network (CNN). In the popular state-of-the-art (SOTA) task of generating a caption on an image, the implementation is often used for feature extraction of an image as an encoder. Instead of performing direct classification, these extracted features are sent from the encoder to the decoder section to generate the sequence. So, some CNN layers related to the classification task are not required. This study aims to determine which CNN pre-trained architecture or model performs best in extracting image features using a state-of-the-art Transformer model as its decoder. Unlike the original Transformer’s architecture, we implemented a vector-to-sequence way instead of sequence-to-sequence for the model. Indonesian Flickr8k and Flick30k datasets were used in this research. Evaluations were carried out using several pre-trained architectures, including ResNet18, ResNet34, ResNet50, ResNet101, VGG16, Efficientnet_b0, Efficientnet_b1, and Googlenet. The qualitative model inference results and quantitative evaluation scores were analyzed in this study. The test results show that the ResNet50 architecture can produce stable sequence generation with the highest accuracy value. With some experimentation, finetuning the encoder can significantly increase the model evaluation score. As for future work, further exploration with larger datasets like Flickr30k, MS COCO 14, MS COCO 17, and other image captioning datasets in Indonesian also implementing a new Transformers-based method can be used to get a better Indonesian automatic image captioning model.

查看原文本刊更多论文

基于变压器的印尼语图像标题生成模型的预训练CNN架构分析

图像处理中的分类和目标识别极大地改善了计算机视觉任务。该方法通常用于视觉问题，特别是在使用卷积神经网络(CNN)的图像分类中。在流行的最先进的(SOTA)任务生成图像上的标题中，该实现通常用于图像的特征提取作为编码器。这些提取的特征不是直接进行分类，而是从编码器发送到解码器部分以生成序列。因此，一些与分类任务相关的CNN层是不需要的。本研究旨在确定哪种CNN预训练架构或模型在使用最先进的Transformer模型作为解码器提取图像特征方面表现最好。与原始Transformer的体系结构不同，我们为模型实现了一种向量到序列的方式，而不是序列到序列的方式。本研究使用印度尼西亚的Flickr8k和Flick30k数据集。使用几种预训练的架构进行评估，包括ResNet18、ResNet34、ResNet50、ResNet101、VGG16、Efficientnet_b0、Efficientnet_b1和Googlenet。本研究对定性模型推断结果和定量评价分数进行分析。测试结果表明，ResNet50结构可以产生稳定的序列生成，具有最高的精度值。经过一些实验，对编码器进行微调可以显著提高模型评价分数。对于未来的工作，进一步探索更大的数据集，如Flickr30k、MS COCO 14、MS COCO 17等印尼语的图像字幕数据集，也可以实现一种新的基于transformer的方法，可以得到更好的印尼语自动图像字幕模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊