Image Caption Enhancement with GRIT, Portable ResNet and BART Context-Tuning

2022 6th International Conference on Universal Village (UV) Pub Date : 2022-10-22 DOI:10.1109/UV56588.2022.10185494

Wuyang Zhang, Jianming Ma

引用次数: 0

Abstract

This paper aims to create an image captioning novel architecture that infuses Grid and Region-based image caption transformer, ResNet, and BART language model to offer a more detail-oriented image captioning model. Conventional state-of-the-art image captioning models mainly focuses on region-based features. They rely on decent object detector architectures like Faster R-CNN to extract object-level information to describe the image’s content. Nevertheless, they cannot remove contextual information, high computational costs, and the ability to introduce in-depth external details of objects presented in the images—the replacement of conventional CNN-based detectors results in faster computation. The experiment can generate image captions comparatively fast with higher accuracy and details with contextual information.

查看原文本刊更多论文

图像说明使用GRIT，便携式ResNet和BART上下文调整进行增强

本文旨在创建一种新的图像字幕架构，该架构注入了基于网格和区域的图像字幕转换器、ResNet和BART语言模型，以提供更面向细节的图像字幕模型。传统的图像字幕模型主要关注基于区域的特征。他们依靠体面的对象检测器架构，如Faster R-CNN来提取对象级信息来描述图像的内容。然而，它们不能去除上下文信息，计算成本高，并且能够引入图像中呈现的物体的深入外部细节-取代传统的基于cnn的检测器导致更快的计算。该实验能够相对较快地生成图像标题，具有较高的准确性，并且具有上下文信息的细节。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 6th International Conference on Universal Village (UV)

自引率

0.00%

发文量