Transitional Adaptation of Pretrained Models for Visual Storytelling

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2021-06-01 DOI:10.1109/CVPR46437.2021.01247

Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, Gunhee Kim

{"title":"Transitional Adaptation of Pretrained Models for Visual Storytelling","authors":"Youngjae Yu, Jiwan Chung, Heeseung Yun, Jongseok Kim, Gunhee Kim","doi":"10.1109/CVPR46437.2021.01247","DOIUrl":null,"url":null,"abstract":"Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pre-trained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR46437.2021.01247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Previous models for vision-to-language generation tasks usually pretrain a visual encoder and a language generator in the respective domains and jointly finetune them with the target task. However, this direct transfer practice may suffer from the discord between visual specificity and language fluency since they are often separately trained from large corpora of visual and text data with no common ground. In this work, we claim that a transitional adaptation task is required between pretraining and finetuning to harmonize the visual encoder and the language model for challenging downstream target tasks like visual storytelling. We propose a novel approach named Transitional Adaptation of Pre-trained Model (TAPM) that adapts the multi-modal modules to each other with a simpler alignment task between visual inputs only with no need for text labels. Through extensive experiments, we show that the adaptation step significantly improves the performance of multiple language models for sequential video and image captioning tasks. We achieve new state-of-the-art performance on both language metrics and human evaluation in the multi-sentence description task of LSMDC 2019 [50] and the image storytelling task of VIST [18]. Our experiments reveal that this improvement in caption quality does not depend on the specific choice of language models.

查看原文本刊更多论文

视觉叙事预训练模型的过渡适应

以前的视觉-语言生成任务模型通常是在各自的领域预训练一个视觉编码器和一个语言生成器，并与目标任务共同对它们进行微调。然而，这种直接迁移实践可能会受到视觉特异性和语言流畅性之间的不协调的影响，因为它们通常是在没有共同点的视觉和文本数据的大型语料库中单独训练的。在这项工作中，我们声称需要在预训练和微调之间进行过渡适应任务，以协调视觉编码器和语言模型，以挑战视觉讲故事等下游目标任务。我们提出了一种新的方法，称为预训练模型的过渡适应(TAPM)，它使多模态模块相互适应，在视觉输入之间的对齐任务更简单，而不需要文本标签。通过大量的实验，我们表明自适应步骤显著提高了多语言模型在序列视频和图像字幕任务中的性能。我们在LSMDC 2019的多句描述任务[50]和VIST的图像叙事任务[18]中实现了语言指标和人类评估方面的最新性能。我们的实验表明，标题质量的提高并不取决于语言模型的具体选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量