Video-Grounded Dialogues with Joint Video and Image Training

2022 IEEE International Conference on Image Processing (ICIP) Pub Date : 2022-10-16 DOI:10.1109/ICIP46576.2022.9897613

Han Zhang, Yingming Li, Zhongfei Zhang

引用次数: 1

Abstract

In this paper, we propose a multi-modal transformer model for end-to-end training of video-grounded dialogue generation. In particular, LayerScale regularized spatio-temporal self-attention blocks are first introduced to enable us to flexibly train end-to-end from both video and image data, without extracting offline visual features. Further, a pre-trained generative language architecture BART is employed to encode different modalities and perform dialogue generation. Extensive experiments on Audio-Visual Scene-Aware Dialog (AVSD) dataset demonstrate its effectiveness and superiority to the state-of-the-art methods.

查看原文本刊更多论文

基于视频的对话与联合视频和图像训练

在本文中，我们提出了一个多模态变压器模型，用于视频接地对话生成的端到端训练。特别是首次引入LayerScale正则化时空自注意块，使我们能够灵活地从视频和图像数据中进行端到端训练，而无需提取离线视觉特征。此外，采用预训练的生成语言架构BART对不同的模态进行编码并进行对话生成。在视听场景感知对话(AVSD)数据集上的大量实验证明了该方法的有效性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Image Processing (ICIP)

自引率

0.00%

发文量