基于视频的对话与联合视频和图像训练

2022 IEEE International Conference on Image Processing (ICIP) Pub Date : 2022-10-16 DOI:10.1109/ICIP46576.2022.9897613

Han Zhang, Yingming Li, Zhongfei Zhang

{"title":"基于视频的对话与联合视频和图像训练","authors":"Han Zhang, Yingming Li, Zhongfei Zhang","doi":"10.1109/ICIP46576.2022.9897613","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a multi-modal transformer model for end-to-end training of video-grounded dialogue generation. In particular, LayerScale regularized spatio-temporal self-attention blocks are first introduced to enable us to flexibly train end-to-end from both video and image data, without extracting offline visual features. Further, a pre-trained generative language architecture BART is employed to encode different modalities and perform dialogue generation. Extensive experiments on Audio-Visual Scene-Aware Dialog (AVSD) dataset demonstrate its effectiveness and superiority to the state-of-the-art methods.","PeriodicalId":387035,"journal":{"name":"2022 IEEE International Conference on Image Processing (ICIP)","volume":"1108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Video-Grounded Dialogues with Joint Video and Image Training\",\"authors\":\"Han Zhang, Yingming Li, Zhongfei Zhang\",\"doi\":\"10.1109/ICIP46576.2022.9897613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a multi-modal transformer model for end-to-end training of video-grounded dialogue generation. In particular, LayerScale regularized spatio-temporal self-attention blocks are first introduced to enable us to flexibly train end-to-end from both video and image data, without extracting offline visual features. Further, a pre-trained generative language architecture BART is employed to encode different modalities and perform dialogue generation. Extensive experiments on Audio-Visual Scene-Aware Dialog (AVSD) dataset demonstrate its effectiveness and superiority to the state-of-the-art methods.\",\"PeriodicalId\":387035,\"journal\":{\"name\":\"2022 IEEE International Conference on Image Processing (ICIP)\",\"volume\":\"1108 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Image Processing (ICIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIP46576.2022.9897613\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP46576.2022.9897613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在本文中，我们提出了一个多模态变压器模型，用于视频接地对话生成的端到端训练。特别是首次引入LayerScale正则化时空自注意块，使我们能够灵活地从视频和图像数据中进行端到端训练，而无需提取离线视觉特征。此外，采用预训练的生成语言架构BART对不同的模态进行编码并进行对话生成。在视听场景感知对话(AVSD)数据集上的大量实验证明了该方法的有效性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Video-Grounded Dialogues with Joint Video and Image Training

In this paper, we propose a multi-modal transformer model for end-to-end training of video-grounded dialogue generation. In particular, LayerScale regularized spatio-temporal self-attention blocks are first introduced to enable us to flexibly train end-to-end from both video and image data, without extracting offline visual features. Further, a pre-trained generative language architecture BART is employed to encode different modalities and perform dialogue generation. Extensive experiments on Audio-Visual Scene-Aware Dialog (AVSD) dataset demonstrate its effectiveness and superiority to the state-of-the-art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Conference on Image Processing (ICIP)

自引率

0.00%

发文量