Multimodal Dialogue State Tracking

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-06-16 DOI:10.48550/arXiv.2206.07898

Hung Le, Nancy F. Chen, S. Hoi

{"title":"Multimodal Dialogue State Tracking","authors":"Hung Le, Nancy F. Chen, S. Hoi","doi":"10.48550/arXiv.2206.07898","DOIUrl":null,"url":null,"abstract":"Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems.","PeriodicalId":382084,"journal":{"name":"North American Chapter of the Association for Computational Linguistics","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"North American Chapter of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2206.07898","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems.

查看原文本刊更多论文

多模式对话状态跟踪

对话状态跟踪器是对话系统的重要组成部分，用于跟踪对话中的用户目标。然而，对话状态跟踪的研究在很大程度上局限于单模态，其中槽位和槽值受知识域(例如具有餐馆名称和价格范围槽位的餐馆域)的限制，并由特定的数据库模式定义。本文提出将对话状态跟踪的定义扩展到多模态。具体来说，我们引入了一种新的对话状态跟踪任务来跟踪基于视频的对话中提到的视觉对象的信息。每个新的对话话语可能会引入一个新的视频片段、新的视觉对象或新的对象属性，并且需要状态跟踪器相应地更新这些信息槽。为此，我们创建了一个新的综合基准，并设计了一个新的基准——视频对话变压器网络(VDTN)。VDTN结合了对象级功能和段级功能，并学习视频和对话之间的上下文依赖关系，以生成多模态对话状态。我们针对状态生成任务以及恢复视频片段或对象表示的自监督视频理解任务对VDTN进行了优化。最后，我们训练VDTN在响应预测任务中使用解码状态。通过综合消融和定性分析，我们发现了构建更有能力的多模式对话系统的有趣见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

North American Chapter of the Association for Computational Linguistics

自引率

0.00%

发文量