Knowledge-aware Multimodal Dialogue Systems

Proceedings of the 26th ACM international conference on Multimedia Pub Date : 2018-10-15 DOI:10.1145/3240508.3240605

Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, Tat-Seng Chua

{"title":"Knowledge-aware Multimodal Dialogue Systems","authors":"Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, Tat-Seng Chua","doi":"10.1145/3240508.3240605","DOIUrl":null,"url":null,"abstract":"By offering a natural way for information seeking, multimodal dialogue systems are attracting increasing attention in several domains such as retail, travel etc. However, most existing dialogue systems are limited to textual modality, which cannot be easily extended to capture the rich semantics in visual modality such as product images. For example, in fashion domain, the visual appearance of clothes and matching styles play a crucial role in understanding the user's intention. Without considering these, the dialogue agent may fail to generate desirable responses for users. In this paper, we present a Knowledge-aware Multimodal Dialogue (KMD) model to address the limitation of text-based dialogue systems. It gives special consideration to the semantics and domain knowledge revealed in visual content, and is featured with three key components. First, we build a taxonomy-based learning module to capture the fine-grained semantics in images the category and attributes of a product). Second, we propose an end-to-end neural conversational model to generate responses based on the conversation history, visual semantics, and domain knowledge. Lastly, to avoid inconsistent dialogues, we adopt a deep reinforcement learning method which accounts for future rewards to optimize the neural conversational model. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain. Experiment results show that our method significantly outperforms state-of-the-art methods, demonstrating the efficacy of modeling visual modality and domain knowledge for dialogue systems.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"106","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240605","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 106

Abstract

By offering a natural way for information seeking, multimodal dialogue systems are attracting increasing attention in several domains such as retail, travel etc. However, most existing dialogue systems are limited to textual modality, which cannot be easily extended to capture the rich semantics in visual modality such as product images. For example, in fashion domain, the visual appearance of clothes and matching styles play a crucial role in understanding the user's intention. Without considering these, the dialogue agent may fail to generate desirable responses for users. In this paper, we present a Knowledge-aware Multimodal Dialogue (KMD) model to address the limitation of text-based dialogue systems. It gives special consideration to the semantics and domain knowledge revealed in visual content, and is featured with three key components. First, we build a taxonomy-based learning module to capture the fine-grained semantics in images the category and attributes of a product). Second, we propose an end-to-end neural conversational model to generate responses based on the conversation history, visual semantics, and domain knowledge. Lastly, to avoid inconsistent dialogues, we adopt a deep reinforcement learning method which accounts for future rewards to optimize the neural conversational model. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain. Experiment results show that our method significantly outperforms state-of-the-art methods, demonstrating the efficacy of modeling visual modality and domain knowledge for dialogue systems.

查看原文本刊更多论文

知识感知多模式对话系统

多模态对话系统提供了一种自然的信息搜索方式，在零售、旅游等领域受到越来越多的关注。然而，现有的对话系统大多局限于文本形态，难以扩展到产品图像等视觉形态中丰富的语义。例如，在时尚领域，服装的视觉外观和搭配风格对理解用户的意图起着至关重要的作用。如果不考虑这些，对话代理可能无法为用户生成所需的响应。在本文中，我们提出了一个知识感知多模态对话(KMD)模型来解决基于文本的对话系统的局限性。它特别考虑了视觉内容中揭示的语义和领域知识，并具有三个关键组成部分。首先，我们构建一个基于分类法的学习模块来捕获图像中的细粒度语义(产品的类别和属性)。其次，我们提出了一个端到端的神经会话模型，该模型基于会话历史、视觉语义和领域知识生成响应。最后，为了避免不一致的对话，我们采用了一种考虑未来奖励的深度强化学习方法来优化神经会话模型。我们对时尚领域的多回合面向任务的对话数据集进行了广泛的评估。实验结果表明，我们的方法明显优于目前最先进的方法，证明了对对话系统的视觉模态和领域知识建模的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th ACM international conference on Multimedia

自引率

0.00%

发文量