Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI:10.1145/3242969.3242994

Divesh Lala, K. Inoue, Tatsuya Kawahara

{"title":"Evaluation of Real-time Deep Learning Turn-taking Models for Multiple Dialogue Scenarios","authors":"Divesh Lala, K. Inoue, Tatsuya Kawahara","doi":"10.1145/3242969.3242994","DOIUrl":null,"url":null,"abstract":"The task of identifying when to take a conversational turn is an important function of spoken dialogue systems. The turn-taking system should also ideally be able to handle many types of dialogue, from structured conversation to spontaneous and unstructured discourse. Our goal is to determine how much a generalized model trained on many types of dialogue scenarios would improve on a model trained only for a specific scenario. To achieve this goal we created a large corpus of Wizard-of-Oz conversation data which consisted of several different types of dialogue sessions, and then compared a generalized model with scenario-specific models. For our evaluation we go further than simply reporting conventional metrics, which we show are not informative enough to evaluate turn-taking in a real-time system. Instead, we process results using a performance curve of latency and false cut-in rate, and further improve our model's real-time performance using a finite-state turn-taking machine. Our results show that the generalized model greatly outperformed the individual model for attentive listening scenarios but was worse in job interview scenarios. This implies that a model based on a large corpus is better suited to conversation which is more user-initiated and unstructured. We also propose that our method of evaluation leads to more informative performance metrics in a real-time system.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3242969.3242994","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

The task of identifying when to take a conversational turn is an important function of spoken dialogue systems. The turn-taking system should also ideally be able to handle many types of dialogue, from structured conversation to spontaneous and unstructured discourse. Our goal is to determine how much a generalized model trained on many types of dialogue scenarios would improve on a model trained only for a specific scenario. To achieve this goal we created a large corpus of Wizard-of-Oz conversation data which consisted of several different types of dialogue sessions, and then compared a generalized model with scenario-specific models. For our evaluation we go further than simply reporting conventional metrics, which we show are not informative enough to evaluate turn-taking in a real-time system. Instead, we process results using a performance curve of latency and false cut-in rate, and further improve our model's real-time performance using a finite-state turn-taking machine. Our results show that the generalized model greatly outperformed the individual model for attentive listening scenarios but was worse in job interview scenarios. This implies that a model based on a large corpus is better suited to conversation which is more user-initiated and unstructured. We also propose that our method of evaluation leads to more informative performance metrics in a real-time system.

查看原文本刊更多论文

多对话场景下实时深度学习轮转模型的评估

识别何时进行会话转换是口语对话系统的一项重要功能。轮转系统还应该能够处理多种类型的对话，从结构化的对话到自发的和非结构化的话语。我们的目标是确定在许多类型的对话场景上训练的广义模型与仅为特定场景训练的模型相比有多大的改进。为了实现这一目标，我们创建了一个由几种不同类型的对话会话组成的《Wizard-of-Oz》对话数据的大型语料，然后将广义模型与场景特定模型进行比较。对于我们的评估，我们不仅仅是简单地报告传统的指标，我们认为这些指标不足以评估实时系统中的轮次。相反，我们使用延迟和误切率的性能曲线来处理结果，并使用有限状态轮询机进一步提高模型的实时性能。结果表明，广义模型在专注倾听情境下的表现明显优于个体模型，但在求职面试情境下表现较差。这意味着基于大型语料库的模型更适合于更多由用户发起和非结构化的对话。我们还提出，我们的评估方法可以在实时系统中提供更多信息的性能指标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 20th ACM International Conference on Multimodal Interaction

自引率

0.00%

发文量