CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD) Pub Date : 1900-01-01 DOI:10.18653/v1/2022.seretod-1.7

Y. Huang, Xiaoting Wu, Si Chen, Wei Hu, Qing Zhu, Junlan Feng, Chao Deng, Zhijian Ou, Jiangjiang Zhao

{"title":"CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems","authors":"Y. Huang, Xiaoting Wu, Si Chen, Wei Hu, Qing Zhu, Junlan Feng, Chao Deng, Zhijian Ou, Jiangjiang Zhao","doi":"10.18653/v1/2022.seretod-1.7","DOIUrl":null,"url":null,"abstract":"Dialogue modeling problems severely limit the real-world deployment of neural conversational models and building a human-like dialogue agent is an extremely challenging task. Recently, data-driven models become more and more prevalent which need a huge amount of conversation data. In this paper, we release around 100,000 dialogue, which come from real-world dialogue transcripts between real users and customer-service staffs. We call this dataset as CMCC (China Mobile Customer Care) dataset, which differs from existing dialogue datasets in both size and nature significantly. The dataset reflects several characteristics of human-human conversations, e.g., task-driven, care-oriented, and long-term dependency among the context. It also covers various dialogue types including task-oriented, chitchat and conversational recommendation in real-world scenarios. To our knowledge, CMCC is the largest real human-human spoken dialogue dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of dialogue modeling methods. The results of extensive experiments indicate that CMCC is challenging and needs further effort. We hope that this resource will allow for more effective models across various dialogue sub-problems to be built in the future.","PeriodicalId":171614,"journal":{"name":"Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.seretod-1.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Dialogue modeling problems severely limit the real-world deployment of neural conversational models and building a human-like dialogue agent is an extremely challenging task. Recently, data-driven models become more and more prevalent which need a huge amount of conversation data. In this paper, we release around 100,000 dialogue, which come from real-world dialogue transcripts between real users and customer-service staffs. We call this dataset as CMCC (China Mobile Customer Care) dataset, which differs from existing dialogue datasets in both size and nature significantly. The dataset reflects several characteristics of human-human conversations, e.g., task-driven, care-oriented, and long-term dependency among the context. It also covers various dialogue types including task-oriented, chitchat and conversational recommendation in real-world scenarios. To our knowledge, CMCC is the largest real human-human spoken dialogue dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of dialogue modeling methods. The results of extensive experiments indicate that CMCC is challenging and needs further effort. We hope that this resource will allow for more effective models across various dialogue sub-problems to be built in the future.

查看原文本刊更多论文

CMCC:一个面向对话系统的全面大规模人-人数据集

对话建模问题严重限制了神经对话模型在现实世界中的部署，构建一个类似人类的对话代理是一项极具挑战性的任务。近年来，数据驱动模型越来越流行，需要大量的会话数据。在本文中，我们发布了大约100,000个对话，这些对话来自真实用户与客服人员之间的真实对话记录。我们将该数据集称为CMCC (China Mobile Customer Care)数据集，它与现有的对话数据集在大小和性质上都有很大的不同。该数据集反映了人类对话的几个特征，如任务驱动、关心导向和上下文之间的长期依赖。它还涵盖了各种对话类型，包括任务导向、闲聊和现实场景中的会话推荐。据我们所知，CMCC是最大的真实人-人口语对话数据集，其数据规模是其他数据集的几十倍，这将极大地促进对话建模方法的训练和评估。大量的实验结果表明，CMCC具有挑战性，需要进一步努力。我们希望这一资源将允许在未来建立跨各种对话子问题的更有效的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)

自引率

0.00%

发文量