Offline-to-Online Co-Evolutional User Simulator and Dialogue System

Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD) Pub Date : 1900-01-01 DOI:10.18653/v1/2022.seretod-1.11

Dafeng Chi, Yuzheng Zhuang, Yao Mu, Bin Wang, Jianzhu Bao, Yasheng Wang, Yuhan Dong, Xin Jiang, Qun Liu, Jianye Hao

{"title":"Offline-to-Online Co-Evolutional User Simulator and Dialogue System","authors":"Dafeng Chi, Yuzheng Zhuang, Yao Mu, Bin Wang, Jianzhu Bao, Yasheng Wang, Yuhan Dong, Xin Jiang, Qun Liu, Jianye Hao","doi":"10.18653/v1/2022.seretod-1.11","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) has emerged as a promising approach to fine-tune offline pretrained GPT-2 model in task-oriented dialogue (TOD) systems. In order to obtain human-like online interactions while extending the usage of RL, building pretrained user simulators (US) along with dialogue systems (DS) and facilitating jointly fine-tuning via RL becomes prevalent. However, joint training brings distributional shift problem caused by compounding exposure bias. Existing methods usually iterative update US and DS to ameliorate the ensued non-stationarity problem, which could lead to sub-optimal policy and less sample efficiency. To take a step further for tackling the problem, we introduce an Offline-to-oNline Co-Evolutional (ONCE) framework, which enables bias-aware concurrent joint update for RL-based fine-tuning whilst takes advantages from GPT-2 based end-to-end modeling on US and DS. Extensive experiments demonstrate that ONCE builds high-quality loops of policy learning and dialogues data collection, and achieves state-of-the-art online and offline evaluation results on MultiWOZ2.1 dataset. Opensourced code will be implemented with Mindspore (MS, 2022) and released on our homepage.","PeriodicalId":171614,"journal":{"name":"Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.seretod-1.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reinforcement learning (RL) has emerged as a promising approach to fine-tune offline pretrained GPT-2 model in task-oriented dialogue (TOD) systems. In order to obtain human-like online interactions while extending the usage of RL, building pretrained user simulators (US) along with dialogue systems (DS) and facilitating jointly fine-tuning via RL becomes prevalent. However, joint training brings distributional shift problem caused by compounding exposure bias. Existing methods usually iterative update US and DS to ameliorate the ensued non-stationarity problem, which could lead to sub-optimal policy and less sample efficiency. To take a step further for tackling the problem, we introduce an Offline-to-oNline Co-Evolutional (ONCE) framework, which enables bias-aware concurrent joint update for RL-based fine-tuning whilst takes advantages from GPT-2 based end-to-end modeling on US and DS. Extensive experiments demonstrate that ONCE builds high-quality loops of policy learning and dialogues data collection, and achieves state-of-the-art online and offline evaluation results on MultiWOZ2.1 dataset. Opensourced code will be implemented with Mindspore (MS, 2022) and released on our homepage.

查看原文本刊更多论文

离线到在线协同进化用户模拟器和对话系统

强化学习(RL)已经成为一种有前途的方法来微调离线预训练GPT-2模型在任务导向对话(TOD)系统中。为了在扩展RL使用的同时获得类似人类的在线交互，构建预训练的用户模拟器(US)以及对话系统(DS)并促进通过RL进行联合微调变得普遍。然而，联合训练带来了复合暴露偏差导致的分布偏移问题。现有的方法通常是迭代更新US和DS来改善随之而来的非平稳性问题，这可能导致次优策略和较低的样本效率。为了进一步解决这个问题，我们引入了一个离线到在线的共同进化(ONCE)框架，该框架可以实现基于rl的微调的偏差感知并发联合更新，同时利用基于GPT-2的端到端建模在US和DS上的优势。大量实验表明，ONCE构建了高质量的政策学习和对话数据收集循环，并在MultiWOZ2.1数据集上实现了最先进的在线和离线评估结果。开放源代码将在Mindspore (MS, 2022)中实现，并在我们的主页上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)

自引率

0.00%

发文量