Multi-Document Grounded Multi-Turn Synthetic Dialog Generation

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI:arxiv-2409.11500

Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ramón Fernandez Astudillo, Radu Florian

引用次数: 0

Abstract

We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.

查看原文本刊更多论文

多文档接地多轮合成对话生成

我们介绍了一种多文档多回合合成对话生成技术，其中包含三个主要思想。首先，我们使用分类法驱动的用户查询来控制整个对话流程，这些查询是通过思维链（CoT）提示生成的。其次，我们通过模拟现实世界中使用的检索器，在用户每次进入对话后更新基础文档，从而支持多文档基础对话的生成。第三，我们应用 LLM-as-a-Judge（LLM 即法官）来过滤掉答案不正确的查询。对合成对话数据的人工评估表明，这些数据是多样的、连贯的，而且大部分答案都是正确的。对可回答查询的人工评估和自动评估都表明，在四个公开的多轮文档基准测试集中，在合成对话上经过微调的模型始终优于在现有的人工生成的训练数据上经过微调的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量