探索大型语言模型的对话理解能力

arXiv (Cornell University) Pub Date : 2023-11-13 DOI:10.48550/arxiv.2311.07194

She, Shuaijie, Huang, Shujian, Wang, Xingyun, Zhou, Yanke, Chen, Jiajun

{"title":"探索大型语言模型的对话理解能力","authors":"She, Shuaijie, Huang, Shujian, Wang, Xingyun, Zhou, Yanke, Chen, Jiajun","doi":"10.48550/arxiv.2311.07194","DOIUrl":null,"url":null,"abstract":"The recent emergence of large language models (LLMs) have attracted considerable attention. LLMs may interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. Without correct comprehension of the dialogue, the model may inevitably generate incorrect responses. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation with the help of the dialogue summarization task. Beside evaluating and analyzing the dialogue summarization performance (DIAC-Sum), we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 27% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest evaluated model, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average accuracy of all evaluated LLMs is only 62.8%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still the most challenging problem for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data. The experimental results demonstrate that our method achieved an accuracy improvement of 8.9% on DIAC-FactQA.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"117 36","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring the Dialogue Comprehension Ability of Large Language Models\",\"authors\":\"She, Shuaijie, Huang, Shujian, Wang, Xingyun, Zhou, Yanke, Chen, Jiajun\",\"doi\":\"10.48550/arxiv.2311.07194\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The recent emergence of large language models (LLMs) have attracted considerable attention. LLMs may interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. Without correct comprehension of the dialogue, the model may inevitably generate incorrect responses. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation with the help of the dialogue summarization task. Beside evaluating and analyzing the dialogue summarization performance (DIAC-Sum), we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 27% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest evaluated model, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average accuracy of all evaluated LLMs is only 62.8%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still the most challenging problem for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data. The experimental results demonstrate that our method achieved an accuracy improvement of 8.9% on DIAC-FactQA.\",\"PeriodicalId\":496270,\"journal\":{\"name\":\"arXiv (Cornell University)\",\"volume\":\"117 36\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv (Cornell University)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arxiv.2311.07194\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv (Cornell University)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arxiv.2311.07194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近出现的大型语言模型(llm)引起了相当大的关注。llm可能会以对话的形式与用户进行交互，并根据用户的指示生成响应，这自然需要对话理解能力。如果没有对对话的正确理解，模型可能不可避免地产生错误的响应。然而，对话理解是一种普遍的语言能力，很难直接评价。在这项工作中，我们建议在对话总结任务的帮助下进行评估。除了评估和分析对话摘要的性能(DIAC-Sum)，我们还从生成的摘要中推导出事实问题，并将其用作更灵活的对话理解度量(DIAC-FactQA)。我们的评估显示，平均而言，法学硕士生成的摘要中有27%包含事实不一致。即使是评估最强的ChatGPT模型，其总结中也有16%的错误。在回答更具挑战性的事实性问题时，所有被评估法学硕士的平均准确率仅为62.8%。这两个结果都显示出严重的缺陷。详细分析表明，对对话主体/客体的理解仍然是法学硕士最具挑战性的问题。此外，为了激发和提高llm的对话理解能力，我们提出了一个自动构建多任务数据的微调范式。实验结果表明，该方法在DIAC-FactQA上的准确率提高了8.9%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring the Dialogue Comprehension Ability of Large Language Models

The recent emergence of large language models (LLMs) have attracted considerable attention. LLMs may interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. Without correct comprehension of the dialogue, the model may inevitably generate incorrect responses. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation with the help of the dialogue summarization task. Beside evaluating and analyzing the dialogue summarization performance (DIAC-Sum), we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 27% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest evaluated model, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average accuracy of all evaluated LLMs is only 62.8%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still the most challenging problem for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data. The experimental results demonstrate that our method achieved an accuracy improvement of 8.9% on DIAC-FactQA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv (Cornell University)

自引率

0.00%

发文量