An exhaustive evaluation method for open-domain LLM dialogue by constructing recursive CoT

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2026-10-01 Epub Date: 2026-02-13 DOI:10.1016/j.csl.2026.101957

Shengjie Zhao , Zhenping Xie

{"title":"An exhaustive evaluation method for open-domain LLM dialogue by constructing recursive CoT","authors":"Shengjie Zhao , Zhenping Xie","doi":"10.1016/j.csl.2026.101957","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, evaluation methods based on large language models (LLMs) have demonstrated advanced performance in reference-free evaluation of open-domain dialogue quality. However, existing approaches often rely on simple, manually crafted evaluation instructions, lacking the depth and diversity to reflect complex human thinking processes. To address these limitations, we propose the Rec-CoT-Eval framework, a reference-free method for evaluating dialogue quality that automatically constructs a Chain-of-Thought (CoT) through interaction with LLMs. Unlike existing methods that depend on manually crafted instructions, our approach enables the automatic construction of a CoT for evaluation. We treat each evaluation metric as a root task and use prompts to guide the LLMs in recursively decomposing it into sub-problems in a top-down manner. By solving these sub-problems, a comprehensive evaluation CoT is constructed. Ultimately, this CoT is used as a prompt for the LLMs, enabling them to act as dialogue quality evaluation agents and perform reference-free evaluation of target dialogues. Furthermore, the framework incorporates an optional human-computer interaction mechanism, designed to meet the need for fine-grained and personalized customization of evaluation criteria in practical industrial applications. This mechanism allows evaluators to dynamically modify the generated CoT when necessary, integrating expert knowledge to enhance evaluation accuracy and personalization. Experimental results demonstrate that our proposed method achieves a higher correlation with human judgments and outperforms existing approaches.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101957"},"PeriodicalIF":3.4000,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230826000203","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/13 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, evaluation methods based on large language models (LLMs) have demonstrated advanced performance in reference-free evaluation of open-domain dialogue quality. However, existing approaches often rely on simple, manually crafted evaluation instructions, lacking the depth and diversity to reflect complex human thinking processes. To address these limitations, we propose the Rec-CoT-Eval framework, a reference-free method for evaluating dialogue quality that automatically constructs a Chain-of-Thought (CoT) through interaction with LLMs. Unlike existing methods that depend on manually crafted instructions, our approach enables the automatic construction of a CoT for evaluation. We treat each evaluation metric as a root task and use prompts to guide the LLMs in recursively decomposing it into sub-problems in a top-down manner. By solving these sub-problems, a comprehensive evaluation CoT is constructed. Ultimately, this CoT is used as a prompt for the LLMs, enabling them to act as dialogue quality evaluation agents and perform reference-free evaluation of target dialogues. Furthermore, the framework incorporates an optional human-computer interaction mechanism, designed to meet the need for fine-grained and personalized customization of evaluation criteria in practical industrial applications. This mechanism allows evaluators to dynamically modify the generated CoT when necessary, integrating expert knowledge to enhance evaluation accuracy and personalization. Experimental results demonstrate that our proposed method achieves a higher correlation with human judgments and outperforms existing approaches.

查看原文本刊更多论文

一种基于构造递归CoT的开域LLM对话穷举评价方法

近年来，基于大型语言模型（llm）的评价方法在开放域对话质量的无参考评价中表现出了先进的性能。然而，现有的方法往往依赖于简单的、手工制作的评估指令，缺乏反映复杂的人类思维过程的深度和多样性。为了解决这些限制，我们提出了Rec-CoT-Eval框架，这是一种无需参考的方法，用于评估对话质量，通过与llm的交互自动构建思维链（CoT）。与现有依赖于手工制作指令的方法不同，我们的方法能够自动构建用于评估的CoT。我们将每个评估指标视为一个根任务，并使用提示来指导llm以自上而下的方式递归地将其分解为子问题。通过求解这些子问题，构建了一个综合评价模型。最终，这个CoT被用作llm的提示，使它们能够充当对话质量评估代理，并对目标对话执行无参考的评估。此外，该框架还包含一个可选的人机交互机制，旨在满足实际工业应用中对评估标准的细粒度和个性化定制的需要。该机制允许评估人员在必要时动态修改生成的CoT，集成专家知识以提高评估的准确性和个性化。实验结果表明，该方法与人类判断具有较高的相关性，优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.