{"title":"An exhaustive evaluation method for open-domain LLM dialogue by constructing recursive CoT","authors":"Shengjie Zhao , Zhenping Xie","doi":"10.1016/j.csl.2026.101957","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, evaluation methods based on large language models (LLMs) have demonstrated advanced performance in reference-free evaluation of open-domain dialogue quality. However, existing approaches often rely on simple, manually crafted evaluation instructions, lacking the depth and diversity to reflect complex human thinking processes. To address these limitations, we propose the Rec-CoT-Eval framework, a reference-free method for evaluating dialogue quality that automatically constructs a Chain-of-Thought (CoT) through interaction with LLMs. Unlike existing methods that depend on manually crafted instructions, our approach enables the automatic construction of a CoT for evaluation. We treat each evaluation metric as a root task and use prompts to guide the LLMs in recursively decomposing it into sub-problems in a top-down manner. By solving these sub-problems, a comprehensive evaluation CoT is constructed. Ultimately, this CoT is used as a prompt for the LLMs, enabling them to act as dialogue quality evaluation agents and perform reference-free evaluation of target dialogues. Furthermore, the framework incorporates an optional human-computer interaction mechanism, designed to meet the need for fine-grained and personalized customization of evaluation criteria in practical industrial applications. This mechanism allows evaluators to dynamically modify the generated CoT when necessary, integrating expert knowledge to enhance evaluation accuracy and personalization. Experimental results demonstrate that our proposed method achieves a higher correlation with human judgments and outperforms existing approaches.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"100 ","pages":"Article 101957"},"PeriodicalIF":3.4000,"publicationDate":"2026-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230826000203","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/13 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, evaluation methods based on large language models (LLMs) have demonstrated advanced performance in reference-free evaluation of open-domain dialogue quality. However, existing approaches often rely on simple, manually crafted evaluation instructions, lacking the depth and diversity to reflect complex human thinking processes. To address these limitations, we propose the Rec-CoT-Eval framework, a reference-free method for evaluating dialogue quality that automatically constructs a Chain-of-Thought (CoT) through interaction with LLMs. Unlike existing methods that depend on manually crafted instructions, our approach enables the automatic construction of a CoT for evaluation. We treat each evaluation metric as a root task and use prompts to guide the LLMs in recursively decomposing it into sub-problems in a top-down manner. By solving these sub-problems, a comprehensive evaluation CoT is constructed. Ultimately, this CoT is used as a prompt for the LLMs, enabling them to act as dialogue quality evaluation agents and perform reference-free evaluation of target dialogues. Furthermore, the framework incorporates an optional human-computer interaction mechanism, designed to meet the need for fine-grained and personalized customization of evaluation criteria in practical industrial applications. This mechanism allows evaluators to dynamically modify the generated CoT when necessary, integrating expert knowledge to enhance evaluation accuracy and personalization. Experimental results demonstrate that our proposed method achieves a higher correlation with human judgments and outperforms existing approaches.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.