MSDiagnosis: A benchmark and framework for evaluating large language models in multi-step clinical diagnosis

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-09-23 DOI:10.1016/j.knosys.2025.114524

Ruihui Hou , Shencheng Chen , Yongqi Fan , Guangya Yu , Lifeng Zhu , Jing Sun , Jingping Liu , Tong Ruan

{"title":"MSDiagnosis: A benchmark and framework for evaluating large language models in multi-step clinical diagnosis","authors":"Ruihui Hou , Shencheng Chen , Yongqi Fan , Guangya Yu , Lifeng Zhu , Jing Sun , Jingping Liu , Tong Ruan","doi":"10.1016/j.knosys.2025.114524","DOIUrl":null,"url":null,"abstract":"<div><div>Clinical diagnosis is critical in clinical decision-making, typically requiring a continuous and evolving process that includes primary, differential, and final diagnoses. However, most existing clinical diagnostic tasks are single-step processes, which do not align with the complex multi-step diagnostic procedures found in real clinical scenarios. In this paper, we propose MSDiagnosis, a Chinese multi-step clinical diagnostic benchmark consisting of 2225 cases from 12 departments, covering primary, differential, and final diagnosis tasks. Conventional approaches often rely on large language models (LLMs) to perform these tasks sequentially, which can lead to error propagation. To address this, we propose a two-stage diagnostic framework consisting of a forward inference module and a backward reasoning and refinement module. This framework is applied at each diagnostic stage to effectively mitigate error propagation across steps. The forward module retrieves similar cases to assist the LLM in generating an initial diagnosis. In the backward inference and refinement module, we first perform backward inference to infer the diagnostic criteria associated with the initially identified potential diseases. These criteria are then compared with the patient’s records to identify and eliminate possible misdiagnoses. Finally, the diagnostic conclusion is further refined and confirmed. Based on the MSDiagnosis, we evaluate medical LLMs (e.g., OpenBioLLM, PULSE, and Apollo2), general LLMs (e.g., DeepSeek-V3, OpenAI-O1, and GLM4), and our proposed framework. Experimental results show that our framework achieves state-of-the-art performance, demonstrating its effectiveness in multi-step diagnostic tasks. We also provide a detailed analysis and suggest future research directions for this task. Our code and data are publicly available at <span><span>https://github.com/nlper-hou/MSDiagnosis</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114524"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125015631","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Clinical diagnosis is critical in clinical decision-making, typically requiring a continuous and evolving process that includes primary, differential, and final diagnoses. However, most existing clinical diagnostic tasks are single-step processes, which do not align with the complex multi-step diagnostic procedures found in real clinical scenarios. In this paper, we propose MSDiagnosis, a Chinese multi-step clinical diagnostic benchmark consisting of 2225 cases from 12 departments, covering primary, differential, and final diagnosis tasks. Conventional approaches often rely on large language models (LLMs) to perform these tasks sequentially, which can lead to error propagation. To address this, we propose a two-stage diagnostic framework consisting of a forward inference module and a backward reasoning and refinement module. This framework is applied at each diagnostic stage to effectively mitigate error propagation across steps. The forward module retrieves similar cases to assist the LLM in generating an initial diagnosis. In the backward inference and refinement module, we first perform backward inference to infer the diagnostic criteria associated with the initially identified potential diseases. These criteria are then compared with the patient’s records to identify and eliminate possible misdiagnoses. Finally, the diagnostic conclusion is further refined and confirmed. Based on the MSDiagnosis, we evaluate medical LLMs (e.g., OpenBioLLM, PULSE, and Apollo2), general LLMs (e.g., DeepSeek-V3, OpenAI-O1, and GLM4), and our proposed framework. Experimental results show that our framework achieves state-of-the-art performance, demonstrating its effectiveness in multi-step diagnostic tasks. We also provide a detailed analysis and suggest future research directions for this task. Our code and data are publicly available at https://github.com/nlper-hou/MSDiagnosis.

查看原文本刊更多论文

MSDiagnosis：评估多步骤临床诊断中大型语言模型的基准和框架

临床诊断在临床决策中是至关重要的，通常需要一个持续和不断发展的过程，包括原发性诊断、鉴别诊断和最终诊断。然而，大多数现有的临床诊断任务都是单步骤过程，这与实际临床场景中复杂的多步骤诊断程序不一致。在本文中，我们提出了MSDiagnosis，这是一个由来自12个科室的2225例病例组成的中国多步骤临床诊断基准，涵盖了初级、鉴别和最终诊断任务。传统的方法通常依赖于大型语言模型（llm）来顺序地执行这些任务，这可能导致错误传播。为了解决这个问题，我们提出了一个由前向推理模块和后向推理和细化模块组成的两阶段诊断框架。该框架应用于每个诊断阶段，以有效地减少错误在各个步骤之间的传播。转发模块检索类似的案例，以帮助LLM生成初始诊断。在后向推理和细化模块中，我们首先进行后向推理，推断出与最初识别的潜在疾病相关的诊断标准。然后将这些标准与患者的记录进行比较，以识别和消除可能的误诊。最后对诊断结论进行进一步细化和确认。基于MSDiagnosis，我们评估了医学法学硕士（例如OpenBioLLM、PULSE和Apollo2）、一般法学硕士（例如DeepSeek-V3、openai - 01和GLM4）以及我们提出的框架。实验结果表明，我们的框架达到了最先进的性能，证明了它在多步骤诊断任务中的有效性。并对该课题进行了详细的分析，提出了今后的研究方向。我们的代码和数据可以在https://github.com/nlper-hou/MSDiagnosis上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.