Ruihui Hou , Shencheng Chen , Yongqi Fan , Guangya Yu , Lifeng Zhu , Jing Sun , Jingping Liu , Tong Ruan
{"title":"MSDiagnosis: A benchmark and framework for evaluating large language models in multi-step clinical diagnosis","authors":"Ruihui Hou , Shencheng Chen , Yongqi Fan , Guangya Yu , Lifeng Zhu , Jing Sun , Jingping Liu , Tong Ruan","doi":"10.1016/j.knosys.2025.114524","DOIUrl":null,"url":null,"abstract":"<div><div>Clinical diagnosis is critical in clinical decision-making, typically requiring a continuous and evolving process that includes primary, differential, and final diagnoses. However, most existing clinical diagnostic tasks are single-step processes, which do not align with the complex multi-step diagnostic procedures found in real clinical scenarios. In this paper, we propose MSDiagnosis, a Chinese multi-step clinical diagnostic benchmark consisting of 2225 cases from 12 departments, covering primary, differential, and final diagnosis tasks. Conventional approaches often rely on large language models (LLMs) to perform these tasks sequentially, which can lead to error propagation. To address this, we propose a two-stage diagnostic framework consisting of a forward inference module and a backward reasoning and refinement module. This framework is applied at each diagnostic stage to effectively mitigate error propagation across steps. The forward module retrieves similar cases to assist the LLM in generating an initial diagnosis. In the backward inference and refinement module, we first perform backward inference to infer the diagnostic criteria associated with the initially identified potential diseases. These criteria are then compared with the patient’s records to identify and eliminate possible misdiagnoses. Finally, the diagnostic conclusion is further refined and confirmed. Based on the MSDiagnosis, we evaluate medical LLMs (e.g., OpenBioLLM, PULSE, and Apollo2), general LLMs (e.g., DeepSeek-V3, OpenAI-O1, and GLM4), and our proposed framework. Experimental results show that our framework achieves state-of-the-art performance, demonstrating its effectiveness in multi-step diagnostic tasks. We also provide a detailed analysis and suggest future research directions for this task. Our code and data are publicly available at <span><span>https://github.com/nlper-hou/MSDiagnosis</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"330 ","pages":"Article 114524"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125015631","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Clinical diagnosis is critical in clinical decision-making, typically requiring a continuous and evolving process that includes primary, differential, and final diagnoses. However, most existing clinical diagnostic tasks are single-step processes, which do not align with the complex multi-step diagnostic procedures found in real clinical scenarios. In this paper, we propose MSDiagnosis, a Chinese multi-step clinical diagnostic benchmark consisting of 2225 cases from 12 departments, covering primary, differential, and final diagnosis tasks. Conventional approaches often rely on large language models (LLMs) to perform these tasks sequentially, which can lead to error propagation. To address this, we propose a two-stage diagnostic framework consisting of a forward inference module and a backward reasoning and refinement module. This framework is applied at each diagnostic stage to effectively mitigate error propagation across steps. The forward module retrieves similar cases to assist the LLM in generating an initial diagnosis. In the backward inference and refinement module, we first perform backward inference to infer the diagnostic criteria associated with the initially identified potential diseases. These criteria are then compared with the patient’s records to identify and eliminate possible misdiagnoses. Finally, the diagnostic conclusion is further refined and confirmed. Based on the MSDiagnosis, we evaluate medical LLMs (e.g., OpenBioLLM, PULSE, and Apollo2), general LLMs (e.g., DeepSeek-V3, OpenAI-O1, and GLM4), and our proposed framework. Experimental results show that our framework achieves state-of-the-art performance, demonstrating its effectiveness in multi-step diagnostic tasks. We also provide a detailed analysis and suggest future research directions for this task. Our code and data are publicly available at https://github.com/nlper-hou/MSDiagnosis.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.