Guanhong Yao , WuJi Zhang , Yingxi Zhu , Ut-kei Wong , Yanfeng Zhang , Cui Yang , Guanghao Shen , Zhanguo Li , Hui Gao
{"title":"比较大型语言模型和即时工程在诊断实际案例中的准确性","authors":"Guanhong Yao , WuJi Zhang , Yingxi Zhu , Ut-kei Wong , Yanfeng Zhang , Cui Yang , Guanghao Shen , Zhanguo Li , Hui Gao","doi":"10.1016/j.ijmedinf.2025.106026","DOIUrl":null,"url":null,"abstract":"<div><h3>Importance</h3><div>Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility.</div></div><div><h3>Objective</h3><div>To evaluate the diagnostic performance of four LLMs (GPT-4o mini, GPT-4o, ERNIE, and Llama-3) using real-world inpatient medical records and assess the impact of different prompt engineering methods.</div></div><div><h3>Method</h3><div>This single-center, retrospective study was conducted at Peking University International Hospital. It involved 1,122 medical records categorized into common rheumatic autoimmune diseases, rare rheumatic autoimmune diseases, and non-rheumatic diseases. Four LLMs were evaluated using two prompt engineering methods: few-shot and chain-of-thought prompting. Diagnostic accuracy (hit1) was defined as the inclusion of the first final diagnosis from the medical record in the model’s top prediction.</div></div><div><h3>Results</h3><div>Hit1 of four LLMs were as follows: GPT-4omini (81.8 %), GPT-4o (82.4 %), ERNIE (82.9 %) and Llama-3 (82.7 %). Few-shot prompting significantly improved GPT-4o’s hit1 (85.9 %) compared to its base model (p = 0.02), outperforming other models (all p < 0.05). Chain-of-thought prompting showed no significant improvement. Hit1 for both common and rare rheumatic diseases was consistently higher than that for non-rheumatic disease. Few-shot prompting increased costs per correct diagnosis for GPT-4o by approximately ¥4.54.</div></div><div><h3>Conclusions</h3><div>LLMs, including GPT-4o, demonstrate promising diagnostic accuracy on real medical records. Few-shot prompting enhances performance but at higher costs, underscoring the need for accuracy improvements and cost management. These findings inform LLM development in Chinese medical contexts and highlight the necessity for further multi-center validation.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"203 ","pages":"Article 106026"},"PeriodicalIF":4.1000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases\",\"authors\":\"Guanhong Yao , WuJi Zhang , Yingxi Zhu , Ut-kei Wong , Yanfeng Zhang , Cui Yang , Guanghao Shen , Zhanguo Li , Hui Gao\",\"doi\":\"10.1016/j.ijmedinf.2025.106026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Importance</h3><div>Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility.</div></div><div><h3>Objective</h3><div>To evaluate the diagnostic performance of four LLMs (GPT-4o mini, GPT-4o, ERNIE, and Llama-3) using real-world inpatient medical records and assess the impact of different prompt engineering methods.</div></div><div><h3>Method</h3><div>This single-center, retrospective study was conducted at Peking University International Hospital. It involved 1,122 medical records categorized into common rheumatic autoimmune diseases, rare rheumatic autoimmune diseases, and non-rheumatic diseases. Four LLMs were evaluated using two prompt engineering methods: few-shot and chain-of-thought prompting. Diagnostic accuracy (hit1) was defined as the inclusion of the first final diagnosis from the medical record in the model’s top prediction.</div></div><div><h3>Results</h3><div>Hit1 of four LLMs were as follows: GPT-4omini (81.8 %), GPT-4o (82.4 %), ERNIE (82.9 %) and Llama-3 (82.7 %). Few-shot prompting significantly improved GPT-4o’s hit1 (85.9 %) compared to its base model (p = 0.02), outperforming other models (all p < 0.05). Chain-of-thought prompting showed no significant improvement. Hit1 for both common and rare rheumatic diseases was consistently higher than that for non-rheumatic disease. Few-shot prompting increased costs per correct diagnosis for GPT-4o by approximately ¥4.54.</div></div><div><h3>Conclusions</h3><div>LLMs, including GPT-4o, demonstrate promising diagnostic accuracy on real medical records. Few-shot prompting enhances performance but at higher costs, underscoring the need for accuracy improvements and cost management. These findings inform LLM development in Chinese medical contexts and highlight the necessity for further multi-center validation.</div></div>\",\"PeriodicalId\":54950,\"journal\":{\"name\":\"International Journal of Medical Informatics\",\"volume\":\"203 \",\"pages\":\"Article 106026\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1386505625002436\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505625002436","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases
Importance
Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility.
Objective
To evaluate the diagnostic performance of four LLMs (GPT-4o mini, GPT-4o, ERNIE, and Llama-3) using real-world inpatient medical records and assess the impact of different prompt engineering methods.
Method
This single-center, retrospective study was conducted at Peking University International Hospital. It involved 1,122 medical records categorized into common rheumatic autoimmune diseases, rare rheumatic autoimmune diseases, and non-rheumatic diseases. Four LLMs were evaluated using two prompt engineering methods: few-shot and chain-of-thought prompting. Diagnostic accuracy (hit1) was defined as the inclusion of the first final diagnosis from the medical record in the model’s top prediction.
Results
Hit1 of four LLMs were as follows: GPT-4omini (81.8 %), GPT-4o (82.4 %), ERNIE (82.9 %) and Llama-3 (82.7 %). Few-shot prompting significantly improved GPT-4o’s hit1 (85.9 %) compared to its base model (p = 0.02), outperforming other models (all p < 0.05). Chain-of-thought prompting showed no significant improvement. Hit1 for both common and rare rheumatic diseases was consistently higher than that for non-rheumatic disease. Few-shot prompting increased costs per correct diagnosis for GPT-4o by approximately ¥4.54.
Conclusions
LLMs, including GPT-4o, demonstrate promising diagnostic accuracy on real medical records. Few-shot prompting enhances performance but at higher costs, underscoring the need for accuracy improvements and cost management. These findings inform LLM development in Chinese medical contexts and highlight the necessity for further multi-center validation.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.