比较大型语言模型和即时工程在诊断实际案例中的准确性

IF 4.1 2区医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

International Journal of Medical Informatics Pub Date : 2025-06-25 DOI:10.1016/j.ijmedinf.2025.106026

Guanhong Yao , WuJi Zhang , Yingxi Zhu , Ut-kei Wong , Yanfeng Zhang , Cui Yang , Guanghao Shen , Zhanguo Li , Hui Gao

{"title":"比较大型语言模型和即时工程在诊断实际案例中的准确性","authors":"Guanhong Yao , WuJi Zhang , Yingxi Zhu , Ut-kei Wong , Yanfeng Zhang , Cui Yang , Guanghao Shen , Zhanguo Li , Hui Gao","doi":"10.1016/j.ijmedinf.2025.106026","DOIUrl":null,"url":null,"abstract":"<div><h3>Importance</h3><div>Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility.</div></div><div><h3>Objective</h3><div>To evaluate the diagnostic performance of four LLMs (GPT-4o mini, GPT-4o, ERNIE, and Llama-3) using real-world inpatient medical records and assess the impact of different prompt engineering methods.</div></div><div><h3>Method</h3><div>This single-center, retrospective study was conducted at Peking University International Hospital. It involved 1,122 medical records categorized into common rheumatic autoimmune diseases, rare rheumatic autoimmune diseases, and non-rheumatic diseases. Four LLMs were evaluated using two prompt engineering methods: few-shot and chain-of-thought prompting. Diagnostic accuracy (hit1) was defined as the inclusion of the first final diagnosis from the medical record in the model’s top prediction.</div></div><div><h3>Results</h3><div>Hit1 of four LLMs were as follows: GPT-4omini (81.8 %), GPT-4o (82.4 %), ERNIE (82.9 %) and Llama-3 (82.7 %). Few-shot prompting significantly improved GPT-4o’s hit1 (85.9 %) compared to its base model (p = 0.02), outperforming other models (all p < 0.05). Chain-of-thought prompting showed no significant improvement. Hit1 for both common and rare rheumatic diseases was consistently higher than that for non-rheumatic disease. Few-shot prompting increased costs per correct diagnosis for GPT-4o by approximately ¥4.54.</div></div><div><h3>Conclusions</h3><div>LLMs, including GPT-4o, demonstrate promising diagnostic accuracy on real medical records. Few-shot prompting enhances performance but at higher costs, underscoring the need for accuracy improvements and cost management. These findings inform LLM development in Chinese medical contexts and highlight the necessity for further multi-center validation.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"203 ","pages":"Article 106026"},"PeriodicalIF":4.1000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases\",\"authors\":\"Guanhong Yao , WuJi Zhang , Yingxi Zhu , Ut-kei Wong , Yanfeng Zhang , Cui Yang , Guanghao Shen , Zhanguo Li , Hui Gao\",\"doi\":\"10.1016/j.ijmedinf.2025.106026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Importance</h3><div>Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility.</div></div><div><h3>Objective</h3><div>To evaluate the diagnostic performance of four LLMs (GPT-4o mini, GPT-4o, ERNIE, and Llama-3) using real-world inpatient medical records and assess the impact of different prompt engineering methods.</div></div><div><h3>Method</h3><div>This single-center, retrospective study was conducted at Peking University International Hospital. It involved 1,122 medical records categorized into common rheumatic autoimmune diseases, rare rheumatic autoimmune diseases, and non-rheumatic diseases. Four LLMs were evaluated using two prompt engineering methods: few-shot and chain-of-thought prompting. Diagnostic accuracy (hit1) was defined as the inclusion of the first final diagnosis from the medical record in the model’s top prediction.</div></div><div><h3>Results</h3><div>Hit1 of four LLMs were as follows: GPT-4omini (81.8 %), GPT-4o (82.4 %), ERNIE (82.9 %) and Llama-3 (82.7 %). Few-shot prompting significantly improved GPT-4o’s hit1 (85.9 %) compared to its base model (p = 0.02), outperforming other models (all p < 0.05). Chain-of-thought prompting showed no significant improvement. Hit1 for both common and rare rheumatic diseases was consistently higher than that for non-rheumatic disease. Few-shot prompting increased costs per correct diagnosis for GPT-4o by approximately ¥4.54.</div></div><div><h3>Conclusions</h3><div>LLMs, including GPT-4o, demonstrate promising diagnostic accuracy on real medical records. Few-shot prompting enhances performance but at higher costs, underscoring the need for accuracy improvements and cost management. These findings inform LLM development in Chinese medical contexts and highlight the necessity for further multi-center validation.</div></div>\",\"PeriodicalId\":54950,\"journal\":{\"name\":\"International Journal of Medical Informatics\",\"volume\":\"203 \",\"pages\":\"Article 106026\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1386505625002436\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505625002436","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

重要意义大型语言模型（llm）在临床决策中具有潜力，特别是在复杂和罕见疾病的诊断中。然而，实际应用需要进一步评估其准确性和实用性。目的利用真实住院病历评价四种LLMs （gpt - 40mini、gpt - 40o、ERNIE和Llama-3）的诊断性能，并评估不同提示工程方法的影响。方法在北京大学国际医院进行单中心回顾性研究。它涉及1122个医疗记录，分为常见的风湿性自身免疫性疾病、罕见的风湿性自身免疫性疾病和非风湿性疾病。使用两种提示工程方法对4个llm进行了评估：少量提示和思维链提示。诊断准确性（hit1）定义为将医疗记录中的第一个最终诊断纳入模型的最高预测。结果4种LLMs的shit1分别为：GPT-4omini（81.8%）、gpt - 40（82.4%）、ERNIE（82.9%）和Llama-3（82.7%）。与基本模型（p = 0.02）相比，几次提示显著提高了gpt - 40的命中率（85.9%），优于其他模型(p <；0.05)。思维链提示没有明显的改善。常见和罕见风湿病的Hit1均高于非风湿病。少量注射提示使gpt - 40的每次正确诊断成本增加了约4.54日元。结论包括gpt - 40在内的sllms对真实病历的诊断具有良好的准确性。少射提示提高了性能，但成本较高，强调了精度改进和成本管理的必要性。这些发现为法学硕士在中国医学背景下的发展提供了信息，并强调了进一步多中心验证的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparing the accuracy of large language models and prompt engineering in diagnosing realworld cases

Importance

Large language models (LLMs) hold potential in clinical decision-making, especially for complex and rare disease diagnoses. However, real-world applications require further evaluation for accuracy and utility.

Objective

To evaluate the diagnostic performance of four LLMs (GPT-4o mini, GPT-4o, ERNIE, and Llama-3) using real-world inpatient medical records and assess the impact of different prompt engineering methods.

Method

This single-center, retrospective study was conducted at Peking University International Hospital. It involved 1,122 medical records categorized into common rheumatic autoimmune diseases, rare rheumatic autoimmune diseases, and non-rheumatic diseases. Four LLMs were evaluated using two prompt engineering methods: few-shot and chain-of-thought prompting. Diagnostic accuracy (hit1) was defined as the inclusion of the first final diagnosis from the medical record in the model’s top prediction.

Results

Hit1 of four LLMs were as follows: GPT-4omini (81.8 %), GPT-4o (82.4 %), ERNIE (82.9 %) and Llama-3 (82.7 %). Few-shot prompting significantly improved GPT-4o’s hit1 (85.9 %) compared to its base model (p = 0.02), outperforming other models (all p < 0.05). Chain-of-thought prompting showed no significant improvement. Hit1 for both common and rare rheumatic diseases was consistently higher than that for non-rheumatic disease. Few-shot prompting increased costs per correct diagnosis for GPT-4o by approximately ¥4.54.

Conclusions

LLMs, including GPT-4o, demonstrate promising diagnostic accuracy on real medical records. Few-shot prompting enhances performance but at higher costs, underscoring the need for accuracy improvements and cost management. These findings inform LLM development in Chinese medical contexts and highlight the necessity for further multi-center validation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Medical Informatics 医学-计算机：信息系统

CiteScore

8.90

自引率

4.10%

发文量

217

审稿时长

42 days

期刊介绍： International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings. The scope of journal covers: Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.; Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc. Educational computer based programs pertaining to medical informatics or medicine in general; Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.