Felix Tran, Patrick Meybohm, Lea V Blum, Vanessa Neef, Jan A Kloka, Florian Rumpf, Tobias E Haas, Sebastian Hottenrott, Philipp Helmer, Peter Kranke, Benedikt Schmid, Denana Mehic, Kai Zacharowski, Suma Choorapoikayil
{"title":"Comparative analysis of large language models and clinician responses in patient blood management knowledge.","authors":"Felix Tran, Patrick Meybohm, Lea V Blum, Vanessa Neef, Jan A Kloka, Florian Rumpf, Tobias E Haas, Sebastian Hottenrott, Philipp Helmer, Peter Kranke, Benedikt Schmid, Denana Mehic, Kai Zacharowski, Suma Choorapoikayil","doi":"10.23736/S0375-9393.25.19014-7","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) are increasingly used in the medical field and have the potential to reduce workload and improve treatment procedures in clinical practice. This study evaluates the capabilities of LLMs to answer common questions related to patient blood management (PBM) and compares their performance to the expertise of clinicians from two university hospitals.</p><p><strong>Methods: </strong>To evaluate the performance of ChatGPT-3.5, ChatGPT-4o, and Google Gemini in answering PBM-related questions, we used a representative sample of 40 questions (30 single-choice and 10 frequently asked patient questions) and compared their responses to those of clinicians. The accuracy and interrater reliability of the answers were analyzed.</p><p><strong>Results: </strong>For PBM knowledge-based questions, the proportion of correct answers was 96.4% (95% CI: 93.6-98.0%) for ChatGPT-4o, 81.3% (95% CI: 77.0-85.7%) for ChatGPT-3.5, and 84.0% (95% CI: 79.4-87.7%) for Google Gemini. Clinicians (N.=82) provided correct answers to 76.5% (95% CI: 74.7-78.1%) of the questions. For frequently asked patient questions, the proportion of correct answers was 100% for ChatGPT-4o, 95.5% (95% CI: 91.4-99.6%) for ChatGPT-3.5 and 91.7% (95% CI: 86.0-97.4%) for Google Gemini. Clinicians provided correct answers to 62.0% (95% CI: 58.7-65.3%) of the questions. Across the categories -anemia management, iron supplementation, cell salvage, principles of PBM, and blood transfusion- ChatGPT-4o achieved the highest scores, providing the most correct answers.</p><p><strong>Conclusions: </strong>LLMs show strong potential for delivering accurate and comprehensive responses to common PBM-related questions. However, it remains essential for clinicians and patients to verify responses, particularly in critical situations.</p>","PeriodicalId":18522,"journal":{"name":"Minerva anestesiologica","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Minerva anestesiologica","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.23736/S0375-9393.25.19014-7","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ANESTHESIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language models (LLMs) are increasingly used in the medical field and have the potential to reduce workload and improve treatment procedures in clinical practice. This study evaluates the capabilities of LLMs to answer common questions related to patient blood management (PBM) and compares their performance to the expertise of clinicians from two university hospitals.
Methods: To evaluate the performance of ChatGPT-3.5, ChatGPT-4o, and Google Gemini in answering PBM-related questions, we used a representative sample of 40 questions (30 single-choice and 10 frequently asked patient questions) and compared their responses to those of clinicians. The accuracy and interrater reliability of the answers were analyzed.
Results: For PBM knowledge-based questions, the proportion of correct answers was 96.4% (95% CI: 93.6-98.0%) for ChatGPT-4o, 81.3% (95% CI: 77.0-85.7%) for ChatGPT-3.5, and 84.0% (95% CI: 79.4-87.7%) for Google Gemini. Clinicians (N.=82) provided correct answers to 76.5% (95% CI: 74.7-78.1%) of the questions. For frequently asked patient questions, the proportion of correct answers was 100% for ChatGPT-4o, 95.5% (95% CI: 91.4-99.6%) for ChatGPT-3.5 and 91.7% (95% CI: 86.0-97.4%) for Google Gemini. Clinicians provided correct answers to 62.0% (95% CI: 58.7-65.3%) of the questions. Across the categories -anemia management, iron supplementation, cell salvage, principles of PBM, and blood transfusion- ChatGPT-4o achieved the highest scores, providing the most correct answers.
Conclusions: LLMs show strong potential for delivering accurate and comprehensive responses to common PBM-related questions. However, it remains essential for clinicians and patients to verify responses, particularly in critical situations.
期刊介绍:
Minerva Anestesiologica is the journal of the Italian National Society of Anaesthesia, Analgesia, Resuscitation, and Intensive Care. Minerva Anestesiologica publishes scientific papers on Anesthesiology, Intensive care, Analgesia, Perioperative Medicine and related fields.
Manuscripts are expected to comply with the instructions to authors which conform to the Uniform Requirements for Manuscripts Submitted to Biomedical Editors by the International Committee of Medical Journal Editors.