Fact-Checking Large Language Model Responses to a Health Care Prompt: Comparative Study.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES

JMIR Formative Research Pub Date : 2026-04-15 DOI:10.2196/68223

Padhraig Ryan, Orla Davoren, Glyn Elwyn

{"title":"Fact-Checking Large Language Model Responses to a Health Care Prompt: Comparative Study.","authors":"Padhraig Ryan, Orla Davoren, Glyn Elwyn","doi":"10.2196/68223","DOIUrl":null,"url":null,"abstract":"Background: Large language models use machine learning to produce natural language. These models have a range of potential applications in health care, such as patient education and diagnosis. However, evaluations of large language models in health care are still scarce.Objective: This study aimed to (1) evaluate the accuracy and efficiency of automated fact-checking by 2 large language models and (2) illustrate a process through which a large language model might support a patient in redrafting a prompt to include key information needed for patient safety.Methods: A parallel comparison of 2 large language models and 3 human experts was conducted. A clinical scenario was devised in which a woman aged 23 years questions the safety of retinoid treatment for acne by sending prompts to 2 large language models (GPT-4o and OpenBioLLM-70B). GPT-4o and OpenBioLLM-70B were asked to suggest improvements to the patient's initial prompt to elicit key information for clinical decision-making. After the patient sent the revised prompt to the large language models, the models were then asked to fact-check the final response. To test the generalizability of automated fact-checking, a set of 20 clinical statements on disparate topics, mostly related to drug indications, contraindications, and side effects, was developed. The large language models also fact-checked these 20 medical statements. The results were compared against the evaluations of 3 clinical experts. The outcome measures were as follows: (1) percentage of accuracy of automated fact-checking, (2) time to complete fact-checking, and (3) a binary outcome for prompt redrafting (advising the patient to revise her prompt by naming her acne medication to address safety concerns).Results: For the scenario of a patient with acne, GPT-4o and OpenBioLLM-70B both had 86% agreement with the clinical experts' fact-checking. The large language models did not consistently convey the urgency of discontinuing isotretinoin treatment when pregnancy is suspected. In addition, the models did not adequately convey the importance of folic acid supplementation during pregnancy. For the set of 20 medical claims, GPT-4o fact-checking had 100% agreement with that of human experts, whereas OpenBioLLM-70B had 95% agreement. OpenBioLLM-70B diverged from human experts and GPT-4o on 1 question related to pediatric use of antihistamines. The expert fact-checks took a mean time of 18 (SD 3.74) minutes, GPT-4o took 42 seconds, and OpenBioLLM-70B took 33 minutes. The GPT-4o responses for the acne scenario had some inconsistencies but zero fabrication and no obvious omissions. In contrast, OpenBioLLM-70B omitted 1 key information item needed for patient safety.Conclusions: GPT-4o can interact with patients to improve the quality and comprehensiveness of the information contained in health-related prompts. GPT-4o and OpenBioLLM-70B can conduct efficient fact-checking that is close to the level of accuracy of human experts. Human experts need to perform additional checks for accuracy and safety.","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"10 ","pages":"e68223"},"PeriodicalIF":2.0000,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13082570/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/68223","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models use machine learning to produce natural language. These models have a range of potential applications in health care, such as patient education and diagnosis. However, evaluations of large language models in health care are still scarce.

Objective: This study aimed to (1) evaluate the accuracy and efficiency of automated fact-checking by 2 large language models and (2) illustrate a process through which a large language model might support a patient in redrafting a prompt to include key information needed for patient safety.

Methods: A parallel comparison of 2 large language models and 3 human experts was conducted. A clinical scenario was devised in which a woman aged 23 years questions the safety of retinoid treatment for acne by sending prompts to 2 large language models (GPT-4o and OpenBioLLM-70B). GPT-4o and OpenBioLLM-70B were asked to suggest improvements to the patient's initial prompt to elicit key information for clinical decision-making. After the patient sent the revised prompt to the large language models, the models were then asked to fact-check the final response. To test the generalizability of automated fact-checking, a set of 20 clinical statements on disparate topics, mostly related to drug indications, contraindications, and side effects, was developed. The large language models also fact-checked these 20 medical statements. The results were compared against the evaluations of 3 clinical experts. The outcome measures were as follows: (1) percentage of accuracy of automated fact-checking, (2) time to complete fact-checking, and (3) a binary outcome for prompt redrafting (advising the patient to revise her prompt by naming her acne medication to address safety concerns).

Results: For the scenario of a patient with acne, GPT-4o and OpenBioLLM-70B both had 86% agreement with the clinical experts' fact-checking. The large language models did not consistently convey the urgency of discontinuing isotretinoin treatment when pregnancy is suspected. In addition, the models did not adequately convey the importance of folic acid supplementation during pregnancy. For the set of 20 medical claims, GPT-4o fact-checking had 100% agreement with that of human experts, whereas OpenBioLLM-70B had 95% agreement. OpenBioLLM-70B diverged from human experts and GPT-4o on 1 question related to pediatric use of antihistamines. The expert fact-checks took a mean time of 18 (SD 3.74) minutes, GPT-4o took 42 seconds, and OpenBioLLM-70B took 33 minutes. The GPT-4o responses for the acne scenario had some inconsistencies but zero fabrication and no obvious omissions. In contrast, OpenBioLLM-70B omitted 1 key information item needed for patient safety.

Conclusions: GPT-4o can interact with patients to improve the quality and comprehensiveness of the information contained in health-related prompts. GPT-4o and OpenBioLLM-70B can conduct efficient fact-checking that is close to the level of accuracy of human experts. Human experts need to perform additional checks for accuracy and safety.

查看原文本刊更多论文

事实核查大语言模型对卫生保健提示的反应：比较研究。

背景：大型语言模型使用机器学习来产生自然语言。这些模型在医疗保健方面有一系列潜在的应用，例如患者教育和诊断。然而，对卫生保健中大型语言模型的评估仍然很少。目的：本研究旨在(1)评估两种大型语言模型自动事实核查的准确性和效率，(2)说明一个过程，通过该过程，大型语言模型可以支持患者重新起草提示，包括患者安全所需的关键信息。方法：对2个大型语言模型和3位人类专家进行并行比较。设计了一个临床场景，其中一位23岁的女性通过向2个大型语言模型（gpt - 40和OpenBioLLM-70B）发送提示，质疑类维生素A治疗痤疮的安全性。gpt - 40和OpenBioLLM-70B被要求对患者最初的提示提出改进建议，以引出临床决策的关键信息。在患者将修改后的提示发送给大型语言模型之后，这些模型被要求对最终的回答进行事实核查。为了测试自动事实核查的普遍性，研究人员开发了一套20个不同主题的临床声明，主要涉及药物适应症、禁忌症和副作用。大型语言模型也对这20份医学陈述进行了事实核查。将结果与3位临床专家的评价进行比较。结果测量如下：(1)自动事实检查的准确性百分比，(2)完成事实检查的时间，以及(3)提示重新起草的二进制结果（建议患者通过命名痤疮药物来修改提示，以解决安全问题）。结果：对于痤疮患者的场景，gpt - 40和OpenBioLLM-70B与临床专家的事实核查均有86%的一致性。大型语言模型并没有一致地传达当怀疑怀孕时停止异维甲酸治疗的紧迫性。此外，这些模型没有充分传达怀孕期间补充叶酸的重要性。对于这20项医疗声明，gpt - 40事实核查与人类专家的100%一致，而OpenBioLLM-70B的一致性为95%。OpenBioLLM-70B在与儿童使用抗组胺药有关的一个问题上与人类专家和gpt - 40存在分歧。专家事实核查平均耗时18分钟（SD 3.74）， gpt - 40耗时42秒，OpenBioLLM-70B耗时33分钟。痤疮情景的gpt - 40反应有一些不一致，但零捏造和无明显遗漏。相比之下，OpenBioLLM-70B省略了患者安全所需的1个关键信息项。结论：gpt - 40可与患者互动，提高健康相关提示信息的质量和全面性。gpt - 40和OpenBioLLM-70B可以进行接近人类专家准确度的高效事实核查。人类专家需要对准确性和安全性进行额外的检查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊