评估大型语言模型作为系统性红斑狼疮抗疟疾患者信息的补充资源。

IF 1.9 4区医学 Q3 RHEUMATOLOGY

Lupus Pub Date : 2025-04-01 Epub Date: 2025-02-27 DOI:10.1177/09612033251324501

Pamela Munguía-Realpozo, Claudia Mendoza-Pinto, Ivet Etchegaray-Morales, Edith Ramírez-Lara, Juan Carlos Solis-Poblano, Socorro Méndez-Martínez, Laura Serrano Vertiz, Jorge Ayón-Aguilar

{"title":"评估大型语言模型作为系统性红斑狼疮抗疟疾患者信息的补充资源。","authors":"Pamela Munguía-Realpozo, Claudia Mendoza-Pinto, Ivet Etchegaray-Morales, Edith Ramírez-Lara, Juan Carlos Solis-Poblano, Socorro Méndez-Martínez, Laura Serrano Vertiz, Jorge Ayón-Aguilar","doi":"10.1177/09612033251324501","DOIUrl":null,"url":null,"abstract":"ObjectiveTo assess the accuracy, completeness, and reproducibility of Large Language Models (LLMs) (Copilot, GPT-3.5, and GPT-4) on antimalarial use in systemic lupus erythematosus (SLE).Materials and MethodsWe utilized 13 questions derived from patient surveys and common inquiries from the National Health Service. Two independent rheumatologists assessed responses from the LLMs using predefined Likert scales for accuracy, completeness, and reproducibility.ResultsThe GPT models and Copilot achieved high scores in accuracy. However, the completeness of outputs was rated at 38.5%, 55.9%, and 92.3% for Copilot, GPT-3.5, and GPT-4. When questions related to \"mechanism of action\" and \"lifestyle\", were analyzed for completeness (n = 8), ChatGPT-4 scored significantly higher (100%) compared to Copilot (37.5%). In contrast, questions related to \"side-effects\" (n = 5) scored higher for ChatGPT models than Copilot, and the differences were not statistically significant. All three LLMs demonstrated high reproducibility, with rates ranging from 84.6% to 92.3%.ConclusionsAdvanced LLMs like GPT -4 offer significant promise in enhancing patients' understanding of antimalarial therapy in SLE. Although chatbots' capability can potentially bridge the information gap patients face, the performance and limitations of such tools need further exploration to optimize their use in clinical settings.","PeriodicalId":18044,"journal":{"name":"Lupus","volume":" ","pages":"374-380"},"PeriodicalIF":1.9000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating large language models as a supplementary patient information resource on antimalarial use in systemic lupus erythematosus.\",\"authors\":\"Pamela Munguía-Realpozo, Claudia Mendoza-Pinto, Ivet Etchegaray-Morales, Edith Ramírez-Lara, Juan Carlos Solis-Poblano, Socorro Méndez-Martínez, Laura Serrano Vertiz, Jorge Ayón-Aguilar\",\"doi\":\"10.1177/09612033251324501\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ObjectiveTo assess the accuracy, completeness, and reproducibility of Large Language Models (LLMs) (Copilot, GPT-3.5, and GPT-4) on antimalarial use in systemic lupus erythematosus (SLE).Materials and MethodsWe utilized 13 questions derived from patient surveys and common inquiries from the National Health Service. Two independent rheumatologists assessed responses from the LLMs using predefined Likert scales for accuracy, completeness, and reproducibility.ResultsThe GPT models and Copilot achieved high scores in accuracy. However, the completeness of outputs was rated at 38.5%, 55.9%, and 92.3% for Copilot, GPT-3.5, and GPT-4. When questions related to \\\"mechanism of action\\\" and \\\"lifestyle\\\", were analyzed for completeness (n = 8), ChatGPT-4 scored significantly higher (100%) compared to Copilot (37.5%). In contrast, questions related to \\\"side-effects\\\" (n = 5) scored higher for ChatGPT models than Copilot, and the differences were not statistically significant. All three LLMs demonstrated high reproducibility, with rates ranging from 84.6% to 92.3%.ConclusionsAdvanced LLMs like GPT -4 offer significant promise in enhancing patients' understanding of antimalarial therapy in SLE. Although chatbots' capability can potentially bridge the information gap patients face, the performance and limitations of such tools need further exploration to optimize their use in clinical settings.\",\"PeriodicalId\":18044,\"journal\":{\"name\":\"Lupus\",\"volume\":\" \",\"pages\":\"374-380\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Lupus\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/09612033251324501\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/27 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"RHEUMATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lupus","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09612033251324501","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/27 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的：评估大型语言模型（LLMs）（Copilot、GPT-3.5和GPT-4）在系统性红斑狼疮（SLE）抗疟用药方面的准确性、完整性和可重复性。材料和方法：我们使用了13个问题，这些问题来自患者调查和国家卫生服务机构的普通询问。两名独立的风湿病学家使用预定义的李克特量表评估llm的反应，以评估其准确性、完整性和可重复性。结果：GPT模型和Copilot在准确率上均获得高分。然而，对于Copilot、GPT-3.5和GPT-4，输出的完备性分别为38.5%、55.9%和92.3%。当对“作用机制”和“生活方式”相关的问题进行完整性分析（n = 8）时，ChatGPT-4的得分（100%）明显高于Copilot（37.5%）。相比之下，ChatGPT模型的“副作用”（n = 5）相关问题得分高于Copilot，差异无统计学意义。所有三种llm均表现出高再现性，重现率在84.6%至92.3%之间。结论：像GPT -4这样的高级llm在增强患者对SLE抗疟治疗的理解方面提供了重大希望。尽管聊天机器人的能力可以潜在地弥合患者面临的信息鸿沟，但这些工具的性能和局限性需要进一步探索，以优化其在临床环境中的使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating large language models as a supplementary patient information resource on antimalarial use in systemic lupus erythematosus.

ObjectiveTo assess the accuracy, completeness, and reproducibility of Large Language Models (LLMs) (Copilot, GPT-3.5, and GPT-4) on antimalarial use in systemic lupus erythematosus (SLE).Materials and MethodsWe utilized 13 questions derived from patient surveys and common inquiries from the National Health Service. Two independent rheumatologists assessed responses from the LLMs using predefined Likert scales for accuracy, completeness, and reproducibility.ResultsThe GPT models and Copilot achieved high scores in accuracy. However, the completeness of outputs was rated at 38.5%, 55.9%, and 92.3% for Copilot, GPT-3.5, and GPT-4. When questions related to "mechanism of action" and "lifestyle", were analyzed for completeness (n = 8), ChatGPT-4 scored significantly higher (100%) compared to Copilot (37.5%). In contrast, questions related to "side-effects" (n = 5) scored higher for ChatGPT models than Copilot, and the differences were not statistically significant. All three LLMs demonstrated high reproducibility, with rates ranging from 84.6% to 92.3%.ConclusionsAdvanced LLMs like GPT -4 offer significant promise in enhancing patients' understanding of antimalarial therapy in SLE. Although chatbots' capability can potentially bridge the information gap patients face, the performance and limitations of such tools need further exploration to optimize their use in clinical settings.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Lupus 医学-风湿病学

CiteScore

4.20

自引率

11.50%

发文量

225

审稿时长

1 months

期刊介绍： The only fully peer reviewed international journal devoted exclusively to lupus (and related disease) research. Lupus includes the most promising new clinical and laboratory-based studies from leading specialists in all lupus-related disciplines. Invaluable reading, with extended coverage, lupus-related disciplines include: Rheumatology, Dermatology, Immunology, Obstetrics, Psychiatry and Cardiovascular Research…