How well do different chatbots respond to multiple myeloma treatment guidelines?

IF 12.8 1区 医学 Q1 HEMATOLOGY
Edwin U. Suárez, Fabio Torres-Saavedra, Amalia Domingo-González, Jorge Cardete, Pilar Llamas-Sillero
{"title":"How well do different chatbots respond to multiple myeloma treatment guidelines?","authors":"Edwin U. Suárez, Fabio Torres-Saavedra, Amalia Domingo-González, Jorge Cardete, Pilar Llamas-Sillero","doi":"10.1038/s41375-025-02604-8","DOIUrl":null,"url":null,"abstract":"<p>Recent publications in <i>Leukemia</i> offer further affirmation that daratumumab is a cornerstone of first-line therapy for transplant-ineligible patients with newly diagnosed multiple myeloma (NDMM) [1, 2]. These articles are a subgroup analysis and long-term follow-up of the MAIA trial. This raises the question of whether the advent of new therapies in multiple myeloma and the substantial amount of information now available, will enable the chatbots most commonly used to provide accurate answers aligned with management guidelines. There is a significant risk that these models may produce hallucinations (incorrect or confusing outputs), amplify biases and misinformation, and exhibit deficiencies in reasoning abilities [3]. Therefore, we wanted to evaluate artificial intelligence (AI)-based chatbots as tools to aid in understanding and using guidelines for treating NDMM.</p><p>Templates were used to create different diagnosis descriptions for NDMM, standard/high-risk transplant candidates, or non-candidates. Prompts were input to the GPT-4o model via the ChatGPT (OpenAI) interface, Gemini 1.5 Flash and Gemini 1.5 Pro (Google), Copilot (Microsoft), OpenEvidence, and Claude 3.5 Sonnet. The outputs of the chatbots were evaluated against two reliable sources: the 2021 National Comprehensive Cancer Network (NCCN) guidelines and the 2021 guidelines from the Spanish Myeloma Group on multiple myeloma [4, 5]. This was done due to the variability of the chatbot’s cutoff dates at the start of the study. Questions related to Spanish guidelines were asked in both Spanish and English. Feedback with treatment options, clinical trials, and management strategies for multiple myeloma or examples of correct answers was not permitted. Concordance was defined as the level of consistency of chatbot results with treatment guidelines and recommendations. For this, we used a score from 0 to 2, where zero meant the chatbot did not agree with the guidelines, one meant it was partly right, and two meant it was correct. Three certified hematologists and one hematology trainee analyzed the concordance of the chatbot outputs with both guidelines. Agreement between each evaluator’s responses was also assessed. Reliability was also evaluated with <i>Kendall’s</i> W (ranges between 0 and 1, and values close to 1 indicate a strong association; values close to 0 indicate a weak or null association). Institutional review board approval was not needed since human participants were not involved. Data were analyzed between November 1, 2024, and January 31, 2025, using Google application spreadsheets and Excel (version 16.74) and SPSS (version 25).</p>","PeriodicalId":18109,"journal":{"name":"Leukemia","volume":"16 1","pages":""},"PeriodicalIF":12.8000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Leukemia","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41375-025-02604-8","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEMATOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Recent publications in Leukemia offer further affirmation that daratumumab is a cornerstone of first-line therapy for transplant-ineligible patients with newly diagnosed multiple myeloma (NDMM) [1, 2]. These articles are a subgroup analysis and long-term follow-up of the MAIA trial. This raises the question of whether the advent of new therapies in multiple myeloma and the substantial amount of information now available, will enable the chatbots most commonly used to provide accurate answers aligned with management guidelines. There is a significant risk that these models may produce hallucinations (incorrect or confusing outputs), amplify biases and misinformation, and exhibit deficiencies in reasoning abilities [3]. Therefore, we wanted to evaluate artificial intelligence (AI)-based chatbots as tools to aid in understanding and using guidelines for treating NDMM.

Templates were used to create different diagnosis descriptions for NDMM, standard/high-risk transplant candidates, or non-candidates. Prompts were input to the GPT-4o model via the ChatGPT (OpenAI) interface, Gemini 1.5 Flash and Gemini 1.5 Pro (Google), Copilot (Microsoft), OpenEvidence, and Claude 3.5 Sonnet. The outputs of the chatbots were evaluated against two reliable sources: the 2021 National Comprehensive Cancer Network (NCCN) guidelines and the 2021 guidelines from the Spanish Myeloma Group on multiple myeloma [4, 5]. This was done due to the variability of the chatbot’s cutoff dates at the start of the study. Questions related to Spanish guidelines were asked in both Spanish and English. Feedback with treatment options, clinical trials, and management strategies for multiple myeloma or examples of correct answers was not permitted. Concordance was defined as the level of consistency of chatbot results with treatment guidelines and recommendations. For this, we used a score from 0 to 2, where zero meant the chatbot did not agree with the guidelines, one meant it was partly right, and two meant it was correct. Three certified hematologists and one hematology trainee analyzed the concordance of the chatbot outputs with both guidelines. Agreement between each evaluator’s responses was also assessed. Reliability was also evaluated with Kendall’s W (ranges between 0 and 1, and values close to 1 indicate a strong association; values close to 0 indicate a weak or null association). Institutional review board approval was not needed since human participants were not involved. Data were analyzed between November 1, 2024, and January 31, 2025, using Google application spreadsheets and Excel (version 16.74) and SPSS (version 25).

Abstract Image

不同的聊天机器人对多发性骨髓瘤治疗指南的反应如何?
最近在白血病方面发表的文章进一步证实,daratumumab是新诊断的不适合移植的多发性骨髓瘤(NDMM)患者一线治疗的基石[1,2]。这些文章是MAIA试验的亚组分析和长期随访。这就提出了一个问题,即多发性骨髓瘤新疗法的出现以及现有的大量信息,是否会使最常用的聊天机器人提供符合管理指南的准确答案。这些模型可能会产生幻觉(不正确或令人困惑的输出),放大偏见和错误信息,并表现出推理能力的缺陷,这是一个重大风险。因此,我们希望评估基于人工智能(AI)的聊天机器人作为工具,以帮助理解和使用治疗NDMM的指南。模板用于为NDMM、标准/高危移植候选人或非候选人创建不同的诊断描述。通过ChatGPT (OpenAI)接口、Gemini 1.5 Flash和Gemini 1.5 Pro(谷歌)、Copilot (Microsoft)、OpenEvidence和Claude 3.5 Sonnet将提示输入gpt - 40模型。聊天机器人的输出根据两个可靠来源进行评估:2021年国家综合癌症网络(NCCN)指南和2021年西班牙骨髓瘤小组关于多发性骨髓瘤的指南[4,5]。这样做是因为聊天机器人在研究开始时的截止日期是可变的。与西班牙语准则有关的问题以西班牙语和英语提出。不允许反馈多发性骨髓瘤的治疗方案、临床试验和管理策略或正确答案的示例。一致性被定义为聊天机器人结果与治疗指南和建议的一致性水平。为此,我们使用了从0到2的分数,其中0表示聊天机器人不同意指导方针,1表示部分正确,2表示正确。三位认证血液学家和一位血液学培训生分析了聊天机器人输出与这两个指南的一致性。还评估了每个评价者的回答是否一致。可靠性也用肯德尔W进行评估(范围在0到1之间,接近1的值表明强关联;接近0的值表示弱关联或空关联)。由于不涉及人类参与者,因此不需要机构审查委员会的批准。数据分析时间为2024年11月1日至2025年1月31日,使用谷歌应用电子表格和Excel(16.74版本)和SPSS(25版本)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Leukemia
Leukemia 医学-血液学
CiteScore
18.10
自引率
3.50%
发文量
270
审稿时长
3-6 weeks
期刊介绍: Title: Leukemia Journal Overview: Publishes high-quality, peer-reviewed research Covers all aspects of research and treatment of leukemia and allied diseases Includes studies of normal hemopoiesis due to comparative relevance Topics of Interest: Oncogenes Growth factors Stem cells Leukemia genomics Cell cycle Signal transduction Molecular targets for therapy And more Content Types: Original research articles Reviews Letters Correspondence Comments elaborating on significant advances and covering topical issues
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信