{"title":"不同的聊天机器人对多发性骨髓瘤治疗指南的反应如何?","authors":"Edwin U. Suárez, Fabio Torres-Saavedra, Amalia Domingo-González, Jorge Cardete, Pilar Llamas-Sillero","doi":"10.1038/s41375-025-02604-8","DOIUrl":null,"url":null,"abstract":"<p>Recent publications in <i>Leukemia</i> offer further affirmation that daratumumab is a cornerstone of first-line therapy for transplant-ineligible patients with newly diagnosed multiple myeloma (NDMM) [1, 2]. These articles are a subgroup analysis and long-term follow-up of the MAIA trial. This raises the question of whether the advent of new therapies in multiple myeloma and the substantial amount of information now available, will enable the chatbots most commonly used to provide accurate answers aligned with management guidelines. There is a significant risk that these models may produce hallucinations (incorrect or confusing outputs), amplify biases and misinformation, and exhibit deficiencies in reasoning abilities [3]. Therefore, we wanted to evaluate artificial intelligence (AI)-based chatbots as tools to aid in understanding and using guidelines for treating NDMM.</p><p>Templates were used to create different diagnosis descriptions for NDMM, standard/high-risk transplant candidates, or non-candidates. Prompts were input to the GPT-4o model via the ChatGPT (OpenAI) interface, Gemini 1.5 Flash and Gemini 1.5 Pro (Google), Copilot (Microsoft), OpenEvidence, and Claude 3.5 Sonnet. The outputs of the chatbots were evaluated against two reliable sources: the 2021 National Comprehensive Cancer Network (NCCN) guidelines and the 2021 guidelines from the Spanish Myeloma Group on multiple myeloma [4, 5]. This was done due to the variability of the chatbot’s cutoff dates at the start of the study. Questions related to Spanish guidelines were asked in both Spanish and English. Feedback with treatment options, clinical trials, and management strategies for multiple myeloma or examples of correct answers was not permitted. Concordance was defined as the level of consistency of chatbot results with treatment guidelines and recommendations. For this, we used a score from 0 to 2, where zero meant the chatbot did not agree with the guidelines, one meant it was partly right, and two meant it was correct. Three certified hematologists and one hematology trainee analyzed the concordance of the chatbot outputs with both guidelines. Agreement between each evaluator’s responses was also assessed. Reliability was also evaluated with <i>Kendall’s</i> W (ranges between 0 and 1, and values close to 1 indicate a strong association; values close to 0 indicate a weak or null association). Institutional review board approval was not needed since human participants were not involved. Data were analyzed between November 1, 2024, and January 31, 2025, using Google application spreadsheets and Excel (version 16.74) and SPSS (version 25).</p>","PeriodicalId":18109,"journal":{"name":"Leukemia","volume":"16 1","pages":""},"PeriodicalIF":12.8000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How well do different chatbots respond to multiple myeloma treatment guidelines?\",\"authors\":\"Edwin U. Suárez, Fabio Torres-Saavedra, Amalia Domingo-González, Jorge Cardete, Pilar Llamas-Sillero\",\"doi\":\"10.1038/s41375-025-02604-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Recent publications in <i>Leukemia</i> offer further affirmation that daratumumab is a cornerstone of first-line therapy for transplant-ineligible patients with newly diagnosed multiple myeloma (NDMM) [1, 2]. These articles are a subgroup analysis and long-term follow-up of the MAIA trial. This raises the question of whether the advent of new therapies in multiple myeloma and the substantial amount of information now available, will enable the chatbots most commonly used to provide accurate answers aligned with management guidelines. There is a significant risk that these models may produce hallucinations (incorrect or confusing outputs), amplify biases and misinformation, and exhibit deficiencies in reasoning abilities [3]. Therefore, we wanted to evaluate artificial intelligence (AI)-based chatbots as tools to aid in understanding and using guidelines for treating NDMM.</p><p>Templates were used to create different diagnosis descriptions for NDMM, standard/high-risk transplant candidates, or non-candidates. Prompts were input to the GPT-4o model via the ChatGPT (OpenAI) interface, Gemini 1.5 Flash and Gemini 1.5 Pro (Google), Copilot (Microsoft), OpenEvidence, and Claude 3.5 Sonnet. The outputs of the chatbots were evaluated against two reliable sources: the 2021 National Comprehensive Cancer Network (NCCN) guidelines and the 2021 guidelines from the Spanish Myeloma Group on multiple myeloma [4, 5]. This was done due to the variability of the chatbot’s cutoff dates at the start of the study. Questions related to Spanish guidelines were asked in both Spanish and English. Feedback with treatment options, clinical trials, and management strategies for multiple myeloma or examples of correct answers was not permitted. Concordance was defined as the level of consistency of chatbot results with treatment guidelines and recommendations. For this, we used a score from 0 to 2, where zero meant the chatbot did not agree with the guidelines, one meant it was partly right, and two meant it was correct. Three certified hematologists and one hematology trainee analyzed the concordance of the chatbot outputs with both guidelines. Agreement between each evaluator’s responses was also assessed. Reliability was also evaluated with <i>Kendall’s</i> W (ranges between 0 and 1, and values close to 1 indicate a strong association; values close to 0 indicate a weak or null association). Institutional review board approval was not needed since human participants were not involved. Data were analyzed between November 1, 2024, and January 31, 2025, using Google application spreadsheets and Excel (version 16.74) and SPSS (version 25).</p>\",\"PeriodicalId\":18109,\"journal\":{\"name\":\"Leukemia\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":12.8000,\"publicationDate\":\"2025-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Leukemia\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1038/s41375-025-02604-8\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEMATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Leukemia","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41375-025-02604-8","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEMATOLOGY","Score":null,"Total":0}
How well do different chatbots respond to multiple myeloma treatment guidelines?
Recent publications in Leukemia offer further affirmation that daratumumab is a cornerstone of first-line therapy for transplant-ineligible patients with newly diagnosed multiple myeloma (NDMM) [1, 2]. These articles are a subgroup analysis and long-term follow-up of the MAIA trial. This raises the question of whether the advent of new therapies in multiple myeloma and the substantial amount of information now available, will enable the chatbots most commonly used to provide accurate answers aligned with management guidelines. There is a significant risk that these models may produce hallucinations (incorrect or confusing outputs), amplify biases and misinformation, and exhibit deficiencies in reasoning abilities [3]. Therefore, we wanted to evaluate artificial intelligence (AI)-based chatbots as tools to aid in understanding and using guidelines for treating NDMM.
Templates were used to create different diagnosis descriptions for NDMM, standard/high-risk transplant candidates, or non-candidates. Prompts were input to the GPT-4o model via the ChatGPT (OpenAI) interface, Gemini 1.5 Flash and Gemini 1.5 Pro (Google), Copilot (Microsoft), OpenEvidence, and Claude 3.5 Sonnet. The outputs of the chatbots were evaluated against two reliable sources: the 2021 National Comprehensive Cancer Network (NCCN) guidelines and the 2021 guidelines from the Spanish Myeloma Group on multiple myeloma [4, 5]. This was done due to the variability of the chatbot’s cutoff dates at the start of the study. Questions related to Spanish guidelines were asked in both Spanish and English. Feedback with treatment options, clinical trials, and management strategies for multiple myeloma or examples of correct answers was not permitted. Concordance was defined as the level of consistency of chatbot results with treatment guidelines and recommendations. For this, we used a score from 0 to 2, where zero meant the chatbot did not agree with the guidelines, one meant it was partly right, and two meant it was correct. Three certified hematologists and one hematology trainee analyzed the concordance of the chatbot outputs with both guidelines. Agreement between each evaluator’s responses was also assessed. Reliability was also evaluated with Kendall’s W (ranges between 0 and 1, and values close to 1 indicate a strong association; values close to 0 indicate a weak or null association). Institutional review board approval was not needed since human participants were not involved. Data were analyzed between November 1, 2024, and January 31, 2025, using Google application spreadsheets and Excel (version 16.74) and SPSS (version 25).
期刊介绍:
Title: Leukemia
Journal Overview:
Publishes high-quality, peer-reviewed research
Covers all aspects of research and treatment of leukemia and allied diseases
Includes studies of normal hemopoiesis due to comparative relevance
Topics of Interest:
Oncogenes
Growth factors
Stem cells
Leukemia genomics
Cell cycle
Signal transduction
Molecular targets for therapy
And more
Content Types:
Original research articles
Reviews
Letters
Correspondence
Comments elaborating on significant advances and covering topical issues