Min Ji Kim, Donald I Abrams, Ilana M Braun, Amy A Case, Mellar P Davis, Kimberson Tanco, Mark S Wallace, Christopher M Manuel, Eduardo Bruera, David Hui
{"title":"Chatbot Responses to Frequently Asked Questions About Cannabis and Its Use for Cancer Symptoms.","authors":"Min Ji Kim, Donald I Abrams, Ilana M Braun, Amy A Case, Mellar P Davis, Kimberson Tanco, Mark S Wallace, Christopher M Manuel, Eduardo Bruera, David Hui","doi":"10.1016/j.jpainsymman.2026.04.002","DOIUrl":null,"url":null,"abstract":"<p><strong>Context: </strong>Chatbots are increasingly used by the public, but their performance in answering questions about complex health topics, such as cannabis, is unknown.</p><p><strong>Objectives: </strong>To evaluate responses of three popular chatbots regarding cannabis and its use for cancer-related symptoms.</p><p><strong>Methods: </strong>We asked ChatGPT, Google Gemini, and Microsoft Co-Pilot to answer questions about cannabis derived from the Centers for Disease Control website and American Society of Clinical Oncology guidelines regarding cannabis. Responses were collected on February 6, 2025. Six physicians with expertise in this field scored responses for accuracy and comprehensiveness (0-10 scale). Reliability of references was scored separately (0-10 scale). Readability was assessed using Flesch-Kincaid Grade Level, Flesch Reading Ease scores.</p><p><strong>Results: </strong>Mean accuracy scores (SD) for ChatGPT, Gemini, and Co-Pilot were 9.0 (1.8), 8.8 (2.3), and 8.3 (2.3), respectively. Co-Pilot significantly underperformed in accuracy compared to ChatGPT (mean difference -0.62, 95% CI: -1.11, 0.14; P = 0.008). Mean comprehensiveness scores (SD) for ChatGPT, Gemini, and Co-Pilot were 8.1 (2.2), 8.5 (2.2), and 7.2 (2.4), respectively. ChatGPT and Gemini performed better than Co-Pilot in comprehensiveness (mean difference Co-Pilot vs. ChatGPT: -0.88 [95% CI: 1.34, -0.42; P < 0.001]; mean difference Co-Pilot vs. Gemini: -1.28 [95% CI: -1.74, -0.82; P < 0.001]). Inaccurate or misleading statements regarding cannabis formulations and symptom benefits were identified, with missing information on adverse effects and drug interactions. Gemini had the lowest reliability (4.1). Readability among all chatbots was poor.</p><p><strong>Conclusion: </strong>Despite overall high accuracy and comprehensiveness scores, chatbots made some misleading, inaccurate statements or missed information. For now, their answers should be interpreted with caution.</p>","PeriodicalId":16634,"journal":{"name":"Journal of pain and symptom management","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2026-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of pain and symptom management","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jpainsymman.2026.04.002","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Context: Chatbots are increasingly used by the public, but their performance in answering questions about complex health topics, such as cannabis, is unknown.
Objectives: To evaluate responses of three popular chatbots regarding cannabis and its use for cancer-related symptoms.
Methods: We asked ChatGPT, Google Gemini, and Microsoft Co-Pilot to answer questions about cannabis derived from the Centers for Disease Control website and American Society of Clinical Oncology guidelines regarding cannabis. Responses were collected on February 6, 2025. Six physicians with expertise in this field scored responses for accuracy and comprehensiveness (0-10 scale). Reliability of references was scored separately (0-10 scale). Readability was assessed using Flesch-Kincaid Grade Level, Flesch Reading Ease scores.
Results: Mean accuracy scores (SD) for ChatGPT, Gemini, and Co-Pilot were 9.0 (1.8), 8.8 (2.3), and 8.3 (2.3), respectively. Co-Pilot significantly underperformed in accuracy compared to ChatGPT (mean difference -0.62, 95% CI: -1.11, 0.14; P = 0.008). Mean comprehensiveness scores (SD) for ChatGPT, Gemini, and Co-Pilot were 8.1 (2.2), 8.5 (2.2), and 7.2 (2.4), respectively. ChatGPT and Gemini performed better than Co-Pilot in comprehensiveness (mean difference Co-Pilot vs. ChatGPT: -0.88 [95% CI: 1.34, -0.42; P < 0.001]; mean difference Co-Pilot vs. Gemini: -1.28 [95% CI: -1.74, -0.82; P < 0.001]). Inaccurate or misleading statements regarding cannabis formulations and symptom benefits were identified, with missing information on adverse effects and drug interactions. Gemini had the lowest reliability (4.1). Readability among all chatbots was poor.
Conclusion: Despite overall high accuracy and comprehensiveness scores, chatbots made some misleading, inaccurate statements or missed information. For now, their answers should be interpreted with caution.
期刊介绍:
The Journal of Pain and Symptom Management is an internationally respected, peer-reviewed journal and serves an interdisciplinary audience of professionals by providing a forum for the publication of the latest clinical research and best practices related to the relief of illness burden among patients afflicted with serious or life-threatening illness.