The doc versus the bot: A pilot study to assess the quality and accuracy of physician and chatbot responses to clinical questions in gynecologic oncology
Mary Katherine Anastasio , Pamela Peters , Jonathan Foote , Alexander Melamed , Susan C. Modesitt , Fernanda Musa , Emma Rossi , Benjamin B. Albright , Laura J. Havrilesky , Haley A. Moss
{"title":"The doc versus the bot: A pilot study to assess the quality and accuracy of physician and chatbot responses to clinical questions in gynecologic oncology","authors":"Mary Katherine Anastasio , Pamela Peters , Jonathan Foote , Alexander Melamed , Susan C. Modesitt , Fernanda Musa , Emma Rossi , Benjamin B. Albright , Laura J. Havrilesky , Haley A. Moss","doi":"10.1016/j.gore.2024.101477","DOIUrl":null,"url":null,"abstract":"<div><p>Artificial intelligence (AI) applications to medical care are currently under investigation. We aimed to evaluate and compare the quality and accuracy of physician and chatbot responses to common clinical questions in gynecologic oncology. In this cross-sectional pilot study, ten questions about the knowledge and management of gynecologic cancers were selected. Each question was answered by a recruited gynecologic oncologist, ChatGPT (Generative Pretreated Transformer) AI platform, and Bard by Google AI platform. Five recruited gynecologic oncologists who were blinded to the study design were allowed 15 min to respond to each of two questions. Chatbot responses were generated by inserting the question into a fresh session in September 2023. Qualifiers and language identifying the response source were removed. Three gynecologic oncology providers who were blinded to the response source independently reviewed and rated response quality using a 5-point Likert scale, evaluated each response for accuracy, and selected the best response for each question. Overall, physician responses were judged to be best in 76.7 % of evaluations versus ChatGPT (10.0 %) and Bard (13.3 %; p < 0.001). The average quality of responses was 4.2/5.0 for physicians, 3.0/5.0 for ChatGPT and 2.8/5.0 for Bard (<em>t</em>-test for both and ANOVA p < 0.001). Physicians provided a higher proportion of accurate responses (86.7 %) compared to ChatGPT (60 %) and Bard (43 %; p < 0.001 for both). Physicians provided higher quality responses to gynecologic oncology clinical questions compared to chatbots. Patients should be cautioned against non-validated AI platforms for medical advice; larger studies on the use of AI for medical advice are needed.</p></div>","PeriodicalId":12873,"journal":{"name":"Gynecologic Oncology Reports","volume":null,"pages":null},"PeriodicalIF":1.2000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2352578924001565/pdfft?md5=726507797baa7aed0cfa4f99c08eb164&pid=1-s2.0-S2352578924001565-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Gynecologic Oncology Reports","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352578924001565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial intelligence (AI) applications to medical care are currently under investigation. We aimed to evaluate and compare the quality and accuracy of physician and chatbot responses to common clinical questions in gynecologic oncology. In this cross-sectional pilot study, ten questions about the knowledge and management of gynecologic cancers were selected. Each question was answered by a recruited gynecologic oncologist, ChatGPT (Generative Pretreated Transformer) AI platform, and Bard by Google AI platform. Five recruited gynecologic oncologists who were blinded to the study design were allowed 15 min to respond to each of two questions. Chatbot responses were generated by inserting the question into a fresh session in September 2023. Qualifiers and language identifying the response source were removed. Three gynecologic oncology providers who were blinded to the response source independently reviewed and rated response quality using a 5-point Likert scale, evaluated each response for accuracy, and selected the best response for each question. Overall, physician responses were judged to be best in 76.7 % of evaluations versus ChatGPT (10.0 %) and Bard (13.3 %; p < 0.001). The average quality of responses was 4.2/5.0 for physicians, 3.0/5.0 for ChatGPT and 2.8/5.0 for Bard (t-test for both and ANOVA p < 0.001). Physicians provided a higher proportion of accurate responses (86.7 %) compared to ChatGPT (60 %) and Bard (43 %; p < 0.001 for both). Physicians provided higher quality responses to gynecologic oncology clinical questions compared to chatbots. Patients should be cautioned against non-validated AI platforms for medical advice; larger studies on the use of AI for medical advice are needed.
期刊介绍:
Gynecologic Oncology Reports is an online-only, open access journal devoted to the rapid publication of narrative review articles, survey articles, case reports, case series, letters to the editor regarding previously published manuscripts and other short communications in the field of gynecologic oncology. The journal will consider papers that concern tumors of the female reproductive tract, with originality, quality, and clarity the chief criteria of acceptance.