Arnulf Stenzl, Andrew J Armstrong, Eamonn Rogers, Dany Habr, Jochen Walz, Martin Gleave, Andrea Sboner, Jennifer Ghith, Lucile Serfass, Kristine W Schuler, Sam Garas, Dheepa Chari, Ken Truman, Cora N Sternberg
{"title":"对 ChatGPT 作为前列腺癌患者可靠医疗信息来源的评估:肿瘤内科医生和泌尿科医生的全球比较调查。","authors":"Arnulf Stenzl, Andrew J Armstrong, Eamonn Rogers, Dany Habr, Jochen Walz, Martin Gleave, Andrea Sboner, Jennifer Ghith, Lucile Serfass, Kristine W Schuler, Sam Garas, Dheepa Chari, Ken Truman, Cora N Sternberg","doi":"10.1097/UPJ.0000000000000740","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>No consensus exists on performance standards for evaluation of generative artificial intelligence (AI) to generate medical responses. The purpose of this study was the assessment of Chat Generative Pre-trained Transformer (ChatGPT) to address medical questions in prostate cancer.</p><p><strong>Methods: </strong>A global online survey was conducted from April to June 2023 among > 700 medical oncologists or urologists who treat patients with prostate cancer. Participants were unaware this was a survey evaluating AI. In component 1, responses to 9 questions were written independently by medical writers (MWs; from medical websites) and ChatGPT-4.0 (AI generated from publicly available information). Respondents were randomly exposed and blinded to both AI-generated and MW-curated responses; evaluation criteria and overall preference were recorded. Exploratory component 2 evaluated AI-generated responses to 5 complex questions with nuanced answers in the medical literature. Responses were evaluated on a 5-point Likert scale. Statistical significance was denoted by <i>P</i> < .05.</p><p><strong>Results: </strong>In component 1, respondents (N = 602) consistently preferred the clarity of AI-generated responses over MW-curated responses in 7 of 9 questions (<i>P</i> < .05). Despite favoring AI-generated responses when blinded to questions/answers, respondents considered medical websites a more credible source (52%-67%) than ChatGPT (14%). Respondents in component 2 (N = 98) also considered medical websites more credible than ChatGPT, but rated AI-generated responses highly for all evaluation criteria, despite nuanced answers in the medical literature.</p><p><strong>Conclusions: </strong>These findings provide insight into how clinicians rate AI-generated and MW-curated responses with evaluation criteria that can be used in future AI validation studies.</p>","PeriodicalId":45220,"journal":{"name":"Urology Practice","volume":" ","pages":"101097UPJ0000000000000740"},"PeriodicalIF":0.8000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of ChatGPT as a Reliable Source of Medical Information on Prostate Cancer for Patients: Global Comparative Survey of Medical Oncologists and Urologists.\",\"authors\":\"Arnulf Stenzl, Andrew J Armstrong, Eamonn Rogers, Dany Habr, Jochen Walz, Martin Gleave, Andrea Sboner, Jennifer Ghith, Lucile Serfass, Kristine W Schuler, Sam Garas, Dheepa Chari, Ken Truman, Cora N Sternberg\",\"doi\":\"10.1097/UPJ.0000000000000740\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>No consensus exists on performance standards for evaluation of generative artificial intelligence (AI) to generate medical responses. The purpose of this study was the assessment of Chat Generative Pre-trained Transformer (ChatGPT) to address medical questions in prostate cancer.</p><p><strong>Methods: </strong>A global online survey was conducted from April to June 2023 among > 700 medical oncologists or urologists who treat patients with prostate cancer. Participants were unaware this was a survey evaluating AI. In component 1, responses to 9 questions were written independently by medical writers (MWs; from medical websites) and ChatGPT-4.0 (AI generated from publicly available information). Respondents were randomly exposed and blinded to both AI-generated and MW-curated responses; evaluation criteria and overall preference were recorded. Exploratory component 2 evaluated AI-generated responses to 5 complex questions with nuanced answers in the medical literature. Responses were evaluated on a 5-point Likert scale. Statistical significance was denoted by <i>P</i> < .05.</p><p><strong>Results: </strong>In component 1, respondents (N = 602) consistently preferred the clarity of AI-generated responses over MW-curated responses in 7 of 9 questions (<i>P</i> < .05). Despite favoring AI-generated responses when blinded to questions/answers, respondents considered medical websites a more credible source (52%-67%) than ChatGPT (14%). Respondents in component 2 (N = 98) also considered medical websites more credible than ChatGPT, but rated AI-generated responses highly for all evaluation criteria, despite nuanced answers in the medical literature.</p><p><strong>Conclusions: </strong>These findings provide insight into how clinicians rate AI-generated and MW-curated responses with evaluation criteria that can be used in future AI validation studies.</p>\",\"PeriodicalId\":45220,\"journal\":{\"name\":\"Urology Practice\",\"volume\":\" \",\"pages\":\"101097UPJ0000000000000740\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2024-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Urology Practice\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1097/UPJ.0000000000000740\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"UROLOGY & NEPHROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Urology Practice","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1097/UPJ.0000000000000740","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
Evaluation of ChatGPT as a Reliable Source of Medical Information on Prostate Cancer for Patients: Global Comparative Survey of Medical Oncologists and Urologists.
Introduction: No consensus exists on performance standards for evaluation of generative artificial intelligence (AI) to generate medical responses. The purpose of this study was the assessment of Chat Generative Pre-trained Transformer (ChatGPT) to address medical questions in prostate cancer.
Methods: A global online survey was conducted from April to June 2023 among > 700 medical oncologists or urologists who treat patients with prostate cancer. Participants were unaware this was a survey evaluating AI. In component 1, responses to 9 questions were written independently by medical writers (MWs; from medical websites) and ChatGPT-4.0 (AI generated from publicly available information). Respondents were randomly exposed and blinded to both AI-generated and MW-curated responses; evaluation criteria and overall preference were recorded. Exploratory component 2 evaluated AI-generated responses to 5 complex questions with nuanced answers in the medical literature. Responses were evaluated on a 5-point Likert scale. Statistical significance was denoted by P < .05.
Results: In component 1, respondents (N = 602) consistently preferred the clarity of AI-generated responses over MW-curated responses in 7 of 9 questions (P < .05). Despite favoring AI-generated responses when blinded to questions/answers, respondents considered medical websites a more credible source (52%-67%) than ChatGPT (14%). Respondents in component 2 (N = 98) also considered medical websites more credible than ChatGPT, but rated AI-generated responses highly for all evaluation criteria, despite nuanced answers in the medical literature.
Conclusions: These findings provide insight into how clinicians rate AI-generated and MW-curated responses with evaluation criteria that can be used in future AI validation studies.