Colorectal Cancer Prevention: Is Chat Generative Pretrained Transformer (Chat GPT) ready to Assist Physicians in Determining Appropriate Screening and Surveillance Recommendations?
Lisandro Pereyra, Francisco Schlottmann, Leandro Steinberg, Juan Lasa
{"title":"Colorectal Cancer Prevention: Is Chat Generative Pretrained Transformer (Chat GPT) ready to Assist Physicians in Determining Appropriate Screening and Surveillance Recommendations?","authors":"Lisandro Pereyra, Francisco Schlottmann, Leandro Steinberg, Juan Lasa","doi":"10.1097/MCG.0000000000001979","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To determine whether a publicly available advanced language model could help determine appropriate colorectal cancer (CRC) screening and surveillance recommendations.</p><p><strong>Background: </strong>Poor physician knowledge or inability to accurately recall recommendations might affect adherence to CRC screening guidelines. Adoption of newer technologies can help improve the delivery of such preventive care services.</p><p><strong>Methods: </strong>An assessment with 10 multiple choice questions, including 5 CRC screening and 5 CRC surveillance clinical vignettes, was inputted into chat generative pretrained transformer (ChatGPT) 3.5 in 4 separate sessions. Responses were recorded and screened for accuracy to determine the reliability of this tool. The mean number of correct answers was then compared against a control group of gastroenterologists and colorectal surgeons answering the same questions with and without the help of a previously validated CRC screening mobile app.</p><p><strong>Results: </strong>The average overall performance of ChatGPT was 45%. The mean number of correct answers was 2.75 (95% CI: 2.26-3.24), 1.75 (95% CI: 1.26-2.24), and 4.5 (95% CI: 3.93-5.07) for screening, surveillance, and total questions, respectively. ChatGPT showed inconsistency and gave a different answer in 4 questions among the different sessions. A total of 238 physicians also responded to the assessment; 123 (51.7%) without and 115 (48.3%) with the mobile app. The mean number of total correct answers of ChatGPT was significantly lower than those of physicians without [5.62 (95% CI: 5.32-5.92)] and with the mobile app [7.71 (95% CI: 7.39-8.03); P < 0.001].</p><p><strong>Conclusions: </strong>Large language models developed with artificial intelligence require further refinements to serve as reliable assistants in clinical practice.</p>","PeriodicalId":15457,"journal":{"name":"Journal of clinical gastroenterology","volume":" ","pages":"1022-1027"},"PeriodicalIF":2.8000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of clinical gastroenterology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/MCG.0000000000001979","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/2/7 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: To determine whether a publicly available advanced language model could help determine appropriate colorectal cancer (CRC) screening and surveillance recommendations.
Background: Poor physician knowledge or inability to accurately recall recommendations might affect adherence to CRC screening guidelines. Adoption of newer technologies can help improve the delivery of such preventive care services.
Methods: An assessment with 10 multiple choice questions, including 5 CRC screening and 5 CRC surveillance clinical vignettes, was inputted into chat generative pretrained transformer (ChatGPT) 3.5 in 4 separate sessions. Responses were recorded and screened for accuracy to determine the reliability of this tool. The mean number of correct answers was then compared against a control group of gastroenterologists and colorectal surgeons answering the same questions with and without the help of a previously validated CRC screening mobile app.
Results: The average overall performance of ChatGPT was 45%. The mean number of correct answers was 2.75 (95% CI: 2.26-3.24), 1.75 (95% CI: 1.26-2.24), and 4.5 (95% CI: 3.93-5.07) for screening, surveillance, and total questions, respectively. ChatGPT showed inconsistency and gave a different answer in 4 questions among the different sessions. A total of 238 physicians also responded to the assessment; 123 (51.7%) without and 115 (48.3%) with the mobile app. The mean number of total correct answers of ChatGPT was significantly lower than those of physicians without [5.62 (95% CI: 5.32-5.92)] and with the mobile app [7.71 (95% CI: 7.39-8.03); P < 0.001].
Conclusions: Large language models developed with artificial intelligence require further refinements to serve as reliable assistants in clinical practice.
期刊介绍:
Journal of Clinical Gastroenterology gathers the world''s latest, most relevant clinical studies and reviews, case reports, and technical expertise in a single source. Regular features include cutting-edge, peer-reviewed articles and clinical reviews that put the latest research and development into the context of your practice. Also included are biographies, focused organ reviews, practice management, and therapeutic recommendations.