Kristen N Kaiser, Alexa J Hughes, Anthony D Yang, Anita A Turk, Sanjay Mohanty, Andrew A Gonzalez, Rachel E Patzer, Karl Y Bilimoria, Ryan J Ellis
{"title":"Accuracy and consistency of publicly available Large Language Models as clinical decision support tools for the management of colon cancer.","authors":"Kristen N Kaiser, Alexa J Hughes, Anthony D Yang, Anita A Turk, Sanjay Mohanty, Andrew A Gonzalez, Rachel E Patzer, Karl Y Bilimoria, Ryan J Ellis","doi":"10.1002/jso.27821","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large Language Models (LLM; e.g., ChatGPT) may be used to assist clinicians and form the basis of future clinical decision support (CDS) for colon cancer. The objectives of this study were to (1) evaluate the response accuracy of two LLM-powered interfaces in identifying guideline-based care in simulated clinical scenarios and (2) define response variation between and within LLMs.</p><p><strong>Methods: </strong>Clinical scenarios with \"next steps in management\" queries were developed based on National Comprehensive Cancer Network guidelines. Prompts were entered into OpenAI ChatGPT and Microsoft Copilot in independent sessions, yielding four responses per scenario. Responses were compared to clinician-developed responses and assessed for accuracy, consistency, and verbosity.</p><p><strong>Results: </strong>Across 108 responses to 27 prompts, both platforms yielded completely correct responses to 36% of scenarios (n = 39). For ChatGPT, 39% (n = 21) were missing information and 24% (n = 14) contained inaccurate/misleading information. Copilot performed similarly, with 37% (n = 20) having missing information and 28% (n = 15) containing inaccurate/misleading information (p = 0.96). Clinician responses were significantly shorter (34 ± 15.5 words) than both ChatGPT (251 ± 86 words) and Copilot (271 ± 67 words; both p < 0.01).</p><p><strong>Conclusions: </strong>Publicly available LLM applications often provide verbose responses with vague or inaccurate information regarding colon cancer management. Significant optimization is required before use in formal CDS.</p>","PeriodicalId":17111,"journal":{"name":"Journal of Surgical Oncology","volume":null,"pages":null},"PeriodicalIF":2.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Surgical Oncology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/jso.27821","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large Language Models (LLM; e.g., ChatGPT) may be used to assist clinicians and form the basis of future clinical decision support (CDS) for colon cancer. The objectives of this study were to (1) evaluate the response accuracy of two LLM-powered interfaces in identifying guideline-based care in simulated clinical scenarios and (2) define response variation between and within LLMs.
Methods: Clinical scenarios with "next steps in management" queries were developed based on National Comprehensive Cancer Network guidelines. Prompts were entered into OpenAI ChatGPT and Microsoft Copilot in independent sessions, yielding four responses per scenario. Responses were compared to clinician-developed responses and assessed for accuracy, consistency, and verbosity.
Results: Across 108 responses to 27 prompts, both platforms yielded completely correct responses to 36% of scenarios (n = 39). For ChatGPT, 39% (n = 21) were missing information and 24% (n = 14) contained inaccurate/misleading information. Copilot performed similarly, with 37% (n = 20) having missing information and 28% (n = 15) containing inaccurate/misleading information (p = 0.96). Clinician responses were significantly shorter (34 ± 15.5 words) than both ChatGPT (251 ± 86 words) and Copilot (271 ± 67 words; both p < 0.01).
Conclusions: Publicly available LLM applications often provide verbose responses with vague or inaccurate information regarding colon cancer management. Significant optimization is required before use in formal CDS.
期刊介绍:
The Journal of Surgical Oncology offers peer-reviewed, original papers in the field of surgical oncology and broadly related surgical sciences, including reports on experimental and laboratory studies. As an international journal, the editors encourage participation from leading surgeons around the world. The JSO is the representative journal for the World Federation of Surgical Oncology Societies. Publishing 16 issues in 2 volumes each year, the journal accepts Research Articles, in-depth Reviews of timely interest, Letters to the Editor, and invited Editorials. Guest Editors from the JSO Editorial Board oversee multiple special Seminars issues each year. These Seminars include multifaceted Reviews on a particular topic or current issue in surgical oncology, which are invited from experts in the field.