Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations.

IF 3.3 Q2 ONCOLOGY
JCO Clinical Cancer Informatics Pub Date : 2025-03-01 Epub Date: 2025-03-20 DOI:10.1200/CCI-24-00230
Loic Ah-Thiane, Pierre-Etienne Heudel, Mario Campone, Marie Robert, Victoire Brillaud-Meflah, Caroline Rousseau, Magali Le Blanc-Onfroy, Florine Tomaszewski, Stéphane Supiot, Tanguy Perennec, Augustin Mervoyer, Jean-Sébastien Frenel
{"title":"Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations.","authors":"Loic Ah-Thiane, Pierre-Etienne Heudel, Mario Campone, Marie Robert, Victoire Brillaud-Meflah, Caroline Rousseau, Magali Le Blanc-Onfroy, Florine Tomaszewski, Stéphane Supiot, Tanguy Perennec, Augustin Mervoyer, Jean-Sébastien Frenel","doi":"10.1200/CCI-24-00230","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To determine the accuracy of large language models (LLMs) in generating appropriate treatment options for patients with early breast cancer on the basis of their medical records.</p><p><strong>Materials and methods: </strong>Retrospective study using anonymized medical records of patients with BC presented during multidisciplinary team meetings (MDTs) between January and April 2024. Three generalist artificial intelligence models (Claude3-Opus, GPT4-Turbo, and LLaMa3-70B) were used to generate treatment suggestions, which were compared with experts' decisions. The primary outcome was the rate of appropriate suggestions from the LLMs, compared with the reference experts' decisions. The secondary outcome was the LLMs' performances (F1 score and specificity) in generating appropriate suggestions for each treatment category.</p><p><strong>Results: </strong>The rates of appropriate suggestions were 86.6% (97/112), 85.7% (96/112), and 75.0% (84/112) for Claude3-Opus, GPT4-Turbo, and LLaMa3-70B, respectively. No significant difference was found between Claude3-Opus and GPT4-Turbo (<i>P</i> = .85), but both tended to perform better than LLaMa3-70B (<i>P</i> = .027 and <i>P</i> = .043, respectively). LLMs showed high accuracy for adjuvant endocrine therapy and targeted therapy indications. However, they tended to overestimate the need for adjuvant radiotherapy and had variable performances in suggesting adjuvant chemotherapy and genomic tests.</p><p><strong>Conclusion: </strong>LLMs, particularly Claude3-Opus and GPT4-Turbo, demonstrated promising accuracy in suggesting appropriate adjuvant treatments for patients with early BC on the basis of their medical records. Although LLMs showed limitations in validating surgery and indicating genomic tests, their performance in other treatment modalities highlights their potential to automate and augment decision making during MDTs. Further studies with fine-tuned LLMs and a prospective design are needed to demonstrate their utility in clinical practice.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2400230"},"PeriodicalIF":3.3000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11949217/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-24-00230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/20 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: To determine the accuracy of large language models (LLMs) in generating appropriate treatment options for patients with early breast cancer on the basis of their medical records.

Materials and methods: Retrospective study using anonymized medical records of patients with BC presented during multidisciplinary team meetings (MDTs) between January and April 2024. Three generalist artificial intelligence models (Claude3-Opus, GPT4-Turbo, and LLaMa3-70B) were used to generate treatment suggestions, which were compared with experts' decisions. The primary outcome was the rate of appropriate suggestions from the LLMs, compared with the reference experts' decisions. The secondary outcome was the LLMs' performances (F1 score and specificity) in generating appropriate suggestions for each treatment category.

Results: The rates of appropriate suggestions were 86.6% (97/112), 85.7% (96/112), and 75.0% (84/112) for Claude3-Opus, GPT4-Turbo, and LLaMa3-70B, respectively. No significant difference was found between Claude3-Opus and GPT4-Turbo (P = .85), but both tended to perform better than LLaMa3-70B (P = .027 and P = .043, respectively). LLMs showed high accuracy for adjuvant endocrine therapy and targeted therapy indications. However, they tended to overestimate the need for adjuvant radiotherapy and had variable performances in suggesting adjuvant chemotherapy and genomic tests.

Conclusion: LLMs, particularly Claude3-Opus and GPT4-Turbo, demonstrated promising accuracy in suggesting appropriate adjuvant treatments for patients with early BC on the basis of their medical records. Although LLMs showed limitations in validating surgery and indicating genomic tests, their performance in other treatment modalities highlights their potential to automate and augment decision making during MDTs. Further studies with fine-tuned LLMs and a prospective design are needed to demonstrate their utility in clinical practice.

目的:确定大型语言模型(LLMs)根据早期乳腺癌患者的医疗记录为其生成适当治疗方案的准确性:回顾性研究使用的是 2024 年 1 月至 4 月期间在多学科团队会议(MDT)上就诊的 BC 患者的匿名医疗记录。使用三种通用人工智能模型(Claude3-Opus、GPT4-Turbo 和 LLaMa3-70B)生成治疗建议,并与专家的决定进行比较。与参考专家的决定相比,主要结果是 LLM 提出的适当建议的比率。次要结果是 LLMs 在为每个治疗类别生成适当建议时的表现(F1 分数和特异性):结果:Claude3-Opus、GPT4-Turbo 和 LLaMa3-70B 的适当建议率分别为 86.6%(97/112)、85.7%(96/112)和 75.0%(84/112)。Claude3-Opus 和 GPT4-Turbo 之间没有发现明显差异(P = .85),但两者的表现均优于 LLaMa3-70B(P = .027 和 P = .043)。LLMs 对辅助内分泌治疗和靶向治疗适应症的准确性很高。然而,它们往往会高估辅助放疗的需求,在建议辅助化疗和基因组检测方面的表现也不尽相同:结论:LLMs,尤其是 Claude3-Opus 和 GPT4-Turbo,在根据病历为早期 BC 患者建议适当的辅助治疗方面表现出了良好的准确性。虽然 LLMs 在验证手术和指示基因组测试方面存在局限性,但它们在其他治疗方式中的表现突出表明了它们在多学科治疗小组(MDT)期间自动化和增强决策制定的潜力。要证明 LLM 在临床实践中的实用性,还需要对 LLM 进行微调并采用前瞻性设计的进一步研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.20
自引率
4.80%
发文量
190
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信