Applying Large Language Models for Surgical Case Length Prediction.

IF 14.9 1区医学 Q1 SURGERY

JAMA surgery Pub Date : 2025-07-09 DOI:10.1001/jamasurg.2025.2154

Adhitya Ramamurthi,Bhabishya Neupane,Priya Deshpande,Ryan Hanson,Srujan Vegesna,Deborah Cray,Bradley H Crotty,Melek Somai,Kellie R Brown,Sachin S Pawar,Bradley Taylor,Anai N Kothari

{"title":"Applying Large Language Models for Surgical Case Length Prediction.","authors":"Adhitya Ramamurthi,Bhabishya Neupane,Priya Deshpande,Ryan Hanson,Srujan Vegesna,Deborah Cray,Bradley H Crotty,Melek Somai,Kellie R Brown,Sachin S Pawar,Bradley Taylor,Anai N Kothari","doi":"10.1001/jamasurg.2025.2154","DOIUrl":null,"url":null,"abstract":"Importance\r\nAccurate prediction of surgical case duration is critical for operating room (OR) management, as inefficient scheduling can lead to reduced patient and surgeon satisfaction while incurring considerable financial costs.\r\n\r\nObjective\r\nTo evaluate the feasibility and accuracy of large language models (LLMs) in predicting surgical case length using unstructured clinical data compared to existing estimation methods.\r\n\r\nDesign, Setting, and Participants\r\nThis was a retrospective study analyzing elective surgical cases performed between January 2017 and December 2023 at a single academic medical center and affiliated community hospital ORs. Analysis included 125 493 eligible surgical cases, with 1950 used for LLM fine-tuning and 2500 for evaluation. An additional 500 cases from a community site were used for external validation. Cases were randomly sampled using strata to ensure representation across surgical specialties.\r\n\r\nExposures\r\nEleven LLMs, including base models (GPT-4, GPT-3.5, Mistral, Llama-3, Phi-3) and 2 fine-tuned variants (GPT-4 fine-tuned, GPT-3.5 fine-tuned), were used to predict surgical case length based on clinical notes.\r\n\r\nMain Outcomes and Measures\r\nThe primary outcome was average error between predicted and actual surgical case length (wheels-in to wheels-out time). The secondary outcome was prediction accuracy, defined as predicted length within 20% of actual duration.\r\n\r\nResults\r\nFine-tuned GPT-4 achieved the best performance with a mean absolute error (MAE) of 47.64 minutes (95% CI, 45.71-49.56) and R2 of 0.61, matching the performance of current OR scheduling (MAE, 49.34 minutes; 95% CI, 47.60-51.09; R2, 0.63; P = .10). Both GPT-4 fine-tuned and GPT-3.5 fine-tuned significantly outperformed current scheduling methods in accuracy (46.12% and 46.08% vs 40.92%, respectively; P < .001). GPT-4 fine-tuned outperformed all other models during external validation with similar performance metrics (MAE, 48.66 minutes; 95% CI, 45.31-52.00; accuracy, 46.0%). Base models demonstrated variable performance, with GPT-4 showing the highest performance among non-fine-tuned models (MAE, 59.20 minutes; 95% CI, 56.88 - 61.52).\r\n\r\nConclusion and Relevance\r\nThe findings in this study suggest that fine-tuned LLMs can predict surgical case length with accuracy comparable to or exceeding current institutional scheduling methods. This indicates potential for LLMs to enhance operating room efficiency through improved case length prediction using existing clinical documentation.","PeriodicalId":14690,"journal":{"name":"JAMA surgery","volume":"146 1","pages":""},"PeriodicalIF":14.9000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMA surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1001/jamasurg.2025.2154","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SURGERY","Score":null,"Total":0}

引用次数: 0

Abstract

Importance Accurate prediction of surgical case duration is critical for operating room (OR) management, as inefficient scheduling can lead to reduced patient and surgeon satisfaction while incurring considerable financial costs. Objective To evaluate the feasibility and accuracy of large language models (LLMs) in predicting surgical case length using unstructured clinical data compared to existing estimation methods. Design, Setting, and Participants This was a retrospective study analyzing elective surgical cases performed between January 2017 and December 2023 at a single academic medical center and affiliated community hospital ORs. Analysis included 125 493 eligible surgical cases, with 1950 used for LLM fine-tuning and 2500 for evaluation. An additional 500 cases from a community site were used for external validation. Cases were randomly sampled using strata to ensure representation across surgical specialties. Exposures Eleven LLMs, including base models (GPT-4, GPT-3.5, Mistral, Llama-3, Phi-3) and 2 fine-tuned variants (GPT-4 fine-tuned, GPT-3.5 fine-tuned), were used to predict surgical case length based on clinical notes. Main Outcomes and Measures The primary outcome was average error between predicted and actual surgical case length (wheels-in to wheels-out time). The secondary outcome was prediction accuracy, defined as predicted length within 20% of actual duration. Results Fine-tuned GPT-4 achieved the best performance with a mean absolute error (MAE) of 47.64 minutes (95% CI, 45.71-49.56) and R2 of 0.61, matching the performance of current OR scheduling (MAE, 49.34 minutes; 95% CI, 47.60-51.09; R2, 0.63; P = .10). Both GPT-4 fine-tuned and GPT-3.5 fine-tuned significantly outperformed current scheduling methods in accuracy (46.12% and 46.08% vs 40.92%, respectively; P < .001). GPT-4 fine-tuned outperformed all other models during external validation with similar performance metrics (MAE, 48.66 minutes; 95% CI, 45.31-52.00; accuracy, 46.0%). Base models demonstrated variable performance, with GPT-4 showing the highest performance among non-fine-tuned models (MAE, 59.20 minutes; 95% CI, 56.88 - 61.52). Conclusion and Relevance The findings in this study suggest that fine-tuned LLMs can predict surgical case length with accuracy comparable to or exceeding current institutional scheduling methods. This indicates potential for LLMs to enhance operating room efficiency through improved case length prediction using existing clinical documentation.

查看原文本刊更多论文

应用大语言模型预测手术病例长度。

准确预测手术病例持续时间对于手术室（OR）管理至关重要，因为低效的调度会导致患者和外科医生满意度降低，同时产生可观的经济成本。目的比较大语言模型（LLMs）在利用非结构化临床数据预测手术病例长度方面的可行性和准确性。设计、环境和参与者：这是一项回顾性研究，分析了2017年1月至2023年12月在单一学术医疗中心和附属社区医院手术室进行的选择性手术病例。分析包括125 493例符合条件的手术病例，其中1950例用于LLM微调，2500例用于评估。另外500例来自社区站点的病例被用于外部验证。病例采用分层随机抽样，以确保跨外科专业的代表性。11个llm，包括基础模型（GPT-4、GPT-3.5、Mistral、Llama-3、Phi-3）和2个微调变体（GPT-4微调、GPT-3.5微调），用于根据临床记录预测手术病例长度。主要结果和测量主要结果是预测和实际手术病例长度（轮入到轮出时间）之间的平均误差。次要终点是预测准确度，定义为预测长度在实际持续时间的20%以内。结果优化后的GPT-4的平均绝对误差（MAE）为47.64 min (95% CI, 45.71 ~ 49.56)， R2为0.61，与当前的OR调度(MAE, 49.34 min；95% ci, 47.60-51.09；R2, 0.63;p = .10)。GPT-4微调和GPT-3.5微调在准确率上均显著优于当前调度方法（分别为46.12%和46.08% vs 40.92%）；p < 0.001)。在类似的性能指标下，GPT-4微调在外部验证期间优于所有其他模型(MAE， 48.66分钟；95% ci, 45.31-52.00；准确性,46.0%)。基本模型表现出不同的性能，GPT-4在非微调模型中表现出最高的性能(MAE， 59.20分钟；95% ci, 56.88 - 61.52)。结论和相关性本研究的结果表明，微调的llm可以预测手术病例长度，其准确性与目前的机构调度方法相当或超过。这表明llm通过使用现有临床文献改进病例长度预测来提高手术室效率的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JAMA surgery SURGERY-

CiteScore

20.80

自引率

3.60%

发文量

400

期刊介绍： JAMA Surgery, an international peer-reviewed journal established in 1920, is the official publication of the Association of VA Surgeons, the Pacific Coast Surgical Association, and the Surgical Outcomes Club.It is a proud member of the JAMA Network, a consortium of peer-reviewed general medical and specialty publications.