Adhitya Ramamurthi,Bhabishya Neupane,Priya Deshpande,Ryan Hanson,Srujan Vegesna,Deborah Cray,Bradley H Crotty,Melek Somai,Kellie R Brown,Sachin S Pawar,Bradley Taylor,Anai N Kothari
{"title":"Applying Large Language Models for Surgical Case Length Prediction.","authors":"Adhitya Ramamurthi,Bhabishya Neupane,Priya Deshpande,Ryan Hanson,Srujan Vegesna,Deborah Cray,Bradley H Crotty,Melek Somai,Kellie R Brown,Sachin S Pawar,Bradley Taylor,Anai N Kothari","doi":"10.1001/jamasurg.2025.2154","DOIUrl":null,"url":null,"abstract":"Importance\r\nAccurate prediction of surgical case duration is critical for operating room (OR) management, as inefficient scheduling can lead to reduced patient and surgeon satisfaction while incurring considerable financial costs.\r\n\r\nObjective\r\nTo evaluate the feasibility and accuracy of large language models (LLMs) in predicting surgical case length using unstructured clinical data compared to existing estimation methods.\r\n\r\nDesign, Setting, and Participants\r\nThis was a retrospective study analyzing elective surgical cases performed between January 2017 and December 2023 at a single academic medical center and affiliated community hospital ORs. Analysis included 125 493 eligible surgical cases, with 1950 used for LLM fine-tuning and 2500 for evaluation. An additional 500 cases from a community site were used for external validation. Cases were randomly sampled using strata to ensure representation across surgical specialties.\r\n\r\nExposures\r\nEleven LLMs, including base models (GPT-4, GPT-3.5, Mistral, Llama-3, Phi-3) and 2 fine-tuned variants (GPT-4 fine-tuned, GPT-3.5 fine-tuned), were used to predict surgical case length based on clinical notes.\r\n\r\nMain Outcomes and Measures\r\nThe primary outcome was average error between predicted and actual surgical case length (wheels-in to wheels-out time). The secondary outcome was prediction accuracy, defined as predicted length within 20% of actual duration.\r\n\r\nResults\r\nFine-tuned GPT-4 achieved the best performance with a mean absolute error (MAE) of 47.64 minutes (95% CI, 45.71-49.56) and R2 of 0.61, matching the performance of current OR scheduling (MAE, 49.34 minutes; 95% CI, 47.60-51.09; R2, 0.63; P = .10). Both GPT-4 fine-tuned and GPT-3.5 fine-tuned significantly outperformed current scheduling methods in accuracy (46.12% and 46.08% vs 40.92%, respectively; P < .001). GPT-4 fine-tuned outperformed all other models during external validation with similar performance metrics (MAE, 48.66 minutes; 95% CI, 45.31-52.00; accuracy, 46.0%). Base models demonstrated variable performance, with GPT-4 showing the highest performance among non-fine-tuned models (MAE, 59.20 minutes; 95% CI, 56.88 - 61.52).\r\n\r\nConclusion and Relevance\r\nThe findings in this study suggest that fine-tuned LLMs can predict surgical case length with accuracy comparable to or exceeding current institutional scheduling methods. This indicates potential for LLMs to enhance operating room efficiency through improved case length prediction using existing clinical documentation.","PeriodicalId":14690,"journal":{"name":"JAMA surgery","volume":"146 1","pages":""},"PeriodicalIF":14.9000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMA surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1001/jamasurg.2025.2154","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0
Abstract
Importance
Accurate prediction of surgical case duration is critical for operating room (OR) management, as inefficient scheduling can lead to reduced patient and surgeon satisfaction while incurring considerable financial costs.
Objective
To evaluate the feasibility and accuracy of large language models (LLMs) in predicting surgical case length using unstructured clinical data compared to existing estimation methods.
Design, Setting, and Participants
This was a retrospective study analyzing elective surgical cases performed between January 2017 and December 2023 at a single academic medical center and affiliated community hospital ORs. Analysis included 125 493 eligible surgical cases, with 1950 used for LLM fine-tuning and 2500 for evaluation. An additional 500 cases from a community site were used for external validation. Cases were randomly sampled using strata to ensure representation across surgical specialties.
Exposures
Eleven LLMs, including base models (GPT-4, GPT-3.5, Mistral, Llama-3, Phi-3) and 2 fine-tuned variants (GPT-4 fine-tuned, GPT-3.5 fine-tuned), were used to predict surgical case length based on clinical notes.
Main Outcomes and Measures
The primary outcome was average error between predicted and actual surgical case length (wheels-in to wheels-out time). The secondary outcome was prediction accuracy, defined as predicted length within 20% of actual duration.
Results
Fine-tuned GPT-4 achieved the best performance with a mean absolute error (MAE) of 47.64 minutes (95% CI, 45.71-49.56) and R2 of 0.61, matching the performance of current OR scheduling (MAE, 49.34 minutes; 95% CI, 47.60-51.09; R2, 0.63; P = .10). Both GPT-4 fine-tuned and GPT-3.5 fine-tuned significantly outperformed current scheduling methods in accuracy (46.12% and 46.08% vs 40.92%, respectively; P < .001). GPT-4 fine-tuned outperformed all other models during external validation with similar performance metrics (MAE, 48.66 minutes; 95% CI, 45.31-52.00; accuracy, 46.0%). Base models demonstrated variable performance, with GPT-4 showing the highest performance among non-fine-tuned models (MAE, 59.20 minutes; 95% CI, 56.88 - 61.52).
Conclusion and Relevance
The findings in this study suggest that fine-tuned LLMs can predict surgical case length with accuracy comparable to or exceeding current institutional scheduling methods. This indicates potential for LLMs to enhance operating room efficiency through improved case length prediction using existing clinical documentation.
期刊介绍:
JAMA Surgery, an international peer-reviewed journal established in 1920, is the official publication of the Association of VA Surgeons, the Pacific Coast Surgical Association, and the Surgical Outcomes Club.It is a proud member of the JAMA Network, a consortium of peer-reviewed general medical and specialty publications.