Artificial Intelligence in Surgical Coding: Evaluating Large Language Models for Current Procedural Terminology Accuracy in Hand Surgery

Q3 Medicine

Journal of Hand Surgery Global Online Pub Date : 2025-03-01 DOI:10.1016/j.jhsg.2024.11.013

Emily L. Isch MD , Jamie Lee BA , D. Mitchell Self MD , Abhijeet Sambangi BS , Theodore E. Habarth-Morales BS, 1LT , John Vaile BS , EJ Caterson MD, PhD

{"title":"Artificial Intelligence in Surgical Coding: Evaluating Large Language Models for Current Procedural Terminology Accuracy in Hand Surgery","authors":"Emily L. Isch MD , Jamie Lee BA , D. Mitchell Self MD , Abhijeet Sambangi BS , Theodore E. Habarth-Morales BS, 1LT , John Vaile BS , EJ Caterson MD, PhD","doi":"10.1016/j.jhsg.2024.11.013","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>The advent of large language models (LLMs) like ChatGPT has introduced notable advancements in various surgical disciplines. These developments have led to an increased interest in the use of LLMs for Current Procedural Terminology (CPT) coding in surgery. With CPT coding being a complex and time-consuming process, often exacerbated by the scarcity of professional coders, there is a pressing need for innovative solutions to enhance coding efficiency and accuracy.</div></div><div><h3>Methods</h3><div>This observational study evaluated the effectiveness of five publicly available large language models—Perplexity.AI, Bard, BingAI, ChatGPT 3.5, and ChatGPT 4.0—in accurately identifying CPT codes for hand surgery procedures. A consistent query format was employed to test each model, ensuring the inclusion of detailed procedure components where necessary. The responses were classified as correct, partially correct, or incorrect based on their alignment with established CPT coding for the specified procedures.</div></div><div><h3>Results</h3><div>In the evaluation of artificial intelligence (AI) model performance on simple procedures, Perplexity.AI achieved the highest number of correct outcomes (15), followed by Bard and Bing AI (14 each). ChatGPT 4 and ChatGPT 3.5 yielded 8 and 7 correct outcomes, respectively. For complex procedures, Perplexity.AI and Bard each had three correct outcomes, whereas ChatGPT models had none. Bing AI had the highest number of partially correct outcomes (5). There were significant associations between AI models and performance outcomes for both simple and complex procedures.</div></div><div><h3>Conclusions</h3><div>This study highlights the feasibility and potential benefits of integrating LLMs into the CPT coding process for hand surgery. The findings advocate for further refinement and training of AI models to improve their accuracy and practicality, suggesting a future where AI-assisted coding could become a standard component of surgical workflows, aligning with the ongoing digital transformation in health care.</div></div><div><h3>Type of study/level of evidence</h3><div>Observational, IIIb.</div></div>","PeriodicalId":36920,"journal":{"name":"Journal of Hand Surgery Global Online","volume":"7 2","pages":"Pages 181-185"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Hand Surgery Global Online","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589514124002330","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

The advent of large language models (LLMs) like ChatGPT has introduced notable advancements in various surgical disciplines. These developments have led to an increased interest in the use of LLMs for Current Procedural Terminology (CPT) coding in surgery. With CPT coding being a complex and time-consuming process, often exacerbated by the scarcity of professional coders, there is a pressing need for innovative solutions to enhance coding efficiency and accuracy.

Methods

This observational study evaluated the effectiveness of five publicly available large language models—Perplexity.AI, Bard, BingAI, ChatGPT 3.5, and ChatGPT 4.0—in accurately identifying CPT codes for hand surgery procedures. A consistent query format was employed to test each model, ensuring the inclusion of detailed procedure components where necessary. The responses were classified as correct, partially correct, or incorrect based on their alignment with established CPT coding for the specified procedures.

Results

In the evaluation of artificial intelligence (AI) model performance on simple procedures, Perplexity.AI achieved the highest number of correct outcomes (15), followed by Bard and Bing AI (14 each). ChatGPT 4 and ChatGPT 3.5 yielded 8 and 7 correct outcomes, respectively. For complex procedures, Perplexity.AI and Bard each had three correct outcomes, whereas ChatGPT models had none. Bing AI had the highest number of partially correct outcomes (5). There were significant associations between AI models and performance outcomes for both simple and complex procedures.

Conclusions

This study highlights the feasibility and potential benefits of integrating LLMs into the CPT coding process for hand surgery. The findings advocate for further refinement and training of AI models to improve their accuracy and practicality, suggesting a future where AI-assisted coding could become a standard component of surgical workflows, aligning with the ongoing digital transformation in health care.

Type of study/level of evidence

Observational, IIIb.

查看原文本刊更多论文

手术编码中的人工智能：评估手外科当前程序术语准确性的大型语言模型

目的：像ChatGPT这样的大型语言模型（llm）的出现在各种外科学科中都取得了显著的进步。这些发展导致了在外科手术中使用当前程序术语（CPT）编码的法学硕士的兴趣增加。由于CPT编码是一个复杂且耗时的过程，通常由于专业编码人员的缺乏而加剧，因此迫切需要创新的解决方案来提高编码效率和准确性。方法：本观察性研究评估了五个公开可用的大型语言模型- perplexity的有效性。AI、Bard、BingAI、ChatGPT 3.5和ChatGPT 4.0——在手部手术过程中准确识别CPT代码。使用一致的查询格式来测试每个模型，确保在必要时包含详细的过程组件。根据响应是否符合为指定过程建立的CPT编码，将响应分为正确、部分正确或不正确。结果在评价人工智能（AI）模型在简单程序上的性能时，存在困惑。人工智能获得了最多的正确结果（15），其次是巴德和必应人工智能（各14）。ChatGPT 4和ChatGPT 3.5分别产生8和7个正确结果。对于复杂的程序，困惑。AI和Bard都有三个正确的结果，而ChatGPT模型一个都没有。必应人工智能的部分正确结果数量最多(5)。人工智能模型与简单和复杂程序的性能结果之间存在显著关联。本研究强调了将llm整合到手外科CPT编码过程中的可行性和潜在益处。研究结果主张进一步完善和训练人工智能模型，以提高其准确性和实用性，表明未来人工智能辅助编码可能成为外科工作流程的标准组成部分，与医疗保健领域正在进行的数字化转型保持一致。研究类型/证据水平：观察性，IIIb。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊