Language Models for Multilabel Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study.

IF 3.1 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-07-09 DOI:10.2196/71176

Jeremy A Balch, Sasank S Desaraju, Victoria J Nolan, Divya Vellanki, Timothy R Buchanan, Lindsey M Brinkley, Yordan Penev, Ahmet Bilgili, Aashay Patel, Corinne E Chatham, David M Vanderbilt, Rayon Uddin, Azra Bihorac, Philip Efron, Tyler J Loftus, Protiva Rahman, Benjamin Shickel

{"title":"Language Models for Multilabel Document Classification of Surgical Concepts in Exploratory Laparotomy Operative Notes: Algorithm Development Study.","authors":"Jeremy A Balch, Sasank S Desaraju, Victoria J Nolan, Divya Vellanki, Timothy R Buchanan, Lindsey M Brinkley, Yordan Penev, Ahmet Bilgili, Aashay Patel, Corinne E Chatham, David M Vanderbilt, Rayon Uddin, Azra Bihorac, Philip Efron, Tyler J Loftus, Protiva Rahman, Benjamin Shickel","doi":"10.2196/71176","DOIUrl":null,"url":null,"abstract":"Background: Operative notes are frequently mined for surgical concepts in clinical care, research, quality improvement, and billing, often requiring hours of manual extraction. These notes are typically analyzed at the document level to determine the presence or absence of specific procedures or findings (eg, whether a hand-sewn anastomosis was performed or contamination occurred). Extracting several binary classification labels simultaneously is a multilabel classification problem. Traditional natural language processing approaches-bag-of-words (BoW) and term frequency-inverse document frequency (tf-idf) with linear classifiers-have been used previously for this task but are now being augmented or replaced by large language models (LLMs). However, few studies have examined their utility in surgery.Objective: We developed and evaluated LLMs for the purpose of expediting data extraction from surgical notes.Methods: A total of 388 exploratory laparotomy notes from a single institution were annotated for 21 concepts related to intraoperative findings, intraoperative techniques, and closure techniques. Annotation consistency was measured using the Cohen κ statistic. Data were preprocessed to include only the description of the procedure. We compared the evolution of document classification technologies from BoW and tf-idf to encoder-only (Clinical-Longformer) and decoder-only (Llama 3) transformer models. Multilabel classification performance was evaluated with 5-fold cross-validation with F1-score and hamming loss (HL). We experimented with and without context. Errors were assessed by manual review. Code and implementation instructions may be found on GitHub.Results: The prevalence of labels ranged from 0.05 (colostomy, ileostomy, active bleed from named vessel) to 0.50 (running fascial closure). Llama 3.3 was the overall best-performing model (micro F1-score 0.88, 5-fold range: 0.88-0.89; HL 0.11, 5-fold range: 0.11-0.12). The BoW model (micro F1-score 0.68, 5-fold range: 0.64-0.71; HL 0.14, 5-fold range: 0.13-0.16) and Clinical-Longformer (micro F1-score 0.73, 5-fold range: 0.70-0.74; HL 0.11, 5-fold range: 0.10-0.12) had overall similar performance, with tf-idf models trailing (micro F1-score 0.57, 5-fold range: 0.55-0.59; HL 0.27, 5-fold range: 0.25-0.29). F1-scores varied across concepts in the Llama model, ranging from 0.30 (5-fold range: 0.23-0.39) for class III contamination to 0.92 (5-fold range: 0.98-0.84) for bowel resection. Context enhanced Llama's performance, adding an average of 0.16 improvement to the F1-scores. Error analysis demonstrated semantic nuances and edge cases within operative notes, particularly when patients had references to prior operations in their operative notes or simultaneous operations with other surgical services.Conclusions: Off-the-shelf autoregressive LLMs outperformed fined-tuned, encoder-only transformers and traditional natural language processing techniques in classifying operative notes. Multilabel classification with LLMs may streamline retrospective reviews in surgery, though further refinements are required prior to reliable use in research and quality improvement.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e71176"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12266303/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/71176","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Operative notes are frequently mined for surgical concepts in clinical care, research, quality improvement, and billing, often requiring hours of manual extraction. These notes are typically analyzed at the document level to determine the presence or absence of specific procedures or findings (eg, whether a hand-sewn anastomosis was performed or contamination occurred). Extracting several binary classification labels simultaneously is a multilabel classification problem. Traditional natural language processing approaches-bag-of-words (BoW) and term frequency-inverse document frequency (tf-idf) with linear classifiers-have been used previously for this task but are now being augmented or replaced by large language models (LLMs). However, few studies have examined their utility in surgery.

Objective: We developed and evaluated LLMs for the purpose of expediting data extraction from surgical notes.

Methods: A total of 388 exploratory laparotomy notes from a single institution were annotated for 21 concepts related to intraoperative findings, intraoperative techniques, and closure techniques. Annotation consistency was measured using the Cohen κ statistic. Data were preprocessed to include only the description of the procedure. We compared the evolution of document classification technologies from BoW and tf-idf to encoder-only (Clinical-Longformer) and decoder-only (Llama 3) transformer models. Multilabel classification performance was evaluated with 5-fold cross-validation with F1-score and hamming loss (HL). We experimented with and without context. Errors were assessed by manual review. Code and implementation instructions may be found on GitHub.

Results: The prevalence of labels ranged from 0.05 (colostomy, ileostomy, active bleed from named vessel) to 0.50 (running fascial closure). Llama 3.3 was the overall best-performing model (micro F1-score 0.88, 5-fold range: 0.88-0.89; HL 0.11, 5-fold range: 0.11-0.12). The BoW model (micro F1-score 0.68, 5-fold range: 0.64-0.71; HL 0.14, 5-fold range: 0.13-0.16) and Clinical-Longformer (micro F1-score 0.73, 5-fold range: 0.70-0.74; HL 0.11, 5-fold range: 0.10-0.12) had overall similar performance, with tf-idf models trailing (micro F1-score 0.57, 5-fold range: 0.55-0.59; HL 0.27, 5-fold range: 0.25-0.29). F1-scores varied across concepts in the Llama model, ranging from 0.30 (5-fold range: 0.23-0.39) for class III contamination to 0.92 (5-fold range: 0.98-0.84) for bowel resection. Context enhanced Llama's performance, adding an average of 0.16 improvement to the F1-scores. Error analysis demonstrated semantic nuances and edge cases within operative notes, particularly when patients had references to prior operations in their operative notes or simultaneous operations with other surgical services.

Conclusions: Off-the-shelf autoregressive LLMs outperformed fined-tuned, encoder-only transformers and traditional natural language processing techniques in classifying operative notes. Multilabel classification with LLMs may streamline retrospective reviews in surgery, though further refinements are required prior to reliable use in research and quality improvement.

查看原文本刊更多论文

剖腹探查手术笔记中外科概念多标签文档分类的语言模型：算法开发研究。

背景：在临床护理、研究、质量改进和计费中，手术记录经常被挖掘用于外科概念，通常需要数小时的人工提取。这些记录通常在文件级别进行分析，以确定是否存在特定的程序或结果（例如，是否进行了手工缝合吻合或发生了污染）。同时提取多个二值分类标签是一个多标签分类问题。传统的自然语言处理方法——带线性分类器的词袋（BoW）和词频率逆文档频率（tf-idf）——以前曾用于这项任务，但现在正被大型语言模型（llm）增强或取代。然而，很少有研究检验它们在外科手术中的应用。目的：我们开发和评估llm的目的是加快从手术记录中提取数据。方法：对来自某机构的388份剖腹探查记录进行注释，涉及术中发现、术中技术和闭合技术等21个概念。使用Cohen κ统计量测量标注一致性。对数据进行预处理，只包括对手术过程的描述。我们比较了文档分类技术从BoW和tf-idf到仅编码器（clinal - longformer）和仅解码器（Llama 3）变压器模型的演变。用f1评分和汉明损失（HL）进行5次交叉验证，评估多标签分类性能。我们在有背景和没有背景的情况下进行了实验。通过人工审查来评估错误。代码和实现说明可以在GitHub上找到。结果：标签的患病率从0.05（结肠造口术、回肠造口术、命名血管活动性出血）到0.50（运行筋膜闭合）不等。羊驼3.3是整体表现最好的模型(微观f1得分0.88,5倍范围：0.88-0.89；HL = 0.11, 5倍范围：0.11-0.12)。BoW模型(微观f1评分0.68,5倍范围：0.64-0.71；HL 0.14, 5倍范围：0.13-0.16)和clinicallongformer(微f1评分0.73,5倍范围：0.70-0.74；HL评分0.11,5倍范围：0.10-0.12)总体表现相似，tf-idf模型落后(微观f1评分0.57,5倍范围：0.55-0.59；HL = 0.27, 5倍范围：0.25-0.29)。在Llama模型中，不同概念的f1得分不同，从III类污染的0.30（5倍范围：0.23-0.39）到肠切除术的0.92（5倍范围：0.98-0.84）。环境提高了羊驼的表现，使f1得分平均提高了0.16分。错误分析显示了手术记录中的语义细微差别和边缘情况，特别是当患者在手术记录中引用了先前的手术或与其他外科服务同时进行的手术时。结论：现成的自回归llm在分类手术笔记方面优于微调、仅编码的变压器和传统的自然语言处理技术。llm的多标签分类可以简化外科回顾性评价，尽管在可靠的研究使用和质量改进之前需要进一步的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.