Natural Language Processing and ICD-10 Coding for Detecting Bleeding Events in Discharge Summaries: Comparative Cross-Sectional Study.

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

JMIR Medical Informatics Pub Date : 2025-08-29 DOI:10.2196/67837

Frederic Gaspar, Mehdi Zayene, Claire Coumau, Elliott Bertrand, Marie Bettex, Marie Annick Le Pogam, Chantal Csajka

{"title":"Natural Language Processing and ICD-10 Coding for Detecting Bleeding Events in Discharge Summaries: Comparative Cross-Sectional Study.","authors":"Frederic Gaspar, Mehdi Zayene, Claire Coumau, Elliott Bertrand, Marie Bettex, Marie Annick Le Pogam, Chantal Csajka","doi":"10.2196/67837","DOIUrl":null,"url":null,"abstract":"Background: Bleeding adverse drug events (ADEs), particularly among older inpatients receiving antithrombotic therapy, represent a major safety concern in hospitals. These events are often underdetected by conventional rule-based systems relying on structured electronic medical record data, such as the ICD-10 (International Statistical Classification of Diseases and Related Health Problems 10th Revision) codes, which lack the granularity to capture nuanced clinical narratives.Objective: This study aimed to develop and evaluate a natural language processing (NLP) model to detect and categorize bleeding ADEs in discharge summaries of older adults. Specifically, the model was designed to distinguish between \"clinically significant bleeding,\" \"severe bleeding,\" \"history of bleeding,\" and \"no bleeding,\" and was compared with a rule-based algorithm using ICD-10 codes.Methods: Clinicians manually annotated 400 discharge summaries, comprising 65,706 sentences, into four categories: \"no bleeding,\" \"clinically significant bleeding,\" \"severe bleeding,\" and \"history of bleeding.\" The dataset was divided into a training set (70%, 47,100 sentences) and a test set (30%, 18,606 sentences). Two detection approaches were developed and evaluated: (1) an NLP model using binary logistic regression and support vector machine classifiers, and (2) a traditional rule-based algorithm relying exclusively on predefined ICD-10 codes. To address class imbalance, with most sentences categorized as irrelevant (\"no bleeding\"), a class-weighting strategy was applied in the NLP model. Model performance was assessed using accuracy, precision, recall, F1-score, and receiver operating characteristic (ROC) curve analyses, with manual annotations as the gold standard.Results: The NLP model significantly outperformed the rule-based approach across all evaluation metrics. At the document level, the NLP model achieved macro-average scores of 0.81 for accuracy and 0.80 for F1-score. Precision was particularly high for detecting severe (0.92) and clinically significant bleeding events (0.87), demonstrating strong classification capability despite class imbalance. ROC analyses confirmed the model's robust diagnostic performance, yielding an area under the curve (AUC) of 0.91 when distinguishing irrelevant sentences from potential bleeding events, 0.88 for identifying historical mentions of bleeding, and notably, 0.94 for differentiating clinically significant from severe bleeding. In contrast, the rule-based ICD-10 model demonstrated high precision (0.94) for clinically significant bleeding but poor recall (0.03) for severe bleeding events, reflecting frequent missed detections. This limitation arose due to its reliance on commonly used ICD-10 codes (eg, gastrointestinal hemorrhage) and inadequate capture of rare severe bleeding conditions such as shock due to hemorrhage.Conclusions: This study highlights the considerable advantage of NLP over traditional ICD-10-based methods for detecting bleeding ADEs within electronic medical records. The NLP model effectively captured nuanced clinical narratives, including severity, negations, and historical bleeding events, demonstrating substantial promise for improving patient safety surveillance and clinical decision-making. Future research should extend validation across multiple institutions, diversify annotated datasets, and further refine temporal reasoning capabilities within NLP algorithms.","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e67837"},"PeriodicalIF":3.8000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12396801/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/67837","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Bleeding adverse drug events (ADEs), particularly among older inpatients receiving antithrombotic therapy, represent a major safety concern in hospitals. These events are often underdetected by conventional rule-based systems relying on structured electronic medical record data, such as the ICD-10 (International Statistical Classification of Diseases and Related Health Problems 10th Revision) codes, which lack the granularity to capture nuanced clinical narratives.

Objective: This study aimed to develop and evaluate a natural language processing (NLP) model to detect and categorize bleeding ADEs in discharge summaries of older adults. Specifically, the model was designed to distinguish between "clinically significant bleeding," "severe bleeding," "history of bleeding," and "no bleeding," and was compared with a rule-based algorithm using ICD-10 codes.

Methods: Clinicians manually annotated 400 discharge summaries, comprising 65,706 sentences, into four categories: "no bleeding," "clinically significant bleeding," "severe bleeding," and "history of bleeding." The dataset was divided into a training set (70%, 47,100 sentences) and a test set (30%, 18,606 sentences). Two detection approaches were developed and evaluated: (1) an NLP model using binary logistic regression and support vector machine classifiers, and (2) a traditional rule-based algorithm relying exclusively on predefined ICD-10 codes. To address class imbalance, with most sentences categorized as irrelevant ("no bleeding"), a class-weighting strategy was applied in the NLP model. Model performance was assessed using accuracy, precision, recall, F1-score, and receiver operating characteristic (ROC) curve analyses, with manual annotations as the gold standard.

Results: The NLP model significantly outperformed the rule-based approach across all evaluation metrics. At the document level, the NLP model achieved macro-average scores of 0.81 for accuracy and 0.80 for F1-score. Precision was particularly high for detecting severe (0.92) and clinically significant bleeding events (0.87), demonstrating strong classification capability despite class imbalance. ROC analyses confirmed the model's robust diagnostic performance, yielding an area under the curve (AUC) of 0.91 when distinguishing irrelevant sentences from potential bleeding events, 0.88 for identifying historical mentions of bleeding, and notably, 0.94 for differentiating clinically significant from severe bleeding. In contrast, the rule-based ICD-10 model demonstrated high precision (0.94) for clinically significant bleeding but poor recall (0.03) for severe bleeding events, reflecting frequent missed detections. This limitation arose due to its reliance on commonly used ICD-10 codes (eg, gastrointestinal hemorrhage) and inadequate capture of rare severe bleeding conditions such as shock due to hemorrhage.

Conclusions: This study highlights the considerable advantage of NLP over traditional ICD-10-based methods for detecting bleeding ADEs within electronic medical records. The NLP model effectively captured nuanced clinical narratives, including severity, negations, and historical bleeding events, demonstrating substantial promise for improving patient safety surveillance and clinical decision-making. Future research should extend validation across multiple institutions, diversify annotated datasets, and further refine temporal reasoning capabilities within NLP algorithms.

Abstract Image

查看原文本刊更多论文

自然语言处理和ICD-10编码在出院总结中检测出血事件：比较横断面研究。

背景：出血药物不良事件（ADEs），特别是在接受抗血栓治疗的老年住院患者中，是医院主要的安全问题。依赖结构化电子病历数据（如ICD-10（国际疾病和相关健康问题统计分类第10版）代码）的传统基于规则的系统往往无法检测到这些事件，这些系统缺乏捕捉细微临床叙述的粒度。目的：本研究旨在建立和评估一种自然语言处理（NLP）模型，用于检测和分类老年人出院总结中的出血性ade。具体来说，该模型旨在区分“临床显著出血”、“严重出血”、“出血史”和“无出血”，并与使用ICD-10代码的基于规则的算法进行比较。方法：临床医生手动注释400份出院摘要，包括65706个句子，分为四类：“无出血”、“临床显著出血”、“严重出血”和“出血史”。将数据集分为训练集（70%，47100句）和测试集（30%，18606句）。开发并评估了两种检测方法：(1)使用二元逻辑回归和支持向量机分类器的NLP模型，以及(2)完全依赖预定义ICD-10代码的传统基于规则的算法。为了解决类不平衡，大多数句子被归类为不相关（“不流血”），在NLP模型中应用了类加权策略。采用准确率、精密度、召回率、f1评分和受试者工作特征（ROC）曲线分析评估模型性能，并以人工注释为金标准。结果：NLP模型在所有评估指标上都明显优于基于规则的方法。在文档水平上，NLP模型的宏观平均精度得分为0.81，f1得分为0.80。检测严重出血事件（0.92）和临床显著出血事件（0.87）的准确率特别高，尽管分类不平衡，但显示出较强的分类能力。ROC分析证实了该模型的稳健诊断性能，在区分不相关句子和潜在出血事件时，曲线下面积（AUC）为0.91，识别历史提及出血的面积为0.88，值得注意的是，区分临床显著性出血和严重出血的面积为0.94。相比之下，基于规则的ICD-10模型对临床显著性出血的准确率较高（0.94），但对严重出血事件的召回率较低（0.03），反映了经常漏检。这种限制是由于它依赖于常用的ICD-10代码（例如，胃肠道出血）和对罕见的严重出血情况（如出血引起的休克）的不充分捕捉。结论：本研究强调了NLP在电子病历中检测出血性ade方面比传统的基于icd -10的方法具有相当大的优势。NLP模型有效地捕获了细致的临床叙述，包括严重程度、否定和历史出血事件，展示了改善患者安全监测和临床决策的巨大希望。未来的研究应该扩展跨多个机构的验证，多样化注释数据集，并进一步完善NLP算法中的时间推理能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR Medical Informatics Medicine-Health Informatics

CiteScore

7.90

自引率

3.10%

发文量

173

审稿时长

12 weeks

期刊介绍： JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals. Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.