SSMT-PANBERT: A single-stage multitask model for phenotype extraction and assertion negation detection in unstructured clinical text

IF 6.3 2区医学 Q1 BIOLOGY

Computers in biology and medicine Pub Date : 2025-06-22 DOI:10.1016/j.compbiomed.2025.110651

Nour Eddine Zekaoui , Maryem Rhanoui , Siham Yousfi , Mounia Mikram

{"title":"SSMT-PANBERT: A single-stage multitask model for phenotype extraction and assertion negation detection in unstructured clinical text","authors":"Nour Eddine Zekaoui , Maryem Rhanoui , Siham Yousfi , Mounia Mikram","doi":"10.1016/j.compbiomed.2025.110651","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic phenotype extraction and assertion negation detection from large-scale accessible Electronic Health Records (EHRs), including discharge summaries and radiology reports, is a crucial task for various healthcare applications, such as disease diagnosis and treatment planning. The unstructured nature of these documents poses significant challenges for manual processing. However, prior studies exhibit several limitations, such as being restricted to a single label per sentence or omitting the extraction and negation of medical concepts, which make them prone to fail in complex circumstances. In this paper, we capitalize on the advancement of state-of-the-art pre-trained language models (PLMs) to propose a single-stage multitask solution that jointly learns to extract phenotypes and detect their assertion or negation in an end-to-end fashion. Our proposed approach aims to provide practical assistance to healthcare professionals by handling complex and diverse clinical scenarios. We evaluate our method on a validation set derived from an annotated, balanced, and validated dataset based on MIMIC-III clinical notes. The annotations were rigorously reviewed by domain experts to ensure high reliability. The top-performing model in our experiments, SSMT-PANBERT, achieves an average Macro F1 score of 92.33% and a Micro F1 score of 91.66% on the validation set, outperforming traditional pipeline approaches in terms of Macro F1 (92.33% vs. 91.66%), while reducing training time by 37%, inference time by 18.2%, and GPU memory usage by 57%. These results demonstrate the effectiveness of our unified approach in handling complex clinical scenarios while providing significant computational advantages for real-world applications. Furthermore, we conduct a thorough analysis of the model's performance and identify potential areas for future improvement.</div></div>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"195 ","pages":"Article 110651"},"PeriodicalIF":6.3000,"publicationDate":"2025-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010482525010029","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic phenotype extraction and assertion negation detection from large-scale accessible Electronic Health Records (EHRs), including discharge summaries and radiology reports, is a crucial task for various healthcare applications, such as disease diagnosis and treatment planning. The unstructured nature of these documents poses significant challenges for manual processing. However, prior studies exhibit several limitations, such as being restricted to a single label per sentence or omitting the extraction and negation of medical concepts, which make them prone to fail in complex circumstances. In this paper, we capitalize on the advancement of state-of-the-art pre-trained language models (PLMs) to propose a single-stage multitask solution that jointly learns to extract phenotypes and detect their assertion or negation in an end-to-end fashion. Our proposed approach aims to provide practical assistance to healthcare professionals by handling complex and diverse clinical scenarios. We evaluate our method on a validation set derived from an annotated, balanced, and validated dataset based on MIMIC-III clinical notes. The annotations were rigorously reviewed by domain experts to ensure high reliability. The top-performing model in our experiments, SSMT-PANBERT, achieves an average Macro F1 score of 92.33% and a Micro F1 score of 91.66% on the validation set, outperforming traditional pipeline approaches in terms of Macro F1 (92.33% vs. 91.66%), while reducing training time by 37%, inference time by 18.2%, and GPU memory usage by 57%. These results demonstrate the effectiveness of our unified approach in handling complex clinical scenarios while providing significant computational advantages for real-world applications. Furthermore, we conduct a thorough analysis of the model's performance and identify potential areas for future improvement.

查看原文本刊更多论文

SSMT-PANBERT：用于非结构化临床文本中表型提取和断言否定检测的单阶段多任务模型

从大规模可访问的电子健康记录（EHRs）（包括出院摘要和放射学报告）中自动提取表型和断言否定检测是各种医疗保健应用（如疾病诊断和治疗计划）的关键任务。这些文档的非结构化特性给手工处理带来了重大挑战。然而，先前的研究显示出一些局限性，例如每句话仅限于一个标签，或者省略了医学概念的提取和否定，这使得它们在复杂的情况下容易失败。在本文中，我们利用最先进的预训练语言模型（PLMs）的进步，提出了一种单阶段多任务解决方案，该解决方案共同学习提取表型并以端到端方式检测其断言或否定。我们建议的方法旨在通过处理复杂和不同的临床情况，为医护人员提供实际的帮助。我们在一个验证集上评估我们的方法，该验证集来自一个基于MIMIC-III临床记录的经过注释、平衡和验证的数据集。这些注释由领域专家严格审查，以确保高可靠性。在我们的实验中，表现最好的模型SSMT-PANBERT在验证集上的Macro F1平均得分为92.33%，Micro F1平均得分为91.66%，在Macro F1方面优于传统的流水线方法（92.33% vs. 91.66%），同时将训练时间减少37%，推理时间减少18.2%，GPU内存使用减少57%。这些结果证明了我们的统一方法在处理复杂临床场景时的有效性，同时为现实世界的应用提供了显著的计算优势。此外，我们对模型的性能进行了彻底的分析，并确定了未来改进的潜在领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers in biology and medicine 工程技术-工程：生物医学

CiteScore

11.70

自引率

10.40%

发文量

1086

审稿时长

74 days

期刊介绍： Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.