在电子病历中增强自杀行为检测：一个具有转换模型和基于语义检索的注释的多标签NLP框架。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics Pub Date : 2025-01-01 DOI:10.1016/j.jbi.2024.104755

Kimia Zandbiglari , Shobhan Kumar , Muhammad Bilal , Amie Goodin , Masoud Rouhizadeh

{"title":"在电子病历中增强自杀行为检测：一个具有转换模型和基于语义检索的注释的多标签NLP框架。","authors":"Kimia Zandbiglari , Shobhan Kumar , Muhammad Bilal , Amie Goodin , Masoud Rouhizadeh","doi":"10.1016/j.jbi.2024.104755","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>Suicide is a leading cause of death worldwide, making early identification of suicidal behaviors crucial for clinicians. Current Natural Language Processing (NLP) approaches for identifying suicidal behaviors in Electronic Health Records (EHRs) rely on keyword searches, rule-based methods, and binary classification, which may not fully capture the complexity and spectrum of suicidal behaviors. This study aims to create a multi-class labeled dataset with annotation guidelines and develop a novel NLP approach for fine-grained, multi-label classification of suicidal behaviors, improving the efficiency of the annotation process and accuracy of the NLP methods.</div></div><div><h3>Methods:</h3><div>We develop a multi-class labeling system based on guidelines from FDA, CDC, and WHO, distinguishing between six categories of suicidal behaviors and allowing for multiple labels per data sample. To efficiently create an annotated dataset, we use an MPNet-based semantic retrieval framework to extract relevant sentences from a large EHR dataset, reducing annotation space while capturing diverse expressions. Experts annotate the extracted sentences using the multi-class system. We then formulate the task as a multi-label classification problem and fine-tune transformer-based models on the curated dataset to accurately classify suicidal behaviors in EHRs.</div></div><div><h3>Results:</h3><div>Lexical analysis revealed key themes in assessing suicide risk, considering an individual’s history, mental health, substance use, and family background. Fine-tuned transformer-based models effectively identified suicidal behaviors from EHRs, with Bio_ClinicalBERT, BioBERT, and XLNet achieving the F1 scores (0.81), outperforming BERT and RoBERTa. The proposed approach, based on a multi-label classification system, captures the complexity of suicidal behaviors effectively particularly “Suicide Attempt” and “Family History” instances. The proposed approach, using task-specific NLP models and a multi-label classification system, captures the complexity of suicidal behaviors more effectively than traditional binary classification. However, direct comparisons with existing studies are difficult due to varying metrics and label definitions.</div></div><div><h3>Conclusion:</h3><div>This study presents a robust NLP framework for detecting suicidal behaviors in EHRs, leveraging task-specific fine-tuning of transformer-based models and a semi-automated pipeline. Despite limitations, the approach demonstrates the potential of advanced NLP techniques in enhancing the identification of suicidal behaviors. Future work should focus on model expansion and integration to further improve patient care and clinical decision-making.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"161 ","pages":"Article 104755"},"PeriodicalIF":4.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing suicidal behavior detection in EHRs: A multi-label NLP framework with transformer models and semantic retrieval-based annotation\",\"authors\":\"Kimia Zandbiglari , Shobhan Kumar , Muhammad Bilal , Amie Goodin , Masoud Rouhizadeh\",\"doi\":\"10.1016/j.jbi.2024.104755\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background:</h3><div>Suicide is a leading cause of death worldwide, making early identification of suicidal behaviors crucial for clinicians. Current Natural Language Processing (NLP) approaches for identifying suicidal behaviors in Electronic Health Records (EHRs) rely on keyword searches, rule-based methods, and binary classification, which may not fully capture the complexity and spectrum of suicidal behaviors. This study aims to create a multi-class labeled dataset with annotation guidelines and develop a novel NLP approach for fine-grained, multi-label classification of suicidal behaviors, improving the efficiency of the annotation process and accuracy of the NLP methods.</div></div><div><h3>Methods:</h3><div>We develop a multi-class labeling system based on guidelines from FDA, CDC, and WHO, distinguishing between six categories of suicidal behaviors and allowing for multiple labels per data sample. To efficiently create an annotated dataset, we use an MPNet-based semantic retrieval framework to extract relevant sentences from a large EHR dataset, reducing annotation space while capturing diverse expressions. Experts annotate the extracted sentences using the multi-class system. We then formulate the task as a multi-label classification problem and fine-tune transformer-based models on the curated dataset to accurately classify suicidal behaviors in EHRs.</div></div><div><h3>Results:</h3><div>Lexical analysis revealed key themes in assessing suicide risk, considering an individual’s history, mental health, substance use, and family background. Fine-tuned transformer-based models effectively identified suicidal behaviors from EHRs, with Bio_ClinicalBERT, BioBERT, and XLNet achieving the F1 scores (0.81), outperforming BERT and RoBERTa. The proposed approach, based on a multi-label classification system, captures the complexity of suicidal behaviors effectively particularly “Suicide Attempt” and “Family History” instances. The proposed approach, using task-specific NLP models and a multi-label classification system, captures the complexity of suicidal behaviors more effectively than traditional binary classification. However, direct comparisons with existing studies are difficult due to varying metrics and label definitions.</div></div><div><h3>Conclusion:</h3><div>This study presents a robust NLP framework for detecting suicidal behaviors in EHRs, leveraging task-specific fine-tuning of transformer-based models and a semi-automated pipeline. Despite limitations, the approach demonstrates the potential of advanced NLP techniques in enhancing the identification of suicidal behaviors. Future work should focus on model expansion and integration to further improve patient care and clinical decision-making.</div></div>\",\"PeriodicalId\":15263,\"journal\":{\"name\":\"Journal of Biomedical Informatics\",\"volume\":\"161 \",\"pages\":\"Article 104755\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Biomedical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1532046424001734\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046424001734","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

背景：自杀是世界范围内死亡的主要原因，因此早期识别自杀行为对临床医生至关重要。当前用于识别电子健康记录（EHRs）中自杀行为的自然语言处理（NLP）方法依赖于关键字搜索、基于规则的方法和二元分类，这些方法可能无法完全捕捉自杀行为的复杂性和范围。本研究旨在创建一个带有标注指南的多类别标注数据集，并开发一种新的NLP方法，用于细粒度、多标签的自杀行为分类，提高标注过程的效率和NLP方法的准确性。方法：我们根据FDA、CDC和WHO的指南开发了一个多类别标签系统，区分了六种自杀行为，并允许每个数据样本进行多个标签。为了高效地创建带注释的数据集，我们使用基于mpnet的语义检索框架从大型EHR数据集中提取相关句子，在捕获不同表达的同时减少注释空间。专家使用多类系统对提取的句子进行注释。然后，我们将任务制定为一个多标签分类问题，并在策划的数据集上微调基于变压器的模型，以准确地分类电子病历中的自杀行为。结果：词汇分析揭示了评估自杀风险的关键主题，考虑了个人的历史、精神健康、物质使用和家庭背景。基于变压器的微调模型有效地识别了电子健康记录中的自杀行为，Bio_ClinicalBERT、BioBERT和XLNet达到了F1分数（0.81），优于BERT和RoBERTa。该方法基于多标签分类系统，有效地捕捉了自杀行为的复杂性，特别是“自杀企图”和“家族史”实例。该方法使用特定任务的NLP模型和多标签分类系统，比传统的二元分类更有效地捕捉自杀行为的复杂性。然而，由于不同的指标和标签定义，与现有研究的直接比较是困难的。结论：本研究提出了一个强大的NLP框架，用于检测电子病历中的自杀行为，利用基于变压器的特定任务微调模型和半自动化管道。尽管存在局限性，但该方法证明了先进的NLP技术在增强自杀行为识别方面的潜力。未来的工作应注重模型的扩展和整合，以进一步改善患者护理和临床决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Enhancing suicidal behavior detection in EHRs: A multi-label NLP framework with transformer models and semantic retrieval-based annotation

查看原文本刊更多论文

Enhancing suicidal behavior detection in EHRs: A multi-label NLP framework with transformer models and semantic retrieval-based annotation

Background:

Suicide is a leading cause of death worldwide, making early identification of suicidal behaviors crucial for clinicians. Current Natural Language Processing (NLP) approaches for identifying suicidal behaviors in Electronic Health Records (EHRs) rely on keyword searches, rule-based methods, and binary classification, which may not fully capture the complexity and spectrum of suicidal behaviors. This study aims to create a multi-class labeled dataset with annotation guidelines and develop a novel NLP approach for fine-grained, multi-label classification of suicidal behaviors, improving the efficiency of the annotation process and accuracy of the NLP methods.

Methods:

We develop a multi-class labeling system based on guidelines from FDA, CDC, and WHO, distinguishing between six categories of suicidal behaviors and allowing for multiple labels per data sample. To efficiently create an annotated dataset, we use an MPNet-based semantic retrieval framework to extract relevant sentences from a large EHR dataset, reducing annotation space while capturing diverse expressions. Experts annotate the extracted sentences using the multi-class system. We then formulate the task as a multi-label classification problem and fine-tune transformer-based models on the curated dataset to accurately classify suicidal behaviors in EHRs.

Results:

Lexical analysis revealed key themes in assessing suicide risk, considering an individual’s history, mental health, substance use, and family background. Fine-tuned transformer-based models effectively identified suicidal behaviors from EHRs, with Bio_ClinicalBERT, BioBERT, and XLNet achieving the F1 scores (0.81), outperforming BERT and RoBERTa. The proposed approach, based on a multi-label classification system, captures the complexity of suicidal behaviors effectively particularly “Suicide Attempt” and “Family History” instances. The proposed approach, using task-specific NLP models and a multi-label classification system, captures the complexity of suicidal behaviors more effectively than traditional binary classification. However, direct comparisons with existing studies are difficult due to varying metrics and label definitions.

Conclusion:

This study presents a robust NLP framework for detecting suicidal behaviors in EHRs, leveraging task-specific fine-tuning of transformer-based models and a semi-automated pipeline. Despite limitations, the approach demonstrates the potential of advanced NLP techniques in enhancing the identification of suicidal behaviors. Future work should focus on model expansion and integration to further improve patient care and clinical decision-making.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.