Adapting transformer-based language models for heart disease detection and risk factors extraction

IF 6.4 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Journal of Big Data Pub Date : 2024-04-04 DOI:10.1186/s40537-024-00903-y

Essam H. Houssein, Rehab E. Mohamed, Gang Hu, Abdelmgeid A. Ali

{"title":"Adapting transformer-based language models for heart disease detection and risk factors extraction","authors":"Essam H. Houssein, Rehab E. Mohamed, Gang Hu, Abdelmgeid A. Ali","doi":"10.1186/s40537-024-00903-y","DOIUrl":null,"url":null,"abstract":"<p>Efficiently treating cardiac patients before the onset of a heart attack relies on the precise prediction of heart disease. Identifying and detecting the risk factors for heart disease such as diabetes mellitus, Coronary Artery Disease (CAD), hyperlipidemia, hypertension, smoking, familial CAD history, obesity, and medications is critical for developing effective preventative and management measures. Although Electronic Health Records (EHRs) have emerged as valuable resources for identifying these risk factors, their unstructured format poses challenges for cardiologists in retrieving relevant information. This research proposed employing transfer learning techniques to automatically extract heart disease risk factors from EHRs. Leveraging transfer learning, a deep learning technique has demonstrated a significant performance in various clinical natural language processing (NLP) applications, particularly in heart disease risk prediction. This study explored the application of transformer-based language models, specifically utilizing pre-trained architectures like BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, BioClinicalBERT, XLNet, and BioBERT for heart disease detection and extraction of related risk factors from clinical notes, using the i2b2 dataset. These transformer models are pre-trained on an extensive corpus of medical literature and clinical records to gain a deep understanding of contextualized language representations. Adapted models are then fine-tuned using annotated datasets specific to heart disease, such as the i2b2 dataset, enabling them to learn patterns and relationships within the domain. These models have demonstrated superior performance in extracting semantic information from EHRs, automating high-performance heart disease risk factor identification, and performing downstream NLP tasks within the clinical domain. This study proposed fine-tuned five widely used transformer-based models, namely BERT, RoBERTa, BioClinicalBERT, XLNet, and BioBERT, using the 2014 i2b2 clinical NLP challenge dataset. The fine-tuned models surpass conventional approaches in predicting the presence of heart disease risk factors with impressive accuracy. The RoBERTa model has achieved the highest performance, with micro F1-scores of 94.27%, while the BERT, BioClinicalBERT, XLNet, and BioBERT models have provided competitive performances with micro F1-scores of 93.73%, 94.03%, 93.97%, and 93.99%, respectively. Finally, a simple ensemble of the five transformer-based models has been proposed, which outperformed the most existing methods in heart disease risk fan, achieving a micro F1-Score of 94.26%. This study demonstrated the efficacy of transfer learning using transformer-based models in enhancing risk prediction and facilitating early intervention for heart disease prevention.</p>","PeriodicalId":15158,"journal":{"name":"Journal of Big Data","volume":"24 1","pages":""},"PeriodicalIF":6.4000,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Big Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s40537-024-00903-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Efficiently treating cardiac patients before the onset of a heart attack relies on the precise prediction of heart disease. Identifying and detecting the risk factors for heart disease such as diabetes mellitus, Coronary Artery Disease (CAD), hyperlipidemia, hypertension, smoking, familial CAD history, obesity, and medications is critical for developing effective preventative and management measures. Although Electronic Health Records (EHRs) have emerged as valuable resources for identifying these risk factors, their unstructured format poses challenges for cardiologists in retrieving relevant information. This research proposed employing transfer learning techniques to automatically extract heart disease risk factors from EHRs. Leveraging transfer learning, a deep learning technique has demonstrated a significant performance in various clinical natural language processing (NLP) applications, particularly in heart disease risk prediction. This study explored the application of transformer-based language models, specifically utilizing pre-trained architectures like BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, BioClinicalBERT, XLNet, and BioBERT for heart disease detection and extraction of related risk factors from clinical notes, using the i2b2 dataset. These transformer models are pre-trained on an extensive corpus of medical literature and clinical records to gain a deep understanding of contextualized language representations. Adapted models are then fine-tuned using annotated datasets specific to heart disease, such as the i2b2 dataset, enabling them to learn patterns and relationships within the domain. These models have demonstrated superior performance in extracting semantic information from EHRs, automating high-performance heart disease risk factor identification, and performing downstream NLP tasks within the clinical domain. This study proposed fine-tuned five widely used transformer-based models, namely BERT, RoBERTa, BioClinicalBERT, XLNet, and BioBERT, using the 2014 i2b2 clinical NLP challenge dataset. The fine-tuned models surpass conventional approaches in predicting the presence of heart disease risk factors with impressive accuracy. The RoBERTa model has achieved the highest performance, with micro F1-scores of 94.27%, while the BERT, BioClinicalBERT, XLNet, and BioBERT models have provided competitive performances with micro F1-scores of 93.73%, 94.03%, 93.97%, and 93.99%, respectively. Finally, a simple ensemble of the five transformer-based models has been proposed, which outperformed the most existing methods in heart disease risk fan, achieving a micro F1-Score of 94.26%. This study demonstrated the efficacy of transfer learning using transformer-based models in enhancing risk prediction and facilitating early intervention for heart disease prevention.

查看原文本刊更多论文

调整基于转换器的语言模型，用于心脏病检测和风险因素提取

在心脏病发作前有效治疗心脏病患者有赖于对心脏病的精确预测。识别和检测心脏病的危险因素，如糖尿病、冠状动脉疾病（CAD）、高脂血症、高血压、吸烟、家族性冠状动脉疾病史、肥胖和药物，对于制定有效的预防和管理措施至关重要。虽然电子健康记录（EHR）已成为识别这些风险因素的宝贵资源，但其非结构化的格式给心脏病专家检索相关信息带来了挑战。本研究建议采用迁移学习技术自动从电子病历中提取心脏病风险因素。迁移学习是一种深度学习技术，在各种临床自然语言处理（NLP）应用中，尤其是在心脏病风险预测中表现出了显著的性能。本研究探索了基于变换器的语言模型的应用，特别是利用 i2b2 数据集，利用 BERT（来自变换器的双向编码器表示）、RoBERTa、BioClinicalBERT、XLNet 和 BioBERT 等预训练架构，从临床笔记中检测心脏病并提取相关风险因素。这些转换器模型在大量医学文献和临床记录的语料库中进行了预训练，以深入理解语境化语言表达。然后，利用专门针对心脏病的注释数据集（如 i2b2 数据集）对调整后的模型进行微调，使其能够学习该领域内的模式和关系。这些模型在从电子病历中提取语义信息、自动进行高性能心脏病风险因素识别以及在临床领域内执行下游 NLP 任务方面表现出色。本研究利用 2014 年 i2b2 临床 NLP 挑战数据集，提出了微调五种广泛使用的基于转换器的模型，即 BERT、RoBERTa、BioClinicalBERT、XLNet 和 BioBERT。经过微调的模型在预测心脏病风险因素方面超越了传统方法，其准确性令人印象深刻。RoBERTa 模型的性能最高，其微观 F1 分数为 94.27%，而 BERT、BioClinicalBERT、XLNet 和 BioBERT 模型的微观 F1 分数分别为 93.73%、94.03%、93.97% 和 93.99%，表现极具竞争力。最后，研究人员提出了基于五个转换器的模型的简单集合，该集合在心脏病风险扇形中的表现优于大多数现有方法，微观 F1 分数达到 94.26%。这项研究证明了基于转换器模型的迁移学习在增强风险预测和促进心脏病早期干预方面的功效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Big Data Computer Science-Information Systems

CiteScore

17.80

自引率

3.70%

发文量

105

审稿时长

13 weeks

期刊介绍： The Journal of Big Data publishes high-quality, scholarly research papers, methodologies, and case studies covering a broad spectrum of topics, from big data analytics to data-intensive computing and all applications of big data research. It addresses challenges facing big data today and in the future, including data capture and storage, search, sharing, analytics, technologies, visualization, architectures, data mining, machine learning, cloud computing, distributed systems, and scalable storage. The journal serves as a seminal source of innovative material for academic researchers and practitioners alike.