Enhancing coronary heart disease diagnosis: Comparative analysis of data pre-processing techniques and machine learning models using clinical medical records.

IF 2.3 3区医学 Q2 HEALTH CARE SCIENCES & SERVICES

Health Informatics Journal Pub Date : 2025-07-01 Epub Date: 2025-08-06 DOI:10.1177/14604582251366160

Chun-Wei Tseng, Ling-Chun Sun, Ke-Feng Lin, Ping-Nan Chen

{"title":"Enhancing coronary heart disease diagnosis: Comparative analysis of data pre-processing techniques and machine learning models using clinical medical records.","authors":"Chun-Wei Tseng, Ling-Chun Sun, Ke-Feng Lin, Ping-Nan Chen","doi":"10.1177/14604582251366160","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning techniques offer significant potential for improving the diagnosis of coronary heart disease by enabling earlier detection and timely intervention. This study presents a machine learning-based method utilizing clinical records to evaluate the impact of different data preprocessing sequences on predictive accuracy. Two clinical datasets were examined: one comprising heart failure patient data with 14 clinical features, and the Cleveland Heart Disease Dataset. The investigation compared two preprocessing strategies: standardisation prior to balancing, and balancing prior to scaling. Six machine learning models (XGBoost, GBDT, AdaBoost, Random Forest, KNN, and RaSE) were trained on an 80:20 data split and assessed using accuracy, precision, recall, and F1-score. Hyperparameters were optimized with Bayesian Optimisation. Results showed that both preprocessing designs achieved perfect accuracy on the Cleveland dataset. For the heart failure dataset, balancing before scaling led to improved accuracy (95%) compared with standardising before balancing (93.33%), and yielded higher macro-average and weighted-average F1-scores, signifying better overall classification performance. Among the evaluated models, XGBoost consistently provided the most robust predictions across conditions. These findings highlight the critical influence of preprocessing sequence on model effectiveness in imbalanced clinical data and suggest that balancing before scaling significantly enhances classification accuracy. XGBoost stands out as a reliable model for potential implementation in clinical decision support systems. Overall, this study advances the development of AI-driven tools for digital health applications, contributing meaningful insights to the field of health informatics.</p>","PeriodicalId":55069,"journal":{"name":"Health Informatics Journal","volume":"31 3","pages":"14604582251366160"},"PeriodicalIF":2.3000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Informatics Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/14604582251366160","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/6 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning techniques offer significant potential for improving the diagnosis of coronary heart disease by enabling earlier detection and timely intervention. This study presents a machine learning-based method utilizing clinical records to evaluate the impact of different data preprocessing sequences on predictive accuracy. Two clinical datasets were examined: one comprising heart failure patient data with 14 clinical features, and the Cleveland Heart Disease Dataset. The investigation compared two preprocessing strategies: standardisation prior to balancing, and balancing prior to scaling. Six machine learning models (XGBoost, GBDT, AdaBoost, Random Forest, KNN, and RaSE) were trained on an 80:20 data split and assessed using accuracy, precision, recall, and F1-score. Hyperparameters were optimized with Bayesian Optimisation. Results showed that both preprocessing designs achieved perfect accuracy on the Cleveland dataset. For the heart failure dataset, balancing before scaling led to improved accuracy (95%) compared with standardising before balancing (93.33%), and yielded higher macro-average and weighted-average F1-scores, signifying better overall classification performance. Among the evaluated models, XGBoost consistently provided the most robust predictions across conditions. These findings highlight the critical influence of preprocessing sequence on model effectiveness in imbalanced clinical data and suggest that balancing before scaling significantly enhances classification accuracy. XGBoost stands out as a reliable model for potential implementation in clinical decision support systems. Overall, this study advances the development of AI-driven tools for digital health applications, contributing meaningful insights to the field of health informatics.

查看原文本刊更多论文

增强冠心病诊断：使用临床医疗记录的数据预处理技术和机器学习模型的比较分析

机器学习技术通过实现早期发现和及时干预，为改善冠心病的诊断提供了巨大的潜力。本研究提出了一种基于机器学习的方法，利用临床记录来评估不同数据预处理顺序对预测准确性的影响。研究了两个临床数据集：一个包括有14个临床特征的心力衰竭患者数据，以及克利夫兰心脏病数据集。调查比较了两种预处理策略：标准化之前的平衡，和平衡之前的缩放。六个机器学习模型（XGBoost、GBDT、AdaBoost、Random Forest、KNN和RaSE）在80:20的数据分割上进行训练，并使用准确性、精密度、召回率和f1分数进行评估。采用贝叶斯优化方法对超参数进行优化。结果表明，两种预处理设计在克利夫兰数据集上均取得了较好的精度。对于心力衰竭数据集，与平衡前的标准化（93.33%）相比，在缩放前进行平衡可以提高准确率（95%），并且产生更高的宏观平均值和加权平均值f1分数，这意味着更好的整体分类性能。在评估的模型中，XGBoost始终提供最可靠的预测。这些发现强调了预处理顺序对不平衡临床数据模型有效性的关键影响，并表明在缩放前进行平衡可以显著提高分类精度。XGBoost作为一种可靠的模型在临床决策支持系统中脱颖而出。总体而言，本研究推进了用于数字健康应用的人工智能驱动工具的发展，为健康信息学领域提供了有意义的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Health Informatics Journal HEALTH CARE SCIENCES & SERVICES-MEDICAL INFORMATICS

CiteScore

7.80

自引率

6.70%

发文量

审稿时长

6 months

期刊介绍： Health Informatics Journal is an international peer-reviewed journal. All papers submitted to Health Informatics Journal are subject to peer review by members of a carefully appointed editorial board. The journal operates a conventional single-blind reviewing policy in which the reviewer’s name is always concealed from the submitting author.