Weighted Rank Difference Ensemble: A New Form of Ensemble Feature Selection Method for Medical Datasets

BioMedInformatics Pub Date : 2024-02-10 DOI:10.3390/biomedinformatics4010027

Arju Manara Begum, M. Mondal, Prajoy Podder, J. Kamruzzaman

{"title":"Weighted Rank Difference Ensemble: A New Form of Ensemble Feature Selection Method for Medical Datasets","authors":"Arju Manara Begum, M. Mondal, Prajoy Podder, J. Kamruzzaman","doi":"10.3390/biomedinformatics4010027","DOIUrl":null,"url":null,"abstract":"Background: Feature selection (FS), a crucial preprocessing step in machine learning, greatly reduces the dimension of data and improves model performance. This paper focuses on selecting features for medical data classification. Methods: In this work, a new form of ensemble FS method called weighted rank difference ensemble (WRD-Ensemble) has been put forth. It combines three FS methods to produce a stable and diverse subset of features. The three base FS approaches are Pearson’s correlation coefficient (PCC), reliefF, and gain ratio (GR). These three FS approaches produce three distinct lists of features, and then they order each feature by importance or weight. The final subset of features in this study is chosen using the average weight of each feature and the rank difference of a feature across three ranked lists. Using the average weight and rank difference of each feature, unstable and less significant features are eliminated from the feature space. The WRD-Ensemble method is applied to three medical datasets: chronic kidney disease (CKD), lung cancer, and heart disease. These data samples are classified using logistic regression (LR). Results: The experimental results show that compared to the base FS methods and other ensemble FS methods, the proposed WRD-Ensemble method leads to obtaining the highest accuracy value of 98.97% for CKD, 93.24% for lung cancer, and 83.84% for heart disease. Conclusion: The results indicate that the proposed WRD-Ensemble method can potentially improve the accuracy of disease diagnosis models, contributing to advances in clinical decision-making.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":" 771","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BioMedInformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/biomedinformatics4010027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Feature selection (FS), a crucial preprocessing step in machine learning, greatly reduces the dimension of data and improves model performance. This paper focuses on selecting features for medical data classification. Methods: In this work, a new form of ensemble FS method called weighted rank difference ensemble (WRD-Ensemble) has been put forth. It combines three FS methods to produce a stable and diverse subset of features. The three base FS approaches are Pearson’s correlation coefficient (PCC), reliefF, and gain ratio (GR). These three FS approaches produce three distinct lists of features, and then they order each feature by importance or weight. The final subset of features in this study is chosen using the average weight of each feature and the rank difference of a feature across three ranked lists. Using the average weight and rank difference of each feature, unstable and less significant features are eliminated from the feature space. The WRD-Ensemble method is applied to three medical datasets: chronic kidney disease (CKD), lung cancer, and heart disease. These data samples are classified using logistic regression (LR). Results: The experimental results show that compared to the base FS methods and other ensemble FS methods, the proposed WRD-Ensemble method leads to obtaining the highest accuracy value of 98.97% for CKD, 93.24% for lung cancer, and 83.84% for heart disease. Conclusion: The results indicate that the proposed WRD-Ensemble method can potentially improve the accuracy of disease diagnosis models, contributing to advances in clinical decision-making.

查看原文本刊更多论文

加权秩差集合：医学数据集的一种新型集合特征选择方法

背景：特征选择（FS）是机器学习中一个重要的预处理步骤，它能大大降低数据维度，提高模型性能。本文重点探讨如何为医疗数据分类选择特征。方法：在这项工作中，提出了一种新形式的集合 FS 方法，称为加权秩差集合（WRD-Ensemble）。它结合了三种FS方法，以产生稳定且多样化的特征子集。这三种基本 FS 方法分别是皮尔逊相关系数 (PCC)、浮点系数 (reliefF) 和增益比 (GR)。这三种 FS 方法会产生三个不同的特征列表，然后根据重要性或权重对每个特征进行排序。在本研究中，最终的特征子集是根据每个特征的平均权重和特征在三个排序列表中的排名差异来选择的。利用每个特征的平均权重和排名差异，从特征空间中剔除不稳定和不重要的特征。WRD-Ensemble 方法适用于三个医疗数据集：慢性肾病（CKD）、肺癌和心脏病。使用逻辑回归（LR）对这些数据样本进行分类。结果显示实验结果表明，与基础分类方法和其他集合分类方法相比，所提出的 WRD-Ensemble 方法对慢性肾病的分类准确率最高，达到 98.97%；对肺癌的分类准确率最高，达到 93.24%；对心脏病的分类准确率最高，达到 83.84%。结论结果表明，所提出的 WRD-Ensemble 方法有可能提高疾病诊断模型的准确性，从而推动临床决策的进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BioMedInformatics

CiteScore

1.70

自引率

0.00%

发文量