利用粒子群优化--相互信息进行混合分布式特征选择

Khumukcham Robindro, Sanasam Surjalata Devi, Urikhimbam Boby Clinton, Linthoingambi Takhellambam, Yambem Ranjan Singh, Nazrul Hoque
{"title":"利用粒子群优化--相互信息进行混合分布式特征选择","authors":"Khumukcham Robindro,&nbsp;Sanasam Surjalata Devi,&nbsp;Urikhimbam Boby Clinton,&nbsp;Linthoingambi Takhellambam,&nbsp;Yambem Ranjan Singh,&nbsp;Nazrul Hoque","doi":"10.1016/j.dsm.2023.10.003","DOIUrl":null,"url":null,"abstract":"<div><p>Feature selection (FS) is a data preprocessing step in machine learning (ML) that selects a subset of relevant and informative features from a large feature pool. FS helps ML models improve their predictive accuracy at lower computational costs. Moreover, FS can handle the model overfitting problem on a high-dimensional dataset. A major problem with the filter and wrapper FS methods is that they consume a significant amount of time during FS on high-dimensional datasets. The proposed “HDFS(PSO-MI): hybrid distribute feature selection using particle swarm optimization-mutual information (PSO-MI)”, which is a PSO-based hybrid method that can overcome the problem mentioned above. This method hybridizes the filter and wrapper techniques in a distributed manner. A new combiner is also introduced to merge the effective features selected from multiple data distributions. The effectiveness of the proposed HDFS(PSO-MI) method is evaluated using five ML classifiers, i.e., logistic regression (LR), k-NN, support vector machine (SVM), decision tree (DT), and random forest (RF), on various datasets in terms of accuracy and Matthew’s correlation coefficient (MCC). From the experimental analysis, we observed that HDFS(PSO-MI) method yielded more than 98%, 95%, 92%, 90%, and 85% accuracy for the unbalanced, kidney disease, emotions, wafer manufacturing, and breast cancer datasets, respectively. Our method shows promising results comapred to other methods, such as mutual information, gain ratio, Spearman correlation, analysis of variance (ANOVA), Pearson correlation, and an ensemble feature selection with ranking method (EFSRank).</p></div>","PeriodicalId":100353,"journal":{"name":"Data Science and Management","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666764923000462/pdfft?md5=712938edf51c71c99b1a5d68d7ef20da&pid=1-s2.0-S2666764923000462-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Hybrid distributed feature selection using particle swarm optimization-mutual information\",\"authors\":\"Khumukcham Robindro,&nbsp;Sanasam Surjalata Devi,&nbsp;Urikhimbam Boby Clinton,&nbsp;Linthoingambi Takhellambam,&nbsp;Yambem Ranjan Singh,&nbsp;Nazrul Hoque\",\"doi\":\"10.1016/j.dsm.2023.10.003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Feature selection (FS) is a data preprocessing step in machine learning (ML) that selects a subset of relevant and informative features from a large feature pool. FS helps ML models improve their predictive accuracy at lower computational costs. Moreover, FS can handle the model overfitting problem on a high-dimensional dataset. A major problem with the filter and wrapper FS methods is that they consume a significant amount of time during FS on high-dimensional datasets. The proposed “HDFS(PSO-MI): hybrid distribute feature selection using particle swarm optimization-mutual information (PSO-MI)”, which is a PSO-based hybrid method that can overcome the problem mentioned above. This method hybridizes the filter and wrapper techniques in a distributed manner. A new combiner is also introduced to merge the effective features selected from multiple data distributions. The effectiveness of the proposed HDFS(PSO-MI) method is evaluated using five ML classifiers, i.e., logistic regression (LR), k-NN, support vector machine (SVM), decision tree (DT), and random forest (RF), on various datasets in terms of accuracy and Matthew’s correlation coefficient (MCC). From the experimental analysis, we observed that HDFS(PSO-MI) method yielded more than 98%, 95%, 92%, 90%, and 85% accuracy for the unbalanced, kidney disease, emotions, wafer manufacturing, and breast cancer datasets, respectively. Our method shows promising results comapred to other methods, such as mutual information, gain ratio, Spearman correlation, analysis of variance (ANOVA), Pearson correlation, and an ensemble feature selection with ranking method (EFSRank).</p></div>\",\"PeriodicalId\":100353,\"journal\":{\"name\":\"Data Science and Management\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666764923000462/pdfft?md5=712938edf51c71c99b1a5d68d7ef20da&pid=1-s2.0-S2666764923000462-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data Science and Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666764923000462\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666764923000462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

特征选择(FS)是机器学习(ML)中的一个数据预处理步骤,它从一个庞大的特征库中选择相关和信息量大的特征子集。FS 可以帮助 ML 模型以较低的计算成本提高预测准确性。此外,FS 还能处理高维数据集上的模型过拟合问题。过滤器和包装 FS 方法的一个主要问题是,它们在高维数据集上进行 FS 时会消耗大量时间。提出的 "HDFS(PSO-MI):使用粒子群优化-互信息(PSO-MI)的混合分布式特征选择 "是一种基于 PSO 的混合方法,可以克服上述问题。该方法以分布式方式混合了过滤器和包装器技术。此外,还引入了一个新的组合器,用于合并从多个数据分布中选出的有效特征。我们使用五种 ML 分类器,即逻辑回归 (LR)、k-NN、支持向量机 (SVM)、决策树 (DT) 和随机森林 (RF),在各种数据集上从准确率和马修相关系数 (MCC) 角度评估了所提出的 HDFS(PSO-MI) 方法的有效性。通过实验分析,我们发现 HDFS(PSO-MI)方法在不平衡、肾病、情感、晶圆制造和乳腺癌数据集上的准确率分别超过了 98%、95%、92%、90% 和 85%。与其他方法(如互信息、增益比、斯皮尔曼相关性、方差分析(ANOVA)、皮尔逊相关性和带排序的集合特征选择方法(EFSRank))相比,我们的方法显示出良好的效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Hybrid distributed feature selection using particle swarm optimization-mutual information

Feature selection (FS) is a data preprocessing step in machine learning (ML) that selects a subset of relevant and informative features from a large feature pool. FS helps ML models improve their predictive accuracy at lower computational costs. Moreover, FS can handle the model overfitting problem on a high-dimensional dataset. A major problem with the filter and wrapper FS methods is that they consume a significant amount of time during FS on high-dimensional datasets. The proposed “HDFS(PSO-MI): hybrid distribute feature selection using particle swarm optimization-mutual information (PSO-MI)”, which is a PSO-based hybrid method that can overcome the problem mentioned above. This method hybridizes the filter and wrapper techniques in a distributed manner. A new combiner is also introduced to merge the effective features selected from multiple data distributions. The effectiveness of the proposed HDFS(PSO-MI) method is evaluated using five ML classifiers, i.e., logistic regression (LR), k-NN, support vector machine (SVM), decision tree (DT), and random forest (RF), on various datasets in terms of accuracy and Matthew’s correlation coefficient (MCC). From the experimental analysis, we observed that HDFS(PSO-MI) method yielded more than 98%, 95%, 92%, 90%, and 85% accuracy for the unbalanced, kidney disease, emotions, wafer manufacturing, and breast cancer datasets, respectively. Our method shows promising results comapred to other methods, such as mutual information, gain ratio, Spearman correlation, analysis of variance (ANOVA), Pearson correlation, and an ensemble feature selection with ranking method (EFSRank).

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.50
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信