{"title":"利用粒子群优化--相互信息进行混合分布式特征选择","authors":"Khumukcham Robindro, Sanasam Surjalata Devi, Urikhimbam Boby Clinton, Linthoingambi Takhellambam, Yambem Ranjan Singh, Nazrul Hoque","doi":"10.1016/j.dsm.2023.10.003","DOIUrl":null,"url":null,"abstract":"<div><p>Feature selection (FS) is a data preprocessing step in machine learning (ML) that selects a subset of relevant and informative features from a large feature pool. FS helps ML models improve their predictive accuracy at lower computational costs. Moreover, FS can handle the model overfitting problem on a high-dimensional dataset. A major problem with the filter and wrapper FS methods is that they consume a significant amount of time during FS on high-dimensional datasets. The proposed “HDFS(PSO-MI): hybrid distribute feature selection using particle swarm optimization-mutual information (PSO-MI)”, which is a PSO-based hybrid method that can overcome the problem mentioned above. This method hybridizes the filter and wrapper techniques in a distributed manner. A new combiner is also introduced to merge the effective features selected from multiple data distributions. The effectiveness of the proposed HDFS(PSO-MI) method is evaluated using five ML classifiers, i.e., logistic regression (LR), k-NN, support vector machine (SVM), decision tree (DT), and random forest (RF), on various datasets in terms of accuracy and Matthew’s correlation coefficient (MCC). From the experimental analysis, we observed that HDFS(PSO-MI) method yielded more than 98%, 95%, 92%, 90%, and 85% accuracy for the unbalanced, kidney disease, emotions, wafer manufacturing, and breast cancer datasets, respectively. Our method shows promising results comapred to other methods, such as mutual information, gain ratio, Spearman correlation, analysis of variance (ANOVA), Pearson correlation, and an ensemble feature selection with ranking method (EFSRank).</p></div>","PeriodicalId":100353,"journal":{"name":"Data Science and Management","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666764923000462/pdfft?md5=712938edf51c71c99b1a5d68d7ef20da&pid=1-s2.0-S2666764923000462-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Hybrid distributed feature selection using particle swarm optimization-mutual information\",\"authors\":\"Khumukcham Robindro, Sanasam Surjalata Devi, Urikhimbam Boby Clinton, Linthoingambi Takhellambam, Yambem Ranjan Singh, Nazrul Hoque\",\"doi\":\"10.1016/j.dsm.2023.10.003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Feature selection (FS) is a data preprocessing step in machine learning (ML) that selects a subset of relevant and informative features from a large feature pool. FS helps ML models improve their predictive accuracy at lower computational costs. Moreover, FS can handle the model overfitting problem on a high-dimensional dataset. A major problem with the filter and wrapper FS methods is that they consume a significant amount of time during FS on high-dimensional datasets. The proposed “HDFS(PSO-MI): hybrid distribute feature selection using particle swarm optimization-mutual information (PSO-MI)”, which is a PSO-based hybrid method that can overcome the problem mentioned above. This method hybridizes the filter and wrapper techniques in a distributed manner. A new combiner is also introduced to merge the effective features selected from multiple data distributions. The effectiveness of the proposed HDFS(PSO-MI) method is evaluated using five ML classifiers, i.e., logistic regression (LR), k-NN, support vector machine (SVM), decision tree (DT), and random forest (RF), on various datasets in terms of accuracy and Matthew’s correlation coefficient (MCC). From the experimental analysis, we observed that HDFS(PSO-MI) method yielded more than 98%, 95%, 92%, 90%, and 85% accuracy for the unbalanced, kidney disease, emotions, wafer manufacturing, and breast cancer datasets, respectively. Our method shows promising results comapred to other methods, such as mutual information, gain ratio, Spearman correlation, analysis of variance (ANOVA), Pearson correlation, and an ensemble feature selection with ranking method (EFSRank).</p></div>\",\"PeriodicalId\":100353,\"journal\":{\"name\":\"Data Science and Management\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666764923000462/pdfft?md5=712938edf51c71c99b1a5d68d7ef20da&pid=1-s2.0-S2666764923000462-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data Science and Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666764923000462\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Science and Management","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666764923000462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hybrid distributed feature selection using particle swarm optimization-mutual information
Feature selection (FS) is a data preprocessing step in machine learning (ML) that selects a subset of relevant and informative features from a large feature pool. FS helps ML models improve their predictive accuracy at lower computational costs. Moreover, FS can handle the model overfitting problem on a high-dimensional dataset. A major problem with the filter and wrapper FS methods is that they consume a significant amount of time during FS on high-dimensional datasets. The proposed “HDFS(PSO-MI): hybrid distribute feature selection using particle swarm optimization-mutual information (PSO-MI)”, which is a PSO-based hybrid method that can overcome the problem mentioned above. This method hybridizes the filter and wrapper techniques in a distributed manner. A new combiner is also introduced to merge the effective features selected from multiple data distributions. The effectiveness of the proposed HDFS(PSO-MI) method is evaluated using five ML classifiers, i.e., logistic regression (LR), k-NN, support vector machine (SVM), decision tree (DT), and random forest (RF), on various datasets in terms of accuracy and Matthew’s correlation coefficient (MCC). From the experimental analysis, we observed that HDFS(PSO-MI) method yielded more than 98%, 95%, 92%, 90%, and 85% accuracy for the unbalanced, kidney disease, emotions, wafer manufacturing, and breast cancer datasets, respectively. Our method shows promising results comapred to other methods, such as mutual information, gain ratio, Spearman correlation, analysis of variance (ANOVA), Pearson correlation, and an ensemble feature selection with ranking method (EFSRank).