Hybrid Top-K Feature Selection to Improve High-Dimensional Data Classification Using Naïve Bayes Algorithm

Scientific Journal of Informatics Pub Date : 2023-04-20 DOI:10.15294/sji.v10i2.42818

Riska Wibowo, M. Soeleman, Affandy Affandy

{"title":"Hybrid Top-K Feature Selection to Improve High-Dimensional Data Classification Using Naïve Bayes Algorithm","authors":"Riska Wibowo, M. Soeleman, Affandy Affandy","doi":"10.15294/sji.v10i2.42818","DOIUrl":null,"url":null,"abstract":"Abstract. Purpose: The naive bayes algorithm is one of the most popular machine learning algorithms, because it is simple, has high computational efficiency and has good accuracy. The naive bayes method assumes each attribute contributes to determining the classification result that may exist between attributes, this can interfere with the classification performance of naive bayes. The naïve bayes algorithm is sensitive to many features so this can reduce the performance of naïve bayes. Efforts to improve the performance of the naïve bayes algorithm by using a hybrid top-k feature selection method that aims to handle high-dimensional data using the naïve bayes algorithm so as to produce better accuracy.Methods: This research proposes a hybrid top-k feature selection method with stages 1. Prepare the dataset, 2. Replace the missing value with the average value of each attribute, 3. Calculate the weight of the attribute value using the weight information gain method, 4. Select attributes using the top-k feature selection method, 5. Backward Elimination with the naïve bayes algorithm, 6. Datasets that have been selected new attributes, then validated using 10 fold-cross validation where the data is divided into training data and testing data, 7. Calculate the accuracy value based on the confusion matrix table.Result: Based on the experimental results of performance and performance comparison of several methods that have been presented (Naïve Bayes, deep feature weighting naïve bayes, top-k feature selection, and hybrid top-k feature selection). The experimental results in this study show that from 5 datasets from UCI Repository that have been tested, the accuracy value of the hybrid top-k feature selection method increases from the previous method. From the accuracy comparison results that the proposed hybrid top-k feature selection method is ranked the first best method.Novelty: Thus it can be concluded that the Hybrid top-k feature selection method can be used to handle dimensional data in the Naïve Bayes algorithm. ","PeriodicalId":30781,"journal":{"name":"Scientific Journal of Informatics","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Journal of Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15294/sji.v10i2.42818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract. Purpose: The naive bayes algorithm is one of the most popular machine learning algorithms, because it is simple, has high computational efficiency and has good accuracy. The naive bayes method assumes each attribute contributes to determining the classification result that may exist between attributes, this can interfere with the classification performance of naive bayes. The naïve bayes algorithm is sensitive to many features so this can reduce the performance of naïve bayes. Efforts to improve the performance of the naïve bayes algorithm by using a hybrid top-k feature selection method that aims to handle high-dimensional data using the naïve bayes algorithm so as to produce better accuracy.Methods: This research proposes a hybrid top-k feature selection method with stages 1. Prepare the dataset, 2. Replace the missing value with the average value of each attribute, 3. Calculate the weight of the attribute value using the weight information gain method, 4. Select attributes using the top-k feature selection method, 5. Backward Elimination with the naïve bayes algorithm, 6. Datasets that have been selected new attributes, then validated using 10 fold-cross validation where the data is divided into training data and testing data, 7. Calculate the accuracy value based on the confusion matrix table.Result: Based on the experimental results of performance and performance comparison of several methods that have been presented (Naïve Bayes, deep feature weighting naïve bayes, top-k feature selection, and hybrid top-k feature selection). The experimental results in this study show that from 5 datasets from UCI Repository that have been tested, the accuracy value of the hybrid top-k feature selection method increases from the previous method. From the accuracy comparison results that the proposed hybrid top-k feature selection method is ranked the first best method.Novelty: Thus it can be concluded that the Hybrid top-k feature selection method can be used to handle dimensional data in the Naïve Bayes algorithm.

查看原文本刊更多论文

用朴素贝叶斯算法改进高维数据分类的混合Top-K特征选择

摘要目的：朴素贝叶斯算法是最流行的机器学习算法之一，因为它简单、计算效率高、精度高。朴素贝叶斯方法假设每个属性都有助于确定属性之间可能存在的分类结果，这可能会干扰朴素贝叶斯的分类性能。朴素贝叶斯算法对许多特征都很敏感，这会降低朴素贝叶斯算法的性能。通过使用混合top-k特征选择方法来提高朴素贝叶斯算法的性能，该方法旨在使用朴素贝叶斯算法处理高维数据，以产生更好的精度。方法：本研究提出了一种具有阶段1的混合top-k特征选择方法。准备数据集，2。将缺失的值替换为每个属性的平均值3。使用权重信息增益方法计算属性值的权重，4。使用top-k特征选择方法5选择属性。朴素贝叶斯算法的后向消除，6。选择了新属性的数据集，然后使用10倍交叉验证进行验证，其中数据分为训练数据和测试数据，7。根据混淆矩阵表计算准确度值。结果：基于已经提出的几种方法（朴素贝叶斯、深度特征加权朴素贝叶斯、top-k特征选择和混合top-k特性选择）的性能和性能比较的实验结果。本研究的实验结果表明，从UCI Repository中已经测试的5个数据集来看，混合top-k特征选择方法的精度值比以前的方法有所提高。从精度比较结果来看，所提出的混合top-k特征选择方法被评为第一好方法。新颖性：因此可以得出结论，混合top-k特征选择方法可以用于处理Naïve Bayes算法中的维度数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientific Journal of Informatics

自引率

0.00%

发文量

审稿时长

24 weeks