A New Performance Metric to Evaluate Filter Feature Selection Methods in Text Classification

JUCS - Journal of Universal Computer Science Pub Date : 2024-07-28 DOI:10.3897/jucs.111675

Rasim Çekik, Mahmut Kaya

{"title":"A New Performance Metric to Evaluate Filter Feature Selection Methods in Text Classification","authors":"Rasim Çekik, Mahmut Kaya","doi":"10.3897/jucs.111675","DOIUrl":null,"url":null,"abstract":"High dimensionality and sparsity are the primary issues in text classification. Using feature selection approaches, the most effective way to solve the problem is to select a subset of features. The most common and effective methods used for this process are filter techniques. Various performance metrics such as Micro-F1, Macro-F1, and Accuracy are used to evaluate the performance of filter methods used for feature selection on datasets Such methods work depending on a classification algorithm. However, when selecting features in filter techniques, the information on the individual features is evaluated without considering the relationship between the features. In such an approach, the actual performance of the filter technique used in feature selection may not be determined. In such a case, it causes the existing methods to be insufficient in testing the validity of the proposed method. For this purpose, this study suggests a novel performance metric called Selection Error (SE) to determine the actual performance evaluation of filter techniques. The Selection Error metric allows us to analyze the information value of the selected features more accurately than existing methods without relying on a classifier. The feature selection performance of the filtering approaches was performed on six different datasets with both The Selection Error and traditional performance metrics. When the results are examined, it is seen that there is a strong relationship between the proposed performance metric and the classification performance metric results. The Selection Error aims to significantly contribute to the literature by demonstrating the success of filtering feature selection methods, regardless of classifier performance. ","PeriodicalId":124602,"journal":{"name":"JUCS - Journal of Universal Computer Science","volume":"3 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JUCS - Journal of Universal Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3897/jucs.111675","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

High dimensionality and sparsity are the primary issues in text classification. Using feature selection approaches, the most effective way to solve the problem is to select a subset of features. The most common and effective methods used for this process are filter techniques. Various performance metrics such as Micro-F1, Macro-F1, and Accuracy are used to evaluate the performance of filter methods used for feature selection on datasets Such methods work depending on a classification algorithm. However, when selecting features in filter techniques, the information on the individual features is evaluated without considering the relationship between the features. In such an approach, the actual performance of the filter technique used in feature selection may not be determined. In such a case, it causes the existing methods to be insufficient in testing the validity of the proposed method. For this purpose, this study suggests a novel performance metric called Selection Error (SE) to determine the actual performance evaluation of filter techniques. The Selection Error metric allows us to analyze the information value of the selected features more accurately than existing methods without relying on a classifier. The feature selection performance of the filtering approaches was performed on six different datasets with both The Selection Error and traditional performance metrics. When the results are examined, it is seen that there is a strong relationship between the proposed performance metric and the classification performance metric results. The Selection Error aims to significantly contribute to the literature by demonstrating the success of filtering feature selection methods, regardless of classifier performance.

查看原文本刊更多论文

评估文本分类中过滤器特征选择方法的新性能指标

高维度和稀疏性是文本分类的主要问题。使用特征选择方法，解决问题的最有效方法是选择一个特征子集。在这一过程中，最常用、最有效的方法是过滤技术。各种性能指标（如 Micro-F1、Macro-F1 和 Accuracy）被用来评估数据集上用于特征选择的过滤方法的性能。然而，在筛选技术中选择特征时，只评估单个特征的信息，而不考虑特征之间的关系。在这种方法中，可能无法确定用于特征选择的过滤技术的实际性能。在这种情况下，会导致现有方法不足以检验拟议方法的有效性。为此，本研究提出了一种名为 "选择误差"（Selection Error，SE）的新型性能指标，用于确定筛选技术的实际性能评估。与现有方法相比，"选择误差 "指标能让我们在不依赖分类器的情况下更准确地分析所选特征的信息价值。我们在六个不同的数据集上使用选择误差和传统性能指标对过滤方法的特征选择性能进行了评估。对结果进行检验后发现，所提出的性能指标与分类性能指标结果之间存在密切关系。无论分类器性能如何，"选择误差 "都能证明筛选特征选择方法的成功，从而为文献做出重大贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JUCS - Journal of Universal Computer Science

自引率

0.00%

发文量