{"title":"Statera: A Balanced Feature Selection Method for Text Classification","authors":"Tatiane Nogueira Rios, Braian Varjão Gama Bispo","doi":"10.1109/bracis.2018.00052","DOIUrl":null,"url":null,"abstract":"Feature selection is widely used to overcome the problems caused by the curse of dimensionality, since it reduces data dimensionality by removing irrelevant and redundant features from a dataset. Moreover, it is an important pre-processing step usually mandatory in text mining tasks using Machine Learning techniques. In this paper, we propose a new feature selection method for text classification, named Statera, that selects a subset of features that guarantees the representativeness of all classes from a domain in a balanced way, and calculates such degree of representativeness based on information retrieval measures. We demonstrate the effectiveness of our method conducting experiments on nine real document collections. The result shows that the proposed approach can outperform state-of-art feature selection methods, achieving good classification results even with a very small number of features.","PeriodicalId":405190,"journal":{"name":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/bracis.2018.00052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Feature selection is widely used to overcome the problems caused by the curse of dimensionality, since it reduces data dimensionality by removing irrelevant and redundant features from a dataset. Moreover, it is an important pre-processing step usually mandatory in text mining tasks using Machine Learning techniques. In this paper, we propose a new feature selection method for text classification, named Statera, that selects a subset of features that guarantees the representativeness of all classes from a domain in a balanced way, and calculates such degree of representativeness based on information retrieval measures. We demonstrate the effectiveness of our method conducting experiments on nine real document collections. The result shows that the proposed approach can outperform state-of-art feature selection methods, achieving good classification results even with a very small number of features.