Rogerio C. P. Fragoso, Roberto H. W. Pinheiro, George D. C. Cavalcanti
{"title":"A Method for Automatic Determination of the Feature Vector Size for Text Categorization","authors":"Rogerio C. P. Fragoso, Roberto H. W. Pinheiro, George D. C. Cavalcanti","doi":"10.1109/BRACIS.2016.055","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a feature selection method for text categorization based on the filtering approach named Automatic Feature Subsets Analyzer (AFSA). The AFSA extends the Class-dependent Maximum Features per Document (cMFDR) algorithm and automatically defines the best number of features per document. In the cMFDR algorithm, the number of features is selected after a repetitive application of the methods which is a time-consuming strategy. In contrast, AFSA finds the best number of features in a data-driven way which is faster than cMFDR. The experiments with the Naïve Bayes Multinomial classifier, using four benchmark datasets, and three Feature Evaluation Function showed that the AFSA outperforms or presents similar results when compared with the cMFDR.","PeriodicalId":183149,"journal":{"name":"2016 5th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 5th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2016.055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
In this paper, we propose a feature selection method for text categorization based on the filtering approach named Automatic Feature Subsets Analyzer (AFSA). The AFSA extends the Class-dependent Maximum Features per Document (cMFDR) algorithm and automatically defines the best number of features per document. In the cMFDR algorithm, the number of features is selected after a repetitive application of the methods which is a time-consuming strategy. In contrast, AFSA finds the best number of features in a data-driven way which is faster than cMFDR. The experiments with the Naïve Bayes Multinomial classifier, using four benchmark datasets, and three Feature Evaluation Function showed that the AFSA outperforms or presents similar results when compared with the cMFDR.