Improving Classification Accuracy of Automated Text Classifiers

2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) Pub Date : 2018-08-01 DOI:10.1109/ICRITO.2018.8748498

Shivam Rastogi

{"title":"Improving Classification Accuracy of Automated Text Classifiers","authors":"Shivam Rastogi","doi":"10.1109/ICRITO.2018.8748498","DOIUrl":null,"url":null,"abstract":"The number of textual documents is increasing at an incredible rate and very often, there is a need to classify these documents into some fixed predefined categories automatically. Since the classification is being done automatically, the classifier needs to be a good classifier so that there are as less misclassifications as possible. Therefore, the classification accuracy is very important and needs to be taken care of. There are various factors that can affect the classification accuracy of classifiers. One of the factors is the Feature Selection method used to reduce the number of features in the documents. Information Gain (IG) is one of the most popular measures employed for this task but there are few shortcomings in this measure of evaluating the useful words for classification. In this paper, a new measure is proposed which eliminates those shortcomings and thus finds the better words which are more useful in the classification task. The new proposed measure aims to find those words which have more discriminating power than others and therefore, it is named as Discriminating Power (DP). We have also compared the results of using IG measure and DP measure for text classification and the results show that new proposed measure takes almost same time as IG measure but still improves the average classification accuracy of a text classifier and is much more consistent or stable in its classification accuracy when count of features to be selected is varied.","PeriodicalId":439047,"journal":{"name":"2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","volume":"327 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICRITO.2018.8748498","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The number of textual documents is increasing at an incredible rate and very often, there is a need to classify these documents into some fixed predefined categories automatically. Since the classification is being done automatically, the classifier needs to be a good classifier so that there are as less misclassifications as possible. Therefore, the classification accuracy is very important and needs to be taken care of. There are various factors that can affect the classification accuracy of classifiers. One of the factors is the Feature Selection method used to reduce the number of features in the documents. Information Gain (IG) is one of the most popular measures employed for this task but there are few shortcomings in this measure of evaluating the useful words for classification. In this paper, a new measure is proposed which eliminates those shortcomings and thus finds the better words which are more useful in the classification task. The new proposed measure aims to find those words which have more discriminating power than others and therefore, it is named as Discriminating Power (DP). We have also compared the results of using IG measure and DP measure for text classification and the results show that new proposed measure takes almost same time as IG measure but still improves the average classification accuracy of a text classifier and is much more consistent or stable in its classification accuracy when count of features to be selected is varied.

查看原文本刊更多论文

提高自动文本分类器的分类精度

文本文档的数量正在以令人难以置信的速度增长，并且经常需要将这些文档自动分类到一些固定的预定义类别中。由于分类是自动完成的，分类器需要是一个好的分类器，以便尽可能少地出现错误分类。因此，分类精度是非常重要的，需要照顾。影响分类器分类精度的因素有很多。其中一个因素是用于减少文档中特征数量的特征选择方法。信息增益(Information Gain, IG)是该任务中最常用的度量之一，但该度量在评估分类有用词方面存在一些不足。本文提出了一种新的方法来消除这些缺点，从而找到更好的词，在分类任务中更有用。新提出的测量方法旨在找出那些比其他词具有更大判别力的词，因此将其命名为判别力(DP)。我们还比较了使用IG度量和DP度量进行文本分类的结果，结果表明，新提出的度量与IG度量花费的时间几乎相同，但仍然提高了文本分类器的平均分类精度，并且在需要选择的特征数量不同时，其分类精度更加一致或稳定。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)

自引率

0.00%

发文量