Text Categorization via Attribute Distance Weighted k-Nearest Neighbor Classification

2016 International Conference on Information Technology (ICIT) Pub Date : 2016-12-01 DOI:10.1109/ICIT.2016.053

H. Wandabwa, Defu Zhang, Korir Sammy

{"title":"Text Categorization via Attribute Distance Weighted k-Nearest Neighbor Classification","authors":"H. Wandabwa, Defu Zhang, Korir Sammy","doi":"10.1109/ICIT.2016.053","DOIUrl":null,"url":null,"abstract":"Text categorization entails making a decision on whether a document belongs to a set of pre-specified classes of other documents. This can be in a supervised way in classification tasks or unsupervised reminiscent of clustering related tasks. Categorization can be a challenging task especially when the discriminating words are large. K-Nearest Neighbor is an instance based learning algorithm that has proven to be effective in such classification tasks including documents. The key element of this algorithm lies in the similarity measurement principle that is capable of identifying neighbors of a particular document to high accuracies. The only drawback of this approach is in the weighting of all features to determine the distance among the documents in question. This is not only time consuming but also overuses computer resources without adding anything substantial to the overall results. In our approach (Attribute Distance Weighted - KNN), we do not make use of all features in the corpus but first extract the most relevant ones by weighting them in relation to the corpus. We then calculated the distance between the highly ranked features in the corpus alone as a representative of the entire document set. So far no known literature has inclined towards this approach thus our comparison will be in relation to the classical KNN measure. Our approach showed marginal performance in distance measure compared to classical KNN.","PeriodicalId":220153,"journal":{"name":"2016 International Conference on Information Technology (ICIT)","volume":"212 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Information Technology (ICIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIT.2016.053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Text categorization entails making a decision on whether a document belongs to a set of pre-specified classes of other documents. This can be in a supervised way in classification tasks or unsupervised reminiscent of clustering related tasks. Categorization can be a challenging task especially when the discriminating words are large. K-Nearest Neighbor is an instance based learning algorithm that has proven to be effective in such classification tasks including documents. The key element of this algorithm lies in the similarity measurement principle that is capable of identifying neighbors of a particular document to high accuracies. The only drawback of this approach is in the weighting of all features to determine the distance among the documents in question. This is not only time consuming but also overuses computer resources without adding anything substantial to the overall results. In our approach (Attribute Distance Weighted - KNN), we do not make use of all features in the corpus but first extract the most relevant ones by weighting them in relation to the corpus. We then calculated the distance between the highly ranked features in the corpus alone as a representative of the entire document set. So far no known literature has inclined towards this approach thus our comparison will be in relation to the classical KNN measure. Our approach showed marginal performance in distance measure compared to classical KNN.

查看原文本刊更多论文

基于属性距离加权k近邻分类的文本分类

文本分类需要决定一个文档是否属于一组预先指定的其他文档类。这可以在分类任务中以监督的方式进行，也可以在与聚类相关的任务中以无监督的方式进行。分类可能是一项具有挑战性的任务，特别是当判别词很大的时候。K-Nearest Neighbor是一种基于实例的学习算法，已被证明在包括文档在内的分类任务中是有效的。该算法的关键在于相似性度量原理，该原理能够以较高的精度识别特定文档的邻居。这种方法的唯一缺点是需要对所有特征进行加权，以确定所讨论的文档之间的距离。这不仅耗时，而且还会过度使用计算机资源，而不会对整体结果产生任何实质性的影响。在我们的方法(属性距离加权- KNN)中，我们没有利用语料库中的所有特征，而是首先通过对语料库进行加权来提取最相关的特征。然后，我们单独计算语料库中排名较高的特征之间的距离，作为整个文档集的代表。到目前为止，还没有已知的文献倾向于这种方法，因此我们的比较将与经典的KNN测量有关。与经典KNN相比，我们的方法在距离度量方面表现出边际性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 International Conference on Information Technology (ICIT)

自引率

0.00%

发文量