Contextual Feature Weighting Using Knowledge beyond the Repository Knowledge

World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering Pub Date : 2018-01-01 DOI:10.17706/IJCCE.2018.7.3.45-57

Kazem Qazanfari, Abdou Youssef

{"title":"Contextual Feature Weighting Using Knowledge beyond the Repository Knowledge","authors":"Kazem Qazanfari, Abdou Youssef","doi":"10.17706/IJCCE.2018.7.3.45-57","DOIUrl":null,"url":null,"abstract":"Bag of words, bigram, or more complex combinations of words are the most among general and widely used features in text classification. However, in almost all real-world text classification problems, the distribution of the available training dataset for each class often does not match the real distribution of the class concept, which reduces the accuracy of the classifiers. Let W(f) and R(f) be the discriminating power of feature f based on the world knowledge and the repository knowledge, respectively. In an ideal situation, W(f) = R(f) is desirable; however, in most situations, W(f) and R(f) are not equal and sometimes they are quite different, because the repository knowledge and the world knowledge do not have the same statistics about the discriminating power of feature f. In this paper, this phenomenon is called inadequacy of knowledge and we show how this phenomenon could reduce the performance of the text classifiers. To solve this issue, a novel feature weighting method is proposed which combines two bodies of knowledge, world knowledge and repository knowledge, using a particular transformation T. In this method, if both the world knowledge and the repository knowledge indicate a significantly high (resp., low) discriminating power of feature f, the weight of this feature is increased (resp., decreased); otherwise, the weight of the feature will be determined by a linear combination of the two weights. Experimental results show that the performance of classifiers like SVM, KNN and Bayes improves significantly if the proposed feature weighting method is applied on the contextual features such as bigram and unigram. It is shown also that pruning some words from the dataset using the proposed feature weighting method could improve the performance of the text classifier when the feature sets are created using Doc2vec.","PeriodicalId":23787,"journal":{"name":"World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering","volume":"32 1","pages":"47-57"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17706/IJCCE.2018.7.3.45-57","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Bag of words, bigram, or more complex combinations of words are the most among general and widely used features in text classification. However, in almost all real-world text classification problems, the distribution of the available training dataset for each class often does not match the real distribution of the class concept, which reduces the accuracy of the classifiers. Let W(f) and R(f) be the discriminating power of feature f based on the world knowledge and the repository knowledge, respectively. In an ideal situation, W(f) = R(f) is desirable; however, in most situations, W(f) and R(f) are not equal and sometimes they are quite different, because the repository knowledge and the world knowledge do not have the same statistics about the discriminating power of feature f. In this paper, this phenomenon is called inadequacy of knowledge and we show how this phenomenon could reduce the performance of the text classifiers. To solve this issue, a novel feature weighting method is proposed which combines two bodies of knowledge, world knowledge and repository knowledge, using a particular transformation T. In this method, if both the world knowledge and the repository knowledge indicate a significantly high (resp., low) discriminating power of feature f, the weight of this feature is increased (resp., decreased); otherwise, the weight of the feature will be determined by a linear combination of the two weights. Experimental results show that the performance of classifiers like SVM, KNN and Bayes improves significantly if the proposed feature weighting method is applied on the contextual features such as bigram and unigram. It is shown also that pruning some words from the dataset using the proposed feature weighting method could improve the performance of the text classifier when the feature sets are created using Doc2vec.

查看原文本刊更多论文

使用知识库知识以外的知识进行上下文特征加权

在文本分类中，词袋、双字或更复杂的词组合是最常用和最广泛使用的特征。然而，在几乎所有现实世界的文本分类问题中，每个类别的可用训练数据集的分布往往与类别概念的真实分布不匹配，这降低了分类器的准确性。设W(f)和R(f)分别为基于世界知识和知识库知识的特征f的识别能力。在理想情况下，W(f) = R(f)是理想的;然而，在大多数情况下，W(f)和R(f)是不相等的，有时它们差别很大，因为存储库知识和世界知识对特征f的判别能力没有相同的统计。在本文中，这种现象被称为知识不足，我们展示了这种现象如何降低文本分类器的性能。为了解决这一问题，提出了一种新的特征加权方法，该方法将世界知识和知识库知识两个知识体结合起来，使用特定的变换t。当特征f的识别能力较低时，该特征的权重增加(见图1)。,减少);否则，特征的权值将由两个权值的线性组合确定。实验结果表明，如果将所提出的特征加权方法应用于双图和单图等上下文特征上，SVM、KNN和Bayes等分类器的性能都有显著提高。当使用Doc2vec创建特征集时，使用所提出的特征加权方法从数据集中修剪一些词可以提高文本分类器的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering

自引率

0.00%

发文量