{"title":"Contextual Feature Weighting Using Knowledge beyond the Repository Knowledge","authors":"Kazem Qazanfari, Abdou Youssef","doi":"10.17706/IJCCE.2018.7.3.45-57","DOIUrl":null,"url":null,"abstract":"Bag of words, bigram, or more complex combinations of words are the most among general and widely used features in text classification. However, in almost all real-world text classification problems, the distribution of the available training dataset for each class often does not match the real distribution of the class concept, which reduces the accuracy of the classifiers. Let W(f) and R(f) be the discriminating power of feature f based on the world knowledge and the repository knowledge, respectively. In an ideal situation, W(f) = R(f) is desirable; however, in most situations, W(f) and R(f) are not equal and sometimes they are quite different, because the repository knowledge and the world knowledge do not have the same statistics about the discriminating power of feature f. In this paper, this phenomenon is called inadequacy of knowledge and we show how this phenomenon could reduce the performance of the text classifiers. To solve this issue, a novel feature weighting method is proposed which combines two bodies of knowledge, world knowledge and repository knowledge, using a particular transformation T. In this method, if both the world knowledge and the repository knowledge indicate a significantly high (resp., low) discriminating power of feature f, the weight of this feature is increased (resp., decreased); otherwise, the weight of the feature will be determined by a linear combination of the two weights. Experimental results show that the performance of classifiers like SVM, KNN and Bayes improves significantly if the proposed feature weighting method is applied on the contextual features such as bigram and unigram. It is shown also that pruning some words from the dataset using the proposed feature weighting method could improve the performance of the text classifier when the feature sets are created using Doc2vec.","PeriodicalId":23787,"journal":{"name":"World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering","volume":"32 1","pages":"47-57"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17706/IJCCE.2018.7.3.45-57","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Bag of words, bigram, or more complex combinations of words are the most among general and widely used features in text classification. However, in almost all real-world text classification problems, the distribution of the available training dataset for each class often does not match the real distribution of the class concept, which reduces the accuracy of the classifiers. Let W(f) and R(f) be the discriminating power of feature f based on the world knowledge and the repository knowledge, respectively. In an ideal situation, W(f) = R(f) is desirable; however, in most situations, W(f) and R(f) are not equal and sometimes they are quite different, because the repository knowledge and the world knowledge do not have the same statistics about the discriminating power of feature f. In this paper, this phenomenon is called inadequacy of knowledge and we show how this phenomenon could reduce the performance of the text classifiers. To solve this issue, a novel feature weighting method is proposed which combines two bodies of knowledge, world knowledge and repository knowledge, using a particular transformation T. In this method, if both the world knowledge and the repository knowledge indicate a significantly high (resp., low) discriminating power of feature f, the weight of this feature is increased (resp., decreased); otherwise, the weight of the feature will be determined by a linear combination of the two weights. Experimental results show that the performance of classifiers like SVM, KNN and Bayes improves significantly if the proposed feature weighting method is applied on the contextual features such as bigram and unigram. It is shown also that pruning some words from the dataset using the proposed feature weighting method could improve the performance of the text classifier when the feature sets are created using Doc2vec.