{"title":"A Novel Term Weighting Scheme and an Approach for Classification of Agricultural Arabic Text Complaints","authors":"D. S. Guru, Mostafa Ali, M. Suhil","doi":"10.1109/ASAR.2018.8480317","DOIUrl":null,"url":null,"abstract":"In this paper, a machine learning based approach for classification of farmers’ complaints which are in Arabic text into different crops has been proposed. Initially, the complaints are preprocessed using stop word removal, auto correction of words, handling some special cases and stemming to extract only the content terms. Some of the domain specific special cases which may affect the classification performance are handled. A new term weighting scheme called Term Class Weight-Inverse Class Frequency (TCW-ICF) is then used to extract the most discriminating features with respect to each class. The extracted features are then used to represent the preprocessed complaints in the form of feature vectors for training a classifier. Finally, an unlabeled complaint is classified as a member of one of the crop classes by the trained classifier. Nevertheless, a relatively large dataset consisting of more than 5000 complaints of the farmers described in Arabic script from eight different crops has been created. The proposed approach has been experimentally validated by conducting an extensive experimentation on the newly created dataset using KNN classifier. It has been argued that the proposed outperforms the baseline Vector Space Model (VSM). Further, the superiority of the proposed term weighting scheme in selecting the best set of discriminating features has been demonstrated through a comparative analysis against four well-known feature selection techniques. The new term is applied on Arabic script as a case study but it can be applied on any text data from any language.","PeriodicalId":165564,"journal":{"name":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAR.2018.8480317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
In this paper, a machine learning based approach for classification of farmers’ complaints which are in Arabic text into different crops has been proposed. Initially, the complaints are preprocessed using stop word removal, auto correction of words, handling some special cases and stemming to extract only the content terms. Some of the domain specific special cases which may affect the classification performance are handled. A new term weighting scheme called Term Class Weight-Inverse Class Frequency (TCW-ICF) is then used to extract the most discriminating features with respect to each class. The extracted features are then used to represent the preprocessed complaints in the form of feature vectors for training a classifier. Finally, an unlabeled complaint is classified as a member of one of the crop classes by the trained classifier. Nevertheless, a relatively large dataset consisting of more than 5000 complaints of the farmers described in Arabic script from eight different crops has been created. The proposed approach has been experimentally validated by conducting an extensive experimentation on the newly created dataset using KNN classifier. It has been argued that the proposed outperforms the baseline Vector Space Model (VSM). Further, the superiority of the proposed term weighting scheme in selecting the best set of discriminating features has been demonstrated through a comparative analysis against four well-known feature selection techniques. The new term is applied on Arabic script as a case study but it can be applied on any text data from any language.