A Novel Term Weighting Scheme and an Approach for Classification of Agricultural Arabic Text Complaints

2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR) Pub Date : 2018-03-01 DOI:10.1109/ASAR.2018.8480317

D. S. Guru, Mostafa Ali, M. Suhil

{"title":"A Novel Term Weighting Scheme and an Approach for Classification of Agricultural Arabic Text Complaints","authors":"D. S. Guru, Mostafa Ali, M. Suhil","doi":"10.1109/ASAR.2018.8480317","DOIUrl":null,"url":null,"abstract":"In this paper, a machine learning based approach for classification of farmers’ complaints which are in Arabic text into different crops has been proposed. Initially, the complaints are preprocessed using stop word removal, auto correction of words, handling some special cases and stemming to extract only the content terms. Some of the domain specific special cases which may affect the classification performance are handled. A new term weighting scheme called Term Class Weight-Inverse Class Frequency (TCW-ICF) is then used to extract the most discriminating features with respect to each class. The extracted features are then used to represent the preprocessed complaints in the form of feature vectors for training a classifier. Finally, an unlabeled complaint is classified as a member of one of the crop classes by the trained classifier. Nevertheless, a relatively large dataset consisting of more than 5000 complaints of the farmers described in Arabic script from eight different crops has been created. The proposed approach has been experimentally validated by conducting an extensive experimentation on the newly created dataset using KNN classifier. It has been argued that the proposed outperforms the baseline Vector Space Model (VSM). Further, the superiority of the proposed term weighting scheme in selecting the best set of discriminating features has been demonstrated through a comparative analysis against four well-known feature selection techniques. The new term is applied on Arabic script as a case study but it can be applied on any text data from any language.","PeriodicalId":165564,"journal":{"name":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAR.2018.8480317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

In this paper, a machine learning based approach for classification of farmers’ complaints which are in Arabic text into different crops has been proposed. Initially, the complaints are preprocessed using stop word removal, auto correction of words, handling some special cases and stemming to extract only the content terms. Some of the domain specific special cases which may affect the classification performance are handled. A new term weighting scheme called Term Class Weight-Inverse Class Frequency (TCW-ICF) is then used to extract the most discriminating features with respect to each class. The extracted features are then used to represent the preprocessed complaints in the form of feature vectors for training a classifier. Finally, an unlabeled complaint is classified as a member of one of the crop classes by the trained classifier. Nevertheless, a relatively large dataset consisting of more than 5000 complaints of the farmers described in Arabic script from eight different crops has been created. The proposed approach has been experimentally validated by conducting an extensive experimentation on the newly created dataset using KNN classifier. It has been argued that the proposed outperforms the baseline Vector Space Model (VSM). Further, the superiority of the proposed term weighting scheme in selecting the best set of discriminating features has been demonstrated through a comparative analysis against four well-known feature selection techniques. The new term is applied on Arabic script as a case study but it can be applied on any text data from any language.

查看原文本刊更多论文

一种新的阿拉伯语农业文本投诉术语加权方案及分类方法

本文提出了一种基于机器学习的方法，将阿拉伯语文本的农民投诉分类为不同的作物。首先，对投诉进行停止词去除、自动纠错、处理一些特殊情况和词干提取等预处理，只提取内容词。处理了一些可能影响分类性能的特定领域的特殊情况。然后使用一种新的术语加权方案，称为术语类权重逆类频率(TCW-ICF)，以提取相对于每个类别的最具区别性的特征。然后将提取的特征以特征向量的形式表示预处理后的投诉，用于训练分类器。最后，未标记的投诉被训练好的分类器分类为其中一个裁剪类的成员。尽管如此，还是建立了一个相对较大的数据集，其中包括用阿拉伯文字描述的来自8种不同作物的5000多份农民投诉。通过在使用KNN分类器的新创建的数据集上进行广泛的实验，验证了所提出的方法。有人认为，所提出的优于基线向量空间模型(VSM)。此外，通过与四种知名特征选择技术的比较分析，证明了所提出的术语加权方案在选择最佳判别特征集方面的优越性。新术语应用于阿拉伯文字作为案例研究，但它可以应用于任何语言的任何文本数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR)

自引率

0.00%

发文量