Efficient Feature Representation Based on the Effect of Words Frequency for Arabic Documents Classification

Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering Pub Date : 2018-11-28 DOI:10.1145/3291842.3291900

Yousif A. Alhaj, W. U. Wickramaarachchi, Aamir Hussain, M. A. Al-qaness, Hammam M. Abdelaal

{"title":"Efficient Feature Representation Based on the Effect of Words Frequency for Arabic Documents Classification","authors":"Yousif A. Alhaj, W. U. Wickramaarachchi, Aamir Hussain, M. A. Al-qaness, Hammam M. Abdelaal","doi":"10.1145/3291842.3291900","DOIUrl":null,"url":null,"abstract":"This paper is based on the influence of the frequency of words in the classification of Arabic documents, its effects on the representation of characteristics namely Bag of word (Bow) and Term frequency- Inverse Documents Frequency (TF-IDF). Three classification techniques are being discussed, namely Naive Bayes (NB), k-nearest Neighbor (KNN) and Support Vector Machine (SVM). The Chi-square is used as a selection function to select essential features and remove unnecessary features. An experiment in the classification of Arab documents of public data collected from Arab sites, namely the CNN Arabic Corpus, to study the performance of the classification. The K-fold to validate the classifier and The F1-Micro to test the classifier. Recent results show that SVM classifier was upgraded to KNN and NB classifiers using the TF-IDF representation approach and that the NB classifier outperformed the KNN and SVM classifiers when using the representation approach in Bow. The SVM and NB classifiers attached 94.38% and 93.47% Micro-F1 are worth eliminating the word.","PeriodicalId":283197,"journal":{"name":"Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3291842.3291900","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

This paper is based on the influence of the frequency of words in the classification of Arabic documents, its effects on the representation of characteristics namely Bag of word (Bow) and Term frequency- Inverse Documents Frequency (TF-IDF). Three classification techniques are being discussed, namely Naive Bayes (NB), k-nearest Neighbor (KNN) and Support Vector Machine (SVM). The Chi-square is used as a selection function to select essential features and remove unnecessary features. An experiment in the classification of Arab documents of public data collected from Arab sites, namely the CNN Arabic Corpus, to study the performance of the classification. The K-fold to validate the classifier and The F1-Micro to test the classifier. Recent results show that SVM classifier was upgraded to KNN and NB classifiers using the TF-IDF representation approach and that the NB classifier outperformed the KNN and SVM classifiers when using the representation approach in Bow. The SVM and NB classifiers attached 94.38% and 93.47% Micro-F1 are worth eliminating the word.

查看原文本刊更多论文

基于词频效应的阿拉伯语文档分类高效特征表示

本文基于词频对阿拉伯语文档分类的影响，即词频对词袋(Bow)和词频-逆文档频率(TF-IDF)特征表示的影响。本文讨论了三种分类技术，即朴素贝叶斯(NB)、k近邻(KNN)和支持向量机(SVM)。卡方是一个选择函数，用来选择必要的特征，去除不必要的特征。对来自阿拉伯网站的公共数据，即CNN阿拉伯语语料库的阿拉伯语文档进行分类实验，研究分类的性能。K-fold验证分类器和F1-Micro测试分类器。最近的研究结果表明，使用TF-IDF表示方法将SVM分类器升级为KNN和NB分类器，并且在Bow中使用TF-IDF表示方法时，NB分类器优于KNN和SVM分类器。SVM和NB分类器分别有94.38%和93.47%的Micro-F1值值得剔除。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering

自引率

0.00%

发文量