基于词频效应的阿拉伯语文档分类高效特征表示

Yousif A. Alhaj, W. U. Wickramaarachchi, Aamir Hussain, M. A. Al-qaness, Hammam M. Abdelaal
{"title":"基于词频效应的阿拉伯语文档分类高效特征表示","authors":"Yousif A. Alhaj, W. U. Wickramaarachchi, Aamir Hussain, M. A. Al-qaness, Hammam M. Abdelaal","doi":"10.1145/3291842.3291900","DOIUrl":null,"url":null,"abstract":"This paper is based on the influence of the frequency of words in the classification of Arabic documents, its effects on the representation of characteristics namely Bag of word (Bow) and Term frequency- Inverse Documents Frequency (TF-IDF). Three classification techniques are being discussed, namely Naive Bayes (NB), k-nearest Neighbor (KNN) and Support Vector Machine (SVM). The Chi-square is used as a selection function to select essential features and remove unnecessary features. An experiment in the classification of Arab documents of public data collected from Arab sites, namely the CNN Arabic Corpus, to study the performance of the classification. The K-fold to validate the classifier and The F1-Micro to test the classifier. Recent results show that SVM classifier was upgraded to KNN and NB classifiers using the TF-IDF representation approach and that the NB classifier outperformed the KNN and SVM classifiers when using the representation approach in Bow. The SVM and NB classifiers attached 94.38% and 93.47% Micro-F1 are worth eliminating the word.","PeriodicalId":283197,"journal":{"name":"Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Efficient Feature Representation Based on the Effect of Words Frequency for Arabic Documents Classification\",\"authors\":\"Yousif A. Alhaj, W. U. Wickramaarachchi, Aamir Hussain, M. A. Al-qaness, Hammam M. Abdelaal\",\"doi\":\"10.1145/3291842.3291900\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper is based on the influence of the frequency of words in the classification of Arabic documents, its effects on the representation of characteristics namely Bag of word (Bow) and Term frequency- Inverse Documents Frequency (TF-IDF). Three classification techniques are being discussed, namely Naive Bayes (NB), k-nearest Neighbor (KNN) and Support Vector Machine (SVM). The Chi-square is used as a selection function to select essential features and remove unnecessary features. An experiment in the classification of Arab documents of public data collected from Arab sites, namely the CNN Arabic Corpus, to study the performance of the classification. The K-fold to validate the classifier and The F1-Micro to test the classifier. Recent results show that SVM classifier was upgraded to KNN and NB classifiers using the TF-IDF representation approach and that the NB classifier outperformed the KNN and SVM classifiers when using the representation approach in Bow. The SVM and NB classifiers attached 94.38% and 93.47% Micro-F1 are worth eliminating the word.\",\"PeriodicalId\":283197,\"journal\":{\"name\":\"Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3291842.3291900\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3291842.3291900","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

本文基于词频对阿拉伯语文档分类的影响,即词频对词袋(Bow)和词频-逆文档频率(TF-IDF)特征表示的影响。本文讨论了三种分类技术,即朴素贝叶斯(NB)、k近邻(KNN)和支持向量机(SVM)。卡方是一个选择函数,用来选择必要的特征,去除不必要的特征。对来自阿拉伯网站的公共数据,即CNN阿拉伯语语料库的阿拉伯语文档进行分类实验,研究分类的性能。K-fold验证分类器和F1-Micro测试分类器。最近的研究结果表明,使用TF-IDF表示方法将SVM分类器升级为KNN和NB分类器,并且在Bow中使用TF-IDF表示方法时,NB分类器优于KNN和SVM分类器。SVM和NB分类器分别有94.38%和93.47%的Micro-F1值值得剔除。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Efficient Feature Representation Based on the Effect of Words Frequency for Arabic Documents Classification
This paper is based on the influence of the frequency of words in the classification of Arabic documents, its effects on the representation of characteristics namely Bag of word (Bow) and Term frequency- Inverse Documents Frequency (TF-IDF). Three classification techniques are being discussed, namely Naive Bayes (NB), k-nearest Neighbor (KNN) and Support Vector Machine (SVM). The Chi-square is used as a selection function to select essential features and remove unnecessary features. An experiment in the classification of Arab documents of public data collected from Arab sites, namely the CNN Arabic Corpus, to study the performance of the classification. The K-fold to validate the classifier and The F1-Micro to test the classifier. Recent results show that SVM classifier was upgraded to KNN and NB classifiers using the TF-IDF representation approach and that the NB classifier outperformed the KNN and SVM classifiers when using the representation approach in Bow. The SVM and NB classifiers attached 94.38% and 93.47% Micro-F1 are worth eliminating the word.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信