基于频率分类的大数据文本分类特征选择

Houda Amazal, M. Ramdani, M. Kissi
{"title":"基于频率分类的大数据文本分类特征选择","authors":"Houda Amazal, M. Ramdani, M. Kissi","doi":"10.1145/3419604.3419620","DOIUrl":null,"url":null,"abstract":"In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.","PeriodicalId":250715,"journal":{"name":"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Frequency-Category Based Feature Selection in Big Data for Text Classification\",\"authors\":\"Houda Amazal, M. Ramdani, M. Kissi\",\"doi\":\"10.1145/3419604.3419620\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.\",\"PeriodicalId\":250715,\"journal\":{\"name\":\"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3419604.3419620\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3419604.3419620","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在大数据时代,文本分类被认为是机器学习最重要的应用领域之一。然而,要构建高效的分类算法,特征选择是降低维数、提高准确率和提高执行时间的基本步骤。在文献中,大多数特征排序技术都是基于文档的。这种方法的主要缺点是,它偏爱在文件中经常出现的术语,而忽略了术语与类别之间的相关性。在这项工作中,与传统的单独处理文档的方法不同,我们使用mapreduce范式将每个类别的文档作为单个文档进行处理。然后,我们引入了一种独立于任何分类器的并行频率-类别特征选择方法来选择最相关的特征。在20个新闻组数据集上的实验结果表明,我们的方法将分类准确率提高到90.3%。并且保持了系统的简单性和较低的执行时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Frequency-Category Based Feature Selection in Big Data for Text Classification
In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信