基于频率分类的大数据文本分类特征选择

Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications Pub Date : 2020-09-23 DOI:10.1145/3419604.3419620

Houda Amazal, M. Ramdani, M. Kissi

{"title":"基于频率分类的大数据文本分类特征选择","authors":"Houda Amazal, M. Ramdani, M. Kissi","doi":"10.1145/3419604.3419620","DOIUrl":null,"url":null,"abstract":"In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.","PeriodicalId":250715,"journal":{"name":"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Frequency-Category Based Feature Selection in Big Data for Text Classification\",\"authors\":\"Houda Amazal, M. Ramdani, M. Kissi\",\"doi\":\"10.1145/3419604.3419620\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.\",\"PeriodicalId\":250715,\"journal\":{\"name\":\"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3419604.3419620\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3419604.3419620","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在大数据时代，文本分类被认为是机器学习最重要的应用领域之一。然而，要构建高效的分类算法，特征选择是降低维数、提高准确率和提高执行时间的基本步骤。在文献中，大多数特征排序技术都是基于文档的。这种方法的主要缺点是，它偏爱在文件中经常出现的术语，而忽略了术语与类别之间的相关性。在这项工作中，与传统的单独处理文档的方法不同，我们使用mapreduce范式将每个类别的文档作为单个文档进行处理。然后，我们引入了一种独立于任何分类器的并行频率-类别特征选择方法来选择最相关的特征。在20个新闻组数据集上的实验结果表明，我们的方法将分类准确率提高到90.3%。并且保持了系统的简单性和较低的执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Frequency-Category Based Feature Selection in Big Data for Text Classification

In big data era, text classification is considered as one of the most important machine learning application domain. However, to build an efficient algorithm for classification, feature selection is a fundamental step to reduce dimensionality, achieve better accuracy and improve time execution. In the literature, most of the feature ranking techniques are document based. The major weakness of this approach is that it favours the terms occurring frequently in the documents and neglects the correlation between the terms and the categories. In this work, unlike the traditional approaches which deal with documents individually, we use mapreduce paradigm to process the documents of each category as a single document. Then, we introduce a parallel frequency-category feature selection method independently of any classifier to select the most relevant features. Experimental results on the 20-Newsgroups dataset showed that our approach improves the classification accuracy to 90.3%. Moreover, the system maintains the simplicity and lower execution time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications

自引率

0.00%

发文量