Feature redundancy removal for text classification using correlated feature subsets

IF 1.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Lazhar Farek, Amira Benaidja
{"title":"Feature redundancy removal for text classification using correlated feature subsets","authors":"Lazhar Farek,&nbsp;Amira Benaidja","doi":"10.1111/coin.12621","DOIUrl":null,"url":null,"abstract":"<p>The curse of high dimensionality in text classification is a worrisome problem that requires efficient and optimal feature selection (FS) methods to improve classification accuracy and reduce learning time. Existing filter-based FS methods evaluate features independently of other related ones, which can then lead to selecting a large number of redundant features, especially in high-dimensional datasets, resulting in more learning time and less classification performance, whereas information theory-based methods aim to maximize feature dependency with the class variable and minimize its redundancy for all selected features, which gradually becomes impractical when increasing the feature space. To overcome the time complexity issue of information theory-based methods while taking into account the redundancy issue, in this article, we propose a new feature selection method for text classification termed correlation-based redundancy removal, which aims to minimize the redundancy using subsets of features having close mutual information scores without sequentially seeking already selected features. The idea is that it is not important to assess the redundancy of a dominant feature having high classification information with another irrelevant feature having low classification information and vice-versa since they are implicitly weakly correlated. Our method, tested on seven datasets using both traditional classifiers (Naive Bayes and support vector machines) and deep learning models (long short-term memory and convolutional neural networks), demonstrated strong performance by reducing redundancy and improving classification compared to ten competitive metrics.</p>","PeriodicalId":55228,"journal":{"name":"Computational Intelligence","volume":"40 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/coin.12621","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The curse of high dimensionality in text classification is a worrisome problem that requires efficient and optimal feature selection (FS) methods to improve classification accuracy and reduce learning time. Existing filter-based FS methods evaluate features independently of other related ones, which can then lead to selecting a large number of redundant features, especially in high-dimensional datasets, resulting in more learning time and less classification performance, whereas information theory-based methods aim to maximize feature dependency with the class variable and minimize its redundancy for all selected features, which gradually becomes impractical when increasing the feature space. To overcome the time complexity issue of information theory-based methods while taking into account the redundancy issue, in this article, we propose a new feature selection method for text classification termed correlation-based redundancy removal, which aims to minimize the redundancy using subsets of features having close mutual information scores without sequentially seeking already selected features. The idea is that it is not important to assess the redundancy of a dominant feature having high classification information with another irrelevant feature having low classification information and vice-versa since they are implicitly weakly correlated. Our method, tested on seven datasets using both traditional classifiers (Naive Bayes and support vector machines) and deep learning models (long short-term memory and convolutional neural networks), demonstrated strong performance by reducing redundancy and improving classification compared to ten competitive metrics.

利用相关特征子集去除文本分类中的特征冗余
文本分类中的高维诅咒是一个令人担忧的问题,需要高效、优化的特征选择(FS)方法来提高分类准确率并缩短学习时间。现有的基于滤波器的特征选择方法在评估特征时不考虑其他相关特征,这可能会导致选择大量冗余特征,尤其是在高维数据集中,从而导致学习时间增加,分类性能降低;而基于信息论的方法旨在最大化特征与类变量的依赖性,并最小化所有被选特征的冗余度,当特征空间增大时,这种方法逐渐变得不切实际。为了克服基于信息论方法的时间复杂性问题,同时考虑到冗余问题,我们在本文中提出了一种新的文本分类特征选择方法,即基于相关性的冗余去除方法,其目的是使用互信息得分接近的特征子集来最小化冗余,而不需要依次寻找已选定的特征。我们的想法是,评估一个具有高分类信息的主要特征与另一个具有低分类信息的不相关特征之间的冗余度并不重要,反之亦然,因为它们之间隐含着弱相关性。我们的方法使用传统分类器(奈夫贝叶斯和支持向量机)和深度学习模型(长短期记忆和卷积神经网络)在七个数据集上进行了测试,与十个竞争指标相比,我们的方法减少了冗余,提高了分类效果,表现出了强劲的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computational Intelligence
Computational Intelligence 工程技术-计算机:人工智能
CiteScore
6.90
自引率
3.60%
发文量
65
审稿时长
>12 weeks
期刊介绍: This leading international journal promotes and stimulates research in the field of artificial intelligence (AI). Covering a wide range of issues - from the tools and languages of AI to its philosophical implications - Computational Intelligence provides a vigorous forum for the publication of both experimental and theoretical research, as well as surveys and impact studies. The journal is designed to meet the needs of a wide range of AI workers in academic and industrial research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信