Semantic similarity-aware feature selection and redundancy removal for text classification using joint mutual information

IF 3.1 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge and Information Systems Pub Date : 2024-06-13 DOI:10.1007/s10115-024-02143-1

Farek Lazhar, Benaidja Amira

{"title":"Semantic similarity-aware feature selection and redundancy removal for text classification using joint mutual information","authors":"Farek Lazhar, Benaidja Amira","doi":"10.1007/s10115-024-02143-1","DOIUrl":null,"url":null,"abstract":"<p>The high dimensionality of text data is a challenging issue that requires efficient methods to reduce vector space and improve classification accuracy. Existing filter-based methods fail to address the redundancy issue, resulting in the selection of irrelevant and redundant features. Information theory-based methods effectively solve this problem but are not practical for large amounts of data due to their high time complexity. The proposed method, termed semantic similarity-aware feature selection and redundancy removal (SS-FSRR), employs joint mutual information between the pairs of semantically related terms and the class label to capture redundant features. It is predicated on the assumption that semantically related terms imply potentially redundant ones, which can significantly reduce execution time by avoiding sequential search strategies. In this work, we use Word2Vec’s CBOW model to obtain semantic similarity between terms. The efficiency of the SS-FSRR is compared to six state-of-the-art competitive selection methods for categorical data using two traditional classifiers (SVM and NB) and a robust deep learning model (LSTM) on seven datasets with 10-fold cross-validation, where experimental results show that the SS-FSRR outperforms the other methods on most tested datasets with high stability as measured by the Jaccard’s Index.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"48 1","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge and Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10115-024-02143-1","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The high dimensionality of text data is a challenging issue that requires efficient methods to reduce vector space and improve classification accuracy. Existing filter-based methods fail to address the redundancy issue, resulting in the selection of irrelevant and redundant features. Information theory-based methods effectively solve this problem but are not practical for large amounts of data due to their high time complexity. The proposed method, termed semantic similarity-aware feature selection and redundancy removal (SS-FSRR), employs joint mutual information between the pairs of semantically related terms and the class label to capture redundant features. It is predicated on the assumption that semantically related terms imply potentially redundant ones, which can significantly reduce execution time by avoiding sequential search strategies. In this work, we use Word2Vec’s CBOW model to obtain semantic similarity between terms. The efficiency of the SS-FSRR is compared to six state-of-the-art competitive selection methods for categorical data using two traditional classifiers (SVM and NB) and a robust deep learning model (LSTM) on seven datasets with 10-fold cross-validation, where experimental results show that the SS-FSRR outperforms the other methods on most tested datasets with high stability as measured by the Jaccard’s Index.

Abstract Image

查看原文本刊更多论文

利用联合互信息为文本分类选择语义相似性感知特征并去除冗余

文本数据的高维性是一个具有挑战性的问题，需要高效的方法来减少向量空间并提高分类精度。现有的基于滤波器的方法无法解决冗余问题，导致选择不相关的冗余特征。基于信息论的方法能有效解决这一问题，但由于时间复杂度高，对于海量数据来说并不实用。所提出的方法被称为语义相似性感知特征选择和冗余去除（SS-FSRR），它利用语义相关术语对和类别标签之间的联合互信息来捕捉冗余特征。它的前提假设是，语义相关的术语意味着潜在的冗余术语，这就避免了顺序搜索策略，从而大大缩短了执行时间。在这项工作中，我们使用 Word2Vec 的 CBOW 模型来获取术语之间的语义相似性。实验结果表明，在大多数测试数据集上，SS-FSRR 的性能都优于其他方法，而且以 Jaccard 指数衡量，SS-FSRR 具有很高的稳定性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge and Information Systems 工程技术-计算机：人工智能

CiteScore

5.70

自引率

7.40%

发文量

152

审稿时长

7.2 months

期刊介绍： Knowledge and Information Systems (KAIS) provides an international forum for researchers and professionals to share their knowledge and report new advances on all topics related to knowledge systems and advanced information systems. This monthly peer-reviewed archival journal publishes state-of-the-art research reports on emerging topics in KAIS, reviews of important techniques in related areas, and application papers of interest to a general readership.