An enhanced algorithm for semantic-based feature reduction in spam filtering

IF 4.3 3区材料科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

ACS Applied Electronic Materials Pub Date : 2024-07-31 DOI:10.7717/peerj-cs.2206

María Novo-Lourés, Reyes Pavón, Rosalía Laza, José R. Méndez, David Ruano-Ordás

{"title":"An enhanced algorithm for semantic-based feature reduction in spam filtering","authors":"María Novo-Lourés, Reyes Pavón, Rosalía Laza, José R. Méndez, David Ruano-Ordás","doi":"10.7717/peerj-cs.2206","DOIUrl":null,"url":null,"abstract":"With the advent and improvement of ontological dictionaries (WordNet, Babelnet), the use of synsets-based text representations is gaining popularity in classification tasks. More recently, ontological dictionaries were used for reducing dimensionality in this kind of representation (e.g., Semantic Dimensionality Reduction System (SDRS) (Vélez de Mendizabal et al., 2020)). These approaches are based on the combination of semantically related columns by taking advantage of semantic information extracted from ontological dictionaries. Their main advantage is that they not only eliminate features but can also combine them, minimizing (low-loss) or avoiding (lossless) the loss of information. The most recent (and accurate) techniques included in this group are based on using evolutionary algorithms to find how many features can be grouped to reduce false positive (FP) and false negative (FN) errors obtained. The main limitation of these evolutionary-based schemes is the computational requirements derived from the use of optimization algorithms. The contribution of this study is a new lossless feature reduction scheme exploiting information from ontological dictionaries, which achieves slightly better accuracy (specially in FP errors) than optimization-based approaches but using far fewer computational resources. Instead of using computationally expensive evolutionary algorithms, our proposal determines whether two columns (synsets) can be combined by observing whether the instances included in a dataset (e.g., training dataset) containing these synsets are mostly of the same class. The study includes experiments using three datasets and a detailed comparison with two previous optimization-based approaches.","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2206","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

With the advent and improvement of ontological dictionaries (WordNet, Babelnet), the use of synsets-based text representations is gaining popularity in classification tasks. More recently, ontological dictionaries were used for reducing dimensionality in this kind of representation (e.g., Semantic Dimensionality Reduction System (SDRS) (Vélez de Mendizabal et al., 2020)). These approaches are based on the combination of semantically related columns by taking advantage of semantic information extracted from ontological dictionaries. Their main advantage is that they not only eliminate features but can also combine them, minimizing (low-loss) or avoiding (lossless) the loss of information. The most recent (and accurate) techniques included in this group are based on using evolutionary algorithms to find how many features can be grouped to reduce false positive (FP) and false negative (FN) errors obtained. The main limitation of these evolutionary-based schemes is the computational requirements derived from the use of optimization algorithms. The contribution of this study is a new lossless feature reduction scheme exploiting information from ontological dictionaries, which achieves slightly better accuracy (specially in FP errors) than optimization-based approaches but using far fewer computational resources. Instead of using computationally expensive evolutionary algorithms, our proposal determines whether two columns (synsets) can be combined by observing whether the instances included in a dataset (e.g., training dataset) containing these synsets are mostly of the same class. The study includes experiments using three datasets and a detailed comparison with two previous optimization-based approaches.

查看原文本刊更多论文

垃圾邮件过滤中基于语义减少特征的增强算法

随着本体字典（WordNet、Babelnet）的出现和改进，基于同义词集的文本表示法在分类任务中越来越受欢迎。最近，本体词典被用于降低此类表示的维度（例如语义维度降低系统（SDRS）（Vélez de Mendizabal 等人，2020 年））。这些方法的基础是利用从本体字典中提取的语义信息，将语义相关的列进行组合。它们的主要优势在于不仅能消除特征，还能组合特征，最大限度地减少（低损耗）或避免（无损耗）信息损失。这类技术中最新（也是最准确）的技术是基于进化算法，找出有多少特征可以通过分组来减少假阳性（FP）和假阴性（FN）误差。这些基于进化算法的方案的主要局限性在于使用优化算法所带来的计算要求。本研究的贡献在于利用本体字典中的信息，提出了一种新的无损特征缩减方案，与基于优化的方法相比，该方案的准确率（尤其是在 FP 错误方面）略高，但使用的计算资源却要少得多。我们的方案不使用计算成本高昂的进化算法，而是通过观察包含这些同义词集的数据集（如训练数据集）中的实例是否大多属于同一类别，来确定是否可以合并两列（同义词集）。研究包括使用三个数据集进行的实验，以及与之前两种基于优化的方法进行的详细比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACS Applied Electronic Materials Multiple-

CiteScore

7.20

自引率

4.30%

发文量

567