A hybrid TwinSVM-HHO model for multilingual spam review detection using sentiment features and pre-trained embeddings

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-05-15 DOI:10.1016/j.eswa.2025.128160

Ala’ M. Al-Zoubi , Antonio M. Mora , Hossam Faris , Raneem Qaddoura

{"title":"A hybrid TwinSVM-HHO model for multilingual spam review detection using sentiment features and pre-trained embeddings","authors":"Ala’ M. Al-Zoubi , Antonio M. Mora , Hossam Faris , Raneem Qaddoura","doi":"10.1016/j.eswa.2025.128160","DOIUrl":null,"url":null,"abstract":"<div><div>The detection of spam reviews in multilingual environments remains a challenging task due to linguistic diversity, data imbalance, and semantic complexity. This paper proposes a novel hybrid model that integrates Twin Support Vector Machine (TwinSVM) with Harris Hawks Optimization (HHO) for simultaneous parameter optimization and feature selection. To enhance semantic understanding, sentiment-based features are incorporated alongside pre-trained word embedding models—BERT, FastText, and MUSE—across English, Arabic, and Spanish datasets. Our approach generates 24 high-quality datasets using embeddings with 100 and 400 dimensions, including a combined multilingual set. Experimental results demonstrate that our proposed HHO-TwinSVM model consistently outperforms conventional classifiers and metaheuristic-enhanced SVMs, achieving accuracy improvements of up to 9.44 % and enhanced robustness in low-resource languages. This integrated framework represents a scalable and adaptable solution for multilingual spam detection. Four detailed experiments were conducted in this study, each designed to address and demonstrate a specific aspect of the proposed approach. Across all experiments, the method outperformed existing algorithms, achieving impressive accuracy rates of 92.9741 %, 89.0314 %, 80.3580 %, and 85.0859 % on Arabic, English, Spanish, and multilingual datasets, respectively. Subsequently, sentiment analysis features were incorporated to further enhance detection performance, resulting in improvements of 1.0994 %, 2.6674 %, 9.4430 %, and 8.7448 %, respectively. A comprehensive analysis of the experimental results, including the influence of reviews and sentiment features, is also presented.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"287 ","pages":"Article 128160"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425017804","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The detection of spam reviews in multilingual environments remains a challenging task due to linguistic diversity, data imbalance, and semantic complexity. This paper proposes a novel hybrid model that integrates Twin Support Vector Machine (TwinSVM) with Harris Hawks Optimization (HHO) for simultaneous parameter optimization and feature selection. To enhance semantic understanding, sentiment-based features are incorporated alongside pre-trained word embedding models—BERT, FastText, and MUSE—across English, Arabic, and Spanish datasets. Our approach generates 24 high-quality datasets using embeddings with 100 and 400 dimensions, including a combined multilingual set. Experimental results demonstrate that our proposed HHO-TwinSVM model consistently outperforms conventional classifiers and metaheuristic-enhanced SVMs, achieving accuracy improvements of up to 9.44 % and enhanced robustness in low-resource languages. This integrated framework represents a scalable and adaptable solution for multilingual spam detection. Four detailed experiments were conducted in this study, each designed to address and demonstrate a specific aspect of the proposed approach. Across all experiments, the method outperformed existing algorithms, achieving impressive accuracy rates of 92.9741 %, 89.0314 %, 80.3580 %, and 85.0859 % on Arabic, English, Spanish, and multilingual datasets, respectively. Subsequently, sentiment analysis features were incorporated to further enhance detection performance, resulting in improvements of 1.0994 %, 2.6674 %, 9.4430 %, and 8.7448 %, respectively. A comprehensive analysis of the experimental results, including the influence of reviews and sentiment features, is also presented.

查看原文本刊更多论文

基于情感特征和预训练嵌入的多语言垃圾邮件审查检测混合TwinSVM-HHO模型

由于语言多样性、数据不平衡和语义复杂性，在多语言环境中检测垃圾邮件评论仍然是一项具有挑战性的任务。本文提出了一种将双支持向量机（TwinSVM）和哈里斯鹰优化（HHO）相结合的混合模型，用于同时进行参数优化和特征选择。为了增强语义理解，在英语、阿拉伯语和西班牙语数据集上，基于情感的特征与预训练的词嵌入模型（bert、FastText和muse）结合在一起。我们的方法使用100和400维的嵌入生成24个高质量的数据集，包括一个组合的多语言集。实验结果表明，我们提出的HHO-TwinSVM模型始终优于传统分类器和元启发式增强支持向量机，准确率提高高达9.44%，并且在低资源语言中增强了鲁棒性。该集成框架为多语言垃圾邮件检测提供了可扩展和可适应的解决方案。在本研究中进行了四个详细的实验，每个实验都旨在解决和展示所提出方法的一个特定方面。在所有实验中，该方法优于现有算法，在阿拉伯语、英语、西班牙语和多语言数据集上的准确率分别为92.9741%、89.0314 %、80.3580%和85.0859 %。随后，加入情感分析特征，进一步提高检测性能，分别提高1.0994%、2.6674%、9.4430%和8.7448%。对实验结果进行了综合分析，包括评论和情感特征的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.