基于最优特征选择和高效软计算分类器的最优文档聚类方法

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Perumal Pitchandi, R. Kingsy Grace
{"title":"基于最优特征选择和高效软计算分类器的最优文档聚类方法","authors":"Perumal Pitchandi,&nbsp;R. Kingsy Grace","doi":"10.1016/j.eswa.2025.128762","DOIUrl":null,"url":null,"abstract":"<div><div>In general, document grouping is an important area of text extraction commonly used for document organization, browsing, abstraction, and categorization. This is an important process used for data recovery, data processing and document management. Recently several document grouping methods have been suggested to improve system performance. However, these document grouping methods face serious challenges. The main problem with document grouping is choosing the appropriate document features and similar tools. Moreover, due to the high computational cost and memory usage of those grouping methods, they are not suitable for many documents that need to be processed on a daily basis. This paper presents the optimal method of document clustering based on hybrid optimization selection and efficient computer classification. The proposed method consists three tire processes. First, we introduce a fuzzy density fruit fly optimization (FD-FFO) algorithm for data pre-processing which removes the unwanted artifacts and redundant content from the documents. Second, we illustrate the teaching–learning-based Harris Hawks optimization (TL-HHO) algorithm for optimal feature selection which computes best and optimal features among multiple features in document. Then, we offer a support vector regression probabilistic neural network (SVR-PNN) for optimal document clustering which improves the performance of clustering. Finally, the proposed SVR-PNN method which is evaluated by Reuters, 20 Press database and Web-snippets database. The performance of proposed SVR-PNN method can compare with existing methods such as Rider-Moth Flame optimization algorithm (RMFO), Correlation Based Incremental Clustering Algorithm (CBICA), Incremental Construction of GMM Tree (ICGT) and Weighted Probabilistic Latent Semantic Analysis (WPLSA) using Precision, Accuracy, F-Measure and Recall.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"294 ","pages":"Article 128762"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An optimal document clustering method using hybrid optimal feature selection and efficient soft computing classifier\",\"authors\":\"Perumal Pitchandi,&nbsp;R. Kingsy Grace\",\"doi\":\"10.1016/j.eswa.2025.128762\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In general, document grouping is an important area of text extraction commonly used for document organization, browsing, abstraction, and categorization. This is an important process used for data recovery, data processing and document management. Recently several document grouping methods have been suggested to improve system performance. However, these document grouping methods face serious challenges. The main problem with document grouping is choosing the appropriate document features and similar tools. Moreover, due to the high computational cost and memory usage of those grouping methods, they are not suitable for many documents that need to be processed on a daily basis. This paper presents the optimal method of document clustering based on hybrid optimization selection and efficient computer classification. The proposed method consists three tire processes. First, we introduce a fuzzy density fruit fly optimization (FD-FFO) algorithm for data pre-processing which removes the unwanted artifacts and redundant content from the documents. Second, we illustrate the teaching–learning-based Harris Hawks optimization (TL-HHO) algorithm for optimal feature selection which computes best and optimal features among multiple features in document. Then, we offer a support vector regression probabilistic neural network (SVR-PNN) for optimal document clustering which improves the performance of clustering. Finally, the proposed SVR-PNN method which is evaluated by Reuters, 20 Press database and Web-snippets database. The performance of proposed SVR-PNN method can compare with existing methods such as Rider-Moth Flame optimization algorithm (RMFO), Correlation Based Incremental Clustering Algorithm (CBICA), Incremental Construction of GMM Tree (ICGT) and Weighted Probabilistic Latent Semantic Analysis (WPLSA) using Precision, Accuracy, F-Measure and Recall.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"294 \",\"pages\":\"Article 128762\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425023802\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425023802","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

一般来说,文档分组是文本提取的一个重要领域,通常用于文档组织、浏览、抽象和分类。这是用于数据恢复、数据处理和文档管理的重要过程。最近提出了几种文档分组方法来提高系统性能。然而,这些文档分组方法面临着严峻的挑战。文档分组的主要问题是选择合适的文档特性和类似的工具。此外,由于这些分组方法的高计算成本和内存使用,它们不适合每天需要处理的许多文档。提出了一种基于混合优化选择和高效计算机分类的文档聚类优化方法。该方法由三个过程组成。首先,我们引入了一种模糊密度果蝇优化(FD-FFO)算法进行数据预处理,该算法可以去除文档中不需要的工件和冗余内容。其次,我们给出了基于教学-学习的Harris Hawks优化算法(TL-HHO),该算法在文档的多个特征中计算出最佳和最优特征。然后,我们提出了支持向量回归概率神经网络(SVR-PNN)用于最优文档聚类,提高了聚类的性能。最后,利用路透社、20家出版社数据库和Web-snippets数据库对所提出的SVR-PNN方法进行了评价。所提出的SVR-PNN方法的性能可以与现有方法如飞蛾火焰优化算法(RMFO)、基于相关的增量聚类算法(CBICA)、GMM树的增量构建(ICGT)和加权概率潜在语义分析(WPLSA)进行比较,包括精密度、准确度、F-Measure和召回率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
An optimal document clustering method using hybrid optimal feature selection and efficient soft computing classifier
In general, document grouping is an important area of text extraction commonly used for document organization, browsing, abstraction, and categorization. This is an important process used for data recovery, data processing and document management. Recently several document grouping methods have been suggested to improve system performance. However, these document grouping methods face serious challenges. The main problem with document grouping is choosing the appropriate document features and similar tools. Moreover, due to the high computational cost and memory usage of those grouping methods, they are not suitable for many documents that need to be processed on a daily basis. This paper presents the optimal method of document clustering based on hybrid optimization selection and efficient computer classification. The proposed method consists three tire processes. First, we introduce a fuzzy density fruit fly optimization (FD-FFO) algorithm for data pre-processing which removes the unwanted artifacts and redundant content from the documents. Second, we illustrate the teaching–learning-based Harris Hawks optimization (TL-HHO) algorithm for optimal feature selection which computes best and optimal features among multiple features in document. Then, we offer a support vector regression probabilistic neural network (SVR-PNN) for optimal document clustering which improves the performance of clustering. Finally, the proposed SVR-PNN method which is evaluated by Reuters, 20 Press database and Web-snippets database. The performance of proposed SVR-PNN method can compare with existing methods such as Rider-Moth Flame optimization algorithm (RMFO), Correlation Based Incremental Clustering Algorithm (CBICA), Incremental Construction of GMM Tree (ICGT) and Weighted Probabilistic Latent Semantic Analysis (WPLSA) using Precision, Accuracy, F-Measure and Recall.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信