一种基于分治法的快速分类实例选择方法

IF 3.5 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2025-04-17 DOI:10.1007/s10489-025-06541-y

Hamid Saadatfar, Sayed Iqbal Nawin, Edris Hosseini Gol

{"title":"一种基于分治法的快速分类实例选择方法","authors":"Hamid Saadatfar, Sayed Iqbal Nawin, Edris Hosseini Gol","doi":"10.1007/s10489-025-06541-y","DOIUrl":null,"url":null,"abstract":"<div><p>Instance selection is a data preprocessing method in data mining that aims to reduce the volume of the training dataset. Reducing samples from a large dataset offers benefits such as lower storage requirements, reduced computational costs, increased processing speed, and, in some cases, improved accuracy for learning algorithms. However, reducing samples from large datasets is also a challenging task due to their sheer volume. Recently, numerous instance selection methods for big data have been proposed, often facing challenges such as low accuracy and slow processing speed. In this research, we propose a fast and efficient three-step method based on the divide-and-conquer approach. In the first step, the training set is divided based on the number of classes. Next, representative summaries of each class are extracted. Finally, samples from each class are reduced independently while considering the representatives of other classes. By using a proposed ranking-based method, it is possible to accurately identify less important and noisy samples. For a comprehensive evaluation, we utilized 20 well-known large datasets and three synthetic datasets featuring challenging structures. The results demonstrate the superiority of the proposed method over four recent related methods.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 7","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A fast approach based on divide-and-conquer for instance selection in classification problem\",\"authors\":\"Hamid Saadatfar, Sayed Iqbal Nawin, Edris Hosseini Gol\",\"doi\":\"10.1007/s10489-025-06541-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Instance selection is a data preprocessing method in data mining that aims to reduce the volume of the training dataset. Reducing samples from a large dataset offers benefits such as lower storage requirements, reduced computational costs, increased processing speed, and, in some cases, improved accuracy for learning algorithms. However, reducing samples from large datasets is also a challenging task due to their sheer volume. Recently, numerous instance selection methods for big data have been proposed, often facing challenges such as low accuracy and slow processing speed. In this research, we propose a fast and efficient three-step method based on the divide-and-conquer approach. In the first step, the training set is divided based on the number of classes. Next, representative summaries of each class are extracted. Finally, samples from each class are reduced independently while considering the representatives of other classes. By using a proposed ranking-based method, it is possible to accurately identify less important and noisy samples. For a comprehensive evaluation, we utilized 20 well-known large datasets and three synthetic datasets featuring challenging structures. The results demonstrate the superiority of the proposed method over four recent related methods.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 7\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-025-06541-y\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-025-06541-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

实例选择是数据挖掘中的一种数据预处理方法，其目的是减少训练数据集的体积。从大型数据集中减少样本可以降低存储需求、降低计算成本、提高处理速度，在某些情况下还可以提高学习算法的准确性。然而，从大型数据集中减少样本也是一项具有挑战性的任务，因为它们的数量庞大。近年来，针对大数据的实例选择方法层出不穷，但往往面临准确率低、处理速度慢等问题。在本研究中，我们提出了一种基于分治法的快速高效的三步法。第一步，根据类的数量对训练集进行划分。接下来，提取每个类的代表性摘要。最后，在考虑其他类代表的同时，对每个类的样本进行独立约简。通过使用提出的基于排序的方法，可以准确地识别不太重要和有噪声的样本。为了进行综合评估，我们使用了20个知名的大型数据集和3个具有挑战性结构的合成数据集。结果表明，该方法优于目前的四种相关方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A fast approach based on divide-and-conquer for instance selection in classification problem

查看原文本刊更多论文

A fast approach based on divide-and-conquer for instance selection in classification problem

Instance selection is a data preprocessing method in data mining that aims to reduce the volume of the training dataset. Reducing samples from a large dataset offers benefits such as lower storage requirements, reduced computational costs, increased processing speed, and, in some cases, improved accuracy for learning algorithms. However, reducing samples from large datasets is also a challenging task due to their sheer volume. Recently, numerous instance selection methods for big data have been proposed, often facing challenges such as low accuracy and slow processing speed. In this research, we propose a fast and efficient three-step method based on the divide-and-conquer approach. In the first step, the training set is divided based on the number of classes. Next, representative summaries of each class are extracted. Finally, samples from each class are reduced independently while considering the representatives of other classes. By using a proposed ranking-based method, it is possible to accurately identify less important and noisy samples. For a comprehensive evaluation, we utilized 20 well-known large datasets and three synthetic datasets featuring challenging structures. The results demonstrate the superiority of the proposed method over four recent related methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.