Scalable graph attention-based instance selection via mini-batch sampling and hierarchical hashing

IF 14.8

AI Open Pub Date : 2025-01-01 DOI:10.1016/j.aiopen.2025.08.004

Zahiriddin Rustamov , Ayham Zaitouny , Nazar Zaki

{"title":"Scalable graph attention-based instance selection via mini-batch sampling and hierarchical hashing","authors":"Zahiriddin Rustamov , Ayham Zaitouny , Nazar Zaki","doi":"10.1016/j.aiopen.2025.08.004","DOIUrl":null,"url":null,"abstract":"<div><div>Instance selection (IS) addresses the critical challenge of reducing dataset size while keeping informative characteristics, becoming increasingly important as datasets grow to millions of instances. Current IS methods often struggle with capturing complex relationships in high-dimensional spaces and scale with large datasets. This paper introduces a graph attention-based instance selection (GAIS) method that uses attention mechanisms to identify informative instances through their structural relationships in graph representations. We present two approaches for scalable graph construction: a distance-based mini-batch sampling technique that achieves dataset-size-independent complexity through strategic batch processing, and a hierarchical hashing approach that enables efficient similarity computation through random projections. The mini-batch approach keeps class distributions through stratified sampling, while the hierarchical hashing method captures relationships at multiple granularities through single-level, multi-level, and multi-view variants. Experiments across 39 datasets show that GAIS achieves reduction rates above 96% while maintaining or improving model performance relative to state-of-the-art IS methods. The findings show that the distance-based mini-batch approach offers an optimal efficiency for large-scale datasets, while multi-view variants excel on complex, high-dimensional data, demonstrating that attention-based importance scoring can effectively identify instances important for maintaining decision boundaries while avoiding computationally prohibitive pairwise comparisons. The code is publicly available at <span><span>https://github.com/zahiriddin-rustamov/gais</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"6 ","pages":"Pages 167-182"},"PeriodicalIF":14.8000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651025000130","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Instance selection (IS) addresses the critical challenge of reducing dataset size while keeping informative characteristics, becoming increasingly important as datasets grow to millions of instances. Current IS methods often struggle with capturing complex relationships in high-dimensional spaces and scale with large datasets. This paper introduces a graph attention-based instance selection (GAIS) method that uses attention mechanisms to identify informative instances through their structural relationships in graph representations. We present two approaches for scalable graph construction: a distance-based mini-batch sampling technique that achieves dataset-size-independent complexity through strategic batch processing, and a hierarchical hashing approach that enables efficient similarity computation through random projections. The mini-batch approach keeps class distributions through stratified sampling, while the hierarchical hashing method captures relationships at multiple granularities through single-level, multi-level, and multi-view variants. Experiments across 39 datasets show that GAIS achieves reduction rates above 96% while maintaining or improving model performance relative to state-of-the-art IS methods. The findings show that the distance-based mini-batch approach offers an optimal efficiency for large-scale datasets, while multi-view variants excel on complex, high-dimensional data, demonstrating that attention-based importance scoring can effectively identify instances important for maintaining decision boundaries while avoiding computationally prohibitive pairwise comparisons. The code is publicly available at https://github.com/zahiriddin-rustamov/gais.

查看原文本刊更多论文

通过小批量采样和分层哈希进行可扩展的基于关注的图实例选择

实例选择（IS）解决了在保持信息特征的同时减少数据集大小的关键挑战，随着数据集增长到数百万个实例，它变得越来越重要。当前的IS方法通常难以在高维空间中捕获复杂的关系，并且难以在大型数据集上进行扩展。本文介绍了一种基于图注意的实例选择（GAIS）方法，该方法利用注意机制通过图表示中的结构关系来识别信息实例。我们提出了两种可扩展图构建方法：一种基于距离的小批量采样技术，通过战略性批处理实现与数据集大小无关的复杂性，以及一种分层哈希方法，通过随机投影实现高效的相似性计算。小批处理方法通过分层抽样保持类分布，而分层散列方法通过单级、多级和多视图变体捕获多粒度的关系。39个数据集的实验表明，相对于最先进的IS方法，GAIS在保持或提高模型性能的同时，实现了96%以上的降噪率。研究结果表明，基于距离的小批量方法为大规模数据集提供了最佳效率，而多视图变体在复杂的高维数据上表现出色，这表明基于注意力的重要性评分可以有效地识别对维持决策边界重要的实例，同时避免了计算上的两两比较。该代码可在https://github.com/zahiriddin-rustamov/gais上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AI Open

CiteScore

45.00

自引率

0.00%

发文量