Imbalanced data sampling design based on grid boundary domain for big data

IF 1.4 4区数学 Q3 STATISTICS & PROBABILITY

Computational Statistics Pub Date : 2024-03-08 DOI:10.1007/s00180-024-01471-8

{"title":"Imbalanced data sampling design based on grid boundary domain for big data","authors":"","doi":"10.1007/s00180-024-01471-8","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>The data distribution is often associated with a <em>priori</em>-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"54 1","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1007/s00180-024-01471-8","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.

查看原文本刊更多论文

基于网格边界域的大数据不平衡数据采样设计

摘要数据分布往往与事先已知的概率有关，而感兴趣事件的发生概率较小，因此在社会学、经济学、工程学等各个领域都会出现大量的不平衡数据。现有的过采样和欠采样方法被广泛应用于不平衡数据分类问题，但过采样会导致过拟合，而欠采样会忽略有效信息。我们提出了一种新的采样设计算法，称为边界混合采样的邻域网格（NGBM），它关注边界信息。本文通过网格边界域识别获得分类边界信息，从而确定样本的重要性。在此前提下，对边界网格采用合成少数超采样技术，对其他网格采用随机欠采样技术。在这种混合采样策略的帮助下，可以提取出更重要的分类边界信息，尤其是对正样本信息的识别。通过数值模拟和实际数据分析，讨论了 NGBM 的参数设置策略，并说明了所提出的 NGBM 在不平衡数据中的优势以及实际应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Statistics 数学-统计学与概率论

CiteScore

2.90

自引率

0.00%

发文量

122

审稿时长

>12 weeks

期刊介绍： Computational Statistics (CompStat) is an international journal which promotes the publication of applications and methodological research in the field of Computational Statistics. The focus of papers in CompStat is on the contribution to and influence of computing on statistics and vice versa. The journal provides a forum for computer scientists, mathematicians, and statisticians in a variety of fields of statistics such as biometrics, econometrics, data analysis, graphics, simulation, algorithms, knowledge based systems, and Bayesian computing. CompStat publishes hardware, software plus package reports.