{"title":"Restoring balance: principled under/oversampling of data for optimal classification","authors":"Emanuele Loffredo, Mauro Pastore, Simona Cocco, Rémi Monasson","doi":"arxiv-2405.09535","DOIUrl":null,"url":null,"abstract":"Class imbalance in real-world data poses a common bottleneck for machine\nlearning tasks, since achieving good generalization on under-represented\nexamples is often challenging. Mitigation strategies, such as under or\noversampling the data depending on their abundances, are routinely proposed and\ntested empirically, but how they should adapt to the data statistics remains\npoorly understood. In this work, we determine exact analytical expressions of\nthe generalization curves in the high-dimensional regime for linear classifiers\n(Support Vector Machines). We also provide a sharp prediction of the effects of\nunder/oversampling strategies depending on class imbalance, first and second\nmoments of the data, and the metrics of performance considered. We show that\nmixed strategies involving under and oversampling of data lead to performance\nimprovement. Through numerical experiments, we show the relevance of our\ntheoretical predictions on real datasets, on deeper architectures and with\nsampling strategies based on unsupervised probabilistic models.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":"43 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Disordered Systems and Neural Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.09535","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Class imbalance in real-world data poses a common bottleneck for machine
learning tasks, since achieving good generalization on under-represented
examples is often challenging. Mitigation strategies, such as under or
oversampling the data depending on their abundances, are routinely proposed and
tested empirically, but how they should adapt to the data statistics remains
poorly understood. In this work, we determine exact analytical expressions of
the generalization curves in the high-dimensional regime for linear classifiers
(Support Vector Machines). We also provide a sharp prediction of the effects of
under/oversampling strategies depending on class imbalance, first and second
moments of the data, and the metrics of performance considered. We show that
mixed strategies involving under and oversampling of data lead to performance
improvement. Through numerical experiments, we show the relevance of our
theoretical predictions on real datasets, on deeper architectures and with
sampling strategies based on unsupervised probabilistic models.