数据粗粒度化可以提高模型性能。

ArXiv Pub Date : 2025-09-18

Alex Nguyen, David J Schwab, Vudtiwat Ngampruetikorn

{"title":"数据粗粒度化可以提高模型性能。","authors":"Alex Nguyen, David J Schwab, Vudtiwat Ngampruetikorn","doi":"","DOIUrl":null,"url":null,"abstract":"Lossy data transformations by definition lose information. Yet, in modern machine learning, methods like data pruning and lossy data augmentation can help improve generalization performance. We study this paradox using a solvable model of high-dimensional, ridge-regularized linear regression under data coarse graining. Inspired by the renormalization group in statistical physics, we analyze coarse-graining schemes that systematically discard features based on their relevance to the learning task. Our results reveal a nonmonotonic dependence of the prediction risk on the degree of coarse graining. A high-pass scheme-which filters out less relevant, lower-signal features-can help models generalize better. By contrast, a low-pass scheme that integrates out more relevant, higher-signal features is purely detrimental. Crucially, using optimal regularization, we demonstrate that this nonmonotonicity is a distinct effect of data coarse graining and not an artifact of double descent. Our framework offers a clear, analytical explanation for why careful data augmentation works: it strips away less relevant degrees of freedom and isolates more predictive signals. Our results highlight a complex, nonmonotonic risk landscape shaped by the structure of the data, and illustrate how ideas from statistical physics provide a principled lens for understanding modern machine learning phenomena.","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12458590/pdf/","citationCount":"0","resultStr":"{\"title\":\"Data coarse graining can improve model performance.\",\"authors\":\"Alex Nguyen, David J Schwab, Vudtiwat Ngampruetikorn\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lossy data transformations by definition lose information. Yet, in modern machine learning, methods like data pruning and lossy data augmentation can help improve generalization performance. We study this paradox using a solvable model of high-dimensional, ridge-regularized linear regression under data coarse graining. Inspired by the renormalization group in statistical physics, we analyze coarse-graining schemes that systematically discard features based on their relevance to the learning task. Our results reveal a nonmonotonic dependence of the prediction risk on the degree of coarse graining. A high-pass scheme-which filters out less relevant, lower-signal features-can help models generalize better. By contrast, a low-pass scheme that integrates out more relevant, higher-signal features is purely detrimental. Crucially, using optimal regularization, we demonstrate that this nonmonotonicity is a distinct effect of data coarse graining and not an artifact of double descent. Our framework offers a clear, analytical explanation for why careful data augmentation works: it strips away less relevant degrees of freedom and isolates more predictive signals. Our results highlight a complex, nonmonotonic risk landscape shaped by the structure of the data, and illustrate how ideas from statistical physics provide a principled lens for understanding modern machine learning phenomena.\",\"PeriodicalId\":93888,\"journal\":{\"name\":\"ArXiv\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12458590/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ArXiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

根据定义，有损数据转换会丢失信息。然而，在现代机器学习中，数据修剪和有损数据增强等方法可以帮助提高泛化性能。我们使用“数据粗粒度”下的高维山脊正则化线性回归的可解模型来研究这一悖论。受统计物理中的重整化组的启发，我们分析了基于与学习任务的相关性系统地丢弃特征的粗粒度方案。我们的结果揭示了预测风险与粗粒化程度的非单调依赖关系。“高通”方案——过滤掉不太相关的低信号特征——可以帮助模型更好地泛化。相比之下，集成了更多相关的高信号特征的“低通”方案纯粹是有害的。至关重要的是，使用最优正则化，我们证明了这种非单调性是数据粗粒度的明显影响，而不是双重下降的产物。我们的框架为谨慎的数据增强为何有效提供了清晰的分析性解释：它剥离了不太相关的自由度，隔离了更具预测性的信号。我们的研究结果强调了由数据结构形成的复杂、非单调的风险景观，并说明了统计物理学的思想如何为理解现代机器学习现象提供了一个原则性的视角。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Data coarse graining can improve model performance.

本刊更多论文

Data coarse graining can improve model performance.

Lossy data transformations by definition lose information. Yet, in modern machine learning, methods like data pruning and lossy data augmentation can help improve generalization performance. We study this paradox using a solvable model of high-dimensional, ridge-regularized linear regression under data coarse graining. Inspired by the renormalization group in statistical physics, we analyze coarse-graining schemes that systematically discard features based on their relevance to the learning task. Our results reveal a nonmonotonic dependence of the prediction risk on the degree of coarse graining. A high-pass scheme-which filters out less relevant, lower-signal features-can help models generalize better. By contrast, a low-pass scheme that integrates out more relevant, higher-signal features is purely detrimental. Crucially, using optimal regularization, we demonstrate that this nonmonotonicity is a distinct effect of data coarse graining and not an artifact of double descent. Our framework offers a clear, analytical explanation for why careful data augmentation works: it strips away less relevant degrees of freedom and isolates more predictive signals. Our results highlight a complex, nonmonotonic risk landscape shaped by the structure of the data, and illustrate how ideas from statistical physics provide a principled lens for understanding modern machine learning phenomena.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ArXiv

自引率

0.00%

发文量