分布偏差影响了 "留一 "交叉验证。

ArXiv Pub Date : 2025-03-24

George I Austin, Itsik Pe'er, Tal Korem

{"title":"分布偏差影响了 \"留一 \"交叉验证。","authors":"George I Austin, Itsik Pe'er, Tal Korem","doi":"","DOIUrl":null,"url":null,"abstract":"Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called \"leave-one-out cross-validation\" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test instance available per model trained, predictions are aggregated across the entire dataset to calculate common performance metrics such as the area under the receiver operating characteristic or R2 scores. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias for both classification and regression. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations, across machine learning benchmarks, and in several published leave-one-out analyses.","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11177965/pdf/","citationCount":"0","resultStr":"{\"title\":\"Distributional bias compromises leave-one-out cross-validation.\",\"authors\":\"George I Austin, Itsik Pe'er, Tal Korem\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called \\\"leave-one-out cross-validation\\\" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test instance available per model trained, predictions are aggregated across the entire dataset to calculate common performance metrics such as the area under the receiver operating characteristic or R2 scores. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias for both classification and regression. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations, across machine learning benchmarks, and in several published leave-one-out analyses.\",\"PeriodicalId\":93888,\"journal\":{\"name\":\"ArXiv\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11177965/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ArXiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

交叉验证是估算机器学习模型预测性能的常用方法。在数据稀缺的情况下，人们通常希望最大限度地增加用于训练模型的实例数量，因此经常使用一种称为 "留一不交叉验证 "的方法。在这种设计中，在对所有其他实例进行训练后，为预测每个数据实例建立一个单独的模型。由于这样做的结果是每个训练好的模型只有一个测试数据点，因此预测结果会在整个数据集上汇总，以计算常见的基于等级的性能指标，如接收者操作特征下面积或精度-召回曲线。在这项工作中，我们证明了这种方法会在每个训练折叠的平均标签与其对应测试实例的标签之间产生负相关，我们将这种现象称为分布偏差。由于机器学习模型倾向于向其训练数据的平均值回归，这种分布偏差往往会对性能评估和超参数优化产生负面影响。我们的研究表明，这种影响会泛化到留空交叉验证，并在各种建模和评估方法中持续存在，而且会导致对更强正则化的偏见。为了解决这个问题，我们提出了一种可推广的再平衡交叉验证方法，可以纠正分布偏差。我们在合成模拟和几项已发表的留空分析中证明，我们的方法改进了交叉验证性能评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

Distributional bias compromises leave-one-out cross-validation.

Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called "leave-one-out cross-validation" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test instance available per model trained, predictions are aggregated across the entire dataset to calculate common performance metrics such as the area under the receiver operating characteristic or R2 scores. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias for both classification and regression. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations, across machine learning benchmarks, and in several published leave-one-out analyses.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ArXiv

自引率

0.00%

发文量