Statistical agnostic regression: A machine learning method to validate regression models

IF 11.4 1区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Journal of Advanced Research Pub Date : 2025-05-01 DOI:10.1016/j.jare.2025.04.026

J.M. Gorriz, J. Ramirez, F. Segovia, C. Jimenez-Mesa, F.J. Martinez-Murcia, J. Suckling

{"title":"Statistical agnostic regression: A machine learning method to validate regression models","authors":"J.M. Gorriz, J. Ramirez, F. Segovia, C. Jimenez-Mesa, F.J. Martinez-Murcia, J. Suckling","doi":"10.1016/j.jare.2025.04.026","DOIUrl":null,"url":null,"abstract":"<h3><strong>Introduction:</strong></h3>Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources.<h3><strong>Objectives:</strong></h3>Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection.<h3><strong>Methods:</strong></h3>In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least <span><span style=\"\"></span><span data-mathml='<math xmlns=\"http://www.w3.org/1998/Math/MathML\"><mrow is=\"true\"><mn is=\"true\">1</mn><mo linebreak=\"badbreak\" is=\"true\">-</mo><mi is=\"true\">&#x3B7;</mi></mrow></math>' role=\"presentation\" style=\"font-size: 90%; display: inline-block; position: relative;\" tabindex=\"0\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"2.432ex\" role=\"img\" style=\"vertical-align: -0.697ex;\" viewbox=\"0 -747.2 2226.9 1047.3\" width=\"5.172ex\" xmlns:xlink=\"http://www.w3.org/1999/xlink\"><g fill=\"currentColor\" stroke=\"currentColor\" stroke-width=\"0\" transform=\"matrix(1 0 0 -1 0 0)\"><g is=\"true\"><g is=\"true\"><use xlink:href=\"#MJMAIN-31\"></use></g><g is=\"true\" transform=\"translate(722,0)\"><use xlink:href=\"#MJMAIN-2212\"></use></g><g is=\"true\" transform=\"translate(1723,0)\"><use xlink:href=\"#MJMATHI-3B7\"></use></g></g></g></svg><span role=\"presentation\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><mrow is=\"true\"><mn is=\"true\">1</mn><mo is=\"true\" linebreak=\"badbreak\">-</mo><mi is=\"true\">η</mi></mrow></math></span></span><script type=\"math/mml\"><math><mrow is=\"true\"><mn is=\"true\">1</mn><mo linebreak=\"badbreak\" is=\"true\">-</mo><mi is=\"true\">η</mi></mrow></math></script></span>, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables.<h3><strong>Conclusions:</strong></h3>Simulations demonstrate that the proposed agnostic (non-parametric) test can perform an analysis of variance comparable to the classical multivariate <em>F</em>-test for the slope parameter, without relying on the underlying assumptions of classical methods. A power analysis on a putative regression task revealed an overinflated false positive rate in standard ML methods, whereas the SAR test exhibited excellent control. Moreover, the residuals computed using this method represent a trade-off between those obtained from ML approaches and classical OLS.","PeriodicalId":14952,"journal":{"name":"Journal of Advanced Research","volume":"89 1","pages":""},"PeriodicalIF":11.4000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advanced Research","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1016/j.jare.2025.04.026","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction:

Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources.

Objectives:

Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection.

Methods:

In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least

1 - η

$1 - η$ , to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables.

Conclusions:

Simulations demonstrate that the proposed agnostic (non-parametric) test can perform an analysis of variance comparable to the classical multivariate F-test for the slope parameter, without relying on the underlying assumptions of classical methods. A power analysis on a putative regression task revealed an overinflated false positive rate in standard ML methods, whereas the SAR test exhibited excellent control. Moreover, the residuals computed using this method represent a trade-off between those obtained from ML approaches and classical OLS.

Abstract Image

查看原文本刊更多论文

统计不可知回归：一种验证回归模型的机器学习方法

简介：回归分析是统计建模中的一个核心主题，旨在估计因变量（通常称为响应变量）与一个或多个自变量（即解释变量）之间的关系。到目前为止，线性回归是在各种研究领域中执行此任务的最流行的方法，例如在组合来自多个来源的信息时进行数据集成和预测建模。目标：解决线性回归问题的经典方法，如普通最小二乘（OLS）、Ridge或Lasso回归，通常是更先进的机器学习（ML）技术的基础，这些技术已经成功应用，尽管没有正式的统计显著性定义。最多，已经进行了基于经验度量（例如，残差或准确性）的排列或分析，利用ML估计的更高灵敏度进行检测。方法：在本文中，我们引入统计不可知论回归（SAR）来评估基于ml的线性回归模型的统计显著性。这是通过分析实际风险（预期损失）的集中不平等和考虑最坏情况来实现的。为此，我们定义了一个阈值，以确保有足够的证据，至少有1-η - 1-η的概率，来得出解释（特征）和响应（标签）变量之间存在线性关系的结论。结论：模拟表明，所提出的不可知论（非参数）检验可以执行与斜率参数的经典多变量f检验相当的方差分析，而不依赖于经典方法的基本假设。对假定回归任务的功率分析显示，标准ML方法的假阳性率过高，而SAR测试表现出良好的控制性。此外，使用该方法计算的残差代表了从ML方法和经典OLS获得的残差之间的权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Advanced Research Multidisciplinary-Multidisciplinary

CiteScore

21.60

自引率

0.90%

发文量

280

审稿时长

12 weeks

期刊介绍： Journal of Advanced Research (J. Adv. Res.) is an applied/natural sciences, peer-reviewed journal that focuses on interdisciplinary research. The journal aims to contribute to applied research and knowledge worldwide through the publication of original and high-quality research articles in the fields of Medicine, Pharmaceutical Sciences, Dentistry, Physical Therapy, Veterinary Medicine, and Basic and Biological Sciences. The following abstracting and indexing services cover the Journal of Advanced Research: PubMed/Medline, Essential Science Indicators, Web of Science, Scopus, PubMed Central, PubMed, Science Citation Index Expanded, Directory of Open Access Journals (DOAJ), and INSPEC.