J.M. Gorriz, J. Ramirez, F. Segovia, C. Jimenez-Mesa, F.J. Martinez-Murcia, J. Suckling
{"title":"Statistical agnostic regression: A machine learning method to validate regression models","authors":"J.M. Gorriz, J. Ramirez, F. Segovia, C. Jimenez-Mesa, F.J. Martinez-Murcia, J. Suckling","doi":"10.1016/j.jare.2025.04.026","DOIUrl":null,"url":null,"abstract":"<h3><strong>Introduction:</strong></h3>Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources.<h3><strong>Objectives:</strong></h3>Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection.<h3><strong>Methods:</strong></h3>In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least <span><span style=\"\"></span><span data-mathml='<math xmlns=\"http://www.w3.org/1998/Math/MathML\"><mrow is=\"true\"><mn is=\"true\">1</mn><mo linebreak=\"badbreak\" is=\"true\">-</mo><mi is=\"true\">&#x3B7;</mi></mrow></math>' role=\"presentation\" style=\"font-size: 90%; display: inline-block; position: relative;\" tabindex=\"0\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"2.432ex\" role=\"img\" style=\"vertical-align: -0.697ex;\" viewbox=\"0 -747.2 2226.9 1047.3\" width=\"5.172ex\" xmlns:xlink=\"http://www.w3.org/1999/xlink\"><g fill=\"currentColor\" stroke=\"currentColor\" stroke-width=\"0\" transform=\"matrix(1 0 0 -1 0 0)\"><g is=\"true\"><g is=\"true\"><use xlink:href=\"#MJMAIN-31\"></use></g><g is=\"true\" transform=\"translate(722,0)\"><use xlink:href=\"#MJMAIN-2212\"></use></g><g is=\"true\" transform=\"translate(1723,0)\"><use xlink:href=\"#MJMATHI-3B7\"></use></g></g></g></svg><span role=\"presentation\"><math xmlns=\"http://www.w3.org/1998/Math/MathML\"><mrow is=\"true\"><mn is=\"true\">1</mn><mo is=\"true\" linebreak=\"badbreak\">-</mo><mi is=\"true\">η</mi></mrow></math></span></span><script type=\"math/mml\"><math><mrow is=\"true\"><mn is=\"true\">1</mn><mo linebreak=\"badbreak\" is=\"true\">-</mo><mi is=\"true\">η</mi></mrow></math></script></span>, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables.<h3><strong>Conclusions:</strong></h3>Simulations demonstrate that the proposed agnostic (non-parametric) test can perform an analysis of variance comparable to the classical multivariate <em>F</em>-test for the slope parameter, without relying on the underlying assumptions of classical methods. A power analysis on a putative regression task revealed an overinflated false positive rate in standard ML methods, whereas the SAR test exhibited excellent control. Moreover, the residuals computed using this method represent a trade-off between those obtained from ML approaches and classical OLS.","PeriodicalId":14952,"journal":{"name":"Journal of Advanced Research","volume":"89 1","pages":""},"PeriodicalIF":11.4000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advanced Research","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1016/j.jare.2025.04.026","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction:
Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources.
Objectives:
Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection.
Methods:
In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least , to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables.
Conclusions:
Simulations demonstrate that the proposed agnostic (non-parametric) test can perform an analysis of variance comparable to the classical multivariate F-test for the slope parameter, without relying on the underlying assumptions of classical methods. A power analysis on a putative regression task revealed an overinflated false positive rate in standard ML methods, whereas the SAR test exhibited excellent control. Moreover, the residuals computed using this method represent a trade-off between those obtained from ML approaches and classical OLS.
期刊介绍:
Journal of Advanced Research (J. Adv. Res.) is an applied/natural sciences, peer-reviewed journal that focuses on interdisciplinary research. The journal aims to contribute to applied research and knowledge worldwide through the publication of original and high-quality research articles in the fields of Medicine, Pharmaceutical Sciences, Dentistry, Physical Therapy, Veterinary Medicine, and Basic and Biological Sciences.
The following abstracting and indexing services cover the Journal of Advanced Research: PubMed/Medline, Essential Science Indicators, Web of Science, Scopus, PubMed Central, PubMed, Science Citation Index Expanded, Directory of Open Access Journals (DOAJ), and INSPEC.