Manuel Navarro-García , Vanesa Guerrero , María Durban , Arturo del Cerro
{"title":"Feature and functional form selection in additive models via mixed-integer optimization","authors":"Manuel Navarro-García , Vanesa Guerrero , María Durban , Arturo del Cerro","doi":"10.1016/j.cor.2024.106945","DOIUrl":null,"url":null,"abstract":"<div><div>Feature selection is a recurrent research topic in modern regression analysis, which strives to build interpretable models, using sparsity as a proxy, without sacrificing predictive power. The best subset selection problem is central to this statistical task: it has the goal of identifying the subset of covariates of a given size that provides the best fit in terms of an empirical loss function. In this work, we address the problem of feature and functional form selection in additive regression models under a mathematical optimization lens. Penalized splines (<span><math><mrow><mi>P</mi><mo>−</mo></mrow></math></span>splines) are used to estimate the smooth functions involved in the regression equation, which allow us to state the feature selection problem as a cardinality-constrained mixed-integer quadratic program (MIQP) in terms of both linear and non-linear covariates. To strengthen this MIQP formulation, we develop tight bounds for the regression coefficients. A matheuristic approach, which encompasses the use of a preprocessing step, the construction of a warm-start solution, the MIQP formulation and the large neighborhood search metaheuristic paradigm, is proposed to handle larger instances of the feature and functional form selection problem. The performance of the exact and the matheuristic approaches are compared in simulated data. Furthermore, our matheuristic is compared to other methodologies in the literature that have publicly available implementations, using both simulated and real-world data. We show that the stated approach is competitive in terms of predictive power and in the selection of the correct subset of covariates with the appropriate functional form. A public Python library is available with all the implementations of the methodologies developed in this paper.</div></div>","PeriodicalId":10542,"journal":{"name":"Computers & Operations Research","volume":"176 ","pages":"Article 106945"},"PeriodicalIF":4.1000,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Operations Research","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0305054824004179","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Feature selection is a recurrent research topic in modern regression analysis, which strives to build interpretable models, using sparsity as a proxy, without sacrificing predictive power. The best subset selection problem is central to this statistical task: it has the goal of identifying the subset of covariates of a given size that provides the best fit in terms of an empirical loss function. In this work, we address the problem of feature and functional form selection in additive regression models under a mathematical optimization lens. Penalized splines (splines) are used to estimate the smooth functions involved in the regression equation, which allow us to state the feature selection problem as a cardinality-constrained mixed-integer quadratic program (MIQP) in terms of both linear and non-linear covariates. To strengthen this MIQP formulation, we develop tight bounds for the regression coefficients. A matheuristic approach, which encompasses the use of a preprocessing step, the construction of a warm-start solution, the MIQP formulation and the large neighborhood search metaheuristic paradigm, is proposed to handle larger instances of the feature and functional form selection problem. The performance of the exact and the matheuristic approaches are compared in simulated data. Furthermore, our matheuristic is compared to other methodologies in the literature that have publicly available implementations, using both simulated and real-world data. We show that the stated approach is competitive in terms of predictive power and in the selection of the correct subset of covariates with the appropriate functional form. A public Python library is available with all the implementations of the methodologies developed in this paper.
期刊介绍:
Operations research and computers meet in a large number of scientific fields, many of which are of vital current concern to our troubled society. These include, among others, ecology, transportation, safety, reliability, urban planning, economics, inventory control, investment strategy and logistics (including reverse logistics). Computers & Operations Research provides an international forum for the application of computers and operations research techniques to problems in these and related fields.