An automated exact solution framework towards solving the logistic regression best subset selection problem

IF 0.4 Q4 STATISTICS & PROBABILITY
Thomas van Niekerk, Jacques V. Venter, Stephanus E. Terblanche
{"title":"An automated exact solution framework towards solving the logistic regression best subset selection problem","authors":"Thomas van Niekerk, Jacques V. Venter, Stephanus E. Terblanche","doi":"10.37920/sasj.2023.57.2.2","DOIUrl":null,"url":null,"abstract":"An automated logistic regression solution framework (ALRSF) is proposed to solve a mixed integer programming (MIP) formulation of the well known logistic regression best subset selection problem. The solution framework firstly determines the optimal number of independent variables that should be included in the model using an automated cardinality parameter selection procedure. The cardinality parameter dictates the size of the subset of variables and can be problem-specific. A novel regression parameter fixing heuristic that utilises a Benders decomposition algorithm is applied to prune the solution search space such that the optimal regression parameter values are found faster. An optimality gap is subsequently calculated to quantify the quality of the final regression model by considering the distance between the best possible log-likelihood value and a log-likelihood value that is calculated using the current parameter values. Attempts are then made to reduce the optimality gap by adjusting regression parameter values. The ALRSF serves as a holistic variable selection framework that enables the user to consider larger datasets when solving the best subset selection logistic regression problem by significantly reducing the memory requirements associated with its mixed integer programming formulation. Furthermore, the automated framework requires minimal user intervention during model training and hyperparameter tuning. Improvements in quality of the final model (when considering both the optimality gap and computing resources required to achieve a result) are observed when the ALRSF is applied to well-known real-world UCI machine learning datasets. Keywords: Best subset selection, Independent variable selection, Logistic regression, Mixed integer programming","PeriodicalId":53997,"journal":{"name":"SOUTH AFRICAN STATISTICAL JOURNAL","volume":"1 1","pages":"0"},"PeriodicalIF":0.4000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SOUTH AFRICAN STATISTICAL JOURNAL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.37920/sasj.2023.57.2.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

Abstract

An automated logistic regression solution framework (ALRSF) is proposed to solve a mixed integer programming (MIP) formulation of the well known logistic regression best subset selection problem. The solution framework firstly determines the optimal number of independent variables that should be included in the model using an automated cardinality parameter selection procedure. The cardinality parameter dictates the size of the subset of variables and can be problem-specific. A novel regression parameter fixing heuristic that utilises a Benders decomposition algorithm is applied to prune the solution search space such that the optimal regression parameter values are found faster. An optimality gap is subsequently calculated to quantify the quality of the final regression model by considering the distance between the best possible log-likelihood value and a log-likelihood value that is calculated using the current parameter values. Attempts are then made to reduce the optimality gap by adjusting regression parameter values. The ALRSF serves as a holistic variable selection framework that enables the user to consider larger datasets when solving the best subset selection logistic regression problem by significantly reducing the memory requirements associated with its mixed integer programming formulation. Furthermore, the automated framework requires minimal user intervention during model training and hyperparameter tuning. Improvements in quality of the final model (when considering both the optimality gap and computing resources required to achieve a result) are observed when the ALRSF is applied to well-known real-world UCI machine learning datasets. Keywords: Best subset selection, Independent variable selection, Logistic regression, Mixed integer programming
一个解决逻辑回归最佳子集选择问题的自动化精确解框架
提出了一种自动逻辑回归求解框架(ALRSF),用于求解混合整数规划(MIP)形式的逻辑回归最优子集选择问题。求解框架首先使用自动基数参数选择过程确定应包含在模型中的自变量的最优数量。基数参数指示变量子集的大小,可以是特定于问题的。利用Benders分解算法,采用一种新颖的回归参数确定启发式算法对解搜索空间进行剪枝,从而更快地找到最优回归参数值。随后计算最优性差距,通过考虑最佳可能对数似然值与使用当前参数值计算的对数似然值之间的距离来量化最终回归模型的质量。然后尝试通过调整回归参数值来减小最优性差距。ALRSF作为一个整体变量选择框架,通过显著降低与其混合整数规划公式相关的内存需求,使用户能够在解决最佳子集选择逻辑回归问题时考虑更大的数据集。此外,自动化框架在模型训练和超参数调优期间需要最少的用户干预。当将ALRSF应用于众所周知的现实世界的UCI机器学习数据集时,可以观察到最终模型质量的改进(考虑到最优性差距和实现结果所需的计算资源)。关键词:最佳子集选择,自变量选择,逻辑回归,混合整数规划
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
SOUTH AFRICAN STATISTICAL JOURNAL
SOUTH AFRICAN STATISTICAL JOURNAL STATISTICS & PROBABILITY-
CiteScore
0.30
自引率
0.00%
发文量
18
期刊介绍: The journal will publish innovative contributions to the theory and application of statistics. Authoritative review articles on topics of general interest which are not readily accessible in a coherent form, will be also be considered for publication. Articles on applications or of a general nature will be published in separate sections and an author should indicate which of these sections an article is intended for. An applications article should normally consist of the analysis of actual data and need not necessarily contain new theory. The data should be made available with the article but need not necessarily be part of it.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信