Assessing the impact on quality of prediction and inference from balancing in multilevel logistic regression

Healthcare analytics (New York, N.Y.) Pub Date : 2024-08-22 DOI:10.1016/j.health.2024.100359

Carolina Gonzalez-Canas , Gustavo A. Valencia-Zapata , Ana Maria Estrada Gomez , Zachary Hass

{"title":"Assessing the impact on quality of prediction and inference from balancing in multilevel logistic regression","authors":"Carolina Gonzalez-Canas , Gustavo A. Valencia-Zapata , Ana Maria Estrada Gomez , Zachary Hass","doi":"10.1016/j.health.2024.100359","DOIUrl":null,"url":null,"abstract":"<div><p>The primary goal of this research is to examine the impact of balancing data on the prediction quality and inference in multilevel logistic regression models. Logistic regression is a valuable approach for modeling binary outcomes expected in health applications. The class imbalance problem, where one of the two outcome categories occurs much more often than the other, is common in healthcare data, such as when modeling the risk factors for rare diseases. The issue is particularly relevant for medical data that contains individual measurements and other data sources measured at a geographic region level, such as environmental risk factors. For this work, both prediction and model interpretation are of interest. A simulation model is proposed to test the impact of balancing strategies on the logistic multilevel model's parameter estimation, inference, and predictive performance. The simulated information emulates characteristics of a Gestational Diabetes Mellitus (GDM) dataset from Indiana's Medicaid program. Several datasets were simulated with varying levels of complexity, involving the balance of the outcome variable and predictors. These datasets exhibited high- or low-frequency occurrences in specific intersections of variables, often called ‘cells.’ The impact of the balancing strategies on prediction and inference was assessed using different techniques, such as the Equivalence (TOST) Test, power analysis, and predictive measures. To the best of our knowledge, this is the first research that explores the impact of using balanced samples on coefficient estimation and prediction measures when using logistic multilevel modeling, finding evidence about the benefits of using balanced samples in this context.</p></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"6 ","pages":"Article 100359"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772442524000613/pdfft?md5=61d70749e6aeada54ee254cabcd3c429&pid=1-s2.0-S2772442524000613-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442524000613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The primary goal of this research is to examine the impact of balancing data on the prediction quality and inference in multilevel logistic regression models. Logistic regression is a valuable approach for modeling binary outcomes expected in health applications. The class imbalance problem, where one of the two outcome categories occurs much more often than the other, is common in healthcare data, such as when modeling the risk factors for rare diseases. The issue is particularly relevant for medical data that contains individual measurements and other data sources measured at a geographic region level, such as environmental risk factors. For this work, both prediction and model interpretation are of interest. A simulation model is proposed to test the impact of balancing strategies on the logistic multilevel model's parameter estimation, inference, and predictive performance. The simulated information emulates characteristics of a Gestational Diabetes Mellitus (GDM) dataset from Indiana's Medicaid program. Several datasets were simulated with varying levels of complexity, involving the balance of the outcome variable and predictors. These datasets exhibited high- or low-frequency occurrences in specific intersections of variables, often called ‘cells.’ The impact of the balancing strategies on prediction and inference was assessed using different techniques, such as the Equivalence (TOST) Test, power analysis, and predictive measures. To the best of our knowledge, this is the first research that explores the impact of using balanced samples on coefficient estimation and prediction measures when using logistic multilevel modeling, finding evidence about the benefits of using balanced samples in this context.

查看原文本刊更多论文

评估多级逻辑回归中的平衡对预测和推断质量的影响

这项研究的主要目的是考察平衡数据对多层次逻辑回归模型的预测质量和推断的影响。逻辑回归是一种对健康应用中预期的二元结果进行建模的重要方法。类不平衡问题，即两个结果类别中的一个类别比另一个类别出现得更频繁，在医疗数据中很常见，例如在对罕见疾病的风险因素建模时。这个问题对于包含个人测量数据和其他在地理区域层面测量的数据源（如环境风险因素）的医疗数据尤为重要。在这项工作中，预测和模型解释都很重要。我们提出了一个仿真模型来测试平衡策略对逻辑多层次模型的参数估计、推理和预测性能的影响。模拟信息模仿了印第安纳州医疗补助计划中妊娠糖尿病（GDM）数据集的特征。模拟的几个数据集具有不同程度的复杂性，涉及结果变量和预测因子的平衡。这些数据集在变量的特定交叉点（通常称为 "单元"）上显示出高频或低频的出现。平衡策略对预测和推理的影响通过不同的技术进行了评估，如等效性（TOST）测试、功率分析和预测措施。据我们所知，这是第一项探索在使用逻辑多层次建模时使用平衡样本对系数估计和预测指标的影响的研究，发现了在这种情况下使用平衡样本的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊