Flexible variable selection in the presence of missing data.

IF 1.2 4区数学

International Journal of Biostatistics Pub Date : 2024-02-13 eCollection Date: 2024-11-01 DOI:10.1515/ijb-2023-0059

Brian D Williamson, Ying Huang

{"title":"Flexible variable selection in the presence of missing data.","authors":"Brian D Williamson, Ying Huang","doi":"10.1515/ijb-2023-0059","DOIUrl":null,"url":null,"abstract":"<p><p>In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":"347-359"},"PeriodicalIF":1.2000,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11323294/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/ijb-2023-0059","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.

查看原文本刊更多论文

在数据缺失的情况下灵活选择变量。

在许多应用中，人们有兴趣从多个候选特征中识别出一组或一组合理的特征，从而在预测响应时达到理想的性能水平。在实践中，由于抽样设计或其他随机机制造成的数据缺失，这项任务往往会变得复杂。最近在缺失数据情况下进行变量选择的大多数工作都在一定程度上依赖于有限维统计模型，例如广义线性模型或惩罚线性模型。在这种模型被错误定义的情况下，所选变量可能并不都具有真正的科学相关性，并可能导致面板的分类性能不理想。为了解决这一局限性，我们提出了一种非参数变量选择算法，结合多重估算，在随机数据缺失的情况下建立灵活的面板。我们概述了基于所提算法的策略，这些策略可实现对常用错误率的控制。通过模拟，我们证明了我们的建议具有良好的操作特性，与现有的几种惩罚回归方法相比，在广义线性模型被错误指定的情况下，我们的面板具有更高的分类和变量选择性能。最后，我们利用所提出的方法开发了生物标记物面板，用于区分具有不同恶性潜能的胰腺囊肿，在这种情况下，由于标本量有限，生物标记物会出现复杂的缺失。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Biostatistics Mathematics-Statistics and Probability

CiteScore

2.30

自引率

8.30%

发文量

期刊介绍： The International Journal of Biostatistics (IJB) seeks to publish new biostatistical models and methods, new statistical theory, as well as original applications of statistical methods, for important practical problems arising from the biological, medical, public health, and agricultural sciences with an emphasis on semiparametric methods. Given many alternatives to publish exist within biostatistics, IJB offers a place to publish for research in biostatistics focusing on modern methods, often based on machine-learning and other data-adaptive methodologies, as well as providing a unique reading experience that compels the author to be explicit about the statistical inference problem addressed by the paper. IJB is intended that the journal cover the entire range of biostatistics, from theoretical advances to relevant and sensible translations of a practical problem into a statistical framework. Electronic publication also allows for data and software code to be appended, and opens the door for reproducible research allowing readers to easily replicate analyses described in a paper. Both original research and review articles will be warmly received, as will articles applying sound statistical methods to practical problems.