BiometricsPub Date : 2025-07-03DOI: 10.1093/biomtc/ujaf092
Lingxiao Wang
{"title":"Using model-assisted calibration methods to improve efficiency of regression analyses using two-phase samples or pooled samples under complex survey designs.","authors":"Lingxiao Wang","doi":"10.1093/biomtc/ujaf092","DOIUrl":"10.1093/biomtc/ujaf092","url":null,"abstract":"<p><p>Two-phase sampling designs are frequently applied in epidemiological studies and large-scale health surveys. In such designs, certain variables are collected exclusively within a second-phase random subsample of the initial first-phase sample, often due to factors such as high costs, response burden, or constraints on data collection or assessment. Consequently, second-phase sample estimators can be inefficient due to the diminished sample size. Model-assisted calibration methods have been used to improve the efficiency of second-phase estimators in regression analysis. However, limited literature provides valid finite population inferences of the calibration estimators that use appropriate calibration auxiliary variables while simultaneously accounting for the complex sample designs in the first- and second-phase samples. Moreover, no literature considers the \"pooled design\" where some covariates are measured exclusively in certain repeated survey cycles. This paper proposes calibrating the sample weights for the second-phase sample to the weighted first-phase sample based on score functions of the regression model that uses predictions of the second-phase variable for the first-phase sample. We establish the consistency of estimation using calibrated weights and provide variance estimation for the regression coefficients under the two-phase design or the pooled design nested within complex survey designs. Empirical evidence highlights the efficiency and robustness of the proposed calibration compared to existing calibration and imputation methods. Data examples from the National Health and Nutrition Examination Survey are provided.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12288669/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BiometricsPub Date : 2025-07-03DOI: 10.1093/biomtc/ujaf110
Tal Agassi, Nir Keret, Malka Gorfine
{"title":"Mastering rare event analysis: subsample-size determination in Cox and logistic regressions.","authors":"Tal Agassi, Nir Keret, Malka Gorfine","doi":"10.1093/biomtc/ujaf110","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf110","url":null,"abstract":"<p><p>In the realm of contemporary data analysis, the use of massive datasets has taken on heightened significance, albeit often entailing considerable demands on computational time and memory. While a multitude of existing works offer optimal subsampling methods for conducting analyses on subsamples with minimized efficiency loss, they notably lack tools for judiciously selecting the subsample size. To bridge this gap, our work introduces tools designed for choosing the subsample size. We focus on three settings: the Cox regression model for survival data with rare events, and logistic regression for both balanced and imbalanced datasets. Additionally, we present a new optimal subsampling procedure tailored to logistic regression with imbalanced data. The efficacy of these tools and procedures is demonstrated through an extensive simulation study and meticulous analyses of two sizable datasets: survival analysis of UK Biobank colorectal cancer data with about 350 million rows and logistic regression of linked birth and infant death data with about 28 million observations.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BiometricsPub Date : 2025-07-03DOI: 10.1093/biomtc/ujaf049
Malka Gorfine, David M Zucker, Shoval Shoham
{"title":"Cumulative incidence function estimation using population-based biobank data.","authors":"Malka Gorfine, David M Zucker, Shoval Shoham","doi":"10.1093/biomtc/ujaf049","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf049","url":null,"abstract":"<p><p>Many countries have established population-based biobanks, which are being used increasingly in epidemiological and clinical research. These biobanks offer opportunities for large-scale studies addressing questions beyond the scope of traditional clinical trials or cohort studies. However, using biobank data poses new challenges. Typically, biobank data are collected from a study cohort recruited over a defined calendar period, with subjects entering the study at various ages falling between $c_L$ and $c_U$. This work focuses on biobank data with individuals reporting disease-onset age upon recruitment, termed prevalent data, along with individuals initially recruited as healthy, and their disease onset observed during the follow-up period. We propose a novel cumulative incidence function (CIF) estimator that efficiently incorporates prevalent cases, in contrast to existing methods, providing two advantages: (1) increased efficiency and (2) CIF estimation for ages before the lower limit, $c_L$.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144783415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BiometricsPub Date : 2025-07-03DOI: 10.1093/biomtc/ujaf120
Yifan Dai, Di Wu, Yufeng Liu
{"title":"Statistical significance of clustering for count data.","authors":"Yifan Dai, Di Wu, Yufeng Liu","doi":"10.1093/biomtc/ujaf120","DOIUrl":"10.1093/biomtc/ujaf120","url":null,"abstract":"<p><p>Clustering is widely used in biomedical research for meaningful subgroup identification. However, most existing clustering algorithms do not account for the statistical uncertainty of the resulting clusters and consequently may generate spurious clusters due to natural sampling variation. To address this problem, the Statistical Significance of Clustering (SigClust) method was developed to evaluate the significance of clusters in high-dimensional data. While SigClust has been successful in assessing clustering significance for continuous data, it is not specifically designed for discrete data, such as count data in genomics. Moreover, SigClust and its variations can suffer from reduced statistical power when applied to non-Gaussian high-dimensional data. To overcome these limitations, we propose SigClust-DEV, a method designed to evaluate the significance of clusters in count data. Through extensive simulations, we compare SigClust-DEV against other existing SigClust approaches across various count distributions and demonstrate its superior performance. Furthermore, we apply our proposed SigClust-DEV to Hydra single-cell RNA sequencing (scRNA) data and electronic health records (EHRs) of cancer patients to identify meaningful latent cell types and patient subgroups, respectively.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448855/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145091099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BiometricsPub Date : 2025-07-03DOI: 10.1093/biomtc/ujaf094
John Neuhaus, Charles McCulloch, Ross Boylan
{"title":"Improved prediction and flagging of extreme random effects for non-Gaussian outcomes using weighted methods.","authors":"John Neuhaus, Charles McCulloch, Ross Boylan","doi":"10.1093/biomtc/ujaf094","DOIUrl":"10.1093/biomtc/ujaf094","url":null,"abstract":"<p><p>Investigators often focus on predicting extreme random effects from mixed effects models fitted to longitudinal or clustered data, and on identifying or \"flagging\" outliers such as poorly performing hospitals or rapidly deteriorating patients. Our recent work with Gaussian outcomes showed that weighted prediction methods can substantially reduce mean square error of prediction for extremes and substantially increase correct flagging rates compared to previous methods, while controlling the incorrect flagging rates. This paper extends the weighted prediction methods to non-Gaussian outcomes such as binary and count data. Closed-form expressions for predicted random effects and probabilities of correct and incorrect flagging are not available for the usual non-Gaussian outcomes, and the computational challenges are substantial. Therefore, our results include the development of theory to support algorithms that tune predictors that we call \"self-calibrated\" (which control the incorrect flagging rate using very simple flagging rules) and innovative numerical methods to calculate weighted predictors as well as to evaluate their performance. Comprehensive numerical evaluations show that the novel weighted predictors for non-Gaussian outcomes have substantially lower mean square error of prediction at the extremes and considerably higher correct flagging rates than previously proposed methods, while controlling the incorrect flagging rates. We illustrate our new methods using data on emergency room readmissions for children with asthma.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12309285/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144741072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A monotone single index model for spatially referenced multistate current status data.","authors":"Snigdha Das, Minwoo Chae, Debdeep Pati, Dipankar Bandyopadhyay","doi":"10.1093/biomtc/ujaf105","DOIUrl":"https://doi.org/10.1093/biomtc/ujaf105","url":null,"abstract":"<p><p>Assessment of multistate disease progression is commonplace in biomedical research, such as in periodontal disease (PD). However, the presence of multistate current status endpoints, where only a single snapshot of each subject's progression through disease states is available at a random inspection time after a known starting state, complicates the inferential framework. In addition, these endpoints can be clustered, and spatially associated, where a group of proximally located teeth (within subjects) may experience similar PD status, compared to those distally located. Motivated by a clinical study recording PD progression, we propose a Bayesian semiparametric accelerated failure time model with an inverse-Wishart proposal for accommodating (spatial) random effects, and flexible errors that follow a Dirichlet process mixture of Gaussians. For clinical interpretability, the systematic component of the event times is modeled using a monotone single index model, with the (unknown) link function estimated via a novel integrated basis expansion and basis coefficients endowed with constrained Gaussian process priors. In addition to establishing parameter identifiability, we present scalable computing via a combination of elliptical slice sampling, fast circulant embedding techniques, and smoothing of hard constraints, leading to straightforward estimation of parameters, and state occupation and transition probabilities. Using synthetic data, we study the finite sample properties of our Bayesian estimates and their performance under model misspecification. We also illustrate our method via application to the real clinical PD dataset.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12391879/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BiometricsPub Date : 2025-07-03DOI: 10.1093/biomtc/ujaf088
Simon N Wood
{"title":"Simple simulation based reconstruction of incidence rates from death data.","authors":"Simon N Wood","doi":"10.1093/biomtc/ujaf088","DOIUrl":"10.1093/biomtc/ujaf088","url":null,"abstract":"<p><p>Daily deaths from an infectious disease provide a means for retrospectively inferring daily incidence, given knowledge of the infection-to-death interval distribution. Existing methods for doing so rely either on fitting simplified non-linear epidemic models to the deaths data or on spline based deconvolution approaches. The former runs the risk of introducing unintended artefacts via the model formulation, while the latter may be viewed as technically obscure, impeding uptake by practitioners. This note proposes a simple simulation based approach to inferring fatal incidence from deaths that requires minimal assumptions, is easy to understand, and allows testing of alternative hypothesized incidence trajectories. The aim is that in any future situation similar to the COVID pandemic, the method can be easily, rapidly, transparently, and uncontroversially deployed as an input to management.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BiometricsPub Date : 2025-07-03DOI: 10.1093/biomtc/ujaf101
Yisen Jin, Aaron J Molstad, Ander Wilson, Joseph Antonelli
{"title":"Smooth and shape-constrained quantile distributed lag models.","authors":"Yisen Jin, Aaron J Molstad, Ander Wilson, Joseph Antonelli","doi":"10.1093/biomtc/ujaf101","DOIUrl":"10.1093/biomtc/ujaf101","url":null,"abstract":"<p><p>Exposure to environmental pollutants during the gestational period can significantly impact infant health outcomes, such as birth weight and neurological development. Identifying critical windows of susceptibility, which are specific periods during pregnancy when exposure has the most profound effects, is essential for developing targeted interventions. Distributed lag models (DLMs) are widely used in environmental epidemiology to analyze the temporal patterns of exposure and their impact on health outcomes. However, traditional DLMs focus on modeling the conditional mean, which may fail to capture heterogeneity in the relationship between predictors and the outcome. Moreover, when modeling the distribution of health outcomes like gestational birth weight, it is the extreme quantiles that are of most clinical relevance. We introduce 2 new quantile distributed lag model (QDLM) estimators designed to address the limitations of existing methods by leveraging smoothness and shape constraints, such as unimodality and concavity, to enhance interpretability and efficiency. We apply our QDLM estimators to the Colorado birth cohort data, demonstrating their effectiveness in identifying critical windows of susceptibility and informing public health interventions.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12381565/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144941091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BiometricsPub Date : 2025-07-03DOI: 10.1093/biomtc/ujaf089
Fangting Zhou, Kejun He, Yang Ni
{"title":"Tree-based additive noise directed acyclic graphical models for nonlinear causal discovery with interactions.","authors":"Fangting Zhou, Kejun He, Yang Ni","doi":"10.1093/biomtc/ujaf089","DOIUrl":"10.1093/biomtc/ujaf089","url":null,"abstract":"<p><p>Directed acyclic graphical models with additive noises are essential in nonlinear causal discovery and have numerous applications in various domains, such as social science and systems biology. Most such models further assume that structural causal functions are additive to ensure causal identifiability and computational feasibility, which may be too restrictive in the presence of causal interactions. Some methods consider general nonlinear causal functions represented by, for example, Gaussian processes and neural networks, to accommodate interactions. However, they are either computationally intensive or lack interpretability. We propose a highly interpretable and computationally feasible approach using trees to incorporate interactions in nonlinear causal discovery, termed tree-based additive noise models. The nature of the tree construction leads to piecewise constant causal functions, making existing causal identifiability results of additive noise models with continuous and smooth causal functions inapplicable. Therefore, we provide new conditions under which the proposed model is identifiable. We develop a recursive algorithm for source node identification and a score-based ordering search algorithm. Through extensive simulations, we demonstrate the utility of the proposed model and algorithms benchmarking against existing additive noise models, especially when there are strong causal interactions. Our method is applied to infer a protein-protein interaction network for breast cancer, where proteins may form protein complexes to perform their functions.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12288665/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144706199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
BiometricsPub Date : 2025-07-03DOI: 10.1093/biomtc/ujaf095
Guorong Dai, Raymond J Carroll, Jinbo Chen
{"title":"Valid and efficient inference for nonparametric variable importance in two-phase studies.","authors":"Guorong Dai, Raymond J Carroll, Jinbo Chen","doi":"10.1093/biomtc/ujaf095","DOIUrl":"10.1093/biomtc/ujaf095","url":null,"abstract":"<p><p>We consider a common nonparametric regression setting, where the data consist of a response variable Y, some easily obtainable covariates $mathbf {X}$, and a set of costly covariates $mathbf {Z}$. Before establishing predictive models for Y, a natural question arises: Is it worthwhile to include $mathbf {Z}$ as predictors, given the additional cost of collecting data on $mathbf {Z}$ for both training the models and predicting Y for future individuals? Therefore, we aim to conduct preliminary investigations to infer importance of $mathbf {Z}$ in predicting Y in the presence of $mathbf {X}$. To achieve this goal, we propose a nonparametric variable importance measure for $mathbf {Z}$. It is defined as a parameter that aggregates maximum potential contributions of $mathbf {Z}$ in single or multiple predictive models, with contributions quantified by general loss functions. Considering two-phase data that provide a large number of observations for $(Y,mathbf {X})$ with the expensive $mathbf {Z}$ measured only in a small subsample, we develop a novel approach to infer the proposed importance measure, accommodating missingness of $mathbf {Z}$ in the sample by substituting functions of $(Y,mathbf {X})$ for each individual's contribution to the predictive loss of models involving $mathbf {Z}$. Our approach attains unified and efficient inference regardless of whether $mathbf {Z}$ makes zero or positive contribution to predicting Y, a desirable yet surprising property owing to data incompleteness. As intermediate steps of our theoretical development, we establish novel results in two relevant research areas, semi-supervised inference and two-phase nonparametric estimation. Numerical results from both simulated and real data demonstrate superior performance of our approach.</p>","PeriodicalId":8930,"journal":{"name":"Biometrics","volume":"81 3","pages":""},"PeriodicalIF":1.7,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12312401/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144752245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}