{"title":"Statistical Inference for Chi-square Statistics or F-Statistics Based on Multiple Imputation","authors":"Binhuan Wang, Yixin Fang, Man Jin","doi":"arxiv-2409.10812","DOIUrl":"https://doi.org/arxiv-2409.10812","url":null,"abstract":"Missing data is a common issue in medical, psychiatry, and social studies. In\u0000literature, Multiple Imputation (MI) was proposed to multiply impute datasets\u0000and combine analysis results from imputed datasets for statistical inference\u0000using Rubin's rule. However, Rubin's rule only works for combined inference on\u0000statistical tests with point and variance estimates and is not applicable to\u0000combine general F-statistics or Chi-square statistics. In this manuscript, we\u0000provide a solution to combine F-test statistics from multiply imputed datasets,\u0000when the F-statistic has an explicit fractional form (that is, both the\u0000numerator and denominator of the F-statistic are reported). Then we extend the\u0000method to combine Chi-square statistics from multiply imputed datasets.\u0000Furthermore, we develop methods for two commonly applied F-tests, Welch's ANOVA\u0000and Type-III tests of fixed effects in mixed effects models, which do not have\u0000the explicit fractional form. SAS macros are also developed to facilitate\u0000applications.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Jacob Bien, Daniela Witten
{"title":"Decomposing Gaussians with Unknown Covariance","authors":"Ameer Dharamshi, Anna Neufeld, Lucy L. Gao, Jacob Bien, Daniela Witten","doi":"arxiv-2409.11497","DOIUrl":"https://doi.org/arxiv-2409.11497","url":null,"abstract":"Common workflows in machine learning and statistics rely on the ability to\u0000partition the information in a data set into independent portions. Recent work\u0000has shown that this may be possible even when conventional sample splitting is\u0000not (e.g., when the number of samples $n=1$, or when observations are not\u0000independent and identically distributed). However, the approaches that are\u0000currently available to decompose multivariate Gaussian data require knowledge\u0000of the covariance matrix. In many important problems (such as in spatial or\u0000longitudinal data analysis, and graphical modeling), the covariance matrix may\u0000be unknown and even of primary interest. Thus, in this work we develop new\u0000approaches to decompose Gaussians with unknown covariance. First, we present a\u0000general algorithm that encompasses all previous decomposition approaches for\u0000Gaussian data as special cases, and can further handle the case of an unknown\u0000covariance. It yields a new and more flexible alternative to sample splitting\u0000when $n>1$. When $n=1$, we prove that it is impossible to partition the\u0000information in a multivariate Gaussian into independent portions without\u0000knowing the covariance matrix. Thus, we use the general algorithm to decompose\u0000a single multivariate Gaussian with unknown covariance into dependent parts\u0000with tractable conditional distributions, and demonstrate their use for\u0000inference and validation. The proposed decomposition strategy extends naturally\u0000to Gaussian processes. In simulation and on electroencephalography data, we\u0000apply these decompositions to the tasks of model selection and post-selection\u0000inference in settings where alternative strategies are unavailable.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Justin Philip Tuazon, Gia Mizrane Abubo, Joemari Olea
{"title":"Interpretability Indices and Soft Constraints for Factor Models","authors":"Justin Philip Tuazon, Gia Mizrane Abubo, Joemari Olea","doi":"arxiv-2409.11525","DOIUrl":"https://doi.org/arxiv-2409.11525","url":null,"abstract":"Factor analysis is a way to characterize the relationships between many\u0000(observable) variables in terms of a smaller number of unobservable random\u0000variables which are called factors. However, the application of factor models\u0000and its success can be subjective or difficult to gauge, since infinitely many\u0000factor models that produce the same correlation matrix can be fit given sample\u0000data. Thus, there is a need to operationalize a criterion that measures how\u0000meaningful or \"interpretable\" a factor model is in order to select the best\u0000among many factor models. While there are already techniques that aim to measure and enhance\u0000interpretability, new indices, as well as rotation methods via mathematical\u0000optimization based on them, are proposed to measure interpretability. The\u0000proposed methods directly incorporate semantics with the help of natural\u0000language processing and are generalized to incorporate any \"prior information\".\u0000Moreover, the indices allow for complete or partial specification of\u0000relationships at a pairwise level. Aside from these, two other main benefits of\u0000the proposed methods are that they do not require the estimation of factor\u0000scores, which avoids the factor score indeterminacy problem, and that no\u0000additional explanatory variables are necessary. The implementation of the proposed methods is written in Python 3 and is made\u0000available together with several helper functions through the package\u0000interpretablefa on the Python Package Index. The methods' application is\u0000demonstrated here using data on the Experiences in Close Relationships Scale,\u0000obtained from the Open-Source Psychometrics Project.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"104 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimation and imputation of missing data in longitudinal models with Zero-Inflated Poisson response variable","authors":"D. S. Martinez-Lobo, O. O. Melo, N. A. Cruz","doi":"arxiv-2409.11040","DOIUrl":"https://doi.org/arxiv-2409.11040","url":null,"abstract":"This research deals with the estimation and imputation of missing data in\u0000longitudinal models with a Poisson response variable inflated with zeros. A\u0000methodology is proposed that is based on the use of maximum likelihood,\u0000assuming that data is missing at random and that there is a correlation between\u0000the response variables. In each of the times, the expectation maximization (EM)\u0000algorithm is used: in step E, a weighted regression is carried out, conditioned\u0000on the previous times that are taken as covariates. In step M, the estimation\u0000and imputation of the missing data are performed. The good performance of the\u0000methodology in different loss scenarios is demonstrated in a simulation study\u0000comparing the model only with complete data, and estimating missing data using\u0000the mode of the data of each individual. Furthermore, in a study related to the\u0000growth of corn, it is tested on real data to develop the algorithm in a\u0000practical scenario.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"203 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Probability-scale residuals for event-time data","authors":"Eric S. Kawaguchi, Bryan E. Shepherd, Chun Li","doi":"arxiv-2409.11385","DOIUrl":"https://doi.org/arxiv-2409.11385","url":null,"abstract":"The probability-scale residual (PSR) is defined as $E{sign(y, Y^*)}$, where\u0000$y$ is the observed outcome and $Y^*$ is a random variable from the fitted\u0000distribution. The PSR is particularly useful for ordinal and censored outcomes\u0000for which fitted values are not available without additional assumptions.\u0000Previous work has defined the PSR for continuous, binary, ordinal,\u0000right-censored, and current status outcomes; however, development of the PSR\u0000has not yet been considered for data subject to general interval censoring. We\u0000develop extensions of the PSR, first to mixed-case interval-censored data, and\u0000then to data subject to several types of common censoring schemes. We derive\u0000the statistical properties of the PSR and show that our more general PSR\u0000encompasses several previously defined PSR for continuous and censored outcomes\u0000as special cases. The performance of the residual is illustrated in real data\u0000from the Caribbean, Central, and South American Network for HIV Epidemiology.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BMRMM: An R Package for Bayesian Markov (Renewal) Mixed Models","authors":"Yutong Wu, Abhra Sarkar","doi":"arxiv-2409.10835","DOIUrl":"https://doi.org/arxiv-2409.10835","url":null,"abstract":"We introduce the BMRMM package implementing Bayesian inference for a class of\u0000Markov renewal mixed models which can characterize the stochastic dynamics of a\u0000collection of sequences, each comprising alternative instances of categorical\u0000states and associated continuous duration times, while being influenced by a\u0000set of exogenous factors as well as a 'random' individual. The default setting\u0000flexibly models the state transition probabilities using mixtures of Dirichlet\u0000distributions and the duration times using mixtures of gamma kernels while also\u0000allowing variable selection for both. Modeling such data using simpler Markov\u0000mixed models also remains an option, either by ignoring the duration times\u0000altogether or by replacing them with instances of an additional category\u0000obtained by discretizing them by a user-specified unit. The option is also\u0000useful when data on duration times may not be available in the first place. We\u0000demonstrate the package's utility using two data sets.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew J. Smith, Rachael V. Phillips, Camille Maringe, Miguel Angel Luque Fernandez
{"title":"Performance of Cross-Validated Targeted Maximum Likelihood Estimation","authors":"Matthew J. Smith, Rachael V. Phillips, Camille Maringe, Miguel Angel Luque Fernandez","doi":"arxiv-2409.11265","DOIUrl":"https://doi.org/arxiv-2409.11265","url":null,"abstract":"Background: Advanced methods for causal inference, such as targeted maximum\u0000likelihood estimation (TMLE), require certain conditions for statistical\u0000inference. However, in situations where there is not differentiability due to\u0000data sparsity or near-positivity violations, the Donsker class condition is\u0000violated. In such situations, TMLE variance can suffer from inflation of the\u0000type I error and poor coverage, leading to conservative confidence intervals.\u0000Cross-validation of the TMLE algorithm (CVTMLE) has been suggested to improve\u0000on performance compared to TMLE in settings of positivity or Donsker class\u0000violations. We aim to investigate the performance of CVTMLE compared to TMLE in\u0000various settings. Methods: We utilised the data-generating mechanism as described in Leger et\u0000al. (2022) to run a Monte Carlo experiment under different Donsker class\u0000violations. Then, we evaluated the respective statistical performances of TMLE\u0000and CVTMLE with different super learner libraries, with and without regression\u0000tree methods. Results: We found that CVTMLE vastly improves confidence interval coverage\u0000without adversely affecting bias, particularly in settings with small sample\u0000sizes and near-positivity violations. Furthermore, incorporating regression\u0000trees using standard TMLE with ensemble super learner-based initial estimates\u0000increases bias and variance leading to invalid statistical inference. Conclusions: It has been shown that when using CVTMLE the Donsker class\u0000condition is no longer necessary to obtain valid statistical inference when\u0000using regression trees and under either data sparsity or near-positivity\u0000violations. We show through simulations that CVTMLE is much less sensitive to\u0000the choice of the super learner library and thereby provides better estimation\u0000and inference in cases where the super learner library uses more flexible\u0000candidates and is prone to overfitting.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexible survival regression with variable selection for heterogeneous population","authors":"Abhishek Mandal, Abhisek Chakraborty","doi":"arxiv-2409.10771","DOIUrl":"https://doi.org/arxiv-2409.10771","url":null,"abstract":"Survival regression is widely used to model time-to-events data, to explore\u0000how covariates may influence the occurrence of events. Modern datasets often\u0000encompass a vast number of covariates across many subjects, with only a subset\u0000of the covariates significantly affecting survival. Additionally, subjects\u0000often belong to an unknown number of latent groups, where covariate effects on\u0000survival differ significantly across groups. The proposed methodology addresses\u0000both challenges by simultaneously identifying the latent sub-groups in the\u0000heterogeneous population and evaluating covariate significance within each\u0000sub-group. This approach is shown to enhance the predictive accuracy for\u0000time-to-event outcomes, via uncovering varying risk profiles within the\u0000underlying heterogeneous population and is thereby helpful to device targeted\u0000disease management strategies.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"bayesCureRateModel: Bayesian Cure Rate Modeling for Time to Event Data in R","authors":"Panagiotis Papastamoulis, Fotios Milienos","doi":"arxiv-2409.10221","DOIUrl":"https://doi.org/arxiv-2409.10221","url":null,"abstract":"The family of cure models provides a unique opportunity to simultaneously\u0000model both the proportion of cured subjects (those not facing the event of\u0000interest) and the distribution function of time-to-event for susceptibles\u0000(those facing the event). In practice, the application of cure models is mainly\u0000facilitated by the availability of various R packages. However, most of these\u0000packages primarily focus on the mixture or promotion time cure rate model. This\u0000article presents a fully Bayesian approach implemented in R to estimate a\u0000general family of cure rate models in the presence of covariates. It builds\u0000upon the work by Papastamoulis and Milienos (2024) by additionally considering\u0000various options for describing the promotion time, including the Weibull,\u0000exponential, Gompertz, log-logistic and finite mixtures of gamma distributions,\u0000among others. Moreover, the user can choose any proper distribution function\u0000for modeling the promotion time (provided that some specific conditions are\u0000met). Posterior inference is carried out by constructing a Metropolis-coupled\u0000Markov chain Monte Carlo (MCMC) sampler, which combines Gibbs sampling for the\u0000latent cure indicators and Metropolis-Hastings steps with Langevin diffusion\u0000dynamics for parameter updates. The main MCMC algorithm is embedded within a\u0000parallel tempering scheme by considering heated versions of the target\u0000posterior distribution. The package is illustrated on a real dataset analyzing\u0000the duration of the first marriage under the presence of various covariates\u0000such as the race, age and the presence of kids.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"183 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalized Matrix Factor Model","authors":"Xinbing Kong, Tong Zhang","doi":"arxiv-2409.10001","DOIUrl":"https://doi.org/arxiv-2409.10001","url":null,"abstract":"This article introduces a nonlinear generalized matrix factor model (GMFM)\u0000that allows for mixed-type variables, extending the scope of linear matrix\u0000factor models (LMFM) that are so far limited to handling continuous variables.\u0000We introduce a novel augmented Lagrange multiplier method, equivalent to the\u0000constraint maximum likelihood estimation, and carefully tailored to be locally\u0000concave around the true factor and loading parameters. This statistically\u0000guarantees the local convexity of the negative Hessian matrix around the true\u0000parameters of the factors and loadings, which is nontrivial in the matrix\u0000factor modeling and leads to feasible central limit theorems of the estimated\u0000factors and loadings. We also theoretically establish the convergence rates of\u0000the estimated factor and loading matrices for the GMFM under general conditions\u0000that allow for correlations across samples, rows, and columns. Moreover, we\u0000provide a model selection criterion to determine the numbers of row and column\u0000factors consistently. To numerically compute the constraint maximum likelihood\u0000estimator, we provide two algorithms: two-stage alternating maximization and\u0000minorization maximization. Extensive simulation studies demonstrate GMFM's\u0000superiority in handling discrete and mixed-type variables. An empirical data\u0000analysis of the company's operating performance shows that GMFM does clustering\u0000and reconstruction well in the presence of discontinuous entries in the data\u0000matrix.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}