{"title":"An optimal subsampling design for large-scale Cox model with censored data.","authors":"Shiqi Liu, Zilong Xie, Ming Zheng, Wen Yu","doi":"10.1080/02664763.2024.2423234","DOIUrl":"10.1080/02664763.2024.2423234","url":null,"abstract":"<p><p>Subsampling designs are useful for reducing computational load and storage cost for large-scale data analysis. For massive survival data with right censoring, we propose a class of optimal subsampling designs under the widely-used Cox model. The proposed designs utilize information from both the outcome and the covariates. Different forms of the design can be derived adaptively to meet various targets, such as optimizing the overall estimation accuracy or minimizing the variation of specific linear combination of the estimators. Given the subsampled data, the inverse probability weighting approach is employed to estimate the model parameters. The resultant estimators are shown to be consistent and asymptotically normally distributed. Simulation results indicate that the proposed subsampling design yields more efficient estimators than the uniform subsampling by using subsampled data of comparable sample sizes. Additionally, the subsampling estimation significantly reduces the computational load and storage cost relative to the full data estimation. An analysis of a real data example is provided for illustration.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 7","pages":"1315-1341"},"PeriodicalIF":1.2,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12123965/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144199240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhengxin Wang, Daniel B Rowe, Xinyi Li, D Andrew Brown
{"title":"Efficient fully Bayesian approach to brain activity mapping with complex-valued fMRI data.","authors":"Zhengxin Wang, Daniel B Rowe, Xinyi Li, D Andrew Brown","doi":"10.1080/02664763.2024.2422392","DOIUrl":"https://doi.org/10.1080/02664763.2024.2422392","url":null,"abstract":"<p><p>Functional magnetic resonance imaging (fMRI) enables indirect detection of brain activity changes via the blood-oxygen-level-dependent (BOLD) signal. Conventional analysis methods mainly rely on the real-valued magnitude of these signals. In contrast, research suggests that analyzing both real and imaginary components of the complex-valued fMRI (cv-fMRI) signal provides a more holistic approach that can increase power to detect neuronal activation. We propose a fully Bayesian model for brain activity mapping with cv-fMRI data. Our model accommodates temporal and spatial dynamics. Additionally, we propose a computationally efficient sampling algorithm, which enhances processing speed through image partitioning. Our approach is shown to be computationally efficient via image partitioning and parallel computation while being competitive with state-of-the-art methods. We support these claims with both simulated numerical studies and an application to real cv-fMRI data obtained from a finger-tapping experiment.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1299-1314"},"PeriodicalIF":1.2,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035935/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143998676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prediction intervals and bands with improved coverage for functional data under noisy discrete observation.","authors":"David Kraus","doi":"10.1080/02664763.2024.2420223","DOIUrl":"https://doi.org/10.1080/02664763.2024.2420223","url":null,"abstract":"<p><p>We revisit the classic situation in functional data analysis in which curves are observed at discrete, possibly sparse and irregular, arguments with observation noise. We focus on the reconstruction of individual curves by prediction intervals and bands. The standard approach consists of two steps: first, one estimates the mean and covariance function of curves and observation noise variance function by, e.g. penalized splines, and second, under Gaussian assumptions, one derives the conditional distribution of a curve given observed data and constructs prediction sets with required properties, usually employing sampling from the predictive distribution. This approach is well established, commonly used and theoretically valid but practically, it surprisingly fails in its key property: prediction sets constructed this way often do not have the required coverage. The actual coverage is lower than the nominal one. We investigate the cause of this issue and propose a computationally feasible remedy that leads to prediction regions with a much better coverage. Our method accounts for the uncertainty of the predictive model by sampling from the approximate distribution of its spline estimators whose covariance is estimated by a novel sandwich estimator. Our approach also applies to the important case of covariate-adjusted models.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1258-1277"},"PeriodicalIF":1.2,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035946/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144010105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predrag M Popović, Hassan S Bakouch, Miroslav M Ristić
{"title":"A non-linear integer-valued autoregressive model with zero-inflated data series.","authors":"Predrag M Popović, Hassan S Bakouch, Miroslav M Ristić","doi":"10.1080/02664763.2024.2419495","DOIUrl":"https://doi.org/10.1080/02664763.2024.2419495","url":null,"abstract":"<p><p>A new non-linear stationary process for time series of counts is introduced. The process is composed of the survival and innovation component. The survival component is based on the generalized zero-modified geometric thinning operator, where the innovation process figures in the survival component as well. A few probability distributions for the innovation process have been discussed, in order to adjust the model for observed series with the excess number of zeros. The conditional maximum likelihood and the conditional least squares methods are investigated for the estimation of the model parameters. The practical aspect of the model is presented on some real-life data sets, where we observe data with inflation as well as deflation of zeroes so we can notice how the model can be adjusted with the proper parameter selection.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1195-1218"},"PeriodicalIF":1.2,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143995010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the median <i>p</i>-value method for assessing the statistical significance of tests when using multiple imputation.","authors":"Peter C Austin, Iris Eekhout, Stef van Buuren","doi":"10.1080/02664763.2024.2418473","DOIUrl":"https://doi.org/10.1080/02664763.2024.2418473","url":null,"abstract":"<p><p>Rubin's Rules are commonly used to pool the results of statistical analyses across imputed samples when using multiple imputation. Rubin's Rules cannot be used when the result of an analysis in an imputed dataset is not a statistic and its associated standard error, but a test statistic (e.g. Student's t-test). While complex methods have been proposed for pooling test statistics across imputed samples, these methods have not been implemented in many popular statistical software packages. The median <i>p</i>-value method has been proposed for pooling test statistics. The statistical significance level of the pooled test statistic is the median of the associated <i>p</i>-values across the imputed samples. We evaluated the performance of this method with nine statistical tests: Student's t-test, Wilcoxon Rank Sum test, Analysis of Variance, Kruskal-Wallis test, the test of significance for Pearson's and Spearman's correlation coefficient, the Chi-squared test, the test of significance for a regression coefficient from a linear regression and from a logistic regression. For each test, the empirical type I error rate was higher than the advertised rate. The magnitude of inflation increased as the prevalence of missing data increased. The median <i>p</i>-value method should not be used to assess statistical significance across imputed datasets.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1161-1176"},"PeriodicalIF":1.2,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144012737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fernando Henrique de Paula E Silva Mendes, Douglas Eduardo Turatti, Guilherme Pumi
{"title":"Mitigating the choice of the duration in DDMS models through a parametric link.","authors":"Fernando Henrique de Paula E Silva Mendes, Douglas Eduardo Turatti, Guilherme Pumi","doi":"10.1080/02664763.2024.2419505","DOIUrl":"https://doi.org/10.1080/02664763.2024.2419505","url":null,"abstract":"<p><p>One of the most important hyper-parameters in duration-dependent Markov-switching (DDMS) models is the duration of the hidden states. Because there is currently no procedure for estimating this duration or testing whether a given duration is appropriate for a given data set, an ad hoc duration choice must be heuristically justified. In this paper, we propose and examine a methodology that mitigates the choice of duration in DDMS models when forecasting is the goal. The novelty of this paper is the use of the asymmetric Aranda-Ordaz parametric link function to model transition probabilities in DDMS models, instead of the commonly applied logit link. The idea behind this approach is that any incorrect duration choice is compensated for by the parameter in the link, increasing model flexibility. Two Monte Carlo simulations, based on classical applications of DDMS models, are employed to evaluate the methodology. In addition, an empirical investigation is carried out to forecast the volatility of the S&P500, which showcases the capabilities of the proposed model.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1219-1238"},"PeriodicalIF":1.2,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144018792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A semiparametric accelerated failure time-based mixture cure tree.","authors":"Wisdom Aselisewine, Suvra Pal, Helton Saulo","doi":"10.1080/02664763.2024.2418476","DOIUrl":"https://doi.org/10.1080/02664763.2024.2418476","url":null,"abstract":"<p><p>The mixture cure rate model (MCM) is the most widely used model for the analysis of survival data with a cured subgroup. In this context, the most common strategy to model the cure probability is to assume a generalized linear model with a known link function, such as the logit link function. However, the logit model can only capture simple effects of covariates on the cure probability. In this article, we propose a new MCM where the cure probability is modeled using a decision tree-based classifier and the survival distribution of the uncured is modeled using an accelerated failure time structure. To estimate the model parameters, we develop an expectation maximization algorithm. Our simulation study shows that the proposed model performs better in capturing nonlinear classification boundaries when compared to the logit-based MCM and the spline-based MCM. This results in more accurate and precise estimates of the cured probabilities, which in-turn results in improved predictive accuracy of cure. We further show that capturing nonlinear classification boundary also improves the estimation results corresponding to the survival distribution of the uncured subjects. Finally, we apply our proposed model and the EM algorithm to analyze an existing bone marrow transplant data.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1177-1194"},"PeriodicalIF":1.2,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035937/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144020246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The PCovR biplot: a graphical tool for principal covariates regression.","authors":"Elisa Frutos-Bernal, José Luis Vicente-Villardón","doi":"10.1080/02664763.2024.2417978","DOIUrl":"10.1080/02664763.2024.2417978","url":null,"abstract":"<p><p>Biplots are useful tools because they provide a visual representation of both individuals and variables simultaneously, making it easier to explore relationships and patterns within multidimensional datasets. This paper extends their use to examine the relationship between a set of predictors <math><mrow><mi>X</mi></mrow> </math> and a set of response variables <math><mrow><mi>Y</mi></mrow> </math> using Principal Covariates Regression analysis (PCovR). The PCovR biplot provides a simultaneous graphical representation of individuals, predictor variables and response variables. It also provides the ability to examine the relationship between both types of variables in the form of the regression coefficient matrix.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 5","pages":"1144-1159"},"PeriodicalIF":1.2,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951325/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143752849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliability analysis based on doubly-truncated and interval-censored data.","authors":"Pao-Sheng Shen, Huai-Man Li","doi":"10.1080/02664763.2024.2415412","DOIUrl":"10.1080/02664763.2024.2415412","url":null,"abstract":"<p><p>Field data provide important information on product reliability. Interval sampling is widely used for collection of field data, which typically report incident cases during a certain time period. Such sampling scheme induces doubly truncated (DT) data if the exact failure time is known. In many situations, the exact failure date is known only to fall within an interval, leading to doubly truncated and interval censored (DTIC) data. This article considers analysis of DTIC data under parametric failure time models. We consider a conditional likelihood approach and propose interval estimation for parameters and the cumulative distribution functions. Simulation studies show that the proposed method performs well for finite sample size.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 5","pages":"1128-1143"},"PeriodicalIF":1.2,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143752915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel ranked <i>k</i>-nearest neighbors algorithm for missing data imputation.","authors":"Yasir Khan, Said Farooq Shah, Syed Muhammad Asim","doi":"10.1080/02664763.2024.2414357","DOIUrl":"10.1080/02664763.2024.2414357","url":null,"abstract":"<p><p>Missing data is a common problem in many domains that rely on data analysis. The <i>k</i> Nearest Neighbors imputation method has been widely used to address this issue, but it has limitations in accurately imputing missing values, especially for datasets with small pairwise correlations and small values of <i>k</i>. In this study, we proposed a method, Ranked <i>k</i> Nearest Neighbors imputation that uses a similar approach to <i>k</i> Nearest Neighbor, but utilizing the concept of Ranked set sampling to select the most relevant neighbors for imputation. Our results show that the proposed method outperforms the standard <i>k</i> nearest neighbor method in terms of imputation accuracy both in case of Missing Completely at Random and Missing at Random mechanism, as demonstrated by consistently lower MSIE and MAIE values across all datasets. This suggests that the proposed method is a promising alternative for imputing missing values in datasets with small pairwise correlations and small values of <i>k</i>. Thus, the proposed Ranked <i>k</i> Nearest Neighbor method has important implications for data imputation in various domains and can contribute to the development of more efficient and accurate imputation methods without adding any computational complexity to an algorithm.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 5","pages":"1103-1127"},"PeriodicalIF":1.2,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951327/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143752879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}