{"title":"A robust distance-based approach for detecting multidimensional outliers.","authors":"R Lakshmi, T A Sajesh","doi":"10.1080/02664763.2024.2422403","DOIUrl":"https://doi.org/10.1080/02664763.2024.2422403","url":null,"abstract":"<p><p>Identifying outliers in data analysis is a critical task, as outliers can significantly influence the results and conclusions drawn from a dataset. This study explores the use of the Mahalanobis distance metric for detecting outliers in multivariate data, focusing on a novel approach inspired by the work of M. Falk, [<i>On mad and comedians</i>, Ann. Inst. Stat. Math. 49 (1997), pp. 615-644]. The proposed method is rigorously tested through extensive simulation analysis, where it demonstrates high True Positive Rates (TPR) and low False Positive Rates (FPR) when compared to other existing outlier detection techniques. Through extensive simulation analysis, we empirically evaluate the affine equivariance and breakdown properties of our proposed distance measure and it is evident from the outputs that our robust distance measure demonstrates effective results with respect to the measures FPR and TPR. The proposed method was applied to seven different datasets, showing promising true positive rates (TPR) and false positive rates (FPR), and it outperformed several well-known outlier identification approaches. We can effectively use our proposed distance measure in fields demanding outlier detection.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1278-1298"},"PeriodicalIF":1.2,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144016593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan F Díaz-Sepúlveda, Nicoletta D'Angelo, Giada Adelfio, Jonatan A González, Francisco J Rodríguez-Cortés
{"title":"Clustering in point processes on linear networks using nearest neighbour volumes.","authors":"Juan F Díaz-Sepúlveda, Nicoletta D'Angelo, Giada Adelfio, Jonatan A González, Francisco J Rodríguez-Cortés","doi":"10.1080/02664763.2024.2411214","DOIUrl":"10.1080/02664763.2024.2411214","url":null,"abstract":"<p><p>This study introduces a novel method specifically designed to detect clusters of points within linear networks. This method extends a classification approach used for point processes in spatial contexts. Unlike traditional methods that operate on planar spaces, our approach adapts to the unique geometric challenges of linear networks, where classical properties of point processes are altered, and intuitive data visualisation becomes more complex. Our method utilises the distribution of the <i>K</i>th nearest neighbour volumes, extending planar-based clustering techniques to identify regions of increased point density within a network. This approach is particularly effective for distinguishing overlapping Poisson processes within the same linear network. We demonstrate the practical utility of our method through applications to road traffic accident data from two Colombian cities, Bogota and Medellin. Our results reveal distinct clusters of high-density points in road segments where severe traffic accidents (resulting in injuries or fatalities) are most likely to occur, highlighting areas of increased risk. These clusters were primarily located on major arterial roads with high traffic volumes. In contrast, low-density points corresponded to areas with fewer accidents, likely due to lower traffic flow or other mitigating factors. Our findings provide valuable insights for urban planning and road safety management.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 5","pages":"993-1016"},"PeriodicalIF":1.2,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951330/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143752900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daisuke Yoneoka, Takayuki Kawashima, Yuta Tanoue, Shuhei Nomura, Akifumi Eguchi
{"title":"Robust estimation of the incubation period and the time of exposure using <i>γ</i>-divergence.","authors":"Daisuke Yoneoka, Takayuki Kawashima, Yuta Tanoue, Shuhei Nomura, Akifumi Eguchi","doi":"10.1080/02664763.2024.2420221","DOIUrl":"https://doi.org/10.1080/02664763.2024.2420221","url":null,"abstract":"<p><p>Estimating the exposure time to single infectious pathogens and the associated incubation period, based on symptom onset data, is crucial for identifying infection sources and implementing public health interventions. However, data from rapid surveillance systems designed for early outbreak warning often come with outliers originated from individuals who were not directly exposed to the initial source of infection (i.e. tertiary and subsequent infection cases), making the estimation of exposure time challenging. To address this issue, this study uses a three-parameter lognormal distribution and proposes a new <i>γ</i>-divergence-based robust approach for estimating the parameter corresponding to exposure time with a tailored optimization procedure using the majorization-minimization algorithm, which ensures the monotonic decreasing property of the objective function. Comprehensive numerical experiments and real data analyses suggest that our method is superior to conventional methods in terms of bias, mean squared error, and coverage probability of 95% confidence intervals.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1239-1257"},"PeriodicalIF":1.2,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035932/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143992898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhengxin Wang, Daniel B Rowe, Xinyi Li, D Andrew Brown
{"title":"Efficient fully Bayesian approach to brain activity mapping with complex-valued fMRI data.","authors":"Zhengxin Wang, Daniel B Rowe, Xinyi Li, D Andrew Brown","doi":"10.1080/02664763.2024.2422392","DOIUrl":"https://doi.org/10.1080/02664763.2024.2422392","url":null,"abstract":"<p><p>Functional magnetic resonance imaging (fMRI) enables indirect detection of brain activity changes via the blood-oxygen-level-dependent (BOLD) signal. Conventional analysis methods mainly rely on the real-valued magnitude of these signals. In contrast, research suggests that analyzing both real and imaginary components of the complex-valued fMRI (cv-fMRI) signal provides a more holistic approach that can increase power to detect neuronal activation. We propose a fully Bayesian model for brain activity mapping with cv-fMRI data. Our model accommodates temporal and spatial dynamics. Additionally, we propose a computationally efficient sampling algorithm, which enhances processing speed through image partitioning. Our approach is shown to be computationally efficient via image partitioning and parallel computation while being competitive with state-of-the-art methods. We support these claims with both simulated numerical studies and an application to real cv-fMRI data obtained from a finger-tapping experiment.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1299-1314"},"PeriodicalIF":1.2,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035935/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143998676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prediction intervals and bands with improved coverage for functional data under noisy discrete observation.","authors":"David Kraus","doi":"10.1080/02664763.2024.2420223","DOIUrl":"https://doi.org/10.1080/02664763.2024.2420223","url":null,"abstract":"<p><p>We revisit the classic situation in functional data analysis in which curves are observed at discrete, possibly sparse and irregular, arguments with observation noise. We focus on the reconstruction of individual curves by prediction intervals and bands. The standard approach consists of two steps: first, one estimates the mean and covariance function of curves and observation noise variance function by, e.g. penalized splines, and second, under Gaussian assumptions, one derives the conditional distribution of a curve given observed data and constructs prediction sets with required properties, usually employing sampling from the predictive distribution. This approach is well established, commonly used and theoretically valid but practically, it surprisingly fails in its key property: prediction sets constructed this way often do not have the required coverage. The actual coverage is lower than the nominal one. We investigate the cause of this issue and propose a computationally feasible remedy that leads to prediction regions with a much better coverage. Our method accounts for the uncertainty of the predictive model by sampling from the approximate distribution of its spline estimators whose covariance is estimated by a novel sandwich estimator. Our approach also applies to the important case of covariate-adjusted models.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1258-1277"},"PeriodicalIF":1.2,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035946/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144010105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Predrag M Popović, Hassan S Bakouch, Miroslav M Ristić
{"title":"A non-linear integer-valued autoregressive model with zero-inflated data series.","authors":"Predrag M Popović, Hassan S Bakouch, Miroslav M Ristić","doi":"10.1080/02664763.2024.2419495","DOIUrl":"https://doi.org/10.1080/02664763.2024.2419495","url":null,"abstract":"<p><p>A new non-linear stationary process for time series of counts is introduced. The process is composed of the survival and innovation component. The survival component is based on the generalized zero-modified geometric thinning operator, where the innovation process figures in the survival component as well. A few probability distributions for the innovation process have been discussed, in order to adjust the model for observed series with the excess number of zeros. The conditional maximum likelihood and the conditional least squares methods are investigated for the estimation of the model parameters. The practical aspect of the model is presented on some real-life data sets, where we observe data with inflation as well as deflation of zeroes so we can notice how the model can be adjusted with the proper parameter selection.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1195-1218"},"PeriodicalIF":1.2,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143995010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the median <i>p</i>-value method for assessing the statistical significance of tests when using multiple imputation.","authors":"Peter C Austin, Iris Eekhout, Stef van Buuren","doi":"10.1080/02664763.2024.2418473","DOIUrl":"https://doi.org/10.1080/02664763.2024.2418473","url":null,"abstract":"<p><p>Rubin's Rules are commonly used to pool the results of statistical analyses across imputed samples when using multiple imputation. Rubin's Rules cannot be used when the result of an analysis in an imputed dataset is not a statistic and its associated standard error, but a test statistic (e.g. Student's t-test). While complex methods have been proposed for pooling test statistics across imputed samples, these methods have not been implemented in many popular statistical software packages. The median <i>p</i>-value method has been proposed for pooling test statistics. The statistical significance level of the pooled test statistic is the median of the associated <i>p</i>-values across the imputed samples. We evaluated the performance of this method with nine statistical tests: Student's t-test, Wilcoxon Rank Sum test, Analysis of Variance, Kruskal-Wallis test, the test of significance for Pearson's and Spearman's correlation coefficient, the Chi-squared test, the test of significance for a regression coefficient from a linear regression and from a logistic regression. For each test, the empirical type I error rate was higher than the advertised rate. The magnitude of inflation increased as the prevalence of missing data increased. The median <i>p</i>-value method should not be used to assess statistical significance across imputed datasets.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1161-1176"},"PeriodicalIF":1.2,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144012737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fernando Henrique de Paula E Silva Mendes, Douglas Eduardo Turatti, Guilherme Pumi
{"title":"Mitigating the choice of the duration in DDMS models through a parametric link.","authors":"Fernando Henrique de Paula E Silva Mendes, Douglas Eduardo Turatti, Guilherme Pumi","doi":"10.1080/02664763.2024.2419505","DOIUrl":"https://doi.org/10.1080/02664763.2024.2419505","url":null,"abstract":"<p><p>One of the most important hyper-parameters in duration-dependent Markov-switching (DDMS) models is the duration of the hidden states. Because there is currently no procedure for estimating this duration or testing whether a given duration is appropriate for a given data set, an ad hoc duration choice must be heuristically justified. In this paper, we propose and examine a methodology that mitigates the choice of duration in DDMS models when forecasting is the goal. The novelty of this paper is the use of the asymmetric Aranda-Ordaz parametric link function to model transition probabilities in DDMS models, instead of the commonly applied logit link. The idea behind this approach is that any incorrect duration choice is compensated for by the parameter in the link, increasing model flexibility. Two Monte Carlo simulations, based on classical applications of DDMS models, are employed to evaluate the methodology. In addition, an empirical investigation is carried out to forecast the volatility of the S&P500, which showcases the capabilities of the proposed model.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1219-1238"},"PeriodicalIF":1.2,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144018792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A semiparametric accelerated failure time-based mixture cure tree.","authors":"Wisdom Aselisewine, Suvra Pal, Helton Saulo","doi":"10.1080/02664763.2024.2418476","DOIUrl":"https://doi.org/10.1080/02664763.2024.2418476","url":null,"abstract":"<p><p>The mixture cure rate model (MCM) is the most widely used model for the analysis of survival data with a cured subgroup. In this context, the most common strategy to model the cure probability is to assume a generalized linear model with a known link function, such as the logit link function. However, the logit model can only capture simple effects of covariates on the cure probability. In this article, we propose a new MCM where the cure probability is modeled using a decision tree-based classifier and the survival distribution of the uncured is modeled using an accelerated failure time structure. To estimate the model parameters, we develop an expectation maximization algorithm. Our simulation study shows that the proposed model performs better in capturing nonlinear classification boundaries when compared to the logit-based MCM and the spline-based MCM. This results in more accurate and precise estimates of the cured probabilities, which in-turn results in improved predictive accuracy of cure. We further show that capturing nonlinear classification boundary also improves the estimation results corresponding to the survival distribution of the uncured subjects. Finally, we apply our proposed model and the EM algorithm to analyze an existing bone marrow transplant data.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 6","pages":"1177-1194"},"PeriodicalIF":1.2,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12035937/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144020246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The PCovR biplot: a graphical tool for principal covariates regression.","authors":"Elisa Frutos-Bernal, José Luis Vicente-Villardón","doi":"10.1080/02664763.2024.2417978","DOIUrl":"10.1080/02664763.2024.2417978","url":null,"abstract":"<p><p>Biplots are useful tools because they provide a visual representation of both individuals and variables simultaneously, making it easier to explore relationships and patterns within multidimensional datasets. This paper extends their use to examine the relationship between a set of predictors <math><mrow><mi>X</mi></mrow> </math> and a set of response variables <math><mrow><mi>Y</mi></mrow> </math> using Principal Covariates Regression analysis (PCovR). The PCovR biplot provides a simultaneous graphical representation of individuals, predictor variables and response variables. It also provides the ability to examine the relationship between both types of variables in the form of the regression coefficient matrix.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 5","pages":"1144-1159"},"PeriodicalIF":1.2,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951325/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143752849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}