{"title":"On use of adaptive cluster sampling for variance estimation.","authors":"Shameem Alam, Javid Shabbir, Malaika Nadeem","doi":"10.1080/02664763.2025.2460072","DOIUrl":"https://doi.org/10.1080/02664763.2025.2460072","url":null,"abstract":"<p><p>Adaptive cluster sampling is particularly helpful whenever the target population is unique, dispersed unevenly, concealed or difficult to find. In the current investigation, under an adaptive cluster sampling approach, we propose a ratio-product-logarithmic type estimator employing a single auxiliary variable for the estimation of finite population variance. The bias and mean square error of the proposed estimator are developed by using simulation as well as real data sets. The study results show that for estimating the finite population variance, the proposed estimator outperforms the competing estimators.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 12","pages":"2291-2305"},"PeriodicalIF":1.1,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416028/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145029941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcos S Oliveira, Marcos O Prates, Christian E Galarza, Victor H Lachos
{"title":"Influence diagnostics in the Heckman selection models based on EM algorithms.","authors":"Marcos S Oliveira, Marcos O Prates, Christian E Galarza, Victor H Lachos","doi":"10.1080/02664763.2025.2461715","DOIUrl":"https://doi.org/10.1080/02664763.2025.2461715","url":null,"abstract":"<p><p>This study presents diagnostic techniques for Heckman selection models estimated using the EM algorithm. The focus is on the selection <i>t</i> and normal models, based on the bivariate Student's-<i>t</i> and bivariate normal distributions, respectively. The Heckman selection model is a key econometric tool for estimating relationships while addressing selection bias. Relying on the EM-type algorithm, we develop global and local influence analyses based on the conditional expectation of the complete-data log-likelihood function, exploring four perturbation schemes for local influence analysis. To assess the effectiveness of the proposed diagnostic measures in identifying influential observations, we conducted a simulation study, complemented by two real-data applications that demonstrate how these techniques can effectively identify influential points. The proposed algorithms and methodologies are incorporated into the R package HeckmanEM.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 13","pages":"2384-2412"},"PeriodicalIF":1.1,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12490367/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145232640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Objective Bayesian trend filtering via adaptive piecewise polynomial regression.","authors":"Sang Gil Kang, Yongku Kim","doi":"10.1080/02664763.2025.2461186","DOIUrl":"https://doi.org/10.1080/02664763.2025.2461186","url":null,"abstract":"<p><p>Several methods have been developed for nonparametric regression problems, including classical approaches such as kernels, local polynomials, smoothing splines, sieves, and wavelets, as well as relatively new methods such as lasso, generalized lasso, and trend filtering. This study proposes an objective Bayesian trend filtering method based on model selection. The procedure followed in this study estimates the functions based on adaptive piecewise polynomial regression models with two components. First, we determine the intervals with varying trends using Bayesian binary segmentation and then evaluate the most reasonable trend via Bayesian model selection at these intervals. This trend filtering procedure follows Bayesian model selection that uses intrinsic priors, which eliminated any subjective input. Additionally, we prove that the proposed method using these intrinsic priors was consistent when applied to large sample sizes. The behavior of the proposed Bayesian trend filtering procedure is compared with the trend filtering using a simulation study and real examples. Finally, we apply the proposed method to detect the variance change points under mean changes, whereas the existing methods yielded inaccurate estimates of the variance change points when the mean varied smoothly, as the sudden-change assumption was violated in such cases.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 13","pages":"2357-2383"},"PeriodicalIF":1.1,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12490381/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145232665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parametric estimation of quantile versions of Zenga and D inequality curves: methodology and application to Weibull distribution.","authors":"Sylwester Pia̧tek","doi":"10.1080/02664763.2025.2458126","DOIUrl":"https://doi.org/10.1080/02664763.2025.2458126","url":null,"abstract":"<p><p>Inequality (concentration) curves such as Lorenz, Bonferroni, Zenga curves, as well as a new inequality curve - the <i>D</i> curve, are broadly used to analyse inequalities in wealth and income distribution in certain populations. Quantile versions of these inequality curves are more robust to outliers. We discuss several parametric estimators of quantile versions of the Zenga and <i>D</i> curves. A minimum distance (MD) estimator is proposed for these two curves and the indices related to them. The consistency and asymptotic normality of the MD estimator is proved. The MD estimator can also be used to estimate the inequality measures corresponding to the quantile versions of the inequality curves. The estimation methods considered are illustrated in the case of the Weibull model, which has many applications in life sciences, for example, to fit the precipitation data. In econometrics it is also considered to fit incomes, especially in the case when a significant share of population have low incomes, for example, in less developed countries or among low-paid jobs.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 12","pages":"2226-2246"},"PeriodicalIF":1.1,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416017/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145029907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gene mutation estimations via mutual information and Ewens sampling based CNN & machine learning algorithms.","authors":"Wanyang Dai","doi":"10.1080/02664763.2025.2460076","DOIUrl":"https://doi.org/10.1080/02664763.2025.2460076","url":null,"abstract":"<p><p>We conduct gene mutation rate estimations via developing mutual information and Ewens sampling based convolutional neural network (CNN) and machine learning algorithms. More precisely, we develop a systematic methodology through constructing a CNN. Meanwhile, we develop two machine learning algorithms to study protein production with target gene sequences and protein structures. The core of the CNN and machine learning approach is to address a two-stage optimization problem to balance gene mutation rates during protein production. To wit, we try to optimally coordinate the consistency between the given input DNA sequences and the given (or optimally computed) target ones through controlling their intermediate gene mutation rates. The purposes in doing so are aimed to conduct gene editing and protein structure prediction. For example, after the gene mutation rates are estimated, the computing complexity of protein structure prediction will be reduced to a reasonable degree. Our developed CNN numerical optimization scheme consists of two newly designed machine learning algorithms. The stochastic gradients for the two algorithms are designed according to the Kuhn-Tucker conditions with boundary constraints and with the support of Ewens sampling, multi-input multi-output (MIMO) mutual information, and codon optimization techniques. The associated learning rate bounds are explicitly derived from the method and the two algorithms are numerically implemented. The convergence and optimality of the algorithms are mathematically proved. To illustrate the usage of our study, we also conduct a real-world data implementation.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 12","pages":"2321-2353"},"PeriodicalIF":1.1,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416021/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145029916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shabbir Ahmad, Muhammad Riaz, Tahir Mahmood, Nasir Abbas
{"title":"Change point detection to analyze air pollution and its economic effects: an exponentially weighted moving average perspective.","authors":"Shabbir Ahmad, Muhammad Riaz, Tahir Mahmood, Nasir Abbas","doi":"10.1080/02664763.2025.2455636","DOIUrl":"10.1080/02664763.2025.2455636","url":null,"abstract":"<p><p>Air pollution has a direct impact on every society, leading to consequential effects on the economy of a nation. Poor air quality adversely affects human health, resulting in various economic outcomes such as rising healthcare costs, diminished labor productivity, negative impacts on tourism and living standards, increased regulatory expenses for businesses, and heightened economic disparities. Effective control methods are essential to monitor factors influencing the economy, including air quality. The presence of toxic substances in the air reduces air quality, necessitating its monitoring through indices like PM10. Among statistical process control tools, control charts are the most prominent for efficient change point detection. This study introduces a new process monitoring tool that incorporates additional auxiliary information, if available, alongside the main variable of interest. The proposed methodology ensures detection ability remains robust, even under disturbances in the auxiliary variable. Furthermore, mathematical analyses reveal that many existing statistical quality control tools become special cases of the proposed structure for specific sensitivity parameter values. Evaluated through properties of run length distribution, the proposed chart allows control of the robustness-efficiency balance by adjusting its sensitivity parameter. A practical implementation demonstrates the effectiveness of the chart in monitoring air quality data.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 11","pages":"2113-2155"},"PeriodicalIF":1.1,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404093/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144992776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the use and misuse of time-rescaling to assess the goodness-of-fit of self-exciting temporal point processes.","authors":"M-A El-Aroui","doi":"10.1080/02664763.2025.2459245","DOIUrl":"https://doi.org/10.1080/02664763.2025.2459245","url":null,"abstract":"<p><p>The paper first highlights important drawbacks and biases related to the common use of time-rescaling to assess the goodness-of-fit (Gof) of self-exciting temporal point process (SETPP) models. Then it presents a new predictive time-rescaling approach leading to an asymptotically unbiased Gof framework for general SETPPs in the case of single observed trajectories. The predictive approach focuses on forecasting accuracy and addresses the bias problem resulting from the plugged-in estimated parameters. Dawid's prequential approach is used and the models' checking is mainly based on the forecasting accuracy of arrival times. These times are transformed, using sequentially estimated parameters, into random vectors which are proved to converge in probability under the null hypothesis and standard regulatory conditions to vectors of iid Exponential(1) rv's. Numerical experiments are used to compare the performances of the standard and predictive time-rescaling for Gof assessment of non-homogeneous Poisson and Hawkes self-exciting temporal processes. Data of Japanese seismic events are also used to illustrate the dynamic aspect of the proposed model-checking approach.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 12","pages":"2247-2270"},"PeriodicalIF":1.1,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416029/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145029909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laura Vicuña Torres de Paula, Idemauro Antonio Rodrigues de Lara, Cesar Auguto Taconeli, Carolina Reigada, Rafael de Andrade Moral
{"title":"Gradient test to assess homogeneity of probabilities in discrete-time transition models with application in agricultural science data.","authors":"Laura Vicuña Torres de Paula, Idemauro Antonio Rodrigues de Lara, Cesar Auguto Taconeli, Carolina Reigada, Rafael de Andrade Moral","doi":"10.1080/02664763.2025.2457008","DOIUrl":"10.1080/02664763.2025.2457008","url":null,"abstract":"<p><p>Longitudinal studies in discrete or continuous time involving categorical data are common in agricultural sciences. Transition models can be used as a means to analyse the resulting data, especially when the aim is to describe category changes over time, as well as to accommodate covariates due to experimental design. Here we focus on discrete-time models, for which it is critical to assess whether the underlying process is stationary or not. Tests based on likelihood procedures are very useful, and here we propose the Gradient test to assess stationary, or homogeneity of transition probabilities. We carried out simulation studies to evaluate the performance of the proposed test, which indicated a good performance regarding type-I error and power when compared to other classical tests available in the literature. As motivation we present two studies with agricultural data, the first one applied to entomology with nominal responses and the second application refers to the degree of injury in pigs. Using our proposed test, stationarity and non-stationarity were verified respectively in the applications. Since the gradient test to assess stationarity has a simplified structure when compared to other tests, it is therefore a useful alternative when carrying out inference in these types of models.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 11","pages":"2172-2190"},"PeriodicalIF":1.1,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144992721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pathway-based genetic association analysis for overdispersed count data.","authors":"Yang Liu","doi":"10.1080/02664763.2025.2460073","DOIUrl":"https://doi.org/10.1080/02664763.2025.2460073","url":null,"abstract":"<p><p>Overdispersion is a common phenomenon in genetic data, such as gene expression count data. In genetic association studies, it is important to investigate the association between a gene expression and a set of genetic variants from a pathway. However, existing approaches for pathway analysis are primarily designed for continuous and binary outcomes and are not applicable to overdispersed count data. In this paper, we propose a hierarchical approach to analyze the association between an overdispersed count response and a set of low-frequency genetic variants in negative binomial regression. We derive score-type test statistics for both fixed and random effects of genetic variants, and further introduce a novel procedure for efficiently combining these two statistics for global testing. Through simulation studies, we demonstrate that the proposed method tends to be more powerful than existing methods under a wide range of scenarios. Additionally, we apply the proposed method to a colorectal cancer study, demonstrating its power in identifying associations between gene expression and somatic mutations.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 12","pages":"2306-2320"},"PeriodicalIF":1.1,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145029923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clustering of recurrent events data.","authors":"G Babykina, V Vandewalle, J Carretero-Bravo","doi":"10.1080/02664763.2025.2452966","DOIUrl":"10.1080/02664763.2025.2452966","url":null,"abstract":"<p><p>Nowadays data are often timestamped, thus, when analysing the events which may occur several times (recurrent events), it is desirable to model the whole dynamics of the counting process rather than to focus on a total number of events. Such kind of data can be encountered in hospital readmissions, disease recurrences or repeated failures of industrial systems. Recurrent events can be analysed in the counting process framework, as in the Andersen-Gill model, assuming that the baseline intensity depends on time and on covariates, as in the Cox model. However, observed covariates are often insufficient to explain the observed heterogeneity in the data. We propose a mixture model for recurrent events, allowing to account for the unobserved heterogeneity and to perform clustering of individuals (unsupervised classification allowing to partition of the heterogeneous data according to unobserved, or latent, variables). Within each cluster, the recurrent event process intensity is specified parametrically and is adjusted for covariates. Model parameters are estimated by maximum likelihood using the EM algorithm; the BIC criterion is adopted to choose an optimal number of clusters. The model feasibility is checked on simulated data. Real data on hospital readmissions of elderly people, which motivated the development of the proposed clustering model, are analysed. The obtained results allow a fine understanding of the recurrent event process in each cluster.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 11","pages":"2031-2059"},"PeriodicalIF":1.1,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144992763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}