{"title":"Imbalanced data sampling design based on grid boundary domain for big data","authors":"","doi":"10.1007/s00180-024-01471-8","DOIUrl":"https://doi.org/10.1007/s00180-024-01471-8","url":null,"abstract":"<h3>Abstract</h3> <p>The data distribution is often associated with a <em>priori</em>-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"54 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140075873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sparse estimation of linear model via Bayesian method $$^*$$","authors":"","doi":"10.1007/s00180-024-01474-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01474-5","url":null,"abstract":"<h3>Abstract</h3> <p>This paper considers the sparse estimation problem of regression coefficients in the linear model. Note that the global–local shrinkage priors do not allow the regression coefficients to be truly estimated as zero, we propose three threshold rules and compare their contraction properties, and also tandem those rules with the popular horseshoe prior and the horseshoe+ prior that are normally regarded as global–local shrinkage priors. The hierarchical prior expressions for the horseshoe prior and the horseshoe+ prior are obtained, and the full conditional posterior distributions for all parameters for algorithm implementation are also given. Simulation studies indicate that the horseshoe/horseshoe+ prior with the threshold rules are both superior to the spike-slab models. Finally, a real data analysis demonstrates the effectiveness of variable selection of the proposed method.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"35 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Degree selection methods for curve estimation via Bernstein polynomials","authors":"","doi":"10.1007/s00180-024-01473-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01473-6","url":null,"abstract":"<h3>Abstract</h3> <p>Bernstein Polynomial (BP) bases can uniformly approximate any continuous function based on observed noisy samples. However, a persistent challenge is the data-driven selection of a suitable degree for the BPs. In the absence of noise, asymptotic theory suggests that a larger degree leads to better approximation. However, in the presence of noise, which reduces bias, a larger degree also results in larger variances due to high-dimensional parameter estimation. Thus, a balance in the classic bias-variance trade-off is essential. The main objective of this work is to determine the minimum possible degree of the approximating BPs using probabilistic methods that are robust to various shapes of an unknown continuous function. Beyond offering theoretical guidance, the paper includes numerical illustrations to address the issue of determining a suitable degree for BPs in approximating arbitrary continuous functions.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"22 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140016810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic piecewise linear regression","authors":"Mathias von Ottenbreit, Riccardo De Bin","doi":"10.1007/s00180-024-01475-4","DOIUrl":"https://doi.org/10.1007/s00180-024-01475-4","url":null,"abstract":"<p>Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest and boosted trees predict very well the outcome of new observations, but the effect of the predictors on the result is hard to interpret. Highly interpretable algorithms like linear effect-based boosting and MARS, on the other hand, are typically less predictive. Here we propose a novel regression algorithm, automatic piecewise linear regression (APLR), that combines the predictiveness of a boosting algorithm with the interpretability of a MARS model. In addition, as a boosting algorithm, it automatically handles variable selection, and, as a MARS-based approach, it takes into account non-linear relationships and possible interaction terms. We show on simulated and real data examples how APLR’s performance is comparable to that of the top-performing approaches in terms of prediction, while offering an easy way to interpret the results. APLR has been implemented in C++ and wrapped in a Python package as a Scikit-learn compatible estimator.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140016808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Variational Bayesian Lasso for spline regression","authors":"Larissa C. Alves, Ronaldo Dias, Helio S. Migon","doi":"10.1007/s00180-024-01470-9","DOIUrl":"https://doi.org/10.1007/s00180-024-01470-9","url":null,"abstract":"<p>This work presents a new scalable automatic Bayesian Lasso methodology with variational inference for non-parametric splines regression that can capture the non-linear relationship between a response variable and predictor variables. Note that under non-parametric point of view the regression curve is assumed to lie in a infinite dimension space. Regression splines use a finite approximation of this infinite space, representing the regression function by a linear combination of basis functions. The crucial point of the approach is determining the appropriate number of bases or equivalently number of knots, avoiding over-fitting/under-fitting. A decision-theoretic approach was devised for knot selection. Comprehensive simulation studies were conducted in challenging scenarios to compare alternative criteria for knot selection, thereby ensuring the efficacy of the proposed algorithms. Additionally, the performance of the proposed method was assessed using real-world datasets. The novel procedure demonstrated good performance in capturing the underlying data structure by selecting the appropriate number of knots/basis.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"611 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139956295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian estimation of the number of species from Poisson-Lindley stochastic abundance model using non-informative priors","authors":"Anurag Pathak, Manoj Kumar, Sanjay Kumar Singh, Umesh Singh, Sandeep Kumar","doi":"10.1007/s00180-024-01464-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01464-7","url":null,"abstract":"<p>In this article, we propose a Poisson-Lindley distribution as a stochastic abundance model in which the sample is according to the independent Poisson process. Jeffery’s and Bernardo’s reference priors have been obtaining and proposed the Bayes estimators of the number of species for this model. The proposed Bayes estimators have been compared with the corresponding profile and conditional maximum likelihood estimators for their square root of the risks under squared error loss function (SELF). Jeffery’s and Bernardo’s reference priors have been considered and compared with the Bayesian approach based on biological data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139951516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generation of normal distributions revisited","authors":"Takayuki Umeda","doi":"10.1007/s00180-024-01468-3","DOIUrl":"https://doi.org/10.1007/s00180-024-01468-3","url":null,"abstract":"<p>Normally distributed random numbers are commonly used in scientific computing in various fields. It is important to generate a set of random numbers as close to a normal distribution as possible for reducing initial fluctuations. Two types of samples from a uniform distribution are examined as source samples for inverse transform sampling methods. Three types of inverse transform sampling methods with new approximations of inverse cumulative distribution functions are also discussed for converting uniformly distributed source samples to normally distributed samples.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"32 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139951514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian regression models in gretl: the BayTool package","authors":"Luca Pedini","doi":"10.1007/s00180-024-01466-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01466-5","url":null,"abstract":"<p>This article presents the <span>gretl</span> package <span>BayTool</span> which integrates the software functionalities, mostly concerned with frequentist approaches, with Bayesian estimation methods of commonly used econometric models. Computational efficiency is achieved by pairing an extensive use of Gibbs sampling for posterior simulation with the possibility of splitting single-threaded experiments into multiple cores or machines by means of parallelization. From the user’s perspective, the package requires only basic knowledge of <span>gretl</span> scripting to fully access its functionality, while providing a point-and-click solution in the form of a graphical interface for a less experienced audience. These features, in particular, make <span>BayTool</span> stand out as an excellent teaching device without sacrificing more advanced or complex applications.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"14 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erina Paul, Santosh Sutradhar, Jonathan Hartzel, Devan V. Mehrotra
{"title":"Bayesian sequential probability ratio test for vaccine efficacy trials","authors":"Erina Paul, Santosh Sutradhar, Jonathan Hartzel, Devan V. Mehrotra","doi":"10.1007/s00180-024-01458-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01458-5","url":null,"abstract":"<p>Designing vaccine efficacy (VE) trials often requires recruiting large numbers of participants when the diseases of interest have a low incidence. When developing novel vaccines, such as for COVID-19 disease, the plausible range of VE is quite large at the design stage. Thus, the number of events needed to demonstrate efficacy above a pre-defined regulatory threshold can be difficult to predict and the time needed to accrue the necessary events can often be long. Therefore, it is advantageous to evaluate the efficacy at earlier interim analysis in the trial to potentially allow the trials to stop early for overwhelming VE or futility. In such cases, incorporating interim analyses through the use of the sequential probability ratio test (SPRT) can be helpful to allow for multiple analyses while controlling for both type-I and type-II errors. In this article, we propose a Bayesian SPRT for designing a vaccine trial for comparing a test vaccine with a control assuming two Poisson incidence rates. We provide guidance on how to choose the prior distribution and how to optimize the number of events for interim analyses to maximize the efficiency of the design. Through simulations, we demonstrate how the proposed Bayesian SPRT performs better when compared with the corresponding frequentist SPRT. An R repository to implement the proposed method is placed at: https://github.com/Merck/bayesiansprt.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"14 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Overlapping coefficient in network-based semi-supervised clustering","authors":"Claudio Conversano, Luca Frigau, Giulia Contu","doi":"10.1007/s00180-024-01457-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01457-6","url":null,"abstract":"<p>Network-based Semi-Supervised Clustering (NeSSC) is a semi-supervised approach for clustering in the presence of an outcome variable. It uses a classification or regression model on resampled versions of the original data to produce a proximity matrix that indicates the magnitude of the similarity between pairs of observations measured with respect to the outcome. This matrix is transformed into a complex network on which a community detection algorithm is applied to search for underlying community structures which is a partition of the instances into highly homogeneous clusters to be evaluated in terms of the outcome. In this paper, we focus on the case the outcome variable to be used in NeSSC is numeric and propose an alternative selection criterion of the optimal partition based on a measure of overlapping between density curves as well as a penalization criterion which takes accounts for the number of clusters in a candidate partition. Next, we consider the performance of the proposed method for some artificial datasets and for 20 different real datasets and compare NeSSC with the other three popular methods of semi-supervised clustering with a numeric outcome. Results show that NeSSC with the overlapping criterion works particularly well when a reduced number of clusters are scattered localized.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"18 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}