{"title":"Weighted support vector machine for extremely imbalanced data","authors":"Jongmin Mun , Sungwan Bang , Jaeoh Kim","doi":"10.1016/j.csda.2024.108078","DOIUrl":"10.1016/j.csda.2024.108078","url":null,"abstract":"<div><div>Based on an asymptotically optimal weighted support vector machine (SVM) that introduces label shift, a systematic procedure is derived for applying oversampling and weighted SVM to extremely imbalanced datasets with a cluster-structured positive class. This method formalizes three intuitions: (i) oversampling should reflect the structure of the positive class; (ii) weights should account for both the imbalance and oversampling ratios; (iii) synthetic samples should carry less weight than the original samples. The proposed method generates synthetic samples from the estimated positive class distribution using a Gaussian mixture model. To prevent overfitting to excessive synthetic samples, different misclassification penalties are assigned to the original positive class, synthetic positive class, and negative class. The proposed method is numerically validated through simulations and an analysis of Republic of Korea Army artillery training data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cox regression model with doubly truncated and interval-censored data","authors":"Pao-sheng Shen","doi":"10.1016/j.csda.2024.108090","DOIUrl":"10.1016/j.csda.2024.108090","url":null,"abstract":"<div><div>Interval sampling is an efficient sampling scheme used in epidemiological studies. Doubly truncated (DT) data arise under this sampling scheme when the failure time can be observed exactly. In practice, the failure time may not be observed and might be recorded only within time intervals, leading to doubly truncated and interval censored (DTIC) data. This article considers regression analysis of DTIC data under the Cox proportional hazards (PH) model and develops the conditional maximum likelihood estimators (cMLEs) for the regression parameters and baseline cumulative hazard function of models. The cMLEs are shown to be consistent and asymptotically normal. Simulation results indicate that the cMLEs perform well for samples of moderate size.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating computation: A pairwise fitting technique for multivariate probit models","authors":"Margaux Delporte , Geert Verbeke , Steffen Fieuws , Geert Molenberghs","doi":"10.1016/j.csda.2024.108082","DOIUrl":"10.1016/j.csda.2024.108082","url":null,"abstract":"<div><div>Fitting multivariate probit models via maximum likelihood presents considerable computational challenges, particularly in terms of computation time and convergence difficulties, even for small numbers of responses. These issues are exacerbated when dealing with ordinal data. An efficient computational approach is introduced, based on a pairwise fitting technique within a pseudo-likelihood framework. This methodology is applied to clinical case studies, specifically using a trivariate probit model. Additionally, the correlation structure among outcomes is allowed to depend on covariates, enhancing both the flexibility and interpretability of the model. By way of simulation and real data applications, the proposed approach demonstrates superior computational efficiency as the dimension of the outcome vector increases. The method's ability to capture covariate-dependent correlations makes it particularly useful in medical research, where understanding complex associations among health outcomes is of scientific importance.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142578447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A unified consensus-based parallel algorithm for high-dimensional regression with combined regularizations","authors":"Xiaofei Wu , Rongmei Liang , Zhimin Zhang , Zhenyu Cui","doi":"10.1016/j.csda.2024.108081","DOIUrl":"10.1016/j.csda.2024.108081","url":null,"abstract":"<div><div>The parallel algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical learning models. However, there is currently limited research on parallel algorithms specifically designed for high-dimensional regression with combined regularization terms. These terms, such as elastic-net, sparse group lasso, sparse fused lasso, and their nonconvex variants, have gained significant attention in various fields due to their ability to incorporate prior information and promote sparsity within specific groups or fused variables. The scarcity of parallel algorithms for combined regularizations can be attributed to the inherent nonsmoothness and complexity of these terms, as well as the absence of closed-form solutions for certain proximal operators associated with them. This paper proposes a <em>unified</em> constrained optimization formulation based on the consensus problem for these types of convex and nonconvex regression problems, and derives the corresponding parallel alternating direction method of multipliers (ADMM) algorithms. Furthermore, it is proven that the proposed algorithm not only has global convergence but also exhibits a linear convergence rate. It is worth noting that the computational complexity of the proposed algorithm remains the same for different regularization terms and losses, which implicitly demonstrates the universality of this algorithm. Extensive simulation experiments, along with a financial example, serve to demonstrate the reliability, stability, and scalability of our algorithm. The R package for implementing the proposed algorithm can be obtained at <span><span>https://github.com/xfwu1016/CPADMM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anthony-Alexander Christidis , Stefan Van Aelst , Ruben Zamar
{"title":"Multi-model subset selection","authors":"Anthony-Alexander Christidis , Stefan Van Aelst , Ruben Zamar","doi":"10.1016/j.csda.2024.108073","DOIUrl":"10.1016/j.csda.2024.108073","url":null,"abstract":"<div><div>The two primary approaches for high-dimensional regression problems are sparse methods (e.g., best subset selection, which uses the <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span>-norm in the penalty) and ensemble methods (e.g., random forests). Although sparse methods typically yield interpretable models, in terms of prediction accuracy they are often outperformed by “blackbox” multi-model ensemble methods. A regression ensemble is introduced which combines the interpretability of sparse methods with the high prediction accuracy of ensemble methods. An algorithm is proposed to solve the joint optimization of the corresponding <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span>-penalized regression models by extending recent developments in <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span>-optimization for sparse methods to multi-model regression ensembles. The sparse and diverse models in the ensemble are learned simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the response variable. Empirical studies and theoretical knowledge about ensembles are used to gain insight into the ensemble method's performance, focusing on the interplay between bias, variance, covariance, and variable selection. In prediction tasks, the ensembles can outperform state-of-the-art competitors on both simulated and real data. Forward stepwise regression is also generalized to multi-model regression ensembles and used to obtain an initial solution for the algorithm. The optimization algorithms are implemented in publicly available software packages.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142560769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vine copula based structural equation models","authors":"Claudia Czado","doi":"10.1016/j.csda.2024.108076","DOIUrl":"10.1016/j.csda.2024.108076","url":null,"abstract":"<div><div>Gaussian linear structural equation models (SEMs) are often used as a statistical model associated with a directed acyclic graph (DAG) also known as a Bayesian network. However, such a model might not be able to represent the non-Gaussian dependence present in some data sets resulting in nonlinear, non-additive and non Gaussian conditional distributions. Therefore the use of the class of D-vine copula based regression models for the specification of the conditional distribution of a node given its parents is proposed. This class extends the class of standard linear regression models considerably. The approach also allows to create an importance order of the parents of each node and gives the potential to remove edges from the starting DAG not supported by the data. Further uncertainty of conditional estimates can be assessed and fast generative simulation using the D-vine copula based SEM is available. The improvement over a Gaussian linear SEM is shown using random specifications of the D-vine based SEM as well as its ability to correctly remove edges not present in the data generation using simulation. An engineering application showcases the usefulness of the proposals.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142553893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian grouping-Gibbs sampling estimation of high-dimensional linear model with non-sparsity","authors":"Shanshan Qin , Guanlin Zhang , Yuehua Wu , Zhongyi Zhu","doi":"10.1016/j.csda.2024.108072","DOIUrl":"10.1016/j.csda.2024.108072","url":null,"abstract":"<div><div>In high-dimensional linear regression models, common assumptions typically entail sparsity of regression coefficients <span><math><mi>β</mi><mo>∈</mo><msup><mrow><mi>R</mi></mrow><mrow><mi>p</mi></mrow></msup></math></span>. However, these assumptions may not hold when the majority, if not all, of regression coefficients are non-zeros. Statistical methods designed for sparse models may lead to substantial bias in model estimation. Therefore, this article proposes a novel Bayesian Grouping-Gibbs Sampling (BGGS) method, which departs from the common sparse assumptions in high-dimensional problems. The BGGS method leverages a grouping strategy that partitions <strong><em>β</em></strong> into distinct groups, facilitating rapid sampling in high-dimensional space. The grouping number (<em>k</em>) can be determined using the ‘Elbow plot’, which operates efficiently and is robust against the initial value. Theoretical analysis, under some regular conditions, guarantees model selection and parameter estimation consistency, and bound for the prediction error. Furthermore, three finite simulations are conducted to assess the competitive advantages of the proposed method in terms of parameter estimation and prediction accuracy. Finally, the BGGS method is applied to a financial dataset to explore its practical utility.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142529305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shanshan Luo , Jiaqi Min , Wei Li , Xueli Wang , Zhi Geng
{"title":"A comparative analysis of different adjustment sets using propensity score based estimators","authors":"Shanshan Luo , Jiaqi Min , Wei Li , Xueli Wang , Zhi Geng","doi":"10.1016/j.csda.2024.108079","DOIUrl":"10.1016/j.csda.2024.108079","url":null,"abstract":"<div><div>Propensity score based estimators are commonly employed in observational studies to address baseline confounders, without explicitly modeling their association with the outcome. In this paper, to fully leverage these estimators, we consider a series of regression models for improving estimation efficiency. The proposed estimators rely solely on a properly modeled propensity score and do not require the correct specification of outcome models. In addition, we consider a comparative analysis by applying the proposed estimators to four different adjustment sets, each consisting of background covariates. The theoretical results imply that incorporating predictive covariates into both propensity score and regression model demonstrates the lowest asymptotic variance. However, including instrumental variables in the propensity score may decrease the estimation efficiency of the proposed estimators. To evaluate the performance of the proposed estimators, we conduct simulation studies and provide a real data example.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142529306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unified specification tests in partially linear time series models","authors":"Shuang Sun , Zening Song , Xiaojun Song","doi":"10.1016/j.csda.2024.108074","DOIUrl":"10.1016/j.csda.2024.108074","url":null,"abstract":"<div><div>Based on a residual marked empirical process, Cramér–von Mises and Kolmogorov–Smirnov tests are proposed for the correct specification of the nonparametric components in partially linear time series models. The tests are unified in the sense that the asymptotic distribution of residual marked empirical process is invariant across different <span><math><msup><mrow><mi>n</mi></mrow><mrow><mi>ν</mi></mrow></msup></math></span>-consistent estimators in calculating residuals (where <span><math><mi>ν</mi><mo>></mo><mn>1</mn><mo>/</mo><mn>4</mn></math></span>) under the null. In addition, the residual marked empirical process has the same power property under the sequence of local alternatives regardless of the estimators used. Achieved through a projection method, these features also enable using a computationally convenient multiplier bootstrap to approximate the unified null distributions of the test statistics. Simulations show satisfactory finite-sample performance of the proposed method. The application to validate the parametric form of conditional variance in the ARCH-X model is also highlighted, along with an empirical analysis of the conditional variance of the FTSE 100 index return series.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142529303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziqian Lin , Yuan Gao , Feifei Wang , Hansheng Wang
{"title":"Testing sufficiency for transfer learning","authors":"Ziqian Lin , Yuan Gao , Feifei Wang , Hansheng Wang","doi":"10.1016/j.csda.2024.108075","DOIUrl":"10.1016/j.csda.2024.108075","url":null,"abstract":"<div><div>Modern statistical analysis often encounters high dimensional models but with limited sample sizes. This makes it difficult to estimate high-dimensional statistical models based on target data with limited sample size. Then how to borrow information from another large sized source data for more accurate target model estimation becomes an interesting problem. This leads to the useful idea of transfer learning. Various estimation methods in this regard have been developed recently. In this work, we study transfer learning from a different perspective. Specifically, we consider here the problem of testing for transfer learning sufficiency. We denote <em>transfer learning sufficiency</em> to be the null hypothesis. It refers to the situation that, with the help of the source data, the useful information contained in the feature vectors of the target data can be sufficiently extracted for predicting the interested target response. Therefore, the rejection of the null hypothesis implies that information useful for prediction remains in the feature vectors of the target data and thus calls for further exploration. To this end, we develop a novel testing procedure and a centralized and standardized test statistic, whose asymptotic null distribution is analytically derived. Simulation studies are presented to demonstrate the finite sample performance of the proposed method. A deep learning related real data example is presented for illustration purpose.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142529302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}