Statistical Analysis and Data Mining最新文献

筛选
英文 中文
Imputed quantile vector autoregressive model for multivariate spatial–temporal data 多变量时空数据的估算量级向量自回归模型
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-01-25 DOI: 10.1002/sam.11658
Liang Jinwen, Tian Maozai
{"title":"Imputed quantile vector autoregressive model for multivariate spatial–temporal data","authors":"Liang Jinwen, Tian Maozai","doi":"10.1002/sam.11658","DOIUrl":"https://doi.org/10.1002/sam.11658","url":null,"abstract":"Imputing missing values in multivariate spatial–temporal data is important in many fields. Existing low rank tensor learning methods are popular for handling this task but are sensitive to high level of skewness. The aim of this paper is to develop an alternative method with robustness and high imputation accuracy for multivariate spatial–temporal data. In view of the fact that quantile regression is robust to noises and outliers, we propose an imputed quantile vector autoregressive (IQVAR) model. IQVAR can simultaneously impute missing values and estimate parameters of quantile vector autoregressive model. The objective function includes check loss and nuclear norm penalization. We develop an ADMM (Alternating Direction Method of Multipliers) algorithm to solve the resulting optimization problem. Simulation studies and real data analysis are conducted to verify the efficiency of IQVAR. Compared with other approaches, IQVAR is more robust and accurate.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139590190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonparametric Bayesian functional clustering with applications to racial disparities in breast cancer 非参数贝叶斯功能聚类在乳腺癌种族差异中的应用
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-01-25 DOI: 10.1002/sam.11657
Wenyu Gao, Inyoung Kim, Wonil Nam, Xiang Ren, Wei Zhou, Masoud Agah
{"title":"Nonparametric Bayesian functional clustering with applications to racial disparities in breast cancer","authors":"Wenyu Gao, Inyoung Kim, Wonil Nam, Xiang Ren, Wei Zhou, Masoud Agah","doi":"10.1002/sam.11657","DOIUrl":"https://doi.org/10.1002/sam.11657","url":null,"abstract":"As we have easier access to massive data sets, functional analyses have gained more interest. However, such data sets often contain large heterogeneities, noises, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This paper considers noisy information reduction in functional analyses from two perspectives: functional clustering to group similar observations and thus reduce the sample size and functional variable selection to reduce the dimensionality. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this paper proposes a nonparametric Bayesian functional clustering and peak point selection method via weighted Dirichlet process mixture (WDPM) modeling that automatically clusters and provides accurate estimations, together with conditional Laplace prior, which is a conjugate variable selection prior. The proposed method is named WDPM-VS for short, and is able to simultaneously perform the following tasks: (1) Automatic cluster without specifying the number of clusters or cluster centers beforehand; (2) Cluster for heterogeneously behaved functions; (3) Select vibrational peak points; and (4) Reduce noisy information from the two perspectives: sample size and dimensionality. The method will greatly outperform its comparison methods in root mean squared errors. Based on this proposed method, we are able to identify biological factors that can explain the breast cancer racial disparities.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139581738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Study of a bounded interval perks distribution with quantile regression analysis 利用量子回归分析研究有界区间津贴分布
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2024-01-25 DOI: 10.1002/sam.11656
Laila A. Al-Essa, Shakaiba Shafiq, Deniz Ozonur, Farrukh Jamal
{"title":"Study of a bounded interval perks distribution with quantile regression analysis","authors":"Laila A. Al-Essa, Shakaiba Shafiq, Deniz Ozonur, Farrukh Jamal","doi":"10.1002/sam.11656","DOIUrl":"https://doi.org/10.1002/sam.11656","url":null,"abstract":"In this article, a novel bounded interval model called the unit-Perks model is developed by suitably transforming the positive random variable of the Perks distribution. Numerous statistical features of the bounded interval Perks model are being explored based on the expansion of the density function. Eight distinct estimation approaches are being used to estimate the parameters of the unit-Perks model. A throughout simulation analysis is also included to evaluate the precision of the resulting estimators from eight estimating approaches. Two real bounded interval data sets are being utilized to investigate the practical applicability of the unit-Perks model. A comparison is also made to determine which method of estimation works better for the given model. According to a comparison of eight different estimation approaches, the maximum likelihood estimation approach outperformed than the other seven estimating approaches. The unit-perks model is then used to introduce the quantile regression model named as quantile unit-Perks distribution. Application to real data set for the quantile unit-Perks distribution is also performed. The quantile residuals are used for the residual analysis of the fitted regression model. On the basis of mathematical, computational, and pictorial evidences, it is concluded that the presented model exhibited greater modeling capabilities.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139581983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boosting diversity in regression ensembles 提升回归集合的多样性
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-12-30 DOI: 10.1002/sam.11654
Mathias Bourel, Jairo Cugliari, Yannig Goude, Jean-Michel Poggi
{"title":"Boosting diversity in regression ensembles","authors":"Mathias Bourel, Jairo Cugliari, Yannig Goude, Jean-Michel Poggi","doi":"10.1002/sam.11654","DOIUrl":"https://doi.org/10.1002/sam.11654","url":null,"abstract":"Ensemble methods, such as Bagging, Boosting, or Random Forests, often enhance the prediction performance of single learners on both classification and regression tasks. In the context of regression, we propose a gradient boosting-based algorithm incorporating a diversity term with the aim of constructing different learners that enrich the ensemble while achieving a trade-off of some individual optimality for global enhancement. Verifying the hypotheses of Biau and Cadre's theorem (2021, <i>Advances in contemporary statistics and econometrics—Festschrift in honour of Christine Thomas-Agnan</i>, Springer), we present a convergence result ensuring that the associated optimization strategy reaches the global optimum. In the experiments, we consider a variety of different base learners with increasing complexity: stumps, regression trees, Purely Random Forests, and Breiman's Random Forests. Finally, we consider simulated and benchmark datasets and a real-world electricity demand dataset to show, by means of numerical experiments, the suitability of our procedure by examining the behavior not only of the final or the aggregated predictor but also of the whole generated sequence.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139063502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multivariate contaminated normal mixture regression modeling of longitudinal data based on joint mean-covariance model 基于联合均值-协方差模型的纵向数据多变量污染正态混合回归建模
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-12-22 DOI: 10.1002/sam.11653
Niu Xiaoyu, Tian Yuzhu, Tang Manlai, Tian Maozai
{"title":"Multivariate contaminated normal mixture regression modeling of longitudinal data based on joint mean-covariance model","authors":"Niu Xiaoyu, Tian Yuzhu, Tang Manlai, Tian Maozai","doi":"10.1002/sam.11653","DOIUrl":"https://doi.org/10.1002/sam.11653","url":null,"abstract":"Outliers are common in longitudinal data analysis, and the multivariate contaminated normal (MCN) distribution in model-based clustering is often used to detect outliers and provide robust parameter estimates in each subgroup. In this paper, we propose a method, the mixture of MCN (MCNM), based on the joint mean-covariance model, specifically designed to analyze longitudinal data characterized by mild outliers. Our model can automatically detect outliers in longitudinal data and provide robust parameter estimates in each subgroup. We use iteratively expectation-conditional maximization (ECM) algorithm and Aitken acceleration to estimate the model parameters, achieving both algorithm acceleration and stable convergence. Our proposed method simultaneously clusters the population, identifies progression patterns of the mean and covariance structures for different subgroups over time, and detects outliers. To demonstrate the effectiveness of our method, we conduct simulation studies under various cases involving different proportions and degrees of contamination. Additionally, we apply our method to real data on the number of people infected with AIDS in 49 countries or regions from 2001 to 2021. Results show that our proposed method effectively clusters the data based on various mean progression trajectories. In summary, our proposed MCNM based on the joint mean-covariance model and MCD of covariance matrices provides a robust method for clustering longitudinal data with mild outliers. It effectively detects outliers and identifies progression patterns in different groups over time, making it valuable for various applications in longitudinal data analysis.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139031070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A machine learning oracle for parameter estimation 用于参数估计的机器学习算法
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-12-09 DOI: 10.1002/sam.11651
Lucas Koepke, Mary Gregg, Michael Frey
{"title":"A machine learning oracle for parameter estimation","authors":"Lucas Koepke, Mary Gregg, Michael Frey","doi":"10.1002/sam.11651","DOIUrl":"https://doi.org/10.1002/sam.11651","url":null,"abstract":"Competing procedures, involving data smoothing, weighting, imputation, outlier removal, etc., may be available to prepare data for parametric model estimation. Often, however, little is known about the best choice of preparatory procedure for the planned estimation and the observed data. A machine learning-based decision rule, an “oracle,” can be constructed in such cases to decide the best procedure from a set <math altimg=\"urn:x-wiley:19321864:media:sam11651:sam11651-math-0001\" display=\"inline\" location=\"graphic/sam11651-math-0001.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<mi mathvariant=\"script\">C</mi>\u0000</mrow>\u0000$$ mathcal{C} $$</annotation>\u0000</semantics></math> of available preparatory procedures. The oracle learns the decision regions associated with <math altimg=\"urn:x-wiley:19321864:media:sam11651:sam11651-math-0002\" display=\"inline\" location=\"graphic/sam11651-math-0002.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<mi mathvariant=\"script\">C</mi>\u0000</mrow>\u0000$$ mathcal{C} $$</annotation>\u0000</semantics></math> based on training data synthesized solely from the given data using model parameters with high posterior probability. An estimator in combination with an oracle to guide data preparation is called an oracle estimator. Oracle estimator performance is studied in two estimation problems: slope estimation in simple linear regression (SLR) and changepoint estimation in continuous two-linear-segments regression (CTLSR). In both examples, the regression response is given to be increasing, and the oracle must decide whether to isotonically smooth the response data preparatory to fitting the regression model. A measure of performance called headroom is proposed to assess the oracle's potential for reducing estimation error. Experiments with SLR and CTLSR find for important ranges of problem configurations that the headroom is high, the oracle's empirical performance is near the headroom, and the oracle estimator offers clear benefit.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138561157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The generalized hyperbolic family and automatic model selection through the multiple-choice LASSO 广义双曲线族和通过多选 LASSO 自动选择模型
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-12-08 DOI: 10.1002/sam.11652
Luca Bagnato, Alessio Farcomeni, Antonio Punzo
{"title":"The generalized hyperbolic family and automatic model selection through the multiple-choice LASSO","authors":"Luca Bagnato, Alessio Farcomeni, Antonio Punzo","doi":"10.1002/sam.11652","DOIUrl":"https://doi.org/10.1002/sam.11652","url":null,"abstract":"We revisit the generalized hyperbolic (GH) distribution and its nested models. These include widely used parametric choices like the multivariate normal, skew-<math altimg=\"urn:x-wiley:19321864:media:sam11652:sam11652-math-0001\" display=\"inline\" location=\"graphic/sam11652-math-0001.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<mi>t</mi>\u0000</mrow>\u0000$$ t $$</annotation>\u0000</semantics></math>, Laplace, and several others. We also introduce the multiple-choice LASSO, a novel penalized method for choosing among alternative constraints on the same parameter. A hierarchical multiple-choice Least Absolute Shrinkage and Selection Operator (LASSO) penalized likelihood is optimized to perform simultaneous model selection and inference within the GH family. We illustrate our approach through a simulation study and a real data example. The methodology proposed in this paper has been implemented in R functions which are available as supplementary material.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138555623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling subpopulations for hierarchically structured data 为分层结构数据建模子种群
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-11-22 DOI: 10.1002/sam.11650
Andrew Simpson, Semhar Michael, Dylan Borchert, Christopher Saunders, Larry Tang
{"title":"Modeling subpopulations for hierarchically structured data","authors":"Andrew Simpson, Semhar Michael, Dylan Borchert, Christopher Saunders, Larry Tang","doi":"10.1002/sam.11650","DOIUrl":"https://doi.org/10.1002/sam.11650","url":null,"abstract":"The field of forensic statistics offers a unique hierarchical data structure in which a population is composed of several subpopulations of sources and a sample is collected from each source. This subpopulation structure creates an additional layer of complexity. Hence, the data has a hierarchical structure in addition to the existence of underlying subpopulations. Finite mixtures are known for modeling heterogeneity; however, previous parameter estimation procedures assume that the data is generated through a simple random sampling process. We propose using a semi-supervised mixture modeling approach to model the subpopulation structure which leverages the fact that we know the collection of samples came from the same source, yet an unknown subpopulation. A simulation study and a real data analysis based on famous glass datasets and a keystroke dynamic typing data set show that the proposed approach performs better than other approaches that have been used previously in practice.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatially-correlated time series clustering using location-dependent Dirichlet process mixture model 基于位置相关Dirichlet过程混合模型的空间相关时间序列聚类
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-11-22 DOI: 10.1002/sam.11649
Junsub Jung, Sungil Kim, Heeyoung Kim
{"title":"Spatially-correlated time series clustering using location-dependent Dirichlet process mixture model","authors":"Junsub Jung, Sungil Kim, Heeyoung Kim","doi":"10.1002/sam.11649","DOIUrl":"https://doi.org/10.1002/sam.11649","url":null,"abstract":"The Dirichlet process mixture (DPM) model has been widely used as a Bayesian nonparametric model for clustering. However, the exchangeability assumption of the Dirichlet process is not valid for clustering spatially correlated time series as these data are indexed spatially and temporally. While analyzing spatially correlated time series, correlations between observations at proximal times and locations must be appropriately considered. In this study, we propose a location-dependent DPM model by extending the traditional DPM model for clustering spatially correlated time series. We model the temporal pattern as an infinite mixture of Gaussian processes while considering spatial dependency using a location-dependent Dirichlet process prior over mixture components. This encourages the assignment of observations from proximal locations to the same cluster. By contrast, because mixture atoms for modeling temporal patterns are shared across space, observations with similar temporal patterns can be still grouped together even if they are located far apart. The proposed model also allows the number of clusters to be automatically determined in the clustering procedure. We validate the proposed model using simulated examples. Moreover, in a real case study, we cluster adjacent roads based on their traffic speed patterns that have changed as a result of a traffic accident occurred in Seoul, South Korea.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Input-response space-filling designs incorporating response uncertainty 包含响应不确定性的输入-响应空间填充设计
IF 1.3 4区 数学
Statistical Analysis and Data Mining Pub Date : 2023-11-20 DOI: 10.1002/sam.11648
Xiankui Yang, Lu Lu, Christine M. Anderson-Cook
{"title":"Input-response space-filling designs incorporating response uncertainty","authors":"Xiankui Yang, Lu Lu, Christine M. Anderson-Cook","doi":"10.1002/sam.11648","DOIUrl":"https://doi.org/10.1002/sam.11648","url":null,"abstract":"Traditionally space-filling designs have focused on the characteristics of the design in the input space ensuring uniform spread throughout the region. Input-response space-filling designs considered scenarios when having good spread throughout the range or region of the responses is also of interest. This paper acknowledges that there is typically uncertainty associated with the values of the response(s) and hence proposes a method, Input-Response Space-Filling Designs with Uncertainty (IRSFwU), to incorporate this into the design construction. The Pareto front of designs offers alternatives that balance input and response space filling, while prioritizing input combinations with lower associated response uncertainty. These lower uncertainty choices improve the chances of observing the desired response values. We describe the new approach with an uncertainty-adjusted distance to measure the response space filling, the Pareto aggregate point exchange algorithm to populate the set of promising designs, and illustrate the method with three examples of different input and response relationships and dimensions.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信