Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

筛选
英文 中文
Randomized algorithms for tensor response regression 张量响应回归的随机算法
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-11-21 DOI: 10.1002/sam.11603
Zhe Cheng, Xiangjian Xu, Zihao Song, Weihua Zhao
{"title":"Randomized algorithms for tensor response regression","authors":"Zhe Cheng, Xiangjian Xu, Zihao Song, Weihua Zhao","doi":"10.1002/sam.11603","DOIUrl":"https://doi.org/10.1002/sam.11603","url":null,"abstract":"In this paper, we consider the estimation algorithm of tensor response on vector covariate regression model. Based on projection theory of tensor and the idea of randomized algorithm for tensor decomposition, three new algorithms named SHOLRR, RHOLRR and RSHOLRR are proposed under the low‐rank Tucker decomposition and some theoretical analyses for two randomized algorithms are also provided. To explore the nonlinear relationship between tensor response and vector covariate, we develop the KRSHOLRR algorithm based on kernel trick and RSHOLRR algorithm. Our proposed algorithms can not only guarantee high estimation accuracy but also have the advantage of fast computing speed, especially for higher‐order tensor response. Through extensive synthesized data analyses and applications to two real datasets, we demonstrate the outperformance of our proposed algorithms over the stat‐of‐art.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125733942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local support vector machine based dimension reduction 基于局部支持向量机的降维
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-10-17 DOI: 10.1002/sam.11600
Linxi Li, Qin Wang, Chenlu Ke
{"title":"Local support vector machine based dimension reduction","authors":"Linxi Li, Qin Wang, Chenlu Ke","doi":"10.1002/sam.11600","DOIUrl":"https://doi.org/10.1002/sam.11600","url":null,"abstract":"Motivated by several recent work that adopt support vector machines into the sufficient dimension reduction research, we propose a local support vector machine based dimension reduction approach. The proposal deals with continuous and binary responses, linear and nonlinear dimension reduction in a unified framework. The localization can also help relax the stringent probabilistic assumptions required by the global methods. Numerical experiments and a real data application demonstrate the efficacy of the proposed approach.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128990855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Frequentist model averaging for zero‐inflated Poisson regression models 零膨胀泊松回归模型的频率模型平均
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-10-05 DOI: 10.1002/sam.11598
Jianhong Zhou, Alan T. K. Wan, Dalei Yu
{"title":"Frequentist model averaging for zero‐inflated Poisson regression models","authors":"Jianhong Zhou, Alan T. K. Wan, Dalei Yu","doi":"10.1002/sam.11598","DOIUrl":"https://doi.org/10.1002/sam.11598","url":null,"abstract":"This paper considers frequentist model averaging for estimating the unknown parameters of the zero‐inflated Poisson regression model. Our proposed weight choice procedure is based on the minimization of an unbiased estimator of a conditional quadratic loss function. We prove that the resulting model average estimator enjoys optimal asymptotic property and improves finite sample properties over the two commonly used information‐based model selection estimators and their model average estimators via simulation studies. The proposed method is illustrated by a real data example.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129204527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature screening of ultrahigh dimensional longitudinal data based on the C‐statistic 基于C统计量的超高维纵向数据特征筛选
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-09-26 DOI: 10.1002/sam.11597
Peng Lai, Qing Di, Zhezi Shen, Yanqiu Zhou
{"title":"Feature screening of ultrahigh dimensional longitudinal data based on the C‐statistic","authors":"Peng Lai, Qing Di, Zhezi Shen, Yanqiu Zhou","doi":"10.1002/sam.11597","DOIUrl":"https://doi.org/10.1002/sam.11597","url":null,"abstract":"This paper considers the feature screening method for the ultrahigh dimensional semiparametric linear models with longitudinal data. The C‐statistic which measures the rank concordance between predictors and outcomes is generalized to the longitudinal data. On the basis of C‐statistic and the score equation theory, we propose a feature screening method named LCSIS. Based on the smoothed technique and the score equations, the proposed estimating screening procedure is easy to compute and satisfies the feature screening consistency. Furthermore, Monte Carlo simulation studies and a real data application are conducted to examine the finite sample performance of the proposed procedure.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129688262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonparametric clustering of RNA‐sequencing data RNA测序数据的非参数聚类
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-09-23 DOI: 10.1002/sam.11638
Gabriel L. Lozano, Nadia M. Atallah, M. Levine
{"title":"Nonparametric clustering of RNA‐sequencing data","authors":"Gabriel L. Lozano, Nadia M. Atallah, M. Levine","doi":"10.1002/sam.11638","DOIUrl":"https://doi.org/10.1002/sam.11638","url":null,"abstract":"Identification of clusters of co-expressed genes in transcriptomic data is a difficult task. Most algorithms used for this purpose can be classified into two broad categories: distance-based or model-based approaches. Distance-based approaches typically utilize a distance function between pairs of data objects and group similar objects together into clusters. Model-based approaches are based on using the mixture-modeling framework. Compared to distance-based approaches, model-based approaches offer better interpretability because each cluster can be explicitly characterized in terms of the proposed model. However, these models present a particular difficulty in identifying a correct multivariate distribution that a mixture can be based upon. In this manuscript, we review some of the approaches used to select a distribution for the needed mixture model first. Then, we propose avoiding this problem altogether by using a nonparametric MSL (Maximum Smoothed Likelihood) algorithm. This algorithm was proposed earlier in statistical literature but has not been, to the best of our knowledge, applied to transcriptomics data. The salient feature of this approach is that it avoids explicit specification of distributions of individual biological samples altogether, thus making the task of a practitioner easier. When used on a real dataset, the algorithm produces a large number of biologically meaningful clusters and compares favorably to the two other mixture-based algorithms commonly used for RNA-seq data clustering. Our code is publicly available in Github at https://github.com/Matematikoi/non_parametric_clustering.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123072474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning and neural network based model predictions of soybean export shares from US Gulf to China 基于机器学习和神经网络的模型预测美国海湾地区对中国大豆出口份额
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-09-07 DOI: 10.1002/sam.11595
Shantanu Awasthi, I. Sengupta, W. Wilson, Prithviraj Lakkakula
{"title":"Machine learning and neural network based model predictions of soybean export shares from US Gulf to China","authors":"Shantanu Awasthi, I. Sengupta, W. Wilson, Prithviraj Lakkakula","doi":"10.1002/sam.11595","DOIUrl":"https://doi.org/10.1002/sam.11595","url":null,"abstract":"In this paper, we propose a general model for the soybean export market share dynamics and provide several theoretical analyses related to a special case of the general model. We implement machine and neural network algorithms to train, analyze, and predict US Gulf soybean market shares (target variable) to China using weekly time series data consisting of several features between January 6, 2012 and January 3, 2020. Overall, the results indicate that US Gulf soybean market shares to China are volatile and can be effectively explained (predicted) using a set of logical input variables. Some of the variables, including shipments due at US Gulf port in 10 days, cost of transporting soybean shipments via barge at Mid‐Mississippi, and soybean exports loaded at US Gulf port in the past 7 days, and binary variables have shown significant influence in predicting soybean market shares.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122208973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An auxiliary Part‐of‐Speech tagger for blog and microblog cyber‐slang 一个辅助词性标注器,用于博客和微博网络俚语
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-09-06 DOI: 10.1002/sam.11596
Silvia Golia, Paola Zola
{"title":"An auxiliary Part‐of‐Speech tagger for blog and microblog cyber‐slang","authors":"Silvia Golia, Paola Zola","doi":"10.1002/sam.11596","DOIUrl":"https://doi.org/10.1002/sam.11596","url":null,"abstract":"The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state‐of‐the‐art Part‐of‐Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens' information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114603767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Out‐of‐bag stability estimation for k‐means clustering k均值聚类的袋外稳定性估计
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-08-03 DOI: 10.1002/sam.11593
Tianmou Liu, Han Yu, R. Blair
{"title":"Out‐of‐bag stability estimation for k‐means clustering","authors":"Tianmou Liu, Han Yu, R. Blair","doi":"10.1002/sam.11593","DOIUrl":"https://doi.org/10.1002/sam.11593","url":null,"abstract":"Clustering data is a challenging problem in unsupervised learning where there is no gold standard. Results depend on several factors, such as the selection of a clustering method, measures of dissimilarity, parameters, and the determination of the number of reliable groupings. Stability has become a valuable surrogate to performance and robustness that can provide insight to an investigator on the quality of a clustering, and guidance on subsequent cluster prioritization. This work develops a framework for stability measurements that is based on resampling and OB estimation. Bootstrapping methods for cluster stability can be prone to overfitting in a setting that is analogous to poor delineation of test and training sets in supervised learning. Stability that relies on OB items from a resampling overcomes these issues and does not depend on a reference clustering for comparisons. Furthermore, OB stability can provide estimates at the level of the item, cluster, and as an overall summary, which has good interpretive value. This framework is extended to develop stability estimates for determining the number of clusters (model selection) through contrasts between stability estimates on clustered data, and stability estimates of clustered reference data with no signal. These contrasts form stability profiles that can be used to identify the largest differences in stability and do not require a direct threshold on stability values, which tend to be data specific. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128573120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed dimension reduction with nearly oracle rate 以接近oracle的速度进行分布式降维
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-08-03 DOI: 10.1002/sam.11592
Zhengtian Zhu, Liping Zhu
{"title":"Distributed dimension reduction with nearly oracle rate","authors":"Zhengtian Zhu, Liping Zhu","doi":"10.1002/sam.11592","DOIUrl":"https://doi.org/10.1002/sam.11592","url":null,"abstract":"We consider sufficient dimension reduction for heterogeneous massive data. We show that, even in the presence of heterogeneity and nonlinear dependence, the minimizers of convex loss functions of linear regression fall into the central subspace at the population level. We suggest a distributed algorithm to perform sufficient dimension reduction, where the convex loss functions are approximated with surrogate quadratic losses. This allows to perform dimension reduction in a unified least squares framework and facilitates to transmit the gradients in our distributed algorithm. The minimizers of these surrogate quadratic losses possess a nearly oracle rate after a finite number of iterations. We conduct simulations and an application to demonstrate the effectiveness of our proposed distributed algorithm for heterogeneous massive data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134186971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A novel Bayesian method for variable selection and estimation in binary quantile regression 二元分位数回归中变量选择与估计的贝叶斯新方法
Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-07-23 DOI: 10.1002/sam.11591
Mai Dao, Min Wang, Souparno Ghosh
{"title":"A novel Bayesian method for variable selection and estimation in binary quantile regression","authors":"Mai Dao, Min Wang, Souparno Ghosh","doi":"10.1002/sam.11591","DOIUrl":"https://doi.org/10.1002/sam.11591","url":null,"abstract":"In this paper, we develop a Bayesian hierarchical model and associated computation strategy for simultaneously conducting parameter estimation and variable selection in binary quantile regression. We specify customary asymmetric Laplace distribution on the error term and assign quantile‐dependent priors on the regression coefficients and a binary vector to identify the model configuration. Thanks to the normal‐exponential mixture representation of the asymmetric Laplace distribution, we proceed to develop a novel three‐stage computational scheme starting with an expectation–maximization algorithm and then the Gibbs sampler followed by an importance re‐weighting step to draw nearly independent Markov chain Monte Carlo samples from the full posterior distributions of the unknown parameters. Simulation studies are conducted to compare the performance of the proposed Bayesian method with that of several existing ones in the literature. Finally, two real‐data applications are provided for illustrative purposes.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116869385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信