Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献_第4页

Randomized algorithms for tensor response regression 张量响应回归的随机算法

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-11-21 DOI: 10.1002/sam.11603

Zhe Cheng, Xiangjian Xu, Zihao Song, Weihua Zhao

引用次数: 0

Local support vector machine based dimension reduction 基于局部支持向量机的降维

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-10-17 DOI: 10.1002/sam.11600

Linxi Li, Qin Wang, Chenlu Ke

引用次数: 1

Frequentist model averaging for zero‐inflated Poisson regression models 零膨胀泊松回归模型的频率模型平均

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-10-05 DOI: 10.1002/sam.11598

Jianhong Zhou, Alan T. K. Wan, Dalei Yu

引用次数: 0

Feature screening of ultrahigh dimensional longitudinal data based on the C‐statistic 基于C统计量的超高维纵向数据特征筛选

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-09-26 DOI: 10.1002/sam.11597

Peng Lai, Qing Di, Zhezi Shen, Yanqiu Zhou

引用次数: 0

Nonparametric clustering of RNA‐sequencing data RNA测序数据的非参数聚类

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-09-23 DOI: 10.1002/sam.11638

Gabriel L. Lozano, Nadia M. Atallah, M. Levine

{"title":"Nonparametric clustering of RNA‐sequencing data","authors":"Gabriel L. Lozano, Nadia M. Atallah, M. Levine","doi":"10.1002/sam.11638","DOIUrl":"https://doi.org/10.1002/sam.11638","url":null,"abstract":"Identification of clusters of co-expressed genes in transcriptomic data is a difficult task. Most algorithms used for this purpose can be classified into two broad categories: distance-based or model-based approaches. Distance-based approaches typically utilize a distance function between pairs of data objects and group similar objects together into clusters. Model-based approaches are based on using the mixture-modeling framework. Compared to distance-based approaches, model-based approaches offer better interpretability because each cluster can be explicitly characterized in terms of the proposed model. However, these models present a particular difficulty in identifying a correct multivariate distribution that a mixture can be based upon. In this manuscript, we review some of the approaches used to select a distribution for the needed mixture model first. Then, we propose avoiding this problem altogether by using a nonparametric MSL (Maximum Smoothed Likelihood) algorithm. This algorithm was proposed earlier in statistical literature but has not been, to the best of our knowledge, applied to transcriptomics data. The salient feature of this approach is that it avoids explicit specification of distributions of individual biological samples altogether, thus making the task of a practitioner easier. When used on a real dataset, the algorithm produces a large number of biologically meaningful clusters and compares favorably to the two other mixture-based algorithms commonly used for RNA-seq data clustering. Our code is publicly available in Github at https://github.com/Matematikoi/non_parametric_clustering.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123072474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning and neural network based model predictions of soybean export shares from US Gulf to China 基于机器学习和神经网络的模型预测美国海湾地区对中国大豆出口份额

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-09-07 DOI: 10.1002/sam.11595

Shantanu Awasthi, I. Sengupta, W. Wilson, Prithviraj Lakkakula

引用次数: 2

An auxiliary Part‐of‐Speech tagger for blog and microblog cyber‐slang 一个辅助词性标注器，用于博客和微博网络俚语

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-09-06 DOI: 10.1002/sam.11596

Silvia Golia, Paola Zola

{"title":"An auxiliary Part‐of‐Speech tagger for blog and microblog cyber‐slang","authors":"Silvia Golia, Paola Zola","doi":"10.1002/sam.11596","DOIUrl":"https://doi.org/10.1002/sam.11596","url":null,"abstract":"The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state‐of‐the‐art Part‐of‐Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens' information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114603767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Out‐of‐bag stability estimation for k‐means clustering k均值聚类的袋外稳定性估计

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-08-03 DOI: 10.1002/sam.11593

Tianmou Liu, Han Yu, R. Blair

{"title":"Out‐of‐bag stability estimation for k‐means clustering","authors":"Tianmou Liu, Han Yu, R. Blair","doi":"10.1002/sam.11593","DOIUrl":"https://doi.org/10.1002/sam.11593","url":null,"abstract":"Clustering data is a challenging problem in unsupervised learning where there is no gold standard. Results depend on several factors, such as the selection of a clustering method, measures of dissimilarity, parameters, and the determination of the number of reliable groupings. Stability has become a valuable surrogate to performance and robustness that can provide insight to an investigator on the quality of a clustering, and guidance on subsequent cluster prioritization. This work develops a framework for stability measurements that is based on resampling and OB estimation. Bootstrapping methods for cluster stability can be prone to overfitting in a setting that is analogous to poor delineation of test and training sets in supervised learning. Stability that relies on OB items from a resampling overcomes these issues and does not depend on a reference clustering for comparisons. Furthermore, OB stability can provide estimates at the level of the item, cluster, and as an overall summary, which has good interpretive value. This framework is extended to develop stability estimates for determining the number of clusters (model selection) through contrasts between stability estimates on clustered data, and stability estimates of clustered reference data with no signal. These contrasts form stability profiles that can be used to identify the largest differences in stability and do not require a direct threshold on stability values, which tend to be data specific. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128573120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed dimension reduction with nearly oracle rate 以接近oracle的速度进行分布式降维

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-08-03 DOI: 10.1002/sam.11592

Zhengtian Zhu, Liping Zhu

引用次数: 1

A novel Bayesian method for variable selection and estimation in binary quantile regression 二元分位数回归中变量选择与估计的贝叶斯新方法

Statistical Analysis and Data Mining: The ASA Data Science Journal Pub Date : 2022-07-23 DOI: 10.1002/sam.11591

Mai Dao, Min Wang, Souparno Ghosh

引用次数: 0