{"title":"Stability estimation for unsupervised clustering: A review.","authors":"Tianmou Liu, Han Yu, Rachael Hageman Blair","doi":"10.1002/wics.1575","DOIUrl":"10.1002/wics.1575","url":null,"abstract":"<p><p>Cluster analysis remains one of the most challenging yet fundamental tasks in unsupervised learning. This is due in part to the fact that there are no labels or gold standards by which performance can be measured. Moreover, the wide range of clustering methods available is governed by different objective functions, different parameters, and dissimilarity measures. The purpose of clustering is versatile, often playing critical roles in the early stages of exploratory data analysis and as an endpoint for knowledge and discovery. Thus, understanding the quality of a clustering is of critical importance. The concept of <i>stability</i> has emerged as a strategy for assessing the performance and reproducibility of data clustering. The key idea is to produce perturbed data sets that are very close to the original, and cluster them. If the clustering is stable, then the clusters from the original data will be preserved in the perturbed data clustering. The nature of the perturbation, and the methods for quantifying similarity between clusterings, are nontrivial, and ultimately what distinguishes many of the stability estimation methods apart. In this review, we provide an overview of the very active research area of cluster stability estimation and discuss some of the open questions and challenges that remain in the field. This article is categorized under:Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification.</p>","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":"14 6","pages":"e1575"},"PeriodicalIF":4.4,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/0e/84/WICS-14-e1575.PMC9787023.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10512933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A survey of numerical algorithms that can solve the Lasso problems","authors":"Yujie Zhao, X. Huo","doi":"10.1002/wics.1602","DOIUrl":"https://doi.org/10.1002/wics.1602","url":null,"abstract":"In statistics, the least absolute shrinkage and selection operator (Lasso) is a regression method that performs both variable selection and regularization. There is a lot of literature available, discussing the statistical properties of the regression coefficients estimated by the Lasso method. However, there lacks a comprehensive review discussing the algorithms to solve the optimization problem in Lasso. In this review, we summarize five representative algorithms to optimize the objective function in Lasso, including iterative shrinkage threshold algorithm (ISTA), fast iterative shrinkage‐thresholding algorithms (FISTA), coordinate gradient descent algorithm (CGDA), smooth L1 algorithm (SLA), and path following algorithm (PFA). Additionally, we also compare their convergence rate, as well as their potential strengths and weakness.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41836132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data validation and statistical issues such as power and other considerations in genome‐wide association study (GWAS)","authors":"Makoto Tomita","doi":"10.1002/wics.1601","DOIUrl":"https://doi.org/10.1002/wics.1601","url":null,"abstract":"A series of steps in genomic data analysis will be presented. In data validation, starting with marker quality control, he mentioned structuring problems from ethnic populations, genome‐wide significant levels, Manhattan plots, and Haploview. Statistical issues such as power, sample size calculation, false discovery rate, and QQ plot of p‐values were also introduced.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46676632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On unbiasedness and biasedness of the Wilcoxon and some nonparametric tests","authors":"H. Murakami, Seong-Keon Lee","doi":"10.1002/wics.1600","DOIUrl":"https://doi.org/10.1002/wics.1600","url":null,"abstract":"In several fields of applications, the underlying theoretical distribution is unknown and cannot be assumed to have a specific parametric distribution such as a normal distribution. Nonparametric statistical methods are preferable in these cases. Nonparametric testing hypotheses have been one of the primarily used statistical procedures for nearly a century, and the power of the test is an important property in nonparametric testing procedures. This review discusses the unbiasedness of nonparametric tests. In nonparametric hypothesis, the best‐known Wilcoxon–Mann–Whitney (WMW) test has both robustness and power performance. Therefore, the WMW test is widely used to determine the location parameter. In this review, the unbiasedness and biasedness of the WMW test for the location parameter family of the distribution is mainly investigated. An overview of historical developments, detailed discussions, and works on the unbiasedness/biasedness of several nonparametric tests are presented with references to numerous studies. Finally, we conclude this review with a discussion on the unbiasedness/biasedness of nonparametric test procedures. This article is categorized under: Statistical and Graphical Methods of Data Analysis > Nonparametric Methods.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48942026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A review of recent advances in empirical likelihood","authors":"Pang-Chi Liu, Yichuan Zhao","doi":"10.1002/wics.1599","DOIUrl":"https://doi.org/10.1002/wics.1599","url":null,"abstract":"Empirical likelihood is widely used in many statistical problems. In this article, we provide a review of the empirical likelihood method, due to its significant development in recent years. Since the introduction of empirical likelihood, variants of empirical likelihood have been proposed, and the applications of empirical likelihood in high dimensions have also been studied. It is necessary to summarize the new development of empirical likelihood. In this article, we give a review of the Bayesian empirical likelihood, the bias‐corrected empirical likelihood, the jackknife empirical likelihood, the adjusted empirical likelihood, the extended empirical likelihood, the transformed empirical likelihood, the mean empirical likelihood, and the empirical likelihood with high dimensions. Finally, we have a brief survey of the computation and implementation for empirical likelihood methods.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45129169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequential Monte Carlo optimization and statistical inference","authors":"J. Duan, Shuping Li, Yaxian Xu","doi":"10.1002/wics.1598","DOIUrl":"https://doi.org/10.1002/wics.1598","url":null,"abstract":"Sequential Monte Carlo (SMC) is a powerful technique originally developed for particle filtering and Bayesian inference. As a generic optimizer for statistical and nonstatistical objectives, its role is far less known. Density‐tempered SMC is a highly efficient sampling technique ideally suited for challenging global optimization problems and is implementable with a somewhat arbitrary initialization sampler instead of relying on a prior distribution. SMC optimization is anchored at the fact that all optimization tasks (continuous, discontinuous, combinatorial, or noisy objective function) can be turned into sampling under a density or probability function short of a norming constant. The point with the highest functional value is the SMC estimate for the maximum. Through examples, we systematically present various density‐tempered SMC algorithms and their superior performance vs. other techniques like Markov Chain Monte Carlo. Data cloning and k‐fold duplication are two easily implementable accuracy accelerators, and their complementarity is discussed. The Extreme Value Theorem on the maximum order statistic can also help assess the quality of the SMC optimum. Our coverage includes the algorithmic essence of the density‐tempered SMC with various enhancements and solutions for (1) a bi‐modal nonstatistical function without and with constraints, (2) a multidimensional step function, (3) offline and online optimizations, (4) combinatorial variable selection, and (5) noninvertibility of the Hessian.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42620366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cluster analysis: A modern statistical review","authors":"Adam Jaeger, David Banks","doi":"10.1002/wics.1597","DOIUrl":"https://doi.org/10.1002/wics.1597","url":null,"abstract":"Cluster analysis is a big, sprawling field. This review paper cannot hope to fully survey the territory. Instead, it focuses on hierarchical agglomerative clustering, k‐means clustering, mixture models, and then several related topics of which any cluster analysis practitioner should be aware. Even then, this review cannot do justice to the chosen topics. There is a lot of literature, and often it is somewhat ad hoc. That is generally the nature of cluster analysis—each application requires a bespoke analysis. Nonetheless, clustering has proven itself to be incredibly useful as an exploratory data analysis tool in biology, advertising, recommender systems, and genomics.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48225906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Chambers, E. Fabrizi, M. Ranalli, N. Salvati, Suojin Wang
{"title":"Robust regression using probabilistically linked data","authors":"R. Chambers, E. Fabrizi, M. Ranalli, N. Salvati, Suojin Wang","doi":"10.1002/wics.1596","DOIUrl":"https://doi.org/10.1002/wics.1596","url":null,"abstract":"There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are nondeterministic, based on how likely a linking decision corresponds to a correct match, that is, it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, that is, the often confidential data that agencies use in their record matching. We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modeling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46408778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAREV: A review on statistical analytics of single-cell RNA sequencing data.","authors":"Dorothy Ellis, Dongyuan Wu, Susmita Datta","doi":"10.1002/wics.1558","DOIUrl":"10.1002/wics.1558","url":null,"abstract":"<p><p>Due to the development of next-generation RNA sequencing (NGS) technologies, there has been tremendous progress in research involving determining the role of genomics, transcriptomics and epigenomics in complex biological systems. However, scientists have realized that information obtained using earlier technology, frequently called 'bulk RNA-seq' data, provides information averaged across all the cells present in a tissue. Relatively newly developed single cell (scRNA-seq) technology allows us to provide transcriptomic information at a single-cell resolution. Nevertheless, these high-resolution data have their own complex natures and demand novel statistical data analysis methods to provide effective and highly accurate results on complex biological systems. In this review, we cover many such recently developed statistical methods for researchers wanting to pursue scRNA-seq statistical and computational research as well as scientific research about these existing methods and free software tools available for their generated data. This review is certainly not exhaustive due to page limitations. We have tried to cover the popular methods starting from quality control to the downstream analysis of finding differentially expressed genes and concluding with a brief description of network analysis.</p>","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":"14 4","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1558","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9729203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}