{"title":"Cluster analysis: A modern statistical review","authors":"Adam Jaeger, David Banks","doi":"10.1002/wics.1597","DOIUrl":"https://doi.org/10.1002/wics.1597","url":null,"abstract":"Cluster analysis is a big, sprawling field. This review paper cannot hope to fully survey the territory. Instead, it focuses on hierarchical agglomerative clustering, k‐means clustering, mixture models, and then several related topics of which any cluster analysis practitioner should be aware. Even then, this review cannot do justice to the chosen topics. There is a lot of literature, and often it is somewhat ad hoc. That is generally the nature of cluster analysis—each application requires a bespoke analysis. Nonetheless, clustering has proven itself to be incredibly useful as an exploratory data analysis tool in biology, advertising, recommender systems, and genomics.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48225906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Chambers, E. Fabrizi, M. Ranalli, N. Salvati, Suojin Wang
{"title":"Robust regression using probabilistically linked data","authors":"R. Chambers, E. Fabrizi, M. Ranalli, N. Salvati, Suojin Wang","doi":"10.1002/wics.1596","DOIUrl":"https://doi.org/10.1002/wics.1596","url":null,"abstract":"There is growing interest in a data integration approach to survey sampling, particularly where population registers are linked for sampling and subsequent analysis. The reason for doing this is simple: it is only by linking the same individuals in the different sources that it becomes possible to create a data set suitable for analysis. But data linkage is not error free. Many linkages are nondeterministic, based on how likely a linking decision corresponds to a correct match, that is, it brings together the same individual in all sources. High quality linking will ensure that the probability of this happening is high. Analysis of the linked data should take account of this additional source of error when this is not the case. This is especially true for secondary analysis carried out without access to the linking information, that is, the often confidential data that agencies use in their record matching. We describe an inferential framework that allows for linkage errors when sampling from linked registers. After first reviewing current research activity in this area, we focus on secondary analysis and linear regression modeling, including the important special case of estimation of subpopulation and small area means. In doing so we consider both robustness and efficiency of the resulting linked data inferences.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46408778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Llorente, Luca Martino, E. Curbelo, J. Lopez-Santiago, D. Delgado
{"title":"On the safe use of prior densities for Bayesian model selection","authors":"F. Llorente, Luca Martino, E. Curbelo, J. Lopez-Santiago, D. Delgado","doi":"10.1002/wics.1595","DOIUrl":"https://doi.org/10.1002/wics.1595","url":null,"abstract":"The application of Bayesian inference for the purpose of model selection is very popular nowadays. In this framework, models are compared through their marginal likelihoods, or their quotients, called Bayes factors. However, marginal likelihoods depend on the prior choice. For model selection, even diffuse priors can be actually very informative, unlike for the parameter estimation problem. Furthermore, when the prior is improper, the marginal likelihood of the corresponding model is undetermined. In this work, we discuss the issue of prior sensitivity of the marginal likelihood and its role in model selection. We also comment on the use of uninformative priors, which are very common choices in practice. Several practical suggestions are discussed and many possible solutions, proposed in the literature, to design objective priors for model selection are described. Some of them also allow the use of improper priors. The connection between the marginal likelihood approach and the well‐known information criteria is also presented. We describe the main issues and possible solutions by illustrative numerical examples, providing also some related code. One of them involving a real‐world application on exoplanet detection.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44402673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dionne Swift, Kellen Cresswell, Robert Johnson, Spiro C. Stilianoudakis, Xingtao Wei
{"title":"A review of normalization and differential abundance methods for microbiome counts data","authors":"Dionne Swift, Kellen Cresswell, Robert Johnson, Spiro C. Stilianoudakis, Xingtao Wei","doi":"10.1002/wics.1586","DOIUrl":"https://doi.org/10.1002/wics.1586","url":null,"abstract":"The recent development of cost‐effective high‐throughput DNA sequencing technologies has tremendously increased microbiome research. However, it has been well documented that the observed microbiome data suffers from compositionality, sparsity, and high variability. All of which pose serious challenges when analyzing microbiome data. Over the last decade, there has been considerable amount of interest into statistical and computational methods to tackle these challenges. The choice of inference aids in the selection of the appropriate statistical methods since only a few methods allow inferences for absolute abundance while most methods allow inferences for relative abundances. An overview of recent methods for differential abundance analysis and normalization of microbiome data is presented, focusing on methods that are accessible but have not been widely covered in previous literature. In detailed descriptions of each method, we discuss assumptions and if and how these methods address the challenges of microbiome data. These methods are compared based on accuracy metrics in real and simulated settings. The goal is to provide a comprehensive but non‐exhaustive set of potential and easily‐accessible tools for differential abundance and normalization of microbiome data.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45693764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Projection‐based techniques for high‐dimensional optimal transport problems","authors":"Jingyi Zhang, Ping Ma, Wenxuan Zhong, Cheng Meng","doi":"10.1002/wics.1587","DOIUrl":"https://doi.org/10.1002/wics.1587","url":null,"abstract":"Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost. Such a minimum transport cost, with a certain power transform, is called the Wasserstein distance. Recently, OT methods have drawn great attention in statistics, machine learning, and computer science, especially in deep generative neural networks. Despite its broad applications, the estimation of high‐dimensional Wasserstein distances is a well‐known challenging problem owing to the curse‐of‐dimensionality. There are some cutting‐edge projection‐based techniques that tackle high‐dimensional OT problems. Three major approaches of such techniques are introduced, respectively, the slicing approach, the iterative projection approach, and the projection robust OT approach. Open challenges are discussed at the end of the review.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48984047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyu Zhang, Zhenwei Zhou, Hanfei Xu, Ching-Ti Liu
{"title":"Integrative clustering methods for multi-omics data.","authors":"Xiaoyu Zhang, Zhenwei Zhou, Hanfei Xu, Ching-Ti Liu","doi":"10.1002/wics.1553","DOIUrl":"https://doi.org/10.1002/wics.1553","url":null,"abstract":"<p><p>Integrative analysis of multi-omics data has drawn much attention from the scientific community due to the technological advancements which have generated various omics data. Leveraging these multi-omics data potentially provides a more comprehensive view of the disease mechanism or biological processes. Integrative multi-omics clustering is an unsupervised integrative method specifically used to find coherent groups of samples or features by utilizing information across multi-omics data. It aims to better stratify diseases and to suggest biological mechanisms and potential targeted therapies for the diseases. However, applying integrative multi-omics clustering is both statistically and computationally challenging due to various reasons such as high dimensionality and heterogeneity. In this review, we summarized integrative multi-omics clustering methods into three general categories: <i>concatenated clustering</i>, <i>clustering of clusters</i>, and <i>interactive clustering</i> based on when and how the multi-omics data are processed for clustering. We further classified the methods into different approaches under each category based on the main statistical strategy used during clustering. In addition, we have provided recommended practices tailored to four real-life scenarios to help researchers to strategize their selection in integrative multi-omics clustering methods for their future studies.</p>","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":"14 3","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1553","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9379724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Craigmile, Radu Herbei, Geoffrey Liu, Grant Schneider
{"title":"Statistical inference for stochastic differential equations","authors":"P. Craigmile, Radu Herbei, Geoffrey Liu, Grant Schneider","doi":"10.1002/wics.1585","DOIUrl":"https://doi.org/10.1002/wics.1585","url":null,"abstract":"Many scientific fields have experienced growth in the use of stochastic differential equations (SDEs), also known as diffusion processes, to model scientific phenomena over time. SDEs can simultaneously capture the known deterministic dynamics of underlying variables of interest (e.g., ocean flow, chemical and physical characteristics of a body of water, presence, absence, and spread of a disease), while enabling a modeler to capture the unknown random dynamics in a stochastic setting. We focus on reviewing a wide range of statistical inference methods for likelihood‐based frequentist and Bayesian parametric inference based on discretely‐sampled diffusions. Exact parametric inference is not usually possible because the transition density is not available in closed form. Thus, we review the literature on approximate numerical methods (e.g., Euler, Milstein, local linearization, and Aït‐Sahalia) and simulation‐based approaches (e.g., data augmentation and exact sampling) that are used to carry out parametric statistical inference on SDE processes. We close with a brief discussion of other methods of inference for SDEs and more complex SDE processes such as spatio‐temporal SDEs.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47620930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Function minimization and nonlinear least squares in R","authors":"J. Nash","doi":"10.1002/wics.1580","DOIUrl":"https://doi.org/10.1002/wics.1580","url":null,"abstract":"This review will look at function minimization and nonlinear least squares, possibly bounds constrained, using R. These tools derive from the more general context of numerical optimization and mathematical programming. How R developers have tried to make the application of such tools easier for users not familiar with optimization is highlighted. Some limitations of methods and their implementations are mentioned to provide perspective.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45021708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Echelon analysis and its software for spatial lattice data","authors":"K. Kurihara, Fumio Ishioka","doi":"10.1002/wics.1579","DOIUrl":"https://doi.org/10.1002/wics.1579","url":null,"abstract":"In this study, we explore the use of echelon analysis and its software named EcheScan for spatial lattice data. EcheScan is developed as a web application via an internet browser in R language and Shiny server for echelon analysis. The technique of echelon is proposed to analyze the topological structure for spatial lattice data. The echelon tree provides a dendrogram representation. Regional features, such as hierarchical spatial data structure and hotspots clusters, are shown in an echelon dendrogram. In addition, we introduce the conception of echelon with the values and neighbors for lattice data. We also explain the use of EcheScan for one‐ and two‐dimensional regular lattice data. Furthermore, coronavirus disease 2019 death data corresponding to 50 US states are illustrated using EcheScan as an example of geospatial lattice data.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2022-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48195347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}