{"title":"Split-Apply-Combine with Dynamic Grouping","authors":"Mark P. J. van der Loo","doi":"arxiv-2406.09887","DOIUrl":"https://doi.org/arxiv-2406.09887","url":null,"abstract":"Partitioning a data set by one or more of its attributes and computing an\u0000aggregate for each part is one of the most common operations in data analyses.\u0000There are use cases where the partitioning is determined dynamically by\u0000collapsing smaller subsets into larger ones, to ensure sufficient support for\u0000the computed aggregate. These use cases are not supported by software\u0000implementing split-apply-combine types of operations. This paper presents the\u0000texttt{R} package texttt{accumulate} that offers convenient interfaces for\u0000defining grouped aggregation where the grouping itself is dynamically\u0000determined, based on user-defined conditions on subsets, and a user-defined\u0000subset collapsing scheme. The formal underlying algorithm is described and\u0000analyzed as well.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning High-dimensional Latent Variable Models via Doubly Stochastic Optimisation by Unadjusted Langevin","authors":"Motonori Oka, Yunxiao Chen, Irini Mounstaki","doi":"arxiv-2406.09311","DOIUrl":"https://doi.org/arxiv-2406.09311","url":null,"abstract":"Latent variable models are widely used in social and behavioural sciences,\u0000such as education, psychology, and political science. In recent years,\u0000high-dimensional latent variable models have become increasingly common for\u0000analysing large and complex data. Estimating high-dimensional latent variable\u0000models using marginal maximum likelihood is computationally demanding due to\u0000the complexity of integrals involved. To address this challenge, stochastic\u0000optimisation, which combines stochastic approximation and sampling techniques,\u0000has been shown to be effective. This method iterates between two steps -- (1)\u0000sampling the latent variables from their posterior distribution based on the\u0000current parameter estimate, and (2) updating the fixed parameters using an\u0000approximate stochastic gradient constructed from the latent variable samples.\u0000In this paper, we propose a computationally more efficient stochastic\u0000optimisation algorithm. This improvement is achieved through the use of a\u0000minibatch of observations when sampling latent variables and constructing\u0000stochastic gradients, and an unadjusted Langevin sampler that utilises the\u0000gradient of the negative complete-data log-likelihood to sample latent\u0000variables. Theoretical results are established for the proposed algorithm,\u0000showing that the iterative parameter update converges to the marginal maximum\u0000likelihood estimate as the number of iterations goes to infinity. Furthermore,\u0000the proposed algorithm is shown to scale well to high-dimensional settings\u0000through simulation studies and a personality test application with 30,000\u0000respondents, 300 items, and 30 latent dimensions.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast solution to the fair ranking problem using the Sinkhorn algorithm","authors":"Yuki Uehara, Shunnosuke Ikeda, Naoki Nishimura, Koya Ohashi, Yilin Li, Jie Yang, Deddy Jobson, Xingxia Zha, Takeshi Matsumoto, Noriyoshi Sukegawa, Yuichi Takano","doi":"arxiv-2406.10262","DOIUrl":"https://doi.org/arxiv-2406.10262","url":null,"abstract":"In two-sided marketplaces such as online flea markets, recommender systems\u0000for providing consumers with personalized item rankings play a key role in\u0000promoting transactions between providers and consumers. Meanwhile, two-sided\u0000marketplaces face the problem of balancing consumer satisfaction and fairness\u0000among items to stimulate activity of item providers. Saito and Joachims (2022)\u0000devised an impact-based fair ranking method for maximizing the Nash social\u0000welfare based on fair division; however, this method, which requires solving a\u0000large-scale constrained nonlinear optimization problem, is very difficult to\u0000apply to practical-scale recommender systems. We thus propose a fast solution\u0000to the impact-based fair ranking problem. We first transform the fair ranking\u0000problem into an unconstrained optimization problem and then design a gradient\u0000ascent method that repeatedly executes the Sinkhorn algorithm. Experimental\u0000results demonstrate that our algorithm provides fair rankings of high quality\u0000and is about 1000 times faster than application of commercial optimization\u0000software.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics","authors":"Elias Chaibub Neto","doi":"arxiv-2406.06488","DOIUrl":"https://doi.org/arxiv-2406.06488","url":null,"abstract":"Non-parametric two-sample tests based on energy distance or maximum mean\u0000discrepancy are widely used statistical tests for comparing multivariate data\u0000from two populations. While these tests enjoy desirable statistical properties,\u0000their test statistics can be expensive to compute as they require the\u0000computation of 3 distinct Euclidean distance (or kernel) matrices between\u0000samples, where the time complexity of each of these computations (namely,\u0000$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically\u0000with the number of samples ($n_x$, $n_y$) and linearly with the number of\u0000variables ($p$). Since the standard permutation test requires repeated\u0000re-computations of these expensive statistics it's application to large\u0000datasets can become unfeasible. While several statistical approaches have been\u0000proposed to mitigate this issue, they all sacrifice desirable statistical\u0000properties to decrease the computational cost (e.g., trade computation speed by\u0000a decrease in statistical power). A better computational strategy is to first\u0000pre-compute the Euclidean distance (kernel) matrix of the concatenated data,\u0000and then permute indexes and retrieve the corresponding elements to compute the\u0000re-sampled statistics. While this strategy can reduce the computation cost\u0000relative to the standard permutation test, it relies on the computation of a\u0000larger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.\u0000In this paper, we present a novel computationally efficient permutation\u0000algorithm which only requires the pre-computation of the 3 smaller matrices and\u0000achieves large computational speedups without sacrificing finite-sample\u0000validity or statistical power. We illustrate its computational gains in a\u0000series of experiments and compare its statistical power to the current\u0000state-of-the-art approach for balancing computational cost and statistical\u0000performance.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking","authors":"Thomas Le Menestrel, Manuel Rivas","doi":"arxiv-2406.05738","DOIUrl":"https://doi.org/arxiv-2406.05738","url":null,"abstract":"Docking is a crucial component in drug discovery aimed at predicting the\u0000binding conformation and affinity between small molecules and target proteins.\u0000ML-based docking has recently emerged as a prominent approach, outpacing\u0000traditional methods like DOCK and AutoDock Vina in handling the growing scale\u0000and complexity of molecular libraries. However, the availability of\u0000comprehensive and user-friendly datasets for training and benchmarking ML-based\u0000docking algorithms remains limited. We introduce Smiles2Dock, an open\u0000large-scale multi-task dataset for molecular docking. We created a framework\u0000combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\u0000database against 15 AlphaFold proteins, giving us more than 25 million\u0000protein-ligand binding scores. The dataset leverages a wide range of\u0000high-accuracy AlphaFold protein models, encompasses a diverse set of\u0000biologically relevant compounds and enables researchers to benchmark all major\u0000approaches for ML-based docking such as Graph, Transformer and CNN-based\u0000methods. We also introduce a novel Transformer-based architecture for docking\u0000scores prediction and set it as an initial benchmark for our dataset. Our\u0000dataset and code are publicly available to support the development of novel\u0000ML-based methods for molecular docking to advance scientific research in this\u0000field.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic full waveform inversion with deep generative prior for uncertainty quantification","authors":"Yuke Xie, Hervé Chauris, Nicolas Desassis","doi":"arxiv-2406.04859","DOIUrl":"https://doi.org/arxiv-2406.04859","url":null,"abstract":"To obtain high-resolution images of subsurface structures from seismic data,\u0000seismic imaging techniques such as Full Waveform Inversion (FWI) serve as\u0000crucial tools. However, FWI involves solving a nonlinear and often non-unique\u0000inverse problem, presenting challenges such as local minima trapping and\u0000inadequate handling of inherent uncertainties. In addressing these challenges,\u0000we propose leveraging deep generative models as the prior distribution of\u0000geophysical parameters for stochastic Bayesian inversion. This approach\u0000integrates the adjoint state gradient for efficient back-propagation from the\u0000numerical solution of partial differential equations. Additionally, we\u0000introduce explicit and implicit variational Bayesian inference methods. The\u0000explicit method computes variational distribution density using a normalizing\u0000flow-based neural network, enabling computation of the Bayesian posterior of\u0000parameters. Conversely, the implicit method employs an inference network\u0000attached to a pretrained generative model to estimate density, incorporating an\u0000entropy estimator. Furthermore, we also experimented with the Stein Variational\u0000Gradient Descent (SVGD) method as another variational inference technique,\u0000using particles. We compare these variational Bayesian inference methods with\u0000conventional Markov chain Monte Carlo (McMC) sampling. Each method is able to\u0000quantify uncertainties and to generate seismic data-conditioned realizations of\u0000subsurface geophysical parameters. This framework provides insights into\u0000subsurface structures while accounting for inherent uncertainties.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multiscale Perspective on Maximum Marginal Likelihood Estimation","authors":"O. Deniz Akyildiz, Iain Souttar, Michela Ottobre","doi":"arxiv-2406.04187","DOIUrl":"https://doi.org/arxiv-2406.04187","url":null,"abstract":"In this paper, we provide a multiscale perspective on the problem of maximum\u0000marginal likelihood estimation. We consider and analyse a diffusion-based\u0000maximum marginal likelihood estimation scheme using ideas from multiscale\u0000dynamics. Our perspective is based on stochastic averaging; we make an explicit\u0000connection between ideas in applied probability and parameter inference in\u0000computational statistics. In particular, we consider a general class of coupled\u0000Langevin diffusions for joint inference of latent variables and parameters in\u0000statistical models, where the latent variables are sampled from a fast Langevin\u0000process (which acts as a sampler), and the parameters are updated using a slow\u0000Langevin process (which acts as an optimiser). We show that the resulting\u0000system of stochastic differential equations (SDEs) can be viewed as a two-time\u0000scale system. To demonstrate the utility of such a perspective, we show that\u0000the averaged parameter dynamics obtained in the limit of scale separation can\u0000be used to estimate the optimal parameter, within the strongly convex setting.\u0000We do this by using recent uniform-in-time non-asymptotic averaging bounds.\u0000Finally, we conclude by showing that the slow-fast algorithm we consider here,\u0000termed Slow-Fast Langevin Algorithm, performs on par with state-of-the-art\u0000methods on a variety of examples. We believe that the stochastic averaging\u0000approach we provide in this paper enables us to look at these algorithms from a\u0000fresh angle, as well as unlocking the path to develop and analyse new methods\u0000using well-established averaging principles.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanming Yang, Antonio Khalil Moretti, Sebastian Macaluso, Philippe Chlenski, Christian A. Naesseth, Itsik Pe'er
{"title":"Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics","authors":"Hanming Yang, Antonio Khalil Moretti, Sebastian Macaluso, Philippe Chlenski, Christian A. Naesseth, Itsik Pe'er","doi":"arxiv-2406.03242","DOIUrl":"https://doi.org/arxiv-2406.03242","url":null,"abstract":"Reconstructing jets, which provide vital insights into the properties and\u0000histories of subatomic particles produced in high-energy collisions, is a main\u0000problem in data analyses in collider physics. This intricate task deals with\u0000estimating the latent structure of a jet (binary tree) and involves parameters\u0000such as particle energy, momentum, and types. While Bayesian methods offer a\u0000natural approach for handling uncertainty and leveraging prior knowledge, they\u0000face significant challenges due to the super-exponential growth of potential\u0000jet topologies as the number of observed particles increases. To address this,\u0000we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet\u0000latent structures. As a second contribution, we leverage the resulting\u0000estimator to develop a variational inference algorithm for parameter learning.\u0000Building on this, we introduce a variational family using a pseudo-marginal\u0000framework for a fully Bayesian treatment of all variables, unifying the\u0000generative model with the inference process. We illustrate our method's\u0000effectiveness through experiments using data generated with a collider physics\u0000generative model, highlighting superior speed and accuracy across a range of\u0000tasks.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Variance-reduced sampling importance resampling","authors":"Yao Xiao, Kang Fu, Kun Li","doi":"arxiv-2406.01864","DOIUrl":"https://doi.org/arxiv-2406.01864","url":null,"abstract":"The sampling importance resampling method is widely utilized in various\u0000fields, such as numerical integration and statistical simulation. In this\u0000paper, two modified methods are presented by incorporating two variance\u0000reduction techniques commonly used in Monte Carlo simulation, namely antithetic\u0000sampling and Latin hypercube sampling, into the process of sampling importance\u0000resampling method respectively. Theoretical evidence is provided to demonstrate\u0000that the proposed methods significantly reduce estimation errors compared to\u0000the original approach. Furthermore, the effectiveness and advantages of the\u0000proposed methods are validated through both numerical studies and real data\u0000analysis.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141255858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mary Lai O. Salvana, Sameh Abdulah, Minwoo Kim, David Helmy, Ying Sun, Marc G. Genton
{"title":"MPCR: Multi- and Mixed-Precision Computations Package in R","authors":"Mary Lai O. Salvana, Sameh Abdulah, Minwoo Kim, David Helmy, Ying Sun, Marc G. Genton","doi":"arxiv-2406.02701","DOIUrl":"https://doi.org/arxiv-2406.02701","url":null,"abstract":"Computational statistics has traditionally utilized double-precision (64-bit)\u0000data structures and full-precision operations, resulting in\u0000higher-than-necessary accuracy for certain applications. Recently, there has\u0000been a growing interest in exploring low-precision options that could reduce\u0000computational complexity while still achieving the required level of accuracy.\u0000This trend has been amplified by new hardware such as NVIDIA's Tensor Cores in\u0000their V100, A100, and H100 GPUs, which are optimized for mixed-precision\u0000computations, Intel CPUs with Deep Learning (DL) boost, Google Tensor\u0000Processing Units (TPUs), Field Programmable Gate Arrays (FPGAs), ARM CPUs, and\u0000others. However, using lower precision may introduce numerical instabilities\u0000and accuracy issues. Nevertheless, some applications have shown robustness to\u0000low-precision computations, leading to new multi- and mixed-precision\u0000algorithms that balance accuracy and computational cost. To address this need,\u0000we introduce MPCR, a novel R package that supports three different precision\u0000types (16-, 32-, and 64-bit) and their combinations, along with its usage in\u0000commonly-used Frequentist/Bayesian statistical examples. The MPCR package is\u0000written in C++ and integrated into R through the pkg{Rcpp} package, enabling\u0000highly optimized operations in various precisions.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}