ZuhengDavid, Xu, Moksh Jain, Ali Denton, Shawn Whitfield, Aniket Didolkar, Berton Earnshaw, Jason Hartford
{"title":"Automated Discovery of Pairwise Interactions from Unstructured Data","authors":"ZuhengDavid, Xu, Moksh Jain, Ali Denton, Shawn Whitfield, Aniket Didolkar, Berton Earnshaw, Jason Hartford","doi":"arxiv-2409.07594","DOIUrl":"https://doi.org/arxiv-2409.07594","url":null,"abstract":"Pairwise interactions between perturbations to a system can provide evidence\u0000for the causal dependencies of the underlying underlying mechanisms of a\u0000system. When observations are low dimensional, hand crafted measurements,\u0000detecting interactions amounts to simple statistical tests, but it is not\u0000obvious how to detect interactions between perturbations affecting latent\u0000variables. We derive two interaction tests that are based on pairwise\u0000interventions, and show how these tests can be integrated into an active\u0000learning pipeline to efficiently discover pairwise interactions between\u0000perturbations. We illustrate the value of these tests in the context of\u0000biology, where pairwise perturbation experiments are frequently used to reveal\u0000interactions that are not observable from any single perturbation. Our tests\u0000can be run on unstructured data, such as the pixels in an image, which enables\u0000a more general notion of interaction than typical cell viability experiments,\u0000and can be run on cheaper experimental assays. We validate on several synthetic\u0000and real biological experiments that our tests are able to identify interacting\u0000pairs effectively. We evaluate our approach on a real biological experiment\u0000where we knocked out 50 pairs of genes and measured the effect with microscopy\u0000images. We show that we are able to recover significantly more known biological\u0000interactions than random search and standard active learning baselines.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convergence of continuous-time stochastic gradient descent with applications to linear deep neural networks","authors":"Gabor Lugosi, Eulalia Nualart","doi":"arxiv-2409.07401","DOIUrl":"https://doi.org/arxiv-2409.07401","url":null,"abstract":"We study a continuous-time approximation of the stochastic gradient descent\u0000process for minimizing the expected loss in learning problems. The main results\u0000establish general sufficient conditions for the convergence, extending the\u0000results of Chatterjee (2022) established for (nonstochastic) gradient descent.\u0000We show how the main result can be applied to the case of overparametrized\u0000linear neural network training.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhuohang Li, Andrew Lowy, Jing Liu, Toshiaki Koike-Akino, Bradley Malin, Kieran Parsons, Ye Wang
{"title":"Exploring User-level Gradient Inversion with a Diffusion Prior","authors":"Zhuohang Li, Andrew Lowy, Jing Liu, Toshiaki Koike-Akino, Bradley Malin, Kieran Parsons, Ye Wang","doi":"arxiv-2409.07291","DOIUrl":"https://doi.org/arxiv-2409.07291","url":null,"abstract":"We explore user-level gradient inversion as a new attack surface in\u0000distributed learning. We first investigate existing attacks on their ability to\u0000make inferences about private information beyond training data reconstruction.\u0000Motivated by the low reconstruction quality of existing methods, we propose a\u0000novel gradient inversion attack that applies a denoising diffusion model as a\u0000strong image prior in order to enhance recovery in the large batch setting.\u0000Unlike traditional attacks, which aim to reconstruct individual samples and\u0000suffer at large batch and image sizes, our approach instead aims to recover a\u0000representative image that captures the sensitive shared semantic information\u0000corresponding to the underlying user. Our experiments with face images\u0000demonstrate the ability of our methods to recover realistic facial images along\u0000with private user attributes.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tuning-Free Online Robust Principal Component Analysis through Implicit Regularization","authors":"Lakshmi Jayalal, Gokularam Muthukrishnan, Sheetal Kalyani","doi":"arxiv-2409.07275","DOIUrl":"https://doi.org/arxiv-2409.07275","url":null,"abstract":"The performance of the standard Online Robust Principal Component Analysis\u0000(OR-PCA) technique depends on the optimum tuning of the explicit regularizers\u0000and this tuning is dataset sensitive. We aim to remove the dependency on these\u0000tuning parameters by using implicit regularization. We propose to use the\u0000implicit regularization effect of various modified gradient descents to make\u0000OR-PCA tuning free. Our method incorporates three different versions of\u0000modified gradient descent that separately but naturally encourage sparsity and\u0000low-rank structures in the data. The proposed method performs comparable or\u0000better than the tuned OR-PCA for both simulated and real-world datasets.\u0000Tuning-free ORPCA makes it more scalable for large datasets since we do not\u0000require dataset-dependent parameter tuning.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"203 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
António Farinhas, Haau-Sing Li, André F. T. Martins
{"title":"Reranking Laws for Language Generation: A Communication-Theoretic Perspective","authors":"António Farinhas, Haau-Sing Li, André F. T. Martins","doi":"arxiv-2409.07131","DOIUrl":"https://doi.org/arxiv-2409.07131","url":null,"abstract":"To ensure large language models (LLMs) are used safely, one must reduce their\u0000propensity to hallucinate or to generate unacceptable answers. A simple and\u0000often used strategy is to first let the LLM generate multiple hypotheses and\u0000then employ a reranker to choose the best one. In this paper, we draw a\u0000parallel between this strategy and the use of redundancy to decrease the error\u0000rate in noisy communication channels. We conceptualize the generator as a\u0000sender transmitting multiple descriptions of a message through parallel noisy\u0000channels. The receiver decodes the message by ranking the (potentially\u0000corrupted) descriptions and selecting the one found to be most reliable. We\u0000provide conditions under which this protocol is asymptotically error-free\u0000(i.e., yields an acceptable answer almost surely) even in scenarios where the\u0000reranker is imperfect (governed by Mallows or Zipf-Mandelbrot models) and the\u0000channel distributions are statistically dependent. We use our framework to\u0000obtain reranking laws which we validate empirically on two real-world tasks\u0000using LLMs: text-to-code generation with DeepSeek-Coder 7B and machine\u0000translation of medical data with TowerInstruct 13B.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zehao Dou, Subhodh Kotekal, Zhehao Xu, Harrison H. Zhou
{"title":"From optimal score matching to optimal sampling","authors":"Zehao Dou, Subhodh Kotekal, Zhehao Xu, Harrison H. Zhou","doi":"arxiv-2409.07032","DOIUrl":"https://doi.org/arxiv-2409.07032","url":null,"abstract":"The recent, impressive advances in algorithmic generation of high-fidelity\u0000image, audio, and video are largely due to great successes in score-based\u0000diffusion models. A key implementing step is score matching, that is, the\u0000estimation of the score function of the forward diffusion process from training\u0000data. As shown in earlier literature, the total variation distance between the\u0000law of a sample generated from the trained diffusion model and the ground truth\u0000distribution can be controlled by the score matching risk. Despite the widespread use of score-based diffusion models, basic theoretical\u0000questions concerning exact optimal statistical rates for score estimation and\u0000its application to density estimation remain open. We establish the sharp\u0000minimax rate of score estimation for smooth, compactly supported densities.\u0000Formally, given (n) i.i.d. samples from an unknown (alpha)-H\"{o}lder\u0000density (f) supported on ([-1, 1]), we prove the minimax rate of estimating\u0000the score function of the diffused distribution (f * mathcal{N}(0, t)) with\u0000respect to the score matching loss is (frac{1}{nt^2} wedge\u0000frac{1}{nt^{3/2}} wedge (t^{alpha-1} + n^{-2(alpha-1)/(2alpha+1)})) for\u0000all (alpha > 0) and (t ge 0). As a consequence, it is shown the law\u0000(hat{f}) of a sample generated from the diffusion model achieves the sharp\u0000minimax rate (bE(dTV(hat{f}, f)^2) lesssim n^{-2alpha/(2alpha+1)}) for\u0000all (alpha > 0) without any extraneous logarithmic terms which are prevalent\u0000in the literature, and without the need for early stopping which has been\u0000required for all existing procedures to the best of our knowledge.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Training-Free Guidance for Discrete Diffusion Models for Molecular Generation","authors":"Thomas J. Kerby, Kevin R. Moon","doi":"arxiv-2409.07359","DOIUrl":"https://doi.org/arxiv-2409.07359","url":null,"abstract":"Training-free guidance methods for continuous data have seen an explosion of\u0000interest due to the fact that they enable foundation diffusion models to be\u0000paired with interchangable guidance models. Currently, equivalent guidance\u0000methods for discrete diffusion models are unknown. We present a framework for\u0000applying training-free guidance to discrete data and demonstrate its utility on\u0000molecular graph generation tasks using the discrete diffusion model\u0000architecture of DiGress. We pair this model with guidance functions that return\u0000the proportion of heavy atoms that are a specific atom type and the molecular\u0000weight of the heavy atoms and demonstrate our method's ability to guide the\u0000data generation.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fengzhe Zhang, Jiajun He, Laurence I. Midgley, Javier Antorán, José Miguel Hernández-Lobato
{"title":"Efficient and Unbiased Sampling of Boltzmann Distributions via Consistency Models","authors":"Fengzhe Zhang, Jiajun He, Laurence I. Midgley, Javier Antorán, José Miguel Hernández-Lobato","doi":"arxiv-2409.07323","DOIUrl":"https://doi.org/arxiv-2409.07323","url":null,"abstract":"Diffusion models have shown promising potential for advancing Boltzmann\u0000Generators. However, two critical challenges persist: (1) inherent errors in\u0000samples due to model imperfections, and (2) the requirement of hundreds of\u0000functional evaluations (NFEs) to achieve high-quality samples. While existing\u0000solutions like importance sampling and distillation address these issues\u0000separately, they are often incompatible, as most distillation models lack the\u0000necessary density information for importance sampling. This paper introduces a\u0000novel sampling method that effectively combines Consistency Models (CMs) with\u0000importance sampling. We evaluate our approach on both synthetic energy\u0000functions and equivariant n-body particle systems. Our method produces unbiased\u0000samples using only 6-25 NFEs while achieving a comparable Effective Sample Size\u0000(ESS) to Denoising Diffusion Probabilistic Models (DDPMs) that require\u0000approximately 100 NFEs.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Manifold Learning via Foliations and Knowledge Transfer","authors":"E. Tron, E. Fioresi","doi":"arxiv-2409.07412","DOIUrl":"https://doi.org/arxiv-2409.07412","url":null,"abstract":"Understanding how real data is distributed in high dimensional spaces is the\u0000key to many tasks in machine learning. We want to provide a natural geometric\u0000structure on the space of data employing a deep ReLU neural network trained as\u0000a classifier. Through the data information matrix (DIM), a variation of the\u0000Fisher information matrix, the model will discern a singular foliation\u0000structure on the space of data. We show that the singular points of such\u0000foliation are contained in a measure zero set, and that a local regular\u0000foliation exists almost everywhere. Experiments show that the data is\u0000correlated with leaves of such foliation. Moreover we show the potential of our\u0000approach for knowledge transfer by analyzing the spectrum of the DIM to measure\u0000distances between datasets.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"k-MLE, k-Bregman, k-VARs: Theory, Convergence, Computation","authors":"Zuogong Yue, Victor Solo","doi":"arxiv-2409.06938","DOIUrl":"https://doi.org/arxiv-2409.06938","url":null,"abstract":"We develop hard clustering based on likelihood rather than distance and prove\u0000convergence. We also provide simulations and real data examples.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}