{"title":"Towards Explainable Automated Data Quality Enhancement without Domain Knowledge","authors":"Djibril Sarr","doi":"arxiv-2409.10139","DOIUrl":"https://doi.org/arxiv-2409.10139","url":null,"abstract":"In the era of big data, ensuring the quality of datasets has become\u0000increasingly crucial across various domains. We propose a comprehensive\u0000framework designed to automatically assess and rectify data quality issues in\u0000any given dataset, regardless of its specific content, focusing on both textual\u0000and numerical data. Our primary objective is to address three fundamental types\u0000of defects: absence, redundancy, and incoherence. At the heart of our approach\u0000lies a rigorous demand for both explainability and interpretability, ensuring\u0000that the rationale behind the identification and correction of data anomalies\u0000is transparent and understandable. To achieve this, we adopt a hybrid approach\u0000that integrates statistical methods with machine learning algorithms. Indeed,\u0000by leveraging statistical techniques alongside machine learning, we strike a\u0000balance between accuracy and explainability, enabling users to trust and\u0000comprehend the assessment process. Acknowledging the challenges associated with\u0000automating the data quality assessment process, particularly in terms of time\u0000efficiency and accuracy, we adopt a pragmatic strategy, employing\u0000resource-intensive algorithms only when necessary, while favoring simpler, more\u0000efficient solutions whenever possible. Through a practical analysis conducted\u0000on a publicly provided dataset, we illustrate the challenges that arise when\u0000trying to enhance data quality while keeping explainability. We demonstrate the\u0000effectiveness of our approach in detecting and rectifying missing values,\u0000duplicates and typographical errors as well as the challenges remaining to be\u0000addressed to achieve similar accuracy on statistical outliers and logic errors\u0000under the constraints set in our work.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zi-Ming Wang, Nan Xue, Ling Lei, Rebecka Jörnsten, Gui-Song Xia
{"title":"Partial Distribution Matching via Partial Wasserstein Adversarial Networks","authors":"Zi-Ming Wang, Nan Xue, Ling Lei, Rebecka Jörnsten, Gui-Song Xia","doi":"arxiv-2409.10499","DOIUrl":"https://doi.org/arxiv-2409.10499","url":null,"abstract":"This paper studies the problem of distribution matching (DM), which is a\u0000fundamental machine learning problem seeking to robustly align two probability\u0000distributions. Our approach is established on a relaxed formulation, called\u0000partial distribution matching (PDM), which seeks to match a fraction of the\u0000distributions instead of matching them completely. We theoretically derive the\u0000Kantorovich-Rubinstein duality for the partial Wasserstain-1 (PW) discrepancy,\u0000and develop a partial Wasserstein adversarial network (PWAN) that efficiently\u0000approximates the PW discrepancy based on this dual form. Partial matching can\u0000then be achieved by optimizing the network using gradient descent. Two\u0000practical tasks, point set registration and partial domain adaptation are\u0000investigated, where the goals are to partially match distributions in 3D space\u0000and high-dimensional feature space respectively. The experiment results confirm\u0000that the proposed PWAN effectively produces highly robust matching results,\u0000performing better or on par with the state-of-the-art methods.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tight Lower Bounds under Asymmetric High-Order Hölder Smoothness and Uniform Convexity","authors":"Site Bai, Brian Bullins","doi":"arxiv-2409.10773","DOIUrl":"https://doi.org/arxiv-2409.10773","url":null,"abstract":"In this paper, we provide tight lower bounds for the oracle complexity of\u0000minimizing high-order H\"older smooth and uniformly convex functions.\u0000Specifically, for a function whose $p^{th}$-order derivatives are H\"older\u0000continuous with degree $nu$ and parameter $H$, and that is uniformly convex\u0000with degree $q$ and parameter $sigma$, we focus on two asymmetric cases: (1)\u0000$q > p + nu$, and (2) $q < p+nu$. Given up to $p^{th}$-order oracle access,\u0000we establish worst-case oracle complexities of $Omegaleft( left(\u0000frac{H}{sigma}right)^frac{2}{3(p+nu)-2}left(\u0000frac{sigma}{epsilon}right)^frac{2(q-p-nu)}{q(3(p+nu)-2)}right)$ with a\u0000truncated-Gaussian smoothed hard function in the first case and\u0000$Omegaleft(left(frac{H}{sigma}right)^frac{2}{3(p+nu)-2}+\u0000log^2left(frac{sigma^{p+nu}}{H^q}right)^frac{1}{p+nu-q}right)$ in the\u0000second case, for reaching an $epsilon$-approximate solution in terms of the\u0000optimality gap. Our analysis generalizes previous lower bounds for functions\u0000under first- and second-order smoothness as well as those for uniformly convex\u0000functions, and furthermore our results match the corresponding upper bounds in\u0000the general setting.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"89 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Zhao, Ziwei Luo, Jens Sjölund, Thomas B. Schön
{"title":"Conditional sampling within generative diffusion models","authors":"Zheng Zhao, Ziwei Luo, Jens Sjölund, Thomas B. Schön","doi":"arxiv-2409.09650","DOIUrl":"https://doi.org/arxiv-2409.09650","url":null,"abstract":"Generative diffusions are a powerful class of Monte Carlo samplers that\u0000leverage bridging Markov processes to approximate complex, high-dimensional\u0000distributions, such as those found in image processing and language models.\u0000Despite their success in these domains, an important open challenge remains:\u0000extending these techniques to sample from conditional distributions, as\u0000required in, for example, Bayesian inverse problems. In this paper, we present\u0000a comprehensive review of existing computational approaches to conditional\u0000sampling within generative diffusion models. Specifically, we highlight key\u0000methodologies that either utilise the joint distribution, or rely on\u0000(pre-trained) marginal distributions with explicit likelihoods, to construct\u0000conditional generative samplers.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics","authors":"Yi Ren, Danica J. Sutherland","doi":"arxiv-2409.09626","DOIUrl":"https://doi.org/arxiv-2409.09626","url":null,"abstract":"Obtaining compositional mappings is important for the model to generalize\u0000well compositionally. To better understand when and how to encourage the model\u0000to learn such mappings, we study their uniqueness through different\u0000perspectives. Specifically, we first show that the compositional mappings are\u0000the simplest bijections through the lens of coding length (i.e., an upper bound\u0000of their Kolmogorov complexity). This property explains why models having such\u0000mappings can generalize well. We further show that the simplicity bias is\u0000usually an intrinsic property of neural network training via gradient descent.\u0000That partially explains why some models spontaneously generalize well when they\u0000are trained appropriately.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Veridical Data Science for Medical Foundation Models","authors":"Ahmed Alaa, Bin Yu","doi":"arxiv-2409.10580","DOIUrl":"https://doi.org/arxiv-2409.10580","url":null,"abstract":"The advent of foundation models (FMs) such as large language models (LLMs)\u0000has led to a cultural shift in data science, both in medicine and beyond. This\u0000shift involves moving away from specialized predictive models trained for\u0000specific, well-defined domain questions to generalist FMs pre-trained on vast\u0000amounts of unstructured data, which can then be adapted to various clinical\u0000tasks and questions. As a result, the standard data science workflow in\u0000medicine has been fundamentally altered; the foundation model lifecycle (FMLC)\u0000now includes distinct upstream and downstream processes, in which computational\u0000resources, model and data access, and decision-making power are distributed\u0000among multiple stakeholders. At their core, FMs are fundamentally statistical\u0000models, and this new workflow challenges the principles of Veridical Data\u0000Science (VDS), hindering the rigorous statistical analysis expected in\u0000transparent and scientifically reproducible data science practices. We\u0000critically examine the medical FMLC in light of the core principles of VDS:\u0000predictability, computability, and stability (PCS), and explain how it deviates\u0000from the standard data science workflow. Finally, we propose recommendations\u0000for a reimagined medical FMLC that expands and refines the PCS principles for\u0000VDS including considering the computational and accessibility constraints\u0000inherent to FMs.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RuiKang OuYang, Bo Qiang, José Miguel Hernández-Lobato
{"title":"BEnDEM:A Boltzmann Sampler Based on Bootstrapped Denoising Energy Matching","authors":"RuiKang OuYang, Bo Qiang, José Miguel Hernández-Lobato","doi":"arxiv-2409.09787","DOIUrl":"https://doi.org/arxiv-2409.09787","url":null,"abstract":"Developing an efficient sampler capable of generating independent and\u0000identically distributed (IID) samples from a Boltzmann distribution is a\u0000crucial challenge in scientific research, e.g. molecular dynamics. In this\u0000work, we intend to learn neural samplers given energy functions instead of data\u0000sampled from the Boltzmann distribution. By learning the energies of the noised\u0000data, we propose a diffusion-based sampler, ENERGY-BASED DENOISING ENERGY\u0000MATCHING, which theoretically has lower variance and more complexity compared\u0000to related works. Furthermore, a novel bootstrapping technique is applied to\u0000EnDEM to balance between bias and variance. We evaluate EnDEM and BEnDEM on a\u00002-dimensional 40 Gaussian Mixture Model (GMM) and a 4-particle double-welling\u0000potential (DW-4). The experimental results demonstrate that BEnDEM can achieve\u0000state-of-the-art performance while being more robust.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clayton Harper, Luke Wood, Peter Gerstoft, Eric C. Larson
{"title":"Scaling Continuous Kernels with Sparse Fourier Domain Learning","authors":"Clayton Harper, Luke Wood, Peter Gerstoft, Eric C. Larson","doi":"arxiv-2409.09875","DOIUrl":"https://doi.org/arxiv-2409.09875","url":null,"abstract":"We address three key challenges in learning continuous kernel\u0000representations: computational efficiency, parameter efficiency, and spectral\u0000bias. Continuous kernels have shown significant potential, but their practical\u0000adoption is often limited by high computational and memory demands.\u0000Additionally, these methods are prone to spectral bias, which impedes their\u0000ability to capture high-frequency details. To overcome these limitations, we\u0000propose a novel approach that leverages sparse learning in the Fourier domain.\u0000Our method enables the efficient scaling of continuous kernels, drastically\u0000reduces computational and memory requirements, and mitigates spectral bias by\u0000exploiting the Gibbs phenomenon.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OML-AD: Online Machine Learning for Anomaly Detection in Time Series Data","authors":"Sebastian Wette, Florian Heinrichs","doi":"arxiv-2409.09742","DOIUrl":"https://doi.org/arxiv-2409.09742","url":null,"abstract":"Time series are ubiquitous and occur naturally in a variety of applications\u0000-- from data recorded by sensors in manufacturing processes, over financial\u0000data streams to climate data. Different tasks arise, such as regression,\u0000classification or segmentation of the time series. However, to reliably solve\u0000these challenges, it is important to filter out abnormal observations that\u0000deviate from the usual behavior of the time series. While many anomaly\u0000detection methods exist for independent data and stationary time series, these\u0000methods are not applicable to non-stationary time series. To allow for\u0000non-stationarity in the data, while simultaneously detecting anomalies, we\u0000propose OML-AD, a novel approach for anomaly detection (AD) based on online\u0000machine learning (OML). We provide an implementation of OML-AD within the\u0000Python library River and show that it outperforms state-of-the-art baseline\u0000methods in terms of accuracy and computational efficiency.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model Selection Through Model Sorting","authors":"Mohammad Ali Hajiani, Babak Seyfe","doi":"arxiv-2409.09674","DOIUrl":"https://doi.org/arxiv-2409.09674","url":null,"abstract":"We propose a novel approach to select the best model of the data. Based on\u0000the exclusive properties of the nested models, we find the most parsimonious\u0000model containing the risk minimizer predictor. We prove the existence of\u0000probable approximately correct (PAC) bounds on the difference of the minimum\u0000empirical risk of two successive nested models, called successive empirical\u0000excess risk (SEER). Based on these bounds, we propose a model order selection\u0000method called nested empirical risk (NER). By the sorted NER (S-NER) method to\u0000sort the models intelligently, the minimum risk decreases. We construct a test\u0000that predicts whether expanding the model decreases the minimum risk or not.\u0000With a high probability, the NER and S-NER choose the true model order and the\u0000most parsimonious model containing the risk minimizer predictor, respectively.\u0000We use S-NER model selection in the linear regression and show that, the S-NER\u0000method without any prior information can outperform the accuracy of feature\u0000sorting algorithms like orthogonal matching pursuit (OMP) that aided with prior\u0000knowledge of the true model order. Also, in the UCR data set, the NER method\u0000reduces the complexity of the classification of UCR datasets dramatically, with\u0000a negligible loss of accuracy.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}