arXiv - STAT - Machine Learning最新文献_第3页

Towards Explainable Automated Data Quality Enhancement without Domain Knowledge 在没有领域知识的情况下实现可解释的自动数据质量增强

arXiv - STAT - Machine Learning Pub Date : 2024-09-16 DOI: arxiv-2409.10139

Djibril Sarr

{"title":"Towards Explainable Automated Data Quality Enhancement without Domain Knowledge","authors":"Djibril Sarr","doi":"arxiv-2409.10139","DOIUrl":"https://doi.org/arxiv-2409.10139","url":null,"abstract":"In the era of big data, ensuring the quality of datasets has become\u0000increasingly crucial across various domains. We propose a comprehensive\u0000framework designed to automatically assess and rectify data quality issues in\u0000any given dataset, regardless of its specific content, focusing on both textual\u0000and numerical data. Our primary objective is to address three fundamental types\u0000of defects: absence, redundancy, and incoherence. At the heart of our approach\u0000lies a rigorous demand for both explainability and interpretability, ensuring\u0000that the rationale behind the identification and correction of data anomalies\u0000is transparent and understandable. To achieve this, we adopt a hybrid approach\u0000that integrates statistical methods with machine learning algorithms. Indeed,\u0000by leveraging statistical techniques alongside machine learning, we strike a\u0000balance between accuracy and explainability, enabling users to trust and\u0000comprehend the assessment process. Acknowledging the challenges associated with\u0000automating the data quality assessment process, particularly in terms of time\u0000efficiency and accuracy, we adopt a pragmatic strategy, employing\u0000resource-intensive algorithms only when necessary, while favoring simpler, more\u0000efficient solutions whenever possible. Through a practical analysis conducted\u0000on a publicly provided dataset, we illustrate the challenges that arise when\u0000trying to enhance data quality while keeping explainability. We demonstrate the\u0000effectiveness of our approach in detecting and rectifying missing values,\u0000duplicates and typographical errors as well as the challenges remaining to be\u0000addressed to achieve similar accuracy on statistical outliers and logic errors\u0000under the constraints set in our work.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partial Distribution Matching via Partial Wasserstein Adversarial Networks 通过部分瓦瑟斯坦对抗网络进行部分分布匹配

arXiv - STAT - Machine Learning Pub Date : 2024-09-16 DOI: arxiv-2409.10499

Zi-Ming Wang, Nan Xue, Ling Lei, Rebecka Jörnsten, Gui-Song Xia

引用次数: 0

Tight Lower Bounds under Asymmetric High-Order Hölder Smoothness and Uniform Convexity 非对称高阶荷尔德平滑性和均匀凸性下的严格下界

arXiv - STAT - Machine Learning Pub Date : 2024-09-16 DOI: arxiv-2409.10773

Site Bai, Brian Bullins

{"title":"Tight Lower Bounds under Asymmetric High-Order Hölder Smoothness and Uniform Convexity","authors":"Site Bai, Brian Bullins","doi":"arxiv-2409.10773","DOIUrl":"https://doi.org/arxiv-2409.10773","url":null,"abstract":"In this paper, we provide tight lower bounds for the oracle complexity of\u0000minimizing high-order H\"older smooth and uniformly convex functions.\u0000Specifically, for a function whose $p^{th}$-order derivatives are H\"older\u0000continuous with degree $nu$ and parameter $H$, and that is uniformly convex\u0000with degree $q$ and parameter $sigma$, we focus on two asymmetric cases: (1)\u0000$q > p + nu$, and (2) $q < p+nu$. Given up to $p^{th}$-order oracle access,\u0000we establish worst-case oracle complexities of $Omegaleft( left(\u0000frac{H}{sigma}right)^frac{2}{3(p+nu)-2}left(\u0000frac{sigma}{epsilon}right)^frac{2(q-p-nu)}{q(3(p+nu)-2)}right)$ with a\u0000truncated-Gaussian smoothed hard function in the first case and\u0000$Omegaleft(left(frac{H}{sigma}right)^frac{2}{3(p+nu)-2}+\u0000log^2left(frac{sigma^{p+nu}}{H^q}right)^frac{1}{p+nu-q}right)$ in the\u0000second case, for reaching an $epsilon$-approximate solution in terms of the\u0000optimality gap. Our analysis generalizes previous lower bounds for functions\u0000under first- and second-order smoothness as well as those for uniformly convex\u0000functions, and furthermore our results match the corresponding upper bounds in\u0000the general setting.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"89 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Conditional sampling within generative diffusion models 生成式扩散模型中的条件采样

arXiv - STAT - Machine Learning Pub Date : 2024-09-15 DOI: arxiv-2409.09650

Zheng Zhao, Ziwei Luo, Jens Sjölund, Thomas B. Schön

引用次数: 0

Understanding Simplicity Bias towards Compositional Mappings via Learning Dynamics 通过学习动力了解合成映射的简单性偏差

arXiv - STAT - Machine Learning Pub Date : 2024-09-15 DOI: arxiv-2409.09626

Yi Ren, Danica J. Sutherland

引用次数: 0

Veridical Data Science for Medical Foundation Models 医学基础模型的验证数据科学

arXiv - STAT - Machine Learning Pub Date : 2024-09-15 DOI: arxiv-2409.10580

Ahmed Alaa, Bin Yu

{"title":"Veridical Data Science for Medical Foundation Models","authors":"Ahmed Alaa, Bin Yu","doi":"arxiv-2409.10580","DOIUrl":"https://doi.org/arxiv-2409.10580","url":null,"abstract":"The advent of foundation models (FMs) such as large language models (LLMs)\u0000has led to a cultural shift in data science, both in medicine and beyond. This\u0000shift involves moving away from specialized predictive models trained for\u0000specific, well-defined domain questions to generalist FMs pre-trained on vast\u0000amounts of unstructured data, which can then be adapted to various clinical\u0000tasks and questions. As a result, the standard data science workflow in\u0000medicine has been fundamentally altered; the foundation model lifecycle (FMLC)\u0000now includes distinct upstream and downstream processes, in which computational\u0000resources, model and data access, and decision-making power are distributed\u0000among multiple stakeholders. At their core, FMs are fundamentally statistical\u0000models, and this new workflow challenges the principles of Veridical Data\u0000Science (VDS), hindering the rigorous statistical analysis expected in\u0000transparent and scientifically reproducible data science practices. We\u0000critically examine the medical FMLC in light of the core principles of VDS:\u0000predictability, computability, and stability (PCS), and explain how it deviates\u0000from the standard data science workflow. Finally, we propose recommendations\u0000for a reimagined medical FMLC that expands and refines the PCS principles for\u0000VDS including considering the computational and accessibility constraints\u0000inherent to FMs.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BEnDEM:A Boltzmann Sampler Based on Bootstrapped Denoising Energy Matching BEnDEM：基于引导去噪能量匹配的波尔兹曼采样器

arXiv - STAT - Machine Learning Pub Date : 2024-09-15 DOI: arxiv-2409.09787

RuiKang OuYang, Bo Qiang, José Miguel Hernández-Lobato

引用次数: 0

Scaling Continuous Kernels with Sparse Fourier Domain Learning 利用稀疏傅立叶域学习扩展连续核

arXiv - STAT - Machine Learning Pub Date : 2024-09-15 DOI: arxiv-2409.09875

Clayton Harper, Luke Wood, Peter Gerstoft, Eric C. Larson

引用次数: 0

OML-AD: Online Machine Learning for Anomaly Detection in Time Series Data OML-AD：用于时间序列数据异常检测的在线机器学习

arXiv - STAT - Machine Learning Pub Date : 2024-09-15 DOI: arxiv-2409.09742

Sebastian Wette, Florian Heinrichs

引用次数: 0

Model Selection Through Model Sorting 通过模型排序选择模型

arXiv - STAT - Machine Learning Pub Date : 2024-09-15 DOI: arxiv-2409.09674

Mohammad Ali Hajiani, Babak Seyfe

{"title":"Model Selection Through Model Sorting","authors":"Mohammad Ali Hajiani, Babak Seyfe","doi":"arxiv-2409.09674","DOIUrl":"https://doi.org/arxiv-2409.09674","url":null,"abstract":"We propose a novel approach to select the best model of the data. Based on\u0000the exclusive properties of the nested models, we find the most parsimonious\u0000model containing the risk minimizer predictor. We prove the existence of\u0000probable approximately correct (PAC) bounds on the difference of the minimum\u0000empirical risk of two successive nested models, called successive empirical\u0000excess risk (SEER). Based on these bounds, we propose a model order selection\u0000method called nested empirical risk (NER). By the sorted NER (S-NER) method to\u0000sort the models intelligently, the minimum risk decreases. We construct a test\u0000that predicts whether expanding the model decreases the minimum risk or not.\u0000With a high probability, the NER and S-NER choose the true model order and the\u0000most parsimonious model containing the risk minimizer predictor, respectively.\u0000We use S-NER model selection in the linear regression and show that, the S-NER\u0000method without any prior information can outperform the accuracy of feature\u0000sorting algorithms like orthogonal matching pursuit (OMP) that aided with prior\u0000knowledge of the true model order. Also, in the UCR data set, the NER method\u0000reduces the complexity of the classification of UCR datasets dramatically, with\u0000a negligible loss of accuracy.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0