Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning

arXiv - STAT - Computation Pub Date : 2024-09-02 DOI:arxiv-2409.01410

Vyacheslav Kungurtsev, Yuanfang Peng, Jianyang Gu, Saeed Vahidian, Anthony Quinn, Fadwa Idlahcen, Yiran Chen

{"title":"Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning","authors":"Vyacheslav Kungurtsev, Yuanfang Peng, Jianyang Gu, Saeed Vahidian, Anthony Quinn, Fadwa Idlahcen, Yiran Chen","doi":"arxiv-2409.01410","DOIUrl":null,"url":null,"abstract":"Dataset distillation (DD) is an increasingly important technique that focuses\non constructing a synthetic dataset capable of capturing the core information\nin training data to achieve comparable performance in models trained on the\nlatter. While DD has a wide range of applications, the theory supporting it is\nless well evolved. New methods of DD are compared on a common set of\nbenchmarks, rather than oriented towards any particular learning task. In this\nwork, we present a formal model of DD, arguing that a precise characterization\nof the underlying optimization problem must specify the inference task\nassociated with the application of interest. Without this task-specific focus,\nthe DD problem is under-specified, and the selection of a DD algorithm for a\nparticular task is merely heuristic. Our formalization reveals novel\napplications of DD across different modeling environments. We analyze existing\nDD methods through this broader lens, highlighting their strengths and\nlimitations in terms of accuracy and faithfulness to optimal DD operation.\nFinally, we present numerical results for two case studies important in\ncontemporary settings. Firstly, we address a critical challenge in medical data\nanalysis: merging the knowledge from different datasets composed of\nintersecting, but not identical, sets of features, in order to construct a\nlarger dataset in what is usually a small sample setting. Secondly, we consider\nout-of-distribution error across boundary conditions for physics-informed\nneural networks (PINNs), showing the potential for DD to provide more\nphysically faithful data. By establishing this general formulation of DD, we\naim to establish a new research paradigm by which DD can be understood and from\nwhich new DD techniques can arise.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"6 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01410","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Dataset distillation (DD) is an increasingly important technique that focuses on constructing a synthetic dataset capable of capturing the core information in training data to achieve comparable performance in models trained on the latter. While DD has a wide range of applications, the theory supporting it is less well evolved. New methods of DD are compared on a common set of benchmarks, rather than oriented towards any particular learning task. In this work, we present a formal model of DD, arguing that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest. Without this task-specific focus, the DD problem is under-specified, and the selection of a DD algorithm for a particular task is merely heuristic. Our formalization reveals novel applications of DD across different modeling environments. We analyze existing DD methods through this broader lens, highlighting their strengths and limitations in terms of accuracy and faithfulness to optimal DD operation. Finally, we present numerical results for two case studies important in contemporary settings. Firstly, we address a critical challenge in medical data analysis: merging the knowledge from different datasets composed of intersecting, but not identical, sets of features, in order to construct a larger dataset in what is usually a small sample setting. Secondly, we consider out-of-distribution error across boundary conditions for physics-informed neural networks (PINNs), showing the potential for DD to provide more physically faithful data. By establishing this general formulation of DD, we aim to establish a new research paradigm by which DD can be understood and from which new DD techniques can arise.

查看原文本刊更多论文

从第一原理出发的数据集提炼：整合核心信息提取和有目的的学习

数据集提炼（Dataset distillation，DD）是一种越来越重要的技术，它主要是构建一个能够捕捉训练数据中核心信息的合成数据集，从而使在训练数据上训练的模型达到可比的性能。虽然 DD 的应用范围很广，但支持它的理论却不够成熟。DD的新方法是在一组通用基准上进行比较的，而不是面向任何特定的学习任务。在这项工作中，我们提出了一个 DD 的正式模型，认为对底层优化问题的精确描述必须明确与感兴趣的应用相关的推理任务。如果缺乏对特定任务的关注，DD 问题就不够明确，为特定任务选择 DD 算法也只是启发式的。我们的形式化揭示了 DD 在不同建模环境中的新应用。我们通过这个更广阔的视角分析了现有的 DD 方法，强调了它们在准确性和忠实于最佳 DD 操作方面的优势和局限性。最后，我们展示了两个在当代环境中非常重要的案例研究的数值结果。首先，我们讨论了医学数据分析中的一个关键挑战：合并由相互交叉但不完全相同的特征集组成的不同数据集的知识，以便在通常是小样本的情况下构建更大的数据集。其次，我们考虑了物理信息神经网络（PINNs）跨边界条件的分布误差，展示了 DD 提供更忠实于物理的数据的潜力。通过建立这种 DD 的一般表述，我们希望建立一种新的研究范式，从而理解 DD 并从中产生新的 DD 技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - STAT - Computation

自引率

0.00%

发文量