超越效率：分子数据剪枝以增强通用性

arXiv - QuanBio - Biomolecules Pub Date : 2024-09-02 DOI:arxiv-2409.01081

Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Xu Yu, Liang Wang

{"title":"超越效率：分子数据剪枝以增强通用性","authors":"Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Xu Yu, Liang Wang","doi":"arxiv-2409.01081","DOIUrl":null,"url":null,"abstract":"With the emergence of various molecular tasks and massive datasets, how to\nperform efficient training has become an urgent yet under-explored issue in the\narea. Data pruning (DP), as an oft-stated approach to saving training burdens,\nfilters out less influential samples to form a coreset for training. However,\nthe increasing reliance on pretrained models for molecular tasks renders\ntraditional in-domain DP methods incompatible. Therefore, we propose a\nMolecular data Pruning framework for enhanced Generalization (MolPeg), which\nfocuses on the source-free data pruning scenario, where data pruning is applied\nwith pretrained models. By maintaining two models with different updating paces\nduring training, we introduce a novel scoring function to measure the\ninformativeness of samples based on the loss discrepancy. As a plug-and-play\nframework, MolPeg realizes the perception of both source and target domain and\nconsistently outperforms existing DP methods across four downstream tasks.\nRemarkably, it can surpass the performance obtained from full-dataset training,\neven when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work\nsuggests that the discovery of effective data-pruning metrics could provide a\nviable path to both enhanced efficiency and superior generalization in transfer\nlearning.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization\",\"authors\":\"Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Xu Yu, Liang Wang\",\"doi\":\"arxiv-2409.01081\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the emergence of various molecular tasks and massive datasets, how to\\nperform efficient training has become an urgent yet under-explored issue in the\\narea. Data pruning (DP), as an oft-stated approach to saving training burdens,\\nfilters out less influential samples to form a coreset for training. However,\\nthe increasing reliance on pretrained models for molecular tasks renders\\ntraditional in-domain DP methods incompatible. Therefore, we propose a\\nMolecular data Pruning framework for enhanced Generalization (MolPeg), which\\nfocuses on the source-free data pruning scenario, where data pruning is applied\\nwith pretrained models. By maintaining two models with different updating paces\\nduring training, we introduce a novel scoring function to measure the\\ninformativeness of samples based on the loss discrepancy. As a plug-and-play\\nframework, MolPeg realizes the perception of both source and target domain and\\nconsistently outperforms existing DP methods across four downstream tasks.\\nRemarkably, it can surpass the performance obtained from full-dataset training,\\neven when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work\\nsuggests that the discovery of effective data-pruning metrics could provide a\\nviable path to both enhanced efficiency and superior generalization in transfer\\nlearning.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"9 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01081\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着各种分子任务和海量数据集的出现，如何进行高效训练已成为该领域亟待解决但又未得到充分探索的问题。数据剪枝（Data pruning，DP）是一种常用的减轻训练负担的方法，它可以过滤掉影响较小的样本，形成一个用于训练的核心集。然而，分子任务越来越依赖预训练模型，这使得传统的域内数据剪枝方法变得不相容。因此，我们提出了一个用于增强泛化的分子数据剪枝框架（MolPeg），该框架侧重于无源数据剪枝场景，即使用预训练模型进行数据剪枝。通过在训练过程中维持两个更新速度不同的模型，我们引入了一种新颖的评分函数，根据损失差异来衡量样本的信息化程度。作为一个即插即用的框架，MolPeg 实现了对源领域和目标领域的感知，并在四个下游任务中持续优于现有的 DP 方法。我们的工作表明，发现有效的数据剪枝指标可以为提高迁移学习的效率和泛化能力提供可行的途径。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization

With the emergence of various molecular tasks and massive datasets, how to perform efficient training has become an urgent yet under-explored issue in the area. Data pruning (DP), as an oft-stated approach to saving training burdens, filters out less influential samples to form a coreset for training. However, the increasing reliance on pretrained models for molecular tasks renders traditional in-domain DP methods incompatible. Therefore, we propose a Molecular data Pruning framework for enhanced Generalization (MolPeg), which focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models. By maintaining two models with different updating paces during training, we introduce a novel scoring function to measure the informativeness of samples based on the loss discrepancy. As a plug-and-play framework, MolPeg realizes the perception of both source and target domain and consistently outperforms existing DP methods across four downstream tasks. Remarkably, it can surpass the performance obtained from full-dataset training, even when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work suggests that the discovery of effective data-pruning metrics could provide a viable path to both enhanced efficiency and superior generalization in transfer learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量