Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Xu Yu, Liang Wang
{"title":"超越效率:分子数据剪枝以增强通用性","authors":"Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Xu Yu, Liang Wang","doi":"arxiv-2409.01081","DOIUrl":null,"url":null,"abstract":"With the emergence of various molecular tasks and massive datasets, how to\nperform efficient training has become an urgent yet under-explored issue in the\narea. Data pruning (DP), as an oft-stated approach to saving training burdens,\nfilters out less influential samples to form a coreset for training. However,\nthe increasing reliance on pretrained models for molecular tasks renders\ntraditional in-domain DP methods incompatible. Therefore, we propose a\nMolecular data Pruning framework for enhanced Generalization (MolPeg), which\nfocuses on the source-free data pruning scenario, where data pruning is applied\nwith pretrained models. By maintaining two models with different updating paces\nduring training, we introduce a novel scoring function to measure the\ninformativeness of samples based on the loss discrepancy. As a plug-and-play\nframework, MolPeg realizes the perception of both source and target domain and\nconsistently outperforms existing DP methods across four downstream tasks.\nRemarkably, it can surpass the performance obtained from full-dataset training,\neven when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work\nsuggests that the discovery of effective data-pruning metrics could provide a\nviable path to both enhanced efficiency and superior generalization in transfer\nlearning.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization\",\"authors\":\"Dingshuo Chen, Zhixun Li, Yuyan Ni, Guibin Zhang, Ding Wang, Qiang Liu, Shu Wu, Jeffrey Xu Yu, Liang Wang\",\"doi\":\"arxiv-2409.01081\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the emergence of various molecular tasks and massive datasets, how to\\nperform efficient training has become an urgent yet under-explored issue in the\\narea. Data pruning (DP), as an oft-stated approach to saving training burdens,\\nfilters out less influential samples to form a coreset for training. However,\\nthe increasing reliance on pretrained models for molecular tasks renders\\ntraditional in-domain DP methods incompatible. Therefore, we propose a\\nMolecular data Pruning framework for enhanced Generalization (MolPeg), which\\nfocuses on the source-free data pruning scenario, where data pruning is applied\\nwith pretrained models. By maintaining two models with different updating paces\\nduring training, we introduce a novel scoring function to measure the\\ninformativeness of samples based on the loss discrepancy. As a plug-and-play\\nframework, MolPeg realizes the perception of both source and target domain and\\nconsistently outperforms existing DP methods across four downstream tasks.\\nRemarkably, it can surpass the performance obtained from full-dataset training,\\neven when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work\\nsuggests that the discovery of effective data-pruning metrics could provide a\\nviable path to both enhanced efficiency and superior generalization in transfer\\nlearning.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"9 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01081\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization
With the emergence of various molecular tasks and massive datasets, how to
perform efficient training has become an urgent yet under-explored issue in the
area. Data pruning (DP), as an oft-stated approach to saving training burdens,
filters out less influential samples to form a coreset for training. However,
the increasing reliance on pretrained models for molecular tasks renders
traditional in-domain DP methods incompatible. Therefore, we propose a
Molecular data Pruning framework for enhanced Generalization (MolPeg), which
focuses on the source-free data pruning scenario, where data pruning is applied
with pretrained models. By maintaining two models with different updating paces
during training, we introduce a novel scoring function to measure the
informativeness of samples based on the loss discrepancy. As a plug-and-play
framework, MolPeg realizes the perception of both source and target domain and
consistently outperforms existing DP methods across four downstream tasks.
Remarkably, it can surpass the performance obtained from full-dataset training,
even when pruning up to 60-70% of the data on HIV and PCBA dataset. Our work
suggests that the discovery of effective data-pruning metrics could provide a
viable path to both enhanced efficiency and superior generalization in transfer
learning.