Quantifying the Hardness of Bioactivity Prediction Tasks for Transfer Learning

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2024-05-13 DOI:10.1021/acs.jcim.4c00160

Hosein Fooladi, Steffen Hirte and Johannes Kirchmair*,

{"title":"Quantifying the Hardness of Bioactivity Prediction Tasks for Transfer Learning","authors":"Hosein Fooladi, Steffen Hirte and Johannes Kirchmair*, ","doi":"10.1021/acs.jcim.4c00160","DOIUrl":null,"url":null,"abstract":"<p >Today, machine learning methods are widely employed in drug discovery. However, the chronic lack of data continues to hamper their further development, validation, and application. Several modern strategies aim to mitigate the challenges associated with data scarcity by learning from data on related tasks. These knowledge-sharing approaches encompass transfer learning, multitask learning, and meta-learning. A key question remaining to be answered for these approaches is about the extent to which their performance can benefit from the relatedness of available source (training) tasks; in other words, how difficult (“hard”) a test task is to a model, given the available source tasks. This study introduces a new method for quantifying and predicting the hardness of a bioactivity prediction task based on its relation to the available training tasks. The approach involves the generation of protein and chemical representations and the calculation of distances between the bioactivity prediction task and the available training tasks. In the example of meta-learning on the FS-Mol data set, we demonstrate that the proposed task hardness metric is inversely correlated with performance (Pearson’s correlation coefficient <i>r</i> = −0.72). The metric will be useful in estimating the task-specific gain in performance that can be achieved through meta-learning.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"64 10","pages":"4031–4046"},"PeriodicalIF":5.3000,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/epdf/10.1021/acs.jcim.4c00160","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.4c00160","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

Today, machine learning methods are widely employed in drug discovery. However, the chronic lack of data continues to hamper their further development, validation, and application. Several modern strategies aim to mitigate the challenges associated with data scarcity by learning from data on related tasks. These knowledge-sharing approaches encompass transfer learning, multitask learning, and meta-learning. A key question remaining to be answered for these approaches is about the extent to which their performance can benefit from the relatedness of available source (training) tasks; in other words, how difficult (“hard”) a test task is to a model, given the available source tasks. This study introduces a new method for quantifying and predicting the hardness of a bioactivity prediction task based on its relation to the available training tasks. The approach involves the generation of protein and chemical representations and the calculation of distances between the bioactivity prediction task and the available training tasks. In the example of meta-learning on the FS-Mol data set, we demonstrate that the proposed task hardness metric is inversely correlated with performance (Pearson’s correlation coefficient r = −0.72). The metric will be useful in estimating the task-specific gain in performance that can be achieved through meta-learning.

Abstract Image

查看原文本刊更多论文

量化生物活性预测任务的难度，促进迁移学习

如今，机器学习方法已广泛应用于药物发现领域。然而，数据的长期匮乏继续阻碍着这些方法的进一步开发、验证和应用。有几种现代策略旨在通过学习相关任务的数据来缓解数据匮乏带来的挑战。这些知识共享方法包括迁移学习、多任务学习和元学习。这些方法有待回答的一个关键问题是，它们的性能能在多大程度上受益于可用源（训练）任务的相关性；换句话说，在可用源任务的情况下，测试任务对模型来说有多难（"难"）。本研究介绍了一种新方法，可根据生物活性预测任务与可用训练任务之间的关系来量化和预测生物活性预测任务的难度。该方法包括生成蛋白质和化学表征，以及计算生物活性预测任务与可用训练任务之间的距离。在 FS-Mol 数据集的元学习示例中，我们证明了所提出的任务硬度指标与性能成反比（皮尔逊相关系数 r = -0.72）。该指标有助于估算通过元学习获得的特定任务性能增益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.