Extrapolation is not the same as interpolation

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning Pub Date : 2024-07-23 DOI:10.1007/s10994-024-06591-2

Yuxuan Wang, Ross D. King

{"title":"Extrapolation is not the same as interpolation","authors":"Yuxuan Wang, Ross D. King","doi":"10.1007/s10994-024-06591-2","DOIUrl":null,"url":null,"abstract":"We propose a new machine learning formulation designed specifically for extrapolation. The textbook way to apply machine learning to drug design is to learn a univariate function that when a drug (structure) is input, the function outputs a real number (the activity): f(drug) \\(\\rightarrow\\) activity. However, experience in real-world drug design suggests that this formulation of the drug design problem is not quite correct. Specifically, what one is really interested in is extrapolation: predicting the activity of new drugs with higher activity than any existing ones. Our new formulation for extrapolation is based on learning a bivariate function that predicts the difference in activities of two drugs F(drug1, drug2) \\(\\rightarrow\\) difference in activity, followed by the use of ranking algorithms. This formulation is general and agnostic, suitable for finding samples with target values beyond the target value range of the training set. We applied the formulation to work with support vector machines , random forests , and Gradient Boosting Machines . We compared the formulation with standard regression on thousands of drug design datasets, gene expression datasets and material property datasets. The test set extrapolation metric was the identification of examples with greater values than the training set, and top-performing examples (within the top 10% of the whole dataset). On this metric our pairwise formulation vastly outperformed standard regression. Its proposed variations also showed a consistent outperformance. Its application in the stock selection problem further confirmed the advantage of this pairwise formulation.","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"70 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10994-024-06591-2","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

We propose a new machine learning formulation designed specifically for extrapolation. The textbook way to apply machine learning to drug design is to learn a univariate function that when a drug (structure) is input, the function outputs a real number (the activity): f(drug) \(\rightarrow\) activity. However, experience in real-world drug design suggests that this formulation of the drug design problem is not quite correct. Specifically, what one is really interested in is extrapolation: predicting the activity of new drugs with higher activity than any existing ones. Our new formulation for extrapolation is based on learning a bivariate function that predicts the difference in activities of two drugs F(drug1, drug2) \(\rightarrow\) difference in activity, followed by the use of ranking algorithms. This formulation is general and agnostic, suitable for finding samples with target values beyond the target value range of the training set. We applied the formulation to work with support vector machines , random forests , and Gradient Boosting Machines . We compared the formulation with standard regression on thousands of drug design datasets, gene expression datasets and material property datasets. The test set extrapolation metric was the identification of examples with greater values than the training set, and top-performing examples (within the top 10% of the whole dataset). On this metric our pairwise formulation vastly outperformed standard regression. Its proposed variations also showed a consistent outperformance. Its application in the stock selection problem further confirmed the advantage of this pairwise formulation.

Abstract Image

查看原文本刊更多论文

外推法不同于内插法

我们提出了一种专为外推设计的新机器学习方法。教科书上将机器学习应用于药物设计的方法是学习一个单变量函数，当输入药物（结构）时，函数输出一个实数（活性）：f(drug) \(\rightarrow\) activity。然而，实际药物设计的经验表明，药物设计问题的这种表述并不完全正确。具体来说，我们真正感兴趣的是外推法：预测活性高于任何现有药物的新药的活性。我们新的外推方法是基于学习一个预测两种药物活性差异的双变量函数 F(drug1, drug2) \(\rightarrow\) 活性差异，然后使用排序算法。这种公式具有通用性和不可知性，适用于寻找目标值超出训练集目标值范围的样本。我们将该公式应用于支持向量机、随机森林和梯度提升机。我们在数以千计的药物设计数据集、基因表达数据集和材料属性数据集上比较了该公式与标准回归。测试集的外推指标是识别出比训练集值更高的示例，以及表现最好的示例（在整个数据集中排名前 10%）。在这一指标上，我们的成对公式大大优于标准回归。其提出的变体也显示出持续的优越性。在选股问题中的应用进一步证实了这种成对公式的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine Learning 工程技术-计算机：人工智能

CiteScore

11.00

自引率

2.70%

发文量

162

审稿时长

3 months

期刊介绍： Machine Learning serves as a global platform dedicated to computational approaches in learning. The journal reports substantial findings on diverse learning methods applied to various problems, offering support through empirical studies, theoretical analysis, or connections to psychological phenomena. It demonstrates the application of learning methods to solve significant problems and aims to enhance the conduct of machine learning research with a focus on verifiable and replicable evidence in published papers.