Conformal prediction-based machine learning in Cheminformatics: Current applications and new challenges

Artificial intelligence in the life sciences Pub Date : 2025-02-08 DOI:10.1016/j.ailsci.2025.100127

Mario Astigarraga , Andrés Sánchez-Ruiz , Gonzalo Colmenarejo

{"title":"Conformal prediction-based machine learning in Cheminformatics: Current applications and new challenges","authors":"Mario Astigarraga , Andrés Sánchez-Ruiz , Gonzalo Colmenarejo","doi":"10.1016/j.ailsci.2025.100127","DOIUrl":null,"url":null,"abstract":"<div><div>Conformal Prediction (CP) is a distribution-free Machine Learning (ML) framework that has been developed in the last ∼25 years to provide well calibrated prediction subsets/intervals that include the true label with a user pre-defined probability, only requiring data exchangeability. It is based on the concept of <em>nonconformity</em> (or dissimilarity) of the new prediction compared to previous data and their predictions, so that the prediction subset/interval size is larger for new “unusual” instances and smaller for “typical” instances. Given its simplicity and ease of applicability, since 2012 it has been widely adopted in Cheminformatics, especially in the Quantitative Structure-Activity Relationship (QSAR) modeling and Molecular Screening areas. This rapid popularization of CP in Cheminformatics can be explained on the grounds that: (a) it can handle the applicability domain (AD) issue of ML models, of large importance in Cheminformatics due to the immense size of the chemical space; (b) it deals with classification of heavily imbalanced datasets typical in Molecular Screening; and (c) it quantifies compound-specific prediction uncertainties, especially useful as it allows to implement gain-cost strategies to accelerate drug discovery by reducing compounds to test. This comprehensive review introduces the method, provides a full appraisal of the work done in the field of Cheminformatics (with special emphasis in the QSAR and Molecular Screening arenas), and discusses its pros and cons and new challenges, especially for Deep Learning applications and nonexchangeable datasets, a very frequent situation in Cheminformatics.</div></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"7 ","pages":"Article 100127"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence in the life sciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667318525000030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Conformal Prediction (CP) is a distribution-free Machine Learning (ML) framework that has been developed in the last ∼25 years to provide well calibrated prediction subsets/intervals that include the true label with a user pre-defined probability, only requiring data exchangeability. It is based on the concept of nonconformity (or dissimilarity) of the new prediction compared to previous data and their predictions, so that the prediction subset/interval size is larger for new “unusual” instances and smaller for “typical” instances. Given its simplicity and ease of applicability, since 2012 it has been widely adopted in Cheminformatics, especially in the Quantitative Structure-Activity Relationship (QSAR) modeling and Molecular Screening areas. This rapid popularization of CP in Cheminformatics can be explained on the grounds that: (a) it can handle the applicability domain (AD) issue of ML models, of large importance in Cheminformatics due to the immense size of the chemical space; (b) it deals with classification of heavily imbalanced datasets typical in Molecular Screening; and (c) it quantifies compound-specific prediction uncertainties, especially useful as it allows to implement gain-cost strategies to accelerate drug discovery by reducing compounds to test. This comprehensive review introduces the method, provides a full appraisal of the work done in the field of Cheminformatics (with special emphasis in the QSAR and Molecular Screening arenas), and discusses its pros and cons and new challenges, especially for Deep Learning applications and nonexchangeable datasets, a very frequent situation in Cheminformatics.

查看原文本刊更多论文

化学信息学中基于保形预测的机器学习：当前应用和新挑战

保形预测（CP）是一种无分布的机器学习（ML）框架，在过去的25年里开发出来，提供了经过校准的预测子集/区间，其中包括具有用户预定义概率的真实标签，只需要数据可交换性。它基于新预测与以前的数据及其预测相比较的不一致性（或不相似性）的概念，因此预测子集/区间大小对于新的“不寻常”实例较大，而对于“典型”实例较小。由于它的简单性和适用性，自2012年以来，它被广泛应用于化学信息学，特别是在定量结构-活性关系（QSAR）建模和分子筛选领域。CP在化学信息学中的迅速普及可以解释为：(a)它可以处理ML模型的适用性域（AD）问题，由于化学空间的巨大规模，这在化学信息学中非常重要；(b)处理分子筛选中典型的严重不平衡数据集的分类；(c)它量化了特定化合物的预测不确定性，尤其有用，因为它允许实施收益成本策略，通过减少要测试的化合物来加速药物发现。这篇全面的综述介绍了该方法，全面评估了化学信息学领域的工作（特别强调QSAR和分子筛选领域），并讨论了其优缺点和新的挑战，特别是深度学习应用和不可交换数据集，这是化学信息学中非常常见的情况。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial intelligence in the life sciences Pharmacology, Biochemistry, Genetics and Molecular Biology (General), Computer Science Applications, Health Informatics, Drug Discovery, Veterinary Science and Veterinary Medicine (General)

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

15 days