Hosein Fooladi, Thi Ngoc Lan Vu, Miriam Mathea, Johannes Kirchmair
{"title":"Evaluating Machine Learning Models for Molecular Property Prediction: Performance and Robustness on Out-of-Distribution Data.","authors":"Hosein Fooladi, Thi Ngoc Lan Vu, Miriam Mathea, Johannes Kirchmair","doi":"10.1021/acs.jcim.5c00475","DOIUrl":null,"url":null,"abstract":"<p><p>Today, machine learning models are employed extensively to predict the physicochemical and biological properties of molecules. Their performance is typically evaluated on in-distribution (ID) data, i.e., data originating from the same distribution as the training data. However, the real-world applications of such models often involve molecules that are more distant from the training data, necessitating the assessment of their performance on out-of-distribution (OOD) data. In this work, we investigate and evaluate the performance of 14 machine learning models, including classical approaches like random forests, as well as graph neural network (GNN) methods, such as message-passing graph neural networks, across eight data sets using ten splitting strategies for OOD data generation. First, we investigate what constitutes OOD data in the molecular domain for bioactivity and ADMET prediction tasks. In contrast to the common point of view, we show that both classical machine learning and GNN models work well (not substantially different from random splitting) on data split based on Bemis-Murcko scaffolds. Splitting based on chemical similarity clustering (UMAP-based clustering using ECFP4 fingerprints) poses the most challenging task for both types of models. Second, we investigate the extent to which ID and OOD performance have a positive linear relationship. If a positive correlation holds, models with the best performance on the ID data can be selected with the promise of having the best performance on OOD data. We show that the strength of this linear relationship is strongly related to how the OOD data is generated, i.e., which splitting strategies are used for generating OOD data. While the correlation between ID and OOD performance for scaffold splitting is strong (Pearson's <i>r</i> ∼ 0.9), this correlation decreases significantly for all the cluster-based splitting (Pearson's <i>r</i> ∼ 0.4). Therefore, the relationship can be more nuanced, and a strong positive correlation is not guaranteed for all OOD scenarios. These findings suggest that OOD performance evaluation and model selection should be carefully aligned with the intended application domain.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c00475","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0
Abstract
Today, machine learning models are employed extensively to predict the physicochemical and biological properties of molecules. Their performance is typically evaluated on in-distribution (ID) data, i.e., data originating from the same distribution as the training data. However, the real-world applications of such models often involve molecules that are more distant from the training data, necessitating the assessment of their performance on out-of-distribution (OOD) data. In this work, we investigate and evaluate the performance of 14 machine learning models, including classical approaches like random forests, as well as graph neural network (GNN) methods, such as message-passing graph neural networks, across eight data sets using ten splitting strategies for OOD data generation. First, we investigate what constitutes OOD data in the molecular domain for bioactivity and ADMET prediction tasks. In contrast to the common point of view, we show that both classical machine learning and GNN models work well (not substantially different from random splitting) on data split based on Bemis-Murcko scaffolds. Splitting based on chemical similarity clustering (UMAP-based clustering using ECFP4 fingerprints) poses the most challenging task for both types of models. Second, we investigate the extent to which ID and OOD performance have a positive linear relationship. If a positive correlation holds, models with the best performance on the ID data can be selected with the promise of having the best performance on OOD data. We show that the strength of this linear relationship is strongly related to how the OOD data is generated, i.e., which splitting strategies are used for generating OOD data. While the correlation between ID and OOD performance for scaffold splitting is strong (Pearson's r ∼ 0.9), this correlation decreases significantly for all the cluster-based splitting (Pearson's r ∼ 0.4). Therefore, the relationship can be more nuanced, and a strong positive correlation is not guaranteed for all OOD scenarios. These findings suggest that OOD performance evaluation and model selection should be carefully aligned with the intended application domain.
今天,机器学习模型被广泛用于预测分子的物理化学和生物特性。它们的性能通常在分布内(ID)数据上进行评估,即来自与训练数据相同分布的数据。然而,此类模型的实际应用通常涉及距离训练数据较远的分子,因此需要评估它们在离分布(OOD)数据上的性能。在这项工作中,我们研究和评估了14种机器学习模型的性能,包括随机森林等经典方法,以及图神经网络(GNN)方法,如消息传递图神经网络,跨越8个数据集,使用10种分裂策略生成OOD数据。首先,我们研究了分子域OOD数据的构成,用于生物活性和ADMET预测任务。与通常的观点相反,我们表明经典机器学习和GNN模型在基于Bemis-Murcko支架的数据分割上都能很好地工作(与随机分割没有本质区别)。基于化学相似性聚类的分裂(使用ECFP4指纹的基于umap的聚类)对这两种模型都是最具挑战性的任务。其次,我们研究了ID和OOD绩效之间存在正线性关系的程度。如果正相关成立,则可以选择在ID数据上具有最佳性能的模型,并承诺在OOD数据上具有最佳性能。我们表明,这种线性关系的强度与OOD数据的生成方式密切相关,即用于生成OOD数据的分裂策略。虽然支架分裂的ID和OOD性能之间的相关性很强(Pearson’s r ~ 0.9),但对于所有基于簇的分裂,这种相关性显著降低(Pearson’s r ~ 0.4)。因此,这种关系可能更加微妙,并不能保证所有的OOD场景都有很强的正相关。这些发现表明,OOD性能评估和模型选择应该仔细地与预期的应用领域保持一致。
期刊介绍:
The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery.
Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field.
As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.