Exploring the Potential of Adaptive, Local Machine Learning in Comparison to the Prediction Performance of Global Models: A Case Study from Bayer's Caco-2 Permeability Database.
Frank Filip Steinbauer, Thorsten Lehr, Andreas Reichel
{"title":"Exploring the Potential of Adaptive, Local Machine Learning in Comparison to the Prediction Performance of Global Models: A Case Study from Bayer's Caco-2 Permeability Database.","authors":"Frank Filip Steinbauer, Thorsten Lehr, Andreas Reichel","doi":"10.1021/acs.jcim.4c01083","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning (ML) techniques are being widely implemented to fill the gap in simple molecular design guidelines for newer therapeutic modalities in the extended and beyond rule of five chemical space (eRo5, bRo5). These ML techniques predict molecular properties directly from the structure, allowing for the prioritization of promising compounds. However, the performance of models varies greatly among ML use cases. A molecular property for which achieving sufficient performance in generalizing global models still remains difficult is Caco-2 permeability. Especially within the lower permeability ranges, which are specific for larger molecules belonging to the e/bRo5 space, accurate regression predictions have proven to be challenging. The present study, therefore, identifies a suitable combination of ML algorithm and descriptors, consisting of the LightGBM algorithm and RDKit molecular property descriptors, to predict Caco-2 permeability very efficiently by a simple global model. An additionally introduced local model uses the same algorithm and descriptors but selects its training data based on Tanimoto fingerprint similarity to match the individual test compound's structure. Evaluation of this adaptive model, by systematically varying the number of most similar structures for training, shows that, in comparison to the global model, there was only marginally improved performance with specific training data constellations. These random improvements indicate that deriving general rules for local model parametrization is not possible <i>a priori</i> for the chosen algorithm and descriptor combination, and preselecting training data does not seem advantageous over global ML based on all available data, while creation of more data-efficient models was generally proven to be possible.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"9163-9172"},"PeriodicalIF":5.6000,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.4c01083","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/20 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) techniques are being widely implemented to fill the gap in simple molecular design guidelines for newer therapeutic modalities in the extended and beyond rule of five chemical space (eRo5, bRo5). These ML techniques predict molecular properties directly from the structure, allowing for the prioritization of promising compounds. However, the performance of models varies greatly among ML use cases. A molecular property for which achieving sufficient performance in generalizing global models still remains difficult is Caco-2 permeability. Especially within the lower permeability ranges, which are specific for larger molecules belonging to the e/bRo5 space, accurate regression predictions have proven to be challenging. The present study, therefore, identifies a suitable combination of ML algorithm and descriptors, consisting of the LightGBM algorithm and RDKit molecular property descriptors, to predict Caco-2 permeability very efficiently by a simple global model. An additionally introduced local model uses the same algorithm and descriptors but selects its training data based on Tanimoto fingerprint similarity to match the individual test compound's structure. Evaluation of this adaptive model, by systematically varying the number of most similar structures for training, shows that, in comparison to the global model, there was only marginally improved performance with specific training data constellations. These random improvements indicate that deriving general rules for local model parametrization is not possible a priori for the chosen algorithm and descriptor combination, and preselecting training data does not seem advantageous over global ML based on all available data, while creation of more data-efficient models was generally proven to be possible.
目前正在广泛应用机器学习(ML)技术,以填补在扩展和超越五化学规则空间(eRo5、bRo5)中更新治疗模式的简单分子设计指南方面的空白。这些 ML 技术可直接从结构预测分子特性,从而优先选择有前景的化合物。然而,不同 ML 用例的模型性能差异很大。Caco-2 的渗透性是一种分子性质,在推广全局模型时仍难以达到足够的性能。特别是在属于 e/bRo5 空间的较大分子所特有的较低渗透性范围内,准确的回归预测已被证明具有挑战性。因此,本研究确定了由 LightGBM 算法和 RDKit 分子性质描述符组成的 ML 算法和描述符的适当组合,通过一个简单的全局模型非常有效地预测了 Caco-2 的渗透性。另外引入的局部模型使用相同的算法和描述符,但根据 Tanimoto 指纹相似性选择训练数据,以匹配单个测试化合物的结构。通过系统地改变用于训练的最相似结构的数量,对这种自适应模型进行了评估,结果表明,与全局模型相比,特定训练数据组合的性能仅略有提高。这些随机的改进表明,对于所选择的算法和描述符组合,不可能先验地得出局部模型参数化的一般规则,而且与基于所有可用数据的全局 ML 相比,预选训练数据似乎并不具有优势。
期刊介绍:
The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery.
Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field.
As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.