通过机器反复学习开发 QSAR 建模的新方法：药物在各组织分布的案例研究。

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2024-04-19 DOI:10.1021/acs.jcim.4c00046

Koichi Handa*, Saki Yoshimura, Michiharu Kageyama and Takeshi Iijima,

{"title":"通过机器反复学习开发 QSAR 建模的新方法：药物在各组织分布的案例研究。","authors":"Koichi Handa*, Saki Yoshimura, Michiharu Kageyama and Takeshi Iijima, ","doi":"10.1021/acs.jcim.4c00046","DOIUrl":null,"url":null,"abstract":"Artificial intelligence is expected to help identify excellent candidates in drug discovery. However, we face a lack of data, as it is time-consuming and expensive to acquire raw data perfectly for many compounds. Hence, we tried to develop a novel quantitative structure-activity relationship (QSAR) method to predict a parameter more precisely from an incomplete data set via optimizing data handling by making use of predicted explanatory variables. As a case study we focused on the tissue-to-plasma partition coefficient (Kp), which is an important parameter for understanding drug distribution in tissues and building the physiologically based pharmacokinetic model and is a representative of small and sparse data sets. In this study, we predicted the Kp values of 119 compounds in nine tissues (adipose, brain, gut, heart, kidney, liver, lung, muscle, and skin), although some of these were not available. To fill the missing values in Kp for each tissue, first we predicted those Kp values by the nonmissing data set using a random forest (RF) model with in vitro parameters (log P, fu, Drug Class, and fi) like a classical prediction by a QSAR model. Next, to predict the tissue-specific Kp values in a test data set, we constructed a second RF model with not only in vitro parameters but also the Kp values of other tissues (i.e., other than target tissues) predicted by the first RF model as explanatory variables. Furthermore, we tested all possible combinations of explanatory variables and selected the model with the highest predictability from the test data set as the final model. The evaluation of Kp prediction accuracy based on the root-mean-square error and R2 value revealed that the proposed models outperformed other machine learning methods such as the conventional RF and message-passing neural networks. Significant improvements were observed in the Kp values of adipose tissue, brain, kidney, liver, and skin. These improvements indicated that the Kp information on other tissues can be used to predict the same for a specific tissue. Additionally, we found a novel relationship between each tissue by evaluating all combinations of explanatory variables. In conclusion, we developed a novel RF model to predict Kp values. We hope that this method will be applied to various problems in the field of experimental biology which often contains missing values in the near future.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"64 9","pages":"3662–3669"},"PeriodicalIF":5.3000,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Development of Novel Methods for QSAR Modeling by Machine Learning Repeatedly: A Case Study on Drug Distribution to Each Tissue\",\"authors\":\"Koichi Handa*, Saki Yoshimura, Michiharu Kageyama and Takeshi Iijima, \",\"doi\":\"10.1021/acs.jcim.4c00046\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Artificial intelligence is expected to help identify excellent candidates in drug discovery. However, we face a lack of data, as it is time-consuming and expensive to acquire raw data perfectly for many compounds. Hence, we tried to develop a novel quantitative structure-activity relationship (QSAR) method to predict a parameter more precisely from an incomplete data set via optimizing data handling by making use of predicted explanatory variables. As a case study we focused on the tissue-to-plasma partition coefficient (Kp), which is an important parameter for understanding drug distribution in tissues and building the physiologically based pharmacokinetic model and is a representative of small and sparse data sets. In this study, we predicted the Kp values of 119 compounds in nine tissues (adipose, brain, gut, heart, kidney, liver, lung, muscle, and skin), although some of these were not available. To fill the missing values in Kp for each tissue, first we predicted those Kp values by the nonmissing data set using a random forest (RF) model with in vitro parameters (log P, fu, Drug Class, and fi) like a classical prediction by a QSAR model. Next, to predict the tissue-specific Kp values in a test data set, we constructed a second RF model with not only in vitro parameters but also the Kp values of other tissues (i.e., other than target tissues) predicted by the first RF model as explanatory variables. Furthermore, we tested all possible combinations of explanatory variables and selected the model with the highest predictability from the test data set as the final model. The evaluation of Kp prediction accuracy based on the root-mean-square error and R2 value revealed that the proposed models outperformed other machine learning methods such as the conventional RF and message-passing neural networks. Significant improvements were observed in the Kp values of adipose tissue, brain, kidney, liver, and skin. These improvements indicated that the Kp information on other tissues can be used to predict the same for a specific tissue. Additionally, we found a novel relationship between each tissue by evaluating all combinations of explanatory variables. In conclusion, we developed a novel RF model to predict Kp values. We hope that this method will be applied to various problems in the field of experimental biology which often contains missing values in the near future.\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"64 9\",\"pages\":\"3662–3669\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acs.jcim.4c00046\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.4c00046","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

摘要

人工智能有望帮助发现药物发现领域的优秀候选药物。然而，我们面临着数据匮乏的问题，因为要完全获取许多化合物的原始数据既费时又费钱。因此，我们尝试开发一种新型定量结构-活性关系（QSAR）方法，通过利用预测的解释变量优化数据处理，从不连贯的数据集中更精确地预测参数。组织-血浆分配系数（Kp）是了解药物在组织中的分布和建立基于生理学的药代动力学模型的重要参数，也是小型稀疏数据集的代表。在本研究中，我们预测了 119 种化合物在 9 种组织（脂肪、脑、肠道、心脏、肾脏、肝脏、肺、肌肉和皮肤）中的 Kp 值，尽管其中有些组织的 Kp 值无法获得。为了填补各组织 Kp 值的缺失，我们首先使用随机森林（RF）模型，利用体外参数（log P、fu、药物类别和 fi）预测了非缺失数据集的 Kp 值，就像经典的 QSAR 模型预测一样。接下来，为了预测测试数据集中的组织特异性 Kp 值，我们构建了第二个 RF 模型，其中不仅包含体外参数，还包含第一个 RF 模型预测的其他组织（即靶组织以外）的 Kp 值作为解释变量。此外，我们还测试了所有可能的解释变量组合，并从测试数据集中选出了预测性最高的模型作为最终模型。根据均方根误差和 R2 值对 Kp 预测精度进行的评估表明，所提出的模型优于其他机器学习方法，如传统的 RF 和信息传递神经网络。脂肪组织、大脑、肾脏、肝脏和皮肤的 Kp 值都有显著提高。这些改进表明，其他组织的 Kp 信息可用于预测特定组织的 Kp 值。此外，通过评估所有解释变量的组合，我们还发现了各组织之间的新型关系。总之，我们开发了一种新型射频模型来预测 Kp 值。我们希望在不久的将来，这种方法能应用于实验生物学领域的各种问题，因为这些问题往往包含缺失值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Development of Novel Methods for QSAR Modeling by Machine Learning Repeatedly: A Case Study on Drug Distribution to Each Tissue

查看原文本刊更多论文

Development of Novel Methods for QSAR Modeling by Machine Learning Repeatedly: A Case Study on Drug Distribution to Each Tissue

Artificial intelligence is expected to help identify excellent candidates in drug discovery. However, we face a lack of data, as it is time-consuming and expensive to acquire raw data perfectly for many compounds. Hence, we tried to develop a novel quantitative structure-activity relationship (QSAR) method to predict a parameter more precisely from an incomplete data set via optimizing data handling by making use of predicted explanatory variables. As a case study we focused on the tissue-to-plasma partition coefficient (Kp), which is an important parameter for understanding drug distribution in tissues and building the physiologically based pharmacokinetic model and is a representative of small and sparse data sets. In this study, we predicted the Kp values of 119 compounds in nine tissues (adipose, brain, gut, heart, kidney, liver, lung, muscle, and skin), although some of these were not available. To fill the missing values in Kp for each tissue, first we predicted those Kp values by the nonmissing data set using a random forest (RF) model with in vitro parameters (log P, fu, Drug Class, and fi) like a classical prediction by a QSAR model. Next, to predict the tissue-specific Kp values in a test data set, we constructed a second RF model with not only in vitro parameters but also the Kp values of other tissues (i.e., other than target tissues) predicted by the first RF model as explanatory variables. Furthermore, we tested all possible combinations of explanatory variables and selected the model with the highest predictability from the test data set as the final model. The evaluation of Kp prediction accuracy based on the root-mean-square error and R² value revealed that the proposed models outperformed other machine learning methods such as the conventional RF and message-passing neural networks. Significant improvements were observed in the Kp values of adipose tissue, brain, kidney, liver, and skin. These improvements indicated that the Kp information on other tissues can be used to predict the same for a specific tissue. Additionally, we found a novel relationship between each tissue by evaluating all combinations of explanatory variables. In conclusion, we developed a novel RF model to predict Kp values. We hope that this method will be applied to various problems in the field of experimental biology which often contains missing values in the near future.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.