Use of tree-based machine learning methods to screen affinitive peptides based on docking data.

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics Pub Date : 2023-12-01 Epub Date: 2023-11-09 DOI:10.1002/minf.202300143

Hua Feng, Fangyu Wang, Ning Li, Qian Xu, Guanming Zheng, Xuefeng Sun, Man Hu, Xuewu Li, Guangxu Xing, Gaiping Zhang

{"title":"Use of tree-based machine learning methods to screen affinitive peptides based on docking data.","authors":"Hua Feng, Fangyu Wang, Ning Li, Qian Xu, Guanming Zheng, Xuefeng Sun, Man Hu, Xuewu Li, Guangxu Xing, Gaiping Zhang","doi":"10.1002/minf.202300143","DOIUrl":null,"url":null,"abstract":"<p><p>Screening peptides with good affinity is an important step in peptide-drug discovery. Recent advancement in computer and data science have made machine learning a useful tool in accurately affinitive-peptide screening. In current study, four different tree-based algorithms, including Classification and regression trees (CART), C5.0 decision tree (C50), Bagged CART (BAG) and Random Forest (RF), were employed to explore the relationship between experimental peptide affinities and virtual docking data, and the performance of each model was also compared in parallel. All four algorithms showed better performances on dataset pre-scaled, -centered and -PCA than other pre-processed dataset. After model re-built and hyperparameter optimization, the optimal C50 model (C50O) showed the best performances in terms of Accuracy, Kappa, Sensitivity, Specificity, F1, MCC and AUC when validated on test data and an unknown PEDV datasets evaluation (Accuracy=80.4 %). BAG and RFO (the optimal RF), as two best models during training process, did not performed as expecting during in testing and unknown dataset validations. Furthermore, the high correlation of the predictions of RFO and BAG to C50O implied the high stability and robustness of their prediction. Whereas although the good performance on unknown dataset, the poor performance in test data validation and correlation analysis indicated CARTO could not be used for future data prediction. To accurately evaluate the peptide affinity, the current study firstly gave a tree-model competition on affinitive peptide prediction by using virtual docking data, which would expand the application of machine learning algorithms in studying PepPIs and benefit the development of peptide therapeutics.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202300143"},"PeriodicalIF":3.1000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/minf.202300143","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/9 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

Screening peptides with good affinity is an important step in peptide-drug discovery. Recent advancement in computer and data science have made machine learning a useful tool in accurately affinitive-peptide screening. In current study, four different tree-based algorithms, including Classification and regression trees (CART), C5.0 decision tree (C50), Bagged CART (BAG) and Random Forest (RF), were employed to explore the relationship between experimental peptide affinities and virtual docking data, and the performance of each model was also compared in parallel. All four algorithms showed better performances on dataset pre-scaled, -centered and -PCA than other pre-processed dataset. After model re-built and hyperparameter optimization, the optimal C50 model (C50O) showed the best performances in terms of Accuracy, Kappa, Sensitivity, Specificity, F1, MCC and AUC when validated on test data and an unknown PEDV datasets evaluation (Accuracy=80.4 %). BAG and RFO (the optimal RF), as two best models during training process, did not performed as expecting during in testing and unknown dataset validations. Furthermore, the high correlation of the predictions of RFO and BAG to C50O implied the high stability and robustness of their prediction. Whereas although the good performance on unknown dataset, the poor performance in test data validation and correlation analysis indicated CARTO could not be used for future data prediction. To accurately evaluate the peptide affinity, the current study firstly gave a tree-model competition on affinitive peptide prediction by using virtual docking data, which would expand the application of machine learning algorithms in studying PepPIs and benefit the development of peptide therapeutics.

Abstract Image

查看原文本刊更多论文

使用基于树的机器学习方法筛选基于对接数据的亲和肽。

筛选具有良好亲和力的多肽是多肽药物开发的重要步骤。计算机和数据科学的最新进展使机器学习成为准确筛选亲和肽的有用工具。本研究采用分类与回归树(CART)、C5.0决策树(C50)、Bagged CART (BAG)和Random Forest (RF) 4种不同的基于树的算法，探讨实验肽亲和度与虚拟对接数据之间的关系，并并行比较各模型的性能。四种算法在数据集预缩放、中心化和主成分分析方面均表现出较好的性能。经过模型重建和超参数优化，最优C50模型(C50O)在测试数据验证和未知PEDV数据集评估中，在准确率、Kappa、灵敏度、特异性、F1、MCC和AUC方面表现最佳(准确率= 80.4%)。BAG和RFO(最优RF)作为训练过程中的两个最佳模型，在测试和未知数据集验证过程中表现不如预期。此外，RFO和BAG对C50O的预测具有较高的相关性，表明其预测具有较高的稳定性和鲁棒性。然而，尽管CARTO在未知数据上具有良好的性能，但在测试数据验证和相关性分析方面的性能较差，表明CARTO不能用于未来的数据预测。为了准确评估肽的亲和性，本研究首先利用虚拟对接数据对亲和肽预测进行了树模型竞争，这将扩大机器学习算法在PepPIs研究中的应用，有利于肽疗法的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Molecular Informatics CHEMISTRY, MEDICINAL-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

7.30

自引率

2.80%

发文量

审稿时长

3 months

期刊介绍： Molecular Informatics is a peer-reviewed, international forum for publication of high-quality, interdisciplinary research on all molecular aspects of bio/cheminformatics and computer-assisted molecular design. Molecular Informatics succeeded QSAR & Combinatorial Science in 2010. Molecular Informatics presents methodological innovations that will lead to a deeper understanding of ligand-receptor interactions, macromolecular complexes, molecular networks, design concepts and processes that demonstrate how ideas and design concepts lead to molecules with a desired structure or function, preferably including experimental validation. The journal''s scope includes but is not limited to the fields of drug discovery and chemical biology, protein and nucleic acid engineering and design, the design of nanomolecular structures, strategies for modeling of macromolecular assemblies, molecular networks and systems, pharmaco- and chemogenomics, computer-assisted screening strategies, as well as novel technologies for the de novo design of biologically active molecules. As a unique feature Molecular Informatics publishes so-called "Methods Corner" review-type articles which feature important technological concepts and advances within the scope of the journal.