量化三维蛋白质结构相似性的新型计算机器学习管道

bioRxiv - Pharmacology and Toxicology Pub Date : 2024-08-17 DOI:10.1101/2024.08.14.607969

Shreyas U Hirway, Xiao Xu, Fan Fan

{"title":"量化三维蛋白质结构相似性的新型计算机器学习管道","authors":"Shreyas U Hirway, Xiao Xu, Fan Fan","doi":"10.1101/2024.08.14.607969","DOIUrl":null,"url":null,"abstract":"Animal models are widely used during drug development. The selection of suitable animal model relies on various factors such as target biology, animal resource availability and legacy species. It is imperative that the selected animal species exhibit the highest resemblance to human, in terms of target biology as well as the similarity in the target protein. The current practice to address cross-species protein similarity relies on pair wise sequence comparison using protein sequences, instead of the biologically relevant 3-dimensional (3D) structure of proteins. We developed a novel quantitative machine learning pipeline using 3D structure-based feature data from the Protein Data Bank, nominal data from UNIPROT and bioactivity data from ChEMBL, all of which were matched for human and animal data. Using the XGBoost regression model, similarity scores between targets were calculated and based on these scores, the best animal species for a target was identified. For real-world application, targets from an alternative source, i.e., AlphaFold, were tested using the model, and the animal species that had the most similar protein to the human counterparts were predicted. These targets were then grouped based on their associated phenotype such that the pipeline could predict an optimal animal species.","PeriodicalId":501518,"journal":{"name":"bioRxiv - Pharmacology and Toxicology","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Novel Computational Machine Learning Pipeline to Quantify Similarities in Three-Dimensional Protein Structures\",\"authors\":\"Shreyas U Hirway, Xiao Xu, Fan Fan\",\"doi\":\"10.1101/2024.08.14.607969\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Animal models are widely used during drug development. The selection of suitable animal model relies on various factors such as target biology, animal resource availability and legacy species. It is imperative that the selected animal species exhibit the highest resemblance to human, in terms of target biology as well as the similarity in the target protein. The current practice to address cross-species protein similarity relies on pair wise sequence comparison using protein sequences, instead of the biologically relevant 3-dimensional (3D) structure of proteins. We developed a novel quantitative machine learning pipeline using 3D structure-based feature data from the Protein Data Bank, nominal data from UNIPROT and bioactivity data from ChEMBL, all of which were matched for human and animal data. Using the XGBoost regression model, similarity scores between targets were calculated and based on these scores, the best animal species for a target was identified. For real-world application, targets from an alternative source, i.e., AlphaFold, were tested using the model, and the animal species that had the most similar protein to the human counterparts were predicted. These targets were then grouped based on their associated phenotype such that the pipeline could predict an optimal animal species.\",\"PeriodicalId\":501518,\"journal\":{\"name\":\"bioRxiv - Pharmacology and Toxicology\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv - Pharmacology and Toxicology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.14.607969\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Pharmacology and Toxicology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.14.607969","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

动物模型在药物研发过程中被广泛使用。选择合适的动物模型取决于多种因素，如目标生物学、动物资源可用性和遗留物种。就目标生物学以及目标蛋白质的相似性而言，所选动物物种必须与人类具有最高的相似性。目前解决跨物种蛋白质相似性的方法依赖于使用蛋白质序列进行成对序列比较，而不是蛋白质的生物相关三维（3D）结构。我们利用蛋白质数据库中基于三维结构的特征数据、UNIPROT 的标称数据和 ChEMBL 的生物活性数据，开发了一种新型定量机器学习管道，所有这些数据都与人类和动物数据相匹配。利用 XGBoost 回归模型计算目标之间的相似性得分，并根据这些得分确定目标的最佳动物物种。在实际应用中，使用该模型测试了来自其他来源（即 AlphaFold）的靶标，并预测了与人类对应蛋白最相似的动物物种。然后根据相关的表型对这些靶标进行分组，这样管道就能预测出最佳的动物物种。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Novel Computational Machine Learning Pipeline to Quantify Similarities in Three-Dimensional Protein Structures

Animal models are widely used during drug development. The selection of suitable animal model relies on various factors such as target biology, animal resource availability and legacy species. It is imperative that the selected animal species exhibit the highest resemblance to human, in terms of target biology as well as the similarity in the target protein. The current practice to address cross-species protein similarity relies on pair wise sequence comparison using protein sequences, instead of the biologically relevant 3-dimensional (3D) structure of proteins. We developed a novel quantitative machine learning pipeline using 3D structure-based feature data from the Protein Data Bank, nominal data from UNIPROT and bioactivity data from ChEMBL, all of which were matched for human and animal data. Using the XGBoost regression model, similarity scores between targets were calculated and based on these scores, the best animal species for a target was identified. For real-world application, targets from an alternative source, i.e., AlphaFold, were tested using the model, and the animal species that had the most similar protein to the human counterparts were predicted. These targets were then grouped based on their associated phenotype such that the pipeline could predict an optimal animal species.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

bioRxiv - Pharmacology and Toxicology

自引率

0.00%

发文量