Thorough Assessment of Machine Learning Techniques for Predicting Protein-Nucleic Acid Binding Hot Spots

IF 2.4 3区 生物学 Q3 BIOCHEMICAL RESEARCH METHODS
Xianzhe Zou, Chen Zhang, MIngyan Tang, Lei Deng
{"title":"Thorough Assessment of Machine Learning Techniques for Predicting Protein-Nucleic Acid Binding Hot Spots","authors":"Xianzhe Zou, Chen Zhang, MIngyan Tang, Lei Deng","doi":"10.2174/1574893618666230913090436","DOIUrl":null,"url":null,"abstract":"Background: Proteins and nucleic acids are vital biomolecules that contribute significantly to biological life. The precise and efficient identification of hot spots at protein-nucleic acid interfaces is crucial for guiding drug development, advancing protein engineering, and exploring the underlying molecular recognition mechanisms. As experimental methods like alanine scanning mutagenesis prove to be time-consuming and expensive, a growing number of machine learning techniques are being employed to predict hot spots. However, the existing approach is distinguished by a lack of uniform standards, a scarcity of data, and a wide range of attributes. Currently, there is no comprehensive overview or evaluation of this field. As a result, providing a full overview and review is extremely helpful. Methods: In this study, we present an overview of cutting-edge machine learning approaches utilized for hot spot prediction in protein-nucleic acid complexes. Additionally, we outline the feature categories currently in use, derived from relevant biological data sources, and assess conventional feature selection methods based on 600 extracted features. Simultaneously, we create two new benchmark datasets, PDHS87 and PRHS48, and develop distinct binary classification models based on these datasets to evaluate the advantages and disadvantages of various machine-learning techniques. Results: Prediction of protein-nucleic acid interaction hotspots is a challenging task. The study demonstrates that structural neighborhood features play a crucial role in identifying hot spots. The prediction performance can be improved by choosing effective feature selection methods and machine learning methods. Among the existing prediction methods, XGBPRH has the best performance. Conclusion: It is crucial to continue studying hot spot theories, discover new and effective features, add accurate experimental data, and utilize DNA/RNA information. Semi-supervised learning, transfer learning, and ensemble learning can optimize predictive ability. Combining computational docking with machine learning methods can potentially further improve predictive performance.","PeriodicalId":10801,"journal":{"name":"Current Bioinformatics","volume":"1 1","pages":"0"},"PeriodicalIF":2.4000,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/1574893618666230913090436","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Proteins and nucleic acids are vital biomolecules that contribute significantly to biological life. The precise and efficient identification of hot spots at protein-nucleic acid interfaces is crucial for guiding drug development, advancing protein engineering, and exploring the underlying molecular recognition mechanisms. As experimental methods like alanine scanning mutagenesis prove to be time-consuming and expensive, a growing number of machine learning techniques are being employed to predict hot spots. However, the existing approach is distinguished by a lack of uniform standards, a scarcity of data, and a wide range of attributes. Currently, there is no comprehensive overview or evaluation of this field. As a result, providing a full overview and review is extremely helpful. Methods: In this study, we present an overview of cutting-edge machine learning approaches utilized for hot spot prediction in protein-nucleic acid complexes. Additionally, we outline the feature categories currently in use, derived from relevant biological data sources, and assess conventional feature selection methods based on 600 extracted features. Simultaneously, we create two new benchmark datasets, PDHS87 and PRHS48, and develop distinct binary classification models based on these datasets to evaluate the advantages and disadvantages of various machine-learning techniques. Results: Prediction of protein-nucleic acid interaction hotspots is a challenging task. The study demonstrates that structural neighborhood features play a crucial role in identifying hot spots. The prediction performance can be improved by choosing effective feature selection methods and machine learning methods. Among the existing prediction methods, XGBPRH has the best performance. Conclusion: It is crucial to continue studying hot spot theories, discover new and effective features, add accurate experimental data, and utilize DNA/RNA information. Semi-supervised learning, transfer learning, and ensemble learning can optimize predictive ability. Combining computational docking with machine learning methods can potentially further improve predictive performance.
预测蛋白质-核酸结合热点的机器学习技术的全面评估
背景:蛋白质和核酸是对生物生命有重要贡献的重要生物分子。准确、高效地识别蛋白质-核酸界面热点对于指导药物开发、推进蛋白质工程、探索潜在的分子识别机制至关重要。由于像丙氨酸扫描诱变这样的实验方法被证明是耗时且昂贵的,越来越多的机器学习技术被用于预测热点。然而,现有方法的特点是缺乏统一的标准、数据稀缺和属性范围广。目前,对这一领域还没有全面的概述和评价。因此,提供一个完整的概述和回顾是非常有帮助的。方法:在本研究中,我们概述了用于蛋白质-核酸复合物热点预测的尖端机器学习方法。此外,我们概述了目前使用的特征类别,这些特征来自相关的生物数据源,并基于600个提取的特征评估了传统的特征选择方法。同时,我们创建了两个新的基准数据集PDHS87和PRHS48,并基于这些数据集开发了不同的二元分类模型,以评估各种机器学习技术的优缺点。结果:预测蛋白质与核酸相互作用热点是一项具有挑战性的任务。研究表明,结构邻域特征在热点识别中起着至关重要的作用。通过选择有效的特征选择方法和机器学习方法,可以提高预测性能。在现有的预测方法中,XGBPRH的预测效果最好。结论:继续研究热点理论,发现新的有效特征,增加准确的实验数据,利用DNA/RNA信息至关重要。半监督学习、迁移学习和集成学习可以优化预测能力。将计算对接与机器学习方法相结合,可以进一步提高预测性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Current Bioinformatics
Current Bioinformatics 生物-生化研究方法
CiteScore
6.60
自引率
2.50%
发文量
77
审稿时长
>12 weeks
期刊介绍: Current Bioinformatics aims to publish all the latest and outstanding developments in bioinformatics. Each issue contains a series of timely, in-depth/mini-reviews, research papers and guest edited thematic issues written by leaders in the field, covering a wide range of the integration of biology with computer and information science. The journal focuses on advances in computational molecular/structural biology, encompassing areas such as computing in biomedicine and genomics, computational proteomics and systems biology, and metabolic pathway engineering. Developments in these fields have direct implications on key issues related to health care, medicine, genetic disorders, development of agricultural products, renewable energy, environmental protection, etc.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信