{"title":"基于集成学习的蛋白质序列金属离子结合位点预测新方法","authors":"Chuyi Song, Jing-qing Jiang","doi":"10.1145/3579654.3579694","DOIUrl":null,"url":null,"abstract":"The identification of metal ion-binding sites is important for detecting the protein structures and understanding its biological functions. However, in Protein Data Bank (PDB) which collects the known crystal structures of proteins, only less than one percent are membrane proteins even though they play a significant role in material exchange for cells and have a close relationship in drug target design. In this work, we develop an efficient prediction method for six different types of metal ion-binding sites in membrane proteins. In order to solve the imbalance problem in the dataset, multiple random down-sampling technique is used to obtain multiple training subsets with equal number of binding residues and non-binding residues. The support vector machines (SVM) and random forest (RF) classification models are built based on these subsets and their results are combined by ensemble learning algorithm which efficiently reduce the number of false positive samples in the final prediction. On an independent testing set, our proposed method achieves the average accuracy of 0.991 and average MCC of 0.681 which outperform a recently proposed prediction method, . The superiority in performance has demonstrated that our proposed method is expected to be an accurate tool for prediction of metal ion-binding sites in membrane proteins and it should provide assistant in design of new drug targets.","PeriodicalId":146783,"journal":{"name":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Novel Prediction Method for Metal-Ion Binding Sites in Protein Sequence Based on Ensemble Learning\",\"authors\":\"Chuyi Song, Jing-qing Jiang\",\"doi\":\"10.1145/3579654.3579694\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The identification of metal ion-binding sites is important for detecting the protein structures and understanding its biological functions. However, in Protein Data Bank (PDB) which collects the known crystal structures of proteins, only less than one percent are membrane proteins even though they play a significant role in material exchange for cells and have a close relationship in drug target design. In this work, we develop an efficient prediction method for six different types of metal ion-binding sites in membrane proteins. In order to solve the imbalance problem in the dataset, multiple random down-sampling technique is used to obtain multiple training subsets with equal number of binding residues and non-binding residues. The support vector machines (SVM) and random forest (RF) classification models are built based on these subsets and their results are combined by ensemble learning algorithm which efficiently reduce the number of false positive samples in the final prediction. On an independent testing set, our proposed method achieves the average accuracy of 0.991 and average MCC of 0.681 which outperform a recently proposed prediction method, . The superiority in performance has demonstrated that our proposed method is expected to be an accurate tool for prediction of metal ion-binding sites in membrane proteins and it should provide assistant in design of new drug targets.\",\"PeriodicalId\":146783,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"volume\":\"97 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3579654.3579694\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579654.3579694","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
金属离子结合位点的鉴定对于检测蛋白质结构和了解其生物学功能具有重要意义。然而,在收集已知蛋白质晶体结构的蛋白质数据库(Protein Data Bank, PDB)中,只有不到1%是膜蛋白,尽管它们在细胞的物质交换中起着重要作用,并且在药物靶标设计中有着密切的关系。在这项工作中,我们开发了一种有效的预测膜蛋白中六种不同类型金属离子结合位点的方法。为了解决数据集的不平衡问题,采用多重随机下采样技术,获得具有相等数目的结合残数和非结合残数的多个训练子集。基于这些子集建立支持向量机(SVM)和随机森林(RF)分类模型,并通过集成学习算法将其结果组合在一起,有效地减少了最终预测中的假阳性样本数量。在独立测试集上,本文方法的平均准确率为0.991,平均MCC为0.681,优于最近提出的预测方法。性能上的优势表明,我们的方法有望成为预测膜蛋白中金属离子结合位点的准确工具,并为新药物靶点的设计提供辅助。
A Novel Prediction Method for Metal-Ion Binding Sites in Protein Sequence Based on Ensemble Learning
The identification of metal ion-binding sites is important for detecting the protein structures and understanding its biological functions. However, in Protein Data Bank (PDB) which collects the known crystal structures of proteins, only less than one percent are membrane proteins even though they play a significant role in material exchange for cells and have a close relationship in drug target design. In this work, we develop an efficient prediction method for six different types of metal ion-binding sites in membrane proteins. In order to solve the imbalance problem in the dataset, multiple random down-sampling technique is used to obtain multiple training subsets with equal number of binding residues and non-binding residues. The support vector machines (SVM) and random forest (RF) classification models are built based on these subsets and their results are combined by ensemble learning algorithm which efficiently reduce the number of false positive samples in the final prediction. On an independent testing set, our proposed method achieves the average accuracy of 0.991 and average MCC of 0.681 which outperform a recently proposed prediction method, . The superiority in performance has demonstrated that our proposed method is expected to be an accurate tool for prediction of metal ion-binding sites in membrane proteins and it should provide assistant in design of new drug targets.