Enhancing the identification of malonylation sites using AlphaFold2 and ensemble learning.

IF 3.8 2区化学 Q2 CHEMISTRY, APPLIED

Molecular Diversity Pub Date : 2025-10-05 DOI:10.1007/s11030-025-11357-6

Linlin Xu, Yuting Qian, Jiayi Yang, Xiaowei Xu, Zhiqiang Li, Yanhan Wang, Enhui Lv, Xingxing Kang, Hongwei Zhang, Yaping Lu, Fei Wang, Xin Liu

{"title":"Enhancing the identification of malonylation sites using AlphaFold2 and ensemble learning.","authors":"Linlin Xu, Yuting Qian, Jiayi Yang, Xiaowei Xu, Zhiqiang Li, Yanhan Wang, Enhui Lv, Xingxing Kang, Hongwei Zhang, Yaping Lu, Fei Wang, Xin Liu","doi":"10.1007/s11030-025-11357-6","DOIUrl":null,"url":null,"abstract":"<p><p>Malonylation modification of proteins is closely related to many diseases, such as diabetes and cancer. Therefore, accurate identification of malonylation modification sites is crucial for elucidating the molecular mechanisms underlying these diseases. Traditional experimental methods suffer from the problems of high cost, long cycle time, difficulty, etc. With advancements in artificial intelligence, the prediction of protein post-translational modification sites through computational methods has emerged as a vital complement to experimental approaches. In this paper, we present a malonylation site prediction model, Catsoft_Kmalsite, the core innovation of which lies in its integration of complementary information from protein three-dimensional structural features and sequence/physicochemical features, coupled with a soft voting ensemble strategy based on Bayesian-optimized base classifiers. Specifically, we utilize AlphaFold2 to acquire protein tertiary structural information and employ CTDC, EAAC, and EGAAC methods to extract protein sequence and physicochemical features. Subsequently, two base classifiers are constructed using the CatBoost algorithm based on these two distinct feature sets, respectively. Following parameter fine-tuning of the base classifiers via Bayesian optimization, they are ultimately integrated using a soft voting strategy. All ablation experimental results show that the Catsoft_Kmalsite model exhibited good robustness and generalization ability. Across six metrics, including AUC, ACC, Sen, Pre, F1, and MCC, the model achieved average performances of 94.03%, 87.91%, 89.15%, 86.91%, 88.00%, and 0.7585, respectively, in fivefold cross-validation and specific performance of 95.18%, 89.55%, 90.87%, 88.79%, 89.82%, and 0.7912 on the independent test set; Catsoft_Kmalsite also outperformed other state-of-the-art studies in all evaluated metrics. Furthermore, we have developed a website for users to use ( http://1.94.102.146:8501/Catsoft_Kmalsite ). The code and dataset of Catsoft_Kmalsite are available at https://github.com/flyinsky6/Catsoft_Kmalsite .</p>","PeriodicalId":708,"journal":{"name":"Molecular Diversity","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Diversity","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1007/s11030-025-11357-6","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

Malonylation modification of proteins is closely related to many diseases, such as diabetes and cancer. Therefore, accurate identification of malonylation modification sites is crucial for elucidating the molecular mechanisms underlying these diseases. Traditional experimental methods suffer from the problems of high cost, long cycle time, difficulty, etc. With advancements in artificial intelligence, the prediction of protein post-translational modification sites through computational methods has emerged as a vital complement to experimental approaches. In this paper, we present a malonylation site prediction model, Catsoft_Kmalsite, the core innovation of which lies in its integration of complementary information from protein three-dimensional structural features and sequence/physicochemical features, coupled with a soft voting ensemble strategy based on Bayesian-optimized base classifiers. Specifically, we utilize AlphaFold2 to acquire protein tertiary structural information and employ CTDC, EAAC, and EGAAC methods to extract protein sequence and physicochemical features. Subsequently, two base classifiers are constructed using the CatBoost algorithm based on these two distinct feature sets, respectively. Following parameter fine-tuning of the base classifiers via Bayesian optimization, they are ultimately integrated using a soft voting strategy. All ablation experimental results show that the Catsoft_Kmalsite model exhibited good robustness and generalization ability. Across six metrics, including AUC, ACC, Sen, Pre, F1, and MCC, the model achieved average performances of 94.03%, 87.91%, 89.15%, 86.91%, 88.00%, and 0.7585, respectively, in fivefold cross-validation and specific performance of 95.18%, 89.55%, 90.87%, 88.79%, 89.82%, and 0.7912 on the independent test set; Catsoft_Kmalsite also outperformed other state-of-the-art studies in all evaluated metrics. Furthermore, we have developed a website for users to use ( http://1.94.102.146:8501/Catsoft_Kmalsite ). The code and dataset of Catsoft_Kmalsite are available at https://github.com/flyinsky6/Catsoft_Kmalsite .

查看原文本刊更多论文

利用AlphaFold2和集成学习增强丙二醛化位点的识别。

蛋白质的丙二酸修饰与许多疾病密切相关，如糖尿病和癌症。因此，准确识别丙二酰化修饰位点对于阐明这些疾病的分子机制至关重要。传统的实验方法存在成本高、周期长、难度大等问题。随着人工智能的进步，通过计算方法预测蛋白质翻译后修饰位点已经成为实验方法的重要补充。在本文中，我们提出了一个丙二酰化位点预测模型Catsoft_Kmalsite，其核心创新在于将蛋白质三维结构特征和序列/物理化学特征的互补信息整合在一起，并结合基于贝叶斯优化碱基分类器的软投票集成策略。具体而言，我们利用AlphaFold2获取蛋白质三级结构信息，并采用CTDC、EAAC和EGAAC方法提取蛋白质序列和理化特征。随后，基于这两个不同的特征集，使用CatBoost算法分别构建了两个基本分类器。在通过贝叶斯优化对基本分类器进行参数微调之后，它们最终使用软投票策略进行集成。烧蚀实验结果表明，Catsoft_Kmalsite模型具有良好的鲁棒性和泛化能力。在AUC、ACC、Sen、Pre、F1和MCC 6个指标上，模型的平均性能分别为94.03%、87.91%、89.15%、86.91%、88.00%和0.7585，在独立测试集上的交叉验证和特异性能分别为95.18%、89.55%、90.87%、88.79%、89.82%和0.7912；在所有评估指标中，Catsoft_Kmalsite的表现也优于其他最先进的研究。此外，我们还开发了一个网站供用户使用（http://1.94.102.146:8501/Catsoft_Kmalsite）。Catsoft_Kmalsite的代码和数据集可在https://github.com/flyinsky6/Catsoft_Kmalsite上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Molecular Diversity 化学-化学综合

CiteScore

7.30

自引率

7.90%

发文量

219

审稿时长

2.7 months

期刊介绍： Molecular Diversity is a new publication forum for the rapid publication of refereed papers dedicated to describing the development, application and theory of molecular diversity and combinatorial chemistry in basic and applied research and drug discovery. The journal publishes both short and full papers, perspectives, news and reviews dealing with all aspects of the generation of molecular diversity, application of diversity for screening against alternative targets of all types (biological, biophysical, technological), analysis of results obtained and their application in various scientific disciplines/approaches including: combinatorial chemistry and parallel synthesis; small molecule libraries; microwave synthesis; flow synthesis; fluorous synthesis; diversity oriented synthesis (DOS); nanoreactors; click chemistry; multiplex technologies; fragment- and ligand-based design; structure/function/SAR; computational chemistry and molecular design; chemoinformatics; screening techniques and screening interfaces; analytical and purification methods; robotics, automation and miniaturization; targeted libraries; display libraries; peptides and peptoids; proteins; oligonucleotides; carbohydrates; natural diversity; new methods of library formulation and deconvolution; directed evolution, origin of life and recombination; search techniques, landscapes, random chemistry and more;