{"title":"Enhancing the identification of malonylation sites using AlphaFold2 and ensemble learning.","authors":"Linlin Xu, Yuting Qian, Jiayi Yang, Xiaowei Xu, Zhiqiang Li, Yanhan Wang, Enhui Lv, Xingxing Kang, Hongwei Zhang, Yaping Lu, Fei Wang, Xin Liu","doi":"10.1007/s11030-025-11357-6","DOIUrl":null,"url":null,"abstract":"<p><p>Malonylation modification of proteins is closely related to many diseases, such as diabetes and cancer. Therefore, accurate identification of malonylation modification sites is crucial for elucidating the molecular mechanisms underlying these diseases. Traditional experimental methods suffer from the problems of high cost, long cycle time, difficulty, etc. With advancements in artificial intelligence, the prediction of protein post-translational modification sites through computational methods has emerged as a vital complement to experimental approaches. In this paper, we present a malonylation site prediction model, Catsoft_Kmalsite, the core innovation of which lies in its integration of complementary information from protein three-dimensional structural features and sequence/physicochemical features, coupled with a soft voting ensemble strategy based on Bayesian-optimized base classifiers. Specifically, we utilize AlphaFold2 to acquire protein tertiary structural information and employ CTDC, EAAC, and EGAAC methods to extract protein sequence and physicochemical features. Subsequently, two base classifiers are constructed using the CatBoost algorithm based on these two distinct feature sets, respectively. Following parameter fine-tuning of the base classifiers via Bayesian optimization, they are ultimately integrated using a soft voting strategy. All ablation experimental results show that the Catsoft_Kmalsite model exhibited good robustness and generalization ability. Across six metrics, including AUC, ACC, Sen, Pre, F1, and MCC, the model achieved average performances of 94.03%, 87.91%, 89.15%, 86.91%, 88.00%, and 0.7585, respectively, in fivefold cross-validation and specific performance of 95.18%, 89.55%, 90.87%, 88.79%, 89.82%, and 0.7912 on the independent test set; Catsoft_Kmalsite also outperformed other state-of-the-art studies in all evaluated metrics. Furthermore, we have developed a website for users to use ( http://1.94.102.146:8501/Catsoft_Kmalsite ). The code and dataset of Catsoft_Kmalsite are available at https://github.com/flyinsky6/Catsoft_Kmalsite .</p>","PeriodicalId":708,"journal":{"name":"Molecular Diversity","volume":" ","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Diversity","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1007/s11030-025-11357-6","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}
引用次数: 0
Abstract
Malonylation modification of proteins is closely related to many diseases, such as diabetes and cancer. Therefore, accurate identification of malonylation modification sites is crucial for elucidating the molecular mechanisms underlying these diseases. Traditional experimental methods suffer from the problems of high cost, long cycle time, difficulty, etc. With advancements in artificial intelligence, the prediction of protein post-translational modification sites through computational methods has emerged as a vital complement to experimental approaches. In this paper, we present a malonylation site prediction model, Catsoft_Kmalsite, the core innovation of which lies in its integration of complementary information from protein three-dimensional structural features and sequence/physicochemical features, coupled with a soft voting ensemble strategy based on Bayesian-optimized base classifiers. Specifically, we utilize AlphaFold2 to acquire protein tertiary structural information and employ CTDC, EAAC, and EGAAC methods to extract protein sequence and physicochemical features. Subsequently, two base classifiers are constructed using the CatBoost algorithm based on these two distinct feature sets, respectively. Following parameter fine-tuning of the base classifiers via Bayesian optimization, they are ultimately integrated using a soft voting strategy. All ablation experimental results show that the Catsoft_Kmalsite model exhibited good robustness and generalization ability. Across six metrics, including AUC, ACC, Sen, Pre, F1, and MCC, the model achieved average performances of 94.03%, 87.91%, 89.15%, 86.91%, 88.00%, and 0.7585, respectively, in fivefold cross-validation and specific performance of 95.18%, 89.55%, 90.87%, 88.79%, 89.82%, and 0.7912 on the independent test set; Catsoft_Kmalsite also outperformed other state-of-the-art studies in all evaluated metrics. Furthermore, we have developed a website for users to use ( http://1.94.102.146:8501/Catsoft_Kmalsite ). The code and dataset of Catsoft_Kmalsite are available at https://github.com/flyinsky6/Catsoft_Kmalsite .
期刊介绍:
Molecular Diversity is a new publication forum for the rapid publication of refereed papers dedicated to describing the development, application and theory of molecular diversity and combinatorial chemistry in basic and applied research and drug discovery. The journal publishes both short and full papers, perspectives, news and reviews dealing with all aspects of the generation of molecular diversity, application of diversity for screening against alternative targets of all types (biological, biophysical, technological), analysis of results obtained and their application in various scientific disciplines/approaches including:
combinatorial chemistry and parallel synthesis;
small molecule libraries;
microwave synthesis;
flow synthesis;
fluorous synthesis;
diversity oriented synthesis (DOS);
nanoreactors;
click chemistry;
multiplex technologies;
fragment- and ligand-based design;
structure/function/SAR;
computational chemistry and molecular design;
chemoinformatics;
screening techniques and screening interfaces;
analytical and purification methods;
robotics, automation and miniaturization;
targeted libraries;
display libraries;
peptides and peptoids;
proteins;
oligonucleotides;
carbohydrates;
natural diversity;
new methods of library formulation and deconvolution;
directed evolution, origin of life and recombination;
search techniques, landscapes, random chemistry and more;