ILYCROsite：基于 FCM-GRNN 欠采样技术的赖氨酸巴豆酰化位点鉴定

IF 2.6 4区生物学 Q2 BIOLOGY

Computational Biology and Chemistry Pub Date : 2024-09-13 DOI:10.1016/j.compbiolchem.2024.108212

Yun Zuo , Minquan Wan , Yang Shen , Xinheng Wang , Wenying He , Yue Bi , Xiangrong Liu , Zhaohong Deng

{"title":"ILYCROsite：基于 FCM-GRNN 欠采样技术的赖氨酸巴豆酰化位点鉴定","authors":"Yun Zuo , Minquan Wan , Yang Shen , Xinheng Wang , Wenying He , Yue Bi , Xiangrong Liu , Zhaohong Deng","doi":"10.1016/j.compbiolchem.2024.108212","DOIUrl":null,"url":null,"abstract":"<div><p>Protein lysine crotonylation is an important post-translational modification that regulates various cellular activities. For example, histone crotonylation affects chromatin structure and promotes histone replacement. Identification and understanding of lysine crotonylation sites is crucial in the field of protein research. However, due to the increasing amount of non-histone crotonylation sites, existing classifiers based on traditional machine learning may encounter performance limitations. In order to address this problem, a novel deep learning-based model for identifying crotonylation sites is presented in this study, given the unique advantages of deep learning techniques for sequence data analysis. In this study, an MLP-Attention-based model was developed for the identification of crotonylation sites. Firstly, three feature extraction strategies, namely Amino Acid Composition, K-mer, and Distance-based residue features extraction strategy, were used to encode crotonylated and non-crotonylated sequences. Then, in order to balance the training dataset, the FCM-GRNN undersampling algorithm combining fuzzy clustering and generalized neural network approaches was introduced. Finally, to improve the effectiveness of crotonylation site identification, we explored various classification algorithms, and based on the relevant experimental performance comparisons, the multilayer perceptron (MLP) combined with the superimposed self-attention mechanism was finally selected to construct the prediction model ILYCROsite. The results obtained from independent testing and five-fold cross-validation demonstrated that the model proposed in this study, ILYCROsite, had excellent performance. Notably, on the independent test set, ILYCROsite achieves an AUC value of 87.93 %, which is significantly better than the existing state-of-the-art models. In addition, SHAP (Shapley Additive exPlanations) values were used to analyze the importance of features and their impact on model predictions. Meanwhile, in order to facilitate researchers to use the prediction model constructed in this study, we developed a prediction program to identify the crotonylation sites in a given protein sequence. The data and code for this program are available at: <span><span>https://github.com/wmqskr/ILYCROsite</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":"113 ","pages":"Article 108212"},"PeriodicalIF":2.6000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ILYCROsite: Identification of lysine crotonylation sites based on FCM-GRNN undersampling technique\",\"authors\":\"Yun Zuo , Minquan Wan , Yang Shen , Xinheng Wang , Wenying He , Yue Bi , Xiangrong Liu , Zhaohong Deng\",\"doi\":\"10.1016/j.compbiolchem.2024.108212\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Protein lysine crotonylation is an important post-translational modification that regulates various cellular activities. For example, histone crotonylation affects chromatin structure and promotes histone replacement. Identification and understanding of lysine crotonylation sites is crucial in the field of protein research. However, due to the increasing amount of non-histone crotonylation sites, existing classifiers based on traditional machine learning may encounter performance limitations. In order to address this problem, a novel deep learning-based model for identifying crotonylation sites is presented in this study, given the unique advantages of deep learning techniques for sequence data analysis. In this study, an MLP-Attention-based model was developed for the identification of crotonylation sites. Firstly, three feature extraction strategies, namely Amino Acid Composition, K-mer, and Distance-based residue features extraction strategy, were used to encode crotonylated and non-crotonylated sequences. Then, in order to balance the training dataset, the FCM-GRNN undersampling algorithm combining fuzzy clustering and generalized neural network approaches was introduced. Finally, to improve the effectiveness of crotonylation site identification, we explored various classification algorithms, and based on the relevant experimental performance comparisons, the multilayer perceptron (MLP) combined with the superimposed self-attention mechanism was finally selected to construct the prediction model ILYCROsite. The results obtained from independent testing and five-fold cross-validation demonstrated that the model proposed in this study, ILYCROsite, had excellent performance. Notably, on the independent test set, ILYCROsite achieves an AUC value of 87.93 %, which is significantly better than the existing state-of-the-art models. In addition, SHAP (Shapley Additive exPlanations) values were used to analyze the importance of features and their impact on model predictions. Meanwhile, in order to facilitate researchers to use the prediction model constructed in this study, we developed a prediction program to identify the crotonylation sites in a given protein sequence. The data and code for this program are available at: <span><span>https://github.com/wmqskr/ILYCROsite</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":10616,\"journal\":{\"name\":\"Computational Biology and Chemistry\",\"volume\":\"113 \",\"pages\":\"Article 108212\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Biology and Chemistry\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1476927124002007\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927124002007","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

蛋白质赖氨酸巴豆酰化是一种重要的翻译后修饰，可调节各种细胞活动。例如，组蛋白巴豆酰化会影响染色质结构并促进组蛋白替换。鉴定和了解赖氨酸巴豆酰化位点在蛋白质研究领域至关重要。然而，由于非组蛋白巴豆酰化位点的数量不断增加，现有的基于传统机器学习的分类器可能会遇到性能限制。鉴于深度学习技术在序列数据分析方面的独特优势，本研究提出了一种基于深度学习的新型巴豆酰化位点识别模型，以解决这一问题。本研究开发了一种基于 MLP-Attention 的模型来识别巴豆酰化位点。首先，使用三种特征提取策略，即氨基酸组成、K-mer和基于距离的残基特征提取策略，对巴豆化和非巴豆化序列进行编码。然后，为了平衡训练数据集，引入了结合模糊聚类和广义神经网络方法的 FCM-GRNN 欠采样算法。最后，为了提高巴豆酰化位点识别的有效性，我们探索了多种分类算法，并在相关实验性能比较的基础上，最终选择了多层感知器（MLP）结合叠加自注意机制来构建预测模型 ILYCROsite。独立测试和五倍交叉验证的结果表明，本研究提出的模型 ILYCROsite 具有优异的性能。值得注意的是，在独立测试集上，ILYCROsite 的 AUC 值达到了 87.93 %，明显优于现有的先进模型。此外，SHAP（Shapley Additive exPlanations）值用于分析特征的重要性及其对模型预测的影响。同时，为了方便研究人员使用本研究构建的预测模型，我们开发了一个预测程序来识别给定蛋白质序列中的巴豆酰化位点。该程序的数据和代码见：https://github.com/wmqskr/ILYCROsite。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ILYCROsite: Identification of lysine crotonylation sites based on FCM-GRNN undersampling technique

Protein lysine crotonylation is an important post-translational modification that regulates various cellular activities. For example, histone crotonylation affects chromatin structure and promotes histone replacement. Identification and understanding of lysine crotonylation sites is crucial in the field of protein research. However, due to the increasing amount of non-histone crotonylation sites, existing classifiers based on traditional machine learning may encounter performance limitations. In order to address this problem, a novel deep learning-based model for identifying crotonylation sites is presented in this study, given the unique advantages of deep learning techniques for sequence data analysis. In this study, an MLP-Attention-based model was developed for the identification of crotonylation sites. Firstly, three feature extraction strategies, namely Amino Acid Composition, K-mer, and Distance-based residue features extraction strategy, were used to encode crotonylated and non-crotonylated sequences. Then, in order to balance the training dataset, the FCM-GRNN undersampling algorithm combining fuzzy clustering and generalized neural network approaches was introduced. Finally, to improve the effectiveness of crotonylation site identification, we explored various classification algorithms, and based on the relevant experimental performance comparisons, the multilayer perceptron (MLP) combined with the superimposed self-attention mechanism was finally selected to construct the prediction model ILYCROsite. The results obtained from independent testing and five-fold cross-validation demonstrated that the model proposed in this study, ILYCROsite, had excellent performance. Notably, on the independent test set, ILYCROsite achieves an AUC value of 87.93 %, which is significantly better than the existing state-of-the-art models. In addition, SHAP (Shapley Additive exPlanations) values were used to analyze the importance of features and their impact on model predictions. Meanwhile, in order to facilitate researchers to use the prediction model constructed in this study, we developed a prediction program to identify the crotonylation sites in a given protein sequence. The data and code for this program are available at: https://github.com/wmqskr/ILYCROsite.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Biology and Chemistry 生物-计算机：跨学科应用

CiteScore

6.10

自引率

3.20%

发文量

142

审稿时长

24 days

期刊介绍： Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered. Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered. Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.