xgboost增强集合模型使用判别杂交特征来预测sumoylation位点。

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-02-03 DOI:10.1186/s13040-024-00415-8

Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A AlQahtani, Nijad Ahmad

{"title":"xgboost增强集合模型使用判别杂交特征来预测sumoylation位点。","authors":"Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A AlQahtani, Nijad Ahmad","doi":"10.1186/s13040-024-00415-8","DOIUrl":null,"url":null,"abstract":"Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences-plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson's and Alzheimer's. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model's reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"12"},"PeriodicalIF":6.1000,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792219/pdf/","citationCount":"0","resultStr":"{\"title\":\"XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites.\",\"authors\":\"Salman Khan, Sumaiya Noor, Tahir Javed, Afshan Naseem, Fahad Aslam, Salman A AlQahtani, Nijad Ahmad\",\"doi\":\"10.1186/s13040-024-00415-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences-plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson's and Alzheimer's. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model's reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.\",\"PeriodicalId\":48947,\"journal\":{\"name\":\"Biodata Mining\",\"volume\":\"18 1\",\"pages\":\"12\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2025-02-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11792219/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biodata Mining\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13040-024-00415-8\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-024-00415-8","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

翻译后修饰（PTMs）对于调节蛋白质的定位和稳定性至关重要，显著影响基因表达、生物学功能和基因组复制。其中，sumoylation（一种将化学基团连接到蛋白质序列上的PTM）在蛋白质功能中起着关键作用。由于与帕金森氏症和阿尔茨海默氏症的联系，确定sumo化位点尤为重要。本研究引入了XGBoost-Sumo，这是一个强大的模型，通过整合蛋白质结构和序列数据来预测sumo化位点。该模型利用基于转换器的注意机制对多肽进行编码，并通过PsePSSM-DWT方法提取进化特征。通过融合词嵌入和进化描述符，应用SHapley加性解释（SHAP）算法进行最优特征选择，并使用极限梯度增强（XGBoost）进行分类。XGBoost-Sumo在使用10倍交叉验证的基准数据集上实现了令人印象深刻的99.68%的准确率，在独立样本上实现了96.08%的准确率。这标志着显著的改进，在训练数据上优于现有模型10.31%，在独立测试上优于现有模型2.74%。该模型的可靠性和高性能使其成为研究人员的宝贵资源，在药物开发中具有很强的应用潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites.

查看原文本刊更多论文

XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites.

Posttranslational modifications (PTMs) are essential for regulating protein localization and stability, significantly affecting gene expression, biological functions, and genome replication. Among these, sumoylation a PTM that attaches a chemical group to protein sequences-plays a critical role in protein function. Identifying sumoylation sites is particularly important due to their links to Parkinson's and Alzheimer's. This study introduces XGBoost-Sumo, a robust model to predict sumoylation sites by integrating protein structure and sequence data. The model utilizes a transformer-based attention mechanism to encode peptides and extract evolutionary features through the PsePSSM-DWT approach. By fusing word embeddings with evolutionary descriptors, it applies the SHapley Additive exPlanations (SHAP) algorithm for optimal feature selection and uses eXtreme Gradient Boosting (XGBoost) for classification. XGBoost-Sumo achieved an impressive accuracy of 99.68% on benchmark datasets using 10-fold cross-validation and 96.08% on independent samples. This marks a significant improvement, outperforming existing models by 10.31% on training data and 2.74% on independent tests. The model's reliability and high performance make it a valuable resource for researchers, with strong potential for applications in pharmaceutical development.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.