{"title":"Ubigo-X: Protein ubiquitination site prediction using ensemble learning with image-based feature representation and weighted voting.","authors":"Disline Manli Tantoh, Jen-Chieh Yu, Ching-Hsuan Chien, Wei-Yi Yeh, Yen-Wei Chu","doi":"10.1016/j.csbj.2025.07.025","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate ubiquitination identification is crucial in biological function analysis. We developed Ubigo-X, a novel protein ubiquitination prediction tool. Our training data, sourced from the Protein Lysine Modification Database (PLMD 3.0), comprised 53,338 ubiquitination and 71,399 non-ubiquitination sites, retained after CD-HIT and CD-HIT-2d sequence filtering. Three sub-models: Single-Type sequence-based features (Single-Type SBF), k-mer sequence-based features (Co-Type SBF), and structure-based and function-based features (S-FBF), were developed. Single-Type SBF used amino acid composition (AAC), amino acid index (AAindex), and one-hot encoding; Co-Type SBF used Single-Type SBF via k-mer encoding; and S-FBF used secondary structure, relative solvent accessibility (RSA)/absolute solvent-accessible area (ASA), and signal peptide cleavage sites. S-FBF was trained using XGBoost, while Single-Type SBF and Co-Type SBF were transformed into image-based features and trained using Resnet34. Ubigo-X was developed by combining the three models via a weighted voting strategy. Independent testing using PhosphoSitePlus data (65,421 ubiquitination and 61,222 non-ubiquitination sites) retained after filtering yielded 0.85, 0.79, and 0.58 for area under the curve (AUC), accuracy (ACC), and Matthews correlation coefficient (MCC), respectively. Further testing on imbalanced PhosphoSitePlus data (1:8 positive-to-negative sample ratio) yielded 0.94 AUC, 0.85 ACC, and 0.55 MCC. Using the GPS-Uber data, the AUC, ACC, and MCC were 0.81, 0.59, and 0.27, respectively. In conclusion, Ubigo-X outperformed existing tools in MCC (for both balanced and unbalanced data) and AUC and ACC (for balanced data), highlighting the efficacy of integrating image-based feature representation and weighted voting in ubiquitination prediction. Ubigo-X is a potential species-neutral ubiquitination site prediction tool, accessible at http://merlin.nchu.edu.tw/ubigox/.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"3137-3146"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12303043/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.07.025","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Accurate ubiquitination identification is crucial in biological function analysis. We developed Ubigo-X, a novel protein ubiquitination prediction tool. Our training data, sourced from the Protein Lysine Modification Database (PLMD 3.0), comprised 53,338 ubiquitination and 71,399 non-ubiquitination sites, retained after CD-HIT and CD-HIT-2d sequence filtering. Three sub-models: Single-Type sequence-based features (Single-Type SBF), k-mer sequence-based features (Co-Type SBF), and structure-based and function-based features (S-FBF), were developed. Single-Type SBF used amino acid composition (AAC), amino acid index (AAindex), and one-hot encoding; Co-Type SBF used Single-Type SBF via k-mer encoding; and S-FBF used secondary structure, relative solvent accessibility (RSA)/absolute solvent-accessible area (ASA), and signal peptide cleavage sites. S-FBF was trained using XGBoost, while Single-Type SBF and Co-Type SBF were transformed into image-based features and trained using Resnet34. Ubigo-X was developed by combining the three models via a weighted voting strategy. Independent testing using PhosphoSitePlus data (65,421 ubiquitination and 61,222 non-ubiquitination sites) retained after filtering yielded 0.85, 0.79, and 0.58 for area under the curve (AUC), accuracy (ACC), and Matthews correlation coefficient (MCC), respectively. Further testing on imbalanced PhosphoSitePlus data (1:8 positive-to-negative sample ratio) yielded 0.94 AUC, 0.85 ACC, and 0.55 MCC. Using the GPS-Uber data, the AUC, ACC, and MCC were 0.81, 0.59, and 0.27, respectively. In conclusion, Ubigo-X outperformed existing tools in MCC (for both balanced and unbalanced data) and AUC and ACC (for balanced data), highlighting the efficacy of integrating image-based feature representation and weighted voting in ubiquitination prediction. Ubigo-X is a potential species-neutral ubiquitination site prediction tool, accessible at http://merlin.nchu.edu.tw/ubigox/.
期刊介绍:
Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to:
Structure and function of proteins, nucleic acids and other macromolecules
Structure and function of multi-component complexes
Protein folding, processing and degradation
Enzymology
Computational and structural studies of plant systems
Microbial Informatics
Genomics
Proteomics
Metabolomics
Algorithms and Hypothesis in Bioinformatics
Mathematical and Theoretical Biology
Computational Chemistry and Drug Discovery
Microscopy and Molecular Imaging
Nanotechnology
Systems and Synthetic Biology