Integrating multi-encoding sequence features via stacking ensemble learning for RNA m5C site prediction.

IF 1.3 4区 生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY
Ubaid Ur Rahman, Naeem Ul Islam
{"title":"Integrating multi-encoding sequence features via stacking ensemble learning for RNA m5C site prediction.","authors":"Ubaid Ur Rahman, Naeem Ul Islam","doi":"10.1080/15257770.2026.2658190","DOIUrl":null,"url":null,"abstract":"<p><p>RNA 5-methylcytosine (m5C) is an important epitranscriptomic modification involved in RNA stability, translation, and post-transcriptional regulation. Accurate identification of m5C sites remains challenging due to limited sequence representation and insufficient feature integration in existing computational methods. In this study, we propose a comprehensive machine learning framework that integrates six complementary sequence encoding schemes, including enhanced nucleic acid composition (ENAC), tri-nucleotide composition (TNC), composition of K-spaced nucleic acid pairs (CKSNAP), pseudo-electron-ion interaction potential (PseEIIP), one-hot encoding, and nucleotide chemical properties (NCP). Each encoding is paired with an optimal classifier, and a stacking ensemble strategy is employed to fuze the outputs of base classifiers. The model is trained using 5-fold cross-validation for base learners and 3-fold cross-validation for the meta-learner. Performance evaluation using multiple metrics demonstrates that the proposed approach achieves improved robustness and cross-dataset generalization, with an accuracy of 75.5%, MCC of 0.51, and PR-AUC of 0.82. These results indicate that the proposed fusion-based ensemble framework provides an effective and reliable solution for RNA m5C site prediction.</p>","PeriodicalId":19343,"journal":{"name":"Nucleosides, Nucleotides & Nucleic Acids","volume":" ","pages":"1-29"},"PeriodicalIF":1.3000,"publicationDate":"2026-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nucleosides, Nucleotides & Nucleic Acids","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1080/15257770.2026.2658190","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

RNA 5-methylcytosine (m5C) is an important epitranscriptomic modification involved in RNA stability, translation, and post-transcriptional regulation. Accurate identification of m5C sites remains challenging due to limited sequence representation and insufficient feature integration in existing computational methods. In this study, we propose a comprehensive machine learning framework that integrates six complementary sequence encoding schemes, including enhanced nucleic acid composition (ENAC), tri-nucleotide composition (TNC), composition of K-spaced nucleic acid pairs (CKSNAP), pseudo-electron-ion interaction potential (PseEIIP), one-hot encoding, and nucleotide chemical properties (NCP). Each encoding is paired with an optimal classifier, and a stacking ensemble strategy is employed to fuze the outputs of base classifiers. The model is trained using 5-fold cross-validation for base learners and 3-fold cross-validation for the meta-learner. Performance evaluation using multiple metrics demonstrates that the proposed approach achieves improved robustness and cross-dataset generalization, with an accuracy of 75.5%, MCC of 0.51, and PR-AUC of 0.82. These results indicate that the proposed fusion-based ensemble framework provides an effective and reliable solution for RNA m5C site prediction.

基于叠加集成学习的多编码序列特征集成RNA m5C位点预测。
RNA 5-甲基胞嘧啶(m5C)是一种重要的外转录组修饰,参与RNA的稳定性、翻译和转录后调控。由于现有计算方法中序列表示的限制和特征集成的不足,m5C位点的准确识别仍然具有挑战性。在这项研究中,我们提出了一个综合的机器学习框架,该框架集成了六种互补序列编码方案,包括增强核酸组成(ENAC)、三核苷酸组成(TNC)、k间隔核酸对组成(CKSNAP)、伪电子-离子相互作用势(PseEIIP)、单热编码和核苷酸化学性质(NCP)。每个编码与一个最优分类器配对,并采用堆叠集成策略对基分类器的输出进行融合。该模型对基础学习器使用5倍交叉验证,对元学习器使用3倍交叉验证。使用多个指标进行的性能评估表明,该方法实现了更好的鲁棒性和跨数据集泛化,准确率为75.5%,MCC为0.51,PR-AUC为0.82。这些结果表明,基于融合的集成框架为RNA m5C位点预测提供了有效可靠的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Nucleosides, Nucleotides & Nucleic Acids
Nucleosides, Nucleotides & Nucleic Acids 生物-生化与分子生物学
CiteScore
2.60
自引率
7.70%
发文量
91
审稿时长
6 months
期刊介绍: Nucleosides, Nucleotides & Nucleic Acids publishes research articles, short notices, and concise, critical reviews of related topics that focus on the chemistry and biology of nucleosides, nucleotides, and nucleic acids. Complete with experimental details, this all-inclusive journal emphasizes the synthesis, biological activities, new and improved synthetic methods, and significant observations related to new compounds.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书