GENNUS: generative approaches for nucleotide sequences enhance mirtron classification.

IF 2.8 Q1 GENETICS & HEREDITY
NAR Genomics and Bioinformatics Pub Date : 2025-06-20 eCollection Date: 2025-06-01 DOI:10.1093/nargab/lqaf072
Alisson Gaspar Chiquitto, Liliane Santana Oliveira, Pedro Henrique Bugatti, Priscila Tiemi Maeda Saito, Mark Basham, Roberto Tadeu Raittz, Alexandre Rossi Paschoal
{"title":"GENNUS: generative approaches for nucleotide sequences enhance mirtron classification.","authors":"Alisson Gaspar Chiquitto, Liliane Santana Oliveira, Pedro Henrique Bugatti, Priscila Tiemi Maeda Saito, Mark Basham, Roberto Tadeu Raittz, Alexandre Rossi Paschoal","doi":"10.1093/nargab/lqaf072","DOIUrl":null,"url":null,"abstract":"<p><p>Classifying non-coding RNA (ncRNA) sequences, particularly mirtrons, is essential for elucidating gene regulation mechanisms. However, the prevalent class imbalance in ncRNA datasets presents significant challenges, often resulting in overfitting and diminished generalization in machine learning models. In this study, GENNUS (GENerative approaches for NUcleotide Sequences) is proposed, introducing novel data augmentation strategies using generative adversarial networks (GANs) and synthetic minority over-sampling technique (SMOTE) to enhance mirtron and canonical microRNA (miRNA) classification performance. Our GAN-based methods effectively generate high-quality synthetic data that capture the intricate patterns and diversity of real mirtron sequences, eliminating the need for extensive feature engineering. Through four experiments, it is demonstrated that models trained on a combination of real and GAN-generated data improve classification accuracy compared to traditional SMOTE techniques or only with real data. Our findings reveal that GANs enhance model performance and provide a richer representation of minority classes, thus improving generalization capabilities across various machine learning frameworks. This work highlights the transformative potential of synthetic data generation in addressing data limitations in genomics, offering a pathway for more effective and scalable mirtron and canonical miRNA classification methodologies. GENNUS is available at https://github.com/chiquitto/GENNUS; and https://doi.org/10.6084/m9.figshare.28207328.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf072"},"PeriodicalIF":2.8000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204755/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqaf072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Classifying non-coding RNA (ncRNA) sequences, particularly mirtrons, is essential for elucidating gene regulation mechanisms. However, the prevalent class imbalance in ncRNA datasets presents significant challenges, often resulting in overfitting and diminished generalization in machine learning models. In this study, GENNUS (GENerative approaches for NUcleotide Sequences) is proposed, introducing novel data augmentation strategies using generative adversarial networks (GANs) and synthetic minority over-sampling technique (SMOTE) to enhance mirtron and canonical microRNA (miRNA) classification performance. Our GAN-based methods effectively generate high-quality synthetic data that capture the intricate patterns and diversity of real mirtron sequences, eliminating the need for extensive feature engineering. Through four experiments, it is demonstrated that models trained on a combination of real and GAN-generated data improve classification accuracy compared to traditional SMOTE techniques or only with real data. Our findings reveal that GANs enhance model performance and provide a richer representation of minority classes, thus improving generalization capabilities across various machine learning frameworks. This work highlights the transformative potential of synthetic data generation in addressing data limitations in genomics, offering a pathway for more effective and scalable mirtron and canonical miRNA classification methodologies. GENNUS is available at https://github.com/chiquitto/GENNUS; and https://doi.org/10.6084/m9.figshare.28207328.

Abstract Image

Abstract Image

Abstract Image

GENNUS:核苷酸序列的生成方法增强了镜像分类。
分类非编码RNA (ncRNA)序列,特别是镜像序列,对于阐明基因调控机制至关重要。然而,ncRNA数据集中普遍存在的类别不平衡带来了重大挑战,经常导致机器学习模型的过拟合和泛化程度降低。在本研究中,提出了GENNUS(核苷酸序列生成方法),引入了新的数据增强策略,使用生成对抗网络(gan)和合成少数过采样技术(SMOTE)来增强镜像和规范microRNA (miRNA)分类性能。我们基于gan的方法有效地生成高质量的合成数据,这些数据捕获了真实镜像序列的复杂模式和多样性,从而消除了大量特征工程的需要。通过四个实验,证明了与传统的SMOTE技术或仅使用真实数据相比,在真实数据和gan生成数据的组合上训练的模型提高了分类精度。我们的研究结果表明,gan增强了模型性能,并提供了更丰富的少数类表示,从而提高了跨各种机器学习框架的泛化能力。这项工作强调了合成数据生成在解决基因组学数据限制方面的变革潜力,为更有效和可扩展的镜像和规范miRNA分类方法提供了一条途径。GENNUS网站:https://github.com/chiquitto/GENNUS;和https://doi.org/10.6084/m9.figshare.28207328。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.00
自引率
2.20%
发文量
95
审稿时长
15 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信