利用统计分析方法探讨人类转录组m6A修饰位点机器学习识别方法中数据不平衡的影响因素。

IF 2.6 4区生物学 Q2 BIOLOGY

Computational Biology and Chemistry Pub Date : 2025-01-14 DOI:10.1016/j.compbiolchem.2025.108351

Mingxin Li, Rujun Li, Yichi Zhang, Shiyu Peng, Zhibin Lv

{"title":"利用统计分析方法探讨人类转录组m6A修饰位点机器学习识别方法中数据不平衡的影响因素。","authors":"Mingxin Li, Rujun Li, Yichi Zhang, Shiyu Peng, Zhibin Lv","doi":"10.1016/j.compbiolchem.2025.108351","DOIUrl":null,"url":null,"abstract":"<div><div>RNA methylation, particularly through m6A modification, represents a crucial epigenetic mechanism that governs gene expression and influences a range of biological functions. Accurate identification of methylation sites is crucial for understanding their biological functions. Traditional experimental methods, however, are often costly and can be influenced by experimental conditions, making machine learning, especially deep learning techniques, a vital tool for m6A site identification. Despite their utility, current machine learning models struggle with unbalanced datasets, a common issue in bioinformatics. This study addresses the RNA methylation site data imbalance problem from three key perspectives: feature encoding representation, deep learning models, and data resampling strategies. Using the K-mer one-hot encoding strategy, we effectively extracted RNA sequence features and developed classification prediction models utilizing long short-term memory networks (LSTM) and its variant, Multiplicative LSTM (mLSTM). We further enhanced model performance by ensemble and weighted strategy models. Additionally, we utilized the sequence generative adversarial network (SeqGAN) and the synthetic minority resampling technique (SMOTE) to construct balanced datasets for RNA methylation sites. The prediction results were rigorously analyzed using the Wilcoxon test and multivariate linear regression to explore the effects of different K-mer values, model architectures, and sampling methods on classification outcomes. The analysis underscored the significant impact of feature selection, model architecture, and sampling techniques in addressing data imbalance. Notably, the optimal prediction performance was achieved with a K value of 5 using the mLSTM-ensemble model. These findings not only offer new insights and methodologies for RNA methylation site identification but also provide valuable guidance for addressing similar challenges in bioinformatics.</div></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":"115 ","pages":"Article 108351"},"PeriodicalIF":2.6000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using statistical analysis to explore the influencing factors of data imbalance for machine learning identification methods of human transcriptome m6A modification sites\",\"authors\":\"Mingxin Li, Rujun Li, Yichi Zhang, Shiyu Peng, Zhibin Lv\",\"doi\":\"10.1016/j.compbiolchem.2025.108351\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>RNA methylation, particularly through m6A modification, represents a crucial epigenetic mechanism that governs gene expression and influences a range of biological functions. Accurate identification of methylation sites is crucial for understanding their biological functions. Traditional experimental methods, however, are often costly and can be influenced by experimental conditions, making machine learning, especially deep learning techniques, a vital tool for m6A site identification. Despite their utility, current machine learning models struggle with unbalanced datasets, a common issue in bioinformatics. This study addresses the RNA methylation site data imbalance problem from three key perspectives: feature encoding representation, deep learning models, and data resampling strategies. Using the K-mer one-hot encoding strategy, we effectively extracted RNA sequence features and developed classification prediction models utilizing long short-term memory networks (LSTM) and its variant, Multiplicative LSTM (mLSTM). We further enhanced model performance by ensemble and weighted strategy models. Additionally, we utilized the sequence generative adversarial network (SeqGAN) and the synthetic minority resampling technique (SMOTE) to construct balanced datasets for RNA methylation sites. The prediction results were rigorously analyzed using the Wilcoxon test and multivariate linear regression to explore the effects of different K-mer values, model architectures, and sampling methods on classification outcomes. The analysis underscored the significant impact of feature selection, model architecture, and sampling techniques in addressing data imbalance. Notably, the optimal prediction performance was achieved with a K value of 5 using the mLSTM-ensemble model. These findings not only offer new insights and methodologies for RNA methylation site identification but also provide valuable guidance for addressing similar challenges in bioinformatics.</div></div>\",\"PeriodicalId\":10616,\"journal\":{\"name\":\"Computational Biology and Chemistry\",\"volume\":\"115 \",\"pages\":\"Article 108351\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Biology and Chemistry\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1476927125000118\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927125000118","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

RNA甲基化，特别是通过m6A修饰，是一种重要的表观遗传机制，它控制基因表达并影响一系列生物学功能。甲基化位点的准确鉴定对于理解其生物学功能至关重要。然而，传统的实验方法往往成本高昂，并且可能受到实验条件的影响，这使得机器学习，特别是深度学习技术，成为m6A位点识别的重要工具。尽管它们很实用，但当前的机器学习模型在不平衡数据集上挣扎，这是生物信息学中的一个常见问题。本研究从特征编码表示、深度学习模型和数据重采样策略三个关键角度解决了RNA甲基化位点数据不平衡问题。利用K-mer单热编码策略，我们有效地提取了RNA序列特征，并利用长短期记忆网络（LSTM）及其变体乘法LSTM （mLSTM）建立了分类预测模型。我们通过集成和加权策略模型进一步提高了模型的性能。此外，我们利用序列生成对抗网络（SeqGAN）和合成少数重采样技术（SMOTE）来构建RNA甲基化位点的平衡数据集。使用Wilcoxon检验和多元线性回归对预测结果进行严格分析，探讨不同K-mer值、模型架构和抽样方法对分类结果的影响。分析强调了特征选择、模型架构和采样技术在解决数据不平衡方面的重要影响。值得注意的是，使用mLSTM-ensemble模型，当K值为5时，预测性能达到最佳。这些发现不仅为RNA甲基化位点鉴定提供了新的见解和方法，而且为解决生物信息学中的类似挑战提供了有价值的指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using statistical analysis to explore the influencing factors of data imbalance for machine learning identification methods of human transcriptome m6A modification sites

RNA methylation, particularly through m6A modification, represents a crucial epigenetic mechanism that governs gene expression and influences a range of biological functions. Accurate identification of methylation sites is crucial for understanding their biological functions. Traditional experimental methods, however, are often costly and can be influenced by experimental conditions, making machine learning, especially deep learning techniques, a vital tool for m6A site identification. Despite their utility, current machine learning models struggle with unbalanced datasets, a common issue in bioinformatics. This study addresses the RNA methylation site data imbalance problem from three key perspectives: feature encoding representation, deep learning models, and data resampling strategies. Using the K-mer one-hot encoding strategy, we effectively extracted RNA sequence features and developed classification prediction models utilizing long short-term memory networks (LSTM) and its variant, Multiplicative LSTM (mLSTM). We further enhanced model performance by ensemble and weighted strategy models. Additionally, we utilized the sequence generative adversarial network (SeqGAN) and the synthetic minority resampling technique (SMOTE) to construct balanced datasets for RNA methylation sites. The prediction results were rigorously analyzed using the Wilcoxon test and multivariate linear regression to explore the effects of different K-mer values, model architectures, and sampling methods on classification outcomes. The analysis underscored the significant impact of feature selection, model architecture, and sampling techniques in addressing data imbalance. Notably, the optimal prediction performance was achieved with a K value of 5 using the mLSTM-ensemble model. These findings not only offer new insights and methodologies for RNA methylation site identification but also provide valuable guidance for addressing similar challenges in bioinformatics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Biology and Chemistry 生物-计算机：跨学科应用

CiteScore

6.10

自引率

3.20%

发文量

142

审稿时长

24 days

期刊介绍： Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered. Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered. Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.