MSlocPRED: deep transfer learning-based identification of multi-label mRNA subcellular localization.

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2024-09-23 DOI:10.1093/bib/bbae504

Yun Zuo, Bangyi Zhang, Wenying He, Yue Bi, Xiangrong Liu, Xiangxiang Zeng, Zhaohong Deng

{"title":"MSlocPRED: deep transfer learning-based identification of multi-label mRNA subcellular localization.","authors":"Yun Zuo, Bangyi Zhang, Wenying He, Yue Bi, Xiangrong Liu, Xiangxiang Zeng, Zhaohong Deng","doi":"10.1093/bib/bbae504","DOIUrl":null,"url":null,"abstract":"<p><p>Subcellular localization of messenger ribonucleic acid (mRNA) is a universal mechanism for precise and efficient control of the translation process. Although many computational methods have been constructed by researchers for predicting mRNA subcellular localization, very few of these computational methods have been designed to predict subcellular localization with multiple localization annotations, and their generalization performance could be improved. In this study, the prediction model MSlocPRED was constructed to identify multi-label mRNA subcellular localization. First, the preprocessed Dataset 1 and Dataset 2 are transformed into the form of images. The proposed MDNDO-SMDU resampling technique is then used to balance the number of samples in each category in the training dataset. Finally, deep transfer learning was used to construct the predictive model MSlocPRED to identify subcellular localization for 16 classes (Dataset 1) and 18 classes (Dataset 2). The results of comparative tests of different resampling techniques show that the resampling technique proposed in this study is more effective in preprocessing for subcellular localization. The prediction results of the datasets constructed by intercepting different NC end (Both the 5' and 3' untranslated regions that flank the protein-coding sequence and influence mRNA function without encoding proteins themselves.) lengths show that for Dataset 1 and Dataset 2, the prediction performance is best when the NC end is intercepted by 35 nucleotides, respectively. The results of both independent testing and five-fold cross-validation comparisons with established prediction tools show that MSlocPRED is significantly better than established tools for identifying multi-label mRNA subcellular localization. Additionally, to understand how the MSlocPRED model works during the prediction process, SHapley Additive exPlanations was used to explain it. The predictive model and associated datasets are available on the following github: https://github.com/ZBYnb1/MSlocPRED/tree/main.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11472759/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbae504","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Subcellular localization of messenger ribonucleic acid (mRNA) is a universal mechanism for precise and efficient control of the translation process. Although many computational methods have been constructed by researchers for predicting mRNA subcellular localization, very few of these computational methods have been designed to predict subcellular localization with multiple localization annotations, and their generalization performance could be improved. In this study, the prediction model MSlocPRED was constructed to identify multi-label mRNA subcellular localization. First, the preprocessed Dataset 1 and Dataset 2 are transformed into the form of images. The proposed MDNDO-SMDU resampling technique is then used to balance the number of samples in each category in the training dataset. Finally, deep transfer learning was used to construct the predictive model MSlocPRED to identify subcellular localization for 16 classes (Dataset 1) and 18 classes (Dataset 2). The results of comparative tests of different resampling techniques show that the resampling technique proposed in this study is more effective in preprocessing for subcellular localization. The prediction results of the datasets constructed by intercepting different NC end (Both the 5' and 3' untranslated regions that flank the protein-coding sequence and influence mRNA function without encoding proteins themselves.) lengths show that for Dataset 1 and Dataset 2, the prediction performance is best when the NC end is intercepted by 35 nucleotides, respectively. The results of both independent testing and five-fold cross-validation comparisons with established prediction tools show that MSlocPRED is significantly better than established tools for identifying multi-label mRNA subcellular localization. Additionally, to understand how the MSlocPRED model works during the prediction process, SHapley Additive exPlanations was used to explain it. The predictive model and associated datasets are available on the following github: https://github.com/ZBYnb1/MSlocPRED/tree/main.

查看原文本刊更多论文

MSlocPRED：基于深度迁移学习的多标签 mRNA 亚细胞定位识别。

信使核糖核酸（mRNA）的亚细胞定位是精确有效控制翻译过程的普遍机制。虽然研究人员已经构建了许多用于预测 mRNA 亚细胞定位的计算方法，但这些计算方法中很少有设计用于预测具有多个定位注释的亚细胞定位，其泛化性能有待提高。本研究构建了 MSlocPRED 预测模型来识别多标签 mRNA 亚细胞定位。首先，将预处理后的数据集 1 和数据集 2 转换为图像形式。然后，使用提出的 MDNDO-SMDU 重采样技术来平衡训练数据集中每个类别的样本数量。最后，利用深度迁移学习构建预测模型 MSlocPRED，以识别 16 个类别（数据集 1）和 18 个类别（数据集 2）的亚细胞定位。不同重采样技术的对比测试结果表明，本研究提出的重采样技术在亚细胞定位的预处理中更为有效。截取不同NC末端（5'和3'非翻译区，位于蛋白质编码序列外侧，影响mRNA功能，但本身不编码蛋白质）长度所构建的数据集的预测结果表明，对于数据集1和数据集2，当NC末端分别截取35个核苷酸时，预测效果最好。独立测试以及与现有预测工具的五倍交叉验证比较结果表明，MSlocPRED 在识别多标签 mRNA 亚细胞定位方面明显优于现有工具。此外，为了了解 MSlocPRED 模型在预测过程中的工作原理，还使用了 SHapley Additive exPlanations 对其进行解释。预测模型和相关数据集可在以下 github 上获取：https://github.com/ZBYnb1/MSlocPRED/tree/main。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.