CytoLNCpred-a computational method for predicting cytoplasm associated long non-coding RNAs in 15 cell-lines.

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics Pub Date : 2025-05-26 eCollection Date: 2025-01-01 DOI:10.3389/fbinf.2025.1585794

Shubham Choudhury, Naman Kumar Mehta, Gajendra P S Raghava

{"title":"CytoLNCpred-a computational method for predicting cytoplasm associated long non-coding RNAs in 15 cell-lines.","authors":"Shubham Choudhury, Naman Kumar Mehta, Gajendra P S Raghava","doi":"10.3389/fbinf.2025.1585794","DOIUrl":null,"url":null,"abstract":"<p><p>The function of long non-coding RNA (lncRNA) is largely determined by its specific location within a cell. Previous methods have used noisy datasets, including mRNA transcripts in tools intended for lncRNAs, and excluded lncRNAs lacking significant differential localization between the cytoplasm and nucleus. In order to overcome these shortcomings, a method has been developed for predicting cytoplasm-associated lncRNAs in 15 human cell-lines, identifying which lncRNAs are more abundant in the cytoplasm compared to the nucleus. All models in this study were trained using five-fold cross validation and tested on an validation dataset. Initially, we developed machine and deep learning based models using traditional features like composition and correlation. Using composition and correlation based features, machine learning algorithms achieved an average AUC of 0.7049 and 0.7089, respectively for 15 cell-lines. Secondly, we developed machine based models developed using embedding features obtained from the large language model DNABERT-2. The average AUC for all the cell-lines achieved by this approach was 0.665. Subsequently, we also fine-tuned DNABERT-2 on our training dataset and evaluated the fine-tuned DNABERT-2 model on the validation dataset. The fine-tuned DNABERT-2 model achieved an average AUC of 0.6336. Correlation-based features combined with ML algorithms outperform LLM-based models, in the case of predicting differential lncRNA localization. These cell-line specific models as well as web-based service are available to the public from our web server (https://webs.iiitd.edu.in/raghava/cytolncpred/).</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1585794"},"PeriodicalIF":3.9000,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12146324/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2025.1585794","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The function of long non-coding RNA (lncRNA) is largely determined by its specific location within a cell. Previous methods have used noisy datasets, including mRNA transcripts in tools intended for lncRNAs, and excluded lncRNAs lacking significant differential localization between the cytoplasm and nucleus. In order to overcome these shortcomings, a method has been developed for predicting cytoplasm-associated lncRNAs in 15 human cell-lines, identifying which lncRNAs are more abundant in the cytoplasm compared to the nucleus. All models in this study were trained using five-fold cross validation and tested on an validation dataset. Initially, we developed machine and deep learning based models using traditional features like composition and correlation. Using composition and correlation based features, machine learning algorithms achieved an average AUC of 0.7049 and 0.7089, respectively for 15 cell-lines. Secondly, we developed machine based models developed using embedding features obtained from the large language model DNABERT-2. The average AUC for all the cell-lines achieved by this approach was 0.665. Subsequently, we also fine-tuned DNABERT-2 on our training dataset and evaluated the fine-tuned DNABERT-2 model on the validation dataset. The fine-tuned DNABERT-2 model achieved an average AUC of 0.6336. Correlation-based features combined with ML algorithms outperform LLM-based models, in the case of predicting differential lncRNA localization. These cell-line specific models as well as web-based service are available to the public from our web server (https://webs.iiitd.edu.in/raghava/cytolncpred/).

查看原文本刊更多论文

cytolncpred -一种预测15种细胞系细胞质相关长链非编码rna的计算方法。

长链非编码RNA （lncRNA）的功能在很大程度上取决于其在细胞内的特定位置。以前的方法使用了嘈杂的数据集，包括用于lncrna的工具中的mRNA转录物，并排除了细胞质和细胞核之间缺乏显著差异定位的lncrna。为了克服这些缺点，研究人员开发了一种预测15种人类细胞系细胞质相关lncrna的方法，确定哪些lncrna在细胞质中比在细胞核中更丰富。本研究中的所有模型都使用五重交叉验证进行训练，并在验证数据集上进行测试。最初，我们使用组合和相关性等传统特征开发了基于机器和深度学习的模型。使用基于组合和相关性的特征，机器学习算法对15个细胞系的平均AUC分别为0.7049和0.7089。其次，我们利用从大型语言模型DNABERT-2中获得的嵌入特征开发了基于机器的模型。通过这种方法获得的所有细胞系的平均AUC为0.665。随后，我们还在训练数据集上对DNABERT-2进行了微调，并在验证数据集上对微调后的DNABERT-2模型进行了评估。经过微调的DNABERT-2模型的平均AUC为0.6336。在预测lncRNA差异定位的情况下，基于相关性的特征与ML算法相结合优于基于llm的模型。公众可以通过我们的网络服务器（https://webs.iiitd.edu.in/raghava/cytolncpred/）获得这些细胞系特定模型以及基于web的服务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in bioinformatics

CiteScore

2.60

自引率

0.00%

发文量