MARS 和 RNAcmap3：所有可能 RNA 序列的主数据库，与 RNAcmap 集成用于 RNA 同源搜索

Genomics, Proteomics & Bioinformatics Pub Date : 2024-03-01 DOI:10.1093/gpbjnl/qzae018

Ke'ai Chen, Thomas Litfin, Jaswinder Singh, Jian Zhan, Yaoqi Zhou

{"title":"MARS 和 RNAcmap3：所有可能 RNA 序列的主数据库，与 RNAcmap 集成用于 RNA 同源搜索","authors":"Ke'ai Chen, Thomas Litfin, Jaswinder Singh, Jian Zhan, Yaoqi Zhou","doi":"10.1093/gpbjnl/qzae018","DOIUrl":null,"url":null,"abstract":"\n Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by including the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to nucleotide database (nt) and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI’s nt database or 60-fold larger than RNAcentral. The new dataset along with a new split–search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037 and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.","PeriodicalId":170516,"journal":{"name":"Genomics, Proteomics & Bioinformatics","volume":"114 35","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MARS and RNAcmap3: The Master Database of All Possible RNA Sequences Integrated with RNAcmap for RNA Homology Search\",\"authors\":\"Ke'ai Chen, Thomas Litfin, Jaswinder Singh, Jian Zhan, Yaoqi Zhou\",\"doi\":\"10.1093/gpbjnl/qzae018\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by including the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to nucleotide database (nt) and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI’s nt database or 60-fold larger than RNAcentral. The new dataset along with a new split–search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037 and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.\",\"PeriodicalId\":170516,\"journal\":{\"name\":\"Genomics, Proteomics & Bioinformatics\",\"volume\":\"114 35\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genomics, Proteomics & Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/gpbjnl/qzae018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics, Proteomics & Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/gpbjnl/qzae018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近，AlphaFold2 在蛋白质结构预测方面的成功在很大程度上依赖于从庞大的蛋白质序列综合数据库（Big Fantastic Database）中发现的同源蛋白质序列中获得的协同进化信息。与此相反，现有的核苷酸数据库并没有进行整合，因此无法进行更广泛、更深入的同源搜索。在这里，我们建立了一个综合数据库，除了美国国家生物技术信息中心（NCBI）的核苷酸数据库（nt）及其子集外，还包括 RNAcentral 的非编码 RNA（ncRNA）序列、元基因组学 RAST（MG-RAST）的转录组组装和元基因组组装、基因组仓库（GWH）的基因组序列和 MGnify 的基因组序列。由此产生的所有可能的 RNA 序列主数据库（MARS）比 NCBI 的 nt 数据库大 20 倍，比 RNAcentral 大 60 倍。与现有的先进技术相比，新的数据集和新的分割搜索策略大大改进了同源性搜索。对于大多数映射到 Rfam 上的结构化 RNA，它还能产生比 Rfam 中人工编辑的多序列比对 (MSAs) 更准确、更灵敏的多序列比对 (MSA)。结果表明，MARS 与全自动同源性搜索工具 RNAcmap 的结合将有助于改进 ncRNA 的结构和功能推断以及基于 MSAs 的 RNA 语言模型。MARS 可在 https://ngdc.cncb.ac.cn/omix/release/OMIX003037 上访问，RNAcmap3 可在 http://zhouyq-lab.szbl.ac.cn/download/ 上访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MARS and RNAcmap3: The Master Database of All Possible RNA Sequences Integrated with RNAcmap for RNA Homology Search

Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by including the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to nucleotide database (nt) and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI’s nt database or 60-fold larger than RNAcentral. The new dataset along with a new split–search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037 and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Genomics, Proteomics & Bioinformatics

自引率

0.00%

发文量