核搜索器-精确过滤RNA数据库，以策划高质量的数据集。

IF 2.8 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics Pub Date : 2025-03-18 eCollection Date: 2025-03-01 DOI:10.1093/nargab/lqaf021

Utkarsh Upadhyay, Fabrizio Pucci, Julian Herold, Alexander Schug

{"title":"核搜索器-精确过滤RNA数据库，以策划高质量的数据集。","authors":"Utkarsh Upadhyay, Fabrizio Pucci, Julian Herold, Alexander Schug","doi":"10.1093/nargab/lqaf021","DOIUrl":null,"url":null,"abstract":"The structural prediction of biomolecules via computational methods complements the often involved wet-lab experiments. Unlike protein structure prediction, RNA structure prediction remains a significant challenge in bioinformatics, primarily due to the scarcity of annotated RNA structure data and its varying quality. Many methods have used this limited data to train deep learning models but redundancy, data leakage and bad data quality hampers their performance. In this work, we present NucleoSeeker, a tool designed to curate high-quality, tailored datasets from the Protein Data Bank (PDB) database. It is a unified framework that combines multiple tools and streamlines an otherwise complicated process of data curation. It offers multiple filters at structure, sequence, and annotation levels, giving researchers full control over data curation. Further, we present several use cases. In particular, we demonstrate how NucleoSeeker allows the creation of a nonredundant RNA structure dataset to assess AlphaFold3's performance for RNA structure prediction. This demonstrates NucleoSeeker's effectiveness in curating valuable nonredundant tailored datasets to both train novel and judge existing methods. NucleoSeeker is very easy to use, highly flexible, and can significantly increase the quality of RNA structure datasets.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 1","pages":"lqaf021"},"PeriodicalIF":2.8000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11915511/pdf/","citationCount":"0","resultStr":"{\"title\":\"NucleoSeeker-precision filtering of RNA databases to curate high-quality datasets.\",\"authors\":\"Utkarsh Upadhyay, Fabrizio Pucci, Julian Herold, Alexander Schug\",\"doi\":\"10.1093/nargab/lqaf021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The structural prediction of biomolecules via computational methods complements the often involved wet-lab experiments. Unlike protein structure prediction, RNA structure prediction remains a significant challenge in bioinformatics, primarily due to the scarcity of annotated RNA structure data and its varying quality. Many methods have used this limited data to train deep learning models but redundancy, data leakage and bad data quality hampers their performance. In this work, we present NucleoSeeker, a tool designed to curate high-quality, tailored datasets from the Protein Data Bank (PDB) database. It is a unified framework that combines multiple tools and streamlines an otherwise complicated process of data curation. It offers multiple filters at structure, sequence, and annotation levels, giving researchers full control over data curation. Further, we present several use cases. In particular, we demonstrate how NucleoSeeker allows the creation of a nonredundant RNA structure dataset to assess AlphaFold3's performance for RNA structure prediction. This demonstrates NucleoSeeker's effectiveness in curating valuable nonredundant tailored datasets to both train novel and judge existing methods. NucleoSeeker is very easy to use, highly flexible, and can significantly increase the quality of RNA structure datasets.\",\"PeriodicalId\":33994,\"journal\":{\"name\":\"NAR Genomics and Bioinformatics\",\"volume\":\"7 1\",\"pages\":\"lqaf021\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-03-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11915511/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NAR Genomics and Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/nargab/lqaf021\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/3/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqaf021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

摘要

通过计算方法预测生物分子的结构是对湿实验室实验的补充。与蛋白质结构预测不同，RNA结构预测仍然是生物信息学中的一个重大挑战，主要是由于带注释的RNA结构数据的稀缺性及其质量的差异。许多方法使用这些有限的数据来训练深度学习模型，但冗余、数据泄漏和糟糕的数据质量阻碍了它们的性能。在这项工作中，我们提出了一个名为NucleoSeeker的工具，旨在从蛋白质数据库（PDB）数据库中收集高质量的定制数据集。它是一个统一的框架，结合了多种工具，简化了原本复杂的数据管理过程。它在结构、序列和注释级别提供了多个过滤器，使研究人员能够完全控制数据管理。此外，我们还提供了几个用例。特别是，我们展示了核seeker如何允许创建一个非冗余的RNA结构数据集来评估AlphaFold3在RNA结构预测方面的性能。这证明了NucleoSeeker在管理有价值的非冗余定制数据集以训练新方法和判断现有方法方面的有效性。NucleoSeeker非常容易使用，高度灵活，可以显著提高RNA结构数据集的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

NucleoSeeker-precision filtering of RNA databases to curate high-quality datasets.

查看原文本刊更多论文

NucleoSeeker-precision filtering of RNA databases to curate high-quality datasets.

The structural prediction of biomolecules via computational methods complements the often involved wet-lab experiments. Unlike protein structure prediction, RNA structure prediction remains a significant challenge in bioinformatics, primarily due to the scarcity of annotated RNA structure data and its varying quality. Many methods have used this limited data to train deep learning models but redundancy, data leakage and bad data quality hampers their performance. In this work, we present NucleoSeeker, a tool designed to curate high-quality, tailored datasets from the Protein Data Bank (PDB) database. It is a unified framework that combines multiple tools and streamlines an otherwise complicated process of data curation. It offers multiple filters at structure, sequence, and annotation levels, giving researchers full control over data curation. Further, we present several use cases. In particular, we demonstrate how NucleoSeeker allows the creation of a nonredundant RNA structure dataset to assess AlphaFold3's performance for RNA structure prediction. This demonstrates NucleoSeeker's effectiveness in curating valuable nonredundant tailored datasets to both train novel and judge existing methods. NucleoSeeker is very easy to use, highly flexible, and can significantly increase the quality of RNA structure datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NAR Genomics and Bioinformatics Multiple-

CiteScore

8.00

自引率

2.20%

发文量

审稿时长

15 weeks