使用rdeval在规模上评估测序读数。

IF 5.4

Bioinformatics (Oxford, England) Pub Date : 2025-09-01 DOI:10.1093/bioinformatics/btaf416

Giulio Formenti, Bonhwang Koo, Marco Sollitto, Jennifer Balacco, Nadolina Brajuka, Richard Burhans, Erick Duarte, Alice M Giani, Kirsty McCaffrey, Jack A Medico, Eugene W Myers, Patrik Smeds, Anton Nekrutenko, Erich D Jarvis

{"title":"使用rdeval在规模上评估测序读数。","authors":"Giulio Formenti, Bonhwang Koo, Marco Sollitto, Jennifer Balacco, Nadolina Brajuka, Richard Burhans, Erick Duarte, Alice M Giani, Kirsty McCaffrey, Jack A Medico, Eugene W Myers, Patrik Smeds, Anton Nekrutenko, Erich D Jarvis","doi":"10.1093/bioinformatics/btaf416","DOIUrl":null,"url":null,"abstract":"Motivation: Large sequencing datasets are being produced and deposited into public archives at unprecedented rates. The availability of tools that can reliably and efficiently generate and store sequencing read summary statistics has become critical.Results: As part of the effort by the Vertebrate Genomes Project (VGP) to generate high-quality reference genomes at scale, we sought to address the community's need for efficient sequence data evaluation by developing rdeval, a standalone tool to quickly compute and interactively display sequencing read metrics. Rdeval can either run on the fly or store key sequence data metrics in tiny read 'snapshot' files. Statistics can then be efficiently recalled from snapshots for additional processing. Rdeval can convert fa*[.gz] files to and from other popular formats including BAM and CRAM for better compression. Overall, while CRAM achieves the best compression, the gain compared to BAM is marginal, and BAM achieves the best compromise between data compression and access speed. Rdeval also generates a detailed visual report with multiple data analytics that can be exported in various formats. We showcase rdeval's functionalities using long-read data from different sequencing platforms and species, including human. For PacBio long-read sequencing, our analysis shows dramatic improvements in both read length and quality over time, as well as the benefit of increased coverage for genome assembly, though the magnitude varies by taxa.Availability and implementation: Rdeval is implemented in C++ for data processing and in R for data visualization. Precompiled releases (Linux, MacOS, Windows) and commented source code for rdeval are available under MIT license at https://github.com/vgl-hub/rdeval. Documentation is available on ReadTheDocs (https://rdeval-documentation.readthedocs.io). Rdeval is also available in Bioconda and in Galaxy (https://usegalaxy.org). An automated test workflow ensures the consistency of software updates.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12401588/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation of sequencing reads at scale using rdeval.\",\"authors\":\"Giulio Formenti, Bonhwang Koo, Marco Sollitto, Jennifer Balacco, Nadolina Brajuka, Richard Burhans, Erick Duarte, Alice M Giani, Kirsty McCaffrey, Jack A Medico, Eugene W Myers, Patrik Smeds, Anton Nekrutenko, Erich D Jarvis\",\"doi\":\"10.1093/bioinformatics/btaf416\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Large sequencing datasets are being produced and deposited into public archives at unprecedented rates. The availability of tools that can reliably and efficiently generate and store sequencing read summary statistics has become critical.Results: As part of the effort by the Vertebrate Genomes Project (VGP) to generate high-quality reference genomes at scale, we sought to address the community's need for efficient sequence data evaluation by developing rdeval, a standalone tool to quickly compute and interactively display sequencing read metrics. Rdeval can either run on the fly or store key sequence data metrics in tiny read 'snapshot' files. Statistics can then be efficiently recalled from snapshots for additional processing. Rdeval can convert fa*[.gz] files to and from other popular formats including BAM and CRAM for better compression. Overall, while CRAM achieves the best compression, the gain compared to BAM is marginal, and BAM achieves the best compromise between data compression and access speed. Rdeval also generates a detailed visual report with multiple data analytics that can be exported in various formats. We showcase rdeval's functionalities using long-read data from different sequencing platforms and species, including human. For PacBio long-read sequencing, our analysis shows dramatic improvements in both read length and quality over time, as well as the benefit of increased coverage for genome assembly, though the magnitude varies by taxa.Availability and implementation: Rdeval is implemented in C++ for data processing and in R for data visualization. Precompiled releases (Linux, MacOS, Windows) and commented source code for rdeval are available under MIT license at https://github.com/vgl-hub/rdeval. Documentation is available on ReadTheDocs (https://rdeval-documentation.readthedocs.io). Rdeval is also available in Bioconda and in Galaxy (https://usegalaxy.org). An automated test workflow ensures the consistency of software updates.\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12401588/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf416\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf416","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

动机：大量的测序数据集正在以前所未有的速度产生并存入公共档案。能够可靠、高效地生成和存储测序读取摘要统计信息的工具的可用性变得至关重要。结果：作为脊椎动物基因组计划（VGP）大规模生成高质量参考基因组工作的一部分，我们试图通过开发rdeval来解决社区对高效序列数据评估的需求，rdeval是一个独立的工具，可以快速计算和交互式显示测序读取指标。Rdeval既可以动态运行，也可以将键序列数据指标存储在微小的可读“快照”文件中。然后可以有效地从快照中召回统计信息以进行其他处理。Rdeval可以将fa*[.gz]文件转换为其他流行的格式，包括BAM和CRAM，以获得更好的压缩。总的来说，虽然CRAM实现了最佳压缩，但与BAM相比的增益是微不足道的，而BAM在数据压缩和访问速度之间实现了最佳折衷。Rdeval还生成详细的可视化报告，其中包含可以以各种格式导出的多个数据分析。我们使用来自不同测序平台和物种（包括人类）的长读数据来展示rdeval的功能。对于PacBio长读测序，我们的分析显示，随着时间的推移，读取长度和质量都有了显著的改善，基因组组装覆盖率的增加也带来了好处，尽管程度因分类群而异。可用性：Rdeval在c++中实现数据处理能力，在R中实现数据可视化。预编译版本（Linux, MacOS, Windows）和rdeval的注释源代码在MIT许可下可在https://github.com/vgl-hub/rdeval上获得。文档可在ReadTheDocs （https://rdeval-documentation.readthedocs.io）上获得。Rdeval也可以在Bioconda和Galaxy中使用（https://usegalaxy.org）。自动化的测试工作流程确保了软件更新的一致性。补充信息：补充数据可在生物信息学在线获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Evaluation of sequencing reads at scale using rdeval.

查看原文本刊更多论文

Evaluation of sequencing reads at scale using rdeval.

Motivation: Large sequencing datasets are being produced and deposited into public archives at unprecedented rates. The availability of tools that can reliably and efficiently generate and store sequencing read summary statistics has become critical.

Results: As part of the effort by the Vertebrate Genomes Project (VGP) to generate high-quality reference genomes at scale, we sought to address the community's need for efficient sequence data evaluation by developing rdeval, a standalone tool to quickly compute and interactively display sequencing read metrics. Rdeval can either run on the fly or store key sequence data metrics in tiny read 'snapshot' files. Statistics can then be efficiently recalled from snapshots for additional processing. Rdeval can convert fa*[.gz] files to and from other popular formats including BAM and CRAM for better compression. Overall, while CRAM achieves the best compression, the gain compared to BAM is marginal, and BAM achieves the best compromise between data compression and access speed. Rdeval also generates a detailed visual report with multiple data analytics that can be exported in various formats. We showcase rdeval's functionalities using long-read data from different sequencing platforms and species, including human. For PacBio long-read sequencing, our analysis shows dramatic improvements in both read length and quality over time, as well as the benefit of increased coverage for genome assembly, though the magnitude varies by taxa.

Availability and implementation: Rdeval is implemented in C++ for data processing and in R for data visualization. Precompiled releases (Linux, MacOS, Windows) and commented source code for rdeval are available under MIT license at https://github.com/vgl-hub/rdeval. Documentation is available on ReadTheDocs (https://rdeval-documentation.readthedocs.io). Rdeval is also available in Bioconda and in Galaxy (https://usegalaxy.org). An automated test workflow ensures the consistency of software updates.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量