使用Zarr在Biobank规模上分析就绪的VCF。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience Pub Date : 2025-01-06 DOI:10.1093/gigascience/giaf049

Eric Czech, Will Tyler, Tom White, Ben Jeffery, Timothy R Millar, Benjamin Elsworth, Jérémy Guez, Jonny Hancox, Konrad J Karczewski, Alistair Miles, Sam Tallman, Per Unneberg, Rafal Wojdyla, Shadi Zabad, Jeff Hammerbacher, Jerome Kelleher

{"title":"使用Zarr在Biobank规模上分析就绪的VCF。","authors":"Eric Czech, Will Tyler, Tom White, Ben Jeffery, Timothy R Millar, Benjamin Elsworth, Jérémy Guez, Jonny Hancox, Konrad J Karczewski, Alistair Miles, Sam Tallman, Per Unneberg, Rafal Wojdyla, Shadi Zabad, Jeff Hammerbacher, Jerome Kelleher","doi":"10.1093/gigascience/giaf049","DOIUrl":null,"url":null,"abstract":"Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.Results: Zarr is a format for storing multidimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of 3 large human datasets (Genomics England: $n$=78,195; Our Future Health: $n$=651,050; All of Us: $n$=245,394) along with whole genome datasets for Norway Spruce ($n$=1,063) and SARS-CoV-2 ($n$=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely used, open-source technologies, has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12127038/pdf/","citationCount":"0","resultStr":"{\"title\":\"Analysis-ready VCF at Biobank scale using Zarr.\",\"authors\":\"Eric Czech, Will Tyler, Tom White, Ben Jeffery, Timothy R Millar, Benjamin Elsworth, Jérémy Guez, Jonny Hancox, Konrad J Karczewski, Alistair Miles, Sam Tallman, Per Unneberg, Rafal Wojdyla, Shadi Zabad, Jeff Hammerbacher, Jerome Kelleher\",\"doi\":\"10.1093/gigascience/giaf049\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.Results: Zarr is a format for storing multidimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of 3 large human datasets (Genomics England: $n$=78,195; Our Future Health: $n$=651,050; All of Us: $n$=245,394) along with whole genome datasets for Norway Spruce ($n$=1,063) and SARS-CoV-2 ($n$=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely used, open-source technologies, has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.\",\"PeriodicalId\":12581,\"journal\":{\"name\":\"GigaScience\",\"volume\":\"14 \",\"pages\":\"\"},\"PeriodicalIF\":11.8000,\"publicationDate\":\"2025-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12127038/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GigaScience\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/gigascience/giaf049\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf049","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

背景：变异调用格式（VCF）是交换遗传变异数据和相关质量控制指标的标准文件格式。VCF数据模型通常的逐行编码（作为文本或打包的二进制）强调对给定变量的所有数据的有效检索，但是以字段或样本为基础访问数据是低效的。目前可用的生物银行规模的数据集包括数十万个全基因组和数百tb的压缩VCF。逐行数据存储从根本上来说是不合适的，需要一种更具可扩展性的方法。结果：Zarr是一种存储多维数据的格式，在科学领域广泛使用，非常适合大规模并行处理。我们提出了VCF Zarr规范，使用Zarr对VCF数据模型进行编码，以及用于大规模高效可靠转换的基本软件基础设施。我们展示了这种格式如何比标准的基于vcf的方法更有效，并且在压缩比和单线程计算性能方面与存储基因型数据的专门方法竞争。我们介绍了3个大型人类数据集子集的案例研究(Genomics England: $n$=78,195；我们未来的健康：$n$=651,050；我们所有人：$n$=245,394)以及挪威云杉（$n$=1,063）和SARS-CoV-2 （$n$=4,484,157）的全基因组数据集。我们通过使用云计算和gpu的说明性示例展示了VCF Zarr在实现新一代高性能和经济高效应用方面的潜力。结论：行编码的大型VCF文件是当前研究的主要瓶颈，存储和处理这些文件需要大量的成本。VCF Zarr规范建立在广泛使用的开源技术基础上，有可能大大降低这些成本，并可能实现下一代工具的多样化生态系统，用于直接从基于云的对象存储中分析遗传变异数据，同时保持与现有面向文件的工作流程的兼容性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysis-ready VCF at Biobank scale using Zarr.

Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.

Results: Zarr is a format for storing multidimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of 3 large human datasets (Genomics England: $n$=78,195; Our Future Health: $n$=651,050; All of Us: $n$=245,394) along with whole genome datasets for Norway Spruce ($n$=1,063) and SARS-CoV-2 ($n$=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.

Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely used, open-source technologies, has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

GigaScience MULTIDISCIPLINARY SCIENCES-

CiteScore

15.50

自引率

1.10%

发文量

119

审稿时长

1 weeks

期刊介绍： GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.