Eric Czech, Will Tyler, Tom White, Ben Jeffery, Timothy R Millar, Benjamin Elsworth, Jérémy Guez, Jonny Hancox, Konrad J Karczewski, Alistair Miles, Sam Tallman, Per Unneberg, Rafal Wojdyla, Shadi Zabad, Jeff Hammerbacher, Jerome Kelleher
{"title":"Analysis-ready VCF at Biobank scale using Zarr.","authors":"Eric Czech, Will Tyler, Tom White, Ben Jeffery, Timothy R Millar, Benjamin Elsworth, Jérémy Guez, Jonny Hancox, Konrad J Karczewski, Alistair Miles, Sam Tallman, Per Unneberg, Rafal Wojdyla, Shadi Zabad, Jeff Hammerbacher, Jerome Kelleher","doi":"10.1093/gigascience/giaf049","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.</p><p><strong>Results: </strong>Zarr is a format for storing multidimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of 3 large human datasets (Genomics England: $n$=78,195; Our Future Health: $n$=651,050; All of Us: $n$=245,394) along with whole genome datasets for Norway Spruce ($n$=1,063) and SARS-CoV-2 ($n$=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.</p><p><strong>Conclusions: </strong>Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely used, open-source technologies, has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12127038/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf049","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasizes efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. The Biobank-scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed.
Results: Zarr is a format for storing multidimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF-based approaches, and competitive with specialized methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of 3 large human datasets (Genomics England: $n$=78,195; Our Future Health: $n$=651,050; All of Us: $n$=245,394) along with whole genome datasets for Norway Spruce ($n$=1,063) and SARS-CoV-2 ($n$=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs.
Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely used, open-source technologies, has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.