The scalable variant call representation: enabling genetic analysis beyond one million genomes.

Bioinformatics (Oxford, England) Pub Date : 2024-12-26 DOI:10.1093/bioinformatics/btae746

Timothy Poterba, Christopher Vittal, Daniel King, Daniel Goldstein, Jacqueline I Goldstein, Patrick Schultz, Konrad J Karczewski, Cotton Seed, Benjamin M Neale

{"title":"The scalable variant call representation: enabling genetic analysis beyond one million genomes.","authors":"Timothy Poterba, Christopher Vittal, Daniel King, Daniel Goldstein, Jacqueline I Goldstein, Patrick Schultz, Konrad J Karczewski, Cotton Seed, Benjamin M Neale","doi":"10.1093/bioinformatics/btae746","DOIUrl":null,"url":null,"abstract":"Motivation: The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150 000 genome VCF would occupy 900 TiB, making it costly and complicated to produce, analyze, and store. The issue stems from VCF's requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lead to unnecessary data duplication and, ultimately, very large files.Results: To address these challenges, we introduce the Scalable Variant Call Representation (SVCR). This representation reduces file sizes by ensuring they scale linearly with samples. SVCR's linear scaling relies on two techniques, both necessary for linearity: local allele indices and reference blocks, which were first introduced by the Genomic Variant Call Format. SVCR is also lossless and mergeable, allowing for N + 1 and N + K incremental joint-calling. We present two implementations of SVCR: SVCR-VCF, which encodes SVCR in VCF format, and VDS, which uses Hail's native format. Our experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files. We also discuss the VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis. SVCR, and VDS in particular, ensure the scientific community can generate, analyze, and disseminate genetics datasets with millions of samples.Availability and implementation: https://github.com/hail-is/hail/.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11745898/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae746","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150 000 genome VCF would occupy 900 TiB, making it costly and complicated to produce, analyze, and store. The issue stems from VCF's requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lead to unnecessary data duplication and, ultimately, very large files.

Results: To address these challenges, we introduce the Scalable Variant Call Representation (SVCR). This representation reduces file sizes by ensuring they scale linearly with samples. SVCR's linear scaling relies on two techniques, both necessary for linearity: local allele indices and reference blocks, which were first introduced by the Genomic Variant Call Format. SVCR is also lossless and mergeable, allowing for N + 1 and N + K incremental joint-calling. We present two implementations of SVCR: SVCR-VCF, which encodes SVCR in VCF format, and VDS, which uses Hail's native format. Our experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files. We also discuss the VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis. SVCR, and VDS in particular, ensure the scientific community can generate, analyze, and disseminate genetics datasets with millions of samples.

Availability and implementation: https://github.com/hail-is/hail/.

查看原文本刊更多论文

可扩展的变体呼叫表示：实现超过一百万个基因组的遗传分析。

动机：变异调用格式（VCF）被广泛应用于基因组测序，但规模性差。例如，我们估计150,000个基因组VCF将占用900 TiB，这使得生产，分析和存储成本高且复杂。这个问题源于VCF要求密集地表示参考基因型和等位基因索引数组。这些需求导致不必要的数据重复，并最终导致非常大的文件。结果：为了解决这些挑战，我们引入了可扩展变量调用表示（SVCR）。这种表示通过确保文件大小随样本线性扩展来减小文件大小。SVCR的线性缩放依赖于两种技术，这两种技术都是线性所必需的：局部等位基因指数和参考块，这是由基因组变异调用格式（GVCF）首次引入的。SVCR也是无损和可合并的，允许N + 1和N + K增量联合调用。我们提出了SVCR的两种实现：SVCR-VCF和VDS，前者以VCF格式编码SVCR，后者使用Hail的原生格式。我们的实验证实了SVCR-VCF和VDS的线性可扩展性，与标准VCF文件的超线性增长形成对比。我们还讨论了VDS组合器，这是一个可扩展的开源工具，用于从GVCFs生成VDS，以及VDS的独特功能，可以实现快速数据分析。SVCR，特别是VDS，确保科学界能够生成、分析和传播具有数百万样本的遗传学数据集。可用性：https://github.com/hail-is/hail/.Supplementary信息：补充数据可在Bioinformatics在线获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量