IGD: a simple, efficient genotype data format.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-08-26 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf205

Drew DeHaas, Xinzhu Wei

引用次数: 0

Abstract

Motivation: While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement-yet fast and small-is helpful for research on highly scalable statistical and population genetics methods.

Results: We present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100× faster and 3.5× smaller than vcf.gz on biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format.

Availability and implementation: A C++ library for reading and writing IGD, and tooling to convert .vcf.gz files, can be found at https://github.com/aprilweilab/picovcf. A Python library is at https://github.com/aprilweilab/pyigd.

Abstract Image

查看原文本刊更多论文

IGD：一种简单、高效的基因型数据格式。

动机：虽然有多种文件格式用于存储参考序列对齐的基因型数据，但许多文件格式复杂或效率低下。编程语言对这种格式的支持通常是有限的。一种易于理解和实现的文件格式——但又快又小——有助于研究高度可扩展的统计和群体遗传学方法。结果：我们提出了可索引基因型数据（Indexable Genotype Data， IGD）文件格式，这是一种简单的未压缩二进制格式，在生物库规模的全基因组序列数据上，比vcf.gz快100倍，小3.5倍。在Python中读写IGD的实现不到350行代码，这反映了格式的简单性。可用性和实现：一个用于读写IGD的c++库，以及转换.vcf.gz文件的工具，可以在https://github.com/aprilweilab/picovcf上找到。Python库在https://github.com/aprilweilab/pyigd。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量