IGD: a simple, efficient genotype data format.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Bioinformatics advances Pub Date : 2025-08-26 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf205
Drew DeHaas, Xinzhu Wei
{"title":"IGD: a simple, efficient genotype data format.","authors":"Drew DeHaas, Xinzhu Wei","doi":"10.1093/bioadv/vbaf205","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement-yet fast and small-is helpful for research on highly scalable statistical and population genetics methods.</p><p><strong>Results: </strong>We present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100× faster and 3.5× smaller than <i>vcf.gz</i> on biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format.</p><p><strong>Availability and implementation: </strong>A C++ library for reading and writing IGD, and tooling to convert .vcf.gz files, can be found at https://github.com/aprilweilab/picovcf. A Python library is at https://github.com/aprilweilab/pyigd.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf205"},"PeriodicalIF":2.8000,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448908/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf205","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement-yet fast and small-is helpful for research on highly scalable statistical and population genetics methods.

Results: We present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100× faster and 3.5× smaller than vcf.gz on biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format.

Availability and implementation: A C++ library for reading and writing IGD, and tooling to convert .vcf.gz files, can be found at https://github.com/aprilweilab/picovcf. A Python library is at https://github.com/aprilweilab/pyigd.

Abstract Image

Abstract Image

Abstract Image

IGD:一种简单、高效的基因型数据格式。
动机:虽然有多种文件格式用于存储参考序列对齐的基因型数据,但许多文件格式复杂或效率低下。编程语言对这种格式的支持通常是有限的。一种易于理解和实现的文件格式——但又快又小——有助于研究高度可扩展的统计和群体遗传学方法。结果:我们提出了可索引基因型数据(Indexable Genotype Data, IGD)文件格式,这是一种简单的未压缩二进制格式,在生物库规模的全基因组序列数据上,比vcf.gz快100倍,小3.5倍。在Python中读写IGD的实现不到350行代码,这反映了格式的简单性。可用性和实现:一个用于读写IGD的c++库,以及转换.vcf.gz文件的工具,可以在https://github.com/aprilweilab/picovcf上找到。Python库在https://github.com/aprilweilab/pyigd。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信