vcfsim: flexible simulation of all-sites VCFs with missing data.

Paimon Goulart, Kieran Samuk
{"title":"vcfsim: flexible simulation of all-sites VCFs with missing data.","authors":"Paimon Goulart, Kieran Samuk","doi":"10.1101/2025.01.29.635540","DOIUrl":null,"url":null,"abstract":"<p><strong>Background |: </strong>VCFs are the most widely used data format for encoding genetic variation. By design, standard VCFs do not include data from sites where all individuals are homozygous for the reference allele (\"invariant sites\") and thus do not differentiate these from sites where data are completely missing. However, missing data are a key feature of biological datasets across all domains of genomics, and many recent studies have shown that missing data can introduce a variety of statistical biases in the estimation of key population genetic parameters. A solution to this limitation is to include invariant sites in a standard VCF, creating an \"all-sites VCF\", exposing missing and invariant sites explicitly. One hurdle to the wider adoption of all-sites VCFs is a reliable parameterized simulation framework for generating biologically realistic all-sites VCFs.</p><p><strong>Results |: </strong>Here, we introduce an open-source command line tool, <i>vcfsim</i>, that interfaces with the popular coalescent simulation platform <i>msprime</i> and provides convenience functions for simulating all-sites VCFs with variable levels of ploidy and missing data. We show that the post-processed VCFs generated using <i>vcfsim</i> align precisely with population genetic expectations (i.e. are statistically identical to raw <i>msprime</i> output), accurately introduce missing data, and permit the simulation of data with varying ploidy levels, including the simulation of intraindividual ploidy variation (e.g. heterogametic sex chromosomes) and population structures.</p><p><strong>Conclusions |: </strong>Our results <i>vcfsim</i> is a useful and easy-to-use tool for the benchmarking of new software tools, performing population genetic inference, training of machine learning models, and the exploration of the effects of missing data in genomics data sets.</p>","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486129/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.01.29.635540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background |: VCFs are the most widely used data format for encoding genetic variation. By design, standard VCFs do not include data from sites where all individuals are homozygous for the reference allele ("invariant sites") and thus do not differentiate these from sites where data are completely missing. However, missing data are a key feature of biological datasets across all domains of genomics, and many recent studies have shown that missing data can introduce a variety of statistical biases in the estimation of key population genetic parameters. A solution to this limitation is to include invariant sites in a standard VCF, creating an "all-sites VCF", exposing missing and invariant sites explicitly. One hurdle to the wider adoption of all-sites VCFs is a reliable parameterized simulation framework for generating biologically realistic all-sites VCFs.

Results |: Here, we introduce an open-source command line tool, vcfsim, that interfaces with the popular coalescent simulation platform msprime and provides convenience functions for simulating all-sites VCFs with variable levels of ploidy and missing data. We show that the post-processed VCFs generated using vcfsim align precisely with population genetic expectations (i.e. are statistically identical to raw msprime output), accurately introduce missing data, and permit the simulation of data with varying ploidy levels, including the simulation of intraindividual ploidy variation (e.g. heterogametic sex chromosomes) and population structures.

Conclusions |: Our results vcfsim is a useful and easy-to-use tool for the benchmarking of new software tools, performing population genetic inference, training of machine learning models, and the exploration of the effects of missing data in genomics data sets.

Abstract Image

Abstract Image

Abstract Image

vcfsim:灵活模拟缺失数据的全站点vcf。
vcf是遗传变异编码中使用最广泛的数据格式。根据设计,标准vcf不包括所有个体对参考等位基因纯合的位点(“不变位点”)的数据,因此不能将这些位点与数据完全缺失的位点区分开来。然而,缺失数据是基因组学所有领域生物数据集的一个关键特征,最近的许多研究表明,缺失数据可能会在估计关键群体遗传参数时引入各种统计偏差。解决这个限制的方法是在标准VCF中包含不变站点,创建一个“全站点VCF”,显式地暴露缺失的和不变的站点。广泛采用全点VCFs的一个障碍是可靠的参数化模拟框架,用于生成生物学上真实的全点VCFs。在这里,我们介绍了一个开源的命令行工具vcfsim,它与流行的聚合模拟平台mprime接口,并提供了方便的功能来模拟具有可变倍性水平和缺失数据的全站点vcf。我们表明,使用vcfsim生成的后处理VCFs与群体遗传期望精确地一致(即统计上与原始的mprime输出相同),并且可以准确地引入缺失数据和不同的倍性水平,包括模拟个体内倍性变异(例如异配子性染色体)。我们建议vcfsim将成为一个有用的工具,用于新软件工具的基准测试,机器学习模型的训练,以及探索基因组学数据集中缺失数据的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信