{"title":"Correcting for Bias in Estimates of \u0000 \u0000 \u0000 \u0000 θ\u0000 w\u0000 \u0000 \u0000 and Tajima's \u0000 \u0000 \u0000 D\u0000 \u0000 From Missing Data in Next-Generation Sequencing","authors":"Nick Bailey, Laurie Stevison, Kieran Samuk","doi":"10.1111/1755-0998.14104","DOIUrl":"10.1111/1755-0998.14104","url":null,"abstract":"<p>Population genetic analyses use information from the site frequency spectrum to infer evolutionary processes. Two summary statistics, Watterson's estimator (<span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msub>\u0000 <mi>θ</mi>\u0000 <mi>w</mi>\u0000 </msub>\u0000 </mrow>\u0000 </semantics></math>) of genetic diversity, and Tajima's <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>D</mi>\u0000 </mrow>\u0000 </semantics></math>, used for detecting non-neutral evolution, are among the most frequently computed statistics utilising this information. However, missing information in genomic data, particularly as encoded in the Variant Call Format (VCF), can bias these estimates, leading to incorrect evolutionary inferences. We assessed the impact of missing data on the estimation of these statistics using various population genetic software packages (<span>VCFtools</span>, <span>PopGenome</span>, <span>pegas</span> and <span>scikit-allel</span>). By simulating neutral genomic data with varying levels of missing genotypes and sites, we found consistent underestimation of <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msub>\u0000 <mi>θ</mi>\u0000 <mi>w</mi>\u0000 </msub>\u0000 </mrow>\u0000 </semantics></math> across programs. We found a consequent bias in estimates of Tajima's <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>D</mi>\u0000 </mrow>\u0000 </semantics></math>, though the direction varied by software. We developed and implemented correction methods as functions in an update of the popular <span>pixy</span> software, significantly reducing these biases. Our findings highlight the need for accurate data handling in population genomics to avoid misinterpretations of evolutionary phenomena.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":"25 6","pages":""},"PeriodicalIF":5.5,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.14104","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143690398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}