{"title":"PISAD: reference-free intraspecies sample anomalies detection tool based on k-mer counting.","authors":"Zhantian Xu, Fan Nie, Jianxin Wang","doi":"10.1093/gigascience/giaf061","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Genomic sequencing research often requires the simultaneous analysis of heterogeneous data types across single or multiple individuals, introducing a substantial risk of sample swaps (e.g., labeling errors). Existing methods primarily rely on reference information, requiring the preselection of informative variant sites with a population allele frequency around 0.5, which may be insufficient or unavailable for nonmodel organisms. As research expands to encompass a growing number of new species, a robust quality control tool will become increasingly important.</p><p><strong>Finds: </strong>We developed PISAD (Phased Intraspecies Sample Anomalies Detection), a tool for validating sample identities in whole-genome sequencing (WGS) data without requiring reference information. It uses a 2-stage approach: first, it performs rapid, reference-free single nucleotide polymorphism (SNP) calling on low-error-rate data from the target individual to create a variant sketch; then, it assesses the concordance of other samples on this sketch to verify relationships. We assessed the performance and efficiency of PISAD on Homo sapiens, Bos taurus, Gallus gallus, Arctia plantaginis, and Pyrus species.</p><p><strong>Conclusions: </strong>Our evaluation showed that PISAD achieves a lower data coverage requirement (0.5×) compared to the reference-based tool ntsm and is broadly applicable to multiple diploid species.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12202988/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf061","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Genomic sequencing research often requires the simultaneous analysis of heterogeneous data types across single or multiple individuals, introducing a substantial risk of sample swaps (e.g., labeling errors). Existing methods primarily rely on reference information, requiring the preselection of informative variant sites with a population allele frequency around 0.5, which may be insufficient or unavailable for nonmodel organisms. As research expands to encompass a growing number of new species, a robust quality control tool will become increasingly important.
Finds: We developed PISAD (Phased Intraspecies Sample Anomalies Detection), a tool for validating sample identities in whole-genome sequencing (WGS) data without requiring reference information. It uses a 2-stage approach: first, it performs rapid, reference-free single nucleotide polymorphism (SNP) calling on low-error-rate data from the target individual to create a variant sketch; then, it assesses the concordance of other samples on this sketch to verify relationships. We assessed the performance and efficiency of PISAD on Homo sapiens, Bos taurus, Gallus gallus, Arctia plantaginis, and Pyrus species.
Conclusions: Our evaluation showed that PISAD achieves a lower data coverage requirement (0.5×) compared to the reference-based tool ntsm and is broadly applicable to multiple diploid species.
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.