Mona L Taouk, Leo A Featherstone, George Taiaroa, Torsten Seemann, Danielle J Ingle, Timothy P Stinear, Ryan R Wick
{"title":"Exploring SNP filtering strategies: the influence of strict vs soft core.","authors":"Mona L Taouk, Leo A Featherstone, George Taiaroa, Torsten Seemann, Danielle J Ingle, Timothy P Stinear, Ryan R Wick","doi":"10.1099/mgen.0.001346","DOIUrl":null,"url":null,"abstract":"<p><p>Phylogenetic analyses are crucial for understanding microbial evolution and infectious disease transmission. Bacterial phylogenies are often inferred from SNP alignments, with SNPs as the fundamental signal within these data. SNP alignments can be reduced to a 'strict core' by removing those sites that do not have data present in every sample. However, as sample size and genome diversity increase, a strict core can shrink markedly, discarding potentially informative data. Here, we propose and provide evidence to support the use of a 'soft core' that tolerates some missing data, preserving more information for phylogenetic analysis. Using large datasets of <i>Neisseria gonorrhoeae</i> and <i>Salmonella enterica</i> serovar Typhi, we assess different core thresholds. Our results show that strict cores can drastically reduce informative sites compared to soft cores. In a 10 000-genome alignment of <i>Salmonella enterica</i> serovar Typhi, a 95% soft core yielded ten times more informative sites than a 100% strict core. Similar patterns were observed in <i>N. gonorrhoeae</i>. We further evaluated the accuracy of phylogenies built from strict- and soft-core alignments using datasets with strong temporal signals. Soft-core alignments generally outperformed strict cores in producing trees displaying clock-like behaviour; for instance, the <i>N. gonorrhoeae</i> 95% soft-core phylogeny had a root-to-tip regression <i>R</i> <sup>2</sup> of 0.50 compared to 0.21 for the strict-core phylogeny. This study suggests that soft-core strategies are preferable for large, diverse microbial datasets. To facilitate this, we developed <i>Core-SNP-filter</i> (https://github.com/rrwick/Core-SNP-filter), an open-source software tool for generating soft-core alignments from whole-genome alignments based on user-defined thresholds.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 1","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734701/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbial Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1099/mgen.0.001346","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Phylogenetic analyses are crucial for understanding microbial evolution and infectious disease transmission. Bacterial phylogenies are often inferred from SNP alignments, with SNPs as the fundamental signal within these data. SNP alignments can be reduced to a 'strict core' by removing those sites that do not have data present in every sample. However, as sample size and genome diversity increase, a strict core can shrink markedly, discarding potentially informative data. Here, we propose and provide evidence to support the use of a 'soft core' that tolerates some missing data, preserving more information for phylogenetic analysis. Using large datasets of Neisseria gonorrhoeae and Salmonella enterica serovar Typhi, we assess different core thresholds. Our results show that strict cores can drastically reduce informative sites compared to soft cores. In a 10 000-genome alignment of Salmonella enterica serovar Typhi, a 95% soft core yielded ten times more informative sites than a 100% strict core. Similar patterns were observed in N. gonorrhoeae. We further evaluated the accuracy of phylogenies built from strict- and soft-core alignments using datasets with strong temporal signals. Soft-core alignments generally outperformed strict cores in producing trees displaying clock-like behaviour; for instance, the N. gonorrhoeae 95% soft-core phylogeny had a root-to-tip regression R2 of 0.50 compared to 0.21 for the strict-core phylogeny. This study suggests that soft-core strategies are preferable for large, diverse microbial datasets. To facilitate this, we developed Core-SNP-filter (https://github.com/rrwick/Core-SNP-filter), an open-source software tool for generating soft-core alignments from whole-genome alignments based on user-defined thresholds.
期刊介绍:
Microbial Genomics (MGen) is a fully open access, mandatory open data and peer-reviewed journal publishing high-profile original research on archaea, bacteria, microbial eukaryotes and viruses.