Exploring SNP filtering strategies: the influence of strict vs soft core.

IF 4 2区 生物学 Q1 GENETICS & HEREDITY
Mona L Taouk, Leo A Featherstone, George Taiaroa, Torsten Seemann, Danielle J Ingle, Timothy P Stinear, Ryan R Wick
{"title":"Exploring SNP filtering strategies: the influence of strict vs soft core.","authors":"Mona L Taouk, Leo A Featherstone, George Taiaroa, Torsten Seemann, Danielle J Ingle, Timothy P Stinear, Ryan R Wick","doi":"10.1099/mgen.0.001346","DOIUrl":null,"url":null,"abstract":"<p><p>Phylogenetic analyses are crucial for understanding microbial evolution and infectious disease transmission. Bacterial phylogenies are often inferred from SNP alignments, with SNPs as the fundamental signal within these data. SNP alignments can be reduced to a 'strict core' by removing those sites that do not have data present in every sample. However, as sample size and genome diversity increase, a strict core can shrink markedly, discarding potentially informative data. Here, we propose and provide evidence to support the use of a 'soft core' that tolerates some missing data, preserving more information for phylogenetic analysis. Using large datasets of <i>Neisseria gonorrhoeae</i> and <i>Salmonella enterica</i> serovar Typhi, we assess different core thresholds. Our results show that strict cores can drastically reduce informative sites compared to soft cores. In a 10 000-genome alignment of <i>Salmonella enterica</i> serovar Typhi, a 95% soft core yielded ten times more informative sites than a 100% strict core. Similar patterns were observed in <i>N. gonorrhoeae</i>. We further evaluated the accuracy of phylogenies built from strict- and soft-core alignments using datasets with strong temporal signals. Soft-core alignments generally outperformed strict cores in producing trees displaying clock-like behaviour; for instance, the <i>N. gonorrhoeae</i> 95% soft-core phylogeny had a root-to-tip regression <i>R</i> <sup>2</sup> of 0.50 compared to 0.21 for the strict-core phylogeny. This study suggests that soft-core strategies are preferable for large, diverse microbial datasets. To facilitate this, we developed <i>Core-SNP-filter</i> (https://github.com/rrwick/Core-SNP-filter), an open-source software tool for generating soft-core alignments from whole-genome alignments based on user-defined thresholds.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 1","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734701/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbial Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1099/mgen.0.001346","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Phylogenetic analyses are crucial for understanding microbial evolution and infectious disease transmission. Bacterial phylogenies are often inferred from SNP alignments, with SNPs as the fundamental signal within these data. SNP alignments can be reduced to a 'strict core' by removing those sites that do not have data present in every sample. However, as sample size and genome diversity increase, a strict core can shrink markedly, discarding potentially informative data. Here, we propose and provide evidence to support the use of a 'soft core' that tolerates some missing data, preserving more information for phylogenetic analysis. Using large datasets of Neisseria gonorrhoeae and Salmonella enterica serovar Typhi, we assess different core thresholds. Our results show that strict cores can drastically reduce informative sites compared to soft cores. In a 10 000-genome alignment of Salmonella enterica serovar Typhi, a 95% soft core yielded ten times more informative sites than a 100% strict core. Similar patterns were observed in N. gonorrhoeae. We further evaluated the accuracy of phylogenies built from strict- and soft-core alignments using datasets with strong temporal signals. Soft-core alignments generally outperformed strict cores in producing trees displaying clock-like behaviour; for instance, the N. gonorrhoeae 95% soft-core phylogeny had a root-to-tip regression R 2 of 0.50 compared to 0.21 for the strict-core phylogeny. This study suggests that soft-core strategies are preferable for large, diverse microbial datasets. To facilitate this, we developed Core-SNP-filter (https://github.com/rrwick/Core-SNP-filter), an open-source software tool for generating soft-core alignments from whole-genome alignments based on user-defined thresholds.

探索SNP过滤策略:严格与软核的影响。
系统发育分析对于理解微生物进化和传染病传播至关重要。细菌系统发育通常是从SNP比对中推断出来的,SNP是这些数据中的基本信号。通过去除每个样本中不存在数据的那些位点,SNP比对可以减少到“严格核心”。然而,随着样本量和基因组多样性的增加,严格的核心可能会显着缩小,从而丢弃潜在的信息数据。在这里,我们提出并提供证据来支持使用“软核”,它可以容忍一些缺失的数据,为系统发育分析保留更多的信息。利用淋病奈瑟菌和伤寒沙门氏菌的大型数据集,我们评估了不同的核心阈值。我们的研究结果表明,与软核相比,严格核可以大大减少信息位点。在大肠沙门氏菌血清型伤寒的1万个基因组比对中,95%软核比100%严格核产生的信息位点多10倍。在淋病奈瑟菌中也观察到类似的模式。我们使用具有强时间信号的数据集进一步评估了从严格核比对和软核比对建立的系统发育的准确性。软核排列在产生显示时钟行为的树方面通常优于严格核;例如,淋病奈瑟菌95%软核系统发育的根尖回归r2为0.50,而严格核系统发育的回归r2为0.21。这项研究表明,软核策略更适合于大型、多样化的微生物数据集。为了促进这一点,我们开发了Core-SNP-filter (https://github.com/rrwick/Core-SNP-filter),这是一个开源软件工具,用于根据用户定义的阈值从全基因组比对中生成软核比对。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Microbial Genomics
Microbial Genomics Medicine-Epidemiology
CiteScore
6.60
自引率
2.60%
发文量
153
审稿时长
12 weeks
期刊介绍: Microbial Genomics (MGen) is a fully open access, mandatory open data and peer-reviewed journal publishing high-profile original research on archaea, bacteria, microbial eukaryotes and viruses.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信