Aneley Getahun Strobel, Andrew J Hayes, Wytamma Wirth, Mikaele Mua, Tiko Saumalua, Orisi Cabenatabua, Vika Soqo, Varanisese Rosa, Nancy Wang, Jake A Lacey, Dianna Hocking, Mary Valcanis, Adam Jenney, Benjamin P Howden, Sebastian Duchene, Kim Mulholland, Richard A Strugnell, Mark R Davies
{"title":"Corrigendum: Genetic heterogeneity in the Salmonella Typhi Vi capsule locus: a population genomic study from Fiji.","authors":"Aneley Getahun Strobel, Andrew J Hayes, Wytamma Wirth, Mikaele Mua, Tiko Saumalua, Orisi Cabenatabua, Vika Soqo, Varanisese Rosa, Nancy Wang, Jake A Lacey, Dianna Hocking, Mary Valcanis, Adam Jenney, Benjamin P Howden, Sebastian Duchene, Kim Mulholland, Richard A Strugnell, Mark R Davies","doi":"10.1099/mgen.0.001310","DOIUrl":"10.1099/mgen.0.001310","url":null,"abstract":"","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143080390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting the bacterial host range of plasmid genomes using the language model-based one-class support vector machine algorithm.","authors":"Tao Feng, Xirao Chen, Shufang Wu, Waijiao Tang, Hongwei Zhou, Zhencheng Fang","doi":"10.1099/mgen.0.001355","DOIUrl":"10.1099/mgen.0.001355","url":null,"abstract":"<p><p>The prediction of the plasmid host range is crucial for investigating the dissemination of plasmids and the transfer of resistance and virulence genes mediated by plasmids. Several machine learning-based tools have been developed to predict plasmid host ranges. These tools have been trained and tested based on the bacterial host records of plasmids in related databases. Typically, a plasmid genome in databases such as the National Center for Biotechnology Information is annotated with only one or a few bacterial hosts, which does not encompass all possible hosts. Consequently, existing methods may significantly underestimate the host ranges of mobile plasmids. In this work, we propose a novel method named HRPredict, which employs a word vector model to digitally represent the encoded proteins on plasmid genomes. Since it is difficult to confirm which host a particular plasmid definitely cannot enter, we developed a machine learning approach for predicting whether a plasmid can enter a specific bacterium as a no-negative samples learning task. Using multiple one-class support vector machine (SVM) models that do not require negative samples for training, HRPredict predicts the host range of plasmids across 45 families, 56 genera and 56 species. In the benchmark test set, we constructed reliable negative samples for each host taxonomic unit via two indirect methods, and we found that the area under the curve (AUC), F1-score, recall, precision and accuracy of most taxonomic unit prediction models exceeded 0.9. Among the 13 broad-host-range plasmid types, HRPredict demonstrated greater coverage than HOTSPOT and PlasmidHostFinder, thus successfully predicting the majority of hosts previously reported. Through feature importance calculation for each SVM model, we found that genes closely related to the plasmid host range are involved in functions such as bacterial adaptability, pathogenicity and survival. These findings provide significant insight into the mechanisms through which bacteria adjust to diverse environments through plasmids. The HRPredict algorithm is expected to facilitate in-depth research on the spread of broad-host-range plasmids and enable host-range predictions for novel plasmids reconstructed from microbiome sequencing data.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143391241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nurul Saidah Din, Farahiyah Mohd Rani, Ahmed Ghazi Alattraqchi, Salwani Ismail, Nor Iza A Rahman, David W Cleary, Stuart C Clarke, Chew Chieng Yeo
{"title":"Whole-genome sequencing of <i>Acinetobacter baumannii</i> clinical isolates from a tertiary hospital in Terengganu, Malaysia (2011-2020), revealed the predominance of the Global Clone 2 lineage.","authors":"Nurul Saidah Din, Farahiyah Mohd Rani, Ahmed Ghazi Alattraqchi, Salwani Ismail, Nor Iza A Rahman, David W Cleary, Stuart C Clarke, Chew Chieng Yeo","doi":"10.1099/mgen.0.001345","DOIUrl":"10.1099/mgen.0.001345","url":null,"abstract":"<p><p>Carbapenem-resistant <i>Acinetobacter baumannii</i> is recognized by the World Health Organization (WHO) as one of the top priority pathogens. Despite its public health importance, genomic data of clinical isolates from Malaysia remain scarce. In this study, whole-genome sequencing was performed on 126 <i>A</i>. <i>baumannii</i> isolates collected from the main tertiary hospital in the state of Terengganu, Malaysia, over a 10-year period (2011-2020). Antimicrobial susceptibilities determined for 20 antibiotics belonging to 8 classes showed that 77.0% (<i>n</i>=97/126) of the isolates were categorized as multidrug resistant (MDR), with all MDR isolates being carbapenem resistant. Multilocus sequence typing analysis categorized the Terengganu <i>A. baumannii</i> clinical isolates into 34 Pasteur and 44 Oxford sequence types (STs), with ST2<sub>Pasteur</sub> of the Global Clone 2 lineage identified as the dominant ST (<i>n</i>=76/126; 60.3%). The ST2<sub>Pasteur</sub> isolates could be subdivided into six Oxford STs with the majority being ST195<sub>Oxford</sub> (<i>n</i>=35) and ST208<sub>Oxford</sub> (<i>n</i>=17). Various antimicrobial resistance genes were identified with the <i>bla</i> <sub>OXA-23</sub>-encoded carbapenemase being the predominant acquired carbapenemase gene (<i>n</i>=90/126; 71.4%). Plasmid-encoded <i>rep</i> genes were identified in nearly all (<i>n</i>=122/126; 96.8%) of the isolates with the majority being Rep_3 family (<i>n</i>=121). Various virulence factors were identified, highlighting the pathogenic nature of this bacterium. Only 14/126 (11.1%) of the isolates were positive for the carriage of CRISPR-Cas arrays with none of the prevalent ST2<sub>Pasteur</sub> isolates harbouring them. This study provided a genomic snapshot of the <i>A. baumannii</i> isolates obtained from a single tertiary healthcare centre in Malaysia over a 10-year period and showed the predominance of a single closely related ST2<sub>Pasteur</sub> lineage, indicating the entrenchment of this clone in the hospital.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11798184/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143189807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Genomic and pathogenicity analyses to identify the causative agent from multiple serogroups of non-O1, non-O139 <i>Vibrio cholerae</i> in foodborne outbreaks.","authors":"Masatomo Morita, Hirotaka Hiyoshi, Eiji Arakawa, Hidemasa Izumiya, Makoto Ohnishi, Kikuyo Ogata, Mari Sasaki, Hiroshi Narimatsu, Emiko Kitagawa, Yukihiro Akeda, Toshio Kodama","doi":"10.1099/mgen.0.001364","DOIUrl":"10.1099/mgen.0.001364","url":null,"abstract":"<p><p>In 2013, foodborne outbreaks in Japan were linked to non-O1, non-O139 <i>Vibrio cholerae</i>. However, laboratory tests have detected several serogroups, making it difficult to determine the causative agent. Therefore, whole-genome analyses revealed that only serogroup O144 <i>V. cholerae</i> possesses a genomic island with a type III secretion system (T3SS). A T3SS-deficient mutant was subsequently generated, and its pathogenicity was assessed using a rabbit ileal loop test. This led to the conclusion that serogroup O144 <i>V. cholerae</i> with T3SS was the causative agent of foodborne outbreaks. This study provides an illustrative example of the utilization of whole-genome data for pathogenicity and molecular epidemiological analyses in outbreak investigations.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11865499/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143516114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pierre Marie Kaktcham, Magdalena Kujawska, Edith Marius Foko Kouam, Laverdure Tchamani Piame, Michele Letitia Tchabou Tientcheu, Julia Mueller, Angela Felsl, Bastian-Alexander Truppel, François Zambou Ngoufack, Lindsay J Hall
{"title":"Genomic insights into the beneficial potential of <i>Bifidobacterium</i> and <i>Enterococcus</i> strains isolated from Cameroonian infants.","authors":"Pierre Marie Kaktcham, Magdalena Kujawska, Edith Marius Foko Kouam, Laverdure Tchamani Piame, Michele Letitia Tchabou Tientcheu, Julia Mueller, Angela Felsl, Bastian-Alexander Truppel, François Zambou Ngoufack, Lindsay J Hall","doi":"10.1099/mgen.0.001354","DOIUrl":"10.1099/mgen.0.001354","url":null,"abstract":"<p><p>A healthy early-life gut microbiota plays an important role in maintaining immediate and long-term health. Perturbations, particularly in low- to middle-income communities, are associated with increased infection risk. Thus, a promising avenue for restoring a healthy infant microbiota is to select key beneficial bacterial candidates from underexplored microbiomes for developing new probiotic-based therapies. This study aimed to recover bifidobacteria and lactic acid bacteria from the faeces of healthy Cameroonian infants and unravel the genetic basis of their beneficial properties. Faecal samples were collected from 26 infants aged 0-5 months recruited in Dschang (Cameroon). Recovered bacterial isolates were subjected to whole-genome sequencing and <i>in silico</i> analysis to assess their potential for carbohydrate utilization, their antimicrobial capacities, host-adaptation capabilities and their safety. From the range of infant-associated <i>Bifidobacterium</i> and <i>Enterococcus</i> strains identified, <i>Bifidobacterium</i> species were found to harbour putative gene clusters implicated in human milk oligosaccharide metabolism. Genes linked to the production of antimicrobial peptides such as class IV lanthipeptides were found in <i>Bifidobacterium pseudocatenulatum</i>, while those implicated in biosynthesis of cytolysins, enterolysins, enterocins and propeptins, among others, were identified in enterococci. Bifidobacterial isolates did not contain genes associated with virulence; however, we detected the presence of putative tetracycline resistance genes in several strains belonging to <i>Bifidobacterium animalis</i> subsp. <i>lactis</i> and <i>Bifidobacterium longum</i> subsp. <i>longum</i>. Among the enterococci, <i>Enterococcus mundtii</i> PM10 did not carry any genes associated with antimicrobial resistance or virulence. The latter, together with all the <i>Bifidobacterium</i> strains, also encoded several putative adaptive and stress-response-related genes, suggesting robust gastroinstestinal tract colonization potential. This work provides the first genomic characterization of <i>Bifidobacterium</i> and <i>Enterococcus</i> isolates from Cameroonian infants. Several strains showed the genomic potential to confer beneficial properties. Further phenotypic and clinical investigations are needed to confirm their suitability as customized probiotics.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11840169/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143448573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karina Andrea Büttner, Vera Bregy, Fanny Wegner, Srinithi Purushothaman, Frank Imkamp, Tim Roloff Handschin, Mirja H Puolakkainen, Eija Hiltunen-Back, Domnique Braun, Ibrahim Kisakesen, Andreas Schreiber, Andrea Carolina Entrocassi, María Lucía Gallo Vaulet, Deysi López Aquino, Laura Svidler López, Luciana La Rosa, Adrian Egli, Marcelo Rodríguez Fermepin, Helena Mb Seth-Smith, On Behalf Of The Escmid Study Group For Mycoplasma And Chlamydia Infections Esgmac
{"title":"Evaluating methods for genome sequencing of <i>Chlamydia trachomatis</i> and other sexually transmitted bacteria directly from clinical swabs.","authors":"Karina Andrea Büttner, Vera Bregy, Fanny Wegner, Srinithi Purushothaman, Frank Imkamp, Tim Roloff Handschin, Mirja H Puolakkainen, Eija Hiltunen-Back, Domnique Braun, Ibrahim Kisakesen, Andreas Schreiber, Andrea Carolina Entrocassi, María Lucía Gallo Vaulet, Deysi López Aquino, Laura Svidler López, Luciana La Rosa, Adrian Egli, Marcelo Rodríguez Fermepin, Helena Mb Seth-Smith, On Behalf Of The Escmid Study Group For Mycoplasma And Chlamydia Infections Esgmac","doi":"10.1099/mgen.0.001353","DOIUrl":"10.1099/mgen.0.001353","url":null,"abstract":"<p><p>Rates of bacterial sexually transmitted infections (STIs) are rising, and accessing their genomes provides information on strain evolution, circulating strains and encoded antimicrobial resistance (AMR). Notable pathogens include <i>Chlamydia trachomatis</i> (CT), <i>Neisseria gonorrhoeae</i> (NG) and <i>Treponema pallidum</i> (TP), globally the most common bacterial STIs. <i>Mycoplasmoides</i> (formerly <i>Mycoplasma</i>) <i>genitalium</i> (MG) is also a bacterial STI that is of concern due to AMR development. These bacteria are also fastidious or hard to culture, and standard sampling methods lyse bacteria, completely preventing pathogen culture. Clinical samples contain large amounts of human and other microbiota DNA. These factors hinder the sequencing of bacterial STI genomes. We aimed to overcome these challenges in obtaining whole-genome sequences and evaluated four approaches using clinical samples from Argentina (39), and Switzerland (14), and cultured samples from Finland (2) and Argentina (1). First, direct genome sequencing from swab samples was attempted through Illumina deep metagenomic sequencing, showing extremely low levels of target DNA, with under 0.01% of the sequenced reads being from the target pathogens. Second, host DNA depletion followed by Illumina sequencing was not found to produce enrichment in these very low-load samples. Third, we tried a selective long-read approach with the new adaptive sequencing from Oxford Nanopore Technologies, which also did not improve enrichment sufficiently to provide genomic information. Finally, target enrichment using a novel pan-genome set of custom SureSelect probes targeting CT, NG, TP and MG followed by Illumina sequencing was successful. We produced whole genomes from 64% of CT-positive samples, from 36% of NG-positive samples and 60% of TP-positive samples. Additionally, we enriched MG DNA to gain partial genomes from 60% of samples. This is the first publication to date to utilize a pan-genome STI panel in target enrichment. Target enrichment, though costly, proved essential for obtaining genomic data from clinical samples. These data can be utilized to examine circulating strains and genotypic resistance and guide public health strategies.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143409050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The complete genome sequence of <i>Penaeus vannamei</i> nudivirus (previously Baculovirus penaei or <i>P. vannamei</i> singly enveloped nuclear polyhedrosis virus).","authors":"Hung N Mai, Arun K Dhar","doi":"10.1099/mgen.0.001360","DOIUrl":"10.1099/mgen.0.001360","url":null,"abstract":"<p><p><i>Penaeus vannamei</i> singly enveloped nuclear polyhedrosis virus (PvSNPV), also known as Baculovirus penaei (BP), is the first viral pathogen of penaeid shrimp described in 1974. Although PvSNPV was discovered almost 50 years ago, the complete genome sequence has not been elucidated until now. We detected the virus in a quarantine stock of <i>P. vannamei</i> shrimp by light microscopy of faecal samples and by PCR screening of broodstock. Subsequently, next-generation sequencing was deployed to determine the complete genome sequence of PvSNPV. The PvSNPV genome is a circular, double-stranded DNA molecule of 119 883 bp in length encoding 101 ORFs. The deduced aa sequences from 28 ORFs were homologous to 28 core proteins from all identified nudiviruses. Phylogenetic analyses based on deduced aa sequences of the core genes and orthologous genes revealed that PvSNPV clusters with <i>Penaeus monodon</i> nudivirus. Therefore, we propose to rename BP/PvSNPV as <i>P. vannamei</i> nudivirus and re-assign the virus to the family <i>Nudiviridae</i> instead of <i>Baculoviridae</i>.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143516118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Molecular epidemiology of <i>Eimeria</i> spp. parasites and the faecal microbiome of Indiana bats (<i>Myotis sodalis</i>): a non-invasive, multiplex metabarcode survey of an endangered species.","authors":"Andrew J Bennett, Cory D Suski, Joy M O'Keefe","doi":"10.1099/mgen.0.001358","DOIUrl":"10.1099/mgen.0.001358","url":null,"abstract":"<p><p>Assessing individual and population health in endangered wildlife poses unique challenges due to the lack of an adequate baseline and ethical constraints on invasive sampling. For endangered bats, minimally invasive samples like guano can often be the ethical and technical limit for studies of pathogens and the microbiome. In this study, we use multiplex metabarcode sequencing to describe the faecal microbiome and parasites of 56 Indiana bats (<i>Myotis sodalis</i>). We show evidence of a high prevalence of <i>Eimeria</i> spp. protozoan parasite and characterize associations between infection and changes to the faecal microbiome. We identify a strong and significant enrichment of <i>Clostridium</i> species in <i>Eimeria</i>-positive bats, including isolates related to <i>Clostridium perfringens</i>.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 2","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143516116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The dominant lineage of an emerging pathogen harbours contact-dependent inhibition systems.","authors":"Cristian V Crisan, Joanna B Goldberg","doi":"10.1099/mgen.0.001332","DOIUrl":"10.1099/mgen.0.001332","url":null,"abstract":"<p><p>Bacteria from the <i>Stenotrophomonas maltophilia</i> complex (Smc) are important multidrug-resistant pathogens that cause a broad range of infections. Smc is genomically diverse and has been classified into 23 lineages. Lineage Sm6 is the most common among sequenced strains, but it is unclear why this lineage has evolved to be dominant. Antagonistic interactions can significantly affect the evolution of bacterial populations. These interactions may be mediated by secreted contact-dependent proteins, which allow inhibitor cells to intoxicate adjacent target bacteria. Contact-dependent inhibition (CDI) requires three proteins: CdiA, CdiB and CdiI. CdiA is a large, filamentous protein exported to the surface of inhibitor cells through the pore-like CdiB. The CdiA C-terminal domain (CdiA-CT) is toxic when delivered into target cells of the same species or genus. CdiI immunity proteins neutralize the toxicity of cognate CdiA-CT toxins. We found that all complete Smc genomes from the Sm6 lineage harbour at least one CDI locus. By contrast, less than a quarter of strains from other lineages have CDI genes. Smc CdiA-CT domains are diverse and have a broad range of predicted functions. Most Sm6 strains harbour non-cognate <i>cdiI</i> genes predicted to provide protection against foreign toxins from other strains. Finally, we demonstrated that an Smc CdiA-CT toxin has antibacterial properties and is neutralized by its cognate CdiI.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893273/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143033497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mona L Taouk, Leo A Featherstone, George Taiaroa, Torsten Seemann, Danielle J Ingle, Timothy P Stinear, Ryan R Wick
{"title":"Exploring SNP filtering strategies: the influence of strict vs soft core.","authors":"Mona L Taouk, Leo A Featherstone, George Taiaroa, Torsten Seemann, Danielle J Ingle, Timothy P Stinear, Ryan R Wick","doi":"10.1099/mgen.0.001346","DOIUrl":"10.1099/mgen.0.001346","url":null,"abstract":"<p><p>Phylogenetic analyses are crucial for understanding microbial evolution and infectious disease transmission. Bacterial phylogenies are often inferred from SNP alignments, with SNPs as the fundamental signal within these data. SNP alignments can be reduced to a 'strict core' by removing those sites that do not have data present in every sample. However, as sample size and genome diversity increase, a strict core can shrink markedly, discarding potentially informative data. Here, we propose and provide evidence to support the use of a 'soft core' that tolerates some missing data, preserving more information for phylogenetic analysis. Using large datasets of <i>Neisseria gonorrhoeae</i> and <i>Salmonella enterica</i> serovar Typhi, we assess different core thresholds. Our results show that strict cores can drastically reduce informative sites compared to soft cores. In a 10 000-genome alignment of <i>Salmonella enterica</i> serovar Typhi, a 95% soft core yielded ten times more informative sites than a 100% strict core. Similar patterns were observed in <i>N. gonorrhoeae</i>. We further evaluated the accuracy of phylogenies built from strict- and soft-core alignments using datasets with strong temporal signals. Soft-core alignments generally outperformed strict cores in producing trees displaying clock-like behaviour; for instance, the <i>N. gonorrhoeae</i> 95% soft-core phylogeny had a root-to-tip regression <i>R</i> <sup>2</sup> of 0.50 compared to 0.21 for the strict-core phylogeny. This study suggests that soft-core strategies are preferable for large, diverse microbial datasets. To facilitate this, we developed <i>Core-SNP-filter</i> (https://github.com/rrwick/Core-SNP-filter), an open-source software tool for generating soft-core alignments from whole-genome alignments based on user-defined thresholds.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 1","pages":""},"PeriodicalIF":4.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734701/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142984006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}