{"title":"Machine learning reveals the dynamic importance of accessory sequences for <i>Salmonella</i> outbreak clustering.","authors":"Chao Chun Liu, William W L Hsiao","doi":"10.1128/mbio.02650-24","DOIUrl":null,"url":null,"abstract":"<p><p>Bacterial typing at whole-genome scales is now feasible owing to decreasing costs in high-throughput sequencing and the recent advances in computation. The unprecedented resolution of whole-genome typing is achieved by genotyping the variable segments of bacterial genomes that can fluctuate significantly in gene content. However, due to the transient and hypervariable nature of many accessory elements, the value of the added resolution in outbreak investigations remains disputed. To assess the analytical value of bacterial accessory genomes in clustering epidemiologically related cases, we trained classifiers on a set of genomes collected from 24 <i>Salmonella enterica</i> outbreaks of food, animal, or environmental origin. The models demonstrated high precision and recall on unseen test data with near-perfect accuracy in classifying clonal and short-term outbreaks. Annotating the genomic features important for cluster classification revealed functional enrichment of molecular fingerprints in genes involved in membrane transportation, trafficking, and carbohydrate metabolism. Importantly, we discovered polymorphisms in mobile genetic elements (MGEs) and gain/loss of MGEs to be informative in defining outbreak clusters. To quantify the ability of MGE variations to cluster outbreak clones, we devised a reference-free tree-building algorithm inspired by colored de Bruijn graphs, which enabled topological comparisons between MGE and standard typing methods. Systematic evaluation of clustering MGEs on an unseen dataset of 34 <i>Salmonella</i> outbreaks yielded mixed results that exemplified the power of accessory sequence variations when core genomes of unrelated cases are insufficiently discriminatory, as well as the distortion of outbreak signals by microevolution events or the incomplete assembly of MGEs.</p><p><strong>Importance: </strong>Gene-by-gene typing is widely used to detect clusters of foodborne illnesses that share a common origin. It remains actively debated whether the inclusion of accessory sequences in bacterial typing schema is informative or deleterious for cluster definitions in outbreak investigations due to the potential confounding effects of horizontal gene transfer. By training machine learning models on a curated set of historical <i>Salmonella</i> outbreaks, we revealed an enriched presence of outbreak distinguishing features in a wide range of mobile genetic elements. Systematic comparison of the efficacy of clustering different accessory elements against standard sequence typing methods led to our cataloging of scenarios where accessory sequence variations were beneficial and uninformative to resolving outbreak clusters. The presented work underscores the complexity of the molecular trends in enteric outbreaks and seeks to inspire novel computational ways to exploit whole-genome sequencing data in enteric disease surveillance and management.</p>","PeriodicalId":18315,"journal":{"name":"mBio","volume":" ","pages":"e0265024"},"PeriodicalIF":5.1000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11898705/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"mBio","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/mbio.02650-24","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Bacterial typing at whole-genome scales is now feasible owing to decreasing costs in high-throughput sequencing and the recent advances in computation. The unprecedented resolution of whole-genome typing is achieved by genotyping the variable segments of bacterial genomes that can fluctuate significantly in gene content. However, due to the transient and hypervariable nature of many accessory elements, the value of the added resolution in outbreak investigations remains disputed. To assess the analytical value of bacterial accessory genomes in clustering epidemiologically related cases, we trained classifiers on a set of genomes collected from 24 Salmonella enterica outbreaks of food, animal, or environmental origin. The models demonstrated high precision and recall on unseen test data with near-perfect accuracy in classifying clonal and short-term outbreaks. Annotating the genomic features important for cluster classification revealed functional enrichment of molecular fingerprints in genes involved in membrane transportation, trafficking, and carbohydrate metabolism. Importantly, we discovered polymorphisms in mobile genetic elements (MGEs) and gain/loss of MGEs to be informative in defining outbreak clusters. To quantify the ability of MGE variations to cluster outbreak clones, we devised a reference-free tree-building algorithm inspired by colored de Bruijn graphs, which enabled topological comparisons between MGE and standard typing methods. Systematic evaluation of clustering MGEs on an unseen dataset of 34 Salmonella outbreaks yielded mixed results that exemplified the power of accessory sequence variations when core genomes of unrelated cases are insufficiently discriminatory, as well as the distortion of outbreak signals by microevolution events or the incomplete assembly of MGEs.
Importance: Gene-by-gene typing is widely used to detect clusters of foodborne illnesses that share a common origin. It remains actively debated whether the inclusion of accessory sequences in bacterial typing schema is informative or deleterious for cluster definitions in outbreak investigations due to the potential confounding effects of horizontal gene transfer. By training machine learning models on a curated set of historical Salmonella outbreaks, we revealed an enriched presence of outbreak distinguishing features in a wide range of mobile genetic elements. Systematic comparison of the efficacy of clustering different accessory elements against standard sequence typing methods led to our cataloging of scenarios where accessory sequence variations were beneficial and uninformative to resolving outbreak clusters. The presented work underscores the complexity of the molecular trends in enteric outbreaks and seeks to inspire novel computational ways to exploit whole-genome sequencing data in enteric disease surveillance and management.
期刊介绍:
mBio® is ASM''s first broad-scope, online-only, open access journal. mBio offers streamlined review and publication of the best research in microbiology and allied fields.