Comparison of gene-by-gene and genome-wide short nucleotide sequence-based approaches to define the global population structure of Streptococcus pneumoniae.

IF 4 2区生物学 Q1 GENETICS & HEREDITY

Microbial Genomics Pub Date : 2024-08-01 DOI:10.1099/mgen.0.001278

Alannah C King, Narender Kumar, Kate C Mellor, Paulina A Hawkins, Lesley McGee, Nicholas J Croucher, Stephen D Bentley, John A Lees, Stephanie W Lo

{"title":"Comparison of gene-by-gene and genome-wide short nucleotide sequence-based approaches to define the global population structure of Streptococcus pneumoniae.","authors":"Alannah C King, Narender Kumar, Kate C Mellor, Paulina A Hawkins, Lesley McGee, Nicholas J Croucher, Stephen D Bentley, John A Lees, Stephanie W Lo","doi":"10.1099/mgen.0.001278","DOIUrl":null,"url":null,"abstract":"Defining the population structure of a pathogen is a key part of epidemiology, as genomically related isolates are likely to share key clinical features such as antimicrobial resistance profiles and invasiveness. Multiple different methods are currently used to cluster together closely related genomes, potentially leading to inconsistency between studies. Here, we use a global dataset of 26 306 Streptococcus pneumoniae genomes to compare four clustering methods: gene-by-gene seven-locus MLST, core genome MLST (cgMLST)-based hierarchical clustering (HierCC) assignments, life identification number (LIN) barcoding and k-mer-based PopPUNK clustering (known as GPSCs in this species). We compare the clustering results with phylogenetic and pan-genome analyses to assess their relationship with genome diversity and evolution, as we would expect a good clustering method to form a single monophyletic cluster that has high within-cluster similarity of genomic content. We show that the four methods are generally able to accurately reflect the population structure based on these metrics and that the methods were broadly consistent with each other. We investigated further to study the discrepancies in clusters. The greatest concordance was seen between LIN barcoding and HierCC (adjusted mutual information score=0.950), which was expected given that both methods utilize cgMLST, but have different methods for defining an individual cluster and different core genome schema. However, the existence of differences between the two methods shows that the selection of a core genome schema can introduce inconsistencies between studies. GPSC and HierCC assignments were also highly concordant (AMI=0.946), showing that k-mer-based methods which use the whole genome and do not require the careful selection of a core genome schema are just as effective at representing the population structure. Additionally, where there were differences in clustering between these methods, this could be explained by differences in the accessory genome that were not identified in cgMLST. We conclude that for S. pneumoniae, standardized and stable nomenclature is important as the number of genomes available expands. Furthermore, the research community should transition away from seven-locus MLST, whilst cgMLST, GPSC and LIN assignments should be used more widely. However, to allow for easy comparison between studies and to make previous literature relevant, the reporting of multiple clustering names should be standardized within the research.","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"10 8","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11353345/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbial Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1099/mgen.0.001278","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Defining the population structure of a pathogen is a key part of epidemiology, as genomically related isolates are likely to share key clinical features such as antimicrobial resistance profiles and invasiveness. Multiple different methods are currently used to cluster together closely related genomes, potentially leading to inconsistency between studies. Here, we use a global dataset of 26 306 Streptococcus pneumoniae genomes to compare four clustering methods: gene-by-gene seven-locus MLST, core genome MLST (cgMLST)-based hierarchical clustering (HierCC) assignments, life identification number (LIN) barcoding and k-mer-based PopPUNK clustering (known as GPSCs in this species). We compare the clustering results with phylogenetic and pan-genome analyses to assess their relationship with genome diversity and evolution, as we would expect a good clustering method to form a single monophyletic cluster that has high within-cluster similarity of genomic content. We show that the four methods are generally able to accurately reflect the population structure based on these metrics and that the methods were broadly consistent with each other. We investigated further to study the discrepancies in clusters. The greatest concordance was seen between LIN barcoding and HierCC (adjusted mutual information score=0.950), which was expected given that both methods utilize cgMLST, but have different methods for defining an individual cluster and different core genome schema. However, the existence of differences between the two methods shows that the selection of a core genome schema can introduce inconsistencies between studies. GPSC and HierCC assignments were also highly concordant (AMI=0.946), showing that k-mer-based methods which use the whole genome and do not require the careful selection of a core genome schema are just as effective at representing the population structure. Additionally, where there were differences in clustering between these methods, this could be explained by differences in the accessory genome that were not identified in cgMLST. We conclude that for S. pneumoniae, standardized and stable nomenclature is important as the number of genomes available expands. Furthermore, the research community should transition away from seven-locus MLST, whilst cgMLST, GPSC and LIN assignments should be used more widely. However, to allow for easy comparison between studies and to make previous literature relevant, the reporting of multiple clustering names should be standardized within the research.

查看原文本刊更多论文

比较基于逐基因和全基因组短核苷酸序列的方法来确定肺炎链球菌的全球种群结构。

确定病原体的种群结构是流行病学的一个关键部分，因为基因组相关的分离株很可能具有共同的关键临床特征，如抗菌药耐药性和侵袭性。目前有多种不同的方法用于聚类密切相关的基因组，这可能会导致研究之间的不一致性。在这里，我们使用一个包含 26 306 个肺炎链球菌基因组的全球数据集来比较四种聚类方法：逐基因七焦点 MLST、基于核心基因组 MLST (cgMLST) 的分层聚类 (HierCC) 分配、生命识别码 (LIN) 条形码和基于 k 聚合体的 PopPUNK 聚类（在该物种中称为 GPSC）。我们将聚类结果与系统进化分析和泛基因组分析进行比较，以评估它们与基因组多样性和进化之间的关系，因为我们期望一种好的聚类方法能形成一个单一的单系群，该群内的基因组内容具有高度的相似性。我们的研究表明，基于这些指标，四种方法一般都能准确反映种群结构，而且方法之间大体一致。我们进一步研究了聚类中的差异。LIN 条形编码与 HierCC 的一致性最高（调整后互信息得分=0.950），这在意料之中，因为这两种方法都使用了 cgMLST，但定义单个聚类的方法不同，核心基因组模式也不同。然而，两种方法之间存在的差异表明，核心基因组模式的选择可能会导致不同研究之间的不一致。GPSC 和 HierCC 分配也高度一致（AMI=0.946），这表明基于 k-mer 的方法使用了全基因组，不需要仔细选择核心基因组模式，在表现群体结构方面同样有效。此外，如果这些方法之间存在聚类差异，这可能是由于 cgMLST 中未识别出的附属基因组的差异造成的。我们的结论是，随着可用基因组数量的增加，标准化和稳定的命名法对肺炎双球菌非常重要。此外，研究界应放弃七焦点 MLST，而更广泛地使用 cgMLST、GPSC 和 LIN 分配。不过，为了便于研究之间的比较，并使以前的文献具有相关性，应在研究中规范多重聚类名称的报告。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Microbial Genomics Medicine-Epidemiology

CiteScore

6.60

自引率

2.60%

发文量

153

审稿时长

12 weeks

期刊介绍： Microbial Genomics (MGen) is a fully open access, mandatory open data and peer-reviewed journal publishing high-profile original research on archaea, bacteria, microbial eukaryotes and viruses.