Multi-proteins similarity-based sampling to select representative genomes from large databases.

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-05-06 DOI:10.1186/s12859-025-06095-3

Rémi-Vinh Coudert, Jean-Philippe Charrier, Frédéric Jauffrit, Jean-Pierre Flandrois, Céline Brochier-Armanet

{"title":"Multi-proteins similarity-based sampling to select representative genomes from large databases.","authors":"Rémi-Vinh Coudert, Jean-Philippe Charrier, Frédéric Jauffrit, Jean-Pierre Flandrois, Céline Brochier-Armanet","doi":"10.1186/s12859-025-06095-3","DOIUrl":null,"url":null,"abstract":"Background: Genome sequence databases are growing exponentially, but with high redundancy and uneven data quality. For these reasons, selecting representative subsets of genomes is an essential step for almost all studies. However, most current sampling approaches are biased and unable to process large datasets in a reasonable time.Methods: Here we present MPS-Sampling (Multiple-Protein Similarity-based Sampling), a fast, scalable, and efficient method for selecting reliable and representative samples of genomes from very large datasets. Using families of homologous proteins as input, MPS-Sampling delineates homogeneous groups of genomes through two successive clustering steps. Representative genomes are then selected within these groups according to predefined or user-defined priority criteria.Results: MPS-Sampling was applied to a dataset of 48 ribosomal protein families from 178,203 bacterial genomes to generate representative genome sets of various size, corresponding to a sampling of 32.17% down to 0.3% of the complete dataset. An in-depth analysis shows that the selected genomes are both taxonomically and phylogenetically representative of the complete dataset, demonstrating the relevance of the approach.Conclusion: MPS-Sampling provides an efficient, fast and scalable way to sample large collections of genomes in an acceptable computational time. MPS-Sampling does not rely on taxonomic information and does not require the inference of phylogenetic trees, thus avoiding the biases inherent in these approaches. As such, MPS-Sampling meets the needs of a growing number of users.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"121"},"PeriodicalIF":2.9000,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12057276/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06095-3","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Genome sequence databases are growing exponentially, but with high redundancy and uneven data quality. For these reasons, selecting representative subsets of genomes is an essential step for almost all studies. However, most current sampling approaches are biased and unable to process large datasets in a reasonable time.

Methods: Here we present MPS-Sampling (Multiple-Protein Similarity-based Sampling), a fast, scalable, and efficient method for selecting reliable and representative samples of genomes from very large datasets. Using families of homologous proteins as input, MPS-Sampling delineates homogeneous groups of genomes through two successive clustering steps. Representative genomes are then selected within these groups according to predefined or user-defined priority criteria.

Results: MPS-Sampling was applied to a dataset of 48 ribosomal protein families from 178,203 bacterial genomes to generate representative genome sets of various size, corresponding to a sampling of 32.17% down to 0.3% of the complete dataset. An in-depth analysis shows that the selected genomes are both taxonomically and phylogenetically representative of the complete dataset, demonstrating the relevance of the approach.

Conclusion: MPS-Sampling provides an efficient, fast and scalable way to sample large collections of genomes in an acceptable computational time. MPS-Sampling does not rely on taxonomic information and does not require the inference of phylogenetic trees, thus avoiding the biases inherent in these approaches. As such, MPS-Sampling meets the needs of a growing number of users.

查看原文本刊更多论文

基于多蛋白质相似性的采样从大型数据库中选择代表性基因组。

背景：基因组序列数据库呈指数级增长，但存在高冗余和数据质量参差不齐的问题。由于这些原因，选择具有代表性的基因组子集是几乎所有研究的必要步骤。然而，大多数当前的抽样方法是有偏差的，无法在合理的时间内处理大型数据集。方法：在这里，我们提出了MPS-Sampling（基于多蛋白相似性的采样），这是一种快速、可扩展和有效的方法，用于从非常大的数据集中选择可靠和有代表性的基因组样本。使用同源蛋白家族作为输入，MPS-Sampling通过两个连续的聚类步骤描绘均匀的基因组群。然后根据预定义或用户定义的优先级标准在这些组中选择具有代表性的基因组。结果：MPS-Sampling应用于来自178,203个细菌基因组的48个核糖体蛋白家族的数据集，生成了不同大小的代表性基因组集，对应的采样率为32.17%，下降到完整数据集的0.3%。深入分析表明，所选择的基因组在分类和系统发育上都是完整数据集的代表，证明了该方法的相关性。结论：MPS-Sampling提供了一种高效、快速和可扩展的方法，可以在可接受的计算时间内对大量基因组进行采样。MPS-Sampling不依赖于分类信息，也不需要系统发育树的推断，从而避免了这些方法固有的偏差。因此，MPS-Sampling满足了越来越多用户的需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.