GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.

IF 16.6 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Nucleic Acids Research Pub Date : 2024-09-09 DOI:10.1093/nar/gkae609

Jianshu Zhao, Jean Pierre Both, Luis M Rodriguez-R, Konstantinos T Konstantinidis

{"title":"GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.","authors":"Jianshu Zhao, Jean Pierre Both, Luis M Rodriguez-R, Konstantinos T Konstantinidis","doi":"10.1093/nar/gkae609","DOIUrl":null,"url":null,"abstract":"<p><p>Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.</p>","PeriodicalId":19471,"journal":{"name":"Nucleic Acids Research","volume":" ","pages":"e74"},"PeriodicalIF":16.6000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11381346/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nucleic Acids Research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/nar/gkae609","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.

查看原文本刊更多论文

GSearch：通过将 K-mer 哈希算法与分层导航小世界图相结合，实现超快速、可扩展的基因组搜索。

基因组搜索和/或分类通常涉及寻找最匹配的数据库（参考）基因组，由于可用数据库基因组的数量不断增加，而传统方法又不能很好地扩展到大型数据库，因此基因组搜索和/或分类变得越来越具有挑战性。通过将基于 k-mer 哈希值的概率数据结构（即 ProbMinHash、SuperMinHash、Densified MinHash 和 SetSketch）与基于图的近邻搜索算法（Hierarchical Navigable Small World Graphs，或 HNSW）相结合来估计基因组距离，我们创建了一种新的数据结构，并开发了相关的计算机程序 GSearch。例如，GSearch可以在几分钟内用个人笔记本电脑搜索8000个查询基因组与所有可用的微生物或病毒基因组进行最佳匹配（分别为n = ∼318 000 或 ∼3 000 000），使用的内存为∼6 GB（通过SetSketch为2.5 GB）。值得注意的是，GSearch 的时间复杂度为 O(log(N))，根据数据库拆分策略，可以很好地扩展到数十亿个基因组。此外，GSearch 还根据查询基因组的新颖程度实施了三步搜索策略，以最大限度地提高特异性和灵敏度。因此，GSearch 解决了微生物组研究中需要基因组搜索和/或分类的主要瓶颈问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nucleic Acids Research 生物-生化与分子生物学

CiteScore

27.10

自引率

4.70%

发文量

1057

审稿时长

2 months

期刊介绍： Nucleic Acids Research (NAR) is a scientific journal that publishes research on various aspects of nucleic acids and proteins involved in nucleic acid metabolism and interactions. It covers areas such as chemistry and synthetic biology, computational biology, gene regulation, chromatin and epigenetics, genome integrity, repair and replication, genomics, molecular biology, nucleic acid enzymes, RNA, and structural biology. The journal also includes a Survey and Summary section for brief reviews. Additionally, each year, the first issue is dedicated to biological databases, and an issue in July focuses on web-based software resources for the biological community. Nucleic Acids Research is indexed by several services including Abstracts on Hygiene and Communicable Diseases, Animal Breeding Abstracts, Agricultural Engineering Abstracts, Agbiotech News and Information, BIOSIS Previews, CAB Abstracts, and EMBASE.