Fractional hitting sets for efficient multiset sketching.

IF 1.7 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology Pub Date : 2025-02-08 DOI:10.1186/s13015-024-00268-0

Timothé Rouzé, Igor Martayan, Camille Marchet, Antoine Limasset

{"title":"Fractional hitting sets for efficient multiset sketching.","authors":"Timothé Rouzé, Igor Martayan, Camille Marchet, Antoine Limasset","doi":"10.1186/s13015-024-00268-0","DOIUrl":null,"url":null,"abstract":"<p><p>The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, supersampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to sourmash, supersampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data. supersampler is an open-source software and can be accessed at https://github.com/TimRouze/supersampler . The data required to reproduce the results presented in this manuscript is available at https://github.com/TimRouze/supersampler/experiments .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"1"},"PeriodicalIF":1.7000,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11807336/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-024-00268-0","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, supersampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to sourmash, supersampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data. supersampler is an open-source software and can be accessed at https://github.com/TimRouze/supersampler . The data required to reproduce the results presented in this manuscript is available at https://github.com/TimRouze/supersampler/experiments .

Abstract Image

查看原文本刊更多论文

分数命中集用于高效的多集素描。

公开可用的测序数据和基因组资源呈指数级增长，需要开发高效的数据处理和分析方法。位置敏感的散列技术已经成功地将大型数据集转换为更小、更易于管理的草图，同时使用Jaccard和containment索引等指标保持可比性。然而，固定大小的草图在应用于不同的数据集时会遇到困难。可扩展的素描方法，如sourmash，提供了有价值的解决方案，但仍然缺乏资源高效，量身定制的索引。我们的目标是创建具有可比结果的更轻的草图，同时提高效率。我们引入了分数命中集的概念，它是普遍命中集的推广，它覆盖了k-mer空间的特定分数。在理论和实践中，我们证明了用简单而高效的方案实现这种覆盖的可行性。通过将覆盖的k-mers编码为super-k-mers，我们提供了一种节省空间的精确表示，也可以进行优化比较。我们的新工具，超级采样器，实现了这一方案，实验结果与真实的细菌收集密切匹配我们的理论发现。与sourmash相比，supersampler在利用更少的空间和内存并以几倍的速度运行的同时实现了类似的结果。这凸显了我们的方法在应对不断扩大的基因组数据所带来的挑战方面的潜力。supersampler是一个开源软件，可以在https://github.com/TimRouze/supersampler上访问。重现本文中所呈现的结果所需的数据可在https://github.com/TimRouze/supersampler/experiments上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.