Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.

IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS
Journal of Computational Biology Pub Date : 2024-07-01 Epub Date: 2024-07-09 DOI:10.1089/cmb.2024.0544
Guillaume Marçais, Dan DeBlasio, Carl Kingsford
{"title":"Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.","authors":"Guillaume Marçais, Dan DeBlasio, Carl Kingsford","doi":"10.1089/cmb.2024.0544","DOIUrl":null,"url":null,"abstract":"<p><p>Most sequence sketching methods work by selecting specific <i>k</i>-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the <i>k</i>-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a <i>decycling set</i> of the de Bruijn graph, which is a set of unavoidable <i>k</i>-mers. Any long enough sequence, by definition, must contain a <i>k</i>-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the <i>k</i>-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The <i>Minimum Decycling Sets</i> (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"597-615"},"PeriodicalIF":1.4000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11304339/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0544","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/9 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.

使用最小解旋集保证小窗口的草图绘制方法
大多数序列草图绘制方法都是从序列中选择特定的 k-mer,这样就可以只用草图来估计两个序列之间的相似性。由于使用草图估计序列相似性比使用序列比对要快得多,草图方法被用来降低计算生物学软件的计算要求。使用草图的应用通常依赖于 k-mer 选择程序的特性,以确保与使用序列比对相比,使用草图不会降低结果的质量。这种特性的两个重要例子是局部性和窗口保证,后者可确保草图中不会出现序列中没有代表的长区域。具有窗口保证的草图绘制方法,不管是隐式的还是显式的,都对应于 de Bruijn 图的去循环集,即一组不可避免的 k-mers。根据定义,任何足够长的序列都必须包含一个来自任何去循环集的 k-分子(因此,这就是不可避免的特性)。反过来说,解旋集合也定义了一种草图绘制方法,即从集合中选择 k-分子作为代表。虽然目前的方法使用的是少数草图绘制方法系列中的一种,但去循环集的空间要大得多,而且大部分尚未开发。寻找具有理想特征(如剩余路径长度较小)的去循环集是发现性能更佳(如具有小窗口保证)的新草图绘制方法的一种有前途的方法。最小解循环集(MDS)因其最小尺寸而特别引人关注。以前只知道 Mykkeltveit 和 Champarnaud 的两种算法可以生成两个特定的 MDS,尽管通常有大量可供选择的 MDS。我们提供了一种枚举 MDS 的简单方法。通过这种方法,我们可以探索 MDS 空间,并找到针对理想特性进行优化的 MDS。我们给出的证据表明,Mykkeltveit 集在剩余路径长度这一特定属性上接近最优。我们还提出了一些猜想以及支持这些猜想的计算和理论证据。代码见 https://github.com/Kingsford-Group/mdsscope。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Computational Biology
Journal of Computational Biology 生物-计算机:跨学科应用
CiteScore
3.60
自引率
5.90%
发文量
113
审稿时长
6-12 weeks
期刊介绍: Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics. Journal of Computational Biology coverage includes: -Genomics -Mathematical modeling and simulation -Distributed and parallel biological computing -Designing biological databases -Pattern matching and pattern detection -Linking disparate databases and data -New tools for computational biology -Relational and object-oriented database technology for bioinformatics -Biological expert system design and use -Reasoning by analogy, hypothesis formation, and testing by machine -Management of biological databases
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信