Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, Gonzalo Navarro
{"title":"Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests.","authors":"Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, Gonzalo Navarro","doi":"10.4230/LIPIcs.SEA.2024.10","DOIUrl":null,"url":null,"abstract":"<p><p>For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use <math><mi>k</mi></math> -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: ■ a KATKA kernel, which discards characters that are not in the first or last occurrence of any <math> <mrow><msub><mi>k</mi> <mrow><mtext>max</mtext></mrow> </msub> </mrow> </math> -tuple, for a parameter <math> <mrow><msub><mi>k</mi> <mrow><mtext>max</mtext></mrow> </msub> </mrow> </math> ; a minimizer digest; ■ a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated (\"true positive\" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.</p>","PeriodicalId":30209,"journal":{"name":"Leibniz International Proceedings in Informatics","volume":"301 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11301608/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Leibniz International Proceedings in Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.SEA.2024.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/11 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0
Abstract
For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: ■ a KATKA kernel, which discards characters that are not in the first or last occurrence of any -tuple, for a parameter ; a minimizer digest; ■ a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.
在分类学分类中,我们需要为系统发生树中的基因组建立索引,这样在以后给定一个 DNA 读数时,我们就可以快速选择一个可能包含该读数的基因组的小子树。虽然 Kraken 等流行的分类器使用的是 k-mers,但最近的研究表明,使用最大精确匹配(MEM)可以获得更好的分类效果。例如,我们可以 ■ 在按从左到右顺序连接的树中的基因组上建立一个增强的 FM 索引;■ 针对读数中的每个 MEM,找到后缀数组中包含该 MEM 在这些基因组中出现的起始位置的区间;■ 找到存储在该区间中的最小值和最大值;■ 取包含这些位置上的字符的基因组的最低共同祖先(LCA)。不过,只有当树状结构中基因组的总大小相当小时,这种解决方案才实用。在本文中,我们考虑将相同的解决方案应用于基因组连接的三种有损压缩表示:KATKA 内核,在参数 k max 的情况下,丢弃不在任何 k max 元组的第一次或最后一次出现中的字符;最小化摘要;■最小化摘要的 KATKA 内核。利用测试数据集和这三种表示方法、模拟读数和各种参数设置,我们检查了有多少读数的最长 MEM 仅出现在这些读数生成的序列中("真阳性 "读数)。在某些参数设置下,我们实现了显著的压缩,而真阳性率仅略有下降。