{"title":"基于压缩LCA索引的稳健16S rRNA分类","authors":"Omar Y. Ahmed, Christina Boucher, Ben Langmead","doi":"10.1101/gr.279846.124","DOIUrl":null,"url":null,"abstract":"Taxonomic sequence classification is a computational problem central to the study of metagenomics and evolution Advances in compressed indexing with the <em>r</em>-index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections. Previous work proposed the document array profiles, which use <em>O</em>(<em>rd</em>) words of space where<em> r</em> is the number of maximal-equal letter runs in the Burrows-Wheeler transform and <em> d</em> is the number of distinct genomes. The linear dependence on <em> d</em> is limiting, since real taxonomies can easily contain 10,000s of leaves or more. We propose a method called cliff compression that reduces this size by a large factor, over 250× when indexing the SILVA 16S rRNA gene database. This method uses Θ(<em>r</em> log <em> d</em>) words of space in expectation under a random model we propose here. We implemented these ideas in an open source tool called Cliffy that performs efficient taxonomic classification of sequencing reads with respect to a compressed taxonomic index. When applied to simulated 16S rRNA reads, Cliffy's read-level accuracy is higher than Kraken2's by 11-18%. Clade abundances are also more accurately predicted by Cliffy compared to Kraken2 and Bracken. Overall, Cliffy is a fast and space-economical extension to compressed full-text indexes, enabling them to perform fast and accurate taxonomic classification queries. Cliffy's accuracy underscores the advantages of full-text indexes, which offer a more precise solution compared to <em>k</em>-mer indexes designed for a specific <em>k</em> value.","PeriodicalId":12678,"journal":{"name":"Genome research","volume":"10 1","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust 16S rRNA classification based on a compressed LCA index\",\"authors\":\"Omar Y. Ahmed, Christina Boucher, Ben Langmead\",\"doi\":\"10.1101/gr.279846.124\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Taxonomic sequence classification is a computational problem central to the study of metagenomics and evolution Advances in compressed indexing with the <em>r</em>-index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections. Previous work proposed the document array profiles, which use <em>O</em>(<em>rd</em>) words of space where<em> r</em> is the number of maximal-equal letter runs in the Burrows-Wheeler transform and <em> d</em> is the number of distinct genomes. The linear dependence on <em> d</em> is limiting, since real taxonomies can easily contain 10,000s of leaves or more. We propose a method called cliff compression that reduces this size by a large factor, over 250× when indexing the SILVA 16S rRNA gene database. This method uses Θ(<em>r</em> log <em> d</em>) words of space in expectation under a random model we propose here. We implemented these ideas in an open source tool called Cliffy that performs efficient taxonomic classification of sequencing reads with respect to a compressed taxonomic index. When applied to simulated 16S rRNA reads, Cliffy's read-level accuracy is higher than Kraken2's by 11-18%. Clade abundances are also more accurately predicted by Cliffy compared to Kraken2 and Bracken. Overall, Cliffy is a fast and space-economical extension to compressed full-text indexes, enabling them to perform fast and accurate taxonomic classification queries. Cliffy's accuracy underscores the advantages of full-text indexes, which offer a more precise solution compared to <em>k</em>-mer indexes designed for a specific <em>k</em> value.\",\"PeriodicalId\":12678,\"journal\":{\"name\":\"Genome research\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genome research\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1101/gr.279846.124\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/gr.279846.124","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
Robust 16S rRNA classification based on a compressed LCA index
Taxonomic sequence classification is a computational problem central to the study of metagenomics and evolution Advances in compressed indexing with the r-index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections. Previous work proposed the document array profiles, which use O(rd) words of space where r is the number of maximal-equal letter runs in the Burrows-Wheeler transform and d is the number of distinct genomes. The linear dependence on d is limiting, since real taxonomies can easily contain 10,000s of leaves or more. We propose a method called cliff compression that reduces this size by a large factor, over 250× when indexing the SILVA 16S rRNA gene database. This method uses Θ(r log d) words of space in expectation under a random model we propose here. We implemented these ideas in an open source tool called Cliffy that performs efficient taxonomic classification of sequencing reads with respect to a compressed taxonomic index. When applied to simulated 16S rRNA reads, Cliffy's read-level accuracy is higher than Kraken2's by 11-18%. Clade abundances are also more accurately predicted by Cliffy compared to Kraken2 and Bracken. Overall, Cliffy is a fast and space-economical extension to compressed full-text indexes, enabling them to perform fast and accurate taxonomic classification queries. Cliffy's accuracy underscores the advantages of full-text indexes, which offer a more precise solution compared to k-mer indexes designed for a specific k value.
期刊介绍:
Launched in 1995, Genome Research is an international, continuously published, peer-reviewed journal that focuses on research that provides novel insights into the genome biology of all organisms, including advances in genomic medicine.
Among the topics considered by the journal are genome structure and function, comparative genomics, molecular evolution, genome-scale quantitative and population genetics, proteomics, epigenomics, and systems biology. The journal also features exciting gene discoveries and reports of cutting-edge computational biology and high-throughput methodologies.
New data in these areas are published as research papers, or methods and resource reports that provide novel information on technologies or tools that will be of interest to a broad readership. Complete data sets are presented electronically on the journal''s web site where appropriate. The journal also provides Reviews, Perspectives, and Insight/Outlook articles, which present commentary on the latest advances published both here and elsewhere, placing such progress in its broader biological context.