基于分支特异性标记K-Mer的哈希方法在宏基因组分类中的可扩展性如何?

IF 1.3 Q3 ENGINEERING, ELECTRICAL & ELECTRONIC

Frontiers in signal processing Pub Date : 2022-07-05 DOI:10.3389/frsip.2022.842513

Melissa M. Gray, Zhengqiao Zhao, G. Rosen

{"title":"基于分支特异性标记K-Mer的哈希方法在宏基因组分类中的可扩展性如何?","authors":"Melissa M. Gray, Zhengqiao Zhao, G. Rosen","doi":"10.3389/frsip.2022.842513","DOIUrl":null,"url":null,"abstract":"Efficiently and accurately identifying which microbes are present in a biological sample is important to medicine and biology. For example, in medicine, microbe identification allows doctors to better diagnose diseases. Two questions are essential to metagenomic analysis (the analysis of a random sampling of DNA in a patient/environment sample): How to accurately identify the microbes in samples and how to efficiently update the taxonomic classifier as new microbe genomes are sequenced and added to the reference database. To investigate how classifiers change as they train on more knowledge, we made sub-databases composed of genomes that existed in past years that served as “snapshots in time” (1999–2020) of the NCBI reference genome database. We evaluated two classification methods, Kraken 2 and CLARK with these snapshots using a real, experimental metagenomic sample from a human gut. This allowed us to measure how much of a real sample could confidently classify using these methods and as the database grows. Despite not knowing the ground truth, we could measure the concordance between methods and between years of the database within each method using a Bray-Curtis distance. In addition, we also recorded the training times of the classifiers for each snapshot. For all data for Kraken 2, we observed that as more genomes were added, more microbes from the sample were classified. CLARK had a similar trend, but in the final year, this trend reversed with the microbial variation and less unique k-mers. Also, both classifiers, while having different ways of training, generally are linear in time - but Kraken 2 has a significantly lower slope in scaling to more data.","PeriodicalId":93557,"journal":{"name":"Frontiers in signal processing","volume":"12 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"How Scalable Are Clade-Specific Marker K-Mer Based Hash Methods for Metagenomic Taxonomic Classification?\",\"authors\":\"Melissa M. Gray, Zhengqiao Zhao, G. Rosen\",\"doi\":\"10.3389/frsip.2022.842513\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Efficiently and accurately identifying which microbes are present in a biological sample is important to medicine and biology. For example, in medicine, microbe identification allows doctors to better diagnose diseases. Two questions are essential to metagenomic analysis (the analysis of a random sampling of DNA in a patient/environment sample): How to accurately identify the microbes in samples and how to efficiently update the taxonomic classifier as new microbe genomes are sequenced and added to the reference database. To investigate how classifiers change as they train on more knowledge, we made sub-databases composed of genomes that existed in past years that served as “snapshots in time” (1999–2020) of the NCBI reference genome database. We evaluated two classification methods, Kraken 2 and CLARK with these snapshots using a real, experimental metagenomic sample from a human gut. This allowed us to measure how much of a real sample could confidently classify using these methods and as the database grows. Despite not knowing the ground truth, we could measure the concordance between methods and between years of the database within each method using a Bray-Curtis distance. In addition, we also recorded the training times of the classifiers for each snapshot. For all data for Kraken 2, we observed that as more genomes were added, more microbes from the sample were classified. CLARK had a similar trend, but in the final year, this trend reversed with the microbial variation and less unique k-mers. Also, both classifiers, while having different ways of training, generally are linear in time - but Kraken 2 has a significantly lower slope in scaling to more data.\",\"PeriodicalId\":93557,\"journal\":{\"name\":\"Frontiers in signal processing\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2022-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in signal processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/frsip.2022.842513\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in signal processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frsip.2022.842513","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

有效和准确地识别生物样品中存在的微生物对医学和生物学都很重要。例如，在医学上，微生物鉴定使医生能够更好地诊断疾病。宏基因组分析(从患者/环境样本中随机取样DNA的分析)有两个关键问题:如何准确识别样本中的微生物，以及如何在新的微生物基因组测序并添加到参考数据库时有效地更新分类分类器。为了研究分类器在接受更多知识训练时是如何变化的，我们制作了由过去几年存在的基因组组成的子数据库，作为NCBI参考基因组数据库的“快照”(1999-2020)。我们评估了两种分类方法，Kraken 2和CLARK，使用这些快照使用来自人类肠道的真实实验性宏基因组样本。这使我们能够测量使用这些方法和随着数据库的增长，真实样本中有多少可以自信地分类。尽管不知道实际情况，但我们可以使用布雷-柯蒂斯距离测量方法之间的一致性以及每种方法中数据库年份之间的一致性。此外，我们还记录了每个快照的分类器的训练次数。对于Kraken 2的所有数据，我们观察到，随着更多的基因组被添加，更多的样本微生物被分类。CLARK也有类似的趋势，但在最后一年，这种趋势随着微生物的变化和较少独特的k-mers而逆转。此外，这两个分类器虽然有不同的训练方式，但通常在时间上是线性的——但Kraken 2在扩展到更多数据方面的斜率明显较低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

How Scalable Are Clade-Specific Marker K-Mer Based Hash Methods for Metagenomic Taxonomic Classification?

Efficiently and accurately identifying which microbes are present in a biological sample is important to medicine and biology. For example, in medicine, microbe identification allows doctors to better diagnose diseases. Two questions are essential to metagenomic analysis (the analysis of a random sampling of DNA in a patient/environment sample): How to accurately identify the microbes in samples and how to efficiently update the taxonomic classifier as new microbe genomes are sequenced and added to the reference database. To investigate how classifiers change as they train on more knowledge, we made sub-databases composed of genomes that existed in past years that served as “snapshots in time” (1999–2020) of the NCBI reference genome database. We evaluated two classification methods, Kraken 2 and CLARK with these snapshots using a real, experimental metagenomic sample from a human gut. This allowed us to measure how much of a real sample could confidently classify using these methods and as the database grows. Despite not knowing the ground truth, we could measure the concordance between methods and between years of the database within each method using a Bray-Curtis distance. In addition, we also recorded the training times of the classifiers for each snapshot. For all data for Kraken 2, we observed that as more genomes were added, more microbes from the sample were classified. CLARK had a similar trend, but in the final year, this trend reversed with the microbial variation and less unique k-mers. Also, both classifiers, while having different ways of training, generally are linear in time - but Kraken 2 has a significantly lower slope in scaling to more data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in signal processing

自引率

0.00%

发文量