环境元条形码数据集特征选择和机器学习方法的基准分析

IF 4.4 2区生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY

Computational and structural biotechnology journal Pub Date : 2025-01-01 DOI:10.1016/j.csbj.2025.04.017

Erik Zschaubitz , Henning Schröder , Conor Christopher Glackin , Lukas Vogel , Matthias Labrenz , Theodor Sperlea

{"title":"环境元条形码数据集特征选择和机器学习方法的基准分析","authors":"Erik Zschaubitz , Henning Schröder , Conor Christopher Glackin , Lukas Vogel , Matthias Labrenz , Theodor Sperlea","doi":"10.1016/j.csbj.2025.04.017","DOIUrl":null,"url":null,"abstract":"<div><div>Next-Generation Sequencing methods like DNA metabarcoding enable the generation of large community composition datasets and have grown instrumental in many branches of ecology in recent years. However, the sparsity, compositionality, and high dimensionality of metabarcoding datasets pose challenges in data analysis. In theory, feature selection methods improve the analyzability of eDNA metabarcoding datasets by identifying a subset of informative taxa that are relevant for a certain task and discarding those that are redundant or irrelevant. However, general guidelines on selecting a feature selection method for application to a given setting are lacking. Here, we report a comparison of feature selection methods in a supervised machine learning setup across 13 environmental metabarcoding datasets with differing characteristics. We evaluate workflows that consist of data preprocessing, feature selection and a machine learning model by their ability to capture the ecological relationship between the microbial community composition and environmental parameters. Our results demonstrate that, while the optimal feature selection approach depends on dataset characteristics, feature selection is more likely to impair model performance than to improve it for tree ensemble models like Random Forests. Furthermore, our results show that calculating relative counts impairs model performance, which suggests that novel methods to combat the compositionality of metabarcoding data are required.</div></div>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"Pages 1636-1647"},"PeriodicalIF":4.4000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets\",\"authors\":\"Erik Zschaubitz , Henning Schröder , Conor Christopher Glackin , Lukas Vogel , Matthias Labrenz , Theodor Sperlea\",\"doi\":\"10.1016/j.csbj.2025.04.017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Next-Generation Sequencing methods like DNA metabarcoding enable the generation of large community composition datasets and have grown instrumental in many branches of ecology in recent years. However, the sparsity, compositionality, and high dimensionality of metabarcoding datasets pose challenges in data analysis. In theory, feature selection methods improve the analyzability of eDNA metabarcoding datasets by identifying a subset of informative taxa that are relevant for a certain task and discarding those that are redundant or irrelevant. However, general guidelines on selecting a feature selection method for application to a given setting are lacking. Here, we report a comparison of feature selection methods in a supervised machine learning setup across 13 environmental metabarcoding datasets with differing characteristics. We evaluate workflows that consist of data preprocessing, feature selection and a machine learning model by their ability to capture the ecological relationship between the microbial community composition and environmental parameters. Our results demonstrate that, while the optimal feature selection approach depends on dataset characteristics, feature selection is more likely to impair model performance than to improve it for tree ensemble models like Random Forests. Furthermore, our results show that calculating relative counts impairs model performance, which suggests that novel methods to combat the compositionality of metabarcoding data are required.</div></div>\",\"PeriodicalId\":10715,\"journal\":{\"name\":\"Computational and structural biotechnology journal\",\"volume\":\"27 \",\"pages\":\"Pages 1636-1647\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational and structural biotechnology journal\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2001037025001382\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2001037025001382","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

像DNA元条形码这样的下一代测序方法能够生成大型群落组成数据集，并且近年来在生态学的许多分支中发挥了重要作用。然而，元条形码数据集的稀疏性、组合性和高维性给数据分析带来了挑战。从理论上讲，特征选择方法通过识别与特定任务相关的信息分类群子集并丢弃那些冗余或不相关的分类群来提高eDNA元条形码数据集的可分析性。然而，对于如何选择一种特征选择方法来应用于给定的设置，目前还缺乏通用的指导方针。在这里，我们报告了13个具有不同特征的环境元条形码数据集的监督机器学习设置中的特征选择方法的比较。我们通过捕获微生物群落组成和环境参数之间的生态关系的能力来评估由数据预处理、特征选择和机器学习模型组成的工作流程。我们的研究结果表明，虽然最优特征选择方法取决于数据集的特征，但对于随机森林这样的树集成模型，特征选择更有可能损害模型的性能，而不是提高模型的性能。此外，我们的研究结果表明，计算相对计数会损害模型的性能，这表明需要新的方法来对抗元条形码数据的组合性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets

查看原文本刊更多论文

A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets

Next-Generation Sequencing methods like DNA metabarcoding enable the generation of large community composition datasets and have grown instrumental in many branches of ecology in recent years. However, the sparsity, compositionality, and high dimensionality of metabarcoding datasets pose challenges in data analysis. In theory, feature selection methods improve the analyzability of eDNA metabarcoding datasets by identifying a subset of informative taxa that are relevant for a certain task and discarding those that are redundant or irrelevant. However, general guidelines on selecting a feature selection method for application to a given setting are lacking. Here, we report a comparison of feature selection methods in a supervised machine learning setup across 13 environmental metabarcoding datasets with differing characteristics. We evaluate workflows that consist of data preprocessing, feature selection and a machine learning model by their ability to capture the ecological relationship between the microbial community composition and environmental parameters. Our results demonstrate that, while the optimal feature selection approach depends on dataset characteristics, feature selection is more likely to impair model performance than to improve it for tree ensemble models like Random Forests. Furthermore, our results show that calculating relative counts impairs model performance, which suggests that novel methods to combat the compositionality of metabarcoding data are required.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational and structural biotechnology journal Biochemistry, Genetics and Molecular Biology-Biophysics

CiteScore

9.30

自引率

3.30%

发文量

540

审稿时长

6 weeks

期刊介绍： Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology