Unsupervised encoding selection through ensemble pruning for biomedical classification.

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2023-03-16 DOI:10.1186/s13040-022-00317-7

Sebastian Spänig, Alexander Michel, Dominik Heider

{"title":"Unsupervised encoding selection through ensemble pruning for biomedical classification.","authors":"Sebastian Spänig, Alexander Michel, Dominik Heider","doi":"10.1186/s13040-022-00317-7","DOIUrl":null,"url":null,"abstract":"Background: Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptide's function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking.Results: We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets.Conclusion: The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"10"},"PeriodicalIF":4.0000,"publicationDate":"2023-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10018861/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-022-00317-7","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptide's function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking.

Results: We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets.

Conclusion: The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.

Abstract Image

查看原文本刊更多论文

基于集成剪枝的生物医学分类无监督编码选择。

背景:随着多重耐药病原菌的不断增多，抗菌肽作为经典抗生素的替代策略受到越来越多的关注。因此，一个关键部分是昂贵的识别和验证。随着标注肽数量的不断增长，研究人员利用人工智能来规避繁琐的、基于湿实验室的识别，并自动检测有前途的候选肽。然而，对肽功能的预测并不局限于抗菌效率。迄今为止，多项研究成功地分类了其他特性，例如抗病毒或细胞穿透作用。在这种情况下，为了进一步改进预测，我们采用了集成分类器。尽管我们最近提出了一个工作流来显著减少初始编码选择，但考虑到各种机器学习模型，仍然缺乏一个完整的无监督编码选择。结果:我们开发了一个工作流程，通过采用复杂的修剪方法自动选择编码和生成分类器集成。我们观察到，帕累托边界修剪是一种为手头数据集创建编码集成的好方法。此外，结合决策树分类器作为基本模型的编码通常更优越。然而，我们的结果也表明，没有一种集成构建技术对所有数据集都是杰出的。结论:该工作流通过多种修剪方法来评估由广泛的肽编码和基础模型组成的集成分类器。因此，研究人员可以使用工作流进行无监督编码选择和集成创建。最终，可扩展的工作流可以用作PEPTIDE REACToR的插件，进一步将其建立为该领域的通用工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.