Architecture and implementation of ulrb algorithm in R

IF 5.8 2区环境科学与生态学 Q1 ECOLOGY

Ecological Informatics Pub Date : 2025-05-28 DOI:10.1016/j.ecoinf.2025.103229

Francisco Pascoal , Rodrigo Costa , Luís Torgo , Catarina Magalhães , Paula Branco

{"title":"Architecture and implementation of ulrb algorithm in R","authors":"Francisco Pascoal , Rodrigo Costa , Luís Torgo , Catarina Magalhães , Paula Branco","doi":"10.1016/j.ecoinf.2025.103229","DOIUrl":null,"url":null,"abstract":"<div><div>Low-abundance microorganisms, often referred to as the “rare biosphere”, play a crucial role in ecosystem resistance and resilience, but remain challenging to study. One of the main difficulties lies in the lack of an appropriate definition of rare taxa. Most studies use relative abundance thresholds (<em>e.g.</em>, 0.1 % relative abundance, per sample) to discern rare from abundant taxa within a microbial community. This is inappropriate because such thresholds are arbitrary and lack biological meaning. To solve this problem, we have proposed the utilization of unsupervised machine learning, through the <em>ulrb</em> (“Unsupervised Learning Definition of the Microbial Rare Biosphere”) algorithm, implemented as an R package (v0.1.8). This algorithm applies the partition around medoids (pam) algorithm to cluster taxa based on their abundance, in a community, for any number of samples. Based on the clusters, <em>ulrb</em> automatically classifies taxa into “rare”, “undetermined” or “abundant”, by default. <em>Ulrb</em> includes functions for all analytical steps necessary to define the rare biosphere. Specifically, we include four groups of functions: 1) process data of the user into the correct format for the <em>ulrb</em> algorithm; 2) cluster taxa into abundance classifications; 3) helper functions to evaluate detailed statistics of the clustering steps; and 4) visualization functions, focused on rank abundance curves and Silhouette scores, for assessment of clustering quality. In addition, <em>ulrb</em> allows the user to change the number of classifications obtained and includes options for detailed reporting. In this article, we describe the <em>ulrb</em> R package architecture, coding organization, and strategy. Furthermore, we use a 16S rRNA gene amplicon sequencing dataset from the Arctic Ocean to provide illustrative examples, with code, on how to use and explore <em>ulrb</em> capabilities. By explaining the architecture and implementation of <em>ulrb</em>, this study allows independent groups to integrate an abundance classification step in their data analysis protocols, instead of relying on taxa labeled by inconsistent or manual strategies.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":"90 ","pages":"Article 103229"},"PeriodicalIF":5.8000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954125002389","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Low-abundance microorganisms, often referred to as the “rare biosphere”, play a crucial role in ecosystem resistance and resilience, but remain challenging to study. One of the main difficulties lies in the lack of an appropriate definition of rare taxa. Most studies use relative abundance thresholds (e.g., 0.1 % relative abundance, per sample) to discern rare from abundant taxa within a microbial community. This is inappropriate because such thresholds are arbitrary and lack biological meaning. To solve this problem, we have proposed the utilization of unsupervised machine learning, through the ulrb (“Unsupervised Learning Definition of the Microbial Rare Biosphere”) algorithm, implemented as an R package (v0.1.8). This algorithm applies the partition around medoids (pam) algorithm to cluster taxa based on their abundance, in a community, for any number of samples. Based on the clusters, ulrb automatically classifies taxa into “rare”, “undetermined” or “abundant”, by default. Ulrb includes functions for all analytical steps necessary to define the rare biosphere. Specifically, we include four groups of functions: 1) process data of the user into the correct format for the ulrb algorithm; 2) cluster taxa into abundance classifications; 3) helper functions to evaluate detailed statistics of the clustering steps; and 4) visualization functions, focused on rank abundance curves and Silhouette scores, for assessment of clustering quality. In addition, ulrb allows the user to change the number of classifications obtained and includes options for detailed reporting. In this article, we describe the ulrb R package architecture, coding organization, and strategy. Furthermore, we use a 16S rRNA gene amplicon sequencing dataset from the Arctic Ocean to provide illustrative examples, with code, on how to use and explore ulrb capabilities. By explaining the architecture and implementation of ulrb, this study allows independent groups to integrate an abundance classification step in their data analysis protocols, instead of relying on taxa labeled by inconsistent or manual strategies.

Abstract Image

查看原文本刊更多论文

ulrb算法的结构与实现

低丰度微生物，通常被称为“稀有生物圈”，在生态系统的抵抗力和恢复力中起着至关重要的作用，但研究仍然具有挑战性。主要困难之一在于缺乏对稀有分类群的适当定义。大多数研究使用相对丰度阈值（例如，0.1%相对丰度，每个样本）来区分微生物群落中稀有和丰富的分类群。这是不合适的，因为这样的阈值是武断的，缺乏生物学意义。为了解决这个问题，我们提出利用无监督机器学习，通过ulrb（“微生物稀有生物圈的无监督学习定义”）算法，作为R包（v0.1.8）实现。对于任意数量的样本，该算法根据群落中分类群的丰度对分类群应用围绕介质的划分（pam）算法。ulrb根据集群自动将分类群默认分为“rare”、“undetermined”和“abundant”。Ulrb包括定义稀有生物圈所需的所有分析步骤的功能。具体来说，我们包括四组功能：1)将用户数据处理成正确的ulrb算法格式；2)聚类分类为丰度分类；3)辅助函数用于评估聚类步骤的详细统计信息；4)可视化功能，侧重于排序丰度曲线和剪影分数，用于评估聚类质量。此外，ulrb允许用户更改获得的分类数量，并包括详细报告的选项。在本文中，我们将描述ulrb R包的体系结构、编码组织和策略。此外，我们使用来自北冰洋的16S rRNA基因扩增子测序数据集来提供说白性示例，并附有代码，说明如何使用和探索ulrb功能。通过解释ulrb的架构和实现，本研究允许独立的小组在他们的数据分析协议中集成一个丰富的分类步骤，而不是依赖于不一致或手动策略标记的分类群。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Ecological Informatics 环境科学-生态学

CiteScore

8.30

自引率

11.80%

发文量

346

审稿时长

46 days

期刊介绍： The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change. The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.