Francisco Pascoal , Rodrigo Costa , Luís Torgo , Catarina Magalhães , Paula Branco
{"title":"Architecture and implementation of ulrb algorithm in R","authors":"Francisco Pascoal , Rodrigo Costa , Luís Torgo , Catarina Magalhães , Paula Branco","doi":"10.1016/j.ecoinf.2025.103229","DOIUrl":null,"url":null,"abstract":"<div><div>Low-abundance microorganisms, often referred to as the “rare biosphere”, play a crucial role in ecosystem resistance and resilience, but remain challenging to study. One of the main difficulties lies in the lack of an appropriate definition of rare taxa. Most studies use relative abundance thresholds (<em>e.g.</em>, 0.1 % relative abundance, per sample) to discern rare from abundant taxa within a microbial community. This is inappropriate because such thresholds are arbitrary and lack biological meaning. To solve this problem, we have proposed the utilization of unsupervised machine learning, through the <em>ulrb</em> (“Unsupervised Learning Definition of the Microbial Rare Biosphere”) algorithm, implemented as an R package (v0.1.8). This algorithm applies the partition around medoids (pam) algorithm to cluster taxa based on their abundance, in a community, for any number of samples. Based on the clusters, <em>ulrb</em> automatically classifies taxa into “rare”, “undetermined” or “abundant”, by default. <em>Ulrb</em> includes functions for all analytical steps necessary to define the rare biosphere. Specifically, we include four groups of functions: 1) process data of the user into the correct format for the <em>ulrb</em> algorithm; 2) cluster taxa into abundance classifications; 3) helper functions to evaluate detailed statistics of the clustering steps; and 4) visualization functions, focused on rank abundance curves and Silhouette scores, for assessment of clustering quality. In addition, <em>ulrb</em> allows the user to change the number of classifications obtained and includes options for detailed reporting. In this article, we describe the <em>ulrb</em> R package architecture, coding organization, and strategy. Furthermore, we use a 16S rRNA gene amplicon sequencing dataset from the Arctic Ocean to provide illustrative examples, with code, on how to use and explore <em>ulrb</em> capabilities. By explaining the architecture and implementation of <em>ulrb</em>, this study allows independent groups to integrate an abundance classification step in their data analysis protocols, instead of relying on taxa labeled by inconsistent or manual strategies.</div></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":"90 ","pages":"Article 103229"},"PeriodicalIF":5.8000,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954125002389","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Low-abundance microorganisms, often referred to as the “rare biosphere”, play a crucial role in ecosystem resistance and resilience, but remain challenging to study. One of the main difficulties lies in the lack of an appropriate definition of rare taxa. Most studies use relative abundance thresholds (e.g., 0.1 % relative abundance, per sample) to discern rare from abundant taxa within a microbial community. This is inappropriate because such thresholds are arbitrary and lack biological meaning. To solve this problem, we have proposed the utilization of unsupervised machine learning, through the ulrb (“Unsupervised Learning Definition of the Microbial Rare Biosphere”) algorithm, implemented as an R package (v0.1.8). This algorithm applies the partition around medoids (pam) algorithm to cluster taxa based on their abundance, in a community, for any number of samples. Based on the clusters, ulrb automatically classifies taxa into “rare”, “undetermined” or “abundant”, by default. Ulrb includes functions for all analytical steps necessary to define the rare biosphere. Specifically, we include four groups of functions: 1) process data of the user into the correct format for the ulrb algorithm; 2) cluster taxa into abundance classifications; 3) helper functions to evaluate detailed statistics of the clustering steps; and 4) visualization functions, focused on rank abundance curves and Silhouette scores, for assessment of clustering quality. In addition, ulrb allows the user to change the number of classifications obtained and includes options for detailed reporting. In this article, we describe the ulrb R package architecture, coding organization, and strategy. Furthermore, we use a 16S rRNA gene amplicon sequencing dataset from the Arctic Ocean to provide illustrative examples, with code, on how to use and explore ulrb capabilities. By explaining the architecture and implementation of ulrb, this study allows independent groups to integrate an abundance classification step in their data analysis protocols, instead of relying on taxa labeled by inconsistent or manual strategies.
期刊介绍:
The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change.
The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.