Optimization of clustering parameters for single-cell RNA analysis using intrinsic goodness metrics.

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics Pub Date : 2025-06-11 eCollection Date: 2025-01-01 DOI:10.3389/fbinf.2025.1562410

Nicolina Sciaraffa, Antonino Gagliano, Luigi Augugliaro, Claudia Coronnello

{"title":"Optimization of clustering parameters for single-cell RNA analysis using intrinsic goodness metrics.","authors":"Nicolina Sciaraffa, Antonino Gagliano, Luigi Augugliaro, Claudia Coronnello","doi":"10.3389/fbinf.2025.1562410","DOIUrl":null,"url":null,"abstract":"Introduction: The accurate clustering of cell subpopulations is a crucial aspect of single-cell RNA sequencing. The ability to correctly subdivide cell subpopulations hinges on the efficacy of unsupervised clustering. Despite the advancements and numerous adaptations of clustering algorithms, the correct clustering of cells remains a challenging endeavor that is dependent on the data in question and on the parameters selected for the clustering process. In this context, the present study aimed to predict the accuracy of clustering methods when varying different parameters by exploiting the intrinsic goodness metrics.Methods: This study utilized three datasets, each originating from a distinct anatomical district and with a ground truth cell annotation. Moreover, the investigation employed two clustering methods: the Leiden and the Deep Embedding for Single-cell Clustering (DESC) algorithm. Firstly, a robust linear mixed regression model has been implemented in order to analyze the impact of clustering parameters on the accuracy. Consequently, fifteen intrinsic measures have been calculated and used to train an ElasticNet regression model in both intra- and cross-dataset approaches to evaluate the possibility of predicting the clustering accuracy.Results and discussion: The first-order interactions demonstrated that the use of the UMAP method for the generation of the neighborhood graph and an increase in resolution has a beneficial impact on accuracy. The impact of the resolution parameter is accentuated by the reduced number of nearest neighbors, resulting in sparser and more locally sensitive graphs, which better preserve fine-grained cellular relationships. Furthermore, it is advisable to test different numbers of principal components, given that this parameter is highly affected by data complexity. This procedure has enabled the effective prediction of clustering accuracy through the utilization of intrinsic metrics. The findings demonstrated that the within-cluster dispersion and the Banfield-Raftery index could be effectively used as proxies for accuracy, for an immediate comparison of different clustering parameter configurations.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1562410"},"PeriodicalIF":3.9000,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12187673/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2025.1562410","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: The accurate clustering of cell subpopulations is a crucial aspect of single-cell RNA sequencing. The ability to correctly subdivide cell subpopulations hinges on the efficacy of unsupervised clustering. Despite the advancements and numerous adaptations of clustering algorithms, the correct clustering of cells remains a challenging endeavor that is dependent on the data in question and on the parameters selected for the clustering process. In this context, the present study aimed to predict the accuracy of clustering methods when varying different parameters by exploiting the intrinsic goodness metrics.

Methods: This study utilized three datasets, each originating from a distinct anatomical district and with a ground truth cell annotation. Moreover, the investigation employed two clustering methods: the Leiden and the Deep Embedding for Single-cell Clustering (DESC) algorithm. Firstly, a robust linear mixed regression model has been implemented in order to analyze the impact of clustering parameters on the accuracy. Consequently, fifteen intrinsic measures have been calculated and used to train an ElasticNet regression model in both intra- and cross-dataset approaches to evaluate the possibility of predicting the clustering accuracy.

Results and discussion: The first-order interactions demonstrated that the use of the UMAP method for the generation of the neighborhood graph and an increase in resolution has a beneficial impact on accuracy. The impact of the resolution parameter is accentuated by the reduced number of nearest neighbors, resulting in sparser and more locally sensitive graphs, which better preserve fine-grained cellular relationships. Furthermore, it is advisable to test different numbers of principal components, given that this parameter is highly affected by data complexity. This procedure has enabled the effective prediction of clustering accuracy through the utilization of intrinsic metrics. The findings demonstrated that the within-cluster dispersion and the Banfield-Raftery index could be effectively used as proxies for accuracy, for an immediate comparison of different clustering parameter configurations.

查看原文本刊更多论文

利用内在良度指标优化单细胞RNA分析聚类参数。

细胞亚群的准确聚类是单细胞RNA测序的一个关键方面。正确细分细胞亚群的能力取决于无监督聚类的有效性。尽管聚类算法有了进步和许多适应，但正确的细胞聚类仍然是一项具有挑战性的工作，这取决于所讨论的数据和为聚类过程选择的参数。在此背景下，本研究旨在利用固有良度指标来预测聚类方法在不同参数下的准确性。方法：本研究利用了三个数据集，每个数据集都来自不同的解剖区，并带有地面真值细胞注释。此外，该研究采用了两种聚类方法：Leiden和DESC算法。首先，为了分析聚类参数对准确率的影响，建立了鲁棒线性混合回归模型。因此，计算了15个内在度量，并使用它们在内部和跨数据集方法中训练ElasticNet回归模型，以评估预测聚类精度的可能性。结果和讨论：一阶相互作用表明，使用UMAP方法生成邻域图和提高分辨率对精度有有益的影响。通过减少最近邻的数量，分辨率参数的影响得到加强，从而产生更稀疏和更局部敏感的图，从而更好地保留细粒度的细胞关系。此外，考虑到该参数受数据复杂性的影响很大，建议测试不同数量的主成分。该方法通过利用固有指标对聚类精度进行了有效的预测。研究结果表明，簇内离散度和Banfield-Raftery指数可以有效地作为准确性的代理，用于直接比较不同的聚类参数配置。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in bioinformatics

CiteScore

2.60

自引率

0.00%

发文量