SUMA: a lightweight machine learning model-powered shared nearest neighbour-based clustering application interface for scRNA-Seq data.

IF 0.9

Turkish journal of biology = Turk biyoloji dergisi Pub Date : 2023-12-18 eCollection Date: 2023-01-01 DOI:10.55730/1300-0152.2675

Hamza Umut Karakurt, Pınar Pir

{"title":"SUMA: a lightweight machine learning model-powered shared nearest neighbour-based clustering application interface for scRNA-Seq data.","authors":"Hamza Umut Karakurt, Pınar Pir","doi":"10.55730/1300-0152.2675","DOIUrl":null,"url":null,"abstract":"Background/aim: Single-cell transcriptomics (scRNA-Seq) explores cellular diversity at the gene expression level. Due to the inherent sparsity and noise in scRNA-Seq data and the uncertainty on the types of sequenced cells, effective clustering and cell type annotation are essential. The graph-based clustering of scRNA-Seq data is a simple yet powerful approach that presents data as a \"shared nearest neighbour\" graph and clusters the cells using graph clustering algorithms. These algorithms are dependent on several user-defined parameters.Here we present SUMA, a lightweight tool that uses a random forest model to predict the optimum number of neighbours to obtain the optimum clustering results. Moreover, we integrated our method with other commonly used methods in an RShiny application. SUMA can be used in a local environment (https://github.com/hkarakurt8742/SUMA) or as a browser tool (https://hkarakurt.shinyapps.io/suma/).Materials and methods: Publicly available scRNA-Seq datasets and 3 different graph-based clustering algorithms were used to develop SUMA, and a large range for number of neighbours and variant genes was taken into consideration. The quality of clustering was assessed using the adjusted Rand index (ARI) and true labels of each dataset. The data were split into training and test datasets, and the model was built and optimised using Scikit-learn (Python) and randomForest (R) libraries.Results: The accuracy of our machine learning model was 0.96, while the AUC of the ROC curve was 0.98. The model indicated that the number of cells in scRNA-Seq data is the most important feature when deciding the number of neighbours.Conclusion: We developed and evaluated the SUMA model and implemented the method in the SUMAShiny app, which integrates SUMA with different clustering methods and enables nonbioinformatician users to cluster and visualise their scRNA data easily. The SUMAShiny app is available both for desktop and browser use.","PeriodicalId":94363,"journal":{"name":"Turkish journal of biology = Turk biyoloji dergisi","volume":"47 6","pages":"413-422"},"PeriodicalIF":0.9000,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11045205/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Turkish journal of biology = Turk biyoloji dergisi","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.55730/1300-0152.2675","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background/aim: Single-cell transcriptomics (scRNA-Seq) explores cellular diversity at the gene expression level. Due to the inherent sparsity and noise in scRNA-Seq data and the uncertainty on the types of sequenced cells, effective clustering and cell type annotation are essential. The graph-based clustering of scRNA-Seq data is a simple yet powerful approach that presents data as a "shared nearest neighbour" graph and clusters the cells using graph clustering algorithms. These algorithms are dependent on several user-defined parameters.Here we present SUMA, a lightweight tool that uses a random forest model to predict the optimum number of neighbours to obtain the optimum clustering results. Moreover, we integrated our method with other commonly used methods in an RShiny application. SUMA can be used in a local environment (https://github.com/hkarakurt8742/SUMA) or as a browser tool (https://hkarakurt.shinyapps.io/suma/).

Materials and methods: Publicly available scRNA-Seq datasets and 3 different graph-based clustering algorithms were used to develop SUMA, and a large range for number of neighbours and variant genes was taken into consideration. The quality of clustering was assessed using the adjusted Rand index (ARI) and true labels of each dataset. The data were split into training and test datasets, and the model was built and optimised using Scikit-learn (Python) and randomForest (R) libraries.

Results: The accuracy of our machine learning model was 0.96, while the AUC of the ROC curve was 0.98. The model indicated that the number of cells in scRNA-Seq data is the most important feature when deciding the number of neighbours.

Conclusion: We developed and evaluated the SUMA model and implemented the method in the SUMAShiny app, which integrates SUMA with different clustering methods and enables nonbioinformatician users to cluster and visualise their scRNA data easily. The SUMAShiny app is available both for desktop and browser use.

查看原文本刊更多论文

SUMA：针对 scRNA-Seq 数据的基于共享近邻的轻量级机器学习模型驱动聚类应用界面。

背景/目的：单细胞转录组学（scRNA-Seq）在基因表达水平上探索细胞的多样性。由于 scRNA-Seq 数据固有的稀疏性和噪声以及测序细胞类型的不确定性，有效的聚类和细胞类型注释至关重要。基于图的 scRNA-Seq 数据聚类是一种简单而强大的方法，它将数据显示为 "共享近邻 "图，并使用图聚类算法对细胞进行聚类。在这里，我们介绍一种轻量级工具 SUMA，它使用随机森林模型预测最佳邻域数，以获得最佳聚类结果。此外，我们还在 RShiny 应用程序中将我们的方法与其他常用方法进行了整合。SUMA 可在本地环境中使用（https://github.com/hkarakurt8742/SUMA），也可作为浏览器工具使用（https://hkarakurt.shinyapps.io/suma/）。材料与方法：开发 SUMA 时使用了公开的 scRNA-Seq 数据集和 3 种不同的基于图的聚类算法，并考虑了较大范围的相邻基因和变异基因数量。使用调整后的兰德指数（ARI）和每个数据集的真实标签来评估聚类的质量。数据被分成训练数据集和测试数据集，并使用 Scikit-learn (Python) 和 randomForest (R) 库建立和优化模型：我们的机器学习模型的准确率为 0.96，ROC 曲线的 AUC 为 0.98。该模型表明，在决定邻居数量时，scRNA-Seq 数据中的细胞数量是最重要的特征：我们开发并评估了 SUMA 模型，并在 SUMAShiny 应用程序中实现了该方法。该应用程序将 SUMA 与不同的聚类方法集成在一起，使非生物信息学用户能够轻松地对其 scRNA 数据进行聚类和可视化。SUMAShiny 应用程序可在桌面和浏览器上使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Turkish journal of biology = Turk biyoloji dergisi

自引率

0.00%

发文量