从单细胞RNA测序数据中发现最佳细胞类型分类标记基因。

BMC methods Pub Date : 2024-01-01 Epub Date: 2024-11-04 DOI:10.1186/s44330-024-00015-2

Angela Liu, Beverly Peng, Ajith V Pankajam, Thu Elizabeth Duong, Gloria Pryhuber, Richard H Scheuermann, Yun Zhang

{"title":"从单细胞RNA测序数据中发现最佳细胞类型分类标记基因。","authors":"Angela Liu, Beverly Peng, Ajith V Pankajam, Thu Elizabeth Duong, Gloria Pryhuber, Richard H Scheuermann, Yun Zhang","doi":"10.1186/s44330-024-00015-2","DOIUrl":null,"url":null,"abstract":"Background: The use of single cell/nucleus RNA sequencing (scRNA-seq) technologies that quantitively describe cell transcriptional phenotypes is revolutionizing our understanding of cell biology, leading to new insights in cell type identification, disease mechanisms, and drug development. The tremendous growth in scRNA-seq data has posed new challenges in efficiently characterizing data-driven cell types and identifying quantifiable marker genes for cell type classification. The use of machine learning and explainable artificial intelligence has emerged as an effective approach to study large-scale scRNA-seq data.Methods: NS-Forest is a random forest machine learning-based algorithm that aims to provide a scalable data-driven solution to identify minimum combinations of necessary and sufficient marker genes that capture cell type identity with maximum classification accuracy. Here, we describe the latest version, NS-Forest version 4.0 and its companion Python package (https://github.com/JCVenterInstitute/NSForest), with several enhancements to select marker gene combinations that exhibit highly selective expression patterns among closely related cell types and more efficiently perform marker gene selection for large-scale scRNA-seq data atlases with millions of cells.Results: By modularizing the final decision tree step, NS-Forest v4.0 can be used to compare the performance of user-defined marker genes with the NS-Forest computationally-derived marker genes based on the decision tree classifiers. To quantify how well the identified markers exhibit the desired pattern of being exclusively expressed at high levels within their target cell types, we introduce the On-Target Fraction metric that ranges from 0 to 1, with a metric of 1 assigned to markers that are only expressed within their target cell types and not in cells of any other cell types. NS-Forest v4.0 outperforms previous versions in simulation studies and on its ability to identify markers with higher On-Target Fraction values for closely related cell types in real data, and outperforms other marker gene selection approaches for cell type classification with significantly higher F-beta scores when applied to datasets from three human organs-brain, kidney, and lung.Discussion: Finally, we discuss potential use cases of the NS-Forest marker genes, including for designing spatial transcriptomics gene panels and semantic representation of cell types in biomedical ontologies, for the broad user community.","PeriodicalId":519945,"journal":{"name":"BMC methods","volume":"1 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12396544/pdf/","citationCount":"0","resultStr":"{\"title\":\"Discovery of optimal cell type classification marker genes from single cell RNA sequencing data.\",\"authors\":\"Angela Liu, Beverly Peng, Ajith V Pankajam, Thu Elizabeth Duong, Gloria Pryhuber, Richard H Scheuermann, Yun Zhang\",\"doi\":\"10.1186/s44330-024-00015-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The use of single cell/nucleus RNA sequencing (scRNA-seq) technologies that quantitively describe cell transcriptional phenotypes is revolutionizing our understanding of cell biology, leading to new insights in cell type identification, disease mechanisms, and drug development. The tremendous growth in scRNA-seq data has posed new challenges in efficiently characterizing data-driven cell types and identifying quantifiable marker genes for cell type classification. The use of machine learning and explainable artificial intelligence has emerged as an effective approach to study large-scale scRNA-seq data.Methods: NS-Forest is a random forest machine learning-based algorithm that aims to provide a scalable data-driven solution to identify minimum combinations of necessary and sufficient marker genes that capture cell type identity with maximum classification accuracy. Here, we describe the latest version, NS-Forest version 4.0 and its companion Python package (https://github.com/JCVenterInstitute/NSForest), with several enhancements to select marker gene combinations that exhibit highly selective expression patterns among closely related cell types and more efficiently perform marker gene selection for large-scale scRNA-seq data atlases with millions of cells.Results: By modularizing the final decision tree step, NS-Forest v4.0 can be used to compare the performance of user-defined marker genes with the NS-Forest computationally-derived marker genes based on the decision tree classifiers. To quantify how well the identified markers exhibit the desired pattern of being exclusively expressed at high levels within their target cell types, we introduce the On-Target Fraction metric that ranges from 0 to 1, with a metric of 1 assigned to markers that are only expressed within their target cell types and not in cells of any other cell types. NS-Forest v4.0 outperforms previous versions in simulation studies and on its ability to identify markers with higher On-Target Fraction values for closely related cell types in real data, and outperforms other marker gene selection approaches for cell type classification with significantly higher F-beta scores when applied to datasets from three human organs-brain, kidney, and lung.Discussion: Finally, we discuss potential use cases of the NS-Forest marker genes, including for designing spatial transcriptomics gene panels and semantic representation of cell types in biomedical ontologies, for the broad user community.\",\"PeriodicalId\":519945,\"journal\":{\"name\":\"BMC methods\",\"volume\":\"1 \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12396544/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s44330-024-00015-2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/11/4 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s44330-024-00015-2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/4 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景：定量描述细胞转录表型的单细胞/细胞核RNA测序（scRNA-seq）技术的使用正在彻底改变我们对细胞生物学的理解，从而在细胞类型鉴定、疾病机制和药物开发方面产生新的见解。scRNA-seq数据的巨大增长为有效表征数据驱动的细胞类型和鉴定用于细胞类型分类的可量化标记基因提出了新的挑战。机器学习和可解释的人工智能的使用已经成为研究大规模scRNA-seq数据的有效方法。方法：NS-Forest是一种基于随机森林机器学习的算法，旨在提供可扩展的数据驱动解决方案，以识别必要和充分的标记基因的最小组合，以最大的分类精度捕获细胞类型身份。在这里，我们描述了最新版本，NS-Forest版本4.0及其配套Python包（https://github.com/JCVenterInstitute/NSForest），具有几个增强功能，可以选择标记基因组合，这些标记基因组合在密切相关的细胞类型中表现出高度选择性的表达模式，并更有效地对具有数百万细胞的大规模scRNA-seq数据图谱进行标记基因选择。结果：通过模块化最终的决策树步骤，NS-Forest v4.0可用于比较自定义标记基因与基于决策树分类器的计算衍生的NS-Forest标记基因的性能。为了量化已鉴定的标记物在其靶细胞类型中表现出高水平特异性表达的期望模式的程度，我们引入了范围为0到1的On-Target Fraction指标，其中1的指标分配给仅在其靶细胞类型中表达而不在任何其他细胞类型中表达的标记物。NS-Forest v4.0在模拟研究中优于以前的版本，并且能够识别真实数据中密切相关的细胞类型中具有较高on - target Fraction值的标记，并且在应用于来自三个人体器官（脑，肾和肺）的数据集时，具有显着更高的F-beta分数的细胞类型分类中优于其他标记基因选择方法。讨论：最后，我们讨论了NS-Forest标记基因的潜在用例，包括为广泛的用户群体设计空间转录组学基因面板和生物医学本体中细胞类型的语义表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Discovery of optimal cell type classification marker genes from single cell RNA sequencing data.

Background: The use of single cell/nucleus RNA sequencing (scRNA-seq) technologies that quantitively describe cell transcriptional phenotypes is revolutionizing our understanding of cell biology, leading to new insights in cell type identification, disease mechanisms, and drug development. The tremendous growth in scRNA-seq data has posed new challenges in efficiently characterizing data-driven cell types and identifying quantifiable marker genes for cell type classification. The use of machine learning and explainable artificial intelligence has emerged as an effective approach to study large-scale scRNA-seq data.

Methods: NS-Forest is a random forest machine learning-based algorithm that aims to provide a scalable data-driven solution to identify minimum combinations of necessary and sufficient marker genes that capture cell type identity with maximum classification accuracy. Here, we describe the latest version, NS-Forest version 4.0 and its companion Python package (https://github.com/JCVenterInstitute/NSForest), with several enhancements to select marker gene combinations that exhibit highly selective expression patterns among closely related cell types and more efficiently perform marker gene selection for large-scale scRNA-seq data atlases with millions of cells.

Results: By modularizing the final decision tree step, NS-Forest v4.0 can be used to compare the performance of user-defined marker genes with the NS-Forest computationally-derived marker genes based on the decision tree classifiers. To quantify how well the identified markers exhibit the desired pattern of being exclusively expressed at high levels within their target cell types, we introduce the On-Target Fraction metric that ranges from 0 to 1, with a metric of 1 assigned to markers that are only expressed within their target cell types and not in cells of any other cell types. NS-Forest v4.0 outperforms previous versions in simulation studies and on its ability to identify markers with higher On-Target Fraction values for closely related cell types in real data, and outperforms other marker gene selection approaches for cell type classification with significantly higher F-beta scores when applied to datasets from three human organs-brain, kidney, and lung.

Discussion: Finally, we discuss potential use cases of the NS-Forest marker genes, including for designing spatial transcriptomics gene panels and semantic representation of cell types in biomedical ontologies, for the broad user community.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC methods

自引率

0.00%

发文量