聚类后差异表达分析中去除双浸偏差的统计软件包。

IF 1.6 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Journal of Computational Biology Pub Date : 2025-10-08 DOI:10.1177/15578666251383562

Christy Lee, Dongyuan Song, Siqi Chen, Jingyi Jessica Li

{"title":"聚类后差异表达分析中去除双浸偏差的统计软件包。","authors":"Christy Lee, Dongyuan Song, Siqi Chen, Jingyi Jessica Li","doi":"10.1177/15578666251383562","DOIUrl":null,"url":null,"abstract":"Typical pipelines for single-cell and spatial transcriptomics involve clustering cells or spatial spots, followed by post-clustering differential expression (DE) analysis to identify marker genes for annotating clusters as cell types or spatial domains. However, using the same data for both clustering and DE analysis-a problem known as double-dipping-can lead to spurious detection of DE genes. In particular, over-clustering can produce artificial clusters that are incorrectly interpreted as distinct cell types or spatial domains. To address this issue, the ClusterDE R package implements a statistical method using a synthetic null dataset, which consists of a single homogeneous cell population or spatial domain but is constructed to match the real dataset in terms of gene means, variances, and gene-gene rank correlations. By serving as a parallel negative control, the synthetic null data allow users to identify and remove false-positive DE genes arising from double-dipping. This article introduces the ClusterDE R package and provides practical guidance on installation and usage for more reliable marker gene detection following clustering.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ClusterDE: A Statistical Software Package for Removing Double-Dipping Bias in Post-Clustering Differential Expression Analysis.\",\"authors\":\"Christy Lee, Dongyuan Song, Siqi Chen, Jingyi Jessica Li\",\"doi\":\"10.1177/15578666251383562\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Typical pipelines for single-cell and spatial transcriptomics involve clustering cells or spatial spots, followed by post-clustering differential expression (DE) analysis to identify marker genes for annotating clusters as cell types or spatial domains. However, using the same data for both clustering and DE analysis-a problem known as double-dipping-can lead to spurious detection of DE genes. In particular, over-clustering can produce artificial clusters that are incorrectly interpreted as distinct cell types or spatial domains. To address this issue, the ClusterDE R package implements a statistical method using a synthetic null dataset, which consists of a single homogeneous cell population or spatial domain but is constructed to match the real dataset in terms of gene means, variances, and gene-gene rank correlations. By serving as a parallel negative control, the synthetic null data allow users to identify and remove false-positive DE genes arising from double-dipping. This article introduces the ClusterDE R package and provides practical guidance on installation and usage for more reliable marker gene detection following clustering.\",\"PeriodicalId\":15526,\"journal\":{\"name\":\"Journal of Computational Biology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2025-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computational Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1177/15578666251383562\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1177/15578666251383562","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

单细胞和空间转录组学的典型方法包括聚集细胞或空间点，然后进行聚类后差异表达（DE）分析，以识别标记基因，将集群注释为细胞类型或空间域。然而，在聚类和DE分析中使用相同的数据（称为双重检测的问题）可能导致对DE基因的错误检测。特别是，过度聚类会产生人为的集群，这些集群被错误地解释为不同的细胞类型或空间域。为了解决这个问题，ClusterDE R包实现了一种使用合成空数据集的统计方法，该数据集由单个同质细胞群或空间域组成，但在基因均值、方差和基因-基因秩相关性方面与真实数据集相匹配。通过作为平行阴性对照，合成的零数据允许用户识别和去除由双浸产生的假阳性DE基因。本文介绍了ClusterDE R包，并提供了关于安装和使用的实用指导，以便在聚类之后进行更可靠的标记基因检测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ClusterDE: A Statistical Software Package for Removing Double-Dipping Bias in Post-Clustering Differential Expression Analysis.

Typical pipelines for single-cell and spatial transcriptomics involve clustering cells or spatial spots, followed by post-clustering differential expression (DE) analysis to identify marker genes for annotating clusters as cell types or spatial domains. However, using the same data for both clustering and DE analysis-a problem known as double-dipping-can lead to spurious detection of DE genes. In particular, over-clustering can produce artificial clusters that are incorrectly interpreted as distinct cell types or spatial domains. To address this issue, the ClusterDE R package implements a statistical method using a synthetic null dataset, which consists of a single homogeneous cell population or spatial domain but is constructed to match the real dataset in terms of gene means, variances, and gene-gene rank correlations. By serving as a parallel negative control, the synthetic null data allow users to identify and remove false-positive DE genes arising from double-dipping. This article introduces the ClusterDE R package and provides practical guidance on installation and usage for more reliable marker gene detection following clustering.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Computational Biology 生物-计算机：跨学科应用

CiteScore

3.60

自引率

5.90%

发文量

113

审稿时长

6-12 weeks

期刊介绍： Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics. Journal of Computational Biology coverage includes: -Genomics -Mathematical modeling and simulation -Distributed and parallel biological computing -Designing biological databases -Pattern matching and pattern detection -Linking disparate databases and data -New tools for computational biology -Relational and object-oriented database technology for bioinformatics -Biological expert system design and use -Reasoning by analogy, hypothesis formation, and testing by machine -Management of biological databases