Leveraging gene correlations in single cell transcriptomic data

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2024-09-18 DOI:10.1186/s12859-024-05926-z

Kai Silkwood, Emmanuel Dollinger, Joshua Gervin, Scott Atwood, Qing Nie, Arthur D. Lander

{"title":"Leveraging gene correlations in single cell transcriptomic data","authors":"Kai Silkwood, Emmanuel Dollinger, Joshua Gervin, Scott Atwood, Qing Nie, Arthur D. Lander","doi":"10.1186/s12859-024-05926-z","DOIUrl":null,"url":null,"abstract":"Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data—looking for rare cell types, subtleties of cell states, and details of gene regulatory networks—there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually). We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization—a step that skews distributions, particularly for sparse data—and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene–gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene–gene correlations.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"11 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05926-z","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Many approaches have been developed to overcome technical noise in single cell RNA-sequencing (scRNAseq). As researchers dig deeper into data—looking for rare cell types, subtleties of cell states, and details of gene regulatory networks—there is a growing need for algorithms with controllable accuracy and fewer ad hoc parameters and thresholds. Impeding this goal is the fact that an appropriate null distribution for scRNAseq cannot simply be extracted from data in which ground truth about biological variation is unknown (i.e., usually). We approach this problem analytically, assuming that scRNAseq data reflect only cell heterogeneity (what we seek to characterize), transcriptional noise (temporal fluctuations randomly distributed across cells), and sampling error (i.e., Poisson noise). We analyze scRNAseq data without normalization—a step that skews distributions, particularly for sparse data—and calculate p values associated with key statistics. We develop an improved method for selecting features for cell clustering and identifying gene–gene correlations, both positive and negative. Using simulated data, we show that this method, which we call BigSur (Basic Informatics and Gene Statistics from Unnormalized Reads), captures even weak yet significant correlation structures in scRNAseq data. Applying BigSur to data from a clonal human melanoma cell line, we identify thousands of correlations that, when clustered without supervision into gene communities, align with known cellular components and biological processes, and highlight potentially novel cell biological relationships. New insights into functionally relevant gene regulatory networks can be obtained using a statistically grounded approach to the identification of gene–gene correlations.

查看原文本刊更多论文

利用单细胞转录组数据中的基因相关性

为了克服单细胞 RNA 测序（scRNAseq）中的技术噪音，人们开发了许多方法。随着研究人员深入挖掘数据，寻找罕见细胞类型、细胞状态的微妙之处以及基因调控网络的细节，他们越来越需要精确度可控、临时参数和阈值较少的算法。阻碍这一目标实现的事实是，scRNAseq 的适当空分布不能简单地从生物变异基本真相未知（即通常情况下）的数据中提取。我们采用分析方法来解决这个问题，假设 scRNAseq 数据只反映细胞异质性（我们试图描述的特征）、转录噪声（随机分布在细胞中的时间波动）和采样误差（即泊松噪声）。我们分析 scRNAseq 数据时没有进行归一化处理--这一步会使分布偏斜，尤其是稀疏数据--而是计算与关键统计量相关的 p 值。我们开发了一种改进的方法，用于选择细胞聚类的特征和识别基因与基因之间的正负相关性。通过模拟数据，我们证明了这种我们称之为 BigSur（来自非规范化读数的基本信息学和基因统计）的方法甚至能捕捉到 scRNAseq 数据中微弱但重要的相关结构。将 BigSur 应用于克隆人类黑色素瘤细胞系的数据时，我们发现了成千上万的相关性，当这些相关性在没有监督的情况下聚类成基因群落时，它们与已知的细胞成分和生物过程相一致，并突出了潜在的新型细胞生物学关系。使用基于统计学的方法来识别基因-基因相关性，可以获得对功能相关基因调控网络的新见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.