Gene representation in scRNA-seq is correlated with common motifs at the 3' end of transcripts.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics Pub Date : 2023-05-15 eCollection Date: 2023-01-01 DOI:10.3389/fbinf.2023.1120290

Xinling Li, Greg Gibson, Peng Qiu

{"title":"Gene representation in scRNA-seq is correlated with common motifs at the 3' end of transcripts.","authors":"Xinling Li, Greg Gibson, Peng Qiu","doi":"10.3389/fbinf.2023.1120290","DOIUrl":null,"url":null,"abstract":"<p><p>One important characteristic of single-cell RNA sequencing (scRNA-seq) data is its high sparsity, where the gene-cell count data matrix contains high proportion of zeros. The sparsity has motivated widespread discussions on dropouts and missing data, as well as imputation algorithms of scRNA-seq analysis. Here, we aim to investigate whether there exist genes that are more prone to be under-detected in scRNA-seq, and if yes, what commonalities those genes may share. From public data sources, we gathered paired bulk RNA-seq and scRNA-seq data from 53 human samples, which were generated in diverse biological contexts. We derived pseudo-bulk gene expression by averaging the scRNA-seq data across cells. Comparisons of the paired bulk and pseudo-bulk gene expression profiles revealed that there indeed exists a collection of genes that are frequently under-detected in scRNA-seq compared to bulk RNA-seq. This result was robust to randomization when unpaired bulk and pseudo-bulk gene expression profiles were compared. We performed motif search to the last 350 bp of the identified genes, and observed an enrichment of poly(T) motif. The poly(T) motif toward the tails of those genes may be able to form hairpin structures with the poly(A) tails of their mRNA transcripts, making it difficult for their mRNA transcripts to be captured during scRNA-seq library preparation, which is a mechanistic conjecture of why certain genes may be more prone to be under-detected in scRNA-seq.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"3 ","pages":"1120290"},"PeriodicalIF":2.8000,"publicationDate":"2023-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10226423/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2023.1120290","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

One important characteristic of single-cell RNA sequencing (scRNA-seq) data is its high sparsity, where the gene-cell count data matrix contains high proportion of zeros. The sparsity has motivated widespread discussions on dropouts and missing data, as well as imputation algorithms of scRNA-seq analysis. Here, we aim to investigate whether there exist genes that are more prone to be under-detected in scRNA-seq, and if yes, what commonalities those genes may share. From public data sources, we gathered paired bulk RNA-seq and scRNA-seq data from 53 human samples, which were generated in diverse biological contexts. We derived pseudo-bulk gene expression by averaging the scRNA-seq data across cells. Comparisons of the paired bulk and pseudo-bulk gene expression profiles revealed that there indeed exists a collection of genes that are frequently under-detected in scRNA-seq compared to bulk RNA-seq. This result was robust to randomization when unpaired bulk and pseudo-bulk gene expression profiles were compared. We performed motif search to the last 350 bp of the identified genes, and observed an enrichment of poly(T) motif. The poly(T) motif toward the tails of those genes may be able to form hairpin structures with the poly(A) tails of their mRNA transcripts, making it difficult for their mRNA transcripts to be captured during scRNA-seq library preparation, which is a mechanistic conjecture of why certain genes may be more prone to be under-detected in scRNA-seq.

Abstract Image

查看原文本刊更多论文

scRNA-seq中的基因表达与转录物3'端的常见基序相关。

单细胞RNA测序（scRNA-seq）数据的一个重要特征是其高度稀疏性，其中基因细胞计数数据矩阵包含高比例的零。稀疏性引发了关于辍学和缺失数据以及scRNA-seq分析的插补算法的广泛讨论。在这里，我们的目的是调查是否存在在scRNA-seq中更容易被检测不足的基因，如果存在，这些基因可能有哪些共性。从公共数据来源，我们从53个人类样本中收集了成对的大块RNA-seq和scRNA-seq数据，这些样本是在不同的生物环境中产生的。我们通过对细胞间的scRNA-seq数据进行平均，得出了伪体基因表达。配对大块和伪大块基因表达谱的比较表明，与大块RNA-seq相比，确实存在一组在scRNA-seq中经常检测不足的基因。当比较未配对的大块和伪大块基因表达谱时，该结果对随机化是稳健的。我们对已鉴定基因的最后350bp进行了基序搜索，并观察到poly（T）基序的富集。朝向这些基因尾部的聚（T）基序可能能够与它们的mRNA转录物的聚（A）尾部形成发夹结构，这使得它们的信使核糖核酸转录物在scRNA-seq文库制备过程中很难被捕获，这是为什么某些基因在scRNA-seq中可能更容易被检测不足的机制推测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in bioinformatics

CiteScore

2.60

自引率

0.00%

发文量