Tim O. Nieuwenhuis , Hunter H. Giles , Jeremy V.A. Arking , Arun H. Patil , Wen Shi , Matthew N. McCall , Marc K. Halushka
{"title":"49 种人体组织中不需要的生物和技术表达变异模式","authors":"Tim O. Nieuwenhuis , Hunter H. Giles , Jeremy V.A. Arking , Arun H. Patil , Wen Shi , Matthew N. McCall , Marc K. Halushka","doi":"10.1016/j.labinv.2024.102069","DOIUrl":null,"url":null,"abstract":"<div><p>Tissue gene expression studies are impacted by biological and technical sources of variation, which can be broadly classified into wanted and unwanted variation. The latter, if not addressed, results in misleading biological conclusions. Methods have been proposed to reduce unwanted variation, such as normalization and batch correction. A more accurate understanding of all causes of variation could significantly improve the ability of these methods to remove unwanted variation while retaining variation corresponding to the biological question of interest. We used 17,282 samples from 49 human tissues in the Genotype-Tissue Expression data set (v8) to investigate patterns and causes of expression variation. Transcript expression was transformed to z-scores, and only the most variable 2% of transcripts were evaluated and clustered based on coexpression patterns. Clustered gene sets were assigned to different biological or technical causes based on histologic appearances and metadata elements. We identified 522 variable transcript clusters (median: 11 per tissue) among the samples. Of these, 63% were confidently explained, 16% were likely explained, 7% were low confidence explanations, and 14% had no clear cause. Histologic analysis annotated 46 clusters. Other common causes of variability included sex, sequencing contamination, immunoglobulin diversity, and compositional tissue differences. Less common biological causes included death interval (Hardy score), disease status, and age. Technical causes included blood draw timing and harvesting differences. Many of the causes of variation in bulk tissue expression were identifiable in the Tabula Sapiens data set of single-cell expression. This is among the largest explorations of the underlying sources of tissue expression variation. It uncovered expected and unexpected causes of variable gene expression and demonstrated the utility of matched histologic specimens. It further demonstrated the value of acquiring meaningful tissue harvesting metadata elements to use for improved normalization, batch correction, and analysis of both bulk and single-cell RNA-seq data.</p></div>","PeriodicalId":17930,"journal":{"name":"Laboratory Investigation","volume":"104 6","pages":"Article 102069"},"PeriodicalIF":5.1000,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0023683724017471/pdfft?md5=e1c5e83b5c111e0ec6eaf22031e5e112&pid=1-s2.0-S0023683724017471-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Patterns of Unwanted Biological and Technical Expression Variation Among 49 Human Tissues\",\"authors\":\"Tim O. Nieuwenhuis , Hunter H. Giles , Jeremy V.A. Arking , Arun H. Patil , Wen Shi , Matthew N. McCall , Marc K. Halushka\",\"doi\":\"10.1016/j.labinv.2024.102069\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Tissue gene expression studies are impacted by biological and technical sources of variation, which can be broadly classified into wanted and unwanted variation. The latter, if not addressed, results in misleading biological conclusions. Methods have been proposed to reduce unwanted variation, such as normalization and batch correction. A more accurate understanding of all causes of variation could significantly improve the ability of these methods to remove unwanted variation while retaining variation corresponding to the biological question of interest. We used 17,282 samples from 49 human tissues in the Genotype-Tissue Expression data set (v8) to investigate patterns and causes of expression variation. Transcript expression was transformed to z-scores, and only the most variable 2% of transcripts were evaluated and clustered based on coexpression patterns. Clustered gene sets were assigned to different biological or technical causes based on histologic appearances and metadata elements. We identified 522 variable transcript clusters (median: 11 per tissue) among the samples. Of these, 63% were confidently explained, 16% were likely explained, 7% were low confidence explanations, and 14% had no clear cause. Histologic analysis annotated 46 clusters. Other common causes of variability included sex, sequencing contamination, immunoglobulin diversity, and compositional tissue differences. Less common biological causes included death interval (Hardy score), disease status, and age. Technical causes included blood draw timing and harvesting differences. Many of the causes of variation in bulk tissue expression were identifiable in the Tabula Sapiens data set of single-cell expression. This is among the largest explorations of the underlying sources of tissue expression variation. It uncovered expected and unexpected causes of variable gene expression and demonstrated the utility of matched histologic specimens. It further demonstrated the value of acquiring meaningful tissue harvesting metadata elements to use for improved normalization, batch correction, and analysis of both bulk and single-cell RNA-seq data.</p></div>\",\"PeriodicalId\":17930,\"journal\":{\"name\":\"Laboratory Investigation\",\"volume\":\"104 6\",\"pages\":\"Article 102069\"},\"PeriodicalIF\":5.1000,\"publicationDate\":\"2024-04-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0023683724017471/pdfft?md5=e1c5e83b5c111e0ec6eaf22031e5e112&pid=1-s2.0-S0023683724017471-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Laboratory Investigation\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0023683724017471\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, RESEARCH & EXPERIMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Laboratory Investigation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0023683724017471","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0
摘要
组织基因表达研究受到生物和技术变异来源的影响,这些变异可大致分为想要的变异和不想要的变异。后者如果不加以解决,会导致误导性的生物学结论。目前已提出了一些方法来减少不必要的变异,如归一化和批次校正。如果能更准确地了解造成变异的所有原因,就能大大提高这些方法去除不必要变异的能力,同时保留与感兴趣的生物学问题相对应的变异。我们使用基因型-组织表达数据集(v8)中 49 个人体组织的 17282 个样本来研究表达变异的模式和原因。转录本表达被转化为 z 分数,只有变化最大的 2% 的转录本被评估,并根据共表达模式进行聚类。根据组织学外观和元数据元素,将聚类基因组归入不同的生物学或技术原因。我们在样本中发现了 522 个可变转录本簇(中位数:每个组织 11 个)。其中,63%有把握解释,16%可能解释,7%可信度低,14%无明确原因。组织学分析注释了 46 个群组。其他常见的变异原因包括性别、测序污染、免疫球蛋白多样性和组织成分差异。较少见的生物学原因包括死亡间隔(哈代评分)、疾病状态和年龄。技术原因包括抽血时间和采血差异。在单细胞表达的 Tabula Sapiens 数据集中,可以识别出造成大量组织表达差异的许多原因。这是对组织表达变异潜在来源的最大规模探索之一。它揭示了基因表达变异的预期和意外原因,并证明了匹配组织学标本的实用性。它进一步证明了获取有意义的组织采集元数据元素的价值,以用于改进批量和单细胞 RNA-seq 数据的归一化、批量校正和分析。
Patterns of Unwanted Biological and Technical Expression Variation Among 49 Human Tissues
Tissue gene expression studies are impacted by biological and technical sources of variation, which can be broadly classified into wanted and unwanted variation. The latter, if not addressed, results in misleading biological conclusions. Methods have been proposed to reduce unwanted variation, such as normalization and batch correction. A more accurate understanding of all causes of variation could significantly improve the ability of these methods to remove unwanted variation while retaining variation corresponding to the biological question of interest. We used 17,282 samples from 49 human tissues in the Genotype-Tissue Expression data set (v8) to investigate patterns and causes of expression variation. Transcript expression was transformed to z-scores, and only the most variable 2% of transcripts were evaluated and clustered based on coexpression patterns. Clustered gene sets were assigned to different biological or technical causes based on histologic appearances and metadata elements. We identified 522 variable transcript clusters (median: 11 per tissue) among the samples. Of these, 63% were confidently explained, 16% were likely explained, 7% were low confidence explanations, and 14% had no clear cause. Histologic analysis annotated 46 clusters. Other common causes of variability included sex, sequencing contamination, immunoglobulin diversity, and compositional tissue differences. Less common biological causes included death interval (Hardy score), disease status, and age. Technical causes included blood draw timing and harvesting differences. Many of the causes of variation in bulk tissue expression were identifiable in the Tabula Sapiens data set of single-cell expression. This is among the largest explorations of the underlying sources of tissue expression variation. It uncovered expected and unexpected causes of variable gene expression and demonstrated the utility of matched histologic specimens. It further demonstrated the value of acquiring meaningful tissue harvesting metadata elements to use for improved normalization, batch correction, and analysis of both bulk and single-cell RNA-seq data.
期刊介绍:
Laboratory Investigation is an international journal owned by the United States and Canadian Academy of Pathology. Laboratory Investigation offers prompt publication of high-quality original research in all biomedical disciplines relating to the understanding of human disease and the application of new methods to the diagnosis of disease. Both human and experimental studies are welcome.