Causal Inference in Transcriptome-Wide Association Studies with Invalid Instruments and GWAS Summary Data.

IF 3 1区数学 Q1 STATISTICS & PROBABILITY

Journal of the American Statistical Association Pub Date : 2023-01-01 Epub Date: 2023-03-17 DOI:10.1080/01621459.2023.2183127

Haoran Xue, Xiaotong Shen, Wei Pan

{"title":"Causal Inference in Transcriptome-Wide Association Studies with Invalid Instruments and GWAS Summary Data.","authors":"Haoran Xue, Xiaotong Shen, Wei Pan","doi":"10.1080/01621459.2023.2183127","DOIUrl":null,"url":null,"abstract":"<p><p>Transcriptome-wide association studies (TWAS) have recently emerged as a popular tool to discover (putative) causal genes by integrating an outcome GWAS dataset with another gene expression/transcriptome GWAS (called eQTL) dataset. In our motivating and target application, we'd like to identify causal genes for low-density lipoprotein cholesterol (LDL), which is crucial for developing new treatments for hyperlipidemia and cardiovascular diseases. The statistical principle underlying TWAS is (two-sample) two-stage least squares (2SLS) using multiple correlated SNPs as instrumental variables (IVs); it is closely related to typical (two-sample) Mendelian randomization (MR) using independent SNPs as IVs, which is expected to be impractical and lower-powered for TWAS (and some other) applications. However, often some of the SNPs used may not be valid IVs, e.g. due to the widespread pleiotropy of their direct effects on the outcome not mediated through the gene of interest, leading to false conclusions by TWAS (or MR). Building on recent advances in sparse regression, we propose a robust and efficient inferential method to account for both hidden confounding and some invalid IVs via two-stage constrained maximum likelihood (2ScML), an extension of 2SLS. We first develop the proposed method with individual-level data, then extend it both theoretically and computationally to GWAS summary data for the most popular two-sample TWAS design, to which almost all existing robust IV regression methods are however not applicable. We show that the proposed method achieves asymptotically valid statistical inference on causal effects, demonstrating its wider applicability and superior finite-sample performance over the standard 2SLS/TWAS (and MR). We apply the methods to identify putative causal genes for LDL by integrating large-scale lipid GWAS summary data with eQTL data.</p>","PeriodicalId":17227,"journal":{"name":"Journal of the American Statistical Association","volume":"118 543","pages":"1525-1537"},"PeriodicalIF":3.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10557939/pdf/nihms-1877198.pdf","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Statistical Association","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1080/01621459.2023.2183127","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/3/17 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 3

Abstract

Transcriptome-wide association studies (TWAS) have recently emerged as a popular tool to discover (putative) causal genes by integrating an outcome GWAS dataset with another gene expression/transcriptome GWAS (called eQTL) dataset. In our motivating and target application, we'd like to identify causal genes for low-density lipoprotein cholesterol (LDL), which is crucial for developing new treatments for hyperlipidemia and cardiovascular diseases. The statistical principle underlying TWAS is (two-sample) two-stage least squares (2SLS) using multiple correlated SNPs as instrumental variables (IVs); it is closely related to typical (two-sample) Mendelian randomization (MR) using independent SNPs as IVs, which is expected to be impractical and lower-powered for TWAS (and some other) applications. However, often some of the SNPs used may not be valid IVs, e.g. due to the widespread pleiotropy of their direct effects on the outcome not mediated through the gene of interest, leading to false conclusions by TWAS (or MR). Building on recent advances in sparse regression, we propose a robust and efficient inferential method to account for both hidden confounding and some invalid IVs via two-stage constrained maximum likelihood (2ScML), an extension of 2SLS. We first develop the proposed method with individual-level data, then extend it both theoretically and computationally to GWAS summary data for the most popular two-sample TWAS design, to which almost all existing robust IV regression methods are however not applicable. We show that the proposed method achieves asymptotically valid statistical inference on causal effects, demonstrating its wider applicability and superior finite-sample performance over the standard 2SLS/TWAS (and MR). We apply the methods to identify putative causal genes for LDL by integrating large-scale lipid GWAS summary data with eQTL data.

查看原文本刊更多论文

无效仪器和GWAS汇总数据的转录组广泛关联研究中的因果推断。

转录组全关联研究（TWAS）最近成为一种流行的工具，通过将结果GWAS数据集与另一个基因表达/转录组GWAS（称为eQTL）数据集相结合来发现（假定的）因果基因。在我们的激励和靶向应用中，我们希望确定低密度脂蛋白胆固醇（LDL）的致病基因，这对开发高脂血症和心血管疾病的新治疗方法至关重要。TWAS的统计原理是（两个样本）两阶段最小二乘法（2SLS），使用多个相关SNPs作为工具变量（IV）；它与使用独立SNPs作为IVs的典型（两个样本）孟德尔随机化（MR）密切相关，这对于TWAS（和其他一些）应用来说是不切实际的并且功率较低。然而，通常使用的一些SNPs可能不是有效的IVs，例如，由于其对结果的直接影响的广泛多效性，而不是通过感兴趣的基因介导的，导致TWAS（或MR）得出错误结论。在稀疏回归的最新进展的基础上，我们提出了一种稳健有效的推理方法，通过两阶段约束最大似然（2ScML）（2SLS的扩展）来解释隐藏的混杂和一些无效的IVs。我们首先用个体水平的数据开发了所提出的方法，然后在理论和计算上将其扩展到最流行的两样本TWAS设计的GWAS汇总数据，然而，几乎所有现有的稳健IV回归方法都不适用于该设计。我们表明，所提出的方法实现了对因果效应的渐近有效统计推断，证明了其比标准2SLS/TWS（和MR）更广泛的适用性和优越的有限样本性能。我们通过整合大规模脂质GWAS汇总数据和eQTL数据，应用这些方法来鉴定低密度脂蛋白的假定致病基因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the American Statistical Association 数学-统计学与概率论

CiteScore

7.50

自引率

8.10%

发文量

168

审稿时长

12 months

期刊介绍： Established in 1888 and published quarterly in March, June, September, and December, the Journal of the American Statistical Association ( JASA ) has long been considered the premier journal of statistical science. Articles focus on statistical applications, theory, and methods in economic, social, physical, engineering, and health sciences. Important books contributing to statistical advancement are reviewed in JASA . JASA is indexed in Current Index to Statistics and MathSci Online and reviewed in Mathematical Reviews. JASA is abstracted by Access Company and is indexed and abstracted in the SRM Database of Social Research Methodology.