Semiparametric efficient estimation of small genetic effects in large-scale population cohorts.

IF 2 3区数学 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biostatistics Pub Date : 2024-12-31 DOI:10.1093/biostatistics/kxaf030

Olivier Labayle, Breeshey Roskams-Hieter, Joshua Slaughter, Kelsey Tetley-Campbell, Mark J van der Laan, Chris P Ponting, Sjoerd V Beentjes, Ava Khamseh

{"title":"Semiparametric efficient estimation of small genetic effects in large-scale population cohorts.","authors":"Olivier Labayle, Breeshey Roskams-Hieter, Joshua Slaughter, Kelsey Tetley-Campbell, Mark J van der Laan, Chris P Ponting, Sjoerd V Beentjes, Ava Khamseh","doi":"10.1093/biostatistics/kxaf030","DOIUrl":null,"url":null,"abstract":"<p><p>Population genetics seeks to quantify DNA variant associations with traits or diseases, as well as interactions among variants and with environmental factors. Computing millions of estimates in large cohorts in which small effect sizes and tight confidence intervals are expected, necessitates minimizing model-misspecification bias to increase power and control false discoveries. We present TarGene, a unified statistical workflow for the semi-parametric efficient and double robust estimation of genetic effects including $ k $-point interactions among categorical variables in the presence of confounding and weak population dependence. $ k $-point interactions, or Average Interaction Effects (AIEs), are a direct generalization of the usual average treatment effect (ATE). We estimate genetic effects with cross-validated and/or weighted versions of Targeted Minimum Loss-based Estimators (TMLE) and One-Step Estimators (OSE). The effect of dependence among data units on variance estimates is corrected by using sieve plateau variance estimators based on genetic relatedness across the units. We present extensive realistic simulations to demonstrate power, coverage, and control of type I error. Our motivating application is the targeted estimation of genetic effects on trait, including two-point and higher-order gene-gene and gene-environment interactions, in large-scale genomic databases such as UK Biobank and All of Us. All cross-validated and/or weighted TMLE and OSE for the AIE $ k $-point interaction, as well as ATEs, conditional ATEs and functions thereof, are implemented in the general purpose Julia package TMLE.jl. For high-throughput applications in population genomics, we provide the open-source Nextflow pipeline and software TarGene which integrates seamlessly with modern high-performance and cloud computing platforms.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12479317/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biostatistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/biostatistics/kxaf030","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Population genetics seeks to quantify DNA variant associations with traits or diseases, as well as interactions among variants and with environmental factors. Computing millions of estimates in large cohorts in which small effect sizes and tight confidence intervals are expected, necessitates minimizing model-misspecification bias to increase power and control false discoveries. We present TarGene, a unified statistical workflow for the semi-parametric efficient and double robust estimation of genetic effects including $ k $-point interactions among categorical variables in the presence of confounding and weak population dependence. $ k $-point interactions, or Average Interaction Effects (AIEs), are a direct generalization of the usual average treatment effect (ATE). We estimate genetic effects with cross-validated and/or weighted versions of Targeted Minimum Loss-based Estimators (TMLE) and One-Step Estimators (OSE). The effect of dependence among data units on variance estimates is corrected by using sieve plateau variance estimators based on genetic relatedness across the units. We present extensive realistic simulations to demonstrate power, coverage, and control of type I error. Our motivating application is the targeted estimation of genetic effects on trait, including two-point and higher-order gene-gene and gene-environment interactions, in large-scale genomic databases such as UK Biobank and All of Us. All cross-validated and/or weighted TMLE and OSE for the AIE $ k $-point interaction, as well as ATEs, conditional ATEs and functions thereof, are implemented in the general purpose Julia package TMLE.jl. For high-throughput applications in population genomics, we provide the open-source Nextflow pipeline and software TarGene which integrates seamlessly with modern high-performance and cloud computing platforms.

查看原文本刊更多论文

大规模群体群体中小遗传效应的半参数有效估计。

群体遗传学试图量化DNA变异与性状或疾病的关联，以及变异之间和与环境因素的相互作用。在大型队列中计算数以百万计的估计，其中预期的效应大小较小，置信区间较紧，需要最小化模型错配偏差，以增加功率并控制错误发现。我们提出了TarGene，一个统一的统计工作流程，用于遗传效应的半参数有效和双鲁棒估计，包括在混杂和弱种群依赖性存在下分类变量之间的$ k $点相互作用。k点相互作用，或平均相互作用效应（AIEs），是通常的平均治疗效果（ATE）的直接概括。我们使用交叉验证和/或加权版本的基于目标最小损失的估计器（TMLE）和一步估计器（OSE）来估计遗传效应。利用基于单元间遗传相关性的平台方差估计修正了数据单元间的相关性对方差估计的影响。我们提出了广泛的现实模拟，以展示功率，覆盖范围和控制类型I错误。我们的激励应用是在大型基因组数据库（如UK Biobank和All of Us）中有针对性地估计遗传对性状的影响，包括两点和高阶基因-基因和基因-环境相互作用。用于AIE $ k $点交互的所有交叉验证和/或加权TMLE和OSE，以及ate、条件ate及其函数，都在通用的Julia包TMLE. j1中实现。对于人口基因组学的高通量应用，我们提供开源的Nextflow管道和软件TarGene，与现代高性能和云计算平台无缝集成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biostatistics 生物-数学与计算生物学

CiteScore

5.10

自引率

4.80%

发文量

审稿时长

6-12 weeks

期刊介绍： Among the important scientific developments of the 20th century is the explosive growth in statistical reasoning and methods for application to studies of human health. Examples include developments in likelihood methods for inference, epidemiologic statistics, clinical trials, survival analysis, and statistical genetics. Substantive problems in public health and biomedical research have fueled the development of statistical methods, which in turn have improved our ability to draw valid inferences from data. The objective of Biostatistics is to advance statistical science and its application to problems of human health and disease, with the ultimate goal of advancing the public''s health.