Zhiwen Jiang, Haoyu Zhang, Thomas U. Ahearn, Montserrat Garcia-Closas, Nilanjan Chatterjee, Hongtu Zhu, Xiang Zhan, Ni Zhao
{"title":"The sequence kernel association test for multicategorical outcomes","authors":"Zhiwen Jiang, Haoyu Zhang, Thomas U. Ahearn, Montserrat Garcia-Closas, Nilanjan Chatterjee, Hongtu Zhu, Xiang Zhan, Ni Zhao","doi":"10.1002/gepi.22527","DOIUrl":"10.1002/gepi.22527","url":null,"abstract":"<p>Disease heterogeneity is ubiquitous in biomedical and clinical studies. In genetic studies, researchers are increasingly interested in understanding the distinct genetic underpinning of subtypes of diseases. However, existing set-based analysis methods for genome-wide association studies are either inadequate or inefficient to handle such multicategorical outcomes. In this paper, we proposed a novel set-based association analysis method, sequence kernel association test (SKAT)-MC, the sequence kernel association test for multicategorical outcomes (nominal or ordinal), which jointly evaluates the relationship between a set of variants (common and rare) and disease subtypes. Through comprehensive simulation studies, we showed that SKAT-MC effectively preserves the nominal type I error rate while substantially increases the statistical power compared to existing methods under various scenarios. We applied SKAT-MC to the Polish breast cancer study (PBCS), and identified gene <i>FGFR2</i> was significantly associated with estrogen receptor (ER)+ and ER− breast cancer subtypes. We also investigated educational attainment using UK Biobank data (<math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>N</mi>\u0000 \u0000 <mo>=</mo>\u0000 \u0000 <mn>127</mn>\u0000 \u0000 <mo>,</mo>\u0000 \u0000 <mn>127</mn>\u0000 </mrow>\u0000 <annotation> $N=127,127$</annotation>\u0000 </semantics></math>) with SKAT-MC, and identified 21 significant genes in the genome. Consequently, SKAT-MC is a powerful and efficient analysis tool for genetic association studies with multicategorical outcomes. A freely distributed R package SKAT-MC can be accessed at https://github.com/Zhiwen-Owen-Jiang/SKATMC.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 6","pages":"432-449"},"PeriodicalIF":2.1,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22527","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9985331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jasper P. Hof, Sita H. Vermeulen, Anthony C. C. Coolen, Tessel E. Galesloot
{"title":"Fast and accurate recurrent event analysis for genome-wide association studies","authors":"Jasper P. Hof, Sita H. Vermeulen, Anthony C. C. Coolen, Tessel E. Galesloot","doi":"10.1002/gepi.22525","DOIUrl":"10.1002/gepi.22525","url":null,"abstract":"<p>Many diseases recur after recovery, for example, recurrences in cancer and infections. However, research is often focused on analysing only time-to-first recurrence, thereby ignoring any subsequent recurrences that may occur after the first. Statistical models for the analysis of recurrent events are available, of which the extended Cox proportional hazards frailty model is the current state-of-the-art. However, this model is too statistically complex for computationally efficient application in high-dimensional data sets, including genome-wide association studies (GWAS). Here, we develop an application for fast and accurate recurrent event analysis in GWAS, called SPARE (SaddlePoint Approximation for Recurrent Event analysis). In SPARE, every DNA variant is tested for association with recurrence risk using a modified score statistic. A saddlepoint approximation is implemented to achieve statistical accuracy. SPARE controls the Type I error, and its statistical power is similar to existing recurrent event models, yet SPARE is significantly faster. An application of SPARE in a recurrent event GWAS on bladder cancer for 6.2 million DNA variants in 1,443 individuals required less than 15 min, whereas existing recurrent event methods would require several weeks.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 5","pages":"365-378"},"PeriodicalIF":2.1,"publicationDate":"2023-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22525","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9666885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RoPE: A robust profile likelihood method for differential gene expression analysis","authors":"Lehang Zhong, Lisa J. Strug","doi":"10.1002/gepi.22526","DOIUrl":"10.1002/gepi.22526","url":null,"abstract":"<p>Variation in RNA-Seq data creates modeling challenges for differential gene expression (DE) analysis. Statistical approaches address conventional small sample sizes and implement empirical Bayes or non-parametric tests, but frequently produce different conclusions. Increasing sample sizes enable proposal of alternative DE paradigms. Here we develop RoPE, which uses a data-driven adjustment for variation and a robust profile likelihood ratio DE test. Simulation studies show RoPE can have improved performance over existing tools as sample size increases and has the most reliable control of error rates. Application of RoPE demonstrates that an active <i>Pseudomonas aeruginosa</i> infection downregulates the <i>SLC9A3</i> Cystic Fibrosis modifier gene.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 5","pages":"379-393"},"PeriodicalIF":2.1,"publicationDate":"2023-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22526","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9657680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bias correction for inverse variance weighting Mendelian randomization","authors":"Ninon Mounier, Zoltán Kutalik","doi":"10.1002/gepi.22522","DOIUrl":"10.1002/gepi.22522","url":null,"abstract":"<p>Inverse-variance weighted two-sample Mendelian randomization (IVW-MR) is the most widely used approach that utilizes genome-wide association studies (GWAS) summary statistics to infer the existence and the strength of the causal effect between an exposure and an outcome. Estimates from this approach can be subject to different biases due to the use of weak instruments and winner's curse, which can change as a function of the overlap between the exposure and outcome samples. We developed a method (<span>MRlap</span>) that simultaneously considers weak instrument bias and winner's curse while accounting for potential sample overlap. Assuming spike-and-slab genomic architecture and leveraging linkage disequilibrium score regression and other techniques, we could analytically derive, reliably estimate, and hence correct for the bias of IVW-MR using association summary statistics only. We tested our approach using simulated data for a wide range of realistic settings. In all the explored scenarios, our correction reduced the bias, in some situations by as much as 30-fold. In addition, our results are consistent with the fact that the strength of the biases will decrease as the sample size increases and we also showed that the overall bias is also dependent on the genetic architecture of the exposure, and traits with low heritability and/or high polygenicity are more strongly affected. Applying <span>MRlap</span> to obesity-related exposures revealed statistically significant differences between IVW-based and corrected effects, both for nonoverlapping and fully overlapping samples. Our method not only reduces bias in causal effect estimation but also enables the use of much larger GWAS sample sizes, by allowing for potentially overlapping samples.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 4","pages":"314-331"},"PeriodicalIF":2.1,"publicationDate":"2023-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22522","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10040176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Monica Isgut, Kijoung Song, Margaret G. Ehm, May Dongmei Wang, Jonathan Davitte
{"title":"Effect of case and control definitions on genome-wide association study (GWAS) findings","authors":"Monica Isgut, Kijoung Song, Margaret G. Ehm, May Dongmei Wang, Jonathan Davitte","doi":"10.1002/gepi.22523","DOIUrl":"10.1002/gepi.22523","url":null,"abstract":"<p>Genome-wide association studies (GWAS) have significantly advanced our understanding of the genetic underpinnings of diseases, but case and control cohort definitions for a given disease can vary between different published studies. For example, two GWAS for the same disease using the UK Biobank data set might use different data sources (i.e., self-reported questionnaires, hospital records, etc.) or different levels of granularity (i.e., specificity of inclusion criteria) to define cases and controls. The extent to which this variability in cohort definitions impacts the end-results of a GWAS study is unclear. In this study, we systematically evaluated the effect of the data sources used for case and control definitions on GWAS findings. Using the UK Biobank, we selected three diseases—glaucoma, migraine, and iron-deficiency anemia. For each disease, we designed 13 GWAS, each using different combinations of data sources to define cases and controls, and then calculated the pairwise genetic correlations between all GWAS for each disease. We found that the data sources used to define cases for a given disease can have a significant impact on GWAS end-results, but the extent of this depends heavily on the disease in question. This suggests the need for greater scrutiny on how case cohorts are defined for GWAS.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 5","pages":"394-406"},"PeriodicalIF":2.1,"publicationDate":"2023-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10022267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Shoaib, Qiang Ye, Heidi IglayReger, Meng H. Tan, Michael Boehnke, Charles F. Burant, Scott A. Soleimanpour, Sarah A. Gagliano Taliun
{"title":"Evaluation of polygenic risk scores to differentiate between type 1 and type 2 diabetes","authors":"Muhammad Shoaib, Qiang Ye, Heidi IglayReger, Meng H. Tan, Michael Boehnke, Charles F. Burant, Scott A. Soleimanpour, Sarah A. Gagliano Taliun","doi":"10.1002/gepi.22521","DOIUrl":"10.1002/gepi.22521","url":null,"abstract":"<p>Polygenic risk scores (PRS) quantify the genetic liability to disease and are calculated using an individual's genotype profile and disease-specific genome-wide association study (GWAS) summary statistics. Type 1 (T1D) and type 2 (T2D) diabetes both are determined in part by genetic loci. Correctly differentiating between types of diabetes is crucial for accurate diagnosis and treatment. PRS have the potential to address possible misclassification of T1D and T2D. Here we evaluated PRS models for T1D and T2D in European genetic ancestry participants from the UK Biobank (UKB) and then in the Michigan Genomics Initiative (MGI). Specifically, we investigated the utility of T1D and T2D PRS to discriminate between T1D, T2D, and controls in unrelated UKB individuals of European ancestry. We derived PRS models using external non-UKB GWAS. The T1D PRS model with the best discrimination between T1D cases and controls (area under the receiver operator curve [AUC] = 0.805) also yielded the best discrimination of T1D from T2D cases in the UKB (AUC = 0.792) and separation in MGI (AUC = 0.686). In contrast, the best T2D model did not discriminate between T1D and T2D cases (AUC = 0.527). Our analysis suggests that a T1D PRS model based on independent single nucleotide polymorphisms may help differentiate between T1D, T2D, and controls in individuals of European genetic ancestry.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 4","pages":"303-313"},"PeriodicalIF":2.1,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22521","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9708272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gene–environment interaction analysis via deep learning","authors":"Shuni Wu, Yaqing Xu, Qingzhao Zhang, Shuangge Ma","doi":"10.1002/gepi.22518","DOIUrl":"10.1002/gepi.22518","url":null,"abstract":"<p>Gene–environment (G–E) interaction analysis plays an important role in studying complex diseases. Extensive methodological research has been conducted on G–E interaction analysis, and the existing methods are mostly based on regression techniques. In many fields including biomedicine and omics, it has been increasingly recognized that deep learning may outperform regression with its unique flexibility (e.g., in accommodating unspecified nonlinear effects) and superior prediction performance. However, there has been a lack of development in deep learning for G–E interaction analysis. In this article, we fill this important knowledge gap and develop a new analysis approach based on deep neural network in conjunction with penalization. The proposed approach can simultaneously conduct model estimation and selection (of important main G effects and G–E interactions), while uniquely respecting the “main effects, interactions” variable selection hierarchy. Simulation shows that it has superior prediction and feature selection performance. The analysis of data on lung adenocarcinoma and skin cutaneous melanoma overall survival further establishes its practical utility. Overall, this study can advance G–E interaction analysis by delivering a powerful new analysis approach based on modern deep learning.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 3","pages":"261-286"},"PeriodicalIF":2.1,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22518","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9944213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Claudia Coscia, Esther Molina-Montes, Raquel Benítez, Evangelina López de Maturana, Alfonso Muriel, Núria Malats, Teresa Pérez
{"title":"New proposal to address mediation analysis interrogations by using genetic variants as instrumental variables","authors":"Claudia Coscia, Esther Molina-Montes, Raquel Benítez, Evangelina López de Maturana, Alfonso Muriel, Núria Malats, Teresa Pérez","doi":"10.1002/gepi.22519","DOIUrl":"10.1002/gepi.22519","url":null,"abstract":"<p>The application of causal mediation analysis (CMA) considering the mediation effect of a third variable is increasing in epidemiological studies; however, this requires fitting strong assumptions on confounding bias. To address this limitation, we propose an extension of CMA combining it with Mendelian randomization (MRinCMA). We applied the new approach to analyse the causal effect of obesity and diabetes on pancreatic cancer, considering each factor as potential mediator. To check the performance of MRinCMA under several conditions/scenarios, we used it in different simulated data sets and compared it with structural equation models. For continuous variables, MRinCMA and structural equation models performed similarly, suggesting that both approaches are valid to obtain unbiased estimates. When noncontinuous variables were considered, MRinCMA presented, overall, lower bias than structural equation models. By applying MRinCMA, we did not find any evidence of causality of obesity or diabetes on pancreatic cancer. With this new methodology, researchers would be able to address CMA hypotheses by appropriately accounting for the confounding bias assumption regardless of the conditions used in their studies in different settings.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 3","pages":"287-300"},"PeriodicalIF":2.1,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22519","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9121098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dapeng Shi, Yuquan Wang, Ziyong Zhang, Yunlong Cao, Yue-Qing Hu
{"title":"MR-BOIL: Causal inference in one-sample Mendelian randomization for binary outcome with integrated likelihood method","authors":"Dapeng Shi, Yuquan Wang, Ziyong Zhang, Yunlong Cao, Yue-Qing Hu","doi":"10.1002/gepi.22520","DOIUrl":"10.1002/gepi.22520","url":null,"abstract":"<p>Mendelian randomization is a statistical method for inferring the causal relationship between exposures and outcomes using an economics-derived instrumental variable approach. The research results are relatively complete when both exposures and outcomes are continuous variables. However, due to the noncollapsing nature of the logistic model, the existing methods inherited from the linear model for exploring binary outcome cannot take the effect of confounding factors into account, which leads to biased estimate of the causal effect. In this article, we propose an integrated likelihood method MR-BOIL to investigate causal relationships for binary outcomes by treating confounders as latent variables in one-sample Mendelian randomization. Under the assumption of a joint normal distribution of the confounders, we use expectation maximization algorithm to estimate the causal effect. Extensive simulations demonstrate that the estimator of MR-BOIL is asymptotically unbiased and that our method improves statistical power without inflating type I error rate. We then apply this method to analyze the data from Atherosclerosis Risk in Communications Study. The results show that MR-BOIL can better identify plausible causal relationships with high reliability, compared with the unreliable results of existing methods. MR-BOIL is implemented in R and the corresponding R code is provided for free download.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 4","pages":"332-357"},"PeriodicalIF":2.1,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9665586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gregory J. M. Zajac, Sarah A. Gagliano Taliun, Carlo Sidore, Sarah E. Graham, Bjørn O. Åsvold, Ben Brumpton, Jonas B. Nielsen, Wei Zhou, Maiken Gabrielsen, Anne H. Skogholt, Lars G. Fritsche, David Schlessinger, Francesco Cucca, Kristian Hveem, Cristen J. Willer, Gonçalo R. Abecasis
{"title":"A fast linkage method for population GWAS cohorts with related individuals","authors":"Gregory J. M. Zajac, Sarah A. Gagliano Taliun, Carlo Sidore, Sarah E. Graham, Bjørn O. Åsvold, Ben Brumpton, Jonas B. Nielsen, Wei Zhou, Maiken Gabrielsen, Anne H. Skogholt, Lars G. Fritsche, David Schlessinger, Francesco Cucca, Kristian Hveem, Cristen J. Willer, Gonçalo R. Abecasis","doi":"10.1002/gepi.22516","DOIUrl":"10.1002/gepi.22516","url":null,"abstract":"<p>Linkage analysis, a class of methods for detecting co-segregation of genomic segments and traits in families, was used to map disease-causing genes for decades before genotyping arrays and dense SNP genotyping enabled genome-wide association studies in population samples. Population samples often contain related individuals, but the segregation of alleles within families is rarely used because traditional linkage methods are computationally inefficient for larger datasets. Here, we describe Population Linkage, a novel application of Haseman–Elston regression as a method of moments estimator of variance components and their standard errors. We achieve additional computational efficiency by using modern methods for detection of IBD segments and variance component estimation, efficient preprocessing of input data, and minimizing redundant numerical calculations. We also refined variance component models to account for the biases in population-scale methods for IBD segment detection. We ran Population Linkage on four blood lipid traits in over 70,000 individuals from the HUNT and SardiNIA studies, successfully detecting 25 known genetic signals. One notable linkage signal that appeared in both was for low-density lipoprotein (LDL) cholesterol levels in the region near the gene <i>APOE</i> (LOD = 29.3, variance explained = 4.1%). This is the region where the missense variants rs7412 and rs429358, which together make up the ε2, ε3, and ε4 alleles each account for 2.4% and 0.8% of variation in circulating LDL cholesterol. Our results show the potential for linkage analysis and other large-scale applications of method of moments variance components estimation.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"47 3","pages":"231-248"},"PeriodicalIF":2.1,"publicationDate":"2023-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22516","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9496203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}